Pinyin Tones Transformation

Well this is for those of you who learn Mandarin. Sometimes you need to type a phrase in pinyin and add some tone marks because digital notation (Wo3 shi4 Mei3guo2 ren2) looks ugly (Wǒ shì Měiguó rén is a way better). I’ve just made a WordPress plugin for this. This is a plugin homepage, but if you want to know how it works or just need code not-for-wordpress, please read the rest of this entry.

The code is very simple because the rules are very simple:

  1. We split all words to syllables. Pinyin notation is made to avoid confusion, for example, you must use an apostrophe to separate the syllables if confusion is possible, for instance “Tian’anmen” is spelled definitely as “Tian an men”, not “Ti an an men”. With digits it’s even more simple, it’s “Tian1an1men2″, no confusion is possible unless there is a light tone inside but i can’t think of an example. Anyway you can use an apostrophe in this case as well.
  2. The diacritic mark appears on one of the syllable’s vowels. How to decide which one is that?
    1. If there is “a” or “e”, it takes the mark
    2. If there is “ou”, then “o” takes the mark
    3. In all other cases, the last vowel takes it.
  3. There is an ü letter which is represented by “v” when typing pinyin, because it’s the only letter which is not used by it.

Ok, now the code. It transforms everything which is inside [pinyin][/pinyin] block into the great looking pinyin.

function transform_pinyin_tones($content)
{
    if(!preg_match_all('`\[pinyin\](.*)\[/pinyin\]`Uis', $content, $r)) return $content;

    $tones = array(
    'a1' => '257',
    'a2' => '225',
    'a3' => '462',
    'a4' => '224',
    'e1' => '275',
    'e2' => '233',
    'e3' => '283',
    'e4' => '232',
    'i1' => '299',
    'i2' => '237',
    'i3' => '464',
    'i4' => '236',
    'o1' => '333',
    'o2' => '243',
    'o3' => '466',
    'o4' => '242',
    'u1' => '363',
    'u2' => '250',
    'u3' => '468',
    'u4' => '249',
    'v1' => '470',
    'v2' => '472',
    'v3' => '474',
    'v4' => '476'
    );

    $vowels = array('a', 'e', 'i', 'o', 'u', 'v');

    foreach($r[0] as $i => $match)
    {
        $digital = $r[1][$i];
        $diacritic = $digital;
        if(!preg_match_all('`([a-z]{1,6})([1-4])`is', $digital, $syllables)) continue;
        foreach($syllables[0] as $k => $syllable)
        {
            $s = $syllables[1][$k];
            $t = $syllables[2][$k];
            if(preg_match('`(a|e)`i', $s, $r2))
            {
                $s = preg_replace('`'.$r2[1].'`i', '&#'.$tones[strtolower($r2[1]).$t].';', $s);
            }
            elseif(preg_match('`ou`', $s, $r2))
            {
                $s = preg_replace('`ou`i', '&#'.$tones['o'.$t].';u', $s);
            }
            else
            {
                for($j=strlen($s)-1;$j;$j--)
                {
                    if(in_array($s[$j], $vowels))
                    {
                        $s = str_replace($s[$j], '&#'.$tones[$s[$j].$t].';', $s);
                        break;
                    }

                }
            }

            $diacritic = str_replace($syllable, $s, $diacritic);
        }

        $content = str_replace($match, $diacritic, $content);
    }
    return $content;
}

Leave a Reply