Notes on Character Encoding Conversions

I did a quick bit of research on Japanese character encodings and how functions in PHP handle the conversions between them.

The table below summarizes the results (click to enlarge).

We can see the following;

Although Shift-JIS (SJIS) is still the most common format in Japan, it is terrible at handling special “hankaku” (single-width) characters. It simply leaves out a lot of them; even the ones that we would like to use quite frequently.
The PHP mb_convert_encoding function gives up when it can’t find a matching character, and deletes the character. On the other hand, iconv does a pretty good job of finding a good substitute if we specify //TRANSLIT.
Gathering from webpages that I can find on the subject, a lot of people seem to prefer mb_convert_encoding with the sjis-win encoding. This is a lousy solution if you are using special “hankaku” characters. It’s better to use iconv with CP932 encoding and //TRANSLIT. There is one snag with CP932 encoding with //TRANSLIT and that is with regards to the “hankaku” yen character (“¥”). Converting to “yen” isn’t really a nice solution. You can see however that //TRANSLIT always converts to ASCII, and “yen” probably is the only way you can sensibly convert the ¥ mark. Otherwise, it’s a good idea to use the “zenkaku” (double-width) “￥”.
The micro mark “µ” is not supported in Shift-JIS but the greek mu “μ” is. Therefore, if you want to write a micro mark in Shift-JIS, you should use the greek mu instead. Again, iconv with //TRANSLIT does the correct thing (converting it to “u”).

Leave a comment Cancel reply