Notes on Character Encoding Conversions

I did a quick bit of research on Japanese character encodings and how functions in PHP handle the conversions between them.

The table below summarizes the results (click to enlarge).

スクリーンショット 2014-03-07 16.11.06

We can see the following;

  1. Although Shift-JIS (SJIS) is still the most common format in Japan, it is terrible at handling special “hankaku” (single-width) characters. It simply leaves out a lot of them; even the ones that we would like to use quite frequently.
  2. The PHP mb_convert_encoding function gives up when it can’t find a matching character, and deletes the character. On the other hand, iconv does a pretty good job of finding a good substitute if we specify //TRANSLIT.
  3. Gathering from webpages that I can find on the subject, a lot of people seem to prefer mb_convert_encoding with the sjis-win encoding. This is a lousy solution if you are using special “hankaku” characters. It’s better to use iconv with CP932 encoding and //TRANSLIT. There is one snag with CP932 encoding with //TRANSLIT and that is with regards to the “hankaku” yen character (“¥”). Converting to “yen” isn’t really a nice solution. You can see however that //TRANSLIT always converts to ASCII, and “yen” probably is the only way you can sensibly convert the ¥ mark. Otherwise, it’s a good idea to use the “zenkaku” (double-width) “¥”.
  4. The micro mark “µ” is not supported in Shift-JIS but the greek mu “μ” is. Therefore, if you want to write a micro mark in Shift-JIS, you should use the greek mu instead. Again, iconv with //TRANSLIT does the correct thing (converting it to “u”).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: