Handling UTF16 line endings in Ruby

A quick memo of a problem that I was having with Ruby.

I was reading in a UTF-16 Little-Endian text file with Windows (CR+LF) line endings, using the Ruby ‘read’ command, then converting it to UTF8 using the NKF library. I was constantly running into a problem where some of the characters were garbled.

After some digging around, I found this post (in Japanese).
Ruby List

What it is saying is that UTF-16 Little-Endian CD+LF line endings are encoded as

"r 00 n 00"

The problem is that since the Ruby get command uses “n” as the default separator string, the string that is actually read in is

"r 00 n"

The result is such that the final character “n” is only 8 bits long and is not a valid UTF-16 character. This causes NKF to misbehave and garble the text (with Iconv, it spits out an error and quits).

Instead of using a simple gets to fetch a line from UTF-16 Little-Endian CD+LF text, simply use

gets("n00")

You can then use either NKF or Iconv without any problems.

共有:

Related

Leave a comment Cancel reply