Using Hex coded characters inside Ruby regex character classes

I ran into a (bug? | annoyance) in Ruby 1.8.6 on MacOS X.

Run all following code with

$KCODE = 'u'

Working with the hex coded em-dash character.

puts "xE2x80x94"
output: 

I can successfully use the hex coding to generate a simple regular expression.

puts "—" =~ /xE2x80x94/
output: 0

However, this doesn’t work if I put the hex coded character inside a character class.

puts "—" =~ /[xE2x80x94]/
output: nil

I can work around this by evaluating the hex coded character and generating a UTF-8 character, before putting into the character class brackets.

puts "—" =~ /[#{"xE2x80x94"}]/
output: 0

To see what’s happening, I inspected the regex objects.

/[xE2x80x94]/.inspect
output: "/[xE2x80x94]/"
/#{"xE2x80x94"}/.inspect
output: "/—/"

It looks like if I want to reliably use unicode within Ruby regular expressions, using the hex code inside of the regex is a bad idea. I should evaluate the hex code and generate a unicode character before sticking it into the regex.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: