I ran into a (bug? | annoyance) in Ruby 1.8.6 on MacOS X.
Run all following code with
$KCODE = 'u'
Working with the hex coded em-dash character.
puts "xE2x80x94"
output:—
I can successfully use the hex coding to generate a simple regular expression.
puts "—" =~ /xE2x80x94/
output:0
However, this doesn’t work if I put the hex coded character inside a character class.
puts "—" =~ /[xE2x80x94]/
output:nil
I can work around this by evaluating the hex coded character and generating a UTF-8 character, before putting into the character class brackets.
puts "—" =~ /[#{"xE2x80x94"}]/
output:0
To see what’s happening, I inspected the regex objects.
/[xE2x80x94]/.inspect
output:"/[xE2x80x94]/"
/#{"xE2x80x94"}/.inspect
output:"/—/"
It looks like if I want to reliably use unicode within Ruby regular expressions, using the hex code inside of the regex is a bad idea. I should evaluate the hex code and generate a unicode character before sticking it into the regex.