I’ve got 99 Problems, and Unicode Ain’t One

A friend of mine posed this JavaScript problem to me today:

I need to replace [fullwidth] numerals (\uFF10-\uFF19) with [regular] numerals (\u0030-\u0039)

Well, immediately, of course, I thought of Perl’s transliteration operator, which would make this a snap. This is JavaScript, though, not Perl, so we have to be a little more clever.

Since we are replacing a contiguous set of characters with another contiguous set of characters, we can be very clever indeed. Here’s the solution:

function replaceFullWidthNumerals(s) { 
  return s.replace( /[\uFF10-\uFF19]/g, 
    function(m){ 
      return String.fromCharCode( m.charCodeAt() - 0xFEE0 ); 
    } 
  ); 
}

So why’s it work? Well, let’s break it down.

We start by using a regular expression (/[\uFF10-\uFF19]/g) to match any fullwidth numerals in the input stream. Since we’re using a character group, it’ll match each character individually. JavaScript’s replace method allows you to specify the replacement with a function. (Side note: I was initially tempted to skip the function and just use the $& replacement variable, but it’s not possible to do it that way…I’ll leave the reason why as an exercise for the reader.)

So far so good: we’re matching all fullwidth numerals, and invoking a function to replace them with something else. One obvious way out here would be a lookup table: replace \uFF10 with \u0030, \uFF11 with \u0031, etc. A ten-item lookup table isn’t so bad, and certainly that’s the approach we would have to take if the transliteration wasn’t contiguous. But since it’s contiguous we can avoid the lookup table.

Basically every character has a corresponding numeric code. In JavaScript, you can extract that code with the String.charCodeAt() function. Once you have the code, you can normalize it by subtracting the code of the beginning of your contiguous range — in thise case, 0xFF10. Now \uFF10 becomes 0, \uFF11 becomes 1, etc. Now the regular numerals start at \u0030 and go up to \u0039, and what we really want is \uFF10 to map to \u0030, \uFF11 to map to \u0031, etc. So we take our normalized number, and just add 0x30. In other words, to get the regular numeric character representation, we have the formula n - 0xFF10 + 0x30. A little bit of arithmetic reduces this to n - 0xFEE0, and all that’s left to do is convert this character code back into an actual character, which you can do with JavaScript’s String.fromCharCode method.

Fast, efficient, and compact: my kind of code.


One Comment on I’ve got 99 Problems, and Unicode Ain’t One

  1. +++ for the Jay Z reference.