Perfection Kills

by kangax

Exploring Javascript by example

Whitespace deviations

August 23rd, 2009

I am reading a Regular Expression Cookbook by Jan Goyvaerts and Steven Levithan. It’s a truly excellent book on a subject, with an incredible level of attention to details. I am only half-way through the book, but have already learned few things about regular expressions – both general and javascript-related ones.

One thing I noticed missing in the book was a mention of whitespace character class (\s) discrepancies in current ECMAScript implementations. Cookbook rightfully explains that \s in Javascript matches any character defined as whitespace by the Unicode standard. What it fails to mention is how horribly this rule is actually implemented in modern browsers. While most of the implementations correctly handle ASCII whitespace characters, such as – U+0020 (Space), U+000B (Vertical Tab) and U+000A (Line Feed) – there’s much more chaos in anything above U+2000 (EN QUAD) point.

In practice such non-conformance can lead to surprising results when implementing something like trim function. If trim were to utilize \s, than it could miss quite common characters like U+00A0 (No-Break Space); In fact, trim used in jQuery or Prototype uses exactly that – standard whitespace character class (\s) – and so fails with any of these troublesome characters. One of the solutions, of course, is to replace \s with a custom character class, e.g.: – [\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029]

This topic comes up once in a while on comp.lang.javascript and there have been some efforts to document these discrepancies. I wanted to make a simple table of modern browsers compliance and used a test provided once by Richard Cornford (also available online for anyone to try it out).

Here’s a table demonstrating above mentioned deviations. It’s good to see Safari 4+ and Chrome 2+ conforming to specs fully. Hopefully, upcoming versions of Firefox will also take care of the remaining “failures”.

Code point / Browser Firefox 2-3.5 Safari 2.0-3.2.1 Safari 4 Opera 9.25, 9.64 Opera 10 IE 6-8 Chrome 2-3 Konqueror 4.2.2
(0×0009) [ASCII Tab] PASS PASS PASS PASS PASS PASS PASS PASS
(0x000A) [ASCII Line Feed] PASS PASS PASS PASS PASS PASS PASS PASS
(0x000B) [ASCII Vertical Tab] PASS PASS PASS PASS PASS PASS PASS FAIL
(0x000C) [ASCII Form Feed] PASS PASS PASS PASS PASS PASS PASS PASS
(0x000D) [ASCII Carriage Return] PASS PASS PASS PASS PASS PASS PASS PASS
(0×0020) SPACE PASS PASS PASS PASS PASS PASS PASS PASS
(0x00A0) NO-BREAK SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×1680) OGHAM SPACE MARK FAIL FAIL PASS PASS FAIL FAIL PASS FAIL
(0x180E) MONGOLIAN VOWEL SEPARATOR FAIL FAIL PASS FAIL FAIL FAIL PASS FAIL
(0×2000) EN QUAD PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2001) EM QUAD PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2002) EN SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2003) EM SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2004) THREE-PER-EM SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2005) FOUR-PER-EM SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2006) SIX-PER-EM SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2007) FIGURE SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2008) PUNCTUATION SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2009) THIN SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0x200A) HAIR SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2028) LINE SEPARATOR PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0×2029) PARAGRAPH SEPARATOR PASS FAIL PASS PASS PASS FAIL PASS FAIL
(0x202F) NARROW NO-BREAK SPACE FAIL FAIL PASS PASS PASS FAIL PASS FAIL
(0x205F) MEDIUM MATHEMATICAL SPACE FAIL FAIL PASS FAIL FAIL FAIL PASS FAIL
(0×3000) IDEOGRAPHIC SPACE PASS FAIL PASS PASS PASS FAIL PASS FAIL

Tests for Firefox, Safari and Opera were performed on Mac OS X (10.5.8); IE and Chrome – on Windows XP Pro SP2 (via VMWare); and Konqueror – on Ubuntu 9.04 (via VMWare)

Edit (28/09/2009)

Clarified operating systems (and their versions) used for testing; Aligned characters in a table by code point; Updated Opera to 10RC, added Chrome 3 to results, combined FF columns into one, since they are the identical; Sorted table by code point. Thanks to Dr J R Stockton and Luke Smith for suggestions.

Edit (04/09/2009)

Updated Opera 10RC to Opera 10 (Thanks to Garrett Smith for test); tested and updated table with results of Safari 2.x and older 3.x versions; fixed a bug in a testcase where `char` identifier (one of future reserved words as per ES3) would prevent script parsing in Safari 2.x

Categories: Uncategorized

Comments (17)

  1. Gravatar

    Rod said on Aug 24, 2009 @ 2:04

    Thanks for that kangax. I had no idea about any of this and I don’t think I’ve ever come across a situation where this might be a problem (yet). But it’s great that you’ve put in the effort to document it and bring it to people’s attention.

  2. Gravatar

    Mikuso said on Aug 24, 2009 @ 8:49

    I know this isn’t too helpful, but you could reduce the character class to this:
    [\s\x0B\xA0\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
    And still cover all bases.

    But I suppose that’s not the real issue. :)

  3. Gravatar

    Morgan Roderick said on Aug 24, 2009 @ 17:54

    Wow, great research!!!

    Thanks for sharing!

  4. Gravatar

    kangax (article author) said on Aug 26, 2009 @ 16:15

    @Rod, @Morgan Roderick
    Thanks! Glad you liked it.

    @Mikuso
    I wouldn’t want to rely on \s conformance in clients when creating a fully compliant implementation. I would use something like:
    [\x09\x0A\-\x0D\x20\xA0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000\u2028\u2029]

  5. Gravatar

    Mikuso said on Aug 27, 2009 @ 3:28

    @kangax
    Surely, you mean [\x09-\x0D\x20\xA0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000\u2028\u2029]?
    (admittedly, i missed the \u1680 on my previous post.)

    Anyway, no doubt then there must be discrepancies with other pre-defined character classes. Most likely \w and \b.

    Depending on the specific implementation (or perhaps, locale), \w may match a different set of characters to another implementation.

    \b, on the other hand, relies on the definition of \w to perform its function. This one can’t just easily be replaced by a fully-defined character class. It relies upon look-arounds, and doesn’t actually match any single character.

    Your blog title is correct, perfection does kill.

  6. Gravatar

    Dr J R Stockton said on Aug 27, 2009 @ 12:25

    For Mikuso : “ECMAScript Language Specification Edition 3 24-Mar-00″ and ECMA Final Draft 5th Edition require RegExp \w to match ONLY the 63 characters     a-z A-Z 0-9 _     .

  7. Gravatar

    Mikuso said on Aug 28, 2009 @ 3:29

    @Dr J R Stockton
    Despite what the spec says, kangax’s post shows us that in reality, the implementation doesn’t always follow the spec.

    Now, I’ve gone and done some testing and I can tell you that the following browsers will match \w to most (if not all) of the upper-ASCII alphabet:
    * Netscape Navigator 8
    * Firefox 1

    I can hear you laughing, and I know that these browsers are ancient and rarely ever seen these days – but it shows that again, the implementation will sometimes differ from the spec.

  8. Gravatar

    Tomas said on Dec 28, 2009 @ 9:02

    There is only 1 FAIL in Opera 10.50 (pre-alpha).
    (cp = 0x200B)

  9. Gravatar

    kangax (article author) said on Dec 28, 2009 @ 14:14

    @Tomas
    Yes, I noticed it too :) The reason Opera fails U+200B is probably due to conformance to ES5 (ECMA-262, 5th edition), where that character is now considered a whitespace. So… it fails ES3, but passes ES5.

  10. Gravatar

    Dr J R Stockton said on Apr 10, 2010 @ 12:20

    In my WinXP sp3 IE8, two copies, \w also matches dotted capital I (İ) (but not undotted lower-case i (ı) ); that affects \b.

    Happily, \d always, in my tests, ignores the 24 Roman numeral characters Ⅰ-Ⅻ ⅰ-ⅻ.

  11. Gravatar

    Tim said on Jun 5, 2010 @ 8:28

    @kangax: thanks for this work.

    Apologies for the tangential spin-off nature of my question, but I’m trying to find out which, if any, fonts that ship with Windows and OS X accurately render these various space-glyphs. I am trying to transcribe an ancient manuscript where space-width has linguistic and contextual significance.

  12. Gravatar

    Bobby Jack said on Apr 10, 2012 @ 6:55

    Interesting, and – despite being a couple of years old – still relevant. I’ve been taking a look at this very issue this morning, after coming across a problem in some code in which trim() and .replace(/\s/) were behaving differently. So I threw together a quick test page, probably much like the one referenced (although that now seems to be a broken link). What I’ve found:

    * You appear to be missing a unicode whitespace character from your list, 0×0085 (next line). Whether this is truly a whitespace character or not may be up for debate, but I’ve seen plenty of references to 26 whitespace unicode characters; there are only 25 in the table above. This one appears to be problematic: in firefox, neither /\s/ nor trim() consider this char whitespace; in chrome, on the other hand, trim() considers it whitespace, /\s/ doesn’t.

    * ‘Whitespace characters being consistently recognised as whitespace’ is one problem, the other half of the problem is non-whitespace characters and the behaviour of trim() and /\s/ in regard to them. ‘zero width no break space’ (0xFEFF) is an extreme case, and possibly not worth worrying about, but the whole character that started me off originally is ‘zero width space’ (0x200B), quite a useful one. Firefox handles it well but, in Chrome, trim() considers it whitespace, /\s/ does not. There are 4 other ‘space, but not whitespace’ characters but they appear to be handled correctly across the board.

    In short, roll your own trim() function if you’re concerned about how it will behave with any of these outlying cases.

Trackbacks

  1. JSToolbox – все о JavaScript » Blog Archive » Функция trim в JavaScript said:

    [...] Подробней об этом можете прочитать здесь (англ.). Из этого же источника привожу результаты [...]

  2. Perfection kills » Sputniktests web runner said:

    [...] regards to the notion of whitespace character. Passing plain U+0020 does the job, but U+00A0 (and a whole slew of other ones) often doesn’t. Instead, NaN is returned for what should really be a [...]

  3. Функция trim в JavaScript « Все для вашего сайта said:

    [...] как “пробельные”. Подробней об этом можете прочитать здесь (англ.). Из этого же источника привожу результаты [...]

  4. JScript and DOM changes in IE9 preview 3 - 隐遁峰 said:

    [...] character class (as in /s/) still doesn’t match majority of whitespace characters (as defined by specs). These include “U+00A0”, “U+2000” to [...]

  5. Perfection kills » JScript and DOM changes in IE9 preview 3 said:

    [...] character class (as in /s/) still doesn’t match majority of whitespace characters (as defined by specs). These include “U+00A0”, “U+2000” to [...]

Leave a Comment

Please, don't forget to escape your input (<, > and &). Wrap code sections with <pre>

Allowed tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>