Perfection kills

Exploring Javascript by example

Experimenting with html minifier

March 9th, 2010 by kangax

In Optimizing HTML, I mentioned that state of HTML minifiers is rather crude at the moment. We have a large variety of JS and CSS minification tools, but almost no HTML ones. This is actually quite understandable.

First of all, minifiying scripts and stylesheets usually results in better savings, overall. Second, the nature of document markup is much more dynamic than that of scripts and styles. As a result, HTML minification has to be done “on demand”, and carries certain overhead. Only when this overhead is less then difference in time for delivering minified-vs-original document, there’s a benefit in minification. In some cases, though, savings in document size (and so bandwidth) can be more important than time spent on minification.

It’s no suprise that HTML minification is almost always a low-priority optimization. When it comes to client-side performance, there are certainly other more important things to pay attention to. Only when other aspects are taken into consideration, it is worth minifying document markup.

Few weeks ago, I decided to experiment with Javascript-based HTML minifier and created an online-based tool, with lint-like capabilities. After some tweaking, the script was able to parse and minify markup of almost any random website. The goal was to see how easy it is to implement something like this, learn HTML a bit more, and have fun in a process. Ultimately, I wanted to minify some of the popular websites and see if savings are worth all the trouble.

Today, I’d like to share this tool with you. I’ll quickly go over some of the initial features, explain how minifier works, and look into possible side effects of minification. Please note that the script is still in very early stage, and shouldn’t be used in production. If you are not interested in inner workings, feel free to skip to tests or conclusions.


Screenshot of HTMLMinifier

How it works

Parser

At its core, minifier relies on HTML parser by John Resig. John’s parser was capable of handling quite complex documents, but would sometimes trip on some of the more obscure structures. For example, doctype declarations were not understood at all. Whenever attribute name contained characters like “-” (e.g. as in “http-equiv”), parser would fail. There were also some defficiencies in regular expressions for matching comments and CDATA sections: newlines inside them were not accounted for, so multiline comments simply weren’t matched. CDATA sections and comments inside elements with CDATA content model (e.g. SCRIPT and STYLE) were getting stripped for no apparent reason.

All of these are now fixed.

Minifier

Minifier is a very small “wrapper” on top of parser. As of now it’s only about 250 LOC. It takes input string and configuration object; passes this input string to parser, and builds final output according to specified options.

For example, we can tell it to remove comments:

    var input = '<!-- foo --><div>baz</div><!-- bar\n\n moo -->';
    minify(input, { removeComments: true }); // '<div>baz</div>'

or to collapse boolean attributes:

    var input = '<input disabled="disabled">';
    minify(input, { collapseBooleanAttributes: true }); // '<input disabled>'

Test suite

One of the goals I had for this little project was to have a robust test suite. HTML minifier is fully unit tested with ~100 tests at the moment. This has few benefits: anyone can change, tweak or add things without worrying to break existing functionality. It takes literally seconds to tell if script is functional in certain browser (or even in non-browser implementation, such as node.js on a server)—simply by running a test suite. Finally, tests can serve as documentation for how minifier handles some of the edge cases.

Lint

While working on minifier, I realized that oftentimes the most wasteful part of the markup is not white space, comments or boolean attributes, but inline styles, scripts, presentational or deprecated elements and attributes. None of these can be simply stripped, as that could affect state of the document and is just too obtrusive. What can be done, however, is reporting of these occurences to the user. HTMLLint is even a smaller script, whose job is exactly that—to log any deprecated or presentational elements/attributes encountered during parsing. Additionally, it detects event attributes (e.g. onclick, onmouseover, etc.). The rationale for this is that moving contents of event attributes to external script allows to take advantage of resource caching.

Options

Before we begin, it’s important to understand that minifier parses documents as HTML, not XHTML. This allows to employ such optimizations as “remove optional tags and quotes”, “collapse boolean attributes”, etc. Note that almost none of the options affect document validity, as per HTML 4.01. XHTML support might be added in the future, but considering that in context of pubilc web it’s mostly pointless at the moment, I see little reason in doing so. Besides, minifying XHTML documents (given that they’re actually served to clients properly, with “application/xhtml+xml”) doesn’t reduce size as much as if they were HTML.

The following is a list of current options in minifier. It is far from being exhaustive, and will most likely be extended in a future. Let’s look at each one of them quickly:

Remove comments

    var input = '<!-- some comment --><p>blah</p>';
    var output = minify(input, { removeComments: true });
 
    output; // '<p>blah</p>'

This one should be self-explanatory. Passing truthy removeComments tells minifier to strip HTML comments. Note that comments inside elements with CDATA content model, such as SCRIPT and STYLE, are left intact (but see next option).

    var input = '<script type="text/javascript"><!-- some comment --></script>';
    var output = minify(input, { removeComments: true });
 
    output; // '<script type="text/javascript"><!-- some comment --></script>'

Remove comments from scripts and styles

When this option is enabled, HTML comments in scripts and styles are stripped as well:

    var input = '<script type="text/javascript"><!--\n alert(1) --></script>';
    var output = minify(input, { removeCommentsFromCDATA: true });
 
    output; // '<script type="text/javascript">alert(1)</script>'

It’s worth pointing out that there’s a slight difference in the way HTML comments are treated inside SCRIPT and STYLE elements. In scripts, comment start delimiter (“<!--”) tells parser to ignore everything until newline:

    <!-- alert(1); // alert never happens!
    <!--
    alert(2); // but this one does!
    // "<!--" acts as a single-line JS comment ("//").

In styles, however, “<!--” is simply ignored when it’s present in the beginning of input (I haven’t tested what happens in other parts of a stylesheet). Contrary to script behavior, anything that follows “<!--” still remains present:

    <!-- body { color: red; } -->
 
    /*  treated as:
        body { color: red; }
    */

Explanation of why you might want to strip comments.

Remove CDATA sections

This option removes CDATA sections from script and style elements:

    var input = '<script>/* <![CDATA[ \n\n */alert(1)/* ]]> */</script>';
    var output = minify(input, { removeCDATASectionsFromCDATA: true });
 
    output; // '<script>alert(1)</script>'

Explanation of why you might want to do this.

Collapse whitespace

This options collapses white space that contributes to text nodes in a document tree. For example:

    var input = '<div> <p>    foo </p>    </div>';
    var output = minify(input, { collapseWhitespace: true });
 
    output; // '<div><p>foo</p></div>'

It doesn’t affect significant white space; e.g. in contents of elements like SCRIPT, STYLE, PRE or TEXTAREA.

    var input = '<script>    alert("foo     bar")</script>';
    var output = minify(input, { collapseWhitespace: true });
 
    output; // '<script>alert("foo     bar")</script>'
 
    input = '<textarea>     x x   x </textarea>';
    output = minify(input, { collapseWhitespace: true });
 
    output; // '<textarea>     x x   x </textarea>'

Now, it’s worth mentioning that this modification can have side effects, and significantly change document representation.

For example, markup like <span>foo</span> <span>bar</span> is usually displayed as “foo bar” in browsers, with one space character in between two words. White space in markup is represented as text node in document tree. This text node’s value is a white space (e.g. U+0020), and as long as two adjacent elements are inline-level—as they are in this example—it is this white space that contributes to a gap in between “foo” and “bar”. As soon as we remove that white space (i.e. changing markup to <span>foo</span><span>bar</span>), representation changes from “foo bar” to “foobar”.

There are two ways to work around this issue.

First one is not to rely on such white space for document representation, and instead style elements to have margins and paddings as needed. In previous example, this could have been: <span class="foo">foo</span><span>bar</span> (where foo class would be declared with, say, margin-right: 0.25em;). At first, this might seem like an overkill. After all, adding class seems to defeat the purpose, resulting in larger output, when compared to a version with just one white space character. However, depending on a context, giving few elements a class for styling purposes, and then stripping white space from the entire document, can result in a smaller output.

Second option is to never fully remove white space characters, and instead always collapse them to one white space character. HTML 4.01 is actually specified to do just that, so there’s no harm in doing it upfront. Because of this, the following 2 snippets should render identically:

  <span>foo</span>
 
     <span>bar</span>

and:

    <span>foo</span> <span>bar</span>

…with one space in between “foo” and “bar”. Note how in first case, there’s an entire sequence of white space characters (including line breaks).

This second option—collapsing to one white space—has not yet been added to minifier.

Another noticeable effect white space removal can have on a document is related to CSS white-space property. As I mentioned earlier, by default, adjacent sequences of white space in most of the elements collapse into one space character. But white-space property changes it all. Some of its values result in different collapsing behavior. white-space: pre, for example, makes whitespace render exactly as it occurs in a markup.

As a result, snippet like this:

<span style="white-space:pre;">  foo     bar</span>

renders exactly as is, and becomes:

  foo     bar

As of now, minifier doesn’t respect space-preserving white-space values (i.e. “pre” and “pre-wrap”). It doesn’t even understand them. Unfortunately, computing elements’ styles and determining their white-space values would be just way too complex and impractical [1]. On a bright side, it seems that white-space property is not used very often. In a future, it should be possible to add an option to minifier for specifying a way to prevent certain elements from having their content collapsed. A filtering can be based on a class, a simple selector, or maybe even by parsing element’s style attribute.

Collapse boolean attributes

HTML 4.01 has so-called boolean attributes—“selected”, “disabled”, “checked”, etc. These may appear in a minimized (collapsed) form, where attribute value is fully ommited. For example, instead of writing <input disabled="disabled">, we can simply write—<input disabled>.

Minifier has an option to perform this optimization, called collapseBooleanAttributes:

    var input = '<input value="foo" readonly="readonly">';
    var output = minify(input, { collapseBooleanAttributes: true });
 
    output; // '<input value="foo" readonly>'

A potential caveat here is that if you target elements by attribute name and value, things might break after applying this optimization. Granted, this kind of case seems rather unreal, but here’s an example. If we had these rules:

    input[disabled] { color: red }
    input[disabled="disabled"] { color: green }
    input:disabled { color: blue }

and markup like <input disabled="disabled">, then after transforming it to <input disabled>, second rule—input[disabled="disabled"]—would stop matching an element. First and third ones, however, would still work as expected. I can’t imagine why someone would use this second version, and you probably won’t ever stumble upon issues like these, but it’s good to be aware of them.

Remove attribute quotes

By default, SGML (which HTML originates from) requires that all attribute values be delimited using either double or single quotes. But in certain cases—when attribute values contain a specific set of characters—quotes can be omitted altogether. Note that HTML specification recommends to always use quotes. There’s also an interesting explanation of why always quoting is a good idea by Jukka Korpela (although none of the dangers he’s talking about apply here). Please, use this optimization with care.

Relevant option is removeAttributeQuotes, and it tells minifier to omit quotes when it is safe to do so:

    var input = '<p class="foo-bar" id="moo" title="blah blah">foo</p>';
    var output = minify(input, { removeAttributeQuotes: true });
 
    output; // '<p class=foo-bar id=moo title="blah blah">foo</p>'

Remove redundant attributes

Some attributes in HTML 4.01 have default values. For example, input’s type attribute defaults to “text” and form’s method—to “get”. When enabling corresponding option in minifier (removeRedundantAttributes), these default attribute name-value pairs get stripped from the output.

There are also few other redundancies that are taken care of as part of this optimization.

One of them is removing deprecated language attribute on SCRIPT elements. It was among markup smells I mentioned recently. Another one is coexisting “name” and “id” attributes on acnhors. And finally, redundant “javascript” labels in event handlers.

Use short doctype

This optimization is the only one affecting document validity. That is if document is defined to be anything but HTML5 (such as HTML 4.01). When useShortDoctype option is enabled, existing doctype is replaced with its short (HTML5) version—<!DOCTYPE html>. As mentioned before, this replacement is generally pretty safe, but you should decide for yourself if this is something worth doing.

Remove empty (or blank) attributes

The corresponding option is removeEmptyAttributes, and when enabled, all attributes with empty values are simply removed from the output. This includes blank values as well—those consisting of white space only.

    var input = '<p id="" STYLE=" " title="\n" >foo</p>';
    var ouptut = minify(input, { removeEmptyAttributes: true });
 
    output; // <p>foo</p>

Note that not all “empty” attributes are removed. For example, both “src” and “alt” attributes are required on IMG elements, so we can’t remove them, even if they’re empty. Right now, only core attributes (id, class, style, title), i18n ones (lang, dir) and event ones (onclick, ondblclick, etc.) are considered “safe” for removal.

The caveat here is that, similar to “collapse boolean attributes” option, this change can affect certain style or script behavior. For example, you might want to target all elements with class attribute—*[class] { ... }. This will apply to elements with empty class, such as <p class="">bar</p>, but obviously not to those without—<p>bar</p>.

This might not be a big issue, but take it into consideration.

Remove optional tags

Some elements in HTML 4.01 are allowed to have their tags omitted. Optional tags are either end one (e.g. </td>) or both—start and end ones (e.g. <tbody> and </tbody>). Note that start tag can never be optional on its own.

Corresponding option in minifier is removeOptionalTags. Currently, it only strips end tags of HTML, HEAD, BODY, THEAD, TBODY and TFOOT elements. I don’t fully understand the process of creating document tree from “unclosed” markup, so I’m not sure when it’s safe to omit tags like </p>.

For example, I can see how removing BODY start tag can have side effects. Let’s say we have a markup like this (with omitted HTML 4.01 doctype, for brevity):

    <head>
      <title>x</title>
    </head>
    <body><script type="text/javascript"></script>
      <p>x</p>
      <script type="text/javascript">
        document.write(document.body.childNodes[0].nodeName);
      </script>
    </body>

and the same markup with HEAD and BODY tags removed:

    <title>x</title>
    <script type="text/javascript"></script>
    <p>x</p>
    <script type="text/javascript">
      document.write(document.body.childNodes[0].nodeName);
    </script>

Note that second version is a perfectly valid document. It just has start and end tags of HEAD and BODY elements omitted. Now what seems to happen here, in a second version, is this:

Browser starts parsing, encounters TITLE tag, and given lack of starting HTML and HEAD tags, creates both elements implicitly (first, HTML, then HEAD as its immediate child). It then continues parsing, up until it stumbles upon P element, which, as per DTD, can not be a child of HEAD. Browser is therefore forced to implicitly close HEAD element, start BODY element, and continue parsing further. P element becomes first child in BODY, and SCRIPT element becomes last child in HEAD.

Now, if we were to display both of these documents, first one would alert “SCRIPT” and second one—“P”. This is becase in original version, SCRIPT element is defined explicitly to be a child of BODY, and in modified version—child of HEAD (due to the way parsing works). The behavior of two documents is therefore not identical. We’ve got a “problem”.

Just like with previous “gotchas”, I’m not sure how likely this type of scenario is to appear in real life. From what I can see, the only other element (besides SCRIPT), allowed as child of both—HEAD and BODY, is OBJECT. As for the future, it should be possible to make minifier strip other optional tags as well. But only in safe scenarios.

It’s also worth mentioning that unclosed elements can result in slightly slower parsing times. Unfortunately, there are no extensive benchmarks done on this topic, and results seem to vary across browsers.

Remove empty elements

This optimization is probably one of the most obtrusive ones, which is why it is disabled by default. Think of it as an experimental addition, and employ with great care. There are dozens of valid use cases for occurence of empty elements in a document. They can be used as placeholders for content inserted later with scripting; or for presentational purposes, such as to implement rounded corners, shadows, float clearing, etc. There are probably other cases, which I can’t think of at the moment.

When enabled, minifier simply removes all elements with empty contents (but not those with empty content model, such as IMG, LINK, or BR).

For example:

    var input = '<p></p>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // ''
 
    input = '<div>blah<span></span></div>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // '<div>blah</div>';

There are few things to be aware of. First of all, elements containing only other empty elements are not removed. For example:

    var input = '<div><div><div></div></div></div>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // '<div><div></div></div>'

Note how only inner DIV element—the one with actual empty contents—is removed.

Second of all, only truly empty string is considered an empty content. This does not include spaces, newlines, or other white space characters:

    var input = '<p> </p>'; // note one space character in between
    var output = minify(input, { removeEmptyElements: true });
 
    output; // '<p> </p>'

Also note that comments are parsed as separate entities and so don’t affect “emptiness” of elements:

    var input = '<p><!-- comment --></p>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // ''

As with other optimizations, some of these limitations will likely be removed in the future.

Validate input through HTML lint

This option simply toggles linting. You can create new HTMLLint object and pass it to minifier. During minification, lint object silently logs all “suspicious” activity. It exposes populate method, which accepts element and inserts its log into this element:

    var lint = new HTMLLint();
    minify(' some input... ', { lint: lint });
 
    lint.populate(document.getElementById('someElement'));

Field-testing

So how does minifier stand against real-life markup? Let’s take a look at minification results of some of the popular websites (note that when gzip’ing documents, 6th level of compression (default) was used):

Amazon.com

Original size: 217KB (35.8KB gzipped)
Minified size: 206.6KB (34.3KB gzipped)
Savings: 10.4KB (1.5KB gzipped)

Minifying home page of amazon.com saves about 10KB with uncompressed document, and only 1.5KB with compressed one. What’s interesting is that humongous 217KB is actually a result of miriad of inline styles and scripts scattered throughout a document. Replacing those with external scripts would be the best optimization. Getting rid of occasional style attributes would help too.

Digg.com

Original size: 82KB (18.2KB gzipped)
Minified size: 74.9KB (17.2KB gzipped)
Savings: 7KB (1KB gzipped)

On digg.com, reduction is slightly smaller—7KB (1KB gzipped). The markup is not as cluttered as on amazon, but still has smells: inline scripts (and unnecessary comments in them), deprecated attributes, anchors defunct without scripting, etc. The benefits of minification are rather small here.

Ajaxian.com

Original size: 177.6KB (32.4KB gzipped)
Minified size: 157.3KB (29.7KB gzipped)
Savings: 20.3KB (2.7KB gzipped)

Trying out home page of ajaxian.com, we see a difference of ~20KB—even better reduction in size. And again, compressed documents show savings of only 2.7KB. Speaking of compression, ajaxian.com shamelessly serves its 177KB-large document uncompressed. There’s also some redundant markup, like unnecessary &nbsp;’s, excessive style attributes, lots of comments, and few inline scripts. Removing all of those, and turning on compression would be an ultimate optimization.

Linkedin.com

Original size: 128.8KB (19.8KB gzipped)
Minified size: 89.4KB (17.1KB gzipped)
Savings: 39.4KB (2.7KB gzipped)

linkedin.com surprises with savings of almost 40KB (!) after minification. Looking at the source, we see that large number is explained by excessive amount of whitespace. This is a good example of how carelessly used whitespace can add up to huge number like this. And again, gzip saves the day; minifying compressed document reduces it only by 2.7KB.

ECMAScript language specification

Original size: 703KB (122.5KB gzipped)
Minified size: 572KB (106.4KB gzipped)
Savings: 131KB (16KB gzipped)

Large static documents is where HTML minification truly shines, and HTML version of ECMAScript (3rd ed.) language specification is a clear demonstration of it. Minifying document results in savings of 131KB (!) for an uncompressed document, and 16KB for compressed one. Since document is served statically, there’s hardly any reason not to apply minification here.

Cost and benefits

It’s pretty obvious that the best candidates for html minification are large static documents. Or just static documents—FAQ’s, standalone articles, etc. Anything that can’t be compressed (e.g. if there are not enough access rights, to enable gzip on a server) would benefit from minification as well. Even when serving gzipped content, it’s worth remembering that not everyone is getting gzip. So clients that are being sent gzipped content could receive 2-3KB smaller file, whereas those receiving uncompressed content could end up with files up to whopping 10-20KB smaller than original ones.

One of the biggest problems I see, when it comes to dynamic minification, is the possibility of error. The core of the issue is that minification relies on parsing, and parsing HTML is a pretty tricky business. Even though minifier applies a strict set of rules—removing quotes and optional tags only when it is absolutely safe to do so, a single misplaced character in start tag can trip parser and wreak havoc on an entire document. This is especially relevant when there’s an inclusion of user-generated content.

As an example, browsers usually understand empty end tags (allowed in HTML)—<p> test </>, but parser, which minifier is based on, would immediately choke here and stop. Another example is attributes containing “weird” characters—<a href="http://example.com""> test </a> (note trailing quote after an attribute). Many browsers happily parse this element, ignoring trailing quote. But parser, once again, falls short and bails out.

It’s certainly possible to tame errors and simply output original, uncompressed document. But this brings us to another downside—time spent on minification. Even when errors are not an issue, there’s an actual overhead of parsing and processing document tree. Minifying home page of amazon.com in pretty speedy nightly webkit, for example, takes exactly 1 second. Most of that time is consumed by parsing. 1 second is quite a lot. An acceptable time for real-time minification would be somewhere around 50-100ms. This problem can be mitigated by optimizing parser, or porting script to be executed in a faster environment (v8 on a server?).

Curiously, Opera 10.50 beta (on Mac OS X) managed to beat WebKit and completed this task almost twice faster (~500ms). Unfortunately, this version suffers from some bugs in regex matching, and fails half of the test suite. Hopefully, those issues will be resolved in later revisions.

Another interesting performance observation was with V8 engine. When testing with version 1.3.x, the time it took to minify amazon.com home page was 0.6 secs. However, version 2.1.2.6 (currently latest stabe) performed same task in excruciatingly long 2 seconds.

Future

I can think of many other things to improve in minifier. Unfortunately, I don’t have much time to work on it. The project is licensed under MIT, and is free for use/modification by anyone interested. Test suite should make collaboration easy. There’s a short todo list on a bottom of project page. Among other things, it lists some of the known bugs.

As always, any questions, corrections, and suggestions are very much welcomed.

Enjoy.

1. “white-space: pre” declaration could be part of a rule from within an extrnal stylesheet; getting computed style would require downloading, parsing and analyzing every single stylesheet linked from the document (or imported from within another stylesheet).

Categories: html, optimizations 24 Comments »

Javascript quiz

February 8th, 2010 by kangax

I was recently reminded about Dmitry Baranovsky’s Javascript test, when N. Zakas answered and explained it in a blog post. First time I saw those questions explained was by Richard Cornford in comp.lang.javascript, although not as thoroughly as by Nicholas.

I decided to come up with my own little quiz. I wanted to keep question not very obscure, practical, yet challenging. They would also cover wider range of topics.

Host objects

Contrary to Dmitry’s test, quiz does not involve host objects (e.g. window), as their behavior is unspecified and can vary sporadically across implementations. We are talking about pure ECMAScript (3rd ed.) behavior. Now, it’s worth pointing out that sometimes implementations deviate from the standard collectively, forming their own, de-facto standard. An example of this is for-in statement, where none of the popular implementations throw TypeError when expression evalutes to null or undefinedfor (var prop in null) { ... } — and instead just silently ignore it. I tried to avoid these non-standard cases. Every question has a correct answer that can be reproduced in at least one of the major implementations.

So what are we testing?

Not a lot really. Quiz mainly focuses on knowledge of scoping, function expressions (and how they differ from function declarations), references, process of variable and function declaration, order of evaluation, and a couple more things like delete operator and object instantiation. These are all relatively simple concepts, which I think every professional Javascript developer should know. Most of these are applied in practice quite often. Ideally, even if you can’t answer a question, you should be able to infer answer from specs (without executing the snippet). When creating these questions, I made sure I can answer each one of them off the top of my head, to keep things relatively simple.

Note, however, that not all questions are very practical, so don’t worry if you can’t answer some of them. We don’t often use with statement, for example, so failing to know/remember its exact behavior is understandable.

Few notes about code

  • Assuming ECMAScript 3rd edition (not 5th)
  • Implementation quirks do not count (assuming standard behavior only)
  • Every snippet is run as a global code (not as eval or function one)
  • There are no other variables declared (and host environment is not extended with anything beyond what’s defined in specs)
  • Answer should correspond to exact return value of entire expression/statement (or last line)
  • “Error” in answer indicates that overall snippet results in a runtime error

Quiz

Please make sure you select answer in each question, as lack of answer is not checked and counts as failure. The final score is simply a number of wrong answers, less is better. Quiz requires Javascript to be enabled.

  1. 1.

        (function(){ 
          return typeof arguments;
        })();
  2. 2.

        var f = function g(){ return 23; };
        typeof g();
  3. 3.

        (function(x){
          delete x;
          return x;
        })(1);
  4. 4.

        var y = 1, x = y = typeof x;
        x;
  5. 5.

        (function f(f){ 
          return typeof f(); 
        })(function(){ return 1; });
  6. 6.

        var foo = { 
          bar: function() { return this.baz; }, 
          baz: 1
        };
        (function(){ 
          return typeof arguments[0]();
        })(foo.bar);
  7. 7.

        var foo = {
          bar: function(){ return this.baz; },
          baz: 1
        }
        typeof (f = foo.bar)();
  8. 8.

        var f = (function f(){ return "1"; }, function g(){ return 2; })();
        typeof f;
  9. 9.

        var x = 1;
        if (function f(){}) {
          x += typeof f;
        }
        x;
  10. 10.

        var x = [typeof x, typeof y][1];
        typeof typeof x;
  11. 11.

        (function(foo){
          return typeof foo.bar;
        })({ foo: { bar: 1 } });
  12. 12.

        (function f(){
          function f(){ return 1; }
          return f();
          function f(){ return 2; }
        })();
  13. 13.

        function f(){ return f; }
        new f() instanceof f;
  14. 14.

        with (function(x, undefined){}) length;

I hope you liked it. Please leave your score in the comments. I’ll try to explain these questions sometime in a near future, unless someone else does it before me. Meanwhile, you can take a look at my articles on function expressions and delete operator, understanding which would help you answer some of these questions, and more importantly, explain their answers.

Categories: ECMA-262, Quiz, [[Delete]], delete operator, with 138 Comments »

Understanding delete

January 10th, 2010 by kangax
  1. Theory

  2. Firebug confusion

  3. Browsers compliance

  4. IE bugs
  5. Misconceptions
  6. `delete` and host objects
  7. ES5 strict mode
  8. Summary

A couple of weeks ago, I had a chance to glance through Stoyan Stefanov’s Object-Oriented Javascript. The book had an exceptionally high rating on Amazon (12 reviews with 5 stars), so I was curious to see if it was something worth recommending. I started reading through chapter on functions, and really enjoyed the way things were explained there; the flow of examples was structured in such nice, progressive way, it seemed even beginners would grasp it easily. However, almost immediately I stumbled upon an interesting misconception present throughout the entire chapter — deleting functions. There were some other mistakes (such as the difference between function declarations and function expressions), but we aren’t going to talk about them now.

The book claims that “function is treated as a normal variable—it can be copied to a different variable and even deleted.”. Following that explanation, there is this example:

  >>> var sum = function(a, b) {return a + b;} 
  >>> var add = sum; 
  >>> delete sum
  true
  >>> typeof sum;
  "undefined"

Ignoring a couple of missing semicolons, can you see what’s wrong with this snippet? The problem, of course, is that deleting sum variable should not be successful; delete statement should not evaluate to true and typeof sum should not result in “undefined”. All because it’s not possible to delete variables in Javascript. At least not when declared in such way.

So what’s going on in this example? Is it a typo? A diversion? Probably not. This whole snippet is actually a real output from the Firebug console, which Stoyan must have been using for quick testing. It’s almost as if Firebug follows some other rules of deletion. It is Firebug that has led Stoyan astray! So what is really going on here?

To answer this question, we need to understand how delete operator works in Javascript: what exactly can and cannot be deleted and why. Today I’ll try to explain this in details. We’ll take a look at Firebug’s “weird” behavior and realize that it’s not all that weird; we’ll delve into what’s going on behind the scenes when declaring variables, functions, assigning properties and deleting them; we’ll look at browsers’ compliance and some of the most notorious bugs; we’ll also talk about strict mode of 5th edition of ECMAScript, and how it changes delete operator behavior.

I’ll be using Javascript and ECMAScript interchangeable to really mean ECMAScript (unless explicitly talking about Mozilla’s JavaScript™ implementation).

Unsurprisingly, explanations of delete on the web are rather scarce. MDC article is probably the most comprehensive resource, but unfortunately misses few interesting details about the subject; Curiously, one of these forgotten things is the cause of Firebug’s tricky behavior. MSDN reference is practically useless.

Theory

So why is it that we can delete object properties:

  var o = { x: 1 }; 
  delete o.x; // true
  o.x; // undefined

but not variables, declared like this:

  var x = 1; 
  delete x; // false
  x; // 1

or functions, declared like this:

  function x(){}
  delete x; // false
  typeof x; // "function"

Note that delete only returns false when a property can not be deleted.

To understand this, we need to first grasp such concepts as variable instantiation and property attributes — something that’s unfortunately rarely covered in books on Javascript. I’ll try go over these very concisely in the next few paragraphs. It’s not hard to understand them at all! If you don’t care about why things work the way they work, feel free to skip this chapter.

Type of code

There are 3 types of executable code in ECMAScript: Global code, Function code and Eval code. These types are somewhat self-descriptive, but here’s a short overview:

  1. When a source text is treated as a Program, it is executed in a global scope, and is considered a Global code. In a browser environment, content of SCRIPT elements is usually parsed as a Program, and is therefore evaluated as a Global code.
  2. Anything that’s executed directly within a function is, quite obviously, considered a Function code. In browsers, content of event attributes (e.g. <p onclick="...">) is usually parsed and treated as a Function code.
  3. Finally, text that’s supplied to a built-in eval function is parsed as Eval code. We will soon see why this type is special.

Execution context

When ECMAScript code executes, it always happens within certain execution context. Execution context is a somewhat abstract entity, which helps understand how scope and variable instantiation works. For each of three types of executable code, there’s an execution context. When a function is executed, it is said that control enters execution context for Function code; when Global code executes, control enters execution context for Global code, and so on.

As you can see, execution contexts can logically form a stack. First there might be Global code with its own execution context; that code might call a function, with its own execution context; that function could call another function, and so on and so forth. Even if function is calling itself recursively, a new execition context is being entered with every invocation.

Activation object / Variable object

Every execution context has a so-called Variable Object associated with it. Similarly to execution context, Variable object is an abstract entity, a mechanism to describe variable instantiation. Now, the interesing part is that variables and functions declared in a source text are actually added as properties of this Variable object.

When control enters execution context for Global code, a Global object is used as a Variable object. This is precisely why variables or functions declared globally become properties of a Global object:

  /* remember that `this` refers to global object when in global scope */
  var GLOBAL_OBJECT = this;
 
  var foo = 1;
  GLOBAL_OBJECT.foo; // 1
  foo === GLOBAL_OBJECT.foo; // true
 
  function bar(){}
  typeof GLOBAL_OBJECT.bar; // "function"
  GLOBAL_OBJECT.bar === bar; // true

Ok, so global variables become properties of Global object, but what happens with local variables — those declared in Function code? The behavior is actually very similar: they become properties of Variable object. The only difference is that when in Function code, a Variable object is not a Global object, but a so-called Activation object. Activation object is created every time execution context for Function code is entered.

Not only do variables and functions declared within Function code become properties of Activation object; this also happens with each of function arguments (under names corresponding to formal parameters) and a special Arguments object (under arguments name). Note that Activation object is an internal mechanism and is never really accessible by program code.

  (function(foo){
 
    var bar = 2;
    function baz(){}
 
    /*
    In abstract terms,
 
    Special `arguments` object becomes a property of containing function's Activation object: 
      ACTIVATION_OBJECT.arguments; // Arguments object
 
    ...as well as argument `foo`:
      ACTIVATION_OBJECT.foo; // 1
 
    ...as well as variable `bar`:
      ACTIVATION_OBJECT.bar; // 2
 
    ...as well as function declared locally:
      typeof ACTIVATION_OBJECT.baz; // "function"
    */
 
  })(1);

Finally, variables declared within Eval code are created as properties of calling context’s Variable object. Eval code simply uses Variable object of the execution context that it’s being called within:

  var GLOBAL_OBJECT = this;
 
  /* `foo` is created as a property of calling context Variable object,
      which in this case is a Global object */
 
  eval('var foo = 1;');
  GLOBAL_OBJECT.foo; // 1
 
  (function(){
 
    /* `bar` is created as a property of calling context Variable object,
      which in this case is an Activation object of containing function */
 
    eval('var bar = 1;');
 
    /* 
      In abstract terms, 
      ACTIVATION_OBJECT.bar; // 1
    */
 
  })();

Property attributes

We are almost there. Now that it’s clear what happens with variables (they become properties), the only remaining concept to understand is property attributes. Every property can have zero or more attributes from the following set — ReadOnly, DontEnum, DontDelete and Internal. You can think of them as flags — an attribute can either exist on a property or not. For the purposes of today’s discussion, we are only interested in DontDelete.

When declared variables and functions become properties of a Variable object — either Activation object (for Function code), or Global object (for Global code), these properties are created with DontDelete attribute. However, any explicit (or implicit) property assignment creates property without DontDelete attribute. And this is essentialy why we can delete some properties, but not others:

  var GLOBAL_OBJECT = this;
 
  /*  `foo` is a property of a Global object.
      It is created via variable declaration and so has DontDelete attribute.
      This is why it can not be deleted. */
 
  var foo = 1;
  delete foo; // false
  typeof foo; // "number"
 
  /*  `bar` is a property of a Global object.
      It is created via function declaration and so has DontDelete attribute.
      This is why it can not be deleted either. */
 
  function bar(){}
  delete bar; // false
  typeof bar; // "function"
 
  /*  `baz` is also a property of a Global object.
      However, it is created via property assignment and so has no DontDelete attribute.
      This is why it can be deleted. */
 
  GLOBAL_OBJECT.baz = 'blah';
  delete GLOBAL_OBJECT.baz; // true
  typeof GLOBAL_OBJECT.baz; // "undefined"

Built-ins and DontDelete

So this is what it’s all about: a special attribute on a property that controls whether this property can be deleted or not. Note that some of the properties of built-in objects are specified to have DontDelete, and so can not be deleted. Special arguments variable (or, as we know now, a property of Activation object) has DontDelete. length property of any function instance has DontDelete as well:

  (function(){
 
    /* can't delete `arguments`, since it has DontDelete */
 
    delete arguments; // false
    typeof arguments; // "object"
 
    /* can't delete function's `length`; it also has DontDelete */
 
    function f(){}
    delete f.length; // false
    typeof f.length; // "number"
 
  })();

Properties corresponding to function arguments are created with DontDelete as well, and so can not be deleted either:

  (function(foo, bar){
 
    delete foo; // false
    foo; // 1
 
    delete bar; // false
    bar; // 'blah'
 
  })(1, 'blah');

Undeclared assignments

As you might remember, undeclared assignment creates a property on a global object. That is unless that property is found somewhere in the scope chain before global object. And now that we know the difference between property assignment and variable declaration — latter one sets DontDelete, whereas former one doesn’t — it should be clear why undeclared assignment creates a deletable property:

  var GLOBAL_OBJECT = this;
 
  /* create global property via variable declaration; property has DontDelete */
  var foo = 1;
 
  /* create global property via undeclared assignment; property has no DontDelete */
  bar = 2;
 
  delete foo; // false
  typeof foo; // "number"
 
  delete bar; // true
  typeof bar; // "undefined"

Note that it is during property creation that attributes are determined (i.e. none are set). Later assignments don’t modify attributes of existing property. It’s important to understand this distinction.

  /* `foo` is created as a property with DontDelete */
  function foo(){}
 
  /* Later assignments do not modify attributes. DontDelete is still there! */
  foo = 1;
  delete foo; // false
  typeof foo; // "number"
 
  /* But assigning to a property that doesn't exist, 
     creates that property with empty attributes (and so without DontDelete) */
 
  this.bar = 1;
  delete bar; // true
  typeof bar; // "undefined"

Firebug confusion

So what happens in Firebug? Why is it that variables declared in console can be deleted, contrary to what we have just learned? Well, as I said before, Eval code has a special behavior when it comes to variable declaration. Variables declared within Eval code are actually created as properties without DontDelete:

  eval('var foo = 1;');
  foo; // 1
  delete foo; // true
  typeof foo; // "undefined"

and, similarly, when called within Function code:

  (function(){
 
    eval('var foo = 1;');
    foo; // 1
    delete foo; // true
    typeof foo; // "undefined"
 
  })();

And this is the gist of Firebug’s abnormal behavior. All the text in console seems to be parsed and executed as Eval code, not as a Global or Function one. Obviously, any declared variables end up as properties without DontDelete, and so can be easily deleted. Be aware of these differences between regular Global code and Firebug console.

Deleting variables via eval

This interesting eval behavior, coupled with another aspect of ECMAScript can technically allow us to delete non-deletable properties. The thing about function declarations is that they can overwrite same-named variables in the same execution context:

  function x(){ }
  var x;
  typeof x; // "function"

Note how function declaration takes precedence and overwrites same-named variable (or, in other words, same property of Variable object). This is because function declarations are instantiated after variable declarations, and are allowed to overwrite them. Not only does function declaration replaces previous value of a property, it also replaces that property attributes. If we declare function via eval, that function should also replace that property’s attributes with its own. And since variables declared from within eval create properties without DontDelete, instantiating this new function should essentially remove existing DontDelete attribute from the property in question, making that property deletable (and of course changing its value to reference newly created function).

  var x = 1;
 
  /* Can't delete, `x` has DontDelete */
 
  delete x; // false
  typeof x; // "number"
 
  eval('function x(){}');
 
  /* `x` property now references function, and should have no DontDelete */
 
  typeof x; // "function"
  delete x; // should be `true`
  typeof x; // should be "undefined"

Unfortunately, this kind of spoofing doesn’t work in any implementation I tried. I might be missing something here, or this behavior might simply be too obscure for implementors to pay attention to.

Browsers compliance

Knowing how things work in theory is useful, but practical implications are paramount. Do browsers follow standards when it comes to variable/property creation/deletion? For the most part, yes.

I wrote a simple test suite to check compliance of delete operator with Global code, Function code and Eval code. Test suite checks both — return value of delete operator, and whether properties are deleted (or not) as they are supposed to. delete return value is not as important as its actual results. It’s not very crucial if delete returns true instead of false, but it’s important that properties with DontDelete are not deleted and vice versa.

Modern browsers are generally pretty compliant. Besides this eval peculiarity I mentioned earlier, the following browsers pass test suite fully: Opera 7.54+, Firefox 1.0+, Safari 3.1.2+, Chrome 4+.

Safari 2.x and 3.0.4 have problems with deleting function arguments; those properties seem to be created without DontDelete, so it is possible to delete them. Safari 2.x has even more problems — deleting non-reference (e.g. delete 1) throws error; function declarations create deletable properties (but, strangely, not variable declarations); variable declarations in eval become non-deletable (but not function declarations).

Similar to Safari, Konqueror (3.5, but not 4.3) throws error when deleting non-reference (e.g. delete 1) and erroneously makes function arguments deletable.

Gecko DontDelete bug

Gecko 1.8.x browsers — Firefox 2.x, Camino 1.x, Seamonkey 1.x, etc. — exhibit an interesting bug where explicitly assigning to a property can remove its DontDelete attribite, even if that property was created via variable or function declaration:

    function foo(){}
    delete foo; // false (as expected)
    typeof foo; // "function" (as expected)
 
    /* now assign to a property explicitly */
 
    this.foo = 1; // erroneously clears DontDelete attribute
    delete foo; // true
    typeof foo; // "undefined"
 
    /* note that this doesn't happen when assigning property implicitly */
 
    function bar(){}
    bar = 1;
    delete bar; // false
    typeof bar; // "number" (although assignment replaced property)

Surprisingly, Internet Explorer 5.5 – 8 passes test suite fully except that deleting non-reference (e.g. delete 1) throws error (just like in older Safari). But there are actually more serious bugs in IE, that are not immediately apparent. These bugs are related to Global object.

IE bugs

The entire chapter just for bugs in Internet Explorer? How unexpected!

In IE (at least, 6-8), the following expression throws error (when evaluated in Global code):

    this.x = 1;
    delete x; // TypeError: Object doesn't support this action

and this one as well, but different exception, just to make things interesting:

    var x = 1;
    delete this.x; // TypeError: Cannot delete 'this.x'

It’s as if variable declarations in Global code do not create properties on Global object in IE. Creating property via assignment (this.x = 1) and then deleting it via delete x throws error. Creating property via declaration (var x = 1) and then deleting it via delete this.x throws another error.

But that’s not all. Creating property via explicit assignment actually always throws error on deletion. Not only is there an error, but created property appears to have DontDelete set on it, which of course it shouldn’t have:

    this.x = 1;
 
    delete this.x; // TypeError: Object doesn't support this action
    typeof x; // "number" (still exists, wasn't deleted as it should have been!)
 
    delete x; // TypeError: Object doesn't support this action
    typeof x; // "number" (wasn't deleted again)

Now, contrary to what one would think, undeclared assignments (those that should create a property on global object) do create deletable properties in IE:

    x = 1;
    delete x; // true
    typeof x; // "undefined"

But if you try to delete such property by referecing it via this in Global code (delete this.x), a familiar error pops up:

    x = 1;
    delete this.x; // TypeError: Cannot delete 'this.x'

If we were to generalize this behavior, it would appear that delete this.x from within Global code never succeeds. When property in question is created via explicit assignment (this.x = 1), delete throws one error; when property is created via undeclared assignment (x = 1) or via declaration (var x = 1), delete throws another error.

delete x, on the other hand, only throws error when property in question is created via explicit assignment — this.x = 1. If a property is created via declaration (var x = 1), deletion simply never occurs and delete correctly returns false. If a property is created via undeclared assignment (x = 1), deletion works as expected.

I was pondering about this issue back in September, and Garrett Smith suggested that in IE “The global variable object is implemented as a JScript object, and the global object is implemented by the host. Garrett used Eric Lippert’s blog entry as a reference.
We can somewhat confirm this theory by performing few tests. Note how this and window seem to reference same object (if we can believe === operator), but Variable object (the one on which function is declared) is different from whatever this references.

    /* in Global code */
    function getBase(){ return this; }
 
    getBase() === this.getBase(); // false
    this.getBase() === this.getBase(); // true
    window.getBase() === this.getBase(); // true
    window.getBase() === getBase(); // false

Misconceptions

The beauty of understanding why things work the way they work can not be underestimated. I’ve seen few misconceptions on the web related to misunderstanding of delete operator. For example, there’s this answer on Stackoverflow (with surprisingly high rating), confidently explaining how “delete is supposed to be no-op when target isn’t an object property”. Now that we understand the core of delete behavior, it becomes pretty clear that this answer is rather inaccurate. delete doesn’t differentiate between variables and properties (in fact, for delete, those are all References) and really only cares about DontDelete attribute (and property existence).

It’s also interesting to see how misconceptions bounce off of each other, where in the very same thread someone first suggests to just delete variable (which won’t work unless it’s declared from within eval), and another person provides a wrong correction how it’s possible to delete variables in Global code but not in Function one.

Be careful with Javascript explanations on the web, and ideally, always seek to understand the core of the issue ;)

`delete` and host objects

An algorithm for delete is specified roughtly like this:

  1. If operand is not a reference, return true
  2. If object has no direct property with such name, return true (where, as we now know, object can be Activation object or Global object)
  3. If property exists but has DontDelete, return false
  4. Otherwise, remove property and return true

However, behavior of delete operator with host objects can be rather unpredictable. And there’s actually nothing wrong with that: host objects are allowed (by specification) to implement any kind of behavior for operations such as read (internal [[Get]] method), write (internal [[Put]] method) or delete (internal [[Delete]] method), among few others. This allowance for custom [[Delete]] behavior is what makes host objects so chaotic.

We’ve already seen some IE oddities, where deleting certain objects (which are apparently implemented as host objects) throws errors. Some versions of Firefox throw when trying to delete window.location. You can’t trust return values of delete either, when it comes to host objects; take a look at what happens in Firefox:

    /* "alert" is a direct property of `window` (if we were to believe `hasOwnProperty`) */
    window.hasOwnProperty('alert'); // true
 
    delete window.alert; // true
    typeof window.alert; // "function"

Deleting window.alert returns true, even though there’s nothing about this property that should lead to such result. It resolves to a reference (so can’t return true on the first step). It’s a direct property of a window object (so can’t return true on a second step). The only way delete could return true is after reaching step 4 and actually deleting a property. Yet, property is never deleted.

The moral of the story is to never trust host objects.

ES5 strict mode

So what does strict mode of ECMAScript 5th edition bring to the table? Few restrictions are being introduced. SyntaxError is now thrown when expression in delete operator is a direct reference to a variable, function argument or function identifier. In addition, if property has internal [[Configurable]] == false, a TypeError is thrown:

  (function(foo){
 
    "use strict"; // enable strict mode within this function
 
    var bar;
    function baz(){}
 
    delete foo; // SyntaxError (when deleting argument)
    delete bar; // SyntaxError (when deleting variable)
    delete baz; // SyntaxError (when deleting variable created with function declaration)
 
    /* `length` of function instances has { [[Configurable]] : false } */
 
    delete (function(){}).length; // TypeError
 
  })();

In addition, deleting undeclared variable (or in other words, unresolved Referece) throws SyntaxError as well:

    "use strict";
    delete i_dont_exist; // SyntaxError

This is somewhat similar to the way undeclared assignment in strict mode behaves (except that ReferenceError is thrown instead of a SyntaxError):

    "use strict";
    i_dont_exist = 1; // ReferenceError

As you now understand, all these restrictions somewhat make sense, given how much confusion deleting variables, function declarations and arguments causes. Instead of silently ignoring deletion, strict mode takes more agressive and descriptive measures.

Summary

This post turned out to be quite lengthy, so I’m not going to talk about things like removing array items with delete and what the implications of it are. You can always refer to MDC article for that particular explanation (or read specs and experiment yourself).

Here’s a short summary of how deletion works in Javascript:

  • Variables and function declarations are properties of either Activation or Global objects.
  • Properties have attributes, one of which — DontDelete — is responsible for whether a property can be deleted.
  • Variable and function declarations in Global and Function code always create properties with DontDelete.
  • Function arguments are also properties of Activation object and are created with DontDelete.
  • Variable and function declarations in Eval code always create properties without DontDelete.
  • New properties are always created with empty attributes (and so without DontDelete).
  • Host objects are allowed to react to deletion however they want.

If you’d like to get more familiar with things described here, please refer to ECMA-262 3rd edition specification.

I hope you enjoyed this overview and learned something new. Any questions, suggestions and corrections are as always welcomed.

Categories: DontDelete, ECMA-262, [[Delete]], delete operator, review, strict-mode 27 Comments »

« Previous Entries