Perfection kills

Exploring Javascript by example

What’s wrong with extending the DOM

April 5th, 2010 by kangax

I was recently surprised to find out how little the topic of DOM extensions is covered on the web. What’s disturbing is that downsides of this seemingly useful practice don’t seem to be well known, except in certain secluded circles. The lack of information could well explain why there are scripts and libraries built today that still fall into this trap. I’d like to explain why extending DOM is generally a bad idea, by showing some of the problems associated with it. We’ll also look at possible alternatives to this harmful exercise.

But first of all, what exactly is DOM extension? And how does it all work?

How DOM extension works

DOM extension is simply the process of adding custom methods/properties to DOM objects. Custom properties are those that don’t exist in a particular implementation. And what are the DOM objects? These are host objects implementing Element, Event, Document, or any of dozens of other DOM interfaces. During extension, methods/properties can be added to objects directly, or to their prototypes (but only in environments that have proper support for it).

The most commonly extended objects are probably DOM elements (those that implement Element interface), popularized by Javascript libraries like Prototype and Mootools. Event objects (those that implement Event interface), and documents (Document interface) are often extended as well.

In environment that exposes prototype of Element objects, an example of DOM extension would look something like this:

  Element.prototype.hide = function() {
    this.style.display = 'none';
  };
  ...
  var element = document.createElement('p');
 
  element.style.display; // ''
  element.hide();
  element.style.display; // 'none'

As you can see, “hide” function is first assigned to a hide property of Element.prototype. It is then invoked directly on an element, and element’s “display” style is set to “none”.

The reason this “works” is because object referred to by Element.prototype is actually one of the objects in prototype chain of P element. When hide property is resolved on it, it’s searched throughout the prototype chain until found on this Element.prototype object.

In fact, if we were to examine prototype chain of P element in some of the modern browsers, it would usually look like this:

  // "^" denotes connection between objects in prototype chain
 
  document.createElement('p');
    ^
  HTMLParagraphElement.prototype
    ^
  HTMLElement.prototype
    ^
  Element.prototype
    ^
  Node.prototype
    ^
  Object.prototype
    ^
  null

Note how the nearest ancestor in the prototype chain of P element is object referred to by HTMLParagraphElement.prototype. This is an object specific to type of an element. For P element, it’s HTMLParagraphElement.prototype; for DIV element, it’s HTMLDivElement.prototype; for A element, it’s HTMLAnchorElement.prototype, and so on.

But why such strange names, you might ask?

These names actually correspond to interfaces defined in DOM Level 2 HTML Specification. That same specification also defines inheritance between those interfaces. It says, for example, that “… HTMLParagraphElement interface have all properties and functions of the HTMLElement interface …” (source) and that “… HTMLElement interface have all properties and functions of the Element interface …” (source), and so on.

Quite obviously, if we were to create a property on “prototype object” of paragraph element, that property would not be available on, say, anchor element:

  HTMLParagraphElement.prototype.hide = function() {
    this.style.display = 'none';
  };
  ...
  typeof document.createElement('a').hide; // "undefined"
  typeof document.createElement('p').hide; // "function"

This is because anchor element’s prototype chain never includes object refered to by HTMLParagraphElement.prototype, but instead includes that referred to by HTMLAnchorElement.prototype. To “fix” this, we can assign to property of object positioned further in the prototype chain, such as that referred to by HTMLElement.prototype, Element.prototype or Node.prototype.

Similarly, creating a property on Element.prototype would not make it available on all nodes, but only on nodes of element type. If we wanted to have property on all nodes (e.g. text nodes, comment nodes, etc.), we would need to assign to property of Node.prototype instead. And speaking of text and comment nodes, this is how interface inheritance usually looks for them:

  document.createTextNode('foo'); // < Text.prototype < CharacterData.prototype < Node.prototype
  document.createComment('bar'); // < Comment.prototype < CharacterData.prototype < Node.prototype

Now, it’s important to understand that exposure of these DOM object prototypes is not guaranteed. DOM Level 2 specification merely defines interfaces, and inheritance between those interfaces. It does not state that there should exist global Element property, referencing object that’s a prototype of all objects implementing Element interface. Neither does it state that there should exist global Node property, referencing object that’s a prototype of all objects implementing Node interface.

Internet Explorer 7 (and below) is an example of such environment; it does not expose global Node, Element, HTMLElement, HTMLParagraphElement, or other properties. Another such browser is Safari 2.x (and most likely Safari 1.x).

So what can we do in environments that don’t expose these global “prototype” objects? A workaround is to extend DOM objects directly:

  var element = document.createElement('p');
  ...
  element.hide = function() {
    this.style.display = 'none'; 
  };
  ...
  element.style.display; // ''
  element.hide();
  element.style.display; // 'none'

What went wrong?

Being able to extend DOM elements through prototype objects sounds amazing. We are taking advantage of Javascript prototypal nature, and scripting DOM becomes very object-oriented. In fact, DOM extension seemed so temptingly useful that few years ago, Prototype Javascript library made it an essential part of its architecture. But what hides behind seemingly innocuous practice is a huge load of trouble. As we’ll see in a moment, when it comes to cross-browser scripting, the downsides of this approach far outweigh any benefits. DOM extension is one of the biggest mistakes Prototype.js has ever done.

So what are these problems?

Lack of specification

As I have already mentioned, exposure of “prototype objects” is not part of any specification. DOM Level 2 merely defines interfaces and their inheritance relations. In order for implementation to conform to DOM Level 2 fully, there’s no need to expose those global Node, Element, HTMLElement, etc. objects. Neither is there a requirement to expose them in any other way. Given that there’s always a possibility to extend DOM objects manually, this doesn’t seem like a big issue. But the truth is that manual extension is a rather slow and inconvenient process (as we will see shortly). And the fact that fast, “prototype object” -based extension is merely somewhat of a de-facto standard among few browsers, makes this practice unreliable when it comes to future adoption or portability across non-convential platforms (e.g. mobile devices).

Host objects have no rules

Next problem with DOM extension is that DOM objects are host objects, and host objects are the worst bunch. By specification (ECMA-262 3rd. ed), host objects are allowed to do things, no other objects can even dream of. To quote relevant section [8.6.2]:

Host objects may implement these internal methods with any implementation-dependent behaviour, or it may be that a host object implements only some internal methods and not others.

The internal methods specification talks about are [[Get]], [[Put]], [[Delete]], etc. Note how it says that internal methods behavior is implementation-dependent. What this means is that it’s absolutely normal for host object to throw error on invocation of, say, [[Get]] method. And unfortunatey, this isn’t just a theory. In Internet Explorer, we can easily observe exactly this—an example of host object [[Get]] throwing error:

  document.createElement('p').offsetParent; // "Unspecified error."
  new ActiveXObject("MSXML2.XMLHTTP").send; // "Object doesn't support this property or method"

Extending DOM objects is kind of like walking in a minefield. By definition, you are working with something that’s allowed to behave in unpredictable and completely erratic way. And not only things can blow up; there’s also a possibility of silent failures, which is even worse scenario. An example of erratic behavior is applet, object and embed elements, which in certain cases throw errors on assignment of properties. Similar disaster happens with XML nodes:

  var xmlDoc = new ActiveXObject("Microsoft.XMLDOM");
  xmlDoc.loadXML('<foo>bar</foo>');
  xmlDoc.firstChild.foo = 'bar'; // "Object doesn't support this property or method"

There are other cases of failures in IE, such as document.styleSheets[99999] throwing “Invalid procedure call or argument” or document.createElement('p').filters throwing “Member not found.” exceptions. But not only MSHTML DOM is the problem. Trying to overwrite “target” property of event object in Mozilla throws TypeError, complaining that property has only a getter (meaning that it’s readonly and can not be set). Doing same thing in WebKit, results in silent failure, where “target” continues to refer to original object after assignment.

When creating API for working with event objects, there’s now a need to consider all of these readonly properties, instead of focusing on concise and descriptive names.

A good rule of thumb is to avoid touching host objects as much as possible. Trying to base architecture on something that—by definition—can behave so sporadically is hardly a good idea.

Chance of collisions

API based on DOM element extensions is hard to scale. It’s hard to scale for developers of the library—when adding new or changing core API methods, and for library users—when adding domain-specific extensions. The root of the issue is a likely chance of collisions. DOM implementations in popular browsers usually all have properietary API’s. What’s worse is that these API’s are not static, but constantly change as new browser versions come out. Some parts get deprecated; others are added or modified. As a result, set of properties and methods present on DOM objects is somewhat of a moving target.

Given huge amount of environments in use today, it becomes impossible to tell if certain property is not already part of some DOM. And if it is, can it be overwritten? Or will it throw error when attempting to do so? Remember that it’s a host object! And if we can quietly overwrite it, how would it affect other parts of DOM? Would everything still work as expected? If everything is fine in one version of such browser, is there a guarantee that next version doesn’t introduce same-named property? The list of questions goes on.

Some examples of proprietary extensions that broke Prototype are wrap property on textareas in IE (colliding with Element#wrap method), and select method on form control elements in Opera (colliding with Element#select method). Even though both of these cases are documented, having to remember these little exceptions is annoying.

Proprietary extensions are not the only problem. HTML5 brings new methods and properties to the table. And most of the popular browsers have already started implementing them. At some point, WebForms defined replace property on input elements, which Opera decided to add to their browser. And once again, it broke Prototype, due to conflict with Element#replace method.

But wait, there’s more!

Due to long-standing DOM Level 0 tradition, there’s this “convenient” way to access form controls off of form elements, simply by their name. What this means is that instead of using standard elements collection, you can access form control like this:

  <form action="">
    <input name="foo">
  </form>
  ...
  <script type="text/javascript">
    document.forms[0].foo; // non-standard access
    // compare to
    document.forms[0].elements.foo; // standard access
  </script>

So, say you extend form elements with login method, which for example checks validation and submits login form. If you also happen to have form control with “login” name (which is pretty likely, if you ask me), what happens next is not pretty:

  <form action="">
    <input name="login">
    ...
  </form>
  ...
  <script type="text/javascript">
    HTMLFormElement.prototype.login = function(){ 
      return 'logging in'; 
    };
    ...
    $(myForm).login(); // boom!
    // $(myForm).login references input element, not `login` method
  </script>

Every named form control shadows properties inherited through prototype chain. The chance of collisions and unexpected errors on form elements is even higher.

Situation is somewhat similar with named form elements, where they can be accessed directly off document by their names:

  <form name="foo">
    ...
  </form>
  ...
  <script type="text/javascript">
    document.foo; // [object HTMLFormElement]
  </script>

When extending document objects, there’s now an additional risk of form names conflicting with extensions. And what if script is running in legacy applications with tons of rusty HTML, where changing/removing such names is not a trivial task?

Employing some kind of prefixing strategy can alleviate the problem. But will probably also bring extra noise.

Not modifying objects you don’t own is an ultimate recipe for avoiding collisions. Breaking this rule already got Prototype into trouble, when it overwrote document.getElementsByClassName with own, custom implementation. Following it also means playing nice with other scripts, running in the same environment—no matter if they modify DOM objects or not.

Performance overhead

As we’ve seen before, browsers that don’t support element extensions—like IE 6, 7, Safari 2.x, etc.—require manual object extension. The problem is that manual extension is slow, inconvenient and doesn’t scale. It’s slow because object needs to be extended with what’s often a large number of methods/properties. And ironically, these browsers are the slowest ones around. It’s inconvenient because object needs to be first extended in order to be operated on. So instead of document.createElement('p').hide(), you would need to do something like $(document.createElement('p')).hide(). This, by the way, is one of the most common stumbing blocks for beginners of Prototype. Finally, manual extension doesn’t scale well because adding API methods affects performance pretty much linearly. If there’s 100 methods on Element.prototype, there has to be 100 assignments made to an element in question; if there’s 200 methods, there has to be 200 assignments made to an element, and so on.

Another performance hit is with event objects. Prototype follows similar approach with events and extends them with a certain set of methods. Unfortunately, some events in browsers—mousemove, mouseover, mouseout, resize, to name few—can fire literally dozens of times per second. Extending each one of them is an incredibly expensive process. And what for? Just to invoke what could be a single method on event obejct?

Finally, once you start extending elements, library API most likely needs to return extended elements everywhere. As a result, querying methods like $$ could end up extending every single element in a query. It’s easy to imagine performance overead of such process, when we’re talking about hundreds or thousands of elements.

IE DOM is a mess

As shown in previous section, manual DOM extension is a mess. But manual DOM extension in IE is even worse, and here’s why.

We all know that in IE, circular references between host and native objects leak, and are best avoided. But adding methods to DOM elements is a first step towards creation of such circular references. And since older versions of IE don’t expose “object prototypes”, there’s not much to do but extend elements directly. Circular references and leaks are almost inevitable. And in fact, Prototype suffered from them for most of its lifetime.

Another problem is the way IE DOM maps properties and attributes to each other. The fact that attributes are in the same namespace as properties, increases chance of collisions and all kinds of unexpected inconsistencies. What happens if element has custom “show” attribute and is then extended by Prototype. You’ll be surprised, but show “attribute” would get overwritten by Prototype’s Element#show method. extendedElement.getAttribute('show') would return a reference to a function, not the value of “show” attribute. Similarly, extendedElement.hasAttribute('hide') would say “true”, even if there was never custom “hide” attribute on an element. Note that IE<8 lacks hasAttribute, but we could still see attribute/property conflict: typeof extendedElement.attributes['show'] != "undefined".

Finally, one of the lesser-known downsides is the fact that adding properties to DOM elements causes reflow in IE, so mere extension of element becomes a quite expensive operation. This actually makes sense, given the deficient mapping of attributes and properties in its DOM.

Bonus: browser bugs

If everything we’ve been over so far is not enough (in which case, you’re probably a masochist), here’s a couple more bugs to top it all of.

In some versions of Safari 3.x, there’s a bug where navigating to a previous page via back button wipes off all host object extensions. Unfortunately, the bug is undetectable, so to work around the issue, Prototype has to do something horrible. It sniffs browser for that version of WebKit, and explicitly disables bfcache by attaching “unload” event listener to window. Disabled bfcache means that browser has to re-fetch page when navigating via back/forward buttons, instead of restoring page from the cached state.

Another bug is with HTMLObjectElement.prototype and HTMLAppletElement.prototype in IE8, and the way object and applet elements don’t inherit from those prototype objects. You can assign to a property of HTMLObjectElement.prototype, but that property is never “resolved” on object element. Ditto for applets. As a result, those elements always have to be extended manually, which is another overhead.

IE8 also exposes only a subset of prototype objects, when compared to other popular implementations. For example, there’s HTMLParagraphElement.prototype (as well as other type-specific ones), and Element.prototype, but no HTMLElement (and so HTMLElement.prototype) or Node (and so Node.prototype). Element.prototype in IE8 also doesn’t inherit from Object.prototype. These are not bugs, per se, but is something to keep in mind nevertheless: there’s nothing good about trying to extend non-existent Node, for example.

Wrappers to the rescue

One of the most common alternatives to this whole mess of DOM extension is object wrappers. This is the approach jQuery has taken from the start, and few other libraries followed later on. The idea is simple. Instead of extending elements or events directly, create a wrapper around them, and delegate methods accordingly. No collisions, no need to deal with host objects madness, easier to manage leaks and operate in dysfunctional MSHTML DOM, better performance, saner maintenance and painless scaling.

And you still avoid procedural approach.

Prototype 2.0

The good news is that Prototype mistake is something that’s going away in the next major version of the library. As far as I’m concerned, all core developers understand the problems mentioned above, and that wrapper approach is the saner way to move forward. I’m not sure what the plans are in other DOM-extending libraries like Mootools. From what I can see they are already using wrappers with events, but still extend elements. I’m certinaly hoping they move away from this madness in a near future.

Controlled environments

So far we looked at DOM extension from the point of view of cross-browser scripting library. In that context, it’s clear how troublesome this idea really is. But what about controlled environments? When script is only run in one or two environments, such as those based on Gecko, WebKit or any other modern non-MSHTML DOM. Perhaps it’s an intranet application, that’s accessed through certain browsers. Or a desktop, WebKit-based app.

In that case, situtation is definitly better. Let’s look at the points listed above.

Lack of specification becomes somewhat irrelevant, as there’s no need to worry about compatibility with other platforms, or future editions. Most of the non-MSHTML DOM environments expose DOM object prototypes for quite a while, and are unlikely to drop it in a near future. There’s still a possibility for change, however.

Point about host objects unreliability also loses its weight, since host objects in Gecko or WebKit -based DOMs are much, much saner than those in MSHTML DOM. But they are still host objects, and so should be treated with care. Besides, there are readonly properties covered before, which could easily cripple the flexibility of API.

The point about collisions still holds weight. These environments support non-standard form controls access, have proprietary API, and are constantly implementing new HTML5 features. Modifying objects you don’t own is still a wicked idea and can lead to hard-to-find bugs and inconsistencies.

Performance overhead is practically non-existent, as these DOM support prototype-based DOM extension. Performance can actually be even better, comparing to, say, wrappers approach, as there’s no need to create any additional objects in order to invoke methods (or access properties) off DOM objects.

Extending DOM in controlled environment sure seems like a perfectly healthy thing to do. But even though the main problem is that with collisions, I would still advise to employ wrappers instead. It’s a safer way to move forward, and will save you from maintenance overhead in the future.

Afterword

Hopefuly, you can now clearly see all the truth behind what looks like an elegant approach. Next time you design a Javascript framework, just say no to DOM extensions. Say no, and save yourself from all the trouble of maintaining a cumbersome API and suffering unnecessary performance overheads. If on the other hand, you’re considering to employ Javascript library that extends DOM, stop for a second, and ask yourself if you’re willing to take a risk. Is ellusive convenience of DOM extension really worth all the trouble?

Categories: DOMLint, ECMA-262, annoyances, don'ts 39 Comments »

Experimenting with html minifier

March 9th, 2010 by kangax

In Optimizing HTML, I mentioned that state of HTML minifiers is rather crude at the moment. We have a large variety of JS and CSS minification tools, but almost no HTML ones. This is actually quite understandable.

First of all, minifiying scripts and stylesheets usually results in better savings, overall. Second, the nature of document markup is much more dynamic than that of scripts and styles. As a result, HTML minification has to be done “on demand”, and carries certain overhead. Only when this overhead is less then difference in time for delivering minified-vs-original document, there’s a benefit in minification. In some cases, though, savings in document size (and so bandwidth) can be more important than time spent on minification.

It’s no suprise that HTML minification is almost always a low-priority optimization. When it comes to client-side performance, there are certainly other more important things to pay attention to. Only when other aspects are taken into consideration, it is worth minifying document markup.

Few weeks ago, I decided to experiment with Javascript-based HTML minifier and created an online-based tool, with lint-like capabilities. After some tweaking, the script was able to parse and minify markup of almost any random website. The goal was to see how easy it is to implement something like this, learn HTML a bit more, and have fun in a process. Ultimately, I wanted to minify some of the popular websites and see if savings are worth all the trouble.

Today, I’d like to share this tool with you. I’ll quickly go over some of the initial features, explain how minifier works, and look into possible side effects of minification. Please note that the script is still in very early stage, and shouldn’t be used in production. If you are not interested in inner workings, feel free to skip to tests or conclusions.


Screenshot of HTMLMinifier

How it works

Parser

At its core, minifier relies on HTML parser by John Resig. John’s parser was capable of handling quite complex documents, but would sometimes trip on some of the more obscure structures. For example, doctype declarations were not understood at all. Whenever attribute name contained characters like “-” (e.g. as in “http-equiv”), parser would fail. There were also some defficiencies in regular expressions for matching comments and CDATA sections: newlines inside them were not accounted for, so multiline comments simply weren’t matched. CDATA sections and comments inside elements with CDATA content model (e.g. SCRIPT and STYLE) were getting stripped for no apparent reason.

All of these are now fixed.

Minifier

Minifier is a very small “wrapper” on top of parser. As of now it’s only about 250 LOC. It takes input string and configuration object; passes this input string to parser, and builds final output according to specified options.

For example, we can tell it to remove comments:

    var input = '<!-- foo --><div>baz</div><!-- bar\n\n moo -->';
    minify(input, { removeComments: true }); // '<div>baz</div>'

or to collapse boolean attributes:

    var input = '<input disabled="disabled">';
    minify(input, { collapseBooleanAttributes: true }); // '<input disabled>'

Test suite

One of the goals I had for this little project was to have a robust test suite. HTML minifier is fully unit tested with ~100 tests at the moment. This has few benefits: anyone can change, tweak or add things without worrying to break existing functionality. It takes literally seconds to tell if script is functional in certain browser (or even in non-browser implementation, such as node.js on a server)—simply by running a test suite. Finally, tests can serve as documentation for how minifier handles some of the edge cases.

Lint

While working on minifier, I realized that oftentimes the most wasteful part of the markup is not white space, comments or boolean attributes, but inline styles, scripts, presentational or deprecated elements and attributes. None of these can be simply stripped, as that could affect state of the document and is just too obtrusive. What can be done, however, is reporting of these occurences to the user. HTMLLint is even a smaller script, whose job is exactly that—to log any deprecated or presentational elements/attributes encountered during parsing. Additionally, it detects event attributes (e.g. onclick, onmouseover, etc.). The rationale for this is that moving contents of event attributes to external script allows to take advantage of resource caching.

Options

Before we begin, it’s important to understand that minifier parses documents as HTML, not XHTML. This allows to employ such optimizations as “remove optional tags and quotes”, “collapse boolean attributes”, etc. Note that almost none of the options affect document validity, as per HTML 4.01. XHTML support might be added in the future, but considering that in context of pubilc web it’s mostly pointless at the moment, I see little reason in doing so. Besides, minifying XHTML documents (given that they’re actually served to clients properly, with “application/xhtml+xml”) doesn’t reduce size as much as if they were HTML.

The following is a list of current options in minifier. It is far from being exhaustive, and will most likely be extended in a future. Let’s look at each one of them quickly:

Remove comments

    var input = '<!-- some comment --><p>blah</p>';
    var output = minify(input, { removeComments: true });
 
    output; // '<p>blah</p>'

This one should be self-explanatory. Passing truthy removeComments tells minifier to strip HTML comments. Note that comments inside elements with CDATA content model, such as SCRIPT and STYLE, are left intact (but see next option).

    var input = '<script type="text/javascript"><!-- some comment --></script>';
    var output = minify(input, { removeComments: true });
 
    output; // '<script type="text/javascript"><!-- some comment --></script>'

Remove comments from scripts and styles

When this option is enabled, HTML comments in scripts and styles are stripped as well:

    var input = '<script type="text/javascript"><!--\n alert(1) --></script>';
    var output = minify(input, { removeCommentsFromCDATA: true });
 
    output; // '<script type="text/javascript">alert(1)</script>'

It’s worth pointing out that there’s a slight difference in the way HTML comments are treated inside SCRIPT and STYLE elements. In scripts, comment start delimiter (“<!--”) tells parser to ignore everything until newline:

    <!-- alert(1); // alert never happens!
    <!--
    alert(2); // but this one does!
    // "<!--" acts as a single-line JS comment ("//").

In styles, however, “<!--” is simply ignored when it’s present in the beginning of input (I haven’t tested what happens in other parts of a stylesheet). Contrary to script behavior, anything that follows “<!--” still remains present:

    <!-- body { color: red; } -->
 
    /*  treated as:
        body { color: red; }
    */

Explanation of why you might want to strip comments.

Remove CDATA sections

This option removes CDATA sections from script and style elements:

    var input = '<script>/* <![CDATA[ \n\n */alert(1)/* ]]> */</script>';
    var output = minify(input, { removeCDATASectionsFromCDATA: true });
 
    output; // '<script>alert(1)</script>'

Explanation of why you might want to do this.

Collapse whitespace

This options collapses white space that contributes to text nodes in a document tree. For example:

    var input = '<div> <p>    foo </p>    </div>';
    var output = minify(input, { collapseWhitespace: true });
 
    output; // '<div><p>foo</p></div>'

It doesn’t affect significant white space; e.g. in contents of elements like SCRIPT, STYLE, PRE or TEXTAREA.

    var input = '<script>    alert("foo     bar")</script>';
    var output = minify(input, { collapseWhitespace: true });
 
    output; // '<script>alert("foo     bar")</script>'
 
    input = '<textarea>     x x   x </textarea>';
    output = minify(input, { collapseWhitespace: true });
 
    output; // '<textarea>     x x   x </textarea>'

Now, it’s worth mentioning that this modification can have side effects, and significantly change document representation.

For example, markup like <span>foo</span> <span>bar</span> is usually displayed as “foo bar” in browsers, with one space character in between two words. White space in markup is represented as text node in document tree. This text node’s value is a white space (e.g. U+0020), and as long as two adjacent elements are inline-level—as they are in this example—it is this white space that contributes to a gap in between “foo” and “bar”. As soon as we remove that white space (i.e. changing markup to <span>foo</span><span>bar</span>), representation changes from “foo bar” to “foobar”.

There are two ways to work around this issue.

First one is not to rely on such white space for document representation, and instead style elements to have margins and paddings as needed. In previous example, this could have been: <span class="foo">foo</span><span>bar</span> (where foo class would be declared with, say, margin-right: 0.25em;). At first, this might seem like an overkill. After all, adding class seems to defeat the purpose, resulting in larger output, when compared to a version with just one white space character. However, depending on a context, giving few elements a class for styling purposes, and then stripping white space from the entire document, can result in a smaller output.

Second option is to never fully remove white space characters, and instead always collapse them to one white space character. HTML 4.01 is actually specified to do just that, so there’s no harm in doing it upfront. Because of this, the following 2 snippets should render identically:

  <span>foo</span>
 
     <span>bar</span>

and:

    <span>foo</span> <span>bar</span>

…with one space in between “foo” and “bar”. Note how in first case, there’s an entire sequence of white space characters (including line breaks).

This second option—collapsing to one white space—has not yet been added to minifier.

Another noticeable effect white space removal can have on a document is related to CSS white-space property. As I mentioned earlier, by default, adjacent sequences of white space in most of the elements collapse into one space character. But white-space property changes it all. Some of its values result in different collapsing behavior. white-space: pre, for example, makes whitespace render exactly as it occurs in a markup.

As a result, snippet like this:

<span style="white-space:pre;">  foo     bar</span>

renders exactly as is, and becomes:

  foo     bar

As of now, minifier doesn’t respect space-preserving white-space values (i.e. “pre” and “pre-wrap”). It doesn’t even understand them. Unfortunately, computing elements’ styles and determining their white-space values would be just way too complex and impractical [1]. On a bright side, it seems that white-space property is not used very often. In a future, it should be possible to add an option to minifier for specifying a way to prevent certain elements from having their content collapsed. A filtering can be based on a class, a simple selector, or maybe even by parsing element’s style attribute.

Collapse boolean attributes

HTML 4.01 has so-called boolean attributes—“selected”, “disabled”, “checked”, etc. These may appear in a minimized (collapsed) form, where attribute value is fully ommited. For example, instead of writing <input disabled="disabled">, we can simply write—<input disabled>.

Minifier has an option to perform this optimization, called collapseBooleanAttributes:

    var input = '<input value="foo" readonly="readonly">';
    var output = minify(input, { collapseBooleanAttributes: true });
 
    output; // '<input value="foo" readonly>'

A potential caveat here is that if you target elements by attribute name and value, things might break after applying this optimization. Granted, this kind of case seems rather unreal, but here’s an example. If we had these rules:

    input[disabled] { color: red }
    input[disabled="disabled"] { color: green }
    input:disabled { color: blue }

and markup like <input disabled="disabled">, then after transforming it to <input disabled>, second rule—input[disabled="disabled"]—would stop matching an element. First and third ones, however, would still work as expected. I can’t imagine why someone would use this second version, and you probably won’t ever stumble upon issues like these, but it’s good to be aware of them.

Remove attribute quotes

By default, SGML (which HTML originates from) requires that all attribute values be delimited using either double or single quotes. But in certain cases—when attribute values contain a specific set of characters—quotes can be omitted altogether. Note that HTML specification recommends to always use quotes. There’s also an interesting explanation of why always quoting is a good idea by Jukka Korpela (although none of the dangers he’s talking about apply here). Please, use this optimization with care.

Relevant option is removeAttributeQuotes, and it tells minifier to omit quotes when it is safe to do so:

    var input = '<p class="foo-bar" id="moo" title="blah blah">foo</p>';
    var output = minify(input, { removeAttributeQuotes: true });
 
    output; // '<p class=foo-bar id=moo title="blah blah">foo</p>'

Remove redundant attributes

Some attributes in HTML 4.01 have default values. For example, input’s type attribute defaults to “text” and form’s method—to “get”. When enabling corresponding option in minifier (removeRedundantAttributes), these default attribute name-value pairs get stripped from the output.

There are also few other redundancies that are taken care of as part of this optimization.

One of them is removing deprecated language attribute on SCRIPT elements. It was among markup smells I mentioned recently. Another one is coexisting “name” and “id” attributes on acnhors. And finally, redundant “javascript” labels in event handlers.

Use short doctype

This optimization is the only one affecting document validity. That is if document is defined to be anything but HTML5 (such as HTML 4.01). When useShortDoctype option is enabled, existing doctype is replaced with its short (HTML5) version—<!DOCTYPE html>. As mentioned before, this replacement is generally pretty safe, but you should decide for yourself if this is something worth doing.

Remove empty (or blank) attributes

The corresponding option is removeEmptyAttributes, and when enabled, all attributes with empty values are simply removed from the output. This includes blank values as well—those consisting of white space only.

    var input = '<p id="" STYLE=" " title="\n" >foo</p>';
    var ouptut = minify(input, { removeEmptyAttributes: true });
 
    output; // <p>foo</p>

Note that not all “empty” attributes are removed. For example, both “src” and “alt” attributes are required on IMG elements, so we can’t remove them, even if they’re empty. Right now, only core attributes (id, class, style, title), i18n ones (lang, dir) and event ones (onclick, ondblclick, etc.) are considered “safe” for removal.

The caveat here is that, similar to “collapse boolean attributes” option, this change can affect certain style or script behavior. For example, you might want to target all elements with class attribute—*[class] { ... }. This will apply to elements with empty class, such as <p class="">bar</p>, but obviously not to those without—<p>bar</p>.

This might not be a big issue, but take it into consideration.

Remove optional tags

Some elements in HTML 4.01 are allowed to have their tags omitted. Optional tags are either end one (e.g. </td>) or both—start and end ones (e.g. <tbody> and </tbody>). Note that start tag can never be optional on its own.

Corresponding option in minifier is removeOptionalTags. Currently, it only strips end tags of HTML, HEAD, BODY, THEAD, TBODY and TFOOT elements. I don’t fully understand the process of creating document tree from “unclosed” markup, so I’m not sure when it’s safe to omit tags like </p>.

For example, I can see how removing BODY start tag can have side effects. Let’s say we have a markup like this (with omitted HTML 4.01 doctype, for brevity):

    <head>
      <title>x</title>
    </head>
    <body><script type="text/javascript"></script>
      <p>x</p>
      <script type="text/javascript">
        document.write(document.body.childNodes[0].nodeName);
      </script>
    </body>

and the same markup with HEAD and BODY tags removed:

    <title>x</title>
    <script type="text/javascript"></script>
    <p>x</p>
    <script type="text/javascript">
      document.write(document.body.childNodes[0].nodeName);
    </script>

Note that second version is a perfectly valid document. It just has start and end tags of HEAD and BODY elements omitted. Now what seems to happen here, in a second version, is this:

Browser starts parsing, encounters TITLE tag, and given lack of starting HTML and HEAD tags, creates both elements implicitly (first, HTML, then HEAD as its immediate child). It then continues parsing, up until it stumbles upon P element, which, as per DTD, can not be a child of HEAD. Browser is therefore forced to implicitly close HEAD element, start BODY element, and continue parsing further. P element becomes first child in BODY, and SCRIPT element becomes last child in HEAD.

Now, if we were to display both of these documents, first one would alert “SCRIPT” and second one—“P”. This is becase in original version, SCRIPT element is defined explicitly to be a child of BODY, and in modified version—child of HEAD (due to the way parsing works). The behavior of two documents is therefore not identical. We’ve got a “problem”.

Just like with previous “gotchas”, I’m not sure how likely this type of scenario is to appear in real life. From what I can see, the only other element (besides SCRIPT), allowed as child of both—HEAD and BODY, is OBJECT. As for the future, it should be possible to make minifier strip other optional tags as well. But only in safe scenarios.

It’s also worth mentioning that unclosed elements can result in slightly slower parsing times. Unfortunately, there are no extensive benchmarks done on this topic, and results seem to vary across browsers.

Remove empty elements

This optimization is probably one of the most obtrusive ones, which is why it is disabled by default. Think of it as an experimental addition, and employ with great care. There are dozens of valid use cases for occurence of empty elements in a document. They can be used as placeholders for content inserted later with scripting; or for presentational purposes, such as to implement rounded corners, shadows, float clearing, etc. There are probably other cases, which I can’t think of at the moment.

When enabled, minifier simply removes all elements with empty contents (but not those with empty content model, such as IMG, LINK, or BR).

For example:

    var input = '<p></p>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // ''
 
    input = '<div>blah<span></span></div>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // '<div>blah</div>';

There are few things to be aware of. First of all, elements containing only other empty elements are not removed. For example:

    var input = '<div><div><div></div></div></div>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // '<div><div></div></div>'

Note how only inner DIV element—the one with actual empty contents—is removed.

Second of all, only truly empty string is considered an empty content. This does not include spaces, newlines, or other white space characters:

    var input = '<p> </p>'; // note one space character in between
    var output = minify(input, { removeEmptyElements: true });
 
    output; // '<p> </p>'

Also note that comments are parsed as separate entities and so don’t affect “emptiness” of elements:

    var input = '<p><!-- comment --></p>';
    var output = minify(input, { removeEmptyElements: true });
 
    output; // ''

As with other optimizations, some of these limitations will likely be removed in the future.

Validate input through HTML lint

This option simply toggles linting. You can create new HTMLLint object and pass it to minifier. During minification, lint object silently logs all “suspicious” activity. It exposes populate method, which accepts element and inserts its log into this element:

    var lint = new HTMLLint();
    minify(' some input... ', { lint: lint });
 
    lint.populate(document.getElementById('someElement'));

Field-testing

So how does minifier stand against real-life markup? Let’s take a look at minification results of some of the popular websites (note that when gzip’ing documents, 6th level of compression (default) was used):

Amazon.com

Original size: 217KB (35.8KB gzipped)
Minified size: 206.6KB (34.3KB gzipped)
Savings: 10.4KB (1.5KB gzipped)

Minifying home page of amazon.com saves about 10KB with uncompressed document, and only 1.5KB with compressed one. What’s interesting is that humongous 217KB is actually a result of miriad of inline styles and scripts scattered throughout a document. Replacing those with external scripts would be the best optimization. Getting rid of occasional style attributes would help too.

Digg.com

Original size: 82KB (18.2KB gzipped)
Minified size: 74.9KB (17.2KB gzipped)
Savings: 7KB (1KB gzipped)

On digg.com, reduction is slightly smaller—7KB (1KB gzipped). The markup is not as cluttered as on amazon, but still has smells: inline scripts (and unnecessary comments in them), deprecated attributes, anchors defunct without scripting, etc. The benefits of minification are rather small here.

Ajaxian.com

Original size: 177.6KB (32.4KB gzipped)
Minified size: 157.3KB (29.7KB gzipped)
Savings: 20.3KB (2.7KB gzipped)

Trying out home page of ajaxian.com, we see a difference of ~20KB—even better reduction in size. And again, compressed documents show savings of only 2.7KB. Speaking of compression, ajaxian.com shamelessly serves its 177KB-large document uncompressed. There’s also some redundant markup, like unnecessary &nbsp;’s, excessive style attributes, lots of comments, and few inline scripts. Removing all of those, and turning on compression would be an ultimate optimization.

Linkedin.com

Original size: 128.8KB (19.8KB gzipped)
Minified size: 89.4KB (17.1KB gzipped)
Savings: 39.4KB (2.7KB gzipped)

linkedin.com surprises with savings of almost 40KB (!) after minification. Looking at the source, we see that large number is explained by excessive amount of whitespace. This is a good example of how carelessly used whitespace can add up to huge number like this. And again, gzip saves the day; minifying compressed document reduces it only by 2.7KB.

ECMAScript language specification

Original size: 703KB (122.5KB gzipped)
Minified size: 572KB (106.4KB gzipped)
Savings: 131KB (16KB gzipped)

Large static documents is where HTML minification truly shines, and HTML version of ECMAScript (3rd ed.) language specification is a clear demonstration of it. Minifying document results in savings of 131KB (!) for an uncompressed document, and 16KB for compressed one. Since document is served statically, there’s hardly any reason not to apply minification here.

Cost and benefits

It’s pretty obvious that the best candidates for html minification are large static documents. Or just static documents—FAQ’s, standalone articles, etc. Anything that can’t be compressed (e.g. if there are not enough access rights, to enable gzip on a server) would benefit from minification as well. Even when serving gzipped content, it’s worth remembering that not everyone is getting gzip. So clients that are being sent gzipped content could receive 2-3KB smaller file, whereas those receiving uncompressed content could end up with files up to whopping 10-20KB smaller than original ones.

One of the biggest problems I see, when it comes to dynamic minification, is the possibility of error. The core of the issue is that minification relies on parsing, and parsing HTML is a pretty tricky business. Even though minifier applies a strict set of rules—removing quotes and optional tags only when it is absolutely safe to do so, a single misplaced character in start tag can trip parser and wreak havoc on an entire document. This is especially relevant when there’s an inclusion of user-generated content.

As an example, browsers usually understand empty end tags (allowed in HTML)—<p> test </>, but parser, which minifier is based on, would immediately choke here and stop. Another example is attributes containing “weird” characters—<a href="http://example.com""> test </a> (note trailing quote after an attribute). Many browsers happily parse this element, ignoring trailing quote. But parser, once again, falls short and bails out.

It’s certainly possible to tame errors and simply output original, uncompressed document. But this brings us to another downside—time spent on minification. Even when errors are not an issue, there’s an actual overhead of parsing and processing document tree. Minifying home page of amazon.com in pretty speedy nightly webkit, for example, takes exactly 1 second. Most of that time is consumed by parsing. 1 second is quite a lot. An acceptable time for real-time minification would be somewhere around 50-100ms. This problem can be mitigated by optimizing parser, or porting script to be executed in a faster environment (v8 on a server?).

Curiously, Opera 10.50 beta (on Mac OS X) managed to beat WebKit and completed this task almost twice faster (~500ms). Unfortunately, this version suffers from some bugs in regex matching, and fails half of the test suite. Hopefully, those issues will be resolved in later revisions.

Another interesting performance observation was with V8 engine. When testing with version 1.3.x, the time it took to minify amazon.com home page was 0.6 secs. However, version 2.1.2.6 (currently latest stabe) performed same task in excruciatingly long 2 seconds.

Future

I can think of many other things to improve in minifier. Unfortunately, I don’t have much time to work on it. The project is licensed under MIT, and is free for use/modification by anyone interested. Test suite should make collaboration easy. There’s a short todo list on a bottom of project page. Among other things, it lists some of the known bugs.

As always, any questions, corrections, and suggestions are very much welcomed.

Enjoy.

1. “white-space: pre” declaration could be part of a rule from within an extrnal stylesheet; getting computed style would require downloading, parsing and analyzing every single stylesheet linked from the document (or imported from within another stylesheet).

Categories: html, optimizations 36 Comments »

Javascript quiz

February 8th, 2010 by kangax

I was recently reminded about Dmitry Baranovsky’s Javascript test, when N. Zakas answered and explained it in a blog post. First time I saw those questions explained was by Richard Cornford in comp.lang.javascript, although not as thoroughly as by Nicholas.

I decided to come up with my own little quiz. I wanted to keep question not very obscure, practical, yet challenging. They would also cover wider range of topics.

Host objects

Contrary to Dmitry’s test, quiz does not involve host objects (e.g. window), as their behavior is unspecified and can vary sporadically across implementations. We are talking about pure ECMAScript (3rd ed.) behavior. Now, it’s worth pointing out that sometimes implementations deviate from the standard collectively, forming their own, de-facto standard. An example of this is for-in statement, where none of the popular implementations throw TypeError when expression evalutes to null or undefinedfor (var prop in null) { ... } — and instead just silently ignore it. I tried to avoid these non-standard cases. Every question has a correct answer that can be reproduced in at least one of the major implementations.

So what are we testing?

Not a lot really. Quiz mainly focuses on knowledge of scoping, function expressions (and how they differ from function declarations), references, process of variable and function declaration, order of evaluation, and a couple more things like delete operator and object instantiation. These are all relatively simple concepts, which I think every professional Javascript developer should know. Most of these are applied in practice quite often. Ideally, even if you can’t answer a question, you should be able to infer answer from specs (without executing the snippet). When creating these questions, I made sure I can answer each one of them off the top of my head, to keep things relatively simple.

Note, however, that not all questions are very practical, so don’t worry if you can’t answer some of them. We don’t often use with statement, for example, so failing to know/remember its exact behavior is understandable.

Few notes about code

  • Assuming ECMAScript 3rd edition (not 5th)
  • Implementation quirks do not count (assuming standard behavior only)
  • Every snippet is run as a global code (not as eval or function one)
  • There are no other variables declared (and host environment is not extended with anything beyond what’s defined in specs)
  • Answer should correspond to exact return value of entire expression/statement (or last line)
  • “Error” in answer indicates that overall snippet results in a runtime error

Quiz

Please make sure you select answer in each question, as lack of answer is not checked and counts as failure. The final score is simply a number of wrong answers, less is better. Quiz requires Javascript to be enabled.

  1. 1.

        (function(){ 
          return typeof arguments;
        })();
  2. 2.

        var f = function g(){ return 23; };
        typeof g();
  3. 3.

        (function(x){
          delete x;
          return x;
        })(1);
  4. 4.

        var y = 1, x = y = typeof x;
        x;
  5. 5.

        (function f(f){ 
          return typeof f(); 
        })(function(){ return 1; });
  6. 6.

        var foo = { 
          bar: function() { return this.baz; }, 
          baz: 1
        };
        (function(){ 
          return typeof arguments[0]();
        })(foo.bar);
  7. 7.

        var foo = {
          bar: function(){ return this.baz; },
          baz: 1
        }
        typeof (f = foo.bar)();
  8. 8.

        var f = (function f(){ return "1"; }, function g(){ return 2; })();
        typeof f;
  9. 9.

        var x = 1;
        if (function f(){}) {
          x += typeof f;
        }
        x;
  10. 10.

        var x = [typeof x, typeof y][1];
        typeof typeof x;
  11. 11.

        (function(foo){
          return typeof foo.bar;
        })({ foo: { bar: 1 } });
  12. 12.

        (function f(){
          function f(){ return 1; }
          return f();
          function f(){ return 2; }
        })();
  13. 13.

        function f(){ return f; }
        new f() instanceof f;
  14. 14.

        with (function(x, undefined){}) length;

I hope you liked it. Please leave your score in the comments. I’ll try to explain these questions sometime in a near future, unless someone else does it before me. Meanwhile, you can take a look at my articles on function expressions and delete operator, understanding which would help you answer some of these questions, and more importantly, explain their answers.

Categories: ECMA-262, Quiz, [[Delete]], delete operator, with 152 Comments »

« Previous Entries Next Entries »