Perfection kills

Exploring Javascript by example

Javascript quiz

February 8th, 2010 by kangax

I was recently reminded about Dmitry Baranovsky’s Javascript test, when N. Zakas answered and explained it in a blog post. First time I saw those questions explained was by Richard Cornford in comp.lang.javascript, although not as thoroughly as by Nicholas.

I decided to come up with my own little quiz. I wanted to keep question not very obscure, practical, yet challenging. They would also cover wider range of topics.

Host objects

Contrary to Dmitry’s test, quiz does not involve host objects (e.g. window), as their behavior is unspecified and can vary sporadically across implementations. We are talking about pure ECMAScript (3rd ed.) behavior. Now, it’s worth pointing out that sometimes implementations deviate from the standard collectively, forming their own, de-facto standard. An example of this is for-in statement, where none of the popular implementations throw TypeError when expression evalutes to null or undefinedfor (var prop in null) { ... } — and instead just silently ignore it. I tried to avoid these non-standard cases. Every question has a correct answer that can be reproduced in at least one of the major implementations.

So what are we testing?

Not a lot really. Quiz mainly focuses on knowledge of scoping, function expressions (and how they differ from function declarations), references, process of variable and function declaration, order of evaluation, and a couple more things like delete operator and object instantiation. These are all relatively simple concepts, which I think every professional Javascript developer should know. Most of these are applied in practice quite often. Ideally, even if you can’t answer a question, you should be able to infer answer from specs (without executing the snippet). When creating these questions, I made sure I can answer each one of them off the top of my head, to keep things relatively simple.

Note, however, that not all questions are very practical, so don’t worry if you can’t answer some of them. We don’t often use with statement, for example, so failing to know/remember its exact behavior is understandable.

Few notes about code

  • Assuming ECMAScript 3rd edition (not 5th)
  • Implementation quirks do not count (assuming standard behavior only)
  • Every snippet is run as a global code (not as eval or function one)
  • There are no other variables declared (and host environment is not extended with anything beyond what’s defined in specs)
  • Answer should correspond to exact return value of entire expression/statement (or last line)
  • “Error” in answer indicates that overall snippet results in a runtime error

Quiz

Please make sure you select answer in each question, as lack of answer is not checked and counts as failure. The final score is simply a number of wrong answers, less is better. Quiz requires Javascript to be enabled.

  1.     (function(){ 
          return typeof arguments;
        })();
  2.     var f = function g(){ return 23; };
        typeof g();
  3.     (function(x){
          delete x;
          return x;
        })(1);
  4.     var y = 1, x = y = typeof x;
        x;
  5.     (function f(f){ 
          return typeof f(); 
        })(function(){ return 1; });
  6.     var foo = { 
          bar: function() { return this.baz; }, 
          baz: 1
        };
        (function(){ 
          return typeof arguments[0]();
        })(foo.bar);
  7.     var foo = {
          bar: function(){ return this.baz; },
          baz: 1
        }
        typeof (f = foo.bar)();
  8.     var f = (function f(){ return "1"; }, function g(){ return 2; })();
        typeof f;
  9.     var x = 1;
        if (function f(){}) {
          x += typeof f;
        }
        x;
  10.     var x = [typeof x, typeof y][1];
        typeof typeof x;
  11.     (function(foo){
          return typeof foo.bar;
        })({ foo: { bar: 1 } });
  12.     (function f(){
          function f(){ return 1; }
          return f();
          function f(){ return 2; }
        })();
  13.     function f(){ return f; }
        new f() instanceof f;
  14.     with (function(x, undefined){}) length;

I hope you liked it. Please leave your score in the comments. I’ll try to explain these questions sometime in a near future, unless someone else does it before me. Meanwhile, you can take a look at my articles on function expressions and delete operator, understanding which would help you answer some of these questions, and more importantly, explain their answers.

Categories: ECMA-262, Quiz, [[Delete]], delete operator, with 42 Comments »

Understanding delete

January 10th, 2010 by kangax
  1. Theory

  2. Firebug confusion

  3. Browsers compliance

  4. IE bugs
  5. Misconceptions
  6. `delete` and host objects
  7. ES5 strict mode
  8. Summary

A couple of weeks ago, I had a chance to glance through Stoyan Stefanov’s Object-Oriented Javascript. The book had an exceptionally high rating on Amazon (12 reviews with 5 stars), so I was curious to see if it was something worth recommending. I started reading through chapter on functions, and really enjoyed the way things were explained there; the flow of examples was structured in such nice, progressive way, it seemed even beginners would grasp it easily. However, almost immediately I stumbled upon an interesting misconception present throughout the entire chapter — deleting functions. There were some other mistakes (such as the difference between function declarations and function expressions), but we aren’t going to talk about them now.

The book claims that “function is treated as a normal variable—it can be copied to a different variable and even deleted.”. Following that explanation, there is this example:

  >>> var sum = function(a, b) {return a + b;} 
  >>> var add = sum; 
  >>> delete sum
  true
  >>> typeof sum;
  "undefined"

Ignoring a couple of missing semicolons, can you see what’s wrong with this snippet? The problem, of course, is that deleting sum variable should not be successful; delete statement should not evaluate to true and typeof sum should not result in “undefined”. All because it’s not possible to delete variables in Javascript. At least not when declared in such way.

So what’s going on in this example? Is it a typo? A diversion? Probably not. This whole snippet is actually a real output from the Firebug console, which Stoyan must have been using for quick testing. It’s almost as if Firebug follows some other rules of deletion. It is Firebug that has led Stoyan astray! So what is really going on here?

To answer this question, we need to understand how delete operator works in Javascript: what exactly can and cannot be deleted and why. Today I’ll try to explain this in details. We’ll take a look at Firebug’s “weird” behavior and realize that it’s not all that weird; we’ll delve into what’s going on behind the scenes when declaring variables, functions, assigning properties and deleting them; we’ll look at browsers’ compliance and some of the most notorious bugs; we’ll also talk about strict mode of 5th edition of ECMAScript, and how it changes delete operator behavior.

I’ll be using Javascript and ECMAScript interchangeable to really mean ECMAScript (unless explicitly talking about Mozilla’s JavaScript™ implementation).

Unsurprisingly, explanations of delete on the web are rather scarce. MDC article is probably the most comprehensive resource, but unfortunately misses few interesting details about the subject; Curiously, one of these forgotten things is the cause of Firebug’s tricky behavior. MSDN reference is practically useless.

Theory

So why is it that we can delete object properties:

  var o = { x: 1 }; 
  delete o.x; // true
  o.x; // undefined

but not variables, declared like this:

  var x = 1; 
  delete x; // false
  x; // 1

or functions, declared like this:

  function x(){}
  delete x; // false
  typeof x; // "function"

Note that delete only returns false when a property can not be deleted.

To understand this, we need to first grasp such concepts as variable instantiation and property attributes — something that’s unfortunately rarely covered in books on Javascript. I’ll try go over these very concisely in the next few paragraphs. It’s not hard to understand them at all! If you don’t care about why things work the way they work, feel free to skip this chapter.

Type of code

There are 3 types of executable code in ECMAScript: Global code, Function code and Eval code. These types are somewhat self-descriptive, but here’s a short overview:

  1. When a source text is treated as a Program, it is executed in a global scope, and is considered a Global code. In a browser environment, content of SCRIPT elements is usually parsed as a Program, and is therefore evaluated as a Global code.
  2. Anything that’s executed directly within a function is, quite obviously, considered a Function code. In browsers, content of event attributes (e.g. <p onclick="...">) is usually parsed and treated as a Function code.
  3. Finally, text that’s supplied to a built-in eval function is parsed as Eval code. We will soon see why this type is special.

Execution context

When ECMAScript code executes, it always happens within certain execution context. Execution context is a somewhat abstract entity, which helps understand how scope and variable instantiation works. For each of three types of executable code, there’s an execution context. When a function is executed, it is said that control enters execution context for Function code; when Global code executes, control enters execution context for Global code, and so on.

As you can see, execution contexts can logically form a stack. First there might be Global code with its own execution context; that code might call a function, with its own execution context; that function could call another function, and so on and so forth. Even if function is calling itself recursively, a new execition context is being entered with every invocation.

Activation object / Variable object

Every execution context has a so-called Variable Object associated with it. Similarly to execution context, Variable object is an abstract entity, a mechanism to describe variable instantiation. Now, the interesing part is that variables and functions declared in a source text are actually added as properties of this Variable object.

When control enters execution context for Global code, a Global object is used as a Variable object. This is precisely why variables or functions declared globally become properties of a Global object:

  /* remember that `this` refers to global object when in global scope */
  var GLOBAL_OBJECT = this;
 
  var foo = 1;
  GLOBAL_OBJECT.foo; // 1
  foo === GLOBAL_OBJECT.foo; // true
 
  function bar(){}
  typeof GLOBAL_OBJECT.bar; // "function"
  GLOBAL_OBJECT.bar === bar; // true

Ok, so global variables become properties of Global object, but what happens with local variables — those declared in Function code? The behavior is actually very similar: they become properties of Variable object. The only difference is that when in Function code, a Variable object is not a Global object, but a so-called Activation object. Activation object is created every time execution context for Function code is entered.

Not only do variables and functions declared within Function code become properties of Activation object; this also happens with each of function arguments (under names corresponding to formal parameters) and a special Arguments object (under arguments name). Note that Activation object is an internal mechanism and is never really accessible by program code.

  (function(foo){
 
    var bar = 2;
    function baz(){}
 
    /*
    In abstract terms,
 
    Special `arguments` object becomes a property of containing function's Activation object: 
      ACTIVATION_OBJECT.arguments; // Arguments object
 
    ...as well as argument `foo`:
      ACTIVATION_OBJECT.foo; // 1
 
    ...as well as variable `bar`:
      ACTIVATION_OBJECT.bar; // 2
 
    ...as well as function declared locally:
      typeof ACTIVATION_OBJECT.baz; // "function"
    */
 
  })(1);

Finally, variables declared within Eval code are created as properties of calling context’s Variable object. Eval code simply uses Variable object of the execution context that it’s being called within:

  var GLOBAL_OBJECT = this;
 
  /* `foo` is created as a property of calling context Variable object,
      which in this case is a Global object */
 
  eval('var foo = 1;');
  GLOBAL_OBJECT.foo; // 1
 
  (function(){
 
    /* `bar` is created as a property of calling context Variable object,
      which in this case is an Activation object of containing function */
 
    eval('var bar = 1;');
 
    /* 
      In abstract terms, 
      ACTIVATION_OBJECT.bar; // 1
    */
 
  })();

Property attributes

We are almost there. Now that it’s clear what happens with variables (they become properties), the only remaining concept to understand is property attributes. Every property can have zero or more attributes from the following set — ReadOnly, DontEnum, DontDelete and Internal. You can think of them as flags — an attribute can either exist on a property or not. For the purposes of today’s discussion, we are only interested in DontDelete.

When declared variables and functions become properties of a Variable object — either Activation object (for Function code), or Global object (for Global code), these properties are created with DontDelete attribute. However, any explicit (or implicit) property assignment creates property without DontDelete attribute. And this is essentialy why we can delete some properties, but not others:

  var GLOBAL_OBJECT = this;
 
  /*  `foo` is a property of a Global object.
      It is created via variable declaration and so has DontDelete attribute.
      This is why it can not be deleted. */
 
  var foo = 1;
  delete foo; // false
  typeof foo; // "number"
 
  /*  `bar` is a property of a Global object.
      It is created via function declaration and so has DontDelete attribute.
      This is why it can not be deleted either. */
 
  function bar(){}
  delete bar; // false
  typeof bar; // "function"
 
  /*  `baz` is also a property of a Global object.
      However, it is created via property assignment and so has no DontDelete attribute.
      This is why it can be deleted. */
 
  GLOBAL_OBJECT.baz = 'blah';
  delete GLOBAL_OBJECT.baz; // true
  typeof GLOBAL_OBJECT.baz; // "undefined"

Built-ins and DontDelete

So this is what it’s all about: a special attribute on a property that controls whether this property can be deleted or not. Note that some of the properties of built-in objects are specified to have DontDelete, and so can not be deleted. Special arguments variable (or, as we know now, a property of Activation object) has DontDelete. length property of any function instance has DontDelete as well:

  (function(){
 
    /* can't delete `arguments`, since it has DontDelete */
 
    delete arguments; // false
    typeof arguments; // "object"
 
    /* can't delete function's `length`; it also has DontDelete */
 
    function f(){}
    delete f.length; // false
    typeof f.length; // "number"
 
  })();

Properties corresponding to function arguments are created with DontDelete as well, and so can not be deleted either:

  (function(foo, bar){
 
    delete foo; // false
    foo; // 1
 
    delete bar; // false
    bar; // 'blah'
 
  })(1, 'blah');

Undeclared assignments

As you might remember, undeclared assignment creates a property on a global object. That is unless that property is found somewhere in the scope chain before global object. And now that we know the difference between property assignment and variable declaration — latter one sets DontDelete, whereas former one doesn’t — it should be clear why undeclared assignment creates a deletable property:

  var GLOBAL_OBJECT = this;
 
  /* create global property via variable declaration; property has DontDelete */
  var foo = 1;
 
  /* create global property via undeclared assignment; property has no DontDelete */
  bar = 2;
 
  delete foo; // false
  typeof foo; // "number"
 
  delete bar; // true
  typeof bar; // "undefined"

Note that it is during property creation that attributes are determined (i.e. none are set). Later assignments don’t modify attributes of existing property. It’s important to understand this distinction.

  /* `foo` is created as a property with DontDelete */
  function foo(){}
 
  /* Later assignments do not modify attributes. DontDelete is still there! */
  foo = 1;
  delete foo; // false
  typeof foo; // "number"
 
  /* But assigning to a property that doesn't exist, 
     creates that property with empty attributes (and so without DontDelete) */
 
  this.bar = 1;
  delete bar; // true
  typeof bar; // "undefined"

Firebug confusion

So what happens in Firebug? Why is it that variables declared in console can be deleted, contrary to what we have just learned? Well, as I said before, Eval code has a special behavior when it comes to variable declaration. Variables declared within Eval code are actually created as properties without DontDelete:

  eval('var foo = 1;');
  foo; // 1
  delete foo; // true
  typeof foo; // "undefined"

and, similarly, when called within Function code:

  (function(){
 
    eval('var foo = 1;');
    foo; // 1
    delete foo; // true
    typeof foo; // "undefined"
 
  })();

And this is the gist of Firebug’s abnormal behavior. All the text in console seems to be parsed and executed as Eval code, not as a Global or Function one. Obviously, any declared variables end up as properties without DontDelete, and so can be easily deleted. Be aware of these differences between regular Global code and Firebug console.

Deleting variables via eval

This interesting eval behavior, coupled with another aspect of ECMAScript can technically allow us to delete non-deletable properties. The thing about function declarations is that they can overwrite same-named variables in the same execution context:

  function x(){ }
  var x;
  typeof x; // "function"

Note how function declaration takes precedence and overwrites same-named variable (or, in other words, same property of Variable object). This is because function declarations are instantiated after variable declarations, and are allowed to overwrite them. Not only does function declaration replaces previous value of a property, it also replaces that property attributes. If we declare function via eval, that function should also replace that property’s attributes with its own. And since variables declared from within eval create properties without DontDelete, instantiating this new function should essentially remove existing DontDelete attribute from the property in question, making that property deletable (and of course changing its value to reference newly created function).

  var x = 1;
 
  /* Can't delete, `x` has DontDelete */
 
  delete x; // false
  typeof x; // "number"
 
  eval('function x(){}');
 
  /* `x` property now references function, and should have no DontDelete */
 
  typeof x; // "function"
  delete x; // should be `true`
  typeof x; // should be "undefined"

Unfortunately, this kind of spoofing doesn’t work in any implementation I tried. I might be missing something here, or this behavior might simply be too obscure for implementors to pay attention to.

Browsers compliance

Knowing how things work in theory is useful, but practical implications are paramount. Do browsers follow standards when it comes to variable/property creation/deletion? For the most part, yes.

I wrote a simple test suite to check compliance of delete operator with Global code, Function code and Eval code. Test suite checks both — return value of delete operator, and whether properties are deleted (or not) as they are supposed to. delete return value is not as important as its actual results. It’s not very crucial if delete returns true instead of false, but it’s important that properties with DontDelete are not deleted and vice versa.

Modern browsers are generally pretty compliant. Besides this eval peculiarity I mentioned earlier, the following browsers pass test suite fully: Opera 7.54+, Firefox 1.0+, Safari 3.1.2+, Chrome 4+.

Safari 2.x and 3.0.4 have problems with deleting function arguments; those properties seem to be created without DontDelete, so it is possible to delete them. Safari 2.x has even more problems — deleting non-reference (e.g. delete 1) throws error; function declarations create deletable properties (but, strangely, not variable declarations); variable declarations in eval become non-deletable (but not function declarations).

Similar to Safari, Konqueror (3.5, but not 4.3) throws error when deleting non-reference (e.g. delete 1) and erroneously makes function arguments deletable.

Gecko DontDelete bug

Gecko 1.8.x browsers — Firefox 2.x, Camino 1.x, Seamonkey 1.x, etc. — exhibit an interesting bug where explicitly assigning to a property can remove its DontDelete attribite, even if that property was created via variable or function declaration:

    function foo(){}
    delete foo; // false (as expected)
    typeof foo; // "function" (as expected)
 
    /* now assign to a property explicitly */
 
    this.foo = 1; // erroneously clears DontDelete attribute
    delete foo; // true
    typeof foo; // "undefined"
 
    /* note that this doesn't happen when assigning property implicitly */
 
    function bar(){}
    bar = 1;
    delete bar; // false
    typeof bar; // "number" (although assignment replaced property)

Surprisingly, Internet Explorer 5.5 – 8 passes test suite fully except that deleting non-reference (e.g. delete 1) throws error (just like in older Safari). But there are actually more serious bugs in IE, that are not immediately apparent. These bugs are related to Global object.

IE bugs

The entire chapter just for bugs in Internet Explorer? How unexpected!

In IE (at least, 6-8), the following expression throws error (when evaluated in Global code):

    this.x = 1;
    delete x; // TypeError: Object doesn't support this action

and this one as well, but different exception, just to make things interesting:

    var x = 1;
    delete this.x; // TypeError: Cannot delete 'this.x'

It’s as if variable declarations in Global code do not create properties on Global object in IE. Creating property via assignment (this.x = 1) and then deleting it via delete x throws error. Creating property via declaration (var x = 1) and then deleting it via delete this.x throws another error.

But that’s not all. Creating property via explicit assignment actually always throws error on deletion. Not only is there an error, but created property appears to have DontDelete set on it, which of course it shouldn’t have:

    this.x = 1;
 
    delete this.x; // TypeError: Object doesn't support this action
    typeof x; // "number" (still exists, wasn't deleted as it should have been!)
 
    delete x; // TypeError: Object doesn't support this action
    typeof x; // "number" (wasn't deleted again)

Now, contrary to what one would think, undeclared assignments (those that should create a property on global object) do create deletable properties in IE:

    x = 1;
    delete x; // true
    typeof x; // "undefined"

But if you try to delete such property by referecing it via this in Global code (delete this.x), a familiar error pops up:

    x = 1;
    delete this.x; // TypeError: Cannot delete 'this.x'

If we were to generalize this behavior, it would appear that delete this.x from within Global code never succeeds. When property in question is created via explicit assignment (this.x = 1), delete throws one error; when property is created via undeclared assignment (x = 1) or via declaration (var x = 1), delete throws another error.

delete x, on the other hand, only throws error when property in question is created via explicit assignment — this.x = 1. If a property is created via declaration (var x = 1), deletion simply never occurs and delete correctly returns false. If a property is created via undeclared assignment (x = 1), deletion works as expected.

I was pondering about this issue back in September, and Garrett Smith suggested that in IE “The global variable object is implemented as a JScript object, and the global object is implemented by the host. Garrett used Eric Lippert’s blog entry as a reference.
We can somewhat confirm this theory by performing few tests. Note how this and window seem to reference same object (if we can believe === operator), but Variable object (the one on which function is declared) is different from whatever this references.

    /* in Global code */
    function getBase(){ return this; }
 
    getBase() === this.getBase(); // false
    this.getBase() === this.getBase(); // true
    window.getBase() === this.getBase(); // true
    window.getBase() === getBase(); // false

Misconceptions

The beauty of understanding why things work the way they work can not be underestimated. I’ve seen few misconceptions on the web related to misunderstanding of delete operator. For example, there’s this answer on Stackoverflow (with surprisingly high rating), confidently explaining how “delete is supposed to be no-op when target isn’t an object property”. Now that we understand the core of delete behavior, it becomes pretty clear that this answer is rather inaccurate. delete doesn’t differentiate between variables and properties (in fact, for delete, those are all References) and really only cares about DontDelete attribute (and property existence).

It’s also interesting to see how misconceptions bounce off of each other, where in the very same thread someone first suggests to just delete variable (which won’t work unless it’s declared from within eval), and another person provides a wrong correction how it’s possible to delete variables in Global code but not in Function one.

Be careful with Javascript explanations on the web, and ideally, always seek to understand the core of the issue ;)

`delete` and host objects

An algorithm for delete is specified roughtly like this:

  1. If operand is not a reference, return true
  2. If object has no direct property with such name, return true (where, as we now know, object can be Activation object or Global object)
  3. If property exists but has DontDelete, return false
  4. Otherwise, remove property and return true

However, behavior of delete operator with host objects can be rather unpredictable. And there’s actually nothing wrong with that: host objects are allowed (by specification) to implement any kind of behavior for operations such as read (internal [[Get]] method), write (internal [[Put]] method) or delete (internal [[Delete]] method), among few others. This allowance for custom [[Delete]] behavior is what makes host objects so chaotic.

We’ve already seen some IE oddities, where deleting certain objects (which are apparently implemented as host objects) throws errors. Some versions of Firefox throw when trying to delete window.location. You can’t trust return values of delete either, when it comes to host objects; take a look at what happens in Firefox:

    /* "alert" is a direct property of `window` (if we were to believe `hasOwnProperty`) */
    window.hasOwnProperty('alert'); // true
 
    delete window.alert; // true
    typeof window.alert; // "function"

Deleting window.alert returns true, even though there’s nothing about this property that should lead to such result. It resolves to a reference (so can’t return true on the first step). It’s a direct property of a window object (so can’t return true on a second step). The only way delete could return true is after reaching step 4 and actually deleting a property. Yet, property is never deleted.

The moral of the story is to never trust host objects.

ES5 strict mode

So what does strict mode of ECMAScript 5th edition bring to the table? Few restrictions are being introduced. SyntaxError is now thrown when expression in delete operator is a direct reference to a variable, function argument or function identifier. In addition, if property has internal [[Configurable]] == false, a TypeError is thrown:

  (function(foo){
 
    "use strict"; // enable strict mode within this function
 
    var bar;
    function baz(){}
 
    delete foo; // SyntaxError (when deleting argument)
    delete bar; // SyntaxError (when deleting variable)
    delete baz; // SyntaxError (when deleting variable created with function declaration)
 
    /* `length` of function instances has { [[Configurable]] : false } */
 
    delete (function(){}).length; // TypeError
 
  })();

In addition, deleting undeclared variable (or in other words, unresolved Referece) throws SyntaxError as well:

    "use strict";
    delete i_dont_exist; // SyntaxError

This is somewhat similar to the way undeclared assignment in strict mode behaves (except that ReferenceError is thrown instead of a SyntaxError):

    "use strict";
    i_dont_exist = 1; // ReferenceError

As you now understand, all these restrictions somewhat make sense, given how much confusion deleting variables, function declarations and arguments causes. Instead of silently ignoring deletion, strict mode takes more agressive and descriptive measures.

Summary

This post turned out to be quite lengthy, so I’m not going to talk about things like removing array items with delete and what the implications of it are. You can always refer to MDC article for that particular explanation (or read specs and experiment yourself).

Here’s a short summary of how deletion works in Javascript:

  • Variables and function declarations are properties of either Activation or Global objects.
  • Properties have attributes, one of which — DontDelete — is responsible for whether a property can be deleted.
  • Variable and function declarations in Global and Function code always create properties with DontDelete.
  • Function arguments are also properties of Activation object and are created with DontDelete.
  • Variable and function declarations in Eval code always create properties without DontDelete.
  • New properties are always created with empty attributes (and so without DontDelete).
  • Host objects are allowed to react to deletion however they want.

If you’d like to get more familiar with things described here, please refer to ECMA-262 3rd edition specification.

I hope you enjoyed this overview and learned something new. Any questions, suggestions and corrections are as always welcomed.

Categories: DontDelete, ECMA-262, [[Delete]], delete operator, review, strict-mode 21 Comments »

Optimizing HTML

December 29th, 2009 by kangax
  1. Why clean markup?
  2. Markup smells
  3. Additional optimizations
  4. Agressive optimizations
  5. When things go wrong
  6. Antipatterns
  7. Tools
  8. Future considerations

Why clean markup?

Client-side optimization is getting a lot of attention lately, but some of its basic aspects seem to go unnoticed. If you look carefully at pages on the web (even those that are supposed to be highly optimized), it’s easy to spot a good amount of redundancies, and inefficient or archaic structures in their markup. All this baggage adds extra weight to pages that are supposed to be as light as possible.

The reason to keep documents clean is not so much about faster load times, as it is about having a solid and robust foundation to build upon. Clean markup means better accessibility, easier maintenance, and good search engine visibility. Smaller size is just a property of clean documents, and another reason to keep them this way.

In this post, we’ll take a look at HTML optimization: removing some of the common markup smells; reducing document size by getting rid of redundant structures, and employing minification techniques. We’ll look at currently available minification tools, and analyze what they do wrong and right. We’ll also talk about what can be done in a future.

Markup smells

So what are the most common offenders?

1. HTML comments in scripts

One of the gross redundanies nowadays is inclusion of HTML comments — <!-- --> — in script blocks. There’s not much to say here, except that browsers that actually need this error-prevention measure (such as ‘95 Netscape 1.0) are pretty much extinct. Comments in scripts are just an unnecessary baggage and should be removed ferociously.

2. <![CDATA[ … ]> sections

Another often needless error-prevention measure is inclusion of CDATA blocks in SCRIPT elements:

  <script type="text/javascript">
    //<![CDATA[
      ...
    //]]>
  </script>

It’s a noble goal that falls short in reality. While CDATA blocks are a perfectly good way to prevent XML processor from recognizing < and & as start of markup, it is only the case in true XHTML documents — those that are served with “application/xhtml+xml” content-type. Majority of the web is still served as “text/html” (since, for example, IE doesn’t understand XHTML to this date), and so is parsed as HTML by the browsers, not as XML.

Unless you’re serving documents as “application/xhtml+xml”, there’s little reason to have CDATA sections hanging around. Even if you’re planning to use xhtml in a future, it might make sense to remove unnecessary weight from the document, and only add it later, when actually needed.

And, of course, an ultimate solution here is to avoid inline scripts altogether (to take advantage of external scripts caching).

3. onclick=”…”, onmouseover=”“, etc.

There are some valid use cases for intrinsic event attributes, such as for performance reasons or to target ancient browsers (although, I’m not aware of any environment that would understand event attributes — onclick="...", and not property-based assignments — element.onclick = ...). Besides well-known reasons to avoid them, such as separation of concerns and reusability, there’s a matter of markup pollution. By moving event logic to external script, we can take advantage of that script’s caching. Event handler logic doesn’t need to be transferred to client every time document is requested.

4. onclick=”javascript:…”

An interesting confusion of javascript: pseudo protocol and intrinsic event handlers results in this redundant mix (with 106,000 (!) occurrences). The truth is that entire contents of event handler attribute become a body of a function. That function then serves as an event handler (usually, after having its scope augmented to include some or all of the ancestors and element itself). “javascript:” addition merely becomes an unnecessary label and rarely serves any purpose.

5. href=”javascript:void(0)”

Continuting with “javascript:” pseudo protocol, there’s an infamous href="javascript:void(0)" snippet, as a way to prevent default anchor behavior. This terrible practice of course makes anchor completely inacessible when Javascript is disabled/not available/errors out. It should go without saying that ideal solution is to include proper url in href, and stop default anchor behavior in event handler. If, on the other hand, anchor element is created dynamically, and is then inserted into a document (or is hidden initially, then shown via Javascript), plain href="#" is a leaner and faster alternative to “javascript:” version.

6. style=”…”

There’s nothing inherently wrong with style attribute, except that by moving its contents to an external stylesheet, we can take advantage of resource caching. This is similar to avoiding event attributes, mentioned earlier. Even if you only need to style one particular element and are not planning to reuse its styles, remember that style information has to be transferred every time document is requested. Moving style to external resouce prevents this, as stylesheet is transferred once and then cached on a client.

7. <script language=”Javascript” … >

Probably one of the most misunderstood attributes is SCRIPT’s “language”. This attribute is so archaic that it was already deprecated in 1999 (!), 10 years ago, when HTML 4.01 became an official recommendation. There’s absolutely no reason to use this attribute, except for the rare cases when language version needs to be specified (and even that is somewhat unreliable and should probably be avoided if possible).

8. <script charset=”…” … >

Another misunderstanding of SCRIPT element is that with charset attribute. Sometimes I see documents that include this kind of markup:

  <script type="text/javascript" charset="UTF-8">
    ...
  </script>

The thing is that charset attribute only really makes sense on “external” SCRIPT elements — those that have “src” attribute. HTML 4.01 even says:

Note that the charset attribute refers to the character encoding of the script designated by the src attribute; it does not concern the content of the SCRIPT element.

Testing shows that actual browsers behavior also matches specs in this regard.

Searching for this pattern, reveals about 2000 occurrences. Not suprising, given that even popular apps like Textmate include wrong usage of charset.

Additional optimizations

We’ve covered some of the bad practices, that almost always have to be avoided. But there’s still more ahead, and that “more” is removing redundant parts. Optimizations explained below are often questionable, as they compromise clarity for size. Therefore I include them here not as a recommendation, but merely as an option. Employ with careful consideration.

1. <style media=”all” …>

HTML 4.01 defines media attribute on STYLE elements, as a way of targeting specific medium — screen, print, handheld, and so on. One of the possible values for media is “all”, which also happens to be a de-facto standard among modern (and not so modern) browsers. If you find yourself using media="all", it should be safe to just omit it and let browser set value implicitly.

Interestingly, HTML 4.01 states that default value for media is “screen”. However, none of the browsers I tested [1] implement it as per specs, and default to “all” instead. This is probably why HTML 5 draft specifies default value as “all” — to match actual browsers’ behavior.

2. <form method=”get” …>

Another default value — GET — of FORM element’s “method” attribute is often specified explicitly. There’s no harm in dropping it, except for lesser clarity. Note that HTML 5 draft leaves this behavior untouched.

3. <input type=”text” …>

INPUT element’s “type” defaults to “text” in both — HTML 4.01 and HTML 5 draft. Dropping this attribute can result in substantial size savings on pages with lots of text fields.

4. <meta http-equiv=”Content-type” …>

Specifying document’s character encoding has always been a source of great confusion. Contrary to common belief, META element that specifies Content-type does not have higher priority over “Content-type” HTTP header that document is served with. When both — header and META element are specified, header takes precedence.

If you control server response and can set up Content-type header properly, it’s safe to omit META element. The only reason to keep it, is to specify encoding when document is viewed offline.

5. <a id=”…” name=”…” …>

The main reason “name” attribute is still used together with “id” is for compatibility with ancient browsers (e.g. Netscape 4). Those couldn’t link to anchors by “id”, so “name” had to be used. If you have elements with pairing name/id’s, and don’t care about ancient browsers, feel free to get rid of this archaic pattern.

Watch out for any side effects. If you’re referencing elements by name in scripts (document.getElementsByName, document.evaluate, document.querySelectorAll, etc.), replacing name’s with id’s might break things. Also remember that document.anchors only returns elements with name attributes.

6. <!doctype html>

A little more than a year ago, Dustin Diaz prposed to use HTML 5 doctype, as a way to cut down on document size. This is not a major optimization, but if you don’t care about validation and need to squeeze every single byte out of the page, using <!doctype html> is a viable option. Tests revealed that this fancy doctype triggers standards mode in a large variety of browsers.

Agressive optimizations

If you’re still craving for more, here are few extreme ideas. Some of these (e.g. omitting optional tags) have been circulating around for a while. Others I haven’t heard mentioned. Even though these might seem way too obtrusive, note that none of them really invalidate a document. That is if document is in HTML, not XHTML. But you’re serving documents as HTML anyway, don’t you? ;)

  1. Remove HTML comments
  2. Remove/collapse whitespace
  3. Remove optional closing tags (<p>foo</p><p>foo)
  4. Remove quotes around attribute values, when allowed (<p class="foo"><p class=foo>)
  5. Remove optional values from boolean attributes (<option selected="selected"><option selected>)
  6. Munge inline styles, inline scripts and event attributes (if it’s not possible to remove them)
  7. Munge classes and ids (needs to be in sync with scripts and style declarations)
  8. Strip scheme names off of URLs (http://example.com//example.com)

But we have compression!

Do all of these optimizations even matter when document is compressed? Doesn’t gzip eliminate most of the markup overhead? After all, it’s a textual format we’re talking about!

It still matters.

First of all, it’s good to remember that not everyone is getting gzip. This is very sad, but the good thing is that in such cases HTML optimization plays even more significant role.

Second, even if document is served compressed, there are still savings of 5-10KB after compression (on an average document). Savings are even bigger with large documents. This might not seem like a lot, but in reality every byte counts.

As an example of compressing large document, I munged unofficial HTML version of ECMA-262, 3rd edition specs, which originally weighed about 750KB (131KB gzipped), to 606KB (115KB gzipped). That’s a saving of 16 KB after gzipping, simply by removing whitespace, comments, attribute quotes and optional tags. You can see that optimized version looks the same as the old one.

Finally, optimizations like stripping whitespace and comments actually make resulting document tree lighter, potentially improving page rendering performance.

When things go wrong

As with any optimization, it’s very easy to get carried away. HTML Compact is a good example of HTML compression taken too far. This wonderful Windows app takes “unique” approach at compressing HTML… by writing it into a document via Javascript.

Turning this perfectly clean document:

    <html>
      <head>
        <title></title>
      </head>
      <body>
        <div>
          <ul>
            <li>foo</li><li>bar</li><li>baz</li>
            <!-- few more dozens of list elements ... -->
          </ul>
        </div>
      </body>
    </html>

into this mess:

  <!--hcpage status="compressed"-->
  <html>
    <head>
      <SCRIPT LANGUAGE="JavaScript" SRC="hc_decoder.js"></SCRIPT>
      <title></title>
    </head>
    <BODY>
      <NOSCRIPT>To display this page you need a browser with JavaScript support.</NOSCRIPT>
      <SCRIPT LANGUAGE="JavaScript">
        <!--
          hc_d0("Mv#d|\x3C:,&c@w4YFAtD1 [... and so on, another couple hundreds of characters ...]");
        //-->
      </SCRIPT>
    </BODY>
  </html>

Needless to say, this kind of “optimization” should never be performed in the public web. Unless the intention is to make documents inacessible to users and search engines. And it hurts me seeing those NOSCRIPT elements, which fall short in clients behind Javascript-blocking firewalls. Bad idea, bad execution.

Antipatterns

Previous snippet was a good example of optimization anti-pattern. There are, however, few more you should be aware of:

1. Removing doctype

HTML Compresor has an option — on by default — to strip doctype. I can’t think of a case where stripping it would be beneficial. On a contrary, missing doctype triggers quirks mode, and as a result, wreaks havoc on a page layout and behavior. Doctypes should be left alone, or instead, replaced with a shorter — HTML 5 — version.

2. Replacing STRONG with B and EM with I

Another harmful option in the same HTML Compressor is to replace elements with their shorter “alternatives”. The problem here is that B is not really an alternative to STRONG. Neither is I a replacement to EM. STRONG and EM elements have semantic meaning — emphasis, whereas B and I are simply font-style elements; They affect text rendering, but carry no semantic meaning.

Even though browsers usually display these elements identically, screen readers and search engines very much understand the difference.

3. Removing title, alt attributes, and LABEL elements.

A good rule of thumb is to never optimize in exchange of accessibility. You might be tempted to remove that optional “alt” attribute on IMG elements, or “title” on anchors, but saving few dozens of bytes is really not worth often-critical accessibility loss.

Tools

It’s more or less trivial to automate most of the tweaks from “additional optimizations” section. There already exist tools that strip comments, whitespaces, and remove quotes around attribute values. But these are still in their infancy and perform a very limited set of optimizations. We can definitely do better.

A couple of months ago, hakunin and I started working on a similar, Ruby-based compressor, but never had a chance to finish it.

So what do we have so far?

  1. Absolute HTML Compressor (desktop, windows)

    Does great job, but only after turning off options like stripping doctype and replacing STRONG with I.

  2. HTML Compact (desktop, windows)

    Makes document inaccessible. Avoid.

  3. HTML Compressor (desktop, windows)

    Only removes whitespace, and even in whitespace-sensitive elements, such as PRE. Not very useful.

  4. Pretty Diff (web-based)

    Doesn’t have option to completely remove whitespaces (only collapses them). Doesn’t perform any optimizations except collapsing whitespace and removing newlines. Doesn’t respect whitespace-sensitive elements. Not very useful.

  5. htmlcompressor (java-based)

    Performs most of the optimizations described here (but doesn’t remove optional tags or shorten boolean attributes). Respects whitespace-sensitive elements. It is more or less best option at the moment.

As you can see, current state of affairs is pretty disappointing. There seem to be no compression tools for Mac/Linux, and those for Windows are hardly useful.

Future considerations

Whereas munging and stripping can (and should) be done during production, markup smells is something that should never happen in the first place. Neither in production, nor in development. Not unless, for whatever reason, they are absolutely necessary.

Unsurprisingly, the best optimization one can do is often a manual one: changing document structure to avoid repeating classes on multiple elements (and instead moving them to parent element), or eliminating chunks that are not immediately needed, and instead loading them dynamically. Replacing miriads of <br>’s or &nbsp;’s used inefficiently for presentational purposes, or that old table-based layout are other good examples of manual cleaning.

As far as all the other little tweaks, I expect more compression tools to appear in the near future, pushing size-reduction boundaries even further.

If you know more ways to optimize HTML, please share. I’d be glad to hear any questions, suggestions or corrections.

  1. Tested browsers were:
    Firefox 1, 1.5, 2, 3, 3.5;
    Opera 7.54, 8.54, 9.27, 9.64, 10.10;
    Safari 2.0.4, 3.0.4, 4;
    Chrome 4 — on Mac OS X 10.6.2.
    Internet Explorer 6, 7, 8 on Windows XP Pro SP2, and
    Konqueror 4.3.2 on Ubuntu 9.10.
Categories: optimizations 49 Comments »

« Previous Entries