Beware the Unicode Byte Order Mark when merging files

What is the Unicode Byte Order Mark (BOM)?

Every Unicode string starts with a "zero-width no-break space"; depending on its actual byte representation, a text processor understands the endianness of the characters that follow the BOM. For example, in UTF-16, if the first character of a string is represented as FE FF it means that the bytes in the string are represented using the Big Endian order, while if the string was using the Little Endian order, the first character would have been represented as FF FE. This rule apply only to UFT-16 and UTF-32 encoded strings, and not to UTF-8 (since there is just one byte per character, so no order to be specified).

Unfortunately for us it's common practice (many Windows programs do this) to add a Byte-Order Mark also to UTF-8 encoded files, just to mark it as UTF-8 (and in this case, the BOM is represented as EF BB BF).

Why am I telling you all of this, and why does it matter?

Because if you merge many UTF-8 files into one, and you do it the wrong way, you will include the BOM in the middle of the file, and, even though the UTF specification recommends to ignore them, some applications (like the CSS parser of Firefox) might treat them as "zero-width no-break spaces" or as other characters (like its ISO-8859-1 version "ï»¿"), and possibility break something.

How I came into this problem

I was developing a new feature for Subtext, an HttpHandler to serve all the CSS files belonging to a skin as only one http response, and I run into the following issue: in Firefox the first CSS rule of each file was not interpreted at all (but everything was fine using IE).

Using Firebug I found out that Firefox was seeing the first line of each merged CSS as " .classname" (with a leading space) even if the physical line started with the dot (".classname").

The code I used to merge the files was: (this is the wrong way to merge UTF-8 files)

// styles is the list of all the files to be merged
// with the path relative to the root of the website
foreach (string style in styles)
{
    context.Response.WriteFile(context.Server.MapPath(style));
}

The problem was that the WriteFile method wrote all of the file into the output stream, including the leading "EF BB BF". So when merged, the resulting file had this sequence of bytes also in the middle of the file, and the CSS parser of Firefox interpreted it as a space, and put it before the first rule breaking it (since all the rules must start at the beginning of the line).

So I changed the merging code the following:

// styles is the list of all the files to be merged
// with the path relative to the root of the website
foreach (string style in styles)
{
    string cssFile = File.ReadAllText(context.Server.MapPath(style));
    context.Response.Write(cssFile);
}

The ReadAllText method reads all the text inside the file, and being UTF-8 aware, it strips the BOM out of the return string. So the final output doesn't include any Byte Order Mark, and all the CSS rules are interpreted correctly.

This issues is caused by two non-compliance to the UTF specifications:

first, the text editor used to save the CSS files should not have added the BOM at the beginning of the UTF-8 file
second, the CSS parser inside Firefox should have ignored it instead of treating it as space

Just to add more complexity to the issue, the problem happened only if the CSS file started with a comment (and the broken rule was the first after the comment), but not if it started directly with a CSS rule.

Technorati tags: UTF-8, Byte Order Mark, Subtext, CSS