A Sane Way of Sanitizing HTML

Web sites based on user generated content need to …well… let users generate content (obviously you read this blog for its insights!).

Except that opens a huge can of worms. Some users will want to deface content that other users have created. Others will try and upload content that covers your web site in platypuses – just to prove they can. Still others will try and upload malicious content so they can steal other user’s personal information, such as passwords.

Let’s look at one small part of the problem – letting users add content to a site. Say you have a nice little form that looks something like this (note this form does not work):

Create Some Content!

Soon your first user wanders by – and is aghast at the primitiveness of your solution. How do I make things bold? Color? Links? Pictures?

At this point a bright idea pops into your head – why not write a nice, simple, little formatting language? Maybe we could make italic text like ”this”. And bold text like ”’this”’. Thankfully, you soon come to your senses and realize this has all been done before – just pick you favorite between MediaWiki, Textile, Markdown, and scores of others.

Then again, why bother making users learn some strange new markup language? What about a nice editor – something like this. Then again, if you hate you users, you could try something like this. Either way, you happily code up your nice editor and wait for good things to happen.

When Bad Things Happen

Before you know it, you’re running the next MySpace, and you’re barely off the phone with Yahoo when Google comes calling. But then disaster strikes – one of your valued customers, Sammy, has decided to befriend every other member. Your website crashes under the load, Yahoo and Google decide you’re an incompetent lout and they shower their millions on your evil competitor.

What went wrong? You got burned by a classic example of cross-site scripting. Sammy embedded javascript into the content he uploaded for his profile. When a user, Sally, looked at it, the embedded JavaScript was executed, causing Sally to unknowingly become Sammy’s friend. And then when Sue looked at Sally’s profile, she became Sammy’s friend. And soon everyone loved Sammy.

Don’t Trust Your Users

The sad moral of the story is don’t trust your users. You have to sanitize everything uploaded to your web site. Your nice little editor produces HTML – which you blithely except and store right into the database.

Although a bit ashamed at your gaffe, you figure any fool can clean up a bit of HTML. You code up a baroque regular expression, that only you can understand, and update your web site. You figure Google will be back begging tomorrow.

Tomorrow dawns, and now everyone is friends of Sally. Eeek gads – what went wrong this time? After a bit of searching you come across a sketchy looking site that has pages and pages of examples on how to defeat your primitive defences.

Despairing, you take the day off and ponder becoming a rock star, or if that fails, a real estate agent.

Friends Abound

The next day, everyone has decided to become friends with Sue. Your heart warms as you watch all your users getting to know each other. That feeling lasts through your first cup of coffee, when you get slapped by a lawsuit from some weird sounding European country you’ve never heard of claiming that you’ve willfully exposed personal details about your users without their consent. You figure its time to buckle down and solve this problem once and for all.

A bit of searching on Google turns up a bewildering number of choices. You find one site though, HTML Purifier, that offers a nice comparison of the options. By now its lunch, and you call up your buddy Bill to grab some food.


You tell Bill about your woes. He calls you an idiot and says that you have to buy him lunch. He then points out that your fancy little editor generates XHTML, right? So why don’t you use libxml to parse the XHTML, and tell it to validate it against Basic XHTML. Seeing the bewildered look in your eye, Bill sighs deeply, and tries again.

Look, he says, trying to validate HTML leads right into the morass of tag soup where browsers do their best to render whatever you throw at them. Although that sounds nice, it reality it leaves a vast attack space for someone to slip in malicious content, often times using invalid HTML.

If you switch to XHTML, you immediately eliminate that problem. Continuing, Bill also points out that you can reuse all the work that has gone into building the XML tool chain. And even better, he explains, the W3C has kindly spent the last five years breaking XHTML into different modules. Each module is rigorously defined via a DTD.

They have also been kind enough to define XHTML Basic, which combines a subset of XHTML modules to create a simplified version of XHTML that can run on small devices such as PDAs and cellphones. It eliminates most of the nasty XHTML elements – with a tweak here and a there you can get rid of them all. And while you are at it, its probably best to eliminate all the predefined character entities (you do use UTF8, don’t you?).

So all you have to do is take the XHTML Basic DTD and validate user input against it. You say that sounds awfully difficult. Bill laughs, and quickly writes a few lines of Ruby code on a napkin:

require 'xml/libxml_so'

def verify(html)
  dtd = XML::Dtd.new("public", 'xhtml1-transitional.dtd')
  parse =  XML::Parser.string(html)

You stare incredulously – that’s it? Bill replies – not quite. DTDs can’t verify attributes, so you still have to make sure there aren’t any nasty JavaScript fragments lurking in them. For example:

<img src="javascript:alert('you have been hacked')" />

He also mentions that you could have libxml validate against a Relax NG schema instead, which supports validation of attributes.

And voila, you’ve successfully plugged at least one security hole in your web site. Undoubtedly there are many more to be found.

Update – Its definitely worth reading the great comment from Ambush Commander, who is the author of HTML Purifier, an HTML sanitization library. I should have been more clear that XHTML Basic is a good starting point since it removes a large portion of XHML that you don’t want to support. However, it still includes dangerous elements like <script> and <object>, so clearly you have to remove those. Anyway, its worth checking out out his comment and my response.

  1. February 10, 2007

    Thanks for linking to my comparison article at HTML Purifier. Being the author of that library, I’d like to throw in my two cents on the issue.

    You noted some of the problems of attempting to validate a document using the XHTML Basic DTD (no attribute parsing) and the ability of RELAX NG schema to peek in those attributes, but there are still some problems:

    1. The schemas validate *entire* documents, not HTML fragments that users usually submit. You’ll need to wrap html in at least an html and body tag. In a similar vein, XHTML Basic allows things like html, meta and body tags, which aren’t appropriate for XHTML fragments
    2. The schemas are, to put it lightly, not very forgiving. Even with a WYSIWYG editor, things can get past, and even the slightest element in the wrong place will result in failed validation and a cryptic error message. XML currently does not have facilities for graceful error handling
    3. XHTML Basic allows script tags. And object tags. And events. Sorry, but if you’re going to piggy-back of XML validation functionality, construct your own schema! XHTML Basic is not an XSS-defender.

    (Plug Warning): HTML Purifier validates according to hand vetted XHTML 1.0 DTDs, with bad stuff taken out, as well as comprehensive attribute validation schemes. Recent developmental versions validate according XHTML 1.1’s modularization. I strongly recommend you take a second look at the library! Thanks.

  2. Charlie Savage –
    February 10, 2007

    Hi Ambush Commander,

    Thanks for the great comment. I was very impressed by your HTML Purifier library. I think your approach of using a full-fledged parser and validating against a white list is on the only workable solution to this problem. And in fact, the approach I discussed, using libxml to validate against a DTD or schema, is the same. The reason why I prefer that route is that you can reuse the xml infrastructure that has been built over the last five years.

    Caveats do apply though:

    1. You have to use XHTML.
    2. As you mention, you have to wrap the text to parse with an <html> element. This seems like a good idea anyways though, since otherwise someone could embed a processiong instruction at the top of the submitted text that points to a different schema to validate against (I totally forget to mention that in the article).
    3. Agreed that XHTML Basic still leaves in dangerous elements like script and object, and those need to be removed. That is what I meant by the sentence “with a tweak here and tweak there.” I should have been more explicit. So yes, you have to create your own schema. XHTML Basic is a good starting point since its a lot closer to what you want then full XHTML.

    Thanks again for the comment. It sounds to me we are in general agreement, but let me know if that’s not the case! I am curious as to why you took the approach you did though, instead of using libxml or another validating parser?

  3. February 11, 2007

    >*You have to use XHTML*

    Surprisingly, this is not always the case. PHP’s [DOM](http://us2.php.net/manual/en/ref.dom.php) extension has a method called DOMDocument->loadHTML() which accepts poorly formed HTML documents. I believe DOM is based off of libxml.

    >*It sounds to me we are in general agreement, but let me know if that’s not the case!*

    For the most part. In my opinion, using an XML parser/validator is a quick and easy way to get some HTML filtering capabilities, although it isn’t necessarily the best. But I do agree that the only way to filter HTML is to understand it fully.

    >*I am curious as to why you took the approach you did though, instead of using libxml or another validating parser?*

    Besides the limitations I enumerated in my first comment as well as the fact that even Relax NG’s treatment of attribute validation isn’t satisfactory either (it will allow javascript:// URIs, for instance), this library, quite simply put, wasn’t available.

    HTML Purifier has support for PHP 4 and PHP 5, but libxml is only bundled with PHP 5.1.0 or higher. The PHP 4 library, expat, does not have validating capabilities. When libxml is available, I use it to parse the document (yes, wrapping the code in html tags), but then I use my own validator system.

    P.S. It’s Ambush *Commander* not Ambush *Defender*. You can use my real name (Edward Z. Yang) though. 😉

  4. Charlie Savage –
    February 12, 2007

    Hi Edward,

    My apologies on your name – got it right in the comment and wrong on the blog. Its fixed now.

    As far as HTML and using PHP to correctly parse it, that sounds good. Another alternative that I’m sure you know about (mentioning more for others to see) is HTML Tidy – not sure if that is better or worse.

    And thanks for the explanation about when you can and can’t use libxml, definitely makes sense. I do most of my work with Ruby. Ruby doesn’t have libxml support built-in, but the luckily there is an open source extension that provides Ruby bindings.

    Thanks again for the great comments.

  5. greenday
    February 8, 2008

    check out the htmLawed library