A Sane Way of Sanitizing HTML

Web sites based on user generated content need to …well… let users generate content (obviously you read this blog for its insights!).

Except that opens a huge can of worms. Some users will want to deface content that other users have created. Others will try and upload content that covers your web site in platypuses – just to prove they can. Still others will try and upload malicious content so they can steal other user’s personal information, such as passwords.

Let’s look at one small part of the problem – letting users add content to a site. Say you have a nice little form that looks something like this (note this form does not work):

Create Some Content!

Soon your first user wanders by – and is aghast at the primitiveness of your solution. How do I make things bold? Color? Links? Pictures?

At this point a bright idea pops into your head – why not write a nice, simple, little formatting language? Maybe we could make italic text like ”this”. And bold text like ”’this”’. Thankfully, you soon come to your senses and realize this has all been done before – just pick you favorite between MediaWiki, Textile, Markdown, and scores of others.

Then again, why bother making users learn some strange new markup language? What about a nice editor – something like this. Then again, if you hate you users, you could try something like this. Either way, you happily code up your nice editor and wait for good things to happen.

When Bad Things Happen

Before you know it, you’re running the next MySpace, and you’re barely off the phone with Yahoo when Google comes calling. But then disaster strikes – one of your valued customers, Sammy, has decided to befriend every other member. Your website crashes under the load, Yahoo and Google decide you’re an incompetent lout and they shower their millions on your evil competitor.

What went wrong? You got burned by a classic example of cross-site scripting. Sammy embedded javascript into the content he uploaded for his profile. When a user, Sally, looked at it, the embedded JavaScript was executed, causing Sally to unknowingly become Sammy’s friend. And then when Sue looked at Sally’s profile, she became Sammy’s friend. And soon everyone loved Sammy.

Don’t Trust Your Users

The sad moral of the story is don’t trust your users. You have to sanitize everything uploaded to your web site. Your nice little editor produces HTML – which you blithely except and store right into the database.

Although a bit ashamed at your gaffe, you figure any fool can clean up a bit of HTML. You code up a baroque regular expression, that only you can understand, and update your web site. You figure Google will be back begging tomorrow.

Tomorrow dawns, and now everyone is friends of Sally. Eeek gads – what went wrong this time? After a bit of searching you come across a sketchy looking site that has pages and pages of examples on how to defeat your primitive defences.

Despairing, you take the day off and ponder becoming a rock star, or if that fails, a real estate agent.

Friends Abound

The next day, everyone has decided to become friends with Sue. Your heart warms as you watch all your users getting to know each other. That feeling lasts through your first cup of coffee, when you get slapped by a lawsuit from some weird sounding European country you’ve never heard of claiming that you’ve willfully exposed personal details about your users without their consent. You figure its time to buckle down and solve this problem once and for all.

A bit of searching on Google turns up a bewildering number of choices. You find one site though, HTML Purifier, that offers a nice comparison of the options. By now its lunch, and you call up your buddy Bill to grab some food.


You tell Bill about your woes. He calls you an idiot and says that you have to buy him lunch. He then points out that your fancy little editor generates XHTML, right? So why don’t you use libxml to parse the XHTML, and tell it to validate it against Basic XHTML. Seeing the bewildered look in your eye, Bill sighs deeply, and tries again.

Look, he says, trying to validate HTML leads right into the morass of tag soup where browsers do their best to render whatever you throw at them. Although that sounds nice, it reality it leaves a vast attack space for someone to slip in malicious content, often times using invalid HTML.

If you switch to XHTML, you immediately eliminate that problem. Continuing, Bill also points out that you can reuse all the work that has gone into building the XML tool chain. And even better, he explains, the W3C has kindly spent the last five years breaking XHTML into different modules. Each module is rigorously defined via a DTD.

They have also been kind enough to define XHTML Basic, which combines a subset of XHTML modules to create a simplified version of XHTML that can run on small devices such as PDAs and cellphones. It eliminates most of the nasty XHTML elements – with a tweak here and a there you can get rid of them all. And while you are at it, its probably best to eliminate all the predefined character entities (you do use UTF8, don’t you?).

So all you have to do is take the XHTML Basic DTD and validate user input against it. You say that sounds awfully difficult. Bill laughs, and quickly writes a few lines of Ruby code on a napkin:

require 'xml/libxml_so'

def verify(html)
  dtd = XML::Dtd.new("public", 'xhtml1-transitional.dtd')
  parse =  XML::Parser.string(html)

You stare incredulously – that’s it? Bill replies – not quite. DTDs can’t verify attributes, so you still have to make sure there aren’t any nasty JavaScript fragments lurking in them. For example:

<img src="javascript:alert('you have been hacked')" />

He also mentions that you could have libxml validate against a Relax NG schema instead, which supports validation of attributes.

And voila, you’ve successfully plugged at least one security hole in your web site. Undoubtedly there are many more to be found.

Update – Its definitely worth reading the great comment from Ambush Commander, who is the author of HTML Purifier, an HTML sanitization library. I should have been more clear that XHTML Basic is a good starting point since it removes a large portion of XHML that you don’t want to support. However, it still includes dangerous elements like <script> and <object>, so clearly you have to remove those. Anyway, its worth checking out out his comment and my response.