<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>cfis : A Sane Way of Sanitizing HTML</title>
    <link>http://cfis.savagexi.com</link>
    <atom:link rel="self" type="application/rss+xml" href="http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html?format=rss"/>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Charlie Savage's Blog</description>
    <item>
      <title>Trackback from cfis: Ruby, libxml and Windows on A Sane Way of Sanitizing HTML</title>
      <description>Let's say my last post tempted you to check out libxml's validation functionality. However, you work in shop that develops on multiple platforms, including Windows, OS X and Linux, so you code has to run on all three platforms. You happily note th...</description>
      <pubDate>Fri, 09 Feb 2007 01:06:18 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:0568ef1c-061f-4854-aa23-cd58b7aaab1b</guid>
      <link>http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html#trackback-112</link>
    </item>
    <item>
      <title>Comment on A Sane Way of Sanitizing HTML by Ambush Commander</title>
      <description>&lt;p&gt;Thanks for linking to my comparison article at HTML Purifier. Being the author of that library, I&amp;#8217;d like to throw in my two cents on the issue.&lt;/p&gt;

&lt;p&gt;You noted some of the problems of attempting to validate a document using the XHTML Basic DTD (no attribute parsing) and the ability of RELAX NG schema to peek in those attributes, but there are still some problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The schemas validate &lt;em&gt;entire&lt;/em&gt; documents, not HTML fragments that users usually submit. You&amp;#8217;ll need to wrap html in at least an html and body tag. In a similar vein, XHTML Basic allows things like html, meta and body tags, which aren&amp;#8217;t appropriate for XHTML fragments&lt;/li&gt;
&lt;li&gt;The schemas are, to put it lightly, not very forgiving. Even with a WYSIWYG editor, things can get past, and even the slightest element in the wrong place will result in failed validation and a cryptic error message. XML currently does not have facilities for graceful error handling&lt;/li&gt;
&lt;li&gt;XHTML Basic allows script tags. And object tags. And events. Sorry, but if you&amp;#8217;re going to piggy-back of XML validation functionality, construct your own schema! XHTML Basic is not an XSS-defender.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;(Plug Warning): HTML Purifier validates according to hand vetted XHTML 1.0 DTDs, with bad stuff taken out, as well as comprehensive attribute validation schemes. Recent developmental versions validate according XHTML 1.1&amp;#8217;s modularization. I strongly recommend you take a second look at the library! Thanks.&lt;/p&gt;</description>
      <pubDate>Fri, 09 Feb 2007 21:06:01 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:9015fdc0-3c60-4802-81c3-47d6152c0cd3</guid>
      <link>http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html#comment-113</link>
    </item>
    <item>
      <title>Comment on A Sane Way of Sanitizing HTML by cfis</title>
      <description>&lt;p&gt;Hi Ambush Commander,&lt;/p&gt;

&lt;p&gt;Thanks for the great comment. I was very impressed by your HTML Purifier library. I think your approach of using a full-fledged parser and validating against a white list is on the only workable solution to this problem. And in fact, the approach I discussed, using libxml to validate against a DTD or schema, is the same. The reason why I prefer that route is that you can reuse the xml infrastructure that has been built over the last five years.&lt;/p&gt;

&lt;p&gt;Caveats do apply though:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You have to use XHTML.&lt;/li&gt;
&lt;li&gt;As you mention, you have to wrap the text to parse with an &amp;lt;html&amp;gt; element.  This seems like a good idea anyways though, since otherwise someone could embed a processiong instruction at the top of the submitted text that points to a different schema to validate against (I totally forget to mention that in the article).&lt;/li&gt;
&lt;li&gt;Agreed that XHTML Basic still leaves in dangerous elements like script and object, and those need to be removed.  That is what I meant by the sentence &amp;#8220;with a tweak here and tweak there.&amp;#8221;  I should have been more explicit.   So yes, you have to create your own schema. XHTML Basic is a good starting point since its a lot closer to what you want then full XHTML.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Thanks again for the comment. It sounds to me we are in general agreement, but let me know if that&amp;#8217;s not the case!  I am curious as to why you took the approach you did though, instead of using libxml or another validating parser?&lt;/p&gt;</description>
      <pubDate>Sat, 10 Feb 2007 12:02:23 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:c0929fe0-6d16-472f-8aab-9ad452fb843d</guid>
      <link>http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html#comment-115</link>
    </item>
    <item>
      <title>Comment on A Sane Way of Sanitizing HTML by Edward Z. Yang</title>
      <description>&lt;blockquote&gt;&lt;p&gt;&lt;em&gt;You have to use XHTML&lt;/em&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Surprisingly, this is not always the case. PHP&amp;#8217;s &lt;a href="http://us2.php.net/manual/en/ref.dom.php" rel="nofollow"&gt;DOM&lt;/a&gt; extension has a method called DOMDocument-&gt;loadHTML() which accepts poorly formed HTML documents. I believe DOM is based off of libxml.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;&lt;em&gt;It sounds to me we are in general agreement, but let me know if that&#8217;s not the case!&lt;/em&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;For the most part. In my opinion, using an XML parser/validator is a quick and easy way to get some HTML filtering capabilities, although it isn&amp;#8217;t necessarily the best. But I do agree that the only way to filter HTML is to understand it fully.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;&lt;em&gt;I am curious as to why you took the approach you did though, instead of using libxml or another validating parser?&lt;/em&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Besides the limitations I enumerated in my first comment as well as the fact that even Relax NG&amp;#8217;s treatment of attribute validation isn&amp;#8217;t satisfactory either (it will allow javascript:// URIs, for instance), this library, quite simply put, wasn&amp;#8217;t available.&lt;/p&gt;

&lt;p&gt;HTML Purifier has support for PHP 4 and PHP 5, but libxml is only bundled with PHP 5.1.0 or higher. The PHP 4 library, expat, does not have validating capabilities. When libxml is available, I use it to parse the document (yes, wrapping the code in html tags), but then I use my own validator system.&lt;/p&gt;

&lt;p&gt;P.S. It&amp;#8217;s Ambush &lt;em&gt;Commander&lt;/em&gt; not Ambush &lt;em&gt;Defender&lt;/em&gt;. You can use my real name (Edward Z. Yang) though. ;-)&lt;/p&gt;</description>
      <pubDate>Sun, 11 Feb 2007 13:37:40 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:c1f4564d-49e5-4ec6-8e9d-5f4e7f5f2cb9</guid>
      <link>http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html#comment-114</link>
    </item>
    <item>
      <title>Comment on A Sane Way of Sanitizing HTML by cfis</title>
      <description>&lt;p&gt;Hi Edward,&lt;/p&gt;

&lt;p&gt;My apologies on your name - got it right in the comment and wrong on the blog.  Its fixed now.&lt;/p&gt;

&lt;p&gt;As far as HTML and using PHP to correctly parse it, that sounds good.  Another alternative that I&amp;#8217;m sure you know about (mentioning more for others to see) is HTML Tidy - not sure if that is better or worse.&lt;/p&gt;

&lt;p&gt;And thanks for the explanation about when you can and can&amp;#8217;t use libxml, definitely makes sense.  I do most of my work with Ruby.  Ruby doesn&amp;#8217;t have libxml support built-in, but the luckily there is an open source extension that provides Ruby bindings.&lt;/p&gt;

&lt;p&gt;Thanks again for the great comments.&lt;/p&gt;</description>
      <pubDate>Sun, 11 Feb 2007 20:45:33 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:4859e87d-0978-42fe-bc34-c878fbabea6c</guid>
      <link>http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html#comment-116</link>
    </item>
    <item>
      <title>Comment on A Sane Way of Sanitizing HTML by greenday</title>
      <description>&lt;p&gt;check out the &lt;a href="http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php" rel="nofollow"&gt;htmLawed&lt;/a&gt; library&lt;/p&gt;</description>
      <pubDate>Thu, 07 Feb 2008 22:00:44 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:148bcd6f-3544-473b-b019-57aee2f290e3</guid>
      <link>http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html#comment-6210</link>
    </item>
    <item>
      <title>Trackback from cfis: Resurecting libxml-ruby on A Sane Way of Sanitizing HTML</title>
      <description>There is general discontent with the state of XML processing in Ruby - see for example here or here. An obvious solution is to use libxml. However that has been a non-starter since the libxml Ruby bindings have historically caused numerous segemen...</description>
      <pubDate>Wed, 16 Jul 2008 10:41:02 -0600</pubDate>
      <guid isPermaLink="false">urn:uuid:a7513f74-99c1-43f0-a162-29b7cc5d2ce6</guid>
      <link>http://cfis.savagexi.com/2007/02/08/a-sane-way-of-sanitizing-html#trackback-22071</link>
    </item>
  </channel>
</rss>
