There is general discontent with the state of XML processing in Ruby – see for example here or here. An obvious solution is to use libxml. However that has been a non-starter since the libxml Ruby bindings have historically caused numerous segementation faults, don’t run on Windows and recently lost their current maintainer, Dan Janowski. Making it even more frustrating is that Dan had spent the last year rearchitecting the bindings, successfully fixing the segmentation faults.
Since MapBuzz heavily depends on libxml, it seemed time to step in and contribute. Over the last two weeks I’ve added support for Windows, cleaned out the bug database and patch list, resolved the few remaining segmentation issues, greatly improved the RDocs and refactored large portions of the code base to conform with modern Ruby extension standards.
After iterating through a couple of releases over the last two weeks, the Ruby libxml community is happy to announce the availability of version 0.8.0, which we believe is ready for prime time. It offers a great combination of speed, functionality and conformance (libxml passes all 1800+ tests in the OASIS XML Tests Suite).
So give it a try – its as easy to install as:
gem install libxml-ruby
If you’re on Windows there may be an extra step if you haven’t already installed libxml2. If not, then the libxml-ruby distribution includes a prebuilt libxml2 dll in the libxml-ruby/mingw directory. Copy the dll to libxml-ruby/lib, your Ruby bin directory, or somewhere on your path (basically put it someplace where Windows can find it).
Undoubtedly there are still some bugs left, so please report anything you find, so we can fix them in future releases.
The major reason people consider using libxml-ruby is performance. Here are the results from running (on my laptop) a few simple benchmarks that have recently been blogged about on the Web (you can find them in the benchmark directory of the libxml distribution).
From Zack Chandler:
user system total real libxml 0.032000 0.000000 0.032000 ( 0.031000) Hpricot 0.640000 0.031000 0.671000 ( 0.890000) REXML 1.813000 0.047000 1.860000 ( 2.031000)
From Stephen Bannasch:
user system total real libxml 0.641000 0.031000 0.672000 ( 0.672000) hpricot 5.359000 0.062000 5.421000 ( 5.516000) rexml 22.859000 0.047000 22.906000 ( 23.203000)
From Andreas Meingast:
LIBXML THROUGHPUT: 10.2570516817665 MB/s 10.2570830340359 MB/s 12.6992253283934 MB/s 10.2570516817665 MB/s 8.51116888387252 MB/s 10.2570830340359 MB/s HPRICOT THROUGHPUT: 0.211597647822036 MB/s 0.202390771964726 MB/s 0.180272812529665 MB/s 0.198474511420818 MB/s 0.198474499681793 MB/s 0.180925089981179 MB/s REXML THROUGHPUT: 0.130301425548982 MB/s 0.131630590068325 MB/s 0.128316078417727 MB/s 0.125203555921636 MB/s 0.120181872867636 MB/s 0.115330940074107 MB/s
I can’t vouch for the appropriateness of the tests, but they show libxml clocking in at 10x hpricot and 30x to 60x REXML. I’d be happy to accept additional tests or more appropriate tests if you have any.
An Embarrassment of Riches
In addition to performance, the libxml-ruby bindings provide impressive coverage of libxml’s functionality. Goodies include:
- XMLReader (streaming interface)
- XML Schema
- XSLT (split into the libxslt-ruby bindings)
Now, your first reaction might be that SAX, DOM and XPath are all you need, but validating parsers make it a whole lot easier to sanitize user contributed content on web sites. And the XMLReader offers a clever way of combining the DOM’s ease of use (well, ok, compared to SAX at least) with SAX’s memory and speed advantages.
Better yet, most of this functionality is exposed via an easy-to-use, Ruby like API. There are still of course some warts lurking in the code, where libxml’s C api leaks through to Ruby, but they are being removed one by one. And for those of you who aren’t C hackers, much of this work can be done in good old Ruby.
A Long History
For such a useful, and full-featured library, the libxml-ruby bindings have a star-crossed history. Out of curiosity, I went back and traced their lineage. Sean Chittenden originally wrote them back in 2002. At the start of 2005, Trans Onoma adopted the project after Sean had moved on, and at the end of 2005 the bindings found their current home on Ruby Forge. At that point Ross Bamford took over maintenance and worked on the bindings for roughly a year, until early 2007, when then the bindings again became unmaintained. Dan Janowski picked up the ball in 2007 and completely overhauled the binding’s memory model. Sadly, Dan had to give up active support this spring.
But on the bright side, Trans, Dan and Sean are all once again active on the mailing list, providing valuable experience and insight. From my point of view, with the renewed push towards a production quality release, and bringing in new users, the libxml-ruby community is as healthy as it has been in a long while.