Resurrecting libxml-ruby

There is general discontent with the state of XML processing in Ruby – see for example here or here. An obvious solution is to use libxml. However that has been a non-starter since the libxml Ruby bindings have historically caused numerous segementation faults, don’t run on Windows and recently lost their current maintainer, Dan Janowski. Making it even more frustrating is that Dan had spent the last year rearchitecting the bindings, successfully fixing the segmentation faults.

Since MapBuzz heavily depends on libxml, it seemed time to step in and contribute. Over the last two weeks I’ve added support for Windows, cleaned out the bug database and patch list, resolved the few remaining segmentation issues, greatly improved the RDocs and refactored large portions of the code base to conform with modern Ruby extension standards.

After iterating through a couple of releases over the last two weeks, the Ruby libxml community is happy to announce the availability of version 0.8.0, which we believe is ready for prime time. It offers a great combination of speed, functionality and conformance (libxml passes all 1800+ tests in the OASIS XML Tests Suite).

So give it a try – its as easy to install as:

gem install libxml-ruby

If you’re on Windows there may be an extra step if you haven’t already installed libxml2. If not, then the libxml-ruby distribution includes a prebuilt libxml2 dll in the libxml-ruby/mingw directory. Copy the dll to libxml-ruby/lib, your Ruby bin directory, or somewhere on your path (basically put it someplace where Windows can find it).

Undoubtedly there are still some bugs left, so please report anything you find, so we can fix them in future releases.

Blindingly Fast

The major reason people consider using libxml-ruby is performance. Here are the results from running (on my laptop) a few simple benchmarks that have recently been blogged about on the Web (you can find them in the benchmark directory of the libxml distribution).

From Zack Chandler:

              user     system      total        real
libxml    0.032000   0.000000   0.032000 (  0.031000)
Hpricot   0.640000   0.031000   0.671000 (  0.890000)
REXML     1.813000   0.047000   1.860000 (  2.031000)

From Stephen Bannasch:

              user     system      total        real
libxml    0.641000   0.031000   0.672000 (  0.672000)
hpricot   5.359000   0.062000   5.421000 (  5.516000)
rexml    22.859000   0.047000  22.906000 ( 23.203000)

From Andreas Meingast:

LIBXML THROUGHPUT:
	10.2570516817665 MB/s
	10.2570830340359 MB/s
	12.6992253283934 MB/s
  10.2570516817665 MB/s
	8.51116888387252 MB/s
	10.2570830340359 MB/s

HPRICOT THROUGHPUT:
	0.211597647822036 MB/s
	0.202390771964726 MB/s
	0.180272812529665 MB/s
	0.198474511420818 MB/s
	0.198474499681793 MB/s
  0.180925089981179 MB/s

REXML THROUGHPUT:
	0.130301425548982 MB/s
	0.131630590068325 MB/s
	0.128316078417727 MB/s
	0.125203555921636 MB/s
	0.120181872867636 MB/s
	0.115330940074107 MB/s

I can’t vouch for the appropriateness of the tests, but they show libxml clocking in at 10x hpricot and 30x to 60x REXML. I’d be happy to accept additional tests or more appropriate tests if you have any.

An Embarrassment of Riches

In addition to performance, the libxml-ruby bindings provide impressive coverage of libxml’s functionality. Goodies include:

  • SAX
  • DOM
  • XMLReader (streaming interface)
  • XPath
  • XPointer
  • XML Schema
  • DTDs
  • XSLT (split into the libxslt-ruby bindings)

Now, your first reaction might be that SAX, DOM and XPath are all you need, but validating parsers make it a whole lot easier to sanitize user contributed content on web sites. And the XMLReader offers a clever way of combining the DOM’s ease of use (well, ok, compared to SAX at least) with SAX’s memory and speed advantages.

Better yet, most of this functionality is exposed via an easy-to-use, Ruby like API. There are still of course some warts lurking in the code, where libxml’s C api leaks through to Ruby, but they are being removed one by one. And for those of you who aren’t C hackers, much of this work can be done in good old Ruby.

A Long History

For such a useful, and full-featured library, the libxml-ruby bindings have a star-crossed history. Out of curiosity, I went back and traced their lineage. Sean Chittenden originally wrote them back in 2002. At the start of 2005, Trans Onoma adopted the project after Sean had moved on, and at the end of 2005 the bindings found their current home on Ruby Forge. At that point Ross Bamford took over maintenance and worked on the bindings for roughly a year, until early 2007, when then the bindings again became unmaintained. Dan Janowski picked up the ball in 2007 and completely overhauled the binding’s memory model. Sadly, Dan had to give up active support this spring.

But on the bright side, Trans, Dan and Sean are all once again active on the mailing list, providing valuable experience and insight. From my point of view, with the renewed push towards a production quality release, and bringing in new users, the libxml-ruby community is as healthy as it has been in a long while.

  1. July 16, 2008

    Hey Charlie, what great news you give us!

    Great work all the guys involved, now and the past on this project.

    For the libxml2.dll requirement maybe we can build libxml statically instead, but I think the next step is worthy.

    Again, great work!

    Reply
  2. Charlie
    July 16, 2008

    Hi Luis,

    I just read an interview with your earlier this year on Akita on Rails. Great work on the one-click installer. Ping me at my email (its on the about page) if you’d like to talk about it a bit more, I’ve gotten pretty good and this MSVC/MingW Windows Ruby stuff after doing it for a couple of years now.

    Reply
  3. Charlie Savage –
    July 16, 2008

    Hi Luis,

    About the libxml2.dll. It would be great if libxml-ruby could come in the one-click installer, and in that case, I’d put the dll into the Ruby bin directory like is done for other libraries. And if we’re doing that, how about adding libxslt-ruby also?

    Reply
  4. July 16, 2008

    Hey Charlie,

    Current VC6 One-Click Installer is a huge monolithic script and is kind of hard to maintain and test everything that is been bundled there.

    We are working on the new rubyinstaller sandbox and build environment that uses Windows Installer and provide a smaller runtime (Ruby+RubyGems) to get you started and a Developer Kit (MinGW+MSYS) to build those gems that don’t come pre-built for Windows.

    I’ve been thinking fix the binary files bundling of rubygems to allow .dll files be part of gems…

    I’ll get in touch with you soon about libxml 🙂
    (in any case, I invite you to rubyinstaller-devel mailing list):

    http://rubyforge.org/projects/rubyinstaller

    Thanks again for your kind words!

    Reply
  5. July 16, 2008

    This is awesome news! Thanks for all your hard work!

    Reply
  6. malcontent
    July 16, 2008

    You are a fucking hero!.

    Awesome work.

    Reply
  7. filterfish
    July 17, 2008

    What malcontent said!

    Reply
  8. July 17, 2008

    I’d like to talk to you about possibly building a Java backend for the libxml library. Since so many people seem to be interested in a good XML library these days, and since XML libraries by and large look pretty much the same on the surface, I’m pretty sure we could map the libxml APIs onto JAXP for JRuby. That would allow people even more confidence to move to libxml for all XML purposes, since they wouldn’t be tied to the C impls. So hey, drop me a line or find me on FreeNode @ #jruby.

    Reply
  9. Rajmohan
    July 17, 2008

    On Windows, you mentioned the extra step of copying the prebuilt libxml2.dll. Is there any place from where I can copy the prebuilt dll or I would need to build it?

    Reply
  10. Jesus
    July 17, 2008

    Thank boy. Thats a great peace of work you can be proud off!

    Reply
  11. Charlie Savage –
    July 17, 2008

    Rajmohan,

    A prebuilt libxml2.dll is already in the libxml/mingw directory. If you already have libxml installed on your system, you won’t have to do anything. If you don’t, then copy the dll to your ruby/bin directory or somewhere on your Windows path.

    Also, I’ve updated the post to be a bit more clear on how to get things up and running on Windows.

    Reply
  12. le_fnord
    July 17, 2008

    so many thanks for your work,

    that’s exactly the thing,
    what i’m searched for my thesis.

    Reply
  13. July 17, 2008

    Awesome work!! Anyone knows if this is “plug and play” for Rails or does it need adapting? For ActiveResource based apps this could mean a significant performance boost! Congrats!

    Reply
  14. July 17, 2008

    This is awesome! Thanks for the hard work.

    Reply
  15. Charlie Savage –
    July 17, 2008

    AkitaOnRails,

    Not sure. Want to find out? We gladly accept patches 🙂

    Reply
  16. July 17, 2008

    That’s greate news!

    Reply
  17. July 17, 2008

    Fantastic! Thank you very much for all your work.

    Reply
  18. Dan Healy
    July 17, 2008

    First off, great work and thanks for taking up libxml-ruby.

    Even after changing the search path to something that exists in the document I’m parsing, this example still doesn’t seem to work for me:

    ` doc.find(‘//root_node/foo/bar’).each do |node|
    puts “Node path: #{node.path} \t Contents: #{node.content}”
    end`

    In fact, I can’t seem to get anything related to .find to work at all (.empty? always returns true). Got any suggestions?

    Reply
  19. July 17, 2008

    http://saxon.sourceforge.net/saxon6.5.3/expressions.html

    More specifically, if you look at the table at the bottom of that page, I can get things like ‘*’ and ‘*[last()]’ to work, but anything involving a literal string like ‘XXX’ or ‘//XXX’ will return a nil NodeSet.

    Reply
  20. July 19, 2008

    Currently I’am using REXML and ruby-xsl for a project, the code is ugly and slow. I’m going to rewrite the code to come back to libxml-ruby. Great work !

    Reply
  21. Charlie Savage –
    July 19, 2008

    Sounds good Alex. If you feel like doing the intermediate step, writing a wrapper for libxml with the same api as REXML, then do submit it to the libxml project and we’ll include it in the distribution (as long as it passes all REXML tests).

    Charlie

    Reply
  22. Kurt Burns
    July 21, 2008

    I am getting segmentation faults on 64-bit redhat fairly randomly and faily regularly while doing a large amount of data importing. I was using 0.5.4 and never once had a seg.fault. After upgrading, I am getting them alot. Any idea if seg. faults are an issue with 64-bit right now? I am currently versioning down to see if I can determine if this is truly related to the upgrade.

    Reply
  23. Charlie Savage –
    July 21, 2008

    Hi Kurt,

    Could you post to the libxml-devel mailing list, where we can try and work out the issue. Any additional information would be helpful – example XML file, example code, and best yet a stack trace, etc.

    Reply
  24. jim Cropcho
    July 25, 2008

    sweet dude; thanks!

    Reply
  25. Junaid
    July 30, 2008

    Great work.Thanks

    Reply
  26. November 10, 2008

    HUZZAH!

    Reply

Leave a Reply

Your email address will not be published.

Top