There is general discontent with the state of XML processing in Ruby – see for example here or here. An obvious solution is to use libxml. However that has been a non-starter since the libxml Ruby bindings have historically caused numerous segementation faults, don’t run on Windows and recently lost their current maintainer, Dan Janowski. Making it even more frustrating is that Dan had spent the last year rearchitecting the bindings, successfully fixing the segmentation faults.
Since MapBuzz heavily depends on libxml, it seemed time to step in and contribute. Over the last two weeks I’ve added support for Windows, cleaned out the bug database and patch list, resolved the few remaining segmentation issues, greatly improved the RDocs and refactored large portions of the code base to conform with modern Ruby extension standards.
After iterating through a couple of releases over the last two weeks, the Ruby libxml community is happy to announce the availability of version 0.8.0, which we believe is ready for prime time. It offers a great combination of speed, functionality and conformance (libxml passes all 1800+ tests in the OASIS XML Tests Suite).
So give it a try – its as easy to install as:
gem install libxml-ruby
If you’re on Windows there may be an extra step if you haven’t already installed libxml2. If not, then the libxml-ruby distribution includes a prebuilt libxml2 dll in the libxml-ruby/mingw directory. Copy the dll to libxml-ruby/lib, your Ruby bin directory, or somewhere on your path (basically put it someplace where Windows can find it).
Undoubtedly there are still some bugs left, so please report anything you find, so we can fix them in future releases.
Blindingly Fast
The major reason people consider using libxml-ruby is performance. Here are the results from running (on my laptop) a few simple benchmarks that have recently been blogged about on the Web (you can find them in the benchmark directory of the libxml distribution).
From Zack Chandler:
user system total real libxml 0.032000 0.000000 0.032000 ( 0.031000) Hpricot 0.640000 0.031000 0.671000 ( 0.890000) REXML 1.813000 0.047000 1.860000 ( 2.031000)
From Stephen Bannasch:
user system total real libxml 0.641000 0.031000 0.672000 ( 0.672000) hpricot 5.359000 0.062000 5.421000 ( 5.516000) rexml 22.859000 0.047000 22.906000 ( 23.203000)
From Andreas Meingast:
LIBXML THROUGHPUT: 10.2570516817665 MB/s 10.2570830340359 MB/s 12.6992253283934 MB/s 10.2570516817665 MB/s 8.51116888387252 MB/s 10.2570830340359 MB/s HPRICOT THROUGHPUT: 0.211597647822036 MB/s 0.202390771964726 MB/s 0.180272812529665 MB/s 0.198474511420818 MB/s 0.198474499681793 MB/s 0.180925089981179 MB/s REXML THROUGHPUT: 0.130301425548982 MB/s 0.131630590068325 MB/s 0.128316078417727 MB/s 0.125203555921636 MB/s 0.120181872867636 MB/s 0.115330940074107 MB/s
I can’t vouch for the appropriateness of the tests, but they show libxml clocking in at 10x hpricot and 30x to 60x REXML. I’d be happy to accept additional tests or more appropriate tests if you have any.
An Embarrassment of Riches
In addition to performance, the libxml-ruby bindings provide impressive coverage of libxml’s functionality. Goodies include:
- SAX
- DOM
- XMLReader (streaming interface)
- XPath
- XPointer
- XML Schema
- DTDs
- XSLT (split into the libxslt-ruby bindings)
Now, your first reaction might be that SAX, DOM and XPath are all you need, but validating parsers make it a whole lot easier to sanitize user contributed content on web sites. And the XMLReader offers a clever way of combining the DOM’s ease of use (well, ok, compared to SAX at least) with SAX’s memory and speed advantages.
Better yet, most of this functionality is exposed via an easy-to-use, Ruby like API. There are still of course some warts lurking in the code, where libxml’s C api leaks through to Ruby, but they are being removed one by one. And for those of you who aren’t C hackers, much of this work can be done in good old Ruby.
A Long History
For such a useful, and full-featured library, the libxml-ruby bindings have a star-crossed history. Out of curiosity, I went back and traced their lineage. Sean Chittenden originally wrote them back in 2002. At the start of 2005, Trans Onoma adopted the project after Sean had moved on, and at the end of 2005 the bindings found their current home on Ruby Forge. At that point Ross Bamford took over maintenance and worked on the bindings for roughly a year, until early 2007, when then the bindings again became unmaintained. Dan Janowski picked up the ball in 2007 and completely overhauled the binding’s memory model. Sadly, Dan had to give up active support this spring.
But on the bright side, Trans, Dan and Sean are all once again active on the mailing list, providing valuable experience and insight. From my point of view, with the renewed push towards a production quality release, and bringing in new users, the libxml-ruby community is as healthy as it has been in a long while.
–
July 16, 2008Hey Charlie, what great news you give us!
Great work all the guys involved, now and the past on this project.
For the libxml2.dll requirement maybe we can build libxml statically instead, but I think the next step is worthy.
Again, great work!
Charlie
July 16, 2008Hi Luis,
I just read an interview with your earlier this year on Akita on Rails. Great work on the one-click installer. Ping me at my email (its on the about page) if you’d like to talk about it a bit more, I’ve gotten pretty good and this MSVC/MingW Windows Ruby stuff after doing it for a couple of years now.
Charlie Savage –
July 16, 2008Hi Luis,
About the libxml2.dll. It would be great if libxml-ruby could come in the one-click installer, and in that case, I’d put the dll into the Ruby bin directory like is done for other libraries. And if we’re doing that, how about adding libxslt-ruby also?
–
July 16, 2008Hey Charlie,
Current VC6 One-Click Installer is a huge monolithic script and is kind of hard to maintain and test everything that is been bundled there.
We are working on the new rubyinstaller sandbox and build environment that uses Windows Installer and provide a smaller runtime (Ruby+RubyGems) to get you started and a Developer Kit (MinGW+MSYS) to build those gems that don’t come pre-built for Windows.
I’ve been thinking fix the binary files bundling of rubygems to allow .dll files be part of gems…
I’ll get in touch with you soon about libxml 🙂
(in any case, I invite you to rubyinstaller-devel mailing list):
http://rubyforge.org/projects/rubyinstaller
Thanks again for your kind words!
–
July 16, 2008This is awesome news! Thanks for all your hard work!
malcontent
July 16, 2008You are a fucking hero!.
Awesome work.
filterfish
July 17, 2008What malcontent said!
–
July 17, 2008I’d like to talk to you about possibly building a Java backend for the libxml library. Since so many people seem to be interested in a good XML library these days, and since XML libraries by and large look pretty much the same on the surface, I’m pretty sure we could map the libxml APIs onto JAXP for JRuby. That would allow people even more confidence to move to libxml for all XML purposes, since they wouldn’t be tied to the C impls. So hey, drop me a line or find me on FreeNode @ #jruby.
Rajmohan
July 17, 2008On Windows, you mentioned the extra step of copying the prebuilt libxml2.dll. Is there any place from where I can copy the prebuilt dll or I would need to build it?
Jesus
July 17, 2008Thank boy. Thats a great peace of work you can be proud off!
Charlie Savage –
July 17, 2008Rajmohan,
A prebuilt libxml2.dll is already in the libxml/mingw directory. If you already have libxml installed on your system, you won’t have to do anything. If you don’t, then copy the dll to your ruby/bin directory or somewhere on your Windows path.
Also, I’ve updated the post to be a bit more clear on how to get things up and running on Windows.
le_fnord
July 17, 2008so many thanks for your work,
that’s exactly the thing,
what i’m searched for my thesis.
–
July 17, 2008Awesome work!! Anyone knows if this is “plug and play” for Rails or does it need adapting? For ActiveResource based apps this could mean a significant performance boost! Congrats!
–
July 17, 2008This is awesome! Thanks for the hard work.
–
July 17, 2008http://www.rubyinside.com/ruby-xml-crisis-over-libxml-0-8-0-released-955.html
I like the graph. I like the speed.
Charlie Savage –
July 17, 2008AkitaOnRails,
Not sure. Want to find out? We gladly accept patches 🙂
–
July 17, 2008That’s greate news!
–
July 17, 2008Fantastic! Thank you very much for all your work.
Dan Healy
July 17, 2008First off, great work and thanks for taking up libxml-ruby.
Even after changing the search path to something that exists in the document I’m parsing, this example still doesn’t seem to work for me:
` doc.find(‘//root_node/foo/bar’).each do |node|
puts “Node path: #{node.path} \t Contents: #{node.content}”
end`
In fact, I can’t seem to get anything related to .find to work at all (.empty? always returns true). Got any suggestions?
–
July 17, 2008http://saxon.sourceforge.net/saxon6.5.3/expressions.html
More specifically, if you look at the table at the bottom of that page, I can get things like ‘*’ and ‘*[last()]’ to work, but anything involving a literal string like ‘XXX’ or ‘//XXX’ will return a nil NodeSet.
Charlie Savage –
July 18, 2008Hi Dan,
Best to post issues to the libxml ruby forge tracker – better chance of getting a response.
My guess is your document has a default namespace ( ). In that case, refer to these Rdocs:
http://libxml.rubyforge.org/rdoc/classes/LibXML/XML/XPath.html
Dan Healy
July 18, 2008Aha. I figured the problem was most likely user error and not a bug, but I had no idea why. You were right about the namespace thing.
This link also helped solve my issue:
http://thebogles.com/blog/an-hpricot-style-interface-to-libxml/
–
July 19, 2008Currently I’am using REXML and ruby-xsl for a project, the code is ugly and slow. I’m going to rewrite the code to come back to libxml-ruby. Great work !
Charlie Savage –
July 19, 2008Sounds good Alex. If you feel like doing the intermediate step, writing a wrapper for libxml with the same api as REXML, then do submit it to the libxml project and we’ll include it in the distribution (as long as it passes all REXML tests).
Charlie
Kurt Burns
July 21, 2008I am getting segmentation faults on 64-bit redhat fairly randomly and faily regularly while doing a large amount of data importing. I was using 0.5.4 and never once had a seg.fault. After upgrading, I am getting them alot. Any idea if seg. faults are an issue with 64-bit right now? I am currently versioning down to see if I can determine if this is truly related to the upgrade.
Charlie Savage –
July 21, 2008Hi Kurt,
Could you post to the libxml-devel mailing list, where we can try and work out the issue. Any additional information would be helpful – example XML file, example code, and best yet a stack trace, etc.
jim Cropcho
July 25, 2008sweet dude; thanks!
Junaid
July 30, 2008Great work.Thanks
cocotteman
November 4, 2008Hi,
First all : great work => the code is much much faster than with REXML…
I’m working on Windows system, and I have an error : “[BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-mswin32]”
I’ve found a discussion at http://www.mail-archive.com/libxml-devel@rubyforge.org/msg01049.html which gives a patch.
But, I don’t know how to apply this patch (http://www.acidlunchbox.com/bagby/ref_count.diff
)…
Anyone can help me please ?
–
November 10, 2008HUZZAH!