Building Ruby 1.9.1 on Windows

Posted by Charlie Sun, 29 Mar 2009 04:29:00 GMT

As noted else where, ruby 1.9.1 hasn't exactly bounded out of the gate. That's not particularly surprising, considering 1.9.1 has been available for only a couple of months and requires changes to existing code. In addition, there are a number of incompatible gems, giving rise to the isitrub19y website as a clearing house of information. So despite the great efforts from the Rails team, the rest of the community is still lagging behind.

That's particularly true on Windows, where a new one-click installer isn't yet available. According to the latest market share stats from Net Applications, Windows controls 88% of the desktop market. I have no idea how many Ruby installations exist, and how they are divided by operating system. But looking at RubyForge, by far and away the most popular download of all times is the Windows one-click installer with over 3 million downloads.

Luis Lavena has taken over stewardship of the one-click installer, and clearly needs a bit of help. So although I have very little free time, I offered to pitch in as I could. While Luis is concentrating on putting together a new version of the one-click installer using Mingw and msys, I thought I could help out by putting 1.9.1 through its paces on Windows.

My basic approach was to simply start with the basics:

  • Build ruby with Visual Studio 2008
  • Build the default extensions and libraries Ruby uses (zlib, iconv, openssl, etc)
  • Run Ruby's unit tests

That was almost a month ago. Thirty-nine patches later (I have no doubt Nobu is getting sick of me), I just about have Ruby 1.9.1's test suite running on Windows. There a still a few remaining issues, in particular a couple of imap tests that hang.

As for Visual Studio, I'm using it for two reasons. First, it has a lights-out debugger that makes it much easier to track down and fix problems. Second, its lets you compile instrumented executable and libraries that can detect incorrect API usage, heap corruption, stack corruption and mismatched calling conventions.

It quickly became obvious that no one had ever done that with Ruby, because it turned up a whole host of issues. For example, the dl extension used the cdecl calling convention to call the Windows API instead of stdcall. Or that there were a set of memory leaks in printf/sprintf.

The other thing that was bothersome was the huge number of compiler warnings generated by building Ruby. See for your self - and then realize the original list doesn't include any of the warnings generated by building Ruby's extensions. Cleaning up the warnings took a number of patches, but at this point most of them have been fixed. And all credit to Nobu for working through my patches, fixing them and applying them since my knowledge of the Ruby runtime is fairly limited, thereby causing most of my patches to not be quite right.

Anyway, since its not all that obvious how to build Ruby on Windows (with Visual Studio or Mingw), I'll see if I can put together a few posts that describe how to do it for anyone who wants to roll their own.

Posted in  | 9 comments | no trackbacks

libxml-ruby 1.1.3 - Boosting Performance

Posted by Charlie Sun, 22 Mar 2009 05:19:00 GMT

I'm happy to announce the release of libxml-ruby 1.1.3. Besides including the usual assortment of new features and bug fixes, this release also includes a speed boost of roughly 10% to 20%.

This resulted from RubyInside's recent post summarizing the performance of Ruby parsers. As expected, libxml-ruby blew away Hpricot and REXML in pure parsing speed (which of course is a simplistic view of what is important in an xml processor, but nevertheless still important). But it consistently finished a bit behind Nokogiri.

I was a bit surprised by that since libxml-ruby and Nokogiri use the libxml2 library as their parsing engine. Since the specific test cases almost exclusively tested parsing, the two extensions should have identical run times.

Since the times were different, then the obvious conclusion was that the two extensions were using different libxml2 APIs or using different settings. I suspected the second, but when investigating performance you never know beforehand.

Not to bore everyone with the nitty-gritty details of using libxml2, but when looking into the first test, parsing an in-memory string, it didn't look there was much difference in API calls.

For libxml-ruby:

xmlCreateMemoryParserCtxt
xmlParseDocument

For Nokogiri:

xmlReadMemory
-> xmlCreateMemoryParserCtxt
-> xmlDoRead
-> xmlParseDocument

So that didn't solve the mystery.

The next possibility was xmlDoRead was modifying the libxml2 parser context. Now a libxml2 parser context is a beast of a thing - for those brave souls who want to take a peek, its defined in libxml2's online documentation.

Working through the options one-by-one, I finally found the culprit, an obscure field in the structure:

int	dictNames	: Use dictionary names for the tree

What this setting controls is whether libxml2 uses a dictionary to cache strings it has previously parsed. Caching strings makes a big difference, so by default it should be enabled. That is now the case with libxml-ruby 1.1.3 and higher.

Rerunning the published benchmarks now shows libxml-ruby and Nokogiri to have equivalent performance. If you run the tests yourself, beware though. The order in which the extensions are tested changes the results. Whichever extension is tested first will always be faster, at least on my Fedora 10 box. I assume that's because the first parser has more memory available to it when the test begins and therefore invokes Ruby's garbage collector a few times less.

Posted in  | 6 comments | no trackbacks

libxml-ruby reaches 1.0!

Posted by Charlie Wed, 11 Mar 2009 05:52:00 GMT

A mere seven years after its inception, libxml-ruby has finally reached version 1.0. libxml-ruby provides ruby, via the libxml2 libary, the super fast, feature rich xml parser that is has sorely lacked.

Last year I posted about the resurrection of the project, and since then we've made enormous progress. The 1.0 release marks the culmination of this work, and comes with tons of goodies:

  • Ruby 1.9.1 support
  • Out of the box support for OS X 10.5 and MacPorts
  • Greatly expanded documentation
  • Much better test coverage
  • A nice, clean API that makes it easy to do simple things, but provides all the power of libxml2 if you need it

Not to mention that libxml-ruby is blindingly fast and incredibly feature rich (see my post from last year for all the details), making it the choice for a number of high-traffic websites.

So give them a try - its as easy to install as:

gem install libxml-ruby

And if you feel like polishing your ruby, xml, or C skills, come join the community!

Posted in  | 5 comments | no trackbacks

Window Functions and the Connected Workspace

Posted by Charlie Sun, 08 Mar 2009 01:20:00 GMT

One of the great new features of the upcoming postgresql 8.4 release is the addition of window functions. Previously limited to enterprise databases such as Oracle and DB2, they open up a whole new world of functionality to sql queries.

Window functions are one of the more obscure parts of the sql standard, so you may never have heard of them. In a nutshell, they let you perform calculations based on the current record and its set of related records. This turns out to be quite useful. A good place to find out more information is the postgresql documentation, which does a good job of explaining some of the more common use cases.

Workplace of the Future

My introduction to window functions was almost five years ago while doing a project for Ubisense. Ubisense sells indoor tracking systems, based on ultra-wideband, that can locate tags within 6 inches.

For one of projects, we worked on Cisco's Connected Workspace. The Connected Workspace was designed to see if office space could be laid out in a way to increase worker happiness and productivity. To do this, Cisco took all the cubicles out of the main floor of one of its buildings and replaced it with a fairly radical design. Roughly half of the floor was made into a a large open open space with individual and group desks. The remainder of the floor was split between a large kitchen with a really nice eating room and offices that ranged in size from 1 to 12 people. Here is a picture of the main floor area (courtesy Cisco Systems):

Cisco Connected Workspace

For a few more pictures, check out Cisco's presentation.

The idea was that employees could sit wherever they wanted, there were no assigned seats. If employees needed to collaborate they could work in the open areas, if they needed privacy they could grab one of the smaller offices and if they needed to do a conference call they could grab one of the larger offices.

The other impetus behind the experiment was financial. Cisco has a huge campus in Santa Clara hundreds of buildings, each costing millions of dollars to maintain. Was it possible to pack more people into each building and maintain, or improve, their hapiness and productivity?

The Experiment

Ubisense was hired to figure out how well the different parts of the connect workspace were utilized. By giving each employee a tag, the system anonymously keep track of each time someone entered or left a room. This aggregate data could then be used to gain insight into the effectiveness of the new floor plan:

  • Did employees spend time in the open area?
  • If so, in which parts of the open area (it was divided into 5 subdivisons)?
  • How much were the individual offices being used? Were there too many or too few?
  • What about the larger conference room?
  • How much was the kitchen and eating area utilized?

To do this, I hooked into Ubisense's platform API to monitor each time a tag entered or left a room. That information was then entered into a Oracle database (without any user information, so the data was totally anonymous). Thus the Oracle table consisted of millions of rows of data - with each row representing an tag entering a room or leaving a room. For example, here is a simplified view of the data:

tag_id room_id event time
1 Conference #1 Enter 10:00am
2 Office #2 Enter 11:15am
2 Office #2 Leave 11:20am
1 Conference #1 Leave 11:30am

Window Functions to the Rescue

The next trick was to analyze the data to answer the questions I posed above. To do that required figuring out how much time each tag spent in each room. So something like this:

room_id enter leave duration
Conference #1 10:00am 11:30am 1 hour 30 min
Office #2 11:15am 11:20am 5 min

OObviously you could write a script in the language of your choice to process the raw data and populate this new table. But that adds another level of complexity to the system and makes it hard to do add-hoc queries.

And this is where window functions are so useful. Using window functions, you can implement the basic algorithm fully in >

  1. Sort the data by tag_id, room_id and id so that room enter records for a tag are directly followed by room exit records
  2. Select the room exit records
  3. Use theUse the lag window function to pull the previous record, which is the room enter record, and then subtract the two times to get the duration
  4. Wrap this query up in a view, let's call it room_usage, that can serve as the basis for add-hoc queries or reports.

Without window functions, item #3 is impossible with sql because there is no way to relate a record to its surrounding records (ie., a window).

And thus window functions provide a great new data analysis tools which postgresql will make available to everyone at no-cost.

Posted in  | 2 comments | no trackbacks