The Agony of Upgrading Fedora

Posted by Charlie Tue, 29 Jan 2008 18:04:00 GMT

Time for a rant.

A few years ago I decided it was finally time to learn Linux after having used DOS, Windows, and the Mac OS for years. My plan of attack was to run my own domain - savagexi.com, complete with a website, blogs, mail server, DNS server and DHCP server. And if I'd ever find the time, MythTv.

Back then Fedora seemed like the best choice, and every year or so I upgrade the servers in the basement to the latest version. Upgrading Fedora always sucks, but my experience over the weekend warrants a big, resounding F.

When working on my own machines, I tend to go beyond flying by the seat-of-my-pants to wanton recklessness. There's nothing quite like a nasty error message (disk failure, missing partition, broken boot loader, misconfigured X server etc.) to focus the mind and learn how things really work. Over the years, my reckless attitude has cost me only once, when a disk drive that was part of a Logical Volume gave up its soul when it screeched to a dreadful halt. And even then, I almost managed to rescue the data I needed off the remaining disk, finding out five minutes too late what I should have done instead of what I did. Since then, I've eschewed LVM and gone with nice, simple RAID 1 arrays (which means having 2 disks that mirror each other so if one breaks you can get your data from the other one) to at least provide a modicum of redundancy.

The impetus for upgrading this time around was spam. I've always heard how wonderful greylisting is, and after one too many emails about navigating the love canal with confidence, it was time to take action. But of course I ran into a roadblock - setting up greylisting on Fedora 6 using a program called PostGrey didn't work because it conflicted with SELinux (see, I'm a glutton for punishment, using SELinux on a home network). Of course that took some doing to figure out, since Fedora 6 doesn't bother to actually log a message about the problem. So after reading the Fedora 8 release notes about how PostGrey and SELinux are best of buddies, I decided it was worth the pain to upgrade the email server.

From past experience, I was under no illusion it would be easy. But little did I suspect just how dreadful it would be. I decided to do the upgrade using a network install since I don't have a DVD burner (yeah, yeah), which means the bytes are downloaded on demand across the Internet. It actually works pretty well if you pick a fast mirror, such as facebook. But when things go wrong you have to stop the installation, reboot the machine, Google around a bit, fix whatever problem is, start the installation over and redownload the bytes. Remember the stop-reboot-fix-install sequence, I must have done it twenty times.

Day 1

Attempt #1. Things got off to a rousing start with Anaconda, the Fedora installer, complaining that the disk partitions on the two drives in the machine had to be labeled. Of course Anaconda should have just fixed the problem itself, but no, it is a remarkably unhelpful program.

Attempt #2. So stop-reboot-google around-fix the problem - and try again. This time Anaconda bitched about not finding any valid partitions, or in English, it couldn't read the 2 hard-drives on the machine and thus couldn't update them. Since I had just rebooted the machine, it stretched the imagination that Anaconda could be so dumb. But either way, back to the stop-reboot-fix-reboot-start cycle. Except this time there was no fix, since the machine booted just fine.

Attempt #3. Try again. This time I gave into Anaconda when it offered the choice of wiping the drives clean, and hit the next button. I then quickly decided that was a bad move, and hit the back button. No luck. Although the installation hadn't started yet (I was on screen that was asking me some question I don't remember), when I rebooted the machine I was greeted by the message GRUB. Mind you, not a grub prompt, just four capital letters that spelled GRUB. Ugh.

So it was now time to dig out the Fedora 6 rescue disk and run it. It couldn't find any partitions either, and dumped me at a command prompt. From there I could run the ever exciting program fdisk, which let's you manage the partitions on your disk. fdisk is a nice, easy to use program, but its living on the edge - one false move and you can easily delete your data. From fdisk I noted that the machine had two hard-drives, the first was 80GB and the second 60GB. I also saw that first drive (80GB) no longer had a partition table thanks to Anaconda. Working backwards, I recreated its partitions. That was easy to do, since the two drives are part of a RAID array and thus I assumed the first partition on the first disk should be 60GB.

Attempt #4. Reboot and .... get greeted by the every friendly GRUB message again.

Attempt #5. Reboot, but this time I hit the F12 key to open the Boot menu. I then noticed that the last choice in the boot menu was to start a utility disk, which miraculously opened to a grub prompt (it wasn't until the next day I figured out how to run grub from the rescue disk, although I suspected it was possible). Of course I don't know diddly about Grub, so it took another 30 minutes of Googling to figure out how to fix the problem (basically reinstall grub on the drive).

Attempt #6. This time, Anaconda had the decency to recognize my partitions and even offered me a chance to upgrade them. Hooray. Pushing my luck, I hit the next button, and watched Anaconda check the dependencies for all installed packages. 5%, 10%, 15%, 25%, 26%...and then nothing. Of course.

More Googling, and finally enlightenment. Turns out I was hardly the first to run into this show-stopper bug. If that wasn't bad enough, the bug was still open 2 months after it was reported, and none of the mirrors had been updated (there has been a respin of the the Fedora 8 CDs, but its hardly useful if I can't get to it). So I read through the whole thread, and in one of the comments a Fedora developer had posted a link to a "update image" on his website. After a bit of research, I figured out what an update image is and how to use it.

Attempt #7. If you don't first succeed, try, try again. This time Anaconda got past the dependency checker, and amazingly enough finished. Success was near at hand. NOT.

Attempt #8. Reboot the machine and watch in horror as the dreaded GRUB message rears its ugly head.

So back to the rescue disk - which of course can't mount any partitions ( wtf?) and spits me out to a linux prompt. Back to fdisk. And once again enlightenment - Disk #1 had once again lost its partition table. Fix it. Boy this is getting tedious.

Attempt #9. Surely things are fixed by now. Reboot. And then watch in amazement as the computer tries to load Fedora Core 6, spits out pages and pages of errors, and unceremoniously dumps me to a login prompt. Of course the login prompt doesn't work. WTF?

Ah - my favorite pastime, loading the rescue disk. Try fdisk again, everything looks ok. So the next obvious thing is the RAID array is broken somehow. Go read about mdadm, which is the Linux program for creating and managing software RAID arrays. Using the wonders of Google, I found a very helpful article that explains how to rescue your RAID array. Following the instructions, I remount the array and discover that only Disk #2 is available. And then it dawns on me - somehow Anaconda only updated Drive #2, thus leaving Drive #1 with Fedora 6 in a very broken state. So a bit more Googling, and I learn how to re-add Disk #1 back into the array. And then nothing. Hmm. More Googling - how exactly do you know what a RAID array is doing?

That didn't take long, and I stare in wonderment as something actually goes right - mdadm is happily resyncing Disk #1 with Disk #2 and says it will be done in a bit over an hour. At this point its 3:30 am, so I call it a day.

Day 2

Attempt #10. After a good night's sleep, it was time for more fun. The RAID array had successfully fixed itself overnight, so crossing my fingers I rebooted the machine. My heart sunk when I was greeted with lines and lines of warnings about disk overflow errors. But wait, those were for the extra partition on Disk #1 (remember only the first 60GB are used in the RAID array, leaving 20GB free). Once the cruft had cleared, the machine managed to boot all the way to the Fedora 8 welcome screen. Hallelujah! Of course a fair bit was broken, including the DNS server, which meant at least a few hours in BIND hell (BIND and I simply don't get along). But first things first.

However, I was worried about the disk overflow errors. For some reason, the kernel thought the 20GB partition was smaller that it really was. A bit of Googling turned up a couple of potential causes and solutions, but none worked. So back to fdisk. I figured the best course of action was to just delete the 2nd partition and recreate it.

Attempt #11. After recreating the problem partition, it was time to reboot the machine. And of course back to my old friend GRUB. I have no idea how I ended up back there, but clearly old flings die slowly. But at this point I was an old hand at moving on, and rescue disk in hand, it was time to work some magic at the grub prompt. And to be on the safe-side, I Googled around a bit more to see if somehow I had mistakenly configured GRUB with RAID and could kick this habit once and for all. Fortunately, I turned up this gem of an article and promptly changed things around based on its recommendations.

Attempt #12. And finally, one day later, a clean boot to Fedora 8 (minus of course BIND being unhappy).

Denouement

It beats me how any normal person manages to maintain their own Linux system - I only succeed through sheer determination and stubbornness. I realize that Fedora recommends a clean install with each new version, but to do that without losing your personal data and system configuration takes knowledge and effort beyond almost anyone who lives on this planet, including myself. So overall - I give Fedora an F for its horribly broken upgrade program.

And of course the kicker - PostGrey still doesn't work with SELinux on Fedora 8. But at least in FC8 its polite enough to actually log an error. So anyone for creating and compiling their own policy files? Ah, I feel another rant coming along about SELinux.

Posted in  | 18 comments | no trackbacks

Fighting Trac Spam

Posted by Charlie Tue, 29 Jan 2008 03:06:00 GMT

For MapBuzz, we use a popular open source project called Trac for managing our bugs, feature requests, release schedules, etc.  As long as you don't have complex requirements, Trac is pretty good - its a lot more pleasant to use then expensive commercial products such as Rational ClearQuest.

Unlike ClearQuest, Trac is designed to live on the Web.  But living on the Web can be dangerous - in recent months our database was getting overwhelmed by spam.  Cleaning it out was becoming a tedious, daily chore.

After trying a variety of counter  measures over a period of a few months, I finally gave up and handed it over to Anders (and do take a look at the very cool URI he has).  It took him about one minute to diagnose the problem - spammers weren't coming in through the front door, they were coming in through the back door.  I had assumed that spammers were using Trac's web interface to futher their nefarious causes, but instead they were using our automated email ticket submission system.  The way that works is when an error is generated, either on a MapBuzz client or server, an email with all the relevant information is sent to trac@mapbuzz.com.  Bugs submitted that way are easy to spot - we use the imaginative names "MapBuzz Client Error" or "MapBuzz Server Error" for them.

The solution was obvious - only let computers from within the mapbuzz domain email tickets.  But figuring out how to do it was another thing.  The problem with not having a full-time admin is that there is always a huge startup cost in fixing IT problems as you rack your brain trying to remember how some complex piece of sofware works.  In this case it was Postfix, and after an hour of rummaging through the manuals, we finally discovered the right incantation.  Undoubtedly there are other ways to do this, and probably better ways, but we added the following line to the file roleaccount_exceptions:

# Only allow sending to trac from local domain
trac@mapbuzz.com permit_mynetworks,reject

Or in English, only machines in the MapBuzz domain can send tickets to Trac. And Voila - no more spam!

Posted in  | 1 comment | no trackbacks

Profiling Ruby Code

Posted by Charlie Wed, 16 Aug 2006 06:00:00 GMT

Pat Eyler has written a nice set of articles about profiling Ruby code. He shows how to use the built in Ruby profiler as well as ruby-prof (my personal favorite :). He talks a bit about call graphs, so if you're not familiar with them, his article is a good place to start.

Posted in , ,  | 2 comments | no trackbacks

Selenium and Mouse Events

Posted by Charlie Tue, 15 Aug 2006 07:06:00 GMT

For MapBuzz browser testing, we need to control the x and y locations of the mouse cursor. Selenium didn't support this functionality, so I coded it up one evening and submitted a patch.

Nelson Sproul took the patch, refactored it a bit, and included it in the recent 0.7.1 Selenium release. Note that there is a major limitation when using it with Internet Explorer. IE does not bubble script generated events. Thus, if you send a mouseclick event to an element, it will receive it, but will not bubble it up to its parent in the DOM tree.

Posted in ,  | no comments | no trackbacks

Selenium and Mouse Events

Posted by Charlie Tue, 15 Aug 2006 07:06:00 GMT

For MapBuzz browser testing, we need to control the x and y locations of the mouse cursor. Selenium didn't support this functionality, so I coded it up one evening and submitted a patch.

Nelson Sproul took the patch, refactored it a bit, and included it in the recent 0.7.1 Selenium release. Note that there is a major limitation when using it with Internet Explorer. IE does not bubble script generated events. Thus, if you send a mouseclick event to an element, it will receive it, but will not bubble it up to its parent in the DOM tree.

Posted in ,  | no comments | no trackbacks

Selenium and XHTML

Posted by Charlie Wed, 02 Aug 2006 06:09:00 GMT

Last month I blogged about Selenium, which is an open source project that let's you test web applications running in a variety of browser. Unfortunately, Selenium doesn't work out of the box with XHTML - any XPath expressions you use stop working.

I fixed this last month in my local copy, but I've noticed other people are starting to have the same issue. The problem is that Selenium does not implement a namespace resolver as described in the Mozilla XPath documentation. For html documents, XPath expressions look like this:

div/p[@id="foo"]

For XHTML documents, they must include a namespace prefix like this:

x:div/x:p[@id="foo"]

The choice of "x" is random, however, its what XPath Checker (a Firefox extension) uses.

Luckily, Selenium is easily extensible since JavaScript is a language that gets out of your way. The fix is to add the following code into your user-extensions.js file:

PageBot.prototype.namespaceResolver = 
function(prefix)
{
  if (prefix == 'html' ||
      prefix == 'xhtml' ||
      prefix == 'x')
  {
    return 'http://www.w3.org/1999/xhtml';
  }
  else if (prefix == 'mathml')
  {
    return 'http://www.w3.org/1998/Math/MathML'
  }
  else
  {
    throw new Error("Unknown namespace: " + prefix + ".")
  }
}

PageBot.prototype.findElementUsingFullXPath = 
function(xpath, inDocument) {
    if (browserVersion.isIE && !inDocument.evaluate) {
        addXPathSupport(inDocument);
    }

    // HUGE hack - remove namespace from xpath for IE
    if (browserVersion.isIE)
        xpath = xpath.replace(/x:/g,'')

    // Use document.evaluate() if it's available
    if (inDocument.evaluate) {
        // cfis
        //return inDocument.evaluate(xpath, 
              inDocument, null, 0, null).iterateNext();
        return inDocument.evaluate(xpath,
          inDocument, this.namespaceResolver, 0, null).iterateNext();
    }

    // If not, fall back to slower JavaScript implementation
    var context = new XPathContext();
    context.expressionContextNode = inDocument;
    var xpathResult = new XPathParser().parse(xpath).evaluate(context);
    if (xpathResult && xpathResult.toArray) {
        return xpathResult.toArray()[0];
    }
    return null;
};

There are two big hacks. First, the hard-coded "x" prefix. And second, Internet Explorer does not support XHTML so the code strips out any namespace prefixes.

Last, if you are using the Firefox Selenium IDE, make sure to point it at your updated user-extensions.js file (do this using the options menu).

Posted in , ,  | no comments | no trackbacks

Selenium and XHTML

Posted by Charlie Wed, 02 Aug 2006 06:09:00 GMT

Last month I blogged about Selenium, which is an open source project that let's you test web applications running in a variety of browser. Unfortunately, Selenium doesn't work out of the box with XHTML - any XPath expressions you use stop working.

I fixed this last month in my local copy, but I've noticed other people are starting to have the same issue. The problem is that Selenium does not implement a namespace resolver as described in the Mozilla XPath documentation. For html documents, XPath expressions look like this:

div/p[@id="foo"]

For XHTML documents, they must include a namespace prefix like this:

x:div/x:p[@id="foo"]

The choice of "x" is random, however, its what XPath Checker (a Firefox extension) uses.

Luckily, Selenium is easily extensible since JavaScript is a language that gets out of your way. The fix is to add the following code into your user-extensions.js file:

PageBot.prototype.namespaceResolver = 
function(prefix)
{
  if (prefix == 'html' ||
      prefix == 'xhtml' ||
      prefix == 'x')
  {
    return 'http://www.w3.org/1999/xhtml';
  }
  else if (prefix == 'mathml')
  {
    return 'http://www.w3.org/1998/Math/MathML'
  }
  else
  {
    throw new Error("Unknown namespace: " + prefix + ".")
  }
}

PageBot.prototype.findElementUsingFullXPath = 
function(xpath, inDocument) {
    if (browserVersion.isIE && !inDocument.evaluate) {
        addXPathSupport(inDocument);
    }

    // HUGE hack - remove namespace from xpath for IE
    if (browserVersion.isIE)
        xpath = xpath.replace(/x:/g,'')

    // Use document.evaluate() if it's available
    if (inDocument.evaluate) {
        // cfis
        //return inDocument.evaluate(xpath, 
              inDocument, null, 0, null).iterateNext();
        return inDocument.evaluate(xpath,
          inDocument, this.namespaceResolver, 0, null).iterateNext();
    }

    // If not, fall back to slower JavaScript implementation
    var context = new XPathContext();
    context.expressionContextNode = inDocument;
    var xpathResult = new XPathParser().parse(xpath).evaluate(context);
    if (xpathResult && xpathResult.toArray) {
        return xpathResult.toArray()[0];
    }
    return null;
};

There are two big hacks. First, the hard-coded "x" prefix. And second, Internet Explorer does not support XHTML so the code strips out any namespace prefixes.

Last, if you are using the Firefox Selenium IDE, make sure to point it at your updated user-extensions.js file (do this using the options menu).

Posted in , ,  | no comments | no trackbacks

Porting ruby-prof to Windows

Posted by Charlie Fri, 09 Jun 2006 19:14:00 GMT

Yesterday I wanted to profile some methods I'm using on a Rails controller. To get a feel for profiling Ruby code I put together a test case, added a "require 'prof'" to the top of the file and eagerly waited for the results. And waited, and waited, and waited. Thinking I did something wrong, I ran the code without a profiler - it took about 2 seconds. With the profiler it took so long I gave up. And this was on a dual core pentium D processor with 1 Gig of memory running Fedora Core 5.

Time for some investigation. It turns out this is a well known problem - the built-in Ruby profiler, which is written in Ruby, is so slow as to be useless. I came across two alternatives - ruby-prof, a C extension written by Shugo Maeda, and ZenProfile, an inline C exension done by Ryan Davis.

I went with ruby-prof. On Linux it was easy enough to download, build and install and it worked like a charm. But I do most of my work on my laptop which runs Windows XP. So I opened up MingW and built and installed the extension on Windows (that's not quite true, I had to hack the C a bit, more info below). But when I ran the test script I was met with a program fail message saying that the stack was empty. Ugh.

Since I find it impossible to debug extensions on Windows with MingW I fired up Visual Studio 2005, rebuilt the extension, and tried again. Same issue.

Digging deeper, it turns out the profilers (as well as the wonderful rcov project) work by registering a callback with Kernel::set_trace_func. When Ruby executes a line of code, enters a new Ruby or C method, or exists a Ruby or C method, the callback is activated.

The problem is that ruby-prof assumes that each call into a method is matched by a return - and if its not then the failure I see is triggered. To understand the problem, let's look at a super simple test case:

require 'profiler'

I said it was simple - didn't I! Here's the trace from Linux:

return start_profile
call print_profile
call stop_profile
c-call set_trace_func

Start_profile is the method that installs the set_trace callback - so it makes sens that the first thing we see is returning from that method. Once the program is done, the profiler calls print_profile, which calls stop_profile, which calls set_trac_func which uninstalls the callback. So the method enters and returns do not balance.

Although the method names ruby-prof uses are slightly different, the problem remains the same. ruby-prof hacks through it by pushing and popping extra items on its stack to counterweigh the imbalanced method calls. Thus its hard-coded to a specific sequence of method calls. So why doesn't it work on Windows?

A quick trace running our test program on Windows shows the problem:

return start_profile
return require
call print_profile
call stop_profile
c-call set_trace_func
There is an extra "return require" which is being generated by Ruby gems. And if you run the program in Arachno, which uses a modified version of Ruby to supports its fantastic debugger (its fast enough that I always run Rails under the debugger so I can set breakpoints at key places - definitely go check it out).
c-return set_trace_func 
return start_profile 
c-return require__ 
return require 
c-return require__ 
return require 
call print_profile 
call stop_profile 
c-call set_trace_func 

It quickly becomes clear that assuming a balanced stack is a bad idea. If you look at the built in Ruby profile it doesn't make such an assumption.

These changes have been merged into ruby-prof-0.4.0 which is now available as a RubyGem.

So, I've patched ruby-prof to remove this assumption and to make it compile on Windows. I'll submit the patch to Shugo Maeda, but in the meantime, I've provided windows binaries for anyone who wants to use the profiler on windows. To install:

1. Download the windows extension, prof.xo, and put it in your ruby\lib\ruby\site_ruby\1.8\i386-msvcrt directory.

2. Download unprof.rb and put it in your ruby\lib\ruby\site_ruby\1.8 directory.

3. To use the profiler simply require 'unprof' at the top of the file

One thing to note about my changes. The self-time for the "toplevel" method will always show "0". Its looks like the Ruby profiler does the same thing, so I think this is ok.

Assembly Hacking

This section is for anyone who's interested in some lower level details - feel free to skip it.

Getting ruby-prof to compile on windows required a few of the usual changes. For example, making sure that the extension's initialization method is property exported using __declspec(dllexport), etc.

However, ruby-prof provides an extra twist. It can measure time in several ways including using some low-level functionality provided by more recent Pentium and PowerPC processors. To access this information it uses this inline assembly call:


static prof_clock_t
cpu_get_clock()
{
#if defined(__i386__)
    unsigned long long x;
    __asm__ __volatile__ ("rdtsc" : "=A" (x));
    return x;
#elif defined(__powerpc__) || defined(__ppc__)
    unsigned long long x, y;

    __asm__ __volatile__ ("\n\
1:	mftbu   %1\n\
	mftb    %L0\n\
	mftbu   %0\n\
	cmpw    %0,%1\n\
	bne-    1b"
	: "=r" (x), "=r" (y));
    return x;
#endif
}

For x86 chips, what it does is call the rdtsc assembly function which returns the number of clock cycles that have been executed. So if you call get_cpu_clock, wait 1 second, and call get_cpu_clock again, you can calculate the chip's clock frequency. Using this information, you can time method calls. For instance, if the chip's frequency is 500Mhz and a method takes 250,000,000 cycles to complete, you can calculate it took 0.5 seconds.

This of course won't work with Visual C++ because it uses its own syntax for inline assembly calls. In this case there are couple ways of porting this code. Newer versions of Visual C++ support compiler intrinsics, and there is one for rdtsc. However, I thought it would be better to use inline assembly to support older versions. Here's the code:

static prof_clock_t
cpu_get_clock()
{
    prof_clock_t cycles = 0;

    __asm
    {
        rdtsc
        mov DWORD PTR cycles, eax
        mov DWORD PTR [cycles + 4], edx
    }
    return cycles;
}
To use this timing method you have to specifically enable it by including the following line in your ruby code.
ENV["RUBY_PROF_CLOCK_MODE"] = "cpu"
require 'unprof'

However, I can't say this works very well. The calculated frequency for my chip is always different. I don't know why - my best guess is that its a Pentium M with Intel's speed step technology so the clock frequency varies to save power. However, I'm usually plugged in so I don't think that's it. Note you can tell ruby-prof your click frequency like this:

ENV["RUBY_PROF_CLOCK_MODE"] = "cpu"
ENV["RUBY_PROF_CPU_FREQUENCY"]= "466000000"
require 'unprof'

So my recommendation is just use the default ruby-prof timing method - it does the job perfectly well.

These changes have been merged into ruby-prof-0.4.0 so I've taken them offline

Posted in , , ,  | 4 comments | no trackbacks

Firebug

Posted by Charlie Fri, 31 Mar 2006 20:12:00 GMT

Just noticed that Joe Hewitt has posted an updated version of Firebug. He's taken a great extension and turned it into the best Firefox extension by far. Its an absolutely invaluable development tool. If you develop web apps, you should go download it now.

Posted in ,  | no comments | no trackbacks

Firebug

Posted by Charlie Fri, 31 Mar 2006 20:12:00 GMT

Just noticed that Joe Hewitt has posted an updated version of Firebug. He's taken a great extension and turned it into the best Firefox extension by far. Its an absolutely invaluable development tool. If you develop web apps, you should go download it now.

Posted in ,  | no comments | no trackbacks