Porting ruby-prof to Windows

Yesterday I wanted to profile some methods I’m using on a Rails controller. To get a feel for profiling Ruby code I put together a test case, added a “require ‘prof'” to the top of the file and eagerly waited for the results. And waited, and waited, and waited. Thinking I did something wrong, I ran the code without a profiler – it took about 2 seconds. With the profiler it took so long I gave up. And this was on a dual core pentium D processor with 1 Gig of memory running Fedora Core 5.

Time for some investigation. It turns out this is a well known problem – the built-in Ruby profiler, which is written in Ruby, is so slow as to be useless. I came across two alternatives – ruby-prof, a C extension written by Shugo Maeda, and ZenProfile, an inline C exension done by Ryan Davis.

I went with ruby-prof. On Linux it was easy enough to download, build and install and it worked like a charm. But I do most of my work on my laptop which runs Windows XP. So I opened up MingW and built and installed the extension on Windows (that’s not quite true, I had to hack the C a bit, more info below). But when I ran the test script I was met with a program fail message saying that the stack was empty. Ugh.

Since I find it impossible to debug extensions on Windows with MingW I fired up Visual Studio 2005, rebuilt the extension, and tried again. Same issue.

Digging deeper, it turns out the profilers (as well as the wonderful rcov project)
work by registering a callback with Kernel::set_trace_func. When Ruby executes a line of code, enters a new Ruby or C method, or exists a Ruby or C method, the callback is activated.

The problem is that ruby-prof assumes that each call into a method is matched
by a return – and if its not then the failure I see is triggered. To understand the problem, let’s look at a super simple test case:

require 'profiler'

I said it was simple – didn’t I! Here’s the trace from Linux:

return start_profile
call print_profile
call stop_profile
c-call set_trace_func

Start_profile is the method that installs the set_trace callback – so it makes
sens that the first thing we see is returning from that method. Once the program
is done, the profiler calls print_profile, which calls stop_profile, which
calls set_trac_func which uninstalls the callback. So the method enters and
returns do not balance.

Although the method names ruby-prof uses are slightly different, the problem
remains the same. ruby-prof hacks through it by pushing and popping
extra items on its stack to counterweigh the imbalanced method calls. Thus its
hard-coded to a specific sequence of method calls. So why doesn’t it work on Windows?

A quick trace running our test program on Windows shows the problem:

return start_profile
return require
call print_profile
call stop_profile
c-call set_trace_func

There is an extra “return require” which is being generated by Ruby gems. And if you run the program in Arachno, which uses a modified version of Ruby to supports
its fantastic debugger (its fast enough that I always run Rails under the debugger
so I can set breakpoints at key places – definitely go check it out).

c-return set_trace_func 
return start_profile 
c-return require__ 
return require 
c-return require__ 
return require 
call print_profile 
call stop_profile 
c-call set_trace_func 

It quickly becomes clear that assuming a balanced stack is a bad idea. If you look
at the built in Ruby profile it doesn’t make such an assumption.

These changes have been merged into ruby-prof-0.4.0 which is now available as a RubyGem.

So, I’ve patched ruby-prof to remove this assumption and to make it compile on
Windows. I’ll submit the patch to Shugo Maeda, but in the meantime, I’ve provided
windows binaries for anyone who wants to use the profiler on windows. To install:

1. Download the windows extension, prof.xo, and put it in your ruby\lib\ruby\site_ruby\1.8\i386-msvcrt
directory.

2. Download unprof.rb and put it in your ruby\lib\ruby\site_ruby\1.8 directory.

3. To use the profiler simply require ‘unprof’ at the top of the file

One thing to note about my changes. The self-time for the “toplevel” method will
always show “0”. Its looks like the Ruby profiler does the same thing, so I think
this is ok.

Assembly Hacking

This section is for anyone who’s interested in some lower level details – feel
free to skip it.

Getting ruby-prof to compile on windows required a few of the usual changes.
For example, making sure that the extension’s initialization method is property
exported using __declspec(dllexport), etc.

However, ruby-prof provides an extra twist. It can measure time in several ways including using some low-level functionality provided by more
recent Pentium and PowerPC processors. To access this information it uses this
inline assembly call:


static prof_clock_t
cpu_get_clock()
{
#if defined(__i386__)
    unsigned long long x;
    __asm__ __volatile__ ("rdtsc" : "=A" (x));
    return x;
#elif defined(__powerpc__) || defined(__ppc__)
    unsigned long long x, y;

    __asm__ __volatile__ ("\n\
1:	mftbu   %1\n\
	mftb    %L0\n\
	mftbu   %0\n\
	cmpw    %0,%1\n\
	bne-    1b"
	: "=r" (x), "=r" (y));
    return x;
#endif
}

For x86 chips, what it does is call the rdtsc assembly function which returns
the number of clock cycles that have been executed. So if you call get_cpu_clock,
wait 1 second, and call get_cpu_clock again, you can calculate the chip’s clock
frequency. Using this information, you can time method calls. For instance,
if the chip’s frequency is 500Mhz and a method takes 250,000,000 cycles to complete, you can calculate it took 0.5 seconds.

This of course won’t work with Visual C++ because it uses its own syntax for
inline assembly calls. In this case there are couple ways of porting this code.
Newer versions of Visual C++ support compiler intrinsics, and there is one for
rdtsc. However, I thought it would be better to use inline assembly to support older versions. Here’s the code:

static prof_clock_t
cpu_get_clock()
{
    prof_clock_t cycles = 0;

    __asm
    {
        rdtsc
        mov DWORD PTR cycles, eax
        mov DWORD PTR [cycles + 4], edx
    }
    return cycles;
}

To use this timing method you have to specifically enable it by including the
following line in your ruby code.

ENV["RUBY_PROF_CLOCK_MODE"] = "cpu"
require 'unprof'

However, I can’t say this works very well. The calculated
frequency for my chip is always different. I don’t know why – my best guess is that
its a Pentium M with Intel’s speed step technology so the clock frequency varies
to save power. However, I’m usually plugged in so I don’t think that’s it. Note
you can tell ruby-prof your click frequency like this:

ENV["RUBY_PROF_CLOCK_MODE"] = "cpu"
ENV["RUBY_PROF_CPU_FREQUENCY"]= "466000000"
require 'unprof'

So my recommendation is just use the default ruby-prof timing method – it does
the job perfectly well.

These changes have been merged into ruby-prof-0.4.0 so I’ve taken them offline
  1. June 11, 2006

    This patch is much easier to read:

    --- ruby_prof.c.orig	2006-06-09 23:06:22.000000000 +0200
    +++ ruby_prof.c.new	2006-06-09 23:06:29.000000000 +0200
    @@ -73,7 +73,7 @@
     
     #define CLOCK_MODE_CLOCK 0
     #define CLOCK_MODE_GETTIMEOFDAY 1
    -#if defined(__GNUC__) && (defined(__i386__) || defined(__powerpc__) || defined(__ppc__))
    +#if defined(_WIN32) || (defined(__GNUC__) && (defined(__i386__) || defined(__powerpc__) || defined(__ppc__)))
     #define CLOCK_MODE_CPU 2
     static double cpu_frequency;
     #endif
    @@ -109,6 +109,9 @@
     
     #ifdef CLOCK_MODE_CPU
     
    +
    +#if defined(__GNUC__)
    +
     static prof_clock_t
     cpu_get_clock()
     {
    @@ -130,6 +133,57 @@
     #endif
     }
     
    +#elif defined(_WIN32)
    +
    +static prof_clock_t
    +cpu_get_clock()
    +{
    +    prof_clock_t cycles = 0;
    +
    +    __asm
    +    {
    +        rdtsc
    +            mov DWORD PTR cycles, eax
    +            mov DWORD PTR [cycles + 4], edx
    +    }
    +    return cycles;
    +}
    +
    +#endif
    +
    +
    +/* The _WIN32 check is needed for msys (and maybe cygwin?) */
    +#if defined(__GNUC__) && !defined(_WIN32)
    +
    +double get_cpu_frequency()
    +{
    +    unsigned long long x, y;
    +
    +    struct timespec ts;
    +    ts.tv_sec = 0;
    +    ts.tv_nsec = 500000000;
    +    x = cpu_get_clock();
    +    nanosleep(&ts, );
    +    y = cpu_get_clock();
    +    return (y - x) * 2;
    +}
    +
    +#elif defined(_WIN32)
    +
    +double get_cpu_frequency()
    +{
    +    unsigned long long x, y;
    +    double frequency;
    +    x = cpu_get_clock();
    +    /* Use the windows sleep function, not Ruby's */
    +    Sleep(500);
    +    y = cpu_get_clock();
    +    frequency = 2*(y-x);
    +    printf("clock speed: %f", frequency);
    +    return frequency;
    +}
    +#endif
    +
     static double
     cpu_clock2sec(prof_clock_t c)
     {
    @@ -190,13 +244,13 @@
         }
         return stack->ptr++;
     }
    -
     static inline prof_data_t *
     stack_pop(prof_stack_t *stack)
     {
         if (stack->ptr == stack->start)
    -        rb_fatal("empty stack");
    -    return --stack->ptr;
    +        return ;
    +    else
    +        return --stack->ptr;
     }
     
     static inline prof_data_t *
    @@ -459,6 +513,8 @@
     
     	    now = get_clock();
     	    data = stack_pop(stack);
    +	    if (data == )
    +	        break;
     	    if (!NIL_P(klass)) {
     		if (TYPE(klass) == T_ICLASS) {
     		    klass = RBASIC(klass)->klass;
    @@ -486,10 +542,10 @@
     static VALUE
     prof_start(VALUE self)
     {
    -    prof_data_t *data;
    -    
    +    prof_clock_t now;
    +
         if (stack_tbl) {
    -	rb_raise(rb_eRuntimeError, "Prof.start was already called");
    +        rb_raise(rb_eRuntimeError, "Prof.start was already called");
         }
         stack_tbl = stack_table_create();
         class_tbl = rb_hash_new();
    @@ -497,13 +553,17 @@
         stack = stack_create();
         stack_table_insert(stack_tbl, profiling_thread, stack);
         result_tbl = minfo_table_create();
    +
    +    now = get_clock();
         toplevel = prof_result_create(Qnil, rb_intern("#toplevel"));
    +    toplevel->count = 1;
    +    toplevel->total_time = now;
    +    toplevel->self_time = 0;
    +
         rb_add_event_hook(prof_event_hook,
    -		      RUBY_EVENT_CALL | RUBY_EVENT_RETURN |
    -		      RUBY_EVENT_C_CALL | RUBY_EVENT_C_RETURN);
    -    data = stack_push(stack);
    -    data->start_time = get_clock();
    -    data->child_cost = 0;
    +        RUBY_EVENT_CALL | RUBY_EVENT_RETURN |
    +        RUBY_EVENT_C_CALL | RUBY_EVENT_C_RETURN);
    +
         return Qnil;
     }
     
    @@ -528,19 +588,16 @@
     static VALUE
     prof_stop(VALUE self)
     {
    -    prof_data_t *data;
         VALUE result_list;
         prof_clock_t now;
     
         if (stack_tbl == ) {
    -	rb_raise(rb_eRuntimeError, "Prof.start is not called yet");
    +        rb_raise(rb_eRuntimeError, "Prof.start is not called yet");
         }
    +
         now = get_clock();
    -    stack_pop(stack); /* data for Prof.stop */
    -    data = stack_pop(stack);
    -    toplevel->count = 1;
    -    toplevel->total_time = now - data->start_time;
    -    toplevel->self_time = toplevel->total_time - data->child_cost;
    +    toplevel->total_time = now - toplevel->total_time;
    +
         st_foreach(stack_tbl, free_stack, 0);
         st_free_table(stack_tbl);
         stack_tbl = ;
    @@ -581,17 +638,8 @@
     	break;
     #ifdef CLOCK_MODE_CPU
         case CLOCK_MODE_CPU:
    -	if (cpu_frequency == 0) {
    -	    unsigned long long x, y;
    -	    struct timespec ts;
    -
    -	    ts.tv_sec = 0;
    -	    ts.tv_nsec = 500000000;
    -	    x = cpu_get_clock();
    -	    nanosleep(&ts, );
    -	    y = cpu_get_clock();
    -	    cpu_frequency = (y - x) * 2;
    -	}
    +    if (cpu_frequency == 0)
    +      cpu_frequency = get_cpu_frequency();
     	get_clock = cpu_get_clock;
     	clock2sec = cpu_clock2sec;
     	break;
    @@ -604,6 +652,9 @@
         return val;
     }
     
    +#if defined(_WIN32)
    +__declspec(dllexport) 
    +#endif
     void
     Init_prof()
     {
    

    Reply
  2. Charlie Savage –
    June 12, 2006

    Hi Mauricio,

    Curious what problems you had reading the patch? Was it due to Unix line endings versus Windows, or something else?

    Great job on [rcov](http://eigenclass.org/hiki.rb?rcov+0.5.0) – I use it almost every day.

    Reply
  3. June 14, 2006

    I believe the patch is hard to read because if you look closely, the patch is ‘removing’ and then ‘adding’ the same lines, cancelling each other out – but taking up a whole lot of space.

    Removing the redundant lines, makes the patch much smaller like the above.

    Reply
  4. Charlie
    June 16, 2006

    Thanks Paul – hadn’t noticed that. I’ve made some more changes in the last couple of days and will post them once they’re read.

    Reply

Leave a Reply

Your email address will not be published.

Top