Posted by Charlie
Sat, 02 Feb 2008 22:58:00 GMT
A couple of days ago, Alex Dymo from Pluron sent me an email describing some of the great work he has done optimizing the performance of their online project managment software Accunote. His great insight was that their performance problems were caused by allocating too much memory, thus forcing Ruby's Garbage Collector to frequently run ruining performance.
Using a patched version of ruby and ruby-prof, Alex was able to more than double performance (with hints of more to come) and reduced memory consumption by 75%, or 750MB (yes - that is Megabytes). Alex does a wonderful job of documenting his approach with a series of blog posts here and here.
The main culprit was Rail's handling of attributes, which is dreadfully designed (an obvious case of the simplest solution to a problem is the wrong solution - something I've been meaning to blog about for almost a year now ). But he also implicated Ruby's built-in benchmarking module.
Even better, Alex provided patches to ruby core (already accepted), Rails (already accepted) and ruby-prof. We'll also gladly accept his patches, and since its about time for a ruby-prof refresh, we'll spin out a new release as soon as we can. More to follow.
Posted in ruby-prof | 5 comments | no trackbacks
Posted by Charlie
Wed, 18 Jul 2007 17:12:00 GMT
Last week I showed how to profile a Rails application using ruby-prof. The post was driven by our desire to improve MapBuzz's map rendering speed. Our nightly log analysis showed that rendering speed has significantly degraded over the last few weeks. Since the whole point of MapBuzz is to share maps with others, we needed to find out what was taking so long.
A day of profiling and we had our answer. With a few simple changes we were able to reduce rendering time from 7.8 seconds to 0.92 seconds (note these times are gathered while profiling the application, so the real times are 2 to 3 times less).
What we found should be generally applicable to all Rails applications, so I'd like to share them:
If you interested in the nitty-gritty details of each change, keep reading!
A Bit of Background
MapBuzz renders maps in a two step process. First, the html page is created and shown to the user. Second, the browser makes an Ajax request to get the features that should be shown on the map. The reason for having a two step process is that it allows users to zoom and pan around the map, without having to reload the whole page.
Log analysis revealed that the Ajax request was taking a long time - 1 to 2 seconds, with a big standard deviation. Our goal is to reduce the average time to less than half a second and to significantly reduce the standard deviation.
If you look at the request URI, you'll see its an Atom feed of features and not an HTML page. That has performance implications.
First, Atom feeds tend to have lots of links. In our case, each entry has 10 to 15 links. We limit the number of entries to 50, so that means there are roughly 600 link per page.
Second, we generate Atom feeds in a modular fashion - each entry is created from 10 partials. There is a partial for the user, one for ratings, one for content, one for icons, one for geometries, etc. Why did we do it this way? Simple - to keep things DRY. Having a set of partials creates an extremely modular system - we can mix and match them as needed depending on the context of the request.
Setup
Testing was done with two machines. The first was my laptop, which is a Dell Latitude D610 running windows, Ruby 1.8.4, and Rails 1.2.3. The second is our staging server, which is fairly high end Dell Desktop running Fedora Core 6 with a copy of our production database (which is about 10Gb). Profiling was done with ruby-prof 0.5.1 (of course), using the Rails plugin I described previously. Note that the logging level was set to :debug, which will show up in the test results.
Baseline
To start we measured the baseline performance of our test setup. It was ugly indeed:
Total: 7.828
%self total self wait child calls name
24.11 1.95 1.89 0.00 0.06 3605 Kernel#clone
13.15 1.04 1.03 0.02 0.00 569 IO#write
7.78 4.13 0.61 0.00 3.52 441 <Module::Benchmark>#
realtime-1
7.41 0.58 0.58 0.00 0.00 125 PGconn#exec
One bit of solace is that profiling an application will slow it down 2 to 3 times, but that sill leaves us with a 4 seconds per request. So let's start digging through the results.
The first thing that jumps out is the huge amount of time Kernel#clone takes. A look at the call graph shows that the caller is clone_attribute_value.
And a bit more digging reveals this custom code in our application:
def rating
# Convert cached rating from BigDecimal to float,
# otherwise strange rounding errors happen
attributes['rating'].to_f
end
The problem is that when you call ActiveRecord#attributes it returns a copy of the attributes, thus generating all the clone calls. Uuggh. I think this is a result of ActiveRecord's flawed attribute implementation, but that is for another blog.
What we want is access to the untyped, original value of the attribute. In theory you are supposed to use the auto generated method, rating_before_type_cast, for this. This has the advantage of skipping the clone call, but it relies on method_missing, which has some overhead (we did not measure how much). Or, you could use read_attribute directly, which also skips the clone call. This is usually your best choice.
For performance critical code, you may wish to read the untyped attribute directly (see the discussion about time below for an example). That can be done using the method read_attribute_before_type_cast, except that it is private. However, with Ruby, that's easily solved:
module ActiveRecord
class Base
public :read_attribute_before_type_cast
end
end
Lesson - Don't use attributes
Gain - 24%
Let's look at our updated results:
Total: 3.39
%self total self wait child calls name
16.73 0.57 0.57 0.00 0.00 569 IO#write
12.83 0.44 0.44 0.00 0.00 125 PGconn#exec
8.76 1.77 0.30 0.00 1.48 441 <Module::Benchmark>#
realtime-1
5.99 0.22 0.20 0.00 0.01 40 PGconn#query
What stands out are the 125 calls to PGconn#exec - each one is a separate query. Looks like we forgot to specify a table or two to be eager loaded via ActiveRecord#find's :include option.
Note there is one downside to using :include - you lose the ability to use the :select option. For us that is important - and can be worked around by using the rails select_parser I've blogged about before.
Lesson - Get your :includes right
Gain - 12%
After fixing attributes and includes, here is our current timings:
Total: 2.594
%self total self wait child calls name
12.64 1.14 0.33 0.00 0.81 361 <Module::Benchmark>#
realtime-1
12.61 0.33 0.33 0.00 0.00 409 IO#write
4.90 0.13 0.13 0.00 0.00 403 <Class::File>#mtime
4.86 0.16 0.13 0.00 0.03 30250 Hash#[]
3.62 0.09 0.09 0.00 0.00 5 PGconn#exec
A couple of things jump out. First, using Benchmark#realtime is fairly time consuming.
Second, logging also takes its toll as seen in the times for IO#write (remember we have
logging set to :debug).
However, the #mtime call looks suspicious. Some more digging through Rails shows all 403 calls come from
ActionView::Base#compile_template?. Let's take a look:
def compile_template?(template, file_name, local_assigns)
method_key = file_name || template
render_symbol = @@method_names[method_key]
if @@compile_time[render_symbol] &&
supports_local_assigns?(render_symbol, local_assigns)
if file_name && !@@cache_template_loading
@@compile_time[render_symbol] < File.mtime(file_name) ||
(File.symlink?(file_name) &&
(@@compile_time[render_symbol] < File.lstat(file_name).mtime))
end
else
true
end
end
Remember I mentioned that it takes roughly 500 partial calls to generate the output? This is where it bites us -
by default Rails checks the timestamp of cached templates before running them. In a production environment that
is totally unnecessary. The solution is simple - just update your production.rb file like this:
# Don't check view timestamps!
config.action_view.cache_template_loading = true
And while we are at it, we'll change the debug level to :info to reduce its impact on the results.
Lesson - Set cache_template_loading to true
Gain - 5%
We've managed to reduce the request time from 7.8 to 2.5 seconds. Not bad. But we still have more work to do.
Total: 2.531
%self total self wait child calls name
13.59 1.91 0.34 0.00 1.56 45 <Module::Benchmark>#
realtime
12.25 0.31 0.31 0.00 0.00 409 IO#write
4.98 0.81 0.13 0.00 0.69 361 <Module::Benchmark>#
realtime-1
4.94 0.13 0.13 0.00 0.00 40 ActiveRecord::Associations::
HasManyThroughAssociation#
construct_sql
3.71 0.09 0.09 0.00 0.00 5 PGconn#exec
3.04 0.08 0.08 0.00 0.00 29847 Hash#[]
3.00 0.11 0.08 0.00 0.03 2097 Hash#each
1.86 0.28 0.05 0.00 0.23 322 ActionController::Routing::
RouteSet#generate
Of particular interest is the ActionController::Routing::RouteSet#generate. Although it only runs for 0.05 seconds, its total time, including children, is 0.28. Thus creating 322 urls takes 11% of the request time.
People have previously mentioned how slow url_for is - well, its true. Instead of using url_for, just create your urls manually using string mashing. For example, if you want an absolute uri, then do this:
"#{request.protocol}#{request.host_with_port}/controller/#{map.id}"
Lesson - Don't use url_for or link_to
Gain - 11%
After removing url_for, we've smashed the 2 second barrier:
Total: 1.391
%self total self wait child calls name
9.06 0.13 0.13 0.00 0.00 726 String#sub!
8.99 0.13 0.13 0.00 0.00 40 ActionView::Base::CompiledTemplates#
_run_ratom_47app47views47geometry47_geometry46ratom
6.69 0.09 0.09 0.00 0.00 5 PGconn#exec
5.54 0.08 0.08 0.00 0.00 741 Item#id
5.54 0.30 0.08 0.00 0.22 121 <Class::Date>#_parse
Next up is <Class::Date>#_parse, whose total time, including children, is a whopping 21% of the time. Ouch.
If you take a look at the code (its in the standard ruby library), you can see its quite complicated
since it tries to parse a variety of date formats. In addition, it makes use of rational numbers, which are none too fast.
Taking a look at a graph profile, we see that <Class::Date>#_parse is called from <Class::ActiveRecord::ConnectionAdapters::Column>#string_to_time.
Thus Rails uses it to convert timestamps received from the database into Ruby objects.
As I detailed in a recent post, the problem is that Ruby's DateTime implementation is extremely slow compared to its Time implementation. The solution is to avoid DateTime's entirely and use custom time parsing code. Fortunately, our database can output time in ISO 861 format, which Time can quickly parse.
So, any place we access times, we simply roll our own attribute readers and writers like this (note that we don't use read_attribute!).
def created_on
@created_on || Time.iso8601(read_attribute_before_type_cast('created_on'))
end
def created_on=(value)
write_attribute(:created_on, value)
@created_on = nil
end
def updated_on
@updated_on || Time.iso8601(read_attribute_before_type_cast('created_on'))
end
Lesson - Don't let Rails parse timestamps - do it yourself
Gain - 20%
We've almost hit the 1 second barrier:
Total: 1.156
%self total self wait child calls name
8.13 0.11 0.09 0.00 0.01 642 Kernel#send-3
8.13 0.09 0.09 0.00 0.00 5 PGconn#exec
6.75 0.08 0.08 0.00 0.00 7497 String#concat
4.15 0.05 0.05 0.00 0.00 817 Hash#each
4.07 0.05 0.05 0.00 0.00 40 ActiveRecord::Associations::
HasAndBelongsToManyAssociation#
construct_sql
2.77 0.03 0.03 0.00 0.00 6360 String#match
2.77 0.06 0.03 0.00 0.03 47 Array#each_index
2.77 0.03 0.03 0.00 0.00 280 IconType#size
2.60 0.03 0.03 0.00 0.00 4309 Array#[]
At this point, most of the easy wins are gone. However, there is at least one left. Although its not shown in the flat profile, the graph profile shows that the method ActiveSupport::CoreExtensions::Hash::Keys#symbolize_keys takes 2.77% of the time including its children. Additional rummaging through the call graph shows that the calls come from ActionView#compile_and_render_template:
def compile_and_render_template(extension, template = nil,
file_path = nil, local_assigns = {})
# convert string keys to symbols if requested
local_assigns = local_assigns.symbolize_keys if
@@local_assigns_support_string_keys
A bit of research shows that local_assigns_support_string_keys is deprecated and slated for removal from Rails! Sweet - that means we can save an additional 3% simply by adding this line to environment.rb:
config.action_view.local_assigns_support_string_keys = false
If only all optimizations were so simple.
Lesson - Don't symbolize keys
Gain - 3%
Wrapping Up
So what's our final time? Let's see:
Total: 0.922
%self total self wait child calls name
10.30 0.10 0.10 0.00 0.00 5 PGconn#exec
3.47 0.17 0.03 0.00 0.14 312 Array#each-2
3.47 0.03 0.03 0.00 0.00 212 ActiveRecord::ConnectionAdapters::
Quoting#quote
3.47 0.03 0.03 0.00 0.00 19455 Hash#[]
3.47 0.36 0.03 0.00 0.33 360 ActionView::Base#render-2
3.36 0.06 0.03 0.00 0.03 47 Array#each_index
3.36 0.03 0.03 0.00 0.00 299 ActiveRecord::Associations::
AssociationProxy#initialize
3.36 0.14 0.03 0.00 0.11 705 ActiveRecord::Associations::
AssociationProxy#method_missing
Not bad - we reduced the request time from 7.8 to 0.92 seconds with a day's worth of work. Obviously there is still much to be done - we need to run the code through our performance monitoring suite and make sure that the average time holds across a number of requests and that the standard deviation is in line. But we've at least made a good start.
Posted in Programming, Rails, Ruby, ruby-prof | 27 comments | 7 trackbacks
Posted by Charlie
Tue, 10 Jul 2007 16:58:00 GMT
Yesterday I mentioned that ruby-prof now supports profiling rails applications. Lets see how that works.
But a word of caution before beginning. Like all web applications, Rails apps are complex. They encompass many pieces - clients on remote machines, network connections, any number of application servers and database servers. Getting the whole ensemble to work can be tricky - let alone making it perform well.
To avoid drowning in complexity (and data), its crucial to use a rigorous approach to analyzing your application's performance. For a good example of how to approach the problem, take a look at Zed Shaw's analysis of Ruby/Odeum's performance.
Ok - enough lecturing - onto the good stuff.
Installation and Setup
Assume that after much hard work, you've identified a request that takes too long (where of course you have to define what too long is). In addition, you know that the culprit is your Rails code - not the client, not the network and not the database.The trick is to find out what is taking so long - and there is where ruby-prof helps.
First install ruby-prof by opening a command line and typing:
gem install ruby-prof
Next, copy ruby-prof's Rails plugin to your Rails plugin directory:
cp rails_plugin rails_app/vendor/plugins
Finally create a new Rails configuration with the following setting:
config.cache_classes = true
If you don't cache classes, then the time Rails spends reloading and compiling your models and controllers will overwhelm anything else your code does. And obviously try to make your test setup as similar as possible to your production environment - use a copy of the same database, use the same web server, etc.
Profiling a Request
Once you've completed your setup, and started your Rails server, make a single request. Obviously for real testing you'll need to make many requests, but let's start with one to see how ruby-prof works.
To make this as realistic as possible, I'm going to show some old profiling data for MapBuzz, which is a social mapping site I started that let's you see what's going on in your neighborhood.
One of the things that is important to us is how long it takes to render the home page - so let's make a request to the home page. Once the request is complete, the results are output in the Rails log file.
Completed in 0.71900 (1 reqs/sec)
Rendering: 0.37600 (52%) | DB: 0.18700 (26%)
200 OK [http://localhost/]
Thread ID: 175635930
Total: 0.719
%self total self wait child calls name
26.15 0.19 0.19 0.00 0.00 23 PGconn#exec
26.01 0.19 0.19 0.00 0.00 27 PGresult#each
4.45 0.03 0.03 0.00 0.00 1403 String#match
4.45 0.03 0.03 0.00 0.00 171 IO#write
2.23 0.02 0.02 0.00 0.00 872 String#to_s
2.23 0.02 0.02 0.00 0.00 467 String#gsub
2.23 0.03 0.02 0.00 0.02 76 ManyParser#_parse
2.23 0.05 0.02 0.00 0.03 60 MonitorMixin
#synchronize
2.23 0.02 0.02 0.00 0.00 60 Parsers#sequence
2.23 0.02 0.02 0.00 0.00 1640 ParseContext#eof
2.23 0.02 0.02 0.00 0.00 31 ActionView::
Partials
#add_object_to_
local_assigns!
2.23 0.02 0.02 0.00 0.00 11 ActionView::
Helpers::
UrlHelper#link_to
2.23 0.02 0.02 0.00 0.00 243 Object#add_error
2.23 0.02 0.02 0.00 0.00 175 Kernel#clone
The first line shows the overall time as measured by the benchmark class used by Rails. Below that, you'll see a flat profile generated by ruby-prof that shows the slowest methods. By default methods that take more than 1% of the time will be printed - for brevity's sake I've culled it to 2%.
A quick scan of the results shows some interesting things. First, the most time is spent in PGconn#exec, which is used to query the database via a C extension. Most likely there isn't much to optimize there on the Rails side, although there certainly may be room for optimization on the database side.
Next, PGconn#each is likely looping over the query results. Its time seems excessive, but we'll worry about that later since we notice there are 1,403 calls to String#match. That seems overly excessive - what is causing that?
Digging Deeper
Flat profiles are great for providing an overview of where time is spent in your program, but they leave out a lot of useful information. In this case, we'd like to know what code is calling String#match. To do that we need information provided by graph profiles.
To create a graph profile, open up vendor/ruby-prof/lib/profiling.rb, and edit it like this:
module ActionController #:nodoc:
module Profiling #:nodoc:
def perform_action_with_profiling
if RubyProf.running? or
...
else
...
# Example for Graph html printer
printer = RubyProf::GraphHtmlPrinter.new(result)
File.open('request.html', 'w') do |file|
printer.print(file, {:min_percent => 1,
:print_file => true})
end
end
end
end
end
This will create a new file, request.html, that shows a graph profile of the request. Rerun the request, and then open up request.html.
In our case we are interested in the callers of String#match, so let's take a look:
| %Total |
%Self |
Total |
Self |
Wait |
Child |
Calls |
Name |
Line |
|
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
1/1403 |
ActionController::AbstractRequest#accepts |
65 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
195/1403 |
Array#each-1 |
103 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
1/1403 |
ActionController::Rescue#local_machine? |
10 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
232/1403 |
Array#each-2 |
103 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
1/1403 |
ApplicationController#sanitize_status_codes |
118 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
8/1403 |
ActiveRecord::ConnectionAdapters::
PostgreSQLAdapter#translate_field_type |
55 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
232/1403 |
Array#each-3 |
103 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
11/1403 |
Geos::GeometryFactory#create_geom_from_hex |
39 |
| |
|
0.02 |
0.02 |
0.00 |
0.00 |
380/1403 |
Array#each-4 |
103 |
| |
|
0.02 |
0.02 |
0.00 |
0.00 |
342/1403 |
Array#each-5 |
103 |
| 4.45% |
4.45% |
0.03 |
0.03 |
0.00 |
0.00 |
1403 |
String#match |
39 |
| |
|
0.00 |
0.00 |
0.00 |
0.00 |
1403/1429 |
Regexp#match |
118 |
|
If you haven't seen a graph profile before, you're probably feeling a bit overwhelmed. To help out a bit, I've written some documentation here. However, the quick summary is that the method of interest is shown in bold. Methods above it are the method's callers, while the methods below are the methods callees.
Notice the calls Array#each, Array#each-1, Array#each-2, etc? The dash and number indicate that a method has been called recursively. Thus Array#each-1 means that Array#each was called and then it in turn called Array#each either directly or indirectly.
Going back to our investigation, we see that Array#each-4 called String#match the most times. So the next step would be to click the Array#each-4 hyperlink and see where it is called from.
Visualizing a Profile
Once you create your own graph profile, you'll quickly realize how valuable the html hyperlinks are for navigating through the mass of information. But wouldn't it be nice to visualize the profile? Glad you asked!
ruby-prof 0.5.0 can now output call tree information thanks to a patch from Carl Shimer. To create a call out put format, once again you'll have to modify profiling.rb:
module ActionController #:nodoc:
module Profiling #:nodoc:
def perform_action_with_profiling
if RubyProf.running? or
...
else
...
# Example for Graph html printer
printer = RubyProf::CallTreePrinter.new(result)
File.open('callgrind.out', 'w') do |file|
printer.print(file, {:min_percent => 1,
:print_file => true})
end
end
end
end
end
Once you've created the output file, open it with KCachegrind (sorry, this is Linux only, although there is a port to Windows which I have not tried using). KCachegrind take a bit of work to get used to, but once you do you can create visualizations that show you what your program is doing. In this case, here is a picture that shows String#match's callers.

Wrapping Up
Remember - most of the time you won't need ruby-prof. It provides much too much information to be used as a general way of finding performance bottlenecks - instead analyze your web log files, Rails log files and database log files.
But once you know the problem is in your Rails code, ruby-prof is invaluable in pointing out where it is.
Posted in Rails, ruby-prof | 10 comments | 4 trackbacks
Posted by Charlie
Mon, 09 Jul 2007 06:56:00 GMT
Make sure to grab the latest version of ruby-prof, currently 0.5.2, which includes some performance tweaks and bug fixes.
A year has passed since the last release of ruby-prof. If you haven't used ruby-prof, its a superfast, open-source, profiler for Ruby that shows you where your program is slow.
Over the last few months the list of bugs and enhancements on Ruby Forge has been accumulating at an embarassing rate. And not to mention, I need to profile our rails app.
So its time for another release.
And this isn't just a minor bug fix release - its a major update that adds lots of new stuff. To install it, open a command prompt and type:
gem install ruby-prof
Then pick the appropriate gem for your platform.
Multithreaded Applications
The most important change is support for multi-threaded applications. Previously, ruby-prof correctly recognized the threads in a program and their independent call stacks, but it got the thread times wrong. ruby-prof would incorrectly add a child thread's time to a parent thread's time, making hard to figure out where the time in a program was spent. Now, ruby-prof keeps track of a thread's "wait time," which is the time it spends waiting for other threads. Thanks to Sylvain Joyeux for a patch that got the ball rolling on better thread support.
Measurement Modes
Sylvain also submitted a second patch that suprised me. Instead of tracking method times, why not track object allocations? That seemed reasonable, but he implemented it as another type of clock mode (ruby-prof supports 3 different ways of measuring time - process time, wall time and cpu time).
At first I was hesitant about the patch - shoehorning object allocations into clock modes didn't make much sense to me. But after thinking about it a bit, I decided Sylvain was onto something. Instead of thinking about "clock modes" I started thinking about "measurement modes" (catchy, isn't it?). And once you have that perspective, then it makes sense for ruby-prof to measure all sorts of things in a running program - time, object allocations, memory usage, etc.
So with a bit of refactoring, ruby-prof's now supports for "measure modes." They are process time, wall time, cpu time an object allocation. And now its easy to add more if anyone wants to submit a patch!
I should note there is one downside to the object allocations mode - it only works if you have a patched Ruby interpreter (if you don't ruby-prof works fine, you just don't get access to allocation information).
Rails
What use would a Ruby profiler be if it didn't work with Rails? ruby-prof now provides some hooks to make easier to profile your rails app. But I'll leave those for a future blog post.
Call Graph Format
Carl Shimer implemented support for the call graph format which is used by KCachegrind to visualize calling information.
Recursive Methods
Finally, ruby-prof has much improved support for recursive method calls as explained in the updated readme file.
Bug Fixes
And last, this release fixes some nasty bugs:
Unknown Singleton Object - This was a tough one since I didn't understand how to reproduce it. But luckily Matthew Fallshaw submitted a reproducable test case, and from there it was just a matter of reading the translated version of the Ruby Hacking Guide to unravel the mysteries of Ruby's singleton classes.
64-bit support - The 0.4.x releases used ints to store pointers instead of longs, which meant they didn't compile and/or work on 64 bit machines. This is now fixed thanks to a patch submitted by Diego 'Flameeyes' Pettenò.
IRB support - Previously you couldn't start ruby-prof in a method and then exit that method. Which meant that ruby-prof couldn't be used in IRB. That restriction has now been removed.
sort order - The sort orderan in generated reports was wrong. Graph reports now sort based on a method's total percent time while flat reports use a method's self percent time.
So give version 0.5.0 a try and see how you like it!
Posted in Ruby, ruby-prof | 2 comments | 2 trackbacks
Posted by Charlie
Wed, 16 Aug 2006 06:00:00 GMT
Pat Eyler has written a nice set
of articles about profiling Ruby code. He shows how
to use the built in Ruby profiler as well as ruby-prof (my personal
favorite :).
He talks a bit about call
graphs, so if you're not familiar with them, his article is
a good place to start.
Posted in Ruby, ruby-prof, Tools | 2 comments | no trackbacks
Posted by Charlie
Wed, 21 Jun 2006 19:06:00 GMT
Update: ruby-prof 0.5.0 is now available and has significantly more features than 0.4.0, including better threading support, rails support, call tree output, etc.
After porting ruby-prof to Windows a couple of weeks ago, Shugo
Maeda, ruby-prof's author, and I started working together. We're
happy to announce the release of ruby-prof 0.4.0, which is chock
full of new features:
- Addition of call graph profiles similar to GProf
- Improved speed - overhead is now as low as 15% for some code, although
you should generally expect around 50%
- Full support for multiple threads
- New cross-referenced html reports
- New ruby-prof script that makes it easy to profile your programs
without modifying them
- Vastly improved documentation
- Detection of recursive calls and
call cycles
- Support for windows
Best of all, ruby-prof is now distributed as a gem so it's as easy to
install as:
gem install ruby-prof
If you're on Windows, the gem includes a pre-built windows binary. If
you're on Linux or Unix, the binary will be automatically built on intallation.
Graph Profiles
My favorite new feature - and the raison d’être for this
release, is the addition of call graphs. Here is the source code for the following examples.
Most profiling tools for Ruby1
output flat profiles, which are useful for identifying which methods
take the longest. They are concise and easy to understand.
Let's take a look at an example to
see what I mean.
For short programs, flat profiles work well. But for
long programs, it's really helpful to understand more about the context
in which a method is called. For example, which methods called the one
we're interested in? Which methods did it call?
A quick word of warning - call graphs can be overwhelming if you haven't
seen one before. Let's venture forth and take a look at one that
shows the same results as the flat profile above.
Quite a bit different, isn't it? If you haven't used a call graph before
I've put together a little tutorial here.
There are also a number of excellent examples on the Web - search Google
for "gprof call graph tutorial."
Speed
The thing that got me started working on ruby-prof was that the
built in Ruby profiler is beyond slow (and doesn't support call graphs).
So how does ruby-prof compare?
We spent a good deal of time profiling ruby-prof, using oprofile on
Linux and Rational PurifyPlus on
Windows. When we started, ruby-prof added over 100% overhead to a program.
Thus, if a program took 10 seconds to run the profile run would take at
least 20. That's not too bad, but there was clearly room for improvement.
Internally ruby-prof maintains two main data structures. The first is
a stack that keeps track of the current call sequence. The second is a
hash table with one entry per method profiled per thread.
The slowest part was looking up the method information in the hash table.
This happens each time a method is entered or exited.
To give you some idea of the number of times this happens, take a look
at the the Fixnum# method
in the graph profile above. For a program that lasted only 8 seconds, Fixnum#
got called a whopping 250,000 times (and these results are from last week
before our optimization work, so the program really only takes around 3
or 4 seconds to run).
The key to speeding up ruby-prof was to reduce these lookups as much as
possible. This was done by a combination of caching to avoid lookups,
simplyifing the hash key to make lookups faster,
and making the method tracking logic as simple as possible.
The result are quite encouraging. On "normal" programs, like Rails
apps, ruby-prof's overhead is now less than 50% (and in fact, when I profile
our unit tests it is more in the range of 10% to 20%). For
programs that stress profilers, like the prime test above or a
factorial method, the overhead is somewhere in the 50% to 80% range.
Next time - some results from profiling Rails.
The latest release of Mauricio Fernandez excellent rcov extension looks like it now has this functionality
Posted in Ruby, ruby-prof | 15 comments | no trackbacks
Posted by Charlie
Fri, 09 Jun 2006 19:14:00 GMT
Yesterday I wanted to profile some methods I'm using on a Rails controller.
To get a feel for profiling Ruby code I put together a test
case, added a
"require 'prof'" to the top of the file and eagerly waited for the
results. And waited, and waited, and waited. Thinking I did something
wrong, I ran the code without a profiler - it took about 2 seconds. With the
profiler it took so long I gave up. And this
was on a dual core pentium D processor with 1 Gig of memory running Fedora Core
5.
Time for some investigation. It turns out this is a well known problem -
the built-in Ruby profiler, which is written in Ruby, is so slow as to be useless.
I came across two alternatives - ruby-prof, a
C extension written by Shugo Maeda, and ZenProfile, an inline C exension done by Ryan Davis.
I went with ruby-prof. On Linux it was easy enough to download, build and
install and it worked like a charm. But I do most of my work on my laptop which
runs Windows XP. So I opened up MingW and built and installed the extension on
Windows (that's not quite true, I had to hack the C a bit, more info below). But when I ran the test script I was met with a program fail message
saying that the stack was empty. Ugh.
Since I find it impossible to debug extensions on Windows with MingW I
fired up Visual Studio 2005, rebuilt the extension, and tried again. Same issue.
Digging deeper, it turns out the profilers (as well as the wonderful rcov project)
work by registering a callback with Kernel::set_trace_func. When Ruby
executes a line of code, enters a new Ruby or C method, or exists a Ruby or C
method, the callback is activated.
The problem is that ruby-prof assumes that each call into a method is matched
by a return - and if its not then the failure I see is triggered. To understand
the problem, let's look at a super simple test case:
require 'profiler'
I said it was simple - didn't I! Here's the trace from Linux:
return start_profile
call print_profile
call stop_profile
c-call set_trace_func
Start_profile is the method that installs the set_trace callback - so it makes
sens that the first thing we see is returning from that method. Once the program
is done, the profiler calls print_profile, which calls stop_profile, which
calls set_trac_func which uninstalls the callback. So the method enters and
returns do not balance.
Although the method names ruby-prof uses are slightly different, the problem
remains the same. ruby-prof hacks through it by pushing and popping
extra items on its stack to counterweigh the imbalanced method calls. Thus its
hard-coded to a specific sequence of method calls. So why doesn't it work on Windows?
A quick trace running our test program on Windows shows the problem:
return start_profile
return require
call print_profile
call stop_profile
c-call set_trace_func
There is an extra "return require" which is being generated by Ruby gems.
And if you run the program in
Arachno, which uses a modified version of Ruby to supports
its fantastic debugger (its fast enough that I always run Rails under the debugger
so I can set breakpoints at key places - definitely go check it out).
c-return set_trace_func
return start_profile
c-return require__
return require
c-return require__
return require
call print_profile
call stop_profile
c-call set_trace_func
It quickly becomes clear that assuming a balanced stack is a bad idea. If you look
at the built in Ruby profile it doesn't make such an assumption.
These changes have been merged into ruby-prof-0.4.0 which is now available as a RubyGem.
So, I've patched ruby-prof to remove this assumption and to make it compile on
Windows. I'll submit the patch to Shugo Maeda, but in the meantime, I've provided
windows binaries for anyone who wants to use the profiler on windows. To install:
1. Download the windows extension, prof.xo, and put it in your ruby\lib\ruby\site_ruby\1.8\i386-msvcrt
directory.
2. Download unprof.rb and put it in your ruby\lib\ruby\site_ruby\1.8 directory.
3. To use the profiler simply require 'unprof' at the top of the file
One thing to note about my changes. The self-time for the "toplevel" method will
always show "0". Its looks like the Ruby profiler does the same thing, so I think
this is ok.
Assembly Hacking
This section is for anyone who's interested in some lower level details - feel
free to skip it.
Getting ruby-prof to compile on windows required a few of the usual changes.
For example, making sure that the extension's initialization method is property
exported using __declspec(dllexport), etc.
However, ruby-prof provides an extra twist. It can measure time in several ways including using some low-level functionality provided by more
recent Pentium and PowerPC processors. To access this information it uses this
inline assembly call:
static prof_clock_t
cpu_get_clock()
{
#if defined(__i386__)
unsigned long long x;
__asm__ __volatile__ ("rdtsc" : "=A" (x));
return x;
#elif defined(__powerpc__) || defined(__ppc__)
unsigned long long x, y;
__asm__ __volatile__ ("\n\
1: mftbu %1\n\
mftb %L0\n\
mftbu %0\n\
cmpw %0,%1\n\
bne- 1b"
: "=r" (x), "=r" (y));
return x;
#endif
}
For x86 chips, what it does is call the rdtsc assembly function which returns
the number of clock cycles that have been executed. So if you call get_cpu_clock,
wait 1 second, and call get_cpu_clock again, you can calculate the chip's clock
frequency. Using this information, you can time method calls. For instance,
if the chip's frequency is 500Mhz and a method takes 250,000,000 cycles to complete,
you can calculate it took 0.5 seconds.
This of course won't work with Visual C++ because it uses its own syntax for
inline assembly calls. In this case there are couple ways of porting this code.
Newer versions of Visual C++ support compiler intrinsics, and there is one for
rdtsc. However,
I thought it would be better to use inline assembly to support older versions.
Here's the code:
static prof_clock_t
cpu_get_clock()
{
prof_clock_t cycles = 0;
__asm
{
rdtsc
mov DWORD PTR cycles, eax
mov DWORD PTR [cycles + 4], edx
}
return cycles;
}
To use this timing method you have to specifically enable it by including the
following line in your ruby code.
ENV["RUBY_PROF_CLOCK_MODE"] = "cpu"
require 'unprof'
However, I can't say this works very well. The calculated
frequency for my chip is always different. I don't know why - my best guess is that
its a Pentium M with Intel's speed step technology so the clock frequency varies
to save power. However, I'm usually plugged in so I don't think that's it. Note
you can tell ruby-prof your click frequency like this:
ENV["RUBY_PROF_CLOCK_MODE"] = "cpu"
ENV["RUBY_PROF_CPU_FREQUENCY"]= "466000000"
require 'unprof'
So my recommendation is just use the default ruby-prof timing method - it does
the job perfectly well.
These changes have been merged into
ruby-prof-0.4.0 so I've taken them offline
Posted in Rails, Ruby, ruby-prof, Tools | 4 comments | no trackbacks