Making Rails Go Vroom

Last week I showed how to profile a Rails application using ruby-prof. The post was driven by our desire to improve MapBuzz‘s map rendering speed. Our nightly log analysis showed that rendering speed has significantly degraded over the last few weeks. Since the whole point of MapBuzz is to share maps with others, we needed to find out what was taking so long.

A day of profiling and we had our answer. With a few simple changes we were able to reduce rendering time from 7.8 seconds to 0.92 seconds (note these times are gathered while profiling the application, so the real times are 2 to 3 times less).

What we found should be generally applicable to all Rails applications, so I’d like to share them:

If you interested in the nitty-gritty details of each change, keep reading!

A Bit of Background

MapBuzz renders maps in a two step process. First, the html page is created and shown to the user. Second, the browser makes an Ajax request to get the features that should be shown on the map. The reason for having a two step process is that it allows users to zoom and pan around the map, without having to reload the whole page.

Log analysis revealed that the Ajax request was taking a long time – 1 to 2 seconds, with a big standard deviation. Our goal is to reduce the average time to less than half a second and to significantly reduce the standard deviation.

If you look at the request URI, you’ll see its an Atom feed of features and not an HTML page. That has performance implications.

First, Atom feeds tend to have lots of links. In our case, each entry has 10 to 15 links. We limit the number of entries to 50, so that means there are roughly 600 link per page.

Second, we generate Atom feeds in a modular fashion – each entry is created from 10 partials. There is a partial for the user, one for ratings, one for content, one for icons, one for geometries, etc. Why did we do it this way? Simple – to keep things DRY. Having a set of partials creates an extremely modular system – we can mix and match them as needed depending on the context of the request.

Setup

Testing was done with two machines. The first was my laptop, which is a Dell Latitude D610 running windows, Ruby 1.8.4, and Rails 1.2.3. The second is our staging server, which is fairly high end Dell Desktop running Fedora Core 6 with a copy of our production database (which is about 10Gb). Profiling was done with ruby-prof 0.5.1 (of course), using the Rails plugin I described previously. Note that the logging level was set to :debug, which will show up in the test results.

Baseline

To start we measured the baseline performance of our test setup. It was ugly indeed:

Total: 7.828
%self     total     self     wait    child    calls  name
24.11      1.95     1.89     0.00     0.06     3605  Kernel#clone
13.15      1.04     1.03     0.02     0.00      569  IO#write
 7.78      4.13     0.61     0.00     3.52      441  <Module::Benchmark>#
                                                     realtime-1
 7.41      0.58     0.58     0.00     0.00      125  PGconn#exec

One bit of solace is that profiling an application will slow it down 2 to 3 times, but that sill leaves us with a 4 seconds per request. So let’s start digging through the results.

Don’t Use ActiveRecord:attributes

The first thing that jumps out is the huge amount of time Kernel#clone takes. A look at the call graph shows that the caller is clone_attribute_value.

And a bit more digging reveals this custom code in our application:

def rating
  # Convert cached rating from BigDecimal to float,
  # otherwise strange rounding errors happen
  attributes['rating'].to_f
end

The problem is that when you call ActiveRecord#attributes it returns a copy of the attributes, thus generating all the clone calls. Uuggh. I think this is a result of ActiveRecord’s flawed attribute implementation, but that is for another blog.

What we want is access to the untyped, original value of the attribute. In theory you are supposed to use the auto generated method, rating_before_type_cast, for this. This has the advantage of skipping the clone call, but it relies on method_missing, which has some overhead (we did not measure how much). Or, you could use read_attribute directly, which also skips the clone call. This is usually your best choice.

For performance critical code, you may wish to read the untyped attribute directly (see the discussion about time below for an example). That can be done using the method read_attribute_before_type_cast, except that it is private. However, with Ruby, that’s easily solved:

module ActiveRecord
  class Base
    public :read_attribute_before_type_cast
  end
end

Lesson – Don’t use attributes

Gain – 24%

Use include

Let’s look at our updated results:

Total: 3.39

 %self     total     self     wait    child    calls  name
 16.73      0.57     0.57     0.00     0.00      569  IO#write
 12.83      0.44     0.44     0.00     0.00      125  PGconn#exec
  8.76      1.77     0.30     0.00     1.48      441  <Module::Benchmark>#
	                                                    realtime-1
  5.99      0.22     0.20     0.00     0.01       40  PGconn#query

What stands out are the 125 calls to PGconn#exec – each one is a separate query. Looks like we forgot to specify a table or two to be eager loaded via ActiveRecord#find’s :include option.

Note there is one downside to using :include – you lose the ability to use the :select option. For us that is important – and can be worked around by using the rails select_parser I’ve blogged about before.

Lesson – Get your :includes right

Gain – 12%

cache_template_loading=true

After fixing attributes and includes, here is our current timings:

Total: 2.594

 %self     total     self     wait    child    calls  name
 12.64      1.14     0.33     0.00     0.81      361  <Module::Benchmark>#
                                                      realtime-1
 12.61      0.33     0.33     0.00     0.00      409  IO#write
  4.90      0.13     0.13     0.00     0.00      403  <Class::File>#mtime
  4.86      0.16     0.13     0.00     0.03    30250  Hash#[]
  3.62      0.09     0.09     0.00     0.00        5  PGconn#exec

A couple of things jump out. First, using Benchmark#realtime is fairly time consuming.
Second, logging also takes its toll as seen in the times for IO#write (remember we have
logging set to :debug).

However, the #mtime call looks suspicious. Some more digging through Rails shows all 403 calls come from
ActionView::Base#compile_template?. Let’s take a look:

def compile_template?(template, file_name, local_assigns)
  method_key    = file_name || template
  render_symbol = @@method_names[method_key]

  if @@compile_time[render_symbol] &&
	   supports_local_assigns?(render_symbol, local_assigns)
    if file_name && !@@cache_template_loading 
      @@compile_time[render_symbol] < File.mtime(file_name) ||
        (File.symlink?(file_name) && 
        (@@compile_time[render_symbol] < File.lstat(file_name).mtime))
    end
  else
    true
  end
end

Remember I mentioned that it takes roughly 500 partial calls to generate the output? This is where it bites us –
by default Rails checks the timestamp of cached templates before running them. In a production environment that
is totally unnecessary. The solution is simple – just update your production.rb file like this:

# Don't check view timestamps!
config.action_view.cache_template_loading = true

And while we are at it, we’ll change the debug level to :info to reduce its impact on the results.

Lesson – Set cache_template_loading to true

Gain – 5%

Don’t use url_for

We’ve managed to reduce the request time from 7.8 to 2.5 seconds. Not bad. But we still have more work to do.


Total: 2.531

 %self     total     self     wait    child    calls  name
 13.59      1.91     0.34     0.00     1.56       45  <Module::Benchmark>#
                                                      realtime
 12.25      0.31     0.31     0.00     0.00      409  IO#write
  4.98      0.81     0.13     0.00     0.69      361  <Module::Benchmark>#
                                                      realtime-1
  4.94      0.13     0.13     0.00     0.00       40  ActiveRecord::Associations::
                                                      HasManyThroughAssociation#
                                                      construct_sql
  3.71      0.09     0.09     0.00     0.00        5  PGconn#exec
  3.04      0.08     0.08     0.00     0.00    29847  Hash#[]
  3.00      0.11     0.08     0.00     0.03     2097  Hash#each
  1.86      0.28     0.05     0.00     0.23      322  ActionController::Routing::
                                                      RouteSet#generate

Of particular interest is the ActionController::Routing::RouteSet#generate. Although it only runs for 0.05 seconds, its total time, including children, is 0.28. Thus creating 322 urls takes 11% of the request time.

People have previously mentioned how slow url_for is – well, its true. Instead of using url_for, just create your urls manually using string mashing. For example, if you want an absolute uri, then do this:

"#{request.protocol}#{request.host_with_port}/controller/#{map.id}"

Lesson – Don’t use url_for or link_to

Gain – 11%

Don’t let Rails parse timestamps

After removing url_for, we’ve smashed the 2 second barrier:

Total: 1.391

 %self     total     self     wait    child    calls  name
  9.06      0.13     0.13     0.00     0.00      726  String#sub!
  8.99      0.13     0.13     0.00     0.00       40  ActionView::Base::CompiledTemplates#
	                                                    _run_ratom_47app47views47geometry47_geometry46ratom
  6.69      0.09     0.09     0.00     0.00        5  PGconn#exec
  5.54      0.08     0.08     0.00     0.00      741  Item#id
  5.54      0.30     0.08     0.00     0.22      121  <Class::Date>#_parse

Next up is <Class::Date>#_parse, whose total time, including children, is a whopping 21% of the time. Ouch.
If you take a look at the code (its in the standard ruby library), you can see its quite complicated
since it tries to parse a variety of date formats. In addition, it makes use of rational numbers, which are none too fast.

Taking a look at a graph profile, we see that <Class::Date>#_parse is called from <Class::ActiveRecord::ConnectionAdapters::Column>#string_to_time.
Thus Rails uses it to convert timestamps received from the database into Ruby objects.

As I detailed in a recent post, the problem is that Ruby’s DateTime implementation is extremely slow compared to its Time implementation. The solution is to avoid DateTime’s entirely and use custom time parsing code. Fortunately, our database can output time in ISO 861 format, which Time can quickly parse.

So, any place we access times, we simply roll our own attribute readers and writers like this (note that we don’t use read_attribute!).

def created_on
  @created_on || Time.iso8601(read_attribute_before_type_cast('created_on'))
end
  
def created_on=(value)
  write_attribute(:created_on, value)
  @created_on = nil
end
  
def updated_on
  @updated_on || Time.iso8601(read_attribute_before_type_cast('created_on'))
end

Lesson – Don’t let Rails parse timestamps – do it yourself

Gain – 20%

Don’t symbolize keys

We’ve almost hit the 1 second barrier:

Total: 1.156

 %self     total     self     wait    child    calls  name
  8.13      0.11     0.09     0.00     0.01      642  Kernel#send-3
  8.13      0.09     0.09     0.00     0.00        5  PGconn#exec
  6.75      0.08     0.08     0.00     0.00     7497  String#concat
  4.15      0.05     0.05     0.00     0.00      817  Hash#each
  4.07      0.05     0.05     0.00     0.00       40  ActiveRecord::Associations::
                                                      HasAndBelongsToManyAssociation#
                                                      construct_sql
  2.77      0.03     0.03     0.00     0.00     6360  String#match
  2.77      0.06     0.03     0.00     0.03       47  Array#each_index
  2.77      0.03     0.03     0.00     0.00      280  IconType#size
  2.60      0.03     0.03     0.00     0.00     4309  Array#[]
		

At this point, most of the easy wins are gone. However, there is at least one left. Although its not shown in the flat profile, the graph profile shows that the method ActiveSupport::CoreExtensions::Hash::Keys#symbolize_keys takes 2.77% of the time including its children. Additional rummaging through the call graph shows that the calls come from ActionView#compile_and_render_template:

def compile_and_render_template(extension, template = nil, 
			        file_path = nil, local_assigns = {})
  # convert string keys to symbols if requested
  local_assigns = local_assigns.symbolize_keys if 
	          @@local_assigns_support_string_keys

A bit of research shows that local_assigns_support_string_keys is deprecated and slated for removal from Rails! Sweet – that means we can save an additional 3% simply by adding this line to environment.rb:

config.action_view.local_assigns_support_string_keys = false

If only all optimizations were so simple.

Lesson – Don’t symbolize keys

Gain – 3%

Wrapping Up

So what’s our final time? Let’s see:

Total: 0.922

 %self     total     self     wait    child    calls  name
 10.30      0.10     0.10     0.00     0.00        5  PGconn#exec
  3.47      0.17     0.03     0.00     0.14      312  Array#each-2
  3.47      0.03     0.03     0.00     0.00      212  ActiveRecord::ConnectionAdapters::
                                                      Quoting#quote
  3.47      0.03     0.03     0.00     0.00    19455  Hash#[]
  3.47      0.36     0.03     0.00     0.33      360  ActionView::Base#render-2
  3.36      0.06     0.03     0.00     0.03       47  Array#each_index
  3.36      0.03     0.03     0.00     0.00      299  ActiveRecord::Associations::
                                                      AssociationProxy#initialize
  3.36      0.14     0.03     0.00     0.11      705  ActiveRecord::Associations::
                                                      AssociationProxy#method_missing

Not bad – we reduced the request time from 7.8 to 0.92 seconds with a day’s worth of work. Obviously there is still much to be done – we need to run the code through our performance monitoring suite and make sure that the average time holds across a number of requests and that the standard deviation is in line. But we’ve at least made a good start.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top