Last week I showed how to profile a Rails application using ruby-prof. The post was driven by our desire to improve MapBuzz‘s map rendering speed. Our nightly log analysis showed that rendering speed has significantly degraded over the last few weeks. Since the whole point of MapBuzz is to share maps with others, we needed to find out what was taking so long.
A day of profiling and we had our answer. With a few simple changes we were able to reduce rendering time from 7.8 seconds to 0.92 seconds (note these times are gathered while profiling the application, so the real times are 2 to 3 times less).
What we found should be generally applicable to all Rails applications, so I’d like to share them:
- Don’t use ActiveRecord#attributes
- Get your :includes right
- Don’t check template timestamps ( cache_template_loading = true)
- Don’t use url_for
- Don’t let Rails parse timestamps
- Don’t symbolize keys (local_assigns_support_string_keys = false)
If you interested in the nitty-gritty details of each change, keep reading!
A Bit of Background
MapBuzz renders maps in a two step process. First, the html page is created and shown to the user. Second, the browser makes an Ajax request to get the features that should be shown on the map. The reason for having a two step process is that it allows users to zoom and pan around the map, without having to reload the whole page.
Log analysis revealed that the Ajax request was taking a long time – 1 to 2 seconds, with a big standard deviation. Our goal is to reduce the average time to less than half a second and to significantly reduce the standard deviation.
If you look at the request URI, you’ll see its an Atom feed of features and not an HTML page. That has performance implications.
First, Atom feeds tend to have lots of links. In our case, each entry has 10 to 15 links. We limit the number of entries to 50, so that means there are roughly 600 link per page.
Second, we generate Atom feeds in a modular fashion – each entry is created from 10 partials. There is a partial for the user, one for ratings, one for content, one for icons, one for geometries, etc. Why did we do it this way? Simple – to keep things DRY. Having a set of partials creates an extremely modular system – we can mix and match them as needed depending on the context of the request.
Setup
Testing was done with two machines. The first was my laptop, which is a Dell Latitude D610 running windows, Ruby 1.8.4, and Rails 1.2.3. The second is our staging server, which is fairly high end Dell Desktop running Fedora Core 6 with a copy of our production database (which is about 10Gb). Profiling was done with ruby-prof 0.5.1 (of course), using the Rails plugin I described previously. Note that the logging level was set to :debug, which will show up in the test results.
Baseline
To start we measured the baseline performance of our test setup. It was ugly indeed:
Total: 7.828 %self total self wait child calls name 24.11 1.95 1.89 0.00 0.06 3605 Kernel#clone 13.15 1.04 1.03 0.02 0.00 569 IO#write 7.78 4.13 0.61 0.00 3.52 441 <Module::Benchmark># realtime-1 7.41 0.58 0.58 0.00 0.00 125 PGconn#exec
One bit of solace is that profiling an application will slow it down 2 to 3 times, but that sill leaves us with a 4 seconds per request. So let’s start digging through the results.
Don’t Use ActiveRecord:attributes
The first thing that jumps out is the huge amount of time Kernel#clone takes. A look at the call graph shows that the caller is clone_attribute_value.
And a bit more digging reveals this custom code in our application:
def rating # Convert cached rating from BigDecimal to float, # otherwise strange rounding errors happen attributes['rating'].to_f end
The problem is that when you call ActiveRecord#attributes it returns a copy of the attributes, thus generating all the clone calls. Uuggh. I think this is a result of ActiveRecord’s flawed attribute implementation, but that is for another blog.
What we want is access to the untyped, original value of the attribute. In theory you are supposed to use the auto generated method, rating_before_type_cast, for this. This has the advantage of skipping the clone call, but it relies on method_missing, which has some overhead (we did not measure how much). Or, you could use read_attribute directly, which also skips the clone call. This is usually your best choice.
For performance critical code, you may wish to read the untyped attribute directly (see the discussion about time below for an example). That can be done using the method read_attribute_before_type_cast, except that it is private. However, with Ruby, that’s easily solved:
module ActiveRecord class Base public :read_attribute_before_type_cast end end
Lesson – Don’t use attributes
Gain – 24%
Use include
Let’s look at our updated results:
Total: 3.39 %self total self wait child calls name 16.73 0.57 0.57 0.00 0.00 569 IO#write 12.83 0.44 0.44 0.00 0.00 125 PGconn#exec 8.76 1.77 0.30 0.00 1.48 441 <Module::Benchmark># realtime-1 5.99 0.22 0.20 0.00 0.01 40 PGconn#query
What stands out are the 125 calls to PGconn#exec – each one is a separate query. Looks like we forgot to specify a table or two to be eager loaded via ActiveRecord#find’s :include option.
Note there is one downside to using :include – you lose the ability to use the :select option. For us that is important – and can be worked around by using the rails select_parser I’ve blogged about before.
Lesson – Get your :includes right
Gain – 12%
cache_template_loading=true
After fixing attributes and includes, here is our current timings:
Total: 2.594 %self total self wait child calls name 12.64 1.14 0.33 0.00 0.81 361 <Module::Benchmark># realtime-1 12.61 0.33 0.33 0.00 0.00 409 IO#write 4.90 0.13 0.13 0.00 0.00 403 <Class::File>#mtime 4.86 0.16 0.13 0.00 0.03 30250 Hash#[] 3.62 0.09 0.09 0.00 0.00 5 PGconn#exec
A couple of things jump out. First, using Benchmark#realtime is fairly time consuming.
Second, logging also takes its toll as seen in the times for IO#write (remember we have
logging set to :debug).
However, the #mtime call looks suspicious. Some more digging through Rails shows all 403 calls come from
ActionView::Base#compile_template?. Let’s take a look:
def compile_template?(template, file_name, local_assigns) method_key = file_name || template render_symbol = @@method_names[method_key] if @@compile_time[render_symbol] && supports_local_assigns?(render_symbol, local_assigns) if file_name && !@@cache_template_loading @@compile_time[render_symbol] < File.mtime(file_name) || (File.symlink?(file_name) && (@@compile_time[render_symbol] < File.lstat(file_name).mtime)) end else true end end
Remember I mentioned that it takes roughly 500 partial calls to generate the output? This is where it bites us –
by default Rails checks the timestamp of cached templates before running them. In a production environment that
is totally unnecessary. The solution is simple – just update your production.rb file like this:
# Don't check view timestamps! config.action_view.cache_template_loading = true
And while we are at it, we’ll change the debug level to :info to reduce its impact on the results.
Lesson – Set cache_template_loading to true
Gain – 5%
Don’t use url_for
We’ve managed to reduce the request time from 7.8 to 2.5 seconds. Not bad. But we still have more work to do.
Total: 2.531 %self total self wait child calls name 13.59 1.91 0.34 0.00 1.56 45 <Module::Benchmark># realtime 12.25 0.31 0.31 0.00 0.00 409 IO#write 4.98 0.81 0.13 0.00 0.69 361 <Module::Benchmark># realtime-1 4.94 0.13 0.13 0.00 0.00 40 ActiveRecord::Associations:: HasManyThroughAssociation# construct_sql 3.71 0.09 0.09 0.00 0.00 5 PGconn#exec 3.04 0.08 0.08 0.00 0.00 29847 Hash#[] 3.00 0.11 0.08 0.00 0.03 2097 Hash#each 1.86 0.28 0.05 0.00 0.23 322 ActionController::Routing:: RouteSet#generate
Of particular interest is the ActionController::Routing::RouteSet#generate. Although it only runs for 0.05 seconds, its total time, including children, is 0.28. Thus creating 322 urls takes 11% of the request time.
People have previously mentioned how slow url_for is – well, its true. Instead of using url_for, just create your urls manually using string mashing. For example, if you want an absolute uri, then do this:
"#{request.protocol}#{request.host_with_port}/controller/#{map.id}"
Lesson – Don’t use url_for or link_to
Gain – 11%
Don’t let Rails parse timestamps
After removing url_for, we’ve smashed the 2 second barrier:
Total: 1.391 %self total self wait child calls name 9.06 0.13 0.13 0.00 0.00 726 String#sub! 8.99 0.13 0.13 0.00 0.00 40 ActionView::Base::CompiledTemplates# _run_ratom_47app47views47geometry47_geometry46ratom 6.69 0.09 0.09 0.00 0.00 5 PGconn#exec 5.54 0.08 0.08 0.00 0.00 741 Item#id 5.54 0.30 0.08 0.00 0.22 121 <Class::Date>#_parse
Next up is <Class::Date>#_parse, whose total time, including children, is a whopping 21% of the time. Ouch.
If you take a look at the code (its in the standard ruby library), you can see its quite complicated
since it tries to parse a variety of date formats. In addition, it makes use of rational numbers, which are none too fast.
Taking a look at a graph profile, we see that <Class::Date>#_parse is called from <Class::ActiveRecord::ConnectionAdapters::Column>#string_to_time.
Thus Rails uses it to convert timestamps received from the database into Ruby objects.
As I detailed in a recent post, the problem is that Ruby’s DateTime implementation is extremely slow compared to its Time implementation. The solution is to avoid DateTime’s entirely and use custom time parsing code. Fortunately, our database can output time in ISO 861 format, which Time can quickly parse.
So, any place we access times, we simply roll our own attribute readers and writers like this (note that we don’t use read_attribute!).
def created_on @created_on || Time.iso8601(read_attribute_before_type_cast('created_on')) end def created_on=(value) write_attribute(:created_on, value) @created_on = nil end def updated_on @updated_on || Time.iso8601(read_attribute_before_type_cast('created_on')) end
Lesson – Don’t let Rails parse timestamps – do it yourself
Gain – 20%
Don’t symbolize keys
We’ve almost hit the 1 second barrier:
Total: 1.156 %self total self wait child calls name 8.13 0.11 0.09 0.00 0.01 642 Kernel#send-3 8.13 0.09 0.09 0.00 0.00 5 PGconn#exec 6.75 0.08 0.08 0.00 0.00 7497 String#concat 4.15 0.05 0.05 0.00 0.00 817 Hash#each 4.07 0.05 0.05 0.00 0.00 40 ActiveRecord::Associations:: HasAndBelongsToManyAssociation# construct_sql 2.77 0.03 0.03 0.00 0.00 6360 String#match 2.77 0.06 0.03 0.00 0.03 47 Array#each_index 2.77 0.03 0.03 0.00 0.00 280 IconType#size 2.60 0.03 0.03 0.00 0.00 4309 Array#[]
At this point, most of the easy wins are gone. However, there is at least one left. Although its not shown in the flat profile, the graph profile shows that the method ActiveSupport::CoreExtensions::Hash::Keys#symbolize_keys takes 2.77% of the time including its children. Additional rummaging through the call graph shows that the calls come from ActionView#compile_and_render_template:
def compile_and_render_template(extension, template = nil, file_path = nil, local_assigns = {}) # convert string keys to symbols if requested local_assigns = local_assigns.symbolize_keys if @@local_assigns_support_string_keys
A bit of research shows that local_assigns_support_string_keys is deprecated and slated for removal from Rails! Sweet – that means we can save an additional 3% simply by adding this line to environment.rb:
config.action_view.local_assigns_support_string_keys = false
If only all optimizations were so simple.
Lesson – Don’t symbolize keys
Gain – 3%
Wrapping Up
So what’s our final time? Let’s see:
Total: 0.922 %self total self wait child calls name 10.30 0.10 0.10 0.00 0.00 5 PGconn#exec 3.47 0.17 0.03 0.00 0.14 312 Array#each-2 3.47 0.03 0.03 0.00 0.00 212 ActiveRecord::ConnectionAdapters:: Quoting#quote 3.47 0.03 0.03 0.00 0.00 19455 Hash#[] 3.47 0.36 0.03 0.00 0.33 360 ActionView::Base#render-2 3.36 0.06 0.03 0.00 0.03 47 Array#each_index 3.36 0.03 0.03 0.00 0.00 299 ActiveRecord::Associations:: AssociationProxy#initialize 3.36 0.14 0.03 0.00 0.11 705 ActiveRecord::Associations:: AssociationProxy#method_missing
Not bad – we reduced the request time from 7.8 to 0.92 seconds with a day’s worth of work. Obviously there is still much to be done – we need to run the code through our performance monitoring suite and make sure that the average time holds across a number of requests and that the standard deviation is in line. But we’ve at least made a good start.