Sooner or later, for most large websites you have to bite the bullet and implement some form of asynchronous processing to deal with long-running tasks. For example, with MapBuzz we have a several long-running tasks:
- Importing data
- Batch geodcoding
- Emailing event notifications to users
If you’re developing a Facebook application, moving long-running tasks to a background process or thread is critical since Facebook times out requests to your server within ten to twelve seconds.
So Many Choices
Having decided you need asynchronous processing, the next question is how to do it. And this is where things get complicated – there are a myriad of approaches, each applicable for certain problem domains. Let’s look at some possibilities:
- Spawn/Fork – Create processes on demand to perform background tasks
- Distributed Objects – Use a distributed object protocol (RMI, Corba, DCOM, DrB, etc) to communicate with another process to perform background tasks
- Job Queue – Persist tasks in shared files or databases and execute them using background processes
- Messaging Processing – Send messages to another process via a message bus
In the Ruby world, there are a number of implementations for each approach – a few examples include:
- Spawn/Fork – Spawn
- Distributed Objects – BackgroundDRb
- Job Queue – Delayed Job (DJ), BackgroundJob (BJ), BackgroundFu, Starling, Beanstalkd, Sparrow
- Message Sending – AP4R
Not surprisingly, most of these solutions are designed to work with Rails, since there’s no need to speed up processing if its just another machine on the other end instead of an impatient human.
Selecting the best one for your application is totally dependent on your use cases. Having said that, its still possible to reach some broad conclusions. Spawning or forking processes makes it impossible to offload processing to additional machines, so you’ll quickly run into scalability limits. Distributed objects solve that problem, but experience has shown distributed object protocols are very brittle because they bind clients and servers so tightly together – thus I would never use them. Job queues are more reliable because tasks are represented in a standard format (usually text based, such as xml) that is persisted to files or database tables. Message queues are similar, but add significantly more functionality such as message routing, transformation, prioritization, etc.
For many websites, a job queue is the best solution. Job queues are relatively light weight and let you distribute processing across multiple machines. However, the ruby based solutions listed above require installing and managing additional software as well as writing the job processing code itself. They also make it more difficult to develop and test software since you know have to debug multiple processes at once.
A Simple HTTP Based Solution
So what’s a simpler solution? Reuse what you already have. Most Rails applications are divided into multiple instances, distributed across one or more machines, that embed an http server (mongrel, thin, ebb) for requests. Thus we already have our background processes and an easy way to communicate with them – http (of course!). And if your using mongrel or a proxy server (Pound, Lighttpd, Nginx, Apache, etc.), then you also get a built-in request queue.
In other words:
simple background queue = HTTP + Load Balancer + Rails instances
Besides simplicity, a big advantage to this approach is that background tasks run within the Rails environment, giving you access to ActiveRecord, your models, etc.
Worker Plugin
Thus enters a new Rails plugin called worker (yeah the name leaves something to be desired). Let’s look at an example:
class ImportController < ApplicationController # Add support for using workers include Worker # Incoming requests are handled by this method resource :Geodata do def post read_file(params) end end # This method handles requests in a worker process resource :process do def post end end private def read_file(params) worker_params = {:file_name => file_name, :tags => params['tags'], :controller => 'import', :resource => 'process', :map_id => @map.id} # Create worker request create_worker.post(worker_params) end end
So how does this work? A user POSTs a file to http://myserver/import/geodata. That method does various checks (deleted for brevity) and then sends a request to http://myserver/import/process which runs in a separate Rails instance. Although this controller delegates back to itself (in a separate process) it could call any controller it wishes.
The worker plugin will pass a session key, if available, to the background process. That turns out to be very useful since it allows sharing session information between the foreground process and background process if you’re storing session information since in memcached or a background database. That means you can use the same authentication and authorization mechanisms in the background process as you do in the foreground process.
In addition, all worker requests are signed with a MD5 hash to verify that no-one in the middle is spoofing requests.
Environments and Configuration
By default, Rails applications use three environments – testing, development and production. Each environment is quite different, which affects how you want to use worker processes. To deal with these differences, the worker plugin uses a strategy pattern to invoke requests.
In a test environment, there are no background running Rails instances. More importantly, you need to be able to check that worker requests correctly complete. Thus you want worker requests to happen synchronously and within the test process. This is the Worker::Controller strategy, and works similarly to how Rails render_component functionality works. To set this up, add the following lines to your test environment file:
config.after_initialize do Worker::Config.strategy = Worker::Controller end
In development mode, you have one Rails instance running. In this case, you want worker requests to happen asynchronously but within the single development process. This is the Worker::HttpAsync strategy. To set this up, add the following lines to your development environment file:
config.after_initialize do Worker::Config.strategy = Worker::HttpAsync end
Note this assumes that your development process is running on the standard port 3000.
Finally, in production mode, you’ll have multiple Rails instances running. To be on the safe side, some of these instances should be dedicated to only fulfilling worker requests. The easiest way to do this is put them on an internally accessible IP address, say 8500, that outsides cannot access. Thus the port, and perhaps IP address, of the user-facing Rails instances will be different than worker instances. To set this up, add an additional line to your config file that globally sets the host and port number of workers. Note this assumes that there is either one worker or a pool or workers at the given host and port.
config.after_initialize do Worker::Config.strategy = Worker::HttpAsync Worker::HttpAsync.options = {:host => 'some_other_host', :port => '8500'} end
The Code
We’re releasing the worker plugin under an MIT license. If there is sufficient interest, we’re happy to setup a RubyForge project.