Posted by Charlie
Wed, 27 Aug 2008 18:55:00 GMT
One of the projects we've been working on for MapBuzz the last few weeks is building an interactive map that shows all the events going on in Denver during the Democratic National Convention. Users can pick the event type and date they are interested in, and the map refreshes with icons for relevant events. By clicking on a given event, the user can see exactly where and when the event is taking place. I think the map turned out pretty well - its a good example of mashup pulling data from different sources. In this case, base maps from Google, event information from Zvents, and all rendering/styling/page from MapBuzz.
It did clarify my thinking on a few points. First, Rails built-in page caching is really limited - it ignores query parameters and only works for html. So we had to hack around that, more info coming in a later post. Second, for building mashups xml really is superior to JSON simply because it supports namespaces (for all their pain points, namespaces really do facilitate merging of data from multiple sources). Third, when you need it, xslt is invaluable. Zvents serves its data using RSS, but our client only supports Atom. The simple solution was a quick xsl transformation to convert Zvent's rss feed over to Atom using libxslt (and thus MapBuzz's contribution back to the Ruby community to get the libxml and libxslt bindings back into good shape).
Posted in Design | 2 comments | no trackbacks
Posted by Charlie
Mon, 23 Jun 2008 01:51:00 GMT
The web is full of articles discussing how to render transparent 24-bit png images in IE6 - so why write another one?
Three reasons. First, although the long hoped for demise of IE6 is finally showing some progress, IE6 still has a 20% to 40% market share, which is still more than Firefox. Second, none of the existing articles talk about the terrible performance degradations caused by the most common solutions proposed to display png files and how to avoid them. Third, I discovered an alternate solution using VML that I've never seen documented.
AlphaImageLoader is Slow
Let's start with the performance issues. Most websites recommend enabling transparent png's in IE6 by running javascript code when the page loads. The javascript finds all png images on a page and applies Microsoft's proprietary AlphaImageLoader filter.
What almost no article mentions (the one I linked to above being an exception) is how badly this degrades page load performance. But don't take my word for it - I've setup an example page that loads sea-level rise data from Peter Black's excellent climate atlas blog.
At the top of the page are several buttons - click the far left one, titled Slow, to see how the typical png solution performs. Go ahead, try it now (using IE6 of course). Notice how the browser freezes for over 10 seconds and doesn't display any images until the very end? Not very good, is it?
But if you view the same files using Google maps, the performance is much better. And you can also see each tile as it loads, instead of waiting for the end. So how did Google achieve this sleight of hand? A bit of digging shows the trick - instead of fixing all the images at once, Google fixes them one at a time. This is done by attaching an onload event handler to each image that needs to be fixed. When the image is loaded, the onload event applies the AlphaImageLoader filter. This avoids IE freezing. And to avoid the annoying flashing caused by applying the filter, images start off "hidden" and are only made visible when their onload events are fired. Clever, isn't it?
Now go back to the example page and click the second button, titled Fast. The difference is amazing - like night and day. As an extra bonus, the onload event for image elements that point to invalid or non-existent images never fires, meaning that they are never made visible. That neat trick avoids IE displaying an annoying red x (
) for invalid images.
A New Approach - VML
A couple of years ago I discovered an entirely new way of displaying 24-bit transparent png images in IE - use VML. Somehow I've never managed to put it down on paper (wrong metaphor I know, but humor me). I figure I'd better do it now before this tidbit of knowledge becomes irrelevant.
It's little known that VML supports not only vector graphics, but also raster images via its image element. And I've never seen it mentioned that the vml:image element supports transparent 24-bit pngs.
To see VML in action, go back to the example page and click the third button, titled VML. The first time you click the button, the performance won't be very good since all the images are first downloaded. But once the page is loaded, click the Clear button and then click the VML button. From my testing, the performance is noticeably better than using the AlphaImageLoader.
There are several things to note about the VML code that took me hours to figure out:
- The setupVml function enables VML via Javascript - its two lines of code are barely documented on the web and thus took hours of fiddling to get working
- You must set the width and height of the VML images, otherwise nothing is displayed
- You must set the coordsize of the VML images - otherwise they will be randomly one pixel to short or wide
Sadly, the VML solution suffers three problems that make in unsuitable most of the time. First, notice the images are a bit fuzzy when displayed with VML. I haven't a clue as to why.
Second, images that don't load are shown with IE's annoying red x mentioned above. The problem is that VML images don't seem to support the onload event (or onreadystate), so there is now way to start them off as hidden and then make them visible once the image has loaded.
Third, IE6 doesn't cache VML images across page loads. To see this, reload the example page and press the VML button again. Notice the long delay? If you watch IE's http traffic (say using Fiddler), you'll see that IE6 requests each image again. It does not do that for html image elements, which you can verify by running the same experiment but clicking the Fast button instead.
Together, these three issues make the VML solution inferior to the AlphaImageLoader solution, but I thought I'd write it down in case someone ever needs to know about it.
Posted in Design, Web | 1 comment | no trackbacks
Posted by Charlie
Tue, 17 Jun 2008 05:55:00 GMT
Sooner or later, for most large websites you have to bite the bullet and implement some form of asynchronous processing to deal with long-running tasks. For example, with MapBuzz we have a several long-running tasks:
- Importing data
- Batch geodcoding
- Emailing event notifications to users
If you're developing a Facebook application, moving long-running tasks to a background process or thread is critical since Facebook times out requests to your server within ten to twelve seconds.
So Many Choices
Having decided you need asynchronous processing, the next question is how to do it. And this is where things get complicated - there are a myriad of approaches, each applicable for certain problem domains. Let's look at some possibilities:
- Spawn/Fork - Create processes on demand to perform background tasks
- Distributed Objects - Use a distributed object protocol (RMI, Corba, DCOM, DrB, etc) to communicate with another process to perform background tasks
- Job Queue - Persist tasks in shared files or databases and execute them using background processes
- Messaging Processing - Send messages to another process via a message bus
In the Ruby world, there are a number of implementations for each approach - a few examples include:
Not surprisingly, most of these solutions are designed to work with Rails, since there's no need to speed up processing if its just another machine on the other end instead of an impatient human.
Selecting the best one for your application is totally dependent on your use cases. Having said that, its still possible to reach some broad conclusions. Spawning or forking processes makes it impossible to offload processing to additional machines, so you'll quickly run into scalability limits. Distributed objects solve that problem, but experience has shown distributed object protocols are very brittle because they bind clients and servers so tightly together - thus I would never use them. Job queues are more reliable because tasks are represented in a standard format (usually text based, such as xml) that is persisted to files or database tables. Message queues are similar, but add significantly more functionality such as message routing, transformation, prioritization, etc.
For many websites, a job queue is the best solution. Job queues are relatively light weight and let you distribute processing across multiple machines. However, the ruby based solutions listed above require installing and managing additional software as well as writing the job processing code itself. They also make it more difficult to develop and test software since you know have to debug multiple processes at once.
A Simple HTTP Based Solution
So what's a simpler solution? Reuse what you already have. Most Rails applications are divided into multiple instances, distributed across one or more machines, that embed an http server (mongrel, thin, ebb) for requests. Thus we already have our background processes and an easy way to communicate with them - http (of course!). And if your using mongrel or a proxy server (Pound, Lighttpd, Nginx, Apache, etc.), then you also get a built-in request queue.
In other words:
simple background queue = HTTP + Load Balancer + Rails instances
Besides simplicity, a big advantage to this approach is that background tasks run within the Rails environment, giving you access to ActiveRecord, your models, etc.
Worker Plugin
Thus enters a new Rails plugin called worker (yeah the name leaves something to be desired). Let's look at an example:
class ImportController < ApplicationController
# Add support for using workers
include Worker
# Incoming requests are handled by this method
resource :Geodata do
def post
read_file(params)
end
end
# This method handles requests in a worker process
resource :process do
def post
end
end
private
def read_file(params)
worker_params = {:file_name => file_name,
:tags => params['tags'],
:controller => 'import',
:resource => 'process',
:map_id => @map.id}
# Create worker request
create_worker.post(worker_params)
end
end
So how does this work? A user POSTs a file to http://myserver/import/geodata. That method does various checks (deleted for brevity) and then sends a request to http://myserver/import/process which runs in a separate Rails instance. Although this controller delegates back to itself (in a separate process) it could call any controller it wishes.
The worker plugin will pass a session key, if available, to the background process. That turns out to be very useful since it allows sharing session information between the foreground process and background process if you're storing session information since in memcached or a background database. That means you can use the same authentication and authorization mechanisms in the background process as you do in the foreground process.
In addition, all worker requests are signed with a MD5 hash to verify that no-one in the middle is spoofing requests.
Environments and Configuration
By default, Rails applications use three environments - testing, development and production. Each environment is quite different, which affects how you want to use worker processes. To deal with these differences, the worker plugin uses a strategy pattern to invoke requests.
In a test environment, there are no background running Rails instances. More importantly, you need to be able to check that worker requests correctly complete. Thus you want worker requests to happen synchronously and within the test process. This is the Worker::Controller strategy, and works similarly to how Rails render_component functionality works. To set this up, add the following lines to your test environment file:
config.after_initialize do
Worker::Config.strategy = Worker::Controller
end
In development mode, you have one Rails instance running. In this case, you want worker requests to happen asynchronously but within the single development process. This is the Worker::HttpAsync strategy. To set this up, add the following lines to your development environment file:
config.after_initialize do
Worker::Config.strategy = Worker::HttpAsync
end
Note this assumes that your development process is running on the standard port 3000.
Finally, in production mode, you'll have multiple Rails instances running. To be on the safe side, some of these instances should be dedicated to only fulfilling worker requests. The easiest way to do this is put them on an internally accessible IP address, say 8500, that outsides cannot access. Thus the port, and perhaps IP address, of the user-facing Rails instances will be different than worker instances. To set this up, add an additional line to your config file that globally sets the host and port number of workers. Note this assumes that there is either one worker or a pool or workers at the given host and port.
config.after_initialize do
Worker::Config.strategy = Worker::HttpAsync
Worker::HttpAsync.options = {:host => 'some_other_host',
:port => '8500'}
end
The Code
We're releasing the worker plugin under an MIT license. If there is sufficient interest, we're happy to setup a RubyForge project.
Read more...
Posted in Design, Rails | 16 comments | 2 trackbacks
Posted by Charlie
Mon, 11 Feb 2008 18:11:00 GMT
If there were such a thing as the Ten Commandments of programming, code reuse would surely be included. Now you're probably thinking I've lost my mind, because any good developer knows that code reuse is a pipe dream. But that's because you are thinking on the macro level and not the micro level. At the micro level, code reuse has altogether more pleasant acronym, DRY, or do not repeat yourself.
The tussle in the Rails community over components illustrated this tension well. On one extreme, advocates dreamed of creating plug-and-play components that could be reused across multiple applications. On the other extreme, detractors sneered at code reuse as a hopeless endeavor and vowed to strip out all traces of components from the the Rails 2.0 release. Left out in the cold was the view that reusing code, specifically controllers, within a single application was a good thing.
Reusing Controllers
Rails is built using the Model-View-Controller pattern, which is designed to segment an application into controllers, models and views. Rails also adds in the concept of filters, which are pieces of code that run before and after controllers.
The Rails community encourages reuse of models, filters and views, but seems to actively discourage the reuse of controllers if the Ruby on Rails book is any indication:
When Rails was initially released, it came with a system for creating components. Unfortunately, the implementation of components left a lot to be desired: performance
was poor, and there were unanticipated side effects. As a result,
components are being phased out.
Instead, the common wisdom now is to synthesize component-like functionality
using a combination of before filters and partials. Use the before filter to set up the context for the partial, and then render the fragment you want using
a regular render :partial call.
Like much conventional wisdom, this advice is hogwash.
Filters + Partials != Controllers
The problem with just using filters and partials is that it only applies to a subset of web applications. A good example, and of course the one used in the Ruby on Rails book, is a shopping website. The focal point of most shopping websites is a shopping cart. The point of the application is to make it easy for users to add things to the cart, modify the cart and hopefully buy the contents of the cart. Since the cart plays such a crucial role, it often make sense to have a filter that setups the cart so each controller has easy access to it. A nice side affect of this approach, is that views also have access to the cart, as described in the quote above.
But for many other types of applications, filters and partials can't make up for controllers. For example, take a look at the Boulder community on MapBuzz. The top-left side of the page is rendered by the community controller, the comments on the bottom left by a comment controller and the map listings on the right are by a map browser controller. If you log-in, then a couple of additional tabs are added to the page, each rendered by its own controller.
This type of composition is quite common in Web 2.0 applications. Take most social networking sites - they'll mix together news feeds, discussion boards, friends/friends lists, pictures,etc., in a variety of different ways depending on the current page.
The design problem is that any given controller can be called in two different contexts:
- When the whole page is rendered via a Browser page refresh
- When just the controller is rendered via an Ajax call
Trying to do this with just filters and helpers is a non-starter, because you end up with one big controller that needs to run different filters depending on the context of the call.
The better approach is to divide your controllers into logical units, and then have a separate page controller for the entire page. When the page is rendered, the page controllers should delegate rendering the various sub-parts (for example, tabs) of the page to the appropriate controller. When just one of the sub-parts of the page needs to be rerendered, due to an Ajax call, then you directly call the appropriate controller.
Performance
In Rails, a controller or view can call another controller using the much maligned render_component method. Part of the problem is the method is misnamed. It no longer has anything to do with rendering components - instead its used to invoke another controller. Therefore, it would be more appropriately named render_controller, call_controller, invoke_controller, etc.
Assuming you agree with my so far, reading the Rails documentation for render_component with certainly give you pause:
Components should be used with care. They‘re significantly slower than simply splitting reusable parts into partials and conceptually more complicated. Don‘t use components as a way of separating concerns inside a single application. Instead, reserve components to those rare cases where you truly have reusable view and controller elements that can be employed across many applications at once.
So to repeat: Components are a special-purpose approach that can often be replaced with better use of partials and filters.
Undoubtedly this was true once upon a time. Is it still? There is one way to find out - run a test. I created a new Rails application using the built-in generators and then added the following simple code:
controller/main_controller.rb
class MainController < ApplicationController
def get_without_controller
a = 1
end
def get_with_controller
a = 1
end
end
controller/sidebar_controller.rb
class SidebarController < ApplicationController
def get
render(:partial => 'sidebar/content')
end
end
views/main/get_without_controller.html.erb
<p>Some fun content goes here</p>
<div class="sidebar">
<%= render(:partial => 'sidebar/content') %>
</div>
views/main/get_with_controller.html.erb
<p>Some fun content goes here</p>
<div class="sidebar">
<%= render_component(:controller => SidebarController,
:action => 'get') %>
</div>
views/sidebar/_content.html.erb
<p>Hi there</p>
There are two paths through this application:
- GET '/main/get_without_controller.rb'
- GET '/main/get_with_controller.rb'
In case its not obvious, get_without_controller usesrender(:partial) to include the sidebar content while get_with_controller uses render_component. Using both benchmark and ruby-prof, I ran each method 100 times using a souped up integration test (more about that in a future post). The results, using Rails 2.02 on Ruby 1.8.4 on WindowsXP on a Pentium M laptop (about 3 years old) are:
| Method |
100 Requests (s) |
1 Request (s) |
| get_without_controller |
0.30 |
0.0030 |
| get_with_controller |
0.45 |
0.0045 |
So using components is 50% slower, but the overhead is a miniscule 0.0015 seconds per request. That overhead is obviously lost in a real application. Of course you have to be careful when using render_component to not try and do to much per HTTP request - but the same is true using filters and partials.
DRY Up Your Controllers
In truth, render_component is the most primitive way imaginable of reusing controllers. But it does let you to DRY up your Rails application by letting you create more cohesive controllers that can be reused within a single website. For most websites you won't need this functionality, but when you do, there isn't a substitute for it and don't let anyone browbeat you into thinking there is.
Posted in Design, Rails | 13 comments | 2 trackbacks
Posted by Charlie
Thu, 08 Feb 2007 08:58:00 GMT
Web sites based on user generated content need to ...well... let users generate content (obviously you read this blog for its insights!).
Except that opens a huge can of worms. Some users will want to deface content that other users have created. Others will try and upload content that covers your web site in platypuses - just to prove they can. Still others will try and upload malicious content so they can steal other user's personal information, such as passwords.
Let's look at one small part of the problem - letting users add content to a site. Say you have a nice little form that looks something like this (note this form does not work):
Soon your first user wanders by - and is aghast at the primitiveness of your solution. How do I make things bold? Color? Links? Pictures?
At this point a bright idea pops into your head - why not write a nice, simple, little formatting language? Maybe we could make italic text like ''this''. And bold text like '''this'''. Thankfully, you soon come to your senses and realize this has all been done before - just pick you favorite between MediaWiki, Textile, Markdown, and scores of others.
Then again, why bother making users learn some strange new markup language? What about a nice editor - something like this. Then again, if you hate you users, you could try something like this. Either way, you happily code up your nice editor and wait for good things to happen.
When Bad Things Happen
Before you know it, you're running the next MySpace, and you're barely off the phone with Yahoo when Google comes calling. But then disaster strikes - one of your valued customers, Sammy, has decided to befriend every other member. Your website crashes under the load, Yahoo and Google decide you're an incompetent lout and they shower their millions on your evil competitor.
What went wrong? You got burned by a classic example of cross-site scripting. Sammy embedded javascript into the content he uploaded for his profile. When a user, Sally, looked at it, the embedded JavaScript was executed, causing Sally to unknowingly become Sammy's friend. And then when Sue looked at Sally's profile, she became Sammy's friend. And soon everyone loved Sammy.
Don't Trust Your Users
The sad moral of the story is don't trust your users. You have to sanitize everything uploaded to your web site. Your nice little editor produces HTML - which you blithely except and store right into the database.
Although a bit ashamed at your gaffe, you figure any fool can clean up a bit of HTML. You code up a baroque regular expression, that only you can understand, and update your web site. You figure Google will be back begging tomorrow.
Tomorrow dawns, and now everyone is friends of Sally. Eeek gads - what went wrong this time? After a bit of searching you come across a sketchy looking site that has pages and pages of examples on how to defeat your primitive defences.
Despairing, you take the day off and ponder becoming a rock star, or if that fails, a real estate agent.
Friends Abound
The next day, everyone has decided to become friends with Sue. Your heart warms as you watch all your users getting to know each other. That feeling lasts through your first cup of coffee, when you get slapped by a lawsuit from some weird sounding European country you've never heard of claiming that you've willfully exposed personal details about your users without their consent. You figure its time to buckle down and solve this problem once and for all.
A bit of searching on Google turns up a bewildering number of choices. You find one site though, HTML Purifier, that offers a nice comparison of the options. By now its lunch, and you call up your buddy Bill to grab some food.
XHTML Basic
You tell Bill about your woes. He calls you an idiot and says that you have to buy him lunch. He then points out that your fancy little editor generates XHTML, right? So why don't you use libxml to parse the XHTML, and tell it to validate it against Basic XHTML. Seeing the bewildered look in your eye, Bill sighs deeply, and tries again.
Look, he says, trying to validate HTML leads right into the morass of tag soup where browsers do their best to render whatever you throw at them. Although that sounds nice, it reality it leaves a vast attack space for someone to slip in malicious content, often times using invalid HTML.
If you switch to XHTML, you immediately eliminate that problem. Continuing, Bill also points out that you can reuse all the work that has gone into building the XML tool chain. And even better, he explains, the W3C has kindly spent the last five years breaking XHTML into different modules. Each module is rigorously defined via a DTD.
They have also been kind enough to define XHTML Basic, which combines a subset of XHTML modules to create a simplified version of XHTML that can run on small devices such as PDAs and cellphones. It eliminates most of the nasty XHTML elements - with a tweak here and a there you can get rid of them all. And while you are at it, its probably best to eliminate all the predefined character entities (you do use UTF8, don't you?).
So all you have to do is take the XHTML Basic DTD and validate user input against it. You say that sounds awfully difficult. Bill laughs, and quickly writes a few lines of Ruby code on a napkin:
require 'xml/libxml_so'
def verify(html)
dtd = XML::Dtd.new("public", 'xhtml1-transitional.dtd')
parse = XML::Parser.string(html)
parse.validate(dtd)
end
You stare incredulously - that's it? Bill replies - not quite. DTDs can't verify attributes, so you still have to make sure there aren't any nasty JavaScript fragments lurking in them. For example:
<img src="javascript:alert('you have been hacked')" />
He also mentions that you could have libxml validate against a Relax NG schema instead, which supports validation of attributes.
And voila, you've successfully plugged at least one security hole in your web site. Undoubtedly there are many more to be found.
Update - Its definitely worth reading the great comment from Ambush Commander, who is the author of HTML Purifier, an HTML sanitization library. I should have been more clear that XHTML Basic is a good starting point since it removes a large portion of XHML that you don't want to support. However, it still includes dangerous elements like <script> and <object>, so clearly you have to remove those. Anyway, its worth checking out out his comment and my response.
Posted in Design, Ruby, Technology | 5 comments | 2 trackbacks
Posted by Charlie
Wed, 31 Jan 2007 09:16:00 GMT
One of the great mysteries of sofware development is what is the best way to do it. Its just like diets - every year or two the latest fad comes roaring through and everyone jumps on board. And just like diets, there is no one right way - it depends on what you are trying to accomplish and what type of team you have.
When it comes to writing software, on one extreme is the belief that every possible piece of functionality should be designed ahead of time in exruciating detail - resulting in thousands of pages of specifications that almost no one reads. On the other extreme is the belief that the best thing to do is jump right in and see how things work out. Surprisingly, sometimes either of the extremes is the right answer. However, most of the time an approach somewhere in the middle is better.
Yet no development process can account for human creativity. Just like painting, or writing music, or desiging buildings, software development is a creative process. A couple of months ago we implemented one of the more complex parts of MapBuzz - who can view or edit features that are on a map. This turns out to be quite tricky - so over a month we thought long and hard to figure out the best design. Once we thought we had it, we spent a couple of weeks implementing it and thoroughly testing it.
And yet we were wrong. In a flash of insight last week, we discovered a much better solution - one that greatly simplifies the code and makes it reusable for permissions on any type of object (map, feature, comment, etc.). It took another week of work to implement the new solution - obviously begging the question why we didn't see it at first.
And this question goes right to the heart of design. Good design is based on experience - without it you don't have the requisite knowledge to solve difficult problems. But often times experience is not enough - difficult problems require that you immerse yourself in them trying any number of solutions until one clicks. Even more maddeningly, you usually then get so engrossed in a problem that you lose your ability to see the solution. At that point you have to walk away from the problem and give your subconscious a chance to put all the pieces together. If you're lucky, it will present you with a solution in the form of a Eureka moment when you least expect it. And that is exactly what happened to us.
Which leads to an obvious conclusion. The more you work, the less creative you are because you don't give your subsconscious enough time to find the best solution. This is hardly a novel thought - a quick searh on Google turns up a number of studies of this phenomenan. But its worth keeping in mind - if you have a really tough problem, and you've put in the necessary hours to fully understand it, then your best bet for solving it is a day at the beach or skiing!
Posted in Design | 1 comment | no trackbacks
Posted by Charlie
Sun, 30 Jul 2006 05:14:00 GMT
Sean mentioned the idea of using
YAML for MapServer configuration files instead of XML. I wholeheartedly agree.
The most important thing for a configuration file is that it is easily readable
and editable. YAML's simpler syntax and conciseness is a big win in this case.
After reading his article, a brilliant idea occurred to me (well,
at least I thought it was brilliant!) - use YAML to serialize RDF. RDF's
graph data model maps poorly onto XML's hierarchical model, making RDF's XML
serialization format more complicated than RDF itself. This complexity has been
the subject of endless debates and
has led to a number of alternative syntaxes
over the years, such as Notation
3, Turtle, RPV,
etc. Yet none of them have have been widely adopted.
XML's main advantages,
compared
to other serialization formats, are its extensibility, strong internationalization
support and strong tool support.
A quick look at the YAML specification shows
it supports UTF8 and UTF16 encodings, so its internationalization
support looks good (I haven't done any testing to see if the reality is different).
YAML's tool support is also good - you can find YAML parsers for Ruby, Python,
PHP, JavaScript (using the JSON subset)
and other languages as well.
Just like with XML, you can serialize custom vocabularies using YAML.
However, YAML lacks a schema language. But if you're using RDF and require a
schema, then clearly you will use RDF
schema.
RDF schema reuses the RDF XML serialization format, so I would take
the same approach with YAML and use it to both encode RDF and RDF schemas.
Curious to see what work has been done in this area, I did a quick search.
It turns out that Micah Dubinko suggested using
YAML for RDF more than three years ago on the xml-dev mailing
list.
As far as implementations, the only example I could find was a Perl module
that
hasn't bee updated since 2003. Now that some of the hype around XML has died
down, and YAML has established itself as a legitimate format, it seems
like a good time to revisit this idea.
Posted in Design, Design, Modeling, Modeling, YAML, YAML | 2 comments | no trackbacks
Posted by Charlie
Sun, 30 Jul 2006 05:14:00 GMT
Sean mentioned the idea of using
YAML for MapServer configuration files instead of XML. I wholeheartedly agree.
The most important thing for a configuration file is that it is easily readable
and editable. YAML's simpler syntax and conciseness is a big win in this case.
After reading his article, a brilliant idea occurred to me (well,
at least I thought it was brilliant!) - use YAML to serialize RDF. RDF's
graph data model maps poorly onto XML's hierarchical model, making RDF's XML
serialization format more complicated than RDF itself. This complexity has been
the subject of endless debates and
has led to a number of alternative syntaxes
over the years, such as Notation
3, Turtle, RPV,
etc. Yet none of them have have been widely adopted.
XML's main advantages,
compared
to other serialization formats, are its extensibility, strong internationalization
support and strong tool support.
A quick look at the YAML specification shows
it supports UTF8 and UTF16 encodings, so its internationalization
support looks good (I haven't done any testing to see if the reality is different).
YAML's tool support is also good - you can find YAML parsers for Ruby, Python,
PHP, JavaScript (using the JSON subset)
and other languages as well.
Just like with XML, you can serialize custom vocabularies using YAML.
However, YAML lacks a schema language. But if you're using RDF and require a
schema, then clearly you will use RDF
schema.
RDF schema reuses the RDF XML serialization format, so I would take
the same approach with YAML and use it to both encode RDF and RDF schemas.
Curious to see what work has been done in this area, I did a quick search.
It turns out that Micah Dubinko suggested using
YAML for RDF more than three years ago on the xml-dev mailing
list.
As far as implementations, the only example I could find was a Perl module
that
hasn't bee updated since 2003. Now that some of the hype around XML has died
down, and YAML has established itself as a legitimate format, it seems
like a good time to revisit this idea.
Posted in Design, Design, Modeling, Modeling, YAML, YAML | 2 comments | no trackbacks
Posted by Charlie
Thu, 20 Jul 2006 07:55:00 GMT
A great debate in
linguistics is how much language influences thought. In the world of computer
science, I am firm believer in the theory. Paul Graham, amongst many
others, has nicely argued the yes side of the debate.
Many developers start off in statically typed languages like I did. I learned
to program using Pascal, did a bit of Assembly and then settled in
with Pascal via Delphi.
The first dynamic language I used was Magik - it was quite a shock. Since no one has ever heard of Magik, its a proprietary language used in the Smallworld GIS system that is quite similar to Ruby.
Magik had awful tools, no static type checking, no debugger, no GUI building tools,
etc. And no compiler, at least not in the sense I was used to.
First Impressions Can Be Misleading
Needless to say, my first impressions were less than enthusiastic. Everyone
kept telling me the environment was so much more productive, but I didn't buy
it. I could churn out object pascal almost as fast, and Delphi's blazingly fast
compiler and great tools made up for any difference. I suppose people were comparing
to C++, where at the time you might as well have gone off and had several cups
of coffee between each edit-compile-test cycle. Or maybe to toy languages like AML (another proprietary language from the GIS world).
Anyway, for beginners, the environment was just awful. The library documentation
was non-existent - if you wanted to know what a method did, you went and read
the source code. No sir,
no fancy online hyper-linked context sensitive help files.
And the final nail in the coffin, the Magik IDE was an albatross called
Emacs. Emacs drove me too such distraction I went off and wrote my own
IDE (sorry, ten years haven't changed my mind about Emacs, but I sure like VIM).
But I was paid to write Magik, so I wrote Magik. After a rough start, things
started looking up a bit. It sure was
nice not worrying about memory allocation and deallocation. And having an interactive
console, where you could poke around inside a running program, that sure was
neat. And then there were the truly weird things - like dynamically
loading classes. Even better, you could replace methods in an existing
class by simply redefining them in another file (we called it reopening
classes, the term today is monkey patching). And you could pass functions as parameters via the use of procs which
were closures - although I didn't know that at the time.
Stuck With Complexity
Almost ten years later, I find myself mostly programming in Ruby, Javascript
and C. Yikes, what happened? I'm as surprised as anyone.
My take is that contrary to popular wisdom,
a good language gets out of your way and lets you do what you need to. This
is quite counterintuitive. Computer programs are pinnacles of brittle complexity
- one tiny mistake in millions of lines of code brings the whole edifice crashing
down. The natural inclination is to make the walls of that edifice as thick and
strong as possible. Java is a great example of this line of thought, you can
see examples of it throughout its design:
- Use of static type checking
- Polymorphism only through
inheritance of classes or interfaces
- Final classes
- The forced use of exception specifications
- The forced handling of exception specifications
- Difficulty in modifying code at runtime
- Strong encapsulation
- Clunky reflection
These things make the language less malleable. In return, the payoff should
be more robust programs. But do you really get that? My experience is no, but
I would love to hear about any references to studies or research that can provide
a definitive answer either way.
Trusting my experience, I don't believe programs written in Java (or C++,
etc.) are on average more robust than programs written in Python, Ruby, Smalltalk,
Perl, etc. So what has the loss of malleability cost you? Once again, my experience
tells me quite a lot.
Given a reasonable sized program, I can guarantee you a
few things:
- It contains bugs
- It's used in ways the designers and developers
never imagined
- It's execution environment is constantly changing
If you're stuck with a brittle edifice of complexity you don't want it to be
a fortress complete with ten foot walls and surrounded by a serpent filled
moat. What you want is a building with an open floor pan
where you can nudge a wall here, add one there and remove one over
there.
In more concrete terms, if code is buggy then
you want to be able to write up a patch, throw it in a directory somewhere, and
have the application load it automatically replacing the invalid code. Or closely
related, you want to provide a simple mechanism to add in new functionality,
just like Selenium does
via its user user-extensions.js file.
Maybe you need to graft on a major piece of new functionality, such as adding
support for serializing objects to JSON.
One approach is to open up the base Object class and add a new method, toJSON,
just as Rails 1.1 did.
Or let's say you find yourself typing in the same boilerplate code
over and over. Why not write a method that tells the language to do this for
you? Ruby and Rails are filled with this type of metaprogramming, just as Magik
is and of course the granddaddy of the technique, Lisp. Soon you're on the road
to creating your own domain
specific languages, one of the hot topics
du jour.
Or maybe you need to retrofit Object X so that it can be processed by Method
A which expects to be passed an Object A. For some reason, Object X cannot inherit
from Object A. So instead you leverage duck
typing to add the needed methods to Object X.
These things are easy to do in some language, hard in others, and impossible
in others.
Really, I Know What I'm Doing
Mastering a skill requires mastering its tools - be it construction, sword fighting,
cooking, bike riding, flying, etc. The techniques above are some of the sharp
tools of programming. You can use them to quickly make mince-meat
out of your problem - or, on not so good days, mince meat out of your fingers.
But when you make dinner tonight, I'm guessing you're not reaching for the dullest
knife in the drawer.
Posted in Design, JavaScript, Magik, Ruby, Smallworld | 5 comments | 1 trackback