Under the Influence of Metcalfe's Law

Posted by Charlie Wed, 09 May 2007 22:03:00 GMT

Its not every day someone takes the time to write me an open letter - I have to say its kind of fun. Brian added some additional thoughts to our ongoing conversation about GML. In truth, this is where blogging breaks down a bit, it would be much easier to sit down in a room for an hour and have a great in-depth technical discussion (of course, then our discussion wouldn't be available for the whole world to see which is significant downside).

Since its a bit hard sifting through where things stand in a long discussion, let me recap the points I think we agree on:

  • GML is a toolkit that provides rules for translating your proprietary data model into XML
  • Having translated your data model into GML/XML, it is then necessary to code both clients and servers to understand it

Where we disagree is whether this is a good idea or not.

I see at least three very different use cases here:

  • I want to share within my own organization
  • I want to share with a preselected set of outside organizations
  • I want to share with the world

I'll agree with Brian that for the first two use cases, GML 2 (and 3) provides a workable solution (although I think GML 1 was a better solution and that the overhead of GML 2 is prohibitive).

Its item #3 though that really matters. One of the things that makes the Web different is Metcalfe's Law (and Reed's Law) becomes predominant - the value of something becomes much more important the more people use it. Which leads me to the conclusion that everyone has to agree to a shared data model and format. Otherwise you end with thousands of one-off data integrations, which does nothing to solve the general problem.

There are obvious downsides to agreeing to a general data model - it will always be a lowest common denominator and wont work for many complex integrations that live in the realm of the first two use cases. But there is an obvious upside - it is the only thing that has any chance of working out on the web. If you don't agree, then please show me a real-life example that disproves it.

So where does that leave us? I believe that GML as it is formulated has no chance of success out on the Web because its simply not designed for it. The obvious consequence is the emergence of the Atom / GeoRSS combination and KML. And truth be told, those standards solve the problem of rendering maps made up of multiple geographic data sources well enough.

What they don't solve is exchanging attribute data between systems. And this leads right into the hornet's nest of the Semantic Web and data modeling - no one has every come up with a solution to this problem and I doubt anyone ever will.

So faced with that daunting task - why not try the simplest thing that could possibly work - which ironically was more or less GML 1:

<Feature typeName="Road">
  <description>M11</description>
  <property typeName="classification">motorway</property>
  <property typeName="number" type="integer">11</property>
  <geometricProperty typeName="linearGeometry">
    <LineString srsName="EPSG:4326">
      <coordinates>
        0.0,100.0 100.0,0.0
      </coordinates>
    </LineString>
  </geometricProperty>
</Feature>

In today's world, I'd modify this a bit and start with Atom, add in GeoRSS, and then add in an new namespace that encodes properties like above. And I'd stick the same stuff in the KML metadata tag.

Now, I don't expect this to do diddly-squat for machine to machine integration. What I do expect it to do is make it easy for clients to show a nice property browser to users when they mouse over a feature on a map. And for the web, that's good enough since it all comes down to humans in the end anyway.

Posted in , ,  | 5 comments | no trackbacks

Drinking the Cool Aid

Posted by Charlie Sun, 06 May 2007 23:42:00 GMT

From Stefan:
There are way too many examples of this; my favorite one being XMI.
Amen. I drank that cool aid once myself (nothing like learning by doing)...

Posted in ,  | no comments | no trackbacks

On Long Transactions

Posted by Charlie Sun, 06 Aug 2006 03:12:00 GMT

It was with great interest that I read that PostGIS 1.1.3 supports long transactions. Except there is one problem - if you dig into the documentation you'll see it does no such thing.

Instead, PostGIS supports record level locking as defined in OGC's Web Feature Service (WFS) specification. According to the spec (section 10, page 34):

The purpose of the LockFeature operation is to expose a long term feature locking mechanism to ensure consistency. The lock is considered long term because network latency would make feature locks last relatively longer than native commercial database locks.

This has nothing to do with long transactions, and really should be called something like "record locking"

So What is a Long Transaction?

The term long transaction came out of the GIS industry to describe updates that take days or weeks or months to complete. It was coined to highlight the difference between normal database transactions, or "short transactions," that take milliseconds to complete.

Most of what we do on computers are long transactions - writing documents, creating spreadsheets, drawing graphics, writing new software, etc.

In the GIS world, long transactions are crucial for modeling the world. For example, imagine a developer wants to build a new subdivision. Part of the required work is to design the subdivsion's networks - roads, water pipes, sewer pipes, electrical lines and phone lines. Another part is to lay out the parcels - where the houses will go. Creating these designs can take months, and it is often necessary to create several different designs to find the optimal one.

While this works is being done, you want it to be isolated from other users so as to not disturb their work.

Versioned Databases

Two naive ways of implementing long transactions are:

Both of these approaches were tried in the industry, and unsuprisingly, failed. The problem is that they don't scale in multi-user systems. Before long, users start stepping on each other toes and the whole system grinds to a halt.

Instead, what is needed is an approach that allows users to create their own "version" of the database, work on it as long as needed, and once its done, merge it back into the main database. If you are a developer, this should sound awfully familiar. Its the exact same functionality that branches in source control systems provide.

Implementations

Smallworld was the first commercial implementation of a GIS that had a versioned database that supported long transactions. Later, Oracle, working with Smallworld, introduced a similar technology in Oracle 9i called which they called Workspace Manager. ESRI, the largest GIS vendor, also now supports long transactions.

Unfortunately, Postgresql/PostGIS does not support versioned databases. And the new locking functionality it provides is almost useless because it won't scale in multi-user environments. Of course, the PostGIS developers are just implementing a poorly thought out part of the WFS specification.

Posted in , , , , ,  | 2 comments | no trackbacks

On Long Transactions

Posted by Charlie Sun, 06 Aug 2006 03:12:00 GMT

It was with great interest that I read that PostGIS 1.1.3 supports long transactions. Except there is one problem - if you dig into the documentation you'll see it does no such thing.

Instead, PostGIS supports record level locking as defined in OGC's Web Feature Service (WFS) specification. According to the spec (section 10, page 34):

The purpose of the LockFeature operation is to expose a long term feature locking mechanism to ensure consistency. The lock is considered long term because network latency would make feature locks last relatively longer than native commercial database locks.

This has nothing to do with long transactions, and really should be called something like "record locking"

So What is a Long Transaction?

The term long transaction came out of the GIS industry to describe updates that take days or weeks or months to complete. It was coined to highlight the difference between normal database transactions, or "short transactions," that take milliseconds to complete.

Most of what we do on computers are long transactions - writing documents, creating spreadsheets, drawing graphics, writing new software, etc.

In the GIS world, long transactions are crucial for modeling the world. For example, imagine a developer wants to build a new subdivision. Part of the required work is to design the subdivsion's networks - roads, water pipes, sewer pipes, electrical lines and phone lines. Another part is to lay out the parcels - where the houses will go. Creating these designs can take months, and it is often necessary to create several different designs to find the optimal one.

While this works is being done, you want it to be isolated from other users so as to not disturb their work.

Versioned Databases

Two naive ways of implementing long transactions are:

Both of these approaches were tried in the industry, and unsuprisingly, failed. The problem is that they don't scale in multi-user systems. Before long, users start stepping on each other toes and the whole system grinds to a halt.

Instead, what is needed is an approach that allows users to create their own "version" of the database, work on it as long as needed, and once its done, merge it back into the main database. If you are a developer, this should sound awfully familiar. Its the exact same functionality that branches in source control systems provide.

Implementations

Smallworld was the first commercial implementation of a GIS that had a versioned database that supported long transactions. Later, Oracle, working with Smallworld, introduced a similar technology in Oracle 9i called which they called Workspace Manager. ESRI, the largest GIS vendor, also now supports long transactions.

Unfortunately, Postgresql/PostGIS does not support versioned databases. And the new locking functionality it provides is almost useless because it won't scale in multi-user environments. Of course, the PostGIS developers are just implementing a poorly thought out part of the WFS specification.

Posted in , , , , ,  | 2 comments | no trackbacks

Its the data stupid!

Posted by Charlie Tue, 01 Aug 2006 21:53:00 GMT

Let's say you've been tasked with integrating several applications in your organization. A quick Google later and you're overwhelmed with different opinions about the right technology to use. What programming language, what platform, what messaging infrastructure, what database (or no database), what hardware - the list goes on and on. And as quickly, you'll find plenty of war stories about unsupportive management, political difficulties of getting different parts of an organization to work together, turmoil caused by reorganizations, difficulties with outsourcing - and on and on.

Yet try to find information about how to model your problem domain. This does not mean what is the best UML tool, or why XML is superior to XYZ. It means what information do your systems capture about the real world and what hidden assumptions do they use to manipulate that information.

The reason you won't find much information is that the problem is extraordinarily hard. Computer are digital - they divide the world into sharp distinctions that don't exist. The real world is analog. For example, can you tell me the difference between a stream, brook, run, creek, river and water course? Of course you can't - they all blend into each other. Words in a language are fuzzy, ambiguous representations of things in the world (or maybe not, do dragons exist?). They mean whatever a group of people have decided they mean. Your definition of a creek is undoubtedly different than mine.

One of the wonders of human intelligence is that we are able to sort through all this fuzziness and can usually communicate with each other. Computers are not so fortunate. Trying to share information between different applications is fraught with error.

Let's take an example from the book Data and Reality, which, as I've written before, is by far the best book I've read on the subject of data modeling (go buy a copy now before it goes out of print again!). Let's say you want to share employee data between two different applications. Sounds easy, doesn't it? But then let's start asking some questions:

  • Do employees include contractors?
  • Do employees include part-time workers?
  • Do employees include retired workers?
  • Do employees include workers on leave?
  • Do employees include workers serving in the military?
  • Do employees include workers who have just signed a contract but have not show up to work yet?

 

And on and on. The answers will be different depending on what department of the organization you ask. An employee on leave may exist according to the benefits department but not the payroll department. Or how about a couple working for the same company - is the husband's wife a dependent or an employee (and of course vice versa).

No matter what you do, you won't get this right. Every application includes hidden assumption on what its data means and how it is processed. Those assumption inevitably vary between different applications.

In the next post we'll look at a real world example.

Posted in ,  | no comments | no trackbacks

Its the data stupid!

Posted by Charlie Tue, 01 Aug 2006 21:53:00 GMT

Let's say you've been tasked with integrating several applications in your organization. A quick Google later and you're overwhelmed with different opinions about the right technology to use. What programming language, what platform, what messaging infrastructure, what database (or no database), what hardware - the list goes on and on. And as quickly, you'll find plenty of war stories about unsupportive management, political difficulties of getting different parts of an organization to work together, turmoil caused by reorganizations, difficulties with outsourcing - and on and on.

Yet try to find information about how to model your problem domain. This does not mean what is the best UML tool, or why XML is superior to XYZ. It means what information do your systems capture about the real world and what hidden assumptions do they use to manipulate that information.

The reason you won't find much information is that the problem is extraordinarily hard. Computer are digital - they divide the world into sharp distinctions that don't exist. The real world is analog. For example, can you tell me the difference between a stream, brook, run, creek, river and water course? Of course you can't - they all blend into each other. Words in a language are fuzzy, ambiguous representations of things in the world (or maybe not, do dragons exist?). They mean whatever a group of people have decided they mean. Your definition of a creek is undoubtedly different than mine.

One of the wonders of human intelligence is that we are able to sort through all this fuzziness and can usually communicate with each other. Computers are not so fortunate. Trying to share information between different applications is fraught with error.

Let's take an example from the book Data and Reality, which, as I've written before, is by far the best book I've read on the subject of data modeling (go buy a copy now before it goes out of print again!). Let's say you want to share employee data between two different applications. Sounds easy, doesn't it? But then let's start asking some questions:

  • Do employees include contractors?
  • Do employees include part-time workers?
  • Do employees include retired workers?
  • Do employees include workers on leave?
  • Do employees include workers serving in the military?
  • Do employees include workers who have just signed a contract but have not show up to work yet?

 

And on and on. The answers will be different depending on what department of the organization you ask. An employee on leave may exist according to the benefits department but not the payroll department. Or how about a couple working for the same company - is the husband's wife a dependent or an employee (and of course vice versa).

No matter what you do, you won't get this right. Every application includes hidden assumption on what its data means and how it is processed. Those assumption inevitably vary between different applications.

In the next post we'll look at a real world example.

Posted in ,  | no comments | no trackbacks

YAML and RDF

Posted by Charlie Sun, 30 Jul 2006 05:14:00 GMT

Sean mentioned the idea of using YAML for MapServer configuration files instead of XML. I wholeheartedly agree. The most important thing for a configuration file is that it is easily readable and editable. YAML's simpler syntax and conciseness is a big win in this case.

After reading his article, a brilliant idea occurred to me (well, at least I thought it was brilliant!) - use YAML to serialize RDF. RDF's graph data model maps poorly onto XML's hierarchical model, making RDF's XML serialization format more complicated than RDF itself. This complexity has been the subject of endless debates and has led to a number of alternative syntaxes over the years, such as Notation 3, Turtle, RPV, etc. Yet none of them have have been widely adopted.

XML's main advantages, compared to other serialization formats, are its extensibility, strong internationalization support and strong tool support.

A quick look at the YAML specification shows it supports UTF8 and UTF16 encodings, so its internationalization support looks good (I haven't done any testing to see if the reality is different). YAML's tool support is also good - you can find YAML parsers for Ruby, Python, PHP, JavaScript (using the JSON subset) and other languages as well.

Just like with XML, you can serialize custom vocabularies using YAML. However, YAML lacks a schema language. But if you're using RDF and require a schema, then clearly you will use RDF schema. RDF schema reuses the RDF XML serialization format, so I would take the same approach with YAML and use it to both encode RDF and RDF schemas.

Curious to see what work has been done in this area, I did a quick search. It turns out that Micah Dubinko suggested using YAML for RDF more than three years ago on the xml-dev mailing list.

As far as implementations, the only example I could find was a Perl module that hasn't bee updated since 2003. Now that some of the hype around XML has died down, and YAML has established itself as a legitimate format, it seems like a good time to revisit this idea.

Posted in , , , , ,  | 2 comments | no trackbacks

YAML and RDF

Posted by Charlie Sun, 30 Jul 2006 05:14:00 GMT

Sean mentioned the idea of using YAML for MapServer configuration files instead of XML. I wholeheartedly agree. The most important thing for a configuration file is that it is easily readable and editable. YAML's simpler syntax and conciseness is a big win in this case.

After reading his article, a brilliant idea occurred to me (well, at least I thought it was brilliant!) - use YAML to serialize RDF. RDF's graph data model maps poorly onto XML's hierarchical model, making RDF's XML serialization format more complicated than RDF itself. This complexity has been the subject of endless debates and has led to a number of alternative syntaxes over the years, such as Notation 3, Turtle, RPV, etc. Yet none of them have have been widely adopted.

XML's main advantages, compared to other serialization formats, are its extensibility, strong internationalization support and strong tool support.

A quick look at the YAML specification shows it supports UTF8 and UTF16 encodings, so its internationalization support looks good (I haven't done any testing to see if the reality is different). YAML's tool support is also good - you can find YAML parsers for Ruby, Python, PHP, JavaScript (using the JSON subset) and other languages as well.

Just like with XML, you can serialize custom vocabularies using YAML. However, YAML lacks a schema language. But if you're using RDF and require a schema, then clearly you will use RDF schema. RDF schema reuses the RDF XML serialization format, so I would take the same approach with YAML and use it to both encode RDF and RDF schemas.

Curious to see what work has been done in this area, I did a quick search. It turns out that Micah Dubinko suggested using YAML for RDF more than three years ago on the xml-dev mailing list.

As far as implementations, the only example I could find was a Perl module that hasn't bee updated since 2003. Now that some of the hype around XML has died down, and YAML has established itself as a legitimate format, it seems like a good time to revisit this idea.

Posted in , , , , ,  | 2 comments | no trackbacks