Its the data stupid!

Let’s say you’ve been tasked with integrating several applications in your organization. A quick Google later and you’re overwhelmed with different opinions about the right technology to use. What programming language, what platform,
what messaging infrastructure, what database (or no database), what hardware
– the list goes on and on. And as quickly, you’ll find plenty of war stories about unsupportive management, political difficulties of getting different parts of an organization to work together, turmoil caused by reorganizations, difficulties with outsourcing – and on and on.

Yet try to find information about how to model your problem domain. This does not mean what is the best UML tool, or why XML is superior to XYZ. It means what information do your systems capture about the real world and what hidden assumptions do they use to manipulate that information.

The reason you won’t find much information is that the problem is extraordinarily
hard. Computer are digital – they divide the world into sharp distinctions that don’t exist. The real world is analog. For example, can you tell me the difference between a stream, brook, run, creek, river and water course? Of course you can’t – they all blend into each other. Words in a language are fuzzy, ambiguous representations of things in the world (or maybe not, do dragons exist?). They mean whatever a group of people have decided they mean. Your definition of a creek is undoubtedly different than mine.

One of the wonders of human intelligence is that we are able to sort through all this fuzziness and can usually communicate with each other. Computers are not so fortunate. Trying to share information between different applications is fraught with error.

Let’s take an example from the book Data and Reality, which, as I’ve written before,
is by far the best book I’ve read on the subject of data modeling (go buy a copy now before it goes out of print again!). Let’s say you want to share employee data between two different applications. Sounds easy, doesn’t it? But then let’s start asking some questions:

  • Do employees include contractors?
  • Do employees include part-time workers?
  • Do employees include retired workers?
  • Do employees include workers on leave?
  • Do employees include workers serving in the military?
  • Do employees include workers who have just signed a contract but have not
    show up to work yet?

And on and on. The answers will be different depending on what department of the organization you ask. An employee on leave may exist according to the benefits department but not the payroll department. Or how about a couple working for the same company – is the husband’s wife a dependent or an employee (and of course vice versa).

No matter what you do, you won’t get this right. Every application includes hidden assumption on what its data means and how it is processed. Those assumption inevitably vary between different applications.

In the next post we’ll look at a real world example.

Leave a Reply

Your email address will not be published. Required fields are marked *