Technical Paper 8 – GIS Databases are Different

by Peter Batty and Richard G. Newell

Synopsis

There has been much debate in the GIS industry about the suitability of standard
commercial database management systems (DBMS’s) for use in GIS. Historically most
GIS’s used proprietary storage mechanisms for their geographical data, as the performance
of commercial DBMS’s did not really make them a viable proposition. However, in
the last few years, advances in hardware and software performance have made it
possible to develop GIS products on top of commercial DBMS’s. This approach has
several obvious attractions.

However, the main argument of this paper is that current commercial DBMS technology
has fundamental restrictions for GIS, and that a radically different client-server
architecture based on version management has major advantages over traditional
DBMS architectures. This new architecture is described and its benefits are explained.
Clearly integration with existing databases and conformance to existing standards
are also very important, and these issues are also discussed.

The Use of Standard DBMS’s for GIS

The attractions of using standard DBMS’s for GIS have been described in some detail
by various people, including one of the authors of this paper (see Batty (1), Seaborn
(2)). In summary, the advantages of this approach are that the GIS vendor should
be able to take advantage of functions provided by the DBMS vendor and concentrate
on developing GIS functions.

In particular, functions such as security and backup and recovery are well proven
in standard DBMS’s and the GIS can take advantage of these. The GIS user can exploit
existing database skills and use common procedures for many database administration
tasks, for both GIS and non-GIS data. Integration between GIS and non-GIS data
should be easier to achieve when using this approach. There is increasingly good
capability available for integration of multiple DBMS’s from different vendors
in a heterogeneous distributed environment.

The drawbacks of the standard DBMS approach are perhaps less obvious, with the
exception of performance. There is an argument that performance will not be an
issue in the longer term because of advances in technology. Certainly performance
has improved, and it is now possible to implement a reasonably efficient spatial
indexing system on top of a standard DBMS, especially one which supports physical
clustering of data. However, the performance issue is more complex than this, and
we will discuss later how a client-server version managed architecture can offer
an order of magnitude better performance than conventional systems in a networked
GIS with a large number of users.

The biggest single drawback of standard DBMS’s is their lack of capability in
the area of handling long transactions and version management, which is discussed
in the next section. Handling long transactions is a fundamentally different problem
from that of handling short transactions, which is what standard DBMS’s are designed
for, and handling the former would require significant re-architecting of existing
DBMS products.

Some of the other apparent advantages of standard DBMS’s are not as clear cut
as they might appear. While the GIS vendor may not have to worry about writing
code to cater for some DBMS tasks such as backup and recovery, this advantage may
be outweighed by the amount of code which has to be written to implement other
functions which are not provided by the standard DBMS, such as the storage and
retrieval of spatial data, and the provision of some sort of long transaction handling
mechanism such as checkout.

The integration of geographic data into external applications is also not necessarily
simplified by storing it in a standard DBMS. Because spatial data types are not
explicitly supported by most standard DBMS’s, the GIS vendor typically has to use
complex record structures and retrieval algorithms for the geographic data, which
means that it cannot be easily read or updated by external applications. The external
application also needs to understand any special locking mechanisms which are used
by the GIS to handle long transactions, since the standard DBMS locking does not
handle this. The most obvious short term solution to this problem is for the GIS
vendor to provide an API (Application Programming Interface) which can be used
by external applications, but that approach can be used equally well whether or
not a standard DBMS is used for the GIS data. This is an area where progress is
being made by the standard DBMS vendors, as they begin to provide the ability to
define new data types and operators within SQL. Capabilities are included in the
draft SQL3 standard which will be useful in this respect.

Long Transactions

A database transaction is a group of related updates against a database which
form a single
“unit of work”. Standard DBMS’s are designed to handle “short transactions”,
so called because they usually take a short time – typically a few seconds at most.
A simple example of a short transactions in a banking system would be transferring
some money from a current account to a savings account. The system first has to
subtract the appropriate amount from an entry in the current account table, and
then add the same amount to an entry in the savings account table. If the system
fails half way though the transaction then the database is in an inconsistent state,
and it must be rolled back to the state at the start of the transaction to make
sure that the database is still consistent. An inconsistent state could also arise
if someone else were able to update the records involved while the transaction
was in progress, so these records will be locked and nobody else will be able to
update them for the duration of the transaction. Standard DBMS’s are designed around
being able to handle this sort of transaction in a very robust way which will handle
system failures appropriately.

In contrast, a typical GIS transaction is something very different. For example,
an engineer designing an extension to a gas utility network will need to digitise
a number of pipes, valves and other objects, which may take hours, days or even
weeks. While he is in the process of doing this design, the database will not be
in a consistent state – if someone else looked at the work he was doing when he
was half way through it, it would probably not make engineering sense. So he in
some sense needs to have a private copy of the data for the duration of the transaction,
which he can make available to other users when he has finished his work. He may
also want to create multiple different designs for his network extension and do
some analysis on each alternative design before deciding which one to use. Once
the work has been carried out, it may be necessary to make some modifications to
the database to reflect any changes which were made to the original design when
the work was done (“as-built” changes).

This whole process may well take weeks or months, and since separate copies of
the data have to be managed, the issue of concurrency control also has to be looked
after. Some mechanism is required to handle the situation where multiple people
want to work in the same area at the same time. There are two basic approaches
to the concurrency problem: optimistic or pessimistic. The pessimistic approach
prevents more than one person working on the same area at the same time by locking
out any area where a user is currently working. The optimistic approach allows
multiple users to work in the same area with no constraints, on the assumption
that conflict in what they are doing is unlikely to occur. Conflicts are checked
for at the time that the changes are posted to the master version of the database
and they can be resolved at that time. The optimistic approach is generally most
suitable for GIS, since typically the volume of updates is small in relation to
the total amount of data, and also working practices often dictate that there should
not be conflicts even if two people are working in the same area.

Checkout

The most common approach which has been used to address the long transaction problem
is checkout. In this approach, the user specifies an area in which he wishes to
work, and the data selected is copied to a separate working area which is just
used by that user. The working area may or may not use the same DBMS and data structures
as the master database. Updates are made to this working data set, and when the
work has been completed the changes are applied to the master database. The concurrency
control mechanism used may either be pessimistic, in which case all data retrieved
by a user is locked and can only be viewed by other users, or it can be optimistic,
in which case multiple users can retrieve overlapping areas and conflicts are identified
when changes are passed back to the master database. For a more detailed description
of a checkout based system, see Batty (3).

Checkout has a number of disadvantages. The first is that the initial retrieval
of data can take a long time – in general, checkout times tend to be measured in
minutes rather than seconds. A second drawback is that the user has a restricted
subset of the database to work in. If he discovers that he needs to work outside
the area which he originally requested, then it is necessary to do another retrieval,
which may again take a significant time. If relationships (including topological
relationships) are allowed between objects in the database, this introduces further
complications in deciding exactly how much data to check out – some mechanism is
required for controlling updates to objects which have been extracted which are
related to objects which have not been extracted.

Version Management

Another approach to handling long transaction is to use a version managed database.
In such a database it is possible to create different versions of the database
called alternatives. Only one user can update an alternative at one time, but any
number of users can read an alternative. Changes made by a user within an alternative
are only seen within that alternative. The whole database can be seen within an
alternative, but data is not replicated: only the changes relative to the parent
version are stored in an alternative. To implement this efficiently requires version
management to be built in at a fundamental level in the DBMS. When changes to an
alternative have been completed, they can be posted up to the parent alternative.
An optimistic approach to concurrency control is used, and any conflicts are detected
and corrected at this time. It is also possible to have a tree structure of alternatives,
so in addition to handling simple long transactions, this approach also provides
a mechanism for handling alternative designs in an elegant way. Version management
overcomes all the problems with checkout mentioned above: there is no initial retrieval
time, no copying of data is required, and the user has access to the whole database
at all times. For a more detailed description of version management see Newell
(4) and Easterfield (5)

A Client-server Implementation of Version Management

This section looks at the way in which version management is implemented in the
Smallworld Version Managed Data Store (VMDS). It is necessary to consider the structure
of the database at quite a low level in order to appreciate the difference between
this architecture and that of traditional DBMS’s in terms of the performance which
is achievable. This section will just summarise the most important points about
the implementation – for a more detailed discussion see Easterfield (5).

Tables in VMDS are implemented using a standard structure called a B^-tree which
allows efficient navigation based on a key value to find a record with that key.
Many DBMS’s implement their tables using B-trees. However, the key difference between
the VMDS approach and more traditional approaches, for the purposes of this discussion
on performance, is that the datastore never updates a disk block directly while
any version of the database still refers to it. Whenever some data is changed and
committed to disk, a copy of the original disk block containing that data is created
and the update is made to that copy. In turn, any other disk blocks in the tree
which referred to the old disk block are also copied and updated so that they point
to the new version of the data. These other disk blocks may still point to other
disk blocks which have not been changed – so common data is shared between versions.

This is in contrast to traditional DBMS architectures, where an update will cause
the data in the original block in which that data was stored to be modified. In
such a short transaction based DBMS, a change to a record is immediately seen by
all users. In contrast, when an update is made by a user of VMDS, it is only seen
by that user until a version management operation such as a merge is carried out
by another user.

A huge benefit of the fact that a disk block is never updated by VMDS lies in
the fact that this makes it possible to cache disk blocks on client workstations.
In any subsequent processing, if a disk block is required and it has been cached
on the local workstation, that block can be immediately used, in the knowledge
that it has not been updated on the server. This is not the case with a standard
DBMS, since as soon as a block was updated by any user, a mechanism would be required
for updating or uncaching all cached copies of that block (in a way which will
work consistently even in the event of hardware, software or network failures on
part of the system). This is a very complex unsolved problem.

Performance

This ability to cache blocks on the client workstation also means that most of
the processing can be done by the client. Thus this approach removes the two main
bottlenecks in GIS performance which arise with standard DBMS’s – processing on
the server and network traffic. The requirement for a typical GIS redraw transaction
is to be able to retrieve hundreds or thousands of records, and potentially megabytes
of data, from the database in a few seconds. In standard DBMS’s, all the optimisation
and query processing is done by the server, which quickly leads to a processing
bottleneck on the server when processing large numbers of concurrent queries of
this complexity. The second bottleneck is that all the data which is returned from
the query needs to be transferred across the network, which leads to serious problems
in terms of network traffic.

The version managed client server approach offers enormous improvements in both
of these areas, as very little processing is done by the server – it is essentially
just a very simple fast block server – and typically much (or all) of the data
required will be cached on the client, in which case network traffic is reduced
and processing on the server is also further cut down. This makes it possible to
have very large numbers of users concurrently updating a single continuous database.
Tests carried out at one customer showed no significant loss in performance as
the number of users accessing a single server on a local area network was increased
from 1 to 35. The users were all running an intensive data capture application.
The only way in which anywhere near this level of performance could be achieved
with the standard DBMS approach is by checking out data and storing it locally
on each workstation, but this approach has other significant drawbacks which were
discussed earlier.

Distributed Database

A very common requirement for companies who use GIS is to be able to access the
GIS data in multiple geographically separate locations, which may not be connected
by very fast communications. However, for some applications it may be necessary
to view the database as a whole, and people working in one district may occasionally
want to look at data in other districts. The disk block caching described in the
previous section has a natural extension to this sort of distributed database environment
(Newell (6)).

It is possible to have a central master database, and then at each remote location
which is accessing the database, have a slave database which stores cached blocks
on disk rather than in memory. This persistent cache will store blocks from one
session to another, and these cached blocks can be accessed by all users working
at this remote site. If users at the remote site typically access a specific subset
of the master database (which could be a geographic subset, such as one of ten
districts of a utility, or a functional subset of the database, such as water network
objects), then the local disk cache will automatically become populated with blocks
containing data from that subset. Updates made at remote sites still go directly
back to the master database.

This persistent cache approach to distributed database further extends the capabilities
described in the previous section, of reducing the load on the central server and
reducing network traffic. It provides a very elegant solution to the distributed
database problem, in that data is automatically distributed to the appropriate
location based on demand. If the usage of the database at a remote location changes
– for example, if the areas managed by different district offices are changed –
then the data cached at the location will automatically change over time as users
look at the new areas.

Once again, this whole approach will only work in a version managed environment
because of the problem of synchronising updates to multiple copies of a block in
a short transaction environment.

The persistent cache approach can also be used in a multi-level way. A district
could have its own large persistent cache, and outlying depots have their own persistent
cache. A request for a block from a user in the depot would first check the cache
on that client machine, then look in the persistent cache on the depot server,
and then the persistent cache on the district machine, before finally looking in
the master server.

Benefits of Version Management for System Development

Version management can also be extremely useful when doing development, or when
making changes to a production database. Not only can changes be made to data within
a version, but it is also possible to make changes to the data model within a version
without affecting the data model (or data) in any other versions. This is extremely
useful for an application developer, who can test any changes to the data model
against the whole database in their own version, before applying the same changes
to the master version of the database. Similarly, any operations which do large
scale changes to existing data in the production databases can be run in a version
and the results can be thoroughly tested before posting the changes to the master
version. If any problems are found then that version can just be thrown away without
having done any harm to the master version of the database.

Version management is also an important requirement for CASE (Computer Aided Software
Engineering) applications, and indeed other aspects of GIS technology are useful
in implementing a CASE tool, such as the ability to define objects with data which
can be represented graphically, and the ability to maintain complex relationships
between them. Smallworld have used this fact to implement a CASE tool which can
be used to define and modify data models for GIS applications. This uses version
management techniques extensively to help provide functions like the ability to
create alternative versions of the data model, apply these to alternative versions
of a populated database for testing and development, and finally apply the changes
to the master version of the database with a mechanism for propagating the schema
changes down to other versions in the database.

Integration

Integration between GIS and non-GIS data and applications is generally accepted
as being very important in gaining maximum business benefits from the GIS. So although
significant advantages can be obtained by using a new type of DBMS for GIS, it
is also very important that any new DBMS should provide good integration with existing
DBMS’s and applications.

The Smallworld system caters for this by allowing tables to be defined either
in the Smallworld version managed data store or in a standard DBMS such as Oracle
or Ingres. To both users and application developers, access to either type of table
is identical (an example of encapsulation in an object-oriented system). The only
differences are that any updates to the tables in a short transaction database
will be immediately visible to all users in all versions of the database, whereas
updates made to the Smallworld VMDS will only be visible in the alternative in
which the change is made. Any distributed functions provided by the external DBMS
can also be used to access other DBMS’s on remote machines. It is also possible
to use SQL gateway products like SequeLink to access remote DBMS’s.

External applications may also need to access data stored in the GIS. This can
be done by providing an SQL server which allows external applications to query
the data using standard SQL. There are some issues with this approach in that standard
SQL does not support the spatial operators or data types which are supported by
the Smallworld database. However, some standard DBMS’s now include the ability
to add new operators and datatypes, and the SQL3 standard also addresses these
areas. However, there are certainly shortcomings in SQL in this area at the moment.
Another option for providing external access to the GIS data is to provide an API
which allows queries or commands written in the GIS development language (Smallworld
Magik, in the case of the Smallworld system) to be executed by an external application
and to get the results back. This provides the potential for executing more complex
GIS queries such as network traces, which would be very difficult to express in
SQL.

One issue which still requires some careful thought is the exact nature of the
integration between long and short transaction data and how commits in both environments
interact. Although any table can be stored in either the short or the long transaction
database depending on what type of updates are most frequently carried out against
that table, there may be occasions where related updates are required across both
environments. There may be some cases where controlled replication of certain data
in both environments is a valid option.

Acknowledgements

The authors would like to acknowledge the work of the developers of the technology
described in this paper. Mark Easterfield developed the Version Managed Data Store
and most of the concepts explained in this paper. Nicola Terry developed the Data
Dictionary and Gillian Kendrick developed the CASE tool, which were touched on
briefly in this paper and will be revisited in more detail in a future paper. Betty
Barber developed the capability to integrate external DBMS’s with VMDS.

Conclusion

The conclusion of this paper is that, although there are some attractions to using
standard DBMS’s for GIS, an alternative client server database architecture based
on version management has very significant advantages over traditional DBMS architectures
for GIS. These advantages are firstly in the area of handling long transactions
and managing alternative versions of data, which are critical issues in handling
large multi-user GIS databases, and secondly in the area of performance with large
numbers of users in a networked environment.

The authors have personal experience of implementing both approaches, and believe
that, although progress is being made on some of the shortcomings of standard DBMS’s
for GIS, it is hard to see how they can provide the advantages of the database
architecture described in this paper without some major re-architecting, which
seems unlikely to happen. Thus it is our belief that the best way forward is for
us to continue to develop the database technology described in this paper, and
further enhance its ability to integrate with commercial DBMS’s using SQL standards.
The alternative is to wait for the day when the DBMS vendors provide equivalent
functionality to what we now have.

References

1. Batty P.M. An Introduction to GIS database issues, Proceedings of AGI 92, Birmingham
1992.

2. Seaborn D. 1995: The Year GIS Disappeared, AM/FM XV Conference Proceedings,
San Antonio, April 1992 pp 822-826

3. Batty P.M. Exploiting Relational Database Technology in a GIS, Mapping Awareness
magazine, July/August 1990.

4. Newell R.G. Theriault D.T. and Easterfield M.E. Temporal GIS – modelling the
evolution of spatial data in time, Computers and Geosciences Vol 18 No 4 pp427-433,
1992

5. Easterfield M.E., Newell R.G. and Theriault D.G. Version Management in GIS:
Applications and Techniques, Proceedings EGIS ’90, Amsterdam, April 1990

6. Newell R.G. Distributed Database versus Fast Communications, AM/FM XVI Conference
Proceedings, Orlando, March 1993, pp 647-659

Top