Technical Paper No. 1 – Difficult Problems in Building a GIS

Richard G. Newell & David G. Theriault

Abstract

Building a GIS is a fruitful area if one likes the challenge of having difficult
technical problems to solve. Some problems have been solved in other technologies
such as CAD or database management. However, GIS throws up new demands, therefore
requiring new solutions. This paper has chosen to examine ten difficult problems,
why they need to be solved and gives some indication of the state of the art
of current solutions.

Introduction

The subject of Geographical Information Systems has moved a long way from the
time when it was thought to be concerned only with digital mapping. Whereas digital
mapping is limited to solving problems in cartography, GIS is much more concerned
with the modelling, analysis and management of geographically related resources.
At a time when the planet is coming under increasing pressure from an ever more
demanding, increasing population, the arrival of GIS technology has come none
too soon.

However, there is a widespread lack of awareness as to the true potential of
GIS systems in the future. When the necessary education has been completed, will
the systems be there to handle the challenge? It has to be said that the perfect
GIS system has not yet been developed.

Many of the fundamental problems have now been identified, and indeed attempts
(sometimes good ones) to solve them have been embodied in today’s commercially
available systems. However, no system has an adequate solution to all of them
and indeed, some are not addressed by any system (Rhind and Green, 1988).

In this paper we have chosen to examine ten important problems which we believe
need to be solved. Ten is a somewhat arbitrary number since it would have been
easy to find many more. However, one has to draw the line somewhere. Also we
call the problems difficult, as opposed to unsolved, because for most of them,
there is now some light at the end of the tunnel and, indeed, implemented solutions
exist.

The fundamental causes of difficulty in GIS stem from the huge volumes of spatially
related data, the nature of the data itself, and the wide spectrum of different
organisations with diverse application requirements. Problems can be classified
under the following four categories:

Data capture – this will be the single biggest cost in implementing a new
system.
Performance – most systems strain under the volume of data and scale of
the tasks invoked.
Customisation – every organisation is different; no black box system is
going to be adequate.
Integration – today’s systems are disjoint, in the future data and functions
will need to be provided in one seamless environment.

Our ten problems are really different manifestations of these four and so the
rest of this paper is devoted to a commentary on each of them in turn. As most
organisations are at the data capture stage, it is not surprising that this is
the area where most work has been done in trying to improve the process.

The Topology Capture Problem

The lack of a really good solution to this problem is probably going to contribute
most to the slow speed of implementation of GIS systems in the future. Although
much progress has been made using scanning and vectorisation technology to capture
map geometry (Nimtz, 1988), the addition of feature information, the generation
of meaningful topology and the association of user data involves a large human
element. Some modern topological editors do help to ease the burden (Visvalingam
et al, 1986), but essentially, the only solution is to throw man-hours at it
(Chrisman, 1987). Thus for some time we will be seeing expedient solutions using
raster backdrops without topology.

Large Data Volumes

The volume of data required to represent the resources and infrastructure of
a municipality is of the order of 0.5 to 1 gigabyte per 100,000 of population,
supporting examples can be found where an urban region would require 14 million
points simply to represent the map planimetry (Bordeaux, 1988). Not only are
the municipal records required to be handled in the database, but also large
amounts of map data in the form of coordinates, lines, polygons and associations
of these. Today’s database technology is barely up to the task of allowing the
handling of geographic data by large numbers of users with adequate performance.
Serious questions have been raised as to whether the most popular form of database,
the relational model, will be able to handle the geometric data with adequate
response. Certainly, if this data is accessed via the approved route of SQL calls,
the achievable speed is orders of magnitude less than that which can be achieved
by a model structure built for the task (Frank, 1988).

A common approach to this problem is to partition the database into separate
map sheets. Frequently this is achieved by interfacing a sheet based CAD drafting
system to a database (but see continuous mapping below). Another approach for
avoiding the shortcomings of commercially available database systems is to build
a special purpose database system to handle the graphics, which lives alongside
the commercially available database system. A third approach, based on checking-out
subsets of the database, followed by checking-in the changes, can overcome the
problem of reviewing and editing (but see continuous mapping and version management
below).

The idealistic solution of implementing one database system to handle both kinds
of data with optimum performance is a large undertaking, and even then does not
satisfy the user who has already committed to an existing commercially available
database.

Continuous Mapping – the seamless database

As one begins to build a GIS, the problem of large data volumes appears to be
the most important one to solve. A solution in the form of the map sheet approach,
chosen by most systems today, simply mirrors what happens with physical map sheets.
This is then combined with map management techniques to try to hide the cracks.
While this partitioning helps to solve the first problem, it destroys the end
user’s dream of a single, continuous, seamless landscape. At best this may be
termed contiguous mapping, but in all cases, such schemes lead to difficulties
when one is trying to deal with objects which span more than one map sheet. The
problem of dealing with variable density data, for example between urban and
rural areas, is tackled in an ad hoc manner as there is no single size of map
sheet to yield optimal performance in all cases.

The reason that a map sheet approach is popular is that it can easily be implemented
with a conventional filing system and it is straightforward to find which files
should be searched for a given spatial query. However, the system can hardly
be called integrated, since the geometry, topology and the relationship between
graphic and application data are all handled in an ad hoc manner.

What is now demanded is that map data is handled just as it is in the real world
truly continuously, with no cracks. But this must be done so that spatial queries
of the form “inside an area” or “near to an object”
can be handled efficiently. A common query is of the form “conforming to
such-and-such and is on the screen” i.e. filtering out the rest of the world.
Normal indexing methods provided with conventional database systems are inadequate
to answer this type of query, and so a variety of spatial indexing methods have
been invented to try to solve this problem.

The general idea is to construct a tree structured index (e.g. quad tree, range
tree, field tree) the leaves of which are not disjoint they overlap in two dimensional
space and this accommodates objects of different sizes at different levels. In
this respect, the quad tree (Samet, 1984) is beautifully simple but has serious
deficiencies for large data volumes. The range tree (Guttman, 1984) is rather
more complicated but is more efficient. Systems which are optimal for retrievals
are not necessarily optimal for updates and vice versa. As data is read far more
often than it is written, the emphasis should be on providing maximum retrieval
performance, while still maintaining an acceptable rate of interactive update.

However, given today’s disk technology, it matters little how efficient the
index is if the data is scattered on disk. Thus it is essential that data is
clustered on disk so that spatially similar data have similar addresses (Bundock,
1987). This can be achieved in the relational and other tabular models provided
that data is stored ordered by a properly generated spatial key. There are a
lot of good ideas now in this area but it is yet to be seen which are the best.

Accommodating Existing Databases

There are a number of shortcomings in existing available database technology
for the building of a GIS. Nevertheless, there are now many organisations which
have committed in a big way to one of the emerging de-facto standard database
systems. It is not acceptable for a GIS vendor to try to displace such a database
with something else tuned for GIS applications. It is necessary to engineer a
solution which preserves the users’ investment while at the same time doing as
good a job as possible in providing a GIS capability. As mentioned above, if
one tries to handle all data in the commercial DBMS, then it is highly likely
that a serious performance problem will result. If one runs a geometric DBMS
alongside, then serious problems of backout, recovery and integrity may result.
It would seem that what is needed is some “virtual DBMS” which can
act as a front end to two or more physical DBMSs, and that this should have a
knowledge of all aspects to do with data dictionary, integrity, and access to
the various data.

Version Management – the problem of the long transaction

It is necessary that large databases can be accessed simultaneously by multiple
users. The normal assumption in a commercial DBMS is that a transaction is a
short term affair, and that a locking system can be used to keep users from unintentionally
interfering with one another. Such methods are used to preserve the integrity
of the database at all times. As far as any particular user is concerned, there
are two versions of the database of which he is aware, the version when he started
his transaction and the version he can see at the instant prior to committing
it. The problem is, when a user is in mid transaction in such systems, then all
other users are locked out of doing certain things particularly updating any
records that have been locked. This is all fine, provided that the transactions
are short.

However, in a GIS system, as in design database systems in general, a transaction
can be very long indeed, possibly days or even weeks. A planner for example may
take a considerable time to complete his work before he gets it approved and
therefore available to the other users of the system. In these circumstances
it is totally unreasonable to expect other users to avoid the area.

Copying the whole database is completely impractical in many cases. A system
of check-out, check-in can be used, but this gives rise to the problem of achieving
integrity through combining incompatible changes in different checked-in versions.
It is quite possible for two separate sets of changes to have integrity, but
for the combination to lack it.

Users who are starting the implementation of a GIS system may be blissfully
unaware of this problem. To start a project, most organisations will initiate
a pilot study: a simplified data-model, only one or two users of the system and
test data that can be thrown away if the pilot database becomes corrupted.

The situation in a production environment is the opposite of the above: a complex
data-model, a large number of users and concurrent activities, and a large investment
in the creation of the database. The problem of version management could dominate
all database activities in a production system and yet does not manifest itself
in a pilot project.

Versioning is not addressed by today’s round of DBMSs. The complete problem
covers the management of alternatives, the management of chronology (what did
my database look like three months ago), the management and policing of change,
and various forms of semantic versioning such as time series, variant designs
and “as built” records.

Hybrid Vector Raster Database

It is an amusing paradox that whereas the bottle-neck with vector based GIS
is in capturing the map base, the problem with remotely sensed data is the sheer
volume of it that pours down upon us from the heavens, a veritable fire-hose
of data. For example SPOT1, at 50Mbits/second, captures an area 60km x 60km in
9 seconds. Increasing resolution to 5 metres would mean boosting transmission
to 200Mbits per second for every hour of daylight.

Whereas the value of an attribute-less map base is limited, the value of satellite
data is potentially immense, for here surely we can observe and monitor the catastrophic
way in which we humans are changing and damaging this paradise in which we live.
Green issues aside, the vector people are still struggling with data capture
while the image processing people have a major computational problem with the
sheer bulk of data.

These two forms must come together in one system since the combination will
yield far more than the parts. Currently vector based GIS systems tend to be
separate from raster based image processing systems. Raster data is derived from
a variety of sources e.g. rasterized paper maps (from scanners), grid data derived
from digital terrain models and remotely sensed data (from satellites).

Many applications such as overlay analysis (described below), recording changes
in land use and changes in the environment can be tackled much more effectively
when the two forms of data can be manipulated, viewed and analysed in one integrated,
seamless environment.

Overlay Analysis – a problem of robustness

Given a multi-thematic map base, overlay analysis is used to answer questions
of the type: “show me all areas with sandy soil and agricultural land use,
where the ground slope is less than 5 degrees because I want to grow potatoes”.

The idea behind overlay analysis is very neat: one starts by producing a single
combined topological model of all the themes which are to contribute to the analysis.
Producing the combined topology is both a geometric as well as a topological
problem. Once it has been produced, all queries and analyses can be carried out
wholly in the topological domain without any further reference to the geometry
and are therefore amenable to be tackled via conventional query languages such
as SQL (Frank 1987).

The problem is that producing the combined topology robustly is far less trivial
than it seems at first sight. Anyone who has wrapped their brain around the problem
of producing a robust solid modeller will know the kind of pathological cases
that can arise. Fortunately, the problem is not quite as bad as producing a robust
solid modeller; however, any system implementor who underestimates the difficulty
of this problem had better beware – schoolboy geometry just isn’t good enough.

But robustness is not the only problem. It turns out that not infrequently two
or more themes may share the same boundary curve. Typically these will have been
digitized separately and with different accuracies, therefore their representations
will be different in detail. If two such themes are now combined, then many superfluous
small polygons (called slivers) will be produced. Slivers can easily outnumber
the real polygons leading to an explosion in data volumes, messy graphics, a
collapse in efficiency and questionable analysis. Sliver removal is just another
of those irritating issues which must be addressed if a practical system is to
result.

In any analysis, it is important to try to make some assessment of the accuracy
of the results. In practice, each theme of data will contain inaccuracies and
errors, the bounds of which may be known. The consequent effects of these individual
errors on the results from many combined themes may be impossible to gauge (Walsh
et al, 1987), overlay analysis systems available today should carry the caveat: “let
the user beware”. (As an aside here, the whole subject of quality meta data,
including such aspects as source, accuracy and coverage would be a good candidate
for the eleventh problem.)

Overlay analysis is one of those techniques which is mind-blowingly tedious
to do without a GIS. However, without robust clean algorithms there is a grave
risk that the results produced will be meaningless.

Very Large Polygons

When Slartibartfast designed the coast of Norway (The Hitch Hiker’s Guide to
the Galaxy, Adams, 1979), he was really setting a major challenge for the designers
of Scandinavian GIS systems. He probably also designed the Stockholm Archipelago
with its 23000 islands as a challenge to those trying to do spatial analysis
in the Baltic. Whereas very large linear features can be straightforwardly subdivided
into manageable pieces, very large polygonal areas, including those with many
embedded islands are not amenable to the same techniques.

However, one doesn’t need to travel to scenic northern regions to encounter
this problem, since very large polygons can occur in everyday applications. The
GIS designer is confronted with the problem of doing spatial analysis on small
parts of such polygons, as well as answering questions to do with inside and
outside, without processing the millions of points which there may be in such
polygons.

As with very large databases, the goal is to produce systems whose performance
is dependent only on the data of interest and not to be burdened with the irrelevant.

The Front End Language – the seamless programming environment

A problem with GIS customers, seen from the system vendor’s viewpoint, is that
they are all different. Each customer will have a different data-model and will
want substantial enhancements to his user interface and functionality. Whereas
there are good facilities for data-modelling, the tools for developing and customising
today’s systems leave a lot to be desired. Much effort has gone into toolkits
for developing and customising systems including standard graphics libraries,
user interface managers, data managers, windowing systems etc. However, if one
wishes to get in and program any of the systems, one is confronted with operating
system commands, Fortran, C, SQL, embedded SQL, some online command language,
domain specific 4GLs or a combination of these; not to mention auxiliary “languages” to
define user syntax, menus, data definitions, data dictionaries etc. With these
kinds of programming tools it can take many man-months of skilled programmer
effort to achieve even modest system customisation.

It is a common problem with systems that contain parts that are front ended
by different languages that it is not possible to integrate them properly. For
example, a graphics system for mapping, which is “hooked into” a
database, typically does not allow the full power of the database to be accessed
from within the graphics command language, nor can the power of the graphics
system be invoked from within the database query language. What is really needed
is a system such that all data and functions can be accessed and manipulated
in one seamless programming environment (Butler, 1988).

What has been shown by a number of organisations is that the same development
carried out with an on-line object orientated programming language can cut such
development times by a very large factor (e.g. 20). Object orientation does not
just mean that there is a database with objects in it, but that the system is
organised around the concept of objects which have behaviour (methods). Objects
belong to classes which are arranged in a hierarchy. Subclasses inherit behaviour
and communication between objects is via a system of message passing thereby
providing very robust interfaces.

Firstly, such a language should be able to provide facilities covering the normal
requirements of an operating system, customisation, applications programming
and most systems programming. Secondly, the language should have a friendly syntax
that allows casual users to write simple programs for quick online customisation.
Larger programs should be in a form that is easily readable and debuggable. Thirdly,
the language must be usable in an interactive mode as well as being able to be “supercharged” by
means of full compilation.

There are several languages around which satisfy some of these requirements:
Basic is all right for simple programs (but look at what some people now achieve
in the micro world with some of the more advanced modern Basics); Lisp has much
of the power and speed, but is hardly readable (but just see what Autocad has
achieved with Autolisp). Smalltalk has both speed and object orientation, but
with the total exclusion of any other programming construct. Postscript is very
widely used and has a number of the desired features, but is another write-only
language (i.e. “unreadable” by anyone, including the original programmer).
Hypertalk is wonderful, but you would not write a large system in it. C++ has
much of the required syntax and semantics, but it is not available as a front
end language and can therefore only be accessed by a select few system builders,
normally employed by the system vendor.

Having dismissed most of the well known languages developed during the last
30 years, then what is required? It is an on line programming language, with
a friendly Algol-like control structure and the powerful object oriented semantics
of languages like Smalltalk.

The Query Language

Modern query languages such as SQL are not sufficient in either performance
or sophistication for much of the major development required in a GIS system
– but then one would argue that they were not intended for this. One can see
why people like SQL; it can give immense power in return for some fairly simple “select”
constructs. A problem which has to be addressed is spatial queries within the
language, since trying to achieve this with the standard set of predicates provided
is extremely difficult and clumsy. (An example of a spatial query is to select
objects “inside” a given polygon.)

If the route adopted is to provide two databases in parallel, a commercial one
driven by SQL and a geometry database to hold the graphics, then there is a problem
constructing queries that address both databases. Ideally, the query language
should be a natural subset of the front end language allowing access to the same
seamless environment that the front end language provides. Much work needs to
be done in the area of query languages for GIS.

Conclusion

Ten years ago the evolution of CAD systems was at about the same stage that
the evolution of GIS systems is now. One can recall the arguments over whether
mechanical design systems should be 2D or 3D; what was it that allowed a 3D system
to claim to be a solid modeller? The purists endlessly discussed CSG versus BReps;
Euler’s rules had to be obeyed at all costs – while in the mean time the faceted
modeller brigade stole the show. The surface designers insisted that C2 was a
minimum and some pronounced that C7 was essential! Your surface system was definitely
not macho unless it was founded on NURBS – even though the humble conic can handle
adequately 95% of practical problems. Today, most of the world seems to be happy
hacking away on a fairly modest little drafting system called Autocad.

So it is now with GIS: What is a GIS and what isn’t a GIS? Do the ever longer
synonyms for GIS say anything new? Should the system be based on raster or vector?
One of us recently asked a vector bigot what percentage of spatial analysis problems
can be tackled in the raster domain – back came his answer: “None!”.
Arguments about topological models are rife while protagonists argue about what
are the acceptable ways to break the strict rules laid down for data-modelling,
in order to achieve a working system.

Let us beware in GIS, too much concentration on how things are done as opposed
to what we are trying to do will lead to much wastage of time and wrong decisions
being made. It is all very well for the salesman to try to identify unique selling
features in his product in order to exclude another vendor from the sale, but
let us learn from the experience of CAD when this is happening. It is important
to identify the real application requirements of a system, this can be crucially
dependent on that system having strong fundamentals. These can be exposed by
careful benchmarking (McLaren, 1988).

In this paper we have attempted to lay bare ten difficult problems in GIS. In
fact they are not wholly in dependent problems indeed some are a consequence
of trying to solve another. The encouraging thing is that most of them have now
been identified, and indeed for many there are some promising solutions.

Acknowledgements

The authors gratefully acknowledge Robin McLaren for reading the draft of this
paper and thank him for his thoughtful comments.

References

Adams, Douglas (1979). The Hitch Hiker’s Guide to the Galaxy, Pan Books.

Bordeaux, La Communaut Urbaine de (1988). Programme du Concours pour un systSme
interactif graphique de gestion de donnes urbaines localises.

Bundock, Michael (1987). Integrated DBMS Approach to Geographical Information
Systems, Auto Carto 8.

Butler, R. (1988). The Use of Artificial Intelligence in GIS, Mapping Awareness
and Integrated Spatial Information Systems, Vol. 2, No. 3.

Chrisman, N. R. (1987). Efficient digitizing for error detection and editing,
International Journal of Geographical Information Systems, Vol. 1, No. 3.

Frank, Andrew U. (1987). Overlay Processing in Spatial Information Systems,
Auto Carto 8.

Frank, Andrew U. (1988). Requirements for a Database Management System for a
GIS, Photogrammetric Engineering and Remote Sensing, Vol. 54, No. 11.

Guttman, Antonin (1984). R-Trees: A Dynamic Index Structure for Spatial Searching,
ACM

McLaren, Robin A. (1988) Benchmarking within the GIS Procurement Process, Mapping
Awareness and Integrated Spatial Information Systems, Vol. 2, No. 4.

Nimtz, H. (1988). Scanning und Vektorisierungs Software zur Erfassung von Katasterplanwerken,
Proceedings Automated Mapping/Facilities Management European Conference IV.

Rhind, D. W., Green, N. P. A. (1988). Design of a Geographical Information System
for a Heterogeneous Scientific Community, International Journal of Geographical
Information Systems, Vol. 2, No. 2.

Samet, H., (1984). The Quadtree and Related Hierarchical Data Structures, ACM
Computing Surveys, Vol. 16, No. 2.

Visvalingam, M., Wade, P., Kirby, G. H. (1986). Extraction of area topology
from line geometry, Proceedings Auto Carto London.

Walsh, Stephen J., Lightfoot, Dale R., Butler, David R. (1987). Assessment of
Inherent and Operational Errors in Geographic Information Systems, Technical
Papers, 1987 ASPRS-ACSM Annual Convention, Vol. 5.

CFIS

CFIS

CFIS

Technical Paper No. 1 – Difficult Problems in Building a GIS

Abstract

Introduction

The Topology Capture Problem

Large Data Volumes

Continuous Mapping – the seamless database

Accommodating Existing Databases

Version Management – the problem of the long transaction

Hybrid Vector Raster Database

Overlay Analysis – a problem of robustness

Very Large Polygons

The Front End Language – the seamless programming environment

The Query Language

Conclusion

Acknowledgements

References