Technical Paper 6 – Integrated Data Capture

by David G. Theriault, Nicola Terry, Cees van der Leeden and Richard G. Newell

For the past 15 years GIS users who enthusiastically embarked on grandiose projects
have, with very few exceptions, spent enormous amounts of time and effort mired
in the initial phase of data capture. The organisations worst affected are those
where the capture effort is concerned with the conversion of old documents to a
digital, structured database, e.g. utilities, local authorities and cadastral agencies.
The reasons for the difficulties are mainly inappropriate tools being applied to
the problem. This paper presents an approach in which the problem has been completely
re-examined at four levels. The first consideration is at the project management
level where a proper analysis of goals and targets is required. The next level
is concerned with the rapid development of specific applications in an ergonomic,
integrated object-oriented environment. These applications must be supported by
facilities for rapid take-on of data (e.g. raster-vector integration, interactive
vectorisation tools and multiple grey-level images) and a topologically structured,
seamless spatial database. Finally, powerful workstations with a visually pleasing
window environment will be needed. Benchmarks so far seem to indicate that factors
of 3 to 5 improvement in time may be achieved.


Everybody now recognises that one of the largest costs (possibly the largest cost)
of implementing a GIS is data conversion. In particular, the problem of converting
data held on maps into a database containing, points, lines and polygons which
are intelligently structured, is particularly burdensome. Much effort has been
invested in trying to improve the process beyond the traditional methods based
on manual digitizing. These include the use of scanned raster data, automatic conversion
of raster into structured vector, automatic symbol and text recognition and “heads
up” (on-screen) digitizing. All of these have their place, but we will advocate
in this paper that other aspects of a system’s function-ality and its structure
can have more dramatic effects in reducing conversion costs in many cases. Our
experiences stem from applying these methods to the capture of utility foreground
data and cad- astral polygon data, where automatic approaches based on artificial
intelligence break down.

Putting it another way, it seems that, with the current state of the art of automatic
recognition methods for data conversion, there will always be a large proportion
of documents which can only be interpreted and input by human operators. The exercise
then becomes one of providing the best possible environment for humans to operate.
This includes, not just providing much improved user interfaces, but also the very
structure of the system itself. Badly structured, poorly integrated systems contribute
greatly to the costs of conversion.

We contend that data capture is not a separate function carried out in isolation
of the GIS it is to feed, but the data capture system itself should embody much
of what one would expect from a modern corporate GIS.

Goals and Targets

It is important at the start of a GIS implementation to identify what it is that
the GIS is intended for. This may be an obvious thing to say, but it is amazing
how in many cases in the past, the conversion exercise has borne no relationship
to any pre-assigned goals. An organisation embarking on GIS can define the business
case for the whole project. This will result in:

  • identifying the end users of the system (including those who might
  • never actually sit down in front of a terminal).
  • a detailed breakdown of activities on the system.
  • an outline of possible future requirements.

From the analysis it is possible to know the exact data requirements, including
the coverage required and the level of intelligence required in the data structures.

This business case analysis has not always been done rigorously. The result of
this is that frequently far more is converted than is strictly necessary. It does
not seem very many years since the silly arguments about the relative merits of
raster and vector. Now that the discs are cheap and the processors and displays
are fast, everybody recognizes that a raster representation for a landbase is an
excellent and cost effective way to get started. Raster not only provides a perfectly
adequate representation for producing displays and plots, but also is a good basis
for the capture of intelligent data.

For a utility company, the data may be divided into background and foreground
data. The source of the background data is good quality paper maps from the national
mapping agency (e.g. the Ordnance Survey in the UK and the IGN in France). A complete
digital vector representation of these background maps is usually not available
or is expensive, so we would advocate in this proposal that these maps should be
scanned using a good quality scanner. The cost of producing a scanned background
mapbase is a small fraction (perhaps one tenth) of the cost of producing a vector
background mapbase.

It is thus essential to decide at an early stage, what are the objects in the
real world that need to be represented in the system, in order that the foreground
capture can be limited to these.

After all, a GIS is no more than a corporate database management system, in which
one of the query and report environments is a map. Thus the same data modelling
and analysis that goes into DBMS implementation needs to be done for GIS implementation.
This data model will determine that which needs to have an intelligently structured
graphical representation and that which does not.

It is essential that goals are set so that a complete coverage of a defined and
useful area can be achieved in about one year to 18 months. If the time frame is
longer than this, the management will get impatient to see results, the maintenance
problem will overtake the primary data-capture and the whole thing will fall into
disrepute. We would go as far as to say, if the organisation cannot commit the
money and resources to achieve this, then it should wait until it can.

Development of Specific Applications

For the purpose of the data conversion project, specific applications for input
and edit need to be developed for each data item. In parallel, it is necessary
to implement and test the end applicat-ions of the system. Traditionally in a GIS
project, these tasks have consumed an enormous amount of time and valuable resource.
For sound reasons, the data capture project is postponed until these applications
show robust results. The advances that have been made in data processing and database
management, such as rapid prototyping and 4GL environments, have only recently
surfaced in some commercial GIS systems.

This task of applications development is made much easier if the GIS already has
an extensive library of generic functions which can be applied to the specific
data model implemented. These functions might include generic editors for geometry,
attributes and topology, network functions as well as polygon operations including
buffering and overlay. It is in this area that interactive object-oriented environments
really come to the fore as new functionality can inherit much of the existing functionality
of a generic system as well as providing very rapid prototyping and development

Integrated System Approach

This section contains the main burden of our argument. In the past, most data
capture was carried out using conventional digitizers on single user systems, one
sheet at a time, in isolation of the GIS. We contend that data capture should be
carried out using on-screen digitizing in a multi-user environment using a continuous
mapbase in the environment of a fully integrated GIS.

Such an approach gives considerable savings in time and cost in making a complete
coverage. The basis of the approach is to eliminate many of the activities which
take time in systems which are not well integrated.

The procedure starts, by capturing an accurate background landbase. This may well
be a base constructed of vectors and polygons, if it is available, such as the
data available from the Ordnance Survey. Alternatively, a base constructed of scanned
raster is much cheaper. Scanning at 200 dots per inch is barely adequate for this
purpose and it is worth the marginal additional expense of scanning at a higher
resolution of say 300 or even 400 dots per inch. With modern compression techniques,
400 dpi only takes twice, not four times, the space of 200 dpi. Semi- automatic
vectorisation (see below) is considerably improved at the higher resolutions.

[ Figure 1 not available ]

Figure 1: Basic Data Model

The next stage is to input data which can be used to locate objects and mapsheets
as well providing an essential geometric reference base. This includes such things
as street gazetteers, map sheet locators and street centre-lines. Thus locating
any street or map sheet and getting it onto the screen should be very efficient.
Conversely, the system should be able to identify the mapsheet(s) containing any
point or object indicated on the screen. Some locational data may already exist
in existing company databases. Doing this before the foreground capture can improve
overall efficiency.

Finally, the source maps containing the foreground data of interest can be scanned
and the process of on-screen data capture can begin in an environment in which
many of the hurdles in conventional systems are removed. Figure 1 illustrates the
order in which information is assembled.

The following is a discussion of several important issues, including:

  • On-screen digitizing
  • Digitizing across many sheets without interruption
  • Geometry, topology and attribute capture in one integrated GIS environment
  • Multi-user data capture into one continuous database
  • The full facilities of a GIS for validation and running test applications

On-screen Digitizing

The conventional way of digitizing map data is to use a tablet or digitizing table.
While, for certain kinds of data, this is usually of sufficient accuracy, a significant
amount of time is taken by the operator in having to validate the input on a work-station
screen, which is a separate surface to that on which the digitizing is performed.
Thus it is difficult to check for digitizing errors and also for completeness,
as the two images are not together.

Further, it is common for there to be minor mis-alignments between a quality background
map sheet and the map sheet on which the foreground data is drawn. With on-screen
digitizing it is possible to do a quick local realignment to compensate for such

The process can be considerably improved with facilities for semi-automatic vectorisation
of the raster data, so that high accuracy can be obtained while working at a relatively
small scale. This reduces the need to zoom in and out frequently as the automatic
vectorising facilities are usually more accurate than positioning coordinates by

It is also necessary to be able to handle greyscale scanned maps in cases where
the source maps are of low quality, or contain a wide variety of line densities,
such as pencil marks on a map drawn in ink. Compression techniques should allow
a large number of greyscale or colour maps to be stored efficiently.

Digitizing Across Many Map Sheets Without Interruption

With conventional digitizing, it is not possible to digitize at one time any object
which lies across a map sheet boundary. Objects such as pipes or electricity cables
frequently cross map sheet boundaries at least once. In these cases, the parts
of the object must be connected together by an additional interactive process.
Ad hoc techniques such as “off sheet connectors” and artificial segmentation of
the object must be avoided.

With on screen digitizing, the process starts by scanning all (or at least a significant
number of) adjacent foreground maps into the database before digitizing commences.
This, in effect simulates a single large continuous raster mapbase. Thus the digitizing
of complete objects can be achieved without interruption, as all of the information
is presented together on the screen about the object.

Geometry, Topology and Attribute Capture in one Integrated GIS Environment

It is also necessary for the input, edit and query functions of the GIS to be
available to the operator during data capture. Thus, little time is wasted moving
from one user interface environment to another. Objects with similar attributes
to objects already in the system can use those attributes as a starting template
for modification.

The integration of the digitizing capability with the geometric CAD type facilities
means that constructive geometry can be mixed with digitized geometry at will.
Further, the integration of the geometry capability with the topology facilities
means that most topology can be generated automatically by the system at the time
that objects are positioned on the screen. For example, a water valve placed on
a water pipe causes a node to be inserted in the water pipe, and the correct connectivity
to be inserted in the database.

The inclusion of gazetteering functions considerably reduces the amount of time
spent locating objects in the database.

Multi-user Data Capture into one Continuous Database

Approaches based on digitizing one sheet at a time by means of single user data
capture systems result in a lot of time being spent merging and managing the many
independent data files generated. The problem becomes particularly acute when edits
are required to data which has already been input.

The system should permit many users to simultaneously update one continuous mapbase.
This can only be achieved efficiently if the system allows any number of versions
and simultaneous alternatives to be held in the database. This results in making
the task of administering many users engaged in simultaneous additions and modifications
to the database much easier.

This single issue is one of the dominant reasons why data capture can be very
time consuming and costly.

The GIGO myth

A popular cliché these days is the GIGO (garbage in garbage out) story. Well
it is not actually true in many cases, or at least the situation may be recoverable.
It is our experience that many of today’s GIS do in fact contain so- called garbage.
The reason we say this is because frequently the geometry is good enough, but the
structure of the data is defective, often objects are split across map sheet boundaries
and object attributes are wrong or just plain missing. In other words the data
cannot be used for the purpose it was intended.

In our experience, there is usually enough in these databases to convert the data
into a satisfactory quality by a process of “laundering”. This is a process where
the data is imported into a GIS environment, and automatic processors can be applied
to the data to infer missing topology and missing attributes and a range of validators
can be run to detect and highlight problems graphically for a human operator to
deal with. The cost of doing this is usually very much less than the cost of the
original data capture, but it does require a rich GIS environment in which to do

It is particularly useful if the system contains rapid prototyping and development
facilities so that it makes the writing of special programs to clean up particular
datasets more feasible. These programs may well be used only once, then thrown

User Interface and Platform Issues

Much of what has been proposed above has only been made possible by the huge improvements
in hardware and software in the last three years. On the hardware front, the drop
in price combined with the huge increases in performance, memory and disk sizes
means that data capture can be carried out in the environment of a large seamless
database, managed by a modern GIS. The improvement in emerging user interface standards,
coupled with the arrival of object-oriented development environments for the rapid
development of custom built user interfaces for particular data capture tasks is
another contributing factor.


Whereas most effort in the past in improving data capture has been expended on
fully automated methods for converting source map data into a structured data base,
we believe that for many types of documents, the human operator cannot be eliminated,
but the present situation can be considerably improved by providing a fully integrated

Many of the hurdles which contribute to the cost of data-capture can be eliminated
by several levels of integration including:

  • The integration of source and target images in one database and on one screen
    via the use of raster scanned maps.
  • The integration of all map sheets in one seamless mapbase.
  • The integration of map sheets and locational data.
  • The integration of data capture, validation and GIS functions in one environment.
  • The integration of many users capturing data into one version managed multi-user

As well as these issues of integration, the application of robust intelligent
techniques, such as semi-automatic vectorisation and automatic topology generation
further improve the process. Finally, the advent of modern user interfaces and
object-oriented development environments lead to the production of optimised custom
built user interfaces ideally tuned to the task in hand.