Technical Paper 4 – Version Management in GIS – Applications and Techniques

by Mark E. Easterfield, Richard G. Newell & David G. Theriault

Abstract

The majority of today’s GIS systems adopt an approach to multi-user working which
is based upon the standard model underpinning conventional commercial RDBMS systems.
This model is appropriate in applications where only the most recent state of the
database is of interest, but it does not work well over long transactions, and
is inappropriate for manipulating multiple versions. Many GIS activities involve
transactions lasting days or weeks, and others could benefit from the ability to
develop, compare, and retain or merge different versions of data. This paper proposes
an alternative approach to multi-user working, based around deferred detection
of conflict. The paper reviews a number of benefits which the approach brings,
such as multi-user data capture, developing and assessing alternative designs,
and the ability to include external mapbase updates over time. The paper concludes
by considering the features required of the underlying system in order to support
the approach.

Introduction

The majority of today’s GIS systems adopt an approach to multi-user working based
on the one designed for conventional data processing requirements. This approach
is based on the principal that it is only permissible to make changes to the most
up to date version of the data, a rule which is satisfied by a combination of locking
and enforced rollback.

This rule is entirely appropriate in situations where changes tend to be small,
where the data represents a single state, and where conflict must be detected at
the earliest opportunity. A good example is a ticket reservation system, where
it is vital to know with absolute certainty whether a seat has been sold, or is
still available.

The approach is much less appropriate to most GIS applications, where changes
may be substantial, and be self consistent only over a considerable period of time,
and where it may be legitimate to construct alternative versions of data before
deciding upon one or another to represent the real world. However, most of the
GIS products available currently are locked into the conventional model, as they
tend to be heavily dependant upon standard commercial DBMS offerings, which are
designed for the DP environment.

The particular deficiency of conventional DBMS systems concerned with their inability
to handle long transactions has been recognised, particularly where graphical data
is concerned. However, attribute data is still generally managed in the traditional
manner, and there are problems in maintaining consistency between the two forms
of data. Currently, there is little support for the management of alternative versions
of data.

This paper examines the benefits of a data management system which supports alternative
versions. It illustrates how such a system also provides a much more appropriate
multi-user model in the GIS environment, and provides additional spin-off benefits
in a distributed environment.

The paper goes on to consider the requirements of such a system, independent of
its implementation, and concludes with one possible implementation strategy.

[ Figure 1 not available ]

The Benefits Of Enabling Alternative Versions

Traditional database technology has encouraged a view of the world, as represented
in the database, in which there is a single, canonical state. It is possible to
modify the database, from one state to the next, but it is not permissible to have
multiple canonical states simultaneously.

At first sight this seems most reasonable there is only one true state of the
world (in real life). Furthermore, it precludes incompatible changes from being
made in a very simple way there need never be any doubt about the logical integrity
of the data. The approach is good at representing a static past, or a present which
evolves in small, simple, discrete steps.

However, many applications are actually used to model the future, or to model
a loose form of the present in which changes in the real world are not represented
as a continuous series of small transitions but rather as a few discreet modifications,
each of which is substantial.

When modelling the future, the ability to represent alternatives is of great value.
For example, when planning a new development, a number of different possibilities
can be reviewed and costed before one is chosen. A system which permits alternative
models of the (future) world to co-exist independently of one another must also
inherently permit these alternative views to remain separate from the accepted
view of the present, so that if there is a canonical view of the world it is not
affected by “what if”s.

This requirement to keep a canonical data set apart from changes to it also occurs
during customisation. In this case, the alternatives do not necessarily represent
real futures at all, but simply experiments to check that the customisation work
does what is intended. Traditionally, such verification is performed on a small
subset of the data, which has been copied out of the master database. However,
this approach tends to conceal problems of scale until the customisation is released,
when it is too late to rectify them. Using alternatives, customisations can be
tested on the real database, highlighting problems immediately, but without interfering
with users in any way.

When modelling the present, there is benefit in being able to defer making individual
changes visible until a complete set of changes is in a logically consistent state.
In the case of adding a newly acquired data set, or of making substantial changes
to the use of a piece of land, say, the data may remain inconsistent for some time.
The type and duration of change is more akin to that in a CADCAM environment, rather
than the small transitional changes encountered in DP. In an environment which
provides alternatives, the canonical state represents one possibility, and the
new state resulting from the modifications represents another. The new state can
remain hidden from general view for as long as is required, and only released when
the whole change is considered complete.

Looking further afield, there are other advantages in an environment which supports
alternatives. If a solution has been found to the problem of merging independently
developed versions (which is fundamental to the entire alternative strategy), then
it is also possible to contemplate merging changes which have occurred entirely
outside the particular database.

One example is that of the map base vendor, who issues changes to the map base
from time to time to a set of customers. Each customer may have wished to modify
his existing base over time, to keep it in step with the real world. At the time
the new base is issued, some of the changes will have ceased to be relevant, because
they have now been incorporated by the vendor, but others, for reasons of timing
or of the nature of the data, will still be required. In such a situation, it is
necessary to be able to identify the differences from the original base, and to
re- apply those modifications which are still relevant to the new base.

This is illustrated in Figure 1, where the act of bringing the map base up to
date would actually prove undesirable if allowed to proceed in an uncontrolled
manner. The alternative management approach permits the user’s foreground to be
reconciled with the new map base in a controlled fashion, so that day to day users
of the data never see a dangerous and misleading combination.

A second example where the comparison and merging of disjoint versions is of value
is in a distributed database environment. To be truly distributed, parts of the
database must be able to function whilst they are cut off from other parts. This
necessarily demands that changes must be able to be made which cannot be wholly
checked for logical consistency until all the database is visible again. An approach
which accepts the mutual coexistence of different versions, with consistency validation
deferred until late in the day, makes this feasible.

An approach which allows alternative (and therefore necessarily conflicting) versions
of the state of the database to exist is mutually exclusive to an approach which
guarantees a single, fully consistent state. In many DP applications, the latter
approach is absolutely necessary. We contend, however, that in GIS applications,
trans- actions are naturally long, concurrent alternative scenarios are desirable,
and that in practice mutually conflicting changes are either rare or wholly unavoidable,
and that therefore a model based upon alternatives is more appropriate.

The last of the three points is the most crucial to this argument. If there is
a large amount of avoidable conflict which has to be resolved late in the day,
then the amount of work required to create a consistent state outweighs the benefits
which have been proposed. We believe, however, that day to day conflict will generally
be rare in a reasonably well managed environment, and that the other problem of
substantial unavoidable conflict is one which is outside the scope of conventional
database management techniques.

The Relationship Between Versions And The Management Of Time

The management of versions and the managment of time are often referred to synonymously.
However, the two are not the same, and should not be confused (Langran 1988).

Version management, with which this paper is principally concerned, offers control
over database time, which runs strictly forwards. As will be seen later, the management
of versions may be conveniently implemented by successive overlaying of changes.

Management of real world time is different, because it is possible to modify the
past by adding new data (and therefore time in one sense is no longer strictly
forwards). For example, the data in a map base may reflect the external world at
the current time, say 1990; later on, data may be added which describes what the
world was like at an earlier date, for example in 1953. Database time and real
world time are thus running in opposite directions.

It is possible to implement mechanisms which manage time in a database, but the
methods described here are not generally appropriate. As with alternative version
management, an efficient solution demands functions built into the database manager
itself which are outside the scope of this paper.

The Requirements Of A System Providing Alternative Versions

Any system which permits alternative states of data to exist concurrently must
minimally provide a certain set of features. These are considered in this section,
and comprise:

The ability for different users to work on seemingly wholly independent views
of the database, so that other users remain unaware of the changes they are making
The ability to commit changes to disk without the changes becoming visible
to other users
The ability to make changes potentially visible to other users when they have
been authorised as a consistent set
The ability of a user to defer, until it is convenient, the making visible
of changes effected by someone else
The ability to determine precisely what has changed between two versions
The ability to validate the consistency of a set of changes to the database.

The first point is an underlying pre-requisite for alternative version management.
Conventional GIS solutions to the problem of data access which are based on check-out
and check-in suffer from restricting access through excessive locking of data,
perform poorly during the check-in/check-out operations themselves, and are expensive
in their use of disk. A system providing effective alternative version management
demands solutions more directly in tune with that specific approach.

The separation of the action of committing changes to disk, to that of making
changes visible to others, is actually more related to long transactions than to
alternative management per se. However, since an alternative itself may evolve
through a number of states, and may exist for an indefinite period of time, there
is the de facto requirement to be able to save individual states of an alternative
while it is still hidden from general view.

The traditional DP approach to transaction management combines the act of committing
to disk, and that of making changes visible to others. It also incorporates rollback
using the same technique, for good measure. These three functions, although related,
do not have to be combined, and can actually be considered separately. In an environment
which supports a number of individually identifiable alternatives, any one can
be committed to disk and rolled back without affecting the others. The act of making
the changes visible to a wider audience then becomes a separate administrative
function, rather akin to the formal approval and release of drawings or documents.

In an environment where conflicting alternatives may co-exist, the merging of
two such alternatives requires a high degree of formality. Three distinct phases
are required. The first, making a set of changes potentially visible, has already
been described. However, before a user working on another alternative can be allowed
to see those changes, he too must take positive action, to prevent the view of
the database he is using suddenly going from a self consistent state to an inconsistent
one. Just as the user has been buffered from changes going on elsewhere up until
now, so he should continue to be protected until there is a good chance that the
uptake of the new view will not cause conflict. This process of accepting a set
of changes which have been issued from another alternative is called refreshing.
The refresh process itself consists of two parts, the first validating that the
new combined state is logically consistent, and the second actually making the
combined state the one to be used.

When two alternatives are developed independently, the only data which can truly
be considered common between the two is that which existed in a common ancestor.
The difference between a pair of alternatives is a function of the modifications
made to that ancestor in each case. And as these modifications are made in isolation
of one another, there is no physical relationship between them, so they may only
be compared by considering what logical changes have been made. In particular,
it is not possible to simply physically merge two versions together, it is necessary
to re-apply the logical changes that have been made.

Thus the ability to refresh demands the ability to determine what changes have
been applied to the database during the life of an alternative version.

Before a refresh can be completed, the new view of the database which would be
created by incorporating the changes of one alternative into the changes of another
must be validated to determine whether any integrity constraints have been broken.
In databases based on the relational model, two sorts of conflict can arise. There
is a conflict if the same record (identified by primary key value) has been modified
in a different way in the two alternatives; and there may be a conflict arising
from an application constraint. The former is easy to detect, the latter more difficult,
requiring application specific code. However, things are not as bad as they may
seem, since if the application possesses the ability to verify the integrity of
data generally (for example, as it is entered) then precisely the same verification
functions can be used on the set of changed records during the refresh process.

The detection of either form of conflict can be built upon the ability to retrieve
the set of records which have changed between the version being refreshed and the
common ancestral version. The ancillary problems of how to report or display conflict,
and of providing tools to rectify the situation, fall into the application and
user interface domain, and are not considered here.

The foregoing has described a set of requirements for a platform which supports
the management of alternatives. We next describe one possible mechanism by which
these requirements can be fulfilled.

An Implementation Strategy

Many of the demands made of a system which manages alternative versions of data
concern the ability to identify the differences between them. Thus we are considering
different states of the data, and, most importantly, what has happened to get from
one state to the next.

Using the relational model, the changes between one state and another can be described
by the set of changed records, identified by primary key value, and the type of
change (insertion, deletion or modification). Non relational models present more
problems, because there is no simple way of comparing entities between two independently
generated sets of data. This paper only considers the relational case.

The implementation described here is based on a tabular datastore constructed
from conventional B*-Trees, with the fundamental versioning mechanism built into
the underlying data structure itself. This has the properties of low space overhead
and almost zero performance overhead during conventional access. This level of
the implementation permits multiple versions of the data to co-exist, with blocks
which are common to two or more versions being shared between them.

Note that the approach does not demand that all blocks reside in the same file.
In particular, benefit can be gained in the long term by storing a version on a
mass storage medium such as CD ROM and maintaining the changes on conventional
disk.

In order to determine the differences between a version and one of its ancestors,
it is only necessary to examine those blocks in that version which differ from
the ancestor’s version, rather than all the blocks in the database.

Different versions of the same tree are simply accessed via their different rootblocks,
and the action of committing a new version to disk simply involves storing the
new root block number in an appropriate place. The old version continues to remain
accessible while its root block number is still known.

[ Figure 2 not available ]

Alternative versions form a hierarchy, as illustrated in Figure 3. At the top
(the root) is the master database, which would probably correspond to what is considered
the canonical state. Nodes which are derived from this could be used, for example,
to differentiate between themes (e.g. pipework, electrical, cadastral, etc.), between
different spatial areas, or between any other aspects where the probability of
conflict is low. At the leaves of the tree, each
“user” has an alternative in which to work without conflicting with others. In
many cases, the levels of the tree will have close correspondence with levels of
approval.

[ Figure 3 not available ]

Changes are posted (i.e. made more widely visible) one level at a time. Provided
that the up to date version at the higher level is the one from which the changed
version was derived, only application validation in the form of
“approval” is required the version itself will be self consistent.

If the version at the higher level has since been updated, the version about to
post must first refresh, so that it is the up to date state which is being posted
into. If this formality were not undertaken, the intervening changes to that higher
level version would be lost.

The posting process can generally be implemented using purely physical means;
the refresh process cannot. The refresh process is effected by re-applying into
the new state the changes which had been made since the last refresh, by determining
which records had changed, and the form of change, and repeating the process.

Of all the processes, refreshing is the most complex and potentially time consuming.
However, the time taken is proportional to the degree of change, not the overall
size of the database.

The cycle of modification, posting and refresh is illustrated in Figure 4.

[ Figure 4 not available ]

Accessing Conventional RDBMs

A GIS may be called upon to access and maintain data which is handled using a
conventional DBMS system. We know of no system around today that supports the sort
of functionality described in this paper, so if the functionality is to be provided
to the end-user it must be achieved by means of a version managed front end to
the non version managed underlying system. This solution is bound not to be wholly
satisfactory, and includes such problems as:

Finding an optimal way of populating the version managed datastore tables which
performs well and doesn’t require too much disk
Coping with situations where the conventional database is updated by the external
access (i.e. outside the version managed environment)
Retaining version information once data is posted back to the conventional
database

The first of these problems is almost certainly soluble by techniques such as
remembering frequently used queries and dynamic downloading of data. The latter
two problems are much more difficult to resolve. However, we believe that the benefits
of a system which supports multiple versions and alternatives in GIS applications
outweighs these potential difficulties.

Conclusions

In this paper we have suggested that the conventional DP transaction approach
to multi-user working is inappropriate for GIS applications. There are a number
of essential differences between the conventional approach and the one described
here. These include the ability to maintain, merge and administer a hierarchy of
alternative versions in a controlled manner and to safely allow multi-user access
without locking.

We contend that a system which uses deferred detection of conflict to permit alternative
versions of data to be worked on concurrently provides many benefits. These include
the ability to experiment and cost alternative scenarios, to develop and test customisations
on a realistic size of dataset, and to resolve many of the current problems relating
to long transactions and distributed working.

We have identified a number of requisites of a system which is to successfully
support alternative working, which include the ability to separate the process
of making one user’s changes visible to another into discreet stages, and the ability
to determine what has changed between one version and another.

Finally, we have demonstrated the feasibility of the approach by proposing one
possible implementation technique which fulfils these requirements efficiently.

References

Armstrong Marc P. (1988). Temporality in Spatial Databases, Proceedings GIS/LIS’88,
San Antonio 1988, pp 880-889.

Barrett, Myles 1989. Object-Oriented Language Extensions for CAD/CAM: Journal
of Object-Oriented Programming, Vol. 2, No. 2, July/Aug 1989.

Chance, A., Newell, R. G., and Theriault, D. G. 1990. An Object Oriented GIS Issues
and Solutions. Conference Proceedings of EGIS, Amsterdam, April 1990.

Chou H., and Kim, W., 1986. A Unifying Framework for Version Control in a CAD
Environment. Proceedings of VLDB86.

Ecklund, D. J., Ecklund, E. F., Eifrig, R. O., and Tonge, F. M. 1987. DVSS: A
Distributed Version Storage Server for CAD Applications. Proceedings of VLDB87.
Katz, R. H. and Chang, E. 1987. Managing Change in a Computer-Aided Design Database.
Proceedings of VLDB87.

Kent, William 1989. An Overview of the Versioning Problem. ACM SIGMOD 1989 Conference
in Portland, Oregon. SIGMOD record Vol. 18 No 2, June 1989.

Langran Gail and Chrisman Nicholas R. 1988. A Framework for Temporal Geographic
Information. CARTOGRAPHICA Vol. 25, No. 3, 1988 pp1-14.

CFIS

CFIS

CFIS

Technical Paper 4 – Version Management in GIS – Applications and Techniques

Abstract

Introduction

The Benefits Of Enabling Alternative Versions

The Relationship Between Versions And The Management Of Time

The Requirements Of A System Providing Alternative Versions

An Implementation Strategy

Accessing Conventional RDBMs

Conclusions

References