25 Aug 1997 ............... Length about 4800 words (29000 bytes).
This is a WWW version of a document. You may copy it.
To ftp a postscript version of it to print
click this.
Temporal and Situated Aspects of Data: Final report
Stephen W. Draper, Richard Cooper
Department of Psychology
University of Glasgow
Glasgow G12 8QQ U.K.
email: steve@psy.gla.ac.uk
WWW URL: http://www.psy.gla.ac.uk/~steve
This is the detailed report (section 16 of the final report), written in
August 1997, for EPSRC grant GR/K25021 "Temporal and situated aspects of data",
which ran October 1994 (actually April 1995 because of a late start) to May
1997.
This project aimed to study a number of problems in data modelling,
particularly those associated with the treatment of time and with varying
validity in the data. The strategy however was not to identify these problems,
develop a general solution, and then to illustrate it by reference to some
applications, but to use one particular problem domain as the focus and attempt
to solve as many problems as possible that occurred in that domain.
The problem domain chosen was that of simple accountancy: for instance
managing a personal bank account, the budget for a research project, or the
accounts of a small company. These simple accounts are currently usually
handled by spreadsheets. We were interested in identifying and solving those
issues in this domain that are poorly supported by spreadsheets and represent
quite general problems for data modelling and databases. A number of these
problems and issues were described in the grant application. In addition we
interviewed experienced users in our domain (responsible for managing small
budgets and producing simple accounts), and this produced an important
improvement in our understanding of the domain and the problems it presents to
software attempting to support it.
We have developed a data model based on three types of object: time varying
functions of money, status variables, and an explicit representation of the
network of formulae linking money items. We can show how these are sufficient
to solve many issues in our problem domain, supporting facilities useful to
users and far in advance of what can be accomplished by spreadsheets or indeed
many complex accounting packages. We have implemented a program incorporating
these objects and demonstrating their application to accounting, and how they
solve in this particular domain a variety of data modelling problems of wide
interest.
In the rest of this report we first describe what we learned of the nature of
our problem domain; then define our data model objects; then how these can be
applied in many ways to address the problems; then the general issues in data
modelling these can be seen as solving; and finally report on our practical
work.
We are concerned with managing small budgets such as personal bank
accounts, research grants, a department or firm with two or three employees.
Many items are recurrent e.g. standing orders for mortgage repayments. Many in
fact are recurrent but with varying amounts e.g. salaries incremented each
year, direct debits for gas bills whose amount is adjusted annually.
More importantly, many items go through a standard set of states, rather than
being one-off items like a grocery bill that is only entered at the time
payment is made. For instance, standing orders are committed to in advance
(when the order is signed), but payment is made on a particular date. Cheques
are committed to when signed and handed over, but only paid from the bank
account many days later. The first implication from this is that such items
must be modelled as entities that persist over time, but change status. The
second implication is that account holders want more than one view of their
account: a view for reconciling bank statements which deals only in cleared
funds, and another view (which the users we interviewed told us is called
"commitment accounting") that shows how much money you have left after all the
payments in the pipeline have been made.
In addition to these aspects of the domain, our interviews taught us the
existence and importance of the following three issues which we had not
foreseen. Firstly, that even with simple accounts, users need multiple views
of what to them are "the same" accounts. Committed versus cleared funds is
just one example, others being the requirement to produce accounts for
different financial years such as the tax year (from April), the university
financial year (from September), and from the start of a grant (often October,
but could be any month). Secondly, the facility to produce accounts at varying
levels of detail. Committees and individuals typically look at the totals
first, then the first level of subdivision. If there is something unexpected,
then more and more details will be required for the subheading producing the
anomaly down until it is understood. That is, from the user point of view, the
level of detail required is not fixed in advance, but ideally should be very
easily changed. While control of the level of detail is an issue for the user
interface, the associated user task is that of looking for the justification
(reasons for) a particular item (such as the budget total).
The third and most important issue, though, is that such accounts are not
produced simply for the sake of recording history but mainly to support
decisions. Individuals want to know how much money they have to spare, and
whether they have overspent, and so does a budget holder. One of our
informants produces as a matter of routine totals, next to planned expenditure,
and a "variation" column indicating significant differences. The fundamental
user action here is not that of reading off a total, but of comparing current
totals against previous plans, and if necessary then modifying the plan. Both
plans and actual spending actions need to be recorded in the same system, and
indeed closely related as frequently they refer to the same items. Supporting
this caused a major modification to our initial designs because it corresponds
to a change in understanding about the essential nature of this domain: not
that of recording historical payment actions and calculating some totals, but
comparing two views (planned and actual) of "the same" thing with particular
attention to differences.
We found that we could support all the tasks and functions for our
domain using only three new kinds of object.
Money amounts are modelled, not as simple integer quantities, but as
functions that take a time period (in the world being described by the
database) and return a money value. (Of course the same machinery can address
domains where the "currency" is not money but, say, exam marks.) One-off
payments are a special degenerate case, returing either zero or the one-off
amount depending on whether it occurred during the period specifed as the
argument. Standing orders pay out repeatedly on specified dates. Totals (e.g.
"equipment budget") are functions like spreadsheet formulae that refer to other
functions (passing on the time period they received as an argument).
Items (time functions of money) can have one or more status variables
associated with them. For a cheque, a status variable might be used to
represent its state e.g. written, signed, handed over (or sent) to the payee,
paid in by payee, cleared by payer's bank. In a small company, a job might
have a status variable with values such as: prospect, estimate given, order
received, job executed, invoice sent, payment received. In constructing a
plan, items might be given a status variable with values such as: residual
funds, estimated cost, actual cost.
A status variable is an enumerated set of values, specified by the user in
setting up a particular model (schema). The values in general form a directed
graph, possibly with cycles; but may have an order defined over them. That
is, users are allowed to enter events that change a status value in any way,
and in many cases such transitions will be meaningful as representing unusual
but possible changes to status (e.g. a cheque having bounced being
re-presented). However the defined order is useful to users so that they can
ask for a view of the data that reports (only) on items with a status greater
than or equal to some value e.g. to reconcile a bank statement ask for all
items actually paid, but to calculate spare funds, ask for all items either
committed or actually paid. A company chasing unpaid bills would ask for a
list of jobs with status billed but not paid, while when it wants to direct its
salesmen it might ask for a list of jobs quoted for but not ordered.
When a money function is created, if the function itself is a formula
referring to other items, this is added to the graph relating formulae, with
the new item as a parent node to those referred to. The user is also asked to
specify what set(s) a new item belongs to (e.g. an item might be named
"computer" and added to the set "equipment"), and this tacitly adds it to the
formula graph that specifies the sum (subtotal) for that set.
Informally we can think of this graph as a tree whose root is the budget's
overall total, but since it is not uncommon for alternative groupings of items
to be used simultaneously, the structure of the network is in general a DAG
(Directed Acyclic Graph). The formula DAG is stored explicitly, and used in a
number of ways.
All data (and schema) entry is done by creating an "event". This is
time-stamped for the time of data entry, but often also has an associated world
time, entered by the user, to which it refers (e.g. the time the payment is
made). An event may change the value of a status variable, create a time
function, change the time function associated with a given item or its
parameters. It may also change the metadata rather than the data in the sense
of creating or modifying an item representing a new subtotal i.e. a set that
groups other items in a new way e.g. moves an item from equipment to
consumables.
The user can ask for a report or view, specifying the period of world time the
view is to describe, and the values for any status variables to be used to
filter the calculation. Values for all functions (including totals) are then
computed. Alternatively, the user can ask for a trace (graphical or textual)
of the values of a particular item as it changes over time.
To implement a normal account, we expect a user to create and maintain a
"history stream" and a "plan stream" referring to the same budget or account.
The history stream can be thought of as a set of items organised bottom-up that
can be evaluated together to give the current state of an account, while the
plan stream is a corresponding set of items organised top-down from the overall
total down to some level. The streams share a common formula DAG: the
structure of formulae grouping items into subtotals and totals. All items are
time varying functions of money, as explained above. The formula DAG is
created as money functions are created: if a function contains a formula that
refers to other items, then it is a superordinate node to those items, and when
created the user assigns it to one or more sets which are its superordinate
sets.
Like the history stream, the plan stream also consists of a set of money
functions, but these are tagged as of type "plan" with a given plan version
number (so that alternative plans and plans made at different dates can be held
together indefinitely). As in the history stream, every item corresponds to
one specific node in the formula DAG. A plan item is either a pointer to the
formula from that node (i.e. its value depends on the corresponding subordinate
plan items), or a pointer to an item that is a money function in the history
stream (to represent a plan for a specific item such as the purchase of a major
piece of equipment), or else a separate money function (to represent an overall
plan for a whole subtree e.g. petty cash, travel expenses at the rate of
£1000 per month, etc.). By specifying a world-time and a plan
version number, the plan stream can be evaluated to show the planned budget at
that date. This can be compared to a view of the history stream. That is, the
planned and actual versions of the account can be juxtaposed (textually or
graphically) or, alternatively, the history of the actual and planned versions
of an item can similarly be compared.
A typical practice might be to review the accounts once a month, but to enter
expenditure daily as it arises. At the end of the month, the latest plan (now
a month old) would be evaluated for the current date to show what the planned
spend should have been under each heading, while the history stream would also
be evaluated with the current date to show actual expenditure. If the
comparison showed discrepancies then a new plan could be made, by copying and
modifying the last plan, to take account of what has happened. For instance if
travel were underspent, the function for the rate of spend could be increased,
while if equipment were overspent then either money could be moved ("vired")
from another heading or the rate of the spend function for the remaining period
reduced.
In creating a plan, typically the totals are known and some items or subtotals
must be adjusted to conform with this. Functions in the interface are offered
to the user to make this easy. A status variable associated with every plan
item marks totals and pointers to history stream items as "fixed" and
plan-stream money functions as "provisional". Proposed values (for the user to
modify and approve) for the provisional items can be calculated by using the
formula DAG to work backwards from the "fixed" totals.
Status variables can be used so that an item can be represented
throughout its financial history. Thus in a research budget, a large item of
equipment might be specified in the original application, and have status
values "proposed", "ordered", "delivered", "paid for". Thus the specific plan
for buying this equipment could be represented by entering this item at the
start, even though purchase might not take place for some months. Actions that
change its status are represented by entering "events" that refer to the item
and specify the change. Its amount i.e. the price of the equipment is likely
to be modified (by an event) too, since only when it is ordered is the exact
cost likely to become fixed and known.
Similarly cheques and direct debits for, say, regular subscriptions could be
handled with their own similar status variable, thus allowing a user to enter
an item for an anticipated bill months in advance, while keeping track of what
is paid when.
Besides data entry, the basic user action is to request a view on the
data by specifying a period. For instance, to produce the accounts for a
month, the user specifies the period (e.g. 1 March 1997 to 31 March 1997) plus
status variable values (e.g. to specify whether cleared or committed funds are
to be reported) and the software will produce the values of all items including
totals for that period. The essential advantage of this approach is that many
different views can be produced of the same data: different accounting periods,
different status subsets.
In general there are two approaches to modelling time: qualitative and
quantitative. The former, usually associated with logical treatments,
describes orderings of events without describing the size of durations, while
the latter describes durations. (This is not exactly the distinction between
discrete and continuous, as durations may be described in discrete units such
as days.) We have provided separate mechanisms to handle both these modelling
requirements. Status variables allow qualitative time changes as an ordered
sequence of values, specific to a user-defined type of object. Time as a
quantity is handled as a basic value, passed as an argument to functions.
All user actions are time stamped and a non-deletable record is kept.
To correct an entry, the user enters a new event superceding the previous one.
It is thus possible to do complete roll-back to the state of the database at
any given user time, and reproduce the state as the database believed it to be
then. In the accountancy domain, this allows a complete record of errors to be
retained even after correcting them. This could be used for calculating
compensation, calculating error rates with a view to improving the human and
overall operations, and showing what the company believed at a given time in
order to deal with claims about misleading customers or auditors, or about
responsible management. Given that many numbers are subject to change
independently of human error (e.g. prices are corrected from estimates to
actual billings, plans are modified frequently), this is hard to do in present
systems. Currently it is most often done by keeping complete paper or disk
dumps of accounts at frequent intervals and keeping them for many years:
clearly our approach is more economical.
Note that that facility is wholly about the user-machine (i.e.
user-database) interaction over time. That "user time" is quite separate from
"world time": the times stored that represent time in the world being modelled
e.g. dates of bank payments. Because in practice, at least as we understand
it, the users' main task is to review and manage an account, they are doing
equal amounts of planning and of recording definite (past) events. Both of
these refer to world time, but view it from opposite directions. Thus planning
tries to reason about world times that are still in the future (using
estimates). Thus in our design, a user requests a view of the data in a single
uniform way, specifying some world date which might be in the past or in the
future.
This single uniform approach is esthetically pleasing, but it also draws
attention to the fact that any model of the world, including of the recent
past, is only an estimate: it is only as accurate as the data currently
entered. In much database work, it is assumed that the design of a data model
or schema defines how much detail the database will hold, but then it is
assumed to be in exact correspondence (within those limits) with the world.
Data entry errors are held to a low level by validation procedures at data
entry. In the world of accounting, people are much more aware that the
relationship of validity between a database and the world it models is more
complex and more problematic than that. One of the biggest issues is the time
delay in acquiring data. Companies typically publish accounts months after the
end of the accounting year, reflecting how long it takes to acquire and enter
financial data. Government figures on the economy, e.g. for high street
spending, are even more subject to both delay and inaccuracy. Thus in fact any
given report should be specified in terms of both world time and user time i.e.
both the time in the world which the report tries to describe, and the time at
which the database was interrogated (not necessarily the present time if
rollback were used to undo some data entry actions). Very small companies may
also have significant lags in data entry (e.g. accounts may only be updated and
bills paid once a week or once a month), just as large companies do.
Users who have a model of the kinds of uncertainty in the data may define
status variables to describe this (e.g. provisional vs. confirmed figures).
Another example might be, in a personal account, to enter credit card items as
they are spent, but use the itemised bill sent by the credit card company to
check items and to catch items that had been forgotten. In this way
uncertainties in the validity of the data may be modelled.
A frequent task for account managers is seek out where a figure comes
from: why it is the amount it is. For instance if consumables expenditure
seems high this week, then the manager will want to look at the items making it
up. Another example is in looking at an estimated total (perhaps in a plan),
the manager will want to understand what this estimate depends on, and what
items could best be changed in order to change the estimate i.e. which are
least certain or least fixed. During the project, we took ideas from the
Artificial Intelligence literature on Reason Maintainence Systems and applied
them here. In this domain, a figure is calculated by a formula, which in turn
draws on other items. If the user asks "why/explain" and points to an item,
the formula DAG can be used to identify the set of items on which it depends.
If a status variable has been defined marking some items as "certain", others
as "estimated" and so on, with an order defined on the values for that status
variable, then the mechanism will identify the subset with the lowest values:
an inference is only as good as the weakest (least certain) assumption on which
it is based.
A visual approach corresponding to that is the fisheye view approach,
another idea from the literature that we have adapted for this domain. A
fisheye lens shows the centre of the field at high magnification, and
peripheral things at progressively lower resolution. A computer equivalent is
to be able to show any item at a number of levels of detail or size on the
display, and to use a measure of "interest" to control this. Here, if the user
points to an item (say a subtotal), this defines the centre of interest. Other
items are assigned interest in inverse proportion to their distance in the
formula DAG from the point of interest. Status variables can be used to modify
this interest function further.
In practice, many database users would like to modify the database's
schema during its lifetime, raising problems that are in general intractable.
However because of the simple structure of our data model, we can allow this.
Modifications are changes to the formula DAG. New nodes can be inserted or
removed without leaving any ambiguities. Items can be moved from one place in
the DAG to another, as long as the interface prevents them being left isolated.
Such changes will be logged and timestamped as "events", even though they
change the metadata rather than the data modelling the world. Because of the
time stamps, it would then be possible to ask for a view of the world
specifying one world time and a different metadata time i.e. you could ask for
the accounts using a new structure, but referring to a past time, or conversely
the accounts of the current time but using an earlier structure. This is
likely to prove useful in the accounting world, as practices periodically
change making comparisons between different years difficult if they were
reported following different standards. Furthermore some companies report
using more than one structure: one they think the best reflection of the "true
state of affairs", and another reflecting current legal requirements for
reporting.
Because our long term plan was and is to test our software on real
users, it is important to be able to deliver it on multiple platforms,
particularly the PC (most businesses) and Macs (many local university account
managers, and some multimedia businesses). We therefore spent some time at the
start of the project identifying a suitable environment, settling on a version
of Smalltalk that would support iterative experimental programming, but would
run on both Macs and PCs. Early outline designs were implemented using this,
as were student projects on graph drawing packages, a Reason Maintainence
system, and a Fisheye package. At the end of the project we switched to an
implementation in Java which ported instantly between a Sun and a PC running
WindowsNT, and also a Mac. All the facilities mentioned above have been
implemented.
Of the objectives listed on the application form, we have accomplished all of
them except going back to the outside users we interviewed early on, to test
the resulting software. We feel disappointed by this, but the persistent ill
health of the RA on this project held back the work over and above the two
months unpaid leave she asked for. Extensive user testing will also require
additional software work, particularly on routines to read in existing
spreadsheets and convert them into our data structures in order to persuade
users to use our software on their real work.
These delays also mean that we are only now writing up papers to report on this
work, and formulating a new grant application for further work.
The immediate further work required is to test our software on users,
both ourselves trying to use it to manage our accounts, and other outside
users. This would doubtless lead to many revisions, some to the user
interface, perhaps some to our data modelling ideas, and some to our
identification of and support for the main user tasks in this domain.
However this would probably be only the start of a new issue. Judging from
what we personally have learned about "simple" accounting during this project,
a major issue will be teaching users new practices: to understand that what
they are really doing is not just writing down sums spent, but managing
accounts, that there could be substantial advantages in entering items months
in advance when they are first foreseen, and that there isn't really any such
thing as "printing out the accounts" but rather the question of what view they
want to see for what purpose. It will be one thing to polish up a user
interface for a user with a deep understanding of the domain, and quite another
to teach accounting-naive users what the domain is about through the medium of
the software interface and documentation.
This reflects the issues involved in building a task or domain-specific program
as opposed to a general purpose tool like a spreadsheet. A special purpose
tool will have better adapted functionality, and save every user having to
repeat many pieces of implementation (e.g. calculating salaries with increment
dates, having separate spreadsheets for each required accounting period). But
it raises rather than solves the increasingly important issue in software
design of how to teach users about the domain the tool addresses. This may
well be a cruical issue in databases in general, where apparent poor usability
may often really be due to a lack of support for the user to understand the
view of the domain implied by a given database, and the relationship of the
database to the world it is trying to represent. Addressing this issue, of how
to present a domain to a user not already familiar with it, is an important and
general issue that we would like to explore through this particular domain of
simple accountancy.
In our view, this work could inspire commercial software products in future,
but is still early in its development phase.
We are pleased with our strategy of focussing on a particular small
problem domain as a way of studying general problems. Not only does it force
us to pay attention to problems because they matter to real users (at least in
one domain) rather than selecting them arbitrarily, but the solutions are
nevertheless much more general than the domain we studied, even if they do not
have the complete generality that, say, relational calculus does for databases.
Time varying functions, status variables, and formula DAGs not only solve
multiple problems in this domain, but it seems clear they will be useful in
other application areas. Furthermore they capture what is common about what is
the same item to users (e.g. buying one item of equipment), yet handled in
duplicate in both spreadsheet and manual systems of accounting (e.g. double
entry book-keeping) depending on whether the item is being viewed as a plan or
a past expenditure, part of the equipment budget or an item with an invoice to
be paid.