27 Jul 1995 ............... Length about 4000 words (25000 bytes).
This is a WWW version of a document. You may copy it. How to refer to it.

Brief ideas about IR evaluation

by Steve Draper ( http://www.psy.gla.ac.uk/~steve/)
University of Glasgow, U.K.

Contents (click to jump)

Points to accommodate in an evaluation framework for IR
General background for an evaluation framework for IR
An evaluation framework
Conclusion
References

Here are some ideas towards a framework for evaluation of information retrieval (IR) software, written around January 1995. While I am interested in evaluation of many kinds, in many domains, this paper addresses the field of information retrieval from free-text documents, and its traditional focus on the measures of precision and recall. However the ideas may be relevant to other interfaces to information, such as 3D visualisations, and so are offered for the Fadiva2 proceedings.

First is a list of points that the framework should accommodate. Then a longer, more connected argument towards a framework.

Points to accommodate in an evaluation framework for IR

* Any framework should be able to accommodate the current (and foreseeable future) state of technological development and application. In particular, librarians and other expert users are not the only or even the main class of user: how well can the software serve children or shoppers?. Similarly, the engines are now cheap enough in their use of computational resources to be applied to quite small or even trivial collections: the framework should be able to handle tests designed to answer the question of how well this re-deployment would work. E.g. would it be useful to use IR on a "collection" of Macintosh file names and info-box contents? Or in Unix, to apply "strings" to executable files (which will extract all the error messages and perhaps the variable symbol table). The average quality of user interfaces has risen markedly in the last 2 decades, and is likely to have a marked effect on whether and how successfully it is used: so evaluation must encompass these effects.

* Document collections should, in the framework, be treated as a major variable, along with the retrieval engine and the task. It is entirely possible, even likely, that an engine's performance may depend on some characteristic of the collection, so evaluation should have a place for systematically varying the collection. (E.g. in highly technical writing, jargon may be used so consistently that the terms never co-exist with more common paraphrases, so that queries entered in non-technical language may frequently fail; whereas in journalism this may not be a problem.) Conversely, there may be a useful role for artificial tiny collections for some kinds of performance tests, just as the acceleration from 0 to 60 m.p.h is generally quoted for cars even though this "task" is very seldom executed in real usage.

* The framework must include the observation of real tasks, and of real user interaction. (This does not mean that all measures must be of user performance.)

* The framework could use measures with respect to an idealised user for 2 purposes: as benchmarks of engine performance, but also for learnability measures i.e. if we establish what an ideal user can achieve, then we can also measure how long it takes real users to reach this level of performance, whether they do, whether the user interfaces support this learning adequately.

* It is desirable to measure how good measures themselves are with respect to deeper or more directly valid measures. E.g. we can measure how well precision correlates with user success, user satisfaction, etc. (If the user interface supports quick scanning and selection of documents from long lists, it may be that precision is unimportant: certainly its importance must depend on the lengths of lists presented, and the accuracy of selection by users from those lists, which in turn depends on the design of the summary information (e.g. title) presented on those lists.)

* Contemporary IR systems, like most other software, have fast interactive interfaces. Consequently the framework must be concerned, not with a single cycle of the retrieval engine and how it performs on single queries, but with a typical iterative episode involving a number of retrievals.

* Simple formulations of tasks or goals (e.g. in early AI planning theory) take them to be the achievement of a state of the world e.g. changing the world by constructing a tower, or retrieving a certain set of documents. A more accurate formulation in general is that a goal consists of two parts: a state of the world, plus a state of knowledge or at least confidence in the human agent that that state now obtains. This is particularly important in IR: it is not enough to retrieve the best set of documents, but equally important that the user is confident that this is the best. E.g. if you ask a machine to pick the best restaurant to go for lunch and it returns the empty set, you need to be sure that that is the right answer (and should turn to selecting a take-away or other alternative plan). Thus among the outcome measures must be, in addition to measures of time and accuracy, measures of how confident the user feels in the result and how accurate (how well justified) that confidence in fact is. In fact, as precision is only moderate in IR software, it needs to be designed to deliver information about precision for human judgement besides the documents themselves. Thinking of this as an explicit part of the user task allows it to be an explicit design requirement, and a testable part of evaluation.


General background for an evaluation framework for IR

In many areas of engineering, including traditional computer science, the performance of artifacts is judged mainly by measures that do not refer to human actions or judgement e.g. load bearing and resistance to side winds for bridges, frequency response for hifi equipment, fuel consumption for cars. In information retrieval (IR) the measures have usually been precision and recall. This approach seems obviously objective, as opposed to the "subjective" issues if humans are allowed into the measurement process, and certainly it lends itself to reproducible measurements which is desirable. However objective measurements are not useful if they are irrelevant to the practical circumstances of interest. For bridges, this is not a problem: the weight of loads and the speed of the wind do sum up the important aspects of real circumstances. Fuel economy however illustrates the difficulties of a simple numerical approach: when the US government first required manufacturers to publish fuel consumption figures, they naturally chose circumstances that made these as good as possible — warming up cars first, then measuring circuits at constant speed on level ground — but wholly irrelevant to the driving conditions experienced by customers. If unrealistic user tasks are used in tests, then you get objective but irrelevant measurements.

This is a serious problem for IR. Hitherto, standard test sets contained queries without any empirical knowledge of whether or how frequently they arise in practice. In fact it may be best to regard this as concealing not one but three problems: what is the set and relative frequency of tasks that actually arise for end users, what is the set of queries that are then entered into search engines (which may not express the real task well), and how good are the results then obtained? Because little work seems to have been done on the first two, we cannot be sure we know much about the third e.g. perhaps tests have only been done on queries that do not arise in practice. This state of ignorance of the relationship of the search engines to users' needs is ironic, given that it is widely thought that the field of IR arose as a reaction to the difficulties users experienced in using boolean query interfaces. However it is very hard to find in the literature any empirical work on those difficulties with boolean interfaces, and there are few papers on how well tasks are achieved in practice with IR software. Thus the IR field is in the position of having almost no empirical confirmation of its original motivation, and no good measurements of how well it is performing in relation to the original aim of making things easier for users.

It seems clear that it is desirable to construct and then apply a framework for performance measurements of retrieval systems that can address these considerations: in particular the issue of how well a given setup actually serves users in the achievement of their tasks. Such a framework should apply to boolean retrievals as well, so that comparisons can be made. It should be able to cover current technology and practice: cheap computation, the ability to apply search engines to areas they were not originally designed for e.g. both collections bigger than previously imagined, but also very small ones, highly interactive and visual user interfaces, a wide range of users (not just information specialists).

What might Human Computer Interaction (HCI) bring to the construction of such a framework? As we shall see, considering users explicitly enters in two distinct ways. Firstly, what matters is not how a machine might behave with an ideal operator but what results are actually achieved with a typical user. Secondly, much of what we are trying to study depends on human judgements which must be extracted from people: what documents are relevant, what information is actually extracted from a retrieval session, what functions are useful, what tasks are actually found to be wanted by people, how well is a task judged to have been achieved.

The simplest approach is to put users at the centre, and measure how the combined system of human and machine perform. The simple framework often used is a 3 factor one of tasks, users, and machines. It might be implemented as follows. First a survey of users in their workplace is done, to estimate what representative tasks occur in practice. Then trials can be run in which people typical of the target user group are used, each are given the same set of tasks and the same machine, and their success and other performance measures taken (time and errors being the most common). Although human users are more variable than the machines, a fairly small set of measurements (e.g. 6 users) is often enough to estimate the mean and variance likely to be encountered in any attempt to reproduce the measurements.

This approach is most appropriate for studying relatively simple devices such as pocket calculators and word processors, where the functions provided are not technically challenging and are assumed to have been tested in advance, but where the user interface makes a big difference to the overall performance. In IR, as in fact in many other interesting cases (e.g. spelling correctors, compiler error messages) these limiting approximations do not hold. In particular the functionality needs study in its own right: it is both more difficult to provide, and its specification is an issue in itself — discovering what functionality actually serves users best. How can this be accommodated in a framework? Firstly we must remember, what is sometimes neglected in formulations of HCI, that not only usability (costs to the user in time and disruption caused by poor interfaces) but also utility (benefits to the user of the work done by the machine) is important, and in general users require the best tradeoff between these factors. In other words, they will not use a machine if it does nothing for them, no matter how good the interface design, but they will put up with considerable penalties in time and effort if the machine does enough useful work for them. Measuring utility requires measuring both what useful functions are provided, and what value each user puts on these: in other words utility involves both a subjective weight and an objective service.

There are two ways to try to build on this. Probably both need to be employed in evaluating IR.
(A) To continue to focus on real users and realistic tasks, and to measure net benefit for these tasks. This ensures the relevance of the results, and because it concentrates on the net result it avoids difficulties in defining what exactly the different functions are and in making separate measures of the utility of each. E.g. in IR, this approach need not worry about the separate contributions of the retrieval engine, the format of the list of documents retrieved, the skill of the users tested in choosing queries and documents etc. On the other hand, although this might be the best evaluation for choosing between two alternative retrieval machines, it does not give information about what to change to improve a design.
(B) The other approach is to work towards isolating functions in the machine so that they can be separately measured and developed. By "function", I mean here distinctly identifiable outcomes that matter to the user. E.g. with cars, fuel consumption, top speed, and rate of acceleration (or time from 0-60 m.p.h.) are separate functions because they are separately measurable effects, and matter independently to a user, even though all are involved in any task (e.g. shopping, a 200 mile motorway trip, etc.). This approach is closer to the old approach of user-independent benchmarks, but here also involves attempting to measure utility weights for each function, and how relevant test tasks are to real users.

How must the simple 3 factor model of users, tasks, and machine be developed to accommodate these approaches and the specific nature of IR? Firstly, the "machine" should be divided into (at least) two independent variables: the retrieval engine, and the collection. It is quite possible for an engine to do better with one collection than another and it is convenient for other reasons (see below). Secondly, "users" appear here with at least two roles: the human using the machine, and the human who owns and originates the task. If a specialist advisor translates the task into commands, as is often done in libraries for instance, then these may be very different people.

Thirdly, the "tasks" factor is more complex than is often admitted in HCI (Draper; 1993). (a) For instance in reality tasks are neither wholly prior to design, nor fixed: users take a machine and gradually discover what they can do with it, often applying it to tasks unforeseen by the designer. Thus standard tasks cannot sensibly be fixed by an evaluation standard, but must be studied and observed, and may differ between machines even for the same users. (b) It is also important, particularly for approach (B) above, to identify distinct functions with a view to measuring and optimising them independently, and to consider subtasks within the overall task. In IR one important subtask is the formulation of a query command, apart from then issuing it. Another is the selection of relevant documents by the user from the short list returned: clearly both user skill and the detailed design of the information summarised in the list are crucial here Another is finding the information within a document when it is retrieved: highlighting matched terms within the document is a machine function designed to aid this (and I have seen users miss the relevance of a document when this was not done well). (c) Most important here is to make an explicit place for issues of social interaction and other workplace constraints.

Fourthly is the issue of what measures to use. The key difficulty is in establishing measures of how good the material retrieved is with respect to the task. It is possible to ask people to rate the usefulness of particular documents, but the main difficulty is in discovering what other documents, not retrieved, would in fact have been useful. There are several approaches.
1) Use a very small collection, so that the user or the evaluator can inspect it for all documents for relevance. This should be done for applications to small collections. It may however be one useful test of engines designed for large collections. Could an engine that failed here be able to argue that it would work well on a large collection?
2) Use a contrived collection, where a large number of wholly irrelevant documents are merged with a few that are relevant to the test task. Obviously there are many avenues to explore here. Basically, this type of test can provide exact measures of an engine's ability to find a division when a sharp division exists. Specific issues in retrieval engine design can be tested by manipulating the relationship between the irrelevant and relevant subsets. Consider for instance one test that merged academic geology papers with a few newspaper articles on politics; and another that merged 19th century newspaper articles with a few contemporary ones and did not warn (or allow) users to employ time specifications in their queries.
3) Devise a task (necessarily artificial to some degree) in which the evaluator knows some important fact that is witheld from or forbidden to the test users. Their results can then be compared with those obtained using the witheld fact. For instance, searching for notorious murder cases without using the names of famous murderers, or searching for articles on the outbreak of a war without using its date. In general, this corresponds to testing IR by people ignorant of a topic against an expert's own knowledge.
4.1) With realistic collections, you could compare an actual user session with a greatly prolonged effort by an expert, using the latter as an estimate of the ideal retrieval.
4.2) Similarly, use the union of results from many experts and many engines to estimate the ideal retrieval, and compare each particular performance against that. This is the method that has been most used in IR so far in "test collections". The problem with these techniques is that there may be some weakness in common between experts and/or engines with respect to that task and collection, but this cannot be discovered within the technique.

An evaluation framework

What model do we end up with, taking these issues into consideration? It could be summarised as follows.

0) There are 4 major variables in the testing: tasks, users, collections, and retrieval engines.

1) At the centre is the iterative session of issuing queries to satisfy a single top level initial goal (task). The central measurements are of these sessions, for various input parameter values of tasks, users, collections, and engines. What comes out is a partial achievement of the task (perhaps measured along more than one dimension), and measures of cost (machine time, iteration number, user effort).

2) Prior to that is the top level task (probably not consciously articulated even in the user's mind), and the social environment that may affect it. The task (if we are able to determine what it actually is) contains a specification not only of the content of the information required, but of its amount (all items, 1 item, 20 items), the degree of relevance (e.g. optimal, above a minimum), and the level of confidence required for each of these aspects. For instance in finding a restaurant for dinner, confidence in adequate quality is important but confidence in optimality probably unimportant; in searching for all the books by a specific author, confidence in exhaustiveness may be crucial; in doing a general literature search to support a dissertation, finding about 20 (say) papers may be the most important thing, not that they are all the relevant papers or precisely the most relevant. But the wider social context may mean that the real task requirements are to do with other factors than these. For instance, it may be in some situations that being seen to use the IR machine matters far more than any results it produces, in other situtations using the IR machine to give authority to a conclusion already arrived at may be the tacit requirement.
2.2) Surveys or field studies can be done to determine which of these are observed in practice, and could then be used to establish a set of test tasks to represent some type of work.

3) Outputs, and measures of performance. Many measures can be taken: total time, the number of iterations, when and how the user decides to stop iterating and whether that decision was taken too early (could have found a better set) or too late (wasted iterations), how good each query the user formulates is when compared to some other measure of goodness of query, how confident the user feels in the result when they stop, how good the final retrieved set of documents is, what information the user finally extracts from the retrieved set (and how good it is).
3.2) Many measures will require special cases to be run in (1) to provide reference points such as the best document set to be found.

4) (3) and (2) act as surrounding context for (1), larger issues within which particular cases of (1) are measured. It is in principle quite possible to zoom in further, and look at parts of (1) in isolation e.g. how an engine performs with specific standard queries. However this may not be of great importance, as there are many such components including: the generation (by the user) of a query as the next step, the selection (by the user) of a document from the list returned for use in relevance feedback, the generation (by the machine) of a new query based on that document, the extraction of the desired information by first selecting one or more documents to open, and then finding the relevant parts of them to read. It is how these subtasks and functions work together in practice in a session, not how each one operates in isolation, that probably determines overall performance. For instance, if a different initial query leads after a few rounds of relevance feedback to much the same set, then performance on specific queries may not matter as much as achieving this convergence.

Conclusion

I have suggested varying 4 major factors in testing — tasks, users, collections, and retrieval engines — and considering both artificial and real cases of each. Hitherto, IR testing has usually used real engines and collections, arbitrarily chosen tasks (it is unclear whether they are realistic or not), and avoided users by using fixed queries (again, unclear whether they are realistic). In general, it would seem informative both to observe actual tasks, users, and collections, and to make use of deliberately artificial or idealised tasks, users, and collections as tests for engines. The artificial can provide fixed points and limiting cases for benchmark tests, while real case observations can allow us to relate these to the conditions that are encountered in practice.

In general it is possible to distinguish three kinds of evaluation approach. Analytic evaluation analyses designs on paper with respect to fixed principles, thereby presupposing that all the important issues are known in advance. At the opposite extreme is empirical evaluation that observes what actually happens in real cases, but while good for spotting unexpected disasters does not give rich information about what to modify. A third approach, consistent with the above discussion, could be called functional evaluation. It attempts to identify distinct functions (e.g. acceleration, precision), which are then measured separately e.g. by varying and extreme cases, regardless of how realistic these cases are. This gives information designers can use to improve designs. However the identification of such functions must be closely linked with observations of what matters in normal use and practice, or the whole approach may become irrelevant. An important aim in developing a new framework for IR evaluation will be to identify a more relevant set of such functions, and to provide empirical evidence about the degree of their relevance, as well as establishing measures for them that include the human element in the overall system performance.

References

Draper, S.W. (1993) "The notion of task in HCI" pp.207-208 Interchi'93 Adjunct proceedings (eds.) S.Ashlund, K.Mullet, A.Henderson, E.Hollnagel, T.White (ACM)

Galliers, J.R. & Sparck Jones,K. (1993) Evaluating natural language processing systems Tech. report no.291 U. of Cambridge computing lab.

Green, T.R.G. & Hendry D.G. (1993) "Spelling mistakes: How well do correctors perform?" pp.83-84 Interchi'93 Adjunct proceedings (eds.) S.Ashlund, K.Mullet, A.Henderson, E.Hollnagel, T.White (ACM)

Green, T.R.G. & Hendry D.G. (1993) "Spelling correctors and spelling mistakes" ?? Report from Applied Psychology Unit, Cambridge.