Last changed 16 Aug 1998 ............... Length about 3,500 words (22,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/stefano.html. You may copy it. How to refer to it.

(Back up to notes on MIRA Dublin workshop Back up to Steve Draper's home page)

Mizzaro's framework for relevance

Contents (click to jump to a section)

1. Information need
2. Components
3. Time
4. Information resources
Steve Draper's notes on possible problems
- Measuring relevance
- Conclusion
  - Measuring relevance
  - Relevance framework
Points about demonstrations
Points to debate

by
Stephen W. Draper

Stefano Mizzaro offers a model or framework of the senses of relevance in IR. For us, the question is whether it covers all the issues important in real information retrieval today, and hence could be used to structure our discussions of how to evaluate IR software in actual, interactive, multimedia retrieval.

Mizzaro's framework has 4 dimensions:

Information need: RIN, PIN, EIN, FIN: the real, perceived (mental), expressed (natural language) and formal (query) representations.
Components: topic (the information itself), task (what it is used for), context (meta-information about the information).
Time: (which point in a retrieval session)
Information resource: document, surrogate, the information itself.

I have changed the order he uses. His framework is published in:

Mizzaro,S. (1998) "How many relevances in information retrieval?" Interacting With Computers vol.10 no.3 pp.305-322 which is available on the web. (Or in compressed (gunzip) pdf version.)
See also his MIRA paper in the Monselice proceedings.
Mizzaro,S. (1997) "Relevance: the whole history" Journal of the American society for information science vol.48 no.9 pp.810-832
pdf version
compressed (gunzip) pdf version

1. Information need

He defines the types in terms of the representation in which the need is expressed.

RIN = Real Information Need. The need external to the user, perhaps not fully graspable by them.

PIN = Perceived Information Need. The mental representation in the user's mind.

EIN = Expressed Information Need = Request: the need expressed in natural language.

FIN = Formalised Information Need = Query: the need formalised in a machine language.

2. Components

(Actually, aspects of the need, or information subgoals):

Topic: the central and final goal, the information sought.

Task: the activity for which the information need will be used (in effect, a higher level goal than the information need).

Context: information about the subject domain used during the activity. E.g. meta-information about the topic.

An important contrast here is that of topic vs. context. Stefano emphasises subject matter vs. broader domain background; I would emphasise information (e.g. a fact) needed vs. meta-information that may modify the goal (information need) (e.g. there are many papers with almost the same content so no one citation is the best here).

3. Time

Retrieval sessions nowadays typically involve many retrievals. The relevance of an item changes over time: it can fall if the item has already been found; it can rise, if information was found about the "context" that now lets the user recognise its importance for the first time. Thus the relevance of the same document for the same information need often changes during the course of a session. In fact in many tasks, two documents with the same content (e.g. two editions of a book) should not from the user's viewpoint both be retrieved, because given the first, the second now has zero relevance.

One interesting example is a dictionary that would let you understand a document in another language. Given the task, they are only relevant in combination, so a) their relevance depends on each other, b) if you you retrieve one at a time, then the relevance of the first depends upon the future (whether you retrieve the second).

4. Information resources

The relevance of a

document, or its

surrogate (summary), or of

the information it contains. If a user wants a fact (not a document for its own sake), then a big document with the fact is less relevant than a small document with it because of dilution.

Steve Draper's notes on possible problems

Here are some notes on issues that the framework may not cover properly.

Binary or scalar relevance judgements. (But see [7] below.]
In traditional IR, human judges are asked to make binary judgements of a document's relevance, while the software is required to assign a scalar or at least ordinal score. Comparing the two (apples and oranges) is obviously a mistake. We could require the software to make binary judgements, but that would mean offered lists of documents were no longer sorted by estimated relevance, which is actually clearly one of the biggest benefits of IR as opposed to database retrieval. So it seems clear that we should ask experts to make scalar not binary judgements of relevance in order to test IR software properly.
This is the subject of Fabio's session. It seems clearly right and necessary. Stefano's framework does not really help here: it seems more likely to lead to eliciting a large set of binary judgements (one for each type of relevance), rather than the ordinal score required.
Now Stefano might respond that even though he talks in terms of binary judgements, we could easily take a scalar measure for each of his dimensions. But there is a further crucial point here. A single ordinal score is almost certainly what is truly necessary in the software, because only that allows the user to scan a list in order of merit. Even if it is true that relevance is "really" multi-dimensional, software will not be useful unless it can produce one-dimensional ordered lists for users because the user's task is a decision task: choosing a single document to examine next. This argument is essentially similar to those about cost-benefit calculations and whether all kinds of benefits (human life, works of art, etc.) can all be reduced to the single measure of money. The argument about decision tasks says yes: a choice must be made, so everything must be compared on a single ordinal dimension.
Information needs should be distinguished by representation, not cause
Stefano's approach to the information need dimension is to define the distinctions in terms of the representation of the need. This is an attractive approach, as it seems easy to relate it to operational measures: if the need is expressed in natural language then it is a request etc. This is better than, say, an attempt to make distinctions based on time sequence e.g. before the user expresses it, after it ...: because a need may well begin life as a written request to the user from their manager, and the user then adds background knowledge to create a PIN by working "back" from the request. Thus Stefano's definition of information need type by representation is a good move because a) it is independent of time (as I showed, the EIN may come before or after the PIN in historical sequence; and that is one reason why he has time as a separate dimension); b) it is independent of cause (the EIN may come from the PIN, or vice versa).
External, socially specified information needs.
Jane Reid points out that many information needs come to a user from another person e.g. your boss orders you to do some task, which will require you to do some IR. The framework allows for this in one way well (distinguishing between what is spoken or written about the need, from what is implicit in background knowledge). However it does not explicitly allow for these two (or more) agents. Yet this is an important feature of many real life IR tasks, and we may need a framework that describes this. For instance, software that supports the interaction about the IR task between humans may be more useful in many cases that software that does not, but we could not describe the difference using only Stefano's framework.
Information needs: are they ordered?
Stefano's approach is to have an ordered dimension of information need. But is there really an order there, or only an unordered set? As argued above, at first it seems like the order is that of time and cause (RIN comes first, and then the PIN is made from it, etc.); but that is often not what happens e.g. the Request may come first and the PIN be created from it.
Another possible order is that of completeness and detail: the amount of information in each representation. The RIN seems the whole thing, and the PIN a subset (the part the user consciously realises), and then the request is a subset of that (the part the user knows how and decides to express). But this may not always be the case. A query may be more precise than the request, and only the requirement of having to formalise the query may make the user calculate exactly what they need. Or a user may believe (the PIN) they need complete information for a task, when in fact (the RIN) they only need approximate information: e.g. they request a photo to recognise someone, but in fact there is only one person in the room they do not already know so recognition is easy; or they request a full bibliographic citation, but in fact just the journal and author is enough because there is only one paper by an author of that name.
So there are apparent orderings by time, causality, and amount (completeness) of information, but none of these is valid and general. The information need dimension must be unordered.
[Jane Reid's comment] The discrepancy between these different information need elements is important, not the ordering. To me, the vital point of time is where the PIN (against which RJs are made by the user) most closely resembles the RIN, i.e. the point at which the user has the best, closest and clearest idea of his "real" information need. It is for this reason that I am using post-feedback RJs in my PhD work.
It follows that a complete description of a task's information need would require the following kinds of description:
- The set of types of information need in terms of type of representation
- The time sequence in which various representations appeared
- The causal sequence (actually, partial order) of these
- The sequence or set of human agents involved in formulating some of these.
Measuring relevance
The following further comments arose from a conversation with Jane Reid on 6 Aug 1998.
We need to measure relevance (e.g. in creating test collections) in order to test software. And in connection with that, we are interested in the question of the degree of consensus between judges.
This involves thinking of relevance judgements (RJs) as actions (not abstract, eternal value quantities), and of how they can be measured. The usual way is to ask experts to give a judgement. In contrast, in current IR systems the user typically makes a decision e.g. about whether to open a document within one or two seconds: this cannot be sensibly considered as reflecting the user's best evaluation of a document's relevance. Research on relevance judgements must equally be beware of this: asking people about relevance is not necessarily much connected to what their behaviour would be when using IR software.
So note that the main reason for studying RJs is to support user actions better while they are using IR software. We want ultimately to support their making optimal actions. Relevance, however defined, is only part of this. Note too that experiments on RJs are a kind of action, but usually a different kind of action, so we need to be aware of these issues in order to relate experimental findings with user behaviour during IR, rather than naively equating them.
Time to make the relevance judgement: do we ask them to make the judgement in (say) 2 seconds maximum (i.e. our estimate of the time taken to make most user decisions in current IR), or do we ask them to reflect and discuss and come up with a final judgement. Of course, now I ask this, we probably want to run studies where we do both, and report on whether there is a difference in the judgements made.
RJs may be
- binary (the traditional way of collecting them in test collections); or
- scalar (e.g. scored as numbers between zero and one i.e. this are absolute values); or
- ordinal (the judge ranks the set of documents in a full ordering i.e. relative relevance); or
- just the single most relevant is selected. Note that in current IR software, the decision users actually have to take is just to pick one action from a set of (typically) 12: 10 ranked documents, or choose to look at the next 10 surrogates, or choose to abandon this query and/or task. None of the four alternative ways of measuring relevance just listed actually correspond to the user's decision task. A fifth alternative would be:
- the actual 12-way choice. This seems like the binary RJ, but unlike the binary RJ we would expect it to depend in part on whether there are more surrogates to view, whether the user could think of different queries to try, etc.
- Another measure that perhaps should be collected is the degree of consensus about a judgement. In experiments, we would like to collect this to teach us more about the nature of RJs. But in software, such a number could be useful too: it would be like having a sensitivity or confidence score associated with the RJ score, and this could be used to display to the user, to decide on whether and where to threshold lists if this is done, and so on.
Group relevance judgements. An interesting way to study RJs, though it does raise still more issues, is to have a set of users or experts discuss an RJ to see if this increases consensus. After all, if there is a lot of variation in RJ values this could be because there is a lot of genuine disagreement, or because some users missed some point but would quickly agree on the "real" RJ value if this is pointed out. One possible format for studying this might be to give an RJ problem to a group of people; ask each person separately to give an RJ instantly (within a 2 second limit); then to think more carefully until they are happy with an opinion and to record that; then to discuss it as a group and agree a single group RJ; then to give again a private RJ. This design can show the effect of group discussion (if any) on private judgements.
Look at IR and relevance in the wild i.e. while a user is doing IR, and measure both their behaviour (what documents they select e.g. to open) AND what they say about its relevance (e.g. as soon as they select one, ask them how relevant they think it is i.e. ask why they selected it). There may well be cases where users select a document while saying it is not relevant. (Possible cases: a retrieved document says "free holiday for the first to read this" even though that is not related to the query; i.e. money, sex, etc. are likely to motivate the user independently of their IR task; or a document that says "Today's tips for improving your retrieval methods with this collection"; i.e. the user may decide that the document will help them indirectly, and so describe this as "irrelevant" even though it will help with the information need.) There are broadly two possible cases, if such divergences are in fact observed. Firstly, can the relevance framework cover the triggering of new information needs; secondly, users may use the word "relevance" when asked in a narrower sense than is needed for a comprehensive framework like Mizzaro's, which must affect how we do research on relevance.

Conclusion

The above points do not really challenge Mizzaro's framework, but raise issues about where it fits into IR. That framework tacitly supposes that relevance is an abstract quantity that has some real absolute amount. That is in practice an acceptable assumption because of how it is used within a typical IR system: where documents are fully ordered by a score, and the question is whether the order the machine produces is the same as a user would give. Thus difficult questions about absolute relevance are in fact irrelevant to how "relevance" scores will be used in the software. However these points are important with respect to how we could or should collect RJs experimentally.

Measuring relevance

Should RJs be expressed as binary, scalar, ordinal, or what? Since what we are going to use the RJs for is to rank documents, it seems likely that we should ask experts to rank documents, not give either binary or scalar judgements. But since actually the user decision task is not exactly that but includes decisions on whether to choose any document at all, this point needs further work.

Probably we should measure the degree of consensus in RJs all the time (i.e. we should criticise software that comes up with different RJs more if the human judges all agree on those judgements, than when they disagree a lot).

Jane Reid points out there are still further issues about how to combine the RJs from several judges e.g. average binary judgements to give a scalar (that could be interpreted as the probability that the next judge will give a positive binary judgement), etc.etc.

To the extent that time to make the judgement, and group discussion change RJs, we should perhaps always use the longest, slowest, and most discussion method to make RJs i.e. go for the "best" RJs because we want our software to make the best RJs (the best ordering).

Where there are big differences, this is probably a signal that our software needs to support the decision by better or different surrogates or other supporting information.

Relevance framework

User actions are NOT the same as nor predicted by topic relevance. But Mizzaro's dimension 2 above of components describes this: a user may select a document that is not relevant to the topic, but is relevant in another way perhaps to the task. However, we must measure and use these separately if we are to design software to support user choices.

Furthermore [Jane Reid's comment] the context is defined as "information about the subject domain used during the activity". I don't recall whether this is exactly how Stefano himself described it, but, if so, this seems to me rather a narrow interpretation of context. There will be, after all, contextual factors which have little or nothing to do with the domain which may well influence RJs. These include other indirect domain information the user might bring to bear, e.g. "my supervisor doesn't think very much of this researcher, so any papers by him are worth little to me", or completely independent factors, e.g. what level of enthusiasm or motivation the user has on a particular day. I can think of lots of different examples in these areas. In other words, I think either:

the definition of context needs to be widened to take this into account or
there is really a SET of contexts, rather than one simple one, e.g. a domain context, a task context, a topic context...

Of course, it is possible that this could be interpreted as being part of the information need (PIN, RIN), but it doesn't seem to fit well into that category. And again, if so, this should be made explicit.

The issues of speed of making a judgement and whether discussion modifies judgements is important for measuring relevance, but may not affect the framework.

Try to measure the best relevance possible (allow the judges time, and discussion).

We need a whole other approach to measure what influences users in making the right (or wrong) relevance judgements while doing IR. This is not relevant to the framework, but it is relevant to another important part of evaluation in IR.

Points about devising demonstrations

Demonstrations are a good way of getting people to believe in the importance of the issues you raise: personal experience is more convincing than reading about someone else's experiment.

Demonstrations are mainly single examples. A single example cannot prove a point about all possible cases, but they can prove some things. A single counterexample disproves a theory (or other universal statement); a single example proves that a proposition is possible in the world. A single example that strikes everyone as absolutely typical can prove that an issue is important (not just applying to special and unusual cases).

So for each phenomenon (e.g. how much consensus is there in relevance judgements?), we might look for 3 demo. cases: One extreme (a case where everyone agrees), the other extreme (a case where everyone disagrees), and a prototypical case (a case that seems extremely common, and if there is some disagreement about relevance, then lack of consensus must be a significant issue in general).

Cases to look for?

A porn or money item inserted in a retrieval list, to show people selecting a document to look at which they say is not relevant.
A bad or misleading surrogate to demonstrate a case where people's fast (2 second) relevance judgement is different from their slower, more careful judgement.
A case to show up differences between binary, ordering, and scalar RJs e.g. a query "door on left of window" and picture with the door on right of window. (In this case, the picture is absolutely irrelevant and might be scored zero, yet will probably be ranked as more worth a second look than pictures with no doors and no windows.)

Points to debate

Do we need binary relevance judgements, or scalar (quantitative) ones?
Or, further, which of: binary, scalar, ordinal, single most relevant document, single most relevant action, degree of consensus?
Must we have a single relevance judgement, even though relevance is really multi-dimensional?
Is the information need dimension unordered or ordered?

(Back up to notes on MIRA Dublin workshop Back up to Steve Draper's home page)