Last changed
16 Aug 1998 ............... Length about 3,500 words (22,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/stefano.html.
You may copy it. How to refer to it.
(Back
up to notes on MIRA Dublin workshop
Back
up to Steve Draper's home page)
Mizzaro's framework for relevance
by
Stephen W. Draper
Stefano Mizzaro offers a model or framework of the senses of relevance
in IR. For us, the question is whether it covers all the issues important in
real information retrieval today, and hence could be used to structure our
discussions of how to evaluate IR software in actual, interactive, multimedia
retrieval.
Mizzaro's framework has 4 dimensions:
- Information need: RIN, PIN, EIN, FIN: the real, perceived (mental),
expressed (natural language) and formal (query) representations.
- Components: topic (the information itself), task (what it is used for),
context (meta-information about the information).
- Time: (which point in a retrieval session)
- Information resource: document, surrogate, the information itself.
I have changed the order he uses. His framework is published in:
He defines the types in terms of the representation in which the need is
expressed.
RIN = Real Information Need. The need external to the user, perhaps not fully
graspable by them.
PIN = Perceived Information Need. The mental representation in the user's
mind.
EIN = Expressed Information Need = Request: the need expressed in
natural language.
FIN = Formalised Information Need = Query: the need formalised in a
machine language.
(Actually, aspects of the need, or information subgoals):
Topic: the central and final goal, the information sought.
Task: the activity for which the information need will be used (in effect, a
higher level goal than the information need).
Context: information about the subject domain used during the activity.
E.g. meta-information about the topic.
An important contrast here is that of topic vs. context. Stefano emphasises
subject matter vs. broader domain background; I would emphasise information
(e.g. a fact) needed vs. meta-information that may modify the goal (information
need) (e.g. there are many papers with almost the same content so no one
citation is the best here).
Retrieval sessions nowadays typically involve many retrievals. The
relevance of an item changes over time: it can fall if the item has already
been found; it can rise, if information was found about the "context" that now
lets the user recognise its importance for the first time. Thus the relevance
of the same document for the same information need often changes during the
course of a session. In fact in many tasks, two documents with the same
content (e.g. two editions of a book) should not from the user's viewpoint both
be retrieved, because given the first, the second now has zero relevance.
One interesting example is a dictionary that would let you understand a
document in another language. Given the task, they are only relevant in
combination, so a) their relevance depends on each other, b) if you you
retrieve one at a time, then the relevance of the first depends upon the
future (whether you retrieve the second).
The relevance of a
document, or its
surrogate (summary), or of
the information it contains. If a user wants a fact (not a document for its own
sake), then a big document with the fact is less relevant than a small document
with it because of dilution.
Here are some notes on issues that the framework may not cover
properly.
- Binary or scalar relevance judgements.
(But see [7] below.]
In traditional IR, human judges are asked to make binary judgements of a
document's relevance, while the software is required to assign a scalar or at
least ordinal score. Comparing the two (apples and oranges) is obviously a
mistake. We could require the software to make binary judgements, but that
would mean offered lists of documents were no longer sorted by estimated
relevance, which is actually clearly one of the biggest benefits of IR as
opposed to database retrieval. So it seems clear that we should ask experts to
make scalar not binary judgements of relevance in order to test IR software
properly.
This is the subject of Fabio's session. It seems clearly right and necessary.
Stefano's framework does not really help here: it seems more likely to lead to
eliciting a large set of binary judgements (one for each type of relevance),
rather than the ordinal score required.
Now Stefano might respond that even though he talks in terms of binary
judgements, we could easily take a scalar measure for each of his dimensions.
But there is a further crucial point here. A single ordinal score is almost
certainly what is truly necessary in the software, because only that allows
the user to scan a list in order of merit. Even if it is true that relevance
is "really" multi-dimensional, software will not be useful unless it can
produce one-dimensional ordered lists for users because the user's task is a
decision task: choosing a single document to examine next. This argument is
essentially similar to those about cost-benefit calculations and whether all
kinds of benefits (human life, works of art, etc.) can all be reduced to the
single measure of money. The argument about decision tasks says yes: a choice
must be made, so everything must be compared on a single ordinal dimension.
- Information needs should be distinguished by representation, not cause
Stefano's approach to the information need dimension is to define the
distinctions in terms of the representation of the need. This is an attractive
approach, as it seems easy to relate it to operational measures: if the need is
expressed in natural language then it is a request etc. This is better than,
say, an attempt to make distinctions based on time sequence e.g. before the
user expresses it, after it ...: because a need may well begin life as a
written request to the user from their manager, and the user then adds
background knowledge to create a PIN by working "back" from the request.
Thus Stefano's definition of information need type by representation is a
good move because a) it is independent of time (as I showed, the EIN may come
before or after the PIN in historical sequence; and that is one reason why he
has time as a separate dimension); b) it is independent of cause (the EIN may
come from the PIN, or vice versa).
- External, socially specified information needs.
Jane Reid points out that many information needs come to a user from another
person e.g. your boss orders you to do some task, which will require you to do
some IR. The framework allows for this in one way well (distinguishing between
what is spoken or written about the need, from what is implicit in background
knowledge). However it does not explicitly allow for these two (or more)
agents. Yet this is an important feature of many real life IR tasks, and we
may need a framework that describes this. For instance, software that supports
the interaction about the IR task between humans may be more useful in many
cases that software that does not, but we could not describe the difference
using only Stefano's framework.
- Information needs: are they ordered?
Stefano's approach is to have an ordered dimension of information need. But is
there really an order there, or only an unordered set? As argued above, at
first it seems like the order is that of time and cause (RIN comes first, and
then the PIN is made from it, etc.); but that is often not what happens
e.g. the Request may come first and the PIN be created from it.
Another possible order is that of completeness and detail: the amount of
information in each representation. The RIN seems the whole thing, and the
PIN a subset (the part the user consciously realises), and then the request
is a subset of that (the part the user knows how and decides to express).
But this may not always be the case. A query may be more precise than the
request, and only the requirement of having to formalise the query may make
the user calculate exactly what they need. Or a user may believe (the PIN)
they need complete information for a task, when in fact (the RIN) they only
need approximate information: e.g. they request a photo to recognise someone,
but in fact there is only one person in the room they do not already know so
recognition is easy; or they request a full bibliographic citation, but in
fact just the journal and author is enough because there is only one paper by
an author of that name.
So there are apparent orderings by time, causality, and amount (completeness)
of information, but none of these is valid and general. The information need
dimension must be unordered.
[Jane Reid's comment] The discrepancy between these different information
need elements is important, not the ordering. To me, the vital point
of time is where the PIN (against which RJs are made by the user) most
closely resembles the RIN, i.e. the point at which the user has the best,
closest and clearest idea of his "real" information need. It is for this
reason that I am using post-feedback RJs in my PhD work.
- It follows that a complete description of a task's information need would
require the following kinds of description:
- The set of types of information need in terms of type of representation
- The time sequence in which various representations appeared
- The causal sequence (actually, partial order) of these
- The sequence or set of human agents involved in formulating some of these.
Measuring relevance
The following further comments arose from a conversation with
Jane Reid on 6 Aug 1998.
We need to measure relevance (e.g. in creating test collections) in order to
test software. And in connection with that, we are interested in the question
of the degree of consensus between judges.
This involves thinking of relevance judgements (RJs) as actions (not
abstract, eternal value quantities), and of how they can be measured. The
usual way is to ask experts to give a judgement. In contrast, in current IR
systems the user typically makes a decision e.g. about whether to open a
document within one or two seconds: this cannot be sensibly considered as
reflecting the user's best evaluation of a document's relevance. Research on
relevance judgements must equally be beware of this: asking people about
relevance is not necessarily much connected to what their behaviour would be
when using IR software.
So note that the main reason for studying RJs is to support user actions
better while they are using IR software. We want ultimately to support their
making optimal actions. Relevance, however defined, is only part of this.
Note too that experiments on RJs are a kind of action, but usually a different
kind of action, so we need to be aware of these issues in order to relate
experimental findings with user behaviour during IR, rather than naively
equating them.
- Time to make the relevance judgement: do we ask them to make the
judgement in (say) 2 seconds maximum (i.e. our estimate of the time taken to
make most user decisions in current IR), or do we ask them to reflect and
discuss and come up with a final judgement. Of course, now I ask this, we
probably want to run studies where we do both, and report on whether there is
a difference in the judgements made.
- RJs may be
- binary (the traditional way of collecting them in test collections); or
- scalar (e.g. scored as numbers between zero and one i.e. this are absolute
values); or
- ordinal (the judge ranks the set of documents in a full ordering i.e.
relative relevance); or
- just the single most relevant is selected.
Note that in current IR software, the decision users actually have to take is
just to pick one action from a set of (typically) 12: 10 ranked documents, or
choose to look at the next 10 surrogates, or choose to abandon this query
and/or task. None of the four alternative ways of measuring relevance just
listed actually correspond to the user's decision task. A fifth alternative
would be:
- the actual 12-way choice. This seems like the binary RJ, but unlike the
binary RJ we would expect it to depend in part on whether there are more
surrogates to view, whether the user could think of different queries to try,
etc.
- Another measure that perhaps should be collected is the degree of
consensus about a judgement. In experiments, we would like to
collect this to teach us more about the nature of RJs. But in software, such
a number could be useful too: it would be like having a sensitivity or
confidence score associated with the RJ score, and this could be used to
display to the user, to decide on whether and where to threshold lists if
this is done, and so on.
- Group relevance judgements. An interesting way to study RJs, though
it does raise still more issues, is to have a set of users or experts discuss
an RJ to see if this increases consensus. After all, if there is a lot of
variation in RJ values this could be because there is a lot of genuine
disagreement, or because some users missed some point but would quickly agree
on the "real" RJ value if this is pointed out. One possible format for
studying this might be to give an RJ problem to a group of people; ask each
person separately to give an RJ instantly (within a 2 second limit); then to
think more carefully until they are happy with an opinion and to record that;
then to discuss it as a group and agree a single group RJ; then to give again
a private RJ. This design can show the effect of group discussion (if any) on
private judgements.
- Look at IR and relevance in the wild i.e. while a user is doing IR,
and measure both their behaviour (what documents they select e.g. to open) AND
what they say about its relevance (e.g. as soon as they select one, ask them
how relevant they think it is i.e. ask why they selected it). There may well
be cases where users select a document while saying it is not relevant.
(Possible cases: a retrieved document says "free holiday for the first to read
this" even though that is not related to the query; i.e. money, sex, etc. are
likely to motivate the user independently of their IR task; or a document
that says "Today's tips for improving your retrieval methods with this
collection"; i.e. the user may decide that the document will help them
indirectly, and so describe this as "irrelevant" even though it will help
with the information need.) There are broadly two possible cases, if such
divergences are in fact observed. Firstly, can the relevance framework cover
the triggering of new information needs; secondly, users may use the word
"relevance" when asked in a narrower sense than is needed for a comprehensive
framework like Mizzaro's, which must affect how we do research on relevance.
The above points do not really challenge Mizzaro's framework, but raise issues
about where it fits into IR. That framework tacitly supposes that relevance
is an abstract quantity that has some real absolute amount. That is in
practice an acceptable assumption because of how it is used within a typical
IR system: where documents are fully ordered by a score, and the question is
whether the order the machine produces is the same as a user would give.
Thus difficult questions about absolute relevance are in fact irrelevant to
how "relevance" scores will be used in the software. However these points are
important with respect to how we could or should collect RJs experimentally.
Should RJs be expressed as binary, scalar, ordinal, or what?
Since what we are going to use the RJs for is to rank documents, it seems
likely that we should ask experts to rank documents, not give either binary or
scalar judgements. But since actually the user decision task is not exactly
that but includes decisions on whether to choose any document at all, this
point needs further work.
Probably we should measure the degree of consensus in RJs all the time (i.e.
we should criticise software that comes up with different RJs more if the
human judges all agree on those judgements, than when they disagree a lot).
Jane Reid points out there are still further issues about how to combine
the RJs from several judges e.g. average binary judgements to give a scalar
(that could be interpreted as the probability that the next judge will give a
positive binary judgement), etc.etc.
To the extent that time to make the judgement, and
group discussion change RJs, we should perhaps always use the longest,
slowest, and most discussion method to make RJs i.e. go for the "best" RJs
because we want our software to make the best RJs (the best ordering).
Where there are big differences, this is probably a signal that our
software needs to support the decision by better or different surrogates or
other supporting information.
User actions are NOT the same as nor predicted by topic relevance. But
Mizzaro's dimension 2 above of components describes this: a user may select a
document that is not relevant to the topic, but is relevant in another way
perhaps to the task. However, we must measure and use these separately if we
are to design software to support user choices.
Furthermore [Jane Reid's comment] the
context is defined as "information about the subject domain used during the
activity". I don't recall whether this is exactly how Stefano himself
described it, but, if so, this seems to me rather a narrow interpretation of
context. There will be, after all, contextual factors which have little or
nothing to do with the domain which may well influence RJs. These include
other indirect domain information the user might bring to bear, e.g. "my
supervisor doesn't think very much of this researcher, so any papers by him
are worth little to me", or completely independent factors, e.g. what level
of enthusiasm or motivation the user has on a particular day. I can think of
lots of different examples in these areas. In other words, I think either:
- the definition of context needs to be widened to take this into account or
- there is really a SET of contexts, rather than one simple one, e.g. a
domain context, a task context, a topic context...
Of course, it is possible
that this could be interpreted as being part of the information need (PIN,
RIN), but it doesn't seem to fit well into that category. And again, if so,
this should be made explicit.
The issues of speed of making a judgement and whether discussion modifies
judgements is important for measuring relevance, but may not affect the
framework.
Try to measure the best relevance possible (allow the judges time, and
discussion).
We need a whole other approach to measure what influences users in making
the right (or wrong) relevance judgements while doing IR. This is not
relevant to the framework, but it is relevant to another important part of
evaluation in IR.
Demonstrations are a good way of getting people to believe in the importance
of the issues you raise: personal experience is more convincing than reading
about someone else's experiment.
Demonstrations are mainly single examples. A single example cannot prove a
point about all possible cases, but they can prove some things. A single
counterexample disproves a theory (or other universal statement); a single
example proves that a proposition is possible in the world. A single example
that strikes everyone as absolutely typical can prove that an issue is
important (not just applying to special and unusual cases).
So for each phenomenon (e.g. how much consensus is there in relevance
judgements?), we might look for 3 demo. cases: One extreme (a case where
everyone agrees), the other extreme (a case where everyone disagrees), and a
prototypical case (a case that seems extremely common, and if there is some
disagreement about relevance, then lack of consensus must be a significant
issue in general).
Cases to look for?
- A porn or money item inserted in a retrieval list, to show people selecting
a document to look at which they say is not relevant.
- A bad or misleading surrogate to demonstrate a case where people's fast (2
second) relevance judgement is different from their slower, more careful
judgement.
- A case to show up differences between binary, ordering, and scalar RJs e.g. a
query "door on left of window" and picture with the door on right of window.
(In this case, the picture is absolutely irrelevant and might be scored zero,
yet will probably be ranked as more worth a second look than pictures with no
doors and no windows.)
- Do we need binary relevance judgements, or scalar (quantitative) ones?
- Or, further, which of: binary, scalar, ordinal, single most relevant
document, single most relevant action, degree of consensus?
- Must we have a single relevance judgement, even though relevance
is really multi-dimensional?
- Is the information need dimension unordered or ordered?
(Back
up to notes on MIRA Dublin workshop
Back
up to Steve Draper's home page)