Last changed 28 Nov 1997 ............... Length about 2,000 words (14,000 bytes).
This is a WWW document by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/Dag.html. You may copy it. How to refer to it.

Dagstuhl: what I learned

Contents (click to jump to a section)

Preface
Introduction
Context
Cross-media referring
What is multimedia information retrieval?

by
Stephen W. Draper

Preface

This is a review, or rather a short essay on what I learned at Dagstuhl; for putting on web, as part of workshop output. Partly for that (contribute to post-workshop deliverable); partly for my own sake; partly as a small foundation for planning Nancy; partly for possible future paper.

Introduction

At the MIRA workshop at Dagstuhl (14-18 April 1997) I personally learned two important things. This is a note expanding on the one-sentence summaries we were asked to give at the workshop. Each of the two things has a section here; but in brief, they were the importance of "context" in what it is that a retrieval needs to retrieve for the user, and that using text queries to retrieve images is not a cheat but rather is what important classes of user need and also is representative of one of the most central issues in multimedia information retrieval: how media can cross-refer to other media.

Context

If you are using a textbook, and use the index to look up the places where the book discusses a term, then you look up the term in the index, go to the page listed in the index, scan the page for the term. But you do NOT then just start reading from that term onwards: at the very least you scan backwards to the beginning of the sentence, and more probably you look back to see what chapter and section it is part of in order to see the context in which the term is discussed. If you did not do that, you would probably not get the information you needed. Thus the structure of the document and the language are used implicitly but crucially in delivering the content you want.

It is important to realise this. First, we must recognise that even though the query engine in a typical IR program does not use the document or language structures, the overall system does in a crucial way. Indeed the whole point of an IR system is usually not just to re-print the query terms if they are found in a document, but to print all or part of the document. Furthermore, because so many words (at least in English) are ambiguous, the word itself does not carry the meaning: the context carries the information about which of the word's meanings is valid here. To put it another way: without the context the text is not information but only data, and the user may not be able to judge its relevance. The underlying technical and philosophical point is that while formal languages rest their meaning on separate, prior definitions which it is assumed that transmitter and receiver have previously agreed and synchronised by mechanisms outside the language, natural languages carry at least part of the meaning within themselves and the "context". That is why taking a word or a sentence out of context often conveys a different meaning from the one intended by the speaker and understood by the original audience.

The IR design issue is how to present the document or document-part so that the presentation carries the context the user wants. In practice document retrieval is often not what is wanted. On the one hand, if the whole document is retrieved without the query words being highlighted then many users reject it because they cannot see its relevance. At the other extreme, if only the query words are re-printed, or if only the document beginning with the first query words printed, the user will not find that useful. So the general design problem, to which whole-document presentation is a poor solution, is how to present retrieved documents so as to be most useful to the user given the search query. The solution will be one which presents the right amount of "context" around the hits, probably using the document structure even though that may not have been used in the query-driven retrieval.

Another aspect of this is suggested by the observation that in some image retrieval tasks, users are explicit that they do NOT want a single image returned even if the software was so good that it could correctly calculate which image would be chosen finally for the user's task. (For example, designers asking a stock agency to supply a picture for an illustration.) Instead, such users say they want a set returned: apparently this is not just so that they can make a choice, but so that they can see something about the set of neighbours. This may be an image retrieval equivalent of "context": of seeing the sentence in which a query word appeared.

Cross-media referring

One of the half-day sessions focussed on image retrieval, and it also focussed on particular user tasks. Taking these in turn brought home to me several lessons, the first of which is how there are quite different user tasks all involving retrieval of images.

The first case described was pictures of "art" i.e. paintings and photographs, and a multi-dimensional description system founded on professional thinking in that area. Professionals think in terms of concepts, and retrieval based on a database-like system matches this: the database should be structured to represent these concepts. The different "dimensions" (i.e. attributes) mean that images can be retrieved in a number of independent ways e.g. what an image is of (e.g. a woman) or about (e.g. motherhood). Next some experiments using non-experts to group images in various ways extended the conceptual analysis. Thirdly, there are commercial agencies with several million images (photographs) on file who supply them to clients. These retrievals are based on words filled in on a form i.e. again, essentially a database like system. Because an industry is based around this, it seems we can conclude that there is a large and important user group who have needs for images expressed in text. A variation of this would be a journalism-oriented collection, where pictures could be categorised by who (e.g. Chancellor Kohl) and by expression (e.g. laughing). Finally however another example was a retrieval program that allowed users to express part of the query in terms of 2D space e.g. having a lot of empty sky at the top right of the picture. This was associated with a user group of graphic designers, who would use the pictures on leaflets but would also need space to superimpose words. Because their work was essentially that of 2D layout on the page, a 2D spatial query system matched their needs.

What does this tell me?

Image retrieval is not one problem: there are at least as many different user tasks involving retrieving images as there are involving retrieving words. Different users may need completely different software to support their tasks.
Studying particular real user groups is very informative for us; not only in designing a particular IR program, but also in understanding general ideas about multimedia IR.
Using text queries to retrieve images is not a "cheat"; using visual queries may be entirely the wrong way to retrieve images for some user groups e.g. journalists, art historians. If users think about images in terms of verbal concepts, then that is how their queries should be expressed.
In fact, this probably represents a key issue for multimedia retrieval: how one medium (text) refers to another (images).

Inter-reference of one kind of modality or medium to another is in fact a general issue that is relevant in a number of ways to IR.

HCI: inter-referential input and output

In human-computer interaction, one underlying but pervasive issue is how user input may often not mean something by itself but only by referring either to computer output (e.g. with menus, mouse input only means something by referring to the menu displayed on the screen) or to previous user input (e.g. in the Unix shell's history mechanism, or in any undo command). Output may also refer to previous output or user input (e.g. in many error messages). For a longer discussion, see Draper,S.W. (1986) "Display managers as the basis for user-machine communication" in User Centered System Design eds. D.A.Norman & S.W.Draper (Erlbaum: London) pp.339-352.

Note that interaction seems to entail inter-reference of input and output. If IR uses interactive systems, it involves inter-referentiality.

IR: cross-language retrieval

At the MIRA workshop in Monteselice a member of my group described how the Swiss government has a considerable requirement for cross-language document retrieval. There are four official languages in Switzerland. Government officials are usually competent readers in these languages, but may be most competent in writing only one of these. Hence their need to be able to issue a query in one language, but to retrieve documents in all of them.

Because basic IR techniques have no understanding of language apart from word stem matches, this would require an index of words in one language to refer to collections in all four languages. This is probably comparable to the issues of having text refer to images.

Inter-media reference

Getting text queries to refer to images is just one of the many combinations we need to consider in media retrieval problems.

In addition, another whole kind of inter-media reference is the kind tackled by Lynda Hardman and her colleagues: to do with delivering ("playing") a multimedia document. Naive multimedia technology uses only time sequence (the time line) to organise and relate multiple media. In fact the structure within each medium is important, and could cross-refer. The very common technique in film in which when cutting between scenes, the visuals and the soundtrack cut at different times (e.g. you hear the sound of the new scene several seconds before you see the new shot) shows that the relationship is non-trivial. For instance consider a pop video. The music has a structure (e.g. bars of music), and the lyrics (words) are divided into lines; if you have multi-lingual subtitles, then these are also divided into lines; the visuals will have a structure of shots, and this structure probably has a relationship to the lyrics.

I hope this makes it clear that this is a big subject: I will say no more here.

Intra-IR inter-reference

It may be worth a moment's thought to reflect on the ways in which in classical IR the different items in a typical retrieval cycle refer to each other.

A formulated query (i.e. a set of terms) is meant to refer to a user's information need; and bridging that gap is one big issue.
A query refers to an ordered set of documents: that is what the IR engines do. How good that mapping is forms the subject of classical evaluations.
A surrogate (i.e. the summary description of a document on the ordered list returned from a retrieval) is meant to refer to a whole document in a way that allows a user to make a relevance judgement quickly.
A document presentation must also support relevance judgements: hence the considerable importance of highlighting query terms within the document, and considering how to present the relevant context around them. These are in effect attempts to improve the effectiveness with which a document can refer to the user's information need.
Relevance feedback is an attempt to get subsets of documents to refer to or represent the user's information need; and to refer to another (to be retrieved) subset of documents.

What is multimedia information retrieval?

Clearly this is a complex subject. In the light of the above, the major components may be:

Each medium and its own technical problems, including: a) how to do user input of queries in the medium when that is appropriate for a user group; b) how to present a "document" when retrieved including how to present b2) its "context" and b3) why it matched the query; c) how to present surrogates e.g. thumbnail images, speeded up audio.
User tasks, for which the retrieval machine will only be one tool that is a means to part of the task. There is an infinite variety of such tasks.
Human-machine interaction: how to design a complete user-machine interactive cycle that maximises retrieval success and, more importantly, user-task-success over a number of cycles.
Inter-reference: between media, and between user input and machine output.