1: \documentclass{acm_proc_article-sp}
2: \begin{document}
3:
4: % CIKM 2000 version, modified after refereeing
5: % Adapted for ACM style sheet
6:
7: \title{Retrieval from Captioned Image Databases Using Natural Language Processing}
8: \numberofauthors{1}
9: \author{
10: \alignauthor David Elworthy\titlenote{Now at: Microsoft Research Limited, St
11: George House, 1 Guildhall Street, Cambridge CB2 3NH, United Kingdom}\\
12: \affaddr{Canon Research Centre Europe, Guildford, United Kingdom}\\
13: \email{dahe@acm.org}
14: }
15:
16: \maketitle
17: \begin{abstract}
18: At first sight, it might appear that natural language processing should
19: improve the accuracy of information retrieval systems, by making available a
20: more detailed analysis of queries and documents. Although past results appear
21: to show that this is not so, if the focus is shifted to short phrases rather
22: than full documents, the situation becomes somewhat different. The ANVIL
23: system uses a natural language technique to obtain high accuracy retrieval of
24: images which have been annotated with a descriptive textual caption. The
25: natural language techniques also allow additional contextual information to be
26: derived from the relation between the query and the caption, which can help
27: users to understand the overall collection of retrieval results. The
28: techniques have been successfully used in a information retrieval system which
29: forms both a testbed for research and the basis of a commercial system.
30: \end{abstract}
31:
32: \category{H.3.1}{Information Systems}{Content Analysis and Indexing}
33: \category{H.3.3}{Information Systems}{Information Search and Retrieval}
34: \keywords{Information retrieval, Natural language processing, Image databases}
35:
36: \newcommand{\aum}{\"{a}}
37:
38: \section{Introduction}
39:
40: Text information retrieval is concerned with finding documents which match
41: against user's query, and assigning a measure according to the closeness of
42: the match. Natural language processing (NLP) can provide rich information
43: about the text, and it might appear reasonable that this would result in
44: better retrieval than conventional ``bag of words''
45: approaches. Fagan \cite{Fagan:1987} reports experiments in which simple
46: keywords were augmented with compound terms consisting of pairs of
47: keywords. While the addition of compound terms produced better accuracy, no
48: significant difference was observed between terms selected on the basis of
49: their linguistic relationship and ones selected purely on the basis of their
50: statistical association. Smeaton \cite{Smeaton:1997} makes similar observations and
51: on the basis of a number of experiments concludes that NLP has little to offer
52: IR. However, there are exceptions in some niche areas. For example,
53: Flank \cite{Flank:1998} describes a retrieval system in which the ``documents''
54: are short image captions. She uses the techniques of searching on heads and
55: head-modifier combinations introduced by Strzalkowski \cite{Strz:1994}, and obtains high
56: precision and recall. It therefore appears that in specialised applications, NLP
57: may have something to offer.
58:
59: Here we will look at a technique called {\em phrase matching}, which attempts
60: to use lightweight, symbolic natural language analysis to improve retrieval
61: accuracy. Like Strzalkowski's work, it relies on looking for combinations of
62: words which stand in certain modification relationships, and like Flank, we
63: have applied it to searching a database of annotated images. It differs from
64: the earlier work in two important ways. Firstly, it does not simply use the
65: analysis of the captions and queries as a source of compound terms, as
66: Strzalkowski does. Instead it
67: recursively explores the structure of the caption and query, checking that
68: terms stand in equivalent modification relations in the two phrases. This also
69: allows the match score to be finely tuned and special cases such as negation
70: to be handled. Secondly, by means of a further algorithm called {\em context
71: extraction}, information about non-matching parts of the caption, related
72: to the parts which did match, can be obtained. The retrieval results can then
73: be organised and categorised by the contexts they have in common, with the
74: goal of helping users of the retrieval system to understand and organise the
75: results. This is an important step, because it provides information which is
76: unavailable without natural language analysis, and shows that NLP can
77: contribute in adding new functionality as well as improving accuracy.
78:
79: In section~2 of this paper, we will introduce the phrase matching algorithm,
80: and give some evaluation results. Section~3 then moves on to context
81: extraction. Some conclusions and suggestions for future work are presented in
82: section~4. We first briefly look at the application for which the work was
83: intended.
84:
85: \subsection{The ANVIL system}
86: ANVIL (Accurate Natural Language Visual Information Locator) is a retrieval
87: system for databases of digital photographs, intended for operation over the
88: world wide web. The photographs are annotated with captions, typically between
89: 10 and 30 words in length, which describe the subject matter of the image. The
90: system is intended for casual users, and it is therefore important to make it
91: easy to formulate and refine queries, and to help the users understand the
92: results. This is the main motivation for using phrasal captions and phrase
93: matching: while traditional IR techniques over collections of keywords may
94: give good recall, they are not really suitable for users who will give up if
95: they do not see an acceptable result in the first few presented by the
96: system. ANVIL is further enhanced by an interactive user interface, details of
97: which can be found in Rose et al. \cite{Rose:2000}.
98:
99: In outline, the processing in ANVIL proceeds as follows. When images are
100: registered with the system, their captions are analysed into a meaning
101: representation. The terms from the captions are stored in an index database,
102: pointing to records containing the image identifier and the analysed
103: caption. In retrieval, the terms are extracted from the query and used to find
104: candidate captions using conventional IR techniques such as vector-cosine
105: matching or the similarity techniques of Smeaton and Quigley \cite{SQ:1996}; this phase is
106: called simple matching. The query is analysed to a meaning representation in
107: the same way as the captions, and the representations of the query and
108: candidate captions are compared using natural language matching techniques. The
109: result of the comparison is a score, which is combined with the score from
110: simple matching. Contexts may also extracted at this stage, and the resulting
111: images with their scores, captions and contexts are presented to the user.
112:
113: \section{Phrase matching}
114:
115: The basic idea in phrase matching is as follows. We start by analysing the
116: query and the caption into dependency structures, in which the words are
117: connected by labelled links indicating the relationship between them. One word
118: (or occasionally more) will not be a modifier of any other words. It is
119: designated the head, and is the word which says, in most general terms, what
120: the caption is about. The head of the query is compared against words in the
121: caption, starting from its own head and progressing to modifiers if no match
122: is found. If there is a match, the modifiers of the query head are compared
123: against modifiers of the corresponding term in the caption. For each word that
124: matches, the process recurses in a similar way down through the dependency
125: structure. The modification relationships can be simple ones, or they can
126: involve tracing through several dependency links. Each stage of the comparison
127: has a score associated with it, so that strong and weak matches can be
128: assigned different scores. Finally, we allow matching of elements in the
129: dependency structure against fixed expressions, to detect special cases such
130: as negation.
131:
132: Figure~\ref{f1} shows the dependency structures for two phrases with similar
133: meanings.
134: \begin{figure} %[htb]
135: \centering
136: \setlength{\unitlength}{1cm}
137: \begin{picture}(12,6.5)(0,1)
138: \put(0,1.5){\makebox{\rm colour}}
139: \put(2,1.5){\makebox{\rm document}}
140: \put(3.9,1.5){\makebox{\rm copier}}
141: \put(1.6,2){\oval(1.7,1)[t]}
142: \put(1.5,2.5){\vector(1,0){0.3}}
143: \put(1.3,2.7){\makebox{\tt mod}}
144: \put(3.3,2){\oval(1.7,1)[t]}
145: \put(3.2,2.5){\vector(1,0){0.3}}
146: \put(3.0,2.7){\makebox{\tt mod}}
147: %\end{picture}
148: %\\
149: %\begin{picture}(12,3.5)(0,1)
150: \put(0,4.5){\makebox{\rm copier}}
151: \put(2.0,4.5){\makebox{\rm for}}
152: \put(3.9,4.5){\makebox{\rm colour}}
153: \put(5.8,4.5){\makebox{\rm documents}}
154: \put(1.2,5){\oval(2,1)[t]}
155: \put(1.3,5.5){\vector(-1,0){0.3}}
156: \put(0.8,5.8){\makebox{\tt prep}}
157: \put(4.2,5){\oval(4,2)[t]}
158: \put(4.2,6){\vector(-1,0){0.3}}
159: \put(3.2,6.2){\makebox{\tt phead}}
160: \put(5.2,5){\oval(2,1)[t]}
161: \put(5.2,5.5){\vector(1,0){0.3}}
162: \put(4.8,5.7){\makebox{\tt mod}}
163: %\put(5.8,1.5){\makebox{\rm copier}}
164: %\put(7.8,1.5){\makebox{\rm for}}
165: %\put(9.7,1.5){\makebox{\rm colour}}
166: %\put(11.6,1.5){\makebox{\rm documents}}
167: %put(7,2){\oval(2,1)[t]}
168: %\put(7.1,2.5){\vector(-1,0){0.3}}
169: %\put(6.6,2.8){\makebox{\tt prep}}
170: %\put(10,2){\oval(4,2)[t]}
171: %\put(10,3){\vector(-1,0){0.3}}
172: %\put(9,3.2){\makebox{\tt phead}}
173: %\put(11,2){\oval(2,1)[t]}
174: %\put(11,2.5){\vector(1,0){0.3}}
175: %\put(10.6,2.7){\makebox{\tt mod}}
176: \end{picture}
177: \caption{Example dependency structures}
178: \label{f1}
179: \end{figure}
180: Dependencies are shown as pointing from a modifier to the term it
181: modifies. Although dependency structures go some way to abstracting away from
182: the syntactic analysis, we still need a way of assigning a similarity between
183: non-identical structures. In this example, we want the noun-noun modification
184: between {\em copier} and {\em document} in the lower phrase to have a high
185: similarity to the modification via the preposition {\em for} in the upper one.
186:
187: For convenience, we represent dependency structures using a notation of
188: indexed variables, in which the name of the variable stands for the name of the
189: dependency, and the variable is indexed on the modified word. An unindexed
190: variable is used for the head. The examples can then be written as
191: \begin{verbatim}
192: colour document copier
193: head = copier
194: mod[copier] = document
195: mod[document] = colour
196:
197: copier for colour documents
198: head = copier
199: prep[copier] = for
200: phead[for] = documents
201: mod[documents] = colour
202: \end{verbatim}
203: Thus, for example, \texttt{mod[copier] = document} indicates that {\em copier}
204: stands in the \texttt{mod} relation to {\em document}, i.e the \texttt{mod}(ifier)
205: of {\em copier} is {\em document}.
206:
207: Dependency structures are especially suitable for this kind of
208: processing. They are closely related to the syntactic form, but abstract away
209: from the linear order of the words and fine details of phrase structure. From
210: a practical point of
211: view, dependency structures can be computed quickly and efficiently; see for
212: example, the dependency parser built by J\"{a}rvinen and Tapanainen \cite{Jarvinen:1997} or the Link
213: grammar parser of Sleator et al. \cite{Sleator:1991}. We use a finite-state parser which
214: has been modified to deliver the dependencies as well as the phrase bracketing
215: (Elworthy \cite{Elworthy:2000}). It works in time roughly proportional to the square
216: of the number of words in the phrase.
217:
218: \subsection{Matching rules}
219:
220: A system of rules specifies what relationships can be treated as equivalent. A
221: small set of example rules appears in figure~\ref{f2}.
222: \begin{figure} %[htb]
223: \centering
224: \begin{verbatim}
225: head_rule
226: {
227: head = head 1.0 => mod_rule 0.7;
228: head = mod[] 0.5 => mod_rule 0.7;
229: mod[] ? 0.3 => Done 1.0;
230: }
231:
232: mod_rule
233: {
234: mod[] = mod[] 1.0 => mod_rule 1.0;
235:
236: phead:prep[] = phead:prep[] 1.0 => mod_rule 1.0;
237: phead:prep[] = mod[] 1.0 => mod_rule 1.0;
238: mod[] = phead:prep[] 1.0 => mod_rule 1.0;
239:
240: vhead:cop:rel[]
241: = vhead:cop:rel[] 1.0 => mod_rule 1.0;
242: vhead:cop:rel[] = mod[] 1.0 => mod_rule 1.0;
243: mod[] = vhead:cop:rel[] 1.0 => mod_rule 1.0;
244:
245: amod[] = amod[] 1.0 => Done 1.0;
246: 'not' = amod[] 0.0 => Done 0.0;
247: amod[] = 'not' 0.0 => Done 0.0;
248: }
249: \end{verbatim}
250: \caption{Example matching rules}
251: \label{f2}
252: \end{figure}
253: The left and right hand sides of a comparison express paths through the
254: dependency structure. The idea is that if we have already found a query
255: word which matches a word from the caption, we then follow the specified paths
256: from these words, and compare the words lying at the end of the paths.
257:
258: It is convenient to gather rules into named groups, such as \texttt{head\_rule}
259: and \texttt{mod\_rule}. One group is designated the start group, and its rules
260: are applied to start the matching process. Within a group, the rules are
261: applied in order, so that later rules in a group can be used to test words
262: which were not caught by the earlier rules. Each rule has a continuation,
263: which specifies what should happen after it has been applied. As an example,
264: consider the rule
265: \begin{verbatim}
266: head = head 1.0 => mod_rule 0.7;
267: \end{verbatim}
268: This says that after matching head words, continue with the rule
269: group \texttt{mod\_rule}. The words which have just matched provide the
270: starting point for paths in the continuation; in effect they are substituted
271: where \texttt{[]} appears in \texttt{mod\_rule}. The special continutation {\tt
272: Done} indicates that no further comparison is to be carried out from the words
273: that matched.
274:
275: The process is started by comparing words without indexing, stored in {\tt
276: head}. Thus, the structures in the examples of figure~\ref{f1} can be matched
277: by starting with \texttt{head = head} and then continuing with \texttt{mod[] = phead:prep[]},
278: indicating that a modifier of the head (\texttt{mod[]}) can be compared with the
279: head of a prepositional phrase (\texttt{phead}) reached by following from a
280: matched caption word a preposition (\texttt{prep[]}).
281:
282: There are two special sorts of rules: mopping-up rules and token
283: rules. Mopping-up rules specify that certain words are to be considered to
284: have matched, without actually consuming any words from the other phrase. One
285: use is to catch words from the query which did not have a counterpart in the
286: caption. For example,
287: \begin{verbatim}
288: mod[] ? 0.3 => Done 1.0;
289: \end{verbatim}
290: causes modifiers from the query to be mopped up\footnote{Since this is
291: in the start rule group, the whole unmatched range of the \texttt{mod}
292: variable is used, without indexing.}. Token rules allow matching against
293: specific words. For example, the rule
294: \begin{verbatim}
295: 'not' = amod[] 0.0 => Done 0.0;
296: \end{verbatim}
297: allows an \texttt{amod} (``adverbial'' modifier) in the caption to be tested
298: against the literal word {\em not}, with an effect on scoring described below. There are a few
299: further variants of rules which we will not discuss here, for example rules
300: with a negated test, and ones which are sensitive to word order.
301:
302: \subsection{The scoring scheme}
303:
304: The scoring scheme is a critical part of phrase matching, as it will allow us
305: to distinguish exact and near-exact matches from partial and weak ones. The general
306: approach is to assign each word of the query phrase two numeric values, called
307: the {\em score} and the {\em weight}. The score of a query word is a measure
308: of how well it matched considered in isolation from the rest of the caption,
309: while the weight indicates the importance of the rule application. In general,
310: words which are compared in the start rule group, such as the head, will be
311: more important than ones compared as a result of a continuation, such as
312: modifiers. Scores are assigned to query words if they actually matched a word
313: in the caption, or if they were caught by a mopping up rule or token rule. The
314: score does not take the caption words into account, other than an allowance
315: for their similarity with query words. A special score, called an {\em
316: up-score} is also used to handle words at the end of paths for special cases
317: such as negation. Writing the scores as $s_{i}$ and the weights as $w_{i}$,
318: the overall score of the match is $\sum s_{i}w_{i} /\sum s_{i}$, modified by
319: the up-scores as described below.
320:
321: The rules are annotated with two values, called the $t$ (term) factor and the
322: $d$ (down) factor. In general, the $t$-factor provides the basic score for
323: words which were matched by the rule, and the $d$-factor sets the weight for
324: continuations. Thus, in
325: \begin{verbatim}
326: head = mod[] 0.5 => mod_rule 0.7;
327: \end{verbatim}
328: the $t$-factor is 0.5 and the $d$-factor is 0.7.
329:
330: At the start of matching, the weight is 1.0. As we follow through
331: continuations, it is the product of the $d$-factors of the rules leading to
332: this point. If the rule above were in the start group, the weight of words
333: matched in \texttt{mod\_rule} would be 0.7, and if \texttt{mod\_rule} contained
334: \begin{verbatim}
335: mod[] = mod[] 1.0 => submod_rule 0.6;
336: \end{verbatim}
337: then the weight in \texttt{submod\_rule} would be $0.7\times 0.6$. The scores are
338: formed from the product of the $t$-factor of the rule, and two special
339: factors. Firstly, the similarity between the words can be used. For example,
340: we might allow {\em car} to match {\em vehicle}, but with a reduced
341: score. This factor could be calculated using lexical similarity metrics such
342: as those of Resnik \cite{Resnik:1995} or Jiang and Conrath
343: \cite{JiCon:1997}. A further extension
344: would be to recognise that the agentive suffix {\em X-er} (as in {\em copier})
345: allows a
346: match against the whole phrase {\em machine for X-ing} (as in {\em machine for
347: copying}), and similar rules based on derivational morphology. We do not take
348: this step in the current version of phrase matching.
349:
350: The second special factor is the up-score. When a \texttt{Done} continuation
351: is reached,
352: its $d$-factor is multiplied into the score assigned by the rule which invoked
353: it\footnote{This represents a harmless overloading of the rule
354: notation}. Usually the factor will be 1.0, but in special cases it may be some
355: other
356: value. An example of where this is useful can be found in the rules involving
357: `not' in figure~\ref{f2}. When a negation is seen, we effectively cancel the
358: score on the word which is negated, by using a $d$-factor, and hence an
359: up-score, of 0. Note that
360: making this kind of adjustment based on word pairs without the recursion
361: through the overall structure, as in Fagan's and Strzalkowski's work, is
362: very hard to do.
363:
364: To show the rules in operation, suppose the query {\em yellow car} is tested
365: against {\em yellow car}, {\em car which is yellow} and {\em car which is not
366: yellow}. The dependency structures, written as variables, are shown in
367: figure~\ref{f3}, and a trace through the matching process appears in
368: figure~\ref{f4}.
369: \begin{figure} %[htb]
370: \centering
371: \begin{verbatim}
372: yellow car
373: head = car
374: mod[car] = yellow
375:
376: car which is yellow
377: head = car
378: rel[car] = which
379: cop[which] = is
380: vhead[is] = yellow
381:
382: car which is not yellow
383: head = car
384: rel[car] = which
385: cop[which] = is
386: vhead[is] = yellow
387: amod[yellow] = not
388: \end{verbatim}
389: \caption{Dependency structures for the matching example}
390: \label{f3}
391: \end{figure}
392: In particular, note how the rule
393: \begin{verbatim}
394: 'not' = amod[] 0.0 => Done 0.0;
395: \end{verbatim}
396: causes the previous score assigment for {\em yellow} to be replaced by 0 when
397: comparing against {\em car which is not yellow}.
398: \begin{figure*} %[htb]
399: \centering
400: \textbf{yellow car + yellow car}\\
401: \begin{tabular}{|l|l|l|c|c|} \hline
402: Query word & Rule group & Comparison & Score & Weight \\ \hline
403: car & head\_rule & head = head & 1.0 & 1.0 \\
404: yellow & mod\_rule & mod[] = mod[] & 1.0 & 0.7 \\ \hline
405: \end{tabular}
406: \\Overall match score = $(1.0\times 1.0 + 1.0\times 0.7) / (1.0 + 0.7) = 1.0$
407: \vspace*{0.5cm}\\
408: \textbf{yellow car + car which is yellow}\\
409: \begin{tabular}{|l|l|l|c|c|} \hline
410: Query word & Rule group & Comparison & Score & Weight \\ \hline
411: car & head\_rule & head = head & 1.0 & 1.0 \\
412: yellow & mod\_rule & mod[] = vhead:cop:rel[] & 1.0 & 0.7 \\ \hline
413: \end{tabular}
414: \\Overall match score = $(1.0\times 1.0 + 1.0\times 0.7) / (1.0 + 0.7) = 1.0$
415: \vspace*{0.5cm}\\
416: \textbf{yellow car + car which is not yellow}\\
417: \begin{tabular}{|l|l|l|c|c|} \hline
418: Query word & Rule group & Comparison & Score & Weight \\ \hline
419: car & head\_rule & head = head & 1.0 & 1.0 \\
420: yellow & mod\_rule & mod[] = vhead:cop:rel[] & 1.0 (initially) & 0.7 \\
421: (none) & mod\_rule & 'not' = amod[] & 0.0 & 0.0 \\
422: yellow & mod\_rule & mod = vhead:cop:rel[] & 0.0 (on up-score) & 0.7 \\ \hline
423: \end{tabular}
424: \\Overall match score = $(1.0\times 1.0 + 0.0\times 0.7) / (1.0 + 0.7) = 0.59$
425: \caption{Matching in action}
426: \label{f4}
427: \end{figure*}
428: The scores in this rule set are chosen on the basis of examining a variety of
429: examples, some of which might be expected to provide a close match, some a
430: partial match, and some a weak match. No experiments on learning the scores
431: from data have been carried out.
432:
433: \subsection{Evaluation}
434:
435: Evaluation of image caption retrieval is limited by the lack of suitable large test collections. We therefore created our own captions for a
436: set of digital photographs. The captions were prepared according to a set of
437: guidelines, so that they emphasised the objects in the image rather than
438: layout or composition. The guidelines were formulated to overcome problems with quality which
439: had been seen both in a pilot study, and the captions used by
440: Smeaton and Quigley \cite{SQ:1996}. There were 1932 captions in the set, with lengths ranging
441: from 1 to 22 words (9.0 average). Almost all of the captions were noun
442: phrases. It is relatively easy to construct a grammar which correctly analyses
443: all the phrases.
444:
445: A query set was constructed by taking pictures from another source, and
446: devising phrases which should elicit a related image. An initial set of
447: results was obtained by pooling several keyword-based retrieval runs,
448: discarding queries which produced no results\footnote{With such a small test
449: collection and using a single retrieval systems, it might have been better to
450: construct complete relevance judgements rather than use pooling. However, time
451: pressures obviated doing this.}. The top results from phrase
452: matching with each query were then judged for relevance by two human
453: assessors, acting separately. Neither assessor was responsible for writing the
454: captions; one of them devised the queries. A standard precision-recall measure
455: was then calculated, using the TREC interpolation procedure (from {\tt
456: http://trec.nist.gov/}). An example of the output for a query, showing some
457: sample captions appears in figure~\ref{fout}.
458:
459: The main comparison point between different tests was chosen to be the
460: precision at 10\% recall. This represent the case of naive or casual users,
461: who do not care about completeness in the results and who want high accuracy
462: in the first few (Pollock and Hockey \cite{pollock}). The precision at 5
463: documents and the R-precision were also calculated, although they are less
464: useful, partly there is often a very small number of
465: relevant results in such a small test set. Table~\ref{t1} shows the results
466: for a simple weighted
467: keyword matching strategy, and for phrase matching, using the two sets of
468: relevance judgements.
469: \begin{table*} %[htb]
470: \centering
471: \begin{tabular}{|l|c|c|c|} \hline
472: Run (assessor) & Precision at & Precision at & R-precision \\
473: & 10\% recall & 5 documents & \\ \hline
474: Simple matching (I) & 85\% & 45\% & 61\% \\
475: Phrase matching (I) & 92\% & 46\% & 66\% \\
476: Simple matching (II) & 87\% & 49\% & 63\% \\
477: Phrase matching (II) & 95\% & 53\% & 72\% \\ \hline
478: \end{tabular}
479: \caption{Evaluation results}
480: \label{t1}
481: \end{table*}
482:
483: Phrase matching produces a good improvement over simple matching. 43 of the 47
484: queries in the best phrase matching run gave a precision of 100\% at 10\%
485: recall. Inspection of the remaining results shows that the errors could
486: typically only be fixed with a richer semantic representation allowing
487: interaction between the meaning of the words. For example, the query {\em
488: plastic toys} fails to match {\em plastic sword} because a sword is not
489: normally a toy. The precision at 5 documents shows less of an improvement as a
490: result of the small numbers of relevant captions.
491:
492: Note that due to the lack of sources of good quality relevance judgements for this
493: kind of application, the results should be taken as suggestive of the
494: quality of phrase matching rather than as a definitive statement. An
495: evaluation was also carried out using the data from Smeaton and Quigley \cite{SQ:1996}, but we
496: concluded that the results could not be trusted, because the relevance
497: judgements were made against the images rather than the captions, and both the
498: captions and queries were of relatively low quality. In some cases we found
499: pairs of almost identical captions, one of which was judged relevant and one
500: irrelevant by Smeaton and Quigley's assessors. For comparison, the best
501: precision at 10\% recall reported by Smeaton and Quigley is around 62\%.
502:
503: \section{Context extraction}
504:
505: Context extraction is a means of obtaining additional information about
506: phrases which matched, by using the unmatched parts of the caption which are
507: close in the dependency structure to parts which did match. For example, if
508: the query was {\em camera lens}, and the captions included {\em long camera
509: lens} and {\em camera lens on a table}, then the contexts would be {\em long}
510: and {\em on a table}. Context extraction becomes valuable when there are many
511: retrieval results. Captions with similar contexts can be grouped together, for
512: example as shown the bottom half of in figure~\ref{c3}. A user can therefore
513: select or reject several retrieval results in one go by examining just the
514: contexts.
515:
516: The algorithm for extracting the context is quite straightforward. It is
517: outlined in figure~\ref{c2}.
518: % Figure fout is here to try to keep the page count under control
519: \begin{figure*} %[htb]
520: \centering
521: \begin{verbatim}
522: Query = 'camera with a lens.'
523: 5 results:
524: SCORE CAPTION
525: 1 black SLR camera, with zoom lens, on a white surface.
526: * camera: black, SLR, on a white surface
527: * lens: zoom
528: 1 old-style black camera, with protruding lens, on a white surface.
529: * camera: black, on a white surface
530: * lens: protruding
531: 0.588 old camera, hip flask, box and album filled with sepia photographs.
532: * camera: old
533: 0.5 Canon camera, magnifying lens and fashion magazine on grey ridged surface.
534: * camera: Canon, on a grey ridged surface
535: 0.1 an astronaut floating within a space craft, showing the on-board cameras.
536: * cameras: on-board
537: \end{verbatim}
538: \caption{Example ANVIL query result}
539: \label{fout}
540: \end{figure*}
541:
542: \begin{figure*} %[htb]
543: \centering
544: \begin{verbatim}
545: let P be the set of path rules (input)
546: let T be the set of current words, initialised to all matched words (input)
547: let U be the set of available words, initialised to all unmatched words (input)
548: let S be the set of contexts, intially empty (output)
549:
550: while T is not empty
551: {
552: select a word t from T
553:
554: for each word u in U
555: {
556: if there is a context rule <rt,rv,rp,ru,rC> in P such that
557: has_pos(t,rt)
558: AND in_var(t,rv)
559: AND has_pos(u,ru)
560: AND on_path(t,u,rp)
561: then
562: find the smallest phrase C such that valid_phrase(C,rC,u)
563: if there is such a C then
564: add the context <t,C> to S
565: remove u from U
566: }
567:
568: remove t from T
569: }
570:
571: where
572: has_pos(t,rt) if t has part of speech rt
573: in_var(t,rv) if t is stored in variable rv
574: on_path(t,u,rp) if the path rp connects t and u
575: \end{verbatim}
576: \caption{The context extraction algorithm}
577: \label{c2}
578: \end{figure*}
579:
580: The algorithm uses pre-defined context rules of the form $\langle
581: r_{t},r_{v},r_{p},r_{u},r_{C} \rangle$. In essence, it looks for words which
582: successfully matched, have a given part of speech $r_{t}$ and are stored in a
583: variable $r_{v}$. It then follows a path $r_{p}$ through the dependency structure,
584: arriving at an unmatched word $u$ with part of speech $r_{u}$, and then
585: extracts the syntactic context around it using the phrase type $r_{C}$ (for
586: example, {\em PP}, prepositional phrase). The restriction to the smallest
587: phrase is simply for cases where a phrase of a given type embedded within
588: another phrase of the same type. The elements of the rule can be wildcards,
589: which match anything. The algorithm delivers a set of pairs, each of a matched word
590: and its context. Simpler versions of the rules which do not have all of these
591: elements might also be possible.
592:
593: An example path rule is $\langle noun,*,mod,*,* \rangle$, which
594: selects all modifiers of nouns. A rule of this sort might be used for
595: extracting {\em long} in the examples above. To get the context {\em on the
596: table}, a suitable rule might be $\langle noun,*,phead:prep,*,PP \rangle$,
597: i.e. from a matched word, follow a {\em prep} link followed by a {\em phead}
598: link, and select the {\em PP} surrounding the resulting word. Some example
599: contexts resulting from these rules are shown in figure~\ref{c3}, for the
600: query {\em camera with a lens}. The bottom half of the figure shows the
601: results gathered together by context. Presenting the results to the user in
602: this way would allow selection or rejection of several results with a single
603: decision, thus making it easier to manage large result sets.
604: \begin{figure} %[htb]
605: \centering
606: \begin{verbatim}
607: Query = camera with a lens
608:
609: Captions and contexts
610: ---------------------
611: Camera with a lens
612: {none}
613:
614: Large camera with a lens
615: <camera [mod], large>
616:
617: camera with a lens on a table
618: <camera [phead:prep], [on a table]PP>
619:
620: large camera with a zoom lens
621: <camera [mod], large>
622: <lens [mod], zoom>
623:
624: camera on a table with a long zoom lens
625: <camera [phead:prep], [on a table]PP>
626: <lens [mod], zoom>
627: <lens [mod], long>
628:
629: Captions gathered by context
630: ----------------------------
631: camera with a lens:
632: camera modifiers:
633: {none} (1)
634: on a table (2)
635: lens modifiers:
636: large (1)
637: zoom (2)
638: long (1)
639: \end{verbatim}
640: \caption{Example contexts}
641: \label{c3}
642: \end{figure}
643:
644: Perhaps the most important point about context extraction is not the algorithm
645: or exactly what the results look like, but the use of NLP to provide extract
646: information. Although there is some work in IR in extracting relevant parts of
647: the text, for example using named entity extraction, in general IR systems
648: just output a ranked list of matching documents. Context extraction
649: demonstrates that using NLP, which works with more detailed information
650: structures than traditional IR, we can produce a richer form of
651: output.
652:
653: \section{Discussion}
654:
655: The approach most closely related to phrase matching is that of
656: Sheridan and Smeaton \cite{SheSme:1992}. They start by constructing a
657: dependency tree (of a different form to ours), in
658: which interior nodes can be labelled, for example to mark the head or record
659: the preposition which links words on the nodes under it. The matching process
660: looks for pairs of words which are syntactically related in the query tree,
661: and which both appear in the tree for the key (caption). The nearest parent
662: nodes for the pairs of words are then checked for compatibility. Any parts of
663: the dependency structure which hang off the paths to the parent node, called
664: the residual structure, are examined to see if they could disrupt the
665: matching. For example, if words were both nouns, a verb in the residual would
666: block the match, since its presence indicates the nouns cannot stand in a
667: head/modifier relationship. The whole process is launched by looking at the
668: rightmost node in the query structure. A score is assigned based on the
669: proportion of words which match, possibly modified by certain residual nodes.
670:
671: The main way in which this differs from our algorithm is that the selection
672: of nodes to try is {\em ad hoc}, rather than being guided directly by the
673: modification structure. The use of rules with a reduced score (such as {\tt
674: head = mod[]} above) and mopping up rules is also more explicit and modular
675: than the use of residuals. Furthermore, the scoring process in our phrase
676: matching takes the depth through the the structure (and hence the significance
677: of the terms) into account better, and is arguably more perspicuous. Some
678: further related work can be found in Schwarz \cite{Schw:1990}, in in which
679: syntactic structures are first converted to a normal form and then compared.
680:
681: The work was conducted before the rise of interest in question-answering
682: (Voorhees and Tice \cite{Voor:1999}) which also uses short, precise queries to locate specific
683: information. Most of the TREC-8 question-answering systems used IR followed by entity
684: extraction, and one important limitation of this technique when applied to the
685: application described here is worth noting. The entity extracted as the answer
686: can appear anywhere in the retrieved text and consequently could part of some
687: modifying phrase rather than the main point of the caption, and so result in
688: retrieving images which do not correspond well to the request. By contrast,
689: the phrase matching rules can penalise such matches, provided the captions
690: model the content of the images well.
691:
692: Two challenges follow. The first is to adapt techniques of this sort to full text
693: documents, in which there is a much richer linguistic structure, and where different
694: parts of the text may have different information content (a title compared to
695: a sentence in parentheses, for example). Secondly, there is a need to use
696: evaluation measures which place more emphasis on interactive retrieval and user reaction.
697: The assumption in much IR is that the results are simply
698: judged by their relevance to the user's information needs, essentially as a
699: binary decision. With an extension such as context extraction, where the
700: retrieval results contain extra information over the original data, we need an evaluation technique which is able to take into
701: account the benefit obtained from the results by the information user.
702:
703: \section*{Acknowledgements}
704:
705: The algorithms were implemented by the author and Aaron Kotcheff. The ideas
706: also benefitted from discussions with Tony Rose and Amanda Clare, and (at an
707: earlier stage) Tom Wachtel and Evelyn van de Veen.
708:
709: \bibliographystyle{abbrv}
710: \bibliography{pm}
711:
712: \end{document}
713: