0011:cs0011028/pm.tex

1: \documentclass{acm_proc_article-sp}

2: \begin{document}

3:

4: % CIKM 2000 version, modified after refereeing

5: % Adapted for ACM style sheet

6:

7: \title{Retrieval from Captioned Image Databases Using Natural Language Processing}

8: \numberofauthors{1}

9: \author{

10: \alignauthor David Elworthy\titlenote{Now at: Microsoft Research Limited, St

11: George House, 1 Guildhall Street, Cambridge CB2 3NH, United Kingdom}\\

12:   \affaddr{Canon Research Centre Europe, Guildford, United Kingdom}\\

13:   \email{dahe@acm.org}

14: }

15:

16: \maketitle

17: \begin{abstract}

18: At first sight, it might appear that natural language processing should

19: improve the accuracy of information retrieval systems, by making available a

20: more detailed analysis of queries and documents. Although past results appear

21: to show that this is not so, if the focus is shifted to short phrases rather

22: than full documents, the situation becomes somewhat different. The ANVIL

23: system uses a natural language technique to obtain high accuracy retrieval of

24: images which have been annotated with a descriptive textual caption. The

25: natural language techniques also allow additional contextual information to be

26: derived from the relation between the query and the caption, which can help

27: users to understand the overall collection of retrieval results. The

28: techniques have been successfully used in a information retrieval system which

29: forms both a testbed for research and the basis of a commercial system.

30: \end{abstract}

31:

32: \category{H.3.1}{Information Systems}{Content Analysis and Indexing}

33: \category{H.3.3}{Information Systems}{Information Search and Retrieval}

34: \keywords{Information retrieval, Natural language processing, Image databases}

35:

36: \newcommand{\aum}{\"{a}}

37:

38: \section{Introduction}

39:

40: Text information retrieval is concerned with finding documents which match

41: against user's query, and assigning a measure according to the closeness of

42: the match. Natural language processing (NLP) can provide rich information

43: about the text, and it might appear reasonable that this would result in

44: better retrieval than conventional ``bag of words''

45: approaches. Fagan \cite{Fagan:1987} reports experiments in which simple

46: keywords were augmented with compound terms consisting of pairs of

47: keywords. While the addition of compound terms produced better accuracy, no

48: significant difference was observed between terms selected on the basis of

49: their linguistic relationship and ones selected purely on the basis of their

50: statistical association. Smeaton \cite{Smeaton:1997} makes similar observations and

51: on the basis of a number of experiments concludes that NLP has little to offer

52: IR. However, there are exceptions in some niche areas. For example,

53: Flank \cite{Flank:1998} describes a retrieval system in which the ``documents''

54: are short image captions. She uses the techniques of searching on heads and

55: head-modifier combinations introduced by Strzalkowski \cite{Strz:1994}, and obtains high

56: precision and recall. It therefore appears that in specialised applications, NLP

57: may have something to offer.

58:

59: Here we will look at a technique called {\em phrase matching}, which attempts

60: to use lightweight, symbolic natural language analysis to improve retrieval

61: accuracy. Like Strzalkowski's work, it relies on looking for combinations of

62: words which stand in certain modification relationships, and like Flank, we

63: have applied it to searching a database of annotated images. It differs from

64: the earlier work in two important ways. Firstly, it does not simply use the

65: analysis of the captions and queries as a source of compound terms, as

66: Strzalkowski does. Instead it

67: recursively explores the structure of the caption and query, checking that

68: terms stand in equivalent modification relations in the two phrases. This also

69: allows the match score to be finely tuned and special cases such as negation

70: to be handled. Secondly, by means of a further algorithm called {\em context

71: extraction}, information about non-matching parts of the caption, related

72: to the parts which did match, can be obtained. The retrieval results can then

73: be organised and categorised by the contexts they have in common, with the

74: goal of helping users of the retrieval system to understand and organise the

75: results. This is an important step, because it provides information which is

76: unavailable without natural language analysis, and shows that NLP can

77: contribute in adding new functionality as well as improving accuracy.

78:

79: In section~2 of this paper, we will introduce the phrase matching algorithm,

80: and give some evaluation results. Section~3 then moves on to context

81: extraction. Some conclusions and suggestions for future work are presented in

82: section~4. We first briefly look at the application for which the work was

83: intended.

84:

85: \subsection{The ANVIL system}

86: ANVIL (Accurate Natural Language Visual Information Locator) is a retrieval

87: system for databases of digital photographs, intended for operation over the

88: world wide web. The photographs are annotated with captions, typically between

89: 10 and 30 words in length, which describe the subject matter of the image. The

90: system is intended for casual users, and it is therefore important to make it

91: easy to formulate and refine queries, and to help the users understand the

92: results. This is the main motivation for using phrasal captions and phrase

93: matching: while traditional IR techniques over collections of keywords may

94: give good recall, they are not really suitable for users who will give up if

95: they do not see an acceptable result in the first few presented by the

96: system. ANVIL is further enhanced by an interactive user interface, details of

97: which can be found in Rose et al. \cite{Rose:2000}.

98:

99: In outline, the processing in ANVIL proceeds as follows. When images are

100: registered with the system, their captions are analysed into a meaning

101: representation. The terms from the captions are stored in an index database,

102: pointing to records containing the image identifier and the analysed

103: caption. In retrieval, the terms are extracted from the query and used to find

104: candidate captions using conventional IR techniques such as vector-cosine

105: matching or the similarity techniques of Smeaton and Quigley \cite{SQ:1996}; this phase is

106: called simple matching. The query is analysed to a meaning representation in

107: the same way as the captions, and the representations of the query and

108: candidate captions are compared using natural language matching techniques. The

109: result of the comparison is a score, which is combined with the score from

110: simple matching. Contexts may also extracted at this stage, and the resulting

111: images with their scores, captions and contexts are presented to the user.

112:

113: \section{Phrase matching}

114:

115: The basic idea in phrase matching is as follows. We start by analysing the

116: query and the caption into dependency structures, in which the words are

117: connected by labelled links indicating the relationship between them. One word

118: (or occasionally more) will not be a modifier of any other words. It is

119: designated the head, and is the word which says, in most general terms, what

120: the caption is about. The head of the query is compared against words in the

121: caption, starting from its own head and progressing to modifiers if no match

122: is found. If there is a match, the modifiers of the query head are compared

123: against modifiers of the corresponding term in the caption. For each word that

124: matches, the process recurses in a similar way down through the dependency

125: structure. The modification relationships can be simple ones, or they can

126: involve tracing through several dependency links. Each stage of the comparison

127: has a score associated with it, so that strong and weak matches can be

128: assigned different scores. Finally, we allow matching of elements in the

129: dependency structure against fixed expressions, to detect special cases such

130: as negation.

131:

132: Figure~\ref{f1} shows the dependency structures for two phrases with similar

133: meanings.

134: \begin{figure} %[htb]

135: \centering

136: \setlength{\unitlength}{1cm}

137: \begin{picture}(12,6.5)(0,1)

138: \put(0,1.5){\makebox{\rm colour}}

139: \put(2,1.5){\makebox{\rm document}}

140: \put(3.9,1.5){\makebox{\rm copier}}

141: \put(1.6,2){\oval(1.7,1)[t]}

142: \put(1.5,2.5){\vector(1,0){0.3}}

143: \put(1.3,2.7){\makebox{\tt mod}}

144: \put(3.3,2){\oval(1.7,1)[t]}

145: \put(3.2,2.5){\vector(1,0){0.3}}

146: \put(3.0,2.7){\makebox{\tt mod}}

147: %\end{picture}

148: %\\

149: %\begin{picture}(12,3.5)(0,1)

150: \put(0,4.5){\makebox{\rm copier}}

151: \put(2.0,4.5){\makebox{\rm for}}

152: \put(3.9,4.5){\makebox{\rm colour}}

153: \put(5.8,4.5){\makebox{\rm documents}}

154: \put(1.2,5){\oval(2,1)[t]}

155: \put(1.3,5.5){\vector(-1,0){0.3}}

156: \put(0.8,5.8){\makebox{\tt prep}}

157: \put(4.2,5){\oval(4,2)[t]}

158: \put(4.2,6){\vector(-1,0){0.3}}

159: \put(3.2,6.2){\makebox{\tt phead}}

160: \put(5.2,5){\oval(2,1)[t]}

161: \put(5.2,5.5){\vector(1,0){0.3}}

162: \put(4.8,5.7){\makebox{\tt mod}}

163: %\put(5.8,1.5){\makebox{\rm copier}}

164: %\put(7.8,1.5){\makebox{\rm for}}

165: %\put(9.7,1.5){\makebox{\rm colour}}

166: %\put(11.6,1.5){\makebox{\rm documents}}

167: %put(7,2){\oval(2,1)[t]}

168: %\put(7.1,2.5){\vector(-1,0){0.3}}

169: %\put(6.6,2.8){\makebox{\tt prep}}

170: %\put(10,2){\oval(4,2)[t]}

171: %\put(10,3){\vector(-1,0){0.3}}

172: %\put(9,3.2){\makebox{\tt phead}}

173: %\put(11,2){\oval(2,1)[t]}

174: %\put(11,2.5){\vector(1,0){0.3}}

175: %\put(10.6,2.7){\makebox{\tt mod}}

176: \end{picture}

177: \caption{Example dependency structures}

178: \label{f1}

179: \end{figure}

180: Dependencies are shown as pointing from a modifier to the term it

181: modifies. Although dependency structures go some way to abstracting away from

182: the syntactic analysis, we still need a way of assigning a similarity between

183: non-identical structures. In this example, we want the noun-noun modification

184: between {\em copier} and {\em document} in the lower phrase to have a high

185: similarity to the modification via the preposition {\em for} in the upper one.

186:

187: For convenience, we represent dependency structures using a notation of

188: indexed variables, in which the name of the variable stands for the name of the

189: dependency, and the variable is indexed on the modified word. An unindexed

190: variable is used for the head. The examples can then be written as

191: \begin{verbatim}

192: colour document copier

193:     head = copier

194:     mod[copier]   = document

195:     mod[document] = colour

196:

197: copier for colour documents

198:     head = copier

199:     prep[copier]   = for

200:     phead[for]     = documents

201:     mod[documents] = colour

202: \end{verbatim}

203: Thus, for example, \texttt{mod[copier] = document} indicates that {\em copier}

204: stands in the \texttt{mod} relation to {\em document}, i.e the \texttt{mod}(ifier)

205: of {\em copier} is {\em document}.

206:

207: Dependency structures are especially suitable for this kind of

208: processing. They are closely related to the syntactic form, but abstract away

209: from the linear order of the words and fine details of phrase structure. From

210: a practical point of

211: view, dependency structures can be computed quickly and efficiently; see for

212: example, the dependency parser built by J\"{a}rvinen and Tapanainen \cite{Jarvinen:1997} or the Link

213: grammar parser of Sleator et al. \cite{Sleator:1991}. We use a finite-state parser which

214: has been modified to deliver the dependencies as well as the phrase bracketing

215: (Elworthy \cite{Elworthy:2000}). It works in time roughly proportional to the square

216: of the number of words in the phrase.

217:

218: \subsection{Matching rules}

219:

220: A system of rules specifies what relationships can be treated as equivalent. A

221: small set of example rules appears in figure~\ref{f2}.

222: \begin{figure} %[htb]

223: \centering

224: \begin{verbatim}

225: head_rule

226: {

227:  head = head  1.0 => mod_rule 0.7;

228:  head = mod[] 0.5 => mod_rule 0.7;

229:  mod[]  ?     0.3 => Done 1.0;

230: }

231:

232: mod_rule

233: {

234:  mod[] = mod[] 1.0 => mod_rule 1.0;

235:

236:  phead:prep[] = phead:prep[] 1.0 => mod_rule 1.0;

237:  phead:prep[] = mod[] 1.0 => mod_rule 1.0;

238:  mod[] = phead:prep[] 1.0 => mod_rule 1.0;

239:

240:  vhead:cop:rel[]

241:      = vhead:cop:rel[] 1.0 => mod_rule 1.0;

242:  vhead:cop:rel[] = mod[] 1.0 => mod_rule 1.0;

243:  mod[] = vhead:cop:rel[] 1.0 => mod_rule 1.0;

244:

245:  amod[] = amod[]  1.0 => Done 1.0;

246:  'not' = amod[]  0.0 => Done 0.0;

247:  amod[] = 'not' 0.0 => Done 0.0;

248: }

249: \end{verbatim}

250: \caption{Example matching rules}

251: \label{f2}

252: \end{figure}

253: The left and right hand sides of a comparison express paths through the

254: dependency structure. The idea is that if we have already found a query

255: word which matches a word from the caption, we then follow the specified paths

256: from these words, and compare the words lying at the end of the paths.

257:

258: It is convenient to gather rules into named groups, such as \texttt{head\_rule}

259: and \texttt{mod\_rule}. One group is designated the start group, and its rules

260: are applied to start the matching process. Within a group, the rules are

261: applied in order, so that later rules in a group can be used to test words

262: which were not caught by the earlier rules. Each rule has a continuation,

263: which specifies what should happen after it has been applied. As an example,

264: consider the rule

265: \begin{verbatim}

266: head = head    1.0 => mod_rule 0.7;

267: \end{verbatim}

268: This says that after matching head words, continue with the rule

269: group \texttt{mod\_rule}. The words which have just matched provide the

270: starting point for paths in the continuation; in effect they are substituted

271: where \texttt{[]} appears in \texttt{mod\_rule}. The special continutation {\tt

272: Done} indicates that no further comparison is to be carried out from the words

273: that matched.

274:

275: The process is started by comparing words without indexing, stored in {\tt

276: head}. Thus, the structures in the examples of figure~\ref{f1} can be matched

277: by starting with \texttt{head = head} and then continuing with \texttt{mod[] = phead:prep[]},

278: indicating that a modifier of the head (\texttt{mod[]}) can be compared with the

279: head of a prepositional phrase (\texttt{phead}) reached by following from a

280: matched caption word a preposition (\texttt{prep[]}).

281:

282: There are two special sorts of rules: mopping-up rules and token

283: rules. Mopping-up rules specify that certain words are to be considered to

284: have matched, without actually consuming any words from the other phrase. One

285: use is to catch words from the query which did not have a counterpart in the

286: caption. For example,

287: \begin{verbatim}

288: mod[] ?     0.3 => Done 1.0;

289: \end{verbatim}

290: causes modifiers from the query to be mopped up\footnote{Since this is

291: in the start rule group, the whole unmatched range of the \texttt{mod}

292: variable is used, without indexing.}. Token rules allow matching against

293: specific words. For example, the rule

294: \begin{verbatim}

295: 'not' = amod[]  0.0 => Done 0.0;

296: \end{verbatim}

297: allows an \texttt{amod} (``adverbial'' modifier) in the caption to be tested

298: against the literal word {\em not}, with an effect on scoring described below. There are a few

299: further variants of rules which we will not discuss here, for example rules

300: with a negated test, and ones which are sensitive to word order.

301:

302: \subsection{The scoring scheme}

303:

304: The scoring scheme is a critical part of phrase matching, as it will allow us

305: to distinguish exact and near-exact matches from partial and weak ones. The general

306: approach is to assign each word of the query phrase two numeric values, called

307: the {\em score} and the {\em weight}. The score of a query word is a measure

308: of how well it matched considered in isolation from the rest of the caption,

309: while the weight indicates the importance of the rule application. In general,

310: words which are compared in the start rule group, such as the head, will be

311: more important than ones compared as a result of a continuation, such as

312: modifiers. Scores are assigned to query words if they actually matched a word

313: in the caption, or if they were caught by a mopping up rule or token rule. The

314: score does not take the caption words into account, other than an allowance

315: for their similarity with query words. A special score, called an {\em

316: up-score} is also used to handle words at the end of paths for special cases

317: such as negation. Writing the scores as $s_{i}$ and the weights as $w_{i}$,

318: the overall score of the match is $\sum s_{i}w_{i} /\sum s_{i}$, modified by

319: the up-scores as described below.

320:

321: The rules are annotated with two values, called the $t$ (term) factor and the

322: $d$ (down) factor. In general, the $t$-factor provides the basic score for

323: words which were matched by the rule, and the $d$-factor sets the weight for

324: continuations. Thus, in

325: \begin{verbatim}

326: head = mod[]   0.5 => mod_rule 0.7;

327: \end{verbatim}

328: the $t$-factor is 0.5 and the $d$-factor is 0.7.

329:

330: At the start of matching, the weight is 1.0. As we follow through

331: continuations, it is the product of the $d$-factors of the rules leading to

332: this point. If the rule above were in the start group, the weight of words

333: matched in \texttt{mod\_rule} would be 0.7, and if \texttt{mod\_rule} contained

334: \begin{verbatim}

335: mod[] = mod[]   1.0 => submod_rule 0.6;

336: \end{verbatim}

337: then the weight in \texttt{submod\_rule} would be $0.7\times 0.6$. The scores are

338: formed from the product of the $t$-factor of the rule, and two special

339: factors. Firstly, the similarity between the words can be used. For example,

340: we might allow {\em car} to match {\em vehicle}, but with a reduced

341: score. This factor could be calculated using lexical similarity metrics such

342: as those of Resnik \cite{Resnik:1995} or Jiang and Conrath

343: \cite{JiCon:1997}. A further extension

344: would be to recognise that the agentive suffix {\em X-er} (as in {\em copier})

345: allows a

346: match against the whole phrase {\em machine for X-ing} (as in {\em machine for

347: copying}), and similar rules based on derivational morphology. We do not take

348: this step in the current version of phrase matching.

349:

350: The second special factor is the up-score. When a \texttt{Done} continuation

351: is reached,

352: its $d$-factor is multiplied into the score assigned by the rule which invoked

353: it\footnote{This represents a harmless overloading of the rule

354: notation}. Usually the factor will be 1.0, but in special cases it may be some

355: other

356: value. An example of where this is useful can be found in the rules involving

357: `not' in figure~\ref{f2}. When a negation is seen, we effectively cancel the

358: score on the word which is negated, by using a $d$-factor, and hence an

359: up-score, of 0. Note that

360: making this kind of adjustment based on word pairs without the recursion

361: through the overall structure, as in Fagan's and Strzalkowski's work, is

362: very hard to do.

363:

364: To show the rules in operation, suppose the query {\em yellow car} is tested

365: against {\em yellow car}, {\em car which is yellow} and {\em car which is not

366: yellow}. The dependency structures, written as variables, are shown in

367: figure~\ref{f3}, and a trace through the matching process appears in

368: figure~\ref{f4}.

369: \begin{figure} %[htb]

370: \centering

371: \begin{verbatim}

372: yellow car

373:      head = car

374:      mod[car] = yellow

375:

376: car which is yellow

377:     head = car

378:     rel[car]   = which

379:     cop[which] = is

380:     vhead[is]  = yellow

381:

382: car which is not yellow

383:     head = car

384:     rel[car]     = which

385:     cop[which]   = is

386:     vhead[is]    = yellow

387:     amod[yellow] = not

388: \end{verbatim}

389: \caption{Dependency structures for the matching example}

390: \label{f3}

391: \end{figure}

392: In particular, note how the rule

393: \begin{verbatim}

394: 'not' = amod[]  0.0 => Done 0.0;

395: \end{verbatim}

396: causes the previous score assigment for {\em yellow} to be replaced by 0 when

397: comparing against {\em car which is not yellow}.

398: \begin{figure*} %[htb]

399: \centering

400: \textbf{yellow car + yellow car}\\

401: \begin{tabular}{|l|l|l|c|c|} \hline

402: Query word & Rule group & Comparison & Score & Weight \\ \hline

403: car    & head\_rule & head = head   & 1.0 & 1.0 \\

404: yellow & mod\_rule  & mod[] = mod[] & 1.0 & 0.7 \\ \hline

405: \end{tabular}

406: \\Overall match score = $(1.0\times 1.0 + 1.0\times 0.7) / (1.0 + 0.7) = 1.0$

407: \vspace*{0.5cm}\\

408: \textbf{yellow car + car which is yellow}\\

409: \begin{tabular}{|l|l|l|c|c|} \hline

410: Query word & Rule group & Comparison & Score & Weight \\ \hline

411: car    & head\_rule & head = head             & 1.0 & 1.0 \\

412: yellow & mod\_rule  & mod[] = vhead:cop:rel[] & 1.0 & 0.7 \\ \hline

413: \end{tabular}

414: \\Overall match score = $(1.0\times 1.0 + 1.0\times 0.7) / (1.0 + 0.7) = 1.0$

415: \vspace*{0.5cm}\\

416: \textbf{yellow car + car which is not yellow}\\

417: \begin{tabular}{|l|l|l|c|c|} \hline

418: Query word & Rule group & Comparison & Score & Weight \\ \hline

419: car    & head\_rule & head = head             & 1.0 & 1.0 \\

420: yellow & mod\_rule  & mod[] = vhead:cop:rel[] & 1.0 (initially) & 0.7 \\

421: (none) & mod\_rule  & 'not' = amod[]          & 0.0 & 0.0 \\

422: yellow & mod\_rule  & mod = vhead:cop:rel[]   & 0.0 (on up-score) & 0.7 \\ \hline

423: \end{tabular}

424: \\Overall match score = $(1.0\times 1.0 + 0.0\times 0.7) / (1.0 + 0.7) = 0.59$

425: \caption{Matching in action}

426: \label{f4}

427: \end{figure*}

428: The scores in this rule set are chosen on the basis of examining a variety of

429: examples, some of which might be expected to provide a close match, some a

430: partial match, and some a weak match. No experiments on learning the scores

431: from data have been carried out.

432:

433: \subsection{Evaluation}

434:

435: Evaluation of image caption retrieval is limited by the lack of suitable large test collections. We therefore created our own captions for a

436: set of digital photographs. The captions were prepared according to a set of

437: guidelines, so that they emphasised the objects in the image rather than

438: layout or composition. The guidelines were formulated to overcome problems with quality which

439: had been seen both in a pilot study, and the captions used by

440: Smeaton and Quigley \cite{SQ:1996}. There were 1932 captions in the set, with lengths ranging

441: from 1 to 22 words (9.0 average). Almost all of the captions were noun

442: phrases. It is relatively easy to construct a grammar which correctly analyses

443: all the phrases.

444:

445: A query set was constructed by taking pictures from another source, and

446: devising phrases which should elicit a related image. An initial set of

447: results was obtained by pooling several keyword-based retrieval runs,

448: discarding queries which produced no results\footnote{With such a small test

449: collection and using a single retrieval systems, it might have been better to

450: construct complete relevance judgements rather than use pooling. However, time

451: pressures obviated doing this.}. The top results from phrase

452: matching with each query were then judged for relevance by two human

453: assessors, acting separately. Neither assessor was responsible for writing the

454: captions; one of them devised the queries. A standard precision-recall measure

455: was then calculated, using the TREC interpolation procedure (from {\tt

456: http://trec.nist.gov/}). An example of the output for a query, showing some

457: sample captions appears in figure~\ref{fout}.

458:

459: The main comparison point between different tests was chosen to be the

460: precision at 10\% recall. This represent the case of naive or casual users,

461: who do not care about completeness in the results and who want high accuracy

462: in the first few (Pollock and Hockey \cite{pollock}). The precision at 5

463: documents and the R-precision were also calculated, although they are less

464: useful, partly there is often a very small number of

465: relevant results in such a small test set. Table~\ref{t1} shows the results

466: for a simple weighted

467: keyword matching strategy, and for phrase matching, using the two sets of

468: relevance judgements.

469: \begin{table*} %[htb]

470: \centering

471: \begin{tabular}{|l|c|c|c|} \hline

472: Run (assessor) & Precision at & Precision at & R-precision \\

473:     & 10\% recall  & 5 documents  &             \\ \hline

474: Simple matching (I)  & 85\% & 45\% & 61\% \\

475: Phrase matching (I)  & 92\% & 46\% & 66\% \\

476: Simple matching (II) & 87\% & 49\% & 63\% \\

477: Phrase matching (II) & 95\% & 53\% & 72\% \\ \hline

478: \end{tabular}

479: \caption{Evaluation results}

480: \label{t1}

481: \end{table*}

482:

483: Phrase matching produces a good improvement over simple matching. 43 of the 47

484: queries in the best phrase matching run gave a precision of 100\% at 10\%

485: recall. Inspection of the remaining results shows that the errors could

486: typically only be fixed with a richer semantic representation allowing

487: interaction between the meaning of the words. For example, the query {\em

488: plastic toys} fails to match {\em plastic sword} because a sword is not

489: normally a toy. The precision at 5 documents shows less of an improvement as a

490: result of the small numbers of relevant captions.

491:

492: Note that due to the lack of sources of good quality relevance judgements for this

493: kind of application, the results should be taken as suggestive of the

494: quality of phrase matching rather than as a definitive statement. An

495: evaluation was also carried out using the data from Smeaton and Quigley \cite{SQ:1996}, but we

496: concluded that the results could not be trusted, because the relevance

497: judgements were made against the images rather than the captions, and both the

498: captions and queries were of relatively low quality. In some cases we found

499: pairs of almost identical captions, one of which was judged relevant and one

500: irrelevant by Smeaton and Quigley's assessors. For comparison, the best

501: precision at 10\% recall reported by Smeaton and Quigley is around 62\%.

502:

503: \section{Context extraction}

504:

505: Context extraction is a means of obtaining additional information about

506: phrases which matched, by using the unmatched parts of the caption which are

507: close in the dependency structure to parts which did match. For example, if

508: the query was {\em camera lens}, and the captions included {\em long camera

509: lens} and {\em camera lens on a table}, then the contexts would be {\em long}

510: and {\em on a table}. Context extraction becomes valuable when there are many

511: retrieval results. Captions with similar contexts can be grouped together, for

512: example as shown the bottom half of in figure~\ref{c3}. A user can therefore

513: select or reject several retrieval results in one go by examining just the

514: contexts.

515:

516: The algorithm for extracting the context is quite straightforward. It is

517: outlined in figure~\ref{c2}.

518: % Figure fout is here to try to keep the page count under control

519: \begin{figure*} %[htb]

520: \centering

521: \begin{verbatim}

522: Query = 'camera with a lens.'

523: 5 results:

524: SCORE  CAPTION

525: 1      black SLR camera, with zoom lens, on a white surface.

526:        * camera: black, SLR, on a white surface

527:        * lens:   zoom

528: 1      old-style black camera, with protruding lens, on a white surface.

529:        * camera: black, on a white surface

530:        * lens:   protruding

531: 0.588  old camera, hip flask, box and album filled with sepia photographs.

532:        * camera: old

533: 0.5    Canon camera, magnifying lens and fashion magazine on grey ridged surface.

534:        * camera: Canon, on a grey ridged surface

535: 0.1    an astronaut floating within a space craft, showing the on-board cameras.

536:        * cameras: on-board

537: \end{verbatim}

538: \caption{Example ANVIL query result}

539: \label{fout}

540: \end{figure*}

541:

542: \begin{figure*} %[htb]

543: \centering

544: \begin{verbatim}

545: let P be the set of path rules (input)

546: let T be the set of current words, initialised to all matched words (input)

547: let U be the set of available words, initialised to all unmatched words (input)

548: let S be the set of contexts, intially empty (output)

549:

550: while T is not empty

551: {

552:     select a word t from T

553:

554:     for each word u in U

555:     {

556:        if there is a context rule <rt,rv,rp,ru,rC> in P such that

557:            has_pos(t,rt)

558:            AND in_var(t,rv)

559:            AND has_pos(u,ru)

560:            AND on_path(t,u,rp)

561:        then

562:            find the smallest phrase C such that valid_phrase(C,rC,u)

563:            if there is such a C then

564:                add the context <t,C> to S

565:                remove u from U

566:     }

567:

568:     remove t from T

569: }

570:

571: where

572:   has_pos(t,rt)   if t has part of speech rt

573:   in_var(t,rv)    if t is stored in variable rv

574:   on_path(t,u,rp) if the path rp connects t and u

575: \end{verbatim}

576: \caption{The context extraction algorithm}

577: \label{c2}

578: \end{figure*}

579:

580: The algorithm uses pre-defined context rules of the form $\langle

581: r_{t},r_{v},r_{p},r_{u},r_{C} \rangle$. In essence, it looks for words which

582: successfully matched, have a given part of speech $r_{t}$ and are stored in a

583: variable $r_{v}$. It then follows a path $r_{p}$ through the dependency structure,

584: arriving at an unmatched word $u$ with part of speech $r_{u}$, and then

585: extracts the syntactic context around it using the phrase type $r_{C}$ (for

586: example, {\em PP}, prepositional phrase). The restriction to the smallest

587: phrase is simply for cases where a phrase of a given type embedded within

588: another phrase of the same type. The elements of the rule can be wildcards,

589: which match anything. The algorithm delivers a set of pairs, each of a matched word

590: and its context. Simpler versions of the rules which do not have all of these

591: elements might also be possible.

592:

593: An example path rule is $\langle noun,*,mod,*,* \rangle$, which

594: selects all modifiers of nouns. A rule of this sort might be used for

595: extracting {\em long} in the examples above. To get the context {\em on the

596: table}, a suitable rule might be $\langle noun,*,phead:prep,*,PP \rangle$,

597: i.e. from a matched word, follow a {\em prep} link followed by a {\em phead}

598: link, and select the {\em PP} surrounding the resulting word. Some example

599: contexts resulting from these rules are shown in figure~\ref{c3}, for the

600: query {\em camera with a lens}. The bottom half of the figure shows the

601: results gathered together by context. Presenting the results to the user in

602: this way would allow selection or rejection of several results with a single

603: decision, thus making it easier to manage large result sets.

604: \begin{figure} %[htb]

605: \centering

606: \begin{verbatim}

607: Query = camera with a lens

608:

609: Captions and contexts

610: ---------------------

611:   Camera with a lens

612:     {none}

613:

614:   Large camera with a lens

615:     <camera [mod], large>

616:

617:   camera with a lens on a table

618:     <camera [phead:prep], [on a table]PP>

619:

620:   large camera with a zoom lens

621:     <camera [mod], large>

622:     <lens [mod], zoom>

623:

624:   camera on a table with a long zoom lens

625:     <camera [phead:prep], [on a table]PP>

626:     <lens [mod], zoom>

627:     <lens [mod], long>

628:

629: Captions gathered by context

630: ----------------------------

631:   camera with a lens:

632:     camera modifiers:

633:       {none}     (1)

634:       on a table (2)

635:     lens modifiers:

636:       large      (1)

637:       zoom       (2)

638:       long       (1)

639: \end{verbatim}

640: \caption{Example contexts}

641: \label{c3}

642: \end{figure}

643:

644: Perhaps the most important point about context extraction is not the algorithm

645: or exactly what the results look like, but the use of NLP to provide extract

646: information. Although there is some work in IR in extracting relevant parts of

647: the text, for example using named entity extraction, in general IR systems

648: just output a ranked list of matching documents. Context extraction

649: demonstrates that using NLP, which works with more detailed information

650: structures than traditional IR, we can produce a richer form of

651: output.

652:

653: \section{Discussion}

654:

655: The approach most closely related to phrase matching is that of

656: Sheridan and Smeaton \cite{SheSme:1992}. They start by constructing a

657: dependency tree (of a different form to ours), in

658: which interior nodes can be labelled, for example to mark the head or record

659: the preposition which links words on the nodes under it. The matching process

660: looks for pairs of words which are syntactically related in the query tree,

661: and which both appear in the tree for the key (caption). The nearest parent

662: nodes for the pairs of words are then checked for compatibility. Any parts of

663: the dependency structure which hang off the paths to the parent node, called

664: the residual structure, are examined to see if they could disrupt the

665: matching. For example, if words were both nouns, a verb in the residual would

666: block the match, since its presence indicates the nouns cannot stand in a

667: head/modifier relationship. The whole process is launched by looking at the

668: rightmost node in the query structure. A score is assigned based on the

669: proportion of words which match, possibly modified by certain residual nodes.

670:

671: The main way in which this differs from our algorithm is that the selection

672: of nodes to try is {\em ad hoc}, rather than being guided directly by the

673: modification structure. The use of rules with a reduced score (such as {\tt

674: head = mod[]} above) and mopping up rules is also more explicit and modular

675: than the use of residuals. Furthermore, the scoring process in our phrase

676: matching takes the depth through the the structure (and hence the significance

677: of the terms) into account better, and is arguably more perspicuous. Some

678: further related work can be found in Schwarz \cite{Schw:1990}, in in which

679: syntactic structures are first converted to a normal form and then compared.

680:

681: The work was conducted before the rise of interest in question-answering

682: (Voorhees and Tice \cite{Voor:1999}) which also uses short, precise queries to locate specific

683: information. Most of the TREC-8 question-answering systems used IR followed by entity

684: extraction, and one important limitation of this technique when applied to the

685: application described here is worth noting. The entity extracted as the answer

686: can appear anywhere in the retrieved text and consequently could part of some

687: modifying phrase rather than the main point of the caption, and so result in

688: retrieving images which do not correspond well to the request. By contrast,

689: the phrase matching rules can penalise such matches, provided the captions

690: model the content of the images well.

691:

692: Two challenges follow. The first is to adapt techniques of this sort to full text

693: documents, in which there is a much richer linguistic structure, and where different

694: parts of the text may have different information content (a title compared to

695: a sentence in parentheses, for example). Secondly, there is a need to use

696: evaluation measures which place more emphasis on interactive retrieval and user reaction.

697: The assumption in much IR is that the results are simply

698: judged by their relevance to the user's information needs, essentially as a

699: binary decision. With an extension such as context extraction, where the

700: retrieval results contain extra information over the original data, we need an evaluation technique which is able to take into

701: account the benefit obtained from the results by the information user.

702:

703: \section*{Acknowledgements}

704:

705: The algorithms were implemented by the author and Aaron Kotcheff. The ideas

706: also benefitted from discussions with Tony Rose and Amanda Clare, and (at an

707: earlier stage) Tom Wachtel and Evelyn van de Veen.

708:

709: \bibliographystyle{abbrv}

710: \bibliography{pm}

711:

712: \end{document}

713: