0011:cs0011002/main.tex

1: %%

2: %% LREC2000 camera ready

3: %%

4: \documentstyle[lrec2000]{article}

5:

6: \title{A Novelty-based Evaluation Method for Information Retrieval}

7:

8: \name{Atsushi Fujii, Tetsuya Ishikawa}

9:

10: \address{University of Library and Information Science \\

11: 1-2 Kasuga Tsukuba 305-8550, JAPAN \\

12: {\{fujii, ishikawa\}@ulis.ac.jp}}

13:

14: \abstract{In information retrieval research, precision and recall have

15: long been used to evaluate IR systems. However, given that a number of

16: retrieval systems resembling one another are already available to the

17: public, it is valuable to retrieve novel relevant documents, i.e.,

18: documents that cannot be retrieved by those existing systems. In view

19: of this problem, we propose an evaluation method that favors systems

20: retrieving as many novel documents as possible. We also used our

21: method to evaluate systems that participated in the IREX workshop.}

22:

23: \newcommand{\etal}{et~al.}

24: \newcommand{\etaleos}{et~al}

25: \newcommand{\eq}[1]{(\ref{#1})}

26:

27: \begin{document}

28:

29: \maketitleabstract

30:

31: \section{Introduction}

32: \label{sec:introduction}

33:

34: In information retrieval (IR) research, the notion of precision and

35: recall have commonly been used to evaluate the empirical performance

36: of systems~\cite{keen:ipm-92,salton:ipm-92}. Precision is the ratio of

37: the number of relevant documents retrieved by a system under

38: evaluation, compared to the total number of documents retrieved by the

39: system. On the other hand, recall is the ratio of the number of

40: relevant documents retrieved by the system, compared to the total

41: relevant documents in a given benchmark test collection.

42:

43: In other words, the precision/recall-based evaluation method regards

44: all the relevant documents as equally important or informative for the

45: user, and thus highly values systems that retrieve as many relevant

46: documents as possible, with little noise.

47:

48: However, in the real world, where a number of IR systems are

49: available, for example, on the World Wide Web, it is often the case

50: that the user has already read some of relevant documents using other

51: systems. Thus, systems that always retrieve relevant documents similar

52: to those retrieved by ubiquitous systems have little practical

53: utility. In addition, meta search systems, which integrate document

54: sets retrieved by more than one system, are less effective, in the

55: case where individual systems retrieve similar documents.

56:

57: In view of these problems, our proposed IR evaluation method favors

58: systems that retrieve more {\em novel\/} documents, that is, relevant

59: documents which cannot be retrieved by other existing systems.

60:

61: From a different perspective, our evaluation method is also effective

62: in producing test collections. The pooling

63: method~\cite{voorhees:sigir-98}, which has commonly been used to

64: produce test collections, requires a variety of participating systems.

65: However, in the case where most participating systems adopt similar

66: techniques, it is not feasible to collect a sufficient ``pool'' (i.e.,

67: a set of candidates for relevant documents).  Our evaluation method is

68: expected to promote a development of IR systems with various concepts,

69: and therefore resolve the above problem.

70:

71: Section~\ref{sec:measure} formalizes the evaluation measure based on

72: the novelty of documents, and Section~\ref{sec:case_study} applies

73: this measure to evaluate IR systems that participated in the IREX

74: workshop~\cite{sekine:irex-99}.

75:

76: \section{Formalizing the Measure}

77: \label{sec:measure}

78:

79: Instead of the notion of precision and recall, we propose as a new

80: evaluation measure the utility of system $x$ with respect to relevant

81: document $d$, \mbox{$U_{d}(x)$}. This measure denotes the extent to

82: which $x$ contributes to providing the user with $d$, for a given

83: query.  Note that in this paper, $d$ generally refers to a {\em

84: relevant\/} document.

85:

86: From an information theoretical point of view, we calculate

87: \mbox{$U_{d}(x)$} as the ratio of the probability that the user reads

88: document $d$ by using system $x$, \mbox{$P(D=d|S=x)$}, compared to the

89: probability that the user reads $d$ by using another system (i.e.,

90: even without using $x$), \mbox{$P(D=d)$}, as shown in

91: Equation~\eq{eq:udx}.

92: \begin{equation}

93:   \label{eq:udx}

94:   U_{d}(x) = \log\frac{\textstyle P(D=d|S=x)}{\textstyle P(D=d)}

95: \end{equation}

96: In the case where system $x$ adopts a ubiquitous retrieval technique,

97: the value of \mbox{$P(D=d|S=x)$} becomes similar to that of

98: \mbox{$P(D=d)$}, and thus the utility of $x$ becomes small.  On the

99: other hand, the utility of $x$ becomes greater as the number of {\em

100: novel \/} relevant documents provided by $x$ increases.

101:

102: We then calculate the {\em total\/} utility of $x$, $U(x)$, by summing

103: up $U_{d}(x)$'s of all the relevant documents for the query, as shown

104: in Equation~\eq{eq:ux}.

105: \begin{equation}

106:   \label{eq:ux}

107:   U(x) = \sum_{d} U_{d}(x)

108: \end{equation}

109: To sum up, our evaluation method favors systems with greater

110: \mbox{$U(x)$}.

111:

112: In Equation~\eq{eq:udx}, \mbox{$P(D=d)$} is the summation of

113: \mbox{$P(D=d|S=y)$}'s for existing systems, averaged by the

114: probability that the user utilizes system $y$, \mbox{$P(S=y)$}.  Thus,

115: given a set of existing system excluding $x$, $E$, we calculate

116: \mbox{$P(D=d)$} as in Equation~\eq{eq:pd}.

117: \begin{eqnarray}

118:   \label{eq:pd}

119:   \begin{array}{lll}

120:     P(D=d) & = & {\displaystyle \sum_{y\in E}P(D=d|S=y)\cdot P(S=y)} \\

121:     \noalign{\vskip 2ex}

122:     & \approx & {\displaystyle \sum_{y\in

123:     E}P(D=d|S=y)\cdot\frac{\textstyle 1}{\textstyle |E|}}

124:   \end{array}

125: \end{eqnarray}

126: Here, note that we assume uniformity with respect to \mbox{$P(S=y)$}.

127:

128: Finally, the crucial content is the way to estimate

129: \mbox{$P(D=d|S=x)$}, i.e., the probability that the user reads

130: document $d$ by using system $x$. It can safely be assumed that the

131: user always reads the top document, $d_1$, and thus $P(D=d_{1}|S=x)$

132: always takes 1. However, the probability that the user reads remaining

133: documents becomes smaller according to their ranking.

134:

135: Given $N$ documents sorted according to their relevance degree, in

136: descending order, the user can choose a threshold for the ranking

137: (i.e., the boundary until which he/she continues to read) out of $N$

138: choices. Consequently, documents ranked lower than the threshold will

139: be discarded.

140:

141: In other words, we can calculate \mbox{$P(D=d|S=x)$} as the

142: probability that the user chooses a threshold equal to or greater than

143: the ranking of $d$, as in Equation~\eq{eq:pdx}.

144: \begin{equation}

145:   \label{eq:pdx}

146:   \begin{array}{lll}

147:     P(D=d|S=x) & = & {\displaystyle \sum_{i = r_{x,d}}^{N}

148:     \frac{\textstyle 1}{\textstyle N}} \\

149:     \noalign{\vskip 2ex}

150:     & = & \frac{\textstyle N - r_{x,d} + 1}{\textstyle N}

151:   \end{array}

152: \end{equation}

153: Here, $r_{x,d}$ is the ranking of document $d$ determined by system

154: $x$.

155:

156: \section{A Case Study using the IREX Collection}

157: \label{sec:case_study}

158:

159: Our concern in this section is to investigate the characteristic of

160: our evaluation method. For this purpose, we targeted IR systems

161: participated in the IREX workshop~\cite{sekine:irex-99}, and compared

162: the result obtained based on our newly proposed evaluation method,

163: with that based on the precision/recall. We also investigated reasons

164: behind the difference between those two results, if any.

165:

166: \subsection{Overview of the IREX Collection}

167: \label{subsec:irex}

168:

169: The IREX collection was produced through the IREX

170: workshop~\cite{sekine:irex-99}, which consists of TREC-style IR and

171: MUC-style named entity (NE) tasks for Japanese.\footnote{{\tt

172: http://cs.nyu.edu/cs/projects/proteus/irex/\\index-e.html}} Hereafter,

173: the IREX collection/workshop refers solely to that related to the IR

174: task.

175:

176: The IREX collection consists of 30 queries, 211,853 articles collected

177: from two years worth of ``Mainichi Shimbun'' newspaper

178: articles~\cite{mainichi:94-95},\footnote{Practically speaking, the

179: IREX collection provides only article IDs, which corresponds to

180: articles in Mainichi Shimbun newspaper CD-ROM'94-'95. Participants

181: must get a copy of the CD-ROMs themselves.} relevance assessment for

182: each query, retrieval results of 22 participating systems, and

183: technical details of each system.

184:

185: Each query consists of the ID, description and narrative.  While

186: descriptions are usually phrases to briefly express the topic,

187: narratives consist of several sentences and synonyms associated with

188: the topic. Figure~\ref{fig:query} shows an example query in the SGML

189: form (translated into English by one of the organizers of the IREX

190: workshop).

191:

192: \begin{figure}[htbp]

193:   \begin{center}

194:     \leavevmode

195:     \small

196:     \begin{quote}

197:       \tt

198:       <TOPIC> \\

199:       <TOPIC-ID>1001</TOPIC-ID> \\

200:       <DESCRIPTION>Corporate merging</DESCRIPTION> \\

201:       <NARRATIVE>The article describes a corporate merging and in the

202:       article, the name of companies have to be identifiable. Information

203:       including the field and the purpose of the merging have to be

204:       identifiable. Corporate merging includes corporate acquisition,

205:       corporate unifications and corporate buying.</NARRATIVE> \\

206:       </TOPIC>

207:     \end{quote}

208:     \caption{An example query in the IREX collection.}

209:     \label{fig:query}

210:   \end{center}

211: \end{figure}

212:

213: Relevance assessment was performed based on the pooling

214: method~\cite{voorhees:sigir-98}. That is, candidates for relevant

215: documents were first pooled using the 22 participating systems.

216: Thereafter, for each candidate document, human experts assigned one of

217: three ranks of relevance, i.e., ``relevant'', ``partially relevant''

218: and ``irrelevant''.  The average number of documents pooled for each

219: query is 2,105, among which the number of relevant and partially

220: relevant documents are 68 and 116, respectively.

221:

222: Each retrieval result consists of the top 300 articles submitted in

223: the same form as used in the TREC.\footnote{{\tt

224: http://trec.nist.gov/pubs.html}} For each of the 22 results, the TREC

225: evaluation software was used to investigate the performance (e.g.,

226: non-interpolated average precision).  Figure~\ref{fig:trec} shows a

227: fragment of the retrieval result obtained with one of the

228: participating systems, which consists of the query ID, dummy field,

229: article ID, ranking of the article, relevance degree computed by the

230: system, and system ID.

231:

232: \begin{figure}[htbp]

233:   \begin{center}

234:     \leavevmode

235:     \small

236:     \tt

237:     \begin{tabular}{llllll}

238:       1007 & 0 & 940228106 & 1 & 0.306856 & 1106 \\

239:       1007 & 0 & 940110130 & 2 & 0.246505 & 1106 \\

240:       1007 & 0 & 950106119 & 3 & 0.237173 & 1106 \\

241:       1007 & 0 & 940131126 & 4 & 0.236115 & 1106 \\

242:       1007 & 0 & 940614009 & 5 & 0.223313 & 1106 \\

243:       1007 & 0 & 940614002 & 6 & 0.222998 & 1106 \\

244:       1007 & 0 & 941107114 & 7 & 0.217324 & 1106 \\

245:       1007 & 0 & 940428222 & 8 & 0.215979 & 1106

246:     \end{tabular}

247:     \caption{A fragment of the retrieval result of system ``1106''.}

248:     \label{fig:trec}

249:   \end{center}

250: \end{figure}

251:

252: \begin{table*}[htbp]

253:   \tabcolsep=3pt

254:   \begin{center}

255:     \leavevmode

256:     \small

257:     \begin{tabular}{ll} \hline\hline

258:       {\hfill\centering Question\hfill} &

259:       {\hfill\centering Answers\hfill} \\ \hline

260:       query information used & only description (8),

261:       description+narrative (14) \\

262:       indexing method & word (9), n-gram (3), word+character (2),

263:       character (1), syntactic phrase (1), \\

264:       & statistical phrase (1) \\

265:       proper noun identification & yes (5) \\

266:       query expansion & local feedback (2), use of a thesaurus (2) \\

267:       retrieval method & vector space model (13), probabilistic model

268:       (4), latent semantic indexing (1) \\

269:       \hline

270:     \end{tabular}

271:     \caption{A fragment of the result of the IREX questionnaire.}

272:     \label{tab:spec}

273:   \end{center}

274: \end{table*}

275:

276: It should be noted that using relevance assessment and retrieval

277: results for each system, we can easily calculate \mbox{$P(D=d|S=x)$}

278: in Equation~\eq{eq:pdx}, which is the central issue in estimating our

279: evaluation measure.

280:

281: Technical details of participating systems were collected from

282: questionnaires answered by each participant, where questions ranged

283: from retrieval algorithms used to execution time. Although several

284: questions are relatively vague, a number of questions are effective to

285: characterize each system.

286:

287: Table~\ref{tab:spec} shows representative questions in terms of

288: retrieval accuracy. In this table, the number of answers are indicated

289: in parentheses. However, answers classified as ``no'', ``unknown'' and

290: ``etc.'' are not shown. Roughly speaking, most systems adopted the

291: word-based indexing and vector space model combined with TF$\cdot$IDF

292: term weighting.

293:

294: On the other hand, note that in the IREX workshop, the correspondence

295: between system IDs and participants is not available to the

296: public. Additionally, several participants did not have oral

297: presentations and papers in the proceedings. Consequently, for some

298: systems it is difficult to obtain sufficient technical details.

299:

300: For example, although most participants answered ``TF$\cdot$IDF'' for

301: the question about term weighting method, it is not possible to

302: identify the exact formula used, out of a number of

303: variants~\cite{salton:ipm-88,zobel:sigir-forum-98}, for several

304: systems.

305:

306: \subsection{Experimentation}

307: \label{subsec:experiment}

308:

309: As explained in Section~\ref{subsec:irex}, the 22 IREX participating

310: systems have already been ranked based on the conventional

311: precision/recall, using the TREC evaluation software.

312:

313: Thus, we re-evaluated the 22 systems based on our evaluation method,

314: and compared results derived from different evaluation methods. To put

315: it more precisely, we conducted 22 trials in each of which a different

316: system was under evaluation and the rest were regarded as existing

317: systems. That is, the former and latter correspond to $x$ and $E$ in

318: Section~\ref{sec:measure}, respectively.

319:

320: Note that in this evaluation, we did not regard ``partially relevant''

321: documents as relevant ones, because interpretation of ``partially

322: relevant'' is not fully clear to the authors.

323:

324: Table~\ref{tab:all_A} compares rankings obtained based on

325: non-interpolated average precision and the utility factor we proposed

326: in this paper. Table~\ref{tab:qbq_A} compares rankings obtained with

327: two evaluation methods on a query-by-query basis, where we show solely

328: the difference of rankings for enhanced readability. Since in the IREX

329: collection, every query ID consists of four digits stating with

330: ``10'', we simply show the remaining two digits in

331: Table~\ref{tab:qbq_A}.

332:

333: \begin{table}[htbp]

334:   \begin{center}

335:     \leavevmode

336:     \small

337:     \begin{tabular}{cccc} \hline\hline

338:       System ID &

339:       {\hfill\centering Avg. Precision\hfill} &

340:       {\hfill\centering Utility\hfill} &

341:       {\hfill\centering Difference\hfill} \\ \hline

342:       1144b & 2 & 1 & +1 \\

343:       1135a & 3 & 2 & +1 \\

344:       1144a & 1 & 3 & -2 \\

345:       1135b & 4 & 4 & 0 \\

346:       1103b & 5 & 5 & 0 \\

347:       1106 & 17 & 6 & +11 \\

348:       1145b & 16 & 7 & +9 \\

349:       1122b & 7 & 8 & -1 \\

350:       1103a & 10 & 9 & +1 \\

351:       1128b & 9 & 10 & -1 \\

352:       1142 & 6 & 11 & -5 \\

353:       1122a & 8 & 12 & -4 \\

354:       1110 & 11 & 13 & -2 \\

355:       1133a & 19 & 14 & +5 \\

356:       1133b & 18 & 15 & +3 \\

357:       1128a & 12 & 16 & -4 \\

358:       1120 & 14 & 17 & -3 \\

359:       1145a & 13 & 18 & -5 \\

360:       1112 & 15 & 19 & -4 \\

361:       1146 & 20 & 20 & 0 \\

362:       1132 & 22 & 21 & +1 \\

363:       1126 & 21 & 22 & -1 \\

364:       \hline

365:     \end{tabular}

366:     \caption{Comparison of rankings obtained based on

367:     non-interpolated average precision and utility factor.}

368:     \label{tab:all_A}

369:   \end{center}

370: \end{table}

371:

372: \begin{table*}[htbp]

373:   \tabcolsep=3pt

374:   \begin{center}

375:     \leavevmode

376:     \scriptsize

377:     \begin{tabular}{lrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr} \hline\hline

378:       & \multicolumn{30}{c}{Query ID} \\ \cline{2-31}

379:       System ID & 07 & 08 & 09 & 10 & 11 & 12 & 13 & 14 & 15 & 16 & 17 &

380:       18 & 19 & 20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 &

381:       31 & 32 & 33 & 34 & 35 & 36 \\ \hline

382:       ~~~1103a & 8 & -7 & 14 & 0 & 8 & 3 & 3 & -14 & 1 & 13 & 5 & -3 & 0

383:       & -4 & -2 & 3 & -6 & -3 & 6 & 1 & -2 & 13 & 2 & 14 & -3 & -5 &

384:       -7 & -2 & -3 & 3 \\

385:       ~~~1103b & -2 & -5 & 6 & 4 & -1 & -3 & -6 & -9 & 4 & -5 & -1 & 1 &

386:       -3 & -2 & -1 & 8 & 0 & -2 & 1 & -2 & -1 & 7 & 1 & -3 & -5 & -1 &

387:       -6 & -3 & -2 & 5 \\

388:       ~~~1106 & 8 & -4 & -9 & -2 & 9 & -2 & 7 & 11 & 5 & -1 & -2 & -4 & 5

389:       & 4 & 0 & -3 & -3 & 2 & 0 & 0 & -1 & -1 & 1 & 2 & 1 & 2 & 0 & 2

390:       & 17 & 0 \\

391:       ~~~1110 & 6 & -1 & -4 & 4 & -1 & 9 & -4 & -10 & -1 & 0 & 4 & -2 &

392:       -5 & -1 & 0 & 3 & 0 & -2 & -1 & 0 & 0 & 16 & 13 & -1 & -3 & -3 &

393:       8 & 1 & 3 & -2 \\

394:       ~~~1112 & -2 & -5 & 0 & 0 & -5 & 3 & -3 & 1 & -11 & 0 & 5 & -5 & 12

395:       & -2 & -1 & 5 & -3 & -4 & -3 & -1 & -1 & -4 & -6 & -4 & 3 & 1 &

396:       -4 & -2 & 0 & 0 \\

397:       ~~~1120 & 1 & -2 & -2 & -1 & 0 & -3 & 4 & -8 & -1 & 0 & 5 & -2 & 7

398:       & 1 & 0 & 5 & 0 & 2 & 0 & 2 & 0 & -3 & -1 & -1 & 2 & 2 & 6 & 5 &

399:       -1 & 0 \\

400:       ~~~1122a & -2 & 2 & -2 & -7 & -5 & 5 & -5 & -11 & -1 & -5 & 1 & 8 &

401:       -1 & -6 & -2 & -8 & 1 & 1 & 0 & -1 & 4 & -4 & 1 & -1 & -3 & -1 &

402:       3 & -2 & -3 & -1 \\

403:       ~~~1122b & -5 & 0 & -8 & 1 & 0 & -8 & 1 & -5 & -9 & -5 & 0 & -2 &

404:       -3 & -6 & 1 & -4 & 4 & 0 & -2 & 1 & 7 & -3 & -2 & -4 & -4 & 0 &

405:       6 & 0 & -1 & -2 \\

406:       ~~~1126 & 0 & 4 & -10 & 0 & 0 & -2 & 0 & 3 & -1 & -1 & -1 & 1 & -1

407:       & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & -2 & -3 & 0 & 0 & -3 & -1

408:       & 0 & 0 \\

409:       ~~~1128a & -1 & -1 & 4 & -2 & -3 & 0 & 3 & -6 & -8 & -1 & -3 & 4 &

410:       2 & 9 & 1 & -13 & 0 & 6 & 2 & -1 & 0 & -2 & 1 & 0 & -1 & 1 & 4 &

411:       -4 & 0 & 4 \\

412:       ~~~1128b & -2 & 14 & -4 & -4 & -7 & -5 & 11 & 9 & -2 & -2 & -5 & 4

413:       & -1 & 3 & -2 & -13 & -1 & 1 & 2 & 2 & 0 & 1 & 0 & -5 & 1 & -1 &

414:       0 & -4 & 0 & -1 \\

415:       ~~~1132 & 0 & 16 & -9 & 2 & 0 & 0 & 0 & 12 & 21 & 0 & 0 & 10 & 0 &

416:       8 & 15 & 0 & -4 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & -1 & 0 & 13 &

417:       0 & 0 \\

418:       ~~~1133a & -2 & -2 & -4 & 0 & 3 & 2 & 3 & 15 & 11 & 1 & -5 & -1 & 1

419:       & 7 & -1 & 3 & 4 & 1 & 4 & 1 & 0 & -2 & -1 & 1 & 4 & 7 & -1 & 0

420:       & 0 & 1 \\

421:       ~~~1133b & -3 & -2 & -4 & 2 & 3 & 1 & 11 & 15 & 3 & 0 & -4 & 2 & 0

422:       & 5 & 1 & 6 & 5 & 0 & 3 & 1 & 0 & -3 & -5 & -1 & 10 & 3 & -2 &

423:       -2 & 1 & -1 \\

424:       ~~~1135a & -1 & -2 & 9 & -2 & 4 & -11 & -6 & 4 & 9 & 2 & -6 & -4 &

425:       -1 & -1 & -1 & -2 & -3 & -1 & -1 & -1 & 0 & -2 & -2 & 0 & 1 & -1

426:       & -1 & 0 & -1 & -3 \\

427:       ~~~1135b & 2 & 0 & 6 & -1 & -12 & -13 & -6 & 1 & 2 & 0 & -3 & 1 &

428:       -5 & -6 & -3 & -1 & -3 & -2 & 0 & -1 & -4 & -7 & -2 & 0 & 0 & -2

429:       & -1 & -7 & -2 & 0 \\

430:       ~~~1142 & -4 & -1 & 10 & 0 & -5 & -1 & -7 & -14 & -7 & -3 & -2 & -3

431:       & -4 & -7 & -5 & -2 & 4 & -3 & -3 & -1 & -2 & -2 & -2 & -5 & 2 &

432:       -6 & -7 & -6 & -1 & -4 \\

433:       ~~~1144a & -2 & -1 & -1 & 3 & -1 & 5 & -16 & -9 & -3 & 5 & 1 & -6 &

434:       -1 & -2 & 0 & 6 & -1 & -2 & -2 & -3 & 0 & 0 & -2 & -1 & 0 & -4 &

435:       7 & 2 & -1 & -1 \\

436:       ~~~1144b & -2 & 3 & -1 & 2 & -2 & 5 & -16 & -5 & -2 & 5 & 2 & -5 &

437:       2 & -2 & 1 & 5 & -3 & 1 & 1 & -1 & 0 & 0 & -5 & -2 & 0 & 1 & 4 &

438:       2 & -1 & 2 \\

439:       ~~~1145a & 0 & -4 & -7 & -4 & -5 & -1 & 5 & 11 & -2 & -1 & -1 & -3

440:       & -1 & -1 & -1 & 1 & 8 & -3 & -5 & 5 & -1 & -4 & 5 & 6 & -2 & 2

441:       & -4 & -3 & 1 & -3 \\

442:       ~~~1145b & 3 & -3 & -5 & 5 & 13 & 7 & 12 & 13 & -5 & -1 & -2 & 8 &

443:       -3 & 4 & 0 & 2 & 1 & 1 & -2 & 0 & -1 & 0 & 5 & 6 & -2 & 7 & 0 &

444:       13 & -5 & 0 \\

445:       ~~~1146 & 0 & 1 & 21 & 0 & 7 & 9 & 9 & -4 & -3 & -1 & 12 & 1 & 0 &

446:       -1 & 0 & -1 & 0 & 7 & 0 & -2 & 1 & 0 & -1 & 2 & -1 & -1 & -2 &

447:       -2 & -1 & 3 \\

448:       \hline

449:     \end{tabular}

450:     \caption{Query-by-query comparison of rankings obtained based on

451:     non-interpolated average precision and utility factor.}

452:     \label{tab:qbq_A}

453:   \end{center}

454: \end{table*}

455:

456: \subsection{Discussion}

457: \label{subsec:discussion}

458:

459: Looking at Table~\ref{tab:all_A}, one may notice that rankings of

460: systems ``1106'', ``1145b'', ``1133a'' and ``1133b'' were

461: significantly improved within our evaluation method. Thus, we

462: investigated properties that characterize each of those four systems,

463: in a comparison with other systems.

464:

465: First, we found that ``1106'' adopted a relatively simple

466: implementation, while most systems used more elaborate ones. To put it

467: more precisely, morphological analysis was performed, and nouns/verbs

468: were extracted for a word-based indexing. For term weighting, a

469: TF$\cdot$IDF formula as in Equation~\eq{eq:tf_idf} was used, while

470: most systems used different methods, such as the logarithmic TF

471: formulation as in Equation~\eq{eq:log_tf_idf} and one proposed by

472: Robertson and Walker~\shortcite{robertson:sigir-94}.

473: \begin{equation}

474:   \label{eq:tf_idf}

475:   f_{t,d}\cdot\log\frac{\textstyle N}{\textstyle n_{t}} \\

476: \end{equation}

477: \begin{equation}

478:   \label{eq:log_tf_idf}

479:   (1 + \log f_{t,d})\cdot\log\frac{\textstyle N}{\textstyle n_{t}}

480: \end{equation}

481: Here, $f_{t,d}$ denotes the frequency that term $t$ appears in

482: document $d$, and $n_{t}$ denotes the number of documents containing

483: term $t$. $N$ is the total number of documents in the collection.

484:

485: Second, ``1145b'' conducted a query expansion~\cite{qiu:sigir-93},

486: while a few systems used query expansion (e.g., one based on a

487: thesaurus). In addition, a term weighing method based on mutual

488: information between two terms was introduced. Possible rationales

489: behind this method include that two terms frequently co-occur are

490: effective to characterize the domain of documents, and are thus

491: assigned with greater term weights.

492:

493: Third, ``1133a'' and ``1133b'' also used domain knowledge for term

494: weighting. However, unlike the case of ``1145b'', they regarded pages

495: of news articles as domain. In practice, a greater weight is assigned

496: to terms whose distribution varies more strongly depending on the

497: page, because they are expected to characterize the domain. On the

498: other hand, terms commonly appear in more pages are assigned with a

499: lesser weight.

500:

501: To sum up, our novelty-based evaluation revealed the effectiveness of

502: those properties above, specifically term weighting methods introduced

503: in ``1145b'', ``1133a'' and ``1133b'', which were overshadowed or

504: underestimated within the precision/recall-based evaluation.

505:

506: We devote a little space to consider Table~\ref{tab:qbq_A} for further

507: investigation. We arbitrarily regarded improvements above seven as

508: significant, and focused solely on systems with relatively many

509: significant improvements, that is, ``1103a'' and ``1132''. Although

510: ``1145b'' is associated with the same number of significant

511: improvements as ``1132'', we previously discussed system ``1145b''

512: above.

513:

514: We found that ``1103a'' is one of five systems that conducts a proper

515: noun identification, and that five of six queries where ``1103a''

516: achieved significant improvements are directly or indirectly

517: associated with proper nouns.

518:

519: Samples of query descriptions directly and indirectly related to

520: proper nouns include ``1016: Nick Price (a golfer)'' and ``1011:

521: arrest of suspects of robbery in the {\it Kanto\/} region'',

522: respectively. Note that in the latter (indirect) case, Japanese

523: prefectures within the ``{\it Kanto\/}'' region, which are not

524: explicitly described in the query (e.g., ``{\it Tokyo\/}'' and ``{\it

525: Kanagawa\/}''), must be identified in news articles.

526:

527: Finally, ``1132'' is the only system that used Latent Semantic

528: Indexing (LSI), which is an extension of the vector space model, so as

529: to retrieve relevant documents including no common terms in a given

530: query. While as shown in Table~\ref{tab:all_A}, ``1132'' had the

531: lowest ranking in terms of the average precision, our evaluation

532: method indicated that in many cases (queries) an LSI-based method is

533: expected to retrieve relevant documents that other types of methods

534: fail to retrieve.

535:

536: \section{Conclusion}

537: \label{sec:conclusion}

538:

539: Evaluation methods based on precision and recall have long been used

540: in information retrieval (IR) research, where systems that retrieve as

541: many relevant documents as possible are usually highly valued.

542:

543: However, given the fact that a number of retrieval systems resembling

544: one another are available to the public (not only in laboratories), it

545: is valuable to retrieve relevant documents that can never be retrieved

546: by those existing systems. This notion is also true in various

547: contexts that require a variety of IR systems, such as meta search

548: systems and the pooling method in producing IR test collections.

549:

550: In consideration of these factors, we proposed a new evaluation method

551: for IR, which favors systems that retrieve more novel documents, i.e.,

552: relevant documents that many systems fail to retrieve. To realize this

553: notion, we estimated the utility of a system in question by comparing

554: the probability that the user reads relevant documents by using the

555: system, and the probability that the user can read those documents

556: even without using the system.

557:

558: We also applied our evaluation method to the 22 systems that

559: participated in the IREX workshop, and identified several effective

560: techniques that have been underestimated in the conventional

561: precision/recall-based evaluation method.

562:

563: \bibliographystyle{acl}

564: \begin{thebibliography}{}

565:

566: \bibitem[\protect\citename{Keen}1992]{keen:ipm-92}

567: E.~Michael Keen.

568: \newblock 1992.

569: \newblock Presenting results of experimental retrieval comparisons.

570: \newblock {\em Information Processing \& Management}, 28(4):491--502.

571:

572: \bibitem[\protect\citename{{Mainichi Shimbun}}1994 1995]{mainichi:94-95}

573: {Mainichi Shimbun}.

574: \newblock 1994-1995.

575: \newblock Mainichi shimbun {CD-ROM} '94-'95.

576: \newblock (In Japanese).

577:

578: \bibitem[\protect\citename{Qiu and Frei}1993]{qiu:sigir-93}

579: Y.~Qiu and H.~Frei.

580: \newblock 1993.

581: \newblock Concept based query expansion.

582: \newblock In {\em Proceedings of the 16th Annual International ACM SIGIR

583:   Conference on Research and Development in Information Retrieval}, pages

584:   160--169.

585:

586: \bibitem[\protect\citename{Robertson and Walker}1994]{robertson:sigir-94}

587: S.~E. Robertson and S.~Walker.

588: \newblock 1994.

589: \newblock Some simple effective approximations to the 2-poisson model for

590:   probabilistic weighted retrieval.

591: \newblock In {\em Proceedings of the 17th Annual International ACM SIGIR

592:   Conference on Research and Development in Information Retrieval}, pages

593:   232--241.

594:

595: \bibitem[\protect\citename{Salton and Buckley}1988]{salton:ipm-88}

596: Gerard Salton and Christopher Buckley.

597: \newblock 1988.

598: \newblock Term-weighting approaches in automatic text retrieval.

599: \newblock {\em Information Processing \& Management}, 24(5):513--523.

600:

601: \bibitem[\protect\citename{Salton}1992]{salton:ipm-92}

602: Gerard Salton.

603: \newblock 1992.

604: \newblock The state of retrieval system evaluation.

605: \newblock {\em Information Processing \& Management}, 28(4):441--449.

606:

607: \bibitem[\protect\citename{Sekine and Isahara}1999]{sekine:irex-99}

608: Satoshi Sekine and Hitoshi Isahara.

609: \newblock 1999.

610: \newblock {IREX} project overview.

611: \newblock In {\em Proceedings of the IREX Workshop}, pages 7--12.

612:

613: \bibitem[\protect\citename{Voorhees}1998]{voorhees:sigir-98}

614: Ellen~M. Voorhees.

615: \newblock 1998.

616: \newblock Variations in relevance judgments and the measurement of retrieval

617:   effectiveness.

618: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

619:   Conference on Research and Development in Information Retrieval}, pages

620:   315--323.

621:

622: \bibitem[\protect\citename{Zobel and Moffat}1998]{zobel:sigir-forum-98}

623: Justin Zobel and Alistair Moffat.

624: \newblock 1998.

625: \newblock Exploring the similarity space.

626: \newblock {\em ACM SIGIR FORUM}, 32(1):18--34.

627:

628: \end{thebibliography}

629:

630: \section*{Acknowledgments}

631:

632: The authors would like to thank organizers and participants of the

633: IREX workshop for their support with the IREX collection.

634:

635: \end{document}

636: