cs0011002/main.tex
1: %%
2: %% LREC2000 camera ready
3: %%
4: \documentstyle[lrec2000]{article}
5: 
6: \title{A Novelty-based Evaluation Method for Information Retrieval}
7: 
8: \name{Atsushi Fujii, Tetsuya Ishikawa}
9: 
10: \address{University of Library and Information Science \\
11: 1-2 Kasuga Tsukuba 305-8550, JAPAN \\
12: {\{fujii, ishikawa\}@ulis.ac.jp}}
13: 
14: \abstract{In information retrieval research, precision and recall have
15: long been used to evaluate IR systems. However, given that a number of
16: retrieval systems resembling one another are already available to the
17: public, it is valuable to retrieve novel relevant documents, i.e.,
18: documents that cannot be retrieved by those existing systems. In view
19: of this problem, we propose an evaluation method that favors systems
20: retrieving as many novel documents as possible. We also used our
21: method to evaluate systems that participated in the IREX workshop.}
22: 
23: \newcommand{\etal}{et~al.}
24: \newcommand{\etaleos}{et~al}
25: \newcommand{\eq}[1]{(\ref{#1})}
26: 
27: \begin{document}
28: 
29: \maketitleabstract
30: 
31: \section{Introduction}
32: \label{sec:introduction}
33: 
34: In information retrieval (IR) research, the notion of precision and
35: recall have commonly been used to evaluate the empirical performance
36: of systems~\cite{keen:ipm-92,salton:ipm-92}. Precision is the ratio of
37: the number of relevant documents retrieved by a system under
38: evaluation, compared to the total number of documents retrieved by the
39: system. On the other hand, recall is the ratio of the number of
40: relevant documents retrieved by the system, compared to the total
41: relevant documents in a given benchmark test collection.
42: 
43: In other words, the precision/recall-based evaluation method regards
44: all the relevant documents as equally important or informative for the
45: user, and thus highly values systems that retrieve as many relevant
46: documents as possible, with little noise.
47: 
48: However, in the real world, where a number of IR systems are
49: available, for example, on the World Wide Web, it is often the case
50: that the user has already read some of relevant documents using other
51: systems. Thus, systems that always retrieve relevant documents similar
52: to those retrieved by ubiquitous systems have little practical
53: utility. In addition, meta search systems, which integrate document
54: sets retrieved by more than one system, are less effective, in the
55: case where individual systems retrieve similar documents.
56: 
57: In view of these problems, our proposed IR evaluation method favors
58: systems that retrieve more {\em novel\/} documents, that is, relevant
59: documents which cannot be retrieved by other existing systems.
60: 
61: From a different perspective, our evaluation method is also effective
62: in producing test collections. The pooling
63: method~\cite{voorhees:sigir-98}, which has commonly been used to
64: produce test collections, requires a variety of participating systems.
65: However, in the case where most participating systems adopt similar
66: techniques, it is not feasible to collect a sufficient ``pool'' (i.e.,
67: a set of candidates for relevant documents).  Our evaluation method is
68: expected to promote a development of IR systems with various concepts,
69: and therefore resolve the above problem.
70: 
71: Section~\ref{sec:measure} formalizes the evaluation measure based on
72: the novelty of documents, and Section~\ref{sec:case_study} applies
73: this measure to evaluate IR systems that participated in the IREX
74: workshop~\cite{sekine:irex-99}.
75: 
76: \section{Formalizing the Measure}
77: \label{sec:measure}
78: 
79: Instead of the notion of precision and recall, we propose as a new
80: evaluation measure the utility of system $x$ with respect to relevant
81: document $d$, \mbox{$U_{d}(x)$}. This measure denotes the extent to
82: which $x$ contributes to providing the user with $d$, for a given
83: query.  Note that in this paper, $d$ generally refers to a {\em
84: relevant\/} document.
85: 
86: From an information theoretical point of view, we calculate
87: \mbox{$U_{d}(x)$} as the ratio of the probability that the user reads
88: document $d$ by using system $x$, \mbox{$P(D=d|S=x)$}, compared to the
89: probability that the user reads $d$ by using another system (i.e.,
90: even without using $x$), \mbox{$P(D=d)$}, as shown in
91: Equation~\eq{eq:udx}.
92: \begin{equation}
93:   \label{eq:udx}
94:   U_{d}(x) = \log\frac{\textstyle P(D=d|S=x)}{\textstyle P(D=d)}
95: \end{equation}
96: In the case where system $x$ adopts a ubiquitous retrieval technique,
97: the value of \mbox{$P(D=d|S=x)$} becomes similar to that of
98: \mbox{$P(D=d)$}, and thus the utility of $x$ becomes small.  On the
99: other hand, the utility of $x$ becomes greater as the number of {\em
100: novel \/} relevant documents provided by $x$ increases.
101: 
102: We then calculate the {\em total\/} utility of $x$, $U(x)$, by summing
103: up $U_{d}(x)$'s of all the relevant documents for the query, as shown
104: in Equation~\eq{eq:ux}.
105: \begin{equation}
106:   \label{eq:ux}
107:   U(x) = \sum_{d} U_{d}(x)
108: \end{equation}
109: To sum up, our evaluation method favors systems with greater
110: \mbox{$U(x)$}.
111: 
112: In Equation~\eq{eq:udx}, \mbox{$P(D=d)$} is the summation of
113: \mbox{$P(D=d|S=y)$}'s for existing systems, averaged by the
114: probability that the user utilizes system $y$, \mbox{$P(S=y)$}.  Thus,
115: given a set of existing system excluding $x$, $E$, we calculate
116: \mbox{$P(D=d)$} as in Equation~\eq{eq:pd}.
117: \begin{eqnarray}
118:   \label{eq:pd}
119:   \begin{array}{lll}
120:     P(D=d) & = & {\displaystyle \sum_{y\in E}P(D=d|S=y)\cdot P(S=y)} \\
121:     \noalign{\vskip 2ex}
122:     & \approx & {\displaystyle \sum_{y\in
123:     E}P(D=d|S=y)\cdot\frac{\textstyle 1}{\textstyle |E|}}
124:   \end{array}
125: \end{eqnarray}
126: Here, note that we assume uniformity with respect to \mbox{$P(S=y)$}.
127: 
128: Finally, the crucial content is the way to estimate
129: \mbox{$P(D=d|S=x)$}, i.e., the probability that the user reads
130: document $d$ by using system $x$. It can safely be assumed that the
131: user always reads the top document, $d_1$, and thus $P(D=d_{1}|S=x)$
132: always takes 1. However, the probability that the user reads remaining
133: documents becomes smaller according to their ranking.
134: 
135: Given $N$ documents sorted according to their relevance degree, in
136: descending order, the user can choose a threshold for the ranking
137: (i.e., the boundary until which he/she continues to read) out of $N$
138: choices. Consequently, documents ranked lower than the threshold will
139: be discarded.
140: 
141: In other words, we can calculate \mbox{$P(D=d|S=x)$} as the
142: probability that the user chooses a threshold equal to or greater than
143: the ranking of $d$, as in Equation~\eq{eq:pdx}.
144: \begin{equation}
145:   \label{eq:pdx}
146:   \begin{array}{lll}
147:     P(D=d|S=x) & = & {\displaystyle \sum_{i = r_{x,d}}^{N}
148:     \frac{\textstyle 1}{\textstyle N}} \\
149:     \noalign{\vskip 2ex}
150:     & = & \frac{\textstyle N - r_{x,d} + 1}{\textstyle N}
151:   \end{array}
152: \end{equation}
153: Here, $r_{x,d}$ is the ranking of document $d$ determined by system
154: $x$.
155: 
156: \section{A Case Study using the IREX Collection}
157: \label{sec:case_study}
158: 
159: Our concern in this section is to investigate the characteristic of
160: our evaluation method. For this purpose, we targeted IR systems
161: participated in the IREX workshop~\cite{sekine:irex-99}, and compared
162: the result obtained based on our newly proposed evaluation method,
163: with that based on the precision/recall. We also investigated reasons
164: behind the difference between those two results, if any.
165: 
166: \subsection{Overview of the IREX Collection}
167: \label{subsec:irex}
168: 
169: The IREX collection was produced through the IREX
170: workshop~\cite{sekine:irex-99}, which consists of TREC-style IR and
171: MUC-style named entity (NE) tasks for Japanese.\footnote{{\tt
172: http://cs.nyu.edu/cs/projects/proteus/irex/\\index-e.html}} Hereafter,
173: the IREX collection/workshop refers solely to that related to the IR
174: task.
175: 
176: The IREX collection consists of 30 queries, 211,853 articles collected
177: from two years worth of ``Mainichi Shimbun'' newspaper
178: articles~\cite{mainichi:94-95},\footnote{Practically speaking, the
179: IREX collection provides only article IDs, which corresponds to
180: articles in Mainichi Shimbun newspaper CD-ROM'94-'95. Participants
181: must get a copy of the CD-ROMs themselves.} relevance assessment for
182: each query, retrieval results of 22 participating systems, and
183: technical details of each system.
184: 
185: Each query consists of the ID, description and narrative.  While
186: descriptions are usually phrases to briefly express the topic,
187: narratives consist of several sentences and synonyms associated with
188: the topic. Figure~\ref{fig:query} shows an example query in the SGML
189: form (translated into English by one of the organizers of the IREX
190: workshop).
191: 
192: \begin{figure}[htbp]
193:   \begin{center}
194:     \leavevmode
195:     \small
196:     \begin{quote}
197:       \tt
198:       <TOPIC> \\
199:       <TOPIC-ID>1001</TOPIC-ID> \\
200:       <DESCRIPTION>Corporate merging</DESCRIPTION> \\
201:       <NARRATIVE>The article describes a corporate merging and in the
202:       article, the name of companies have to be identifiable. Information
203:       including the field and the purpose of the merging have to be
204:       identifiable. Corporate merging includes corporate acquisition,
205:       corporate unifications and corporate buying.</NARRATIVE> \\
206:       </TOPIC>
207:     \end{quote}
208:     \caption{An example query in the IREX collection.}
209:     \label{fig:query}
210:   \end{center}
211: \end{figure}
212: 
213: Relevance assessment was performed based on the pooling
214: method~\cite{voorhees:sigir-98}. That is, candidates for relevant
215: documents were first pooled using the 22 participating systems.
216: Thereafter, for each candidate document, human experts assigned one of
217: three ranks of relevance, i.e., ``relevant'', ``partially relevant''
218: and ``irrelevant''.  The average number of documents pooled for each
219: query is 2,105, among which the number of relevant and partially
220: relevant documents are 68 and 116, respectively.
221: 
222: Each retrieval result consists of the top 300 articles submitted in
223: the same form as used in the TREC.\footnote{{\tt
224: http://trec.nist.gov/pubs.html}} For each of the 22 results, the TREC
225: evaluation software was used to investigate the performance (e.g.,
226: non-interpolated average precision).  Figure~\ref{fig:trec} shows a
227: fragment of the retrieval result obtained with one of the
228: participating systems, which consists of the query ID, dummy field,
229: article ID, ranking of the article, relevance degree computed by the
230: system, and system ID.
231: 
232: \begin{figure}[htbp]
233:   \begin{center}
234:     \leavevmode
235:     \small
236:     \tt
237:     \begin{tabular}{llllll}
238:       1007 & 0 & 940228106 & 1 & 0.306856 & 1106 \\
239:       1007 & 0 & 940110130 & 2 & 0.246505 & 1106 \\
240:       1007 & 0 & 950106119 & 3 & 0.237173 & 1106 \\
241:       1007 & 0 & 940131126 & 4 & 0.236115 & 1106 \\
242:       1007 & 0 & 940614009 & 5 & 0.223313 & 1106 \\
243:       1007 & 0 & 940614002 & 6 & 0.222998 & 1106 \\
244:       1007 & 0 & 941107114 & 7 & 0.217324 & 1106 \\
245:       1007 & 0 & 940428222 & 8 & 0.215979 & 1106
246:     \end{tabular}
247:     \caption{A fragment of the retrieval result of system ``1106''.}
248:     \label{fig:trec}
249:   \end{center}
250: \end{figure}
251: 
252: \begin{table*}[htbp]
253:   \tabcolsep=3pt
254:   \begin{center}
255:     \leavevmode
256:     \small
257:     \begin{tabular}{ll} \hline\hline
258:       {\hfill\centering Question\hfill} &
259:       {\hfill\centering Answers\hfill} \\ \hline
260:       query information used & only description (8), 
261:       description+narrative (14) \\
262:       indexing method & word (9), n-gram (3), word+character (2),
263:       character (1), syntactic phrase (1), \\
264:       & statistical phrase (1) \\
265:       proper noun identification & yes (5) \\
266:       query expansion & local feedback (2), use of a thesaurus (2) \\
267:       retrieval method & vector space model (13), probabilistic model
268:       (4), latent semantic indexing (1) \\
269:       \hline
270:     \end{tabular}
271:     \caption{A fragment of the result of the IREX questionnaire.}
272:     \label{tab:spec}
273:   \end{center}
274: \end{table*}
275: 
276: It should be noted that using relevance assessment and retrieval
277: results for each system, we can easily calculate \mbox{$P(D=d|S=x)$}
278: in Equation~\eq{eq:pdx}, which is the central issue in estimating our
279: evaluation measure.
280: 
281: Technical details of participating systems were collected from
282: questionnaires answered by each participant, where questions ranged
283: from retrieval algorithms used to execution time. Although several
284: questions are relatively vague, a number of questions are effective to
285: characterize each system.
286: 
287: Table~\ref{tab:spec} shows representative questions in terms of
288: retrieval accuracy. In this table, the number of answers are indicated
289: in parentheses. However, answers classified as ``no'', ``unknown'' and
290: ``etc.'' are not shown. Roughly speaking, most systems adopted the
291: word-based indexing and vector space model combined with TF$\cdot$IDF
292: term weighting.
293: 
294: On the other hand, note that in the IREX workshop, the correspondence
295: between system IDs and participants is not available to the
296: public. Additionally, several participants did not have oral
297: presentations and papers in the proceedings. Consequently, for some
298: systems it is difficult to obtain sufficient technical details.
299: 
300: For example, although most participants answered ``TF$\cdot$IDF'' for
301: the question about term weighting method, it is not possible to
302: identify the exact formula used, out of a number of
303: variants~\cite{salton:ipm-88,zobel:sigir-forum-98}, for several
304: systems.
305: 
306: \subsection{Experimentation}
307: \label{subsec:experiment}
308: 
309: As explained in Section~\ref{subsec:irex}, the 22 IREX participating
310: systems have already been ranked based on the conventional
311: precision/recall, using the TREC evaluation software.
312: 
313: Thus, we re-evaluated the 22 systems based on our evaluation method,
314: and compared results derived from different evaluation methods. To put
315: it more precisely, we conducted 22 trials in each of which a different
316: system was under evaluation and the rest were regarded as existing
317: systems. That is, the former and latter correspond to $x$ and $E$ in
318: Section~\ref{sec:measure}, respectively.
319: 
320: Note that in this evaluation, we did not regard ``partially relevant''
321: documents as relevant ones, because interpretation of ``partially
322: relevant'' is not fully clear to the authors.
323: 
324: Table~\ref{tab:all_A} compares rankings obtained based on
325: non-interpolated average precision and the utility factor we proposed
326: in this paper. Table~\ref{tab:qbq_A} compares rankings obtained with
327: two evaluation methods on a query-by-query basis, where we show solely
328: the difference of rankings for enhanced readability. Since in the IREX
329: collection, every query ID consists of four digits stating with
330: ``10'', we simply show the remaining two digits in
331: Table~\ref{tab:qbq_A}.
332: 
333: \begin{table}[htbp]
334:   \begin{center}
335:     \leavevmode
336:     \small
337:     \begin{tabular}{cccc} \hline\hline
338:       System ID &
339:       {\hfill\centering Avg. Precision\hfill} &
340:       {\hfill\centering Utility\hfill} &
341:       {\hfill\centering Difference\hfill} \\ \hline
342:       1144b & 2 & 1 & +1 \\
343:       1135a & 3 & 2 & +1 \\
344:       1144a & 1 & 3 & -2 \\
345:       1135b & 4 & 4 & 0 \\
346:       1103b & 5 & 5 & 0 \\
347:       1106 & 17 & 6 & +11 \\
348:       1145b & 16 & 7 & +9 \\
349:       1122b & 7 & 8 & -1 \\
350:       1103a & 10 & 9 & +1 \\
351:       1128b & 9 & 10 & -1 \\
352:       1142 & 6 & 11 & -5 \\
353:       1122a & 8 & 12 & -4 \\
354:       1110 & 11 & 13 & -2 \\
355:       1133a & 19 & 14 & +5 \\
356:       1133b & 18 & 15 & +3 \\
357:       1128a & 12 & 16 & -4 \\
358:       1120 & 14 & 17 & -3 \\
359:       1145a & 13 & 18 & -5 \\
360:       1112 & 15 & 19 & -4 \\
361:       1146 & 20 & 20 & 0 \\
362:       1132 & 22 & 21 & +1 \\
363:       1126 & 21 & 22 & -1 \\
364:       \hline
365:     \end{tabular}
366:     \caption{Comparison of rankings obtained based on
367:     non-interpolated average precision and utility factor.}
368:     \label{tab:all_A}
369:   \end{center}
370: \end{table}
371: 
372: \begin{table*}[htbp]
373:   \tabcolsep=3pt
374:   \begin{center}
375:     \leavevmode
376:     \scriptsize
377:     \begin{tabular}{lrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr} \hline\hline
378:       & \multicolumn{30}{c}{Query ID} \\ \cline{2-31}
379:       System ID & 07 & 08 & 09 & 10 & 11 & 12 & 13 & 14 & 15 & 16 & 17 &
380:       18 & 19 & 20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 &
381:       31 & 32 & 33 & 34 & 35 & 36 \\ \hline
382:       ~~~1103a & 8 & -7 & 14 & 0 & 8 & 3 & 3 & -14 & 1 & 13 & 5 & -3 & 0
383:       & -4 & -2 & 3 & -6 & -3 & 6 & 1 & -2 & 13 & 2 & 14 & -3 & -5 &
384:       -7 & -2 & -3 & 3 \\
385:       ~~~1103b & -2 & -5 & 6 & 4 & -1 & -3 & -6 & -9 & 4 & -5 & -1 & 1 &
386:       -3 & -2 & -1 & 8 & 0 & -2 & 1 & -2 & -1 & 7 & 1 & -3 & -5 & -1 &
387:       -6 & -3 & -2 & 5 \\
388:       ~~~1106 & 8 & -4 & -9 & -2 & 9 & -2 & 7 & 11 & 5 & -1 & -2 & -4 & 5
389:       & 4 & 0 & -3 & -3 & 2 & 0 & 0 & -1 & -1 & 1 & 2 & 1 & 2 & 0 & 2
390:       & 17 & 0 \\
391:       ~~~1110 & 6 & -1 & -4 & 4 & -1 & 9 & -4 & -10 & -1 & 0 & 4 & -2 &
392:       -5 & -1 & 0 & 3 & 0 & -2 & -1 & 0 & 0 & 16 & 13 & -1 & -3 & -3 &
393:       8 & 1 & 3 & -2 \\
394:       ~~~1112 & -2 & -5 & 0 & 0 & -5 & 3 & -3 & 1 & -11 & 0 & 5 & -5 & 12
395:       & -2 & -1 & 5 & -3 & -4 & -3 & -1 & -1 & -4 & -6 & -4 & 3 & 1 &
396:       -4 & -2 & 0 & 0 \\
397:       ~~~1120 & 1 & -2 & -2 & -1 & 0 & -3 & 4 & -8 & -1 & 0 & 5 & -2 & 7
398:       & 1 & 0 & 5 & 0 & 2 & 0 & 2 & 0 & -3 & -1 & -1 & 2 & 2 & 6 & 5 &
399:       -1 & 0 \\
400:       ~~~1122a & -2 & 2 & -2 & -7 & -5 & 5 & -5 & -11 & -1 & -5 & 1 & 8 &
401:       -1 & -6 & -2 & -8 & 1 & 1 & 0 & -1 & 4 & -4 & 1 & -1 & -3 & -1 &
402:       3 & -2 & -3 & -1 \\
403:       ~~~1122b & -5 & 0 & -8 & 1 & 0 & -8 & 1 & -5 & -9 & -5 & 0 & -2 &
404:       -3 & -6 & 1 & -4 & 4 & 0 & -2 & 1 & 7 & -3 & -2 & -4 & -4 & 0 &
405:       6 & 0 & -1 & -2 \\
406:       ~~~1126 & 0 & 4 & -10 & 0 & 0 & -2 & 0 & 3 & -1 & -1 & -1 & 1 & -1
407:       & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & -2 & -3 & 0 & 0 & -3 & -1
408:       & 0 & 0 \\
409:       ~~~1128a & -1 & -1 & 4 & -2 & -3 & 0 & 3 & -6 & -8 & -1 & -3 & 4 &
410:       2 & 9 & 1 & -13 & 0 & 6 & 2 & -1 & 0 & -2 & 1 & 0 & -1 & 1 & 4 &
411:       -4 & 0 & 4 \\
412:       ~~~1128b & -2 & 14 & -4 & -4 & -7 & -5 & 11 & 9 & -2 & -2 & -5 & 4
413:       & -1 & 3 & -2 & -13 & -1 & 1 & 2 & 2 & 0 & 1 & 0 & -5 & 1 & -1 &
414:       0 & -4 & 0 & -1 \\
415:       ~~~1132 & 0 & 16 & -9 & 2 & 0 & 0 & 0 & 12 & 21 & 0 & 0 & 10 & 0 &
416:       8 & 15 & 0 & -4 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & -1 & 0 & 13 &
417:       0 & 0 \\
418:       ~~~1133a & -2 & -2 & -4 & 0 & 3 & 2 & 3 & 15 & 11 & 1 & -5 & -1 & 1
419:       & 7 & -1 & 3 & 4 & 1 & 4 & 1 & 0 & -2 & -1 & 1 & 4 & 7 & -1 & 0
420:       & 0 & 1 \\
421:       ~~~1133b & -3 & -2 & -4 & 2 & 3 & 1 & 11 & 15 & 3 & 0 & -4 & 2 & 0
422:       & 5 & 1 & 6 & 5 & 0 & 3 & 1 & 0 & -3 & -5 & -1 & 10 & 3 & -2 &
423:       -2 & 1 & -1 \\
424:       ~~~1135a & -1 & -2 & 9 & -2 & 4 & -11 & -6 & 4 & 9 & 2 & -6 & -4 &
425:       -1 & -1 & -1 & -2 & -3 & -1 & -1 & -1 & 0 & -2 & -2 & 0 & 1 & -1
426:       & -1 & 0 & -1 & -3 \\
427:       ~~~1135b & 2 & 0 & 6 & -1 & -12 & -13 & -6 & 1 & 2 & 0 & -3 & 1 &
428:       -5 & -6 & -3 & -1 & -3 & -2 & 0 & -1 & -4 & -7 & -2 & 0 & 0 & -2
429:       & -1 & -7 & -2 & 0 \\
430:       ~~~1142 & -4 & -1 & 10 & 0 & -5 & -1 & -7 & -14 & -7 & -3 & -2 & -3
431:       & -4 & -7 & -5 & -2 & 4 & -3 & -3 & -1 & -2 & -2 & -2 & -5 & 2 &
432:       -6 & -7 & -6 & -1 & -4 \\
433:       ~~~1144a & -2 & -1 & -1 & 3 & -1 & 5 & -16 & -9 & -3 & 5 & 1 & -6 &
434:       -1 & -2 & 0 & 6 & -1 & -2 & -2 & -3 & 0 & 0 & -2 & -1 & 0 & -4 &
435:       7 & 2 & -1 & -1 \\
436:       ~~~1144b & -2 & 3 & -1 & 2 & -2 & 5 & -16 & -5 & -2 & 5 & 2 & -5 &
437:       2 & -2 & 1 & 5 & -3 & 1 & 1 & -1 & 0 & 0 & -5 & -2 & 0 & 1 & 4 &
438:       2 & -1 & 2 \\
439:       ~~~1145a & 0 & -4 & -7 & -4 & -5 & -1 & 5 & 11 & -2 & -1 & -1 & -3
440:       & -1 & -1 & -1 & 1 & 8 & -3 & -5 & 5 & -1 & -4 & 5 & 6 & -2 & 2
441:       & -4 & -3 & 1 & -3 \\
442:       ~~~1145b & 3 & -3 & -5 & 5 & 13 & 7 & 12 & 13 & -5 & -1 & -2 & 8 &
443:       -3 & 4 & 0 & 2 & 1 & 1 & -2 & 0 & -1 & 0 & 5 & 6 & -2 & 7 & 0 &
444:       13 & -5 & 0 \\
445:       ~~~1146 & 0 & 1 & 21 & 0 & 7 & 9 & 9 & -4 & -3 & -1 & 12 & 1 & 0 &
446:       -1 & 0 & -1 & 0 & 7 & 0 & -2 & 1 & 0 & -1 & 2 & -1 & -1 & -2 &
447:       -2 & -1 & 3 \\
448:       \hline
449:     \end{tabular}
450:     \caption{Query-by-query comparison of rankings obtained based on
451:     non-interpolated average precision and utility factor.}
452:     \label{tab:qbq_A}
453:   \end{center}
454: \end{table*}
455: 
456: \subsection{Discussion}
457: \label{subsec:discussion}
458: 
459: Looking at Table~\ref{tab:all_A}, one may notice that rankings of
460: systems ``1106'', ``1145b'', ``1133a'' and ``1133b'' were
461: significantly improved within our evaluation method. Thus, we
462: investigated properties that characterize each of those four systems,
463: in a comparison with other systems.
464: 
465: First, we found that ``1106'' adopted a relatively simple
466: implementation, while most systems used more elaborate ones. To put it
467: more precisely, morphological analysis was performed, and nouns/verbs
468: were extracted for a word-based indexing. For term weighting, a
469: TF$\cdot$IDF formula as in Equation~\eq{eq:tf_idf} was used, while
470: most systems used different methods, such as the logarithmic TF
471: formulation as in Equation~\eq{eq:log_tf_idf} and one proposed by
472: Robertson and Walker~\shortcite{robertson:sigir-94}.
473: \begin{equation}
474:   \label{eq:tf_idf}
475:   f_{t,d}\cdot\log\frac{\textstyle N}{\textstyle n_{t}} \\
476: \end{equation}
477: \begin{equation}
478:   \label{eq:log_tf_idf}
479:   (1 + \log f_{t,d})\cdot\log\frac{\textstyle N}{\textstyle n_{t}}
480: \end{equation}
481: Here, $f_{t,d}$ denotes the frequency that term $t$ appears in
482: document $d$, and $n_{t}$ denotes the number of documents containing
483: term $t$. $N$ is the total number of documents in the collection.
484: 
485: Second, ``1145b'' conducted a query expansion~\cite{qiu:sigir-93},
486: while a few systems used query expansion (e.g., one based on a
487: thesaurus). In addition, a term weighing method based on mutual
488: information between two terms was introduced. Possible rationales
489: behind this method include that two terms frequently co-occur are
490: effective to characterize the domain of documents, and are thus
491: assigned with greater term weights.
492: 
493: Third, ``1133a'' and ``1133b'' also used domain knowledge for term
494: weighting. However, unlike the case of ``1145b'', they regarded pages
495: of news articles as domain. In practice, a greater weight is assigned
496: to terms whose distribution varies more strongly depending on the
497: page, because they are expected to characterize the domain. On the
498: other hand, terms commonly appear in more pages are assigned with a
499: lesser weight.
500: 
501: To sum up, our novelty-based evaluation revealed the effectiveness of
502: those properties above, specifically term weighting methods introduced
503: in ``1145b'', ``1133a'' and ``1133b'', which were overshadowed or
504: underestimated within the precision/recall-based evaluation.
505: 
506: We devote a little space to consider Table~\ref{tab:qbq_A} for further
507: investigation. We arbitrarily regarded improvements above seven as
508: significant, and focused solely on systems with relatively many
509: significant improvements, that is, ``1103a'' and ``1132''. Although
510: ``1145b'' is associated with the same number of significant
511: improvements as ``1132'', we previously discussed system ``1145b''
512: above.
513: 
514: We found that ``1103a'' is one of five systems that conducts a proper
515: noun identification, and that five of six queries where ``1103a''
516: achieved significant improvements are directly or indirectly
517: associated with proper nouns.
518: 
519: Samples of query descriptions directly and indirectly related to
520: proper nouns include ``1016: Nick Price (a golfer)'' and ``1011:
521: arrest of suspects of robbery in the {\it Kanto\/} region'',
522: respectively. Note that in the latter (indirect) case, Japanese
523: prefectures within the ``{\it Kanto\/}'' region, which are not
524: explicitly described in the query (e.g., ``{\it Tokyo\/}'' and ``{\it
525: Kanagawa\/}''), must be identified in news articles.
526: 
527: Finally, ``1132'' is the only system that used Latent Semantic
528: Indexing (LSI), which is an extension of the vector space model, so as
529: to retrieve relevant documents including no common terms in a given
530: query. While as shown in Table~\ref{tab:all_A}, ``1132'' had the
531: lowest ranking in terms of the average precision, our evaluation
532: method indicated that in many cases (queries) an LSI-based method is
533: expected to retrieve relevant documents that other types of methods
534: fail to retrieve.
535: 
536: \section{Conclusion}
537: \label{sec:conclusion}
538: 
539: Evaluation methods based on precision and recall have long been used
540: in information retrieval (IR) research, where systems that retrieve as
541: many relevant documents as possible are usually highly valued.
542: 
543: However, given the fact that a number of retrieval systems resembling
544: one another are available to the public (not only in laboratories), it
545: is valuable to retrieve relevant documents that can never be retrieved
546: by those existing systems. This notion is also true in various
547: contexts that require a variety of IR systems, such as meta search
548: systems and the pooling method in producing IR test collections.
549: 
550: In consideration of these factors, we proposed a new evaluation method
551: for IR, which favors systems that retrieve more novel documents, i.e.,
552: relevant documents that many systems fail to retrieve. To realize this
553: notion, we estimated the utility of a system in question by comparing
554: the probability that the user reads relevant documents by using the
555: system, and the probability that the user can read those documents
556: even without using the system.
557: 
558: We also applied our evaluation method to the 22 systems that
559: participated in the IREX workshop, and identified several effective
560: techniques that have been underestimated in the conventional
561: precision/recall-based evaluation method.
562: 
563: \bibliographystyle{acl}
564: \begin{thebibliography}{}
565: 
566: \bibitem[\protect\citename{Keen}1992]{keen:ipm-92}
567: E.~Michael Keen.
568: \newblock 1992.
569: \newblock Presenting results of experimental retrieval comparisons.
570: \newblock {\em Information Processing \& Management}, 28(4):491--502.
571: 
572: \bibitem[\protect\citename{{Mainichi Shimbun}}1994 1995]{mainichi:94-95}
573: {Mainichi Shimbun}.
574: \newblock 1994-1995.
575: \newblock Mainichi shimbun {CD-ROM} '94-'95.
576: \newblock (In Japanese).
577: 
578: \bibitem[\protect\citename{Qiu and Frei}1993]{qiu:sigir-93}
579: Y.~Qiu and H.~Frei.
580: \newblock 1993.
581: \newblock Concept based query expansion.
582: \newblock In {\em Proceedings of the 16th Annual International ACM SIGIR
583:   Conference on Research and Development in Information Retrieval}, pages
584:   160--169.
585: 
586: \bibitem[\protect\citename{Robertson and Walker}1994]{robertson:sigir-94}
587: S.~E. Robertson and S.~Walker.
588: \newblock 1994.
589: \newblock Some simple effective approximations to the 2-poisson model for
590:   probabilistic weighted retrieval.
591: \newblock In {\em Proceedings of the 17th Annual International ACM SIGIR
592:   Conference on Research and Development in Information Retrieval}, pages
593:   232--241.
594: 
595: \bibitem[\protect\citename{Salton and Buckley}1988]{salton:ipm-88}
596: Gerard Salton and Christopher Buckley.
597: \newblock 1988.
598: \newblock Term-weighting approaches in automatic text retrieval.
599: \newblock {\em Information Processing \& Management}, 24(5):513--523.
600: 
601: \bibitem[\protect\citename{Salton}1992]{salton:ipm-92}
602: Gerard Salton.
603: \newblock 1992.
604: \newblock The state of retrieval system evaluation.
605: \newblock {\em Information Processing \& Management}, 28(4):441--449.
606: 
607: \bibitem[\protect\citename{Sekine and Isahara}1999]{sekine:irex-99}
608: Satoshi Sekine and Hitoshi Isahara.
609: \newblock 1999.
610: \newblock {IREX} project overview.
611: \newblock In {\em Proceedings of the IREX Workshop}, pages 7--12.
612: 
613: \bibitem[\protect\citename{Voorhees}1998]{voorhees:sigir-98}
614: Ellen~M. Voorhees.
615: \newblock 1998.
616: \newblock Variations in relevance judgments and the measurement of retrieval
617:   effectiveness.
618: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
619:   Conference on Research and Development in Information Retrieval}, pages
620:   315--323.
621: 
622: \bibitem[\protect\citename{Zobel and Moffat}1998]{zobel:sigir-forum-98}
623: Justin Zobel and Alistair Moffat.
624: \newblock 1998.
625: \newblock Exploring the similarity space.
626: \newblock {\em ACM SIGIR FORUM}, 32(1):18--34.
627: 
628: \end{thebibliography}
629: 
630: \section*{Acknowledgments}
631: 
632: The authors would like to thank organizers and participants of the
633: IREX workshop for their support with the IREX collection.
634: 
635: \end{document}
636: