1: \documentclass[11pt]{article}
2: \usepackage{acl2001,times}
3: \usepackage{epsfig}
4: \setlength\titlebox{6.5cm} %
5:
6: \title{Looking Under the Hood: Tools for Diagnosing Your Question
7: Answering Engine{\footnotesize$^\textrm{1}$}\footnotemark[0]}
8:
9: \author{Eric Breck$^{\dagger}$, Marc Light$^{\dagger}$, Gideon
10: S. Mann$^{\diamondsuit}$, Ellen Riloff$^{\circ}$, \\ {\bf Brianne
11: Brown$^{\ddagger}$, Pranav Anand$^{*}$, Mats Rooth$^{\mp}$,
12: Michael Thelen$^{\circ}$} \\
13: ~ \\
14: \small
15: $^{\dagger}$ The MITRE Corporation, 202 Burlington Rd.,Bedford, MA
16: 01730, \{ebreck,light\}@mitre.org \\
17: \small
18: $^{\diamondsuit}$ Department of Computer Science, Johns Hopkins
19: University, Baltimore, MD 21218, gsm@cs.jhu.edu \\
20: \small
21: $^{\circ}$ School of Computing, University of Utah, Salt Lake City, UT
22: 84112, \{riloff,thelenm\}@cs.utah.edu \\
23: \small
24: $^{\ddagger}$ Bryn Mawr College, Bryn Mawr, PA 19010,
25: bbrown@brynmawr.edu\\
26: \small
27: $^{*}$ Department of Mathematics, Harvard University, Cambridge, MA
28: 02138, anand@fas.harvard.edu \\
29: \small
30: $^{\mp}$ Department of Linguistics, Cornell University, Ithaca, NY
31: 14853, mr249@cornell.edu
32: \normalsize
33: }
34:
35:
36: \date{}
37:
38:
39: \begin{document}
40: \maketitle
41: \begin{abstract}
42:
43: \footnotetext[1]{This paper contains a revised Table 2 replacing the one appearing in the Proceedings of the Workshop on Open-Domain
44: Question Answering, Toulouse, France 2001.}
45: \setcounter{footnote}{1 }
46:
47: In this paper we analyze two question answering tasks : the TREC-8
48: question answering task and a set of reading comprehension exams.
49: First, we show that Q/A systems perform better when there are
50: multiple answer opportunities per question. Next, we analyze common
51: approaches to two subproblems: term overlap for answer sentence
52: identification, and answer typing for short answer extraction. We
53: present general tools for analyzing the strengths and limitations of
54: techniques for these subproblems. Our results quantify the
55: limitations of both term overlap and answer typing to distinguish
56: between competing answer candidates.
57:
58:
59: \end{abstract}
60:
61: \section{Introduction}
62:
63: When building a system to perform a task, the most important statistic
64: is the performance on an end-to-end evaluation. For the task of
65: open-domain question answering against text collections,
66: there have been two large-scale end-to-end evaluations:
67: \cite{trec8-proceedings} and \cite{trec9-proceedings}. In addition, a
68: number of researchers have built systems to take reading comprehension
69: examinations designed to evaluate children's reading
70: levels \cite{charniak-readcomp,hirschman99,ng2000,riloff-quarc,harper-readcomp}.
71: The performance statistics have
72: been useful for determining how well techniques work.
73:
74: However, raw performance statistics are not enough. If the score is
75: low, we need to understand what went wrong
76: and how to fix it. If the score is high, it is important to
77: understand why. For example, performance may be dependent on
78: characteristics of the current test set and would not carry over to a
79: new domain. It would also be useful to know if there is a particular
80: characteristic of the system that is central. If so, then the system
81: can be streamlined and simplified.
82:
83: In this paper, we explore ways of gaining insight into question
84: answering system performance. First, we analyze the impact of having
85: multiple answer opportunities for a question. We found that TREC-8 Q/A
86: systems performed better on questions that had multiple answer
87: opportunities in the document collection. Second, we present a variety
88: of graphs to visualize and analyze functions for ranking sentences.
89: The graphs revealed that relative score instead of absolute score is
90: paramount. Third, we introduce bounds on functions that use term
91: overlap\footnote{Throughout the text, we use ``overlap'' to refer to
92: the intersection of sets of words, most often the words in the
93: question and the words in a sentence.} to rank sentences. Fourth,
94: we compute the expected score of a hypothetical Q/A system that
95: correctly identifies the answer type for a question and correctly
96: identifies all entities of that type in answer sentences. We found
97: that a surprising amount of ambiguity remains because sentences often
98: contain multiple entities of the same type.
99:
100:
101: \section{The data}
102:
103:
104: The experiments in Sections~\ref{ansMult}, \ref{graphs}, and
105: \ref{bounds} were performed on two question answering data sets: (1)
106: the TREC-8 Question Answering Track data set and (2) the CBC reading
107: comprehension data set. We will briefly describe each of these data
108: sets and their corresponding tasks.
109:
110:
111: The task of the TREC-8 Question Answering track was to find the answer
112: to 198 questions using a document collection consisting of roughly
113: 500,000 newswire documents. For each question, systems were allowed
114: to return a ranked list of 5 short (either 50-character or
115: 250-character) responses. As a service to track participants, AT\&T
116: provided top documents returned by their retrieval engine for each of
117: the TREC questions. Sections~\ref{graphs} and \ref{bounds} present
118: analyses that use all sentences in the top 10 of these documents.
119: Each sentence is classified as correct or incorrect automatically.
120: This automatic classification judges a sentence to be correct if it
121: contains at least half of the stemmed, content-words in the answer
122: key. We have compared this automatic evaluation to the TREC-8 QA
123: track assessors and found it to agree 93-95\% of the time
124: \cite{breck2000}.
125:
126:
127:
128: The CBC data set was created for the Johns Hopkins Summer 2000
129: Workshop on Reading Comprehension. Texts were collected from the
130: Canadian Broadcasting Corporation web page for kids
131: (http://cbc4kids.ca/). They are an average of 24 sentences long. The
132: stories were adapted from newswire texts to be appropriate for
133: adolescent children, and most fall into the following domains:
134: politics, health, education, science, human interest, disaster,
135: sports, business, crime, war, entertainment, and environment. For
136: each CBC story, 8-12 questions and an answer key were
137: generated.\footnote{This work was performed by Lisa Ferro and Tim
138: Bevins of the MITRE Corporation. Dr. Ferro has professional
139: experience writing questions for reading comprehension exams and led
140: the question writing effort.} We used a 650 question subset of the
141: data and their corresponding 75 stories. The answer candidates for
142: each question in this data set were all sentences in the document.
143: The sentences were scored against the answer key by the automatic
144: method described previously.
145:
146: \section{Analyzing the number of answer opportunities per question}
147: \label{ansMult}
148:
149: In this section we explore the impact of multiple answer opportunities
150: on end-to-end system performance. A question may have multiple
151: answers for two reasons: (1) there is more than one different answer
152: to the question, and (2) there may be multiple instances of each
153: answer. For example, {\em ``What does the Peugeot company
154: manufacture?''} can be answered by {\em trucks}, {\em cars}, or {\em
155: motors} and each of these answers may occur in many sentences that
156: provide enough context to answer the question. The table insert in
157: Figure~\ref{cbc-histograms} shows that, on average, there are 7 answer
158: occurrences per question in the TREC-8 collection.\footnote{We would
159: like to thank John Burger and John Aberdeen for help preparing
160: Figure~\ref{cbc-histograms}.} In contrast, there are only 1.25 answer
161: occurrences in a CBC document. The number of answer occurrences
162: varies widely, as illustrated by the standard deviations. The median
163: shows an answer frequency of 3 for TREC and 1 for CBC, which perhaps
164: gives a more realistic sense of the degree of answer frequency for
165: most questions.
166:
167: \begin{figure}[htbp]
168: \centering
169: \epsfig{figure=figs/ansMultBar.eps,height=2in,width=3.1in}
170: \caption{Frequency of answers in the TREC-8 (black bars) and CBC
171: (white bars) data sets}
172: \label{cbc-histograms}
173: \end{figure}
174:
175: To gather this data we manually reviewed 50 randomly chosen TREC-8
176: questions and identified all answers to these questions in our text
177: collection. We defined an ``answer'' as a text fragment that contains
178: the answer string in a context sufficient to answer the question.
179: Figure~\ref{cbc-histograms} shows the resulting graph. The $x$-axis
180: displays the number of answer occurrences found in the text
181: collection per question and the $y$-axis shows the percentage of
182: questions that had $x$ answers. For example, 26\%
183: of the TREC-8 questions had
184: only 1 answer occurrence, and 20\%
185: of the TREC-8 questions had exactly 2 answer occurrences (the black
186: bars). The most prolific question had 67 answer occurrences (the
187: Peugeot example mentioned above).
188: Figure~\ref{cbc-histograms} also shows the analysis of 219 CBC
189: questions. In contrast, 80\%
190: of the CBC questions had only 1 answer
191: occurrence in the targeted document, and 16\%
192: had exactly 2 answer occurrences.
193:
194: \begin{figure}[htbp]
195: \centering
196: \epsfig{figure=figs/dupVsCorr.eps,height=2in,width=3.1in}
197: \caption{Answer repetition vs. system response correctness for TREC-8}
198: \label{scatter}
199: \end{figure}
200:
201: Figure~\ref{scatter} shows the effect that multiple answer
202: opportunities had on the performance of TREC-8 systems. Each solid
203: dot in the scatter plot represents one of the 50 questions we
204: examined.\footnote{We would like to thank Lynette Hirschman for
205: suggesting the analysis behind Figure~\ref{scatter} and John Burger
206: for help with the analysis and presentation.} The $x$-axis shows the
207: number of answer opportunities for the question, and the $y$-axis
208: represents the percentage of systems that generated a correct
209: answer\footnote{For this analysis, we say that a system generated a
210: correct answer if a correct answer was in its response set.} for the
211: question. E.g., for the question with 67 answer occurrences,
212: 80\% of the systems produced a correct answer. In
213: contrast, many questions had a single answer occurrence and the
214: percentage of systems that got those correct varied from about 2\% to
215: 60\%.
216:
217: The circles in Figure~\ref{scatter} represent the average percentage
218: of systems that answered questions correctly for all questions with
219: the same number of answer occurrences. For example, on average about
220: 27\% of the systems produced a correct answer for questions that had
221: exactly one answer occurrence, but about 50\% of the systems produced
222: a correct answer for questions with 7 answer opportunities.
223: Overall, a clear pattern emerges: the performance of TREC-8 systems
224: was strongly correlated with the number of answer opportunities
225: present in the document collection.
226:
227: \section{Graphs for analyzing scoring functions of answer candidates}
228: \label{graphs}
229:
230: Most question answering systems generate several answer candidates and
231: rank them by defining a scoring function that maps answer candidates
232: to a range of numbers. In this section, we analyze one particular
233: scoring function: {\em term overlap} between the question and answer
234: candidate. The techniques we use can be easily applied to other
235: scoring functions as well (e.g., weighted term overlap, partial
236: unification of sentence parses, weighted abduction score, etc.). The
237: answer candidates we consider are the sentences from the documents.
238:
239: The expected performance of a system that ranks all sentences using
240: term overlap is 35\% for the TREC-8 data. This number is an expected
241: score because of ties: correct and incorrect candidates may have the
242: same term overlap score. If ties are broken optimally, the best
243: possible score ({\em maximum}) would be 54\%. If ties are broken
244: maximally suboptimally, the worst possible score ({\em minimum}) would
245: be 24\%. The corresponding scores on the CBC data are 58\%
246: expected, 69\% maximum, and 51\% minimum. We would like to
247: understand why the term overlap scoring function works as well as it
248: does and what can be done to improve it.
249:
250: Figures~\ref{camel-overlap-TREC} and \ref{camel-overlap-CBC} compare
251: correct candidates and incorrect candidates with respect to the
252: scoring function. The $x$-axis plots the range of the scoring
253: function, i.e., the amount of overlap. The $y$-axis represents {\bf
254: Pr(overlap=x $\mid$ correct)} and {\bf Pr(overlap=x $\mid$
255: incorrect)}, where separate curves are plotted for correct and
256: incorrect candidates. The probabilities are generated by normalizing
257: the number of correct/incorrect answer candidates with a particular
258: overlap score by the total number of correct/incorrect candidates,
259: respectively.
260:
261: \begin{figure}[h]
262: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-jqaviar-n-i-c1-TREC.eps,height=2.0in}}
263: \caption{Pr(overlap=x$\mid$[in]correct) for TREC-8}
264: \label{camel-overlap-TREC}
265: \vspace*{.2in}
266: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-jqaviar-n-i-c1-CBC.eps,height=2.0in}}
267: \caption{Pr(overlap=x$\mid$[in]correct) for CBC}
268: \label{camel-overlap-CBC}
269: \end{figure}
270:
271: Figure ~\ref{camel-overlap-TREC} illustrates that the correct
272: candidates for TREC-8 have term overlap scores distributed between 0 and 10 with
273: a peak of 24\% at an overlap of 2. However, the incorrect candidates
274: have a similar distribution between 0 and 8 with a peak of 32\% at an
275: overlap of 0. The similarity of the curves illustrates that it is
276: unclear how to use the score to decide if a candidate is correct or
277: not. Certainly no static threshold above which a candidate is deemed
278: correct will work. Yet the expected score of our TREC term overlap system
279: was 35\%, which is much higher than a random baseline which would get
280: an expected score of less than 3\% because there are over 40 sentences on
281: average in newswire documents.\footnote{We also tried dividing the term overlap
282: score by the length of the question to normalize for query length
283: but did not find that the graph was any more helpful.}
284:
285: After inspecting some of the data directly, we posited that it was not
286: the absolute term overlap that was important for judging candidate but
287: how the overlap score compares to the scores of other candidates. To
288: visualize this, we generated new graphs by plotting the rank of a
289: candidate's score on the $x$-axis. For example, the candidate with the
290: highest score would be ranked first, the candidate with the second
291: highest score would be ranked second, etc.
292: Figures~\ref{camel-overlap-rank-TREC} and \ref{camel-overlap-rank-CBC}
293: show these graphs, which display {\bf Pr(rank=x $\mid$ correct)} and
294: {\bf Pr(rank=x $\mid$ incorrect)} on the $y$-axis. The top-ranked
295: candidate has rank=0.
296:
297: \begin{figure}[h]
298: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-k-jqaviar-n-i-c1-TREC.eps,height=2.0in}}
299: \caption{Pr(rank=x $\mid$ [in]correct) for TREC-8}
300: \label{camel-overlap-rank-TREC}
301: \vspace*{.2in}
302: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-k-jqaviar-n-i-c1-CBC.eps,height=2.0in}}
303: \caption{Pr(rank=x $\mid$ [in]correct) for CBC}
304: \label{camel-overlap-rank-CBC}
305: \end{figure}
306:
307: The ranked graphs are more revealing than the graphs of absolute
308: scores: the probability of a high rank is greater for correct answers
309: than incorrect ones. Now we can begin to understand why the term
310: overlap scoring function worked as well as it did. We see that,
311: unlike classification tasks, there is no good threshold for our
312: scoring function. Instead relative score is paramount. Systems such
313: as \cite{ng2000} make explicit use of relative rank in their
314: algorithms and now we understand why this is effective.
315:
316:
317: Before we leave the topic of graphing scoring functions, we want to
318: introduce one other view of the data.
319: Figure~\ref{logodds-overlap-TREC} plots term overlap scores on the
320: $x$-axis and the log odds of being correct given a score on the
321: $y$-axis. The log odds formula is:
322: \begin{displaymath}
323: \log\frac{Pr(correct|overlap)}{Pr(incorrect|overlap)}
324: \end{displaymath}
325: Intuitively, this graph shows how much more likely a sentence is to be
326: correct versus incorrect given a particular score. A second curve,
327: labeled ``mass,'' plots the number of answer candidates with each
328: score. Figure~\ref{logodds-overlap-TREC} shows that the odds of being
329: correct are negative until an overlap of 10, but the mass curve
330: reveals that few answer candidates have an overlap score greater than
331: 6.
332:
333:
334: \begin{figure}
335: \centerline{\epsfig{figure=figs/out-tlogodds-xoverlap-jqaviar-i-c1-TREC.eps,height=2.0in}}
336: \caption{TREC-8 log odds correct given overlap}
337: \label{logodds-overlap-TREC}
338: \end{figure}
339:
340: \section{Bounds on scoring functions that use term overlap}
341: \label{bounds}
342:
343: The scoring function used in the previous section simply counts the
344: number of terms shared by a question and a sentence. One obvious
345: modification is to weight some terms more heavily than others. We
346: tried using inverse document frequence based (IDF) term weighting on
347: the CBC data but found that it did not improve performance. The graph
348: analogous to Figure~\ref{camel-overlap-rank-CBC} but with IDF term
349: weighting was virtually identical.
350:
351: Could another weighting scheme perform better? How well could an
352: optimal weighting scheme do? How poorly would the maximally
353: suboptimal scheme do? The analysis in this section addresses
354: these questions. In essence the answer is the following: the question
355: and the candidate answers are typically short and thus the number of
356: overlapping terms is small -- consequently, many candidate answers
357: have exactly the same overlapping terms and no weighting scheme could
358: differentiate them. In addition, subset relations often hold between
359: overlaps. A candidate whose overlap is a subset of a second
360: candidate cannot score higher regardless of the weighting
361: scheme.\footnote{Assuming that all term weights are positive.}
362: We formalize these overlap set relations and then calculate statistics
363: based on them for the CBC and TREC data.
364:
365:
366: \begin{figure}[htbp]
367: \fbox{
368: \begin{minipage}{2.8in}
369: \footnotesize
370: Question: How much was Babe Belanger paid to play amateur basketball? \\
371: \\
372: S1: She was a member of the winningest \\
373: \hspace*{.2in} {\bf basketball} team Canada ever had. \\
374: S2: {\bf Babe} {\bf Belanger} never made a cent for her \\
375: \hspace*{.2in} skills.\\
376: S3: They were just a group of young women \\
377: \hspace*{.2in} from the same school who liked to \\
378: \hspace*{.2in} {\bf play} {\bf amateur} {\bf basketball}. \\
379: S4: {\bf Babe} {\bf Belanger} played with the Grads from \\
380: \hspace*{.2in} 1929 to 1937. \\
381: S5: {\bf Babe} never talked about her fabulous career. \\
382: \hrule
383: \vspace*{1mm}
384: MaxOsets : ( \{S2, S4\}, \{S3\} )
385: \end{minipage}
386: }
387: \caption{Example of Overlap Sets from CBC}
388: \label{qsubset}
389: \end{figure}
390:
391: Figure~\ref{qsubset} presents an example from the CBC data. The four
392: overlap sets are (i) {\em Babe Belanger}, (ii) {\em basketball}, (iii)
393: {\em play amateur basketball}, and (iv) {\em Babe}. In any
394: term-weighting scheme with positive weights, a sentence containing the
395: words \textit{Babe Belanger} will have a higher score than sentences
396: containing just \textit{Babe}, and sentences with \textit{play amateur
397: basketball} will have a higher score than those with just
398: \textit{basketball}. However, we cannot generalize with respect to
399: the relative scores of sentences containing \textit{Babe Belanger} and
400: those containing \textit{play amateur basketball} because some terms
401: may have higher weights than others.
402:
403: The most we can say is that the highest scoring candidate must be a
404: member of $\{S2,S4\}$ or $\{S3\}$. S5 and S1 cannot be ranked highest
405: because their overlap sets are a proper subset of competing overlap
406: sets. The correct answer is S2 so an optimal weighting scheme would
407: have a 50\% chance of ranking S2 first, assuming that it identified
408: the correct overlap set $\{S2,S4\}$ and then randomly chose between S2
409: and S4. A maximally suboptimal weighting scheme could rank S2 no lower
410: than third.
411:
412: We will formalize these concepts using the following variables:
413: \begin{quote}
414: {\em q}: a question (a set of words) \\
415: {\em s}: a sentence (a set of words) \\
416: {\em w,v}: sets of intersecting words
417: \end{quote}
418: We define an {\it overlap set} ($o_{w,q}$) to be a set of sentences
419: (answer candidates) that have the same words overlapping with the
420: question. We define a {\it maximal overlap set} ($M_q$) as an overlap
421: set that is not a subset of any other overlap set for the question.
422: For simplicity, we will refer to a maximal overlap set as a {\it
423: MaxOset}.
424: \begin{itemize}
425: \item[] $o_{w,q} = \{s| s\cap q = w\}$
426: \item[] $\Omega_{q} = \mbox{all unique overlap sets for } q$
427: \item[] $maximal(o_{w,q})$ ~if~ $\forall o_{v,q} \in \Omega_q, w \not\subset v$
428: \item[] $M_{q} = \{o_{w,q} \in \Omega_{q}\ \mid maximal(o_{w,q})\}$
429: \item[] $C_{q} = \{s | s \mbox{ correctly answers } q\}$
430: \end{itemize}
431:
432: We can use these definitions to give upper and lower bounds on the
433: performance of term-weighting functions on our two data sets.
434: Table~\ref{subsetnum} shows the results. The $max$ statistic is the
435: percentage of questions for which at least one member of its MaxOsets
436: is correct. The $min$ statistic is the percentage of questions for
437: which all candidates of all of its MaxOsets are correct (i.e., there
438: is no way to pick a wrong answer). Finally the $expected max$ is a
439: slightly more realistic upper bound. It is equivalent to randomly
440: choosing among members of the ``best'' maximal overlap set, i.e., the
441: MaxOset that has the highest percentage of correct members. Formally,
442: the statistics for a set of questions $Q$ are computed as:
443: \begin{displaymath}
444: \mbox{max} = \\
445: \frac{|\{q| \exists o \in M_q, \exists s \in o \mbox{ s.t. } s \in
446: C_q\}|}{|Q|}
447: \end{displaymath}
448: \begin{displaymath}
449: \mbox{min} = \frac{|\{q|\forall o \in M_q, \forall s \in
450: o~~~s \in C_q\}|}{|Q|}
451: \end{displaymath}
452: \begin{displaymath}
453: \mbox{exp. max} = \frac{1}{|Q|}*\sum_{q \in Q} \max_{o \in M_q}
454: \frac{|\{s \in o \mbox{ and } s \in C_q\}|}{|o|}
455: \end{displaymath}
456:
457: The results for the TREC data are considerably lower than the results
458: for the CBC data. One explanation may be that in the CBC data, only
459: sentences from one document containing the answer are considered. In
460: the TREC data, as in the TREC task, it is not known beforehand which
461: documents contain answers, so irrelevant documents may contain
462: high-scoring sentences that distract from the correct sentences.
463:
464: \begin{table}[hbst]
465: \centering
466: \begin{tabular}{|l|r|r|r|} \hline
467: & exp. max & max & min \\ \hline
468: CBC training & 72.7\% & 79.0\% & 24.4\% \\
469: TREC-8 & 48.8\% & 64.7\% & 10.1\% \\ \hline
470: \end{tabular}
471: \caption{Maximum overlap analysis of scores}\label{subsetnum}
472: \end{table}
473:
474: In Table~\ref{mosbrk}, we present a detailed breakdown of the MaxOset
475: results for the CBC data. (Note that the classifications overlap,
476: e.g., questions that are in ``there is always a chance to get it
477: right'' are also in the class ``there may be a chance to get it
478: right.'') 21\% of the questions are literally impossible to
479: get right using only term weighting because none of the correct
480: sentences are in the MaxOsets.
481: This result illustrates that maximal overlap sets can identify the
482: limitations of a scoring function by recognizing that some candidates
483: will \underline{always} be ranked higher than others. Although our
484: analysis only considered term overlap as a scoring function, maximal
485: overlap sets could be used to evaluate other scoring functions as
486: well, for example overlap sets based on semantic classes rather than
487: lexical items.
488:
489:
490:
491:
492: \begin{table*}[hbst]
493: \small
494: \centerline{\begin{tabular}{|lrr|} \hline
495: & \multicolumn{1}{c}{number of} & \multicolumn{1}{c|}{percentage} \\
496: & \multicolumn{1}{c}{questions} & \multicolumn{1}{c|}{of questions}\\ \hline
497: Impossible to get it wrong & 159 & 24\% \\
498: ($\forall o_w \in M_q, \forall s \in o_w, s \in C_q$) & & \\
499: There is always a chance to get it right & 204 & 31\% \\
500: ($\forall o_w \in M_q, \exists s \in o_w \mbox{ s.t. } s \in C_q$) &
501: & \\
502: There may be a chance to get it right & 514 & 79\% \\
503: ($\exists o_w \in M_q \mbox{ s.t. } \exists s \in o_w \mbox{ s.t. }
504: s \in C_q$) & & \\
505: The wrong answers will always be weighted too highly & 137 & 21\% \\
506: ($\forall o_w \in M_q, \forall s \in o_w, s \not\in C_q$) & & \\
507: There are no correct answers with any overlap with $Q$ & 66 & 10\% \\
508: ($\forall s \in d,s $ is incorrect or $s$ has 0 overlap) & & \\
509: There are no correct answers (auto scoring error) & 12 & 2\% \\
510: ($\forall s \in d,s $ is incorrect) & & \\ \hline
511: \end{tabular}}
512: \caption{Maximal Overlap Set Analysis for CBC data}
513: \label{mosbrk}
514: \end{table*}
515:
516: In sum, the upper bound for term weighting schemes is quite low and
517: the lower bound is quite high. These results suggest that methods
518: such as query expansion are essential to increase the feature sets
519: used to score answer candidates. Richer feature sets could distinguish
520: candidates that would otherwise be represented by the same features
521: and therefore would inevitably receive the same score.
522:
523:
524:
525:
526:
527:
528:
529:
530:
531:
532:
533:
534:
535:
536: \section{Analyzing the effect of multiple answer type occurrences in
537: a sentence}
538: \label{answerType}
539:
540:
541: In this section, we analyze the problem of extracting short answers
542: from a sentence. Many Q/A systems first decide what answer type a
543: question expects and then identify instances of that type in
544: sentences. A scoring function ranks the possible answers using
545: additional criteria, which may include features of the surrounding
546: sentence such as term overlap with the question.
547:
548: For our analysis, we will assume that two short answers that have the
549: same answer type and come from the same sentence are indistinguishable
550: to the system. This assumption is made by many Q/A systems: they do
551: not have features that can prefer one entity over another of the same
552: type in the same sentence.
553:
554:
555:
556:
557: We manually annotated data for 165 TREC-9 questions and 186 CBC
558: questions to indicate perfect question typing, perfect answer
559: sentence identification, and perfect semantic tagging.
560: Using these annotations, we measured how much ``answer confusion''
561: remains if an oracle gives you the correct question type, a sentence
562: containing the answer, and correctly tags all entities in the sentence
563: that match the question type. For example, the oracle tells you that
564: the question expects a person, gives you a sentence containing the
565: correct person, and tags all person entities in that sentence. The one
566: thing the oracle does not tell you is {\it which} person is the
567: correct one.
568:
569: Table~\ref{confusability-table} shows the answer types that we used.
570: Most of the types are fairly standard, except for the {\it Defaultnp}
571: and {\it Defaultvp} which are default tags for questions
572: that desire a noun phrase or verb phrase but cannot be more precisely
573: typed.
574:
575: We computed an expected score for this hypothetical system as follows:
576: for each question, we divided the number of correct candidates
577: (usually one) by the total number of candidates of the same answer
578: type in the sentence. For example, if a question expects a {\em
579: Location} as an answer and the sentence contains three locations,
580: then the expected accuracy of the system would be 1/3 because the
581: system must choose among the locations randomly. When multiple
582: sentences contain a correct answer, we aggregated the sentences.
583: Finally, we averaged this expected accuracy across all questions for
584: each answer type.
585:
586:
587:
588: \begin{table}[t]
589: \footnotesize
590: \begin{center}
591: \begin{tabular}{|l|l|c|l|c|} \hline
592: & \multicolumn{2}{c|}{\bf TREC} & \multicolumn{2}{c|}{\bf CBC} \\
593: {\it Answer Type} & {\it Score} & {\it Freq} & {\it Score} & {\it Freq} \\ \hline
594: defaultnp & .33 & 47 & .25 & 28 \\
595: organization & .50 & 1 & .72 & 3 \\
596: length & .50 & 1 & .75 & 2 \\
597: thingname & .58 & 14 & .50 & 1 \\
598: quantity & .58 & 13 & .77 & 14 \\
599: agent & .63 & 19 & .40 & 23 \\
600: location & .70 & 24 & .68 & 29 \\
601: personname & .72 & 11 & .83 & 13 \\
602: city & .73 & 3 & n/a & 0 \\
603: defaultvp & .75 & 2 & .42 & 15 \\
604: temporal & .78 & 16 & .75 & 26 \\
605: personnoun & .79 & 7 & .53 & 5 \\
606: duration & 1.0 & 3 & .67 & 4 \\
607: province & 1.0 & 2 & 1.0 & 2 \\
608: area & 1.0 & 1 & n/a & 0 \\
609: day & 1.0 & 1 & n/a & 0 \\
610: title & n/a & 0 & .50 & 1 \\
611: person & n/a & 0 & .67 & 3 \\
612: money & n/a & 0 & .88 & 8 \\
613: ambigbig & n/a & 0 & .88 & 4 \\
614: age & n/a & 0 & 1.0 & 2 \\
615: comparison & n/a & 0 & 1.0 & 1 \\
616: mass & n/a & 0 & 1.0 & 1 \\
617: measure & n/a & 0 & 1.0 & 1 \\ \hline
618: {\bf Overall} & .59 & 165 & .61 & 186 \\ \hline
619: {\bf Overall-dflts} & .69 & 116 & .70 & 143 \\ \hline
620: \end{tabular}
621: \end{center}
622: \caption{Expected scores and frequencies for each answer type}
623: \label{confusability-table}
624: \end{table}
625:
626: Table~\ref{confusability-table} shows that a system with perfect
627: question typing, perfect answer sentence identification, and perfect
628: semantic tagging would still achieve only 59\% accuracy on the TREC-9
629: data. These results reveal that there are often multiple candidates of
630: the same type in a sentence. For example, {\it Temporal} questions
631: received an expected score of 78\% because there was usually only one
632: date expression per sentence (the correct one), while {\it Default NP}
633: questions yielded an expected score of 25\% because there were four
634: noun phrases per question on average. Some common types were
635: particularly problematic. {\it Agent} questions (most {\em Who}
636: questions) had an answer confusability of 0.63, while {\it Quantity}
637: questions had a confusability of 0.58.
638:
639: The CBC data showed a similar level of answer confusion, with an
640: expected score of 61\%, although the confusability of individual
641: answer types varied from TREC. For example, {\it Agent} questions were even
642: more difficult, receiving a score of 40\%, but {\it Quantity}
643: questions were easier receiving a score of 77\%.
644:
645: Perhaps a better question analyzer could assign more specific types to
646: the {\it Default NP} and {\it Default VP} questions, which skew the
647: results. The {\bf Overall-dflts} row of
648: Table~\ref{confusability-table} shows the expected scores without
649: these types, which is still about 70\% so a great deal of answer
650: confusion remains even without those questions. The confusability
651: analysis provides insight into the limitations of the answer type set,
652: and may be useful for comparing the effectiveness of different answer
653: type sets (somewhat analogous to the use of grammar perplexity in
654: speech research).
655:
656: \begin{figure}[htbp]
657: \fbox{
658: \begin{minipage}{2.9in}
659: \footnotesize
660: Q1: {\it What city is Massachusetts General Hospital located in?}
661:
662: A1: It was conducted by a cooperative group of oncologists from Hoag,
663: Massachusetts General Hospital in \underline{{\bf Boston}},
664: \underline{Dartmouth} College in New Hampshire, UC \underline{San Diego} Medical
665: Center, McGill University in \underline{Montreal}
666: and the University of Missouri in \underline{Columbia}. \\
667:
668: Q2: {\it When was Nostradamus born? }
669:
670: A2: Mosley said followers of Nostradamus, who lived from
671: \underline{{\bf 1503}} to \underline{1566},
672: have claimed ...
673: \end{minipage}
674: }
675: \caption{Sentences with Multiple Items of the Same Type}
676: \label{multitypes}
677: \end{figure}
678:
679: However, Figure~\ref{multitypes} shows the fundamental problem behind
680: answer confusability. Many sentences contain multiple instances of the
681: same type, such as lists and ranges. In Q1, recognizing that the
682: question expects a city rather than a general location is still not
683: enough because several cities are in the answer sentence. To achieve
684: better performance, Q/A systems need use features that can more
685: precisely target an answer.
686:
687:
688:
689:
690:
691:
692:
693:
694:
695:
696:
697:
698:
699:
700: \section{Conclusion}
701:
702: In this paper we have presented four analyses of question answering
703: system performance involving: multiple answer occurence, relative
704: score for candidate ranking, bounds on term overlap performance, and
705: limitations of answer typing for short answer extraction. We hope
706: that both the results {\em and} the tools we describe will be useful
707: to others. In general, we feel that analysis of good performance is
708: nearly as important as the performance itself and that the analysis of
709: bad performance can be equally important.
710:
711:
712:
713: \small
714: \bibliographystyle{acl}
715: \bibliography{riloff,hood}
716: \end{document}
717:
718:
719:
720:
721:
722:
723:
724:
725:
726:
727:
728:
729:
730:
731:
732:
733:
734:
735:
736:
737:
738:
739:
740:
741:
742:
743:
744:
745:
746:
747:
748:
749:
750:
751:
752:
753:
754:
755:
756:
757:
758:
759:
760:
761:
762:
763:
764:
765:
766:
767:
768:
769:
770:
771:
772:
773:
774:
775:
776:
777:
778:
779:
780:
781:
782:
783:
784:
785:
786:
787:
788:
789:
790:
791:
792:
793:
794:
795:
796:
797:
798:
799:
800:
801:
802:
803:
804:
805:
806:
807:
808:
809:
810:
811:
812:
813:
814:
815:
816:
817:
818:
819:
820:
821:
822:
823:
824:
825:
826:
827: