0107:cs0107006/hood.tex

1: \documentclass[11pt]{article}

2: \usepackage{acl2001,times}

3: \usepackage{epsfig}

4: \setlength\titlebox{6.5cm}    %

5:

6: \title{Looking Under the Hood: Tools for Diagnosing Your Question

7: Answering Engine{\footnotesize$^\textrm{1}$}\footnotemark[0]}

8:

9: \author{Eric Breck$^{\dagger}$, Marc Light$^{\dagger}$, Gideon

10:   S. Mann$^{\diamondsuit}$, Ellen Riloff$^{\circ}$, \\ {\bf Brianne

11:   Brown$^{\ddagger}$, Pranav Anand$^{*}$, Mats Rooth$^{\mp}$,

12:   Michael Thelen$^{\circ}$} \\

13:  ~ \\

14: \small

15: $^{\dagger}$ The MITRE Corporation, 202 Burlington Rd.,Bedford, MA

16:   01730, \{ebreck,light\}@mitre.org \\

17: \small

18: $^{\diamondsuit}$ Department of Computer Science, Johns Hopkins

19:   University, Baltimore, MD 21218, gsm@cs.jhu.edu \\

20: \small

21: $^{\circ}$ School of Computing, University of Utah, Salt Lake City, UT

22:   84112, \{riloff,thelenm\}@cs.utah.edu \\

23: \small

24: $^{\ddagger}$ Bryn Mawr College, Bryn Mawr, PA 19010,

25:   bbrown@brynmawr.edu\\

26: \small

27: $^{*}$ Department of Mathematics, Harvard University, Cambridge, MA

28:   02138, anand@fas.harvard.edu  \\

29: \small

30: $^{\mp}$ Department of Linguistics, Cornell University, Ithaca, NY

31:   14853, mr249@cornell.edu

32: \normalsize

33: }

34:

35:

36: \date{}

37:

38:

39: \begin{document}

40: \maketitle

41: \begin{abstract}

42:

43: \footnotetext[1]{This paper contains a revised Table 2 replacing the one appearing in the Proceedings of the Workshop on Open-Domain

44: Question Answering, Toulouse, France 2001.}

45: \setcounter{footnote}{1 }

46:

47:   In this paper we analyze two question answering tasks : the TREC-8

48:   question answering task and a set of reading comprehension exams.

49:   First, we show that Q/A systems perform better when there are

50:   multiple answer opportunities per question.  Next, we analyze common

51:   approaches to two subproblems: term overlap for answer sentence

52:   identification, and answer typing for short answer extraction. We

53:   present general tools for analyzing the strengths and limitations of

54:   techniques for these subproblems. Our results quantify the

55:   limitations of both term overlap and answer typing to distinguish

56:   between competing answer candidates.

57:

58:

59: \end{abstract}

60:

61: \section{Introduction}

62:

63: When building a system to perform a task, the most important statistic

64: is the performance on an end-to-end evaluation.  For the task of

65: open-domain question answering against text collections,

66: there have been two large-scale end-to-end evaluations:

67: \cite{trec8-proceedings} and \cite{trec9-proceedings}.  In addition, a

68: number of researchers have built systems to take reading comprehension

69: examinations designed to evaluate children's reading

70: levels \cite{charniak-readcomp,hirschman99,ng2000,riloff-quarc,harper-readcomp}.

71: The performance statistics have

72: been useful for determining how well techniques work.

73:

74: However, raw performance statistics are not enough.  If the score is

75: low, we need to understand what went wrong

76: and how to fix it.  If the score is high, it is important to

77: understand why.  For example, performance may be dependent on

78: characteristics of the current test set and would not carry over to a

79: new domain.  It would also be useful to know if there is a particular

80: characteristic of the system that is central.  If so, then the system

81: can be streamlined and simplified.

82:

83: In this paper, we explore ways of gaining insight into question

84: answering system performance.  First, we analyze the impact of having

85: multiple answer opportunities for a question. We found that TREC-8 Q/A

86: systems performed better on questions that had multiple answer

87: opportunities in the document collection. Second, we present a variety

88: of graphs to visualize and analyze functions for ranking sentences.

89: The graphs revealed that relative score instead of absolute score is

90: paramount. Third, we introduce bounds on functions that use term

91: overlap\footnote{Throughout the text, we use ``overlap'' to refer to

92:   the intersection of sets of words, most often the words in the

93:   question and the words in a sentence.}  to rank sentences.  Fourth,

94: we compute the expected score of a hypothetical Q/A system that

95: correctly identifies the answer type for a question and correctly

96: identifies all entities of that type in answer sentences. We found

97: that a surprising amount of ambiguity remains because sentences often

98: contain multiple entities of the same type.

99:

100:

101: \section{The data}

102:

103:

104: The experiments in Sections~\ref{ansMult}, \ref{graphs}, and

105: \ref{bounds} were performed on two question answering data sets: (1)

106: the TREC-8 Question Answering Track data set and (2) the CBC reading

107: comprehension data set. We will briefly describe each of these data

108: sets and their corresponding tasks.

109:

110:

111: The task of the TREC-8 Question Answering track was to find the answer

112: to 198 questions using a document collection consisting of roughly

113: 500,000 newswire documents.  For each question, systems were allowed

114: to return a ranked list of 5 short (either 50-character or

115: 250-character) responses.  As a service to track participants, AT\&T

116: provided top documents returned by their retrieval engine for each of

117: the TREC questions.  Sections~\ref{graphs} and \ref{bounds} present

118: analyses that use all sentences in the top 10 of these documents.

119: Each sentence is classified as correct or incorrect automatically.

120: This automatic classification judges a sentence to be correct if it

121: contains at least half of the stemmed, content-words in the answer

122: key.  We have compared this automatic evaluation to the TREC-8 QA

123: track assessors and found it to agree 93-95\% of the time

124: \cite{breck2000}.

125:

126:

127:

128: The CBC data set was created for the Johns Hopkins Summer 2000

129: Workshop on Reading Comprehension.  Texts were collected from the

130: Canadian Broadcasting Corporation web page for kids

131: (http://cbc4kids.ca/). They are an average of 24 sentences long.  The

132: stories were adapted from newswire texts to be appropriate for

133: adolescent children, and most fall into the following domains:

134: politics, health, education, science, human interest, disaster,

135: sports, business, crime, war, entertainment, and environment.  For

136: each CBC story, 8-12 questions and an answer key were

137: generated.\footnote{This work was performed by Lisa Ferro and Tim

138:   Bevins of the MITRE Corporation.  Dr. Ferro has professional

139:   experience writing questions for reading comprehension exams and led

140:   the question writing effort.} We used a 650 question subset of the

141: data and their corresponding 75 stories.  The answer candidates for

142: each question in this data set were all sentences in the document.

143: The sentences were scored against the answer key by the automatic

144: method described previously.

145:

146: \section{Analyzing the number of answer opportunities per question}

147: \label{ansMult}

148:

149: In this section we explore the impact of multiple answer opportunities

150: on end-to-end system performance.  A question may have multiple

151: answers for two reasons: (1) there is more than one different answer

152: to the question, and (2) there may be multiple instances of each

153: answer.  For example, {\em ``What does the Peugeot company

154: manufacture?''} can be answered by {\em trucks}, {\em cars}, or {\em

155: motors} and each of these answers may occur in many sentences that

156: provide enough context to answer the question.  The table insert in

157: Figure~\ref{cbc-histograms} shows that, on average, there are 7 answer

158: occurrences per question in the TREC-8 collection.\footnote{We would

159: like to thank John Burger and John Aberdeen for help preparing

160: Figure~\ref{cbc-histograms}.} In contrast, there are only 1.25 answer

161: occurrences in a CBC document.  The number of answer occurrences

162: varies widely, as illustrated by the standard deviations.  The median

163: shows an answer frequency of 3 for TREC and 1 for CBC, which perhaps

164: gives a more realistic sense of the degree of answer frequency for

165: most questions.

166:

167: \begin{figure}[htbp]

168: \centering

169: \epsfig{figure=figs/ansMultBar.eps,height=2in,width=3.1in}

170: \caption{Frequency of answers in the TREC-8 (black bars) and CBC

171:   (white bars) data sets}

172: \label{cbc-histograms}

173: \end{figure}

174:

175: To gather this data we manually reviewed 50 randomly chosen TREC-8

176: questions and identified all answers to these questions in our text

177: collection. We defined an ``answer'' as a text fragment that contains

178: the answer string in a context sufficient to answer the question.

179: Figure~\ref{cbc-histograms} shows the resulting graph.  The $x$-axis

180: displays the number of answer occurrences found in the text

181: collection per question and the $y$-axis shows the percentage of

182: questions that had $x$ answers.  For example, 26\%

183: of the TREC-8 questions had

184: only 1 answer occurrence, and 20\%

185: of the TREC-8 questions had exactly 2 answer occurrences (the black

186: bars).  The most prolific question had 67 answer occurrences (the

187: Peugeot example mentioned above).

188: Figure~\ref{cbc-histograms} also shows the analysis of 219 CBC

189: questions. In contrast, 80\%

190: of the CBC questions had only 1 answer

191: occurrence in the targeted document, and 16\%

192: had exactly 2 answer occurrences.

193:

194: \begin{figure}[htbp]

195: \centering

196: \epsfig{figure=figs/dupVsCorr.eps,height=2in,width=3.1in}

197: \caption{Answer repetition vs. system response correctness for TREC-8}

198: \label{scatter}

199: \end{figure}

200:

201: Figure~\ref{scatter} shows the effect that multiple answer

202: opportunities had on the performance of TREC-8 systems.  Each solid

203: dot in the scatter plot represents one of the 50 questions we

204: examined.\footnote{We would like to thank Lynette Hirschman for

205: suggesting the analysis behind Figure~\ref{scatter} and John Burger

206: for help with the analysis and presentation.}  The $x$-axis shows the

207: number of answer opportunities for the question, and the $y$-axis

208: represents the percentage of systems that generated a correct

209: answer\footnote{For this analysis, we say that a system generated a

210: correct answer if a correct answer was in its response set.} for the

211: question.  E.g., for the question with 67 answer occurrences,

212: 80\% of the systems produced a correct answer.  In

213: contrast, many questions had a single answer occurrence and the

214: percentage of systems that got those correct varied from about 2\% to

215: 60\%.

216:

217: The circles in Figure~\ref{scatter} represent the average percentage

218: of systems that answered questions correctly for all questions with

219: the same number of answer occurrences.  For example, on average about

220: 27\% of the systems produced a correct answer for questions that had

221: exactly one answer occurrence, but about 50\% of the systems produced

222: a correct answer for questions with 7 answer opportunities.

223: Overall, a clear pattern emerges: the performance of TREC-8 systems

224: was strongly correlated with the number of answer opportunities

225: present in the document collection.

226:

227: \section{Graphs for analyzing scoring functions of answer candidates}

228: \label{graphs}

229:

230: Most question answering systems generate several answer candidates and

231: rank them by defining a scoring function that maps answer candidates

232: to a range of numbers.  In this section, we analyze one particular

233: scoring function: {\em term overlap} between the question and answer

234: candidate.  The techniques we use can be easily applied to other

235: scoring functions as well (e.g., weighted term overlap, partial

236: unification of sentence parses, weighted abduction score, etc.). The

237: answer candidates we consider are the sentences from the documents.

238:

239: The expected performance of a system that ranks all sentences using

240: term overlap is 35\% for the TREC-8 data.  This number is an expected

241: score because of ties: correct and incorrect candidates may have the

242: same term overlap score.  If ties are broken optimally, the best

243: possible score ({\em maximum}) would be 54\%. If ties are broken

244: maximally suboptimally, the worst possible score ({\em minimum}) would

245: be 24\%.  The corresponding scores on the CBC data are 58\%

246: expected, 69\% maximum, and 51\% minimum.  We would like to

247: understand why the term overlap scoring function works as well as it

248: does and what can be done to improve it.

249:

250: Figures~\ref{camel-overlap-TREC} and \ref{camel-overlap-CBC} compare

251: correct candidates and incorrect candidates with respect to the

252: scoring function. The $x$-axis plots the range of the scoring

253: function, i.e., the amount of overlap.  The $y$-axis represents {\bf

254:   Pr(overlap=x $\mid$ correct)} and {\bf Pr(overlap=x $\mid$

255:   incorrect)}, where separate curves are plotted for correct and

256: incorrect candidates.  The probabilities are generated by normalizing

257: the number of correct/incorrect answer candidates with a particular

258: overlap score by the total number of correct/incorrect candidates,

259: respectively.

260:

261: \begin{figure}[h]

262: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-jqaviar-n-i-c1-TREC.eps,height=2.0in}}

263: \caption{Pr(overlap=x$\mid$[in]correct) for TREC-8}

264: \label{camel-overlap-TREC}

265: \vspace*{.2in}

266: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-jqaviar-n-i-c1-CBC.eps,height=2.0in}}

267: \caption{Pr(overlap=x$\mid$[in]correct) for CBC}

268: \label{camel-overlap-CBC}

269: \end{figure}

270:

271: Figure ~\ref{camel-overlap-TREC} illustrates that the correct

272: candidates for TREC-8 have term overlap scores distributed between 0 and 10 with

273: a peak of 24\% at an overlap of 2.  However, the incorrect candidates

274: have a similar distribution between 0 and 8 with a peak of 32\% at an

275: overlap of 0.  The similarity of the curves illustrates that it is

276: unclear how to use the score to decide if a candidate is correct or

277: not.  Certainly no static threshold above which a candidate is deemed

278: correct will work.  Yet the expected score of our TREC term overlap system

279: was 35\%, which is much higher than a random baseline which would get

280: an expected score of less than 3\% because there are over 40 sentences on

281: average in newswire documents.\footnote{We also tried dividing the term overlap

282:   score by the length of the question to normalize for query length

283:   but did not find that the graph was any more helpful.}

284:

285: After inspecting some of the data directly, we posited that it was not

286: the absolute term overlap that was important for judging candidate but

287: how the overlap score compares to the scores of other candidates.  To

288: visualize this, we generated new graphs by plotting the rank of a

289: candidate's score on the $x$-axis. For example, the candidate with the

290: highest score would be ranked first, the candidate with the second

291: highest score would be ranked second, etc.

292: Figures~\ref{camel-overlap-rank-TREC} and \ref{camel-overlap-rank-CBC}

293: show these graphs, which display {\bf Pr(rank=x $\mid$ correct)} and

294: {\bf Pr(rank=x $\mid$ incorrect)} on the $y$-axis.  The top-ranked

295: candidate has rank=0.

296:

297: \begin{figure}[h]

298: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-k-jqaviar-n-i-c1-TREC.eps,height=2.0in}}

299: \caption{Pr(rank=x $\mid$ [in]correct) for TREC-8}

300: \label{camel-overlap-rank-TREC}

301: \vspace*{.2in}

302: \centerline{\epsfig{figure=figs/out-tcamel-xoverlap-k-jqaviar-n-i-c1-CBC.eps,height=2.0in}}

303: \caption{Pr(rank=x $\mid$ [in]correct) for CBC}

304: \label{camel-overlap-rank-CBC}

305: \end{figure}

306:

307: The ranked graphs are more revealing than the graphs of absolute

308: scores: the probability of a high rank is greater for correct answers

309: than incorrect ones.  Now we can begin to understand why the term

310: overlap scoring function worked as well as it did.  We see that,

311: unlike classification tasks, there is no good threshold for our

312: scoring function.  Instead relative score is paramount.  Systems such

313: as \cite{ng2000} make explicit use of relative rank in their

314: algorithms and now we understand why this is effective.

315:

316:

317: Before we leave the topic of graphing scoring functions, we want to

318: introduce one other view of the data.

319: Figure~\ref{logodds-overlap-TREC} plots term overlap scores on the

320: $x$-axis and the log odds of being correct given a score on the

321: $y$-axis. The log odds formula is:

322: \begin{displaymath}

323: \log\frac{Pr(correct|overlap)}{Pr(incorrect|overlap)}

324: \end{displaymath}

325: Intuitively, this graph shows how much more likely a sentence is to be

326: correct versus incorrect given a particular score.  A second curve,

327: labeled ``mass,'' plots the number of answer candidates with each

328: score. Figure~\ref{logodds-overlap-TREC} shows that the odds of being

329: correct are negative until an overlap of 10, but the mass curve

330: reveals that few answer candidates have an overlap score greater than

331: 6.

332:

333:

334: \begin{figure}

335: \centerline{\epsfig{figure=figs/out-tlogodds-xoverlap-jqaviar-i-c1-TREC.eps,height=2.0in}}

336: \caption{TREC-8 log odds correct given overlap}

337: \label{logodds-overlap-TREC}

338: \end{figure}

339:

340: \section{Bounds on scoring functions that use term overlap}

341: \label{bounds}

342:

343: The scoring function used in the previous section simply counts the

344: number of terms shared by a question and a sentence.  One obvious

345: modification is to weight some terms more heavily than others.  We

346: tried using inverse document frequence based (IDF) term weighting on

347: the CBC data but found that it did not improve performance.  The graph

348: analogous to Figure~\ref{camel-overlap-rank-CBC} but with IDF term

349: weighting was virtually identical.

350:

351: Could another weighting scheme perform better?  How well could an

352: optimal weighting scheme do?  How poorly would the maximally

353: suboptimal scheme do?  The analysis in this section addresses

354: these questions.  In essence the answer is the following: the question

355: and the candidate answers are typically short and thus the number of

356: overlapping terms is small -- consequently, many candidate answers

357: have exactly the same overlapping terms and no weighting scheme could

358: differentiate them.  In addition, subset relations often hold between

359: overlaps. A candidate whose overlap is a subset of a second

360: candidate cannot score higher regardless of the weighting

361: scheme.\footnote{Assuming that all term weights are positive.}

362: We formalize these overlap set relations and then calculate statistics

363: based on them for the CBC and TREC data.

364:

365:

366: \begin{figure}[htbp]

367: \fbox{

368: \begin{minipage}{2.8in}

369: \footnotesize

370: Question: How much was Babe Belanger paid to play amateur basketball? \\

371: \\

372: S1: She was a member of the winningest \\

373: \hspace*{.2in} {\bf basketball} team Canada ever had. \\

374: S2: {\bf Babe} {\bf Belanger} never made a cent for her \\

375: \hspace*{.2in} skills.\\

376: S3: They were just a group of young women \\

377: \hspace*{.2in} from the same school who liked to \\

378: \hspace*{.2in}  {\bf play} {\bf amateur} {\bf basketball}. \\

379: S4: {\bf Babe} {\bf Belanger} played with the Grads from  \\

380: \hspace*{.2in} 1929 to 1937. \\

381: S5: {\bf Babe} never talked about her fabulous career. \\

382: \hrule

383: \vspace*{1mm}

384: MaxOsets : ( \{S2, S4\}, \{S3\} )

385: \end{minipage}

386: }

387: \caption{Example of Overlap Sets from CBC}

388: \label{qsubset}

389: \end{figure}

390:

391: Figure~\ref{qsubset} presents an example from the CBC data.  The four

392: overlap sets are (i) {\em Babe Belanger}, (ii) {\em basketball}, (iii)

393: {\em play amateur basketball}, and (iv) {\em Babe}.  In any

394: term-weighting scheme with positive weights, a sentence containing the

395: words \textit{Babe Belanger} will have a higher score than sentences

396: containing just \textit{Babe}, and sentences with \textit{play amateur

397:   basketball} will have a higher score than those with just

398: \textit{basketball}.  However, we cannot generalize with respect to

399: the relative scores of sentences containing \textit{Babe Belanger} and

400: those containing \textit{play amateur basketball} because some terms

401: may have higher weights than others.

402:

403: The most we can say is that the highest scoring candidate must be a

404: member of $\{S2,S4\}$ or $\{S3\}$. S5 and S1 cannot be ranked highest

405: because their overlap sets are a proper subset of competing overlap

406: sets. The correct answer is S2 so an optimal weighting scheme would

407: have a 50\% chance of ranking S2 first, assuming that it identified

408: the correct overlap set $\{S2,S4\}$ and then randomly chose between S2

409: and S4. A maximally suboptimal weighting scheme could rank S2 no lower

410: than third.

411:

412: We will formalize these concepts using the following variables:

413: \begin{quote}

414: {\em q}: a question (a set of words) \\

415: {\em s}: a sentence (a set of words) \\

416: {\em w,v}: sets of intersecting words

417: \end{quote}

418: We define an {\it overlap set} ($o_{w,q}$) to be a set of sentences

419: (answer candidates) that have the same words overlapping with the

420: question. We define a {\it maximal overlap set} ($M_q$) as an overlap

421: set that is not a subset of any other overlap set for the question.

422: For simplicity, we will refer to a maximal overlap set as a {\it

423:   MaxOset}.

424: \begin{itemize}

425: \item[] $o_{w,q} = \{s| s\cap q = w\}$

426: \item[] $\Omega_{q} = \mbox{all unique overlap sets for } q$

427: \item[] $maximal(o_{w,q})$ ~if~ $\forall o_{v,q} \in \Omega_q, w \not\subset v$

428: \item[] $M_{q} = \{o_{w,q} \in \Omega_{q}\ \mid maximal(o_{w,q})\}$

429: \item[] $C_{q} = \{s | s \mbox{ correctly answers } q\}$

430: \end{itemize}

431:

432: We can use these definitions to give upper and lower bounds on the

433: performance of term-weighting functions on our two data sets.

434: Table~\ref{subsetnum} shows the results.  The $max$ statistic is the

435: percentage of questions for which at least one member of its MaxOsets

436: is correct.  The $min$ statistic is the percentage of questions for

437: which all candidates of all of its MaxOsets are correct (i.e., there

438: is no way to pick a wrong answer).  Finally the $expected max$ is a

439: slightly more realistic upper bound.  It is equivalent to randomly

440: choosing among members of the ``best'' maximal overlap set, i.e., the

441: MaxOset that has the highest percentage of correct members.  Formally,

442: the statistics for a set of questions $Q$ are computed as:

443: \begin{displaymath}

444:  \mbox{max} = \\

445:  \frac{|\{q| \exists o \in M_q, \exists s \in o \mbox{  s.t.  } s \in

446:  C_q\}|}{|Q|}

447:  \end{displaymath}

448: \begin{displaymath}

449:  \mbox{min} = \frac{|\{q|\forall o \in M_q, \forall s \in

450:  o~~~s \in C_q\}|}{|Q|}

451:  \end{displaymath}

452: \begin{displaymath}

453:  \mbox{exp. max} = \frac{1}{|Q|}*\sum_{q \in Q} \max_{o \in M_q}

454:  \frac{|\{s \in o \mbox{ and } s \in C_q\}|}{|o|}

455: \end{displaymath}

456:

457: The results for the TREC data are considerably lower than the results

458: for the CBC data.  One explanation may be that in the CBC data, only

459: sentences from one document containing the answer are considered.  In

460: the TREC data, as in the TREC task, it is not known beforehand which

461: documents contain answers, so irrelevant documents may contain

462: high-scoring sentences that distract from the correct sentences.

463:

464: \begin{table}[hbst]

465: \centering

466: \begin{tabular}{|l|r|r|r|} \hline

467:  & exp. max & max & min \\ \hline

468: CBC training & 72.7\% & 79.0\% & 24.4\% \\

469: TREC-8 & 48.8\% & 64.7\% & 10.1\% \\ \hline

470: \end{tabular}

471: \caption{Maximum overlap analysis of scores}\label{subsetnum}

472: \end{table}

473:

474: In Table~\ref{mosbrk}, we present a detailed breakdown of the MaxOset

475: results for the CBC data.  (Note that the classifications overlap,

476: e.g., questions that are in ``there is always a chance to get it

477: right'' are also in the class ``there may be a chance to get it

478: right.'')  21\% of the questions are literally impossible to

479: get right using only term weighting because none of the correct

480: sentences are in the MaxOsets.

481: This result illustrates that maximal overlap sets can identify the

482: limitations of a scoring function by recognizing that some candidates

483: will \underline{always} be ranked higher than others. Although our

484: analysis only considered term overlap as a scoring function, maximal

485: overlap sets could be used to evaluate other scoring functions as

486: well, for example overlap sets based on semantic classes rather than

487: lexical items.

488:

489:

490:

491:

492: \begin{table*}[hbst]

493: \small

494: \centerline{\begin{tabular}{|lrr|} \hline

495:  & \multicolumn{1}{c}{number of} & \multicolumn{1}{c|}{percentage} \\

496:  & \multicolumn{1}{c}{questions}  & \multicolumn{1}{c|}{of questions}\\ \hline

497: Impossible to get it wrong        &   159 &   24\% \\

498:   ($\forall o_w \in M_q, \forall s \in o_w, s \in C_q$) & & \\

499: There is always a chance to get it right &   204 &    31\% \\

500:   ($\forall o_w \in M_q, \exists s \in o_w \mbox{ s.t. } s \in C_q$) &

501:   & \\

502: There may be a chance to get it right & 514 &   79\% \\

503:   ($\exists o_w \in M_q \mbox{ s.t. }  \exists s \in o_w \mbox{ s.t. }

504:   s \in C_q$) & & \\

505: The wrong answers will always be weighted too highly   & 137 &   21\% \\

506:   ($\forall o_w \in M_q, \forall s \in o_w, s \not\in C_q$) & & \\

507:   There are no correct answers with any overlap with $Q$  &   66 &   10\% \\

508:   ($\forall s \in d,s $ is incorrect or $s$ has 0 overlap) & & \\

509:   There are no correct answers (auto scoring error)  &   12 &    2\% \\

510:   ($\forall s \in d,s $ is incorrect) & & \\ \hline

511: \end{tabular}}

512: \caption{Maximal Overlap Set Analysis for CBC data}

513: \label{mosbrk}

514: \end{table*}

515:

516: In sum, the upper bound for term weighting schemes is quite low and

517: the lower bound is quite high.  These results suggest that methods

518: such as query expansion are essential to increase the feature sets

519: used to score answer candidates. Richer feature sets could distinguish

520: candidates that would otherwise be represented by the same features

521: and therefore would inevitably receive the same score.

522:

523:

524:

525:

526:

527:

528:

529:

530:

531:

532:

533:

534:

535:

536: \section{Analyzing the effect of multiple answer type occurrences in

537:   a sentence}

538: \label{answerType}

539:

540:

541: In this section, we analyze the problem of extracting short answers

542: from a sentence.  Many Q/A systems first decide what answer type a

543: question expects and then identify instances of that type in

544: sentences.  A scoring function ranks the possible answers using

545: additional criteria, which may include features of the surrounding

546: sentence such as term overlap with the question.

547:

548: For our analysis, we will assume that two short answers that have the

549: same answer type and come from the same sentence are indistinguishable

550: to the system. This assumption is made by many Q/A systems: they do

551: not have features that can prefer one entity over another of the same

552: type in the same sentence.

553:

554:

555:

556:

557: We manually annotated data for 165 TREC-9 questions and 186 CBC

558: questions to indicate perfect question typing, perfect answer

559: sentence identification, and perfect semantic tagging.

560: Using these annotations, we measured how much ``answer confusion''

561: remains if an oracle gives you the correct question type, a sentence

562: containing the answer, and correctly tags all entities in the sentence

563: that match the question type.  For example, the oracle tells you that

564: the question expects a person, gives you a sentence containing the

565: correct person, and tags all person entities in that sentence. The one

566: thing the oracle does not tell you is {\it which} person is the

567: correct one.

568:

569: Table~\ref{confusability-table} shows the answer types that we used.

570: Most of the types are fairly standard, except for the {\it Defaultnp}

571: and {\it Defaultvp} which are default tags for questions

572: that desire a noun phrase or verb phrase but cannot be more precisely

573: typed.

574:

575: We computed an expected score for this hypothetical system as follows:

576: for each question, we divided the number of correct candidates

577: (usually one) by the total number of candidates of the same answer

578: type in the sentence.  For example, if a question expects a {\em

579:   Location} as an answer and the sentence contains three locations,

580: then the expected accuracy of the system would be 1/3 because the

581: system must choose among the locations randomly.  When multiple

582: sentences contain a correct answer, we aggregated the sentences.

583: Finally, we averaged this expected accuracy across all questions for

584: each answer type.

585:

586:

587:

588: \begin{table}[t]

589: \footnotesize

590: \begin{center}

591: \begin{tabular}{|l|l|c|l|c|} \hline

592: & \multicolumn{2}{c|}{\bf TREC} & \multicolumn{2}{c|}{\bf CBC} \\

593: {\it Answer Type} & {\it Score} & {\it Freq} & {\it Score} & {\it Freq} \\ \hline

594: defaultnp      & .33 & 47 & .25 & 28 \\

595: organization   & .50 & 1 &  .72 & 3 \\

596: length         & .50 & 1 &  .75 & 2 \\

597: thingname      & .58 & 14 & .50 & 1 \\

598: quantity       & .58 & 13 & .77 & 14 \\

599: agent          & .63 & 19 & .40 & 23 \\

600: location       & .70 & 24 & .68 & 29  \\

601: personname     & .72 & 11 & .83 & 13 \\

602: city           & .73 & 3 &   n/a & 0 \\

603: defaultvp      & .75 & 2 &  .42 & 15 \\

604: temporal       & .78 & 16 & .75 & 26 \\

605: personnoun     & .79 & 7 &  .53 & 5 \\

606: duration       & 1.0 & 3 &  .67 & 4 \\

607: province       & 1.0 & 2 &  1.0 & 2 \\

608: area           & 1.0 & 1 &   n/a & 0 \\

609: day            & 1.0 & 1 &   n/a & 0 \\

610: title          & n/a & 0 &  .50 & 1 \\

611: person         & n/a & 0 &  .67 & 3 \\

612: money          & n/a & 0 &  .88 & 8 \\

613: ambigbig       & n/a & 0 &  .88 & 4 \\

614: age            & n/a & 0 &  1.0 & 2 \\

615: comparison     & n/a & 0 &  1.0 & 1 \\

616: mass           & n/a & 0 &  1.0 & 1 \\

617: measure        & n/a & 0 &  1.0 & 1 \\  \hline

618: {\bf Overall}    & .59 & 165 &   .61 & 186 \\ \hline

619: {\bf Overall-dflts} & .69 & 116 &  .70 & 143 \\ \hline

620: \end{tabular}

621: \end{center}

622: \caption{Expected scores and frequencies for each answer type}

623: \label{confusability-table}

624: \end{table}

625:

626: Table~\ref{confusability-table} shows that a system with perfect

627: question typing, perfect answer sentence identification, and perfect

628: semantic tagging would still achieve only 59\% accuracy on the TREC-9

629: data. These results reveal that there are often multiple candidates of

630: the same type in a sentence.  For example, {\it Temporal} questions

631: received an expected score of 78\% because there was usually only one

632: date expression per sentence (the correct one), while {\it Default NP}

633: questions yielded an expected score of 25\% because there were four

634: noun phrases per question on average.  Some common types were

635: particularly problematic.  {\it Agent} questions (most {\em Who}

636: questions) had an answer confusability of 0.63, while {\it Quantity}

637: questions had a confusability of 0.58.

638:

639: The CBC data showed a similar level of answer confusion, with an

640: expected score of 61\%, although the confusability of individual

641: answer types varied from TREC. For example, {\it Agent} questions were even

642: more difficult, receiving a score of 40\%, but {\it Quantity}

643: questions were easier receiving a score of 77\%.

644:

645: Perhaps a better question analyzer could assign more specific types to

646: the {\it Default NP} and {\it Default VP} questions, which skew the

647: results.  The {\bf Overall-dflts} row of

648: Table~\ref{confusability-table} shows the expected scores without

649: these types, which is still about 70\% so a great deal of answer

650: confusion remains even without those questions.  The confusability

651: analysis provides insight into the limitations of the answer type set,

652: and may be useful for comparing the effectiveness of different answer

653: type sets (somewhat analogous to the use of grammar perplexity in

654: speech research).

655:

656: \begin{figure}[htbp]

657: \fbox{

658: \begin{minipage}{2.9in}

659: \footnotesize

660: Q1: {\it What city is Massachusetts General Hospital located in?}

661:

662: A1: It was conducted by a cooperative group of oncologists from Hoag,

663: Massachusetts General Hospital in \underline{{\bf Boston}},

664: \underline{Dartmouth} College in New Hampshire, UC \underline{San Diego} Medical

665: Center, McGill University in \underline{Montreal}

666: and the University of Missouri in \underline{Columbia}. \\

667:

668: Q2: {\it When was Nostradamus born? }

669:

670: A2: Mosley said followers of Nostradamus, who lived from

671: \underline{{\bf 1503}} to \underline{1566},

672: have claimed ...

673: \end{minipage}

674: }

675: \caption{Sentences with Multiple Items of the Same Type}

676: \label{multitypes}

677: \end{figure}

678:

679: However, Figure~\ref{multitypes} shows the fundamental problem behind

680: answer confusability. Many sentences contain multiple instances of the

681: same type, such as lists and ranges. In Q1, recognizing that the

682: question expects a city rather than a general location is still not

683: enough because several cities are in the answer sentence. To achieve

684: better performance, Q/A systems need use features that can more

685: precisely target an answer.

686:

687:

688:

689:

690:

691:

692:

693:

694:

695:

696:

697:

698:

699:

700: \section{Conclusion}

701:

702: In this paper we have presented four analyses of question answering

703: system performance involving: multiple answer occurence, relative

704: score for candidate ranking, bounds on term overlap performance, and

705: limitations of answer typing for short answer extraction.  We hope

706: that both the results {\em and} the tools we describe will be useful

707: to others.  In general, we feel that analysis of good performance is

708: nearly as important as the performance itself and that the analysis of

709: bad performance can be equally important.

710:

711:

712:

713: \small

714: \bibliographystyle{acl}

715: \bibliography{riloff,hood}

716: \end{document}

717:

718:

719:

720:

721:

722:

723:

724:

725:

726:

727:

728:

729:

730:

731:

732:

733:

734:

735:

736:

737:

738:

739:

740:

741:

742:

743:

744:

745:

746:

747:

748:

749:

750:

751:

752:

753:

754:

755:

756:

757:

758:

759:

760:

761:

762:

763:

764:

765:

766:

767:

768:

769:

770:

771:

772:

773:

774:

775:

776:

777:

778:

779:

780:

781:

782:

783:

784:

785:

786:

787:

788:

789:

790:

791:

792:

793:

794:

795:

796:

797:

798:

799:

800:

801:

802:

803:

804:

805:

806:

807:

808:

809:

810:

811:

812:

813:

814:

815:

816:

817:

818:

819:

820:

821:

822:

823:

824:

825:

826:

827: