0001:cs0001010/part1.tex

1: \Section{Answer Extraction: The Core Idea}

2:

3: One of the fields where natural language understanding technology has failed

4: to deliver convincing results is \em text based question answering. \em

5: Systems which read texts, assimilate their content, and answer freely

6: phrased questions about them would be very useful in a wide variety of

7: applications, particularly so if questions \em and \em texts could be

8: written in unrestricted language. They would be the perfect solution to the

9: problem of information overload in the age of the World Wide Web. However,

10: as the situation is today (and will remain for a long time to come), such

11: systems can be implemented only in very small domains, for extremely small

12: amounts of text, and with very high development costs. One such system,

13: LILOG~(\cite{Herzog:1991}), absorbed well in excess of 60 person-years of

14: work and could, in its final stage, still treat merely a few dozen pages of

15: text. Moreover it turned out to be extremely costly to port the system from

16: one domain to another, closely related, one (many person-months of work). In

17: fact these types of systems have become prototypical cases of non-scalable

18: laboratory applications with very limited impact on further developments.

19:

20: When it comes to processing larger amounts of texts there have been

21: around only two serious contenders up until now: Information Retrieval

22: and Information Extraction. Unfortunately, both techniques have

23: serious drawbacks. Standard \em information retrieval \em (IR)

24: techniques allow arbitrary queries over very large document

25: collections (many gigabytes in size) covering arbitrary domains but

26: they usually retrieve \em entire documents \em (this holds true for

27: traditional systems such as SMART~\cite{Salton:1989} as well as for

28: probabilistic ones such as SPIDER~\cite{Schaeuble:1993}). However,

29: this is unhelpful if documents are dozens, or hundreds, of pages long.

30: Sometimes the techniques of IR are also used to retrieve individual

31: passages of documents (one or more sentences, paragraphs)

32: (cf.~\cite{Salton:1993}). In such cases the number of search terms

33: found in a given sentence, together with their density (in terms of

34: closeness in a sentence), is used to find relevant sentences.

35:

36: Unfortunately, all IR techniques (whether applied to entire documents or to

37: individual passages) have a number of limitations that make them unsuitable

38: for certain important applications. First, they take into account only the

39: content words of a document (all the function words are thrown away).

40: Second, in most cases only the stem of such words is used (and this stem is

41: usually not derived by a proper morphological analysis but by means of some

42: kind of stemmer algorithm, inevitably resulting in numerous spurious

43: ambiguities). Finally, and most importantly, the resulting terms are treated

44: as \em isolated items \em whose unordered combination is used as content

45: model of the original document. This holds for Boolean systems as well as

46: for vector space based systems. Inevitably, neither model can, as such,

47: distinguish the concept of ``computer design'' from that of ``design

48: computer'' (lost ordering information), or the concept of ``export from

49: Germany to the UK'' from that of ``export from the UK to Germany'' (lost

50: function word information). True, most systems can use phrasal search terms

51: (such as "computer design"), to be found as a whole in the documents, but

52: then a number of relevant documents (such as those containing "design of

53: computers") will no longer be retrieved. All this also holds for the (few)

54: passage retrieval systems described in the literature (such as the system

55: described in~\cite{Salton:1993}).

56:

57: \em Information extraction \em (IE) techniques do not suffer from the same

58: shortcomings. They are similar to IR systems in that they, too, are suitable

59: for screening very large text collections (of basically unlimited size, such

60: as streams of messages) covering a potentially wide range of topics.

61: However, they differ from IR systems in that they not only identify certain

62: messages in such a stream (those that fall into a number of specific topics)

63: but also extract from those messages highly specific content data. Typical

64: examples are newswire reports describing terrorist attacks (where they

65: extract the information as to who attacked whom and how and when, what was

66: the outcome of the attack etc.) or newswire reports on management

67: succession events in newswire business reports (with data on who resigned

68: from what post in which company, who is successor etc.). This predefined

69: information is placed into a template, or data base record, defined for the

70: different role fillers of a given type of report.

71:

72: Clearly, this kind of information is much more precise and specific than

73: what is considered by IR systems. On the other hand, IE systems do not allow

74: for arbitrary questions (as IR systems do). They merely allow for a small

75: number of \em pre-defined \em information frames to be filled. Worse still,

76: the ``Message Understanding Conferences'', which have been driving

77: development in this area since 1987 (the latest so far, with published

78: proceedings, is MUC-6~\cite{ARPA:1996}), put so much emphasis on very large

79: text volumes that most of the participating systems that had used, at

80: first, a thorough linguistic analysis had to abandon it and adopt a

81: very shallow approach instead, simply because of run-time requirements for

82: such volumes of data~(e.g. \cite{Appelt:1993}). This approach, which is now taken

83: by most systems taking part in MUCs, makes the systems increasingly less

84: general.

85:

86: However, there is a growing need today for systems that are capable of

87: locating information in texts \em not \em running into the gigabytes but

88: which should show very high precision and recall and which should

89: furthermore allow \em arbitrarily phrased \em questions. Moreover they

90: should be able to cope with documents written in syntactically \em

91: unrestricted \em natural language whereas the \em domain \em of the texts is

92: normally quite \em restricted \em. Examples for such systems are interfaces

93: to machine-readable technical manuals, on-line help systems for complex

94: software, help desk systems in large organisations, and public inquiry

95: systems accessible over the Internet. For these tasks, very high precision

96: of retrieval is mandatory (queries may be very specific), often near perfect

97: recall is vital (technical manuals typically explain things only once), and

98: sometimes retrieval time is mission critical (retrieving information about a

99: system about to get out of control). What is needed in such situations is a

100: system that pinpoints the exact \em phrase(s) \em in a document (collection)

101: from whose \em meaning \em we can infer the answer to a specific question.

102: This is the core idea of Answer Extraction (AE). Since we need to determine

103: the meaning of sentences (questions and texts) we must use (a limited degree

104: of) linguistic (syntactic and semantic) information, which is expensive, but

105: on the other hand the texts to be processed are moderately sized (some

106: hundreds of kilobytes, sometimes a few megabytes), and they typically cover

107: a very limited domain. This makes Answer Extraction a realistic compromise

108: between full question answering on the one hand, and mere information

109: extraction or information retrieval on the other.

110:

111: We will describe an Answer Extraction system, ``ExtrAns'', for questions

112: about (a subset of) the on-line Unix manual (the so-called ``man pages'').

113: Although the system is, for the time being, functional only as a prototype

114: it can cope with unedited text and arbitrary questions, with performance

115: degrading gracefully if input (documents or questions) cannot be analysed

116: completely. It is incrementally extensible in the sense that refinements of

117: the grammar and/or the semantic component automatically improve precision

118: and recall, without the need to change any other components of the system.

119:

120: \Section{Requirements and components}

121: \label{sec:arch}

122:

123: Given the fact that an Answer Extraction system should be able to cope with

124: unrestricted text, it needs a very reliable tokeniser, a grammar of

125: considerable coverage, a reasonably robust parser, some way of dealing with

126: ambiguities, a module that can subject even fragments of syntax structures

127: to a semantic analysis, and a search engine capable of using the resulting

128: knowledge base.

129:

130: In the following we will describe merely three components of the

131: system in some detail. \em First, \em we will point out that

132: preprocessing technical language goes well beyond what a typical

133: tokeniser does. We will not describe the syntax analysis module for

134: which we use and extend an existing dependency oriented system that

135: comes with a full form lexicon, a grammar, and a parser, viz.  Sleator

136: and Temperley's ``Link Grammar''~\cite{Sleator:1991}.  It has certain

137: built-in capabilities for robust parsing, which we supplement by a

138: fall-back strategy that turns unrecognised constituents into keywords

139: (thus resorting to an IR-type behaviour). \em Second, \em we will

140: describe the design principles for the semantic representations

141: derived from the (very specific type of) syntax structures produced by

142: Link Grammar.  We will not explain in depth the disambiguation module,

143: for which we adopt and extend the approach put forward by Brill

144: and Resnik~\cite{Brill:1994}.  \em Third, \em we will show how we cope

145: with the syntactic ambiguities that survive all our disambiguation

146: activities. We will also explain the search strategy very briefly.

147:

148: \Section{Preprocessing technical language} % Jawad

149: \label{sec:tokeniser}

150:

151: The analysis of technical language is, in general, considerably simpler than

152: that of domain unspecific language (newspapers etc.) but as far as

153: preprocessing is concerned it is far more demanding. This holds, in

154: particular, for tokenisation, normalisation, and document structure

155: analysis.

156:

157: \SubSection{Tokenisation and normalisation}

158:

159: In general, tokenising a text means merely identifying word forms and

160: sentences. However, in highly technical documents such as the Unix man

161: pages, this may become a formidable task. Apart from regular word forms, the

162: ExtrAns tokeniser has to recognise all of the following as tokens and

163: represent them as normalised expressions:

164:

165: {\raggedright

166: \paragraph{Command names:} eject, nice (problem: identify regular words

167:     when used as names of commands in sentences like ``\bfseries eject

168:     \mdseries is used for...'', as opposed to their standard use, as

169:     in ``It is not recommended to physically eject media ...'').

170:

171: \paragraph{Path names and absolute file names:} /usr/bin/X11; usr/5bin/ls,

172:     /etc/hostname.le (problems: leading, trailing and internal

173:     slashes, numbers and periods).

174:

175: \paragraph{Options of commands:} -C, -ww, -dFinUv (problem: identify where

176:     a sequence preceded by a dash is an option and where not, as in

177:     ``... whose name ends with .gz, -gz, .z, -z, \_z or .Z and which

178:     begins ...'').

179:

180: \paragraph{Named variables:} \em filename1, device, nickname \em (problems:

181:     identify words used as named variables, mostly as arguments of

182:     commands as in ``... the first \em mm \em is the hour number; \em dd

183:     \em is the day ...'')

184:

185: \paragraph{Special characters as parts of tokens:} AF\_UNIX, sun\_path,

186:     \^{}S(CTRL-S), KR, C++, name@domain or

187:     \%, \%\% (as in: ``A single \%

188:     is encoded by \%\%.''), various punctuation marks (as in: ``...

189:     corresponding to cat? or fmt?'', or in ``/usr/man/man?'',

190:     ``\(<\)signal.h\(>\)'', or

191:     ``[host!...host!]host!username '')

192:

193: }

194:

195: \vspace{1ex}

196:

197: Normalising such tokens means, among other things, to appropriately mark

198: special tokens such as command names (otherwise the parser chokes on them).

199: Luckily, the Unix man pages contain a considerable amount of useful

200: information beyond the purely textual level, namely the information conveyed

201: by the \em formatting commands \em. Thus command names are, as a rule,

202: printed in boldface, and expressions used as variables, in italics, as in

203: \begin{lquote}

204:   \textbf{compress} [ -cfv ] [ -b \em bits \em ] [ \em filename \em...]

205: \end{lquote}

206:

207: This type of information is extracted from the formatting commands and added

208: to the tokens for later modules to use (e.g. ``eject'', when used as the

209: name of a command, is turned into ``eject.com'', and ``filename'', when used

210: as an argument, into ``filename.arg'').

211:

212: \SubSection{Document structure analysis}

213:

214: The formatting instructions in the Unix man pages are, unfortunately,

215: used in a fairly unsystematic fashion (these pages were written by

216: dozens of different persons). In order to extract additional

217: information about tokens from the formatting (see above), the

218: tokeniser must make up for these inconsistencies in the source texts.

219: It does so by performing a considerable amount of document structure

220: analysis. It has, for instance, to collect all command names and

221: argument names from the SYNOPSIS and NAME sections of each manual page

222: to be sure that it will recognise all of them in the body of the

223: DESCRIPTION section, even if formatted incorrectly. Thus, processing a

224: man page becomes a case of processing each of its sections in a

225: particular way.

226:

227:

228:

229:

230:

231: