1: \Section{Answer Extraction: The Core Idea}
2:
3: One of the fields where natural language understanding technology has failed
4: to deliver convincing results is \em text based question answering. \em
5: Systems which read texts, assimilate their content, and answer freely
6: phrased questions about them would be very useful in a wide variety of
7: applications, particularly so if questions \em and \em texts could be
8: written in unrestricted language. They would be the perfect solution to the
9: problem of information overload in the age of the World Wide Web. However,
10: as the situation is today (and will remain for a long time to come), such
11: systems can be implemented only in very small domains, for extremely small
12: amounts of text, and with very high development costs. One such system,
13: LILOG~(\cite{Herzog:1991}), absorbed well in excess of 60 person-years of
14: work and could, in its final stage, still treat merely a few dozen pages of
15: text. Moreover it turned out to be extremely costly to port the system from
16: one domain to another, closely related, one (many person-months of work). In
17: fact these types of systems have become prototypical cases of non-scalable
18: laboratory applications with very limited impact on further developments.
19:
20: When it comes to processing larger amounts of texts there have been
21: around only two serious contenders up until now: Information Retrieval
22: and Information Extraction. Unfortunately, both techniques have
23: serious drawbacks. Standard \em information retrieval \em (IR)
24: techniques allow arbitrary queries over very large document
25: collections (many gigabytes in size) covering arbitrary domains but
26: they usually retrieve \em entire documents \em (this holds true for
27: traditional systems such as SMART~\cite{Salton:1989} as well as for
28: probabilistic ones such as SPIDER~\cite{Schaeuble:1993}). However,
29: this is unhelpful if documents are dozens, or hundreds, of pages long.
30: Sometimes the techniques of IR are also used to retrieve individual
31: passages of documents (one or more sentences, paragraphs)
32: (cf.~\cite{Salton:1993}). In such cases the number of search terms
33: found in a given sentence, together with their density (in terms of
34: closeness in a sentence), is used to find relevant sentences.
35:
36: Unfortunately, all IR techniques (whether applied to entire documents or to
37: individual passages) have a number of limitations that make them unsuitable
38: for certain important applications. First, they take into account only the
39: content words of a document (all the function words are thrown away).
40: Second, in most cases only the stem of such words is used (and this stem is
41: usually not derived by a proper morphological analysis but by means of some
42: kind of stemmer algorithm, inevitably resulting in numerous spurious
43: ambiguities). Finally, and most importantly, the resulting terms are treated
44: as \em isolated items \em whose unordered combination is used as content
45: model of the original document. This holds for Boolean systems as well as
46: for vector space based systems. Inevitably, neither model can, as such,
47: distinguish the concept of ``computer design'' from that of ``design
48: computer'' (lost ordering information), or the concept of ``export from
49: Germany to the UK'' from that of ``export from the UK to Germany'' (lost
50: function word information). True, most systems can use phrasal search terms
51: (such as "computer design"), to be found as a whole in the documents, but
52: then a number of relevant documents (such as those containing "design of
53: computers") will no longer be retrieved. All this also holds for the (few)
54: passage retrieval systems described in the literature (such as the system
55: described in~\cite{Salton:1993}).
56:
57: \em Information extraction \em (IE) techniques do not suffer from the same
58: shortcomings. They are similar to IR systems in that they, too, are suitable
59: for screening very large text collections (of basically unlimited size, such
60: as streams of messages) covering a potentially wide range of topics.
61: However, they differ from IR systems in that they not only identify certain
62: messages in such a stream (those that fall into a number of specific topics)
63: but also extract from those messages highly specific content data. Typical
64: examples are newswire reports describing terrorist attacks (where they
65: extract the information as to who attacked whom and how and when, what was
66: the outcome of the attack etc.) or newswire reports on management
67: succession events in newswire business reports (with data on who resigned
68: from what post in which company, who is successor etc.). This predefined
69: information is placed into a template, or data base record, defined for the
70: different role fillers of a given type of report.
71:
72: Clearly, this kind of information is much more precise and specific than
73: what is considered by IR systems. On the other hand, IE systems do not allow
74: for arbitrary questions (as IR systems do). They merely allow for a small
75: number of \em pre-defined \em information frames to be filled. Worse still,
76: the ``Message Understanding Conferences'', which have been driving
77: development in this area since 1987 (the latest so far, with published
78: proceedings, is MUC-6~\cite{ARPA:1996}), put so much emphasis on very large
79: text volumes that most of the participating systems that had used, at
80: first, a thorough linguistic analysis had to abandon it and adopt a
81: very shallow approach instead, simply because of run-time requirements for
82: such volumes of data~(e.g. \cite{Appelt:1993}). This approach, which is now taken
83: by most systems taking part in MUCs, makes the systems increasingly less
84: general.
85:
86: However, there is a growing need today for systems that are capable of
87: locating information in texts \em not \em running into the gigabytes but
88: which should show very high precision and recall and which should
89: furthermore allow \em arbitrarily phrased \em questions. Moreover they
90: should be able to cope with documents written in syntactically \em
91: unrestricted \em natural language whereas the \em domain \em of the texts is
92: normally quite \em restricted \em. Examples for such systems are interfaces
93: to machine-readable technical manuals, on-line help systems for complex
94: software, help desk systems in large organisations, and public inquiry
95: systems accessible over the Internet. For these tasks, very high precision
96: of retrieval is mandatory (queries may be very specific), often near perfect
97: recall is vital (technical manuals typically explain things only once), and
98: sometimes retrieval time is mission critical (retrieving information about a
99: system about to get out of control). What is needed in such situations is a
100: system that pinpoints the exact \em phrase(s) \em in a document (collection)
101: from whose \em meaning \em we can infer the answer to a specific question.
102: This is the core idea of Answer Extraction (AE). Since we need to determine
103: the meaning of sentences (questions and texts) we must use (a limited degree
104: of) linguistic (syntactic and semantic) information, which is expensive, but
105: on the other hand the texts to be processed are moderately sized (some
106: hundreds of kilobytes, sometimes a few megabytes), and they typically cover
107: a very limited domain. This makes Answer Extraction a realistic compromise
108: between full question answering on the one hand, and mere information
109: extraction or information retrieval on the other.
110:
111: We will describe an Answer Extraction system, ``ExtrAns'', for questions
112: about (a subset of) the on-line Unix manual (the so-called ``man pages'').
113: Although the system is, for the time being, functional only as a prototype
114: it can cope with unedited text and arbitrary questions, with performance
115: degrading gracefully if input (documents or questions) cannot be analysed
116: completely. It is incrementally extensible in the sense that refinements of
117: the grammar and/or the semantic component automatically improve precision
118: and recall, without the need to change any other components of the system.
119:
120: \Section{Requirements and components}
121: \label{sec:arch}
122:
123: Given the fact that an Answer Extraction system should be able to cope with
124: unrestricted text, it needs a very reliable tokeniser, a grammar of
125: considerable coverage, a reasonably robust parser, some way of dealing with
126: ambiguities, a module that can subject even fragments of syntax structures
127: to a semantic analysis, and a search engine capable of using the resulting
128: knowledge base.
129:
130: In the following we will describe merely three components of the
131: system in some detail. \em First, \em we will point out that
132: preprocessing technical language goes well beyond what a typical
133: tokeniser does. We will not describe the syntax analysis module for
134: which we use and extend an existing dependency oriented system that
135: comes with a full form lexicon, a grammar, and a parser, viz. Sleator
136: and Temperley's ``Link Grammar''~\cite{Sleator:1991}. It has certain
137: built-in capabilities for robust parsing, which we supplement by a
138: fall-back strategy that turns unrecognised constituents into keywords
139: (thus resorting to an IR-type behaviour). \em Second, \em we will
140: describe the design principles for the semantic representations
141: derived from the (very specific type of) syntax structures produced by
142: Link Grammar. We will not explain in depth the disambiguation module,
143: for which we adopt and extend the approach put forward by Brill
144: and Resnik~\cite{Brill:1994}. \em Third, \em we will show how we cope
145: with the syntactic ambiguities that survive all our disambiguation
146: activities. We will also explain the search strategy very briefly.
147:
148: \Section{Preprocessing technical language} % Jawad
149: \label{sec:tokeniser}
150:
151: The analysis of technical language is, in general, considerably simpler than
152: that of domain unspecific language (newspapers etc.) but as far as
153: preprocessing is concerned it is far more demanding. This holds, in
154: particular, for tokenisation, normalisation, and document structure
155: analysis.
156:
157: \SubSection{Tokenisation and normalisation}
158:
159: In general, tokenising a text means merely identifying word forms and
160: sentences. However, in highly technical documents such as the Unix man
161: pages, this may become a formidable task. Apart from regular word forms, the
162: ExtrAns tokeniser has to recognise all of the following as tokens and
163: represent them as normalised expressions:
164:
165: {\raggedright
166: \paragraph{Command names:} eject, nice (problem: identify regular words
167: when used as names of commands in sentences like ``\bfseries eject
168: \mdseries is used for...'', as opposed to their standard use, as
169: in ``It is not recommended to physically eject media ...'').
170:
171: \paragraph{Path names and absolute file names:} /usr/bin/X11; usr/5bin/ls,
172: /etc/hostname.le (problems: leading, trailing and internal
173: slashes, numbers and periods).
174:
175: \paragraph{Options of commands:} -C, -ww, -dFinUv (problem: identify where
176: a sequence preceded by a dash is an option and where not, as in
177: ``... whose name ends with .gz, -gz, .z, -z, \_z or .Z and which
178: begins ...'').
179:
180: \paragraph{Named variables:} \em filename1, device, nickname \em (problems:
181: identify words used as named variables, mostly as arguments of
182: commands as in ``... the first \em mm \em is the hour number; \em dd
183: \em is the day ...'')
184:
185: \paragraph{Special characters as parts of tokens:} AF\_UNIX, sun\_path,
186: \^{}S(CTRL-S), KR, C++, name@domain or
187: \%, \%\% (as in: ``A single \%
188: is encoded by \%\%.''), various punctuation marks (as in: ``...
189: corresponding to cat? or fmt?'', or in ``/usr/man/man?'',
190: ``\(<\)signal.h\(>\)'', or
191: ``[host!...host!]host!username '')
192:
193: }
194:
195: \vspace{1ex}
196:
197: Normalising such tokens means, among other things, to appropriately mark
198: special tokens such as command names (otherwise the parser chokes on them).
199: Luckily, the Unix man pages contain a considerable amount of useful
200: information beyond the purely textual level, namely the information conveyed
201: by the \em formatting commands \em. Thus command names are, as a rule,
202: printed in boldface, and expressions used as variables, in italics, as in
203: \begin{lquote}
204: \textbf{compress} [ -cfv ] [ -b \em bits \em ] [ \em filename \em...]
205: \end{lquote}
206:
207: This type of information is extracted from the formatting commands and added
208: to the tokens for later modules to use (e.g. ``eject'', when used as the
209: name of a command, is turned into ``eject.com'', and ``filename'', when used
210: as an argument, into ``filename.arg'').
211:
212: \SubSection{Document structure analysis}
213:
214: The formatting instructions in the Unix man pages are, unfortunately,
215: used in a fairly unsystematic fashion (these pages were written by
216: dozens of different persons). In order to extract additional
217: information about tokens from the formatting (see above), the
218: tokeniser must make up for these inconsistencies in the source texts.
219: It does so by performing a considerable amount of document structure
220: analysis. It has, for instance, to collect all command names and
221: argument names from the SYNOPSIS and NAME sections of each manual page
222: to be sure that it will recognise all of them in the body of the
223: DESCRIPTION section, even if formatted incorrectly. Thus, processing a
224: man page becomes a case of processing each of its sections in a
225: particular way.
226:
227:
228:
229:
230:
231: