cs0106015/main.tex
1: \documentclass[10pt]{article}
2: \usepackage{acl2001,times}
3: \setlength\titlebox{6.5cm}    % Expanding the titlebox
4: 
5: \title{Organizing Encyclopedic Knowledge based on the Web and its
6: Application to Question Answering}
7: 
8: \author{Atsushi Fujii \\
9:   University of Library and \\ Information Science \\
10:   1-2 Kasuga, Tsukuba \\
11:   305-8550, Japan \\
12:   CREST, Japan Science and \\ Technology Corporation \\
13:   {\tt fujii@ulis.ac.jp} \And
14:   Tetsuya Ishikawa \\
15:   University of Library and \\ Information Science \\
16:   1-2 Kasuga, Tsukuba \\
17:   305-8550, Japan \\
18:   {\tt ishikawa@ulis.ac.jp}
19: }
20: 
21: \date{}
22: 
23: \newcommand{\etal}{et~al.}
24: \newcommand{\etaleos}{et~al}
25: \newcommand{\eq}[1]{(\ref{#1})}
26: \renewcommand{\nocite}[1]{\shortcite{#1}}
27: \input{psfig.tex}
28: 
29: \begin{document}
30: \maketitle
31: \begin{abstract}
32:   We propose a method to generate large-scale encyclopedic knowledge,
33:   which is valuable for much NLP research, based on the Web. We first
34:   search the Web for pages containing a term in question.  Then we use
35:   linguistic patterns and HTML structures to extract text fragments
36:   describing the term. Finally, we organize extracted term
37:   descriptions based on word senses and domains. In addition, we apply
38:   an automatically generated encyclopedia to a question answering
39:   system targeting the Japanese Information-Technology Engineers
40:   Examination.
41: \end{abstract}
42: 
43: \section{Introduction}
44: \label{sec:introduction}
45: 
46: Reflecting the growth in utilization of the World Wide Web, a number
47: of Web-based language processing methods have been proposed within the
48: natural language processing (NLP), information retrieval (IR) and
49: artificial intelligence (AI) communities. A sample of these includes
50: methods to {\em extract\/} linguistic
51: resources~\cite{fujii:acl-2000,resnik:acl-99,soderland:kdd-97}, {\em
52: retrieve\/} useful information in response to user
53: queries~\cite{etzioni:ai-magazine-97,mccallum:ijcai-99} and {\em
54: mine/discover\/} knowledge latent in the Web~\cite{inokuchi:pakdd-99}.
55: 
56: In this paper, mainly from an NLP point of view, we explore a method
57: to produce linguistic resources. Specifically, we enhance the method
58: proposed by Fujii and Ishikawa~\shortcite{fujii:acl-2000}, which
59: extracts encyclopedic knowledge (i.e., term descriptions) from the
60: Web.
61: 
62: In brief, their method searches the Web for pages containing a term in
63: question, and uses linguistic expressions and HTML layouts to extract
64: fragments describing the term. They also use a language model to
65: discard non-linguistic fragments.  In addition, a clustering method is
66: used to divide descriptions into a specific number of groups.
67: 
68: On the one hand, their method is expected to enhance existing
69: encyclopedias, where vocabulary size is relatively limited, and
70: therefore the {\em quantity\/} problems has been resolved.
71: 
72: On the other hand, encyclopedias extracted from the Web are not
73: comparable with existing ones in terms of {\em quality}.  In
74: hand-crafted encyclopedias, term descriptions are carefully organized
75: based on domains and word senses, which are especially effective for
76: human usage.  However, the output of Fujii's method is simply a set of
77: unorganized term descriptions.  Although clustering is optionally
78: performed, resultant clusters are not necessarily related to explicit
79: criteria, such as word senses and domains.
80: 
81: To sum up, our belief is that by combining {\em extraction\/} and {\em
82: organization\/} methods, we can enhance both quantity and quality of
83: Web-based encyclopedias.
84: 
85: Motivated by this background, we introduce an organization model to
86: Fujii's method and reformalize the whole framework.  In other words,
87: our proposed method is not only extraction but {\em generation\/} of
88: encyclopedic knowledge.
89: 
90: Section~\ref{sec:system_design} explains the overall design of our
91: encyclopedia generation system, and Section~\ref{sec:organization}
92: elaborates on our organization model.  Section~\ref{sec:application}
93: then explores a method for applying our resultant encyclopedia to NLP
94: research, specifically, question answering.
95: Section~\ref{sec:experimentation} performs a number of experiments to
96: evaluate our methods.
97: 
98: \section{System Design}
99: \label{sec:system_design}
100: 
101: \subsection{Overview}
102: \label{subsec:system_overview}
103: 
104: Figure~\ref{fig:system} depicts the overall design of our system,
105: which generates an encyclopedia for input terms.
106: 
107: Our system, which is currently implemented for Japanese, consists of
108: three modules: ``retrieval,'' ``extraction'' and ``organization,''
109: among which the organization module is newly introduced in this paper.
110: In principle, the remaining two modules (``retrieval'' and
111: ``extraction'') are the same as proposed by Fujii and
112: Ishikawa~\shortcite{fujii:acl-2000}.
113: 
114: In Figure~\ref{fig:system}, terms can be submitted either on-line or
115: off-line. A reasonable method is that while the system periodically
116: updates the encyclopedia off-line, terms unindexed in the encyclopedia
117: are dynamically processed in real-time usage.  In either case, our
118: system processes input terms one by one.
119: 
120: We briefly explain each module in the following three sections,
121: respectively.
122: 
123: \begin{figure}[htbp]
124:   \begin{center}
125:     \leavevmode
126:     \psfig{file=system.eps,height=2.5in}
127:   \end{center}
128:   \caption{The overall design of our Web-based encyclopedia generation
129:     system.}
130:   \label{fig:system}
131: \end{figure}
132: 
133: \subsection{Retrieval}
134: \label{subsec:retrieval}
135: 
136: The retrieval module searches the Web for pages containing an input
137: term, for which existing Web search engines can be used, and those
138: with broad coverage are desirable.
139: 
140: However, search engines performing query expansion are not always
141: desirable, because they usually retrieve a number of pages which do
142: not contain an input keyword.  Since the extraction module (see
143: Section~\ref{subsec:extraction}) analyzes the usage of the input term
144: in retrieved pages, pages not containing the term are of no use for our
145: purpose.
146: 
147: Thus, we use as the retrieval module ``Google,'' which is one of the
148: major search engines and does not conduct query
149: expansion\footnote{http://www.google.com/}.
150: 
151: \subsection{Extraction}
152: \label{subsec:extraction}
153: 
154: In the extraction module, given Web pages containing an input term,
155: newline codes, redundant white spaces and HTML tags that are not used
156: in the following processes are discarded to standardize the page
157: format.
158: 
159: Second, we approximately identify a region describing the term in the
160: page, for which two rules are used.
161: 
162: The first rule is based on Japanese linguistic patterns typically used
163: for term descriptions, such as ``X {\it toha\/} Y {\it dearu\/} (X is
164: Y).''  Following the method proposed by Fujii and
165: Ishikawa~\shortcite{fujii:acl-2000}, we semi-automatically produced 20
166: patterns based on the Japanese CD-ROM World
167: Encyclopedia~\cite{heibonsha:98}, which includes approximately 80,000
168: entries related to various fields.  It is expected that a region
169: including the sentence that matched with one of those patterns can be
170: a term description.
171: 
172: The second rule is based on HTML layout. In a typical case, a term in
173: question is highlighted as a heading with tags such as \verb|<DT>|,
174: \verb|<B>| and \verb|<Hx>| (``\verb|x|'' denotes a digit), followed by
175: its description. In some cases, terms are marked with the anchor
176: \verb|<A>| tag, providing hyperlinks to pages where they are
177: described.
178: 
179: Finally, based on the region briefly identified by the above method,
180: we extract a page fragment as a term description. Since term
181: descriptions usually consist of a logical segment (such as a
182: paragraph) rather than a single sentence, we extract a fragment that
183: matched with one of the following patterns, which are sorted according
184: to preference in descending order:
185: \begin{enumerate}
186: \item description tagged with \verb|<DD>| in the case where the term
187:   is tagged with \verb|<DT>|\footnote{\texttt{<DT>} and \texttt{<DD>} are
188:   inherently provided to describe terms in HTML.},
189: \item paragraph tagged with \verb|<P>|,
190: \item itemization tagged with \verb|<UL>|,
191: \item $N$ sentences, where we empirically set \mbox{$N = 3$}.
192: \end{enumerate}
193: 
194: \subsection{Organization}
195: \label{subsec:organization}
196: 
197: As discussed in Section~\ref{sec:introduction}, organizing information
198: extracted from the Web is crucial in our framework.  For this purpose,
199: we classify extracted term descriptions based on word senses and
200: domains.
201: 
202: Although a number of methods have been proposed to generate word
203: senses (for example, one based on the vector space
204: model~\cite{schutze:cl-98}), it is still difficult to accurately
205: identify word senses without explicit dictionaries that define sense
206: candidates.
207: 
208: In addition, since word senses are often associated with
209: domains~\cite{yarowsky:acl-95}, word senses can be consequently
210: distinguished by way of determining the domain of each description.
211: For example, different senses for ``pipeline (processing
212: method/transportation pipe)'' are associated with the computer and
213: construction domains (fields), respectively.
214: 
215: To sum up, the organization module classifies term descriptions based
216: on domains, for which we use domain and description models.  In
217: Section~\ref{sec:organization}, we elaborate on our organization
218: model.
219: 
220: \section{Statistical Organization Model}
221: \label{sec:organization}
222: 
223: \subsection{Overview}
224: \label{subsec:organization_overview}
225: 
226: Given one or more (in most cases more than one) descriptions for a
227: single input term, the organization module selects appropriate
228: description(s) for each domain related to the term.
229: 
230: We do not need all the extracted descriptions as final outputs,
231: because they are usually similar to one another, and thus are
232: redundant.
233: 
234: For the moment, we assume that we know {\it a priori\/} which domains
235: are related to the input term.
236: 
237: From the viewpoint of probability theory, our task here is to select
238: descriptions with greater probability for given domains.  The
239: probability for description $d$ given domain $c$, \mbox{$P(d|c)$}, is
240: commonly transformed as in Equation~\eq{eq:organization}, through
241: use of the Bayesian theorem.
242: \begin{equation}
243:   \label{eq:organization}
244:   P(d|c) = \frac{\textstyle P(c|d)\cdot P(d)}{\textstyle P(c)}
245: \end{equation}
246: In practice, $P(c)$ can be omitted because this factor is a constant,
247: and thus does not affect the relative probability for different
248: descriptions.
249: 
250: In Equation~\eq{eq:organization}, $P(c|d)$ models a probability that
251: $d$ corresponds to domain $c$. $P(d)$ models a probability that $d$
252: can be a description for the term in question, disregarding the
253: domain. We shall call them domain and description models, respectively.
254: 
255: To sum up, in principle we select $d$'s that are strongly associated
256: with a specific domain, and are likely to be descriptions themselves.
257: 
258: Extracted descriptions are not linguistically understandable in the
259: case where the extraction process is unsuccessful and retrieved pages
260: inherently contain non-linguistic information (such as special
261: characters and e-mail addresses).
262: 
263: To resolve this problem, Fujii and Ishikawa~\shortcite{fujii:acl-2000}
264: used a language model to filter out descriptions with low
265: perplexity. However, in this paper we integrated a description model,
266: which is practically the same as a language model, with an
267: organization model. The new framework is more understandable with
268: respect to probability theory.
269: 
270: In practice, we first use Equation~\eq{eq:organization} to compute
271: $P(d|c)$ for all the $c$'s predefined in the domain model. Then we
272: discard such $c$'s whose $P(d|c)$ is below a specific threshold.  As a
273: result, for the input term, related domains and descriptions are
274: simultaneously selected. Thus, we do not have to know {\it a priori\/}
275: which domains are related to each term.
276: 
277: In the following two sections, we explain methods to realize the
278: domain and description models, respectively.
279: 
280: \subsection{Domain Model}
281: \label{subsec:domain_model}
282: 
283: The domain model quantifies the extent to which description $d$ is
284: associated with domain $c$, which is fundamentally a categorization
285: task.  Among a number of existing categorization methods, we
286: experimentally used one proposed by Iwayama and
287: Tokunaga~\shortcite{iwayama:anlp-94}, which formulates $P(c|d)$ as in
288: Equation~(\ref{eq:domain_model}).
289: \begin{equation}
290:   \label{eq:domain_model}
291:   P(c|d) = P(c)\cdot\sum_{t}\frac{\textstyle P(t|c)\cdot
292:   P(t|d)}{\textstyle P(t)}
293: \end{equation}
294: Here, $P(t|d)$, $P(t|c)$ and $P(t)$ denote probabilities that word $t$
295: appears in $d$, $c$ and all the domains, respectively. We regard
296: $P(c)$ as a constant. While $P(t|d)$ is simply a relative frequency of
297: $t$ in $d$, we need predefined domains to compute $P(t|c)$ and $P(t)$.
298: For this purpose, the use of large-scale corpora annotated with
299: domains is desirable.
300: 
301: However, since those resources are prohibitively expensive, we used
302: the ``Nova'' dictionary for Japanese/English machine translation
303: systems\footnote{Produced by NOVA, Inc.}, which includes approximately
304: one million entries related to 19 technical fields as listed below:
305: \begin{quote}
306:   aeronautics,
307:   biotechnology,
308:   business,
309:   chemistry,
310:   computers,
311:   construction,
312:   defense,
313:   ecology,
314:   electricity,
315:   energy,
316:   finance,
317:   law,
318:   mathematics,
319:   mechanics,
320:   medicine,
321:   metals,
322:   oceanography,
323:   plants,
324:   trade.
325: \end{quote}
326: 
327: We extracted words from dictionary entries to estimate $P(t|c)$ and
328: $P(t)$, which are relative frequencies of $t$ in $c$ and all the
329: domains, respectively.  We used the ChaSen morphological
330: analyzer~\cite{matsumoto:chasen-97} to extract words from Japanese
331: entries.  We also used English entries because Japanese descriptions
332: often contain English words.
333: 
334: It may be argued that statistics extracted from dictionaries are
335: unreliable, because word frequencies in real word usage are missing.
336: However, words that are representative for a domain tend to be
337: frequently used in compound word entries associated with the domain,
338: and thus our method is a practical approximation.
339: 
340: \subsection{Description Model}
341: \label{subsec:desc_model}
342: 
343: The description model quantifies the extent to which a given page
344: fragment is feasible as a description for the input term.  In
345: principle, we decompose the description model into language and
346: quality properties, as shown in Equation~(\ref{eq:desc_model}).
347: \begin{equation}
348:   \label{eq:desc_model}
349:   P(d) = P_{L}(d)\cdot P_{Q}(d)
350: \end{equation}
351: Here, $P_{L}(d)$ and $P_{Q}(d)$ denote language and quality models,
352: respectively.
353: 
354: It is expected that the quality model discards incorrect or misleading
355: information contained in Web pages. For this purpose, a number of
356: quality rating methods for Web
357: pages~\cite{amento:sigir-2000,zhu:sigir-2000} can be used.
358: 
359: However, since Google (i.e., the search engine used in our system)
360: rates the quality of pages based on hyperlink information, and
361: selectively retrieves those with higher quality
362: \cite{brin:compnet-1998}, we tentatively regarded $P_{Q}(d)$ as a
363: constant. Thus, in practice the description model is approximated
364: solely with the language model as in Equation~(\ref{eq:lang_model}).
365: \begin{equation}
366:   \label{eq:lang_model}
367:   P(d) \approx P_{L}(d)
368: \end{equation}
369: 
370: Statistical approaches to language modeling have been used in much NLP
371: research, such as machine translation~\cite{brown:cl-93} and speech
372: recognition~\cite{bahl:ieee-tpami-1983}. Our model is almost the same
373: as existing models, but is different in two respects.
374: 
375: First, while general language models quantify the extent to which a
376: given word sequence is linguistically acceptable, our model also
377: quantifies the extent to which the input is acceptable as a term
378: description.  Thus, we trained the model based on an existing machine
379: readable encyclopedia.
380: 
381: We used the ChaSen morphological analyzer to segment the Japanese
382: CD-ROM World Encyclopedia~\cite{heibonsha:98} into words (we replaced
383: headwords with a common symbol), and then used the CMU-Cambridge
384: toolkit~\cite{clarkson:eurospeech-97} to model a word-based trigram.
385: 
386: Consequently, descriptions in which word sequences are more similar to
387: those in the World Encyclopedia are assigned greater probability
388: scores through our language model.
389: 
390: Second, $P(d)$, which is a product of probabilities for $N$-grams in
391: $d$, is quite sensitive to the length of $d$. In the cases of machine
392: translation and speech recognition, this problem is less crucial
393: because multiple candidates compared based on the language model are
394: almost equivalent in terms of length.
395: 
396: However, since in our case length of descriptions are significantly
397: different, shorter descriptions are more likely to be selected,
398: regardless of the quality.  To avoid this problem, we normalize $P(d)$
399: by the number of words contained in $d$.
400: 
401: \section{Application}
402: \label{sec:application}
403: 
404: \subsection{Overview}
405: \label{subsec:application_overview}
406: 
407: Encyclopedias generated through our Web-based method can be used in a
408: number of applications, including human usage, thesaurus
409: production~\cite{hearst:coling-92,nakamura:coling-88} and natural
410: language understanding in general.
411: 
412: Among the above applications, natural language understanding (NLU) is
413: the most challenging from a scientific point of view.  Current
414: practical NLU research includes dialogue, information extraction and
415: question answering, among which we focus solely on question answering
416: (QA) in this paper.
417: 
418: A straightforward application is to answer interrogative questions
419: like ``What is X?'' in which a QA system searches the encyclopedia
420: database for one or more descriptions related to X (this application
421: is also effective for dialog systems).
422: 
423: In general, the performance of QA systems are evaluated based on
424: coverage and accuracy. Coverage is the ratio between the number of
425: questions answered (disregarding their correctness) and the total
426: number of questions. Accuracy is the ratio between the number of
427: correct answers and the total number of answers made by the system.
428: 
429: While coverage can be estimated objectively and systematically,
430: estimating accuracy relies on human subjects (because there is no
431: absolute description for term X), and thus is expensive.
432: 
433: In view of this problem, we targeted Information Technology Engineers
434: Examinations\footnote{Japan Information-Technology Engineers
435: Examination Center. http://www.jitec.jipdec.or.jp/}, which are
436: biannual (spring and autumn) examinations necessary for candidates to
437: qualify to be IT engineers in Japan.
438: 
439: Among a number of classes, we focused on the ``Class II'' examination,
440: which requires fundamental and general knowledge related to
441: information technology. Approximately half of questions are associated
442: with IT technical terms.
443: 
444: Since past examinations and answers are open to the public, we can
445: evaluate the performance of our QA system with minimal cost.
446: 
447: \subsection{Analyzing IT Engineers Examinations}
448: \label{subsec:analysis}
449: 
450: The Class II examination consists of quadruple-choice questions, among
451: which technical term questions can be subdivided into two types.
452: 
453: In the first type of question, examinees choose the most appropriate
454: description for a given technical term, such as ``memory interleave''
455: and ``router.''
456: 
457: In the second type of question, examinees choose the most appropriate
458: term for a given question, for which we show examples collected from
459: the examination in the autumn of 1999 (translated into English by one
460: of the authors) as follows:
461: \begin{enumerate}
462: \item Which data structure is most appropriate for FIFO (First-In
463:   First-Out)?
464: 
465:   a) binary trees, b) queues, c) stacks, d) heaps
466: \item Choose the LAN access method in which multiple terminals transmit
467:   data simultaneously and thus they potentially collide.
468: 
469:   a) ATM, b) CSM/CD, c) FDDI, d) token ring
470: \end{enumerate}
471: 
472: In the autumn of 1999, out of 80 questions, the number of the first
473: and second types were 22 and 18, respectively.
474: 
475: \subsection{Implementing a QA system}
476: \label{subsec:implementation}
477: 
478: For the first type of question, human examinees would search their
479: knowledge base (i.e., memory) for the description of a given term, and
480: compare that description with four candidates.  Then they would choose
481: the candidate that is most similar to the description.
482: 
483: For the second type of question, human examinees would search their
484: knowledge base for the description of each of four candidate terms.
485: Then they would choose the candidate term whose description is most
486: similar to the question description.
487: 
488: The mechanism of our QA system is analogous to the above human
489: methods.  However, unlike human examinees, our system uses an
490: encyclopedia generated from the Web as a knowledge base.
491: 
492: In addition, our system selectively uses term descriptions categorized
493: into domains related to information technology.  In other words, the
494: description of ``pipeline (transportation pipe)'' is irrelevant or
495: misleading to answer questions associated with ``pipeline (processing
496: method).''
497: 
498: To compute the similarity between two descriptions, we used techniques
499: developed in IR research, in which the similarity between a user query
500: and each document in a collection is usually quantified based on word
501: frequencies.  In our case, a question and four possible answers
502: correspond to query and document collection, respectively.  We used a
503: probabilistic method~\cite{robertson:sigir-94}, which is one of the
504: major IR methods.
505: 
506: To sum up, given a question, its type and four choices, our QA system
507: chooses one of four candidates as the answer, in which the resolution
508: algorithm varies depending on the question type.
509: 
510: \subsection{Related Work}
511: \label{subsec:related_work}
512: 
513: Motivated partially by the TREC-8 QA
514: collection~\cite{voorhees:sigir-2000}, question answering has of late
515: become one of the major topics within the NLP/IR communities.
516: 
517: In fact, a number of QA systems targeting the TREC QA collection have
518: recently been
519: proposed~\cite{harabagiu:coling-2000,moldovan:acl-2000,prager:sigir-2000}.
520: Those systems are commonly termed ``open-domain'' systems, because
521: questions expressed in natural language are not necessarily limited to
522: explicit axes, including {\em who\/}, {\em what\/}, {\em when\/}, {\em
523: where\/}, {\em how\/} and {\em why}.
524: 
525: However, Moldovan and Harabagiu~\shortcite{moldovan:acl-2000} found
526: that each of the TREC questions can be recast as either a single axis
527: or a combination of axes.  They also found that out of the 200 TREC
528: questions, 64 questions (approximately one third) were associated with
529: the {\em what\/} axis, for which the Web-based encyclopedia is
530: expected to improve the quality of answers.
531: 
532: Although Harabagiu~\etal~\shortcite{harabagiu:coling-2000} proposed a
533: knowledge-based QA system, most existing systems rely on conventional
534: IR and shallow NLP methods. The use of encyclopedic knowledge for QA
535: systems, as we demonstrated, needs to be further explored.
536: 
537: \section{Experimentation}
538: \label{sec:experimentation}
539: 
540: \subsection{Methodology}
541: \label{subsec:eval_method}
542: 
543: We conducted a number of experiments to investigate the effectiveness
544: of our methods.
545: 
546: First, we generated an encyclopedia by way of our Web-based method (see
547: Sections~\ref{sec:system_design} and \ref{sec:organization}), and
548: evaluated the quality of the encyclopedia itself.
549: 
550: Second, we applied the generated encyclopedia to our QA system (see
551: Section~\ref{sec:application}), and evaluated its performance.  The
552: second experiment can be seen as a task-oriented evaluation for our
553: encyclopedia generation method.
554: 
555: In the first experiment, we collected 96 terms from technical term
556: questions in the Class II examination (the autumn of 1999). We used as
557: test inputs those 96 terms and generated an encyclopedia, which was
558: used in the second experiment.
559: 
560: For all the 96 test terms, Google (see Section~\ref{subsec:retrieval})
561: retrieved a positive number of pages, and the average number of pages
562: for one term was 196,503. Since Google practically outputs contents of
563: the top 1,000 pages, the remaining pages were not used in our
564: experiments.
565: 
566: In the following two sections, we explain the first and second
567: experiments, respectively.
568: 
569: \subsection{Evaluating Encyclopedia Generation}
570: \label{subsec:eval_generation}
571: 
572: For each test term, our method first computed $P(d|c)$ using
573: Equation~\eq{eq:organization} and discarded domains whose $P(d|c)$ was
574: below 0.05. Then, for each remaining domain, descriptions with higher
575: $P(d|c)$ were selected as the final outputs.
576: 
577: We selected the top three (not one) descriptions for each domain,
578: because reading a couple of descriptions, which are short paragraphs,
579: is not laborious for human users in real-world usage. As a result, at
580: least one description was generated for 85 test terms, disregarding
581: the correctness.  The number of resultant descriptions was 326 (3.8
582: per term). We analyzed those descriptions from different perspectives.
583: 
584: First, we analyzed the distribution of the Google ranks for the Web
585: pages from which the top three descriptions were eventually retained.
586: Figure~\ref{fig:ranking} shows the result, where we have combined the
587: pages in groups of 50, so that the leftmost bar, for example, denotes
588: the number of used pages whose original Google ranks ranged from 1 to
589: 50.
590: 
591: Although the first group includes the largest number of pages, other
592: groups are also related to a relatively large number of pages.  In
593: other words, our method exploited a number of low ranking pages, which
594: are not browsed or utilized by most Web users.
595: 
596: \begin{figure}[htbp]
597:   \begin{center}
598:     \leavevmode
599:     \psfig{file=ranking.ps,height=2in}
600:   \end{center}
601:   \caption{Distribution of rankings for original pages in Google.}
602:   \label{fig:ranking}
603: \end{figure}
604: 
605: Second, we analyzed the distribution of domains assigned to the 326
606: resultant descriptions.  Figure~\ref{fig:domain_dist} shows the
607: result, in which, as expected, most descriptions were associated with
608: the computer domain.
609: 
610: However, the law domain was unexpectedly associated with a relatively
611: great number of descriptions.  We manually analyzed the resultant
612: descriptions and found that descriptions for which appropriate domains
613: are not defined in our domain model, such as sports, tended to be
614: categorized into the law domain.
615: 
616: \begin{figure}[htbp]
617:   \begin{center}
618:     \small
619:     \begin{tabular}{l} \hline\hline
620:       computers (200),
621:       law (41),
622:       electricity (28), \\
623:       plants (15),
624:       medicine (10),
625:       finance (8), \\
626:       mathematics (8),
627:       mechanics (5),
628:       biotechnology (4), \\
629:       construction (2),
630:       ecology (2),
631:       chemistry (1), \\
632:       energy (1),
633:       oceanography (1) \\
634:       \hline
635:     \end{tabular}
636:     \caption{Distribution of domains related to the 326 resultant 
637:     descriptions.}
638:     \label{fig:domain_dist}
639:   \end{center}
640: \end{figure}
641: 
642: Third, we evaluated the accuracy of our method, that is, the quality
643: of an encyclopedia our method generated.  For this purpose, each of
644: the resultant descriptions was judged as to whether or not it is a
645: correct description for a term in question. Each domain assigned to
646: descriptions was also judged correct or incorrect.
647: 
648: We analyzed the result on a description-by-description basis, that is,
649: all the generated descriptions were considered independent of one
650: another. The ratio of correct descriptions, disregarding the domain
651: correctness, was 58.0\% (189/326), and the ratio of correct
652: descriptions categorized into the correct domain was 47.9\% (156/326).
653: 
654: However, since all the test terms are inherently related to the IT
655: field, we focused solely on descriptions categorized into the computer
656: domain.  In this case, the ratio of correct descriptions, disregarding
657: the domain correctness, was 62.0\% (124/200), and the ratio of correct
658: descriptions categorized into the correct domain was 61.5\% (123/200).
659: 
660: In addition, we analyzed the result on a term-by-term basis, because
661: reading only a couple of descriptions is not crucial.  In other words,
662: we evaluated each term (not description), and in the case where at
663: least one correct description categorized into the correct domain was
664: generated for a term in question, we judged it correct.  The ratio of
665: correct terms was 89.4\% (76/85), and in the case where we focused
666: solely on the computer domain, the ratio was 84.8\% (67/79).
667: 
668: In other words, by reading a couple of descriptions (3.8 descriptions
669: per term), human users can obtain knowledge of approximately 90\% of
670: input terms.
671: 
672: Finally, we compared the resultant descriptions with an existing
673: dictionary. For this purpose, we used the ``Nichigai'' computer
674: dictionary~\cite{nichigai_compdic:96}, which lists approximately
675: 30,000 Japanese technical terms related to the computer field, and
676: contains descriptions for 13,588 terms.  In the Nichigai dictionary,
677: 42 out of the 96 test terms were described. Our method, which
678: generated correct descriptions associated with the computer domain for
679: 67 input terms, enhanced the Nichigai dictionary in terms of quantity.
680: 
681: These results indicate that our method for generating encyclopedias is
682: of operational quality.
683: 
684: \subsection{Evaluating Question Answering}
685: \label{subsec:eval_qa}
686: 
687: We used as test inputs 40 questions, which are related to technical
688: terms collected from the Class II examination in the autumn of 1999.
689: 
690: The objective here is not only to evaluate the performance of our QA
691: system itself, but also to evaluate the quality of the encyclopedia
692: generated by our method.
693: 
694: Thus, as performed in the first experiment
695: (Section~\ref{subsec:eval_generation}), we used the Nichigai computer
696: dictionary as a baseline encyclopedia. We compared the following three
697: different resources as a knowledge base:
698: \begin{itemize}
699: \item the Nichigai dictionary (``Nichigai''),
700: \item the descriptions generated in the first experiment (``Web''),
701: \item combination of both resources (``Nichigai + Web'').
702: \end{itemize}
703: 
704: Table~\ref{tab:eval_qa} shows the result of our comparative
705: experiment, in which ``C'' and ``A'' denote coverage and accuracy,
706: respectively, for variations of our QA system.
707: 
708: Since all the questions we used are quadruple-choice, in case the
709: system cannot answer the question, random choice can be performed to
710: improve the coverage to 100\%.  Thus, for each knowledge resource we
711: compared cases without/with random choice, which are denoted ``w/o
712: Random'' and ``w/ Random'' in Table~\ref{tab:eval_qa}, respectively.
713: 
714: \begin{table}[htbp]
715:   \begin{center}
716:     \caption{Coverage and accuracy (\%) for different question
717:     answering methods.}
718:     \medskip
719:     \leavevmode
720:     \small
721:     \begin{tabular}{lcccc} \hline\hline
722:       & \multicolumn{2}{c}{w/o Random} & \multicolumn{2}{c}{w/
723:       Random} \\
724:       \multicolumn{1}{c}{Resource} &
725:       C &
726:       A &
727:       C &
728:       A \\ \hline
729:       Nichigai & 50.0 & 65.0 & 100 & 45.0 \\
730:       Web & 92.5 & 48.6 & 100 & 46.9 \\
731:       Nichigai + Web & 95.0 & 63.2 & 100 & 61.3 \\
732:       \hline
733:     \end{tabular}
734:     \label{tab:eval_qa}
735:   \end{center}
736: \end{table}
737: 
738: In the case where random choice was not performed, the Web-based
739: encyclopedia noticeably improved the coverage for the Nichigai
740: dictionary, but decreased the accuracy.
741: However, by combining both resources, the accuracy was noticeably
742: improved, and the coverage was comparable with that for the Nichigai
743: dictionary.
744: 
745: On the other hand, in the case where random choice was performed, the
746: Nichigai dictionary and the Web-based encyclopedia were comparable in
747: terms of both the coverage and accuracy.  Additionally, by combining
748: both resources, the accuracy was further improved.
749: 
750: We also investigated the performance of our QA system where
751: descriptions related to the computer domain are solely used. However,
752: coverage/accuracy did not significantly change, because as shown in
753: Figure~\ref{fig:domain_dist}, most of the descriptions were inherently
754: related to the computer domain.
755: 
756: \section{Conclusion}
757: \label{sec:conclusion}
758: 
759: The World Wide Web has been an unprecedentedly enormous information
760: source, from which a number of language processing methods have been
761: explored to extract, retrieve and discover various types of
762: information.
763: 
764: In this paper, we aimed at generating encyclopedic knowledge, which is
765: valuable for many applications including human usage and natural
766: language understanding.
767: For this purpose, we reformalized an existing Web-based extraction
768: method, and proposed a new statistical organization model to improve
769: the quality of extracted data.
770: 
771: Given a term for which encyclopedic knowledge (i.e., descriptions) is
772: to be generated, our method sequentially performs a) retrieval of Web
773: pages containing the term, b) extraction of page fragments describing
774: the term, and c) organizing extracted descriptions based on domains
775: (and consequently word senses).
776: 
777: In addition, we proposed a question answering system, which answers
778: interrogative questions associated with {\it what\/}, by using a
779: Web-based encyclopedia as a knowledge base.  For the purpose of
780: evaluation, we used as test inputs technical terms collected from the
781: Class II IT engineers examination, and found that the encyclopedia
782: generated through our method was of operational quality and quantity.
783: 
784: We also used test questions from the Class II examination, and
785: evaluated the Web-based encyclopedia in terms of question
786: answering. We found that our Web-based encyclopedia improved the
787: system coverage obtained solely with an existing dictionary. In
788: addition, when we used both resources, the performance was further
789: improved.
790: 
791: Future work would include generating information associated with more
792: complex interrogations, such as ones related to {\it how\/} and {\it
793: why\/}, so as to enhance Web-based natural language understanding.
794: 
795: \section*{Acknowledgments}
796: 
797: The authors would like to thank NOVA, Inc. for their support with the
798: Nova dictionary and Katunobu Itou (The National Institute of Advanced
799: Industrial Science and Technology, Japan) for his insightful comments
800: on this paper.
801: 
802: \bibliographystyle{acl}
803: 
804: \begin{thebibliography}{}
805: 
806: \bibitem[\protect\citename{Amento \bgroup et al.\egroup
807:   }2000]{amento:sigir-2000}
808: Brian Amento, Loren Terveen, and Will Hill.
809: \newblock 2000.
810: \newblock Does ``authority'' mean quality? predicting expert quality ratings of
811:   {Web} documents.
812: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR
813:   Conference on Research and Development in Information Retrieval}, pages
814:   296--303.
815: 
816: \bibitem[\protect\citename{Bahl \bgroup et al.\egroup
817:   }1983]{bahl:ieee-tpami-1983}
818: Lalit.~R. Bahl, Frederick Jelinek, and Robert~L. Mercer.
819: \newblock 1983.
820: \newblock A maximum linklihood approach to continuous speech recognition.
821: \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},
822:   5(2):179--190.
823: 
824: \bibitem[\protect\citename{Brin and Page}1998]{brin:compnet-1998}
825: Sergey Brin and Lawrence Page.
826: \newblock 1998.
827: \newblock The anatomy of a large-scale hypertextual {Web} search engine.
828: \newblock {\em Computer Networks}, 30(1--7):107--117.
829: 
830: \bibitem[\protect\citename{Brown \bgroup et al.\egroup }1993]{brown:cl-93}
831: Peter~F. Brown, Stephen A.~Della Pietra, Vincent J.~Della Pietra, and Robert~L.
832:   Mercer.
833: \newblock 1993.
834: \newblock The mathematics of statistical machine translation: Parameter
835:   estimation.
836: \newblock {\em Computational Linguistics}, 19(2):263--311.
837: 
838: \bibitem[\protect\citename{Clarkson and Rosenfeld}1997]{clarkson:eurospeech-97}
839: Philip Clarkson and Ronald Rosenfeld.
840: \newblock 1997.
841: \newblock Statistical language modeling using the {CMU}-{Cambridge} toolkit.
842: \newblock In {\em Proceedings of EuroSpeech'97}, pages 2707--2710.
843: 
844: \bibitem[\protect\citename{Etzioni}1997]{etzioni:ai-magazine-97}
845: Oren Etzioni.
846: \newblock 1997.
847: \newblock Moving up the information food chain.
848: \newblock {\em AI Magazine}, 18(2):11--18.
849: 
850: \bibitem[\protect\citename{Fujii and Ishikawa}2000]{fujii:acl-2000}
851: Atsushi Fujii and Tetsuya Ishikawa.
852: \newblock 2000.
853: \newblock Utilizing the {World Wide Web} as an encyclopedia: Extracting term
854:   descriptions from semi-structured texts.
855: \newblock In {\em Proceedings of the 38th Annual Meeting of the Association for
856:   Computational Linguistics}, pages 488--495.
857: 
858: \bibitem[\protect\citename{Harabagiu \bgroup et al.\egroup
859:   }2000]{harabagiu:coling-2000}
860: Sanda~M. Harabagiu, Marius~A. Pa\c{s}ca, and Steven~J. Maiorano.
861: \newblock 2000.
862: \newblock Experiments with open-domain textual question answering.
863: \newblock In {\em Proceedings of the 18th International Conference on
864:   Computational Linguistics}, pages 292--298.
865: 
866: \bibitem[\protect\citename{Hearst}1992]{hearst:coling-92}
867: Marti~A. Hearst.
868: \newblock 1992.
869: \newblock Automatic acquisition of hyponyms from large text corpora.
870: \newblock In {\em Proceedings of the 14th International Conference on
871:   Computational Linguistics}, pages 539--545.
872: 
873: \bibitem[\protect\citename{Heibonsha}1998]{heibonsha:98}
874: Hitachi~Digital Heibonsha.
875: \newblock 1998.
876: \newblock {CD-ROM World Encyclopedia}.
877: \newblock (In Japanese).
878: 
879: \bibitem[\protect\citename{Inokuchi \bgroup et al.\egroup
880:   }1999]{inokuchi:pakdd-99}
881: Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda, Kouhei Kumasawa, and Naohide
882:   Arai.
883: \newblock 1999.
884: \newblock Basket analysis for graph structured data.
885: \newblock In {\em Proceedings of the 3rd Pacific-Asia Conference on Knowledge
886:   Discovery and Data Mining}, pages 420--431.
887: 
888: \bibitem[\protect\citename{Iwayama and Tokunaga}1994]{iwayama:anlp-94}
889: Makoto Iwayama and Takenobu Tokunaga.
890: \newblock 1994.
891: \newblock A probabilistic model for text categorization: Based on a single
892:   random variable with multiple values.
893: \newblock In {\em Proceedings of the 4th Conference on Applied Natural Language
894:   Processing}, pages 162--167.
895: 
896: \bibitem[\protect\citename{Matsumoto \bgroup et al.\egroup
897:   }1997]{matsumoto:chasen-97}
898: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Osamu
899:   Imaichi, and Tomoaki Imamura.
900: \newblock 1997.
901: \newblock {Japanese} morphological analysis system {ChaSen} manual.
902: \newblock Technical Report NAIST-IS-TR97007, NAIST.
903: \newblock (In Japanese).
904: 
905: \bibitem[\protect\citename{McCallum \bgroup et al.\egroup
906:   }1999]{mccallum:ijcai-99}
907: Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.
908: \newblock 1999.
909: \newblock A machine learning approach to building domain-specific search
910:   engines.
911: \newblock In {\em Proceedings of the 16th International Joint Conference on
912:   Artificial Intelligence}, pages 662--667.
913: 
914: \bibitem[\protect\citename{Moldovan and Harabagiu}2000]{moldovan:acl-2000}
915: Dan Moldovan and Sanda Harabagiu.
916: \newblock 2000.
917: \newblock The structure and performance of an open-domain question answering
918:   system.
919: \newblock In {\em Proceedings of the 38th Annual Meeting of the Association for
920:   Computational Linguistics}, pages 563--570.
921: 
922: \bibitem[\protect\citename{Nakamura and Nagao}1988]{nakamura:coling-88}
923: Jun'ichi Nakamura and Makoto Nagao.
924: \newblock 1988.
925: \newblock Extraction of semantic information from an ordinary {English}
926:   dictionary and its evaluation.
927: \newblock In {\em Proceedings of the 10th International Conference on
928:   Computational Linguistics}, pages 459--464.
929: 
930: \bibitem[\protect\citename{{Nichigai Associates}}1996]{nichigai_compdic:96}
931: {Nichigai Associates}.
932: \newblock 1996.
933: \newblock {English-Japanese} computer terminology dictionary.
934: \newblock (In Japanese).
935: 
936: \bibitem[\protect\citename{Prager \bgroup et al.\egroup
937:   }2000]{prager:sigir-2000}
938: John Prager, Eric Brown, and Anni Coden.
939: \newblock 2000.
940: \newblock Question-answering by predictive annotation.
941: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR
942:   Conference on Research and Development in Information Retrieval}, pages
943:   184--191.
944: 
945: \bibitem[\protect\citename{Resnik}1999]{resnik:acl-99}
946: Philip Resnik.
947: \newblock 1999.
948: \newblock Mining the {Web} for bilingual texts.
949: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for
950:   Computational Linguistics}, pages 527--534.
951: 
952: \bibitem[\protect\citename{Robertson and Walker}1994]{robertson:sigir-94}
953: S.~E. Robertson and S.~Walker.
954: \newblock 1994.
955: \newblock Some simple effective approximations to the 2-poisson model for
956:   probabilistic weighted retrieval.
957: \newblock In {\em Proceedings of the 17th Annual International ACM SIGIR
958:   Conference on Research and Development in Information Retrieval}, pages
959:   232--241.
960: 
961: \bibitem[\protect\citename{Sch\"{u}tze}1998]{schutze:cl-98}
962: Hinrich Sch\"{u}tze.
963: \newblock 1998.
964: \newblock Automatic word sense discrimination.
965: \newblock {\em Computational Linguistics}, 24(1):97--123.
966: 
967: \bibitem[\protect\citename{Soderland}1997]{soderland:kdd-97}
968: Stephen Soderland.
969: \newblock 1997.
970: \newblock Learning to extract text-based information from the {World Wide Web}.
971: \newblock In {\em Proceedings of 3rd International Conference on Knowledge
972:   Discovery and Data Mining}.
973: 
974: \bibitem[\protect\citename{Voorhees and Tice}2000]{voorhees:sigir-2000}
975: Ellen~M. Voorhees and Dawn~M. Tice.
976: \newblock 2000.
977: \newblock Building a question answering test collection.
978: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR
979:   Conference on Research and Development in Information Retrieval}, pages
980:   200--207.
981: 
982: \bibitem[\protect\citename{Yarowsky}1995]{yarowsky:acl-95}
983: David Yarowsky.
984: \newblock 1995.
985: \newblock Unsupervised word sense disambiguation rivaling supervised methods.
986: \newblock In {\em Proceedings of the 33rd Annual Meeting of the Association for
987:   Computational Linguistics}, pages 189--196.
988: 
989: \bibitem[\protect\citename{Zhu and Gauch}2000]{zhu:sigir-2000}
990: Xiaolan Zhu and Susan Gauch.
991: \newblock 2000.
992: \newblock Incorporating quality metrics in centralized/distributed information
993:   retrieval on the {World Wide Web}.
994: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR
995:   Conference on Research and Development in Information Retrieval}, pages
996:   288--295.
997: 
998: \end{thebibliography}
999: 
1000: \end{document}
1001: