0106:cs0106015/main.tex

1: \documentclass[10pt]{article}

2: \usepackage{acl2001,times}

3: \setlength\titlebox{6.5cm}    % Expanding the titlebox

4:

5: \title{Organizing Encyclopedic Knowledge based on the Web and its

6: Application to Question Answering}

7:

8: \author{Atsushi Fujii \\

9:   University of Library and \\ Information Science \\

10:   1-2 Kasuga, Tsukuba \\

11:   305-8550, Japan \\

12:   CREST, Japan Science and \\ Technology Corporation \\

13:   {\tt fujii@ulis.ac.jp} \And

14:   Tetsuya Ishikawa \\

15:   University of Library and \\ Information Science \\

16:   1-2 Kasuga, Tsukuba \\

17:   305-8550, Japan \\

18:   {\tt ishikawa@ulis.ac.jp}

19: }

20:

21: \date{}

22:

23: \newcommand{\etal}{et~al.}

24: \newcommand{\etaleos}{et~al}

25: \newcommand{\eq}[1]{(\ref{#1})}

26: \renewcommand{\nocite}[1]{\shortcite{#1}}

27: \input{psfig.tex}

28:

29: \begin{document}

30: \maketitle

31: \begin{abstract}

32:   We propose a method to generate large-scale encyclopedic knowledge,

33:   which is valuable for much NLP research, based on the Web. We first

34:   search the Web for pages containing a term in question.  Then we use

35:   linguistic patterns and HTML structures to extract text fragments

36:   describing the term. Finally, we organize extracted term

37:   descriptions based on word senses and domains. In addition, we apply

38:   an automatically generated encyclopedia to a question answering

39:   system targeting the Japanese Information-Technology Engineers

40:   Examination.

41: \end{abstract}

42:

43: \section{Introduction}

44: \label{sec:introduction}

45:

46: Reflecting the growth in utilization of the World Wide Web, a number

47: of Web-based language processing methods have been proposed within the

48: natural language processing (NLP), information retrieval (IR) and

49: artificial intelligence (AI) communities. A sample of these includes

50: methods to {\em extract\/} linguistic

51: resources~\cite{fujii:acl-2000,resnik:acl-99,soderland:kdd-97}, {\em

52: retrieve\/} useful information in response to user

53: queries~\cite{etzioni:ai-magazine-97,mccallum:ijcai-99} and {\em

54: mine/discover\/} knowledge latent in the Web~\cite{inokuchi:pakdd-99}.

55:

56: In this paper, mainly from an NLP point of view, we explore a method

57: to produce linguistic resources. Specifically, we enhance the method

58: proposed by Fujii and Ishikawa~\shortcite{fujii:acl-2000}, which

59: extracts encyclopedic knowledge (i.e., term descriptions) from the

60: Web.

61:

62: In brief, their method searches the Web for pages containing a term in

63: question, and uses linguistic expressions and HTML layouts to extract

64: fragments describing the term. They also use a language model to

65: discard non-linguistic fragments.  In addition, a clustering method is

66: used to divide descriptions into a specific number of groups.

67:

68: On the one hand, their method is expected to enhance existing

69: encyclopedias, where vocabulary size is relatively limited, and

70: therefore the {\em quantity\/} problems has been resolved.

71:

72: On the other hand, encyclopedias extracted from the Web are not

73: comparable with existing ones in terms of {\em quality}.  In

74: hand-crafted encyclopedias, term descriptions are carefully organized

75: based on domains and word senses, which are especially effective for

76: human usage.  However, the output of Fujii's method is simply a set of

77: unorganized term descriptions.  Although clustering is optionally

78: performed, resultant clusters are not necessarily related to explicit

79: criteria, such as word senses and domains.

80:

81: To sum up, our belief is that by combining {\em extraction\/} and {\em

82: organization\/} methods, we can enhance both quantity and quality of

83: Web-based encyclopedias.

84:

85: Motivated by this background, we introduce an organization model to

86: Fujii's method and reformalize the whole framework.  In other words,

87: our proposed method is not only extraction but {\em generation\/} of

88: encyclopedic knowledge.

89:

90: Section~\ref{sec:system_design} explains the overall design of our

91: encyclopedia generation system, and Section~\ref{sec:organization}

92: elaborates on our organization model.  Section~\ref{sec:application}

93: then explores a method for applying our resultant encyclopedia to NLP

94: research, specifically, question answering.

95: Section~\ref{sec:experimentation} performs a number of experiments to

96: evaluate our methods.

97:

98: \section{System Design}

99: \label{sec:system_design}

100:

101: \subsection{Overview}

102: \label{subsec:system_overview}

103:

104: Figure~\ref{fig:system} depicts the overall design of our system,

105: which generates an encyclopedia for input terms.

106:

107: Our system, which is currently implemented for Japanese, consists of

108: three modules: ``retrieval,'' ``extraction'' and ``organization,''

109: among which the organization module is newly introduced in this paper.

110: In principle, the remaining two modules (``retrieval'' and

111: ``extraction'') are the same as proposed by Fujii and

112: Ishikawa~\shortcite{fujii:acl-2000}.

113:

114: In Figure~\ref{fig:system}, terms can be submitted either on-line or

115: off-line. A reasonable method is that while the system periodically

116: updates the encyclopedia off-line, terms unindexed in the encyclopedia

117: are dynamically processed in real-time usage.  In either case, our

118: system processes input terms one by one.

119:

120: We briefly explain each module in the following three sections,

121: respectively.

122:

123: \begin{figure}[htbp]

124:   \begin{center}

125:     \leavevmode

126:     \psfig{file=system.eps,height=2.5in}

127:   \end{center}

128:   \caption{The overall design of our Web-based encyclopedia generation

129:     system.}

130:   \label{fig:system}

131: \end{figure}

132:

133: \subsection{Retrieval}

134: \label{subsec:retrieval}

135:

136: The retrieval module searches the Web for pages containing an input

137: term, for which existing Web search engines can be used, and those

138: with broad coverage are desirable.

139:

140: However, search engines performing query expansion are not always

141: desirable, because they usually retrieve a number of pages which do

142: not contain an input keyword.  Since the extraction module (see

143: Section~\ref{subsec:extraction}) analyzes the usage of the input term

144: in retrieved pages, pages not containing the term are of no use for our

145: purpose.

146:

147: Thus, we use as the retrieval module ``Google,'' which is one of the

148: major search engines and does not conduct query

149: expansion\footnote{http://www.google.com/}.

150:

151: \subsection{Extraction}

152: \label{subsec:extraction}

153:

154: In the extraction module, given Web pages containing an input term,

155: newline codes, redundant white spaces and HTML tags that are not used

156: in the following processes are discarded to standardize the page

157: format.

158:

159: Second, we approximately identify a region describing the term in the

160: page, for which two rules are used.

161:

162: The first rule is based on Japanese linguistic patterns typically used

163: for term descriptions, such as ``X {\it toha\/} Y {\it dearu\/} (X is

164: Y).''  Following the method proposed by Fujii and

165: Ishikawa~\shortcite{fujii:acl-2000}, we semi-automatically produced 20

166: patterns based on the Japanese CD-ROM World

167: Encyclopedia~\cite{heibonsha:98}, which includes approximately 80,000

168: entries related to various fields.  It is expected that a region

169: including the sentence that matched with one of those patterns can be

170: a term description.

171:

172: The second rule is based on HTML layout. In a typical case, a term in

173: question is highlighted as a heading with tags such as \verb|<DT>|,

174: \verb|<B>| and \verb|<Hx>| (``\verb|x|'' denotes a digit), followed by

175: its description. In some cases, terms are marked with the anchor

176: \verb|<A>| tag, providing hyperlinks to pages where they are

177: described.

178:

179: Finally, based on the region briefly identified by the above method,

180: we extract a page fragment as a term description. Since term

181: descriptions usually consist of a logical segment (such as a

182: paragraph) rather than a single sentence, we extract a fragment that

183: matched with one of the following patterns, which are sorted according

184: to preference in descending order:

185: \begin{enumerate}

186: \item description tagged with \verb|<DD>| in the case where the term

187:   is tagged with \verb|<DT>|\footnote{\texttt{<DT>} and \texttt{<DD>} are

188:   inherently provided to describe terms in HTML.},

189: \item paragraph tagged with \verb|<P>|,

190: \item itemization tagged with \verb|<UL>|,

191: \item $N$ sentences, where we empirically set \mbox{$N = 3$}.

192: \end{enumerate}

193:

194: \subsection{Organization}

195: \label{subsec:organization}

196:

197: As discussed in Section~\ref{sec:introduction}, organizing information

198: extracted from the Web is crucial in our framework.  For this purpose,

199: we classify extracted term descriptions based on word senses and

200: domains.

201:

202: Although a number of methods have been proposed to generate word

203: senses (for example, one based on the vector space

204: model~\cite{schutze:cl-98}), it is still difficult to accurately

205: identify word senses without explicit dictionaries that define sense

206: candidates.

207:

208: In addition, since word senses are often associated with

209: domains~\cite{yarowsky:acl-95}, word senses can be consequently

210: distinguished by way of determining the domain of each description.

211: For example, different senses for ``pipeline (processing

212: method/transportation pipe)'' are associated with the computer and

213: construction domains (fields), respectively.

214:

215: To sum up, the organization module classifies term descriptions based

216: on domains, for which we use domain and description models.  In

217: Section~\ref{sec:organization}, we elaborate on our organization

218: model.

219:

220: \section{Statistical Organization Model}

221: \label{sec:organization}

222:

223: \subsection{Overview}

224: \label{subsec:organization_overview}

225:

226: Given one or more (in most cases more than one) descriptions for a

227: single input term, the organization module selects appropriate

228: description(s) for each domain related to the term.

229:

230: We do not need all the extracted descriptions as final outputs,

231: because they are usually similar to one another, and thus are

232: redundant.

233:

234: For the moment, we assume that we know {\it a priori\/} which domains

235: are related to the input term.

236:

237: From the viewpoint of probability theory, our task here is to select

238: descriptions with greater probability for given domains.  The

239: probability for description $d$ given domain $c$, \mbox{$P(d|c)$}, is

240: commonly transformed as in Equation~\eq{eq:organization}, through

241: use of the Bayesian theorem.

242: \begin{equation}

243:   \label{eq:organization}

244:   P(d|c) = \frac{\textstyle P(c|d)\cdot P(d)}{\textstyle P(c)}

245: \end{equation}

246: In practice, $P(c)$ can be omitted because this factor is a constant,

247: and thus does not affect the relative probability for different

248: descriptions.

249:

250: In Equation~\eq{eq:organization}, $P(c|d)$ models a probability that

251: $d$ corresponds to domain $c$. $P(d)$ models a probability that $d$

252: can be a description for the term in question, disregarding the

253: domain. We shall call them domain and description models, respectively.

254:

255: To sum up, in principle we select $d$'s that are strongly associated

256: with a specific domain, and are likely to be descriptions themselves.

257:

258: Extracted descriptions are not linguistically understandable in the

259: case where the extraction process is unsuccessful and retrieved pages

260: inherently contain non-linguistic information (such as special

261: characters and e-mail addresses).

262:

263: To resolve this problem, Fujii and Ishikawa~\shortcite{fujii:acl-2000}

264: used a language model to filter out descriptions with low

265: perplexity. However, in this paper we integrated a description model,

266: which is practically the same as a language model, with an

267: organization model. The new framework is more understandable with

268: respect to probability theory.

269:

270: In practice, we first use Equation~\eq{eq:organization} to compute

271: $P(d|c)$ for all the $c$'s predefined in the domain model. Then we

272: discard such $c$'s whose $P(d|c)$ is below a specific threshold.  As a

273: result, for the input term, related domains and descriptions are

274: simultaneously selected. Thus, we do not have to know {\it a priori\/}

275: which domains are related to each term.

276:

277: In the following two sections, we explain methods to realize the

278: domain and description models, respectively.

279:

280: \subsection{Domain Model}

281: \label{subsec:domain_model}

282:

283: The domain model quantifies the extent to which description $d$ is

284: associated with domain $c$, which is fundamentally a categorization

285: task.  Among a number of existing categorization methods, we

286: experimentally used one proposed by Iwayama and

287: Tokunaga~\shortcite{iwayama:anlp-94}, which formulates $P(c|d)$ as in

288: Equation~(\ref{eq:domain_model}).

289: \begin{equation}

290:   \label{eq:domain_model}

291:   P(c|d) = P(c)\cdot\sum_{t}\frac{\textstyle P(t|c)\cdot

292:   P(t|d)}{\textstyle P(t)}

293: \end{equation}

294: Here, $P(t|d)$, $P(t|c)$ and $P(t)$ denote probabilities that word $t$

295: appears in $d$, $c$ and all the domains, respectively. We regard

296: $P(c)$ as a constant. While $P(t|d)$ is simply a relative frequency of

297: $t$ in $d$, we need predefined domains to compute $P(t|c)$ and $P(t)$.

298: For this purpose, the use of large-scale corpora annotated with

299: domains is desirable.

300:

301: However, since those resources are prohibitively expensive, we used

302: the ``Nova'' dictionary for Japanese/English machine translation

303: systems\footnote{Produced by NOVA, Inc.}, which includes approximately

304: one million entries related to 19 technical fields as listed below:

305: \begin{quote}

306:   aeronautics,

307:   biotechnology,

308:   business,

309:   chemistry,

310:   computers,

311:   construction,

312:   defense,

313:   ecology,

314:   electricity,

315:   energy,

316:   finance,

317:   law,

318:   mathematics,

319:   mechanics,

320:   medicine,

321:   metals,

322:   oceanography,

323:   plants,

324:   trade.

325: \end{quote}

326:

327: We extracted words from dictionary entries to estimate $P(t|c)$ and

328: $P(t)$, which are relative frequencies of $t$ in $c$ and all the

329: domains, respectively.  We used the ChaSen morphological

330: analyzer~\cite{matsumoto:chasen-97} to extract words from Japanese

331: entries.  We also used English entries because Japanese descriptions

332: often contain English words.

333:

334: It may be argued that statistics extracted from dictionaries are

335: unreliable, because word frequencies in real word usage are missing.

336: However, words that are representative for a domain tend to be

337: frequently used in compound word entries associated with the domain,

338: and thus our method is a practical approximation.

339:

340: \subsection{Description Model}

341: \label{subsec:desc_model}

342:

343: The description model quantifies the extent to which a given page

344: fragment is feasible as a description for the input term.  In

345: principle, we decompose the description model into language and

346: quality properties, as shown in Equation~(\ref{eq:desc_model}).

347: \begin{equation}

348:   \label{eq:desc_model}

349:   P(d) = P_{L}(d)\cdot P_{Q}(d)

350: \end{equation}

351: Here, $P_{L}(d)$ and $P_{Q}(d)$ denote language and quality models,

352: respectively.

353:

354: It is expected that the quality model discards incorrect or misleading

355: information contained in Web pages. For this purpose, a number of

356: quality rating methods for Web

357: pages~\cite{amento:sigir-2000,zhu:sigir-2000} can be used.

358:

359: However, since Google (i.e., the search engine used in our system)

360: rates the quality of pages based on hyperlink information, and

361: selectively retrieves those with higher quality

362: \cite{brin:compnet-1998}, we tentatively regarded $P_{Q}(d)$ as a

363: constant. Thus, in practice the description model is approximated

364: solely with the language model as in Equation~(\ref{eq:lang_model}).

365: \begin{equation}

366:   \label{eq:lang_model}

367:   P(d) \approx P_{L}(d)

368: \end{equation}

369:

370: Statistical approaches to language modeling have been used in much NLP

371: research, such as machine translation~\cite{brown:cl-93} and speech

372: recognition~\cite{bahl:ieee-tpami-1983}. Our model is almost the same

373: as existing models, but is different in two respects.

374:

375: First, while general language models quantify the extent to which a

376: given word sequence is linguistically acceptable, our model also

377: quantifies the extent to which the input is acceptable as a term

378: description.  Thus, we trained the model based on an existing machine

379: readable encyclopedia.

380:

381: We used the ChaSen morphological analyzer to segment the Japanese

382: CD-ROM World Encyclopedia~\cite{heibonsha:98} into words (we replaced

383: headwords with a common symbol), and then used the CMU-Cambridge

384: toolkit~\cite{clarkson:eurospeech-97} to model a word-based trigram.

385:

386: Consequently, descriptions in which word sequences are more similar to

387: those in the World Encyclopedia are assigned greater probability

388: scores through our language model.

389:

390: Second, $P(d)$, which is a product of probabilities for $N$-grams in

391: $d$, is quite sensitive to the length of $d$. In the cases of machine

392: translation and speech recognition, this problem is less crucial

393: because multiple candidates compared based on the language model are

394: almost equivalent in terms of length.

395:

396: However, since in our case length of descriptions are significantly

397: different, shorter descriptions are more likely to be selected,

398: regardless of the quality.  To avoid this problem, we normalize $P(d)$

399: by the number of words contained in $d$.

400:

401: \section{Application}

402: \label{sec:application}

403:

404: \subsection{Overview}

405: \label{subsec:application_overview}

406:

407: Encyclopedias generated through our Web-based method can be used in a

408: number of applications, including human usage, thesaurus

409: production~\cite{hearst:coling-92,nakamura:coling-88} and natural

410: language understanding in general.

411:

412: Among the above applications, natural language understanding (NLU) is

413: the most challenging from a scientific point of view.  Current

414: practical NLU research includes dialogue, information extraction and

415: question answering, among which we focus solely on question answering

416: (QA) in this paper.

417:

418: A straightforward application is to answer interrogative questions

419: like ``What is X?'' in which a QA system searches the encyclopedia

420: database for one or more descriptions related to X (this application

421: is also effective for dialog systems).

422:

423: In general, the performance of QA systems are evaluated based on

424: coverage and accuracy. Coverage is the ratio between the number of

425: questions answered (disregarding their correctness) and the total

426: number of questions. Accuracy is the ratio between the number of

427: correct answers and the total number of answers made by the system.

428:

429: While coverage can be estimated objectively and systematically,

430: estimating accuracy relies on human subjects (because there is no

431: absolute description for term X), and thus is expensive.

432:

433: In view of this problem, we targeted Information Technology Engineers

434: Examinations\footnote{Japan Information-Technology Engineers

435: Examination Center. http://www.jitec.jipdec.or.jp/}, which are

436: biannual (spring and autumn) examinations necessary for candidates to

437: qualify to be IT engineers in Japan.

438:

439: Among a number of classes, we focused on the ``Class II'' examination,

440: which requires fundamental and general knowledge related to

441: information technology. Approximately half of questions are associated

442: with IT technical terms.

443:

444: Since past examinations and answers are open to the public, we can

445: evaluate the performance of our QA system with minimal cost.

446:

447: \subsection{Analyzing IT Engineers Examinations}

448: \label{subsec:analysis}

449:

450: The Class II examination consists of quadruple-choice questions, among

451: which technical term questions can be subdivided into two types.

452:

453: In the first type of question, examinees choose the most appropriate

454: description for a given technical term, such as ``memory interleave''

455: and ``router.''

456:

457: In the second type of question, examinees choose the most appropriate

458: term for a given question, for which we show examples collected from

459: the examination in the autumn of 1999 (translated into English by one

460: of the authors) as follows:

461: \begin{enumerate}

462: \item Which data structure is most appropriate for FIFO (First-In

463:   First-Out)?

464:

465:   a) binary trees, b) queues, c) stacks, d) heaps

466: \item Choose the LAN access method in which multiple terminals transmit

467:   data simultaneously and thus they potentially collide.

468:

469:   a) ATM, b) CSM/CD, c) FDDI, d) token ring

470: \end{enumerate}

471:

472: In the autumn of 1999, out of 80 questions, the number of the first

473: and second types were 22 and 18, respectively.

474:

475: \subsection{Implementing a QA system}

476: \label{subsec:implementation}

477:

478: For the first type of question, human examinees would search their

479: knowledge base (i.e., memory) for the description of a given term, and

480: compare that description with four candidates.  Then they would choose

481: the candidate that is most similar to the description.

482:

483: For the second type of question, human examinees would search their

484: knowledge base for the description of each of four candidate terms.

485: Then they would choose the candidate term whose description is most

486: similar to the question description.

487:

488: The mechanism of our QA system is analogous to the above human

489: methods.  However, unlike human examinees, our system uses an

490: encyclopedia generated from the Web as a knowledge base.

491:

492: In addition, our system selectively uses term descriptions categorized

493: into domains related to information technology.  In other words, the

494: description of ``pipeline (transportation pipe)'' is irrelevant or

495: misleading to answer questions associated with ``pipeline (processing

496: method).''

497:

498: To compute the similarity between two descriptions, we used techniques

499: developed in IR research, in which the similarity between a user query

500: and each document in a collection is usually quantified based on word

501: frequencies.  In our case, a question and four possible answers

502: correspond to query and document collection, respectively.  We used a

503: probabilistic method~\cite{robertson:sigir-94}, which is one of the

504: major IR methods.

505:

506: To sum up, given a question, its type and four choices, our QA system

507: chooses one of four candidates as the answer, in which the resolution

508: algorithm varies depending on the question type.

509:

510: \subsection{Related Work}

511: \label{subsec:related_work}

512:

513: Motivated partially by the TREC-8 QA

514: collection~\cite{voorhees:sigir-2000}, question answering has of late

515: become one of the major topics within the NLP/IR communities.

516:

517: In fact, a number of QA systems targeting the TREC QA collection have

518: recently been

519: proposed~\cite{harabagiu:coling-2000,moldovan:acl-2000,prager:sigir-2000}.

520: Those systems are commonly termed ``open-domain'' systems, because

521: questions expressed in natural language are not necessarily limited to

522: explicit axes, including {\em who\/}, {\em what\/}, {\em when\/}, {\em

523: where\/}, {\em how\/} and {\em why}.

524:

525: However, Moldovan and Harabagiu~\shortcite{moldovan:acl-2000} found

526: that each of the TREC questions can be recast as either a single axis

527: or a combination of axes.  They also found that out of the 200 TREC

528: questions, 64 questions (approximately one third) were associated with

529: the {\em what\/} axis, for which the Web-based encyclopedia is

530: expected to improve the quality of answers.

531:

532: Although Harabagiu~\etal~\shortcite{harabagiu:coling-2000} proposed a

533: knowledge-based QA system, most existing systems rely on conventional

534: IR and shallow NLP methods. The use of encyclopedic knowledge for QA

535: systems, as we demonstrated, needs to be further explored.

536:

537: \section{Experimentation}

538: \label{sec:experimentation}

539:

540: \subsection{Methodology}

541: \label{subsec:eval_method}

542:

543: We conducted a number of experiments to investigate the effectiveness

544: of our methods.

545:

546: First, we generated an encyclopedia by way of our Web-based method (see

547: Sections~\ref{sec:system_design} and \ref{sec:organization}), and

548: evaluated the quality of the encyclopedia itself.

549:

550: Second, we applied the generated encyclopedia to our QA system (see

551: Section~\ref{sec:application}), and evaluated its performance.  The

552: second experiment can be seen as a task-oriented evaluation for our

553: encyclopedia generation method.

554:

555: In the first experiment, we collected 96 terms from technical term

556: questions in the Class II examination (the autumn of 1999). We used as

557: test inputs those 96 terms and generated an encyclopedia, which was

558: used in the second experiment.

559:

560: For all the 96 test terms, Google (see Section~\ref{subsec:retrieval})

561: retrieved a positive number of pages, and the average number of pages

562: for one term was 196,503. Since Google practically outputs contents of

563: the top 1,000 pages, the remaining pages were not used in our

564: experiments.

565:

566: In the following two sections, we explain the first and second

567: experiments, respectively.

568:

569: \subsection{Evaluating Encyclopedia Generation}

570: \label{subsec:eval_generation}

571:

572: For each test term, our method first computed $P(d|c)$ using

573: Equation~\eq{eq:organization} and discarded domains whose $P(d|c)$ was

574: below 0.05. Then, for each remaining domain, descriptions with higher

575: $P(d|c)$ were selected as the final outputs.

576:

577: We selected the top three (not one) descriptions for each domain,

578: because reading a couple of descriptions, which are short paragraphs,

579: is not laborious for human users in real-world usage. As a result, at

580: least one description was generated for 85 test terms, disregarding

581: the correctness.  The number of resultant descriptions was 326 (3.8

582: per term). We analyzed those descriptions from different perspectives.

583:

584: First, we analyzed the distribution of the Google ranks for the Web

585: pages from which the top three descriptions were eventually retained.

586: Figure~\ref{fig:ranking} shows the result, where we have combined the

587: pages in groups of 50, so that the leftmost bar, for example, denotes

588: the number of used pages whose original Google ranks ranged from 1 to

589: 50.

590:

591: Although the first group includes the largest number of pages, other

592: groups are also related to a relatively large number of pages.  In

593: other words, our method exploited a number of low ranking pages, which

594: are not browsed or utilized by most Web users.

595:

596: \begin{figure}[htbp]

597:   \begin{center}

598:     \leavevmode

599:     \psfig{file=ranking.ps,height=2in}

600:   \end{center}

601:   \caption{Distribution of rankings for original pages in Google.}

602:   \label{fig:ranking}

603: \end{figure}

604:

605: Second, we analyzed the distribution of domains assigned to the 326

606: resultant descriptions.  Figure~\ref{fig:domain_dist} shows the

607: result, in which, as expected, most descriptions were associated with

608: the computer domain.

609:

610: However, the law domain was unexpectedly associated with a relatively

611: great number of descriptions.  We manually analyzed the resultant

612: descriptions and found that descriptions for which appropriate domains

613: are not defined in our domain model, such as sports, tended to be

614: categorized into the law domain.

615:

616: \begin{figure}[htbp]

617:   \begin{center}

618:     \small

619:     \begin{tabular}{l} \hline\hline

620:       computers (200),

621:       law (41),

622:       electricity (28), \\

623:       plants (15),

624:       medicine (10),

625:       finance (8), \\

626:       mathematics (8),

627:       mechanics (5),

628:       biotechnology (4), \\

629:       construction (2),

630:       ecology (2),

631:       chemistry (1), \\

632:       energy (1),

633:       oceanography (1) \\

634:       \hline

635:     \end{tabular}

636:     \caption{Distribution of domains related to the 326 resultant

637:     descriptions.}

638:     \label{fig:domain_dist}

639:   \end{center}

640: \end{figure}

641:

642: Third, we evaluated the accuracy of our method, that is, the quality

643: of an encyclopedia our method generated.  For this purpose, each of

644: the resultant descriptions was judged as to whether or not it is a

645: correct description for a term in question. Each domain assigned to

646: descriptions was also judged correct or incorrect.

647:

648: We analyzed the result on a description-by-description basis, that is,

649: all the generated descriptions were considered independent of one

650: another. The ratio of correct descriptions, disregarding the domain

651: correctness, was 58.0\% (189/326), and the ratio of correct

652: descriptions categorized into the correct domain was 47.9\% (156/326).

653:

654: However, since all the test terms are inherently related to the IT

655: field, we focused solely on descriptions categorized into the computer

656: domain.  In this case, the ratio of correct descriptions, disregarding

657: the domain correctness, was 62.0\% (124/200), and the ratio of correct

658: descriptions categorized into the correct domain was 61.5\% (123/200).

659:

660: In addition, we analyzed the result on a term-by-term basis, because

661: reading only a couple of descriptions is not crucial.  In other words,

662: we evaluated each term (not description), and in the case where at

663: least one correct description categorized into the correct domain was

664: generated for a term in question, we judged it correct.  The ratio of

665: correct terms was 89.4\% (76/85), and in the case where we focused

666: solely on the computer domain, the ratio was 84.8\% (67/79).

667:

668: In other words, by reading a couple of descriptions (3.8 descriptions

669: per term), human users can obtain knowledge of approximately 90\% of

670: input terms.

671:

672: Finally, we compared the resultant descriptions with an existing

673: dictionary. For this purpose, we used the ``Nichigai'' computer

674: dictionary~\cite{nichigai_compdic:96}, which lists approximately

675: 30,000 Japanese technical terms related to the computer field, and

676: contains descriptions for 13,588 terms.  In the Nichigai dictionary,

677: 42 out of the 96 test terms were described. Our method, which

678: generated correct descriptions associated with the computer domain for

679: 67 input terms, enhanced the Nichigai dictionary in terms of quantity.

680:

681: These results indicate that our method for generating encyclopedias is

682: of operational quality.

683:

684: \subsection{Evaluating Question Answering}

685: \label{subsec:eval_qa}

686:

687: We used as test inputs 40 questions, which are related to technical

688: terms collected from the Class II examination in the autumn of 1999.

689:

690: The objective here is not only to evaluate the performance of our QA

691: system itself, but also to evaluate the quality of the encyclopedia

692: generated by our method.

693:

694: Thus, as performed in the first experiment

695: (Section~\ref{subsec:eval_generation}), we used the Nichigai computer

696: dictionary as a baseline encyclopedia. We compared the following three

697: different resources as a knowledge base:

698: \begin{itemize}

699: \item the Nichigai dictionary (``Nichigai''),

700: \item the descriptions generated in the first experiment (``Web''),

701: \item combination of both resources (``Nichigai + Web'').

702: \end{itemize}

703:

704: Table~\ref{tab:eval_qa} shows the result of our comparative

705: experiment, in which ``C'' and ``A'' denote coverage and accuracy,

706: respectively, for variations of our QA system.

707:

708: Since all the questions we used are quadruple-choice, in case the

709: system cannot answer the question, random choice can be performed to

710: improve the coverage to 100\%.  Thus, for each knowledge resource we

711: compared cases without/with random choice, which are denoted ``w/o

712: Random'' and ``w/ Random'' in Table~\ref{tab:eval_qa}, respectively.

713:

714: \begin{table}[htbp]

715:   \begin{center}

716:     \caption{Coverage and accuracy (\%) for different question

717:     answering methods.}

718:     \medskip

719:     \leavevmode

720:     \small

721:     \begin{tabular}{lcccc} \hline\hline

722:       & \multicolumn{2}{c}{w/o Random} & \multicolumn{2}{c}{w/

723:       Random} \\

724:       \multicolumn{1}{c}{Resource} &

725:       C &

726:       A &

727:       C &

728:       A \\ \hline

729:       Nichigai & 50.0 & 65.0 & 100 & 45.0 \\

730:       Web & 92.5 & 48.6 & 100 & 46.9 \\

731:       Nichigai + Web & 95.0 & 63.2 & 100 & 61.3 \\

732:       \hline

733:     \end{tabular}

734:     \label{tab:eval_qa}

735:   \end{center}

736: \end{table}

737:

738: In the case where random choice was not performed, the Web-based

739: encyclopedia noticeably improved the coverage for the Nichigai

740: dictionary, but decreased the accuracy.

741: However, by combining both resources, the accuracy was noticeably

742: improved, and the coverage was comparable with that for the Nichigai

743: dictionary.

744:

745: On the other hand, in the case where random choice was performed, the

746: Nichigai dictionary and the Web-based encyclopedia were comparable in

747: terms of both the coverage and accuracy.  Additionally, by combining

748: both resources, the accuracy was further improved.

749:

750: We also investigated the performance of our QA system where

751: descriptions related to the computer domain are solely used. However,

752: coverage/accuracy did not significantly change, because as shown in

753: Figure~\ref{fig:domain_dist}, most of the descriptions were inherently

754: related to the computer domain.

755:

756: \section{Conclusion}

757: \label{sec:conclusion}

758:

759: The World Wide Web has been an unprecedentedly enormous information

760: source, from which a number of language processing methods have been

761: explored to extract, retrieve and discover various types of

762: information.

763:

764: In this paper, we aimed at generating encyclopedic knowledge, which is

765: valuable for many applications including human usage and natural

766: language understanding.

767: For this purpose, we reformalized an existing Web-based extraction

768: method, and proposed a new statistical organization model to improve

769: the quality of extracted data.

770:

771: Given a term for which encyclopedic knowledge (i.e., descriptions) is

772: to be generated, our method sequentially performs a) retrieval of Web

773: pages containing the term, b) extraction of page fragments describing

774: the term, and c) organizing extracted descriptions based on domains

775: (and consequently word senses).

776:

777: In addition, we proposed a question answering system, which answers

778: interrogative questions associated with {\it what\/}, by using a

779: Web-based encyclopedia as a knowledge base.  For the purpose of

780: evaluation, we used as test inputs technical terms collected from the

781: Class II IT engineers examination, and found that the encyclopedia

782: generated through our method was of operational quality and quantity.

783:

784: We also used test questions from the Class II examination, and

785: evaluated the Web-based encyclopedia in terms of question

786: answering. We found that our Web-based encyclopedia improved the

787: system coverage obtained solely with an existing dictionary. In

788: addition, when we used both resources, the performance was further

789: improved.

790:

791: Future work would include generating information associated with more

792: complex interrogations, such as ones related to {\it how\/} and {\it

793: why\/}, so as to enhance Web-based natural language understanding.

794:

795: \section*{Acknowledgments}

796:

797: The authors would like to thank NOVA, Inc. for their support with the

798: Nova dictionary and Katunobu Itou (The National Institute of Advanced

799: Industrial Science and Technology, Japan) for his insightful comments

800: on this paper.

801:

802: \bibliographystyle{acl}

803:

804: \begin{thebibliography}{}

805:

806: \bibitem[\protect\citename{Amento \bgroup et al.\egroup

807:   }2000]{amento:sigir-2000}

808: Brian Amento, Loren Terveen, and Will Hill.

809: \newblock 2000.

810: \newblock Does ``authority'' mean quality? predicting expert quality ratings of

811:   {Web} documents.

812: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR

813:   Conference on Research and Development in Information Retrieval}, pages

814:   296--303.

815:

816: \bibitem[\protect\citename{Bahl \bgroup et al.\egroup

817:   }1983]{bahl:ieee-tpami-1983}

818: Lalit.~R. Bahl, Frederick Jelinek, and Robert~L. Mercer.

819: \newblock 1983.

820: \newblock A maximum linklihood approach to continuous speech recognition.

821: \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},

822:   5(2):179--190.

823:

824: \bibitem[\protect\citename{Brin and Page}1998]{brin:compnet-1998}

825: Sergey Brin and Lawrence Page.

826: \newblock 1998.

827: \newblock The anatomy of a large-scale hypertextual {Web} search engine.

828: \newblock {\em Computer Networks}, 30(1--7):107--117.

829:

830: \bibitem[\protect\citename{Brown \bgroup et al.\egroup }1993]{brown:cl-93}

831: Peter~F. Brown, Stephen A.~Della Pietra, Vincent J.~Della Pietra, and Robert~L.

832:   Mercer.

833: \newblock 1993.

834: \newblock The mathematics of statistical machine translation: Parameter

835:   estimation.

836: \newblock {\em Computational Linguistics}, 19(2):263--311.

837:

838: \bibitem[\protect\citename{Clarkson and Rosenfeld}1997]{clarkson:eurospeech-97}

839: Philip Clarkson and Ronald Rosenfeld.

840: \newblock 1997.

841: \newblock Statistical language modeling using the {CMU}-{Cambridge} toolkit.

842: \newblock In {\em Proceedings of EuroSpeech'97}, pages 2707--2710.

843:

844: \bibitem[\protect\citename{Etzioni}1997]{etzioni:ai-magazine-97}

845: Oren Etzioni.

846: \newblock 1997.

847: \newblock Moving up the information food chain.

848: \newblock {\em AI Magazine}, 18(2):11--18.

849:

850: \bibitem[\protect\citename{Fujii and Ishikawa}2000]{fujii:acl-2000}

851: Atsushi Fujii and Tetsuya Ishikawa.

852: \newblock 2000.

853: \newblock Utilizing the {World Wide Web} as an encyclopedia: Extracting term

854:   descriptions from semi-structured texts.

855: \newblock In {\em Proceedings of the 38th Annual Meeting of the Association for

856:   Computational Linguistics}, pages 488--495.

857:

858: \bibitem[\protect\citename{Harabagiu \bgroup et al.\egroup

859:   }2000]{harabagiu:coling-2000}

860: Sanda~M. Harabagiu, Marius~A. Pa\c{s}ca, and Steven~J. Maiorano.

861: \newblock 2000.

862: \newblock Experiments with open-domain textual question answering.

863: \newblock In {\em Proceedings of the 18th International Conference on

864:   Computational Linguistics}, pages 292--298.

865:

866: \bibitem[\protect\citename{Hearst}1992]{hearst:coling-92}

867: Marti~A. Hearst.

868: \newblock 1992.

869: \newblock Automatic acquisition of hyponyms from large text corpora.

870: \newblock In {\em Proceedings of the 14th International Conference on

871:   Computational Linguistics}, pages 539--545.

872:

873: \bibitem[\protect\citename{Heibonsha}1998]{heibonsha:98}

874: Hitachi~Digital Heibonsha.

875: \newblock 1998.

876: \newblock {CD-ROM World Encyclopedia}.

877: \newblock (In Japanese).

878:

879: \bibitem[\protect\citename{Inokuchi \bgroup et al.\egroup

880:   }1999]{inokuchi:pakdd-99}

881: Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda, Kouhei Kumasawa, and Naohide

882:   Arai.

883: \newblock 1999.

884: \newblock Basket analysis for graph structured data.

885: \newblock In {\em Proceedings of the 3rd Pacific-Asia Conference on Knowledge

886:   Discovery and Data Mining}, pages 420--431.

887:

888: \bibitem[\protect\citename{Iwayama and Tokunaga}1994]{iwayama:anlp-94}

889: Makoto Iwayama and Takenobu Tokunaga.

890: \newblock 1994.

891: \newblock A probabilistic model for text categorization: Based on a single

892:   random variable with multiple values.

893: \newblock In {\em Proceedings of the 4th Conference on Applied Natural Language

894:   Processing}, pages 162--167.

895:

896: \bibitem[\protect\citename{Matsumoto \bgroup et al.\egroup

897:   }1997]{matsumoto:chasen-97}

898: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Osamu

899:   Imaichi, and Tomoaki Imamura.

900: \newblock 1997.

901: \newblock {Japanese} morphological analysis system {ChaSen} manual.

902: \newblock Technical Report NAIST-IS-TR97007, NAIST.

903: \newblock (In Japanese).

904:

905: \bibitem[\protect\citename{McCallum \bgroup et al.\egroup

906:   }1999]{mccallum:ijcai-99}

907: Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.

908: \newblock 1999.

909: \newblock A machine learning approach to building domain-specific search

910:   engines.

911: \newblock In {\em Proceedings of the 16th International Joint Conference on

912:   Artificial Intelligence}, pages 662--667.

913:

914: \bibitem[\protect\citename{Moldovan and Harabagiu}2000]{moldovan:acl-2000}

915: Dan Moldovan and Sanda Harabagiu.

916: \newblock 2000.

917: \newblock The structure and performance of an open-domain question answering

918:   system.

919: \newblock In {\em Proceedings of the 38th Annual Meeting of the Association for

920:   Computational Linguistics}, pages 563--570.

921:

922: \bibitem[\protect\citename{Nakamura and Nagao}1988]{nakamura:coling-88}

923: Jun'ichi Nakamura and Makoto Nagao.

924: \newblock 1988.

925: \newblock Extraction of semantic information from an ordinary {English}

926:   dictionary and its evaluation.

927: \newblock In {\em Proceedings of the 10th International Conference on

928:   Computational Linguistics}, pages 459--464.

929:

930: \bibitem[\protect\citename{{Nichigai Associates}}1996]{nichigai_compdic:96}

931: {Nichigai Associates}.

932: \newblock 1996.

933: \newblock {English-Japanese} computer terminology dictionary.

934: \newblock (In Japanese).

935:

936: \bibitem[\protect\citename{Prager \bgroup et al.\egroup

937:   }2000]{prager:sigir-2000}

938: John Prager, Eric Brown, and Anni Coden.

939: \newblock 2000.

940: \newblock Question-answering by predictive annotation.

941: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR

942:   Conference on Research and Development in Information Retrieval}, pages

943:   184--191.

944:

945: \bibitem[\protect\citename{Resnik}1999]{resnik:acl-99}

946: Philip Resnik.

947: \newblock 1999.

948: \newblock Mining the {Web} for bilingual texts.

949: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for

950:   Computational Linguistics}, pages 527--534.

951:

952: \bibitem[\protect\citename{Robertson and Walker}1994]{robertson:sigir-94}

953: S.~E. Robertson and S.~Walker.

954: \newblock 1994.

955: \newblock Some simple effective approximations to the 2-poisson model for

956:   probabilistic weighted retrieval.

957: \newblock In {\em Proceedings of the 17th Annual International ACM SIGIR

958:   Conference on Research and Development in Information Retrieval}, pages

959:   232--241.

960:

961: \bibitem[\protect\citename{Sch\"{u}tze}1998]{schutze:cl-98}

962: Hinrich Sch\"{u}tze.

963: \newblock 1998.

964: \newblock Automatic word sense discrimination.

965: \newblock {\em Computational Linguistics}, 24(1):97--123.

966:

967: \bibitem[\protect\citename{Soderland}1997]{soderland:kdd-97}

968: Stephen Soderland.

969: \newblock 1997.

970: \newblock Learning to extract text-based information from the {World Wide Web}.

971: \newblock In {\em Proceedings of 3rd International Conference on Knowledge

972:   Discovery and Data Mining}.

973:

974: \bibitem[\protect\citename{Voorhees and Tice}2000]{voorhees:sigir-2000}

975: Ellen~M. Voorhees and Dawn~M. Tice.

976: \newblock 2000.

977: \newblock Building a question answering test collection.

978: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR

979:   Conference on Research and Development in Information Retrieval}, pages

980:   200--207.

981:

982: \bibitem[\protect\citename{Yarowsky}1995]{yarowsky:acl-95}

983: David Yarowsky.

984: \newblock 1995.

985: \newblock Unsupervised word sense disambiguation rivaling supervised methods.

986: \newblock In {\em Proceedings of the 33rd Annual Meeting of the Association for

987:   Computational Linguistics}, pages 189--196.

988:

989: \bibitem[\protect\citename{Zhu and Gauch}2000]{zhu:sigir-2000}

990: Xiaolan Zhu and Susan Gauch.

991: \newblock 2000.

992: \newblock Incorporating quality metrics in centralized/distributed information

993:   retrieval on the {World Wide Web}.

994: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR

995:   Conference on Research and Development in Information Retrieval}, pages

996:   288--295.

997:

998: \end{thebibliography}

999:

1000: \end{document}

1001: