0011:cs0011001/main.tex

1: %%

2: %% ACL-2000 camera-ready

3: %%

4: \documentstyle[11pt,colacl]{article}

5:

6: \title{Utilizing the World Wide Web as an Encyclopedia: \\ Extracting

7: Term Descriptions from Semi-Structured Texts}

8:

9: \author{Atsushi Fujii \and Tetsuya Ishikawa \\

10: University of Library and Information Science \\

11: 1-2 Kasuga, Tsukuba, 305-8550, JAPAN \\

12: \smallskip

13: {\normalsize\tt fujii@ulis.ac.jp}}

14:

15: \newcommand{\etal}{et~al.}

16: \newcommand{\etaleos}{et~al}

17: \newcommand{\eq}[1]{(\ref{#1})}

18:

19: \renewcommand{\nocite}[1]{\shortcite{#1}}

20:

21: \input{psfig.tex}

22:

23: \begin{document}

24:

25: \maketitle\thispagestyle{empty}

26:

27: \begin{abstract}

28:   In this paper, we propose a method to extract descriptions of

29:   technical terms from Web pages in order to utilize the World Wide

30:   Web as an encyclopedia. We use linguistic patterns and HTML text

31:   structures to extract text fragments containing term descriptions.

32:   We also use a language model to discard extraneous descriptions, and

33:   a clustering method to summarize resultant descriptions.  We show

34:   the effectiveness of our method by way of experiments.

35: \end{abstract}

36:

37: \section{Introduction}

38: \label{sec:introduction}

39:

40: Reflecting the growth in utilization of machine readable texts,

41: extraction and acquisition of linguistic knowledge from large corpora

42: has been one of the major topics within the natural language

43: processing (NLP) community. A sample of linguistic knowledge targeted

44: in past research includes grammars~\cite{kupiec:aaai-slnlp-ws-92},

45: word classes~\cite{hatzivassiloglou:acl-93} and bilingual

46: lexicons~\cite{smadja:cl-96}. While human experts find it difficult to

47: produce exhaustive and consistent linguistic knowledge, automatic

48: methods can help alleviate problems associated with manual

49: construction.

50:

51: Term descriptions, which are usually carefully organized in

52: encyclopedias, are valuable linguistic knowledge, but have seldom been

53: targeted in past NLP literature.  As with other types of linguistic

54: knowledge relying on human introspection and supervision, constructing

55: encyclopedias is quite expensive. Additionally, since existing

56: encyclopedias are usually revised every few years, in many cases users

57: find it difficult to obtain descriptions for newly created terms.

58:

59: To cope with the above limitation of existing encyclopedias, it is

60: possible to use a search engine on the World Wide Web as a substitute,

61: expecting that certain Web pages will describe the submitted

62: keyword. However, since keyword-based search engines often retrieve a

63: surprisingly large number of Web pages, it is time-consuming to

64: identify pages that satisfy the users' information needs.

65:

66: In view of this problem, we propose a method to automatically extract

67: term descriptions from Web pages and summarize them. In this paper, we

68: generally use ``Web pages'' to refer to those pages containing textual

69: contents, excluding those with only image/audio information. Besides

70: this, we specifically target descriptions for technical terms, and

71: thus ``terms'' generally refer to technical terms.

72:

73: In brief, our method extracts fragments of Web pages, based on

74: patterns (or templates) typically used to describe terms.  Web pages

75: are in a sense semi-structured data, because HTML (Hyper Text Markup

76: Language) tags provide the textual information contained in a page

77: with a certain structure. Thus, our method relies on both linguistic

78: and structural description patterns.

79:

80: We used several NLP techniques to semi-automatically produce

81: linguistic patterns.  We call this approach ``NLP-based method.'' We

82: also produced several heuristics associated with the use of HTML tags,

83: which we call ``HTML-based method.'' While the former method is

84: language-dependent, and currently applied only to Japanese, the latter

85: method is theoretically language-independent.

86:

87: Our research can be classified from several different perspectives. As

88: explained in the beginning of this section, our research can be seen

89: as linguistic knowledge extraction. Specifically, our research is

90: related to Web mining methods~\cite{nie:sigir-99,resnik:acl-99}.

91:

92: From an information retrieval point of view, our research can be seen

93: as constructing domain-specific (or task-oriented) Web search engines

94: and software agents~\cite{etzioni:ai-magazine-97,mccallum:ijcai-99}.

95:

96: \section{Overview}

97: \label{sec:overview}

98:

99: Our objective is to collect encyclopedic knowledge from the Web, for

100: which we designed a system involving two processes. As with existing

101: Web search systems, in the background process our system periodically

102: updates a database consisting of term descriptions (a description

103: database), while users can browse term descriptions anytime in the

104: foreground process.

105:

106: In the background process, depicted as in Figure~\ref{fig:background},

107: a search engine searches the Web for pages containing terms listed in

108: a lexicon.

109:

110: Then, fragments (such as paragraphs) of retrieved Web pages are

111: extracted based on linguistic and structural description

112: patterns. Note that as a preprocessing for the extraction process, we

113: discard newline codes, redundant white spaces, and HTML tags that our

114: extraction method does not use, in order to standardize the layout of

115: Web pages.

116:

117: However, in some cases the extraction process is unsuccessful, and

118: thus extracted fragments are not linguistically understandable.  In

119: addition, Web pages contain some non-linguistic information, such as

120: special characters (symbols) and e-mail addresses for contact, along

121: with linguistic information. Consequently, those noises decrease

122: extraction accuracy.

123:

124: \begin{figure}[t]

125:   \begin{center}

126:     \leavevmode

127:     \psfig{file=background.eps,height=1.8in}

128:   \end{center}

129:   \caption{The control flow of our extraction system.}

130:   \label{fig:background}

131: \end{figure}

132:

133: In view of this problem, we perform a filtering to enhance the

134: extraction accuracy. In practice, we use a language model to measure

135: the extent to which a given extracted fragment can be linguistic, and

136: index only fragments judged as linguistic into the description

137: database.

138:

139: At the same time, the URLs of Web pages from which descriptions were

140: extracted are also indexed in the database, so that users can browse

141: the full content, in the case where descriptions extracted are not

142: satisfactory.

143:

144: In the case where a number of descriptions are extracted for a single

145: term, the resultant description set is redundant, because it contains

146: a number of similar descriptions. Thus, it is preferable to summarize

147: descriptions, rather than to present all the descriptions as a list.

148:

149: For this purpose, we use a clustering method to divide descriptions

150: for a single term into a certain number of clusters, and present only

151: descriptions that are representative for each cluster. As a result, it

152: is expected that descriptions resembling one another will be in the

153: same cluster, and that each cluster corresponds to different

154: viewpoints and word senses.

155:

156: Possible sources of the lexicon include existing machine readable

157: terminology dictionaries, which often list terms, but lack

158: descriptions. However, since new terms unlisted in existing

159: dictionaries also have to be considered, newspaper articles and

160: magazines distributed via the Web can be possible sources. In other

161: words, a morphological analysis is performed periodically (e.g.,

162: weekly) to identify word tokens from those resources, in order to

163: enhance the lexicon. However, this is not the central issue in this

164: paper.

165:

166: In the foreground process, given an input term, a browser presents one

167: or more descriptions to a user. In the case where the database does

168: not index descriptions for the given term, term descriptions are

169: dynamically extracted as in the background process. The background

170: process is optional, and thus term descriptions can always be obtained

171: dynamically. However, this potentially decreases the time efficiency

172: for a real-time response.

173:

174: Figure~\ref{fig:enigma} shows a Web browser, in which our prototype

175: page presents several Japanese descriptions extracted for the word

176: ``{\it deeta-mainingu\/}~(data mining).'' For example, an English

177: translation for the first description is as follows:

178: \begin{quote}

179:   data mining is a process that collects data for a certain task, and

180:   retrieves relations latent in the data.

181: \end{quote}

182:

183: \begin{figure}[htbp]

184:   \begin{center}

185:     \leavevmode

186:     \psfig{file=enigma.ps,height=3.2in}

187:   \end{center}

188:   \caption{Example Japanese descriptions for ``{\it

189:   deeta-mainingu\/}~(data mining).''}

190:   \label{fig:enigma}

191: \end{figure}

192:

193: In Figure~\ref{fig:enigma}, each description uses various expressions,

194: but describes the same content: data mining is a process which

195: discovers rules latent in given databases. It is expected that users

196: can understand what data mining is, by browsing some of those

197: descriptions. In addition, each headword (``{\it deeta-mainingu\/}''

198: in this case) positioned above each description is linked to the Web

199: page from which the description was extracted.

200:

201: In the following sections, we first elaborate on the NLP/HTML-based

202: extraction methods in Section~\ref{sec:extraction}. We then elaborate

203: on noise reduction and clustering methods in Sections~\ref{sec:n-gram}

204: and \ref{sec:clustering}, respectively. Finally, in

205: Section~\ref{sec:experimentation} we investigate the effectiveness of

206: our extraction method by way of experiments.

207:

208: \section{Extracting Term Descriptions}

209: \label{sec:extraction}

210:

211: \subsection{NLP-based Extraction Method}

212: \label{subsec:nlp-based}

213:

214: The crucial content for the NLP-based extraction method is the way to

215: produce linguistic patterns that can be used to describe technical

216: terms. However, human introspection is a difficult method to

217: exhaustively enumerate possible description patterns.

218:

219: Thus, we used NLP techniques to semi-automatically collect description

220: patterns from machine readable encyclopedias, because they usually

221: contain a significantly large number of descriptions for existing

222: terms.  In practice, we used the Japanese CD-ROM World

223: Encyclopedia~\cite{heibonsha:98}, which includes approximately 80,000

224: entries related to various fields.

225:

226: Before collecting description patterns, through a preliminary study on

227: the encyclopedia we used, we found that term descriptions frequently

228: contain salient patterns consisting of two Japanese ``{\it

229: bunsetsu\/}'' phrases. The following sentence, which describes the

230: term ``X,'' contains a typical {\it bunsetsu\/} combination, that is,

231: ``X~{\it toha\/}'' and ``{\it de-aru\/}'':

232: \begin{list}{}{}

233: \item X {\it toha\/} Y {\it de-aru\/}~~~(X is Y).\footnote{Although

234:   ``{\it de-aru\/}'' itself is not a {\it bunsetsu\/} phrase, we use

235:   {\it bunsetsu\/} phrases to refer to combinations of several words.}

236: \end{list}

237: In other words, we collected description patterns, based on the

238: co-occurrence of two {\it bunsetsu\/} phrases, as in the following

239: method.

240:

241: First, we collected entries associated with technical terms listed in

242: the World Encyclopedia, and replaced headwords with a variable ``X.''

243: Note that the World Encyclopedia describes various types of words,

244: including technical terms, historical people and places, and thus

245: description patterns vary depending on the word type. For example,

246: entries for historical people usually contain when/where the people

247: were born and their major contributions to the society.

248:

249: However, for the purposes of our extraction, it is desirable to use

250: entries solely associated with technical terms. We then consulted the

251: EDR machine readable technical terminology dictionary, which contains

252: approximately 120,000 terms related to the information processing

253: field~\cite{edr-eng:95}, and obtained 2,259 entries associated with

254: terms listed in the EDR dictionary.

255:

256: Second, we used the ChaSen morphological

257: analyzer~\cite{matsumoto:chasen-97}, which has commonly been used for

258: much Japanese NLP research, to segment collected entries into words,

259: and assign them parts-of-speech. We also developed simple heuristics

260: to produce {\it bunsetsu\/} phrases based on the part-of-speech

261: information.

262:

263: Finally, we collected combinations of two {\it bunsetsu\/} phrases,

264: and sorted them according to their co-occurrence frequency, in

265: descending order. However, since the resultant {\it bunsetsu\/}

266: co-occurrences (even with higher rankings) are extraneous, we

267: supervised (verified, corrected or discarded) the top 100 candidates,

268: and produced 20 description patterns. Figure~\ref{fig:patterns} shows

269: a fragment of the resultant patterns and their English glosses. In

270: this figure, ``X'' and ``Y'' denote variables to which technical terms

271: and sentence fragments can be unified, respectively.

272:

273: \begin{figure}[htbp]

274:   \def\baselinestretch{1}

275:   \begin{center}

276:     \leavevmode

277:     \small

278:     \begin{tabular}[t]{ll} \hline\hline

279:       {\hfill\centering Japanese\hfill}

280:       & {\hfill\centering English Gloss\hfill} \\ \hline

281:       X {\it toha\/} Y {\it dearu}. & X is Y. \\

282:       X {\it ha\/} Y {\it dearu}. & X is Y. \\

283:       Y {\it wo\/} X {\it to-iu}. & Y is called X. \\

284:       X {\it wo\/} Y {\it to-sadameru}. & X is defined as Y. \\

285:       Y {\it wo\/} X {\it to-yobu}. & Y is called X. \\

286:       \hline

287:     \end{tabular}

288:   \end{center}

289:   \caption{A fragment of linguistic description patterns we produced.}

290:   \label{fig:patterns}

291: \end{figure}

292:

293: Here, we are in a position to extract sentences that match with

294: description patterns, from Web pages retrieved by the search engine

295: (see Figure~\ref{fig:background}). In this process, we do not conduct

296: morphological analysis on Web pages, because of computational

297: cost. Instead, we first segment textual contents in Web pages into

298: sentences, based on the Japanese punctuation system, and use a surface

299: pattern matching based on regular expressions.

300:

301: However, in most cases term descriptions consist of more than one

302: sentence. This is especially salient in the case where anaphoric

303: expressions and itemization are used. Thus, it is desirable to extract

304: a larger fragment containing sentences that match with description

305: patterns.

306:

307: In view of this problem, we first use linguistic description patterns

308: to briefly identify a zone, and sequentially search the following

309: fragments relying partially on HTML tags, until a certain fragment is

310: extracted:\footnote{Although we use HTML tags to identify appropriate

311: text fragments, we call the method described in this section NLP-based

312: method, in a comparison with the method in

313: Section~\ref{subsec:html-based} that relies solely on HTML tags.}

314: \begin{enumerate}

315:   \def\labelenumi{(\theenumi)}

316: \item paragraph tagged with \verb$<P>...</P>$ (or

317:   \verb$<P>...<P>$ in the case where \verb$</P>$ is missing),

318: \item itemization tagged with \verb$<UL>...</UL>$,

319: \item $N$ sentences identified with the Japanese punctuation system,

320:   where the sentence that matched with a description pattern is

321:   positioned as near center as possible, where we empirically set

322:   \mbox{$N=3$}.

323: \end{enumerate}

324:

325: \subsection{HTML-based Extraction Method}

326: \label{subsec:html-based}

327:

328: Through a preliminary study on existing Web pages, we identified two

329: typical usages of HTML tags associated with describing technical

330: terms.

331:

332: In the first usage, a term in question is highlighted as a heading by

333: way of \verb$<H>...</H>$, \verb$<B>...</B>$ or \verb$<DT>$ tag, and

334: followed by its description in a short fragment. In the second usage,

335: terms that are potentially unfamiliar to readers are tagged with the

336: anchor \verb$<A>$ tag, providing hyperlinks to other pages (or a

337: different position within the same page) where they are described.

338:

339: The crucial factor here is to determine which fragment in the page is

340: extracted as a description. For this purpose, we use the same rules

341: described in Section~\ref{subsec:nlp-based}. However, unlike the

342: NLP-based method, in the HTML-based method we extract the fragment

343: that {\em follows\/} the heading and the position linked from the

344: anchor. However, in the case where a term in question is tagged with

345: \verb$<DT>$, we extract the following fragment tagged with

346: \verb$<DD>$. Note that \verb$<DT>$ and \verb$<DD>$ are inherently

347: provided to describe terms.

348:

349: The HTML-based method is expected to extract term descriptions that

350: cannot be extracted by the NLP-based method, and vice versa. In fact,

351: in Figure~\ref{fig:enigma} the third and fourth descriptions were

352: extracted with the HTML-based method, while the rest were extracted

353: with the NLP-based method.

354:

355: \section{Language Modeling for Filtering}

356: \label{sec:n-gram}

357:

358: Given a set of Web page fragments extracted by the NLP/HTML-based

359: methods, we select fragments that are linguistically understandable,

360: and index them into the description database. For this purpose, we

361: perform a language modeling, so as to quantify the extent to which a

362: given text fragment is linguistically acceptable.

363:

364: There are several alternative methods for language modeling. For

365: example, grammars are relatively strict language modeling methods.

366: However, we use a model based on $N$-gram, which is usually more

367: robust than that based on grammars. In other words, text fragments

368: with lower perplexity values are more linguistically acceptable.

369:

370: In practice, we used the CMU-Cambridge

371: toolkit~\cite{clarkson:eurospeech-97}, and produced a trigram-based

372: language model from two years of \mbox{Mainichi Shimbun} Japanese

373: newspaper articles~\cite{mainichi:94-95}, which were automatically

374: segmented into words by the ChaSen morphological

375: analyzer~\cite{matsumoto:chasen-97}.

376:

377: In the current implementation, we empirically select as the final

378: extraction results text fragments whose perplexity values are lower

379: than 1,000.

380:

381: \section{Clustering Term Descriptions}

382: \label{sec:clustering}

383:

384: For the purpose of clustering term descriptions extracted using

385: methods in Sections~\ref{sec:extraction} and \ref{sec:n-gram}, we use

386: the Hierarchical Bayesian Clustering (HBC)

387: method~\cite{iwayama:ijcai-95}, which has been used for clustering

388: news articles and constructing thesauri.

389:

390: As with a number of hierarchical clustering methods, the HBC method

391: merges similar items (i.e., term descriptions in our case) in a

392: bottom-up manner, until all the items are merged into a single

393: cluster. That is, a certain number of clusters can be obtained by

394: splitting the resultant hierarchy at a certain level.

395:

396: At the same time, the HBC method also determines the most

397: representative item (centroid) for each cluster. Then, we present only

398: those centroids to users.

399:

400: The similarity between items is computed based on feature vectors that

401: characterize each item. In our case, vectors for each term description

402: consist of frequencies of content words (e.g., nouns and verbs

403: identified through a morphological analysis) appearing in the

404: description.

405:

406: \section{Experimentation}

407: \label{sec:experimentation}

408:

409: \subsection{Methodology}

410: \label{subsec:experiment_method}

411:

412: We investigated the effectiveness of our extraction method from a

413: scientific point of view. However, unlike other research topics where

414: benchmark test collections are available to the public (e.g.,

415: information retrieval), there are two major problems for the purpose

416: of our experimentation, as follows:

417: \begin{itemize}

418: \item production of test terms for which descriptions are extracted,

419: \item judgement for descriptions extracted for those test terms.

420: \end{itemize}

421: For test terms, possible sources are those listed in existing

422: terminology dictionaries. However, since the judgement can be

423: considerably expensive for a large number of test terms, it is

424: preferable to selectively sample a small number of terms that

425: potentially reflect the interest in the real world.

426:

427: In view of this problem, we used as test terms those contained in

428: queries in the NACSIS test collection~\cite{kando:sigir-99}, which

429: consists of 60 Japanese queries and approximately 330,000 abstracts

430: (in either a combination of English and Japanese or either of the

431: languages individually), collected from technical papers published by

432: 65 Japanese associations for various fields.\footnote{\tt

433: {http://www.rd.nacsis.ac.jp/\~{}ntcadm/\\index-en.html}}

434:

435: This collection was originally produced for the evaluation of

436: information retrieval systems, where each query is used to retrieve

437: technical abstracts. Thus, the title field of each query usually

438: contains one or more technical terms. Besides this, since each query

439: was produced based partially on existing technical abstracts, they

440: reflect the real world interest, to some extent.  As a result, we

441: extracted 53 test terms, as shown in Table~\ref{tab:result}. In this

442: table, we romanized Japanese terms, and inserted hyphens between each

443: morpheme for enhanced readability.

444:

445: Note that unlike the case of information retrieval (e.g., a patent

446: retrieval), where every relevant document must be retrieved, in our

447: case even one description can potentially be sufficient. In other

448: words, in our experiments, more weight is attached to accuracy

449: (precision) than recall.

450:

451: For the search engine in Figure~\ref{fig:background}, we used

452: ``goo,''\footnote{{\tt http://www.goo.ne.jp/}} which is one of the

453: major Japanese Web search engines.  Then, for each extracted

454: description, one of the authors judged it correct or incorrect.

455:

456: \subsection{Results}

457: \label{subsec:experiment_result}

458:

459: Out of the 53 test terms extracted from the NACSIS collection, for 44

460: terms goo retrieved one or more Web pages.  Among those 44 test terms,

461: our method extracted at least one term description for 27 terms,

462: disregarding the judgement. Thus, the coverage (or applicability) of

463: our method was 61.4\%. In Table~\ref{tab:result}, the third column

464: denotes the number of Web pages identified by goo. However, goo

465: retrieves contents for only the top 1,000 pages.

466:

467: Table~\ref{tab:result} also shows the number descriptions judged as

468: correct (the column ``\#C''), the total number of descriptions

469: extracted (the column ``\#T''), and the accuracy (the column ``A''),

470: for both cases with/without the trigram-based language model.

471:

472: \begin{table*}[htbp]

473:   \def\baselinestretch{1}

474:   \begin{center}

475:     \caption{Extraction accuracy for the 27 test terms (\#C = the

476:     number of correct descriptions, \#T = the total number of

477:     extracted descriptions, A = accuracy (\%)).}

478:     \medskip

479:     \leavevmode

480:     \footnotesize

481:     \tabcolsep=3pt

482:     \begin{tabular}{llrrrrrrr} \hline\hline

483:       & & & \multicolumn{3}{c}{w/o Trigram} & \multicolumn{3}{c}{w

484:       Trigram} \\ \cline{4-9}

485:       {\hfill\centering Japanese Term\hfill} &

486:       {\hfill\centering English Gloss\hfill} &

487:       {\hfill\centering \#Pages\hfill}

488:       & {\hfill\centering \#C\hfill} & {\hfill\centering \#T\hfill} &

489:       {\hfill\centering A\hfill} & {\hfill\centering \#C\hfill} &

490:       {\hfill\centering \#T\hfill} & {\hfill\centering A\hfill} \\

491:       \hline

492:       Zipf{\it -no-housoku\/} & Zipf's law & 15 & 1 & 1 & 100 & 1 & 1 &

493:       100 \\

494:       {\it akusesu-seigyo\/} & access control & 6,925 & 10 & 20 & 50.0

495:       & 10 & 20 & 50.0 \\

496:       {\it bunsho-gazou-rikai\/} & document image understanding & 43 &

497:       1 & 1 & 100 & 1 & 1 & 100 \\

498:       {\it chiteki-eejento\/} & intelligent agent & 323 & 3 & 5 & 60.0

499:       & 3 & 5 & 60.0 \\

500:       {\it deeta-mainingu\/} & data mining & 3,389 & 37 & 49 &

501:       75.5 & 30 & 40 & 75.0 \\

502:       {\it denshi-sukashi\/} & digital watermark & 2,124 & 29 & 32 &

503:       90.6 & 29 & 32 & 90.6 \\

504:       {\it denshi-toshokan\/} & digital library & 7,938 & 10 & 26 &

505:       38.5 & 8 & 17 & 47.1 \\

506:       {\it gazou-kensaku\/} & image retrieval & 1,694 & 1 & 4 & 25.0 &

507:       1 & 3 & 33.3 \\

508:       {\it guruupuwea\/} & groupware & 19,760 & 14 & 40 & 35.0 & 12 &

509:       21 & 57.1 \\

510:       {\it hikari-faibaa\/} & optical fiber & 10,078 & 17 & 25 & 68.0

511:       & 14 & 21 & 66.7 \\

512:       {\it ichi-keisoku\/} & position measurement & 735 & 0 & 3 & 0 &

513:       0 & 3 & 0 \\

514:       {\it identeki-arugorizumu\/} & genetic algorithm & 4,686 & 24 &

515:       31 & 77.4 & 22 & 28 & 78.6 \\

516:       {\it jinkou-chinou\/} & artificial intelligence & 18,190 & 10 &

517:       19 & 52.6 & 9 & 13 & 69.2 \\

518:       {\it jiritsu-idou-robotto\/} & autonomous mobile robot & 792 & 2

519:       & 2 & 100 & 2 & 2 & 100 \\

520:       {\it jisedai-intaanetto\/} & next generation Internet & 1,963 &

521:       6 & 10 & 60.0 & 6 & 10 & 60.0 \\

522:       {\it kiiwaado-jidou-chuushutsu\/} & keyword automatic extraction

523:       & 25 & 1 & 1 & 100 & 1 & 1 & 100 \\

524:       {\it kikai-hon'yaku\/} & machine translation & 3,141 & 1 & 10 &

525:       10.0 & 0 & 8 & 0 \\

526:       {\it korokeishon\/} & collocation & 547 & 7 & 16 & 43.8 &

527:       7 & 15 & 46.7 \\

528:       {\it koshou-shindan\/} & fault diagnosis & 1,682 & 2 & 5 & 40.0 &

529:       2 & 4 & 50.0 \\

530:       {\it maruchikyasuto\/} & multicast & 5,758 & 18 & 25 & 72.0 & 15

531:       & 22 & 68.2 \\

532:       {\it media-douki\/} & media synchronization & 46 & 1 & 1 & 100 & 1

533:       & 1 & 100 \\

534:       {\it nettowaaku-toporojii\/} & network topology & 438 & 1 & 4 &

535:       25.0 & 0 & 3 & 0 \\

536:       {\it nyuuraru-nettowaaku\/} & neural network & 9,537 & 37 & 47 &

537:       78.7 & 36 & 45 & 80.0 \\

538:       {\it ringu-gata-nettowaaku\/} & ring network & 44 & 0 & 1 & 0 &

539:       0 & 1 & 0 \\

540:       {\it shisourasu\/} & thesaurus & 3,399 & 21 & 23 &

541:       91.3 & 19 & 20 & 95.0 \\

542:       {\it souraa-kaa\/} & solar car & 3,698 & 12 & 21 &

543:       57.1 & 12 & 21 & 57.1 \\

544:       {\it teromea\/} & telomere & 873 & 26 & 36 & 72.2 &

545:       25 & 34 & 73.5 \\

546:       \hline

547:       {\hfill\centering total\hfill} & {\hfill\centering ---\hfill} &

548:       109,049 & 292 & 460 & 63.5 & 266 & 392 & 67.9 \\ \hline

549:     \end{tabular}

550:     \label{tab:result}

551:   \end{center}

552: \end{table*}

553:

554: Table~\ref{tab:result} shows that the NLP/HTML-based methods extracted

555: appropriate term descriptions with a 63.5\% accuracy, and that the

556: trigram-based language model further improved the accuracy from 63.5\%

557: to 67.9\%. In other words, only two descriptions are sufficient for

558: users to understand a term in question. Reading a few descriptions is

559: not time-consuming, because they usually consist of short paragraphs.

560:

561: We also investigated the effectiveness of clustering, where for each

562: test term, we clustered descriptions into three clusters (in the case

563: where there are less than four descriptions, individual descriptions

564: were regarded as different clusters), and only descriptions determined

565: as representative by the HBC method were presented as the final

566: result. We found that 66.7\% of descriptions presented were correct

567: ones.  In other words, users can obtain descriptions from different

568: viewpoints and word senses, maintaining the extraction accuracy

569: obtained above (i.e., 67.9\%).

570:

571: However, we concede that we did not investigate whether or not each

572: cluster corresponds to different viewpoints in a rigorous manner.

573:

574: For the polysemy problem, we investigated all the descriptions

575: extracted, and found that only ``{\it korokeishon\/}~(collocation)''

576: was associated with two word senses, that is, ``word collocations''

577: and ``position of machinery.''  Among the three representative

578: descriptions for ``{\it korokeishon\/}~(collocation),'' two

579: corresponded to the first sense, and one corresponded to the second

580: sense. To sum up, the HBC clustering method correctly identified

581: polysemy.

582:

583: \section{Conclusion}

584: \label{sec:conclusion}

585:

586: In this paper, we proposed a method to extract encyclopedic knowledge

587: from the World Wide Web.

588:

589: For extracting fragments of Web pages containing term descriptions, we

590: used linguistic and HTML structural patterns typically used to describe

591: terms. Then, we used a language model to discard irrelevant

592: descriptions. We also used a clustering method to summarize extracted

593: descriptions based on different viewpoints and word senses.

594:

595: We evaluated our method by way of experiments, and found that the

596: accuracy of our extraction method was practical, that is, a user can

597: understand a term in question, by browsing two descriptions, on

598: average. We also found that the language model and the clustering

599: method further enhanced our framework.

600:

601: Future work will include experiments using a larger number of test

602: terms, and application of extracted descriptions to other NLP

603: research.

604:

605: \section*{Acknowledgments}

606:

607: The authors would like to thank Hitachi Digital Heibonsha, Inc. for

608: their support with the CD-ROM World Encyclopedia, Makoto Iwayama and

609: Takenobu Tokunaga for their support with the HBC clustering software,

610: and Noriko Kando (National Institute of Informatics, Japan) for her

611: support with the NACSIS collection.

612:

613: \bibliographystyle{acl}

614:

615: \begin{thebibliography}{}

616:

617: \bibitem[\protect\citename{Clarkson and Rosenfeld}1997]{clarkson:eurospeech-97}

618: Philip Clarkson and Ronald Rosenfeld.

619: \newblock 1997.

620: \newblock Statistical language modeling using the {CMU}-{Cambridge} toolkit.

621: \newblock In {\em Proceedings of EuroSpeech'97}, pages 2707--2710.

622:

623: \bibitem[\protect\citename{Etzioni}1997]{etzioni:ai-magazine-97}

624: Oren Etzioni.

625: \newblock 1997.

626: \newblock Moving up the information food chain.

627: \newblock {\em AI Magazine}, 18(2):11--18.

628:

629: \bibitem[\protect\citename{Hatzivassiloglou and

630:   McKeown}1993]{hatzivassiloglou:acl-93}

631: Vasileios Hatzivassiloglou and Kathleen~R. McKeown.

632: \newblock 1993.

633: \newblock Towards the automatic identification of adjectival scales: Clustering

634:   adjectives according to meaning.

635: \newblock In {\em Proceedings of the 31st Annual Meeting of the Association for

636:   Computational Linguistics}, pages 172--182.

637:

638: \bibitem[\protect\citename{Heibonsha}1998]{heibonsha:98}

639: Hitachi~Digital Heibonsha.

640: \newblock 1998.

641: \newblock {CD-ROM World Encyclopedia}.

642: \newblock (In Japanese).

643:

644: \bibitem[\protect\citename{Iwayama and Tokunaga}1995]{iwayama:ijcai-95}

645: Makoto Iwayama and Takenobu Tokunaga.

646: \newblock 1995.

647: \newblock Hierarchical {Bayesian} clustering for automatic text classification.

648: \newblock In {\em Proceedings of the 14th International Joint Conference on

649:   Artificial Intelligence}, pages 1322--1327.

650:

651: \bibitem[\protect\citename{{Japan Electronic Dictionary Research

652:   Institute}}1995]{edr-eng:95}

653: {Japan Electronic Dictionary Research Institute}.

654: \newblock 1995.

655: \newblock {EDR} electronic dictionary technical guide.

656:

657: \bibitem[\protect\citename{Kando \bgroup et al.\egroup }1999]{kando:sigir-99}

658: Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue.

659: \newblock 1999.

660: \newblock {NACSIS} test collection workshop ({NTCIR-1}).

661: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR

662:   Conference on Research and Development in Information Retrieval}, pages

663:   299--300.

664:

665: \bibitem[\protect\citename{Kupiec and Maxwell}1992]{kupiec:aaai-slnlp-ws-92}

666: Julian Kupiec and John Maxwell.

667: \newblock 1992.

668: \newblock Training stochastic grammars from unlabelled text corpora.

669: \newblock In {\em Workshop on Statistically-Based Natural Language Programming

670:   Techniques}.

671: \newblock AAAI Technical Reports WS-92-01.

672:

673: \bibitem[\protect\citename{{Mainichi Shimbun}}1994 1995]{mainichi:94-95}

674: {Mainichi Shimbun}.

675: \newblock 1994-1995.

676: \newblock Mainichi shimbun {CD-ROM} '94-'95.

677: \newblock (In Japanese).

678:

679: \bibitem[\protect\citename{Matsumoto \bgroup et al.\egroup

680:   }1997]{matsumoto:chasen-97}

681: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki

682:   Imamura.

683: \newblock 1997.

684: \newblock {Japanese} morphological analysis system {ChaSen} manual.

685: \newblock Technical Report NAIST-IS-TR97007, NAIST.

686: \newblock (In Japanese).

687:

688: \bibitem[\protect\citename{McCallum \bgroup et al.\egroup

689:   }1999]{mccallum:ijcai-99}

690: Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.

691: \newblock 1999.

692: \newblock A machine learning approach to building domain-specific search

693:   engines.

694: \newblock In {\em Proceedings of the 16th International Joint Conference on

695:   Artificial Intelligence}, pages 662--667.

696:

697: \bibitem[\protect\citename{Nie \bgroup et al.\egroup }1999]{nie:sigir-99}

698: Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand.

699: \newblock 1999.

700: \newblock Cross-language information retrieval based on parallel texts and

701:   automatic mining of parallel texts from the {Web}.

702: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR

703:   Conference on Research and Development in Information Retrieval}, pages

704:   74--81.

705:

706: \bibitem[\protect\citename{Resnik}1999]{resnik:acl-99}

707: Philip Resnik.

708: \newblock 1999.

709: \newblock Mining the {Web} for bilingual texts.

710: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for

711:   Computational Linguistics}, pages 527--534.

712:

713: \bibitem[\protect\citename{Smadja \bgroup et al.\egroup }1996]{smadja:cl-96}

714: Frank Smadja, Kathleen~R. McKeown, and Vasileios Hatzivassiloglou.

715: \newblock 1996.

716: \newblock Translating collocations for bilingual lexicons: A statistical

717:   approach.

718: \newblock {\em Computational Linguistics}, 22(1):1--38.

719:

720: \end{thebibliography}

721:

722:

723: \end{document}

724: