1: %%
2: %% ACL-2000 camera-ready
3: %%
4: \documentstyle[11pt,colacl]{article}
5:
6: \title{Utilizing the World Wide Web as an Encyclopedia: \\ Extracting
7: Term Descriptions from Semi-Structured Texts}
8:
9: \author{Atsushi Fujii \and Tetsuya Ishikawa \\
10: University of Library and Information Science \\
11: 1-2 Kasuga, Tsukuba, 305-8550, JAPAN \\
12: \smallskip
13: {\normalsize\tt fujii@ulis.ac.jp}}
14:
15: \newcommand{\etal}{et~al.}
16: \newcommand{\etaleos}{et~al}
17: \newcommand{\eq}[1]{(\ref{#1})}
18:
19: \renewcommand{\nocite}[1]{\shortcite{#1}}
20:
21: \input{psfig.tex}
22:
23: \begin{document}
24:
25: \maketitle\thispagestyle{empty}
26:
27: \begin{abstract}
28: In this paper, we propose a method to extract descriptions of
29: technical terms from Web pages in order to utilize the World Wide
30: Web as an encyclopedia. We use linguistic patterns and HTML text
31: structures to extract text fragments containing term descriptions.
32: We also use a language model to discard extraneous descriptions, and
33: a clustering method to summarize resultant descriptions. We show
34: the effectiveness of our method by way of experiments.
35: \end{abstract}
36:
37: \section{Introduction}
38: \label{sec:introduction}
39:
40: Reflecting the growth in utilization of machine readable texts,
41: extraction and acquisition of linguistic knowledge from large corpora
42: has been one of the major topics within the natural language
43: processing (NLP) community. A sample of linguistic knowledge targeted
44: in past research includes grammars~\cite{kupiec:aaai-slnlp-ws-92},
45: word classes~\cite{hatzivassiloglou:acl-93} and bilingual
46: lexicons~\cite{smadja:cl-96}. While human experts find it difficult to
47: produce exhaustive and consistent linguistic knowledge, automatic
48: methods can help alleviate problems associated with manual
49: construction.
50:
51: Term descriptions, which are usually carefully organized in
52: encyclopedias, are valuable linguistic knowledge, but have seldom been
53: targeted in past NLP literature. As with other types of linguistic
54: knowledge relying on human introspection and supervision, constructing
55: encyclopedias is quite expensive. Additionally, since existing
56: encyclopedias are usually revised every few years, in many cases users
57: find it difficult to obtain descriptions for newly created terms.
58:
59: To cope with the above limitation of existing encyclopedias, it is
60: possible to use a search engine on the World Wide Web as a substitute,
61: expecting that certain Web pages will describe the submitted
62: keyword. However, since keyword-based search engines often retrieve a
63: surprisingly large number of Web pages, it is time-consuming to
64: identify pages that satisfy the users' information needs.
65:
66: In view of this problem, we propose a method to automatically extract
67: term descriptions from Web pages and summarize them. In this paper, we
68: generally use ``Web pages'' to refer to those pages containing textual
69: contents, excluding those with only image/audio information. Besides
70: this, we specifically target descriptions for technical terms, and
71: thus ``terms'' generally refer to technical terms.
72:
73: In brief, our method extracts fragments of Web pages, based on
74: patterns (or templates) typically used to describe terms. Web pages
75: are in a sense semi-structured data, because HTML (Hyper Text Markup
76: Language) tags provide the textual information contained in a page
77: with a certain structure. Thus, our method relies on both linguistic
78: and structural description patterns.
79:
80: We used several NLP techniques to semi-automatically produce
81: linguistic patterns. We call this approach ``NLP-based method.'' We
82: also produced several heuristics associated with the use of HTML tags,
83: which we call ``HTML-based method.'' While the former method is
84: language-dependent, and currently applied only to Japanese, the latter
85: method is theoretically language-independent.
86:
87: Our research can be classified from several different perspectives. As
88: explained in the beginning of this section, our research can be seen
89: as linguistic knowledge extraction. Specifically, our research is
90: related to Web mining methods~\cite{nie:sigir-99,resnik:acl-99}.
91:
92: From an information retrieval point of view, our research can be seen
93: as constructing domain-specific (or task-oriented) Web search engines
94: and software agents~\cite{etzioni:ai-magazine-97,mccallum:ijcai-99}.
95:
96: \section{Overview}
97: \label{sec:overview}
98:
99: Our objective is to collect encyclopedic knowledge from the Web, for
100: which we designed a system involving two processes. As with existing
101: Web search systems, in the background process our system periodically
102: updates a database consisting of term descriptions (a description
103: database), while users can browse term descriptions anytime in the
104: foreground process.
105:
106: In the background process, depicted as in Figure~\ref{fig:background},
107: a search engine searches the Web for pages containing terms listed in
108: a lexicon.
109:
110: Then, fragments (such as paragraphs) of retrieved Web pages are
111: extracted based on linguistic and structural description
112: patterns. Note that as a preprocessing for the extraction process, we
113: discard newline codes, redundant white spaces, and HTML tags that our
114: extraction method does not use, in order to standardize the layout of
115: Web pages.
116:
117: However, in some cases the extraction process is unsuccessful, and
118: thus extracted fragments are not linguistically understandable. In
119: addition, Web pages contain some non-linguistic information, such as
120: special characters (symbols) and e-mail addresses for contact, along
121: with linguistic information. Consequently, those noises decrease
122: extraction accuracy.
123:
124: \begin{figure}[t]
125: \begin{center}
126: \leavevmode
127: \psfig{file=background.eps,height=1.8in}
128: \end{center}
129: \caption{The control flow of our extraction system.}
130: \label{fig:background}
131: \end{figure}
132:
133: In view of this problem, we perform a filtering to enhance the
134: extraction accuracy. In practice, we use a language model to measure
135: the extent to which a given extracted fragment can be linguistic, and
136: index only fragments judged as linguistic into the description
137: database.
138:
139: At the same time, the URLs of Web pages from which descriptions were
140: extracted are also indexed in the database, so that users can browse
141: the full content, in the case where descriptions extracted are not
142: satisfactory.
143:
144: In the case where a number of descriptions are extracted for a single
145: term, the resultant description set is redundant, because it contains
146: a number of similar descriptions. Thus, it is preferable to summarize
147: descriptions, rather than to present all the descriptions as a list.
148:
149: For this purpose, we use a clustering method to divide descriptions
150: for a single term into a certain number of clusters, and present only
151: descriptions that are representative for each cluster. As a result, it
152: is expected that descriptions resembling one another will be in the
153: same cluster, and that each cluster corresponds to different
154: viewpoints and word senses.
155:
156: Possible sources of the lexicon include existing machine readable
157: terminology dictionaries, which often list terms, but lack
158: descriptions. However, since new terms unlisted in existing
159: dictionaries also have to be considered, newspaper articles and
160: magazines distributed via the Web can be possible sources. In other
161: words, a morphological analysis is performed periodically (e.g.,
162: weekly) to identify word tokens from those resources, in order to
163: enhance the lexicon. However, this is not the central issue in this
164: paper.
165:
166: In the foreground process, given an input term, a browser presents one
167: or more descriptions to a user. In the case where the database does
168: not index descriptions for the given term, term descriptions are
169: dynamically extracted as in the background process. The background
170: process is optional, and thus term descriptions can always be obtained
171: dynamically. However, this potentially decreases the time efficiency
172: for a real-time response.
173:
174: Figure~\ref{fig:enigma} shows a Web browser, in which our prototype
175: page presents several Japanese descriptions extracted for the word
176: ``{\it deeta-mainingu\/}~(data mining).'' For example, an English
177: translation for the first description is as follows:
178: \begin{quote}
179: data mining is a process that collects data for a certain task, and
180: retrieves relations latent in the data.
181: \end{quote}
182:
183: \begin{figure}[htbp]
184: \begin{center}
185: \leavevmode
186: \psfig{file=enigma.ps,height=3.2in}
187: \end{center}
188: \caption{Example Japanese descriptions for ``{\it
189: deeta-mainingu\/}~(data mining).''}
190: \label{fig:enigma}
191: \end{figure}
192:
193: In Figure~\ref{fig:enigma}, each description uses various expressions,
194: but describes the same content: data mining is a process which
195: discovers rules latent in given databases. It is expected that users
196: can understand what data mining is, by browsing some of those
197: descriptions. In addition, each headword (``{\it deeta-mainingu\/}''
198: in this case) positioned above each description is linked to the Web
199: page from which the description was extracted.
200:
201: In the following sections, we first elaborate on the NLP/HTML-based
202: extraction methods in Section~\ref{sec:extraction}. We then elaborate
203: on noise reduction and clustering methods in Sections~\ref{sec:n-gram}
204: and \ref{sec:clustering}, respectively. Finally, in
205: Section~\ref{sec:experimentation} we investigate the effectiveness of
206: our extraction method by way of experiments.
207:
208: \section{Extracting Term Descriptions}
209: \label{sec:extraction}
210:
211: \subsection{NLP-based Extraction Method}
212: \label{subsec:nlp-based}
213:
214: The crucial content for the NLP-based extraction method is the way to
215: produce linguistic patterns that can be used to describe technical
216: terms. However, human introspection is a difficult method to
217: exhaustively enumerate possible description patterns.
218:
219: Thus, we used NLP techniques to semi-automatically collect description
220: patterns from machine readable encyclopedias, because they usually
221: contain a significantly large number of descriptions for existing
222: terms. In practice, we used the Japanese CD-ROM World
223: Encyclopedia~\cite{heibonsha:98}, which includes approximately 80,000
224: entries related to various fields.
225:
226: Before collecting description patterns, through a preliminary study on
227: the encyclopedia we used, we found that term descriptions frequently
228: contain salient patterns consisting of two Japanese ``{\it
229: bunsetsu\/}'' phrases. The following sentence, which describes the
230: term ``X,'' contains a typical {\it bunsetsu\/} combination, that is,
231: ``X~{\it toha\/}'' and ``{\it de-aru\/}'':
232: \begin{list}{}{}
233: \item X {\it toha\/} Y {\it de-aru\/}~~~(X is Y).\footnote{Although
234: ``{\it de-aru\/}'' itself is not a {\it bunsetsu\/} phrase, we use
235: {\it bunsetsu\/} phrases to refer to combinations of several words.}
236: \end{list}
237: In other words, we collected description patterns, based on the
238: co-occurrence of two {\it bunsetsu\/} phrases, as in the following
239: method.
240:
241: First, we collected entries associated with technical terms listed in
242: the World Encyclopedia, and replaced headwords with a variable ``X.''
243: Note that the World Encyclopedia describes various types of words,
244: including technical terms, historical people and places, and thus
245: description patterns vary depending on the word type. For example,
246: entries for historical people usually contain when/where the people
247: were born and their major contributions to the society.
248:
249: However, for the purposes of our extraction, it is desirable to use
250: entries solely associated with technical terms. We then consulted the
251: EDR machine readable technical terminology dictionary, which contains
252: approximately 120,000 terms related to the information processing
253: field~\cite{edr-eng:95}, and obtained 2,259 entries associated with
254: terms listed in the EDR dictionary.
255:
256: Second, we used the ChaSen morphological
257: analyzer~\cite{matsumoto:chasen-97}, which has commonly been used for
258: much Japanese NLP research, to segment collected entries into words,
259: and assign them parts-of-speech. We also developed simple heuristics
260: to produce {\it bunsetsu\/} phrases based on the part-of-speech
261: information.
262:
263: Finally, we collected combinations of two {\it bunsetsu\/} phrases,
264: and sorted them according to their co-occurrence frequency, in
265: descending order. However, since the resultant {\it bunsetsu\/}
266: co-occurrences (even with higher rankings) are extraneous, we
267: supervised (verified, corrected or discarded) the top 100 candidates,
268: and produced 20 description patterns. Figure~\ref{fig:patterns} shows
269: a fragment of the resultant patterns and their English glosses. In
270: this figure, ``X'' and ``Y'' denote variables to which technical terms
271: and sentence fragments can be unified, respectively.
272:
273: \begin{figure}[htbp]
274: \def\baselinestretch{1}
275: \begin{center}
276: \leavevmode
277: \small
278: \begin{tabular}[t]{ll} \hline\hline
279: {\hfill\centering Japanese\hfill}
280: & {\hfill\centering English Gloss\hfill} \\ \hline
281: X {\it toha\/} Y {\it dearu}. & X is Y. \\
282: X {\it ha\/} Y {\it dearu}. & X is Y. \\
283: Y {\it wo\/} X {\it to-iu}. & Y is called X. \\
284: X {\it wo\/} Y {\it to-sadameru}. & X is defined as Y. \\
285: Y {\it wo\/} X {\it to-yobu}. & Y is called X. \\
286: \hline
287: \end{tabular}
288: \end{center}
289: \caption{A fragment of linguistic description patterns we produced.}
290: \label{fig:patterns}
291: \end{figure}
292:
293: Here, we are in a position to extract sentences that match with
294: description patterns, from Web pages retrieved by the search engine
295: (see Figure~\ref{fig:background}). In this process, we do not conduct
296: morphological analysis on Web pages, because of computational
297: cost. Instead, we first segment textual contents in Web pages into
298: sentences, based on the Japanese punctuation system, and use a surface
299: pattern matching based on regular expressions.
300:
301: However, in most cases term descriptions consist of more than one
302: sentence. This is especially salient in the case where anaphoric
303: expressions and itemization are used. Thus, it is desirable to extract
304: a larger fragment containing sentences that match with description
305: patterns.
306:
307: In view of this problem, we first use linguistic description patterns
308: to briefly identify a zone, and sequentially search the following
309: fragments relying partially on HTML tags, until a certain fragment is
310: extracted:\footnote{Although we use HTML tags to identify appropriate
311: text fragments, we call the method described in this section NLP-based
312: method, in a comparison with the method in
313: Section~\ref{subsec:html-based} that relies solely on HTML tags.}
314: \begin{enumerate}
315: \def\labelenumi{(\theenumi)}
316: \item paragraph tagged with \verb$<P>...</P>$ (or
317: \verb$<P>...<P>$ in the case where \verb$</P>$ is missing),
318: \item itemization tagged with \verb$<UL>...</UL>$,
319: \item $N$ sentences identified with the Japanese punctuation system,
320: where the sentence that matched with a description pattern is
321: positioned as near center as possible, where we empirically set
322: \mbox{$N=3$}.
323: \end{enumerate}
324:
325: \subsection{HTML-based Extraction Method}
326: \label{subsec:html-based}
327:
328: Through a preliminary study on existing Web pages, we identified two
329: typical usages of HTML tags associated with describing technical
330: terms.
331:
332: In the first usage, a term in question is highlighted as a heading by
333: way of \verb$<H>...</H>$, \verb$<B>...</B>$ or \verb$<DT>$ tag, and
334: followed by its description in a short fragment. In the second usage,
335: terms that are potentially unfamiliar to readers are tagged with the
336: anchor \verb$<A>$ tag, providing hyperlinks to other pages (or a
337: different position within the same page) where they are described.
338:
339: The crucial factor here is to determine which fragment in the page is
340: extracted as a description. For this purpose, we use the same rules
341: described in Section~\ref{subsec:nlp-based}. However, unlike the
342: NLP-based method, in the HTML-based method we extract the fragment
343: that {\em follows\/} the heading and the position linked from the
344: anchor. However, in the case where a term in question is tagged with
345: \verb$<DT>$, we extract the following fragment tagged with
346: \verb$<DD>$. Note that \verb$<DT>$ and \verb$<DD>$ are inherently
347: provided to describe terms.
348:
349: The HTML-based method is expected to extract term descriptions that
350: cannot be extracted by the NLP-based method, and vice versa. In fact,
351: in Figure~\ref{fig:enigma} the third and fourth descriptions were
352: extracted with the HTML-based method, while the rest were extracted
353: with the NLP-based method.
354:
355: \section{Language Modeling for Filtering}
356: \label{sec:n-gram}
357:
358: Given a set of Web page fragments extracted by the NLP/HTML-based
359: methods, we select fragments that are linguistically understandable,
360: and index them into the description database. For this purpose, we
361: perform a language modeling, so as to quantify the extent to which a
362: given text fragment is linguistically acceptable.
363:
364: There are several alternative methods for language modeling. For
365: example, grammars are relatively strict language modeling methods.
366: However, we use a model based on $N$-gram, which is usually more
367: robust than that based on grammars. In other words, text fragments
368: with lower perplexity values are more linguistically acceptable.
369:
370: In practice, we used the CMU-Cambridge
371: toolkit~\cite{clarkson:eurospeech-97}, and produced a trigram-based
372: language model from two years of \mbox{Mainichi Shimbun} Japanese
373: newspaper articles~\cite{mainichi:94-95}, which were automatically
374: segmented into words by the ChaSen morphological
375: analyzer~\cite{matsumoto:chasen-97}.
376:
377: In the current implementation, we empirically select as the final
378: extraction results text fragments whose perplexity values are lower
379: than 1,000.
380:
381: \section{Clustering Term Descriptions}
382: \label{sec:clustering}
383:
384: For the purpose of clustering term descriptions extracted using
385: methods in Sections~\ref{sec:extraction} and \ref{sec:n-gram}, we use
386: the Hierarchical Bayesian Clustering (HBC)
387: method~\cite{iwayama:ijcai-95}, which has been used for clustering
388: news articles and constructing thesauri.
389:
390: As with a number of hierarchical clustering methods, the HBC method
391: merges similar items (i.e., term descriptions in our case) in a
392: bottom-up manner, until all the items are merged into a single
393: cluster. That is, a certain number of clusters can be obtained by
394: splitting the resultant hierarchy at a certain level.
395:
396: At the same time, the HBC method also determines the most
397: representative item (centroid) for each cluster. Then, we present only
398: those centroids to users.
399:
400: The similarity between items is computed based on feature vectors that
401: characterize each item. In our case, vectors for each term description
402: consist of frequencies of content words (e.g., nouns and verbs
403: identified through a morphological analysis) appearing in the
404: description.
405:
406: \section{Experimentation}
407: \label{sec:experimentation}
408:
409: \subsection{Methodology}
410: \label{subsec:experiment_method}
411:
412: We investigated the effectiveness of our extraction method from a
413: scientific point of view. However, unlike other research topics where
414: benchmark test collections are available to the public (e.g.,
415: information retrieval), there are two major problems for the purpose
416: of our experimentation, as follows:
417: \begin{itemize}
418: \item production of test terms for which descriptions are extracted,
419: \item judgement for descriptions extracted for those test terms.
420: \end{itemize}
421: For test terms, possible sources are those listed in existing
422: terminology dictionaries. However, since the judgement can be
423: considerably expensive for a large number of test terms, it is
424: preferable to selectively sample a small number of terms that
425: potentially reflect the interest in the real world.
426:
427: In view of this problem, we used as test terms those contained in
428: queries in the NACSIS test collection~\cite{kando:sigir-99}, which
429: consists of 60 Japanese queries and approximately 330,000 abstracts
430: (in either a combination of English and Japanese or either of the
431: languages individually), collected from technical papers published by
432: 65 Japanese associations for various fields.\footnote{\tt
433: {http://www.rd.nacsis.ac.jp/\~{}ntcadm/\\index-en.html}}
434:
435: This collection was originally produced for the evaluation of
436: information retrieval systems, where each query is used to retrieve
437: technical abstracts. Thus, the title field of each query usually
438: contains one or more technical terms. Besides this, since each query
439: was produced based partially on existing technical abstracts, they
440: reflect the real world interest, to some extent. As a result, we
441: extracted 53 test terms, as shown in Table~\ref{tab:result}. In this
442: table, we romanized Japanese terms, and inserted hyphens between each
443: morpheme for enhanced readability.
444:
445: Note that unlike the case of information retrieval (e.g., a patent
446: retrieval), where every relevant document must be retrieved, in our
447: case even one description can potentially be sufficient. In other
448: words, in our experiments, more weight is attached to accuracy
449: (precision) than recall.
450:
451: For the search engine in Figure~\ref{fig:background}, we used
452: ``goo,''\footnote{{\tt http://www.goo.ne.jp/}} which is one of the
453: major Japanese Web search engines. Then, for each extracted
454: description, one of the authors judged it correct or incorrect.
455:
456: \subsection{Results}
457: \label{subsec:experiment_result}
458:
459: Out of the 53 test terms extracted from the NACSIS collection, for 44
460: terms goo retrieved one or more Web pages. Among those 44 test terms,
461: our method extracted at least one term description for 27 terms,
462: disregarding the judgement. Thus, the coverage (or applicability) of
463: our method was 61.4\%. In Table~\ref{tab:result}, the third column
464: denotes the number of Web pages identified by goo. However, goo
465: retrieves contents for only the top 1,000 pages.
466:
467: Table~\ref{tab:result} also shows the number descriptions judged as
468: correct (the column ``\#C''), the total number of descriptions
469: extracted (the column ``\#T''), and the accuracy (the column ``A''),
470: for both cases with/without the trigram-based language model.
471:
472: \begin{table*}[htbp]
473: \def\baselinestretch{1}
474: \begin{center}
475: \caption{Extraction accuracy for the 27 test terms (\#C = the
476: number of correct descriptions, \#T = the total number of
477: extracted descriptions, A = accuracy (\%)).}
478: \medskip
479: \leavevmode
480: \footnotesize
481: \tabcolsep=3pt
482: \begin{tabular}{llrrrrrrr} \hline\hline
483: & & & \multicolumn{3}{c}{w/o Trigram} & \multicolumn{3}{c}{w
484: Trigram} \\ \cline{4-9}
485: {\hfill\centering Japanese Term\hfill} &
486: {\hfill\centering English Gloss\hfill} &
487: {\hfill\centering \#Pages\hfill}
488: & {\hfill\centering \#C\hfill} & {\hfill\centering \#T\hfill} &
489: {\hfill\centering A\hfill} & {\hfill\centering \#C\hfill} &
490: {\hfill\centering \#T\hfill} & {\hfill\centering A\hfill} \\
491: \hline
492: Zipf{\it -no-housoku\/} & Zipf's law & 15 & 1 & 1 & 100 & 1 & 1 &
493: 100 \\
494: {\it akusesu-seigyo\/} & access control & 6,925 & 10 & 20 & 50.0
495: & 10 & 20 & 50.0 \\
496: {\it bunsho-gazou-rikai\/} & document image understanding & 43 &
497: 1 & 1 & 100 & 1 & 1 & 100 \\
498: {\it chiteki-eejento\/} & intelligent agent & 323 & 3 & 5 & 60.0
499: & 3 & 5 & 60.0 \\
500: {\it deeta-mainingu\/} & data mining & 3,389 & 37 & 49 &
501: 75.5 & 30 & 40 & 75.0 \\
502: {\it denshi-sukashi\/} & digital watermark & 2,124 & 29 & 32 &
503: 90.6 & 29 & 32 & 90.6 \\
504: {\it denshi-toshokan\/} & digital library & 7,938 & 10 & 26 &
505: 38.5 & 8 & 17 & 47.1 \\
506: {\it gazou-kensaku\/} & image retrieval & 1,694 & 1 & 4 & 25.0 &
507: 1 & 3 & 33.3 \\
508: {\it guruupuwea\/} & groupware & 19,760 & 14 & 40 & 35.0 & 12 &
509: 21 & 57.1 \\
510: {\it hikari-faibaa\/} & optical fiber & 10,078 & 17 & 25 & 68.0
511: & 14 & 21 & 66.7 \\
512: {\it ichi-keisoku\/} & position measurement & 735 & 0 & 3 & 0 &
513: 0 & 3 & 0 \\
514: {\it identeki-arugorizumu\/} & genetic algorithm & 4,686 & 24 &
515: 31 & 77.4 & 22 & 28 & 78.6 \\
516: {\it jinkou-chinou\/} & artificial intelligence & 18,190 & 10 &
517: 19 & 52.6 & 9 & 13 & 69.2 \\
518: {\it jiritsu-idou-robotto\/} & autonomous mobile robot & 792 & 2
519: & 2 & 100 & 2 & 2 & 100 \\
520: {\it jisedai-intaanetto\/} & next generation Internet & 1,963 &
521: 6 & 10 & 60.0 & 6 & 10 & 60.0 \\
522: {\it kiiwaado-jidou-chuushutsu\/} & keyword automatic extraction
523: & 25 & 1 & 1 & 100 & 1 & 1 & 100 \\
524: {\it kikai-hon'yaku\/} & machine translation & 3,141 & 1 & 10 &
525: 10.0 & 0 & 8 & 0 \\
526: {\it korokeishon\/} & collocation & 547 & 7 & 16 & 43.8 &
527: 7 & 15 & 46.7 \\
528: {\it koshou-shindan\/} & fault diagnosis & 1,682 & 2 & 5 & 40.0 &
529: 2 & 4 & 50.0 \\
530: {\it maruchikyasuto\/} & multicast & 5,758 & 18 & 25 & 72.0 & 15
531: & 22 & 68.2 \\
532: {\it media-douki\/} & media synchronization & 46 & 1 & 1 & 100 & 1
533: & 1 & 100 \\
534: {\it nettowaaku-toporojii\/} & network topology & 438 & 1 & 4 &
535: 25.0 & 0 & 3 & 0 \\
536: {\it nyuuraru-nettowaaku\/} & neural network & 9,537 & 37 & 47 &
537: 78.7 & 36 & 45 & 80.0 \\
538: {\it ringu-gata-nettowaaku\/} & ring network & 44 & 0 & 1 & 0 &
539: 0 & 1 & 0 \\
540: {\it shisourasu\/} & thesaurus & 3,399 & 21 & 23 &
541: 91.3 & 19 & 20 & 95.0 \\
542: {\it souraa-kaa\/} & solar car & 3,698 & 12 & 21 &
543: 57.1 & 12 & 21 & 57.1 \\
544: {\it teromea\/} & telomere & 873 & 26 & 36 & 72.2 &
545: 25 & 34 & 73.5 \\
546: \hline
547: {\hfill\centering total\hfill} & {\hfill\centering ---\hfill} &
548: 109,049 & 292 & 460 & 63.5 & 266 & 392 & 67.9 \\ \hline
549: \end{tabular}
550: \label{tab:result}
551: \end{center}
552: \end{table*}
553:
554: Table~\ref{tab:result} shows that the NLP/HTML-based methods extracted
555: appropriate term descriptions with a 63.5\% accuracy, and that the
556: trigram-based language model further improved the accuracy from 63.5\%
557: to 67.9\%. In other words, only two descriptions are sufficient for
558: users to understand a term in question. Reading a few descriptions is
559: not time-consuming, because they usually consist of short paragraphs.
560:
561: We also investigated the effectiveness of clustering, where for each
562: test term, we clustered descriptions into three clusters (in the case
563: where there are less than four descriptions, individual descriptions
564: were regarded as different clusters), and only descriptions determined
565: as representative by the HBC method were presented as the final
566: result. We found that 66.7\% of descriptions presented were correct
567: ones. In other words, users can obtain descriptions from different
568: viewpoints and word senses, maintaining the extraction accuracy
569: obtained above (i.e., 67.9\%).
570:
571: However, we concede that we did not investigate whether or not each
572: cluster corresponds to different viewpoints in a rigorous manner.
573:
574: For the polysemy problem, we investigated all the descriptions
575: extracted, and found that only ``{\it korokeishon\/}~(collocation)''
576: was associated with two word senses, that is, ``word collocations''
577: and ``position of machinery.'' Among the three representative
578: descriptions for ``{\it korokeishon\/}~(collocation),'' two
579: corresponded to the first sense, and one corresponded to the second
580: sense. To sum up, the HBC clustering method correctly identified
581: polysemy.
582:
583: \section{Conclusion}
584: \label{sec:conclusion}
585:
586: In this paper, we proposed a method to extract encyclopedic knowledge
587: from the World Wide Web.
588:
589: For extracting fragments of Web pages containing term descriptions, we
590: used linguistic and HTML structural patterns typically used to describe
591: terms. Then, we used a language model to discard irrelevant
592: descriptions. We also used a clustering method to summarize extracted
593: descriptions based on different viewpoints and word senses.
594:
595: We evaluated our method by way of experiments, and found that the
596: accuracy of our extraction method was practical, that is, a user can
597: understand a term in question, by browsing two descriptions, on
598: average. We also found that the language model and the clustering
599: method further enhanced our framework.
600:
601: Future work will include experiments using a larger number of test
602: terms, and application of extracted descriptions to other NLP
603: research.
604:
605: \section*{Acknowledgments}
606:
607: The authors would like to thank Hitachi Digital Heibonsha, Inc. for
608: their support with the CD-ROM World Encyclopedia, Makoto Iwayama and
609: Takenobu Tokunaga for their support with the HBC clustering software,
610: and Noriko Kando (National Institute of Informatics, Japan) for her
611: support with the NACSIS collection.
612:
613: \bibliographystyle{acl}
614:
615: \begin{thebibliography}{}
616:
617: \bibitem[\protect\citename{Clarkson and Rosenfeld}1997]{clarkson:eurospeech-97}
618: Philip Clarkson and Ronald Rosenfeld.
619: \newblock 1997.
620: \newblock Statistical language modeling using the {CMU}-{Cambridge} toolkit.
621: \newblock In {\em Proceedings of EuroSpeech'97}, pages 2707--2710.
622:
623: \bibitem[\protect\citename{Etzioni}1997]{etzioni:ai-magazine-97}
624: Oren Etzioni.
625: \newblock 1997.
626: \newblock Moving up the information food chain.
627: \newblock {\em AI Magazine}, 18(2):11--18.
628:
629: \bibitem[\protect\citename{Hatzivassiloglou and
630: McKeown}1993]{hatzivassiloglou:acl-93}
631: Vasileios Hatzivassiloglou and Kathleen~R. McKeown.
632: \newblock 1993.
633: \newblock Towards the automatic identification of adjectival scales: Clustering
634: adjectives according to meaning.
635: \newblock In {\em Proceedings of the 31st Annual Meeting of the Association for
636: Computational Linguistics}, pages 172--182.
637:
638: \bibitem[\protect\citename{Heibonsha}1998]{heibonsha:98}
639: Hitachi~Digital Heibonsha.
640: \newblock 1998.
641: \newblock {CD-ROM World Encyclopedia}.
642: \newblock (In Japanese).
643:
644: \bibitem[\protect\citename{Iwayama and Tokunaga}1995]{iwayama:ijcai-95}
645: Makoto Iwayama and Takenobu Tokunaga.
646: \newblock 1995.
647: \newblock Hierarchical {Bayesian} clustering for automatic text classification.
648: \newblock In {\em Proceedings of the 14th International Joint Conference on
649: Artificial Intelligence}, pages 1322--1327.
650:
651: \bibitem[\protect\citename{{Japan Electronic Dictionary Research
652: Institute}}1995]{edr-eng:95}
653: {Japan Electronic Dictionary Research Institute}.
654: \newblock 1995.
655: \newblock {EDR} electronic dictionary technical guide.
656:
657: \bibitem[\protect\citename{Kando \bgroup et al.\egroup }1999]{kando:sigir-99}
658: Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue.
659: \newblock 1999.
660: \newblock {NACSIS} test collection workshop ({NTCIR-1}).
661: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
662: Conference on Research and Development in Information Retrieval}, pages
663: 299--300.
664:
665: \bibitem[\protect\citename{Kupiec and Maxwell}1992]{kupiec:aaai-slnlp-ws-92}
666: Julian Kupiec and John Maxwell.
667: \newblock 1992.
668: \newblock Training stochastic grammars from unlabelled text corpora.
669: \newblock In {\em Workshop on Statistically-Based Natural Language Programming
670: Techniques}.
671: \newblock AAAI Technical Reports WS-92-01.
672:
673: \bibitem[\protect\citename{{Mainichi Shimbun}}1994 1995]{mainichi:94-95}
674: {Mainichi Shimbun}.
675: \newblock 1994-1995.
676: \newblock Mainichi shimbun {CD-ROM} '94-'95.
677: \newblock (In Japanese).
678:
679: \bibitem[\protect\citename{Matsumoto \bgroup et al.\egroup
680: }1997]{matsumoto:chasen-97}
681: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki
682: Imamura.
683: \newblock 1997.
684: \newblock {Japanese} morphological analysis system {ChaSen} manual.
685: \newblock Technical Report NAIST-IS-TR97007, NAIST.
686: \newblock (In Japanese).
687:
688: \bibitem[\protect\citename{McCallum \bgroup et al.\egroup
689: }1999]{mccallum:ijcai-99}
690: Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.
691: \newblock 1999.
692: \newblock A machine learning approach to building domain-specific search
693: engines.
694: \newblock In {\em Proceedings of the 16th International Joint Conference on
695: Artificial Intelligence}, pages 662--667.
696:
697: \bibitem[\protect\citename{Nie \bgroup et al.\egroup }1999]{nie:sigir-99}
698: Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand.
699: \newblock 1999.
700: \newblock Cross-language information retrieval based on parallel texts and
701: automatic mining of parallel texts from the {Web}.
702: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
703: Conference on Research and Development in Information Retrieval}, pages
704: 74--81.
705:
706: \bibitem[\protect\citename{Resnik}1999]{resnik:acl-99}
707: Philip Resnik.
708: \newblock 1999.
709: \newblock Mining the {Web} for bilingual texts.
710: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for
711: Computational Linguistics}, pages 527--534.
712:
713: \bibitem[\protect\citename{Smadja \bgroup et al.\egroup }1996]{smadja:cl-96}
714: Frank Smadja, Kathleen~R. McKeown, and Vasileios Hatzivassiloglou.
715: \newblock 1996.
716: \newblock Translating collocations for bilingual lexicons: A statistical
717: approach.
718: \newblock {\em Computational Linguistics}, 22(1):1--38.
719:
720: \end{thebibliography}
721:
722:
723: \end{document}
724: