cs0206034/main.tex
1: %%
2: %% SIGIR-2000 WS on patent retrieval
3: %%
4: \documentstyle[11pt,twocolumn]{article}
5: 
6: \input{a4params.tex}
7: 
8: %% Title/Author/Affiliation
9: 
10: \title{\bf Applying a Hybrid Query Translation Method to
11: Japanese/English Cross-Language Patent Retrieval}
12: 
13: \author{\Large Masatoshi Fukui$^{\dagger}$~~Shigeto
14: Higuchi$^{\dagger}$~~Youichi Nakatani$^{\dagger}$~~\medskip \\
15: \Large Masao Tanaka$^{\dagger}$~~Atsushi Fujii$^{\ddagger}$~~Tetsuya
16: Ishikawa$^{\ddagger}$}
17: 
18: \date{$^\dagger$Japan Patent Information Organization \\
19: Satoh Daiya Bldg., 1-7 Toyo 4-Chome Koto-ku 135-0016, JAPAN \medskip \\
20: $^\ddagger$University of Library and Information Science \\
21: 1-2 Kasuga Tsukuba 305-8550, JAPAN \\ \smallskip
22: \smallskip {\tt E-mail:~fujii@ulis.ac.jp}}
23: 
24: \newcommand{\etal}{et~al.}
25: \newcommand{\etaleos}{et~al}
26: \newcommand{\eq}[1]{(\ref{#1})}
27: 
28: \input{psfig.tex}
29: 
30: \pagestyle{empty}
31: 
32: \begin{document}
33: 
34: \maketitle\thispagestyle{empty}
35: 
36: \begin{abstract}
37:   This paper applies an existing query translation method to
38:   cross-language patent retrieval. In our method, multiple
39:   dictionaries are used to derive all possible translations for an
40:   input query, and collocational statistics are used to resolve
41:   translation ambiguity. We used Japanese/English parallel patent
42:   abstracts to perform comparative experiments, where our method
43:   outperformed a simple dictionary-based query translation method, and
44:   achieved 76\% of monolingual retrieval in terms of average
45:   precision.
46: \end{abstract}
47: 
48: \section{Introduction}
49: \label{sec:introduction}
50: 
51: Since 1978, JAPIO (Japan Patent Information Organization) has operated
52: PATOLIS, which is one of the first on-line patent retrieval services
53: in Japan, and currently provides clients (i.e., 8,000 Japanese
54: companies) with patent information from 62 countries and 5
55: international organizations. At the same time, since a patent obtained
56: in a single country can be protected in multiple countries
57: simultaneously, it is feasible that users are interested in retrieving
58: patent information across languages. Motivated by this background,
59: JAPIO manually summarizes each patent document submitted in Japan into
60: approximately 400 characters, and translates the summarized documents
61: into English, which are provided on PAJ (Patent Abstract of Japan)
62: CD-ROMs\footnote{Copyright by Japan Patent Office.}.
63: 
64: In this paper, we target cross-language information retrieval (CLIR)
65: in the context of patent retrieval, and evaluate its effectiveness
66: using Japanese/English patent abstracts on PAJ CD-ROMs.
67: 
68: In brief, existing CLIR systems are classified into three approaches:
69: (a) translating queries into the document
70: language~\cite{ballesteros:sigir-98,davis:sigir-97}, (b) translating
71: documents into the query language~\cite{mccarley:acl-99,oard:amta-98},
72: and (c) representing both queries and documents in a
73: language-independent
74: space~\cite{carbonell:ijcai-97,gonzalo:chum-98,littman:clir-98,salton:jasis-70}.
75: However, since developing a CLIR system is expensive, we used the CLIR
76: system proposed by Fujii and
77: Ishikawa~\cite{fujii:ntcir-99,fujii:emnlp-vlc-99}, which follows the
78: first approach.
79: 
80: This system has partially been developed for the NACSIS test
81: collection~\cite{kando:sigir-99}, which consists of 39 Japanese
82: queries and approximately 330,000 technical abstracts in Japanese and
83: English.  However, since patent information usually includes technical
84: terms, it is expected that this system also will perform reasonably
85: for patent abstracts.
86: 
87: \section{System Description}
88: \label{sec:system}
89: 
90: Figure~\ref{fig:system} depicts the overall design of our CLIR system,
91: in which we combine a query translation module and an IR engine for
92: monolingual retrieval.  Unlike the original system proposed by Fujii
93: and Ishikawa~\cite{fujii:ntcir-99,fujii:emnlp-vlc-99} targeting the
94: NACSIS collection, we use the JAPIO collection for the target
95: documents. Here, the JAPIO collection is a subset of PAJ CD-ROMs. We
96: will elaborate on this collection in
97: Section~\ref{sec:experimentation}. In this section, we briefly explain
98: the retrieval process based on Figure~\ref{fig:system}.
99: 
100: First, query translation is performed for the source language query to
101: output the translation. For this purpose, a hybrid method integrating
102: multiple resources is used. To put it more precisely, the EDR
103: technical/general dictionaries~\cite{edr:95} are used to derive all
104: possible translation candidates for words and phrases included in the
105: source query. In addition, for words unlisted in dictionaries,
106: transliteration is performed to identify phonetic equivalents in the
107: target language.
108: 
109: Then, bi-gram statistics extracted from NACSIS documents in the target
110: language are used to resolve the translation ambiguity. Ideally,
111: bi-gram statistics should be extracted from the JAPIO
112: collection. However, since the number of documents in this collection
113: is relatively small, when compared with the NACSIS collection (see
114: Section~\ref{sec:experimentation}), we avoided the data sparseness
115: problem.
116: 
117: Since our system is bidirectional between Japanese and English, we
118: tokenize documents with different methods, depending on their
119: language. For English documents, the tokenization involves eliminating
120: stopwords and identifying root forms for inflected content words. For
121: this purpose, we use WordNet~\cite{fellbaum:wordnet-98}, which
122: contains a stopword list and correspondences between inflected words
123: and their root form.
124: 
125: On the other hand, we segment Japanese documents into lexical units
126: using the ChaSen morphological analyzer~\cite{matsumoto:chasen-97},
127: which has commonly been used for much Japanese NLP research, and
128: extract content words based on their part-of-speech information.
129: 
130: Second, the IR engine searches the JAPIO collection for documents
131: relevant to the translated query, and sorts them according to the
132: degree of relevance, in descending order. Our IR engine is based on
133: the vector space model, in which the similarity between the query and
134: each document (i.e., the degree of relevance of each document) is
135: computed as the cosine of the angle between their associated
136: vectors. We use the notion of TF$\cdot$IDF for term weighting. Among a
137: number of variations of term weighting
138: methods~\cite{salton:ipm-88,zobel:sigir-forum-98}, we tentatively use
139: the formulae as shown in Equation~\eq{eq:tf_idf}.
140: \begin{equation}
141:   \label{eq:tf_idf}
142:   \begin{array}{lll}
143:     TF & = & 1 + \log(f_{t,d}) \\
144:     \noalign{\vskip 1.2ex}
145:     IDF & = & \log(\frac{\textstyle N}{\textstyle n_{t}})
146:   \end{array}
147: \end{equation}
148: Here, $f_{t,d}$ denotes the frequency that term $t$ appears in
149: document $d$, and $n_{t}$ denotes the number of documents containing
150: term $t$. $N$ is the total number of documents in the collection.
151: 
152: For the indexing process, we first tokenize documents as explained
153: above (i.e., we use WordNet and ChaSen for English and Japanese
154: documents, respectively), and then conduct the word-based
155: indexing. That is, we use each content word as a single indexing term.
156: 
157: Finally, since retrieved documents are not in the user's native
158: language, we optionally use a machine translation system to enhance
159: readability of retrieved documents.
160: 
161: \begin{figure}[t]
162:   \begin{center}
163:     \leavevmode
164:     \psfig{file=system.eps,height=1.3in}
165:   \end{center}
166:   \caption{The overall design of our cross-language patent retrieval
167:     system.}
168:   \label{fig:system}
169: \end{figure}
170: 
171: \section{Experimentation}
172: \label{sec:experimentation}
173: 
174: Since no test collection for Japanese/English patent retrieval is
175: available to the public, we produced our test collection (i.e., the
176: JAPIO collection), which consists of three Japanese queries and
177: Japanese/English comparable abstracts.
178: 
179: Each query, which was manually produced, consists of the description
180: and narrative, and corresponds to different domains, i.e., electrical
181: engineering, mechanical engineering and chemistry.
182: Figure~\ref{fig:query} shows the three query descriptions in the
183: second column.
184: 
185: \begin{figure*}[htbp]
186:   \begin{center}
187:     \leavevmode
188:     \small
189:     \begin{tabular}{llrr} \hline\hline
190:       {\hfill\centering IPC\hfill} & {\hfill\centering
191:       Description\hfill} & \#Relevant & \#Documents \\ \hline
192:       electronics & GPS car navigation system based on VICS & 930 &
193:       7,526 \\
194:       mechanics & eliminating dioxin in burning solid wastes & 451 &
195:       8,214 \\
196:       chemistry & antibacterial plastic combining inorganic
197:       materials & 473 & 5,902 \\
198:       \hline
199:     \end{tabular}
200:     \caption{Query descriptions in the JAPIO collection.}
201:     \label{fig:query}
202:   \end{center}
203: \end{figure*}
204: 
205: In conventional test collections, relevance assessment is usually
206: performed based on the pooling method~\cite{voorhees:sigir-98}, which
207: first pools candidates for relevant documents using multiple retrieval
208: systems. However, since in our case only one system described in
209: Section~\ref{sec:system} is currently available, a different
210: production method was needed.
211: 
212: To put it more precisely, for each query (domain), target documents
213: were first collected based on the IPC classification number, from PAJ
214: CD-ROMs in 1993-1998. Then, for each query, three professional human
215: searchers, who were allowed to enhance queries based on thesauri and
216: their introspection, searched the target documents for relevant
217: documents.
218: 
219: Thus, in practice, the JAPIO collection consists of three different
220: document collections corresponding to each query.  In
221: Figure~\ref{fig:query}, the third and fourth columns denote the number
222: of relevant documents and the total number of target documents for
223: each query.
224: 
225: We compared the following methods:
226: \begin{itemize}
227: \item Japanese-English CLIR, where all possible translations derived
228:   from EDR dictionaries and the transliteration method were used as
229:   query terms (JEALL),
230: \item Japanese-English CLIR, where disambiguation based on bi-gram
231:   statistics were performed, and $k$-best translations were used as
232:   query terms (JEDIS),
233: \item Japanese-Japanese monolingual IR (JJ).
234: \end{itemize}
235: Here, we empirically set \mbox{$k=1$}. Although the performance of
236: JEDIS did not significantly differ as long as we set a small value of
237: $k$ (e.g., \mbox{$k=5$}), we achieved the best performance when we set
238: \mbox{$k=1$}.
239: 
240: Figure~\ref{fig:rp} shows recall-precision curves for the above three
241: methods, where JEDIS generally outperformed JEALL, and JJ generally
242: outperformed both JEALL and JEDIS, regardless of the recall.  The
243: difference between JEALL and JEDIS is attributed to the fact that
244: JEDIS resolved translation ambiguity based on bi-gram statistics
245: extracted from the NACSIS collection. Thus, we can conclude that the
246: use of bi-gram statistics (even extracted from a collection other than
247: the JAPIO collection) was effective for the query translation.
248: 
249: Table~\ref{tab:avg_pre} shows the non-interpolated average precision
250: values, averaged over the three queries, for each method.  This table
251: shows that JJ outperformed JEALL and JEDIS, JEDIS outperformed JEALL,
252: and the average precision value for JEDIS was 76\% of that obtained
253: with JJ.
254: 
255: These results are also observable in existing CLIR experiments using
256: the TREC and NACSIS collections. Thus, we conclude that our
257: cross-language patent retrieval system is relatively comparable with
258: those for newspaper articles and technical abstracts in performance.
259: 
260: However, we could not conduct statistical testing, which investigates
261: whether the difference in average precision is meaningful or simply
262: due to chance~\cite{hull:sigir-93}, because the number of queries is
263: small. We concede that experiments using a larger number of queries
264: need to be further explored.
265: 
266: \section{Conclusion}
267: \label{sec:conclusion}
268: 
269: In this paper, we explored Japanese/English cross-language patent
270: retrieval. For this purpose, we used an existing cross-language IR
271: system relying on a hybrid query translation method, and evaluated its
272: effectiveness using Japanese queries and English patent abstracts.
273: The experimental results paralleled existing experiments. That is, we
274: found that resolving translation ambiguity was effective for the query
275: translation, and that the average precision value for cross-language
276: IR was approximately 76\% of that obtained with monolingual IR.
277: Future work will include qualitative/quantitative analyses based on a
278: larger number of queries.
279: 
280: \begin{figure}[t]
281:   \begin{center}
282:     \leavevmode
283:     \psfig{file=rp-curve.ps,height=3.2in}
284:   \end{center}
285:   \caption{Recall-precision curves for different methods.}
286:   \label{fig:rp}
287: \end{figure}
288: 
289: \begin{table}[htbp]
290:   \begin{center}
291:     \caption{Non-interpolated average precision values,
292:     averaged over the three queries, for different methods.}
293:     \medskip
294:     \leavevmode
295:     \small
296:     \begin{tabular}{lcc} \hline\hline
297:       Method & Avg. Precision & Ratio to JJ \\ \hline
298:       JJ & 0.4151 & -- \\
299:       JEDIS & 0.3156 & 0.7603 \\
300:       JEALL & 0.2709 & 0.6526 \\
301:       \hline
302:     \end{tabular}
303:     \label{tab:avg_pre}
304:   \end{center}
305: \end{table}
306: 
307: \small
308: 
309: \bibliographystyle{jplain}
310: 
311: \begin{thebibliography}{10}
312: 
313: \bibitem{ballesteros:sigir-98}
314: Lisa Ballesteros and W.~Bruce Croft.
315: \newblock Resolving ambiguity for cross-language retrieval.
316: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
317:   Conference on Research and Development in Information Retrieval}, pp. 64--71,
318:   1998.
319: 
320: \bibitem{carbonell:ijcai-97}
321: Jaime~G. Carbonell, Yiming Yang, Robert~E. Frederking, Ralf~D. Brown, Yibing
322:   Geng, and Danny Lee.
323: \newblock Translingual information retrieval: A comparative evaluation.
324: \newblock In {\em Proceedings of the 15th International Joint Conference on
325:   Artificial Intelligence}, pp. 708--714, 1997.
326: 
327: \bibitem{davis:sigir-97}
328: Mark~W. Davis and William~C. Ogden.
329: \newblock {QUILT}: Implementing a large-scale cross-language text retrieval
330:   system.
331: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR
332:   Conference on Research and Development in Information Retrieval}, pp. 92--98,
333:   1997.
334: 
335: \bibitem{fellbaum:wordnet-98}
336: Christiane Fellbaum, editor.
337: \newblock {\em {WordNet}: An Electronic Lexical Database}.
338: \newblock MIT Press, 1998.
339: 
340: \bibitem{fujii:ntcir-99}
341: Atsushi Fujii and Tetsuya Ishikawa.
342: \newblock Cross-language information retrieval at {ULIS}.
343: \newblock In {\em Proceedings of the 1st NTCIR Workshop on Research in Japanese
344:   Text Retrieval and Term Recognition}, pp. 163--169, 1999.
345: 
346: \bibitem{fujii:emnlp-vlc-99}
347: Atsushi Fujii and Tetsuya Ishikawa.
348: \newblock Cross-language information retrieval for technical documents.
349: \newblock In {\em Proceedings of the Joint ACL SIGDAT Conference on Empirical
350:   Methods in Natural Language Processing and Very Large Corpora}, pp. 29--37,
351:   1999.
352: 
353: \bibitem{gonzalo:chum-98}
354: Julio Gonzalo, Felisa Verdejo, Carol Peters, and Nicoletta Calzolari.
355: \newblock Applying {EuroWordNet} to cross-language text retrieval.
356: \newblock {\em Computers and the Humanities}, Vol.~32, pp. 185--207, 1998.
357: 
358: \bibitem{hull:sigir-93}
359: David Hull.
360: \newblock Using statistical testing in the evaluation of retrieval experiments.
361: \newblock In {\em Proceedings of the 16th Annual International ACM SIGIR
362:   Conference on Research and Development in Information Retrieval}, pp.
363:   329--338, 1993.
364: 
365: \bibitem{edr:95}
366: {Japan Electronic Dictionary Research Institute}.
367: \newblock {EDR} electronic dictionary technical guide, 1995.
368: \newblock (In Japanese).
369: 
370: \bibitem{kando:sigir-99}
371: Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue.
372: \newblock {NACSIS} test collection workshop ({NTCIR-1}).
373: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
374:   Conference on Research and Development in Information Retrieval}, pp.
375:   299--300, 1999.
376: 
377: \bibitem{littman:clir-98}
378: Michael~L. Littman, Susan~T. Dumais, and Thomas~K. Landauer.
379: \newblock Automatic cross-language information retrieval using latent semantic
380:   indexing.
381: \newblock In Gregory Grefenstette, editor, {\em Cross-Language Information
382:   Retrieval}, chapter~5, pp. 51--62. Kluwer Academic Publishers, 1998.
383: 
384: \bibitem{matsumoto:chasen-97}
385: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki
386:   Imamura.
387: \newblock {Japanese} morphological analysis system {ChaSen} manual.
388: \newblock Technical Report NAIST-IS-TR97007, NAIST, 1997.
389: \newblock (In Japanese).
390: 
391: \bibitem{mccarley:acl-99}
392: J.~Scott McCarley.
393: \newblock Should we translate the documents or the queries in cross-language
394:   information retrieval?
395: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for
396:   Computational Linguistics}, pp. 208--214, 1999.
397: 
398: \bibitem{oard:amta-98}
399: Douglas~W. Oard.
400: \newblock A comparative study of query and document translation for
401:   cross-language information retrieval.
402: \newblock In {\em Proceedings of the 3rd Conference of the Association for
403:   Machine Translation in the Americas}, pp. 472--483, 1998.
404: 
405: \bibitem{salton:jasis-70}
406: Gerard Salton.
407: \newblock Automatic processing of foreign language documents.
408: \newblock {\em Journal of the American Society for Information Science},
409:   Vol.~21, No.~3, pp. 187--194, 1970.
410: 
411: \bibitem{salton:ipm-88}
412: Gerard Salton and Christopher Buckley.
413: \newblock Term-weighting approaches in automatic text retrieval.
414: \newblock {\em Information Processing \& Management}, Vol.~24, No.~5, pp.
415:   513--523, 1988.
416: 
417: \bibitem{voorhees:sigir-98}
418: Ellen~M. Voorhees.
419: \newblock Variations in relevance judgments and the measurement of retrieval
420:   effectiveness.
421: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
422:   Conference on Research and Development in Information Retrieval}, pp.
423:   315--323, 1998.
424: 
425: \bibitem{zobel:sigir-forum-98}
426: Justin Zobel and Alistair Moffat.
427: \newblock Exploring the similarity space.
428: \newblock {\em ACM SIGIR FORUM}, Vol.~32, No.~1, pp. 18--34, 1998.
429: 
430: \end{thebibliography}
431: 
432: \end{document}
433: 
434: % Local Variables:
435: % mode: japanese-LaTeX
436: % TeX-master: t
437: % End:
438: