1: %%
2: %% SIGIR-2000 WS on patent retrieval
3: %%
4: \documentstyle[11pt,twocolumn]{article}
5:
6: \input{a4params.tex}
7:
8: %% Title/Author/Affiliation
9:
10: \title{\bf Applying a Hybrid Query Translation Method to
11: Japanese/English Cross-Language Patent Retrieval}
12:
13: \author{\Large Masatoshi Fukui$^{\dagger}$~~Shigeto
14: Higuchi$^{\dagger}$~~Youichi Nakatani$^{\dagger}$~~\medskip \\
15: \Large Masao Tanaka$^{\dagger}$~~Atsushi Fujii$^{\ddagger}$~~Tetsuya
16: Ishikawa$^{\ddagger}$}
17:
18: \date{$^\dagger$Japan Patent Information Organization \\
19: Satoh Daiya Bldg., 1-7 Toyo 4-Chome Koto-ku 135-0016, JAPAN \medskip \\
20: $^\ddagger$University of Library and Information Science \\
21: 1-2 Kasuga Tsukuba 305-8550, JAPAN \\ \smallskip
22: \smallskip {\tt E-mail:~fujii@ulis.ac.jp}}
23:
24: \newcommand{\etal}{et~al.}
25: \newcommand{\etaleos}{et~al}
26: \newcommand{\eq}[1]{(\ref{#1})}
27:
28: \input{psfig.tex}
29:
30: \pagestyle{empty}
31:
32: \begin{document}
33:
34: \maketitle\thispagestyle{empty}
35:
36: \begin{abstract}
37: This paper applies an existing query translation method to
38: cross-language patent retrieval. In our method, multiple
39: dictionaries are used to derive all possible translations for an
40: input query, and collocational statistics are used to resolve
41: translation ambiguity. We used Japanese/English parallel patent
42: abstracts to perform comparative experiments, where our method
43: outperformed a simple dictionary-based query translation method, and
44: achieved 76\% of monolingual retrieval in terms of average
45: precision.
46: \end{abstract}
47:
48: \section{Introduction}
49: \label{sec:introduction}
50:
51: Since 1978, JAPIO (Japan Patent Information Organization) has operated
52: PATOLIS, which is one of the first on-line patent retrieval services
53: in Japan, and currently provides clients (i.e., 8,000 Japanese
54: companies) with patent information from 62 countries and 5
55: international organizations. At the same time, since a patent obtained
56: in a single country can be protected in multiple countries
57: simultaneously, it is feasible that users are interested in retrieving
58: patent information across languages. Motivated by this background,
59: JAPIO manually summarizes each patent document submitted in Japan into
60: approximately 400 characters, and translates the summarized documents
61: into English, which are provided on PAJ (Patent Abstract of Japan)
62: CD-ROMs\footnote{Copyright by Japan Patent Office.}.
63:
64: In this paper, we target cross-language information retrieval (CLIR)
65: in the context of patent retrieval, and evaluate its effectiveness
66: using Japanese/English patent abstracts on PAJ CD-ROMs.
67:
68: In brief, existing CLIR systems are classified into three approaches:
69: (a) translating queries into the document
70: language~\cite{ballesteros:sigir-98,davis:sigir-97}, (b) translating
71: documents into the query language~\cite{mccarley:acl-99,oard:amta-98},
72: and (c) representing both queries and documents in a
73: language-independent
74: space~\cite{carbonell:ijcai-97,gonzalo:chum-98,littman:clir-98,salton:jasis-70}.
75: However, since developing a CLIR system is expensive, we used the CLIR
76: system proposed by Fujii and
77: Ishikawa~\cite{fujii:ntcir-99,fujii:emnlp-vlc-99}, which follows the
78: first approach.
79:
80: This system has partially been developed for the NACSIS test
81: collection~\cite{kando:sigir-99}, which consists of 39 Japanese
82: queries and approximately 330,000 technical abstracts in Japanese and
83: English. However, since patent information usually includes technical
84: terms, it is expected that this system also will perform reasonably
85: for patent abstracts.
86:
87: \section{System Description}
88: \label{sec:system}
89:
90: Figure~\ref{fig:system} depicts the overall design of our CLIR system,
91: in which we combine a query translation module and an IR engine for
92: monolingual retrieval. Unlike the original system proposed by Fujii
93: and Ishikawa~\cite{fujii:ntcir-99,fujii:emnlp-vlc-99} targeting the
94: NACSIS collection, we use the JAPIO collection for the target
95: documents. Here, the JAPIO collection is a subset of PAJ CD-ROMs. We
96: will elaborate on this collection in
97: Section~\ref{sec:experimentation}. In this section, we briefly explain
98: the retrieval process based on Figure~\ref{fig:system}.
99:
100: First, query translation is performed for the source language query to
101: output the translation. For this purpose, a hybrid method integrating
102: multiple resources is used. To put it more precisely, the EDR
103: technical/general dictionaries~\cite{edr:95} are used to derive all
104: possible translation candidates for words and phrases included in the
105: source query. In addition, for words unlisted in dictionaries,
106: transliteration is performed to identify phonetic equivalents in the
107: target language.
108:
109: Then, bi-gram statistics extracted from NACSIS documents in the target
110: language are used to resolve the translation ambiguity. Ideally,
111: bi-gram statistics should be extracted from the JAPIO
112: collection. However, since the number of documents in this collection
113: is relatively small, when compared with the NACSIS collection (see
114: Section~\ref{sec:experimentation}), we avoided the data sparseness
115: problem.
116:
117: Since our system is bidirectional between Japanese and English, we
118: tokenize documents with different methods, depending on their
119: language. For English documents, the tokenization involves eliminating
120: stopwords and identifying root forms for inflected content words. For
121: this purpose, we use WordNet~\cite{fellbaum:wordnet-98}, which
122: contains a stopword list and correspondences between inflected words
123: and their root form.
124:
125: On the other hand, we segment Japanese documents into lexical units
126: using the ChaSen morphological analyzer~\cite{matsumoto:chasen-97},
127: which has commonly been used for much Japanese NLP research, and
128: extract content words based on their part-of-speech information.
129:
130: Second, the IR engine searches the JAPIO collection for documents
131: relevant to the translated query, and sorts them according to the
132: degree of relevance, in descending order. Our IR engine is based on
133: the vector space model, in which the similarity between the query and
134: each document (i.e., the degree of relevance of each document) is
135: computed as the cosine of the angle between their associated
136: vectors. We use the notion of TF$\cdot$IDF for term weighting. Among a
137: number of variations of term weighting
138: methods~\cite{salton:ipm-88,zobel:sigir-forum-98}, we tentatively use
139: the formulae as shown in Equation~\eq{eq:tf_idf}.
140: \begin{equation}
141: \label{eq:tf_idf}
142: \begin{array}{lll}
143: TF & = & 1 + \log(f_{t,d}) \\
144: \noalign{\vskip 1.2ex}
145: IDF & = & \log(\frac{\textstyle N}{\textstyle n_{t}})
146: \end{array}
147: \end{equation}
148: Here, $f_{t,d}$ denotes the frequency that term $t$ appears in
149: document $d$, and $n_{t}$ denotes the number of documents containing
150: term $t$. $N$ is the total number of documents in the collection.
151:
152: For the indexing process, we first tokenize documents as explained
153: above (i.e., we use WordNet and ChaSen for English and Japanese
154: documents, respectively), and then conduct the word-based
155: indexing. That is, we use each content word as a single indexing term.
156:
157: Finally, since retrieved documents are not in the user's native
158: language, we optionally use a machine translation system to enhance
159: readability of retrieved documents.
160:
161: \begin{figure}[t]
162: \begin{center}
163: \leavevmode
164: \psfig{file=system.eps,height=1.3in}
165: \end{center}
166: \caption{The overall design of our cross-language patent retrieval
167: system.}
168: \label{fig:system}
169: \end{figure}
170:
171: \section{Experimentation}
172: \label{sec:experimentation}
173:
174: Since no test collection for Japanese/English patent retrieval is
175: available to the public, we produced our test collection (i.e., the
176: JAPIO collection), which consists of three Japanese queries and
177: Japanese/English comparable abstracts.
178:
179: Each query, which was manually produced, consists of the description
180: and narrative, and corresponds to different domains, i.e., electrical
181: engineering, mechanical engineering and chemistry.
182: Figure~\ref{fig:query} shows the three query descriptions in the
183: second column.
184:
185: \begin{figure*}[htbp]
186: \begin{center}
187: \leavevmode
188: \small
189: \begin{tabular}{llrr} \hline\hline
190: {\hfill\centering IPC\hfill} & {\hfill\centering
191: Description\hfill} & \#Relevant & \#Documents \\ \hline
192: electronics & GPS car navigation system based on VICS & 930 &
193: 7,526 \\
194: mechanics & eliminating dioxin in burning solid wastes & 451 &
195: 8,214 \\
196: chemistry & antibacterial plastic combining inorganic
197: materials & 473 & 5,902 \\
198: \hline
199: \end{tabular}
200: \caption{Query descriptions in the JAPIO collection.}
201: \label{fig:query}
202: \end{center}
203: \end{figure*}
204:
205: In conventional test collections, relevance assessment is usually
206: performed based on the pooling method~\cite{voorhees:sigir-98}, which
207: first pools candidates for relevant documents using multiple retrieval
208: systems. However, since in our case only one system described in
209: Section~\ref{sec:system} is currently available, a different
210: production method was needed.
211:
212: To put it more precisely, for each query (domain), target documents
213: were first collected based on the IPC classification number, from PAJ
214: CD-ROMs in 1993-1998. Then, for each query, three professional human
215: searchers, who were allowed to enhance queries based on thesauri and
216: their introspection, searched the target documents for relevant
217: documents.
218:
219: Thus, in practice, the JAPIO collection consists of three different
220: document collections corresponding to each query. In
221: Figure~\ref{fig:query}, the third and fourth columns denote the number
222: of relevant documents and the total number of target documents for
223: each query.
224:
225: We compared the following methods:
226: \begin{itemize}
227: \item Japanese-English CLIR, where all possible translations derived
228: from EDR dictionaries and the transliteration method were used as
229: query terms (JEALL),
230: \item Japanese-English CLIR, where disambiguation based on bi-gram
231: statistics were performed, and $k$-best translations were used as
232: query terms (JEDIS),
233: \item Japanese-Japanese monolingual IR (JJ).
234: \end{itemize}
235: Here, we empirically set \mbox{$k=1$}. Although the performance of
236: JEDIS did not significantly differ as long as we set a small value of
237: $k$ (e.g., \mbox{$k=5$}), we achieved the best performance when we set
238: \mbox{$k=1$}.
239:
240: Figure~\ref{fig:rp} shows recall-precision curves for the above three
241: methods, where JEDIS generally outperformed JEALL, and JJ generally
242: outperformed both JEALL and JEDIS, regardless of the recall. The
243: difference between JEALL and JEDIS is attributed to the fact that
244: JEDIS resolved translation ambiguity based on bi-gram statistics
245: extracted from the NACSIS collection. Thus, we can conclude that the
246: use of bi-gram statistics (even extracted from a collection other than
247: the JAPIO collection) was effective for the query translation.
248:
249: Table~\ref{tab:avg_pre} shows the non-interpolated average precision
250: values, averaged over the three queries, for each method. This table
251: shows that JJ outperformed JEALL and JEDIS, JEDIS outperformed JEALL,
252: and the average precision value for JEDIS was 76\% of that obtained
253: with JJ.
254:
255: These results are also observable in existing CLIR experiments using
256: the TREC and NACSIS collections. Thus, we conclude that our
257: cross-language patent retrieval system is relatively comparable with
258: those for newspaper articles and technical abstracts in performance.
259:
260: However, we could not conduct statistical testing, which investigates
261: whether the difference in average precision is meaningful or simply
262: due to chance~\cite{hull:sigir-93}, because the number of queries is
263: small. We concede that experiments using a larger number of queries
264: need to be further explored.
265:
266: \section{Conclusion}
267: \label{sec:conclusion}
268:
269: In this paper, we explored Japanese/English cross-language patent
270: retrieval. For this purpose, we used an existing cross-language IR
271: system relying on a hybrid query translation method, and evaluated its
272: effectiveness using Japanese queries and English patent abstracts.
273: The experimental results paralleled existing experiments. That is, we
274: found that resolving translation ambiguity was effective for the query
275: translation, and that the average precision value for cross-language
276: IR was approximately 76\% of that obtained with monolingual IR.
277: Future work will include qualitative/quantitative analyses based on a
278: larger number of queries.
279:
280: \begin{figure}[t]
281: \begin{center}
282: \leavevmode
283: \psfig{file=rp-curve.ps,height=3.2in}
284: \end{center}
285: \caption{Recall-precision curves for different methods.}
286: \label{fig:rp}
287: \end{figure}
288:
289: \begin{table}[htbp]
290: \begin{center}
291: \caption{Non-interpolated average precision values,
292: averaged over the three queries, for different methods.}
293: \medskip
294: \leavevmode
295: \small
296: \begin{tabular}{lcc} \hline\hline
297: Method & Avg. Precision & Ratio to JJ \\ \hline
298: JJ & 0.4151 & -- \\
299: JEDIS & 0.3156 & 0.7603 \\
300: JEALL & 0.2709 & 0.6526 \\
301: \hline
302: \end{tabular}
303: \label{tab:avg_pre}
304: \end{center}
305: \end{table}
306:
307: \small
308:
309: \bibliographystyle{jplain}
310:
311: \begin{thebibliography}{10}
312:
313: \bibitem{ballesteros:sigir-98}
314: Lisa Ballesteros and W.~Bruce Croft.
315: \newblock Resolving ambiguity for cross-language retrieval.
316: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
317: Conference on Research and Development in Information Retrieval}, pp. 64--71,
318: 1998.
319:
320: \bibitem{carbonell:ijcai-97}
321: Jaime~G. Carbonell, Yiming Yang, Robert~E. Frederking, Ralf~D. Brown, Yibing
322: Geng, and Danny Lee.
323: \newblock Translingual information retrieval: A comparative evaluation.
324: \newblock In {\em Proceedings of the 15th International Joint Conference on
325: Artificial Intelligence}, pp. 708--714, 1997.
326:
327: \bibitem{davis:sigir-97}
328: Mark~W. Davis and William~C. Ogden.
329: \newblock {QUILT}: Implementing a large-scale cross-language text retrieval
330: system.
331: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR
332: Conference on Research and Development in Information Retrieval}, pp. 92--98,
333: 1997.
334:
335: \bibitem{fellbaum:wordnet-98}
336: Christiane Fellbaum, editor.
337: \newblock {\em {WordNet}: An Electronic Lexical Database}.
338: \newblock MIT Press, 1998.
339:
340: \bibitem{fujii:ntcir-99}
341: Atsushi Fujii and Tetsuya Ishikawa.
342: \newblock Cross-language information retrieval at {ULIS}.
343: \newblock In {\em Proceedings of the 1st NTCIR Workshop on Research in Japanese
344: Text Retrieval and Term Recognition}, pp. 163--169, 1999.
345:
346: \bibitem{fujii:emnlp-vlc-99}
347: Atsushi Fujii and Tetsuya Ishikawa.
348: \newblock Cross-language information retrieval for technical documents.
349: \newblock In {\em Proceedings of the Joint ACL SIGDAT Conference on Empirical
350: Methods in Natural Language Processing and Very Large Corpora}, pp. 29--37,
351: 1999.
352:
353: \bibitem{gonzalo:chum-98}
354: Julio Gonzalo, Felisa Verdejo, Carol Peters, and Nicoletta Calzolari.
355: \newblock Applying {EuroWordNet} to cross-language text retrieval.
356: \newblock {\em Computers and the Humanities}, Vol.~32, pp. 185--207, 1998.
357:
358: \bibitem{hull:sigir-93}
359: David Hull.
360: \newblock Using statistical testing in the evaluation of retrieval experiments.
361: \newblock In {\em Proceedings of the 16th Annual International ACM SIGIR
362: Conference on Research and Development in Information Retrieval}, pp.
363: 329--338, 1993.
364:
365: \bibitem{edr:95}
366: {Japan Electronic Dictionary Research Institute}.
367: \newblock {EDR} electronic dictionary technical guide, 1995.
368: \newblock (In Japanese).
369:
370: \bibitem{kando:sigir-99}
371: Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue.
372: \newblock {NACSIS} test collection workshop ({NTCIR-1}).
373: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
374: Conference on Research and Development in Information Retrieval}, pp.
375: 299--300, 1999.
376:
377: \bibitem{littman:clir-98}
378: Michael~L. Littman, Susan~T. Dumais, and Thomas~K. Landauer.
379: \newblock Automatic cross-language information retrieval using latent semantic
380: indexing.
381: \newblock In Gregory Grefenstette, editor, {\em Cross-Language Information
382: Retrieval}, chapter~5, pp. 51--62. Kluwer Academic Publishers, 1998.
383:
384: \bibitem{matsumoto:chasen-97}
385: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki
386: Imamura.
387: \newblock {Japanese} morphological analysis system {ChaSen} manual.
388: \newblock Technical Report NAIST-IS-TR97007, NAIST, 1997.
389: \newblock (In Japanese).
390:
391: \bibitem{mccarley:acl-99}
392: J.~Scott McCarley.
393: \newblock Should we translate the documents or the queries in cross-language
394: information retrieval?
395: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for
396: Computational Linguistics}, pp. 208--214, 1999.
397:
398: \bibitem{oard:amta-98}
399: Douglas~W. Oard.
400: \newblock A comparative study of query and document translation for
401: cross-language information retrieval.
402: \newblock In {\em Proceedings of the 3rd Conference of the Association for
403: Machine Translation in the Americas}, pp. 472--483, 1998.
404:
405: \bibitem{salton:jasis-70}
406: Gerard Salton.
407: \newblock Automatic processing of foreign language documents.
408: \newblock {\em Journal of the American Society for Information Science},
409: Vol.~21, No.~3, pp. 187--194, 1970.
410:
411: \bibitem{salton:ipm-88}
412: Gerard Salton and Christopher Buckley.
413: \newblock Term-weighting approaches in automatic text retrieval.
414: \newblock {\em Information Processing \& Management}, Vol.~24, No.~5, pp.
415: 513--523, 1988.
416:
417: \bibitem{voorhees:sigir-98}
418: Ellen~M. Voorhees.
419: \newblock Variations in relevance judgments and the measurement of retrieval
420: effectiveness.
421: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
422: Conference on Research and Development in Information Retrieval}, pp.
423: 315--323, 1998.
424:
425: \bibitem{zobel:sigir-forum-98}
426: Justin Zobel and Alistair Moffat.
427: \newblock Exploring the similarity space.
428: \newblock {\em ACM SIGIR FORUM}, Vol.~32, No.~1, pp. 18--34, 1998.
429:
430: \end{thebibliography}
431:
432: \end{document}
433:
434: % Local Variables:
435: % mode: japanese-LaTeX
436: % TeX-master: t
437: % End:
438: