1: \documentstyle[chum]{article}
2:
3: \title{Japanese/English Cross-Language Information Retrieval:
4: Exploration of Query Translation and Transliteration\footnote{Computers and the Humanities, Vol.35, No.4, pp.389--420, Nov. 2001}}
5:
6: \author{\Large Atsushi Fujii and Tetsuya Ishikawa}
7:
8: \date{University of Library and Information Science \\
9: 1-2 Kasuga Tsukuba 305-8550, JAPAN \\ \smallskip
10: {\normalsize\tt
11: E-mail:fujii@ulis.ac.jp}}
12:
13: \summary{Cross-language information retrieval (CLIR), where queries
14: and documents are in different languages, has of late become one of
15: the major topics within the information retrieval community. This
16: paper proposes a Japanese/English CLIR system, where we combine a
17: query translation and retrieval modules. We currently target the
18: retrieval of technical documents, and therefore the performance of our
19: system is highly dependent on the quality of the translation of
20: technical terms. However, the technical term translation is still
21: problematic in that technical terms are often compound words, and thus
22: new terms are progressively created by combining existing base
23: words. In addition, Japanese often represents loanwords based on its
24: special phonogram. Consequently, existing dictionaries find it
25: difficult to achieve sufficient coverage. To counter the first
26: problem, we produce a Japanese/English dictionary for base words, and
27: translate compound words on a word-by-word basis. We also use a
28: probabilistic method to resolve translation ambiguity. For the second
29: problem, we use a transliteration method, which corresponds words
30: unlisted in the base word dictionary to their phonetic equivalents in
31: the target language. We evaluate our system using a test collection
32: for CLIR, and show that both the compound word translation and
33: transliteration methods improve the system performance.}
34:
35: \begin{document}
36:
37: \makeidpage
38: \maketitle
39:
40: \input{psfig.tex}
41:
42: \newcommand{\etal}{et~al.}
43: \newcommand{\etaleos}{et~al}
44: \newcommand{\eq}[1]{(\ref{#1})}
45:
46: \renewcommand{\nocite}[1]{\shortcite{#1}}
47:
48: \section{Introduction}
49: \label{sec:introduction}
50:
51: Cross-language information retrieval (CLIR) is the retrieval process
52: where the user presents queries in one language to retrieve documents
53: in {\em another\/} language. One of the traditional research
54: references for CLIR dates back to the 1960s~\cite{mongar:tis-69}. In
55: the 1970s, Salton~\nocite{salton:jasis-70,salton:techrep-72}
56: empirically showed that CLIR using a hand-crafted bilingual thesaurus
57: is comparable with monolingual information retrieval in
58: performance. The 1990s witnessed a growing number of machine readable
59: texts in various languages, including those accessible via the World
60: Wide Web, but each content is usually provided in a limited number of
61: languages. Thus, it is feasible that users are interested in
62: retrieving information across languages. Possible users of CLIR are
63: given below:
64: \begin{itemize}
65: \item users who are able to read documents in foreign languages, but
66: have difficulty formulating foreign queries,
67: \item users who find it difficult to retrieve/read relevant documents,
68: but need the information, for the purpose of which the use of
69: machine translation (MT) systems for the limited number of documents
70: retrieved through CLIR is computationally more efficient rather than
71: translating the entire collection,
72: \item users who know foreign keywords/phrases, and want to read
73: documents associated with them, in their native language.
74: \end{itemize}
75: In fact, CLIR has of late become one of the major topics within the
76: information retrieval (IR), natural language processing (NLP) and
77: artificial intelligence (AI) communities, and numerous CLIR systems
78: have variously been
79: proposed~\cite{aaai-spring-sympo-97,sigir-96-98,trec-92-98}.
80: Note that CLIR can be seen as a subtask of multi-lingual information
81: retrieval (MLIR), which also includes the following cases:
82: \begin{itemize}
83: \item identify the query language (based on, for example, character
84: codes), and search a multilingual collection for documents in the
85: query language,
86: \item retrieve documents, in which each document is in more than one
87: language,
88: \item retrieve documents using a query in more than one
89: language~\cite{fung:acl-99}.
90: \end{itemize}
91: However, these above cases are beyond the scope of this paper. It
92: should also be noted that while CLIR is not necessarily limited to IR
93: within two languages, we consistently use the term ``bilingual,''
94: keeping the potential applicability of CLIR to more than two languages
95: in mind, because the variety of languages used is not the central
96: issue of this paper.
97:
98: Since by definition queries and documents are in different languages,
99: CLIR needs a translation process along with the conventional
100: monolingual retrieval process. For this purpose, existing CLIR
101: systems adopt various techniques explored in NLP research. In brief,
102: dictionaries, corpora, thesauri and MT systems are used to translate
103: queries and/or documents. However, due to the rudimentary nature of
104: existing translation methods, CLIR still finds it difficult to achieve
105: the performance of monolingual IR. Roughly speaking, recent
106: experiments showed that the average precision of CLIR is 50-75\% of
107: that obtained with monolingual IR~\cite{schauble:trec-97}, which
108: stimulates us to further explore this exciting research area.
109:
110: In this paper, we propose a Japanese/English bidirectional CLIR system
111: targeting technical documents, which has been less explored than that
112: for newspaper articles in past CLIR literature. Our research is
113: partly motivated by the NACSIS test collection for (CL)IR systems,
114: which consists of Japanese queries and Japanese/English abstracts
115: collected from technical papers~\cite{kando:sigir-99}.\footnote{\tt
116: {http://www.rd.nacsis.ac.jp/\~{}ntcadm/index-en.html}} We will
117: elaborate on the NACSIS collection in
118: Section~\ref{subsec:eval_overview}. As can be predicted, the
119: performance of our CLIR system strongly depends on the quality of the
120: translation of technical terms, which are often unlisted in general
121: dictionaries.
122:
123: Pirkola~\nocite{pirkola:sigir-98}, for example, used a subset of the
124: TREC collection related to health topics, and showed that a
125: combination of general and domain specific (i.e., medical)
126: dictionaries improves the CLIR performance obtained with only a
127: general dictionary. This result shows the potential contribution of
128: technical term translation to CLIR. At the same time, it should be
129: noted that even domain specific dictionaries do not exhaustively list
130: possible technical terms. For example, the EDR technical terminology
131: dictionary~\cite{edr-techdic:95}, which consists of approximately
132: 120,000 Japanese-English translations related to the information
133: processing field, lacks recent terms like ``{\it jouhou
134: chuushutsu\/}~(information extraction).'' We classify problems
135: associated with technical term translation as given below:
136: \begin{itemize}
137: \item technical terms are often compound words, which
138: can be progressively created simply by combining multiple existing
139: morphemes (``base words''), and therefore it is not entirely
140: satisfactory or feasible to exhaustively enumerate newly emerging
141: terms in dictionaries,
142: \item Japanese often represents loanwords (i.e., technical terms and
143: proper nouns imported from foreign languages) using its special
144: phonetic alphabet (or phonogram) called ``{\it katakana},'' with
145: which new words can be spelled out,
146: \item English technical terms are often abbreviated, which can be used
147: as ``Japanese'' words.
148: \end{itemize}
149: To counter the first problem, we propose a compound word translation
150: method, which selects appropriate translations based on the
151: probability of occurrence of each combination of base words in the
152: target language (see Section~\ref{subsec:cwt}). Note that technical
153: compound words sometimes include general words, such as ``AI {\em
154: chess\/}'' and ``digital {\em watermark\/}.'' In this paper, we do not
155: rigorously define general words, by which we mean words that are
156: contained in existing general dictionaries but rarely in technical
157: term dictionaries. For the second problem, we propose a
158: ``transliteration'' method, which identifies phonetic equivalents in
159: the target language (see Section~\ref{subsec:translit}). Finally, to
160: resolve the third problem, we enhance our bilingual dictionary with
161: multiples of each abbreviation and its complete form (e.g., ``IR'' and
162: ``information retrieval'') extracted from English corpora (see
163: Section~\ref{subsec:dictionary_enhancement}). Note that although a
164: number of methods targeting those above problems have been explored
165: in past research, no attempt has been made to integrate them in the
166: context of CLIR.
167:
168: Section~\ref{sec:past_research} surveys past research on CLIR, and
169: clarifies our focus and approach. Section~\ref{sec:system_overview}
170: overviews our CLIR system, and Section~\ref{sec:translation}
171: elaborates on the translation method aimed to resolve the above
172: problems associated with technical term translation.
173: Section~\ref{sec:evaluation} then evaluates the performance of our
174: CLIR system using the NACSIS collection.
175:
176: \section{Past Research on CLIR}
177: \label{sec:past_research}
178:
179: \subsection{Retrieval Methodologies}
180: \label{subsec:retrieval_methods}
181:
182: Figure~\ref{fig:retrieval_methods} classifies existing CLIR
183: approaches in terms of retrieval methodology. The top level three
184: categories correspond to the different titles of the following items.
185:
186: \paragraph{Query translation approach}
187:
188: This approach translates queries into document languages using
189: bilingual dictionaries or/and corpora, prior to the retrieval process.
190: Since the retrieval process is fundamentally the same as performed in
191: monolingual IR, the translation module can easily be combined with
192: existing IR engines. This category can be further subdivided into the
193: following three methods.
194:
195: The first subcategory can be called dictionary-based methods. Hull
196: and Grefenstette~\nocite{hull:sigir-96} used a bilingual dictionary to
197: derive all possible translation candidates of query terms, which are
198: used for the subsequent retrieval. Their method is easy to implement,
199: but potentially retrieves irrelevant documents and decreases the time
200: efficiency. To resolve this problem,
201: Hull~\nocite{hull:aaai-spring-sympo-97} combined translation
202: candidates for each query term with the ``OR'' operator, and used the
203: weighted boolean method to assign an importance degree to each
204: translation candidate.
205:
206: Pirkola~\nocite{pirkola:sigir-98} also used structured queries, where
207: each term is combined with different types of operators. Ballesteros
208: and Croft~\nocite{ballesteros:sigir-97} enhanced the dictionary-based
209: translation using the ``local context analysis''~\cite{xu:sigir-96}
210: and phrase-based translation. Dorr and Oard~\nocite{dorr:lrec-98}
211: evaluated the effectiveness of a semantic structure of a query in the
212: query translation. As far as their comparative experiments were
213: concerned, the use of semantic structures was not as effective as
214: MT/dictionary-based query translation methods.
215:
216: The second subcategory, corpus-based methods, uses translations
217: extracted from bilingual corpora, for the query
218: translation~\cite{carbonell:ijcai-97}. In this paper, ``(bilingual)
219: aligned corpora'' generally refer to a pair of two language corpora
220: aligned to each other on a word, sentence, paragraph or document
221: basis. Given such resources, corpus-based methods are expected to
222: acquire domain specific translations unlisted in existing
223: dictionaries. In fact, Carbonell~\etal~\nocite{carbonell:ijcai-97}
224: empirically showed that their corpus-based query translation method
225: outperformed a dictionary-based method. Their comparative evaluation
226: also showed that the corpus-based translation method outperformed
227: GVSM/LSI-based methods (see the following ``Interlingual
228: representation approach'' item for details of GVSM and LSI). Note that
229: for the purpose of corpus-based translation methods, a number of
230: translation extraction techniques explored in NLP
231: research~\cite{fung:acl-95,kaji:coling-96,smadja:cl-96} are
232: applicable.
233:
234: Finally, hybrid methods use corpora to resolve the translation
235: ambiguity inherent in bilingual dictionaries. Unlike the corpus-based
236: translation methods described above, which rely on bilingual corpora,
237: Ballesteros and Croft~\nocite{ballesteros:sigir-98} and
238: Chen~\etal~\nocite{chen:acl-99} independently used a {\em
239: monolingual\/} corpus for the disambiguation, and therefore the
240: implementation cost is less. In practice, their method selects the
241: combination of translation candidates that frequently co-occur in the
242: target language corpus. On the other hand, bilingual corpora are also
243: applicable to hybrid
244: methods. Okumura~\etal~\nocite{okumura:lrec-tlim-ws-98} and
245: Yamabana~\etal~\nocite{yamabana:sigir-ws-96} independently used the
246: same disambiguation method, in that they consider word frequencies in
247: both the source and target languages, obtained from a bilingual
248: aligned corpus. Nie~\etal~\nocite{nie:sigir-99} automatically
249: collected parallel texts in French and English from the World Wide
250: Web, to train a probabilistic query translation model, and suggested
251: its feasibility for CLIR.
252:
253: Davis and Ogden~\nocite{davis:sigir-97} used a bilingual aligned
254: corpus as the document collection for training retrieval. They first
255: derive possible translation candidates using a dictionary. Then,
256: training retrieval trials are performed on the bilingual corpus, in
257: which the source and translated queries are used to retrieve source
258: and target documents, respectively. Finally, they select translations
259: which retrieved documents aligned to those retrieved with the source
260: query. Note that this method provides a salient contrast to other
261: query translation methods, in which translation is performed
262: independently from the retrieval
263: module.
264:
265: Chen~\etal~\nocite{chen:acl-99} addressed the disambiguation of
266: polysemy in the target language, along with the translation
267: disambiguation, specifically in the case where a source query term
268: corresponds to a small number of translations, but some of these
269: translations are associated with a large number of word senses, the
270: polysemous disambiguation is more crucial than the resolution of
271: translation ambiguity. To counter this problem, source query terms are
272: expanded with words that frequently co-occur, which are expected to
273: restrict the meaning of polysemous words in the target language
274: documents.
275:
276: \paragraph{Document translation approach}
277:
278: This approach translates documents into query languages, prior to the
279: retrieval. In most cases, existing MT systems are used to translate
280: all the documents in a given
281: collection~\cite{gachot:sigir-ws-96,kwon:cpol-98,oard:amta-98}. Otherwise,
282: a dictionary-based method is used to translate only index
283: terms~\cite{aone:anlp-97}. It is feasible that when compared with
284: short queries, documents contain a significantly higher volume of
285: information for the translation. In fact, Oard~\nocite{oard:amta-98}
286: showed that the document translation method using an MT system
287: outperformed several types of dictionary-based query translation
288: methods.
289:
290: However, McCarley~\nocite{mccarley:acl-99} showed that the relative
291: superiority between query and document translation approaches varied
292: depending on the source and target language pair. He also showed that
293: a hybrid system (it should not be confused with one described in the
294: ``Query translation approach'' item above), where the relevance degree
295: of each document (i.e., the ``score'') is the mean of those obtained
296: with query and document translation systems, outperformed systems
297: based on either query or document translation approach. However,
298: generally speaking, the full translation on large-scale collections
299: can be prohibitive.
300:
301: \paragraph{Interlingual representation approach}
302:
303: The basis of this approach is to project both queries and documents in
304: a language-independent (conceptual) space. In other words, as
305: Salton~\nocite{salton:jasis-70,salton:techrep-72} and Sheridan and
306: Ballerini~\nocite{sheridan:sigir-96} identified, the interlingual
307: representation approach is based on query expansion methods proposed
308: for monolingual IR. This category can be subdivided into
309: thesaurus-based methods and variants of the vector space model
310: (VSM)~\cite{salton:83}.
311:
312: Salton~\nocite{salton:jasis-70,salton:techrep-72} applied hand-crafted
313: English/French and English/German thesauri to the SMART
314: system~\cite{salton:71}, and demonstrated that a CLIR version of the
315: SMART system is comparable to the monolingual version in
316: performance. The International Road Research Documentation
317: scheme~\cite{mongar:tis-69} used a trilingual thesaurus associated
318: with English, German and French.
319: Gilarranz~\etal~\nocite{gilarranz:aaai-spring-sympo-97} and
320: Gonzalo~\etal~\nocite{gonzalo:chum-98} used the EuroWordNet
321: multilingual thesaurus~\cite{vossen:chum-98}. Unlike these above
322: methods relying on manual thesaurus construction, Sheridan and
323: Ballerini~\nocite{sheridan:sigir-96} used a multilingual thesaurus
324: automatically produced from an aligned corpus.
325:
326: The generalized vector space model (GVSM)~\cite{wong:sigir-85} and
327: latent semantic indexing (LSI)~\cite{deerwester:jasis-90}, which were
328: originally proposed as variants of the vector space model for
329: monolingual IR, project both queries and documents into a
330: language-independent vector space, and therefore these methods can be
331: applicable to CLIR. While Dumais~\etal~\nocite{dumais:sigir-ws-96}
332: explored an LSI-based CLIR,
333: Carbonell~\etal~\nocite{carbonell:ijcai-97} empirically showed that
334: GVSM outperformed LSI in terms of CLIR. Note that like thesaurus-based
335: methods, GVSM/LSI-based methods require aligned corpora.
336:
337: \begin{figure*}[htbp]
338: \def\baselinestretch{1}
339: \begin{center}
340: \leavevmode
341: \small
342: \fbox{
343: $\left\{
344: \begin{array}{l}
345: \mbox{query translation approach}\left\{
346: \begin{array}{l}
347: \mbox{dictionary-based methods} \\
348: \mbox{corpus-based methods} \\
349: \mbox{hybrid methods}\left\{
350: \begin{array}{l}
351: \mbox{bilingual aligned corpora} \\
352: $\underline{\mbox{monolingual corpora}}$
353: \end{array}\right.
354: \end{array}\right. \medskip \\
355: \mbox{document translation approach}\left\{
356: \begin{array}{l}
357: \mbox{full document translation} \\
358: \mbox{index term translation}
359: \end{array}\right. \medskip \\
360: \mbox{interlingual representation approach}\left\{
361: \begin{array}{l}
362: \mbox{thesaurus-based methods}\left\{
363: \begin{array}{l}
364: \mbox{hand-crafted thesauri} \\
365: \mbox{corpus-based thesauri}
366: \end{array}\right.\medskip \\
367: \mbox{vector space models}\left\{
368: \begin{array}{l}
369: \mbox{generalized vector space model} \\
370: \mbox{latent semantic indexing}
371: \end{array}\right.
372: \end{array}\right.
373: \end{array}
374: \right.$}
375: \end{center}
376: \medskip
377: \caption{Classification of CLIR retrieval methods (the method we
378: adopt is underlined)}
379: \label{fig:retrieval_methods}
380: \end{figure*}
381:
382: \subsection{Presentation Methodologies}
383: \label{subsec:presentation_methods}
384:
385: In the case of CLIR, retrieved documents are not always written in the
386: user's native language. Therefore, presentation methodology of
387: retrieval results is a more crucial task than in monolingual IR. It
388: is desirable to present smaller-sized contents with less noise, in
389: other words, precision is often given more importance than recall for
390: CLIR systems. Note that effective presentation is also crucial when a
391: user and system interactively retrieve relevant documents, as
392: performed in relevance feedback~\cite{salton:83}.
393:
394: However, a surprisingly small number of references addressing this
395: issue can be found in past research literature.
396: Aone~\etal~\nocite{aone:anlp-97} presented only keywords frequently
397: appearing in retrieved documents, rather than entire documents. Note
398: that since most CLIR systems use frequency information associated with
399: index terms like ``term frequency (TF)'' and ``inverse document
400: frequency (IDF)''~\cite{salton:83} for the retrieval, frequently
401: appearing keywords can be identified without an excessive additional
402: computational cost. Experiments independently conducted by Oard and
403: Resnik~\nocite{oard:ipm-99} and
404: Suzuki~\etal~\nocite{suzuki:signl-98-7} showed that even a simple
405: translation of keywords (such as using all possible translations
406: defined in a dictionary) improved on the efficiency for users to find
407: relevant foreign documents from the whole retrieval result.
408: Suzuki~\etal~\nocite{suzuki:nlp-99} more extensively investigated the
409: user's retrieval efficiency (i.e., the time efficiency and accuracy
410: with which human subjects find relevant foreign documents) by
411: comparing different presentation methods, in which the following
412: contents were independently presented to the user:
413: \begin{enumerate}
414: \item keywords without translation,
415: \item keywords translated with the first entry defined in a dictionary,
416: \item keywords translated through the hybrid method (see the
417: ``Query translation approach'' item in
418: Section~\ref{subsec:retrieval_methods}),
419: \item documents summarized (by an existing summarization software) and
420: manually translated.
421: \end{enumerate}
422: Their comparative experiments showed that the third content was most
423: effective in terms of the retrieval efficiency.
424:
425: For monolingual IR, automatic summarization methods based on the
426: user's focus/query have recently been explored. Mani and
427: Bloedorn~\nocite{mani:aaai-iaai-98} used machine learning techniques
428: to produce document summarization rules based on the user's focus
429: (i.e., query). Tombros and Sanderson~\nocite{tombros:sigir-98} showed
430: experimental results, in which presenting the fragment of each
431: retrieved document containing query terms improved on the retrieval
432: efficiency of human subjects. Applicability of these methods to CLIR
433: needs to be further explored.
434:
435: \subsection{Evaluation Methodologies}
436: \label{subsec:evaluation_methods}
437:
438: From a scientific point of view, performance evaluation is invaluable
439: for CLIR. In most cases, the evaluation of CLIR is the same as
440: performed for monolingual IR. That is, each system conducts a
441: retrieval trial using a test collection consisting of predefined
442: queries and documents in {\em different\/} languages, and then the
443: performance is evaluated based on the precision and recall. Several
444: experiments used test collections for monolingual IR in which either
445: queries or documents were translated, prior to the
446: evaluation. However, as Sakai~\etal~\nocite{sakai:tipsj-99}
447: empirically showed, the CLIR performance varies depending on the
448: quality of the translation of collections, and thus it is desirable to
449: carefully produce test collections for CLIR. The production of test
450: collections usually involves collecting documents, producing queries
451: and relevance assessment for each query. However, since relevance
452: assessment is expensive, especially for large-scale collections (even
453: in the case where the pooling method~\cite{voorhees:sigir-98} is used
454: to reduce the number of candidates of relevant documents),
455: Carbonell~\etal~\nocite{carbonell:ijcai-97} first translated queries
456: into the document language, and used as (pseudo) relevant documents
457: those retrieved with the translated queries. In other words, this
458: evaluation method investigates the extent to which CLIR maintains the
459: performance of monolingual IR.
460:
461: For the evaluation of presentation methods, human subjects are often
462: used to investigate the retrieval efficiency, as described in
463: Section~\ref{subsec:presentation_methods}. However, evaluation methods
464: involving human interactions are problematic, because human subjects
465: are in a way trained through repetitive retrieval trials for different
466: systems, which can potentially bias the result. On the other hand, in
467: the case where each subject uses a single system, difference of
468: subjects affects the result. To minimize this bias, multiple subjects
469: are usually classified based on, for example, their literacy in terms
470: of the target language, and those falling into the same cluster are
471: virtually regarded as the same person. However, this issue still
472: remains an open question, and needs to be further explored.
473:
474: \subsection{Our Focus and Approach}
475: \label{subsec:our_approach}
476:
477: Through discussions in the above three sections, we identified the
478: following points which should be taken into consideration for our
479: research.
480:
481: For translation methodology, the query translation approach is
482: preferable in terms of implementation cost, because this approach can
483: simply be combined with existing IR engines. On the other hand, other
484: approaches can be prohibitive, because (a) the document translation
485: approach conducts the full translation on the entire collection, and
486: (b) the interlingual representation approach requires alignment of
487: bilingual thesauri/corpora. In fact, we do not have Japanese-English
488: thesauri/corpora with sufficient volume of alignment information at
489: present. One may argue that the NACSIS collection, which is a
490: large-scale Japanese-English aligned corpora, can be used for the
491: translation. However, note that bilingual corpora for the translation
492: must not be obtained from the test collection used for the evaluation,
493: because in real world usage one of the two language documents in the
494: collection is usually missing. In other words, CLIR has little
495: necessity for bilingual aligned document collections, in that the user
496: can retrieve documents in the query language, without the translation
497: process.
498:
499: However, at the same time we concede that each approach is worth
500: further exploration, and in this paper we do not pretend to draw any
501: premature conclusions regarding the relative merits of different
502: approaches. To sum up, we focus mainly on translating sequences of
503: content words included in queries, rather than the entire
504: collection. Among different methods following the query translation
505: approach, we adopt the hybrid method using a {\em monolingual\/}
506: corpus. In other words, our translation method is relatively similar
507: to that proposed by Ballesteros and
508: Croft~\etal~\nocite{ballesteros:sigir-98} and
509: Chen~\etal~\nocite{chen:acl-99}. However, unlike their cases, we
510: integrate word-based translation and transliteration methods within
511: the query translation.
512:
513: For presentation methodology, we use keywords translated using the
514: hybrid translation method, which were proven to be effective in
515: comparative experiments by Suzuki~\etal~\nocite{suzuki:nlp-99} (in the
516: case where retrieved documents are not in the user's native language).
517: Note that for the purpose of the translation of keywords, we can use
518: exactly the same method as performed for the query translation,
519: because both queries and keywords usually consist of one or more
520: content words.
521:
522: Finally, for the evaluation of our CLIR system we use the NACSIS
523: collection~\cite{kando:sigir-99}. Since in this collection relevance
524: assessment is performed between Japanese queries and Japanese/English
525: documents, we can easily evaluate our system in terms of
526: Japanese-English CLIR. On the other hand, the evaluation of
527: English-Japanese CLIR is beyond the scope of this paper, because as
528: discussed in Section~\ref{subsec:evaluation_methods} the production of
529: English queries has to be carefully conducted, and is thus expensive.
530: Besides this, in this paper we do not evaluate our system in terms of
531: presentation methodology, because experiments using human subjects is
532: also expensive and still problematic. These remaining issues need to
533: be further explored.
534:
535: \section{System Overview}
536: \label{sec:system_overview}
537:
538: Figure~\ref{fig:system} depicts the overall design of our CLIR system,
539: in which we combine a translator with an IR engine for monolingual
540: retrieval. In the following, we briefly explain the retrieval process
541: based on this figure.
542:
543: First, the translator processes a query in the source language
544: (query in S) to output the translation (query in T). For this
545: purpose, the translator uses a dictionary to derive possible
546: translation candidates and a collocation to resolve the
547: translation ambiguity. Note that a user can utilize more than one
548: translation candidate, because multiple translations are often
549: appropriate for a single query. By the collocation, we mean
550: bi-gram statistics associated with content words extracted from NACSIS
551: documents. Since our system is bidirectional between Japanese and
552: English, we tokenize documents with different methods, depending on
553: their language. For English documents, the tokenization involves
554: eliminating stopwords and identifying root forms for inflected content
555: words. For this purpose, we use
556: WordNet~\cite{fellbaum:wordnet-98}, which contains a stopword list
557: and correspondences between inflected words and their root form. On
558: the other hand, we segment Japanese documents into lexical units using
559: the ChaSen morphological analyzer~\cite{matsumoto:chasen-97},
560: which has commonly been used for much Japanese NLP research, and
561: extract content words based on their part-of-speech information.
562:
563: Second, the IR engine searches the NACSIS collection for documents
564: (docs in T) relevant to the translated query, and sorts them
565: according to the degree of relevance, in descending order. Our IR
566: engine is currently a simple implementation of the vector space model,
567: in which the similarity between the query and each document (i.e., the
568: degree of relevance of each document) is computed as the cosine of the
569: angle between their associated vectors. We used the notion of
570: TF$\cdot$IDF for term weighting. Among a number of variations of term
571: weighting methods~\cite{salton:ipm-88,zobel:sigir-forum-98}, we
572: tentatively implemented two alternative types of TF (term frequency)
573: and one type of IDF (inverse document frequency), as shown in
574: Equation~\eq{eq:tf_idf}.
575: \begin{equation}
576: \label{eq:tf_idf}
577: \begin{array}{llll}
578: TF & = & f_{t,d} & (\mbox{\rm standard formulation}) \\
579: \noalign{\vskip 1.2ex}
580: TF & = & 1 + \log(f_{t,d}) & (\mbox{\rm logarithmic formulation})
581: \\
582: \noalign{\vskip 1.2ex}
583: IDF & = & \log(\frac{\textstyle N}{\textstyle n_{t}})
584: \end{array}
585: \end{equation}
586: Here, $f_{t,d}$ denotes the frequency that term $t$ appears in
587: document $d$, and $n_{t}$ denotes the number of documents containing
588: term $t$. $N$ is the total number of documents in the collection. The
589: second TF type diminishes the effect of $f_{d,t}$, and consequently
590: IDF affects the similarity computation more. We shall call the first
591: and second TF types ``standard'' and ``logarithmic'' formulations,
592: respectively. For the indexing process, we first tokenize documents as
593: explained above (i.e., we use WordNet and ChaSen for English and
594: Japanese documents, respectively), and then conduct the word-based
595: indexing. That is, we use each content word as a single indexing term.
596: Since our focus in this paper is the query translation rather than the
597: retrieval process, we do not explore other IR techniques, including
598: query expansion and relevance feedback.
599:
600: Finally, in the case where retrieved documents are not in the user's
601: native language, we extract keywords from retrieved documents, and
602: translate them into the source language using the translator (KWs in
603: S). Unlike existing presentation methods, where keywords are words
604: frequently appearing in each
605: document~\cite{aone:anlp-97,suzuki:signl-98-7,suzuki:nlp-99}, we
606: tentatively use author keywords. In the NACSIS collection, each
607: document contains roughly 3-5 single/compound keywords provided by the
608: author(s) of the document. In addition, since the NACSIS documents are
609: relatively short abstracts (instead of entire papers), it is not
610: entirely satisfactory to rely on the word frequency information. Note
611: that even in the case where retrieved documents are in the user's
612: native language, presenting author keywords is expected to improve the
613: retrieval efficiency.
614:
615: For future enhancement, we optionally use an MT system to translate
616: entire documents retrieved (or only documents identified as relevant
617: using author keywords) into the user's native language (docs in S). We
618: currently use the Transer Japanese/English MT system, which combines a
619: general dictionary consisting of 230,000 entries, and a computer
620: terminology dictionary consisting of 100,000
621: entries.\footnote{Developed by NOVA, Inc.} Note that the translation
622: of the limited number of retrieved documents is less expensive than
623: that of the whole collection, as performed in the document translation
624: approach (see Section~\ref{subsec:retrieval_methods}).
625:
626: In Section~\ref{sec:translation}, we will explain the translator
627: in Figure~\ref{fig:system}, which involves compound word translation
628: and transliteration methods. While our translation method is
629: applicable to both queries and keywords in documents, in the following
630: we shall call it the query translation method without loss of
631: generality.
632:
633: \begin{figure}[htbp]
634: \begin{center}
635: \leavevmode
636: \psfig{file=system.eps,height=1.8in}
637: \end{center}
638: \caption{The overall design of our CLIR system (S and T
639: denote the source and target languages, respectively)}
640: \label{fig:system}
641: \end{figure}
642:
643: \section{Query Translation Method}
644: \label{sec:translation}
645:
646: \subsection{Overview}
647: \label{subsec:trans_overview}
648:
649: Given a query in the source language, tokenization is first performed
650: as for target documents, that is, we use WordNet and ChaSen for
651: English and Japanese queries, respectively (see
652: Section~\ref{sec:system_overview}). We then discard stopwords and
653: extract only content words. Here, ``content words'' refer to both
654: single and compound words. Let us take the following English query as
655: an example:
656: \begin{list}{}{}
657: \item improvement or proposal of data mining methods.
658: \end{list}
659: For this query, we discard ``or'' and ``of,'' to extract
660: ``improvement,'' ``proposal'' and ``data mining methods.''
661: Thereafter, we translate each extracted content word on a word-by-word
662: basis, maintaining the word order in the source language. A
663: preliminary study showed that approximately 95\% of compound technical
664: terms defined in a bilingual dictionary~\cite{ferber:89} maintain the
665: same word order in both Japanese and English. Note that we currently
666: do not consider relation (e.g., syntactic relation) between content
667: words, and thus each content word is translated independently. In
668: brief, our translation method consists of the following two phases:
669: \begin{enumerate}
670: \def\labelenumi{(\theenumi)}
671: \item derive all possible translations for base words,
672: \item resolve translation ambiguity using the collocation associated
673: with base word translations.
674: \end{enumerate}
675: While phase~(2) is the same for both Japanese-English and
676: English-Japanese translations, phase~(1) differs depending on the
677: source language. In the case of English-Japanese translation, we
678: simply consult our bilingual dictionary for each base word. However,
679: transliteration is performed whenever base words unlisted in the
680: dictionary are found.
681:
682: On the other hand, in the case of Japanese-English translation, we
683: consider all possible segmentations of the input word, by consulting
684: the dictionary, because Japanese compound words lack lexical
685: segmentation.\footnote{For Japanese query terms used in our evaluation
686: (see Section~\ref{sec:evaluation}), the average number of possible
687: segmentations was 4.9.} Then, we select such segmentations that
688: consist of the minimal number of base words. This segmentation method
689: parallels that for the Japanese compound noun
690: analysis~\cite{kobayashi:coling-94}. During the segmentation process,
691: the dictionary derives all possible translations for base words. At
692: the same time, transliteration is performed only when {\it katakana\/}
693: words unlisted in the base word dictionary are found.
694:
695: \subsection{Compound Word Translation}
696: \label{subsec:cwt}
697:
698: This section explains our compound word translation method based on a
699: probabilistic model, focusing mainly on the resolution of translation
700: ambiguity. After deriving possible translations for base words (by way
701: of either consulting the base word dictionary or performing
702: transliteration), we can formally represent the source compound word
703: $S$ and one translation candidate $T$ as below.
704: \begin{eqnarray*}
705: S & = & s_{1}, s_{2}, \ldots, s_{n} \\
706: T & = & t_{1}, t_{2}, \ldots, t_{n}
707: \end{eqnarray*}
708: Here, $s_{i}$ denotes an $i$-th base word, and $t_{i}$ denotes a
709: translation candidate of $s_{i}$. Our task, i.e., to select the $T$
710: which maximizes $P(T|S)$, is transformed into
711: Equation~\eq{eq:trans_model} through use of the Bayesian theorem, as
712: performed in the statistical machine translation~\cite{brown:cl-93}.
713: \begin{eqnarray}
714: \label{eq:trans_model}
715: \arg\max_{T}P(T|S) & = & \arg\max_{T}P(S|T)\cdot P(T)
716: \end{eqnarray}
717: In practice, in the case where the user utilizes more than one
718: translation, $T$'s with greater probabilities are selected. We
719: approximate $P(S|T)$ and $P(T)$ using statistics associated with base
720: words, as in Equation~\eq{eq:approx}.
721: \begin{equation}
722: \label{eq:approx}
723: \begin{array}{lll}
724: P(S|T) & \approx & {\displaystyle \prod_{i=1}^{n}P(s_{i}|t_{i})} \\
725: \noalign{\vskip 1.2ex}
726: P(T) & \approx & {\displaystyle
727: \prod_{i=1}^{n-1}P(t_{i+1}|t_{i})}
728: \end{array}
729: \end{equation}
730: One may notice that this approximation is analogous to that for the
731: statistical part-of-speech tagging, where $s_{i}$ and $t_{i}$ in
732: Equation~\eq{eq:approx} correspond to a word and one of its
733: part-of-speech candidates, respectively~\cite{church:cl-93}. Here, we
734: estimate $P(t_{i+1}|t_{i})$ using the word-based bi-gram statistics
735: extracted from target language documents (i.e., the collocation in
736: Figure~\ref{fig:system}). Before elaborating on the estimation of
737: $P(s_{i}|t_{i})$ we explain the way to produce our bilingual
738: dictionary for base words, because $P(s_{i}|t_{i})$ is estimated using
739: this dictionary.
740:
741: For our dictionary production, we used the EDR technical terminology
742: dictionary~\cite{edr-techdic:95}, which includes approximately 120,000
743: Japanese-English translations related to the information processing
744: field. Since most of the entries are compound words, we need to
745: segment Japanese compound words, and correlate Japanese-English
746: translations on a word-by-word basis. However, the complexity of
747: segmenting Japanese words becomes much greater as the number of
748: component base words increases. In consideration of these factors, we
749: first extracted 59,533 English words consisting of only {\em two\/}
750: base words, and their Japanese translations. We then developed simple
751: heuristics to segment Japanese compound words into two substrings. Our
752: heuristics relies mainly on Japanese character types, i.e., ``{\it
753: kanji},'' ``{\it katakana},'' ``{\it hiragana},'' alphabets and other
754: characters like numerals. Note that {\it kanji\/} (or Chinese
755: character) is the Japanese idiogram, and {\it katakana\/} and {\it
756: hiragana\/} are phonograms.
757:
758: In brief, we segment each Japanese word at the boundary of different
759: character types (or at the leftmost boundary for words containing more
760: than one character type boundary). Although this method is relatively
761: simple, a preliminary study showed that we can almost correctly
762: segment words that are in one of the following forms: ``{\tt CK},''
763: ``{\tt CA},'' ``{\tt AK}'' and ``{\tt KA\/}.'' Here, ``{\tt C},''
764: ``{\tt K\/}'' and ``{\tt A}'' denote {\it kanji}, {\it katakana\/} and
765: alphabet character sequences, respectively. For other combinations of
766: character types, we identified one or more cases in which our
767: segmentation method incorrectly performed.
768:
769: On the other hand, in the case where a given Japanese word consists of
770: a single character type, we segment the word at the middle (or at the
771: left-side of the middle character for words consisting of an odd
772: number of characters). Note that roughly 90\% of Japanese words
773: consisting of four {\it kanji\/} characters can be correctly segmented
774: at the middle~\cite{kobayashi:coling-94}. However, in the case where
775: resultant substrings begin/end with characters that do not appear at
776: the beginning/end of words (for example, Japanese words rarely begin
777: with a long vowel), we shift the segmentation position to the right.
778:
779: Tsuji and Kageura~\nocite{tsuji:nlprs-97} used the HMM to segment
780: Japanese compound words in an English-Japanese bilingual
781: dictionary. Their method can also segment words consisting of more
782: than two base words, and reportedly achieved an accuracy of roughly
783: 80-90\%, whereas our segmentation method is applicable only to those
784: consisting of two base words. However, while the HMM-based
785: segmentation is expected to improve the quality of our dictionary
786: production, in this paper we tentatively show that our
787: heuristics-based method is effective for CLIR despite its simple
788: implementation, by way of experiments (see
789: Section~\ref{sec:evaluation}).
790:
791: As a result, we obtained 24,439 Japanese and 7,910 English base words.
792: We randomly sampled 600 compound words, and confirmed that 95\% of
793: those words were correctly segmented.
794: Figure~\ref{fig:compound_word_dictionary} shows a fragment of the EDR
795: dictionary (after segmenting Japanese words), and
796: Figure~\ref{fig:base_word_dictionary} shows a base word dictionary
797: produced from entries in Figure~\ref{fig:compound_word_dictionary}.
798: Figure~\ref{fig:base_word_dictionary} contains Japanese variants, such
799: as {\it memori\/}/{\it memorii\/} for the English word ``memory.'' We
800: can easily produce a Japanese-English base word dictionary from
801: Figure~\ref{fig:compound_word_dictionary}, using the same procedure.
802:
803: During the dictionary production, we also count the correspondence
804: frequency for each combination of $s_{i}$ and $t_{i}$, in order to
805: estimate $P(s_{i}|t_{i})$. In Figure~\ref{fig:base_word_dictionary},
806: for example, the Japanese base word ``{\it soukan\/}'' corresponds
807: once to ``associative,'' and twice to ``correlation.'' Thus, we can
808: derive Equation~\eq{eq:soukan}.
809: \begin{equation}
810: \label{eq:soukan}
811: \begin{array}{lll}
812: P(\mbox{associative}\:|\:{\it soukan}) & = & 1/3 \\
813: \noalign{\vskip 0.6ex}
814: P(\mbox{correlation}\:|\:{\it soukan}) & = & 2/3
815: \end{array}
816: \end{equation}
817: However, in the case where $s_{i}$ is {\em transliterated\/} into
818: $t_{i}$, we replace $P(s_{i}|t_{i})$ with a probabilistic score
819: computed by our transliteration method (see
820: Section~\ref{subsec:translit}).
821:
822: One may argue that $P(s_{i}|t_{i})$ should be estimated based on real
823: world usage, i.e., bilingual corpora. However, such resources are
824: generally expensive to obtain, and we do not have Japanese-English
825: corpora with sufficient volume of alignment information at present
826: (see Section~\ref{subsec:our_approach} for more discussion).
827:
828: \begin{figure}[htbp]
829: \def\baselinestretch{1}
830: \begin{center}
831: \leavevmode
832: \small
833: \begin{tabular}[t]{ll} \hline\hline
834: {\hfill\centering English\hfill} & {\hfill\centering
835: Japanese\hfill} \\ \hline
836: CCD memory & CCD {\it memorii\/} \\
837: IC memory & IC {\it memori\/} \\
838: associative learning & {\it soukan gakushuu\/} \\
839: associative memory & {\it rensou memori\/} \\
840: associative record & {\it ketsugou rekoodo\/} \\
841: correlation function & {\it soukan kansuu\/} \\
842: error detection & {\it ayamari kenshutsu\/} \\
843: factor correlation & {\it inshi soukan\/} \\
844: hybrid IC & {\it haiburiddo shuusekikairo\/} \\ \hline
845: \end{tabular}
846: \end{center}
847: \caption{A fragment of the EDR technical terminology dictionary}
848: \label{fig:compound_word_dictionary}
849: \end{figure}
850:
851: \subsection{Transliteration}
852: \label{subsec:translit}
853:
854: This section explains our transliteration method, which identifies
855: phonetic equivalent translations for words unlisted in the base word
856: dictionary.
857:
858: Figure~\ref{fig:katakana} shows example correspondences between
859: English and (romanized) {\it katakana\/} words, where we insert
860: hyphens between each {\it katakana\/} character for enhanced
861: readability. The basis of our transliteration method is analogous to
862: that for compound word translation described in
863: Section~\ref{subsec:cwt}. The formula for the source word $S$ and one
864: transliteration candidate $T$ are represented as below.
865: \begin{eqnarray*}
866: S & = & s_{1}, s_{2}, \ldots, s_{n} \\
867: T & = & t_{1}, t_{2}, \ldots, t_{n}
868: \end{eqnarray*}
869: Here, unlike the case of compound word translation, $s_{i}$ and
870: $t_{i}$ denote $i$-th ``symbols'' (which consist of one or more
871: letters), respectively. To derive possible $s_{i}$'s and $t_{i}$'s, we
872: consider all possible segmentations of the source word $S$, by
873: consulting a dictionary for symbols, namely the ``transliteration
874: dictionary.'' Then, we select such segmentations that consist of the
875: minimal number of symbols. Note that unlike the case of compound word
876: translation, the segmentation is performed for both Japanese-English
877: and English-Japanese transliterations.
878:
879: \begin{figure}[htbp]
880: \def\baselinestretch{1}
881: \begin{center}
882: \leavevmode
883: \small
884: \begin{tabular}[t]{ll} \hline\hline
885: {\hfill\centering English\hfill} & {\hfill\centering
886: Japanese\hfill} \\ \hline
887: CCD & CCD \\
888: IC & IC, {\it shuusekikairo\/} \\
889: associative & {\it soukan}, {\it rensou}, {\it ketsugou\/} \\
890: correlation & {\it soukan\/} \\
891: detection & {\it kenshutsu\/} \\
892: error & {\it ayamari\/} \\
893: factor & {\it inshi\/} \\
894: function & {\it kansuu\/} \\
895: hybrid & {\it haiburiddo\/} \\
896: learning & {\it gakushuu\/} \\
897: memory & {\it memori}, {\it memorii\/} \\
898: record & {\it rekoodo\/} \\ \hline
899: \end{tabular}
900: \end{center}
901: \caption{A fragment of an English-Japanese base word dictionary
902: produced from Figure~\protect\ref{fig:compound_word_dictionary}}
903: \label{fig:base_word_dictionary}
904: \end{figure}
905:
906: \begin{figure}[htbp]
907: \def\baselinestretch{1}
908: \begin{center}
909: \leavevmode
910: \small
911: \begin{tabular}{ll} \hline\hline
912: {\hfill\centering English \hfill} & {\hfill\centering Japanese
913: \hfill} \\ \hline
914: system & {\it shi-su-te-mu\/} \\
915: mining & {\it ma-i-ni-n-gu\/} \\
916: data & {\it dee-ta\/} \\
917: network & {\it ne-tto-waa-ku\/} \\
918: text & {\it te-ki-su-to\/} \\
919: collocation & {\it ko-ro-ke-i-sho-n\/} \\ \hline
920: \end{tabular}
921: \caption{Example correspondences between English and (romanized)
922: Japanese {\it katakana\/} words}
923: \label{fig:katakana}
924: \end{center}
925: \end{figure}
926:
927: Thereafter, we resolve the transliteration ambiguity based on the a
928: probabilistic model similar to that for the compound word translation.
929: To put it more precisely, we compute $P(T|S)$ for each $T$ using
930: Equation~\eq{eq:trans_model}, and select $T$'s with greater
931: probabilities. Note that $T$'s must be correct words (that are indexed
932: in the NACSIS document collection). However, Equation~\eq{eq:approx},
933: which approximates $P(T)$ by combining $P(t_i)$'s for substrings of
934: $T$, potentially assigns positive possibility values for incorrect
935: (unindexed) words.
936:
937: In view of this problem, we estimate $P(T)$ as the probability that
938: $T$ occurs in the document collection, and consequently the
939: probability for unindexed words becomes zero. In practice, during the
940: segmentation process we simply discard such $T$'s that are unindexed
941: in the document collection, so that we can enhance the computation for
942: $P(T|S)$'s. On the other hand, we approximate $P(S|T)$ as in
943: Equation~\eq{eq:approx}, and estimate $P(s_{i}|t_{i})$ based on the
944: correspondence frequency for each combination of $s_{i}$ and $t_{i}$
945: in the transliteration dictionary.
946:
947: The crucial content here is the way to produce the transliteration
948: dictionary, because such dictionaries have rarely been published. For
949: the purpose of dictionary production, we used approximately 35,000
950: {\it katakana\/} Japanese words and their English translations
951: collected from the EDR technical terminology
952: dictionary~\cite{edr-techdic:95} and bilingual
953: dictionary~\cite{edr-bilindic:95}. To illustrate our dictionary
954: production method, we consider Figure~\ref{fig:katakana}
955: again. Looking at this figure, one may notice that the first letter in
956: each {\it katakana\/} character tends to be contained in its
957: corresponding English word. However, there are a few exceptions. A
958: typical case is that since Japanese has no distinction between ``L''
959: and ``R'' sounds, the two English sounds collapse into the same
960: Japanese sound. In addition, a single English letter may correspond to
961: multiple {\it katakana\/} characters, such as ``x'' to ``{\it
962: ki-su\/}'' in \mbox{``$<$text, {\it te-ki-su-to\/}$>$.''} To sum up,
963: English and romanized {\it katakana\/} words are not exactly
964: identical, but similar to each other.
965:
966: We first manually defined the similarity between the English letter
967: $e$ and the first romanized letter for each {\it katakana\/} character
968: $j$, as shown in Table~\ref{tab:katakana}. In this table,
969: ``phonetically similar'' letters refer to a certain pair of letters,
970: such as ``L'' and ``R,'' for which we identified approximately twenty
971: pairs of letters. We then consider the similarity for any possible
972: combination of letters in English and romanized {\it katakana\/}
973: words, which can be represented as a matrix, as shown in
974: Figure~\ref{fig:matrix}. This figure shows the similarity between
975: letters in \mbox{``$<$text, {\it te-ki-su-to\/}$>$.''} We put a dummy
976: letter ``\$,'' which has a positive similarity only to itself, at the
977: end of both English and {\it katakana\/} words.
978:
979: One may notice that matching plausible symbols can be seen as finding
980: the path which maximizes the total similarity from the first to last
981: letters. The best path can efficiently be found by, for example,
982: Dijkstra's algorithm~\cite{dijkstra:nm-59}. From
983: Figure~\ref{fig:matrix}, we can derive the following correspondences:
984: \mbox{``$<$te, {\it te\/}$>$,''} \mbox{``$<$x, {\it ki-su\/}$>$''} and
985: \mbox{``$<$t, {\it to\/}$>$.''} In practice, to exclude noisy
986: correspondences, we used only English-Japanese translations whose
987: total similarity from the first to last letters is above a predefined
988: threshold. The resultant transliteration dictionary contains 432
989: Japanese and 1018 English symbols, from which we estimated
990: $P(s_{i}|t_{i})$.
991:
992: \begin{table}[htbp]
993: \def\baselinestretch{1}
994: \begin{center}
995: \caption{The similarity between English letter $e$ and Japanese
996: letter $j$}
997: \medskip \leavevmode \small
998: \begin{tabular}{lc} \hline\hline
999: {\hfill\centering Condition \hfill} & {\hfill\centering
1000: Similarity \hfill} \\ \hline
1001: $e$ and $j$ are identical & 3 \\
1002: $e$ and $j$ are phonetically similar & 2 \\
1003: both $e$ and $j$ are vowels or consonants & 1 \\
1004: otherwise & 0 \\ \hline
1005: \end{tabular}
1006: \label{tab:katakana}
1007: \end{center}
1008: \end{table}
1009:
1010: \begin{figure}[htbp]
1011: \begin{center}
1012: \leavevmode
1013: \psfig{file=matrix.eps,height=2in}
1014: \end{center}
1015: \caption{An example matrix for English-Japanese symbol matching
1016: (arrows denote the best path)}
1017: \label{fig:matrix}
1018: \end{figure}
1019:
1020: To evaluate our transliteration method, we extracted Japanese {\it
1021: katakana\/} words (excluding compound words) and their English
1022: translations from an English-Japanese
1023: dictionary~\cite{nichigai_compdic:96}. We then discarded
1024: Japanese/English pairs that were not phonetically equivalent to each
1025: other, and were listed in the EDR dictionaries. For the resultant 248
1026: pairs, the accuracy of our transliteration method was 65.3\%.
1027:
1028: Thus, our transliteration method is less accurate than the word-based
1029: translation. For example, the {\it katakana\/} word ``{\it
1030: re-ji-su-ta}~(register/resistor)'' is transliterated into
1031: ``resister,'' ``resistor'' and ``register,'' with the probability
1032: score in descending order. Note that Japanese seldom represents
1033: ``resister'' as ``{\it re-ji-su-ta\/}'' (whereas it can be
1034: theoretically correct when this word is written in {\it katakana\/}
1035: characters), because ``resister'' corresponds to more appropriate
1036: translations in {\it kanji\/} characters. However, the compound word
1037: translation is expected to select appropriate transliteration
1038: candidates. For example, ``re-ji-su-ta'' in the compound word ``{\it
1039: re-ji-su-ta\/} {\it tensou\/} {\it gengo\/}~(register transfer
1040: language)'' is successfully translated, given a set of base words
1041: ``{\it tensou\/}~(transfer)'' and ``{\it gengo\/}~(language)'' as a
1042: context.
1043:
1044: Finally, we devote a little more space to compare our transliteration
1045: method and other related works.
1046: Chen~\etal~\nocite{chen:coling-acl-98} proposed a Chinese-English
1047: transliteration method. Given a (romanized) source word, their methods
1048: compute the similarity between the source word and each target word
1049: listed in the dictionary. In brief, the more letters two words share
1050: in common, the more similar they are. In other words, unlike our case,
1051: their methods disregard the order of letters in source and target
1052: words, which potentially degrades the transliteration accuracy. In
1053: addition, since for each source word the similarity is computed
1054: between all the target words (or words that share at least one common
1055: letter with the source word), the similarity computation can be
1056: prohibitive. Lee and Choi~\nocite{lee:iral-97} explored English-Korean
1057: transliteration, where they automatically produced a transliteration
1058: model from a word-aligned corpus. In brief, they first consider all
1059: possible English-Korean symbol correspondences for each word
1060: alignment. Then, iterative estimation is performed to select such
1061: symbol correspondences that maximize transliteration accuracy on
1062: training data. However, when compared with our symbol alignment
1063: method, their iterative estimation method is computationally
1064: expensive. Knight and Graehl~\nocite{knight:cl-98} proposed a
1065: Japanese-English transliteration method based on the mapping
1066: probability between English and Japanese {\it katakana\/}
1067: sounds. However, while their method needs a large-scale phoneme
1068: inventory, we use a simpler approach using surface mapping between
1069: English and {\it katakana\/} characters, as defined in our
1070: transliteration dictionary. Note that none of those above methods has
1071: been evaluated in the context of CLIR. Empirical comparison of
1072: different transliteration methods needs to be further explored.
1073:
1074: \subsection{Further Enhancement of Translation}
1075: \label{subsec:dictionary_enhancement}
1076:
1077: This section explains two additional methods to enhance the query
1078: translation.
1079:
1080: First, we can enhance our base word dictionary with {\em general\/}
1081: words, because technical compound words sometimes include general
1082: words, as discussed in Section~\ref{sec:introduction}. Note that in
1083: Section~\ref{subsec:cwt} we produced our base word dictionary from the
1084: EDR {\em technical\/} terminology dictionary. Thus, we used the EDR
1085: bilingual dictionary~\cite{edr-bilindic:95}, which consists of
1086: approximately 370,000 Japanese-English translations aimed at general
1087: usage. However, unlike in the case of technical terms, it is not
1088: feasible to segment general compound words, such as ``hot dog,'' into
1089: base words. Thus, we simply extracted 162,751 Japanese and 67,136
1090: English single words (i.e., words that consist of a single base word)
1091: from this dictionary. In addition, to minimize the degree of
1092: translation ambiguity, we use general translations only when (a) base
1093: words unlisted in our technical term dictionary are found, and (b) our
1094: transliteration method fails to output any candidates for those
1095: unlisted base words.
1096:
1097: Second, in Section~\ref{sec:introduction} we also identified that
1098: English technical terms are often abbreviated, such as ``IR'' and
1099: ``NLP,'' and they can be used as Japanese words. One solution would be
1100: to output those abbreviated words as they are, for both
1101: Japanese-English and English-Japanese translations. On the other hand,
1102: it is expected that we can improve the recall by using complete forms
1103: along with their abbreviated forms. To realize this notion, we
1104: extracted 7,307 tuples of each abbreviation and its complete form from
1105: the NACSIS English document collection, using simple heuristics. Our
1106: heuristics relies on the assumption that either abbreviations or
1107: complete forms often appear in parentheses headed by their
1108: counterparts, as shown below:
1109: \begin{quote}
1110: Natural Language Processing (NLP), \\
1111: cross-language information retrieval (CLIR), \\
1112: MRDs (machine readable dictionaries).
1113: \end{quote}
1114: While the first example is the most straightforward, in the second and
1115: third examples we disregard a hyphen and lowercase letter (i.e., ``s''
1116: in ``MRDs''), respectively. In practice, we can easily extract such
1117: tuples using the regular expression pattern matching.
1118: Figure~\ref{fig:abbreviation} shows example tuples of abbreviations
1119: and complete forms extracted from the NACSIS collection. In this
1120: figure, the column ``Frequency'' denotes the frequency that each tuple
1121: appears in the collection, with which we can optionally set a cut-off
1122: threshold for multiple complete forms corresponding to a single
1123: abbreviation (e.g., ``information retrieval,'' ``isoprene rubber'' and
1124: ``insulin receptor'' for ``IR'').
1125:
1126: \begin{figure}[htbp]
1127: \def\baselinestretch{1}
1128: \begin{center}
1129: \leavevmode
1130: \small
1131: \begin{tabular}[t]{llc} \hline\hline
1132: {\hfill\centering Abbreviation\hfill} & {\hfill\centering
1133: Complete form\hfill} & {\hfill\centering Frequency\hfill} \\ \hline
1134: IR & information retrieval & 3 \\
1135: IR & isoprene rubber & 1 \\
1136: IR & insulin receptor & 1 \\
1137: MT & machine translation & 11 \\
1138: MT & mobile telephone & 3 \\
1139: NLP & natural language processing & 8 \\ \hline
1140: \end{tabular}
1141: \end{center}
1142: \caption{Example abbreviations and their complete forms}
1143: \label{fig:abbreviation}
1144: \end{figure}
1145:
1146: \section{Evaluation}
1147: \label{sec:evaluation}
1148:
1149: \subsection{Methodology}
1150: \label{subsec:eval_overview}
1151:
1152: We investigated the performance of our system in terms of
1153: Japanese-English CLIR, based on the TREC-type evaluation methodology.
1154: That is, the system outputs 1,000 top documents, and the TREC
1155: evaluation software was used to plot recall-precision curves and
1156: calculate non-interpolated average precision values.
1157:
1158: For the purpose of our evaluation, we used a preliminary version of
1159: the NACSIS test collection~\cite{kando:sigir-99}. This collection
1160: includes approximately 330,000 documents (in either a combination of
1161: English and Japanese or either of the languages individually),
1162: collected from technical papers published by 65 Japanese associations
1163: for various fields.\footnote{The official version of the NACSIS
1164: collection includes 39 Japanese queries and the same document set as
1165: in the preliminary version we used. NACSIS (National Center for
1166: Science Information Systems, Japan) held a TREC-type (CL)IR contest
1167: workshop in August 1999, and participants, including the authors of
1168: this paper, were provided with the whole document set and 21 queries
1169: for training. These 21 queries are included in the final package of
1170: the test collection. See {\tt
1171: http://www.rd.nacsis.ac.jp/\~{}ntcadm/workshop/work-en.html} for
1172: details.} Each document consists of the document ID, title, name(s) of
1173: author(s), name/date of conference, hosting organization, abstract and
1174: keywords, from which we used titles, abstracts and keywords for the
1175: indexing. We used as target documents approximately 187,000 entries
1176: where abstracts are in both English and Japanese.
1177:
1178: This collection also includes 21 Japanese queries. Each query
1179: consists of the query ID, title of the topic, description, narrative
1180: and list of synonyms, from which we used only the
1181: description.\footnote{In the NACSIS workshop, each participant can
1182: submit more than one retrieval result using different
1183: systems. However, at least one result must be gained with only the
1184: description field.} In general, most topics are related to electronic,
1185: information and control engineering. Figure~\ref{fig:query} shows
1186: example descriptions (translated into English by one of the authors).
1187:
1188: In the NACSIS collection, relevance assessment was performed based on
1189: the pooling method~\cite{voorhees:sigir-98}. That is, candidates for
1190: relevant documents were first obtained with multiple retrieval
1191: systems. Thereafter, for each candidate document, human expert(s)
1192: assigned one of three ranks of relevance, i.e., ``relevant,''
1193: ``partially relevant'' and \mbox{``irrelevant.''} The average number
1194: of candidate documents for each query is 4,400, among which the number
1195: of relevant and partially relevant documents are 144 and 13,
1196: respectively. In our evaluation, we did not regard partially relevant
1197: documents as relevant ones, because (a) the result did not
1198: significantly change depending on whether we regarded partially
1199: relevant as relevant or not, and (b) interpretation of partially
1200: relevant is not fully clear to the authors.
1201:
1202: Since the NACSIS collection does not contain English queries, we
1203: cannot estimate a baseline for Japanese-English CLIR performance based
1204: on English-English IR. Instead, we used a Japanese-Japanese IR system,
1205: which uses as documents Japanese titles/abstracts/keywords comparable
1206: to English fields in the NACSIS collection. One may argue that we can
1207: manually translate Japanese queries into English. However, as
1208: discussed in Section~\ref{subsec:evaluation_methods}, the CLIR
1209: performance varies depending on the quality of translation, and thus
1210: we avoided an arbitrary evaluation.
1211:
1212: \begin{figure}[htbp]
1213: \def\baselinestretch{1}
1214: \begin{center}
1215: \leavevmode
1216: \small
1217: \begin{tabular}{cl} \hline\hline
1218: ID & {\hfill\centering Description\hfill} \\ \hline
1219: 0005 & dimension reduction for clustering \\
1220: 0006 & intelligent information retrieval using agent functions \\
1221: 0019 & syntactic analysis methods for Japanese sentences \\
1222: 0024 & machine translation systems \\ \hline
1223: \end{tabular}
1224: \caption{Example query descriptions in the NACSIS collection}
1225: \label{fig:query}
1226: \end{center}
1227: \end{figure}
1228:
1229: \subsection{Quantitative Comparison}
1230: \label{subsec:quantitative}
1231:
1232: We compared the following query translation methods:
1233: \begin{itemize}
1234: \item all possible translations derived from the (original) EDR
1235: technical terminology dictionary~\cite{edr-techdic:95} are used for
1236: query terms, which can be seen as a lower bound method of this
1237: comparative experiment (``EDR''),
1238: \item all possible base word translations derived from our base word
1239: dictionary are used (``ALL''),
1240: \item $k$-best translations selected by our compound word translation
1241: method are used, where transliteration is not used (``CWT''),
1242: \item transliteration is performed for unlisted {\it katakana\/} words
1243: in CWT above, which represents the overall query
1244: translation method we proposed in this paper (``TRL'').
1245: \end{itemize}
1246: One may notice that both EDR and ALL correspond to the
1247: dictionary-based method, and CWT and TRL correspond to the
1248: hybrid method described in Section~\ref{subsec:retrieval_methods}. In
1249: the case of EDR, compound words unlisted in the EDR dictionary
1250: were manually segmented so that substrings (shorter compound words or
1251: base words) could be translated. There was almost no translation
1252: ambiguity in the case of EDR. In addition, preliminary experiments
1253: showed that disambiguation degraded the retrieval performance for
1254: EDR. In CWT and TRL, $k$ is a parametric constant, for
1255: which we set \mbox{$k=1$}. Through preliminary experiments, we
1256: achieved the best performance when we set \mbox{$k=1$}. By increasing
1257: the value of $k$, we theoretically gain a query expansion effect,
1258: because multiple translations semantically related are used as query
1259: terms. However, in our case, additional translations were rather noisy
1260: with respect to the retrieval performance. Note that in this
1261: experiment, we did not used the general and abbreviation dictionaries.
1262: We will discuss the effect of those dictionaries in
1263: Section~\ref{subsec:dictionary_enhancement}.
1264:
1265: Table~\ref{tab:avg_pre} shows the non-interpolated average precision
1266: values, averaged over the 21 queries, for different combinations of
1267: query translation and retrieval methods. It is worth comparing the
1268: effectiveness of query translation methods with different retrieval
1269: methods, because advanced retrieval methods potentially overcome the
1270: rudimentary nature of query translation methods, and therefore may
1271: overshadow the difference of query translation methods in CLIR
1272: performance. In consideration of this problem, as described in
1273: Section~\ref{sec:system_overview}, we adopted two alternative term
1274: weighting methods, i.e., the standard and logarithmic formulations. In
1275: addition, we used as the IR engine in Figure~\ref{fig:system} the
1276: SMART system~\cite{salton:71}, where the augmented TF$\cdot$IDF term
1277: weighting method (``ATC'') was used for both queries and
1278: documents. This makes it easy for other researchers to rigorously
1279: compare their query translation methods with ours within the same
1280: evaluation environment, because the SMART system is available to the
1281: public.
1282:
1283: In Table~\ref{tab:avg_pre}, J-J refers to the baseline performance,
1284: that is, the result obtained by the Japanese-Japanese IR system. Note
1285: that the performance of J-J using the SMART system is not available
1286: because this system is not implemented for the retrieval of Japanese
1287: documents. The column ``\# of Terms'' denotes the average number of
1288: query terms used for the retrieval, where the number of terms used in
1289: ALL was approximately seven times as great as those of other
1290: methods. Suggestions can be derived from these results is as follows.
1291:
1292: \begin{table}[htbp]
1293: \def\baselinestretch{1}
1294: \begin{center}
1295: \caption{Non-interpolated average precision values,
1296: averaged over the 21 queries, for different combinations of query
1297: translation and retrieval methods}
1298: \medskip
1299: \leavevmode
1300: \small
1301: \begin{tabular}{lccccc} \hline\hline
1302: & & \multicolumn{3}{c}{Retrieval Method} \\ \cline{3-5}
1303: & \# of Terms & Standard TF & Logarithmic TF & SMART \\ \hline
1304: J-J & 4.0 & 0.2085 & 0.2443 & --- \\
1305: TRL & 4.0 & 0.2427 & 0.2911 & 0.3147 \\
1306: CWT & 3.9 & 0.2324 & 0.2680 & 0.2770 \\
1307: ALL & 21 & 0.1971 & 0.2271 & 0.2106 \\
1308: EDR & 4.1 & 0.1785 & 0.2173 & 0.2477 \\ \hline
1309: \end{tabular}
1310: \label{tab:avg_pre}
1311: \end{center}
1312: \end{table}
1313:
1314: First, the relative superiority between EDR and ALL varies
1315: depending on the retrieval method. Since neither case resolved the
1316: translation ambiguity, the difference in performance for the two
1317: translation methods is reduced solely to the difference between the
1318: two dictionaries. Therefore, the base word dictionary we produced was
1319: effective when combined with the standard and logarithmic TF
1320: formulations. However, the translation disambiguation as performed in
1321: CWT improved the performance of ALL, and consequently CWT
1322: outperformed EDR irrespective of the retrieval method. To sum up,
1323: our compound word translation method was more effective than the use
1324: of an existing dictionary, in terms of CLIR performance.
1325:
1326: Second, by comparing results of CWT and TRL, one can see that
1327: our transliteration method further improved the performance of the
1328: compound word translation relying solely on the base word dictionary,
1329: irrespective of the retrieval method. Since TRL represents the
1330: overall performance of our system, it is worth comparing TRL and
1331: EDR (i.e., a lower bound method) more carefully. Thus, we used the
1332: paired t-test for statistical testing, which investigates whether the
1333: difference in performance is meaningful or simply due to
1334: chance~\cite{hull:sigir-93,keen:ipm-92}. We found that the average
1335: precision values of TRL and EDR are significantly different
1336: (at the 5\% level), for any of the three retrieval methods.
1337:
1338: Third, the performance was generally improved as a more sophisticated
1339: retrieval method was used, for all of the translation methods
1340: excepting ALL. In other words, enhancements of the query
1341: translation and IR engine independently improved on the performance of
1342: our CLIR system. Note that the difference between the SMART system and
1343: the other two methods is due to more than one factor, including
1344: stemming and term weighting methods. This suggests that our system may
1345: achieve a higher performance using other advanced IR techniques.
1346:
1347: Finally, TRL and CWT outperformed J-J for any of the
1348: retrieval methods. However, these differences are partially attributed
1349: to the different properties inherent in Japanese and English IR. For
1350: example, the performance of Japanese IR is more strongly dependent on
1351: the indexing method than English IR, since Japanese lacks lexical
1352: segmentation. This issue needs to be further explored.
1353:
1354: Figures~\ref{fig:rp_raw_TF}-\ref{fig:rp_smart} show recall-precision
1355: curves of different query translation methods, for different retrieval
1356: methods, respectively. In these figures, while the superiority of EDR
1357: and ALL in terms of precision varies depending on the recall, one can
1358: see that CWT outperformed EDR and ALL, and that TRL outperformed CWT,
1359: regardless of the recall. In Figures~\ref{fig:rp_raw_TF} and
1360: \ref{fig:rp_log_TF}, J-J generally performed better at lower recall
1361: while any of four CLIR methods performs better at higher recall. As
1362: discussed above, possible rationales would include the difference
1363: between Japanese and English IR. To put it more precisely, in Japanese
1364: IR a word-based indexing method (as performed in our IR engine) fails
1365: to retrieve documents in which words are inappropriately segmented.
1366: In addition, the ChaSen morphological analyzer often incorrectly
1367: segments {\it katakana\/} words, which frequently appear in technical
1368: documents. Consequently this drawback leads to a poor recall in the
1369: case of J-J.
1370:
1371: \begin{figure}[htbp]
1372: \begin{center}
1373: \leavevmode
1374: \psfig{file=rp-curve_raw_TF.ps,height=3.5in}
1375: \end{center}
1376: \caption{Recall-precision curves using the standard TF}
1377: \label{fig:rp_raw_TF}
1378: \end{figure}
1379:
1380: \begin{figure}[htbp]
1381: \begin{center}
1382: \leavevmode
1383: \psfig{file=rp-curve_log_TF.ps,height=3.5in}
1384: \end{center}
1385: \caption{Recall-precision curves using the logarithmic TF}
1386: \label{fig:rp_log_TF}
1387: \end{figure}
1388:
1389: \begin{figure}[htbp]
1390: \begin{center}
1391: \leavevmode
1392: \psfig{file=rp-curve_smart.ps,height=3.5in}
1393: \end{center}
1394: \caption{Recall-precision curves using the SMART system}
1395: \label{fig:rp_smart}
1396: \end{figure}
1397:
1398: \subsection{Query-by-query Analysis}
1399: \label{subsec:qbq_analysis}
1400:
1401: In this Section, we discuss reasons why our translation method
1402: was effective in CLIR performance, through a query-by-query analysis.
1403:
1404: First, we compared EDR and CWT (see in
1405: Section~\ref{subsec:quantitative}), to investigate the effectiveness
1406: of our compound word translation method. For this purpose, we
1407: identified fragments of the NACSIS query that were correctly
1408: translated by CWT but not by EDR, as shown in
1409: Table~\ref{tab:avgpre_qbq_cwt}. In this table, where we insert hyphens
1410: between each Japanese base word for enhanced readability,
1411: Japanese/English words unlisted in the EDR technical terminology
1412: dictionary are underlined. Note that as mentioned in
1413: Section~\ref{subsec:quantitative}, in these cases translations for
1414: remaining base words were used as query terms. However, in the case of
1415: the query 0019, the EDR dictionary lists a phrase translation,
1416: i.e., ``{\it kakariuke-kaiseki\/}~(analysis of dependence relation),''
1417: and thus ``analysis,'' ``dependence'' and ``relation'' were used as
1418: query terms (``of'' was discarded as a stopword). One can see that
1419: except for the five cases asterisked, out of 18 cases, CWT
1420: outperformed EDR. Note that in the case of 0019, EDR
1421: conducted a phrase-based translation, while CWT conducted a
1422: word-based translation. The relative superiority between these two
1423: translation approaches varies depending on the retrieval method, and
1424: thus we cannot draw any conclusion regarding this point in this paper.
1425: In the case of the query 0006, although the translation in CWT
1426: was linguistically correct, we found that the English word ``agent
1427: function'' is rarely used in documents associated with agent research,
1428: and that ``function'' ended up degrading the retrieval performance. In
1429: the case of the query 0020, ``loanword'' would be a more
1430: appropriate translation for ``{\it gairaigo\/}.'' However, even when
1431: we used ``loanword'' for the retrieval, instead of ``foreign'' and
1432: ``word,'' the performance of EDR did not change.
1433:
1434: \begin{table}[htbp]
1435: \def\baselinestretch{1}
1436: \begin{center}
1437: \caption{Query-by-query comparison between EDR and CWT}
1438: \medskip
1439: \leavevmode
1440: \footnotesize
1441: \tabcolsep=3pt
1442: \begin{tabular}{cllll} \hline\hline
1443: & & \multicolumn{3}{c}{Change in Average Precision (EDR
1444: $\rightarrow$ CWT)} \\ \cline{3-5}
1445: ID & {\hfill\centering Japanese (Translation in CWT)\hfill} &
1446: {\hfill\centering Standard TF\hfill} & {\hfill\centering
1447: Logarithmic TF\hfill} & {\hfill\centering SMART\hfill} \\ \hline
1448: 0001 & {\it $\underline{jiritsu}$-idou-robotto\/}
1449: ($\underline{\mbox{autonomous}}$ mobile robot) & 0.2325
1450: $\rightarrow$ 0.3667 & 0.2587 $\rightarrow$ 0.4058 & 0.2259
1451: $\rightarrow$ 0.3441 \\
1452: 0004 & {\it $\underline{bunsho}$-gazou-rikai\/}
1453: ($\underline{\mbox{document}}$ image understanding) & 0.0011
1454: $\rightarrow$ 0.2775 & 0.0091 $\rightarrow$ 0.3768 & 0.0217
1455: $\rightarrow$ 0.2740 \\
1456: 0006 & {\it eejento-$\underline{kinou}$\/} (agent
1457: $\underline{\mbox{function}}$) & 0.2008
1458: $\rightarrow$ 0.1603* & 0.2920 $\rightarrow$ 0.1997* & 0.1430
1459: $\rightarrow$ 0.1395* \\
1460: 0016 & {\it saidai-$\underline{kyoutsuu}$-bubungurafu\/}
1461: (greatest $\underline{\mbox{common}}$ subgraph) &
1462: 0.1615 $\rightarrow$ 0.5039 & 0.4661 $\rightarrow$ 0.6216 &
1463: 0.1295 $\rightarrow$ 0.4460 \\
1464: 0019 & {\it kakariuke-kaiseki\/} (dependency analysis) & 0.0794
1465: $\rightarrow$ 0.3550 & 0.1383 $\rightarrow$ 0.4302 & 0.1852
1466: $\rightarrow$ 0.1449* \\
1467: 0020 & {\it katakana-$\underline{\mbox{\it gairai-go\/}}$\/}
1468: (katakana $\underline{\mbox{foreign word}}$) & 0.4536
1469: $\rightarrow$ 0.4568 & 0.2408 $\rightarrow$ 0.4674 & 0.9429
1470: $\rightarrow$ 0.8769* \smallskip \\ \hline
1471: \end{tabular}
1472: \label{tab:avgpre_qbq_cwt}
1473: \end{center}
1474: \end{table}
1475:
1476: Second, we compared CWT and TRL in Table~\ref{tab:avgpre_qbq_trl},
1477: which uses the same basic notation as Table~\ref{tab:avgpre_qbq_cwt}.
1478: The NACSIS query set contains 20 {\it katakana\/} base word types,
1479: among which ``{\it ma-i-ni-n-gu\/}~(mining)'' and ``{\it
1480: ko-ro-ke-i-sho-n\/}~(collocation)'' were unlisted in our base word
1481: dictionary. Unlike the previous case, transliteration generally
1482: improved on the performance. On the other hand, we concede that only
1483: three queries are not enough to justify the effectiveness of our
1484: transliteration method. In view of this problem, we assumed that every
1485: {\it katakana\/} word in the query is unlisted in our base word
1486: dictionary, and compared the following two extreme cases:
1487: \begin{itemize}
1488: \item every {\it katakana\/} word was untranslated (i.e., they were
1489: simply discarded from queries), which can be seen as a lower bound
1490: method in this comparison,
1491: \item transliteration was applied to every {\it katakana\/} word,
1492: instead of consulting the base word dictionary.
1493: \end{itemize}
1494: Both cases were combined into the CWT
1495: Section~\ref{subsec:quantitative}. Note that in the latter case, when
1496: a {\it katakana\/} word is included in a compound word,
1497: transliteration candidates of the word are disambiguated through the
1498: compound word translation method, and thus noisy candidates are
1499: potentially discarded. It should also be noted that in the case where
1500: a compound word consists of solely {\it katakana\/} words (e.g., {\it
1501: deeta-mainingu\/}~(data mining)), our method automatically segments it
1502: into base words, by transliterating all the possible substrings.
1503:
1504: Table~\ref{tab:avg_pre_kana} shows the average precision values,
1505: averaged over the 21 queries, for those above cases. By comparing
1506: Tables~\ref{tab:avg_pre} and \ref{tab:avg_pre_kana}, one can see that
1507: the performance was considerably degraded when we disregard every {\it
1508: katakana\/} word, and that even when we applied transliteration to
1509: every katakana word, the performance was greater than that of CWT
1510: and was quite comparable to that of TRL. Among the 20 {\it
1511: katakana\/} base words, only ``{\it eejento\/}~(agent)'' was
1512: incorrectly transliterated into ``eagent,'' which was due to an
1513: insufficient volume of the transliteration dictionary.
1514:
1515: \begin{table}[htbp]
1516: \def\baselinestretch{1}
1517: \begin{center}
1518: \caption{Query-by-query comparison between CWT and TRL}
1519: \medskip
1520: \leavevmode
1521: \footnotesize
1522: \begin{tabular}{cllll} \hline\hline
1523: & & \multicolumn{3}{c}{Change in Average Precision (CWT
1524: $\rightarrow$ TRL)} \\ \cline{3-5}
1525: ID & Japanese (Translation in TRL) & {\hfill\centering Standard
1526: TF\hfill} & {\hfill\centering Logarithmic
1527: TF\hfill} & {\hfill\centering SMART\hfill} \\ \hline
1528: 0008 & {\it deeta-$\underline{mainingu}$\/} (data
1529: $\underline{\mbox{mining}}$) & 0.0018 $\rightarrow$ 0.0942 &
1530: 0.0299 $\rightarrow$ 0.3363 & 0.3156 $\rightarrow$ 0.7295 \\
1531: 0012 & {\it deeta-$\underline{mainingu}$\/} (data
1532: $\underline{\mbox{mining}}$) & 0.0018 $\rightarrow$ 0.1229 &
1533: 0.0003 $\rightarrow$ 0.1683 & 0.0000 $\rightarrow$ 0.0853 \\
1534: 0015 & {\it $\underline{corokeishon}$\/}
1535: ($\underline{\mbox{collocation}}$) & 0.0054 $\rightarrow$ 0.0084
1536: & 0.0389 $\rightarrow$ 0.0485 & 0.0193 $\rightarrow$
1537: 0.3114 \smallskip \\ \hline
1538: \end{tabular}
1539: \label{tab:avgpre_qbq_trl}
1540: \end{center}
1541: \end{table}
1542:
1543: \begin{table}[htbp]
1544: \def\baselinestretch{1}
1545: \begin{center}
1546: \caption{Non-interpolated average precision values,
1547: averaged over the 21 queries, for the evaluation of
1548: transliteration}
1549: \medskip
1550: \leavevmode
1551: \small
1552: \begin{tabular}{lccccc} \hline\hline
1553: & & \multicolumn{3}{c}{Retrieval Method} \\ \cline{3-5}
1554: & \# of Terms & Standard TF & Logarithmic TF & SMART \\ \hline
1555: discard every {\it katakana\/} word & 2.8 & 0.1519 & 0.1840 &
1556: 0.1873 \\
1557: transliterate every {\it katakana\/} word & 4.0 & 0.2354 & 0.2786 &
1558: 0.3024 \\ \hline
1559: \end{tabular}
1560: \label{tab:avg_pre_kana}
1561: \end{center}
1562: \end{table}
1563:
1564: Finally, we discuss the effect of additional dictionaries, i.e., the
1565: general and abbreviation dictionaries. The NACSIS query set contains
1566: the general word ``{\it shimbun kiji\/}~(newspaper article)'' and
1567: abbreviation ``LFG~(lexical functional grammar)'' unlisted in our
1568: technical base word dictionary. The abbreviation dictionary lists the
1569: correct translation for ``LFG.'' On the other hand, our general
1570: dictionary, which consists solely of single words, does not list the
1571: correct translation for ``{\it shimbun-kiji\/}.'' Instead, the English
1572: word ``story'' was listed as the translation, which would be used in a
1573: particular context. Table~\ref{tab:avgpre_qbq_additional}, where basic
1574: notation is the same as Table~\ref{tab:avgpre_qbq_cwt}, compares
1575: average precision values with/without these translations. From this
1576: table we cannot see any improvement with the additional
1577: dictionaries. However, when the correct translation was provided as in
1578: 0023 with ``newspaper article,'' the performance was improved
1579: disregarding the retrieval method. In addition, since we found only
1580: two cases where additional dictionaries could be applied, this issue
1581: needs to be further explored using more test queries.
1582:
1583: \begin{table}[htbp]
1584: \def\baselinestretch{1}
1585: \begin{center}
1586: \caption{Query-by-query comparison for the general and
1587: abbreviation dictionaries}
1588: \medskip
1589: \leavevmode
1590: \footnotesize
1591: \begin{tabular}{cllll} \hline\hline
1592: & & \multicolumn{3}{c}{Change in Average Precision} \\ \cline{3-5}
1593: ID & Japanese (Translation) & {\hfill\centering Standard
1594: TF\hfill} & {\hfill\centering Logarithmic
1595: TF\hfill} & {\hfill\centering SMART\hfill} \\ \hline
1596: 0023 & {\it shimbun-kiji\/} (story) & 0.0003
1597: $\rightarrow$ 0.0000* & 0.0000 $\rightarrow$ 0.0000 & 0.0000
1598: $\rightarrow$ 0.0000 \\
1599: 0023 & {\it shimbun-kiji\/} (newspaper article) & 0.0003
1600: $\rightarrow$ 0.0200 & 0.0000 $\rightarrow$ 0.0858 & 0.0000
1601: $\rightarrow$ 0.1800 \\
1602: 0025 & LFG (lexical functional grammar) & 0.8000 $\rightarrow$
1603: 0.5410* & 0.8000 $\rightarrow$ 0.6879* & 0.9452 $\rightarrow$
1604: 0.8617* \\ \hline
1605: \end{tabular}
1606: \label{tab:avgpre_qbq_additional}
1607: \end{center}
1608: \end{table}
1609:
1610: \section{Conclusion}
1611: \label{sec:conclusion}
1612:
1613: Reflecting the rapid growth in utilization of machine readable
1614: multilingual texts in the 1990s, cross-language information retrieval
1615: (CLIR), which was initiated in the 1960s, has variously been explored
1616: in order to facilitate retrieving information across languages. For
1617: this purpose, a number of CLIR systems have been developed in
1618: information retrieval, natural language processing and artificial
1619: intelligence research.
1620:
1621: In this paper, we proposed a Japanese/English bidirectional CLIR
1622: system targeting technical documents, in that translation of technical
1623: terms is a crucial task. Since our research methodology must be
1624: contextualized in terms of past research literature, we surveyed
1625: existing CLIR systems, and classified them into three approaches: (a)
1626: translating queries into the document language, (b) translating
1627: documents into the query language, and (c) representing both queries
1628: and documents in a language-independent space. Among these approaches,
1629: we found that the first one, namely the query translation approach, is
1630: relatively inexpensive to implement. Therefore, following this
1631: approach, we combined query translation and monolingual retrieval
1632: modules.
1633:
1634: However, a naive query translation method relying on existing
1635: bilingual dictionaries does not guarantee sufficient system
1636: performance, because new technical terms are progressively created by
1637: combining existing base words or by the Japanese {\it katakana\/}
1638: phonograms. To counter this problem, we proposed compound word
1639: translation and transliteration methods, and integrated them within
1640: one framework. Our methods involve the dictionary production and
1641: probabilistic resolution of translation/transliteration ambiguity,
1642: both of which are fully automated. To produce the dictionary used for
1643: the compound word translation, we extracted base word translations
1644: from the EDR technical terminology dictionary. On the other hand, we
1645: corresponded English and Japanese {\it katakana\/} words on a
1646: character basis, to produce the transliteration dictionary. For the
1647: disambiguation, we used word frequency statistics extracted from the
1648: document collection. We also produced a dictionary for abbreviated
1649: English technical terms, to enhance the translation.
1650:
1651: From a scientific point of view, we investigated the performance of
1652: our CLIR system by way of the standardized IR evaluation method. For
1653: this purpose, we used the NACSIS test collection, which consists of
1654: Japanese queries and Japanese/English technical abstracts, and carried
1655: out Japanese-English CLIR evaluation. Our evaluation results showed
1656: that each individual method proposed, i.e., compound word translation
1657: and transliteration, improved on the baseline performance, and when
1658: used together the improvement was even greater, resulting in a
1659: performance comparable with Japanese-Japanese monolingual IR. We also
1660: showed that the enhancement of the retrieval module improved on our
1661: system performance, independently from the enhancement of the query
1662: translation module.
1663:
1664: Future work will include improvement of each component in our system,
1665: and the effective presentation of retrieved documents using
1666: sophisticated summarization techniques.
1667:
1668: \section*{Acknowledgments}
1669:
1670: The authors would like to thank Noriko Kando (National Institute of
1671: Informatics, Japan) for her support with the NACSIS collection.
1672:
1673: \bibliographystyle{acl}
1674: \begin{thebibliography}{}
1675:
1676: \bibitem[\protect\citename{AAAI}1997]{aaai-spring-sympo-97}
1677: AAAI.
1678: \newblock 1997.
1679: \newblock {\em Electronic Working Notes of the AAAI Spring Symposium on
1680: Cross-Language Text and Speech Retrieval}.
1681: \newblock {\tt http://www.clis.umd.edu/dlrg/filter/sss/papers/}.
1682:
1683: \bibitem[\protect\citename{ACM}1996-1998]{sigir-96-98}
1684: ACM SIGIR.
1685: \newblock 1996-1998.
1686: \newblock {\em Proceedings of the Annual International ACM SIGIR Conference on
1687: Research and Development in Information Retrieval}.
1688:
1689: \bibitem[\protect\citename{Aone \bgroup et al.\egroup }1997]{aone:anlp-97}
1690: Chinatsu Aone, Nicholas Charocopos, and James Gorlinsky.
1691: \newblock 1997.
1692: \newblock An intelligent multilingual information browsing and retrieval system
1693: using information extraction.
1694: \newblock In {\em Proceedings of the 5th Conference on Applied Natural Language
1695: Processing}, pages 332--339.
1696:
1697: \bibitem[\protect\citename{Ballesteros and Croft}1997]{ballesteros:sigir-97}
1698: Lisa Ballesteros and W.~Bruce Croft.
1699: \newblock 1997.
1700: \newblock Phrasal translation and query expansion techniques for cross-language
1701: information retrieval.
1702: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR
1703: Conference on Research and Development in Information Retrieval}, pages
1704: 84--91.
1705:
1706: \bibitem[\protect\citename{Ballesteros and Croft}1998]{ballesteros:sigir-98}
1707: Lisa Ballesteros and W.~Bruce Croft.
1708: \newblock 1998.
1709: \newblock Resolving ambiguity for cross-language retrieval.
1710: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
1711: Conference on Research and Development in Information Retrieval}, pages
1712: 64--71.
1713:
1714: \bibitem[\protect\citename{Brown \bgroup et al.\egroup }1993]{brown:cl-93}
1715: Peter~F. Brown, Stephen A.~Della Pietra, Vincent J.~Della Pietra, and Robert~L.
1716: Mercer.
1717: \newblock 1993.
1718: \newblock The mathematics of statistical machine translation: Parameter
1719: estimation.
1720: \newblock {\em Computational Linguistics}, 19(2):263--311.
1721:
1722: \bibitem[\protect\citename{Carbonell \bgroup et al.\egroup
1723: }1997]{carbonell:ijcai-97}
1724: Jaime~G. Carbonell, Yiming Yang, Robert~E. Frederking, Ralf~D. Brown, Yibing
1725: Geng, and Danny Lee.
1726: \newblock 1997.
1727: \newblock Translingual information retrieval: A comparative evaluation.
1728: \newblock In {\em Proceedings of the 15th International Joint Conference on
1729: Artificial Intelligence}, pages 708--714.
1730:
1731: \bibitem[\protect\citename{Chen \bgroup et al.\egroup
1732: }1998]{chen:coling-acl-98}
1733: Hsin-Hsi Chen, Sheng-Jie Huang, Yung-Wei Ding, and Shih-Chung Tsai.
1734: \newblock 1998.
1735: \newblock Proper name translation in cross-language information retrieval.
1736: \newblock In {\em Proceedings of the 36th Annual Meeting of the Association for
1737: Computational Linguistics and the 17th International Conference on
1738: Computational Linguistics}, pages 232--236.
1739:
1740: \bibitem[\protect\citename{Chen \bgroup et al.\egroup }1999]{chen:acl-99}
1741: Hsin-Hsi Chen, Guo-Wei Bian, and Wen-Cheng Lin.
1742: \newblock 1999.
1743: \newblock Resolving translation ambiguity and target polysemy in cross-language
1744: information retrieval.
1745: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for
1746: Computational Linguistics}, pages 215--222.
1747:
1748: \bibitem[\protect\citename{Church and Mercer}1993]{church:cl-93}
1749: Kenneth~W. Church and Robert~L. Mercer.
1750: \newblock 1993.
1751: \newblock Introduction to the special issue on computational linguistics using
1752: large corpora.
1753: \newblock {\em Computational Linguistics}, 19(1):1--24.
1754:
1755: \bibitem[\protect\citename{Davis and Ogden}1997]{davis:sigir-97}
1756: Mark~W. Davis and William~C. Ogden.
1757: \newblock 1997.
1758: \newblock {QUILT}: Implementing a large-scale cross-language text retrieval
1759: system.
1760: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR
1761: Conference on Research and Development in Information Retrieval}, pages
1762: 92--98.
1763:
1764: \bibitem[\protect\citename{Deerwester \bgroup et al.\egroup
1765: }1990]{deerwester:jasis-90}
1766: Scott Deerwester, Susan~T. Dumais, George~W. Furnas, Thomas~K. Landauer, and
1767: Richard Harshman.
1768: \newblock 1990.
1769: \newblock Indexing by latent semantic analysis.
1770: \newblock {\em Journal of the American Society for Information Science},
1771: 41(6):391--407.
1772:
1773: \bibitem[\protect\citename{Dijkstra}1959]{dijkstra:nm-59}
1774: Edsgar~W. Dijkstra.
1775: \newblock 1959.
1776: \newblock A note on two problems in connexion with graphs.
1777: \newblock {\em Numerische Mathematik}, 1:269--271.
1778:
1779: \bibitem[\protect\citename{Dorr and Oard}1998]{dorr:lrec-98}
1780: Bonnie~J. Dorr and Douglas~W. Oard.
1781: \newblock 1998.
1782: \newblock Evaluating resources for query translation in cross-language
1783: information retrieval.
1784: \newblock In {\em Proceedings of the 1st International Conference on Language
1785: Resources and Evaluation}, pages 759--764.
1786:
1787: \bibitem[\protect\citename{Dumais \bgroup et al.\egroup
1788: }1996]{dumais:sigir-ws-96}
1789: Susan~T. Dumais, Thomas~K. Landauer, and Michael~L. Littman.
1790: \newblock 1996.
1791: \newblock Automatic cross-linguistic information retrieval using latent
1792: semantic indexing.
1793: \newblock In {\em ACM SIGIR Workshop on Cross-Linguistic Information
1794: Retrieval}.
1795:
1796: \bibitem[\protect\citename{Fellbaum}1998]{fellbaum:wordnet-98}
1797: Christiane Fellbaum, editor.
1798: \newblock 1998.
1799: \newblock {\em {WordNet}: An Electronic Lexical Database}.
1800: \newblock MIT Press.
1801:
1802: \bibitem[\protect\citename{Ferber}1989]{ferber:89}
1803: Gene Ferber.
1804: \newblock 1989.
1805: \newblock {\em {English-Japanese}, {Japanese-English} Dictionary of Computer
1806: and Data-Processing Terms}.
1807: \newblock MIT Press.
1808:
1809: \bibitem[\protect\citename{Fung \bgroup et al.\egroup }1999]{fung:acl-99}
1810: Pascale Fung, Liu Xiaohu, and Cheung~Chi Shun.
1811: \newblock 1999.
1812: \newblock Mixed language query disambiguation.
1813: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for
1814: Computational Linguistics}, pages 333--340.
1815:
1816: \bibitem[\protect\citename{Fung}1995]{fung:acl-95}
1817: Pascale Fung.
1818: \newblock 1995.
1819: \newblock A pattern matching method for finding noun and proper noun
1820: translations from noisy parallel corpora.
1821: \newblock In {\em Proceedings of the 33rd Annual Meeting of the Association for
1822: Computational Linguistics}, pages 236--243.
1823:
1824: \bibitem[\protect\citename{Gachot \bgroup et al.\egroup
1825: }1996]{gachot:sigir-ws-96}
1826: Denis~A. Gachot, Elke Lange, and Jin Yang.
1827: \newblock 1996.
1828: \newblock The {SYSTRAN} {NLP} browser: An application of machine translation
1829: technology in multilingual information retrieval.
1830: \newblock In {\em ACM SIGIR Workshop on Cross-Linguistic Information
1831: Retrieval}.
1832:
1833: \bibitem[\protect\citename{Gilarranz \bgroup et al.\egroup
1834: }1997]{gilarranz:aaai-spring-sympo-97}
1835: Julio Gilarranz, Julio Gonzalo, and Felisa Verdejo.
1836: \newblock 1997.
1837: \newblock An approach to conceptual text retrieval using the {EuroWordNet}
1838: multilingual semantic database.
1839: \newblock In {\em Electronic Working Notes of the AAAI Spring Symposium on
1840: Cross-Language Text and Speech Retrieval}.
1841:
1842: \bibitem[\protect\citename{Gonzalo \bgroup et al.\egroup
1843: }1998]{gonzalo:chum-98}
1844: Julio Gonzalo, Felisa Verdejo, Carol Peters, and Nicoletta Calzolari.
1845: \newblock 1998.
1846: \newblock Applying {EuroWordNet} to cross-language text retrieval.
1847: \newblock {\em Computers and the Humanities}, 32:185--207.
1848:
1849: \bibitem[\protect\citename{Hull and Grefenstette}1996]{hull:sigir-96}
1850: David~A. Hull and Gregory Grefenstette.
1851: \newblock 1996.
1852: \newblock Querying across languages: A dictionary-based approach to
1853: multilingual information retrieval.
1854: \newblock In {\em Proceedings of the 19th Annual International ACM SIGIR
1855: Conference on Research and Development in Information Retrieval}, pages
1856: 49--57.
1857:
1858: \bibitem[\protect\citename{Hull}1993]{hull:sigir-93}
1859: David Hull.
1860: \newblock 1993.
1861: \newblock Using statistical testing in the evaluation of retrieval experiments.
1862: \newblock In {\em Proceedings of the 16th Annual International ACM SIGIR
1863: Conference on Research and Development in Information Retrieval}, pages
1864: 329--338.
1865:
1866: \bibitem[\protect\citename{Hull}1997]{hull:aaai-spring-sympo-97}
1867: David~A. Hull.
1868: \newblock 1997.
1869: \newblock Using structured queries for disambiguation in cross-language
1870: information retrieval.
1871: \newblock In {\em Electronic Working Notes of the AAAI Spring Symposium on
1872: Cross-Language Text and Speech Retrieval}.
1873:
1874: \bibitem[\protect\citename{{Japan Electronic Dictionary Research
1875: Institute}}1995a]{edr-bilindic:95}
1876: {Japan Electronic Dictionary Research Institute}.
1877: \newblock 1995a.
1878: \newblock Bilingual dictionary.
1879: \newblock (In Japanese).
1880:
1881: \bibitem[\protect\citename{{Japan Electronic Dictionary Research
1882: Institute}}1995b]{edr-techdic:95}
1883: {Japan Electronic Dictionary Research Institute}.
1884: \newblock 1995b.
1885: \newblock Technical terminology dictionary (information processing).
1886: \newblock (In Japanese).
1887:
1888: \bibitem[\protect\citename{Kaji and Aizono}1996]{kaji:coling-96}
1889: Hiroyuki Kaji and Toshiko Aizono.
1890: \newblock 1996.
1891: \newblock Extracting word correspondences from bilingual corpora based on word
1892: co-occurrence information.
1893: \newblock In {\em Proceedings of the 16th International Conference on
1894: Computational Linguistics}, pages 23--28.
1895:
1896: \bibitem[\protect\citename{Kando \bgroup et al.\egroup }1999]{kando:sigir-99}
1897: Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue.
1898: \newblock 1999.
1899: \newblock {NACSIS} test collection workshop ({NTCIR-1}).
1900: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
1901: Conference on Research and Development in Information Retrieval}, pages
1902: 299--300.
1903:
1904: \bibitem[\protect\citename{Keen}1992]{keen:ipm-92}
1905: E.~Michael Keen.
1906: \newblock 1992.
1907: \newblock Presenting results of experimental retrieval comparisons.
1908: \newblock {\em Information Processing \& Management}, 28(4):491--502.
1909:
1910: \bibitem[\protect\citename{Knight and Graehl}1998]{knight:cl-98}
1911: Kevin Knight and Jonathan Graehl.
1912: \newblock 1998.
1913: \newblock Machine transliteration.
1914: \newblock {\em Computational Linguistics}, 24(4):599--612.
1915:
1916: \bibitem[\protect\citename{Kobayashi \bgroup et al.\egroup
1917: }1994]{kobayashi:coling-94}
1918: Yoshiyuki Kobayashi, Takenobu Tokunaga, and Hozumi Tanaka.
1919: \newblock 1994.
1920: \newblock Analysis of {Japanese} compound nouns using collocational
1921: information.
1922: \newblock In {\em Proceedings of the 15th International Conference on
1923: Computational Linguistics}, pages 865--869.
1924:
1925: \bibitem[\protect\citename{Kwon \bgroup et al.\egroup }1998]{kwon:cpol-98}
1926: Oh-Woog Kwon, Insu Kang, Jong-Hyeok Lee, and Geunbae Lee.
1927: \newblock 1998.
1928: \newblock Conceptual cross-language text retrieval based on document
1929: translation using {Japanese}-to-{Korean} {MT} system.
1930: \newblock {\em International Journal of Computer Processing of Oriental
1931: Languages}, 12(1):1--16.
1932:
1933: \bibitem[\protect\citename{Lee and Choi}1997]{lee:iral-97}
1934: Jae~Sung Lee and Key-Sun Choi.
1935: \newblock 1997.
1936: \newblock A statistical method to generate various foreign word
1937: transliterations in multilingual information retrieval system.
1938: \newblock In {\em Proceedings of the 2nd International Workshop on Information
1939: Retrieval with Asian Languages}, pages 123--128.
1940:
1941: \bibitem[\protect\citename{Mani and Bloedorn}1998]{mani:aaai-iaai-98}
1942: Inderjeet Mani and Eric Bloedorn.
1943: \newblock 1998.
1944: \newblock Machine learning of generic and user-focused summarization.
1945: \newblock In {\em Proceedings of AAAI/IAAI-98}, pages 821--826.
1946:
1947: \bibitem[\protect\citename{Matsumoto \bgroup et al.\egroup
1948: }1997]{matsumoto:chasen-97}
1949: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki
1950: Imamura.
1951: \newblock 1997.
1952: \newblock {Japanese} morphological analysis system {ChaSen} manual.
1953: \newblock Technical Report NAIST-IS-TR97007, NAIST.
1954: \newblock (In Japanese).
1955:
1956: \bibitem[\protect\citename{McCarley}1999]{mccarley:acl-99}
1957: J.~Scott McCarley.
1958: \newblock 1999.
1959: \newblock Should we translate the documents or the queries in cross-language
1960: information retrieval?
1961: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for
1962: Computational Linguistics}, pages 208--214.
1963:
1964: \bibitem[\protect\citename{Mongar}1969]{mongar:tis-69}
1965: P.E. Mongar.
1966: \newblock 1969.
1967: \newblock International co-operation in abstracting services for road
1968: engineering.
1969: \newblock {\em The Information Scientist}, 3:51--62.
1970:
1971: \bibitem[\protect\citename{{Nichigai Associates}}1996]{nichigai_compdic:96}
1972: {Nichigai Associates}.
1973: \newblock 1996.
1974: \newblock {English-Japanese} computer terminology dictionary.
1975: \newblock (In Japanese).
1976:
1977: \bibitem[\protect\citename{Nie \bgroup et al.\egroup }1999]{nie:sigir-99}
1978: Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand.
1979: \newblock 1999.
1980: \newblock Cross-language information retrieval based on parallel texts and
1981: automatic mining of parallel texts from the {Web}.
1982: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
1983: Conference on Research and Development in Information Retrieval}, pages
1984: 74--81.
1985:
1986: \bibitem[\protect\citename{NIST}1992-1998]{trec-92-98}
1987: {National Institute of Standards \& Technology}.
1988: \newblock 1992--1998.
1989: \newblock {\em Proceedings of the Text REtrieval Conferences}.
1990: \newblock {\tt http://trec.nist.gov/pubs.html}.
1991:
1992: \bibitem[\protect\citename{Oard and Resnik}1999]{oard:ipm-99}
1993: Douglas~W. Oard and Philip Resnik.
1994: \newblock 1999.
1995: \newblock Support for interactive document selection in cross-language
1996: information retrieval.
1997: \newblock {\em Information Processing \& Management}, 35(3):363--379.
1998:
1999: \bibitem[\protect\citename{Oard}1998]{oard:amta-98}
2000: Douglas~W. Oard.
2001: \newblock 1998.
2002: \newblock A comparative study of query and document translation for
2003: cross-language information retrieval.
2004: \newblock In {\em Proceedings of the 3rd Conference of the Association for
2005: Machine Translation in the Americas}, pages 472--483.
2006:
2007: \bibitem[\protect\citename{Okumura \bgroup et al.\egroup
2008: }1998]{okumura:lrec-tlim-ws-98}
2009: Akitoshi Okumura, Kai Ishikawa, and Kenji Satoh.
2010: \newblock 1998.
2011: \newblock Translingual information retrieval by a bilingual dictionary and
2012: comparable corpus.
2013: \newblock In {\em The 1st International Conference on Language Resources and
2014: Evaluation, Workshop on Translingual Information Management: Current Levels
2015: and Future Abilities}.
2016:
2017: \bibitem[\protect\citename{Pirkola}1998]{pirkola:sigir-98}
2018: Ari Pirkola.
2019: \newblock 1998.
2020: \newblock The effects of query structure and dictionary setups in
2021: dictionary-based cross-language information retrieval.
2022: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
2023: Conference on Research and Development in Information Retrieval}, pages
2024: 55--63.
2025:
2026: \bibitem[\protect\citename{Sakai \bgroup et al.\egroup }1999]{sakai:tipsj-99}
2027: Tetsuya Sakai, Masahiro Kajiura, Kazuo Sumita, Gareth Jones, and Nigel Collier.
2028: \newblock 1999.
2029: \newblock A study on {English}-{Japanese}/{Japanese}-{English} cross-language
2030: information retrieval using machine translation.
2031: \newblock {\em Transactions of Information Processing Society of Japan},
2032: 40(11):4075--4086.
2033: \newblock (In Japanese).
2034:
2035: \bibitem[\protect\citename{Salton and Buckley}1988]{salton:ipm-88}
2036: Gerard Salton and Christopher Buckley.
2037: \newblock 1988.
2038: \newblock Term-weighting approaches in automatic text retrieval.
2039: \newblock {\em Information Processing \& Management}, 24(5):513--523.
2040:
2041: \bibitem[\protect\citename{Salton and McGill}1983]{salton:83}
2042: Gerard Salton and Michael~J. McGill.
2043: \newblock 1983.
2044: \newblock {\em Introduction to Modern Information Retrieval}.
2045: \newblock McGraw-Hill.
2046:
2047: \bibitem[\protect\citename{Salton}1970]{salton:jasis-70}
2048: Gerard Salton.
2049: \newblock 1970.
2050: \newblock Automatic processing of foreign language documents.
2051: \newblock {\em Journal of the American Society for Information Science},
2052: 21(3):187--194.
2053:
2054: \bibitem[\protect\citename{Salton}1971]{salton:71}
2055: Gerard Salton.
2056: \newblock 1971.
2057: \newblock {\em The {SMART} Retrieval System: Experiments in Automatic Document
2058: Processing}.
2059: \newblock Prentice-Hall.
2060:
2061: \bibitem[\protect\citename{Salton}1972]{salton:techrep-72}
2062: Gerard Salton.
2063: \newblock 1972.
2064: \newblock Experiments in multi-lingual information retrieval.
2065: \newblock Technical Report TR 72-154, Computer Science Department, Cornell
2066: University.
2067:
2068: \bibitem[\protect\citename{Sch\"{a}uble and Sheridan}1997]{schauble:trec-97}
2069: Peter Sch\"{a}uble and P\'{a}raic Sheridan.
2070: \newblock 1997.
2071: \newblock Cross-language information retrieval ({CLIR}) track overview.
2072: \newblock In {\em {\it The 6th Text Retrieval Conference}}.
2073:
2074: \bibitem[\protect\citename{Sheridan and Ballerini}1996]{sheridan:sigir-96}
2075: P\'{a}raic Sheridan and Jean~Paul Ballerini.
2076: \newblock 1996.
2077: \newblock Experiments in multilingual information retrieval using the {SPIDER}
2078: system.
2079: \newblock In {\em Proceedings of the 19th Annual International ACM SIGIR
2080: Conference on Research and Development in Information Retrieval}, pages
2081: 58--65.
2082:
2083: \bibitem[\protect\citename{Smadja \bgroup et al.\egroup }1996]{smadja:cl-96}
2084: Frank Smadja, Kathleen~R. McKeown, and Vasileios Hatzivassiloglou.
2085: \newblock 1996.
2086: \newblock Translating collocations for bilingual lexicons: A statistical
2087: approach.
2088: \newblock {\em Computational Linguistics}, 22(1):1--38.
2089:
2090: \bibitem[\protect\citename{Suzuki \bgroup et al.\egroup
2091: }1998]{suzuki:signl-98-7}
2092: Masami Suzuki, Naomi Inoue, and Kazuo Hashimoto.
2093: \newblock 1998.
2094: \newblock Effect on displaying translated major keywords of contents as
2095: browsing support in cross-language information retrieval.
2096: \newblock {\em Information Processing Society of Japan SIGNL Notes},
2097: 98(63):99--106.
2098: \newblock (In Japanese).
2099:
2100: \bibitem[\protect\citename{Suzuki \bgroup et al.\egroup }1999]{suzuki:nlp-99}
2101: Masami Suzuki, Naomi Inoue, and Kazuo Hashimoto.
2102: \newblock 1999.
2103: \newblock Effects of partial translation for users' document selection in
2104: cross-language information retrieval.
2105: \newblock In {\em Proceedings of The 5th Annual Meeting of The Association for
2106: Natural Language Processing}, pages 371--374.
2107: \newblock (In Japanese).
2108:
2109: \bibitem[\protect\citename{Tombros and Sanderson}1998]{tombros:sigir-98}
2110: Anastasios Tombros and Mark Sanderson.
2111: \newblock 1998.
2112: \newblock Advantages of query biased summaries in information retrieval.
2113: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
2114: Conference on Research and Development in Information Retrieval}, pages
2115: 2--10.
2116:
2117: \bibitem[\protect\citename{Tsuji and Kageura}1997]{tsuji:nlprs-97}
2118: Keita Tsuji and Kyo Kageura.
2119: \newblock 1997.
2120: \newblock An {HMM}-based method for segmenting {Japanese} terms and keywords
2121: based on domain-specific bilingual corpora.
2122: \newblock In {\em Proceedings of the 4th Natural Language Processing Pacific
2123: Rim Symposium}, pages 557--560.
2124:
2125: \bibitem[\protect\citename{Voorhees}1998]{voorhees:sigir-98}
2126: Ellen~M. Voorhees.
2127: \newblock 1998.
2128: \newblock Variations in relevance judgments and the measurement of retrieval
2129: effectiveness.
2130: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
2131: Conference on Research and Development in Information Retrieval}, pages
2132: 315--323.
2133:
2134: \bibitem[\protect\citename{Vossen}1998]{vossen:chum-98}
2135: Piek Vossen.
2136: \newblock 1998.
2137: \newblock Introduction to {EuroWordNet}.
2138: \newblock {\em Computers and the Humanities}, 32:73--89.
2139:
2140: \bibitem[\protect\citename{Wong \bgroup et al.\egroup }1985]{wong:sigir-85}
2141: S.K.M. Wong, W.~Siarko, and P.C.N. Wong.
2142: \newblock 1985.
2143: \newblock Generalized vector space model in information retrieval.
2144: \newblock In {\em Proceedings of the 8th Annual International ACM SIGIR
2145: Conference on Research and Development in Information Retrieval}, pages
2146: 18--25.
2147:
2148: \bibitem[\protect\citename{Xu and Croft}1996]{xu:sigir-96}
2149: Jinxi Xu and W.~Bruce Croft.
2150: \newblock 1996.
2151: \newblock Query expansion using local and global document analysis.
2152: \newblock In {\em Proceedings of the 19th Annual International ACM SIGIR
2153: Conference on Research and Development in Information Retrieval}, pages
2154: 4--11.
2155:
2156: \bibitem[\protect\citename{Yamabana \bgroup et al.\egroup
2157: }1996]{yamabana:sigir-ws-96}
2158: Kiyoshi Yamabana, Kazunori Muraki, Shinichi Doi, and Shin'ichiro Kamei.
2159: \newblock 1996.
2160: \newblock A language conversion front-end for cross-linguistic information
2161: retrieval.
2162: \newblock In {\em ACM SIGIR Workshop on Cross-Linguistic Information
2163: Retrieval}.
2164:
2165: \bibitem[\protect\citename{Zobel and Moffat}1998]{zobel:sigir-forum-98}
2166: Justin Zobel and Alistair Moffat.
2167: \newblock 1998.
2168: \newblock Exploring the similarity space.
2169: \newblock {\em ACM SIGIR FORUM}, 32(1):18--34.
2170:
2171: \end{thebibliography}
2172:
2173: \end{document}
2174: