0206:cs0206015/main.tex

1: \documentstyle[chum]{article}

2:

3: \title{Japanese/English Cross-Language Information Retrieval:

4: Exploration of Query Translation and Transliteration\footnote{Computers and the Humanities, Vol.35, No.4, pp.389--420, Nov. 2001}}

5:

6: \author{\Large Atsushi Fujii and Tetsuya Ishikawa}

7:

8: \date{University of Library and Information Science \\

9: 1-2 Kasuga Tsukuba 305-8550, JAPAN \\ \smallskip

10: {\normalsize\tt

11: E-mail:fujii@ulis.ac.jp}}

12:

13: \summary{Cross-language information retrieval (CLIR), where queries

14: and documents are in different languages, has of late become one of

15: the major topics within the information retrieval community. This

16: paper proposes a Japanese/English CLIR system, where we combine a

17: query translation and retrieval modules. We currently target the

18: retrieval of technical documents, and therefore the performance of our

19: system is highly dependent on the quality of the translation of

20: technical terms.  However, the technical term translation is still

21: problematic in that technical terms are often compound words, and thus

22: new terms are progressively created by combining existing base

23: words. In addition, Japanese often represents loanwords based on its

24: special phonogram. Consequently, existing dictionaries find it

25: difficult to achieve sufficient coverage. To counter the first

26: problem, we produce a Japanese/English dictionary for base words, and

27: translate compound words on a word-by-word basis.  We also use a

28: probabilistic method to resolve translation ambiguity.  For the second

29: problem, we use a transliteration method, which corresponds words

30: unlisted in the base word dictionary to their phonetic equivalents in

31: the target language. We evaluate our system using a test collection

32: for CLIR, and show that both the compound word translation and

33: transliteration methods improve the system performance.}

34:

35: \begin{document}

36:

37: \makeidpage

38: \maketitle

39:

40: \input{psfig.tex}

41:

42: \newcommand{\etal}{et~al.}

43: \newcommand{\etaleos}{et~al}

44: \newcommand{\eq}[1]{(\ref{#1})}

45:

46: \renewcommand{\nocite}[1]{\shortcite{#1}}

47:

48: \section{Introduction}

49: \label{sec:introduction}

50:

51: Cross-language information retrieval (CLIR) is the retrieval process

52: where the user presents queries in one language to retrieve documents

53: in {\em another\/} language. One of the traditional research

54: references for CLIR dates back to the 1960s~\cite{mongar:tis-69}. In

55: the 1970s, Salton~\nocite{salton:jasis-70,salton:techrep-72}

56: empirically showed that CLIR using a hand-crafted bilingual thesaurus

57: is comparable with monolingual information retrieval in

58: performance. The 1990s witnessed a growing number of machine readable

59: texts in various languages, including those accessible via the World

60: Wide Web, but each content is usually provided in a limited number of

61: languages. Thus, it is feasible that users are interested in

62: retrieving information across languages. Possible users of CLIR are

63: given below:

64: \begin{itemize}

65: \item users who are able to read documents in foreign languages, but

66:   have difficulty formulating foreign queries,

67: \item users who find it difficult to retrieve/read relevant documents,

68:   but need the information, for the purpose of which the use of

69:   machine translation (MT) systems for the limited number of documents

70:   retrieved through CLIR is computationally more efficient rather than

71:   translating the entire collection,

72: \item users who know foreign keywords/phrases, and want to read

73:   documents associated with them, in their native language.

74: \end{itemize}

75: In fact, CLIR has of late become one of the major topics within the

76: information retrieval (IR), natural language processing (NLP) and

77: artificial intelligence (AI) communities, and numerous CLIR systems

78: have variously been

79: proposed~\cite{aaai-spring-sympo-97,sigir-96-98,trec-92-98}.

80: Note that CLIR can be seen as a subtask of multi-lingual information

81: retrieval (MLIR), which also includes the following cases:

82: \begin{itemize}

83: \item identify the query language (based on, for example, character

84:   codes), and search a multilingual collection for documents in the

85:   query language,

86: \item retrieve documents, in which each document is in more than one

87:   language,

88: \item retrieve documents using a query in more than one

89:   language~\cite{fung:acl-99}.

90: \end{itemize}

91: However, these above cases are beyond the scope of this paper. It

92: should also be noted that while CLIR is not necessarily limited to IR

93: within two languages, we consistently use the term ``bilingual,''

94: keeping the potential applicability of CLIR to more than two languages

95: in mind, because the variety of languages used is not the central

96: issue of this paper.

97:

98: Since by definition queries and documents are in different languages,

99: CLIR needs a translation process along with the conventional

100: monolingual retrieval process.  For this purpose, existing CLIR

101: systems adopt various techniques explored in NLP research. In brief,

102: dictionaries, corpora, thesauri and MT systems are used to translate

103: queries and/or documents. However, due to the rudimentary nature of

104: existing translation methods, CLIR still finds it difficult to achieve

105: the performance of monolingual IR. Roughly speaking, recent

106: experiments showed that the average precision of CLIR is 50-75\% of

107: that obtained with monolingual IR~\cite{schauble:trec-97}, which

108: stimulates us to further explore this exciting research area.

109:

110: In this paper, we propose a Japanese/English bidirectional CLIR system

111: targeting technical documents, which has been less explored than that

112: for newspaper articles in past CLIR literature.  Our research is

113: partly motivated by the NACSIS test collection for (CL)IR systems,

114: which consists of Japanese queries and Japanese/English abstracts

115: collected from technical papers~\cite{kando:sigir-99}.\footnote{\tt

116: {http://www.rd.nacsis.ac.jp/\~{}ntcadm/index-en.html}} We will

117: elaborate on the NACSIS collection in

118: Section~\ref{subsec:eval_overview}. As can be predicted, the

119: performance of our CLIR system strongly depends on the quality of the

120: translation of technical terms, which are often unlisted in general

121: dictionaries.

122:

123: Pirkola~\nocite{pirkola:sigir-98}, for example, used a subset of the

124: TREC collection related to health topics, and showed that a

125: combination of general and domain specific (i.e., medical)

126: dictionaries improves the CLIR performance obtained with only a

127: general dictionary. This result shows the potential contribution of

128: technical term translation to CLIR. At the same time, it should be

129: noted that even domain specific dictionaries do not exhaustively list

130: possible technical terms. For example, the EDR technical terminology

131: dictionary~\cite{edr-techdic:95}, which consists of approximately

132: 120,000 Japanese-English translations related to the information

133: processing field, lacks recent terms like ``{\it jouhou

134: chuushutsu\/}~(information extraction).'' We classify problems

135: associated with technical term translation as given below:

136: \begin{itemize}

137: \item technical terms are often compound words, which

138:   can be progressively created simply by combining multiple existing

139:   morphemes (``base words''), and therefore it is not entirely

140:   satisfactory or feasible to exhaustively enumerate newly emerging

141:   terms in dictionaries,

142: \item Japanese often represents loanwords (i.e., technical terms and

143:   proper nouns imported from foreign languages) using its special

144:   phonetic alphabet (or phonogram) called ``{\it katakana},'' with

145:   which new words can be spelled out,

146: \item English technical terms are often abbreviated, which can be used

147:   as ``Japanese'' words.

148: \end{itemize}

149: To counter the first problem, we propose a compound word translation

150: method, which selects appropriate translations based on the

151: probability of occurrence of each combination of base words in the

152: target language (see Section~\ref{subsec:cwt}). Note that technical

153: compound words sometimes include general words, such as ``AI {\em

154: chess\/}'' and ``digital {\em watermark\/}.'' In this paper, we do not

155: rigorously define general words, by which we mean words that are

156: contained in existing general dictionaries but rarely in technical

157: term dictionaries. For the second problem, we propose a

158: ``transliteration'' method, which identifies phonetic equivalents in

159: the target language (see Section~\ref{subsec:translit}). Finally, to

160: resolve the third problem, we enhance our bilingual dictionary with

161: multiples of each abbreviation and its complete form (e.g., ``IR'' and

162: ``information retrieval'') extracted from English corpora (see

163: Section~\ref{subsec:dictionary_enhancement}).  Note that although a

164: number of methods targeting those above problems have been explored

165: in past research, no attempt has been made to integrate them in the

166: context of CLIR.

167:

168: Section~\ref{sec:past_research} surveys past research on CLIR, and

169: clarifies our focus and approach. Section~\ref{sec:system_overview}

170: overviews our CLIR system, and Section~\ref{sec:translation}

171: elaborates on the translation method aimed to resolve the above

172: problems associated with technical term translation.

173: Section~\ref{sec:evaluation} then evaluates the performance of our

174: CLIR system using the NACSIS collection.

175:

176: \section{Past Research on CLIR}

177: \label{sec:past_research}

178:

179: \subsection{Retrieval Methodologies}

180: \label{subsec:retrieval_methods}

181:

182: Figure~\ref{fig:retrieval_methods} classifies existing CLIR

183: approaches in terms of retrieval methodology. The top level three

184: categories correspond to the different titles of the following items.

185:

186: \paragraph{Query translation approach}

187:

188: This approach translates queries into document languages using

189: bilingual dictionaries or/and corpora, prior to the retrieval process.

190: Since the retrieval process is fundamentally the same as performed in

191: monolingual IR, the translation module can easily be combined with

192: existing IR engines. This category can be further subdivided into the

193: following three methods.

194:

195: The first subcategory can be called dictionary-based methods.  Hull

196: and Grefenstette~\nocite{hull:sigir-96} used a bilingual dictionary to

197: derive all possible translation candidates of query terms, which are

198: used for the subsequent retrieval. Their method is easy to implement,

199: but potentially retrieves irrelevant documents and decreases the time

200: efficiency. To resolve this problem,

201: Hull~\nocite{hull:aaai-spring-sympo-97} combined translation

202: candidates for each query term with the ``OR'' operator, and used the

203: weighted boolean method to assign an importance degree to each

204: translation candidate.

205:

206: Pirkola~\nocite{pirkola:sigir-98} also used structured queries, where

207: each term is combined with different types of operators. Ballesteros

208: and Croft~\nocite{ballesteros:sigir-97} enhanced the dictionary-based

209: translation using the ``local context analysis''~\cite{xu:sigir-96}

210: and phrase-based translation.  Dorr and Oard~\nocite{dorr:lrec-98}

211: evaluated the effectiveness of a semantic structure of a query in the

212: query translation. As far as their comparative experiments were

213: concerned, the use of semantic structures was not as effective as

214: MT/dictionary-based query translation methods.

215:

216: The second subcategory, corpus-based methods, uses translations

217: extracted from bilingual corpora, for the query

218: translation~\cite{carbonell:ijcai-97}. In this paper, ``(bilingual)

219: aligned corpora'' generally refer to a pair of two language corpora

220: aligned to each other on a word, sentence, paragraph or document

221: basis. Given such resources, corpus-based methods are expected to

222: acquire domain specific translations unlisted in existing

223: dictionaries. In fact, Carbonell~\etal~\nocite{carbonell:ijcai-97}

224: empirically showed that their corpus-based query translation method

225: outperformed a dictionary-based method.  Their comparative evaluation

226: also showed that the corpus-based translation method outperformed

227: GVSM/LSI-based methods (see the following ``Interlingual

228: representation approach'' item for details of GVSM and LSI). Note that

229: for the purpose of corpus-based translation methods, a number of

230: translation extraction techniques explored in NLP

231: research~\cite{fung:acl-95,kaji:coling-96,smadja:cl-96} are

232: applicable.

233:

234: Finally, hybrid methods use corpora to resolve the translation

235: ambiguity inherent in bilingual dictionaries.  Unlike the corpus-based

236: translation methods described above, which rely on bilingual corpora,

237: Ballesteros and Croft~\nocite{ballesteros:sigir-98} and

238: Chen~\etal~\nocite{chen:acl-99} independently used a {\em

239: monolingual\/} corpus for the disambiguation, and therefore the

240: implementation cost is less. In practice, their method selects the

241: combination of translation candidates that frequently co-occur in the

242: target language corpus. On the other hand, bilingual corpora are also

243: applicable to hybrid

244: methods. Okumura~\etal~\nocite{okumura:lrec-tlim-ws-98} and

245: Yamabana~\etal~\nocite{yamabana:sigir-ws-96} independently used the

246: same disambiguation method, in that they consider word frequencies in

247: both the source and target languages, obtained from a bilingual

248: aligned corpus. Nie~\etal~\nocite{nie:sigir-99} automatically

249: collected parallel texts in French and English from the World Wide

250: Web, to train a probabilistic query translation model, and suggested

251: its feasibility for CLIR.

252:

253: Davis and Ogden~\nocite{davis:sigir-97} used a bilingual aligned

254: corpus as the document collection for training retrieval. They first

255: derive possible translation candidates using a dictionary. Then,

256: training retrieval trials are performed on the bilingual corpus, in

257: which the source and translated queries are used to retrieve source

258: and target documents, respectively. Finally, they select translations

259: which retrieved documents aligned to those retrieved with the source

260: query. Note that this method provides a salient contrast to other

261: query translation methods, in which translation is performed

262: independently from the retrieval

263: module.

264:

265: Chen~\etal~\nocite{chen:acl-99} addressed the disambiguation of

266: polysemy in the target language, along with the translation

267: disambiguation, specifically in the case where a source query term

268: corresponds to a small number of translations, but some of these

269: translations are associated with a large number of word senses, the

270: polysemous disambiguation is more crucial than the resolution of

271: translation ambiguity. To counter this problem, source query terms are

272: expanded with words that frequently co-occur, which are expected to

273: restrict the meaning of polysemous words in the target language

274: documents.

275:

276: \paragraph{Document translation approach}

277:

278: This approach translates documents into query languages, prior to the

279: retrieval. In most cases, existing MT systems are used to translate

280: all the documents in a given

281: collection~\cite{gachot:sigir-ws-96,kwon:cpol-98,oard:amta-98}. Otherwise,

282: a dictionary-based method is used to translate only index

283: terms~\cite{aone:anlp-97}. It is feasible that when compared with

284: short queries, documents contain a significantly higher volume of

285: information for the translation. In fact, Oard~\nocite{oard:amta-98}

286: showed that the document translation method using an MT system

287: outperformed several types of dictionary-based query translation

288: methods.

289:

290: However, McCarley~\nocite{mccarley:acl-99} showed that the relative

291: superiority between query and document translation approaches varied

292: depending on the source and target language pair. He also showed that

293: a hybrid system (it should not be confused with one described in the

294: ``Query translation approach'' item above), where the relevance degree

295: of each document (i.e., the ``score'') is the mean of those obtained

296: with query and document translation systems, outperformed systems

297: based on either query or document translation approach. However,

298: generally speaking, the full translation on large-scale collections

299: can be prohibitive.

300:

301: \paragraph{Interlingual representation approach}

302:

303: The basis of this approach is to project both queries and documents in

304: a language-independent (conceptual) space. In other words, as

305: Salton~\nocite{salton:jasis-70,salton:techrep-72} and Sheridan and

306: Ballerini~\nocite{sheridan:sigir-96} identified, the interlingual

307: representation approach is based on query expansion methods proposed

308: for monolingual IR. This category can be subdivided into

309: thesaurus-based methods and variants of the vector space model

310: (VSM)~\cite{salton:83}.

311:

312: Salton~\nocite{salton:jasis-70,salton:techrep-72} applied hand-crafted

313: English/French and English/German thesauri to the SMART

314: system~\cite{salton:71}, and demonstrated that a CLIR version of the

315: SMART system is comparable to the monolingual version in

316: performance. The International Road Research Documentation

317: scheme~\cite{mongar:tis-69} used a trilingual thesaurus associated

318: with English, German and French.

319: Gilarranz~\etal~\nocite{gilarranz:aaai-spring-sympo-97} and

320: Gonzalo~\etal~\nocite{gonzalo:chum-98} used the EuroWordNet

321: multilingual thesaurus~\cite{vossen:chum-98}.  Unlike these above

322: methods relying on manual thesaurus construction, Sheridan and

323: Ballerini~\nocite{sheridan:sigir-96} used a multilingual thesaurus

324: automatically produced from an aligned corpus.

325:

326: The generalized vector space model (GVSM)~\cite{wong:sigir-85} and

327: latent semantic indexing (LSI)~\cite{deerwester:jasis-90}, which were

328: originally proposed as variants of the vector space model for

329: monolingual IR, project both queries and documents into a

330: language-independent vector space, and therefore these methods can be

331: applicable to CLIR. While Dumais~\etal~\nocite{dumais:sigir-ws-96}

332: explored an LSI-based CLIR,

333: Carbonell~\etal~\nocite{carbonell:ijcai-97} empirically showed that

334: GVSM outperformed LSI in terms of CLIR. Note that like thesaurus-based

335: methods, GVSM/LSI-based methods require aligned corpora.

336:

337: \begin{figure*}[htbp]

338:   \def\baselinestretch{1}

339:   \begin{center}

340:     \leavevmode

341:     \small

342:     \fbox{

343:     $\left\{

344:     \begin{array}{l}

345:       \mbox{query translation approach}\left\{

346:         \begin{array}{l}

347:           \mbox{dictionary-based methods} \\

348:           \mbox{corpus-based methods} \\

349:           \mbox{hybrid methods}\left\{

350:           \begin{array}{l}

351:             \mbox{bilingual aligned corpora} \\

352:             $\underline{\mbox{monolingual corpora}}$

353:           \end{array}\right.

354:         \end{array}\right. \medskip \\

355:       \mbox{document translation approach}\left\{

356:         \begin{array}{l}

357:           \mbox{full document translation} \\

358:           \mbox{index term translation}

359:         \end{array}\right. \medskip \\

360:       \mbox{interlingual representation approach}\left\{

361:         \begin{array}{l}

362:           \mbox{thesaurus-based methods}\left\{

363:             \begin{array}{l}

364:               \mbox{hand-crafted thesauri} \\

365:               \mbox{corpus-based thesauri}

366:             \end{array}\right.\medskip \\

367:           \mbox{vector space models}\left\{

368:             \begin{array}{l}

369:               \mbox{generalized vector space model} \\

370:               \mbox{latent semantic indexing}

371:             \end{array}\right.

372:         \end{array}\right.

373:     \end{array}

374:     \right.$}

375:   \end{center}

376:   \medskip

377:   \caption{Classification of CLIR retrieval methods (the method we

378:   adopt is underlined)}

379:   \label{fig:retrieval_methods}

380: \end{figure*}

381:

382: \subsection{Presentation Methodologies}

383: \label{subsec:presentation_methods}

384:

385: In the case of CLIR, retrieved documents are not always written in the

386: user's native language. Therefore, presentation methodology of

387: retrieval results is a more crucial task than in monolingual IR.  It

388: is desirable to present smaller-sized contents with less noise, in

389: other words, precision is often given more importance than recall for

390: CLIR systems. Note that effective presentation is also crucial when a

391: user and system interactively retrieve relevant documents, as

392: performed in relevance feedback~\cite{salton:83}.

393:

394: However, a surprisingly small number of references addressing this

395: issue can be found in past research literature.

396: Aone~\etal~\nocite{aone:anlp-97} presented only keywords frequently

397: appearing in retrieved documents, rather than entire documents. Note

398: that since most CLIR systems use frequency information associated with

399: index terms like ``term frequency (TF)'' and ``inverse document

400: frequency (IDF)''~\cite{salton:83} for the retrieval, frequently

401: appearing keywords can be identified without an excessive additional

402: computational cost. Experiments independently conducted by Oard and

403: Resnik~\nocite{oard:ipm-99} and

404: Suzuki~\etal~\nocite{suzuki:signl-98-7} showed that even a simple

405: translation of keywords (such as using all possible translations

406: defined in a dictionary) improved on the efficiency for users to find

407: relevant foreign documents from the whole retrieval result.

408: Suzuki~\etal~\nocite{suzuki:nlp-99} more extensively investigated the

409: user's retrieval efficiency (i.e., the time efficiency and accuracy

410: with which human subjects find relevant foreign documents) by

411: comparing different presentation methods, in which the following

412: contents were independently presented to the user:

413: \begin{enumerate}

414: \item keywords without translation,

415: \item keywords translated with the first entry defined in a dictionary,

416: \item keywords translated through the hybrid method (see the

417:   ``Query translation approach'' item in

418:   Section~\ref{subsec:retrieval_methods}),

419: \item documents summarized (by an existing summarization software) and

420:   manually translated.

421: \end{enumerate}

422: Their comparative experiments showed that the third content was most

423: effective in terms of the retrieval efficiency.

424:

425: For monolingual IR, automatic summarization methods based on the

426: user's focus/query have recently been explored. Mani and

427: Bloedorn~\nocite{mani:aaai-iaai-98} used machine learning techniques

428: to produce document summarization rules based on the user's focus

429: (i.e., query). Tombros and Sanderson~\nocite{tombros:sigir-98} showed

430: experimental results, in which presenting the fragment of each

431: retrieved document containing query terms improved on the retrieval

432: efficiency of human subjects. Applicability of these methods to CLIR

433: needs to be further explored.

434:

435: \subsection{Evaluation Methodologies}

436: \label{subsec:evaluation_methods}

437:

438: From a scientific point of view, performance evaluation is invaluable

439: for CLIR. In most cases, the evaluation of CLIR is the same as

440: performed for monolingual IR. That is, each system conducts a

441: retrieval trial using a test collection consisting of predefined

442: queries and documents in {\em different\/} languages, and then the

443: performance is evaluated based on the precision and recall. Several

444: experiments used test collections for monolingual IR in which either

445: queries or documents were translated, prior to the

446: evaluation. However, as Sakai~\etal~\nocite{sakai:tipsj-99}

447: empirically showed, the CLIR performance varies depending on the

448: quality of the translation of collections, and thus it is desirable to

449: carefully produce test collections for CLIR. The production of test

450: collections usually involves collecting documents, producing queries

451: and relevance assessment for each query. However, since relevance

452: assessment is expensive, especially for large-scale collections (even

453: in the case where the pooling method~\cite{voorhees:sigir-98} is used

454: to reduce the number of candidates of relevant documents),

455: Carbonell~\etal~\nocite{carbonell:ijcai-97} first translated queries

456: into the document language, and used as (pseudo) relevant documents

457: those retrieved with the translated queries. In other words, this

458: evaluation method investigates the extent to which CLIR maintains the

459: performance of monolingual IR.

460:

461: For the evaluation of presentation methods, human subjects are often

462: used to investigate the retrieval efficiency, as described in

463: Section~\ref{subsec:presentation_methods}. However, evaluation methods

464: involving human interactions are problematic, because human subjects

465: are in a way trained through repetitive retrieval trials for different

466: systems, which can potentially bias the result. On the other hand, in

467: the case where each subject uses a single system, difference of

468: subjects affects the result. To minimize this bias, multiple subjects

469: are usually classified based on, for example, their literacy in terms

470: of the target language, and those falling into the same cluster are

471: virtually regarded as the same person.  However, this issue still

472: remains an open question, and needs to be further explored.

473:

474: \subsection{Our Focus and Approach}

475: \label{subsec:our_approach}

476:

477: Through discussions in the above three sections, we identified the

478: following points which should be taken into consideration for our

479: research.

480:

481: For translation methodology, the query translation approach is

482: preferable in terms of implementation cost, because this approach can

483: simply be combined with existing IR engines. On the other hand, other

484: approaches can be prohibitive, because (a) the document translation

485: approach conducts the full translation on the entire collection, and

486: (b) the interlingual representation approach requires alignment of

487: bilingual thesauri/corpora. In fact, we do not have Japanese-English

488: thesauri/corpora with sufficient volume of alignment information at

489: present. One may argue that the NACSIS collection, which is a

490: large-scale Japanese-English aligned corpora, can be used for the

491: translation. However, note that bilingual corpora for the translation

492: must not be obtained from the test collection used for the evaluation,

493: because in real world usage one of the two language documents in the

494: collection is usually missing. In other words, CLIR has little

495: necessity for bilingual aligned document collections, in that the user

496: can retrieve documents in the query language, without the translation

497: process.

498:

499: However, at the same time we concede that each approach is worth

500: further exploration, and in this paper we do not pretend to draw any

501: premature conclusions regarding the relative merits of different

502: approaches. To sum up, we focus mainly on translating sequences of

503: content words included in queries, rather than the entire

504: collection. Among different methods following the query translation

505: approach, we adopt the hybrid method using a {\em monolingual\/}

506: corpus. In other words, our translation method is relatively similar

507: to that proposed by Ballesteros and

508: Croft~\etal~\nocite{ballesteros:sigir-98} and

509: Chen~\etal~\nocite{chen:acl-99}. However, unlike their cases, we

510: integrate word-based translation and transliteration methods within

511: the query translation.

512:

513: For presentation methodology, we use keywords translated using the

514: hybrid translation method, which were proven to be effective in

515: comparative experiments by Suzuki~\etal~\nocite{suzuki:nlp-99} (in the

516: case where retrieved documents are not in the user's native language).

517: Note that for the purpose of the translation of keywords, we can use

518: exactly the same method as performed for the query translation,

519: because both queries and keywords usually consist of one or more

520: content words.

521:

522: Finally, for the evaluation of our CLIR system we use the NACSIS

523: collection~\cite{kando:sigir-99}. Since in this collection relevance

524: assessment is performed between Japanese queries and Japanese/English

525: documents, we can easily evaluate our system in terms of

526: Japanese-English CLIR. On the other hand, the evaluation of

527: English-Japanese CLIR is beyond the scope of this paper, because as

528: discussed in Section~\ref{subsec:evaluation_methods} the production of

529: English queries has to be carefully conducted, and is thus expensive.

530: Besides this, in this paper we do not evaluate our system in terms of

531: presentation methodology, because experiments using human subjects is

532: also expensive and still problematic. These remaining issues need to

533: be further explored.

534:

535: \section{System Overview}

536: \label{sec:system_overview}

537:

538: Figure~\ref{fig:system} depicts the overall design of our CLIR system,

539: in which we combine a translator with an IR engine for monolingual

540: retrieval. In the following, we briefly explain the retrieval process

541: based on this figure.

542:

543: First, the translator processes a query in the source language

544: (query in S) to output the translation (query in T). For this

545: purpose, the translator uses a dictionary to derive possible

546: translation candidates and a collocation to resolve the

547: translation ambiguity. Note that a user can utilize more than one

548: translation candidate, because multiple translations are often

549: appropriate for a single query. By the collocation, we mean

550: bi-gram statistics associated with content words extracted from NACSIS

551: documents. Since our system is bidirectional between Japanese and

552: English, we tokenize documents with different methods, depending on

553: their language. For English documents, the tokenization involves

554: eliminating stopwords and identifying root forms for inflected content

555: words. For this purpose, we use

556: WordNet~\cite{fellbaum:wordnet-98}, which contains a stopword list

557: and correspondences between inflected words and their root form. On

558: the other hand, we segment Japanese documents into lexical units using

559: the ChaSen morphological analyzer~\cite{matsumoto:chasen-97},

560: which has commonly been used for much Japanese NLP research, and

561: extract content words based on their part-of-speech information.

562:

563: Second, the IR engine searches the NACSIS collection for documents

564: (docs in T) relevant to the translated query, and sorts them

565: according to the degree of relevance, in descending order. Our IR

566: engine is currently a simple implementation of the vector space model,

567: in which the similarity between the query and each document (i.e., the

568: degree of relevance of each document) is computed as the cosine of the

569: angle between their associated vectors. We used the notion of

570: TF$\cdot$IDF for term weighting. Among a number of variations of term

571: weighting methods~\cite{salton:ipm-88,zobel:sigir-forum-98}, we

572: tentatively implemented two alternative types of TF (term frequency)

573: and one type of IDF (inverse document frequency), as shown in

574: Equation~\eq{eq:tf_idf}.

575: \begin{equation}

576:   \label{eq:tf_idf}

577:   \begin{array}{llll}

578:     TF & = & f_{t,d} & (\mbox{\rm standard formulation}) \\

579:     \noalign{\vskip 1.2ex}

580:     TF & = & 1 + \log(f_{t,d}) & (\mbox{\rm logarithmic formulation})

581:     \\

582:     \noalign{\vskip 1.2ex}

583:     IDF & = & \log(\frac{\textstyle N}{\textstyle n_{t}})

584:   \end{array}

585: \end{equation}

586: Here, $f_{t,d}$ denotes the frequency that term $t$ appears in

587: document $d$, and $n_{t}$ denotes the number of documents containing

588: term $t$. $N$ is the total number of documents in the collection. The

589: second TF type diminishes the effect of $f_{d,t}$, and consequently

590: IDF affects the similarity computation more. We shall call the first

591: and second TF types ``standard'' and ``logarithmic'' formulations,

592: respectively. For the indexing process, we first tokenize documents as

593: explained above (i.e., we use WordNet and ChaSen for English and

594: Japanese documents, respectively), and then conduct the word-based

595: indexing. That is, we use each content word as a single indexing term.

596: Since our focus in this paper is the query translation rather than the

597: retrieval process, we do not explore other IR techniques, including

598: query expansion and relevance feedback.

599:

600: Finally, in the case where retrieved documents are not in the user's

601: native language, we extract keywords from retrieved documents, and

602: translate them into the source language using the translator (KWs in

603: S). Unlike existing presentation methods, where keywords are words

604: frequently appearing in each

605: document~\cite{aone:anlp-97,suzuki:signl-98-7,suzuki:nlp-99}, we

606: tentatively use author keywords. In the NACSIS collection, each

607: document contains roughly 3-5 single/compound keywords provided by the

608: author(s) of the document. In addition, since the NACSIS documents are

609: relatively short abstracts (instead of entire papers), it is not

610: entirely satisfactory to rely on the word frequency information. Note

611: that even in the case where retrieved documents are in the user's

612: native language, presenting author keywords is expected to improve the

613: retrieval efficiency.

614:

615: For future enhancement, we optionally use an MT system to translate

616: entire documents retrieved (or only documents identified as relevant

617: using author keywords) into the user's native language (docs in S). We

618: currently use the Transer Japanese/English MT system, which combines a

619: general dictionary consisting of 230,000 entries, and a computer

620: terminology dictionary consisting of 100,000

621: entries.\footnote{Developed by NOVA, Inc.} Note that the translation

622: of the limited number of retrieved documents is less expensive than

623: that of the whole collection, as performed in the document translation

624: approach (see Section~\ref{subsec:retrieval_methods}).

625:

626: In Section~\ref{sec:translation}, we will explain the translator

627: in Figure~\ref{fig:system}, which involves compound word translation

628: and transliteration methods. While our translation method is

629: applicable to both queries and keywords in documents, in the following

630: we shall call it the query translation method without loss of

631: generality.

632:

633: \begin{figure}[htbp]

634:   \begin{center}

635:     \leavevmode

636:     \psfig{file=system.eps,height=1.8in}

637:   \end{center}

638:   \caption{The overall design of our CLIR system (S and T

639:   denote the source and target languages, respectively)}

640:   \label{fig:system}

641: \end{figure}

642:

643: \section{Query Translation Method}

644: \label{sec:translation}

645:

646: \subsection{Overview}

647: \label{subsec:trans_overview}

648:

649: Given a query in the source language, tokenization is first performed

650: as for target documents, that is, we use WordNet and ChaSen for

651: English and Japanese queries, respectively (see

652: Section~\ref{sec:system_overview}). We then discard stopwords and

653: extract only content words. Here, ``content words'' refer to both

654: single and compound words. Let us take the following English query as

655: an example:

656: \begin{list}{}{}

657: \item improvement or proposal of data mining methods.

658: \end{list}

659: For this query, we discard ``or'' and ``of,'' to extract

660: ``improvement,'' ``proposal'' and ``data mining methods.''

661: Thereafter, we translate each extracted content word on a word-by-word

662: basis, maintaining the word order in the source language. A

663: preliminary study showed that approximately 95\% of compound technical

664: terms defined in a bilingual dictionary~\cite{ferber:89} maintain the

665: same word order in both Japanese and English. Note that we currently

666: do not consider relation (e.g., syntactic relation) between content

667: words, and thus each content word is translated independently.  In

668: brief, our translation method consists of the following two phases:

669: \begin{enumerate}

670:   \def\labelenumi{(\theenumi)}

671: \item derive all possible translations for base words,

672: \item resolve translation ambiguity using the collocation associated

673:   with base word translations.

674: \end{enumerate}

675: While phase~(2) is the same for both Japanese-English and

676: English-Japanese translations, phase~(1) differs depending on the

677: source language. In the case of English-Japanese translation, we

678: simply consult our bilingual dictionary for each base word. However,

679: transliteration is performed whenever base words unlisted in the

680: dictionary are found.

681:

682: On the other hand, in the case of Japanese-English translation, we

683: consider all possible segmentations of the input word, by consulting

684: the dictionary, because Japanese compound words lack lexical

685: segmentation.\footnote{For Japanese query terms used in our evaluation

686: (see Section~\ref{sec:evaluation}), the average number of possible

687: segmentations was 4.9.} Then, we select such segmentations that

688: consist of the minimal number of base words. This segmentation method

689: parallels that for the Japanese compound noun

690: analysis~\cite{kobayashi:coling-94}. During the segmentation process,

691: the dictionary derives all possible translations for base words. At

692: the same time, transliteration is performed only when {\it katakana\/}

693: words unlisted in the base word dictionary are found.

694:

695: \subsection{Compound Word Translation}

696: \label{subsec:cwt}

697:

698: This section explains our compound word translation method based on a

699: probabilistic model, focusing mainly on the resolution of translation

700: ambiguity. After deriving possible translations for base words (by way

701: of either consulting the base word dictionary or performing

702: transliteration), we can formally represent the source compound word

703: $S$ and one translation candidate $T$ as below.

704: \begin{eqnarray*}

705:     S & = & s_{1}, s_{2}, \ldots, s_{n} \\

706:     T & = & t_{1}, t_{2}, \ldots, t_{n}

707: \end{eqnarray*}

708: Here, $s_{i}$ denotes an $i$-th base word, and $t_{i}$ denotes a

709: translation candidate of $s_{i}$. Our task, i.e., to select the $T$

710: which maximizes $P(T|S)$, is transformed into

711: Equation~\eq{eq:trans_model} through use of the Bayesian theorem, as

712: performed in the statistical machine translation~\cite{brown:cl-93}.

713: \begin{eqnarray}

714:   \label{eq:trans_model}

715:   \arg\max_{T}P(T|S) & = & \arg\max_{T}P(S|T)\cdot P(T)

716: \end{eqnarray}

717: In practice, in the case where the user utilizes more than one

718: translation, $T$'s with greater probabilities are selected. We

719: approximate $P(S|T)$ and $P(T)$ using statistics associated with base

720: words, as in Equation~\eq{eq:approx}.

721: \begin{equation}

722:   \label{eq:approx}

723:   \begin{array}{lll}

724:     P(S|T) & \approx & {\displaystyle \prod_{i=1}^{n}P(s_{i}|t_{i})} \\

725:     \noalign{\vskip 1.2ex}

726:     P(T) & \approx & {\displaystyle

727:     \prod_{i=1}^{n-1}P(t_{i+1}|t_{i})}

728:   \end{array}

729: \end{equation}

730: One may notice that this approximation is analogous to that for the

731: statistical part-of-speech tagging, where $s_{i}$ and $t_{i}$ in

732: Equation~\eq{eq:approx} correspond to a word and one of its

733: part-of-speech candidates, respectively~\cite{church:cl-93}. Here, we

734: estimate $P(t_{i+1}|t_{i})$ using the word-based bi-gram statistics

735: extracted from target language documents (i.e., the collocation in

736: Figure~\ref{fig:system}). Before elaborating on the estimation of

737: $P(s_{i}|t_{i})$ we explain the way to produce our bilingual

738: dictionary for base words, because $P(s_{i}|t_{i})$ is estimated using

739: this dictionary.

740:

741: For our dictionary production, we used the EDR technical terminology

742: dictionary~\cite{edr-techdic:95}, which includes approximately 120,000

743: Japanese-English translations related to the information processing

744: field. Since most of the entries are compound words, we need to

745: segment Japanese compound words, and correlate Japanese-English

746: translations on a word-by-word basis.  However, the complexity of

747: segmenting Japanese words becomes much greater as the number of

748: component base words increases. In consideration of these factors, we

749: first extracted 59,533 English words consisting of only {\em two\/}

750: base words, and their Japanese translations. We then developed simple

751: heuristics to segment Japanese compound words into two substrings. Our

752: heuristics relies mainly on Japanese character types, i.e., ``{\it

753: kanji},'' ``{\it katakana},'' ``{\it hiragana},'' alphabets and other

754: characters like numerals. Note that {\it kanji\/} (or Chinese

755: character) is the Japanese idiogram, and {\it katakana\/} and {\it

756: hiragana\/} are phonograms.

757:

758: In brief, we segment each Japanese word at the boundary of different

759: character types (or at the leftmost boundary for words containing more

760: than one character type boundary). Although this method is relatively

761: simple, a preliminary study showed that we can almost correctly

762: segment words that are in one of the following forms: ``{\tt CK},''

763: ``{\tt CA},'' ``{\tt AK}'' and ``{\tt KA\/}.'' Here, ``{\tt C},''

764: ``{\tt K\/}'' and ``{\tt A}'' denote {\it kanji}, {\it katakana\/} and

765: alphabet character sequences, respectively. For other combinations of

766: character types, we identified one or more cases in which our

767: segmentation method incorrectly performed.

768:

769: On the other hand, in the case where a given Japanese word consists of

770: a single character type, we segment the word at the middle (or at the

771: left-side of the middle character for words consisting of an odd

772: number of characters).  Note that roughly 90\% of Japanese words

773: consisting of four {\it kanji\/} characters can be correctly segmented

774: at the middle~\cite{kobayashi:coling-94}. However, in the case where

775: resultant substrings begin/end with characters that do not appear at

776: the beginning/end of words (for example, Japanese words rarely begin

777: with a long vowel), we shift the segmentation position to the right.

778:

779: Tsuji and Kageura~\nocite{tsuji:nlprs-97} used the HMM to segment

780: Japanese compound words in an English-Japanese bilingual

781: dictionary. Their method can also segment words consisting of more

782: than two base words, and reportedly achieved an accuracy of roughly

783: 80-90\%, whereas our segmentation method is applicable only to those

784: consisting of two base words. However, while the HMM-based

785: segmentation is expected to improve the quality of our dictionary

786: production, in this paper we tentatively show that our

787: heuristics-based method is effective for CLIR despite its simple

788: implementation, by way of experiments (see

789: Section~\ref{sec:evaluation}).

790:

791: As a result, we obtained 24,439 Japanese and 7,910 English base words.

792: We randomly sampled 600 compound words, and confirmed that 95\% of

793: those words were correctly segmented.

794: Figure~\ref{fig:compound_word_dictionary} shows a fragment of the EDR

795: dictionary (after segmenting Japanese words), and

796: Figure~\ref{fig:base_word_dictionary} shows a base word dictionary

797: produced from entries in Figure~\ref{fig:compound_word_dictionary}.

798: Figure~\ref{fig:base_word_dictionary} contains Japanese variants, such

799: as {\it memori\/}/{\it memorii\/} for the English word ``memory.'' We

800: can easily produce a Japanese-English base word dictionary from

801: Figure~\ref{fig:compound_word_dictionary}, using the same procedure.

802:

803: During the dictionary production, we also count the correspondence

804: frequency for each combination of $s_{i}$ and $t_{i}$, in order to

805: estimate $P(s_{i}|t_{i})$. In Figure~\ref{fig:base_word_dictionary},

806: for example, the Japanese base word ``{\it soukan\/}'' corresponds

807: once to ``associative,'' and twice to ``correlation.'' Thus, we can

808: derive Equation~\eq{eq:soukan}.

809: \begin{equation}

810:   \label{eq:soukan}

811:   \begin{array}{lll}

812:     P(\mbox{associative}\:|\:{\it soukan}) & = & 1/3 \\

813:     \noalign{\vskip 0.6ex}

814:     P(\mbox{correlation}\:|\:{\it soukan}) & = & 2/3

815:   \end{array}

816: \end{equation}

817: However, in the case where $s_{i}$ is {\em transliterated\/} into

818: $t_{i}$, we replace $P(s_{i}|t_{i})$ with a probabilistic score

819: computed by our transliteration method (see

820: Section~\ref{subsec:translit}).

821:

822: One may argue that $P(s_{i}|t_{i})$ should be estimated based on real

823: world usage, i.e., bilingual corpora. However, such resources are

824: generally expensive to obtain, and we do not have Japanese-English

825: corpora with sufficient volume of alignment information at present

826: (see Section~\ref{subsec:our_approach} for more discussion).

827:

828: \begin{figure}[htbp]

829:   \def\baselinestretch{1}

830:   \begin{center}

831:     \leavevmode

832:     \small

833:     \begin{tabular}[t]{ll} \hline\hline

834:       {\hfill\centering English\hfill} & {\hfill\centering

835:       Japanese\hfill} \\ \hline

836:       CCD memory & CCD {\it memorii\/} \\

837:       IC memory & IC {\it memori\/} \\

838:       associative learning & {\it soukan gakushuu\/} \\

839:       associative memory & {\it rensou memori\/} \\

840:       associative record & {\it ketsugou rekoodo\/} \\

841:       correlation function & {\it soukan kansuu\/} \\

842:       error detection & {\it ayamari kenshutsu\/} \\

843:       factor correlation & {\it inshi soukan\/} \\

844:       hybrid IC & {\it haiburiddo shuusekikairo\/} \\ \hline

845:     \end{tabular}

846:   \end{center}

847:   \caption{A fragment of the EDR technical terminology dictionary}

848:   \label{fig:compound_word_dictionary}

849: \end{figure}

850:

851: \subsection{Transliteration}

852: \label{subsec:translit}

853:

854: This section explains our transliteration method, which identifies

855: phonetic equivalent translations for words unlisted in the base word

856: dictionary.

857:

858: Figure~\ref{fig:katakana} shows example correspondences between

859: English and (romanized) {\it katakana\/} words, where we insert

860: hyphens between each {\it katakana\/} character for enhanced

861: readability. The basis of our transliteration method is analogous to

862: that for compound word translation described in

863: Section~\ref{subsec:cwt}. The formula for the source word $S$ and one

864: transliteration candidate $T$ are represented as below.

865: \begin{eqnarray*}

866:     S & = & s_{1}, s_{2}, \ldots, s_{n} \\

867:     T & = & t_{1}, t_{2}, \ldots, t_{n}

868: \end{eqnarray*}

869: Here, unlike the case of compound word translation, $s_{i}$ and

870: $t_{i}$ denote $i$-th ``symbols'' (which consist of one or more

871: letters), respectively. To derive possible $s_{i}$'s and $t_{i}$'s, we

872: consider all possible segmentations of the source word $S$, by

873: consulting a dictionary for symbols, namely the ``transliteration

874: dictionary.'' Then, we select such segmentations that consist of the

875: minimal number of symbols. Note that unlike the case of compound word

876: translation, the segmentation is performed for both Japanese-English

877: and English-Japanese transliterations.

878:

879: \begin{figure}[htbp]

880:   \def\baselinestretch{1}

881:   \begin{center}

882:     \leavevmode

883:     \small

884:     \begin{tabular}[t]{ll} \hline\hline

885:       {\hfill\centering English\hfill} & {\hfill\centering

886:       Japanese\hfill} \\ \hline

887:       CCD & CCD \\

888:       IC & IC, {\it shuusekikairo\/} \\

889:       associative & {\it soukan}, {\it rensou}, {\it ketsugou\/} \\

890:       correlation & {\it soukan\/} \\

891:       detection & {\it kenshutsu\/} \\

892:       error & {\it ayamari\/} \\

893:       factor & {\it inshi\/} \\

894:       function & {\it kansuu\/} \\

895:       hybrid & {\it haiburiddo\/} \\

896:       learning & {\it gakushuu\/} \\

897:       memory & {\it memori}, {\it memorii\/} \\

898:       record & {\it rekoodo\/} \\ \hline

899:     \end{tabular}

900:   \end{center}

901:   \caption{A fragment of an English-Japanese base word dictionary

902:   produced from Figure~\protect\ref{fig:compound_word_dictionary}}

903:   \label{fig:base_word_dictionary}

904: \end{figure}

905:

906: \begin{figure}[htbp]

907:   \def\baselinestretch{1}

908:   \begin{center}

909:     \leavevmode

910:     \small

911:     \begin{tabular}{ll} \hline\hline

912:       {\hfill\centering English \hfill} & {\hfill\centering Japanese

913:       \hfill} \\ \hline

914:       system & {\it shi-su-te-mu\/} \\

915:       mining & {\it ma-i-ni-n-gu\/} \\

916:       data & {\it dee-ta\/} \\

917:       network & {\it ne-tto-waa-ku\/} \\

918:       text & {\it te-ki-su-to\/} \\

919:       collocation & {\it ko-ro-ke-i-sho-n\/} \\ \hline

920:     \end{tabular}

921:     \caption{Example correspondences between English and  (romanized)

922:     Japanese {\it katakana\/} words}

923:     \label{fig:katakana}

924:   \end{center}

925: \end{figure}

926:

927: Thereafter, we resolve the transliteration ambiguity based on the a

928: probabilistic model similar to that for the compound word translation.

929: To put it more precisely, we compute $P(T|S)$ for each $T$ using

930: Equation~\eq{eq:trans_model}, and select $T$'s with greater

931: probabilities. Note that $T$'s must be correct words (that are indexed

932: in the NACSIS document collection).  However, Equation~\eq{eq:approx},

933: which approximates $P(T)$ by combining $P(t_i)$'s for substrings of

934: $T$, potentially assigns positive possibility values for incorrect

935: (unindexed) words.

936:

937: In view of this problem, we estimate $P(T)$ as the probability that

938: $T$ occurs in the document collection, and consequently the

939: probability for unindexed words becomes zero. In practice, during the

940: segmentation process we simply discard such $T$'s that are unindexed

941: in the document collection, so that we can enhance the computation for

942: $P(T|S)$'s.  On the other hand, we approximate $P(S|T)$ as in

943: Equation~\eq{eq:approx}, and estimate $P(s_{i}|t_{i})$ based on the

944: correspondence frequency for each combination of $s_{i}$ and $t_{i}$

945: in the transliteration dictionary.

946:

947: The crucial content here is the way to produce the transliteration

948: dictionary, because such dictionaries have rarely been published. For

949: the purpose of dictionary production, we used approximately 35,000

950: {\it katakana\/} Japanese words and their English translations

951: collected from the EDR technical terminology

952: dictionary~\cite{edr-techdic:95} and bilingual

953: dictionary~\cite{edr-bilindic:95}. To illustrate our dictionary

954: production method, we consider Figure~\ref{fig:katakana}

955: again. Looking at this figure, one may notice that the first letter in

956: each {\it katakana\/} character tends to be contained in its

957: corresponding English word. However, there are a few exceptions. A

958: typical case is that since Japanese has no distinction between ``L''

959: and ``R'' sounds, the two English sounds collapse into the same

960: Japanese sound. In addition, a single English letter may correspond to

961: multiple {\it katakana\/} characters, such as ``x'' to ``{\it

962: ki-su\/}'' in \mbox{``$<$text, {\it te-ki-su-to\/}$>$.''} To sum up,

963: English and romanized {\it katakana\/} words are not exactly

964: identical, but similar to each other.

965:

966: We first manually defined the similarity between the English letter

967: $e$ and the first romanized letter for each {\it katakana\/} character

968: $j$, as shown in Table~\ref{tab:katakana}. In this table,

969: ``phonetically similar'' letters refer to a certain pair of letters,

970: such as ``L'' and ``R,'' for which we identified approximately twenty

971: pairs of letters. We then consider the similarity for any possible

972: combination of letters in English and romanized {\it katakana\/}

973: words, which can be represented as a matrix, as shown in

974: Figure~\ref{fig:matrix}. This figure shows the similarity between

975: letters in \mbox{``$<$text, {\it te-ki-su-to\/}$>$.''}  We put a dummy

976: letter ``\$,'' which has a positive similarity only to itself, at the

977: end of both English and {\it katakana\/} words.

978:

979: One may notice that matching plausible symbols can be seen as finding

980: the path which maximizes the total similarity from the first to last

981: letters. The best path can efficiently be found by, for example,

982: Dijkstra's algorithm~\cite{dijkstra:nm-59}. From

983: Figure~\ref{fig:matrix}, we can derive the following correspondences:

984: \mbox{``$<$te, {\it te\/}$>$,''} \mbox{``$<$x, {\it ki-su\/}$>$''} and

985: \mbox{``$<$t, {\it to\/}$>$.''} In practice, to exclude noisy

986: correspondences, we used only English-Japanese translations whose

987: total similarity from the first to last letters is above a predefined

988: threshold. The resultant transliteration dictionary contains 432

989: Japanese and 1018 English symbols, from which we estimated

990: $P(s_{i}|t_{i})$.

991:

992: \begin{table}[htbp]

993:   \def\baselinestretch{1}

994:   \begin{center}

995:     \caption{The similarity between English letter $e$ and Japanese

996:     letter $j$}

997:     \medskip \leavevmode \small

998:     \begin{tabular}{lc} \hline\hline

999:       {\hfill\centering Condition \hfill} & {\hfill\centering

1000:       Similarity \hfill} \\ \hline

1001:       $e$ and $j$ are identical & 3 \\

1002:       $e$ and $j$ are phonetically similar & 2 \\

1003:       both $e$ and $j$ are vowels or consonants & 1 \\

1004:       otherwise & 0 \\ \hline

1005:     \end{tabular}

1006:     \label{tab:katakana}

1007:   \end{center}

1008: \end{table}

1009:

1010: \begin{figure}[htbp]

1011:   \begin{center}

1012:     \leavevmode

1013:     \psfig{file=matrix.eps,height=2in}

1014:   \end{center}

1015:   \caption{An example matrix for English-Japanese symbol matching

1016:   (arrows denote the best path)}

1017:   \label{fig:matrix}

1018: \end{figure}

1019:

1020: To evaluate our transliteration method, we extracted Japanese {\it

1021: katakana\/} words (excluding compound words) and their English

1022: translations from an English-Japanese

1023: dictionary~\cite{nichigai_compdic:96}. We then discarded

1024: Japanese/English pairs that were not phonetically equivalent to each

1025: other, and were listed in the EDR dictionaries. For the resultant 248

1026: pairs, the accuracy of our transliteration method was 65.3\%.

1027:

1028: Thus, our transliteration method is less accurate than the word-based

1029: translation. For example, the {\it katakana\/} word ``{\it

1030: re-ji-su-ta}~(register/resistor)'' is transliterated into

1031: ``resister,'' ``resistor'' and ``register,'' with the probability

1032: score in descending order. Note that Japanese seldom represents

1033: ``resister'' as ``{\it re-ji-su-ta\/}'' (whereas it can be

1034: theoretically correct when this word is written in {\it katakana\/}

1035: characters), because ``resister'' corresponds to more appropriate

1036: translations in {\it kanji\/} characters. However, the compound word

1037: translation is expected to select appropriate transliteration

1038: candidates. For example, ``re-ji-su-ta'' in the compound word ``{\it

1039: re-ji-su-ta\/} {\it tensou\/} {\it gengo\/}~(register transfer

1040: language)'' is successfully translated, given a set of base words

1041: ``{\it tensou\/}~(transfer)'' and ``{\it gengo\/}~(language)'' as a

1042: context.

1043:

1044: Finally, we devote a little more space to compare our transliteration

1045: method and other related works.

1046: Chen~\etal~\nocite{chen:coling-acl-98} proposed a Chinese-English

1047: transliteration method. Given a (romanized) source word, their methods

1048: compute the similarity between the source word and each target word

1049: listed in the dictionary. In brief, the more letters two words share

1050: in common, the more similar they are. In other words, unlike our case,

1051: their methods disregard the order of letters in source and target

1052: words, which potentially degrades the transliteration accuracy. In

1053: addition, since for each source word the similarity is computed

1054: between all the target words (or words that share at least one common

1055: letter with the source word), the similarity computation can be

1056: prohibitive. Lee and Choi~\nocite{lee:iral-97} explored English-Korean

1057: transliteration, where they automatically produced a transliteration

1058: model from a word-aligned corpus. In brief, they first consider all

1059: possible English-Korean symbol correspondences for each word

1060: alignment. Then, iterative estimation is performed to select such

1061: symbol correspondences that maximize transliteration accuracy on

1062: training data. However, when compared with our symbol alignment

1063: method, their iterative estimation method is computationally

1064: expensive. Knight and Graehl~\nocite{knight:cl-98} proposed a

1065: Japanese-English transliteration method based on the mapping

1066: probability between English and Japanese {\it katakana\/}

1067: sounds. However, while their method needs a large-scale phoneme

1068: inventory, we use a simpler approach using surface mapping between

1069: English and {\it katakana\/} characters, as defined in our

1070: transliteration dictionary. Note that none of those above methods has

1071: been evaluated in the context of CLIR. Empirical comparison of

1072: different transliteration methods needs to be further explored.

1073:

1074: \subsection{Further Enhancement of Translation}

1075: \label{subsec:dictionary_enhancement}

1076:

1077: This section explains two additional methods to enhance the query

1078: translation.

1079:

1080: First, we can enhance our base word dictionary with {\em general\/}

1081: words, because technical compound words sometimes include general

1082: words, as discussed in Section~\ref{sec:introduction}. Note that in

1083: Section~\ref{subsec:cwt} we produced our base word dictionary from the

1084: EDR {\em technical\/} terminology dictionary. Thus, we used the EDR

1085: bilingual dictionary~\cite{edr-bilindic:95}, which consists of

1086: approximately 370,000 Japanese-English translations aimed at general

1087: usage. However, unlike in the case of technical terms, it is not

1088: feasible to segment general compound words, such as ``hot dog,'' into

1089: base words. Thus, we simply extracted 162,751 Japanese and 67,136

1090: English single words (i.e., words that consist of a single base word)

1091: from this dictionary.  In addition, to minimize the degree of

1092: translation ambiguity, we use general translations only when (a) base

1093: words unlisted in our technical term dictionary are found, and (b) our

1094: transliteration method fails to output any candidates for those

1095: unlisted base words.

1096:

1097: Second, in Section~\ref{sec:introduction} we also identified that

1098: English technical terms are often abbreviated, such as ``IR'' and

1099: ``NLP,'' and they can be used as Japanese words. One solution would be

1100: to output those abbreviated words as they are, for both

1101: Japanese-English and English-Japanese translations. On the other hand,

1102: it is expected that we can improve the recall by using complete forms

1103: along with their abbreviated forms. To realize this notion, we

1104: extracted 7,307 tuples of each abbreviation and its complete form from

1105: the NACSIS English document collection, using simple heuristics. Our

1106: heuristics relies on the assumption that either abbreviations or

1107: complete forms often appear in parentheses headed by their

1108: counterparts, as shown below:

1109: \begin{quote}

1110:   Natural Language Processing (NLP), \\

1111:   cross-language information retrieval (CLIR), \\

1112:   MRDs (machine readable dictionaries).

1113: \end{quote}

1114: While the first example is the most straightforward, in the second and

1115: third examples we disregard a hyphen and lowercase letter (i.e., ``s''

1116: in ``MRDs''), respectively. In practice, we can easily extract such

1117: tuples using the regular expression pattern matching.

1118: Figure~\ref{fig:abbreviation} shows example tuples of abbreviations

1119: and complete forms extracted from the NACSIS collection.  In this

1120: figure, the column ``Frequency'' denotes the frequency that each tuple

1121: appears in the collection, with which we can optionally set a cut-off

1122: threshold for multiple complete forms corresponding to a single

1123: abbreviation (e.g., ``information retrieval,'' ``isoprene rubber'' and

1124: ``insulin receptor'' for ``IR'').

1125:

1126: \begin{figure}[htbp]

1127:   \def\baselinestretch{1}

1128:   \begin{center}

1129:     \leavevmode

1130:     \small

1131:     \begin{tabular}[t]{llc} \hline\hline

1132:       {\hfill\centering Abbreviation\hfill} & {\hfill\centering

1133:       Complete form\hfill} & {\hfill\centering Frequency\hfill} \\ \hline

1134:       IR & information retrieval & 3 \\

1135:       IR & isoprene rubber & 1 \\

1136:       IR & insulin receptor & 1 \\

1137:       MT & machine translation & 11 \\

1138:       MT & mobile telephone & 3 \\

1139:       NLP & natural language processing & 8 \\ \hline

1140:     \end{tabular}

1141:   \end{center}

1142:   \caption{Example abbreviations and their complete forms}

1143:   \label{fig:abbreviation}

1144: \end{figure}

1145:

1146: \section{Evaluation}

1147: \label{sec:evaluation}

1148:

1149: \subsection{Methodology}

1150: \label{subsec:eval_overview}

1151:

1152: We investigated the performance of our system in terms of

1153: Japanese-English CLIR, based on the TREC-type evaluation methodology.

1154: That is, the system outputs 1,000 top documents, and the TREC

1155: evaluation software was used to plot recall-precision curves and

1156: calculate non-interpolated average precision values.

1157:

1158: For the purpose of our evaluation, we used a preliminary version of

1159: the NACSIS test collection~\cite{kando:sigir-99}. This collection

1160: includes approximately 330,000 documents (in either a combination of

1161: English and Japanese or either of the languages individually),

1162: collected from technical papers published by 65 Japanese associations

1163: for various fields.\footnote{The official version of the NACSIS

1164: collection includes 39 Japanese queries and the same document set as

1165: in the preliminary version we used. NACSIS (National Center for

1166: Science Information Systems, Japan) held a TREC-type (CL)IR contest

1167: workshop in August 1999, and participants, including the authors of

1168: this paper, were provided with the whole document set and 21 queries

1169: for training. These 21 queries are included in the final package of

1170: the test collection. See {\tt

1171: http://www.rd.nacsis.ac.jp/\~{}ntcadm/workshop/work-en.html} for

1172: details.} Each document consists of the document ID, title, name(s) of

1173: author(s), name/date of conference, hosting organization, abstract and

1174: keywords, from which we used titles, abstracts and keywords for the

1175: indexing. We used as target documents approximately 187,000 entries

1176: where abstracts are in both English and Japanese.

1177:

1178: This collection also includes 21 Japanese queries. Each query

1179: consists of the query ID, title of the topic, description, narrative

1180: and list of synonyms, from which we used only the

1181: description.\footnote{In the NACSIS workshop, each participant can

1182: submit more than one retrieval result using different

1183: systems. However, at least one result must be gained with only the

1184: description field.} In general, most topics are related to electronic,

1185: information and control engineering. Figure~\ref{fig:query} shows

1186: example descriptions (translated into English by one of the authors).

1187:

1188: In the NACSIS collection, relevance assessment was performed based on

1189: the pooling method~\cite{voorhees:sigir-98}. That is, candidates for

1190: relevant documents were first obtained with multiple retrieval

1191: systems. Thereafter, for each candidate document, human expert(s)

1192: assigned one of three ranks of relevance, i.e., ``relevant,''

1193: ``partially relevant'' and \mbox{``irrelevant.''} The average number

1194: of candidate documents for each query is 4,400, among which the number

1195: of relevant and partially relevant documents are 144 and 13,

1196: respectively. In our evaluation, we did not regard partially relevant

1197: documents as relevant ones, because (a) the result did not

1198: significantly change depending on whether we regarded partially

1199: relevant as relevant or not, and (b) interpretation of partially

1200: relevant is not fully clear to the authors.

1201:

1202: Since the NACSIS collection does not contain English queries, we

1203: cannot estimate a baseline for Japanese-English CLIR performance based

1204: on English-English IR. Instead, we used a Japanese-Japanese IR system,

1205: which uses as documents Japanese titles/abstracts/keywords comparable

1206: to English fields in the NACSIS collection.  One may argue that we can

1207: manually translate Japanese queries into English. However, as

1208: discussed in Section~\ref{subsec:evaluation_methods}, the CLIR

1209: performance varies depending on the quality of translation, and thus

1210: we avoided an arbitrary evaluation.

1211:

1212: \begin{figure}[htbp]

1213:   \def\baselinestretch{1}

1214:   \begin{center}

1215:     \leavevmode

1216:     \small

1217:     \begin{tabular}{cl} \hline\hline

1218:       ID & {\hfill\centering Description\hfill} \\ \hline

1219:       0005 & dimension reduction for clustering \\

1220:       0006 & intelligent information retrieval using agent functions \\

1221:       0019 & syntactic analysis methods for Japanese sentences \\

1222:       0024 & machine translation systems \\ \hline

1223:     \end{tabular}

1224:     \caption{Example query descriptions in the NACSIS collection}

1225:     \label{fig:query}

1226:   \end{center}

1227: \end{figure}

1228:

1229: \subsection{Quantitative Comparison}

1230: \label{subsec:quantitative}

1231:

1232: We compared the following query translation methods:

1233: \begin{itemize}

1234: \item all possible translations derived from the (original) EDR

1235:   technical terminology dictionary~\cite{edr-techdic:95} are used for

1236:   query terms, which can be seen as a lower bound method of this

1237:   comparative experiment (``EDR''),

1238: \item all possible base word translations derived from our base word

1239:   dictionary are used (``ALL''),

1240: \item $k$-best translations selected by our compound word translation

1241:   method are used, where transliteration is not used (``CWT''),

1242: \item transliteration is performed for unlisted {\it katakana\/} words

1243:   in CWT above, which represents the overall query

1244:   translation method we proposed in this paper (``TRL'').

1245: \end{itemize}

1246: One may notice that both EDR and ALL correspond to the

1247: dictionary-based method, and CWT and TRL correspond to the

1248: hybrid method described in Section~\ref{subsec:retrieval_methods}. In

1249: the case of EDR, compound words unlisted in the EDR dictionary

1250: were manually segmented so that substrings (shorter compound words or

1251: base words) could be translated. There was almost no translation

1252: ambiguity in the case of EDR. In addition, preliminary experiments

1253: showed that disambiguation degraded the retrieval performance for

1254: EDR. In CWT and TRL, $k$ is a parametric constant, for

1255: which we set \mbox{$k=1$}. Through preliminary experiments, we

1256: achieved the best performance when we set \mbox{$k=1$}. By increasing

1257: the value of $k$, we theoretically gain a query expansion effect,

1258: because multiple translations semantically related are used as query

1259: terms. However, in our case, additional translations were rather noisy

1260: with respect to the retrieval performance. Note that in this

1261: experiment, we did not used the general and abbreviation dictionaries.

1262: We will discuss the effect of those dictionaries in

1263: Section~\ref{subsec:dictionary_enhancement}.

1264:

1265: Table~\ref{tab:avg_pre} shows the non-interpolated average precision

1266: values, averaged over the 21 queries, for different combinations of

1267: query translation and retrieval methods. It is worth comparing the

1268: effectiveness of query translation methods with different retrieval

1269: methods, because advanced retrieval methods potentially overcome the

1270: rudimentary nature of query translation methods, and therefore may

1271: overshadow the difference of query translation methods in CLIR

1272: performance. In consideration of this problem, as described in

1273: Section~\ref{sec:system_overview}, we adopted two alternative term

1274: weighting methods, i.e., the standard and logarithmic formulations. In

1275: addition, we used as the IR engine in Figure~\ref{fig:system} the

1276: SMART system~\cite{salton:71}, where the augmented TF$\cdot$IDF term

1277: weighting method (``ATC'') was used for both queries and

1278: documents. This makes it easy for other researchers to rigorously

1279: compare their query translation methods with ours within the same

1280: evaluation environment, because the SMART system is available to the

1281: public.

1282:

1283: In Table~\ref{tab:avg_pre}, J-J refers to the baseline performance,

1284: that is, the result obtained by the Japanese-Japanese IR system.  Note

1285: that the performance of J-J using the SMART system is not available

1286: because this system is not implemented for the retrieval of Japanese

1287: documents. The column ``\# of Terms'' denotes the average number of

1288: query terms used for the retrieval, where the number of terms used in

1289: ALL was approximately seven times as great as those of other

1290: methods. Suggestions can be derived from these results is as follows.

1291:

1292: \begin{table}[htbp]

1293:   \def\baselinestretch{1}

1294:   \begin{center}

1295:     \caption{Non-interpolated average precision values,

1296:     averaged over the 21 queries, for different combinations of query

1297:     translation and retrieval methods}

1298:     \medskip

1299:     \leavevmode

1300:     \small

1301:     \begin{tabular}{lccccc} \hline\hline

1302:       & & \multicolumn{3}{c}{Retrieval Method} \\ \cline{3-5}

1303:       & \# of Terms & Standard TF & Logarithmic TF & SMART \\ \hline

1304:       J-J & 4.0 & 0.2085 & 0.2443 & --- \\

1305:       TRL & 4.0 & 0.2427 & 0.2911 & 0.3147 \\

1306:       CWT & 3.9 & 0.2324 & 0.2680 & 0.2770 \\

1307:       ALL & 21  & 0.1971 & 0.2271 & 0.2106 \\

1308:       EDR & 4.1 & 0.1785 & 0.2173 & 0.2477 \\ \hline

1309:     \end{tabular}

1310:     \label{tab:avg_pre}

1311:   \end{center}

1312: \end{table}

1313:

1314: First, the relative superiority between EDR and ALL varies

1315: depending on the retrieval method. Since neither case resolved the

1316: translation ambiguity, the difference in performance for the two

1317: translation methods is reduced solely to the difference between the

1318: two dictionaries. Therefore, the base word dictionary we produced was

1319: effective when combined with the standard and logarithmic TF

1320: formulations. However, the translation disambiguation as performed in

1321: CWT improved the performance of ALL, and consequently CWT

1322: outperformed EDR irrespective of the retrieval method. To sum up,

1323: our compound word translation method was more effective than the use

1324: of an existing dictionary, in terms of CLIR performance.

1325:

1326: Second, by comparing results of CWT and TRL, one can see that

1327: our transliteration method further improved the performance of the

1328: compound word translation relying solely on the base word dictionary,

1329: irrespective of the retrieval method.  Since TRL represents the

1330: overall performance of our system, it is worth comparing TRL and

1331: EDR (i.e., a lower bound method) more carefully. Thus, we used the

1332: paired t-test for statistical testing, which investigates whether the

1333: difference in performance is meaningful or simply due to

1334: chance~\cite{hull:sigir-93,keen:ipm-92}. We found that the average

1335: precision values of TRL and EDR are significantly different

1336: (at the 5\% level), for any of the three retrieval methods.

1337:

1338: Third, the performance was generally improved as a more sophisticated

1339: retrieval method was used, for all of the translation methods

1340: excepting ALL. In other words, enhancements of the query

1341: translation and IR engine independently improved on the performance of

1342: our CLIR system. Note that the difference between the SMART system and

1343: the other two methods is due to more than one factor, including

1344: stemming and term weighting methods. This suggests that our system may

1345: achieve a higher performance using other advanced IR techniques.

1346:

1347: Finally, TRL and CWT outperformed J-J for any of the

1348: retrieval methods. However, these differences are partially attributed

1349: to the different properties inherent in Japanese and English IR. For

1350: example, the performance of Japanese IR is more strongly dependent on

1351: the indexing method than English IR, since Japanese lacks lexical

1352: segmentation. This issue needs to be further explored.

1353:

1354: Figures~\ref{fig:rp_raw_TF}-\ref{fig:rp_smart} show recall-precision

1355: curves of different query translation methods, for different retrieval

1356: methods, respectively. In these figures, while the superiority of EDR

1357: and ALL in terms of precision varies depending on the recall, one can

1358: see that CWT outperformed EDR and ALL, and that TRL outperformed CWT,

1359: regardless of the recall. In Figures~\ref{fig:rp_raw_TF} and

1360: \ref{fig:rp_log_TF}, J-J generally performed better at lower recall

1361: while any of four CLIR methods performs better at higher recall. As

1362: discussed above, possible rationales would include the difference

1363: between Japanese and English IR. To put it more precisely, in Japanese

1364: IR a word-based indexing method (as performed in our IR engine) fails

1365: to retrieve documents in which words are inappropriately segmented.

1366: In addition, the ChaSen morphological analyzer often incorrectly

1367: segments {\it katakana\/} words, which frequently appear in technical

1368: documents. Consequently this drawback leads to a poor recall in the

1369: case of J-J.

1370:

1371: \begin{figure}[htbp]

1372:   \begin{center}

1373:     \leavevmode

1374:     \psfig{file=rp-curve_raw_TF.ps,height=3.5in}

1375:   \end{center}

1376:   \caption{Recall-precision curves using the standard TF}

1377:   \label{fig:rp_raw_TF}

1378: \end{figure}

1379:

1380: \begin{figure}[htbp]

1381:   \begin{center}

1382:     \leavevmode

1383:     \psfig{file=rp-curve_log_TF.ps,height=3.5in}

1384:   \end{center}

1385:   \caption{Recall-precision curves using the logarithmic TF}

1386:   \label{fig:rp_log_TF}

1387: \end{figure}

1388:

1389: \begin{figure}[htbp]

1390:   \begin{center}

1391:     \leavevmode

1392:     \psfig{file=rp-curve_smart.ps,height=3.5in}

1393:   \end{center}

1394:   \caption{Recall-precision curves using the SMART system}

1395:   \label{fig:rp_smart}

1396: \end{figure}

1397:

1398: \subsection{Query-by-query Analysis}

1399: \label{subsec:qbq_analysis}

1400:

1401: In this Section, we discuss reasons why our translation method

1402: was effective in CLIR performance, through a query-by-query analysis.

1403:

1404: First, we compared EDR and CWT (see in

1405: Section~\ref{subsec:quantitative}), to investigate the effectiveness

1406: of our compound word translation method. For this purpose, we

1407: identified fragments of the NACSIS query that were correctly

1408: translated by CWT but not by EDR, as shown in

1409: Table~\ref{tab:avgpre_qbq_cwt}. In this table, where we insert hyphens

1410: between each Japanese base word for enhanced readability,

1411: Japanese/English words unlisted in the EDR technical terminology

1412: dictionary are underlined. Note that as mentioned in

1413: Section~\ref{subsec:quantitative}, in these cases translations for

1414: remaining base words were used as query terms. However, in the case of

1415: the query 0019, the EDR dictionary lists a phrase translation,

1416: i.e., ``{\it kakariuke-kaiseki\/}~(analysis of dependence relation),''

1417: and thus ``analysis,'' ``dependence'' and ``relation'' were used as

1418: query terms (``of'' was discarded as a stopword).  One can see that

1419: except for the five cases asterisked, out of 18 cases, CWT

1420: outperformed EDR. Note that in the case of 0019, EDR

1421: conducted a phrase-based translation, while CWT conducted a

1422: word-based translation. The relative superiority between these two

1423: translation approaches varies depending on the retrieval method, and

1424: thus we cannot draw any conclusion regarding this point in this paper.

1425: In the case of the query 0006, although the translation in CWT

1426: was linguistically correct, we found that the English word ``agent

1427: function'' is rarely used in documents associated with agent research,

1428: and that ``function'' ended up degrading the retrieval performance. In

1429: the case of the query 0020, ``loanword'' would be a more

1430: appropriate translation for ``{\it gairaigo\/}.'' However, even when

1431: we used ``loanword'' for the retrieval, instead of ``foreign'' and

1432: ``word,'' the performance of EDR did not change.

1433:

1434: \begin{table}[htbp]

1435:   \def\baselinestretch{1}

1436:   \begin{center}

1437:     \caption{Query-by-query comparison between EDR and CWT}

1438:     \medskip

1439:     \leavevmode

1440:     \footnotesize

1441:     \tabcolsep=3pt

1442:     \begin{tabular}{cllll} \hline\hline

1443:       & & \multicolumn{3}{c}{Change in Average Precision (EDR

1444:       $\rightarrow$ CWT)} \\ \cline{3-5}

1445:       ID & {\hfill\centering Japanese (Translation in CWT)\hfill} &

1446:       {\hfill\centering Standard TF\hfill} & {\hfill\centering

1447:       Logarithmic TF\hfill} & {\hfill\centering SMART\hfill} \\ \hline

1448:       0001 & {\it $\underline{jiritsu}$-idou-robotto\/}

1449:       ($\underline{\mbox{autonomous}}$ mobile robot) & 0.2325

1450:       $\rightarrow$ 0.3667 & 0.2587 $\rightarrow$ 0.4058 & 0.2259

1451:       $\rightarrow$ 0.3441 \\

1452:       0004 & {\it $\underline{bunsho}$-gazou-rikai\/}

1453:       ($\underline{\mbox{document}}$ image understanding) & 0.0011

1454:       $\rightarrow$ 0.2775 & 0.0091 $\rightarrow$ 0.3768 & 0.0217

1455:       $\rightarrow$ 0.2740 \\

1456:       0006 & {\it eejento-$\underline{kinou}$\/} (agent

1457:       $\underline{\mbox{function}}$) & 0.2008

1458:       $\rightarrow$ 0.1603* & 0.2920 $\rightarrow$ 0.1997* & 0.1430

1459:       $\rightarrow$ 0.1395* \\

1460:       0016 & {\it saidai-$\underline{kyoutsuu}$-bubungurafu\/}

1461:       (greatest $\underline{\mbox{common}}$ subgraph) &

1462:       0.1615 $\rightarrow$ 0.5039 & 0.4661 $\rightarrow$ 0.6216 &

1463:       0.1295 $\rightarrow$ 0.4460 \\

1464:       0019 & {\it kakariuke-kaiseki\/} (dependency analysis) & 0.0794

1465:       $\rightarrow$ 0.3550 & 0.1383 $\rightarrow$ 0.4302 & 0.1852

1466:       $\rightarrow$ 0.1449* \\

1467:       0020 & {\it katakana-$\underline{\mbox{\it gairai-go\/}}$\/}

1468:       (katakana $\underline{\mbox{foreign word}}$) & 0.4536

1469:       $\rightarrow$ 0.4568 & 0.2408 $\rightarrow$ 0.4674 & 0.9429

1470:       $\rightarrow$ 0.8769* \smallskip \\ \hline

1471:     \end{tabular}

1472:     \label{tab:avgpre_qbq_cwt}

1473:   \end{center}

1474: \end{table}

1475:

1476: Second, we compared CWT and TRL in Table~\ref{tab:avgpre_qbq_trl},

1477: which uses the same basic notation as Table~\ref{tab:avgpre_qbq_cwt}.

1478: The NACSIS query set contains 20 {\it katakana\/} base word types,

1479: among which ``{\it ma-i-ni-n-gu\/}~(mining)'' and ``{\it

1480: ko-ro-ke-i-sho-n\/}~(collocation)'' were unlisted in our base word

1481: dictionary. Unlike the previous case, transliteration generally

1482: improved on the performance. On the other hand, we concede that only

1483: three queries are not enough to justify the effectiveness of our

1484: transliteration method. In view of this problem, we assumed that every

1485: {\it katakana\/} word in the query is unlisted in our base word

1486: dictionary, and compared the following two extreme cases:

1487: \begin{itemize}

1488: \item every {\it katakana\/} word was untranslated (i.e., they were

1489:   simply discarded from queries), which can be seen as a lower bound

1490:   method in this comparison,

1491: \item transliteration was applied to every {\it katakana\/} word,

1492:   instead of consulting the base word dictionary.

1493: \end{itemize}

1494: Both cases were combined into the CWT

1495: Section~\ref{subsec:quantitative}. Note that in the latter case, when

1496: a {\it katakana\/} word is included in a compound word,

1497: transliteration candidates of the word are disambiguated through the

1498: compound word translation method, and thus noisy candidates are

1499: potentially discarded.  It should also be noted that in the case where

1500: a compound word consists of solely {\it katakana\/} words (e.g., {\it

1501: deeta-mainingu\/}~(data mining)), our method automatically segments it

1502: into base words, by transliterating all the possible substrings.

1503:

1504: Table~\ref{tab:avg_pre_kana} shows the average precision values,

1505: averaged over the 21 queries, for those above cases.  By comparing

1506: Tables~\ref{tab:avg_pre} and \ref{tab:avg_pre_kana}, one can see that

1507: the performance was considerably degraded when we disregard every {\it

1508: katakana\/} word, and that even when we applied transliteration to

1509: every katakana word, the performance was greater than that of CWT

1510: and was quite comparable to that of TRL. Among the 20 {\it

1511: katakana\/} base words, only ``{\it eejento\/}~(agent)'' was

1512: incorrectly transliterated into ``eagent,'' which was due to an

1513: insufficient volume of the transliteration dictionary.

1514:

1515: \begin{table}[htbp]

1516:   \def\baselinestretch{1}

1517:   \begin{center}

1518:     \caption{Query-by-query comparison between CWT and TRL}

1519:     \medskip

1520:     \leavevmode

1521:     \footnotesize

1522:     \begin{tabular}{cllll} \hline\hline

1523:       & & \multicolumn{3}{c}{Change in Average Precision (CWT

1524:       $\rightarrow$ TRL)} \\ \cline{3-5}

1525:       ID & Japanese (Translation in TRL) & {\hfill\centering Standard

1526:       TF\hfill} & {\hfill\centering Logarithmic

1527:       TF\hfill} & {\hfill\centering SMART\hfill} \\ \hline

1528:       0008 & {\it deeta-$\underline{mainingu}$\/} (data

1529:       $\underline{\mbox{mining}}$) & 0.0018 $\rightarrow$ 0.0942 &

1530:       0.0299 $\rightarrow$ 0.3363 & 0.3156 $\rightarrow$ 0.7295 \\

1531:       0012 & {\it deeta-$\underline{mainingu}$\/} (data

1532:       $\underline{\mbox{mining}}$) & 0.0018 $\rightarrow$ 0.1229 &

1533:       0.0003 $\rightarrow$ 0.1683 & 0.0000 $\rightarrow$ 0.0853 \\

1534:       0015 & {\it $\underline{corokeishon}$\/}

1535:       ($\underline{\mbox{collocation}}$) & 0.0054 $\rightarrow$ 0.0084

1536:       & 0.0389 $\rightarrow$ 0.0485 & 0.0193 $\rightarrow$

1537:       0.3114 \smallskip \\ \hline

1538:     \end{tabular}

1539:     \label{tab:avgpre_qbq_trl}

1540:   \end{center}

1541: \end{table}

1542:

1543: \begin{table}[htbp]

1544:   \def\baselinestretch{1}

1545:   \begin{center}

1546:     \caption{Non-interpolated average precision values,

1547:     averaged over the 21 queries, for the evaluation of

1548:     transliteration}

1549:     \medskip

1550:     \leavevmode

1551:     \small

1552:     \begin{tabular}{lccccc} \hline\hline

1553:       & & \multicolumn{3}{c}{Retrieval Method} \\ \cline{3-5}

1554:       & \# of Terms & Standard TF & Logarithmic TF & SMART \\ \hline

1555:       discard every {\it katakana\/} word & 2.8 & 0.1519 & 0.1840 &

1556:       0.1873 \\

1557:       transliterate every {\it katakana\/} word & 4.0 & 0.2354 & 0.2786 &

1558:       0.3024 \\ \hline

1559:     \end{tabular}

1560:     \label{tab:avg_pre_kana}

1561:   \end{center}

1562: \end{table}

1563:

1564: Finally, we discuss the effect of additional dictionaries, i.e., the

1565: general and abbreviation dictionaries. The NACSIS query set contains

1566: the general word ``{\it shimbun kiji\/}~(newspaper article)'' and

1567: abbreviation ``LFG~(lexical functional grammar)'' unlisted in our

1568: technical base word dictionary. The abbreviation dictionary lists the

1569: correct translation for ``LFG.''  On the other hand, our general

1570: dictionary, which consists solely of single words, does not list the

1571: correct translation for ``{\it shimbun-kiji\/}.'' Instead, the English

1572: word ``story'' was listed as the translation, which would be used in a

1573: particular context. Table~\ref{tab:avgpre_qbq_additional}, where basic

1574: notation is the same as Table~\ref{tab:avgpre_qbq_cwt}, compares

1575: average precision values with/without these translations. From this

1576: table we cannot see any improvement with the additional

1577: dictionaries. However, when the correct translation was provided as in

1578: 0023 with ``newspaper article,'' the performance was improved

1579: disregarding the retrieval method. In addition, since we found only

1580: two cases where additional dictionaries could be applied, this issue

1581: needs to be further explored using more test queries.

1582:

1583: \begin{table}[htbp]

1584:   \def\baselinestretch{1}

1585:   \begin{center}

1586:     \caption{Query-by-query comparison for the general and

1587:     abbreviation dictionaries}

1588:     \medskip

1589:     \leavevmode

1590:     \footnotesize

1591:     \begin{tabular}{cllll} \hline\hline

1592:       & & \multicolumn{3}{c}{Change in Average Precision} \\ \cline{3-5}

1593:       ID & Japanese (Translation) & {\hfill\centering Standard

1594:       TF\hfill} & {\hfill\centering Logarithmic

1595:       TF\hfill} & {\hfill\centering SMART\hfill} \\ \hline

1596:       0023 & {\it shimbun-kiji\/} (story) & 0.0003

1597:       $\rightarrow$ 0.0000* & 0.0000 $\rightarrow$ 0.0000 & 0.0000

1598:       $\rightarrow$ 0.0000 \\

1599:       0023 & {\it shimbun-kiji\/} (newspaper article) & 0.0003

1600:       $\rightarrow$ 0.0200 & 0.0000 $\rightarrow$ 0.0858 & 0.0000

1601:       $\rightarrow$ 0.1800 \\

1602:       0025 & LFG (lexical functional grammar) & 0.8000 $\rightarrow$

1603:       0.5410* & 0.8000 $\rightarrow$ 0.6879* & 0.9452 $\rightarrow$

1604:       0.8617* \\ \hline

1605:     \end{tabular}

1606:     \label{tab:avgpre_qbq_additional}

1607:   \end{center}

1608: \end{table}

1609:

1610: \section{Conclusion}

1611: \label{sec:conclusion}

1612:

1613: Reflecting the rapid growth in utilization of machine readable

1614: multilingual texts in the 1990s, cross-language information retrieval

1615: (CLIR), which was initiated in the 1960s, has variously been explored

1616: in order to facilitate retrieving information across languages.  For

1617: this purpose, a number of CLIR systems have been developed in

1618: information retrieval, natural language processing and artificial

1619: intelligence research.

1620:

1621: In this paper, we proposed a Japanese/English bidirectional CLIR

1622: system targeting technical documents, in that translation of technical

1623: terms is a crucial task. Since our research methodology must be

1624: contextualized in terms of past research literature, we surveyed

1625: existing CLIR systems, and classified them into three approaches: (a)

1626: translating queries into the document language, (b) translating

1627: documents into the query language, and (c) representing both queries

1628: and documents in a language-independent space. Among these approaches,

1629: we found that the first one, namely the query translation approach, is

1630: relatively inexpensive to implement. Therefore, following this

1631: approach, we combined query translation and monolingual retrieval

1632: modules.

1633:

1634: However, a naive query translation method relying on existing

1635: bilingual dictionaries does not guarantee sufficient system

1636: performance, because new technical terms are progressively created by

1637: combining existing base words or by the Japanese {\it katakana\/}

1638: phonograms. To counter this problem, we proposed compound word

1639: translation and transliteration methods, and integrated them within

1640: one framework. Our methods involve the dictionary production and

1641: probabilistic resolution of translation/transliteration ambiguity,

1642: both of which are fully automated. To produce the dictionary used for

1643: the compound word translation, we extracted base word translations

1644: from the EDR technical terminology dictionary. On the other hand, we

1645: corresponded English and Japanese {\it katakana\/} words on a

1646: character basis, to produce the transliteration dictionary. For the

1647: disambiguation, we used word frequency statistics extracted from the

1648: document collection. We also produced a dictionary for abbreviated

1649: English technical terms, to enhance the translation.

1650:

1651: From a scientific point of view, we investigated the performance of

1652: our CLIR system by way of the standardized IR evaluation method. For

1653: this purpose, we used the NACSIS test collection, which consists of

1654: Japanese queries and Japanese/English technical abstracts, and carried

1655: out Japanese-English CLIR evaluation. Our evaluation results showed

1656: that each individual method proposed, i.e., compound word translation

1657: and transliteration, improved on the baseline performance, and when

1658: used together the improvement was even greater, resulting in a

1659: performance comparable with Japanese-Japanese monolingual IR. We also

1660: showed that the enhancement of the retrieval module improved on our

1661: system performance, independently from the enhancement of the query

1662: translation module.

1663:

1664: Future work will include improvement of each component in our system,

1665: and the effective presentation of retrieved documents using

1666: sophisticated summarization techniques.

1667:

1668: \section*{Acknowledgments}

1669:

1670: The authors would like to thank Noriko Kando (National Institute of

1671: Informatics, Japan) for her support with the NACSIS collection.

1672:

1673: \bibliographystyle{acl}

1674: \begin{thebibliography}{}

1675:

1676: \bibitem[\protect\citename{AAAI}1997]{aaai-spring-sympo-97}

1677: AAAI.

1678: \newblock 1997.

1679: \newblock {\em Electronic Working Notes of the AAAI Spring Symposium on

1680:   Cross-Language Text and Speech Retrieval}.

1681: \newblock {\tt http://www.clis.umd.edu/dlrg/filter/sss/papers/}.

1682:

1683: \bibitem[\protect\citename{ACM}1996-1998]{sigir-96-98}

1684: ACM SIGIR.

1685: \newblock 1996-1998.

1686: \newblock {\em Proceedings of the Annual International ACM SIGIR Conference on

1687:   Research and Development in Information Retrieval}.

1688:

1689: \bibitem[\protect\citename{Aone \bgroup et al.\egroup }1997]{aone:anlp-97}

1690: Chinatsu Aone, Nicholas Charocopos, and James Gorlinsky.

1691: \newblock 1997.

1692: \newblock An intelligent multilingual information browsing and retrieval system

1693:   using information extraction.

1694: \newblock In {\em Proceedings of the 5th Conference on Applied Natural Language

1695:   Processing}, pages 332--339.

1696:

1697: \bibitem[\protect\citename{Ballesteros and Croft}1997]{ballesteros:sigir-97}

1698: Lisa Ballesteros and W.~Bruce Croft.

1699: \newblock 1997.

1700: \newblock Phrasal translation and query expansion techniques for cross-language

1701:   information retrieval.

1702: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR

1703:   Conference on Research and Development in Information Retrieval}, pages

1704:   84--91.

1705:

1706: \bibitem[\protect\citename{Ballesteros and Croft}1998]{ballesteros:sigir-98}

1707: Lisa Ballesteros and W.~Bruce Croft.

1708: \newblock 1998.

1709: \newblock Resolving ambiguity for cross-language retrieval.

1710: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

1711:   Conference on Research and Development in Information Retrieval}, pages

1712:   64--71.

1713:

1714: \bibitem[\protect\citename{Brown \bgroup et al.\egroup }1993]{brown:cl-93}

1715: Peter~F. Brown, Stephen A.~Della Pietra, Vincent J.~Della Pietra, and Robert~L.

1716:   Mercer.

1717: \newblock 1993.

1718: \newblock The mathematics of statistical machine translation: Parameter

1719:   estimation.

1720: \newblock {\em Computational Linguistics}, 19(2):263--311.

1721:

1722: \bibitem[\protect\citename{Carbonell \bgroup et al.\egroup

1723:   }1997]{carbonell:ijcai-97}

1724: Jaime~G. Carbonell, Yiming Yang, Robert~E. Frederking, Ralf~D. Brown, Yibing

1725:   Geng, and Danny Lee.

1726: \newblock 1997.

1727: \newblock Translingual information retrieval: A comparative evaluation.

1728: \newblock In {\em Proceedings of the 15th International Joint Conference on

1729:   Artificial Intelligence}, pages 708--714.

1730:

1731: \bibitem[\protect\citename{Chen \bgroup et al.\egroup

1732:   }1998]{chen:coling-acl-98}

1733: Hsin-Hsi Chen, Sheng-Jie Huang, Yung-Wei Ding, and Shih-Chung Tsai.

1734: \newblock 1998.

1735: \newblock Proper name translation in cross-language information retrieval.

1736: \newblock In {\em Proceedings of the 36th Annual Meeting of the Association for

1737:   Computational Linguistics and the 17th International Conference on

1738:   Computational Linguistics}, pages 232--236.

1739:

1740: \bibitem[\protect\citename{Chen \bgroup et al.\egroup }1999]{chen:acl-99}

1741: Hsin-Hsi Chen, Guo-Wei Bian, and Wen-Cheng Lin.

1742: \newblock 1999.

1743: \newblock Resolving translation ambiguity and target polysemy in cross-language

1744:   information retrieval.

1745: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for

1746:   Computational Linguistics}, pages 215--222.

1747:

1748: \bibitem[\protect\citename{Church and Mercer}1993]{church:cl-93}

1749: Kenneth~W. Church and Robert~L. Mercer.

1750: \newblock 1993.

1751: \newblock Introduction to the special issue on computational linguistics using

1752:   large corpora.

1753: \newblock {\em Computational Linguistics}, 19(1):1--24.

1754:

1755: \bibitem[\protect\citename{Davis and Ogden}1997]{davis:sigir-97}

1756: Mark~W. Davis and William~C. Ogden.

1757: \newblock 1997.

1758: \newblock {QUILT}: Implementing a large-scale cross-language text retrieval

1759:   system.

1760: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR

1761:   Conference on Research and Development in Information Retrieval}, pages

1762:   92--98.

1763:

1764: \bibitem[\protect\citename{Deerwester \bgroup et al.\egroup

1765:   }1990]{deerwester:jasis-90}

1766: Scott Deerwester, Susan~T. Dumais, George~W. Furnas, Thomas~K. Landauer, and

1767:   Richard Harshman.

1768: \newblock 1990.

1769: \newblock Indexing by latent semantic analysis.

1770: \newblock {\em Journal of the American Society for Information Science},

1771:   41(6):391--407.

1772:

1773: \bibitem[\protect\citename{Dijkstra}1959]{dijkstra:nm-59}

1774: Edsgar~W. Dijkstra.

1775: \newblock 1959.

1776: \newblock A note on two problems in connexion with graphs.

1777: \newblock {\em Numerische Mathematik}, 1:269--271.

1778:

1779: \bibitem[\protect\citename{Dorr and Oard}1998]{dorr:lrec-98}

1780: Bonnie~J. Dorr and Douglas~W. Oard.

1781: \newblock 1998.

1782: \newblock Evaluating resources for query translation in cross-language

1783:   information retrieval.

1784: \newblock In {\em Proceedings of the 1st International Conference on Language

1785:   Resources and Evaluation}, pages 759--764.

1786:

1787: \bibitem[\protect\citename{Dumais \bgroup et al.\egroup

1788:   }1996]{dumais:sigir-ws-96}

1789: Susan~T. Dumais, Thomas~K. Landauer, and Michael~L. Littman.

1790: \newblock 1996.

1791: \newblock Automatic cross-linguistic information retrieval using latent

1792:   semantic indexing.

1793: \newblock In {\em ACM SIGIR Workshop on Cross-Linguistic Information

1794:   Retrieval}.

1795:

1796: \bibitem[\protect\citename{Fellbaum}1998]{fellbaum:wordnet-98}

1797: Christiane Fellbaum, editor.

1798: \newblock 1998.

1799: \newblock {\em {WordNet}: An Electronic Lexical Database}.

1800: \newblock MIT Press.

1801:

1802: \bibitem[\protect\citename{Ferber}1989]{ferber:89}

1803: Gene Ferber.

1804: \newblock 1989.

1805: \newblock {\em {English-Japanese}, {Japanese-English} Dictionary of Computer

1806:   and Data-Processing Terms}.

1807: \newblock MIT Press.

1808:

1809: \bibitem[\protect\citename{Fung \bgroup et al.\egroup }1999]{fung:acl-99}

1810: Pascale Fung, Liu Xiaohu, and Cheung~Chi Shun.

1811: \newblock 1999.

1812: \newblock Mixed language query disambiguation.

1813: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for

1814:   Computational Linguistics}, pages 333--340.

1815:

1816: \bibitem[\protect\citename{Fung}1995]{fung:acl-95}

1817: Pascale Fung.

1818: \newblock 1995.

1819: \newblock A pattern matching method for finding noun and proper noun

1820:   translations from noisy parallel corpora.

1821: \newblock In {\em Proceedings of the 33rd Annual Meeting of the Association for

1822:   Computational Linguistics}, pages 236--243.

1823:

1824: \bibitem[\protect\citename{Gachot \bgroup et al.\egroup

1825:   }1996]{gachot:sigir-ws-96}

1826: Denis~A. Gachot, Elke Lange, and Jin Yang.

1827: \newblock 1996.

1828: \newblock The {SYSTRAN} {NLP} browser: An application of machine translation

1829:   technology in multilingual information retrieval.

1830: \newblock In {\em ACM SIGIR Workshop on Cross-Linguistic Information

1831:   Retrieval}.

1832:

1833: \bibitem[\protect\citename{Gilarranz \bgroup et al.\egroup

1834:   }1997]{gilarranz:aaai-spring-sympo-97}

1835: Julio Gilarranz, Julio Gonzalo, and Felisa Verdejo.

1836: \newblock 1997.

1837: \newblock An approach to conceptual text retrieval using the {EuroWordNet}

1838:   multilingual semantic database.

1839: \newblock In {\em Electronic Working Notes of the AAAI Spring Symposium on

1840:   Cross-Language Text and Speech Retrieval}.

1841:

1842: \bibitem[\protect\citename{Gonzalo \bgroup et al.\egroup

1843:   }1998]{gonzalo:chum-98}

1844: Julio Gonzalo, Felisa Verdejo, Carol Peters, and Nicoletta Calzolari.

1845: \newblock 1998.

1846: \newblock Applying {EuroWordNet} to cross-language text retrieval.

1847: \newblock {\em Computers and the Humanities}, 32:185--207.

1848:

1849: \bibitem[\protect\citename{Hull and Grefenstette}1996]{hull:sigir-96}

1850: David~A. Hull and Gregory Grefenstette.

1851: \newblock 1996.

1852: \newblock Querying across languages: A dictionary-based approach to

1853:   multilingual information retrieval.

1854: \newblock In {\em Proceedings of the 19th Annual International ACM SIGIR

1855:   Conference on Research and Development in Information Retrieval}, pages

1856:   49--57.

1857:

1858: \bibitem[\protect\citename{Hull}1993]{hull:sigir-93}

1859: David Hull.

1860: \newblock 1993.

1861: \newblock Using statistical testing in the evaluation of retrieval experiments.

1862: \newblock In {\em Proceedings of the 16th Annual International ACM SIGIR

1863:   Conference on Research and Development in Information Retrieval}, pages

1864:   329--338.

1865:

1866: \bibitem[\protect\citename{Hull}1997]{hull:aaai-spring-sympo-97}

1867: David~A. Hull.

1868: \newblock 1997.

1869: \newblock Using structured queries for disambiguation in cross-language

1870:   information retrieval.

1871: \newblock In {\em Electronic Working Notes of the AAAI Spring Symposium on

1872:   Cross-Language Text and Speech Retrieval}.

1873:

1874: \bibitem[\protect\citename{{Japan Electronic Dictionary Research

1875:   Institute}}1995a]{edr-bilindic:95}

1876: {Japan Electronic Dictionary Research Institute}.

1877: \newblock 1995a.

1878: \newblock Bilingual dictionary.

1879: \newblock (In Japanese).

1880:

1881: \bibitem[\protect\citename{{Japan Electronic Dictionary Research

1882:   Institute}}1995b]{edr-techdic:95}

1883: {Japan Electronic Dictionary Research Institute}.

1884: \newblock 1995b.

1885: \newblock Technical terminology dictionary (information processing).

1886: \newblock (In Japanese).

1887:

1888: \bibitem[\protect\citename{Kaji and Aizono}1996]{kaji:coling-96}

1889: Hiroyuki Kaji and Toshiko Aizono.

1890: \newblock 1996.

1891: \newblock Extracting word correspondences from bilingual corpora based on word

1892:   co-occurrence information.

1893: \newblock In {\em Proceedings of the 16th International Conference on

1894:   Computational Linguistics}, pages 23--28.

1895:

1896: \bibitem[\protect\citename{Kando \bgroup et al.\egroup }1999]{kando:sigir-99}

1897: Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue.

1898: \newblock 1999.

1899: \newblock {NACSIS} test collection workshop ({NTCIR-1}).

1900: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR

1901:   Conference on Research and Development in Information Retrieval}, pages

1902:   299--300.

1903:

1904: \bibitem[\protect\citename{Keen}1992]{keen:ipm-92}

1905: E.~Michael Keen.

1906: \newblock 1992.

1907: \newblock Presenting results of experimental retrieval comparisons.

1908: \newblock {\em Information Processing \& Management}, 28(4):491--502.

1909:

1910: \bibitem[\protect\citename{Knight and Graehl}1998]{knight:cl-98}

1911: Kevin Knight and Jonathan Graehl.

1912: \newblock 1998.

1913: \newblock Machine transliteration.

1914: \newblock {\em Computational Linguistics}, 24(4):599--612.

1915:

1916: \bibitem[\protect\citename{Kobayashi \bgroup et al.\egroup

1917:   }1994]{kobayashi:coling-94}

1918: Yoshiyuki Kobayashi, Takenobu Tokunaga, and Hozumi Tanaka.

1919: \newblock 1994.

1920: \newblock Analysis of {Japanese} compound nouns using collocational

1921:   information.

1922: \newblock In {\em Proceedings of the 15th International Conference on

1923:   Computational Linguistics}, pages 865--869.

1924:

1925: \bibitem[\protect\citename{Kwon \bgroup et al.\egroup }1998]{kwon:cpol-98}

1926: Oh-Woog Kwon, Insu Kang, Jong-Hyeok Lee, and Geunbae Lee.

1927: \newblock 1998.

1928: \newblock Conceptual cross-language text retrieval based on document

1929:   translation using {Japanese}-to-{Korean} {MT} system.

1930: \newblock {\em International Journal of Computer Processing of Oriental

1931:   Languages}, 12(1):1--16.

1932:

1933: \bibitem[\protect\citename{Lee and Choi}1997]{lee:iral-97}

1934: Jae~Sung Lee and Key-Sun Choi.

1935: \newblock 1997.

1936: \newblock A statistical method to generate various foreign word

1937:   transliterations in multilingual information retrieval system.

1938: \newblock In {\em Proceedings of the 2nd International Workshop on Information

1939:   Retrieval with Asian Languages}, pages 123--128.

1940:

1941: \bibitem[\protect\citename{Mani and Bloedorn}1998]{mani:aaai-iaai-98}

1942: Inderjeet Mani and Eric Bloedorn.

1943: \newblock 1998.

1944: \newblock Machine learning of generic and user-focused summarization.

1945: \newblock In {\em Proceedings of AAAI/IAAI-98}, pages 821--826.

1946:

1947: \bibitem[\protect\citename{Matsumoto \bgroup et al.\egroup

1948:   }1997]{matsumoto:chasen-97}

1949: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki

1950:   Imamura.

1951: \newblock 1997.

1952: \newblock {Japanese} morphological analysis system {ChaSen} manual.

1953: \newblock Technical Report NAIST-IS-TR97007, NAIST.

1954: \newblock (In Japanese).

1955:

1956: \bibitem[\protect\citename{McCarley}1999]{mccarley:acl-99}

1957: J.~Scott McCarley.

1958: \newblock 1999.

1959: \newblock Should we translate the documents or the queries in cross-language

1960:   information retrieval?

1961: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for

1962:   Computational Linguistics}, pages 208--214.

1963:

1964: \bibitem[\protect\citename{Mongar}1969]{mongar:tis-69}

1965: P.E. Mongar.

1966: \newblock 1969.

1967: \newblock International co-operation in abstracting services for road

1968:   engineering.

1969: \newblock {\em The Information Scientist}, 3:51--62.

1970:

1971: \bibitem[\protect\citename{{Nichigai Associates}}1996]{nichigai_compdic:96}

1972: {Nichigai Associates}.

1973: \newblock 1996.

1974: \newblock {English-Japanese} computer terminology dictionary.

1975: \newblock (In Japanese).

1976:

1977: \bibitem[\protect\citename{Nie \bgroup et al.\egroup }1999]{nie:sigir-99}

1978: Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand.

1979: \newblock 1999.

1980: \newblock Cross-language information retrieval based on parallel texts and

1981:   automatic mining of parallel texts from the {Web}.

1982: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR

1983:   Conference on Research and Development in Information Retrieval}, pages

1984:   74--81.

1985:

1986: \bibitem[\protect\citename{NIST}1992-1998]{trec-92-98}

1987: {National Institute of Standards \& Technology}.

1988: \newblock 1992--1998.

1989: \newblock {\em Proceedings of the Text REtrieval Conferences}.

1990: \newblock {\tt http://trec.nist.gov/pubs.html}.

1991:

1992: \bibitem[\protect\citename{Oard and Resnik}1999]{oard:ipm-99}

1993: Douglas~W. Oard and Philip Resnik.

1994: \newblock 1999.

1995: \newblock Support for interactive document selection in cross-language

1996:   information retrieval.

1997: \newblock {\em Information Processing \& Management}, 35(3):363--379.

1998:

1999: \bibitem[\protect\citename{Oard}1998]{oard:amta-98}

2000: Douglas~W. Oard.

2001: \newblock 1998.

2002: \newblock A comparative study of query and document translation for

2003:   cross-language information retrieval.

2004: \newblock In {\em Proceedings of the 3rd Conference of the Association for

2005:   Machine Translation in the Americas}, pages 472--483.

2006:

2007: \bibitem[\protect\citename{Okumura \bgroup et al.\egroup

2008:   }1998]{okumura:lrec-tlim-ws-98}

2009: Akitoshi Okumura, Kai Ishikawa, and Kenji Satoh.

2010: \newblock 1998.

2011: \newblock Translingual information retrieval by a bilingual dictionary and

2012:   comparable corpus.

2013: \newblock In {\em The 1st International Conference on Language Resources and

2014:   Evaluation, Workshop on Translingual Information Management: Current Levels

2015:   and Future Abilities}.

2016:

2017: \bibitem[\protect\citename{Pirkola}1998]{pirkola:sigir-98}

2018: Ari Pirkola.

2019: \newblock 1998.

2020: \newblock The effects of query structure and dictionary setups in

2021:   dictionary-based cross-language information retrieval.

2022: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

2023:   Conference on Research and Development in Information Retrieval}, pages

2024:   55--63.

2025:

2026: \bibitem[\protect\citename{Sakai \bgroup et al.\egroup }1999]{sakai:tipsj-99}

2027: Tetsuya Sakai, Masahiro Kajiura, Kazuo Sumita, Gareth Jones, and Nigel Collier.

2028: \newblock 1999.

2029: \newblock A study on {English}-{Japanese}/{Japanese}-{English} cross-language

2030:   information retrieval using machine translation.

2031: \newblock {\em Transactions of Information Processing Society of Japan},

2032:   40(11):4075--4086.

2033: \newblock (In Japanese).

2034:

2035: \bibitem[\protect\citename{Salton and Buckley}1988]{salton:ipm-88}

2036: Gerard Salton and Christopher Buckley.

2037: \newblock 1988.

2038: \newblock Term-weighting approaches in automatic text retrieval.

2039: \newblock {\em Information Processing \& Management}, 24(5):513--523.

2040:

2041: \bibitem[\protect\citename{Salton and McGill}1983]{salton:83}

2042: Gerard Salton and Michael~J. McGill.

2043: \newblock 1983.

2044: \newblock {\em Introduction to Modern Information Retrieval}.

2045: \newblock McGraw-Hill.

2046:

2047: \bibitem[\protect\citename{Salton}1970]{salton:jasis-70}

2048: Gerard Salton.

2049: \newblock 1970.

2050: \newblock Automatic processing of foreign language documents.

2051: \newblock {\em Journal of the American Society for Information Science},

2052:   21(3):187--194.

2053:

2054: \bibitem[\protect\citename{Salton}1971]{salton:71}

2055: Gerard Salton.

2056: \newblock 1971.

2057: \newblock {\em The {SMART} Retrieval System: Experiments in Automatic Document

2058:   Processing}.

2059: \newblock Prentice-Hall.

2060:

2061: \bibitem[\protect\citename{Salton}1972]{salton:techrep-72}

2062: Gerard Salton.

2063: \newblock 1972.

2064: \newblock Experiments in multi-lingual information retrieval.

2065: \newblock Technical Report TR 72-154, Computer Science Department, Cornell

2066:   University.

2067:

2068: \bibitem[\protect\citename{Sch\"{a}uble and Sheridan}1997]{schauble:trec-97}

2069: Peter Sch\"{a}uble and P\'{a}raic Sheridan.

2070: \newblock 1997.

2071: \newblock Cross-language information retrieval ({CLIR}) track overview.

2072: \newblock In {\em {\it The 6th Text Retrieval Conference}}.

2073:

2074: \bibitem[\protect\citename{Sheridan and Ballerini}1996]{sheridan:sigir-96}

2075: P\'{a}raic Sheridan and Jean~Paul Ballerini.

2076: \newblock 1996.

2077: \newblock Experiments in multilingual information retrieval using the {SPIDER}

2078:   system.

2079: \newblock In {\em Proceedings of the 19th Annual International ACM SIGIR

2080:   Conference on Research and Development in Information Retrieval}, pages

2081:   58--65.

2082:

2083: \bibitem[\protect\citename{Smadja \bgroup et al.\egroup }1996]{smadja:cl-96}

2084: Frank Smadja, Kathleen~R. McKeown, and Vasileios Hatzivassiloglou.

2085: \newblock 1996.

2086: \newblock Translating collocations for bilingual lexicons: A statistical

2087:   approach.

2088: \newblock {\em Computational Linguistics}, 22(1):1--38.

2089:

2090: \bibitem[\protect\citename{Suzuki \bgroup et al.\egroup

2091:   }1998]{suzuki:signl-98-7}

2092: Masami Suzuki, Naomi Inoue, and Kazuo Hashimoto.

2093: \newblock 1998.

2094: \newblock Effect on displaying translated major keywords of contents as

2095:   browsing support in cross-language information retrieval.

2096: \newblock {\em Information Processing Society of Japan SIGNL Notes},

2097:   98(63):99--106.

2098: \newblock (In Japanese).

2099:

2100: \bibitem[\protect\citename{Suzuki \bgroup et al.\egroup }1999]{suzuki:nlp-99}

2101: Masami Suzuki, Naomi Inoue, and Kazuo Hashimoto.

2102: \newblock 1999.

2103: \newblock Effects of partial translation for users' document selection in

2104:   cross-language information retrieval.

2105: \newblock In {\em Proceedings of The 5th Annual Meeting of The Association for

2106:   Natural Language Processing}, pages 371--374.

2107: \newblock (In Japanese).

2108:

2109: \bibitem[\protect\citename{Tombros and Sanderson}1998]{tombros:sigir-98}

2110: Anastasios Tombros and Mark Sanderson.

2111: \newblock 1998.

2112: \newblock Advantages of query biased summaries in information retrieval.

2113: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

2114:   Conference on Research and Development in Information Retrieval}, pages

2115:   2--10.

2116:

2117: \bibitem[\protect\citename{Tsuji and Kageura}1997]{tsuji:nlprs-97}

2118: Keita Tsuji and Kyo Kageura.

2119: \newblock 1997.

2120: \newblock An {HMM}-based method for segmenting {Japanese} terms and keywords

2121:   based on domain-specific bilingual corpora.

2122: \newblock In {\em Proceedings of the 4th Natural Language Processing Pacific

2123:   Rim Symposium}, pages 557--560.

2124:

2125: \bibitem[\protect\citename{Voorhees}1998]{voorhees:sigir-98}

2126: Ellen~M. Voorhees.

2127: \newblock 1998.

2128: \newblock Variations in relevance judgments and the measurement of retrieval

2129:   effectiveness.

2130: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

2131:   Conference on Research and Development in Information Retrieval}, pages

2132:   315--323.

2133:

2134: \bibitem[\protect\citename{Vossen}1998]{vossen:chum-98}

2135: Piek Vossen.

2136: \newblock 1998.

2137: \newblock Introduction to {EuroWordNet}.

2138: \newblock {\em Computers and the Humanities}, 32:73--89.

2139:

2140: \bibitem[\protect\citename{Wong \bgroup et al.\egroup }1985]{wong:sigir-85}

2141: S.K.M. Wong, W.~Siarko, and P.C.N. Wong.

2142: \newblock 1985.

2143: \newblock Generalized vector space model in information retrieval.

2144: \newblock In {\em Proceedings of the 8th Annual International ACM SIGIR

2145:   Conference on Research and Development in Information Retrieval}, pages

2146:   18--25.

2147:

2148: \bibitem[\protect\citename{Xu and Croft}1996]{xu:sigir-96}

2149: Jinxi Xu and W.~Bruce Croft.

2150: \newblock 1996.

2151: \newblock Query expansion using local and global document analysis.

2152: \newblock In {\em Proceedings of the 19th Annual International ACM SIGIR

2153:   Conference on Research and Development in Information Retrieval}, pages

2154:   4--11.

2155:

2156: \bibitem[\protect\citename{Yamabana \bgroup et al.\egroup

2157:   }1996]{yamabana:sigir-ws-96}

2158: Kiyoshi Yamabana, Kazunori Muraki, Shinichi Doi, and Shin'ichiro Kamei.

2159: \newblock 1996.

2160: \newblock A language conversion front-end for cross-linguistic information

2161:   retrieval.

2162: \newblock In {\em ACM SIGIR Workshop on Cross-Linguistic Information

2163:   Retrieval}.

2164:

2165: \bibitem[\protect\citename{Zobel and Moffat}1998]{zobel:sigir-forum-98}

2166: Justin Zobel and Alistair Moffat.

2167: \newblock 1998.

2168: \newblock Exploring the similarity space.

2169: \newblock {\em ACM SIGIR FORUM}, 32(1):18--34.

2170:

2171: \end{thebibliography}

2172:

2173: \end{document}

2174: