0206:cs0206034/main.tex

1: %%

2: %% SIGIR-2000 WS on patent retrieval

3: %%

4: \documentstyle[11pt,twocolumn]{article}

5:

6: \input{a4params.tex}

7:

8: %% Title/Author/Affiliation

9:

10: \title{\bf Applying a Hybrid Query Translation Method to

11: Japanese/English Cross-Language Patent Retrieval}

12:

13: \author{\Large Masatoshi Fukui$^{\dagger}$~~Shigeto

14: Higuchi$^{\dagger}$~~Youichi Nakatani$^{\dagger}$~~\medskip \\

15: \Large Masao Tanaka$^{\dagger}$~~Atsushi Fujii$^{\ddagger}$~~Tetsuya

16: Ishikawa$^{\ddagger}$}

17:

18: \date{$^\dagger$Japan Patent Information Organization \\

19: Satoh Daiya Bldg., 1-7 Toyo 4-Chome Koto-ku 135-0016, JAPAN \medskip \\

20: $^\ddagger$University of Library and Information Science \\

21: 1-2 Kasuga Tsukuba 305-8550, JAPAN \\ \smallskip

22: \smallskip {\tt E-mail:~fujii@ulis.ac.jp}}

23:

24: \newcommand{\etal}{et~al.}

25: \newcommand{\etaleos}{et~al}

26: \newcommand{\eq}[1]{(\ref{#1})}

27:

28: \input{psfig.tex}

29:

30: \pagestyle{empty}

31:

32: \begin{document}

33:

34: \maketitle\thispagestyle{empty}

35:

36: \begin{abstract}

37:   This paper applies an existing query translation method to

38:   cross-language patent retrieval. In our method, multiple

39:   dictionaries are used to derive all possible translations for an

40:   input query, and collocational statistics are used to resolve

41:   translation ambiguity. We used Japanese/English parallel patent

42:   abstracts to perform comparative experiments, where our method

43:   outperformed a simple dictionary-based query translation method, and

44:   achieved 76\% of monolingual retrieval in terms of average

45:   precision.

46: \end{abstract}

47:

48: \section{Introduction}

49: \label{sec:introduction}

50:

51: Since 1978, JAPIO (Japan Patent Information Organization) has operated

52: PATOLIS, which is one of the first on-line patent retrieval services

53: in Japan, and currently provides clients (i.e., 8,000 Japanese

54: companies) with patent information from 62 countries and 5

55: international organizations. At the same time, since a patent obtained

56: in a single country can be protected in multiple countries

57: simultaneously, it is feasible that users are interested in retrieving

58: patent information across languages. Motivated by this background,

59: JAPIO manually summarizes each patent document submitted in Japan into

60: approximately 400 characters, and translates the summarized documents

61: into English, which are provided on PAJ (Patent Abstract of Japan)

62: CD-ROMs\footnote{Copyright by Japan Patent Office.}.

63:

64: In this paper, we target cross-language information retrieval (CLIR)

65: in the context of patent retrieval, and evaluate its effectiveness

66: using Japanese/English patent abstracts on PAJ CD-ROMs.

67:

68: In brief, existing CLIR systems are classified into three approaches:

69: (a) translating queries into the document

70: language~\cite{ballesteros:sigir-98,davis:sigir-97}, (b) translating

71: documents into the query language~\cite{mccarley:acl-99,oard:amta-98},

72: and (c) representing both queries and documents in a

73: language-independent

74: space~\cite{carbonell:ijcai-97,gonzalo:chum-98,littman:clir-98,salton:jasis-70}.

75: However, since developing a CLIR system is expensive, we used the CLIR

76: system proposed by Fujii and

77: Ishikawa~\cite{fujii:ntcir-99,fujii:emnlp-vlc-99}, which follows the

78: first approach.

79:

80: This system has partially been developed for the NACSIS test

81: collection~\cite{kando:sigir-99}, which consists of 39 Japanese

82: queries and approximately 330,000 technical abstracts in Japanese and

83: English.  However, since patent information usually includes technical

84: terms, it is expected that this system also will perform reasonably

85: for patent abstracts.

86:

87: \section{System Description}

88: \label{sec:system}

89:

90: Figure~\ref{fig:system} depicts the overall design of our CLIR system,

91: in which we combine a query translation module and an IR engine for

92: monolingual retrieval.  Unlike the original system proposed by Fujii

93: and Ishikawa~\cite{fujii:ntcir-99,fujii:emnlp-vlc-99} targeting the

94: NACSIS collection, we use the JAPIO collection for the target

95: documents. Here, the JAPIO collection is a subset of PAJ CD-ROMs. We

96: will elaborate on this collection in

97: Section~\ref{sec:experimentation}. In this section, we briefly explain

98: the retrieval process based on Figure~\ref{fig:system}.

99:

100: First, query translation is performed for the source language query to

101: output the translation. For this purpose, a hybrid method integrating

102: multiple resources is used. To put it more precisely, the EDR

103: technical/general dictionaries~\cite{edr:95} are used to derive all

104: possible translation candidates for words and phrases included in the

105: source query. In addition, for words unlisted in dictionaries,

106: transliteration is performed to identify phonetic equivalents in the

107: target language.

108:

109: Then, bi-gram statistics extracted from NACSIS documents in the target

110: language are used to resolve the translation ambiguity. Ideally,

111: bi-gram statistics should be extracted from the JAPIO

112: collection. However, since the number of documents in this collection

113: is relatively small, when compared with the NACSIS collection (see

114: Section~\ref{sec:experimentation}), we avoided the data sparseness

115: problem.

116:

117: Since our system is bidirectional between Japanese and English, we

118: tokenize documents with different methods, depending on their

119: language. For English documents, the tokenization involves eliminating

120: stopwords and identifying root forms for inflected content words. For

121: this purpose, we use WordNet~\cite{fellbaum:wordnet-98}, which

122: contains a stopword list and correspondences between inflected words

123: and their root form.

124:

125: On the other hand, we segment Japanese documents into lexical units

126: using the ChaSen morphological analyzer~\cite{matsumoto:chasen-97},

127: which has commonly been used for much Japanese NLP research, and

128: extract content words based on their part-of-speech information.

129:

130: Second, the IR engine searches the JAPIO collection for documents

131: relevant to the translated query, and sorts them according to the

132: degree of relevance, in descending order. Our IR engine is based on

133: the vector space model, in which the similarity between the query and

134: each document (i.e., the degree of relevance of each document) is

135: computed as the cosine of the angle between their associated

136: vectors. We use the notion of TF$\cdot$IDF for term weighting. Among a

137: number of variations of term weighting

138: methods~\cite{salton:ipm-88,zobel:sigir-forum-98}, we tentatively use

139: the formulae as shown in Equation~\eq{eq:tf_idf}.

140: \begin{equation}

141:   \label{eq:tf_idf}

142:   \begin{array}{lll}

143:     TF & = & 1 + \log(f_{t,d}) \\

144:     \noalign{\vskip 1.2ex}

145:     IDF & = & \log(\frac{\textstyle N}{\textstyle n_{t}})

146:   \end{array}

147: \end{equation}

148: Here, $f_{t,d}$ denotes the frequency that term $t$ appears in

149: document $d$, and $n_{t}$ denotes the number of documents containing

150: term $t$. $N$ is the total number of documents in the collection.

151:

152: For the indexing process, we first tokenize documents as explained

153: above (i.e., we use WordNet and ChaSen for English and Japanese

154: documents, respectively), and then conduct the word-based

155: indexing. That is, we use each content word as a single indexing term.

156:

157: Finally, since retrieved documents are not in the user's native

158: language, we optionally use a machine translation system to enhance

159: readability of retrieved documents.

160:

161: \begin{figure}[t]

162:   \begin{center}

163:     \leavevmode

164:     \psfig{file=system.eps,height=1.3in}

165:   \end{center}

166:   \caption{The overall design of our cross-language patent retrieval

167:     system.}

168:   \label{fig:system}

169: \end{figure}

170:

171: \section{Experimentation}

172: \label{sec:experimentation}

173:

174: Since no test collection for Japanese/English patent retrieval is

175: available to the public, we produced our test collection (i.e., the

176: JAPIO collection), which consists of three Japanese queries and

177: Japanese/English comparable abstracts.

178:

179: Each query, which was manually produced, consists of the description

180: and narrative, and corresponds to different domains, i.e., electrical

181: engineering, mechanical engineering and chemistry.

182: Figure~\ref{fig:query} shows the three query descriptions in the

183: second column.

184:

185: \begin{figure*}[htbp]

186:   \begin{center}

187:     \leavevmode

188:     \small

189:     \begin{tabular}{llrr} \hline\hline

190:       {\hfill\centering IPC\hfill} & {\hfill\centering

191:       Description\hfill} & \#Relevant & \#Documents \\ \hline

192:       electronics & GPS car navigation system based on VICS & 930 &

193:       7,526 \\

194:       mechanics & eliminating dioxin in burning solid wastes & 451 &

195:       8,214 \\

196:       chemistry & antibacterial plastic combining inorganic

197:       materials & 473 & 5,902 \\

198:       \hline

199:     \end{tabular}

200:     \caption{Query descriptions in the JAPIO collection.}

201:     \label{fig:query}

202:   \end{center}

203: \end{figure*}

204:

205: In conventional test collections, relevance assessment is usually

206: performed based on the pooling method~\cite{voorhees:sigir-98}, which

207: first pools candidates for relevant documents using multiple retrieval

208: systems. However, since in our case only one system described in

209: Section~\ref{sec:system} is currently available, a different

210: production method was needed.

211:

212: To put it more precisely, for each query (domain), target documents

213: were first collected based on the IPC classification number, from PAJ

214: CD-ROMs in 1993-1998. Then, for each query, three professional human

215: searchers, who were allowed to enhance queries based on thesauri and

216: their introspection, searched the target documents for relevant

217: documents.

218:

219: Thus, in practice, the JAPIO collection consists of three different

220: document collections corresponding to each query.  In

221: Figure~\ref{fig:query}, the third and fourth columns denote the number

222: of relevant documents and the total number of target documents for

223: each query.

224:

225: We compared the following methods:

226: \begin{itemize}

227: \item Japanese-English CLIR, where all possible translations derived

228:   from EDR dictionaries and the transliteration method were used as

229:   query terms (JEALL),

230: \item Japanese-English CLIR, where disambiguation based on bi-gram

231:   statistics were performed, and $k$-best translations were used as

232:   query terms (JEDIS),

233: \item Japanese-Japanese monolingual IR (JJ).

234: \end{itemize}

235: Here, we empirically set \mbox{$k=1$}. Although the performance of

236: JEDIS did not significantly differ as long as we set a small value of

237: $k$ (e.g., \mbox{$k=5$}), we achieved the best performance when we set

238: \mbox{$k=1$}.

239:

240: Figure~\ref{fig:rp} shows recall-precision curves for the above three

241: methods, where JEDIS generally outperformed JEALL, and JJ generally

242: outperformed both JEALL and JEDIS, regardless of the recall.  The

243: difference between JEALL and JEDIS is attributed to the fact that

244: JEDIS resolved translation ambiguity based on bi-gram statistics

245: extracted from the NACSIS collection. Thus, we can conclude that the

246: use of bi-gram statistics (even extracted from a collection other than

247: the JAPIO collection) was effective for the query translation.

248:

249: Table~\ref{tab:avg_pre} shows the non-interpolated average precision

250: values, averaged over the three queries, for each method.  This table

251: shows that JJ outperformed JEALL and JEDIS, JEDIS outperformed JEALL,

252: and the average precision value for JEDIS was 76\% of that obtained

253: with JJ.

254:

255: These results are also observable in existing CLIR experiments using

256: the TREC and NACSIS collections. Thus, we conclude that our

257: cross-language patent retrieval system is relatively comparable with

258: those for newspaper articles and technical abstracts in performance.

259:

260: However, we could not conduct statistical testing, which investigates

261: whether the difference in average precision is meaningful or simply

262: due to chance~\cite{hull:sigir-93}, because the number of queries is

263: small. We concede that experiments using a larger number of queries

264: need to be further explored.

265:

266: \section{Conclusion}

267: \label{sec:conclusion}

268:

269: In this paper, we explored Japanese/English cross-language patent

270: retrieval. For this purpose, we used an existing cross-language IR

271: system relying on a hybrid query translation method, and evaluated its

272: effectiveness using Japanese queries and English patent abstracts.

273: The experimental results paralleled existing experiments. That is, we

274: found that resolving translation ambiguity was effective for the query

275: translation, and that the average precision value for cross-language

276: IR was approximately 76\% of that obtained with monolingual IR.

277: Future work will include qualitative/quantitative analyses based on a

278: larger number of queries.

279:

280: \begin{figure}[t]

281:   \begin{center}

282:     \leavevmode

283:     \psfig{file=rp-curve.ps,height=3.2in}

284:   \end{center}

285:   \caption{Recall-precision curves for different methods.}

286:   \label{fig:rp}

287: \end{figure}

288:

289: \begin{table}[htbp]

290:   \begin{center}

291:     \caption{Non-interpolated average precision values,

292:     averaged over the three queries, for different methods.}

293:     \medskip

294:     \leavevmode

295:     \small

296:     \begin{tabular}{lcc} \hline\hline

297:       Method & Avg. Precision & Ratio to JJ \\ \hline

298:       JJ & 0.4151 & -- \\

299:       JEDIS & 0.3156 & 0.7603 \\

300:       JEALL & 0.2709 & 0.6526 \\

301:       \hline

302:     \end{tabular}

303:     \label{tab:avg_pre}

304:   \end{center}

305: \end{table}

306:

307: \small

308:

309: \bibliographystyle{jplain}

310:

311: \begin{thebibliography}{10}

312:

313: \bibitem{ballesteros:sigir-98}

314: Lisa Ballesteros and W.~Bruce Croft.

315: \newblock Resolving ambiguity for cross-language retrieval.

316: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

317:   Conference on Research and Development in Information Retrieval}, pp. 64--71,

318:   1998.

319:

320: \bibitem{carbonell:ijcai-97}

321: Jaime~G. Carbonell, Yiming Yang, Robert~E. Frederking, Ralf~D. Brown, Yibing

322:   Geng, and Danny Lee.

323: \newblock Translingual information retrieval: A comparative evaluation.

324: \newblock In {\em Proceedings of the 15th International Joint Conference on

325:   Artificial Intelligence}, pp. 708--714, 1997.

326:

327: \bibitem{davis:sigir-97}

328: Mark~W. Davis and William~C. Ogden.

329: \newblock {QUILT}: Implementing a large-scale cross-language text retrieval

330:   system.

331: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR

332:   Conference on Research and Development in Information Retrieval}, pp. 92--98,

333:   1997.

334:

335: \bibitem{fellbaum:wordnet-98}

336: Christiane Fellbaum, editor.

337: \newblock {\em {WordNet}: An Electronic Lexical Database}.

338: \newblock MIT Press, 1998.

339:

340: \bibitem{fujii:ntcir-99}

341: Atsushi Fujii and Tetsuya Ishikawa.

342: \newblock Cross-language information retrieval at {ULIS}.

343: \newblock In {\em Proceedings of the 1st NTCIR Workshop on Research in Japanese

344:   Text Retrieval and Term Recognition}, pp. 163--169, 1999.

345:

346: \bibitem{fujii:emnlp-vlc-99}

347: Atsushi Fujii and Tetsuya Ishikawa.

348: \newblock Cross-language information retrieval for technical documents.

349: \newblock In {\em Proceedings of the Joint ACL SIGDAT Conference on Empirical

350:   Methods in Natural Language Processing and Very Large Corpora}, pp. 29--37,

351:   1999.

352:

353: \bibitem{gonzalo:chum-98}

354: Julio Gonzalo, Felisa Verdejo, Carol Peters, and Nicoletta Calzolari.

355: \newblock Applying {EuroWordNet} to cross-language text retrieval.

356: \newblock {\em Computers and the Humanities}, Vol.~32, pp. 185--207, 1998.

357:

358: \bibitem{hull:sigir-93}

359: David Hull.

360: \newblock Using statistical testing in the evaluation of retrieval experiments.

361: \newblock In {\em Proceedings of the 16th Annual International ACM SIGIR

362:   Conference on Research and Development in Information Retrieval}, pp.

363:   329--338, 1993.

364:

365: \bibitem{edr:95}

366: {Japan Electronic Dictionary Research Institute}.

367: \newblock {EDR} electronic dictionary technical guide, 1995.

368: \newblock (In Japanese).

369:

370: \bibitem{kando:sigir-99}

371: Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue.

372: \newblock {NACSIS} test collection workshop ({NTCIR-1}).

373: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR

374:   Conference on Research and Development in Information Retrieval}, pp.

375:   299--300, 1999.

376:

377: \bibitem{littman:clir-98}

378: Michael~L. Littman, Susan~T. Dumais, and Thomas~K. Landauer.

379: \newblock Automatic cross-language information retrieval using latent semantic

380:   indexing.

381: \newblock In Gregory Grefenstette, editor, {\em Cross-Language Information

382:   Retrieval}, chapter~5, pp. 51--62. Kluwer Academic Publishers, 1998.

383:

384: \bibitem{matsumoto:chasen-97}

385: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki

386:   Imamura.

387: \newblock {Japanese} morphological analysis system {ChaSen} manual.

388: \newblock Technical Report NAIST-IS-TR97007, NAIST, 1997.

389: \newblock (In Japanese).

390:

391: \bibitem{mccarley:acl-99}

392: J.~Scott McCarley.

393: \newblock Should we translate the documents or the queries in cross-language

394:   information retrieval?

395: \newblock In {\em Proceedings of the 37th Annual Meeting of the Association for

396:   Computational Linguistics}, pp. 208--214, 1999.

397:

398: \bibitem{oard:amta-98}

399: Douglas~W. Oard.

400: \newblock A comparative study of query and document translation for

401:   cross-language information retrieval.

402: \newblock In {\em Proceedings of the 3rd Conference of the Association for

403:   Machine Translation in the Americas}, pp. 472--483, 1998.

404:

405: \bibitem{salton:jasis-70}

406: Gerard Salton.

407: \newblock Automatic processing of foreign language documents.

408: \newblock {\em Journal of the American Society for Information Science},

409:   Vol.~21, No.~3, pp. 187--194, 1970.

410:

411: \bibitem{salton:ipm-88}

412: Gerard Salton and Christopher Buckley.

413: \newblock Term-weighting approaches in automatic text retrieval.

414: \newblock {\em Information Processing \& Management}, Vol.~24, No.~5, pp.

415:   513--523, 1988.

416:

417: \bibitem{voorhees:sigir-98}

418: Ellen~M. Voorhees.

419: \newblock Variations in relevance judgments and the measurement of retrieval

420:   effectiveness.

421: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

422:   Conference on Research and Development in Information Retrieval}, pp.

423:   315--323, 1998.

424:

425: \bibitem{zobel:sigir-forum-98}

426: Justin Zobel and Alistair Moffat.

427: \newblock Exploring the similarity space.

428: \newblock {\em ACM SIGIR FORUM}, Vol.~32, No.~1, pp. 18--34, 1998.

429:

430: \end{thebibliography}

431:

432: \end{document}

433:

434: % Local Variables:

435: % mode: japanese-LaTeX

436: % TeX-master: t

437: % End:

438: