0206:cs0206036/main.tex

1: \documentclass{article}

2: \usepackage[preprint]{spconf}

3:

4: \title{

5: Language Modeling for Multi-Domain Speech-Driven Text Retrieval

6: }

7:

8: \name{

9: Katunobu Itou$^1$, Atsushi Fujii$^2$, Tetsuya Ishikawa$^2$

10: \thanks{The first and second authors

11:        are also members of CREST, Japan Science and Technology

12:        Corporation.}

13: }

14:

15: \address{

16: $^1$ National Institute of Advanced Industrial Science and Technology\\

17:   1-1-1 Chuuou Daini Umezono, Tsukuba, 305-8568, Japan,

18: E-mail: itou@ni.aist.go.jp\\

19: $^2$ University of Library and Information Science\\

20:       1-2 Kasuga, Tsukuba, 305-8550, Japan,

21:       E-mail: \{fujii,ishikawa\}@ulis.ac.jp \\

22: }

23:

24: \begin{document}

25: \ninept

26: \maketitle

27: \begin{abstract}

28: We report experimental results associated with speech-driven text

29: retrieval, which facilitates retrieving information in multiple

30: domains with spoken queries. Since users speak contents related to a

31: target collection, we produce language models used for speech

32: recognition based on the target collection, so as to improve both the

33: recognition and retrieval accuracy. Experiments using existing test

34: collections combined with dictated queries showed the effectiveness of

35: our method.

36: \end{abstract}

37:

38: \newcommand{\etal}{et~al.}

39: \newcommand{\etaleos}{et~al}

40: \newcommand{\eq}[1]{(\ref{#1})}

41: \input{psfig.tex}

42:

43: \section{Introduction}

44: \label{sec:introduction}

45:

46: Automatic speech recognition, which decodes human voice to generate

47: transcriptions, has of late become a practical technology.  It is

48: feasible that speech recognition is used in real world computer-based

49: applications, specifically, those associated with human language.  In

50: fact, a number of speech-based methods have been explored in the

51: information retrieval (IR) community, which can be classified into the

52: following two fundamental categories:

53: \begin{itemize}

54: \item spoken document retrieval, in which written queries are used to

55:   search speech (e.g., broadcast news audio) archives for relevant

56:   speech information~\cite{garofolo:trec-97}.

57: \item speech-driven retrieval, in which spoken queries are used to

58:   retrieve relevant textual information~\cite{barnett:eurospeech-97,crestani:fqas-2000}.

59: \end{itemize}

60:

61: Initiated partially by the TREC-6 spoken document retrieval (SDR)

62: track~\cite{garofolo:trec-97}, various methods have been proposed for

63: spoken document retrieval.  However, a relatively small number of

64: methods have been explored for speech-driven text retrieval, although

65: they are associated with numerous keyboard-less retrieval

66: applications, such as telephone-based retrieval, car navigation

67: systems, and user-friendly interfaces.

68:

69: Barnett~\etal~\cite{barnett:eurospeech-97} performed comparative

70: experiments related to speech-driven retrieval, where the DRAGON

71: speech recognition system was used as an input interface for the

72: INQUERY text retrieval system.  They used as test inputs 35 queries

73: collected from the TREC topics and dictated by a single male speaker.

74: Crestani~\cite{crestani:fqas-2000} also used the above 35 queries and

75: showed that conventional relevance feedback techniques marginally

76: improved the accuracy for speech-driven text retrieval.

77:

78: These above cases focused solely on improving text retrieval methods

79: and did not address problems of improving speech recognition accuracy.

80: In fact, an existing speech recognition system was used with no

81: enhancement. In other words, speech recognition and text retrieval

82: modules were fundamentally independent and were simply connected by

83: way of an input/output protocol.

84:

85: However, since most speech recognition systems are trained based on

86: specific domains, the accuracy of speech recognition across domains is

87: not satisfactory. Thus, as can easily be predicted, in cases of

88: Barnett~\etal~\cite{barnett:eurospeech-97} and

89: Crestani~\cite{crestani:fqas-2000}, a speech recognition error rate

90: was relatively high and considerably decreased the retrieval accuracy.

91: Additionally, speech recognition with a high accuracy is crucial for

92: interactive retrieval, such as dialog-based retrieval.

93:

94: Motivated by these problems, in this paper we integrate (not simply

95: connect) speech recognition and text retrieval to improve both

96: recognition and retrieval accuracy in the context of speech-driven

97: text retrieval.

98:

99: Unlike general-purpose speech recognition aimed to decode any

100: spontaneous speech, in the case of speech-driven text retrieval, users

101: usually speak contents associated with a target collection, from which

102: documents relevant to their information need are retrieved.  In a

103: stochastic speech recognition framework, the accuracy depends

104: primarily on acoustic and language models~\cite{bahl:ieee-tpami-1983}.

105: While acoustic models are related to phonetic properties, language

106: models, which represent linguistic contents to be spoken, are

107: related to target collections.  Thus, it is intuitively feasible that

108: language models have to be produced based on target collections.

109:

110: To sum up, our belief is that by adapting a language model based on a

111: target IR collection, we can improve the speech recognition and text

112: retrieval accuracy, simultaneously.

113:

114: Section~\ref{sec:system} describes our speech-driven text retrieval

115: system, which is currently implemented for Japanese.

116: Section~\ref{sec:experimentation} elaborates on comparative

117: experiments, in which IR test collections in different domains are

118: used to evaluate the effectiveness of our system.

119:

120: \section{System Description}

121: \label{sec:system}

122:

123: \subsection{Overview}

124: \label{subsec:system_overview}

125:

126: Figure~\ref{fig:system} depicts the overall design of our

127: speech-driven text retrieval system, which consists of speech

128: recognition and text retrieval modules.

129: In the following sections, we explain two modules in

130: Figure~\ref{fig:system}, respectively.

131:

132: \begin{figure}[htbp]

133:   \begin{center}

134:   \leavevmode \psfig{file=system.eps,height=2in}

135:   \end{center}

136:   \caption{The design of our speech-driven text retrieval system.}

137:   \label{fig:system}

138: \end{figure}

139:

140: \subsection{Speech Recognition}

141: \label{subsec:speech_recognition}

142:

143: For the speech recognition module, we use the Japanese dictation

144: toolkit~\cite{kawahara:icslp-2000}\footnote{http://winnie.kuis.kyoto-u.ac.jp/dictation/},

145: which includes the ``Julius'' recognition engine and acoustic/language

146: models.  Julius performs a two-pass (forward-backward) search using

147: word-based forward bigrams and backward trigrams on the respective

148: passes.

149:

150: The acoustic model was produced by way of the ASJ speech databases of

151: phonetically balanced sentences (ASJ-PB) and newspaper articles texts

152: (ASJ-JNAS)~\cite{itou:98:a}, which contain approximately 20,000

153: sentences uttered by 132 speakers including the both gender groups.

154: We used a 16-mixture Gaussian distribution triphone

155: Hidden Markov Model, where states were clustered into 2,000 groups by

156: a state-tying method.

157:

158: This toolkit also includes development softwares, so that acoustic and

159: language models can be produced and replaced depending on the

160: application.  While we use the acoustic model provided in the toolkit,

161: we use new language models produced by way of source documents (i.e.,

162: target IR collections).

163:

164: \subsection{Text Retrieval}

165: \label{subsec:text_retrieval}

166:

167: The text retrieval module is based on the ``Okapi''

168: method~\cite{robertson:sigir-94}, which computes the relevance score

169: between the transcribed query and each document in the collection,

170: based on the distribution of index terms, and sorts retrieved documents

171: according to the score in descending order.

172:

173: We use content words extracted from documents as index terms, and

174: perform a word-based indexing. For this purpose, we use the ChaSen

175: morphological analyzer~\cite{matsumoto:chasen-99} to extract content

176: words. We extract terms from transcribed queries using the same

177: method.

178:

179: \section{Experimentation}

180: \label{sec:experimentation}

181:

182: \subsection{Test Collections}

183: \label{subsec:test_collection}

184:

185: To investigate the performance of our multi-domain speech-driven

186: retrieval system, we used two different types of Japanese IR test

187: (benchmark) collections: the NTCIR and IREX collections. Both

188: collections, which resemble one used in the TREC ad hoc retrieval

189: track, include topics (information need) and relevance assessment

190: (correct judgement) for each topic, along with target

191: documents. However, these collections are associated with different

192: domain, respectively.

193:

194: The NTCIR

195: collection~\cite{ntcir-2001}\footnote{http://research.nii.ac.jp/\~{}ntcadm/index-en.html}

196: includes 736,166 abstracts collected from technical papers published

197: by 65 Japanese associations for various fields.  On the other hand,

198: the IREX

199: collection~\cite{sekine:lrec-2000}\footnote{http://cs.nyu.edu/cs/projects/proteus/irex/index-e.html}

200: includes 211,853 articles collected from two years worth of ``Mainichi

201: Shimbun'' newspaper articles\footnote{In practice, the IREX collection

202: provides only article IDs, which corresponds to articles in Mainichi

203: Shimbun newspaper CD-ROM'94-'95. Participants must get a copy of the

204: CD-ROMs themselves.}.

205:

206: The NTCIR and IREX collections include 132 and 30 Japanese topics,

207: respectively, for a sample of which English translations are also

208: provided. Figures~\ref{fig:ntcir_topic} and \ref{fig:irex_topic} show

209: example topics in each collection, which consist of different fields

210: (for example, descriptions and narratives) tagged in an SGML form.

211:

212: \begin{figure*}[htbp]

213:   \begin{center}

214:     \leavevmode

215:     \begin{quote}

216:       \tt

217:       \footnotesize

218:       <TOPIC q=0123>\\

219:       <TITLE>Biofilms</TITLE>\\

220:       <DESCRIPTION>Are there any documents about the biofilms produced

221:       by some microorganisms in which chronic diseases are mentioned?</DESCRIPTION>\\

222:       <NARRATIVE>Biofilms are thought to occur when microorganisms

223:       grow in microcolonies embedded in the adherent gel surface on

224:       tunica mucosa, and teeth, or on catheters, prosthetic valves,

225:       and other artifacts. A relevant document will report on any

226:       studies into the relationship between biofilms produced by some

227:       microorganisms and chronic diseases. Documents that include

228:       reports on biofilms produced by non-medical microorganisms that

229:       do not cause infectious diseases are not relevant.</NARRATIVE>\\

230:       </TOPIC>

231:     \end{quote}

232:     \caption{An English translation for an example topic in the NTCIR

233:       collection.}

234:     \label{fig:ntcir_topic}

235:   \end{center}

236: \end{figure*}

237:

238: \begin{figure*}[htbp]

239:   \begin{center}

240:     \leavevmode

241:     \begin{quote}

242:       \tt

243:       \footnotesize

244:       <TOPIC> \\

245:       <TOPIC-ID>1001</TOPIC-ID> \\

246:       <DESCRIPTION>Corporate merging</DESCRIPTION> \\

247:       <NARRATIVE>The article describes a corporate merging and in the

248:       article, the name of companies have to be

249:       identifiable. Information

250:       including the field and the purpose of the merging have to be

251:       identifiable. Corporate merging includes corporate acquisition,

252:       corporate unifications and corporate buying.</NARRATIVE> \\

253:       </TOPIC>

254:     \end{quote}

255:     \caption{An English translation for an example topic in the IREX collection.}

256:     \label{fig:irex_topic}

257:   \end{center}

258: \end{figure*}

259:

260: Since both collections do not contain spoken queries, we asked four

261: speakers (two males/females) to dictate topics. For this purpose, we

262: selectively used a specific field, so as to simulate a realistic

263: speech-driven retrieval.

264:

265: In the case of the NTCIR topics, titles are not informative for the

266: retrieval. On the other hand, narratives, which usually consist of

267: several sentences, are too long to speak. Thus, only descriptions,

268: which consist of a single phrase and sentence, were dictated by each

269: speaker, so as to produce four different sets of 132 spoken queries.

270: However, in the case of the IREX topics, since descriptions are not

271: informative for the retrieval, only narratives were dictated by each

272: speaker, to produce four different sets of 30 spoken queries.

273:

274: \subsection{Comparative Evaluation}

275: \label{subsec:comparison}

276:

277: We compared the performance of the following retrieval methods:

278: \begin{itemize}

279: \item text-to-text retrieval, which used written queries, and can be

280:   seen as the perfect speech-driven text retrieval,

281: \item speech-driven text retrieval, in which a language model produced

282:   based on the NTCIR collection was used,

283: \item speech-driven text retrieval, in which a language model produced

284:   based on the IREX collection was used.

285: \end{itemize}

286: In cases of speech-driven text retrieval methods, queries dictated by

287: four speakers were used independently, and the final result was

288: obtained by averaging results for different speakers.

289:

290: Although the Julius decoder outputs more than one transcription

291: candidates for a single speech, we used only the one with the greatest

292: probability score. The results did not significantly change depending

293: on whether or not we used lower-ranked transcriptions as queries.

294:

295: The only difference in producing two different language models (i.e.,

296: those based on the NTCIR and IREX collections) is the source

297: documents. In other words, both language models were of the same

298: vocabulary size (20,000), and were produced by way of the same

299: softwares.

300:

301: Table~\ref{tab:lang_model} shows statistics related to word

302: tokens/types in two different collections for language modeling, where

303: the line ``Coverage'' denotes the ratio of word tokens contained in

304: the resultant language model. Most of word tokens were covered

305: irrespective of the collection.

306:

307: \begin{table}[htbp]

308:   \begin{center}

309:     \caption{Statistics related to source words for language

310:     modeling.}

311:     \medskip

312:     \leavevmode

313:     \tabcolsep=3pt

314:     \begin{tabular}{lcc} \hline

315:       & NTCIR & IREX \\ \hline

316:       \# of Types & 454K & 179K \\

317:       \# of Tokens & 175M & 53M \\

318:       Coverage & 97.9\% & 96.5\% \\

319:       \hline

320:     \end{tabular}

321:     \label{tab:lang_model}

322:   \end{center}

323: \end{table}

324:

325: Each method retrieved 1,000 top documents, and the TREC evaluation

326: software was used to calculate non-interpolated average precision

327: values and plot recall-precision curves.

328:

329: Table~\ref{tab:results} shows the non-interpolated average precision

330: values (AP) and word error rate in speech recognition, for different

331: retrieval methods. As with existing experiments for speech

332: recognition, word error rate (WER) is the ratio between the number of

333: word errors (i.e., deletion, insertion, and substitution) and the

334: total number of words. In addition, we investigated error rate

335: with respect to query terms (i.e., keywords used for retrieval), which

336: we shall call ``term error rate (TER)''.  Table~\ref{tab:results} also

337: shows trigram test-set perplexity (PP) and test-set out-of-vocabulary

338: rate (OOV).

339:

340: It should noted that for all the evaluation measures in

341: Table~\ref{tab:results} excepting average precision, smaller values

342: are generally obtained with better methods.  Suggestions which can be

343: derived from these results are as follows.

344:

345: \begin{table*}[htbp]

346:   \begin{center}

347:     \caption{Results for different retrieval methods targeting the

348:     NTCIR/IREX collections (AP: average

349:     precision, WER: word error rate, TER: term error rate,

350:     PP: trigram test-set perplexity,

351:     OOV: test-set Out-of-Vocabulary rate).}

352:     \medskip

353:     \leavevmode

354:     \footnotesize

355:     \tabcolsep=5pt

356:     \begin{tabular}{l|ccccc|ccccc} \hline

357:       & \multicolumn{5}{|c|}{NTCIR} & \multicolumn{5}{|c}{IREX} \\

358:       \cline{2-11}

359:       \multicolumn{1}{c|}{Language Model}

360:       & AP & WER & TER & PP & OOV

361:       & AP & WER & TER & PP & OOV \\ \hline

362:       Text & 0.337 & --- & --- & --- & ---

363:            & 0.367 & --- & --- & --- & --- \\

364:       NTCIR & 0.261 & 18.6\% & 23.6\% & 60  & 4.2\%

365:             & 0.166 & 31.1\% & 41.0\% & 138 & 6.1\% \\

366:       IREX  & 0.111 & 41.4\% & 54.6\% & 195 & 9.4\%

367:             & 0.334 & 19.5\% & 22.9\% & 108 & 1.4\% \\

368:       \hline

369:     \end{tabular}

370:     \label{tab:results}

371:   \end{center}

372: \end{table*}

373:

374: First, by comparing results of different language models, one can see

375: that the performance was significantly improved with a language model

376: produced from the target collection, which was observable irrespective

377: of the domain. Thus, producing language models based on target

378: collections was quite effective for speech-driven text retrieval.

379:

380: Second, while in the case of the NTCIR collection, the average

381: precision for speech-driven retrieval was approximately 77\% of that

382: obtained with text-to-text retrieval, in the case of the IREX

383: collection, the average precision for speech-driven retrieval was

384: quite comparable that obtained with text-to-text retrieval.

385:

386: Third, TER was generally higher than WER irrespective of the speaker.

387: In other words, speech recognition for content words was more

388: difficult than functional words, which were not contained in query

389: terms.

390:

391: Finally, we investigated the trade-off between recall and precision.

392: Figures~\ref{fig:ntcir} and \ref{fig:irex} show recall-precision

393: curves of different retrieval methods, for the NTCIR and IREX

394: collections, respectively. In these figures, the relative superiority

395: for precision values due to different language models in

396: Table~\ref{tab:results} was also observable, regardless of the recall.

397:

398: \begin{figure}[htbp]

399:   \begin{center}

400:   \leavevmode \psfig{file=ntcir.ps,height=2.8in}

401:   \end{center}

402:   \caption{Recall-precision curves for different methods targeting the

403:   NTCIR collection.}

404:   \label{fig:ntcir}

405: \end{figure}

406: %

407: \begin{figure}[htbp]

408:   \begin{center}

409:   \leavevmode \psfig{file=irex.ps,height=2.8in}

410:   \end{center}

411:   \caption{Recall-precision curves for different methods targeting the

412:   IREX collection.}

413:   \label{fig:irex}

414: \end{figure}

415:

416: \section{Conclusion}

417: \label{sec:conclusion}

418:

419: Aiming at speech-driven text retrieval with a high accuracy, we

420: proposed a method to integrate speech recognition and text retrieval

421: methods, in which target text collections are used to produce

422: statistical language models for speech recognition.  We also showed

423: the effectiveness of our method by way of experiments, where dictated

424: information needs in the NTCIR/IREX collections were used as queries

425: to retrieve documents in different domains.

426:

427: \section*{Acknowledgments}

428:

429: The authors would like to thank the National Institute of Informatics

430: for their support with the NTCIR collection and the IREX committee for

431: their support with the IREX collection.

432:

433: \bibliographystyle{IEEEbib}

434:

435: \begin{thebibliography}{10}

436:

437: \bibitem{garofolo:trec-97}

438: John~S. Garofolo, Ellen~M. Voorhees, Vincent~M. Stanford, and Karen~Sparck

439:   Jones,

440: \newblock ``{TREC-6} 1997 spoken document retrieval track overview and

441:   results,''

442: \newblock in {\em Proceedings of the 6th Text REtrieval Conference}, 1997, pp.

443:   83--91.

444:

445: \bibitem{barnett:eurospeech-97}

446: J.~Barnett, S.~Anderson, J.~Broglio, M.~Singh, R.~Hudson, and S.~W. Kuo,

447: \newblock ``Experiments in spoken queries for document retrieval,''

448: \newblock in {\em Proceedings of Eurospeech97}, 1997, pp. 1323--1326.

449:

450: \bibitem{crestani:fqas-2000}

451: Fabio Crestani,

452: \newblock ``Word recognition errors and relevance feedback in spoken query

453:   processing,''

454: \newblock in {\em Proceedings of the Fourth International Conference on

455:   Flexible Query Answering Systems}, 2000, pp. 267--281.

456:

457: \bibitem{bahl:ieee-tpami-1983}

458: Lalit.~R. Bahl, Frederick Jelinek, and Robert~L. Mercer,

459: \newblock ``A maximum linklihood approach to continuous speech recognition,''

460: \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},

461:   vol. 5, no. 2, pp. 179--190, 1983.

462:

463: \bibitem{kawahara:icslp-2000}

464: T.~Kawahara, A.~Lee, T.~Kobayashi, K.~Takeda, N.~Minematsu, S.~Sagayama,

465:   K.~Itou, A.~Ito, M.~Yamamoto, A.~Yamada, T.~Utsuro, and K.~Shikano,

466: \newblock ``Free software toolkit for {Japanese} large vocabulary continuous

467:   speech recognition,''

468: \newblock in {\em Proceedings of the 6th International Conference on Spoken

469:   Language Processing}, 2000, pp. 476--479.

470:

471: \bibitem{itou:98:a}

472: K.~Itou, M.~Yamamoto, K.~Takeda, T.~Takezawa, T.~Matsuoka, T.~Kobayashi,

473:   K.~Shikano, and S.~Itahashi,

474: \newblock ``The design of the newspaper-based {Japanese} large vocabulary

475:   continuous speech recognition corpus,''

476: \newblock in {\em ICSLP-98}, 1998, pp. 3261--3264.

477:

478: \bibitem{robertson:sigir-94}

479: S.~E. Robertson and S.~Walker,

480: \newblock ``Some simple effective approximations to the 2-poisson model for

481:   probabilistic weighted retrieval,''

482: \newblock in {\em Proceedings of the 17th Annual International ACM SIGIR

483:   Conference on Research and Development in Information Retrieval}, 1994, pp.

484:   232--241.

485:

486: \bibitem{matsumoto:chasen-99}

487: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi

488:   Matsuda, and Masayuki Asahara,

489: \newblock ``{Japanese} morphological analysis system {ChaSen} version 2.0

490:   manual 2nd edition,''

491: \newblock Tech. {R}ep. NAIST-IS-TR99009, NAIST, 1999.

492:

493: \bibitem{ntcir-2001}

494: {National Institute of Informatics},

495: \newblock {\em Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of

496:   Chinese \& Japanese Text Retrieval and Text Summarization}, 2001.

497:

498: \bibitem{sekine:lrec-2000}

499: Satoshi Sekine and Hitoshi Isahara,

500: \newblock ``{IREX:} {IR} and {IE} evaluation project in {Japanese},''

501: \newblock in {\em Proceedings of the 2nd International Conference on Language

502:   Resources and Evaluation}, 2000, pp. 1475--1480.

503:

504: \end{thebibliography}

505:

506: \end{document}

507: