cs0206036/main.tex
1: \documentclass{article}
2: \usepackage[preprint]{spconf}
3: 
4: \title{
5: Language Modeling for Multi-Domain Speech-Driven Text Retrieval
6: }
7: 
8: \name{
9: Katunobu Itou$^1$, Atsushi Fujii$^2$, Tetsuya Ishikawa$^2$
10: \thanks{The first and second authors
11:        are also members of CREST, Japan Science and Technology
12:        Corporation.}
13: }
14: 
15: \address{
16: $^1$ National Institute of Advanced Industrial Science and Technology\\
17:   1-1-1 Chuuou Daini Umezono, Tsukuba, 305-8568, Japan, 
18: E-mail: itou@ni.aist.go.jp\\
19: $^2$ University of Library and Information Science\\
20:       1-2 Kasuga, Tsukuba, 305-8550, Japan, 
21:       E-mail: \{fujii,ishikawa\}@ulis.ac.jp \\
22: }
23: 
24: \begin{document}
25: \ninept
26: \maketitle
27: \begin{abstract}
28: We report experimental results associated with speech-driven text
29: retrieval, which facilitates retrieving information in multiple
30: domains with spoken queries. Since users speak contents related to a
31: target collection, we produce language models used for speech
32: recognition based on the target collection, so as to improve both the
33: recognition and retrieval accuracy. Experiments using existing test
34: collections combined with dictated queries showed the effectiveness of
35: our method.
36: \end{abstract}
37: 
38: \newcommand{\etal}{et~al.}
39: \newcommand{\etaleos}{et~al}
40: \newcommand{\eq}[1]{(\ref{#1})}
41: \input{psfig.tex}
42: 
43: \section{Introduction}
44: \label{sec:introduction}
45: 
46: Automatic speech recognition, which decodes human voice to generate
47: transcriptions, has of late become a practical technology.  It is
48: feasible that speech recognition is used in real world computer-based
49: applications, specifically, those associated with human language.  In
50: fact, a number of speech-based methods have been explored in the
51: information retrieval (IR) community, which can be classified into the
52: following two fundamental categories:
53: \begin{itemize}
54: \item spoken document retrieval, in which written queries are used to
55:   search speech (e.g., broadcast news audio) archives for relevant
56:   speech information~\cite{garofolo:trec-97}.
57: \item speech-driven retrieval, in which spoken queries are used to
58:   retrieve relevant textual information~\cite{barnett:eurospeech-97,crestani:fqas-2000}.
59: \end{itemize}
60: 
61: Initiated partially by the TREC-6 spoken document retrieval (SDR)
62: track~\cite{garofolo:trec-97}, various methods have been proposed for
63: spoken document retrieval.  However, a relatively small number of
64: methods have been explored for speech-driven text retrieval, although
65: they are associated with numerous keyboard-less retrieval
66: applications, such as telephone-based retrieval, car navigation
67: systems, and user-friendly interfaces.
68: 
69: Barnett~\etal~\cite{barnett:eurospeech-97} performed comparative
70: experiments related to speech-driven retrieval, where the DRAGON
71: speech recognition system was used as an input interface for the
72: INQUERY text retrieval system.  They used as test inputs 35 queries
73: collected from the TREC topics and dictated by a single male speaker.
74: Crestani~\cite{crestani:fqas-2000} also used the above 35 queries and
75: showed that conventional relevance feedback techniques marginally
76: improved the accuracy for speech-driven text retrieval.
77: 
78: These above cases focused solely on improving text retrieval methods
79: and did not address problems of improving speech recognition accuracy.
80: In fact, an existing speech recognition system was used with no
81: enhancement. In other words, speech recognition and text retrieval
82: modules were fundamentally independent and were simply connected by
83: way of an input/output protocol.
84: 
85: However, since most speech recognition systems are trained based on
86: specific domains, the accuracy of speech recognition across domains is
87: not satisfactory. Thus, as can easily be predicted, in cases of
88: Barnett~\etal~\cite{barnett:eurospeech-97} and
89: Crestani~\cite{crestani:fqas-2000}, a speech recognition error rate
90: was relatively high and considerably decreased the retrieval accuracy.
91: Additionally, speech recognition with a high accuracy is crucial for
92: interactive retrieval, such as dialog-based retrieval.
93: 
94: Motivated by these problems, in this paper we integrate (not simply
95: connect) speech recognition and text retrieval to improve both
96: recognition and retrieval accuracy in the context of speech-driven
97: text retrieval.
98: 
99: Unlike general-purpose speech recognition aimed to decode any
100: spontaneous speech, in the case of speech-driven text retrieval, users
101: usually speak contents associated with a target collection, from which
102: documents relevant to their information need are retrieved.  In a
103: stochastic speech recognition framework, the accuracy depends
104: primarily on acoustic and language models~\cite{bahl:ieee-tpami-1983}.
105: While acoustic models are related to phonetic properties, language
106: models, which represent linguistic contents to be spoken, are
107: related to target collections.  Thus, it is intuitively feasible that
108: language models have to be produced based on target collections.
109: 
110: To sum up, our belief is that by adapting a language model based on a
111: target IR collection, we can improve the speech recognition and text
112: retrieval accuracy, simultaneously.
113: 
114: Section~\ref{sec:system} describes our speech-driven text retrieval
115: system, which is currently implemented for Japanese.
116: Section~\ref{sec:experimentation} elaborates on comparative
117: experiments, in which IR test collections in different domains are
118: used to evaluate the effectiveness of our system.
119: 
120: \section{System Description}
121: \label{sec:system}
122: 
123: \subsection{Overview}
124: \label{subsec:system_overview}
125: 
126: Figure~\ref{fig:system} depicts the overall design of our
127: speech-driven text retrieval system, which consists of speech
128: recognition and text retrieval modules. 
129: In the following sections, we explain two modules in
130: Figure~\ref{fig:system}, respectively.
131: 
132: \begin{figure}[htbp]
133:   \begin{center}
134:   \leavevmode \psfig{file=system.eps,height=2in}
135:   \end{center}
136:   \caption{The design of our speech-driven text retrieval system.}
137:   \label{fig:system}
138: \end{figure}
139: 
140: \subsection{Speech Recognition}
141: \label{subsec:speech_recognition}
142: 
143: For the speech recognition module, we use the Japanese dictation
144: toolkit~\cite{kawahara:icslp-2000}\footnote{http://winnie.kuis.kyoto-u.ac.jp/dictation/},
145: which includes the ``Julius'' recognition engine and acoustic/language
146: models.  Julius performs a two-pass (forward-backward) search using
147: word-based forward bigrams and backward trigrams on the respective
148: passes.
149: 
150: The acoustic model was produced by way of the ASJ speech databases of
151: phonetically balanced sentences (ASJ-PB) and newspaper articles texts
152: (ASJ-JNAS)~\cite{itou:98:a}, which contain approximately 20,000
153: sentences uttered by 132 speakers including the both gender groups.
154: We used a 16-mixture Gaussian distribution triphone
155: Hidden Markov Model, where states were clustered into 2,000 groups by
156: a state-tying method.
157: 
158: This toolkit also includes development softwares, so that acoustic and
159: language models can be produced and replaced depending on the
160: application.  While we use the acoustic model provided in the toolkit,
161: we use new language models produced by way of source documents (i.e.,
162: target IR collections).
163: 
164: \subsection{Text Retrieval}
165: \label{subsec:text_retrieval}
166: 
167: The text retrieval module is based on the ``Okapi''
168: method~\cite{robertson:sigir-94}, which computes the relevance score
169: between the transcribed query and each document in the collection,
170: based on the distribution of index terms, and sorts retrieved documents
171: according to the score in descending order.
172: 
173: We use content words extracted from documents as index terms, and
174: perform a word-based indexing. For this purpose, we use the ChaSen
175: morphological analyzer~\cite{matsumoto:chasen-99} to extract content
176: words. We extract terms from transcribed queries using the same
177: method.
178: 
179: \section{Experimentation}
180: \label{sec:experimentation}
181: 
182: \subsection{Test Collections}
183: \label{subsec:test_collection}
184: 
185: To investigate the performance of our multi-domain speech-driven
186: retrieval system, we used two different types of Japanese IR test
187: (benchmark) collections: the NTCIR and IREX collections. Both
188: collections, which resemble one used in the TREC ad hoc retrieval
189: track, include topics (information need) and relevance assessment
190: (correct judgement) for each topic, along with target
191: documents. However, these collections are associated with different
192: domain, respectively.
193: 
194: The NTCIR
195: collection~\cite{ntcir-2001}\footnote{http://research.nii.ac.jp/\~{}ntcadm/index-en.html}
196: includes 736,166 abstracts collected from technical papers published
197: by 65 Japanese associations for various fields.  On the other hand,
198: the IREX
199: collection~\cite{sekine:lrec-2000}\footnote{http://cs.nyu.edu/cs/projects/proteus/irex/index-e.html}
200: includes 211,853 articles collected from two years worth of ``Mainichi
201: Shimbun'' newspaper articles\footnote{In practice, the IREX collection
202: provides only article IDs, which corresponds to articles in Mainichi
203: Shimbun newspaper CD-ROM'94-'95. Participants must get a copy of the
204: CD-ROMs themselves.}.
205: 
206: The NTCIR and IREX collections include 132 and 30 Japanese topics,
207: respectively, for a sample of which English translations are also
208: provided. Figures~\ref{fig:ntcir_topic} and \ref{fig:irex_topic} show
209: example topics in each collection, which consist of different fields
210: (for example, descriptions and narratives) tagged in an SGML form.
211: 
212: \begin{figure*}[htbp]
213:   \begin{center}
214:     \leavevmode
215:     \begin{quote}
216:       \tt
217:       \footnotesize
218:       <TOPIC q=0123>\\
219:       <TITLE>Biofilms</TITLE>\\
220:       <DESCRIPTION>Are there any documents about the biofilms produced
221:       by some microorganisms in which chronic diseases are mentioned?</DESCRIPTION>\\
222:       <NARRATIVE>Biofilms are thought to occur when microorganisms
223:       grow in microcolonies embedded in the adherent gel surface on
224:       tunica mucosa, and teeth, or on catheters, prosthetic valves,
225:       and other artifacts. A relevant document will report on any
226:       studies into the relationship between biofilms produced by some
227:       microorganisms and chronic diseases. Documents that include
228:       reports on biofilms produced by non-medical microorganisms that
229:       do not cause infectious diseases are not relevant.</NARRATIVE>\\
230:       </TOPIC>
231:     \end{quote}
232:     \caption{An English translation for an example topic in the NTCIR
233:       collection.}
234:     \label{fig:ntcir_topic}
235:   \end{center}
236: \end{figure*}
237: 
238: \begin{figure*}[htbp]
239:   \begin{center}
240:     \leavevmode
241:     \begin{quote}
242:       \tt
243:       \footnotesize
244:       <TOPIC> \\
245:       <TOPIC-ID>1001</TOPIC-ID> \\
246:       <DESCRIPTION>Corporate merging</DESCRIPTION> \\
247:       <NARRATIVE>The article describes a corporate merging and in the
248:       article, the name of companies have to be
249:       identifiable. Information
250:       including the field and the purpose of the merging have to be
251:       identifiable. Corporate merging includes corporate acquisition,
252:       corporate unifications and corporate buying.</NARRATIVE> \\
253:       </TOPIC>
254:     \end{quote}
255:     \caption{An English translation for an example topic in the IREX collection.}
256:     \label{fig:irex_topic}
257:   \end{center}
258: \end{figure*}
259: 
260: Since both collections do not contain spoken queries, we asked four
261: speakers (two males/females) to dictate topics. For this purpose, we
262: selectively used a specific field, so as to simulate a realistic
263: speech-driven retrieval.
264: 
265: In the case of the NTCIR topics, titles are not informative for the
266: retrieval. On the other hand, narratives, which usually consist of
267: several sentences, are too long to speak. Thus, only descriptions,
268: which consist of a single phrase and sentence, were dictated by each
269: speaker, so as to produce four different sets of 132 spoken queries.
270: However, in the case of the IREX topics, since descriptions are not
271: informative for the retrieval, only narratives were dictated by each
272: speaker, to produce four different sets of 30 spoken queries.
273: 
274: \subsection{Comparative Evaluation}
275: \label{subsec:comparison}
276: 
277: We compared the performance of the following retrieval methods:
278: \begin{itemize}
279: \item text-to-text retrieval, which used written queries, and can be
280:   seen as the perfect speech-driven text retrieval,
281: \item speech-driven text retrieval, in which a language model produced
282:   based on the NTCIR collection was used,
283: \item speech-driven text retrieval, in which a language model produced
284:   based on the IREX collection was used.
285: \end{itemize}
286: In cases of speech-driven text retrieval methods, queries dictated by
287: four speakers were used independently, and the final result was
288: obtained by averaging results for different speakers.
289: 
290: Although the Julius decoder outputs more than one transcription
291: candidates for a single speech, we used only the one with the greatest
292: probability score. The results did not significantly change depending
293: on whether or not we used lower-ranked transcriptions as queries.
294: 
295: The only difference in producing two different language models (i.e.,
296: those based on the NTCIR and IREX collections) is the source
297: documents. In other words, both language models were of the same
298: vocabulary size (20,000), and were produced by way of the same
299: softwares.
300: 
301: Table~\ref{tab:lang_model} shows statistics related to word
302: tokens/types in two different collections for language modeling, where
303: the line ``Coverage'' denotes the ratio of word tokens contained in
304: the resultant language model. Most of word tokens were covered
305: irrespective of the collection.
306: 
307: \begin{table}[htbp]
308:   \begin{center}
309:     \caption{Statistics related to source words for language
310:     modeling.}
311:     \medskip
312:     \leavevmode
313:     \tabcolsep=3pt
314:     \begin{tabular}{lcc} \hline
315:       & NTCIR & IREX \\ \hline
316:       \# of Types & 454K & 179K \\
317:       \# of Tokens & 175M & 53M \\
318:       Coverage & 97.9\% & 96.5\% \\
319:       \hline
320:     \end{tabular}
321:     \label{tab:lang_model}
322:   \end{center}
323: \end{table}
324: 
325: Each method retrieved 1,000 top documents, and the TREC evaluation
326: software was used to calculate non-interpolated average precision
327: values and plot recall-precision curves.
328: 
329: Table~\ref{tab:results} shows the non-interpolated average precision
330: values (AP) and word error rate in speech recognition, for different
331: retrieval methods. As with existing experiments for speech
332: recognition, word error rate (WER) is the ratio between the number of
333: word errors (i.e., deletion, insertion, and substitution) and the
334: total number of words. In addition, we investigated error rate
335: with respect to query terms (i.e., keywords used for retrieval), which
336: we shall call ``term error rate (TER)''.  Table~\ref{tab:results} also
337: shows trigram test-set perplexity (PP) and test-set out-of-vocabulary
338: rate (OOV).
339: 
340: It should noted that for all the evaluation measures in
341: Table~\ref{tab:results} excepting average precision, smaller values
342: are generally obtained with better methods.  Suggestions which can be
343: derived from these results are as follows.
344: 
345: \begin{table*}[htbp]
346:   \begin{center}
347:     \caption{Results for different retrieval methods targeting the
348:     NTCIR/IREX collections (AP: average
349:     precision, WER: word error rate, TER: term error rate,
350:     PP: trigram test-set perplexity, 
351:     OOV: test-set Out-of-Vocabulary rate).}
352:     \medskip
353:     \leavevmode
354:     \footnotesize
355:     \tabcolsep=5pt
356:     \begin{tabular}{l|ccccc|ccccc} \hline
357:       & \multicolumn{5}{|c|}{NTCIR} & \multicolumn{5}{|c}{IREX} \\
358:       \cline{2-11}
359:       \multicolumn{1}{c|}{Language Model}
360:       & AP & WER & TER & PP & OOV
361:       & AP & WER & TER & PP & OOV \\ \hline
362:       Text & 0.337 & --- & --- & --- & --- 
363:            & 0.367 & --- & --- & --- & --- \\
364:       NTCIR & 0.261 & 18.6\% & 23.6\% & 60  & 4.2\%
365:             & 0.166 & 31.1\% & 41.0\% & 138 & 6.1\% \\
366:       IREX  & 0.111 & 41.4\% & 54.6\% & 195 & 9.4\%  
367:             & 0.334 & 19.5\% & 22.9\% & 108 & 1.4\% \\
368:       \hline
369:     \end{tabular}
370:     \label{tab:results}
371:   \end{center}
372: \end{table*}
373: 
374: First, by comparing results of different language models, one can see
375: that the performance was significantly improved with a language model
376: produced from the target collection, which was observable irrespective
377: of the domain. Thus, producing language models based on target
378: collections was quite effective for speech-driven text retrieval.
379: 
380: Second, while in the case of the NTCIR collection, the average
381: precision for speech-driven retrieval was approximately 77\% of that
382: obtained with text-to-text retrieval, in the case of the IREX
383: collection, the average precision for speech-driven retrieval was
384: quite comparable that obtained with text-to-text retrieval.
385: 
386: Third, TER was generally higher than WER irrespective of the speaker.
387: In other words, speech recognition for content words was more
388: difficult than functional words, which were not contained in query
389: terms.
390: 
391: Finally, we investigated the trade-off between recall and precision.
392: Figures~\ref{fig:ntcir} and \ref{fig:irex} show recall-precision
393: curves of different retrieval methods, for the NTCIR and IREX
394: collections, respectively. In these figures, the relative superiority
395: for precision values due to different language models in
396: Table~\ref{tab:results} was also observable, regardless of the recall.
397: 
398: \begin{figure}[htbp]
399:   \begin{center}
400:   \leavevmode \psfig{file=ntcir.ps,height=2.8in}
401:   \end{center}
402:   \caption{Recall-precision curves for different methods targeting the
403:   NTCIR collection.}
404:   \label{fig:ntcir}
405: \end{figure}
406: %
407: \begin{figure}[htbp]
408:   \begin{center}
409:   \leavevmode \psfig{file=irex.ps,height=2.8in}
410:   \end{center}
411:   \caption{Recall-precision curves for different methods targeting the
412:   IREX collection.}
413:   \label{fig:irex}
414: \end{figure}
415: 
416: \section{Conclusion}
417: \label{sec:conclusion}
418: 
419: Aiming at speech-driven text retrieval with a high accuracy, we
420: proposed a method to integrate speech recognition and text retrieval
421: methods, in which target text collections are used to produce
422: statistical language models for speech recognition.  We also showed
423: the effectiveness of our method by way of experiments, where dictated
424: information needs in the NTCIR/IREX collections were used as queries
425: to retrieve documents in different domains.
426: 
427: \section*{Acknowledgments}
428: 
429: The authors would like to thank the National Institute of Informatics
430: for their support with the NTCIR collection and the IREX committee for
431: their support with the IREX collection.
432: 
433: \bibliographystyle{IEEEbib}
434: 
435: \begin{thebibliography}{10}
436: 
437: \bibitem{garofolo:trec-97}
438: John~S. Garofolo, Ellen~M. Voorhees, Vincent~M. Stanford, and Karen~Sparck
439:   Jones,
440: \newblock ``{TREC-6} 1997 spoken document retrieval track overview and
441:   results,''
442: \newblock in {\em Proceedings of the 6th Text REtrieval Conference}, 1997, pp.
443:   83--91.
444: 
445: \bibitem{barnett:eurospeech-97}
446: J.~Barnett, S.~Anderson, J.~Broglio, M.~Singh, R.~Hudson, and S.~W. Kuo,
447: \newblock ``Experiments in spoken queries for document retrieval,''
448: \newblock in {\em Proceedings of Eurospeech97}, 1997, pp. 1323--1326.
449: 
450: \bibitem{crestani:fqas-2000}
451: Fabio Crestani,
452: \newblock ``Word recognition errors and relevance feedback in spoken query
453:   processing,''
454: \newblock in {\em Proceedings of the Fourth International Conference on
455:   Flexible Query Answering Systems}, 2000, pp. 267--281.
456: 
457: \bibitem{bahl:ieee-tpami-1983}
458: Lalit.~R. Bahl, Frederick Jelinek, and Robert~L. Mercer,
459: \newblock ``A maximum linklihood approach to continuous speech recognition,''
460: \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},
461:   vol. 5, no. 2, pp. 179--190, 1983.
462: 
463: \bibitem{kawahara:icslp-2000}
464: T.~Kawahara, A.~Lee, T.~Kobayashi, K.~Takeda, N.~Minematsu, S.~Sagayama,
465:   K.~Itou, A.~Ito, M.~Yamamoto, A.~Yamada, T.~Utsuro, and K.~Shikano,
466: \newblock ``Free software toolkit for {Japanese} large vocabulary continuous
467:   speech recognition,''
468: \newblock in {\em Proceedings of the 6th International Conference on Spoken
469:   Language Processing}, 2000, pp. 476--479.
470: 
471: \bibitem{itou:98:a}
472: K.~Itou, M.~Yamamoto, K.~Takeda, T.~Takezawa, T.~Matsuoka, T.~Kobayashi,
473:   K.~Shikano, and S.~Itahashi,
474: \newblock ``The design of the newspaper-based {Japanese} large vocabulary
475:   continuous speech recognition corpus,''
476: \newblock in {\em ICSLP-98}, 1998, pp. 3261--3264.
477: 
478: \bibitem{robertson:sigir-94}
479: S.~E. Robertson and S.~Walker,
480: \newblock ``Some simple effective approximations to the 2-poisson model for
481:   probabilistic weighted retrieval,''
482: \newblock in {\em Proceedings of the 17th Annual International ACM SIGIR
483:   Conference on Research and Development in Information Retrieval}, 1994, pp.
484:   232--241.
485: 
486: \bibitem{matsumoto:chasen-99}
487: Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi
488:   Matsuda, and Masayuki Asahara,
489: \newblock ``{Japanese} morphological analysis system {ChaSen} version 2.0
490:   manual 2nd edition,''
491: \newblock Tech. {R}ep. NAIST-IS-TR99009, NAIST, 1999.
492: 
493: \bibitem{ntcir-2001}
494: {National Institute of Informatics},
495: \newblock {\em Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of
496:   Chinese \& Japanese Text Retrieval and Text Summarization}, 2001.
497: 
498: \bibitem{sekine:lrec-2000}
499: Satoshi Sekine and Hitoshi Isahara,
500: \newblock ``{IREX:} {IR} and {IE} evaluation project in {Japanese},''
501: \newblock in {\em Proceedings of the 2nd International Conference on Language
502:   Resources and Evaluation}, 2000, pp. 1475--1480.
503: 
504: \end{thebibliography}
505: 
506: \end{document}
507: