cs0206037/main.tex
1: \documentclass[runningheads]{llncs}
2: 
3: \input{psfig.sty}
4: 
5: \begin{document}
6: 
7: \pagestyle{headings}
8: 
9: \mainmatter
10: 
11: \title{Speech-Driven Text Retrieval: Using Target IR
12: Collections for Statistical Language Model Adaptation in Speech
13: Recognition}
14: 
15: \titlerunning{Speech-Driven Text Retrieval}
16: 
17: \author{Atsushi Fujii\inst{1} \and Katunobu Itou\inst{2}
18: \and Tetsuya Ishikawa\inst{1}}
19: 
20: \authorrunning{Atsushi Fujii et al.}
21: 
22: \institute{University of Library and Information Science\\
23: 1-2 Kasuga, Tsukuba, 305-8550, Japan\\
24: \email{\{fujii,ishikawa\}@ulis.ac.jp}\\
25: \and
26: National Institute of Advanced Industrial Science and Technology\\
27: 1-1-1 Chuuou Daini Umezono, Tsukuba, 305-8568, Japan\\
28: \email{itou@ni.aist.go.jp}}
29: 
30: \maketitle
31: 
32: \begin{abstract}
33:   Speech recognition has of late become a practical technology for
34:   real world applications. Aiming at speech-driven text retrieval,
35:   which facilitates retrieving information with spoken queries, we
36:   propose a method to integrate speech recognition and retrieval
37:   methods. Since users speak contents related to a target collection,
38:   we adapt statistical language models used for speech recognition
39:   based on the target collection, so as to improve both the
40:   recognition and retrieval accuracy. Experiments using existing test
41:   collections combined with dictated queries showed the effectiveness
42:   of our method.
43: \end{abstract}
44: 
45: \newcommand{\etal}{et~al.}
46: \newcommand{\etaleos}{et~al}
47: \newcommand{\eq}[1]{(\ref{#1})}
48: 
49: \section{Introduction}
50: \label{sec:introduction}
51: 
52: Automatic speech recognition, which decodes human voice to generate
53: transcriptions, has of late become a practical technology.  It is
54: feasible that speech recognition is used in real world computer-based
55: applications, specifically, those associated with human language.  In
56: fact, a number of speech-based methods have been explored in the
57: information retrieval community, which can be classified into the
58: following two fundamental categories:
59: \begin{itemize}
60: \item spoken document retrieval, in which written queries are used to
61:   search speech (e.g., broadcast news audio) archives for relevant
62:   speech information~\cite{johnson:icassp-99,jones:sigir-96,sheridan:sigir-97,singhal:sigir-99,srinivasan:sigir-2000,wechsler:sigir-98,whittaker:sigir-99},
63: \item speech-driven (spoken query) retrieval, in which spoken queries
64:   are used to retrieve relevant textual information~\cite{barnett:eurospeech-97,crestani:fqas-2000}.
65: \end{itemize}
66: 
67: Initiated partially by the TREC-6 spoken document retrieval (SDR)
68: track~\cite{garofolo:trec-97}, various methods have been proposed for
69: spoken document retrieval.  However, a relatively small number of
70: methods have been explored for speech-driven text retrieval, although
71: they are associated with numerous keyboard-less retrieval
72: applications, such as telephone-based retrieval, car navigation
73: systems, and user-friendly interfaces.
74: 
75: Barnett~\etal~\cite{barnett:eurospeech-97} performed comparative
76: experiments related to speech-driven retrieval, where an existing
77: speech recognition system was used as an input interface for the
78: INQUERY text retrieval system.  They used as test inputs 35 queries
79: collected from the TREC 101-135 topics, dictated by a single male
80: speaker.  Crestani~\cite{crestani:fqas-2000} also used the above 35
81: queries and showed that conventional relevance feedback techniques
82: marginally improved the accuracy for speech-driven text retrieval.
83: 
84: These above cases focused solely on improving text retrieval methods
85: and did not address problems of improving speech recognition accuracy.
86: In fact, an existing speech recognition system was used with no
87: enhancement. In other words, speech recognition and text retrieval
88: modules were fundamentally independent and were simply connected by
89: way of an input/output protocol.
90: 
91: However, since most speech recognition systems are trained based on
92: specific domains, the accuracy of speech recognition across domains is
93: not satisfactory. Thus, as can easily be predicted, in cases of
94: Barnett~\etal~\cite{barnett:eurospeech-97} and
95: Crestani~\cite{crestani:fqas-2000}, a relatively high speech
96: recognition error rate considerably decreased the retrieval accuracy.
97: Additionally, speech recognition with a high accuracy is crucial for
98: interactive retrieval.
99: 
100: Motivated by these problems, in this paper we integrate (not simply
101: connect) speech recognition and text retrieval to improve both
102: recognition and retrieval accuracy in the context of speech-driven
103: text retrieval.
104: 
105: Unlike general-purpose speech recognition aimed to decode any
106: spontaneous speech, in the case of speech-driven text retrieval, users
107: usually speak contents associated with a target collection, from which
108: documents relevant to their information need are retrieved.  In a
109: stochastic speech recognition framework, the accuracy depends
110: primarily on acoustic and language models~\cite{bahl:ieee-tpami-1983}.
111: While acoustic models are related to phonetic properties, language
112: models, which represent linguistic contents to be spoken, are strongly
113: related to target collections.  Thus, it is intuitively feasible that
114: language models have to be produced based on target collections.
115: 
116: To sum up, our belief is that by adapting a language model based on a
117: target IR collection, we can improve the speech recognition and text
118: retrieval accuracy, simultaneously.
119: 
120: Section~\ref{sec:system} describes our prototype speech-driven text
121: retrieval system, which is currently implemented for Japanese.
122: Section~\ref{sec:experimentation} elaborates on comparative
123: experiments, in which existing test collections for Japanese text
124: retrieval are used to evaluate the effectiveness of our system.
125: 
126: \section{System Description}
127: \label{sec:system}
128: 
129: 
130: \subsection{Overview}
131: \label{subsec:system_overview}
132: 
133: Figure~\ref{fig:system} depicts the overall design of our
134: speech-driven text retrieval system, which consists of speech
135: recognition, text retrieval and adaptation modules. We explain the
136: retrieval process based on this figure.
137: 
138: In the off-line process, the adaptation module uses the entire target
139: collection (from which relevant documents are retrieved) to produce a
140: language model, so that user speech related to the collection can be
141: recognized with a high accuracy.  On the other hand, an acoustic model
142: is produced independent of the target collection.
143: 
144: In the on-line process, given an information need spoken by a user,
145: the speech recognition module uses the acoustic and language models to
146: generate a transcription for the user speech.  Then, the text
147: retrieval module searches the collection for documents relevant to the
148: transcription, and outputs a specific number of top-ranked documents
149: according to the degree of relevance, in descending order.
150: 
151: These documents are fundamentally final outputs. However, in the case
152: where the target collection consists of multiple domains, a language
153: model produced in the off-line adaptation process is not necessarily
154: precisely adapted to a specific information need.  Thus, we optionally
155: use top-ranked documents obtained in the initial retrieval process for
156: an on-line adaptation, because these documents are associated with the
157: user speech more than the entire collection.  We then re-perform
158: speech recognition and text retrieval processes to obtain final
159: outputs.
160: 
161: In other words, our system is based on the two-stage retrieval
162: principle~\cite{kwok:sigir-98}, where top-ranked documents retrieved
163: in the first stage are intermediate results, and are used to improve
164: the accuracy for the second (final) stage.  From a different
165: perspective, while the off-line adaptation process produces the {\it
166: global\/} language model for a target collection, the on-line
167: adaptation process produces a {\it local\/} language model based on
168: the user speech.
169: 
170: In the following sections, we explain speech recognition, adaptation,
171: and text retrieval modules in
172: Figure~\ref{fig:system}, respectively.
173: 
174: \begin{figure}[htbp]
175:   \begin{center}
176:     \leavevmode \psfig{file=system.eps,height=2.5in}
177:   \end{center}
178:   \caption{The overall design of our speech-driven text retrieval system.}
179:   \label{fig:system}
180: \end{figure}
181: 
182: \subsection{Speech Recognition}
183: \label{subsec:speech_recognition}
184: 
185: The speech recognition module generates word sequence $W$, given
186: phoneme sequence $X$.  In the stochastic speech recognition framework,
187: the task is to output the $W$ maximizing $P(W|X)$, which is
188: transformed as in equation~\eq{eq:bayes} through use of the Bayesian
189: theorem.
190: \begin{equation}
191:   \label{eq:bayes}
192:   \arg\max_{W}P(W|X) = \arg\max_{W}P(X|W)\cdot P(W)
193: \end{equation}
194: Here, $P(X|W)$ models a probability that word sequence $W$ is
195: transformed into phoneme sequence $X$, and $P(W)$ models a probability
196: that $W$ is linguistically acceptable. These factors are usually
197: called acoustic and language models, respectively.
198: 
199: For the speech recognition module, we use the Japanese dictation
200: toolkit~\cite{kawahara:icslp-2000}\footnote{http://winnie.kuis.kyoto-u.ac.jp/dictation/},
201: which includes the ``Julius'' recognition engine and acoustic/language
202: models trained based on newspaper articles. This toolkit also includes
203: development softwares, so that acoustic and language models can be
204: produced and replaced depending on the application.  While we use the
205: acoustic model provided in the toolkit, we use new language models
206: produced by way of the adaptation process (see
207: Section~\ref{subsec:lm_adaptation}).
208: 
209: \subsection{Language Model Adaptation}
210: \label{subsec:lm_adaptation}
211: 
212: The basis of the adaptation module is to produce a word-based $N$-gram
213: (in our case, a combination of bigram and trigram) model by way of
214: source documents.
215: 
216: In the off-line (global) adaptation process, we use the ChaSen
217: morphological analyzer~\cite{matsumoto:chasen-99} to extract words
218: from the entire target collection, and produce the global $N$-gram
219: model.
220: 
221: On the other hand, in the on-line (local) adaptation process, only
222: top-ranked documents retrieved in the first stage are used as source
223: documents, from which word-based $N$-grams are extracted as performed
224: in the off-line process.  However, unlike the case of the off-line
225: process, we do not produce the entire language model. Instead, we
226: re-estimate only statistics associated with top-ranked documents, for
227: which we use the MAP (Maximum A-posteriori Probability) estimation
228: method~\cite{masataki:icassp-97}.
229: 
230: Although the on-line adaptation theoretically improves the retrieval
231: accuracy, for real-time usage, the trade-off between the retrieval
232: accuracy and computational time required for the on-line process has
233: to be considered.
234: 
235: Our method is similar to the one proposed by Seymore and
236: Rosenfeld~\cite{seymore:eurospeech-97} in the sense that both methods
237: adapt language models based on a small number of documents related to
238: a specific domain (or topic). However, unlike their method, our method
239: does not require corpora manually annotated with topic tags.
240: 
241: \subsection{Text Retrieval}
242: \label{subsec:text_retrieval}
243: 
244: The text retrieval module is based on an existing probabilistic
245: retrieval method~\cite{robertson:sigir-94}, which computes the
246: relevance score between the transcribed query and each document in the
247: collection.  The relevance score for document $i$ is computed based on
248: equation~\eq{eq:okapi}.
249: \begin{equation}
250:   \label{eq:okapi}
251:   \sum_{t} \left(\frac{\textstyle TF_{t,i}}{\textstyle
252:     \frac{\textstyle DL_{i}}{\textstyle avglen} +
253:     TF_{t,i}}\cdot\log\frac{\textstyle N}{\textstyle DF_{t}}\right)
254: \end{equation}
255: Here, $t$'s denote terms in transcribed queries.  $TF_{t,i}$ denotes
256: the frequency that term $t$ appears in document $i$. $DF_{t}$ and $N$
257: denote the number of documents containing term $t$ and the total
258: number of documents in the collection. $DL_{i}$ denotes the length of
259: document $i$ (i.e., the number of characters contained in $i$), and
260: $avglen$ denotes the average length of documents in the collection.
261: 
262: We use content words extracted from documents as terms, and perform a
263: word-based indexing. For this purpose, we use the ChaSen morphological
264: analyzer~\cite{matsumoto:chasen-99} to extract content words. We
265: extract terms from transcribed queries using the same method.
266: 
267: \section{Experimentation}
268: \label{sec:experimentation}
269: 
270: \subsection{Test Collections}
271: \label{subsec:test_collection}
272: 
273: We investigated the performance of our system based on the NTCIR
274: workshop evaluation methodology, which resembles the one in the TREC
275: ad hoc retrieval track. In other words, each system outputs 1,000 top
276: documents, and the TREC evaluation software was used to plot
277: recall-precision curves and calculate non-interpolated average
278: precision values.
279: 
280: The NTCIR workshop was held twice (in 1999 and 2001), for which two
281: different test collections were produced: the NTCIR-1 and 2
282: collections~\cite{ntcir-99,ntcir-2001}\footnote{http://research.nii.ac.jp/\~{}ntcadm/index-en.html}.
283: However, since these collections do not include spoken queries, we
284: asked four speakers (two males/females) to dictate information needs
285: in the NTCIR collections, and simulated speech-driven text retrieval.
286: 
287: The NTCIR collections include documents collected from technical
288: papers published by 65 Japanese associations for various fields. Each
289: document consists of the document ID, title, name(s) of author(s),
290: name/date of conference, hosting organization, abstract and author
291: keywords, from which we used titles, abstracts and keywords for the
292: indexing. The number of documents in the NTCIR-1 and 2 collections are
293: 332,918 and 736,166, respectively (the NTCIR-1 documents are a subset
294: of the NTCIR-2).
295: 
296: The NTCIR-1 and 2 collections also include 53 and 49 topics,
297: respectively. Each topic consists of the topic ID, title of the topic,
298: description, narrative.  Figure~\ref{fig:topic} shows an English
299: translation for a fragment of the NTCIR topics\footnote{The NTCIR-2
300: collection contains Japanese topics and their English translations.},
301: where each field is tagged in an SGML form. In general, titles are not
302: informative for the retrieval. On the other hand, narratives, which
303: usually consist of several sentences, are too long to speak. Thus,
304: only descriptions, which consist of a single phrase and sentence, were
305: dictated by each speaker, so as to produce four different sets of 102
306: spoken queries.
307: 
308: \begin{figure*}[htbp]
309:   \begin{center}
310:     \leavevmode
311:     \small
312:     \begin{quote}
313:       \tt
314:       <TOPIC q=0118>\\
315:       <TITLE>TV conferencing</TITLE>\\
316:       <DESCRIPTION>Distance education support systems using TV
317:       conferencing</DESCRIPTION>\\
318:       <NARRATIVE>A relevant document will provide information on the
319:       development of distance education support systems using TV
320:       conferencing. Preferred documents would present examples of
321:       using TV conferencing and discuss the results. Any reported
322:       methods of aiding remote teaching are relevant documents (for
323:       example, ways of utilizing satellite communication, the
324:       Internet, and ISDN circuits).</NARRATIVE>\\
325:       </TOPIC>
326:     \end{quote}
327:     \caption{An English translation for an example topic in the NTCIR
328:       collections.}
329:     \label{fig:topic}
330:   \end{center}
331: \end{figure*}
332: 
333: In the NTCIR collections, relevance assessment was performed based on
334: the pooling method~\cite{voorhees:sigir-98}. First, candidates for
335: relevant documents were obtained with multiple retrieval
336: systems. Then, for each candidate document, human expert(s) assigned
337: one of three ranks of relevance: ``relevant,'' ``partially relevant''
338: and \mbox{``irrelevant.''} The NTCIR-2 collection also includes
339: ``highly relevant'' documents. In our evaluation, ``highly relevant''
340: and ``relevant'' documents were regarded as relevant ones.
341: 
342: \subsection{Comparative Evaluation}
343: \label{subsec:comparison}
344: 
345: In order to investigate the effectiveness of the off-line language
346: model adaptation, we compared the performance of the following
347: different retrieval methods:
348: \begin{itemize}
349: \item text-to-text retrieval, which used written descriptions
350:   as queries, and can be seen as the perfect speech-driven text retrieval,
351: \item speech-driven text retrieval, in which a language model produced
352:   based on the NTCIR-2 collection was used,
353: \item speech-driven text retrieval, in which a language model produced
354:   based on ten years worth of {\it Mainichi Shimbun\/} Japanese newspaper
355:   articles (1991-2000) was used.
356: \end{itemize}
357: The only difference in producing two different language models (i.e.,
358: those based on the NTCIR-2 collection and newspaper articles) are the
359: source documents. In other words, both language models have the same
360: vocabulary size (20,000), and were produced using the same softwares.
361: 
362: Table~\ref{tab:lang_model} shows statistics related to word
363: tokens/types in two different source corpora for language modeling,
364: where the line ``Coverage'' denotes the ratio of word tokens contained
365: in the resultant language model. Most of word tokens were covered in
366: both language models.
367: 
368: \begin{table}[htbp]
369:   \begin{center}
370:     \caption{Statistics associated with source words for language
371:     modeling.}
372:     \medskip
373:     \leavevmode
374:     \small
375:     \tabcolsep=3pt
376:     \begin{tabular}{lcc} \hline\hline
377:       & NTCIR & News \\ \hline
378:       \# of Types & 454K & 315K \\
379:       \# of Tokens & 175M & 262M \\
380:       Coverage & 97.9\% & 96.5\% \\
381:       \hline
382:     \end{tabular}
383:     \label{tab:lang_model}
384:   \end{center}
385: \end{table}
386: 
387: In cases of speech-driven text retrieval methods, queries dictated by
388: four speakers were used individually. Thus, in practice we compared
389: nine different retrieval methods. Although the Julius decoder outputs
390: more than one transcription candidate for a single speech input, we
391: used only the one with the greatest probability score. The results did
392: not significantly change depending on whether or not we used
393: lower-ranked transcriptions as queries.
394: 
395: Table~\ref{tab:results} shows the non-interpolated average precision
396: values and word error rate in speech recognition, for different
397: retrieval methods. As with existing experiments for speech
398: recognition, word error rate (WER) is the ratio between the number of
399: word errors (i.e., deletion, insertion, and substitution) and the
400: total number of words. In addition, we also investigated error rate
401: with respect to query terms (i.e., keywords used for retrieval), which
402: we shall call ``term error rate (TER).''
403: 
404: In Table~\ref{tab:results}, the first line denotes results of the
405: text-to-text retrieval, which were relatively high compared with
406: existing results reported in the NTCIR
407: workshops~\cite{ntcir-99,ntcir-2001}.
408: 
409: \begin{table*}[htbp]
410:   \begin{center}
411:     \caption{Results for different retrieval methods (AP: average
412:     precision, WER: word error rate, TER: term error rate).}
413:     \medskip
414:     \leavevmode
415:     \small
416:     \tabcolsep=5pt
417:     \begin{tabular}{lcccccc} \hline\hline
418:       & \multicolumn{3}{c}{NTCIR-1} & \multicolumn{3}{c}{NTCIR-2} \\
419:       \cline{2-7}
420:       {\hfill\centering Method\hfill}
421:       & AP & WER & TER
422:       & AP & WER & TER \\ \hline
423:       Text & 0.3320 & --- & --- & 0.3118 & --- & --- \\
424:       M1 (NTCIR) & 0.2708 & 0.1659 & 0.2190 & 0.2504 & 0.1532 & 0.2313 \\
425:       M2 (NTCIR) & 0.2471 & 0.2034 & 0.2381 & 0.2114 & 0.2180 & 0.2799 \\
426:       F1 (NTCIR) & 0.2276 & 0.1961 & 0.2857 & 0.1873 & 0.1885 & 0.2500 \\
427:       F2 (NTCIR) & 0.2642 & 0.1477 & 0.2222 & 0.2376 & 0.1635 & 0.2388 \\
428:       M1 (News) &  0.1076 & 0.3547 & 0.5143 & 0.0790 & 0.3594 & 0.5149 \\
429:       M2 (News) &  0.1257 & 0.4044 & 0.5460 & 0.0691 & 0.5022 & 0.6343 \\
430:       F1 (News) &  0.1156 & 0.3801 & 0.5238 & 0.0798 & 0.4418 & 0.5709 \\
431:       F2 (News) &  0.1225 & 0.3317 & 0.5016 & 0.0917 & 0.4080 & 0.5858 \\
432:       \hline
433:     \end{tabular}
434:     \label{tab:results}
435:   \end{center}
436: \end{table*}
437: 
438: The remaining lines denote results of speech-driven text retrieval
439: combined with the NTCIR-based language model (lines 2-5) and the
440: newspaper-based model (lines 6-9), respectively.  Here, ``Mx'' and
441: ``Fx'' denote male/female speakers, respectively. Suggestions which
442: can be derived from these results are as follows.
443: 
444: First, for both language models, results did not significantly change
445: depending on the speaker. The best average precision values for
446: speech-driven text retrieval were obtained with a combination of
447: queries dictated by a male speaker (M1) and the NTCIR-based language
448: model, which were approximately 80\% of those with the text-to-text
449: retrieval.
450: 
451: Second, by comparing results of different language models for each
452: speaker, one can see that the NTCIR-based model significantly
453: decreased WER and TER obtained with the newspaper-based model, and
454: that the retrieval method using the NTCIR-based model significantly
455: outperformed one using the newspaper-based model. In addition, these
456: results were observable, irrespective of the speaker.  Thus, we
457: conclude that adapting language models based on target collections was
458: quite effective for speech-driven text retrieval.
459: 
460: Third, TER was generally higher than WER irrespective of the speaker.
461: In other words, speech recognition for content words was more
462: difficult than functional words, which were not contained in query
463: terms.
464: 
465: We analyzed transcriptions for dictated queries, and found that speech
466: recognition error was mainly caused by the out-of-vocabulary
467: problem. In the case where major query terms are mistakenly
468: recognized, the retrieval accuracy substantially decreases.  In
469: addition, descriptions in the NTCIR topics often contain expressions
470: which do not appear in the documents, such as ``I want papers
471: about...''  Although these expressions usually do not affect the
472: retrieval accuracy, misrecognized words affect the recognition
473: accuracy for remaining words including major query
474: terms. Consequently, the retrieval accuracy decreases due to the
475: partial misrecognition.
476: 
477: Finally, we investigated the trade-off between recall and precision.
478: Figures~\ref{fig:ntcir1} and \ref{fig:ntcir2} show recall-precision
479: curves of different retrieval methods, for the NTCIR-1 and 2
480: collections, respectively. In these figures, the relative superiority
481: for precision values due to different language models in
482: Table~\ref{tab:results} was also observable, regardless of the recall.
483: 
484: However, the effectiveness of the on-line adaptation remains an open
485: question and needs to be explored.
486: 
487: \begin{figure}[htbp]
488:   \begin{center}
489:     \leavevmode \psfig{file=ntcir-1.ps,height=3in}
490:   \end{center}
491:   \caption{Recall-precision curves for different retrieval methods
492:   using the NTCIR-1 collection.}
493:   \label{fig:ntcir1}
494: \end{figure}
495: 
496: \begin{figure}[htbp]
497:   \begin{center}
498:     \leavevmode \psfig{file=ntcir-2.ps,height=3in}
499:   \end{center}
500:   \caption{Recall-precision curves for different retrieval methods
501:   using the NTCIR-2 collection.}
502:   \label{fig:ntcir2}
503: \end{figure}
504: 
505: \section{Conclusion}
506: \label{sec:conclusion}
507: 
508: Aiming at speech-driven text retrieval with a high accuracy, we
509: proposed a method to integrate speech recognition and text retrieval
510: methods, in which target text collections are used to adapt
511: statistical language models for speech recognition.  We also showed
512: the effectiveness of our method by way of experiments, where dictated
513: information needs in the NTCIR collections were used as queries to
514: retrieve technical abstracts.  Future work would include experiments
515: on various collections, such as newspaper articles and Web pages.
516: 
517: \section{Acknowledgments}
518: 
519: The authors would like to thank the National Institute of Informatics
520: for their support with the NTCIR collections.
521: 
522: \bibliographystyle{abbrv}
523: \begin{thebibliography}{10}
524: 
525: \bibitem{bahl:ieee-tpami-1983}
526: L.~R. Bahl, F.~Jelinek, and R.~L. Mercer.
527: \newblock A maximum likelihood approach to continuous speech recognition.
528: \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},
529:   5(2):179--190, 1983.
530: 
531: \bibitem{barnett:eurospeech-97}
532: J.~Barnett, S.~Anderson, J.~Broglio, M.~Singh, R.~Hudson, and S.~W. Kuo.
533: \newblock Experiments in spoken queries for document retrieval.
534: \newblock In {\em Proceedings of Eurospeech97}, pages 1323--1326, 1997.
535: 
536: \bibitem{crestani:fqas-2000}
537: F.~Crestani.
538: \newblock Word recognition errors and relevance feedback in spoken query
539:   processing.
540: \newblock In {\em Proceedings of the Fourth International Conference on
541:   Flexible Query Answering Systems}, pages 267--281, 2000.
542: 
543: \bibitem{garofolo:trec-97}
544: J.~S. Garofolo, E.~M. Voorhees, V.~M. Stanford, and K.~S. Jones.
545: \newblock {TREC-6} 1997 spoken document retrieval track overview and results.
546: \newblock In {\em Proceedings of the 6th Text REtrieval Conference}, pages
547:   83--91, 1997.
548: 
549: \bibitem{johnson:icassp-99}
550: S.~Johnson, P.~Jourlin, G.~Moore, K.~S. Jones, and P.~Woodland.
551: \newblock The {Cambridge} {University} spoken document retrieval system.
552: \newblock In {\em Proceedings of ICASSP'99}, pages 49--52, 1999.
553: 
554: \bibitem{jones:sigir-96}
555: G.~Jones, J.~Foote, K.~S. Jones, and S.~Young.
556: \newblock Retrieving spoken documents by combining multiple index sources.
557: \newblock In {\em Proceedings of the 19th Annual International ACM SIGIR
558:   Conference on Research and Development in Information Retrieval}, pages
559:   30--38, 1996.
560: 
561: \bibitem{kawahara:icslp-2000}
562: T.~Kawahara, A.~Lee, T.~Kobayashi, K.~Takeda, N.~Minematsu, S.~Sagayama,
563:   K.~Itou, A.~Ito, M.~Yamamoto, A.~Yamada, T.~Utsuro, and K.~Shikano.
564: \newblock Free software toolkit for {Japanese} large vocabulary continuous
565:   speech recognition.
566: \newblock In {\em Proceedings of the 6th International Conference on Spoken
567:   Language Processing}, pages 476--479, 2000.
568: 
569: \bibitem{kwok:sigir-98}
570: K.~Kwok and M.~Chan.
571: \newblock Improving two-stage ad-hoc retrieval for short queries.
572: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
573:   Conference on Research and Development in Information Retrieval}, pages
574:   250--256, 1998.
575: 
576: \bibitem{masataki:icassp-97}
577: H.~Masataki, Y.~Sagisaka, K.~Hisaki, and T.~Kawahara.
578: \newblock Task adaptation using {MAP} estimation in n-gram language modeling.
579: \newblock In {\em Proceedings of ICASSP'97}, pages 783--786, 1997.
580: 
581: \bibitem{matsumoto:chasen-99}
582: Y.~Matsumoto, A.~Kitauchi, T.~Yamashita, Y.~Hirano, H.~Matsuda, and M.~Asahara.
583: \newblock {Japanese} morphological analysis system {ChaSen} version 2.0 manual
584:   2nd edition.
585: \newblock Technical Report NAIST-IS-TR99009, NAIST, 1999.
586: 
587: \bibitem{ntcir-99}
588: {National Center for Science Information Systems}.
589: \newblock {\em Proceedings of the 1st NTCIR Workshop on Research in Japanese
590:   Text Retrieval and Term Recognition}, 1999.
591: 
592: \bibitem{ntcir-2001}
593: {National Institute of Informatics}.
594: \newblock {\em Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of
595:   Chinese \& Japanese Text Retrieval and Text Summarization}, 2001.
596: 
597: \bibitem{robertson:sigir-94}
598: S.~Robertson and S.~Walker.
599: \newblock Some simple effective approximations to the 2-poisson model for
600:   probabilistic weighted retrieval.
601: \newblock In {\em Proceedings of the 17th Annual International ACM SIGIR
602:   Conference on Research and Development in Information Retrieval}, pages
603:   232--241, 1994.
604: 
605: \bibitem{seymore:eurospeech-97}
606: K.~Seymore and R.~Rosenfeld.
607: \newblock Using story topics for language model adaptation.
608: \newblock In {\em Proceedings of Eurospeech97}, 1997.
609: 
610: \bibitem{sheridan:sigir-97}
611: P.~Sheridan, M.~Wechsler, and P.~Sch\"{a}uble.
612: \newblock Cross-language speech retrieval: Establishing a baseline performance.
613: \newblock In {\em Proceedings of the 20th Annual International ACM SIGIR
614:   Conference on Research and Development in Information Retrieval}, pages
615:   99--108, 1997.
616: 
617: \bibitem{singhal:sigir-99}
618: A.~Singhal and F.~Pereira.
619: \newblock Document expansion for speech retrieval.
620: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
621:   Conference on Research and Development in Information Retrieval}, pages
622:   34--41, 1999.
623: 
624: \bibitem{srinivasan:sigir-2000}
625: S.~Srinivasan and D.~Petkovic.
626: \newblock Phonetic confusion matrix based spoken document retrieval.
627: \newblock In {\em Proceedings of the 23rd Annual International ACM SIGIR
628:   Conference on Research and Development in Information Retrieval}, pages
629:   81--87, 2000.
630: 
631: \bibitem{voorhees:sigir-98}
632: E.~M. Voorhees.
633: \newblock Variations in relevance judgments and the measurement of retrieval
634:   effectiveness.
635: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
636:   Conference on Research and Development in Information Retrieval}, pages
637:   315--323, 1998.
638: 
639: \bibitem{wechsler:sigir-98}
640: M.~Wechsler, E.~Munteanu, and P.~Sch\"{a}uble.
641: \newblock New techniques for open-vocabulary spoken document retrieval.
642: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
643:   Conference on Research and Development in Information Retrieval}, pages
644:   20--27, 1998.
645: 
646: \bibitem{whittaker:sigir-99}
647: S.~Whittaker, J.~Hirschberg, J.~Choi, D.~Hindle, F.~Pereira, and A.~Singhal.
648: \newblock {SCAN}: Designing and evaluating user interfaces to support retrieval
649:   from speech archives.
650: \newblock In {\em Proceedings of the 22nd Annual International ACM SIGIR
651:   Conference on Research and Development in Information Retrieval}, pages
652:   26--33, 1999.
653: 
654: \end{thebibliography}
655: 
656: 
657: \end{document}
658: