cs0206014/main.tex
1: \documentclass[11pt]{article}
2: \usepackage{acl2002}
3: \usepackage{times}
4: \usepackage{latexsym}
5: \setlength\titlebox{6.5cm}    % Expanding the titlebox
6: 
7: \title{A Method for Open-Vocabulary Speech-Driven Text Retrieval}
8: 
9: \author{Atsushi Fujii\thanks{~~~The first and second authors
10:        are also members of CREST, Japan Science and Technology
11:        Corporation.}\\
12:   University of Library and\\
13:   Information Science\\
14:   1-2 Kasuga, Tsukuba\\
15:   305-8550, Japan\\
16:   {\tt fujii@ulis.ac.jp} \And
17:   Katunobu Itou\\
18:   National Institute of\\
19:   Advanced Industrial\\
20:   Science and Technology\\
21:   1-1-1 Chuuou Daini Umezono\\
22:   Tsukuba, 305-8568, Japan\\
23:   {\tt itou@ni.aist.go.jp} \And
24:   Tetsuya Ishikawa\\
25:   University of Library and\\
26:   Information Science\\
27:   1-2 Kasuga, Tsukuba\\
28:   305-8550, Japan\\
29:   {\tt ishikawa@ulis.ac.jp}}
30: 
31: \date{}
32: 
33: \newcommand{\etal}{et~al.}
34: \newcommand{\etaleos}{et~al}
35: \newcommand{\eq}[1]{(\ref{#1})}
36: \renewcommand{\nocite}[1]{\shortcite{#1}}
37: \input{psfig.tex}
38: 
39: \begin{document}
40: \maketitle
41: \begin{abstract}
42:   While recent retrieval techniques do not limit the number of index
43:   terms, out-of-vocabulary (OOV) words are crucial in speech
44:   recognition. Aiming at retrieving information with spoken queries,
45:   we fill the gap between speech recognition and text retrieval in
46:   terms of the vocabulary size. Given a spoken query, we generate a
47:   transcription and detect OOV words through speech recognition. We
48:   then correspond detected OOV words to terms indexed in a target
49:   collection to complete the transcription, and search the collection
50:   for documents relevant to the completed transcription. We show the
51:   effectiveness of our method by way of experiments.
52: \end{abstract}
53: 
54: \section{Introduction}
55: \label{sec:introduction}
56: 
57: Automatic speech recognition, which decodes human voice to generate
58: transcriptions, has of late become a practical technology. It is
59: feasible that speech recognition is used in real-world human language
60: applications, such as information retrieval.
61: 
62: Initiated partially by TREC-6, various methods have been proposed for
63: ``spoken document retrieval (SDR),'' in which written queries are used to
64: search speech archives for relevant
65: information~\cite{garofolo:trec-97}.  State-of-the-art SDR methods,
66: where speech recognition error rate is 20-30\%, are comparable with
67: text retrieval methods in performance~\cite{jourlin:sc-2000}, and thus
68: are already practical. Possible rationales include that recognition
69: errors are overshadowed by a large number of words correctly
70: transcribed in target documents.
71: 
72: However, ``speech-driven retrieval,'' where spoken queries are used to
73: retrieve (textual) information, has not fully been explored, although
74: it is related to numerous keyboard-less applications, such as
75: telephone-based retrieval, car navigation systems, and user-friendly
76: interfaces.
77: 
78: Unlike spoken document retrieval, speech-driven retrieval is still a
79: challenging task, because recognition errors in short queries
80: considerably decrease retrieval accuracy.  A number of references
81: addressing this issue can be found in past research literature.
82: 
83: Barnett~\etal~\shortcite{barnett:eurospeech-97}  and
84: Crestani~\shortcite{crestani:fqas-2000} independently performed
85: comparative experiments related to speech-driven retrieval, where the
86: DRAGON speech recognition system was used as an input interface for
87: the INQUERY text retrieval system.  They used as test queries 35
88: topics in the TREC collection, dictated by a single male speaker.
89: However, these cases focused on improving text retrieval methods and
90: did not address problems in improving speech recognition.  As a
91: result, errors in recognizing spoken queries (error rate was
92: approximately 30\%) considerably decreased the retrieval accuracy.
93: 
94: Although we showed that the use of target document collections in
95: producing language models for speech recognition significantly
96: improved the performance of speech-driven
97: retrieval~\cite{fujii:springer-2002,itou:asru-2001}, a number of
98: issues still remain open questions.
99: 
100: Section~\ref{sec:problem} clarifies problems addressed in this paper.
101: Section~\ref{sec:overview} overviews our speech-driven text retrieval
102: system. Sections~\ref{sec:speech_recognition}-\ref{sec:query_completion}
103: elaborate on our methodology.  Section~\ref{sec:experimentation}
104: describes comparative experiments, in which an existing IR test
105: collection was used to evaluate the effectiveness of our
106: method. Section~\ref{sec:related_work} discusses related research
107: literature.
108: 
109: \section{Problem Statement}
110: \label{sec:problem}
111: 
112: One major problem in speech-driven retrieval is related to
113: out-of-vocabulary (OOV) words.
114: 
115: On the one hand, recent IR systems do not limit the vocabulary size
116: (i.e., the number of index terms), and can be seen as open-vocabulary
117: systems, which allow users to input any keywords contained in a target
118: collection.  It is often the case that a couple of million terms are
119: indexed for a single IR system.
120: 
121: On the other hand, state-of-the-art speech recognition systems still
122: need to limit the vocabulary size (i.e., the number of words in a
123: dictionary), due to problems in estimating statistical language
124: models~\cite{young:ieee-spm-1996} and constraints associated with
125: hardware, such as memories. In addition, computation time is crucial
126: for a real-time usage, including speech-driven retrieval.  In view of
127: these problems,  for many languages the vocabulary size is limited to
128: a couple of ten
129: thousands~\cite{itou:jas-1999,paul:darpa-ws-1992,steeneken:eurospeech-1995},
130: which is incomparably smaller than the size of indexes for practical
131: IR systems.
132: 
133: In addition, high-frequency words, such as functional words and common
134: nouns, are usually included in dictionaries and recognized with a high
135: accuracy. However, those words are not necessarily useful for
136: retrieval.  On the contrary, low-frequency words appearing in specific
137: documents are often effective query terms.
138: 
139: To sum up, the OOV problem is inherent in speech-driven retrieval, and
140: we need to fill the gap between speech recognition and text retrieval
141: in terms of the vocabulary size. In this paper, we propose a method to
142: resolve this problem aiming at open-vocabulary speech-driven retrieval.
143: 
144: \section{System Overview}
145: \label{sec:overview}
146: 
147: Figure~\ref{fig:system} depicts the overall design of our
148: speech-driven text retrieval system, which consists of speech
149: recognition, text retrieval and query completion modules.  Although
150: our system is currently implemented for Japanese, our methodology is
151: language-independent.  We explain the retrieval process based on this
152: figure.
153: 
154: Given a query spoken by a user, the speech recognition module uses a
155: dictionary and acoustic/language models to generate a transcription of
156: the user speech. During this process, OOV words, which are not listed
157: in the dictionary, are also detected.  For this purpose, our language
158: model includes both words and syllables so that OOV words are
159: transcribed as sequences of syllables.
160: 
161: For example, in the case where ``{\it kankitsu\/}~(citrus)'' is not
162: listed in the dictionary, this word should be transcribed as
163: /\verb|ka N ki tsu|/. However, it is possible that this word is mistakenly
164: transcribed, such as /\verb|ka N ke tsu|/ and /\verb|ka N ke tsu ke ko|/.
165: 
166: To improve the quality of our system, these syllable sequences have to
167: be transcribed as {\em words\/}, which is one of the central issues in
168: this paper. In the case of speech-driven retrieval, where users
169: usually have specific information needs, it is feasible that users
170: utter contents related to a target collection. In other words, there
171: is a great possibility that detected OOV words can be identified as
172: index terms that are phonetically identical or similar.
173: 
174: However, since a) a single sound can potentially correspond to more
175: than one word (i.e., homonyms) and b) searching the entire collection
176: for phonetically identical/similar terms is prohibitive,  we need an
177: efficient disambiguation method. Specifically, in the case of
178: Japanese,  the homonym problem is multiply crucial because words
179: consist of different character types, i.e., ``{\it kanji},'' ``{\it
180:   katakana},'' ``{\it hiragana},'' alphabets and other characters like
181: numerals\footnote{In Japanese, {\it kanji\/} (or Chinese character) is
182:   the idiogram, and {\it katakana\/} and {\it hiragana\/} are
183:   phonograms.}.
184: 
185: To resolve this problem, we use a two-stage retrieval method. In the
186: first stage, we delete OOV words from the transcription, and perform
187: text retrieval using remaining words, to obtain a specific number of
188: top-ranked documents according to the degree of relevance.  Even if
189: speech recognition is not perfect, these documents are potentially
190: associated with the user speech more than the entire collection.
191: Thus, we search only these documents for index terms corresponding to
192: detected OOV words.
193: 
194: Then, in the second stage, we replace detected OOV words with
195: identified index terms so as to complete the transcription, and
196: re-perform text retrieval to obtain final outputs. However, we do not
197: re-perform speech recognition in the second stage.
198: 
199: In the above example, let us assume that the user also utters words
200: related to ``{\it kankitsu\/}~(citrus),'' such as ``{\it
201:   orenji\/}~(orange)'' and ``{\it remon\/}~(lemon),'' and that these
202: words are correctly recognized as words. In this case, it is possible
203: that retrieved documents contain the word ``{\it
204:   kankitsu\/}~(citrus).'' Thus, we replace the syllable sequence
205: /\verb|ka N ke tsu|/ in the query with ``{\it kankitsu\/},'' which is
206: additionally used as a query term in the second stage.
207: 
208: It may be argued that our method resembles the notion of
209: pseudo-relevance feedback (or local feedback) for IR, where documents
210: obtained in the first stage are used to expand query terms, and final
211: outputs are refined in the second stage~\cite{kwok:sigir-98}.
212: However, while relevance feedback is used to improve only the
213: retrieval accuracy, our method improves the speech recognition and
214: retrieval accuracy.
215: 
216: \begin{figure}[htbp]
217:   \begin{center}
218:     \leavevmode \psfig{file=system.eps,height=2.3in}
219:   \end{center}
220:   \caption{The overall design of our speech-driven text retrieval system.}
221:   \label{fig:system}
222: \end{figure}
223: 
224: \section{Speech Recognition}
225: \label{sec:speech_recognition}
226: 
227: The speech recognition module generates word sequence $W$, given phone
228: sequence $X$.  In a stochastic speech recognition
229: framework~\cite{bahl:ieee-tpami-1983}, the task is to select the $W$
230: maximizing $P(W|X)$, which is transformed as in Equation~\eq{eq:bayes}
231: through the Bayesian theorem.
232: \begin{equation}
233:   \label{eq:bayes}
234:   \arg\max_{W}P(W|X) = \arg\max_{W}P(X|W)\cdot P(W)
235: \end{equation}
236: Here, $P(X|W)$ models a probability that word sequence $W$ is
237: transformed into phone sequence $X$, and $P(W)$ models a probability
238: that $W$ is linguistically acceptable. These factors are usually
239: called acoustic and language models, respectively.
240: 
241: For the speech recognition module, we use the Japanese dictation
242: toolkit~\cite{kawahara:icslp-2000}\footnote{http://winnie.kuis.kyoto-u.ac.jp/dictation},
243: which includes the ``Julius'' recognition engine and acoustic/language
244: models. The acoustic model was produced by way of the ASJ speech
245: database (ASJ-JNAS)~\cite{itou:98:a,itou:jas-1999}, which contains
246: approximately 20,000 sentences uttered by 132 speakers including the
247: both gender groups.
248: 
249: This toolkit also includes development softwares so that acoustic and
250: language models can be produced and replaced depending on the
251: application.  While we use the acoustic model provided in the toolkit,
252: we use a new language model including both words and syllables.  For
253: this purpose, we used the ``ChaSen'' morphological
254: analyzer\footnote{http://chasen.aist-nara.ac.jp} to extract words from
255: ten years worth of ``Mainichi Shimbun'' newspaper articles (1991-2000).
256: 
257: Then, we selected 20,000 high-frequency words to produce a dictionary.
258: At the same time, we segmented remaining lower-frequency words into
259: syllables based on the Japanese phonogram system. The resultant number
260: of syllable types was approximately 700.  Finally, we produced a
261: word/syllable-based trigram language model.  In other words, OOV words
262: were modeled as sequences of syllables. Thus, by using our language
263: model, OOV words can easily be detected.
264: 
265: In spoken document retrieval, an open-vocabulary method, which
266: combines recognition methods for words and syllables in target speech
267: documents, was also proposed~\cite{wechsler:sigir-98}.  However, this
268: method requires an additional computation for recognizing syllables,
269: and thus is expensive. In contrast, since our language model is a
270: regular statistical $N$-gram model, we can use the same speech
271: recognition framework as in Equation~\eq{eq:bayes}.
272: 
273: \section{Text Retrieval}
274: \label{sec:text_retrieval}
275: 
276: The text retrieval module is based on the ``Okapi'' probabilistic
277: retrieval method~\cite{robertson:sigir-94}, which is used to compute
278: the relevance score between the transcribed query and each document in
279: a target collection.  To produce an inverted file (i.e., an index), we
280: use ChaSen to extract content words from documents as terms, and
281: perform a word-based indexing. We also extract terms from transcribed
282: queries using the same method.
283: 
284: \section{Query Completion}
285: \label{sec:query_completion}
286: 
287: \subsection{Overview}
288: \label{subsec:qc_overview}
289: 
290: As explained in Section~\ref{sec:overview}, the basis of the query
291: completion module is to correspond OOV words detected by speech
292: recognition (Section~\ref{sec:speech_recognition}) to index terms used
293: for text retrieval (Section~\ref{sec:text_retrieval}). However, to
294: identify corresponding index terms efficiently, we limit the number of
295: documents in the first stage retrieval.  In principle, terms that are
296: indexed in top-ranked documents (those retrieved in the first stage)
297: and have the same sound with detected OOV words can be corresponding
298: terms.
299: 
300: However, a single sound often corresponds to multiple words.  In
301: addition, since speech recognition on a syllable-by-syllable basis is
302: not perfect, it is possible that OOV words are incorrectly
303: transcribed.  For example, in some cases the Japanese word ``{\it
304:   kankitsu\/}~(citrus)'' is transcribed as /\verb|ka N ke tsu|/.
305: Thus, we also need to consider index terms that are phonetically {\em
306:   similar\/} to OOV words.  To sum up, we need a disambiguation method
307: to select appropriate corresponding terms, out of a number of
308: candidates.
309: 
310: \subsection{Formalization}
311: \label{subsec:qc_formalization}
312: 
313: Intuitively, it is feasible that appropriate terms:
314: \begin{itemize}
315: \item have identical/similar sound with OOV words detected in spoken
316:   queries,
317: \item frequently appear in a top-ranked document set,
318: \item and appear in higher-ranked documents.
319: \end{itemize}
320: From the viewpoint of probability theory, possible representations for
321: the above three properties include Equation~\eq{eq:prob}, where each
322: property corresponds to different parameters. Our task is to select
323: the $t$ maximizing the value computed by this equation as the
324: corresponding term for OOV word $w$.
325: \begin{equation}
326:   \label{eq:prob}
327:   \sum_{d \in D_{q}} P(w|t) \cdot P(t|d) \cdot P(d|q)
328: \end{equation}
329: Here, $D_{q}$ is the top-ranked document set retrieved in the first
330: stage, given query $q$.  \mbox{$P(w|t)$} is a probability that index
331: term $t$ can be replaced with detected OOV word $w$, in terms of
332: phonetics. \mbox{$P(t|d)$} is the relative frequency of term $t$ in
333: document $d$. \mbox{$P(d|q)$} is a probability that document $d$ is
334: relevant to query $q$, which is associated with the score formalized
335: in the Okapi method.
336: 
337: However, from the viewpoint of empiricism, Equation~\eq{eq:prob} is
338: not necessarily effective. First, it is not easy to estimate $P(w|t)$
339: based on the probability theory. Second, the probability score
340: computed by the Okapi method is an approximation focused mainly on
341: {\em relative\/} superiority among retrieved documents, and thus it is
342: difficult to estimate $P(d|q)$ in a rigorous manner. Finally, it is
343: also difficult to determine the degree to which each parameter
344: influences in the final probability score.
345: 
346: In view of these problems, through preliminary experiments we
347: approximated Equation~\eq{eq:prob} and formalized a method to compute
348: the degree (not the probability) to which given index term $t$
349: corresponds to OOV word $w$.
350: 
351: First, we estimate \mbox{$P(w|t)$} by the ratio between the number of
352: syllables commonly included in both $w$ and $t$ and the total number
353: of syllables in $w$.  We use a DP matching method to identify the
354: number of cases related to deletion, insertion, and substitution in
355: $w$, on a syllable-by-syllable basis.
356: 
357: Second, \mbox{$P(w|t)$} should be more influential than
358: \mbox{$P(t|d)$} and \mbox{$P(d|q)$} in Equation~\eq{eq:prob}, although
359: the last two parameters are effective in the case where a large number
360: of candidates phonetically similar to $w$ are obtained. To decrease
361: the effect of \mbox{$P(t|d)$} and \mbox{$P(d|q)$}, we tentatively use
362: logarithms of these parameters. In addition, we use the score computed
363: by the Okapi method as \mbox{$P(d|q)$}.
364: 
365: According to the above approximation, we compute the score of $t$ as
366: in Equation~\eq{eq:score}.
367: \begin{equation}
368:   \label{eq:score}
369:   \sum_{d \in D_{q}} P(w|t) \cdot \log(P(t|d) \cdot P(d|q))
370: \end{equation}
371: It should be noted that Equation~\eq{eq:score} is independent of the
372: indexing method used, and therefore $t$ can be any sequences of
373: characters contained in $D_{q}$. In other words, any types of indexing
374: methods (e.g., word-based and phrase-based indexing methods) can be
375: used in our framework.
376: 
377: \subsection{Implementation}
378: \label{subsec:qc_implementation}
379: 
380: Since computation time is crucial for a real-time usage, we preprocess
381: documents in a target collection so as to identify candidate terms
382: efficiently. This process is similar to the indexing process performed
383: in the text retrieval module.
384: 
385: In the case of text retrieval, index terms are organized in an
386: inverted file so that documents including terms that {\em exactly\/}
387: match with query keywords can be retrieved efficiently.
388: 
389: However, in the case of query completion, terms that are included in
390: top-ranked documents need to be retrieved. In addition, to minimize a
391: score computation (for example, DP matching is time-consuming), it is
392: desirable to delete terms that are associated with a diminished
393: phonetic similarity value, \mbox{$P(w|t)$}, prior to the computation
394: of Equation~\eq{eq:score}.  In other words, an index file for query
395: completion has to be organized so that a {\em partial\/} matching
396: method can be used. For example, /\verb|ka N ki tsu|/ has to be
397: retrieved efficiently in response to /\verb|ka N ke tsu|/.
398: 
399: Thus, we implemented a forward/backward partial-matching method, in
400: which entries can be retrieved by any substrings from the first/last
401: characters. In addition, we index words and word-based bigrams,
402: because preliminary experiments showed that OOV words detected by our
403: speech recognition module are usually single words or short phrases,
404: such as ``{\it ozon-houru\/}~(ozone hole).''
405: 
406: \section{Experimentation}
407: \label{sec:experimentation}
408: 
409: \subsection{Methodology}
410: \label{subsec:ex_method}
411: 
412: \begin{figure*}[htbp]
413:   \begin{center}
414:     \leavevmode
415:     \begin{quote}
416:       \tt
417:       \footnotesize
418:       <TOPIC><TOPIC-ID>1001</TOPIC-ID> \\
419:       <DESCRIPTION>Corporate merging</DESCRIPTION> \\
420:       <NARRATIVE>The article describes a corporate merging and in the
421:       article, the name of companies have to be
422:       identifiable. Information
423:       including the field and the purpose of the merging have to be
424:       identifiable. Corporate merging includes corporate acquisition,
425:       corporate unifications and corporate buying.</NARRATIVE></TOPIC>
426:     \end{quote}
427:     \caption{An English translation for an example topic in the IREX collection.}
428:     \label{fig:irex_topic}
429:   \end{center}
430: \end{figure*}
431: 
432: To evaluate the performance of our speech-driven retrieval system, we
433: used the IREX
434: collection\footnote{http://cs.nyu.edu/cs/projects/proteus/irex/index-e.html}. This
435: test collection, which resembles one used in the TREC ad hoc retrieval
436: track, includes 30 Japanese topics (information need) and relevance
437: assessment (correct judgement) for each topic, along with target
438: documents.  The target documents are 211,853 articles collected from
439: two years worth of ``Mainichi Shimbun'' newspaper (1994-1995).
440: 
441: Each topic consists of the ID, description and narrative. While
442: descriptions are short phrases related to the topic, narratives
443: consist of one or more sentences describing the
444: topic. Figure~\ref{fig:irex_topic} shows an example topic in the SGML
445: form (translated into English by one of the organizers of the IREX
446: workshop).
447: 
448: However, since the IREX collection does not contain spoken queries, we
449: asked four speakers (two males/females) to dictate the narrative
450: field. Thus, we produced four different sets of 30 spoken queries. By
451: using those queries, we compared the following different methods:
452: \begin{enumerate}
453: \item text-to-text retrieval, which used written narratives as
454:   queries, and can be seen as a perfect speech-driven text retrieval,
455: \item speech-driven text retrieval, in which only words listed in the
456:   dictionary were modeled in the language model (in other words, the
457:   OOV word detection and query completion modules were not used),
458: \item speech-driven text retrieval, in which OOV words detected in
459:   spoken queries were simply deleted (in other words, the query
460:   completion module was not used),
461: \item speech-driven text retrieval, in which our method proposed in
462:   Section~\ref{sec:overview} was used.
463: \end{enumerate}
464: In cases of methods~2-4, queries dictated by four speakers were used
465: independently. Thus, in practice we compared 13 different retrieval
466: results.  In addition, for methods~2-4, ten years worth of {\it
467:   Mainichi Shimbun\/} Japanese newspaper articles (1991-2000) were
468: used to produce language models. However, while method~2 used only
469: 20,000 high-frequency words for language modeling, methods~3 and 4
470: also used syllables extracted from lower-frequency words (see
471: Section~\ref{sec:speech_recognition}).
472: 
473: Following the IREX workshop, each method retrieved 300 top documents
474: in response to each query, and non-interpolated average precision
475: values were used to evaluate each method.
476: 
477: \subsection{Results}
478: \label{subsec:ex_results}
479: 
480: First, we evaluated the performance of detecting OOV words. In the 30
481: queries used for our evaluation, 14 word {\em tokens\/} (13 word {\em
482:   types\/}) were OOV words unlisted in the dictionary for speech
483: recognition. Table~\ref{tab:oov_evaluation} shows the results on a
484: speaker-by-speaker basis, where ``\#Detected'' and ``\#Correct''
485: denote the total number of OOV words detected by our method and the
486: number of OOV words correctly detected, respectively.  In addition,
487: ``\#Completed'' denotes the number of detected OOV words that were
488: corresponded to correct index terms in 300 top documents.
489: 
490: \begin{table*}[htbp]
491:   \begin{center}
492:     \caption{Results for detecting and completing OOV words.}
493:     \medskip
494:     \leavevmode
495:     \small
496:     \begin{tabular}{lcccccc} \hline\hline
497:       Speaker & \#Detected & \#Correct & \#Completed & Recall &
498:       Precision & Accuracy \\ \hline
499:       Female \#1 &  51 &  9 & 18 & 0.643 & 0.176 & 0.353 \\
500:       Female \#2 &  56 & 10 & 18 & 0.714 & 0.179 & 0.321 \\
501:       Male \#1   &  33 &  9 & 12 & 0.643 & 0.273 & 0.364 \\
502:       Male \#2   &  37 & 12 & 16 & 0.857 & 0.324 & 0.432 \\
503:       \hline
504:       Total      & 176 & 40 & 64 & 0.714 & 0.226 & 0.362 \\
505:       \hline
506:     \end{tabular}
507:     \label{tab:oov_evaluation}
508:   \end{center}
509: \end{table*}
510: 
511: It should be noted that ``\#Completed'' was greater than ``\#Correct''
512: because our method often mistakenly detected words in the dictionary
513: as OOV words, but completed them with index terms correctly. We
514: estimated recall and precision for detecting OOV words, and accuracy
515: for query completion, as in Equation~\eq{eq:rpa}.
516: \begin{equation}
517:   \label{eq:rpa}
518:   \begin{array}{lll}
519:     recall & = & \frac{\textstyle \#Correct}{\textstyle 14} \\
520:     \noalign{\vskip 1.2ex}
521:     precision & = & \frac{\textstyle \#Correct}{\textstyle \#Detect} \\
522:     \noalign{\vskip 1.2ex}
523:     accuracy & = & \frac{\textstyle \#Completed}{\textstyle \#Detect}
524:   \end{array}
525: \end{equation}
526: Looking at Table~\ref{tab:oov_evaluation}, one can see that recall was
527: generally greater than precision. In other words, our method tended to
528: detect as many OOV words as possible. In addition, accuracy of
529: query completion was relatively low.
530: 
531: Figure~\ref{fig:examples} shows example words in spoken queries,
532: detected as OOV words and correctly completed with index terms. In
533: this figure, OOV words are transcribed with syllables, where
534: ``\verb|:|'' denotes a long vowel. Hyphens are inserted between
535: Japanese words, which inherently lack lexical segmentation.
536: 
537: \begin{figure*}[htbp]
538:   \begin{center}
539:     \small
540:     \begin{tabular}{lll} \hline\hline
541:       {\hfill\centering OOV words\hfill} &
542:       {\hfill\centering Index terms (syllables)\hfill} &
543:       {\hfill\centering English gloss\hfill} \\ \hline
544:       /\verb|gu re : pu ra chi na ga no|/
545:       & {\it gureepu-furuutsu\/}~/\verb|gu re : pu fu ru : tsu|/
546:       & grapefruit \\
547:       /\verb|ya yo i chi ta|/
548:       & {\it Yayoi-jidai\/}~/\verb|ya yo i ji da i|/
549:       & the {\it Yayoi\/} period \\
550:       /\verb|ni ku ku ra i su|/
551:       & {\it nikku-puraisu\/}~/\verb|ni q ku pu ra i su|/
552:       & Nick Price \\
553:       /\verb|be N pi|/
554:       & {\it benpi\/}~/\verb|be N pi|/
555:       & constipation \\
556:       \hline
557:     \end{tabular}
558:     \caption{Example words detected as OOV words and completed
559:       correctly by our method.}
560:     \label{fig:examples}
561:   \end{center}
562: \end{figure*}
563: 
564: Second, to evaluate the effectiveness of our query completion method
565: more carefully, we compared retrieval accuracy for methods~1-4 (see
566: Section~\ref{subsec:ex_method}).  Table~\ref{tab:avg_pre} shows
567: average precision values, averaged over the 30 queries, for each
568: method\footnote{Average precision is often used to evaluate IR
569:   systems, which should not be confused with evaluation measures in
570:   Equation~\eq{eq:rpa}.}. The average precision values of our method
571: (i.e., method~4) was approximately 87\% of that for text-to-text
572: retrieval.
573: 
574: By comparing methods~2-4, one can see that our method improved average
575: precision values of the other methods irrespective of the speaker.  To
576: put it more precisely, by comparing methods~3 and 4, one can see the
577: effectiveness of the query completion method. In addition, by
578: comparing methods~2 and 4, one can see that a combination of the OOV
579: word detection and query completion methods was effective.
580: 
581: It may be argued that the improvement was relatively small. However,
582: since the number of OOV words inherent in 30 queries was only 14, the
583: effect of our method was overshadowed by a large number of other
584: words. In fact, the number of words used as query terms for our
585: method, averaged over the four speakers, was 421. Since existing test
586: collections for IR research were not produced to explore the OOV
587: problem, it is difficult to derive conclusions that are statistically
588: valid. Experiments using larger-scale test collections where the OOV
589: problem is more crucial need to be further explored.
590: 
591: Finally, we investigated the time efficiency of our method, and found
592: that CPU time required for the query completion process per detected
593: OOV word was 3.5 seconds (AMD Athlon MP 1900+). However, an additional
594: CPU time for detecting OOV words, which can be performed in a
595: conventional speech recognition process, was not crucial.
596: 
597: \subsection{Analyzing Errors}
598: \label{subsec:error_analysis}
599: 
600: We manually analyzed seven cases where the average precision value of
601: our method was significantly lower than that obtained with method~2
602: (the total number of cases was the product of numbers of queries and
603: speakers).
604: 
605: Among these seven cases, in five cases our query completion method
606: selected incorrect index terms, although correct index terms were
607: included in top-ranked documents obtained with the first stage.  For
608: example, in the case of the query 1021 dictated by a female speaker,
609: the word ``{\it seido\/}~(institution)'' was mistakenly transcribed as
610: /\verb|se N do|/. As a result, the word ``{\it sendo\/}~(freshness),''
611: which is associated with the same syllable sequences, was selected as
612: the index term. The word ``{\it seido\/}~(institution)'' was the third
613: candidate based on the score computed by Equation~\eq{eq:score}. To
614: reduce these errors, we need to enhance the score computation.
615: 
616: In another case, our speech recognition module did not correctly
617: recognize words in the dictionary, and decreased the retrieval
618: accuracy.
619: 
620: In the final case, a fragment of a narrative sentence consisting of
621: ten words was detected as a single OOV word. As a result, our method,
622: which can complete up to two word sequences, mistakenly processed that
623: word, and decreased the retrieval accuracy.  However, this case was
624: exceptional. In most cases, functional words, which were recognized
625: with a high accuracy, segmented OOV words into shorter fragments.
626: 
627: \begin{table}[htbp]
628:   \begin{center}
629:     \caption{Non-interpolated average precision values, averaged over
630:       30 queries, for different methods.}
631:     \medskip
632:     \leavevmode
633:     \small
634:     \tabcolsep=4pt
635:     \begin{tabular}{lcccc} \hline\hline
636:       Speaker$\backslash$Method & 1 & 2 & 3 & 4 \\ \hline
637:       Female \#1 & -- & 0.2831 & 0.2834 & 0.3195 \\
638:       Female \#2 & -- & 0.2745 & 0.2443 & 0.2846 \\
639:       Male \#1   & -- & 0.3005 & 0.2987 & 0.3179 \\
640:       Male \#2   & -- & 0.2787 & 0.2675 & 0.2957 \\
641:       \hline
642:       Total      & 0.3486 & 0.2842 & 0.2734 & 0.3044 \\
643:       \hline
644:     \end{tabular}
645:     \label{tab:avg_pre}
646:   \end{center}
647: \end{table}
648: 
649: \section{Related Work}
650: \label{sec:related_work}
651: 
652: The method proposed by Kupiec~\etal~\shortcite{kupiec:arpa-hlt-94} and
653: our method are similar in the sense that both methods use target
654: collections as language models for speech recognition to realize
655: open-vocabulary speech-driven retrieval.
656: 
657: Kupiec~\etaleos's method, which is based on word recognition and
658: accepts only short queries, derives multiple transcription candidates
659: (i.e., possible word combinations), and searches a target collection
660: for the most plausible word combination. However, in the case of
661: longer queries, the number of candidates increases, and thus the
662: searching cost is prohibitive.  This is a reason why operational
663: speech recognition systems have to limit the vocabulary size.
664: 
665: In contrast, our method, which is based on a recent {\em continuous\/}
666: speech recognition framework, can accept longer
667: sentences. Additionally, our method uses a two-stage retrieval
668: principle to limit a search space in a target collection, and
669: disambiguates only detected OOV words. Thus, the computation cost can
670: be minimized.
671: 
672: \section{Conclusion}
673: \label{sec:conclusion}
674: 
675: To facilitate retrieving information by spoken queries, the
676: out-of-vocabulary problem in speech recognition needs to be
677: resolved. In our proposed method, out-of-vocabulary words in a query
678: are detected by speech recognition, and completed with terms indexed
679: for text retrieval, so as to improve the recognition accuracy. In
680: addition, the completed query is used to improve the retrieval
681: accuracy. We showed the effectiveness of our method by using dictated
682: queries in the IREX collection.  Future work would include experiments
683: using larger-scale test collections in various domains.
684: 
685: \bibliographystyle{acl.bst}
686: 
687: \begin{thebibliography}{}
688: 
689: \bibitem[\protect\citename{Bahl \bgroup et al.\egroup
690:   }1983]{bahl:ieee-tpami-1983}
691: Lalit.~R. Bahl, Frederick Jelinek, and Robert~L. Mercer.
692: \newblock 1983.
693: \newblock A maximum likelihood approach to continuous speech recognition.
694: \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},
695:   5(2):179--190.
696: 
697: \bibitem[\protect\citename{Barnett \bgroup et al.\egroup
698:   }1997]{barnett:eurospeech-97}
699: J.~Barnett, S.~Anderson, J.~Broglio, M.~Singh, R.~Hudson, and S.~W. Kuo.
700: \newblock 1997.
701: \newblock Experiments in spoken queries for document retrieval.
702: \newblock In {\em Proceedings of Eurospeech97}, pages 1323--1326.
703: 
704: \bibitem[\protect\citename{Crestani}2000]{crestani:fqas-2000}
705: Fabio Crestani.
706: \newblock 2000.
707: \newblock Word recognition errors and relevance feedback in spoken query
708:   processing.
709: \newblock In {\em Proceedings of the Fourth International Conference on
710:   Flexible Query Answering Systems}, pages 267--281.
711: 
712: \bibitem[\protect\citename{Fujii \bgroup et al.\egroup
713:   }2002]{fujii:springer-2002}
714: Atsushi Fujii, Katunobu Itou, and Tetsuya Ishikawa.
715: \newblock 2002.
716: \newblock Speech-driven text retrieval: Using target {IR} collections for
717:   statistical language model adaptation in speech recognition.
718: \newblock In Anni~R. Coden, Eric~W. Brown, and Savitha Srinivasan, editors,
719:   {\em Information Retrieval Techniques for Speech Applications (LNCS 2273)},
720:   pages 94--104. Springer.
721: 
722: \bibitem[\protect\citename{Garofolo \bgroup et al.\egroup
723:   }1997]{garofolo:trec-97}
724: John~S. Garofolo, Ellen~M. Voorhees, Vincent~M. Stanford, and Karen~Sparck
725:   Jones.
726: \newblock 1997.
727: \newblock {TREC-6} 1997 spoken document retrieval track overview and results.
728: \newblock In {\em Proceedings of the 6th Text REtrieval Conference}, pages
729:   83--91.
730: 
731: \bibitem[\protect\citename{Itou \bgroup et al.\egroup }1998]{itou:98:a}
732: K.~Itou, M.~Yamamoto, K.~Takeda, T.~Takezawa, T.~Matsuoka, T.~Kobayashi,
733:   K.~Shikano, and S.~Itahashi.
734: \newblock 1998.
735: \newblock The design of the newspaper-based {Japanese} large vocabulary
736:   continuous speech recognition corpus.
737: \newblock In {\em Proceedings of the 5th International Conference on Spoken
738:   Language Processing}, pages 3261--3264.
739: 
740: \bibitem[\protect\citename{Itou \bgroup et al.\egroup }1999]{itou:jas-1999}
741: Katunobu Itou, Mikio Yamamoto, Kazuya Takeda, Toshiyuki Takezawa, Tatsuo
742:   Matsuoka, Tetsunori Kobayashi, and Kiyohiro Shikano.
743: \newblock 1999.
744: \newblock {JNAS}: {Japanese} speech corpus for large vocabulary continuous
745:   speech recognition research.
746: \newblock {\em Journal of Acoustic Society of Japan}, 20(3):199--206.
747: 
748: \bibitem[\protect\citename{Itou \bgroup et al.\egroup }2001]{itou:asru-2001}
749: Katunobu Itou, Atsushi Fujii, and Tetsuya Ishikawa.
750: \newblock 2001.
751: \newblock Language modeling for multi-domain speech-driven text retrieval.
752: \newblock In {\em IEEE Automatic Speech Recognition and Understanding
753:   Workshop}.
754: 
755: \bibitem[\protect\citename{Jourlin \bgroup et al.\egroup
756:   }2000]{jourlin:sc-2000}
757: Pierre Jourlin, Sue~E. Johnson, Karen~Sp\"{a}rck Jones, and Philip~C. Woodland.
758: \newblock 2000.
759: \newblock Spoken document representations for probabilistic retrieval.
760: \newblock {\em Speech Communication}, 32:21--36.
761: 
762: \bibitem[\protect\citename{Kawahara \bgroup et al.\egroup
763:   }2000]{kawahara:icslp-2000}
764: T.~Kawahara, A.~Lee, T.~Kobayashi, K.~Takeda, N.~Minematsu, S.~Sagayama,
765:   K.~Itou, A.~Ito, M.~Yamamoto, A.~Yamada, T.~Utsuro, and K.~Shikano.
766: \newblock 2000.
767: \newblock Free software toolkit for {Japanese} large vocabulary continuous
768:   speech recognition.
769: \newblock In {\em Proceedings of the 6th International Conference on Spoken
770:   Language Processing}, pages 476--479.
771: 
772: \bibitem[\protect\citename{Kupiec \bgroup et al.\egroup
773:   }1994]{kupiec:arpa-hlt-94}
774: Julian Kupiec, Don Kimber, and Vijay Balasubramanian.
775: \newblock 1994.
776: \newblock Speech-based retrieval using semantic co-occurrence filtering.
777: \newblock In {\em Proceedings of the ARPA Human Language Technology Workshop},
778:   pages 373--377.
779: 
780: \bibitem[\protect\citename{Kwok and Chan}1998]{kwok:sigir-98}
781: K.L. Kwok and M.~Chan.
782: \newblock 1998.
783: \newblock Improving two-stage ad-hoc retrieval for short queries.
784: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
785:   Conference on Research and Development in Information Retrieval}, pages
786:   250--256.
787: 
788: \bibitem[\protect\citename{Paul and Baker}1992]{paul:darpa-ws-1992}
789: Douglas~B. Paul and Janet~M. Baker.
790: \newblock 1992.
791: \newblock The design for the {Wall} {Street} {Journal}-based {CSR} corpus.
792: \newblock In {\em Proceedings of DARPA Speech \& Natural Language Workshop},
793:   pages 357--362.
794: 
795: \bibitem[\protect\citename{Robertson and Walker}1994]{robertson:sigir-94}
796: S.E. Robertson and S.~Walker.
797: \newblock 1994.
798: \newblock Some simple effective approximations to the 2-poisson model for
799:   probabilistic weighted retrieval.
800: \newblock In {\em Proceedings of the 17th Annual International ACM SIGIR
801:   Conference on Research and Development in Information Retrieval}, pages
802:   232--241.
803: 
804: \bibitem[\protect\citename{Steeneken and van
805:   Leeuwen}1995]{steeneken:eurospeech-1995}
806: Herman J.~M. Steeneken and David~A. van Leeuwen.
807: \newblock 1995.
808: \newblock Multi-lingual assessment of speaker independent large vocabulary
809:   speech-recognition systems: The {SQALE}-project.
810: \newblock In {\em Proceedings of Eurospeech95}, pages 1271--1274.
811: 
812: \bibitem[\protect\citename{Wechsler \bgroup et al.\egroup
813:   }1998]{wechsler:sigir-98}
814: Martin Wechsler, Eugen Munteanu, and Peter Sch\"{a}uble.
815: \newblock 1998.
816: \newblock New techniques for open-vocabulary spoken document retrieval.
817: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR
818:   Conference on Research and Development in Information Retrieval}, pages
819:   20--27.
820: 
821: \bibitem[\protect\citename{Young}1996]{young:ieee-spm-1996}
822: Steve Young.
823: \newblock 1996.
824: \newblock A review of large-vocabulary continuous-speech recognition.
825: \newblock {\em IEEE Signal Processing Magazine}, pages 45--57, September.
826: 
827: \end{thebibliography}
828: 
829: \end{document}
830: