0206:cs0206014/main.tex

1: \documentclass[11pt]{article}

2: \usepackage{acl2002}

3: \usepackage{times}

4: \usepackage{latexsym}

5: \setlength\titlebox{6.5cm}    % Expanding the titlebox

6:

7: \title{A Method for Open-Vocabulary Speech-Driven Text Retrieval}

8:

9: \author{Atsushi Fujii\thanks{~~~The first and second authors

10:        are also members of CREST, Japan Science and Technology

11:        Corporation.}\\

12:   University of Library and\\

13:   Information Science\\

14:   1-2 Kasuga, Tsukuba\\

15:   305-8550, Japan\\

16:   {\tt fujii@ulis.ac.jp} \And

17:   Katunobu Itou\\

18:   National Institute of\\

19:   Advanced Industrial\\

20:   Science and Technology\\

21:   1-1-1 Chuuou Daini Umezono\\

22:   Tsukuba, 305-8568, Japan\\

23:   {\tt itou@ni.aist.go.jp} \And

24:   Tetsuya Ishikawa\\

25:   University of Library and\\

26:   Information Science\\

27:   1-2 Kasuga, Tsukuba\\

28:   305-8550, Japan\\

29:   {\tt ishikawa@ulis.ac.jp}}

30:

31: \date{}

32:

33: \newcommand{\etal}{et~al.}

34: \newcommand{\etaleos}{et~al}

35: \newcommand{\eq}[1]{(\ref{#1})}

36: \renewcommand{\nocite}[1]{\shortcite{#1}}

37: \input{psfig.tex}

38:

39: \begin{document}

40: \maketitle

41: \begin{abstract}

42:   While recent retrieval techniques do not limit the number of index

43:   terms, out-of-vocabulary (OOV) words are crucial in speech

44:   recognition. Aiming at retrieving information with spoken queries,

45:   we fill the gap between speech recognition and text retrieval in

46:   terms of the vocabulary size. Given a spoken query, we generate a

47:   transcription and detect OOV words through speech recognition. We

48:   then correspond detected OOV words to terms indexed in a target

49:   collection to complete the transcription, and search the collection

50:   for documents relevant to the completed transcription. We show the

51:   effectiveness of our method by way of experiments.

52: \end{abstract}

53:

54: \section{Introduction}

55: \label{sec:introduction}

56:

57: Automatic speech recognition, which decodes human voice to generate

58: transcriptions, has of late become a practical technology. It is

59: feasible that speech recognition is used in real-world human language

60: applications, such as information retrieval.

61:

62: Initiated partially by TREC-6, various methods have been proposed for

63: ``spoken document retrieval (SDR),'' in which written queries are used to

64: search speech archives for relevant

65: information~\cite{garofolo:trec-97}.  State-of-the-art SDR methods,

66: where speech recognition error rate is 20-30\%, are comparable with

67: text retrieval methods in performance~\cite{jourlin:sc-2000}, and thus

68: are already practical. Possible rationales include that recognition

69: errors are overshadowed by a large number of words correctly

70: transcribed in target documents.

71:

72: However, ``speech-driven retrieval,'' where spoken queries are used to

73: retrieve (textual) information, has not fully been explored, although

74: it is related to numerous keyboard-less applications, such as

75: telephone-based retrieval, car navigation systems, and user-friendly

76: interfaces.

77:

78: Unlike spoken document retrieval, speech-driven retrieval is still a

79: challenging task, because recognition errors in short queries

80: considerably decrease retrieval accuracy.  A number of references

81: addressing this issue can be found in past research literature.

82:

83: Barnett~\etal~\shortcite{barnett:eurospeech-97}  and

84: Crestani~\shortcite{crestani:fqas-2000} independently performed

85: comparative experiments related to speech-driven retrieval, where the

86: DRAGON speech recognition system was used as an input interface for

87: the INQUERY text retrieval system.  They used as test queries 35

88: topics in the TREC collection, dictated by a single male speaker.

89: However, these cases focused on improving text retrieval methods and

90: did not address problems in improving speech recognition.  As a

91: result, errors in recognizing spoken queries (error rate was

92: approximately 30\%) considerably decreased the retrieval accuracy.

93:

94: Although we showed that the use of target document collections in

95: producing language models for speech recognition significantly

96: improved the performance of speech-driven

97: retrieval~\cite{fujii:springer-2002,itou:asru-2001}, a number of

98: issues still remain open questions.

99:

100: Section~\ref{sec:problem} clarifies problems addressed in this paper.

101: Section~\ref{sec:overview} overviews our speech-driven text retrieval

102: system. Sections~\ref{sec:speech_recognition}-\ref{sec:query_completion}

103: elaborate on our methodology.  Section~\ref{sec:experimentation}

104: describes comparative experiments, in which an existing IR test

105: collection was used to evaluate the effectiveness of our

106: method. Section~\ref{sec:related_work} discusses related research

107: literature.

108:

109: \section{Problem Statement}

110: \label{sec:problem}

111:

112: One major problem in speech-driven retrieval is related to

113: out-of-vocabulary (OOV) words.

114:

115: On the one hand, recent IR systems do not limit the vocabulary size

116: (i.e., the number of index terms), and can be seen as open-vocabulary

117: systems, which allow users to input any keywords contained in a target

118: collection.  It is often the case that a couple of million terms are

119: indexed for a single IR system.

120:

121: On the other hand, state-of-the-art speech recognition systems still

122: need to limit the vocabulary size (i.e., the number of words in a

123: dictionary), due to problems in estimating statistical language

124: models~\cite{young:ieee-spm-1996} and constraints associated with

125: hardware, such as memories. In addition, computation time is crucial

126: for a real-time usage, including speech-driven retrieval.  In view of

127: these problems,  for many languages the vocabulary size is limited to

128: a couple of ten

129: thousands~\cite{itou:jas-1999,paul:darpa-ws-1992,steeneken:eurospeech-1995},

130: which is incomparably smaller than the size of indexes for practical

131: IR systems.

132:

133: In addition, high-frequency words, such as functional words and common

134: nouns, are usually included in dictionaries and recognized with a high

135: accuracy. However, those words are not necessarily useful for

136: retrieval.  On the contrary, low-frequency words appearing in specific

137: documents are often effective query terms.

138:

139: To sum up, the OOV problem is inherent in speech-driven retrieval, and

140: we need to fill the gap between speech recognition and text retrieval

141: in terms of the vocabulary size. In this paper, we propose a method to

142: resolve this problem aiming at open-vocabulary speech-driven retrieval.

143:

144: \section{System Overview}

145: \label{sec:overview}

146:

147: Figure~\ref{fig:system} depicts the overall design of our

148: speech-driven text retrieval system, which consists of speech

149: recognition, text retrieval and query completion modules.  Although

150: our system is currently implemented for Japanese, our methodology is

151: language-independent.  We explain the retrieval process based on this

152: figure.

153:

154: Given a query spoken by a user, the speech recognition module uses a

155: dictionary and acoustic/language models to generate a transcription of

156: the user speech. During this process, OOV words, which are not listed

157: in the dictionary, are also detected.  For this purpose, our language

158: model includes both words and syllables so that OOV words are

159: transcribed as sequences of syllables.

160:

161: For example, in the case where ``{\it kankitsu\/}~(citrus)'' is not

162: listed in the dictionary, this word should be transcribed as

163: /\verb|ka N ki tsu|/. However, it is possible that this word is mistakenly

164: transcribed, such as /\verb|ka N ke tsu|/ and /\verb|ka N ke tsu ke ko|/.

165:

166: To improve the quality of our system, these syllable sequences have to

167: be transcribed as {\em words\/}, which is one of the central issues in

168: this paper. In the case of speech-driven retrieval, where users

169: usually have specific information needs, it is feasible that users

170: utter contents related to a target collection. In other words, there

171: is a great possibility that detected OOV words can be identified as

172: index terms that are phonetically identical or similar.

173:

174: However, since a) a single sound can potentially correspond to more

175: than one word (i.e., homonyms) and b) searching the entire collection

176: for phonetically identical/similar terms is prohibitive,  we need an

177: efficient disambiguation method. Specifically, in the case of

178: Japanese,  the homonym problem is multiply crucial because words

179: consist of different character types, i.e., ``{\it kanji},'' ``{\it

180:   katakana},'' ``{\it hiragana},'' alphabets and other characters like

181: numerals\footnote{In Japanese, {\it kanji\/} (or Chinese character) is

182:   the idiogram, and {\it katakana\/} and {\it hiragana\/} are

183:   phonograms.}.

184:

185: To resolve this problem, we use a two-stage retrieval method. In the

186: first stage, we delete OOV words from the transcription, and perform

187: text retrieval using remaining words, to obtain a specific number of

188: top-ranked documents according to the degree of relevance.  Even if

189: speech recognition is not perfect, these documents are potentially

190: associated with the user speech more than the entire collection.

191: Thus, we search only these documents for index terms corresponding to

192: detected OOV words.

193:

194: Then, in the second stage, we replace detected OOV words with

195: identified index terms so as to complete the transcription, and

196: re-perform text retrieval to obtain final outputs. However, we do not

197: re-perform speech recognition in the second stage.

198:

199: In the above example, let us assume that the user also utters words

200: related to ``{\it kankitsu\/}~(citrus),'' such as ``{\it

201:   orenji\/}~(orange)'' and ``{\it remon\/}~(lemon),'' and that these

202: words are correctly recognized as words. In this case, it is possible

203: that retrieved documents contain the word ``{\it

204:   kankitsu\/}~(citrus).'' Thus, we replace the syllable sequence

205: /\verb|ka N ke tsu|/ in the query with ``{\it kankitsu\/},'' which is

206: additionally used as a query term in the second stage.

207:

208: It may be argued that our method resembles the notion of

209: pseudo-relevance feedback (or local feedback) for IR, where documents

210: obtained in the first stage are used to expand query terms, and final

211: outputs are refined in the second stage~\cite{kwok:sigir-98}.

212: However, while relevance feedback is used to improve only the

213: retrieval accuracy, our method improves the speech recognition and

214: retrieval accuracy.

215:

216: \begin{figure}[htbp]

217:   \begin{center}

218:     \leavevmode \psfig{file=system.eps,height=2.3in}

219:   \end{center}

220:   \caption{The overall design of our speech-driven text retrieval system.}

221:   \label{fig:system}

222: \end{figure}

223:

224: \section{Speech Recognition}

225: \label{sec:speech_recognition}

226:

227: The speech recognition module generates word sequence $W$, given phone

228: sequence $X$.  In a stochastic speech recognition

229: framework~\cite{bahl:ieee-tpami-1983}, the task is to select the $W$

230: maximizing $P(W|X)$, which is transformed as in Equation~\eq{eq:bayes}

231: through the Bayesian theorem.

232: \begin{equation}

233:   \label{eq:bayes}

234:   \arg\max_{W}P(W|X) = \arg\max_{W}P(X|W)\cdot P(W)

235: \end{equation}

236: Here, $P(X|W)$ models a probability that word sequence $W$ is

237: transformed into phone sequence $X$, and $P(W)$ models a probability

238: that $W$ is linguistically acceptable. These factors are usually

239: called acoustic and language models, respectively.

240:

241: For the speech recognition module, we use the Japanese dictation

242: toolkit~\cite{kawahara:icslp-2000}\footnote{http://winnie.kuis.kyoto-u.ac.jp/dictation},

243: which includes the ``Julius'' recognition engine and acoustic/language

244: models. The acoustic model was produced by way of the ASJ speech

245: database (ASJ-JNAS)~\cite{itou:98:a,itou:jas-1999}, which contains

246: approximately 20,000 sentences uttered by 132 speakers including the

247: both gender groups.

248:

249: This toolkit also includes development softwares so that acoustic and

250: language models can be produced and replaced depending on the

251: application.  While we use the acoustic model provided in the toolkit,

252: we use a new language model including both words and syllables.  For

253: this purpose, we used the ``ChaSen'' morphological

254: analyzer\footnote{http://chasen.aist-nara.ac.jp} to extract words from

255: ten years worth of ``Mainichi Shimbun'' newspaper articles (1991-2000).

256:

257: Then, we selected 20,000 high-frequency words to produce a dictionary.

258: At the same time, we segmented remaining lower-frequency words into

259: syllables based on the Japanese phonogram system. The resultant number

260: of syllable types was approximately 700.  Finally, we produced a

261: word/syllable-based trigram language model.  In other words, OOV words

262: were modeled as sequences of syllables. Thus, by using our language

263: model, OOV words can easily be detected.

264:

265: In spoken document retrieval, an open-vocabulary method, which

266: combines recognition methods for words and syllables in target speech

267: documents, was also proposed~\cite{wechsler:sigir-98}.  However, this

268: method requires an additional computation for recognizing syllables,

269: and thus is expensive. In contrast, since our language model is a

270: regular statistical $N$-gram model, we can use the same speech

271: recognition framework as in Equation~\eq{eq:bayes}.

272:

273: \section{Text Retrieval}

274: \label{sec:text_retrieval}

275:

276: The text retrieval module is based on the ``Okapi'' probabilistic

277: retrieval method~\cite{robertson:sigir-94}, which is used to compute

278: the relevance score between the transcribed query and each document in

279: a target collection.  To produce an inverted file (i.e., an index), we

280: use ChaSen to extract content words from documents as terms, and

281: perform a word-based indexing. We also extract terms from transcribed

282: queries using the same method.

283:

284: \section{Query Completion}

285: \label{sec:query_completion}

286:

287: \subsection{Overview}

288: \label{subsec:qc_overview}

289:

290: As explained in Section~\ref{sec:overview}, the basis of the query

291: completion module is to correspond OOV words detected by speech

292: recognition (Section~\ref{sec:speech_recognition}) to index terms used

293: for text retrieval (Section~\ref{sec:text_retrieval}). However, to

294: identify corresponding index terms efficiently, we limit the number of

295: documents in the first stage retrieval.  In principle, terms that are

296: indexed in top-ranked documents (those retrieved in the first stage)

297: and have the same sound with detected OOV words can be corresponding

298: terms.

299:

300: However, a single sound often corresponds to multiple words.  In

301: addition, since speech recognition on a syllable-by-syllable basis is

302: not perfect, it is possible that OOV words are incorrectly

303: transcribed.  For example, in some cases the Japanese word ``{\it

304:   kankitsu\/}~(citrus)'' is transcribed as /\verb|ka N ke tsu|/.

305: Thus, we also need to consider index terms that are phonetically {\em

306:   similar\/} to OOV words.  To sum up, we need a disambiguation method

307: to select appropriate corresponding terms, out of a number of

308: candidates.

309:

310: \subsection{Formalization}

311: \label{subsec:qc_formalization}

312:

313: Intuitively, it is feasible that appropriate terms:

314: \begin{itemize}

315: \item have identical/similar sound with OOV words detected in spoken

316:   queries,

317: \item frequently appear in a top-ranked document set,

318: \item and appear in higher-ranked documents.

319: \end{itemize}

320: From the viewpoint of probability theory, possible representations for

321: the above three properties include Equation~\eq{eq:prob}, where each

322: property corresponds to different parameters. Our task is to select

323: the $t$ maximizing the value computed by this equation as the

324: corresponding term for OOV word $w$.

325: \begin{equation}

326:   \label{eq:prob}

327:   \sum_{d \in D_{q}} P(w|t) \cdot P(t|d) \cdot P(d|q)

328: \end{equation}

329: Here, $D_{q}$ is the top-ranked document set retrieved in the first

330: stage, given query $q$.  \mbox{$P(w|t)$} is a probability that index

331: term $t$ can be replaced with detected OOV word $w$, in terms of

332: phonetics. \mbox{$P(t|d)$} is the relative frequency of term $t$ in

333: document $d$. \mbox{$P(d|q)$} is a probability that document $d$ is

334: relevant to query $q$, which is associated with the score formalized

335: in the Okapi method.

336:

337: However, from the viewpoint of empiricism, Equation~\eq{eq:prob} is

338: not necessarily effective. First, it is not easy to estimate $P(w|t)$

339: based on the probability theory. Second, the probability score

340: computed by the Okapi method is an approximation focused mainly on

341: {\em relative\/} superiority among retrieved documents, and thus it is

342: difficult to estimate $P(d|q)$ in a rigorous manner. Finally, it is

343: also difficult to determine the degree to which each parameter

344: influences in the final probability score.

345:

346: In view of these problems, through preliminary experiments we

347: approximated Equation~\eq{eq:prob} and formalized a method to compute

348: the degree (not the probability) to which given index term $t$

349: corresponds to OOV word $w$.

350:

351: First, we estimate \mbox{$P(w|t)$} by the ratio between the number of

352: syllables commonly included in both $w$ and $t$ and the total number

353: of syllables in $w$.  We use a DP matching method to identify the

354: number of cases related to deletion, insertion, and substitution in

355: $w$, on a syllable-by-syllable basis.

356:

357: Second, \mbox{$P(w|t)$} should be more influential than

358: \mbox{$P(t|d)$} and \mbox{$P(d|q)$} in Equation~\eq{eq:prob}, although

359: the last two parameters are effective in the case where a large number

360: of candidates phonetically similar to $w$ are obtained. To decrease

361: the effect of \mbox{$P(t|d)$} and \mbox{$P(d|q)$}, we tentatively use

362: logarithms of these parameters. In addition, we use the score computed

363: by the Okapi method as \mbox{$P(d|q)$}.

364:

365: According to the above approximation, we compute the score of $t$ as

366: in Equation~\eq{eq:score}.

367: \begin{equation}

368:   \label{eq:score}

369:   \sum_{d \in D_{q}} P(w|t) \cdot \log(P(t|d) \cdot P(d|q))

370: \end{equation}

371: It should be noted that Equation~\eq{eq:score} is independent of the

372: indexing method used, and therefore $t$ can be any sequences of

373: characters contained in $D_{q}$. In other words, any types of indexing

374: methods (e.g., word-based and phrase-based indexing methods) can be

375: used in our framework.

376:

377: \subsection{Implementation}

378: \label{subsec:qc_implementation}

379:

380: Since computation time is crucial for a real-time usage, we preprocess

381: documents in a target collection so as to identify candidate terms

382: efficiently. This process is similar to the indexing process performed

383: in the text retrieval module.

384:

385: In the case of text retrieval, index terms are organized in an

386: inverted file so that documents including terms that {\em exactly\/}

387: match with query keywords can be retrieved efficiently.

388:

389: However, in the case of query completion, terms that are included in

390: top-ranked documents need to be retrieved. In addition, to minimize a

391: score computation (for example, DP matching is time-consuming), it is

392: desirable to delete terms that are associated with a diminished

393: phonetic similarity value, \mbox{$P(w|t)$}, prior to the computation

394: of Equation~\eq{eq:score}.  In other words, an index file for query

395: completion has to be organized so that a {\em partial\/} matching

396: method can be used. For example, /\verb|ka N ki tsu|/ has to be

397: retrieved efficiently in response to /\verb|ka N ke tsu|/.

398:

399: Thus, we implemented a forward/backward partial-matching method, in

400: which entries can be retrieved by any substrings from the first/last

401: characters. In addition, we index words and word-based bigrams,

402: because preliminary experiments showed that OOV words detected by our

403: speech recognition module are usually single words or short phrases,

404: such as ``{\it ozon-houru\/}~(ozone hole).''

405:

406: \section{Experimentation}

407: \label{sec:experimentation}

408:

409: \subsection{Methodology}

410: \label{subsec:ex_method}

411:

412: \begin{figure*}[htbp]

413:   \begin{center}

414:     \leavevmode

415:     \begin{quote}

416:       \tt

417:       \footnotesize

418:       <TOPIC><TOPIC-ID>1001</TOPIC-ID> \\

419:       <DESCRIPTION>Corporate merging</DESCRIPTION> \\

420:       <NARRATIVE>The article describes a corporate merging and in the

421:       article, the name of companies have to be

422:       identifiable. Information

423:       including the field and the purpose of the merging have to be

424:       identifiable. Corporate merging includes corporate acquisition,

425:       corporate unifications and corporate buying.</NARRATIVE></TOPIC>

426:     \end{quote}

427:     \caption{An English translation for an example topic in the IREX collection.}

428:     \label{fig:irex_topic}

429:   \end{center}

430: \end{figure*}

431:

432: To evaluate the performance of our speech-driven retrieval system, we

433: used the IREX

434: collection\footnote{http://cs.nyu.edu/cs/projects/proteus/irex/index-e.html}. This

435: test collection, which resembles one used in the TREC ad hoc retrieval

436: track, includes 30 Japanese topics (information need) and relevance

437: assessment (correct judgement) for each topic, along with target

438: documents.  The target documents are 211,853 articles collected from

439: two years worth of ``Mainichi Shimbun'' newspaper (1994-1995).

440:

441: Each topic consists of the ID, description and narrative. While

442: descriptions are short phrases related to the topic, narratives

443: consist of one or more sentences describing the

444: topic. Figure~\ref{fig:irex_topic} shows an example topic in the SGML

445: form (translated into English by one of the organizers of the IREX

446: workshop).

447:

448: However, since the IREX collection does not contain spoken queries, we

449: asked four speakers (two males/females) to dictate the narrative

450: field. Thus, we produced four different sets of 30 spoken queries. By

451: using those queries, we compared the following different methods:

452: \begin{enumerate}

453: \item text-to-text retrieval, which used written narratives as

454:   queries, and can be seen as a perfect speech-driven text retrieval,

455: \item speech-driven text retrieval, in which only words listed in the

456:   dictionary were modeled in the language model (in other words, the

457:   OOV word detection and query completion modules were not used),

458: \item speech-driven text retrieval, in which OOV words detected in

459:   spoken queries were simply deleted (in other words, the query

460:   completion module was not used),

461: \item speech-driven text retrieval, in which our method proposed in

462:   Section~\ref{sec:overview} was used.

463: \end{enumerate}

464: In cases of methods~2-4, queries dictated by four speakers were used

465: independently. Thus, in practice we compared 13 different retrieval

466: results.  In addition, for methods~2-4, ten years worth of {\it

467:   Mainichi Shimbun\/} Japanese newspaper articles (1991-2000) were

468: used to produce language models. However, while method~2 used only

469: 20,000 high-frequency words for language modeling, methods~3 and 4

470: also used syllables extracted from lower-frequency words (see

471: Section~\ref{sec:speech_recognition}).

472:

473: Following the IREX workshop, each method retrieved 300 top documents

474: in response to each query, and non-interpolated average precision

475: values were used to evaluate each method.

476:

477: \subsection{Results}

478: \label{subsec:ex_results}

479:

480: First, we evaluated the performance of detecting OOV words. In the 30

481: queries used for our evaluation, 14 word {\em tokens\/} (13 word {\em

482:   types\/}) were OOV words unlisted in the dictionary for speech

483: recognition. Table~\ref{tab:oov_evaluation} shows the results on a

484: speaker-by-speaker basis, where ``\#Detected'' and ``\#Correct''

485: denote the total number of OOV words detected by our method and the

486: number of OOV words correctly detected, respectively.  In addition,

487: ``\#Completed'' denotes the number of detected OOV words that were

488: corresponded to correct index terms in 300 top documents.

489:

490: \begin{table*}[htbp]

491:   \begin{center}

492:     \caption{Results for detecting and completing OOV words.}

493:     \medskip

494:     \leavevmode

495:     \small

496:     \begin{tabular}{lcccccc} \hline\hline

497:       Speaker & \#Detected & \#Correct & \#Completed & Recall &

498:       Precision & Accuracy \\ \hline

499:       Female \#1 &  51 &  9 & 18 & 0.643 & 0.176 & 0.353 \\

500:       Female \#2 &  56 & 10 & 18 & 0.714 & 0.179 & 0.321 \\

501:       Male \#1   &  33 &  9 & 12 & 0.643 & 0.273 & 0.364 \\

502:       Male \#2   &  37 & 12 & 16 & 0.857 & 0.324 & 0.432 \\

503:       \hline

504:       Total      & 176 & 40 & 64 & 0.714 & 0.226 & 0.362 \\

505:       \hline

506:     \end{tabular}

507:     \label{tab:oov_evaluation}

508:   \end{center}

509: \end{table*}

510:

511: It should be noted that ``\#Completed'' was greater than ``\#Correct''

512: because our method often mistakenly detected words in the dictionary

513: as OOV words, but completed them with index terms correctly. We

514: estimated recall and precision for detecting OOV words, and accuracy

515: for query completion, as in Equation~\eq{eq:rpa}.

516: \begin{equation}

517:   \label{eq:rpa}

518:   \begin{array}{lll}

519:     recall & = & \frac{\textstyle \#Correct}{\textstyle 14} \\

520:     \noalign{\vskip 1.2ex}

521:     precision & = & \frac{\textstyle \#Correct}{\textstyle \#Detect} \\

522:     \noalign{\vskip 1.2ex}

523:     accuracy & = & \frac{\textstyle \#Completed}{\textstyle \#Detect}

524:   \end{array}

525: \end{equation}

526: Looking at Table~\ref{tab:oov_evaluation}, one can see that recall was

527: generally greater than precision. In other words, our method tended to

528: detect as many OOV words as possible. In addition, accuracy of

529: query completion was relatively low.

530:

531: Figure~\ref{fig:examples} shows example words in spoken queries,

532: detected as OOV words and correctly completed with index terms. In

533: this figure, OOV words are transcribed with syllables, where

534: ``\verb|:|'' denotes a long vowel. Hyphens are inserted between

535: Japanese words, which inherently lack lexical segmentation.

536:

537: \begin{figure*}[htbp]

538:   \begin{center}

539:     \small

540:     \begin{tabular}{lll} \hline\hline

541:       {\hfill\centering OOV words\hfill} &

542:       {\hfill\centering Index terms (syllables)\hfill} &

543:       {\hfill\centering English gloss\hfill} \\ \hline

544:       /\verb|gu re : pu ra chi na ga no|/

545:       & {\it gureepu-furuutsu\/}~/\verb|gu re : pu fu ru : tsu|/

546:       & grapefruit \\

547:       /\verb|ya yo i chi ta|/

548:       & {\it Yayoi-jidai\/}~/\verb|ya yo i ji da i|/

549:       & the {\it Yayoi\/} period \\

550:       /\verb|ni ku ku ra i su|/

551:       & {\it nikku-puraisu\/}~/\verb|ni q ku pu ra i su|/

552:       & Nick Price \\

553:       /\verb|be N pi|/

554:       & {\it benpi\/}~/\verb|be N pi|/

555:       & constipation \\

556:       \hline

557:     \end{tabular}

558:     \caption{Example words detected as OOV words and completed

559:       correctly by our method.}

560:     \label{fig:examples}

561:   \end{center}

562: \end{figure*}

563:

564: Second, to evaluate the effectiveness of our query completion method

565: more carefully, we compared retrieval accuracy for methods~1-4 (see

566: Section~\ref{subsec:ex_method}).  Table~\ref{tab:avg_pre} shows

567: average precision values, averaged over the 30 queries, for each

568: method\footnote{Average precision is often used to evaluate IR

569:   systems, which should not be confused with evaluation measures in

570:   Equation~\eq{eq:rpa}.}. The average precision values of our method

571: (i.e., method~4) was approximately 87\% of that for text-to-text

572: retrieval.

573:

574: By comparing methods~2-4, one can see that our method improved average

575: precision values of the other methods irrespective of the speaker.  To

576: put it more precisely, by comparing methods~3 and 4, one can see the

577: effectiveness of the query completion method. In addition, by

578: comparing methods~2 and 4, one can see that a combination of the OOV

579: word detection and query completion methods was effective.

580:

581: It may be argued that the improvement was relatively small. However,

582: since the number of OOV words inherent in 30 queries was only 14, the

583: effect of our method was overshadowed by a large number of other

584: words. In fact, the number of words used as query terms for our

585: method, averaged over the four speakers, was 421. Since existing test

586: collections for IR research were not produced to explore the OOV

587: problem, it is difficult to derive conclusions that are statistically

588: valid. Experiments using larger-scale test collections where the OOV

589: problem is more crucial need to be further explored.

590:

591: Finally, we investigated the time efficiency of our method, and found

592: that CPU time required for the query completion process per detected

593: OOV word was 3.5 seconds (AMD Athlon MP 1900+). However, an additional

594: CPU time for detecting OOV words, which can be performed in a

595: conventional speech recognition process, was not crucial.

596:

597: \subsection{Analyzing Errors}

598: \label{subsec:error_analysis}

599:

600: We manually analyzed seven cases where the average precision value of

601: our method was significantly lower than that obtained with method~2

602: (the total number of cases was the product of numbers of queries and

603: speakers).

604:

605: Among these seven cases, in five cases our query completion method

606: selected incorrect index terms, although correct index terms were

607: included in top-ranked documents obtained with the first stage.  For

608: example, in the case of the query 1021 dictated by a female speaker,

609: the word ``{\it seido\/}~(institution)'' was mistakenly transcribed as

610: /\verb|se N do|/. As a result, the word ``{\it sendo\/}~(freshness),''

611: which is associated with the same syllable sequences, was selected as

612: the index term. The word ``{\it seido\/}~(institution)'' was the third

613: candidate based on the score computed by Equation~\eq{eq:score}. To

614: reduce these errors, we need to enhance the score computation.

615:

616: In another case, our speech recognition module did not correctly

617: recognize words in the dictionary, and decreased the retrieval

618: accuracy.

619:

620: In the final case, a fragment of a narrative sentence consisting of

621: ten words was detected as a single OOV word. As a result, our method,

622: which can complete up to two word sequences, mistakenly processed that

623: word, and decreased the retrieval accuracy.  However, this case was

624: exceptional. In most cases, functional words, which were recognized

625: with a high accuracy, segmented OOV words into shorter fragments.

626:

627: \begin{table}[htbp]

628:   \begin{center}

629:     \caption{Non-interpolated average precision values, averaged over

630:       30 queries, for different methods.}

631:     \medskip

632:     \leavevmode

633:     \small

634:     \tabcolsep=4pt

635:     \begin{tabular}{lcccc} \hline\hline

636:       Speaker$\backslash$Method & 1 & 2 & 3 & 4 \\ \hline

637:       Female \#1 & -- & 0.2831 & 0.2834 & 0.3195 \\

638:       Female \#2 & -- & 0.2745 & 0.2443 & 0.2846 \\

639:       Male \#1   & -- & 0.3005 & 0.2987 & 0.3179 \\

640:       Male \#2   & -- & 0.2787 & 0.2675 & 0.2957 \\

641:       \hline

642:       Total      & 0.3486 & 0.2842 & 0.2734 & 0.3044 \\

643:       \hline

644:     \end{tabular}

645:     \label{tab:avg_pre}

646:   \end{center}

647: \end{table}

648:

649: \section{Related Work}

650: \label{sec:related_work}

651:

652: The method proposed by Kupiec~\etal~\shortcite{kupiec:arpa-hlt-94} and

653: our method are similar in the sense that both methods use target

654: collections as language models for speech recognition to realize

655: open-vocabulary speech-driven retrieval.

656:

657: Kupiec~\etaleos's method, which is based on word recognition and

658: accepts only short queries, derives multiple transcription candidates

659: (i.e., possible word combinations), and searches a target collection

660: for the most plausible word combination. However, in the case of

661: longer queries, the number of candidates increases, and thus the

662: searching cost is prohibitive.  This is a reason why operational

663: speech recognition systems have to limit the vocabulary size.

664:

665: In contrast, our method, which is based on a recent {\em continuous\/}

666: speech recognition framework, can accept longer

667: sentences. Additionally, our method uses a two-stage retrieval

668: principle to limit a search space in a target collection, and

669: disambiguates only detected OOV words. Thus, the computation cost can

670: be minimized.

671:

672: \section{Conclusion}

673: \label{sec:conclusion}

674:

675: To facilitate retrieving information by spoken queries, the

676: out-of-vocabulary problem in speech recognition needs to be

677: resolved. In our proposed method, out-of-vocabulary words in a query

678: are detected by speech recognition, and completed with terms indexed

679: for text retrieval, so as to improve the recognition accuracy. In

680: addition, the completed query is used to improve the retrieval

681: accuracy. We showed the effectiveness of our method by using dictated

682: queries in the IREX collection.  Future work would include experiments

683: using larger-scale test collections in various domains.

684:

685: \bibliographystyle{acl.bst}

686:

687: \begin{thebibliography}{}

688:

689: \bibitem[\protect\citename{Bahl \bgroup et al.\egroup

690:   }1983]{bahl:ieee-tpami-1983}

691: Lalit.~R. Bahl, Frederick Jelinek, and Robert~L. Mercer.

692: \newblock 1983.

693: \newblock A maximum likelihood approach to continuous speech recognition.

694: \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},

695:   5(2):179--190.

696:

697: \bibitem[\protect\citename{Barnett \bgroup et al.\egroup

698:   }1997]{barnett:eurospeech-97}

699: J.~Barnett, S.~Anderson, J.~Broglio, M.~Singh, R.~Hudson, and S.~W. Kuo.

700: \newblock 1997.

701: \newblock Experiments in spoken queries for document retrieval.

702: \newblock In {\em Proceedings of Eurospeech97}, pages 1323--1326.

703:

704: \bibitem[\protect\citename{Crestani}2000]{crestani:fqas-2000}

705: Fabio Crestani.

706: \newblock 2000.

707: \newblock Word recognition errors and relevance feedback in spoken query

708:   processing.

709: \newblock In {\em Proceedings of the Fourth International Conference on

710:   Flexible Query Answering Systems}, pages 267--281.

711:

712: \bibitem[\protect\citename{Fujii \bgroup et al.\egroup

713:   }2002]{fujii:springer-2002}

714: Atsushi Fujii, Katunobu Itou, and Tetsuya Ishikawa.

715: \newblock 2002.

716: \newblock Speech-driven text retrieval: Using target {IR} collections for

717:   statistical language model adaptation in speech recognition.

718: \newblock In Anni~R. Coden, Eric~W. Brown, and Savitha Srinivasan, editors,

719:   {\em Information Retrieval Techniques for Speech Applications (LNCS 2273)},

720:   pages 94--104. Springer.

721:

722: \bibitem[\protect\citename{Garofolo \bgroup et al.\egroup

723:   }1997]{garofolo:trec-97}

724: John~S. Garofolo, Ellen~M. Voorhees, Vincent~M. Stanford, and Karen~Sparck

725:   Jones.

726: \newblock 1997.

727: \newblock {TREC-6} 1997 spoken document retrieval track overview and results.

728: \newblock In {\em Proceedings of the 6th Text REtrieval Conference}, pages

729:   83--91.

730:

731: \bibitem[\protect\citename{Itou \bgroup et al.\egroup }1998]{itou:98:a}

732: K.~Itou, M.~Yamamoto, K.~Takeda, T.~Takezawa, T.~Matsuoka, T.~Kobayashi,

733:   K.~Shikano, and S.~Itahashi.

734: \newblock 1998.

735: \newblock The design of the newspaper-based {Japanese} large vocabulary

736:   continuous speech recognition corpus.

737: \newblock In {\em Proceedings of the 5th International Conference on Spoken

738:   Language Processing}, pages 3261--3264.

739:

740: \bibitem[\protect\citename{Itou \bgroup et al.\egroup }1999]{itou:jas-1999}

741: Katunobu Itou, Mikio Yamamoto, Kazuya Takeda, Toshiyuki Takezawa, Tatsuo

742:   Matsuoka, Tetsunori Kobayashi, and Kiyohiro Shikano.

743: \newblock 1999.

744: \newblock {JNAS}: {Japanese} speech corpus for large vocabulary continuous

745:   speech recognition research.

746: \newblock {\em Journal of Acoustic Society of Japan}, 20(3):199--206.

747:

748: \bibitem[\protect\citename{Itou \bgroup et al.\egroup }2001]{itou:asru-2001}

749: Katunobu Itou, Atsushi Fujii, and Tetsuya Ishikawa.

750: \newblock 2001.

751: \newblock Language modeling for multi-domain speech-driven text retrieval.

752: \newblock In {\em IEEE Automatic Speech Recognition and Understanding

753:   Workshop}.

754:

755: \bibitem[\protect\citename{Jourlin \bgroup et al.\egroup

756:   }2000]{jourlin:sc-2000}

757: Pierre Jourlin, Sue~E. Johnson, Karen~Sp\"{a}rck Jones, and Philip~C. Woodland.

758: \newblock 2000.

759: \newblock Spoken document representations for probabilistic retrieval.

760: \newblock {\em Speech Communication}, 32:21--36.

761:

762: \bibitem[\protect\citename{Kawahara \bgroup et al.\egroup

763:   }2000]{kawahara:icslp-2000}

764: T.~Kawahara, A.~Lee, T.~Kobayashi, K.~Takeda, N.~Minematsu, S.~Sagayama,

765:   K.~Itou, A.~Ito, M.~Yamamoto, A.~Yamada, T.~Utsuro, and K.~Shikano.

766: \newblock 2000.

767: \newblock Free software toolkit for {Japanese} large vocabulary continuous

768:   speech recognition.

769: \newblock In {\em Proceedings of the 6th International Conference on Spoken

770:   Language Processing}, pages 476--479.

771:

772: \bibitem[\protect\citename{Kupiec \bgroup et al.\egroup

773:   }1994]{kupiec:arpa-hlt-94}

774: Julian Kupiec, Don Kimber, and Vijay Balasubramanian.

775: \newblock 1994.

776: \newblock Speech-based retrieval using semantic co-occurrence filtering.

777: \newblock In {\em Proceedings of the ARPA Human Language Technology Workshop},

778:   pages 373--377.

779:

780: \bibitem[\protect\citename{Kwok and Chan}1998]{kwok:sigir-98}

781: K.L. Kwok and M.~Chan.

782: \newblock 1998.

783: \newblock Improving two-stage ad-hoc retrieval for short queries.

784: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

785:   Conference on Research and Development in Information Retrieval}, pages

786:   250--256.

787:

788: \bibitem[\protect\citename{Paul and Baker}1992]{paul:darpa-ws-1992}

789: Douglas~B. Paul and Janet~M. Baker.

790: \newblock 1992.

791: \newblock The design for the {Wall} {Street} {Journal}-based {CSR} corpus.

792: \newblock In {\em Proceedings of DARPA Speech \& Natural Language Workshop},

793:   pages 357--362.

794:

795: \bibitem[\protect\citename{Robertson and Walker}1994]{robertson:sigir-94}

796: S.E. Robertson and S.~Walker.

797: \newblock 1994.

798: \newblock Some simple effective approximations to the 2-poisson model for

799:   probabilistic weighted retrieval.

800: \newblock In {\em Proceedings of the 17th Annual International ACM SIGIR

801:   Conference on Research and Development in Information Retrieval}, pages

802:   232--241.

803:

804: \bibitem[\protect\citename{Steeneken and van

805:   Leeuwen}1995]{steeneken:eurospeech-1995}

806: Herman J.~M. Steeneken and David~A. van Leeuwen.

807: \newblock 1995.

808: \newblock Multi-lingual assessment of speaker independent large vocabulary

809:   speech-recognition systems: The {SQALE}-project.

810: \newblock In {\em Proceedings of Eurospeech95}, pages 1271--1274.

811:

812: \bibitem[\protect\citename{Wechsler \bgroup et al.\egroup

813:   }1998]{wechsler:sigir-98}

814: Martin Wechsler, Eugen Munteanu, and Peter Sch\"{a}uble.

815: \newblock 1998.

816: \newblock New techniques for open-vocabulary spoken document retrieval.

817: \newblock In {\em Proceedings of the 21st Annual International ACM SIGIR

818:   Conference on Research and Development in Information Retrieval}, pages

819:   20--27.

820:

821: \bibitem[\protect\citename{Young}1996]{young:ieee-spm-1996}

822: Steve Young.

823: \newblock 1996.

824: \newblock A review of large-vocabulary continuous-speech recognition.

825: \newblock {\em IEEE Signal Processing Magazine}, pages 45--57, September.

826:

827: \end{thebibliography}

828:

829: \end{document}

830: