0309:cs0309021/main.tex

1: \documentclass{article}

2: \usepackage{eurospeech_01,amssymb,amsmath,graphicx}

3: \setcounter{page}{1}

4: \sloppy         % better line breaks

5: \ninept

6: %SM below a registered trademark definition

7: \def\reg{{\rm\ooalign{\hfil

8:      \raise.07ex\hbox{\scriptsize R}\hfil\crcr\mathhexbox20D}}}

9:

10: \title{A Cross-media Retrieval System for Lecture Videos}

11:

12: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

13: %% If multiple authors, uncomment and edit the lines shown below.       %%

14: %% Note that each line must be emphasized {\em } by itself.             %%

15: %% (by Stephen Martucci, author of spconf.sty).                         %%

16: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

17: \makeatletter

18: \def\name#1{\gdef\@name{#1\\}}

19: \makeatother

20: % \name{{\em Firstname1 Lastname1, Firstname2 Lastname2, Firstname3 Lastname3,}\\

21: %       {\em Firstname4 Lastname4, Firstname5 Lastname5, Firstname6 Lastname6,

22: %       Firstname7 Lastname7}}

23: \name{{\em Atsushi Fujii$^{\dagger,\dagger\dagger\dagger}$, Katunobu Itou$^{\dagger\dagger,\dagger\dagger\dagger}$, Tomoyosi Akiba$^{\dagger\dagger}$, Tetsuya Ishikawa$^{\dagger}$}}

24: %%%%%%%%%%%%%%% End of required multiple authors changes %%%%%%%%%%%%%%%%%

25:

26: % \address{Department of Speech and Hearing  \\

27: % University of Voiceland, Voiceland \\

28: % {\small \tt jes@sci.voice.edu}

29: % }

30: \address{$^{\dagger}$ Institute of Library and Information Science,

31:   University of Tsukuba \\

32:   $^{\dagger\dagger}$ National Institute of Advanced Industrial Science

33:   and Technology \\

34:   $^{\dagger\dagger\dagger}$ CREST, Japan Science and Technology

35:   Corporation}

36: %

37:

38: %% (re)newcommands

39:

40: \newcommand{\etal}{et~al.}

41: \newcommand{\etaleos}{et~al}

42: \newcommand{\eq}[1]{(\ref{#1})}

43:

44: \begin{document}

45: \maketitle

46: %

47: \begin{abstract}

48:   We propose a cross-media lecture-on-demand system, in which users

49:   can selectively view specific segments of lecture videos by

50:   submitting text queries. Users can easily formulate queries by

51:   using the textbook associated with a target lecture, even if they

52:   cannot come up with effective keywords. Our system extracts the

53:   audio track from a target lecture video, generates a transcription

54:   by large vocabulary continuous speech recognition, and produces a

55:   text index. Experimental results showed that by adapting speech

56:   recognition to the topic of the lecture, the recognition accuracy

57:   increased and the retrieval accuracy was comparable with that

58:   obtained by human transcription.

59: \end{abstract}

60:

61: \section{Introduction}

62: \label{sec:introduction}

63:

64: The growing number of multimedia contents available via the World Wide

65: Web, CD-ROMs, and DVDs has made information technologies incorporating

66: speech, image, and text processing crucial. Of the various types of

67: contents, lectures (audio/video) are typical and a valuable multimedia

68: resource, in which speeches (i.e., oral presentations) are usually

69: organized based on text materials, such as resumes, slides, and

70: textbooks. In lecture videos, image information, such as flip charts,

71: is often also used. In other words, a single lecture consists of

72: different types of compatible multimedia contents.

73:

74: Because a single lecture often refers to several topics and takes a

75: long time, it is useful to obtain specific segments (passages)

76: selectively so that the audience can satisfy their information needs

77: at minimum cost. To resolve this problem, in this paper we propose a

78: lecture-on-demand system that retrieves relevant video/audio passages

79: in response to user queries. For this purpose, we utilize the benefits

80: of different media types to improve retrieval performance.

81:

82: On the one hand, text has the advantage that users can view/scan the

83: entire contents quickly and can easily identify relevant passages

84: using the layout information (e.g., text structures based on sections

85: and paragraphs). In other words, text contents can be used for

86: random-access purposes.  On the other hand, speech is used mainly for

87: sequential-access purposes.  Therefore, it is difficult to identify

88: relevant passages unless target video/audio data includes additional

89: annotation, such as indexes.  Even if the target data are indexed,

90: users are not necessarily able to provide effective queries.  To

91: resolve this problem, textbooks are desirable materials from which

92: users can extract effective keywords and phrases.  However, while

93: textbooks are usually concise, speech has a high degree of redundancy

94: and therefore is easier to understand than textbooks, especially where

95: additional image information is provided.

96:

97: In view of the above, we model our lecture-on-demand (LOD) system as

98: follows. A user selects text segments (keywords, phrases, sentences,

99: and paragraphs) that are relevant to their information needs from a

100: textbook for a target lecture.  By using selected segments, a text

101: query is generated automatically. That is, queries can be formulated

102: even if users cannot provide effective keywords.  Users can also

103: submit additional keywords as queries, if necessary.  Video passages

104: relevant to a given query are retrieved and presented to the user.  To

105: retrieve the video passages in response to text queries, we extract

106: the audio track from a lecture video, generate a transcription by

107: means of large vocabulary continuous speech recognition, and produce a

108: text index, prior to system use.  Our system is a cross-media

109: system in the sense that users can retrieve video and audio

110: information by means of text queries.

111:

112: \section{System Description}

113: \label{sec:system}

114:

115: \subsection{Overview}

116: \label{subsec:system_overview}

117:

118: Figure~\ref{fig:system} depicts the overall design of our

119: lecture-on-demand system, in which the left and right regions

120: correspond to the on-line and off-line processes, respectively.

121: Although our system is currently implemented for Japanese, our

122: methodology is fundamentally language independent.  For the purpose of

123: research and development, we tentatively target lecture programs on TV

124: for which textbooks are published. We explain the basis of our system

125: using Figure~\ref{fig:system}.

126:

127: In the off-line process, given the video data of a target lecture,

128: audio data are extracted and segmented into a number of

129: passages. Then, a speech recognition system transcribes each

130: passage. Finally, the transcribed passages are indexed as in

131: conventional text retrieval systems, so that each passage can be

132: retrieved efficiently in response to text queries.  To adapt speech

133: recognition to a specific lecturer, we perform unsupervised speaker

134: adaptation using an initial speech recognition result (i.e., a

135: transcription).  To adapt speech recognition to a specific topic, we

136: perform language model adaptation, for which we search a general

137: corpus for documents relevant to the textbook related to a target

138: lecture. Then, retrieved documents (i.e., a topic-specific corpus) are

139: used to produce a word-based N-gram language model.  We also perform

140: image analysis to extract text (e.g., keywords and phrases) from flip

141: charts. These contents are also used to improve our language model.

142:

143: In the on-line process, a user can view specific video passages by

144: submitting any text queries, i.e., keywords, phrases, sentences, and

145: paragraphs, extracted from the textbook. Any queries not in the

146: textbook can also be used.  The current implementation is based on a

147: client-server system on the Web. Both the off-line and on-line

148: processes are performed on servers, but users can access our system

149: using Web browsers on their own PCs.

150:

151: Figure~\ref{fig:lodem} depicts a prototype interface of our LOD

152: system, in which a lecture associated with ``nonlinear multivariate

153: analysis'' is given.  In this interface,  an electronic version of a

154: textbook is displayed on the left side, and a lecture video is played

155: on the right side. In addition, users can submit any text queries in

156: the input box, which is not depicted in Figure~\ref{fig:lodem}. In

157: this scenario, a text paragraph related to ``discriminant analysis''

158: was copied and pasted into the query input box, and top-ranked

159: transcribed passages for the query were listed according to the degree

160: of relevance (in the lower part of Figure~\ref{fig:lodem}). Users can

161: select (click on) transcriptions to play the corresponding video

162: passage.

163:

164: It should be noted that unlike conventional keyword-based retrieval

165: systems, in which users usually submit a small number of keywords, in

166: our system users can easily submit longer queries using textbooks.

167: Where submitted keywords are misrecognized in transcriptions, the

168: retrieval accuracy decreases.  However, longer queries are relatively

169: robust for speech recognition errors, because the effect of

170: misrecognized words is overshadowed by the large number of words

171: correctly recognized.

172:

173: \begin{figure}[htbp]

174:   \begin{center}

175:     \leavevmode

176:     \includegraphics[height=2.9in]{system.eps}

177:   \end{center}

178:   \caption{An overview of our lecture-on-demand system.}

179:   \label{fig:system}

180: \end{figure}

181:

182: \subsection{Passage Segmentation}

183: \label{subsec:passage}

184:

185: The basis of passage segmentation is to divide the entire video data

186: for a single lecture into more than one unit to be retrieved.  We call

187: these smaller units ``passages''.  For this purpose, both speech and

188: image data can provide promising clues. However, in lecture TV

189: programs, it is often the case that a lecturer sitting still is the

190: main focus and a small number of flip charts are used occasionally. In

191: such cases, image data is less informative.  Therefore, tentatively we

192: use only speech data for the passage segmentation process.  However,

193: segmentation can potentially vary depending on the user query. Thus,

194: it is difficult to predetermine a desirable segmentation in the

195: off-line process.

196:

197: \begin{figure}[htbp]

198:   \begin{center}

199:    \leavevmode

200:     \includegraphics[height=2.3in]{lodem3.eps}

201:   \end{center}

202:    \caption{The interface of our LOD system over the Web.}

203:    \label{fig:lodem}

204: \end{figure}

205:

206: Because of the above problems, we first extract the audio track from a

207: target video and use a simple pause-based segmentation method to

208: obtain minimal speech units, such as sentences and long phrases. In

209: other words, speech units are continuous audio segments that do not

210: include pauses longer than a certain threshold. Finally, we generate

211: variable-length passages from one or more speech units. To put it more

212: precisely, we combine $N$ speech units into a single passage, with $N$

213: ranging from 1 to 5 in the current implementation.

214:

215: \subsection{Speech Recognition}

216: \label{subsec:speech_recognition}

217:

218: The speech recognition module generates word sequence $W$, given phone

219: sequence $X$. In a stochastic framework, the task is to select the $W$

220: maximizing $P(W|X)$, which is transformed as in Equation~\eq{eq:bayes}

221: through the Bayesian theorem.

222: \begin{equation}

223:   \label{eq:bayes}

224:   \arg\max_{W}P(W|X) = \arg\max_{W}P(X|W)\cdot P(W)

225: \end{equation}

226: $P(X|W)$ models the probability that the word sequence $W$ is

227: transformed into the phone sequence $X$, and $P(W)$ models the

228: probability that $W$ is linguistically acceptable. These factors are

229: called the acoustic and language models, respectively.

230:

231: We use the Japanese dictation

232: toolkit\footnote{http://winnie.kuis.kyoto-u.ac.jp/dictation/}, which

233: includes the Julius decoder and acoustic/language models. Julius

234: performs a two-pass (forward-backward) search using word-based forward

235: bigrams and backward trigrams.  The acoustic model was produced from

236: the ASJ speech database, which contains approximately 20,000 sentences

237: uttered by 132 speakers including both gender groups. A 16-mixture

238: Gaussian distribution triphone Hidden Markov Model, in which states

239: are clustered into 2,000 groups by a state-tying method, is used.  We

240: adapt the provided acoustic model by means of an MLLR-based

241: unsupervised speaker adaptation method, for which in practice we use

242: the HTK toolkit\footnote{http://htk.eng.cam.ac.uk/}.

243:

244: Existing methods to adapt language models can be classified into two

245: fundamental categories. In the first category---the {\em

246:   integration\/} approach---general and topic-specific corpora are

247: integrated to produce a topic-specific language

248: model~\cite{auzanne:riao-2000,seymore:eurospeech-97}. Because the

249: sizes of those corpora differ, N-gram statistics are calculated using

250: the weighted average of the statistics extracted independently from

251: those corpora. However, it is difficult to determine the optimal

252: weight depending on the topic.  In the second category---the {\em

253:   selection\/} approach---a topic-specific subset is selected from a

254: general corpus and is used to produce a language model. This approach

255: is effective if general corpora contain documents associated with

256: target topics, but N-gram statistics in those documents are

257: overshadowed by other documents in resultant language models.

258:

259: We followed the selection approach, because the 10M Web page

260: corpus~\cite{eguchi:sigir-2002} containing mainly Japanese pages

261: associated with various topics was publicly available. The quality of

262: the selection approach depends on the method of selecting

263: topic-specific subsets. An existing

264: method~\cite{chen:adaptation_ws-2001} uses hypotheses in the initial

265: speech recognition phase as queries to retrieve topic-specific

266: documents from a general corpus. However, errors in the initial

267: hypotheses have the potential to decrease the retrieval

268: accuracy. Instead, we use textbooks related to target lectures as

269: queries to improve the retrieval accuracy and consequently the quality

270: of the language model adaptation.

271:

272: \subsection{Retrieval}

273: \label{subsec:retrieval}

274:

275: Given transcribed passages and text queries, the basis of the

276: retrieval module is the same as that for text retrieval.  We use an

277: existing probabilistic text retrieval method~\cite{robertson:sigir-94}

278: to compute the relevance score between the query and each passage in

279: the database.  The relevance score for passage $p$ is computed by

280: Equation~\eq{eq:okapi}.

281: \begin{equation}

282:   \footnotesize

283:   \label{eq:okapi}

284:   \sum_{t} f_{t,q}\cdot\frac{\textstyle (K+1)\cdot

285:   f_{t,p}}{K\cdot\{(1-b)+\textstyle\frac{\textstyle

286:   dl_{p}}{\textstyle b\cdot avgdl}\} +

287:   f_{t,p}}\cdot\log\frac{\textstyle N - n_{t} + 0.5}{\textstyle

288:   n_{t} + 0.5}

289: \end{equation}

290: where $f_{t,q}$ and $f_{t,p}$ denote the frequency with which term $t$

291: appears in query $q$ and passage $p$, respectively. $N$ and $n_{t}$

292: denote the total number of passages in the database and the number of

293: passages containing term $t$, respectively. $dl_{p}$ denotes the

294: length of passage $p$, and $avgdl$ denotes the average length of

295: passages in the database. We empirically set $K=2.0$ and $b=0.8$,

296: respectively.  We use content words, such as nouns, extracted from

297: transcribed passages as index terms, and perform word-based

298: indexing. We use the ChaSen morphological

299: analyzer\footnote{http://chasen.aist-nara.ac.jp/} to extract content

300: words. The same method is used to extract terms from queries.

301:

302: However, retrieved passages are not disjoint, because top-ranked

303: passages often overlap with one another in terms of the temporal

304: axis. It is redundant simply to list the top-ranked retrieved passages

305: as they are.  Therefore, we reorganize those overlapped passages into

306: a single passage.  The relevance score for a group (a new passage) is

307: computed by averaging the scores of all passages belonging to the

308: group. New passages are sorted according to the degree of relevance

309: and are presented to users as the final result.

310:

311: \section{Experimentation}

312: \label{sec:experimentation}

313:

314: \subsection{Methodology}

315: \label{subsec:ex_method}

316:

317: To evaluate the performance of our LOD system, we produced a test

318: collection (as a benchmark data set) and performed experiments

319: partially resembling a task performed in the TREC spoken document

320: retrieval (SDR) track~\cite{garofolo:trec-97}.  Five lecture programs

321: on TV (each lecture was 45 minutes long), for which printed textbooks

322: were also published, were videotaped in DV and were used as target

323: lectures. Each lecture was manually transcribed and sentence

324: boundaries with temporal information (i.e., correct speech units) were

325: also identified manually.

326: Each paragraph in the corresponding textbook was used as a query

327: independently. For each query, a human assessor (a graduate student

328: not an author of this paper) identified one or more relevant sentences

329: in the human transcription.

330:

331: Using our test collection, we evaluated the accuracy of speech

332: recognition and passage retrieval.

333: For the five lectures, our system used the sentence boundaries in

334: human transcriptions to identify speech units, and performed speech

335: recognition. We also used human transcriptions as perfect speech

336: recognition results and investigated the extent to which speech

337: recognition errors affect the retrieval accuracy.  Our system

338: retrieved top-ranked passages in response to each query.  Note that

339: the passages here are those grouped based on the temporal axis, which

340: should not be confused with those obtained from the passage

341: segmentation method.

342:

343: \subsection{Results}

344: \label{subsec:results}

345:

346: \begin{table*}[htbp]

347:   \begin{center}

348:      \caption{Experimental results for speech recognition and passage

349:        retrieval.}

350:     \medskip

351:     \leavevmode

352:     \footnotesize

353:     \begin{tabular}{llccccccccccccccc} \hline\hline

354:       ID &

355:       & \multicolumn{3}{c}{\#1}

356:       & \multicolumn{3}{c}{\#2}

357:       & \multicolumn{3}{c}{\#3}

358:       & \multicolumn{3}{c}{\#4}

359:       & \multicolumn{3}{c}{\#5}

360:       \\

361:       \hline

362:       Topic & &

363:       \multicolumn{3}{c}{Criminal law} &

364:       \multicolumn{3}{c}{Greek history} &

365:       \multicolumn{3}{c}{Domestic relations} &

366:       \multicolumn{3}{c}{Food and body} &

367:       \multicolumn{3}{c}{Solar system}

368:       \\

369:       \hline

370:       &

371:       & HUM &

372:       {\hfill\centering ASR\hfill} &

373:       {\hfill\centering +LA\hfill}

374:       & HUM &

375:       {\hfill\centering ASR\hfill} &

376:       {\hfill\centering +LA\hfill}

377:       & HUM &

378:       {\hfill\centering ASR\hfill} &

379:       {\hfill\centering +LA\hfill}

380:       & HUM &

381:       {\hfill\centering ASR\hfill} &

382:       {\hfill\centering +LA\hfill}

383:       & HUM &

384:       {\hfill\centering ASR\hfill} &

385:       {\hfill\centering +LA\hfill}

386:       \\

387:       \hline

388:       \multicolumn{2}{l}{OOV}

389:       & {\hfill\centering ---\hfill} & .044 & .020

390:       & {\hfill\centering ---\hfill} & .073 & .082

391:       & {\hfill\centering ---\hfill} & .039 & .049

392:       & {\hfill\centering ---\hfill} & .053 & .041

393:       & {\hfill\centering ---\hfill} & .051 & .053

394:       \\

395:       \multicolumn{2}{l}{PP}

396:       & {\hfill\centering ---\hfill} & 48.9 & 43.2

397:       & {\hfill\centering ---\hfill} & 122 & 96.7

398:       & {\hfill\centering ---\hfill} & 136 & 132

399:       & {\hfill\centering ---\hfill} & 89.3 & 108

400:       & {\hfill\centering ---\hfill} & 163 & 130

401:       \\

402:       \multicolumn{2}{l}{WER}

403:       & {\hfill\centering ---\hfill} & .209 & .133

404:       & {\hfill\centering ---\hfill} & .516 & .423

405:       & {\hfill\centering ---\hfill} & .604 & .543

406:       & {\hfill\centering ---\hfill} & .488 & .416

407:       & {\hfill\centering ---\hfill} & .637 & .482

408:       \\

409:       \hline

410:         & R & .695 & .726 &.732 & .449 & .258 & .551

411:         & .632 & .291 & .505 & .451 & .220 & .357 & .296 & .138 & .241 \\

412:         $N$=1 & P & .534 & .548 & .519 & .377 & .319 & .386

413:         & .479 & .362 & .464 & .414 & .277 & .337 & .529 & .358 & .436 \\

414:         & F & .604 & .624 & .607 & .410 & .286 & .454

415:         & .545 & .322 & .484 & .432 & .245 & .347 & .379 & .200 & .311 \\

416:       \hline

417:         & R & .847 & .858 & .832 & .663 & .360 & .674

418:         & .791 & .464 & .677 & .655 & .380 & .463 & .482 & .228 & .421 \\

419:         $N$=2 & P & .441 & .448 & .458 & .301 & .211 & .314

420:         & .372 & .273 & .353 & .321 & .247 & .239 & .462 & .332 & .409 \\

421:         & F & .580 & .588 & .591 & .414 & .266 & .429

422:         & .506 & .343 & .464 & .431 & .300 & .316 & .472 & .270 & .415 \\

423:       \hline

424:         & R & .879 & .868 & .874 & .764 & .438 & .708

425:         & .827 & .495 & .718 & .718 & .392 & .604 & .637 & .289 & .527 \\

426:         $N$=3 & P & .410 & .405 & .401 & .269 & .163 & .252

427:         & .363 & .215 & .318 & .297 & .188 & .235 & .466 & .280 & .385 \\

428:         & F & .560 & .553 & .550 & .398 & .237 & .372

429:         & .505 & .300 & .441 & .420 & .254 & .338 & .538 & .285 & .445 \\

430:       \hline

431:     \end{tabular}

432:     \label{tab:results}

433:   \end{center}

434: \end{table*}

435:

436: To evaluate the accuracy of speech recognition, we used the word error

437: rate (WER), which is the ratio of the number of word errors (deletion,

438: insertion, and substitution) to the total number of words. We also

439: used test-set out-of-vocabulary rate (OOV) and trigram test-set

440: perplexity (PP) to evaluate the extent to which our language model

441: adapted to the target topics.  We used human transcriptions as test

442: set data.  For example, OOV is the ratio of the number of word tokens

443: not contained in the language model for speech recognition to the

444: total number of word tokens in the transcription. Note that smaller

445: values of OOV, PP, and WER are obtained with better methods.

446:

447: The final outputs (i.e., retrieved passages) were evaluated based on

448: recall and precision, averaged over all queries. Recall (R) is the

449: ratio of the number of correct speech units retrieved by our system to

450: the total number of correct speech units for the query in question.

451: Precision (P) is the ratio of the number of correct speech units

452: retrieved by our system to the total number of speech units retrieved

453: by our system. To summarize recall and precision into a single

454: measure, we used the F-measure (F).

455:

456: Table~\ref{tab:results} shows the accuracy of speech recognition (WER)

457: and passage retrieval (R, P, and F), for each lecture. In this table,

458: the columns ``HUM'' and ``ASR'' correspond to the results obtained

459: with human transcriptions and automatic speech recognition,

460: respectively. The column ``+LA'' denotes results for ASR combined with

461: language model adaptation.  The column ``Topic'' denotes topics for

462: the five lectures.

463:

464: To adapt language models, we used the textbook corresponding to a

465: target lecture and searched the 10M Web page corpus for 2,000 relevant

466: pages, which were used as a source corpus. In the case where the

467: language model adaptation was not performed, all 10M Web pages were

468: used as a source corpus. In either case, 20,000 high frequency words

469: were selected from a source corpus to produce a word-based trigram

470: language model. We used the ChaSen morphological analyzer to extract

471: words (morphemes) from the source corpora, because Japanese sentences

472: lack lexical segmentation.

473:

474: In passage retrieval, we regarded the top $N$ passages as the final

475: outputs. In Table~\ref{tab:results}, the value of $N$ ranges from 1 to

476: 3. As the value of $N$ increases, the recall improves, but potentially

477: sacrificing precision.

478:

479: \subsection{Discussion}

480: \label{subsec:discussion}

481:

482: By comparing the results of ASR and +LA in Table~\ref{tab:results},

483: for some cases OOV and PP increased by adapting language models.

484: However, WER decreased by adapting language models to target topics,

485: irrespective of the lecture.

486:

487: The values of OOV, PP, and WER for lecture~\#1 were generally smaller

488: than those for the other lectures. One possible reason is that the

489: lecturer of \#1 spoke more fluently and made fewer erroneous

490: utterances than the other lecturers.

491:

492: Recall, precision, and F-measure increased by adapting language models

493: for lectures~\#2-5, irrespective of the number of passages retrieved.

494: For lecture~\#1, the retrieval accuracy did not significantly differ

495: whether or not we adapted the language model to the topic. One

496: possible reason is that the WER of lecture~\#1 without language model

497: adaptation (20.9\%) was sufficiently small to obtain a retrieval

498: accuracy comparable with the text retrieval~\cite{jourlin:sc-2000}.

499: The difference between HUM and ASR was marginal in terms of the

500: retrieval accuracy. Therefore, the effect of the language model

501: adaptation method was overshadowed in passage retrieval.

502:

503: The retrieval accuracy for lecture~\#1 was higher than those for the

504: other lectures. The story of lecture~\#1 was organized based primarily

505: on the textbook, when compared with the other lectures. This suggests

506: that the performance of our LOD system is dependent of the

507: organization of target lectures.

508:

509: Surprisingly, for lectures~\#1 and \#2, recall, precision, and

510: F-measure of +LA were better than those of HUM. This means that the

511: automatic transcription was more effective than human transcription

512: for passage retrieval purposes.  One possible reason is  the existence

513: of Japanese variants (i.e., more than one spelling form corresponding

514: to the same word), such as ``{\it girisha\/}/{\it

515:   girishia\/}~(Greece)''. Because the language model was adapted by

516: means of the textbook for a target lecture, the spelling in automatic

517: transcriptions systematically resembled that in the queries extracted

518: from the textbooks. In contrast, it is difficult to standardize the

519: spelling in human transcriptions. Therefore, relevant passages in

520: automatic transcriptions were more likely to be retrieved than

521: passages in the human transcriptions.

522:

523: We conclude that our language model adaptation method was effective

524: for both speech recognition and passage retrieval.

525:

526: \section{Conclusion}

527: \label{sec:conclusion}

528:

529: Reflecting the rapid growth in the use of multimedia contents,

530: information technologies appropriate to speech, image, and text

531: processing are crucial. Of the various content types in this paper we

532: focused on the video data of lectures with their organization based on

533: textbooks, and proposed a system for cross-media on-demand lectures,

534: in which users can formulate text queries using the textbook for a

535: target lecture to retrieve specific video passages.

536:

537: To retrieve video passages in response to text queries, we extract the

538: audio track from a lecture video, generate a transcription by large

539: vocabulary continuous speech recognition, and produce a text index,

540: prior to system use.

541:

542: We evaluated the performance of our system experimentally, for which

543: five TV lecture programs in various topics were used. The experimental

544: results showed that the accuracy of speech recognition varied

545: depending on the topic and presentation style of the

546: lecturers. However, the accuracy of speech recognition and passage

547: retrieval was improved by adapting language models to the topic of the

548: target lecture. Even if the word error rate was approximately 40\%,

549: the accuracy of retrieval was comparable with that obtained by human

550: transcription.

551:

552: \bibliographystyle{IEEEtran}

553: \begin{thebibliography}{1}

554: \providecommand{\url}[1]{#1}

555: \def\UrlFont{\rmfamily}

556: \providecommand{\newblock}{\relax}

557: \providecommand{\bibinfo}[2]{#2}

558: \providecommand\BIBentrySTDinterwordspacing{\spaceskip=0pt\relax}

559: \providecommand\BIBentryALTinterwordstretchfactor{4}

560: \providecommand\BIBentryALTinterwordspacing{\spaceskip=\fontdimen2\font plus

561: \BIBentryALTinterwordstretchfactor\fontdimen3\font minus

562:   \fontdimen4\font\relax}

563: \providecommand\BIBforeignlanguage[2]{{%

564: \expandafter\ifx\csname l@#1\endcsname\relax

565: \typeout{** WARNING: IEEEtran.bst: No hyphenation pattern has been}%

566: \typeout{** loaded for the language `#1'. Using the pattern for}%

567: \typeout{** the default language instead.}%

568: \else

569: \language=\csname l@#1\endcsname

570: \fi

571: #2}}

572:

573: \bibitem{auzanne:riao-2000}

574: C.~Auzanne, J.~S. Garofolo, J.~G. Fiscus, and W.~M. Fisher, ``Automatic

575:   language model adaptation for spoken document retrieval,'' in

576:   \emph{Proceedings of RIAO 2000 Conference on Content-Based Multimedia

577:   Information Access}, 2000.

578:

579: \bibitem{seymore:eurospeech-97}

580: K.~Seymore and R.~Rosenfeld, ``Using story topics for language model

581:   adaptation,'' in \emph{Proceedings of Eurospeech97}, 1997, pp. 1987--1990.

582:

583: \bibitem{eguchi:sigir-2002}

584: K.~Eguchi, K.~Oyama, K.~Kuriyama, and N.~Kando, ``The {Web} retrieval task and

585:   its evaluation in the third {NTCIR} workshop,'' in \emph{Proceedings of the

586:   25th Annual International ACM SIGIR Conference on Research and Development in

587:   Information Retrieval}, 2002, pp. 375--376.

588:

589: \bibitem{chen:adaptation_ws-2001}

590: L.~Chen, J.-L. Gauvain, L.~Lamel, G.~Adda, and M.~Adda, ``Language model

591:   adaptation for broadcast news transcription,'' in \emph{Proceedings of ISCA

592:   Workshop on Adaptation Methods For Speech Recognition}, 2001.

593:

594: \bibitem{robertson:sigir-94}

595: S.~Robertson and S.~Walker, ``Some simple effective approximations to the

596:   2-poisson model for probabilistic weighted retrieval,'' in \emph{Proceedings

597:   of the 17th Annual International ACM SIGIR Conference on Research and

598:   Development in Information Retrieval}, 1994, pp. 232--241.

599:

600: \bibitem{garofolo:trec-97}

601: J.~S. Garofolo, E.~M. Voorhees, V.~M. Stanford, and K.~S. Jones, ``{TREC-6}

602:   1997 spoken document retrieval track overview and results,'' in

603:   \emph{Proceedings of the 6th Text REtrieval Conference}, 1997, pp. 83--91.

604:

605: \bibitem{jourlin:sc-2000}

606: P.~Jourlin, S.~E. Johnson, K.~S. Jones, and P.~C. Woodland, ``Spoken document

607:   representations for probabilistic retrieval,'' \emph{Speech Communication},

608:   vol.~32, pp. 21--36, 2000.

609:

610: \end{thebibliography}

611:

612: \end{document}

613: