1: \documentclass{article}
2: \usepackage{eurospeech_01,amssymb,amsmath,graphicx}
3: \setcounter{page}{1}
4: \sloppy % better line breaks
5: \ninept
6: %SM below a registered trademark definition
7: \def\reg{{\rm\ooalign{\hfil
8: \raise.07ex\hbox{\scriptsize R}\hfil\crcr\mathhexbox20D}}}
9:
10: \title{A Cross-media Retrieval System for Lecture Videos}
11:
12: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
13: %% If multiple authors, uncomment and edit the lines shown below. %%
14: %% Note that each line must be emphasized {\em } by itself. %%
15: %% (by Stephen Martucci, author of spconf.sty). %%
16: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
17: \makeatletter
18: \def\name#1{\gdef\@name{#1\\}}
19: \makeatother
20: % \name{{\em Firstname1 Lastname1, Firstname2 Lastname2, Firstname3 Lastname3,}\\
21: % {\em Firstname4 Lastname4, Firstname5 Lastname5, Firstname6 Lastname6,
22: % Firstname7 Lastname7}}
23: \name{{\em Atsushi Fujii$^{\dagger,\dagger\dagger\dagger}$, Katunobu Itou$^{\dagger\dagger,\dagger\dagger\dagger}$, Tomoyosi Akiba$^{\dagger\dagger}$, Tetsuya Ishikawa$^{\dagger}$}}
24: %%%%%%%%%%%%%%% End of required multiple authors changes %%%%%%%%%%%%%%%%%
25:
26: % \address{Department of Speech and Hearing \\
27: % University of Voiceland, Voiceland \\
28: % {\small \tt jes@sci.voice.edu}
29: % }
30: \address{$^{\dagger}$ Institute of Library and Information Science,
31: University of Tsukuba \\
32: $^{\dagger\dagger}$ National Institute of Advanced Industrial Science
33: and Technology \\
34: $^{\dagger\dagger\dagger}$ CREST, Japan Science and Technology
35: Corporation}
36: %
37:
38: %% (re)newcommands
39:
40: \newcommand{\etal}{et~al.}
41: \newcommand{\etaleos}{et~al}
42: \newcommand{\eq}[1]{(\ref{#1})}
43:
44: \begin{document}
45: \maketitle
46: %
47: \begin{abstract}
48: We propose a cross-media lecture-on-demand system, in which users
49: can selectively view specific segments of lecture videos by
50: submitting text queries. Users can easily formulate queries by
51: using the textbook associated with a target lecture, even if they
52: cannot come up with effective keywords. Our system extracts the
53: audio track from a target lecture video, generates a transcription
54: by large vocabulary continuous speech recognition, and produces a
55: text index. Experimental results showed that by adapting speech
56: recognition to the topic of the lecture, the recognition accuracy
57: increased and the retrieval accuracy was comparable with that
58: obtained by human transcription.
59: \end{abstract}
60:
61: \section{Introduction}
62: \label{sec:introduction}
63:
64: The growing number of multimedia contents available via the World Wide
65: Web, CD-ROMs, and DVDs has made information technologies incorporating
66: speech, image, and text processing crucial. Of the various types of
67: contents, lectures (audio/video) are typical and a valuable multimedia
68: resource, in which speeches (i.e., oral presentations) are usually
69: organized based on text materials, such as resumes, slides, and
70: textbooks. In lecture videos, image information, such as flip charts,
71: is often also used. In other words, a single lecture consists of
72: different types of compatible multimedia contents.
73:
74: Because a single lecture often refers to several topics and takes a
75: long time, it is useful to obtain specific segments (passages)
76: selectively so that the audience can satisfy their information needs
77: at minimum cost. To resolve this problem, in this paper we propose a
78: lecture-on-demand system that retrieves relevant video/audio passages
79: in response to user queries. For this purpose, we utilize the benefits
80: of different media types to improve retrieval performance.
81:
82: On the one hand, text has the advantage that users can view/scan the
83: entire contents quickly and can easily identify relevant passages
84: using the layout information (e.g., text structures based on sections
85: and paragraphs). In other words, text contents can be used for
86: random-access purposes. On the other hand, speech is used mainly for
87: sequential-access purposes. Therefore, it is difficult to identify
88: relevant passages unless target video/audio data includes additional
89: annotation, such as indexes. Even if the target data are indexed,
90: users are not necessarily able to provide effective queries. To
91: resolve this problem, textbooks are desirable materials from which
92: users can extract effective keywords and phrases. However, while
93: textbooks are usually concise, speech has a high degree of redundancy
94: and therefore is easier to understand than textbooks, especially where
95: additional image information is provided.
96:
97: In view of the above, we model our lecture-on-demand (LOD) system as
98: follows. A user selects text segments (keywords, phrases, sentences,
99: and paragraphs) that are relevant to their information needs from a
100: textbook for a target lecture. By using selected segments, a text
101: query is generated automatically. That is, queries can be formulated
102: even if users cannot provide effective keywords. Users can also
103: submit additional keywords as queries, if necessary. Video passages
104: relevant to a given query are retrieved and presented to the user. To
105: retrieve the video passages in response to text queries, we extract
106: the audio track from a lecture video, generate a transcription by
107: means of large vocabulary continuous speech recognition, and produce a
108: text index, prior to system use. Our system is a cross-media
109: system in the sense that users can retrieve video and audio
110: information by means of text queries.
111:
112: \section{System Description}
113: \label{sec:system}
114:
115: \subsection{Overview}
116: \label{subsec:system_overview}
117:
118: Figure~\ref{fig:system} depicts the overall design of our
119: lecture-on-demand system, in which the left and right regions
120: correspond to the on-line and off-line processes, respectively.
121: Although our system is currently implemented for Japanese, our
122: methodology is fundamentally language independent. For the purpose of
123: research and development, we tentatively target lecture programs on TV
124: for which textbooks are published. We explain the basis of our system
125: using Figure~\ref{fig:system}.
126:
127: In the off-line process, given the video data of a target lecture,
128: audio data are extracted and segmented into a number of
129: passages. Then, a speech recognition system transcribes each
130: passage. Finally, the transcribed passages are indexed as in
131: conventional text retrieval systems, so that each passage can be
132: retrieved efficiently in response to text queries. To adapt speech
133: recognition to a specific lecturer, we perform unsupervised speaker
134: adaptation using an initial speech recognition result (i.e., a
135: transcription). To adapt speech recognition to a specific topic, we
136: perform language model adaptation, for which we search a general
137: corpus for documents relevant to the textbook related to a target
138: lecture. Then, retrieved documents (i.e., a topic-specific corpus) are
139: used to produce a word-based N-gram language model. We also perform
140: image analysis to extract text (e.g., keywords and phrases) from flip
141: charts. These contents are also used to improve our language model.
142:
143: In the on-line process, a user can view specific video passages by
144: submitting any text queries, i.e., keywords, phrases, sentences, and
145: paragraphs, extracted from the textbook. Any queries not in the
146: textbook can also be used. The current implementation is based on a
147: client-server system on the Web. Both the off-line and on-line
148: processes are performed on servers, but users can access our system
149: using Web browsers on their own PCs.
150:
151: Figure~\ref{fig:lodem} depicts a prototype interface of our LOD
152: system, in which a lecture associated with ``nonlinear multivariate
153: analysis'' is given. In this interface, an electronic version of a
154: textbook is displayed on the left side, and a lecture video is played
155: on the right side. In addition, users can submit any text queries in
156: the input box, which is not depicted in Figure~\ref{fig:lodem}. In
157: this scenario, a text paragraph related to ``discriminant analysis''
158: was copied and pasted into the query input box, and top-ranked
159: transcribed passages for the query were listed according to the degree
160: of relevance (in the lower part of Figure~\ref{fig:lodem}). Users can
161: select (click on) transcriptions to play the corresponding video
162: passage.
163:
164: It should be noted that unlike conventional keyword-based retrieval
165: systems, in which users usually submit a small number of keywords, in
166: our system users can easily submit longer queries using textbooks.
167: Where submitted keywords are misrecognized in transcriptions, the
168: retrieval accuracy decreases. However, longer queries are relatively
169: robust for speech recognition errors, because the effect of
170: misrecognized words is overshadowed by the large number of words
171: correctly recognized.
172:
173: \begin{figure}[htbp]
174: \begin{center}
175: \leavevmode
176: \includegraphics[height=2.9in]{system.eps}
177: \end{center}
178: \caption{An overview of our lecture-on-demand system.}
179: \label{fig:system}
180: \end{figure}
181:
182: \subsection{Passage Segmentation}
183: \label{subsec:passage}
184:
185: The basis of passage segmentation is to divide the entire video data
186: for a single lecture into more than one unit to be retrieved. We call
187: these smaller units ``passages''. For this purpose, both speech and
188: image data can provide promising clues. However, in lecture TV
189: programs, it is often the case that a lecturer sitting still is the
190: main focus and a small number of flip charts are used occasionally. In
191: such cases, image data is less informative. Therefore, tentatively we
192: use only speech data for the passage segmentation process. However,
193: segmentation can potentially vary depending on the user query. Thus,
194: it is difficult to predetermine a desirable segmentation in the
195: off-line process.
196:
197: \begin{figure}[htbp]
198: \begin{center}
199: \leavevmode
200: \includegraphics[height=2.3in]{lodem3.eps}
201: \end{center}
202: \caption{The interface of our LOD system over the Web.}
203: \label{fig:lodem}
204: \end{figure}
205:
206: Because of the above problems, we first extract the audio track from a
207: target video and use a simple pause-based segmentation method to
208: obtain minimal speech units, such as sentences and long phrases. In
209: other words, speech units are continuous audio segments that do not
210: include pauses longer than a certain threshold. Finally, we generate
211: variable-length passages from one or more speech units. To put it more
212: precisely, we combine $N$ speech units into a single passage, with $N$
213: ranging from 1 to 5 in the current implementation.
214:
215: \subsection{Speech Recognition}
216: \label{subsec:speech_recognition}
217:
218: The speech recognition module generates word sequence $W$, given phone
219: sequence $X$. In a stochastic framework, the task is to select the $W$
220: maximizing $P(W|X)$, which is transformed as in Equation~\eq{eq:bayes}
221: through the Bayesian theorem.
222: \begin{equation}
223: \label{eq:bayes}
224: \arg\max_{W}P(W|X) = \arg\max_{W}P(X|W)\cdot P(W)
225: \end{equation}
226: $P(X|W)$ models the probability that the word sequence $W$ is
227: transformed into the phone sequence $X$, and $P(W)$ models the
228: probability that $W$ is linguistically acceptable. These factors are
229: called the acoustic and language models, respectively.
230:
231: We use the Japanese dictation
232: toolkit\footnote{http://winnie.kuis.kyoto-u.ac.jp/dictation/}, which
233: includes the Julius decoder and acoustic/language models. Julius
234: performs a two-pass (forward-backward) search using word-based forward
235: bigrams and backward trigrams. The acoustic model was produced from
236: the ASJ speech database, which contains approximately 20,000 sentences
237: uttered by 132 speakers including both gender groups. A 16-mixture
238: Gaussian distribution triphone Hidden Markov Model, in which states
239: are clustered into 2,000 groups by a state-tying method, is used. We
240: adapt the provided acoustic model by means of an MLLR-based
241: unsupervised speaker adaptation method, for which in practice we use
242: the HTK toolkit\footnote{http://htk.eng.cam.ac.uk/}.
243:
244: Existing methods to adapt language models can be classified into two
245: fundamental categories. In the first category---the {\em
246: integration\/} approach---general and topic-specific corpora are
247: integrated to produce a topic-specific language
248: model~\cite{auzanne:riao-2000,seymore:eurospeech-97}. Because the
249: sizes of those corpora differ, N-gram statistics are calculated using
250: the weighted average of the statistics extracted independently from
251: those corpora. However, it is difficult to determine the optimal
252: weight depending on the topic. In the second category---the {\em
253: selection\/} approach---a topic-specific subset is selected from a
254: general corpus and is used to produce a language model. This approach
255: is effective if general corpora contain documents associated with
256: target topics, but N-gram statistics in those documents are
257: overshadowed by other documents in resultant language models.
258:
259: We followed the selection approach, because the 10M Web page
260: corpus~\cite{eguchi:sigir-2002} containing mainly Japanese pages
261: associated with various topics was publicly available. The quality of
262: the selection approach depends on the method of selecting
263: topic-specific subsets. An existing
264: method~\cite{chen:adaptation_ws-2001} uses hypotheses in the initial
265: speech recognition phase as queries to retrieve topic-specific
266: documents from a general corpus. However, errors in the initial
267: hypotheses have the potential to decrease the retrieval
268: accuracy. Instead, we use textbooks related to target lectures as
269: queries to improve the retrieval accuracy and consequently the quality
270: of the language model adaptation.
271:
272: \subsection{Retrieval}
273: \label{subsec:retrieval}
274:
275: Given transcribed passages and text queries, the basis of the
276: retrieval module is the same as that for text retrieval. We use an
277: existing probabilistic text retrieval method~\cite{robertson:sigir-94}
278: to compute the relevance score between the query and each passage in
279: the database. The relevance score for passage $p$ is computed by
280: Equation~\eq{eq:okapi}.
281: \begin{equation}
282: \footnotesize
283: \label{eq:okapi}
284: \sum_{t} f_{t,q}\cdot\frac{\textstyle (K+1)\cdot
285: f_{t,p}}{K\cdot\{(1-b)+\textstyle\frac{\textstyle
286: dl_{p}}{\textstyle b\cdot avgdl}\} +
287: f_{t,p}}\cdot\log\frac{\textstyle N - n_{t} + 0.5}{\textstyle
288: n_{t} + 0.5}
289: \end{equation}
290: where $f_{t,q}$ and $f_{t,p}$ denote the frequency with which term $t$
291: appears in query $q$ and passage $p$, respectively. $N$ and $n_{t}$
292: denote the total number of passages in the database and the number of
293: passages containing term $t$, respectively. $dl_{p}$ denotes the
294: length of passage $p$, and $avgdl$ denotes the average length of
295: passages in the database. We empirically set $K=2.0$ and $b=0.8$,
296: respectively. We use content words, such as nouns, extracted from
297: transcribed passages as index terms, and perform word-based
298: indexing. We use the ChaSen morphological
299: analyzer\footnote{http://chasen.aist-nara.ac.jp/} to extract content
300: words. The same method is used to extract terms from queries.
301:
302: However, retrieved passages are not disjoint, because top-ranked
303: passages often overlap with one another in terms of the temporal
304: axis. It is redundant simply to list the top-ranked retrieved passages
305: as they are. Therefore, we reorganize those overlapped passages into
306: a single passage. The relevance score for a group (a new passage) is
307: computed by averaging the scores of all passages belonging to the
308: group. New passages are sorted according to the degree of relevance
309: and are presented to users as the final result.
310:
311: \section{Experimentation}
312: \label{sec:experimentation}
313:
314: \subsection{Methodology}
315: \label{subsec:ex_method}
316:
317: To evaluate the performance of our LOD system, we produced a test
318: collection (as a benchmark data set) and performed experiments
319: partially resembling a task performed in the TREC spoken document
320: retrieval (SDR) track~\cite{garofolo:trec-97}. Five lecture programs
321: on TV (each lecture was 45 minutes long), for which printed textbooks
322: were also published, were videotaped in DV and were used as target
323: lectures. Each lecture was manually transcribed and sentence
324: boundaries with temporal information (i.e., correct speech units) were
325: also identified manually.
326: Each paragraph in the corresponding textbook was used as a query
327: independently. For each query, a human assessor (a graduate student
328: not an author of this paper) identified one or more relevant sentences
329: in the human transcription.
330:
331: Using our test collection, we evaluated the accuracy of speech
332: recognition and passage retrieval.
333: For the five lectures, our system used the sentence boundaries in
334: human transcriptions to identify speech units, and performed speech
335: recognition. We also used human transcriptions as perfect speech
336: recognition results and investigated the extent to which speech
337: recognition errors affect the retrieval accuracy. Our system
338: retrieved top-ranked passages in response to each query. Note that
339: the passages here are those grouped based on the temporal axis, which
340: should not be confused with those obtained from the passage
341: segmentation method.
342:
343: \subsection{Results}
344: \label{subsec:results}
345:
346: \begin{table*}[htbp]
347: \begin{center}
348: \caption{Experimental results for speech recognition and passage
349: retrieval.}
350: \medskip
351: \leavevmode
352: \footnotesize
353: \begin{tabular}{llccccccccccccccc} \hline\hline
354: ID &
355: & \multicolumn{3}{c}{\#1}
356: & \multicolumn{3}{c}{\#2}
357: & \multicolumn{3}{c}{\#3}
358: & \multicolumn{3}{c}{\#4}
359: & \multicolumn{3}{c}{\#5}
360: \\
361: \hline
362: Topic & &
363: \multicolumn{3}{c}{Criminal law} &
364: \multicolumn{3}{c}{Greek history} &
365: \multicolumn{3}{c}{Domestic relations} &
366: \multicolumn{3}{c}{Food and body} &
367: \multicolumn{3}{c}{Solar system}
368: \\
369: \hline
370: &
371: & HUM &
372: {\hfill\centering ASR\hfill} &
373: {\hfill\centering +LA\hfill}
374: & HUM &
375: {\hfill\centering ASR\hfill} &
376: {\hfill\centering +LA\hfill}
377: & HUM &
378: {\hfill\centering ASR\hfill} &
379: {\hfill\centering +LA\hfill}
380: & HUM &
381: {\hfill\centering ASR\hfill} &
382: {\hfill\centering +LA\hfill}
383: & HUM &
384: {\hfill\centering ASR\hfill} &
385: {\hfill\centering +LA\hfill}
386: \\
387: \hline
388: \multicolumn{2}{l}{OOV}
389: & {\hfill\centering ---\hfill} & .044 & .020
390: & {\hfill\centering ---\hfill} & .073 & .082
391: & {\hfill\centering ---\hfill} & .039 & .049
392: & {\hfill\centering ---\hfill} & .053 & .041
393: & {\hfill\centering ---\hfill} & .051 & .053
394: \\
395: \multicolumn{2}{l}{PP}
396: & {\hfill\centering ---\hfill} & 48.9 & 43.2
397: & {\hfill\centering ---\hfill} & 122 & 96.7
398: & {\hfill\centering ---\hfill} & 136 & 132
399: & {\hfill\centering ---\hfill} & 89.3 & 108
400: & {\hfill\centering ---\hfill} & 163 & 130
401: \\
402: \multicolumn{2}{l}{WER}
403: & {\hfill\centering ---\hfill} & .209 & .133
404: & {\hfill\centering ---\hfill} & .516 & .423
405: & {\hfill\centering ---\hfill} & .604 & .543
406: & {\hfill\centering ---\hfill} & .488 & .416
407: & {\hfill\centering ---\hfill} & .637 & .482
408: \\
409: \hline
410: & R & .695 & .726 &.732 & .449 & .258 & .551
411: & .632 & .291 & .505 & .451 & .220 & .357 & .296 & .138 & .241 \\
412: $N$=1 & P & .534 & .548 & .519 & .377 & .319 & .386
413: & .479 & .362 & .464 & .414 & .277 & .337 & .529 & .358 & .436 \\
414: & F & .604 & .624 & .607 & .410 & .286 & .454
415: & .545 & .322 & .484 & .432 & .245 & .347 & .379 & .200 & .311 \\
416: \hline
417: & R & .847 & .858 & .832 & .663 & .360 & .674
418: & .791 & .464 & .677 & .655 & .380 & .463 & .482 & .228 & .421 \\
419: $N$=2 & P & .441 & .448 & .458 & .301 & .211 & .314
420: & .372 & .273 & .353 & .321 & .247 & .239 & .462 & .332 & .409 \\
421: & F & .580 & .588 & .591 & .414 & .266 & .429
422: & .506 & .343 & .464 & .431 & .300 & .316 & .472 & .270 & .415 \\
423: \hline
424: & R & .879 & .868 & .874 & .764 & .438 & .708
425: & .827 & .495 & .718 & .718 & .392 & .604 & .637 & .289 & .527 \\
426: $N$=3 & P & .410 & .405 & .401 & .269 & .163 & .252
427: & .363 & .215 & .318 & .297 & .188 & .235 & .466 & .280 & .385 \\
428: & F & .560 & .553 & .550 & .398 & .237 & .372
429: & .505 & .300 & .441 & .420 & .254 & .338 & .538 & .285 & .445 \\
430: \hline
431: \end{tabular}
432: \label{tab:results}
433: \end{center}
434: \end{table*}
435:
436: To evaluate the accuracy of speech recognition, we used the word error
437: rate (WER), which is the ratio of the number of word errors (deletion,
438: insertion, and substitution) to the total number of words. We also
439: used test-set out-of-vocabulary rate (OOV) and trigram test-set
440: perplexity (PP) to evaluate the extent to which our language model
441: adapted to the target topics. We used human transcriptions as test
442: set data. For example, OOV is the ratio of the number of word tokens
443: not contained in the language model for speech recognition to the
444: total number of word tokens in the transcription. Note that smaller
445: values of OOV, PP, and WER are obtained with better methods.
446:
447: The final outputs (i.e., retrieved passages) were evaluated based on
448: recall and precision, averaged over all queries. Recall (R) is the
449: ratio of the number of correct speech units retrieved by our system to
450: the total number of correct speech units for the query in question.
451: Precision (P) is the ratio of the number of correct speech units
452: retrieved by our system to the total number of speech units retrieved
453: by our system. To summarize recall and precision into a single
454: measure, we used the F-measure (F).
455:
456: Table~\ref{tab:results} shows the accuracy of speech recognition (WER)
457: and passage retrieval (R, P, and F), for each lecture. In this table,
458: the columns ``HUM'' and ``ASR'' correspond to the results obtained
459: with human transcriptions and automatic speech recognition,
460: respectively. The column ``+LA'' denotes results for ASR combined with
461: language model adaptation. The column ``Topic'' denotes topics for
462: the five lectures.
463:
464: To adapt language models, we used the textbook corresponding to a
465: target lecture and searched the 10M Web page corpus for 2,000 relevant
466: pages, which were used as a source corpus. In the case where the
467: language model adaptation was not performed, all 10M Web pages were
468: used as a source corpus. In either case, 20,000 high frequency words
469: were selected from a source corpus to produce a word-based trigram
470: language model. We used the ChaSen morphological analyzer to extract
471: words (morphemes) from the source corpora, because Japanese sentences
472: lack lexical segmentation.
473:
474: In passage retrieval, we regarded the top $N$ passages as the final
475: outputs. In Table~\ref{tab:results}, the value of $N$ ranges from 1 to
476: 3. As the value of $N$ increases, the recall improves, but potentially
477: sacrificing precision.
478:
479: \subsection{Discussion}
480: \label{subsec:discussion}
481:
482: By comparing the results of ASR and +LA in Table~\ref{tab:results},
483: for some cases OOV and PP increased by adapting language models.
484: However, WER decreased by adapting language models to target topics,
485: irrespective of the lecture.
486:
487: The values of OOV, PP, and WER for lecture~\#1 were generally smaller
488: than those for the other lectures. One possible reason is that the
489: lecturer of \#1 spoke more fluently and made fewer erroneous
490: utterances than the other lecturers.
491:
492: Recall, precision, and F-measure increased by adapting language models
493: for lectures~\#2-5, irrespective of the number of passages retrieved.
494: For lecture~\#1, the retrieval accuracy did not significantly differ
495: whether or not we adapted the language model to the topic. One
496: possible reason is that the WER of lecture~\#1 without language model
497: adaptation (20.9\%) was sufficiently small to obtain a retrieval
498: accuracy comparable with the text retrieval~\cite{jourlin:sc-2000}.
499: The difference between HUM and ASR was marginal in terms of the
500: retrieval accuracy. Therefore, the effect of the language model
501: adaptation method was overshadowed in passage retrieval.
502:
503: The retrieval accuracy for lecture~\#1 was higher than those for the
504: other lectures. The story of lecture~\#1 was organized based primarily
505: on the textbook, when compared with the other lectures. This suggests
506: that the performance of our LOD system is dependent of the
507: organization of target lectures.
508:
509: Surprisingly, for lectures~\#1 and \#2, recall, precision, and
510: F-measure of +LA were better than those of HUM. This means that the
511: automatic transcription was more effective than human transcription
512: for passage retrieval purposes. One possible reason is the existence
513: of Japanese variants (i.e., more than one spelling form corresponding
514: to the same word), such as ``{\it girisha\/}/{\it
515: girishia\/}~(Greece)''. Because the language model was adapted by
516: means of the textbook for a target lecture, the spelling in automatic
517: transcriptions systematically resembled that in the queries extracted
518: from the textbooks. In contrast, it is difficult to standardize the
519: spelling in human transcriptions. Therefore, relevant passages in
520: automatic transcriptions were more likely to be retrieved than
521: passages in the human transcriptions.
522:
523: We conclude that our language model adaptation method was effective
524: for both speech recognition and passage retrieval.
525:
526: \section{Conclusion}
527: \label{sec:conclusion}
528:
529: Reflecting the rapid growth in the use of multimedia contents,
530: information technologies appropriate to speech, image, and text
531: processing are crucial. Of the various content types in this paper we
532: focused on the video data of lectures with their organization based on
533: textbooks, and proposed a system for cross-media on-demand lectures,
534: in which users can formulate text queries using the textbook for a
535: target lecture to retrieve specific video passages.
536:
537: To retrieve video passages in response to text queries, we extract the
538: audio track from a lecture video, generate a transcription by large
539: vocabulary continuous speech recognition, and produce a text index,
540: prior to system use.
541:
542: We evaluated the performance of our system experimentally, for which
543: five TV lecture programs in various topics were used. The experimental
544: results showed that the accuracy of speech recognition varied
545: depending on the topic and presentation style of the
546: lecturers. However, the accuracy of speech recognition and passage
547: retrieval was improved by adapting language models to the topic of the
548: target lecture. Even if the word error rate was approximately 40\%,
549: the accuracy of retrieval was comparable with that obtained by human
550: transcription.
551:
552: \bibliographystyle{IEEEtran}
553: \begin{thebibliography}{1}
554: \providecommand{\url}[1]{#1}
555: \def\UrlFont{\rmfamily}
556: \providecommand{\newblock}{\relax}
557: \providecommand{\bibinfo}[2]{#2}
558: \providecommand\BIBentrySTDinterwordspacing{\spaceskip=0pt\relax}
559: \providecommand\BIBentryALTinterwordstretchfactor{4}
560: \providecommand\BIBentryALTinterwordspacing{\spaceskip=\fontdimen2\font plus
561: \BIBentryALTinterwordstretchfactor\fontdimen3\font minus
562: \fontdimen4\font\relax}
563: \providecommand\BIBforeignlanguage[2]{{%
564: \expandafter\ifx\csname l@#1\endcsname\relax
565: \typeout{** WARNING: IEEEtran.bst: No hyphenation pattern has been}%
566: \typeout{** loaded for the language `#1'. Using the pattern for}%
567: \typeout{** the default language instead.}%
568: \else
569: \language=\csname l@#1\endcsname
570: \fi
571: #2}}
572:
573: \bibitem{auzanne:riao-2000}
574: C.~Auzanne, J.~S. Garofolo, J.~G. Fiscus, and W.~M. Fisher, ``Automatic
575: language model adaptation for spoken document retrieval,'' in
576: \emph{Proceedings of RIAO 2000 Conference on Content-Based Multimedia
577: Information Access}, 2000.
578:
579: \bibitem{seymore:eurospeech-97}
580: K.~Seymore and R.~Rosenfeld, ``Using story topics for language model
581: adaptation,'' in \emph{Proceedings of Eurospeech97}, 1997, pp. 1987--1990.
582:
583: \bibitem{eguchi:sigir-2002}
584: K.~Eguchi, K.~Oyama, K.~Kuriyama, and N.~Kando, ``The {Web} retrieval task and
585: its evaluation in the third {NTCIR} workshop,'' in \emph{Proceedings of the
586: 25th Annual International ACM SIGIR Conference on Research and Development in
587: Information Retrieval}, 2002, pp. 375--376.
588:
589: \bibitem{chen:adaptation_ws-2001}
590: L.~Chen, J.-L. Gauvain, L.~Lamel, G.~Adda, and M.~Adda, ``Language model
591: adaptation for broadcast news transcription,'' in \emph{Proceedings of ISCA
592: Workshop on Adaptation Methods For Speech Recognition}, 2001.
593:
594: \bibitem{robertson:sigir-94}
595: S.~Robertson and S.~Walker, ``Some simple effective approximations to the
596: 2-poisson model for probabilistic weighted retrieval,'' in \emph{Proceedings
597: of the 17th Annual International ACM SIGIR Conference on Research and
598: Development in Information Retrieval}, 1994, pp. 232--241.
599:
600: \bibitem{garofolo:trec-97}
601: J.~S. Garofolo, E.~M. Voorhees, V.~M. Stanford, and K.~S. Jones, ``{TREC-6}
602: 1997 spoken document retrieval track overview and results,'' in
603: \emph{Proceedings of the 6th Text REtrieval Conference}, 1997, pp. 83--91.
604:
605: \bibitem{jourlin:sc-2000}
606: P.~Jourlin, S.~E. Johnson, K.~S. Jones, and P.~C. Woodland, ``Spoken document
607: representations for probabilistic retrieval,'' \emph{Speech Communication},
608: vol.~32, pp. 21--36, 2000.
609:
610: \end{thebibliography}
611:
612: \end{document}
613: