1: \documentclass[11pt]{article}
2: %\usepackage{ijcai01}
3: %\usepackage{fullpage,palatino}
4: \usepackage{fullpage,url}
5: \setlength{\oddsidemargin}{-0.25in}
6: \setlength{\evensidemargin}{-0.25in}
7: \setlength{\topmargin}{0.5in}
8: \setlength{\headheight}{0pt}
9: \setlength{\headsep}{0pt}
10: \setlength{\footskip}{0.35in}
11: \setlength{\textheight}{8.75in}
12: \setlength{\textwidth}{7in}
13: \setlength{\itemindent}{-0.5cm}
14: \setlength{\marginparwidth}{0in}
15: \setlength{\marginparsep}{0in}
16: %\renewcommand{\baselinestretch}{1.62} % Double-space
17: \hyphenation{inform-ation-seeking inform-ation}
18: \newenvironment{descit}[1]{\begin{quote} \textit{#1}}{\end{quote}}
19:
20: \input{psfig-dvips}
21:
22: \newif\ifpdf
23: \ifx\pdfoutput\undefined
24: \pdffalse
25: \else
26: \pdfoutput=1
27: \pdftrue
28: \fi
29:
30: \ifpdf
31: \usepackage[pdftex]{graphicx}
32: \usepackage[pdftex]{color}
33: \DeclareGraphicsExtensions{.pdf,.png,.jpg}
34: \else
35: \usepackage[dvips]{graphicx}
36: \usepackage[dvips]{color}
37: \DeclareGraphicsExtensions{.eps,.epsi,.ps}
38: \fi
39:
40: \usepackage{times}
41: %\usepackage{fancyheadings}
42:
43: \pagestyle{plain}
44: %\thispagestyle{empty}
45: %\pagestyle{empty}
46:
47: \def\midv{\mathop{\,|\,}}
48: \newtheorem{defn}{Definition}
49: \long\def\cbk#1{{\color{red}[CBK: #1]}}
50: \newlength\colwidth \setlength\colwidth{3.25in}
51:
52: \title{Mixed-Initiative Interaction = Mixed Computation\footnote{This
53: work is supported in part by US National
54: Science Foundation grants DGE-9553458 and IIS-9876167.}}
55:
56: %\author{}
57: \author{Naren Ramakrishnan, Robert Capra, and Manuel A. P\'{e}rez-Qui\~{n}ones\\
58: Department of Computer Science\\
59: Virginia Tech, Blacksburg, VA 24061, USA\\
60: Contact Email: {\tt naren@cs.vt.edu}}
61:
62: \begin{document}
63:
64: \maketitle
65: %\thispagestyle{empty}
66: %\pagestyle{empty}
67:
68: \begin{abstract}
69: \noindent
70: We show that partial evaluation can be usefully viewed as
71: a programming model for realizing mixed-initiative
72: functionality in interactive applications.
73: Mixed-initiative interaction between two participants is one
74: where the parties can take turns at any time to change
75: and steer the flow of interaction. We concentrate on
76: the facet of mixed-initiative referred to as `unsolicited
77: reporting' and demonstrate how out-of-turn interactions
78: by users can be modeled by `jumping ahead' to nested
79: dialogs (via partial evaluation). Our approach permits
80: the view of dialog management systems in terms of their
81: native support for staging and simplifying interactions;
82: we characterize three different voice-based interaction
83: technologies using this viewpoint. In particular, we
84: show that the built-in form interpretation algorithm (FIA)
85: in the VoiceXML dialog management architecture is actually
86: a (well disguised) combination of an interpreter
87: and a partial evaluator.
88: \end{abstract}
89:
90: \newpage
91: \section{Introduction}
92: \label{intro}
93: Mixed-initiative interaction~\cite{computational-mixed} has been studied
94: for the past 30 years in the areas of artificial intelligence
95: planning~\cite{prodigy}, human-computer interaction~\cite{mixed-hci}, and
96: discourse analysis~\cite{coulthard}. As Novick and Sutton point
97: out~\cite{mixed-notkin},
98: it is `one of those things that people think that they can recognize
99: when they see it even if they can't define it.' It
100: can be broadly viewed as a flexible interaction strategy between
101: participants where the parties can take turns at any time to change
102: and steer the flow of interaction. Human conversations are typically
103: mixed-initiative and, interestingly, so are interactions with some modern
104: computer systems. Consider the following two dialogs with a
105: telephone pizza delivery service that has voice-recognition
106: capability (the line numbers are provided for ease of reference):
107:
108: \begin{descit}{Dialog 1}
109: \vspace{-0.1in}
110: \begin{tabbing}
111: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill
112: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\
113: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
114: 2 \> {\bf System:} \> What size pizza would you like?\\
115: 3 \> {\bf Caller:} \> I'd like a medium, please.\\
116: 4 \> {\bf System:} \> What topping would you like on your pizza?\\
117: 5 \> {\bf Caller:} \> Pepperoni.\\
118: 6 \> {\bf System:} \> What type of crust do you want?\\
119: 7 \> {\bf Caller:} \> Uh, deep-dish.\\
120: 8 \> {\bf System:} \> So that is a medium pepperoni pizza with deep-dish crust.
121: Is this correct?\\
122: 9 \> {\bf Caller:} \> Yes.\\
123: (conversation continues to get delivery and payment information)
124: \end{tabbing}
125: \end{descit}
126:
127: \begin{descit}{Dialog 2}
128: \vspace{-0.1in}
129: \begin{tabbing}
130: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill
131: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\
132: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
133: 2 \> {\bf System:} \> What size pizza would you like?\\
134: 3 \> {\bf Caller:} \> I'd like a sausage pizza, please.\\
135: 4 \> {\bf System:} \> Okay, sausage.\\
136: 5 \> {\bf System:} \> What size pizza would you like?\\
137: 6 \> {\bf Caller:} \> Medium.\\
138: 7 \> {\bf System:} \> What type of crust do you want?\\
139: 8 \> {\bf Caller:} \> Deep-dish.\\
140: 9 \> {\bf System:} \> So that is a medium sausage pizza with deep-dish crust.
141: Is this correct?\\
142: 10 \> {\bf Caller:} \> Yes.\\
143: (conversation continues to get delivery and payment information)
144: \end{tabbing}
145: \end{descit}
146:
147: \noindent
148: Both these conversations involve the specification of a (size,topping,crust)
149: tuple to complete the pizza ordering procedure but differ
150: in important ways. In the first dialog, the caller responds to the
151: questions in the order they are posed by the system. The system has the
152: initiative at all times (other than, perhaps, on line 0) and such an
153: interaction is thus
154: referred to as {\it system-initiated}. In the second dialog, when the
155: system prompts the caller about pizza size, he responds
156: with information about his choice of topping instead
157: (sausage; see line 3 of {\it Dialog 2}). Nevertheless, the conversation
158: is not stalled and the system continues with the other aspects of the
159: information gathering activity. In particular, the system registers that the
160: caller has specified a topping, skips its default question on this topic,
161: and repeats its question about the size (see line 5 of {\it Dialog 2}). The
162: caller
163: thus gained the initiative for a brief period during the conversation,
164: before returning it to the system. A conversation that `mixes' these modes
165: of interaction in such arbitrary ways is said to be {\it mixed-initiative}.
166:
167: \subsection{Tiers of Mixed-Initiative Interaction}
168: \label{tiers}
169: It is well accepted that mixed-initiative provides a more natural and
170: personalized mode of interaction. A matter of debate, however, are
171: the qualities that an interaction must possess to merit its
172: classification as mixed-initiative~\cite{mixed-notkin}. In fact,
173: determining who has the initiative at a given point in an interaction can itself
174: be a contentious issue! The role of intention in
175: an interaction and the underlying task goals also affect the characterization
176: of initiative. We will not attempt to settle this debate here but
177: a few preliminary observations will be useful.
178:
179: One of the basic levels of mixed-initiative is referred to
180: as {\it unsolicited reporting} in~\cite{allen-intelligent} and is illustrated
181: in {\it Dialog 2} above. In this facet, a participant
182: provides information out-of-turn (in our case the caller, about his
183: choice of topping). Furthermore, the out-of-turn
184: interaction is not agreed upon in advance by the two participants.
185: Novick and Sutton~\cite{mixed-notkin} stress that the unanticipated
186: nature of out-of-turn interactions is important and that mere turn-taking
187: (perhaps in a hardwired order) does not constitute
188: mixed-initiative. Finally, notice that in {\it Dialog 2} there is a resumption
189: of the original questioning task once processing of the unsolicited response
190: is completed. In other applications, an unsolicited response might shift the
191: control to a new interaction sequence and/or abort the current interaction.
192:
193: Another level of mixed-initiative involves {\it subdialog invocation};
194: for instance, the computer system might not have understood the user's
195: response and ask for clarifications (which amounts to it having
196: the initiative). A final, sophisticated, form of mixed-initiative is one
197: where participants negotiate with each other to determine initiative
198: (as opposed to merely `taking the initiative')~\cite{allen-intelligent}:
199:
200: \vspace{-0.05in}
201: \begin{descit}{Dialog 3}
202: \vspace{-0.1in}
203: \begin{tabbing}
204: abcdefabyr \= thiscanactuallybeamuchlongersentenceokay \kill
205: (with apologies to O. Henry)\\
206: {\bf Husband:} \> Della, Something interesting happened today that I want to
207: tell you.\\
208: {\bf Wife:} \> I too have something exciting to tell you, Jim.\\
209: {\bf Husband:} \> Do you want to go first or shall I tell you my story?
210: \end{tabbing}
211: \end{descit}
212:
213: In addition to models that characterize initiative, there are models
214: for designing dialog-based interaction systems.
215: Allen et al.~\cite{allen-ai} provide a taxonomy of such software models
216: --- finite-state machines, slot-and-filler
217: structures, frame-based methods, and more sophisticated models involving
218: planning, agent-based programming, and exploiting contextual information.
219: While mixed-initiative interaction can be studied in any of these models,
220: it is beyond the scope of this paper to address all or even a majority
221: of them.
222:
223: Instead, we concentrate on the view of (i) a dialog as a
224: task-oriented information assessment activity requiring the filling of a
225: set of slots,
226: (ii) where one of the participants in the dialog is a computer
227: system and the other
228: is a human, and (iii) where mixed-initiative arises from unsolicited
229: reporting (by the human), involving out-of-turn
230: interactions. This characterization includes many voice-based
231: interfaces to information (our pizza ordering dialog is an example) and
232: web sites modeling interaction by hyperlinks~\cite{pipe-tois}.
233: In Section~\ref{ourmodel}, we show that partial evaluation can be
234: usefully viewed as a programming model for such applications.
235: Section~\ref{voice-tech} presents three different voice-based interaction
236: technologies and analyzes them in terms of their native support for
237: mixing initiative. Finally, Section~\ref{future}
238: discusses other facets of mixed-initiative and mentions
239: other software models to which our approach can be extended.
240:
241: \vspace{-0.1in}
242: \section{Programming a Mixed-Initiative Application}
243: \vspace{-0.03in}
244: \label{ourmodel}
245: Before we outline the design of a system such as Joe's Pizza, we introduce
246: a notation~\cite{levinson,goffman} that captures basic elements
247: of initiative and response in an interaction sequence. The notation expresses
248: the local organization of a dialog~\cite{manuel-thesis,manuel-chi} as
249: adjacency pairs; for instance, {\it Dialog 1} is represented as:
250:
251: \vspace{-0.05in}
252: {\center
253: \begin{tabbing}
254: (Ic \= Rs) \= (Is \= Rc) \= (Is \= Rc) \= (Is \= Rc) \= (Is \= Rc) \kill
255: (Ic \> Rs) \> (Is \> Rc) \> (Is \> Rc) \> (Is \> Rc) \> (Is \> Rc) \\
256: \,\,0 \> \,\,1 \> \,\,2 \> \,\,3 \> \,\,4 \> \,\,5 \> \,\,6 \> \,\,7 \> \,\,8 \> \,\,9 \\
257: \end{tabbing}}
258:
259: \noindent
260: The line numbers given below the interaction sequence refer to the utterance
261: numbers in the dialog presented in Section~\ref{intro}.
262: The letter I denotes who has the initiative --- caller (c) or the system (s) ---
263: and the letter R denotes who provides the response. It is easy to see
264: from this notation that {\it Dialog 1} consists
265: of five turns and that the system has the initiative for the last
266: four turns. The initial turn is modeled as the caller having the
267: initiative because he or she chose to place the phone call in the first place.
268: The system quickly takes the initiative after playing a greeting to
269: the caller (which is modeled here as the response to the caller's call).
270: The subsequent four interactions then address three questions and a
271: confirmation, all involving the system retaining the initiative (Is) and
272: the caller in the responding mode (Rc). Likewise,
273: the mixed-initiative interaction in {\it Dialog 2} is
274: represented as:
275:
276: {\center
277: \begin{tabbing}
278: (Ic \= Rs) \= (Iso \= (Ic \= Rs) \= Rc) \= (Is \= Rc) \= (Is \= Rc) \kill
279: (Ic \> Rs) \> (Is \> (Ic \> Rs) \> Rc) \> (Is \> Rc) \> (Is \> Rc) \\
280: \,\,0 \> \,\,1 \> \,\,2,5 \> \,\,3 \> \,\,4 \> \,\,6 \> \,\,7 \> \,\,8 \> \,\,9 \,\,10 \\
281: \end{tabbing}}
282:
283: \noindent
284: In this case, the system takes the initiative in utterance 2 but instead
285: of responding to the question of size in utterance 3, the caller
286: takes the initiative, causing
287: an `insertion' to occur in the interaction sequence (dialog)~\cite{levinson}.
288: The system responds with an acknowledgement (`Okay, sausage.') in
289: utterance 4. This is represented as the nested pair (Ic Rs) above.
290: The system then re-focuses the dialog on the question of pizza size in
291: utterance 5 (thus retaking the initiative). In utterance 6 the
292: caller responds with the desired size (medium) and
293: the interaction proceeds as before, from this point.
294:
295: The notation is useful to describe the space of possible interactions that
296: are to be supported. For instance, utterances 0 and 1 have to proceed in order.
297: Utterances dealing with selection of (size,topping,crust) can then
298: be nested in any order and provide interesting
299: opportunities for mixing initiative.
300: For instance, if a user is a frequent
301: customer of Joe's Pizza, he might take the initiative and specify all three
302: pizza attributes on the first available prompt:
303:
304: \begin{descit}{Dialog 4}
305: \vspace{-0.1in}
306: \begin{tabbing}
307: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill
308: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\
309: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
310: 2 \> {\bf System:} \> What size pizza would you like?\\
311: 3 \> {\bf Caller:} \> I'd like a sausage pizza, medium, and deep-dish.\\
312: (conversation continues with confirmation of order)
313: \end{tabbing}
314: \end{descit}
315:
316: \noindent
317: Finally, the utterances dealing with confirmation of the user's request
318: can proceed only after choices of all three pizza attributes have been
319: made. There are 13 possible interaction sequences (discounting permutations
320: of attributes specified in a given utterance) --- 1 possibility
321: of specifying everything in one utterance, 6 possibilities of specification
322: in two utterances, and 6 possibilities of specification in three
323: utterances. If we include permutations, there are 24 possibilities (our
324: calculations do not consider situations where, for instance, the system doesn't
325: recognize the user's input and reprompts for information).
326:
327: \begin{figure}
328: \centering
329: \begin{tabular}{cc}
330: \includegraphics[height=1.95in]{dubba1}
331: \hspace{0.1in}
332: \includegraphics[height=1.95in]{dubba3}
333: \end{tabular}
334: \caption{Designs of software systems for
335: mixed-initiative interaction. (left) Traditional system architecture,
336: distinguishing between responsive and unsolicited inputs.
337: (right) Using partial evaluation to handle inputs uniformly.}
338: \label{dialog-designs}
339: \end{figure}
340:
341: Many programming models
342: view mixed-initiative sequences as requiring some
343: special attention to be accommodated. In particular, they rely on recognizing
344: when a user has provided unsolicited
345: input\footnote{We use the term `unsolicited
346: input' here to refer to expected but out-of-turn inputs as opposed to
347: completely unexpected (or out-of-vocabulary) inputs.}
348: and qualify a
349: shift-in-initiative as a `transfer of
350: control.'
351: This implies that the mechanisms that handle out-of-turn interactions
352: are often different
353: from those that realize purely system-directed interactions.
354: Fig.~\ref{dialog-designs} (left) describes a typical software design.
355: A dialog manager is in charge of
356: prompting the user for input, queueing messages onto an output
357: medium, event processing, and managing the overall flow of interaction.
358: One of its inputs is a dialog script that contains a specification of
359: interaction and a set of slots that are to be filled. In our pizza example,
360: slots correspond to placeholders for values of size, topping,
361: and crust. An interpreter determines the first unfilled
362: slot to be visited and presents any prompts
363: for soliciting user input.
364: A responsive input requires mere slot filling whereas unsolicited inputs
365: would require out-of-turn processing (involving a combination of slot
366: filling and simplification). In turn, this causes a revision of the
367: dialog script. The interpreter terminates when there is nothing left to
368: process in the script. While typical dialog managers
369: perform miscellaneous functions such as error control,
370: transferring to other scripts, and accessing scripts from a database, the
371: architecture in Fig.~\ref{dialog-designs}
372: (left) focuses on the aspects most relevant to our presentation.
373:
374: Our approach, on the other hand, is to think of a mixed-initiative
375: dialog as a program,
376: all of whose arguments are passed by reference and which correspond to the
377: slots comprising information assessment. As usual, an interpreter
378: in the dialog manager
379: queues up any applicable prompts to an output medium.
380: Both responsive and
381: unsolicited inputs by a user now correspond (uniformly)
382: to values for arguments; they are processed by partially evaluating
383: the program with respect to these inputs. The result of partial evaluation
384: is another dialog (simplified as a result of user input) which is handed
385: back to the interpreter. This novel design
386: is depicted in Fig.~\ref{dialog-designs} (right) and a dialog script
387: represented in a programmatic notation is given in Fig.~\ref{pizza-script}.
388: Partial evaluation of Fig.~\ref{pizza-script} with respect to user
389: input will remove the conditionals for all slots that
390: are filled by the utterance (global variables are assumed to be
391: under the purview of the interpreter).
392: The reader can verify that a sequence of such partial evaluations
393: will indeed mimic the interaction sequence depicted in {\it Dialog 2}
394: (and any of the other mixed-initiative sequences).
395:
396: Partial evaluation serves two critical uses in our design. The first is
397: obvious, namely the processing of out-of-turn interactions (and any
398: appropriate simplifications to the dialog script). The more subtle advantage
399: of partial evaluation is its support for staging mixed-initiative
400: interactions. The
401: mix-equation~\cite{jones,jones-pe-book} holds for every possible way
402: of splitting inputs into two categories, without enumerating and
403: `trapping' the ways
404: in which the computations can be staged.
405: For instance, the nested pair in {\it Dialog 2} is supported as a natural
406: consequence of our design, not by anticipating and reacting to an
407: out-of-turn input.
408:
409: Another way to characterize the system designs in Fig.~\ref{dialog-designs} is
410: to say that Fig.~\ref{dialog-designs} (left) makes a distinction of responsive
411: versus unsolicited inputs, whereas Fig.~\ref{dialog-designs} (right) makes
412: a more fundamental
413: distinction of fixed-initiative (interpretation) versus
414: mixed-initiative (partial
415: evaluation). In other words, Fig.~\ref{dialog-designs} (right) carves
416: up an interaction sequence into (i) turns that are
417: to be handled in the order they are modeled (by an interpreter),
418: and (ii) turns that can involve mixing of initiative (handled
419: by a partial evaluator).
420: In the latter case, the computations are actually used as
421: a {\it representation of interactions.} Since only mixed-initiative
422: interactions involve multiple staging options
423: and since these are handled by the partial
424: evaluator, our design requires the {\it least} amount of specification
425: (to support all interaction sequences). For instance, the script
426: in Fig.~\ref{pizza-script} models the parts that involve mixing of
427: initiative and helps realize all of the 13 possible interaction sequences.
428: At the same time it does not model the confirmation sequence of
429: {\it Dialog 2} because the caller cannot confirm his order before selecting
430: the three pizza attributes! This turn should be handled by
431: straightforward interpretation.
432:
433: \begin{figure}
434: \centering
435: \begin{tabular}{|l|} \hline
436: {\tt pizzaorder(size,topping,crust)}\\
437: \{\\
438: \,\,\,\, {\tt if (unfilled(size))\{}\\
439: \,\,\,\,\,\,\,\, {\tt /* prompt for size */}\\
440: \,\,\,\, {\tt \}}\\
441: \,\,\,\, {\tt if (unfilled(topping))\{}\\
442: \,\,\,\,\,\,\,\, {\tt /* prompt for topping */}\\
443: \,\,\,\, {\tt \}}\\
444: \,\,\,\, {\tt if (unfilled(crust))\{}\\
445: \,\,\,\,\,\,\,\, {\tt /* prompt for crust */}\\
446: \,\,\,\, {\tt \}}\\
447: \}\\
448: \hline
449: \end{tabular}
450: \caption{Modeling a dialog script as a program parameterized by slot variables
451: that are passed by reference.}
452: \label{pizza-script}
453: \end{figure}
454:
455: To the best of our knowledge, such a model of partial evaluation
456: for mixed-initiative interaction
457: has not been proposed before. An extensive literature
458: search has revealed no related prior work.
459: While computational models for mixed-initiative
460: interaction remain an active area of research~\cite{computational-mixed},
461: such work is characterized by keywords such as `controlling mixed-initiative
462: interaction,' `knowledge representation and reasoning strategies,' and
463: `multi-agent co-ordination.' There are even projects that talk about
464: `integrating' mixed-initiative interaction and partial evaluation to realize
465: an architecture for planning and learning~\cite{prodigy}. We are optimistic
466: that our work has the same historical significance as the relation
467: between explanation-based generalization (a learning technique in AI)
468: and partial evaluation established by van Haremelen and Bundy
469: in 1988~\cite{EBG_PE}.
470:
471: \vspace{-0.1in}
472: \section{Software Technologies for Voice-Based Mixed-Initiative Applications}
473: \label{voice-tech}
474: One of the main contributions of our model is that it characterizes the
475: minimum amount of information needed to program a mixed-initiative interaction
476: sequence.
477: Once a programmer supplies a script such
478: as Fig.~\ref{pizza-script} mixed-initiative
479: interaction is obtained, quite literally, `for free.' This means that
480: we can use the design in Fig.~\ref{dialog-designs} (right) as a benchmark
481: to compare and contrast the amount of specification required in other
482: approaches.
483:
484: As indicated in Section~\ref{tiers}, our model is applicable to
485: voice-based interaction technologies as well as web access via hyperlinks.
486: We concentrate on voice-based applications since interaction with web
487: sites is addressed in a related paper~\cite{pipe-tois} and because
488: the design constraints in voice-based applications pose interesting
489: considerations for our model. In addition, a variety of commercial
490: technologies are available for voice-based applications (in contrast to
491: web sites) that will aid in comparative assessment.
492:
493: \begin{figure}
494: \centering
495: \begin{tabular}{cc}
496: \includegraphics[height=2in]{spreco}
497: \end{tabular}
498: \caption{Basic components of a spoken language processing system.}
499: \label{spreco}
500: \end{figure}
501:
502: \vspace{-0.1in}
503: \subsection{Basic Principles of Voice-Based Interaction}
504: Before we can study the programming of mixed-initiative in
505: a voice-based application, it will be helpful to understand
506: the basic architecture (see Fig.~\ref{spreco})
507: of a spoken language processing system. As a user speaks into the
508: system, the sounds produced are
509: captured by a microphone and converted into a digital signal by an
510: analog-to-digital converter. In telephone-based systems
511: (the VoiceXML architecture covered later in the paper is geared toward
512: this mode), the microphone is part of the
513: telephone handset and the analog-to-digital conversion is typically done by
514: equipment in the telephone network (in some cellular telephony models,
515: the conversion would be performed in the handset itself).
516:
517: The next stage (feature extraction) prepares the digital speech signal to be
518: processed by the speech recognizer. Features of the signal important for
519: speech recognition are extracted from the original signal, organized as an
520: attribute vector, and passed to the recognizer.
521:
522: Most modern speech recognizers use Hidden Markov Models (HMMs) and associated
523: algorithms to represent, train, and recognize speech. HMMs are
524: probabilistic models that must be trained on an input set of data. A common
525: technique is to create sets of acoustic HMMs that model phonetic units of
526: speech in context. These models are created from a training set of speech
527: data that is (hopefully) representative of the population of users who will
528: use the system. A language model is also created prior to performing
529: recognition. The language model is typically used to specify valid
530: combinations of the HMMs
531: at a word- or sentence-level. In this way, the
532: language model specifies the words, phrases, and sentences that the recognizer
533: can attempt to recognize. The process of recognizing a new input speech
534: signal is then accomplished using efficient search algorithms (such as Viterbi
535: decoding) to find the best matching HMMs, given the constraints of the language
536: model. The output of the speech recognizer can take several different forms,
537: but the basic result is a text string that is the recognizer's best guess of
538: what the user said. Many recognizers can provide additional information such
539: as a lattice of results, or an N-best ranked list of results (in case the later
540: stages of processing wish to reject the recognizer's top choice). A good
541: introduction to speech recognition is available in~\cite{martin-speech}.
542:
543: The stages after speech recognition vary depending on the application and the
544: types of processing required. Fig.~\ref{spreco} presents
545: two additional phases that are commonly included in spoken language processing
546: systems in one form or another. We will broadly refer to the first
547: post-recognition stage as natural language processing (NLP). NLP is a
548: large field in its own right and includes many sub-areas such as parsing,
549: semantic interpretation, knowledge representation, and speech acts; an
550: excellent introduction is available in Allen's classic~\cite{allen-nlp}. Our
551: presentation in this paper has assumed NLP support for slot-filling (i.e.,
552: determining values for slot variables from user input).
553:
554: This is commonly achieved by defining parts of a language model
555: and associating them with slots. The language model could take two
556: major forms --- context-free grammars and statistical-based (such
557: as n-grams). Here we
558: focus on the former: in this approach, slots can be specified within
559: the productions of a context-free grammar (akin to a attribute
560: grammar) or they can be associated with
561: the non-terminals in the grammar.
562:
563: We will refer to the next phase of processing as simply `dialog management'
564: (see Fig.~\ref{spreco}). In this phase, augmented results from
565: the NLP stage are incorporated into the dialog and any associated logic
566: of the application is executed. The job of the dialog manager is to
567: track the proceedings of the dialog and to generate appropriate
568: responses. This is often done within some logical processing
569: framework and a dialog model (in our case, a dialog script)
570: is supplied as input that is specific to the particular application being
571: designed. The execution of the logic on the dialog model (script) results
572: in a response that can be presented back to the user. Sometimes
573: response generation is separated out into a subsequent stage.
574:
575: \begin{figure}
576: \centering
577: \begin{tabular}{cc}
578: \includegraphics[width=3in]{httparch}
579: \hspace{0.1in}
580: \includegraphics[width=3.9in]{vxmlarch}
581: \end{tabular}
582: \caption{(left) Accessing HTML documents via a HTTP web server.
583: (right) Accessing VoiceXML documents via a HTTP web server.}
584: \label{html-vxml}
585: \end{figure}
586:
587: \begin{figure}
588: \centering
589: \small
590: \begin{verbatim}
591: <?xml version="1.0"?>
592: <vxml version="1.0">
593: <!-- pizza.vxml
594: A simple pizza ordering demo to illustrate some basic elements
595: of VoiceXML. Several details have been omitted from this demo
596: to help make the basic ideas stand out. -->
597: <form id="welcome">
598: <block name="block1">
599: <prompt> Thank you for calling Joe's pizza ordering system. </prompt>
600: <goto next="#place_order" />
601: </block>
602: </form>
603:
604: <form id="place_order">
605: <field name="size">
606: <prompt> What size pizza would you like? </prompt>
607: </field>
608:
609: <field name="topping">
610: <prompt> What topping would you like on your pizza? </prompt>
611: </field>
612:
613: <field name="crust">
614: <prompt> What type of crust do you want? </prompt>
615: </field>
616:
617: <field name="verify">
618: <prompt>
619: So that is a <value expr="size"/> <value expr="topping"/> pizza
620: with <value expr="crust"/> crust.
621: Is this correct?
622: </prompt>
623: <grammar> yes | no </grammar>
624: </field>
625:
626: <filled>
627: <if cond="verify=='no'">
628: <clear namelist="size topping verify crust"/>
629: <prompt> Sorry. Your order has been canceled. </prompt>
630: <else/>
631: <prompt>Thank you for ordering from Joe's pizza.</prompt>
632: </if>
633: </filled>
634:
635: </form>
636: </vxml>
637: \end{verbatim}
638: \caption{Modeling the pizza ordering dialog in a VoiceXML document.}
639: \label{vpizza}
640: \end{figure}
641:
642: \vspace{-0.1in}
643: \subsection{The VoiceXML Dialog Management Architecture}
644: There are many technologies and delivery mechanisms available for
645: implementing Fig.~\ref{spreco}'s basic components. A popular
646: implementation can be seen in the VoiceXML dialog management architecture.
647: VoiceXML is a markup language designed to simplify the construction of
648: voice-response applications~\cite{voicexml}. Initiated by a
649: committee comprising AT\&T, IBM, Lucent Technologies, and Motorola,
650: it has emerged as a standard in telephone-based voice user interfaces
651: and in delivering web content via voice. We will hence cover this
652: architecture in detail.
653:
654: The basic idea is to describe interaction
655: sequences using a markup notation in a VoiceXML {\it document.} As the
656: VoiceXML specification~\cite{voicexml} indicates, a VoiceXML document
657: constitutes a conversational finite state machine and describes a
658: sequence of interactions (both fixed- and mixed-initiative are supported).
659: A web server can serve VoiceXML documents using the HTTP
660: protocol (Fig.~\ref{html-vxml} (right)), just as easily as HTML documents
661: are currently served over the Internet (Fig.~\ref{html-vxml} (left)).
662: In addition, voice-based applications require a suitable delivery
663: platform, illustrated by a telephone in Fig.~\ref{html-vxml} (right). The
664: voice-browser platform in Fig.~\ref{html-vxml} (right)
665: includes the VoiceXML interpreter which processes the
666: documents, monitors user inputs, streams messages,
667: and performs other functions expected of a dialog management system. Besides
668: the VoiceXML interpreter, the voice-browser platform includes speech
669: recognizers, speech synthesizers, and telephony interfaces to help
670: realize important aspects of voice-based interaction.
671:
672: Dialog specification in a VoiceXML document involves organizing
673: a sequence of {\it forms} and {\it menus}. Forms specify a set of
674: slots (called field item variables) that are to be filled by user input. Menus
675: are syntactic shorthands (much like a {\tt case} construct); since they
676: involve only one field item variable (argument), there are no opportunities
677: for mixing initiative. We do not discuss menus further in this paper.
678: An example VoiceXML document for our pizza application is given
679: in Fig.~\ref{vpizza}.
680:
681: \begin{figure}
682: \small
683: \centering
684: \begin {tabular}{|p{6.6in}|}\hline
685: \vspace{-0.2in}
686: \begin{verbatim}
687: #JSGF V1.0;
688:
689: grammar sizetoppingcrust;
690:
691: public <sizetoppingcrust> =
692: <size> {this.size=$} [<topping> {this.topping=$}] [<crust> {this.crust=$}] |
693: <size> {this.size=$} <crust> {this.crust=$} <topping> {this.topping=$} |
694: <topping> {this.topping=$} [<crust> {this.crust=$}] [<size> {this.size=$}] |
695: <topping> {this.topping=$} <size> {this.size=$} <crust> {this.crust=$} |
696: <crust> {this.crust=$} [<size> {this.size=$}] [<topping> {this.topping=$}] |
697: <crust> {this.crust=$} <topping> {this.topping=$} <size> {this.size=$};
698:
699: <size> = small | medium | large;
700: <topping> = sausage | pepperoni | onions | green peppers;
701: <crust> = regular | deep dish | thin;
702: \end{verbatim}
703: \\\hline
704: \end{tabular}
705: \caption{A form-level grammar to be used in conjunction with
706: the script in Fig.~\ref{vpizza} to realize mixed-initiative interaction.
707: The above productions for {\tt sizetoppingcrust} cover all possibilities
708: of filling slot variables from user input, including multiple slots filled
709: by a given utterance, and various permutations of specifying pizza attributes.}
710: \label{formgram}
711: \end{figure}
712:
713: As shown in Fig.~\ref{vpizza}, the pizza dialog consists of two forms. The
714: first form ({\tt welcome}) merely welcomes the user and transitions to the
715: second. The {\tt place\_order} form involves four {\tt field}s
716: (slot variables) --- the first three cover the pizza attributes and the
717: fourth models the confirmation variable (recall the dialogs in
718: Section~\ref{intro}). In particular, prompts for soliciting user input
719: in each of the fields are specified in Fig.~\ref{vpizza}.
720:
721: Interactions in a VoiceXML application proceed just like a web application
722: except instead of clicking on a hyperlink (to goto a new state), the
723: user talks into a microphone. The VoiceXML interpreter then
724: determines the next state to transition to. Any appropriate responses
725: (to user input) and prompts are
726: delivered over a speaker. The
727: core of the interpreter is a so-called form interpretation algorithm
728: (FIA) that drives the interaction.
729: In Fig.~\ref{vpizza}, the fields provide for a fixed-initiative, system-directed
730: interaction. The FIA simply visits all fields in the order they are presented
731: in the document.
732: Once all fields are filled, a check is made to ensure that
733: the confirmation was successful; if not, the fields are cleared (notice
734: the {\tt clear namelist} tag) and the FIA will proceed to
735: {\tt prompt} for the inputs again,
736: starting from the first unfilled field --- {\tt size}.
737:
738: \begin{figure}
739: \centering
740: \small
741: \begin {tabular}{|p{6in}|}\hline
742: \vspace{-0.2in}
743: \begin{verbatim}
744: While (true)
745: {
746: // SELECT PHASE
747: Select the first form item with an unsatisfied guard condition
748: (e.g., unfilled)
749: If no such form item, exit
750:
751: // COLLECT PHASE
752: Queue up any prompts for the form item
753: Get an utterance from the user
754:
755: // PROCESS PHASE
756: foreach (slot in user's utterance)
757: {
758: if (slot corresponds to a field item) {
759: copy slot values into field item variables
760: set field item's `just_filled' flag
761: }
762: }
763: // some code for executing any `filled' actions triggered
764: }
765: \end{verbatim}
766: \\\hline
767: \end{tabular}
768: \caption{Outline of the form interpretation algorithm (FIA) in
769: the VoiceXML dialog management architecture. Adapted from~\cite{voicexml}.}
770: \label{fiaalgo}
771: \end{figure}
772:
773: \begin{figure}
774: \small
775: \centering
776: \begin {tabular}{|p{5in}|}\hline
777: \vspace{-0.2in}
778: \begin{verbatim}
779: #JSGF V1.0;
780:
781: grammar sizetoppingcrust;
782:
783: public <sizetoppingcrust> = word*;
784:
785: word = <size> {this.size=$} |
786: <crust> {this.crust=$} |
787: <topping> {this.topping=$};
788:
789: <size> = small | medium | large;
790: <topping> = sausage | pepperoni | onions | green peppers;
791: <crust> = regular | deep dish | thin;
792: \end{verbatim}
793: \\\hline
794: \end{tabular}
795: \caption{A alternative form-level grammar to realize mixed-initiative
796: interaction with
797: the script in Fig.~\ref{vpizza}.}
798: \label{formgram2}
799: \end{figure}
800:
801: The form in Fig.~\ref{vpizza} is referred to as a directed one since the
802: computer has the initiative at all times and the {\tt field}s are filled
803: in a strictly sequential order. To make the interaction mixed-initiative
804: (with respect to {\tt size}, {\tt crust}, and {\tt topping}),
805: the programmer merely has to specify a so-called
806: {\it form-level grammar} that describes possibilities for slot-filling from
807: a user utterance. An example
808: form-level grammar file ({\tt sizetoppingcrust.gram}) that covers all
809: possibilities is given in
810: Fig.~\ref{formgram}. The grammar is associated with the dialog script
811: by including the line:
812: \begin{verbatim}
813: <grammar src="sizetoppingcrust.gram" type="application/x-jsgf"/>
814: \end{verbatim}
815: just before the definition of
816: the first {\tt field} (size) in Fig.~\ref{vpizza}.
817:
818: The form-level grammar contains productions for the various choices
819: available for size, topping, and crust and also qualifies
820: all possible parses for
821: a given utterance (modeled by the non-terminal {\tt sizetoppingcrust}). Any
822: valid combination of the three pizza aspects uttered by the user (in
823: any order) is recognized and the appropriate slot variables are instantiated.
824: To see why this also achieves mixed-initiative, let us consider the FIA
825: in more detail.
826:
827: Fig.~\ref{fiaalgo} only reproduces the salient aspects of the FIA relevant
828: for our discussion. Compare the basic elements of the FIA to the stages in
829: Fig.~\ref{dialog-designs} (right). The Select phase corresponds to the
830: interpreter, the Collect phase gathers the user input, and
831: actions taken in the Process phase mimic the partial evaluator. Recall that
832: `programs'
833: (scripts) in VoiceXML can be modeled by finite-state machines, hence
834: the mechanics of partial evaluation are considerably simplified and just
835: amount to filling the slot and removing it from further consideration.
836: Since the FIA repeatedly executes till there are no remaining form items,
837: the processing phase (Process) is effectively parameterized by the form-level
838: grammar file in Fig.~\ref{formgram}. In other words, the form-level grammar
839: file not only enables slot filling, {\it it also implicitly directs the
840: staging of interactions for mixed-initiative.} When the user
841: specifies `peperroni medium' in an utterace, not only does the grammar
842: file enable the recognition of the slots they correspond to (topping and size),
843: it also directs the FIA to simplify these slots (and remove them in
844: any subsequent interaction).
845:
846: The form-level grammar file shown in Fig.~\ref{formgram}
847: (which is also a specification of interaction staging) may make
848: VoiceXML's design appear overly complex. In reality, however,
849: we could have used the vanilla form-level
850: grammar file in Fig.~\ref{formgram2}. While helping to
851: realize mixed-initiative with Fig.~\ref{vpizza}, the new
852: form-level file (as does our model) also allows the possibility of
853: utterances such as `pepperoni pepperoni,' or even, `pepperoni sausage!'
854: Suitable semantics for such situations (including the role of
855: side-effects) can be defined and accommodated in both the VoiceXML
856: model and ours. It should thus be obvious to the reader that VoiceXML's
857: dialog management architecture is actually implementing a mixed
858: evaluation model (for conversational finite state machines), comprising
859: interpretation and partial evaluation.
860:
861: The VoiceXML specification~\cite{voicexml} refers to the form-level file
862: as merely a `grammar file,' when it is actually also a specification of
863: staging. Even though the grammar file serves the role of a language
864: model in a voice application, we believe that
865: separating its two functionalities is important in understanding
866: mixed-initiative system design.
867: %If a statistical n-gram model served
868: %as the language model (instead of context-free grammars), such a distinction
869: %would be easy to make.
870: A case in point is our study of personalizing
871: interaction with web sites~\cite{pipe-tois}. There is no requirement for
872: a `grammar file,' as there is usually no ambiguity about user clicks and
873: typed-in keywords. In this context, the functionality provided by our model
874: is actually unmatched by any existing web-based interaction system (as
875: web interfaces are not typically designed for mixing initiative). A way to
876: incorporate mixed-initiative interaction into an existing interaction
877: at a web site is described in~\cite{pipe-tois}.
878:
879: \begin{table}
880: \centering
881: \begin{tabular}{|l|c|c|} \hline
882: \multicolumn{1}{|l|} {Software} &
883: \multicolumn{1}{c|} {Support for} &
884: \multicolumn{1}{c|} {Support for} \\
885: \multicolumn{1}{|l|} {Technology} &
886: \multicolumn{1}{c|} {Slot Simplification} &
887: \multicolumn{1}{c|} {Interaction Staging} \\ \hline
888: VoiceXML & $\surd$ & $\surd$ \\
889: Slot Filling Systems & $\surd$ & $\times$\\
890: Recognizer-Only APIs & $\times$ & $\times$ \\
891: \hline
892: \end{tabular}
893: \caption{Comparison of software technologies for voice-based
894: mixed-initiative applications.}
895: \label{compare-table}
896: \end{table}
897:
898: \vspace{-0.1in}
899: \subsection{Other Implementation Technologies}
900: VoiceXML's FIA thus includes native support for slot filling, slot
901: simplification, and interaction staging. All of these are functions
902: enabled by partial evaluation in our model. Table~\ref{compare-table}
903: contrasts two other implementation approaches in terms of these aspects.
904: In a purely slot-filling system, native support
905: is provided for simplifying slots from user utterances but extra code
906: needs to be written to model the control logic (for instance,
907: `the user still didn't specify his choice of size, so the question for
908: size should be repeated.'). Several commercial speech recognition vendors
909: provide APIs that operate at this level. In addition, many vendors support
910: low-level APIs that provide basic access to recognition results (i.e.,
911: text strings) but do not perform any additional processing. We refer
912: to these as recognizer-only APIs.
913: They serve more as raw
914: speech recognition engines and require significant programming to first
915: implement a slot-filling engine and, later, control logic to mimic all
916: possible opportunities for staging. Examples of the two latter technologies
917: can be seen in the commercial spoken dialog systems
918: market (from companies such as Nuance, IBM, and AT\&T). The study presented
919: in this paper suggests a systematic way by which
920: their capabilities for mixed-initiative interaction can be assessed.
921:
922: \vspace{-0.1in}
923: \section{Discussion}
924: \label{future}
925: Our work makes contributions to both partial evaluation and
926: mixed-initiative interaction. For the partial evaluation community, we
927: have identified a novel application where the motivation is the staging
928: of interaction (rather than speedup). Since programs (dialogs) are
929: used as specifications of interaction, they are {\it written to be
930: partially evaluated}; partial evaluation is hence not an `afterthought'
931: or an optimization. A program can thus be thought of as a
932: compaction of all possible interaction sequences that involve mixing
933: initiative. An interesting research issue is:
934: Given (i) a set of interaction sequences, and (ii) addressable information
935: (such as arguments and slot variables), determine (iii) the smallest program so
936: that every interaction sequence can be staged in the model
937: of Fig.~\ref{dialog-designs} (right). This requires algorithms
938: to automatically decompose parts of interaction sequences into those
939: that are best addressed in the interpreter and those that can benefit from
940: representation and specialization by the partial evaluator.
941:
942: For mixed-initiative interaction, we have presented a
943: programming model that accommodates all possibilities of staging, without
944: explicit enumeration. The model makes a distinction between fixed-initiative
945: (and which has to be explicitly programmed) and mixed-initiative
946: (specifications of which can be elegantly compressed for subsequent
947: partial evaluation). We have identified instantiations of this model in
948: VoiceXML and slot-filling APIs. We hope this observation will help
949: system designers gain additional insight into voice application design
950: strategies.
951:
952: It should be recalled that there are various facets of mixed-initiative
953: that are not addressed in this paper. Extending our programming model to cover
954: these facets is an immediate direction of future research. For example,
955: VoiceXML's design currently supports dialogs such as
956: the following:
957:
958: \begin{descit}{Dialog 5}
959: \vspace{-0.1in}
960: \begin{tabbing}
961: [x] \= abcdefab2 \= thiscanactuallybeamuchlongersentenceokay \kill
962: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
963: 2 \> {\bf System:} \> What size pizza would you like?\\
964: 3 \> {\bf Caller 1:} \> What sizes do you have?\\
965: 3 \> {\bf Caller 2:} \> Err.. Why don't you ask me the questions in
966: topping-crust-size order?\\
967: \end{tabbing}
968: \end{descit}
969: \vspace{-0.2in}
970:
971: \noindent
972: {\it Caller 1}'s request, while demonstrating initiative, implies a dialog
973: with an optional stage (which cannot be modeled by partial
974: evaluation). Such a situation has to be trapped by the interpreter, not
975: by partial evaluation. {\it Caller 2} does specify a staging, but his
976: staging poses constraints on the computer's initiative, not
977: his own. Such a `meta-dialog' facet~\cite{mixed-hci} requires the
978: ability to jump out
979: of the current dialog; VoiceXML provides many elements for describing
980: such transitions.
981:
982: VoiceXML also provides certain `impure' features and side-effects in
983: its programming model. For instance, after selecting a size (say, medium),
984: the caller could retake the initiative in a different part of the dialog
985: and select a size again (this time, large). This will cause the new
986: value to over-ride any existing value in the {\tt size}
987: slot (see Fig.~\ref{fiaalgo}). In
988: our model, this implies the dynamic substitution of an earlier,
989: `evaluated out,' stage with a functional equivalent. Obviously, the dialog
990: manager has to maintain some state (across partial evaluations)
991: to accomplish this feature. We plan to investigate programming models suitable
992: for these aspects. In addition, we plan to extend our software model
993: beyond slot-and-filler structures, to include reasoning and exploiting
994: context.
995:
996: Our long-term goal is to characterize mixed-initiative facets, not in
997: terms of initiative, interaction, or task models but in terms of the
998: opportunities for staging and the program transformation techniques that
999: can support such staging. This means that we can establish a
1000: taxonomy of mixed-initiative facets based on the transformation techniques
1001: (e.g., partial evaluation, slicing) needed to realize them.
1002: Such a taxonomy would also help connect the facets to design
1003: models for interactive software systems.
1004:
1005: \bibliographystyle{alpha}
1006: \bibliography{pepm}
1007:
1008: \end{document}
1009:
1010: