cs0110022/pepm.tex
1: \documentclass[11pt]{article} 
2: %\usepackage{ijcai01}
3: %\usepackage{fullpage,palatino}
4: \usepackage{fullpage,url}
5: \setlength{\oddsidemargin}{-0.25in}
6: \setlength{\evensidemargin}{-0.25in}
7: \setlength{\topmargin}{0.5in}
8: \setlength{\headheight}{0pt}
9: \setlength{\headsep}{0pt}
10: \setlength{\footskip}{0.35in}
11: \setlength{\textheight}{8.75in}
12: \setlength{\textwidth}{7in}
13: \setlength{\itemindent}{-0.5cm}
14: \setlength{\marginparwidth}{0in}
15: \setlength{\marginparsep}{0in}
16: %\renewcommand{\baselinestretch}{1.62}   % Double-space
17: \hyphenation{inform-ation-seeking inform-ation}
18: \newenvironment{descit}[1]{\begin{quote} \textit{#1}}{\end{quote}}
19: 
20: \input{psfig-dvips}
21: 
22: \newif\ifpdf
23: \ifx\pdfoutput\undefined
24:   \pdffalse
25: \else
26:   \pdfoutput=1
27:   \pdftrue
28: \fi
29: 
30: \ifpdf
31:   \usepackage[pdftex]{graphicx}
32:   \usepackage[pdftex]{color}
33:   \DeclareGraphicsExtensions{.pdf,.png,.jpg}
34: \else
35:   \usepackage[dvips]{graphicx}
36:   \usepackage[dvips]{color}
37:   \DeclareGraphicsExtensions{.eps,.epsi,.ps}
38: \fi
39: 
40: \usepackage{times}
41: %\usepackage{fancyheadings}
42: 
43: \pagestyle{plain}
44: %\thispagestyle{empty}
45: %\pagestyle{empty}
46: 
47: \def\midv{\mathop{\,|\,}}
48: \newtheorem{defn}{Definition}
49: \long\def\cbk#1{{\color{red}[CBK: #1]}}
50: \newlength\colwidth \setlength\colwidth{3.25in}
51: 
52: \title{Mixed-Initiative Interaction = Mixed Computation\footnote{This
53: work is supported in part by US National
54: Science Foundation grants DGE-9553458 and IIS-9876167.}}
55: 
56: %\author{}
57: \author{Naren Ramakrishnan, Robert Capra, and Manuel A. P\'{e}rez-Qui\~{n}ones\\ 
58: Department of Computer Science\\
59: Virginia Tech, Blacksburg, VA 24061, USA\\
60: Contact Email: {\tt naren@cs.vt.edu}}
61: 
62: \begin{document}
63: 
64: \maketitle
65: %\thispagestyle{empty}
66: %\pagestyle{empty}
67: 
68: \begin{abstract}
69: \noindent
70: We show that partial evaluation can be usefully viewed as 
71: a programming model for realizing mixed-initiative 
72: functionality in interactive applications. 
73: Mixed-initiative interaction between two participants is one 
74: where the parties can take turns at any time to change 
75: and steer the flow of interaction. We concentrate on 
76: the facet of mixed-initiative referred to as `unsolicited 
77: reporting' and demonstrate how out-of-turn interactions
78: by users can be modeled by `jumping ahead' to nested 
79: dialogs (via partial evaluation).  Our approach permits 
80: the view of dialog management systems in terms of their 
81: native support for staging and simplifying interactions; 
82: we characterize three different voice-based interaction 
83: technologies using this viewpoint. In particular, we 
84: show that the built-in form interpretation algorithm (FIA) 
85: in the VoiceXML dialog management architecture is actually 
86: a (well disguised) combination of an interpreter 
87: and a partial evaluator.
88: \end{abstract}
89: 
90: \newpage
91: \section{Introduction} 
92: \label{intro}
93: Mixed-initiative interaction~\cite{computational-mixed} has been studied 
94: for the past 30 years in the areas of artificial intelligence 
95: planning~\cite{prodigy}, human-computer interaction~\cite{mixed-hci}, and 
96: discourse analysis~\cite{coulthard}. As Novick and Sutton point 
97: out~\cite{mixed-notkin}, 
98: it is `one of those things that people think that they can recognize
99: when they see it even if they can't define it.' It
100: can be broadly viewed as a flexible interaction strategy between 
101: participants where the parties can take turns at any time to change 
102: and steer the flow of interaction. Human conversations are typically
103: mixed-initiative and, interestingly, so are interactions with some modern
104: computer systems. Consider the following two dialogs with a 
105: telephone pizza delivery service that has voice-recognition
106: capability (the line numbers are provided for ease of reference):
107: 
108: \begin{descit}{Dialog 1}
109: \vspace{-0.1in}
110: \begin{tabbing}
111: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill
112: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\
113: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
114: 2 \> {\bf System:} \> What size pizza would you like?\\
115: 3 \> {\bf Caller:} \> I'd like a medium, please.\\
116: 4 \> {\bf System:} \> What topping would you like on your pizza?\\
117: 5 \> {\bf Caller:} \> Pepperoni.\\
118: 6 \> {\bf System:} \> What type of crust do you want?\\
119: 7 \> {\bf Caller:} \> Uh, deep-dish.\\
120: 8 \> {\bf System:} \> So that is a medium pepperoni pizza with deep-dish crust.
121:              Is this correct?\\
122: 9 \> {\bf Caller:} \> Yes.\\
123: (conversation continues to get delivery and payment information)
124: \end{tabbing}
125: \end{descit}
126: 
127: \begin{descit}{Dialog 2}
128: \vspace{-0.1in}
129: \begin{tabbing}
130: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill
131: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\
132: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
133: 2 \> {\bf System:} \> What size pizza would you like?\\
134: 3 \> {\bf Caller:} \>  I'd like a sausage pizza, please.\\
135: 4 \> {\bf System:} \> Okay, sausage.\\
136: 5 \> {\bf System:} \> What size pizza would you like?\\
137: 6 \> {\bf Caller:} \> Medium.\\
138: 7 \> {\bf System:} \> What type of crust do you want?\\
139: 8 \> {\bf Caller:} \> Deep-dish.\\
140: 9 \> {\bf System:} \>  So that is a medium sausage pizza with deep-dish crust.
141:              Is this correct?\\
142: 10 \> {\bf Caller:} \> Yes.\\
143: (conversation continues to get delivery and payment information)
144: \end{tabbing}
145: \end{descit}
146: 
147: \noindent
148: Both these conversations involve the specification of a (size,topping,crust)
149: tuple to complete the pizza ordering procedure but differ 
150: in important ways. In the first dialog, the caller responds to the 
151: questions in the order they are posed by the system. The system has the
152: initiative at all times (other than, perhaps, on line 0) and such an 
153: interaction is thus
154: referred to as {\it system-initiated}. In the second dialog, when the
155: system prompts the caller about pizza size, he responds
156: with information about his choice of topping instead
157: (sausage; see line 3 of {\it Dialog 2}). Nevertheless, the conversation
158: is not stalled and the system continues with the other aspects of the
159: information gathering activity. In particular, the system registers that the
160: caller has specified a topping, skips its default question on this topic,
161: and repeats its question about the size (see line 5 of {\it Dialog 2}). The 
162: caller 
163: thus gained the initiative for a brief period during the conversation, 
164: before returning it to the system. A conversation that `mixes' these modes
165: of interaction in such arbitrary ways is said to be {\it mixed-initiative}.
166: 
167: \subsection{Tiers of Mixed-Initiative Interaction}
168: \label{tiers}
169: It is well accepted that mixed-initiative provides a more natural and
170: personalized mode of interaction. A matter of debate, however, are
171: the qualities that an interaction must possess to merit its
172: classification as mixed-initiative~\cite{mixed-notkin}. In fact,
173: determining who has the initiative at a given point in an interaction can itself 
174: be a contentious issue! The role of intention in
175: an interaction and the underlying task goals also affect the characterization
176: of initiative. We will not attempt to settle this debate here but 
177: a few preliminary observations will be useful.
178: 
179: One of the basic levels of mixed-initiative is referred to
180: as {\it unsolicited reporting} in~\cite{allen-intelligent} and is illustrated
181: in {\it Dialog 2} above. In this facet, a participant 
182: provides information out-of-turn (in our case the caller, about his
183: choice of topping). Furthermore, the out-of-turn
184: interaction is not agreed upon in advance by the two participants. 
185: Novick and Sutton~\cite{mixed-notkin} stress that the unanticipated 
186: nature of out-of-turn interactions is important and that mere turn-taking 
187: (perhaps in a hardwired order) does not constitute 
188: mixed-initiative. Finally, notice that in {\it Dialog 2} there is a resumption
189: of the original questioning task once processing of the unsolicited response
190: is completed. In other applications, an unsolicited response might shift the
191: control to a new interaction sequence and/or abort the current interaction.
192: 
193: Another level of mixed-initiative involves {\it subdialog invocation};
194: for instance, the computer system might not have understood the user's
195: response and ask for clarifications (which amounts to it having 
196: the initiative). A final, sophisticated, form of mixed-initiative is one 
197: where participants negotiate with each other to determine initiative 
198: (as opposed to merely `taking the initiative')~\cite{allen-intelligent}:
199: 
200: \vspace{-0.05in}
201: \begin{descit}{Dialog 3}
202: \vspace{-0.1in}
203: \begin{tabbing}
204: abcdefabyr \= thiscanactuallybeamuchlongersentenceokay \kill
205: (with apologies to O. Henry)\\
206: {\bf Husband:} \> Della, Something interesting happened today that I want to
207: tell you.\\
208: {\bf Wife:} \> I too have something exciting to tell you, Jim.\\
209: {\bf Husband:} \> Do you want to go first or shall I tell you my story?
210: \end{tabbing}
211: \end{descit}
212: 
213: In addition to models that characterize initiative, there are models
214: for designing dialog-based interaction systems.
215: Allen et al.~\cite{allen-ai} provide a taxonomy of such software models
216: --- finite-state machines, slot-and-filler
217: structures, frame-based methods, and more sophisticated models involving
218: planning, agent-based programming, and exploiting contextual information.
219: While mixed-initiative interaction can be studied in any of these models,
220: it is beyond the scope of this paper to address all or even a majority
221: of them. 
222: 
223: Instead, we concentrate on the view of (i) a dialog as a
224: task-oriented information assessment activity requiring the filling of a
225: set of slots,
226: (ii) where one of the participants in the dialog is a computer 
227: system and the other 
228: is a human, and (iii) where mixed-initiative arises from unsolicited 
229: reporting (by the human), involving out-of-turn 
230: interactions. This characterization includes many voice-based
231: interfaces to information (our pizza ordering dialog is an example) and
232: web sites modeling interaction by hyperlinks~\cite{pipe-tois}. 
233: In Section~\ref{ourmodel}, we show that partial evaluation can be 
234: usefully viewed as a programming model for such applications. 
235: Section~\ref{voice-tech} presents three different voice-based interaction
236: technologies and analyzes them in terms of their native support for
237: mixing initiative. Finally, Section~\ref{future}
238: discusses other facets of mixed-initiative and mentions
239: other software models to which our approach can be extended.
240: 
241: \vspace{-0.1in}
242: \section{Programming a Mixed-Initiative Application}
243: \vspace{-0.03in}
244: \label{ourmodel}
245: Before we outline the design of a system such as Joe's Pizza, we introduce
246: a notation~\cite{levinson,goffman} that captures basic elements 
247: of initiative and response in an interaction sequence. The notation expresses
248: the local organization of a dialog~\cite{manuel-thesis,manuel-chi} as 
249: adjacency pairs; for instance, {\it Dialog 1} is represented as:
250: 
251: \vspace{-0.05in}
252: {\center
253: \begin{tabbing}
254: (Ic \= Rs) \= (Is \= Rc) \= (Is \= Rc) \= (Is \= Rc) \= (Is \= Rc) \kill
255: (Ic \> Rs) \> (Is \> Rc) \> (Is \> Rc) \> (Is \> Rc) \> (Is \> Rc) \\
256: \,\,0 \> \,\,1 \> \,\,2 \> \,\,3 \> \,\,4 \> \,\,5 \> \,\,6 \> \,\,7 \> \,\,8 \> \,\,9 \\
257: \end{tabbing}}
258: 
259: \noindent
260: The line numbers given below the interaction sequence refer to the utterance
261: numbers in the dialog presented in Section~\ref{intro}. 
262: The letter I denotes who has the initiative --- caller (c) or the system (s) ---
263: and the letter R denotes who provides the response. It is easy to see
264: from this notation that {\it Dialog 1}  consists
265: of five turns and that the system has the initiative for the last
266: four turns.  The initial turn is modeled as the caller having the
267: initiative because he or she chose to place the phone call in the first place.
268: The system quickly takes the initiative after playing a greeting to
269: the caller (which is modeled here as the response to the caller's call). 
270: The subsequent four interactions then address three questions and a 
271: confirmation, all involving the system retaining the initiative (Is) and
272: the caller in the responding mode (Rc). Likewise,
273: the mixed-initiative interaction in {\it Dialog 2} is 
274: represented as:
275: 
276: {\center
277: \begin{tabbing}
278: (Ic \= Rs) \= (Iso \= (Ic \= Rs) \= Rc) \= (Is \= Rc) \= (Is \= Rc) \kill
279: (Ic \> Rs) \> (Is \> (Ic \> Rs) \> Rc) \> (Is \> Rc) \> (Is \> Rc) \\
280: \,\,0 \> \,\,1 \> \,\,2,5 \> \,\,3 \> \,\,4 \> \,\,6 \> \,\,7 \> \,\,8 \> \,\,9 \,\,10 \\
281: \end{tabbing}}
282: 
283: \noindent
284: In this case, the system takes the initiative in utterance 2 but instead
285: of responding to the question of size in utterance 3, the caller 
286: takes the initiative, causing
287: an `insertion' to occur in the interaction sequence (dialog)~\cite{levinson}. 
288: The system responds with an acknowledgement (`Okay, sausage.') in
289: utterance 4. This is represented as the nested pair (Ic Rs) above.
290: The system then re-focuses the dialog on the question of pizza size in
291: utterance 5 (thus retaking the initiative). In utterance 6 the
292: caller responds with the desired size (medium) and
293: the interaction proceeds as before, from this point. 
294: 
295: The notation is useful to describe the space of possible interactions that
296: are to be supported. For instance, utterances 0 and 1 have to proceed in order.
297: Utterances dealing with selection of (size,topping,crust) can then
298: be nested in any order and provide interesting
299: opportunities for mixing initiative. 
300: For instance, if a user is a frequent
301: customer of Joe's Pizza, he might take the initiative and specify all three
302: pizza attributes on the first available prompt:
303: 
304: \begin{descit}{Dialog 4}
305: \vspace{-0.1in}
306: \begin{tabbing}
307: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill
308: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\
309: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
310: 2 \> {\bf System:} \> What size pizza would you like?\\
311: 3 \> {\bf Caller:} \>  I'd like a sausage pizza, medium, and deep-dish.\\
312: (conversation continues with confirmation of order) 
313: \end{tabbing}
314: \end{descit}
315: 
316: \noindent
317: Finally, the utterances dealing with confirmation of the user's request 
318: can proceed only after choices of all three pizza attributes have been 
319: made. There are 13 possible interaction sequences (discounting permutations
320: of attributes specified in a given utterance) --- 1 possibility
321: of specifying everything in one utterance, 6 possibilities of specification
322: in two utterances, and 6 possibilities of specification in three
323: utterances. If we include permutations, there are 24 possibilities (our
324: calculations do not consider situations where, for instance, the system doesn't 
325: recognize the user's input and reprompts for information).
326: 
327: \begin{figure}
328: \centering
329: \begin{tabular}{cc}
330: \includegraphics[height=1.95in]{dubba1}
331: \hspace{0.1in}
332: \includegraphics[height=1.95in]{dubba3}
333: \end{tabular}
334: \caption{Designs of software systems for 
335: mixed-initiative interaction. (left) Traditional system architecture,
336: distinguishing between responsive and unsolicited inputs.
337: (right) Using partial evaluation to handle inputs uniformly.}
338: \label{dialog-designs}
339: \end{figure}
340: 
341: Many programming models 
342: view mixed-initiative sequences as requiring some
343: special attention to be accommodated. In particular, they rely on recognizing 
344: when a user has provided unsolicited 
345: input\footnote{We use the term `unsolicited
346: input' here to refer to expected but out-of-turn inputs as opposed to
347: completely unexpected (or out-of-vocabulary) inputs.}
348: and qualify a 
349: shift-in-initiative as a `transfer of 
350: control.'
351: This implies that the mechanisms that handle out-of-turn interactions 
352: are often different 
353: from those that realize purely system-directed interactions. 
354: Fig.~\ref{dialog-designs} (left) describes a typical software design. 
355: A dialog manager is in charge of
356: prompting the user for input, queueing messages onto an output 
357: medium, event processing, and managing the overall flow of interaction.
358: One of its inputs is a dialog script that contains a specification of
359: interaction and a set of slots that are to be filled. In our pizza example,
360: slots correspond to placeholders for values of size, topping, 
361: and crust. An interpreter determines the first unfilled
362: slot to be visited and presents any prompts 
363: for soliciting user input.
364: A responsive input requires mere slot filling whereas unsolicited inputs 
365: would require out-of-turn processing (involving a combination of slot 
366: filling and simplification). In turn, this causes a revision of the 
367: dialog script. The interpreter terminates when there is nothing left to
368: process in the script. While typical dialog managers
369: perform miscellaneous functions such as error control,
370: transferring to other scripts, and accessing scripts from a database, the 
371: architecture in Fig.~\ref{dialog-designs}
372: (left) focuses on the aspects most relevant to our presentation.
373: 
374: Our approach, on the other hand, is to think of a mixed-initiative
375: dialog as a program,
376: all of whose arguments are passed by reference and which correspond to the
377: slots comprising information assessment. As usual, an interpreter
378: in the dialog manager
379: queues up any applicable prompts to an output medium.
380: Both responsive and
381: unsolicited inputs by a user now correspond (uniformly)
382: to values for arguments; they are processed by partially evaluating 
383: the program with respect to these inputs. The result of partial evaluation
384: is another dialog (simplified as a result of user input) which is handed
385: back to the interpreter. This novel design
386: is depicted in Fig.~\ref{dialog-designs} (right) and a dialog script 
387: represented in a programmatic notation is given in Fig.~\ref{pizza-script}. 
388: Partial evaluation of Fig.~\ref{pizza-script} with respect to user
389: input will remove the conditionals for all slots that
390: are filled by the utterance (global variables are assumed to be
391: under the purview of the interpreter).
392: The reader can verify that a sequence of such partial evaluations
393: will indeed mimic the interaction sequence depicted in {\it Dialog 2} 
394: (and any of the other mixed-initiative sequences).
395: 
396: Partial evaluation serves two critical uses in our design. The first is
397: obvious, namely the processing of out-of-turn interactions (and any
398: appropriate simplifications to the dialog script). The more subtle advantage
399: of partial evaluation is its support for staging mixed-initiative
400: interactions. The 
401: mix-equation~\cite{jones,jones-pe-book} holds for every possible way
402: of splitting inputs into two categories, without enumerating and
403: `trapping' the ways
404: in which the computations can be staged. 
405: For instance, the nested pair in {\it Dialog 2} is supported as a natural 
406: consequence of our design, not by anticipating and reacting to an 
407: out-of-turn input. 
408: 
409: Another way to characterize the system designs in Fig.~\ref{dialog-designs} is
410: to say that Fig.~\ref{dialog-designs} (left) makes a distinction of responsive
411: versus unsolicited inputs, whereas Fig.~\ref{dialog-designs} (right) makes
412: a more fundamental
413: distinction of fixed-initiative (interpretation) versus
414: mixed-initiative (partial
415: evaluation). In other words, Fig.~\ref{dialog-designs} (right) carves
416: up an interaction sequence into (i) turns that are 
417: to be handled in the order they are modeled (by an interpreter),
418: and (ii) turns that can involve mixing of initiative (handled
419: by a partial evaluator). 
420: In the latter case, the computations are actually used as 
421: a {\it representation of interactions.} Since only mixed-initiative 
422: interactions involve multiple staging options
423: and since these are handled by the partial
424: evaluator, our design requires the {\it least} amount of specification 
425: (to support all interaction sequences). For instance, the script
426: in Fig.~\ref{pizza-script} models the parts that involve mixing of
427: initiative and helps realize all of the 13 possible interaction sequences.
428: At the same time it does not model the confirmation sequence of 
429: {\it Dialog 2} because the caller cannot confirm his order before selecting 
430: the three pizza attributes! This turn should be handled by 
431: straightforward interpretation. 
432: 
433: \begin{figure}
434: \centering
435: \begin{tabular}{|l|} \hline
436: {\tt pizzaorder(size,topping,crust)}\\
437: \{\\
438: \,\,\,\, {\tt if (unfilled(size))\{}\\
439: \,\,\,\,\,\,\,\, {\tt /* prompt for size */}\\
440: \,\,\,\, {\tt \}}\\
441: \,\,\,\, {\tt if (unfilled(topping))\{}\\
442: \,\,\,\,\,\,\,\, {\tt /* prompt for topping */}\\
443: \,\,\,\, {\tt \}}\\
444: \,\,\,\, {\tt if (unfilled(crust))\{}\\
445: \,\,\,\,\,\,\,\, {\tt /* prompt for crust */}\\
446: \,\,\,\, {\tt \}}\\
447: \}\\
448: \hline
449: \end{tabular}
450: \caption{Modeling a dialog script as a program parameterized by slot variables
451: that are passed by reference.}
452: \label{pizza-script}
453: \end{figure}
454: 
455: To the best of our knowledge, such a model of partial evaluation
456: for mixed-initiative interaction 
457: has not been proposed before. An extensive literature
458: search has revealed no related prior work. 
459: While computational models for mixed-initiative
460: interaction remain an active area of research~\cite{computational-mixed}, 
461: such work is characterized by keywords such as `controlling mixed-initiative
462: interaction,' `knowledge representation and reasoning strategies,' and
463: `multi-agent co-ordination.' There are even projects that talk about
464: `integrating' mixed-initiative interaction and partial evaluation to realize
465: an architecture for planning and learning~\cite{prodigy}. We are optimistic
466: that our work has the same historical significance as the relation
467: between explanation-based generalization (a learning technique in AI) 
468: and partial evaluation established by van Haremelen and Bundy 
469: in 1988~\cite{EBG_PE}.
470: 
471: \vspace{-0.1in}
472: \section{Software Technologies for Voice-Based Mixed-Initiative Applications}
473: \label{voice-tech}
474: One of the main contributions of our model is that it characterizes the
475: minimum amount of information needed to program a mixed-initiative interaction
476: sequence.
477: Once a programmer supplies a script such 
478: as Fig.~\ref{pizza-script} mixed-initiative 
479: interaction is obtained, quite literally, `for free.' This means that 
480: we can use the design in Fig.~\ref{dialog-designs} (right) as a benchmark 
481: to compare and contrast the amount of specification required in other 
482: approaches. 
483: 
484: As indicated in Section~\ref{tiers}, our model is applicable to 
485: voice-based interaction technologies as well as web access via hyperlinks.
486: We concentrate on voice-based applications since interaction with web
487: sites is addressed in a related paper~\cite{pipe-tois} and because 
488: the design constraints in voice-based applications pose interesting
489: considerations for our model. In addition, a variety of commercial
490: technologies are available for voice-based applications (in contrast to
491: web sites) that will aid in comparative assessment.
492: 
493: \begin{figure}
494: \centering
495: \begin{tabular}{cc}
496: \includegraphics[height=2in]{spreco}
497: \end{tabular}
498: \caption{Basic components of a spoken language processing system.}
499: \label{spreco}
500: \end{figure}
501: 
502: \vspace{-0.1in}
503: \subsection{Basic Principles of Voice-Based Interaction}
504: Before we can study the programming of mixed-initiative in
505: a voice-based application, it will be helpful to understand
506: the basic architecture (see Fig.~\ref{spreco})
507: of a spoken language processing system. As a user speaks into the 
508: system, the sounds produced are 
509: captured by a microphone and converted into a digital signal by an 
510: analog-to-digital converter. In telephone-based systems
511: (the VoiceXML architecture covered later in the paper is geared toward
512: this mode), the microphone is part of the
513: telephone handset and the analog-to-digital conversion is typically done by
514: equipment in the telephone network (in some cellular telephony models,
515: the conversion would be performed in the handset itself).
516: 
517: The next stage (feature extraction) prepares the digital speech signal to be
518: processed by the speech recognizer. Features of the signal important for 
519: speech recognition are extracted from the original signal, organized as an
520: attribute vector, and passed to the recognizer.  
521: 
522: Most modern speech recognizers use Hidden Markov Models (HMMs) and associated
523: algorithms to represent, train, and recognize speech.  HMMs are
524: probabilistic models that must be trained on an input set of data.  A common
525: technique is to create sets of acoustic HMMs that model phonetic units of
526: speech in context. These models are created from a training set of speech
527: data that is (hopefully) representative of the population of users who will
528: use the system. A language model is also created prior to performing
529: recognition. The language model is typically used to specify valid
530: combinations of the HMMs
531: at a word- or sentence-level.  In this way, the
532: language model specifies the words, phrases, and sentences that the recognizer
533: can attempt to recognize.  The process of recognizing a new input speech
534: signal is then accomplished using efficient search algorithms (such as Viterbi
535: decoding) to find the best matching HMMs, given the constraints of the language
536: model.  The output of the speech recognizer can take several different forms,
537: but the basic result is a text string that is the recognizer's best guess of
538: what the user said.  Many recognizers can provide additional information such
539: as a lattice of results, or an N-best ranked list of results (in case the later
540: stages of processing wish to reject the recognizer's top choice).  A good
541: introduction to speech recognition is available in~\cite{martin-speech}.
542: 
543: The stages after speech recognition vary depending on the application and the 
544: types of processing required. Fig.~\ref{spreco} presents
545: two additional phases that are commonly included in spoken language processing 
546: systems in one form or another. We will broadly refer to the first 
547: post-recognition stage as natural language processing (NLP). NLP is a 
548: large field in its own right and includes many sub-areas such as parsing, 
549: semantic interpretation, knowledge representation, and speech acts; an 
550: excellent introduction is available in Allen's classic~\cite{allen-nlp}. Our 
551: presentation in this paper has assumed NLP support for slot-filling (i.e.,
552: determining values for slot variables from user input).
553: 
554: This is commonly achieved by defining parts of a language model
555: and associating them with slots. The language model could take two 
556: major forms --- context-free grammars and statistical-based (such 
557: as n-grams). Here we
558: focus on the former: in this approach, slots can be specified within
559: the productions of a context-free grammar (akin to a attribute
560: grammar) or they can be associated with
561: the non-terminals in the grammar.
562: 
563: We will refer to the next phase of processing as simply `dialog management'
564: (see Fig.~\ref{spreco}). In this phase, augmented results from 
565: the NLP stage are incorporated into the dialog and any associated logic
566: of the application is executed. The job of the dialog manager is to 
567: track the proceedings of the dialog and to generate appropriate 
568: responses. This is often done within some logical processing 
569: framework and a dialog model (in our case, a dialog script)
570: is supplied as input that is specific to the particular application being 
571: designed. The execution of the logic on the dialog model (script) results 
572: in a response that can be presented back to the user. Sometimes
573: response generation is separated out into a subsequent stage.
574: 
575: \begin{figure}
576: \centering
577: \begin{tabular}{cc}
578: \includegraphics[width=3in]{httparch}
579: \hspace{0.1in}
580: \includegraphics[width=3.9in]{vxmlarch}
581: \end{tabular}
582: \caption{(left) Accessing HTML documents via a HTTP web server.
583: (right) Accessing VoiceXML documents via a HTTP web server.}
584: \label{html-vxml}
585: \end{figure}
586: 
587: \begin{figure}
588: \centering
589: \small
590: \begin{verbatim}
591: <?xml version="1.0"?>
592: <vxml version="1.0">
593: <!-- pizza.vxml
594:      A simple pizza ordering demo to illustrate some basic elements
595:      of VoiceXML.  Several details have been omitted from this demo
596:      to help make the basic ideas stand out. -->
597:   <form id="welcome">
598:     <block name="block1">
599:       <prompt> Thank you for calling Joe's pizza ordering system. </prompt>
600:       <goto next="#place_order" />
601:     </block>
602:   </form>
603: 
604:   <form id="place_order">
605:     <field name="size">
606:       <prompt> What size pizza would you like? </prompt>
607:     </field>
608: 
609:     <field name="topping">
610:       <prompt> What topping would you like on your pizza? </prompt>
611:     </field>
612: 
613:     <field name="crust">
614:       <prompt> What type of crust do you want? </prompt>
615:     </field>
616: 
617:     <field name="verify">
618:       <prompt>
619:         So that is a <value expr="size"/> <value expr="topping"/> pizza
620:         with <value expr="crust"/> crust.
621:         Is this correct?
622:       </prompt>
623:       <grammar> yes | no </grammar>
624:     </field>
625: 
626:     <filled>
627:       <if cond="verify=='no'">
628:          <clear namelist="size topping verify crust"/>
629:          <prompt> Sorry.  Your order has been canceled. </prompt>
630:       <else/>
631:          <prompt>Thank you for ordering from Joe's pizza.</prompt>
632:       </if>
633:     </filled>
634: 
635:   </form>
636: </vxml>
637: \end{verbatim}
638: \caption{Modeling the pizza ordering dialog in a VoiceXML document.}
639: \label{vpizza}
640: \end{figure}
641: 
642: \vspace{-0.1in}
643: \subsection{The VoiceXML Dialog Management Architecture}
644: There are many technologies and delivery mechanisms available for
645: implementing Fig.~\ref{spreco}'s basic components. A popular 
646: implementation can be seen in the VoiceXML dialog management architecture.
647: VoiceXML is a markup language designed to simplify the construction of
648: voice-response applications~\cite{voicexml}. Initiated by a 
649: committee comprising AT\&T, IBM, Lucent Technologies, and Motorola,
650: it has emerged as a standard in telephone-based voice user interfaces 
651: and in delivering web content via voice. We will hence cover this 
652: architecture in detail.
653: 
654: The basic idea is to describe interaction
655: sequences using a markup notation in a VoiceXML {\it document.} As the
656: VoiceXML specification~\cite{voicexml} indicates, a VoiceXML document
657: constitutes a conversational finite state machine and describes a
658: sequence of interactions (both fixed- and mixed-initiative are supported).
659: A web server can serve VoiceXML documents using the HTTP 
660: protocol (Fig.~\ref{html-vxml} (right)), just as easily as HTML documents 
661: are currently served over the Internet (Fig.~\ref{html-vxml} (left)).
662: In addition, voice-based applications require a suitable delivery
663: platform, illustrated by a telephone in Fig.~\ref{html-vxml} (right). The 
664: voice-browser platform in Fig.~\ref{html-vxml} (right)
665: includes the VoiceXML interpreter which processes the
666: documents, monitors user inputs, streams messages,
667: and performs other functions expected of a dialog management system. Besides
668: the VoiceXML interpreter, the voice-browser platform includes speech 
669: recognizers, speech synthesizers, and telephony interfaces to help 
670: realize important aspects of voice-based interaction.
671: 
672: Dialog specification in a VoiceXML document involves organizing
673: a sequence of {\it forms} and {\it menus}. Forms specify a set of
674: slots (called field item variables) that are to be filled by user input. Menus
675: are syntactic shorthands (much like a {\tt case} construct); since they
676: involve only one field item variable (argument), there are no opportunities
677: for mixing initiative. We do not discuss menus further in this paper.
678: An example VoiceXML document for our pizza application is given 
679: in Fig.~\ref{vpizza}.
680: 
681: \begin{figure}
682: \small
683: \centering
684: \begin {tabular}{|p{6.6in}|}\hline
685: \vspace{-0.2in}
686: \begin{verbatim}
687: #JSGF V1.0;
688: 
689: grammar sizetoppingcrust;
690: 
691: public <sizetoppingcrust> =
692:    <size> {this.size=$} [<topping> {this.topping=$}] [<crust> {this.crust=$}] |
693:    <size> {this.size=$} <crust> {this.crust=$} <topping> {this.topping=$} |
694:    <topping> {this.topping=$} [<crust> {this.crust=$}] [<size> {this.size=$}] |
695:    <topping> {this.topping=$} <size> {this.size=$} <crust> {this.crust=$} |
696:    <crust> {this.crust=$} [<size> {this.size=$}] [<topping> {this.topping=$}] |
697:    <crust> {this.crust=$} <topping> {this.topping=$} <size> {this.size=$};
698: 
699: <size> = small | medium | large;
700: <topping> =  sausage | pepperoni | onions | green peppers;
701: <crust> = regular | deep dish | thin;
702: \end{verbatim}
703: \\\hline
704: \end{tabular}
705: \caption{A form-level grammar to be used in conjunction with
706: the script in Fig.~\ref{vpizza} to realize mixed-initiative interaction.
707: The above productions for {\tt sizetoppingcrust} cover all possibilities
708: of filling slot variables from user input, including multiple slots filled
709: by a given utterance, and various permutations of specifying pizza attributes.}
710: \label{formgram}
711: \end{figure}
712: 
713: As shown in Fig.~\ref{vpizza}, the pizza dialog consists of two forms. The
714: first form ({\tt welcome}) merely welcomes the user and transitions to the 
715: second. The {\tt place\_order} form involves four {\tt field}s 
716: (slot variables) --- the first three cover the pizza attributes and the
717: fourth models the confirmation variable (recall the dialogs in
718: Section~\ref{intro}). In particular, prompts for soliciting user input
719: in each of the fields are specified in Fig.~\ref{vpizza}. 
720: 
721: Interactions in a VoiceXML application proceed just like a web application
722: except instead of clicking on a hyperlink (to goto a new state), the
723: user talks into a microphone. The VoiceXML interpreter then
724: determines the next state to transition to. Any appropriate responses
725: (to user input) and prompts are
726: delivered over a speaker. The 
727: core of the interpreter is a so-called form interpretation algorithm 
728: (FIA) that drives the interaction. 
729: In Fig.~\ref{vpizza}, the fields provide for a fixed-initiative, system-directed
730: interaction. The FIA simply visits all fields in the order they are presented
731: in the document.
732: Once all fields are filled, a check is made to ensure that
733: the confirmation was successful; if not, the fields are cleared (notice
734: the {\tt clear namelist} tag) and the FIA will proceed to 
735: {\tt prompt} for the inputs again,
736: starting from the first unfilled field --- {\tt size}. 
737: 
738: \begin{figure}
739: \centering
740: \small
741: \begin {tabular}{|p{6in}|}\hline
742: \vspace{-0.2in}
743: \begin{verbatim}
744: While (true)
745: {
746:         // SELECT PHASE
747:         Select the first form item with an unsatisfied guard condition
748:              (e.g., unfilled)
749:           If no such form item, exit
750: 	
751:         // COLLECT PHASE
752:         Queue up any prompts for the form item 
753:         Get an utterance from the user
754: 		
755:         // PROCESS PHASE
756:         foreach (slot in user's utterance)
757:         {
758:                if (slot corresponds to a field item) {
759:                     copy slot values into field item variables
760:                     set field item's `just_filled' flag
761:                }
762:         }
763:         // some code for executing any `filled' actions triggered
764: }
765: \end{verbatim}
766: \\\hline
767: \end{tabular}
768: \caption{Outline of the form interpretation algorithm (FIA) in
769: the VoiceXML dialog management architecture. Adapted from~\cite{voicexml}.}
770: \label{fiaalgo}
771: \end{figure}
772: 
773: \begin{figure}
774: \small
775: \centering
776: \begin {tabular}{|p{5in}|}\hline
777: \vspace{-0.2in}
778: \begin{verbatim}
779: #JSGF V1.0;
780: 
781: grammar sizetoppingcrust;
782: 
783: public <sizetoppingcrust> = word*;
784: 
785: word = <size> {this.size=$} | 
786:        <crust> {this.crust=$} | 
787:        <topping> {this.topping=$};
788: 
789: <size> = small | medium | large;
790: <topping> =  sausage | pepperoni | onions | green peppers;
791: <crust> = regular | deep dish | thin;
792: \end{verbatim}
793: \\\hline
794: \end{tabular}
795: \caption{A alternative form-level grammar to realize mixed-initiative
796: interaction with
797: the script in Fig.~\ref{vpizza}.}
798: \label{formgram2}
799: \end{figure}
800: 
801: The form in Fig.~\ref{vpizza} is referred to as a directed one since the
802: computer has the initiative at all times and the {\tt field}s are filled
803: in a strictly sequential order. To make the interaction mixed-initiative
804: (with respect to {\tt size}, {\tt crust}, and {\tt topping}),
805: the programmer merely has to specify a so-called
806: {\it form-level grammar} that describes possibilities for slot-filling from
807: a user utterance. An example
808: form-level grammar file ({\tt sizetoppingcrust.gram}) that covers all 
809: possibilities is given in
810: Fig.~\ref{formgram}. The grammar is associated with the dialog script
811: by including the line:
812: \begin{verbatim}
813:     <grammar src="sizetoppingcrust.gram" type="application/x-jsgf"/>
814: \end{verbatim}
815: just before the definition of
816: the first {\tt field} (size) in Fig.~\ref{vpizza}.
817: 
818: The form-level grammar contains productions for the various choices 
819: available for size, topping, and crust and also qualifies 
820: all possible parses for
821: a given utterance (modeled by the non-terminal {\tt sizetoppingcrust}). Any
822: valid combination of the three pizza aspects uttered by the user (in
823: any order) is recognized and the appropriate slot variables are instantiated.
824: To see why this also achieves mixed-initiative, let us consider the FIA
825: in more detail.
826: 
827: Fig.~\ref{fiaalgo} only reproduces the salient aspects of the FIA relevant
828: for our discussion. Compare the basic elements of the FIA to the stages in
829: Fig.~\ref{dialog-designs} (right). The Select phase corresponds to the
830: interpreter, the Collect phase gathers the user input, and
831: actions taken in the Process phase mimic the partial evaluator. Recall that 
832: `programs'
833: (scripts) in VoiceXML can be modeled by finite-state machines, hence
834: the mechanics of partial evaluation are considerably simplified and just
835: amount to filling the slot and removing it from further consideration.
836: Since the FIA repeatedly executes till there are no remaining form items,
837: the processing phase (Process) is effectively parameterized by the form-level
838: grammar file in Fig.~\ref{formgram}. In other words, the form-level grammar 
839: file not only enables slot filling, {\it it also implicitly directs the 
840: staging of interactions for mixed-initiative.} When the user
841: specifies `peperroni medium' in an utterace, not only does the grammar
842: file enable the recognition of the slots they correspond to (topping and size),
843: it also directs the FIA to simplify these slots (and remove them in
844: any subsequent interaction).
845: 
846: The form-level grammar file shown in Fig.~\ref{formgram} 
847: (which is also a specification of interaction staging) may make
848: VoiceXML's design appear overly complex. In reality, however,
849: we could have used the vanilla form-level 
850: grammar file in Fig.~\ref{formgram2}. While helping to 
851: realize mixed-initiative with Fig.~\ref{vpizza}, the new
852: form-level file (as does our model) also allows the possibility of 
853: utterances such as `pepperoni pepperoni,' or even, `pepperoni sausage!' 
854: Suitable semantics for such situations (including the role of 
855: side-effects) can be defined and accommodated in both the VoiceXML 
856: model and ours. It should thus be obvious to the reader that VoiceXML's 
857: dialog management architecture is actually implementing a mixed 
858: evaluation model (for conversational finite state machines), comprising 
859: interpretation and partial evaluation.
860: 
861: The VoiceXML specification~\cite{voicexml} refers to the form-level file 
862: as merely a `grammar file,' when it is actually also a specification of 
863: staging. Even though the grammar file serves the role of a language 
864: model in a voice application, we believe that
865: separating its two functionalities is important in understanding
866: mixed-initiative system design. 
867: %If a statistical n-gram model served
868: %as the language model (instead of context-free grammars), such a distinction
869: %would be easy to make.
870: A case in point is our study of personalizing 
871: interaction with web sites~\cite{pipe-tois}. There is no requirement for 
872: a `grammar file,' as there is usually no ambiguity about user clicks and 
873: typed-in keywords. In this context, the functionality provided by our model
874: is actually unmatched by any existing web-based interaction system (as
875: web interfaces are not typically designed for mixing initiative). A way to
876: incorporate mixed-initiative interaction into an existing interaction
877: at a web site is described in~\cite{pipe-tois}. 
878: 
879: \begin{table}
880: \centering
881: \begin{tabular}{|l|c|c|} \hline
882: \multicolumn{1}{|l|} {Software} &
883: \multicolumn{1}{c|} {Support for} &
884: \multicolumn{1}{c|} {Support for} \\ 
885: \multicolumn{1}{|l|} {Technology} &
886: \multicolumn{1}{c|} {Slot Simplification} &
887: \multicolumn{1}{c|} {Interaction Staging} \\ \hline
888: VoiceXML & $\surd$ & $\surd$ \\
889: Slot Filling Systems & $\surd$ & $\times$\\
890: Recognizer-Only APIs & $\times$ & $\times$ \\
891: \hline
892: \end{tabular}
893: \caption{Comparison of software technologies for voice-based
894: mixed-initiative applications.}
895: \label{compare-table}
896: \end{table}
897: 
898: \vspace{-0.1in}
899: \subsection{Other Implementation Technologies}
900: VoiceXML's FIA thus includes native support for slot filling, slot
901: simplification, and interaction staging. All of these are functions
902: enabled by partial evaluation in our model. Table~\ref{compare-table}
903: contrasts two other implementation approaches in terms of these aspects. 
904: In a purely slot-filling system, native support
905: is provided for simplifying slots from user utterances but extra code
906: needs to be written to model the control logic (for instance,
907: `the user still didn't specify his choice of size, so the question for
908: size should be repeated.'). Several commercial speech recognition vendors
909: provide APIs that operate at this level. In addition, many vendors support
910: low-level APIs that provide basic access to recognition results (i.e.,
911: text strings) but do not perform any additional processing. We refer
912: to these as recognizer-only APIs.
913: They serve more as raw 
914: speech recognition engines and require significant programming to first 
915: implement a slot-filling engine and, later, control logic to mimic all
916: possible opportunities for staging. Examples of the two latter technologies
917: can be seen in the commercial spoken dialog systems
918: market (from companies such as Nuance, IBM, and AT\&T). The study presented 
919: in this paper suggests a systematic way by which
920: their capabilities for mixed-initiative interaction can be assessed.
921: 
922: \vspace{-0.1in}
923: \section{Discussion}
924: \label{future}
925: Our work makes contributions to both partial evaluation and 
926: mixed-initiative interaction. For the partial evaluation community, we
927: have identified a novel application where the motivation is the staging 
928: of interaction (rather than speedup). Since programs (dialogs) are
929: used as specifications of interaction, they are {\it written to be
930: partially evaluated}; partial evaluation is hence not an `afterthought' 
931: or an optimization. A program can thus be thought of as a
932: compaction of all possible interaction sequences that involve mixing 
933: initiative. An interesting research issue is:
934: Given (i) a set of interaction sequences, and (ii) addressable information
935: (such as arguments and slot variables), determine (iii) the smallest program so
936: that every interaction sequence can be staged in the model 
937: of Fig.~\ref{dialog-designs} (right). This requires algorithms
938: to automatically decompose parts of interaction sequences into those
939: that are best addressed in the interpreter and those that can benefit from
940: representation and specialization by the partial evaluator. 
941: 
942: For mixed-initiative interaction, we have presented a 
943: programming model that accommodates all possibilities of staging, without 
944: explicit enumeration. The model makes a distinction between fixed-initiative
945: (and which has to be explicitly programmed) and mixed-initiative 
946: (specifications of which can be elegantly compressed for subsequent
947: partial evaluation). We have identified instantiations of this model in
948: VoiceXML and slot-filling APIs. We hope this observation will help
949: system designers gain additional insight into voice application design
950: strategies. 
951: 
952: It should be recalled that there are various facets of mixed-initiative 
953: that are not addressed in this paper. Extending our programming model to cover
954: these facets is an immediate direction of future research. For example, 
955: VoiceXML's design currently supports dialogs such as 
956: the following:
957: 
958: \begin{descit}{Dialog 5}
959: \vspace{-0.1in}
960: \begin{tabbing}
961: [x] \= abcdefab2 \= thiscanactuallybeamuchlongersentenceokay \kill
962: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\
963: 2 \> {\bf System:} \> What size pizza would you like?\\
964: 3 \> {\bf Caller 1:} \> What sizes do you have?\\
965: 3 \> {\bf Caller 2:} \> Err.. Why don't you ask me the questions in
966: topping-crust-size order?\\
967: \end{tabbing}
968: \end{descit}
969: \vspace{-0.2in}
970: 
971: \noindent
972: {\it Caller 1}'s request, while demonstrating initiative, implies a dialog
973: with an optional stage (which cannot be modeled by partial
974: evaluation). Such a situation has to be trapped by the interpreter, not
975: by partial evaluation. {\it Caller 2} does specify a staging, but his 
976: staging poses constraints on the computer's initiative, not 
977: his own. Such a `meta-dialog' facet~\cite{mixed-hci} requires the 
978: ability to jump out 
979: of the current dialog; VoiceXML provides many elements for describing 
980: such transitions.
981: 
982: VoiceXML also provides certain `impure' features and side-effects in
983: its programming model. For instance, after selecting a size (say, medium),
984: the caller could retake the initiative in a different part of the dialog
985: and select a size again (this time, large). This will cause the new 
986: value to over-ride any existing value in the {\tt size} 
987: slot (see Fig.~\ref{fiaalgo}). In
988: our model, this implies the dynamic substitution of an earlier,
989: `evaluated out,' stage with a functional equivalent. Obviously, the dialog
990: manager has to maintain some state (across partial evaluations)
991: to accomplish this feature. We plan to investigate programming models suitable 
992: for these aspects. In addition, we plan to extend our software model 
993: beyond slot-and-filler structures, to include reasoning and exploiting 
994: context. 
995: 
996: Our long-term goal is to characterize mixed-initiative facets, not in
997: terms of initiative, interaction, or task models but in terms of the
998: opportunities for staging and the program transformation techniques that
999: can support such staging. This means that we can establish a 
1000: taxonomy of mixed-initiative facets based on the transformation techniques
1001: (e.g., partial evaluation, slicing) needed to realize them.
1002: Such a taxonomy would also help connect the facets to design
1003: models for interactive software systems.
1004: 
1005: \bibliographystyle{alpha}
1006: \bibliography{pepm}
1007: 
1008: \end{document}
1009: 
1010: