0110:cs0110022/pepm.tex

1: \documentclass[11pt]{article}

2: %\usepackage{ijcai01}

3: %\usepackage{fullpage,palatino}

4: \usepackage{fullpage,url}

5: \setlength{\oddsidemargin}{-0.25in}

6: \setlength{\evensidemargin}{-0.25in}

7: \setlength{\topmargin}{0.5in}

8: \setlength{\headheight}{0pt}

9: \setlength{\headsep}{0pt}

10: \setlength{\footskip}{0.35in}

11: \setlength{\textheight}{8.75in}

12: \setlength{\textwidth}{7in}

13: \setlength{\itemindent}{-0.5cm}

14: \setlength{\marginparwidth}{0in}

15: \setlength{\marginparsep}{0in}

16: %\renewcommand{\baselinestretch}{1.62}   % Double-space

17: \hyphenation{inform-ation-seeking inform-ation}

18: \newenvironment{descit}[1]{\begin{quote} \textit{#1}}{\end{quote}}

19:

20: \input{psfig-dvips}

21:

22: \newif\ifpdf

23: \ifx\pdfoutput\undefined

24:   \pdffalse

25: \else

26:   \pdfoutput=1

27:   \pdftrue

28: \fi

29:

30: \ifpdf

31:   \usepackage[pdftex]{graphicx}

32:   \usepackage[pdftex]{color}

33:   \DeclareGraphicsExtensions{.pdf,.png,.jpg}

34: \else

35:   \usepackage[dvips]{graphicx}

36:   \usepackage[dvips]{color}

37:   \DeclareGraphicsExtensions{.eps,.epsi,.ps}

38: \fi

39:

40: \usepackage{times}

41: %\usepackage{fancyheadings}

42:

43: \pagestyle{plain}

44: %\thispagestyle{empty}

45: %\pagestyle{empty}

46:

47: \def\midv{\mathop{\,|\,}}

48: \newtheorem{defn}{Definition}

49: \long\def\cbk#1{{\color{red}[CBK: #1]}}

50: \newlength\colwidth \setlength\colwidth{3.25in}

51:

52: \title{Mixed-Initiative Interaction = Mixed Computation\footnote{This

53: work is supported in part by US National

54: Science Foundation grants DGE-9553458 and IIS-9876167.}}

55:

56: %\author{}

57: \author{Naren Ramakrishnan, Robert Capra, and Manuel A. P\'{e}rez-Qui\~{n}ones\\

58: Department of Computer Science\\

59: Virginia Tech, Blacksburg, VA 24061, USA\\

60: Contact Email: {\tt naren@cs.vt.edu}}

61:

62: \begin{document}

63:

64: \maketitle

65: %\thispagestyle{empty}

66: %\pagestyle{empty}

67:

68: \begin{abstract}

69: \noindent

70: We show that partial evaluation can be usefully viewed as

71: a programming model for realizing mixed-initiative

72: functionality in interactive applications.

73: Mixed-initiative interaction between two participants is one

74: where the parties can take turns at any time to change

75: and steer the flow of interaction. We concentrate on

76: the facet of mixed-initiative referred to as `unsolicited

77: reporting' and demonstrate how out-of-turn interactions

78: by users can be modeled by `jumping ahead' to nested

79: dialogs (via partial evaluation).  Our approach permits

80: the view of dialog management systems in terms of their

81: native support for staging and simplifying interactions;

82: we characterize three different voice-based interaction

83: technologies using this viewpoint. In particular, we

84: show that the built-in form interpretation algorithm (FIA)

85: in the VoiceXML dialog management architecture is actually

86: a (well disguised) combination of an interpreter

87: and a partial evaluator.

88: \end{abstract}

89:

90: \newpage

91: \section{Introduction}

92: \label{intro}

93: Mixed-initiative interaction~\cite{computational-mixed} has been studied

94: for the past 30 years in the areas of artificial intelligence

95: planning~\cite{prodigy}, human-computer interaction~\cite{mixed-hci}, and

96: discourse analysis~\cite{coulthard}. As Novick and Sutton point

97: out~\cite{mixed-notkin},

98: it is `one of those things that people think that they can recognize

99: when they see it even if they can't define it.' It

100: can be broadly viewed as a flexible interaction strategy between

101: participants where the parties can take turns at any time to change

102: and steer the flow of interaction. Human conversations are typically

103: mixed-initiative and, interestingly, so are interactions with some modern

104: computer systems. Consider the following two dialogs with a

105: telephone pizza delivery service that has voice-recognition

106: capability (the line numbers are provided for ease of reference):

107:

108: \begin{descit}{Dialog 1}

109: \vspace{-0.1in}

110: \begin{tabbing}

111: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill

112: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\

113: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\

114: 2 \> {\bf System:} \> What size pizza would you like?\\

115: 3 \> {\bf Caller:} \> I'd like a medium, please.\\

116: 4 \> {\bf System:} \> What topping would you like on your pizza?\\

117: 5 \> {\bf Caller:} \> Pepperoni.\\

118: 6 \> {\bf System:} \> What type of crust do you want?\\

119: 7 \> {\bf Caller:} \> Uh, deep-dish.\\

120: 8 \> {\bf System:} \> So that is a medium pepperoni pizza with deep-dish crust.

121:              Is this correct?\\

122: 9 \> {\bf Caller:} \> Yes.\\

123: (conversation continues to get delivery and payment information)

124: \end{tabbing}

125: \end{descit}

126:

127: \begin{descit}{Dialog 2}

128: \vspace{-0.1in}

129: \begin{tabbing}

130: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill

131: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\

132: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\

133: 2 \> {\bf System:} \> What size pizza would you like?\\

134: 3 \> {\bf Caller:} \>  I'd like a sausage pizza, please.\\

135: 4 \> {\bf System:} \> Okay, sausage.\\

136: 5 \> {\bf System:} \> What size pizza would you like?\\

137: 6 \> {\bf Caller:} \> Medium.\\

138: 7 \> {\bf System:} \> What type of crust do you want?\\

139: 8 \> {\bf Caller:} \> Deep-dish.\\

140: 9 \> {\bf System:} \>  So that is a medium sausage pizza with deep-dish crust.

141:              Is this correct?\\

142: 10 \> {\bf Caller:} \> Yes.\\

143: (conversation continues to get delivery and payment information)

144: \end{tabbing}

145: \end{descit}

146:

147: \noindent

148: Both these conversations involve the specification of a (size,topping,crust)

149: tuple to complete the pizza ordering procedure but differ

150: in important ways. In the first dialog, the caller responds to the

151: questions in the order they are posed by the system. The system has the

152: initiative at all times (other than, perhaps, on line 0) and such an

153: interaction is thus

154: referred to as {\it system-initiated}. In the second dialog, when the

155: system prompts the caller about pizza size, he responds

156: with information about his choice of topping instead

157: (sausage; see line 3 of {\it Dialog 2}). Nevertheless, the conversation

158: is not stalled and the system continues with the other aspects of the

159: information gathering activity. In particular, the system registers that the

160: caller has specified a topping, skips its default question on this topic,

161: and repeats its question about the size (see line 5 of {\it Dialog 2}). The

162: caller

163: thus gained the initiative for a brief period during the conversation,

164: before returning it to the system. A conversation that `mixes' these modes

165: of interaction in such arbitrary ways is said to be {\it mixed-initiative}.

166:

167: \subsection{Tiers of Mixed-Initiative Interaction}

168: \label{tiers}

169: It is well accepted that mixed-initiative provides a more natural and

170: personalized mode of interaction. A matter of debate, however, are

171: the qualities that an interaction must possess to merit its

172: classification as mixed-initiative~\cite{mixed-notkin}. In fact,

173: determining who has the initiative at a given point in an interaction can itself

174: be a contentious issue! The role of intention in

175: an interaction and the underlying task goals also affect the characterization

176: of initiative. We will not attempt to settle this debate here but

177: a few preliminary observations will be useful.

178:

179: One of the basic levels of mixed-initiative is referred to

180: as {\it unsolicited reporting} in~\cite{allen-intelligent} and is illustrated

181: in {\it Dialog 2} above. In this facet, a participant

182: provides information out-of-turn (in our case the caller, about his

183: choice of topping). Furthermore, the out-of-turn

184: interaction is not agreed upon in advance by the two participants.

185: Novick and Sutton~\cite{mixed-notkin} stress that the unanticipated

186: nature of out-of-turn interactions is important and that mere turn-taking

187: (perhaps in a hardwired order) does not constitute

188: mixed-initiative. Finally, notice that in {\it Dialog 2} there is a resumption

189: of the original questioning task once processing of the unsolicited response

190: is completed. In other applications, an unsolicited response might shift the

191: control to a new interaction sequence and/or abort the current interaction.

192:

193: Another level of mixed-initiative involves {\it subdialog invocation};

194: for instance, the computer system might not have understood the user's

195: response and ask for clarifications (which amounts to it having

196: the initiative). A final, sophisticated, form of mixed-initiative is one

197: where participants negotiate with each other to determine initiative

198: (as opposed to merely `taking the initiative')~\cite{allen-intelligent}:

199:

200: \vspace{-0.05in}

201: \begin{descit}{Dialog 3}

202: \vspace{-0.1in}

203: \begin{tabbing}

204: abcdefabyr \= thiscanactuallybeamuchlongersentenceokay \kill

205: (with apologies to O. Henry)\\

206: {\bf Husband:} \> Della, Something interesting happened today that I want to

207: tell you.\\

208: {\bf Wife:} \> I too have something exciting to tell you, Jim.\\

209: {\bf Husband:} \> Do you want to go first or shall I tell you my story?

210: \end{tabbing}

211: \end{descit}

212:

213: In addition to models that characterize initiative, there are models

214: for designing dialog-based interaction systems.

215: Allen et al.~\cite{allen-ai} provide a taxonomy of such software models

216: --- finite-state machines, slot-and-filler

217: structures, frame-based methods, and more sophisticated models involving

218: planning, agent-based programming, and exploiting contextual information.

219: While mixed-initiative interaction can be studied in any of these models,

220: it is beyond the scope of this paper to address all or even a majority

221: of them.

222:

223: Instead, we concentrate on the view of (i) a dialog as a

224: task-oriented information assessment activity requiring the filling of a

225: set of slots,

226: (ii) where one of the participants in the dialog is a computer

227: system and the other

228: is a human, and (iii) where mixed-initiative arises from unsolicited

229: reporting (by the human), involving out-of-turn

230: interactions. This characterization includes many voice-based

231: interfaces to information (our pizza ordering dialog is an example) and

232: web sites modeling interaction by hyperlinks~\cite{pipe-tois}.

233: In Section~\ref{ourmodel}, we show that partial evaluation can be

234: usefully viewed as a programming model for such applications.

235: Section~\ref{voice-tech} presents three different voice-based interaction

236: technologies and analyzes them in terms of their native support for

237: mixing initiative. Finally, Section~\ref{future}

238: discusses other facets of mixed-initiative and mentions

239: other software models to which our approach can be extended.

240:

241: \vspace{-0.1in}

242: \section{Programming a Mixed-Initiative Application}

243: \vspace{-0.03in}

244: \label{ourmodel}

245: Before we outline the design of a system such as Joe's Pizza, we introduce

246: a notation~\cite{levinson,goffman} that captures basic elements

247: of initiative and response in an interaction sequence. The notation expresses

248: the local organization of a dialog~\cite{manuel-thesis,manuel-chi} as

249: adjacency pairs; for instance, {\it Dialog 1} is represented as:

250:

251: \vspace{-0.05in}

252: {\center

253: \begin{tabbing}

254: (Ic \= Rs) \= (Is \= Rc) \= (Is \= Rc) \= (Is \= Rc) \= (Is \= Rc) \kill

255: (Ic \> Rs) \> (Is \> Rc) \> (Is \> Rc) \> (Is \> Rc) \> (Is \> Rc) \\

256: \,\,0 \> \,\,1 \> \,\,2 \> \,\,3 \> \,\,4 \> \,\,5 \> \,\,6 \> \,\,7 \> \,\,8 \> \,\,9 \\

257: \end{tabbing}}

258:

259: \noindent

260: The line numbers given below the interaction sequence refer to the utterance

261: numbers in the dialog presented in Section~\ref{intro}.

262: The letter I denotes who has the initiative --- caller (c) or the system (s) ---

263: and the letter R denotes who provides the response. It is easy to see

264: from this notation that {\it Dialog 1}  consists

265: of five turns and that the system has the initiative for the last

266: four turns.  The initial turn is modeled as the caller having the

267: initiative because he or she chose to place the phone call in the first place.

268: The system quickly takes the initiative after playing a greeting to

269: the caller (which is modeled here as the response to the caller's call).

270: The subsequent four interactions then address three questions and a

271: confirmation, all involving the system retaining the initiative (Is) and

272: the caller in the responding mode (Rc). Likewise,

273: the mixed-initiative interaction in {\it Dialog 2} is

274: represented as:

275:

276: {\center

277: \begin{tabbing}

278: (Ic \= Rs) \= (Iso \= (Ic \= Rs) \= Rc) \= (Is \= Rc) \= (Is \= Rc) \kill

279: (Ic \> Rs) \> (Is \> (Ic \> Rs) \> Rc) \> (Is \> Rc) \> (Is \> Rc) \\

280: \,\,0 \> \,\,1 \> \,\,2,5 \> \,\,3 \> \,\,4 \> \,\,6 \> \,\,7 \> \,\,8 \> \,\,9 \,\,10 \\

281: \end{tabbing}}

282:

283: \noindent

284: In this case, the system takes the initiative in utterance 2 but instead

285: of responding to the question of size in utterance 3, the caller

286: takes the initiative, causing

287: an `insertion' to occur in the interaction sequence (dialog)~\cite{levinson}.

288: The system responds with an acknowledgement (`Okay, sausage.') in

289: utterance 4. This is represented as the nested pair (Ic Rs) above.

290: The system then re-focuses the dialog on the question of pizza size in

291: utterance 5 (thus retaking the initiative). In utterance 6 the

292: caller responds with the desired size (medium) and

293: the interaction proceeds as before, from this point.

294:

295: The notation is useful to describe the space of possible interactions that

296: are to be supported. For instance, utterances 0 and 1 have to proceed in order.

297: Utterances dealing with selection of (size,topping,crust) can then

298: be nested in any order and provide interesting

299: opportunities for mixing initiative.

300: For instance, if a user is a frequent

301: customer of Joe's Pizza, he might take the initiative and specify all three

302: pizza attributes on the first available prompt:

303:

304: \begin{descit}{Dialog 4}

305: \vspace{-0.1in}

306: \begin{tabbing}

307: [x] \= abcdefab \= thiscanactuallybeamuchlongersentenceokay \kill

308: 0 \> {\bf Caller:} \> $\prec$calls Joe's Pizza on the phone$\succ$ \\

309: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\

310: 2 \> {\bf System:} \> What size pizza would you like?\\

311: 3 \> {\bf Caller:} \>  I'd like a sausage pizza, medium, and deep-dish.\\

312: (conversation continues with confirmation of order)

313: \end{tabbing}

314: \end{descit}

315:

316: \noindent

317: Finally, the utterances dealing with confirmation of the user's request

318: can proceed only after choices of all three pizza attributes have been

319: made. There are 13 possible interaction sequences (discounting permutations

320: of attributes specified in a given utterance) --- 1 possibility

321: of specifying everything in one utterance, 6 possibilities of specification

322: in two utterances, and 6 possibilities of specification in three

323: utterances. If we include permutations, there are 24 possibilities (our

324: calculations do not consider situations where, for instance, the system doesn't

325: recognize the user's input and reprompts for information).

326:

327: \begin{figure}

328: \centering

329: \begin{tabular}{cc}

330: \includegraphics[height=1.95in]{dubba1}

331: \hspace{0.1in}

332: \includegraphics[height=1.95in]{dubba3}

333: \end{tabular}

334: \caption{Designs of software systems for

335: mixed-initiative interaction. (left) Traditional system architecture,

336: distinguishing between responsive and unsolicited inputs.

337: (right) Using partial evaluation to handle inputs uniformly.}

338: \label{dialog-designs}

339: \end{figure}

340:

341: Many programming models

342: view mixed-initiative sequences as requiring some

343: special attention to be accommodated. In particular, they rely on recognizing

344: when a user has provided unsolicited

345: input\footnote{We use the term `unsolicited

346: input' here to refer to expected but out-of-turn inputs as opposed to

347: completely unexpected (or out-of-vocabulary) inputs.}

348: and qualify a

349: shift-in-initiative as a `transfer of

350: control.'

351: This implies that the mechanisms that handle out-of-turn interactions

352: are often different

353: from those that realize purely system-directed interactions.

354: Fig.~\ref{dialog-designs} (left) describes a typical software design.

355: A dialog manager is in charge of

356: prompting the user for input, queueing messages onto an output

357: medium, event processing, and managing the overall flow of interaction.

358: One of its inputs is a dialog script that contains a specification of

359: interaction and a set of slots that are to be filled. In our pizza example,

360: slots correspond to placeholders for values of size, topping,

361: and crust. An interpreter determines the first unfilled

362: slot to be visited and presents any prompts

363: for soliciting user input.

364: A responsive input requires mere slot filling whereas unsolicited inputs

365: would require out-of-turn processing (involving a combination of slot

366: filling and simplification). In turn, this causes a revision of the

367: dialog script. The interpreter terminates when there is nothing left to

368: process in the script. While typical dialog managers

369: perform miscellaneous functions such as error control,

370: transferring to other scripts, and accessing scripts from a database, the

371: architecture in Fig.~\ref{dialog-designs}

372: (left) focuses on the aspects most relevant to our presentation.

373:

374: Our approach, on the other hand, is to think of a mixed-initiative

375: dialog as a program,

376: all of whose arguments are passed by reference and which correspond to the

377: slots comprising information assessment. As usual, an interpreter

378: in the dialog manager

379: queues up any applicable prompts to an output medium.

380: Both responsive and

381: unsolicited inputs by a user now correspond (uniformly)

382: to values for arguments; they are processed by partially evaluating

383: the program with respect to these inputs. The result of partial evaluation

384: is another dialog (simplified as a result of user input) which is handed

385: back to the interpreter. This novel design

386: is depicted in Fig.~\ref{dialog-designs} (right) and a dialog script

387: represented in a programmatic notation is given in Fig.~\ref{pizza-script}.

388: Partial evaluation of Fig.~\ref{pizza-script} with respect to user

389: input will remove the conditionals for all slots that

390: are filled by the utterance (global variables are assumed to be

391: under the purview of the interpreter).

392: The reader can verify that a sequence of such partial evaluations

393: will indeed mimic the interaction sequence depicted in {\it Dialog 2}

394: (and any of the other mixed-initiative sequences).

395:

396: Partial evaluation serves two critical uses in our design. The first is

397: obvious, namely the processing of out-of-turn interactions (and any

398: appropriate simplifications to the dialog script). The more subtle advantage

399: of partial evaluation is its support for staging mixed-initiative

400: interactions. The

401: mix-equation~\cite{jones,jones-pe-book} holds for every possible way

402: of splitting inputs into two categories, without enumerating and

403: `trapping' the ways

404: in which the computations can be staged.

405: For instance, the nested pair in {\it Dialog 2} is supported as a natural

406: consequence of our design, not by anticipating and reacting to an

407: out-of-turn input.

408:

409: Another way to characterize the system designs in Fig.~\ref{dialog-designs} is

410: to say that Fig.~\ref{dialog-designs} (left) makes a distinction of responsive

411: versus unsolicited inputs, whereas Fig.~\ref{dialog-designs} (right) makes

412: a more fundamental

413: distinction of fixed-initiative (interpretation) versus

414: mixed-initiative (partial

415: evaluation). In other words, Fig.~\ref{dialog-designs} (right) carves

416: up an interaction sequence into (i) turns that are

417: to be handled in the order they are modeled (by an interpreter),

418: and (ii) turns that can involve mixing of initiative (handled

419: by a partial evaluator).

420: In the latter case, the computations are actually used as

421: a {\it representation of interactions.} Since only mixed-initiative

422: interactions involve multiple staging options

423: and since these are handled by the partial

424: evaluator, our design requires the {\it least} amount of specification

425: (to support all interaction sequences). For instance, the script

426: in Fig.~\ref{pizza-script} models the parts that involve mixing of

427: initiative and helps realize all of the 13 possible interaction sequences.

428: At the same time it does not model the confirmation sequence of

429: {\it Dialog 2} because the caller cannot confirm his order before selecting

430: the three pizza attributes! This turn should be handled by

431: straightforward interpretation.

432:

433: \begin{figure}

434: \centering

435: \begin{tabular}{|l|} \hline

436: {\tt pizzaorder(size,topping,crust)}\\

437: \{\\

438: \,\,\,\, {\tt if (unfilled(size))\{}\\

439: \,\,\,\,\,\,\,\, {\tt /* prompt for size */}\\

440: \,\,\,\, {\tt \}}\\

441: \,\,\,\, {\tt if (unfilled(topping))\{}\\

442: \,\,\,\,\,\,\,\, {\tt /* prompt for topping */}\\

443: \,\,\,\, {\tt \}}\\

444: \,\,\,\, {\tt if (unfilled(crust))\{}\\

445: \,\,\,\,\,\,\,\, {\tt /* prompt for crust */}\\

446: \,\,\,\, {\tt \}}\\

447: \}\\

448: \hline

449: \end{tabular}

450: \caption{Modeling a dialog script as a program parameterized by slot variables

451: that are passed by reference.}

452: \label{pizza-script}

453: \end{figure}

454:

455: To the best of our knowledge, such a model of partial evaluation

456: for mixed-initiative interaction

457: has not been proposed before. An extensive literature

458: search has revealed no related prior work.

459: While computational models for mixed-initiative

460: interaction remain an active area of research~\cite{computational-mixed},

461: such work is characterized by keywords such as `controlling mixed-initiative

462: interaction,' `knowledge representation and reasoning strategies,' and

463: `multi-agent co-ordination.' There are even projects that talk about

464: `integrating' mixed-initiative interaction and partial evaluation to realize

465: an architecture for planning and learning~\cite{prodigy}. We are optimistic

466: that our work has the same historical significance as the relation

467: between explanation-based generalization (a learning technique in AI)

468: and partial evaluation established by van Haremelen and Bundy

469: in 1988~\cite{EBG_PE}.

470:

471: \vspace{-0.1in}

472: \section{Software Technologies for Voice-Based Mixed-Initiative Applications}

473: \label{voice-tech}

474: One of the main contributions of our model is that it characterizes the

475: minimum amount of information needed to program a mixed-initiative interaction

476: sequence.

477: Once a programmer supplies a script such

478: as Fig.~\ref{pizza-script} mixed-initiative

479: interaction is obtained, quite literally, `for free.' This means that

480: we can use the design in Fig.~\ref{dialog-designs} (right) as a benchmark

481: to compare and contrast the amount of specification required in other

482: approaches.

483:

484: As indicated in Section~\ref{tiers}, our model is applicable to

485: voice-based interaction technologies as well as web access via hyperlinks.

486: We concentrate on voice-based applications since interaction with web

487: sites is addressed in a related paper~\cite{pipe-tois} and because

488: the design constraints in voice-based applications pose interesting

489: considerations for our model. In addition, a variety of commercial

490: technologies are available for voice-based applications (in contrast to

491: web sites) that will aid in comparative assessment.

492:

493: \begin{figure}

494: \centering

495: \begin{tabular}{cc}

496: \includegraphics[height=2in]{spreco}

497: \end{tabular}

498: \caption{Basic components of a spoken language processing system.}

499: \label{spreco}

500: \end{figure}

501:

502: \vspace{-0.1in}

503: \subsection{Basic Principles of Voice-Based Interaction}

504: Before we can study the programming of mixed-initiative in

505: a voice-based application, it will be helpful to understand

506: the basic architecture (see Fig.~\ref{spreco})

507: of a spoken language processing system. As a user speaks into the

508: system, the sounds produced are

509: captured by a microphone and converted into a digital signal by an

510: analog-to-digital converter. In telephone-based systems

511: (the VoiceXML architecture covered later in the paper is geared toward

512: this mode), the microphone is part of the

513: telephone handset and the analog-to-digital conversion is typically done by

514: equipment in the telephone network (in some cellular telephony models,

515: the conversion would be performed in the handset itself).

516:

517: The next stage (feature extraction) prepares the digital speech signal to be

518: processed by the speech recognizer. Features of the signal important for

519: speech recognition are extracted from the original signal, organized as an

520: attribute vector, and passed to the recognizer.

521:

522: Most modern speech recognizers use Hidden Markov Models (HMMs) and associated

523: algorithms to represent, train, and recognize speech.  HMMs are

524: probabilistic models that must be trained on an input set of data.  A common

525: technique is to create sets of acoustic HMMs that model phonetic units of

526: speech in context. These models are created from a training set of speech

527: data that is (hopefully) representative of the population of users who will

528: use the system. A language model is also created prior to performing

529: recognition. The language model is typically used to specify valid

530: combinations of the HMMs

531: at a word- or sentence-level.  In this way, the

532: language model specifies the words, phrases, and sentences that the recognizer

533: can attempt to recognize.  The process of recognizing a new input speech

534: signal is then accomplished using efficient search algorithms (such as Viterbi

535: decoding) to find the best matching HMMs, given the constraints of the language

536: model.  The output of the speech recognizer can take several different forms,

537: but the basic result is a text string that is the recognizer's best guess of

538: what the user said.  Many recognizers can provide additional information such

539: as a lattice of results, or an N-best ranked list of results (in case the later

540: stages of processing wish to reject the recognizer's top choice).  A good

541: introduction to speech recognition is available in~\cite{martin-speech}.

542:

543: The stages after speech recognition vary depending on the application and the

544: types of processing required. Fig.~\ref{spreco} presents

545: two additional phases that are commonly included in spoken language processing

546: systems in one form or another. We will broadly refer to the first

547: post-recognition stage as natural language processing (NLP). NLP is a

548: large field in its own right and includes many sub-areas such as parsing,

549: semantic interpretation, knowledge representation, and speech acts; an

550: excellent introduction is available in Allen's classic~\cite{allen-nlp}. Our

551: presentation in this paper has assumed NLP support for slot-filling (i.e.,

552: determining values for slot variables from user input).

553:

554: This is commonly achieved by defining parts of a language model

555: and associating them with slots. The language model could take two

556: major forms --- context-free grammars and statistical-based (such

557: as n-grams). Here we

558: focus on the former: in this approach, slots can be specified within

559: the productions of a context-free grammar (akin to a attribute

560: grammar) or they can be associated with

561: the non-terminals in the grammar.

562:

563: We will refer to the next phase of processing as simply `dialog management'

564: (see Fig.~\ref{spreco}). In this phase, augmented results from

565: the NLP stage are incorporated into the dialog and any associated logic

566: of the application is executed. The job of the dialog manager is to

567: track the proceedings of the dialog and to generate appropriate

568: responses. This is often done within some logical processing

569: framework and a dialog model (in our case, a dialog script)

570: is supplied as input that is specific to the particular application being

571: designed. The execution of the logic on the dialog model (script) results

572: in a response that can be presented back to the user. Sometimes

573: response generation is separated out into a subsequent stage.

574:

575: \begin{figure}

576: \centering

577: \begin{tabular}{cc}

578: \includegraphics[width=3in]{httparch}

579: \hspace{0.1in}

580: \includegraphics[width=3.9in]{vxmlarch}

581: \end{tabular}

582: \caption{(left) Accessing HTML documents via a HTTP web server.

583: (right) Accessing VoiceXML documents via a HTTP web server.}

584: \label{html-vxml}

585: \end{figure}

586:

587: \begin{figure}

588: \centering

589: \small

590: \begin{verbatim}

591: <?xml version="1.0"?>

592: <vxml version="1.0">

593: <!-- pizza.vxml

594:      A simple pizza ordering demo to illustrate some basic elements

595:      of VoiceXML.  Several details have been omitted from this demo

596:      to help make the basic ideas stand out. -->

597:   <form id="welcome">

598:     <block name="block1">

599:       <prompt> Thank you for calling Joe's pizza ordering system. </prompt>

600:       <goto next="#place_order" />

601:     </block>

602:   </form>

603:

604:   <form id="place_order">

605:     <field name="size">

606:       <prompt> What size pizza would you like? </prompt>

607:     </field>

608:

609:     <field name="topping">

610:       <prompt> What topping would you like on your pizza? </prompt>

611:     </field>

612:

613:     <field name="crust">

614:       <prompt> What type of crust do you want? </prompt>

615:     </field>

616:

617:     <field name="verify">

618:       <prompt>

619:         So that is a <value expr="size"/> <value expr="topping"/> pizza

620:         with <value expr="crust"/> crust.

621:         Is this correct?

622:       </prompt>

623:       <grammar> yes | no </grammar>

624:     </field>

625:

626:     <filled>

627:       <if cond="verify=='no'">

628:          <clear namelist="size topping verify crust"/>

629:          <prompt> Sorry.  Your order has been canceled. </prompt>

630:       <else/>

631:          <prompt>Thank you for ordering from Joe's pizza.</prompt>

632:       </if>

633:     </filled>

634:

635:   </form>

636: </vxml>

637: \end{verbatim}

638: \caption{Modeling the pizza ordering dialog in a VoiceXML document.}

639: \label{vpizza}

640: \end{figure}

641:

642: \vspace{-0.1in}

643: \subsection{The VoiceXML Dialog Management Architecture}

644: There are many technologies and delivery mechanisms available for

645: implementing Fig.~\ref{spreco}'s basic components. A popular

646: implementation can be seen in the VoiceXML dialog management architecture.

647: VoiceXML is a markup language designed to simplify the construction of

648: voice-response applications~\cite{voicexml}. Initiated by a

649: committee comprising AT\&T, IBM, Lucent Technologies, and Motorola,

650: it has emerged as a standard in telephone-based voice user interfaces

651: and in delivering web content via voice. We will hence cover this

652: architecture in detail.

653:

654: The basic idea is to describe interaction

655: sequences using a markup notation in a VoiceXML {\it document.} As the

656: VoiceXML specification~\cite{voicexml} indicates, a VoiceXML document

657: constitutes a conversational finite state machine and describes a

658: sequence of interactions (both fixed- and mixed-initiative are supported).

659: A web server can serve VoiceXML documents using the HTTP

660: protocol (Fig.~\ref{html-vxml} (right)), just as easily as HTML documents

661: are currently served over the Internet (Fig.~\ref{html-vxml} (left)).

662: In addition, voice-based applications require a suitable delivery

663: platform, illustrated by a telephone in Fig.~\ref{html-vxml} (right). The

664: voice-browser platform in Fig.~\ref{html-vxml} (right)

665: includes the VoiceXML interpreter which processes the

666: documents, monitors user inputs, streams messages,

667: and performs other functions expected of a dialog management system. Besides

668: the VoiceXML interpreter, the voice-browser platform includes speech

669: recognizers, speech synthesizers, and telephony interfaces to help

670: realize important aspects of voice-based interaction.

671:

672: Dialog specification in a VoiceXML document involves organizing

673: a sequence of {\it forms} and {\it menus}. Forms specify a set of

674: slots (called field item variables) that are to be filled by user input. Menus

675: are syntactic shorthands (much like a {\tt case} construct); since they

676: involve only one field item variable (argument), there are no opportunities

677: for mixing initiative. We do not discuss menus further in this paper.

678: An example VoiceXML document for our pizza application is given

679: in Fig.~\ref{vpizza}.

680:

681: \begin{figure}

682: \small

683: \centering

684: \begin {tabular}{|p{6.6in}|}\hline

685: \vspace{-0.2in}

686: \begin{verbatim}

687: #JSGF V1.0;

688:

689: grammar sizetoppingcrust;

690:

691: public <sizetoppingcrust> =

692:    <size> {this.size=$} [<topping> {this.topping=$}] [<crust> {this.crust=$}] |

693:    <size> {this.size=$} <crust> {this.crust=$} <topping> {this.topping=$} |

694:    <topping> {this.topping=$} [<crust> {this.crust=$}] [<size> {this.size=$}] |

695:    <topping> {this.topping=$} <size> {this.size=$} <crust> {this.crust=$} |

696:    <crust> {this.crust=$} [<size> {this.size=$}] [<topping> {this.topping=$}] |

697:    <crust> {this.crust=$} <topping> {this.topping=$} <size> {this.size=$};

698:

699: <size> = small | medium | large;

700: <topping> =  sausage | pepperoni | onions | green peppers;

701: <crust> = regular | deep dish | thin;

702: \end{verbatim}

703: \\\hline

704: \end{tabular}

705: \caption{A form-level grammar to be used in conjunction with

706: the script in Fig.~\ref{vpizza} to realize mixed-initiative interaction.

707: The above productions for {\tt sizetoppingcrust} cover all possibilities

708: of filling slot variables from user input, including multiple slots filled

709: by a given utterance, and various permutations of specifying pizza attributes.}

710: \label{formgram}

711: \end{figure}

712:

713: As shown in Fig.~\ref{vpizza}, the pizza dialog consists of two forms. The

714: first form ({\tt welcome}) merely welcomes the user and transitions to the

715: second. The {\tt place\_order} form involves four {\tt field}s

716: (slot variables) --- the first three cover the pizza attributes and the

717: fourth models the confirmation variable (recall the dialogs in

718: Section~\ref{intro}). In particular, prompts for soliciting user input

719: in each of the fields are specified in Fig.~\ref{vpizza}.

720:

721: Interactions in a VoiceXML application proceed just like a web application

722: except instead of clicking on a hyperlink (to goto a new state), the

723: user talks into a microphone. The VoiceXML interpreter then

724: determines the next state to transition to. Any appropriate responses

725: (to user input) and prompts are

726: delivered over a speaker. The

727: core of the interpreter is a so-called form interpretation algorithm

728: (FIA) that drives the interaction.

729: In Fig.~\ref{vpizza}, the fields provide for a fixed-initiative, system-directed

730: interaction. The FIA simply visits all fields in the order they are presented

731: in the document.

732: Once all fields are filled, a check is made to ensure that

733: the confirmation was successful; if not, the fields are cleared (notice

734: the {\tt clear namelist} tag) and the FIA will proceed to

735: {\tt prompt} for the inputs again,

736: starting from the first unfilled field --- {\tt size}.

737:

738: \begin{figure}

739: \centering

740: \small

741: \begin {tabular}{|p{6in}|}\hline

742: \vspace{-0.2in}

743: \begin{verbatim}

744: While (true)

745: {

746:         // SELECT PHASE

747:         Select the first form item with an unsatisfied guard condition

748:              (e.g., unfilled)

749:           If no such form item, exit

750:

751:         // COLLECT PHASE

752:         Queue up any prompts for the form item

753:         Get an utterance from the user

754:

755:         // PROCESS PHASE

756:         foreach (slot in user's utterance)

757:         {

758:                if (slot corresponds to a field item) {

759:                     copy slot values into field item variables

760:                     set field item's `just_filled' flag

761:                }

762:         }

763:         // some code for executing any `filled' actions triggered

764: }

765: \end{verbatim}

766: \\\hline

767: \end{tabular}

768: \caption{Outline of the form interpretation algorithm (FIA) in

769: the VoiceXML dialog management architecture. Adapted from~\cite{voicexml}.}

770: \label{fiaalgo}

771: \end{figure}

772:

773: \begin{figure}

774: \small

775: \centering

776: \begin {tabular}{|p{5in}|}\hline

777: \vspace{-0.2in}

778: \begin{verbatim}

779: #JSGF V1.0;

780:

781: grammar sizetoppingcrust;

782:

783: public <sizetoppingcrust> = word*;

784:

785: word = <size> {this.size=$} |

786:        <crust> {this.crust=$} |

787:        <topping> {this.topping=$};

788:

789: <size> = small | medium | large;

790: <topping> =  sausage | pepperoni | onions | green peppers;

791: <crust> = regular | deep dish | thin;

792: \end{verbatim}

793: \\\hline

794: \end{tabular}

795: \caption{A alternative form-level grammar to realize mixed-initiative

796: interaction with

797: the script in Fig.~\ref{vpizza}.}

798: \label{formgram2}

799: \end{figure}

800:

801: The form in Fig.~\ref{vpizza} is referred to as a directed one since the

802: computer has the initiative at all times and the {\tt field}s are filled

803: in a strictly sequential order. To make the interaction mixed-initiative

804: (with respect to {\tt size}, {\tt crust}, and {\tt topping}),

805: the programmer merely has to specify a so-called

806: {\it form-level grammar} that describes possibilities for slot-filling from

807: a user utterance. An example

808: form-level grammar file ({\tt sizetoppingcrust.gram}) that covers all

809: possibilities is given in

810: Fig.~\ref{formgram}. The grammar is associated with the dialog script

811: by including the line:

812: \begin{verbatim}

813:     <grammar src="sizetoppingcrust.gram" type="application/x-jsgf"/>

814: \end{verbatim}

815: just before the definition of

816: the first {\tt field} (size) in Fig.~\ref{vpizza}.

817:

818: The form-level grammar contains productions for the various choices

819: available for size, topping, and crust and also qualifies

820: all possible parses for

821: a given utterance (modeled by the non-terminal {\tt sizetoppingcrust}). Any

822: valid combination of the three pizza aspects uttered by the user (in

823: any order) is recognized and the appropriate slot variables are instantiated.

824: To see why this also achieves mixed-initiative, let us consider the FIA

825: in more detail.

826:

827: Fig.~\ref{fiaalgo} only reproduces the salient aspects of the FIA relevant

828: for our discussion. Compare the basic elements of the FIA to the stages in

829: Fig.~\ref{dialog-designs} (right). The Select phase corresponds to the

830: interpreter, the Collect phase gathers the user input, and

831: actions taken in the Process phase mimic the partial evaluator. Recall that

832: `programs'

833: (scripts) in VoiceXML can be modeled by finite-state machines, hence

834: the mechanics of partial evaluation are considerably simplified and just

835: amount to filling the slot and removing it from further consideration.

836: Since the FIA repeatedly executes till there are no remaining form items,

837: the processing phase (Process) is effectively parameterized by the form-level

838: grammar file in Fig.~\ref{formgram}. In other words, the form-level grammar

839: file not only enables slot filling, {\it it also implicitly directs the

840: staging of interactions for mixed-initiative.} When the user

841: specifies `peperroni medium' in an utterace, not only does the grammar

842: file enable the recognition of the slots they correspond to (topping and size),

843: it also directs the FIA to simplify these slots (and remove them in

844: any subsequent interaction).

845:

846: The form-level grammar file shown in Fig.~\ref{formgram}

847: (which is also a specification of interaction staging) may make

848: VoiceXML's design appear overly complex. In reality, however,

849: we could have used the vanilla form-level

850: grammar file in Fig.~\ref{formgram2}. While helping to

851: realize mixed-initiative with Fig.~\ref{vpizza}, the new

852: form-level file (as does our model) also allows the possibility of

853: utterances such as `pepperoni pepperoni,' or even, `pepperoni sausage!'

854: Suitable semantics for such situations (including the role of

855: side-effects) can be defined and accommodated in both the VoiceXML

856: model and ours. It should thus be obvious to the reader that VoiceXML's

857: dialog management architecture is actually implementing a mixed

858: evaluation model (for conversational finite state machines), comprising

859: interpretation and partial evaluation.

860:

861: The VoiceXML specification~\cite{voicexml} refers to the form-level file

862: as merely a `grammar file,' when it is actually also a specification of

863: staging. Even though the grammar file serves the role of a language

864: model in a voice application, we believe that

865: separating its two functionalities is important in understanding

866: mixed-initiative system design.

867: %If a statistical n-gram model served

868: %as the language model (instead of context-free grammars), such a distinction

869: %would be easy to make.

870: A case in point is our study of personalizing

871: interaction with web sites~\cite{pipe-tois}. There is no requirement for

872: a `grammar file,' as there is usually no ambiguity about user clicks and

873: typed-in keywords. In this context, the functionality provided by our model

874: is actually unmatched by any existing web-based interaction system (as

875: web interfaces are not typically designed for mixing initiative). A way to

876: incorporate mixed-initiative interaction into an existing interaction

877: at a web site is described in~\cite{pipe-tois}.

878:

879: \begin{table}

880: \centering

881: \begin{tabular}{|l|c|c|} \hline

882: \multicolumn{1}{|l|} {Software} &

883: \multicolumn{1}{c|} {Support for} &

884: \multicolumn{1}{c|} {Support for} \\

885: \multicolumn{1}{|l|} {Technology} &

886: \multicolumn{1}{c|} {Slot Simplification} &

887: \multicolumn{1}{c|} {Interaction Staging} \\ \hline

888: VoiceXML & $\surd$ & $\surd$ \\

889: Slot Filling Systems & $\surd$ & $\times$\\

890: Recognizer-Only APIs & $\times$ & $\times$ \\

891: \hline

892: \end{tabular}

893: \caption{Comparison of software technologies for voice-based

894: mixed-initiative applications.}

895: \label{compare-table}

896: \end{table}

897:

898: \vspace{-0.1in}

899: \subsection{Other Implementation Technologies}

900: VoiceXML's FIA thus includes native support for slot filling, slot

901: simplification, and interaction staging. All of these are functions

902: enabled by partial evaluation in our model. Table~\ref{compare-table}

903: contrasts two other implementation approaches in terms of these aspects.

904: In a purely slot-filling system, native support

905: is provided for simplifying slots from user utterances but extra code

906: needs to be written to model the control logic (for instance,

907: `the user still didn't specify his choice of size, so the question for

908: size should be repeated.'). Several commercial speech recognition vendors

909: provide APIs that operate at this level. In addition, many vendors support

910: low-level APIs that provide basic access to recognition results (i.e.,

911: text strings) but do not perform any additional processing. We refer

912: to these as recognizer-only APIs.

913: They serve more as raw

914: speech recognition engines and require significant programming to first

915: implement a slot-filling engine and, later, control logic to mimic all

916: possible opportunities for staging. Examples of the two latter technologies

917: can be seen in the commercial spoken dialog systems

918: market (from companies such as Nuance, IBM, and AT\&T). The study presented

919: in this paper suggests a systematic way by which

920: their capabilities for mixed-initiative interaction can be assessed.

921:

922: \vspace{-0.1in}

923: \section{Discussion}

924: \label{future}

925: Our work makes contributions to both partial evaluation and

926: mixed-initiative interaction. For the partial evaluation community, we

927: have identified a novel application where the motivation is the staging

928: of interaction (rather than speedup). Since programs (dialogs) are

929: used as specifications of interaction, they are {\it written to be

930: partially evaluated}; partial evaluation is hence not an `afterthought'

931: or an optimization. A program can thus be thought of as a

932: compaction of all possible interaction sequences that involve mixing

933: initiative. An interesting research issue is:

934: Given (i) a set of interaction sequences, and (ii) addressable information

935: (such as arguments and slot variables), determine (iii) the smallest program so

936: that every interaction sequence can be staged in the model

937: of Fig.~\ref{dialog-designs} (right). This requires algorithms

938: to automatically decompose parts of interaction sequences into those

939: that are best addressed in the interpreter and those that can benefit from

940: representation and specialization by the partial evaluator.

941:

942: For mixed-initiative interaction, we have presented a

943: programming model that accommodates all possibilities of staging, without

944: explicit enumeration. The model makes a distinction between fixed-initiative

945: (and which has to be explicitly programmed) and mixed-initiative

946: (specifications of which can be elegantly compressed for subsequent

947: partial evaluation). We have identified instantiations of this model in

948: VoiceXML and slot-filling APIs. We hope this observation will help

949: system designers gain additional insight into voice application design

950: strategies.

951:

952: It should be recalled that there are various facets of mixed-initiative

953: that are not addressed in this paper. Extending our programming model to cover

954: these facets is an immediate direction of future research. For example,

955: VoiceXML's design currently supports dialogs such as

956: the following:

957:

958: \begin{descit}{Dialog 5}

959: \vspace{-0.1in}

960: \begin{tabbing}

961: [x] \= abcdefab2 \= thiscanactuallybeamuchlongersentenceokay \kill

962: 1 \> {\bf System:} \> Thank you for calling Joe's pizza ordering system.\\

963: 2 \> {\bf System:} \> What size pizza would you like?\\

964: 3 \> {\bf Caller 1:} \> What sizes do you have?\\

965: 3 \> {\bf Caller 2:} \> Err.. Why don't you ask me the questions in

966: topping-crust-size order?\\

967: \end{tabbing}

968: \end{descit}

969: \vspace{-0.2in}

970:

971: \noindent

972: {\it Caller 1}'s request, while demonstrating initiative, implies a dialog

973: with an optional stage (which cannot be modeled by partial

974: evaluation). Such a situation has to be trapped by the interpreter, not

975: by partial evaluation. {\it Caller 2} does specify a staging, but his

976: staging poses constraints on the computer's initiative, not

977: his own. Such a `meta-dialog' facet~\cite{mixed-hci} requires the

978: ability to jump out

979: of the current dialog; VoiceXML provides many elements for describing

980: such transitions.

981:

982: VoiceXML also provides certain `impure' features and side-effects in

983: its programming model. For instance, after selecting a size (say, medium),

984: the caller could retake the initiative in a different part of the dialog

985: and select a size again (this time, large). This will cause the new

986: value to over-ride any existing value in the {\tt size}

987: slot (see Fig.~\ref{fiaalgo}). In

988: our model, this implies the dynamic substitution of an earlier,

989: `evaluated out,' stage with a functional equivalent. Obviously, the dialog

990: manager has to maintain some state (across partial evaluations)

991: to accomplish this feature. We plan to investigate programming models suitable

992: for these aspects. In addition, we plan to extend our software model

993: beyond slot-and-filler structures, to include reasoning and exploiting

994: context.

995:

996: Our long-term goal is to characterize mixed-initiative facets, not in

997: terms of initiative, interaction, or task models but in terms of the

998: opportunities for staging and the program transformation techniques that

999: can support such staging. This means that we can establish a

1000: taxonomy of mixed-initiative facets based on the transformation techniques

1001: (e.g., partial evaluation, slicing) needed to realize them.

1002: Such a taxonomy would also help connect the facets to design

1003: models for interactive software systems.

1004:

1005: \bibliographystyle{alpha}

1006: \bibliography{pepm}

1007:

1008: \end{document}

1009:

1010: