cs0211017/paper.tex
1: \documentclass[11pt]{article} 
2: % \pagestyle{empty}
3: % \renewcommand{\thepage}{}
4: 
5: \usepackage{latexsym}
6: 
7: \title{Probabilistic Parsing Strategies}
8: \author{
9: \begin{tabular}[t]{c}
10: Mark-Jan Nederhof%
11:   \,\thanks{
12: Supported by the Royal Netherlands
13: Academy of Arts and Sciences.
14: Secondary affiliation is the
15: German Research Center for Artificial Intelligence (DFKI).
16: } \\
17: Faculty of Arts \\
18: University of Groningen \\
19: P.O.\ Box 716 \\
20: NL-9700 AS Groningen, The Netherlands \\
21: {\tt markjan@let.rug.nl}
22: \end{tabular}
23: \and
24: % \hspace{6pt}
25: \begin{tabular}[t]{c}
26: Giorgio Satta \\
27: %Dip. di Elettronica e Informatica \\
28: Department of Information Engineering
29: %Universit\`{a} di Padova \\
30: University of Padua \\
31: via Gradenigo, 6/A \\
32: I-35131 Padova, Italy \\
33: {\tt satta@dei.unipd.it}
34: \end{tabular}
35: }
36: 
37: % \keywords{parsing algorithms, probabilistic parsing, transduction,
38: % context-free grammars, push-down automata}
39: 
40: \date{}
41: 
42: \setlength{\textheight}{21.2cm}
43: \setlength{\textwidth}{13.5cm}
44: 
45: \setcounter{topnumber}{3}
46: \setcounter{totalnumber}{3}
47: \setcounter{dbltopnumber}{3}
48: % \renewcommand{\textfraction}{.01}
49: \renewcommand{\textfraction}{.2}
50: \renewcommand{\topfraction}{.99}
51: \renewcommand{\dbltopfraction}{.99}
52: \sloppy
53: 
54: \input{epsf}
55: 
56: \newcommand{\comment}[1]{} 
57: 
58: \newcommand{\order}[1]{{\cal O}({#1})}
59: 
60: \newcommand{\mygram}{{\cal G}}
61: \newcommand{\myaut}{{\cal A}}
62: \newcommand{\mystrat}{{\cal S}}
63: 
64: \newcommand{\mypartial}{{\cal T}_{\myaut}}
65: 
66: \newcommand{\myterm}{\mit\Sigma}
67: \newcommand{\mynontset}{\mit\Gamma}
68: \newcommand{\mynont}{N}
69: \newcommand{\myrule}{R}
70: 
71: \newcommand{\bul}{\mathrel{\bullet}} 
72: 
73: \newcommand{\mysym}{Q}
74: \newcommand{\Xinit}{X_{\it init}}
75: \newcommand{\Xfinal}{X_{\it final}}
76: \newcommand{\mytrans}{\mit\Delta}
77: 
78: \newcommand{\myep}[2]{{#1} \mapsto {#2}}
79: \newcommand{\myscan}[4]{{#1} \stackrel{#2,#3}{\mapsto} {#4}}
80: \newcommand{\myscanrec}[3]{{#1} \stackrel{#2}{\mapsto} {#3}}
81: 
82: \newcommand{\pdamoverel}{\vdash}
83: \newcommand{\pdamove}[1]{\stackrel{#1}{\vdash}}
84: \newcommand{\pdamoves}{\vdash^\ast}
85: \newcommand{\pdamovesname}[1]{\stackrel{#1}{\vdash^\ast}}
86: \newcommand{\outp}{{\it out}}
87: 
88: \newcommand{\pdagoto}{\leadsto}
89: 
90: \newcommand{\de}{\rightarrow}
91: 
92: \newcommand{\LC}{\angle}
93: \newcommand{\LCep}{\angle_{\epsilon}}
94: \newcommand{\LCstar}{\angle^\ast}
95: \newcommand{\LCepstar}{\angle_{\epsilon}^\ast}
96: 
97: \newcommand{\fepLC}{f_{\epsilon\mbox{\scriptsize\it -LC}}}
98: \newcommand{\fepTD}{f_{\epsilon\mbox{\scriptsize\it -TD}}}
99: \newcommand{\stratepLC}{\mystrat_{\epsilon\mbox{\scriptsize\it -LC}}}
100: 
101: \newtheorem{definition}{Definition}
102: \newtheorem{theorem}{Theorem}
103: \newtheorem{lemma}[theorem]{Lemma}
104: \newtheorem{prop}{Proposition}
105: \newcommand{\proof}{\noindent {\em Proof.\hspace{1em}}}
106: \newcommand{\closeproof}{\mbox{\hspace{1em}\rule{.45em}{.45em}}}
107: 
108: \newcommand{\tabrule}[4]{
109: \begin{eqnarray}
110: \label{#1}
111:         \frac{ \begin{array}{c} #2 \end{array} }
112:                         { \begin{array}{c} #3 \end{array} }
113:    \left\{ \begin{array}{l} #4 \end{array} \right.
114: \end{eqnarray} }
115: 
116: \newcommand{\tabruletwo}[3]{
117: \begin{eqnarray}
118: \label{#1}
119:         \frac{ \begin{array}{c}  #2 \end{array} }
120:                         { \begin{array}{c} #3 \end{array} }
121: \end{eqnarray} }
122: 
123: \newcommand{\forward}{{\it forward\/}}
124: \newcommand{\inner}{{\it inner\/}}
125: \newcommand{\tabel}{{\it tab\/}}
126: 
127: \begin{document}
128: 
129: \maketitle
130: 
131: \begin{abstract}
132: We present new results on the relation between 
133: purely symbolic context-free
134: parsing strategies and their probabilistic counter-parts.
135: Such parsing strategies are seen as constructions
136: of push-down devices from grammars.
137: We show that preservation of probability distribution is
138: possible under two conditions, viz.\
139: the correct-prefix property and the
140: property of strong predictiveness.
141: These results generalize existing results in the literature 
142: that were obtained by considering parsing strategies in 
143: isolation.  From our general results we also derive negative
144: results on so-called generalized LR parsing. 
145: \end{abstract}
146: 
147: \section{Introduction}
148: \label{s:intro}
149: 
150: Context-free grammars and push-down automata are two 
151: equivalent formalisms to describe context-free languages.
152: While a context-free grammar can be thought of as a purely
153: declarative specification, a push-down automaton is considered to be
154: an operational specification that determines which steps are 
155: performed for a given string
156: in the process of deciding its membership of the language.
157: By a {\em parsing strategy\/} we mean a mapping from 
158: context-free grammars to equivalent push-down automata, 
159: such that some specific conditions are observed. 
160: 
161: This paper deals with the probabilistic extensions of 
162: context-free grammars and push-down automata, 
163: i.e., probabilistic context-free grammars \cite{SA72,BO73}
164: and probabilistic push-down automata \cite{SA72,SA76,TE95,AB99}.
165: These formalisms
166: are obtained by adding probabilities to the rules and transitions
167: of context-free grammars and push-down automata, respectively.
168: More specifically, we will investigate the problem of `extending'
169: parsing strategies to {\em probabilistic\/} parsing strategies.
170: These are mappings from probabilistic context-free grammars to 
171: probabilistic push-down automata
172: that preserve the induced probability distributions
173: on the generated/accepted languages.
174: Two of the main results presented in this paper can be stated as follows:
175: \begin{itemize}
176: \item 
177: No parsing strategy that lacks the
178: correct-prefix property (CPP) can be extended to become
179: a probabilistic parsing strategy.
180: \item
181: All parsing strategies that possess
182: the correct-prefix property and the 
183: strong predictiveness property (SPP) can be extended
184: to become probabilistic parsing strategies.
185: \end{itemize}
186: The above results generalize previous findings
187: reported in~\cite{TE95,TE97,AB99}, where only a few specific 
188: parsing strategies were considered in isolation. 
189: Our findings also have important
190: implications for well-known parsing strategies such as
191: generalized LR parsing, henceforth simply called `LR parsing'.%
192: \footnote{Generalized (or nondeterministic)
193: LR parsing allows for more than one action
194: for a given LR state and input symbol.} 
195: LR parsing has the CPP, but lacks the SPP, and as we
196: will show, LR parsing cannot be extended to become
197: a probabilistic parsing strategy.
198: 
199: In the last decade, widespread interest
200: in probabilistic parsing techniques has arisen
201: in the area of natural language processing \cite{CH93a,MA99,JU00}.
202: This is motivated by the fact that natural language sentences are
203: generally ambiguous,
204: and natural language software needs to be able to
205: distinguish the more probable derivations
206: of a sentence from the less probable ones.
207: This can be achieved by letting the parsing process
208: assign a probability to each parse, 
209: on the basis of a probabilistic grammar.
210: In a typical application, the software may 
211: select those derivations for further processing
212: that have been given the highest
213: probabilities, and discard the others. 
214: The success of this approach relies on the accuracy of
215: the probabilistic model expressed by the probabilistic
216: grammar, i.e., whether the probabilities assigned
217: to derivations accurately reflect
218: the `true' probabilities in the domain at hand.
219: 
220: Probabilities are often estimated on the basis of a corpus,
221: i.e., a collection of sentences. The sentences in a corpus 
222: may be annotated with various kinds of information. 
223: One kind of annotation that is relevant for our discussion is
224: the preferred derivation for each sentence. 
225: Given a corpus with derivations, one may
226: estimate probabilities of rules by their relative
227: frequencies in the corpus. If a corpus is unannotated, 
228: more general techniques of maximum-likelihood estimation
229: can be used to estimate the probabilities of rules. 
230: (See \cite{SA97,CH98,CH99} for some formal properties of types of
231: maximum-likelihood estimation.)
232: 
233: The motivation for studying probabilistic models other than
234: those obtained by attaching probabilities to 
235: given context-free grammars is the
236: observation that more accurate models can be obtained by 
237: conditioning probabilities on `context information' beyond single
238: nonterminals \cite{CH90,CH94}. Furthermore, it
239: has been observed that conditioning on certain types of
240: context information can be achieved by first translating
241: context-free grammars to push-down automata,
242: according to some parsing strategy,
243: and then attaching probabilities to the transitions thereof 
244: \cite{SO99,RO99}. 
245: More concretely, for some parsing strategies,
246: the set of models that can be obtained by attaching
247: probabilities to a push-down automaton constructed from
248: a context-free grammar may include models that cannot be obtained by
249: attaching probabilities to that grammar.
250: 
251: An implicit assumption of this methodology is that,
252: conversely, any probabilistic model that can be obtained from a grammar
253: can also be obtained from the associated push-down automaton,
254: or in other words, the push-down automaton is at least as
255: powerful as the grammar in terms of the set the potential models.
256: If a parsing strategy does not satisfy this property, and
257: if some potential models are lost in the mapping from
258: the grammar to the push-down automaton, then this means that
259: in some cases
260: the strategy may lead to less rather than more accurate models.
261: That LR parsing cannot be extended to become
262: a probabilistic parsing strategy, as we mentioned above,
263: means that the above property is not satisfied by this parsing strategy.
264: This is contrary to what is suggested by some
265: publications on probabilistic LR parsing, such as
266: \cite{BR93} and \cite{IN00}, which fail to observe that
267: LR parsers may sometimes lead to less accurate models
268: than the grammars from which they were constructed.
269: 
270: Some studies, such as \cite{CO97,CH98a,CH01}, propose
271: lexicalized probabilistic context-free grammars, i.e., 
272: probabilistic models based on 
273: context-free grammars in which probabilities
274: heavily rely on the terminal elements from input strings.
275: Even if the current paper does not specifically deal with
276: lexicalization, much of what we discuss pertains
277: to lexicalized probabilistic context-free grammars as well.
278: 
279: The paper is organized as follows. After giving standard
280: definitions in Section~\ref{s:prel}, we give our formal
281: definition of `parsing strategy' in Section~\ref{s:strategy}.
282: We also define what it means to extend a parsing strategy to
283: become a probabilistic parsing strategy.
284: The CPP and the SPP are defined in Sections~\ref{s:cpp}
285: and~\ref{s:pred}, where we also discuss how these properties relate to
286: the question of which strategies can be extended to become
287: probabilistic.
288: Sections~\ref{s:strong} 
289: and~\ref{s:nonstrong} provide examples of parsing strategies
290: with and without the SPP. The examples without the SPP, 
291: most notably LR parsing, are
292: shown not to be extendible to become probabilistic.
293: A wider notion of extending a strategy to become probabilistic
294: is provided by Section~\ref{s:wide}. We show that
295: even under this wider notion,
296: LR parsing cannot be extended to become probabilistic.
297: Section~\ref{s:prefix} presents an application
298: that concerns prefix probabilities.
299: We end this paper with conclusions.
300: 
301: Some results reported here have appeared before in an abbreviated form
302: in \cite{NE02d}.
303: 
304: \section{Preliminaries}
305: \label{s:prel}
306: 
307: A context-free grammar (CFG) $\mygram$ is a 4-tuple 
308: $(\myterm,$ $\mynont,$ $S,$ $\myrule)$,
309: where $\myterm$ is a finite set of {\em terminals}, 
310: called the {\em alphabet},
311: $\mynont$ is a finite set of {\em nonterminals},
312: including the {\em start symbol\/} $S$, and $\myrule$ is a finite set of
313: {\em rules},
314: each of the form $A\de\alpha$, where $A\in \mynont$ and
315: $\alpha\in (\myterm \cup \mynont)^\ast$.
316: Without loss of generality, we assume that there is only one
317: rule $S \de \sigma$ with the start symbol in the left-hand side,
318: and furthermore that $\sigma \neq \epsilon$, where $\epsilon$
319: denotes the empty string.
320: 
321: For a fixed CFG $\mygram$, we
322: define the relation $\Rightarrow$ on triples consisting of two strings
323: $\alpha,\beta\in (\myterm \cup \mynont)^\ast$ and a rule 
324: $\pi\in\myrule$ by:
325: $\alpha \stackrel{\pi}{\Rightarrow} \beta$ if and only if
326: $\alpha$ is of the form $wA\delta$ and $\beta$ is of the
327: form $w\gamma\delta$, for some $w\in\myterm^\ast$ and
328: $\delta\in (\myterm \cup \mynont)^\ast$, and $\pi=(A\de\gamma)$.
329: A {\em left-most derivation\/} 
330: is a string $d = \pi_1 \cdots \pi_m$, $m \geq 0$,
331: such that $S \stackrel{\pi_1}{\Rightarrow} \cdots
332: \stackrel{\pi_m}{\Rightarrow}\alpha$, 
333: for some $\alpha \in (\myterm \cup \mynont)^\ast$.
334: We will identify a left-most derivation
335: with the sequence of strings over 
336: $\myterm \cup \mynont$ that arise in that
337: derivation. 
338: In the remainder of this paper, we will let
339: the term `derivation' refer to
340: `left-most derivation', unless specified otherwise.
341: 
342: A derivation $d = \pi_1 \cdots \pi_m$, $m \geq 0$,
343: such that $S \stackrel{\pi_1}{\Rightarrow} \cdots
344: \stackrel{\pi_m}{\Rightarrow}w$ where $w\in\myterm^\ast$ 
345: will be called a {\em complete\/} derivation;
346: we also say that $d$ is a derivation of $w$.
347: By {\em subderivation\/} we mean a substring of a
348: complete derivation of the form
349: $d = \pi_1 \cdots \pi_m$, $m \geq 0$,
350: such that $A \stackrel{\pi_1}{\Rightarrow} \cdots
351: \stackrel{\pi_m}{\Rightarrow}w$ for some $A$ and $w$.
352: 
353: We write $\alpha \Rightarrow^\ast \beta$ or
354: $\alpha \Rightarrow^+ \beta$ to denote the
355: existence of a string $\pi_1 \cdots \pi_m$ such that
356: $\alpha \stackrel{\pi_1}{\Rightarrow} \cdots
357: \stackrel{\pi_m}{\Rightarrow}\beta$,
358: with $m \geq 0$ or $m > 0$, respectively.
359: We say a CFG is {\em acyclic\/} if
360: $A \Rightarrow^+ A$ does not hold for any $A\in\mynont$.
361: 
362: For a CFG $\mygram$ we define the language $L(\mygram)$
363: it generates as the set of strings $w$
364: such that there is at least one derivation of $w$.
365: We say a CFG is {\em reduced\/} if for each rule $\pi\in\myrule$
366: there is a complete derivation in which it occurs.
367: 
368: A {\em probabilistic\/} context-free grammar (PCFG) is a pair
369: $(\mygram, p)$ consisting of a CFG 
370: $\mygram=(\myterm,$ $\mynont,$ $S,$ $\myrule)$ and 
371: a probability function $p$ from $\myrule$ to
372: real numbers in the interval $[0,1]$. We say a PCFG is {\em proper\/}
373: if $\Sigma_{\pi=(A\de\gamma)\in\myrule}\ p(\pi) = 1$ for
374: each $A\in\mynont$.
375: 
376: For a PCFG $(\mygram,p)$,
377: we define
378: the probability $p(d)$ of a string 
379: $d = \pi_1 \cdots \pi_m \in \myrule^\ast$
380: as $\prod_{i=1}^m\  p(\pi_i)$; 
381: we will in particular consider the probabilities of
382: derivations $d$.
383: The probability $p(w)$ of a string $w\in \myterm^\ast$ as defined by $(\mygram,p)$
384: is the sum of the probabilities of
385: all derivations of that string.
386: We say a PCFG $(\mygram,p)$ is {\em consistent\/} if
387: $\Sigma_{w \in \myterm^\ast}\ p(w) = 1$.
388: 
389: In this paper we will mainly consider push-down transducers
390: rather than push-down automata. Push-down transducers not
391: only compute derivations of the grammar while processing
392: an input string, but they also explicitly produce 
393: output strings from which these derivations can be obtained. 
394: We use transducers for two reasons.
395: First, constraints on the output strings allow
396: us to restrict our attention to `reasonable' parsing strategies.
397: Those strategies that cannot be formalized within these constraints
398: are unlikely to be of practical interest.
399: Secondly, mappings from input strings to derivations, as those realized 
400: by push-down devices, turn out to be a very powerful abstraction 
401: and allow direct proofs of several general results.  
402: 
403: Differently from many textbooks, our push-down devices do not
404: possess states next to stack symbols. This is without loss
405: of generality, since states can be encoded into the stack symbols,
406: given the types of transition that we allow.
407: Thus,
408: a push-down transducer (PDT) $\myaut$ is a 6-tuple
409: $(\myterm_1,$ $\myterm_2,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$,
410: where $\myterm_1$ is the input alphabet,
411: $\myterm_2$ is the output alphabet,
412: $\mysym$ is a finite set of {\em stack symbols}
413: including the {\em initial stack symbol\/} $\Xinit$ and the
414: {\em final stack symbol\/} $\Xfinal$, and $\mytrans$ is the set of
415: {\em transitions}.
416: Each transition can have
417: one of the following three forms:
418: $\myep{X}{X Y}$ (a push transition),
419: $\myep{\it Y X}{Z}$ (a pop transition),  or
420: $\myscan{X}{x}{y}{Y}$ (a swap transition);
421: here $X$, $Y$, $Z\in \mysym$,
422: $x\in \myterm_1 \cup \{\epsilon\}$ 
423: and $y\in \myterm_2^\ast$.
424: Note that
425: in our notation, stacks grow from left to right, i.e., the top-most
426: stack symbol will be found at the right end.
427: 
428: Without loss of generality, we assume that any PDT is such that
429: for a given stack symbol $X\neq \Xfinal$, there are either one or more
430: push transitions
431: $\myep{X}{X Y}$, or one or more pop transitions 
432: $\myep{\it Y X}{Z}$, or one or more swap transitions
433: $\myscan{X}{x}{y}{Y}$, but no combinations of different types of
434: transition. If a PDT does not satisfy this normal form, it can
435: easily be brought in this form by introducing for each stack symbol
436: $X$ three new stack symbols $X_{\it push}$, $X_{\it pop}$
437: and $X_{\it swap}$ and new swap transitions 
438: $\myscan{X}{\epsilon}{\epsilon}{X_{\it push}}$,
439: $\myscan{X}{\epsilon}{\epsilon}{X_{\it pop}}$ and
440: $\myscan{X}{\epsilon}{\epsilon}{X_{\it swap}}$.
441: In each existing transition that operates on top-of-stack $X$,
442: we then replace $X$ by one from $X_{\it push}$, $X_{\it pop}$
443: or $X_{\it swap}$, depending on the type of that transition.
444: We also assume that $\Xfinal$ does not occur in the left-hand side
445: of a transition, again without loss of generality.
446: 
447: A {\em configuration\/} of a PDT is a triple
448: $(\alpha, w, v)$, where $\alpha \in \mysym^\ast$
449: is a stack, $w\in\myterm_1^\ast$ is the remaining input, and
450: $v\in\myterm_2^\ast$ is the output generated so far.
451: For a fixed PDT $\myaut$, we define
452: the relation $\pdamoverel$ on triples consisting of two
453: configurations and a transition $\tau$ by:
454: $(\gamma\alpha, xw, v) \pdamove{\tau} (\gamma\beta, w, vy)$ if and only if
455: $\tau$ is of the form
456: $\myep{\alpha}{\beta}$, where $x=y=\epsilon$, or of the form
457: $\myscan{\alpha}{x}{y}{\beta}$.
458: A {\em computation\/} on an input string $w$ is a string
459: $c=\tau_1 \cdots \tau_m$, $m \geq 0$, such that
460: $(\Xinit, w, \epsilon) \pdamove{\tau_1} \cdots \pdamove{\tau_m}
461: (\alpha, w', v)$.
462: A {\em complete\/} computation on a string $w$ is a computation
463: with $w'=\epsilon$ and $\alpha=\Xfinal$. The string $v$ is called
464: the {\em output\/} of the computation $c$, 
465: and is denoted by $\outp(c)$.
466: 
467: We will identify a
468: computation with the sequence of configurations
469: that arise in that computation,
470: where the first configuration is determined by the context.
471: We also write
472: $(\alpha,w,v) \pdamoves (\beta,w',v')$ or 
473: $(\alpha,w,v) \pdamovesname{c} (\beta,w',v')$, 
474: for $\alpha,\beta \in \mysym^\ast$, 
475: $w,w'\in \myterm_1^\ast$ and $v,v'\in\myterm_2^\ast$,
476: to indicate that $(\beta,w',v')$ can be obtained
477: from $(\alpha,w,v)$ by applying a sequence $c$ of zero or more
478: transitions; we refer to such a sequence $c$ 
479: as a {\em subcomputation}. 
480: The function $\outp$ is
481: extended to subcomputations in a natural way.
482: 
483: For a PDT $\myaut$, we define the language $L(\myaut)$ it
484: accepts as the set of strings $w$ such that there is
485: at least one complete computation on $w$.
486: We say a PDT is {\em reduced\/} if 
487: each transition $\tau\in\mytrans$ occurs in some complete computation.
488: 
489: A {\em probabilistic\/} push-down transducer (PPDT) is a pair
490: $(\myaut, p)$ consisting of a PDT $\myaut$ and
491: a probability function $p$ from the set $\mytrans$ of
492: transitions of $\myaut$ to
493: real numbers in the interval $[0,1]$.
494: We say a PPDT $(\myaut, p)$ is {\em proper\/} if
495: \begin{itemize}
496: \item 
497: $\Sigma_{\tau=(\myep{X}{X Y})\in\mytrans}\ p(\tau) = 1$
498: for each $X\in\mysym$ such that there is at least one
499: transition $\myep{X}{X Y}$, $Y \in \mysym$;
500: \item
501: $\Sigma_{\tau=(\myscan{X}{x}{y}{Y})\in\mytrans}\ p(\tau) = 1$
502: for each $X\in\mysym$ such that there is at least one
503: transition $\myscan{X}{x}{y}{Y}$, 
504: $x\in\myterm_1\cup\{\epsilon\},y\in\myterm_2^\ast,Y\in\mysym$; 
505: and
506: \item
507: $\Sigma_{\tau=(\myep{Y X}{Z})\in\mytrans}\ p(\tau) = 1$,
508: for each $X,Y\in\mysym$ such that there is at least one
509: transition $\myep{Y X}{Z}$, $Z\in\mysym$.
510: \end{itemize}
511: 
512: For a PPDT $(\myaut,p)$,
513: we define the probability $p(c)$
514: of a (sub)computation $c=\tau_1 \cdots \tau_m$ 
515: as $\prod_{i=1}^m\  p(\tau_i)$.
516: The probability $p(w)$ of a string $w$ as defined by $(\myaut,p)$
517: is the sum of the probabilities of
518: all complete computations on that string.
519: We say a PPDT $(\myaut,p)$ is {\em consistent\/} if
520: $\Sigma_{w \in \myterm^\ast}\ p(w) = 1$.
521: 
522: We say a PCFG $(\mygram,p)$ is reduced if $\mygram$ is reduced,
523: and we say a PPDT $(\myaut,p)$ is reduced if $\myaut$ is reduced.
524: 
525: \section{Parsing strategies}
526: \label{s:strategy}
527: 
528: The term `parsing strategy' is often used informally to
529: refer to a class of parsing algorithms that behave similarly
530: in some way. In this paper, we assign a formal
531: meaning to this term, relying on the 
532: observation by \cite{LA74,BI89} that many
533: parsing algorithms for CFGs can be described in two steps.
534: The first is a construction of push-down devices 
535: from CFGs, and the second is
536: a method for handling nondeterminism 
537: (e.g.\ backtracking or dynamic programming).
538: Parsing algorithms that handle nondeterminism in
539: different ways but apply the same construction of
540: push-down devices from CFGs are seen as realizations of
541: the same parsing strategy.
542: 
543: Thus, we define a {\em parsing strategy\/} to be a function 
544: $\mystrat$ that maps 
545: a reduced CFG $\mygram
546: =(\myterm_1,$ $\mynont,$ $S,$ $\myrule)$ to
547: a pair $\mystrat(\mygram)=(\myaut,f)$ consisting of a 
548: reduced PDT $\myaut=(\myterm_1,$ $\myterm_2,$ $\mysym,$ 
549: $\Xinit,$ $\Xfinal,$ $\mytrans)$, and a function $f$ that maps a subset of 
550: $\myterm_2^\ast$ to a subset of $\myrule^\ast$,
551: with the following properties:
552: \begin{itemize}
553: \item $\myrule \subseteq \myterm_2$.
554: \item For each string $w\in\myterm_1^\ast$ and each
555: complete computation $c$ on $w$,
556: $f(\outp(c))=d$ is a derivation of $w$.
557: Furthermore, each symbol from $\myrule$
558: occurs as often in $\outp(c)$ as it occurs in $d$.
559: \item Conversely, for each string $w\in\myterm_1^\ast$ and 
560: each derivation $d$ of $w$,
561: there is precisely one complete computation $c$ on $w$ such that
562: $f(\outp(c)) = d$.
563: \end{itemize}
564: If $c$ is a complete computation, we will write
565: $f(c)$ to denote $f(\outp(v))$. The conditions
566: above then imply that $f$ is
567: a bijection from complete computations to complete derivations.
568: 
569: Note that output strings
570: of (complete) computations may contain symbols that are not in $\myrule$,
571: and the symbols that are in $\myrule$ may occur in a different
572: order in $v$ than in $f(v)=d$. The purpose of the symbols
573: in $\myterm_2 - \myrule$ is to help this process of reordering
574: of symbols in $\myrule$.
575: For a string $v \in \myterm_2^\ast$ we let $\overline{v}$ refer
576: to the maximal subsequence of symbols from $v$ that belong to $\myrule$,
577: or in other words, string $\overline{v}$ is obtained by erasing
578: from $v$ all occurrences of symbols from $\myterm_2 - \myrule$.
579: 
580: A {\em probabilistic parsing strategy\/} is defined to be a function
581: $\mystrat$ that maps a reduced, proper and consistent
582: PCFG $(\mygram, p_{\mygram})$ 
583: to a triple $\mystrat(\mygram,  p_{\mygram})=(\myaut, p_{\myaut}, f)$,
584: where $(\myaut, p_{\myaut})$ is a reduced, proper and consistent PPDT,
585: with the same properties as a
586: (non-probabilistic) parsing strategy, and in addition:
587: \begin{itemize}
588: \item
589: For each complete derivation $d$ and
590: each complete computation $c$ such that $f(c)=d$,
591: $p_{\mygram}(d)$ equals $p_{\myaut}(c)$.
592: \end{itemize}
593: In other words, a complete computation has the same probability
594: as the complete derivation that it is mapped to by
595: function $f$.
596: An implication of this property is that for each string $w\in\myterm_1^\ast$,
597: the probabilities assigned to that string
598: by $(\mygram, p_{\mygram})$ and $(\myaut,p_{\myaut})$ are equal.
599: 
600: We say that probabilistic parsing strategy $\mystrat'$ 
601: is an {\em extension\/} of parsing strategy $\mystrat$ if
602: for each reduced CFG $\mygram$ and probability function $p_{\mygram}$
603: we have
604: $\mystrat(\mygram)=(\myaut, f)$ if and only if
605: $\mystrat'(\mygram, p_{\mygram})=(\myaut, p_{\myaut}, f)$
606: for some $p_{\myaut}$.
607: 
608: In the following sections we will investigate which
609: parsing strategies can be extended to become
610: probabilistic parsing strategies.
611: 
612: \section{Correct-prefix property}
613: \label{s:cpp}
614: 
615: For a given PDT,
616: we say a computation $c$ is {\em dead\/} if 
617: $(\Xinit, w_1, \epsilon)$ $\pdamovesname{c}$
618: $(\alpha, \epsilon, v_1)$, for some $\alpha\in\mysym^\ast$, 
619: $w_1\in \myterm_1^\ast$ and $v_1\in\myterm_2^\ast$,
620: and there are no
621: $w_2\in \myterm_1^\ast$ and $v_2\in\myterm_2^\ast$ such that
622: $(\alpha, w_2, \epsilon) \pdamoves (\Xfinal, \epsilon, v_2)$.
623: Informally, a dead computation is a computation that
624: cannot be continued to become a complete computation.
625: 
626: We say that a PDT has the {\em correct-prefix property\/} (CPP) if
627: it does not allow any dead computations.
628: We say that a parsing strategy has the CPP if it maps each
629: reduced CFG to a PDT that has the CPP.
630: 
631: In this section we show that the correct-prefix property is 
632: a necessary condition
633: for extending a parsing strategy to a probabilistic parsing strategy.
634: For this we need two lemmas.
635: 
636: \begin{lemma}
637: \label{l:pcfg}
638: For each reduced CFG $\mygram$, there is a probability function
639: $p_{\mygram}$ such that 
640: PCFG $(\mygram,p_{\mygram})$ is proper and consistent,
641: and $p_{\mygram}(d) > 0$ for all complete derivations~$d$.
642: \end{lemma}
643: 
644: \proof
645: Since $\mygram$ is reduced, there is a finite set $L$ consisting of
646: complete derivations $d$, such that for each rule $\pi$ in $\mygram$
647: there is at least
648: one $d\in L$ in which $\pi$ occurs.
649: Let $n_{\pi,d}$ be the number of occurrences of rule $\pi$ in
650: derivation $d\in L$, and let $n_{\pi}$ be
651: $\Sigma_{d\in L}\ n_{\pi,d}$, 
652: the total number of occurrences of $\pi$ in $L$.
653: Let $n_A$ be the sum of $n_{\pi}$ for all rules 
654: $\pi$ with $A$ in the left-hand side. A probability function
655: $p_{\mygram}$ can be defined through
656: `maximum-likelihood estimation' such that
657: $p_{\mygram}(\pi) = \frac{n_{\pi}}{n_A}$ for each rule
658: $\pi = A \de \alpha$. 
659: 
660: For all nonterminals $A$, 
661: $\Sigma_{\pi = A \de \alpha}\ p_{\mygram}(\pi)$ $=$
662: $\Sigma_{\pi = A \de \alpha}\ \frac{n_{\pi}}{n_A} $=$ \frac{n_A}{n_A}$ $=$ 1,
663: which means that the PCFG $(\mygram,p_{\mygram})$ is proper.
664: Furthermore, \cite{CH98} has shown that a PCFG $(\mygram,p_{\mygram})$ 
665: is consistent if
666: $p_{\mygram}$ was obtained by maximum-likelihood estimation using 
667: a set of derivations.
668: Finally, since $n_{\pi} > 0$ for each $\pi$, also
669: $p_{\mygram}(\pi) > 0$ for each $\pi$, and
670: $p_{\mygram}(d) > 0$ for all complete derivations $d$.~\closeproof
671: 
672: We say a computation is a {\em shortest\/} dead computation if
673: it is dead and none of its proper prefixes is dead.
674: Note that each dead computation has a unique prefix that is a
675: shortest dead computation.
676: For a PDT $\myaut$, let $\mypartial$ be the union of the set of
677: all complete computations and the set of all
678: shortest dead computations.
679: 
680: \begin{lemma}
681: \label{l:partial}
682: For each proper PPDT $(\myaut, p_{\myaut})$, 
683: $\Sigma_{c \in \mypartial}\  p_{\myaut}(c) \leq 1$.
684: \end{lemma}
685: 
686: \proof
687: The proof is a trivial variant of the proof
688: that for a proper PCFG $(\mygram, p_{\mygram})$,
689: the sum of $p_{\mygram}(d)$ for all derivations $d$ cannot
690: exceed 1, which is shown by \cite{BO73}.~\closeproof
691: 
692: {}From this, the main result of this section follows.
693: 
694: \begin{theorem}
695: \label{t:cpp}
696: A parsing strategy that lacks the CPP cannot be extended to 
697: become a probabilistic parsing strategy.
698: \end{theorem}
699: 
700: \proof
701: Take a parsing strategy $\mystrat$ that does not have the CPP. 
702: Then there is a reduced
703: CFG $\mygram= (\myterm_1,$ $\mynont,$ $S,$ $\myrule)$, 
704: with $\mystrat(\mygram) = (\myaut,f)$ for some
705: $\myaut$ and $f$, and a shortest dead computation $c$ allowed 
706: by $\myaut$.
707: 
708: It follows from Lemma~\ref{l:pcfg} that there is a probability function
709: $p_{\mygram}$ such that
710: $(\mygram, p_{\mygram})$ is a proper and consistent PCFG and
711: $p_{\mygram}(d) > 0$ for all complete derivations $d$.
712: Assume we also have a probability function $p_{\myaut}$ such that
713: $(\myaut, p_{\myaut})$ is a proper and consistent PPDT
714: that assigns the same probabilities to strings over $\Sigma_1$ as
715: $(\mygram, p_{\mygram})$. Since $\myaut$ is reduced, each
716: transition $\tau$ must occur in some complete computation $c'$. Furthermore,
717: for each complete computation $c'$ there is a complete derivation $d$
718: such that $f(c') = d$, and $p_{\myaut}(c') = p_{\mygram}(d) > 0$. 
719: Therefore, $p_{\myaut}(\tau) > 0$ for each
720: transition $\tau$, and $p_{\myaut}(c) > 0$,
721: where $c$ is the above-mentioned dead computation.
722: 
723: Due to Lemma~\ref{l:partial},
724: $1 \geq \Sigma_{c' \in \mypartial}\  p_{\myaut}(c') \geq
725: \Sigma_{w \in \myterm_1^\ast}\ p_{\myaut}(w) + p_{\myaut}(c) >
726: \Sigma_{w \in \myterm_1^\ast}\ p_{\myaut}(w) = 
727: \Sigma_{w \in \myterm_1^\ast}\ p_{\mygram}(w)$.
728: This is in contradiction with the consistency of $(\mygram, p_{\mygram})$.
729: Hence, a probability function $p_{\mygram}$ with the properties we
730: required above cannot exist, and therefore $\mystrat$ cannot be extended 
731: to become
732: a probabilistic parsing strategy.~\closeproof
733: 
734: \section{Strong predictiveness}
735: \label{s:pred}
736: 
737: For a fixed PDT, we define the binary relation
738: $\pdagoto$ on stack symbols by:
739: $Y\pdagoto Y'$ if and only if 
740: $(Y, w, \epsilon) \pdamoves (Y',\epsilon, v)$ for some
741: $w \in \myterm_1^\ast$ and $v\in \myterm_2^\ast$. 
742: In other words,
743: some subcomputation may start with stack $Y$ and 
744: end with stack $Y'$. Note that all 
745: stacks that occur in such a subcomputation 
746: must have height of~1 or more.
747: 
748: We say that a PDT has the {\em strong predictiveness property\/} 
749: (SPP) if the existence of
750: three transitions $\myep{X}{X Y}$, $\myep{X Y_1}{Z_1}$ and
751: $\myep{X Y_2}{Z_2}$ such that $Y\pdagoto Y_1$ and
752: $Y\pdagoto Y_2$ implies $Z_1 = Z_2$. 
753: Informally, this means that
754: when a subcomputation starts with 
755: some stack $\alpha$ and some push transition $\tau$, 
756: then solely on the basis of $\tau$
757: we can uniquely determine
758: what stack symbol $Z_1 = Z_2$ will be on top of the stack in
759: the first configuration 
760: with stack height equal to $|\alpha|$. 
761: Another way of looking at
762: it is that no information may flow from higher stack elements
763: to lower stack elements that
764: was not already predicted before these higher stack elements
765: came into being, hence the term `strong predictiveness'.%
766: \footnote{There is a property of push-down devices called
767: {\em faiblement pr{\'e}dictif\/} (weakly predictive) \cite{VI93a}.
768: Contrary to what this name may suggest however, this property
769: is incomparable with the complement of our notion of SPP.}
770: 
771: We say that a parsing strategy has the SPP if it maps each
772: reduced CFG to a PDT with the SPP.
773: 
774: In the previous section it was shown that we may restrict ourselves
775: to parsing strategies that have the CPP. Here we show that
776: if, in addition, a parsing strategy has the SPP, then it can
777: always be extended to become a probabilistic parsing strategy.
778: 
779: \begin{theorem}
780: \label{t:sp}
781: Any parsing strategy that has the CPP and the SPP
782: can be extended to become a probabilistic parsing strategy.
783: \end{theorem}
784: 
785: \proof
786: Take a parsing strategy $\mystrat$ that 
787: has the CPP and the SPP,
788: and take a reduced PCFG $(\mygram,p_{\mygram})$,
789: where $\mygram = (\myterm_1,$ $\mynont,$  $S,$ $\myrule)$,   
790: and let $\mystrat(\mygram) = (\myaut,f)$, for some PDT $\myaut$ and
791: function $f$.
792: We will show that there is a probability function 
793: $p_{\myaut}$ such that $(\myaut, p_{\myaut})$ is a PPDT and
794: $p_{\myaut}(c) = p_{\mygram}(f(c))$ for all complete computations $c$.
795: 
796: For each stack symbol $X$, consider the set of transitions that
797: are applicable with top-of-stack $X$. Remember that our normal form
798: ensures that all such transitions are of the same type.
799: Suppose this set consists of $m$ swap transitions
800: ${\tau_i} = \myscan{X}{x_i}{y_i}{Y_i}$, $1 \leq i \leq m$.
801: For each $i$,
802: consider all subcomputations of the form
803: $({\it X}, x_iw, \epsilon)$ $\pdamove{\tau_i}$ $({\it Y_i}, w, y_i)$
804: $\pdamoves$
805: $({\it Y'}, \epsilon, v)$ such that there is at least one
806: pop transition of the form $\myep{{\it Z Y'}}{Z'}$ or 
807: such that $Y' = \Xfinal$,
808: and define $L_{\tau_i}$ as the set of strings $v$
809: output by these subcomputations.
810: We also define $L_X = \cup_{j=1}^m\  L_{\tau_j}$, the set
811: of all strings output by subcomputations starting with top-of-stack 
812: $X$, and ending just before
813: a pop transition that leads to a stack with height smaller than that 
814: of the stack at the beginning, or ending with the final stack symbol $\Xfinal$.
815: 
816: Now define for each $i$ ($1 \leq i \leq m$):
817: \begin{eqnarray}
818: \label{e:normalized}
819: p_{\myaut}(\tau_i) &=&
820: \frac{  \Sigma_{v\in L_{\tau_i}}\ p_{\mygram}(\overline{v}) }{
821:         \Sigma_{v\in L_{X}}\ p_{\mygram}(\overline{v}) }
822: \end{eqnarray}
823: In other words, the probability of a transition is the normalized
824: probability of the set of subcomputations starting with that transition,
825: relating subcomputations with fragments of derivations of the PCFG.
826:  
827: These definitions are well-defined. Since $\myaut$ is reduced and has
828: the CPP, the sets $L_{\tau_i}$ are non-empty and thereby
829: the denominator in the definition of $p_{\myaut}(\tau_i)$
830: is non-zero. Furthermore,
831: $\Sigma_{i=1}^m\ p_{\myaut}(\tau_i)$ is clearly $1$.
832:  
833: Now suppose the set of transitions for $X$ consists of $m$
834: push transitions
835: ${\tau_i} = \myep{X}{X Y_i}$, $1 \leq i \leq m$.
836: For each $i$,
837: consider all subcomputations of the form
838: $({\it X}, w, \epsilon)$ $\pdamove{\tau_i}$ $({\it XY_i}, w, \epsilon)$
839: $\pdamoves$
840: $({\it X'}, \epsilon, v)$ such that there is at least one
841: pop transition of the form $\myep{{\it Z X'}}{Z'}$ or $X' = \Xfinal$,
842: and define $L_{\tau_i}$, $L_X$ and $p_{\myaut}(\tau_i)$ as we have done
843: above for the swap transitions.
844:  
845: Suppose the set of transitions for $X$ consists of $m$
846: pop transitions 
847: ${\tau_i} = \myep{\it Y_iX}{Z_i}$, $1 \leq i \leq m$. 
848: Define
849: $L_X = \{\epsilon\}$, and $p_{\myaut}(\tau_i)=1$ for each $i$.
850: To see that this is compatible with the condition of properness
851: of PPDTs,
852: note the following.
853: Since we may assume $\myaut$ is reduced,
854: if $Y_i = Y_j$ for some $i$ and $j$ with $1 \leq i,j \leq m$, 
855: then there is at least one
856: transition $\myep{Y_i}{\it Y_i X'}$ for some $X'$ such that
857: $X'\pdagoto X$. Due to the SPP, $Z_i = Z_j$ and therefore $i=j$.
858: 
859: Finally, we define $L_{\Xfinal} = \{\epsilon\}$.
860: 
861: Take a subcomputation $({\it X}, w, \epsilon)$
862: $\pdamovesname{c}$
863: $({\it Y}, \epsilon, v)$ 
864: such that there is at least one
865: pop transition of the form $\myep{{\it Z Y}}{Y'}$ or $Y = \Xfinal$.
866: Below we will prove that:
867: \begin{eqnarray}
868: \label{e:partialp}
869: p_{\myaut}(c) &=& 
870: \frac{ p_{\mygram}(\overline{v}) }{
871: 	 \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }
872: \end{eqnarray}
873: Since a complete computation $c$ with output $v$ is of this form,
874: with $X = \Xinit$ and $Y= \Xfinal$, we
875: obtain the result we required to prove Theorem~\ref{t:sp},
876: where $D$ denotes the set of all 
877: complete derivations of CFG $\mygram$:
878: \begin{eqnarray} 
879: p_{\myaut}(c) &=& 
880: \frac{ p_{\mygram}(\overline{v}) }{ 
881: 	\Sigma_{v'\in L_{\Xinit}}\ p_{\mygram}(\overline{v'}) } \\
882: &=& \frac{ p_{\mygram}(f(c)) }{
883:         \Sigma_{v'\in L_{\Xinit}}\ p_{\mygram}(f(v')) } \\
884: &=& \frac{ p_{\mygram}(f(c)) }{
885:         \Sigma_{d\in D}\ p_{\mygram}(d) } \\
886: &=&
887: p_{\mygram}(f(c))
888: \end{eqnarray}
889: We have used two properties of $f$ here. The first is that it
890: preserves the frequencies of symbols from $\myrule$, if considered as a
891: mapping from output strings to derivations.
892: The second property is that it can be considered as bijection from
893: complete computations to derivations. Lastly we have used
894: consistency of PCFG $(\mygram,  p_{\mygram})$, meaning that
895: $\Sigma_{d\in D}\ p_{\mygram}(d) = 1$.
896: 
897: For the proof of~(\ref{e:partialp}), we proceed by induction
898: on the length of $c$ and distinguish three cases.
899: 
900: Case~1: Consider a subcomputation $c$ consisting of zero transitions,
901: which naturally has output $v=\epsilon$,
902: with only configuration
903: $({\it X}, \epsilon, \epsilon)$, where there is at least one
904: pop transition of the form $\myep{{\it Z X}}{Z'}$ or $X = \Xfinal$.
905: We trivially have $p_{\myaut}(c)$ $=$ $1$
906: and
907: $\frac{ p_{\mygram}(\overline{v}) }{
908:          \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'})}$ $=$
909: $\frac{ p_{\mygram}(\epsilon) }{
910:          \Sigma_{v'\in \{\epsilon\}}\ p_{\mygram}(\overline{v'})}$ $=$ $1$.
911: 
912: Case~2: Consider a subcomputation
913: $c=\tau_i c'$, where $({\it X}, x_iw, \epsilon)$
914: $\pdamove{\tau_i}$ $({\it Y_i}, w, y_i)$
915: $\pdamovesname{c'}$
916: $({\it Y'}, \epsilon, y_i v)$, such that there is at least one
917: pop transition of the form $\myep{{\it Z Y'}}{Z'}$ or $Y' = \Xfinal$.
918: The induction hypothesis states that:
919: \begin{eqnarray}
920: p_{\myaut}(c') &=&
921: \frac{ p_{\mygram}(\overline{v}) }{
922:          \Sigma_{v'\in L_{Y_i}}\ p_{\mygram}(\overline{v'}) }
923: \end{eqnarray}
924: If we combine this with the definition of $p_{\myaut}$, we obtain:
925: \begin{eqnarray} 
926: p_{\myaut}(c) &=& p_{\myaut}(\tau_i) \cdot  p_{\myaut}(c') \\
927: &=& 
928: \frac{  \Sigma_{v'\in L_{\tau_i}}\ p_{\mygram}(\overline{v'}) }{
929:         \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } \cdot
930: \frac{ p_{\mygram}(\overline{v}) }{
931:          \Sigma_{v'\in L_{Y_i}}\ p_{\mygram}(\overline{v'}) } \\
932: &=&
933: \frac{   p_{\mygram}(\overline{y_i}) \cdot \Sigma_{v'\in L_{Y_i}}\ 
934: 			p_{\mygram}(\overline{v'}) }{
935:         \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } \cdot
936: \frac{ p_{\mygram}(\overline{v}) }{
937:          \Sigma_{v'\in L_{Y_i}}\ p_{\mygram}(\overline{v'}) } \\
938: &=&  
939: \frac{	p_{\mygram}(\overline{y_i}) \cdot p_{\mygram}(\overline{v}) }{
940: 	 \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }  \\
941: &=& 
942: \frac{  p_{\mygram}(\overline{y_i v}) }{
943: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } 
944: \end{eqnarray}
945: 
946: Case~3:  Consider a subcomputation
947: $c$ of the form $({\it X}, w, \epsilon)$ $\pdamove{\tau_i}$ 
948: $({\it XY_i}, w, \epsilon)$
949: $\pdamoves$
950: $({\it X''}, \epsilon, v)$ such that there is at least one
951: pop transition of the form $\myep{{\it Z X''}}{Z'}$ or $X'' = \Xfinal$.
952: Subcomputation $c$ can be decomposed in a unique way as 
953: $c=\tau_i c' \tau c''$,
954: consisting of an application of a push transition
955: $\tau_i = \myep{X}{X Y_i}$,
956: a subcomputation
957: $({\it Y_i}, w_1, \epsilon)$
958: $\pdamovesname{c'}$
959: $({\it Y'}, \epsilon, v_1)$,
960: an application of a pop transition
961: $\tau = \myep{XY'}{X_i'}$,
962: and a subcomputation 
963: $({\it X_i'}, w_2, \epsilon)$
964: $\pdamovesname{c''}$ 
965: $({\it X''}, \epsilon, v_2)$,
966: where $w=w_1w_2$ and $v=v_1 v_2$.
967: This is visualized in Figure~\ref{fig:stack}.
968: \begin{figure}
969: %Mag 100
970: \begin{center}
971: \epsfbox{stack.eps}
972: \end{center}
973: \caption{Development of the stack in the computation
974: $c=\tau_i c' \tau c''$.}
975: \label{fig:stack}
976: \end{figure}
977: 
978: We can now use the induction hypothesis twice, resulting in:
979: \begin{eqnarray}
980: p_{\myaut}(c') &=&
981: \frac{ p_{\mygram}(\overline{v_1}) }{
982:          \Sigma_{v'_1\in L_{Y_i}}\ p_{\mygram}(\overline{v'_1}) }
983: \end{eqnarray}
984: and
985: \begin{eqnarray}
986: p_{\myaut}(c'') &=&
987: \frac{ p_{\mygram}(\overline{v_2}) }{
988:          \Sigma_{v'_2\in L_{X_i'}}\ p_{\mygram}(\overline{v'_2}) }
989: \end{eqnarray}
990: 
991: If we combine this with the definition of $p_{\myaut}$, 
992: we obtain:
993: \begin{eqnarray}
994: p_{\myaut}(c) &=& p_{\myaut}(\tau_i) \cdot p_{\myaut}(c') 
995: 		\cdot p_{\myaut}(\tau)
996: 		\cdot p_{\myaut}(c'') \\
997: &=& 
998: \frac{	\Sigma_{v'\in L_{\tau_i}}\ p_{\mygram}(\overline{v'}) }{
999:          \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }
1000: \cdot 
1001: \frac{  p_{\mygram}(\overline{v_1}) }{ 
1002:          \Sigma_{v'_1\in L_{Y_i}}\ p_{\mygram}(\overline{v'_1}) } 
1003: \cdot 1
1004: \cdot  \frac{ p_{\mygram}(\overline{v_2}) }{
1005:          \Sigma_{v'_2\in L_{X_i'}}\ p_{\mygram}(\overline{v'_2}) }
1006: \end{eqnarray} 
1007: 
1008: Since $\myaut$ has the SPP, $X_i'$ is unique to $\tau_i$ and
1009: the output strings in $L_{\tau_i}$ are precisely those that
1010: can be obtained by concatenating
1011: an output string in $L_{Y_i}$ and
1012: an output string in $L_{X_i'}$.
1013: Therefore $\Sigma_{v'\in L_{\tau_i}}\ p_{\mygram}(\overline{v'})$
1014: $=$
1015: $\Sigma_{v'_1\in L_{Y_i}} \Sigma_{v'_2\in L_{X_i'}}\ 
1016: 		p_{\mygram}(\overline{v'_1} \overline{v'_2})$
1017: $=$
1018: $\Sigma_{v'_1\in L_{Y_i}}\ p_{\mygram}(\overline{v'_1})$ $\cdot$
1019: $\Sigma_{v'_2\in L_{X_i'}}\ p_{\mygram}(\overline{v'_2})$, 
1020: and
1021: \begin{eqnarray}
1022: p_{\myaut}(c) &=&
1023: \frac{  p_{\mygram}(\overline{v_1}) \cdot p_{\mygram}(\overline{v_2}) }{
1024: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } \\
1025: &=& 
1026: \frac{ p_{\mygram}(\overline{v_1 v_2}) }{
1027: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }  \\
1028: &=& 
1029: \frac{ p_{\mygram}(\overline{v}) }{
1030: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } 
1031: \end{eqnarray} 
1032: This concludes the proof.~\closeproof
1033: 
1034: Note that the definition of $p_{\myaut}$ in the above proof relies on the
1035: strings output by $\myaut$. This is the main reason
1036: why we needed to consider push-down transducers rather
1037: than push-down automata (defined below). 
1038: Now assume an appropriate probability
1039: function $p_{\myaut}$ has been found such that 
1040: $(\myaut,p_{\myaut})$ is a PPDT that assigns the same
1041: probabilities to computations as
1042: the given PCFG assigns to the corresponding derivations, 
1043: following the construction from the proof above. 
1044: Then the probabilities assigned to strings over the input
1045: alphabet are also equal.
1046: We may subsequently ignore
1047: the output strings if the application at hand merely requires
1048: probabilistic recognition rather than probabilistic transduction,
1049: or in other words, we may simplify push-down
1050: transducers to push-down automata.
1051: 
1052: Formally, a {\em push-down automaton\/} (PDA) $\myaut$ is a 5-tuple
1053: $(\myterm,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$,
1054: where $\myterm$ is the input alphabet, and
1055: $\mysym,$ $\Xinit,$ $\Xfinal$ and $\mytrans$ are 
1056: as in the definition of PDTs.
1057: Push and pop transitions are as before, but swap transitions are 
1058: simplified to the form
1059: $\myscanrec{X}{x}{Y}$, where $x \in \{\epsilon\} \cup \Sigma$.
1060: Computations are defined as in the case of PDTs, except that configurations
1061: are now pairs $(\alpha,w)$ whereas they were triples $(\alpha,w,v)$
1062: in the case of PDTs. A {\em probabilistic\/} push-down automaton (PPDA) is 
1063: a pair $(\myaut,p_{\myaut})$, where $\myaut$ is a PDA
1064: and $p_{\myaut}$ is a probability function
1065: subject to the same constraints as in the case of
1066: PPDTs.
1067: Since the definitions of CPP and SPP for PDTs did not refer to output strings,
1068: these notions carry over to PDAs in a straightforward way.
1069: 
1070: We define the size of a CFG as 
1071: $\sum_{(A \de \alpha) \in \myrule} |A\alpha|$,
1072: the total number of occurrences of
1073: terminals and nonterminals in the set of rules.
1074: Similarly,
1075: we define the size of a PDA as 
1076: $\sum_{(\myep{\alpha}{\beta})\in \mytrans} |\alpha\beta|+
1077: \sum_{(\myscanrec{X}{x}{Y})\in \mytrans} |{\it XxY}|$,
1078: the total number of occurrences of
1079: stack symbols and terminals in the set of transitions.
1080: 
1081: Let $\myaut$ $=$ 
1082: $(\myterm,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$ be a 
1083: PDA with both CPP and SPP.
1084: We will now show that we can construct an equivalent CFG 
1085: $\mygram$ = $(\myterm,$ $\mysym,$ $\Xinit,$ $\myrule)$
1086: with size linear in the size of $\myaut$. 
1087: The rules of this grammar are the following.
1088: \begin{itemize}
1089: \item
1090: $X \de\it Y Z$
1091: for each transition $\myep{X}{X Y}$, where
1092: $Z$ is the unique stack symbol such that there is at
1093: least one transition $\myep{X Y'}{Z}$ with $Y \pdagoto Y'$;
1094: \item
1095: $X \de x Y$
1096: for each transition 
1097: $\myscanrec{X}{x}{Y}$;
1098: \item
1099: $Y \de \epsilon$ for each stack symbol $Y$ 
1100: such that there is at
1101: least one transition $\myep{X Y}{Z}$ or such that $Y=\Xfinal$.
1102: \end{itemize}
1103: It is easy to see that there exists 
1104: a bijection from complete computations of $\myaut$ to complete derivations
1105: of $\mygram$, preserving the recognized/derived strings. 
1106: Apart from an additional derivation step by rule
1107: $\Xfinal \de \epsilon$, the complete derivations also have the
1108: same length as the corresponding complete computations.
1109: 
1110: The above construction can straightforwardly be extended 
1111: to probabilistic PDAs (PPDAs).
1112: Let $(\myaut, p_{\myaut})$ be a PPDA with both CPP and SPP. 
1113: Then we construct $\mygram$ as above, and further define
1114: $p_{\mygram}$ such that
1115: $p_{\mygram}(\pi) = p_{\myaut}(\tau)$ for rules
1116: $\pi = X \de \it Y Z$ or $\pi = X \de x Y$ that we construct
1117: out of transitions $\tau=\myep{X}{X Y}$ or $\tau=\myscanrec{X}{x}{Y}$,
1118: respectively, in the first two items above.
1119: We also define $p_{\mygram}(Y \de \epsilon) = 1$ 
1120: for rules $Y \de \epsilon$ obtained in the third item above.
1121: If $(\myaut, p_{\myaut})$ is reduced, proper and consistent then so is 
1122: $(\mygram, p_{\mygram})$.
1123: 
1124: This leads to the observation that
1125: parsing strategies with the
1126: CPP and the SPP as well as
1127: their probabilistic extensions can also be
1128: described as grammar transformations, as follows. A given (P)CFG is mapped
1129: to an equivalent (P)PDT by a (probabilistic)
1130: parsing strategy. By ignoring the output components of
1131: swap transitions we obtain a (P)PDA, which can be mapped to
1132: an equivalent (P)CFG as shown above.
1133: This observation gives rise to an extension with
1134: probabilities of
1135: the work on {\em covers\/} by \cite{NI80,LE89}.
1136: 
1137: It has been shown by \cite{GO82} that there is an infinite family of languages
1138: with the following property. 
1139: The sizes of the smallest CFGs generating those languages
1140: are at least quadratically larger than the sizes of the smallest
1141: equivalent PDAs. Note
1142: that this increase in size cannot occur if PDAs satisfy 
1143: both the CPP and the SPP, as we have shown above.
1144: 
1145: It is always possible to transform a PDA with the CPP but
1146: without the SPP to an equivalent PDA with both CPP and SPP, 
1147: by a construction that increases
1148: the size of the PDA considerably (at least quadratically,
1149: in the light of the above construction and \cite{GO82}).
1150: However, such transformations in general
1151: do not preserve parsing strategies and therefore are of minor interest
1152: to the issues discussed in this paper.
1153: 
1154: The simple relationship between PDAs with both CPP and SPP 
1155: on the one hand and CFGs on the other can 
1156: be used to carry over algorithms originally 
1157: designed for CFGs to PDAs or PDTs. One such application is the 
1158: evaluation of the right-hand side of
1159: equation~(\ref{e:normalized}) in the proof of
1160: Theorem~\ref{t:sp}. Both the numerator and the denominator 
1161: involve potentially infinite sets of subcomputations, and therefore
1162: it is not immediately clear that the proof is constructive.
1163: However, there are published algorithms to compute, for a
1164: given PCFG $(\mygram',p_{\mygram'})$
1165: that is not necessarily proper and a given
1166: nonterminal $A$, the expression
1167: $\Sigma_{w \in \Sigma^\ast}\ p_{\mygram'}(A \Rightarrow^\ast w)$,
1168: or rather, to approximate it with arbitrary precision;
1169: see \cite{BO73,ST95}. 
1170: This can be used to compute e.g.\ 
1171: $\Sigma_{v\in L_{X}}\ p_{\mygram}(\overline{v})$
1172: in equation~(\ref{e:normalized}), as follows.
1173: 
1174: The first step 
1175: is to map the PDT to a CFG $\mygram'$ as shown above.
1176: We then define a function
1177: $p_{\mygram'}$ that assigns probability
1178: 1 to all rules that we construct out of push and pop
1179: transitions. We also let $p_{\mygram'}$ assign probability 
1180: $p_{\mygram}(\overline{y})$ to a rule
1181: $X \de x Y$ that we construct out of a scan transition
1182: $\myscan{X}{x}{y}{Y}$. 
1183: It is easy to see that, for any stack symbol $X$, we have
1184: $\Sigma_{v\in L_{X}}\ p_{\mygram}(\overline{v}) =
1185: \Sigma_{w \in \Sigma_1^\ast}\ p_{\mygram'}(X \Rightarrow^\ast w)$.
1186: This allows our problem on the computations of probabilities
1187: in the right-hand side of equation~(\ref{e:normalized})
1188: to be reduced to a problem on PCFGs, which can be solved 
1189: by existing algorithms as discussed above.  
1190: 
1191: \section{Parsing strategies with SPP}
1192: \label{s:strong}
1193: 
1194: Many well-known parsing strategies with the CPP
1195: also have the SPP,
1196: such as top-down parsing \cite{HA78}, left-corner parsing \cite{RO70}
1197: and PLR parsing \cite{SO79}, the first two of which we will
1198: define explicitly here, whereas of the third we will merely 
1199: present a sketch. A fourth strategy that we will discuss
1200: is a combination of left-corner and top-down parsing, with special
1201: computational properties.
1202: 
1203: In order to simplify the presentation, we allow a new type of
1204: transition, without increasing the power of PDTs, viz.\
1205: a combined push/swap transition of the form
1206: $\myscan{X}{x}{y}{X Y}$. Such a transition can be seen as short-hand for
1207: two transitions, the first of the form $\myep{X}{X Y_{x,y}}$,
1208: where $Y_{x,y}$ is a new symbol not already in $\mysym$, and
1209: the second of the form $\myscan{Y_{x,y}}{x}{y}{Y}$.
1210: 
1211: The first strategy we discuss is top-down parsing.
1212: For a fixed CFG grammar $\mygram  = (\myterm,$ $\mynont,$ $S,$ $\myrule)$, 
1213: we define $\mystrat_{\it TD}(\mygram) =
1214: (\myaut,f)$. Here $\myaut$ $=$ $(\myterm,$ 
1215: $\myrule,$ $\mysym,$ $[S \de\ \bul\sigma],$ $[S \de \sigma\bul],$ $\mytrans)$,
1216: where $\mysym = \{ [A \de \alpha \bul \beta]\  |\ 
1217: (A \de\alpha\beta) \in \myrule \}$;
1218: these `dotted rules' are well-known from \cite{KN65,EA70}.
1219: The transitions in $\mytrans$ are:
1220: \begin{itemize}
1221: \item $\myscan{[A \de \alpha \bul a \beta]}{a}{\epsilon}%
1222: 		{[A \de \alpha a \bul \beta]}$
1223: for each rule $A \de \alpha a \beta$;
1224: \item $\myscan{[A \de \alpha \bul B \beta]}{\epsilon}{\pi}%
1225: 		{[A \de \alpha \bul B \beta]\ [B\de\ \bul\gamma]}$
1226: for each pair of rules 
1227: $A \de \alpha B \beta$ and $\pi = B \de \gamma$;
1228: \item $\myep{[A \de \alpha \bul B \beta]\ [B \de \gamma \bul]}%
1229: 		{[A \de \alpha B \bul \beta]}$.
1230: \end{itemize}
1231: The function $f$ is the identity function on strings over $\myrule$.
1232: If seen as a function on computations,
1233: then $f$ is a bijection from complete computations 
1234: of $\myaut$ to complete derivations of $\mygram$, 
1235: as required by the definition of `parsing strategy'. 
1236: 
1237: If $\mygram$ is reduced, then $\myaut$ clearly has the CPP.
1238: That it also has the SPP can be
1239: argued as follows. 
1240: Let us first remark that if 
1241: $[A \de \alpha \bul \beta]\pdagoto X$ 
1242: for some stack symbols $[A \de \alpha \bul \beta]$ and $X$, 
1243: then $X$ must be of the form
1244: $[A \de \alpha \gamma\bul \delta]$, for some $\gamma$ and $\delta$ 
1245: such that $\gamma\delta = \beta$.
1246: Now, if there are three transitions
1247: $\myep{X}{X Y}$, $\myep{X Y_1}{Z_1}$ and
1248: $\myep{X Y_2}{Z_2}$ such that $Y\pdagoto Y_1$ and
1249: $Y\pdagoto Y_2$, then 
1250: $X$ must be of the form $[A \de \alpha \bul B \beta]$ 
1251: and $Y$ of the form $[B\de\ \bul\gamma]$
1252: (strictly speaking $[B\de\ \bul\gamma]_{\epsilon,\pi}$), 
1253: $Y_1$ and $Y_2$ must both be $[B \de \gamma \bul]$,
1254: and $Z_1$ and $Z_2$ must both be
1255: $[A \de \alpha B \bul\beta]$.
1256: Hence the SPP is satisfied.
1257: 
1258: Since $\mystrat_{\it TD}$ has both CPP and SPP, we
1259: may apply Theorem~\ref{t:sp} to conclude that $\mystrat_{\it TD}$ can be 
1260: extended to become a probabilistic parsing strategy. 
1261: A direct construction of a top-down PPDT from a 
1262: PCFG $(\mygram, p_{\mygram})$
1263: is obtained by extending the above construction such that
1264: probability 1 is assigned to all transitions produced by the first
1265: and third items, and probability $p_{\mygram}(\pi)$ is assigned
1266: to transitions produced by the second item.
1267: 
1268: The second strategy we discuss is left-corner (LC) parsing \cite{RO70}.
1269: For a fixed CFG $\mygram= (\myterm,$ $\mynont,$ $S,$ $\myrule)$,
1270: we define the binary relation $\LC$ over
1271: $\myterm \cup \mynont$ by:
1272: $X \LC A$ if and only if there is
1273: an $\alpha \in (\myterm \cup \mynont)^\ast$
1274: such that $(A \de X\alpha)\in\myrule$,
1275: where $X \in \myterm \cup \mynont$.
1276: We define the binary relation $\LCstar$ to be the reflexive and 
1277: transitive closure of $\LC$. This implies that $a \LCstar a$ for all
1278: $a \in \myterm$.
1279: 
1280: We now define $\mystrat_{\it LC}(\mygram) =
1281: (\myaut,f)$. Here $\myaut$ $=$ $(\myterm,$ 
1282: $\myrule\cup\{\dashv\},$ 
1283: $\mysym,$ $[S \de\ \bul\sigma],$ $[S \de \sigma\bul],$ $\mytrans)$,
1284: where $\mysym$ contains stack
1285: symbols of the form $[A \de \alpha \bul \beta]$
1286: where $(A \de\alpha\beta) \in \myrule$ such that 
1287: $\alpha\neq\epsilon\vee A =S$,
1288: and 
1289: stack symbols of the form
1290: $[A \de \alpha \bul Y\!\beta; X]$
1291: where $(A \de\alpha Y\!\beta) \in \myrule$ and
1292: $X,Y\in\myterm \cup \mynont$ such that $\alpha\neq\epsilon\vee A =S$
1293: and $X \LCstar Y$.
1294: The latter type of stack symbol indicates that
1295: left corner $X$ of goal $Y$ in the right-hand side of rule
1296: $A \de \alpha Y\! \beta$ has just been recognized.
1297: The transitions in $\mytrans$ are:
1298: \begin{itemize}
1299: \item $\myscan{[A \de \alpha \bul Y\! \beta]}{a}{\epsilon}%
1300: 		{[A \de \alpha \bul Y\! \beta; a]}$
1301: for each rule $A \de \alpha Y\! \beta$ and $a \in \myterm$
1302: such that $\alpha\neq\epsilon\vee A=S$ and $a \LCstar Y$;
1303: \item $\myscan{[A \de \alpha \bul B \beta]}{\epsilon}{\pi}%
1304: 		{[A \de \alpha \bul B \beta; C]}$
1305: for each pair of rules $A \de \alpha B \beta$ and
1306: $\pi = C \de \epsilon$ such that $\alpha\neq\epsilon\vee A=S$ and $C \LCstar B$;
1307: \item $\myscan{[A \de \alpha \bul B \beta; X]}{\epsilon}{\pi}%
1308:                 {[A \de \alpha \bul B \beta; X]\ [C\de X \bul\gamma]}$
1309: for each pair of rules 
1310: $A \de \alpha B \beta$ and $\pi = C \de X \gamma$ such that 
1311: $\alpha\neq\epsilon\vee A=S$ and $C \LCstar B$;
1312: \item $\myep{[A \de \alpha \bul B \beta; X]\ [C\de X\gamma \bul]}%
1313: 		{[A \de \alpha \bul B \beta; C]}$
1314: for each pair of rules
1315: $A \de \alpha B \beta$ and $C \de X\gamma$ such that 
1316: $\alpha\neq\epsilon\vee A=S$
1317: and 
1318: $C \LCstar B$;
1319: \item $\myscan{[A \de \alpha \bul Y\! \beta; Y]}%
1320: 		{\epsilon}{\dashv}
1321:                 {[A \de \alpha Y \bul \beta]}$
1322: for each rule $A \de \alpha Y\! \beta$ such that 
1323: $\alpha\neq\epsilon\vee A=S$.
1324: \end{itemize}
1325: The function $f$ has to rearrange an output string to obtain
1326: a complete derivation.
1327: To make this possible, the output alphabet contains the
1328: symbol $\dashv$ in addition to rules from $\myrule$. 
1329: This symbol is used to mark the end of
1330: an upward path of nodes in the parse tree 
1331: each of which, except the last, is
1332: the left-most daughter node of its mother node.
1333: As explained in \cite{NI80}, in the absence of such a symbol,
1334: it would be impossible to uniquely identify output strings with
1335: derivations of the input.\footnote{%
1336: In \cite[pp.~22--23]{NI80} a context-free grammar is considered
1337: that consists of the set of rules 
1338: $R = \{S \de {\it aS}, S \de {\it Sb}, S \de c\}$.
1339: It is shown that any left-corner push-down 
1340: transducer using only $R$ as output alphabet would
1341: output at most one string for each input string, 
1342: whereas there may be several 
1343: derivations of the input, as the grammar is ambiguous.}
1344: 
1345: The function $f$ for the strategy $\mystrat_{\it LC}$
1346: is defined by Figure~\ref{f:fLC}. Function $f$ is defined
1347: in terms of function $f_{\it LC}$, which has two arguments.
1348: The first argument, $d$, is either the empty string or
1349: a subderivation that has already been constructed.
1350: The second argument is a suffix of the output
1351: string originally supplied as argument to $f$. 
1352: Function $f_{\it LC}$ removes the first symbol $\pi$
1353: from the output string, which will be
1354: a rule $A \de X X_{1} \cdots X_l$ or $A \de \epsilon$.
1355: In the former case, $d$ must be $\epsilon$ if $X\in \myterm_1$
1356: and $d$ must be a subderivation from nonterminal $X$ otherwise.
1357: The function is then called recursively zero or more times,
1358: once for each nonterminal in $X_{1} \cdots X_l$,
1359: to obtain more subderivations $d_i$, $1 \leq i \leq l$,
1360: each of which is
1361: obtained by consuming a subsequent part of the output string.
1362: These subderivations are combined into a larger subderivation
1363: $d' = \pi d d_{1} \cdots d_l$. Depending on the
1364: question whether we encounter $\dashv$ as the immediately
1365: following symbol of the output string, we return the
1366: derivation $d'$ and the remainder $v'$ of the output string, or
1367: call $\mystrat_{\it LC}$ recursively once more to
1368: obtain a larger subderivation.
1369: \begin{figure}[tp]
1370: \begin{eqnarray*}
1371: f(v) &=& d \\
1372: && {\rm where} \\
1373: && (d, \epsilon) = f_{\it LC}(\epsilon,v) \\
1374: f_{\it LC}(d,\pi v_0) &=& (d'',v'') \\
1375: && {\rm where} \\
1376: && l \mbox{\ is such that\ }
1377:         \pi = A \de X X_{1} \cdots X_l\ \mbox{\ or} \\
1378: && \hspace*{5ex} \pi = A \de \epsilon \wedge l = 0 \\
1379: && (d_{1}, v_{1}) =
1380:                 {\rm if\ } X_{1} \in \myterm_1
1381:                 {\rm \ then\ } (\epsilon, v_{0})
1382:                 {\rm \ else\ } f_{\it LC}(\epsilon, v_{0}) \\
1383: && \ldots \\
1384: && (d_{l}, v_{l}) =
1385:                 {\rm if\ } X_{l} \in \myterm_1
1386:                 {\rm \ then\ } (\epsilon, v_{l-1})
1387:                 {\rm \ else\ } f_{\it LC}(\epsilon, v_{l-1}) \\
1388: && d' = \pi d d_{1} \cdots d_l \\
1389: && (d'',v'') =
1390:                 {\rm if\ } {\dashv} v'  = v_{l}
1391:                 {\rm \ then\ } (d',v')
1392:                 {\rm \ else\ } f_{\it LC}(d',v_{l}) 
1393: \end{eqnarray*}
1394: \caption{Function $f$ for $\mystrat_{\it LC}$.}
1395: \label{f:fLC}
1396: \end{figure}
1397: 
1398: It can be easily shown that this strategy has the CPP.
1399: Regarding the SPP, note that if there are two transitions
1400: $\myscan{[A \de \alpha \bul B \beta; X]}{\epsilon}{\pi}%
1401:                 {[A \de \alpha \bul B \beta; X]\ [C\de X \bul\gamma]}$
1402: and $\myep{[A \de \alpha \bul B \beta; X]\ Y_1}{Z_1}$ such that
1403: $[C\de X \bul\gamma] \pdagoto Y_1$, then 
1404: $Y_1$ must be $[C\de X \gamma\bul]$
1405: and $Z_1$ must be $[A \de \alpha \bul B \beta; C]$, which means that
1406: $Z_1$ is uniquely determined by the first transition.
1407: 
1408: Since $\mystrat_{\it LC}$ has both CPP and SPP, 
1409: left-corner parsing can be extended to become a 
1410: probabilistic parsing strategy. A direct construction of
1411: probabilistic left-corner parsers from PCFGs has been presented
1412: by \cite{TE95}.
1413: 
1414: Since at most two rules occur in each of the items above,
1415: the size of a (probabilistic)
1416: left-corner parser is $\order{|\mygram|^2}$, where
1417: $|\mygram|$ denotes the size of $\mygram$.
1418: This is the same complexity as that of the direct
1419: construction by \cite{TE95}.
1420: This is in contrast to a construction of `shift-reduce' PPDAs
1421: out of PCFGs from \cite{AB99}, which were of size
1422: $\order{|\mygram|^5}$.\footnote{This construction consisted
1423: of a transformation to Chomsky normal form followed by 
1424: a transformation to Greibach normal form (GNF) \cite{HA78}.
1425: Its worse-case time complexity, established in
1426: p.c.\ with David McAllester, is reached for a family
1427: of CFGs $(\mygram_n)_{n \geq 2}$, defined by $\mygram_n =$
1428: $(\{a_1,\ldots,a_n\},$ $\{A_1,\ldots,A_n\},$ $A_1,$ $\myrule)$,
1429: where $\myrule$ contains the rules
1430: $A_i \de A_{i+1}$, for $1 \leq i \leq n-1$,
1431: $A_n \de A_1$,
1432: and $A_i \de A_{i}\ A_{i}$ and
1433: $A_i \de a_i$, for $1 \leq i \leq n$.
1434: After transformation to GNF, the grammar
1435: contains $n^5$ rules of the
1436: form 
1437: $A_{i_1}/A_{i_2} \de a_{i_3}\ A_{i_2}/A_{i_4}\ A_{i_1}/A_{i_5}$,
1438: with $1 \leq i_1,i_2,i_3,i_4,i_5 \leq n$.
1439: In \cite{BL99} a more economical transformation 
1440: to Greibach normal form is given; straightforward
1441: extension to probabilities leads to 
1442: probabilistic parsers of the type considered by \cite{AB99} 
1443: of size $\order{|\mygram|^4}$.}
1444: The ``conjecture that
1445: there is no {\em concise\/} translation of
1446: PCFGs into shift-reduce PPDAs'' from \cite{AB99}
1447: is made less significant by the earlier construction by \cite{TE95}
1448: and our construction above.
1449: It must be noted however that the `shift-reduce' model adhered to
1450: by \cite{AB99} is more restrictive than the PDT models adhered
1451: to by \cite{TE95} and by us.
1452: 
1453: When we look at upper bounds on the sizes of PPDAs (or PPDTs)
1454: that describe the same probability distributations as given 
1455: PCFGs, and compare these with the upper bounds for
1456: (non-probabilistic) PDAs (or PDTs) for given CFGs,
1457: we can make the following observation.
1458: Theorem~\ref{t:cpp} states
1459: that parsing strategies without the CPP cannot be extended to
1460: become probabilistic. Furthermore, \cite{LE00} has shown that
1461: for certain fixed languages the smallest
1462: PDAs without the CPP are much smaller than
1463: the smallest PDAs with the CPP. 
1464: It may therefore appear that probabilistic PDAs
1465: are in general larger than non-probabilistic ones.
1466: However, the automata studied by
1467: \cite{LE00} pertain to very specific languages, and at this
1468: point there is little reason to believe that the demonstrated
1469: results for these languages carry over to
1470: any reasonable strategy for {\em general\/} CFGs.
1471: 
1472: The third parsing strategy that we discuss is PLR parsing \cite{SO79}.
1473: Since it is very similar to LC parsing, 
1474: we merely provide a sketch.
1475: The stack symbols for PLR parsing are like those for LC parsing, 
1476: except that the parts of rules following the dot are omitted.
1477: Thus, instead of symbols of the form
1478: $[A \de \alpha \bul \beta]$ and of the
1479: form $[A \de \alpha \bul \beta; X]$, a PLR parser
1480: manipulates stack symbols
1481: $[A \de \alpha ]$ and $[A \de \alpha ; X]$, respectively.
1482: That $\beta$ is omitted means that PLR parsers may postpone commitment
1483: to one from two similar rules $A \de \alpha \beta$ and 
1484: $A \de \alpha \beta'$ until the point is reached where $\beta$ and
1485: $\beta'$ differ. In this sense PLR parsing
1486: is less predictive than LC parsing,
1487: although it still satisfies the 
1488: strong predictiveness property, so that it can be extended to
1489: become probabilistic.
1490: 
1491: There are two minor differences between the transitions of LC
1492: parsers and those of PLR parsers. The first is the simplification of
1493: stack symbols as explained above. The second is that for PLR, 
1494: output of a rule is delayed until it is completely recognized.
1495: The resulting output strings are right-most
1496: derivations in reverse, which requires different functions $f$ than in
1497: the case of LC parsing.
1498: Note that right-most derivations can be effectively mapped
1499: to corresponding parse trees, and parse trees can be effectively 
1500: mapped to corresponding left-most derivations. 
1501: Hence the required functions $f$ clearly exist.
1502: 
1503: The last strategy to be discussed in this section is a combination
1504: of left-corner and top-down parsing. It has the special property
1505: that, provided the fixed CFG is acyclic, 
1506: the length of computations is bounded by a 
1507: linear function on the length of the input, which
1508: means that the parser cannot `loop' on any input.
1509: Note that if the grammar is not acyclic, computations
1510: of unbounded length cannot be avoided by any parsing strategy.
1511: {}From this perspective, this parsing strategy, which we will
1512: call {\em $\epsilon$-LC\/} parsing, is optimal.
1513: It is based on \cite{NE93b}, and a
1514: related idea for LR parsing was described by \cite{NE96e}.
1515: The special termination properties of this strategy will be needed
1516: in Section~\ref{s:prefix}.
1517: 
1518: We first define the binary relation $\LCep$ over
1519: $\myterm \cup \mynont$ by:
1520: $X \LCep A$ if and only if there are
1521: $\alpha,\beta\in (\myterm \cup \mynont)^\ast$
1522: such that $(A \de \alpha X\beta)\in\myrule$
1523: and $\alpha \Rightarrow^\ast \epsilon$.
1524: Relation $\LCep$ differs from the relation $\LC$ defined earlier
1525: in that epsilon-generating
1526: nonterminals at the beginning of a rule may be ignored.
1527: 
1528: The stack symbols are now of the form
1529: $[A \de \alpha \bul \beta, \mu\bul\nu]$ or of the
1530: form $[A \de \alpha \bul Y\! \beta, \mu\bul\nu; X]$.
1531: Similar to the stack symbols for pure LC parsing, we
1532: have $\alpha\neq\epsilon\vee A =S$
1533: and $X \LCepstar Y$. Different is the additional dotted expression
1534: $\mu\bul\nu$, which is such that $\mu\nu$ is
1535: a string of epsilon-generating nonterminals, occurring at
1536: the beginning of the right-hand side of a rule 
1537: $A \de \mu\nu\alpha \beta$ or $A \de \mu\nu\alpha Y\!\beta$,
1538: respectively.
1539: The string $\mu\nu$ will be ignored in the
1540: part of the strategy that behaves like left-corner parsing,
1541: where $\mu=\epsilon$.
1542: However, when the dot of the first dotted expression is at the end,
1543: i.e., when we obtain a stack symbol of the form
1544: $[A \de \alpha \bul, \bul\nu]$, then
1545: top-down parsing will be activated to retrieve epsilon-generating
1546: subderivations for the nonterminals in $\nu$, 
1547: and the dot will move through $\nu$ from
1548: left to right.\footnote{%
1549: Although such subderivations can also be pre-compiled
1550: during construction of the PDT,
1551: we refrain from doing so since this could lead to
1552: a PDT of exponential size.}
1553: 
1554: We have $\Xinit = [S \de\ \bul \sigma, \bul]$
1555: and $\Xfinal = [S \de \sigma\bul, \bul]$, where for technical
1556: reasons, and without loss of generality, we assume that
1557: $\sigma$ does not contain any epsilon-generating nonterminals.
1558: Next to the symbols from $\myrule$ and the symbol $\dashv$,
1559: the output alphabet $\myterm_2$ also includes the set
1560: of integers $\{0,\ldots,l-1\}$, where $l= |\alpha|$ for
1561: a rule $(A \de \alpha) \in\myrule$ of maximal length;
1562: the purpose of such integers will become clear below.
1563: For the definition of the set of transitions, we will be less
1564: precise than for $\mystrat_{\it TD}$ and 
1565: $\mystrat_{\it LC}$, to prevent
1566: cluttering up the presentation with details. 
1567: We point out however that
1568: in order to produce a reduced PDT from a reduced CFG, further side
1569: conditions are needed for all items below:
1570: 
1571: \begin{itemize}
1572: \item $\myscan{[A \de \alpha \bul Y\! \beta, \bul\mu]}{a}{\epsilon}%
1573:                 {[A \de \alpha \bul Y\! \beta, \bul\mu; a]}$
1574: for $a \in \myterm$ such that $a \LCepstar Y$;
1575: \item $\myscan{[A \de \alpha \bul B \beta, \bul\mu]}{\epsilon}{\pi 0}%
1576: 		{[A \de \alpha \bul B \beta, \bul\mu; C]}$
1577: for $\pi = C \de \epsilon$ such that $C \LCstar B$;
1578: \item $\myscan{[A \de \alpha \bul B \beta, \bul\mu; X]}{\epsilon}{\pi m}%
1579:                 {[A \de \alpha \bul B \beta, \bul\mu; X]\ 
1580: 		[C\de X \bul\gamma, \bul\mu']}$
1581: for $\pi = C \de \mu' X \gamma$ such that
1582: $C \LCepstar B$ and
1583: $\mu'\Rightarrow^\ast\epsilon$, where $m= |\mu'|$;
1584: \item $\myep{[A \de \alpha \bul B \beta, \bul\mu; X]\ 
1585: 		[C\de X\gamma \bul, \mu'\bul]}%
1586:                 {[A \de \alpha \bul B \beta, \bul\mu; C]}$;
1587: \item $\myscan{[A \de \alpha \bul Y\! \beta, \bul\mu; Y]}%
1588:                 {\epsilon}{\dashv}
1589:                 {[A \de \alpha Y \bul \beta, \bul\mu]}$;
1590: \item $\myscan{[A \de \alpha \bul, \mu \bul B \nu]}{\epsilon}{\pi}%
1591:                 {[A \de \alpha \bul, \mu \bul B \nu]\ 
1592: 		[B\de\ \bul, \bul\mu']}$
1593: for $\pi = B \de \mu'$ such that $\mu'\Rightarrow^\ast\epsilon$;
1594: \item $\myep{[A \de \alpha \bul, \mu \bul B \nu]\ [B\de\ \bul, \mu'\bul]}%
1595:                 {[A \de \alpha \bul, \mu B \bul \nu]}$.
1596: \end{itemize}
1597: 
1598: The first five items are almost identical to the five
1599: items we presented for $\mystrat_{\it LC}$,
1600: except that strings $\mu$ of 
1601: epsilon-generating nonterminals at the beginning of rules
1602: are ignored. 
1603: The length $m$ of a string $\mu$ is output just after 
1604: the relevant grammar rule is output, in the second and third items.
1605: This length $m$ will be needed to define function $f$ below.
1606: 
1607: The last two items follow a top-down strategy, but only for
1608: epsilon-generating rules.
1609: The produced transitions do what
1610: was deferred by the left-corner part of the strategy:
1611: they construct subderivations for the 
1612: epsilon-generating nonterminals in strings $\mu$.
1613: 
1614: The function $f$, which produces a complete derivation
1615: from an output string, is defined through two
1616: auxiliary functions, viz.\
1617: $\fepLC$ for the left-corner part and 
1618: $\fepTD$ for the top-down
1619: part, as shown in Figure~\ref{f:fepLC}.
1620: 
1621: \begin{figure}[tp]
1622: \begin{eqnarray*}
1623: f(v) &=& d \\
1624: && {\rm where} \\
1625: && (d, \epsilon) = \fepLC(\epsilon,v) \\
1626: \fepLC(d,\pi m v_{0}) &=& (d'',v'') \\
1627: && {\rm where} \\
1628: && l \mbox{\ is such that\ } 
1629: 	\pi = A \de B_1 \cdots B_m X X_{1} \cdots X_l\ \mbox{\ or} \\
1630: && \hspace*{5ex} \pi = A \de \epsilon \wedge l = 0 \\
1631: && (d_{1}, v_{1}) = 
1632: 		{\rm if\ } X_{1} \in \myterm_1 
1633: 		{\rm \ then\ } (\epsilon, v_{0}) 
1634: 		{\rm \ else\ } \fepLC(\epsilon, v_{0}) \\
1635: && \ldots \\
1636: && (d_{l}, v_{l}) =
1637:                 {\rm if\ } X_{l} \in \myterm_1 
1638:                 {\rm \ then\ } (\epsilon, v_{l-1}) 
1639:                 {\rm \ else\ } \fepLC(\epsilon, v_{l-1}) \\
1640: && (d'_1,v_{l+1}) = \fepTD(v_{l}) \\
1641: && \ldots \\
1642: && (d'_m,v_{l+m}) = \fepTD(v_{l+m-1}) \\
1643: && d' = \pi d'_1 \cdots d'_m d d_{1} \cdots d_l \\
1644: && (d'',v'') =
1645: 		{\rm if\ } {\dashv} v'  = v_{l+m}
1646: 		{\rm \ then\ } (d',v') 
1647: 		{\rm \ else\ } \fepLC(d',v_{l+m}) \\
1648: \fepTD(v) &=& (\pi d_1 \cdots d_l,v_l) \\
1649: && {\rm where} \\
1650: && \pi v_{0} = v \\
1651: && l \mbox{\ is such that\ } \pi = A \de B_{1} \cdots B_l  \\
1652: && (d_1,v_1) = \fepTD(v_0) \\
1653: && \ldots \\
1654: && (d_l,v_l) = \fepTD(v_{l-1})
1655: \end{eqnarray*}
1656: 
1657: \caption{Function $f$ for $\stratepLC$.}
1658: \label{f:fepLC}
1659: \end{figure}
1660: 
1661: The function $\fepLC$ is similar to $f_{\it LC}$ defined in 
1662: Figure~\ref{f:fLC}. The main difference is that now 
1663: subderivations deriving $\epsilon$
1664: for the first $m$ nonterminals in the right-hand side
1665: of a rule are obtained by calls of the function $\fepTD$.
1666: For a suffix $v$ of an output string,
1667: $\fepTD(v)$ yields a pair $(\pi d_1 \cdots d_l,v_l)$
1668: such that $v= \pi d_1 d_2 \cdots d_lv_l$. In other words,
1669: $\fepTD$ does nothing more than split its argument into
1670: two parts. The length of the first part $\pi d_1 \cdots d_l$
1671: depends on the
1672: length $l$ of the right-hand side of rule $\pi$ and
1673: on the lengths of right-hand sides of rules that
1674: are visited recursively.
1675: 
1676: It can be easily seen that $\stratepLC$
1677: has both CPP and SPP. The size of a produced
1678: PDT is now $\order{|\mygram|^3}$, rather than
1679: $\order{|\mygram|^2}$ as in the case of $\mystrat_{\it LC}$.
1680: 
1681: \comment{
1682: The second new type of transition
1683: is a combined swap/pop transition of the form
1684: $\myscan{X Y}{x}{y}{Z}$. Such a transition can be seen as short-hand for
1685: two transitions, the first of the form $\myscan{Y}{x}{y}{Y_X}$,
1686: where $Y_X$ is a new symbol not already in $\mysym$, and
1687: the second of the form $\myep{X Y_X}{Z}$.
1688: 
1689: For the left-corner strategy we have
1690: $\mystrat_{\it LC}(\mygram) =
1691: (\myaut,f)$, where
1692: $\myaut$ differs from the automaton above in the set $\mysym$
1693: of stack symbols
1694: and in the set $\mytrans$ of transitions.
1695: Next to stack symbols $[A \de \alpha \bul \beta]$,
1696: $\mysym$ now also contains
1697: stack symbols of the form
1698: $[A \de \alpha \bul Y \beta; X]$,
1699: where $X$ and $Y$ can be
1700: terminals or nonterminals, and $X \LCstar Y$.
1701: Such a stack symbol on top of the stack indicates that 
1702: left corner $X$ of goal $Y$ in the right-hand side of rule
1703: $A \de \alpha Y \beta$ has just been recognized. 
1704: The transitions in $\mytrans$ are:
1705: \begin{itemize}
1706: \item $\myscan{[A \de \alpha \bul Y \beta]}{a}{\epsilon}%
1707: 		{[A \de \alpha \bul Y \beta; a]}$
1708: for each rule $A \de \alpha Y \beta$ such that $a \LCstar Y$;
1709: \item $\myscan{[A \de \alpha \bul B \beta]}{\epsilon}{\pi}%
1710: 		{[A \de \alpha \bul B \beta; C]}$
1711: for each pair of rules $A \de \alpha B \beta$ and
1712: $\pi = C \de \epsilon$ such that $C \LCstar B$;
1713: \item $\myep{[A \de \alpha \bul B \beta; X]}%
1714:                 {[A \de \alpha \bul B \beta; C]\ [C\de X \bul\gamma]}$
1715: for each pair of rules 
1716: $A \de \alpha B \beta$ and $C \de X \gamma$ such that $C \LCstar B$;
1717: \item $\myscan{[A \de \alpha \bul B \beta; C]\ [C\de \gamma \bul]}{\epsilon}{\pi} 
1718: 		{[A \de \alpha \bul B \beta; C]}$
1719: for each pair of rules
1720: $A \de \alpha B \beta$ and $\pi = C \de \gamma$ such that 
1721: $C \LCstar B$ and $\gamma \neq \epsilon$;
1722: \item $\myep{[A \de \alpha \bul Y \beta; Y]}%
1723:                 {[A \de \alpha Y \bul \beta]}$
1724: for each rule $A \de \alpha Y \beta$.
1725: \end{itemize}
1726: Since the sequence of rules that such a PDT outputs is a right-most
1727: derivation in reverse, the function $f$ has to rearrange the
1728: output string to obtain a complete derivation. This problem is
1729: discussed in \cite{NI80}. 
1730: 
1731: The last parsing strategy we discuss is PLR parsing
1732: \cite{SO79}. This is very similar to LC parsing, with the main difference
1733: that the dotted rules are simplified by omitted the part after the dot.
1734: This leads to a `more deterministic' behaviour, as explained by \cite{NE94a}.
1735: Thus, $\mystrat_{\it PLR}(\mygram) =
1736: (\myaut,f)$, where $\myaut$ $=$ $(\myterm,$ 
1737: $\myrule,$ $\mysym,$ $[S \de\epsilon ],$ $[S \de \sigma],$ $\mytrans)$
1738: and $\mysym$ contains
1739: stack symbols of the form $[A \de \alpha]$,
1740: where $(A\de\alpha\beta)\in \myrule$ for some $\beta$,
1741: or of the form
1742: $[A \de \alpha ; X]$, where
1743: $(A\de\alpha Y \beta)\in \myrule$ for some $Y$ and $\beta$
1744: such that $X \LCstar Y$.
1745: The transitions in $\mytrans$ are:
1746: \begin{itemize}
1747: \item $\myscan{[A \de \alpha ]}{a}{\epsilon}%
1748:                 {[A \de \alpha; a]}$
1749: for each rule $A \de \alpha Y \beta$ such that $a \LCstar Y$;
1750: \item $\myscan{[A \de \alpha ]}{\epsilon}{\pi}%
1751:                 {[A \de \alpha ; C]}$
1752: for each pair of rules $A \de \alpha B \beta$ and
1753: $\pi = C \de \epsilon$ such that $C \LCstar B$;
1754: \item $\myep{[A \de \alpha ; X]}%
1755:                 {[A \de \alpha ; C]\ [C\de X ]}$
1756: for each pair of rules
1757: $A \de \alpha B \beta$ and $C \de X \gamma$ such that $C \LCstar B$;
1758: \item $\myscan{[A \de \alpha ; C]\ [C\de \gamma ]}{\epsilon}{\pi}
1759:                 {[A \de \alpha ; C]}$,
1760: for each pair of rules
1761: $A \de \alpha B \beta$ and $\pi = C \de \gamma$ such that 
1762: $C \LCstar B$ and $\gamma \neq \epsilon$;
1763: \item $\myep{[A \de \alpha ; Y]}%
1764:                 {[A \de \alpha Y ]}$
1765: for each rule $A \de \alpha Y \beta$.
1766: \end{itemize}
1767: The function $f$ is the same as in the case of left-corner parsing.
1768: }
1769: 
1770: \section{Parsing strategies without SPP}
1771: \label{s:nonstrong}
1772: 
1773: In this section
1774: we show that the absence of the strong predictiveness property
1775: may mean that a parsing strategy with the CPP
1776: cannot be extended to become a
1777: probabilistic parsing strategy. We first illustrate this for
1778: LR(0) parsing, formalized as a
1779: parsing strategy $\mystrat_{\it LR}$,
1780: which has the CPP but not the SPP, 
1781: as we will see.
1782: We assume the reader is familiar with LR parsing; see \cite{SI90}.
1783: 
1784: We take a PCFG $(\mygram, p_{\mygram})$ defined by:
1785: $$
1786: \begin{array}{c@{\;=\;}ll}
1787: \pi_{S} & S \de {\it AB}, & p_{\mygram}(\pi_{S}) = 1 \\[.1ex]
1788: \pi_{A_1} & A \de {\it aC}, & p_{\mygram}(\pi_{A_1}) = \frac{1}{3} \\[.1ex]
1789: \pi_{A_2} & A \de {\it aD}, & p_{\mygram}(\pi_{A_2}) = \frac{2}{3} \\[.1ex]
1790: \pi_{B_1} & B \de {\it bC}, & p_{\mygram}(\pi_{B_1}) = \frac{2}{3} \\[.1ex]
1791: \pi_{B_2} & B \de {\it bD}, & p_{\mygram}(\pi_{B_2}) = \frac{1}{3} \\[.1ex]
1792: \pi_{C} & C \de {\it xc}, & p_{\mygram}(\pi_{C}) = 1 \\[.1ex]
1793: \pi_{D} & D \de {\it xd}, & p_{\mygram}(\pi_{D}) = 1
1794: \end{array}
1795: $$
1796: Note that this grammar generates a finite language.
1797: 
1798: We will not present the entire LR automaton $\myaut$, 
1799: with $\mystrat_{\it LR}(\mygram) = (\myaut,f)$ for some $f$,
1800: but we merely mention two of its key transitions, which
1801: represent shift actions over $c$ and $d$:
1802: $$
1803: \begin{array}{c@{\;=\;}l}
1804: \tau_{c} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%
1805: 		{c}{\epsilon}%
1806: 		{\{C\de x\bul c, D\de x\bul d\}\ \{C \de xc\bul\}} \\
1807: \tau_{d} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%
1808:                 {d}{\epsilon}%
1809:                 {\{C\de x\bul c, D\de x\bul d\}\ \{D \de xd\bul\}}
1810: \end{array}
1811: $$
1812: (We denote LR states by their sets of kernel items, as usual.)
1813: 
1814: Take a probability function $p_{\myaut}$
1815: such that $(\myaut, p_{\myaut})$ is a proper PPDT.
1816: It can be easily seen 
1817: that $p_{\myaut}$ must assign 1 to all
1818: transitions except $\tau_{c}$ and $\tau_{d}$, since that is the only
1819: pair of distinct transitions that can be applied for one and the
1820: same top-of-stack symbol,
1821: viz.\ $\{C\de x\bul c,D\de x\bul d\}$.
1822:  
1823: However,
1824: $\frac{p_{\mygram}({\it axcbxd})}{p_{\mygram}({\it axdbxc})} =
1825: \frac{p_{\mygram}(\pi_{A_1}) \cdot  p_{\mygram}(\pi_{B_2})}%
1826: {p_{\mygram}(\pi_{A_2}) \cdot  p_{\mygram}(\pi_{B_1})} =
1827: \frac{\frac{1}{3}\cdot\frac{1}{3}}{\frac{2}{3}\cdot\frac{2}{3}} = 
1828: \frac{1}{4}$
1829: but
1830: $\frac{p_{\myaut}({\it axcbxd})}{p_{\myaut}({\it axdbxc})} =
1831: \frac{p_{\myaut}(\tau_{c}) \cdot  p_{\myaut}(\tau_{d})}%
1832: {p_{\myaut}(\tau_{d}) \cdot  p_{\myaut}(\tau_{c})} = 1 \neq \frac{1}{4}$.
1833: This shows that there is no $p_{\myaut}$ such that
1834: $(\myaut, p_{\myaut})$ assigns the same
1835: probabilities to strings over $\myterm$ as $(\mygram, p_{\mygram})$.
1836: It follows that
1837: the LR strategy cannot be extended to become a probabilistic
1838: parsing strategy.
1839: 
1840: Note that for $\mygram$ as above, $p_{\mygram}(\pi_{A_1})$
1841: and $p_{\mygram}(\pi_{B_1})$ can be freely chosen, and this
1842: choice determines the other values of $p_{\mygram}$, so we have
1843: two free parameters. For $\myaut$ however, there is only one
1844: free parameter in the choice of $p_{\myaut}$.
1845: This is in conflict with an underlying assumption of existing work
1846: on probabilistic LR parsing, by e.g.\ \cite{BR93} and \cite{IN00},
1847: viz.\ that LR parsers would allow more fine-grained probability
1848: distributions than CFGs. However, for some practical grammars 
1849: from the area of natural language processing,
1850: \cite{SO99} has shown that LR parsers do allow
1851: more accurate probability distributions than the CFGs from which
1852: they were constructed, if probability functions are estimated from
1853: corpora.
1854: 
1855: By way of Theorem~\ref{t:sp}, it follows indirectly from
1856: the above that LR parsing lacks the SPP. 
1857: For the somewhat simpler ELR parsing strategy,
1858: to be discussed next,
1859: we will give a direct explanation of why it lacks the SPP.
1860: A direct explanation for LR parsing is much more involved and
1861: therefore is not reported here, although the argument is essentially
1862: of the same nature as the one we discuss for ELR parsing.
1863: 
1864: The ELR parsing strategy is not as well-known as LR parsing.
1865: It was originally 
1866: formulated as a parsing strategy for extended CFGs \cite{PU81,LE89},
1867: but its restriction to normal CFGs is interesting in its
1868: own right, as argued by \cite{NE94a}. 
1869: %MJ added:
1870: ELR parsing for CFGs is also related to the tabular algorithm
1871: from \cite{VO88}.
1872: 
1873: Concerning the representation of right-hand sides of rules,
1874: stack symbols
1875: for ELR parsing are similar to those for PLR parsing:
1876: only the part of a right-hand side is represented
1877: that consists of the grammar symbols that have been processed.
1878: Different from LC and PLR parsing is however that a 
1879: stack symbol for ELR parsing contains
1880: a set consisting of one or more nonterminals from
1881: the left-hand sides of pairwise similar rules, 
1882: rather than a single such nonterminal. 
1883: This allows the commitment to certain rules,
1884: and in particular to their left-hand sides, to be
1885: postponed even longer than for LC and PLR parsing.
1886: 
1887: Thus, for a given CFG $\mygram  = (\myterm,$ $\mynont,$ $S,$ $\myrule)$,
1888: we construct a pair $\mystrat_{\it ELR}(\mygram) =
1889: (\myaut,f)$. Here $\myaut$ $=$ $(\myterm,$
1890: $\myrule,$ $\mysym,$ $[\{S\} \de\epsilon],$ $[\{S\} \de \sigma],$ $\mytrans)$,
1891: where $\mysym$ is a subset of $\{ [\mynontset \de \alpha ]\  |\
1892: \mynontset \subseteq \mynont \wedge 
1893: \forall A \in \mynontset \exists \beta[(A \de\alpha\beta) \in \myrule] \}$ $\cup$
1894: $\{ [\mynontset \de \alpha; B]\  |\
1895: \mynontset \subseteq\nobreak \mynont \wedge
1896: \forall A \in\nobreak\mynontset 
1897: \exists\beta[(A \de\alpha\beta) \in \myrule
1898: \wedge B \in \mynont] \}$.
1899: 
1900: We provide simultaneous inductive definitions of $\mysym$ and
1901: $\mytrans$:
1902: \begin{itemize}
1903: \item $[\{S\} \de\epsilon]\in \mysym$;
1904: \item For $[\mynontset \de \alpha ] \in \mysym$, 
1905: rule $A \de \alpha Y \beta$ and $a\in\myterm$ such that 
1906: $A \in \mynontset$ and $a \LCstar Y$, let
1907: $[\mynontset \de \alpha; a] \in \mysym$ and
1908: $\myscan{[\mynontset \de \alpha ]}{a}{\epsilon}%
1909:                 {[\mynontset \de \alpha; a]} \in \mytrans$;
1910: \item For $[\mynontset \de \alpha] \in \mysym$,
1911: rules $A \de \alpha B \beta$ and
1912: $\pi = C \de \epsilon$ such that $A\in\mynontset$ and
1913: $C \LCstar B$, let
1914: $[\mynontset \de \alpha ; C] \in \mysym$ and 
1915: $\myscan{[\mynontset \de \alpha ]}{\epsilon}{\pi}%
1916:                 {[\mynontset \de \alpha ; C]} \in \mytrans$;
1917: \item For $[\mynontset_1 \de \alpha ; X] \in \mysym$ and 
1918: $\mynontset_2 = \{ C\ |\ 
1919: \exists (A \de \alpha B \beta)\in\myrule[A\in \mynontset_1 \wedge
1920: C \de X \gamma \wedge C\LCstar B] \} \neq \emptyset$, let
1921: $[\mynontset_2\de X ]\in \mysym$ and 
1922: $\myep{[\mynontset_1 \de \alpha ; X]}%
1923:                 {[\mynontset_1 \de \alpha;X]\ [\mynontset_2\de X ]} \in \mytrans$;
1924: \item For $[\mynontset_1 \de \alpha;X], [\mynontset_2\de X\gamma ] \in \mysym$,
1925: rules $A \de \alpha B \beta$ and $\pi = C \de X\gamma$ such that
1926: $A \in \mynontset_1$, $C \in \mynontset_2$ and $C \LCstar B$, let
1927: $[\mynontset_1 \de \alpha; C] \in \mysym$ and 
1928: $\myscan{[\mynontset_1 \de \alpha;X]\ [\mynontset_2\de X\gamma ]}{\epsilon}{\pi}
1929:                 {[\mynontset_1 \de \alpha; C]} \in \mytrans$;
1930: \item For $[\mynontset_1 \de \alpha ; Y] \in \mysym$ and 
1931: $\mynontset_2 = \{ A\in\mynontset_1\ |\ 
1932: \exists \beta[(A \de \alpha Y \beta)\in\myrule] \}
1933: \neq \emptyset$, let
1934: $[\mynontset_2 \de \alpha Y ] \in \mysym$ and
1935: $\myep{[\mynontset_1 \de \alpha ; Y]}%
1936:                 {[\mynontset_2 \de \alpha Y ]} \in \mytrans$.
1937: \end{itemize}
1938: Note that the last five items are very similar to the five items 
1939: for LC parsing. In the second last item, we have assumed
1940: the availability of combined pop/swap transitions of the form
1941: $\myscan{X Y}{x}{y}{Z}$. Such a transition can be seen as short-hand for
1942: two transitions, the first of the form $\myep{X Y}{Z_{x,y}}$,
1943: where $Z_{x,y}$ is a new symbol not already in $\mysym$, and
1944: the second of the form $\myscan{Z_{x,y}}{x}{y}{Z}$.
1945: 
1946: The function $f$ is defined as in the case of PLR parsing, and
1947: turns a complete right-most derivation in
1948: reverse into a complete derivation.
1949: 
1950: ELR parsing has the CPP but, like LR parsing,
1951: it lacks the SPP. The problem is caused by
1952: transitions of the form
1953: $\myscan{[\mynontset_1 \de \alpha;X]\ [\mynontset_2\de X\gamma ]}{\epsilon}{\pi}
1954:                 {[\mynontset_1 \de \alpha; C]}$.
1955: Intuitively, a subcomputation that recognizes $\gamma$,
1956: directly after recognition of $X$, only commits to
1957: a choice of the left-hand side nonterminal $C$ from
1958: $\mynontset_2$ after $\gamma$ has been
1959: completely recognized, and this choice is communicated
1960: to lower areas of the stack through this pop transition.
1961: 
1962: \begin{figure}[tp]
1963: $$
1964: \begin{array}{cl}
1965: \comment{\tau_{a}} & \myscan{[\{S\}\de\epsilon]}{a}{\epsilon}{[\{S\}\de\epsilon;a]} \\
1966: \comment{\tau_{A}} & \myep{[\{S\}\de\epsilon;a]}{[\{S\}\de\epsilon;a]\ [\{A\}\de a]} \\
1967: \comment{\tau_{A/x}} & \myscan{[\{A\}\de a]}{x}{\epsilon}{[\{A\}\de a; x]} \\
1968: \comment{\tau'_{A/x}} & \myep{[\{A\}\de a;x]}{[\{A\}\de a;x]\ [\{C,D\}\de x]} \\
1969: \tau_{c}\ = & \myscan{[\{C,D\}\de x]}{c}{\epsilon}{[\{C,D\}\de x}; c] \\
1970: \tau_{d}\ = & \myscan{[\{C,D\}\de x]}{d}{\epsilon}{[\{C,D\}\de x}; d] \\
1971: \comment{\tau_{C}} & \myep{[\{C,D\}\de x;c]}{[\{C\}\de xc]} \\
1972: \comment{\tau_{A/C}} &  \myscan{[\{A\}\de a;x]\ [\{C\}\de xc]}%
1973: 		{\epsilon}{\pi_{C}}{[\{A\}\de a; C]} \\
1974: \comment{\tau'_{A/C}} & \myep{[\{A\}\de a; C]}[\{A\}\de a C] \\
1975: \comment{\tau_{A_1}} & \myscan{[\{S\}\de\epsilon;a]\ [\{A\}\de aC]}%
1976: 		{\epsilon}{\pi_{A_1}}{[\{S\}\de \epsilon; A]} \\
1977: \comment{\tau_{D}} & \myep{[\{C,D\}\de x;d]}{[\{D\}\de xd]}  \\
1978: \comment{\tau_{A/D}} &  \myscan{[\{A\}\de a;x]\ [\{D\}\de xd]}%
1979: 		{\epsilon}{\pi_{D}}{[\{A\}\de a; D]} \\
1980: \comment{\tau'_{A/D}} & \myep{[\{A\}\de a; D]}[\{A\}\de a D] \\
1981: \comment{\tau_{A_2}} & \myscan{[\{S\}\de\epsilon;a]\ [\{A\}\de aD]}%
1982: 		{\epsilon}{\pi_{A_2}}{[\{S\}\de \epsilon; A]} \\
1983: \comment{\tau_{S/A}} & \myep{[\{S\}\de \epsilon; A]}{[\{S\}\de A]} \\
1984: \comment{\tau_{b}} & \myscan{[\{S\}\de A]}{b}{\epsilon}{[\{S\}\de A; b]} \\
1985: \comment{\tau_{B}} & \myep{[\{S\}\de A;b]}{[\{S\}\de A;b]\ [\{B\}\de b]} \\
1986: \comment{\tau_{B/x}} & \myscan{[\{B\}\de b]}{x}{\epsilon}{[\{B\}\de b; x]} \\
1987: \comment{\tau'_{B/x}} & \myep{[\{B\}\de b;x]}{[\{B\}\de b;x]\ [\{C,D\}\de x]} \\
1988: \comment{\tau_{B/C}} &  \myscan{[\{B\}\de b;x]\ [\{C\}\de xc]}%
1989: 		{\epsilon}{\pi_{C}}{[\{B\}\de b; C]} \\
1990: \comment{\tau'_{B/C}} & \myep{[\{B\}\de b; C]}[\{B\}\de b C] \\
1991: \comment{\tau_{B_1}} & \myscan{[\{S\}\de A;b]\ [\{B\}\de bC]}%
1992: 		{\epsilon}{\pi_{B_1}}{[\{S\}\de A;B]} \\
1993: \comment{\tau_{B/D}} &  \myscan{[\{B\}\de b;x]\ [\{D\}\de xd]}%
1994: 		{\epsilon}{\pi_{D}}{[\{B\}\de b; D]} \\
1995: \comment{\tau'_{B/D}} & \myep{[\{B\}\de b; D]}[\{B\}\de b D] \\
1996: \comment{\tau_{B_2}} & \myscan{[\{S\}\de A;b]\ [\{B\}\de bD]}%
1997: 		{\epsilon}{\pi_{B_2}}{[\{S\}\de A;B]} \\
1998: \comment{\tau_{S/B}} & \myep{[\{S\}\de A; B]}{[\{S\}\de AB]} \\
1999: \end{array}
2000: $$
2001: \caption{Transitions for ELR parsing strategy.}
2002: \label{f:ELRtrans}
2003: \end{figure}
2004: 
2005: That ELR parsing can indeed not be extended to a probabilistic
2006: parsing strategy can be shown by considering the same
2007: CFG as above. From the set of transitions, shown in 
2008: Figure~\ref{f:ELRtrans},
2009: we restrict our attention to the following two:
2010: $$
2011: \begin{array}{c@{\;=\;}l}
2012: \tau_{c} & \myscan{[\{C,D\}\de x]}{c}{\epsilon}{[\{C,D\}\de x}; c] \\
2013: \tau_{d} & \myscan{[\{C,D\}\de x]}{d}{\epsilon}{[\{C,D\}\de x}; d] 
2014: \end{array}
2015: $$
2016: This is the only pair of transitions that can be applied for
2017: one and the same top-of-stack.
2018: The rest of the proof is identical to that in the case of
2019: LR parsing.
2020: 
2021: Problems with the extension of ELR parsing to become
2022: a probabilistic parsing
2023: strategy have been pointed out before by \cite{TE97},
2024: who furthermore proposed an alternative type of probabilistic
2025: push-down automaton that is capable of computing multiple 
2026: probabilities for each subderivation. 
2027: However, since a transition of such an automaton may perform an
2028: unbounded number of elementary computations on probabilities, we 
2029: feel this automaton model cannot realistically express
2030: the behaviour of probabilistic parsers,
2031: and therefore it will not be considered further here.
2032: 
2033: \comment{
2034: We show here that the absence of strong predictiveness
2035: may mean that a parsing strategy cannot be extended to a 
2036: probabilistic parsing strategy. We illustrate this by two different
2037: non-strongly predictive parsing strategies $\mystrat=\mystrat_{\it ELR}$
2038: and $\mystrat=\mystrat_{\it LR}$. In each case, we present
2039: a PCFG $(\mygram, p_{\mygram})$ 
2040: such that
2041: $\mystrat(\mygram) = (\myaut, f)$ and no probability function $p_{\myaut}$
2042: for $\myaut$
2043: can be found such that $(\myaut, p_{\myaut})$ assigns the same
2044: probabilities to strings as $(\mygram, p_{\mygram})$
2045: 
2046: The reason we consider ELR parsing is that it is the
2047: first strategy in a family of parsing strategies, 
2048: following LC and PLR parsing, that is not strongly predictive.\footnote{%
2049: This family was discussed before in \cite{NE94a}.}
2050: In particular, the comparison
2051: between PLR and ELR parsing helps to clarify the problem that the absence
2052: of strong predictiveness poses for extending parsing
2053: strategies to the probabilistic case.
2054: However, we also
2055: treat the more complicated LR parsing strategy \cite{SI90} since that 
2056: is better known than ELR parsing.
2057: 
2058: The ELR strategy results in $\mystrat(\mygram)=(\myaut,f)$, where
2059: $\myaut$ $=$ $(\myterm,$
2060: $\myrule,$ $\mysym,$ $[\{S\} \de\epsilon],$ $[\{S\} \de AB],$ $\mytrans)$
2061: and $\mytrans$ contains:
2062: 
2063: and $\mysym$ is the set of the stack symbols that occur
2064: in the above transitions.
2065: 
2066: Take a probability function $p_{\myaut}$
2067: such that $(\myaut, p_{\myaut})$ is a proper PPDT.
2068: It can be shown that $p_{\myaut}$ must assign 1 to all 
2069: transitions except $\tau_{c}$ and $\tau_{d}$, since that is the only
2070: pair of distinct transitions that can be applied for one and the 
2071: same top-of-stack symbol,
2072: viz.\ $[\{C,D\}\de x]$.
2073: 
2074: However, 
2075: $\frac{p_{\mygram}({\it axcbxd})}{p_{\mygram}({\it axdbxc})} = 
2076: \frac{p_{\mygram}(\pi_{A_1}) \cdot  p_{\mygram}(\pi_{B_2})}%
2077: {p_{\mygram}(\pi_{A_2}) \cdot  p_{\mygram}(\pi_{B_1})} = 
2078: \frac{(\frac{4}{10})^2}{(\frac{6}{10})^2} = \frac{4}{9}$
2079: but
2080: $\frac{p_{\myaut}({\it axcbxd})}{p_{\myaut}({\it axdbxc})} =
2081: \frac{p_{\myaut}(\tau_{c}) \cdot  p_{\myaut}(\tau_{d})}%
2082: {p_{\myaut}(\tau_{d}) \cdot  p_{\myaut}(\tau_{c})} = 1 \neq \frac{4}{9}$.
2083: This shows that there is no $p_{\myaut}$ such that
2084: $(\myaut, p_{\myaut})$ assigns the same
2085: probabilities to strings over $\myterm$ as $(\mygram, p_{\mygram})$.
2086: It follows that
2087: the ELR strategy cannot be extended to be a probabilistic 
2088: parsing strategy.
2089: 
2090: The LR strategy can also be cast in a form that
2091: satisfies our normal form PDTs. We will not give a complete
2092: specification of LR parsing since much existing literature,
2093: such as \cite{SI90}, already contains such specifications.
2094: We will assume below that the reader is familiar with this literature.
2095: 
2096: We will apply the LR(0) strategy to this CFG.
2097: Applying the LR(0) strategy to the CFG above, we obtain the following
2098: PDT. 
2099: This is very similar to the PDT we obtained in the case of ELR
2100: parsing. 
2101: As usual, we denote LR states by a set of kernel items,
2102: which are `dotted' rules. 
2103: Since our type of pop transition only allows
2104: a pop of one symbol at a time, we have to split up a reduction of
2105: a rule $A \de X_1 \cdots X_m$ into
2106: a sequence of $m+1$ transitions, the first $m-1$ resulting in stack symbols
2107: $[A \de X_1 \cdots X_m \bul]$, $[A \de X_1 \cdots X_{m-1} \bul X_m]$, \ldots,
2108: $[A \de X_1 \bul X_2 \cdots X_m]$ on top of the stack, the next
2109: resulting in a top-of-stack $[W;A]$, where $W$ is a set of dotted rules,
2110: and lastly the usual `goto' set of $W$ and $A$ is pushed.
2111: 
2112: We have
2113: $\Xinit = \{S \de\ \bul AB\}$ and
2114: $\Xinit = [S \de\ \bul AB]$ and the set $\mytrans$ of transitions is given
2115: in Figure~\ref{f:LRtrans}.
2116: %begin
2117: In order to simplify the presentation, we allow two new types of
2118: transition, without increasing the power of PDTs.
2119: The first is a combined swap/push transition of the form
2120: $\myscan{X}{x}{y}{Z Y}$. Such a transition can be seen as short-hand for
2121: two transitions, the first of the form $\myscan{X}{x}{y}{Z_Y}$,
2122: where $Z_Y$ is a new symbol not already in $\mysym$, and
2123: the second of the form $\myep{Z_Y}{Z_Y Y}$.
2124: We also assume the existence of a
2125: transition $\myep{Z_Y Y'}{X'}$
2126: for each transition $\myep{Z Y'}{X'}$ that is actually specified.
2127: The second new type of transition
2128: is a combined swap/pop transition of the form
2129: $\myscan{X Y}{x}{y}{Z}$. Such a transition can be seen as short-hand for
2130: two transitions, the first of the form $\myscan{Y}{x}{y}{Y_X}$,
2131: where $Y_X$ is a new symbol not already in $\mysym$, and
2132: the second of the form $\myep{X Y_X}{Z}$.
2133: 
2134: \begin{figure*}
2135: $$
2136: \begin{array}{c@{\;=\;}l}
2137: \tau_{a} & \myscan{\{S\de\ \bul AB\}}{a}{\epsilon}%
2138: 	{\{S\de\ \bul AB\}\ \{A\de a\bul C, A\de a\bul D\}} \\
2139: \tau_{A_1} & \myscan{\{S\de\ \bul AB\}\ [A\de a\bul C]}%
2140:                 {\epsilon}{\pi_{A_1}}{[\{S\de\ \bul A B\};A]} \\
2141: \tau_{A_2} & \myscan{\{S\de\ \bul AB\}\ [A\de a\bul D]}%
2142:                 {\epsilon}{\pi_{A_2}}{[\{S\de\ \bul A B\};A]} \\
2143: \tau_{S/A} & \myscan{[\{S\de\ \bul AB\}; A]}%
2144: 		{\epsilon}{\epsilon}{\{S\de\ \bul AB\}\ \{S\de A \bul B\}} \\
2145: \tau_{b} & \myscan{\{S\de A\bul B\}}{b}{\epsilon}%
2146: 		{\{S\de A\bul B\}\ \{B\de b\bul C, B\de b\bul D\}} \\
2147: \tau_{B_1} & \myscan{\{S\de A\bul B\}\ [B\de b\bul C\}}%
2148:                 {\epsilon}{\pi_{B_1}}{[\{S\de A \bul B\}; B]} \\ 
2149: \tau_{B_2} & \myscan{\{S\de A\bul B\}\ [B\de b\bul D\}}%
2150:                 {\epsilon}{\pi_{B_2}}{[\{S\de A \bul B\}; B]} \\ 
2151: \tau_{S/B} & \myscan{[\{S\de A\bul B\}; B]}%
2152: 		{\epsilon}{\epsilon}{\{S\de A \bul B\}\ \{S\de A B \bul\}} \\
2153: \tau_{S} & \myep{\{S\de A B \bul\}}{[S\de A B \bul]} \\
2154: \tau_{S'} & \myep{\{S\de A \bul B\}\ [S\de A B \bul]}%
2155: 		{[S\de A \bul B ]} \\
2156: \tau_{S''} & \myscan{\{S\de\ \bul A B\}\ [S\de A\bul B ]}%
2157: 		{\epsilon}{\pi_{S}}{[S\de\ \bul A B ]} \\
2158: \tau_{A/x} & \myscan{\{A\de a\bul C, A\de a\bul D\}}%
2159: 		{x}{\epsilon}%
2160: 		{\{A\de a\bul C, A\de a\bul D\}\ \{C\de x\bul c, D\de x\bul d\}} \\
2161: \tau_{A/C} & \myscan{\{A\de a\bul C, A\de a\bul D\}\ [C\de x\bul c]}%
2162: 		{\epsilon}{\pi_{C}}{[\{A\de a\bul C, A\de a\bul D\};C]} \\
2163: \tau'_{A_1} & \myscan{[\{A\de a\bul C, A\de a\bul D\};C]}%
2164: 		{\epsilon}{\epsilon}%
2165: 		{\{A\de a\bul C, A\de a\bul D\}\ \{A\de a C\bul\}} \\
2166: \tau''_{A_1} & \myep{\{A\de a C\bul\}}{[A\de a C\bul]} \\
2167: \tau'''_{A_1} & \myep{\{A\de a\bul C, A\de a\bul D\}\ [A\de a C\bul]}%
2168: 		{[A\de a \bul C]} \\
2169: \tau_{A/D} & \myscan{\{A\de a\bul C, A\de a\bul D\}\ [D\de x\bul d]}%
2170: 		{\epsilon}{\pi_{D}}{[\{A\de a\bul C, A\de a\bul D\};D]} \\
2171: \tau'_{A_2} & \myscan{[\{A\de a\bul C, A\de a\bul D\};D]}%
2172: 		{\epsilon}{\epsilon}%
2173: 		{\{A\de a\bul C, A\de a\bul D\}\ \{A\de a D\bul\}} \\
2174: \tau''_{A_2} & \myep{\{A\de a D\bul\}}{[A\de a D\bul]} \\
2175: \tau'''_{A_2} & \myep{\{A\de a\bul C, A\de a\bul D\}\ [A\de a D\bul]}%
2176: 		{[A\de a \bul D]} \\
2177: 
2178: \tau_{B/x} & \myscan{\{B\de b\bul C, B\de b\bul D\}}%
2179:                 {x}{\epsilon}%
2180:                 {\{B\de b\bul C, B\de b\bul D\}\ \{C\de x\bul c, D\de x\bul d\}} \\
2181: \tau_{B/C} & \myscan{\{B\de b\bul C, B\de b\bul D\}\ [C\de x\bul c]}%
2182:                 {\epsilon}{\pi_{C}}{[\{B\de b\bul C, B\de b\bul D\};C]} \\
2183: \tau'_{B_1} & \myscan{[\{B\de b\bul C, B\de b\bul D\};C]}%
2184:                 {\epsilon}{\epsilon}%
2185:                 {\{B\de b\bul C, B\de b\bul D\}\ \{B\de b C\bul\}} \\
2186: \tau''_{B_1} & \myep{\{B\de b C\bul\}}{[B\de b C\bul]} \\
2187: \tau'''_{B_1} & \myep{\{B\de b\bul C, B\de b\bul D\}\ [B\de b C\bul]}%
2188:                 {[B\de b \bul C]} \\
2189: \tau_{B/D} & \myscan{\{B\de b\bul C, B\de b\bul D\}\ [D\de x\bul d]}%
2190:                 {\epsilon}{\pi_{D}}{[\{B\de b\bul C, B\de b\bul D\};D]} \\
2191: \tau'_{B_2} & \myscan{[\{B\de b\bul C, B\de b\bul D\};D]}%
2192:                 {\epsilon}{\epsilon}%
2193:                 {\{B\de b\bul C, B\de b\bul D\}\ \{B\de b D\bul\}} \\
2194: \tau''_{B_2} & \myep{\{B\de b D\bul\}}{[B\de b D\bul]} \\
2195: \tau'''_{B_2} & \myep{\{B\de b\bul C, B\de b\bul D\}\ [B\de b D\bul]}%
2196:                 {[B\de b \bul D]} \\
2197: \tau_{c} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%
2198: 		{c}{\epsilon}%
2199: 		{\{C\de x\bul c, D\de x\bul d\}\ \{C \de xc\bul\}} \\
2200: \tau_{C} & \myep{\{C \de xc\bul\}}{[C \de xc\bul]} \\
2201: \tau'_{C} & \myep{\{C\de x\bul c, D\de x\bul d\}\ [C \de xc\bul]}%
2202: 		{[C \de x\bul c]} \\
2203: \tau_{d} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%
2204: 		{d}{\epsilon}%
2205: 		{\{C\de x\bul c, D\de x\bul d\}\ \{D \de xd\bul\}} \\
2206: \tau_{D} & \myep{\{D \de xd\bul\}}{[D \de xd\bul]} \\
2207: \tau'_{D} & \myep{\{C\de x\bul c, D\de x\bul d\}\ [D \de xd\bul]}%
2208: 		{[D \de x\bul d]} 
2209: \end{array}
2210: $$
2211: \caption{The set of transitions for the LR strategy.}
2212: \label{f:LRtrans}
2213: \end{figure*}
2214: 
2215: As in the case of ELR parsing, there are only two transitions,
2216: viz.\ $\tau_{c}$ and $\tau_{d}$, to which a probability function
2217: $p_{\myaut}$ can assign a value different from 1.
2218: Again, $\frac{p_{\myaut}({\it axcbxd})}{p_{\myaut}({\it axdbxc})} =
2219: \frac{p_{\myaut}(\tau_{c}) \cdot  p_{\myaut}(\tau_{d})}%
2220: {p_{\myaut}(\tau_{d}) \cdot  p_{\myaut}(\tau_{c})} = 1 \neq \frac{4}{9}$.
2221: This shows that also the LR strategy 
2222: cannot be extended to be a probabilistic parsing strategy.
2223: }
2224: 
2225: 
2226: \section{Extension in the wide sense}
2227: \label{s:wide}
2228: 
2229: The main result from the previous section is that,
2230: in general,
2231: there is no construction of probabilistic LR parsers 
2232: from PCFGs such that, 
2233: firstly, a probabilistic LR parser has the same set of
2234: transitions as the LR parser that would be constructed from the CFG in
2235: the non-probabilistic case and,
2236: secondly, the probabilistic LR parser
2237: has the same probability distribution as the given PCFG.
2238: 
2239: There is a construction proposed by \cite{WR91,WR91a,NG91}
2240: that operates under different assumptions. In particular, a
2241: probabilistic LR parser constructed from a certain PCFG
2242: may possess several `copies' of one and the same 
2243: LR state from the (non-probabilistic) LR parser constructed from
2244: the CFG, 
2245: each annotated with some additional information to
2246: distinguish it from other copies of the same LR state. 
2247: Each such copy behaves as the corresponding LR state from the
2248: LR parser if we neglect probabilities. 
2249: Transitions may however
2250: obtain different probabilities if they operate on different copies
2251: of identical LR states, based on the additional information
2252: attached to the LR states.
2253: 
2254: By this construction,
2255: there are many PCFGs for which one may obtain a
2256: probabilistic LR parser that describes the same
2257: probability distribution. This even holds
2258: for the PCFG we discussed in the previous section, although
2259: we have shown that a probabilistic LR parser {\em without\/}
2260: an extended LR state set could not describe the same
2261: probability distribution.
2262: A serious problem with this approach is however that 
2263: the required number of copies of each LR state is potentially infinite.
2264: 
2265: In this section we formulate these observations in terms of
2266: general parsing strategies and a wider notion of
2267: extension to probabilistic parsing strategies. We also
2268: show that the above-mentioned problem with
2269: infinite numbers of states is inherent in LR parsing, rather
2270: than due to the particular construction of LR parsers from
2271: PCFGs by \cite{WR91,WR91a,NG91}.
2272: 
2273: We first introduce some auxiliary notation and terminology. 
2274: Let $\myaut$ and $\myaut'$ be two PDTs and
2275: let $g$ be a function mapping
2276: the stack symbols of $\myaut'$
2277: to the stack symbols of $\myaut$.
2278: If $\tau$ is a transition of the form $\myep{X}{X Y}$,
2279: $\myep{\it Y X}{Z}$ or $\myscan{X}{x}{y}{Y}$ from $\myaut'$,
2280: then we let $g(\tau)$ denote a transition of the form
2281: $\myep{g(X)}{g(X) g(Y)}$,
2282: $\myep{\it g(Y) g(X)}{g(Z)}$ or $\myscan{g(X)}{x}{y}{g(Y)}$, respectively.
2283: This effectively extends $g$ to a function from transitions to
2284: transitions. 
2285: Note that a transition $g(\tau)$ may, but need not be a
2286: transition from $\myaut$.
2287: In the same vein, we extend $g$ to
2288: a function from computations of $\myaut'$ to
2289: sequences of transitions (which may, but need not be
2290: computations of $\myaut$),
2291: by applying $g$ element-wise as a function on transitions.
2292: 
2293: For PDTs $\myaut$ $=$
2294: $(\myterm_1,$ $\myterm_2,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$
2295: and $\myaut'$ $=$
2296: $(\myterm'_1,$ $\myterm'_2,$ $\mysym',$ $\Xinit',$ $\Xfinal',$ $\mytrans')$,
2297: we say
2298: $\myaut'$
2299: is an {\em expansion\/} of $\myaut$
2300: if $\myterm'_1=\myterm_1$, $\myterm'_2 = \myterm_2$ and there is a function
2301: $g$ such that:
2302: \begin{itemize}
2303: \item $g$ is a surjective function from $\mysym'$ to $\mysym$.
2304: \item Extended to transitions, 
2305: $g$ is a surjective function from $\mytrans'$ to $\mytrans$.
2306: \item Extended to computations,
2307: $g$ is a bijective function from the set of computations of
2308: $\myaut'$ to the set of computations of $\myaut$.
2309: \end{itemize}
2310: In other words, for each stack symbol from $\mysym$, 
2311: $\mysym'$ may contain one or more corresponding
2312: stack symbols. The language that
2313: is accepted and the output strings that are produced for given input
2314: strings remain the same however. Furthermore, that $g$ is a bijection 
2315: on computations implies that the behaviour of the two
2316: automata is identical in terms of e.g.\ the length of
2317: computations and the amount of nondeterminism encountered within
2318: those computations.
2319: 
2320: To illustrate these definitions, assume we have an arbitrary
2321: PDT $\myaut$. We construct a second PDT $\myaut'$ that is an
2322: expansion of $\myaut$. It has the
2323: same input and output alphabets, and for each stack symbol
2324: $X$ from $\myaut$, $\myaut'$ has two stack symbols $(X,0)$ and
2325: $(X,1)$. A second component $0$ signifies that the distance 
2326: of the stack symbol to the bottom of the
2327: stack is even, and $1$ that it is odd.
2328: Naturally, if $\Xinit$ and $\Xfinal$ are the initial and final stack symbols
2329: of $\myaut$, we choose the initial and final stack symbols of $\myaut'$ to be
2330: $(\Xinit,0)$ and $(\Xfinal,0)$, as they have distance 0 to the
2331: bottom of the stack.
2332: For each transition of the form $\myep{X}{X Y}$, 
2333: $\myep{\it Y X}{Z}$ or $\myscan{X}{x}{y}{Y}$ from $\myaut$,
2334: we let $\myaut'$ have the transitions
2335: $\myep{(X,i)}{(X,i) (Y,1-i)}$,
2336: $\myep{(Y,i) (X,1-i)}{(Z,i)}$ or $\myscan{(X,i)}{x}{y}{(Y,i)}$, 
2337: respectively, for both $i=0$ and $i=1$. 
2338: Obviously, the function $g$ mapping stack symbols
2339: from $\myaut'$ to stack symbols from $\myaut$ is given
2340: by $g((X,i))=X$ for all $X$ and $i\in\{0,1\}$.
2341: 
2342: We now come to the central definition of this section.
2343: We say that probabilistic parsing strategy $\mystrat'$
2344: is an {\em extension in the wide sense\/} of parsing strategy 
2345: $\mystrat$ if for each reduced CFG $\mygram$ and 
2346: probability function $p_{\mygram}$ we have
2347: $\mystrat(\mygram)=(\myaut, f)$ if and only if
2348: $\mystrat'(\mygram, p_{\mygram})=(\myaut', p_{\myaut'}, f)$
2349: for some $\myaut'$ that is an expansion of $\myaut$
2350: and some $p_{\myaut'}$. This definition allows more 
2351: probabilistic parsing strategies $\mystrat'$ to be related to a given
2352: strategy $\mystrat$ than the definition of extension from 
2353: Section~\ref{s:strategy}.
2354: 
2355: LR parsing however, which we know can not be extended to a 
2356: probabilistic strategy in the narrow sense from Section~\ref{s:strategy}, 
2357: can neither be
2358: extended in the wide sense to a probabilistic parsing strategy.
2359: To prove this,
2360: consider the following PCFG $(\mygram,p_{\mygram})$, 
2361: taken from \cite{WR91} with minor modifications:
2362: $$
2363: \begin{array}{c@{\;=\;}ll}
2364: \pi_{S} & S \de A, & p_{\mygram}(\pi_{S}) = 1 \\[.1ex]
2365: \pi_{A_1} & A \de B, & p_{\mygram}(\pi_{A_1}) = \frac{1}{2} \\[.1ex]
2366: \pi_{A_2} & A \de C, & p_{\mygram}(\pi_{A_2}) = \frac{1}{2} \\[.1ex]
2367: \pi_{B_1} & B \de {\it aB}, & p_{\mygram}(\pi_{B_1}) = \frac{1}{3} \\[.1ex]
2368: \pi_{B_2} & B \de {\it b}, & p_{\mygram}(\pi_{B_2}) = \frac{2}{3} \\[.1ex]
2369: \pi_{C_1} & C \de {\it aC}, & p_{\mygram}(\pi_{C_1}) = \frac{2}{3} \\[.1ex]
2370: \pi_{C_2} & C \de {\it c}, & p_{\mygram}(\pi_{C_2}) = \frac{1}{3}
2371: \end{array}
2372: $$
2373: The CFG $\mygram$ generates strings of the form $a^n b$ and $a^n c$ for
2374: any $n \geq 0$. Observe that
2375: $\frac{p_{\mygram}(a^n b)}{p_{\mygram}(a^n c)}$ $=$
2376: $\frac{ \frac{1}{2} \cdot
2377: 		\left(\frac{1}{3}\right)^{n} \cdot \frac{2}{3} }{
2378: 	\frac{1}{2} \cdot
2379:                 \left(\frac{2}{3}\right)^{n} \cdot \frac{1}{3} }$ $=$
2380: $\left( \frac{1}{2} \right)^{n-1}$. 
2381: 
2382: Let $\myaut$ be such that $\mystrat_{\it LR}(\mygram)= (\myaut,f)$ and
2383: consider input strings of the form $a^n b$ and $a^n c$, $n \geq 1$. 
2384: After scanning the first $n$ symbols, $\myaut$
2385: reaches a configuration where the top-of-stack $X$ is
2386: given by the set of (kernel) items:
2387: $$
2388: X=\{ B \de a \bul B, C \de a \bul C \}
2389: $$
2390: 
2391: There are three applicable transitions, representing shift
2392: actions over $a$, $b$ and $c$, given by:
2393: $$
2394: \begin{array}{c@{\;=\;}l}
2395: \tau_a & \myscan{X}{a}{\epsilon}{X\ X} \\
2396: \tau_b & \myscan{X}{b}{\epsilon}{X\ \{B \de b\bul\}} \\
2397: \tau_c & \myscan{X}{c}{\epsilon}{X\ \{C \de c\bul\}} 
2398: \end{array}
2399: $$
2400: After reading $b$ or $c$,
2401: the remaining transitions are fully deterministic.
2402: 
2403: For a PDT $\myaut'$ that is an expansion of $\myaut$, we may have
2404: different stack symbols that are all mapped to $X$ by function $g$. 
2405: These stack symbols can be referred to as 
2406: $X_n$, which occur as top-of-stack
2407: after scanning the first $n$ symbols of $a^n b$ or $a^n c$, $n \geq 1$.
2408: We refer to the applicable transitions with top-of-stack $X_n$ as:
2409: $$
2410: \begin{array}{c@{\;=\;}l}
2411: \tau_{a,n} & \myscan{X_n}{a}{\epsilon}{X_n\ X_{n+1}} \\
2412: \tau_{b,n} & \myscan{X_n}{b}{\epsilon}{X_n\ \{B \de b\bul\}_n} \\
2413: \tau_{c,n} & \myscan{X_n}{c}{\epsilon}{X_n\ \{C \de c\bul\}_n}
2414: \end{array}
2415: $$
2416: for certain stack symbols $\{B \de b\bul\}_n$ and
2417: $\{C \de c\bul\}_n$ that $g$ maps to $\{B \de b\bul\}$ and
2418: $\{C \de c\bul\}$, respectively.
2419: 
2420: Now let us assume we have a probability function $p_{\myaut'}$
2421: such that $(\myaut',p_{\myaut'})$ is a PPDT.
2422: Since the application of either $\tau_{b,n}$ or $\tau_{c,n}$ is
2423: the only nondeterministic step 
2424: that distinguishes recognition of $a^n b$ from
2425: recognition of $a^n c$, $n \geq 1$, it follows that
2426: $\frac{p_{\myaut}(a^n b)}{p_{\myaut}(a^n c)}$ $=$
2427: $\frac{p_{\myaut}(\tau_{b,n})}{p_{\myaut}(\tau_{c,n})}$.
2428: If $(\myaut',p_{\myaut'})$ assigns the same probabilities
2429: to strings over alphabet $\{a,b,c\}$ as $(\mygram,p_{\mygram})$,
2430: then $\frac{p_{\myaut}(\tau_{b,n})}{p_{\myaut}(\tau_{c,n})}$
2431: must be equal to $\frac{p_{\mygram}(a^n b)}{p_{\mygram}(a^n c)}$ $=$
2432: $\left( \frac{1}{2} \right)^{n-1}$ for each 
2433: $n\geq 1$. Since $\left( \frac{1}{2} \right)^{n-1}$ is a different
2434: value for each $n$ however, this would require $\myaut'$ to possess
2435: infinitely many stack symbols, which is in conflict with the definition
2436: of push-down transducers.
2437: 
2438: This shows that no probability function $p_{\myaut'}$ exists
2439: for any expansion $\myaut'$ of $\myaut$ such that
2440: $(\myaut',p_{\myaut'})$ assigns the same probabilities
2441: to strings over the alphabet as $(\mygram,p_{\mygram})$,
2442: and therefore LR parsing cannot be extended in the wide sense to
2443: become a probabilistic parsing strategy. With only minor changes
2444: to the proof, the same can be shown for ELR parsing.
2445: 
2446: \section{Prefix probabilities}
2447: \label{s:prefix}
2448: 
2449: In this section we show that the behaviour of PPDTs on input
2450: can be simulated by dynamic programming. 
2451: We also show how dynamic programming can be used for
2452: computing prefix probabilities.
2453: Prefix probabilities have important applications, e.g.\  
2454: in the area of speech recognition.
2455: 
2456: Our algorithm is a minor extension
2457: of an application of dynamic programming developed
2458: for non-probabilistic PDTs by~\cite{LA74,BI89}, and
2459: the treatment of probabilities is derived from~\cite{ST95}.
2460: 
2461: Assume a fixed PPDT $(\myaut,p_{\myaut})$ and a
2462: fixed input string $a_1 \cdots a_n$. Consider a
2463: computation of the form $c_1 \tau c_2$, where
2464: $(\Xinit, a_1 \cdots a_i, \epsilon)$ $\pdamovesname{c_1}$
2465: $(\alpha X, \epsilon, v_1)$,
2466: $\tau$ is of the form 
2467: $\myep{{\it X}}{{\it X Y'}}$, and
2468: $(Y', a_{i+1} \cdots a_j, \epsilon)$
2469: $\pdamovesname{c_2}$
2470: $(Y, \epsilon, v_2)$, for
2471: some stack symbols $X,Y',Y$,
2472: some input positions $i$ and $j$ ($0 \leq i \leq j \leq n$),
2473: and some output strings $v_1$ and $v_2$.
2474: In words, the computation 
2475: obtains top-of-stack $X$ after
2476: scanning of $a_i$ but before scanning of $a_{i+1}$,
2477: then applies a push transition, and then possibly
2478: further push, scan and pop transitions, which
2479: leads to $Y$ on top of $X$ after
2480: scanning of $a_j$ but before scanning of $a_{j+1}$.
2481: 
2482: We now abstract away from some details of such a computation 
2483: by just recording $X$, $Y$, $i$, $j$ and its probability 
2484: $p_1=p_{\myaut}(c_1\tau c_2)$.
2485: The probability $p_1$ is related to what is commonly called 
2486: a {\em forward\/} probability, 
2487: as it expresses the probability of the computation
2488: from the beginning onward.%
2489: \footnote{Forward probability as defined by \cite{ST95}
2490: refers to the sum of the probabilities of
2491: {\em all\/} computations from the
2492: beginning onward that lead to a certain rule occurrence, 
2493: whereas here we consider only one computation at a time.
2494: We will turn to forward probabilities later in this section.}
2495: The existence of the above computation is represented by an
2496: object that we will call a {\em table item\/},
2497: written as $p_1:\forward(X,Y,i,j)$.
2498: 
2499: Similarly, consider a subcomputation of the form
2500: $\tau c_2$, where as before
2501: $\tau$ is of the form 
2502: $\myep{{\it X}}{{\it X Y'}}$, and
2503: $(Y', a_{i+1} \cdots a_j, \epsilon)$
2504: $\pdamovesname{c_2}$
2505: $(Y, \epsilon, v_2)$, for
2506: some stack symbols $X,Y',Y$,
2507: some input positions $i$ and $j$ ($0 \leq i \leq j \leq n$),
2508: and some output string $v_2$.
2509: We express the existence of such a subcomputation
2510: by a different kind of table item, written as
2511: $p_2:\inner(X,Y,i,j)$, where
2512: $p_2=p_{\myaut}(\tau c_2)$. Here, $p_2$ is related to what is commonly
2513: called an {\em inner\/} probability, as
2514: it expresses only the probability internally in a
2515: subcomputation.%
2516: \footnote{We will turn to actual inner probabilities 
2517: later in this section.}
2518: 
2519: For technical reasons, we also need to consider
2520: computations $c$ where
2521: $(\Xinit, a_1 \cdots a_j, \epsilon)$ $\pdamovesname{c}$
2522: $(Y, \epsilon, v)$, for some $Y$, $j$ and $v$.
2523: These are represented by table items
2524: $p_1:\forward(\bot,Y,0,j)$,
2525: where $p_1=p_{\myaut}(c)$.
2526: The symbol $\bot$ can be seen as an imaginary stack symbol that 
2527: is located
2528: below the actual bottom-of-stack element.
2529: 
2530: All table items of the above forms, and only those table items,
2531: can be derived by the deduction system in 
2532: Figure~\ref{f:tabular}. Deduction systems for defining
2533: parsing algorithms have been described before by \cite{SH95};
2534: see also \cite{SI97,SI97a} for a very similar framework.
2535: A dynamic programming algorithm for such a deduction system
2536: incrementally fills a {\em parse table\/} with 
2537: table items, given a grammar and input.
2538: During execution of the algorithm,
2539: items that are already
2540: in the table are matched against antecents of inference
2541: rules. If a combination of items match all
2542: antecents of an inference rule, then the item
2543: that matches the consequent of that inference rule is
2544: added to the table. This process ends when no more
2545: new items can be added to the table.
2546: 
2547: The item in the consequent of inference rule~(\ref{e:init})
2548: represents the fact that 
2549: at the beginning of any computation, 
2550: $\Xinit$ lies on top of imaginary stack
2551: element $\bot$, no input has as yet been read, and
2552: the product of probabilities of all transitions used 
2553: in the represented computation is 1, since no transitions 
2554: have been used yet.
2555: 
2556: Inference rule~(\ref{e:pushfor}) derives a table item from
2557: an existing table item, if the second stack symbol of that 
2558: existing item indicates that a push transition can be applied.
2559: Naturally, the probability in the new item is the product
2560: of the probability in the old item and
2561: the probability of the applied transition.
2562: Inference rule~(\ref{e:scanfor}) is very similar.
2563: 
2564: Two subcomputations are combined through a 
2565: pop transition by inference rule~(\ref{e:popfor}),
2566: the intuition of which can be explained as follows. 
2567: If $W$ occurs as top-of-stack at position $i$ and 
2568: reading the input up to $j$ results in 
2569: $Y$ on top of $W$, and if subsequently reading the input from
2570: $j$ to $k$ results in $X$ on top of $Y$ and
2571: ${\it YX}$ may be replaced by $Z$ by a pop transition, then
2572: reading the input from $i$ to $k$ results in
2573: $Z$ on top of $W$.
2574: The probability of the newly derived subcomputation is the
2575: product of three probabilities. 
2576: The first is the probability of that subcomputation
2577: up to the point where $Y$ is top-of-stack,
2578: which is given by $p_1$; the second is the 
2579: probability from this point onward, up to the point where
2580: $X$ is top-of-stack,
2581: which is given by $p_2$;
2582: the third is the probability of the pop transition.
2583: The second of these
2584: probabilities, $p_2$, is defined by the inference rules for
2585: `inner' items to be discussed next.
2586: 
2587: Inference rule~(\ref{e:pushin}) starts the investigation of
2588: a new subcomputation that begins with a push transition.
2589: This rule does not have any antecedents, but we may 
2590: add an item $p_1:\forward(Z,X,i,j)$ as antecedent,
2591: since the resulting `inner' items can only be useful for
2592: the computation of `forward' items if at least
2593: one item of the form $p_1:\forward(Z,X,i,j)$
2594: exists. We will not do so
2595: however, since this would complicate the theoretical analysis.
2596: 
2597: The next two rules, (\ref{e:scanin}) and~(\ref{e:popin}), 
2598: are almost identical to (\ref{e:scanfor}) and~(\ref{e:popfor}).
2599: 
2600: \begin{figure}[t]
2601: Initialization: \\[-4ex]
2602: \tabruletwo{e:init}{
2603: }{
2604: 1:\forward(\bot,\Xinit,0,0)
2605: }
2606: 
2607: Push (forward): \\[-4ex]
2608: \tabrule{e:pushfor}{
2609: p_1:\forward(Z,X,i,j)
2610: }{
2611: p_1 \cdot p_{\myaut}(\tau):\forward(X,Y,j,j)
2612: }{
2613: \tau = \myep{X}{\it XY}
2614: }
2615: 
2616: Scan (forward): \\[-4ex]
2617: \tabrule{e:scanfor}{
2618: p_1:\forward(Z,X,i,j)
2619: }{
2620: p_1 \cdot p_{\myaut}(\tau):\forward(Z,Y,i,j')
2621: }{
2622: \tau = \myscan{X}{x}{y}{\it Y} \\
2623: (x = \epsilon \wedge j' = j)\ \vee  \\
2624: \ \ \ (x = a_{j+1} \wedge j' = j + 1)
2625: }
2626: 
2627: Pop (forward): \\[-4ex]
2628: \tabrule{e:popfor}{
2629: p_1:\forward(W,Y,i,j) \\
2630: p_2:\inner(Y,X,j,k)
2631: }{
2632: p_1 \cdot p_2\cdot p_{\myaut}(\tau):\forward(W,Z,i,k)
2633: }{
2634: \tau = \myep{{\it Y X}}{\it Z} 
2635: }
2636:  
2637: Push (inner): \\[-4ex]
2638: \tabrule{e:pushin}{
2639: % p_1:\forward(Z,X,i,j)
2640: }{
2641: p_{\myaut}(\tau):\inner(X,Y,j,j)
2642: }{
2643: \tau = \myep{X}{\it XY}
2644: }
2645: 
2646: Scan (inner): \\[-4ex]
2647: \tabrule{e:scanin}{
2648: p_2:\inner(Z,X,i,j)
2649: }{
2650: p_2 \cdot p_{\myaut}(\tau):\inner(Z,Y,i,j')
2651: }{
2652: \tau = \myscan{X}{x}{y}{\it Y} \\
2653: (x = \epsilon \wedge j' = j)\ \vee  \\
2654: \ \ \ (x = a_{j+1} \wedge j' = j + 1)
2655: }
2656: 
2657: Pop (inner): \\[-4ex]
2658: \tabrule{e:popin}{
2659: p_2:\inner(W,Y,i,j) \\
2660: p'_2:\inner(Y,X,j,k)
2661: }{
2662: p_2 \cdot p'_2\cdot p_{\myaut}(\tau):\inner(W,Z,i,k)
2663: }{
2664: \tau = \myep{{\it Y X}}{\it Z}
2665: }
2666: 
2667: \caption{Deduction system of table items.}
2668: \label{f:tabular}
2669: \end{figure}
2670: 
2671: It is not difficult to see that for each complete
2672: computation of the form
2673: $(\Xinit, a_1 \cdots a_n, \epsilon)$ $\pdamovesname{c}$
2674: $(\Xfinal, \epsilon, v)$, for some output string $v$,
2675: there is precisely one derivation by the deduction system
2676: of some table item 
2677: $p_1:\forward(\bot,\Xfinal,0,n)$, where $p_1=p_{\myaut}(c)$.
2678: Conversely, for each derivation of such a table item, there
2679: is a unique corresponding computation.
2680: Computations and derivations can be easily related to each other
2681: by looking at the transitions in the side conditions of the
2682: inference rules.
2683: 
2684: If follows that if we take the sum of
2685: $p_1$ over all derivations of items
2686: $p_1:\forward(\bot,\Xfinal,0,n)$, then we obtain
2687: the probability assigned by $\myaut$ to the input
2688: $w=a_1 \cdots a_n$. 
2689: 
2690: Now assume that $\myaut$ is proper and consistent. 
2691: For a given string
2692: $w' \in \myterm_1^\ast$, where $\myterm_1$ is the input
2693: alphabet, we define the {\em prefix probability\/} of $w'$
2694: to be 
2695: $$ \sum_{w'' \in \myterm_1^\ast}\ p_{\myaut}(w' w'') $$
2696: In other words, we sum the probabilities of all strings
2697: $w=w'w''$ that start with prefix $w'$.
2698: We will now show that this probability can also be expressed
2699: in terms of the probabilities of `forward' items.
2700: 
2701: Assume that $w'= a_1 \cdots a_n$, for some $n \geq 0$.
2702: Any computation on a string $w=w'w''$ 
2703: that is the prefix of a complete computation 
2704: must be of one of two types.
2705: The first is
2706: $(\Xinit, a_1 \cdots a_n, \epsilon)$ $\pdamovesname{c}$
2707: $(\Xfinal, \epsilon, v)$, for some $v$, which means that
2708: $w''=\epsilon$, so that no input beyond position $n$ needs to be
2709: read.
2710: The second is
2711: $(\Xinit, a_1 \cdots a_n a_{n+1} \cdots a_m, \epsilon)$ $\pdamovesname{c_1}$
2712: $(\alpha X, a_{n+1} \cdots a_m, v_1)$ $\pdamove{\tau}$
2713: $(\alpha Y, a_{n+2} \cdots a_m, v_1y)$ $\pdamovesname{c_2}$
2714: $(\Xfinal, \epsilon, v_1yv_2)$,
2715: where $\tau$ is a scan transition
2716: $\myscan{X}{a}{y}{Y}$ such that $a=a_{n+1}$.
2717: 
2718: The sum of probabilities of computations of the first type equals the
2719: sum of $p_1$ over all derivations of items
2720: $p_1:\forward(\bot,\Xfinal,0,n)$, as we have explained above.
2721: For the second type of computation, properness and
2722: consistency
2723: implies that for given $c_1$ and $\tau$ as above,
2724: the sum of probabilities of different $c_2$ must be 1.
2725: (If that sum, say $q$, is less than $1$, then 
2726: the sum of the probabilities of all computations cannot be
2727: more than $1 - (1-q) \cdot p_{\myaut}(c_2) < 1$, which
2728: is in conflict with the assumed consistency.)
2729: Furthermore, properness implies 
2730: that the sum of probabilities of different $\tau$ 
2731: that we can apply for top-of-stack $X$ must be 1.
2732: Therefore, we may conclude that
2733: the sum of probabilities of computations of the 
2734: second type equals the sum of $p_{\myaut}(c_1)$ over all computations
2735: $(\Xinit, a_1 \cdots a_n, \epsilon)$ $\pdamovesname{c_1}$
2736: $(\alpha X, \epsilon, v_1)$ such that there is at least
2737: one scan transition of the form $\myscan{X}{a}{y}{Y}$.
2738: This equals the sum of $p_1$ over all derivations of items
2739: $p_1:\forward(Z,X,0,n)$, for some $Z$, such that there is at least
2740: one scan transition of the form $\myscan{X}{a}{y}{Y}$.
2741: 
2742: Hereby we have shown how both the probability and the
2743: prefix probability of a string can be expressed in terms of 
2744: derivations of table items. However, the number of
2745: derivations of table items can be infinite. The obvious
2746: remedy lies in an alternative interpretation of the inference
2747: rules in Figure~\ref{f:tabular}, following \cite{GO99}:
2748: we regard objects of the form
2749: $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$ as
2750: table items in their own right, and store each at most once
2751: in the parse table.
2752: The associated probabilities 
2753: are then no longer those for individual derivations, 
2754: but are the sums of probabilities
2755: over all derivations of table items
2756: $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$.
2757: Such a sum of probabilities over all
2758: derivations of a table item is commonly
2759: called a {\em forward\/} or {\em inner\/} probability,
2760: respectively.
2761: 
2762: We will make this more concrete, under the assumption that
2763: there are no cyclic dependencies, i.e.,
2764: there is no item $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$ that 
2765: may occur as ancestor of itself in some derivation.
2766: Let $T$ be the set of all items
2767: $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$ that can be derived
2768: using the deduction system in Figure~\ref{f:tabular},
2769: ignoring the probabilities.
2770: We then define a function $p_{\tabel}$ from table items to
2771: probabilities, as shown in Figure~\ref{f:recursive}.
2772: We assume the function $\delta$ evaluates to 1 if its
2773: argument is true, and to 0 otherwise.
2774: 
2775: \begin{figure}[t]
2776: \begin{eqnarray}
2777: \label{e:forward}
2778: \lefteqn{p_{\tabel}(\forward(X,Y,i,j))\ =} \\
2779: && \delta(X = \bot \wedge Y = \Xinit \wedge 
2780:                 i = j = 0)\ + \nonumber\\
2781: && \delta(i=j) \cdot
2782: \sum_{
2783: Z, k,\tau: \atop
2784: { \forward(Z,X,k,i)\in T, \atop 
2785: \tau = \myep{X}{\it XY}}
2786: }
2787: \hspace{-3ex} 
2788: 	p_{\tabel}(\forward(Z,X,k,i)) \cdot p_{\myaut}(\tau)\ + \nonumber\\
2789: && \sum_{
2790: Z,j',x,y,\tau: \atop
2791: { \forward(X,Z,i,j')\in T, \atop
2792: { (x = \epsilon \wedge j' = j) \vee 
2793: 	(x = a_{\tiny j} \wedge j' = j - 1) , \atop
2794: \tau = \myscan{Z}{x}{y}{\it Y} }}
2795: }
2796: \hspace{-8ex} 
2797:         p_{\tabel}(\forward(X,Z,i,j')) \cdot p_{\myaut}(\tau)\ + \nonumber\\
2798: && \sum_{
2799: W,Z,k,\tau: \atop
2800: { \forward(X,W,i,k)\in T, \inner(W,Z,k,j)\in T, \atop
2801: \tau = \myep{\it WZ}{\it Y} }
2802: }
2803: \hspace{-11ex}
2804: 	p_{\tabel}(\forward(X,W,i,k)) \cdot 
2805: 	p_{\tabel}(\inner(W,Z,k,j)) \cdot p_{\myaut}(\tau) \nonumber
2806: \end{eqnarray}
2807: %
2808: \begin{eqnarray}
2809: \label{e:inner}
2810: \lefteqn{p_{\tabel}(\inner(X,Y,i,j))\ =} \\
2811: && \delta(i=j) \cdot
2812: \sum_{
2813: \tau: \atop
2814: \tau = \myep{X}{\it XY}
2815: }
2816: 	p_{\myaut}(\tau)\ + \nonumber\\
2817: && \sum_{
2818: Z,j',x,y,\tau: \atop
2819: { \inner(X,Z,i,j')\in T, \atop
2820: { (x = \epsilon \wedge j' = j) \vee
2821:         (x = a_{\tiny j} \wedge j' = j - 1) , \atop
2822: \tau = \myscan{Z}{x}{y}{\it Y} }}
2823: }
2824: \hspace{-8ex}
2825:         p_{\tabel}(\inner(X,Z,i,j')) \cdot p_{\myaut}(\tau)\ + \nonumber\\
2826: && \sum_{
2827: W,Z,k,\tau: \atop
2828: { \inner(X,W,i,k)\in T, \inner(W,Z,k,j)\in T, \atop
2829: \tau = \myep{\it WZ}{\it Y} }
2830: }
2831: \hspace{-11ex}
2832:         p_{\tabel}(\inner(X,W,i,k)) \cdot 
2833:         p_{\tabel}(\inner(W,Z,k,j)) \cdot p_{\myaut}(\tau) \nonumber
2834: \end{eqnarray}
2835: \caption{Recursive functions to determine probabilities of
2836: table items.}
2837: \label{f:recursive}
2838: \end{figure}
2839: 
2840: Each line in the right-hand sides of the two equations 
2841: in Figure~\ref{f:recursive} can 
2842: be seen as the backward application
2843: of an inference rule from Figure~\ref{f:tabular}.
2844: In other words,
2845: for a given item, 
2846: we investigate 
2847: all possible ways of deriving that item as the 
2848: consequent of different inference rules with different antecedents.
2849: For example, the second line in the right-hand side of 
2850: equation~(\ref{e:forward}), 
2851: can be seen as the backward application of inference rule (\ref{e:pushfor}).
2852: 
2853: That Figure~\ref{f:recursive} is indeed equivalent to
2854: Figure~\ref{f:tabular} follows from the fact that
2855: multiplication distributes over addition.
2856: If there are cyclic dependencies, then the set of equations
2857: in Figure~\ref{f:recursive} may no longer have a closed-form
2858: solution, but we may obtain probabilities by
2859: an iterative algorithm that approximates the lowest non-negative 
2860: solution to the equations \cite{ST95}.
2861: 
2862: Given the set of equations in Figure~\ref{f:recursive}
2863: we can now express the probability of a string of length $n$ as
2864: $p_{\tabel}(\forward(\bot,\Xfinal,0,n))$.
2865: The prefix probability of a string of length $n$ is given by:
2866: \begin{eqnarray}
2867: && p_{\tabel}(\forward(\bot,\Xfinal,0,n))\ + \\
2868: && \sum_{ 
2869: X,Y,i: \atop
2870: { \forward(X,Y,i,n)\in T, \atop
2871: \exists \tau,a,y,Z[\tau = \myscan{Y}{a}{y}{\it Z}] }
2872: }
2873: \hspace{-3ex}
2874: p_{\tabel}(\forward(X,Y,i,n))
2875: \end{eqnarray}
2876: 
2877: To obtain a suitable PPDT from a given PCFG,
2878: we may apply
2879: the strategy $\stratepLC$
2880: from Section~\ref{s:strong}. Provided the (P)CFG is acyclic,
2881: this strategy ensures that there are no computations of
2882: infinite length for any given
2883: input, which implies there are no cyclic dependencies
2884: in the simulation of the automaton by the dynamic programming
2885: algorithm.
2886: 
2887: Hereby we have presented a way to compute probabilities
2888: and prefix probabilities of strings. Our approach is an alternative
2889: to the one from \cite{JE91,ST95}, and has the advantage that
2890: the approach is parameterized by the parsing strategy:
2891: instead of $\stratepLC$
2892: we may apply any other parsing strategy with the same properties
2893: with regard to acyclic grammars.
2894: If our grammars are even more constrained,
2895: e.g.\ if they do not have epsilon rules, 
2896: we may apply even simpler parsing strategies.
2897: Different parsing strategies may differ in the efficiency
2898: of the computation.
2899: 
2900: \section{Conclusions}
2901: 
2902: We have formalized the notion of parsing strategy as a mapping from
2903: context-free grammars to push-down transducers, and have investigated 
2904: the extension to probabilities. 
2905: We have shown that the question of which
2906: strategies can be extended to become probabilistic heavily
2907: relies on two properties, the correct-prefix property and
2908: the strong predictiveness property. 
2909: The CPP is a necessary condition for
2910: extending a strategy to become a probabilistic strategy.
2911: The CPP and SPP together form a sufficient condition.
2912: We have shown that there is at least one strategy 
2913: of practical interest with the CPP but
2914: without the SPP that cannot be extended to become a probabilistic
2915: strategy.
2916: Lastly, we have presented an application
2917: to prefix probabilities.
2918: 
2919: \section*{Acknowledgements}
2920: 
2921: We gratefully acknowledge correspondence with
2922: David McAllester,
2923: Giovanni Pighizzini,
2924: Detlef Prescher,
2925: Virach Sornlertlamvanich and
2926: Eric Villemonte de la Clergerie.
2927: 
2928: \bibliographystyle{plain}
2929: %\bibliography{refs}
2930: %\bibliography{/home/cl-home/nederhof/bib/refs}
2931: \bibliography{/home/markjan/bib/refs}
2932: 
2933: % \newpage
2934: 
2935: \end{document}
2936: 
2937: 
2938: