0211:cs0211017/paper.tex

1: \documentclass[11pt]{article}

2: % \pagestyle{empty}

3: % \renewcommand{\thepage}{}

4:

5: \usepackage{latexsym}

6:

7: \title{Probabilistic Parsing Strategies}

8: \author{

9: \begin{tabular}[t]{c}

10: Mark-Jan Nederhof%

11:   \,\thanks{

12: Supported by the Royal Netherlands

13: Academy of Arts and Sciences.

14: Secondary affiliation is the

15: German Research Center for Artificial Intelligence (DFKI).

16: } \\

17: Faculty of Arts \\

18: University of Groningen \\

19: P.O.\ Box 716 \\

20: NL-9700 AS Groningen, The Netherlands \\

21: {\tt markjan@let.rug.nl}

22: \end{tabular}

23: \and

24: % \hspace{6pt}

25: \begin{tabular}[t]{c}

26: Giorgio Satta \\

27: %Dip. di Elettronica e Informatica \\

28: Department of Information Engineering

29: %Universit\`{a} di Padova \\

30: University of Padua \\

31: via Gradenigo, 6/A \\

32: I-35131 Padova, Italy \\

33: {\tt satta@dei.unipd.it}

34: \end{tabular}

35: }

36:

37: % \keywords{parsing algorithms, probabilistic parsing, transduction,

38: % context-free grammars, push-down automata}

39:

40: \date{}

41:

42: \setlength{\textheight}{21.2cm}

43: \setlength{\textwidth}{13.5cm}

44:

45: \setcounter{topnumber}{3}

46: \setcounter{totalnumber}{3}

47: \setcounter{dbltopnumber}{3}

48: % \renewcommand{\textfraction}{.01}

49: \renewcommand{\textfraction}{.2}

50: \renewcommand{\topfraction}{.99}

51: \renewcommand{\dbltopfraction}{.99}

52: \sloppy

53:

54: \input{epsf}

55:

56: \newcommand{\comment}[1]{}

57:

58: \newcommand{\order}[1]{{\cal O}({#1})}

59:

60: \newcommand{\mygram}{{\cal G}}

61: \newcommand{\myaut}{{\cal A}}

62: \newcommand{\mystrat}{{\cal S}}

63:

64: \newcommand{\mypartial}{{\cal T}_{\myaut}}

65:

66: \newcommand{\myterm}{\mit\Sigma}

67: \newcommand{\mynontset}{\mit\Gamma}

68: \newcommand{\mynont}{N}

69: \newcommand{\myrule}{R}

70:

71: \newcommand{\bul}{\mathrel{\bullet}}

72:

73: \newcommand{\mysym}{Q}

74: \newcommand{\Xinit}{X_{\it init}}

75: \newcommand{\Xfinal}{X_{\it final}}

76: \newcommand{\mytrans}{\mit\Delta}

77:

78: \newcommand{\myep}[2]{{#1} \mapsto {#2}}

79: \newcommand{\myscan}[4]{{#1} \stackrel{#2,#3}{\mapsto} {#4}}

80: \newcommand{\myscanrec}[3]{{#1} \stackrel{#2}{\mapsto} {#3}}

81:

82: \newcommand{\pdamoverel}{\vdash}

83: \newcommand{\pdamove}[1]{\stackrel{#1}{\vdash}}

84: \newcommand{\pdamoves}{\vdash^\ast}

85: \newcommand{\pdamovesname}[1]{\stackrel{#1}{\vdash^\ast}}

86: \newcommand{\outp}{{\it out}}

87:

88: \newcommand{\pdagoto}{\leadsto}

89:

90: \newcommand{\de}{\rightarrow}

91:

92: \newcommand{\LC}{\angle}

93: \newcommand{\LCep}{\angle_{\epsilon}}

94: \newcommand{\LCstar}{\angle^\ast}

95: \newcommand{\LCepstar}{\angle_{\epsilon}^\ast}

96:

97: \newcommand{\fepLC}{f_{\epsilon\mbox{\scriptsize\it -LC}}}

98: \newcommand{\fepTD}{f_{\epsilon\mbox{\scriptsize\it -TD}}}

99: \newcommand{\stratepLC}{\mystrat_{\epsilon\mbox{\scriptsize\it -LC}}}

100:

101: \newtheorem{definition}{Definition}

102: \newtheorem{theorem}{Theorem}

103: \newtheorem{lemma}[theorem]{Lemma}

104: \newtheorem{prop}{Proposition}

105: \newcommand{\proof}{\noindent {\em Proof.\hspace{1em}}}

106: \newcommand{\closeproof}{\mbox{\hspace{1em}\rule{.45em}{.45em}}}

107:

108: \newcommand{\tabrule}[4]{

109: \begin{eqnarray}

110: \label{#1}

111:         \frac{ \begin{array}{c} #2 \end{array} }

112:                         { \begin{array}{c} #3 \end{array} }

113:    \left\{ \begin{array}{l} #4 \end{array} \right.

114: \end{eqnarray} }

115:

116: \newcommand{\tabruletwo}[3]{

117: \begin{eqnarray}

118: \label{#1}

119:         \frac{ \begin{array}{c}  #2 \end{array} }

120:                         { \begin{array}{c} #3 \end{array} }

121: \end{eqnarray} }

122:

123: \newcommand{\forward}{{\it forward\/}}

124: \newcommand{\inner}{{\it inner\/}}

125: \newcommand{\tabel}{{\it tab\/}}

126:

127: \begin{document}

128:

129: \maketitle

130:

131: \begin{abstract}

132: We present new results on the relation between

133: purely symbolic context-free

134: parsing strategies and their probabilistic counter-parts.

135: Such parsing strategies are seen as constructions

136: of push-down devices from grammars.

137: We show that preservation of probability distribution is

138: possible under two conditions, viz.\

139: the correct-prefix property and the

140: property of strong predictiveness.

141: These results generalize existing results in the literature

142: that were obtained by considering parsing strategies in

143: isolation.  From our general results we also derive negative

144: results on so-called generalized LR parsing.

145: \end{abstract}

146:

147: \section{Introduction}

148: \label{s:intro}

149:

150: Context-free grammars and push-down automata are two

151: equivalent formalisms to describe context-free languages.

152: While a context-free grammar can be thought of as a purely

153: declarative specification, a push-down automaton is considered to be

154: an operational specification that determines which steps are

155: performed for a given string

156: in the process of deciding its membership of the language.

157: By a {\em parsing strategy\/} we mean a mapping from

158: context-free grammars to equivalent push-down automata,

159: such that some specific conditions are observed.

160:

161: This paper deals with the probabilistic extensions of

162: context-free grammars and push-down automata,

163: i.e., probabilistic context-free grammars \cite{SA72,BO73}

164: and probabilistic push-down automata \cite{SA72,SA76,TE95,AB99}.

165: These formalisms

166: are obtained by adding probabilities to the rules and transitions

167: of context-free grammars and push-down automata, respectively.

168: More specifically, we will investigate the problem of `extending'

169: parsing strategies to {\em probabilistic\/} parsing strategies.

170: These are mappings from probabilistic context-free grammars to

171: probabilistic push-down automata

172: that preserve the induced probability distributions

173: on the generated/accepted languages.

174: Two of the main results presented in this paper can be stated as follows:

175: \begin{itemize}

176: \item

177: No parsing strategy that lacks the

178: correct-prefix property (CPP) can be extended to become

179: a probabilistic parsing strategy.

180: \item

181: All parsing strategies that possess

182: the correct-prefix property and the

183: strong predictiveness property (SPP) can be extended

184: to become probabilistic parsing strategies.

185: \end{itemize}

186: The above results generalize previous findings

187: reported in~\cite{TE95,TE97,AB99}, where only a few specific

188: parsing strategies were considered in isolation.

189: Our findings also have important

190: implications for well-known parsing strategies such as

191: generalized LR parsing, henceforth simply called `LR parsing'.%

192: \footnote{Generalized (or nondeterministic)

193: LR parsing allows for more than one action

194: for a given LR state and input symbol.}

195: LR parsing has the CPP, but lacks the SPP, and as we

196: will show, LR parsing cannot be extended to become

197: a probabilistic parsing strategy.

198:

199: In the last decade, widespread interest

200: in probabilistic parsing techniques has arisen

201: in the area of natural language processing \cite{CH93a,MA99,JU00}.

202: This is motivated by the fact that natural language sentences are

203: generally ambiguous,

204: and natural language software needs to be able to

205: distinguish the more probable derivations

206: of a sentence from the less probable ones.

207: This can be achieved by letting the parsing process

208: assign a probability to each parse,

209: on the basis of a probabilistic grammar.

210: In a typical application, the software may

211: select those derivations for further processing

212: that have been given the highest

213: probabilities, and discard the others.

214: The success of this approach relies on the accuracy of

215: the probabilistic model expressed by the probabilistic

216: grammar, i.e., whether the probabilities assigned

217: to derivations accurately reflect

218: the `true' probabilities in the domain at hand.

219:

220: Probabilities are often estimated on the basis of a corpus,

221: i.e., a collection of sentences. The sentences in a corpus

222: may be annotated with various kinds of information.

223: One kind of annotation that is relevant for our discussion is

224: the preferred derivation for each sentence.

225: Given a corpus with derivations, one may

226: estimate probabilities of rules by their relative

227: frequencies in the corpus. If a corpus is unannotated,

228: more general techniques of maximum-likelihood estimation

229: can be used to estimate the probabilities of rules.

230: (See \cite{SA97,CH98,CH99} for some formal properties of types of

231: maximum-likelihood estimation.)

232:

233: The motivation for studying probabilistic models other than

234: those obtained by attaching probabilities to

235: given context-free grammars is the

236: observation that more accurate models can be obtained by

237: conditioning probabilities on `context information' beyond single

238: nonterminals \cite{CH90,CH94}. Furthermore, it

239: has been observed that conditioning on certain types of

240: context information can be achieved by first translating

241: context-free grammars to push-down automata,

242: according to some parsing strategy,

243: and then attaching probabilities to the transitions thereof

244: \cite{SO99,RO99}.

245: More concretely, for some parsing strategies,

246: the set of models that can be obtained by attaching

247: probabilities to a push-down automaton constructed from

248: a context-free grammar may include models that cannot be obtained by

249: attaching probabilities to that grammar.

250:

251: An implicit assumption of this methodology is that,

252: conversely, any probabilistic model that can be obtained from a grammar

253: can also be obtained from the associated push-down automaton,

254: or in other words, the push-down automaton is at least as

255: powerful as the grammar in terms of the set the potential models.

256: If a parsing strategy does not satisfy this property, and

257: if some potential models are lost in the mapping from

258: the grammar to the push-down automaton, then this means that

259: in some cases

260: the strategy may lead to less rather than more accurate models.

261: That LR parsing cannot be extended to become

262: a probabilistic parsing strategy, as we mentioned above,

263: means that the above property is not satisfied by this parsing strategy.

264: This is contrary to what is suggested by some

265: publications on probabilistic LR parsing, such as

266: \cite{BR93} and \cite{IN00}, which fail to observe that

267: LR parsers may sometimes lead to less accurate models

268: than the grammars from which they were constructed.

269:

270: Some studies, such as \cite{CO97,CH98a,CH01}, propose

271: lexicalized probabilistic context-free grammars, i.e.,

272: probabilistic models based on

273: context-free grammars in which probabilities

274: heavily rely on the terminal elements from input strings.

275: Even if the current paper does not specifically deal with

276: lexicalization, much of what we discuss pertains

277: to lexicalized probabilistic context-free grammars as well.

278:

279: The paper is organized as follows. After giving standard

280: definitions in Section~\ref{s:prel}, we give our formal

281: definition of `parsing strategy' in Section~\ref{s:strategy}.

282: We also define what it means to extend a parsing strategy to

283: become a probabilistic parsing strategy.

284: The CPP and the SPP are defined in Sections~\ref{s:cpp}

285: and~\ref{s:pred}, where we also discuss how these properties relate to

286: the question of which strategies can be extended to become

287: probabilistic.

288: Sections~\ref{s:strong}

289: and~\ref{s:nonstrong} provide examples of parsing strategies

290: with and without the SPP. The examples without the SPP,

291: most notably LR parsing, are

292: shown not to be extendible to become probabilistic.

293: A wider notion of extending a strategy to become probabilistic

294: is provided by Section~\ref{s:wide}. We show that

295: even under this wider notion,

296: LR parsing cannot be extended to become probabilistic.

297: Section~\ref{s:prefix} presents an application

298: that concerns prefix probabilities.

299: We end this paper with conclusions.

300:

301: Some results reported here have appeared before in an abbreviated form

302: in \cite{NE02d}.

303:

304: \section{Preliminaries}

305: \label{s:prel}

306:

307: A context-free grammar (CFG) $\mygram$ is a 4-tuple

308: $(\myterm,$ $\mynont,$ $S,$ $\myrule)$,

309: where $\myterm$ is a finite set of {\em terminals},

310: called the {\em alphabet},

311: $\mynont$ is a finite set of {\em nonterminals},

312: including the {\em start symbol\/} $S$, and $\myrule$ is a finite set of

313: {\em rules},

314: each of the form $A\de\alpha$, where $A\in \mynont$ and

315: $\alpha\in (\myterm \cup \mynont)^\ast$.

316: Without loss of generality, we assume that there is only one

317: rule $S \de \sigma$ with the start symbol in the left-hand side,

318: and furthermore that $\sigma \neq \epsilon$, where $\epsilon$

319: denotes the empty string.

320:

321: For a fixed CFG $\mygram$, we

322: define the relation $\Rightarrow$ on triples consisting of two strings

323: $\alpha,\beta\in (\myterm \cup \mynont)^\ast$ and a rule

324: $\pi\in\myrule$ by:

325: $\alpha \stackrel{\pi}{\Rightarrow} \beta$ if and only if

326: $\alpha$ is of the form $wA\delta$ and $\beta$ is of the

327: form $w\gamma\delta$, for some $w\in\myterm^\ast$ and

328: $\delta\in (\myterm \cup \mynont)^\ast$, and $\pi=(A\de\gamma)$.

329: A {\em left-most derivation\/}

330: is a string $d = \pi_1 \cdots \pi_m$, $m \geq 0$,

331: such that $S \stackrel{\pi_1}{\Rightarrow} \cdots

332: \stackrel{\pi_m}{\Rightarrow}\alpha$,

333: for some $\alpha \in (\myterm \cup \mynont)^\ast$.

334: We will identify a left-most derivation

335: with the sequence of strings over

336: $\myterm \cup \mynont$ that arise in that

337: derivation.

338: In the remainder of this paper, we will let

339: the term `derivation' refer to

340: `left-most derivation', unless specified otherwise.

341:

342: A derivation $d = \pi_1 \cdots \pi_m$, $m \geq 0$,

343: such that $S \stackrel{\pi_1}{\Rightarrow} \cdots

344: \stackrel{\pi_m}{\Rightarrow}w$ where $w\in\myterm^\ast$

345: will be called a {\em complete\/} derivation;

346: we also say that $d$ is a derivation of $w$.

347: By {\em subderivation\/} we mean a substring of a

348: complete derivation of the form

349: $d = \pi_1 \cdots \pi_m$, $m \geq 0$,

350: such that $A \stackrel{\pi_1}{\Rightarrow} \cdots

351: \stackrel{\pi_m}{\Rightarrow}w$ for some $A$ and $w$.

352:

353: We write $\alpha \Rightarrow^\ast \beta$ or

354: $\alpha \Rightarrow^+ \beta$ to denote the

355: existence of a string $\pi_1 \cdots \pi_m$ such that

356: $\alpha \stackrel{\pi_1}{\Rightarrow} \cdots

357: \stackrel{\pi_m}{\Rightarrow}\beta$,

358: with $m \geq 0$ or $m > 0$, respectively.

359: We say a CFG is {\em acyclic\/} if

360: $A \Rightarrow^+ A$ does not hold for any $A\in\mynont$.

361:

362: For a CFG $\mygram$ we define the language $L(\mygram)$

363: it generates as the set of strings $w$

364: such that there is at least one derivation of $w$.

365: We say a CFG is {\em reduced\/} if for each rule $\pi\in\myrule$

366: there is a complete derivation in which it occurs.

367:

368: A {\em probabilistic\/} context-free grammar (PCFG) is a pair

369: $(\mygram, p)$ consisting of a CFG

370: $\mygram=(\myterm,$ $\mynont,$ $S,$ $\myrule)$ and

371: a probability function $p$ from $\myrule$ to

372: real numbers in the interval $[0,1]$. We say a PCFG is {\em proper\/}

373: if $\Sigma_{\pi=(A\de\gamma)\in\myrule}\ p(\pi) = 1$ for

374: each $A\in\mynont$.

375:

376: For a PCFG $(\mygram,p)$,

377: we define

378: the probability $p(d)$ of a string

379: $d = \pi_1 \cdots \pi_m \in \myrule^\ast$

380: as $\prod_{i=1}^m\  p(\pi_i)$;

381: we will in particular consider the probabilities of

382: derivations $d$.

383: The probability $p(w)$ of a string $w\in \myterm^\ast$ as defined by $(\mygram,p)$

384: is the sum of the probabilities of

385: all derivations of that string.

386: We say a PCFG $(\mygram,p)$ is {\em consistent\/} if

387: $\Sigma_{w \in \myterm^\ast}\ p(w) = 1$.

388:

389: In this paper we will mainly consider push-down transducers

390: rather than push-down automata. Push-down transducers not

391: only compute derivations of the grammar while processing

392: an input string, but they also explicitly produce

393: output strings from which these derivations can be obtained.

394: We use transducers for two reasons.

395: First, constraints on the output strings allow

396: us to restrict our attention to `reasonable' parsing strategies.

397: Those strategies that cannot be formalized within these constraints

398: are unlikely to be of practical interest.

399: Secondly, mappings from input strings to derivations, as those realized

400: by push-down devices, turn out to be a very powerful abstraction

401: and allow direct proofs of several general results.

402:

403: Differently from many textbooks, our push-down devices do not

404: possess states next to stack symbols. This is without loss

405: of generality, since states can be encoded into the stack symbols,

406: given the types of transition that we allow.

407: Thus,

408: a push-down transducer (PDT) $\myaut$ is a 6-tuple

409: $(\myterm_1,$ $\myterm_2,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$,

410: where $\myterm_1$ is the input alphabet,

411: $\myterm_2$ is the output alphabet,

412: $\mysym$ is a finite set of {\em stack symbols}

413: including the {\em initial stack symbol\/} $\Xinit$ and the

414: {\em final stack symbol\/} $\Xfinal$, and $\mytrans$ is the set of

415: {\em transitions}.

416: Each transition can have

417: one of the following three forms:

418: $\myep{X}{X Y}$ (a push transition),

419: $\myep{\it Y X}{Z}$ (a pop transition),  or

420: $\myscan{X}{x}{y}{Y}$ (a swap transition);

421: here $X$, $Y$, $Z\in \mysym$,

422: $x\in \myterm_1 \cup \{\epsilon\}$

423: and $y\in \myterm_2^\ast$.

424: Note that

425: in our notation, stacks grow from left to right, i.e., the top-most

426: stack symbol will be found at the right end.

427:

428: Without loss of generality, we assume that any PDT is such that

429: for a given stack symbol $X\neq \Xfinal$, there are either one or more

430: push transitions

431: $\myep{X}{X Y}$, or one or more pop transitions

432: $\myep{\it Y X}{Z}$, or one or more swap transitions

433: $\myscan{X}{x}{y}{Y}$, but no combinations of different types of

434: transition. If a PDT does not satisfy this normal form, it can

435: easily be brought in this form by introducing for each stack symbol

436: $X$ three new stack symbols $X_{\it push}$, $X_{\it pop}$

437: and $X_{\it swap}$ and new swap transitions

438: $\myscan{X}{\epsilon}{\epsilon}{X_{\it push}}$,

439: $\myscan{X}{\epsilon}{\epsilon}{X_{\it pop}}$ and

440: $\myscan{X}{\epsilon}{\epsilon}{X_{\it swap}}$.

441: In each existing transition that operates on top-of-stack $X$,

442: we then replace $X$ by one from $X_{\it push}$, $X_{\it pop}$

443: or $X_{\it swap}$, depending on the type of that transition.

444: We also assume that $\Xfinal$ does not occur in the left-hand side

445: of a transition, again without loss of generality.

446:

447: A {\em configuration\/} of a PDT is a triple

448: $(\alpha, w, v)$, where $\alpha \in \mysym^\ast$

449: is a stack, $w\in\myterm_1^\ast$ is the remaining input, and

450: $v\in\myterm_2^\ast$ is the output generated so far.

451: For a fixed PDT $\myaut$, we define

452: the relation $\pdamoverel$ on triples consisting of two

453: configurations and a transition $\tau$ by:

454: $(\gamma\alpha, xw, v) \pdamove{\tau} (\gamma\beta, w, vy)$ if and only if

455: $\tau$ is of the form

456: $\myep{\alpha}{\beta}$, where $x=y=\epsilon$, or of the form

457: $\myscan{\alpha}{x}{y}{\beta}$.

458: A {\em computation\/} on an input string $w$ is a string

459: $c=\tau_1 \cdots \tau_m$, $m \geq 0$, such that

460: $(\Xinit, w, \epsilon) \pdamove{\tau_1} \cdots \pdamove{\tau_m}

461: (\alpha, w', v)$.

462: A {\em complete\/} computation on a string $w$ is a computation

463: with $w'=\epsilon$ and $\alpha=\Xfinal$. The string $v$ is called

464: the {\em output\/} of the computation $c$,

465: and is denoted by $\outp(c)$.

466:

467: We will identify a

468: computation with the sequence of configurations

469: that arise in that computation,

470: where the first configuration is determined by the context.

471: We also write

472: $(\alpha,w,v) \pdamoves (\beta,w',v')$ or

473: $(\alpha,w,v) \pdamovesname{c} (\beta,w',v')$,

474: for $\alpha,\beta \in \mysym^\ast$,

475: $w,w'\in \myterm_1^\ast$ and $v,v'\in\myterm_2^\ast$,

476: to indicate that $(\beta,w',v')$ can be obtained

477: from $(\alpha,w,v)$ by applying a sequence $c$ of zero or more

478: transitions; we refer to such a sequence $c$

479: as a {\em subcomputation}.

480: The function $\outp$ is

481: extended to subcomputations in a natural way.

482:

483: For a PDT $\myaut$, we define the language $L(\myaut)$ it

484: accepts as the set of strings $w$ such that there is

485: at least one complete computation on $w$.

486: We say a PDT is {\em reduced\/} if

487: each transition $\tau\in\mytrans$ occurs in some complete computation.

488:

489: A {\em probabilistic\/} push-down transducer (PPDT) is a pair

490: $(\myaut, p)$ consisting of a PDT $\myaut$ and

491: a probability function $p$ from the set $\mytrans$ of

492: transitions of $\myaut$ to

493: real numbers in the interval $[0,1]$.

494: We say a PPDT $(\myaut, p)$ is {\em proper\/} if

495: \begin{itemize}

496: \item

497: $\Sigma_{\tau=(\myep{X}{X Y})\in\mytrans}\ p(\tau) = 1$

498: for each $X\in\mysym$ such that there is at least one

499: transition $\myep{X}{X Y}$, $Y \in \mysym$;

500: \item

501: $\Sigma_{\tau=(\myscan{X}{x}{y}{Y})\in\mytrans}\ p(\tau) = 1$

502: for each $X\in\mysym$ such that there is at least one

503: transition $\myscan{X}{x}{y}{Y}$,

504: $x\in\myterm_1\cup\{\epsilon\},y\in\myterm_2^\ast,Y\in\mysym$;

505: and

506: \item

507: $\Sigma_{\tau=(\myep{Y X}{Z})\in\mytrans}\ p(\tau) = 1$,

508: for each $X,Y\in\mysym$ such that there is at least one

509: transition $\myep{Y X}{Z}$, $Z\in\mysym$.

510: \end{itemize}

511:

512: For a PPDT $(\myaut,p)$,

513: we define the probability $p(c)$

514: of a (sub)computation $c=\tau_1 \cdots \tau_m$

515: as $\prod_{i=1}^m\  p(\tau_i)$.

516: The probability $p(w)$ of a string $w$ as defined by $(\myaut,p)$

517: is the sum of the probabilities of

518: all complete computations on that string.

519: We say a PPDT $(\myaut,p)$ is {\em consistent\/} if

520: $\Sigma_{w \in \myterm^\ast}\ p(w) = 1$.

521:

522: We say a PCFG $(\mygram,p)$ is reduced if $\mygram$ is reduced,

523: and we say a PPDT $(\myaut,p)$ is reduced if $\myaut$ is reduced.

524:

525: \section{Parsing strategies}

526: \label{s:strategy}

527:

528: The term `parsing strategy' is often used informally to

529: refer to a class of parsing algorithms that behave similarly

530: in some way. In this paper, we assign a formal

531: meaning to this term, relying on the

532: observation by \cite{LA74,BI89} that many

533: parsing algorithms for CFGs can be described in two steps.

534: The first is a construction of push-down devices

535: from CFGs, and the second is

536: a method for handling nondeterminism

537: (e.g.\ backtracking or dynamic programming).

538: Parsing algorithms that handle nondeterminism in

539: different ways but apply the same construction of

540: push-down devices from CFGs are seen as realizations of

541: the same parsing strategy.

542:

543: Thus, we define a {\em parsing strategy\/} to be a function

544: $\mystrat$ that maps

545: a reduced CFG $\mygram

546: =(\myterm_1,$ $\mynont,$ $S,$ $\myrule)$ to

547: a pair $\mystrat(\mygram)=(\myaut,f)$ consisting of a

548: reduced PDT $\myaut=(\myterm_1,$ $\myterm_2,$ $\mysym,$

549: $\Xinit,$ $\Xfinal,$ $\mytrans)$, and a function $f$ that maps a subset of

550: $\myterm_2^\ast$ to a subset of $\myrule^\ast$,

551: with the following properties:

552: \begin{itemize}

553: \item $\myrule \subseteq \myterm_2$.

554: \item For each string $w\in\myterm_1^\ast$ and each

555: complete computation $c$ on $w$,

556: $f(\outp(c))=d$ is a derivation of $w$.

557: Furthermore, each symbol from $\myrule$

558: occurs as often in $\outp(c)$ as it occurs in $d$.

559: \item Conversely, for each string $w\in\myterm_1^\ast$ and

560: each derivation $d$ of $w$,

561: there is precisely one complete computation $c$ on $w$ such that

562: $f(\outp(c)) = d$.

563: \end{itemize}

564: If $c$ is a complete computation, we will write

565: $f(c)$ to denote $f(\outp(v))$. The conditions

566: above then imply that $f$ is

567: a bijection from complete computations to complete derivations.

568:

569: Note that output strings

570: of (complete) computations may contain symbols that are not in $\myrule$,

571: and the symbols that are in $\myrule$ may occur in a different

572: order in $v$ than in $f(v)=d$. The purpose of the symbols

573: in $\myterm_2 - \myrule$ is to help this process of reordering

574: of symbols in $\myrule$.

575: For a string $v \in \myterm_2^\ast$ we let $\overline{v}$ refer

576: to the maximal subsequence of symbols from $v$ that belong to $\myrule$,

577: or in other words, string $\overline{v}$ is obtained by erasing

578: from $v$ all occurrences of symbols from $\myterm_2 - \myrule$.

579:

580: A {\em probabilistic parsing strategy\/} is defined to be a function

581: $\mystrat$ that maps a reduced, proper and consistent

582: PCFG $(\mygram, p_{\mygram})$

583: to a triple $\mystrat(\mygram,  p_{\mygram})=(\myaut, p_{\myaut}, f)$,

584: where $(\myaut, p_{\myaut})$ is a reduced, proper and consistent PPDT,

585: with the same properties as a

586: (non-probabilistic) parsing strategy, and in addition:

587: \begin{itemize}

588: \item

589: For each complete derivation $d$ and

590: each complete computation $c$ such that $f(c)=d$,

591: $p_{\mygram}(d)$ equals $p_{\myaut}(c)$.

592: \end{itemize}

593: In other words, a complete computation has the same probability

594: as the complete derivation that it is mapped to by

595: function $f$.

596: An implication of this property is that for each string $w\in\myterm_1^\ast$,

597: the probabilities assigned to that string

598: by $(\mygram, p_{\mygram})$ and $(\myaut,p_{\myaut})$ are equal.

599:

600: We say that probabilistic parsing strategy $\mystrat'$

601: is an {\em extension\/} of parsing strategy $\mystrat$ if

602: for each reduced CFG $\mygram$ and probability function $p_{\mygram}$

603: we have

604: $\mystrat(\mygram)=(\myaut, f)$ if and only if

605: $\mystrat'(\mygram, p_{\mygram})=(\myaut, p_{\myaut}, f)$

606: for some $p_{\myaut}$.

607:

608: In the following sections we will investigate which

609: parsing strategies can be extended to become

610: probabilistic parsing strategies.

611:

612: \section{Correct-prefix property}

613: \label{s:cpp}

614:

615: For a given PDT,

616: we say a computation $c$ is {\em dead\/} if

617: $(\Xinit, w_1, \epsilon)$ $\pdamovesname{c}$

618: $(\alpha, \epsilon, v_1)$, for some $\alpha\in\mysym^\ast$,

619: $w_1\in \myterm_1^\ast$ and $v_1\in\myterm_2^\ast$,

620: and there are no

621: $w_2\in \myterm_1^\ast$ and $v_2\in\myterm_2^\ast$ such that

622: $(\alpha, w_2, \epsilon) \pdamoves (\Xfinal, \epsilon, v_2)$.

623: Informally, a dead computation is a computation that

624: cannot be continued to become a complete computation.

625:

626: We say that a PDT has the {\em correct-prefix property\/} (CPP) if

627: it does not allow any dead computations.

628: We say that a parsing strategy has the CPP if it maps each

629: reduced CFG to a PDT that has the CPP.

630:

631: In this section we show that the correct-prefix property is

632: a necessary condition

633: for extending a parsing strategy to a probabilistic parsing strategy.

634: For this we need two lemmas.

635:

636: \begin{lemma}

637: \label{l:pcfg}

638: For each reduced CFG $\mygram$, there is a probability function

639: $p_{\mygram}$ such that

640: PCFG $(\mygram,p_{\mygram})$ is proper and consistent,

641: and $p_{\mygram}(d) > 0$ for all complete derivations~$d$.

642: \end{lemma}

643:

644: \proof

645: Since $\mygram$ is reduced, there is a finite set $L$ consisting of

646: complete derivations $d$, such that for each rule $\pi$ in $\mygram$

647: there is at least

648: one $d\in L$ in which $\pi$ occurs.

649: Let $n_{\pi,d}$ be the number of occurrences of rule $\pi$ in

650: derivation $d\in L$, and let $n_{\pi}$ be

651: $\Sigma_{d\in L}\ n_{\pi,d}$,

652: the total number of occurrences of $\pi$ in $L$.

653: Let $n_A$ be the sum of $n_{\pi}$ for all rules

654: $\pi$ with $A$ in the left-hand side. A probability function

655: $p_{\mygram}$ can be defined through

656: `maximum-likelihood estimation' such that

657: $p_{\mygram}(\pi) = \frac{n_{\pi}}{n_A}$ for each rule

658: $\pi = A \de \alpha$.

659:

660: For all nonterminals $A$,

661: $\Sigma_{\pi = A \de \alpha}\ p_{\mygram}(\pi)$ $=$

662: $\Sigma_{\pi = A \de \alpha}\ \frac{n_{\pi}}{n_A} $=$ \frac{n_A}{n_A}$ $=$ 1,

663: which means that the PCFG $(\mygram,p_{\mygram})$ is proper.

664: Furthermore, \cite{CH98} has shown that a PCFG $(\mygram,p_{\mygram})$

665: is consistent if

666: $p_{\mygram}$ was obtained by maximum-likelihood estimation using

667: a set of derivations.

668: Finally, since $n_{\pi} > 0$ for each $\pi$, also

669: $p_{\mygram}(\pi) > 0$ for each $\pi$, and

670: $p_{\mygram}(d) > 0$ for all complete derivations $d$.~\closeproof

671:

672: We say a computation is a {\em shortest\/} dead computation if

673: it is dead and none of its proper prefixes is dead.

674: Note that each dead computation has a unique prefix that is a

675: shortest dead computation.

676: For a PDT $\myaut$, let $\mypartial$ be the union of the set of

677: all complete computations and the set of all

678: shortest dead computations.

679:

680: \begin{lemma}

681: \label{l:partial}

682: For each proper PPDT $(\myaut, p_{\myaut})$,

683: $\Sigma_{c \in \mypartial}\  p_{\myaut}(c) \leq 1$.

684: \end{lemma}

685:

686: \proof

687: The proof is a trivial variant of the proof

688: that for a proper PCFG $(\mygram, p_{\mygram})$,

689: the sum of $p_{\mygram}(d)$ for all derivations $d$ cannot

690: exceed 1, which is shown by \cite{BO73}.~\closeproof

691:

692: {}From this, the main result of this section follows.

693:

694: \begin{theorem}

695: \label{t:cpp}

696: A parsing strategy that lacks the CPP cannot be extended to

697: become a probabilistic parsing strategy.

698: \end{theorem}

699:

700: \proof

701: Take a parsing strategy $\mystrat$ that does not have the CPP.

702: Then there is a reduced

703: CFG $\mygram= (\myterm_1,$ $\mynont,$ $S,$ $\myrule)$,

704: with $\mystrat(\mygram) = (\myaut,f)$ for some

705: $\myaut$ and $f$, and a shortest dead computation $c$ allowed

706: by $\myaut$.

707:

708: It follows from Lemma~\ref{l:pcfg} that there is a probability function

709: $p_{\mygram}$ such that

710: $(\mygram, p_{\mygram})$ is a proper and consistent PCFG and

711: $p_{\mygram}(d) > 0$ for all complete derivations $d$.

712: Assume we also have a probability function $p_{\myaut}$ such that

713: $(\myaut, p_{\myaut})$ is a proper and consistent PPDT

714: that assigns the same probabilities to strings over $\Sigma_1$ as

715: $(\mygram, p_{\mygram})$. Since $\myaut$ is reduced, each

716: transition $\tau$ must occur in some complete computation $c'$. Furthermore,

717: for each complete computation $c'$ there is a complete derivation $d$

718: such that $f(c') = d$, and $p_{\myaut}(c') = p_{\mygram}(d) > 0$.

719: Therefore, $p_{\myaut}(\tau) > 0$ for each

720: transition $\tau$, and $p_{\myaut}(c) > 0$,

721: where $c$ is the above-mentioned dead computation.

722:

723: Due to Lemma~\ref{l:partial},

724: $1 \geq \Sigma_{c' \in \mypartial}\  p_{\myaut}(c') \geq

725: \Sigma_{w \in \myterm_1^\ast}\ p_{\myaut}(w) + p_{\myaut}(c) >

726: \Sigma_{w \in \myterm_1^\ast}\ p_{\myaut}(w) =

727: \Sigma_{w \in \myterm_1^\ast}\ p_{\mygram}(w)$.

728: This is in contradiction with the consistency of $(\mygram, p_{\mygram})$.

729: Hence, a probability function $p_{\mygram}$ with the properties we

730: required above cannot exist, and therefore $\mystrat$ cannot be extended

731: to become

732: a probabilistic parsing strategy.~\closeproof

733:

734: \section{Strong predictiveness}

735: \label{s:pred}

736:

737: For a fixed PDT, we define the binary relation

738: $\pdagoto$ on stack symbols by:

739: $Y\pdagoto Y'$ if and only if

740: $(Y, w, \epsilon) \pdamoves (Y',\epsilon, v)$ for some

741: $w \in \myterm_1^\ast$ and $v\in \myterm_2^\ast$.

742: In other words,

743: some subcomputation may start with stack $Y$ and

744: end with stack $Y'$. Note that all

745: stacks that occur in such a subcomputation

746: must have height of~1 or more.

747:

748: We say that a PDT has the {\em strong predictiveness property\/}

749: (SPP) if the existence of

750: three transitions $\myep{X}{X Y}$, $\myep{X Y_1}{Z_1}$ and

751: $\myep{X Y_2}{Z_2}$ such that $Y\pdagoto Y_1$ and

752: $Y\pdagoto Y_2$ implies $Z_1 = Z_2$.

753: Informally, this means that

754: when a subcomputation starts with

755: some stack $\alpha$ and some push transition $\tau$,

756: then solely on the basis of $\tau$

757: we can uniquely determine

758: what stack symbol $Z_1 = Z_2$ will be on top of the stack in

759: the first configuration

760: with stack height equal to $|\alpha|$.

761: Another way of looking at

762: it is that no information may flow from higher stack elements

763: to lower stack elements that

764: was not already predicted before these higher stack elements

765: came into being, hence the term `strong predictiveness'.%

766: \footnote{There is a property of push-down devices called

767: {\em faiblement pr{\'e}dictif\/} (weakly predictive) \cite{VI93a}.

768: Contrary to what this name may suggest however, this property

769: is incomparable with the complement of our notion of SPP.}

770:

771: We say that a parsing strategy has the SPP if it maps each

772: reduced CFG to a PDT with the SPP.

773:

774: In the previous section it was shown that we may restrict ourselves

775: to parsing strategies that have the CPP. Here we show that

776: if, in addition, a parsing strategy has the SPP, then it can

777: always be extended to become a probabilistic parsing strategy.

778:

779: \begin{theorem}

780: \label{t:sp}

781: Any parsing strategy that has the CPP and the SPP

782: can be extended to become a probabilistic parsing strategy.

783: \end{theorem}

784:

785: \proof

786: Take a parsing strategy $\mystrat$ that

787: has the CPP and the SPP,

788: and take a reduced PCFG $(\mygram,p_{\mygram})$,

789: where $\mygram = (\myterm_1,$ $\mynont,$  $S,$ $\myrule)$,

790: and let $\mystrat(\mygram) = (\myaut,f)$, for some PDT $\myaut$ and

791: function $f$.

792: We will show that there is a probability function

793: $p_{\myaut}$ such that $(\myaut, p_{\myaut})$ is a PPDT and

794: $p_{\myaut}(c) = p_{\mygram}(f(c))$ for all complete computations $c$.

795:

796: For each stack symbol $X$, consider the set of transitions that

797: are applicable with top-of-stack $X$. Remember that our normal form

798: ensures that all such transitions are of the same type.

799: Suppose this set consists of $m$ swap transitions

800: ${\tau_i} = \myscan{X}{x_i}{y_i}{Y_i}$, $1 \leq i \leq m$.

801: For each $i$,

802: consider all subcomputations of the form

803: $({\it X}, x_iw, \epsilon)$ $\pdamove{\tau_i}$ $({\it Y_i}, w, y_i)$

804: $\pdamoves$

805: $({\it Y'}, \epsilon, v)$ such that there is at least one

806: pop transition of the form $\myep{{\it Z Y'}}{Z'}$ or

807: such that $Y' = \Xfinal$,

808: and define $L_{\tau_i}$ as the set of strings $v$

809: output by these subcomputations.

810: We also define $L_X = \cup_{j=1}^m\  L_{\tau_j}$, the set

811: of all strings output by subcomputations starting with top-of-stack

812: $X$, and ending just before

813: a pop transition that leads to a stack with height smaller than that

814: of the stack at the beginning, or ending with the final stack symbol $\Xfinal$.

815:

816: Now define for each $i$ ($1 \leq i \leq m$):

817: \begin{eqnarray}

818: \label{e:normalized}

819: p_{\myaut}(\tau_i) &=&

820: \frac{  \Sigma_{v\in L_{\tau_i}}\ p_{\mygram}(\overline{v}) }{

821:         \Sigma_{v\in L_{X}}\ p_{\mygram}(\overline{v}) }

822: \end{eqnarray}

823: In other words, the probability of a transition is the normalized

824: probability of the set of subcomputations starting with that transition,

825: relating subcomputations with fragments of derivations of the PCFG.

826:

827: These definitions are well-defined. Since $\myaut$ is reduced and has

828: the CPP, the sets $L_{\tau_i}$ are non-empty and thereby

829: the denominator in the definition of $p_{\myaut}(\tau_i)$

830: is non-zero. Furthermore,

831: $\Sigma_{i=1}^m\ p_{\myaut}(\tau_i)$ is clearly $1$.

832:

833: Now suppose the set of transitions for $X$ consists of $m$

834: push transitions

835: ${\tau_i} = \myep{X}{X Y_i}$, $1 \leq i \leq m$.

836: For each $i$,

837: consider all subcomputations of the form

838: $({\it X}, w, \epsilon)$ $\pdamove{\tau_i}$ $({\it XY_i}, w, \epsilon)$

839: $\pdamoves$

840: $({\it X'}, \epsilon, v)$ such that there is at least one

841: pop transition of the form $\myep{{\it Z X'}}{Z'}$ or $X' = \Xfinal$,

842: and define $L_{\tau_i}$, $L_X$ and $p_{\myaut}(\tau_i)$ as we have done

843: above for the swap transitions.

844:

845: Suppose the set of transitions for $X$ consists of $m$

846: pop transitions

847: ${\tau_i} = \myep{\it Y_iX}{Z_i}$, $1 \leq i \leq m$.

848: Define

849: $L_X = \{\epsilon\}$, and $p_{\myaut}(\tau_i)=1$ for each $i$.

850: To see that this is compatible with the condition of properness

851: of PPDTs,

852: note the following.

853: Since we may assume $\myaut$ is reduced,

854: if $Y_i = Y_j$ for some $i$ and $j$ with $1 \leq i,j \leq m$,

855: then there is at least one

856: transition $\myep{Y_i}{\it Y_i X'}$ for some $X'$ such that

857: $X'\pdagoto X$. Due to the SPP, $Z_i = Z_j$ and therefore $i=j$.

858:

859: Finally, we define $L_{\Xfinal} = \{\epsilon\}$.

860:

861: Take a subcomputation $({\it X}, w, \epsilon)$

862: $\pdamovesname{c}$

863: $({\it Y}, \epsilon, v)$

864: such that there is at least one

865: pop transition of the form $\myep{{\it Z Y}}{Y'}$ or $Y = \Xfinal$.

866: Below we will prove that:

867: \begin{eqnarray}

868: \label{e:partialp}

869: p_{\myaut}(c) &=&

870: \frac{ p_{\mygram}(\overline{v}) }{

871: 	 \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }

872: \end{eqnarray}

873: Since a complete computation $c$ with output $v$ is of this form,

874: with $X = \Xinit$ and $Y= \Xfinal$, we

875: obtain the result we required to prove Theorem~\ref{t:sp},

876: where $D$ denotes the set of all

877: complete derivations of CFG $\mygram$:

878: \begin{eqnarray}

879: p_{\myaut}(c) &=&

880: \frac{ p_{\mygram}(\overline{v}) }{

881: 	\Sigma_{v'\in L_{\Xinit}}\ p_{\mygram}(\overline{v'}) } \\

882: &=& \frac{ p_{\mygram}(f(c)) }{

883:         \Sigma_{v'\in L_{\Xinit}}\ p_{\mygram}(f(v')) } \\

884: &=& \frac{ p_{\mygram}(f(c)) }{

885:         \Sigma_{d\in D}\ p_{\mygram}(d) } \\

886: &=&

887: p_{\mygram}(f(c))

888: \end{eqnarray}

889: We have used two properties of $f$ here. The first is that it

890: preserves the frequencies of symbols from $\myrule$, if considered as a

891: mapping from output strings to derivations.

892: The second property is that it can be considered as bijection from

893: complete computations to derivations. Lastly we have used

894: consistency of PCFG $(\mygram,  p_{\mygram})$, meaning that

895: $\Sigma_{d\in D}\ p_{\mygram}(d) = 1$.

896:

897: For the proof of~(\ref{e:partialp}), we proceed by induction

898: on the length of $c$ and distinguish three cases.

899:

900: Case~1: Consider a subcomputation $c$ consisting of zero transitions,

901: which naturally has output $v=\epsilon$,

902: with only configuration

903: $({\it X}, \epsilon, \epsilon)$, where there is at least one

904: pop transition of the form $\myep{{\it Z X}}{Z'}$ or $X = \Xfinal$.

905: We trivially have $p_{\myaut}(c)$ $=$ $1$

906: and

907: $\frac{ p_{\mygram}(\overline{v}) }{

908:          \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'})}$ $=$

909: $\frac{ p_{\mygram}(\epsilon) }{

910:          \Sigma_{v'\in \{\epsilon\}}\ p_{\mygram}(\overline{v'})}$ $=$ $1$.

911:

912: Case~2: Consider a subcomputation

913: $c=\tau_i c'$, where $({\it X}, x_iw, \epsilon)$

914: $\pdamove{\tau_i}$ $({\it Y_i}, w, y_i)$

915: $\pdamovesname{c'}$

916: $({\it Y'}, \epsilon, y_i v)$, such that there is at least one

917: pop transition of the form $\myep{{\it Z Y'}}{Z'}$ or $Y' = \Xfinal$.

918: The induction hypothesis states that:

919: \begin{eqnarray}

920: p_{\myaut}(c') &=&

921: \frac{ p_{\mygram}(\overline{v}) }{

922:          \Sigma_{v'\in L_{Y_i}}\ p_{\mygram}(\overline{v'}) }

923: \end{eqnarray}

924: If we combine this with the definition of $p_{\myaut}$, we obtain:

925: \begin{eqnarray}

926: p_{\myaut}(c) &=& p_{\myaut}(\tau_i) \cdot  p_{\myaut}(c') \\

927: &=&

928: \frac{  \Sigma_{v'\in L_{\tau_i}}\ p_{\mygram}(\overline{v'}) }{

929:         \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } \cdot

930: \frac{ p_{\mygram}(\overline{v}) }{

931:          \Sigma_{v'\in L_{Y_i}}\ p_{\mygram}(\overline{v'}) } \\

932: &=&

933: \frac{   p_{\mygram}(\overline{y_i}) \cdot \Sigma_{v'\in L_{Y_i}}\

934: 			p_{\mygram}(\overline{v'}) }{

935:         \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } \cdot

936: \frac{ p_{\mygram}(\overline{v}) }{

937:          \Sigma_{v'\in L_{Y_i}}\ p_{\mygram}(\overline{v'}) } \\

938: &=&

939: \frac{	p_{\mygram}(\overline{y_i}) \cdot p_{\mygram}(\overline{v}) }{

940: 	 \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }  \\

941: &=&

942: \frac{  p_{\mygram}(\overline{y_i v}) }{

943: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }

944: \end{eqnarray}

945:

946: Case~3:  Consider a subcomputation

947: $c$ of the form $({\it X}, w, \epsilon)$ $\pdamove{\tau_i}$

948: $({\it XY_i}, w, \epsilon)$

949: $\pdamoves$

950: $({\it X''}, \epsilon, v)$ such that there is at least one

951: pop transition of the form $\myep{{\it Z X''}}{Z'}$ or $X'' = \Xfinal$.

952: Subcomputation $c$ can be decomposed in a unique way as

953: $c=\tau_i c' \tau c''$,

954: consisting of an application of a push transition

955: $\tau_i = \myep{X}{X Y_i}$,

956: a subcomputation

957: $({\it Y_i}, w_1, \epsilon)$

958: $\pdamovesname{c'}$

959: $({\it Y'}, \epsilon, v_1)$,

960: an application of a pop transition

961: $\tau = \myep{XY'}{X_i'}$,

962: and a subcomputation

963: $({\it X_i'}, w_2, \epsilon)$

964: $\pdamovesname{c''}$

965: $({\it X''}, \epsilon, v_2)$,

966: where $w=w_1w_2$ and $v=v_1 v_2$.

967: This is visualized in Figure~\ref{fig:stack}.

968: \begin{figure}

969: %Mag 100

970: \begin{center}

971: \epsfbox{stack.eps}

972: \end{center}

973: \caption{Development of the stack in the computation

974: $c=\tau_i c' \tau c''$.}

975: \label{fig:stack}

976: \end{figure}

977:

978: We can now use the induction hypothesis twice, resulting in:

979: \begin{eqnarray}

980: p_{\myaut}(c') &=&

981: \frac{ p_{\mygram}(\overline{v_1}) }{

982:          \Sigma_{v'_1\in L_{Y_i}}\ p_{\mygram}(\overline{v'_1}) }

983: \end{eqnarray}

984: and

985: \begin{eqnarray}

986: p_{\myaut}(c'') &=&

987: \frac{ p_{\mygram}(\overline{v_2}) }{

988:          \Sigma_{v'_2\in L_{X_i'}}\ p_{\mygram}(\overline{v'_2}) }

989: \end{eqnarray}

990:

991: If we combine this with the definition of $p_{\myaut}$,

992: we obtain:

993: \begin{eqnarray}

994: p_{\myaut}(c) &=& p_{\myaut}(\tau_i) \cdot p_{\myaut}(c')

995: 		\cdot p_{\myaut}(\tau)

996: 		\cdot p_{\myaut}(c'') \\

997: &=&

998: \frac{	\Sigma_{v'\in L_{\tau_i}}\ p_{\mygram}(\overline{v'}) }{

999:          \Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }

1000: \cdot

1001: \frac{  p_{\mygram}(\overline{v_1}) }{

1002:          \Sigma_{v'_1\in L_{Y_i}}\ p_{\mygram}(\overline{v'_1}) }

1003: \cdot 1

1004: \cdot  \frac{ p_{\mygram}(\overline{v_2}) }{

1005:          \Sigma_{v'_2\in L_{X_i'}}\ p_{\mygram}(\overline{v'_2}) }

1006: \end{eqnarray}

1007:

1008: Since $\myaut$ has the SPP, $X_i'$ is unique to $\tau_i$ and

1009: the output strings in $L_{\tau_i}$ are precisely those that

1010: can be obtained by concatenating

1011: an output string in $L_{Y_i}$ and

1012: an output string in $L_{X_i'}$.

1013: Therefore $\Sigma_{v'\in L_{\tau_i}}\ p_{\mygram}(\overline{v'})$

1014: $=$

1015: $\Sigma_{v'_1\in L_{Y_i}} \Sigma_{v'_2\in L_{X_i'}}\

1016: 		p_{\mygram}(\overline{v'_1} \overline{v'_2})$

1017: $=$

1018: $\Sigma_{v'_1\in L_{Y_i}}\ p_{\mygram}(\overline{v'_1})$ $\cdot$

1019: $\Sigma_{v'_2\in L_{X_i'}}\ p_{\mygram}(\overline{v'_2})$,

1020: and

1021: \begin{eqnarray}

1022: p_{\myaut}(c) &=&

1023: \frac{  p_{\mygram}(\overline{v_1}) \cdot p_{\mygram}(\overline{v_2}) }{

1024: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) } \\

1025: &=&

1026: \frac{ p_{\mygram}(\overline{v_1 v_2}) }{

1027: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }  \\

1028: &=&

1029: \frac{ p_{\mygram}(\overline{v}) }{

1030: 	\Sigma_{v'\in L_{X}}\ p_{\mygram}(\overline{v'}) }

1031: \end{eqnarray}

1032: This concludes the proof.~\closeproof

1033:

1034: Note that the definition of $p_{\myaut}$ in the above proof relies on the

1035: strings output by $\myaut$. This is the main reason

1036: why we needed to consider push-down transducers rather

1037: than push-down automata (defined below).

1038: Now assume an appropriate probability

1039: function $p_{\myaut}$ has been found such that

1040: $(\myaut,p_{\myaut})$ is a PPDT that assigns the same

1041: probabilities to computations as

1042: the given PCFG assigns to the corresponding derivations,

1043: following the construction from the proof above.

1044: Then the probabilities assigned to strings over the input

1045: alphabet are also equal.

1046: We may subsequently ignore

1047: the output strings if the application at hand merely requires

1048: probabilistic recognition rather than probabilistic transduction,

1049: or in other words, we may simplify push-down

1050: transducers to push-down automata.

1051:

1052: Formally, a {\em push-down automaton\/} (PDA) $\myaut$ is a 5-tuple

1053: $(\myterm,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$,

1054: where $\myterm$ is the input alphabet, and

1055: $\mysym,$ $\Xinit,$ $\Xfinal$ and $\mytrans$ are

1056: as in the definition of PDTs.

1057: Push and pop transitions are as before, but swap transitions are

1058: simplified to the form

1059: $\myscanrec{X}{x}{Y}$, where $x \in \{\epsilon\} \cup \Sigma$.

1060: Computations are defined as in the case of PDTs, except that configurations

1061: are now pairs $(\alpha,w)$ whereas they were triples $(\alpha,w,v)$

1062: in the case of PDTs. A {\em probabilistic\/} push-down automaton (PPDA) is

1063: a pair $(\myaut,p_{\myaut})$, where $\myaut$ is a PDA

1064: and $p_{\myaut}$ is a probability function

1065: subject to the same constraints as in the case of

1066: PPDTs.

1067: Since the definitions of CPP and SPP for PDTs did not refer to output strings,

1068: these notions carry over to PDAs in a straightforward way.

1069:

1070: We define the size of a CFG as

1071: $\sum_{(A \de \alpha) \in \myrule} |A\alpha|$,

1072: the total number of occurrences of

1073: terminals and nonterminals in the set of rules.

1074: Similarly,

1075: we define the size of a PDA as

1076: $\sum_{(\myep{\alpha}{\beta})\in \mytrans} |\alpha\beta|+

1077: \sum_{(\myscanrec{X}{x}{Y})\in \mytrans} |{\it XxY}|$,

1078: the total number of occurrences of

1079: stack symbols and terminals in the set of transitions.

1080:

1081: Let $\myaut$ $=$

1082: $(\myterm,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$ be a

1083: PDA with both CPP and SPP.

1084: We will now show that we can construct an equivalent CFG

1085: $\mygram$ = $(\myterm,$ $\mysym,$ $\Xinit,$ $\myrule)$

1086: with size linear in the size of $\myaut$.

1087: The rules of this grammar are the following.

1088: \begin{itemize}

1089: \item

1090: $X \de\it Y Z$

1091: for each transition $\myep{X}{X Y}$, where

1092: $Z$ is the unique stack symbol such that there is at

1093: least one transition $\myep{X Y'}{Z}$ with $Y \pdagoto Y'$;

1094: \item

1095: $X \de x Y$

1096: for each transition

1097: $\myscanrec{X}{x}{Y}$;

1098: \item

1099: $Y \de \epsilon$ for each stack symbol $Y$

1100: such that there is at

1101: least one transition $\myep{X Y}{Z}$ or such that $Y=\Xfinal$.

1102: \end{itemize}

1103: It is easy to see that there exists

1104: a bijection from complete computations of $\myaut$ to complete derivations

1105: of $\mygram$, preserving the recognized/derived strings.

1106: Apart from an additional derivation step by rule

1107: $\Xfinal \de \epsilon$, the complete derivations also have the

1108: same length as the corresponding complete computations.

1109:

1110: The above construction can straightforwardly be extended

1111: to probabilistic PDAs (PPDAs).

1112: Let $(\myaut, p_{\myaut})$ be a PPDA with both CPP and SPP.

1113: Then we construct $\mygram$ as above, and further define

1114: $p_{\mygram}$ such that

1115: $p_{\mygram}(\pi) = p_{\myaut}(\tau)$ for rules

1116: $\pi = X \de \it Y Z$ or $\pi = X \de x Y$ that we construct

1117: out of transitions $\tau=\myep{X}{X Y}$ or $\tau=\myscanrec{X}{x}{Y}$,

1118: respectively, in the first two items above.

1119: We also define $p_{\mygram}(Y \de \epsilon) = 1$

1120: for rules $Y \de \epsilon$ obtained in the third item above.

1121: If $(\myaut, p_{\myaut})$ is reduced, proper and consistent then so is

1122: $(\mygram, p_{\mygram})$.

1123:

1124: This leads to the observation that

1125: parsing strategies with the

1126: CPP and the SPP as well as

1127: their probabilistic extensions can also be

1128: described as grammar transformations, as follows. A given (P)CFG is mapped

1129: to an equivalent (P)PDT by a (probabilistic)

1130: parsing strategy. By ignoring the output components of

1131: swap transitions we obtain a (P)PDA, which can be mapped to

1132: an equivalent (P)CFG as shown above.

1133: This observation gives rise to an extension with

1134: probabilities of

1135: the work on {\em covers\/} by \cite{NI80,LE89}.

1136:

1137: It has been shown by \cite{GO82} that there is an infinite family of languages

1138: with the following property.

1139: The sizes of the smallest CFGs generating those languages

1140: are at least quadratically larger than the sizes of the smallest

1141: equivalent PDAs. Note

1142: that this increase in size cannot occur if PDAs satisfy

1143: both the CPP and the SPP, as we have shown above.

1144:

1145: It is always possible to transform a PDA with the CPP but

1146: without the SPP to an equivalent PDA with both CPP and SPP,

1147: by a construction that increases

1148: the size of the PDA considerably (at least quadratically,

1149: in the light of the above construction and \cite{GO82}).

1150: However, such transformations in general

1151: do not preserve parsing strategies and therefore are of minor interest

1152: to the issues discussed in this paper.

1153:

1154: The simple relationship between PDAs with both CPP and SPP

1155: on the one hand and CFGs on the other can

1156: be used to carry over algorithms originally

1157: designed for CFGs to PDAs or PDTs. One such application is the

1158: evaluation of the right-hand side of

1159: equation~(\ref{e:normalized}) in the proof of

1160: Theorem~\ref{t:sp}. Both the numerator and the denominator

1161: involve potentially infinite sets of subcomputations, and therefore

1162: it is not immediately clear that the proof is constructive.

1163: However, there are published algorithms to compute, for a

1164: given PCFG $(\mygram',p_{\mygram'})$

1165: that is not necessarily proper and a given

1166: nonterminal $A$, the expression

1167: $\Sigma_{w \in \Sigma^\ast}\ p_{\mygram'}(A \Rightarrow^\ast w)$,

1168: or rather, to approximate it with arbitrary precision;

1169: see \cite{BO73,ST95}.

1170: This can be used to compute e.g.\

1171: $\Sigma_{v\in L_{X}}\ p_{\mygram}(\overline{v})$

1172: in equation~(\ref{e:normalized}), as follows.

1173:

1174: The first step

1175: is to map the PDT to a CFG $\mygram'$ as shown above.

1176: We then define a function

1177: $p_{\mygram'}$ that assigns probability

1178: 1 to all rules that we construct out of push and pop

1179: transitions. We also let $p_{\mygram'}$ assign probability

1180: $p_{\mygram}(\overline{y})$ to a rule

1181: $X \de x Y$ that we construct out of a scan transition

1182: $\myscan{X}{x}{y}{Y}$.

1183: It is easy to see that, for any stack symbol $X$, we have

1184: $\Sigma_{v\in L_{X}}\ p_{\mygram}(\overline{v}) =

1185: \Sigma_{w \in \Sigma_1^\ast}\ p_{\mygram'}(X \Rightarrow^\ast w)$.

1186: This allows our problem on the computations of probabilities

1187: in the right-hand side of equation~(\ref{e:normalized})

1188: to be reduced to a problem on PCFGs, which can be solved

1189: by existing algorithms as discussed above.

1190:

1191: \section{Parsing strategies with SPP}

1192: \label{s:strong}

1193:

1194: Many well-known parsing strategies with the CPP

1195: also have the SPP,

1196: such as top-down parsing \cite{HA78}, left-corner parsing \cite{RO70}

1197: and PLR parsing \cite{SO79}, the first two of which we will

1198: define explicitly here, whereas of the third we will merely

1199: present a sketch. A fourth strategy that we will discuss

1200: is a combination of left-corner and top-down parsing, with special

1201: computational properties.

1202:

1203: In order to simplify the presentation, we allow a new type of

1204: transition, without increasing the power of PDTs, viz.\

1205: a combined push/swap transition of the form

1206: $\myscan{X}{x}{y}{X Y}$. Such a transition can be seen as short-hand for

1207: two transitions, the first of the form $\myep{X}{X Y_{x,y}}$,

1208: where $Y_{x,y}$ is a new symbol not already in $\mysym$, and

1209: the second of the form $\myscan{Y_{x,y}}{x}{y}{Y}$.

1210:

1211: The first strategy we discuss is top-down parsing.

1212: For a fixed CFG grammar $\mygram  = (\myterm,$ $\mynont,$ $S,$ $\myrule)$,

1213: we define $\mystrat_{\it TD}(\mygram) =

1214: (\myaut,f)$. Here $\myaut$ $=$ $(\myterm,$

1215: $\myrule,$ $\mysym,$ $[S \de\ \bul\sigma],$ $[S \de \sigma\bul],$ $\mytrans)$,

1216: where $\mysym = \{ [A \de \alpha \bul \beta]\  |\

1217: (A \de\alpha\beta) \in \myrule \}$;

1218: these `dotted rules' are well-known from \cite{KN65,EA70}.

1219: The transitions in $\mytrans$ are:

1220: \begin{itemize}

1221: \item $\myscan{[A \de \alpha \bul a \beta]}{a}{\epsilon}%

1222: 		{[A \de \alpha a \bul \beta]}$

1223: for each rule $A \de \alpha a \beta$;

1224: \item $\myscan{[A \de \alpha \bul B \beta]}{\epsilon}{\pi}%

1225: 		{[A \de \alpha \bul B \beta]\ [B\de\ \bul\gamma]}$

1226: for each pair of rules

1227: $A \de \alpha B \beta$ and $\pi = B \de \gamma$;

1228: \item $\myep{[A \de \alpha \bul B \beta]\ [B \de \gamma \bul]}%

1229: 		{[A \de \alpha B \bul \beta]}$.

1230: \end{itemize}

1231: The function $f$ is the identity function on strings over $\myrule$.

1232: If seen as a function on computations,

1233: then $f$ is a bijection from complete computations

1234: of $\myaut$ to complete derivations of $\mygram$,

1235: as required by the definition of `parsing strategy'.

1236:

1237: If $\mygram$ is reduced, then $\myaut$ clearly has the CPP.

1238: That it also has the SPP can be

1239: argued as follows.

1240: Let us first remark that if

1241: $[A \de \alpha \bul \beta]\pdagoto X$

1242: for some stack symbols $[A \de \alpha \bul \beta]$ and $X$,

1243: then $X$ must be of the form

1244: $[A \de \alpha \gamma\bul \delta]$, for some $\gamma$ and $\delta$

1245: such that $\gamma\delta = \beta$.

1246: Now, if there are three transitions

1247: $\myep{X}{X Y}$, $\myep{X Y_1}{Z_1}$ and

1248: $\myep{X Y_2}{Z_2}$ such that $Y\pdagoto Y_1$ and

1249: $Y\pdagoto Y_2$, then

1250: $X$ must be of the form $[A \de \alpha \bul B \beta]$

1251: and $Y$ of the form $[B\de\ \bul\gamma]$

1252: (strictly speaking $[B\de\ \bul\gamma]_{\epsilon,\pi}$),

1253: $Y_1$ and $Y_2$ must both be $[B \de \gamma \bul]$,

1254: and $Z_1$ and $Z_2$ must both be

1255: $[A \de \alpha B \bul\beta]$.

1256: Hence the SPP is satisfied.

1257:

1258: Since $\mystrat_{\it TD}$ has both CPP and SPP, we

1259: may apply Theorem~\ref{t:sp} to conclude that $\mystrat_{\it TD}$ can be

1260: extended to become a probabilistic parsing strategy.

1261: A direct construction of a top-down PPDT from a

1262: PCFG $(\mygram, p_{\mygram})$

1263: is obtained by extending the above construction such that

1264: probability 1 is assigned to all transitions produced by the first

1265: and third items, and probability $p_{\mygram}(\pi)$ is assigned

1266: to transitions produced by the second item.

1267:

1268: The second strategy we discuss is left-corner (LC) parsing \cite{RO70}.

1269: For a fixed CFG $\mygram= (\myterm,$ $\mynont,$ $S,$ $\myrule)$,

1270: we define the binary relation $\LC$ over

1271: $\myterm \cup \mynont$ by:

1272: $X \LC A$ if and only if there is

1273: an $\alpha \in (\myterm \cup \mynont)^\ast$

1274: such that $(A \de X\alpha)\in\myrule$,

1275: where $X \in \myterm \cup \mynont$.

1276: We define the binary relation $\LCstar$ to be the reflexive and

1277: transitive closure of $\LC$. This implies that $a \LCstar a$ for all

1278: $a \in \myterm$.

1279:

1280: We now define $\mystrat_{\it LC}(\mygram) =

1281: (\myaut,f)$. Here $\myaut$ $=$ $(\myterm,$

1282: $\myrule\cup\{\dashv\},$

1283: $\mysym,$ $[S \de\ \bul\sigma],$ $[S \de \sigma\bul],$ $\mytrans)$,

1284: where $\mysym$ contains stack

1285: symbols of the form $[A \de \alpha \bul \beta]$

1286: where $(A \de\alpha\beta) \in \myrule$ such that

1287: $\alpha\neq\epsilon\vee A =S$,

1288: and

1289: stack symbols of the form

1290: $[A \de \alpha \bul Y\!\beta; X]$

1291: where $(A \de\alpha Y\!\beta) \in \myrule$ and

1292: $X,Y\in\myterm \cup \mynont$ such that $\alpha\neq\epsilon\vee A =S$

1293: and $X \LCstar Y$.

1294: The latter type of stack symbol indicates that

1295: left corner $X$ of goal $Y$ in the right-hand side of rule

1296: $A \de \alpha Y\! \beta$ has just been recognized.

1297: The transitions in $\mytrans$ are:

1298: \begin{itemize}

1299: \item $\myscan{[A \de \alpha \bul Y\! \beta]}{a}{\epsilon}%

1300: 		{[A \de \alpha \bul Y\! \beta; a]}$

1301: for each rule $A \de \alpha Y\! \beta$ and $a \in \myterm$

1302: such that $\alpha\neq\epsilon\vee A=S$ and $a \LCstar Y$;

1303: \item $\myscan{[A \de \alpha \bul B \beta]}{\epsilon}{\pi}%

1304: 		{[A \de \alpha \bul B \beta; C]}$

1305: for each pair of rules $A \de \alpha B \beta$ and

1306: $\pi = C \de \epsilon$ such that $\alpha\neq\epsilon\vee A=S$ and $C \LCstar B$;

1307: \item $\myscan{[A \de \alpha \bul B \beta; X]}{\epsilon}{\pi}%

1308:                 {[A \de \alpha \bul B \beta; X]\ [C\de X \bul\gamma]}$

1309: for each pair of rules

1310: $A \de \alpha B \beta$ and $\pi = C \de X \gamma$ such that

1311: $\alpha\neq\epsilon\vee A=S$ and $C \LCstar B$;

1312: \item $\myep{[A \de \alpha \bul B \beta; X]\ [C\de X\gamma \bul]}%

1313: 		{[A \de \alpha \bul B \beta; C]}$

1314: for each pair of rules

1315: $A \de \alpha B \beta$ and $C \de X\gamma$ such that

1316: $\alpha\neq\epsilon\vee A=S$

1317: and

1318: $C \LCstar B$;

1319: \item $\myscan{[A \de \alpha \bul Y\! \beta; Y]}%

1320: 		{\epsilon}{\dashv}

1321:                 {[A \de \alpha Y \bul \beta]}$

1322: for each rule $A \de \alpha Y\! \beta$ such that

1323: $\alpha\neq\epsilon\vee A=S$.

1324: \end{itemize}

1325: The function $f$ has to rearrange an output string to obtain

1326: a complete derivation.

1327: To make this possible, the output alphabet contains the

1328: symbol $\dashv$ in addition to rules from $\myrule$.

1329: This symbol is used to mark the end of

1330: an upward path of nodes in the parse tree

1331: each of which, except the last, is

1332: the left-most daughter node of its mother node.

1333: As explained in \cite{NI80}, in the absence of such a symbol,

1334: it would be impossible to uniquely identify output strings with

1335: derivations of the input.\footnote{%

1336: In \cite[pp.~22--23]{NI80} a context-free grammar is considered

1337: that consists of the set of rules

1338: $R = \{S \de {\it aS}, S \de {\it Sb}, S \de c\}$.

1339: It is shown that any left-corner push-down

1340: transducer using only $R$ as output alphabet would

1341: output at most one string for each input string,

1342: whereas there may be several

1343: derivations of the input, as the grammar is ambiguous.}

1344:

1345: The function $f$ for the strategy $\mystrat_{\it LC}$

1346: is defined by Figure~\ref{f:fLC}. Function $f$ is defined

1347: in terms of function $f_{\it LC}$, which has two arguments.

1348: The first argument, $d$, is either the empty string or

1349: a subderivation that has already been constructed.

1350: The second argument is a suffix of the output

1351: string originally supplied as argument to $f$.

1352: Function $f_{\it LC}$ removes the first symbol $\pi$

1353: from the output string, which will be

1354: a rule $A \de X X_{1} \cdots X_l$ or $A \de \epsilon$.

1355: In the former case, $d$ must be $\epsilon$ if $X\in \myterm_1$

1356: and $d$ must be a subderivation from nonterminal $X$ otherwise.

1357: The function is then called recursively zero or more times,

1358: once for each nonterminal in $X_{1} \cdots X_l$,

1359: to obtain more subderivations $d_i$, $1 \leq i \leq l$,

1360: each of which is

1361: obtained by consuming a subsequent part of the output string.

1362: These subderivations are combined into a larger subderivation

1363: $d' = \pi d d_{1} \cdots d_l$. Depending on the

1364: question whether we encounter $\dashv$ as the immediately

1365: following symbol of the output string, we return the

1366: derivation $d'$ and the remainder $v'$ of the output string, or

1367: call $\mystrat_{\it LC}$ recursively once more to

1368: obtain a larger subderivation.

1369: \begin{figure}[tp]

1370: \begin{eqnarray*}

1371: f(v) &=& d \\

1372: && {\rm where} \\

1373: && (d, \epsilon) = f_{\it LC}(\epsilon,v) \\

1374: f_{\it LC}(d,\pi v_0) &=& (d'',v'') \\

1375: && {\rm where} \\

1376: && l \mbox{\ is such that\ }

1377:         \pi = A \de X X_{1} \cdots X_l\ \mbox{\ or} \\

1378: && \hspace*{5ex} \pi = A \de \epsilon \wedge l = 0 \\

1379: && (d_{1}, v_{1}) =

1380:                 {\rm if\ } X_{1} \in \myterm_1

1381:                 {\rm \ then\ } (\epsilon, v_{0})

1382:                 {\rm \ else\ } f_{\it LC}(\epsilon, v_{0}) \\

1383: && \ldots \\

1384: && (d_{l}, v_{l}) =

1385:                 {\rm if\ } X_{l} \in \myterm_1

1386:                 {\rm \ then\ } (\epsilon, v_{l-1})

1387:                 {\rm \ else\ } f_{\it LC}(\epsilon, v_{l-1}) \\

1388: && d' = \pi d d_{1} \cdots d_l \\

1389: && (d'',v'') =

1390:                 {\rm if\ } {\dashv} v'  = v_{l}

1391:                 {\rm \ then\ } (d',v')

1392:                 {\rm \ else\ } f_{\it LC}(d',v_{l})

1393: \end{eqnarray*}

1394: \caption{Function $f$ for $\mystrat_{\it LC}$.}

1395: \label{f:fLC}

1396: \end{figure}

1397:

1398: It can be easily shown that this strategy has the CPP.

1399: Regarding the SPP, note that if there are two transitions

1400: $\myscan{[A \de \alpha \bul B \beta; X]}{\epsilon}{\pi}%

1401:                 {[A \de \alpha \bul B \beta; X]\ [C\de X \bul\gamma]}$

1402: and $\myep{[A \de \alpha \bul B \beta; X]\ Y_1}{Z_1}$ such that

1403: $[C\de X \bul\gamma] \pdagoto Y_1$, then

1404: $Y_1$ must be $[C\de X \gamma\bul]$

1405: and $Z_1$ must be $[A \de \alpha \bul B \beta; C]$, which means that

1406: $Z_1$ is uniquely determined by the first transition.

1407:

1408: Since $\mystrat_{\it LC}$ has both CPP and SPP,

1409: left-corner parsing can be extended to become a

1410: probabilistic parsing strategy. A direct construction of

1411: probabilistic left-corner parsers from PCFGs has been presented

1412: by \cite{TE95}.

1413:

1414: Since at most two rules occur in each of the items above,

1415: the size of a (probabilistic)

1416: left-corner parser is $\order{|\mygram|^2}$, where

1417: $|\mygram|$ denotes the size of $\mygram$.

1418: This is the same complexity as that of the direct

1419: construction by \cite{TE95}.

1420: This is in contrast to a construction of `shift-reduce' PPDAs

1421: out of PCFGs from \cite{AB99}, which were of size

1422: $\order{|\mygram|^5}$.\footnote{This construction consisted

1423: of a transformation to Chomsky normal form followed by

1424: a transformation to Greibach normal form (GNF) \cite{HA78}.

1425: Its worse-case time complexity, established in

1426: p.c.\ with David McAllester, is reached for a family

1427: of CFGs $(\mygram_n)_{n \geq 2}$, defined by $\mygram_n =$

1428: $(\{a_1,\ldots,a_n\},$ $\{A_1,\ldots,A_n\},$ $A_1,$ $\myrule)$,

1429: where $\myrule$ contains the rules

1430: $A_i \de A_{i+1}$, for $1 \leq i \leq n-1$,

1431: $A_n \de A_1$,

1432: and $A_i \de A_{i}\ A_{i}$ and

1433: $A_i \de a_i$, for $1 \leq i \leq n$.

1434: After transformation to GNF, the grammar

1435: contains $n^5$ rules of the

1436: form

1437: $A_{i_1}/A_{i_2} \de a_{i_3}\ A_{i_2}/A_{i_4}\ A_{i_1}/A_{i_5}$,

1438: with $1 \leq i_1,i_2,i_3,i_4,i_5 \leq n$.

1439: In \cite{BL99} a more economical transformation

1440: to Greibach normal form is given; straightforward

1441: extension to probabilities leads to

1442: probabilistic parsers of the type considered by \cite{AB99}

1443: of size $\order{|\mygram|^4}$.}

1444: The ``conjecture that

1445: there is no {\em concise\/} translation of

1446: PCFGs into shift-reduce PPDAs'' from \cite{AB99}

1447: is made less significant by the earlier construction by \cite{TE95}

1448: and our construction above.

1449: It must be noted however that the `shift-reduce' model adhered to

1450: by \cite{AB99} is more restrictive than the PDT models adhered

1451: to by \cite{TE95} and by us.

1452:

1453: When we look at upper bounds on the sizes of PPDAs (or PPDTs)

1454: that describe the same probability distributations as given

1455: PCFGs, and compare these with the upper bounds for

1456: (non-probabilistic) PDAs (or PDTs) for given CFGs,

1457: we can make the following observation.

1458: Theorem~\ref{t:cpp} states

1459: that parsing strategies without the CPP cannot be extended to

1460: become probabilistic. Furthermore, \cite{LE00} has shown that

1461: for certain fixed languages the smallest

1462: PDAs without the CPP are much smaller than

1463: the smallest PDAs with the CPP.

1464: It may therefore appear that probabilistic PDAs

1465: are in general larger than non-probabilistic ones.

1466: However, the automata studied by

1467: \cite{LE00} pertain to very specific languages, and at this

1468: point there is little reason to believe that the demonstrated

1469: results for these languages carry over to

1470: any reasonable strategy for {\em general\/} CFGs.

1471:

1472: The third parsing strategy that we discuss is PLR parsing \cite{SO79}.

1473: Since it is very similar to LC parsing,

1474: we merely provide a sketch.

1475: The stack symbols for PLR parsing are like those for LC parsing,

1476: except that the parts of rules following the dot are omitted.

1477: Thus, instead of symbols of the form

1478: $[A \de \alpha \bul \beta]$ and of the

1479: form $[A \de \alpha \bul \beta; X]$, a PLR parser

1480: manipulates stack symbols

1481: $[A \de \alpha ]$ and $[A \de \alpha ; X]$, respectively.

1482: That $\beta$ is omitted means that PLR parsers may postpone commitment

1483: to one from two similar rules $A \de \alpha \beta$ and

1484: $A \de \alpha \beta'$ until the point is reached where $\beta$ and

1485: $\beta'$ differ. In this sense PLR parsing

1486: is less predictive than LC parsing,

1487: although it still satisfies the

1488: strong predictiveness property, so that it can be extended to

1489: become probabilistic.

1490:

1491: There are two minor differences between the transitions of LC

1492: parsers and those of PLR parsers. The first is the simplification of

1493: stack symbols as explained above. The second is that for PLR,

1494: output of a rule is delayed until it is completely recognized.

1495: The resulting output strings are right-most

1496: derivations in reverse, which requires different functions $f$ than in

1497: the case of LC parsing.

1498: Note that right-most derivations can be effectively mapped

1499: to corresponding parse trees, and parse trees can be effectively

1500: mapped to corresponding left-most derivations.

1501: Hence the required functions $f$ clearly exist.

1502:

1503: The last strategy to be discussed in this section is a combination

1504: of left-corner and top-down parsing. It has the special property

1505: that, provided the fixed CFG is acyclic,

1506: the length of computations is bounded by a

1507: linear function on the length of the input, which

1508: means that the parser cannot `loop' on any input.

1509: Note that if the grammar is not acyclic, computations

1510: of unbounded length cannot be avoided by any parsing strategy.

1511: {}From this perspective, this parsing strategy, which we will

1512: call {\em $\epsilon$-LC\/} parsing, is optimal.

1513: It is based on \cite{NE93b}, and a

1514: related idea for LR parsing was described by \cite{NE96e}.

1515: The special termination properties of this strategy will be needed

1516: in Section~\ref{s:prefix}.

1517:

1518: We first define the binary relation $\LCep$ over

1519: $\myterm \cup \mynont$ by:

1520: $X \LCep A$ if and only if there are

1521: $\alpha,\beta\in (\myterm \cup \mynont)^\ast$

1522: such that $(A \de \alpha X\beta)\in\myrule$

1523: and $\alpha \Rightarrow^\ast \epsilon$.

1524: Relation $\LCep$ differs from the relation $\LC$ defined earlier

1525: in that epsilon-generating

1526: nonterminals at the beginning of a rule may be ignored.

1527:

1528: The stack symbols are now of the form

1529: $[A \de \alpha \bul \beta, \mu\bul\nu]$ or of the

1530: form $[A \de \alpha \bul Y\! \beta, \mu\bul\nu; X]$.

1531: Similar to the stack symbols for pure LC parsing, we

1532: have $\alpha\neq\epsilon\vee A =S$

1533: and $X \LCepstar Y$. Different is the additional dotted expression

1534: $\mu\bul\nu$, which is such that $\mu\nu$ is

1535: a string of epsilon-generating nonterminals, occurring at

1536: the beginning of the right-hand side of a rule

1537: $A \de \mu\nu\alpha \beta$ or $A \de \mu\nu\alpha Y\!\beta$,

1538: respectively.

1539: The string $\mu\nu$ will be ignored in the

1540: part of the strategy that behaves like left-corner parsing,

1541: where $\mu=\epsilon$.

1542: However, when the dot of the first dotted expression is at the end,

1543: i.e., when we obtain a stack symbol of the form

1544: $[A \de \alpha \bul, \bul\nu]$, then

1545: top-down parsing will be activated to retrieve epsilon-generating

1546: subderivations for the nonterminals in $\nu$,

1547: and the dot will move through $\nu$ from

1548: left to right.\footnote{%

1549: Although such subderivations can also be pre-compiled

1550: during construction of the PDT,

1551: we refrain from doing so since this could lead to

1552: a PDT of exponential size.}

1553:

1554: We have $\Xinit = [S \de\ \bul \sigma, \bul]$

1555: and $\Xfinal = [S \de \sigma\bul, \bul]$, where for technical

1556: reasons, and without loss of generality, we assume that

1557: $\sigma$ does not contain any epsilon-generating nonterminals.

1558: Next to the symbols from $\myrule$ and the symbol $\dashv$,

1559: the output alphabet $\myterm_2$ also includes the set

1560: of integers $\{0,\ldots,l-1\}$, where $l= |\alpha|$ for

1561: a rule $(A \de \alpha) \in\myrule$ of maximal length;

1562: the purpose of such integers will become clear below.

1563: For the definition of the set of transitions, we will be less

1564: precise than for $\mystrat_{\it TD}$ and

1565: $\mystrat_{\it LC}$, to prevent

1566: cluttering up the presentation with details.

1567: We point out however that

1568: in order to produce a reduced PDT from a reduced CFG, further side

1569: conditions are needed for all items below:

1570:

1571: \begin{itemize}

1572: \item $\myscan{[A \de \alpha \bul Y\! \beta, \bul\mu]}{a}{\epsilon}%

1573:                 {[A \de \alpha \bul Y\! \beta, \bul\mu; a]}$

1574: for $a \in \myterm$ such that $a \LCepstar Y$;

1575: \item $\myscan{[A \de \alpha \bul B \beta, \bul\mu]}{\epsilon}{\pi 0}%

1576: 		{[A \de \alpha \bul B \beta, \bul\mu; C]}$

1577: for $\pi = C \de \epsilon$ such that $C \LCstar B$;

1578: \item $\myscan{[A \de \alpha \bul B \beta, \bul\mu; X]}{\epsilon}{\pi m}%

1579:                 {[A \de \alpha \bul B \beta, \bul\mu; X]\

1580: 		[C\de X \bul\gamma, \bul\mu']}$

1581: for $\pi = C \de \mu' X \gamma$ such that

1582: $C \LCepstar B$ and

1583: $\mu'\Rightarrow^\ast\epsilon$, where $m= |\mu'|$;

1584: \item $\myep{[A \de \alpha \bul B \beta, \bul\mu; X]\

1585: 		[C\de X\gamma \bul, \mu'\bul]}%

1586:                 {[A \de \alpha \bul B \beta, \bul\mu; C]}$;

1587: \item $\myscan{[A \de \alpha \bul Y\! \beta, \bul\mu; Y]}%

1588:                 {\epsilon}{\dashv}

1589:                 {[A \de \alpha Y \bul \beta, \bul\mu]}$;

1590: \item $\myscan{[A \de \alpha \bul, \mu \bul B \nu]}{\epsilon}{\pi}%

1591:                 {[A \de \alpha \bul, \mu \bul B \nu]\

1592: 		[B\de\ \bul, \bul\mu']}$

1593: for $\pi = B \de \mu'$ such that $\mu'\Rightarrow^\ast\epsilon$;

1594: \item $\myep{[A \de \alpha \bul, \mu \bul B \nu]\ [B\de\ \bul, \mu'\bul]}%

1595:                 {[A \de \alpha \bul, \mu B \bul \nu]}$.

1596: \end{itemize}

1597:

1598: The first five items are almost identical to the five

1599: items we presented for $\mystrat_{\it LC}$,

1600: except that strings $\mu$ of

1601: epsilon-generating nonterminals at the beginning of rules

1602: are ignored.

1603: The length $m$ of a string $\mu$ is output just after

1604: the relevant grammar rule is output, in the second and third items.

1605: This length $m$ will be needed to define function $f$ below.

1606:

1607: The last two items follow a top-down strategy, but only for

1608: epsilon-generating rules.

1609: The produced transitions do what

1610: was deferred by the left-corner part of the strategy:

1611: they construct subderivations for the

1612: epsilon-generating nonterminals in strings $\mu$.

1613:

1614: The function $f$, which produces a complete derivation

1615: from an output string, is defined through two

1616: auxiliary functions, viz.\

1617: $\fepLC$ for the left-corner part and

1618: $\fepTD$ for the top-down

1619: part, as shown in Figure~\ref{f:fepLC}.

1620:

1621: \begin{figure}[tp]

1622: \begin{eqnarray*}

1623: f(v) &=& d \\

1624: && {\rm where} \\

1625: && (d, \epsilon) = \fepLC(\epsilon,v) \\

1626: \fepLC(d,\pi m v_{0}) &=& (d'',v'') \\

1627: && {\rm where} \\

1628: && l \mbox{\ is such that\ }

1629: 	\pi = A \de B_1 \cdots B_m X X_{1} \cdots X_l\ \mbox{\ or} \\

1630: && \hspace*{5ex} \pi = A \de \epsilon \wedge l = 0 \\

1631: && (d_{1}, v_{1}) =

1632: 		{\rm if\ } X_{1} \in \myterm_1

1633: 		{\rm \ then\ } (\epsilon, v_{0})

1634: 		{\rm \ else\ } \fepLC(\epsilon, v_{0}) \\

1635: && \ldots \\

1636: && (d_{l}, v_{l}) =

1637:                 {\rm if\ } X_{l} \in \myterm_1

1638:                 {\rm \ then\ } (\epsilon, v_{l-1})

1639:                 {\rm \ else\ } \fepLC(\epsilon, v_{l-1}) \\

1640: && (d'_1,v_{l+1}) = \fepTD(v_{l}) \\

1641: && \ldots \\

1642: && (d'_m,v_{l+m}) = \fepTD(v_{l+m-1}) \\

1643: && d' = \pi d'_1 \cdots d'_m d d_{1} \cdots d_l \\

1644: && (d'',v'') =

1645: 		{\rm if\ } {\dashv} v'  = v_{l+m}

1646: 		{\rm \ then\ } (d',v')

1647: 		{\rm \ else\ } \fepLC(d',v_{l+m}) \\

1648: \fepTD(v) &=& (\pi d_1 \cdots d_l,v_l) \\

1649: && {\rm where} \\

1650: && \pi v_{0} = v \\

1651: && l \mbox{\ is such that\ } \pi = A \de B_{1} \cdots B_l  \\

1652: && (d_1,v_1) = \fepTD(v_0) \\

1653: && \ldots \\

1654: && (d_l,v_l) = \fepTD(v_{l-1})

1655: \end{eqnarray*}

1656:

1657: \caption{Function $f$ for $\stratepLC$.}

1658: \label{f:fepLC}

1659: \end{figure}

1660:

1661: The function $\fepLC$ is similar to $f_{\it LC}$ defined in

1662: Figure~\ref{f:fLC}. The main difference is that now

1663: subderivations deriving $\epsilon$

1664: for the first $m$ nonterminals in the right-hand side

1665: of a rule are obtained by calls of the function $\fepTD$.

1666: For a suffix $v$ of an output string,

1667: $\fepTD(v)$ yields a pair $(\pi d_1 \cdots d_l,v_l)$

1668: such that $v= \pi d_1 d_2 \cdots d_lv_l$. In other words,

1669: $\fepTD$ does nothing more than split its argument into

1670: two parts. The length of the first part $\pi d_1 \cdots d_l$

1671: depends on the

1672: length $l$ of the right-hand side of rule $\pi$ and

1673: on the lengths of right-hand sides of rules that

1674: are visited recursively.

1675:

1676: It can be easily seen that $\stratepLC$

1677: has both CPP and SPP. The size of a produced

1678: PDT is now $\order{|\mygram|^3}$, rather than

1679: $\order{|\mygram|^2}$ as in the case of $\mystrat_{\it LC}$.

1680:

1681: \comment{

1682: The second new type of transition

1683: is a combined swap/pop transition of the form

1684: $\myscan{X Y}{x}{y}{Z}$. Such a transition can be seen as short-hand for

1685: two transitions, the first of the form $\myscan{Y}{x}{y}{Y_X}$,

1686: where $Y_X$ is a new symbol not already in $\mysym$, and

1687: the second of the form $\myep{X Y_X}{Z}$.

1688:

1689: For the left-corner strategy we have

1690: $\mystrat_{\it LC}(\mygram) =

1691: (\myaut,f)$, where

1692: $\myaut$ differs from the automaton above in the set $\mysym$

1693: of stack symbols

1694: and in the set $\mytrans$ of transitions.

1695: Next to stack symbols $[A \de \alpha \bul \beta]$,

1696: $\mysym$ now also contains

1697: stack symbols of the form

1698: $[A \de \alpha \bul Y \beta; X]$,

1699: where $X$ and $Y$ can be

1700: terminals or nonterminals, and $X \LCstar Y$.

1701: Such a stack symbol on top of the stack indicates that

1702: left corner $X$ of goal $Y$ in the right-hand side of rule

1703: $A \de \alpha Y \beta$ has just been recognized.

1704: The transitions in $\mytrans$ are:

1705: \begin{itemize}

1706: \item $\myscan{[A \de \alpha \bul Y \beta]}{a}{\epsilon}%

1707: 		{[A \de \alpha \bul Y \beta; a]}$

1708: for each rule $A \de \alpha Y \beta$ such that $a \LCstar Y$;

1709: \item $\myscan{[A \de \alpha \bul B \beta]}{\epsilon}{\pi}%

1710: 		{[A \de \alpha \bul B \beta; C]}$

1711: for each pair of rules $A \de \alpha B \beta$ and

1712: $\pi = C \de \epsilon$ such that $C \LCstar B$;

1713: \item $\myep{[A \de \alpha \bul B \beta; X]}%

1714:                 {[A \de \alpha \bul B \beta; C]\ [C\de X \bul\gamma]}$

1715: for each pair of rules

1716: $A \de \alpha B \beta$ and $C \de X \gamma$ such that $C \LCstar B$;

1717: \item $\myscan{[A \de \alpha \bul B \beta; C]\ [C\de \gamma \bul]}{\epsilon}{\pi}

1718: 		{[A \de \alpha \bul B \beta; C]}$

1719: for each pair of rules

1720: $A \de \alpha B \beta$ and $\pi = C \de \gamma$ such that

1721: $C \LCstar B$ and $\gamma \neq \epsilon$;

1722: \item $\myep{[A \de \alpha \bul Y \beta; Y]}%

1723:                 {[A \de \alpha Y \bul \beta]}$

1724: for each rule $A \de \alpha Y \beta$.

1725: \end{itemize}

1726: Since the sequence of rules that such a PDT outputs is a right-most

1727: derivation in reverse, the function $f$ has to rearrange the

1728: output string to obtain a complete derivation. This problem is

1729: discussed in \cite{NI80}.

1730:

1731: The last parsing strategy we discuss is PLR parsing

1732: \cite{SO79}. This is very similar to LC parsing, with the main difference

1733: that the dotted rules are simplified by omitted the part after the dot.

1734: This leads to a `more deterministic' behaviour, as explained by \cite{NE94a}.

1735: Thus, $\mystrat_{\it PLR}(\mygram) =

1736: (\myaut,f)$, where $\myaut$ $=$ $(\myterm,$

1737: $\myrule,$ $\mysym,$ $[S \de\epsilon ],$ $[S \de \sigma],$ $\mytrans)$

1738: and $\mysym$ contains

1739: stack symbols of the form $[A \de \alpha]$,

1740: where $(A\de\alpha\beta)\in \myrule$ for some $\beta$,

1741: or of the form

1742: $[A \de \alpha ; X]$, where

1743: $(A\de\alpha Y \beta)\in \myrule$ for some $Y$ and $\beta$

1744: such that $X \LCstar Y$.

1745: The transitions in $\mytrans$ are:

1746: \begin{itemize}

1747: \item $\myscan{[A \de \alpha ]}{a}{\epsilon}%

1748:                 {[A \de \alpha; a]}$

1749: for each rule $A \de \alpha Y \beta$ such that $a \LCstar Y$;

1750: \item $\myscan{[A \de \alpha ]}{\epsilon}{\pi}%

1751:                 {[A \de \alpha ; C]}$

1752: for each pair of rules $A \de \alpha B \beta$ and

1753: $\pi = C \de \epsilon$ such that $C \LCstar B$;

1754: \item $\myep{[A \de \alpha ; X]}%

1755:                 {[A \de \alpha ; C]\ [C\de X ]}$

1756: for each pair of rules

1757: $A \de \alpha B \beta$ and $C \de X \gamma$ such that $C \LCstar B$;

1758: \item $\myscan{[A \de \alpha ; C]\ [C\de \gamma ]}{\epsilon}{\pi}

1759:                 {[A \de \alpha ; C]}$,

1760: for each pair of rules

1761: $A \de \alpha B \beta$ and $\pi = C \de \gamma$ such that

1762: $C \LCstar B$ and $\gamma \neq \epsilon$;

1763: \item $\myep{[A \de \alpha ; Y]}%

1764:                 {[A \de \alpha Y ]}$

1765: for each rule $A \de \alpha Y \beta$.

1766: \end{itemize}

1767: The function $f$ is the same as in the case of left-corner parsing.

1768: }

1769:

1770: \section{Parsing strategies without SPP}

1771: \label{s:nonstrong}

1772:

1773: In this section

1774: we show that the absence of the strong predictiveness property

1775: may mean that a parsing strategy with the CPP

1776: cannot be extended to become a

1777: probabilistic parsing strategy. We first illustrate this for

1778: LR(0) parsing, formalized as a

1779: parsing strategy $\mystrat_{\it LR}$,

1780: which has the CPP but not the SPP,

1781: as we will see.

1782: We assume the reader is familiar with LR parsing; see \cite{SI90}.

1783:

1784: We take a PCFG $(\mygram, p_{\mygram})$ defined by:

1785: $$

1786: \begin{array}{c@{\;=\;}ll}

1787: \pi_{S} & S \de {\it AB}, & p_{\mygram}(\pi_{S}) = 1 \\[.1ex]

1788: \pi_{A_1} & A \de {\it aC}, & p_{\mygram}(\pi_{A_1}) = \frac{1}{3} \\[.1ex]

1789: \pi_{A_2} & A \de {\it aD}, & p_{\mygram}(\pi_{A_2}) = \frac{2}{3} \\[.1ex]

1790: \pi_{B_1} & B \de {\it bC}, & p_{\mygram}(\pi_{B_1}) = \frac{2}{3} \\[.1ex]

1791: \pi_{B_2} & B \de {\it bD}, & p_{\mygram}(\pi_{B_2}) = \frac{1}{3} \\[.1ex]

1792: \pi_{C} & C \de {\it xc}, & p_{\mygram}(\pi_{C}) = 1 \\[.1ex]

1793: \pi_{D} & D \de {\it xd}, & p_{\mygram}(\pi_{D}) = 1

1794: \end{array}

1795: $$

1796: Note that this grammar generates a finite language.

1797:

1798: We will not present the entire LR automaton $\myaut$,

1799: with $\mystrat_{\it LR}(\mygram) = (\myaut,f)$ for some $f$,

1800: but we merely mention two of its key transitions, which

1801: represent shift actions over $c$ and $d$:

1802: $$

1803: \begin{array}{c@{\;=\;}l}

1804: \tau_{c} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%

1805: 		{c}{\epsilon}%

1806: 		{\{C\de x\bul c, D\de x\bul d\}\ \{C \de xc\bul\}} \\

1807: \tau_{d} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%

1808:                 {d}{\epsilon}%

1809:                 {\{C\de x\bul c, D\de x\bul d\}\ \{D \de xd\bul\}}

1810: \end{array}

1811: $$

1812: (We denote LR states by their sets of kernel items, as usual.)

1813:

1814: Take a probability function $p_{\myaut}$

1815: such that $(\myaut, p_{\myaut})$ is a proper PPDT.

1816: It can be easily seen

1817: that $p_{\myaut}$ must assign 1 to all

1818: transitions except $\tau_{c}$ and $\tau_{d}$, since that is the only

1819: pair of distinct transitions that can be applied for one and the

1820: same top-of-stack symbol,

1821: viz.\ $\{C\de x\bul c,D\de x\bul d\}$.

1822:

1823: However,

1824: $\frac{p_{\mygram}({\it axcbxd})}{p_{\mygram}({\it axdbxc})} =

1825: \frac{p_{\mygram}(\pi_{A_1}) \cdot  p_{\mygram}(\pi_{B_2})}%

1826: {p_{\mygram}(\pi_{A_2}) \cdot  p_{\mygram}(\pi_{B_1})} =

1827: \frac{\frac{1}{3}\cdot\frac{1}{3}}{\frac{2}{3}\cdot\frac{2}{3}} =

1828: \frac{1}{4}$

1829: but

1830: $\frac{p_{\myaut}({\it axcbxd})}{p_{\myaut}({\it axdbxc})} =

1831: \frac{p_{\myaut}(\tau_{c}) \cdot  p_{\myaut}(\tau_{d})}%

1832: {p_{\myaut}(\tau_{d}) \cdot  p_{\myaut}(\tau_{c})} = 1 \neq \frac{1}{4}$.

1833: This shows that there is no $p_{\myaut}$ such that

1834: $(\myaut, p_{\myaut})$ assigns the same

1835: probabilities to strings over $\myterm$ as $(\mygram, p_{\mygram})$.

1836: It follows that

1837: the LR strategy cannot be extended to become a probabilistic

1838: parsing strategy.

1839:

1840: Note that for $\mygram$ as above, $p_{\mygram}(\pi_{A_1})$

1841: and $p_{\mygram}(\pi_{B_1})$ can be freely chosen, and this

1842: choice determines the other values of $p_{\mygram}$, so we have

1843: two free parameters. For $\myaut$ however, there is only one

1844: free parameter in the choice of $p_{\myaut}$.

1845: This is in conflict with an underlying assumption of existing work

1846: on probabilistic LR parsing, by e.g.\ \cite{BR93} and \cite{IN00},

1847: viz.\ that LR parsers would allow more fine-grained probability

1848: distributions than CFGs. However, for some practical grammars

1849: from the area of natural language processing,

1850: \cite{SO99} has shown that LR parsers do allow

1851: more accurate probability distributions than the CFGs from which

1852: they were constructed, if probability functions are estimated from

1853: corpora.

1854:

1855: By way of Theorem~\ref{t:sp}, it follows indirectly from

1856: the above that LR parsing lacks the SPP.

1857: For the somewhat simpler ELR parsing strategy,

1858: to be discussed next,

1859: we will give a direct explanation of why it lacks the SPP.

1860: A direct explanation for LR parsing is much more involved and

1861: therefore is not reported here, although the argument is essentially

1862: of the same nature as the one we discuss for ELR parsing.

1863:

1864: The ELR parsing strategy is not as well-known as LR parsing.

1865: It was originally

1866: formulated as a parsing strategy for extended CFGs \cite{PU81,LE89},

1867: but its restriction to normal CFGs is interesting in its

1868: own right, as argued by \cite{NE94a}.

1869: %MJ added:

1870: ELR parsing for CFGs is also related to the tabular algorithm

1871: from \cite{VO88}.

1872:

1873: Concerning the representation of right-hand sides of rules,

1874: stack symbols

1875: for ELR parsing are similar to those for PLR parsing:

1876: only the part of a right-hand side is represented

1877: that consists of the grammar symbols that have been processed.

1878: Different from LC and PLR parsing is however that a

1879: stack symbol for ELR parsing contains

1880: a set consisting of one or more nonterminals from

1881: the left-hand sides of pairwise similar rules,

1882: rather than a single such nonterminal.

1883: This allows the commitment to certain rules,

1884: and in particular to their left-hand sides, to be

1885: postponed even longer than for LC and PLR parsing.

1886:

1887: Thus, for a given CFG $\mygram  = (\myterm,$ $\mynont,$ $S,$ $\myrule)$,

1888: we construct a pair $\mystrat_{\it ELR}(\mygram) =

1889: (\myaut,f)$. Here $\myaut$ $=$ $(\myterm,$

1890: $\myrule,$ $\mysym,$ $[\{S\} \de\epsilon],$ $[\{S\} \de \sigma],$ $\mytrans)$,

1891: where $\mysym$ is a subset of $\{ [\mynontset \de \alpha ]\  |\

1892: \mynontset \subseteq \mynont \wedge

1893: \forall A \in \mynontset \exists \beta[(A \de\alpha\beta) \in \myrule] \}$ $\cup$

1894: $\{ [\mynontset \de \alpha; B]\  |\

1895: \mynontset \subseteq\nobreak \mynont \wedge

1896: \forall A \in\nobreak\mynontset

1897: \exists\beta[(A \de\alpha\beta) \in \myrule

1898: \wedge B \in \mynont] \}$.

1899:

1900: We provide simultaneous inductive definitions of $\mysym$ and

1901: $\mytrans$:

1902: \begin{itemize}

1903: \item $[\{S\} \de\epsilon]\in \mysym$;

1904: \item For $[\mynontset \de \alpha ] \in \mysym$,

1905: rule $A \de \alpha Y \beta$ and $a\in\myterm$ such that

1906: $A \in \mynontset$ and $a \LCstar Y$, let

1907: $[\mynontset \de \alpha; a] \in \mysym$ and

1908: $\myscan{[\mynontset \de \alpha ]}{a}{\epsilon}%

1909:                 {[\mynontset \de \alpha; a]} \in \mytrans$;

1910: \item For $[\mynontset \de \alpha] \in \mysym$,

1911: rules $A \de \alpha B \beta$ and

1912: $\pi = C \de \epsilon$ such that $A\in\mynontset$ and

1913: $C \LCstar B$, let

1914: $[\mynontset \de \alpha ; C] \in \mysym$ and

1915: $\myscan{[\mynontset \de \alpha ]}{\epsilon}{\pi}%

1916:                 {[\mynontset \de \alpha ; C]} \in \mytrans$;

1917: \item For $[\mynontset_1 \de \alpha ; X] \in \mysym$ and

1918: $\mynontset_2 = \{ C\ |\

1919: \exists (A \de \alpha B \beta)\in\myrule[A\in \mynontset_1 \wedge

1920: C \de X \gamma \wedge C\LCstar B] \} \neq \emptyset$, let

1921: $[\mynontset_2\de X ]\in \mysym$ and

1922: $\myep{[\mynontset_1 \de \alpha ; X]}%

1923:                 {[\mynontset_1 \de \alpha;X]\ [\mynontset_2\de X ]} \in \mytrans$;

1924: \item For $[\mynontset_1 \de \alpha;X], [\mynontset_2\de X\gamma ] \in \mysym$,

1925: rules $A \de \alpha B \beta$ and $\pi = C \de X\gamma$ such that

1926: $A \in \mynontset_1$, $C \in \mynontset_2$ and $C \LCstar B$, let

1927: $[\mynontset_1 \de \alpha; C] \in \mysym$ and

1928: $\myscan{[\mynontset_1 \de \alpha;X]\ [\mynontset_2\de X\gamma ]}{\epsilon}{\pi}

1929:                 {[\mynontset_1 \de \alpha; C]} \in \mytrans$;

1930: \item For $[\mynontset_1 \de \alpha ; Y] \in \mysym$ and

1931: $\mynontset_2 = \{ A\in\mynontset_1\ |\

1932: \exists \beta[(A \de \alpha Y \beta)\in\myrule] \}

1933: \neq \emptyset$, let

1934: $[\mynontset_2 \de \alpha Y ] \in \mysym$ and

1935: $\myep{[\mynontset_1 \de \alpha ; Y]}%

1936:                 {[\mynontset_2 \de \alpha Y ]} \in \mytrans$.

1937: \end{itemize}

1938: Note that the last five items are very similar to the five items

1939: for LC parsing. In the second last item, we have assumed

1940: the availability of combined pop/swap transitions of the form

1941: $\myscan{X Y}{x}{y}{Z}$. Such a transition can be seen as short-hand for

1942: two transitions, the first of the form $\myep{X Y}{Z_{x,y}}$,

1943: where $Z_{x,y}$ is a new symbol not already in $\mysym$, and

1944: the second of the form $\myscan{Z_{x,y}}{x}{y}{Z}$.

1945:

1946: The function $f$ is defined as in the case of PLR parsing, and

1947: turns a complete right-most derivation in

1948: reverse into a complete derivation.

1949:

1950: ELR parsing has the CPP but, like LR parsing,

1951: it lacks the SPP. The problem is caused by

1952: transitions of the form

1953: $\myscan{[\mynontset_1 \de \alpha;X]\ [\mynontset_2\de X\gamma ]}{\epsilon}{\pi}

1954:                 {[\mynontset_1 \de \alpha; C]}$.

1955: Intuitively, a subcomputation that recognizes $\gamma$,

1956: directly after recognition of $X$, only commits to

1957: a choice of the left-hand side nonterminal $C$ from

1958: $\mynontset_2$ after $\gamma$ has been

1959: completely recognized, and this choice is communicated

1960: to lower areas of the stack through this pop transition.

1961:

1962: \begin{figure}[tp]

1963: $$

1964: \begin{array}{cl}

1965: \comment{\tau_{a}} & \myscan{[\{S\}\de\epsilon]}{a}{\epsilon}{[\{S\}\de\epsilon;a]} \\

1966: \comment{\tau_{A}} & \myep{[\{S\}\de\epsilon;a]}{[\{S\}\de\epsilon;a]\ [\{A\}\de a]} \\

1967: \comment{\tau_{A/x}} & \myscan{[\{A\}\de a]}{x}{\epsilon}{[\{A\}\de a; x]} \\

1968: \comment{\tau'_{A/x}} & \myep{[\{A\}\de a;x]}{[\{A\}\de a;x]\ [\{C,D\}\de x]} \\

1969: \tau_{c}\ = & \myscan{[\{C,D\}\de x]}{c}{\epsilon}{[\{C,D\}\de x}; c] \\

1970: \tau_{d}\ = & \myscan{[\{C,D\}\de x]}{d}{\epsilon}{[\{C,D\}\de x}; d] \\

1971: \comment{\tau_{C}} & \myep{[\{C,D\}\de x;c]}{[\{C\}\de xc]} \\

1972: \comment{\tau_{A/C}} &  \myscan{[\{A\}\de a;x]\ [\{C\}\de xc]}%

1973: 		{\epsilon}{\pi_{C}}{[\{A\}\de a; C]} \\

1974: \comment{\tau'_{A/C}} & \myep{[\{A\}\de a; C]}[\{A\}\de a C] \\

1975: \comment{\tau_{A_1}} & \myscan{[\{S\}\de\epsilon;a]\ [\{A\}\de aC]}%

1976: 		{\epsilon}{\pi_{A_1}}{[\{S\}\de \epsilon; A]} \\

1977: \comment{\tau_{D}} & \myep{[\{C,D\}\de x;d]}{[\{D\}\de xd]}  \\

1978: \comment{\tau_{A/D}} &  \myscan{[\{A\}\de a;x]\ [\{D\}\de xd]}%

1979: 		{\epsilon}{\pi_{D}}{[\{A\}\de a; D]} \\

1980: \comment{\tau'_{A/D}} & \myep{[\{A\}\de a; D]}[\{A\}\de a D] \\

1981: \comment{\tau_{A_2}} & \myscan{[\{S\}\de\epsilon;a]\ [\{A\}\de aD]}%

1982: 		{\epsilon}{\pi_{A_2}}{[\{S\}\de \epsilon; A]} \\

1983: \comment{\tau_{S/A}} & \myep{[\{S\}\de \epsilon; A]}{[\{S\}\de A]} \\

1984: \comment{\tau_{b}} & \myscan{[\{S\}\de A]}{b}{\epsilon}{[\{S\}\de A; b]} \\

1985: \comment{\tau_{B}} & \myep{[\{S\}\de A;b]}{[\{S\}\de A;b]\ [\{B\}\de b]} \\

1986: \comment{\tau_{B/x}} & \myscan{[\{B\}\de b]}{x}{\epsilon}{[\{B\}\de b; x]} \\

1987: \comment{\tau'_{B/x}} & \myep{[\{B\}\de b;x]}{[\{B\}\de b;x]\ [\{C,D\}\de x]} \\

1988: \comment{\tau_{B/C}} &  \myscan{[\{B\}\de b;x]\ [\{C\}\de xc]}%

1989: 		{\epsilon}{\pi_{C}}{[\{B\}\de b; C]} \\

1990: \comment{\tau'_{B/C}} & \myep{[\{B\}\de b; C]}[\{B\}\de b C] \\

1991: \comment{\tau_{B_1}} & \myscan{[\{S\}\de A;b]\ [\{B\}\de bC]}%

1992: 		{\epsilon}{\pi_{B_1}}{[\{S\}\de A;B]} \\

1993: \comment{\tau_{B/D}} &  \myscan{[\{B\}\de b;x]\ [\{D\}\de xd]}%

1994: 		{\epsilon}{\pi_{D}}{[\{B\}\de b; D]} \\

1995: \comment{\tau'_{B/D}} & \myep{[\{B\}\de b; D]}[\{B\}\de b D] \\

1996: \comment{\tau_{B_2}} & \myscan{[\{S\}\de A;b]\ [\{B\}\de bD]}%

1997: 		{\epsilon}{\pi_{B_2}}{[\{S\}\de A;B]} \\

1998: \comment{\tau_{S/B}} & \myep{[\{S\}\de A; B]}{[\{S\}\de AB]} \\

1999: \end{array}

2000: $$

2001: \caption{Transitions for ELR parsing strategy.}

2002: \label{f:ELRtrans}

2003: \end{figure}

2004:

2005: That ELR parsing can indeed not be extended to a probabilistic

2006: parsing strategy can be shown by considering the same

2007: CFG as above. From the set of transitions, shown in

2008: Figure~\ref{f:ELRtrans},

2009: we restrict our attention to the following two:

2010: $$

2011: \begin{array}{c@{\;=\;}l}

2012: \tau_{c} & \myscan{[\{C,D\}\de x]}{c}{\epsilon}{[\{C,D\}\de x}; c] \\

2013: \tau_{d} & \myscan{[\{C,D\}\de x]}{d}{\epsilon}{[\{C,D\}\de x}; d]

2014: \end{array}

2015: $$

2016: This is the only pair of transitions that can be applied for

2017: one and the same top-of-stack.

2018: The rest of the proof is identical to that in the case of

2019: LR parsing.

2020:

2021: Problems with the extension of ELR parsing to become

2022: a probabilistic parsing

2023: strategy have been pointed out before by \cite{TE97},

2024: who furthermore proposed an alternative type of probabilistic

2025: push-down automaton that is capable of computing multiple

2026: probabilities for each subderivation.

2027: However, since a transition of such an automaton may perform an

2028: unbounded number of elementary computations on probabilities, we

2029: feel this automaton model cannot realistically express

2030: the behaviour of probabilistic parsers,

2031: and therefore it will not be considered further here.

2032:

2033: \comment{

2034: We show here that the absence of strong predictiveness

2035: may mean that a parsing strategy cannot be extended to a

2036: probabilistic parsing strategy. We illustrate this by two different

2037: non-strongly predictive parsing strategies $\mystrat=\mystrat_{\it ELR}$

2038: and $\mystrat=\mystrat_{\it LR}$. In each case, we present

2039: a PCFG $(\mygram, p_{\mygram})$

2040: such that

2041: $\mystrat(\mygram) = (\myaut, f)$ and no probability function $p_{\myaut}$

2042: for $\myaut$

2043: can be found such that $(\myaut, p_{\myaut})$ assigns the same

2044: probabilities to strings as $(\mygram, p_{\mygram})$

2045:

2046: The reason we consider ELR parsing is that it is the

2047: first strategy in a family of parsing strategies,

2048: following LC and PLR parsing, that is not strongly predictive.\footnote{%

2049: This family was discussed before in \cite{NE94a}.}

2050: In particular, the comparison

2051: between PLR and ELR parsing helps to clarify the problem that the absence

2052: of strong predictiveness poses for extending parsing

2053: strategies to the probabilistic case.

2054: However, we also

2055: treat the more complicated LR parsing strategy \cite{SI90} since that

2056: is better known than ELR parsing.

2057:

2058: The ELR strategy results in $\mystrat(\mygram)=(\myaut,f)$, where

2059: $\myaut$ $=$ $(\myterm,$

2060: $\myrule,$ $\mysym,$ $[\{S\} \de\epsilon],$ $[\{S\} \de AB],$ $\mytrans)$

2061: and $\mytrans$ contains:

2062:

2063: and $\mysym$ is the set of the stack symbols that occur

2064: in the above transitions.

2065:

2066: Take a probability function $p_{\myaut}$

2067: such that $(\myaut, p_{\myaut})$ is a proper PPDT.

2068: It can be shown that $p_{\myaut}$ must assign 1 to all

2069: transitions except $\tau_{c}$ and $\tau_{d}$, since that is the only

2070: pair of distinct transitions that can be applied for one and the

2071: same top-of-stack symbol,

2072: viz.\ $[\{C,D\}\de x]$.

2073:

2074: However,

2075: $\frac{p_{\mygram}({\it axcbxd})}{p_{\mygram}({\it axdbxc})} =

2076: \frac{p_{\mygram}(\pi_{A_1}) \cdot  p_{\mygram}(\pi_{B_2})}%

2077: {p_{\mygram}(\pi_{A_2}) \cdot  p_{\mygram}(\pi_{B_1})} =

2078: \frac{(\frac{4}{10})^2}{(\frac{6}{10})^2} = \frac{4}{9}$

2079: but

2080: $\frac{p_{\myaut}({\it axcbxd})}{p_{\myaut}({\it axdbxc})} =

2081: \frac{p_{\myaut}(\tau_{c}) \cdot  p_{\myaut}(\tau_{d})}%

2082: {p_{\myaut}(\tau_{d}) \cdot  p_{\myaut}(\tau_{c})} = 1 \neq \frac{4}{9}$.

2083: This shows that there is no $p_{\myaut}$ such that

2084: $(\myaut, p_{\myaut})$ assigns the same

2085: probabilities to strings over $\myterm$ as $(\mygram, p_{\mygram})$.

2086: It follows that

2087: the ELR strategy cannot be extended to be a probabilistic

2088: parsing strategy.

2089:

2090: The LR strategy can also be cast in a form that

2091: satisfies our normal form PDTs. We will not give a complete

2092: specification of LR parsing since much existing literature,

2093: such as \cite{SI90}, already contains such specifications.

2094: We will assume below that the reader is familiar with this literature.

2095:

2096: We will apply the LR(0) strategy to this CFG.

2097: Applying the LR(0) strategy to the CFG above, we obtain the following

2098: PDT.

2099: This is very similar to the PDT we obtained in the case of ELR

2100: parsing.

2101: As usual, we denote LR states by a set of kernel items,

2102: which are `dotted' rules.

2103: Since our type of pop transition only allows

2104: a pop of one symbol at a time, we have to split up a reduction of

2105: a rule $A \de X_1 \cdots X_m$ into

2106: a sequence of $m+1$ transitions, the first $m-1$ resulting in stack symbols

2107: $[A \de X_1 \cdots X_m \bul]$, $[A \de X_1 \cdots X_{m-1} \bul X_m]$, \ldots,

2108: $[A \de X_1 \bul X_2 \cdots X_m]$ on top of the stack, the next

2109: resulting in a top-of-stack $[W;A]$, where $W$ is a set of dotted rules,

2110: and lastly the usual `goto' set of $W$ and $A$ is pushed.

2111:

2112: We have

2113: $\Xinit = \{S \de\ \bul AB\}$ and

2114: $\Xinit = [S \de\ \bul AB]$ and the set $\mytrans$ of transitions is given

2115: in Figure~\ref{f:LRtrans}.

2116: %begin

2117: In order to simplify the presentation, we allow two new types of

2118: transition, without increasing the power of PDTs.

2119: The first is a combined swap/push transition of the form

2120: $\myscan{X}{x}{y}{Z Y}$. Such a transition can be seen as short-hand for

2121: two transitions, the first of the form $\myscan{X}{x}{y}{Z_Y}$,

2122: where $Z_Y$ is a new symbol not already in $\mysym$, and

2123: the second of the form $\myep{Z_Y}{Z_Y Y}$.

2124: We also assume the existence of a

2125: transition $\myep{Z_Y Y'}{X'}$

2126: for each transition $\myep{Z Y'}{X'}$ that is actually specified.

2127: The second new type of transition

2128: is a combined swap/pop transition of the form

2129: $\myscan{X Y}{x}{y}{Z}$. Such a transition can be seen as short-hand for

2130: two transitions, the first of the form $\myscan{Y}{x}{y}{Y_X}$,

2131: where $Y_X$ is a new symbol not already in $\mysym$, and

2132: the second of the form $\myep{X Y_X}{Z}$.

2133:

2134: \begin{figure*}

2135: $$

2136: \begin{array}{c@{\;=\;}l}

2137: \tau_{a} & \myscan{\{S\de\ \bul AB\}}{a}{\epsilon}%

2138: 	{\{S\de\ \bul AB\}\ \{A\de a\bul C, A\de a\bul D\}} \\

2139: \tau_{A_1} & \myscan{\{S\de\ \bul AB\}\ [A\de a\bul C]}%

2140:                 {\epsilon}{\pi_{A_1}}{[\{S\de\ \bul A B\};A]} \\

2141: \tau_{A_2} & \myscan{\{S\de\ \bul AB\}\ [A\de a\bul D]}%

2142:                 {\epsilon}{\pi_{A_2}}{[\{S\de\ \bul A B\};A]} \\

2143: \tau_{S/A} & \myscan{[\{S\de\ \bul AB\}; A]}%

2144: 		{\epsilon}{\epsilon}{\{S\de\ \bul AB\}\ \{S\de A \bul B\}} \\

2145: \tau_{b} & \myscan{\{S\de A\bul B\}}{b}{\epsilon}%

2146: 		{\{S\de A\bul B\}\ \{B\de b\bul C, B\de b\bul D\}} \\

2147: \tau_{B_1} & \myscan{\{S\de A\bul B\}\ [B\de b\bul C\}}%

2148:                 {\epsilon}{\pi_{B_1}}{[\{S\de A \bul B\}; B]} \\

2149: \tau_{B_2} & \myscan{\{S\de A\bul B\}\ [B\de b\bul D\}}%

2150:                 {\epsilon}{\pi_{B_2}}{[\{S\de A \bul B\}; B]} \\

2151: \tau_{S/B} & \myscan{[\{S\de A\bul B\}; B]}%

2152: 		{\epsilon}{\epsilon}{\{S\de A \bul B\}\ \{S\de A B \bul\}} \\

2153: \tau_{S} & \myep{\{S\de A B \bul\}}{[S\de A B \bul]} \\

2154: \tau_{S'} & \myep{\{S\de A \bul B\}\ [S\de A B \bul]}%

2155: 		{[S\de A \bul B ]} \\

2156: \tau_{S''} & \myscan{\{S\de\ \bul A B\}\ [S\de A\bul B ]}%

2157: 		{\epsilon}{\pi_{S}}{[S\de\ \bul A B ]} \\

2158: \tau_{A/x} & \myscan{\{A\de a\bul C, A\de a\bul D\}}%

2159: 		{x}{\epsilon}%

2160: 		{\{A\de a\bul C, A\de a\bul D\}\ \{C\de x\bul c, D\de x\bul d\}} \\

2161: \tau_{A/C} & \myscan{\{A\de a\bul C, A\de a\bul D\}\ [C\de x\bul c]}%

2162: 		{\epsilon}{\pi_{C}}{[\{A\de a\bul C, A\de a\bul D\};C]} \\

2163: \tau'_{A_1} & \myscan{[\{A\de a\bul C, A\de a\bul D\};C]}%

2164: 		{\epsilon}{\epsilon}%

2165: 		{\{A\de a\bul C, A\de a\bul D\}\ \{A\de a C\bul\}} \\

2166: \tau''_{A_1} & \myep{\{A\de a C\bul\}}{[A\de a C\bul]} \\

2167: \tau'''_{A_1} & \myep{\{A\de a\bul C, A\de a\bul D\}\ [A\de a C\bul]}%

2168: 		{[A\de a \bul C]} \\

2169: \tau_{A/D} & \myscan{\{A\de a\bul C, A\de a\bul D\}\ [D\de x\bul d]}%

2170: 		{\epsilon}{\pi_{D}}{[\{A\de a\bul C, A\de a\bul D\};D]} \\

2171: \tau'_{A_2} & \myscan{[\{A\de a\bul C, A\de a\bul D\};D]}%

2172: 		{\epsilon}{\epsilon}%

2173: 		{\{A\de a\bul C, A\de a\bul D\}\ \{A\de a D\bul\}} \\

2174: \tau''_{A_2} & \myep{\{A\de a D\bul\}}{[A\de a D\bul]} \\

2175: \tau'''_{A_2} & \myep{\{A\de a\bul C, A\de a\bul D\}\ [A\de a D\bul]}%

2176: 		{[A\de a \bul D]} \\

2177:

2178: \tau_{B/x} & \myscan{\{B\de b\bul C, B\de b\bul D\}}%

2179:                 {x}{\epsilon}%

2180:                 {\{B\de b\bul C, B\de b\bul D\}\ \{C\de x\bul c, D\de x\bul d\}} \\

2181: \tau_{B/C} & \myscan{\{B\de b\bul C, B\de b\bul D\}\ [C\de x\bul c]}%

2182:                 {\epsilon}{\pi_{C}}{[\{B\de b\bul C, B\de b\bul D\};C]} \\

2183: \tau'_{B_1} & \myscan{[\{B\de b\bul C, B\de b\bul D\};C]}%

2184:                 {\epsilon}{\epsilon}%

2185:                 {\{B\de b\bul C, B\de b\bul D\}\ \{B\de b C\bul\}} \\

2186: \tau''_{B_1} & \myep{\{B\de b C\bul\}}{[B\de b C\bul]} \\

2187: \tau'''_{B_1} & \myep{\{B\de b\bul C, B\de b\bul D\}\ [B\de b C\bul]}%

2188:                 {[B\de b \bul C]} \\

2189: \tau_{B/D} & \myscan{\{B\de b\bul C, B\de b\bul D\}\ [D\de x\bul d]}%

2190:                 {\epsilon}{\pi_{D}}{[\{B\de b\bul C, B\de b\bul D\};D]} \\

2191: \tau'_{B_2} & \myscan{[\{B\de b\bul C, B\de b\bul D\};D]}%

2192:                 {\epsilon}{\epsilon}%

2193:                 {\{B\de b\bul C, B\de b\bul D\}\ \{B\de b D\bul\}} \\

2194: \tau''_{B_2} & \myep{\{B\de b D\bul\}}{[B\de b D\bul]} \\

2195: \tau'''_{B_2} & \myep{\{B\de b\bul C, B\de b\bul D\}\ [B\de b D\bul]}%

2196:                 {[B\de b \bul D]} \\

2197: \tau_{c} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%

2198: 		{c}{\epsilon}%

2199: 		{\{C\de x\bul c, D\de x\bul d\}\ \{C \de xc\bul\}} \\

2200: \tau_{C} & \myep{\{C \de xc\bul\}}{[C \de xc\bul]} \\

2201: \tau'_{C} & \myep{\{C\de x\bul c, D\de x\bul d\}\ [C \de xc\bul]}%

2202: 		{[C \de x\bul c]} \\

2203: \tau_{d} & \myscan{\{C\de x\bul c, D\de x\bul d\}}%

2204: 		{d}{\epsilon}%

2205: 		{\{C\de x\bul c, D\de x\bul d\}\ \{D \de xd\bul\}} \\

2206: \tau_{D} & \myep{\{D \de xd\bul\}}{[D \de xd\bul]} \\

2207: \tau'_{D} & \myep{\{C\de x\bul c, D\de x\bul d\}\ [D \de xd\bul]}%

2208: 		{[D \de x\bul d]}

2209: \end{array}

2210: $$

2211: \caption{The set of transitions for the LR strategy.}

2212: \label{f:LRtrans}

2213: \end{figure*}

2214:

2215: As in the case of ELR parsing, there are only two transitions,

2216: viz.\ $\tau_{c}$ and $\tau_{d}$, to which a probability function

2217: $p_{\myaut}$ can assign a value different from 1.

2218: Again, $\frac{p_{\myaut}({\it axcbxd})}{p_{\myaut}({\it axdbxc})} =

2219: \frac{p_{\myaut}(\tau_{c}) \cdot  p_{\myaut}(\tau_{d})}%

2220: {p_{\myaut}(\tau_{d}) \cdot  p_{\myaut}(\tau_{c})} = 1 \neq \frac{4}{9}$.

2221: This shows that also the LR strategy

2222: cannot be extended to be a probabilistic parsing strategy.

2223: }

2224:

2225:

2226: \section{Extension in the wide sense}

2227: \label{s:wide}

2228:

2229: The main result from the previous section is that,

2230: in general,

2231: there is no construction of probabilistic LR parsers

2232: from PCFGs such that,

2233: firstly, a probabilistic LR parser has the same set of

2234: transitions as the LR parser that would be constructed from the CFG in

2235: the non-probabilistic case and,

2236: secondly, the probabilistic LR parser

2237: has the same probability distribution as the given PCFG.

2238:

2239: There is a construction proposed by \cite{WR91,WR91a,NG91}

2240: that operates under different assumptions. In particular, a

2241: probabilistic LR parser constructed from a certain PCFG

2242: may possess several `copies' of one and the same

2243: LR state from the (non-probabilistic) LR parser constructed from

2244: the CFG,

2245: each annotated with some additional information to

2246: distinguish it from other copies of the same LR state.

2247: Each such copy behaves as the corresponding LR state from the

2248: LR parser if we neglect probabilities.

2249: Transitions may however

2250: obtain different probabilities if they operate on different copies

2251: of identical LR states, based on the additional information

2252: attached to the LR states.

2253:

2254: By this construction,

2255: there are many PCFGs for which one may obtain a

2256: probabilistic LR parser that describes the same

2257: probability distribution. This even holds

2258: for the PCFG we discussed in the previous section, although

2259: we have shown that a probabilistic LR parser {\em without\/}

2260: an extended LR state set could not describe the same

2261: probability distribution.

2262: A serious problem with this approach is however that

2263: the required number of copies of each LR state is potentially infinite.

2264:

2265: In this section we formulate these observations in terms of

2266: general parsing strategies and a wider notion of

2267: extension to probabilistic parsing strategies. We also

2268: show that the above-mentioned problem with

2269: infinite numbers of states is inherent in LR parsing, rather

2270: than due to the particular construction of LR parsers from

2271: PCFGs by \cite{WR91,WR91a,NG91}.

2272:

2273: We first introduce some auxiliary notation and terminology.

2274: Let $\myaut$ and $\myaut'$ be two PDTs and

2275: let $g$ be a function mapping

2276: the stack symbols of $\myaut'$

2277: to the stack symbols of $\myaut$.

2278: If $\tau$ is a transition of the form $\myep{X}{X Y}$,

2279: $\myep{\it Y X}{Z}$ or $\myscan{X}{x}{y}{Y}$ from $\myaut'$,

2280: then we let $g(\tau)$ denote a transition of the form

2281: $\myep{g(X)}{g(X) g(Y)}$,

2282: $\myep{\it g(Y) g(X)}{g(Z)}$ or $\myscan{g(X)}{x}{y}{g(Y)}$, respectively.

2283: This effectively extends $g$ to a function from transitions to

2284: transitions.

2285: Note that a transition $g(\tau)$ may, but need not be a

2286: transition from $\myaut$.

2287: In the same vein, we extend $g$ to

2288: a function from computations of $\myaut'$ to

2289: sequences of transitions (which may, but need not be

2290: computations of $\myaut$),

2291: by applying $g$ element-wise as a function on transitions.

2292:

2293: For PDTs $\myaut$ $=$

2294: $(\myterm_1,$ $\myterm_2,$ $\mysym,$ $\Xinit,$ $\Xfinal,$ $\mytrans)$

2295: and $\myaut'$ $=$

2296: $(\myterm'_1,$ $\myterm'_2,$ $\mysym',$ $\Xinit',$ $\Xfinal',$ $\mytrans')$,

2297: we say

2298: $\myaut'$

2299: is an {\em expansion\/} of $\myaut$

2300: if $\myterm'_1=\myterm_1$, $\myterm'_2 = \myterm_2$ and there is a function

2301: $g$ such that:

2302: \begin{itemize}

2303: \item $g$ is a surjective function from $\mysym'$ to $\mysym$.

2304: \item Extended to transitions,

2305: $g$ is a surjective function from $\mytrans'$ to $\mytrans$.

2306: \item Extended to computations,

2307: $g$ is a bijective function from the set of computations of

2308: $\myaut'$ to the set of computations of $\myaut$.

2309: \end{itemize}

2310: In other words, for each stack symbol from $\mysym$,

2311: $\mysym'$ may contain one or more corresponding

2312: stack symbols. The language that

2313: is accepted and the output strings that are produced for given input

2314: strings remain the same however. Furthermore, that $g$ is a bijection

2315: on computations implies that the behaviour of the two

2316: automata is identical in terms of e.g.\ the length of

2317: computations and the amount of nondeterminism encountered within

2318: those computations.

2319:

2320: To illustrate these definitions, assume we have an arbitrary

2321: PDT $\myaut$. We construct a second PDT $\myaut'$ that is an

2322: expansion of $\myaut$. It has the

2323: same input and output alphabets, and for each stack symbol

2324: $X$ from $\myaut$, $\myaut'$ has two stack symbols $(X,0)$ and

2325: $(X,1)$. A second component $0$ signifies that the distance

2326: of the stack symbol to the bottom of the

2327: stack is even, and $1$ that it is odd.

2328: Naturally, if $\Xinit$ and $\Xfinal$ are the initial and final stack symbols

2329: of $\myaut$, we choose the initial and final stack symbols of $\myaut'$ to be

2330: $(\Xinit,0)$ and $(\Xfinal,0)$, as they have distance 0 to the

2331: bottom of the stack.

2332: For each transition of the form $\myep{X}{X Y}$,

2333: $\myep{\it Y X}{Z}$ or $\myscan{X}{x}{y}{Y}$ from $\myaut$,

2334: we let $\myaut'$ have the transitions

2335: $\myep{(X,i)}{(X,i) (Y,1-i)}$,

2336: $\myep{(Y,i) (X,1-i)}{(Z,i)}$ or $\myscan{(X,i)}{x}{y}{(Y,i)}$,

2337: respectively, for both $i=0$ and $i=1$.

2338: Obviously, the function $g$ mapping stack symbols

2339: from $\myaut'$ to stack symbols from $\myaut$ is given

2340: by $g((X,i))=X$ for all $X$ and $i\in\{0,1\}$.

2341:

2342: We now come to the central definition of this section.

2343: We say that probabilistic parsing strategy $\mystrat'$

2344: is an {\em extension in the wide sense\/} of parsing strategy

2345: $\mystrat$ if for each reduced CFG $\mygram$ and

2346: probability function $p_{\mygram}$ we have

2347: $\mystrat(\mygram)=(\myaut, f)$ if and only if

2348: $\mystrat'(\mygram, p_{\mygram})=(\myaut', p_{\myaut'}, f)$

2349: for some $\myaut'$ that is an expansion of $\myaut$

2350: and some $p_{\myaut'}$. This definition allows more

2351: probabilistic parsing strategies $\mystrat'$ to be related to a given

2352: strategy $\mystrat$ than the definition of extension from

2353: Section~\ref{s:strategy}.

2354:

2355: LR parsing however, which we know can not be extended to a

2356: probabilistic strategy in the narrow sense from Section~\ref{s:strategy},

2357: can neither be

2358: extended in the wide sense to a probabilistic parsing strategy.

2359: To prove this,

2360: consider the following PCFG $(\mygram,p_{\mygram})$,

2361: taken from \cite{WR91} with minor modifications:

2362: $$

2363: \begin{array}{c@{\;=\;}ll}

2364: \pi_{S} & S \de A, & p_{\mygram}(\pi_{S}) = 1 \\[.1ex]

2365: \pi_{A_1} & A \de B, & p_{\mygram}(\pi_{A_1}) = \frac{1}{2} \\[.1ex]

2366: \pi_{A_2} & A \de C, & p_{\mygram}(\pi_{A_2}) = \frac{1}{2} \\[.1ex]

2367: \pi_{B_1} & B \de {\it aB}, & p_{\mygram}(\pi_{B_1}) = \frac{1}{3} \\[.1ex]

2368: \pi_{B_2} & B \de {\it b}, & p_{\mygram}(\pi_{B_2}) = \frac{2}{3} \\[.1ex]

2369: \pi_{C_1} & C \de {\it aC}, & p_{\mygram}(\pi_{C_1}) = \frac{2}{3} \\[.1ex]

2370: \pi_{C_2} & C \de {\it c}, & p_{\mygram}(\pi_{C_2}) = \frac{1}{3}

2371: \end{array}

2372: $$

2373: The CFG $\mygram$ generates strings of the form $a^n b$ and $a^n c$ for

2374: any $n \geq 0$. Observe that

2375: $\frac{p_{\mygram}(a^n b)}{p_{\mygram}(a^n c)}$ $=$

2376: $\frac{ \frac{1}{2} \cdot

2377: 		\left(\frac{1}{3}\right)^{n} \cdot \frac{2}{3} }{

2378: 	\frac{1}{2} \cdot

2379:                 \left(\frac{2}{3}\right)^{n} \cdot \frac{1}{3} }$ $=$

2380: $\left( \frac{1}{2} \right)^{n-1}$.

2381:

2382: Let $\myaut$ be such that $\mystrat_{\it LR}(\mygram)= (\myaut,f)$ and

2383: consider input strings of the form $a^n b$ and $a^n c$, $n \geq 1$.

2384: After scanning the first $n$ symbols, $\myaut$

2385: reaches a configuration where the top-of-stack $X$ is

2386: given by the set of (kernel) items:

2387: $$

2388: X=\{ B \de a \bul B, C \de a \bul C \}

2389: $$

2390:

2391: There are three applicable transitions, representing shift

2392: actions over $a$, $b$ and $c$, given by:

2393: $$

2394: \begin{array}{c@{\;=\;}l}

2395: \tau_a & \myscan{X}{a}{\epsilon}{X\ X} \\

2396: \tau_b & \myscan{X}{b}{\epsilon}{X\ \{B \de b\bul\}} \\

2397: \tau_c & \myscan{X}{c}{\epsilon}{X\ \{C \de c\bul\}}

2398: \end{array}

2399: $$

2400: After reading $b$ or $c$,

2401: the remaining transitions are fully deterministic.

2402:

2403: For a PDT $\myaut'$ that is an expansion of $\myaut$, we may have

2404: different stack symbols that are all mapped to $X$ by function $g$.

2405: These stack symbols can be referred to as

2406: $X_n$, which occur as top-of-stack

2407: after scanning the first $n$ symbols of $a^n b$ or $a^n c$, $n \geq 1$.

2408: We refer to the applicable transitions with top-of-stack $X_n$ as:

2409: $$

2410: \begin{array}{c@{\;=\;}l}

2411: \tau_{a,n} & \myscan{X_n}{a}{\epsilon}{X_n\ X_{n+1}} \\

2412: \tau_{b,n} & \myscan{X_n}{b}{\epsilon}{X_n\ \{B \de b\bul\}_n} \\

2413: \tau_{c,n} & \myscan{X_n}{c}{\epsilon}{X_n\ \{C \de c\bul\}_n}

2414: \end{array}

2415: $$

2416: for certain stack symbols $\{B \de b\bul\}_n$ and

2417: $\{C \de c\bul\}_n$ that $g$ maps to $\{B \de b\bul\}$ and

2418: $\{C \de c\bul\}$, respectively.

2419:

2420: Now let us assume we have a probability function $p_{\myaut'}$

2421: such that $(\myaut',p_{\myaut'})$ is a PPDT.

2422: Since the application of either $\tau_{b,n}$ or $\tau_{c,n}$ is

2423: the only nondeterministic step

2424: that distinguishes recognition of $a^n b$ from

2425: recognition of $a^n c$, $n \geq 1$, it follows that

2426: $\frac{p_{\myaut}(a^n b)}{p_{\myaut}(a^n c)}$ $=$

2427: $\frac{p_{\myaut}(\tau_{b,n})}{p_{\myaut}(\tau_{c,n})}$.

2428: If $(\myaut',p_{\myaut'})$ assigns the same probabilities

2429: to strings over alphabet $\{a,b,c\}$ as $(\mygram,p_{\mygram})$,

2430: then $\frac{p_{\myaut}(\tau_{b,n})}{p_{\myaut}(\tau_{c,n})}$

2431: must be equal to $\frac{p_{\mygram}(a^n b)}{p_{\mygram}(a^n c)}$ $=$

2432: $\left( \frac{1}{2} \right)^{n-1}$ for each

2433: $n\geq 1$. Since $\left( \frac{1}{2} \right)^{n-1}$ is a different

2434: value for each $n$ however, this would require $\myaut'$ to possess

2435: infinitely many stack symbols, which is in conflict with the definition

2436: of push-down transducers.

2437:

2438: This shows that no probability function $p_{\myaut'}$ exists

2439: for any expansion $\myaut'$ of $\myaut$ such that

2440: $(\myaut',p_{\myaut'})$ assigns the same probabilities

2441: to strings over the alphabet as $(\mygram,p_{\mygram})$,

2442: and therefore LR parsing cannot be extended in the wide sense to

2443: become a probabilistic parsing strategy. With only minor changes

2444: to the proof, the same can be shown for ELR parsing.

2445:

2446: \section{Prefix probabilities}

2447: \label{s:prefix}

2448:

2449: In this section we show that the behaviour of PPDTs on input

2450: can be simulated by dynamic programming.

2451: We also show how dynamic programming can be used for

2452: computing prefix probabilities.

2453: Prefix probabilities have important applications, e.g.\

2454: in the area of speech recognition.

2455:

2456: Our algorithm is a minor extension

2457: of an application of dynamic programming developed

2458: for non-probabilistic PDTs by~\cite{LA74,BI89}, and

2459: the treatment of probabilities is derived from~\cite{ST95}.

2460:

2461: Assume a fixed PPDT $(\myaut,p_{\myaut})$ and a

2462: fixed input string $a_1 \cdots a_n$. Consider a

2463: computation of the form $c_1 \tau c_2$, where

2464: $(\Xinit, a_1 \cdots a_i, \epsilon)$ $\pdamovesname{c_1}$

2465: $(\alpha X, \epsilon, v_1)$,

2466: $\tau$ is of the form

2467: $\myep{{\it X}}{{\it X Y'}}$, and

2468: $(Y', a_{i+1} \cdots a_j, \epsilon)$

2469: $\pdamovesname{c_2}$

2470: $(Y, \epsilon, v_2)$, for

2471: some stack symbols $X,Y',Y$,

2472: some input positions $i$ and $j$ ($0 \leq i \leq j \leq n$),

2473: and some output strings $v_1$ and $v_2$.

2474: In words, the computation

2475: obtains top-of-stack $X$ after

2476: scanning of $a_i$ but before scanning of $a_{i+1}$,

2477: then applies a push transition, and then possibly

2478: further push, scan and pop transitions, which

2479: leads to $Y$ on top of $X$ after

2480: scanning of $a_j$ but before scanning of $a_{j+1}$.

2481:

2482: We now abstract away from some details of such a computation

2483: by just recording $X$, $Y$, $i$, $j$ and its probability

2484: $p_1=p_{\myaut}(c_1\tau c_2)$.

2485: The probability $p_1$ is related to what is commonly called

2486: a {\em forward\/} probability,

2487: as it expresses the probability of the computation

2488: from the beginning onward.%

2489: \footnote{Forward probability as defined by \cite{ST95}

2490: refers to the sum of the probabilities of

2491: {\em all\/} computations from the

2492: beginning onward that lead to a certain rule occurrence,

2493: whereas here we consider only one computation at a time.

2494: We will turn to forward probabilities later in this section.}

2495: The existence of the above computation is represented by an

2496: object that we will call a {\em table item\/},

2497: written as $p_1:\forward(X,Y,i,j)$.

2498:

2499: Similarly, consider a subcomputation of the form

2500: $\tau c_2$, where as before

2501: $\tau$ is of the form

2502: $\myep{{\it X}}{{\it X Y'}}$, and

2503: $(Y', a_{i+1} \cdots a_j, \epsilon)$

2504: $\pdamovesname{c_2}$

2505: $(Y, \epsilon, v_2)$, for

2506: some stack symbols $X,Y',Y$,

2507: some input positions $i$ and $j$ ($0 \leq i \leq j \leq n$),

2508: and some output string $v_2$.

2509: We express the existence of such a subcomputation

2510: by a different kind of table item, written as

2511: $p_2:\inner(X,Y,i,j)$, where

2512: $p_2=p_{\myaut}(\tau c_2)$. Here, $p_2$ is related to what is commonly

2513: called an {\em inner\/} probability, as

2514: it expresses only the probability internally in a

2515: subcomputation.%

2516: \footnote{We will turn to actual inner probabilities

2517: later in this section.}

2518:

2519: For technical reasons, we also need to consider

2520: computations $c$ where

2521: $(\Xinit, a_1 \cdots a_j, \epsilon)$ $\pdamovesname{c}$

2522: $(Y, \epsilon, v)$, for some $Y$, $j$ and $v$.

2523: These are represented by table items

2524: $p_1:\forward(\bot,Y,0,j)$,

2525: where $p_1=p_{\myaut}(c)$.

2526: The symbol $\bot$ can be seen as an imaginary stack symbol that

2527: is located

2528: below the actual bottom-of-stack element.

2529:

2530: All table items of the above forms, and only those table items,

2531: can be derived by the deduction system in

2532: Figure~\ref{f:tabular}. Deduction systems for defining

2533: parsing algorithms have been described before by \cite{SH95};

2534: see also \cite{SI97,SI97a} for a very similar framework.

2535: A dynamic programming algorithm for such a deduction system

2536: incrementally fills a {\em parse table\/} with

2537: table items, given a grammar and input.

2538: During execution of the algorithm,

2539: items that are already

2540: in the table are matched against antecents of inference

2541: rules. If a combination of items match all

2542: antecents of an inference rule, then the item

2543: that matches the consequent of that inference rule is

2544: added to the table. This process ends when no more

2545: new items can be added to the table.

2546:

2547: The item in the consequent of inference rule~(\ref{e:init})

2548: represents the fact that

2549: at the beginning of any computation,

2550: $\Xinit$ lies on top of imaginary stack

2551: element $\bot$, no input has as yet been read, and

2552: the product of probabilities of all transitions used

2553: in the represented computation is 1, since no transitions

2554: have been used yet.

2555:

2556: Inference rule~(\ref{e:pushfor}) derives a table item from

2557: an existing table item, if the second stack symbol of that

2558: existing item indicates that a push transition can be applied.

2559: Naturally, the probability in the new item is the product

2560: of the probability in the old item and

2561: the probability of the applied transition.

2562: Inference rule~(\ref{e:scanfor}) is very similar.

2563:

2564: Two subcomputations are combined through a

2565: pop transition by inference rule~(\ref{e:popfor}),

2566: the intuition of which can be explained as follows.

2567: If $W$ occurs as top-of-stack at position $i$ and

2568: reading the input up to $j$ results in

2569: $Y$ on top of $W$, and if subsequently reading the input from

2570: $j$ to $k$ results in $X$ on top of $Y$ and

2571: ${\it YX}$ may be replaced by $Z$ by a pop transition, then

2572: reading the input from $i$ to $k$ results in

2573: $Z$ on top of $W$.

2574: The probability of the newly derived subcomputation is the

2575: product of three probabilities.

2576: The first is the probability of that subcomputation

2577: up to the point where $Y$ is top-of-stack,

2578: which is given by $p_1$; the second is the

2579: probability from this point onward, up to the point where

2580: $X$ is top-of-stack,

2581: which is given by $p_2$;

2582: the third is the probability of the pop transition.

2583: The second of these

2584: probabilities, $p_2$, is defined by the inference rules for

2585: `inner' items to be discussed next.

2586:

2587: Inference rule~(\ref{e:pushin}) starts the investigation of

2588: a new subcomputation that begins with a push transition.

2589: This rule does not have any antecedents, but we may

2590: add an item $p_1:\forward(Z,X,i,j)$ as antecedent,

2591: since the resulting `inner' items can only be useful for

2592: the computation of `forward' items if at least

2593: one item of the form $p_1:\forward(Z,X,i,j)$

2594: exists. We will not do so

2595: however, since this would complicate the theoretical analysis.

2596:

2597: The next two rules, (\ref{e:scanin}) and~(\ref{e:popin}),

2598: are almost identical to (\ref{e:scanfor}) and~(\ref{e:popfor}).

2599:

2600: \begin{figure}[t]

2601: Initialization: \\[-4ex]

2602: \tabruletwo{e:init}{

2603: }{

2604: 1:\forward(\bot,\Xinit,0,0)

2605: }

2606:

2607: Push (forward): \\[-4ex]

2608: \tabrule{e:pushfor}{

2609: p_1:\forward(Z,X,i,j)

2610: }{

2611: p_1 \cdot p_{\myaut}(\tau):\forward(X,Y,j,j)

2612: }{

2613: \tau = \myep{X}{\it XY}

2614: }

2615:

2616: Scan (forward): \\[-4ex]

2617: \tabrule{e:scanfor}{

2618: p_1:\forward(Z,X,i,j)

2619: }{

2620: p_1 \cdot p_{\myaut}(\tau):\forward(Z,Y,i,j')

2621: }{

2622: \tau = \myscan{X}{x}{y}{\it Y} \\

2623: (x = \epsilon \wedge j' = j)\ \vee  \\

2624: \ \ \ (x = a_{j+1} \wedge j' = j + 1)

2625: }

2626:

2627: Pop (forward): \\[-4ex]

2628: \tabrule{e:popfor}{

2629: p_1:\forward(W,Y,i,j) \\

2630: p_2:\inner(Y,X,j,k)

2631: }{

2632: p_1 \cdot p_2\cdot p_{\myaut}(\tau):\forward(W,Z,i,k)

2633: }{

2634: \tau = \myep{{\it Y X}}{\it Z}

2635: }

2636:

2637: Push (inner): \\[-4ex]

2638: \tabrule{e:pushin}{

2639: % p_1:\forward(Z,X,i,j)

2640: }{

2641: p_{\myaut}(\tau):\inner(X,Y,j,j)

2642: }{

2643: \tau = \myep{X}{\it XY}

2644: }

2645:

2646: Scan (inner): \\[-4ex]

2647: \tabrule{e:scanin}{

2648: p_2:\inner(Z,X,i,j)

2649: }{

2650: p_2 \cdot p_{\myaut}(\tau):\inner(Z,Y,i,j')

2651: }{

2652: \tau = \myscan{X}{x}{y}{\it Y} \\

2653: (x = \epsilon \wedge j' = j)\ \vee  \\

2654: \ \ \ (x = a_{j+1} \wedge j' = j + 1)

2655: }

2656:

2657: Pop (inner): \\[-4ex]

2658: \tabrule{e:popin}{

2659: p_2:\inner(W,Y,i,j) \\

2660: p'_2:\inner(Y,X,j,k)

2661: }{

2662: p_2 \cdot p'_2\cdot p_{\myaut}(\tau):\inner(W,Z,i,k)

2663: }{

2664: \tau = \myep{{\it Y X}}{\it Z}

2665: }

2666:

2667: \caption{Deduction system of table items.}

2668: \label{f:tabular}

2669: \end{figure}

2670:

2671: It is not difficult to see that for each complete

2672: computation of the form

2673: $(\Xinit, a_1 \cdots a_n, \epsilon)$ $\pdamovesname{c}$

2674: $(\Xfinal, \epsilon, v)$, for some output string $v$,

2675: there is precisely one derivation by the deduction system

2676: of some table item

2677: $p_1:\forward(\bot,\Xfinal,0,n)$, where $p_1=p_{\myaut}(c)$.

2678: Conversely, for each derivation of such a table item, there

2679: is a unique corresponding computation.

2680: Computations and derivations can be easily related to each other

2681: by looking at the transitions in the side conditions of the

2682: inference rules.

2683:

2684: If follows that if we take the sum of

2685: $p_1$ over all derivations of items

2686: $p_1:\forward(\bot,\Xfinal,0,n)$, then we obtain

2687: the probability assigned by $\myaut$ to the input

2688: $w=a_1 \cdots a_n$.

2689:

2690: Now assume that $\myaut$ is proper and consistent.

2691: For a given string

2692: $w' \in \myterm_1^\ast$, where $\myterm_1$ is the input

2693: alphabet, we define the {\em prefix probability\/} of $w'$

2694: to be

2695: $$ \sum_{w'' \in \myterm_1^\ast}\ p_{\myaut}(w' w'') $$

2696: In other words, we sum the probabilities of all strings

2697: $w=w'w''$ that start with prefix $w'$.

2698: We will now show that this probability can also be expressed

2699: in terms of the probabilities of `forward' items.

2700:

2701: Assume that $w'= a_1 \cdots a_n$, for some $n \geq 0$.

2702: Any computation on a string $w=w'w''$

2703: that is the prefix of a complete computation

2704: must be of one of two types.

2705: The first is

2706: $(\Xinit, a_1 \cdots a_n, \epsilon)$ $\pdamovesname{c}$

2707: $(\Xfinal, \epsilon, v)$, for some $v$, which means that

2708: $w''=\epsilon$, so that no input beyond position $n$ needs to be

2709: read.

2710: The second is

2711: $(\Xinit, a_1 \cdots a_n a_{n+1} \cdots a_m, \epsilon)$ $\pdamovesname{c_1}$

2712: $(\alpha X, a_{n+1} \cdots a_m, v_1)$ $\pdamove{\tau}$

2713: $(\alpha Y, a_{n+2} \cdots a_m, v_1y)$ $\pdamovesname{c_2}$

2714: $(\Xfinal, \epsilon, v_1yv_2)$,

2715: where $\tau$ is a scan transition

2716: $\myscan{X}{a}{y}{Y}$ such that $a=a_{n+1}$.

2717:

2718: The sum of probabilities of computations of the first type equals the

2719: sum of $p_1$ over all derivations of items

2720: $p_1:\forward(\bot,\Xfinal,0,n)$, as we have explained above.

2721: For the second type of computation, properness and

2722: consistency

2723: implies that for given $c_1$ and $\tau$ as above,

2724: the sum of probabilities of different $c_2$ must be 1.

2725: (If that sum, say $q$, is less than $1$, then

2726: the sum of the probabilities of all computations cannot be

2727: more than $1 - (1-q) \cdot p_{\myaut}(c_2) < 1$, which

2728: is in conflict with the assumed consistency.)

2729: Furthermore, properness implies

2730: that the sum of probabilities of different $\tau$

2731: that we can apply for top-of-stack $X$ must be 1.

2732: Therefore, we may conclude that

2733: the sum of probabilities of computations of the

2734: second type equals the sum of $p_{\myaut}(c_1)$ over all computations

2735: $(\Xinit, a_1 \cdots a_n, \epsilon)$ $\pdamovesname{c_1}$

2736: $(\alpha X, \epsilon, v_1)$ such that there is at least

2737: one scan transition of the form $\myscan{X}{a}{y}{Y}$.

2738: This equals the sum of $p_1$ over all derivations of items

2739: $p_1:\forward(Z,X,0,n)$, for some $Z$, such that there is at least

2740: one scan transition of the form $\myscan{X}{a}{y}{Y}$.

2741:

2742: Hereby we have shown how both the probability and the

2743: prefix probability of a string can be expressed in terms of

2744: derivations of table items. However, the number of

2745: derivations of table items can be infinite. The obvious

2746: remedy lies in an alternative interpretation of the inference

2747: rules in Figure~\ref{f:tabular}, following \cite{GO99}:

2748: we regard objects of the form

2749: $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$ as

2750: table items in their own right, and store each at most once

2751: in the parse table.

2752: The associated probabilities

2753: are then no longer those for individual derivations,

2754: but are the sums of probabilities

2755: over all derivations of table items

2756: $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$.

2757: Such a sum of probabilities over all

2758: derivations of a table item is commonly

2759: called a {\em forward\/} or {\em inner\/} probability,

2760: respectively.

2761:

2762: We will make this more concrete, under the assumption that

2763: there are no cyclic dependencies, i.e.,

2764: there is no item $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$ that

2765: may occur as ancestor of itself in some derivation.

2766: Let $T$ be the set of all items

2767: $\forward(X,Y,i,j)$ or $\inner(X,Y,i,j)$ that can be derived

2768: using the deduction system in Figure~\ref{f:tabular},

2769: ignoring the probabilities.

2770: We then define a function $p_{\tabel}$ from table items to

2771: probabilities, as shown in Figure~\ref{f:recursive}.

2772: We assume the function $\delta$ evaluates to 1 if its

2773: argument is true, and to 0 otherwise.

2774:

2775: \begin{figure}[t]

2776: \begin{eqnarray}

2777: \label{e:forward}

2778: \lefteqn{p_{\tabel}(\forward(X,Y,i,j))\ =} \\

2779: && \delta(X = \bot \wedge Y = \Xinit \wedge

2780:                 i = j = 0)\ + \nonumber\\

2781: && \delta(i=j) \cdot

2782: \sum_{

2783: Z, k,\tau: \atop

2784: { \forward(Z,X,k,i)\in T, \atop

2785: \tau = \myep{X}{\it XY}}

2786: }

2787: \hspace{-3ex}

2788: 	p_{\tabel}(\forward(Z,X,k,i)) \cdot p_{\myaut}(\tau)\ + \nonumber\\

2789: && \sum_{

2790: Z,j',x,y,\tau: \atop

2791: { \forward(X,Z,i,j')\in T, \atop

2792: { (x = \epsilon \wedge j' = j) \vee

2793: 	(x = a_{\tiny j} \wedge j' = j - 1) , \atop

2794: \tau = \myscan{Z}{x}{y}{\it Y} }}

2795: }

2796: \hspace{-8ex}

2797:         p_{\tabel}(\forward(X,Z,i,j')) \cdot p_{\myaut}(\tau)\ + \nonumber\\

2798: && \sum_{

2799: W,Z,k,\tau: \atop

2800: { \forward(X,W,i,k)\in T, \inner(W,Z,k,j)\in T, \atop

2801: \tau = \myep{\it WZ}{\it Y} }

2802: }

2803: \hspace{-11ex}

2804: 	p_{\tabel}(\forward(X,W,i,k)) \cdot

2805: 	p_{\tabel}(\inner(W,Z,k,j)) \cdot p_{\myaut}(\tau) \nonumber

2806: \end{eqnarray}

2807: %

2808: \begin{eqnarray}

2809: \label{e:inner}

2810: \lefteqn{p_{\tabel}(\inner(X,Y,i,j))\ =} \\

2811: && \delta(i=j) \cdot

2812: \sum_{

2813: \tau: \atop

2814: \tau = \myep{X}{\it XY}

2815: }

2816: 	p_{\myaut}(\tau)\ + \nonumber\\

2817: && \sum_{

2818: Z,j',x,y,\tau: \atop

2819: { \inner(X,Z,i,j')\in T, \atop

2820: { (x = \epsilon \wedge j' = j) \vee

2821:         (x = a_{\tiny j} \wedge j' = j - 1) , \atop

2822: \tau = \myscan{Z}{x}{y}{\it Y} }}

2823: }

2824: \hspace{-8ex}

2825:         p_{\tabel}(\inner(X,Z,i,j')) \cdot p_{\myaut}(\tau)\ + \nonumber\\

2826: && \sum_{

2827: W,Z,k,\tau: \atop

2828: { \inner(X,W,i,k)\in T, \inner(W,Z,k,j)\in T, \atop

2829: \tau = \myep{\it WZ}{\it Y} }

2830: }

2831: \hspace{-11ex}

2832:         p_{\tabel}(\inner(X,W,i,k)) \cdot

2833:         p_{\tabel}(\inner(W,Z,k,j)) \cdot p_{\myaut}(\tau) \nonumber

2834: \end{eqnarray}

2835: \caption{Recursive functions to determine probabilities of

2836: table items.}

2837: \label{f:recursive}

2838: \end{figure}

2839:

2840: Each line in the right-hand sides of the two equations

2841: in Figure~\ref{f:recursive} can

2842: be seen as the backward application

2843: of an inference rule from Figure~\ref{f:tabular}.

2844: In other words,

2845: for a given item,

2846: we investigate

2847: all possible ways of deriving that item as the

2848: consequent of different inference rules with different antecedents.

2849: For example, the second line in the right-hand side of

2850: equation~(\ref{e:forward}),

2851: can be seen as the backward application of inference rule (\ref{e:pushfor}).

2852:

2853: That Figure~\ref{f:recursive} is indeed equivalent to

2854: Figure~\ref{f:tabular} follows from the fact that

2855: multiplication distributes over addition.

2856: If there are cyclic dependencies, then the set of equations

2857: in Figure~\ref{f:recursive} may no longer have a closed-form

2858: solution, but we may obtain probabilities by

2859: an iterative algorithm that approximates the lowest non-negative

2860: solution to the equations \cite{ST95}.

2861:

2862: Given the set of equations in Figure~\ref{f:recursive}

2863: we can now express the probability of a string of length $n$ as

2864: $p_{\tabel}(\forward(\bot,\Xfinal,0,n))$.

2865: The prefix probability of a string of length $n$ is given by:

2866: \begin{eqnarray}

2867: && p_{\tabel}(\forward(\bot,\Xfinal,0,n))\ + \\

2868: && \sum_{

2869: X,Y,i: \atop

2870: { \forward(X,Y,i,n)\in T, \atop

2871: \exists \tau,a,y,Z[\tau = \myscan{Y}{a}{y}{\it Z}] }

2872: }

2873: \hspace{-3ex}

2874: p_{\tabel}(\forward(X,Y,i,n))

2875: \end{eqnarray}

2876:

2877: To obtain a suitable PPDT from a given PCFG,

2878: we may apply

2879: the strategy $\stratepLC$

2880: from Section~\ref{s:strong}. Provided the (P)CFG is acyclic,

2881: this strategy ensures that there are no computations of

2882: infinite length for any given

2883: input, which implies there are no cyclic dependencies

2884: in the simulation of the automaton by the dynamic programming

2885: algorithm.

2886:

2887: Hereby we have presented a way to compute probabilities

2888: and prefix probabilities of strings. Our approach is an alternative

2889: to the one from \cite{JE91,ST95}, and has the advantage that

2890: the approach is parameterized by the parsing strategy:

2891: instead of $\stratepLC$

2892: we may apply any other parsing strategy with the same properties

2893: with regard to acyclic grammars.

2894: If our grammars are even more constrained,

2895: e.g.\ if they do not have epsilon rules,

2896: we may apply even simpler parsing strategies.

2897: Different parsing strategies may differ in the efficiency

2898: of the computation.

2899:

2900: \section{Conclusions}

2901:

2902: We have formalized the notion of parsing strategy as a mapping from

2903: context-free grammars to push-down transducers, and have investigated

2904: the extension to probabilities.

2905: We have shown that the question of which

2906: strategies can be extended to become probabilistic heavily

2907: relies on two properties, the correct-prefix property and

2908: the strong predictiveness property.

2909: The CPP is a necessary condition for

2910: extending a strategy to become a probabilistic strategy.

2911: The CPP and SPP together form a sufficient condition.

2912: We have shown that there is at least one strategy

2913: of practical interest with the CPP but

2914: without the SPP that cannot be extended to become a probabilistic

2915: strategy.

2916: Lastly, we have presented an application

2917: to prefix probabilities.

2918:

2919: \section*{Acknowledgements}

2920:

2921: We gratefully acknowledge correspondence with

2922: David McAllester,

2923: Giovanni Pighizzini,

2924: Detlef Prescher,

2925: Virach Sornlertlamvanich and

2926: Eric Villemonte de la Clergerie.

2927:

2928: \bibliographystyle{plain}

2929: %\bibliography{refs}

2930: %\bibliography{/home/cl-home/nederhof/bib/refs}

2931: \bibliography{/home/markjan/bib/refs}

2932:

2933: % \newpage

2934:

2935: \end{document}

2936:

2937:

2938: