0008:cs0008017/tdp.tex

1: \documentclass[11pt]{article}

2: \usepackage{colacl, epic, eepic, epsfig}

3: \title{Efficient probabilistic top-down and left-corner parsing${}^{\dag}$}

4: \author{{\bf Brian Roark and Mark Johnson} \\Cognitive and Linguistic

5: Sciences\\Box 1978, Brown University\\Providence, RI  02912, USA\\{\tt

6: brian-roark@brown.edu}\hspace*{.5in}{\tt mj@cs.brown.edu}}

7: \begin{document}

8: \renewcommand{\thefootnote}{\fnsymbol{footnote}}

9: \maketitle

10: \begin{abstract}

11: This paper examines efficient predictive broad-coverage parsing

12: without dynamic programming. In contrast to bottom-up methods,

13: depth-first top-down parsing produces partial parses that are fully connected

14: trees spanning the entire left context, from which any kind of

15: non-local dependency or partial semantic interpretation can in

16: principle be read. We contrast two predictive parsing approaches,

17: top-down and left-corner parsing, and

18: find both to be viable. In addition, we find that

19: enhancement with non-local information not only improves parser

20: accuracy, but also substantially improves the search efficiency.

21: \footnotetext{${}^{\dag}$This material is based on work supported by the National Science Foundation under Grant No. SBR-9720368.}

22: \end{abstract}

23: \bibliographystyle{acl}

24:

25: \section{Introduction}

26: \renewcommand{\thefootnote}{\arabic{footnote}}

27: Strong empirical evidence has been presented over the past 15 years

28: indicating that the human sentence processing mechanism makes {\it on-line\/}

29: use of contextual information in the preceding discourse

30: \cite{Crain85,Altmann88,Britt94} and in the

31: visual environment \cite{Tanen95}. These results lend

32: support to Mark Steedman's \shortcite{Steed89} ``intuition'' that sentence

33: interpretation takes place incrementally, and that partial

34: interpretations are being built while the sentence is being

35: perceived. This is a very commonly held view among psycholinguists

36: today.

37:

38: Many possible models of human sentence processing can be made

39: consistent with the above view, but the general assumption that must

40: underlie them all is that explicit relationships between lexical items

41: in the sentence must be specified incrementally.  Such a processing

42: mechanism stands in marked contrast to

43: dynamic programming parsers, which delay construction of a constituent

44: until all of its sub-constituents have been completed, and whose

45: partial parses thus consist of disconnected tree fragments. For

46: example, such parsers do not integrate a main verb into the same tree

47: structure as its subject {\small NP} until the {\small VP} has been completely parsed,

48: and in many cases this is the final step of the entire parsing

49: process. Without explicit on-line integration, it would be difficult

50: (though not impossible) to produce partial interpretations

51: on-line. Similarly, it may be difficult to use non-local statistical

52: dependencies (e.g. between subject and main verb) to actively guide

53: such parsers.

54:

55: Our predictive parser does not use dynamic programming,

56: but rather maintains fully connected trees spanning the entire left

57: context, which make explicit the relationships between constituents

58: required for partial interpretation. The parser uses probabilistic

59: best-first parsing methods to pursue the most likely analyses first,

60: and a beam-search to avoid the non-termination problems typical of

61: non-statistical top-down predictive parsers.

62:

63: There are two main

64: results. First, this approach works and, with appropriate attention to

65: specific algorithmic details, is surprisingly efficient. Second, not

66: just accuracy but also efficiency improves as the language model is

67: made more accurate. This bodes well for future research into the use

68: of other non-local (e.g. lexical and semantic) information to guide

69: the parser.

70:

71: In addition, we show that the improvement in accuracy

72: associated with left-corner parsing over top-down is attributable to

73: the non-local information supplied by the

74: strategy, and can thus be obtained through other methods that utilize

75: that same information.

76:

77: \section{Parser architecture}

78:

79: The parser proceeds incrementally from left to right, with one item of

80: look-ahead. Nodes are expanded in a standard top-down, left-to-right

81: fashion. The parser utilizes: (i) a probabilistic context-free grammar

82: ({\small PCFG}), induced via standard relative frequency estimation from a

83: corpus of parse trees; and (ii) look-ahead probabilities as described

84: below. Multiple competing partial parses (or analyses) are held on a

85: priority queue, which we will call the {\it pending\/} heap. They are ranked

86: by a figure of merit ({\small FOM}), which will be discussed below. Each

87: analysis has its own stack of nodes to be expanded, as well as a

88: history, probability, and {\small FOM}. The highest ranked analysis is popped

89: from the pending heap, and the category at the top of its stack is

90: expanded. A category is expanded using every rule which could

91: eventually reach the look-ahead terminal. For every such rule

92: expansion, a new analysis is created\footnote{We count each of these

93: as a parser state (or rule expansion) {\it considered\/}, which can be

94: used as a measure of efficiency.} and pushed back onto the pending

95: heap.

96:

97: The {\small FOM} for an analysis is the product of the probabilities of

98: all {\small PCFG} rules used in its derivation and what we call its look-ahead

99: probability ({\small LAP}). The {\small LAP} approximates the product of the

100: probabilities of the rules that will be required to link the analysis

101: in its current state with the look-ahead terminal\footnote{Since this

102: is a non-lexicalized grammar, we are taking pre-terminal POS markers

103: as our terminal items.}. That is, for a

104: grammar {\small G}, a stack state [{\small $C_{1} \dots C_{n}$}] and a

105: look-ahead terminal item $\omega$:

106:

107: \begin{center}(1) $LAP = P_{G}([C_{1} \dots

108: C_{n}] \stackrel{\star}{\rightarrow} \omega\alpha)$\end{center}

109:

110: We recursively estimate this with two empirically observed conditional

111: probabilities for every non-terminal $C_{i}$ on the stack:

112: $\widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \omega)$

113: and $\widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \epsilon)$.

114: The {\small LAP}

115: approximation for a given stack state and look-ahead terminal is:

116:

117: \begin{center}(2) $P_{G}([C_{i} \dots

118: C_{n}] \stackrel{\star}{\rightarrow} \omega\alpha) $\hspace*{.1in} $

119: \approx $\hspace*{.1in} $

120: \widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \omega)$ +\\$

121: \widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \epsilon) *

122: P_{G}([C_{i+1} \dots

123: C_{n}] \stackrel{\star}{\rightarrow} \omega\alpha)$

124: \end{center}

125:

126:

127: When the topmost stack category of an analysis matches the look-ahead

128: terminal, the terminal is popped from the stack and the analysis is

129: pushed onto a second priority queue, which we will call the {\it success\/}

130: heap. Once there are ``enough'' analyses on the success heap, all those

131: remaining on the pending heap are discarded. The success heap then

132: becomes the pending heap, and the look-ahead is moved forward to the

133: next item in the input string. When the end of the input string is

134: reached, the analysis with the highest probability and an empty stack

135: is returned as the parse. If no such parse is found, an error is

136: returned.

137:

138:

139: \begin{figure*}

140: \begin{picture}(95,0)(0,-135)

141: \put(52,-8){(a)}

142:

143: \put(50,-30){\footnotesize NP}

144: \drawline(57,-34)(38,-44)

145: \put(16,-52){\footnotesize DT+JJ+JJ}

146: \drawline(38,-56)(19,-66)

147: \put(4,-74){\footnotesize DT+JJ}

148: \drawline(19,-78)(7,-88)

149: \put(-0,-96){\footnotesize DT}

150: \drawline(7,-100)(7,-110)

151: \put(1,-118){the}

152: \drawline(19,-78)(30,-88)

153: \put(26,-96){\footnotesize JJ}

154: \drawline(30,-100)(30,-110)

155: \put(24,-118){fat}

156: \drawline(38,-56)(57,-66)

157: \put(53,-74){\footnotesize JJ}

158: \drawline(57,-78)(57,-88)

159: \put(44,-96){happy}

160: \drawline(57,-34)(77,-44)

161: \put(69,-52){\footnotesize NN}

162: \drawline(77,-56)(77,-66)

163: \put(71,-74){cat}

164: \end{picture}

165: \begin{picture}(95,0)(0,-135)

166: \put(20,-8){(b)}

167:

168: \put(19,-30){\footnotesize NP}

169: \drawline(26,-34)(7,-44)

170: \put(-0,-52){\footnotesize DT}

171: \drawline(7,-56)(7,-66)

172: \put(1,-74){the}

173: \drawline(26,-34)(46,-44)

174: \put(30,-52){\footnotesize NP-DT}

175: \drawline(46,-56)(28,-66)

176: \put(24,-74){\footnotesize JJ}

177: \drawline(28,-78)(28,-88)

178: \put(22,-96){fat}

179: \drawline(46,-56)(63,-66)

180: \put(42,-74){\footnotesize NP-DT-JJ}

181: \drawline(63,-78)(49,-88)

182: \put(45,-96){\footnotesize JJ}

183: \drawline(49,-100)(49,-110)

184: \put(36,-118){happy}

185: \drawline(63,-78)(78,-88)

186: \put(70,-96){\footnotesize NN}

187: \drawline(78,-100)(78,-110)

188: \put(72,-118){cat}

189: \end{picture}

190: \begin{picture}(110,0)(0,-135)

191: \put(22,-8){(c)}

192:

193: \put(21,-30){\footnotesize NP}

194: \drawline(28,-34)(7,-44)

195: \put(-0,-52){\footnotesize DT}

196: \drawline(7,-56)(7,-66)

197: \put(1,-74){the}

198: \drawline(28,-34)(48,-44)

199: \put(32,-52){\footnotesize NP-DT}

200: \drawline(48,-56)(28,-66)

201: \put(24,-74){\footnotesize JJ}

202: \drawline(28,-78)(28,-88)

203: \put(22,-96){fat}

204: \drawline(48,-56)(68,-66)

205: \put(47,-74){\footnotesize NP-DT-JJ}

206: \drawline(68,-78)(48,-88)

207: \put(44,-96){\footnotesize JJ}

208: \drawline(48,-100)(48,-110)

209: \put(35,-118){happy}

210: \drawline(68,-78)(89,-88)

211: \put(62,-96){\footnotesize NP-DT-JJ-JJ}

212: \drawline(89,-100)(89,-110)

213: \put(81,-118){\footnotesize NN}

214: \drawline(89,-122)(89,-132)

215: \put(82,-140){cat}

216: \end{picture}

217: \begin{picture}(150,135)(0,-135)

218: \put(24,-8){(d)}

219: \put(23,-30){\footnotesize NP}

220: \drawline(30,-34)(7,-44)

221: \put(0,-52){\footnotesize DT}

222: \drawline(7,-56)(7,-66)

223: \put(1,-74){the}

224: \drawline(30,-34)(52,-44)

225: \put(36,-52){\footnotesize NP-DT}

226: \drawline(52,-56)(28,-66)

227: \put(24,-74){\footnotesize JJ}

228: \drawline(28,-78)(28,-88)

229: \put(22,-96){fat}

230: \drawline(52,-56)(77,-66)

231: \put(55,-74){\footnotesize NP-DT-JJ}

232: \drawline(77,-78)(48,-88)

233: \put(44,-96){\footnotesize JJ}

234: \drawline(48,-100)(48,-110)

235: \put(35,-118){happy}

236: \drawline(77,-78)(106,-88)

237: \put(79,-96){\footnotesize NP-DT-JJ-JJ}

238: \drawline(106,-100)(79,-110)

239: \put(71,-118){\footnotesize NN}

240: \drawline(79,-122)(79,-132)

241: \put(72,-140){cat}

242: \drawline(106,-100)(133,-110)

243: \put(96,-118){\footnotesize NP-DT-JJ-JJ-NN}

244: \drawline(133,-122)(133,-132)

245: \put(130,-140){$\epsilon$}

246: \end{picture}

247: \caption{Binarized trees:  (a) left binarized ({\small LB}); (b) right

248: binarized to binary ({\small RB2}); (c) right binarized to unary

249: ({\small RB1}); (d) right binarized to nullary ({\small RB0})}\label{fig:bin}

250: \end{figure*}

251:

252: The specifics of the beam-search dictate how many analyses

253: on the success heap constitute ``enough''. One approach is to set a

254: constant beam width, e.g. 10,000 analyses on the success heap, at

255: which point the parser

256: moves to the next item in the input. A problem with this approach is

257: that parses towards the bottom of the success heap may be so unlikely

258: relative to those at the top that they have little or no chance of

259: becoming the most likely parse at the end of the day, causing wasted

260: effort. An alternative approach is to dynamically vary the beam width

261: by stipulating a factor, say $10^{-5}$, and proceed until the best analysis

262: on the pending heap has an {\small FOM} less than $10^{-5}$ times the probability of

263: the best analysis on the success heap. Sometimes, however, the number

264: of analyses that fall within such a range can be enormous, creating

265: nearly as large of a processing burden as the first approach. As a

266: compromise between these two approaches, we stipulated a base beam

267: factor $\alpha$ (usually $10^{-4}$), and the actual beam factor used

268: was $\alpha \ast \beta$, where $\beta$ is the number of analyses on

269: the success heap. Thus, when

270: $\beta$ is small, the beam stays relatively wide, to include as many

271: analyses as possible; but as $\beta$ grows, the beam narrows. We found this

272: to be a simple and successful compromise.

273:

274: Of course, with a left

275: recursive grammar, such a top-down parser may never terminate. If {\it

276: no\/} analysis ever makes it to the success heap, then, however one defines

277: the beam-search, a top-down depth-first search with a left-recursive

278: grammar will

279: never terminate. To avoid this, one must place an upper bound on the

280: number of analyses allowed to be pushed onto the pending heap. If that

281: bound is exceeded, the parse fails. With a left-corner strategy, which

282: is not prey to left recursion, no such upper bound is necessary.

283:

284: \section{Grammar transforms}

285:

286: \newcite{Nijholt80} characterized parsing strategies in terms of {\it announce

287: points\/}: the point at which a parent category is announced

288: (identified) relative to its children, and the point at which the rule

289: expanding the parent is identified. In

290: pure top-down parsing, a parent category and the rule expanding it are

291: announced {\it before\/} any of its children. In pure bottom-up parsing, they

292: are identified {\it after\/} all of the children. Grammar transforms are one

293: method for changing the announce points. In top-down parsing with an

294: appropriately binarized grammar, the parent is identified {\it before\/}, but

295: the rule expanding the parent {\it after\/}, all of the children. Left-corner

296: parsers announce a parent category and its expanding rule {\it after\/} its

297: leftmost child has been completed, but {\it before\/} any of the other

298: children.

299:

300: \subsection{Delaying rule identification through binarization}

301: \begin{table*}

302: \begin{tabular} {|p{.8in}|p{.6in}|p{.65in}|p{.8in}|p{.9in}|p{.7in}|p{.9in}|}

303: \hline

304: {\small Binarization} &

305: {\small Rules in Grammar} &

306: {\small Percent of Sentences Parsed${}^{\ast}$} &

307: {\small Avg. States Considered} &

308: {\small Avg. Labelled Precision and Recall${}^{\dag}$} &

309: {\small Avg. MLP Labelled Prec/Rec${}^{\dag}$} &

310: {\small Ratio of Avg. Prob to Avg. MLP Prob${}^{\dag}$} \\\hline

311: {\small None} &

312: {\small 14962} &

313: {\small 34.16} &

314: {\small 19270} &

315: {\small .65521} &

316: {\small .76427} &

317: {\small .001721} \\\hline

318: {\small LB} &

319: {\small 37955} &

320: {\small 33.99} &

321: {\small 96813} &

322: {\small .65539} &

323: {\small .76095} &

324: {\small .001440} \\\hline

325: {\small RB1} &

326: {\small 29851} &

327: {\small 91.27} &

328: {\small 10140} &

329: {\small .71616} &

330: {\small .72712} &

331: {\small .340858} \\\hline

332: {\small RB0} &

333: {\small 41084} &

334: {\small 97.37} &

335: {\small 13868} &

336: {\small .73207} &

337: {\small .72327} &

338: {\small .443705} \\\hline

339: \end{tabular}

340: {\footnotesize Beam Factor = $10^{-4}$ \hspace*{.18in}

341: ${}^{\ast}$Length $\leq$ 40 (2245 sentences

342: in F23 - Avg. length = 21.68) \hspace*{.18in}

343: ${}^{\dag}$Of those sentences parsed}

344: \caption{The effect of different approaches to

345: binarization}\label{tab:bin}

346: \end{table*}

347:

348: Suppose that the category on the top of the stack is an {\small $NP$} and there

349: is a determiner ({\small $DT$}) in the look-ahead. In such a situation, there

350: is no information to distinguish between the rules \begin{small}$NP

351: \rightarrow DT$\hspace*{.1in}$JJ$\hspace*{.1in}$NN$\end{small} and

352: \begin{small}$NP \rightarrow

353: DT$\hspace*{.1in}$JJ$\hspace*{.1in}$NNS$\end{small}.  If the decision

354: can be delayed, however, until such a time as the

355: relevant pre-terminal is in the look-ahead, the parser can make a more

356: informed decision. Grammar binarization is one way to do this, by

357: allowing the parser to use a rule like \begin{small}$NP \rightarrow

358: DT$\hspace*{.1in}$NP$-$DT$\end{small}, where the

359: new non-terminal {\small $NP$-$DT$} can expand into anything that

360: follows a {\small $DT$}

361: in an {\small $NP$}. The expansion of {\small $NP$-$DT$} occurs only

362: after the next pre-terminal is in the look-ahead. Such a delay is

363: essential for an efficient implementation of the kind of incremental

364: parser that we are proposing.

365:

366: There are actually

367: several ways to make a grammar binary, some of which are better than

368: others for our parser. The first distinction that can be drawn is

369: between what we will call {\it left\/} binarization ({\small LB}) versus {\it right\/}

370: binarization ({\small RB}, see figure \ref{fig:bin}). In the former, the leftmost items

371: on the righthand-side of each rule are grouped together; in the

372: latter, the rightmost items on the righthand-side of the rule are

373: grouped together. Notice that, for a top-down, left-to-right parser,

374: {\small RB} is the appropriate transform, because it underspecifies the right

375: siblings. With {\small LB}, a top-down parser must identify all of the

376: siblings before reaching the leftmost item, which does not aid our

377: purposes.

378:

379: Within {\small RB} transforms, however, there is some variation, with

380: respect to how long rule underspecification is maintained. One method

381: is to have the final underspecified category rewrite as a binary rule

382: (hereafter {\small RB2}, see figure \ref{fig:bin}b). Another is to

383: have the final underspecified category rewrite as a unary rule

384: ({\small RB1}, figure \ref{fig:bin}c). The last is to have the final

385: underspecified category rewrite as a nullary rule ({\small RB0},

386: figure \ref{fig:bin}d). Notice that the original motivation

387: for {\small RB}, to delay specification until the relevant items are present

388: in the look-ahead, is not served by {\small RB2}, because the second child

389: must be specified without being present in the look-ahead. {\small RB0} pushes

390: the look-ahead out to the first item in the string {\it after\/} the

391: constituent being expanded, which can be useful in deciding between

392: rules of unequal length, e.g. \begin{small}$NP \rightarrow

393: DT$\hspace*{.1in}$NN$\end{small} and

394: \begin{small}$NP \rightarrow

395: DT$\hspace*{.1in}$NN$\hspace*{.1in}$NN$\end{small}.

396:

397: Table \ref{tab:bin} summarizes some trials demonstrating the effect of

398: different

399: binarization approaches on parser performance. The grammars were

400: induced from sections 2-21 of the Penn Wall St. Journal Treebank

401: \cite{Marcus93}, and tested on section 23. For each transform

402: tested, every tree in the training corpus was transformed before

403: grammar induction, resulting in a transformed {\small PCFG} and look-ahead

404: probabilities estimated in the standard way. Each parse returned by

405: the parser was de-transformed for evaluation\footnote{See

406: \newcite{Johnson98b} for details of the transform/de-transform

407: paradigm.}. The parser used in each trial was identical, with a base

408: beam factor $\alpha = 10^{-4}$. The performance

409: is evaluated using these measures: (i) the percentage of candidate

410: sentences for which a parse was found (coverage); (ii) the average

411: number of states (i.e. rule expansions) considered per candidate

412: sentence (efficiency); and

413: (iii) the average labelled precision and recall of those sentences for

414: which a parse was found (accuracy). We also used the same grammars

415: with an exhaustive, bottom-up {\small CKY} parser, to ascertain both the

416: accuracy and probability of the maximum likelihood parse ({\small MLP}). We

417: can then additionally compare the parser's performance to the {\small MLP}'s

418: on those same sentences.

419:

420: As expected, {\it left\/} binarization conferred no

421: benefit to our parser. {\it Right\/} binarization, in contrast, improved

422: performance across the board. {\small RB0} provided a substantial improvement

423: in coverage and accuracy over {\small RB1}, with something of a decrease in

424: efficiency. This efficiency hit is partly attributable to the fact that

425: the same tree has more nodes with {\small RB0}. Indeed, the efficiency

426: improvement with right binarization over the standard grammar is even

427: more interesting in light of the great increase in the size of the

428: grammars.

429:

430: It is worth noting at this point that, with the {\small RB0} grammar,

431: this parser is now a viable

432: broad-coverage statistical parser, with good coverage, accuracy, and

433: efficiency\footnote{The very efficient bottom-up statistical parser

434: detailed in \newcite{Charniak98} measured efficiency in terms of total

435: edges {\it popped\/}.  An edge (or, in our case, a parser state) is

436: {\it considered\/} when a probability is calculated for it, and we

437: felt that this was a better efficiency measure than simply those

438: popped.  As a baseline, their parser {\it considered\/} an average of

439: 2216 edges per sentence in section 22 of the WSJ corpus (p.c.).}. Next we considered the left-corner parsing strategy.

440:

441: \subsection{Left-corner parsing}

442: \begin{table*}

443: \begin{tabular} {|p{1.05in}|p{.6in}|p{.6in}|p{.75in}|p{.85in}|p{.65in}|p{.85in}|}

444: \hline

445: {\small Transform} &

446: {\small Rules in Grammar} &

447: {\small Pct. of Sentences Parsed${}^{\ast}$} &

448: {\small Avg. States Considered} &

449: {\small Avg Labelled Precision and Recall${}^{\dag}$} &

450: {\small Avg. MLP Labelled Prec/Rec${}^{\dag}$} &

451: {\small Ratio of Avg. Prob to Avg. MLP Prob${}^{\dag}$} \\\hline

452: {\small Left Corner (LC)} &

453: {\small 21797} &

454: {\small 91.75} &

455: {\small 9000} &

456: {\small .76399} &

457: {\small .78156} &

458: {\small .175928} \\\hline

459: {\small LB $\circ$ LC} &

460: {\small 53026} &

461: {\small 96.75} &

462: {\small 7865} &

463: {\small .77815} &

464: {\small .78056} &

465: {\small .359828} \\\hline

466: {\small LC $\circ$ RB} &

467: {\small 53494} &

468: {\small 96.7} &

469: {\small 8125} &

470: {\small .77830} &

471: {\small .78066} &

472: {\small .359439} \\\hline

473: {\small LC $\circ$ RB $\circ$ ANN} &

474: {\small 55094} &

475: {\small 96.21} &

476: {\small 7945} &

477: {\small .77854} &

478: {\small .78094} &

479: {\small .346778} \\\hline

480: {\small RB $\circ$ LC} &

481: {\small 86007} &

482: {\small 93.38} &

483: {\small 4675} &

484: {\small .76120} &

485: {\small .80529} &

486: {\small .267330} \\\hline

487: \end{tabular}

488: {\footnotesize Beam Factor = $10^{-4}$ \hspace*{.18in}

489: ${}^{\ast}$Length $\leq$ 40 (2245 sentences

490: in F23 - Avg. length = 21.68) \hspace*{.18in}

491: ${}^{\dag}$Of those sentences parsed}

492: \caption{Left Corner Results}\label{tab:left}

493: \end{table*}

494:

495: Left-corner ({\small LC}) parsing \cite{Rosenkrantz70} is a

496: well-known strategy that uses both bottom-up evidence (from the left

497: corner of a rule) and top-down prediction (of the rest of the

498: rule). Rosenkrantz and Lewis showed how to transform a context-free

499: grammar into a grammar that, when used by a top-down parser, follows

500: the same search path as an {\small LC} parser. These {\small LC}

501: grammars allow us to use exactly the same predictive parser to

502: evaluate top-down versus {\small LC}

503: parsing. Naturally, an {\small LC} grammar performs best with our parser when

504: right binarized, for the same reasons outlined above. We use transform

505: composition to apply first one transform, then another to the output

506: of the first. We denote this {\small A} $\circ$ {\small B} where

507: ({\small A} $\circ$ {\small B})(t) = {\small B} ({\small A}

508: (t)). After applying the left-corner transform, we then binarize the

509: resulting grammar\footnote{Given that the LC transform involves

510: nullary productions, the use of RB0 is not needed, i.e. nullary

511: productions need only be introduced from one source.  Thus

512: binarization with left corner is always to unary (RB1).}, i.e. {\small LC} $\circ$ {\small RB}.

513:

514: Another probabilistic {\small LC} parser investigated \cite{Manning97},

515: which utilized an {\small LC} parsing architecture (not a transformed

516: grammar), also got a performance boost through right

517: binarization. This, however, is equivalent to {\small RB} $\circ$

518: {\small LC}, which is a very different grammar from {\small LC}

519: $\circ$ {\small RB}. Given our two binarization orientations ({\small

520: LB} and {\small RB}), there are four possible compositions of

521: binarization and {\small LC} transforms:

522: \begin{center}\begin{small}

523: (a) LB $\circ$ LC (b) RB $\circ$ LC

524: (c) LC $\circ$ LB  (d) LC $\circ$ RB

525: \end{small}\end{center}

526: Table \ref{tab:left} shows left-corner results over various

527: conditions\footnote{Option (c) is not the appropriate kind of

528: binarization for our parser, as argued in the previous section, and so

529: is omitted.}. Interestingly, options (a) and (d) encode the same

530: information, leading to nearly identical performance\footnote{The

531: difference is due to the introduction of vacuous unary rules with

532: RB.}. As stated before, right binarization moves the rule announce

533: point from before to after all of the children. The {\small LC} transform is

534: such that {\small LC} $\circ$ {\small RB}

535: also delays {\it parent\/} identification until after all of the

536: children. The transform {\small LC} $\circ$ {\small RB} $\circ$

537: {\small ANN} moves the parent announce

538: point back to the left corner by introducing unary rules at the left

539: corner that simply identify the parent of the binarized rule. This

540: allows us to test the effect of the position of the parent announce

541: point on the performance of the parser. As we can see, however, the

542: effect is slight, with similar performance on all measures.

543:

544: {\small RB} $\circ$ {\small LC} performs with higher accuracy than the others when used with

545: an exhaustive parser, but seems to require a massive beam in order to

546: even approach performance at the {\small MLP} level. \newcite{Manning97}

547: used a beam width of 40,000 parses on the success heap at each input

548: item, which

549: must have resulted in an order of magnitude more rule expansions

550: than what we have been considering up to now, and yet their average

551: labelled precision and recall (.7875) still fell well below what we

552: found to be the {\small MLP} accuracy (.7987) for the grammar. We are still

553: investigating why this grammar functions so poorly when used by an

554: incremental parser.

555:

556: \subsection{Non-local annotation}

557:

558: \newcite{Johnson98b} discusses the improvement of {\small PCFG} models via the

559: annotation of non-local information onto non-terminal nodes in the

560: trees of the training corpus. One simple example is to

561: copy the parent node onto every non-terminal, e.g. the rule

562: \begin{small}$S \rightarrow NP$\hspace*{.1in}$VP$\end{small} becomes

563: \begin{small}$S \rightarrow

564: NP^{\uparrow}S$\hspace*{.1in}$VP^{\uparrow}S$\end{small}.  The idea

565: here is that

566: the distribution of rules of expansion of a particular non-terminal

567: may differ depending on the non-terminal's parent. Indeed, it was

568: shown that this additional information improves the {\small MLP}

569: accuracy dramatically.

570:

571: We looked at two kinds of

572: non-local information annotation: parent ({\small PA}) and left-corner

573: ({\small LCA}). Left-corner parsing gives improved accuracy over top-down or

574: bottom-up parsing with the same grammar. Why? One reason may be that

575: the ancestor category exerts the same kind of non-local influence

576: upon the parser that the parent category does in parent annotation. To

577: test this, we annotated the left-corner ancestor category onto every

578: leftmost non-terminal category. The results of our annotation trials

579: are shown in table \ref{tab:ann}.

580:

581: \begin{table*}

582: \begin{tabular} {|p{1.05in}|p{.6in}|p{.6in}|p{.75in}|p{.85in}|p{.65in}|p{.85in}|}

583: \hline

584: {\small Transform} &

585: {\small Rules in Grammar} &

586: {\small Pct. of Sentences Parsed${}^{\ast}$} &

587: {\small Avg. States Considered} &

588: {\small Avg Labelled Precision and Recall${}^{\dag}$} &

589: {\small Avg. MLP Labelled Prec/Rec${}^{\dag}$} &

590: {\small Ratio of Avg. Prob to Avg. MLP Prob${}^{\dag}$} \\\hline

591: {\small RB0} &

592: {\small 41084} &

593: {\small 97.37} &

594: {\small 13868} &

595: {\small .73207} &

596: {\small .72327} &

597: {\small .443705} \\\hline

598: {\small PA $\circ$ RB0} &

599: {\small 63467} &

600: {\small 95.19} &

601: {\small 8596} &

602: {\small .79188} &

603: {\small .79759} &

604: {\small .486995} \\\hline

605: {\small LC $\circ$ RB} &

606: {\small 53494} &

607: {\small 96.7} &

608: {\small 8125} &

609: {\small .77830} &

610: {\small .78066} &

611: {\small .359439} \\\hline

612: {\small LCA $\circ$ RB0} &

613: {\small 58669} &

614: {\small 96.48} &

615: {\small 11158} &

616: {\small .77476} &

617: {\small .78058} &

618: {\small .495912} \\\hline

619: {\small PA $\circ$ LC $\circ$ RB} &

620: {\small 80245} &

621: {\small 93.52} &

622: {\small 4455} &

623: {\small .81144} &

624: {\small .81833} &

625: {\small .484428} \\\hline

626: \end{tabular}

627: {\footnotesize Beam Factor = $10^{-4}$ \hspace*{.18in}

628: ${}^{\ast}$Length $\leq$ 40 (2245 sentences

629: in F23 - Avg. length = 21.68) \hspace*{.18in}

630: ${}^{\dag}$Of those sentences parsed}

631: \caption{Non-local annotation results}\label{tab:ann}

632: \end{table*}

633:

634: There are two important points to notice from

635: these results. First, with {\small PA} we get not only the previously reported

636: improvement in accuracy, but additionally a fairly dramatic decrease

637: in the number of parser states that must be visited to find a

638: parse. That is, the non-local information not only improves the final

639: product of the parse, but it guides the parser more quickly to the

640: final product. The annotated grammar has 1.5 times as many rules, and

641: would slow a bottom-up {\small CKY} parser proportionally. Yet our parser

642: actually considers far fewer states en route to the more accurate

643: parse.

644:

645: Second, {\small LC}-annotation gives nearly all of the accuracy gain of

646: left-corner parsing\footnote{The rest could very well be within

647: noise.}, in support of the hypothesis that the ancestor

648: information was responsible for the observed accuracy

649: improvement. This result suggests that if we can determine the

650: information that is being annotated by the troublesome {\small RB} $\circ$ {\small LC}

651: transform, we may be able to get the accuracy improvement with a

652: relatively narrow beam. Parent-annotation before the {\small LC} transform gave

653: us the best performance of all, with very few states considered on

654: average, and excellent accuracy for a non-lexicalized grammar.

655:

656: \section{Accuracy/Efficiency tradeoff}

657: \begin{figure*}

658: \hspace*{1.1in}

659: \epsfig{file=graph1.eps, width=4.1in}\vspace*{.05in}\\

660: \hspace*{1.1in}

661: \epsfig{file=graph2.eps, width=4.1in}

662: \caption{Changes in performance with beam factor variation} \label{fig:ef1}

663: \end{figure*}

664:

665: \begin{figure*}

666: \hspace*{1.1in}

667: \epsfig{file=graph3.eps, width=4.1in}\vspace*{.05in}\\

668: \hspace*{1.1in}

669: \epsfig{file=graph4.eps, width=4.1in}

670: \caption{Changes in performance with beam factor variation} \label{fig:ef2}

671: \end{figure*}

672:

673: One point that deserves to be made is that there is something of an

674: accuracy/efficiency tradeoff with regards to the base beam factor. The

675: results given so far were at $10^{-4}$, which functions pretty well for the

676: transforms we have investigated. Figures \ref{fig:ef1} and

677: \ref{fig:ef2} show four performance

678: measures for four of our transforms at base beam factors of $10^{-3}$,

679: $10^{-4}$, $10^{-5}$, and $10^{-6}$.  There is a dramatically increasing

680: efficiency burden as

681: the beam widens, with varying degrees of payoff. With the top-down

682: transforms ({\small RB0} and {\small PA} $\circ$ {\small RB0}), the ratio of the average probability

683: to the {\small MLP} probability does improve substantially as the beam grows,

684: yet with only marginal improvements in coverage and

685: accuracy. Increasing the beam seems to do less with the left-corner

686: transforms.

687:

688: \section{Conclusions and Future Research}

689:

690: We have examined several probabilistic predictive parser variations,

691: and have shown the approach in general to be a viable one, both in

692: terms of the quality of the parses, and the efficiency with which they

693: are found. We have shown that the improvement of the grammars with

694: non-local information not only results in better parses, but guides

695: the parser to them much more efficiently, in contrast to dynamic

696: programming methods. Finally, we have shown that the accuracy

697: improvement that has been demonstrated with left-corner approaches can

698: be attributed to the non-local information utilized by the

699: method.

700:

701: This is relevant to the study of the human sentence processing

702: mechanism insofar as it demonstrates that it is possible to have a

703: model which makes explicit the syntactic relationships between items

704: in the input incrementally, while still scaling up to broad-coverage.

705:

706: Future research will include:

707: \begin{list}{$\bullet$}{\setlength{\topsep}{.01in}\setlength{\itemsep}{0in}}

708: \item lexicalization of the parser

709: \item utilization of fully

710: connected trees for additional syntactic and semantic processing

711: \item the use of syntactic predictions in the beam for language modeling

712: \item an examination of predictive parsing with a left-branching language

713: (e.g. German)

714: \end{list}

715: In addition, it may be of interest to the psycholinguistic community

716: if we introduce a time variable into our model, and use it

717: to compare such competing sentence processing models as race-based

718: and competition-based parsing.

719: \bibliography{ber}

720: \end{document}

721: