cs0008017/tdp.tex
1: \documentclass[11pt]{article}
2: \usepackage{colacl, epic, eepic, epsfig}
3: \title{Efficient probabilistic top-down and left-corner parsing${}^{\dag}$}
4: \author{{\bf Brian Roark and Mark Johnson} \\Cognitive and Linguistic
5: Sciences\\Box 1978, Brown University\\Providence, RI  02912, USA\\{\tt
6: brian-roark@brown.edu}\hspace*{.5in}{\tt mj@cs.brown.edu}} 
7: \begin{document}
8: \renewcommand{\thefootnote}{\fnsymbol{footnote}}
9: \maketitle
10: \begin{abstract}
11: This paper examines efficient predictive broad-coverage parsing
12: without dynamic programming. In contrast to bottom-up methods,
13: depth-first top-down parsing produces partial parses that are fully connected
14: trees spanning the entire left context, from which any kind of
15: non-local dependency or partial semantic interpretation can in
16: principle be read. We contrast two predictive parsing approaches,
17: top-down and left-corner parsing, and 
18: find both to be viable. In addition, we find that
19: enhancement with non-local information not only improves parser
20: accuracy, but also substantially improves the search efficiency. 
21: \footnotetext{${}^{\dag}$This material is based on work supported by the National Science Foundation under Grant No. SBR-9720368.} 
22: \end{abstract}
23: \bibliographystyle{acl}
24: 
25: \section{Introduction}
26: \renewcommand{\thefootnote}{\arabic{footnote}}
27: Strong empirical evidence has been presented over the past 15 years
28: indicating that the human sentence processing mechanism makes {\it on-line\/}
29: use of contextual information in the preceding discourse
30: \cite{Crain85,Altmann88,Britt94} and in the
31: visual environment \cite{Tanen95}. These results lend
32: support to Mark Steedman's \shortcite{Steed89} ``intuition'' that sentence
33: interpretation takes place incrementally, and that partial
34: interpretations are being built while the sentence is being
35: perceived. This is a very commonly held view among psycholinguists
36: today. 
37: 
38: Many possible models of human sentence processing can be made
39: consistent with the above view, but the general assumption that must
40: underlie them all is that explicit relationships between lexical items 
41: in the sentence must be specified incrementally.  Such a processing
42: mechanism stands in marked contrast to 
43: dynamic programming parsers, which delay construction of a constituent
44: until all of its sub-constituents have been completed, and whose
45: partial parses thus consist of disconnected tree fragments. For
46: example, such parsers do not integrate a main verb into the same tree
47: structure as its subject {\small NP} until the {\small VP} has been completely parsed,
48: and in many cases this is the final step of the entire parsing
49: process. Without explicit on-line integration, it would be difficult
50: (though not impossible) to produce partial interpretations
51: on-line. Similarly, it may be difficult to use non-local statistical
52: dependencies (e.g. between subject and main verb) to actively guide
53: such parsers. 
54: 
55: Our predictive parser does not use dynamic programming,
56: but rather maintains fully connected trees spanning the entire left
57: context, which make explicit the relationships between constituents
58: required for partial interpretation. The parser uses probabilistic
59: best-first parsing methods to pursue the most likely analyses first,
60: and a beam-search to avoid the non-termination problems typical of
61: non-statistical top-down predictive parsers. 
62: 
63: There are two main
64: results. First, this approach works and, with appropriate attention to
65: specific algorithmic details, is surprisingly efficient. Second, not
66: just accuracy but also efficiency improves as the language model is
67: made more accurate. This bodes well for future research into the use
68: of other non-local (e.g. lexical and semantic) information to guide
69: the parser.   
70: 
71: In addition, we show that the improvement in accuracy
72: associated with left-corner parsing over top-down is attributable to
73: the non-local information supplied by the 
74: strategy, and can thus be obtained through other methods that utilize
75: that same information. 
76: 
77: \section{Parser architecture}
78: 
79: The parser proceeds incrementally from left to right, with one item of
80: look-ahead. Nodes are expanded in a standard top-down, left-to-right
81: fashion. The parser utilizes: (i) a probabilistic context-free grammar
82: ({\small PCFG}), induced via standard relative frequency estimation from a
83: corpus of parse trees; and (ii) look-ahead probabilities as described
84: below. Multiple competing partial parses (or analyses) are held on a
85: priority queue, which we will call the {\it pending\/} heap. They are ranked
86: by a figure of merit ({\small FOM}), which will be discussed below. Each
87: analysis has its own stack of nodes to be expanded, as well as a
88: history, probability, and {\small FOM}. The highest ranked analysis is popped
89: from the pending heap, and the category at the top of its stack is
90: expanded. A category is expanded using every rule which could
91: eventually reach the look-ahead terminal. For every such rule
92: expansion, a new analysis is created\footnote{We count each of these
93: as a parser state (or rule expansion) {\it considered\/}, which can be 
94: used as a measure of efficiency.} and pushed back onto the pending
95: heap. 
96: 
97: The {\small FOM} for an analysis is the product of the probabilities of
98: all {\small PCFG} rules used in its derivation and what we call its look-ahead
99: probability ({\small LAP}). The {\small LAP} approximates the product of the
100: probabilities of the rules that will be required to link the analysis
101: in its current state with the look-ahead terminal\footnote{Since this
102: is a non-lexicalized grammar, we are taking pre-terminal POS markers
103: as our terminal items.}. That is, for a
104: grammar {\small G}, a stack state [{\small $C_{1} \dots C_{n}$}] and a
105: look-ahead terminal item $\omega$: 
106: 
107: \begin{center}(1) $LAP = P_{G}([C_{1} \dots
108: C_{n}] \stackrel{\star}{\rightarrow} \omega\alpha)$\end{center}
109: 
110: We recursively estimate this with two empirically observed conditional
111: probabilities for every non-terminal $C_{i}$ on the stack:
112: $\widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \omega)$
113: and $\widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \epsilon)$.
114: The {\small LAP}
115: approximation for a given stack state and look-ahead terminal is:  
116: 
117: \begin{center}(2) $P_{G}([C_{i} \dots
118: C_{n}] \stackrel{\star}{\rightarrow} \omega\alpha) $\hspace*{.1in} $
119: \approx $\hspace*{.1in} $
120: \widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \omega)$ +\\$
121: \widehat{P} (C_{i} \stackrel{\star}{\rightarrow} \epsilon) *
122: P_{G}([C_{i+1} \dots
123: C_{n}] \stackrel{\star}{\rightarrow} \omega\alpha)$
124: \end{center}
125: 
126: 
127: When the topmost stack category of an analysis matches the look-ahead
128: terminal, the terminal is popped from the stack and the analysis is
129: pushed onto a second priority queue, which we will call the {\it success\/}
130: heap. Once there are ``enough'' analyses on the success heap, all those
131: remaining on the pending heap are discarded. The success heap then
132: becomes the pending heap, and the look-ahead is moved forward to the
133: next item in the input string. When the end of the input string is
134: reached, the analysis with the highest probability and an empty stack
135: is returned as the parse. If no such parse is found, an error is
136: returned. 
137: 
138: 
139: \begin{figure*}
140: \begin{picture}(95,0)(0,-135)
141: \put(52,-8){(a)}
142: 
143: \put(50,-30){\footnotesize NP}
144: \drawline(57,-34)(38,-44)
145: \put(16,-52){\footnotesize DT+JJ+JJ}
146: \drawline(38,-56)(19,-66)
147: \put(4,-74){\footnotesize DT+JJ}
148: \drawline(19,-78)(7,-88)
149: \put(-0,-96){\footnotesize DT}
150: \drawline(7,-100)(7,-110)
151: \put(1,-118){the}
152: \drawline(19,-78)(30,-88)
153: \put(26,-96){\footnotesize JJ}
154: \drawline(30,-100)(30,-110)
155: \put(24,-118){fat}
156: \drawline(38,-56)(57,-66)
157: \put(53,-74){\footnotesize JJ}
158: \drawline(57,-78)(57,-88)
159: \put(44,-96){happy}
160: \drawline(57,-34)(77,-44)
161: \put(69,-52){\footnotesize NN}
162: \drawline(77,-56)(77,-66)
163: \put(71,-74){cat}
164: \end{picture}
165: \begin{picture}(95,0)(0,-135)
166: \put(20,-8){(b)}
167: 
168: \put(19,-30){\footnotesize NP}
169: \drawline(26,-34)(7,-44)
170: \put(-0,-52){\footnotesize DT}
171: \drawline(7,-56)(7,-66)
172: \put(1,-74){the}
173: \drawline(26,-34)(46,-44)
174: \put(30,-52){\footnotesize NP-DT}
175: \drawline(46,-56)(28,-66)
176: \put(24,-74){\footnotesize JJ}
177: \drawline(28,-78)(28,-88)
178: \put(22,-96){fat}
179: \drawline(46,-56)(63,-66)
180: \put(42,-74){\footnotesize NP-DT-JJ}
181: \drawline(63,-78)(49,-88)
182: \put(45,-96){\footnotesize JJ}
183: \drawline(49,-100)(49,-110)
184: \put(36,-118){happy}
185: \drawline(63,-78)(78,-88)
186: \put(70,-96){\footnotesize NN}
187: \drawline(78,-100)(78,-110)
188: \put(72,-118){cat}
189: \end{picture}
190: \begin{picture}(110,0)(0,-135)
191: \put(22,-8){(c)}
192: 
193: \put(21,-30){\footnotesize NP}
194: \drawline(28,-34)(7,-44)
195: \put(-0,-52){\footnotesize DT}
196: \drawline(7,-56)(7,-66)
197: \put(1,-74){the}
198: \drawline(28,-34)(48,-44)
199: \put(32,-52){\footnotesize NP-DT}
200: \drawline(48,-56)(28,-66)
201: \put(24,-74){\footnotesize JJ}
202: \drawline(28,-78)(28,-88)
203: \put(22,-96){fat}
204: \drawline(48,-56)(68,-66)
205: \put(47,-74){\footnotesize NP-DT-JJ}
206: \drawline(68,-78)(48,-88)
207: \put(44,-96){\footnotesize JJ}
208: \drawline(48,-100)(48,-110)
209: \put(35,-118){happy}
210: \drawline(68,-78)(89,-88)
211: \put(62,-96){\footnotesize NP-DT-JJ-JJ}
212: \drawline(89,-100)(89,-110)
213: \put(81,-118){\footnotesize NN}
214: \drawline(89,-122)(89,-132)
215: \put(82,-140){cat}
216: \end{picture}
217: \begin{picture}(150,135)(0,-135)
218: \put(24,-8){(d)}
219: \put(23,-30){\footnotesize NP}
220: \drawline(30,-34)(7,-44)
221: \put(0,-52){\footnotesize DT}
222: \drawline(7,-56)(7,-66)
223: \put(1,-74){the}
224: \drawline(30,-34)(52,-44)
225: \put(36,-52){\footnotesize NP-DT}
226: \drawline(52,-56)(28,-66)
227: \put(24,-74){\footnotesize JJ}
228: \drawline(28,-78)(28,-88)
229: \put(22,-96){fat}
230: \drawline(52,-56)(77,-66)
231: \put(55,-74){\footnotesize NP-DT-JJ}
232: \drawline(77,-78)(48,-88)
233: \put(44,-96){\footnotesize JJ}
234: \drawline(48,-100)(48,-110)
235: \put(35,-118){happy}
236: \drawline(77,-78)(106,-88)
237: \put(79,-96){\footnotesize NP-DT-JJ-JJ}
238: \drawline(106,-100)(79,-110)
239: \put(71,-118){\footnotesize NN}
240: \drawline(79,-122)(79,-132)
241: \put(72,-140){cat}
242: \drawline(106,-100)(133,-110)
243: \put(96,-118){\footnotesize NP-DT-JJ-JJ-NN}
244: \drawline(133,-122)(133,-132)
245: \put(130,-140){$\epsilon$}
246: \end{picture}
247: \caption{Binarized trees:  (a) left binarized ({\small LB}); (b) right
248: binarized to binary ({\small RB2}); (c) right binarized to unary
249: ({\small RB1}); (d) right binarized to nullary ({\small RB0})}\label{fig:bin}
250: \end{figure*}
251: 
252: The specifics of the beam-search dictate how many analyses
253: on the success heap constitute ``enough''. One approach is to set a
254: constant beam width, e.g. 10,000 analyses on the success heap, at
255: which point the parser 
256: moves to the next item in the input. A problem with this approach is
257: that parses towards the bottom of the success heap may be so unlikely
258: relative to those at the top that they have little or no chance of
259: becoming the most likely parse at the end of the day, causing wasted
260: effort. An alternative approach is to dynamically vary the beam width
261: by stipulating a factor, say $10^{-5}$, and proceed until the best analysis
262: on the pending heap has an {\small FOM} less than $10^{-5}$ times the probability of
263: the best analysis on the success heap. Sometimes, however, the number
264: of analyses that fall within such a range can be enormous, creating
265: nearly as large of a processing burden as the first approach. As a
266: compromise between these two approaches, we stipulated a base beam
267: factor $\alpha$ (usually $10^{-4}$), and the actual beam factor used
268: was $\alpha \ast \beta$, where $\beta$ is the number of analyses on
269: the success heap. Thus, when 
270: $\beta$ is small, the beam stays relatively wide, to include as many
271: analyses as possible; but as $\beta$ grows, the beam narrows. We found this
272: to be a simple and successful compromise. 
273: 
274: Of course, with a left
275: recursive grammar, such a top-down parser may never terminate. If {\it 
276: no\/} analysis ever makes it to the success heap, then, however one defines
277: the beam-search, a top-down depth-first search with a left-recursive
278: grammar will 
279: never terminate. To avoid this, one must place an upper bound on the
280: number of analyses allowed to be pushed onto the pending heap. If that
281: bound is exceeded, the parse fails. With a left-corner strategy, which
282: is not prey to left recursion, no such upper bound is necessary. 
283: 
284: \section{Grammar transforms}
285: 
286: \newcite{Nijholt80} characterized parsing strategies in terms of {\it announce
287: points\/}: the point at which a parent category is announced
288: (identified) relative to its children, and the point at which the rule
289: expanding the parent is identified. In 
290: pure top-down parsing, a parent category and the rule expanding it are
291: announced {\it before\/} any of its children. In pure bottom-up parsing, they
292: are identified {\it after\/} all of the children. Grammar transforms are one
293: method for changing the announce points. In top-down parsing with an
294: appropriately binarized grammar, the parent is identified {\it before\/}, but
295: the rule expanding the parent {\it after\/}, all of the children. Left-corner
296: parsers announce a parent category and its expanding rule {\it after\/} its
297: leftmost child has been completed, but {\it before\/} any of the other
298: children. 
299: 
300: \subsection{Delaying rule identification through binarization}
301: \begin{table*}
302: \begin{tabular} {|p{.8in}|p{.6in}|p{.65in}|p{.8in}|p{.9in}|p{.7in}|p{.9in}|}
303: \hline
304: {\small Binarization} &
305: {\small Rules in Grammar} &
306: {\small Percent of Sentences Parsed${}^{\ast}$} &
307: {\small Avg. States Considered} &
308: {\small Avg. Labelled Precision and Recall${}^{\dag}$} &
309: {\small Avg. MLP Labelled Prec/Rec${}^{\dag}$} &
310: {\small Ratio of Avg. Prob to Avg. MLP Prob${}^{\dag}$} \\\hline
311: {\small None} &
312: {\small 14962} &
313: {\small 34.16} &
314: {\small 19270} &
315: {\small .65521} &
316: {\small .76427} &
317: {\small .001721} \\\hline
318: {\small LB} &
319: {\small 37955} &
320: {\small 33.99} &
321: {\small 96813} &
322: {\small .65539} &
323: {\small .76095} &
324: {\small .001440} \\\hline
325: {\small RB1} &
326: {\small 29851} &
327: {\small 91.27} &
328: {\small 10140} &
329: {\small .71616} &
330: {\small .72712} &
331: {\small .340858} \\\hline
332: {\small RB0} &
333: {\small 41084} &
334: {\small 97.37} &
335: {\small 13868} &
336: {\small .73207} &
337: {\small .72327} &
338: {\small .443705} \\\hline
339: \end{tabular}
340: {\footnotesize Beam Factor = $10^{-4}$ \hspace*{.18in}
341: ${}^{\ast}$Length $\leq$ 40 (2245 sentences
342: in F23 - Avg. length = 21.68) \hspace*{.18in}
343: ${}^{\dag}$Of those sentences parsed}
344: \caption{The effect of different approaches to
345: binarization}\label{tab:bin}
346: \end{table*}
347: 
348: Suppose that the category on the top of the stack is an {\small $NP$} and there
349: is a determiner ({\small $DT$}) in the look-ahead. In such a situation, there
350: is no information to distinguish between the rules \begin{small}$NP
351: \rightarrow DT$\hspace*{.1in}$JJ$\hspace*{.1in}$NN$\end{small} and
352: \begin{small}$NP \rightarrow
353: DT$\hspace*{.1in}$JJ$\hspace*{.1in}$NNS$\end{small}.  If the decision 
354: can be delayed, however, until such a time as the 
355: relevant pre-terminal is in the look-ahead, the parser can make a more
356: informed decision. Grammar binarization is one way to do this, by
357: allowing the parser to use a rule like \begin{small}$NP \rightarrow
358: DT$\hspace*{.1in}$NP$-$DT$\end{small}, where the
359: new non-terminal {\small $NP$-$DT$} can expand into anything that
360: follows a {\small $DT$}
361: in an {\small $NP$}. The expansion of {\small $NP$-$DT$} occurs only
362: after the next pre-terminal is in the look-ahead. Such a delay is
363: essential for an efficient implementation of the kind of incremental
364: parser that we are proposing.
365: 
366: There are actually
367: several ways to make a grammar binary, some of which are better than
368: others for our parser. The first distinction that can be drawn is
369: between what we will call {\it left\/} binarization ({\small LB}) versus {\it right\/}
370: binarization ({\small RB}, see figure \ref{fig:bin}). In the former, the leftmost items
371: on the righthand-side of each rule are grouped together; in the
372: latter, the rightmost items on the righthand-side of the rule are
373: grouped together. Notice that, for a top-down, left-to-right parser,
374: {\small RB} is the appropriate transform, because it underspecifies the right
375: siblings. With {\small LB}, a top-down parser must identify all of the
376: siblings before reaching the leftmost item, which does not aid our
377: purposes. 
378: 
379: Within {\small RB} transforms, however, there is some variation, with
380: respect to how long rule underspecification is maintained. One method
381: is to have the final underspecified category rewrite as a binary rule
382: (hereafter {\small RB2}, see figure \ref{fig:bin}b). Another is to
383: have the final underspecified category rewrite as a unary rule
384: ({\small RB1}, figure \ref{fig:bin}c). The last is to have the final
385: underspecified category rewrite as a nullary rule ({\small RB0},
386: figure \ref{fig:bin}d). Notice that the original motivation 
387: for {\small RB}, to delay specification until the relevant items are present
388: in the look-ahead, is not served by {\small RB2}, because the second child
389: must be specified without being present in the look-ahead. {\small RB0} pushes
390: the look-ahead out to the first item in the string {\it after\/} the
391: constituent being expanded, which can be useful in deciding between
392: rules of unequal length, e.g. \begin{small}$NP \rightarrow
393: DT$\hspace*{.1in}$NN$\end{small} and  
394: \begin{small}$NP \rightarrow
395: DT$\hspace*{.1in}$NN$\hspace*{.1in}$NN$\end{small}. 
396: 
397: Table \ref{tab:bin} summarizes some trials demonstrating the effect of
398: different 
399: binarization approaches on parser performance. The grammars were
400: induced from sections 2-21 of the Penn Wall St. Journal Treebank
401: \cite{Marcus93}, and tested on section 23. For each transform
402: tested, every tree in the training corpus was transformed before
403: grammar induction, resulting in a transformed {\small PCFG} and look-ahead
404: probabilities estimated in the standard way. Each parse returned by
405: the parser was de-transformed for evaluation\footnote{See
406: \newcite{Johnson98b} for details of the transform/de-transform
407: paradigm.}. The parser used in each trial was identical, with a base
408: beam factor $\alpha = 10^{-4}$. The performance 
409: is evaluated using these measures: (i) the percentage of candidate
410: sentences for which a parse was found (coverage); (ii) the average
411: number of states (i.e. rule expansions) considered per candidate
412: sentence (efficiency); and 
413: (iii) the average labelled precision and recall of those sentences for
414: which a parse was found (accuracy). We also used the same grammars
415: with an exhaustive, bottom-up {\small CKY} parser, to ascertain both the
416: accuracy and probability of the maximum likelihood parse ({\small MLP}). We
417: can then additionally compare the parser's performance to the {\small MLP}'s
418: on those same sentences. 
419: 
420: As expected, {\it left\/} binarization conferred no
421: benefit to our parser. {\it Right\/} binarization, in contrast, improved
422: performance across the board. {\small RB0} provided a substantial improvement
423: in coverage and accuracy over {\small RB1}, with something of a decrease in
424: efficiency. This efficiency hit is partly attributable to the fact that
425: the same tree has more nodes with {\small RB0}. Indeed, the efficiency
426: improvement with right binarization over the standard grammar is even
427: more interesting in light of the great increase in the size of the
428: grammars. 
429: 
430: It is worth noting at this point that, with the {\small RB0} grammar,
431: this parser is now a viable 
432: broad-coverage statistical parser, with good coverage, accuracy, and
433: efficiency\footnote{The very efficient bottom-up statistical parser
434: detailed in \newcite{Charniak98} measured efficiency in terms of total 
435: edges {\it popped\/}.  An edge (or, in our case, a parser state) is
436: {\it considered\/} when a probability is calculated for it, and we
437: felt that this was a better efficiency measure than simply those
438: popped.  As a baseline, their parser {\it considered\/} an average of
439: 2216 edges per sentence in section 22 of the WSJ corpus (p.c.).}. Next we considered the left-corner parsing strategy.
440: 
441: \subsection{Left-corner parsing}
442: \begin{table*}
443: \begin{tabular} {|p{1.05in}|p{.6in}|p{.6in}|p{.75in}|p{.85in}|p{.65in}|p{.85in}|}
444: \hline
445: {\small Transform} &
446: {\small Rules in Grammar} &
447: {\small Pct. of Sentences Parsed${}^{\ast}$} &
448: {\small Avg. States Considered} &
449: {\small Avg Labelled Precision and Recall${}^{\dag}$} &
450: {\small Avg. MLP Labelled Prec/Rec${}^{\dag}$} &
451: {\small Ratio of Avg. Prob to Avg. MLP Prob${}^{\dag}$} \\\hline
452: {\small Left Corner (LC)} &
453: {\small 21797} &
454: {\small 91.75} &
455: {\small 9000} &
456: {\small .76399} &
457: {\small .78156} &
458: {\small .175928} \\\hline
459: {\small LB $\circ$ LC} &
460: {\small 53026} &
461: {\small 96.75} &
462: {\small 7865} &
463: {\small .77815} &
464: {\small .78056} &
465: {\small .359828} \\\hline
466: {\small LC $\circ$ RB} &
467: {\small 53494} &
468: {\small 96.7} &
469: {\small 8125} &
470: {\small .77830} &
471: {\small .78066} &
472: {\small .359439} \\\hline
473: {\small LC $\circ$ RB $\circ$ ANN} &
474: {\small 55094} &
475: {\small 96.21} &
476: {\small 7945} &
477: {\small .77854} &
478: {\small .78094} &
479: {\small .346778} \\\hline
480: {\small RB $\circ$ LC} &
481: {\small 86007} &
482: {\small 93.38} &
483: {\small 4675} &
484: {\small .76120} &
485: {\small .80529} &
486: {\small .267330} \\\hline
487: \end{tabular}
488: {\footnotesize Beam Factor = $10^{-4}$ \hspace*{.18in}
489: ${}^{\ast}$Length $\leq$ 40 (2245 sentences
490: in F23 - Avg. length = 21.68) \hspace*{.18in}
491: ${}^{\dag}$Of those sentences parsed}
492: \caption{Left Corner Results}\label{tab:left}
493: \end{table*}
494: 
495: Left-corner ({\small LC}) parsing \cite{Rosenkrantz70} is a
496: well-known strategy that uses both bottom-up evidence (from the left
497: corner of a rule) and top-down prediction (of the rest of the
498: rule). Rosenkrantz and Lewis showed how to transform a context-free
499: grammar into a grammar that, when used by a top-down parser, follows
500: the same search path as an {\small LC} parser. These {\small LC}
501: grammars allow us to use exactly the same predictive parser to
502: evaluate top-down versus {\small LC} 
503: parsing. Naturally, an {\small LC} grammar performs best with our parser when
504: right binarized, for the same reasons outlined above. We use transform
505: composition to apply first one transform, then another to the output
506: of the first. We denote this {\small A} $\circ$ {\small B} where
507: ({\small A} $\circ$ {\small B})(t) = {\small B} ({\small A}
508: (t)). After applying the left-corner transform, we then binarize the
509: resulting grammar\footnote{Given that the LC transform involves
510: nullary productions, the use of RB0 is not needed, i.e. nullary
511: productions need only be introduced from one source.  Thus
512: binarization with left corner is always to unary (RB1).}, i.e. {\small LC} $\circ$ {\small RB}. 
513: 
514: Another probabilistic {\small LC} parser investigated \cite{Manning97},
515: which utilized an {\small LC} parsing architecture (not a transformed
516: grammar), also got a performance boost through right 
517: binarization. This, however, is equivalent to {\small RB} $\circ$
518: {\small LC}, which is a very different grammar from {\small LC}
519: $\circ$ {\small RB}. Given our two binarization orientations ({\small
520: LB} and {\small RB}), there are four possible compositions of 
521: binarization and {\small LC} transforms: 
522: \begin{center}\begin{small}
523: (a) LB $\circ$ LC (b) RB $\circ$ LC
524: (c) LC $\circ$ LB  (d) LC $\circ$ RB 
525: \end{small}\end{center}
526: Table \ref{tab:left} shows left-corner results over various
527: conditions\footnote{Option (c) is not the appropriate kind of
528: binarization for our parser, as argued in the previous section, and so 
529: is omitted.}. Interestingly, options (a) and (d) encode the same
530: information, leading to nearly identical performance\footnote{The
531: difference is due to the introduction of vacuous unary rules with
532: RB.}. As stated before, right binarization moves the rule announce
533: point from before to after all of the children. The {\small LC} transform is
534: such that {\small LC} $\circ$ {\small RB} 
535: also delays {\it parent\/} identification until after all of the
536: children. The transform {\small LC} $\circ$ {\small RB} $\circ$
537: {\small ANN} moves the parent announce
538: point back to the left corner by introducing unary rules at the left
539: corner that simply identify the parent of the binarized rule. This
540: allows us to test the effect of the position of the parent announce
541: point on the performance of the parser. As we can see, however, the
542: effect is slight, with similar performance on all measures. 
543: 
544: {\small RB} $\circ$ {\small LC} performs with higher accuracy than the others when used with
545: an exhaustive parser, but seems to require a massive beam in order to
546: even approach performance at the {\small MLP} level. \newcite{Manning97}
547: used a beam width of 40,000 parses on the success heap at each input
548: item, which 
549: must have resulted in an order of magnitude more rule expansions
550: than what we have been considering up to now, and yet their average
551: labelled precision and recall (.7875) still fell well below what we
552: found to be the {\small MLP} accuracy (.7987) for the grammar. We are still
553: investigating why this grammar functions so poorly when used by an
554: incremental parser. 
555: 
556: \subsection{Non-local annotation}
557: 
558: \newcite{Johnson98b} discusses the improvement of {\small PCFG} models via the
559: annotation of non-local information onto non-terminal nodes in the
560: trees of the training corpus. One simple example is to
561: copy the parent node onto every non-terminal, e.g. the rule
562: \begin{small}$S \rightarrow NP$\hspace*{.1in}$VP$\end{small} becomes 
563: \begin{small}$S \rightarrow
564: NP^{\uparrow}S$\hspace*{.1in}$VP^{\uparrow}S$\end{small}.  The idea
565: here is that 
566: the distribution of rules of expansion of a particular non-terminal
567: may differ depending on the non-terminal's parent. Indeed, it was
568: shown that this additional information improves the {\small MLP}
569: accuracy dramatically.
570: 
571: We looked at two kinds of
572: non-local information annotation: parent ({\small PA}) and left-corner
573: ({\small LCA}). Left-corner parsing gives improved accuracy over top-down or
574: bottom-up parsing with the same grammar. Why? One reason may be that
575: the ancestor category exerts the same kind of non-local influence
576: upon the parser that the parent category does in parent annotation. To
577: test this, we annotated the left-corner ancestor category onto every
578: leftmost non-terminal category. The results of our annotation trials
579: are shown in table \ref{tab:ann}. 
580: 
581: \begin{table*}
582: \begin{tabular} {|p{1.05in}|p{.6in}|p{.6in}|p{.75in}|p{.85in}|p{.65in}|p{.85in}|}
583: \hline
584: {\small Transform} &
585: {\small Rules in Grammar} &
586: {\small Pct. of Sentences Parsed${}^{\ast}$} &
587: {\small Avg. States Considered} &
588: {\small Avg Labelled Precision and Recall${}^{\dag}$} &
589: {\small Avg. MLP Labelled Prec/Rec${}^{\dag}$} &
590: {\small Ratio of Avg. Prob to Avg. MLP Prob${}^{\dag}$} \\\hline
591: {\small RB0} &
592: {\small 41084} &
593: {\small 97.37} &
594: {\small 13868} &
595: {\small .73207} &
596: {\small .72327} &
597: {\small .443705} \\\hline
598: {\small PA $\circ$ RB0} &
599: {\small 63467} &
600: {\small 95.19} &
601: {\small 8596} &
602: {\small .79188} &
603: {\small .79759} &
604: {\small .486995} \\\hline
605: {\small LC $\circ$ RB} &
606: {\small 53494} &
607: {\small 96.7} &
608: {\small 8125} &
609: {\small .77830} &
610: {\small .78066} &
611: {\small .359439} \\\hline
612: {\small LCA $\circ$ RB0} &
613: {\small 58669} &
614: {\small 96.48} &
615: {\small 11158} &
616: {\small .77476} &
617: {\small .78058} &
618: {\small .495912} \\\hline
619: {\small PA $\circ$ LC $\circ$ RB} &
620: {\small 80245} &
621: {\small 93.52} &
622: {\small 4455} &
623: {\small .81144} &
624: {\small .81833} &
625: {\small .484428} \\\hline
626: \end{tabular}
627: {\footnotesize Beam Factor = $10^{-4}$ \hspace*{.18in}
628: ${}^{\ast}$Length $\leq$ 40 (2245 sentences
629: in F23 - Avg. length = 21.68) \hspace*{.18in}
630: ${}^{\dag}$Of those sentences parsed}
631: \caption{Non-local annotation results}\label{tab:ann}
632: \end{table*}
633: 
634: There are two important points to notice from
635: these results. First, with {\small PA} we get not only the previously reported
636: improvement in accuracy, but additionally a fairly dramatic decrease
637: in the number of parser states that must be visited to find a
638: parse. That is, the non-local information not only improves the final
639: product of the parse, but it guides the parser more quickly to the
640: final product. The annotated grammar has 1.5 times as many rules, and
641: would slow a bottom-up {\small CKY} parser proportionally. Yet our parser
642: actually considers far fewer states en route to the more accurate
643: parse. 
644: 
645: Second, {\small LC}-annotation gives nearly all of the accuracy gain of
646: left-corner parsing\footnote{The rest could very well be within
647: noise.}, in support of the hypothesis that the ancestor 
648: information was responsible for the observed accuracy
649: improvement. This result suggests that if we can determine the
650: information that is being annotated by the troublesome {\small RB} $\circ$ {\small LC}
651: transform, we may be able to get the accuracy improvement with a
652: relatively narrow beam. Parent-annotation before the {\small LC} transform gave
653: us the best performance of all, with very few states considered on
654: average, and excellent accuracy for a non-lexicalized grammar. 
655: 
656: \section{Accuracy/Efficiency tradeoff}
657: \begin{figure*}
658: \hspace*{1.1in}
659: \epsfig{file=graph1.eps, width=4.1in}\vspace*{.05in}\\
660: \hspace*{1.1in}
661: \epsfig{file=graph2.eps, width=4.1in}
662: \caption{Changes in performance with beam factor variation} \label{fig:ef1}
663: \end{figure*}
664: 
665: \begin{figure*}
666: \hspace*{1.1in}
667: \epsfig{file=graph3.eps, width=4.1in}\vspace*{.05in}\\
668: \hspace*{1.1in}
669: \epsfig{file=graph4.eps, width=4.1in}
670: \caption{Changes in performance with beam factor variation} \label{fig:ef2}
671: \end{figure*}
672: 
673: One point that deserves to be made is that there is something of an
674: accuracy/efficiency tradeoff with regards to the base beam factor. The
675: results given so far were at $10^{-4}$, which functions pretty well for the
676: transforms we have investigated. Figures \ref{fig:ef1} and
677: \ref{fig:ef2} show four performance 
678: measures for four of our transforms at base beam factors of $10^{-3}$, 
679: $10^{-4}$, $10^{-5}$, and $10^{-6}$.  There is a dramatically increasing
680: efficiency burden as 
681: the beam widens, with varying degrees of payoff. With the top-down
682: transforms ({\small RB0} and {\small PA} $\circ$ {\small RB0}), the ratio of the average probability
683: to the {\small MLP} probability does improve substantially as the beam grows,
684: yet with only marginal improvements in coverage and
685: accuracy. Increasing the beam seems to do less with the left-corner
686: transforms. 
687: 
688: \section{Conclusions and Future Research}
689: 
690: We have examined several probabilistic predictive parser variations,
691: and have shown the approach in general to be a viable one, both in
692: terms of the quality of the parses, and the efficiency with which they
693: are found. We have shown that the improvement of the grammars with
694: non-local information not only results in better parses, but guides
695: the parser to them much more efficiently, in contrast to dynamic
696: programming methods. Finally, we have shown that the accuracy
697: improvement that has been demonstrated with left-corner approaches can
698: be attributed to the non-local information utilized by the
699: method. 
700: 
701: This is relevant to the study of the human sentence processing
702: mechanism insofar as it demonstrates that it is possible to have a
703: model which makes explicit the syntactic relationships between items
704: in the input incrementally, while still scaling up to broad-coverage. 
705: 
706: Future research will include:
707: \begin{list}{$\bullet$}{\setlength{\topsep}{.01in}\setlength{\itemsep}{0in}}
708: \item lexicalization of the parser
709: \item utilization of fully
710: connected trees for additional syntactic and semantic processing
711: \item the use of syntactic predictions in the beam for language modeling
712: \item an examination of predictive parsing with a left-branching language
713: (e.g. German)
714: \end{list}
715: In addition, it may be of interest to the psycholinguistic community
716: if we introduce a time variable into our model, and use it
717: to compare such competing sentence processing models as race-based
718: and competition-based parsing.
719: \bibliography{ber}
720: \end{document}
721: