1: \documentclass[twoside,11pt]{article}
2:
3: % Any additional packages needed should be included after jmlr2e.
4: % Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,
5: % and defines many common macros, such as 'proof' and 'example'.
6: %
7: % It also sets the bibliographystyle to plainnat; for more information on
8: % natbib citation styles, see the natbib documentation, a copy of which
9: % is archived at http://www.jmlr.org/format/natbib.pdf
10:
11: \usepackage{jmlr2e}
12:
13: % Definitions of handy macros can go here
14:
15: % Heading arguments are {volume}{year}{pages}{submitted}{published}{authors}
16:
17: \jmlrheading{2}{2002}{559-594}{9/01}{3/02}{Erik F. Tjong Kim Sang}
18: \ShortHeadings{Memory-Based Shallow Parsing}{Tjong Kim Sang}
19: \firstpageno{559}
20:
21: \begin{document}
22:
23: \title{Memory-Based Shallow Parsing}
24:
25: \author{\name Erik F. Tjong Kim Sang \email erikt@uia.ua.ac.be \\
26: \addr CNTS - Language Technology Group \\
27: University of Antwerp\\
28: Universiteitsplein 1\\
29: B-2610 Wilrijk, Belgium}
30:
31: \editor{James Hammerton, Miles Osborne, Susan Armstrong and
32: Walter Daelemans}
33:
34: \maketitle
35:
36: \begin{abstract}% <- trailing '%' for backward compatibility of .sty file
37: We present memory-based learning approaches to shallow parsing and
38: apply these to five tasks: base noun phrase identification, arbitrary
39: base phrase recognition, clause detection, noun phrase parsing and
40: full parsing.
41: We use feature selection techniques and system combination methods
42: for improving the performance of the memory-based learner.
43: Our approach is evaluated on standard data sets and the results are
44: compared with that of other systems.
45: This reveals that our approach works well for base phrase
46: identification while its application towards recognizing embedded
47: structures leaves some room for improvement.
48: \end{abstract}
49:
50: \begin{keywords}
51: shallow parsing,
52: memory-based learning,
53: feature selection,
54: system combination
55: \end{keywords}
56:
57: \section{Introduction}
58:
59: Memory-based learners classify data based on their similarity to
60: data that they have seen earlier.
61: They have been used for a variety of natural language
62: processing tasks with good results, for example for
63: grapheme-to-phoneme conversion \citep{hoste1999clin},
64: stress assignment \citep{daelemans1994cl} and
65: word class tagging \citep{hvh2001}.
66: These natural language processing tasks are classification tasks: they
67: require an assignment of a class to each character or to each word.
68: Shallow parsing is more complicated than that: it requires sequences
69: of words to be grouped together and be classified.
70:
71: We believe that all natural language tasks can be performed successfully
72: by memory-based learners.
73: Identifying and classifying sequences of words can be converted to a
74: classification task by using special tag sets, for example the IOB
75: tag set proposed by \cite{ramshaw95}.
76: Parsing requires different processing levels and these can be
77: simulated by cascading several memory-based learners which have
78: been trained on different subtasks \citep{daelemans95}.
79: The idea of using memory-based methods for processing natural language
80: has recently led to the emergence of a new paradigm: Memory-Based
81: Language Processing (MBLP) to which a special issue of the Journal of
82: Experimental \& Theoretical Artificial Intelligence was devoted
83: \citep{daelemans99b}.
84:
85: The goal of this paper is to test the theoretic ideas about
86: memory-based learning applied to natural language tasks, in
87: particular its application to shallow parsing.
88: We will implement the ideas of \cite{daelemans95}, show what problems
89: need to be solved, test memory-based shallow parsers and compare their
90: performance with those of other systems.
91: The tasks which we will examine are identification of base noun
92: phrases, recognition of phrases of arbitrary types, finding clauses,
93: discovering embedded noun phrases and full parsing.
94: Memory-based learning performs well on natural language tasks that
95: require output that has relatively little structure.
96: In this paper we will investigate whether we can obtain equally good
97: results when it is applied to tasks requiring more complex outputs.
98:
99: \section{Approach}
100:
101: In our approach we will use three techniques.
102: We will use memory-based learning as base classification method for
103: assigning linguistic classes to data.
104: We will attempt to solve a weakness of this approach, disregarding
105: irrelevant features, by using an additional feature selection method.
106: Finally, we will examine the combination of several learners in order
107: to obtain an extra performance boost.
108: This section also contains information about evaluation and system
109: configuration for performing parameter tuning.
110:
111: \subsection{Memory-Based Learning}
112:
113: The basic idea behind memory-based learning is that concepts can be
114: classified by their similarity with previously seen concepts.
115: In a memory-based system, learning amounts to storing the training
116: data items.
117: The strength of such a system lies in its capability to compute the
118: similarity between a new data item and the training data items.
119: The most simple similarity metric is the overlap metric
120: \citep{timbl2000}.
121: It compares corresponding features of the data items and adds 1 to a
122: similarity rate when they are different.
123: The similarity between two data items is represented by a number
124: between zero and the number of features, $n$, in which value zero
125: corresponds with an exact match and $n$ corresponds with two items
126: which share no feature value.
127: Here is an example:
128:
129: \begin{center}
130: \begin{tabular}{ccccc}
131: TRAIN1 & man & saw & the & V\\
132: TRAIN2 & the & saw & . & N\\
133: TEST1 & boy & saw & the & ?\\
134: \end{tabular}
135: \end{center}
136:
137: \noindent
138: It contains two training items of a part-of-speech (POS) tagger and
139: one test item for which we want to obtain a POS tag.
140: Each item contains three features: the word that needs to be tagged
141: ({\it saw}) and the preceding and the next word.
142: In order to find the best POS tag for the test item, we compare its
143: features with the features of the training data items.
144: The test item shares two features with the first training data item
145: and one with the second.
146: The similarity value for the first training data item (1) is smaller
147: than that of the second (2) and therefore the overlap metric will
148: prefer the first.
149:
150: A weakness of the overlap metric is that it regards all features as
151: equally valuable for computing similarity values.
152: Generally some features are more important than others.
153: For example, when we add a line ``TRAIN3 boy and the C'' to our
154: training data, the overlap metric will regard this new item as equally
155: important as the first training item.
156: Both the first and the third training item share two feature values
157: with the test item but we would like the third to receive a lower
158: similarity value because it does not contain the word for which we
159: want find a POS tag ({\it saw}).
160: In order to accomplish this, we assign weights to the features
161: in such a way that the second feature receives a higher weight than
162: the other two.
163:
164: The method which we use to assign weights to the features is called
165: Gain Ratio, a normalized variant of information gain
166: \citep{timbl2000}.
167: It estimates feature weights by examining the training data and
168: determines for each feature how much information it contributes
169: to the knowledge of the classes of the training data items.
170: The weights are normalized in order to account for features with
171: many different values.
172: The Gain Ratio computation of the weights is summarized in the
173: following formulas:
174:
175: \begin{equation}
176: w_i = \frac{H(C)-\sum_{v\in V_i}P(v)\times H(C\mid v)}{H(V_i)}
177: \end{equation}
178:
179: \begin{equation}
180: H(X) = - \sum_{x\in X} P(x){\rm log}_2P(x)
181: \end{equation}
182:
183: \noindent
184: Here $w_i$ is the weight of feature $i$, $C$ the set of class
185: values and $V_i$ the set of values that feature $i$ can take.
186: $H(C)$ and $H(V_i)$ are the entropy of the sets $C$ and $V_i$
187: respectively and $H(C\mid v)$ is the entropy of the subset of elements
188: of $C$ that are associated with value $v$ of feature $i$.
189: $P(v)$ is the probability that feature $i$ has value $v$.
190: The normalization factor $H(V_i)$ was introduced to prevent that
191: features with low generalization capacities, like identification
192: codes, would obtain large weights.
193:
194: The memory-based learning software which we have used in our
195: experiments, TiMBL \citep{timbl2000}, contains several algorithms
196: with different parameters.
197: In this paper we have restricted ourselves to using a single algorithm
198: (k nearest neighbor classification) with a constant parameter setting.
199: It would be interesting to evaluate every algorithm with all of its
200: parameters but this would require a lot of extra work.
201: We have changed only one parameter of the nearest neighbor algorithm
202: from its default value: the size of the nearest neighborhood region.
203: The learning algorithm computes the distance between the test item and
204: the training items.
205: The test item will receive the most frequent classification of the
206: nearest training items (nearest neighborhood size is 1).
207: \cite{daelemans99} show that using a larger neighborhood is harmful
208: for classification accuracy for three language tasks but not for
209: noun phrase chunking, a task which is central to this paper.
210: In our experiments we have found that using the three nearest
211: sets of data items leads to a better performance than using only
212: the nearest data items.
213: This increase of the neighborhood size used leads to a form of
214: smoothing which can get rid of the influence of some data
215: inconsistencies and exceptions.
216:
217: % general
218: % example
219: % overlap
220: % ig+gr
221: % optimizations
222: % k
223:
224: \subsection{Feature Selection}
225: \label{sec-feat}
226:
227: A disadvantage of the Gain Ratio metric used in memory-based learning
228: is that it computes a weight for a feature without examining other
229: available features.
230: If features are dependent, this will generally not be reflected in
231: their weights.
232: A feature that contains some information about the classification
233: class on its own, but none when another more informative feature is
234: present will receive a non-zero weight.
235: Features which contain little information about the classification
236: class will receive a small weight but a large number of them might
237: still overrule more important features.
238: These two problems will have a negative influence on the
239: classification accuracy, in particular when there are many features
240: available.
241:
242: We have tested the capacity of Gain Ratio to deal with irrelevant
243: features by using it for a simple binary classification problem with
244: extra random features.
245: The problem which we chose is the XOR problem.
246: It contains two binary (0/1) features and a pair of these feature
247: values should be classified as 0 when the values are equal and as
248: 1 when the features are different.
249: We have created training and test data which contained 100
250: examples of the four possible patterns (0/0/0, 0/1/1, 1/0/1 and
251: (1/1/0).
252: A memory-based learner which uses Gain Ratio was able to correctly
253: classify all 400 patterns in the training data.
254: After this we added ten random binary features to both the training
255: data and the test data and observed the performance.
256: The average results of 1000 runs can be found in Figure
257: \ref{fig-feats}.
258:
259: % 0 400.00
260: % 1 400.00
261: % 2 393.72
262: % 3 383.31
263: % 4 372.18
264: % 5 352.42
265: % 6 317.03
266: % 7 275.23
267: % 8 242.30
268: % 9 221.69
269: % 10 210.75
270:
271: \begin{figure}[t]
272: \begin{center}
273: \epsffile{jmlr.feats.eps}
274: \end{center}
275: \caption{Average number of correct patterns over 1000 runs of a
276: memory-based learner using the Gain Ratio metric for test data
277: containing 400 XOR patterns after adding 0 to 10 random binary
278: features.
279: The system performs perfectly with one random feature but when two or
280: more random features are added, the performance drops to about half
281: for 10 extra features.
282: }
283: \label{fig-feats}
284: \end{figure}
285:
286: Without extra features the memory-based learner performs perfectly.
287: Adding a random feature does not harm its performance but after adding
288: two the system only gets 394 of the 400 patterns correct on average.
289: The performance drops for every extra added feature to about 211
290: for 10 extra features which is not much better than randomly guessing
291: the classes.
292: This small experiment shows that Gain Ratio has difficulty with
293: feature sets that contain many irrelevant features.
294: We need an extra method for determining which features are not
295: necessary for obtaining a good performance.
296:
297: \cite{aha94} give a good introduction to methods for selecting
298: relevant features for machine learning tasks.
299: The methods can be divided in two groups: filters and wrappers.
300: A filter uses an evaluation function for determining which features
301: could be more relevant for a classifier than others.
302: A wrapper finds out if one feature is more important than another by
303: applying the classifier to data with either one of the features and
304: comparing the results.
305: This is requires more time than the filter approach but it generates
306: better feature sets because it cannot suffer from a bias difference
307: which may exist between the evaluation function and the classifier
308: \citep{john94}.
309:
310: Both the filter and the wrapper method start with a set of features
311: and attempt to find a better set by adding or removing features and
312: evaluating the resulting sets.
313: There are two basic methods for moving through the feature space.
314: Forward sequential selection starts with an empty feature set and
315: evaluates all sets containing one feature.
316: After this it selects the one with the best performance and evaluates
317: all sets with two features of which one is the best single feature.
318: Backward sequential selection starts with all features and evaluates all
319: sets with one feature less.
320: It will selects the one with the best performance and then examines
321: all feature sets which can be derived from this one by removing one
322: feature.
323: Both methods continue adding or removing a feature until they
324: cannot improve the performance.
325:
326: Forward and backward sequential selection are a variant of
327: hill-climbing, a well-known search technique in artificial
328: intelligence.
329: As with hill-climbing, a disadvantage of these methods is that they
330: can get stuck in local optima, in this case a non-optimal feature set
331: which cannot be improved with the method used.
332: In order to minimize the influence of local optima, we use a combination
333: of the two methods when examining feature sets: bidirectional
334: hill-climbing \citep{caruana94}.
335: The idea here is to apply both adding a feature and removing a feature
336: at each point in the feature space.
337: This enables the feature selection method to backtrack from nonoptimal
338: choices.
339: In order to keep processing times down we will start with an empty
340: feature list just like in forward sequential selection.
341:
342: % devijver and kittler
343: % john, kohavi and pfleger
344:
345: \subsection{System Combination}
346: \label{sec-combi}
347:
348: When different machine learning systems are applied to the same task,
349: they will make different errors.
350: The combined results of these systems can be used for generating an
351: analysis for the task that is usually better than that of any of the
352: participating systems, for example by choosing pattern analyses
353: selected by the majority of the systems.
354: This approach will eliminate errors that made by a minority of the
355: systems.
356: Here is a made-up example:
357: suppose we have five systems, c$_1$ - c$_5$, which assign binary
358: classes to patterns.
359: Their output for eight patterns, p$_1$ - p$_8$, is as follows:
360:
361: \begin{center}
362: \begin{tabular}{r|ccccc|l}
363: & c$_1$ & c$_2$ & c$_3$ & c$_4$ & c$_5$ & correct \\\hline
364: p$_1$ & 0 & 0 & 0 & 0 & 0 & 0 \\
365: p$_2$ & 1 & 1 & 1 & 1 & 1 & 1 \\
366: p$_3$ & 0 & 0 & 0 & 0 & 0 & 0 \\
367: p$_4$ & 1 & 0 & 1 & 1 & 1 & 1 \\
368: p$_5$ & 0 & 0 & 1 & 0 & 0 & 0 \\
369: p$_6$ & 1 & 1 & 1 & 1 & 0 & 1 \\
370: p$_7$ & 1 & 0 & 0 & 0 & 0 & 0 \\
371: p$_8$ & 1 & 1 & 1 & 0 & 1 & 1 \\
372: \end{tabular}
373: \end{center}
374:
375: \noindent
376: Each of the five systems makes an error.
377: We can use a combination of the five by choosing the class
378: that has been predicted most frequently for each pattern.
379: For the first three patterns this will not make a difference because
380: all systems predict the same class.
381: For pattern 4 we will choose class 1, thereby eliminating an error of
382: classifier 2.
383: Pattern 5 will be associated with class 0, thus eliminating classifier
384: 3's only error.
385: Patterns 6, 7 and 8 will receive classes 1, 0 and 1 respectively,
386: thereby eliminating errors of classifiers 5, 1 and 4.
387: Thus the majority choice will generate a perfect analysis of the data.
388:
389: In this paper we will evaluate different techniques for combining
390: system output, most of which have been put forward by
391: \cite{hvh2001}.
392: We use four voting methods and three stacked classifiers.
393: Voting methods assign weights to the output of the individual systems
394: and for each pattern choose the class with the largest accumulated
395: score.
396: The most simple voting method is the one we have used in the preceding
397: example: Majority Voting.
398: It gives all systems the same weight.
399: A more elaborate method is accuracy voting (TotPrecision).
400: It assigns a weight to each system which is equal to the accuracy of
401: the system on some evaluation data.
402:
403: Some classes might be easier to predict than other classes and for
404: this reason we have also tested two voting methods which use weights
405: based on accuracies for particular class tags.
406: The first is TagPrecision.
407: For each output value $v$ of system $s$, it uses a weight which is
408: equal to the precision of that system $s$ obtained for this value $v$.
409: The second method is Precision-Recall.
410: It starts from the same weights as TagPrecision but adds to these the
411: probability that systems producing different output values would have
412: missed $v$.
413: For example, suppose that there are two systems $s_1$ and $s_2$, and
414: that for some data item $s_1$ predicts value $v_1$ while $s_2$
415: predicts something else.
416: In that case, the probability that $s_1$ is right is
417: $precision(s_1,v_1)$ while the probability that $s_2$ would have
418: missed $v_1$ is $1-recall(s_2,v_1)$.
419: Precision-Recall will assign the weight
420: $precision(s_1,v_1)+(1-recall(s_2,v_1))$ to the event of $s_1$
421: predicting $v_1$.
422:
423: A stacked classifier is a classifier which processes the results of
424: other classifiers.
425: We have used three variants of stacked classifiers.
426: The first is called TagPair.
427: It examines pairs of values produced by two systems and estimates the
428: probability that a certain output value is associated with the pair.
429: In the case of the two systems $s_1$ and $s_2$ producing two distinct
430: values $v_1$ and $v_2$, TagPair will examine evaluation data and find
431: that the value pair is associated with, for example, $v_1$ in 20\% of
432: the cases, $v_2$ in 70\% and $v_3$ in 10\%.
433: These numbers will be used as weights for the three output values and
434: the one that has accumulated the largest value after examining all
435: value pairs in the pattern, will be selected.
436: Unlike the voting methods, TagPair has the opportunity to choose the
437: correct output tag even if all systems have made an incorrect
438: prediction (for example, $v_3$ in this example).
439:
440: The other stacked classifier which we have evaluated is the
441: memory-based learner itself.
442: We have tested it in two modes: one in which only the output of the
443: systems was included and one in which we included information about
444: the test item.
445: This extra information was the word that needed to be classified, its
446: part-of-speech (POS) tag and the context (words/POS tags) in which it
447: appeared.
448: The memory-based learner used the same settings as described earlier
449: in this section: it used the Gain Ratio metric and examined a nearest
450: neighborhood of size three.
451:
452: The weight assignment methods used by the voting methods and the
453: stacked classifiers suffer from the same problem as Gain Ratio:
454: they might fail to disregard irrelevant features.
455: For this reason we have often tested the combination methods both with
456: all available system results as well as with a subset of these, thus
457: mimicking the feature selection method described earlier.
458: Apart from Majority Voting, all voting methods and stacked classifiers
459: require training data.
460: This means that we need both training data for the individual systems
461: and training data for the combinators.
462: We will describe how we have selected the training data in the next
463: section.
464:
465: \subsection{Parameter Tuning}
466: \label{sec-partun}
467:
468: In this paper, we will compare different learner set-ups and
469: apply the best one to standard data sets.
470: For example, we will examine different data representations and
471: test different system combination techniques.
472: We should be careful not to tune the system to the test data and
473: therefore we will only use the available training data for finding the
474: best configuration for the learner.
475: This can be done by using 10-fold cross-validation \citep{weiss91}.
476: The training data will be divided in ten sections of similar size and
477: each section will be processed by a system which has been trained on
478: the other nine.
479: The overall performance on all ten sections will be regarded as the
480: performance of the system.
481:
482: In our experiments, we will process the data twice.
483: First we will let the learner generate a classification of the data.
484: After this the learner will process the data another time, this time
485: while including the classifications found earlier for the context of a
486: data item.
487: While working with n-fold cross-validation, we should be careful that
488: information from a test part is not accidentally used in its training
489: part.
490: In the first processing phase we will generate classes for the first
491: section while using the other nine sections.
492: Thus information about the classes in, for example, section two is
493: encoded in the classes produced in section one.
494: If in the second phase we use the classifications of the first section
495: while processing section two, we are analyzing a section while having
496: access to (indirect) information about the classes in the data.
497: Information about the classes in section two might leak to this process
498: via the training data, something which is undesired.
499:
500: There are two ways for preventing this form of information leaking.
501: Both concern being more strict when it comes to creating the training
502: data of the second system.
503: In a cascaded 10-fold cross-validation experiment, the second phase
504: training data for section x must be constructed without using this
505: section.
506: This means that instead of running one 10-fold cross-validation
507: experiment with the first system, we need to run ten 9-fold
508: cross-validation experiments in order to obtain correct training data
509: for the ten sections in the second system.
510: Section one will be trained with the 9-fold cross-validation results
511: from sections 2-10, section 2 with 1 and 3-10 and so on.
512: If at any time we need to add a third phase to the cascade of systems,
513: we need to run 8-fold cross-validation experiments with the first
514: system and 9-fold cross-validation experiments with the second.
515: For extra systems the number of extra runs increases and the amount of
516: available training data for the first system decreases.
517:
518: The second method for preventing training information from a
519: processing phase leaking to the classifications of a next phase
520: is by only using results from previous phases in the test data.
521: In the training data we use the perfect classes rather than
522: the output of the previous phase.
523: This has two disadvantages.
524: First, we cannot use a feature containing the class of the focus
525: word because this feature is the same as the output class.
526: This means that we can only use the classes of neighboring words.
527: Second, the opportunity to correct errors made in the first phase
528: will be restricted because the training data no longer contains
529: information about the errors made by this phase.
530: The advantage of this approach is that we can use all training data in
531: all training phases, so the problem of a diminishing quantity of
532: training data disappears.
533: This approach is especially useful with longer cascades of learners,
534: as for example is required in full parsing.
535:
536: Here is an example to illustrate the two methods: suppose a word in
537: the sixth section in the second phase of a ten-fold cross-validation
538: experiment in chunking is represented by the following eight features:
539:
540: \begin{quote}
541: $w_{i-1}$ $w_i$ $w_{i+1}$ $p_{i-1}$ $p_i$ $p_{i+1}$ $c_{i-1}$ $c_{i+1}$
542: \end{quote}
543:
544: \noindent
545: The goal is to find a chunk tag for word $w_i$.
546: The word features $w_i$, $w_{i-1}$ and $w_{i+1}$ represent, the word
547: itself, the preceding word and the next word, respectively.
548: The POS tag features $p_i$, $p_{i-1}$ and $p_{i+1}$ contain the POS
549: tags of the three words.
550: The two chunk features $c_{i-1}$ and $c_{i+1}$ hold the chunk tag of
551: the preceding and the next word.
552: The word and POS tag information have been taken from the training
553: data.
554: In the first method, the two chunk features are computed by a
555: preceding phase.
556: If this item is part of the training data for section x, $c_{i-1}$ and
557: $c_{i+1}$ were generated by a nine-fold cross-validation experiment
558: which uses all sections except section x.
559: This means that the two chunk features have been generated by training
560: with all sections except 6 and x.
561: If the item is part of the test data, then the chunk features are
562: computed by a ten-fold cross-validation experiment (training with
563: sections 1-5 and 7-10).
564: The second method generates chunk features for the test data in the
565: same way but for training data it takes $c_{i-1}$ and $c_{i+1}$ from
566: the training data, thus preventing that they contain implicit
567: information about the test sections.\footnote{In case $c_{i-1}$ is part
568: of a previous section or $c_{i+1}$ is in a next section, they are left
569: empty.}
570:
571: \subsection{Evaluation}
572: \label{sec-stat}
573:
574: We will compare the results of a shallow parser with an available
575: hand-parsed corpus.
576: For this purpose we will use the precision and recall of the phrases
577: in the results.
578: Precision is the percentage of phrases found by the learner that are
579: correct according to the corpus.
580: Recall is the percentage of corpus phrases found by the learner.
581: It is easier to optimize a system configuration based on one
582: evaluation score and therefore we combine precision and recall
583: in the F$_{\beta}$ rate \citep{vanrijsbergen75}:
584:
585: \begin{equation}
586: F_{\beta} = \frac{(\beta^2+1)*precision*recall}{\beta^2*precision+recall}
587: \end{equation}
588:
589: \noindent
590: $\beta$ can be used for giving precision a larger ($\beta>$1) or
591: smaller ($\beta<$1) weight than recall.
592: We do not have a preference for one or the other and therefore we use
593: $\beta$=1.
594: In previous work on shallow parsing, often a word-related accuracy
595: rate is used as evaluation criterion.
596: We do not believe that this is a good method for evaluating results of
597: phrase detection algorithms.
598: Accuracy rates assign positive values to correctly identified
599: non-phrase words and to partially identified phrases.
600: Furthermore they will produce different numbers for the same analysis
601: based on the data representation used.
602: For these reasons, the relation between accuracy rates and F$_{\beta}$
603: rates is poor and preference should be given to using the latter.
604:
605: Accuracy rates have one advantage over F$_{\beta}$ rates: standard
606: statistical tests can be used for determining if the difference
607: between two accuracy rates is significant.
608: Accuracy is a relatively simple function $correct/processed$ where
609: $processed$ is the number of items that have been processed and
610: $correct$ is the number of items that received the correct class.
611: Unfortunately, F$_{\beta=1}$ is more complex: after some arithmetic
612: we get $2*correct/(found+corpus)$ where $found$ is the number of
613: phrases found by the learner, $correct$ the number of phrases found
614: that were correct and $corpus$ the number of phrases in the corpus
615: according to some gold standard.
616: The value of the $corpus$ variable is an upper bound on the variable
617: $correct$.
618: The complexity of the F$_{\beta=1}$ computation makes it hard to
619: apply standard statistical tests to F$_{\beta=1}$ rates.
620:
621: \cite{yeh2000} offers a method for computing significance values for
622: F$_{\beta=1}$ rate comparisons: by using computationally-intensive
623: randomization tests.
624: His approach requires test data classifications for all systems that
625: need to the compared.
626: Usually we only have access to the test data classifications of our
627: own system and therefore we have used a variant of these
628: randomization tests presented: bootstrap resampling
629: \citep{noreen89}.
630: The basic idea of this approach is to regard the test data
631: classifications as a population of cases.
632: A random sample of this population can be created by arbitrarily
633: choosing cases with replacement.
634: We can create many random samples of the same size as the test data
635: and compute an average F$_{\beta=1}$ rate over the samples and a
636: standard deviation for this average.
637: These statistical measures can be used for deciding if the performance
638: of another system is significantly different from our system.
639: Since we do not know if the performance of our system is distributed
640: according to a normal distribution, we will determine
641: significance boundaries in such a way that 5\% of the samples evaluate
642: worse (or better) than the chosen boundary.
643:
644: \section{Chunking}
645:
646: In this section we will apply a memory-based learner to chunking,
647: identifying base phrases.
648: The section starts with a some background information on this task.
649: After this we will present the results of our experiments with base
650: noun phrase identification and our work targeted at finding base
651: phrases of arbitrary types.
652:
653: \subsection{Task Overview}
654: \label{sec-chunkov}
655:
656: A text chunker divides sentences in phrases which consist of a
657: sequence of consecutive words which are syntactically related.
658: The phrases are nonoverlapping and nonrecursive.
659: In the beginning of the nineties, \cite{abney91} suggested to use
660: chunking as a preprocessing step of a parser.
661: Ten years later, most statistical parsers contained a chunking phase
662: (for example \cite{ratnaparkhi98}).
663: In this study, we will divide chunking in two subtasks: finding only
664: noun phrases and identifying arbitrary chunks.
665:
666: Machine learning approaches towards noun phrase chunking started with
667: work by \cite{church88} who used bracket frequencies associated with
668: POS tags for finding noun phrase boundaries in text.
669: In an influential paper about chunking, \cite{ramshaw95} show that
670: chunking can be regarded as a tagging task.
671: Even more importantly, the authors propose a training and test data
672: set that are still being used for comparing different text chunking
673: methods.
674: These data sets were extracted from the Wall Street Journal part of
675: the Penn Treebank II corpus \citep{marcus93}.
676: Sections 15-18 are used as training data and section 20 as test
677: data.\footnote{The noun phrase identification data is available from
678: {\tt ftp://ftp.cis.upenn.edu/pub/chunker/}}
679: In principle, the noun phrase chunks present in the material are noun
680: phrases that do not include other noun phrases, with initial material
681: (determiners, adjectives, etc.) up to the head but without
682: postmodifying phrases (prepositional phrases or clauses)
683: \citep{ramshaw95}.
684:
685: The noun phrase chunking data produced by \cite{ramshaw95} contains a
686: couple of nontrivial features.
687: First, unlike in the Penn Treebank, possessives between two noun
688: phrases have been attached to the second noun phrase rather than the
689: first.
690: An example in which round brackets mark chunk boundaries: {\it ( Nigel
691: Lawson ) ('s restated commitment )}: the possessive {\it 's} has been
692: moved from {\it Nigel Lawson} to {\it restated commitment}.
693: Second, Treebank annotation may result in nonexpected noun phrase
694: annotations: {\it British Chancellor of ( the Exchequer ) Nigel
695: Lawson} in which only one noun chunk has been marked.
696: The problem here is that neither {\it British Chancellor} nor {\it
697: Nigel Lawson} has been annotated as separate noun phrases in the
698: Treebank.
699: Both {\it British ... Exchequer} and {\it British ... Lawson} are
700: annotated as noun phrases in the Treebank but these phrases could not
701: be used as noun chunks because they contain the smaller noun phrase
702: {\it the Exchequer}.
703:
704: \cite{ramshaw95} proposed to encode chunks with tags: I for words that
705: are inside a noun chunk and O for words that are outside a chunk.
706: In case one noun phrase immediately follows another one, they
707: use the tag B for the first word of the second phrase in order to show
708: that a new phrase starts there.
709: With the three tags I, O and B any chunk structure can be encoded.
710: This representation has two advantages.
711: First, it enables trainable POS taggers to be used as chunkers by
712: simply changing their training data.
713: Second, it minimizes consistency errors which appear with the bracket
714: representation where open and close brackets generated by the learner
715: may not be balanced.
716: Here is an example sentence first with noun phrases encoded by pairs of
717: brackets and then with the Ramshaw and Marcus IOB representation:
718:
719: \begin{quote}
720: In ( early trading ) in ( Hong Kong ) ( Monday ) , ( gold ) was quoted \\
721: at ( \$ 366.50 ) ( an ounce ) .
722:
723: In$_O$ early$_I$ trading$_I$ in$_O$ Hong$_I$ Kong$_I$ Monday$_B$ ,$_O$ gold$_I$ was$_O$ quoted$_O$ \\
724: at$_O$ \$$_I$ 366.50$_I$ an$_B$ ounce$_I$ .$_O$
725: \end{quote}
726:
727: \noindent
728: \cite{tksveenstra99eacl} presents three variants on the Ramshaw and
729: Marcus representation and shows that the bracket representation can
730: also be regarded as a tagging representation with two streams of
731: brackets.
732: They named the variants IOB2, IOE1 and IOE2 and used IOB1 as name for
733: the Ramshaw and Marcus representation.
734: IOB2 was the same as IOB1 but now every chunk-initial word receives tag B.
735: IOE1 differs from IOB1 in the fact that rather than the tag B, a tag E
736: is used for the final word of a noun chunk which is immediately
737: followed by another chunk.
738: IOE2 is a variant of IOE1 in which each final word of a noun phrase is
739: tagged with E.
740: The bracket representations use open brackets for phrase-initial
741: words, close brackets for phrase-final words and a period for all
742: other words.
743: Table \ref{tab-repr} contains example tag sequences for all six tag
744: sequences for the example sentence.
745:
746: \begin{table}[t]
747: \begin{center}
748: \begin{tabular}{|c|ccccccccccccccccc|}\hline
749: IOB1 & O&I&I&O&I&I&B&O&I&O&O&O&I&I&B&I&O \\
750: IOB2 & O&B&I&O&B&I&B&O&B&O&O&O&B&I&B&I&O \\
751: IOE1 & O&I&I&O&I&E&I&O&I&O&O&O&I&E&I&I&O \\
752: IOE2 & O&I&E&O&I&E&E&O&E&O&O&O&I&E&I&E&O \\
753: O & .&$[$&.&.&$[$&.&$[$&.&$[$&.&.&.&$[$&.&$[$&.&. \\
754: C & .&.&$]$&.&.&$]$&$]$&.&$]$&.&.&.&.&$]$&.&$]$&. \\\hline
755: \end{tabular}
756: \end{center}
757: \caption{The chunk tag sequences for the example sentence
758: {\it In early trading in Hong Kong Monday , gold was quoted at \$
759: 366.50 an ounce . }
760: for six different tagging formats.
761: The {\tt I} tag has been used for words inside a chunk, {\tt O}
762: for words outside a chunk, {\tt B} and {\tt [} for
763: chunk-initial words and {\tt E}, {\tt ]} for chunk-final words and
764: periods for words that are neither chunk-initial nor chunk-final.
765: }
766: \label{tab-repr}
767: \end{table}
768:
769: The representation variants are interesting because a learner will
770: make different errors when trained with data encoded in a different
771: representation.
772: This means that we can train one learner with five\footnote{The
773: combination of open and close brackets, O+C, will be regarded as one
774: data representation.}
775: data representations and obtain five different analyses of the data
776: which we can combine with system combination techniques.
777: Thus the different data representations may enable us to improve the
778: performance of the chunker.
779: The data representations can be used both for noun phrase chunking
780: and for arbitrary chunking.
781: In the latter task, more than one chunk type exists so the tags need
782: to be expanded with type-specific suffixes.
783: For example: B-VP, I-VP, E-VP, $[$-VP and $]$-VP.
784:
785: The arbitrary chunking task was more difficult to design because many
786: interesting phrase types often contain parts which belong to other
787: phrases \citep{tksbuchholz2000conll}.
788: For example, verb phrases may contain noun phrases and prepositional
789: phrases often include a noun phrase.
790: Furthermore, noun phrases may contain quantitative or adjective phrases
791: which may prevent them from being identified as noun chunks.
792: The noun, verb and prepositional phrases should be included and
793: therefore the following measures have been taken when constructing the
794: data for the arbitrary chunking task:
795: First, a couple of phrase types, for example quantifier phrases and
796: adjective phrases, have been removed from places where they prevented
797: the identification of noun phrases.
798: This made possible annotating more phrases as noun chunks.
799: Second, some phrase types in the annotated data, for example verb
800: phrases and prepositional phrases, lack material that has already been
801: included in a phrase of another type.
802: Third, adjacent verb clusters have been put in one flat verb phrase
803: unlike in the Treebank where often each verb starts a new phrase.
804: And fourth, adverbial phrase boundaries have been removed from
805: adjective phrases and verb phrases to allow all material to be
806: included in the mother phrase.
807:
808: This chunk definition scheme will generate data in which most of the
809: tokens have been assigned to a chunk of some type.
810: The odd tokens that fall out are usually punctuation signs.
811: This chunk scheme has been used for generating training and test data
812: for the CoNLL-2000 shared task \citep{tksbuchholz2000conll}.
813: The data contains the same segments of the Wall Street Journal part of
814: the Penn Treebank as the noun phrase data of \cite{ramshaw95}: sections
815: 15-18 as training data and section 20 as test data.\footnote{The
816: CoNLL-2000 shared task data is available from
817: {\tt http://lcg-www.uia.ac.be/conll2000/chunking/}}
818: We will use these data sets in our arbitrary chunking experiments.
819:
820: The training and the test data contain two types of features: words
821: and POS tags.
822: The words have been taken from the Penn Treebank.
823: The POS tags of the Treebank have been manually checked and therefore
824: they should not be used in the chunking data.
825: In future applications, the chunking process will be applied to a text
826: with POS tags that have been generated automatically.
827: These POS tags will contain errors and therefore the performance of
828: the chunker will be worse than when applied to a Treebank text with
829: manually checked POS tags.
830: If we want to obtain realistic performance rates, we should work with
831: automatically generated POS tags in our shallow parsing experiment.
832: Conform with earlier work like that of \cite{ramshaw95}, we have used
833: POS tags that were generated by the Brill tagger \citep{brill94}.
834:
835: \subsection{Noun Phrase Recognition}
836:
837: We will use a memory-based learner to find noun phrase chunks in text.
838: In order to determine the best configuration for the learner, we will
839: test different system configurations on the standard training data
840: sets put forward by \cite{ramshaw95}.
841: We will evaluate different feature sets for representing words.
842: Additionally, we will use the five data representations for generating
843: different system results and use system combination techniques for
844: combining these results.
845:
846: In our experiments we will represent words as sets of words and POS
847: tags.
848: These sets contain the word itself, its part-of-speech (POS) tag and a
849: left and right context of a maximum of four words and POS tags on each
850: side, 18 features in total.
851: We have explained in Section \ref{sec-feat} that memory-based learners
852: equipped with the Gain Ratio metric have difficulty in dealing with
853: irrelevant features.
854: Therefore we will use a feature selection method, bi-directional
855: hill-climbing starting with zero features, for finding the best
856: subset of the 18 features for each different data representation.
857:
858: The memory-based learner will make two passes over the data.
859: First, it will attempt to predict the noun phrases in the data as well
860: as possible.
861: After this it will use the output of this first pass as information
862: about the noun phrases in the immediate context of the current word.
863: This means that the second pass has access to the 18 features of the
864: first pass plus the chunk tags of the two words immediately in front
865: of the current word and the chunk tags of the two words immediately
866: following the current word.
867: This cascaded approach was chosen because it was useful for improving
868: overall performance in our earlier work \citep{tksveenstra99eacl}.
869: We omitted the chunk tag for the current word because including it
870: gave a negative bias to the chunker performance.
871: Gain Ratio would correctly identify it as a feature which contained a
872: lot of information about the output class and the weight it assigned
873: to it would make it hard for the other features to influence the
874: output class at all \citep{tksveenstra99eacl}.\footnote{
875: The problem of using the predicted class of the current word was a
876: result of an earlier study in which we did not use feature selection.
877: The selection method used in this study would probably have disregarded
878: this feature automatically.
879: It would start out as the most informative feature but with the
880: feature on its own we would get a worse performance than with
881: combinations of other features (we perform feature selection while
882: keeping the five best combinations).
883: }
884:
885: We performed a cascaded feature search while using five different data
886: representations on the training data of \cite{ramshaw95} in a 10-fold
887: cross-validation approach.
888: We prevented information leaking in the second phase conform Section
889: \ref{sec-partun} by using the estimated chunk tags for test data and
890: using the corpus tags in the training data.
891: In this way we made sure that when the test data consisted of section
892: x, no information about section x was available in the training data.
893: The results of the 10-fold cross-validation experiments can be found
894: in Table \ref{tab-np10a}.
895: In the best feature sets of the first pass most of the nine POS tag
896: features are used (almost eight on average) but interestingly enough
897: only a few of the word features (just over four on average).
898: The best sets for the second pass use fewer POS tag features (under
899: seven), fewer word tags (just over two) and most of the chunk features
900: (about three).
901: The table shows that a wide context is more important for the POS
902: features than for the chunk features and less important for the word
903: features.
904:
905: \begin{table}[t]
906: \begin{center}
907: \begin{tabular}{|l|c|ll|c|lll|}\cline{2-8}
908: \multicolumn{1}{l|}{train} & \multicolumn{3}{c|}{Pass 1}
909: & \multicolumn{4}{c|}{Pass 2} \\\hline
910: Repr.& F$_{\beta=1}$ & \multicolumn{2}{c|}{features} &
911: F$_{\beta=1}$ & \multicolumn{3}{c|}{features} \\\hline
912: IOB1 & 91.88 & word$_{-4..0}$ & POS$_{-2..3}$
913: & 92.54 & word$_{-2..0}$ & POS$_{-4..3}$ & chunk$_{-2,-1,1,2}$\\
914: IOB2 & 91.78 & word$_{-1..0}$ & POS$_{-4..3}$
915: & 92.29 & word$_{-1..0}$ & POS$_{-4..2}$ & chunk$_{-1,1,2}$\\
916: IOE1 & 91.64 & word$_{0..1}$ & POS$_{-3..3}$
917: & 92.28 & word$_{0..1}$ & POS$_{-3..3}$ & chunk$_{-1,1,2}$\\
918: IOE2 & 92.19 & word$_{-3..4}$ & POS$_{-4..4}$
919: & 92.59 & word$_{0..1}$ & POS$_{-1..3}$ & chunk$_{-2,-1,1,2}$\\
920: %O+C & 92.78 & &
921: % & & & \\\hline
922: O & 96.04 & word$_{-2..0}$ & POS$_{-4..3}$
923: & 96.11 & word$_{-1,0}$ & POS$_{-4..1}$ & chunk$_{-1,2}$\\
924: C & 96.43 & word$_{0..4}$ & POS$_{-4..4}$
925: & 96.45 & word$_{0..2}$ & POS$_{-4..2}$ & chunk$_{-2,-1,1}$\\\hline
926: \end{tabular}
927: \caption{Best F$_{\beta=1}$ found for six data representations in two
928: passes while using a bi-directional hill-climbing feature search
929: algorithm in a 10-fold cross-validation process applied to the
930: training data for the noun phrase chunking task.
931: Note that the rates obtained for the O (open bracket) and C (close
932: bracket) representations are for phrase starts and phrase ends
933: respectively and thus higher than for the first four which evaluate
934: complete phrase identification.
935: %The results on O+C line were obtained by a combination of the two.
936: }
937: \label{tab-np10a}
938: \end{center}
939: \end{table}
940:
941: Our motive for processing six representations rather than one was to
942: obtain different results which we could combine in order to improve
943: performance.
944: System combination can be seen as a second cascade behind passes one
945: and two.
946: For reasons mentioned in Section \ref{sec-partun}, adding a second
947: cascade in a 10-fold cross-validation experiment requires taking extra
948: care to prevent information leaking from a training data at one level
949: to the training data of the next level.
950: We have taken care of this problem by preparing the training data of
951: the combination techniques with 9-fold cross-validation runs which
952: were independent of the 10-fold cross-validation experiments used
953: for generating the test data.
954: For example, the test data for the first section was generated by
955: training with sections 2-10 twice, first without information about
956: context chunk tags and then with the perfect information of the
957: context chunk tags.
958: The training data was generated with a 9-fold cross-validation process
959: on sections 2-10, also first without context chunk tags and then with
960: perfect context chunk tags.
961: By working this way it was impossible for information about the first
962: section to enter the training data of the combination processes.
963:
964: Most system combination techniques require results that are in the
965: same format.
966: We have results in six different formats which means that we need to
967: convert them to one format.
968: Since we do not know which of the formats would suit the combination
969: process best, we have evaluated all formats.
970: The four IO formats can trivially be converted to each other and to
971: the O and the C format.
972: The conversion of the two bracket formats to the other four is
973: nontrivial.
974: The two data streams have been generated independently of each other
975: and this means that they may contain inconsistencies.
976: We have chosen to get rid of these by removing all brackets which
977: cannot be matched with the closest candidate.
978: For example, if we have a structure like {\it ( a ( b c ) d )} then
979: the first bracket will be removed because it cannot be matched with
980: the second bracket.
981: The second and third will be kept because they match.
982: Finally, the fourth will be removed because it cannot be matched with
983: the third.
984: We obtain the balanced structure {\it a ( b c ) d} which can trivially
985: be converted to the four IO formats.
986:
987: \begin{table}[t]
988: \begin{center}
989: \begin{tabular}{|l|c|c|c|c|c|}\cline{2-6}
990: \multicolumn{1}{l|}{train}
991: & IOB1 & IOB2 & IOE1 & IOE2 & O+C\\\hline
992: {\bf all systems} & & & & & \\
993: Majority & 93.06 & 93.06 & 93.14 & 93.12 & 93.35 \\
994: TotPrecision & 93.06 & 93.05 & 93.13 & 93.05 & 93.35 \\
995: TagPrecision & 93.06 & 93.10 & 93.11 & 93.11 & 93.35 \\
996: Precision-Recall & 93.06 & 93.10 & 93.11 & 93.08 & 93.35 \\
997: TagPair & 93.05 & 93.14 & 93.10 & 93.13 & 93.36 \\
998: MBL & 93.14 & 93.12 & 93.07 & 92.92 & 93.35 \\
999: MBL+ & 92.81 & 92.74 & 92.91 & 92.78 & 93.29 \\\hline
1000: {\bf some systems} & & & & & \\
1001: Majority & 93.02 & 93.12 & 93.08 & 92.99 & 93.37 \\
1002: TotPrecision & 93.02 & 93.12 & 93.08 & 92.99 & 93.37 \\
1003: TagPrecision & 93.04 & 93.13 & 93.10 & 92.99 & 93.37 \\
1004: Precision-Recall & 93.04 & 93.13 & 93.13 & 93.04 & 93.37 \\
1005: TagPair & 93.08 & 93.16 & 93.12 & 93.05 & 93.37 \\
1006: MBL & 93.12 & 93.18 & 93.18 & 93.03 & 93.38 \\\hline
1007: \end{tabular}
1008: \caption{
1009: F$_{\beta=1}$ rates obtained on 10-fold cross-validation experiments
1010: on the noun phrase chunking data while combining results obtained with
1011: five different data representations.
1012: All five representations have been tested and best rates have been
1013: obtained while using the the combined bracket representation O+C.
1014: All combination results are better than any result of the individual
1015: systems (92.59, see Table \ref{tab-np10a}) and generally combing five
1016: systems led to better results than when only three or four were used.
1017: The best results have been obtained with a stacked memory-based
1018: classifier that used all system results except those generated with
1019: IOE1.
1020: However, the performance differences are small.
1021: }
1022: \label{tab-np10b}
1023: \end{center}
1024: \end{table}
1025:
1026: % significance
1027:
1028: We have combined the five results of pass two of the 10-fold
1029: cross-validation experiments on the noun phrase chunking training
1030: data (O and C have now been regarded as one data stream O+C).
1031: We have used the system combination techniques described in Section
1032: \ref{sec-combi}: Majority Voting, TotPrecision, TagPrecision,
1033: Precision-Recall, TagPair and two variants of a stacked memory-based
1034: learner.
1035: The first stacked learner did not use any context information while
1036: the second one had access to a limited amount of context information:
1037: the current word, the current POS tag or pairs containing the current
1038: POS tag and one of the three current word, previous POS tag or next
1039: POS tag.
1040: We have performed combination experiments with all five data streams
1041: and with all subsets of three and four data streams.
1042: The results can be found in Table \ref{tab-np10b}.
1043: For the second stacked classifier we only included the best results
1044: (obtained with context feature current POS tag).
1045: System combination improved performance: the worst result of the
1046: combination techniques is still better than the best result of the
1047: individual systems.
1048: The differences between the combination techniques are small.
1049: Furthermore, system combination with the four IO data representations
1050: leads to similar results but the combined bracket representation
1051: consistently obtains higher F$_{\beta=1}$ rates.
1052: It should be noted though that while combination of the data with the
1053: IO representations leads to similar precision and recall figures, O+C
1054: obtains its higher F$_{\beta=1}$ rates with high precision rates and
1055: lower recall rates.
1056:
1057: Since the performance differences between the combination techniques
1058: displayed in Table \ref{tab-np10b} are small, we are relatively free
1059: in selecting a technique for further processing.
1060: We chose Majority Voting because it is the simplest of the
1061: combination techniques that were tested since it does not
1062: require extra combinator training data like the other techniques.
1063: It does seem reasonable to use the O+C representation during the
1064: combination process because the best results have been obtained with
1065: this representation.
1066: We will restrict ourselves to a few systems rather than combining all
1067: because Majority Vote in combination with the O+C representation
1068: obtained a slightly higher F$_{\beta=1}$ rate that way.
1069: The best rate was obtained while using only the systems with data
1070: representations IOB1, IOE2 and O+C so we restrict ourselves to
1071: these three.
1072: This leaves us with the following processing scheme:
1073:
1074: \begin{enumerate}
1075: %\itemsep -6mm
1076: \item Process the test data with a memory-based model generated from
1077: the training data.
1078: Use the features shown in Table \ref{tab-np10a} (Pass 1) and
1079: generate output data streams while using the representations
1080: IOB1, IOE2, O and C.
1081: \item Perform a second pass over the test data with another
1082: memory-based model obtained from the training data.
1083: Again use the features shown in Table \ref{tab-np10a} (Pass 2).
1084: In the test data, use the estimated chunk tags from the previous
1085: run as chunk tag features and in the training data use the
1086: corpus chunk tags as chunk features.
1087: Perform these passes four times, once for each of the data
1088: representations IOB1, IOE2, O and C.
1089: \item Convert the output for the data representations IOB1 and IOE2 to
1090: the O and the C format.
1091: \item Combine the three O data streams (IOB1, IOE2 and O) with
1092: Majority Voting and do the same for the three C data streams
1093: (IOB1, IOE2 and C).
1094: \item Remove brackets from the resulting O and C data streams which
1095: cannot be matched with other brackets.
1096: The balanced bracket structure is the analysis of the test data
1097: that is the output of the complete system.
1098: \end{enumerate}
1099:
1100: \noindent
1101: We have applied this procedure to the data sets of \cite{ramshaw95}:
1102: sections 15-18 of the Wall Street Journal part of the Penn Treebank
1103: \citep{marcus93} as training data and section 20 of the same corpus
1104: as test data.
1105: The system obtained a F$_{\beta=1}$ rate of 93.34 (precision 94.01\%
1106: and recall 92.67\%).
1107: This is a modest improvement of our earlier work \citep{tks2000naacl}
1108: in which we did not use feature selection and where we obtained an
1109: F$_{\beta=1}$ rate of 93.26.
1110: In order to estimate significance thresholds, we have applied a
1111: bootstrap resampling test to the output of our system.
1112: We created 10,000 populations by randomly drawing sentences with
1113: replacement from the system results.
1114: The number of sentences in each population was the same as in the
1115: test corpus.
1116: The average F$_{\beta=1}$ of the 10,000 populations was 93.33 with
1117: a standard deviation of 0.24.
1118: For 5 percent of the populations, the F$_{\beta=1}$ rate was equal to
1119: or lower than 92.93 and for another 5 percent it was equal to or higher
1120: than 93.73.
1121: Since 93.26 is between the two significance boundaries, our current
1122: system does not perform significantly better than the previous version
1123: without feature selection.
1124: % bootstrapping: 93.33 0.24 92.93 93.73
1125:
1126: \subsection{Arbitrary Phrase Identification}
1127: \label{sec-chuarb}
1128:
1129: Our work with chunks of arbitrary types\footnote{The results of our
1130: arbitrary phrase identification work have earlier been presented by
1131: \cite{tks2000conll}.}
1132: is similar to that with noun phrase chunks apart from two facts.
1133: First, we refrained from using feature selection methods.
1134: Applying these methods did not gain us much for noun phrase
1135: chunking but they required a lot of extra computational work.
1136: Therefore we went back to using a fixed set of features in these
1137: experiments.
1138: The context size we used here was four left and four right for words
1139: and POS tags in the first pass over the data, and three left and three
1140: right for words and POS tags, and two left and two right without the
1141: focus for chunk tags in the second pass.
1142: This means that both first and second pass use 18 features.
1143: The second pass has only been used for the four IO data
1144: representations.
1145: Table \ref{tab-np10a} shows that the second pass improved the
1146: performance of the first pass only by a small margin for the two
1147: bracket representations O and C.
1148:
1149: The second difference between this study and the one for noun phrase
1150: chunks originates from the fact that apart from chunk boundaries, we
1151: need to find chunk types as well.
1152: We can approach this task in two ways.
1153: First, we could train the learner to identify both chunk boundaries and
1154: chunk types at the same time.
1155: We have called this approach the Single-Phase Approach.
1156: Second, we could split the task and train a learner to identify all
1157: chunk boundaries and feed its output to another classifier which
1158: identifies the types of the chunks (Double-Phase Approach).
1159: A computationally-intensive approach would be to develop learners for
1160: each different chunk type.
1161: They could identify chunks independently of each other and words
1162: assigned to more than one chunk could be disambiguated by choosing the
1163: chunk type that occurs most frequently in the training data (N-Phase
1164: Approach).
1165: Since we did not know in advance which of these three processing
1166: strategies would generate the best results, we have evaluated all
1167: three.
1168:
1169: In order to find the best processing strategy and the best combination
1170: technique, we have performed several 10-fold cross-validation
1171: experiments on the training data.
1172: We have processed this data for each processing strategy and in each
1173: of the six data representations earlier used for noun phrase chunking.
1174: After this we have used the seven combination techniques presented in
1175: Section \ref{sec-combi} for combining these.
1176: The results can be found in Table \ref{tab-xp10}.
1177: Of the three processing strategies, the N-Phase Approach generally
1178: performed best with Double-Phase being second best and Single-Phase
1179: performing worst.
1180: Again, system combination improved all individual results.
1181: There were only small differences between the seven combination
1182: techniques when compared for the same processing approach.
1183: The only exception were the two stacked MBL classifiers applied to the
1184: Single-Phase Approach results.
1185: They did about 0.3 F$_{\beta=1}$ rate better than most of the other
1186: combination techniques.
1187:
1188: \begin{table}[t]
1189: \begin{center}
1190: \begin{tabular}{|l|c|c|c|}\cline{2-4}
1191: \multicolumn{1}{l|}{train} & SP & DP & NP \\\hline
1192: IOB1 & 90.68 & 91.59 & 92.02 \\
1193: IOB2 & 90.77 & 91.65 & 91.94 \\
1194: IOE1 & 90.94 & 91.60 & 91.90 \\
1195: IOE2 & 91.21 & 91.97 & 91.99 \\
1196: O+C & 91.57 & 91.97 & 91.51 \\\hline
1197: Majority & 91.96 & 92.34 & 92.62 \\
1198: TotPrecision & 91.97 & 92.34 & 92.62 \\
1199: TagPrecision & 91.98 & 92.34 & 92.62 \\
1200: Precision-Recall & 91.96 & 92.34 & 92.62 \\
1201: TagPair & 92.08 & 92.34 & 92.65 \\
1202: MBL & 92.32 & 92.35 & 92.75 \\
1203: MBL+ & 92.40 & 92.32 & 92.72 \\\hline
1204: \end{tabular}
1205: \end{center}
1206: \caption{F$_{\beta=1}$ rates obtained for the three processing
1207: strategies, Single-Phase Approach (SP), Double-Phase Approach (DP) and
1208: N-Phase approach (NP), when applied to the training data of the
1209: CoNLL-2000 shared task (arbitrary chunking) while using five different
1210: data representations and seven system combination techniques.
1211: In all cases, system combination led to performances that were better
1212: than the individual system results.
1213: The computationally-intensive N-Phase Approach does better than the
1214: other two.
1215: }
1216: \label{tab-xp10}
1217: \end{table}
1218:
1219: % significance test on best
1220:
1221: The best result was generated with the N-Phase Approach in combination
1222: with a stacked memory-based classifier (MBL, 92.76).
1223: A bootstrap resampling test with 8000 random populations generated the
1224: 90\% significance interval 92.60-92.90 which means that this result
1225: is significantly better than any Single-Phase or Double-Phase result.
1226: However, the N-Phase approach has a big computing overhead: the
1227: number of passes over the data is at least N times the number of
1228: representations.
1229: Therefore, we have chosen the Double-Phase Approach combined with
1230: Majority Voting for our further work.
1231: This approach combines a reasonable performance with computational
1232: efficiency.
1233: The Single-Phase Approach is potentially faster but its performance
1234: is worse unless we use a stacked classifier which requires extra
1235: combinator training data.
1236:
1237: When we applied the Double-Phase Approach combined with Majority
1238: Voting to the CoNLL-2000 data sets, we obtained an F$_{\beta=1}$ rate
1239: of 92.50 (precision 94.04\% and recall 91.00\%).
1240: An overview of the performance rates of the different chunk types can
1241: be found in Table \ref{tab-xp}.
1242: Our system does well for the three most frequently occurring chunk
1243: types, noun phrases, prepositional phrases and verb phrases, and less
1244: well for the other seven.
1245: The chunk type UCP which occurred in the training data, was not
1246: present in the test data.
1247: With this result, our memory-based arbitrary chunker finished third of
1248: eleven participants in the CoNLL-2000 shared task.
1249: The two systems that performed better were Support Vector Machines
1250: \citep[][F$_{\beta=1}$=93.48]{kudoh2000} and Weighted Probability
1251: Distribution Voting \citep[][F$_{\beta=1}$=93.32]{hvh2000}.
1252:
1253: \begin{table}[t]
1254: \begin{center}
1255: \begin{tabular}{|l|c|c|c|}\cline{2-4}
1256: \multicolumn{1}{l|}{test data}
1257: & precision & recall & F$_{\beta=1}$ \\\hline
1258: ADJP & 85.25\% & 59.36\% & 69.99 \\
1259: ADVP & 85.03\% & 71.48\% & 77.67 \\
1260: CONJP & 42.86\% & 33.33\% & 37.50 \\
1261: INTJ &100.00\% & 50.00\% & 66.67 \\
1262: LST & 0.00\% & 0.00\% & 0.00 \\
1263: NP & 94.14\% & 92.34\% & 93.23 \\
1264: PP & 96.45\% & 96.59\% & 96.52 \\
1265: PRT & 79.49\% & 58.49\% & 67.39 \\
1266: SBAR & 89.81\% & 72.52\% & 80.25 \\
1267: VP & 93.97\% & 91.35\% & 92.64 \\\hline
1268: all & 94.04\% & 91.00\% & 92.50 \\\hline
1269: \end{tabular}
1270: \end{center}
1271: \caption{
1272: The results per chunk type of processing the test data with the
1273: Double Pass Approach and Majority Voting.
1274: Although the data is formatted differently than the noun phrase
1275: chunking data, the NP F$_{\beta=1}$ rate here (93.23) is close to
1276: that of our NP chunking F$_{\beta=1}$ rate (93.34).
1277: }
1278: \label{tab-xp}
1279: \end{table}
1280:
1281: % \subsection{Discussion}
1282: % compare
1283: % why sign deviation so big? sets contain different data!
1284: % why no help fsearch?
1285: % error analysis
1286:
1287: \section{Parsing}
1288:
1289: In this section we will examine the application of memory-based
1290: shallow parsing to generating embedded structures.
1291: We will examine three tasks: clause identification, noun phrase
1292: parsing and full parsing.
1293: Whenever possible, we will use the methods that we have applied to
1294: chunking in the previous section.
1295:
1296: \subsection{Clause Identification}
1297: \label{sec-clauses}
1298:
1299: In clause identification the goal is to divide sentences in clauses
1300: which typically contain a subject and a predicate.
1301: We have used the clause data of the CoNLL-2001 shared task
1302: \citep{tksdejean2001conll} which was derived from the Wall Street
1303: Journal Part of the Penn Treebank \citep{marcus93}.
1304: Here is an example sentence from the Treebank, with all information
1305: but words and clause brackets omitted:
1306:
1307: \begin{quote}
1308: \noindent
1309: (S Coach them in\\
1310: \hspace*{0.25cm}(S--NOM handling complaints)\\
1311: \hspace*{0.25cm}(SBAR--PRP so that\\
1312: \hspace*{0.50cm}(S they can resolve problems immediately)\\
1313: \hspace*{0.25cm})\\
1314: \hspace*{0.25cm}.\\
1315: )
1316: \end{quote}
1317:
1318: \noindent
1319: This sentence contains four clauses.
1320: In the data that we have worked with, the function and type
1321: information has been removed.
1322: This means that the type tags NOM and PRP have been omitted and that
1323: the SBAR tag has been replaced by S.
1324: Like the chunking data, these data sets contained words and
1325: part-of-speech tags which were generated by the Brill tagger
1326: \citep{brill94}.
1327: Additionally they contained chunk tags which were computed by the
1328: arbitrary chunking method we discussed in the previous section.
1329:
1330: We have approached identifying clauses in the following
1331: way:\footnote{This approach and the results achieved with it have
1332: earlier been discussed by \cite{tks2001conll}.}
1333: first we evaluated different memory-based learners for predicting the
1334: positions of open clause brackets and close clause brackets,
1335: regardless of their level of embedding.
1336: The two resulting bracket streams will be inconsistent and in order to
1337: solve this we have developed a list of rules which change a possibly
1338: inconsistent set of brackets to a balanced structure.
1339: The evaluation of the learners and the development of the balancing
1340: rules will be done with 10-fold cross-validation of the CoNLL-2001
1341: training data.
1342: Information leaking is prevented by using corpus clause tags as
1343: context features in the training data of cascaded learners rather than
1344: clause tags computed in a previous learning phase.
1345: The best learner configurations and balancing rules found will be
1346: applied to the data for the clause identification shared task.
1347:
1348: Like in our noun phrase chunking work, we have tested memory-based
1349: learners with different sets of features.
1350: At the time we performed these experiments, we did not have access to
1351: feature selection methods and therefore we have only evaluated a few
1352: fixed feature configurations:
1353:
1354: \begin{enumerate}
1355: \itemsep -0.1cm
1356: \item words only (w)
1357: \item POS tags only (p)
1358: \item chunk tags only (c)
1359: \item words and POS tags (wp)
1360: \item words and chunk tags (wc)
1361: \item POS tags and clause tags (pc)
1362: \item words, POS tags and chunk tags (wpc)
1363: \end{enumerate}
1364:
1365: \noindent
1366: All feature groups were tested with four context sizes: no context
1367: information or information about a symmetrical window of one, two or
1368: three words.
1369: Like in our chunking work, we want to check if an improved performance
1370: can be obtained by using system combination.
1371: However, since we attempt to predict brackets at all levels in one
1372: step, we cannot use the five data representations here.
1373: Instead we have evaluated combination of some of the feature
1374: configurations mentioned above: a majority vote of the three using a
1375: single type of information (1+2+3), a majority vote of the three using
1376: pairs of information (4+5+6) and a majority vote of the previous two
1377: and the one using three types of information (7+(1+2+3)+(4+5+6)).
1378: The last one is a combination of three results of which two themselves
1379: are combinations of three results.
1380:
1381: Clauses may contain many words and it is possible that the maximal
1382: context used by the learner, three words left and right, is not enough
1383: for predicting clause boundaries accurately.
1384: However, we cannot make the context size much larger than three
1385: because that would make it harder for the learner to generalize.
1386: We have tried to deal with this problem by evaluating another set of
1387: features which contain summaries of sentences rather than every word.
1388: Since we have chunk information of the sentences available, we
1389: can compress them by removing all words from each chunk except the
1390: main one, the head word.
1391: The head words can be generated by a set of rules put forward by
1392: \cite{magerman95} and modified by \cite{collins99}.\footnote{Available
1393: on http://www.research.att.com/\~{ }mcollins/papers/heads}
1394: After removing the nonhead words from each chunk, we can replace the
1395: POS tag of the remaining word with the chunk tag and thus obtain data
1396: with words and chunk tags only (words outside of a chunk keep their
1397: POS tag).
1398: Again we have evaluated sets of features which hold a single type of
1399: information, words (w--) or chunk tags (c--), or pairs of information,
1400: words and chunk tags (wc--).
1401:
1402: \begin{table}[t]
1403: \begin{center}
1404: \begin{tabular}{ r|l|c|c|c|c|r|l|c|c|c|c|}\cline{3-6}\cline{9-12}
1405: \multicolumn{1}{l}{} & \multicolumn{1}{l|}{train} &
1406: 0 & 1 & 2 & 3 & \multicolumn{1}{l}{~~~~} & train &
1407: 0 & 1 & 2 & 3 \\\cline{2-6}\cline{8-12}
1408: 1 & w & 61.77 & 84.40 & 83.74 & 81.08 &
1409: 1 & w & 61.11 & 75.99 & 77.52 & 77.63 \\
1410: 2 & p & 30.44 & 80.40 & 80.47 & 76.85 &
1411: 2 & p & 61.71 & 77.52 & 78.74 & 77.95 \\
1412: 3 & c & 13.67 & 76.76 & 79.05 & 78.71 &
1413: 3 & c & 00.00 & 67.25 & 75.06 & 75.70 \\
1414: 4 & wp & 62.24 & 87.19 & 84.45 & 81.22 &
1415: 4 & wp & 61.25 & 76.52 & 77.92 & 78.12 \\
1416: 5 & wc & 67.95 & 87.31 & 85.74 & 82.97 &
1417: 5 & wc & 61.01 & 75.96 & 77.46 & 77.79 \\
1418: 6 & pc & 49.29 & 86.65 & 84.92 & 81.72 &
1419: 6 & pc & 61.74 & 77.44 & 78.40 & 77.93 \\
1420: 7 & wpc & 68.66 & 87.92 & 85.93 & 83.28 &
1421: 7 & wpc & 61.21 & 76.17 & 77.73 & 78.00 \\\cline{2-6}\cline{8-12}
1422: 8 & 1+2+3 & 38.32 & 85.24 & 86.92 & 85.38 &
1423: 8 & 1+2+3 & 61.67 & 75.93 & 79.60 & 79.94 \\
1424: 9 & 4+5+6 & 68.04 & 88.83 & 87.44 & 84.98 &
1425: 9 & 4+5+6 & 61.44 & 77.30 & 79.15 & 79.38 \\
1426: 10 & 7+8+9 & 68.03 & 88.75 & 87.72 & 85.45 &
1427: 10 & 7+8+9 & 61.44 & 77.20 & 79.25 & 79.60 \\\cline{2-6}\cline{8-12}
1428: 11 & w- & 54.05 & 83.70 & 83.48 & 81.25 &
1429: 11 & w- & 61.24 & 76.01 & 78.69 & 79.25 \\
1430: 12 & c- & 14.26 & 77.70 & 79.30 & 78.50 &
1431: 12 & c- & 61.73 & 76.82 & 78.34 & 80.90 \\
1432: 13 & wc- & 58.47 & 86.53 & 85.74 & 82.77 &
1433: 13 & wc- & 61.43 & 76.77 & 80.15 & 81.61 \\\cline{2-6}\cline{8-12}
1434: \end{tabular}
1435: \end{center}
1436: \caption{
1437: F$_{\beta=1}$ rates obtained in 10-fold cross-validation experiments
1438: with the training data while predicting open clause brackets (left)
1439: and close clause brackets (right).
1440: We used different combinations of information (w: words, p: POS tags
1441: and c: chunk tags) and different context sizes (0-3).
1442: The best results for open brackets have been obtained with a majority
1443: vote of three information pairs while using context size 1 (row 9)
1444: For close clause brackets best results were obtained with words and
1445: POS tags after compressing the chunks and while using context size 3
1446: (row 13).
1447: }
1448: \label{tab-cl10}
1449: \end{table}
1450:
1451: We have evaluated the twelve groups of feature sets while predicting
1452: the clause open and clause close brackets.
1453: The results can be found in Table \ref{tab-cl10}.
1454: The learner performed best while predicting open clause brackets with
1455: information about the words immediately next to the current word
1456: (column 1).
1457: When more information was available, its performance dropped slightly.
1458: Of the different feature sets tested, the majority vote of sets that
1459: used pairs of information performed best (column 1, row 9).
1460: The classifiers that generated close brackets improved whenever extra
1461: context information became available.
1462: The best performance was reached while using a pair of words and chunk
1463: tags in the summarized format (column 3, row 13).
1464: We have performed an extra experiment to test if the system improved
1465: when using four context words rather than three.
1466: With words and chunk tags in the summarized format the system obtained
1467: F$_{\beta=1}$=81.72 for context size four compared with 81.61 for
1468: context size three.
1469: This increase is small so we have chosen context size three for our
1470: further experiments.
1471:
1472: With the streams of open and close brackets, we attempted to generate
1473: balanced clause structures by modifying the data streams with a set of
1474: heuristic rules.
1475: In these rules we gave more confidence to the open bracket predictions
1476: since, as can be seen in Table \ref{tab-cl10} the system performs
1477: better in predicting open brackets than close brackets.
1478: After testing different rule sets created by hand and evaluating these
1479: on the available training data, we decided on using the following rule
1480: set:
1481:
1482: \begin{enumerate}
1483: \itemsep -0.1cm
1484: \item Assume that exactly one clause starts at each clause start
1485: position.
1486: \item Assume that exactly one clause ends at each clause end
1487: position but
1488: \item ignore all clause end positions when currently no clause is
1489: open, and
1490: \item ignore all clause ends at non-sentence-final positions
1491: which attempt to close a clause started at the first word of the
1492: sentence.
1493: \item If clauses are opened but not closed at the end of the sentence
1494: then close them at the penultimate word of the sentence.
1495: \end{enumerate}
1496:
1497: \noindent
1498: These rules were able to generate complete and consistent embedded
1499: clause structures for the output that the system generated for the
1500: training data of the CoNLL-2001 shared task.
1501: The rules have one main defect: they are incapable of predicting that
1502: two or more clauses start at the same position.
1503: This will make it impossible for the system to detect such clause
1504: start but unfortunately, according to our rule set evaluation, adding
1505: recognition facilities for such multiple clause start would have a
1506: negative influence on overall performance levels.
1507: This set of rules obtained a clause F$_{\beta=1}$ of 71.34 on the
1508: training data of this task when applied to the best results for open
1509: and close brackets.
1510: The rules did not change the open bracket positions and on average the
1511: changes they made to the close bracket positions were an improvement
1512: (F$_{\beta=1}$ = 84.11 compared to 81.61).
1513:
1514: An argument which could be made is that since open bracket prediction
1515: is more accurate than close bracket prediction, one could use the
1516: information of the open bracket positions when predicting clause
1517: close brackets.
1518: We have attempted to do this by repeating the experiment with the best
1519: configuration for close brackets (wc-- with context size 3) while
1520: adding a feature which stated at which clause level the current word
1521: was, according to earlier open and close brackets.
1522: This approach improved the F$_{\beta=1}$ rate of the close bracket
1523: predictor from 81.61 to 83.50.
1524: However, after applying the balancing rules to the open brackets and
1525: the improved close brackets, we only got a clause F$_{\beta=1}$ of
1526: 71.39, a minimal improvement over the previous 71.34.
1527: It seems that the extra performance gain obtained in the close bracket
1528: predictor was obtained by solving problems which could already be
1529: solved by the balancing rules.
1530:
1531: We applied the balancing rules together with an open bracket predictor
1532: using a combination of pairs of feature types (context size 1) and a
1533: close bracket predictor using summarized pairs of words and chunk tags
1534: (context size 3) to the data files of the CoNLL-2001 shared task.
1535: Our clause identification method obtained an F$_{\beta=1}$ rate of
1536: 67.79 for identifying complete clauses (precision 76.91\% and recall
1537: 60.61\%).
1538: In the CoNLL-2001 shared task, the system finished third of six
1539: participants.
1540: One system outperformed the others by a large margin: the boosted
1541: decision tree method by \cite{carreras2001}.
1542: Their system obtained an F$_{\beta=1}$ rate of 78.63 on this task.
1543: The main difference between their approach and ours is that they use a
1544: larger number of features, methods for predicting multiple
1545: co-occurring clause starts and a more advanced statistical model for
1546: combining brackets to clauses.
1547:
1548: In a post-conference study, we have attempted to estimate more
1549: precisely the cause of the performance difference between our method
1550: an the boosted decision trees used by \cite{carreras2001}.
1551: Our hypothesis was that not only the choice of system made a
1552: difference, but also the choice of features.
1553: For this purpose, Carreras and M\`arques kindly repeated an experiment
1554: in predicting open brackets but this time while using our feature set:
1555: pairs of information using a window of one word left and one right,
1556: while results were combined with majority voting (Table
1557: \ref{tab-cl10}, left, row 9, column 1).
1558: The experiment was performed while testing on the CoNLL-2001
1559: development data set.
1560: Originally the memory-based learner obtained F$_{\beta=1}$ = 89.80 on
1561: this data set while their boosted decision tree approach reached
1562: 93.89.
1563: However, while using the memory-based feature set, the performance of
1564: the decision trees dropped to 91.32.
1565: When both systems use the same features, the boosted decision trees
1566: outperform the memory-based learner.
1567: But it is able to perform better with its own feature set.
1568: Our hypothesis was correct: the performance difference between the two
1569: approaches was both caused by choice of the learner and the choice of
1570: the feature set.
1571:
1572: The next obvious question is whether the memory-based system would
1573: perform better with the feature set of the boosted decision trees.
1574: Providing an answer to this question was nontrivial.
1575: The feature set consisted of thousands of binary features which were
1576: more than the memory-based learner could handle.
1577: After converting the features from binary-valued to multi-valued,
1578: there were about 70 features left.
1579: At best, the system obtained F$_{\beta=1}$ = 90.52 with this feature
1580: set.
1581: Since we feared that still the number of features was too large for
1582: the system to handle, we performed a forward sequential selection
1583: search process in the feature space starting with zero features.
1584: The memory-based learner reached an optimal performance with 13
1585: features at F$_{\beta=1}$ = 91.82.
1586: These results show that there is still room for improvement for the
1587: memory-based learner but that cooperation with a feature selection
1588: method will be helpful.
1589:
1590: % results collins ?
1591:
1592: \subsection{Noun Phrase Parsing}
1593: \label{sec-npp}
1594:
1595: Noun phrase parsing is similar to noun phrase chunking but this time
1596: the goal is to find noun phrases at all levels.
1597: This means that just like in the clause identification task we need to
1598: be able to recognize embedded phrases.
1599: The following example sentence will illustrate this:
1600:
1601: \begin{quote}
1602: In ( early trading ) in ( Hong Kong ) ( Monday ) , ( gold ) was quoted \\
1603: at ( ( \$ 366.50 ) ( an ounce ) ) .
1604: \end{quote}
1605:
1606: \noindent
1607: This sentence contains seven noun phrases of which the one containing
1608: the final four words of the sentence consists of two embedded noun
1609: phrases.
1610: If we use the same approach as for clause identification, retrieving
1611: brackets of all phrase levels in one step and balancing these, we will
1612: probably not detect this noun phrase because it starts and ends
1613: together with other noun phrases.
1614: Therefore we will use a different approach here.
1615:
1616: We will recover noun phrases at different levels by performing
1617: repeated chunking \citep{tks2000naacl}.
1618: We will start with data containing words and part-of-speech tags and
1619: identify the base noun phrases in this data with techniques used in
1620: our noun phrase chunking work.
1621: After this we will replace the phrases that were found by the head
1622: words and their tags.
1623: This will create a summary of the sentences with words and a mixed
1624: data stream of POS tags and chunk tags.
1625: We can apply our noun phrase chunking techniques to this data one more
1626: time and find noun phrases one level above the base level.
1627: The compressing and chunking steps will be repeated in order to
1628: retrieve phrases at higher levels.
1629: The process will stop when no new phrases are found.
1630:
1631: The approach described here seems a trivial expansion of our noun
1632: phrase chunking work.
1633: However, there are some details left to discuss.
1634: First, there is the selection of the head word duing the phrase
1635: summarization process.
1636: At the time we performed these experiments, we did not have access to
1637: the Magerman/Collins set of rules for determining head words, and
1638: therefore we used a rule created by ourselves: the head word of a noun
1639: phrase is the final word of the first noun cluster in the phrase or
1640: the final word of the phrase if it does not contain a noun cluster.
1641:
1642: The second fact we should mention, is that the data we used contains
1643: a different format of noun phrase chunks than the data we previously
1644: have worked with.
1645: In this task we use the data set which was developed for the noun
1646: phrase bracketing shared task of CoNLL-99 \citep{osborne99}.
1647: It was extracted from the Wall Street Journal part of the Penn
1648: Treebank \citep{marcus93} without extra modifications and this means,
1649: for example, that possessives between two noun phrases have been
1650: attached to the first one unlike in the noun phrase chunking data.
1651: This and other differences make that we cannot be sure that the
1652: techniques we developed for the other base noun phrase format will
1653: work very well here.
1654: Indeed, there is a performance drop in the chunking part of our shallow
1655: parser when compared with the chunking work (F$_{\beta=1}$ of 92.77
1656: compared with 93.34).
1657: However, we decided not to put extra work in searching for a better
1658: configuration for our noun phrase chunker and have trained an existing
1659: chunker with the data available for this task.
1660:
1661: An unforeseen problem occurred when we attempted to use the chunker for
1662: identifying noun phrases above the base level.
1663: Our chunker output is a majority vote of five systems using different
1664: data representations.
1665: In our evaluation work with tuning data (WSJ section 21), we
1666: observed that the overall output of the chunker at nonbase levels was
1667: worse than the performance of the best individual system
1668: \citep{tks2000naacl}.
1669: The reason for this is that the system that used the O+C data
1670: representation, outperformed the other four systems by a large margin.
1671: Because of this, and probably because the other four systems
1672: made similar errors, the errors of the four cancelled some of the
1673: correct analyses of the best system and caused the majority vote to be
1674: worse than the best individual system.
1675: For this reason we have decided to use only the bracket
1676: representations when processing noun phrases above base levels.
1677:
1678: The main open question in this study is what training data to use
1679: when processing the nonbase noun phrases.
1680: In order to find an answer to this question we have tested several
1681: configurations while processing tuning data, WSJ section 21, with
1682: the training data for the CoNLL-99 shared task.
1683: We have tested six training data configurations for predicting open
1684: and close bracket positions: using all bracket positions, those of
1685: base phrases only, those of all phrases except base phrases, those of
1686: phrases of the current level only, those of the current level and the
1687: previous, and those of the current level and the next.
1688: At all levels, using the brackets of the current level only proved to
1689: be working best or close to best.
1690: At the sixth level no new noun phrases were detected.
1691: Therefore we decided to use only brackets of one phrase level in the
1692: training data for nonbase phrases and stop phrase identification after
1693: six levels.
1694:
1695: We have applied a noun phrase chunker with fixed symmetrical context
1696: sizes to the noun phrase data of the CoNLL-99 shared task
1697: \citep{tks2000naacl}.
1698: The chunker generated a majority vote of open and close brackets put
1699: forward by five systems, each of which used a different representation
1700: of the base noun phrases (IOB1, IOB2, IOE1, IOE2 and O or C).
1701: All systems used a window of four left and four right for words and
1702: POS tags (18 features) and the four systems using IO representations
1703: additionally performed and extra pass with a window of three left and
1704: three right for words and POS tags, and a window of two left and two
1705: right without the focus tag for chunk tags (also 18 features).
1706: The output of the chunker was presented to a cascade of six chunkers,
1707: each of which consisted of a pair of open and close bracket predictors
1708: which were trained with brackets from one of the levels 1 to 6.
1709: After each chunk phase the phrases found were replaced by the head
1710: word of the phrase and a fixed chunk tag.
1711:
1712: The system obtained an overall F$_{\beta=1}$ rate of 83.79 (precision
1713: 90.00\% and recall 78.38\%) for identifying arbitrary noun
1714: phrases.\footnote{This performance was already reported by
1715: \cite{tks2000naacl}.}
1716: It is slightly better than our performance at CoNLL-99 (82.98,
1717: obtained without system combination) which was the best of two entries
1718: submitted for the shared task at that workshop.
1719: The performance of our noun phrase chunker can be regarded as a
1720: baseline score for this data set.
1721: This score is already quite high: F$_{\beta=1}$ = 79.70, and it seems
1722: that the nonbase level chunkers have not been contributing much to the
1723: performance of this shallow parser.
1724: Out of curiosity we have also examined how well a full parser does on
1725: the task of identifying arbitrary noun phrases.
1726: For this purpose we looked at output data of a parser described by
1727: \cite{collins99} which was provided with the parser code (WSJ section
1728: 23, model 2).
1729: The parser obtained F$_{\beta=1}$ = 89.8 (precision 89.3\% and recall
1730: 90.4\%) for this task.
1731: This is a lot better than our shallow parser but we should note that
1732: compared with our application, the Collins parser has access to better
1733: part-of-speech tags and more training data with more sophisticated
1734: annotation rather than only noun phrase boundaries.
1735:
1736: \subsection{Full Parsing}
1737: \label{sec-par}
1738:
1739: The approach for parsing noun phrases outlined in the previous
1740: section can be used for generating parse trees containing phrases of
1741: arbitrary phrases as well.
1742: In that case we would be using chunking techniques for performing full
1743: parsing.
1744: The is not a new idea.
1745: \cite{ejerhed83} present a Swedish grammar which includes noun phrase
1746: chunk rules.
1747: \cite{abney91} describes a chunk parser which consists of two parts:
1748: one that finds base chunks and another that attaches the chunks
1749: to each other in order to obtain parse trees.
1750: \cite{daelemans95} suggested to find long-distance dependencies with a
1751: cascade of lazy learners among which were constituent identifiers.
1752: \cite{ratnaparkhi98} built a parser based on a chunker with an
1753: additional bottom-up process which determines at what position to
1754: start new phrases or to join constituents with earlier ones.
1755: With this approach he obtained state-of-the-art parsing results.
1756: \cite{brants99} applied a cascade of Markov model chunkers to the task
1757: of parsing German sentences.
1758: We have extended our noun phrase parsing techniques to parsing
1759: arbitrary phrases \citep{tks2001clin}.
1760: We will present the main findings of this study here as well.
1761:
1762: The standard data sets for testing statistical parsers are different
1763: than the ones we used for our earlier work on chunking and shallow
1764: parsing.
1765: The data sets have been extracted from the Wall Street Journal (WSJ)
1766: part of the Penn Treebank \citep{marcus93} as well but they contain
1767: different segments.
1768: The training data consists of sections 02-21 (39,832 sentences) while
1769: section 23 is used as test data (2416 sentences).
1770: The data sets consists of words, and part-of-speech tags which have
1771: been generated by the part-of-speech tagger described by
1772: \cite{ratnaparkhi96}.
1773: In the data the phrase types ADVP and PRT have been collapsed into one
1774: category and during evaluation the positions of punctuation signs in
1775: the parse tree have been ignored.
1776: These adaptations have been done by different authors in order to make
1777: it possible to compare the results of their systems with the first
1778: study that used these data sets \citep{magerman95} and all follow-up
1779: work.
1780:
1781: In our work on arbitrary parsing, we were interested in finding an
1782: answer to four questions.
1783: In order to obtain these answers, we have performed tests with smaller
1784: data sets which were taken from the standard training data for this
1785: task: WSJ sections 15-18 as training data and section 20 as test data.
1786: The first topic we were interested in, was the influence of context
1787: size and size of the examined nearest neighborhood size (parameter k
1788: of the memory-based learner) on the performance of the parser.
1789: We took the noun phrase parser developed in the previous section,
1790: lifted its restriction of generating noun phrases only and applied it
1791: to this data set while using different context sizes and values for
1792: parameter k for the classifiers that identified phrases above the base
1793: levels.
1794: The different types of the chunks were derived by using the
1795: Double-Phase Approach for chunking (see Section \ref{sec-chuarb}).
1796: The best configuration we found was a context of two left and two
1797: right words and POS tags with k is 1.
1798: The nearest neighborhood size is smaller than used in our earlier work
1799: (3) and the best context size is smaller than in our noun phrase
1800: chunking work (4).
1801: However, the best context size we found for this task is exactly the
1802: same as reported by \cite{ratnaparkhi98}.
1803:
1804: The second topic we were interested in was the type of training data
1805: that should be used for finding phrases above the base level.
1806: In our noun phrase parsing work, we found that the best performance
1807: could be obtained by using only data of the current phrase level.
1808: This will cause a problems for our parser, since the tree depth may
1809: become as large as 31 in our corpus but there will be few training
1810: material available for these high level phrases if we use the same
1811: training configuration as in our noun phrase parsing work.
1812: We have tested two different training configurations to see if we
1813: could use more training data for this task without losing performance.
1814: With the first of these, using the current, previous and next phrase level,
1815: performance was as well (F$_{\beta=1}$=77.13) as while using only the
1816: current level (77.17).
1817: However, when we trained the cascade of chunkers while using brackets
1818: of all phrase levels, the performance dropped to 67.49.
1819: We have decided to keep on using the current phrase level only in the
1820: training data despite its problems with identifying higher level
1821: phrases.
1822:
1823: In the results that we have presented in this paper, the precision
1824: rates have always been higher than the recall rates.
1825: For a part, this is caused by the method we use for balancing open
1826: brackets and close brackets.
1827: It removes all brackets which cannot be matched with another one which
1828: is approximately the same as accepting clauses which are likely to be
1829: correct and throwing away all others.
1830: We wanted to test if we could obtain more balanced precision and
1831: recall rates because we hoped that these would lead to a better
1832: F$_{\beta=1}$ rate.
1833: Therefore we have tested two alternative methods for combining
1834: brackets.
1835: The first disregarded the type of the open brackets and allowed close
1836: brackets to be combined with open brackets of any type.
1837: The second method allowed open brackets to match with close brackets
1838: of any type.
1839: Unfortunately neither the first (F$_{\beta=1}$=72.33) nor the second
1840: method (76.06) managed to obtain the same F$_{\beta=1}$ rate as our
1841: standard method for combining brackets.
1842: Therefore we decided to stick with the latter.
1843:
1844: The final issue which we wanted to examine is the performance
1845: progression of the parser at the different levels of the process.
1846: The recall of the parser should increase for every extra step in the
1847: cascade of chunkers but we would also like to know how precision and
1848: F$_{\beta=1}$ progressed.
1849: We have measured this for our small parameter tuning data set and
1850: found that indeed recall increased until level 30 of a maximum of 32
1851: and remained constant after that.
1852: Precision dropped until the same level, remaining at the same value
1853: afterwards while F$_{\beta=1}$ reached a top value at level 19 and
1854: dropped afterwards.
1855: The reason for the later drop in F$_{\beta=1}$ value is that while
1856: the recall is still rising, it cannot make up for the loss of
1857: precision at later levels.
1858: Since we want to optimize the F$_{\beta=1}$ rate, we have decided to
1859: restrict the number of cascaded chunkers in our parser to 19 levels.
1860: We have added an extra post-processing step which after the 19 levels
1861: of processing adds clause brackets (S) around sentences which have not
1862: already been identified as a clauses.
1863:
1864: We have applied the best parser configuration found to the standard
1865: parsing data.
1866: Our parser used an arbitrary chunker with the configuration described
1867: in Section \ref{sec-chuarb} (a Majority Vote of five systems using
1868: different data representations) but trained with the relevant data for
1869: this task.
1870: Higher level phrases were identified by a cascade of 19 chunkers, each
1871: of which had a pair of independent open and close bracket classifiers
1872: which used a context of two left and two right of words and POS tags
1873: while being trained with brackets of the current level only.
1874: At each level, open and close brackets were combined to chunks by
1875: removing all brackets that could not be matched with a bracket of the
1876: same type.
1877: The parser contained a post-processing process which added clause
1878: brackets around sentences which were not identified as a clause after
1879: the 19 processing stages.
1880: This chunk parser obtained an F$_{\beta=1}$ rate of 80.49 on WSJ
1881: section 23 (precision 82.34\% and recall 78.72\%).
1882:
1883: The performance of our chunk parser is modest compared with
1884: state-of-the-art statistical parsers, which obtain around 90
1885: F$_{\beta=1}$ rate \citep{collins99,charniak2000}.
1886: However, we have a couple of suggestions for improving its
1887: performance.
1888: First, we could attempt giving the parser access to more information,
1889: for example about lower phrase levels.
1890: Currently, the parser only knows the head words and phrase types of
1891: daughters of phrases that are being built and this might not be
1892: enough.
1893: Second, we could try to find a better method for predicting bracket
1894: positions.
1895: For reasons explained in the previous section, we could not use a
1896: majority vote of systems using different representations.
1897: This might have helped to obtain a better performance.
1898: Finally, we would like to change the greedy approach of our parser.
1899: Currently it chooses the best segmentation of chunks at each level and
1900: builds on that but ideally it would be able to remember some
1901: next-to-best configurations as well and perform backtracking from the
1902: earlier choices whenever necessary.
1903: This approach would probably improve performance considerably
1904: (as shown by \cite{ratnaparkhi98}, Table 6.5).
1905: A practical problem which needs to be solved here, is that in nearest
1906: neighbor memory-based learning alternative classes do not receive
1907: confidence measures.
1908: Rather, sets of item-dependent distances are used to determine the
1909: usability of the classes.
1910: Comparing partial trees requires comparing sets of distances and
1911: it is not obvious how this should be done.
1912:
1913: These extra measures will probably improve the performance of the
1914: chunk parser.
1915: However, it is questionable whether it is worthwhile continuing with this
1916: approach.
1917: The present version of the parser already requires a lot of memory and
1918: processing time: more than a second per {\it word} for chunking only
1919: compared with a mere 0.14 seconds per {\it sentence} for a statistical
1920: parser which performed better \citep{ratnaparkhi98}.
1921: Extra extensions will probably slow down the parser even more so we
1922: are not sure if extending this approach is worth the trouble.
1923:
1924: \begin{table}[t]
1925: \begin{center}
1926: \begin{tabular}{|l|c|c|c|}\cline{2-4}
1927: \multicolumn{1}{l|}{section 20} &
1928: precision & recall & F$_{\beta=1}$ \\\hline
1929: \cite{kudoh2001} & 94.15\% & 94.29\% & 94.22 \\
1930: \cite{tks2000coling} & 94.18\% & 93.55\% & 93.86 \\
1931: MBL & 94.01\% & 92.67\% & 93.34 \\
1932: \cite{tks2000naacl} & 93.63\% & 92.89\% & 93.26 \\
1933: \cite{munoz99} & 92.4\% & 93.1\% & 92.8 \\
1934: \cite{ramshaw95} & 91.80\% & 92.27\% & 92.03 \\
1935: \cite{argamon99} & 91.6\% & 91.6\% & 91.6 \\\hline
1936: baseline & 78.20\% & 81.87\% & 79.99 \\\hline
1937: \end{tabular}
1938: \end{center}
1939: \caption{A selection of results that have been published for the
1940: Ramshaw and Marcus data sets for noun phrase chunking.
1941: Our chunker (MBL) is third-best.
1942: The baseline results have been produced by a system that selects the
1943: most frequent chunk tag (IOB1) for each part-of-speech tag.
1944: The best performance for this task has been obtained by a system using
1945: Support Vector Machines
1946: \citep{kudoh2001}.
1947: }
1948: \label{tab-resnp}
1949: \end{table}
1950:
1951: \section{Related Work}
1952:
1953: In this section we will compare our work with that of others that have
1954: applied machine learning techniques to the same data sets.
1955: First we will discuss the two chunking tasks and then the tasks that
1956: required output of embedded structures.
1957: Many systems have been applied to the five tasks.
1958: Rather that giving a detailed description of each of them, we will
1959: list the best performing systems for each task and mention some
1960: differences between these systems and ours.
1961: This comparison of our memory-based shallow parsers with other work
1962: shows that they produce state-of-the-art results for the chunking
1963: tasks but not for the tasks which require identification of embedded
1964: structures.
1965:
1966: \subsection{Chunking}
1967:
1968: Table \ref{tab-resnp} shows a selection of the best results published
1969: for the noun phrase chunking task.\footnote{An elaborate overview of
1970: most of the systems that have been applied to this task can be found
1971: on http://lcg-www.uia.ac.be/\~{ }erikt/research/np-chunking.html}
1972: As far as we know, the results presented in this paper (line MBL) are
1973: the third-best results.
1974: We have participated in producing the second-best result
1975: \citep{tks2000coling} which was produced by combining of the results
1976: of five different learning techniques.
1977: The best results for this data set have been generated with Support
1978: Vector Machines \citep{kudoh2001}.\footnote{Although we do not wish to
1979: underestimate the power of Support Vector Machines, we should note
1980: that it seems that the optimal results presented by \cite{kudoh2001}
1981: have been obtained by tuning the system to the test data.}
1982: A statistical analysis of our current result revealed that all
1983: performances outside of the region 92.93-93.73 are significantly
1984: different from ours.
1985: This means that all results in the table, except from the 93.26, are
1986: significantly different from ours.
1987:
1988: A topic to which we have paid little attention is the analysis of the
1989: errors that our approach makes.
1990: Such an analysis would provide insights into the weaknesses of the
1991: system and might provide clues to methods for improving the system.
1992: For noun phrase chunking we have performed a limited error analysis
1993: by manually evaluating the errors that were made in the first section
1994: of a 10-fold cross-validation experiment on the training data while
1995: using the chunker described by \cite{tks2000naacl}.
1996: This analysis revealed that the majority of the errors were caused by
1997: errors in the part-of-speech tags (28\% of the false positives/29\% of
1998: the false negatives).
1999: In order to acquire reasonable results, it is custom not to use the
2000: part-of-speech tags from the Treebank, but use tags that have been
2001: generated by a part-of-speech tagger.
2002: This prevents the system performance from reaching levels which would
2003: be unattainable for texts for which no perfect part-of-speech tags
2004: exist.
2005: Unfortunately the tagger makes errors and some of these errors cause
2006: the noun phrase segmentation to become incorrect.
2007:
2008: The second most frequently occurring error cause was related to
2009: conjunctions of noun phrases (16\%/18\%).
2010: Deciding whether a phrase like {\it red dwarfs and giants} consist of
2011: one or two noun phrases requires semantic knowledge and might be too
2012: ambitious for present-day systems to solve.
2013: The other major causes of errors all relate to similar hard cases:
2014: attachment of punctuation signs (15\%/12\%; inconsistent in the
2015: Treebank), deciding whether ambiguous phrases without conjunctions
2016: should be one or two noun phrases (11\%/12\%), adverb attachment
2017: (5\%/4\%), noun phrases containing the word {\it to} (3\%/3\%),
2018: Treebank noun phrase segmentation errors (3\%/1\%) and noun phrases
2019: consisting of the word {\it that} (0\%/2\%).
2020: Apart from these hard cases there also were quite a few errors for
2021: which we could not determine an obvious cause (19\%/19\%).
2022:
2023: The most obvious suggestion for improvement that came out of the error
2024: analysis was to use a better part-of-speech tagger.
2025: We are currently using the Brill tagger \citep{brill94}.
2026: Better taggers are available nowadays but using the Brill tags here
2027: was necessary in order to be able to compare our approach with earlier
2028: studies, which have used the Brill tags as well.
2029: The error analysis did not produce other immediate suggestions for
2030: improving our noun phrase chunking approach.
2031: We are relieved about this because it would have been an
2032: embarrassment if our chunker had produced systematic errors.
2033: However, there is a trivial way to improve the results of the noun
2034: phrase chunker: by using more training data.
2035: Different studies have shown that by increasing the training data size
2036: by 300\%, the F$_{\beta=1}$ error might drop with as much as 25\%
2037: \citep{ramshaw95,tks2000naacl,kudoh2001}.
2038: Another study for a different problem, confusion set disambiguation,
2039: has shown that a further cut in the error rate is possible with even
2040: larger training data sets \citep{banko2001}.
2041: In order to test this for noun phrase chunking we need a hand-parsed
2042: corpus which is larger than anything that is presently available.
2043:
2044: \begin{table}[t]
2045: \begin{center}
2046: \begin{tabular}{|l|c|c|c|}\cline{2-4}
2047: \multicolumn{1}{l|}{section 20} &
2048: precision & recall & F$_{\beta=1}$ \\\hline
2049: \cite{zhang2001} & 94.29\% & 94.01\% & 94.13 \\
2050: \cite{kudoh2001} & 93.89\% & 93.92\% & 93.91 \\
2051: \cite{kudoh2000} & 93.45\% & 93.51\% & 93.48 \\
2052: \cite{hvh2000} & 93.13\% & 93.51\% & 93.32 \\
2053: \cite{tks2000conll} & 94.04\% & 91.00\% & 92.50 \\
2054: \cite{zhou2000} & 91.99\% & 92.25\% & 92.12 \\
2055: \cite{dejean2000} & 91.87\% & 92.31\% & 92.09 \\\hline
2056: baseline & 72.58\% & 82.14\% & 77.07 \\\hline
2057: \end{tabular}
2058: \end{center}
2059: \caption{A selection of results that have been published for the
2060: arbitrary chunking data set of the CoNLL-2000 shared task.
2061: Our chunker
2062: \citep{tks2000conll}
2063: is fifth-best.
2064: The baseline results have been produced by a system that selects the
2065: most frequent chunk tag (IOB1) for each part-of-speech tag.
2066: The best performance for this task has been obtained by a system using
2067: regularized Winnow
2068: \citep{zhang2001}.
2069: Systems that have been applied both to the arbitrary chunking task and
2070: the noun phrase chunking task performed approximately equally well for
2071: NP chunks in both tasks.
2072: }
2073: \label{tab-resxp}
2074: \end{table}
2075:
2076: Table \ref{tab-resxp} contains a selection of the best results published
2077: for the arbitrary chunking data used in the CoNLL-2000 shared
2078: task.\footnote{More results for the chunking task can be found on
2079: http://lcg-www.uia.ac.be/conll2000/chunking/}
2080: Our chunker \citep{tks2000conll} is the fifth-best on this list.
2081: Immediately obvious is the imbalance between precision and recall:
2082: the system identifies a small number of phrases with a high precision
2083: rate.
2084: We assume that this is primarily caused by our method for generating
2085: balanced structures from streams of open and close brackets.
2086: We have performed a bootstrap resampling test on the chunk tag
2087: sequence associated with this result.
2088: An evaluation of 10,000 pairs indicated that the significance interval
2089: for our system (F$_{\beta=1}$ = 92.50) is 92.18-92.81 which means that
2090: all systems that are ahead of ours perform significantly better and all
2091: systems that are behind perform significantly worse.
2092: We are not sure what is causing these large performance differences.
2093: At this moment we assume that our approach has difficulty with
2094: classification tasks when the number of different output classes
2095: increases.
2096:
2097: \begin{table}[t]
2098: \begin{center}
2099: \begin{tabular}{|l|c|c|c|}\cline{2-4}
2100: \multicolumn{1}{l|}{section 21}
2101: & precision & recall & F$_{\beta=1}$ \\\hline
2102: \cite{carreras2001} & 84.82\% & 73.28\% & 78.63 \\
2103: \cite{molina2001} & 70.89\% & 65.57\% & 68.12 \\
2104: \cite{tks2001conll} & 76.91\% & 60.61\% & 67.79 \\
2105: \cite{patrick2001} & 73.75\% & 60.00\% & 66.17 \\
2106: \cite{dejean2001} & 72.56\% & 54.55\% & 62.77 \\
2107: \cite{hammerton2001b} & 55.81\% & 45.99\% & 50.42 \\\hline
2108: baseline & 98.44\% & 31.48\% & 47.71 \\\hline
2109: \end{tabular}
2110: \end{center}
2111: \caption{Results of the clause identification part of the CoNLL-2001
2112: shared task.
2113: Our clause identifier \citep{tks2001conll} is third-best.
2114: The baseline results have been produced by a system that only puts
2115: clause brackets around complete sentences.
2116: The best performance for this task has been obtained by a system using
2117: boosted decision trees \citep{carreras2001}.
2118: }
2119: \label{tab-rescl}
2120: \end{table}
2121:
2122: \subsection{Parsing}
2123:
2124: A complete overview of the clause identification results of the
2125: CoNLL-2001 shared task can be found in Table \ref{tab-rescl}
2126: \citep{tksdejean2001conll}.
2127: Our approach was third-best.
2128: A bootstrap resampling test with a population of 10,000 random samples
2129: generated from our results produced the 90\% significance interval
2130: 66.66-68.95 for our system which means that our result is not
2131: significantly different from the second result.
2132: The boosted decision trees used by \cite{carreras2001} did a lot
2133: better than the other systems.
2134: In Section \ref{sec-clauses}, we have made a comparison between the
2135: performance of this system and ours and concluded that the performance
2136: differences were both caused by the choice of learning system and
2137: by a difference in the features chosen for representing the task.
2138:
2139: The noun phrase parsing task has not received much attention in the
2140: research community and there are only few results to compare with.
2141: \cite{osborne99} used a grammar-extension method based on Minimal
2142: Description Length and applied it to a Definite Clause Grammar.
2143: His system used different training and test segments of the Penn
2144: Treebank than we did.
2145: At best, it obtained an F$_{\beta=1}$ rate of 60.0 on the test data
2146: (precision 53.2\% and recall 68.7\%).
2147: \cite{krymolowski2000} applied a memory-based learning technique
2148: specialized for learning sequences to a noun phrase parsing task.
2149: Their system obtained F$_{\beta=1}$=83.7 (precision 88.5\% and recall
2150: 79.3\%) on yet another segment of the Treebank.
2151: This performance is very close to that of our approach
2152: (F$_{\beta=1}$=83.79).
2153: The memory-based sequence learner used much more training data than
2154: ours (about four times as much) but unlike our method, it generated
2155: its output without using lexical information, which is impressive.
2156: The performance of the Collins parser on the subtask of noun phrase
2157: parsing which we mentioned in Section \ref{sec-npp}
2158: (F$_{\beta=1}$=89.8) shows that there is room for improvement left for
2159: all systems that were discussed here.\footnote{Our full parser, which
2160: was trained and tested on the same data as the Collins parser,
2161: obtained F$_{\beta=1}$=86.96 for recognizing NP phrases only.}
2162:
2163: \begin{table}[t]
2164: \begin{center}
2165: \begin{tabular}{|l|c|c|c|}\cline{2-4}
2166: \multicolumn{1}{l|}{section 23} &
2167: precision & recall & F$_{\beta=1}$ \\\hline
2168: \cite{collins2000} & 89.9\% & 89.6\% & 89.7 \\
2169: \cite{bod2001} & 89.7\% & 89.7\% & 89.7 \\
2170: \cite{charniak2000} & 89.5\% & 89.6\% & 89.5 \\
2171: \cite{collins99} & 88.3\% & 88.1\% & 88.2 \\
2172: \cite{ratnaparkhi98} & 87.5\% & 86.3\% & 86.9 \\
2173: \cite{charniak97} & 86.6\% & 86.7\% & 86.6 \\
2174: \cite{magerman95} & 84.3\% & 84.0\% & 84.1 \\
2175: \cite{tks2001clin} & 82.3\% & 78.7\% & 80.5 \\\hline
2176: \end{tabular}
2177: \end{center}
2178: \caption{A selection of results that have been published for
2179: parsing sentences shorter than 100 words of the Penn Treebank.
2180: The performance of our parser \citep{tks2001clin} is not quite
2181: state-of-the-art.
2182: The best performance for this task has been obtained by statistical
2183: parsers and data-oriented parsers
2184: \citep{collins2000,charniak2000,bod2000}.
2185: }
2186: \label{tab-respa}
2187: \end{table}
2188:
2189: A selection of results for parsing the Penn Treebank can be found in
2190: Table \ref{tab-respa}.
2191: The F$_{\beta=1}$ error rate of the best systems is about half of that
2192: of ours.
2193: A more detailed comparison of the output data of our memory-based
2194: parser and one of the versions of the Collins parser
2195: \citep[][model 2]{collins99} has shown the large performance difference
2196: is caused by the way nonbase phrases are processed
2197: \citep{tks2001clin}.
2198: Our chunker performs reasonably well compared with the first stage of
2199: the Collins parser (F$_{\beta=1}$ = 49.30 compared with 49.85).
2200: Especially at the first few levels after the base levels, our parser
2201: looses F$_{\beta=1}$ points compared with the Collins parser.
2202: The initial difference of 0.65 at the base level grows to 2.92 after
2203: three more levels, 5.16 after six and 6.13 after nine levels with a
2204: final difference of 6.59 after 20 levels \citep{tks2001clin}.
2205: At the end of Section \ref{sec-par}, we have put forward some
2206: suggestions for improving our parser.
2207: However, we have also noted that further improvement might not
2208: be worthwhile because it will make our parser even slower than it
2209: already is.
2210:
2211: \section{Concluding Remarks}
2212:
2213: We have presented memory-based approaches to shallow parsing and we
2214: have applied these to five tasks: noun phrase chunking, arbitrary
2215: chunking, clause identification, noun phrase parsing and full
2216: parsing.
2217: We have used two additional techniques for improving the performance
2218: of our shallow parsers: feature selection and system combination.
2219: The first was used to compensate for a problem of the memory-based
2220: learner: it has difficulty with ignoring features that are not
2221: immediately relevant.
2222: While feature selection worked well in one study (clause
2223: identification with large feature sets), it did not make much
2224: difference to the overall performance of our noun phrase chunker.
2225: We believe that other techniques that were incorporated in the chunker
2226: (cascading and system combination) have already stretched the
2227: performance of the system to its limits.
2228: Therefore there might not have been much left to gain by using feature
2229: selection.
2230: System combination has proved to be quite useful for generating base
2231: phrases.
2232: Unfortunately, we could not apply it for higher level chunks because
2233: our method for producing different system results, using different
2234: data representations, failed to produce results for higher level phrases
2235: that could be improved with the Majority Voting technique we used for
2236: chunking.
2237:
2238: A comparison of our work with other studies revealed that our
2239: approach works well for base phrase identification, but not for
2240: finding embedded structures.
2241: We have made a couple of suggestions for improving the performance on
2242: tasks that require generating embedded structures: provide different
2243: features to the learners, try to find a method which allows
2244: combination of different systems when working on higher level phrases
2245: and replace the greedy phrase selection approach currently used by one
2246: that allows backtracking from earlier choices.
2247: However, while further improvement is interesting from a scientific
2248: point of view, it might not be useful from a practical point of view.
2249: Our present method is already slower than state-of-the-art full
2250: parsers and it requires more memory.
2251: Extra improvements to this approach will probably slow it down even
2252: more without guaranteeing state-of-the-art performance.
2253:
2254: \acks{
2255: \hspace*{-0.3cm}
2256: We would like to thank
2257: our colleagues of CNTS - Language Technology Group, University of
2258: Antwerp, Belgium and
2259: ILK, University of Tilburg, The Netherlands,
2260: the members of the TMR-LCG network, in particular James Hammerton, and
2261: two anonymous reviewers for
2262: valuable discussions and comments.
2263: We are grateful to Xavier Carreras for his cooperation in the
2264: comparison study of his clause identification system with ours.
2265: This study was funded by the European Training and Mobility of
2266: Researchers (TMR) network Learning Computational
2267: Grammars.\footnote{http://lcg-www.uia.ac.be/}
2268: }
2269:
2270: \vskip 0.2in
2271: \bibliography{ref}
2272:
2273: \end{document}
2274: