0204:cs0204049/jmlr.tex

1: \documentclass[twoside,11pt]{article}

2:

3: % Any additional packages needed should be included after jmlr2e.

4: % Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,

5: % and defines many common macros, such as 'proof' and 'example'.

6: %

7: % It also sets the bibliographystyle to plainnat; for more information on

8: % natbib citation styles, see the natbib documentation, a copy of which

9: % is archived at http://www.jmlr.org/format/natbib.pdf

10:

11: \usepackage{jmlr2e}

12:

13: % Definitions of handy macros can go here

14:

15: % Heading arguments are {volume}{year}{pages}{submitted}{published}{authors}

16:

17: \jmlrheading{2}{2002}{559-594}{9/01}{3/02}{Erik F. Tjong Kim Sang}

18: \ShortHeadings{Memory-Based Shallow Parsing}{Tjong Kim Sang}

19: \firstpageno{559}

20:

21: \begin{document}

22:

23: \title{Memory-Based Shallow Parsing}

24:

25: \author{\name Erik F. Tjong Kim Sang \email erikt@uia.ua.ac.be \\

26:        \addr CNTS - Language Technology Group \\

27:              University of Antwerp\\

28:              Universiteitsplein 1\\

29:              B-2610 Wilrijk, Belgium}

30:

31: \editor{James Hammerton, Miles Osborne, Susan Armstrong and

32:         Walter Daelemans}

33:

34: \maketitle

35:

36: \begin{abstract}%   <- trailing '%' for backward compatibility of .sty file

37: We present memory-based learning approaches to shallow parsing and

38: apply these to five tasks: base noun phrase identification, arbitrary

39: base phrase recognition, clause detection, noun phrase parsing and

40: full parsing.

41: We use feature selection techniques and system combination methods

42: for improving the performance of the memory-based learner.

43: Our approach is evaluated on standard data sets and the results are

44: compared with that of other systems.

45: This reveals that our approach works well for base phrase

46: identification while its application towards recognizing embedded

47: structures leaves some room for improvement.

48: \end{abstract}

49:

50: \begin{keywords}

51:   shallow parsing,

52:   memory-based learning,

53:   feature selection,

54:   system combination

55: \end{keywords}

56:

57: \section{Introduction}

58:

59: Memory-based learners classify data based on their similarity to

60: data that they have seen earlier.

61: They have been used for a variety of natural language

62: processing tasks with good results, for example for

63: grapheme-to-phoneme conversion \citep{hoste1999clin},

64: stress assignment \citep{daelemans1994cl} and

65: word class tagging \citep{hvh2001}.

66: These natural language processing tasks are classification tasks: they

67: require an assignment of a class to each character or to each word.

68: Shallow parsing is more complicated than that: it requires sequences

69: of words to be grouped together and be classified.

70:

71: We believe that all natural language tasks can be performed successfully

72: by memory-based learners.

73: Identifying and classifying sequences of words can be converted to a

74: classification task by using special tag sets, for example the IOB

75: tag set proposed by \cite{ramshaw95}.

76: Parsing requires different processing levels and these can be

77: simulated by cascading several memory-based learners which have

78: been trained on different subtasks \citep{daelemans95}.

79: The idea of using memory-based methods for processing natural language

80: has recently led to the emergence of a new paradigm: Memory-Based

81: Language Processing (MBLP) to which a special issue of the Journal of

82: Experimental \& Theoretical Artificial Intelligence was devoted

83: \citep{daelemans99b}.

84:

85: The goal of this paper is to test the theoretic ideas about

86: memory-based learning applied to natural language tasks, in

87: particular its application to shallow parsing.

88: We will implement the ideas of \cite{daelemans95}, show what problems

89: need to be solved, test memory-based shallow parsers and compare their

90: performance with those of other systems.

91: The tasks which we will examine are identification of base noun

92: phrases, recognition of phrases of arbitrary types, finding clauses,

93: discovering embedded noun phrases and full parsing.

94: Memory-based learning performs well on natural language tasks that

95: require output that has relatively little structure.

96: In this paper we will investigate whether we can obtain equally good

97: results when it is applied to tasks requiring more complex outputs.

98:

99: \section{Approach}

100:

101: In our approach we will use three techniques.

102: We will use memory-based learning as base classification method for

103: assigning linguistic classes to data.

104: We will attempt to solve a weakness of this approach, disregarding

105: irrelevant features, by using an additional feature selection method.

106: Finally, we will examine the combination of several learners in order

107: to obtain an extra performance boost.

108: This section also contains information about evaluation and system

109: configuration for performing parameter tuning.

110:

111: \subsection{Memory-Based Learning}

112:

113: The basic idea behind memory-based learning is that concepts can be

114: classified by their similarity with previously seen concepts.

115: In a memory-based system, learning amounts to storing the training

116: data items.

117: The strength of such a system lies in its capability to compute the

118: similarity between a new data item and the training data items.

119: The most simple similarity metric is the overlap metric

120: \citep{timbl2000}.

121: It compares corresponding features of the data items and adds 1 to a

122: similarity rate when they are different.

123: The similarity between two data items is represented by a number

124: between zero and the number of features, $n$, in which value zero

125: corresponds with an exact match and $n$ corresponds with two items

126: which share no feature value.

127: Here is an example:

128:

129: \begin{center}

130: \begin{tabular}{ccccc}

131: TRAIN1 & man & saw & the & V\\

132: TRAIN2 & the & saw & .   & N\\

133:  TEST1 & boy & saw & the & ?\\

134: \end{tabular}

135: \end{center}

136:

137: \noindent

138: It contains two training items of a part-of-speech (POS) tagger and

139: one test item for which we want to obtain a POS tag.

140: Each item contains three features: the word that needs to be tagged

141: ({\it saw}) and the preceding and the next word.

142: In order to find the best POS tag for the test item, we compare its

143: features with the features of the training data items.

144: The test item shares two features with the first training data item

145: and one with the second.

146: The similarity value for the first training data item (1) is smaller

147: than that of the second (2) and therefore the overlap metric will

148: prefer the first.

149:

150: A weakness of the overlap metric is that it regards all features as

151: equally valuable for computing similarity values.

152: Generally some features are more important than others.

153: For example, when we add a line ``TRAIN3 boy and the C'' to our

154: training data, the overlap metric will regard this new item as equally

155: important as the first training item.

156: Both the first and the third training item share two feature values

157: with the test item but we would like the third to receive a lower

158: similarity value because it does not contain the word for which we

159: want find a POS tag ({\it saw}).

160: In order to accomplish this, we assign weights to the features

161: in such a way that the second feature receives a higher weight than

162: the other two.

163:

164: The method which we use to assign weights to the features is called

165: Gain Ratio, a normalized variant of information gain

166: \citep{timbl2000}.

167: It estimates feature weights by examining the training data and

168: determines for each feature how much information it contributes

169: to the knowledge of the classes of the training data items.

170: The weights are normalized in order to account for features with

171: many different values.

172: The Gain Ratio computation of the weights is summarized in the

173: following formulas:

174:

175: \begin{equation}

176: w_i = \frac{H(C)-\sum_{v\in V_i}P(v)\times H(C\mid v)}{H(V_i)}

177: \end{equation}

178:

179: \begin{equation}

180: H(X) = - \sum_{x\in X} P(x){\rm log}_2P(x)

181: \end{equation}

182:

183: \noindent

184: Here $w_i$ is the weight of feature $i$, $C$ the set of class

185: values and $V_i$ the set of values that feature $i$ can take.

186: $H(C)$ and $H(V_i)$ are the entropy of the sets $C$ and $V_i$

187: respectively and $H(C\mid v)$ is the entropy of the subset of elements

188: of $C$ that are associated with value $v$ of feature $i$.

189: $P(v)$ is the probability that feature $i$ has value $v$.

190: The normalization factor $H(V_i)$ was introduced to prevent that

191: features with low generalization capacities, like identification

192: codes, would obtain large weights.

193:

194: The memory-based learning software which we have used in our

195: experiments, TiMBL \citep{timbl2000}, contains several algorithms

196: with different parameters.

197: In this paper we have restricted ourselves to using a single algorithm

198: (k nearest neighbor classification) with a constant parameter setting.

199: It would be interesting to evaluate every algorithm with all of its

200: parameters but this would require a lot of extra work.

201: We have changed only one parameter of the nearest neighbor algorithm

202: from its default value: the size of the nearest neighborhood region.

203: The learning algorithm computes the distance between the test item and

204: the training items.

205: The test item will receive the most frequent classification of the

206: nearest training items (nearest neighborhood size is 1).

207: \cite{daelemans99} show that using a larger neighborhood is harmful

208: for classification accuracy for three language tasks but not for

209: noun phrase chunking, a task which is central to this paper.

210: In our experiments we have found that using the three nearest

211: sets of data items leads to a better performance than using only

212: the nearest data items.

213: This increase of the neighborhood size used leads to a form of

214: smoothing which can get rid of the influence of some data

215: inconsistencies and exceptions.

216:

217: % general

218: % example

219: % overlap

220: % ig+gr

221: % optimizations

222: % k

223:

224: \subsection{Feature Selection}

225: \label{sec-feat}

226:

227: A disadvantage of the Gain Ratio metric used in memory-based learning

228: is that it computes a weight for a feature without examining other

229: available features.

230: If features are dependent, this will generally not be reflected in

231: their weights.

232: A feature that contains some information about the classification

233: class on its own, but none when another more informative feature is

234: present will receive a non-zero weight.

235: Features which contain little information about the classification

236: class will receive a small weight but a large number of them might

237: still overrule more important features.

238: These two problems will have a negative influence on the

239: classification accuracy, in particular when there are many features

240: available.

241:

242: We have tested the capacity of Gain Ratio to deal with irrelevant

243: features by using it for a simple binary classification problem with

244: extra random features.

245: The problem which we chose is the XOR problem.

246: It contains two binary (0/1) features and a pair of these feature

247: values should be classified as 0 when the values are equal and as

248: 1 when the features are different.

249: We have created training and test data which contained 100

250: examples of the four possible patterns (0/0/0, 0/1/1, 1/0/1 and

251: (1/1/0).

252: A memory-based learner which uses Gain Ratio was able to correctly

253: classify all 400 patterns in the training data.

254: After this we added ten random binary features to both the training

255: data and the test data and observed the performance.

256: The average results of 1000 runs can be found in Figure

257: \ref{fig-feats}.

258:

259: %  0 400.00

260: %  1 400.00

261: %  2 393.72

262: %  3 383.31

263: %  4 372.18

264: %  5 352.42

265: %  6 317.03

266: %  7 275.23

267: %  8 242.30

268: %  9 221.69

269: % 10 210.75

270:

271: \begin{figure}[t]

272: \begin{center}

273: \epsffile{jmlr.feats.eps}

274: \end{center}

275: \caption{Average number of correct patterns over 1000 runs of a

276: memory-based learner using the Gain Ratio metric for test data

277: containing 400 XOR patterns after adding 0 to 10 random binary

278: features.

279: The system performs perfectly with one random feature but when two or

280: more random features are added, the performance drops to about half

281: for 10 extra features.

282: }

283: \label{fig-feats}

284: \end{figure}

285:

286: Without extra features the memory-based learner performs perfectly.

287: Adding a random feature does not harm its performance but after adding

288: two the system only gets 394 of the 400 patterns correct on average.

289: The performance drops for every extra added feature to about 211

290: for 10 extra features which is not much better than randomly guessing

291: the classes.

292: This small experiment shows that Gain Ratio has difficulty with

293: feature sets that contain many irrelevant features.

294: We need an extra method for determining which features are not

295: necessary for obtaining a good performance.

296:

297: \cite{aha94} give a good introduction to methods for selecting

298: relevant features for machine learning tasks.

299: The methods can be divided in two groups: filters and wrappers.

300: A filter uses an evaluation function for determining which features

301: could be more relevant for a classifier than others.

302: A wrapper finds out if one feature is more important than another by

303: applying the classifier to data with either one of the features and

304: comparing the results.

305: This is requires more time than the filter approach but it generates

306: better feature sets because it cannot suffer from a bias difference

307: which may exist between the evaluation function and the classifier

308: \citep{john94}.

309:

310: Both the filter and the wrapper method start with a set of features

311: and attempt to find a better set by adding or removing features and

312: evaluating the resulting sets.

313: There are two basic methods for moving through the feature space.

314: Forward sequential selection starts with an empty feature set and

315: evaluates all sets containing one feature.

316: After this it selects the one with the best performance and evaluates

317: all sets with two features of which one is the best single feature.

318: Backward sequential selection starts with all features and evaluates all

319: sets with one feature less.

320: It will selects the one with the best performance and then examines

321: all feature sets which can be derived from this one by removing one

322: feature.

323: Both methods continue adding or removing a feature until they

324: cannot improve the performance.

325:

326: Forward and backward sequential selection are a variant of

327: hill-climbing, a well-known search technique in artificial

328: intelligence.

329: As with hill-climbing, a disadvantage of these methods is that they

330: can get stuck in local optima, in this case a non-optimal feature set

331: which cannot be improved with the method used.

332: In order to minimize the influence of local optima, we use a combination

333: of the two methods when examining feature sets: bidirectional

334: hill-climbing \citep{caruana94}.

335: The idea here is to apply both adding a feature and removing a feature

336: at each point in the feature space.

337: This enables the feature selection method to backtrack from nonoptimal

338: choices.

339: In order to keep processing times down we will start with an empty

340: feature list just like in forward sequential selection.

341:

342: % devijver and kittler

343: % john, kohavi and pfleger

344:

345: \subsection{System Combination}

346: \label{sec-combi}

347:

348: When different machine learning systems are applied to the same task,

349: they will make different errors.

350: The combined results of these systems can be used for generating an

351: analysis for the task that is usually better than that of any of the

352: participating systems, for example by choosing pattern analyses

353: selected by the majority of the systems.

354: This approach will eliminate errors that made by a minority of the

355: systems.

356: Here is a made-up example:

357: suppose we have five systems, c$_1$ - c$_5$, which assign binary

358: classes to patterns.

359: Their output for eight patterns, p$_1$ - p$_8$, is as follows:

360:

361: \begin{center}

362: \begin{tabular}{r|ccccc|l}

363:       & c$_1$ & c$_2$ & c$_3$ & c$_4$ & c$_5$ & correct \\\hline

364: p$_1$ & 0     & 0     & 0     & 0     & 0     & 0 \\

365: p$_2$ & 1     & 1     & 1     & 1     & 1     & 1 \\

366: p$_3$ & 0     & 0     & 0     & 0     & 0     & 0 \\

367: p$_4$ & 1     & 0     & 1     & 1     & 1     & 1 \\

368: p$_5$ & 0     & 0     & 1     & 0     & 0     & 0 \\

369: p$_6$ & 1     & 1     & 1     & 1     & 0     & 1 \\

370: p$_7$ & 1     & 0     & 0     & 0     & 0     & 0 \\

371: p$_8$ & 1     & 1     & 1     & 0     & 1     & 1 \\

372: \end{tabular}

373: \end{center}

374:

375: \noindent

376: Each of the five systems makes an error.

377: We can use a combination of the five by choosing the class

378: that has been predicted most frequently for each pattern.

379: For the first three patterns this will not make a difference because

380: all systems predict the same class.

381: For pattern 4 we will choose class 1, thereby eliminating an error of

382: classifier 2.

383: Pattern 5 will be associated with class 0, thus eliminating classifier

384: 3's only error.

385: Patterns 6, 7 and 8 will receive classes 1, 0 and 1 respectively,

386: thereby eliminating errors of classifiers 5, 1 and 4.

387: Thus the majority choice will generate a perfect analysis of the data.

388:

389: In this paper we will evaluate different techniques for combining

390: system output, most of which have been put forward by

391: \cite{hvh2001}.

392: We use four voting methods and three stacked classifiers.

393: Voting methods assign weights to the output of the individual systems

394: and for each pattern choose the class with the largest accumulated

395: score.

396: The most simple voting method is the one we have used in the preceding

397: example: Majority Voting.

398: It gives all systems the same weight.

399: A more elaborate method is accuracy voting (TotPrecision).

400: It assigns a weight to each system which is equal to the accuracy of

401: the system on some evaluation data.

402:

403: Some classes might be easier to predict than other classes and for

404: this reason we have also tested two voting methods which use weights

405: based on accuracies for particular class tags.

406: The first is TagPrecision.

407: For each output value $v$ of system $s$, it uses a weight which is

408: equal to the precision of that system $s$ obtained for this value $v$.

409: The second method is Precision-Recall.

410: It starts from the same weights as TagPrecision but adds to these the

411: probability that systems producing different output values would have

412: missed $v$.

413: For example, suppose that there are two systems $s_1$ and $s_2$, and

414: that for some data item $s_1$ predicts value $v_1$ while $s_2$

415: predicts something else.

416: In that case, the probability that $s_1$ is right is

417: $precision(s_1,v_1)$ while the probability that $s_2$ would have

418: missed $v_1$ is $1-recall(s_2,v_1)$.

419: Precision-Recall will assign the weight

420: $precision(s_1,v_1)+(1-recall(s_2,v_1))$ to the event of $s_1$

421: predicting $v_1$.

422:

423: A stacked classifier is a classifier which processes the results of

424: other classifiers.

425: We have used three variants of stacked classifiers.

426: The first is called TagPair.

427: It examines pairs of values produced by two systems and estimates the

428: probability that a certain output value is associated with the pair.

429: In the case of the two systems $s_1$ and $s_2$ producing two distinct

430: values $v_1$ and $v_2$, TagPair will examine evaluation data and find

431: that the value pair is associated with, for example, $v_1$ in 20\% of

432: the cases, $v_2$ in 70\% and $v_3$ in 10\%.

433: These numbers will be used as weights for the three output values and

434: the one that has accumulated the largest value after examining all

435: value pairs in the pattern, will be selected.

436: Unlike the voting methods, TagPair has the opportunity to choose the

437: correct output tag even if all systems have made an incorrect

438: prediction (for example, $v_3$ in this example).

439:

440: The other stacked classifier which we have evaluated is the

441: memory-based learner itself.

442: We have tested it in two modes: one in which only the output of the

443: systems was included and one in which we included information about

444: the test item.

445: This extra information was the word that needed to be classified, its

446: part-of-speech (POS) tag and the context (words/POS tags) in which it

447: appeared.

448: The memory-based learner used the same settings as described earlier

449: in this section: it used the Gain Ratio metric and examined a nearest

450: neighborhood of size three.

451:

452: The weight assignment methods used by the voting methods and the

453: stacked classifiers suffer from the same problem as Gain Ratio:

454: they might fail to disregard irrelevant features.

455: For this reason we have often tested the combination methods both with

456: all available system results as well as with a subset of these, thus

457: mimicking the feature selection method described earlier.

458: Apart from Majority Voting, all voting methods and stacked classifiers

459: require training data.

460: This means that we need both training data for the individual systems

461: and training data for the combinators.

462: We will describe how we have selected the training data in the next

463: section.

464:

465: \subsection{Parameter Tuning}

466: \label{sec-partun}

467:

468: In this paper, we will compare different learner set-ups and

469: apply the best one to standard data sets.

470: For example, we will examine different data representations and

471: test different system combination techniques.

472: We should be careful not to tune the system to the test data and

473: therefore we will only use the available training data for finding the

474: best configuration for the learner.

475: This can be done by using 10-fold cross-validation \citep{weiss91}.

476: The training data will be divided in ten sections of similar size and

477: each section will be processed by a system which has been trained on

478: the other nine.

479: The overall performance on all ten sections will be regarded as the

480: performance of the system.

481:

482: In our experiments, we will process the data twice.

483: First we will let the learner generate a classification of the data.

484: After this the learner will process the data another time, this time

485: while including the classifications found earlier for the context of a

486: data item.

487: While working with n-fold cross-validation, we should be careful that

488: information from a test part is not accidentally used in its training

489: part.

490: In the first processing phase we will generate classes for the first

491: section while using the other nine sections.

492: Thus information about the classes in, for example, section two is

493: encoded in the classes produced in section one.

494: If in the second phase we use the classifications of the first section

495: while processing section two, we are analyzing a section while having

496: access to (indirect) information about the classes in the data.

497: Information about the classes in section two might leak to this process

498: via the training data, something which is undesired.

499:

500: There are two ways for preventing this form of information leaking.

501: Both concern being more strict when it comes to creating the training

502: data of the second system.

503: In a cascaded 10-fold cross-validation experiment, the second phase

504: training data for section x must be constructed without using this

505: section.

506: This means that instead of running one 10-fold cross-validation

507: experiment with the first system, we need to run ten 9-fold

508: cross-validation experiments in order to obtain correct training data

509: for the ten sections in the second system.

510: Section one will be trained with the 9-fold cross-validation results

511: from sections 2-10, section 2 with 1 and 3-10 and so on.

512: If at any time we need to add a third phase to the cascade of systems,

513: we need to run 8-fold cross-validation experiments with the first

514: system and 9-fold cross-validation experiments with the second.

515: For extra systems the number of extra runs increases and the amount of

516: available training data for the first system decreases.

517:

518: The second method for preventing training information from a

519: processing phase leaking to the classifications of a next phase

520: is by only using results from previous phases in the test data.

521: In the training data we use the perfect classes rather than

522: the output of the previous phase.

523: This has two disadvantages.

524: First, we cannot use a feature containing the class of the focus

525: word because this feature is the same as the output class.

526: This means that we can only use the classes of neighboring words.

527: Second, the opportunity to correct errors made in the first phase

528: will be restricted because the training data no longer contains

529: information about the errors made by this phase.

530: The advantage of this approach is that we can use all training data in

531: all training phases, so the problem of a diminishing quantity of

532: training data disappears.

533: This approach is especially useful with longer cascades of learners,

534: as for example is required in full parsing.

535:

536: Here is an example to illustrate the two methods: suppose a word in

537: the sixth section in the second phase of a ten-fold cross-validation

538: experiment in chunking is represented by the following eight features:

539:

540: \begin{quote}

541: $w_{i-1}$ $w_i$ $w_{i+1}$ $p_{i-1}$ $p_i$ $p_{i+1}$ $c_{i-1}$ $c_{i+1}$

542: \end{quote}

543:

544: \noindent

545: The goal is to find a chunk tag for word $w_i$.

546: The word features $w_i$, $w_{i-1}$ and $w_{i+1}$ represent, the word

547: itself, the preceding word and the next word, respectively.

548: The POS tag features $p_i$, $p_{i-1}$ and $p_{i+1}$ contain the POS

549: tags of the three words.

550: The two chunk features $c_{i-1}$ and $c_{i+1}$ hold the chunk tag of

551: the preceding and the next word.

552: The word and POS tag information have been taken from the training

553: data.

554: In the first method, the two chunk features are computed by a

555: preceding phase.

556: If this item is part of the training data for section x, $c_{i-1}$ and

557: $c_{i+1}$ were generated by a nine-fold cross-validation experiment

558: which uses all sections except section x.

559: This means that the two chunk features have been generated by training

560: with all sections except 6 and x.

561: If the item is part of the test data, then the chunk features are

562: computed by a ten-fold cross-validation experiment (training with

563: sections 1-5 and 7-10).

564: The second method generates chunk features for the test data in the

565: same way but for training data it takes $c_{i-1}$ and $c_{i+1}$ from

566: the training data, thus preventing that they contain implicit

567: information about the test sections.\footnote{In case $c_{i-1}$ is part

568: of a previous section or $c_{i+1}$ is in a next section, they are left

569: empty.}

570:

571: \subsection{Evaluation}

572: \label{sec-stat}

573:

574: We will compare the results of a shallow parser with an available

575: hand-parsed corpus.

576: For this purpose we will use the precision and recall of the phrases

577: in the results.

578: Precision is the percentage of phrases found by the learner that are

579: correct according to the corpus.

580: Recall is the percentage of corpus phrases found by the learner.

581: It is easier to optimize a system configuration based on one

582: evaluation score and therefore we combine precision and recall

583: in the F$_{\beta}$ rate \citep{vanrijsbergen75}:

584:

585: \begin{equation}

586: F_{\beta} = \frac{(\beta^2+1)*precision*recall}{\beta^2*precision+recall}

587: \end{equation}

588:

589: \noindent

590: $\beta$ can be used for giving precision a larger ($\beta>$1) or

591: smaller ($\beta<$1) weight than recall.

592: We do not have a preference for one or the other and therefore we use

593: $\beta$=1.

594: In previous work on shallow parsing, often a word-related accuracy

595: rate is used as evaluation criterion.

596: We do not believe that this is a good method for evaluating results of

597: phrase detection algorithms.

598: Accuracy rates assign positive values to correctly identified

599: non-phrase words and to partially identified phrases.

600: Furthermore they will produce different numbers for the same analysis

601: based on the data representation used.

602: For these reasons, the relation between accuracy rates and F$_{\beta}$

603: rates is poor and preference should be given to using the latter.

604:

605: Accuracy rates have one advantage over F$_{\beta}$ rates: standard

606: statistical tests can be used for determining if the difference

607: between two accuracy rates is significant.

608: Accuracy is a relatively simple function $correct/processed$ where

609: $processed$ is the number of items that have been processed and

610: $correct$ is the number of items that received the correct class.

611: Unfortunately, F$_{\beta=1}$ is more complex: after some arithmetic

612: we get $2*correct/(found+corpus)$ where $found$ is the number of

613: phrases found by the learner, $correct$ the number of phrases found

614: that were correct and $corpus$ the number of phrases in the corpus

615: according to some gold standard.

616: The value of the $corpus$ variable is an upper bound on the variable

617: $correct$.

618: The complexity of the F$_{\beta=1}$ computation makes it hard to

619: apply standard statistical tests to F$_{\beta=1}$ rates.

620:

621: \cite{yeh2000} offers a method for computing significance values for

622: F$_{\beta=1}$ rate comparisons: by using computationally-intensive

623: randomization tests.

624: His approach requires test data classifications for all systems that

625: need to the compared.

626: Usually we only have access to the test data classifications of our

627: own system and therefore we have used a variant of these

628: randomization tests presented: bootstrap resampling

629: \citep{noreen89}.

630: The basic idea of this approach is to regard the test data

631: classifications as a population of cases.

632: A random sample of this population can be created by arbitrarily

633: choosing cases with replacement.

634: We can create many random samples of the same size as the test data

635: and compute an average F$_{\beta=1}$ rate over the samples and a

636: standard deviation for this average.

637: These statistical measures can be used for deciding if the performance

638: of another system is significantly different from our system.

639: Since we do not know if the performance of our system is distributed

640: according to a normal distribution, we will determine

641: significance boundaries in such a way that 5\% of the samples evaluate

642: worse (or better) than the chosen boundary.

643:

644: \section{Chunking}

645:

646: In this section we will apply a memory-based learner to chunking,

647: identifying base phrases.

648: The section starts with a some background information on this task.

649: After this we will present the results of our experiments with base

650: noun phrase identification and our work targeted at finding base

651: phrases of arbitrary types.

652:

653: \subsection{Task Overview}

654: \label{sec-chunkov}

655:

656: A text chunker divides sentences in phrases which consist of a

657: sequence of consecutive words which are syntactically related.

658: The phrases are nonoverlapping and nonrecursive.

659: In the beginning of the nineties, \cite{abney91} suggested to use

660: chunking as a preprocessing step of a parser.

661: Ten years later, most statistical parsers contained a chunking phase

662: (for example \cite{ratnaparkhi98}).

663: In this study, we will divide chunking in two subtasks: finding only

664: noun phrases and identifying arbitrary chunks.

665:

666: Machine learning approaches towards noun phrase chunking started with

667: work by \cite{church88} who used bracket frequencies associated with

668: POS tags for finding noun phrase boundaries in text.

669: In an influential paper about chunking, \cite{ramshaw95} show that

670: chunking can be regarded as a tagging task.

671: Even more importantly, the authors propose a training and test data

672: set that are still being used for comparing different text chunking

673: methods.

674: These data sets were extracted from the Wall Street Journal part of

675: the Penn Treebank II corpus \citep{marcus93}.

676: Sections 15-18 are used as training data and section 20 as test

677: data.\footnote{The noun phrase identification data is available from

678: {\tt ftp://ftp.cis.upenn.edu/pub/chunker/}}

679: In principle, the noun phrase chunks present in the material are noun

680: phrases that do not include other noun phrases, with initial material

681: (determiners, adjectives, etc.) up to the head but without

682: postmodifying phrases (prepositional phrases or clauses)

683: \citep{ramshaw95}.

684:

685: The noun phrase chunking data produced by \cite{ramshaw95} contains a

686: couple of nontrivial features.

687: First, unlike in the Penn Treebank, possessives between two noun

688: phrases have been attached to the second noun phrase rather than the

689: first.

690: An example in which round brackets mark chunk boundaries: {\it ( Nigel

691: Lawson ) ('s restated commitment )}: the possessive {\it 's} has been

692: moved from {\it Nigel Lawson} to {\it restated commitment}.

693: Second, Treebank annotation may result in nonexpected noun phrase

694: annotations: {\it British Chancellor of ( the Exchequer ) Nigel

695: Lawson} in which only one noun chunk has been marked.

696: The problem here is that neither {\it British Chancellor} nor {\it

697: Nigel Lawson} has been annotated as separate noun phrases in the

698: Treebank.

699: Both {\it British ... Exchequer} and {\it British ... Lawson} are

700: annotated as noun phrases in the Treebank but these phrases could not

701: be used as noun chunks because they contain the smaller noun phrase

702: {\it the Exchequer}.

703:

704: \cite{ramshaw95} proposed to encode chunks with tags: I for words that

705: are inside a noun chunk and O for words that are outside a chunk.

706: In case one noun phrase immediately follows another one, they

707: use the tag B for the first word of the second phrase in order to show

708: that a new phrase starts there.

709: With the three tags I, O and B any chunk structure can be encoded.

710: This representation has two advantages.

711: First, it enables trainable POS taggers to be used as chunkers by

712: simply changing their training data.

713: Second, it minimizes consistency errors which appear with the bracket

714: representation where open and close brackets generated by the learner

715: may not be balanced.

716: Here is an example sentence first with noun phrases encoded by pairs of

717: brackets and then with the Ramshaw and Marcus IOB representation:

718:

719: \begin{quote}

720: In ( early trading ) in ( Hong Kong ) ( Monday ) , ( gold ) was quoted \\

721: at ( \$ 366.50 ) ( an ounce ) .

722:

723: In$_O$ early$_I$ trading$_I$ in$_O$ Hong$_I$ Kong$_I$ Monday$_B$ ,$_O$ gold$_I$ was$_O$ quoted$_O$ \\

724: at$_O$ \$$_I$ 366.50$_I$ an$_B$ ounce$_I$ .$_O$

725: \end{quote}

726:

727: \noindent

728: \cite{tksveenstra99eacl} presents three variants on the Ramshaw and

729: Marcus representation and shows that the bracket representation can

730: also be regarded as a tagging representation with two streams of

731: brackets.

732: They named the variants IOB2, IOE1 and IOE2 and used IOB1 as name for

733: the Ramshaw and Marcus representation.

734: IOB2 was the same as IOB1 but now every chunk-initial word receives tag B.

735: IOE1 differs from IOB1 in the fact that rather than the tag B, a tag E

736: is used for the final word of a noun chunk which is immediately

737: followed by another chunk.

738: IOE2 is a variant of IOE1 in which each final word of a noun phrase is

739: tagged with E.

740: The bracket representations use open brackets for phrase-initial

741: words, close brackets for phrase-final words and a period for all

742: other words.

743: Table \ref{tab-repr} contains example tag sequences for all six tag

744: sequences for the example sentence.

745:

746: \begin{table}[t]

747: \begin{center}

748: \begin{tabular}{|c|ccccccccccccccccc|}\hline

749: IOB1 &  O&I&I&O&I&I&B&O&I&O&O&O&I&I&B&I&O \\

750: IOB2 &  O&B&I&O&B&I&B&O&B&O&O&O&B&I&B&I&O \\

751: IOE1 &  O&I&I&O&I&E&I&O&I&O&O&O&I&E&I&I&O \\

752: IOE2 &  O&I&E&O&I&E&E&O&E&O&O&O&I&E&I&E&O \\

753: O    &  .&$[$&.&.&$[$&.&$[$&.&$[$&.&.&.&$[$&.&$[$&.&. \\

754: C    &  .&.&$]$&.&.&$]$&$]$&.&$]$&.&.&.&.&$]$&.&$]$&. \\\hline

755: \end{tabular}

756: \end{center}

757: \caption{The chunk tag sequences for the example sentence

758: {\it In early trading in Hong Kong Monday , gold was quoted at \$

759: 366.50 an ounce . }

760: for six different tagging formats.

761: The {\tt I} tag has been used for words inside a chunk, {\tt O}

762: for words outside a chunk, {\tt B} and {\tt [} for

763: chunk-initial words and {\tt E}, {\tt ]} for chunk-final words and

764: periods for words that are neither chunk-initial nor chunk-final.

765: }

766: \label{tab-repr}

767: \end{table}

768:

769: The representation variants are interesting because a learner will

770: make different errors when trained with data encoded in a different

771: representation.

772: This means that we can train one learner with five\footnote{The

773: combination of open and close brackets, O+C, will be regarded as one

774: data representation.}

775: data representations and obtain five different analyses of the data

776: which we can combine with system combination techniques.

777: Thus the different data representations may enable us to improve the

778: performance of the chunker.

779: The data representations can be used both for noun phrase chunking

780: and for arbitrary chunking.

781: In the latter task, more than one chunk type exists so the tags need

782: to be expanded with type-specific suffixes.

783: For example: B-VP, I-VP, E-VP, $[$-VP and $]$-VP.

784:

785: The arbitrary chunking task was more difficult to design because many

786: interesting phrase types often contain parts which belong to other

787: phrases \citep{tksbuchholz2000conll}.

788: For example, verb phrases may contain noun phrases and prepositional

789: phrases often include a noun phrase.

790: Furthermore, noun phrases may contain quantitative or adjective phrases

791: which may prevent them from being identified as noun chunks.

792: The noun, verb and prepositional phrases should be included and

793: therefore the following measures have been taken when constructing the

794: data for the arbitrary chunking task:

795: First, a couple of phrase types, for example quantifier phrases and

796: adjective phrases, have been removed from places where they prevented

797: the identification of noun phrases.

798: This made possible annotating more phrases as noun chunks.

799: Second, some phrase types in the annotated data, for example verb

800: phrases and prepositional phrases, lack material that has already been

801: included in a phrase of another type.

802: Third, adjacent verb clusters have been put in one flat verb phrase

803: unlike in the Treebank where often each verb starts a new phrase.

804: And fourth, adverbial phrase boundaries have been removed from

805: adjective phrases and verb phrases to allow all material to be

806: included in the mother phrase.

807:

808: This chunk definition scheme will generate data in which most of the

809: tokens have been assigned to a chunk of some type.

810: The odd tokens that fall out are usually punctuation signs.

811: This chunk scheme has been used for generating training and test data

812: for the CoNLL-2000 shared task \citep{tksbuchholz2000conll}.

813: The data contains the same segments of the Wall Street Journal part of

814: the Penn Treebank as the noun phrase data of \cite{ramshaw95}: sections

815: 15-18 as training data and section 20 as test data.\footnote{The

816: CoNLL-2000 shared task data is available from

817: {\tt http://lcg-www.uia.ac.be/conll2000/chunking/}}

818: We will use these data sets in our arbitrary chunking experiments.

819:

820: The training and the test data contain two types of features: words

821: and POS tags.

822: The words have been taken from the Penn Treebank.

823: The POS tags of the Treebank have been manually checked and therefore

824: they should not be used in the chunking data.

825: In future applications, the chunking process will be applied to a text

826: with POS tags that have been generated automatically.

827: These POS tags will contain errors and therefore the performance of

828: the chunker will be worse than when applied to a Treebank text with

829: manually checked POS tags.

830: If we want to obtain realistic performance rates, we should work with

831: automatically generated POS tags in our shallow parsing experiment.

832: Conform with earlier work like that of \cite{ramshaw95}, we have used

833: POS tags that were generated by the Brill tagger \citep{brill94}.

834:

835: \subsection{Noun Phrase Recognition}

836:

837: We will use a memory-based learner to find noun phrase chunks in text.

838: In order to determine the best configuration for the learner, we will

839: test different system configurations on the standard training data

840: sets put forward by \cite{ramshaw95}.

841: We will evaluate different feature sets for representing words.

842: Additionally, we will use the five data representations for generating

843: different system results and use system combination techniques for

844: combining these results.

845:

846: In our experiments we will represent words as sets of words and POS

847: tags.

848: These sets contain the word itself, its part-of-speech (POS) tag and a

849: left and right context of a maximum of four words and POS tags on each

850: side, 18 features in total.

851: We have explained in Section \ref{sec-feat} that memory-based learners

852: equipped with the Gain Ratio metric have difficulty in dealing with

853: irrelevant features.

854: Therefore we will use a feature selection method, bi-directional

855: hill-climbing starting with zero features, for finding the best

856: subset of the 18 features for each different data representation.

857:

858: The memory-based learner will make two passes over the data.

859: First, it will attempt to predict the noun phrases in the data as well

860: as possible.

861: After this it will use the output of this first pass as information

862: about the noun phrases in the immediate context of the current word.

863: This means that the second pass has access to the 18 features of the

864: first pass plus the chunk tags of the two words immediately in front

865: of the current word and the chunk tags of the two words immediately

866: following the current word.

867: This cascaded approach was chosen because it was useful for improving

868: overall performance in our earlier work \citep{tksveenstra99eacl}.

869: We omitted the chunk tag for the current word because including it

870: gave a negative bias to the chunker performance.

871: Gain Ratio would correctly identify it as a feature which contained a

872: lot of information about the output class and the weight it assigned

873: to it would make it hard for the other features to influence the

874: output class at all \citep{tksveenstra99eacl}.\footnote{

875: The problem of using the predicted class of the current word was a

876: result of an earlier study in which we did not use feature selection.

877: The selection method used in this study would probably have disregarded

878: this feature automatically.

879: It would start out as the most informative feature but with the

880: feature on its own we would get a worse performance than with

881: combinations of other features (we perform feature selection while

882: keeping the five best combinations).

883: }

884:

885: We performed a cascaded feature search while using five different data

886: representations on the training data of \cite{ramshaw95} in a 10-fold

887: cross-validation approach.

888: We prevented information leaking in the second phase conform Section

889: \ref{sec-partun}  by using the estimated chunk tags for test data and

890: using the corpus tags in the training data.

891: In this way we made sure that when the test data consisted of section

892: x, no information about section x was available in the training data.

893: The results of the 10-fold cross-validation experiments can be found

894: in Table \ref{tab-np10a}.

895: In the best feature sets of the first pass most of the nine POS tag

896: features are used (almost eight on average) but interestingly enough

897: only a few of the word features (just over four on average).

898: The best sets for the second pass use fewer POS tag features (under

899: seven), fewer word tags (just over two) and most of the chunk features

900: (about three).

901: The table shows that a wide context is more important for the POS

902: features than for the chunk features and less important for the word

903: features.

904:

905: \begin{table}[t]

906: \begin{center}

907: \begin{tabular}{|l|c|ll|c|lll|}\cline{2-8}

908: \multicolumn{1}{l|}{train} & \multicolumn{3}{c|}{Pass 1}

909:                            & \multicolumn{4}{c|}{Pass 2} \\\hline

910: Repr.& F$_{\beta=1}$ & \multicolumn{2}{c|}{features} &

911:        F$_{\beta=1}$ & \multicolumn{3}{c|}{features} \\\hline

912: IOB1 & 91.88 & word$_{-4..0}$ & POS$_{-2..3}$

913:      & 92.54 & word$_{-2..0}$ & POS$_{-4..3}$ & chunk$_{-2,-1,1,2}$\\

914: IOB2 & 91.78 & word$_{-1..0}$ & POS$_{-4..3}$

915:      & 92.29 & word$_{-1..0}$ & POS$_{-4..2}$ & chunk$_{-1,1,2}$\\

916: IOE1 & 91.64 & word$_{0..1}$  & POS$_{-3..3}$

917:      & 92.28 & word$_{0..1}$  & POS$_{-3..3}$ & chunk$_{-1,1,2}$\\

918: IOE2 & 92.19 & word$_{-3..4}$ & POS$_{-4..4}$

919:      & 92.59 & word$_{0..1}$  & POS$_{-1..3}$ & chunk$_{-2,-1,1,2}$\\

920: %O+C  & 92.78 &                &

921: %     &       &                & \\\hline

922: O    & 96.04 & word$_{-2..0}$ & POS$_{-4..3}$

923:      & 96.11 & word$_{-1,0}$  & POS$_{-4..1}$ & chunk$_{-1,2}$\\

924: C    & 96.43 & word$_{0..4}$  & POS$_{-4..4}$

925:      & 96.45 & word$_{0..2}$  & POS$_{-4..2}$ & chunk$_{-2,-1,1}$\\\hline

926: \end{tabular}

927: \caption{Best F$_{\beta=1}$ found for six data representations in two

928: passes while using a bi-directional hill-climbing feature search

929: algorithm in a 10-fold cross-validation process applied to the

930: training data for the noun phrase chunking task.

931: Note that the rates obtained for the O (open bracket) and C (close

932: bracket) representations are for phrase starts and phrase ends

933: respectively and thus higher than for the first four which evaluate

934: complete phrase identification.

935: %The results on O+C line were obtained by a combination of the two.

936: }

937: \label{tab-np10a}

938: \end{center}

939: \end{table}

940:

941: Our motive for processing six representations rather than one was to

942: obtain different results which we could combine in order to improve

943: performance.

944: System combination can be seen as a second cascade behind passes one

945: and two.

946: For reasons mentioned in Section \ref{sec-partun}, adding a second

947: cascade in a 10-fold cross-validation experiment requires taking extra

948: care to prevent information leaking from a training data at one level

949: to the training data of the next level.

950: We have taken care of this problem by preparing the training data of

951: the combination techniques with 9-fold cross-validation runs which

952: were independent of the 10-fold cross-validation experiments used

953: for generating the test data.

954: For example, the test data for the first section was generated by

955: training with sections 2-10 twice, first without information about

956: context chunk tags and then with the perfect information of the

957: context chunk tags.

958: The training data was generated with a 9-fold cross-validation process

959: on sections 2-10, also first without context chunk tags and then with

960: perfect context chunk tags.

961: By working this way it was impossible for information about the first

962: section to enter the training data of the combination processes.

963:

964: Most system combination techniques require results that are in the

965: same format.

966: We have results in six different formats which means that we need to

967: convert them to one format.

968: Since we do not know which of the formats would suit the combination

969: process best, we have evaluated all formats.

970: The four IO formats can trivially be converted to each other and to

971: the O and the C format.

972: The conversion of the two bracket formats to the other four is

973: nontrivial.

974: The two data streams have been generated independently of each other

975: and this means that they may contain inconsistencies.

976: We have chosen to get rid of these by removing all brackets which

977: cannot be matched with the closest candidate.

978: For example, if we have a structure like {\it ( a ( b c ) d )} then

979: the first bracket will be removed because it cannot be matched with

980: the second bracket.

981: The second and third will be kept because they match.

982: Finally, the fourth will be removed because it cannot be matched with

983: the third.

984: We obtain the balanced structure {\it a ( b c ) d} which can trivially

985: be converted to the four IO formats.

986:

987: \begin{table}[t]

988: \begin{center}

989: \begin{tabular}{|l|c|c|c|c|c|}\cline{2-6}

990: \multicolumn{1}{l|}{train}

991:                  &  IOB1 &  IOB2 &  IOE1 &  IOE2 &  O+C\\\hline

992: {\bf all systems} &      &       &       &       &       \\

993: Majority         & 93.06 & 93.06 & 93.14 & 93.12 & 93.35 \\

994: TotPrecision     & 93.06 & 93.05 & 93.13 & 93.05 & 93.35 \\

995: TagPrecision     & 93.06 & 93.10 & 93.11 & 93.11 & 93.35 \\

996: Precision-Recall & 93.06 & 93.10 & 93.11 & 93.08 & 93.35 \\

997: TagPair          & 93.05 & 93.14 & 93.10 & 93.13 & 93.36 \\

998: MBL              & 93.14 & 93.12 & 93.07 & 92.92 & 93.35 \\

999: MBL+             & 92.81 & 92.74 & 92.91 & 92.78 & 93.29 \\\hline

1000: {\bf some systems} &     &       &       &       &       \\

1001: Majority         & 93.02 & 93.12 & 93.08 & 92.99 & 93.37 \\

1002: TotPrecision     & 93.02 & 93.12 & 93.08 & 92.99 & 93.37 \\

1003: TagPrecision     & 93.04 & 93.13 & 93.10 & 92.99 & 93.37 \\

1004: Precision-Recall & 93.04 & 93.13 & 93.13 & 93.04 & 93.37 \\

1005: TagPair          & 93.08 & 93.16 & 93.12 & 93.05 & 93.37 \\

1006: MBL              & 93.12 & 93.18 & 93.18 & 93.03 & 93.38 \\\hline

1007: \end{tabular}

1008: \caption{

1009: F$_{\beta=1}$ rates obtained on 10-fold cross-validation experiments

1010: on the noun phrase chunking data while combining results obtained with

1011: five different data representations.

1012: All five representations have been tested and best rates have been

1013: obtained while using the the combined bracket representation O+C.

1014: All combination results are better than any result of the individual

1015: systems (92.59, see Table \ref{tab-np10a}) and generally combing five

1016: systems led to better results than when only three or four were used.

1017: The best results have been obtained with a stacked memory-based

1018: classifier that used all system results except those generated with

1019: IOE1.

1020: However, the performance differences are small.

1021: }

1022: \label{tab-np10b}

1023: \end{center}

1024: \end{table}

1025:

1026: % significance

1027:

1028: We have combined the five results of pass two of the 10-fold

1029: cross-validation experiments on the noun phrase chunking training

1030: data (O and C have now been regarded as one data stream O+C).

1031: We have used the system combination techniques described in Section

1032: \ref{sec-combi}: Majority Voting, TotPrecision, TagPrecision,

1033: Precision-Recall, TagPair and two variants of a stacked memory-based

1034: learner.

1035: The first stacked learner did not use any context information while

1036: the second one had access to a limited amount of context information:

1037: the current word, the current POS tag or pairs containing the current

1038: POS tag and one of the three current word, previous POS tag or next

1039: POS tag.

1040: We have performed combination experiments with all five data streams

1041: and with all subsets of three and four data streams.

1042: The results can be found in Table \ref{tab-np10b}.

1043: For the second stacked classifier we only included the best results

1044: (obtained with context feature current POS tag).

1045: System combination improved performance: the worst result of the

1046: combination techniques is still better than the best result of the

1047: individual systems.

1048: The differences between the combination techniques are small.

1049: Furthermore, system combination with the four IO data representations

1050: leads to similar results but the combined bracket representation

1051: consistently obtains higher F$_{\beta=1}$ rates.

1052: It should be noted though that while combination of the data with the

1053: IO representations leads to similar precision and recall figures, O+C

1054: obtains its higher F$_{\beta=1}$ rates with high precision rates and

1055: lower recall rates.

1056:

1057: Since the performance differences between the combination techniques

1058: displayed in Table \ref{tab-np10b} are small, we are relatively free

1059: in selecting a technique for further processing.

1060: We chose Majority Voting because it is the simplest of the

1061: combination techniques that were tested since it does not

1062: require extra combinator training data like the other techniques.

1063: It does seem reasonable to use the O+C representation during the

1064: combination process because the best results have been obtained with

1065: this representation.

1066: We will restrict ourselves to a few systems rather than combining all

1067: because Majority Vote in combination with the O+C representation

1068: obtained a slightly higher F$_{\beta=1}$ rate that way.

1069: The best rate was obtained while using only the systems with data

1070: representations IOB1, IOE2 and O+C so we restrict ourselves to

1071: these three.

1072: This leaves us with the following processing scheme:

1073:

1074: \begin{enumerate}

1075: %\itemsep -6mm

1076: \item Process the test data with a memory-based model generated from

1077:       the training data.

1078:       Use the features shown in Table \ref{tab-np10a} (Pass 1) and

1079:       generate output data streams while using the representations

1080:       IOB1, IOE2, O and C.

1081: \item Perform a second pass over the test data with another

1082:       memory-based model obtained from the training data.

1083:       Again use the features shown in Table \ref{tab-np10a} (Pass 2).

1084:       In the test data, use the estimated chunk tags from the previous

1085:       run as chunk tag features and in the training data use the

1086:       corpus chunk tags as chunk features.

1087:       Perform these passes four times, once for each of the data

1088:       representations IOB1, IOE2, O and C.

1089: \item Convert the output for the data representations IOB1 and IOE2 to

1090:       the O and the C format.

1091: \item Combine the three O data streams (IOB1, IOE2 and O) with

1092:       Majority Voting and do the same for the three C data streams

1093:       (IOB1, IOE2 and C).

1094: \item Remove brackets from the resulting O and C data streams which

1095:       cannot be matched with other brackets.

1096:       The balanced bracket structure is the analysis of the test data

1097:       that is the output of the complete system.

1098: \end{enumerate}

1099:

1100: \noindent

1101: We have applied this procedure to the data sets of \cite{ramshaw95}:

1102: sections 15-18 of the Wall Street Journal part of the Penn Treebank

1103: \citep{marcus93} as training data and section 20 of the same corpus

1104: as test data.

1105: The system obtained a F$_{\beta=1}$ rate of 93.34 (precision 94.01\%

1106: and recall 92.67\%).

1107: This is a modest improvement of our earlier work \citep{tks2000naacl}

1108: in which we did not use feature selection and where we obtained an

1109: F$_{\beta=1}$ rate of 93.26.

1110: In order to estimate significance thresholds, we have applied a

1111: bootstrap resampling test to the output of our system.

1112: We created 10,000 populations by randomly drawing sentences with

1113: replacement from the system results.

1114: The number of sentences in each population was the same as in the

1115: test corpus.

1116: The average F$_{\beta=1}$ of the 10,000 populations was 93.33 with

1117: a standard deviation of 0.24.

1118: For 5 percent of the populations, the F$_{\beta=1}$ rate was equal to

1119: or lower than 92.93 and for another 5 percent it was equal to or higher

1120: than 93.73.

1121: Since 93.26 is between the two significance boundaries, our current

1122: system does not perform significantly better than the previous version

1123: without feature selection.

1124: % bootstrapping: 93.33   0.24  92.93  93.73

1125:

1126: \subsection{Arbitrary Phrase Identification}

1127: \label{sec-chuarb}

1128:

1129: Our work with chunks of arbitrary types\footnote{The results of our

1130: arbitrary phrase identification work have earlier been presented by

1131: \cite{tks2000conll}.}

1132: is similar to that with noun phrase chunks apart from two facts.

1133: First, we refrained from using feature selection methods.

1134: Applying these methods did not gain us much for noun phrase

1135: chunking but they required a lot of extra computational work.

1136: Therefore we went back to using a fixed set of features in these

1137: experiments.

1138: The context size we used here was four left and four right for words

1139: and POS tags in the first pass over the data, and three left and three

1140: right for words and POS tags, and two left and two right without the

1141: focus for chunk tags in the second pass.

1142: This means that both first and second pass use 18 features.

1143: The second pass has only been used for the four IO data

1144: representations.

1145: Table \ref{tab-np10a} shows that the second pass improved the

1146: performance of the first pass only by a small margin for the two

1147: bracket representations O and C.

1148:

1149: The second difference between this study and the one for noun phrase

1150: chunks originates from the fact that apart from chunk boundaries, we

1151: need to find chunk types as well.

1152: We can approach this task in two ways.

1153: First, we could train the learner to identify both chunk boundaries and

1154: chunk types at the same time.

1155: We have called this approach the Single-Phase Approach.

1156: Second, we could split the task and train a learner to identify all

1157: chunk boundaries and feed its output to another classifier which

1158: identifies the types of the chunks (Double-Phase Approach).

1159: A computationally-intensive approach would be to develop learners for

1160: each different chunk type.

1161: They could identify chunks independently of each other and words

1162: assigned to more than one chunk could be disambiguated by choosing the

1163: chunk type that occurs most frequently in the training data (N-Phase

1164: Approach).

1165: Since we did not know in advance which of these three processing

1166: strategies would generate the best results, we have evaluated all

1167: three.

1168:

1169: In order to find the best processing strategy and the best combination

1170: technique, we have performed several 10-fold cross-validation

1171: experiments on the training data.

1172: We have processed this data for each processing strategy and in each

1173: of the six data representations earlier used for noun phrase chunking.

1174: After this we have used the seven combination techniques presented in

1175: Section \ref{sec-combi} for combining these.

1176: The results can be found in Table \ref{tab-xp10}.

1177: Of the three processing strategies, the N-Phase Approach generally

1178: performed best with Double-Phase being second best and Single-Phase

1179: performing worst.

1180: Again, system combination improved all individual results.

1181: There were only small differences between the seven combination

1182: techniques when compared for the same processing approach.

1183: The only exception were the two stacked MBL classifiers applied to the

1184: Single-Phase Approach results.

1185: They did about 0.3 F$_{\beta=1}$ rate better than most of the other

1186: combination techniques.

1187:

1188: \begin{table}[t]

1189: \begin{center}

1190: \begin{tabular}{|l|c|c|c|}\cline{2-4}

1191: \multicolumn{1}{l|}{train} & SP & DP & NP \\\hline

1192: IOB1             & 90.68 & 91.59 & 92.02 \\

1193: IOB2             & 90.77 & 91.65 & 91.94 \\

1194: IOE1             & 90.94 & 91.60 & 91.90 \\

1195: IOE2             & 91.21 & 91.97 & 91.99 \\

1196: O+C              & 91.57 & 91.97 & 91.51 \\\hline

1197: Majority         & 91.96 & 92.34 & 92.62 \\

1198: TotPrecision     & 91.97 & 92.34 & 92.62 \\

1199: TagPrecision     & 91.98 & 92.34 & 92.62 \\

1200: Precision-Recall & 91.96 & 92.34 & 92.62 \\

1201: TagPair          & 92.08 & 92.34 & 92.65 \\

1202: MBL              & 92.32 & 92.35 & 92.75 \\

1203: MBL+             & 92.40 & 92.32 & 92.72 \\\hline

1204: \end{tabular}

1205: \end{center}

1206: \caption{F$_{\beta=1}$ rates obtained for the three processing

1207: strategies, Single-Phase Approach (SP), Double-Phase Approach (DP) and

1208: N-Phase approach (NP), when applied to the training data of the

1209: CoNLL-2000 shared task (arbitrary chunking) while using five different

1210: data representations and seven system combination techniques.

1211: In all cases, system combination led to performances that were better

1212: than the individual system results.

1213: The computationally-intensive N-Phase Approach does better than the

1214: other two.

1215: }

1216: \label{tab-xp10}

1217: \end{table}

1218:

1219: % significance test on best

1220:

1221: The best result was generated with the N-Phase Approach in combination

1222: with a stacked memory-based classifier (MBL, 92.76).

1223: A bootstrap resampling test with 8000 random populations generated the

1224: 90\% significance interval 92.60-92.90 which means that this result

1225: is significantly better than any Single-Phase or Double-Phase result.

1226: However, the N-Phase approach has a big computing overhead: the

1227: number of passes over the data is at least N times the number of

1228: representations.

1229: Therefore, we have chosen the Double-Phase Approach combined with

1230: Majority Voting for our further work.

1231: This approach combines a reasonable performance with computational

1232: efficiency.

1233: The Single-Phase Approach is potentially faster but its performance

1234: is worse unless we use a stacked classifier which requires extra

1235: combinator training data.

1236:

1237: When we applied the Double-Phase Approach combined with Majority

1238: Voting to the CoNLL-2000 data sets, we obtained an F$_{\beta=1}$ rate

1239: of 92.50 (precision 94.04\% and recall 91.00\%).

1240: An overview of the performance rates of the different chunk types can

1241: be found in Table \ref{tab-xp}.

1242: Our system does well for the three most frequently occurring chunk

1243: types, noun phrases, prepositional phrases and verb phrases, and less

1244: well for the other seven.

1245: The chunk type UCP which occurred in the training data, was not

1246: present in the test data.

1247: With this result, our memory-based arbitrary chunker finished third of

1248: eleven participants in the CoNLL-2000 shared task.

1249: The two systems that performed better were Support Vector Machines

1250: \citep[][F$_{\beta=1}$=93.48]{kudoh2000} and Weighted Probability

1251: Distribution Voting \citep[][F$_{\beta=1}$=93.32]{hvh2000}.

1252:

1253: \begin{table}[t]

1254: \begin{center}

1255: \begin{tabular}{|l|c|c|c|}\cline{2-4}

1256: \multicolumn{1}{l|}{test data}

1257:                  & precision & recall & F$_{\beta=1}$ \\\hline

1258: ADJP  & 85.25\% & 59.36\% & 69.99 \\

1259: ADVP  & 85.03\% & 71.48\% & 77.67 \\

1260: CONJP & 42.86\% & 33.33\% & 37.50 \\

1261: INTJ  &100.00\% & 50.00\% & 66.67 \\

1262: LST   &  0.00\% &  0.00\% &  0.00 \\

1263: NP    & 94.14\% & 92.34\% & 93.23 \\

1264: PP    & 96.45\% & 96.59\% & 96.52 \\

1265: PRT   & 79.49\% & 58.49\% & 67.39 \\

1266: SBAR  & 89.81\% & 72.52\% & 80.25 \\

1267: VP    & 93.97\% & 91.35\% & 92.64 \\\hline

1268: all   & 94.04\% & 91.00\% & 92.50 \\\hline

1269: \end{tabular}

1270: \end{center}

1271: \caption{

1272: The results per chunk type of processing the test data with the

1273: Double Pass Approach and Majority Voting.

1274: Although the data is formatted differently than the noun phrase

1275: chunking data, the NP F$_{\beta=1}$ rate here (93.23) is close to

1276: that of our NP chunking F$_{\beta=1}$ rate (93.34).

1277: }

1278: \label{tab-xp}

1279: \end{table}

1280:

1281: % \subsection{Discussion}

1282: % compare

1283: % why sign deviation so big? sets contain different data!

1284: % why no help fsearch?

1285: % error analysis

1286:

1287: \section{Parsing}

1288:

1289: In this section we will examine the application of memory-based

1290: shallow parsing to generating embedded structures.

1291: We will examine three tasks: clause identification, noun phrase

1292: parsing and full parsing.

1293: Whenever possible, we will use the methods that we have applied to

1294: chunking in the previous section.

1295:

1296: \subsection{Clause Identification}

1297: \label{sec-clauses}

1298:

1299: In clause identification the goal is to divide sentences in clauses

1300: which typically contain a subject and a predicate.

1301: We have used the clause data of the CoNLL-2001 shared task

1302: \citep{tksdejean2001conll} which was derived from the Wall Street

1303: Journal Part of the Penn Treebank \citep{marcus93}.

1304: Here is an example sentence from the Treebank, with all information

1305: but words and clause brackets omitted:

1306:

1307: \begin{quote}

1308: \noindent

1309: (S Coach them in\\

1310: \hspace*{0.25cm}(S--NOM handling complaints)\\

1311: \hspace*{0.25cm}(SBAR--PRP so that\\

1312: \hspace*{0.50cm}(S they can resolve problems immediately)\\

1313: \hspace*{0.25cm})\\

1314: \hspace*{0.25cm}.\\

1315: )

1316: \end{quote}

1317:

1318: \noindent

1319: This sentence contains four clauses.

1320: In the data that we have worked with, the function and type

1321: information has been removed.

1322: This means that the type tags NOM and PRP have been omitted and that

1323: the SBAR tag has been replaced by S.

1324: Like the chunking data, these data sets contained words and

1325: part-of-speech tags which were generated by the Brill tagger

1326: \citep{brill94}.

1327: Additionally they contained chunk tags which were computed by the

1328: arbitrary chunking method we discussed in the previous section.

1329:

1330: We have approached identifying clauses in the following

1331: way:\footnote{This approach and the results achieved with it have

1332: earlier been discussed by \cite{tks2001conll}.}

1333: first we evaluated different memory-based learners for predicting the

1334: positions of open clause brackets and close clause brackets,

1335: regardless of their level of embedding.

1336: The two resulting bracket streams will be inconsistent and in order to

1337: solve this we have developed a list of rules which change a possibly

1338: inconsistent set of brackets to a balanced structure.

1339: The evaluation of the learners and the development of the balancing

1340: rules will be done with 10-fold cross-validation of the CoNLL-2001

1341: training data.

1342: Information leaking is prevented by using corpus clause tags as

1343: context features in the training data of cascaded learners rather than

1344: clause tags computed in a previous learning phase.

1345: The best learner configurations and balancing rules found will be

1346: applied to the data for the clause identification shared task.

1347:

1348: Like in our noun phrase chunking work, we have tested memory-based

1349: learners with different sets of features.

1350: At the time we performed these experiments, we did not have access to

1351: feature selection methods and therefore we have only evaluated a few

1352: fixed feature configurations:

1353:

1354: \begin{enumerate}

1355: \itemsep -0.1cm

1356: \item words only (w)

1357: \item POS tags only (p)

1358: \item chunk tags only (c)

1359: \item words and POS tags (wp)

1360: \item words and chunk tags (wc)

1361: \item POS tags and clause tags (pc)

1362: \item words, POS tags and chunk tags (wpc)

1363: \end{enumerate}

1364:

1365: \noindent

1366: All feature groups were tested with four context sizes: no context

1367: information or information about a symmetrical window of one, two or

1368: three words.

1369: Like in our chunking work, we want to check if an improved performance

1370: can be obtained by using system combination.

1371: However, since we attempt to predict brackets at all levels in one

1372: step, we cannot use the five data representations here.

1373: Instead we have evaluated combination of some of the feature

1374: configurations mentioned above: a majority vote of the three using a

1375: single type of information (1+2+3), a majority vote of the three using

1376: pairs of information (4+5+6) and a majority vote of the previous two

1377: and the one using three types of information (7+(1+2+3)+(4+5+6)).

1378: The last one is a combination of three results of which two themselves

1379: are combinations of three results.

1380:

1381: Clauses may contain many words and it is possible that the maximal

1382: context used by the learner, three words left and right, is not enough

1383: for predicting clause boundaries accurately.

1384: However, we cannot make the context size much larger than three

1385: because that would make it harder for the learner to generalize.

1386: We have tried to deal with this problem by evaluating another set of

1387: features which contain summaries of sentences rather than every word.

1388: Since we have chunk information of the sentences available, we

1389: can compress them by removing all words from each chunk except the

1390: main one, the head word.

1391: The head words can be generated by a set of rules put forward by

1392: \cite{magerman95} and modified by \cite{collins99}.\footnote{Available

1393: on http://www.research.att.com/\~{ }mcollins/papers/heads}

1394: After removing the nonhead words from each chunk, we can replace the

1395: POS tag of the remaining word with the chunk tag and thus obtain data

1396: with words and chunk tags only (words outside of a chunk keep their

1397: POS tag).

1398: Again we have evaluated sets of features which hold a single type of

1399: information, words (w--) or chunk tags (c--), or pairs of information,

1400: words and chunk tags (wc--).

1401:

1402: \begin{table}[t]

1403: \begin{center}

1404: \begin{tabular}{ r|l|c|c|c|c|r|l|c|c|c|c|}\cline{3-6}\cline{9-12}

1405: \multicolumn{1}{l}{} & \multicolumn{1}{l|}{train} &

1406:  0 & 1 & 2 & 3 & \multicolumn{1}{l}{~~~~} & train &

1407:  0 & 1 & 2 & 3 \\\cline{2-6}\cline{8-12}

1408:  1 & w         & 61.77 & 84.40 & 83.74 & 81.08 &

1409:  1 & w         & 61.11 & 75.99 & 77.52 & 77.63 \\

1410:  2 & p         & 30.44 & 80.40 & 80.47 & 76.85 &

1411:  2 & p         & 61.71 & 77.52 & 78.74 & 77.95 \\

1412:  3 & c         & 13.67 & 76.76 & 79.05 & 78.71 &

1413:  3 & c         & 00.00 & 67.25 & 75.06 & 75.70 \\

1414:  4 & wp        & 62.24 & 87.19 & 84.45 & 81.22 &

1415:  4 & wp        & 61.25 & 76.52 & 77.92 & 78.12 \\

1416:  5 & wc        & 67.95 & 87.31 & 85.74 & 82.97 &

1417:  5 & wc        & 61.01 & 75.96 & 77.46 & 77.79 \\

1418:  6 & pc        & 49.29 & 86.65 & 84.92 & 81.72 &

1419:  6 & pc        & 61.74 & 77.44 & 78.40 & 77.93 \\

1420:  7 & wpc       & 68.66 & 87.92 & 85.93 & 83.28 &

1421:  7 & wpc       & 61.21 & 76.17 & 77.73 & 78.00 \\\cline{2-6}\cline{8-12}

1422:  8 & 1+2+3     & 38.32 & 85.24 & 86.92 & 85.38 &

1423:  8 & 1+2+3     & 61.67 & 75.93 & 79.60 & 79.94 \\

1424:  9 & 4+5+6     & 68.04 & 88.83 & 87.44 & 84.98 &

1425:  9 & 4+5+6     & 61.44 & 77.30 & 79.15 & 79.38 \\

1426: 10 & 7+8+9     & 68.03 & 88.75 & 87.72 & 85.45 &

1427: 10 & 7+8+9     & 61.44 & 77.20 & 79.25 & 79.60 \\\cline{2-6}\cline{8-12}

1428: 11 & w-        & 54.05 & 83.70 & 83.48 & 81.25 &

1429: 11 & w-        & 61.24 & 76.01 & 78.69 & 79.25 \\

1430: 12 & c-        & 14.26 & 77.70 & 79.30 & 78.50 &

1431: 12 & c-        & 61.73 & 76.82 & 78.34 & 80.90 \\

1432: 13 & wc-       & 58.47 & 86.53 & 85.74 & 82.77 &

1433: 13 & wc-       & 61.43 & 76.77 & 80.15 & 81.61 \\\cline{2-6}\cline{8-12}

1434: \end{tabular}

1435: \end{center}

1436: \caption{

1437: F$_{\beta=1}$ rates obtained in 10-fold cross-validation experiments

1438: with the training data while predicting open clause brackets (left)

1439: and close clause brackets (right).

1440: We used different combinations of information (w: words, p: POS tags

1441: and c: chunk tags) and different context sizes (0-3).

1442: The best results for open brackets have been obtained with a majority

1443: vote of three information pairs while using context size 1 (row 9)

1444: For close clause brackets best results were obtained with words and

1445: POS tags after compressing the chunks and while using context size 3

1446: (row 13).

1447: }

1448: \label{tab-cl10}

1449: \end{table}

1450:

1451: We have evaluated the twelve groups of feature sets while predicting

1452: the clause open and clause close brackets.

1453: The results can be found in Table \ref{tab-cl10}.

1454: The learner performed best while predicting open clause brackets with

1455: information about the words immediately next to the current word

1456: (column 1).

1457: When more information was available, its performance dropped slightly.

1458: Of the different feature sets tested, the majority vote of sets that

1459: used pairs of information performed best (column 1, row 9).

1460: The classifiers that generated close brackets improved whenever extra

1461: context information became available.

1462: The best performance was reached while using a pair of words and chunk

1463: tags in the summarized format (column 3, row 13).

1464: We have performed an extra experiment to test if the system improved

1465: when using four context words rather than three.

1466: With words and chunk tags in the summarized format the system obtained

1467: F$_{\beta=1}$=81.72 for context size four compared with 81.61 for

1468: context size three.

1469: This increase is small so we have chosen context size three for our

1470: further experiments.

1471:

1472: With the streams of open and close brackets, we attempted to generate

1473: balanced clause structures by modifying the data streams with a set of

1474: heuristic rules.

1475: In these rules we gave more confidence to the open bracket predictions

1476: since, as can be seen in Table \ref{tab-cl10} the system performs

1477: better in predicting open brackets than close brackets.

1478: After testing different rule sets created by hand and evaluating these

1479: on the available training data, we decided on using the following rule

1480: set:

1481:

1482: \begin{enumerate}

1483: \itemsep -0.1cm

1484: \item Assume that exactly one clause starts at each clause start

1485:       position.

1486: \item Assume that exactly one clause ends at each clause end

1487:       position but

1488: \item ignore all clause end positions when currently no clause is

1489:       open, and

1490: \item ignore all clause ends at non-sentence-final positions

1491:       which attempt to close a clause started at the first word of the

1492:       sentence.

1493: \item If clauses are opened but not closed at the end of the sentence

1494:       then close them at the penultimate word of the sentence.

1495: \end{enumerate}

1496:

1497: \noindent

1498: These rules were able to generate complete and consistent embedded

1499: clause structures for the output that the system generated for the

1500: training data of the CoNLL-2001 shared task.

1501: The rules have one main defect: they are incapable of predicting that

1502: two or more clauses start at the same position.

1503: This will make it impossible for the system to detect such clause

1504: start but unfortunately, according to our rule set evaluation, adding

1505: recognition facilities for such multiple clause start would have a

1506: negative influence on overall performance levels.

1507: This set of rules obtained a clause F$_{\beta=1}$ of 71.34 on the

1508: training data of this task when applied to the best results for open

1509: and close brackets.

1510: The rules did not change the open bracket positions and on average the

1511: changes they made to the close bracket positions were an improvement

1512: (F$_{\beta=1}$ = 84.11 compared to 81.61).

1513:

1514: An argument which could be made is that since open bracket prediction

1515: is more accurate than close bracket prediction, one could use the

1516: information of the open bracket positions when predicting clause

1517: close brackets.

1518: We have attempted to do this by repeating the experiment with the best

1519: configuration for close brackets (wc-- with context size 3) while

1520: adding a feature which stated at which clause level the current word

1521: was, according to earlier open and close brackets.

1522: This approach improved the F$_{\beta=1}$ rate of the close bracket

1523: predictor from 81.61 to 83.50.

1524: However, after applying the balancing rules to the open brackets and

1525: the improved close brackets, we only got a clause F$_{\beta=1}$ of

1526: 71.39, a minimal improvement over the previous 71.34.

1527: It seems that the extra performance gain obtained in the close bracket

1528: predictor was obtained by solving problems which could already be

1529: solved by the balancing rules.

1530:

1531: We applied the balancing rules together with an open bracket predictor

1532: using a combination of pairs of feature types (context size 1) and a

1533: close bracket predictor using summarized pairs of words and chunk tags

1534: (context size 3) to the data files of the CoNLL-2001 shared task.

1535: Our clause identification method obtained an F$_{\beta=1}$ rate of

1536: 67.79 for identifying complete clauses (precision 76.91\% and recall

1537: 60.61\%).

1538: In the CoNLL-2001 shared task, the system finished third of six

1539: participants.

1540: One system outperformed the others by a large margin: the boosted

1541: decision tree method by \cite{carreras2001}.

1542: Their system obtained an F$_{\beta=1}$ rate of 78.63 on this task.

1543: The main difference between their approach and ours is that they use a

1544: larger number of features, methods for predicting multiple

1545: co-occurring clause starts and a more advanced statistical model for

1546: combining brackets to clauses.

1547:

1548: In a post-conference study, we have attempted to estimate more

1549: precisely the cause of the performance difference between our method

1550: an the boosted decision trees used by \cite{carreras2001}.

1551: Our hypothesis was that not only the choice of system made a

1552: difference, but also the choice of features.

1553: For this purpose, Carreras and M\`arques kindly repeated an experiment

1554: in predicting open brackets but this time while using our feature set:

1555: pairs of information using a window of one word left and one right,

1556: while results were combined with majority voting (Table

1557: \ref{tab-cl10}, left, row 9, column 1).

1558: The experiment was performed while testing on the CoNLL-2001

1559: development data set.

1560: Originally the memory-based learner obtained F$_{\beta=1}$ = 89.80 on

1561: this data set while their boosted decision tree approach reached

1562: 93.89.

1563: However, while using the memory-based feature set, the performance of

1564: the decision trees dropped to 91.32.

1565: When both systems use the same features, the boosted decision trees

1566: outperform the memory-based learner.

1567: But it is able to perform better with its own feature set.

1568: Our hypothesis was correct: the performance difference between the two

1569: approaches was both caused by choice of the learner and the choice of

1570: the feature set.

1571:

1572: The next obvious question is whether the memory-based system would

1573: perform better with the feature set of the boosted decision trees.

1574: Providing an answer to this question was nontrivial.

1575: The feature set consisted of thousands of binary features which were

1576: more than the memory-based learner could handle.

1577: After converting the features from binary-valued to multi-valued,

1578: there were about 70 features left.

1579: At best, the system obtained F$_{\beta=1}$ = 90.52 with this feature

1580: set.

1581: Since we feared that still the number of features was too large for

1582: the system to handle, we performed a forward sequential selection

1583: search process in the feature space starting with zero features.

1584: The memory-based learner reached an optimal performance with 13

1585: features at F$_{\beta=1}$ = 91.82.

1586: These results show that there is still room for improvement for the

1587: memory-based learner but that cooperation with a feature selection

1588: method will be helpful.

1589:

1590: % results collins ?

1591:

1592: \subsection{Noun Phrase Parsing}

1593: \label{sec-npp}

1594:

1595: Noun phrase parsing is similar to noun phrase chunking but this time

1596: the goal is to find noun phrases at all levels.

1597: This means that just like in the clause identification task we need to

1598: be able to recognize embedded phrases.

1599: The following example sentence will illustrate this:

1600:

1601: \begin{quote}

1602: In ( early trading ) in ( Hong Kong ) ( Monday ) , ( gold ) was quoted \\

1603: at ( ( \$ 366.50 ) ( an ounce ) ) .

1604: \end{quote}

1605:

1606: \noindent

1607: This sentence contains seven noun phrases of which the one containing

1608: the final four words of the sentence consists of two embedded noun

1609: phrases.

1610: If we use the same approach as for clause identification, retrieving

1611: brackets of all phrase levels in one step and balancing these, we will

1612: probably not detect this noun phrase because it starts and ends

1613: together with other noun phrases.

1614: Therefore we will use a different approach here.

1615:

1616: We will recover noun phrases at different levels by performing

1617: repeated chunking \citep{tks2000naacl}.

1618: We will start with data containing words and part-of-speech tags and

1619: identify the base noun phrases in this data with techniques used in

1620: our noun phrase chunking work.

1621: After this we will replace the phrases that were found by the head

1622: words and their tags.

1623: This will create a summary of the sentences with words and a mixed

1624: data stream of POS tags and chunk tags.

1625: We can apply our noun phrase chunking techniques to this data one more

1626: time and find noun phrases one level above the base level.

1627: The compressing and chunking steps will be repeated in order to

1628: retrieve phrases at higher levels.

1629: The process will stop when no new phrases are found.

1630:

1631: The approach described here seems a trivial expansion of our noun

1632: phrase chunking work.

1633: However, there are some details left to discuss.

1634: First, there is the selection of the head word duing the phrase

1635: summarization process.

1636: At the time we performed these experiments, we did not have access to

1637: the Magerman/Collins set of rules for determining head words, and

1638: therefore we used a rule created by ourselves: the head word of a noun

1639: phrase is the final word of the first noun cluster in the phrase or

1640: the final word of the phrase if it does not contain a noun cluster.

1641:

1642: The second fact we should mention, is that the data we used contains

1643: a different format of noun phrase chunks than the data we previously

1644: have worked with.

1645: In this task we use the data set which was developed for the noun

1646: phrase bracketing shared task of CoNLL-99 \citep{osborne99}.

1647: It was extracted from the Wall Street Journal part of the Penn

1648: Treebank \citep{marcus93} without extra modifications and this means,

1649: for example, that possessives between two noun phrases have been

1650: attached to the first one unlike in the noun phrase chunking data.

1651: This and other differences make that we cannot be sure that the

1652: techniques we developed for the other base noun phrase format will

1653: work very well here.

1654: Indeed, there is a performance drop in the chunking part of our shallow

1655: parser when compared with the chunking work (F$_{\beta=1}$ of 92.77

1656: compared with 93.34).

1657: However, we decided not to put extra work in searching for a better

1658: configuration for our noun phrase chunker and have trained an existing

1659: chunker with the data available for this task.

1660:

1661: An unforeseen problem occurred when we attempted to use the chunker for

1662: identifying noun phrases above the base level.

1663: Our chunker output is a majority vote of five systems using different

1664: data representations.

1665: In our evaluation work with tuning data (WSJ section 21), we

1666: observed that the overall output of the chunker at nonbase levels was

1667: worse than the performance of the best individual system

1668: \citep{tks2000naacl}.

1669: The reason for this is that the system that used the O+C data

1670: representation, outperformed the other four systems by a large margin.

1671: Because of this, and probably because the other four systems

1672: made similar errors, the errors of the four cancelled some of the

1673: correct analyses of the best system and caused the majority vote to be

1674: worse than the best individual system.

1675: For this reason we have decided to use only the bracket

1676: representations when processing noun phrases above base levels.

1677:

1678: The main open question in this study is what training data to use

1679: when processing the nonbase noun phrases.

1680: In order to find an answer to this question we have tested several

1681: configurations while processing tuning data, WSJ section 21, with

1682: the training data for the CoNLL-99 shared task.

1683: We have tested six training data configurations for predicting open

1684: and close bracket positions: using all bracket positions, those of

1685: base phrases only, those of all phrases except base phrases, those of

1686: phrases of the current level only, those of the current level and the

1687: previous, and those of the current level and the next.

1688: At all levels, using the brackets of the current level only proved to

1689: be working best or close to best.

1690: At the sixth level no new noun phrases were detected.

1691: Therefore we decided to use only brackets of one phrase level in the

1692: training data for nonbase phrases and stop phrase identification after

1693: six levels.

1694:

1695: We have applied a noun phrase chunker with fixed symmetrical context

1696: sizes to the noun phrase data of the CoNLL-99 shared task

1697: \citep{tks2000naacl}.

1698: The chunker generated a majority vote of open and close brackets put

1699: forward by five systems, each of which used a different representation

1700: of the base noun phrases (IOB1, IOB2, IOE1, IOE2 and O or C).

1701: All systems used a window of four left and four right for words and

1702: POS tags (18 features) and the four systems using IO representations

1703: additionally performed and extra pass with a window of three left and

1704: three right for words and POS tags, and a window of two left and two

1705: right without the focus tag for chunk tags (also 18 features).

1706: The output of the chunker was presented to a cascade of six chunkers,

1707: each of which consisted of a pair of open and close bracket predictors

1708: which were trained with brackets from one of the levels 1 to 6.

1709: After each chunk phase the phrases found were replaced by the head

1710: word of the phrase and a fixed chunk tag.

1711:

1712: The system obtained an overall F$_{\beta=1}$ rate of 83.79 (precision

1713: 90.00\% and recall 78.38\%) for identifying arbitrary noun

1714: phrases.\footnote{This performance was already reported by

1715: \cite{tks2000naacl}.}

1716: It is slightly better than our performance at CoNLL-99 (82.98,

1717: obtained without system combination) which was the best of two entries

1718: submitted for the shared task at that workshop.

1719: The performance of our noun phrase chunker can be regarded as a

1720: baseline score for this data set.

1721: This score is already quite high: F$_{\beta=1}$ = 79.70, and it seems

1722: that the nonbase level chunkers have not been contributing much to the

1723: performance of this shallow parser.

1724: Out of curiosity we have also examined how well a full parser does on

1725: the task of identifying arbitrary noun phrases.

1726: For this purpose we looked at output data of a parser described by

1727: \cite{collins99} which was provided with the parser code (WSJ section

1728: 23, model 2).

1729: The parser obtained F$_{\beta=1}$ = 89.8 (precision 89.3\% and recall

1730: 90.4\%) for this task.

1731: This is a lot better than our shallow parser but we should note that

1732: compared with our application, the Collins parser has access to better

1733: part-of-speech tags and more training data with more sophisticated

1734: annotation rather than only noun phrase boundaries.

1735:

1736: \subsection{Full Parsing}

1737: \label{sec-par}

1738:

1739: The approach  for parsing noun phrases outlined in the previous

1740: section can be used for generating parse trees containing phrases of

1741: arbitrary phrases as well.

1742: In that case we would be using chunking techniques for performing full

1743: parsing.

1744: The is not a new idea.

1745: \cite{ejerhed83} present a Swedish grammar which includes noun phrase

1746: chunk rules.

1747: \cite{abney91} describes a chunk parser which consists of two parts:

1748: one that finds base chunks and another that attaches the chunks

1749: to each other in order to obtain parse trees.

1750: \cite{daelemans95} suggested to find long-distance dependencies with a

1751: cascade of lazy learners among which were constituent identifiers.

1752: \cite{ratnaparkhi98} built a parser based on a chunker with an

1753: additional bottom-up process which determines at what position to

1754: start new phrases or to join constituents with earlier ones.

1755: With this approach he obtained state-of-the-art parsing results.

1756: \cite{brants99} applied a cascade of Markov model chunkers to the task

1757: of parsing German sentences.

1758: We have extended our noun phrase parsing techniques to parsing

1759: arbitrary phrases \citep{tks2001clin}.

1760: We will present the main findings of this study here as well.

1761:

1762: The standard data sets for testing statistical parsers are different

1763: than the ones we used for our earlier work on chunking and shallow

1764: parsing.

1765: The data sets have been extracted from the Wall Street Journal (WSJ)

1766: part of the Penn Treebank \citep{marcus93} as well but they contain

1767: different segments.

1768: The training data consists of sections 02-21 (39,832 sentences) while

1769: section 23 is used as test data (2416 sentences).

1770: The data sets consists of words, and part-of-speech tags which have

1771: been generated by the part-of-speech tagger described by

1772: \cite{ratnaparkhi96}.

1773: In the data the phrase types ADVP and PRT have been collapsed into one

1774: category and during evaluation the positions of punctuation signs in

1775: the parse tree have been ignored.

1776: These adaptations have been done by different authors in order to make

1777: it possible to compare the results of their systems with the first

1778: study that used these data sets \citep{magerman95} and all follow-up

1779: work.

1780:

1781: In our work on arbitrary parsing, we were interested in finding an

1782: answer to four questions.

1783: In order to obtain these answers, we have performed tests with smaller

1784: data sets which were taken from the standard training data for this

1785: task: WSJ sections 15-18 as training data and section 20 as test data.

1786: The first topic we were interested in, was the influence of context

1787: size and size of the examined nearest neighborhood size (parameter k

1788: of the memory-based learner) on the performance of the parser.

1789: We took the noun phrase parser developed in the previous section,

1790: lifted its restriction of generating noun phrases only and applied it

1791: to this data set while using different context sizes and values for

1792: parameter k for the classifiers that identified phrases above the base

1793: levels.

1794: The different types of the chunks were derived by using the

1795: Double-Phase Approach for chunking (see Section \ref{sec-chuarb}).

1796: The best configuration we found was a context of two left and two

1797: right words and POS tags with k is 1.

1798: The nearest neighborhood size is smaller than used in our earlier work

1799: (3) and the best context size is smaller than in our noun phrase

1800: chunking work (4).

1801: However, the best context size we found for this task is exactly the

1802: same as reported by \cite{ratnaparkhi98}.

1803:

1804: The second topic we were interested in was the type of training data

1805: that should be used for finding phrases above the base level.

1806: In our noun phrase parsing work, we found that the best performance

1807: could be obtained by using only data of the current phrase level.

1808: This will cause a problems for our parser, since the tree depth may

1809: become as large as 31 in our corpus but there will be few training

1810: material available for these high level phrases if we use the same

1811: training configuration as in our noun phrase parsing work.

1812: We have tested two different training configurations to see if we

1813: could use more training data for this task without losing performance.

1814: With the first of these, using the current, previous and next phrase level,

1815: performance was as well (F$_{\beta=1}$=77.13) as while using only the

1816: current level (77.17).

1817: However, when we trained the cascade of chunkers while using brackets

1818: of all phrase levels, the performance dropped to 67.49.

1819: We have decided to keep on using the current phrase level only in the

1820: training data despite its problems with identifying higher level

1821: phrases.

1822:

1823: In the results that we have presented in this paper, the precision

1824: rates have always been higher than the recall rates.

1825: For a part, this is caused by the method we use for balancing open

1826: brackets and close brackets.

1827: It removes all brackets which cannot be matched with another one which

1828: is approximately the same as accepting clauses which are likely to be

1829: correct and throwing away all others.

1830: We wanted to test if we could obtain more balanced precision and

1831: recall rates because we hoped that these would lead to a better

1832: F$_{\beta=1}$ rate.

1833: Therefore we have tested two alternative methods for combining

1834: brackets.

1835: The first disregarded the type of the open brackets and allowed close

1836: brackets to be combined with open brackets of any type.

1837: The second method allowed open brackets to match with close brackets

1838: of any type.

1839: Unfortunately neither the first (F$_{\beta=1}$=72.33) nor the second

1840: method (76.06) managed to obtain the same F$_{\beta=1}$ rate as our

1841: standard method for combining brackets.

1842: Therefore we decided to stick with the latter.

1843:

1844: The final issue which we wanted to examine is the performance

1845: progression of the parser at the different levels of the process.

1846: The recall of the parser should increase for every extra step in the

1847: cascade of chunkers but we would also like to know how precision and

1848: F$_{\beta=1}$ progressed.

1849: We have measured this for our small parameter tuning data set and

1850: found that indeed recall increased until level 30 of a maximum of 32

1851: and remained constant after that.

1852: Precision dropped until the same level, remaining at the same value

1853: afterwards while F$_{\beta=1}$ reached a top value at level 19 and

1854: dropped afterwards.

1855: The reason for the later drop in F$_{\beta=1}$ value is that while

1856: the recall is still rising, it cannot make up for the loss of

1857: precision at later levels.

1858: Since we want to optimize the F$_{\beta=1}$ rate, we have decided to

1859: restrict the number of cascaded chunkers in our parser to 19 levels.

1860: We have added an extra post-processing step which after the 19 levels

1861: of processing adds clause brackets (S) around sentences which have not

1862: already been identified as a clauses.

1863:

1864: We have applied the best parser configuration found to the standard

1865: parsing data.

1866: Our parser used an arbitrary chunker with the configuration described

1867: in Section \ref{sec-chuarb} (a Majority Vote of five systems using

1868: different data representations) but trained with the relevant data for

1869: this task.

1870: Higher level phrases were identified by a cascade of 19 chunkers, each

1871: of which had a pair of independent open and close bracket classifiers

1872: which used a context of two left and two right of words and POS tags

1873: while being trained with brackets of the current level only.

1874: At each level, open and close brackets were combined to chunks by

1875: removing all brackets that could not be matched with a bracket of the

1876: same type.

1877: The parser contained a post-processing process which added clause

1878: brackets around sentences which were not identified as a clause after

1879: the 19 processing stages.

1880: This chunk parser obtained an F$_{\beta=1}$ rate of 80.49 on WSJ

1881: section 23 (precision 82.34\% and recall 78.72\%).

1882:

1883: The performance of our chunk parser is modest compared with

1884: state-of-the-art statistical parsers, which obtain around 90

1885: F$_{\beta=1}$ rate \citep{collins99,charniak2000}.

1886: However, we have a couple of suggestions for improving its

1887: performance.

1888: First, we could attempt giving the parser access to more information,

1889: for example about lower phrase levels.

1890: Currently, the parser only knows the head words and phrase types of

1891: daughters of phrases that are being built and this might not be

1892: enough.

1893: Second, we could try to find a better method for predicting bracket

1894: positions.

1895: For reasons explained in the previous section, we could not use a

1896: majority vote of systems using different representations.

1897: This might have helped to obtain a better performance.

1898: Finally, we would like to change the greedy approach of our parser.

1899: Currently it chooses the best segmentation of chunks at each level and

1900: builds on that but ideally it would be able to remember some

1901: next-to-best configurations as well and perform backtracking from the

1902: earlier choices whenever necessary.

1903: This approach would probably improve performance considerably

1904: (as shown by \cite{ratnaparkhi98}, Table 6.5).

1905: A practical problem which needs to be solved here, is that in nearest

1906: neighbor memory-based learning alternative classes do not receive

1907: confidence measures.

1908: Rather, sets of item-dependent distances are used to determine the

1909: usability of the classes.

1910: Comparing partial trees requires comparing sets of distances and

1911: it is not obvious how this should be done.

1912:

1913: These extra measures will probably improve the performance of the

1914: chunk parser.

1915: However, it is questionable whether it is worthwhile continuing with this

1916: approach.

1917: The present version of the parser already requires a lot of memory and

1918: processing time: more than a second per {\it word} for chunking only

1919: compared with a mere 0.14 seconds per {\it sentence} for a statistical

1920: parser which performed better \citep{ratnaparkhi98}.

1921: Extra extensions will probably slow down the parser even more so we

1922: are not sure if extending this approach is worth the trouble.

1923:

1924: \begin{table}[t]

1925: \begin{center}

1926: \begin{tabular}{|l|c|c|c|}\cline{2-4}

1927: \multicolumn{1}{l|}{section 20} &

1928:    precision & recall & F$_{\beta=1}$ \\\hline

1929: \cite{kudoh2001}     & 94.15\% & 94.29\% & 94.22 \\

1930: \cite{tks2000coling} & 94.18\% & 93.55\% & 93.86 \\

1931: MBL                  & 94.01\% & 92.67\% & 93.34 \\

1932: \cite{tks2000naacl}  & 93.63\% & 92.89\% & 93.26 \\

1933: \cite{munoz99}       & 92.4\%  & 93.1\%  & 92.8  \\

1934: \cite{ramshaw95}     & 91.80\% & 92.27\% & 92.03 \\

1935: \cite{argamon99}     & 91.6\%  & 91.6\%  & 91.6  \\\hline

1936: baseline             & 78.20\% & 81.87\% & 79.99 \\\hline

1937: \end{tabular}

1938: \end{center}

1939: \caption{A selection of results that have been published for the

1940: Ramshaw and Marcus data sets for noun phrase chunking.

1941: Our chunker (MBL) is third-best.

1942: The baseline results have been produced by a system that selects the

1943: most frequent chunk tag (IOB1) for each part-of-speech tag.

1944: The best performance for this task has been obtained by a system using

1945: Support Vector Machines

1946: \citep{kudoh2001}.

1947: }

1948: \label{tab-resnp}

1949: \end{table}

1950:

1951: \section{Related Work}

1952:

1953: In this section we will compare our work with that of others that have

1954: applied machine learning techniques to the same data sets.

1955: First we will discuss the two chunking tasks and then the tasks that

1956: required output of embedded structures.

1957: Many systems have been applied to the five tasks.

1958: Rather that giving a detailed description of each of them, we will

1959: list the best performing systems for each task and mention some

1960: differences between these systems and ours.

1961: This comparison of our memory-based shallow parsers with other work

1962: shows that they produce state-of-the-art results for the chunking

1963: tasks but not for the tasks which require identification of embedded

1964: structures.

1965:

1966: \subsection{Chunking}

1967:

1968: Table \ref{tab-resnp} shows a selection of the best results published

1969: for the noun phrase chunking task.\footnote{An elaborate overview of

1970: most of the systems that have been applied to this task can be found

1971: on http://lcg-www.uia.ac.be/\~{ }erikt/research/np-chunking.html}

1972: As far as we know, the results presented in this paper (line MBL) are

1973: the third-best results.

1974: We have participated in producing the second-best result

1975: \citep{tks2000coling} which was produced by combining of the results

1976: of five different learning techniques.

1977: The best results for this data set have been generated with Support

1978: Vector Machines \citep{kudoh2001}.\footnote{Although we do not wish to

1979: underestimate the power of Support Vector Machines, we should note

1980: that it seems that the optimal results presented by \cite{kudoh2001}

1981: have been obtained by tuning the system to the test data.}

1982: A statistical analysis of our current result revealed that all

1983: performances outside of the region 92.93-93.73 are significantly

1984: different from ours.

1985: This means that all results in the table, except from the 93.26, are

1986: significantly different from ours.

1987:

1988: A topic to which we have paid little attention is the analysis of the

1989: errors that our approach makes.

1990: Such an analysis would provide insights into the weaknesses of the

1991: system and might provide clues to methods for improving the system.

1992: For noun phrase chunking we have performed a limited error analysis

1993: by manually evaluating the errors that were made in the first section

1994: of a 10-fold cross-validation experiment on the training data while

1995: using the chunker described by \cite{tks2000naacl}.

1996: This analysis revealed that the majority of the errors were caused by

1997: errors in the part-of-speech tags (28\% of the false positives/29\% of

1998: the false negatives).

1999: In order to acquire reasonable results, it is custom not to use the

2000: part-of-speech tags from the Treebank, but use tags that have been

2001: generated by a part-of-speech tagger.

2002: This prevents the system performance from reaching levels which would

2003: be unattainable for texts for which no perfect part-of-speech tags

2004: exist.

2005: Unfortunately the tagger makes errors and some of these errors cause

2006: the noun phrase segmentation to become incorrect.

2007:

2008: The second most frequently occurring error cause was related to

2009: conjunctions of noun phrases (16\%/18\%).

2010: Deciding whether a phrase like {\it red dwarfs and giants} consist of

2011: one or two noun phrases requires semantic knowledge and might be too

2012: ambitious for present-day systems to solve.

2013: The other major causes of errors all relate to similar hard cases:

2014: attachment of punctuation signs (15\%/12\%; inconsistent in the

2015: Treebank), deciding whether ambiguous phrases without conjunctions

2016: should be one or two noun phrases (11\%/12\%), adverb attachment

2017: (5\%/4\%), noun phrases containing the word {\it to} (3\%/3\%),

2018: Treebank noun phrase segmentation errors (3\%/1\%) and noun phrases

2019: consisting of the word {\it that} (0\%/2\%).

2020: Apart from these hard cases there also were quite a few errors for

2021: which we could not determine an obvious cause (19\%/19\%).

2022:

2023: The most obvious suggestion for improvement that came out of the error

2024: analysis was to use a better part-of-speech tagger.

2025: We are currently using the Brill tagger \citep{brill94}.

2026: Better taggers are available nowadays but using the Brill tags here

2027: was necessary in order to be able to compare our approach with earlier

2028: studies, which have used the Brill tags as well.

2029: The error analysis did not produce other immediate suggestions for

2030: improving our noun phrase chunking approach.

2031: We are relieved about this because it would have been an

2032: embarrassment if our chunker had produced systematic errors.

2033: However, there is a trivial way to improve the results of the noun

2034: phrase chunker: by using more training data.

2035: Different studies have shown that by increasing the training data size

2036: by 300\%, the F$_{\beta=1}$ error might drop with as much as 25\%

2037: \citep{ramshaw95,tks2000naacl,kudoh2001}.

2038: Another study for a different problem, confusion set disambiguation,

2039: has shown that a further cut in the error rate is possible with even

2040: larger training data sets \citep{banko2001}.

2041: In order to test this for noun phrase chunking we need a hand-parsed

2042: corpus which is larger than anything that is presently available.

2043:

2044: \begin{table}[t]

2045: \begin{center}

2046: \begin{tabular}{|l|c|c|c|}\cline{2-4}

2047: \multicolumn{1}{l|}{section 20} &

2048:    precision & recall & F$_{\beta=1}$ \\\hline

2049: \cite{zhang2001}     & 94.29\% & 94.01\% & 94.13 \\

2050: \cite{kudoh2001}     & 93.89\% & 93.92\% & 93.91 \\

2051: \cite{kudoh2000}     & 93.45\% & 93.51\% & 93.48 \\

2052: \cite{hvh2000}       & 93.13\% & 93.51\% & 93.32 \\

2053: \cite{tks2000conll}  & 94.04\% & 91.00\% & 92.50 \\

2054: \cite{zhou2000}      & 91.99\% & 92.25\% & 92.12 \\

2055: \cite{dejean2000}    & 91.87\% & 92.31\% & 92.09 \\\hline

2056: baseline             & 72.58\% & 82.14\% & 77.07 \\\hline

2057: \end{tabular}

2058: \end{center}

2059: \caption{A selection of results that have been published for the

2060: arbitrary chunking data set of the CoNLL-2000 shared task.

2061: Our chunker

2062: \citep{tks2000conll}

2063: is fifth-best.

2064: The baseline results have been produced by a system that selects the

2065: most frequent chunk tag (IOB1) for each part-of-speech tag.

2066: The best performance for this task has been obtained by a system using

2067: regularized Winnow

2068: \citep{zhang2001}.

2069: Systems that have been applied both to the arbitrary chunking task and

2070: the noun phrase chunking task performed approximately equally well for

2071: NP chunks in both tasks.

2072: }

2073: \label{tab-resxp}

2074: \end{table}

2075:

2076: Table \ref{tab-resxp} contains a selection of the best results published

2077: for the arbitrary chunking data used in the CoNLL-2000 shared

2078: task.\footnote{More results for the chunking task can be found on

2079: http://lcg-www.uia.ac.be/conll2000/chunking/}

2080: Our chunker \citep{tks2000conll} is the fifth-best on this list.

2081: Immediately obvious is the imbalance between precision and recall:

2082: the system identifies a small number of phrases with a high precision

2083: rate.

2084: We assume that this is primarily caused by our method for generating

2085: balanced structures from streams of open and close brackets.

2086: We have performed a bootstrap resampling test on the chunk tag

2087: sequence associated with this result.

2088: An evaluation of 10,000 pairs indicated that the significance interval

2089: for our system (F$_{\beta=1}$ = 92.50) is 92.18-92.81 which means that

2090: all systems that are ahead of ours perform significantly better and all

2091: systems that are behind perform  significantly worse.

2092: We are not sure what is causing these large performance differences.

2093: At this moment we assume that our approach has difficulty with

2094: classification tasks when the number of different output classes

2095: increases.

2096:

2097: \begin{table}[t]

2098: \begin{center}

2099: \begin{tabular}{|l|c|c|c|}\cline{2-4}

2100: \multicolumn{1}{l|}{section 21}

2101:                       & precision & recall & F$_{\beta=1}$ \\\hline

2102: \cite{carreras2001}   & 84.82\% & 73.28\% & 78.63 \\

2103: \cite{molina2001}     & 70.89\% & 65.57\% & 68.12 \\

2104: \cite{tks2001conll}   & 76.91\% & 60.61\% & 67.79 \\

2105: \cite{patrick2001}    & 73.75\% & 60.00\% & 66.17 \\

2106: \cite{dejean2001}     & 72.56\% & 54.55\% & 62.77 \\

2107: \cite{hammerton2001b} & 55.81\% & 45.99\% & 50.42 \\\hline

2108: baseline              & 98.44\% & 31.48\% & 47.71 \\\hline

2109: \end{tabular}

2110: \end{center}

2111: \caption{Results of the clause identification part of the CoNLL-2001

2112: shared task.

2113: Our clause identifier \citep{tks2001conll} is third-best.

2114: The baseline results have been produced by a system that only puts

2115: clause brackets around complete sentences.

2116: The best performance for this task has been obtained by a system using

2117: boosted decision trees \citep{carreras2001}.

2118: }

2119: \label{tab-rescl}

2120: \end{table}

2121:

2122: \subsection{Parsing}

2123:

2124: A complete overview of the clause identification results of the

2125: CoNLL-2001 shared task can be found in Table \ref{tab-rescl}

2126: \citep{tksdejean2001conll}.

2127: Our approach was third-best.

2128: A bootstrap resampling test with a population of 10,000 random samples

2129: generated from our results produced the 90\% significance interval

2130: 66.66-68.95 for our system which means that our result is not

2131: significantly different from the second result.

2132: The boosted decision trees used by \cite{carreras2001} did a lot

2133: better than the other systems.

2134: In Section \ref{sec-clauses}, we have made a comparison between the

2135: performance of this system and ours and concluded that the performance

2136: differences were both caused by the choice of learning system and

2137: by a difference in the features chosen for representing the task.

2138:

2139: The noun phrase parsing task has not received much attention in the

2140: research community and there are only few results to compare with.

2141: \cite{osborne99} used a grammar-extension method based on Minimal

2142: Description Length and applied it to a Definite Clause Grammar.

2143: His system used different training and test segments of the Penn

2144: Treebank than we did.

2145: At best, it obtained an F$_{\beta=1}$ rate of 60.0 on the test data

2146: (precision 53.2\% and recall 68.7\%).

2147: \cite{krymolowski2000} applied a memory-based learning technique

2148: specialized for learning sequences to a noun phrase parsing task.

2149: Their system obtained F$_{\beta=1}$=83.7 (precision 88.5\% and recall

2150: 79.3\%) on yet another segment of the Treebank.

2151: This performance is very close to that of our approach

2152: (F$_{\beta=1}$=83.79).

2153: The memory-based sequence learner used much more training data than

2154: ours (about four times as much) but unlike our method, it generated

2155: its output without using lexical information, which is impressive.

2156: The performance of the Collins parser on the subtask of noun phrase

2157: parsing which we mentioned in Section \ref{sec-npp}

2158: (F$_{\beta=1}$=89.8) shows that there is room for improvement left for

2159: all systems that were discussed here.\footnote{Our full parser, which

2160: was trained and tested on the same data as the Collins parser,

2161: obtained F$_{\beta=1}$=86.96 for recognizing NP phrases only.}

2162:

2163: \begin{table}[t]

2164: \begin{center}

2165: \begin{tabular}{|l|c|c|c|}\cline{2-4}

2166: \multicolumn{1}{l|}{section 23} &

2167:    precision & recall & F$_{\beta=1}$ \\\hline

2168: \cite{collins2000}   & 89.9\% & 89.6\% & 89.7 \\

2169: \cite{bod2001}       & 89.7\% & 89.7\% & 89.7 \\

2170: \cite{charniak2000}  & 89.5\% & 89.6\% & 89.5 \\

2171: \cite{collins99}     & 88.3\% & 88.1\% & 88.2 \\

2172: \cite{ratnaparkhi98} & 87.5\% & 86.3\% & 86.9 \\

2173: \cite{charniak97}    & 86.6\% & 86.7\% & 86.6 \\

2174: \cite{magerman95}    & 84.3\% & 84.0\% & 84.1 \\

2175: \cite{tks2001clin}   & 82.3\% & 78.7\% & 80.5 \\\hline

2176: \end{tabular}

2177: \end{center}

2178: \caption{A selection of results that have been published for

2179: parsing sentences shorter than 100 words of the Penn Treebank.

2180: The performance of our parser \citep{tks2001clin} is not quite

2181: state-of-the-art.

2182: The best performance for this task has been obtained by statistical

2183: parsers and data-oriented parsers

2184: \citep{collins2000,charniak2000,bod2000}.

2185: }

2186: \label{tab-respa}

2187: \end{table}

2188:

2189: A selection of results for parsing the Penn Treebank can be found in

2190: Table \ref{tab-respa}.

2191: The F$_{\beta=1}$ error rate of the best systems is about half of that

2192: of ours.

2193: A more detailed comparison of the output data of our memory-based

2194: parser and one of the versions of the Collins parser

2195: \citep[][model 2]{collins99} has shown the large performance difference

2196: is caused by the way nonbase phrases are processed

2197: \citep{tks2001clin}.

2198: Our chunker performs reasonably well compared with the first stage of

2199: the Collins parser (F$_{\beta=1}$ = 49.30 compared with 49.85).

2200: Especially at the first few levels after the base levels, our parser

2201: looses F$_{\beta=1}$ points compared with the Collins parser.

2202: The initial difference of 0.65 at the base level grows to 2.92 after

2203: three more levels, 5.16 after six and 6.13 after nine levels with a

2204: final difference of 6.59 after 20 levels \citep{tks2001clin}.

2205: At the end of Section \ref{sec-par}, we have put forward some

2206: suggestions for improving our parser.

2207: However, we have also noted that further improvement might not

2208: be worthwhile because it will make our parser even slower than it

2209: already is.

2210:

2211: \section{Concluding Remarks}

2212:

2213: We have presented memory-based approaches to shallow parsing and we

2214: have applied these to five tasks: noun phrase chunking, arbitrary

2215: chunking, clause identification, noun phrase parsing and full

2216: parsing.

2217: We have used two additional techniques for improving the performance

2218: of our shallow parsers: feature selection and system combination.

2219: The first was used to compensate for a problem of the memory-based

2220: learner: it has difficulty with ignoring features that are not

2221: immediately relevant.

2222: While feature selection worked well in one study (clause

2223: identification with large feature sets), it did not make much

2224: difference to the overall performance of our noun phrase chunker.

2225: We believe that other techniques that were incorporated in the chunker

2226: (cascading and system combination) have already stretched the

2227: performance of the system to its limits.

2228: Therefore there might not have been much left to gain by using feature

2229: selection.

2230: System combination has proved to be quite useful for generating base

2231: phrases.

2232: Unfortunately, we could not apply it for higher level chunks because

2233: our method for producing different system results, using different

2234: data representations, failed to produce results for higher level phrases

2235: that could be improved with the Majority Voting technique we used for

2236: chunking.

2237:

2238: A comparison of our work with other studies revealed that our

2239: approach works well for base phrase identification, but not for

2240: finding embedded structures.

2241: We have made a couple of suggestions for improving the performance on

2242: tasks that require generating embedded structures: provide different

2243: features to the learners, try to find a method which allows

2244: combination of different systems when working on higher level phrases

2245: and replace the greedy phrase selection approach currently used by one

2246: that allows backtracking from earlier choices.

2247: However, while further improvement is interesting from a scientific

2248: point of view, it might not be useful from a practical point of view.

2249: Our present method is already slower than state-of-the-art full

2250: parsers and it requires more memory.

2251: Extra improvements to this approach will probably slow it down even

2252: more without guaranteeing state-of-the-art performance.

2253:

2254: \acks{

2255: \hspace*{-0.3cm}

2256: We would like to thank

2257: our colleagues of CNTS - Language Technology Group, University of

2258: Antwerp, Belgium and

2259: ILK, University of Tilburg, The Netherlands,

2260: the members of the TMR-LCG network, in particular James Hammerton, and

2261: two anonymous reviewers for

2262: valuable discussions and comments.

2263: We are grateful to Xavier Carreras for his cooperation in the

2264: comparison study of his clause identification system with ours.

2265: This study was funded by the European Training and Mobility of

2266: Researchers (TMR) network Learning Computational

2267: Grammars.\footnote{http://lcg-www.uia.ac.be/}

2268: }

2269:

2270: \vskip 0.2in

2271: \bibliography{ref}

2272:

2273: \end{document}

2274: