0105:cs0105002/acl99.tex

1: \documentclass[11pt]{article}

2: \usepackage{colacl}

3: \usepackage[dvips]{epsfig}

4: \author{ Eric Brill and Grace Ngai\\ Department of Computer Science\\ %

5: The Johns Hopkins University\\ Baltimore, MD 21218, USA\\ %

6: Email: {\tt \{brill,gyn\}@cs.jhu.edu}}

7: \title{\vspace{-65pt}{\normalsize \tt \hfill Appeared in \em{Proceedings of the 37th ACL}, 1999}\\ \mbox{} \\

8: Man\footnotemark[1] \space vs.\ Machine: A Case Study in Base %

9: Noun Phrase Learning}

10:

11: \begin{document}

12:

13: \maketitle

14:

15: \begin{abstract}

16: A great deal of work has been done demonstrating the ability of

17: machine learning algorithms to automatically extract linguistic

18: knowledge from annotated corpora.  Very little work has gone into

19: quantifying the difference in ability at this task between a person

20: and a machine.  This paper is a first step in that direction.

21: \end{abstract}

22:

23: \section{Introduction}

24: \renewcommand{\thefootnote}{\fnsymbol{footnote}}\

25: \footnotetext[1]{and Woman.}

26: \renewcommand{\thefootnote}{\arabic{footnote}}

27:

28:

29: Machine learning has been very successful at solving many problems in

30: the field of natural language processing.  It has been amply

31: demonstrated that a wide assortment of machine learning algorithms are

32: quite effective at extracting linguistic information from

33: manually annotated corpora.

34:

35: Among the machine learning algorithms studied, rule based systems have

36: proven effective on many natural language processing tasks, including

37: part-of-speech tagging \cite{brill95:RBT,ramshaw94:tagging}, spelling

38: correction \cite{mangu97:cssc}, word-sense disambiguation

39: \cite{gale92:one_sense}, message understanding \cite{day97:alembic},

40: discourse tagging \cite{samuel98:discourse_tagging}, accent

41: restoration \cite{yarowsky94:decision_lists}, prepositional-phrase

42: attachment \cite{brill94:PPattach} and base noun phrase identification

43: \cite{ramshaw99:basenp,cardie98:basenp,veenstra98:basenp,argamon98:basenp}.

44: Many of these rule based systems learn a short list of simple rules

45: (typically on the order of 50-300) which are easily understood by

46: humans.

47:

48:

49: Since these rule-based systems achieve good performance while learning

50: a small list of simple rules, it raises the question of whether people

51: could also derive an effective rule list manually from an annotated

52: corpus.  In this paper we explore how quickly and effectively

53: relatively untrained people can extract linguistic generalities from a

54: corpus as compared to a machine.  There are a number of reasons for

55: doing this.  We would like to understand the relative strengths and

56: weaknesses of humans versus machines in hopes of marrying their

57: complementary strengths to create even more accurate systems.  Also,

58: since people can use their metaknowledge to generalize from a small

59: number of examples, it is possible that a person could derive

60: effective linguistic knowledge from a much smaller training corpus

61: than that needed by a machine.  A person could also potentially learn

62: more powerful representations than a machine, thereby achieving higher

63: accuracy.

64:

65:

66: In this paper we describe experiments we performed to ascertain how

67: well humans, given an annotated training set, can generate rules for

68: base noun phrase chunking.  Much previous work has been done on this

69: problem and many different methods have been used: Church's PARTS

70: \shortcite{church88:PARTS} program uses a Markov model; Bourigault

71: \shortcite{bourigault92:basenp} uses heuristics along with a grammar;

72: Voutilainen's NPTool \shortcite{voutilainen:NPTool} uses a lexicon

73: combined with a constraint grammar; Juteson and Katz

74: \shortcite{juteson95:basenp} use repeated phrases; Veenstra

75: \shortcite{veenstra98:basenp}, Argamon, Dagan \&

76: Krymolowski\shortcite{argamon98:basenp} and Daelemans, van den Bosch

77: \& Zavrel \shortcite{daelemans99:exceptions} use memory-based systems;

78: Ramshaw \& Marcus \shortcite{ramshaw99:basenp} and Cardie \& Pierce

79: \shortcite{cardie98:basenp} use rule-based systems.

80:

81:

82:

83: \section{Learning Base Noun Phrases by Machine}

84:

85: We used the base noun phrase system of Ramshaw and Marcus (R\&M) as

86: the machine learning system with which to compare the human learners.

87: It is difficult to compare different machine learning approaches to

88: base NP annotation, since different definitions of base NP are used in

89: many of the papers, but the R\&M system is the best of those that have

90: been tested on the Penn Treebank.\footnote{We would like to thank

91: Lance Ramshaw for providing us with the base-NP-annotated training and

92: test corpora that were used in the R\&M system, as well as the rules

93: learned by this system.}

94:

95:

96: To train their system, R\&M used a 200k-word chunk of the Penn

97: Treebank Parsed Wall Street Journal \cite{marcus93:penn_treebank}

98: tagged using a transformation-based tagger \cite{brill95:RBT} and

99: extracted base noun phrases from its parses by selecting noun phrases

100: that contained no nested noun phrases and further processing the data

101: with some heuristics (like treating the possessive marker as the first

102: word of a new base noun phrase) to flatten the recursive structure of

103: the parse.  They cast the problem as a transformation-based tagging

104: problem, where each word is to be labelled with a chunk structure tag

105: from the set \{I, O, B\}, where words marked ``I'' are inside some

106: base NP chunk, those marked ``O'' are not part of any base NP, and

107: those marked ``B'' denote the first word of a base NP which

108: immediately succeeds another base NP.  The training corpus is first

109: run through a part-of-speech tagger.  Then, as a baseline annotation,

110: each word is labelled with the most common chunk structure tag for its

111: part-of-speech tag.

112:

113: After the baseline is achieved, transformation rules fitting a set of

114: rule templates are then learned to improve the ``tagging accuracy'' of

115: the training set.  These templates take into consideration the word,

116: part-of-speech tag and chunk structure tag of the current word and all

117: words within a window of 3 to either side of it.  Applying a rule to a

118: word changes the chunk structure tag of a word and in effect alters

119: the boundaries of the base NP chunks in the sentence.

120:

121: An example of a rule learned by the R\&M system is: {\em change a

122: chunk structure tag of a word from I to B if the word is a determiner,

123: the next word is a noun, and the two previous words both have chunk

124: structure tags of I}.  In other words, a determiner in this context is

125: likely to begin a noun phrase.  The R\&M system learns a total of 500

126: rules.

127:

128: \section{Manual Rule Acquisition}

129:

130: R\&M framed the base NP annotation problem as a word tagging problem.

131: We chose instead to use regular expressions on words and part of

132: speech tags to characterize the NPs, as well as the context

133: surrounding the NPs, because this is both a more powerful

134: representational language and more intuitive to a person.  A person

135: can more easily consider potential phrases as a sequence of words and

136: tags, rather than looking at each individual word and deciding whether

137: it is part of a phrase or not.  The rule actions we allow

138: are:\footnote{The rule types we have chosen are similar to those used

139: by Vilain and Day \shortcite{vilain96:parsing} in transformation-based

140: parsing, but are more powerful.}

141: \begin{flushleft}

142: \begin{tabular}{lp{2.2in}}

143: {\bfseries A}dd & Add a base NP (bracket a sequence of words as a base

144: NP) \\

145: {\bf K}ill & Delete a base NP (remove a pair of parentheses) \\

146: {\bf T}ransform & Transform a base NP (move one or both parentheses to

147: extend/contract a base NP) \\

148: {\bf M}erge & Merge two base NPs

149: \end{tabular}

150: \end{flushleft}

151:

152: As an example, we consider an actual rule from our experiments:

153: \begin{quote}

154: Bracket all sequences of words of: one determiner (DT), zero or more

155: adjectives (JJ, JJR, JJS), and one or more nouns (NN, NNP, NNS, NNPS),

156: if they are followed by a verb (VB, VBD, VBG, VBN, VBP, VBZ).

157: \end{quote}

158:

159: In our language, the rule is written thus:\footnote{A full description

160: of the rule language can be found at

161: {\tt http://nlp.cs.jhu.edu/$\sim$baseNP/manual}.}

162:

163: \begin{verbatim}

164: A

165: (* .)

166: ({1} t=DT) (* t=JJ[RS]?) (+ t=NNP?S?)

167: ({1} t=VB[DGNPZ]?)

168: \end{verbatim}

169:

170: The first line denotes the action, in this case, {\bf A}dd a

171: bracketing.  The second line defines the context preceding the

172: sequence we want to have bracketed \,---\, in this case, we do not

173: care what this sequence is.  The third line defines the sequence which

174: we want bracketed, and the last line defines the context following the

175: bracketed sequence.

176:

177: Internally, the software then translates this rule into the more

178: unwieldy Perl regular expression:

179: \begin{small}

180: \begin{verbatim}

181: s{(([^\s_]+__DT\s+)([^\s_]+__JJ[RS]\s+)*

182: ([^\s_]+__NNP?S?\s+)+)([^\s_]+__VB[DGNPZ]\s+)}

183: { ( $1 ) $5 }g

184: \end{verbatim}

185: \end{small}

186:

187: The base NP annotation system created by the humans is essentially a

188: transformation-based system with hand-written rules.  The user

189: manually creates an ordered list of rules.  A rule list can be edited

190: by adding a rule at any position, deleting a rule, or modifying a

191: rule.  The user begins with an empty rule list.  Rules are derived by

192: studying the training corpus and NPs that the rules have not yet

193: bracketed, as well as NPs that the rules have incorrectly bracketed.

194: Whenever the rule list is edited, the efficacy of the changes can be

195: checked by running the new rule list on the training set and seeing

196: how the modified rule list compares to the unmodified list.  Based on

197: this feedback, the user decides whether to accept or reject the

198: changes that were made.  One nice property of transformation-based

199: learning is that in appending a rule to the end of a rule list, the

200: user need not be concerned about how that rule may interact with other

201: rules on the list.  This is much easier than writing a CFG, for

202: instance, where rules interact in a way that may not be readily

203: apparent to a human rule writer.

204:

205: To make it easy for people to study the training set, word sequences

206: are presented in one of four colors indicating that they:

207:

208: \begin{enumerate}

209: \item are not part of an NP either in the truth or in the output of the

210: person's rule set

211: \item consist of an NP both in the truth and in the output of the

212: person's rule set (i.e. they constitute a base NP that the person's

213: rules correctly annotated)

214: \item consist of an NP in the truth but not in the output of the

215: person's rule set (i.e. they constitute  a recall error)

216: \item consist of an NP  in the output of the person's rule set but not

217: in the truth (i.e. they constitute  a precision error)

218: \end{enumerate}

219:

220: The actual system is located at \\ {\tt

221: http://nlp.cs.jhu.edu/$\sim$basenp/chunking}.  A screenshot of this

222: system is shown in figure \ref{fig:screenshot}.  The correct base NPs

223: are enclosed in parentheses and those annotated by the human's rules

224: in brackets.

225:

226: \section{Experimental Set-Up and Results}

227:

228: The experiment of writing rule lists for base NP annotation was

229: assigned as a homework set to a group of 11 undergraduate and graduate

230: students in an introductory natural language processing

231: course.\footnote{These 11 students were a subset of the entire class.

232: Students were given an option of participating in this experiment or

233: doing a much more challenging final project.  Thus, as a population,

234: they tended to be the less motivated students.}

235:

236: The corpus that the students were given from which to derive and

237: validate rules is a 25k word subset of the R\&M training set,

238: approximately $\frac{1}{8}$ the size of the full R\&M training set.

239: The reason we used a downsized training set was that we believed

240: humans could generalize better from less data, and we thought that it

241: might be possible to meet or surpass R\&M's results with a much

242: smaller training set.

243:

244: \begin{figure*}

245: \begin{tabular}{|l|c|c|c|c||c|c|c|c|}

246: \hline

247: &\multicolumn{4}{|c||}{TRAINING SET (25K Words)}&\multicolumn{4}{|c|}{TEST SET}\\

248: \hline

249:  & Precision & Recall & F-Measure & $\frac{P+R}{2}$ & Precision & Recall &

250:  F-Measure & $\frac{P+R}{2}$ \\

251: \hline

252: Student 1  & 87.8\% & 88.6\% & 88.2 & 88.2 &

253: 	     88.0\% & 88.8\% & 88.4 & 88.4 \\

254: Student 2  & 88.1\% & 88.2\% & 88.2 & 88.2 &

255: 	     88.2\% & 87.9\% & 88.0 & 88.1  \\

256: Student 3  & 88.6\% & 87.6\% & 88.1 & 88.2 &

257: 	     88.3\% & 87.8\% & 88.0 & 88.1  \\

258: Student 4  & 88.0\% & 87.2\% & 87.6 & 87.6 &

259: 	     86.9\% & 85.9\% & 86.4 & 86.4 \\

260: Student 5  & 86.2\% & 86.8\% & 86.5 & 86.5 &

261: 	     85.8\% & 85.8\% & 85.8 & 85.8 \\

262: Student 6  & 86.0\% & 87.1\% & 86.6 & 86.6 &

263: 	     85.8\% & 87.1\% & 86.4 & 86.5 \\

264: Student 7  & 84.9\% & 86.7\% & 85.8 & 85.8 &

265: 	     85.3\% & 87.3\% & 86.3 & 86.3 \\

266: Student 8  & 83.6\% & 86.0\% & 84.8 & 84.8 &

267: 	     83.1\% & 85.7\% & 84.4 & 84.4 \\

268: Student 9  & 83.9\% & 85.0\% & 84.4 & 84.5 &

269: 	     83.5\% & 84.8\% & 84.1 & 84.2 \\

270: Student 10 & 82.8\% & 84.5\% & 83.6 & 83.7 &

271:              83.3\% & 84.4\% & 83.8 & 83.8 \\

272: Student 11 & 84.8\% & 78.8\% & 81.7 & 81.8 &

273: 	     84.0\% & 77.4\% & 80.6 & 80.7 \\

274: \hline

275: \end{tabular}

276: \caption{\label{fig:results_students} P/R results of test subjects on

277: training and test corpora}

278: \end{figure*}

279:

280: Figure \ref{fig:results_students} shows the final precision, recall,

281: F-measure and precision+recall numbers on the training and test

282: corpora for the students.  There was very little difference in

283: performance on the training set compared to the test set. This

284: indicates that people, unlike machines, seem immune to overtraining.

285: The time the students spent on the problem ranged from less than 3

286: hours to almost 10 hours, with an average of about 6 hours.  While it

287: was certainly the case that the students with the worst results spent

288: the least amount of time on the problem, it was not true that those

289: with the best results spent the most time \,---\, indeed, the average

290: amount of time spent by the top three students was a little less than

291: the overall average \,---\, slightly over 5 hours.  On average, people

292: achieved 90\% of their final performance after half of the total time

293: they spent in rule writing.

294:

295: The number of rules in the final rule lists also varied, from as

296: few as 16 rules to as many as 61 rules, with an average of 35.6

297: rules.  Again, the average number for the top three subjects was a

298: little under the average for everybody: 30.3 rules.

299:

300: In the beginning, we believed that the students would be able to match

301: or better the R\&M system's results, which are shown in figure

302: \ref{fig:results_ramshaw}.  It can be seen that when the same training

303: corpus is used, the best students do achieve performances which are

304: close to the R\&M system's \,---\, on average, the top 3 students'

305: performances come within 0.5\% precision and 1.1\% recall of the

306: machine's.  In the following section, we will examine the output of

307: both the manual and automatic systems for differences.

308:

309: \begin{figure*}

310: \begin{center}

311: \begin{tabular}{|c|c|c|c|c|}

312: \hline

313: Training set size(words)&Precision&Recall&F-Measure&$\frac{P+R}{2}$\\

314: \hline

315: 25k  & 88.7\% & 89.3\% & 89.0  & 89.0 \\

316: 200k & 91.8\% & 92.3\% & 92.0  & 92.1 \\

317: \hline

318: \end{tabular}

319: \end{center}

320: \caption{\label{fig:results_ramshaw} P/R results of the R\&M system

321: on test corpus}

322: \end{figure*}

323:

324: \section{Analysis}

325:

326: Before we started the analysis of the test set, we hypothesized that

327: the manually derived systems would have more difficulty with potential

328: rules that are effective, but fix only a very small number of mistakes

329: in the training set.

330:

331: The distribution of noun phrase types, identified by their part of

332: speech sequence, roughly obeys Zipf's Law \cite{Zipf35}: there is a

333: large tail of noun phrase types that occur very infrequently in the

334: corpus.  Assuming there is not a rule that can generalize across a

335: large number of these low-frequency noun phrases, the only way noun

336: phrases in the tail of the distribution can be learned is by learning

337: low-count rules: in other words, rules that will only positively

338: affect a small number of instances in the training corpus.

339:

340: Van der Dosch and Daelemans \shortcite{daelemans98:full_mem} show that

341: not ignoring the low count instances is often crucial to performance

342: in machine learning systems for natural language.  Do the

343: human-written rules suffer from failing to learn these infrequent

344: phrases?

345:

346: \begin{figure*}[htbp]

347: %\leavevmode

348: \begin{center}

349: \mbox{\epsfig{file=freqRecall2, angle=-90, width=5.5in} }

350: \caption{ \label{fig:freqRcll} Test Set Recall vs.\ Frequency of

351: Appearances in Training Set. }

352: \end{center}

353: \end{figure*}

354:

355: To explore the hypothesis that a primary difference between the

356: accuracy of human and machine is the machine's ability to capture the

357: low frequency noun phrases, we observed how the accuracy of noun

358: phrase annotation of both human and machine derived rules is affected

359: by the frequency of occurrence of the noun phrases in the training

360: corpus.  We reduced each base NP in the test set to its POS tag

361: sequence as assigned by the POS tagger. For each POS tag sequence, we

362: then counted the number of times it appeared in the training set and

363: the recall achieved on the test set.

364:

365: The plot of the test set recall vs.\ the number of appearances in the

366: training set of each tag sequence for the machine and the mean of the

367: top 3 students is shown in figure \ref{fig:freqRcll}.  For instance,

368: for base NPs in the test set with tag sequences that appeared 5 times

369: in the training corpus, the students achieved an average recall of

370: 63.6\% while the machine achieved a recall of 83.5\%.  For base NPs

371: with tag sequences that appear less than 6 times in the training set,

372: the machine outperforms the students by a recall of 62.8\% vs.\

373: 54.8\%.  However, for the rest of the base NPs \,---\, those that

374: appear 6 or more times \,---\, the performances of the machine and

375: students are almost identical: 93.7\% for the machine vs.\ 93.5\% for

376: the 3 students, a difference that is not statistically significant.

377:

378: The recall graph clearly shows that for the top 3 students,

379: performance is comparable to the machine's on all but the low

380: frequency constituents.  This can be explained by the human's

381: reluctance or inability to write a rule that will only capture a small

382: number of new base NPs in the training set.  Whereas a machine can

383: easily learn a few hundred rules, each of which makes a very small

384: improvement to accuracy, this is a tedious task for a person, and a

385: task which apparently none of our human subjects was willing or able

386: to take on.

387:

388: There is one anomalous point in figure \ref{fig:freqRcll}.  For base

389: NPs with POS tag sequences that appear 3 times in the training set,

390: there is a large decrease in recall for the machine, but a large

391: increase in recall for the students.  When we looked at the POS tag

392: sequences in question and their corresponding base NPs, we found that

393: this was caused by one single POS tag sequence \,---\, that of two

394: successive numbers (CD).  The test set happened to include many

395: sentences containing sequences of the type:

396: \begin{quote}

397: {\tt \ldots ( CD CD ) TO ( CD CD )\ldots }

398: \end{quote}

399: as in:

400: \begin{quote}

401: {\tt

402: ( International/NNP Paper/NNP ) fell/VBD ( 1/CD $\frac{3}{8}$/CD ) to/TO

403: ( 51/CD $\frac{1}{2}$/CD )\ldots

404: }

405: \end{quote}

406: while the training set had none.  The machine ended up bracketing

407: the entire sequence

408: \begin{quote}

409: {\tt 1/CD $\frac{3}{8}$/CD to/TO 51/CD $\frac{1}{2}$/CD }

410: \end{quote}

411: as a base NP. None of the students, however, made

412: this mistake.

413:

414:

415: \section{Conclusions and Future Work}

416:

417: In this paper we have described research we undertook in an attempt to

418: ascertain how people can perform compared to a machine at learning

419: linguistic information from an annotated corpus, and more importantly

420: to begin to explore the differences in learning behavior between human

421: and machine.  Although people did not match the performance of the

422: machine-learned annotator, it is interesting that these ``language

423: novices'', with almost no training, were able to come fairly close,

424: learning a small number of powerful rules in a short amount of time on

425: a small training set.  This challenges the claim that machine learning

426: offers portability advantages over manual rule writing, seeing that

427: relatively unmotivated people can near-match the best machine

428: performance on this task in so little time at a labor cost of

429: approximately US\$40.

430:

431: We plan to take this work in a number of directions.  First, we will

432: further explore whether people can meet or beat the machine's accuracy

433: at this task.  We have identified one major weakness of human rule

434: writers: capturing information about low frequency events.  It is

435: possible that by providing the person with sufficiently powerful

436: corpus analysis tools to aide in rule writing, we could overcome this

437: problem.

438:

439: We ran all of our human experiments on a fixed training corpus size.

440: It would be interesting to compare how human performance varies as a

441: function of training corpus size with how machine performance varies.

442:

443: There are many ways to combine human corpus-based knowledge extraction

444: with machine learning.  One possibility would be to combine the human

445: and machine outputs.  Another would be to have the human start with

446: the output of the machine and then learn rules to correct the

447: machine's mistakes.  We could also have a hybrid system where the

448: person writes rules with the help of machine learning.  For instance,

449: the machine could propose a set of rules and the person could choose

450: the best one.  We hope that by further studying both human and machine

451: knowledge acquisition from corpora, we can devise learning strategies

452: that successfully combine the two approaches, and by doing so, further

453: improve our ability to extract useful linguistic information from

454: online resources.

455:

456: \begin{figure*}[htbp]

457: %\leavevmode

458: \begin{center}

459: \mbox{\epsfig{file=screenshot2, angle=-90, width=6.5in} }

460: \caption{ \label{fig:screenshot} Screenshot of base NP chunking system }

461: \end{center}

462: \end{figure*}

463:

464: \section*{Acknowledgements}

465:

466: The authors would like to thank Ryan Brown, Mike Harmon, John

467: Henderson and David Yarowsky for their valuable feedback regarding

468: this work.  This work was partly funded by NSF grant IRI-9502312.

469:

470: \bibliographystyle{acl}

471: %...

472: \bibliography{references}

473: \end{document}

474:

475:

476:

477: