1: \documentclass[11pt]{article}
2: \usepackage{colacl}
3: \usepackage[dvips]{epsfig}
4: \author{ Eric Brill and Grace Ngai\\ Department of Computer Science\\ %
5: The Johns Hopkins University\\ Baltimore, MD 21218, USA\\ %
6: Email: {\tt \{brill,gyn\}@cs.jhu.edu}}
7: \title{\vspace{-65pt}{\normalsize \tt \hfill Appeared in \em{Proceedings of the 37th ACL}, 1999}\\ \mbox{} \\
8: Man\footnotemark[1] \space vs.\ Machine: A Case Study in Base %
9: Noun Phrase Learning}
10:
11: \begin{document}
12:
13: \maketitle
14:
15: \begin{abstract}
16: A great deal of work has been done demonstrating the ability of
17: machine learning algorithms to automatically extract linguistic
18: knowledge from annotated corpora. Very little work has gone into
19: quantifying the difference in ability at this task between a person
20: and a machine. This paper is a first step in that direction.
21: \end{abstract}
22:
23: \section{Introduction}
24: \renewcommand{\thefootnote}{\fnsymbol{footnote}}\
25: \footnotetext[1]{and Woman.}
26: \renewcommand{\thefootnote}{\arabic{footnote}}
27:
28:
29: Machine learning has been very successful at solving many problems in
30: the field of natural language processing. It has been amply
31: demonstrated that a wide assortment of machine learning algorithms are
32: quite effective at extracting linguistic information from
33: manually annotated corpora.
34:
35: Among the machine learning algorithms studied, rule based systems have
36: proven effective on many natural language processing tasks, including
37: part-of-speech tagging \cite{brill95:RBT,ramshaw94:tagging}, spelling
38: correction \cite{mangu97:cssc}, word-sense disambiguation
39: \cite{gale92:one_sense}, message understanding \cite{day97:alembic},
40: discourse tagging \cite{samuel98:discourse_tagging}, accent
41: restoration \cite{yarowsky94:decision_lists}, prepositional-phrase
42: attachment \cite{brill94:PPattach} and base noun phrase identification
43: \cite{ramshaw99:basenp,cardie98:basenp,veenstra98:basenp,argamon98:basenp}.
44: Many of these rule based systems learn a short list of simple rules
45: (typically on the order of 50-300) which are easily understood by
46: humans.
47:
48:
49: Since these rule-based systems achieve good performance while learning
50: a small list of simple rules, it raises the question of whether people
51: could also derive an effective rule list manually from an annotated
52: corpus. In this paper we explore how quickly and effectively
53: relatively untrained people can extract linguistic generalities from a
54: corpus as compared to a machine. There are a number of reasons for
55: doing this. We would like to understand the relative strengths and
56: weaknesses of humans versus machines in hopes of marrying their
57: complementary strengths to create even more accurate systems. Also,
58: since people can use their metaknowledge to generalize from a small
59: number of examples, it is possible that a person could derive
60: effective linguistic knowledge from a much smaller training corpus
61: than that needed by a machine. A person could also potentially learn
62: more powerful representations than a machine, thereby achieving higher
63: accuracy.
64:
65:
66: In this paper we describe experiments we performed to ascertain how
67: well humans, given an annotated training set, can generate rules for
68: base noun phrase chunking. Much previous work has been done on this
69: problem and many different methods have been used: Church's PARTS
70: \shortcite{church88:PARTS} program uses a Markov model; Bourigault
71: \shortcite{bourigault92:basenp} uses heuristics along with a grammar;
72: Voutilainen's NPTool \shortcite{voutilainen:NPTool} uses a lexicon
73: combined with a constraint grammar; Juteson and Katz
74: \shortcite{juteson95:basenp} use repeated phrases; Veenstra
75: \shortcite{veenstra98:basenp}, Argamon, Dagan \&
76: Krymolowski\shortcite{argamon98:basenp} and Daelemans, van den Bosch
77: \& Zavrel \shortcite{daelemans99:exceptions} use memory-based systems;
78: Ramshaw \& Marcus \shortcite{ramshaw99:basenp} and Cardie \& Pierce
79: \shortcite{cardie98:basenp} use rule-based systems.
80:
81:
82:
83: \section{Learning Base Noun Phrases by Machine}
84:
85: We used the base noun phrase system of Ramshaw and Marcus (R\&M) as
86: the machine learning system with which to compare the human learners.
87: It is difficult to compare different machine learning approaches to
88: base NP annotation, since different definitions of base NP are used in
89: many of the papers, but the R\&M system is the best of those that have
90: been tested on the Penn Treebank.\footnote{We would like to thank
91: Lance Ramshaw for providing us with the base-NP-annotated training and
92: test corpora that were used in the R\&M system, as well as the rules
93: learned by this system.}
94:
95:
96: To train their system, R\&M used a 200k-word chunk of the Penn
97: Treebank Parsed Wall Street Journal \cite{marcus93:penn_treebank}
98: tagged using a transformation-based tagger \cite{brill95:RBT} and
99: extracted base noun phrases from its parses by selecting noun phrases
100: that contained no nested noun phrases and further processing the data
101: with some heuristics (like treating the possessive marker as the first
102: word of a new base noun phrase) to flatten the recursive structure of
103: the parse. They cast the problem as a transformation-based tagging
104: problem, where each word is to be labelled with a chunk structure tag
105: from the set \{I, O, B\}, where words marked ``I'' are inside some
106: base NP chunk, those marked ``O'' are not part of any base NP, and
107: those marked ``B'' denote the first word of a base NP which
108: immediately succeeds another base NP. The training corpus is first
109: run through a part-of-speech tagger. Then, as a baseline annotation,
110: each word is labelled with the most common chunk structure tag for its
111: part-of-speech tag.
112:
113: After the baseline is achieved, transformation rules fitting a set of
114: rule templates are then learned to improve the ``tagging accuracy'' of
115: the training set. These templates take into consideration the word,
116: part-of-speech tag and chunk structure tag of the current word and all
117: words within a window of 3 to either side of it. Applying a rule to a
118: word changes the chunk structure tag of a word and in effect alters
119: the boundaries of the base NP chunks in the sentence.
120:
121: An example of a rule learned by the R\&M system is: {\em change a
122: chunk structure tag of a word from I to B if the word is a determiner,
123: the next word is a noun, and the two previous words both have chunk
124: structure tags of I}. In other words, a determiner in this context is
125: likely to begin a noun phrase. The R\&M system learns a total of 500
126: rules.
127:
128: \section{Manual Rule Acquisition}
129:
130: R\&M framed the base NP annotation problem as a word tagging problem.
131: We chose instead to use regular expressions on words and part of
132: speech tags to characterize the NPs, as well as the context
133: surrounding the NPs, because this is both a more powerful
134: representational language and more intuitive to a person. A person
135: can more easily consider potential phrases as a sequence of words and
136: tags, rather than looking at each individual word and deciding whether
137: it is part of a phrase or not. The rule actions we allow
138: are:\footnote{The rule types we have chosen are similar to those used
139: by Vilain and Day \shortcite{vilain96:parsing} in transformation-based
140: parsing, but are more powerful.}
141: \begin{flushleft}
142: \begin{tabular}{lp{2.2in}}
143: {\bfseries A}dd & Add a base NP (bracket a sequence of words as a base
144: NP) \\
145: {\bf K}ill & Delete a base NP (remove a pair of parentheses) \\
146: {\bf T}ransform & Transform a base NP (move one or both parentheses to
147: extend/contract a base NP) \\
148: {\bf M}erge & Merge two base NPs
149: \end{tabular}
150: \end{flushleft}
151:
152: As an example, we consider an actual rule from our experiments:
153: \begin{quote}
154: Bracket all sequences of words of: one determiner (DT), zero or more
155: adjectives (JJ, JJR, JJS), and one or more nouns (NN, NNP, NNS, NNPS),
156: if they are followed by a verb (VB, VBD, VBG, VBN, VBP, VBZ).
157: \end{quote}
158:
159: In our language, the rule is written thus:\footnote{A full description
160: of the rule language can be found at
161: {\tt http://nlp.cs.jhu.edu/$\sim$baseNP/manual}.}
162:
163: \begin{verbatim}
164: A
165: (* .)
166: ({1} t=DT) (* t=JJ[RS]?) (+ t=NNP?S?)
167: ({1} t=VB[DGNPZ]?)
168: \end{verbatim}
169:
170: The first line denotes the action, in this case, {\bf A}dd a
171: bracketing. The second line defines the context preceding the
172: sequence we want to have bracketed \,---\, in this case, we do not
173: care what this sequence is. The third line defines the sequence which
174: we want bracketed, and the last line defines the context following the
175: bracketed sequence.
176:
177: Internally, the software then translates this rule into the more
178: unwieldy Perl regular expression:
179: \begin{small}
180: \begin{verbatim}
181: s{(([^\s_]+__DT\s+)([^\s_]+__JJ[RS]\s+)*
182: ([^\s_]+__NNP?S?\s+)+)([^\s_]+__VB[DGNPZ]\s+)}
183: { ( $1 ) $5 }g
184: \end{verbatim}
185: \end{small}
186:
187: The base NP annotation system created by the humans is essentially a
188: transformation-based system with hand-written rules. The user
189: manually creates an ordered list of rules. A rule list can be edited
190: by adding a rule at any position, deleting a rule, or modifying a
191: rule. The user begins with an empty rule list. Rules are derived by
192: studying the training corpus and NPs that the rules have not yet
193: bracketed, as well as NPs that the rules have incorrectly bracketed.
194: Whenever the rule list is edited, the efficacy of the changes can be
195: checked by running the new rule list on the training set and seeing
196: how the modified rule list compares to the unmodified list. Based on
197: this feedback, the user decides whether to accept or reject the
198: changes that were made. One nice property of transformation-based
199: learning is that in appending a rule to the end of a rule list, the
200: user need not be concerned about how that rule may interact with other
201: rules on the list. This is much easier than writing a CFG, for
202: instance, where rules interact in a way that may not be readily
203: apparent to a human rule writer.
204:
205: To make it easy for people to study the training set, word sequences
206: are presented in one of four colors indicating that they:
207:
208: \begin{enumerate}
209: \item are not part of an NP either in the truth or in the output of the
210: person's rule set
211: \item consist of an NP both in the truth and in the output of the
212: person's rule set (i.e. they constitute a base NP that the person's
213: rules correctly annotated)
214: \item consist of an NP in the truth but not in the output of the
215: person's rule set (i.e. they constitute a recall error)
216: \item consist of an NP in the output of the person's rule set but not
217: in the truth (i.e. they constitute a precision error)
218: \end{enumerate}
219:
220: The actual system is located at \\ {\tt
221: http://nlp.cs.jhu.edu/$\sim$basenp/chunking}. A screenshot of this
222: system is shown in figure \ref{fig:screenshot}. The correct base NPs
223: are enclosed in parentheses and those annotated by the human's rules
224: in brackets.
225:
226: \section{Experimental Set-Up and Results}
227:
228: The experiment of writing rule lists for base NP annotation was
229: assigned as a homework set to a group of 11 undergraduate and graduate
230: students in an introductory natural language processing
231: course.\footnote{These 11 students were a subset of the entire class.
232: Students were given an option of participating in this experiment or
233: doing a much more challenging final project. Thus, as a population,
234: they tended to be the less motivated students.}
235:
236: The corpus that the students were given from which to derive and
237: validate rules is a 25k word subset of the R\&M training set,
238: approximately $\frac{1}{8}$ the size of the full R\&M training set.
239: The reason we used a downsized training set was that we believed
240: humans could generalize better from less data, and we thought that it
241: might be possible to meet or surpass R\&M's results with a much
242: smaller training set.
243:
244: \begin{figure*}
245: \begin{tabular}{|l|c|c|c|c||c|c|c|c|}
246: \hline
247: &\multicolumn{4}{|c||}{TRAINING SET (25K Words)}&\multicolumn{4}{|c|}{TEST SET}\\
248: \hline
249: & Precision & Recall & F-Measure & $\frac{P+R}{2}$ & Precision & Recall &
250: F-Measure & $\frac{P+R}{2}$ \\
251: \hline
252: Student 1 & 87.8\% & 88.6\% & 88.2 & 88.2 &
253: 88.0\% & 88.8\% & 88.4 & 88.4 \\
254: Student 2 & 88.1\% & 88.2\% & 88.2 & 88.2 &
255: 88.2\% & 87.9\% & 88.0 & 88.1 \\
256: Student 3 & 88.6\% & 87.6\% & 88.1 & 88.2 &
257: 88.3\% & 87.8\% & 88.0 & 88.1 \\
258: Student 4 & 88.0\% & 87.2\% & 87.6 & 87.6 &
259: 86.9\% & 85.9\% & 86.4 & 86.4 \\
260: Student 5 & 86.2\% & 86.8\% & 86.5 & 86.5 &
261: 85.8\% & 85.8\% & 85.8 & 85.8 \\
262: Student 6 & 86.0\% & 87.1\% & 86.6 & 86.6 &
263: 85.8\% & 87.1\% & 86.4 & 86.5 \\
264: Student 7 & 84.9\% & 86.7\% & 85.8 & 85.8 &
265: 85.3\% & 87.3\% & 86.3 & 86.3 \\
266: Student 8 & 83.6\% & 86.0\% & 84.8 & 84.8 &
267: 83.1\% & 85.7\% & 84.4 & 84.4 \\
268: Student 9 & 83.9\% & 85.0\% & 84.4 & 84.5 &
269: 83.5\% & 84.8\% & 84.1 & 84.2 \\
270: Student 10 & 82.8\% & 84.5\% & 83.6 & 83.7 &
271: 83.3\% & 84.4\% & 83.8 & 83.8 \\
272: Student 11 & 84.8\% & 78.8\% & 81.7 & 81.8 &
273: 84.0\% & 77.4\% & 80.6 & 80.7 \\
274: \hline
275: \end{tabular}
276: \caption{\label{fig:results_students} P/R results of test subjects on
277: training and test corpora}
278: \end{figure*}
279:
280: Figure \ref{fig:results_students} shows the final precision, recall,
281: F-measure and precision+recall numbers on the training and test
282: corpora for the students. There was very little difference in
283: performance on the training set compared to the test set. This
284: indicates that people, unlike machines, seem immune to overtraining.
285: The time the students spent on the problem ranged from less than 3
286: hours to almost 10 hours, with an average of about 6 hours. While it
287: was certainly the case that the students with the worst results spent
288: the least amount of time on the problem, it was not true that those
289: with the best results spent the most time \,---\, indeed, the average
290: amount of time spent by the top three students was a little less than
291: the overall average \,---\, slightly over 5 hours. On average, people
292: achieved 90\% of their final performance after half of the total time
293: they spent in rule writing.
294:
295: The number of rules in the final rule lists also varied, from as
296: few as 16 rules to as many as 61 rules, with an average of 35.6
297: rules. Again, the average number for the top three subjects was a
298: little under the average for everybody: 30.3 rules.
299:
300: In the beginning, we believed that the students would be able to match
301: or better the R\&M system's results, which are shown in figure
302: \ref{fig:results_ramshaw}. It can be seen that when the same training
303: corpus is used, the best students do achieve performances which are
304: close to the R\&M system's \,---\, on average, the top 3 students'
305: performances come within 0.5\% precision and 1.1\% recall of the
306: machine's. In the following section, we will examine the output of
307: both the manual and automatic systems for differences.
308:
309: \begin{figure*}
310: \begin{center}
311: \begin{tabular}{|c|c|c|c|c|}
312: \hline
313: Training set size(words)&Precision&Recall&F-Measure&$\frac{P+R}{2}$\\
314: \hline
315: 25k & 88.7\% & 89.3\% & 89.0 & 89.0 \\
316: 200k & 91.8\% & 92.3\% & 92.0 & 92.1 \\
317: \hline
318: \end{tabular}
319: \end{center}
320: \caption{\label{fig:results_ramshaw} P/R results of the R\&M system
321: on test corpus}
322: \end{figure*}
323:
324: \section{Analysis}
325:
326: Before we started the analysis of the test set, we hypothesized that
327: the manually derived systems would have more difficulty with potential
328: rules that are effective, but fix only a very small number of mistakes
329: in the training set.
330:
331: The distribution of noun phrase types, identified by their part of
332: speech sequence, roughly obeys Zipf's Law \cite{Zipf35}: there is a
333: large tail of noun phrase types that occur very infrequently in the
334: corpus. Assuming there is not a rule that can generalize across a
335: large number of these low-frequency noun phrases, the only way noun
336: phrases in the tail of the distribution can be learned is by learning
337: low-count rules: in other words, rules that will only positively
338: affect a small number of instances in the training corpus.
339:
340: Van der Dosch and Daelemans \shortcite{daelemans98:full_mem} show that
341: not ignoring the low count instances is often crucial to performance
342: in machine learning systems for natural language. Do the
343: human-written rules suffer from failing to learn these infrequent
344: phrases?
345:
346: \begin{figure*}[htbp]
347: %\leavevmode
348: \begin{center}
349: \mbox{\epsfig{file=freqRecall2, angle=-90, width=5.5in} }
350: \caption{ \label{fig:freqRcll} Test Set Recall vs.\ Frequency of
351: Appearances in Training Set. }
352: \end{center}
353: \end{figure*}
354:
355: To explore the hypothesis that a primary difference between the
356: accuracy of human and machine is the machine's ability to capture the
357: low frequency noun phrases, we observed how the accuracy of noun
358: phrase annotation of both human and machine derived rules is affected
359: by the frequency of occurrence of the noun phrases in the training
360: corpus. We reduced each base NP in the test set to its POS tag
361: sequence as assigned by the POS tagger. For each POS tag sequence, we
362: then counted the number of times it appeared in the training set and
363: the recall achieved on the test set.
364:
365: The plot of the test set recall vs.\ the number of appearances in the
366: training set of each tag sequence for the machine and the mean of the
367: top 3 students is shown in figure \ref{fig:freqRcll}. For instance,
368: for base NPs in the test set with tag sequences that appeared 5 times
369: in the training corpus, the students achieved an average recall of
370: 63.6\% while the machine achieved a recall of 83.5\%. For base NPs
371: with tag sequences that appear less than 6 times in the training set,
372: the machine outperforms the students by a recall of 62.8\% vs.\
373: 54.8\%. However, for the rest of the base NPs \,---\, those that
374: appear 6 or more times \,---\, the performances of the machine and
375: students are almost identical: 93.7\% for the machine vs.\ 93.5\% for
376: the 3 students, a difference that is not statistically significant.
377:
378: The recall graph clearly shows that for the top 3 students,
379: performance is comparable to the machine's on all but the low
380: frequency constituents. This can be explained by the human's
381: reluctance or inability to write a rule that will only capture a small
382: number of new base NPs in the training set. Whereas a machine can
383: easily learn a few hundred rules, each of which makes a very small
384: improvement to accuracy, this is a tedious task for a person, and a
385: task which apparently none of our human subjects was willing or able
386: to take on.
387:
388: There is one anomalous point in figure \ref{fig:freqRcll}. For base
389: NPs with POS tag sequences that appear 3 times in the training set,
390: there is a large decrease in recall for the machine, but a large
391: increase in recall for the students. When we looked at the POS tag
392: sequences in question and their corresponding base NPs, we found that
393: this was caused by one single POS tag sequence \,---\, that of two
394: successive numbers (CD). The test set happened to include many
395: sentences containing sequences of the type:
396: \begin{quote}
397: {\tt \ldots ( CD CD ) TO ( CD CD )\ldots }
398: \end{quote}
399: as in:
400: \begin{quote}
401: {\tt
402: ( International/NNP Paper/NNP ) fell/VBD ( 1/CD $\frac{3}{8}$/CD ) to/TO
403: ( 51/CD $\frac{1}{2}$/CD )\ldots
404: }
405: \end{quote}
406: while the training set had none. The machine ended up bracketing
407: the entire sequence
408: \begin{quote}
409: {\tt 1/CD $\frac{3}{8}$/CD to/TO 51/CD $\frac{1}{2}$/CD }
410: \end{quote}
411: as a base NP. None of the students, however, made
412: this mistake.
413:
414:
415: \section{Conclusions and Future Work}
416:
417: In this paper we have described research we undertook in an attempt to
418: ascertain how people can perform compared to a machine at learning
419: linguistic information from an annotated corpus, and more importantly
420: to begin to explore the differences in learning behavior between human
421: and machine. Although people did not match the performance of the
422: machine-learned annotator, it is interesting that these ``language
423: novices'', with almost no training, were able to come fairly close,
424: learning a small number of powerful rules in a short amount of time on
425: a small training set. This challenges the claim that machine learning
426: offers portability advantages over manual rule writing, seeing that
427: relatively unmotivated people can near-match the best machine
428: performance on this task in so little time at a labor cost of
429: approximately US\$40.
430:
431: We plan to take this work in a number of directions. First, we will
432: further explore whether people can meet or beat the machine's accuracy
433: at this task. We have identified one major weakness of human rule
434: writers: capturing information about low frequency events. It is
435: possible that by providing the person with sufficiently powerful
436: corpus analysis tools to aide in rule writing, we could overcome this
437: problem.
438:
439: We ran all of our human experiments on a fixed training corpus size.
440: It would be interesting to compare how human performance varies as a
441: function of training corpus size with how machine performance varies.
442:
443: There are many ways to combine human corpus-based knowledge extraction
444: with machine learning. One possibility would be to combine the human
445: and machine outputs. Another would be to have the human start with
446: the output of the machine and then learn rules to correct the
447: machine's mistakes. We could also have a hybrid system where the
448: person writes rules with the help of machine learning. For instance,
449: the machine could propose a set of rules and the person could choose
450: the best one. We hope that by further studying both human and machine
451: knowledge acquisition from corpora, we can devise learning strategies
452: that successfully combine the two approaches, and by doing so, further
453: improve our ability to extract useful linguistic information from
454: online resources.
455:
456: \begin{figure*}[htbp]
457: %\leavevmode
458: \begin{center}
459: \mbox{\epsfig{file=screenshot2, angle=-90, width=6.5in} }
460: \caption{ \label{fig:screenshot} Screenshot of base NP chunking system }
461: \end{center}
462: \end{figure*}
463:
464: \section*{Acknowledgements}
465:
466: The authors would like to thank Ryan Brown, Mike Harmon, John
467: Henderson and David Yarowsky for their valuable feedback regarding
468: this work. This work was partly funded by NSF grant IRI-9502312.
469:
470: \bibliographystyle{acl}
471: %...
472: \bibliography{references}
473: \end{document}
474:
475:
476:
477: