cs0103026/naacl.tex
1: \documentclass[]{article}
2: \usepackage{naacl2001}
3: 
4: \title{A Decision Tree of Bigrams\\is an Accurate Predictor of Word Sense}
5: 
6: \author{Ted Pedersen\\
7: Department of Computer Science\\
8: University of Minnesota Duluth
9: \\Duluth, MN 55812 USA\\
10: \tt{tpederse@d.umn.edu}} 
11: 
12: %%%\tt{http://www.d.umn.edu/\~{}tpederse}}
13: 
14: 
15: \begin{document}
16: \maketitle      
17: 
18: \begin{abstract}
19: This paper presents a corpus-based approach to word sense
20: disambiguation where a decision tree assigns a sense to an ambiguous word
21: based on the bigrams that occur nearby. This approach is evaluated using
22: the sense-tagged corpora from the 1998 SENSEVAL word sense disambiguation  
23: exercise. It is more accurate than the average results reported for 30 of  
24: 36 words, and is more accurate than the best results for 19 of 36 words.
25: \end{abstract}
26: 
27: \section{Introduction}
28: 
29: Word sense disambiguation is the process of selecting the most appropriate
30: meaning for a word, based on the context in which it occurs. For our
31: purposes it is assumed that the set of possible meanings, i.e., the sense 
32: inventory, has already been determined. For example, suppose {\it bill} 
33: has the following set of possible meanings: a piece of currency, pending
34: legislation, or a  bird jaw. When used in the context of {\it The Senate 
35: bill is under consideration}, a human reader immediately understands that 
36: {\it bill} is being used in the legislative sense. However, a computer 
37: program  attempting to perform the same task faces a difficult problem 
38: since it does not have the benefit of innate common--sense or linguistic 
39: knowledge. 
40: 
41: Rather than attempting to provide computer programs with real--world 
42: knowledge comparable to that of humans, natural language processing has 
43: turned to {\it corpus--based} methods. These approaches use techniques 
44: from statistics and machine learning to induce models of language 
45: usage from large samples of text.  These models are trained to perform 
46: particular tasks, usually via supervised learning. This paper describes an
47: approach where a {\it decision tree} is learned from some number of 
48: sentences  where each instance of an  ambiguous word has been manually 
49: annotated with a sense--tag that denotes the most appropriate sense for 
50: that context. 
51: 
52: Prior to learning, the sense--tagged corpus must be  converted into a more 
53: regular form suitable for automatic processing. Each sense--tagged 
54: occurrence of an ambiguous  word is  converted into a feature vector, 
55: where each feature represents  some  property of the surrounding text that 
56: is considered to be relevant to  the disambiguation process.  Given the 
57: flexibility and complexity of human  language,  there is  potentially an 
58: infinite set of features that could  be utilized. However, in 
59: corpus--based approaches features usually consist of information that  can 
60: be readily identified in the text,  without relying on extensive external 
61: knowledge sources.  These  typically  include the  part--of--speech of 
62: surrounding words, the  presence of certain key words within some window 
63: of context, and various  syntactic properties  of the sentence and the 
64: ambiguous word. 
65: 
66: The approach in this paper relies upon a  feature set made up of {\it 
67: bigrams}, two word sequences that occur in a text. The context in which 
68: an ambiguous word  occurs is  represented by some number of binary 
69: features that  indicate  whether or  not a  particular bigram has occurred 
70: within approximately 50 words to the left or right of the word being  
71: disambiguated. 
72: 
73: We take this approach since surface lexical features 
74: like bigrams, collocations, and co--occurrences often contribute a great 
75: deal to disambiguation accuracy. It is not clear how much
76: disambiguation accuracy is improved through the use of features 
77: that are identified by more complex pre--processing such as  
78: part--of--speech tagging, parsing, or anaphora resolution. One of our
79: objectives is to establish a clear upper bounds on the accuracy of
80: disambiguation using feature sets that do not impose substantial 
81: pre--processing requirements. 
82: 
83: This paper continues with a discussion of our methods for identifying 
84: the bigrams that should be included in the feature set for learning. Then 
85: the decision tree learning algorithm is described, as are some benchmark 
86: learning algorithms that are included for purposes of comparison. The 
87: experimental data is discussed, and then the empirical  results are  
88: presented. We close with an analysis of our findings and a  discussion of 
89: related work. 
90: 
91: \section{Building a Feature Set of Bigrams}
92: 
93: We have developed an approach to word sense disambiguation that   
94: represents text entirely in terms of the occurrence of bigrams, which we 
95: define to be two consecutive words that occur in a text. The
96: distributional characteristics of bigrams are fairly consistent across
97: corpora; a majority of them only occur one time. Given the sparse and
98: skewed nature of this data, the statistical methods used to select
99: interesting bigrams must be carefully chosen. We explore two alternatives, 
100: the power  divergence family of goodness of fit statistics and the Dice 
101: Coefficient, an information theoretic measure related to pointwise 
102: Mutual Information. 
103: 
104: Figure \ref{fig:bigram} summarizes the notation for word and bigram counts 
105: used in this paper by way of a $2 \times 2$ contingency table. The value 
106: of $n_{11}$ shows how many times the bigram {\it big cat} occurs in the 
107: corpus. The value  of $n_{12}$ shows how often bigrams occur where  {\it 
108: big} is the first word  and {\it cat} is not the second. The counts in 
109: $n_{+1}$ and $n_{1+}$  indicate how  often words {\it big} and {\it cat} 
110: occur as the first and  second words of any bigram in the corpus. The 
111: total number of bigrams in the corpus is represented by $n_{++}$. 
112: 
113: \begin{figure}
114: \begin{center}
115: \begin{tabular}[c]{@{}r|c|c|@{}l@{}} 
116: \multicolumn{4}{c}{ } \\
117: \multicolumn{1}{l}{ } &
118: \multicolumn{2}{c}{} \\ 
119: \multicolumn{2}{r}{cat} & 
120: \multicolumn{1}{c|}{$\neg${cat}} &
121: \multicolumn{1}{c}{{totals}} \\
122: \cline{2-4}
123: big & $n_{11}$=$\hfill$10& $n_{12}$=$\hfill$20& $n_{1+}$=$\hfill$30\\
124: \cline{2-3}
125: $\neg${big} & $n_{21}$=$\hfill$40&$n_{22}$=$\hfill$930&$n_{2+}$=$\hfill$970\\
126: \cline{1-4}
127: totals& 
128: \multicolumn{1}{r}{$n_{+1}$=50} & $n_{+2}$=950 & $n_{++}$=1000  \\
129: \multicolumn{4}{c}{ } \\
130: \end{tabular}
131: \caption{Representation of Bigram Counts}
132: \label{fig:bigram}
133: \end{center}
134: \end{figure}
135: 
136: \subsection{The Power Divergence Family}
137: 
138: \cite{CressieR84} introduce the power divergence family of goodness of fit 
139: statistics. A number of well known statistics belong to this family, 
140: including the  likelihood ratio statistic $G^2$ and Pearson's $X^2$ 
141: statistic. 
142: 
143: These measure the divergence of the observed ($n_{ij}$) and expected 
144: ($m_{ij}$) bigram counts, where $m_{ij}$ is estimated based on 
145: the assumption that the component words in the bigram occur together
146: strictly by chance:
147: \begin{eqnarray*}
148: m_{ij} = \frac{n_{i+} * n_{+j}}{n_{++}}
149: \end{eqnarray*}
150: 
151: Given this value, $G^2$ and $X^2$ are calculated as:
152: \begin{eqnarray*}
153: G^2 = 2 \sum_{i,j} n_{ij} * \log \frac{n_{ij}}{m_{ij}} \ \ \ \ 
154: \end{eqnarray*}
155: \begin{eqnarray*}
156: X^2 = \sum_{i,j} \frac{(n_{ij} - m_{ij})^2}{m_{ij}}
157: \label{eq:x2}
158: \end{eqnarray*}     
159: 
160: \cite{Dunning93} argues in favor of $G^2$ over $X^2$, especially when
161: dealing with very sparse and skewed data distributions.  However,  
162: \cite{CressieR84} suggest that there are cases where Pearson's statistic 
163: is more reliable than the likelihood ratio and that one test should not
164: always be preferred over the other. In light of this,  
165: \cite{Pedersen96}  presents Fisher's exact test as an alternative since 
166: it does not rely on the distributional assumptions that underly both  
167: Pearson's test and the likelihood ratio.  
168: 
169: Unfortunately it is usually not clear which test is most appropriate
170: for a particular sample of data.  We take the following 
171: approach, based on the observation that all tests should assign
172: approximately the same  
173: measure of statistical significance when the bigram counts in the 
174: contingency table do not violate any of the distributional assumptions  
175: that underly the goodness of fit statistics. We perform tests using
176: $X^2$, $G^2$, and Fisher's exact test for each bigram.  If the 
177: resulting measures of statistical significance differ, then the  
178: distribution of the bigram counts is causing at least one of the tests to 
179: become unreliable. When this occurs we rely upon the value from Fisher's  
180: exact test since it  makes fewer assumptions about the underlying 
181: distribution of data. 
182: %%Since Fisher's exact test can be computationally 
183: %%complex, a practical shortcut is to perform both the $X^2$ and $G^2$ 
184: %%tests. If they produce comparable  
185: %%results then they are reliable and Fisher's exact test need not be  
186: %%included.
187: 
188: For the experiments in this paper, we identified the top 100 ranked 
189: bigrams that occur more than 5 times in the training corpus associated
190: with a word. There were no  cases where rankings produced by $G^2$, $X^2$, 
191: and Fisher's exact test disagreed, which is not altogether surprising 
192: given that low frequency bigrams were excluded. Since all of these 
193: statistics produced the same rankings, hereafter we make no distinction
194: among them and simply  refer to them generically as the power divergence  
195: statistic. 
196: 
197: \subsection{Dice Coefficient}
198: 
199: The Dice Coefficient is a descriptive statistic that provides a
200: measure of association among two words in a corpus. It is similar to 
201: pointwise Mutual Information, a widely used measure that was first 
202: introduced for identifying lexical relationships in 
203: \cite{ChurchH90}. Pointwise Mutual Information can be defined as follows: 
204: \begin{eqnarray*}
205: MI(w_1,w_2) = log_2 \frac{n_{11} * n_{++}}{n_{+1} * n_{1+}}
206: \end{eqnarray*}
207: where $w_1$ and $w_2$ represent the two words that make up the bigram. 
208: %$n_{11}$ represents the number of times the two words occur 
209: %together as a bigram, $n_{+1}$ and $n_{1+}$ are the 
210: %number of times the words occur as the first and second words 
211: %of a bigram, and $n_{++}$ represents the total number of bigrams in the  
212: %corpus. 
213: 
214: Pointwise Mutual Information quantifies how often two words occur
215: together in a bigram (the numerator) relative to how often they occur
216: overall in the corpus (the denominator). However, there is 
217: a curious limitation to pointwise Mutual Information. A bigram $w_1w_2$  
218: that occurs $n_{11}$ times in the corpus, and whose component words $w_1$  
219: and $w_2$ only occur as a part of that bigram, will result in  
220: increasingly strong  measures of association as the value of $n_{11}$  
221: decreases. 
222: Thus, the maximum pointwise Mutual Information in a given corpus
223: will be assigned to bigrams that occur one time, and whose component words 
224: never occur outside that bigram. These are usually not the bigrams that
225: prove most useful for disambiguation, yet they will dominate a ranked
226: list as determined by pointwise Mutual Information. 
227: 
228: The Dice Coefficient overcomes this limitation, and can be defined as 
229: follows: 
230: 
231: \begin{eqnarray*}
232: Dice(w_1,w_2) = \frac{2* n_{11}}{n_{+1} + n_{1+}}
233: \end{eqnarray*}
234: 
235: When $n_{11} = n_{1+} = n_{+1}$ the value of $Dice(w_1,w_2)$ will be 1 for 
236: all values $n_{11}$.  When the value of $n_{11}$ is less than either of the 
237: marginal totals (the more typical case) the rankings produced by the Dice 
238: Coefficient are similar to those of Mutual Information. The relationship 
239: between pointwise Mutual Information and the Dice Coefficient is also 
240: discussed in \cite{SmadjaMH96}. 
241: 
242: We have developed the Bigram Statistics Package to produce ranked lists of
243: bigrams using a range of tests. This software is written in Perl and 
244: is freely available from www.d.umn.edu/\~{}tpederse. 
245: 
246: \section{Learning Decision Trees}
247: 
248: Decision trees are among the most widely used machine learning algorithms.
249: They perform a general to specific search of a feature space, adding
250: the most informative features to a tree structure as the search proceeds.
251: The objective is to select a minimal set of features that efficiently 
252: partitions the feature space into classes of observations and assemble
253: them into a tree.  In our case, the observations are manually 
254: sense--tagged  examples of an  ambiguous word in context and the 
255: partitions correspond to the different possible senses. 
256: 
257: Each feature selected during the search process is represented by 
258: a node in the learned decision tree. Each node represents a choice
259: point between a number of different possible values for a feature. 
260: Learning continues until all the training examples are accounted for
261: by the decision tree. In general, such a tree will be overly specific
262: to the training data and not generalize well to new examples. Therefore 
263: learning is followed by a pruning step where some nodes are eliminated or
264: reorganized to produce a tree that can generalize to new circumstances. 
265: 
266: Test instances are disambiguated by finding a path through the learned
267: decision tree from the root to a leaf node that corresponds with the 
268: observed features. An instance of an ambiguous word is disambiguated by  
269: passing it through a series of tests, where each test asks if a  
270: particular bigram occurs in the available window of context. 
271: 
272: We also include three benchmark learning algorithms in this study: the 
273: majority classifier, the decision stump, and the Naive Bayesian
274: classifier. 
275: 
276: The {\it majority classifier} assigns the most common sense in the
277: training data to every instance in the test data. 
278: A {\it decision stump} is a one node decision tree\cite{Holte93} that is
279: created by stopping the decision tree learner after the single most
280: informative feature is added to the tree. 
281: 
282: The {\it Naive  Bayesian classifier} \cite{DudaH73} is based on certain 
283: blanket  assumptions about the interactions among  features in a 
284: corpus. There is no search of the feature space performed to build a 
285: representative model as is the case with decision trees. Instead, all 
286: features are included in the classifier and assumed to be relevant to the  
287: task at hand. There is a further assumption that each feature is 
288: conditionally independent of all other features, given the sense of 
289: the ambiguous word. It is most often used with a {\it bag of words}  
290: feature set, where every word in  the training sample is represented by a  
291: binary feature that indicates  whether or  not it occurs  in the window of  
292: context surrounding the ambiguous  word. 
293: 
294: We use the Weka \cite{weka} implementations of the C4.5  
295: decision tree learner (known as J48), the  decision stump, and the Naive  
296: Bayesian classifier. Weka is written in Java and is freely available from 
297: www.cs.waikato.ac.nz/\~{}ml. 
298: 
299: \section{Experimental Data}
300: 
301: Our empirical study utilizes the training and test data from the 1998 
302: SENSEVAL evaluation of word sense disambiguation systems. Ten teams 
303: participated in the supervised learning portion of this event. 
304: Additional details about the exercise, including the data and results
305: referred to in this paper, can be found at the SENSEVAL web site 
306: (www.itri.bton.ac.uk/events/senseval/) and in \cite{KilgarriffP00}. 
307: 
308: We included all 36 tasks from SENSEVAL for which training and test data 
309: were provided. Each task requires that the occurrences of a particular 
310: word in the test data be disambiguated based on a model learned from
311: the sense--tagged instances in the training data. Some words were used in 
312: multiple tasks as different parts of speech. For example, there were two 
313: tasks associated  with {\it bet}, one for its use as a noun and the other 
314: as a verb. Thus, there are 36 tasks involving the disambiguation of 29 
315: different words. 
316: 
317: The words and part of speech associated with each task  are shown in Table 
318: \ref{tab:results} in column 1. Note that the parts of speech are 
319: encoded as {\it n} for noun, {\it a} for  adjective, {\it v} for verb, and 
320: {\it p} for words where the part of speech was not provided. The number of 
321: test and training instances for each task are shown in columns 2 and 
322: 4. Each instance consists of the sentence in which the ambiguous word 
323: occurs as well as one or two surrounding sentences.  In general 
324: the total context available  for each ambiguous word is less than 100 
325: surrounding words. The number of distinct senses in the test data for
326: each task is shown in column 3. 
327: 
328: \section{Experimental Method}
329: 
330: The following process is repeated for each task. Capitalization and
331: punctuation are removed from the training and test data. Two feature
332: sets are selected from the training data based on the top 100 ranked 
333: bigrams according to the power divergence statistic and the Dice 
334: Coefficient. The bigram must have occurred 5 or more times to be 
335: included as a feature. This step filters out a large number of possible
336: bigrams and allows the decision tree learner to focus on a small number of
337: candidate bigrams that are likely to be helpful in the disambiguation
338: process. 
339: 
340: The training and test data are converted to feature vectors where each 
341: feature represents the occurrence of one of the bigrams that belong in 
342: the feature set. This representation of the training data is the actual input
343: to the learning algorithms.  Decision tree and decision stump learning is
344: performed twice, once using the feature set determined by the power 
345: divergence statistic and again using the feature set identified by the 
346: Dice Coefficient. The majority classifier
347: simply determines the most frequent sense in the training data and
348: assigns that to all instances in the test data. The Naive Bayesian 
349: classifier is based on a feature set where every word that occurs 5 or 
350: more times in the training data is included as a feature.  
351: 
352: All of these learned models are used to disambiguate the test data. The
353: test data is kept separate until this stage. We employ a fine grained  
354: scoring method, where a word is  counted as  correctly disambiguated only  
355: when the assigned sense  tag  exactly matches the true sense tag. No  
356: partial credit is assigned for near misses.
357: 
358: \section{Experimental Results}
359: 
360: The accuracy attained by each of the learning algorithms is shown in Table 
361: \ref{tab:results}. 
362: Column 5 reports the  accuracy of the majority classifier, columns 6 and 7 
363: show the best and average accuracy reported by the 10
364: participating SENSEVAL teams. The evaluation at SENSEVAL was
365: based on precision and recall, so we converted those scores to accuracy by 
366: taking their product.  However, the best precision and recall may have 
367: come from different teams,  so the best accuracy shown in column 6 may 
368: actually be higher than that of  any single participating SENSEVAL 
369: system. The average accuracy in column 7  is the product of the average 
370: precision and recall reported for the participating SENSEVAL teams.
371: Column 8 shows the accuracy of the 
372: decision tree using the J48  learning algorithm and the
373: features identified by a power divergence statistic. 
374: Column 10 shows the accuracy of the decision tree when the Dice 
375: Coefficient selects the features. Columns 9 and 11 show  the accuracy of 
376: the decision  stump based on the power
377: divergence statistic  and the Dice
378: Coefficient respectively. Finally, column  13 shows the accuracy of the
379: Naive Bayesian classifier based on a bag of words feature set. 
380: 
381: The most accurate method is the decision tree based on a feature set
382: determined by the power divergence statistic.  The last line of Table 
383: \ref{tab:results} shows the win-tie-loss score of the decision tree/power 
384: divergence method relative to every other method. A win shows it was more 
385: accurate than the method in the column, a loss means it was less accurate, 
386: and a tie means it was equally accurate. The decision tree/power
387: divergence method was more accurate than the best reported SENSEVAL 
388: results for 19  of the 36 tasks, and more accurate for 30 of the 36 tasks 
389: when compared to the average reported accuracy. The decision stumps also 
390: fared well, proving to be more accurate than the best SENSEVAL results for 
391: 14 of the 36 tasks. 
392: 
393: In general the feature sets selected by the power divergence statistic
394: result in more accurate decision trees than those selected by 
395: the Dice Coefficient. The power divergence tests prove to be more reliable  
396: since they account for all possible events surrounding two words  
397: $w_1$ and $w_2$; when they occur as bigram $w_1w_2$, when $w_1$ or  
398: $w_2$ occurs in a bigram without the other, and when a bigram consists of 
399: neither. The Dice Coefficient is based strictly on the event where $w_1$ 
400: and $w_2$ occur together in a bigram.
401: 
402: There are 6 tasks where the decision tree / power divergence approach is
403: less accurate than the SENSEVAL average; promise-n, scrap-n, shirt-n, 
404: amaze-v, bitter-p, and sanction-p. The most dramatic difference 
405: occurred with amaze-v, where the SENSEVAL average was 92.4\% and the 
406: decision tree accuracy was 58.6\%. However, this was an unusual task 
407: where every instance in the  test data belonged to a single sense that 
408: was a minority sense in the training data. 
409: 
410: 
411: \begin{table*}
412: \caption{Experimental Results} 
413: \label{tab:results}
414: \begin{center}
415: \begin{tabular}{crrr|rrrrrrrrr}
416: \hline
417: \hline\rule{0pt}{12pt}
418: (1)  & (2)  & (3)& (4)   & (5)  & (6)  & (7) & (8) & 
419: (9) & (10) & (11) & (12) \\
420:  &   & senses      &    &  &  &  & j48 & stump & j48 & stump & naive \\
421: word-pos & test & in test & train   & maj  & best & avg & pow & pow
422: & dice & dice & bayes \\[2pt]
423: \hline
424: accident-n & 267    &8   & 227     & 75.3 & 87.1 & 79.6 & 85.0
425: &
426: 77.2 & 83.9 & 77.2 & 83.1 &\\
427: behaviour-n & 279    &3   & 994     & 94.3 & 92.9 & 90.2 & 95.7 &
428: 95.7 & 95.7 & 95.7 & 93.2 & \\
429: bet-n & 274    &15  & 106     & 18.2 & 50.7 & 39.6 & 41.8 &
430: 34.5 & 41.8 & 34.5 & 39.3 & \\
431: excess-n & 186    &8   & 251     & 1.1  & 75.9 & 63.7 & 65.1 &
432: 38.7 & 60.8 & 38.7 & 64.5 & \\
433: float-n & 75     &12  & 61      & 45.3  & 66.1 & 45.0 & 52.0 &
434: 50.7 & 52.0 & 50.7 & 56.0 & \\
435: giant-n & 118    &7   & 355     & 49.2  & 67.6 & 56.6 & 68.6 &
436: 59.3 & 66.1 & 59.3 & 70.3 & \\
437: knee-n & 251    &22  & 435     & 48.2 & 67.4 & 56.0 & 71.3 &
438: 60.2 & 70.5 &60.2 & 64.1 & \\
439: onion-n & 214    &4   & 26      & 82.7 & 84.8 & 75.7 & 82.7 &
440: 82.7 & 82.7 & 82.7 & 82.2 & \\
441: promise-n & 113    &8   & 845     & 62.8  & 75.2 & 56.9 & 48.7 &
442: 63.7 & 55.8 & 62.8 & 78.0 & \\
443: sack-n & 82     &7  & 97       & 50.0  & 77.1 & 59.3 & 80.5 &
444: 58.5 & 80.5 & 58.5 & 74.4 & \\
445: scrap-n & 156    &14  & 27      & 41.7  & 51.6 & 35.1 & 26.3 &
446: 16.7 & 26.3 & 16.7 & 26.7 & \\
447: shirt-n & 184    &8   & 533     & 43.5 & 77.4 & 59.8 & 46.7 &
448: 43.5 & 51.1 & 43.5 & 60.9 & \\
449: amaze-v & 70     &1   & 316     & 0.0  & 100.0& 92.4 & 58.6 &
450: 12.9 & 60.0 & 12.9 & 71.4 & \\ 
451: bet-v & 117    &9   & 60      & 43.2  & 60.5 & 44.0 & 50.8 &
452: 58.5 & 52.5 & 50.8 & 58.5 & \\
453: bother-v & 209    &8   & 294     & 75.0 & 59.2 & 50.7 & 69.9 & 
454: 55.0 & 64.6 & 55.0 & 62.2 & \\
455: bury-v & 201    &14  & 272     & 38.3 & 32.7 & 22.9 & 48.8 &
456: 38.3 & 44.8 & 38.3 & 42.3 & \\
457: calculate-v & 218    &5   & 249     & 83.9 & 85.0 & 75.5 & 90.8 &
458: 88.5 & 89.9 & 88.5 & 80.7 & \\
459: consume-v & 186    &6   & 67      & 39.8 & 25.2 & 20.2 & 36.0 &
460: 34.9 & 39.8 & 34.9 & 31.7 & \\
461: derive-v & 217    &6   & 259     & 47.9 & 44.1 & 36.0 & 82.5 &
462: 52.1 & 82.5 & 52.1 & 72.4 & \\
463: float-v & 229    &16  & 183     & 33.2 & 30.8 & 22.5 & 30.1 &
464: 22.7 & 30.1 & 22.7 & 56.3 & \\
465: invade-v & 207    &6   & 64      & 40.1 & 30.9 & 25.5 & 28.0 &
466: 40.1 & 28.0 & 40.1 & 31.0 & \\
467: promise-v & 224    &6   & 1160    & 85.7 & 82.1 & 74.6 & 85.7 &
468: 84.4 & 81.7 & 81.3 & 85.3 & \\
469: sack-v & 178    &3   & 185     & 97.8 & 95.6 & 95.6 & 97.8 &
470: 97.8 & 97.8 & 97.8 & 97.2 & \\
471: scrap-v & 186    &3   & 30      & 85.5 & 80.6 & 68.6 & 85.5 &
472: 85.5 & 85.5 & 85.5 & 82.3 & \\
473: seize-v & 259    &11  & 291     & 21.2 & 51.0 & 42.1 & 52.9 &
474: 25.1 & 49.4  & 25.1 & 51.7 & \\
475: brilliant-a & 229    &10  & 442     & 45.9 & 31.7 & 26.5 & 55.9 &
476: 45.9 & 51.1 & 45.9 & 58.1 & \\
477: floating-a & 47     &5   & 41      & 57.4 & 49.3 & 27.4 & 57.4 &
478: 57.4 & 57.4 & 57.4 & 55.3 & \\
479: generous-a & 227    &6   & 307     & 28.2 & 37.5 & 30.9 & 44.9 &
480: 32.6 & 46.3 & 32.6 & 48.9 & \\
481: giant-a & 97     &5   & 302     & 94.8 & 98.0 & 93.5 & 95.9 &
482: 95.9 & 94.8 & 94.8 & 94.8 &\\
483: modest-a & 270    &9   & 374     & 61.5 & 49.6 & 44.9 & 72.2 &
484: 64.4 & 73.0 & 64.4 & 68.1 & \\
485: slight-a & 218    &6   & 385     & 91.3 & 92.7 & 81.4 & 91.3 &
486: 91.3 & 91.3 & 91.3 & 91.3 & \\
487: wooden-a & 196    &4   & 362     & 93.9 & 81.7 & 71.3 & 96.9 &
488: 96.9 & 96.9 & 96.9 & 93.9 & \\
489: band-p & 302    &29  &1326     & 77.2 & 81.7 & 75.9 & 86.1 &
490: 84.4 & 79.8 & 77.2 & 83.1 & \\
491: bitter-p & 373    &14  &144      & 27.0 & 44.6 & 39.8 & 36.4 &
492: 31.3 & 36.4 & 31.3 & 32.6 & \\
493: sanction-p & 431    &7   &96       & 57.5 & 74.8 & 62.4 & 57.5 &
494: 57.5 & 57.1 & 57.5 & 56.8 & \\
495: shake-p & 356    &36  &963      & 23.6 & 56.7 & 47.1 & 52.2 &
496: 23.6 & 50.0 & 23.6 & 46.6 & \\[2pt]
497: \hline
498: \multicolumn{4}{c|} {win-tie-loss (j48-pow vs. X)}  &
499: \multicolumn{1}{c} {23-7-6} &
500: \multicolumn{1}{c} {19-0-17} & 
501: \multicolumn{1}{c} {30-0-6} & 
502: \multicolumn{1}{c} {}& 
503: \multicolumn{1}{c} {28-9-3} & 
504: \multicolumn{1}{c} {14-15-7} & 
505: \multicolumn{1}{c} {28-9-3} & 
506: \multicolumn{1}{c} {24-1-11} & \\[2pt]
507: \hline
508: \end{tabular}
509: \end{center}
510: \end{table*}
511: %
512: 
513: \begin{table*}
514: \caption{Decision Tree and Stump Characteristics} 
515: \label{tab:stump}
516: \begin{center}
517: \begin{tabular}{c|rrr|rrr}
518: \hline
519: \hline
520: \multicolumn{1}{c|}{ } & 
521: \multicolumn{3}{c}{power divergence} &
522: \multicolumn{3}{|c}{dice coefficient} \\
523: (1) & (2) & (3) & (4) & (5) & (6) & (7) \\
524: word-pos & stump node & leaf/total & features & stump node & leaf/total
525: &features \\[2pt] 
526: \hline 
527: accident-n & by accident & 8/15 & 101 & by accident & 12/23 & 112 \\ 
528: behaviour-n & best behaviour & 2/3 & 100 & best behaviour & 2/3 & 104 \\ 
529: bet-n & betting shop & 20/39 & 50 & betting shop & 20/39 & 50 \\ 
530: excess-n & in excess & 13/25 & 104 & in excess & 11/21 & 102\\ 
531: float-n & the float & 7/13 & 13 & the float & 7/13 & 13 \\ 
532: giant-n & the giants & 16/31 & 103 & the giants & 14/27 & 78 \\ 
533: knee-n & knee injury & 23/45 & 102 & knee injury & 20/39 & 104 \\ 
534: onion-n & in the & 1/1 & 7 & in the & 1/1 & 7\\ 
535: promise-n & promise of & 95/189 & 100 & a promising & 49/97 & 107 \\ 
536: sack-n & the sack & 5/9 & 31 & the sack & 5/9 & 31 \\ 
537: scrap-n & scrap of & 7/13 & 8 & scrap of & 7/13 & 8 \\ 
538: shirt-n & shirt and & 38/75 & 101 & shirt and & 55/109 & 101 \\ 
539: amaze-v & amazed at & 11/21 & 102 & amazed at  &11/21  & 102 \\ 
540: bet-v  & i bet & 4/7 & 10 & i bet & 4/7 & 10 \\ 
541: bother-v & be bothered & 19/37 & 101 & be bothered & 20/39 & 106 \\ 
542: bury-v & buried in & 28/55 & 103 & buried in & 32/63 & 103 \\ 
543: calculate-v & calculated to  & 5/9 & 103 & calculated to & 5/9 & 103 \\ 
544: consume-v & on the & 4/7 & 20 & on the & 4/7 & 20 \\ 
545: derive-v & derived from & 10/19 & 104 & derived from & 10/19 & 104 \\ 
546: float-v & floated on & 24/47 & 80 & floated on & 24/47 & 80 \\ 
547: invade-v & to invade & 55/109 & 107 & to invade & 66/127 & 108 \\ 
548: promise-v & promise to & 3/5 & 100 & promise you  & 5/9 & 106 \\ 
549: sack-v & return to & 1/1 & 91 & return to & 1/1 & 91 \\ 
550: scrap-v & of the & 1/1 & 7 & of the & 1/1 & 7 \\ 
551: seize-v & to seize & 26/51 & 104 & to seize & 57/113 & 104 \\ 
552: brilliant-a & a brilliant & 26/51 & 101 & a brilliant & 42/83 & 103 \\ 
553: floating-a & in the & 7/13 & 10 & in the & 7/13 & 10 \\ 
554: generous-a & a generous & 57/113 & 103 & a generous & 56/111 & 102 \\ 
555: giant-a & the giant & 2/3 & 102 & a giant & 1/1 & 101 \\ 
556: modest-a & a modest & 14/27 & 101 & a modest & 10/19 & 105 \\ 
557: slight-a & the slightest & 2/3 & 105 & the slightest & 2/3 & 105 \\ 
558: wooden-a & wooden spoon & 2/3 & 104 & wooden spoon & 2/3 & 101 \\ 
559: band-p & band of & 14/27 & 100 & the band & 21/41& 117\\ 
560: bitter-p & a bitter & 22/43 & 54 & a bitter & 22/43 & 54 \\ 
561: sanction-p & south africa & 12/23 & 52 & south africa & 12/23 & 52 \\ 
562: shake-p & his head & 90/179 & 100 & his head & 81/161 & 105 \\ 
563: \hline
564: \end{tabular} 
565: \end{center}
566: \end{table*} %
567: 
568: \section{Analysis of Experimental Results}
569: 
570: The characteristics of the decision trees and decision stumps learned for
571: each word are shown in Table \ref{tab:stump}. Column 1 shows the
572: word and part of speech. Columns 2, 3, and 4 are based on the
573: feature set selected by the power divergence statistic while
574: columns 5, 6, and 7 are based on the Dice Coefficient. Columns 2 and 5 
575: show the node selected to serve as the decision stump. Columns 3 and 6
576: show the number of leaf nodes in the learned decision tree relative to the
577: number of total nodes. Columns 4 and 7 show the number of bigram
578: features selected to represent the training data. 
579: 
580: This table shows that there is little difference in the decision stump 
581: nodes selected from feature sets determined by the power divergence 
582: statistics versus the Dice Coefficient. This is to be expected
583: since the top ranked bigrams for each measure are consistent, and the
584: decision stump node is generally chosen from among those. 
585: 
586: However, there are differences between the feature sets selected by the  
587: power divergence statistics and the Dice Coefficient. These are reflected 
588: in the different sized trees that are learned based on these feature sets. 
589: The number of leaf nodes and the total number of nodes for each learned 
590: tree is shown in columns 3 and 6. 
591: The number of internal nodes is simply the difference between the
592: total nodes and the leaf nodes. 
593: Each leaf node represents the end of
594: a path through the decision tree that makes a sense distinction. 
595: Since a bigram feature can only appear once in the 
596: decision tree, the number of internal nodes represents the number of 
597: bigram features selected by the decision tree learner. 
598: 
599: One of our original hypotheses was that accurate decision trees of 
600: bigrams will include a relatively small number of features. This
601: was motivated by the success of decision stumps in performing
602: disambiguation based on a single bigram feature. 
603: In these experiments, there were no decision trees that used all of the  
604: bigram features identified by the filtering step, and for many words the 
605: decision tree learner went on to eliminate most of the candidate 
606: features. This can be seen by comparing the number of internal nodes with 
607: the number of candidate features as shown in columns 4 or 7.\footnote{For 
608: most words the 100 top ranked bigrams form the set of candidate features 
609: presented to the decision tree learner. If 
610: there are ties in the top 100 rankings then there may be more than 100  
611: features,  and if the there were fewer than 100 bigrams that occurred more  
612: than 5 times then all such bigrams are included in the feature set.} 
613: 
614: It is also noteworthy that the bigrams ultimately selected by the decision 
615: tree learner for inclusion in the tree do not always include those
616: bigrams ranked most highly by the power divergence statistic or the Dice 
617: Coefficient. This is to be expected, since the selection of the bigrams 
618: from raw text is only measuring the association between two words, while 
619: the decision tree seeks bigrams that partition  instances of the ambiguous 
620: word into into distinct senses. In particular, the decision tree learner 
621: makes decisions as to what bigram to include as nodes in the tree using 
622: the gain ratio, a measure based on the overall Mutual Information
623: between the bigram and a particular word sense.
624: 
625: Finally, note that the smallest decision trees are functionally equivalent  
626: to our benchmark methods. A decision tree with 1 leaf node and 
627: no internal nodes (1/1)  acts as a majority  classifier. A  decision tree 
628: with  2 leaf nodes and 1 internal node (2/3) has the structure of a 
629: decision stump. 
630: 
631: 
632: \section{Discussion}
633: 
634: One of our long-term objectives is to identify a core set of features 
635: that will be useful for disambiguating a wide class of words using both
636: supervised and unsupervised methodologies.
637: 
638: We have presented an ensemble approach to word sense disambiguation 
639: \cite{Pedersen00b} where multiple Naive Bayesian classifiers, each based 
640: on co--occurrence features from varying sized windows of context, 
641: is shown to perform well on the widely studied nouns {\it interest} and
642: {\it line}. While the accuracy of this approach was as good as any
643: previously published results, the learned models were complex and 
644: difficult to interpret, in effect acting as very accurate black boxes. 
645: 
646: Our experience has been that variations in learning algorithms 
647: are far less significant contributors to disambiguation 
648: accuracy than are variations in the feature set.  In other words, an  
649: informative feature set will result in accurate disambiguation when used 
650: with a wide range of learning algorithms, but there is no  
651: learning algorithm that can perform well given an uninformative or 
652: misleading set of features.  Therefore, our focus is on developing and 
653: discovering feature sets that make distinctions among word senses. Our  
654: learning algorithms must not only produce accurate models, but they 
655: should also shed new light on the relationships among features and allow  
656: us to continue refining and understanding our feature sets. 
657: 
658: We believe that decision trees meet these criteria. A wide range of 
659: implementations are available, and they are known to be robust and 
660: accurate across a range of domains. Most important, their structure is 
661: easy to interpret and may provide insights into the relationships that 
662: exist among features and more general rules of disambiguation. 
663: 
664: \section{Related Work}
665: 
666: Bigrams have been used as features for word sense disambiguation,
667: particularly in the form of collocations where the ambiguous word is one
668: component of the bigram (e.g.,  \cite{BruceW94b}, \cite{NgL96}, 
669: \cite{Yarowsky95}). While some of the bigrams we identify are collocations 
670: that include the word being disambiguated, there is no requirement that 
671: this be the case. 
672: 
673: Decision trees have been used in supervised learning approaches to word
674: sense disambiguation, and have fared well in a number of comparative 
675: studies (e.g., \cite{Mooney96}, \cite{PedersenB97A}).  In the former they 
676: were used with the bag of word feature sets and in the latter they were 
677: used with a mixed feature set that included the part-of-speech of 
678: neighboring words, three collocations, and the morphology of the ambiguous 
679: word. We believe that the approach in this paper is the first time that
680: decision trees based strictly on bigram features have been employed. 
681: 
682: The decision list is a closely related approach 
683: that has also been applied to 
684: word sense disambiguation (e.g., \cite{Yarowsky94}, \cite{WilksS98},
685: \cite{Yarowsky00}). Rather than building and traversing a tree to perform
686: disambiguation, a list is employed. In the general case 
687: a decision list may suffer from less fragmentation during learning than 
688: decision trees; as a practical matter this means that the decision list
689: is less likely to be over--trained. However, we believe that fragmentation
690: also reflects on the feature set used for learning.  Ours consists of at 
691: most approximately 100 binary features. This  results in a relatively 
692: small feature space that is not as likely to suffer from fragmentation as 
693: are larger spaces. 
694: 
695: \section{Future Work}
696: 
697: There are a number of immediate extensions to this work. The first is to 
698: ease the requirement that bigrams be made up of two consecutive words. 
699: Rather, we will search for bigrams where the component words may be
700: separated by other words in the text. The second is to eliminate the
701: filtering step by which candidate bigrams are selected by a power
702: divergence statistic. Instead, the decision tree learner would consider 
703: all possible bigrams. Despite increasing the danger of fragmentation, 
704: this is an interesting issue since the bigrams judged most informative by 
705: the decision tree learner are not always ranked highly in the filtering
706: step. In particular, we will determine if the filtering process ever 
707: eliminates bigrams that could be significant sources of disambiguation 
708: information. 
709: 
710: In the longer term, we hope to adapt this approach to unsupervised 
711: learning, where disambiguation is performed without the benefit of sense 
712: tagged text. We are optimistic that this is viable, since bigram features 
713: are easy to identify in raw text. 
714: 
715: \section{Conclusion}
716: 
717: This paper shows that the combination of a simple feature set made 
718: up of bigrams and a standard decision tree learning algorithm 
719: results in accurate word sense disambiguation. The results of this 
720: approach are compared with those from the 1998 SENSEVAL word sense    
721: disambiguation exercise and show that the bigram based decision tree 
722: approach is more accurate than the best SENSEVAL results for 19 of 36 
723: words.
724: 
725: \section{Acknowledgments}
726: 
727: The Bigram Statistics Package has been implemented by Satanjeev Banerjee, 
728: who is supported by a Grant--in--Aid of Research, Artistry and Scholarship 
729: from the Office of the Vice President for Research and the Dean of the  
730: Graduate School of the University of Minnesota. We would like to thank  
731: the SENSEVAL organizers for making the data and results from the 1998  
732: event freely available. The comments of three anonymous reviewers were  
733: very helpful in preparing the final version of this paper. A preliminary 
734: version of this paper appears in \cite{Pedersen01a}. 
735: 
736: %
737: % ---- Bibliography ----
738: %
739: 
740: %\bibliography{/home/cs/tpederse/TeX/Papers/papers/bib/tdp}
741: %\bibliographystyle{/home/cs/tpederse/TeX/Papers/papers/sty/acl}
742: 
743: \begin{thebibliography}{}
744: 
745: \bibitem[\protect\citename{Bruce and Wiebe}1994]{BruceW94b}
746: R.~Bruce and J.~Wiebe.
747: \newblock 1994.
748: \newblock Word-sense disambiguation using decomposable models.
749: \newblock In {\em Proceedings of the 32nd Annual Meeting of the Association for
750:   Computational Linguistics}, pages 139--146.
751: 
752: \bibitem[\protect\citename{Church and Hanks}1990]{ChurchH90}
753: K.~Church and P.~Hanks.
754: \newblock 1990.
755: \newblock Word association norms, mutual information and lexicography.
756: \newblock In {\em Proceedings of the 28th Annual Meeting of the Association for
757:   Computational Linguistics}, pages 76--83.
758: 
759: \bibitem[\protect\citename{Cressie and Read}1984]{CressieR84}
760: N.~Cressie and T.~Read.
761: \newblock 1984.
762: \newblock Multinomial goodness of fit tests.
763: \newblock {\em Journal of the Royal Statistics Society Series B}, 46:440--464.
764: 
765: \bibitem[\protect\citename{Duda and Hart}1973]{DudaH73}
766: R.~Duda and P.~Hart.
767: \newblock 1973.
768: \newblock {\em Pattern Classification and Scene Analysis}.
769: \newblock Wiley, New York, NY.
770: 
771: \bibitem[\protect\citename{Dunning}1993]{Dunning93}
772: T.~Dunning.
773: \newblock 1993.
774: \newblock Accurate methods for the statistics of surprise and coincidence.
775: \newblock {\em Computational Linguistics}, 19(1):61--74.
776: 
777: \bibitem[\protect\citename{Holte}1993]{Holte93}
778: R.~Holte.
779: \newblock 1993.
780: \newblock Very simple classification rules perform well on most commonly used
781:   datasets.
782: \newblock {\em Machine Learning}, 11:63--91.
783: 
784: \bibitem[\protect\citename{Kilgarriff and Palmer}2000]{KilgarriffP00}
785: A.~Kilgarriff and M.~Palmer.
786: \newblock 2000.
787: \newblock Special issue on {SENSEVAL}: Evaluating word sense disambiguation
788:   programs.
789: \newblock {\em Computers and the Humanities}, 34(1--2).
790: 
791: \bibitem[\protect\citename{Mooney}1996]{Mooney96}
792: R.~Mooney.
793: \newblock 1996.
794: \newblock Comparative experiments on disambiguating word senses: An
795:   illustration of the role of bias in machine learning.
796: \newblock In {\em Proceedings of the Conference on Empirical Methods in Natural
797:   Language Processing}, pages 82--91, May.
798: 
799: \bibitem[\protect\citename{Ng and Lee}1996]{NgL96}
800: H.T. Ng and H.B. Lee.
801: \newblock 1996.
802: \newblock Integrating multiple knowledge sources to disambiguate word sense: An
803:   exemplar-based approach.
804: \newblock In {\em Proceedings of the 34th Annual Meeting of the Association for
805:   Computational Linguistics}, pages 40--47.
806: 
807: \bibitem[\protect\citename{Pedersen and Bruce}1997]{PedersenB97A}
808: T.~Pedersen and R.~Bruce.
809: \newblock 1997.
810: \newblock A new supervised learning algorithm for word sense disambiguation.
811: \newblock In {\em Proceedings of the Fourteenth National Conference on
812:   Artificial Intelligence}, pages 604--609, Providence, RI, July.
813: 
814: \bibitem[\protect\citename{Pedersen}1996]{Pedersen96}
815: T.~Pedersen.
816: \newblock 1996.
817: \newblock Fishing for exactness.
818: \newblock In {\em Proceedings of the South Central SAS User's Group (SCSUG-96)
819:   Conference}, pages 188--200, Austin, TX, October.
820: 
821: \bibitem[\protect\citename{Pedersen}2000]{Pedersen00b}
822: T.~Pedersen.
823: \newblock 2000.
824: \newblock A simple approach to building ensembles of naive bayesian classifiers
825:   for word sense disambiguation.
826: \newblock In {\em Proceedings of the First Annual Meeting of the North American
827:   Chapter of the Association for Computational Linguistics}, pages 63--69,
828:   Seattle, WA, May.
829: 
830: \bibitem[\protect\citename{Pedersen}2001]{Pedersen01a}
831: T.~Pedersen.
832: \newblock 2001.
833: \newblock Lexical semantic ambiguity resolution with bigram--based decision
834:   trees.
835: \newblock In {\em Proceedings of the Second International Conference on
836:   Intelligent Text Processing and Computational Linguistics}, pages 157--168,
837:   Mexico City, February.
838: 
839: \bibitem[\protect\citename{Smadja \bgroup et al.\egroup }1996]{SmadjaMH96}
840: F.~Smadja, K.~McKeown, and V.~Hatzivassiloglou.
841: \newblock 1996.
842: \newblock Translating collocations for bilingual lexicons: A statistical
843:   approach.
844: \newblock {\em Computational Linguistics}, 22(1):1--38.
845: 
846: \bibitem[\protect\citename{Wilks and Stevenson}1998]{WilksS98}
847: Y.~Wilks and M.~Stevenson.
848: \newblock 1998.
849: \newblock Word sense disambiguation using optimised combinations of knowledge
850:   sources.
851: \newblock In {\em Proceedings of COLING/ACL-98}.
852: 
853: \bibitem[\protect\citename{Witten and Frank}2000]{weka}
854: I.~Witten and E.~Frank.
855: \newblock 2000.
856: \newblock {\em Data Mining - Practical Machine Learning Tools and Techniques
857:   with Java Implementations}.
858: \newblock Morgan--Kaufmann, San Francisco, CA.
859: 
860: \bibitem[\protect\citename{Yarowsky}1994]{Yarowsky94}
861: D.~Yarowsky.
862: \newblock 1994.
863: \newblock Decision lists for lexical amgiguity resolution: Application to
864:   accent resotration in {S}panish and {F}rench.
865: \newblock In {\em Proceedings of the 32nd Annual Meeting of the Association for
866:   Computational Linguistics}.
867: 
868: \bibitem[\protect\citename{Yarowsky}1995]{Yarowsky95}
869: D.~Yarowsky.
870: \newblock 1995.
871: \newblock Unsupervised word sense disambiguation rivaling supervised methods.
872: \newblock In {\em Proceedings of the 33rd Annual Meeting of the Association for
873:   Computational Linguistics}, pages 189--196, Cambridge, MA.
874: 
875: \bibitem[\protect\citename{Yarowsky}2000]{Yarowsky00}
876: D.~Yarowsky.
877: \newblock 2000.
878: \newblock Hierarchical decision lists for word sense disambiguation.
879: \newblock {\em Computers and the Humanities}, 34(1--2).
880: 
881: \end{thebibliography}
882: \end{document}
883: