1: \documentclass[]{article}
2: \usepackage{naacl2001}
3:
4: \title{A Decision Tree of Bigrams\\is an Accurate Predictor of Word Sense}
5:
6: \author{Ted Pedersen\\
7: Department of Computer Science\\
8: University of Minnesota Duluth
9: \\Duluth, MN 55812 USA\\
10: \tt{tpederse@d.umn.edu}}
11:
12: %%%\tt{http://www.d.umn.edu/\~{}tpederse}}
13:
14:
15: \begin{document}
16: \maketitle
17:
18: \begin{abstract}
19: This paper presents a corpus-based approach to word sense
20: disambiguation where a decision tree assigns a sense to an ambiguous word
21: based on the bigrams that occur nearby. This approach is evaluated using
22: the sense-tagged corpora from the 1998 SENSEVAL word sense disambiguation
23: exercise. It is more accurate than the average results reported for 30 of
24: 36 words, and is more accurate than the best results for 19 of 36 words.
25: \end{abstract}
26:
27: \section{Introduction}
28:
29: Word sense disambiguation is the process of selecting the most appropriate
30: meaning for a word, based on the context in which it occurs. For our
31: purposes it is assumed that the set of possible meanings, i.e., the sense
32: inventory, has already been determined. For example, suppose {\it bill}
33: has the following set of possible meanings: a piece of currency, pending
34: legislation, or a bird jaw. When used in the context of {\it The Senate
35: bill is under consideration}, a human reader immediately understands that
36: {\it bill} is being used in the legislative sense. However, a computer
37: program attempting to perform the same task faces a difficult problem
38: since it does not have the benefit of innate common--sense or linguistic
39: knowledge.
40:
41: Rather than attempting to provide computer programs with real--world
42: knowledge comparable to that of humans, natural language processing has
43: turned to {\it corpus--based} methods. These approaches use techniques
44: from statistics and machine learning to induce models of language
45: usage from large samples of text. These models are trained to perform
46: particular tasks, usually via supervised learning. This paper describes an
47: approach where a {\it decision tree} is learned from some number of
48: sentences where each instance of an ambiguous word has been manually
49: annotated with a sense--tag that denotes the most appropriate sense for
50: that context.
51:
52: Prior to learning, the sense--tagged corpus must be converted into a more
53: regular form suitable for automatic processing. Each sense--tagged
54: occurrence of an ambiguous word is converted into a feature vector,
55: where each feature represents some property of the surrounding text that
56: is considered to be relevant to the disambiguation process. Given the
57: flexibility and complexity of human language, there is potentially an
58: infinite set of features that could be utilized. However, in
59: corpus--based approaches features usually consist of information that can
60: be readily identified in the text, without relying on extensive external
61: knowledge sources. These typically include the part--of--speech of
62: surrounding words, the presence of certain key words within some window
63: of context, and various syntactic properties of the sentence and the
64: ambiguous word.
65:
66: The approach in this paper relies upon a feature set made up of {\it
67: bigrams}, two word sequences that occur in a text. The context in which
68: an ambiguous word occurs is represented by some number of binary
69: features that indicate whether or not a particular bigram has occurred
70: within approximately 50 words to the left or right of the word being
71: disambiguated.
72:
73: We take this approach since surface lexical features
74: like bigrams, collocations, and co--occurrences often contribute a great
75: deal to disambiguation accuracy. It is not clear how much
76: disambiguation accuracy is improved through the use of features
77: that are identified by more complex pre--processing such as
78: part--of--speech tagging, parsing, or anaphora resolution. One of our
79: objectives is to establish a clear upper bounds on the accuracy of
80: disambiguation using feature sets that do not impose substantial
81: pre--processing requirements.
82:
83: This paper continues with a discussion of our methods for identifying
84: the bigrams that should be included in the feature set for learning. Then
85: the decision tree learning algorithm is described, as are some benchmark
86: learning algorithms that are included for purposes of comparison. The
87: experimental data is discussed, and then the empirical results are
88: presented. We close with an analysis of our findings and a discussion of
89: related work.
90:
91: \section{Building a Feature Set of Bigrams}
92:
93: We have developed an approach to word sense disambiguation that
94: represents text entirely in terms of the occurrence of bigrams, which we
95: define to be two consecutive words that occur in a text. The
96: distributional characteristics of bigrams are fairly consistent across
97: corpora; a majority of them only occur one time. Given the sparse and
98: skewed nature of this data, the statistical methods used to select
99: interesting bigrams must be carefully chosen. We explore two alternatives,
100: the power divergence family of goodness of fit statistics and the Dice
101: Coefficient, an information theoretic measure related to pointwise
102: Mutual Information.
103:
104: Figure \ref{fig:bigram} summarizes the notation for word and bigram counts
105: used in this paper by way of a $2 \times 2$ contingency table. The value
106: of $n_{11}$ shows how many times the bigram {\it big cat} occurs in the
107: corpus. The value of $n_{12}$ shows how often bigrams occur where {\it
108: big} is the first word and {\it cat} is not the second. The counts in
109: $n_{+1}$ and $n_{1+}$ indicate how often words {\it big} and {\it cat}
110: occur as the first and second words of any bigram in the corpus. The
111: total number of bigrams in the corpus is represented by $n_{++}$.
112:
113: \begin{figure}
114: \begin{center}
115: \begin{tabular}[c]{@{}r|c|c|@{}l@{}}
116: \multicolumn{4}{c}{ } \\
117: \multicolumn{1}{l}{ } &
118: \multicolumn{2}{c}{} \\
119: \multicolumn{2}{r}{cat} &
120: \multicolumn{1}{c|}{$\neg${cat}} &
121: \multicolumn{1}{c}{{totals}} \\
122: \cline{2-4}
123: big & $n_{11}$=$\hfill$10& $n_{12}$=$\hfill$20& $n_{1+}$=$\hfill$30\\
124: \cline{2-3}
125: $\neg${big} & $n_{21}$=$\hfill$40&$n_{22}$=$\hfill$930&$n_{2+}$=$\hfill$970\\
126: \cline{1-4}
127: totals&
128: \multicolumn{1}{r}{$n_{+1}$=50} & $n_{+2}$=950 & $n_{++}$=1000 \\
129: \multicolumn{4}{c}{ } \\
130: \end{tabular}
131: \caption{Representation of Bigram Counts}
132: \label{fig:bigram}
133: \end{center}
134: \end{figure}
135:
136: \subsection{The Power Divergence Family}
137:
138: \cite{CressieR84} introduce the power divergence family of goodness of fit
139: statistics. A number of well known statistics belong to this family,
140: including the likelihood ratio statistic $G^2$ and Pearson's $X^2$
141: statistic.
142:
143: These measure the divergence of the observed ($n_{ij}$) and expected
144: ($m_{ij}$) bigram counts, where $m_{ij}$ is estimated based on
145: the assumption that the component words in the bigram occur together
146: strictly by chance:
147: \begin{eqnarray*}
148: m_{ij} = \frac{n_{i+} * n_{+j}}{n_{++}}
149: \end{eqnarray*}
150:
151: Given this value, $G^2$ and $X^2$ are calculated as:
152: \begin{eqnarray*}
153: G^2 = 2 \sum_{i,j} n_{ij} * \log \frac{n_{ij}}{m_{ij}} \ \ \ \
154: \end{eqnarray*}
155: \begin{eqnarray*}
156: X^2 = \sum_{i,j} \frac{(n_{ij} - m_{ij})^2}{m_{ij}}
157: \label{eq:x2}
158: \end{eqnarray*}
159:
160: \cite{Dunning93} argues in favor of $G^2$ over $X^2$, especially when
161: dealing with very sparse and skewed data distributions. However,
162: \cite{CressieR84} suggest that there are cases where Pearson's statistic
163: is more reliable than the likelihood ratio and that one test should not
164: always be preferred over the other. In light of this,
165: \cite{Pedersen96} presents Fisher's exact test as an alternative since
166: it does not rely on the distributional assumptions that underly both
167: Pearson's test and the likelihood ratio.
168:
169: Unfortunately it is usually not clear which test is most appropriate
170: for a particular sample of data. We take the following
171: approach, based on the observation that all tests should assign
172: approximately the same
173: measure of statistical significance when the bigram counts in the
174: contingency table do not violate any of the distributional assumptions
175: that underly the goodness of fit statistics. We perform tests using
176: $X^2$, $G^2$, and Fisher's exact test for each bigram. If the
177: resulting measures of statistical significance differ, then the
178: distribution of the bigram counts is causing at least one of the tests to
179: become unreliable. When this occurs we rely upon the value from Fisher's
180: exact test since it makes fewer assumptions about the underlying
181: distribution of data.
182: %%Since Fisher's exact test can be computationally
183: %%complex, a practical shortcut is to perform both the $X^2$ and $G^2$
184: %%tests. If they produce comparable
185: %%results then they are reliable and Fisher's exact test need not be
186: %%included.
187:
188: For the experiments in this paper, we identified the top 100 ranked
189: bigrams that occur more than 5 times in the training corpus associated
190: with a word. There were no cases where rankings produced by $G^2$, $X^2$,
191: and Fisher's exact test disagreed, which is not altogether surprising
192: given that low frequency bigrams were excluded. Since all of these
193: statistics produced the same rankings, hereafter we make no distinction
194: among them and simply refer to them generically as the power divergence
195: statistic.
196:
197: \subsection{Dice Coefficient}
198:
199: The Dice Coefficient is a descriptive statistic that provides a
200: measure of association among two words in a corpus. It is similar to
201: pointwise Mutual Information, a widely used measure that was first
202: introduced for identifying lexical relationships in
203: \cite{ChurchH90}. Pointwise Mutual Information can be defined as follows:
204: \begin{eqnarray*}
205: MI(w_1,w_2) = log_2 \frac{n_{11} * n_{++}}{n_{+1} * n_{1+}}
206: \end{eqnarray*}
207: where $w_1$ and $w_2$ represent the two words that make up the bigram.
208: %$n_{11}$ represents the number of times the two words occur
209: %together as a bigram, $n_{+1}$ and $n_{1+}$ are the
210: %number of times the words occur as the first and second words
211: %of a bigram, and $n_{++}$ represents the total number of bigrams in the
212: %corpus.
213:
214: Pointwise Mutual Information quantifies how often two words occur
215: together in a bigram (the numerator) relative to how often they occur
216: overall in the corpus (the denominator). However, there is
217: a curious limitation to pointwise Mutual Information. A bigram $w_1w_2$
218: that occurs $n_{11}$ times in the corpus, and whose component words $w_1$
219: and $w_2$ only occur as a part of that bigram, will result in
220: increasingly strong measures of association as the value of $n_{11}$
221: decreases.
222: Thus, the maximum pointwise Mutual Information in a given corpus
223: will be assigned to bigrams that occur one time, and whose component words
224: never occur outside that bigram. These are usually not the bigrams that
225: prove most useful for disambiguation, yet they will dominate a ranked
226: list as determined by pointwise Mutual Information.
227:
228: The Dice Coefficient overcomes this limitation, and can be defined as
229: follows:
230:
231: \begin{eqnarray*}
232: Dice(w_1,w_2) = \frac{2* n_{11}}{n_{+1} + n_{1+}}
233: \end{eqnarray*}
234:
235: When $n_{11} = n_{1+} = n_{+1}$ the value of $Dice(w_1,w_2)$ will be 1 for
236: all values $n_{11}$. When the value of $n_{11}$ is less than either of the
237: marginal totals (the more typical case) the rankings produced by the Dice
238: Coefficient are similar to those of Mutual Information. The relationship
239: between pointwise Mutual Information and the Dice Coefficient is also
240: discussed in \cite{SmadjaMH96}.
241:
242: We have developed the Bigram Statistics Package to produce ranked lists of
243: bigrams using a range of tests. This software is written in Perl and
244: is freely available from www.d.umn.edu/\~{}tpederse.
245:
246: \section{Learning Decision Trees}
247:
248: Decision trees are among the most widely used machine learning algorithms.
249: They perform a general to specific search of a feature space, adding
250: the most informative features to a tree structure as the search proceeds.
251: The objective is to select a minimal set of features that efficiently
252: partitions the feature space into classes of observations and assemble
253: them into a tree. In our case, the observations are manually
254: sense--tagged examples of an ambiguous word in context and the
255: partitions correspond to the different possible senses.
256:
257: Each feature selected during the search process is represented by
258: a node in the learned decision tree. Each node represents a choice
259: point between a number of different possible values for a feature.
260: Learning continues until all the training examples are accounted for
261: by the decision tree. In general, such a tree will be overly specific
262: to the training data and not generalize well to new examples. Therefore
263: learning is followed by a pruning step where some nodes are eliminated or
264: reorganized to produce a tree that can generalize to new circumstances.
265:
266: Test instances are disambiguated by finding a path through the learned
267: decision tree from the root to a leaf node that corresponds with the
268: observed features. An instance of an ambiguous word is disambiguated by
269: passing it through a series of tests, where each test asks if a
270: particular bigram occurs in the available window of context.
271:
272: We also include three benchmark learning algorithms in this study: the
273: majority classifier, the decision stump, and the Naive Bayesian
274: classifier.
275:
276: The {\it majority classifier} assigns the most common sense in the
277: training data to every instance in the test data.
278: A {\it decision stump} is a one node decision tree\cite{Holte93} that is
279: created by stopping the decision tree learner after the single most
280: informative feature is added to the tree.
281:
282: The {\it Naive Bayesian classifier} \cite{DudaH73} is based on certain
283: blanket assumptions about the interactions among features in a
284: corpus. There is no search of the feature space performed to build a
285: representative model as is the case with decision trees. Instead, all
286: features are included in the classifier and assumed to be relevant to the
287: task at hand. There is a further assumption that each feature is
288: conditionally independent of all other features, given the sense of
289: the ambiguous word. It is most often used with a {\it bag of words}
290: feature set, where every word in the training sample is represented by a
291: binary feature that indicates whether or not it occurs in the window of
292: context surrounding the ambiguous word.
293:
294: We use the Weka \cite{weka} implementations of the C4.5
295: decision tree learner (known as J48), the decision stump, and the Naive
296: Bayesian classifier. Weka is written in Java and is freely available from
297: www.cs.waikato.ac.nz/\~{}ml.
298:
299: \section{Experimental Data}
300:
301: Our empirical study utilizes the training and test data from the 1998
302: SENSEVAL evaluation of word sense disambiguation systems. Ten teams
303: participated in the supervised learning portion of this event.
304: Additional details about the exercise, including the data and results
305: referred to in this paper, can be found at the SENSEVAL web site
306: (www.itri.bton.ac.uk/events/senseval/) and in \cite{KilgarriffP00}.
307:
308: We included all 36 tasks from SENSEVAL for which training and test data
309: were provided. Each task requires that the occurrences of a particular
310: word in the test data be disambiguated based on a model learned from
311: the sense--tagged instances in the training data. Some words were used in
312: multiple tasks as different parts of speech. For example, there were two
313: tasks associated with {\it bet}, one for its use as a noun and the other
314: as a verb. Thus, there are 36 tasks involving the disambiguation of 29
315: different words.
316:
317: The words and part of speech associated with each task are shown in Table
318: \ref{tab:results} in column 1. Note that the parts of speech are
319: encoded as {\it n} for noun, {\it a} for adjective, {\it v} for verb, and
320: {\it p} for words where the part of speech was not provided. The number of
321: test and training instances for each task are shown in columns 2 and
322: 4. Each instance consists of the sentence in which the ambiguous word
323: occurs as well as one or two surrounding sentences. In general
324: the total context available for each ambiguous word is less than 100
325: surrounding words. The number of distinct senses in the test data for
326: each task is shown in column 3.
327:
328: \section{Experimental Method}
329:
330: The following process is repeated for each task. Capitalization and
331: punctuation are removed from the training and test data. Two feature
332: sets are selected from the training data based on the top 100 ranked
333: bigrams according to the power divergence statistic and the Dice
334: Coefficient. The bigram must have occurred 5 or more times to be
335: included as a feature. This step filters out a large number of possible
336: bigrams and allows the decision tree learner to focus on a small number of
337: candidate bigrams that are likely to be helpful in the disambiguation
338: process.
339:
340: The training and test data are converted to feature vectors where each
341: feature represents the occurrence of one of the bigrams that belong in
342: the feature set. This representation of the training data is the actual input
343: to the learning algorithms. Decision tree and decision stump learning is
344: performed twice, once using the feature set determined by the power
345: divergence statistic and again using the feature set identified by the
346: Dice Coefficient. The majority classifier
347: simply determines the most frequent sense in the training data and
348: assigns that to all instances in the test data. The Naive Bayesian
349: classifier is based on a feature set where every word that occurs 5 or
350: more times in the training data is included as a feature.
351:
352: All of these learned models are used to disambiguate the test data. The
353: test data is kept separate until this stage. We employ a fine grained
354: scoring method, where a word is counted as correctly disambiguated only
355: when the assigned sense tag exactly matches the true sense tag. No
356: partial credit is assigned for near misses.
357:
358: \section{Experimental Results}
359:
360: The accuracy attained by each of the learning algorithms is shown in Table
361: \ref{tab:results}.
362: Column 5 reports the accuracy of the majority classifier, columns 6 and 7
363: show the best and average accuracy reported by the 10
364: participating SENSEVAL teams. The evaluation at SENSEVAL was
365: based on precision and recall, so we converted those scores to accuracy by
366: taking their product. However, the best precision and recall may have
367: come from different teams, so the best accuracy shown in column 6 may
368: actually be higher than that of any single participating SENSEVAL
369: system. The average accuracy in column 7 is the product of the average
370: precision and recall reported for the participating SENSEVAL teams.
371: Column 8 shows the accuracy of the
372: decision tree using the J48 learning algorithm and the
373: features identified by a power divergence statistic.
374: Column 10 shows the accuracy of the decision tree when the Dice
375: Coefficient selects the features. Columns 9 and 11 show the accuracy of
376: the decision stump based on the power
377: divergence statistic and the Dice
378: Coefficient respectively. Finally, column 13 shows the accuracy of the
379: Naive Bayesian classifier based on a bag of words feature set.
380:
381: The most accurate method is the decision tree based on a feature set
382: determined by the power divergence statistic. The last line of Table
383: \ref{tab:results} shows the win-tie-loss score of the decision tree/power
384: divergence method relative to every other method. A win shows it was more
385: accurate than the method in the column, a loss means it was less accurate,
386: and a tie means it was equally accurate. The decision tree/power
387: divergence method was more accurate than the best reported SENSEVAL
388: results for 19 of the 36 tasks, and more accurate for 30 of the 36 tasks
389: when compared to the average reported accuracy. The decision stumps also
390: fared well, proving to be more accurate than the best SENSEVAL results for
391: 14 of the 36 tasks.
392:
393: In general the feature sets selected by the power divergence statistic
394: result in more accurate decision trees than those selected by
395: the Dice Coefficient. The power divergence tests prove to be more reliable
396: since they account for all possible events surrounding two words
397: $w_1$ and $w_2$; when they occur as bigram $w_1w_2$, when $w_1$ or
398: $w_2$ occurs in a bigram without the other, and when a bigram consists of
399: neither. The Dice Coefficient is based strictly on the event where $w_1$
400: and $w_2$ occur together in a bigram.
401:
402: There are 6 tasks where the decision tree / power divergence approach is
403: less accurate than the SENSEVAL average; promise-n, scrap-n, shirt-n,
404: amaze-v, bitter-p, and sanction-p. The most dramatic difference
405: occurred with amaze-v, where the SENSEVAL average was 92.4\% and the
406: decision tree accuracy was 58.6\%. However, this was an unusual task
407: where every instance in the test data belonged to a single sense that
408: was a minority sense in the training data.
409:
410:
411: \begin{table*}
412: \caption{Experimental Results}
413: \label{tab:results}
414: \begin{center}
415: \begin{tabular}{crrr|rrrrrrrrr}
416: \hline
417: \hline\rule{0pt}{12pt}
418: (1) & (2) & (3)& (4) & (5) & (6) & (7) & (8) &
419: (9) & (10) & (11) & (12) \\
420: & & senses & & & & & j48 & stump & j48 & stump & naive \\
421: word-pos & test & in test & train & maj & best & avg & pow & pow
422: & dice & dice & bayes \\[2pt]
423: \hline
424: accident-n & 267 &8 & 227 & 75.3 & 87.1 & 79.6 & 85.0
425: &
426: 77.2 & 83.9 & 77.2 & 83.1 &\\
427: behaviour-n & 279 &3 & 994 & 94.3 & 92.9 & 90.2 & 95.7 &
428: 95.7 & 95.7 & 95.7 & 93.2 & \\
429: bet-n & 274 &15 & 106 & 18.2 & 50.7 & 39.6 & 41.8 &
430: 34.5 & 41.8 & 34.5 & 39.3 & \\
431: excess-n & 186 &8 & 251 & 1.1 & 75.9 & 63.7 & 65.1 &
432: 38.7 & 60.8 & 38.7 & 64.5 & \\
433: float-n & 75 &12 & 61 & 45.3 & 66.1 & 45.0 & 52.0 &
434: 50.7 & 52.0 & 50.7 & 56.0 & \\
435: giant-n & 118 &7 & 355 & 49.2 & 67.6 & 56.6 & 68.6 &
436: 59.3 & 66.1 & 59.3 & 70.3 & \\
437: knee-n & 251 &22 & 435 & 48.2 & 67.4 & 56.0 & 71.3 &
438: 60.2 & 70.5 &60.2 & 64.1 & \\
439: onion-n & 214 &4 & 26 & 82.7 & 84.8 & 75.7 & 82.7 &
440: 82.7 & 82.7 & 82.7 & 82.2 & \\
441: promise-n & 113 &8 & 845 & 62.8 & 75.2 & 56.9 & 48.7 &
442: 63.7 & 55.8 & 62.8 & 78.0 & \\
443: sack-n & 82 &7 & 97 & 50.0 & 77.1 & 59.3 & 80.5 &
444: 58.5 & 80.5 & 58.5 & 74.4 & \\
445: scrap-n & 156 &14 & 27 & 41.7 & 51.6 & 35.1 & 26.3 &
446: 16.7 & 26.3 & 16.7 & 26.7 & \\
447: shirt-n & 184 &8 & 533 & 43.5 & 77.4 & 59.8 & 46.7 &
448: 43.5 & 51.1 & 43.5 & 60.9 & \\
449: amaze-v & 70 &1 & 316 & 0.0 & 100.0& 92.4 & 58.6 &
450: 12.9 & 60.0 & 12.9 & 71.4 & \\
451: bet-v & 117 &9 & 60 & 43.2 & 60.5 & 44.0 & 50.8 &
452: 58.5 & 52.5 & 50.8 & 58.5 & \\
453: bother-v & 209 &8 & 294 & 75.0 & 59.2 & 50.7 & 69.9 &
454: 55.0 & 64.6 & 55.0 & 62.2 & \\
455: bury-v & 201 &14 & 272 & 38.3 & 32.7 & 22.9 & 48.8 &
456: 38.3 & 44.8 & 38.3 & 42.3 & \\
457: calculate-v & 218 &5 & 249 & 83.9 & 85.0 & 75.5 & 90.8 &
458: 88.5 & 89.9 & 88.5 & 80.7 & \\
459: consume-v & 186 &6 & 67 & 39.8 & 25.2 & 20.2 & 36.0 &
460: 34.9 & 39.8 & 34.9 & 31.7 & \\
461: derive-v & 217 &6 & 259 & 47.9 & 44.1 & 36.0 & 82.5 &
462: 52.1 & 82.5 & 52.1 & 72.4 & \\
463: float-v & 229 &16 & 183 & 33.2 & 30.8 & 22.5 & 30.1 &
464: 22.7 & 30.1 & 22.7 & 56.3 & \\
465: invade-v & 207 &6 & 64 & 40.1 & 30.9 & 25.5 & 28.0 &
466: 40.1 & 28.0 & 40.1 & 31.0 & \\
467: promise-v & 224 &6 & 1160 & 85.7 & 82.1 & 74.6 & 85.7 &
468: 84.4 & 81.7 & 81.3 & 85.3 & \\
469: sack-v & 178 &3 & 185 & 97.8 & 95.6 & 95.6 & 97.8 &
470: 97.8 & 97.8 & 97.8 & 97.2 & \\
471: scrap-v & 186 &3 & 30 & 85.5 & 80.6 & 68.6 & 85.5 &
472: 85.5 & 85.5 & 85.5 & 82.3 & \\
473: seize-v & 259 &11 & 291 & 21.2 & 51.0 & 42.1 & 52.9 &
474: 25.1 & 49.4 & 25.1 & 51.7 & \\
475: brilliant-a & 229 &10 & 442 & 45.9 & 31.7 & 26.5 & 55.9 &
476: 45.9 & 51.1 & 45.9 & 58.1 & \\
477: floating-a & 47 &5 & 41 & 57.4 & 49.3 & 27.4 & 57.4 &
478: 57.4 & 57.4 & 57.4 & 55.3 & \\
479: generous-a & 227 &6 & 307 & 28.2 & 37.5 & 30.9 & 44.9 &
480: 32.6 & 46.3 & 32.6 & 48.9 & \\
481: giant-a & 97 &5 & 302 & 94.8 & 98.0 & 93.5 & 95.9 &
482: 95.9 & 94.8 & 94.8 & 94.8 &\\
483: modest-a & 270 &9 & 374 & 61.5 & 49.6 & 44.9 & 72.2 &
484: 64.4 & 73.0 & 64.4 & 68.1 & \\
485: slight-a & 218 &6 & 385 & 91.3 & 92.7 & 81.4 & 91.3 &
486: 91.3 & 91.3 & 91.3 & 91.3 & \\
487: wooden-a & 196 &4 & 362 & 93.9 & 81.7 & 71.3 & 96.9 &
488: 96.9 & 96.9 & 96.9 & 93.9 & \\
489: band-p & 302 &29 &1326 & 77.2 & 81.7 & 75.9 & 86.1 &
490: 84.4 & 79.8 & 77.2 & 83.1 & \\
491: bitter-p & 373 &14 &144 & 27.0 & 44.6 & 39.8 & 36.4 &
492: 31.3 & 36.4 & 31.3 & 32.6 & \\
493: sanction-p & 431 &7 &96 & 57.5 & 74.8 & 62.4 & 57.5 &
494: 57.5 & 57.1 & 57.5 & 56.8 & \\
495: shake-p & 356 &36 &963 & 23.6 & 56.7 & 47.1 & 52.2 &
496: 23.6 & 50.0 & 23.6 & 46.6 & \\[2pt]
497: \hline
498: \multicolumn{4}{c|} {win-tie-loss (j48-pow vs. X)} &
499: \multicolumn{1}{c} {23-7-6} &
500: \multicolumn{1}{c} {19-0-17} &
501: \multicolumn{1}{c} {30-0-6} &
502: \multicolumn{1}{c} {}&
503: \multicolumn{1}{c} {28-9-3} &
504: \multicolumn{1}{c} {14-15-7} &
505: \multicolumn{1}{c} {28-9-3} &
506: \multicolumn{1}{c} {24-1-11} & \\[2pt]
507: \hline
508: \end{tabular}
509: \end{center}
510: \end{table*}
511: %
512:
513: \begin{table*}
514: \caption{Decision Tree and Stump Characteristics}
515: \label{tab:stump}
516: \begin{center}
517: \begin{tabular}{c|rrr|rrr}
518: \hline
519: \hline
520: \multicolumn{1}{c|}{ } &
521: \multicolumn{3}{c}{power divergence} &
522: \multicolumn{3}{|c}{dice coefficient} \\
523: (1) & (2) & (3) & (4) & (5) & (6) & (7) \\
524: word-pos & stump node & leaf/total & features & stump node & leaf/total
525: &features \\[2pt]
526: \hline
527: accident-n & by accident & 8/15 & 101 & by accident & 12/23 & 112 \\
528: behaviour-n & best behaviour & 2/3 & 100 & best behaviour & 2/3 & 104 \\
529: bet-n & betting shop & 20/39 & 50 & betting shop & 20/39 & 50 \\
530: excess-n & in excess & 13/25 & 104 & in excess & 11/21 & 102\\
531: float-n & the float & 7/13 & 13 & the float & 7/13 & 13 \\
532: giant-n & the giants & 16/31 & 103 & the giants & 14/27 & 78 \\
533: knee-n & knee injury & 23/45 & 102 & knee injury & 20/39 & 104 \\
534: onion-n & in the & 1/1 & 7 & in the & 1/1 & 7\\
535: promise-n & promise of & 95/189 & 100 & a promising & 49/97 & 107 \\
536: sack-n & the sack & 5/9 & 31 & the sack & 5/9 & 31 \\
537: scrap-n & scrap of & 7/13 & 8 & scrap of & 7/13 & 8 \\
538: shirt-n & shirt and & 38/75 & 101 & shirt and & 55/109 & 101 \\
539: amaze-v & amazed at & 11/21 & 102 & amazed at &11/21 & 102 \\
540: bet-v & i bet & 4/7 & 10 & i bet & 4/7 & 10 \\
541: bother-v & be bothered & 19/37 & 101 & be bothered & 20/39 & 106 \\
542: bury-v & buried in & 28/55 & 103 & buried in & 32/63 & 103 \\
543: calculate-v & calculated to & 5/9 & 103 & calculated to & 5/9 & 103 \\
544: consume-v & on the & 4/7 & 20 & on the & 4/7 & 20 \\
545: derive-v & derived from & 10/19 & 104 & derived from & 10/19 & 104 \\
546: float-v & floated on & 24/47 & 80 & floated on & 24/47 & 80 \\
547: invade-v & to invade & 55/109 & 107 & to invade & 66/127 & 108 \\
548: promise-v & promise to & 3/5 & 100 & promise you & 5/9 & 106 \\
549: sack-v & return to & 1/1 & 91 & return to & 1/1 & 91 \\
550: scrap-v & of the & 1/1 & 7 & of the & 1/1 & 7 \\
551: seize-v & to seize & 26/51 & 104 & to seize & 57/113 & 104 \\
552: brilliant-a & a brilliant & 26/51 & 101 & a brilliant & 42/83 & 103 \\
553: floating-a & in the & 7/13 & 10 & in the & 7/13 & 10 \\
554: generous-a & a generous & 57/113 & 103 & a generous & 56/111 & 102 \\
555: giant-a & the giant & 2/3 & 102 & a giant & 1/1 & 101 \\
556: modest-a & a modest & 14/27 & 101 & a modest & 10/19 & 105 \\
557: slight-a & the slightest & 2/3 & 105 & the slightest & 2/3 & 105 \\
558: wooden-a & wooden spoon & 2/3 & 104 & wooden spoon & 2/3 & 101 \\
559: band-p & band of & 14/27 & 100 & the band & 21/41& 117\\
560: bitter-p & a bitter & 22/43 & 54 & a bitter & 22/43 & 54 \\
561: sanction-p & south africa & 12/23 & 52 & south africa & 12/23 & 52 \\
562: shake-p & his head & 90/179 & 100 & his head & 81/161 & 105 \\
563: \hline
564: \end{tabular}
565: \end{center}
566: \end{table*} %
567:
568: \section{Analysis of Experimental Results}
569:
570: The characteristics of the decision trees and decision stumps learned for
571: each word are shown in Table \ref{tab:stump}. Column 1 shows the
572: word and part of speech. Columns 2, 3, and 4 are based on the
573: feature set selected by the power divergence statistic while
574: columns 5, 6, and 7 are based on the Dice Coefficient. Columns 2 and 5
575: show the node selected to serve as the decision stump. Columns 3 and 6
576: show the number of leaf nodes in the learned decision tree relative to the
577: number of total nodes. Columns 4 and 7 show the number of bigram
578: features selected to represent the training data.
579:
580: This table shows that there is little difference in the decision stump
581: nodes selected from feature sets determined by the power divergence
582: statistics versus the Dice Coefficient. This is to be expected
583: since the top ranked bigrams for each measure are consistent, and the
584: decision stump node is generally chosen from among those.
585:
586: However, there are differences between the feature sets selected by the
587: power divergence statistics and the Dice Coefficient. These are reflected
588: in the different sized trees that are learned based on these feature sets.
589: The number of leaf nodes and the total number of nodes for each learned
590: tree is shown in columns 3 and 6.
591: The number of internal nodes is simply the difference between the
592: total nodes and the leaf nodes.
593: Each leaf node represents the end of
594: a path through the decision tree that makes a sense distinction.
595: Since a bigram feature can only appear once in the
596: decision tree, the number of internal nodes represents the number of
597: bigram features selected by the decision tree learner.
598:
599: One of our original hypotheses was that accurate decision trees of
600: bigrams will include a relatively small number of features. This
601: was motivated by the success of decision stumps in performing
602: disambiguation based on a single bigram feature.
603: In these experiments, there were no decision trees that used all of the
604: bigram features identified by the filtering step, and for many words the
605: decision tree learner went on to eliminate most of the candidate
606: features. This can be seen by comparing the number of internal nodes with
607: the number of candidate features as shown in columns 4 or 7.\footnote{For
608: most words the 100 top ranked bigrams form the set of candidate features
609: presented to the decision tree learner. If
610: there are ties in the top 100 rankings then there may be more than 100
611: features, and if the there were fewer than 100 bigrams that occurred more
612: than 5 times then all such bigrams are included in the feature set.}
613:
614: It is also noteworthy that the bigrams ultimately selected by the decision
615: tree learner for inclusion in the tree do not always include those
616: bigrams ranked most highly by the power divergence statistic or the Dice
617: Coefficient. This is to be expected, since the selection of the bigrams
618: from raw text is only measuring the association between two words, while
619: the decision tree seeks bigrams that partition instances of the ambiguous
620: word into into distinct senses. In particular, the decision tree learner
621: makes decisions as to what bigram to include as nodes in the tree using
622: the gain ratio, a measure based on the overall Mutual Information
623: between the bigram and a particular word sense.
624:
625: Finally, note that the smallest decision trees are functionally equivalent
626: to our benchmark methods. A decision tree with 1 leaf node and
627: no internal nodes (1/1) acts as a majority classifier. A decision tree
628: with 2 leaf nodes and 1 internal node (2/3) has the structure of a
629: decision stump.
630:
631:
632: \section{Discussion}
633:
634: One of our long-term objectives is to identify a core set of features
635: that will be useful for disambiguating a wide class of words using both
636: supervised and unsupervised methodologies.
637:
638: We have presented an ensemble approach to word sense disambiguation
639: \cite{Pedersen00b} where multiple Naive Bayesian classifiers, each based
640: on co--occurrence features from varying sized windows of context,
641: is shown to perform well on the widely studied nouns {\it interest} and
642: {\it line}. While the accuracy of this approach was as good as any
643: previously published results, the learned models were complex and
644: difficult to interpret, in effect acting as very accurate black boxes.
645:
646: Our experience has been that variations in learning algorithms
647: are far less significant contributors to disambiguation
648: accuracy than are variations in the feature set. In other words, an
649: informative feature set will result in accurate disambiguation when used
650: with a wide range of learning algorithms, but there is no
651: learning algorithm that can perform well given an uninformative or
652: misleading set of features. Therefore, our focus is on developing and
653: discovering feature sets that make distinctions among word senses. Our
654: learning algorithms must not only produce accurate models, but they
655: should also shed new light on the relationships among features and allow
656: us to continue refining and understanding our feature sets.
657:
658: We believe that decision trees meet these criteria. A wide range of
659: implementations are available, and they are known to be robust and
660: accurate across a range of domains. Most important, their structure is
661: easy to interpret and may provide insights into the relationships that
662: exist among features and more general rules of disambiguation.
663:
664: \section{Related Work}
665:
666: Bigrams have been used as features for word sense disambiguation,
667: particularly in the form of collocations where the ambiguous word is one
668: component of the bigram (e.g., \cite{BruceW94b}, \cite{NgL96},
669: \cite{Yarowsky95}). While some of the bigrams we identify are collocations
670: that include the word being disambiguated, there is no requirement that
671: this be the case.
672:
673: Decision trees have been used in supervised learning approaches to word
674: sense disambiguation, and have fared well in a number of comparative
675: studies (e.g., \cite{Mooney96}, \cite{PedersenB97A}). In the former they
676: were used with the bag of word feature sets and in the latter they were
677: used with a mixed feature set that included the part-of-speech of
678: neighboring words, three collocations, and the morphology of the ambiguous
679: word. We believe that the approach in this paper is the first time that
680: decision trees based strictly on bigram features have been employed.
681:
682: The decision list is a closely related approach
683: that has also been applied to
684: word sense disambiguation (e.g., \cite{Yarowsky94}, \cite{WilksS98},
685: \cite{Yarowsky00}). Rather than building and traversing a tree to perform
686: disambiguation, a list is employed. In the general case
687: a decision list may suffer from less fragmentation during learning than
688: decision trees; as a practical matter this means that the decision list
689: is less likely to be over--trained. However, we believe that fragmentation
690: also reflects on the feature set used for learning. Ours consists of at
691: most approximately 100 binary features. This results in a relatively
692: small feature space that is not as likely to suffer from fragmentation as
693: are larger spaces.
694:
695: \section{Future Work}
696:
697: There are a number of immediate extensions to this work. The first is to
698: ease the requirement that bigrams be made up of two consecutive words.
699: Rather, we will search for bigrams where the component words may be
700: separated by other words in the text. The second is to eliminate the
701: filtering step by which candidate bigrams are selected by a power
702: divergence statistic. Instead, the decision tree learner would consider
703: all possible bigrams. Despite increasing the danger of fragmentation,
704: this is an interesting issue since the bigrams judged most informative by
705: the decision tree learner are not always ranked highly in the filtering
706: step. In particular, we will determine if the filtering process ever
707: eliminates bigrams that could be significant sources of disambiguation
708: information.
709:
710: In the longer term, we hope to adapt this approach to unsupervised
711: learning, where disambiguation is performed without the benefit of sense
712: tagged text. We are optimistic that this is viable, since bigram features
713: are easy to identify in raw text.
714:
715: \section{Conclusion}
716:
717: This paper shows that the combination of a simple feature set made
718: up of bigrams and a standard decision tree learning algorithm
719: results in accurate word sense disambiguation. The results of this
720: approach are compared with those from the 1998 SENSEVAL word sense
721: disambiguation exercise and show that the bigram based decision tree
722: approach is more accurate than the best SENSEVAL results for 19 of 36
723: words.
724:
725: \section{Acknowledgments}
726:
727: The Bigram Statistics Package has been implemented by Satanjeev Banerjee,
728: who is supported by a Grant--in--Aid of Research, Artistry and Scholarship
729: from the Office of the Vice President for Research and the Dean of the
730: Graduate School of the University of Minnesota. We would like to thank
731: the SENSEVAL organizers for making the data and results from the 1998
732: event freely available. The comments of three anonymous reviewers were
733: very helpful in preparing the final version of this paper. A preliminary
734: version of this paper appears in \cite{Pedersen01a}.
735:
736: %
737: % ---- Bibliography ----
738: %
739:
740: %\bibliography{/home/cs/tpederse/TeX/Papers/papers/bib/tdp}
741: %\bibliographystyle{/home/cs/tpederse/TeX/Papers/papers/sty/acl}
742:
743: \begin{thebibliography}{}
744:
745: \bibitem[\protect\citename{Bruce and Wiebe}1994]{BruceW94b}
746: R.~Bruce and J.~Wiebe.
747: \newblock 1994.
748: \newblock Word-sense disambiguation using decomposable models.
749: \newblock In {\em Proceedings of the 32nd Annual Meeting of the Association for
750: Computational Linguistics}, pages 139--146.
751:
752: \bibitem[\protect\citename{Church and Hanks}1990]{ChurchH90}
753: K.~Church and P.~Hanks.
754: \newblock 1990.
755: \newblock Word association norms, mutual information and lexicography.
756: \newblock In {\em Proceedings of the 28th Annual Meeting of the Association for
757: Computational Linguistics}, pages 76--83.
758:
759: \bibitem[\protect\citename{Cressie and Read}1984]{CressieR84}
760: N.~Cressie and T.~Read.
761: \newblock 1984.
762: \newblock Multinomial goodness of fit tests.
763: \newblock {\em Journal of the Royal Statistics Society Series B}, 46:440--464.
764:
765: \bibitem[\protect\citename{Duda and Hart}1973]{DudaH73}
766: R.~Duda and P.~Hart.
767: \newblock 1973.
768: \newblock {\em Pattern Classification and Scene Analysis}.
769: \newblock Wiley, New York, NY.
770:
771: \bibitem[\protect\citename{Dunning}1993]{Dunning93}
772: T.~Dunning.
773: \newblock 1993.
774: \newblock Accurate methods for the statistics of surprise and coincidence.
775: \newblock {\em Computational Linguistics}, 19(1):61--74.
776:
777: \bibitem[\protect\citename{Holte}1993]{Holte93}
778: R.~Holte.
779: \newblock 1993.
780: \newblock Very simple classification rules perform well on most commonly used
781: datasets.
782: \newblock {\em Machine Learning}, 11:63--91.
783:
784: \bibitem[\protect\citename{Kilgarriff and Palmer}2000]{KilgarriffP00}
785: A.~Kilgarriff and M.~Palmer.
786: \newblock 2000.
787: \newblock Special issue on {SENSEVAL}: Evaluating word sense disambiguation
788: programs.
789: \newblock {\em Computers and the Humanities}, 34(1--2).
790:
791: \bibitem[\protect\citename{Mooney}1996]{Mooney96}
792: R.~Mooney.
793: \newblock 1996.
794: \newblock Comparative experiments on disambiguating word senses: An
795: illustration of the role of bias in machine learning.
796: \newblock In {\em Proceedings of the Conference on Empirical Methods in Natural
797: Language Processing}, pages 82--91, May.
798:
799: \bibitem[\protect\citename{Ng and Lee}1996]{NgL96}
800: H.T. Ng and H.B. Lee.
801: \newblock 1996.
802: \newblock Integrating multiple knowledge sources to disambiguate word sense: An
803: exemplar-based approach.
804: \newblock In {\em Proceedings of the 34th Annual Meeting of the Association for
805: Computational Linguistics}, pages 40--47.
806:
807: \bibitem[\protect\citename{Pedersen and Bruce}1997]{PedersenB97A}
808: T.~Pedersen and R.~Bruce.
809: \newblock 1997.
810: \newblock A new supervised learning algorithm for word sense disambiguation.
811: \newblock In {\em Proceedings of the Fourteenth National Conference on
812: Artificial Intelligence}, pages 604--609, Providence, RI, July.
813:
814: \bibitem[\protect\citename{Pedersen}1996]{Pedersen96}
815: T.~Pedersen.
816: \newblock 1996.
817: \newblock Fishing for exactness.
818: \newblock In {\em Proceedings of the South Central SAS User's Group (SCSUG-96)
819: Conference}, pages 188--200, Austin, TX, October.
820:
821: \bibitem[\protect\citename{Pedersen}2000]{Pedersen00b}
822: T.~Pedersen.
823: \newblock 2000.
824: \newblock A simple approach to building ensembles of naive bayesian classifiers
825: for word sense disambiguation.
826: \newblock In {\em Proceedings of the First Annual Meeting of the North American
827: Chapter of the Association for Computational Linguistics}, pages 63--69,
828: Seattle, WA, May.
829:
830: \bibitem[\protect\citename{Pedersen}2001]{Pedersen01a}
831: T.~Pedersen.
832: \newblock 2001.
833: \newblock Lexical semantic ambiguity resolution with bigram--based decision
834: trees.
835: \newblock In {\em Proceedings of the Second International Conference on
836: Intelligent Text Processing and Computational Linguistics}, pages 157--168,
837: Mexico City, February.
838:
839: \bibitem[\protect\citename{Smadja \bgroup et al.\egroup }1996]{SmadjaMH96}
840: F.~Smadja, K.~McKeown, and V.~Hatzivassiloglou.
841: \newblock 1996.
842: \newblock Translating collocations for bilingual lexicons: A statistical
843: approach.
844: \newblock {\em Computational Linguistics}, 22(1):1--38.
845:
846: \bibitem[\protect\citename{Wilks and Stevenson}1998]{WilksS98}
847: Y.~Wilks and M.~Stevenson.
848: \newblock 1998.
849: \newblock Word sense disambiguation using optimised combinations of knowledge
850: sources.
851: \newblock In {\em Proceedings of COLING/ACL-98}.
852:
853: \bibitem[\protect\citename{Witten and Frank}2000]{weka}
854: I.~Witten and E.~Frank.
855: \newblock 2000.
856: \newblock {\em Data Mining - Practical Machine Learning Tools and Techniques
857: with Java Implementations}.
858: \newblock Morgan--Kaufmann, San Francisco, CA.
859:
860: \bibitem[\protect\citename{Yarowsky}1994]{Yarowsky94}
861: D.~Yarowsky.
862: \newblock 1994.
863: \newblock Decision lists for lexical amgiguity resolution: Application to
864: accent resotration in {S}panish and {F}rench.
865: \newblock In {\em Proceedings of the 32nd Annual Meeting of the Association for
866: Computational Linguistics}.
867:
868: \bibitem[\protect\citename{Yarowsky}1995]{Yarowsky95}
869: D.~Yarowsky.
870: \newblock 1995.
871: \newblock Unsupervised word sense disambiguation rivaling supervised methods.
872: \newblock In {\em Proceedings of the 33rd Annual Meeting of the Association for
873: Computational Linguistics}, pages 189--196, Cambridge, MA.
874:
875: \bibitem[\protect\citename{Yarowsky}2000]{Yarowsky00}
876: D.~Yarowsky.
877: \newblock 2000.
878: \newblock Hierarchical decision lists for word sense disambiguation.
879: \newblock {\em Computers and the Humanities}, 34(1--2).
880:
881: \end{thebibliography}
882: \end{document}
883: