0103:cs0103026/naacl.tex

1: \documentclass[]{article}

2: \usepackage{naacl2001}

3:

4: \title{A Decision Tree of Bigrams\\is an Accurate Predictor of Word Sense}

5:

6: \author{Ted Pedersen\\

7: Department of Computer Science\\

8: University of Minnesota Duluth

9: \\Duluth, MN 55812 USA\\

10: \tt{tpederse@d.umn.edu}}

11:

12: %%%\tt{http://www.d.umn.edu/\~{}tpederse}}

13:

14:

15: \begin{document}

16: \maketitle

17:

18: \begin{abstract}

19: This paper presents a corpus-based approach to word sense

20: disambiguation where a decision tree assigns a sense to an ambiguous word

21: based on the bigrams that occur nearby. This approach is evaluated using

22: the sense-tagged corpora from the 1998 SENSEVAL word sense disambiguation

23: exercise. It is more accurate than the average results reported for 30 of

24: 36 words, and is more accurate than the best results for 19 of 36 words.

25: \end{abstract}

26:

27: \section{Introduction}

28:

29: Word sense disambiguation is the process of selecting the most appropriate

30: meaning for a word, based on the context in which it occurs. For our

31: purposes it is assumed that the set of possible meanings, i.e., the sense

32: inventory, has already been determined. For example, suppose {\it bill}

33: has the following set of possible meanings: a piece of currency, pending

34: legislation, or a  bird jaw. When used in the context of {\it The Senate

35: bill is under consideration}, a human reader immediately understands that

36: {\it bill} is being used in the legislative sense. However, a computer

37: program  attempting to perform the same task faces a difficult problem

38: since it does not have the benefit of innate common--sense or linguistic

39: knowledge.

40:

41: Rather than attempting to provide computer programs with real--world

42: knowledge comparable to that of humans, natural language processing has

43: turned to {\it corpus--based} methods. These approaches use techniques

44: from statistics and machine learning to induce models of language

45: usage from large samples of text.  These models are trained to perform

46: particular tasks, usually via supervised learning. This paper describes an

47: approach where a {\it decision tree} is learned from some number of

48: sentences  where each instance of an  ambiguous word has been manually

49: annotated with a sense--tag that denotes the most appropriate sense for

50: that context.

51:

52: Prior to learning, the sense--tagged corpus must be  converted into a more

53: regular form suitable for automatic processing. Each sense--tagged

54: occurrence of an ambiguous  word is  converted into a feature vector,

55: where each feature represents  some  property of the surrounding text that

56: is considered to be relevant to  the disambiguation process.  Given the

57: flexibility and complexity of human  language,  there is  potentially an

58: infinite set of features that could  be utilized. However, in

59: corpus--based approaches features usually consist of information that  can

60: be readily identified in the text,  without relying on extensive external

61: knowledge sources.  These  typically  include the  part--of--speech of

62: surrounding words, the  presence of certain key words within some window

63: of context, and various  syntactic properties  of the sentence and the

64: ambiguous word.

65:

66: The approach in this paper relies upon a  feature set made up of {\it

67: bigrams}, two word sequences that occur in a text. The context in which

68: an ambiguous word  occurs is  represented by some number of binary

69: features that  indicate  whether or  not a  particular bigram has occurred

70: within approximately 50 words to the left or right of the word being

71: disambiguated.

72:

73: We take this approach since surface lexical features

74: like bigrams, collocations, and co--occurrences often contribute a great

75: deal to disambiguation accuracy. It is not clear how much

76: disambiguation accuracy is improved through the use of features

77: that are identified by more complex pre--processing such as

78: part--of--speech tagging, parsing, or anaphora resolution. One of our

79: objectives is to establish a clear upper bounds on the accuracy of

80: disambiguation using feature sets that do not impose substantial

81: pre--processing requirements.

82:

83: This paper continues with a discussion of our methods for identifying

84: the bigrams that should be included in the feature set for learning. Then

85: the decision tree learning algorithm is described, as are some benchmark

86: learning algorithms that are included for purposes of comparison. The

87: experimental data is discussed, and then the empirical  results are

88: presented. We close with an analysis of our findings and a  discussion of

89: related work.

90:

91: \section{Building a Feature Set of Bigrams}

92:

93: We have developed an approach to word sense disambiguation that

94: represents text entirely in terms of the occurrence of bigrams, which we

95: define to be two consecutive words that occur in a text. The

96: distributional characteristics of bigrams are fairly consistent across

97: corpora; a majority of them only occur one time. Given the sparse and

98: skewed nature of this data, the statistical methods used to select

99: interesting bigrams must be carefully chosen. We explore two alternatives,

100: the power  divergence family of goodness of fit statistics and the Dice

101: Coefficient, an information theoretic measure related to pointwise

102: Mutual Information.

103:

104: Figure \ref{fig:bigram} summarizes the notation for word and bigram counts

105: used in this paper by way of a $2 \times 2$ contingency table. The value

106: of $n_{11}$ shows how many times the bigram {\it big cat} occurs in the

107: corpus. The value  of $n_{12}$ shows how often bigrams occur where  {\it

108: big} is the first word  and {\it cat} is not the second. The counts in

109: $n_{+1}$ and $n_{1+}$  indicate how  often words {\it big} and {\it cat}

110: occur as the first and  second words of any bigram in the corpus. The

111: total number of bigrams in the corpus is represented by $n_{++}$.

112:

113: \begin{figure}

114: \begin{center}

115: \begin{tabular}[c]{@{}r|c|c|@{}l@{}}

116: \multicolumn{4}{c}{ } \\

117: \multicolumn{1}{l}{ } &

118: \multicolumn{2}{c}{} \\

119: \multicolumn{2}{r}{cat} &

120: \multicolumn{1}{c|}{$\neg${cat}} &

121: \multicolumn{1}{c}{{totals}} \\

122: \cline{2-4}

123: big & $n_{11}$=$\hfill$10& $n_{12}$=$\hfill$20& $n_{1+}$=$\hfill$30\\

124: \cline{2-3}

125: $\neg${big} & $n_{21}$=$\hfill$40&$n_{22}$=$\hfill$930&$n_{2+}$=$\hfill$970\\

126: \cline{1-4}

127: totals&

128: \multicolumn{1}{r}{$n_{+1}$=50} & $n_{+2}$=950 & $n_{++}$=1000  \\

129: \multicolumn{4}{c}{ } \\

130: \end{tabular}

131: \caption{Representation of Bigram Counts}

132: \label{fig:bigram}

133: \end{center}

134: \end{figure}

135:

136: \subsection{The Power Divergence Family}

137:

138: \cite{CressieR84} introduce the power divergence family of goodness of fit

139: statistics. A number of well known statistics belong to this family,

140: including the  likelihood ratio statistic $G^2$ and Pearson's $X^2$

141: statistic.

142:

143: These measure the divergence of the observed ($n_{ij}$) and expected

144: ($m_{ij}$) bigram counts, where $m_{ij}$ is estimated based on

145: the assumption that the component words in the bigram occur together

146: strictly by chance:

147: \begin{eqnarray*}

148: m_{ij} = \frac{n_{i+} * n_{+j}}{n_{++}}

149: \end{eqnarray*}

150:

151: Given this value, $G^2$ and $X^2$ are calculated as:

152: \begin{eqnarray*}

153: G^2 = 2 \sum_{i,j} n_{ij} * \log \frac{n_{ij}}{m_{ij}} \ \ \ \

154: \end{eqnarray*}

155: \begin{eqnarray*}

156: X^2 = \sum_{i,j} \frac{(n_{ij} - m_{ij})^2}{m_{ij}}

157: \label{eq:x2}

158: \end{eqnarray*}

159:

160: \cite{Dunning93} argues in favor of $G^2$ over $X^2$, especially when

161: dealing with very sparse and skewed data distributions.  However,

162: \cite{CressieR84} suggest that there are cases where Pearson's statistic

163: is more reliable than the likelihood ratio and that one test should not

164: always be preferred over the other. In light of this,

165: \cite{Pedersen96}  presents Fisher's exact test as an alternative since

166: it does not rely on the distributional assumptions that underly both

167: Pearson's test and the likelihood ratio.

168:

169: Unfortunately it is usually not clear which test is most appropriate

170: for a particular sample of data.  We take the following

171: approach, based on the observation that all tests should assign

172: approximately the same

173: measure of statistical significance when the bigram counts in the

174: contingency table do not violate any of the distributional assumptions

175: that underly the goodness of fit statistics. We perform tests using

176: $X^2$, $G^2$, and Fisher's exact test for each bigram.  If the

177: resulting measures of statistical significance differ, then the

178: distribution of the bigram counts is causing at least one of the tests to

179: become unreliable. When this occurs we rely upon the value from Fisher's

180: exact test since it  makes fewer assumptions about the underlying

181: distribution of data.

182: %%Since Fisher's exact test can be computationally

183: %%complex, a practical shortcut is to perform both the $X^2$ and $G^2$

184: %%tests. If they produce comparable

185: %%results then they are reliable and Fisher's exact test need not be

186: %%included.

187:

188: For the experiments in this paper, we identified the top 100 ranked

189: bigrams that occur more than 5 times in the training corpus associated

190: with a word. There were no  cases where rankings produced by $G^2$, $X^2$,

191: and Fisher's exact test disagreed, which is not altogether surprising

192: given that low frequency bigrams were excluded. Since all of these

193: statistics produced the same rankings, hereafter we make no distinction

194: among them and simply  refer to them generically as the power divergence

195: statistic.

196:

197: \subsection{Dice Coefficient}

198:

199: The Dice Coefficient is a descriptive statistic that provides a

200: measure of association among two words in a corpus. It is similar to

201: pointwise Mutual Information, a widely used measure that was first

202: introduced for identifying lexical relationships in

203: \cite{ChurchH90}. Pointwise Mutual Information can be defined as follows:

204: \begin{eqnarray*}

205: MI(w_1,w_2) = log_2 \frac{n_{11} * n_{++}}{n_{+1} * n_{1+}}

206: \end{eqnarray*}

207: where $w_1$ and $w_2$ represent the two words that make up the bigram.

208: %$n_{11}$ represents the number of times the two words occur

209: %together as a bigram, $n_{+1}$ and $n_{1+}$ are the

210: %number of times the words occur as the first and second words

211: %of a bigram, and $n_{++}$ represents the total number of bigrams in the

212: %corpus.

213:

214: Pointwise Mutual Information quantifies how often two words occur

215: together in a bigram (the numerator) relative to how often they occur

216: overall in the corpus (the denominator). However, there is

217: a curious limitation to pointwise Mutual Information. A bigram $w_1w_2$

218: that occurs $n_{11}$ times in the corpus, and whose component words $w_1$

219: and $w_2$ only occur as a part of that bigram, will result in

220: increasingly strong  measures of association as the value of $n_{11}$

221: decreases.

222: Thus, the maximum pointwise Mutual Information in a given corpus

223: will be assigned to bigrams that occur one time, and whose component words

224: never occur outside that bigram. These are usually not the bigrams that

225: prove most useful for disambiguation, yet they will dominate a ranked

226: list as determined by pointwise Mutual Information.

227:

228: The Dice Coefficient overcomes this limitation, and can be defined as

229: follows:

230:

231: \begin{eqnarray*}

232: Dice(w_1,w_2) = \frac{2* n_{11}}{n_{+1} + n_{1+}}

233: \end{eqnarray*}

234:

235: When $n_{11} = n_{1+} = n_{+1}$ the value of $Dice(w_1,w_2)$ will be 1 for

236: all values $n_{11}$.  When the value of $n_{11}$ is less than either of the

237: marginal totals (the more typical case) the rankings produced by the Dice

238: Coefficient are similar to those of Mutual Information. The relationship

239: between pointwise Mutual Information and the Dice Coefficient is also

240: discussed in \cite{SmadjaMH96}.

241:

242: We have developed the Bigram Statistics Package to produce ranked lists of

243: bigrams using a range of tests. This software is written in Perl and

244: is freely available from www.d.umn.edu/\~{}tpederse.

245:

246: \section{Learning Decision Trees}

247:

248: Decision trees are among the most widely used machine learning algorithms.

249: They perform a general to specific search of a feature space, adding

250: the most informative features to a tree structure as the search proceeds.

251: The objective is to select a minimal set of features that efficiently

252: partitions the feature space into classes of observations and assemble

253: them into a tree.  In our case, the observations are manually

254: sense--tagged  examples of an  ambiguous word in context and the

255: partitions correspond to the different possible senses.

256:

257: Each feature selected during the search process is represented by

258: a node in the learned decision tree. Each node represents a choice

259: point between a number of different possible values for a feature.

260: Learning continues until all the training examples are accounted for

261: by the decision tree. In general, such a tree will be overly specific

262: to the training data and not generalize well to new examples. Therefore

263: learning is followed by a pruning step where some nodes are eliminated or

264: reorganized to produce a tree that can generalize to new circumstances.

265:

266: Test instances are disambiguated by finding a path through the learned

267: decision tree from the root to a leaf node that corresponds with the

268: observed features. An instance of an ambiguous word is disambiguated by

269: passing it through a series of tests, where each test asks if a

270: particular bigram occurs in the available window of context.

271:

272: We also include three benchmark learning algorithms in this study: the

273: majority classifier, the decision stump, and the Naive Bayesian

274: classifier.

275:

276: The {\it majority classifier} assigns the most common sense in the

277: training data to every instance in the test data.

278: A {\it decision stump} is a one node decision tree\cite{Holte93} that is

279: created by stopping the decision tree learner after the single most

280: informative feature is added to the tree.

281:

282: The {\it Naive  Bayesian classifier} \cite{DudaH73} is based on certain

283: blanket  assumptions about the interactions among  features in a

284: corpus. There is no search of the feature space performed to build a

285: representative model as is the case with decision trees. Instead, all

286: features are included in the classifier and assumed to be relevant to the

287: task at hand. There is a further assumption that each feature is

288: conditionally independent of all other features, given the sense of

289: the ambiguous word. It is most often used with a {\it bag of words}

290: feature set, where every word in  the training sample is represented by a

291: binary feature that indicates  whether or  not it occurs  in the window of

292: context surrounding the ambiguous  word.

293:

294: We use the Weka \cite{weka} implementations of the C4.5

295: decision tree learner (known as J48), the  decision stump, and the Naive

296: Bayesian classifier. Weka is written in Java and is freely available from

297: www.cs.waikato.ac.nz/\~{}ml.

298:

299: \section{Experimental Data}

300:

301: Our empirical study utilizes the training and test data from the 1998

302: SENSEVAL evaluation of word sense disambiguation systems. Ten teams

303: participated in the supervised learning portion of this event.

304: Additional details about the exercise, including the data and results

305: referred to in this paper, can be found at the SENSEVAL web site

306: (www.itri.bton.ac.uk/events/senseval/) and in \cite{KilgarriffP00}.

307:

308: We included all 36 tasks from SENSEVAL for which training and test data

309: were provided. Each task requires that the occurrences of a particular

310: word in the test data be disambiguated based on a model learned from

311: the sense--tagged instances in the training data. Some words were used in

312: multiple tasks as different parts of speech. For example, there were two

313: tasks associated  with {\it bet}, one for its use as a noun and the other

314: as a verb. Thus, there are 36 tasks involving the disambiguation of 29

315: different words.

316:

317: The words and part of speech associated with each task  are shown in Table

318: \ref{tab:results} in column 1. Note that the parts of speech are

319: encoded as {\it n} for noun, {\it a} for  adjective, {\it v} for verb, and

320: {\it p} for words where the part of speech was not provided. The number of

321: test and training instances for each task are shown in columns 2 and

322: 4. Each instance consists of the sentence in which the ambiguous word

323: occurs as well as one or two surrounding sentences.  In general

324: the total context available  for each ambiguous word is less than 100

325: surrounding words. The number of distinct senses in the test data for

326: each task is shown in column 3.

327:

328: \section{Experimental Method}

329:

330: The following process is repeated for each task. Capitalization and

331: punctuation are removed from the training and test data. Two feature

332: sets are selected from the training data based on the top 100 ranked

333: bigrams according to the power divergence statistic and the Dice

334: Coefficient. The bigram must have occurred 5 or more times to be

335: included as a feature. This step filters out a large number of possible

336: bigrams and allows the decision tree learner to focus on a small number of

337: candidate bigrams that are likely to be helpful in the disambiguation

338: process.

339:

340: The training and test data are converted to feature vectors where each

341: feature represents the occurrence of one of the bigrams that belong in

342: the feature set. This representation of the training data is the actual input

343: to the learning algorithms.  Decision tree and decision stump learning is

344: performed twice, once using the feature set determined by the power

345: divergence statistic and again using the feature set identified by the

346: Dice Coefficient. The majority classifier

347: simply determines the most frequent sense in the training data and

348: assigns that to all instances in the test data. The Naive Bayesian

349: classifier is based on a feature set where every word that occurs 5 or

350: more times in the training data is included as a feature.

351:

352: All of these learned models are used to disambiguate the test data. The

353: test data is kept separate until this stage. We employ a fine grained

354: scoring method, where a word is  counted as  correctly disambiguated only

355: when the assigned sense  tag  exactly matches the true sense tag. No

356: partial credit is assigned for near misses.

357:

358: \section{Experimental Results}

359:

360: The accuracy attained by each of the learning algorithms is shown in Table

361: \ref{tab:results}.

362: Column 5 reports the  accuracy of the majority classifier, columns 6 and 7

363: show the best and average accuracy reported by the 10

364: participating SENSEVAL teams. The evaluation at SENSEVAL was

365: based on precision and recall, so we converted those scores to accuracy by

366: taking their product.  However, the best precision and recall may have

367: come from different teams,  so the best accuracy shown in column 6 may

368: actually be higher than that of  any single participating SENSEVAL

369: system. The average accuracy in column 7  is the product of the average

370: precision and recall reported for the participating SENSEVAL teams.

371: Column 8 shows the accuracy of the

372: decision tree using the J48  learning algorithm and the

373: features identified by a power divergence statistic.

374: Column 10 shows the accuracy of the decision tree when the Dice

375: Coefficient selects the features. Columns 9 and 11 show  the accuracy of

376: the decision  stump based on the power

377: divergence statistic  and the Dice

378: Coefficient respectively. Finally, column  13 shows the accuracy of the

379: Naive Bayesian classifier based on a bag of words feature set.

380:

381: The most accurate method is the decision tree based on a feature set

382: determined by the power divergence statistic.  The last line of Table

383: \ref{tab:results} shows the win-tie-loss score of the decision tree/power

384: divergence method relative to every other method. A win shows it was more

385: accurate than the method in the column, a loss means it was less accurate,

386: and a tie means it was equally accurate. The decision tree/power

387: divergence method was more accurate than the best reported SENSEVAL

388: results for 19  of the 36 tasks, and more accurate for 30 of the 36 tasks

389: when compared to the average reported accuracy. The decision stumps also

390: fared well, proving to be more accurate than the best SENSEVAL results for

391: 14 of the 36 tasks.

392:

393: In general the feature sets selected by the power divergence statistic

394: result in more accurate decision trees than those selected by

395: the Dice Coefficient. The power divergence tests prove to be more reliable

396: since they account for all possible events surrounding two words

397: $w_1$ and $w_2$; when they occur as bigram $w_1w_2$, when $w_1$ or

398: $w_2$ occurs in a bigram without the other, and when a bigram consists of

399: neither. The Dice Coefficient is based strictly on the event where $w_1$

400: and $w_2$ occur together in a bigram.

401:

402: There are 6 tasks where the decision tree / power divergence approach is

403: less accurate than the SENSEVAL average; promise-n, scrap-n, shirt-n,

404: amaze-v, bitter-p, and sanction-p. The most dramatic difference

405: occurred with amaze-v, where the SENSEVAL average was 92.4\% and the

406: decision tree accuracy was 58.6\%. However, this was an unusual task

407: where every instance in the  test data belonged to a single sense that

408: was a minority sense in the training data.

409:

410:

411: \begin{table*}

412: \caption{Experimental Results}

413: \label{tab:results}

414: \begin{center}

415: \begin{tabular}{crrr|rrrrrrrrr}

416: \hline

417: \hline\rule{0pt}{12pt}

418: (1)  & (2)  & (3)& (4)   & (5)  & (6)  & (7) & (8) &

419: (9) & (10) & (11) & (12) \\

420:  &   & senses      &    &  &  &  & j48 & stump & j48 & stump & naive \\

421: word-pos & test & in test & train   & maj  & best & avg & pow & pow

422: & dice & dice & bayes \\[2pt]

423: \hline

424: accident-n & 267    &8   & 227     & 75.3 & 87.1 & 79.6 & 85.0

425: &

426: 77.2 & 83.9 & 77.2 & 83.1 &\\

427: behaviour-n & 279    &3   & 994     & 94.3 & 92.9 & 90.2 & 95.7 &

428: 95.7 & 95.7 & 95.7 & 93.2 & \\

429: bet-n & 274    &15  & 106     & 18.2 & 50.7 & 39.6 & 41.8 &

430: 34.5 & 41.8 & 34.5 & 39.3 & \\

431: excess-n & 186    &8   & 251     & 1.1  & 75.9 & 63.7 & 65.1 &

432: 38.7 & 60.8 & 38.7 & 64.5 & \\

433: float-n & 75     &12  & 61      & 45.3  & 66.1 & 45.0 & 52.0 &

434: 50.7 & 52.0 & 50.7 & 56.0 & \\

435: giant-n & 118    &7   & 355     & 49.2  & 67.6 & 56.6 & 68.6 &

436: 59.3 & 66.1 & 59.3 & 70.3 & \\

437: knee-n & 251    &22  & 435     & 48.2 & 67.4 & 56.0 & 71.3 &

438: 60.2 & 70.5 &60.2 & 64.1 & \\

439: onion-n & 214    &4   & 26      & 82.7 & 84.8 & 75.7 & 82.7 &

440: 82.7 & 82.7 & 82.7 & 82.2 & \\

441: promise-n & 113    &8   & 845     & 62.8  & 75.2 & 56.9 & 48.7 &

442: 63.7 & 55.8 & 62.8 & 78.0 & \\

443: sack-n & 82     &7  & 97       & 50.0  & 77.1 & 59.3 & 80.5 &

444: 58.5 & 80.5 & 58.5 & 74.4 & \\

445: scrap-n & 156    &14  & 27      & 41.7  & 51.6 & 35.1 & 26.3 &

446: 16.7 & 26.3 & 16.7 & 26.7 & \\

447: shirt-n & 184    &8   & 533     & 43.5 & 77.4 & 59.8 & 46.7 &

448: 43.5 & 51.1 & 43.5 & 60.9 & \\

449: amaze-v & 70     &1   & 316     & 0.0  & 100.0& 92.4 & 58.6 &

450: 12.9 & 60.0 & 12.9 & 71.4 & \\

451: bet-v & 117    &9   & 60      & 43.2  & 60.5 & 44.0 & 50.8 &

452: 58.5 & 52.5 & 50.8 & 58.5 & \\

453: bother-v & 209    &8   & 294     & 75.0 & 59.2 & 50.7 & 69.9 &

454: 55.0 & 64.6 & 55.0 & 62.2 & \\

455: bury-v & 201    &14  & 272     & 38.3 & 32.7 & 22.9 & 48.8 &

456: 38.3 & 44.8 & 38.3 & 42.3 & \\

457: calculate-v & 218    &5   & 249     & 83.9 & 85.0 & 75.5 & 90.8 &

458: 88.5 & 89.9 & 88.5 & 80.7 & \\

459: consume-v & 186    &6   & 67      & 39.8 & 25.2 & 20.2 & 36.0 &

460: 34.9 & 39.8 & 34.9 & 31.7 & \\

461: derive-v & 217    &6   & 259     & 47.9 & 44.1 & 36.0 & 82.5 &

462: 52.1 & 82.5 & 52.1 & 72.4 & \\

463: float-v & 229    &16  & 183     & 33.2 & 30.8 & 22.5 & 30.1 &

464: 22.7 & 30.1 & 22.7 & 56.3 & \\

465: invade-v & 207    &6   & 64      & 40.1 & 30.9 & 25.5 & 28.0 &

466: 40.1 & 28.0 & 40.1 & 31.0 & \\

467: promise-v & 224    &6   & 1160    & 85.7 & 82.1 & 74.6 & 85.7 &

468: 84.4 & 81.7 & 81.3 & 85.3 & \\

469: sack-v & 178    &3   & 185     & 97.8 & 95.6 & 95.6 & 97.8 &

470: 97.8 & 97.8 & 97.8 & 97.2 & \\

471: scrap-v & 186    &3   & 30      & 85.5 & 80.6 & 68.6 & 85.5 &

472: 85.5 & 85.5 & 85.5 & 82.3 & \\

473: seize-v & 259    &11  & 291     & 21.2 & 51.0 & 42.1 & 52.9 &

474: 25.1 & 49.4  & 25.1 & 51.7 & \\

475: brilliant-a & 229    &10  & 442     & 45.9 & 31.7 & 26.5 & 55.9 &

476: 45.9 & 51.1 & 45.9 & 58.1 & \\

477: floating-a & 47     &5   & 41      & 57.4 & 49.3 & 27.4 & 57.4 &

478: 57.4 & 57.4 & 57.4 & 55.3 & \\

479: generous-a & 227    &6   & 307     & 28.2 & 37.5 & 30.9 & 44.9 &

480: 32.6 & 46.3 & 32.6 & 48.9 & \\

481: giant-a & 97     &5   & 302     & 94.8 & 98.0 & 93.5 & 95.9 &

482: 95.9 & 94.8 & 94.8 & 94.8 &\\

483: modest-a & 270    &9   & 374     & 61.5 & 49.6 & 44.9 & 72.2 &

484: 64.4 & 73.0 & 64.4 & 68.1 & \\

485: slight-a & 218    &6   & 385     & 91.3 & 92.7 & 81.4 & 91.3 &

486: 91.3 & 91.3 & 91.3 & 91.3 & \\

487: wooden-a & 196    &4   & 362     & 93.9 & 81.7 & 71.3 & 96.9 &

488: 96.9 & 96.9 & 96.9 & 93.9 & \\

489: band-p & 302    &29  &1326     & 77.2 & 81.7 & 75.9 & 86.1 &

490: 84.4 & 79.8 & 77.2 & 83.1 & \\

491: bitter-p & 373    &14  &144      & 27.0 & 44.6 & 39.8 & 36.4 &

492: 31.3 & 36.4 & 31.3 & 32.6 & \\

493: sanction-p & 431    &7   &96       & 57.5 & 74.8 & 62.4 & 57.5 &

494: 57.5 & 57.1 & 57.5 & 56.8 & \\

495: shake-p & 356    &36  &963      & 23.6 & 56.7 & 47.1 & 52.2 &

496: 23.6 & 50.0 & 23.6 & 46.6 & \\[2pt]

497: \hline

498: \multicolumn{4}{c|} {win-tie-loss (j48-pow vs. X)}  &

499: \multicolumn{1}{c} {23-7-6} &

500: \multicolumn{1}{c} {19-0-17} &

501: \multicolumn{1}{c} {30-0-6} &

502: \multicolumn{1}{c} {}&

503: \multicolumn{1}{c} {28-9-3} &

504: \multicolumn{1}{c} {14-15-7} &

505: \multicolumn{1}{c} {28-9-3} &

506: \multicolumn{1}{c} {24-1-11} & \\[2pt]

507: \hline

508: \end{tabular}

509: \end{center}

510: \end{table*}

511: %

512:

513: \begin{table*}

514: \caption{Decision Tree and Stump Characteristics}

515: \label{tab:stump}

516: \begin{center}

517: \begin{tabular}{c|rrr|rrr}

518: \hline

519: \hline

520: \multicolumn{1}{c|}{ } &

521: \multicolumn{3}{c}{power divergence} &

522: \multicolumn{3}{|c}{dice coefficient} \\

523: (1) & (2) & (3) & (4) & (5) & (6) & (7) \\

524: word-pos & stump node & leaf/total & features & stump node & leaf/total

525: &features \\[2pt]

526: \hline

527: accident-n & by accident & 8/15 & 101 & by accident & 12/23 & 112 \\

528: behaviour-n & best behaviour & 2/3 & 100 & best behaviour & 2/3 & 104 \\

529: bet-n & betting shop & 20/39 & 50 & betting shop & 20/39 & 50 \\

530: excess-n & in excess & 13/25 & 104 & in excess & 11/21 & 102\\

531: float-n & the float & 7/13 & 13 & the float & 7/13 & 13 \\

532: giant-n & the giants & 16/31 & 103 & the giants & 14/27 & 78 \\

533: knee-n & knee injury & 23/45 & 102 & knee injury & 20/39 & 104 \\

534: onion-n & in the & 1/1 & 7 & in the & 1/1 & 7\\

535: promise-n & promise of & 95/189 & 100 & a promising & 49/97 & 107 \\

536: sack-n & the sack & 5/9 & 31 & the sack & 5/9 & 31 \\

537: scrap-n & scrap of & 7/13 & 8 & scrap of & 7/13 & 8 \\

538: shirt-n & shirt and & 38/75 & 101 & shirt and & 55/109 & 101 \\

539: amaze-v & amazed at & 11/21 & 102 & amazed at  &11/21  & 102 \\

540: bet-v  & i bet & 4/7 & 10 & i bet & 4/7 & 10 \\

541: bother-v & be bothered & 19/37 & 101 & be bothered & 20/39 & 106 \\

542: bury-v & buried in & 28/55 & 103 & buried in & 32/63 & 103 \\

543: calculate-v & calculated to  & 5/9 & 103 & calculated to & 5/9 & 103 \\

544: consume-v & on the & 4/7 & 20 & on the & 4/7 & 20 \\

545: derive-v & derived from & 10/19 & 104 & derived from & 10/19 & 104 \\

546: float-v & floated on & 24/47 & 80 & floated on & 24/47 & 80 \\

547: invade-v & to invade & 55/109 & 107 & to invade & 66/127 & 108 \\

548: promise-v & promise to & 3/5 & 100 & promise you  & 5/9 & 106 \\

549: sack-v & return to & 1/1 & 91 & return to & 1/1 & 91 \\

550: scrap-v & of the & 1/1 & 7 & of the & 1/1 & 7 \\

551: seize-v & to seize & 26/51 & 104 & to seize & 57/113 & 104 \\

552: brilliant-a & a brilliant & 26/51 & 101 & a brilliant & 42/83 & 103 \\

553: floating-a & in the & 7/13 & 10 & in the & 7/13 & 10 \\

554: generous-a & a generous & 57/113 & 103 & a generous & 56/111 & 102 \\

555: giant-a & the giant & 2/3 & 102 & a giant & 1/1 & 101 \\

556: modest-a & a modest & 14/27 & 101 & a modest & 10/19 & 105 \\

557: slight-a & the slightest & 2/3 & 105 & the slightest & 2/3 & 105 \\

558: wooden-a & wooden spoon & 2/3 & 104 & wooden spoon & 2/3 & 101 \\

559: band-p & band of & 14/27 & 100 & the band & 21/41& 117\\

560: bitter-p & a bitter & 22/43 & 54 & a bitter & 22/43 & 54 \\

561: sanction-p & south africa & 12/23 & 52 & south africa & 12/23 & 52 \\

562: shake-p & his head & 90/179 & 100 & his head & 81/161 & 105 \\

563: \hline

564: \end{tabular}

565: \end{center}

566: \end{table*} %

567:

568: \section{Analysis of Experimental Results}

569:

570: The characteristics of the decision trees and decision stumps learned for

571: each word are shown in Table \ref{tab:stump}. Column 1 shows the

572: word and part of speech. Columns 2, 3, and 4 are based on the

573: feature set selected by the power divergence statistic while

574: columns 5, 6, and 7 are based on the Dice Coefficient. Columns 2 and 5

575: show the node selected to serve as the decision stump. Columns 3 and 6

576: show the number of leaf nodes in the learned decision tree relative to the

577: number of total nodes. Columns 4 and 7 show the number of bigram

578: features selected to represent the training data.

579:

580: This table shows that there is little difference in the decision stump

581: nodes selected from feature sets determined by the power divergence

582: statistics versus the Dice Coefficient. This is to be expected

583: since the top ranked bigrams for each measure are consistent, and the

584: decision stump node is generally chosen from among those.

585:

586: However, there are differences between the feature sets selected by the

587: power divergence statistics and the Dice Coefficient. These are reflected

588: in the different sized trees that are learned based on these feature sets.

589: The number of leaf nodes and the total number of nodes for each learned

590: tree is shown in columns 3 and 6.

591: The number of internal nodes is simply the difference between the

592: total nodes and the leaf nodes.

593: Each leaf node represents the end of

594: a path through the decision tree that makes a sense distinction.

595: Since a bigram feature can only appear once in the

596: decision tree, the number of internal nodes represents the number of

597: bigram features selected by the decision tree learner.

598:

599: One of our original hypotheses was that accurate decision trees of

600: bigrams will include a relatively small number of features. This

601: was motivated by the success of decision stumps in performing

602: disambiguation based on a single bigram feature.

603: In these experiments, there were no decision trees that used all of the

604: bigram features identified by the filtering step, and for many words the

605: decision tree learner went on to eliminate most of the candidate

606: features. This can be seen by comparing the number of internal nodes with

607: the number of candidate features as shown in columns 4 or 7.\footnote{For

608: most words the 100 top ranked bigrams form the set of candidate features

609: presented to the decision tree learner. If

610: there are ties in the top 100 rankings then there may be more than 100

611: features,  and if the there were fewer than 100 bigrams that occurred more

612: than 5 times then all such bigrams are included in the feature set.}

613:

614: It is also noteworthy that the bigrams ultimately selected by the decision

615: tree learner for inclusion in the tree do not always include those

616: bigrams ranked most highly by the power divergence statistic or the Dice

617: Coefficient. This is to be expected, since the selection of the bigrams

618: from raw text is only measuring the association between two words, while

619: the decision tree seeks bigrams that partition  instances of the ambiguous

620: word into into distinct senses. In particular, the decision tree learner

621: makes decisions as to what bigram to include as nodes in the tree using

622: the gain ratio, a measure based on the overall Mutual Information

623: between the bigram and a particular word sense.

624:

625: Finally, note that the smallest decision trees are functionally equivalent

626: to our benchmark methods. A decision tree with 1 leaf node and

627: no internal nodes (1/1)  acts as a majority  classifier. A  decision tree

628: with  2 leaf nodes and 1 internal node (2/3) has the structure of a

629: decision stump.

630:

631:

632: \section{Discussion}

633:

634: One of our long-term objectives is to identify a core set of features

635: that will be useful for disambiguating a wide class of words using both

636: supervised and unsupervised methodologies.

637:

638: We have presented an ensemble approach to word sense disambiguation

639: \cite{Pedersen00b} where multiple Naive Bayesian classifiers, each based

640: on co--occurrence features from varying sized windows of context,

641: is shown to perform well on the widely studied nouns {\it interest} and

642: {\it line}. While the accuracy of this approach was as good as any

643: previously published results, the learned models were complex and

644: difficult to interpret, in effect acting as very accurate black boxes.

645:

646: Our experience has been that variations in learning algorithms

647: are far less significant contributors to disambiguation

648: accuracy than are variations in the feature set.  In other words, an

649: informative feature set will result in accurate disambiguation when used

650: with a wide range of learning algorithms, but there is no

651: learning algorithm that can perform well given an uninformative or

652: misleading set of features.  Therefore, our focus is on developing and

653: discovering feature sets that make distinctions among word senses. Our

654: learning algorithms must not only produce accurate models, but they

655: should also shed new light on the relationships among features and allow

656: us to continue refining and understanding our feature sets.

657:

658: We believe that decision trees meet these criteria. A wide range of

659: implementations are available, and they are known to be robust and

660: accurate across a range of domains. Most important, their structure is

661: easy to interpret and may provide insights into the relationships that

662: exist among features and more general rules of disambiguation.

663:

664: \section{Related Work}

665:

666: Bigrams have been used as features for word sense disambiguation,

667: particularly in the form of collocations where the ambiguous word is one

668: component of the bigram (e.g.,  \cite{BruceW94b}, \cite{NgL96},

669: \cite{Yarowsky95}). While some of the bigrams we identify are collocations

670: that include the word being disambiguated, there is no requirement that

671: this be the case.

672:

673: Decision trees have been used in supervised learning approaches to word

674: sense disambiguation, and have fared well in a number of comparative

675: studies (e.g., \cite{Mooney96}, \cite{PedersenB97A}).  In the former they

676: were used with the bag of word feature sets and in the latter they were

677: used with a mixed feature set that included the part-of-speech of

678: neighboring words, three collocations, and the morphology of the ambiguous

679: word. We believe that the approach in this paper is the first time that

680: decision trees based strictly on bigram features have been employed.

681:

682: The decision list is a closely related approach

683: that has also been applied to

684: word sense disambiguation (e.g., \cite{Yarowsky94}, \cite{WilksS98},

685: \cite{Yarowsky00}). Rather than building and traversing a tree to perform

686: disambiguation, a list is employed. In the general case

687: a decision list may suffer from less fragmentation during learning than

688: decision trees; as a practical matter this means that the decision list

689: is less likely to be over--trained. However, we believe that fragmentation

690: also reflects on the feature set used for learning.  Ours consists of at

691: most approximately 100 binary features. This  results in a relatively

692: small feature space that is not as likely to suffer from fragmentation as

693: are larger spaces.

694:

695: \section{Future Work}

696:

697: There are a number of immediate extensions to this work. The first is to

698: ease the requirement that bigrams be made up of two consecutive words.

699: Rather, we will search for bigrams where the component words may be

700: separated by other words in the text. The second is to eliminate the

701: filtering step by which candidate bigrams are selected by a power

702: divergence statistic. Instead, the decision tree learner would consider

703: all possible bigrams. Despite increasing the danger of fragmentation,

704: this is an interesting issue since the bigrams judged most informative by

705: the decision tree learner are not always ranked highly in the filtering

706: step. In particular, we will determine if the filtering process ever

707: eliminates bigrams that could be significant sources of disambiguation

708: information.

709:

710: In the longer term, we hope to adapt this approach to unsupervised

711: learning, where disambiguation is performed without the benefit of sense

712: tagged text. We are optimistic that this is viable, since bigram features

713: are easy to identify in raw text.

714:

715: \section{Conclusion}

716:

717: This paper shows that the combination of a simple feature set made

718: up of bigrams and a standard decision tree learning algorithm

719: results in accurate word sense disambiguation. The results of this

720: approach are compared with those from the 1998 SENSEVAL word sense

721: disambiguation exercise and show that the bigram based decision tree

722: approach is more accurate than the best SENSEVAL results for 19 of 36

723: words.

724:

725: \section{Acknowledgments}

726:

727: The Bigram Statistics Package has been implemented by Satanjeev Banerjee,

728: who is supported by a Grant--in--Aid of Research, Artistry and Scholarship

729: from the Office of the Vice President for Research and the Dean of the

730: Graduate School of the University of Minnesota. We would like to thank

731: the SENSEVAL organizers for making the data and results from the 1998

732: event freely available. The comments of three anonymous reviewers were

733: very helpful in preparing the final version of this paper. A preliminary

734: version of this paper appears in \cite{Pedersen01a}.

735:

736: %

737: % ---- Bibliography ----

738: %

739:

740: %\bibliography{/home/cs/tpederse/TeX/Papers/papers/bib/tdp}

741: %\bibliographystyle{/home/cs/tpederse/TeX/Papers/papers/sty/acl}

742:

743: \begin{thebibliography}{}

744:

745: \bibitem[\protect\citename{Bruce and Wiebe}1994]{BruceW94b}

746: R.~Bruce and J.~Wiebe.

747: \newblock 1994.

748: \newblock Word-sense disambiguation using decomposable models.

749: \newblock In {\em Proceedings of the 32nd Annual Meeting of the Association for

750:   Computational Linguistics}, pages 139--146.

751:

752: \bibitem[\protect\citename{Church and Hanks}1990]{ChurchH90}

753: K.~Church and P.~Hanks.

754: \newblock 1990.

755: \newblock Word association norms, mutual information and lexicography.

756: \newblock In {\em Proceedings of the 28th Annual Meeting of the Association for

757:   Computational Linguistics}, pages 76--83.

758:

759: \bibitem[\protect\citename{Cressie and Read}1984]{CressieR84}

760: N.~Cressie and T.~Read.

761: \newblock 1984.

762: \newblock Multinomial goodness of fit tests.

763: \newblock {\em Journal of the Royal Statistics Society Series B}, 46:440--464.

764:

765: \bibitem[\protect\citename{Duda and Hart}1973]{DudaH73}

766: R.~Duda and P.~Hart.

767: \newblock 1973.

768: \newblock {\em Pattern Classification and Scene Analysis}.

769: \newblock Wiley, New York, NY.

770:

771: \bibitem[\protect\citename{Dunning}1993]{Dunning93}

772: T.~Dunning.

773: \newblock 1993.

774: \newblock Accurate methods for the statistics of surprise and coincidence.

775: \newblock {\em Computational Linguistics}, 19(1):61--74.

776:

777: \bibitem[\protect\citename{Holte}1993]{Holte93}

778: R.~Holte.

779: \newblock 1993.

780: \newblock Very simple classification rules perform well on most commonly used

781:   datasets.

782: \newblock {\em Machine Learning}, 11:63--91.

783:

784: \bibitem[\protect\citename{Kilgarriff and Palmer}2000]{KilgarriffP00}

785: A.~Kilgarriff and M.~Palmer.

786: \newblock 2000.

787: \newblock Special issue on {SENSEVAL}: Evaluating word sense disambiguation

788:   programs.

789: \newblock {\em Computers and the Humanities}, 34(1--2).

790:

791: \bibitem[\protect\citename{Mooney}1996]{Mooney96}

792: R.~Mooney.

793: \newblock 1996.

794: \newblock Comparative experiments on disambiguating word senses: An

795:   illustration of the role of bias in machine learning.

796: \newblock In {\em Proceedings of the Conference on Empirical Methods in Natural

797:   Language Processing}, pages 82--91, May.

798:

799: \bibitem[\protect\citename{Ng and Lee}1996]{NgL96}

800: H.T. Ng and H.B. Lee.

801: \newblock 1996.

802: \newblock Integrating multiple knowledge sources to disambiguate word sense: An

803:   exemplar-based approach.

804: \newblock In {\em Proceedings of the 34th Annual Meeting of the Association for

805:   Computational Linguistics}, pages 40--47.

806:

807: \bibitem[\protect\citename{Pedersen and Bruce}1997]{PedersenB97A}

808: T.~Pedersen and R.~Bruce.

809: \newblock 1997.

810: \newblock A new supervised learning algorithm for word sense disambiguation.

811: \newblock In {\em Proceedings of the Fourteenth National Conference on

812:   Artificial Intelligence}, pages 604--609, Providence, RI, July.

813:

814: \bibitem[\protect\citename{Pedersen}1996]{Pedersen96}

815: T.~Pedersen.

816: \newblock 1996.

817: \newblock Fishing for exactness.

818: \newblock In {\em Proceedings of the South Central SAS User's Group (SCSUG-96)

819:   Conference}, pages 188--200, Austin, TX, October.

820:

821: \bibitem[\protect\citename{Pedersen}2000]{Pedersen00b}

822: T.~Pedersen.

823: \newblock 2000.

824: \newblock A simple approach to building ensembles of naive bayesian classifiers

825:   for word sense disambiguation.

826: \newblock In {\em Proceedings of the First Annual Meeting of the North American

827:   Chapter of the Association for Computational Linguistics}, pages 63--69,

828:   Seattle, WA, May.

829:

830: \bibitem[\protect\citename{Pedersen}2001]{Pedersen01a}

831: T.~Pedersen.

832: \newblock 2001.

833: \newblock Lexical semantic ambiguity resolution with bigram--based decision

834:   trees.

835: \newblock In {\em Proceedings of the Second International Conference on

836:   Intelligent Text Processing and Computational Linguistics}, pages 157--168,

837:   Mexico City, February.

838:

839: \bibitem[\protect\citename{Smadja \bgroup et al.\egroup }1996]{SmadjaMH96}

840: F.~Smadja, K.~McKeown, and V.~Hatzivassiloglou.

841: \newblock 1996.

842: \newblock Translating collocations for bilingual lexicons: A statistical

843:   approach.

844: \newblock {\em Computational Linguistics}, 22(1):1--38.

845:

846: \bibitem[\protect\citename{Wilks and Stevenson}1998]{WilksS98}

847: Y.~Wilks and M.~Stevenson.

848: \newblock 1998.

849: \newblock Word sense disambiguation using optimised combinations of knowledge

850:   sources.

851: \newblock In {\em Proceedings of COLING/ACL-98}.

852:

853: \bibitem[\protect\citename{Witten and Frank}2000]{weka}

854: I.~Witten and E.~Frank.

855: \newblock 2000.

856: \newblock {\em Data Mining - Practical Machine Learning Tools and Techniques

857:   with Java Implementations}.

858: \newblock Morgan--Kaufmann, San Francisco, CA.

859:

860: \bibitem[\protect\citename{Yarowsky}1994]{Yarowsky94}

861: D.~Yarowsky.

862: \newblock 1994.

863: \newblock Decision lists for lexical amgiguity resolution: Application to

864:   accent resotration in {S}panish and {F}rench.

865: \newblock In {\em Proceedings of the 32nd Annual Meeting of the Association for

866:   Computational Linguistics}.

867:

868: \bibitem[\protect\citename{Yarowsky}1995]{Yarowsky95}

869: D.~Yarowsky.

870: \newblock 1995.

871: \newblock Unsupervised word sense disambiguation rivaling supervised methods.

872: \newblock In {\em Proceedings of the 33rd Annual Meeting of the Association for

873:   Computational Linguistics}, pages 189--196, Cambridge, MA.

874:

875: \bibitem[\protect\citename{Yarowsky}2000]{Yarowsky00}

876: D.~Yarowsky.

877: \newblock 2000.

878: \newblock Hierarchical decision lists for word sense disambiguation.

879: \newblock {\em Computers and the Humanities}, 34(1--2).

880:

881: \end{thebibliography}

882: \end{document}

883: