0506:cs0506075/eval.tex

1: \section{Evaluation}

2: \label{sec:eval}

3:

4: \newcommand{\tabsvmova}{SVM one-vs-all (ova)}

5: \newcommand{\tabsvmposratio}{ova\plusidentity\APS}

6: \newcommand{\tabsvmregdisc}{SVM regression (reg)}

7: \newcommand{\tabngsvmregressionposratio}{reg\plusidentity\APS}

8:

9:

10:

11: This section compares the accuracies of the  approaches outlined in

12: Section \ref{sec:method} on the four corpora comprising  our

13: \gradoc. (Results using $L_1$ error were qualitatively similar.)

14: Throughout, when we refer to something as ``significant'', we mean

15: statistically so

16: with respect to the paired $t$-test, $p<.05$.

17:

18: The results that follow

19: are based on ${\rm SVM}^{light}$'s default parameter settings for SVM regression

20: and OVA.

21: Preliminary analysis of the effect of varying the regression parameter

22: $\varepsilon$ in the \four-class case revealed that the default value

23: was often optimal.

24:

25:

26: The notation ``A{\plusidentity}B''  denotes

27: metric labeling

28: where method A provides the initial

29: label preference function $\pref$ and B

30: serves as similarity measure.

31: To train, we first select the meta-parameters $\genericnumneighbors$

32: and $\coeff$ by running 9-fold cross-validation within the training

33: set.

34: Fixing $\genericnumneighbors$

35: and $\coeff$

36: to those values yielding the best performance, we then re-train A

37: (but with SVM parameters fixed, as described above) on the whole

38: training set. At test time,

39: the nearest neighbors of each

40: item are also taken from the full

41: training set.

42:

43:  \subsection{Main comparison}

44: \label{sec:acc}

45:

46: \input{eval-accfig}

47:

48: Figure \ref{fig:acc} summarizes our average 10-fold cross-validation

49: accuracy

50: results.

51: We first observe from the plots that all the algorithms described in Section

52: \ref{sec:method}

53: always

54: definitively outperform

55: the simple baseline of  predicting the

56: majority class, although the improvements are smaller in the

57: \four-class case.

58: Incidentally, the data was distributed in  such a way that the absolute performance of the  baseline

59: itself does not change much between the \three- and \four-class case

60: (which implies that the \three-class datasets were relatively more balanced);

61: and  \rc's

62: datasets seem noticeably easier than the others.

63:

64:

65: We now examine the effect of implicitly using label and item similarity.

66: In the \four-class case, regression

67: performed better than OVA (\statsigly so for

68: two {\reviewer}s, as shown in

69: the righthand table);

70: but for the \three-category task, OVA

71: \statsigly outperforms

72: regression

73: for all four

74: authors.

75: One might initially interprete this ``flip'' as showing that in the \four-class

76: scenario, item and label similarities provide a

77: richer source of information

78: relative to class-specific characteristics, especially since for the

79: non-majority classes there is less data available; whereas in the \three-class

80: setting the categories are better modeled as quite distinct entities.

81:

82:

83:

84:

85:

86: However, the \three-class results for metric labeling on top of OVA and

87: regression (shown in Figure \ref{fig:acc} by black versions of the

88: corresponding icons) show that employing explicit similarities always

89: improves results, often to a \statsig degree, and

90: yields the best overall accuracies.  Thus,

91: we

92: {\em can} in fact effectively exploit similarities in the \three-class

93: case.

94: Additionally, in both the \three- and \four- class scenarios,

95: metric labeling often brings the performance of the weaker base method

96: up to that of the stronger one (as indicated by the

97: ``disappearance'' of upward triangles in corresponding table rows), and

98: never hurts performance \statsigly.

99:

100: In the \four-class case, metric labeling and regression seem roughly

101: equivalent.  One possible interpretation is that the relevant

102: structure of the problem is already

103: captured by linear

104: regression (and perhaps a different kernel for regression would have

105: improved its \three-class performance).  However, according to

106: additional experiments we ran

107: in the \four-class situation,

108: the test-set-optimal parameter settings for

109: metric labeling would have produced \statsig improvements,

110: indicating there may be greater potential for our

111: framework.  At any rate, we view the fact that

112: metric labeling

113: performed quite well for both rating scales

114: as a

115: definitely positive result.

116:

117: \subsection{Further discussion}

118: \label{sec:disc}

119:

120:

121: \qanda{Metric labeling looks like it's

122:   just combining SVMs with nearest neighbors, and classifier

123:   combination often improves performance.  Couldn't we get the same

124:   kind of results by  combining SVMs with any other

125:   reasonable method?}  { No.  For example, if we take the strongest

126:   base SVM method for initial label preferences, but replace

127:   \APSabbrev with the term-overlap-based cosine (\overlapabbrev),

128:   performance often drops \statsigly.  This

129:   result, which is in accordance with Section \ref{sec:mix}'s data, suggests

130:   that choosing an item similarity function that correlates well with label

131:   similarity is important.

132: (\statdiff{\ovaabbrev\plusidentity\APS}{\ovaabbrev\plusidentity\overlapabbrev}{\rowb\rowb\rowb\rowb}{3};

133: \statdiff{\regabbrev\plusidentity\APS}{\regabbrev\plusidentity\overlapabbrev}{\nodiff\nodiff\nodiff\rowb}{4})

134: }

135:

136: \qanda{Could you explain that  notation, please?}

137: {

138:   Triangles point toward the \statsigly better algorithm for some

139:   dataset.   For instance, ``\statdiff{M}{N}{\rowb\rowb\colb\nodiff}{3}'' means,

140: ``In the 3-class task, method M is \statsigly better than N for two

141:   author datasets and \statsigly worse

142: for one dataset (so the algorithms were statistically

143:   indistinguishable on the remaining dataset)''.

144: When the algorithms being compared are statistically indistinguishable

145: on all four datasets (the ``no triangles'' case), we indicate this

146: with an equals sign (``='').

147: }

148:

149:

150: \qanda{Thanks.  Doesn't Figure \ref{fig:APS} show that the \aps would

151:   be a good classifier even in isolation, so metric labeling isn't necessary?}

152: {

153: No.

154: Predicting class labels directly from the \APS value via trained

155:   thresholds

156: isn't as effective

157: (\statdiff{\ovaabbrev\plusidentity\APS}{threshold \APS}{\rowb\rowb\rowb\rowb}{3};

158: \statdiff{\regabbrev\plusidentity\APS}{threshold \APS}{\rowb\nodiff\rowb\nodiff}{4}).

159:

160: Alternatively, we could use only the \APSabbrev component of metric labeling by

161: setting the label preference function to the constant function 0,

162: but even with {\em test-set-optimal} parameter settings, doing so

163: underperforms the {\em trained} metric labeling algorithm with access

164: to an initial SVM classifier

165: (\statdiff{\ovaabbrev\plusidentity\APS}{\APSonly}{\rowb\rowb\rowb\rowb}{3};

166: \statdiff{\regabbrev\plusidentity\APS}{\APSonly}{\rowb\nodiff\rowb\nodiff}{4}).

167: }

168:

169: \qanda{What about using \APSabbrev as one of the features for input to a standard

170:   classifier?}{Our focus is on investigating the

171:   utility of similarity information.  In our particular

172:   rating-inference setting, it  so happens that the basis for our

173:   pairwise similarity measure can be

174:   incorporated as an item-specific feature, but we view this as a

175:   tangential issue.  That being said, preliminary experiments show

176:   that metric labeling can be  better, barely (for test-set-optimal

177:   parameter settings for both algorithms:

178:  \statsigly better

179:   results for one author, \four-class

180: case; statistically

181:   indistinguishable otherwise),

182: although one needs to determine an appropriate weight for the \APSabbrev

183: feature

184: to get good performance.

185: }

186:

187: \qanda{You

188: defined the ``metric transformation'' function $\distfn$ as the

189: identity function $\distfn(\distvar)=\distvar$,

190: imposing greater loss as the distance between labels assigned to two similar items increases.

191: Can you do just as well if you penalize all non-equal label

192: assignments by the same amount, or does the distance between labels really matter?

193: }

194: {

195: You're asking for a comparison to the {\em Potts model}, which

196: sets $\distfn$ to the function $\unif(\distvar)=1$ if  $\distvar > 0$, $0$ otherwise.

197: In the one setting in which there is a \statsig difference between

198: the two, the Potts model does worse

199: (\statdiff{\ovaabbrev\plusidentity\APS}{\ovaabbrev\pluspotts\APS}{\rowb\nodiff\nodiff\nodiff}{3}).

200: Also, employing

201: the

202: Potts model generally leads to fewer \statsig

203: improvements over a chosen base method (compare Figure \ref{fig:acc}'s tables with: \statdiff{\regabbrev\pluspotts\APS}{\regabbrev}{\nodiff\nodiff\nodiff\rowb}{3};

204: \statdiff{\ovaabbrev\pluspotts\APS}{\ovaabbrev}{\rowb\rowb\nodiff\nodiff}{3};

205: \statdiff{\ovaabbrev\pluspotts\APS}{\ovaabbrev}{\nodiff\nodiff\nodiff\nodiff$=$}{4};

206: but note that

207: \statdiff{\regabbrev\pluspotts\APS}{\regabbrev}{\nodiff\nodiff\nodiff\rowb}{4}).

208: We note that optimizing the

209: Potts model in the multi-label case is NP-hard, whereas the

210: optimal metric labeling with the identity metric-transformation

211: function can be efficiently

212: obtained (see Section \ref{sec:aps}).

213: }

214:

215: \qanda{Your datasets had many labeled reviews and only one

216:   \reviewer each.  Is your work relevant to settings with many

217:   {\reviewer}s but very little data for each?

218: }{As discussed in Section \ref{sec:validate}, it can be quite

219:   difficult to properly calibrate different {\reviewer}s' scales, since

220:   the same number of ``stars'' even within what is ostensibly the same

221:   rating system can mean different things for different {\reviewer}s.

222: But since you ask: we temporarily turned a blind eye to this serious issue,

223: creating a collection of 5394 reviews by 496 {\reviewer}s with at most

224:   80 reviews per \reviewer, where we pretended that our rating

225:   conversions mapped correctly into a universal rating scheme.

226: Preliminary results on this

227: dataset were actually

228: comparable to the results reported above, although since we are not

229: confident in the class labels themselves, more work is needed to

230: derive a clear analysis of this setting.

231: (Abusing notation, since we're already playing fast and loose: [3c]:

232: baseline 52.4\%,  {\regabbrev} 61.4\%, {\regabbrev\plusidentity\APS}

233: 61.5\%, {\ovaabbrev} (65.4\%) $\triangleright$

234: {\ovaabbrev\plusidentity\APS} (66.3\%);

235: [4c]: baseline 38.8\%, {\regabbrev} (51.9\%)  $\triangleright$

236: \regabbrev\plusidentity\APS (52.7\%), {\ovaabbrev} (53.8\%) $\triangleright$ {\ovaabbrev\plusidentity\APS} (54.6\%))

237:

238: In future work, it would be interesting to determine \reviewer-independent

239: characteristics that can be used on (or suitably adapted to) data for specific

240: {\reviewer}s.

241: }

242:

243: \qanda{How about trying ---}{---Yes, there are many

244:   alternatives. A few that we tested are described in the Appendix, and

245:   we  propose some others in the next section.  We should mention that

246:    we have not yet experimented with {\em all-vs.-all} (AVA), another standard

247:   binary-to-multi-category classifier conversion method, because we

248:   wished to

249:   focus on the effect of omitting pairwise information. In

250:   independent work on 3-category \problemGeneric for a different corpus, \newcite{Koppel+Schler:05a}  found that

251:   regression outperformed  AVA, and \newcite{Rifkin+Klautau:04a} argue

252:   that in principle OVA should do just as well as AVA. But we plan to

253:   try it out.

254:

255: }

256:

257:

258:

259:

260: