cs0506075/eval.tex
1: \section{Evaluation}
2: \label{sec:eval}
3: 
4: \newcommand{\tabsvmova}{SVM one-vs-all (ova)}
5: \newcommand{\tabsvmposratio}{ova\plusidentity\APS}
6: \newcommand{\tabsvmregdisc}{SVM regression (reg)}
7: \newcommand{\tabngsvmregressionposratio}{reg\plusidentity\APS}
8: 
9: 
10: 
11: This section compares the accuracies of the  approaches outlined in
12: Section \ref{sec:method} on the four corpora comprising  our
13: \gradoc. (Results using $L_1$ error were qualitatively similar.)
14: Throughout, when we refer to something as ``significant'', we mean
15: statistically so
16: with respect to the paired $t$-test, $p<.05$.
17: 
18: The results that follow
19: are based on ${\rm SVM}^{light}$'s default parameter settings for SVM regression
20: and OVA.
21: Preliminary analysis of the effect of varying the regression parameter
22: $\varepsilon$ in the \four-class case revealed that the default value
23: was often optimal.
24: 
25: 
26: The notation ``A{\plusidentity}B''  denotes 
27: metric labeling 
28: where method A provides the initial
29: label preference function $\pref$ and B
30: serves as similarity measure.  
31: To train, we first select the meta-parameters $\genericnumneighbors$
32: and $\coeff$ by running 9-fold cross-validation within the training
33: set.  
34: Fixing $\genericnumneighbors$
35: and $\coeff$ 
36: to those values yielding the best performance, we then re-train A 
37: (but with SVM parameters fixed, as described above) on the whole
38: training set. At test time, 
39: the nearest neighbors of each 
40: item are also taken from the full
41: training set.
42: 
43:  \subsection{Main comparison}
44: \label{sec:acc}
45: 
46: \input{eval-accfig}
47: 
48: Figure \ref{fig:acc} summarizes our average 10-fold cross-validation
49: accuracy
50: results.
51: We first observe from the plots that all the algorithms described in Section
52: \ref{sec:method} 
53: always 
54: definitively outperform
55: the simple baseline of  predicting the
56: majority class, although the improvements are smaller in the
57: \four-class case.
58: Incidentally, the data was distributed in  such a way that the absolute performance of the  baseline
59: itself does not change much between the \three- and \four-class case
60: (which implies that the \three-class datasets were relatively more balanced);
61: and  \rc's 
62: datasets seem noticeably easier than the others.
63: 
64: 
65: We now examine the effect of implicitly using label and item similarity.
66: In the \four-class case, regression
67: performed better than OVA (\statsigly so for
68: two {\reviewer}s, as shown in
69: the righthand table);
70: but for the \three-category task, OVA 
71: \statsigly outperforms 
72: regression
73: for all four
74: authors.
75: One might initially interprete this ``flip'' as showing that in the \four-class
76: scenario, item and label similarities provide a 
77: richer source of information
78: relative to class-specific characteristics, especially since for the
79: non-majority classes there is less data available; whereas in the \three-class
80: setting the categories are better modeled as quite distinct entities.
81: 
82: 
83: 
84: 
85: 
86: However, the \three-class results for metric labeling on top of OVA and
87: regression (shown in Figure \ref{fig:acc} by black versions of the
88: corresponding icons) show that employing explicit similarities always
89: improves results, often to a \statsig degree, and 
90: yields the best overall accuracies.  Thus,
91: we 
92: {\em can} in fact effectively exploit similarities in the \three-class
93: case.
94: Additionally, in both the \three- and \four- class scenarios,
95: metric labeling often brings the performance of the weaker base method
96: up to that of the stronger one (as indicated by the
97: ``disappearance'' of upward triangles in corresponding table rows), and 
98: never hurts performance \statsigly.
99: 
100: In the \four-class case, metric labeling and regression seem roughly
101: equivalent.  One possible interpretation is that the relevant
102: structure of the problem is already 
103: captured by linear
104: regression (and perhaps a different kernel for regression would have
105: improved its \three-class performance).  However, according to
106: additional experiments we ran
107: in the \four-class situation, 
108: the test-set-optimal parameter settings for
109: metric labeling would have produced \statsig improvements,
110: indicating there may be greater potential for our
111: framework.  At any rate, we view the fact that 
112: metric labeling
113: performed quite well for both rating scales 
114: as a 
115: definitely positive result.
116: 
117: \subsection{Further discussion}
118: \label{sec:disc}
119: 
120: 
121: \qanda{Metric labeling looks like it's
122:   just combining SVMs with nearest neighbors, and classifier
123:   combination often improves performance.  Couldn't we get the same
124:   kind of results by  combining SVMs with any other
125:   reasonable method?}  { No.  For example, if we take the strongest
126:   base SVM method for initial label preferences, but replace
127:   \APSabbrev with the term-overlap-based cosine (\overlapabbrev),
128:   performance often drops \statsigly.  This
129:   result, which is in accordance with Section \ref{sec:mix}'s data, suggests
130:   that choosing an item similarity function that correlates well with label
131:   similarity is important.
132: (\statdiff{\ovaabbrev\plusidentity\APS}{\ovaabbrev\plusidentity\overlapabbrev}{\rowb\rowb\rowb\rowb}{3}; 
133: \statdiff{\regabbrev\plusidentity\APS}{\regabbrev\plusidentity\overlapabbrev}{\nodiff\nodiff\nodiff\rowb}{4})
134: }
135: 
136: \qanda{Could you explain that  notation, please?}
137: {
138:   Triangles point toward the \statsigly better algorithm for some
139:   dataset.   For instance, ``\statdiff{M}{N}{\rowb\rowb\colb\nodiff}{3}'' means,
140: ``In the 3-class task, method M is \statsigly better than N for two
141:   author datasets and \statsigly worse
142: for one dataset (so the algorithms were statistically
143:   indistinguishable on the remaining dataset)''. 
144: When the algorithms being compared are statistically indistinguishable
145: on all four datasets (the ``no triangles'' case), we indicate this
146: with an equals sign (``='').
147: }
148: 
149: 
150: \qanda{Thanks.  Doesn't Figure \ref{fig:APS} show that the \aps would
151:   be a good classifier even in isolation, so metric labeling isn't necessary?}
152: {
153: No.
154: Predicting class labels directly from the \APS value via trained
155:   thresholds 
156: isn't as effective
157: (\statdiff{\ovaabbrev\plusidentity\APS}{threshold \APS}{\rowb\rowb\rowb\rowb}{3};
158: \statdiff{\regabbrev\plusidentity\APS}{threshold \APS}{\rowb\nodiff\rowb\nodiff}{4}).
159: 
160: Alternatively, we could use only the \APSabbrev component of metric labeling by
161: setting the label preference function to the constant function 0,
162: but even with {\em test-set-optimal} parameter settings, doing so
163: underperforms the {\em trained} metric labeling algorithm with access
164: to an initial SVM classifier
165: (\statdiff{\ovaabbrev\plusidentity\APS}{\APSonly}{\rowb\rowb\rowb\rowb}{3};
166: \statdiff{\regabbrev\plusidentity\APS}{\APSonly}{\rowb\nodiff\rowb\nodiff}{4}).
167: }
168: 
169: \qanda{What about using \APSabbrev as one of the features for input to a standard
170:   classifier?}{Our focus is on investigating the
171:   utility of similarity information.  In our particular
172:   rating-inference setting, it  so happens that the basis for our
173:   pairwise similarity measure can be
174:   incorporated as an item-specific feature, but we view this as a
175:   tangential issue.  That being said, preliminary experiments show
176:   that metric labeling can be  better, barely (for test-set-optimal
177:   parameter settings for both algorithms: 
178:  \statsigly better
179:   results for one author, \four-class 
180: case; statistically
181:   indistinguishable otherwise),
182: although one needs to determine an appropriate weight for the \APSabbrev
183: feature 
184: to get good performance.
185: }
186: 
187: \qanda{You 
188: defined the ``metric transformation'' function $\distfn$ as the
189: identity function $\distfn(\distvar)=\distvar$,
190: imposing greater loss as the distance between labels assigned to two similar items increases.
191: Can you do just as well if you penalize all non-equal label
192: assignments by the same amount, or does the distance between labels really matter?
193: }
194: {
195: You're asking for a comparison to the {\em Potts model}, which 
196: sets $\distfn$ to the function $\unif(\distvar)=1$ if  $\distvar > 0$, $0$ otherwise.  
197: In the one setting in which there is a \statsig difference between
198: the two, the Potts model does worse
199: (\statdiff{\ovaabbrev\plusidentity\APS}{\ovaabbrev\pluspotts\APS}{\rowb\nodiff\nodiff\nodiff}{3}).
200: Also, employing 
201: the 
202: Potts model generally leads to fewer \statsig
203: improvements over a chosen base method (compare Figure \ref{fig:acc}'s tables with: \statdiff{\regabbrev\pluspotts\APS}{\regabbrev}{\nodiff\nodiff\nodiff\rowb}{3};
204: \statdiff{\ovaabbrev\pluspotts\APS}{\ovaabbrev}{\rowb\rowb\nodiff\nodiff}{3};
205: \statdiff{\ovaabbrev\pluspotts\APS}{\ovaabbrev}{\nodiff\nodiff\nodiff\nodiff$=$}{4};
206: but note that
207: \statdiff{\regabbrev\pluspotts\APS}{\regabbrev}{\nodiff\nodiff\nodiff\rowb}{4}).
208: We note that optimizing the
209: Potts model in the multi-label case is NP-hard, whereas the 
210: optimal metric labeling with the identity metric-transformation
211: function can be efficiently 
212: obtained (see Section \ref{sec:aps}).
213: }
214: 
215: \qanda{Your datasets had many labeled reviews and only one
216:   \reviewer each.  Is your work relevant to settings with many
217:   {\reviewer}s but very little data for each?
218: }{As discussed in Section \ref{sec:validate}, it can be quite
219:   difficult to properly calibrate different {\reviewer}s' scales, since
220:   the same number of ``stars'' even within what is ostensibly the same
221:   rating system can mean different things for different {\reviewer}s.
222: But since you ask: we temporarily turned a blind eye to this serious issue,
223: creating a collection of 5394 reviews by 496 {\reviewer}s with at most
224:   80 reviews per \reviewer, where we pretended that our rating
225:   conversions mapped correctly into a universal rating scheme.
226: Preliminary results on this 
227: dataset were actually
228: comparable to the results reported above, although since we are not
229: confident in the class labels themselves, more work is needed to
230: derive a clear analysis of this setting.
231: (Abusing notation, since we're already playing fast and loose: [3c]:
232: baseline 52.4\%,  {\regabbrev} 61.4\%, {\regabbrev\plusidentity\APS}
233: 61.5\%, {\ovaabbrev} (65.4\%) $\triangleright$
234: {\ovaabbrev\plusidentity\APS} (66.3\%);
235: [4c]: baseline 38.8\%, {\regabbrev} (51.9\%)  $\triangleright$
236: \regabbrev\plusidentity\APS (52.7\%), {\ovaabbrev} (53.8\%) $\triangleright$ {\ovaabbrev\plusidentity\APS} (54.6\%))
237: 
238: In future work, it would be interesting to determine \reviewer-independent
239: characteristics that can be used on (or suitably adapted to) data for specific
240: {\reviewer}s. 
241: }
242: 
243: \qanda{How about trying ---}{---Yes, there are many 
244:   alternatives. A few that we tested are described in the Appendix, and
245:   we  propose some others in the next section.  We should mention that
246:    we have not yet experimented with {\em all-vs.-all} (AVA), another standard
247:   binary-to-multi-category classifier conversion method, because we
248:   wished to
249:   focus on the effect of omitting pairwise information. In
250:   independent work on 3-category \problemGeneric for a different corpus, \newcite{Koppel+Schler:05a}  found that
251:   regression outperformed  AVA, and \newcite{Rifkin+Klautau:04a} argue
252:   that in principle OVA should do just as well as AVA. But we plan to
253:   try it out.
254:  
255: }
256: 
257: 
258: 
259: 
260: