1: \section{Evaluation}
2: \label{sec:eval}
3:
4: \newcommand{\tabsvmova}{SVM one-vs-all (ova)}
5: \newcommand{\tabsvmposratio}{ova\plusidentity\APS}
6: \newcommand{\tabsvmregdisc}{SVM regression (reg)}
7: \newcommand{\tabngsvmregressionposratio}{reg\plusidentity\APS}
8:
9:
10:
11: This section compares the accuracies of the approaches outlined in
12: Section \ref{sec:method} on the four corpora comprising our
13: \gradoc. (Results using $L_1$ error were qualitatively similar.)
14: Throughout, when we refer to something as ``significant'', we mean
15: statistically so
16: with respect to the paired $t$-test, $p<.05$.
17:
18: The results that follow
19: are based on ${\rm SVM}^{light}$'s default parameter settings for SVM regression
20: and OVA.
21: Preliminary analysis of the effect of varying the regression parameter
22: $\varepsilon$ in the \four-class case revealed that the default value
23: was often optimal.
24:
25:
26: The notation ``A{\plusidentity}B'' denotes
27: metric labeling
28: where method A provides the initial
29: label preference function $\pref$ and B
30: serves as similarity measure.
31: To train, we first select the meta-parameters $\genericnumneighbors$
32: and $\coeff$ by running 9-fold cross-validation within the training
33: set.
34: Fixing $\genericnumneighbors$
35: and $\coeff$
36: to those values yielding the best performance, we then re-train A
37: (but with SVM parameters fixed, as described above) on the whole
38: training set. At test time,
39: the nearest neighbors of each
40: item are also taken from the full
41: training set.
42:
43: \subsection{Main comparison}
44: \label{sec:acc}
45:
46: \input{eval-accfig}
47:
48: Figure \ref{fig:acc} summarizes our average 10-fold cross-validation
49: accuracy
50: results.
51: We first observe from the plots that all the algorithms described in Section
52: \ref{sec:method}
53: always
54: definitively outperform
55: the simple baseline of predicting the
56: majority class, although the improvements are smaller in the
57: \four-class case.
58: Incidentally, the data was distributed in such a way that the absolute performance of the baseline
59: itself does not change much between the \three- and \four-class case
60: (which implies that the \three-class datasets were relatively more balanced);
61: and \rc's
62: datasets seem noticeably easier than the others.
63:
64:
65: We now examine the effect of implicitly using label and item similarity.
66: In the \four-class case, regression
67: performed better than OVA (\statsigly so for
68: two {\reviewer}s, as shown in
69: the righthand table);
70: but for the \three-category task, OVA
71: \statsigly outperforms
72: regression
73: for all four
74: authors.
75: One might initially interprete this ``flip'' as showing that in the \four-class
76: scenario, item and label similarities provide a
77: richer source of information
78: relative to class-specific characteristics, especially since for the
79: non-majority classes there is less data available; whereas in the \three-class
80: setting the categories are better modeled as quite distinct entities.
81:
82:
83:
84:
85:
86: However, the \three-class results for metric labeling on top of OVA and
87: regression (shown in Figure \ref{fig:acc} by black versions of the
88: corresponding icons) show that employing explicit similarities always
89: improves results, often to a \statsig degree, and
90: yields the best overall accuracies. Thus,
91: we
92: {\em can} in fact effectively exploit similarities in the \three-class
93: case.
94: Additionally, in both the \three- and \four- class scenarios,
95: metric labeling often brings the performance of the weaker base method
96: up to that of the stronger one (as indicated by the
97: ``disappearance'' of upward triangles in corresponding table rows), and
98: never hurts performance \statsigly.
99:
100: In the \four-class case, metric labeling and regression seem roughly
101: equivalent. One possible interpretation is that the relevant
102: structure of the problem is already
103: captured by linear
104: regression (and perhaps a different kernel for regression would have
105: improved its \three-class performance). However, according to
106: additional experiments we ran
107: in the \four-class situation,
108: the test-set-optimal parameter settings for
109: metric labeling would have produced \statsig improvements,
110: indicating there may be greater potential for our
111: framework. At any rate, we view the fact that
112: metric labeling
113: performed quite well for both rating scales
114: as a
115: definitely positive result.
116:
117: \subsection{Further discussion}
118: \label{sec:disc}
119:
120:
121: \qanda{Metric labeling looks like it's
122: just combining SVMs with nearest neighbors, and classifier
123: combination often improves performance. Couldn't we get the same
124: kind of results by combining SVMs with any other
125: reasonable method?} { No. For example, if we take the strongest
126: base SVM method for initial label preferences, but replace
127: \APSabbrev with the term-overlap-based cosine (\overlapabbrev),
128: performance often drops \statsigly. This
129: result, which is in accordance with Section \ref{sec:mix}'s data, suggests
130: that choosing an item similarity function that correlates well with label
131: similarity is important.
132: (\statdiff{\ovaabbrev\plusidentity\APS}{\ovaabbrev\plusidentity\overlapabbrev}{\rowb\rowb\rowb\rowb}{3};
133: \statdiff{\regabbrev\plusidentity\APS}{\regabbrev\plusidentity\overlapabbrev}{\nodiff\nodiff\nodiff\rowb}{4})
134: }
135:
136: \qanda{Could you explain that notation, please?}
137: {
138: Triangles point toward the \statsigly better algorithm for some
139: dataset. For instance, ``\statdiff{M}{N}{\rowb\rowb\colb\nodiff}{3}'' means,
140: ``In the 3-class task, method M is \statsigly better than N for two
141: author datasets and \statsigly worse
142: for one dataset (so the algorithms were statistically
143: indistinguishable on the remaining dataset)''.
144: When the algorithms being compared are statistically indistinguishable
145: on all four datasets (the ``no triangles'' case), we indicate this
146: with an equals sign (``='').
147: }
148:
149:
150: \qanda{Thanks. Doesn't Figure \ref{fig:APS} show that the \aps would
151: be a good classifier even in isolation, so metric labeling isn't necessary?}
152: {
153: No.
154: Predicting class labels directly from the \APS value via trained
155: thresholds
156: isn't as effective
157: (\statdiff{\ovaabbrev\plusidentity\APS}{threshold \APS}{\rowb\rowb\rowb\rowb}{3};
158: \statdiff{\regabbrev\plusidentity\APS}{threshold \APS}{\rowb\nodiff\rowb\nodiff}{4}).
159:
160: Alternatively, we could use only the \APSabbrev component of metric labeling by
161: setting the label preference function to the constant function 0,
162: but even with {\em test-set-optimal} parameter settings, doing so
163: underperforms the {\em trained} metric labeling algorithm with access
164: to an initial SVM classifier
165: (\statdiff{\ovaabbrev\plusidentity\APS}{\APSonly}{\rowb\rowb\rowb\rowb}{3};
166: \statdiff{\regabbrev\plusidentity\APS}{\APSonly}{\rowb\nodiff\rowb\nodiff}{4}).
167: }
168:
169: \qanda{What about using \APSabbrev as one of the features for input to a standard
170: classifier?}{Our focus is on investigating the
171: utility of similarity information. In our particular
172: rating-inference setting, it so happens that the basis for our
173: pairwise similarity measure can be
174: incorporated as an item-specific feature, but we view this as a
175: tangential issue. That being said, preliminary experiments show
176: that metric labeling can be better, barely (for test-set-optimal
177: parameter settings for both algorithms:
178: \statsigly better
179: results for one author, \four-class
180: case; statistically
181: indistinguishable otherwise),
182: although one needs to determine an appropriate weight for the \APSabbrev
183: feature
184: to get good performance.
185: }
186:
187: \qanda{You
188: defined the ``metric transformation'' function $\distfn$ as the
189: identity function $\distfn(\distvar)=\distvar$,
190: imposing greater loss as the distance between labels assigned to two similar items increases.
191: Can you do just as well if you penalize all non-equal label
192: assignments by the same amount, or does the distance between labels really matter?
193: }
194: {
195: You're asking for a comparison to the {\em Potts model}, which
196: sets $\distfn$ to the function $\unif(\distvar)=1$ if $\distvar > 0$, $0$ otherwise.
197: In the one setting in which there is a \statsig difference between
198: the two, the Potts model does worse
199: (\statdiff{\ovaabbrev\plusidentity\APS}{\ovaabbrev\pluspotts\APS}{\rowb\nodiff\nodiff\nodiff}{3}).
200: Also, employing
201: the
202: Potts model generally leads to fewer \statsig
203: improvements over a chosen base method (compare Figure \ref{fig:acc}'s tables with: \statdiff{\regabbrev\pluspotts\APS}{\regabbrev}{\nodiff\nodiff\nodiff\rowb}{3};
204: \statdiff{\ovaabbrev\pluspotts\APS}{\ovaabbrev}{\rowb\rowb\nodiff\nodiff}{3};
205: \statdiff{\ovaabbrev\pluspotts\APS}{\ovaabbrev}{\nodiff\nodiff\nodiff\nodiff$=$}{4};
206: but note that
207: \statdiff{\regabbrev\pluspotts\APS}{\regabbrev}{\nodiff\nodiff\nodiff\rowb}{4}).
208: We note that optimizing the
209: Potts model in the multi-label case is NP-hard, whereas the
210: optimal metric labeling with the identity metric-transformation
211: function can be efficiently
212: obtained (see Section \ref{sec:aps}).
213: }
214:
215: \qanda{Your datasets had many labeled reviews and only one
216: \reviewer each. Is your work relevant to settings with many
217: {\reviewer}s but very little data for each?
218: }{As discussed in Section \ref{sec:validate}, it can be quite
219: difficult to properly calibrate different {\reviewer}s' scales, since
220: the same number of ``stars'' even within what is ostensibly the same
221: rating system can mean different things for different {\reviewer}s.
222: But since you ask: we temporarily turned a blind eye to this serious issue,
223: creating a collection of 5394 reviews by 496 {\reviewer}s with at most
224: 80 reviews per \reviewer, where we pretended that our rating
225: conversions mapped correctly into a universal rating scheme.
226: Preliminary results on this
227: dataset were actually
228: comparable to the results reported above, although since we are not
229: confident in the class labels themselves, more work is needed to
230: derive a clear analysis of this setting.
231: (Abusing notation, since we're already playing fast and loose: [3c]:
232: baseline 52.4\%, {\regabbrev} 61.4\%, {\regabbrev\plusidentity\APS}
233: 61.5\%, {\ovaabbrev} (65.4\%) $\triangleright$
234: {\ovaabbrev\plusidentity\APS} (66.3\%);
235: [4c]: baseline 38.8\%, {\regabbrev} (51.9\%) $\triangleright$
236: \regabbrev\plusidentity\APS (52.7\%), {\ovaabbrev} (53.8\%) $\triangleright$ {\ovaabbrev\plusidentity\APS} (54.6\%))
237:
238: In future work, it would be interesting to determine \reviewer-independent
239: characteristics that can be used on (or suitably adapted to) data for specific
240: {\reviewer}s.
241: }
242:
243: \qanda{How about trying ---}{---Yes, there are many
244: alternatives. A few that we tested are described in the Appendix, and
245: we propose some others in the next section. We should mention that
246: we have not yet experimented with {\em all-vs.-all} (AVA), another standard
247: binary-to-multi-category classifier conversion method, because we
248: wished to
249: focus on the effect of omitting pairwise information. In
250: independent work on 3-category \problemGeneric for a different corpus, \newcite{Koppel+Schler:05a} found that
251: regression outperformed AVA, and \newcite{Rifkin+Klautau:04a} argue
252: that in principle OVA should do just as well as AVA. But we plan to
253: try it out.
254:
255: }
256:
257:
258:
259:
260: