1: \documentclass[english]{article}
2: \pdfoutput=1
3: \usepackage{amssymb}
4: \usepackage{amsmath}
5: \usepackage{wrapfig}
6: \usepackage{graphicx}
7: \usepackage{subfigure}
8: \usepackage{verbatim}
9: \usepackage[multiple]{footmisc}
10: \usepackage[bf]{caption2}
11:
12: \renewcommand{\captionfont}{\small}
13:
14: \pagestyle{empty}
15:
16: \def\argmax{\operatornamewithlimits{arg\,max}}
17: \def\argmin{\operatornamewithlimits{arg\,min}}
18:
19: \setcounter{topnumber}{5}
20: \renewcommand{\topfraction}{1}
21: \setcounter{bottomnumber}{5}
22: \renewcommand{\bottomfraction}{1}
23: \setcounter{totalnumber}{10}
24: \renewcommand{\textfraction}{0}
25: \renewcommand{\floatpagefraction}{0}
26: \graphicspath{{images/}}
27: \newcommand{\eg}{e.g.\ }
28:
29:
30: \pagenumbering{arabic}
31:
32:
33: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
34: \begin{document}
35:
36:
37: \title{Comparison and Combination of State-of-the-art \mbox{\rule{-1ex}{0pt}Techniques for
38: Handwritten Character Recognition:} Topping the MNIST Benchmark}
39: \author{Daniel Keysers\\daniel.keysers@dfki.de\\
40: Image Understanding and Pattern Recognition (IUPR) Group\\
41: German Research Center for Artificial Intelligence (DFKI)}
42: \date{May 2006}
43: \maketitle
44:
45:
46: \pagestyle{plain}
47: \setcounter{page}{1}
48:
49:
50: \begin{abstract}
51: Although the recognition of isolated handwritten digits has been a research
52: topic for many years, it continues to be of interest for the research
53: community and for commercial applications. We show that despite the
54: maturity of the field, different approaches still deliver results that vary
55: enough to allow improvements by using their combination. We do so by
56: choosing four well-motivated state-of-the-art recognition systems for which
57: results on the standard MNIST benchmark are available. When comparing the
58: errors made, we observe that the errors made differ between all four
59: systems, suggesting the use of classifier combination. We then determine the
60: error rate of a hypothetical system that combines the output of the four
61: systems. The result obtained in this manner is an error rate of 0.35\% on
62: the MNIST data, the best result published so far. We furthermore discuss the
63: statistical significance of the combined result and of the results of the
64: individual classifiers.
65: \end{abstract}
66:
67:
68: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
69: \section{Introduction}
70:
71: The recognition of handwritten digits is a topic of practical importance
72: because of applications like automated form reading and handwritten zip-code
73: processing. It is also a subject that has continued to produce much research
74: effort over the last decades for several reasons:
75: \begin{itemize}
76: \addtolength{\itemsep}{-1.2ex}
77: \item The problem is prototypical for image processing and pattern recognition, with
78: a small number of classes.
79: \item Standard benchmark data sets exist that make it easy to obtain valid results
80: quickly.
81: \item Many publications and techniques are available that can be cited and
82: built on, respectively.
83: \item The practical applications motivate the research performed.
84: \item Improvements in classification accuracy over existing techniques
85: continue to be obtained using new approaches.
86: \end{itemize}
87:
88: This paper has the objective to analyze four of the state-of-the-art methods
89: for the recognition of handwritten
90: digits~\cite{shapecontext_pami,sch02,icpr04_nlmatch,simardICDAR03} by
91: comparing the errors made on the standard MNIST benchmark data.
92: (A part of this work has been described in~\cite{diss}.)
93: We perform a
94: statistically analysis of the errors using a bootstrapping
95: technique~\cite{bisani_poi} that not only uses the error count but also takes
96: into account which errors were made. Using this technique we can determine
97: more accurate estimates of the statistical significance of improvements.
98:
99: When analyzing the errors made we observe that --- although the error rates
100: obtained are all very similar --- there are substantial differences in {\em
101: which} patterns are classified erroneously. This can be interpreted as an
102: indicator for using classifier combination. An experiment shows that indeed a
103: combination of the classifiers performs better than the single best
104: classifier. The statistical analysis shows that the probability that this
105: results constitutes a real improvement and is not based on chance alone is
106: 94\%.
107:
108:
109: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
110: \section{Related work}
111:
112: This paper is of course only possible because the results
113: of the four chosen base
114: methods~\cite{shapecontext_pami,sch02,icpr04_nlmatch,simardICDAR03} were
115: available\footnote{We would like to thank Patrice Simard for providing the
116: recognition results to us and the authors of~\cite{shapecontext_pami,sch02}
117: for listing the errors in the respective papers.}. These approaches are
118: presented in more detail in Section~\ref{sec:mnist-sota}. We are aware that
119: there exist other methods that also achieve very good classification error
120: rates on the data used, e.g.~\cite{liu_benchmark}.
121: However, we feel that the
122: four methods chosen comprise a set of well-motivated and self-contained
123: approaches.
124: Furthermore, they represent the different classification methods
125: most commonly used (in the research literature), that is, the nearest neighbor
126: classifier, neural networks, and the support vector machine. All four methods
127: use the appearance-based paradigm in the broad sense and can thus be
128: considered as being sufficiently general as to be applied to other object
129: recognition tasks.
130:
131: There is a large amount of work available on the topic of classifier
132: combination as well (an introduction can be found e.g.~in~\cite{Kittler98})
133: and much work exists on applying classifier combination to handwriting
134: recognition
135: (e.g.~\cite{batthacharyya-iwfhr04,das06_kumar,mcs2001,bunke-iwfhr04}).
136: Note that we do not propose new algorithms for classification of handwritten
137: digits or for the combination of classifiers. Instead, our contribution is to
138: present a statistical analysis that compares different classifiers and to show
139: that their combination improves the performance even though the individual
140: classifiers all reach state-of-the-art error rates by themselves.
141:
142:
143:
144: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
145: \section{The MNIST task}\label{sec:mnist}
146:
147:
148: The modified NIST handwritten digit database (MNIST, \cite{lecun98}) contains
149: 60,000 images in the training set and 10,000 patterns in the test set, each of
150: size 28$\times$28 pixels with 256 graylevels. The data set is available
151: online\footnote{\tt http://www.research.att.com/$\sim$yann/ocr/mnist/} and
152: some examples from the MNIST corpus are shown in Figure~\ref{fig:nist_ex}.
153:
154: The preprocessing of the images is described as follows in \cite{lecun98}:
155: ``The original black and white (bilevel) images were size normalized to fit in
156: a 20$\times$20 pixel box while preserving their aspect ratio. The resulting
157: images contain gray levels as result of the antialiasing (image interpolation)
158: technique used by the normalization algorithm. [...] the images were centered
159: in a 28$\times$28 image by computing the center of mass of the pixels and
160: translating the image so as to position this point at the center of the
161: 28$\times$28 field.'' Note that some authors use a `deslanted' version of the
162: database.
163:
164: \begin{figure}[tb]
165: \begin{center}
166: \includegraphics[width=\columnwidth]{images/NIST1}
167: \includegraphics[width=\columnwidth]{images/NIST2}
168: \includegraphics[width=\columnwidth]{images/NIST3}\\
169: \caption[Example images from the MNIST data set]%
170: {Example images from the MNIST data set.\label{fig:nist_ex}}
171: \end{center}
172: \end{figure}
173:
174: The task is generally not considered to be `difficult' (in the sense that
175: absolute error rates are high) recognition task for two reasons. First, the
176: human error rate is estimated to be only about 0.2\%, although it has not been
177: determined for the whole test set \cite{sim93+}. Second, the large training
178: set allows machine learning algorithms to generalize well. With respect to the
179: connection between training set size and classification performance for OCR
180: tasks it is argued \cite{Smith94} that increasing the training set size by a
181: factor of ten cuts the error rate approximately to half the original figure.
182:
183: Table~\ref{tab:mnist-er} gives a comprehensive overview of the error rates
184: reported for the MNIST data. One disadvantage of the MNIST corpus is that
185: there exists no development test set, which leads to effects known as
186: `training on the testing data'. This is not necessarily true for each of the
187: research groups performing experiments, but it cannot always be ruled out.
188: Note that in some publications (e.g.~\cite{simardICDAR03}) the authors
189: explicitly state that all parameters of the system were chosen by using a
190: subset of the training set for validation, which then rules out the
191: overadaptation to the test set.
192: However, the tendency exists to evaluate one method with
193: different parameters or different methods several times on the same data until
194: the best performance seems to have been reached. This procedure leads to an
195: overly optimistic estimation of the error rate of the classifier and the
196: number of tuned parameters should be considered when judging such error rates.
197: Ideally, a development test set would be used to determine the best parameters
198: for the classifiers and the results would be obtained from one run on the test
199: set itself. Nevertheless a comparison of `best performing' algorithms may
200: lead to valid conclusions, especially if these perform well on several
201: different tasks.
202: %
203: \def\pz{\phantom{0}}
204: \begin{table}[b]
205: \caption[MNIST error rates]{Error rates for the MNIST task in \%.
206: The systems marked with $^*$ are those we use for analysis and combination.
207: \label{tab:mnist-er}}
208: %\small
209: \centering
210: \begin{tabular}{@{\vline\hspace{0.7ex}}r@{\hspace{0.7ex}}l@{\hspace{0.7ex}\vline\hspace{0.7ex}}l@{\hspace{0.7ex}\vline\hspace{0.7ex}}r@{\hspace{0.7ex}\vline}}
211: \hline
212: \multicolumn{2}{@{\vline\hspace{0.7ex}}l}{reference} & method & ER\pz \\
213: \hline
214: \cite{sim93+} & AT\&T & human performance & 0.2\pz \\
215: & --- & Euclidean nearest neighbor & 3.5\pz \\
216: \hline
217: \cite{maree04} & U Liège & decision trees + sub-windows & 2.63 \\
218: \cite{lecun98} & AT\&T & deslant, Euclidean 3-NN & 2.4\pz \\
219: \cite{icpr04_uchida} & Kyushu U & elastic matching & 2.10 \\
220: \cite{icpr00_td} & RWTH & one-sided tangent distance & 1.9\pz \\
221: \cite{bot94+} & AT\&T & neural net LeNet1 & 1.7\pz \\
222: \cite{mayraz} & UC London & products of experts & 1.7\pz \\
223: \cite{milgram_mnist_05} & U Qu\'ebec & hyperplanes + support vector m. & 1.5\pz \\
224: \cite{sch97} & TU Berlin & support vector machine & 1.4\pz \\
225: \cite{bot94+} & AT\&T & neural net LeNet4 & 1.1\pz \\
226: \cite{sim93+} & AT\&T & tangent distance & 1.1\pz \\
227: \cite{icpr00_td} & RWTH & two-sided tangent d., virt. data & 1.0\pz \\
228: \cite{dong02} & CENPARMI & local learning & 0.99 \\
229: \cite{sch98new+} & MPI, AT\&T & virtual SVM & 0.8\pz \\
230: \cite{lecun98} & AT\&T & distortions, neural net LeNet5 & 0.82 \\
231: \cite{lecun98} & AT\&T & distortions, boosted LeNet4 & 0.7\pz \\
232: \cite{teow00} & U Singapore & bio-inspired features + SVM & 0.72 \\
233: \cite{sch02} & Caltech,MPI & virtual SVM (jitter) & 0.68 \\
234: \cite{shapecontext_pami} & UC Berkeley & shape context matching & $^*$0.63 \\
235: \cite{dong04} & CENPARMI & support vector machine & 0.60 \\
236: \cite{teow02} & U Singapore & deslant, biology-inspired features & 0.59 \\
237: \cite{athi05} & Boston U & cascaded shape context & 0.58 \\
238: \cite{sch02} & Caltech,MPI & deslant, virtual SVM (jitter,shift) & $^*$0.56 \\
239: \cite{athi05} & Boston U & shape context matching & 0.54 \\
240: \cite{icpr04_nlmatch} & RWTH & deformation model (IDM) & $^*$0.54 \\
241: \cite{liu_benchmark} & Hitachi & preprocessing, support vector m. & 0.42\\
242: \cite{simardICDAR03} & Microsoft & neural net + virtual data & $^*$0.42 \\
243: & this work & hyp. comb. of 4 systems ($^*$)& 0.35 \\
244: \hline
245: \end{tabular}
246: \end{table}
247: %
248: Note that Dong gives lower error rates than in \cite{dong04} of 0.38 to 0.44
249: percent on his web page (accessed February 2005), but it remains somewhat
250: unclear how these error rates were obtained and if possibly these low error
251: rates are due to the effect of `training on the testing data'.
252: Also, \cite{teow02} try a variety of SVMs and networks which yield error rates
253: ranging from 0.59 percent to 0.81 percent.
254: The IDM \cite{icpr04_nlmatch} as described in the Section~\ref{sec:idm} was
255: not optimized for the MNIST task. Instead, all parameter settings were
256: determined using the smaller USPS data set and then the complete setup was
257: evaluated once on the MNIST data.
258:
259: \begin{figure}[p]
260: \newlength{\mndlength}
261: \setlength{\mndlength}{7mm}
262: \newcommand{\examplewithclass}[2]{\scriptsize%
263: #2\includegraphics[width=\mndlength]{images/MNIST-difficult#1}}
264: \centerline{
265: \begin{tabular}{*{8}{c@{\;}}}
266: \examplewithclass{0}{9}&
267: \examplewithclass{1}{6}&
268: \examplewithclass{2}{4}&
269: \examplewithclass{3}{7}&
270: \examplewithclass{4}{8}&
271: \examplewithclass{5}{2}&
272: \examplewithclass{6}{5}&
273: \examplewithclass{7}{8}\\
274: \examplewithclass{8}{1}&
275: \examplewithclass{9}{7}&
276: \fbox{\examplewithclass{10}{8}}&
277: \examplewithclass{11}{6}&
278: \examplewithclass{12}{8}&
279: \examplewithclass{13}{4}&
280: \examplewithclass{14}{7}&
281: \examplewithclass{15}{9}\\
282: \examplewithclass{16}{4}&
283: \examplewithclass{17}{9}&
284: \examplewithclass{18}{5}&
285: \examplewithclass{19}{8}&
286: \examplewithclass{20}{5}&
287: \examplewithclass{21}{8}&
288: \examplewithclass{22}{4}&
289: \examplewithclass{23}{3}\\
290: \examplewithclass{24}{9}&
291: \examplewithclass{25}{2}&
292: \examplewithclass{26}{8}&
293: \fbox{\examplewithclass{27}{9}}&
294: \examplewithclass{28}{5}&
295: \examplewithclass{29}{5}&
296: \examplewithclass{30}{7}&
297: \examplewithclass{31}{5}\\
298: \examplewithclass{32}{2}&
299: \examplewithclass{33}{3}&
300: \fbox{\examplewithclass{34}{4}}&
301: \examplewithclass{35}{6}&
302: \fbox{\examplewithclass{36}{1}}&
303: \examplewithclass{37}{5}&
304: \examplewithclass{38}{9}&
305: \examplewithclass{39}{1}\\
306: \examplewithclass{40}{4}&
307: \examplewithclass{41}{2}&
308: \examplewithclass{42}{2}&
309: \examplewithclass{43}{7}&
310: \examplewithclass{44}{9}&
311: \examplewithclass{45}{5}&
312: \examplewithclass{46}{9}&
313: \examplewithclass{47}{6}\\
314: \examplewithclass{48}{4}&
315: \examplewithclass{49}{3}&
316: \examplewithclass{50}{9}&
317: \examplewithclass{51}{3}&
318: \examplewithclass{52}{5}&
319: \examplewithclass{53}{6}&
320: \examplewithclass{54}{8}&
321: \examplewithclass{55}{1}\\
322: \examplewithclass{56}{7}&
323: \examplewithclass{57}{2}&
324: \examplewithclass{58}{4}&
325: \fbox{\examplewithclass{59}{6}}&
326: \examplewithclass{60}{7}&
327: \examplewithclass{61}{3}&
328: \examplewithclass{62}{6}&
329: \examplewithclass{63}{4}\\
330: \examplewithclass{64}{5}&
331: \examplewithclass{65}{1}&
332: \examplewithclass{66}{7}&
333: \examplewithclass{67}{6}&
334: \examplewithclass{68}{7}&
335: \examplewithclass{69}{7}&
336: \examplewithclass{70}{9}&
337: \examplewithclass{71}{9}\\
338: \examplewithclass{72}{9}&
339: \examplewithclass{73}{9}&
340: \examplewithclass{74}{7}&
341: \examplewithclass{75}{9}&
342: \examplewithclass{76}{9}&
343: \examplewithclass{77}{9}&
344: \examplewithclass{78}{2}&
345: \examplewithclass{79}{1}\\
346: \examplewithclass{80}{9}&
347: \examplewithclass{81}{2}&
348: \examplewithclass{82}{9}&
349: \examplewithclass{83}{8}&
350: \examplewithclass{84}{9}&
351: \examplewithclass{85}{9}&
352: \examplewithclass{86}{8}&
353: \examplewithclass{87}{3}\\
354: \examplewithclass{88}{9}&
355: \examplewithclass{89}{9}&
356: \examplewithclass{90}{6}&
357: \examplewithclass{91}{4}&
358: \examplewithclass{92}{7}&
359: \examplewithclass{93}{5}&
360: \fbox{\examplewithclass{94}{5}}&
361: \examplewithclass{95}{3}\\
362: \examplewithclass{96}{3}&
363: \examplewithclass{97}{9}&
364: \examplewithclass{98}{2}&
365: \examplewithclass{99}{9}&
366: \examplewithclass{100}{7}&
367: \examplewithclass{101}{0}&
368: \examplewithclass{102}{8}&
369: \examplewithclass{103}{1}\\
370: \examplewithclass{104}{1}&
371: \examplewithclass{105}{0}&
372: \examplewithclass{106}{8}&
373: \examplewithclass{107}{8}&
374: \examplewithclass{108}{7}&
375: \examplewithclass{109}{0}&
376: \examplewithclass{110}{1}&
377: \examplewithclass{111}{8}\\
378: \examplewithclass{112}{4}&
379: \examplewithclass{113}{7}&
380: \examplewithclass{114}{7}&
381: \examplewithclass{115}{9}&
382: \examplewithclass{116}{9}&
383: \examplewithclass{117}{2}&
384: \examplewithclass{118}{6}&
385: \examplewithclass{119}{9}\\
386: \examplewithclass{120}{6}&
387: \fbox{\examplewithclass{121}{5}}&
388: \examplewithclass{122}{5}&
389: \examplewithclass{123}{4}&
390: \examplewithclass{124}{2}&
391: \fbox{\examplewithclass{125}{0}}&
392: \examplewithclass{126}{4}&
393: \\
394: \end{tabular}
395: }
396:
397: \caption[Difficult MNIST test samples]{Difficult examples from
398: the MNIST test set along with their target labels. At least one of the four
399: state-of-the-art systems (cp.~Table~\ref{tab:mnist-er}) misclassifies these
400: images. The framed examples are misclassified by all four systems.
401: \label{fig:mnist-difficult}}
402: \end{figure}
403:
404: Figure~\ref{fig:mnist-difficult} shows the `difficult' examples from the MNIST
405: test set. At least one of the four state-of-the-art systems misclassifies each
406: sample. (These systems are marked with `$^*$' in Table~\ref{tab:mnist-er}.)
407: Those samples that are misclassified by all four systems are marked by a
408: surrounding frame. This presentation is possible because both in \cite{sch02}
409: and in \cite{shapecontext_pami} the authors present the set of samples
410: misclassified by their systems. Furthermore, Patrice Simard kindly provided
411: the classification results of his system as described in \cite{simardICDAR03}
412: for all test data. The availability of these results also makes it possible
413: to determine the error rate of a hypothetical system that combines these four
414: best systems as described in the following Section~\ref{sec:mnist-sota}.
415:
416: Some of the images in Figure~\ref{fig:mnist-difficult} are a good illustration
417: of the inherent class overlap that exists for this problem: some instances of
418: \eg `3'~vs.`5', `4'~vs.`9', and `8'~vs.~`9' are not distinguishable by taking
419: into account the observed image only. This suggests that we are dealing with
420: a problem with non-zero Bayes error rate. Further improvements in the error
421: rate on this data set might therefore be problematic. For example, consider a
422: classifier that classifies the second framed image as a `9': despite the fact
423: that this classifier would not make an error with this decision according to
424: the class labels, we might prefer a classifier that classifies the image as a
425: `4'. Note that recently~\cite{suen-prl05} has presented a more detailed
426: discussion of different types of errors made by state-of-the-art classifiers
427: for handwritten characters.
428:
429:
430: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
431: \section{The classifiers and their combination}
432: \label{sec:mnist-sota}
433:
434: We briefly describe the four systems for handwritten digit recognition that we
435: compare and combine. Then, we discuss the statistical significance of their
436: results and present a simple classifier combination of these four methods that
437: achieves a (hypothetical) error rate of 0.35\%.
438:
439: {\bf Shape context matching.} \cite{shapecontext_pami} presents the shape
440: context matching approach. The method proceeds by first extracting contour
441: points of the images. In the case of handwritten character images the
442: resulting contour points trace both sides of the pen strokes the character is
443: composed of. Then, at each contour point a local descriptor of the shape as
444: represented by the contour points is extracted. This local descriptor is
445: called a shape context and is a histogram of the contour points in the
446: surrounding of the central point. This histogram has a finer resolution at
447: points close to the central point and a coarser for regions farther away,
448: which is achieved using a log-polar
449: representation.\\
450: The classification is then done by using a nearest neighbor classifier
451: (although the authors chose to use only one third of the training data for the
452: MNIST task). The distance within the classifier is determined using an
453: iterative matching based on the shape context descriptors and two-dimensional
454: deformation. The shape contexts of training and test image are assigned to
455: each other by using the Hungarian algorithm on a bipartite graph
456: representation with edge weights according to the similarity of the shape
457: context descriptors. This assignment is then used to estimate a
458: two-dimensional spline transformation best matching the two images. The
459: images are transformed accordingly and the whole process (including extraction
460: of shape contexts) is iterated until a stopping criterion is reached. The
461: resulting distance is used in the
462: classifier. \\
463: Recently, \cite{athi05} discuss a cascading technique to speed up the slow
464: nearest neighbor matching by ``two to three orders of magnitude''. While the
465: result that this discussion is based on only used the first 20,000 training
466: samples for reasons of efficiency and resulted in an error rate of 0.63\%
467: \cite{shape_context}, \cite{athi05} report an error rate of 0.54\% for the
468: full training set and 0.58\% for the cascaded classifier that uses only about
469: 300 distance calculations per test.
470:
471: {\bf Invariant support vector machine.}
472: \cite{sch02} presents a support vector machine (SVM) that is especially suited
473: for handwritten digit recognition by incorporating prior knowledge about the
474: task. This is achieved by using virtual data or a special kernel function
475: within the SVM. The special kernel function applies several transformations to
476: the compared images that leave the class identity unchanged and return the
477: kernel function of the appropriate pair of transformed images. This method is
478: referred to as kernel jittering. The second uses so-called virtual support
479: vectors. This approach consists of first training a support vector machine.
480: Now, the set of support vectors contains
481: sufficient information about the recognition problem and can therefore be
482: considered a condensed representation of the training data for discrimination
483: purposes. The method proceeds to create transformed versions of the support
484: vectors, which are the virtual support vectors. In the experiments leading to
485: the error rate of 0.56\% the transformations used were image shifts within the
486: eight-neighborhood plus horizontal and vertical shifts of two pixels, thus
487: resulting in $9+4=13$ virtual support vectors for each original support
488: vector. (This experiment also used the deslanted version of the MNIST data
489: \cite{lecun98}.) On this new set of virtual support vectors, another support
490: vector machine was trained and evaluated on the test set.
491:
492: {\bf Pixel-to-pixel image matching with local contexts.}
493: %
494: \cite{icpr04_nlmatch} presents deformable models for handwritten character
495: recognition. It is shown that a simple zero-order matching approach called
496: image distortion model (IDM) can lead to very competitive results if the local
497: context of each pixel is considered in the distortion. The local context is
498: represented by a $3\times3$ surrounding window of the horizontal and vertical
499: image gradient, resulting in an 18-dimensional descriptor. The IDM allows to
500: choose for each pixel of the test image the best fitting counterpart of the
501: reference image within a suitable corresponding range. The distance as
502: determined by the best match between two images is then used within a
503: 3-nearest-neighbor classifier. More elaborate models for image matching are
504: also discussed, but only small improvements can be obtained at the cost of
505: much higher computational costs. The IDM can be seen as the best compromise
506: between high classification speed and high recognition accuracy while being
507: conceptually very simple and easy to implement.
508:
509: {\bf Convolutional neural net and virtual data.}
510: \label{sec:idm}
511: %
512: \cite{simardICDAR03} presents a large convolutional neural network of about
513: 3,000 nodes in five layers that is especially designed for handwritten
514: character classification. The new concept in the approach is to present a new
515: set of virtual training images to the learning algorithm of the neural net in
516: each iteration of the training. The virtual training set is constructed from
517: the given training data by applying a separate two-dimensional random
518: displacement field that is smoothed with a Gaussian filter to each of the
519: images. This makes it possible to generate a very large amount of virtual data
520: in the order of 1,000 virtual samples for each original element of the
521: training data set. The data is generated on the fly in each training iteration
522: and therefore does not have to be saved, which avoids the problems with data
523: handling. Apart from the generation of virtual examples there is another point
524: where prior knowledge about the task comes into play, namely the use of a {\em
525: convolutional} neural net. This architecture, which is described in greater
526: detail in \cite{lecun98}, contains prior knowledge in that it uses tying of
527: weights within the neural net to extract low-level features from the input
528: that are invariant with respect to the position within the image, and only in
529: later layers of the neural net the position information is used.
530:
531:
532:
533: {\bf Discussion and combination.}
534: %
535: We can observe that all four methods take special measures to deal
536: with the image variability present in the images, using virtual data
537: and image matching methods. At the same time the concrete
538: classification algorithm seems to play a somewhat smaller role in the
539: performance as nearest neighbor classifiers, support vector machines,
540: and neural networks all perform very well. Only a slight advantage of
541: the neural net can be seen in the possibility to use very large
542: amounts of virtual data in training because the training proceeds in
543: several iterations, which need not use the same data but can use
544: distorted samples of the images instead.
545:
546:
547: Figure~\ref{fig:mnist-difficult} shows all the errors made by one of the four
548: classifiers. It is remarkable that only eight samples are classified
549: incorrectly by all four systems. This observation naturally suggests the use
550: of classifier combination to further reduce the error rate. The availability
551: of the results of the other classifiers makes it possible to determine this
552: error rate of a simple hypothetical combined system.
553:
554: However, we are somewhat restricted for the choice of combination scheme,
555: because for two of classifiers we only know if the result was correct or not.
556: We thus decided to use a simple majority vote combination based on the four
557: classifiers, where the neural net classifier is used for tie-breaking (because
558: it has the best single error rate). Note that the result is only an upper
559: bound of the error rate that a real combined system would have, because we do
560: not use the class labels the patterns were assigned to (but only the
561: information if the decision was correct or not). This means that in case of a
562: disagreement between the falsely assigned classes we could have a correct
563: assignment when using the class labels. Furthermore, it seems likely that the
564: use of the confidence values of the component classifiers in the combination
565: scheme could also improve the joint decision.
566:
567: Using the described hypothetical combination, the resulting error rate is
568: 0.35\%\label{mnistres}. In the following section we will show that this
569: improvement has a probability of 94\% to be an improvement that is not based
570: on chance alone but constitutes a real improvement.
571:
572:
573:
574: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
575: \section{Statistical analysis of results}
576:
577:
578: \begin{table}[tb]
579: \caption[Significance of improvements for best MNIST classifiers]%
580: {Probabilities of improvement for all pairs of the four used
581: classifiers and their combination according to a bootstrap analysis.
582: Probabilities in
583: {boldface} show {\bfseries significant} improvements with respect to the
584: 5\% level. This table can be read as follows: the classifier in
585: each row improves over the classifiers given in the columns with the
586: stated probability (\eg the probability of improvement for SVM over SC is
587: 0.60). The second table shows the difference in error rates for
588: comparison. SC: shape context matching; SVM: invariant support
589: vector machine; IDM: image distortion model; CNN: convolutional neural net with
590: distortions; CC: combination of the four classifiers;}
591: \label{tab:poi-mnist}
592: \small
593: \centering
594: \begin{tabular}{|l|c|c|c|c|c|}
595: \multicolumn{6}{c}{probability of improvement}\\
596: \hline
597: & SC & SVM & IDM & CNN & CC \\
598: \hline
599: SC & --- & & & &\\
600: \hline
601: SVM & 0.60 & --- & & & \\
602: \hline
603: IDM & 0.85 & 0.58 & --- & & \\
604: \hline
605: CNN & {\bf 0.99} & {\bf 0.96} & 0.92 & --- &\\
606: \hline
607: CC & {\bf 1.00} & {\bf 1.00} & {\bf 1.00} & 0.94 & --- \\
608: \hline
609: \end{tabular} \hspace{1cm}
610: \begin{tabular}{|l|c|c|c|c|c|}
611: \multicolumn{6}{c}{difference in error rate}\\
612: \hline
613: & SC & SVM & IDM & CNN & CC\\
614: \hline
615: SC & --- & & & & \\
616: \hline
617: SVM & 0.07 & --- & & & \\
618: \hline
619: IDM & 0.09 & 0.02 & --- & & \\
620: \hline
621: CNN & 0.21 & 0.14 & 0.12 & --- & \\
622: \hline
623: CC & 0.28 & 0.21 & 0.19 & 0.07 & --- \\
624: \hline
625: \end{tabular}
626: \end{table}
627:
628:
629: As mentioned above, we can perform a more detailed analysis of the results of
630: the four methods described in the previous section because we do not only
631: know the error rate of the classifiers but also the exact patterns for which
632: an error has occurred. Therefore, we do not have to assume that the
633: classifiers have been evaluated on independent data and are thus able to
634: derive tighter estimates of the level of confidence of an improvement.
635:
636: The more detailed analysis shown here is an estimation of the
637: probability that a classifier performs generally better than a second
638: classifier (probability of improvement) by using the decisions of the
639: two classifiers on the same test samples. We estimate this probability
640: by drawing a large number of bootstrap samples from the test data set
641: and observing the relative performance of the two classifiers on these
642: resampled test sets~\cite{bisani_poi}. This estimation tells us more
643: than just using a comparison based on the individual error rates
644: alone. For example, we will intuitively be more inclined to believe
645: that the first classifier is better if it leads to better
646: classifications on 2\% of the test data and to the same results on the
647: remaining 98\% than if the first classifier performs better on 30\% of
648: the test data but worse on 28\% of the data. (For an interesting
649: discussion of significance in the context of comparisons of machine
650: learning algorithms, see \cite{salzberg97comparing}.)
651: Table~\ref{tab:poi-mnist} shows the probabilities of improvement based
652: on this technique for the four methods described above along with the
653: differences in error rate. \cite{lecun98} states that improvements of
654: more than 0.1\% in the error rate may be considered significant. The
655: analysis performed here allows a more detailed assessment of the
656: significance of improvements.
657:
658: We observe that the improvements between the three classifiers based on shape
659: context, virtual support vectors, and the image distortion model, do not
660: differ statistically significantly (at the 5\% level). On the other hand, the
661: neural net based classifier shows significant improvements over the
662: classifiers based on shape context and virtual support vectors, but not over
663: the classifier based on the image distortion model. Finally, the improvements
664: of the combined classifier over the single classifiers is highly significant
665: except for the improvement with respect to the neural net, where the
666: improvement has a significance level of 6\%. This value is not beneath the
667: commonly used 5\% threshold, but sufficiently close to it to convince
668: us that the improvement is not based on chance alone.
669:
670: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
671: \section{Conclusion}
672:
673:
674: We presented a statistical analysis of the results of four state-of-the-art
675: systems for handwritten character recognition on the MNIST benchmark. By
676: using the fact that the systems were tested on the same data, we were able to
677: derive more specific results than it would have been possible by using the
678: error rates (and number of tests) alone. During the analysis, we observed that
679: the four systems had a higher variability in the results than we initially
680: expected. Specifically, only eight errors were common among all classifiers.
681: This observation motivated a combination of the classifiers, which resulted in
682: an error rate of 0.35\%, the lowest error rate reported on this data set so
683: far. The statistical analysis resulted in a probability of improvement of 94\%
684: for the combination with respect to the best single classifier.
685:
686:
687: In the view of the low error rates that are achieved by current methods on the
688: MNIST data, we may have reached a point at which further improvement may be
689: largely due to random effects and overadaptation to the (test) data. Some of
690: the errors observed also show that the Bayes error rate of the problem is also
691: larger than zero. This underlines the necessity to present statistical
692: analyses of improvement claims and the measures taken to avoid training on the
693: testing data within all publications using these data in the future. These
694: results may also be viewed as a hint that it is necessary to promote benchmark
695: data sets of similar impact as the MNIST data for new and more complex
696: problems.
697:
698:
699:
700: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
701: \section*{Acknowledgements}
702: This work was partially funded by the BMBF (German Federal Ministry of
703: Education and Research), project IPeT (01~IW~D03).
704:
705:
706: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
707: \section*{Appendix}
708: For completeness, we list the numbers of the MNIST test patterns that
709: are misclassified by the four systems and their combination in this appendix.
710:
711: \begin{description}
712: \small
713: \addtolength{\itemsep}{-1ex}
714: \item[Shape Context \cite{shapecontext_pami}]
715: 210, 448, 583, 692, 717, 948, 1034, 1113, 1227, 1248, 1300, 1320, 1531, 1682, 1710, 1791, 1879, 1902, 2041, 2074, 2099, 2131, 2183, 2238, 2448, 2463, 2583, 2598, 2655, 2772, 2940, 3063, 3074, 3251, 3423, 3476, 3559, 3822, 3851, 4094, 4164, 4202, 4370, 4498, 4506, 4663, 4732, 4762, 5736, 5938, 6555, 6572, 6577, 6598, 6884, 8066, 8280, 8317, 8528, 9506, 9643, 9730, 9851
716: \item[SVM \cite{sch02}]
717: 448, 583, 660, 675, 727, 948, 1015, 1113, 1227, 1233, 1248, 1300, 1320, 1531, 1550, 1682, 1710, 1791, 1902, 2036, 2071, 2099, 2131, 2136, 2183, 2294, 2489, 2655, 2928, 2940, 2954, 3031, 3074, 3226, 3423, 3521, 3535, 3559, 3605, 3763, 3870, 3986, 4079, 4762, 4824, 5938, 6577, 6598, 6784, 8326, 8409, 9665, 9730, 9750, 9793, 9851
718: \item[IDM \cite{icpr04_nlmatch}]
719: 446, 448, 552, 717, 727, 948, 1015, 1113, 1243, 1682, 1879, 1902, 2110, 2131, 2183, 2344, 2463, 2524, 2598, 2649, 2940, 3226, 3423, 3442, 3559, 3602, 3768, 3809, 3986, 4054, 4164, 4177, 4202, 4285, 4290, 4762, 5655, 5736, 5938, 6167, 6884, 7217, 8317, 8377, 8409, 8528, 9010, 9506, 9531, 9643, 9680, 9730, 9793, 9851
720: \item[Neural Net \cite{simardICDAR03}]
721: 583, 948, 1233, 1300, 1394, 1879, 1902, 2036, 2131, 2136, 2183, 2463, 2583, 2598, 2655, 2928, 2971, 3289, 3423, 3763, 4202, 4741, 4839, 4861, 5655, 5938, 5956, 5974, 6572, 6577, 6598, 6626, 8409, 8528, 9680, 9693, 9699, 9730, 9793, 9840, 9851, 9923
722: \item[Combination]
723: 448, 583, 948, 1113, 1233, 1300, 1682, 1879, 1902, 2036, 2131, 2136, 2183, 2463, 2583, 2598, 2655, 2928, 2940, 3423, 3559, 3763, 4202, 4762, 5655, 5938, 6572, 6577, 6598, 8409, 8528, 9680, 9730, 9793, 9851
724: \end{description}
725:
726:
727: \begin{thebibliography}{10}\setlength{\itemsep}{-0.7ex}\small
728:
729: \bibitem{athi05}
730: V.~Athistos, J.~Alon, and S.~Sclaroff.
731: \newblock Efficient Nearest Neighbor Classification Using a Cascade of
732: Approximate Similarity Measures.
733: \newblock In {\em CVPR 2005, Int. Conf. on Computer Vision and Pattern
734: Recognition}, volume~I, pages 486--493, San Diego, CA, June 2005.
735:
736: \bibitem{shape_context}
737: S.~Belongie, J.~Malik, and J.~Puzicha.
738: \newblock Shape Context: A New Descriptor for Shape Matching and Object
739: Recognition.
740: \newblock In T.~K. Leen, T.~G. Dietterich, and V.~Tresp, editors, {\em Advances
741: in Neural Information Processing Systems~13}, pages 831--837. MIT Press,
742: April 2001.
743:
744: \bibitem{shapecontext_pami}
745: S.~Belongie, J.~Malik, and J.~Puzicha.
746: \newblock Shape Matching and Object Recognition Using Shape Contexts.
747: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},
748: 24(4):509--522, April 2002.
749:
750: \bibitem{batthacharyya-iwfhr04}
751: U.~Bhattacharya, S.~Vajda, A.~Mallick, B.~B. Chaudhuri, and A.~Belaid.
752: \newblock On the Choice of Training Set, Architecture and Combination Rule of
753: Multiple MLP Classifiers for Multiresolution Recognition of Handwritten
754: Characters.
755: \newblock In {\em International Workshop on Frontiers in Handwriting
756: Recognition (IWFHR'04)}, pages 419--424, Tokyo, Japan, October 2004.
757:
758: \bibitem{bisani_poi}
759: M.~Bisani and H.~Ney.
760: \newblock Bootstrap Estimates for Confidence Intervals in ASR Performance
761: Evaluation.
762: \newblock In {\em Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal
763: Processing}, volume~1, pages 409--412, Montreal, Canada, May 2004.
764:
765: \bibitem{bot94+}
766: L.~Bottou, C.~Cortes, J.~S. Denker, H.~Drucker, I.~Guyon, L.~Jackel,
767: Y.~{Le~Cun}, U.~M{\"u}ller, E.~S{\"a}ckinger, P.~Simard, and V.~N. Vapnik.
768: \newblock Comparison of Classifier Methods: {A} Case Study in Handwritten Digit
769: Recognition.
770: \newblock In {\em Proc. of the Int. Conf. on Pattern Recognition}, pages
771: 77--82, Jerusalem, Israel, October 1994.
772:
773: \bibitem{das06_kumar}
774: K.~Chellapilla, M.~Shilman, and P.~Simard.
775: \newblock Combining Multiple Classifiers for Faster Optical Character
776: Recognition.
777: \newblock In {\em DAS 2006, Int. Workshop Document Analysis Systems}, volume
778: 3872 of {\em LNCS}, pages 358--367, Nelson, New Zealand, February 2006.
779:
780: \bibitem{mcs2001}
781: J.~Dahmen, D.~Keysers, and H.~Ney.
782: \newblock Combined Classification of Handwritten Digits using the 'Virtual Test
783: Sample Method'.
784: \newblock In {\em MCS 2001, 2nd Int. Workshop on Multiple Classifier Systems},
785: volume 2096 of {\em Lecture Notes in Computer Science}, pages 109--118,
786: Cambridge, UK, May 2001. Springer.
787:
788: \bibitem{sch02}
789: D.~DeCoste and B.~Sch{\"o}lkopf.
790: \newblock Training Invariant Support Vector Machines.
791: \newblock {\em Machine Learning}, 46(1-3):161--190, 2002.
792:
793: \bibitem{dong02}
794: J.~X. Dong, A.~Krzyzak, and C.~Y. Suen.
795: \newblock Local learning framework for handwritten character recognition.
796: \newblock {\em Engineering Applications of Artificial Intelligence},
797: 15(2):151--159, April 2002.
798:
799: \bibitem{dong04}
800: J.-X. Dong, A.~Krzyzak, and C.~Y. Suen.
801: \newblock Fast SVM Training Algorithm with Decomposition on Very Large Data
802: Sets.
803: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},
804: 27(4):603--618, April 2005.
805: \newblock Additional results at
806: http://www.cenparmi.concordia.ca/$\sim$people/jdong/ HeroSvm.html.
807:
808: \bibitem{bunke-iwfhr04}
809: S.~G{\"u}nter and H.~Bunke.
810: \newblock Combination of Three Classifiers with Different Architectures for
811: Handwritten Word Recognition.
812: \newblock In {\em International Workshop on Frontiers in Handwriting
813: Recognition (IWFHR'04)}, pages 63--68, Tokyo, Japan, October 2004.
814:
815: \bibitem{diss}
816: D.~Keysers.
817: \newblock {\em Modeling of Image Variability for Recognition}.
818: \newblock {PhD} thesis, RWTH Aachen University, Aachen, Germany, March 2006.
819:
820: \bibitem{icpr00_td}
821: D.~Keysers, J.~Dahmen, T.~Theiner, and H.~Ney.
822: \newblock Experiments with an Extended Tangent Distance.
823: \newblock In {\em Proc. 15th Int. Conf. on Pattern Recognition}, volume~2,
824: pages 38--42, Barcelona, Spain, September 2000.
825:
826: \bibitem{icpr04_nlmatch}
827: D.~Keysers, C.~Gollan, and H.~Ney.
828: \newblock Local Context in Non-linear Deformation Models for Handwritten
829: Character Recognition.
830: \newblock In {\em ICPR 2004, 17th Int. Conf. on Pattern Recognition},
831: volume~IV, pages 511--514, Cambridge, UK, August 2004.
832:
833: \bibitem{Kittler98}
834: J.~Kittler.
835: \newblock On Combining Classifiers.
836: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},
837: 20(3):226--239, March 1998.
838:
839: \bibitem{lecun98}
840: Y.~LeCun, L.~Bottou, Y.~Bengio, and P.~Haffner.
841: \newblock Gradient-Based Learning Applied to Document Recognition.
842: \newblock {\em Proc. of the IEEE}, 86(11):2278--2324, November 1998.
843:
844: \bibitem{liu_benchmark}
845: C.-L. Liu, K.~Nakashima, H.~Sako, and H.~Fujisawa.
846: \newblock Handwritten Digit Recognition: Benchmarking of State-of-the-Art
847: Techniques.
848: \newblock {\em Pattern Recognition}, 36(10):2271--2285, October 2003.
849:
850: \bibitem{maree04}
851: R.~Marée, P.~Geurts, J.~Piater, and L.~Wehenkel.
852: \newblock A Generic Aproach for Image Classification Based on Decision Tree
853: Ensembles and Local Sub-Windows.
854: \newblock In K.-S. Hong and Z.~Zhang, editors, {\em Proc. of the 6th Asian
855: Conf. on Computer Vision}, volume~2, pages 860--865, Jeju Island, Korea,
856: January 2004.
857:
858: \bibitem{icpr04_uchida}
859: N.~Matsumoto, S.~Uchida, and H.~Sakoe.
860: \newblock Prototype Setting for Elastic Matching-based Image Pattern
861: Recognition.
862: \newblock In {\em ICPR 2004, 17th Int. Conf. on Pattern Recognition}, volume~I,
863: pages 224--227, Cambridge, UK, August 2004.
864:
865: \bibitem{mayraz}
866: G.~Mayraz and G.~Hinton.
867: \newblock Recognizing Handwritten Digits Using Hierarchical Products of
868: Experts.
869: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},
870: 24(2):189--197, February 2002.
871:
872: \bibitem{milgram_mnist_05}
873: J.~Milgram, R.~Sabourin, and M.~Cheriet.
874: \newblock Combining Model-based and Discriminative Approaches in a Modular
875: Two-stage Classification System: Application to Isolated Handwritten Digit
876: Recognition.
877: \newblock {\em Electronic Letters on Computer Vision and Image Analysis},
878: 5(2):1--15, 2005.
879:
880: \bibitem{salzberg97comparing}
881: S.~L. Salzberg.
882: \newblock On Comparing Classifiers: Pitfalls to Avoid and a Recommended
883: Approach.
884: \newblock {\em Data Mining and Knowledge Discovery}, 1(3), 1997.
885:
886: \bibitem{sch97}
887: B.~Sch{\"o}lkopf.
888: \newblock {\em Support Vector Learning}.
889: \newblock Oldenbourg Verlag, Munich, 1997.
890:
891: \bibitem{sch98new+}
892: B.~Sch{\"o}lkopf, P.~Simard, A.~Smola, and V.~Vapnik.
893: \newblock Prior Knowledge in Support Vector Kernels.
894: \newblock In M.~I. Jordan, M.~J. Kearns, and S.~A. Solla, editors, {\em
895: Advances in Neural Information Processing Systems~10}, pages 640--646. {MIT}
896: Press, June 1998.
897:
898: \bibitem{simardICDAR03}
899: P.~Simard.
900: \newblock Best Practices for Convolutional Neural Networks Applied to Visual
901: Document Analysis.
902: \newblock In {\em 7th Int. Conf. Document Analysis and Recognition}, pages
903: 958--962, Edinburgh, Scotland, August 2003.
904:
905: \bibitem{sim93+}
906: P.~Simard, Y.~{Le Cun}, and J.~Denker.
907: \newblock Efficient Pattern Recognition Using a New Transformation Distance.
908: \newblock In S.~Hanson, J.~Cowan, and C.~Giles, editors, {\em Advances in
909: Neural Information Processing Systems~5}, pages 50--58, San Mateo, CA, 1993.
910: Morgan Kaufmann.
911:
912: \bibitem{Smith94}
913: S.~J. Smith, M.~O. Bourgoin, K.~Sims, and H.~L. Voorhees.
914: \newblock Handwritten Character Classification Using Nearest Neighbor in Large
915: Databases.
916: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},
917: 16(9):915--919, September 1994.
918:
919: \bibitem{suen-prl05}
920: C.~Y. Suen and J.~Tan.
921: \newblock Analysis of Errors of Handwritten Digits Made by a Multitude of
922: Classifiers.
923: \newblock {\em Pattern Recognition Letters}, 26(3):369--379, 2005.
924:
925: \bibitem{teow00}
926: L.-N. Teow and K.-F. Loe.
927: \newblock Handwritten Digit Recognition with a Novel Vision Model that Extracts
928: Linearly Separable Features.
929: \newblock In {\em Proc. CVPR 2000, Conf. on Computer Vision and Pattern
930: Recognition}, volume~2, pages 76--81, Hilton Head, SC, June 2000.
931:
932: \bibitem{teow02}
933: L.-N. Teow and K.-F. Loe.
934: \newblock Robust Vision-Based Features and Classification Schemes for Off-Line
935: Handwritten Digit Recognition.
936: \newblock {\em Pattern Recognition}, 35(11):2355--2364, November 2002.
937:
938: \end{thebibliography}
939:
940:
941: \end{document}
942: