0710:0710.2231/tr.tex

1: \documentclass[english]{article}

2: \pdfoutput=1

3: \usepackage{amssymb}

4: \usepackage{amsmath}

5: \usepackage{wrapfig}

6: \usepackage{graphicx}

7: \usepackage{subfigure}

8: \usepackage{verbatim}

9: \usepackage[multiple]{footmisc}

10: \usepackage[bf]{caption2}

11:

12: \renewcommand{\captionfont}{\small}

13:

14: \pagestyle{empty}

15:

16: \def\argmax{\operatornamewithlimits{arg\,max}}

17: \def\argmin{\operatornamewithlimits{arg\,min}}

18:

19: \setcounter{topnumber}{5}

20: \renewcommand{\topfraction}{1}

21: \setcounter{bottomnumber}{5}

22: \renewcommand{\bottomfraction}{1}

23: \setcounter{totalnumber}{10}

24: \renewcommand{\textfraction}{0}

25: \renewcommand{\floatpagefraction}{0}

26: \graphicspath{{images/}}

27: \newcommand{\eg}{e.g.\ }

28:

29:

30: \pagenumbering{arabic}

31:

32:

33: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

34: \begin{document}

35:

36:

37: \title{Comparison and Combination of State-of-the-art \mbox{\rule{-1ex}{0pt}Techniques for

38: Handwritten Character Recognition:} Topping the MNIST Benchmark}

39: \author{Daniel Keysers\\daniel.keysers@dfki.de\\

40: Image Understanding and Pattern Recognition (IUPR) Group\\

41: German Research Center for Artificial Intelligence (DFKI)}

42: \date{May 2006}

43: \maketitle

44:

45:

46: \pagestyle{plain}

47: \setcounter{page}{1}

48:

49:

50: \begin{abstract}

51:   Although the recognition of isolated handwritten digits has been a research

52:   topic for many years, it continues to be of interest for the research

53:   community and for commercial applications.  We show that despite the

54:   maturity of the field, different approaches still deliver results that vary

55:   enough to allow improvements by using their combination.  We do so by

56:   choosing four well-motivated state-of-the-art recognition systems for which

57:   results on the standard MNIST benchmark are available. When comparing the

58:   errors made, we observe that the errors made differ between all four

59:   systems, suggesting the use of classifier combination. We then determine the

60:   error rate of a hypothetical system that combines the output of the four

61:   systems. The result obtained in this manner is an error rate of 0.35\% on

62:   the MNIST data, the best result published so far. We furthermore discuss the

63:   statistical significance of the combined result and of the results of the

64:   individual classifiers.

65: \end{abstract}

66:

67:

68: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

69: \section{Introduction}

70:

71: The recognition of handwritten digits is a topic of practical importance

72: because of applications like automated form reading and handwritten zip-code

73: processing. It is also a subject that has continued to produce much research

74: effort over the last decades for several reasons:

75: \begin{itemize}

76: \addtolength{\itemsep}{-1.2ex}

77: \item The problem is prototypical for image processing and pattern recognition, with

78: a small number of classes.

79: \item Standard benchmark data sets exist that make it easy to obtain valid results

80: quickly.

81: \item Many publications and techniques are available that can be cited and

82:   built on, respectively.

83: \item The practical applications motivate the research performed.

84: \item Improvements in classification accuracy over existing techniques

85:   continue to be obtained using new approaches.

86: \end{itemize}

87:

88: This paper has the objective to analyze four of the state-of-the-art methods

89: for the recognition of handwritten

90: digits~\cite{shapecontext_pami,sch02,icpr04_nlmatch,simardICDAR03} by

91: comparing the errors made on the standard MNIST benchmark data.

92: (A part of this work has been described in~\cite{diss}.)

93: We perform a

94: statistically analysis of the errors using a bootstrapping

95: technique~\cite{bisani_poi} that not only uses the error count but also takes

96: into account which errors were made. Using this technique we can determine

97: more accurate estimates of the statistical significance of improvements.

98:

99: When analyzing the errors made we observe that --- although the error rates

100: obtained are all very similar --- there are substantial differences in {\em

101:   which} patterns are classified erroneously.  This can be interpreted as an

102: indicator for using classifier combination. An experiment shows that indeed a

103: combination of the classifiers performs better than the single best

104: classifier. The statistical analysis shows that the probability that this

105: results constitutes a real improvement and is not based on chance alone is

106: 94\%.

107:

108:

109: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

110: \section{Related work}

111:

112: This paper is of course only possible because the results

113: of the four chosen base

114: methods~\cite{shapecontext_pami,sch02,icpr04_nlmatch,simardICDAR03} were

115: available\footnote{We would like to thank Patrice Simard for providing the

116:   recognition results to us and the authors of~\cite{shapecontext_pami,sch02}

117:   for listing the errors in the respective papers.}.  These approaches are

118: presented in more detail in Section~\ref{sec:mnist-sota}.  We are aware that

119: there exist other methods that also achieve very good classification error

120: rates on the data used, e.g.~\cite{liu_benchmark}.

121: However, we feel that the

122: four methods chosen comprise a set of well-motivated and self-contained

123: approaches.

124: Furthermore, they represent the different classification methods

125: most commonly used (in the research literature), that is, the nearest neighbor

126: classifier, neural networks, and the support vector machine.  All four methods

127: use the appearance-based paradigm in the broad sense and can thus be

128: considered as being sufficiently general as to be applied to other object

129: recognition tasks.

130:

131: There is a large amount of work available on the topic of classifier

132: combination as well (an introduction can be found e.g.~in~\cite{Kittler98})

133: and much work exists on applying classifier combination to handwriting

134: recognition

135: (e.g.~\cite{batthacharyya-iwfhr04,das06_kumar,mcs2001,bunke-iwfhr04}).

136: Note that we do not propose new algorithms for classification of handwritten

137: digits or for the combination of classifiers. Instead, our contribution is to

138: present a statistical analysis that compares different classifiers and to show

139: that their combination improves the performance even though the individual

140: classifiers all reach state-of-the-art error rates by themselves.

141:

142:

143:

144: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

145: \section{The MNIST task}\label{sec:mnist}

146:

147:

148: The modified NIST handwritten digit database (MNIST, \cite{lecun98}) contains

149: 60,000 images in the training set and 10,000 patterns in the test set, each of

150: size 28$\times$28 pixels with 256 graylevels.  The data set is available

151: online\footnote{\tt http://www.research.att.com/$\sim$yann/ocr/mnist/} and

152: some examples from the MNIST corpus are shown in Figure~\ref{fig:nist_ex}.

153:

154: The preprocessing of the images is described as follows in \cite{lecun98}:

155: ``The original black and white (bilevel) images were size normalized to fit in

156: a 20$\times$20 pixel box while preserving their aspect ratio. The resulting

157: images contain gray levels as result of the antialiasing (image interpolation)

158: technique used by the normalization algorithm. [...]  the images were centered

159: in a 28$\times$28 image by computing the center of mass of the pixels and

160: translating the image so as to position this point at the center of the

161: 28$\times$28 field.'' Note that some authors use a `deslanted' version of the

162: database.

163:

164: \begin{figure}[tb]

165: \begin{center}

166: \includegraphics[width=\columnwidth]{images/NIST1}

167: \includegraphics[width=\columnwidth]{images/NIST2}

168: \includegraphics[width=\columnwidth]{images/NIST3}\\

169: \caption[Example images from the MNIST data set]%

170: {Example images from the MNIST data set.\label{fig:nist_ex}}

171: \end{center}

172: \end{figure}

173:

174: The task is generally not considered to be `difficult' (in the sense that

175: absolute error rates are high) recognition task for two reasons. First, the

176: human error rate is estimated to be only about 0.2\%, although it has not been

177: determined for the whole test set \cite{sim93+}.  Second, the large training

178: set allows machine learning algorithms to generalize well. With respect to the

179: connection between training set size and classification performance for OCR

180: tasks it is argued \cite{Smith94} that increasing the training set size by a

181: factor of ten cuts the error rate approximately to half the original figure.

182:

183: Table~\ref{tab:mnist-er} gives a comprehensive overview of the error rates

184: reported for the MNIST data.  One disadvantage of the MNIST corpus is that

185: there exists no development test set, which leads to effects known as

186: `training on the testing data'. This is not necessarily true for each of the

187: research groups performing experiments, but it cannot always be ruled out.

188: Note that in some publications (e.g.~\cite{simardICDAR03}) the authors

189: explicitly state that all parameters of the system were chosen by using a

190: subset of the training set for validation, which then rules out the

191: overadaptation to the test set.

192: However, the tendency exists to evaluate one method with

193: different parameters or different methods several times on the same data until

194: the best performance seems to have been reached. This procedure leads to an

195: overly optimistic estimation of the error rate of the classifier and the

196: number of tuned parameters should be considered when judging such error rates.

197: Ideally, a development test set would be used to determine the best parameters

198: for the classifiers and the results would be obtained from one run on the test

199: set itself.  Nevertheless a comparison of `best performing' algorithms may

200: lead to valid conclusions, especially if these perform well on several

201: different tasks.

202: %

203: \def\pz{\phantom{0}}

204: \begin{table}[b]

205:  \caption[MNIST error rates]{Error rates for the MNIST task in \%.

206:   The systems marked with $^*$ are those we use for analysis and combination.

207:    \label{tab:mnist-er}}

208: %\small

209: \centering

210: \begin{tabular}{@{\vline\hspace{0.7ex}}r@{\hspace{0.7ex}}l@{\hspace{0.7ex}\vline\hspace{0.7ex}}l@{\hspace{0.7ex}\vline\hspace{0.7ex}}r@{\hspace{0.7ex}\vline}}

211: \hline

212: \multicolumn{2}{@{\vline\hspace{0.7ex}}l}{reference} & method & ER\pz \\

213: \hline

214: \cite{sim93+}             & AT\&T   & human performance & 0.2\pz \\

215:                           &  ---   & Euclidean nearest neighbor & 3.5\pz \\

216: \hline

217: \cite{maree04}            & U Li�ge   & decision trees + sub-windows & 2.63 \\

218: \cite{lecun98}            & AT\&T   & deslant, Euclidean 3-NN  & 2.4\pz \\

219: \cite{icpr04_uchida}      & Kyushu U & elastic matching & 2.10 \\

220: \cite{icpr00_td}          & RWTH      & one-sided tangent distance & 1.9\pz \\

221: \cite{bot94+}             & AT\&T   & neural net LeNet1  & 1.7\pz \\

222: \cite{mayraz}             & UC London &   products of experts & 1.7\pz \\

223: \cite{milgram_mnist_05}   & U Qu\'ebec & hyperplanes + support vector m. & 1.5\pz \\

224: \cite{sch97}              & TU Berlin   & support vector machine & 1.4\pz \\

225: \cite{bot94+}             & AT\&T      & neural net LeNet4  & 1.1\pz \\

226: \cite{sim93+}             & AT\&T    & tangent distance  & 1.1\pz \\

227: \cite{icpr00_td}          & RWTH    & two-sided tangent d., virt. data & 1.0\pz \\

228: \cite{dong02}             & CENPARMI & local learning & 0.99 \\

229: \cite{sch98new+}          & MPI, AT\&T   & virtual SVM  & 0.8\pz \\

230: \cite{lecun98}            & AT\&T   & distortions, neural net LeNet5   & 0.82 \\

231: \cite{lecun98}            & AT\&T    & distortions, boosted LeNet4          & 0.7\pz \\

232: \cite{teow00}             & U Singapore  & bio-inspired features + SVM    & 0.72 \\

233: \cite{sch02}              & Caltech,MPI   & virtual SVM (jitter)         & 0.68 \\

234: \cite{shapecontext_pami}  & UC Berkeley & shape context matching   & $^*$0.63 \\

235: \cite{dong04}             & CENPARMI    & support vector machine  & 0.60 \\

236: \cite{teow02}             & U Singapore  & deslant, biology-inspired features & 0.59 \\

237: \cite{athi05}             & Boston U & cascaded shape context & 0.58 \\

238: \cite{sch02}              & Caltech,MPI   & deslant, virtual SVM (jitter,shift) & $^*$0.56 \\

239: \cite{athi05}             & Boston U & shape context matching & 0.54 \\

240: \cite{icpr04_nlmatch}     & RWTH  & deformation model (IDM) &  $^*$0.54 \\

241: \cite{liu_benchmark}      & Hitachi & preprocessing,  support vector m. & 0.42\\

242: \cite{simardICDAR03}      & Microsoft  & neural net + virtual data & $^*$0.42 \\

243:                           & this work &  hyp. comb. of 4 systems ($^*$)& 0.35 \\

244: \hline

245: \end{tabular}

246: \end{table}

247: %

248: Note that Dong gives lower error rates than in \cite{dong04} of 0.38 to 0.44

249: percent on his web page (accessed February 2005), but it remains somewhat

250: unclear how these error rates were obtained and if possibly these low error

251: rates are due to the effect of `training on the testing data'.

252: Also, \cite{teow02} try a variety of SVMs and networks which yield error rates

253: ranging from 0.59 percent to 0.81 percent.

254: The IDM \cite{icpr04_nlmatch} as described in the Section~\ref{sec:idm} was

255: not optimized for the MNIST task. Instead, all parameter settings were

256: determined using the smaller USPS data set and then the complete setup was

257: evaluated once on the MNIST data.

258:

259: \begin{figure}[p]

260: \newlength{\mndlength}

261: \setlength{\mndlength}{7mm}

262: \newcommand{\examplewithclass}[2]{\scriptsize%

263: #2\includegraphics[width=\mndlength]{images/MNIST-difficult#1}}

264: \centerline{

265: \begin{tabular}{*{8}{c@{\;}}}

266: \examplewithclass{0}{9}&

267: \examplewithclass{1}{6}&

268: \examplewithclass{2}{4}&

269: \examplewithclass{3}{7}&

270: \examplewithclass{4}{8}&

271: \examplewithclass{5}{2}&

272: \examplewithclass{6}{5}&

273: \examplewithclass{7}{8}\\

274: \examplewithclass{8}{1}&

275: \examplewithclass{9}{7}&

276: \fbox{\examplewithclass{10}{8}}&

277: \examplewithclass{11}{6}&

278: \examplewithclass{12}{8}&

279: \examplewithclass{13}{4}&

280: \examplewithclass{14}{7}&

281: \examplewithclass{15}{9}\\

282: \examplewithclass{16}{4}&

283: \examplewithclass{17}{9}&

284: \examplewithclass{18}{5}&

285: \examplewithclass{19}{8}&

286: \examplewithclass{20}{5}&

287: \examplewithclass{21}{8}&

288: \examplewithclass{22}{4}&

289: \examplewithclass{23}{3}\\

290: \examplewithclass{24}{9}&

291: \examplewithclass{25}{2}&

292: \examplewithclass{26}{8}&

293: \fbox{\examplewithclass{27}{9}}&

294: \examplewithclass{28}{5}&

295: \examplewithclass{29}{5}&

296: \examplewithclass{30}{7}&

297: \examplewithclass{31}{5}\\

298: \examplewithclass{32}{2}&

299: \examplewithclass{33}{3}&

300: \fbox{\examplewithclass{34}{4}}&

301: \examplewithclass{35}{6}&

302: \fbox{\examplewithclass{36}{1}}&

303: \examplewithclass{37}{5}&

304: \examplewithclass{38}{9}&

305: \examplewithclass{39}{1}\\

306: \examplewithclass{40}{4}&

307: \examplewithclass{41}{2}&

308: \examplewithclass{42}{2}&

309: \examplewithclass{43}{7}&

310: \examplewithclass{44}{9}&

311: \examplewithclass{45}{5}&

312: \examplewithclass{46}{9}&

313: \examplewithclass{47}{6}\\

314: \examplewithclass{48}{4}&

315: \examplewithclass{49}{3}&

316: \examplewithclass{50}{9}&

317: \examplewithclass{51}{3}&

318: \examplewithclass{52}{5}&

319: \examplewithclass{53}{6}&

320: \examplewithclass{54}{8}&

321: \examplewithclass{55}{1}\\

322: \examplewithclass{56}{7}&

323: \examplewithclass{57}{2}&

324: \examplewithclass{58}{4}&

325: \fbox{\examplewithclass{59}{6}}&

326: \examplewithclass{60}{7}&

327: \examplewithclass{61}{3}&

328: \examplewithclass{62}{6}&

329: \examplewithclass{63}{4}\\

330: \examplewithclass{64}{5}&

331: \examplewithclass{65}{1}&

332: \examplewithclass{66}{7}&

333: \examplewithclass{67}{6}&

334: \examplewithclass{68}{7}&

335: \examplewithclass{69}{7}&

336: \examplewithclass{70}{9}&

337: \examplewithclass{71}{9}\\

338: \examplewithclass{72}{9}&

339: \examplewithclass{73}{9}&

340: \examplewithclass{74}{7}&

341: \examplewithclass{75}{9}&

342: \examplewithclass{76}{9}&

343: \examplewithclass{77}{9}&

344: \examplewithclass{78}{2}&

345: \examplewithclass{79}{1}\\

346: \examplewithclass{80}{9}&

347: \examplewithclass{81}{2}&

348: \examplewithclass{82}{9}&

349: \examplewithclass{83}{8}&

350: \examplewithclass{84}{9}&

351: \examplewithclass{85}{9}&

352: \examplewithclass{86}{8}&

353: \examplewithclass{87}{3}\\

354: \examplewithclass{88}{9}&

355: \examplewithclass{89}{9}&

356: \examplewithclass{90}{6}&

357: \examplewithclass{91}{4}&

358: \examplewithclass{92}{7}&

359: \examplewithclass{93}{5}&

360: \fbox{\examplewithclass{94}{5}}&

361: \examplewithclass{95}{3}\\

362: \examplewithclass{96}{3}&

363: \examplewithclass{97}{9}&

364: \examplewithclass{98}{2}&

365: \examplewithclass{99}{9}&

366: \examplewithclass{100}{7}&

367: \examplewithclass{101}{0}&

368: \examplewithclass{102}{8}&

369: \examplewithclass{103}{1}\\

370: \examplewithclass{104}{1}&

371: \examplewithclass{105}{0}&

372: \examplewithclass{106}{8}&

373: \examplewithclass{107}{8}&

374: \examplewithclass{108}{7}&

375: \examplewithclass{109}{0}&

376: \examplewithclass{110}{1}&

377: \examplewithclass{111}{8}\\

378: \examplewithclass{112}{4}&

379: \examplewithclass{113}{7}&

380: \examplewithclass{114}{7}&

381: \examplewithclass{115}{9}&

382: \examplewithclass{116}{9}&

383: \examplewithclass{117}{2}&

384: \examplewithclass{118}{6}&

385: \examplewithclass{119}{9}\\

386: \examplewithclass{120}{6}&

387: \fbox{\examplewithclass{121}{5}}&

388: \examplewithclass{122}{5}&

389: \examplewithclass{123}{4}&

390: \examplewithclass{124}{2}&

391: \fbox{\examplewithclass{125}{0}}&

392: \examplewithclass{126}{4}&

393: \\

394: \end{tabular}

395: }

396:

397: \caption[Difficult MNIST test samples]{Difficult examples from

398:   the MNIST test set along with their target labels. At least one of the four

399:   state-of-the-art systems (cp.~Table~\ref{tab:mnist-er}) misclassifies these

400:   images.  The framed examples are misclassified by all four systems.

401: \label{fig:mnist-difficult}}

402: \end{figure}

403:

404: Figure~\ref{fig:mnist-difficult} shows the `difficult' examples from the MNIST

405: test set. At least one of the four state-of-the-art systems misclassifies each

406: sample.  (These systems are marked with `$^*$' in Table~\ref{tab:mnist-er}.)

407: Those samples that are misclassified by all four systems are marked by a

408: surrounding frame. This presentation is possible because both in \cite{sch02}

409: and in \cite{shapecontext_pami} the authors present the set of samples

410: misclassified by their systems.  Furthermore, Patrice Simard kindly provided

411: the classification results of his system as described in \cite{simardICDAR03}

412: for all test data.  The availability of these results also makes it possible

413: to determine the error rate of a hypothetical system that combines these four

414: best systems as described in the following Section~\ref{sec:mnist-sota}.

415:

416: Some of the images in Figure~\ref{fig:mnist-difficult} are a good illustration

417: of the inherent class overlap that exists for this problem: some instances of

418: \eg `3'~vs.`5', `4'~vs.`9', and `8'~vs.~`9' are not distinguishable by taking

419: into account the observed image only.  This suggests that we are dealing with

420: a problem with non-zero Bayes error rate. Further improvements in the error

421: rate on this data set might therefore be problematic. For example, consider a

422: classifier that classifies the second framed image as a `9': despite the fact

423: that this classifier would not make an error with this decision according to

424: the class labels, we might prefer a classifier that classifies the image as a

425: `4'. Note that recently~\cite{suen-prl05} has presented a more detailed

426: discussion of different types of errors made by state-of-the-art classifiers

427: for handwritten characters.

428:

429:

430: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

431: \section{The classifiers and their combination}

432: \label{sec:mnist-sota}

433:

434: We briefly describe the four systems for handwritten digit recognition that we

435: compare and combine.  Then, we discuss the statistical significance of their

436: results and present a simple classifier combination of these four methods that

437: achieves a (hypothetical) error rate of 0.35\%.

438:

439: {\bf Shape context matching.}  \cite{shapecontext_pami} presents the shape

440: context matching approach.  The method proceeds by first extracting contour

441: points of the images. In the case of handwritten character images the

442: resulting contour points trace both sides of the pen strokes the character is

443: composed of.  Then, at each contour point a local descriptor of the shape as

444: represented by the contour points is extracted. This local descriptor is

445: called a shape context and is a histogram of the contour points in the

446: surrounding of the central point. This histogram has a finer resolution at

447: points close to the central point and a coarser for regions farther away,

448: which is achieved using a log-polar

449: representation.\\

450: The classification is then done by using a nearest neighbor classifier

451: (although the authors chose to use only one third of the training data for the

452: MNIST task). The distance within the classifier is determined using an

453: iterative matching based on the shape context descriptors and two-dimensional

454: deformation. The shape contexts of training and test image are assigned to

455: each other by using the Hungarian algorithm on a bipartite graph

456: representation with edge weights according to the similarity of the shape

457: context descriptors. This assignment is then used to estimate a

458: two-dimensional spline transformation best matching the two images.  The

459: images are transformed accordingly and the whole process (including extraction

460: of shape contexts) is iterated until a stopping criterion is reached. The

461: resulting distance is used in the

462: classifier. \\

463: Recently, \cite{athi05} discuss a cascading technique to speed up the slow

464: nearest neighbor matching by ``two to three orders of magnitude''. While the

465: result that this discussion is based on only used the first 20,000 training

466: samples for reasons of efficiency and resulted in an error rate of 0.63\%

467: \cite{shape_context}, \cite{athi05} report an error rate of 0.54\% for the

468: full training set and 0.58\% for the cascaded classifier that uses only about

469: 300 distance calculations per test.

470:

471: {\bf Invariant support vector machine.}

472: \cite{sch02} presents a support vector machine (SVM) that is especially suited

473: for handwritten digit recognition by incorporating prior knowledge about the

474: task. This is achieved by using virtual data or a special kernel function

475: within the SVM. The special kernel function applies several transformations to

476: the compared images that leave the class identity unchanged and return the

477: kernel function of the appropriate pair of transformed images. This method is

478: referred to as kernel jittering. The second uses so-called virtual support

479: vectors. This approach consists of first training a support vector machine.

480: Now,  the set of support vectors contains

481: sufficient information about the recognition problem and can therefore be

482: considered a condensed representation of the training data for discrimination

483: purposes. The method proceeds to create transformed versions of the support

484: vectors, which are the virtual support vectors. In the experiments leading to

485: the error rate of 0.56\% the transformations used were image shifts within the

486: eight-neighborhood plus horizontal and vertical shifts of two pixels, thus

487: resulting in $9+4=13$ virtual support vectors for each original support

488: vector. (This experiment also used the deslanted version of the MNIST data

489: \cite{lecun98}.) On this new set of virtual support vectors, another support

490: vector machine was trained and evaluated on the test set.

491:

492: {\bf Pixel-to-pixel image matching with local contexts.}

493: %

494: \cite{icpr04_nlmatch} presents deformable models for handwritten character

495: recognition. It is shown that a simple zero-order matching approach called

496: image distortion model (IDM) can lead to very competitive results if the local

497: context of each pixel is considered in the distortion. The local context is

498: represented by a $3\times3$ surrounding window of the horizontal and vertical

499: image gradient, resulting in an 18-dimensional descriptor.  The IDM allows to

500: choose for each pixel of the test image the best fitting counterpart of the

501: reference image within a suitable corresponding range.  The distance as

502: determined by the best match between two images is then used within a

503: 3-nearest-neighbor classifier.  More elaborate models for image matching are

504: also discussed, but only small improvements can be obtained at the cost of

505: much higher computational costs.  The IDM can be seen as the best compromise

506: between high classification speed and high recognition accuracy while being

507: conceptually very simple and easy to implement.

508:

509: {\bf Convolutional neural net and virtual data.}

510: \label{sec:idm}

511: %

512: \cite{simardICDAR03} presents a large convolutional neural network of about

513: 3,000 nodes in five layers that is especially designed for handwritten

514: character classification. The new concept in the approach is to present a new

515: set of virtual training images to the learning algorithm of the neural net in

516: each iteration of the training. The virtual training set is constructed from

517: the given training data by applying a separate two-dimensional random

518: displacement field that is smoothed with a Gaussian filter to each of the

519: images. This makes it possible to generate a very large amount of virtual data

520: in the order of 1,000 virtual samples for each original element of the

521: training data set. The data is generated on the fly in each training iteration

522: and therefore does not have to be saved, which avoids the problems with data

523: handling. Apart from the generation of virtual examples there is another point

524: where prior knowledge about the task comes into play, namely the use of a {\em

525:   convolutional} neural net. This architecture, which is described in greater

526: detail in \cite{lecun98}, contains prior knowledge in that it uses tying of

527: weights within the neural net to extract low-level features from the input

528: that are invariant with respect to the position within the image, and only in

529: later layers of the neural net the position information is used.

530:

531:

532:

533: {\bf Discussion and combination.}

534: %

535: We can observe that all four methods take special measures to deal

536: with the image variability present in the images, using virtual data

537: and image matching methods. At the same time the concrete

538: classification algorithm seems to play a somewhat smaller role in the

539: performance as nearest neighbor classifiers, support vector machines,

540: and neural networks all perform very well. Only a slight advantage of

541: the neural net can be seen in the possibility to use very large

542: amounts of virtual data in training because the training proceeds in

543: several iterations, which need not use the same data but can use

544: distorted samples of the images instead.

545:

546:

547: Figure~\ref{fig:mnist-difficult} shows all the errors made by one of the four

548: classifiers. It is remarkable that only eight samples are classified

549: incorrectly by all four systems. This observation naturally suggests the use

550: of classifier combination to further reduce the error rate.  The availability

551: of the results of the other classifiers makes it possible to determine this

552: error rate of a simple hypothetical combined system.

553:

554: However, we are somewhat restricted for the choice of combination scheme,

555: because for two of classifiers we only know if the result was correct or not.

556: We thus decided to use a simple majority vote combination based on the four

557: classifiers, where the neural net classifier is used for tie-breaking (because

558: it has the best single error rate).  Note that the result is only an upper

559: bound of the error rate that a real combined system would have, because we do

560: not use the class labels the patterns were assigned to (but only the

561: information if the decision was correct or not). This means that in case of a

562: disagreement between the falsely assigned classes we could have a correct

563: assignment when using the class labels. Furthermore, it seems likely that the

564: use of the confidence values of the component classifiers in the combination

565: scheme could also improve the joint decision.

566:

567: Using the described hypothetical combination, the resulting error rate is

568: 0.35\%\label{mnistres}.  In the following section we will show that this

569: improvement has a probability of 94\% to be an improvement that is not based

570: on chance alone but constitutes a real improvement.

571:

572:

573:

574: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

575: \section{Statistical analysis of results}

576:

577:

578: \begin{table}[tb]

579:   \caption[Significance of improvements for best MNIST classifiers]%

580:   {Probabilities of improvement for all pairs of the four used

581:   classifiers and their combination according to a bootstrap analysis.

582:   Probabilities in

583:   {boldface} show {\bfseries significant} improvements with respect to the

584:   5\% level. This table can be read as follows: the classifier in

585:   each row improves over the classifiers given in the columns with the

586:   stated probability (\eg the probability of improvement for SVM over SC is

587:   0.60). The second table shows the difference in error rates for

588:   comparison.  SC: shape context matching; SVM: invariant support

589:   vector machine; IDM: image distortion model; CNN: convolutional neural net with

590:   distortions; CC: combination of the four classifiers;}

591:   \label{tab:poi-mnist}

592:   \small

593:   \centering

594:   \begin{tabular}{|l|c|c|c|c|c|}

595:     \multicolumn{6}{c}{probability of improvement}\\

596:     \hline

597:         & SC & SVM & IDM & CNN & CC \\

598:     \hline

599:     SC  & --- &   &   &  &\\

600:     \hline

601:     SVM & 0.60 & --- &   & & \\

602:     \hline

603:     IDM & 0.85 & 0.58 & --- & & \\

604:     \hline

605:     CNN & {\bf 0.99} & {\bf 0.96} & 0.92 & --- &\\

606:     \hline

607:     CC  & {\bf 1.00} & {\bf 1.00} & {\bf 1.00} & 0.94  & ---  \\

608:     \hline

609:   \end{tabular} \hspace{1cm}

610:   \begin{tabular}{|l|c|c|c|c|c|}

611:     \multicolumn{6}{c}{difference in error rate}\\

612:     \hline

613:         & SC & SVM & IDM & CNN & CC\\

614:     \hline

615:     SC  & ---  &     &   &  & \\

616:     \hline

617:     SVM & 0.07 & --- &   & &  \\

618:     \hline

619:     IDM & 0.09 & 0.02 & --- & &  \\

620:     \hline

621:     CNN & 0.21 & 0.14 & 0.12 & --- & \\

622:     \hline

623:     CC  & 0.28 & 0.21 & 0.19 & 0.07 & ---  \\

624:     \hline

625:   \end{tabular}

626: \end{table}

627:

628:

629: As mentioned above, we can perform a more detailed analysis of the results of

630: the four methods described in the previous section because we do not only

631: know the error rate of the classifiers but also the exact patterns for which

632: an error has occurred. Therefore, we do not have to assume that the

633: classifiers have been evaluated on independent data and are thus able to

634: derive tighter estimates of the level of confidence of an improvement.

635:

636: The more detailed analysis shown here is an estimation of the

637: probability that a classifier performs generally better than a second

638: classifier (probability of improvement) by using the decisions of the

639: two classifiers on the same test samples. We estimate this probability

640: by drawing a large number of bootstrap samples from the test data set

641: and observing the relative performance of the two classifiers on these

642: resampled test sets~\cite{bisani_poi}. This estimation tells us more

643: than just using a comparison based on the individual error rates

644: alone. For example, we will intuitively be more inclined to believe

645: that the first classifier is better if it leads to better

646: classifications on 2\% of the test data and to the same results on the

647: remaining 98\% than if the first classifier performs better on 30\% of

648: the test data but worse on 28\% of the data. (For an interesting

649: discussion of significance in the context of comparisons of machine

650: learning algorithms, see \cite{salzberg97comparing}.)

651: Table~\ref{tab:poi-mnist} shows the probabilities of improvement based

652: on this technique for the four methods described above along with the

653: differences in error rate. \cite{lecun98} states that improvements of

654: more than 0.1\% in the error rate may be considered significant. The

655: analysis performed here allows a more detailed assessment of the

656: significance of improvements.

657:

658: We observe that the improvements between the three classifiers based on shape

659: context, virtual support vectors, and the image distortion model, do not

660: differ statistically significantly (at the 5\% level). On the other hand, the

661: neural net based classifier shows significant improvements over the

662: classifiers based on shape context and virtual support vectors, but not over

663: the classifier based on the image distortion model. Finally, the improvements

664: of the combined classifier over the single classifiers is highly significant

665: except for the improvement with respect to the neural net, where the

666: improvement has a significance level of 6\%. This value is not beneath the

667: commonly used 5\% threshold, but sufficiently close to it to convince

668: us that the improvement is not based on chance alone.

669:

670: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

671: \section{Conclusion}

672:

673:

674: We presented a statistical analysis of the results of four state-of-the-art

675: systems for handwritten character recognition on the MNIST benchmark.  By

676: using the fact that the systems were tested on the same data, we were able to

677: derive more specific results than it would have been possible by using the

678: error rates (and number of tests) alone. During the analysis, we observed that

679: the four systems had a higher variability in the results than we initially

680: expected. Specifically, only eight errors were common among all classifiers.

681: This observation motivated a combination of the classifiers, which resulted in

682: an error rate of 0.35\%, the lowest error rate reported on this data set so

683: far. The statistical analysis resulted in a probability of improvement of 94\%

684: for the combination with respect to the best single classifier.

685:

686:

687: In the view of the low error rates that are achieved by current methods on the

688: MNIST data, we may have reached a point at which further improvement may be

689: largely due to random effects and overadaptation to the (test) data. Some of

690: the errors observed also show that the Bayes error rate of the problem is also

691: larger than zero. This underlines the necessity to present statistical

692: analyses of improvement claims and the measures taken to avoid training on the

693: testing data within all publications using these data in the future.  These

694: results may also be viewed as a hint that it is necessary to promote benchmark

695: data sets of similar impact as the MNIST data for new and more complex

696: problems.

697:

698:

699:

700: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

701: \section*{Acknowledgements}

702: This work was partially funded by the BMBF (German Federal Ministry of

703: Education and Research), project IPeT (01~IW~D03).

704:

705:

706: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

707: \section*{Appendix}

708: For completeness, we list the numbers of the MNIST test patterns that

709: are misclassified by the four systems and their combination in this appendix.

710:

711: \begin{description}

712: \small

713: \addtolength{\itemsep}{-1ex}

714: \item[Shape Context \cite{shapecontext_pami}]

715: 210, 448, 583, 692, 717, 948, 1034, 1113, 1227, 1248, 1300, 1320, 1531, 1682, 1710, 1791, 1879, 1902, 2041, 2074, 2099, 2131, 2183, 2238, 2448, 2463, 2583, 2598, 2655, 2772, 2940, 3063, 3074, 3251, 3423, 3476, 3559, 3822, 3851, 4094, 4164, 4202, 4370, 4498, 4506, 4663, 4732, 4762, 5736, 5938, 6555, 6572, 6577, 6598, 6884, 8066, 8280, 8317, 8528, 9506, 9643, 9730, 9851

716: \item[SVM \cite{sch02}]

717: 448, 583, 660, 675, 727, 948, 1015, 1113, 1227, 1233, 1248, 1300, 1320, 1531, 1550, 1682, 1710, 1791, 1902, 2036, 2071, 2099, 2131, 2136, 2183, 2294, 2489, 2655, 2928, 2940, 2954, 3031, 3074, 3226, 3423, 3521, 3535, 3559, 3605, 3763, 3870, 3986, 4079, 4762, 4824, 5938, 6577, 6598, 6784, 8326, 8409, 9665, 9730, 9750, 9793, 9851

718: \item[IDM \cite{icpr04_nlmatch}]

719: 446, 448, 552, 717, 727, 948, 1015, 1113, 1243, 1682, 1879, 1902, 2110, 2131, 2183, 2344, 2463, 2524, 2598, 2649, 2940, 3226, 3423, 3442, 3559, 3602, 3768, 3809, 3986, 4054, 4164, 4177, 4202, 4285, 4290, 4762, 5655, 5736, 5938, 6167, 6884, 7217, 8317, 8377, 8409, 8528, 9010, 9506, 9531, 9643, 9680, 9730, 9793, 9851

720: \item[Neural Net \cite{simardICDAR03}]

721: 583, 948, 1233, 1300, 1394, 1879, 1902, 2036, 2131, 2136, 2183, 2463, 2583, 2598, 2655, 2928, 2971, 3289, 3423, 3763, 4202, 4741, 4839, 4861, 5655, 5938, 5956, 5974, 6572, 6577, 6598, 6626, 8409, 8528, 9680, 9693, 9699, 9730, 9793, 9840, 9851, 9923

722: \item[Combination]

723: 448, 583, 948, 1113, 1233, 1300, 1682, 1879, 1902, 2036, 2131, 2136, 2183, 2463, 2583, 2598, 2655, 2928, 2940, 3423, 3559, 3763, 4202, 4762, 5655, 5938, 6572, 6577, 6598, 8409, 8528, 9680, 9730, 9793, 9851

724: \end{description}

725:

726:

727: \begin{thebibliography}{10}\setlength{\itemsep}{-0.7ex}\small

728:

729: \bibitem{athi05}

730: V.~Athistos, J.~Alon, and S.~Sclaroff.

731: \newblock Efficient Nearest Neighbor Classification Using a Cascade of

732:   Approximate Similarity Measures.

733: \newblock In {\em CVPR 2005, Int. Conf. on Computer Vision and Pattern

734:   Recognition}, volume~I, pages 486--493, San Diego, CA, June 2005.

735:

736: \bibitem{shape_context}

737: S.~Belongie, J.~Malik, and J.~Puzicha.

738: \newblock Shape Context: A New Descriptor for Shape Matching and Object

739:   Recognition.

740: \newblock In T.~K. Leen, T.~G. Dietterich, and V.~Tresp, editors, {\em Advances

741:   in Neural Information Processing Systems~13}, pages 831--837. MIT Press,

742:   April 2001.

743:

744: \bibitem{shapecontext_pami}

745: S.~Belongie, J.~Malik, and J.~Puzicha.

746: \newblock Shape Matching and Object Recognition Using Shape Contexts.

747: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},

748:   24(4):509--522, April 2002.

749:

750: \bibitem{batthacharyya-iwfhr04}

751: U.~Bhattacharya, S.~Vajda, A.~Mallick, B.~B. Chaudhuri, and A.~Belaid.

752: \newblock On the Choice of Training Set, Architecture and Combination Rule of

753:   Multiple MLP Classifiers for Multiresolution Recognition of Handwritten

754:   Characters.

755: \newblock In {\em International Workshop on Frontiers in Handwriting

756:   Recognition (IWFHR'04)}, pages 419--424, Tokyo, Japan, October 2004.

757:

758: \bibitem{bisani_poi}

759: M.~Bisani and H.~Ney.

760: \newblock Bootstrap Estimates for Confidence Intervals in ASR Performance

761:   Evaluation.

762: \newblock In {\em Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal

763:   Processing}, volume~1, pages 409--412, Montreal, Canada, May 2004.

764:

765: \bibitem{bot94+}

766: L.~Bottou, C.~Cortes, J.~S. Denker, H.~Drucker, I.~Guyon, L.~Jackel,

767:   Y.~{Le~Cun}, U.~M{\"u}ller, E.~S{\"a}ckinger, P.~Simard, and V.~N. Vapnik.

768: \newblock Comparison of Classifier Methods: {A} Case Study in Handwritten Digit

769:   Recognition.

770: \newblock In {\em Proc. of the Int. Conf. on Pattern Recognition}, pages

771:   77--82, Jerusalem, Israel, October 1994.

772:

773: \bibitem{das06_kumar}

774: K.~Chellapilla, M.~Shilman, and P.~Simard.

775: \newblock Combining Multiple Classifiers for Faster Optical Character

776:   Recognition.

777: \newblock In {\em DAS 2006, Int. Workshop Document Analysis Systems}, volume

778:   3872 of {\em LNCS}, pages 358--367, Nelson, New Zealand, February 2006.

779:

780: \bibitem{mcs2001}

781: J.~Dahmen, D.~Keysers, and H.~Ney.

782: \newblock Combined Classification of Handwritten Digits using the 'Virtual Test

783:   Sample Method'.

784: \newblock In {\em MCS 2001, 2nd Int. Workshop on Multiple Classifier Systems},

785:   volume 2096 of {\em Lecture Notes in Computer Science}, pages 109--118,

786:   Cambridge, UK, May 2001. Springer.

787:

788: \bibitem{sch02}

789: D.~DeCoste and B.~Sch{\"o}lkopf.

790: \newblock Training Invariant Support Vector Machines.

791: \newblock {\em Machine Learning}, 46(1-3):161--190, 2002.

792:

793: \bibitem{dong02}

794: J.~X. Dong, A.~Krzyzak, and C.~Y. Suen.

795: \newblock Local learning framework for handwritten character recognition.

796: \newblock {\em Engineering Applications of Artificial Intelligence},

797:   15(2):151--159, April 2002.

798:

799: \bibitem{dong04}

800: J.-X. Dong, A.~Krzyzak, and C.~Y. Suen.

801: \newblock Fast SVM Training Algorithm with Decomposition on Very Large Data

802:   Sets.

803: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},

804:   27(4):603--618, April 2005.

805: \newblock Additional results at

806:   http://www.cenparmi.concordia.ca/$\sim$people/jdong/ HeroSvm.html.

807:

808: \bibitem{bunke-iwfhr04}

809: S.~G{\"u}nter and H.~Bunke.

810: \newblock Combination of Three Classifiers with Different Architectures for

811:   Handwritten Word Recognition.

812: \newblock In {\em International Workshop on Frontiers in Handwriting

813:   Recognition (IWFHR'04)}, pages 63--68, Tokyo, Japan, October 2004.

814:

815: \bibitem{diss}

816: D.~Keysers.

817: \newblock {\em Modeling of Image Variability for Recognition}.

818: \newblock {PhD} thesis, RWTH Aachen University, Aachen, Germany, March 2006.

819:

820: \bibitem{icpr00_td}

821: D.~Keysers, J.~Dahmen, T.~Theiner, and H.~Ney.

822: \newblock Experiments with an Extended Tangent Distance.

823: \newblock In {\em Proc. 15th Int. Conf. on Pattern Recognition}, volume~2,

824:   pages 38--42, Barcelona, Spain, September 2000.

825:

826: \bibitem{icpr04_nlmatch}

827: D.~Keysers, C.~Gollan, and H.~Ney.

828: \newblock Local Context in Non-linear Deformation Models for Handwritten

829:   Character Recognition.

830: \newblock In {\em ICPR 2004, 17th Int. Conf. on Pattern Recognition},

831:   volume~IV, pages 511--514, Cambridge, UK, August 2004.

832:

833: \bibitem{Kittler98}

834: J.~Kittler.

835: \newblock On Combining Classifiers.

836: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},

837:   20(3):226--239, March 1998.

838:

839: \bibitem{lecun98}

840: Y.~LeCun, L.~Bottou, Y.~Bengio, and P.~Haffner.

841: \newblock Gradient-Based Learning Applied to Document Recognition.

842: \newblock {\em Proc. of the IEEE}, 86(11):2278--2324, November 1998.

843:

844: \bibitem{liu_benchmark}

845: C.-L. Liu, K.~Nakashima, H.~Sako, and H.~Fujisawa.

846: \newblock Handwritten Digit Recognition: Benchmarking of State-of-the-Art

847:   Techniques.

848: \newblock {\em Pattern Recognition}, 36(10):2271--2285, October 2003.

849:

850: \bibitem{maree04}

851: R.~Mar�e, P.~Geurts, J.~Piater, and L.~Wehenkel.

852: \newblock A Generic Aproach for Image Classification Based on Decision Tree

853:   Ensembles and Local Sub-Windows.

854: \newblock In K.-S. Hong and Z.~Zhang, editors, {\em Proc. of the 6th Asian

855:   Conf. on Computer Vision}, volume~2, pages 860--865, Jeju Island, Korea,

856:   January 2004.

857:

858: \bibitem{icpr04_uchida}

859: N.~Matsumoto, S.~Uchida, and H.~Sakoe.

860: \newblock Prototype Setting for Elastic Matching-based Image Pattern

861:   Recognition.

862: \newblock In {\em ICPR 2004, 17th Int. Conf. on Pattern Recognition}, volume~I,

863:   pages 224--227, Cambridge, UK, August 2004.

864:

865: \bibitem{mayraz}

866: G.~Mayraz and G.~Hinton.

867: \newblock Recognizing Handwritten Digits Using Hierarchical Products of

868:   Experts.

869: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},

870:   24(2):189--197, February 2002.

871:

872: \bibitem{milgram_mnist_05}

873: J.~Milgram, R.~Sabourin, and M.~Cheriet.

874: \newblock Combining Model-based and Discriminative Approaches in a Modular

875:   Two-stage Classification System: Application to Isolated Handwritten Digit

876:   Recognition.

877: \newblock {\em Electronic Letters on Computer Vision and Image Analysis},

878:   5(2):1--15, 2005.

879:

880: \bibitem{salzberg97comparing}

881: S.~L. Salzberg.

882: \newblock On Comparing Classifiers: Pitfalls to Avoid and a Recommended

883:   Approach.

884: \newblock {\em Data Mining and Knowledge Discovery}, 1(3), 1997.

885:

886: \bibitem{sch97}

887: B.~Sch{\"o}lkopf.

888: \newblock {\em Support Vector Learning}.

889: \newblock Oldenbourg Verlag, Munich, 1997.

890:

891: \bibitem{sch98new+}

892: B.~Sch{\"o}lkopf, P.~Simard, A.~Smola, and V.~Vapnik.

893: \newblock Prior Knowledge in Support Vector Kernels.

894: \newblock In M.~I. Jordan, M.~J. Kearns, and S.~A. Solla, editors, {\em

895:   Advances in Neural Information Processing Systems~10}, pages 640--646. {MIT}

896:   Press, June 1998.

897:

898: \bibitem{simardICDAR03}

899: P.~Simard.

900: \newblock Best Practices for Convolutional Neural Networks Applied to Visual

901:   Document Analysis.

902: \newblock In {\em 7th Int. Conf. Document Analysis and Recognition}, pages

903:   958--962, Edinburgh, Scotland, August 2003.

904:

905: \bibitem{sim93+}

906: P.~Simard, Y.~{Le Cun}, and J.~Denker.

907: \newblock Efficient Pattern Recognition Using a New Transformation Distance.

908: \newblock In S.~Hanson, J.~Cowan, and C.~Giles, editors, {\em Advances in

909:   Neural Information Processing Systems~5}, pages 50--58, San Mateo, CA, 1993.

910:   Morgan Kaufmann.

911:

912: \bibitem{Smith94}

913: S.~J. Smith, M.~O. Bourgoin, K.~Sims, and H.~L. Voorhees.

914: \newblock Handwritten Character Classification Using Nearest Neighbor in Large

915:   Databases.

916: \newblock {\em IEEE Trans. Pattern Analysis and Machine Intelligence},

917:   16(9):915--919, September 1994.

918:

919: \bibitem{suen-prl05}

920: C.~Y. Suen and J.~Tan.

921: \newblock Analysis of Errors of Handwritten Digits Made by a Multitude of

922:   Classifiers.

923: \newblock {\em Pattern Recognition Letters}, 26(3):369--379, 2005.

924:

925: \bibitem{teow00}

926: L.-N. Teow and K.-F. Loe.

927: \newblock Handwritten Digit Recognition with a Novel Vision Model that Extracts

928:   Linearly Separable Features.

929: \newblock In {\em Proc. CVPR 2000, Conf. on Computer Vision and Pattern

930:   Recognition}, volume~2, pages 76--81, Hilton Head, SC, June 2000.

931:

932: \bibitem{teow02}

933: L.-N. Teow and K.-F. Loe.

934: \newblock Robust Vision-Based Features and Classification Schemes for Off-Line

935:   Handwritten Digit Recognition.

936: \newblock {\em Pattern Recognition}, 35(11):2355--2364, November 2002.

937:

938: \end{thebibliography}

939:

940:

941: \end{document}

942: