0701:math0701152/ang.tex

1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: % Guidelines for authors for ASMDA-2005 Proceedings

3: % Applied Stochastic Models and Data Analysis Conference

4: % A Conference of the Quantitative methods in Business

5: % and Industry Society.

6: %

7: % Brest (France) May, 17, 18, 19, 20, 2005.

8: %

9: % Version 1.0.2005

10: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

11:

12:

13: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

14: % This is a sample input file for your contribution to the

15: % ASMDA-2005 proceedings.

16: %

17: % Please use it as a template for your own input, and please

18: % follow the instructions in this document.

19: %

20: % Please send the compiled version of your paper for

21: % submission in PDF or PS format.

22: %

23: % For ready-camera paper (after acceptance), please send the

24: % compiled version of your paper (PDF only) along with the Tex

25: % sources and figure files of your manuscript, with any

26: % additional style files to the editor.

27: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

28:

29:

30: %RECOMMENDED%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

31:

32: \documentclass[runningheads,french]{asmda2005}

33: \usepackage[T1]{fontenc}

34: \usepackage{babel}

35:

36: \usepackage{asmda2005References} % For References style

37:

38:

39: \usepackage{graphicx} % standard LaTeX graphics tool

40:                        % for including eps-figure files.

41:                        % Uncomment the \usepackage{graphicx}

42:                        % if you need to include an eps-figure

43:                        % file.

44:

45:

46: %AUTHOR_STYLES_AND_DEFINITIONS%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

47: %

48: % Please reduce your own definitions and macros to an absolute

49: % minimum. Use your own definitions and macros if it is

50: % absolutely necessary for typesetting your manuscript.

51: %

52: %

53: %

54: %END_AUTHOR_STYLES_AND_DEFINITIONS%%%%%%%%%%%%%%%%%%%%%%%%%%%

55:

56:

57: \begin{document}

58: %

59: % \title* lets you specify the title of your manuscript.

60: % Use \protect\newline to force a line break in your title.

61: \title*{Missing values : processing with the Kohonen algorithm

62: }

63: %

64: % \toctitle specifies the title as will be printed in the table of

65: % contents.

66: % Use \protect\newline to force a line break in your title.

67: \toctitle{Missing values : processing with the Kohonen algorithm}

68: %

69: % \titlerunning defines the title in the running head. Abbreviate

70: % your title, if the full title is too long to fit in the running

71: % head.

72: \titlerunning{Missing values}

73: %

74: % \authors specifies the authors. Please use initials. Authors are

75: % seperated by the \and command. Use the \inst{1} and \inst{2} commands

76: % to define the reference mark to your affiliation if needed.

77: \author{

78:   Marie Cottrell

79:   \and

80:   Patrick Letr�my

81: }

82: %

83: % The following command allow each of the authors to appear

84: % in the author index.

85: % \index{Author, A.}

86: \index{Cottrell M.}

87: \index{Letr�my, P.}

88:

89: %

90: % \authorrunning specifies the author name(s) in the running head.

91: % If there are more than two authors, please abbreviate author list

92: % (e.g., Vaillant  et al.) for running head.

93: \authorrunning{Cottrell and Letr�my}

94: %

95: % The \institute command lets you specify  your affiliation and

96: % your address. Seperate two or more different affiliations by the

97: % \and command.

98: \institute{

99:   SAMOS-MATISSE\\

100:   Universit� Paris 1\\

101:   90, rue de Tolbiac, 75634 Paris Cedex 13, France\\

102:   (e-mail: {\tt cottrell@univ-paris1.fr, pley@univ-paris1.fr})

103: }

104:

105: % Typeset the title

106: \maketitle

107:

108: \begin{abstract}

109: We show how it is possible to use the Kohonen self-organizing algorithm to deal with data with missing values and estimate them. After a methodological reminder, we illustrate our purpose with three applications to real-world data.

110: Nous montrons comment il est possible d'utiliser l'algorithme d'auto-organisation de Kohonen pour traiter des donn�es avec valeurs manquantes et estimer ces derni�res. Apr�s un rappel m�thodologique, nous illustrons notre propos � partir de trois applications � des donn�es r�elles.

111: \keyword{Data Analysis}

112: \keyword{Kohonen maps}

113: \keyword{Missing Values}

114: \end{abstract}

115:

116:

117: \section{Introduction}

118:

119:

120: %\begin{figure}

121: %\caption{essai3}

122: %\includegraphics[width=.8\textwidth]{image6.eps}

123: %\end{figure}

124:

125: The processing of data which contain missing values is a complicated and always awkward problem, when the data come from real-world contexts. In applications, we are very often in front of observations for which all the values are not available, and this can occur for many reasons: typing errors, fields left unanswered in surveys, etc.

126:

127: Most of the statistical software (as SAS for example) simply suppresses incomplete observations. It has no practical consequence when the data are very numerous. But if the number of remaining data is too small, it can remove all significance to the results.

128:

129: To avoid suppressing data in that way, it is possible to replace a missing value with the mean value of the corresponding variable, but this approximation can be very bad when the variable has a large variance.

130:

131: So it is very worthwhile seeing that the Kohonen algorithm (as well as the Forgy algorithm) perfectly deals with data with missing values, without having to estimate them beforehand. We are particularly interested in the Kohonen algorithm for its visualization properties.

132:

133: In Sma�l Ibbou's PHD thesis, one can find a chapter about this question, but it has not been published yet. The examples are run with the software written by Patrick Letr�my in IML-SAS and available on the SAMOS WEB page (http://samos.univ-paris1.fr).

134:

135: \section{Adaptation of the Kohonen algorithm to data with missing values}

136:

137: We do not remind of the definition of the Kohonen algorithm here, see for example Kohonen \cite{kohonen}, or \cite{cottrell}.

138:

139: Let us assume that the observations are real-valued $p$-dimensional vectors, that we intend to cluster into $n$ classes.

140:

141: When the input is an incomplete vector $x$, we first define the set $M_x$ of the numbers of the missing components. $M_x$ is a sub-set of $\{1, 2, \ldots, p\}$. If $C=(C_1, C_2, ..., C_n)$ is the set of code-vectors at this stage, the winning code-vector $C_{i_0(x,C)}$ related to $x$ is computed as by setting

142: $$ i_0(x,C)=\arg \min_i \|x-C_i\|,$$

143:

144: where the distance $\|x - C_i\|^2 = \sum_{k \not\in M_x}(x_k - C_{i,k})^2$

145: is computed with the components present in vector $x$.

146:

147: One can use incomplete data in two ways:

148:

149: a) If we want to use them during the construction of the code-vectors, at each stage, the update of the code-vectors (the winning one and its neighbors) only concerns the components present in the observation. Let us denote

150: $C^t=(C_1^t, C_2^t, ..., C_n^t)$ the code-vectors at time $t$ and if a randomly chosen observation $x^{t+1}$ is drawn, the code-vectors are updated by setting:

151:

152: $$C_{i,k}^{t+1}= C_{i,k}^t + \epsilon (t) (x_k^{t+1} - C_{i,k}^t)$$

153: for $k\notin M_x$ and $j$ neighbor of $i_0(x^{t+1}, C^t)$. Otherwise,

154: $$C_{i,k}^{t+1}= C_{i,k}^t.$$

155:

156: The sequence $\epsilon(t)$ is [0,1]-valued with $\epsilon(0)\simeq 0.5$ and converges to 0 as $1/t$. After convergence, the classes are defined by the nearest neighbor method.

157:

158: b) If the data are numerous enough to avoid using the incomplete vectors to build the map, one can content oneself with classifying them after the map is built, as supplementary data, by allocating them to the class with the code-vector which is the nearest for the distance restricted to non-missing components.

159:

160: This method yields excellent results, provided a variable is not totally or almost totally missing, and also provided the variables are correlated enough, which is the case for most real data bases. Several examples can be encountered in Sma�l Ibbou's PHD thesis \cite{ibbou} and also in Gaubert, Ibbou and Tutin \cite{gaubert}.

161:

162:

163: \section{Estimation of missing values, computation of membership probabilities}

164:

165: Whatever the method used to deal with missing values, one of the most interesting properties of the algorithm is that it allows an a posteriori estimation of these missing values.

166:

167: Let us denote by $C=(C_1, C_2, \ldots, C_n)$ the code-vectors after building the Kohonen map. If $M_x$ is the set of missing component numbers for the observation $x$, and if $x$ is classified in class $i$, for each index $k$ in $M_x$, one estimates $x_k$ by :

168: $$\hat{x}_k=C_{i,k}.$$

169:

170: Because in the end of the learning the Kohonen algorithm uses no more neighbor (0 neighbor algorithm), we know that the code-vectors are asymptotically near the mean values of their classes. This estimation method therefore consists in estimating the missing values of a variable by the mean value of its class.

171:

172: It is clear that this estimation is all the more precise as the classes built by the algorithm are homogeneous and well separated. Numerous simulations have shown as well for artificial data as for real ones, that when the variables are sufficiently correlated, the precision of these estimations is remarkable, \cite{ibbou}.

173:

174: It is also possible to use a probabilistic classification rule, by computing the membership probabilities for the supplementary observations (be they complete or incomplete), by putting:

175: $$Prob(x \in \mbox{Class }i) =

176: \frac{\exp (-\|x - C_i\|^2)}{\sum^n_{k=1}\exp (-\|x - C_k\|^2)}.$$

177:

178: These probabilities also give confirmation of the quality of the organization in the Kohonen map, since significant probabilities have to correspond to neighboring classes.

179:

180: Moreover, to estimate the missing values, one can compute the weighted mean value of the corresponding components. The weights are the membership probabilities. If $x$ is an incomplete observation, and for each index $k$ in $M_x$, one estimates $x_k$ by :

181: $$\hat{x}_k=\sum Prob(x \in \mbox{Class }i) \; C_{i,k}.$$

182:

183: These probabilities also provide confidence intervals, etc. In the following sections, we present three examples extracted from real data.

184:

185: \section{Socio-economic data}

186:

187: The first example is classical. The database contains seven ratios measured in 1996 on the macroeconomic situation of 182 countries. This data set was first used by F. Blayo and P. Demartines \cite{blayo} in the context of data analysis by SOMs.

188:

189: The measured variables are: annual population growth (ANCRX), mortality rate (TXMORT), illiteracy rate (TXANAL), population proportion in high school (SCOL2), GDP per head (PNBH), unemployment rate (CHOMAG), inflation rate (INFLAT).

190:

191: Among the set of 182 countries, only 115 have no missing values, 51 have only one missing value, while 16 have 2 or more than 2 missing values.

192:

193: Therefore we use the 115 + 51 = 166 complete or almost complete countries to build the Kohonen map, and we then classify the 16 remaining countries. The data are centered and reduced as classically. We take a Kohonen map with 7 by 7 units, that is 49 classes. Figure 1 shows the contents of the classes. The 166 countries that were used for computing the code-vectors are in normal font, the 16 others in underlined italics.

194:

195: \begin{figure}

196: \caption{The 182 countries (166 + 16) on a 7 by 7 map, 1500 iterations}

197: \includegraphics[scale=0.5]{car-pays.eps}

198: \end{figure}

199:

200: We can see that rich countries are in the top left hand corner, very poor ones are displayed in the top right hand corner. Ex-socialist countries are not very far from the richest, etc. As for the 16 countries which are classified after the learning as supplementary observations, we observe that the logic is respected. Monaco and Vatican are displayed with rich countries, and Guinea with very poor countries, etc.

201:

202: From these computations, it is possible to calculate the membership probabilities of each supplementary observation of each of the 49 classes.

203:

204: For example, the probabilities that Cuba belongs to class $i$ are greater than 0.03 for classes $i= (1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7, 1), (1,2), (2,2)$,

205: $ (3,2), (4,2), (5,2), (6,2), (7,2), (3,3), (4,3), (6,3), (7,3)$, the maximum (0.06) being reached for class $(5,2)$. We can notice (figure 1) that they are neighboring classes. From these probabilities, it is possible to estimate the distribution of the estimators of the missing values. For Cuba, the variables in question are GDP, Unemployment and Inflation.

206:

207: From these results, it is possible (as it will be shown in the talk) to build super-classes by using an ascending hierarchical classification of the code-vectors and then to cross this classification with other exogenous classifications, etc.

208:

209:

210: \section{Study of the property market in Ile-de-France}

211:

212: The second example is extracted from a study commissioned by the direction of Housing in the Regional Direction of Equipment in Ile-de-France (DHV/DREIF). This was achieved in 1993 by Paris 1 METIS and SAMOS laboratories, by Gaubert, Tutin and Ibbou, \cite{gaubert}.

213:

214: For 205 towns in Ile-de-France considered in 1988, we have property data (housing rents and prices, old and new, collective or individual, standard or luxurious, office rents and prices, old and new). Structurally, some of the data are missing, for example the office market can be nonexistent in some towns.

215:

216: This is a case where some data are structurally missing, and where the number of towns is dramatically reduced if one suppresses those which are incomplete: only 5 out of 205 would be kept! So for the learning, we use 150 towns which have less than 12 missing values out of 15. After that, the 55 towns which have more than 12 missing values out of 15 are classified as supplementary observations.

217:

218: Figure 2 displays the 205 towns (with and without missing values) classified on a 7 by 7 Kohonen map. Note that there are about 63\% of missing values on the data set.

219:

220: \begin{figure}

221: \caption{The 205 towns in Ile-de-France, in underlined italics the 55 towns which have more than 12 missing values out of 15 variables}

222: \includegraphics[scale=0.5]{communes.eps}

223: \end{figure}

224:

225: In this example which is practically impossible to deal with using classical software, we see that the Kohonen algorithm nevertheless allows to classify extremely sparse data, without introducing any rough error.

226: The results are perfectly coherent, even though the data are seriously incomplete. The districts of Paris, Boulogne and Neuilly sur Seine are in the bottom left hand corner. On a diagonal stripe, one finds the towns of the inner suburbs (petite couronne), further right there are the towns of the outer suburbs (grande couronne). Arcueil is classified together with l'Ha\"y-les-Roses (class (2,3)), Villejuif with Kremlin Bic�tre (4,6), etc.

227:

228: Of course, these good results can be explained by the fact that the 15 measured variables are well correlated and that the present values contain information about missing values. The examination of the correlation matrix (that SAS computes even in case of missing values) shows that 76 coefficients out of 105 are greater than 0.8, none of them being less than 0.65.

229:

230:

231: \section{Structures of Government Spending from 1872 to 1971}

232:

233: The third example is a very classical one in data analysis, taken from the book ``Que-sais-je ?'' by Bouroche and Saporta , ``L'analyse des donn�es''  \cite{saporta}. The problem is to study the government spending, measured over 24 years between 1872 and 1971, by a 11-dimensional vector: Public Authorities (Pouvoirs publics), Agriculture (Agriculture), Trade and Industry (Commerce et industrie), Transports (Transports), Housing and Regional Development (Logement et am�nagement du territoire), Education and Culture (Education et culture), Social Welfare (Action sociale), Veterans (Anciens combattants), Defense (D�fense), Debt (Dette), Miscellaneous (Divers). It is a very small example, with 24 observations of dimension 11, without any missing values.

234:

235: A Principal Component Analysis provides an excellent representation in two dimensions with 64\% of explained variance. See figure 3.

236:

237: \begin{figure}

238: \caption{On the left, the projections on the first two principal axes; on the right, the Kohonen map with 9 classes and 3 super-classes}

239: \includegraphics[scale=0.5]{annees.eps}

240: \end{figure}

241:

242: On this projection, the years split up into three groups, which correspond to three clearly identified periods (before the First World War, between the two World Wars, after the Second World War). Only the year 1920, the first year when an expenditure item for Veterans appears, is set inside the first group, while it belongs to the second one. On the Kohonen map, the three super-classes (identical to the ones just defined) are identified by an ascending hierarchical classification of the code-vectors.

243:

244: In this example, we have artificially suppressed randomly chosen values which were present in the original data, from 1 value out of 11 to 8 values out of 11, in order to study the clustering stability and compute the accuracy of the estimations that we get by taking the corresponding values of the code-vectors.

245:

246: One can observe that the three super-classes remain perfectly stable as long as one does not suppress more than 3 values a year, that is 27\% of the values.

247:

248: Then we estimate the suppressed values in each case. The next table shows the evolution of the mean quadratic error according to the number of suppressed values.

249:

250: \begin{table}

251:   \begin{center}

252:     \begin{tabular}{|l|l|l|l|l|l|l|l|l|}

253:       \hline

254:       Number of missing values & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8\\

255: 		 \hline

256: Percentage of missing values & 9\% & 18\% & 27\% & 36\% & 45\% & 55\% & 64\% & 73\%\\

257:       \hline

258:       & 0.39 & 0.54 & 0.73 & 1.11 & 1.31 & 1.30 & 1.27 & 1.39\\

259:            \hline

260:     \end{tabular}

261:   \end{center}

262:   \caption{Mean Quadratic Error according to the number of suppressed values}

263:

264: \end{table}

265:

266: We notice that the error remains small as long as we do not suppress more than 3 values a year.

267:

268: \section{Conclusion}

269:

270: Through these three examples, we have thus shown how it is possible and desirable to use Kohonen maps when the available observations have missing values. Of course, the estimations and the classes that we get are all the more relevant since the variables are well correlated.

271:

272: Example 2 shows that it can be the only possible method when the data are extremely sparse. Example 3 shows how this method allows to estimate the absent values with good accuracy. The completed data can then be dealt with using any classical treatment.

273:

274:

275: \bibliographystyle{asmda2005References}

276:

277: \bibliography{bibliography-cottrell}

278:

279: \end{document}

280: