0703:physics0703281/Paul0307_AdvancesDataAnalysis.tex

1: %\documentclass[smallextended,referee]{svjour3} %

2: \documentclass[smallextended]{svjour3}

3: %\documentclass[global,referee]{svjour3}

4: %\documentclass[global,twocolumn,referee]{svjour}

5:

6:

7: \smartqed  % flush right qed marks, e.g. at end of proof

8: %

9: %

10: % Remove any % below to load the required packages

11: %\usepackage{latexsym}

12: %\usepackage{graphics}

13: \usepackage{graphicx}

14: \usepackage{array}

15: \usepackage{cite}

16: \usepackage{float}

17: \usepackage{subfigure}

18: \floatplacement{figure}{t}

19: %\usepackage{epic}

20: %\usepackage{eepic}

21: \usepackage{amsmath,amssymb,mathrsfs}

22: %\usepackage{enumerate}

23: %\newtheorem{prop}{property}

24: \newtheorem{theo}{Theorem}

25:

26: % Insert the name of "your" journal with the command below:

27: \journalname{Advances in Data Analysis and Classification}

28: %

29: \begin{document}

30: %

31: \title{A Global Algorithm for Clustering Univariate Observations}

32:

33:

34: \author{Nicolas Paul \and

35:         Michel Terre \and

36:         Luc Fety %etc.

37: }

38:

39: \institute{N. Paul, M. Terre, L. Fety \at

40:               Conservatoire National des Arts et Metiers, Electronic and Communications, \\

41:               292 rue Saint-Martin, 75003 PARIS, FRANCE \\

42:               Tel.: 33 1 40 27 25 67\\

43:               Fax: 33 1 40 27 24 81\\

44:               \email{nicolas.paul@cnam.fr}   }

45:

46:

47: \date{Received: date / Accepted: date}

48: % The correct dates will be entered by the editor\maketitle

49:

50: \maketitle

51:

52:

53: \begin{abstract}

54: This paper deals with the clustering of univariate observations: given a set of observations coming from $K$ possible clusters, one has to estimate the cluster means. We propose an algorithm based on the minimization of the "KP" criterion we introduced in a previous work. In this paper, we show that the global minimum of this criterion can be reached by first solving a linear system then calculating the roots of some polynomial of order $K$. The KP global minimum provides a first raw estimate of the cluster means, and a final clustering step enables to recover the cluster means. Our method's relevance and superiority to the Expectation-Maximization algorithm is illustrated through simulations of various Gaussian mixtures.

55: \keywords{unsupervised clustering \and non-iterative algorithm \and optimization criterion \and univariate observations}

56: % \PACS{PACS code1 \and PACS code2 \and more}

57: % \subclass{MSC code1 \and MSC code2 \and more}

58: \end{abstract}

59:

60:

61: %---------------------------------------------------------

62: % INTRODUCTION

63: %---------------------------------------------------------

64: \section{Introduction}

65: In this paper we focus on the clustering of univariate observations coming from $K$ possible clusters, when the number of clusters is known. One method consists in estimating the observation pdf (mixture of $K$ pdf), by associating a kernel to each observation and adding the contribution of all the kernels (Parzen 1962). A search of the pdf modes then leads to the cluster means. The drawback of such method is that it requires the configuration of extra-parameters (kernel design, intervals for the mode search). Alternately, the Expectation-Maximization (EM) (Dempster et al. 1977) algorithm is the most commonly used method when the mixture densities belong to the same known parameterized family. It is an iterative algorithm that look for the mixture parameters that maximize the likelihood of the observations. Each EM iteration consists of two steps. The Expectation step estimates the probability for each observation to come from each mixture component. Then, during the Maximization step, these estimated probabilities are used to update the estimation of the mixture parameters. One can show that this procedure converges to one maximum (local or global) of the likelihood (Dempster et al. 1977). If the mixture components do not belong to a common and known parameterized family, the EM algorithm does not directly apply. Yet, if the component densities do not overlap too much, some clustering methods can be used to cluster the data and calculate the cluster means: in (Fisher 1958) an algorithm is proposed to compute the $K$-partition of the $N$ sorted observations which minimize the sum of the squares within clusters. Instead of testing the $\binom{N-1}{K-1}$ possible partitions, some relationships between $k$-partitions and $(k+1)$-partitions are used to recursively compute the optimal $K$-partition. The main drawbacks of this method are a high sensitivity to potential differences between the cluster variances and a complexity in $\text{O}(KN^2)$ (Fitzgibbon 2000). Among the clustering methods, the K-Means algorithm (Hartigan 1977) is one of the most popular method. It is an iterative algorithm which groups the data into K clusters in order to minimize an objective function such as the sum of point to cluster mean square Euclidean distance. The main drawback of K-Means or EM is the potential convergence to some local extrema of the criterion they use. Some solutions consist for instance in using smart initializations (see (McLachlan and Peel 2000) and (Lindsay and Furman 1994) for EM, (Bradley and Fayyad 1998) for k-means) or stochastic optimization, to become less sensitive in the initialization (see (Celeux et al. 1995) and (Pernkopf and Bouchaffra 2005) for EM, (Krishna and Murty 1999) for K-means). Another drawback of these methods is the convergence speed, which can be very slow when the number of observations is high. A survey of the clustering techniques can be found in (Berkhin 2006) and (Xu and Wunsch 2005). In this contribution, we propose a non-iterative algorithm which mainly consists in calculating the minimum of the "K-Product" (KP) criterion we first introduced in (Paul et al. 2006): if $\{z_n\}_{n\in\{1 \cdots N\}}$ is a set of $N$ observations, $K$ the known number of clusters and $\{x_k\}_{k\in\{1 \cdots K\}}$ any vector of $\mathbb{R}^K$, we define the KP criterion as the sum of all the K-terms products $\prod_{k=1}^{K}(z_n-x_k)^2$. The main motivation for using such criterion is that, though it provides a slightly biased estimation of the cluster means, its global minimum can be reached by first solving a linear system then calculating the roots of some polynomial of order $K$. Once these $K$ roots have been obtained, a final clustering step assigns each observation to the closest root and calculates the resulting cluster means. Another advantage of the proposed method is that it does not require the configuration of any extra-parameters. The rest of the paper is organized as follow: In section 2 the observation model is presented and the criterion is defined. In section 3 the criterion global minimum is theoretically calculated. In section 4 the clusters estimation algorithm is described. Section 5 presents simulation results which illustrate the algorithm performances on different Gaussian mixtures: mixtures of three, six and nine components have been simulated with various configurations (common/different mixing weights, common/different variances). Conclusions are finally given in Section 6.

66:

67: %------------------------------------------------------------------------------------

68: % MODEL AND DEFINITION

69: %------------------------------------------------------------------------------------

70: \section{Observation model and criterion definition}

71: Let $\{ a_k \}_{k\in\{1 \cdots K\}}$ be a set of $K$ different values of $\mathbb{R}^K$, let $\textbf{a}$ be the vector of cluster means defined by $\textbf{a}\stackrel{\Delta}{=}(a_1,a_2 \cdots a_K)^t$, let $\{ \pi_k \}_{k\in\{1 \cdots K\}}$ be a set of $K$ mixing weights (prior probabilities) that sum up to one and let $\{g_k\}_{k\in\{1 \cdots K\}}$ be a set of $K$ zeros-mean densities. The probability density function of the multimodal observation $z$ is a finite mixture given by:

72: \begin{equation}

73: f(z)=\sum_{k=1}^{K}{ \pi_k g_k(z-a_k) } \notag

74: %\label{mixture2}

75: \end{equation}

76: \noindent Note that the form of the densities $g_k$ are usually not known by the estimator and that the $g_k$ do not necessarily belong to the same parameterized family. Now let $\{ z_n \}_{n\in\{1 \cdots N\}}$ be a set of $N$ observations in $\mathbb{R}^N$. In all the following we assume that $N$ is greater than $K$ and that the number of different observations is greater than $K-1$. The KP criterion $J(\textbf{x})$ is defined by:

77: \begin{equation}

78: J: \mathbb{R}^K\rightarrow \mathbb{R}^+: \

79: \textbf{x} \rightarrow \sum_{n=1}^{N}{ \prod_{k=1}^{K}{\left(z_n-x_k\right)^2} }

80: \label{definition_J} % label necessaire

81: \end{equation}

82: \noindent Note the difference with the K-means criterion which can be written (for the square Euclidean distance):

83: \begin{equation}

84: \text{K-means}: \mathbb{R}^K\rightarrow \mathbb{R}^+: \

85: \textbf{x} \rightarrow

86: \sum_{n=1}^{N}{ \underset{k\in\{1 \cdots K\}}{\text{min}} (z_n-x_k)^2 } \notag

87: \end{equation}

88: \noindent The KP criterion \eqref{definition_J} is clearly positive for any vector $\textbf{x}$. The first intuitive motivation for defining this criterion is its asymptotic behavior when all the $g_k$ variances are null. In this case, all the observations are equal to one of the $a_k$ and therefore $J(\textbf{a})=0$. $J(\textbf{x})$ is then minimal when $\textbf{x}$ is equal to $\textbf{a}$ or any of its $K!$ permutations. The second motivation is that, in the general case, $J$ have $K!$ minima that are the $K!$ permutations of one single vector which can be reached by solving a linear system then finding the roots of some polynomial of order $K$. This is shown in section 3.

89:

90: % SECTION 3 KPRODUCT MINIMUM

91: %---------------------------

92: \section{KP global minimum}

93: We first give in section \ref{section3A} some useful definitions which are needed in section \ref{section3B} to reach the global minimum of $J$.

94: \subsection{Some useful definitions}

95: \label{section3A}

96: To any observation $z_n$ we associate the vector $\textbf{z}_n$ defined by:

97: \begin{equation}

98: \textbf{z}_n\stackrel{\Delta}{=}(z_n^{K-1}, z_n^{K-2} \cdots ,1)^t, \ \ \textbf{z}_n \in \mathbb{R}^K

99: \label{definition_zn} % label necessaire

100: \end{equation}

101: \noindent The vector $\textbf{z}$ and the Hankel matrix $\textbf{Z}$ are then respectively defined by:

102: \begin{equation}

103: \textbf{z}\stackrel{\Delta}{=}\sum_{n=1}^{N}{z_n^K \textbf{z}_n}, \ \ \textbf{z} \in \mathbb{R}^K

104: \label{definition_z} % label necessaire

105: \end{equation}

106: \begin{equation}

107: \textbf{Z}\stackrel{\Delta}{=}\sum_{n=1}^{N}{ \textbf{z}_n \textbf{z}_n^t }, \ \ \textbf{Z} \in \mathbb{R}^{K \times K}

108: \label{definition_Z} % label necessaire

109: \end{equation}

110: \noindent The matrix \textbf{Z} is regular if the number of different observations is greater than $K-1$ (one explanation is detailed in Appendix A).\\

111: \\

112: \noindent Now let $\textbf{y}=(y_1,\cdots,y_K)^t$ be a vector of $\mathbb{R}^K$. We define the polynomial of order $K$ $q_\textbf{y}(\alpha)$ as:

113: \begin{equation}

114: q_{\textbf{y}}(\alpha)\stackrel{\Delta}{=} \alpha^K-\sum_{k=1}^{K}{\alpha^{K-k}y_k}

115: \label{definition_qy} % label necessaire

116: \end{equation}

117: \noindent if $\textbf{r}=(r_1,\cdots,r_K)^t$ is a vector of $\mathbb{C}^K$ containing the $K$ roots of $q_{\textbf{y}}(\alpha)$ the factorial form of $q_{\textbf{y}}(\alpha)$ is:

118: \begin{equation}

119: q_{\textbf{y}}(\alpha)=\prod_{k=1}^{K}(\alpha-r_k) \notag

120: %\label{qy_factorise}

121: \end{equation}

122: %\begin{multline} % multiline pour papier final (voir autre suggest ieee)

123: \begin{equation}

124: q_{\textbf{y}}(\alpha)=\alpha^K-(r_1+\cdots+r_K)\alpha^{K-1}+...\\

125: +(-1)^K(r_1\times r_2 \cdots \times r_K) \notag

126: %\label{qy_esp1}

127: \end{equation}

128: %\end{multline}

129: \begin{equation}

130: q_{\textbf{y}}(\alpha)=\alpha^K-\sum_{k=1}^{K}{\alpha^{K-k}w_k(\textbf{r})} \notag

131: %\label{qy_esp2}

132: \end{equation}

133: \noindent where $w_k(\textbf{r})$ is the Elementary Symmetric Polynomial (ESP) in the variables ${r_1,\cdots,r_K}$ defined by:

134: \begin{equation}

135: w_k(\textbf{r}) \stackrel{\Delta}{=} (-1)^{k+1} \sum_{ \substack{ \{j_1,\cdots,j_k\} \in \{1\cdots K\}^k \\ j_1<\cdots<j_k \leqslant K } }{r_{j_1}.r_{j_2}\cdots.r_{j_k}}

136: \label{definition_wk} % label necessaire

137: \end{equation}

138: \noindent For instance, for $K=3$, we have:

139: \begin{equation}

140: w_1(\textbf{r})=r_1+r_2+r_3 \notag

141: \end{equation}

142: \begin{equation}

143: w_2(\textbf{r})=-(r_1r_2+r_2r_3+r_1r_3) \notag

144: \end{equation}

145: \begin{equation}

146: w_3(\textbf{r})=r_1r_2r_3 \notag

147: \end{equation}

148: \noindent If we call $\textbf{w}(\textbf{r})$ the vector of ESP of $\textbf{r}$ defined by:

149: \begin{equation}

150: \textbf{w}(\textbf{r})\stackrel{\Delta}{=}(w_1(\textbf{r}),\cdots,w_K(\textbf{r}))^t

151: \label{definition_w} % label necessaire

152: \end{equation}

153: \noindent the relationship between the roots and coefficients of $q_{\textbf{y}}(\alpha)$ becomes:

154: \begin{equation}

155: \textbf{y}=\textbf{w}(\textbf{r}) \Leftrightarrow \forall k \in \{1\cdots K\} \ q_{\textbf{y}}(r_k)=0

156: \label{roots_coefficients} % label necessaire

157: \end{equation}

158:

159: % THEOREME ET DEMONSTRATION

160: %--------------------------

161: \subsection{The KP minimum}

162: \label{section3B}

163: The global minimum of $J$ is given by theorem 1:

164: %The main idea is to express $J(\textbf{x})$ as a function of $\textbf{w}(\textbf{x})$: using definitions \eqref{definition_zn} and \eqref{definition_w}, the development of each term of the sum in $J$ leads to

165: %$J(\textbf{x})=\sum_{n=1}^{N}{ \left({z_n^K-\textbf{z}_n^t\textbf{w}(\textbf{x})}\right)^2}$. Therefore, the minimization of $J$ becomes a least square minimization in the variable $\textbf{w}(\textbf{x})$. The vector $\textbf{y}_{min}$ which minimizes $\sum_{n=1}^{N}{ \left({z_n^K-\textbf{z}_n^t\textbf{y}}\right)^2}$ can be easily obtained. Now if $\textbf{x}_{min}$ is a vector such as  $\textbf{y}_{min}=\textbf{w}(\textbf{x}_{min})$ and $\textbf{x}_{min} \in \mathbb{R}^K$, then $\textbf{x}_{min}$ is clearly a minimum of $J$. According to \eqref{roots_coefficients}, $\textbf{x}_{min}$ has to contain the $K$ roots of $q_{\textbf{y}_{min}}(\alpha)$ to have $\textbf{y}_{min}=\textbf{w}(\textbf{x}_{min})$. The difficult part is to show that these $K$ roots of $q_{\textbf{y}_{min}}(\alpha)$ are always real:

166: \begin{theo}

167: if $\textbf{y}_{min}$ is the solution of $\textbf{Z}.\textbf{y}_{min}=\textbf{z}$ (where $\textbf{z}$ and $\textbf{Z}$ have been defined in \eqref{definition_z} and \eqref{definition_Z}) and if $\textbf{x}_{min}$ is a vector containing, in any order, the $K$ roots of $q_{\textbf{y}_{min}}(\alpha)$ (defined in \eqref{definition_qy}), then $\textbf{x}_{min}$ belongs to $\mathbb{R}^K$ and $\textbf{x}_{min}$ is the global minimum of $\textbf{J}$.

168: \end{theo}

169: The proof is given in appendix B.

170:

171:

172: % CLUSTERS ESTIMATION ALGORITHM

173: \section{Clusters estimation algorithm}

174: The clusters estimation algorithm consists of two steps. In the first step, the minimum of $J$, $\textbf{x}_{min}=(x_{1,min},...,x_{K,min})^t$, is calculated, giving a first raw estimation of the set of cluster means. This first estimate is slighly biased: for instance, for a Gaussian mixture with two balanced components centred on $-a$ and $a$ and a common standard deviation $\sigma$ the asymptotical solution of $\textbf{Z}.\textbf{y}_{min}=\textbf{z}$ is $\textbf{y}_{min}=(0,a^2+\sigma^2)^t$ and the roots of $q_{\textbf{y}_{min}}(\alpha)$ are:

175: \begin{equation}

176: \textbf{x}_{min}=\left(-a\sqrt{1+\frac{\sigma^2}{a^2}},a\sqrt{1+\frac{\sigma^2}{a^2}}\right) \notag

177: \end{equation}

178: \noindent Therefore, in a second step, each observation $z_n$ is assigned to the nearest $x_{k,min}$, $K$ clusters are formed, and the final estimated cluster means are calculated. The algorithm steps and their complexities are illustrated in table~\ref{algo}. The total complexity is in O$(NK+K^2)$, which is equivalent to O$(NK)$ since $N$ is greater than $K$.

179: \begin{table}

180: \renewcommand{\arraystretch}{1.3}

181: \caption{KP algorithm steps and complexities}

182: \label{algo}

183: \centering

184: \begin{tabular} {c}

185: \hline

186: \bfseries step 1: calculate a minimum of J \\

187: \hline

188: calculate $\textbf{Z}$ and $\textbf{z}$: O$(NK)$ \\

189: %calculate $\textbf{y}_{min}$ by solving (\ref{ymin_obtention}): o$(K^2)$ \\

190: calculate $\textbf{y}_{min}$ by solving $\textbf{Z}.\textbf{y}_{min}=\textbf{z}$: O$(K^2)$ \\

191: calculate the roots

192: $(x_{1,min}, \cdots ,x_{K,min})$

193: of $q_{\textbf{y}_{min}}(\alpha)$: O$(K^2)$ \\

194: \hline

195: \bfseries step 2: clustering and cluster means estimation \\

196: \hline

197: assign each $z_n$ to the closest $x_{k,min}$: O$(NK)$ \\

198: calculate the K means of the resulting clusters: O$(N)$ \\

199: \hline

200: \end{tabular}

201: \end{table}

202:

203: % SIMULATION

204: \section{Simulations}

205: Several types of Gaussian mixture have been considered. The number of components (clusters) is equal to three (scenario A), six (scenario B) and nine (scenario C). In scenario A, the distance between two successive cluster centers (component means) is equal to one. In scenario B and in scenario C, the distance between two successive centers is equal to one or two. For each scenario "X", four cases have been studied: common variance and common mixing weight (scenario X.1), different variances and common mixing weight (scenario X.2), common variance and different mixing weights (scenario X.3) and different variances and different mixing weights (scenario X.4). A summary of all the scenari is given in Tables 2, 3 and 4. The number of observations ($N$) per simulation run is equal to 100 in scenario A, 200 in scenario B and 300 in scenario C.

206:

207: % TABLEAU DES SCENARIO A (3 CLUSTERS)

208: \begin{table}

209: \caption{simulation scenario A}

210: \begin{center}{

211: \begin{tabular}{|c|c c|c c|c c|c c|}

212: \cline{2-9}

213: \multicolumn{1}{c|}{} & \multicolumn{2}{c|}{scenario A.1} & \multicolumn{2}{c|}{scenario A.2} & \multicolumn{2}{c|}{scenario A.3} & \multicolumn{2}{c|}{scenario A.4} \\[2mm]

214: \hline

215: means & variances & prior & variances & prior & variances & prior & variances & prior \\[2mm]

216: \hline

217: $0$ & $\sigma^2$ & $\frac{1}{3}$ & $\sigma^2$ & $\frac{1}{3}$ & $\sigma^2$ & $0.4$ & $\sigma^2$ & $0.4$ \\[2mm]

218: \hline

219: $1$ & $\sigma^2$ & $\frac{1}{3}$ & $\frac{\sigma^2}{2}$ & $\frac{1}{3}$ & $\sigma^2$ & $0.4$ & $\frac{\sigma^2}{2}$ & $0.4$ \\[2mm]

220: \hline

221: $2$ & $\sigma^2$ & $\frac{1}{3}$ & $\sigma^2$ & $\frac{1}{3}$ & $\sigma^2$ & $0.2$ & $\sigma^2$ & $0.2$ \\[2mm]

222: \hline

223: %\multicolumn{9}{l}{Table 1: simulation scenario A} \\

224: \end{tabular}}

225: \end{center}

226: \end{table}

227: %{description of sub-scenarios A}

228:

229: % TABLEAU DES SCENARIO B (6 CLUSTERS)

230: \begin{table}

231: \caption{simulation scenario B}

232: \begin{center}{

233: %\begin{tabular}{|c|c|c|c|c|c|c|c|c|}

234: \begin{tabular}{|c|c c|c c|c c|c c|}

235: %\hline

236: %\cline{2-9}

237: %\multicolumn{1}{c|}{} & \multicolumn{8}{c|}{scenario B} \\

238: %\hline

239: \cline{2-9}

240: \multicolumn{1}{c|}{} & \multicolumn{2}{c|}{scenario B.1} & \multicolumn{2}{c|}{scenario B.2} & \multicolumn{2}{c|}{scenario B.3} & \multicolumn{2}{c|}{scenario B.4} \\[2mm]

241: \hline

242: means & variances & prior & variances & prior & variances & prior & variances & prior \\[2mm]

243: \hline

244: $0$ & $\sigma^2$ & $\frac{1}{6}$ & $\sigma^2$ & $\frac{1}{6}$ & $\sigma^2$ & $0.2$ & $\sigma^2$ & $0.2$ \\[2mm]

245: \hline

246: $1$ & $\sigma^2$ & $\frac{1}{6}$ & $\frac{\sigma^2}{2}$ & $\frac{1}{6}$ & $\sigma^2$ & $0.2$ & $\frac{\sigma^2}{2}$ & $0.2$ \\[2mm]

247: \hline

248: $2$ & $\sigma^2$ & $\frac{1}{6}$ & $\sigma^2$ & $\frac{1}{6}$ & $\sigma^2$ & $0.1$ & $\sigma^2$ & $0.1$ \\[2mm]

249: \hline

250: $4$ & $\sigma^2$ & $\frac{1}{6}$ & $\frac{\sigma^2}{2}$ & $\frac{1}{6}$ & $\sigma^2$ & $0.2$ & $\frac{\sigma^2}{2}$ & $0.2$ \\[2mm]

251: \hline

252: $5$ & $\sigma^2$ & $\frac{1}{6}$ & $\sigma^2$ & $\frac{1}{6}$ & $\sigma^2$ & $0.2$ & $\sigma^2$ & $0.2$ \\[2mm]

253: \hline

254: $6$ & $\sigma^2$ & $\frac{1}{6}$ & $\frac{\sigma^2}{2}$ & $\frac{1}{6}$ & $\sigma^2$ & $0.1$ & $\frac{\sigma^2}{2}$ & $0.1$ \\[2mm]

255: \hline

256: \end{tabular}}

257: \end{center}

258: \end{table}

259:

260: % TABLEAU DES SCENARIO C (6 CLUSTERS)

261: \begin{table}

262: \caption{simulation scenario C}

263: \begin{center}{

264: %\begin{tabular}{|c|c|c|c|c|c|c|c|c|}

265: \begin{tabular}{|c|c c|c c|c c|c c|}

266: %\hline

267: %\cline{2-9}

268: %\multicolumn{1}{c|}{} & \multicolumn{8}{c|}{scenario C} \\

269: %\hline

270: \cline{2-9}

271: \multicolumn{1}{c|}{} & \multicolumn{2}{c|}{scenario C.1} & \multicolumn{2}{c|}{scenario C.2} & \multicolumn{2}{c|}{scenario C.3} & \multicolumn{2}{c|}{scenario C.4} \\[2mm]

272: \hline

273: means & variances & prior & variances & prior & variances & prior & variances & prior \\[2mm]

274: \hline

275: $0$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{2}{15}$ & $\sigma^2$ & $\frac{2}{15}$ \\[2mm]

276: \hline

277: $1$ & $\sigma^2$ & $\frac{1}{9}$ & $\frac{\sigma^2}{2}$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{2}{15}$ & $\frac{\sigma^2}{2}$ & $\frac{2}{15}$ \\[2mm]

278: \hline

279: $2$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{15}$ & $\sigma^2$ & $\frac{1}{15}$ \\[2mm]

280: \hline

281: $4$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{15}$ & $\sigma^2$ & $\frac{1}{15}$ \\[2mm]

282: \hline

283: $5$ & $\sigma^2$ & $\frac{1}{9}$ & $\frac{\sigma^2}{2}$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{3}{15}$ & $\frac{\sigma^2}{2}$ & $\frac{3}{15}$ \\[2mm]

284: \hline

285: $6$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{15}$ & $\sigma^2$ & $\frac{1}{15}$ \\[2mm]

286: \hline

287: $8$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{2}{15}$ & $\sigma^2$ & $\frac{2}{15}$ \\[2mm]

288: \hline

289: $9$ & $\sigma^2$ & $\frac{1}{9}$ & $\frac{\sigma^2}{2}$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{2}{15}$ & $\frac{\sigma^2}{2}$ & $\frac{2}{15}$ \\[2mm]

290: \hline

291: $10$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{9}$ & $\sigma^2$ & $\frac{1}{15}$ & $\sigma^2$ & $\frac{1}{15}$ \\[2mm]

292: \hline

293: \end{tabular}}

294: \end{center}

295: \end{table}

296:

297: %For each simulation run, a set of $N=100$ observations is generated from the mixture described in  ~\eqref{mixture2} with $K=4$ unbalanced components, %$\{\pi_k\}=\{0.15 \quad 0.35 \quad 0.15 \quad 0.35\}$ and equispaced means $\textbf{a}=(0,d,2d,3d)^t$. $d$ is the distance between two successive %clusters. A high (resp. low) value of d indicates separated (resp. overlapping) component densities. The component densities $g_1$ and $g_3$ are %Laplace functions given by $g(v)=\dfrac{\lambda}{2}e^{-\lambda |v|}$ with a variance $\dfrac{2}{\lambda^2}=10^{-2}$ and the component densities $g_2$ %and $g_4$ are Gaussian function with variance $\sigma^2=10^{-2}$. An example of observation distribution is shown in figure 1 for $d=1$ (well separated %components) and for $d=0.5$ (overlapping components). The number of modes is supposed to be known in each method. 10000 runs have been performed.

298:

299: % ALGORITHME EM

300: \noindent To evaluate the performances of our proposal we compare it to the classical EM algorithm. To estimate the parameters of a Gaussian mixtures, the EM algorithm proceeds as follows (Dempster et al. 1977): if $\hat{\beta}^{(ite)}_{n,k}$ is the estimed probability that $z_n$ comes from cluster $k$ and if $\hat{\pi}^{(ite)}_k$, $\hat{a}^{(ite)}_k$ and $\hat{\sigma}^{(ite)}_k$ are respectively the estimated prior, mean and standard deviation of cluster $k$ at iteration $ite$, then the estimations at iteration $ite+1$ are given by: \\

301: \noindent Expectation step:

302: \begin{equation}

303: \hat{\beta}^{(ite+1)}_{n,k}=

304: \frac{\frac{\hat{\pi}^{(ite)}_k}{\sqrt{2\pi}\hat{\sigma}^{(ite)}_k}

305: \text{exp}\left(-\frac{1}{2}\left(\frac{z_n-\hat{a}^{(ite)}_k}{\hat{\sigma}^{(ite)}_k}\right)^2\right)}

306: {\sum\limits_{k=1}^{K}{ \frac{\hat{\pi}^{(ite)}_k}{\sqrt{2\pi}\hat{\sigma}^{(ite)}_k} \text{exp}\left(-\frac{1}{2}\left(\frac{z_n-\hat{a}^{(ite)}_k}{\hat{\sigma}^{(ite)}_k}\right)^2\right)}} \notag

307: \end{equation}

308: \noindent Maximization step:

309: \begin{equation}

310: \hat{\pi}^{(ite+1)}_k=\frac{ \sum_{n=1}^{N}{ \hat{\beta}^{(ite+1)}_{n,k}} }{N} \notag

311: \end{equation}

312: \begin{equation}

313: \hat{a}^{(ite+1)}_k=\frac{ \sum_{n=1}^{N}{ \hat{\beta}^{(ite+1)}_{n,k} }z_n }{ \sum_{n=1}^{N}{ \hat{\beta}^{(ite+1)}_{n,k} }} \notag

314: \end{equation}

315: \begin{equation}

316: \hat{\sigma}^{(ite+1)}_k=\frac{ \sum_{n=1}^{N}{\hat{\beta}^{(ite+1)}_{n,k}}\left(z_n-\hat{a}^{(ite+1)}_k\right)^2 }{ \sum_{n=1}^{N}{\hat{\beta}^{(ite+1)}_{n,k}}} \notag

317: \end{equation}

318: This iterative procedure converges to one maximum of the likelihood function $ \prod\limits_{n=1}^{N}{ P(z_n | \left\{ \hat{a}_k \hat{\sigma}_k \hat{\pi}_k \right\}_{k\in \{1 \cdots K\}})}$. To initialize the EM algorithm in our simulations, $K$ cluster means $\hat{a}^{(0)}_k$ are randomly chosen with a uniform draw in the observation zone $[\text{min}(z_n) \quad \text{max}(z_n)]$. For each $n$, $\hat{\beta}^{(1)}_{n,k}$ is set to one if $\hat{a}^{(0)}_k$ is the closest cluster means to the observation $z_n$ and $\hat{\beta}^{(1)}_{n,k}$ is set to zero otherwise. This initialization is repeated until each cluster contains at least one observation. Then the EM starts with a maximization step. The algorithm is stopped if all the estimated parameters do not change between two EM steps or if a maximal number of $100$ iterations is reached.

319:

320: % PERFORMANCE EVALUATION

321: The clustering performances are evaluated as follows: to get rid of the permutation ambiguity, for each simulation run $r$ and estimation $\hat{\textbf{a}}_r$, the performance criterion $e_r$ is defined as the maximal absolute distance between the true and estimated sorted vector of cluster means:

322: \begin{equation}

323: e_r\stackrel{\Delta}{=}\text{N} \left( \text{sort}(\textbf{a})-\text{sort}(\hat{\textbf{a}}_r)\right) \notag

324: \end{equation}

325: where $\text{N}(\textbf{x})\stackrel{\Delta}{=} \underset{k \in \{1\cdots K\}}{\text{max}} |x_k|$.

326:

327: The distribution of $e_r$ is given in figure 1 for the scenario A.1 with $\sigma=0.25$ and 10000 simulation run. The KP minimum is a biased estimation: $e_r$ is greater than $0.1$ for $90\%$ of the run. Yet, $e_r$ remains less than $0.2$ for $80\%$ of the run. Then the "full KP" algorithm (calculation of the KP minimum followed by a clustering) always provides an accurate set of estimates: $e_r$ remains less than $0.1$ (resp $0.2$) for $80\%$ (resp. $100\%$) of the run. With the EM algorithm, $e_r$ is less than $0.1$ (resp. $0.2$) for $45\%$ (resp. $65\%$) of the run but $e_r$ is greater than $0.5$ for $30\%$ of the run. In this case, the EM gets stuck at a local maximum of the likelihood. Typically one estimated cluster mean is located in the middle of two true cluster means (assuming a too high variance), while two other estimated cluster means are closed to the same true cluster mean. In figure 2, 3 and 4 we present the EM and KP performances for all the scenari with different values of $\sigma$. In scenario A (scenari A.1 to A.4) the KP algorithm estimation is perfect when $\sigma$ is less than $0.2$ ($e_r$ is less than $0.1$ for $95\%$ of the run) and remains correct for $\sigma<0.3$ ($e_r$ is less than $0.2$ for $95\%$ of the run). On the contrary, EM can provide a wrong set of estimated clusters as soon as $\sigma$ is not null: for instance, when $\sigma=0.1$, $e_r$ is greater than $0.5$ for $25\%$ of the run. When the mixture components strongly overlap ($\sigma>0.5$) the two methods lead to wrong estimations, with a slight superiority of EM when $\sigma>0.8$. The KP algorithm remains superior to EM in scenario B and in scenario C. In scenario B (6 clusters), KP is robust for any $\sigma$ less than $0.15$, while, for $\sigma=0.1$, EM converges to a wrong set of cluster means for $75\%$ of the run. in scenario C (9 clusters), KP is robust for any $\sigma$ less than 0.05, while, for $\sigma=0.02$, EM converges to a wrong set of cluster means for $70\%$ of the run. In each scenario, the mixing weights configuration (balanced/unbalanced) has a slight influence on the KP algorithm: the performances on sub-scenari X.3 and X.4 (different mixing weights) are weaker than the performances on sub-scenari X.1 and X.2 (common mixing weights). Yet the KP performances on the unbalanced mixtures remain strongly greater than the EM performances.

328:

329:

330: %for the well-separated case $d=1$. With the K-means algorithm, $e_r$ is less than $0.1$ for $60\%$ of the run. Yet, for $39\%$ of the run, $e_r$ is %greater than $0.5$, which corresponds to a poor estimation of the cluster means. In this case the K-means method has converged to a local minimum of its %cost function. Typically, one estimated cluster mean is located in the middle of two true cluster means while two other estimated cluster means are %closed to the same true cluster means. The KP minimum is a slightly biased estimator: $e_r$ is greater than $0.1$ for $60\%$ of the run, but remains %less than $0.2$ for $91\%$ of the run. The full KP algorithm (calculation of the KP minimum followed by a clustering) always provides an accurate set of %estimations: $e_r$ remains less than $0.1$ (resp. $0.2$) for $99.8\%$ (resp. $100\%$) of the run. We have then studied the performances of the full KP %algorithm for different inter-cluster distance $d$, keeping the same common variance ($10^{-2}$) for all the component densities. For each value of $d$, %10000 run of 100 observations have been performed. In figure 3 we give, for each value of $d$, the probability for $e_r$ to be less than $0.1$, to be %less than $0.2$ and to be greater than $0.5$. For $d\geq0.7$, the KP algorithm provides perfect estimation of the cluster means. When $d=0.5$, which %corresponds to overlapping component density, the performances remains correct: $e_r$ is less than 0.2 for $80\%$ of the run. For low value $d<0.2$ the %estimation is poor: $e_r$ is always greater than $0.5$.

331:

332: % ER DISTRIBUTION ON SCENARIO A.1

333: %\begin{figure}{

334: %%\begin{center}{

335: %\includegraphics[height=0.4\linewidth]{dessins/scenarioA1_erDist.eps}

336: %%}

337: %%\end{center}

338: %\caption{Estimation performances on scenario A.1 with $\sigma=0.25$. 1000 simulation run have been performed. For each simulation run, 100 observations %have been generated. $e_r$ is the maximal distance between the sorted vector of true cluster means and the sorted vector of estimated cluster means}}

339: %\end{figure}

340:

341: % For one-column wide figures use

342: \begin{figure}

343: \begin{center}{

344: % Use the relevant command for your figure-insertion program

345: % to insert the figure file.

346: % For example, with the option graphics use

347: \resizebox{0.75\textwidth}{!}{%

348:   \includegraphics{dessins/scenarioA1_erDist_10000.eps}

349: }}

350: \end{center}

351: % If not, use

352: %\vspace{5cm}       % Give the correct figure height in cm

353: \caption{Estimation performances on scenario A.1 with $\sigma=0.25$. 10000 simulation run have been performed. For each simulation run, 100 observations have been generated. $e_r$ is the maximal distance between the sorted vector of true cluster means and the sorted vector of estimated cluster means}

354: \label{fig:1}       % Give a unique label

355: \end{figure}

356:

357:

358: % FIGURE DES PERF SUR SCENARIO A

359: \begin{figure}{

360: \begin{center}{

361: \begin{subfigure}{

362: \includegraphics[height=0.45\linewidth]{dessins/scenarioA_p01.eps}}

363: %\resizebox{0.6\textwidth}{!}{  \includegraphics{dessins/scenarioA_p01.eps}}

364: \end {subfigure}

365: \begin{subfigure}{

366: %\label{performances_A_01}

367: \includegraphics[height=0.45\linewidth]{dessins/scenarioA_p02.eps}}

368: %\resizebox{0.6\textwidth}{!}{\includegraphics{dessins/scenarioA_p02.eps}}}

369: %\caption{performances of the full KP algorithm for different inter-cluster distance d}

370: %\label{performances_A_02}

371: \end{subfigure}

372: \begin{subfigure}{

373: \includegraphics[height=0.45\linewidth]{dessins/scenarioA_p05.eps}}

374: %\resizebox{0.6\textwidth}{!}{\includegraphics{dessins/scenarioA_p05.eps}}}

375: %\caption{performances of the full KP algorithm for different inter-cluster distance d}

376: %\label{performances_A_05}}

377: \end{subfigure}}

378: \end{center}}

379: \caption{performances of the EM and KP algorithms on scenario A for different values of $\sigma$. For each value of $\sigma$ and for each sub-scenario 10000 simulation run have been performed. For each simulation run, 100 observations have been generated. $e_r$ is the maximal distance between the sorted vector of true cluster means and the sorted vector of estimated cluster means. The performance criteria are the probabilities for $e_r$ to be smaller than 0.1 (top), smaller than 0.2 (middle) and greater than 0.5 (bottom).}

380: \end{figure}

381:

382: % FIGURE DES PERF SUR SCENARIO B

383: \begin{figure}{

384: \begin{center}{

385: \begin{subfigure}{

386: \includegraphics[height=0.45\linewidth]{dessins/scenarioB_p01.eps}}

387: %\resizebox{0.75\textwidth}{!}{  \includegraphics{dessins/scenarioB_p01.eps}}}

388: \end {subfigure}

389: \begin{subfigure}{

390: \includegraphics[height=0.45\linewidth]{dessins/scenarioB_p02.eps}}

391: %\resizebox{0.75\textwidth}{!}{\includegraphics{dessins/scenarioB_p02.eps}}}

392: \end {subfigure}

393: \begin{subfigure}{

394: \includegraphics[height=0.45\linewidth]{dessins/scenarioB_p05.eps}}

395: %\resizebox{0.75\textwidth}{!}{\includegraphics{dessins/scenarioB_p05.eps}}}

396: \end {subfigure}}

397: \end{center}}

398: \caption{performances of the EM and KP algorithms on scenario B for different values of $\sigma$. For each value of $\sigma$ and for each sub-scenario 10000 simulation run have been performed. For each simulation run, 200 observations have been generated. $e_r$ is the maximal distance between the sorted vector of true cluster means and the sorted vector of estimated cluster means. The performance criteria are the probabilities for $e_r$ to be smaller than 0.1 (top), smaller than 0.2 (middle) and greater than 0.5 (bottom).}

399: \end{figure}

400:

401: % FIGURE DES PERF SUR SCENARIO C

402: \begin{figure}{

403: \begin{center}{

404: \begin{subfigure}{

405: \includegraphics[height=0.45\linewidth]{dessins/scenarioC_p01.eps}}

406: %\resizebox{0.75\textwidth}{!}{  \includegraphics{dessins/scenarioC_p01.eps}}}

407: \end {subfigure}

408: \begin{subfigure}{

409: \includegraphics[height=0.45\linewidth]{dessins/scenarioC_p02.eps}}

410: %\resizebox{0.75\textwidth}{!}{\includegraphics{dessins/scenarioC_p02.eps}}}

411: \end {subfigure}

412: \begin{subfigure}{

413: \includegraphics[height=0.45\linewidth]{dessins/scenarioC_p05.eps}}

414: %\resizebox{0.75\textwidth}{!}{\includegraphics{dessins/scenarioC_p05.eps}}}

415: \end {subfigure}}

416: \end{center}}

417: \caption{performances of the EM and KP algorithms on scenario C for different values of $\sigma$. For each value of $\sigma$ and for each sub-scenario 1000 simulation run have been performed. For each simulation run, 300 observations have been generated. $e_r$ is the maximal distance between the sorted vector of true cluster means and the sorted vector of estimated cluster means. The performance criteria are the probabilities for $e_r$ to be smaller than 0.1 (top), smaller than 0.2 (middle) and greater than 0.5 (bottom).}

418: \end{figure}

419:

420: % CONCLUSION

421: %-----------

422: \section{Conclusion}

423: We have proposed a clusters estimation algorithm for univariate observations when the number of clusters is known. It is based on the minimization of the "KP" criterion we first introduced in (Paul et al. 2006). We have shown that the global minimum of this criterion can be reached with a linear least square minimization followed by a roots finding algorithm. This minimum is used to get a first raw estimation of the cluster means, and a final clustering step enables to recover the cluster means. The proposed method is not iterative, its complexity is in $\text{O}(NK+K^2)$ and it does not require the configuration of any extra parameter. Simulations have illustrated the KP algorithm performances and superiority to the Expectation-Maximization algorithm which can get stuck at a local maximum of the likelihood. We focused on the univariate case and our current researchs deal with the multivariate case. If the observations $\textbf{z}_n$ belong to $\mathbb{R}^d$, if $\{\textbf{x}_k\}_{k\in\{1 \cdots K\}}$ is any set of $K$  vectors of $\mathbb{R}^d$, the KP criterion is now defined as the sum of all the K-terms products $\prod_{k=1}^{K}||\textbf{z}_n-\textbf{x}_k||_{\mathbb{R}^d}^2$. The minima of such criterion and some algorithms to reach them are currently being studied.

424:

425: \begin{acknowledgements}

426: The authors want to thank B. Scherrer, M. Bellanger, P. Tortelier, G. Saporta and J.P.Nakache for their constructive comments that helped in improving this manuscript.

427: \end{acknowledgements}

428:

429: \begin{thebibliography}{}

430:

431: %\begin{itemize}[]

432: \addtolength{\leftmargin}{0.2in}

433: \setlength{\itemindent}{-0.2in}

434: % survey on clustering

435:

436: \bibitem{berkhin}

437: Berkin P (2006) A Survey of clustering data mining techniques. Grouping Multidimensional Data: Recent Advances in Clustering, Ed. J. Kogan and C. Nicholas and M. Teboulle, Springer, pp. 25-71

438: % k-means

439: % kmeans init

440: \bibitem{bradley}

441: Bradley P S, Fayyad U M (1998) Refining initial points for K-means clustering. Proc. of the 15th Int. Conf. on Machine Learning, San-Fransisco, Morgan Kaufmann, pp.  91-99

442: % stochastic em

443: \bibitem{celeux}

444: Celeux G, Chauveau D, Diebolt J (1995) On stochastic version of the EM algorithm. INRIA research report no 2514, available: http://www.inria.fr/rrrt/rr-2514.html

445: % Ze reference d'em:

446: \bibitem{dempster}

447: Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B. 39, pp. 1-38

448: % Fisher partition

449: \bibitem{fisher}

450: Fisher W D (1958) On grouping for maximum homogeneity. Journal of the American Statistical Association, Vol. 53, No. 284, pp. 789-798

451: % complexit� de Fisher

452: \bibitem{fitzgibbon}

453: Fitzgibbon L J, Allison L, Dowe D L (2000) Minimum message length grouping of ordered data. Algorithmic Learning Theory, 11th International Conference, ALT 2000, Sydney, Australia

454: % ze ref on kmeans

455: \bibitem{hartigan}

456: Hartigan J, Wong M (1979) A k-means clustering algorithm, Journal of Applied Statistics, vol 28, pp. 100-108

457: % stochastic kmeans

458: \bibitem{krishna}

459: Krishna K, Narasimha Murty M (1999) Genetic K-Means Algorithm, IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, Vol. 29, No. 3

460: % em initialisation with moment

461: \bibitem{lindsay2}

462: Lindsay B, Furman D (1994) Measuring the relative effectiveness of moment estimators as starting values in maximizing likelihoods. Computational Statistics and Data Analysis, Volume 17, Issue 5, pp. 493-507

463: % em init

464: \bibitem{mclachlan}

465: McLachlan G, Peel D (2000) Finite Mixture Models. Wiley Series in probability and statistics, John Wiley and Sons

466: % Parzen

467: \bibitem{parzen}

468: Parzen E (1962) On estimation of a probability density function and mode. Annals of Mathematicals Statistics 33, pp. 1065-1076

469: % k product first article (norsig)

470: \bibitem{norsig}

471: Paul N, Terre M, Fety L (2006) The k-product criterion for gaussian mixture estimation. 7th Nordic Signal Processing Symposium,  Reykjavik, Iceland

472: % stocahstic em 2

473: \bibitem{pernkopf}

474: Pernkopf F, Bouchaffra D (2005) Genetic-based EM algorithm for learning gaussian mixture models, IEEE Transactions On Pattern Analysis and Machine Intelligence, Vol. 27, No. 8

475: % survey of clustering

476: \bibitem{xu}

477: Xu R, Wunsch II D (2005) Survey of Clustering Algorithms. IEEE Transactions On Neural Networks, vol. 16, No. 3, pp. 645-676

478: \end{thebibliography}

479: %\end{itemize}

480:

481: % old

482: %\bibitem{kmeansInit2}	T. Su and J. Dy, "A Deterministic Method for Initializing K-means Clustering," in \textit{Proc. of the 16th %IEEE Int. Conf. on Tools with Artificial Intelligence}, 2004.

483: %\bibitem{michell}	T. Michell, "Machine learning", McGraw-Hill, New-York, NY, USA, 1997

484: %% moments

485: %\bibitem{lindsay1}	B. Lindsay, "Moment Matrices: Application in Mixture", \textit{The Annals of Statistics}, Vol. 17, No. 2 (June 1989) pp. 722-740

486: %% algebra (for the Elementary Symetric Polynomials)

487: %\bibitem{lang} S. Lang, "Algebra", Springer-Verlag, 2004

488: % problem of moment (for regularity of Z)

489: %\bibitem{shohat} J.A. Shohat and J.D Tamarkin "the problem of moments", \textit{american mathematical society}, New York 1943

490:

491:

492: \appendix{}

493: %\chapter{toto}

494: \section{: non-singularity of \textbf{Z}}

495: In appendix A we explain why the matrix $\textbf{Z}$ of size $K \times K$, defined in \eqref{definition_Z} is regular if the number of different observations is greater than $K-1$. $\textbf{Z}$ can be written as the following matrix product:

496: \begin{equation}

497: \textbf{Z}=\textbf{V}\textbf{V}^t \notag

498: \end{equation}

499: where \textbf{V} is a $K \times N$ Vandermonde Matrix defined by:

500: \begin{equation}

501: \textbf{V}\stackrel{\Delta}{=} \left( \textbf{z}_1, \textbf{z}_2, \cdots \textbf{z}_N \right) \notag

502: \end{equation}

503: and $\textbf{z}_n$ has been defined in \eqref{definition_zn}. Let us assume that the $K$ first observations are different. The determinant of the $K \times K$ Vandermonde matrix $\left( \textbf{z}_1, \textbf{z}_2, \cdots \textbf{z}_K \right)$ is equal to $\prod\limits_{1 \leq i < j \leq K}{(z_j-z_i)}$, which is different from zero. The rank of $\textbf{V}$ is then equal to $K$, so the rank of \textbf{Z} is equal to $K$ and \textbf{Z} is regular.

504:

505: % Appendix B: preuve du th�or�me 1

506: %\chapter

507: %\appendix

508: \section{: proof of theorem 1}

509: \noindent In appendix B we prove theorem 1. Let \textbf{$F$} be the function defined by:

510: \begin{equation}

511: F: \mathbb{C}^K\rightarrow \mathbb{R}^+:

512: \textbf{x} \rightarrow \sum_{n=1}^{N}{ \prod_{k=1}^{K}{||z_n-x_k||_{\mathbb{C}}^2} } \notag

513: %\label{definition_F}

514: \end{equation}

515: \noindent The restriction of $F$ to $\mathbb{R}^K$ is the function $J$ since the observations $z_n$ are real:

516: \begin{equation}

517: \forall \textbf{x} \in \mathbb{R}^K: \ \ \ F(\textbf{x})=J(\textbf{x})

518: \label{restrictionRK}

519: \end{equation}

520: \noindent Now let $H$ be the function defined by:

521: \begin{equation}

522: H: \mathbb{C}^K\rightarrow \mathbb{R}^+:

523: \textbf{y} \rightarrow \sum_{n=1}^{N}{\left\|z_n^K-\textbf{z}_n^t\textbf{y}\right\|_{\mathbb{C}}^2} \notag

524: %\label{definition_H}

525: \end{equation}

526: \noindent The function $H$ applied to the ESP of a vector $\textbf{x}$ in $\mathbb{C}^k$ is equal to the function $F$ applied to $\textbf{x}$:

527: \begin{equation}

528: \forall \textbf{x} \in \mathbb{C}^K: \ \ \

529: F(\textbf{x})=\sum_{n=1}^{N}{ \left\|\prod_{k=1}^{K}{(z_n-x_k)}\right\|_{\mathbb{C}}^2 }

530: \label{F_SumNormProduct}

531: \end{equation}

532: \noindent developping~\eqref{F_SumNormProduct} using definition~\eqref{definition_wk} leads to:

533: \begin{equation}

534: \forall \textbf{x} \in \mathbb{C}^K: \ \ \

535: F(\textbf{x})=\sum_{n=1}^{N}{ \left\|{z_n^K-\sum_{k=1}^{K}{z_n^{K-k}w_k(\textbf{x})}}\right\|_{\mathbb{C}}^2 } \notag

536: \end{equation}

537: \noindent including definitions~\eqref{definition_zn} and~\eqref{definition_w}:

538: \begin{equation}

539: \forall \textbf{x} \in \mathbb{C}^K: \ \ \

540: F(\textbf{x})=\sum_{n=1}^{N}{ \left\|{z_n^K-\textbf{z}_n^t\textbf{w}(\textbf{x})}\right\|_{\mathbb{C}}^2 } \notag

541: \end{equation}

542: \begin{equation}

543: \forall \textbf{x} \in \mathbb{C}^K: \ \ \ F(\textbf{x})=H( \textbf{w}(\textbf{x}) )

544: \label{FuIsHWu}

545: \end{equation}

546: \noindent The global minimum of $H$ is the linear least square solution $\textbf{y}_{min}$ given by:

547: \begin{equation}

548: \textbf{y}_{min}=\underset{\textbf{y} \in \mathbb{C}^K}{\text{argmin}}  \left\{\sum_{n=1}^{N}{\left\|z_n^K-\textbf{z}_n^t\textbf{y}\right\|_{\mathbb{C}}^2}\right\}

549: \label{argminH}

550: \end{equation}

551: \noindent developping~\eqref{argminH} using definitions~\eqref{definition_z} and~\eqref{definition_Z} and remembering that the coefficients of $\textbf{Z}$ and $\textbf{z}$ are real:

552: \begin{equation}

553: \textbf{y}_{min}=\underset{\textbf{y} \in \mathbb{C}^K}{\text{argmin}}  \left\{\textbf{y}^H\textbf{Z}\textbf{y}-2\text{Re}\{\textbf{y}^H\}\textbf{z} \right\} \notag

554: \end{equation}

555: \begin{equation}

556: \textbf{Z}.\textbf{y}_{min}=\textbf{z}, \ \ \ \textbf{y}_{min} \in \mathbb{R}^K

557: \label{ymin_obtention}

558: \end{equation}

559: \noindent The Hankel matrix $\textbf{Z}$ is regular since the number of different observations is greater than $K-1$ (appendix A). System~\eqref{ymin_obtention} therefore has exactly one solution.  Since $\textbf{Z}$ belongs to $\mathbb{R}^{K \times K}$ and $\textbf{z}$ belongs to $\mathbb{R}^K$, $\textbf{y}_{min}$ belongs to $\mathbb{R}^K$. Now let $\textbf{x}_{min}$=$(x_{1,min},\cdots ,x_{K,min})^t$ be a vector containing, in any order, the $K$ (potentially complex) roots of $q_{\textbf{y}_{min}}(\alpha)$. One can show that the following holds: \\

560: \noindent \hspace*{1cm}(i) $\textbf{x}_{min}$ is a global minimum of $F$ \\

561: \hspace*{1cm}(ii) $\textbf{x}_{min} \in \mathbb{R}^K$ \\

562: \hspace*{1cm}(iii) $\textbf{x}_{min}$ is a global minimum of $J$  \\

563: \noindent Property (i) is a direct consequence of~\eqref{FuIsHWu}:

564: \begin{equation}

565: \forall \textbf{x} \in \mathbb{C}^K: \ \ \ F(\textbf{x})=H(\textbf{w}(\textbf{x})) \notag

566: \end{equation}

567: \begin{equation}

568: \forall \textbf{x} \in \mathbb{C}^K: \ \ \ F(\textbf{x}) \geq \text{min} \left\{H\right\} \notag

569: \end{equation}

570: \begin{equation}

571: \forall \textbf{x} \in \mathbb{C}^K: \ \ \ F(\textbf{x}) \geq H(\textbf{y}_{min}) \notag

572: \end{equation}

573: \noindent According to (\ref{roots_coefficients}), $\textbf{y}_{min}=\textbf{w}(\textbf{x}_{min})$ and we have:

574: \begin{equation}

575: \forall \textbf{x} \in \mathbb{C}^K: \ \ \ F(\textbf{x}) \geq H(\textbf{w}(\textbf{x}_{min})) \notag

576: \end{equation}

577: \begin{equation}

578: \forall \textbf{x} \in \mathbb{C}^K: \ \ \ F(\textbf{x}) \geq F(\textbf{x}_{min}) \notag

579: \end{equation}

580: \noindent which proves (i). Property (ii) can be shown by contradiction: if $\textbf{x}_{min}$ does not belong to $\mathbb{R}^K$, then for one of the $x_{k,min}$ we have $x_{k,min} \neq \text{Re}\{x_{k,min}\}$ and, since all the observations $z_n$ are real:

581: \begin{equation}

582: \forall n \in \{1,\cdots,N\}: \ \ \

583: \left\|z_n-x_{k,min}\right\|_{\mathbb{C}} > \left\|z_n-\text{Re}\{x_{k,min}\}\right\|_{\mathbb{C}} \notag

584: \end{equation}

585: \noindent which leads to:

586: \begin{equation}

587: F(\textbf{x}_{min})>F(\text{Re}\{\textbf{x}_{min}\}) \notag

588: \end{equation}

589: \noindent This is impossible since $\textbf{x}_{min}$ is a global minimum of $F$. This proves property (ii). We finally have to prove (iii): since $\textbf{x}_{min} \in \mathbb{R}^K$ we have, using~\eqref{restrictionRK}:

590: \begin{equation}

591: F(\textbf{x}_{min})=J(\textbf{x}_{min})

592: \label{FuminIsJumin}

593: \end{equation}

594: \noindent Furthermore, according to~\eqref{restrictionRK}:

595: \begin{equation}

596: \forall \textbf{x} \in \mathbb{R}^K: \ \ \ J(\textbf{x})=F(\textbf{x}) \notag

597: \end{equation}

598: \begin{equation}

599: \forall \textbf{x} \in \mathbb{R}^K: \ \ \ J(\textbf{x}) \geq \text{min}\{F\} \notag

600: \end{equation}

601: \noindent then, according to property (i):

602: \begin{equation}

603: \forall \textbf{x} \in \mathbb{R}^K: \ \ \ J(\textbf{x}) \geq F(\textbf{x}_{min}) \notag

604: \end{equation}

605: \noindent using~\eqref{FuminIsJumin}:

606: \begin{equation}

607: \forall \textbf{x} \in \mathbb{R}^K: \ \ \ J(\textbf{x}) \geq J(\textbf{x}_{min}) \notag

608: \end{equation}

609: \noindent which proves (iii). Properties (ii) and (iii) directly lead to theorem 1.

610: \end{document}