0712:0712.4126/em.tex

1: \chapter{TRUST-TECH based Expectation Maximization for Learning Mixture Models}

2: \label{ch:trust-tech-em}

3:

4: In this chapter, we develop a TRUST-TECH based algorithm for solving the problem of mixture modeling. In the field of statistical pattern recognition, finite mixtures

5: allow a probabilistic model-based approach to unsupervised learning

6: \cite{McLachlan88}. One of the most popular methods used for fitting

7: mixture models to the observed data is the {\it

8: Expectation-Maximization} (EM) algorithm which converges to the

9: maximum likelihood estimate of the mixture parameters locally

10: \cite{Demspter77,Redner84}. The usual steepest descent, conjugate

11: gradient, or Newton-Raphson methods are too complicated for use in

12: solving this problem \cite{Xu96}. EM has become a popular method

13: since it takes advantage of problem specific properties. EM based

14: methods have been successfully applied to solve a wide range of

15: problems that arise in pattern recognition \cite {Baum70,Bilmes98},

16: clustering \cite{Banfield93}, information retrieval \cite {Nigam00},

17: computer vision \cite {Carson02}, data mining \cite{Shumway82}~etc.

18:

19: Without loss of generality, we will consider the problem of learning parameters of

20: Gaussian Mixture Models (GMM). Fig \ref{fig:gmm} shows data

21: generated by three Gaussian components with different mean and

22: variance. Note that every data point has a probabilistic (or soft)

23: membership that gives the probability with which it belongs to each

24: of the components. Points that belong to component 1 will have high

25: probability of membership for component 1. On the other hand, data

26: points belonging to components 2 and 3 are not well separated. The

27: problem of learning mixture models involves estimating the

28: parameters of these components and finding the probabilities

29: with which each data point belongs to these components. Given the

30: number of components and an initial set of parameters, EM algorithm

31: computes the optimal estimates of the parameters

32: that maximize the likelihood of the data given the estimates of

33: these components. However, the main problem with the EM algorithm is

34: that it is a `{\it greedy}' method which is very sensitive to the

35: given initial set of parameters. To overcome this problem, a novel

36: three-stage algorithm is proposed \cite{Reddy07}. The main research concerns that motivated the new algorithm

37: presented in this chapter are :

38: \begin {itemize}

39: \item{EM algorithm converges to a local maximum

40: of the likelihood function very quickly.}

41:

42: \item{There are several other promising local optimal solutions in

43: the vicinity of the solutions obtained from the methods that

44: provide good initial guesses of the solution.}

45:

46: \item{Model selection criteria usually assumes that the global

47: optimal solution of the log-likelihood function can be obtained.

48: However, achieving this is computationally intractable.}

49:

50: \item{Some regions in the search space do not contain any promising solutions.

51: The promising and non-promising regions coexist and it becomes

52: challenging to avoid wasting computational resources to search in

53: non-promising regions.}

54:

55: \end{itemize}

56:

57:

58: Of all the concerns mentioned above, the fact that most of the local

59: maxima are not distributed uniformly \cite{Ueda98} makes it

60: important to develop algorithms that can avoid searching in the low-likelihood regions and focus on exploring promising subspaces more thoroughly. This

61: subspace search will also be useful for making the solution less

62: sensitive to the initial set of parameters. Here, we

63: propose a novel three-stage algorithm for estimating the parameters of

64: mixture models. Using TRUST-TECH method and EM algorithm

65: simultaneously to exploit the problem specific features of the

66: mixture models, the proposed three-stage algorithm obtains the optimal set of parameters

67: by searching for the global maximum in a

68: systematic manner.

69:

70:

71: \begin{figure}[htp]

72: \centerline{

73:   \epsfig{figure=Figures/gmm1.ps, width=3.5in}

74: } \caption{Data consisting of three Gaussian components with

75: different mean and variance values. Note that each data point doesn't have

76: a hard membership that it belongs to only one component. Most of the

77: points in the first component will have high probability with which

78: they belong to it. In this case, the other components do not have

79: much influence. Components 2 and 3 data points are not clearly separated. The

80: problem of learning mixture models involves estimating the

81: parameters of the Gaussian components and finding the

82: probabilities with which each data sample belongs to the component.}

83: \label{fig:gmm}

84: \end{figure}

85:

86:

87: \section{Relevant Background}

88: \label{sec:background}  Although EM and its variants have been extensively used for learning

89: mixture models, several researchers have approached the problem by

90: identifying new techniques that give good initial points. More

91: generic techniques like deterministic annealing \cite

92: {Rose98,Ueda98}, genetic algorithms \cite{Pernkopf05,Mart�nez00}

93: have been applied to obtain a good set of parameters. Though, these

94: techniques have asymptotic guarantees, they are very time consuming

95: and hence may not be used for most of the practical applications.

96: Some problem specific algorithms like split and merge EM

97: \cite{Ueda00}, component-wise EM \cite{Figueiredo02}, greedy

98: learning \cite{Verbeek03}, incremental version for sparse

99: representations \cite{neal98}, parameter space grid \cite{Li99} are

100: also proposed in the literature. Some of these algorithms are either

101: computationally very expensive or infeasible when learning mixtures

102: in high dimensional spaces \cite{Li99}. Inspite of all the expense

103: in these methods, very little effort has been taken to explore

104: promising subspaces within the larger parameter space. Most of these

105: algorithms eventually apply the EM algorithm to move to a locally

106: maximal set of parameters on the likelihood surface. Simple approaches like running EM from several random

107: initializations, and then choosing the final estimate that leads to

108: the local maximum with higher value of the likelihood can be successful to certain extent \cite{hastie96,Roberts98}.

109:

110:

111: Though some of these methods apply other additional mechanisms (like

112: perturbations \cite{Elidan02}) to escape out of the local optimal

113: solutions, systematic methods are yet to be developed for searching

114: the subspace. The dynamical system of the log-likelihood function

115: reveals more information about the topology of the nonlinear log-likelihood surface \cite{Chiang96}. Hence, the

116: difficulties of finding good solutions when the error surface is

117: very rugged can be overcome by understanding the geometric and dynamic characteristics of the log-likelihood surface. Though this method might introduce some additional cost, one

118: has to realize that existing approaches are much more expensive due

119: to their stochastic nature. Specifically, for a problem in this

120: context, where there is a non-uniform distribution of local maxima,

121: it is difficult for most of the methods to search neighboring

122: regions \cite{Zhang04}. For this reason, it is more desirable to

123: apply TRUST-TECH based Expectation Maximization (TT-EM) algorithm

124: after obtaining some point in a promising region. The main

125: advantages of the proposed algorithm are that it :

126:

127: \begin{itemize}

128: \item{Explores most of the neighborhood local optimal solutions unlike the traditional stochastic algorithms.}

129: \item{Acts as a flexible interface between the EM algorithm and other global method. Sometimes, a global method will optimize an approximation of the original function. Hence, it is important to provide an interface between the EM algorithm and the global method.}

130: \item{Allows the user to work with existing clusters obtained from the traditional approaches and improves the quality of the solutions based on the maximum likelihood criteria.}

131: \item{Helps the expensive global methods to truncate early.}

132: \item{Exploits the heuristics that the EM algorithm that it converges at a faster rate if the solutions are promising.}

133: \end{itemize}

134:

135: \noindent While trying to obtain multiple optimal solutions, TRUST-TECH can dynamically change the threshold for the number of iterations. For e.g. while computing Tier-1 solutions, if a promising solution has been obtained with a few iterations, then all the rest of the tier-1 solutions will use this value as their threshold.

136:

137: \section{Preliminaries}

138: \label{sec:problem} We will now introduce some necessary preliminaries on

139: mixture models, EM algorithm and nonlinear transformation. Table~\ref{TB:datanot} gives the notations used in this chapter :

140:

141: \begin{table}[h]

142: \centering \caption{\protect Description of the Notations

143: used}

144: \begin{center}

145: \begin{tabular}{cl}

146: \hline  Notation &   Description \\

147:  \hline

148:

149:  d & number of features\\

150:  n & number of data points\\

151:  k & number of components\\

152:  s & total number of parameters\\

153:  $\Theta$ & parameter set\\

154:  $\theta_i$ & parameters of $i^{th}$ component\\

155:  $\alpha_i$ & mixing weights for $i^{th}$ component\\

156:  $\mathcal{X}$ & observed data\\

157:  $\mathcal{Z}$ & missing data\\

158:  $\mathcal{Y}$ & complete data\\

159:  t &  timestep for the estimates\\

160:  \hline

161: \end{tabular}

162: \end{center}

163: \label{TB:datanot}

164: \end{table}

165:

166: \subsection{Mixture Models}

167:

168: Lets assume that there are $k$ Gaussians in the mixture model. The

169: form of the probability density function is as follows :

170:

171: \begin{equation}\label{eq:gaussian1}

172:     p(x|\Theta) =\sum_{i=1}^{k}{\alpha_i p(x|\theta_i)}

173: \end{equation}

174:

175: \noindent where $x=[x_1,x_2,...,x_d]^T$ is the feature vector of $d$

176: dimensions. The $\alpha_k$'s represent the {\it mixing weights}. $\Theta$ represents the parameter set ($\alpha_1, \alpha_2,...

177: \alpha_k,\theta_1,\theta_2,...\theta_k$) and $p$ is a univariate

178: Gaussian density parameterized by $\theta_i$(i.e. $\mu_i$ and

179: $\sigma_i$):

180:

181: \begin{equation}\label{eq:gaussiandensity}

182:     p(x|\theta_i) =\frac{1}{\sqrt{(2\pi)}\sigma_i}e^{-\frac{(x-\mu_i)^2}{2\sigma_i^2}}

183: \end{equation}

184:

185: Also, it should be noticed that being probabilities $\alpha_i$ must

186: satisfy

187:

188: \begin{equation}\label{eq:probabi}

189:     0 \leq \alpha_i\leq 1 ~,~ \forall i=1,..,k,~ and ~~ \sum_{i=1}^k

190:     \alpha_i=1

191: \end{equation}

192:

193: Given a set of n i.i.d samples

194: $\mathcal{X}=\{x^{(1)},x^{(2)},..,x^{(n)}\}$, the log-likelihood

195: corresponding to a mixture is

196:

197: \begin{equation}\label{eq:log}

198: %\begin{split}

199:     log ~p(\mathcal{X}|\Theta)=log \prod_{j=1}^n

200:     ~p(x^{(j)}|\Theta)\\

201:      =\sum_{j=1}^n log \sum_{i=1}^k \alpha_i

202:     ~p(x^{(j)}|\theta_i)

203: %\end{split}

204: \end{equation}

205:

206: The goal of learning mixture models is to obtain the parameters

207: $\widehat{\Theta}$ from a set of n data points which are the samples

208: of a distribution with density given by (\ref{eq:gaussian1}). The

209: {\it Maximum Likelihood Estimate } (MLE) is given by :

210: \begin{equation}\label{eq:MLE}

211: \widehat{\Theta}_{MLE} = arg \max_{\tilde{\Theta}} ~\{~log

212: ~p(\mathcal{X}|\Theta)~\}

213: \end{equation}

214:

215: where $\tilde{\Theta}$ indicates the entire parameter space. Since,

216: this MLE cannot be found analytically for mixture models, one has to

217: rely on iterative procedures that can find the global maximum of

218: $log~ p(\mathcal{X}|\Theta)$. The EM algorithm described in the next

219: section has been used successfully to find the local maximum of such

220: a function \cite{McLachlan97}.

221:

222: \subsection{Expectation Maximization}

223:

224: The EM algorithm assumes $\mathcal{X}$ to be $observed$ data. The

225: missing part, termed as $hidden$ data, is a set of {\it n} labels

226: $\mathcal{Z}=\{\footnotesize{\bf z}^{(1)},\footnotesize{\bf

227: z}^{(2)},..,\footnotesize{\bf z}^{(n)}\}$ associated with $n$

228: samples, indicating which component produced each sample

229: \cite{McLachlan97}. Each label $\footnotesize{\bf

230: z}^{(j)}=[z_1^{(j)},z_2^{(j)},..,z_k^{(j)}]$ is a binary vector

231: where $z_i^{(j)}=1$ and $z_m^{(j)}=0$  $\forall m \neq i$, means the

232: sample $x^{(j)}$ was produced by the $i^{th}$ component. Now, the

233: complete log-likelihood i.e. the one from which we would estimate

234: $\Theta$ if the {\it complete data}

235: $\mathcal{Y}=~\{~\mathcal{X},\mathcal{Z}~\}$ is

236:

237: \begin{equation*}\label{eq:beforecomplete}

238:     log ~p(\mathcal{X},\mathcal{Z}|\Theta)=\sum_{j=1}^n ~log \prod_{i=1}^k

239:     ~ [~\alpha_i~p(x^{(j)}|\theta_i)~]^{z_i^{(j)}}

240: \end{equation*}

241:

242:

243: \begin{equation}\label{eq:complete}

244:     log ~p(\mathcal{Y}|\Theta)=\sum_{j=1}^n \sum_{i=1}^k

245:     z_i^{(j)}~log~ [~\alpha_i~p(x^{(j)}|\theta_i)~]

246: \end{equation}

247:

248: The EM algorithm produces a sequence of estimates

249: $\{\widehat{\Theta}(t),t=0,1,2,...\}$ by alternately applying the

250: following two steps until convergence :

251:

252: \begin {itemize}

253: \item{{\bf E-Step : } Compute the conditional expectation of the

254: hidden data, given $\mathcal{X}$ and the current estimate

255: $\widehat{\Theta}(t)$. Since $log~p(\mathcal{X,Z}|\Theta)$ is linear

256: with respect to the missing data $\mathcal{Z}$, we simply have to

257: compute the conditional expectation $\mathcal{W} \equiv

258: E[\mathcal{Z|X},\widehat{\Theta}(t)]$, and plug it into $log ~p

259: (\mathcal{X,Z}|\Theta)$. This gives the $Q$-function as follows :

260:

261:

262:  \begin{equation}\label{eq:qfunc}

263: %\begin{split}

264:     Q(\Theta|\widehat{\Theta}(t))\equiv

265:    E_Z[log~p(\mathcal{X},\mathcal{Z})|\mathcal{X},\widehat{\Theta}(t)]

266:  %\end{split}

267: \end{equation}

268:

269: Since $\mathcal{Z}$ is a binary vector, its conditional expectation

270: is given by :

271:

272: \begin{equation}\label{eq:conditionw}

273: %\begin{split}

274:     w_i^{(j)} \equiv E~[~z_i^{(j)}|\mathcal{X},\widehat{\Theta}(t)~] \\

275:  = Pr~[~z_i^{(j)}=1|x^{(j)},\widehat{\Theta}(t)~]\\

276: =

277:     \frac{\widehat{\alpha}_i(t) p(x^{(j)}|\widehat{\theta}_i(t))}{\sum_{i=1}^{k}{\widehat{\alpha}_i(t) p(x^{(j)}|\widehat{\theta}_i(t))}}

278: %\end{split}

279: \end{equation}

280:

281: where the last equality is simply the Bayes law ($\alpha_i$ is the a

282: priori probability that $z_i^{(j)}=1$), while $w_i^{(j)}$ is the a

283: posteriori probability that $z_i^{(j)}=1$ given the observation

284: $x^{(j)}$.}

285:

286: \item{{\bf M-Step : } The estimates of the new parameters are

287: updated using the following equation :

288: \begin{equation}\label{eq:update}

289:     \widehat{\Theta} (t+1) = arg \max_{\Theta}\{Q(\Theta,\widehat{\Theta}(t))\}

290: \end{equation}

291: }

292: \end{itemize}

293: \subsection{EM for GMMs}

294:

295: Several variants of the EM algorithm have been extensively used to

296: solve this problem. The convergence properties of the EM algorithm

297: for Gaussian mixtures are thoroughly discussed in \cite{Xu96}. The

298: $Q-function$ for GMM is given by :

299:

300:  \begin{equation}\label{eq:qfuncgmm}

301:  %\begin{split}

302:  Q(\Theta|\widehat{\Theta}(t))= \sum_{j=1}^{n}\sum_{i=1}^{k}  w_i^{(j)}[log\frac{1}{\sigma_i\sqrt{2\pi}} \\-\frac{(x^{(j)}-\mu_i)^2}{2\sigma_i^2}+log ~\alpha_i]

303: %\end{split}

304: \end{equation}

305:

306: where

307:  \begin{equation}\label{eq:expectz}

308: w_i^{(j)}=\frac{\frac{\alpha_i(t)}{\sigma_i(t)}e^{-\frac{1}{2\sigma_i(t)^2}(x^{(j)}-\mu_i(t))^2}}{\sum_{i=1}^k

309: \frac{\alpha_i(t)}{\sigma_i(t)}e^{-\frac{1}{2\sigma_i(t)^2}(x^{(j)}-\mu_i(t))^2}}

310:  \end{equation}

311:

312: The maximization step is given by the following equation :

313:  \begin{equation}\label{eq:max}

314:     \frac{\partial }{\partial \Theta_k} Q(\Theta|\widehat{\Theta}(t))= 0

315: \end{equation}

316: where $\Theta_k$ is the parameters for the $k^{th}$ component.

317: Because of the assumption made that each data point comes from a

318: single component, solving the above equation becomes trivial. The

319: updates for the maximization step in the case of GMMs are given as

320: follows :

321: \begin{eqnarray}

322: %\begin{split}

323: \mu_i(t+1) = \frac{\sum_{j=1}^{n}w_i^{(j)}x^{(j)}}{\sum_{j=1}^{n}w_i^{(j)}}\\

324:   \sigma_i^2(t+1) = \frac{\sum_{j=1}^{n}w_i^{(j)} (x^{(j)}-\mu_i(t+1))^2}{\sum_{j=1}^{n}w_i^{(j)}}\\

325: \alpha_i(t+1)=\frac{1}{n}\sum_{j=1}^{n}w_i^{(j)} \label{eq:update}

326: %\end{split}

327: \end{eqnarray}

328:

329: \subsection{Nonlinear Transformation}

330: This section mainly deals with the transformation of the original

331: log-likelihood function into its corresponding nonlinear dynamical

332: system and introduces some terminology pertinent to comprehend our

333: algorithm. This transformation gives the correspondence between all

334: the critical points of the $s$-dimensional likelihood surface and

335: that of its dynamical system. For the case of spherical Gaussian

336: mixtures with $k$ components, we have the number of unknown

337: parameters $s=3k-1$. For convenience, the maximization problem is

338: transformed into a minimization problem defined by the following

339: objective function :

340: \begin{equation}

341: %\begin{split}

342: ~\max_\Theta ~\{~log

343: ~p(\mathcal{X}|\Theta)~\}=~\min_\Theta ~\{~-~log

344: ~p(\mathcal{X}|\Theta)~\}\\= \min_\Theta f(\Theta)\label{eq:problem1} %\end{split}

345: \end{equation}

346:

347: %where $f(\Theta)$ is assumed to be in $C^2(\Re^s,\Re)$.

348:

349: \begin{lem1}\label{def:cont}

350: $f(\Theta)$ is $C^2(\Re^s,\Re)$.

351: \end{lem1}

352: \begin{proof}

353:

354: Note from Eq.(\ref{eq:log}), we have

355: \begin{equation}\label{eq:otherlog1}

356:     f(\Theta)=-log ~p(\mathcal{X}|\Theta)

357:     =-\sum_{j=1}^n log \sum_{i=1}^k \alpha_i

358:     ~p(\bm{x}^{(j)}|\bm{\theta}_i)

359: \end{equation}

360: Each of the simple functions which appear in Eq. (\ref{eq:otherlog1}) are twice differentiable and continuous in the interior of the domain over which $f(\Theta)$ is defined. The function $f(\Theta)$ is composed of arithmetic operations of these simple functions and from basic results in analysis, we can conclude that $f(\Theta)$ is twice continuously differentiable.

361: \end{proof}

362:

363: Lemma \ref{def:cont} and the preceeding arguments guarantee the existence of the gradient system associated with $f(\Theta)$ for the log-likelihood function in the case of spherical Gaussians and allows us to construct the following negative gradient system :

364: \begin{equation}

365: \begin{split}

366: \textstyle{ \left[ \dot{\mu}_1(t)~..~\dot{\mu}_k(t)~\dot{\sigma}_1(t) ~..~\dot{\sigma}_k(t)~\dot{\alpha}_1(t) ~..~\dot{\alpha}_{k-1}(t)\right]^T}\\

367: =~-~\left[ \frac{\partial f}{\partial

368: \mu_1}~..~\frac{\partial f}{\partial \mu_k}~\frac{\partial

369: f}{\partial \sigma_1}~..~\frac{\partial f}{\partial

370: \sigma_k}~\frac{\partial f}{\partial \alpha_1}~..~\frac{\partial

371: f}{\partial \alpha_{k-1}} \right]^T  \label{def:loggrad}

372: \end{split}

373: \end{equation}

374:

375:

376: \begin{thm}\label{th:stabgrad}{\it (Stabilitiy):}

377: The gradient system~\ref{def:loggrad} is completely stable.

378: \end{thm}

379: {\it Proof: See Appendix-A.}\\

380:

381: Developing a gradient system is one of the simplest transformation possible. One can think of a more complicated nonlinear transformations as well. We will now describe three main guidelines that must be satisfied by the transformation :

382:

383: \begin{itemize}

384: \item{The original log-likelihood function must be a Lyapunov function for the dynamical system.}

385: \item{The location of the critical points must be preserved under this transformation.}

386: \item{The system must be completely stable. In other words, every trajectory $\Phi(x,t)$ must be bounded.}

387: \end{itemize}

388:

389:

390: From the implementation point of view, it is not required to construct this gradient system. However, to understand the details of our method, it is necessary to obtain this gradient system. For simplicity, we show the construction of the gradient system for

391: the case of spherical Gaussians. It can be easily extended to the

392: full covariance Gaussian mixture case. It should be noted that only

393: (k-1) $\alpha$ values are considered in the gradient system because

394: of the unity constraint. The dependent variable $\alpha_k$ is

395: written as follows :

396:

397: \begin{equation}

398:  \alpha_k=1-\sum_{j=1}^{k-1} \alpha_j\label{eq:alphak}

399: \end{equation}

400:

401: This gradient system and the decomposition points on the practical stability boundary of the stable equilibrium points will enable us to define {\it Tier-1 stable equilibrium point}.

402:

403: \begin{lem}\label{def:tier1}

404: For a given stable equilibrium point ($x_s$), a {\it Tier-1 stable equilibrium point} is defined as a stable equilibrium point whose stability boundary intersects with the stability boundary of $x_s$.

405: \end{lem}

406:

407:

408: \begin{figure}

409:    \centering

410:    \subfigure[Parameter Space]{\includegraphics[width = 2.75 in]{Figures/algo1.ps}}\qquad

411:    \subfigure[Function Space]{\includegraphics[width = 2.75 in]{Figures/algo2.ps}}\qquad

412:    \caption{\label{fig:algo_s}Various stages of our algorithm in (a) Parameter space - the solid lines indicate the practical stability boundary.

413: Points highlighted on the stability boundary are the decomposition

414: points. The dotted arrows indicate the convergence of the EM

415: algorithm. The dashed lines indicate the neighborhood-search stage.

416: $x_1$ and $x_2$ are the exit points on the practical stability

417: boundary (b) Different points in the function space and their corresponding log-likelihood function values.

418:    }

419:  \end{figure}

420:

421:

422: \section{TRUST-TECH based Expectation Maximization}

423: \label{sec:algorithm}

424:

425: Our framework consists three stages namely: (i) global stage, (ii) local stage and (iii) neighborhood-search stage.

426: The last two stages are repeated in the solution space to obtain promising solutions. Global method obtains promising subspaces of the solution space. The next stage is the local stage (or the EM stage) where the results from the global methods are refined to the corresponding locally optimal parameter set. Then, during the neighborhood search stage, the exit points are computed

427: and the neighborhood solutions are systematically explored through these exit points. Fig. \ref{fig:algo_s} shows the different steps of our algorithm both in (a) the parameter space and (b) the function space.

428:

429: It is beneficial to use the TRUST-TECH based algorithm at the promising subspaces. In this sense, the neighborhood-search stage can act as a interface between global methods for initialization and the EM algorithm which gives the local maxima.

430: This approach differs from

431: traditional local methods by computing multiple local maxima in

432: the neighborhood region. This also enhances user flexibility in choosing between different sets of good

433: clusterings. Though global methods can identify promising subsets, it is

434: important to explore this more thoroughly especially in

435: problems like parameter estimation.

436:

437:

438: \begin{algorithm}

439: \caption{TRUST-TECH based EM Algorithm} \label{nexttieralg}

440: \begin{algorithmic}

441: \STATE \textbf{Input:} Parameters $\Theta$, Data  $\mathcal{X}$,

442: tolerance $\tau$, Step $S_p$  \STATE \textbf{Output:}

443: $\widehat{\Theta}_{MLE} $  \STATE \textbf{Algorithm:}

444:

445: \STATE Apply global method and store the q promising solutions $

446: \Theta_{init}=\{\Theta_1,\Theta_2,..,\Theta_q\}$ ~~~~~~~~ Initialize

447: E= $\phi$

448:

449: \WHILE{$\Theta_{init} \neq \phi$} \STATE Choose $\Theta_i \in

450: \Theta_{init}$, set $\Theta_{init}= \Theta_{init} \backslash

451: \{\Theta_i\}$ \STATE $LM_i=EM(\Theta_i,\mathcal{X},\tau)$~~~~~~~~~

452: $E=E \cup \{LM_i\}$ \STATE Generate promising direction vectors

453: $d_j$ from $LM_i$

454:

455: \FOR {each $d_j$} \STATE Compute Exit Point ($X_j$) along $d_j$

456: starting from $LM_i$ by evaluating the log-likelihood function given

457: by (\ref {eq:log})

458:

459: \STATE $New_j=EM(X_j+\epsilon \cdot d_j,\mathcal{X},\tau)$

460: \IF{$new_j \notin E$} \STATE $E=E \cup New_j$ \ENDIF

461:

462: \ENDFOR

463:

464:    \ENDWHILE

465: \STATE $\widehat{\Theta}_{MLE} =max\{val(E_i)\}$

466:

467: %

468: %\FOR{$k=1$ to $size(Dir)$} \STATE $Params[k]= Pset~~~~~~ExtPt=OFF$

469: %\STATE $Prev\_Val=Val~~~~~~~~~Cnt=0$

470: %

471: %\WHILE{$(!~ExtPt)~ \&\& ~(Cnt<Eval\_MAX)$} \STATE

472: %$Params[k]=update(Params[k],Dir[k],Step)$ \STATE $Cnt~=~Cnt~+~1$

473: %\STATE $Next\_Val~=~eval(Params[k])$

474: %

475: %\IF{$(Next\_Val~>~Prev\_Val)$} \STATE $ExtPt=ON$ \ENDIF

476: %

477: %\STATE $Prev\_Val=Next\_Val$

478: %    \ENDWHILE

479: %\IF{$count<Eval\_MAX$}

480: %    \STATE $Params[k]=update(Params[k],Dir[k],ASC)$

481: %\STATE $Params[k]=EM(Params[k],Data,Tol)$ \ELSE \STATE

482: %$Params[k]=NULL$ \ENDIF \ENDFOR \STATE \bf Return $Params[~]$

483: \end{algorithmic}

484: \end{algorithm}

485:

486:

487: In order to escape out of a found local maximum, our method needs to

488: compute certain promising directions based on the local behaviour of

489: the function. One can realize that generating these promising

490: directions is one of the important aspects of our algorithm.

491: Surprisingly, choosing random directions to move out of the local

492: maximum works well for this problem. One might also use other

493: directions like eigenvectors of the Hessian or incorporate some

494: domain-specific knowledge (like information about priors,

495: approximate location of cluster means, user preferences on the final

496: clusters) depending on the application that they are working on and

497: the level of computational expense that they can afford. We used

498: random directions in our work because they are very cheap to

499: compute. Once the promising directions are generated, exit points

500: are computed along these directions. {\it Exit points} are points of

501: intersection between any given direction and the practical stability

502: boundary of that local maximum along that particular direction. If

503: the stability boundary is not encountered along a given direction,

504: then there is a guarantee that one will not be able to find any new local maximum in

505: that direction. With a new initial guess in the vicinity of the exit

506: points, EM algorithm is applied again to obtain a new local maximum. Sometimes, this new point ($X_j+\epsilon \cdot d_j$) might have convergence problems. In such cases, TRUST-TECH can help the convergence by integrating the dynamical system and obtaining another point that is much closer to the local optimal solution. However, this is not done here because of the fact that the computation of gradient for log-likelihood function is expensive.

507:

508:

509: \begin{algorithm}

510: \caption{Params[~] $TT\_EM(Pset,Data,Tol,Step)$} \label{nexttier5}

511: \begin{algorithmic}

512: \STATE$Val=eval(Pset)$ \STATE $Dir[~]=Gen\_Dir (Pset)$ \STATE

513: $Eval\_MAX=500$

514:

515: \FOR{$k=1$ to $size(Dir)$} \STATE $Params[k]= Pset~~~~~~ExtPt=OFF$

516: \STATE $Prev\_Val=Val~~~~~~~~~Cnt=0$

517:

518: \WHILE{$(!~ExtPt)~ \&\& ~(Cnt<Eval\_MAX)$} \STATE

519: $Params[k]=update(Params[k],Dir[k],Step)$ \STATE $Cnt~=~Cnt~+~1$

520: \STATE $Next\_Val~=~eval(Params[k])$

521:

522: \IF{$(Next\_Val~>~Prev\_Val)$} \STATE $ExtPt=ON$ \ENDIF

523:

524: \STATE $Prev\_Val=Next\_Val$

525:     \ENDWHILE

526: \IF{$count<Eval\_MAX$}

527:     \STATE $Params[k]=update(Params[k],Dir[k],ASC)$

528: \STATE $Params[k]=EM(Params[k],Data,Tol)$ \ELSE \STATE

529: $Params[k]=NULL$ \ENDIF \ENDFOR \STATE \bf Return

530: $max(eval(Params[~]))$

531: \end{algorithmic}

532: \end{algorithm}

533:

534: \section{Implementation Details}

535: \label{sec:implementation}

536:

537: Our program is implemented in MATLAB and runs on Pentium IV 2.8 GHz

538: machine. The main procedure implemented is $TT\_EM$ described in

539: Algorithm ~\ref{nexttier5}. The algorithm takes the mixture data and

540: the initial set of parameters as input along with step size for

541: moving out and tolerance for convergence in the EM algorithm. It

542: returns the set of parameters that correspond to the Tier-1

543: neighboring local optimal solutions. The procedure $eval$ returns

544: the log-likelihood score given by Eq. (\ref{eq:log}). The $Gen\_Dir$

545: procedure generates promising directions from the local maximum. Exit

546: points are obtained along these generated directions. The procedure

547: $update$ moves the current parameter to the next parameter set along

548: a given $k^{th}$ direction $Dir[k]$. Some of the directions might

549: have one of the following two problems: (i) exit points might not be

550: obtained in these directions. (ii) even if the exit point is

551: obtained it might converge to a less promising solution. If the exit

552: points are not found along these directions, search will be

553: terminated after $Eval\_MAX$ number of evaluations. For all exit

554: points that are successfully found, $EM$ procedure is applied and

555: all the corresponding neighborhood set of parameters are stored in

556: the $Params[~]$. To ensure that the new initial points are

557: in a new convergence region of the EM algorithm, one should move (along that particular direction) `$\epsilon$' away from the exit points. Since,

558: different parameters will be of different ranges, care must be taken

559: while multiplying with the step sizes. It is important to use the

560: current estimates to get an approximation of the step size with

561: which one should move out along each parameter in the search space.

562: Finally, the solution with the highest likelihood score amongst the

563: original set of parameters and the Tier-1 solutions is returned.

564:

565:

566: \begin{figure*}[htp]

567:    \centering

568:   % \subfigure[]{\includegraphics[width = 1.2 in]{Figures/actual3.ps}}\qquad

569:    \subfigure[]{\includegraphics[width = 2.45 in]{Figures/init3.ps}}\qquad

570:    \subfigure[]{\includegraphics[width = 2.45 in]{Figures/initial3.ps}}\qquad

571:    \subfigure[]{\includegraphics[width = 2.45 in]{Figures/extpt3.ps}}\qquad

572:    \subfigure[]{\includegraphics[width = 2.45 in]{Figures/final3.ps}}\qquad

573:   % \subfigure[]{\includegraphics[width = 1.5 in]{Figures/SREM1.ps}}

574:    \caption{\label{fig:diagcov}Parameter estimates at various

575:    stages of our algorithm on the three component Gaussian mixture

576:    model (a) Poor random initial guess (b) Local maximum

577:    obtained after applying EM algorithm with the poor initial

578:    guess (c) Exit point obtained by our algorithm (d) The final

579:    solution obtained by applying the EM algorithm using the exit point as the initial guess.

580:    }

581:  \end{figure*}

582: \section{Results and Discussion}

583: \label{sec:results}

584:

585: Our algorithm has been tested on both synthetic and real datasets.

586: The initial values for the centers and the covariances were chosen

587: uniformly random. Uniform priors were chosen for initializing the

588: components. For real datasets, the centers were chosen randomly from

589: the sample points.

590:

591: \begin{figure}[htp]

592: \centerline{

593:   \epsfig{figure=Figures/SREM1.ps, width=4.0in}

594: } \caption{Graph showing likelihood vs Evaluations. A corresponds to

595: the original local maximum (L=-3235.0). B corresponds to the exit

596: point (L=-3676.1). C corresponds to the new initial point (L=-3657.3) after moving out by

597: `$\epsilon$'. D corresponds to the new local maximum (L=-3078.7).}

598: \label{fig:eval}

599: \end{figure}

600:

601: \subsection{Synthetic Datasets}

602: A simple synthetic data with 40 samples and 5 spherical Gaussian

603: components was generated and tested with our algorithm. Priors were

604: uniform and the standard deviation was 0.01. The centers for the

605: five components are given as follows: $\mu_1=[0.3~0.3]^T$,

606: $\mu_2=[0.5~0.5]^T$, $\mu_3=[0.7~0.7]^T$, $\mu_4=[0.3~0.7]^T$ and

607: $\mu_5=[0.7~0.3]^T$.

608:

609:

610: %\begin{figure*}

611: %   \centering

612: %   \subfigure[Caption A]{\includegraphics[width = 2.75 in]{Figures/actual3.ps}}\qquad

613: %   \subfigure[Caption B]{\includegraphics[width = 2.75 in]{Figures/actual3.ps}}\qquad

614: %   \caption{\label{fig-sphergauss}Relative weight reduction for the

615: %   schedule trees produced on subsets of size (a) 10\% (b) 25\% (c)

616: %   50\% (d) 75\%. The baseline in this case is chosen as the

617: %   smaller of (i) a sort of the raw data set for each view or (ii)

618: %   computation of the full cube.}

619: % \end{figure*}

620:

621: %\begin{figure}[htp]

622: %\centerline{

623: %  \epsfig{figure=Figures/actual3.ps, width=2.8in}

624: %} \caption{True mixture of the three Gaussian components with 900

625: %samples.} \label{fig:true}

626: %\end{figure}

627:

628:

629: The second dataset was that of a diagonal covariance case containing

630: $n=900$ data points. The data generated from a two-dimensional,

631: three-component Gaussian mixture distribution with mean vectors at

632: $[0 ~-2]^T, [0~ 0]^T,[0 ~2]^T$ and same diagonal covariance matrix

633: with values 2 and 0.2 along the diagonal \cite{Ueda98}. All the

634: three mixtures have uniform priors. Fig. \ref{fig:diagcov} shows

635: various stages of our algorithm and demonstrates how the clusters

636: obtained from existing algorithms are improved using our algorithm.

637: The initial clusters obtained are of low quality because of the poor

638: initial set of parameters. Our algorithm takes these clusters and

639: applies the neighborhood-search stage and the EM stage simultaneously to

640: obtain the final result. Fig. \ref{fig:eval} shows the value of the

641: log-likelihood during the neighborhood-search stage and the EM

642: iterations.

643:

644:

645: In the third synthetic dataset, a more complicated overlapping

646: Gaussian mixtures are considered \cite{Figueiredo02}. The parameters

647: are as follows: $\mu_1=\mu_2=[-4~ -4]^T$ , $\mu_3 =[2~2]^T$ and

648: $\mu_4=[-1~-6]^T$. $\alpha_1=\alpha_2=\alpha_3=0.3$ and

649: $\alpha_4=0.1$.

650: \begin{displaymath}

651:      \Sigma_1=\left[ \begin{array}{cc} 1 &0.5\\  0.5 & 1 \end{array}

652:      \right]~~~~~~~~~~~~~~~\Sigma_2=\left[ \begin{array}{cc} 6 &-2\\  -2 & 6 \end{array}

653:      \right]

654:  \end{displaymath}

655: \begin{displaymath}

656:     \Sigma_3=\left[ \begin{array}{cc} 2 &-1\\  -1 & 2 \end{array}

657:      \right]~~~~~~~~~~~~~~~ \Sigma_4=\left[ \begin{array}{cc} 0.125 &0\\  0 &0.125  \end{array}

658:      \right]

659:  \end{displaymath}

660:

661:

662: %\begin{figure}[htp]

663: %\centerline{

664: %  \epsfig{figure=Figures/data2.ps, width=2.8in}

665: %} \caption{True mixtures of the more complicated overlapping

666: %Gaussian case with 1000 samples. This dataset was used to show the

667: %improvements in the performance by varying the number of data

668: %points.} \label{fig:third}

669: %\end{figure}

670:

671: \begin{table*}[htp]

672: \centering \caption{\protect Performance of TRUST-TECH-EM algorithm on

673: an average of 100 runs on various synthetic and real datasets compared with random start EM algorithm}

674: \begin{center}

675: \begin{small}

676: \begin{tabular}{|c|c|c|c|c|c|}

677: \hline

678: Dataset &Samples & Clusters & Features &  EM(mean $\pm$ std) & TT-EM(mean $\pm$ std)\\

679: \hline Spherical & 40& 5 & 2  & 38.07$\pm$2.12 & 43.55$\pm$0.6 \\

680: \hline Elliptical & 900& 3 & 2  &  -3235$\pm$0.34 & -3078.7$\pm$0.03 \\

681: \hline FC1& 500& 4 & 2  & -2345.5 $\pm$175.13 & -2121.9$\pm$ 21.16 \\

682: \hline FC2& 2000& 4 & 2  & -9309.9 $\pm$694.74 & -8609.7 $\pm$37.02 \\

683: \hline Iris & 150& 3 & 4  & -198.13$\pm$27.25 &-173.63$\pm$11.72\\

684: \hline Wine & 178& 3 & 13  & -1652.7$\pm$1342.1& -1618.3$\pm$1349.9\\

685: \hline

686: \end{tabular}

687: \end{small}

688: \end{center}

689: \label{TB:results5}

690: \end{table*}

691:

692: \subsection{Real Datasets}

693: Two real datasets obtained from the UCI Machine Learning repository

694: \cite{Blake98} were also used for testing the performance of our

695: algorithm. Most widely used Iris data with 150 samples, 3 classes

696: and 4 features was used. Wine data set with 178 samples was also

697: used for testing. Wine data had 3 classes and 13 features. For these

698: real data sets, the class labels were deleted thus treating it as an unsupervised learning problem. Table~\ref{TB:results5} summarizes our

699: results over 100 runs. The mean and the standard deviations of the

700: log-likelihood values are reported. The traditional EM algorithm

701: with random starts is compared against our algorithm on both

702: synthetic and real data sets. Our algorithm not only obtains higher

703: likelihood value but also produces it with high confidence. The low

704: standard deviation of our results indicates the robustness of

705: obtaining the global maximum. In the case of the wine data, the

706: improvements with our algorithm are not much significant compared to

707: the other datasets. This might be due to the fact that the dataset

708: might not have Gaussian components. Our method assumes that the

709: underlying distribution of the data is mixture of Gaussians.

710: Table~\ref{TB:compresults5} gives the results of TRUST-TECH-EM

711: compared with other methods like split and merge EM and k-means+EM

712: proposed in the literature.

713:

714:

715: %Enzyme Data \cite{Richardson97}

716:

717: %\begin{figure}[htp]

718: %\centerline{

719: %  \epsfig{figure=Figures/gmm.ps, width=2.5in}

720: %} \caption{Enzyme data being modeled with 4 components.}

721: %\label{fig:summary}

722: %\end{figure}

723:

724:

725: %Iris Data from the  UCI Machine Learning repository

726: %\cite{Blake98}.

727:

728: \begin{table}[htp]

729: \centering \caption{\protect Comparison of TRUST-TECH-EM with

730: other methods}

731: \begin{center}

732: \begin{tabular}{|c|c|c|}

733: \hline

734: Method & Elliptical & Iris\\

735: \hline RS+EM & -3235 $\pm$ 14.2 & -198 $\pm$ 27\\

736: \hline K-Means+EM & -3195 $\pm$ 54&-186 $\pm$ 10\\

737: \hline SMEM &-3123 $\pm$ 54&-178.5 $\pm$ 6\\

738: \hline TRUST-TECH-EM &-3079 $\pm$ 0.03 &-173.6 $\pm$ 11\\

739: \hline

740: \end{tabular}

741: \end{center}

742: \label{TB:compresults5}

743: \end{table}

744: \subsection{Discussion}

745: It will be effective to use TRUST-TECH-EM for those solutions that

746: appear to be promising. Due to the nature of the problem, it is very

747: likely that the nearby solutions surrounding the existing solution

748: will be more promising. One of the primary advantages of our method

749: is that it can be used along with other popular methods available

750: and improve the quality of the existing solutions. In clustering

751: problems, it is an added advantage to perform refinement of the

752: final clusters obtained. Most of the focus in the literature was on

753: new methods for initialization or new clustering techniques which

754: often do not take advantage of the existing results and completely

755: start the clustering procedure ``{\it from scratch}". Though shown

756: only for the case of multivariate Gaussian mixtures, our technique

757: can be effectively applied to any parametric finite mixture model.

758:

759: %As shown in fig. \ref{fig:diagcov}, our algorithm can help in

760: %improving the quality of the existing clusters. The initial

761: %algorithm identified only one of the three clusters correctly even

762: %in such a simple and clearly separated case. Our algorithm used

763: %existing solution and identified three distinct clusters that are

764: %identical to the true mixtures.

765:

766: \begin{table}[htp]

767: \centering \caption{\protect Number of iterations taken for

768: the convergence of the best solution. }

769: \begin{center}

770: \begin{tabular}{|c|c|c|}

771: \hline

772: Dataset & Avg. no. of  & No. of iterations \\

773:  &iterations & for the best solution\\

774: \hline Spherical &126& 73\\

775: \hline Elliptical &174& 86\\

776: \hline Full covariance &292&173\\

777: \hline

778: \end{tabular}

779: \end{center}

780: \label{TB:convresults}

781: \end{table}

782: Table \ref{TB:convresults} summarizes the average number of

783: iterations taken by the EM algorithm for the convergence to the

784: local optimal solution. We can see that the most promising solution

785: produced by our TRUST-TECH methodology converges much faster. In

786: other words, our method can effectively take advantage of the fact

787: that the convergence of the EM algorithm is much faster for high

788: quality solutions. We exploit this inherent property of the EM algorithm

789: to improve the efficiency of our algorithm. Hence, for obtaining the Tier-1 solutions using our algorithm, the

790: threshold for the number of iterations can be significantly lowered.

791:

792: \newpage

793: \section*{APPENDIX-A: Proof of Theorem \ref{th:stabgrad}} \label{sec:appendix-A}

794: \begin{proof}

795: First, we will show that every bounded trajectory will converge to one of the equilibrium points. Second, we will show that every trajectory is bounded \cite{Chiang96}.

796: \begin{enumerate}

797: \item{Let $\Phi(x,t)$ denote the bounded trajectory starting at $x$. Computing the time derivative along the trajectory, we get

798: \begin{equation*}

799: \frac{d}{dt}f(\Phi(x,t))=-(\nabla f(\Phi(x,t)))^T(\nabla f(\Phi(x,t)))~\leq~0

800: \end{equation*}

801: Also, we know that $\frac{d}{dt}f(\Phi(x,t))~=~0$ if, and only if, $x \in E$. Hence, $f(x)$ is a Lyapunov function of the gradient system (\ref{def:loggrad}) and the $\omega$-limit point of any bounded trajectory consists of equilibrium points only, i.e. any bounded trajectory will approach one of the equilibrium point.}

802: \item{

803: Following the proof of preposition 1 presented in \cite{Chiang96}, we can show that every trajectory $\Phi(x,t)$ is bounded. However, we will have to show that the magnitude of the gradient of the log-likelihood function for the Gaussian mixture model is bounded on the entire domain of the parameter space.

804:

805: \begin{equation}\label{eq:logeq}

806: log ~p(\mathcal{Y}|\Theta) =-\sum_{j=1}^n log \sum_{i=1}^k \alpha_i ~p({y}^{(j)}|{\theta}_i)

807: \end{equation}

808:

809: Now, the domain of the parameter space is given as follows:

810:

811: $-\infty <\mu_i<\infty$ , $\Sigma_i$ is positive definite and $0\leq \alpha_i \leq 1$ where $\sum_{i=1}^{k}\alpha_i=1$.

812:

813: %We proved earlier that $log ~p(\mathcal{Y}|\Theta)$ is atleast twice differentiable. Since gradient vector is continuous we can safely conclude that it is bounded on compact subsets of the domain of the parameter space.

814: First, let us focus on $\alpha$ because it is a constrained variable.

815:

816: \textbf {Derivative with $\alpha$} :

817:

818: \begin{equation}\label{eq:partalpha}

819:     \frac{\partial f}{\partial \alpha_r}= \sum_{j=1}^{n}\left[ \frac{p({y}^{(j)}|{\theta}_r)}{\sum_{i=1}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right]

820: \end{equation}

821: As $\alpha \rightarrow 1$, we have

822: \begin{equation*}\label{eq:alphatends1}

823:     \frac{\partial f}{\partial \alpha_r}= \sum_{j=1}^{n}\left[ \frac{p({y}^{(j)}|{\theta}_r)}{1 \cdot p({y}^{(j)}|{\theta}_i)}\right] = n <\infty

824: \end{equation*}

825: As $\alpha \rightarrow 0$, we have

826: \begin{equation*}\label{eq:alphatends0}

827:     \frac{\partial f}{\partial \alpha_r}=\sum_{j=1}^{n}\left[ \frac{p({y}^{(j)}|{\theta}_r)}{\sum_{i=1,i\neq r}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right] <\infty

828: \end{equation*}

829:

830: Hence, the derivatives with respect to $\alpha$ are bounded.

831:

832: \textbf {Derivative with $\mu$} :

833: \begin{equation}\label{eq:partmu}

834:     \frac{\partial f}{\partial \mu_r}= \sum_{j=1}^{n}\left[ \frac{\alpha_r \frac{1}{\sqrt{(2\pi)}\sigma_r} e^{-\frac{(x^{(j)}-\mu_r)^2}{2\sigma_r^2}}\cdot\frac{1}{\sigma_r^2}(x^{(j)}-\mu_r)}{\sum_{i=1}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right]

835: \end{equation}

836: This is obviously bounded for and $\mu \in \Re$.

837:

838: \textbf {Derivative with $\sigma$} :

839: \begin{equation}\label{eq:partmu}

840:     \frac{\partial f}{\partial \sigma_r}= \sum_{j=1}^{n}\left[ \frac{\frac{1}{\sigma_r} e^{-\frac{(x^{(j)}-\mu_r)^2}{2\sigma_r^2}}\cdot \frac{(x^{(j)}-\mu_r)^2}{\sigma_{r^3}} -\frac{1}{\sigma_{r^2}} e^{-\frac{(x^{(j)}-\mu_r)^2}{2\sigma_r^2}} }{\sum_{i=1}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right]

841: \end{equation}

842:

843: As $\sigma_r \rightarrow 0$ the exponential factor goes to zero faster than $\frac{1}{\sigma_r}$ goes to infinity. Hence, it is bounded. So, the gradient of the log-likelihood function is bounded in the  entire domain of the parameter space.

844:

845: }

846: \end{enumerate}

847:

848: \end{proof}

849: