1: \chapter{TRUST-TECH based Expectation Maximization for Learning Mixture Models}
2: \label{ch:trust-tech-em}
3:
4: In this chapter, we develop a TRUST-TECH based algorithm for solving the problem of mixture modeling. In the field of statistical pattern recognition, finite mixtures
5: allow a probabilistic model-based approach to unsupervised learning
6: \cite{McLachlan88}. One of the most popular methods used for fitting
7: mixture models to the observed data is the {\it
8: Expectation-Maximization} (EM) algorithm which converges to the
9: maximum likelihood estimate of the mixture parameters locally
10: \cite{Demspter77,Redner84}. The usual steepest descent, conjugate
11: gradient, or Newton-Raphson methods are too complicated for use in
12: solving this problem \cite{Xu96}. EM has become a popular method
13: since it takes advantage of problem specific properties. EM based
14: methods have been successfully applied to solve a wide range of
15: problems that arise in pattern recognition \cite {Baum70,Bilmes98},
16: clustering \cite{Banfield93}, information retrieval \cite {Nigam00},
17: computer vision \cite {Carson02}, data mining \cite{Shumway82}~etc.
18:
19: Without loss of generality, we will consider the problem of learning parameters of
20: Gaussian Mixture Models (GMM). Fig \ref{fig:gmm} shows data
21: generated by three Gaussian components with different mean and
22: variance. Note that every data point has a probabilistic (or soft)
23: membership that gives the probability with which it belongs to each
24: of the components. Points that belong to component 1 will have high
25: probability of membership for component 1. On the other hand, data
26: points belonging to components 2 and 3 are not well separated. The
27: problem of learning mixture models involves estimating the
28: parameters of these components and finding the probabilities
29: with which each data point belongs to these components. Given the
30: number of components and an initial set of parameters, EM algorithm
31: computes the optimal estimates of the parameters
32: that maximize the likelihood of the data given the estimates of
33: these components. However, the main problem with the EM algorithm is
34: that it is a `{\it greedy}' method which is very sensitive to the
35: given initial set of parameters. To overcome this problem, a novel
36: three-stage algorithm is proposed \cite{Reddy07}. The main research concerns that motivated the new algorithm
37: presented in this chapter are :
38: \begin {itemize}
39: \item{EM algorithm converges to a local maximum
40: of the likelihood function very quickly.}
41:
42: \item{There are several other promising local optimal solutions in
43: the vicinity of the solutions obtained from the methods that
44: provide good initial guesses of the solution.}
45:
46: \item{Model selection criteria usually assumes that the global
47: optimal solution of the log-likelihood function can be obtained.
48: However, achieving this is computationally intractable.}
49:
50: \item{Some regions in the search space do not contain any promising solutions.
51: The promising and non-promising regions coexist and it becomes
52: challenging to avoid wasting computational resources to search in
53: non-promising regions.}
54:
55: \end{itemize}
56:
57:
58: Of all the concerns mentioned above, the fact that most of the local
59: maxima are not distributed uniformly \cite{Ueda98} makes it
60: important to develop algorithms that can avoid searching in the low-likelihood regions and focus on exploring promising subspaces more thoroughly. This
61: subspace search will also be useful for making the solution less
62: sensitive to the initial set of parameters. Here, we
63: propose a novel three-stage algorithm for estimating the parameters of
64: mixture models. Using TRUST-TECH method and EM algorithm
65: simultaneously to exploit the problem specific features of the
66: mixture models, the proposed three-stage algorithm obtains the optimal set of parameters
67: by searching for the global maximum in a
68: systematic manner.
69:
70:
71: \begin{figure}[htp]
72: \centerline{
73: \epsfig{figure=Figures/gmm1.ps, width=3.5in}
74: } \caption{Data consisting of three Gaussian components with
75: different mean and variance values. Note that each data point doesn't have
76: a hard membership that it belongs to only one component. Most of the
77: points in the first component will have high probability with which
78: they belong to it. In this case, the other components do not have
79: much influence. Components 2 and 3 data points are not clearly separated. The
80: problem of learning mixture models involves estimating the
81: parameters of the Gaussian components and finding the
82: probabilities with which each data sample belongs to the component.}
83: \label{fig:gmm}
84: \end{figure}
85:
86:
87: \section{Relevant Background}
88: \label{sec:background} Although EM and its variants have been extensively used for learning
89: mixture models, several researchers have approached the problem by
90: identifying new techniques that give good initial points. More
91: generic techniques like deterministic annealing \cite
92: {Rose98,Ueda98}, genetic algorithms \cite{Pernkopf05,Martínez00}
93: have been applied to obtain a good set of parameters. Though, these
94: techniques have asymptotic guarantees, they are very time consuming
95: and hence may not be used for most of the practical applications.
96: Some problem specific algorithms like split and merge EM
97: \cite{Ueda00}, component-wise EM \cite{Figueiredo02}, greedy
98: learning \cite{Verbeek03}, incremental version for sparse
99: representations \cite{neal98}, parameter space grid \cite{Li99} are
100: also proposed in the literature. Some of these algorithms are either
101: computationally very expensive or infeasible when learning mixtures
102: in high dimensional spaces \cite{Li99}. Inspite of all the expense
103: in these methods, very little effort has been taken to explore
104: promising subspaces within the larger parameter space. Most of these
105: algorithms eventually apply the EM algorithm to move to a locally
106: maximal set of parameters on the likelihood surface. Simple approaches like running EM from several random
107: initializations, and then choosing the final estimate that leads to
108: the local maximum with higher value of the likelihood can be successful to certain extent \cite{hastie96,Roberts98}.
109:
110:
111: Though some of these methods apply other additional mechanisms (like
112: perturbations \cite{Elidan02}) to escape out of the local optimal
113: solutions, systematic methods are yet to be developed for searching
114: the subspace. The dynamical system of the log-likelihood function
115: reveals more information about the topology of the nonlinear log-likelihood surface \cite{Chiang96}. Hence, the
116: difficulties of finding good solutions when the error surface is
117: very rugged can be overcome by understanding the geometric and dynamic characteristics of the log-likelihood surface. Though this method might introduce some additional cost, one
118: has to realize that existing approaches are much more expensive due
119: to their stochastic nature. Specifically, for a problem in this
120: context, where there is a non-uniform distribution of local maxima,
121: it is difficult for most of the methods to search neighboring
122: regions \cite{Zhang04}. For this reason, it is more desirable to
123: apply TRUST-TECH based Expectation Maximization (TT-EM) algorithm
124: after obtaining some point in a promising region. The main
125: advantages of the proposed algorithm are that it :
126:
127: \begin{itemize}
128: \item{Explores most of the neighborhood local optimal solutions unlike the traditional stochastic algorithms.}
129: \item{Acts as a flexible interface between the EM algorithm and other global method. Sometimes, a global method will optimize an approximation of the original function. Hence, it is important to provide an interface between the EM algorithm and the global method.}
130: \item{Allows the user to work with existing clusters obtained from the traditional approaches and improves the quality of the solutions based on the maximum likelihood criteria.}
131: \item{Helps the expensive global methods to truncate early.}
132: \item{Exploits the heuristics that the EM algorithm that it converges at a faster rate if the solutions are promising.}
133: \end{itemize}
134:
135: \noindent While trying to obtain multiple optimal solutions, TRUST-TECH can dynamically change the threshold for the number of iterations. For e.g. while computing Tier-1 solutions, if a promising solution has been obtained with a few iterations, then all the rest of the tier-1 solutions will use this value as their threshold.
136:
137: \section{Preliminaries}
138: \label{sec:problem} We will now introduce some necessary preliminaries on
139: mixture models, EM algorithm and nonlinear transformation. Table~\ref{TB:datanot} gives the notations used in this chapter :
140:
141: \begin{table}[h]
142: \centering \caption{\protect Description of the Notations
143: used}
144: \begin{center}
145: \begin{tabular}{cl}
146: \hline Notation & Description \\
147: \hline
148:
149: d & number of features\\
150: n & number of data points\\
151: k & number of components\\
152: s & total number of parameters\\
153: $\Theta$ & parameter set\\
154: $\theta_i$ & parameters of $i^{th}$ component\\
155: $\alpha_i$ & mixing weights for $i^{th}$ component\\
156: $\mathcal{X}$ & observed data\\
157: $\mathcal{Z}$ & missing data\\
158: $\mathcal{Y}$ & complete data\\
159: t & timestep for the estimates\\
160: \hline
161: \end{tabular}
162: \end{center}
163: \label{TB:datanot}
164: \end{table}
165:
166: \subsection{Mixture Models}
167:
168: Lets assume that there are $k$ Gaussians in the mixture model. The
169: form of the probability density function is as follows :
170:
171: \begin{equation}\label{eq:gaussian1}
172: p(x|\Theta) =\sum_{i=1}^{k}{\alpha_i p(x|\theta_i)}
173: \end{equation}
174:
175: \noindent where $x=[x_1,x_2,...,x_d]^T$ is the feature vector of $d$
176: dimensions. The $\alpha_k$'s represent the {\it mixing weights}. $\Theta$ represents the parameter set ($\alpha_1, \alpha_2,...
177: \alpha_k,\theta_1,\theta_2,...\theta_k$) and $p$ is a univariate
178: Gaussian density parameterized by $\theta_i$(i.e. $\mu_i$ and
179: $\sigma_i$):
180:
181: \begin{equation}\label{eq:gaussiandensity}
182: p(x|\theta_i) =\frac{1}{\sqrt{(2\pi)}\sigma_i}e^{-\frac{(x-\mu_i)^2}{2\sigma_i^2}}
183: \end{equation}
184:
185: Also, it should be noticed that being probabilities $\alpha_i$ must
186: satisfy
187:
188: \begin{equation}\label{eq:probabi}
189: 0 \leq \alpha_i\leq 1 ~,~ \forall i=1,..,k,~ and ~~ \sum_{i=1}^k
190: \alpha_i=1
191: \end{equation}
192:
193: Given a set of n i.i.d samples
194: $\mathcal{X}=\{x^{(1)},x^{(2)},..,x^{(n)}\}$, the log-likelihood
195: corresponding to a mixture is
196:
197: \begin{equation}\label{eq:log}
198: %\begin{split}
199: log ~p(\mathcal{X}|\Theta)=log \prod_{j=1}^n
200: ~p(x^{(j)}|\Theta)\\
201: =\sum_{j=1}^n log \sum_{i=1}^k \alpha_i
202: ~p(x^{(j)}|\theta_i)
203: %\end{split}
204: \end{equation}
205:
206: The goal of learning mixture models is to obtain the parameters
207: $\widehat{\Theta}$ from a set of n data points which are the samples
208: of a distribution with density given by (\ref{eq:gaussian1}). The
209: {\it Maximum Likelihood Estimate } (MLE) is given by :
210: \begin{equation}\label{eq:MLE}
211: \widehat{\Theta}_{MLE} = arg \max_{\tilde{\Theta}} ~\{~log
212: ~p(\mathcal{X}|\Theta)~\}
213: \end{equation}
214:
215: where $\tilde{\Theta}$ indicates the entire parameter space. Since,
216: this MLE cannot be found analytically for mixture models, one has to
217: rely on iterative procedures that can find the global maximum of
218: $log~ p(\mathcal{X}|\Theta)$. The EM algorithm described in the next
219: section has been used successfully to find the local maximum of such
220: a function \cite{McLachlan97}.
221:
222: \subsection{Expectation Maximization}
223:
224: The EM algorithm assumes $\mathcal{X}$ to be $observed$ data. The
225: missing part, termed as $hidden$ data, is a set of {\it n} labels
226: $\mathcal{Z}=\{\footnotesize{\bf z}^{(1)},\footnotesize{\bf
227: z}^{(2)},..,\footnotesize{\bf z}^{(n)}\}$ associated with $n$
228: samples, indicating which component produced each sample
229: \cite{McLachlan97}. Each label $\footnotesize{\bf
230: z}^{(j)}=[z_1^{(j)},z_2^{(j)},..,z_k^{(j)}]$ is a binary vector
231: where $z_i^{(j)}=1$ and $z_m^{(j)}=0$ $\forall m \neq i$, means the
232: sample $x^{(j)}$ was produced by the $i^{th}$ component. Now, the
233: complete log-likelihood i.e. the one from which we would estimate
234: $\Theta$ if the {\it complete data}
235: $\mathcal{Y}=~\{~\mathcal{X},\mathcal{Z}~\}$ is
236:
237: \begin{equation*}\label{eq:beforecomplete}
238: log ~p(\mathcal{X},\mathcal{Z}|\Theta)=\sum_{j=1}^n ~log \prod_{i=1}^k
239: ~ [~\alpha_i~p(x^{(j)}|\theta_i)~]^{z_i^{(j)}}
240: \end{equation*}
241:
242:
243: \begin{equation}\label{eq:complete}
244: log ~p(\mathcal{Y}|\Theta)=\sum_{j=1}^n \sum_{i=1}^k
245: z_i^{(j)}~log~ [~\alpha_i~p(x^{(j)}|\theta_i)~]
246: \end{equation}
247:
248: The EM algorithm produces a sequence of estimates
249: $\{\widehat{\Theta}(t),t=0,1,2,...\}$ by alternately applying the
250: following two steps until convergence :
251:
252: \begin {itemize}
253: \item{{\bf E-Step : } Compute the conditional expectation of the
254: hidden data, given $\mathcal{X}$ and the current estimate
255: $\widehat{\Theta}(t)$. Since $log~p(\mathcal{X,Z}|\Theta)$ is linear
256: with respect to the missing data $\mathcal{Z}$, we simply have to
257: compute the conditional expectation $\mathcal{W} \equiv
258: E[\mathcal{Z|X},\widehat{\Theta}(t)]$, and plug it into $log ~p
259: (\mathcal{X,Z}|\Theta)$. This gives the $Q$-function as follows :
260:
261:
262: \begin{equation}\label{eq:qfunc}
263: %\begin{split}
264: Q(\Theta|\widehat{\Theta}(t))\equiv
265: E_Z[log~p(\mathcal{X},\mathcal{Z})|\mathcal{X},\widehat{\Theta}(t)]
266: %\end{split}
267: \end{equation}
268:
269: Since $\mathcal{Z}$ is a binary vector, its conditional expectation
270: is given by :
271:
272: \begin{equation}\label{eq:conditionw}
273: %\begin{split}
274: w_i^{(j)} \equiv E~[~z_i^{(j)}|\mathcal{X},\widehat{\Theta}(t)~] \\
275: = Pr~[~z_i^{(j)}=1|x^{(j)},\widehat{\Theta}(t)~]\\
276: =
277: \frac{\widehat{\alpha}_i(t) p(x^{(j)}|\widehat{\theta}_i(t))}{\sum_{i=1}^{k}{\widehat{\alpha}_i(t) p(x^{(j)}|\widehat{\theta}_i(t))}}
278: %\end{split}
279: \end{equation}
280:
281: where the last equality is simply the Bayes law ($\alpha_i$ is the a
282: priori probability that $z_i^{(j)}=1$), while $w_i^{(j)}$ is the a
283: posteriori probability that $z_i^{(j)}=1$ given the observation
284: $x^{(j)}$.}
285:
286: \item{{\bf M-Step : } The estimates of the new parameters are
287: updated using the following equation :
288: \begin{equation}\label{eq:update}
289: \widehat{\Theta} (t+1) = arg \max_{\Theta}\{Q(\Theta,\widehat{\Theta}(t))\}
290: \end{equation}
291: }
292: \end{itemize}
293: \subsection{EM for GMMs}
294:
295: Several variants of the EM algorithm have been extensively used to
296: solve this problem. The convergence properties of the EM algorithm
297: for Gaussian mixtures are thoroughly discussed in \cite{Xu96}. The
298: $Q-function$ for GMM is given by :
299:
300: \begin{equation}\label{eq:qfuncgmm}
301: %\begin{split}
302: Q(\Theta|\widehat{\Theta}(t))= \sum_{j=1}^{n}\sum_{i=1}^{k} w_i^{(j)}[log\frac{1}{\sigma_i\sqrt{2\pi}} \\-\frac{(x^{(j)}-\mu_i)^2}{2\sigma_i^2}+log ~\alpha_i]
303: %\end{split}
304: \end{equation}
305:
306: where
307: \begin{equation}\label{eq:expectz}
308: w_i^{(j)}=\frac{\frac{\alpha_i(t)}{\sigma_i(t)}e^{-\frac{1}{2\sigma_i(t)^2}(x^{(j)}-\mu_i(t))^2}}{\sum_{i=1}^k
309: \frac{\alpha_i(t)}{\sigma_i(t)}e^{-\frac{1}{2\sigma_i(t)^2}(x^{(j)}-\mu_i(t))^2}}
310: \end{equation}
311:
312: The maximization step is given by the following equation :
313: \begin{equation}\label{eq:max}
314: \frac{\partial }{\partial \Theta_k} Q(\Theta|\widehat{\Theta}(t))= 0
315: \end{equation}
316: where $\Theta_k$ is the parameters for the $k^{th}$ component.
317: Because of the assumption made that each data point comes from a
318: single component, solving the above equation becomes trivial. The
319: updates for the maximization step in the case of GMMs are given as
320: follows :
321: \begin{eqnarray}
322: %\begin{split}
323: \mu_i(t+1) = \frac{\sum_{j=1}^{n}w_i^{(j)}x^{(j)}}{\sum_{j=1}^{n}w_i^{(j)}}\\
324: \sigma_i^2(t+1) = \frac{\sum_{j=1}^{n}w_i^{(j)} (x^{(j)}-\mu_i(t+1))^2}{\sum_{j=1}^{n}w_i^{(j)}}\\
325: \alpha_i(t+1)=\frac{1}{n}\sum_{j=1}^{n}w_i^{(j)} \label{eq:update}
326: %\end{split}
327: \end{eqnarray}
328:
329: \subsection{Nonlinear Transformation}
330: This section mainly deals with the transformation of the original
331: log-likelihood function into its corresponding nonlinear dynamical
332: system and introduces some terminology pertinent to comprehend our
333: algorithm. This transformation gives the correspondence between all
334: the critical points of the $s$-dimensional likelihood surface and
335: that of its dynamical system. For the case of spherical Gaussian
336: mixtures with $k$ components, we have the number of unknown
337: parameters $s=3k-1$. For convenience, the maximization problem is
338: transformed into a minimization problem defined by the following
339: objective function :
340: \begin{equation}
341: %\begin{split}
342: ~\max_\Theta ~\{~log
343: ~p(\mathcal{X}|\Theta)~\}=~\min_\Theta ~\{~-~log
344: ~p(\mathcal{X}|\Theta)~\}\\= \min_\Theta f(\Theta)\label{eq:problem1} %\end{split}
345: \end{equation}
346:
347: %where $f(\Theta)$ is assumed to be in $C^2(\Re^s,\Re)$.
348:
349: \begin{lem1}\label{def:cont}
350: $f(\Theta)$ is $C^2(\Re^s,\Re)$.
351: \end{lem1}
352: \begin{proof}
353:
354: Note from Eq.(\ref{eq:log}), we have
355: \begin{equation}\label{eq:otherlog1}
356: f(\Theta)=-log ~p(\mathcal{X}|\Theta)
357: =-\sum_{j=1}^n log \sum_{i=1}^k \alpha_i
358: ~p(\bm{x}^{(j)}|\bm{\theta}_i)
359: \end{equation}
360: Each of the simple functions which appear in Eq. (\ref{eq:otherlog1}) are twice differentiable and continuous in the interior of the domain over which $f(\Theta)$ is defined. The function $f(\Theta)$ is composed of arithmetic operations of these simple functions and from basic results in analysis, we can conclude that $f(\Theta)$ is twice continuously differentiable.
361: \end{proof}
362:
363: Lemma \ref{def:cont} and the preceeding arguments guarantee the existence of the gradient system associated with $f(\Theta)$ for the log-likelihood function in the case of spherical Gaussians and allows us to construct the following negative gradient system :
364: \begin{equation}
365: \begin{split}
366: \textstyle{ \left[ \dot{\mu}_1(t)~..~\dot{\mu}_k(t)~\dot{\sigma}_1(t) ~..~\dot{\sigma}_k(t)~\dot{\alpha}_1(t) ~..~\dot{\alpha}_{k-1}(t)\right]^T}\\
367: =~-~\left[ \frac{\partial f}{\partial
368: \mu_1}~..~\frac{\partial f}{\partial \mu_k}~\frac{\partial
369: f}{\partial \sigma_1}~..~\frac{\partial f}{\partial
370: \sigma_k}~\frac{\partial f}{\partial \alpha_1}~..~\frac{\partial
371: f}{\partial \alpha_{k-1}} \right]^T \label{def:loggrad}
372: \end{split}
373: \end{equation}
374:
375:
376: \begin{thm}\label{th:stabgrad}{\it (Stabilitiy):}
377: The gradient system~\ref{def:loggrad} is completely stable.
378: \end{thm}
379: {\it Proof: See Appendix-A.}\\
380:
381: Developing a gradient system is one of the simplest transformation possible. One can think of a more complicated nonlinear transformations as well. We will now describe three main guidelines that must be satisfied by the transformation :
382:
383: \begin{itemize}
384: \item{The original log-likelihood function must be a Lyapunov function for the dynamical system.}
385: \item{The location of the critical points must be preserved under this transformation.}
386: \item{The system must be completely stable. In other words, every trajectory $\Phi(x,t)$ must be bounded.}
387: \end{itemize}
388:
389:
390: From the implementation point of view, it is not required to construct this gradient system. However, to understand the details of our method, it is necessary to obtain this gradient system. For simplicity, we show the construction of the gradient system for
391: the case of spherical Gaussians. It can be easily extended to the
392: full covariance Gaussian mixture case. It should be noted that only
393: (k-1) $\alpha$ values are considered in the gradient system because
394: of the unity constraint. The dependent variable $\alpha_k$ is
395: written as follows :
396:
397: \begin{equation}
398: \alpha_k=1-\sum_{j=1}^{k-1} \alpha_j\label{eq:alphak}
399: \end{equation}
400:
401: This gradient system and the decomposition points on the practical stability boundary of the stable equilibrium points will enable us to define {\it Tier-1 stable equilibrium point}.
402:
403: \begin{lem}\label{def:tier1}
404: For a given stable equilibrium point ($x_s$), a {\it Tier-1 stable equilibrium point} is defined as a stable equilibrium point whose stability boundary intersects with the stability boundary of $x_s$.
405: \end{lem}
406:
407:
408: \begin{figure}
409: \centering
410: \subfigure[Parameter Space]{\includegraphics[width = 2.75 in]{Figures/algo1.ps}}\qquad
411: \subfigure[Function Space]{\includegraphics[width = 2.75 in]{Figures/algo2.ps}}\qquad
412: \caption{\label{fig:algo_s}Various stages of our algorithm in (a) Parameter space - the solid lines indicate the practical stability boundary.
413: Points highlighted on the stability boundary are the decomposition
414: points. The dotted arrows indicate the convergence of the EM
415: algorithm. The dashed lines indicate the neighborhood-search stage.
416: $x_1$ and $x_2$ are the exit points on the practical stability
417: boundary (b) Different points in the function space and their corresponding log-likelihood function values.
418: }
419: \end{figure}
420:
421:
422: \section{TRUST-TECH based Expectation Maximization}
423: \label{sec:algorithm}
424:
425: Our framework consists three stages namely: (i) global stage, (ii) local stage and (iii) neighborhood-search stage.
426: The last two stages are repeated in the solution space to obtain promising solutions. Global method obtains promising subspaces of the solution space. The next stage is the local stage (or the EM stage) where the results from the global methods are refined to the corresponding locally optimal parameter set. Then, during the neighborhood search stage, the exit points are computed
427: and the neighborhood solutions are systematically explored through these exit points. Fig. \ref{fig:algo_s} shows the different steps of our algorithm both in (a) the parameter space and (b) the function space.
428:
429: It is beneficial to use the TRUST-TECH based algorithm at the promising subspaces. In this sense, the neighborhood-search stage can act as a interface between global methods for initialization and the EM algorithm which gives the local maxima.
430: This approach differs from
431: traditional local methods by computing multiple local maxima in
432: the neighborhood region. This also enhances user flexibility in choosing between different sets of good
433: clusterings. Though global methods can identify promising subsets, it is
434: important to explore this more thoroughly especially in
435: problems like parameter estimation.
436:
437:
438: \begin{algorithm}
439: \caption{TRUST-TECH based EM Algorithm} \label{nexttieralg}
440: \begin{algorithmic}
441: \STATE \textbf{Input:} Parameters $\Theta$, Data $\mathcal{X}$,
442: tolerance $\tau$, Step $S_p$ \STATE \textbf{Output:}
443: $\widehat{\Theta}_{MLE} $ \STATE \textbf{Algorithm:}
444:
445: \STATE Apply global method and store the q promising solutions $
446: \Theta_{init}=\{\Theta_1,\Theta_2,..,\Theta_q\}$ ~~~~~~~~ Initialize
447: E= $\phi$
448:
449: \WHILE{$\Theta_{init} \neq \phi$} \STATE Choose $\Theta_i \in
450: \Theta_{init}$, set $\Theta_{init}= \Theta_{init} \backslash
451: \{\Theta_i\}$ \STATE $LM_i=EM(\Theta_i,\mathcal{X},\tau)$~~~~~~~~~
452: $E=E \cup \{LM_i\}$ \STATE Generate promising direction vectors
453: $d_j$ from $LM_i$
454:
455: \FOR {each $d_j$} \STATE Compute Exit Point ($X_j$) along $d_j$
456: starting from $LM_i$ by evaluating the log-likelihood function given
457: by (\ref {eq:log})
458:
459: \STATE $New_j=EM(X_j+\epsilon \cdot d_j,\mathcal{X},\tau)$
460: \IF{$new_j \notin E$} \STATE $E=E \cup New_j$ \ENDIF
461:
462: \ENDFOR
463:
464: \ENDWHILE
465: \STATE $\widehat{\Theta}_{MLE} =max\{val(E_i)\}$
466:
467: %
468: %\FOR{$k=1$ to $size(Dir)$} \STATE $Params[k]= Pset~~~~~~ExtPt=OFF$
469: %\STATE $Prev\_Val=Val~~~~~~~~~Cnt=0$
470: %
471: %\WHILE{$(!~ExtPt)~ \&\& ~(Cnt<Eval\_MAX)$} \STATE
472: %$Params[k]=update(Params[k],Dir[k],Step)$ \STATE $Cnt~=~Cnt~+~1$
473: %\STATE $Next\_Val~=~eval(Params[k])$
474: %
475: %\IF{$(Next\_Val~>~Prev\_Val)$} \STATE $ExtPt=ON$ \ENDIF
476: %
477: %\STATE $Prev\_Val=Next\_Val$
478: % \ENDWHILE
479: %\IF{$count<Eval\_MAX$}
480: % \STATE $Params[k]=update(Params[k],Dir[k],ASC)$
481: %\STATE $Params[k]=EM(Params[k],Data,Tol)$ \ELSE \STATE
482: %$Params[k]=NULL$ \ENDIF \ENDFOR \STATE \bf Return $Params[~]$
483: \end{algorithmic}
484: \end{algorithm}
485:
486:
487: In order to escape out of a found local maximum, our method needs to
488: compute certain promising directions based on the local behaviour of
489: the function. One can realize that generating these promising
490: directions is one of the important aspects of our algorithm.
491: Surprisingly, choosing random directions to move out of the local
492: maximum works well for this problem. One might also use other
493: directions like eigenvectors of the Hessian or incorporate some
494: domain-specific knowledge (like information about priors,
495: approximate location of cluster means, user preferences on the final
496: clusters) depending on the application that they are working on and
497: the level of computational expense that they can afford. We used
498: random directions in our work because they are very cheap to
499: compute. Once the promising directions are generated, exit points
500: are computed along these directions. {\it Exit points} are points of
501: intersection between any given direction and the practical stability
502: boundary of that local maximum along that particular direction. If
503: the stability boundary is not encountered along a given direction,
504: then there is a guarantee that one will not be able to find any new local maximum in
505: that direction. With a new initial guess in the vicinity of the exit
506: points, EM algorithm is applied again to obtain a new local maximum. Sometimes, this new point ($X_j+\epsilon \cdot d_j$) might have convergence problems. In such cases, TRUST-TECH can help the convergence by integrating the dynamical system and obtaining another point that is much closer to the local optimal solution. However, this is not done here because of the fact that the computation of gradient for log-likelihood function is expensive.
507:
508:
509: \begin{algorithm}
510: \caption{Params[~] $TT\_EM(Pset,Data,Tol,Step)$} \label{nexttier5}
511: \begin{algorithmic}
512: \STATE$Val=eval(Pset)$ \STATE $Dir[~]=Gen\_Dir (Pset)$ \STATE
513: $Eval\_MAX=500$
514:
515: \FOR{$k=1$ to $size(Dir)$} \STATE $Params[k]= Pset~~~~~~ExtPt=OFF$
516: \STATE $Prev\_Val=Val~~~~~~~~~Cnt=0$
517:
518: \WHILE{$(!~ExtPt)~ \&\& ~(Cnt<Eval\_MAX)$} \STATE
519: $Params[k]=update(Params[k],Dir[k],Step)$ \STATE $Cnt~=~Cnt~+~1$
520: \STATE $Next\_Val~=~eval(Params[k])$
521:
522: \IF{$(Next\_Val~>~Prev\_Val)$} \STATE $ExtPt=ON$ \ENDIF
523:
524: \STATE $Prev\_Val=Next\_Val$
525: \ENDWHILE
526: \IF{$count<Eval\_MAX$}
527: \STATE $Params[k]=update(Params[k],Dir[k],ASC)$
528: \STATE $Params[k]=EM(Params[k],Data,Tol)$ \ELSE \STATE
529: $Params[k]=NULL$ \ENDIF \ENDFOR \STATE \bf Return
530: $max(eval(Params[~]))$
531: \end{algorithmic}
532: \end{algorithm}
533:
534: \section{Implementation Details}
535: \label{sec:implementation}
536:
537: Our program is implemented in MATLAB and runs on Pentium IV 2.8 GHz
538: machine. The main procedure implemented is $TT\_EM$ described in
539: Algorithm ~\ref{nexttier5}. The algorithm takes the mixture data and
540: the initial set of parameters as input along with step size for
541: moving out and tolerance for convergence in the EM algorithm. It
542: returns the set of parameters that correspond to the Tier-1
543: neighboring local optimal solutions. The procedure $eval$ returns
544: the log-likelihood score given by Eq. (\ref{eq:log}). The $Gen\_Dir$
545: procedure generates promising directions from the local maximum. Exit
546: points are obtained along these generated directions. The procedure
547: $update$ moves the current parameter to the next parameter set along
548: a given $k^{th}$ direction $Dir[k]$. Some of the directions might
549: have one of the following two problems: (i) exit points might not be
550: obtained in these directions. (ii) even if the exit point is
551: obtained it might converge to a less promising solution. If the exit
552: points are not found along these directions, search will be
553: terminated after $Eval\_MAX$ number of evaluations. For all exit
554: points that are successfully found, $EM$ procedure is applied and
555: all the corresponding neighborhood set of parameters are stored in
556: the $Params[~]$. To ensure that the new initial points are
557: in a new convergence region of the EM algorithm, one should move (along that particular direction) `$\epsilon$' away from the exit points. Since,
558: different parameters will be of different ranges, care must be taken
559: while multiplying with the step sizes. It is important to use the
560: current estimates to get an approximation of the step size with
561: which one should move out along each parameter in the search space.
562: Finally, the solution with the highest likelihood score amongst the
563: original set of parameters and the Tier-1 solutions is returned.
564:
565:
566: \begin{figure*}[htp]
567: \centering
568: % \subfigure[]{\includegraphics[width = 1.2 in]{Figures/actual3.ps}}\qquad
569: \subfigure[]{\includegraphics[width = 2.45 in]{Figures/init3.ps}}\qquad
570: \subfigure[]{\includegraphics[width = 2.45 in]{Figures/initial3.ps}}\qquad
571: \subfigure[]{\includegraphics[width = 2.45 in]{Figures/extpt3.ps}}\qquad
572: \subfigure[]{\includegraphics[width = 2.45 in]{Figures/final3.ps}}\qquad
573: % \subfigure[]{\includegraphics[width = 1.5 in]{Figures/SREM1.ps}}
574: \caption{\label{fig:diagcov}Parameter estimates at various
575: stages of our algorithm on the three component Gaussian mixture
576: model (a) Poor random initial guess (b) Local maximum
577: obtained after applying EM algorithm with the poor initial
578: guess (c) Exit point obtained by our algorithm (d) The final
579: solution obtained by applying the EM algorithm using the exit point as the initial guess.
580: }
581: \end{figure*}
582: \section{Results and Discussion}
583: \label{sec:results}
584:
585: Our algorithm has been tested on both synthetic and real datasets.
586: The initial values for the centers and the covariances were chosen
587: uniformly random. Uniform priors were chosen for initializing the
588: components. For real datasets, the centers were chosen randomly from
589: the sample points.
590:
591: \begin{figure}[htp]
592: \centerline{
593: \epsfig{figure=Figures/SREM1.ps, width=4.0in}
594: } \caption{Graph showing likelihood vs Evaluations. A corresponds to
595: the original local maximum (L=-3235.0). B corresponds to the exit
596: point (L=-3676.1). C corresponds to the new initial point (L=-3657.3) after moving out by
597: `$\epsilon$'. D corresponds to the new local maximum (L=-3078.7).}
598: \label{fig:eval}
599: \end{figure}
600:
601: \subsection{Synthetic Datasets}
602: A simple synthetic data with 40 samples and 5 spherical Gaussian
603: components was generated and tested with our algorithm. Priors were
604: uniform and the standard deviation was 0.01. The centers for the
605: five components are given as follows: $\mu_1=[0.3~0.3]^T$,
606: $\mu_2=[0.5~0.5]^T$, $\mu_3=[0.7~0.7]^T$, $\mu_4=[0.3~0.7]^T$ and
607: $\mu_5=[0.7~0.3]^T$.
608:
609:
610: %\begin{figure*}
611: % \centering
612: % \subfigure[Caption A]{\includegraphics[width = 2.75 in]{Figures/actual3.ps}}\qquad
613: % \subfigure[Caption B]{\includegraphics[width = 2.75 in]{Figures/actual3.ps}}\qquad
614: % \caption{\label{fig-sphergauss}Relative weight reduction for the
615: % schedule trees produced on subsets of size (a) 10\% (b) 25\% (c)
616: % 50\% (d) 75\%. The baseline in this case is chosen as the
617: % smaller of (i) a sort of the raw data set for each view or (ii)
618: % computation of the full cube.}
619: % \end{figure*}
620:
621: %\begin{figure}[htp]
622: %\centerline{
623: % \epsfig{figure=Figures/actual3.ps, width=2.8in}
624: %} \caption{True mixture of the three Gaussian components with 900
625: %samples.} \label{fig:true}
626: %\end{figure}
627:
628:
629: The second dataset was that of a diagonal covariance case containing
630: $n=900$ data points. The data generated from a two-dimensional,
631: three-component Gaussian mixture distribution with mean vectors at
632: $[0 ~-2]^T, [0~ 0]^T,[0 ~2]^T$ and same diagonal covariance matrix
633: with values 2 and 0.2 along the diagonal \cite{Ueda98}. All the
634: three mixtures have uniform priors. Fig. \ref{fig:diagcov} shows
635: various stages of our algorithm and demonstrates how the clusters
636: obtained from existing algorithms are improved using our algorithm.
637: The initial clusters obtained are of low quality because of the poor
638: initial set of parameters. Our algorithm takes these clusters and
639: applies the neighborhood-search stage and the EM stage simultaneously to
640: obtain the final result. Fig. \ref{fig:eval} shows the value of the
641: log-likelihood during the neighborhood-search stage and the EM
642: iterations.
643:
644:
645: In the third synthetic dataset, a more complicated overlapping
646: Gaussian mixtures are considered \cite{Figueiredo02}. The parameters
647: are as follows: $\mu_1=\mu_2=[-4~ -4]^T$ , $\mu_3 =[2~2]^T$ and
648: $\mu_4=[-1~-6]^T$. $\alpha_1=\alpha_2=\alpha_3=0.3$ and
649: $\alpha_4=0.1$.
650: \begin{displaymath}
651: \Sigma_1=\left[ \begin{array}{cc} 1 &0.5\\ 0.5 & 1 \end{array}
652: \right]~~~~~~~~~~~~~~~\Sigma_2=\left[ \begin{array}{cc} 6 &-2\\ -2 & 6 \end{array}
653: \right]
654: \end{displaymath}
655: \begin{displaymath}
656: \Sigma_3=\left[ \begin{array}{cc} 2 &-1\\ -1 & 2 \end{array}
657: \right]~~~~~~~~~~~~~~~ \Sigma_4=\left[ \begin{array}{cc} 0.125 &0\\ 0 &0.125 \end{array}
658: \right]
659: \end{displaymath}
660:
661:
662: %\begin{figure}[htp]
663: %\centerline{
664: % \epsfig{figure=Figures/data2.ps, width=2.8in}
665: %} \caption{True mixtures of the more complicated overlapping
666: %Gaussian case with 1000 samples. This dataset was used to show the
667: %improvements in the performance by varying the number of data
668: %points.} \label{fig:third}
669: %\end{figure}
670:
671: \begin{table*}[htp]
672: \centering \caption{\protect Performance of TRUST-TECH-EM algorithm on
673: an average of 100 runs on various synthetic and real datasets compared with random start EM algorithm}
674: \begin{center}
675: \begin{small}
676: \begin{tabular}{|c|c|c|c|c|c|}
677: \hline
678: Dataset &Samples & Clusters & Features & EM(mean $\pm$ std) & TT-EM(mean $\pm$ std)\\
679: \hline Spherical & 40& 5 & 2 & 38.07$\pm$2.12 & 43.55$\pm$0.6 \\
680: \hline Elliptical & 900& 3 & 2 & -3235$\pm$0.34 & -3078.7$\pm$0.03 \\
681: \hline FC1& 500& 4 & 2 & -2345.5 $\pm$175.13 & -2121.9$\pm$ 21.16 \\
682: \hline FC2& 2000& 4 & 2 & -9309.9 $\pm$694.74 & -8609.7 $\pm$37.02 \\
683: \hline Iris & 150& 3 & 4 & -198.13$\pm$27.25 &-173.63$\pm$11.72\\
684: \hline Wine & 178& 3 & 13 & -1652.7$\pm$1342.1& -1618.3$\pm$1349.9\\
685: \hline
686: \end{tabular}
687: \end{small}
688: \end{center}
689: \label{TB:results5}
690: \end{table*}
691:
692: \subsection{Real Datasets}
693: Two real datasets obtained from the UCI Machine Learning repository
694: \cite{Blake98} were also used for testing the performance of our
695: algorithm. Most widely used Iris data with 150 samples, 3 classes
696: and 4 features was used. Wine data set with 178 samples was also
697: used for testing. Wine data had 3 classes and 13 features. For these
698: real data sets, the class labels were deleted thus treating it as an unsupervised learning problem. Table~\ref{TB:results5} summarizes our
699: results over 100 runs. The mean and the standard deviations of the
700: log-likelihood values are reported. The traditional EM algorithm
701: with random starts is compared against our algorithm on both
702: synthetic and real data sets. Our algorithm not only obtains higher
703: likelihood value but also produces it with high confidence. The low
704: standard deviation of our results indicates the robustness of
705: obtaining the global maximum. In the case of the wine data, the
706: improvements with our algorithm are not much significant compared to
707: the other datasets. This might be due to the fact that the dataset
708: might not have Gaussian components. Our method assumes that the
709: underlying distribution of the data is mixture of Gaussians.
710: Table~\ref{TB:compresults5} gives the results of TRUST-TECH-EM
711: compared with other methods like split and merge EM and k-means+EM
712: proposed in the literature.
713:
714:
715: %Enzyme Data \cite{Richardson97}
716:
717: %\begin{figure}[htp]
718: %\centerline{
719: % \epsfig{figure=Figures/gmm.ps, width=2.5in}
720: %} \caption{Enzyme data being modeled with 4 components.}
721: %\label{fig:summary}
722: %\end{figure}
723:
724:
725: %Iris Data from the UCI Machine Learning repository
726: %\cite{Blake98}.
727:
728: \begin{table}[htp]
729: \centering \caption{\protect Comparison of TRUST-TECH-EM with
730: other methods}
731: \begin{center}
732: \begin{tabular}{|c|c|c|}
733: \hline
734: Method & Elliptical & Iris\\
735: \hline RS+EM & -3235 $\pm$ 14.2 & -198 $\pm$ 27\\
736: \hline K-Means+EM & -3195 $\pm$ 54&-186 $\pm$ 10\\
737: \hline SMEM &-3123 $\pm$ 54&-178.5 $\pm$ 6\\
738: \hline TRUST-TECH-EM &-3079 $\pm$ 0.03 &-173.6 $\pm$ 11\\
739: \hline
740: \end{tabular}
741: \end{center}
742: \label{TB:compresults5}
743: \end{table}
744: \subsection{Discussion}
745: It will be effective to use TRUST-TECH-EM for those solutions that
746: appear to be promising. Due to the nature of the problem, it is very
747: likely that the nearby solutions surrounding the existing solution
748: will be more promising. One of the primary advantages of our method
749: is that it can be used along with other popular methods available
750: and improve the quality of the existing solutions. In clustering
751: problems, it is an added advantage to perform refinement of the
752: final clusters obtained. Most of the focus in the literature was on
753: new methods for initialization or new clustering techniques which
754: often do not take advantage of the existing results and completely
755: start the clustering procedure ``{\it from scratch}". Though shown
756: only for the case of multivariate Gaussian mixtures, our technique
757: can be effectively applied to any parametric finite mixture model.
758:
759: %As shown in fig. \ref{fig:diagcov}, our algorithm can help in
760: %improving the quality of the existing clusters. The initial
761: %algorithm identified only one of the three clusters correctly even
762: %in such a simple and clearly separated case. Our algorithm used
763: %existing solution and identified three distinct clusters that are
764: %identical to the true mixtures.
765:
766: \begin{table}[htp]
767: \centering \caption{\protect Number of iterations taken for
768: the convergence of the best solution. }
769: \begin{center}
770: \begin{tabular}{|c|c|c|}
771: \hline
772: Dataset & Avg. no. of & No. of iterations \\
773: &iterations & for the best solution\\
774: \hline Spherical &126& 73\\
775: \hline Elliptical &174& 86\\
776: \hline Full covariance &292&173\\
777: \hline
778: \end{tabular}
779: \end{center}
780: \label{TB:convresults}
781: \end{table}
782: Table \ref{TB:convresults} summarizes the average number of
783: iterations taken by the EM algorithm for the convergence to the
784: local optimal solution. We can see that the most promising solution
785: produced by our TRUST-TECH methodology converges much faster. In
786: other words, our method can effectively take advantage of the fact
787: that the convergence of the EM algorithm is much faster for high
788: quality solutions. We exploit this inherent property of the EM algorithm
789: to improve the efficiency of our algorithm. Hence, for obtaining the Tier-1 solutions using our algorithm, the
790: threshold for the number of iterations can be significantly lowered.
791:
792: \newpage
793: \section*{APPENDIX-A: Proof of Theorem \ref{th:stabgrad}} \label{sec:appendix-A}
794: \begin{proof}
795: First, we will show that every bounded trajectory will converge to one of the equilibrium points. Second, we will show that every trajectory is bounded \cite{Chiang96}.
796: \begin{enumerate}
797: \item{Let $\Phi(x,t)$ denote the bounded trajectory starting at $x$. Computing the time derivative along the trajectory, we get
798: \begin{equation*}
799: \frac{d}{dt}f(\Phi(x,t))=-(\nabla f(\Phi(x,t)))^T(\nabla f(\Phi(x,t)))~\leq~0
800: \end{equation*}
801: Also, we know that $\frac{d}{dt}f(\Phi(x,t))~=~0$ if, and only if, $x \in E$. Hence, $f(x)$ is a Lyapunov function of the gradient system (\ref{def:loggrad}) and the $\omega$-limit point of any bounded trajectory consists of equilibrium points only, i.e. any bounded trajectory will approach one of the equilibrium point.}
802: \item{
803: Following the proof of preposition 1 presented in \cite{Chiang96}, we can show that every trajectory $\Phi(x,t)$ is bounded. However, we will have to show that the magnitude of the gradient of the log-likelihood function for the Gaussian mixture model is bounded on the entire domain of the parameter space.
804:
805: \begin{equation}\label{eq:logeq}
806: log ~p(\mathcal{Y}|\Theta) =-\sum_{j=1}^n log \sum_{i=1}^k \alpha_i ~p({y}^{(j)}|{\theta}_i)
807: \end{equation}
808:
809: Now, the domain of the parameter space is given as follows:
810:
811: $-\infty <\mu_i<\infty$ , $\Sigma_i$ is positive definite and $0\leq \alpha_i \leq 1$ where $\sum_{i=1}^{k}\alpha_i=1$.
812:
813: %We proved earlier that $log ~p(\mathcal{Y}|\Theta)$ is atleast twice differentiable. Since gradient vector is continuous we can safely conclude that it is bounded on compact subsets of the domain of the parameter space.
814: First, let us focus on $\alpha$ because it is a constrained variable.
815:
816: \textbf {Derivative with $\alpha$} :
817:
818: \begin{equation}\label{eq:partalpha}
819: \frac{\partial f}{\partial \alpha_r}= \sum_{j=1}^{n}\left[ \frac{p({y}^{(j)}|{\theta}_r)}{\sum_{i=1}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right]
820: \end{equation}
821: As $\alpha \rightarrow 1$, we have
822: \begin{equation*}\label{eq:alphatends1}
823: \frac{\partial f}{\partial \alpha_r}= \sum_{j=1}^{n}\left[ \frac{p({y}^{(j)}|{\theta}_r)}{1 \cdot p({y}^{(j)}|{\theta}_i)}\right] = n <\infty
824: \end{equation*}
825: As $\alpha \rightarrow 0$, we have
826: \begin{equation*}\label{eq:alphatends0}
827: \frac{\partial f}{\partial \alpha_r}=\sum_{j=1}^{n}\left[ \frac{p({y}^{(j)}|{\theta}_r)}{\sum_{i=1,i\neq r}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right] <\infty
828: \end{equation*}
829:
830: Hence, the derivatives with respect to $\alpha$ are bounded.
831:
832: \textbf {Derivative with $\mu$} :
833: \begin{equation}\label{eq:partmu}
834: \frac{\partial f}{\partial \mu_r}= \sum_{j=1}^{n}\left[ \frac{\alpha_r \frac{1}{\sqrt{(2\pi)}\sigma_r} e^{-\frac{(x^{(j)}-\mu_r)^2}{2\sigma_r^2}}\cdot\frac{1}{\sigma_r^2}(x^{(j)}-\mu_r)}{\sum_{i=1}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right]
835: \end{equation}
836: This is obviously bounded for and $\mu \in \Re$.
837:
838: \textbf {Derivative with $\sigma$} :
839: \begin{equation}\label{eq:partmu}
840: \frac{\partial f}{\partial \sigma_r}= \sum_{j=1}^{n}\left[ \frac{\frac{1}{\sigma_r} e^{-\frac{(x^{(j)}-\mu_r)^2}{2\sigma_r^2}}\cdot \frac{(x^{(j)}-\mu_r)^2}{\sigma_{r^3}} -\frac{1}{\sigma_{r^2}} e^{-\frac{(x^{(j)}-\mu_r)^2}{2\sigma_r^2}} }{\sum_{i=1}^{k}\alpha_i p({y}^{(j)}|{\theta}_i)}\right]
841: \end{equation}
842:
843: As $\sigma_r \rightarrow 0$ the exponential factor goes to zero faster than $\frac{1}{\sigma_r}$ goes to infinity. Hence, it is bounded. So, the gradient of the log-likelihood function is bounded in the entire domain of the parameter space.
844:
845: }
846: \end{enumerate}
847:
848: \end{proof}
849: