cs0501091/nldr.tex
1: \documentclass[10pt,letterpaper,conference]{IEEEtran}
2: %\bibliographystyle{IEEEtran.bst}
3: %\usepackage[dvipdfm]{hyperref}
4: \usepackage{maxim}
5: \usepackage{verbatim}
6: \mythmfalse
7: 
8: \def\cA{{\cal A}}
9: \def\cF{{\cal F}}
10: \def\cG{{\cal G}}
11: \def\cM{{\cal M}}
12: \def\cN{{\cal N}}
13: \def\cR{{\cal R}}
14: \def\cX{{\cal X}}
15: 
16: \begin{document}
17: 
18: \title{A Complexity-Regularized Quantization Approach to Nonlinear
19:   Dimensionality Reduction\vspace{-10pt}}
20: \author{\authorblockN{Maxim Raginsky}
21: \authorblockA{Beckman Institute and the University of Illinois\\
22: 405 N Mathews Ave, Urbana, IL 61801, USA \\
23: Email: maxim@uiuc.edu\vspace{-10pt}}
24: }
25: \maketitle
26: \begin{abstract}We consider the problem of nonlinear dimensionality
27: reduction: given a training set of high-dimensional
28: data whose ``intrinsic'' low dimension is
29: assumed known, find a feature extraction map to low-dimensional space, a
30: reconstruction map back to high-dimensional space, and a geometric
31: description of the dimension-reduced data as a smooth manifold. We introduce a
32: complexity-regularized quantization approach for fitting a Gaussian
33: mixture model to the training set via a Lloyd algorithm. Complexity regularization controls the trade-off between
34: adaptation to the local shape of the underlying manifold and global geometric consistency. The
35: resulting mixture model is used to design the feature extraction
36: and reconstruction maps and to define a Riemannian metric on the
37: low-dimensional data. We also sketch a proof of
38: consistency of our scheme for the purposes of estimating the unknown
39: underlying pdf of high-dimensional data.\end{abstract}
40: 
41: \section{Introduction}
42: \label{sec:intro}
43: 
44: When dealing with high volumes of vector-valued data of some
45: large dimension $n$, it is often assumed that the data possess some intrinsic geometric description in a space of unknown dimension
46: $k < n$ and that the high dimensionality arises
47: from an unknown stochastic mapping of $\R^k$ into $\R^n$. We can pose
48: the problem of {\em nonlinear dimensionality reduction} (NLDR)
49: \cite{TenSilLan00, RowSau00}
50: as follows:
51: given raw data with values in $\R^n$, we wish to obtain optimal estimates of the intrinsic
52: dimension $k$ and of the stochastic map with the purpose of
53: modeling the intrinsic geometry of
54: the data in $\R^k$.
55: 
56: One typically considers the following set-up: we are given a sample $X^N
57: \equiv (X_1,\ldots,X_N)$, where $X_i$ are i.i.d. according to an unknown
58: absolutely continuous distribution $P^*$. The corresponding pdf $f^*$ has to be estimated from the observation as $\hat{f}_N \equiv
59: \hat{f}_N(X^N)$. The intrinsic dimension $k$ of the data may not be known
60: in advance and would also have be estimated as
61: $\hat{k}_N \equiv \hat{k}_N(X^N)$. Since the pdf $f^*$ is assumed to arise from a stochastic map of the low-dimensional space $\R^k$ into the high-dimensional space
62: $\R^n$, we can use our knowledge about $k$ and $f^*$ in order to make
63: inferences about the intrinsic geometry of the data. In the absence of
64: such knowledge, any such inference has to be
65: made based on the estimates $\hat{k}_N$ and $\hat{f}_N$. In this paper we introduce a complexity-regularized quantization
66: approach to NLDR, assuming that the intrinsic dimension $k$ of the data is given
67: (e.g., as a maximum-likelihood estimate \cite{LevBic05}). 
68: 
69: \section{Smooth manifolds and their noisy embeddings}
70: \label{sec:manifolds}
71: 
72: We begin with a quick sketch of some notions about smooth
73: manifolds \cite{Lan95}. A {\em smooth manifold} of dimension $k$ is a set $M$ together with a
74: collection $\cA = \{(U_l,\varphi_l) : l \in \Lambda\}$, where the sets
75: $U_l \subset M$ cover $M$ and each map $\varphi_l$ is a bijection of
76: $U_l$ onto an open set $\varphi_l(U_l)\subset\R^k$, such that for
77: all $l,l'$ with $U_l \cap U_{l'} \neq \varnothing$ the map
78: $\map{\varphi_{l'}\circ\varphi^{-1}_l}{\varphi_l(U_l \cap
79:   U_{l'})}{\varphi_{l'}(U_l \cap U_{l'})}$ is smooth. The pairs
80: $(U_l,\varphi_l)$ are called {\em charts} of $M$, and the entire
81: collection $\cA$ is referred to as an {\em atlas}. Intuitively, the
82: charts describe the points of $M$ by {\em local}
83: coordinates: given $p \in M$ and a chart $(U_l\ni p,\varphi_l)$,
84: $\varphi_l$ maps any point $q$ ``near $p$'' (i.e., $q \in U_l$) to an element of $\varphi_l(U_l) \subset \R^k$. Smoothness of the transition maps $\varphi_{l'}
85: \circ \varphi^{-1}_l$ ensures that local coordinates of a point transform
86: differentiably under a change of chart.
87: 
88: Assuming that $M$ is compact, we
89: can always choose the atlas $\cA$ in such a way that the indexing set
90: $\Lambda$ is finite and each $\varphi_l(U_l)$ is
91: an open ball of radius $r_l$ \cite[Thm.~3.3]{Lan95} (one can always
92: set $r_l \equiv 1$ for all $l \in \Lambda$, but we choose not to do
93: this for greater flexibility in modeling).
94: 
95: The next notion we need is that of a {\em tangent space} to $M$ at
96: point $p$, denoted by $T_pM$. Let $I \subset \R$ be an open interval
97: such that $0 \in I$. Consider the set of all curves
98: $\map{\xi}{I}{M}$ such that $\xi(0)=p$. Then for any chart $(U_l\ni p,\varphi_l)$ we have a
99: function $\map{\xi_l \deq \varphi_l \circ \xi}{I}{\R^k}$, such that
100: $\xi_l(t) \in \varphi_l(U_l)$ for all $t$ in a sufficiently small
101: neighborhood of $0$. We say that two such curves $\xi,\xi'$ are
102: equivalent iff $d\xi_{l,j}(t)/dt\big|_{t = 0} =
103: d\xi'_{l,j}(t)/dt\big|_{t=0}$, $j=1,\ldots,k$, for all $l \in \Lambda$ such that $U_l \ni p$, where $\xi_{l,j}(t)$
104: are the components of $\xi_l(t)$. The resulting set of equivalence classes
105: has the structure of a vector space of dimension $k$, and is precisely
106: the tangent space $T_pM$. Intuitively, $T_pM$ allows us to ``linearize'' $M$ around $p$. Note that, although all the
107: tangent spaces $T_pM,p\in M$ are isomorphic to each other and to
108: $\R^k$, there is no meaningful way to add elements of $T_pM$ and
109: $T_qM$ with $p,q$ distinct.
110: 
111: Next, we specify the class of stochastic embeddings dealt with in this
112: paper. Consider three random variables $L,Y,X$, where $L$ takes
113: values in the finite set $\Lambda$ with $w_l \deq \Pr(L=l)$, $Y$
114: takes values in $\R^k$, and $X$ takes values in $\R^n$. Conditional
115: distributions of $Y$ given $L$ and of $X$ given $Y,L$ are assumed to be
116: absolutely continuous and described by densities $f_{Y|L}$ and
117: $f_{X|YL}$, respectively. Since for a compact $M$ the images
118: $\varphi_l(U_l)$ of charts in $\cA$ are open balls of radii $r_l$, let
119: us suppose that the conditional mean $m_l(Y) \equiv \E[Y|L=l]$ is the center
120: of $\varphi_l(U_l)$ [we can therefore take $m_l(Y)=0$ for all $l \in
121: \Lambda$] and that the largest eigenvalue of the conditional covariance matrix $K_l(Y)
122: \equiv \E\big[YY^t\big| L=l\big]$ of $Y$ given $L=l$ is equal to $r^2_l$. It is convenient to think of the eigenvectors
123: $e^{(l)}_1,\ldots,e^{(l)}_k$ of $K_l(Y)$ as giving a basis of the
124: tangent space $T_{\varphi^{-1}_l(0)}M$. The unconditional density
125: $f_X$ of $X$ is the finite mixture
126: $f_X(x) = \sum_{l \in \Lambda}w_lf_l(x)$, where $f_l(x) \deq
127: \int_{\R^k}f_{X|YL}(x|y,l)f_{Y|L}(y|l)dy$. The resulting pdf follows the
128: local structure of the manifold $M$ and accounts both for
129: low- and high-dimensional noise.
130: 
131: As an example \cite{RowSauHin02}, let all $f_{Y|L}(y|l)$ be $k$-dimensional zero-mean Gaussians with unit covariance matrices,
132: $f_{Y|L}(y|l) = \cN(y;0,I) \equiv (2\pi)^{-k/2}\exp(-\frac{1}{2}y^ty)$, and
133: $f_{X|YL}(x|y,l) = \cN(x;\mu_l+A_ly,\Sigma_l)$, $\forall l \in
134: \Lambda$, for some means $\mu_l \in \R^n$, covariance matrices $\Sigma_l$, and
135: $n\times k$ matrices $A_l$, so that $f_X(x) = \sum_{l\in\Lambda}w_l\cN(x;\mu_l,A_lA^t_l + \Sigma_l)$.
136: 
137: \section{Complexity-regularized mixture models}
138: \label{sec:regularize}
139: 
140: Consider a random vector $X \in \R^n$ with an
141: absolutely continuous distribution $P_f$, described by a pdf $f$. We wish to find a
142: mixture model that would not only yield a good
143: ``local'' approximation to $f$, but also have low complexity, where
144: the precise notion of complexity depends on application.
145: 
146: In order to set this up quantitatively, we use a complexity-regularized adaptation of the
147: quantizer mismatch approach of Gray and Linder \cite{GraLin03}. We
148: seek a finite
149: collection $\Gamma = \{g_m : m \in \cM \}$ of pdf's from a class $\cG$
150: of ``admissible'' models and a measurable
151: partition $\cR = \{R_m : m \in \cM\}$ of $\R^n$ that would minimize the
152: objective function%\vspace{-10pt}
153: \begin{equation}
154: \bar{I}_f(\cR,\Gamma) \deq \sum_{m \in \cM}P_f(R_m)\big[ D(f_m\|g_m) +
155:   \mu \Phi_\Gamma(g_m)\big],
156: \label{eq:ibar}%\vspace{-10pt}
157: \end{equation}
158: where $f_m$ is the pdf defined as $1_{\{x\in R_m\}}f(x)/P_f(R_m)$, $D(\cdot\|\cdot)$ is the relative
159: entropy, $\Phi_\Gamma(g_m)$ is a regularization functional that
160: quantifies the complexity of the $m$th model pdf relative to the entire
161: collection $\Gamma$, and $\mu \ge 0$ is the parameter that controls
162: the trade-off between the relative-entropy (mismatch) term and the complexity term.
163: 
164: This minimization problem can be posed as a {\em complexity-constrained quantization
165:   problem} with an encoder $\map{\alpha}{\R^n}{\cM}$ corresponding to
166: the partition $\cR = \{R_m\}$ through $\alpha(x) = m$ if $x \in R_m$, a decoder
167: $\map{\beta}{\cM}{\cG}$ defined by $\beta(m) = g_m$, and a length
168: function $\map{\ell}{\cM}{\{0,1,2,\ldots\}}$ satisfying the Kraft
169: inequality $\sum_{m \in \cM}e^{-\ell(m)} \le 1$. In order to
170: describe the encoder and to quantify the performance of the
171: quantization scheme, we need to choose a distortion measure between an input
172: vector and an encoder output in
173: such a way that minimizing average distortion would yield the
174: $\bar{I}$-functional (\ref{eq:ibar}) of the corresponding partition
175: and codebook.
176: 
177: Consider the
178: distortion $\rho(x,m) \deq \ln \big(f(x)/g_m(x)\big) + \ell(m) + \mu
179: \Phi_\Gamma(g_m)$ (this is not a distortion measure in the
180: strict sense since it can be negative, but its expectation with
181: respect to $f$ is nonnegative by the divergence inequality). For a given codebook $\Gamma$ and length function $\ell$, the optimal encoder is the
182: minimum-distortion encoder $\alpha(x) = \argmin_{m \in
183:   \cM}\rho(x,m)$ with ties broken arbitrarily. The resulting partition $\cR =
184: \{R_m\}$ yields the average distortion
185: \begin{eqnarray*}
186: && \E_f \rho\big(X,\alpha(X)\big) = \sum_{m \in \cM}p_m\Big[\ell(m)+\mu
187:   \Phi_\Gamma(g_m) \\
188: && \qquad \qquad + \int_{R_m}f_m(x)\ln\frac{p_mf_m(x)}{g_m(x)}dx\Big],
189: \end{eqnarray*}
190: where $p_m \deq P_f(R_m)$. Then
191: \begin{eqnarray*}
192: && \E_f \rho\big(X,\alpha(X)\big)  = \sum_{m\in\cM}p_m\Big[D(f_m\|g_m)\\
193: && \qquad \qquad +\ln \frac{p_m}{e^{-\ell(m)}} + \mu \Phi_\Gamma(g_m)\Big]\\
194: && \qquad \ge  \sum_{m \in \cM}p_m\big[D(f_m\|g_m) + \mu\Phi_\Gamma(g_m)\big],
195: \end{eqnarray*}
196: with equality if and only if $\ell(m) =
197: -\ln p_m$. Thus, the optimal decoder and length function
198: for a given partition are such that the average $\rho$-distortion is precisely
199: the $\bar{I}$-functional. We can therefore iterate the
200: optimality properties of the encoder, decoder and length function in a
201: Lloyd-type descent algorithm; this can only decrease average distortion and thus
202: the $\bar{I}$-functional. Note that the $\ln f(x)$ term in $\rho(x,m)$ does not affect the
203: minimum-distortion encoder. Thus, as far as the encoder is concerned, the distortion measure $\rho_0(x,m) \deq -\ln g_m(x)
204: + \ell(m) + \mu \Phi_\Gamma(g_m)$ is equivalent to $\rho$.
205: 
206: When the distribution of $X$ is unknown, we can take a sufficiently
207: large training sample $X^N =
208: (X_1,\ldots,X_N)$ and use a Lloyd
209: descent algorithm to empirically design a mixture model for the data:\smallskip
210: 
211: \noindent 1) {\bf Initialization:} begin with an initial codebook $\Gamma =
212:   \{g^{(0)}_m : m\in \cM\} \subset \cG$, where $\cG$ is the class of
213:   admissible models, and a length function
214:   $\map{\ell^{(0)}}{\cM}{\{0,1,2,\ldots\}}$. Set iteration number $r =
215:   1$, pick a convergence threshold $\epsilon$, 
216:   and let $D_0$ be the average $\rho_0$-distortion of the initial
217:   codebook.
218: 
219: \noindent 2) {\bf Minimum-distortion encoder:} encode each sample $X_i$
220:   into the index $\alpha^{(r)}(X_i) = \argmin_{m \in
221:     \cM}\rho_0(X_i,g^{(r-1)}_m)$.
222: 
223: \noindent 3) {\bf Centroid decoder:} update the codebook by
224:   minimizing over all $g \in \cG$ the empirical conditional expectation
225: $$
226: \E \big[ \rho_0(X,g) \big| \alpha^{(r)}(X)=m\big] \equiv \frac{1}{N^{(r)}_m} \sum_{i: \alpha^{(r)}(X_i) = m} \rho_0(X_i,g),
227: $$
228: where $N^{(r)}_m \deq \abs{\{ i : \alpha^{(r)}(X_i) = m\}}$, i.e., set $\beta^{(r)}(m) = g^{(r)}_m = \argmin_{g \in \cG}\E
229: \big[\rho_0(X,g) \big| \alpha^{(r)}(X)=m\big]$.
230: 
231: \noindent 4) {\bf Optimal length function:} if $N^{(r)}_m > 0$, let $\ell^{(r)}(m) = - \ln
232:   p^{(r)}_m$, where $p^{(r)}_m = N^{(r)}_m/N$. If $N^{(r)}_m = 0$, remove
233:   the corresponding cell from the code and decrease
234:   $\abs{\cM}$ by 1.
235: 
236: \noindent 5) {\bf Test:} compute the average $\rho$-distortion $D_r$ with
237:   the code $(\alpha^{(r)},\beta^{(r)},\ell^{(r)})$. If
238:   $(D_{r-1}-D_r)/D_{r-1} < \epsilon$, quit. Otherwise, go to Step 2 and
239:   continue.\smallskip
240: 
241: With a judicious choice of the initial codebook and length function,
242: this algorithm yields a finite mixture model $\{(g_m,p_m) :
243: m \in \cM\}$ as a good ``fit'' to the empirical distribution
244: of the data in the sense of near-optimal trade-off between the local
245: mismatch and complexity.
246: 
247: \section{Application to NLDR}
248: \label{sec:stochembed}
249: 
250: Given a training sample $X^N = (X_1,\ldots,X_N)$ of ``raw''
251: $n$-dimensional data and assuming its intrinsic dimension $k < n$ is
252: known, our goal is to
253: determine two mappings, $\map{v}{\R^n}{\R^k}$ and $\map{w}{\R^k}{\R^n}$,
254: where $v$ maps high-dimensional vectors to their dimension-reduced
255: versions and $w$ maps back to the high-dimensional space. In general,
256: the dimension-reducing map $v$ entails loss of information, so $w(v(x)) \neq x$. Therefore we will be interested
257: in the average distortion incurred by our scheme, $\bar{d}(v,w) \deq
258: \E[d(X,w(v(X)))]$, where $\map{d}{\R^n\times\R^n}{[0,\infty)}$ is a suitable distortion
259: measure on pairs of $n$-vectors, e.g., the squared Euclidean distance,
260: and the expectation is w.r.t. the empirical distribution of the
261: sample.
262: 
263: \subsection{Mixture model of a stochastic embedding}
264: \label{ssec:mmstochemb}
265: 
266: The first step is to use the above quantization scheme to fit a complexity-regularized
267: Gaussian mixture model to the training sample. Our class
268: $\cG$ of admissible model pdf's will be the set of all $n$-dimensional
269: Gaussians with nonsingular covariance matrices, $\cG = \{ \cN(x;\mu,K)
270: : \mu \in \R^n, \det K > 0 \}$, and for each finite set $\Gamma
271: \subset \cG$ we shall define a regularization functional
272: $\map{\Phi_\Gamma}{\Gamma}{[0,\infty)}$ that penalizes those $g \in
273:   \Gamma$ that are ``geometrically complex'' relative to the rest of
274:   $\Gamma$.
275: 
276: The idea of ``geometric complexity'' can be motivated
277: \cite{RowSauHin02,Bra03} by the example of the Gaussian
278: mixture model from Sect.~\ref{sec:manifolds}. The covariance matrix of the
279: $l$th component, $A_lA^t_l + \Sigma_l$, is invariant under
280: the mapping $A_l \mapsto A_lR$, where $R$ is a $k \times k$ orthogonal
281: matrix, i.e., $RR^t = I$. In geometric terms, a copy of the orthogonal
282: group $O_k$ associated with the
283: $l$th component of the mixture is the
284: group of rotations and reflections in the tangent space to $M$ at
285: $\varphi^{-1}_l(0)$. Thus, the log-likelihood term in $\rho_0$ is not affected by assigning arbitrary and
286: independent orientations to the tangent
287: spaces associated with the components of the mixture. However, since
288: our goal is to model the intrinsic {\em global}
289: geometry of the data, it should be possible to smoothly glue together the local data provided by our model. We therefore
290: require that the orientations of the tangent spaces at
291: ``nearby'' points change smoothly as well. (In fact, one has to impose
292: certain continuity requirements on the orientation of the tangent
293: spaces in order to define measure and integration on the manifold
294: \cite[Ch.~XI]{Lan95}.)
295: 
296: Given a finite set $\Gamma \subset \cG$, we shall define the
297: regularization functional $\map{\Phi_\Gamma}{\Gamma}{[0,\infty)}$ as
298: \begin{equation}
299: \Phi_\Gamma(g) \deq \sum_{g' \in
300:   \Gamma\backslash\{g\}}\kappa(\mu_g,\mu_{g'})D(g'\|g),
301: \label{eq:gc}
302: \end{equation}
303: where $\map{\kappa}{\R^n\times\R^n}{\R^+}$ is a smooth positive
304: symmetric kernel such that $\kappa(x,x') \to 0$ as $\norm{x-x'} \to
305: \infty$, and
306: \begin{eqnarray*}
307: && D(g'\|g) = \frac{1}{2}\big(\ln\det(K^{-1}_{g'}K_g) +
308: \tr(K^{-1}_gK_{g'}) \\
309: && \qquad \qquad \qquad + (\mu_g-\mu_{g'})^tK^{-1}_g(\mu_g-\mu_{g'})-n\big)
310: \end{eqnarray*}
311: is the relative entropy between two Gaussians. Possible choices for
312: the kernel $\kappa$ are the inverse Euclidean distance $\kappa(x,x') =
313: \norm{x-x'}^{-1}$ \cite{Bra03a}, a Gaussian kernel $\kappa(x,x') =
314: \cN(x-x';0,\sigma^2I)$ for a suitable value of $\sigma$
315: \cite{Bra03,Bra03a} or a compactly supported ``bump'' $\kappa(x,x') =
316: \psi_{r_1,r_2}(x-x')$, where $\psi_{r_1,r_2}$ is an infinitely
317: differentiable reflection-symmetric function
318: that is identically zero everywhere outside a closed ball of radius
319: $r_2$ and one everywhere inside an open ball of radius $r_1 <
320: r_2$. The relative entropy serves as a measure of position and
321: orientation alignment of the tangent spaces, while the smoothing
322: kernel ensures that more weight is assigned to ``nearby''
323: components. This complexity functional is a generalization of the
324: ``global coordination'' prior of Brand \cite{Bra03} to mixtures
325: with unequal component weights.
326: 
327: With these definitions of $\cG$ and $\Phi_\Gamma$, the
328: $\rho_0$-distortion for a codebook $\Gamma = \{g_m : m \in \cM\}$ and a
329: length function $\ell$ is
330: \begin{eqnarray*}
331: && \rho_0(x,m) = \frac{1}{2}\ln\det K_m +
332: \frac{1}{2}(x-\mu_m)^tK^{-1}_m(x-\mu_m) \\
333: && \quad \quad \quad + \ell(m) + \sum_{m' \in \cM\backslash\{m\}}
334: \kappa(\mu_m,\mu_{m'}) D(g_{m'}\|g_m),
335: \end{eqnarray*}
336: where we have also removed the $(n/2)\ln(2\pi)$ term as it does not
337: affect the encoder. The effect of the geometric complexity term is to
338: curve the boundaries of the partition cells according to
339: locally interpolated ``nonlocal information'' about the rest of the
340: codebook. Determining the Lloyd centroids for the decoder
341: will involve solving $\abs{\cM}$ simultaneous nonlinear equations for
342: the means and the same number of equations for the covariance
343: matrices. For computational efficiency we can use the kernel data from
344: the previous iteration, which would sacrifice optimality but avoid
345: nonlinear equations.
346: 
347: \subsection{Design of reduction and reconstruction maps}
348: \label{ssec:redrec}
349: 
350: The output of the previous step is a Gauss mixture model $\{ (g_m,p_m) : m \in
351: \cM\}$ and a partition $\cR = \{R_m\}$ of $\R^n$. Suppose that for
352: each $m \in \cM$ the eigenvectors
353: $e^{(m)}_1,\ldots,e^{(m)}_n$ of $K_m$ are
354: numbered in the order of decreasing eigenvalues, $\lambda^{(n)}_1 \ge
355: \ldots \ge \lambda^{(m)}_n$. The next step is to design the dimension-reducing map $v$ and the reconstruction
356: map $w$. One method, proposed by Brand \cite{Bra03}, is to use the
357: mixture model of the underlying pdf [obtained in his case by an EM algorithm with a prior
358: corresponding to the average of the complexity $\Phi_\Gamma(g)$ over the
359: entire codebook and with equiprobable components of the mixture] to
360: construct a mixture of local affine transforms, preceded by
361: local Karhunen-\Loeve\ transforms, as a solution to a weighted
362: least-squares problem.
363: 
364: However, we can use the encoder partition $\cR$ directly: for each $m
365: \in \cM$, let $v_m(x) \deq
366: \Pi_m(x-\mu_m)$, where $\Pi_m$ is the projection onto the first $k$
367: eigenvectors of $K_m$, and then define $v(x) = \sum_{m \in \cM}1_{\{x\in
368:   R_m\}}v_m(x)$. This approach is similar to local principal component
369: analysis of Kambhatla and Leen \cite{KamLee97}, except that their
370: quantizer was not complexity-regularized and therefore the shape of
371: the resulting Voronoi regions was determined only by local statistical
372: data. We can describe the operation of dimension reduction (feature
373: extraction) as an encoder $\map{\hat{v}}{\R^n}{\cM\times\R^k}$, so that $\hat{v}(x) =
374: (\alpha(x),v_{\alpha(x)}(x))$, where $\alpha$ is the
375: minimum-distortion encoder for the $\rho_0$-distortion.
376: 
377: The corresponding reconstruction operation can be
378: designed as a decoder $\map{\hat{w}}{\cM \times \R^k}{\R^n}$ which
379: receives a pair $(m,u)$, $m \in \cM,u \in \R^k$, and computes $w_m(u) =
380: \mu_m + \sum^k_{i=1}\ave{u,e^{(m)}_i}e^{(m)}_i$, where
381: $\ave{\cdot,\cdot}$ denotes the usual scalar product in $\R^k$.
382: 
383: This encoder-decoder pair is a composite Karhunen-\Loeve\ transform coder matched to the mixture source $g = \sum_m
384: p_mg_m$. If the data alphabet $\cX$ is compact, then the squared-error
385: distortion is bounded by some $A > 0$,
386: and the mismatch due to using this composite coder on the disjoint
387: mixture
388: source $f = \sum_m p_mf_m$ can be bounded from above by $A\|f-g\|_1$, where
389: $\|\cdot\|_1$ is the $L_1$ norm. Provided that the mixture $g$ is
390: optimal for $f$ in the sense of minimizing the $\rho$-distortion,
391: we can use Pinsker's inequality
392: \cite[Ch.~5]{DevLug01} $\|f-g\|_1 \le \sqrt{2D(f\|g)}$ and convexity of
393: the relative entropy to further bound the mismatch by $A\sqrt{2\big(\bar{I}_f(\cR,\Gamma) - \mu\sum_mp_m\Phi_\Gamma(g_m)\big)}$.
394: 
395: Note that the maps
396: $v$ and $w$ are not smooth, unlike the analogous maps of Brand \cite{Bra03,Bra03a}. This
397: is an artifact of the hard partitioning used in our scheme. However,
398: hard partitioning has certain advantages: it allows
399: for use of composite codes \cite{GraLin03} and nonlinear interpolative vector quantization
400: \cite{Ger90} if additional compression of dimension-reduced data is
401: required. Moreover, the lack of smoothness is not a problem in our
402: case because
403: we can use kernel interpolation techniques to model the geometry of
404: dimension-reduced data by a smooth manifold, as explained next.
405: 
406: \subsection{Manifold structure of dimension-reduced data}
407: \label{ssec:dimredman}
408: 
409: Our use of mixture models has been motivated by
410: certain assumptions about the structure of stochastic embeddings of
411: low-dimensional manifolds into high-dimensional spaces. In
412: particular, given an $n$-dimensional Gaussian mixture model $\{(g_m,p_m) : m \in \cM\}$, we can associate to each component of the
413: mixture a chart of the underlying manifold, such that the image of the
414: chart in $\R^k$ is an open ball of radius $r_m = (\lambda^{(m)}_1)^{1/2}$ centered at the origin, and we can take the
415: first $k$ eigenvectors of the covariance matrix of $g_m$ as coordinate
416: axes in the tangent space to the manifold at
417: the inverse image of $0 \in \R^k$ under the $m$th chart. Owing to
418: geometric complexity regularization, the orientations of tangent
419: spaces change smoothly as a function of position.
420: 
421: Ideally, one would like to construct a smooth manifold consistent with the
422: given descriptions of charts and tangent spaces. However, this is a
423: fairly difficult task since we not only have to define a smooth coordinate map
424: $\varphi_m$ for each chart, but also make sure that these maps satisfy
425: the chart compatibility condition. Instead, we can construct the
426: manifold {\em implicitly} by gluing the coordinate frames of the
427: tangent spaces into an object having a smooth inner product.
428: 
429: Specifically, let us fix a sufficiently small $\delta > 0$, and let
430: $\psi_m$ be an infinitely differentiable function
431: that is identically zero everywhere outside a closed ball of radius
432: $r_m$ and one everywhere inside an open ball of radius $r_m-\delta$, with
433: both balls centered at $\Pi_m\mu_m$. Let $\eta_m(u) \deq \frac{p_m\psi_m(u)}{\sum_{m \in
434:     \cM}p_m\psi_m(u)}$. The inner product of two vectors $u,u' \in
435: \R^k$, treated as elements of the tangent space
436: $T_{\varphi^{-1}_m(0)}M$, is given by $\ave{u,u'}_m = \sum^k_{i=1} \ave{u,e^{(m)}_i}\ave{e^{(m)}_i,u'}$. Then for each $y \in \R^k$ the map $\map{g_y}{\R^k \times \R^k}{[0,\infty)}$,
437: $$
438: g_y(u,u') \deq \sum_{m \in \cM}\eta_m(y+\Pi_m\mu_m)\ave{u,u'}_m,
439: $$
440: is a symmetric form, which is positive definite
441: whenever $\eta_m(y+\Pi_m\mu_m) \neq 0$ for at least one value of
442: $m$. In addition, the map $y \mapsto g_y(\cdot,\cdot)$ is smooth. In
443: this way, we have implicitly defined a {\em Riemannian metric}
444: \cite[Ch.~VII]{Lan95} on the underlying
445: manifold. The functions $\eta_m$ form a so-called {\em smooth
446:   partition of unity}, which is the only known
447: way of gluing together local geometric data to form smooth objects \cite[Ch.~II]{Lan95}.
448: 
449: In geometric terms, $\eta_m(y+\Pi_m\mu_m) = 0$ for all $m$ if and only
450: if $y \in \R^k$ is an image under the dimension-reduction map of a
451: point in $\R^n$ whose first $k$ principal components w.r.t. each
452: Gaussian in the mixture model fall outside the covariance ellipsoid of
453: that Gaussian. If the mixture model is close to
454: optimum, this will happen with negligible probability. A practical
455: advantage of this feature of our scheme is in rendering it robust to outliers.
456: 
457: \section{Consistency and codebook design}
458: \label{sec:consistency}
459: 
460: Our mixture modeling
461: scheme can also be used to estimate the ``true'' but unknown pdf $f^*$ of the
462: high-dimensional data, if we assume that $f^*$ belongs to some fixed
463: class $\cF$. Indeed, the empirically designed codebook
464: $\Gamma = \{g_m : m \in \cM\}$ of Gaussian pdf's, the corresponding
465: component weights $\{p_m\}$, and the mixture $g = \sum_{m \in
466:   \cM}p_mg_m$ are random variables since they depend on the training sample $X^N$. We are interested in the quality of
467: approximation of $f^*$ by the mixture $g \equiv g(X^N)$.
468: 
469: Following Moulin and Liu \cite{MouLiu00}, we use the relative-entropy loss function
470: $D(f^*\|g)$. We shall give an upper bound on the loss in terms of the {\em index of
471:   resolvability} \cite{MouLiu00}
472: $$
473: R_{\mu,N}(f^*) \deq \Min_{m \in \cM}\left[D(f^*\|g_m)+\frac{\mu
474:     L(g_m)}{N}\right],
475: $$
476: where $L(g_m) \deq \Phi_\Gamma(g_m) - \ln
477: p_m$, which quantifies how well $f^*$ can be
478: approximated, in the relative-entropy sense (and, by Pinsker's
479: inequality, in $L_1$ sense),
480: by a Gaussian of moderate geometric complexity relative to the rest of
481: the codebook. We have the following result:
482: 
483: \begin{theorem} Let the codebook $\Gamma = \{g_m : m \in \cM\}$ of
484:   Gaussian pdf's be such that the log-likelihood ratios $U_m \deq -\ln
485:   \big(f^*(X)/g_m(X)\big)$ uniformly satisfy the {\em Bernstein moment condition} \cite{DevLug01}, i.e.,
486:   there exists some $h > 0$ such that $\E\abs{U_m-\E U_m}^k \le
487:   (1/2)\var(U_m)k!h^{k-2}$ for all $k \ge 2$. Let $M(f^*)$ be the
488:   smallest number such that $\var(U_m) \le -M(f^*)\E U_m$ for all $m
489:   \in \cM$
490: (owing to the Bernstein condition, it is nonnegative and
491:   finite). Then, for any $\mu > h + M(f^*)/2$ and $\delta > 0$, 
492: \begin{equation}
493: \Pr\left\{D(f^*\|g) \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) + \frac{2\mu
494:   \ln\frac{\abs{\cM}}{\delta}}{(1-\alpha)N}\right\} \ge 1-2\delta,
495: \label{eq:lossbound1}
496: \end{equation}
497: where $\alpha = \frac{M(f^*)}{2(\mu - h)}$. The expected loss satisfies
498: \begin{equation}
499: \E[D(f^*\|g)] \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +
500: \frac{4\abs{\cM}\mu}{(1-\alpha)N}.
501: \label{eq:lossbound2}
502: \end{equation}
503: The probabilities and expectations are all w.r.t. the pdf $f^*$.\end{theorem}
504: 
505: \begin{proof} Due to the fact that $\Phi_\Gamma(g_m) \ge 0$ for all $m
506:   \in \cM$, the composite complexity $L(g_m)$ satisfies the Kraft inequality. Then we
507: can use a strategy similar to that of Moulin and Liu \cite{MouLiu00} to prove that
508: $$
509: \Pr \left\{D(f^*\|g_m) \ge \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +
510: \frac{2\mu\ln\frac{\abs{\cM}}{\delta}}{(1-\alpha)N}\right\} \le
511: \frac{2\delta}{\abs{\cM}}
512: $$
513: for each $m \in \cM$. Hence, by the union bound
514: $$
515: D(f^*\|g_m) \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +
516: \frac{2\mu\ln\frac{\abs{\cM}}{\delta}}{(1-\alpha)N}
517: $$
518: for all $m \in \cM$, except for an event of probability at most
519: $2\delta$. By convexity of the relative entropy,
520: $D(f^*\|g_m) \le C$ for all $m \in \cM$ implies that $D(f^*\|g) \le C$
521: for $g = \sum_{m \in \cM}p_mg_m$. Therefore
522: $$
523: D(f^*\|g) \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +
524: \frac{2\mu \ln \frac{\abs{\cM}}{\delta}}{(1-\alpha)N}
525: $$
526: with probability at least $1-2\delta$. To prove (\ref{eq:lossbound1}), we use the fact \cite{DevLug01} that
527: if $Z$ is a random variable with $\E\abs{Z} < \infty$, then $\E[Z] \le
528: \int^\infty_0 \Pr[Z\ge t]dt$. We let $Z =
529: D(f^*\|g)-\frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*)$ and choose $\delta
530: = \abs{\cM}e^{-\frac{Nt(1-\alpha)}{2\mu}}$. Then $\E[Z] \le \frac{4\abs{\cM}\mu}{(1-\alpha)N}$, which proves (\ref{eq:lossbound2}).
531: \end{proof}
532: 
533: To discuss consistency in the large-sample limit, consider a
534: sequence of empirically designed mixture models
535: $\{(g^{(N)}_m,p^{(N)}_m) : m \in \cM^{(N)}\}$. This is
536: different from the usual empirical quantizer design, where we
537: increase the training set size but keep the number of quantizer
538: levels fixed. The scheme is consistent
539: in the relative-entropy sense if $\E D(f^*\|g^{(N)}) \to 0$ as $N \to
540: \infty$, where $g^{(N)} = \sum_{m\in\cM^{(N)}}p^{(N)}_mg^{(N)}_m$ and the
541: expectation is with respect to $f^*$.
542: 
543: A sufficient condition for consistency can be determined by inspection
544: of the upper bound in Eq.~(\ref{eq:lossbound2}). Specifically, we
545: require that the codebooks $\Gamma^{(N)}$ satisfy: (a) $\max_{m \in
546:   \cM^{(N)}}L(g^{(N)}_m) = o(N)$, (b) $\min_{m \in
547:   \cM^{(N)}}D(f^*\|g_m) = o(1)$ for all $f^* \in \cF$, and (c)
548: $\abs{\cM^{(N)}} = o(N)$. Condition (c) can be satisfied by
549: initializing the Lloyd algorithm by a codebook of size much smaller than the training set size $N$, which is usually done in practice in order
550: to ensure good training performance. The first two conditions can also
551: be easily met in many practical settings.
552: 
553: Consider, for instance, the class $\cF$ of all pdf's supported on a
554: compact $\cX \subset \R^n$ and Lipschitz-continuous with Lipschitz constant $c$. Then, if we take as our class of
555: admissible Gaussians $\cG = \{\cN(x;\mu,K) : \mu \in \cX, c_1 \le \det K
556: \le c_2\}$ for suitably chosen constants $c_1,c_2 > 0$
557: independent of $N$, the relative entropy $D(g\|g')$ of any two $g,g'
558: \in \cG$ can be bounded independently of $N$, and
559: condition (a) will be met with proper choice of the component
560: weights. Condition (b) is likewise easy to meet since the maximum
561: value of any $f^* \in \cF$ depends only on the set $\cX$, the
562: Lipschitz constant $c$, and the dimension $n$.
563: 
564: In general, the issue of optimal codebook design is closely related to
565: the problem of universal vector quantization \cite{ChoEffGra96}:
566: we can consider, e.g., a class $\cF$ of pdf's with disjoint
567: supports contained in a compact $\cX \subset \R^n$. Then a sequence of
568: Gaussian codebooks that yields a consistent estimate of each $f^* \in \cF$
569: in the large-sample limit is weakly minimax universal
570: \cite{ChoEffGra96} for
571: $\cF$ and can also be used to quantize any source contained in the
572: $L_1$-closed convex hull of $\cF$.
573: 
574: \section{Discussion}
575: \label{sec:discuss}
576: 
577: We have introduced a complexity-regularized quantization approach to
578: NLDR. One advantage of this scheme over existing methods for NLDR
579: based on Gaussian mixtures, e.g., \cite{Bra03}, is that, instead of
580: fitting a Gauss mixture to the entire sample, we design a codebook
581: of Gaussians that provides a good trade-off between local adaptation to
582: the data and global geometric coherence, which is key to robust
583: geometric modeling. Complexity regularization is based on a kernel
584: smoothing technique that allows for a meaningful geometric
585: description of dimension-reduced data by means of a Riemannian metric
586: and is also robust to outliers. Moreover, to our knowledge, the
587: consistency proof presented here is the first theoretical asymptotic
588: consistency result applied to NLDR.
589: 
590: Work is currently underway to implement the proposed scheme for applications to image processing and computer vision. Also
591: planned is future work on a quantization-based approach to
592: estimating the intrinsic dimension of the data and on assessing asymptotic
593: {\em geometric} consistency of our scheme in terms of the
594: Gromov-Hausdorff distance between compact metric spaces
595: \cite{Pet90}.\smallskip
596: 
597: \noindent{\bf Acknowledgment.} I would like to thank Svetlana Lazebnik and Prof. Pierre Moulin for useful discussions. This research has
598: been supported by the Beckman Postdoctoral Fellowship.
599: 
600: \begin{thebibliography}{15}
601: 
602: \bibitem{TenSilLan00}
603: J.~Tenenbaum, V.~de~Silva, and J.~Langford, ``A global geometric framework for
604:   nonlinear dimensionality reduction,'' \emph{Science}, vol. 290, pp.
605:   2319--2323, December 2000.
606: 
607: \bibitem{RowSau00}
608: S.~Roweis and L.~Saul, ``Nonlinear dimensionality reduction by locally linear
609:   embedding,'' \emph{Science}, vol. 290, pp. 2323--2326, December 2000.
610: 
611: \bibitem{LevBic05}
612: E.~Levina and P.~Bickel, ``Maximum likelihood estimation of intrinsic
613:   dimension,'' in \emph{Adv. Neural Inform. Processing Systems}, L.~Saul,
614:   Y.~Weiss, and L.~Bottou, Eds., vol.~17.\hskip 1em plus 0.5em minus
615:   0.4em\relax Cambridge, MA: MIT Press, 2005.
616: 
617: \bibitem{Lan95}
618: S.~Lang, \emph{Differential and Riemannian Manifolds}.\hskip 1em plus 0.5em
619:   minus 0.4em\relax New York: Springer-Verlag, 1995.
620: 
621: \bibitem{RowSauHin02}
622: S.~Roweis, L.~Saul, and G.~Hinton, ``Global coordination of locally linear
623:   models,'' in \emph{Adv. Neural Inform. Processing Systems}, T.~Dietterich,
624:   S.~Becker, and Z.~Ghahramani, Eds., vol.~14.\hskip 1em plus 0.5em minus
625:   0.4em\relax Cambridge, MA: MIT Press, 2002, pp. 889--896.
626: 
627: \bibitem{GraLin03}
628: R.~Gray and T.~Linder, ``Mismatch in high-rate entropy-constrained vector
629:   quantization,'' \emph{IEEE Trans. Inform. Theory}, vol.~49, no.~5, pp.
630:   1204--1217, May 2003.
631: 
632: \bibitem{Bra03}
633: M.~Brand, ``Charting a manifold,'' in \emph{Adv. Neural Inform. Processing
634:   Systems}, S.~Becker, S.~Thrun, and K.~Obermayer, Eds., vol.~15.\hskip 1em
635:   plus 0.5em minus 0.4em\relax Cambridge, MA: MIT Press, 2003, pp. 977--984.
636: 
637: \bibitem{Bra03a}
638: ------, ``Continuous nonlinear dimensionality reduction by kernel eigenmaps,''
639:   in \emph{Int. Joint Conf. Artif. Intel.}, 2003.
640: 
641: \bibitem{KamLee97}
642: N.~Kambhatla and T.~Leen, ``Dimension reduction by local principal component
643:   analysis,'' \emph{Neural Comput.}, vol.~9, pp. 1493--1516, 1997.
644: 
645: \bibitem{DevLug01}
646: L.~Devroye and G.~Lugosi, \emph{Combinatorial Methods in Density
647:   Estimation}.\hskip 1em plus 0.5em minus 0.4em\relax New York:
648:   Springer-Verlag, 2001.
649: 
650: \bibitem{Ger90}
651: A.~Gersho, ``Optimal nonlinear interpolative vector quantization,'' \emph{IEEE
652:   Trans. Commun.}, vol.~38, no.~9, pp. 1285--1287, September 1990.
653: 
654: \bibitem{MouLiu00}
655: P.~Moulin and J.~Liu, ``Statistical imaging and complexity regularization,''
656:   \emph{IEEE Trans. Inform. Theory}, vol.~46, no.~5, pp. 1762--1777, August
657:   2000.
658: 
659: %\bibitem{Dob70}
660: %R.~Dobrushin, ``Unified methods for optimal quantization of messages,'' in
661: %  \emph{Problemy Kibernetiki}, A.~A. Lyapunov, Ed.\hskip 1em plus 0.5em minus
662: %  0.4em\relax Moscow: Nauka, 1970, vol.~22, pp. 107--156, (in Russian).
663: 
664: \bibitem{ChoEffGra96}
665: P.~Chou, M.~Effros, and R.~Gray, ``A vector quantization approach to universal
666:   noiseless coding and quantization,'' \emph{IEEE Trans. Inform. Theory},
667:   vol.~42, no.~4, pp. 1109--1138, July 1996.
668: 
669: \bibitem{Pet90}
670: P.~Petersen, ``Gromov-{H}ausdorff convergence of metric spaces,'' in
671:   \emph{Summer Inst. Diff. Geom.}, ser. Proc. Symposia Pure Math.\hskip 1em
672:   plus 0.5em minus 0.4em\relax Amer. Math. Soc., 1990, pp. 489--505.
673: 
674: \end{thebibliography}
675: 
676: 
677: 
678: \end{document}
679: