0501:cs0501091/nldr.tex

1: \documentclass[10pt,letterpaper,conference]{IEEEtran}

2: %\bibliographystyle{IEEEtran.bst}

3: %\usepackage[dvipdfm]{hyperref}

4: \usepackage{maxim}

5: \usepackage{verbatim}

6: \mythmfalse

7:

8: \def\cA{{\cal A}}

9: \def\cF{{\cal F}}

10: \def\cG{{\cal G}}

11: \def\cM{{\cal M}}

12: \def\cN{{\cal N}}

13: \def\cR{{\cal R}}

14: \def\cX{{\cal X}}

15:

16: \begin{document}

17:

18: \title{A Complexity-Regularized Quantization Approach to Nonlinear

19:   Dimensionality Reduction\vspace{-10pt}}

20: \author{\authorblockN{Maxim Raginsky}

21: \authorblockA{Beckman Institute and the University of Illinois\\

22: 405 N Mathews Ave, Urbana, IL 61801, USA \\

23: Email: maxim@uiuc.edu\vspace{-10pt}}

24: }

25: \maketitle

26: \begin{abstract}We consider the problem of nonlinear dimensionality

27: reduction: given a training set of high-dimensional

28: data whose ``intrinsic'' low dimension is

29: assumed known, find a feature extraction map to low-dimensional space, a

30: reconstruction map back to high-dimensional space, and a geometric

31: description of the dimension-reduced data as a smooth manifold. We introduce a

32: complexity-regularized quantization approach for fitting a Gaussian

33: mixture model to the training set via a Lloyd algorithm. Complexity regularization controls the trade-off between

34: adaptation to the local shape of the underlying manifold and global geometric consistency. The

35: resulting mixture model is used to design the feature extraction

36: and reconstruction maps and to define a Riemannian metric on the

37: low-dimensional data. We also sketch a proof of

38: consistency of our scheme for the purposes of estimating the unknown

39: underlying pdf of high-dimensional data.\end{abstract}

40:

41: \section{Introduction}

42: \label{sec:intro}

43:

44: When dealing with high volumes of vector-valued data of some

45: large dimension $n$, it is often assumed that the data possess some intrinsic geometric description in a space of unknown dimension

46: $k < n$ and that the high dimensionality arises

47: from an unknown stochastic mapping of $\R^k$ into $\R^n$. We can pose

48: the problem of {\em nonlinear dimensionality reduction} (NLDR)

49: \cite{TenSilLan00, RowSau00}

50: as follows:

51: given raw data with values in $\R^n$, we wish to obtain optimal estimates of the intrinsic

52: dimension $k$ and of the stochastic map with the purpose of

53: modeling the intrinsic geometry of

54: the data in $\R^k$.

55:

56: One typically considers the following set-up: we are given a sample $X^N

57: \equiv (X_1,\ldots,X_N)$, where $X_i$ are i.i.d. according to an unknown

58: absolutely continuous distribution $P^*$. The corresponding pdf $f^*$ has to be estimated from the observation as $\hat{f}_N \equiv

59: \hat{f}_N(X^N)$. The intrinsic dimension $k$ of the data may not be known

60: in advance and would also have be estimated as

61: $\hat{k}_N \equiv \hat{k}_N(X^N)$. Since the pdf $f^*$ is assumed to arise from a stochastic map of the low-dimensional space $\R^k$ into the high-dimensional space

62: $\R^n$, we can use our knowledge about $k$ and $f^*$ in order to make

63: inferences about the intrinsic geometry of the data. In the absence of

64: such knowledge, any such inference has to be

65: made based on the estimates $\hat{k}_N$ and $\hat{f}_N$. In this paper we introduce a complexity-regularized quantization

66: approach to NLDR, assuming that the intrinsic dimension $k$ of the data is given

67: (e.g., as a maximum-likelihood estimate \cite{LevBic05}).

68:

69: \section{Smooth manifolds and their noisy embeddings}

70: \label{sec:manifolds}

71:

72: We begin with a quick sketch of some notions about smooth

73: manifolds \cite{Lan95}. A {\em smooth manifold} of dimension $k$ is a set $M$ together with a

74: collection $\cA = \{(U_l,\varphi_l) : l \in \Lambda\}$, where the sets

75: $U_l \subset M$ cover $M$ and each map $\varphi_l$ is a bijection of

76: $U_l$ onto an open set $\varphi_l(U_l)\subset\R^k$, such that for

77: all $l,l'$ with $U_l \cap U_{l'} \neq \varnothing$ the map

78: $\map{\varphi_{l'}\circ\varphi^{-1}_l}{\varphi_l(U_l \cap

79:   U_{l'})}{\varphi_{l'}(U_l \cap U_{l'})}$ is smooth. The pairs

80: $(U_l,\varphi_l)$ are called {\em charts} of $M$, and the entire

81: collection $\cA$ is referred to as an {\em atlas}. Intuitively, the

82: charts describe the points of $M$ by {\em local}

83: coordinates: given $p \in M$ and a chart $(U_l\ni p,\varphi_l)$,

84: $\varphi_l$ maps any point $q$ ``near $p$'' (i.e., $q \in U_l$) to an element of $\varphi_l(U_l) \subset \R^k$. Smoothness of the transition maps $\varphi_{l'}

85: \circ \varphi^{-1}_l$ ensures that local coordinates of a point transform

86: differentiably under a change of chart.

87:

88: Assuming that $M$ is compact, we

89: can always choose the atlas $\cA$ in such a way that the indexing set

90: $\Lambda$ is finite and each $\varphi_l(U_l)$ is

91: an open ball of radius $r_l$ \cite[Thm.~3.3]{Lan95} (one can always

92: set $r_l \equiv 1$ for all $l \in \Lambda$, but we choose not to do

93: this for greater flexibility in modeling).

94:

95: The next notion we need is that of a {\em tangent space} to $M$ at

96: point $p$, denoted by $T_pM$. Let $I \subset \R$ be an open interval

97: such that $0 \in I$. Consider the set of all curves

98: $\map{\xi}{I}{M}$ such that $\xi(0)=p$. Then for any chart $(U_l\ni p,\varphi_l)$ we have a

99: function $\map{\xi_l \deq \varphi_l \circ \xi}{I}{\R^k}$, such that

100: $\xi_l(t) \in \varphi_l(U_l)$ for all $t$ in a sufficiently small

101: neighborhood of $0$. We say that two such curves $\xi,\xi'$ are

102: equivalent iff $d\xi_{l,j}(t)/dt\big|_{t = 0} =

103: d\xi'_{l,j}(t)/dt\big|_{t=0}$, $j=1,\ldots,k$, for all $l \in \Lambda$ such that $U_l \ni p$, where $\xi_{l,j}(t)$

104: are the components of $\xi_l(t)$. The resulting set of equivalence classes

105: has the structure of a vector space of dimension $k$, and is precisely

106: the tangent space $T_pM$. Intuitively, $T_pM$ allows us to ``linearize'' $M$ around $p$. Note that, although all the

107: tangent spaces $T_pM,p\in M$ are isomorphic to each other and to

108: $\R^k$, there is no meaningful way to add elements of $T_pM$ and

109: $T_qM$ with $p,q$ distinct.

110:

111: Next, we specify the class of stochastic embeddings dealt with in this

112: paper. Consider three random variables $L,Y,X$, where $L$ takes

113: values in the finite set $\Lambda$ with $w_l \deq \Pr(L=l)$, $Y$

114: takes values in $\R^k$, and $X$ takes values in $\R^n$. Conditional

115: distributions of $Y$ given $L$ and of $X$ given $Y,L$ are assumed to be

116: absolutely continuous and described by densities $f_{Y|L}$ and

117: $f_{X|YL}$, respectively. Since for a compact $M$ the images

118: $\varphi_l(U_l)$ of charts in $\cA$ are open balls of radii $r_l$, let

119: us suppose that the conditional mean $m_l(Y) \equiv \E[Y|L=l]$ is the center

120: of $\varphi_l(U_l)$ [we can therefore take $m_l(Y)=0$ for all $l \in

121: \Lambda$] and that the largest eigenvalue of the conditional covariance matrix $K_l(Y)

122: \equiv \E\big[YY^t\big| L=l\big]$ of $Y$ given $L=l$ is equal to $r^2_l$. It is convenient to think of the eigenvectors

123: $e^{(l)}_1,\ldots,e^{(l)}_k$ of $K_l(Y)$ as giving a basis of the

124: tangent space $T_{\varphi^{-1}_l(0)}M$. The unconditional density

125: $f_X$ of $X$ is the finite mixture

126: $f_X(x) = \sum_{l \in \Lambda}w_lf_l(x)$, where $f_l(x) \deq

127: \int_{\R^k}f_{X|YL}(x|y,l)f_{Y|L}(y|l)dy$. The resulting pdf follows the

128: local structure of the manifold $M$ and accounts both for

129: low- and high-dimensional noise.

130:

131: As an example \cite{RowSauHin02}, let all $f_{Y|L}(y|l)$ be $k$-dimensional zero-mean Gaussians with unit covariance matrices,

132: $f_{Y|L}(y|l) = \cN(y;0,I) \equiv (2\pi)^{-k/2}\exp(-\frac{1}{2}y^ty)$, and

133: $f_{X|YL}(x|y,l) = \cN(x;\mu_l+A_ly,\Sigma_l)$, $\forall l \in

134: \Lambda$, for some means $\mu_l \in \R^n$, covariance matrices $\Sigma_l$, and

135: $n\times k$ matrices $A_l$, so that $f_X(x) = \sum_{l\in\Lambda}w_l\cN(x;\mu_l,A_lA^t_l + \Sigma_l)$.

136:

137: \section{Complexity-regularized mixture models}

138: \label{sec:regularize}

139:

140: Consider a random vector $X \in \R^n$ with an

141: absolutely continuous distribution $P_f$, described by a pdf $f$. We wish to find a

142: mixture model that would not only yield a good

143: ``local'' approximation to $f$, but also have low complexity, where

144: the precise notion of complexity depends on application.

145:

146: In order to set this up quantitatively, we use a complexity-regularized adaptation of the

147: quantizer mismatch approach of Gray and Linder \cite{GraLin03}. We

148: seek a finite

149: collection $\Gamma = \{g_m : m \in \cM \}$ of pdf's from a class $\cG$

150: of ``admissible'' models and a measurable

151: partition $\cR = \{R_m : m \in \cM\}$ of $\R^n$ that would minimize the

152: objective function%\vspace{-10pt}

153: \begin{equation}

154: \bar{I}_f(\cR,\Gamma) \deq \sum_{m \in \cM}P_f(R_m)\big[ D(f_m\|g_m) +

155:   \mu \Phi_\Gamma(g_m)\big],

156: \label{eq:ibar}%\vspace{-10pt}

157: \end{equation}

158: where $f_m$ is the pdf defined as $1_{\{x\in R_m\}}f(x)/P_f(R_m)$, $D(\cdot\|\cdot)$ is the relative

159: entropy, $\Phi_\Gamma(g_m)$ is a regularization functional that

160: quantifies the complexity of the $m$th model pdf relative to the entire

161: collection $\Gamma$, and $\mu \ge 0$ is the parameter that controls

162: the trade-off between the relative-entropy (mismatch) term and the complexity term.

163:

164: This minimization problem can be posed as a {\em complexity-constrained quantization

165:   problem} with an encoder $\map{\alpha}{\R^n}{\cM}$ corresponding to

166: the partition $\cR = \{R_m\}$ through $\alpha(x) = m$ if $x \in R_m$, a decoder

167: $\map{\beta}{\cM}{\cG}$ defined by $\beta(m) = g_m$, and a length

168: function $\map{\ell}{\cM}{\{0,1,2,\ldots\}}$ satisfying the Kraft

169: inequality $\sum_{m \in \cM}e^{-\ell(m)} \le 1$. In order to

170: describe the encoder and to quantify the performance of the

171: quantization scheme, we need to choose a distortion measure between an input

172: vector and an encoder output in

173: such a way that minimizing average distortion would yield the

174: $\bar{I}$-functional (\ref{eq:ibar}) of the corresponding partition

175: and codebook.

176:

177: Consider the

178: distortion $\rho(x,m) \deq \ln \big(f(x)/g_m(x)\big) + \ell(m) + \mu

179: \Phi_\Gamma(g_m)$ (this is not a distortion measure in the

180: strict sense since it can be negative, but its expectation with

181: respect to $f$ is nonnegative by the divergence inequality). For a given codebook $\Gamma$ and length function $\ell$, the optimal encoder is the

182: minimum-distortion encoder $\alpha(x) = \argmin_{m \in

183:   \cM}\rho(x,m)$ with ties broken arbitrarily. The resulting partition $\cR =

184: \{R_m\}$ yields the average distortion

185: \begin{eqnarray*}

186: && \E_f \rho\big(X,\alpha(X)\big) = \sum_{m \in \cM}p_m\Big[\ell(m)+\mu

187:   \Phi_\Gamma(g_m) \\

188: && \qquad \qquad + \int_{R_m}f_m(x)\ln\frac{p_mf_m(x)}{g_m(x)}dx\Big],

189: \end{eqnarray*}

190: where $p_m \deq P_f(R_m)$. Then

191: \begin{eqnarray*}

192: && \E_f \rho\big(X,\alpha(X)\big)  = \sum_{m\in\cM}p_m\Big[D(f_m\|g_m)\\

193: && \qquad \qquad +\ln \frac{p_m}{e^{-\ell(m)}} + \mu \Phi_\Gamma(g_m)\Big]\\

194: && \qquad \ge  \sum_{m \in \cM}p_m\big[D(f_m\|g_m) + \mu\Phi_\Gamma(g_m)\big],

195: \end{eqnarray*}

196: with equality if and only if $\ell(m) =

197: -\ln p_m$. Thus, the optimal decoder and length function

198: for a given partition are such that the average $\rho$-distortion is precisely

199: the $\bar{I}$-functional. We can therefore iterate the

200: optimality properties of the encoder, decoder and length function in a

201: Lloyd-type descent algorithm; this can only decrease average distortion and thus

202: the $\bar{I}$-functional. Note that the $\ln f(x)$ term in $\rho(x,m)$ does not affect the

203: minimum-distortion encoder. Thus, as far as the encoder is concerned, the distortion measure $\rho_0(x,m) \deq -\ln g_m(x)

204: + \ell(m) + \mu \Phi_\Gamma(g_m)$ is equivalent to $\rho$.

205:

206: When the distribution of $X$ is unknown, we can take a sufficiently

207: large training sample $X^N =

208: (X_1,\ldots,X_N)$ and use a Lloyd

209: descent algorithm to empirically design a mixture model for the data:\smallskip

210:

211: \noindent 1) {\bf Initialization:} begin with an initial codebook $\Gamma =

212:   \{g^{(0)}_m : m\in \cM\} \subset \cG$, where $\cG$ is the class of

213:   admissible models, and a length function

214:   $\map{\ell^{(0)}}{\cM}{\{0,1,2,\ldots\}}$. Set iteration number $r =

215:   1$, pick a convergence threshold $\epsilon$,

216:   and let $D_0$ be the average $\rho_0$-distortion of the initial

217:   codebook.

218:

219: \noindent 2) {\bf Minimum-distortion encoder:} encode each sample $X_i$

220:   into the index $\alpha^{(r)}(X_i) = \argmin_{m \in

221:     \cM}\rho_0(X_i,g^{(r-1)}_m)$.

222:

223: \noindent 3) {\bf Centroid decoder:} update the codebook by

224:   minimizing over all $g \in \cG$ the empirical conditional expectation

225: $$

226: \E \big[ \rho_0(X,g) \big| \alpha^{(r)}(X)=m\big] \equiv \frac{1}{N^{(r)}_m} \sum_{i: \alpha^{(r)}(X_i) = m} \rho_0(X_i,g),

227: $$

228: where $N^{(r)}_m \deq \abs{\{ i : \alpha^{(r)}(X_i) = m\}}$, i.e., set $\beta^{(r)}(m) = g^{(r)}_m = \argmin_{g \in \cG}\E

229: \big[\rho_0(X,g) \big| \alpha^{(r)}(X)=m\big]$.

230:

231: \noindent 4) {\bf Optimal length function:} if $N^{(r)}_m > 0$, let $\ell^{(r)}(m) = - \ln

232:   p^{(r)}_m$, where $p^{(r)}_m = N^{(r)}_m/N$. If $N^{(r)}_m = 0$, remove

233:   the corresponding cell from the code and decrease

234:   $\abs{\cM}$ by 1.

235:

236: \noindent 5) {\bf Test:} compute the average $\rho$-distortion $D_r$ with

237:   the code $(\alpha^{(r)},\beta^{(r)},\ell^{(r)})$. If

238:   $(D_{r-1}-D_r)/D_{r-1} < \epsilon$, quit. Otherwise, go to Step 2 and

239:   continue.\smallskip

240:

241: With a judicious choice of the initial codebook and length function,

242: this algorithm yields a finite mixture model $\{(g_m,p_m) :

243: m \in \cM\}$ as a good ``fit'' to the empirical distribution

244: of the data in the sense of near-optimal trade-off between the local

245: mismatch and complexity.

246:

247: \section{Application to NLDR}

248: \label{sec:stochembed}

249:

250: Given a training sample $X^N = (X_1,\ldots,X_N)$ of ``raw''

251: $n$-dimensional data and assuming its intrinsic dimension $k < n$ is

252: known, our goal is to

253: determine two mappings, $\map{v}{\R^n}{\R^k}$ and $\map{w}{\R^k}{\R^n}$,

254: where $v$ maps high-dimensional vectors to their dimension-reduced

255: versions and $w$ maps back to the high-dimensional space. In general,

256: the dimension-reducing map $v$ entails loss of information, so $w(v(x)) \neq x$. Therefore we will be interested

257: in the average distortion incurred by our scheme, $\bar{d}(v,w) \deq

258: \E[d(X,w(v(X)))]$, where $\map{d}{\R^n\times\R^n}{[0,\infty)}$ is a suitable distortion

259: measure on pairs of $n$-vectors, e.g., the squared Euclidean distance,

260: and the expectation is w.r.t. the empirical distribution of the

261: sample.

262:

263: \subsection{Mixture model of a stochastic embedding}

264: \label{ssec:mmstochemb}

265:

266: The first step is to use the above quantization scheme to fit a complexity-regularized

267: Gaussian mixture model to the training sample. Our class

268: $\cG$ of admissible model pdf's will be the set of all $n$-dimensional

269: Gaussians with nonsingular covariance matrices, $\cG = \{ \cN(x;\mu,K)

270: : \mu \in \R^n, \det K > 0 \}$, and for each finite set $\Gamma

271: \subset \cG$ we shall define a regularization functional

272: $\map{\Phi_\Gamma}{\Gamma}{[0,\infty)}$ that penalizes those $g \in

273:   \Gamma$ that are ``geometrically complex'' relative to the rest of

274:   $\Gamma$.

275:

276: The idea of ``geometric complexity'' can be motivated

277: \cite{RowSauHin02,Bra03} by the example of the Gaussian

278: mixture model from Sect.~\ref{sec:manifolds}. The covariance matrix of the

279: $l$th component, $A_lA^t_l + \Sigma_l$, is invariant under

280: the mapping $A_l \mapsto A_lR$, where $R$ is a $k \times k$ orthogonal

281: matrix, i.e., $RR^t = I$. In geometric terms, a copy of the orthogonal

282: group $O_k$ associated with the

283: $l$th component of the mixture is the

284: group of rotations and reflections in the tangent space to $M$ at

285: $\varphi^{-1}_l(0)$. Thus, the log-likelihood term in $\rho_0$ is not affected by assigning arbitrary and

286: independent orientations to the tangent

287: spaces associated with the components of the mixture. However, since

288: our goal is to model the intrinsic {\em global}

289: geometry of the data, it should be possible to smoothly glue together the local data provided by our model. We therefore

290: require that the orientations of the tangent spaces at

291: ``nearby'' points change smoothly as well. (In fact, one has to impose

292: certain continuity requirements on the orientation of the tangent

293: spaces in order to define measure and integration on the manifold

294: \cite[Ch.~XI]{Lan95}.)

295:

296: Given a finite set $\Gamma \subset \cG$, we shall define the

297: regularization functional $\map{\Phi_\Gamma}{\Gamma}{[0,\infty)}$ as

298: \begin{equation}

299: \Phi_\Gamma(g) \deq \sum_{g' \in

300:   \Gamma\backslash\{g\}}\kappa(\mu_g,\mu_{g'})D(g'\|g),

301: \label{eq:gc}

302: \end{equation}

303: where $\map{\kappa}{\R^n\times\R^n}{\R^+}$ is a smooth positive

304: symmetric kernel such that $\kappa(x,x') \to 0$ as $\norm{x-x'} \to

305: \infty$, and

306: \begin{eqnarray*}

307: && D(g'\|g) = \frac{1}{2}\big(\ln\det(K^{-1}_{g'}K_g) +

308: \tr(K^{-1}_gK_{g'}) \\

309: && \qquad \qquad \qquad + (\mu_g-\mu_{g'})^tK^{-1}_g(\mu_g-\mu_{g'})-n\big)

310: \end{eqnarray*}

311: is the relative entropy between two Gaussians. Possible choices for

312: the kernel $\kappa$ are the inverse Euclidean distance $\kappa(x,x') =

313: \norm{x-x'}^{-1}$ \cite{Bra03a}, a Gaussian kernel $\kappa(x,x') =

314: \cN(x-x';0,\sigma^2I)$ for a suitable value of $\sigma$

315: \cite{Bra03,Bra03a} or a compactly supported ``bump'' $\kappa(x,x') =

316: \psi_{r_1,r_2}(x-x')$, where $\psi_{r_1,r_2}$ is an infinitely

317: differentiable reflection-symmetric function

318: that is identically zero everywhere outside a closed ball of radius

319: $r_2$ and one everywhere inside an open ball of radius $r_1 <

320: r_2$. The relative entropy serves as a measure of position and

321: orientation alignment of the tangent spaces, while the smoothing

322: kernel ensures that more weight is assigned to ``nearby''

323: components. This complexity functional is a generalization of the

324: ``global coordination'' prior of Brand \cite{Bra03} to mixtures

325: with unequal component weights.

326:

327: With these definitions of $\cG$ and $\Phi_\Gamma$, the

328: $\rho_0$-distortion for a codebook $\Gamma = \{g_m : m \in \cM\}$ and a

329: length function $\ell$ is

330: \begin{eqnarray*}

331: && \rho_0(x,m) = \frac{1}{2}\ln\det K_m +

332: \frac{1}{2}(x-\mu_m)^tK^{-1}_m(x-\mu_m) \\

333: && \quad \quad \quad + \ell(m) + \sum_{m' \in \cM\backslash\{m\}}

334: \kappa(\mu_m,\mu_{m'}) D(g_{m'}\|g_m),

335: \end{eqnarray*}

336: where we have also removed the $(n/2)\ln(2\pi)$ term as it does not

337: affect the encoder. The effect of the geometric complexity term is to

338: curve the boundaries of the partition cells according to

339: locally interpolated ``nonlocal information'' about the rest of the

340: codebook. Determining the Lloyd centroids for the decoder

341: will involve solving $\abs{\cM}$ simultaneous nonlinear equations for

342: the means and the same number of equations for the covariance

343: matrices. For computational efficiency we can use the kernel data from

344: the previous iteration, which would sacrifice optimality but avoid

345: nonlinear equations.

346:

347: \subsection{Design of reduction and reconstruction maps}

348: \label{ssec:redrec}

349:

350: The output of the previous step is a Gauss mixture model $\{ (g_m,p_m) : m \in

351: \cM\}$ and a partition $\cR = \{R_m\}$ of $\R^n$. Suppose that for

352: each $m \in \cM$ the eigenvectors

353: $e^{(m)}_1,\ldots,e^{(m)}_n$ of $K_m$ are

354: numbered in the order of decreasing eigenvalues, $\lambda^{(n)}_1 \ge

355: \ldots \ge \lambda^{(m)}_n$. The next step is to design the dimension-reducing map $v$ and the reconstruction

356: map $w$. One method, proposed by Brand \cite{Bra03}, is to use the

357: mixture model of the underlying pdf [obtained in his case by an EM algorithm with a prior

358: corresponding to the average of the complexity $\Phi_\Gamma(g)$ over the

359: entire codebook and with equiprobable components of the mixture] to

360: construct a mixture of local affine transforms, preceded by

361: local Karhunen-\Loeve\ transforms, as a solution to a weighted

362: least-squares problem.

363:

364: However, we can use the encoder partition $\cR$ directly: for each $m

365: \in \cM$, let $v_m(x) \deq

366: \Pi_m(x-\mu_m)$, where $\Pi_m$ is the projection onto the first $k$

367: eigenvectors of $K_m$, and then define $v(x) = \sum_{m \in \cM}1_{\{x\in

368:   R_m\}}v_m(x)$. This approach is similar to local principal component

369: analysis of Kambhatla and Leen \cite{KamLee97}, except that their

370: quantizer was not complexity-regularized and therefore the shape of

371: the resulting Voronoi regions was determined only by local statistical

372: data. We can describe the operation of dimension reduction (feature

373: extraction) as an encoder $\map{\hat{v}}{\R^n}{\cM\times\R^k}$, so that $\hat{v}(x) =

374: (\alpha(x),v_{\alpha(x)}(x))$, where $\alpha$ is the

375: minimum-distortion encoder for the $\rho_0$-distortion.

376:

377: The corresponding reconstruction operation can be

378: designed as a decoder $\map{\hat{w}}{\cM \times \R^k}{\R^n}$ which

379: receives a pair $(m,u)$, $m \in \cM,u \in \R^k$, and computes $w_m(u) =

380: \mu_m + \sum^k_{i=1}\ave{u,e^{(m)}_i}e^{(m)}_i$, where

381: $\ave{\cdot,\cdot}$ denotes the usual scalar product in $\R^k$.

382:

383: This encoder-decoder pair is a composite Karhunen-\Loeve\ transform coder matched to the mixture source $g = \sum_m

384: p_mg_m$. If the data alphabet $\cX$ is compact, then the squared-error

385: distortion is bounded by some $A > 0$,

386: and the mismatch due to using this composite coder on the disjoint

387: mixture

388: source $f = \sum_m p_mf_m$ can be bounded from above by $A\|f-g\|_1$, where

389: $\|\cdot\|_1$ is the $L_1$ norm. Provided that the mixture $g$ is

390: optimal for $f$ in the sense of minimizing the $\rho$-distortion,

391: we can use Pinsker's inequality

392: \cite[Ch.~5]{DevLug01} $\|f-g\|_1 \le \sqrt{2D(f\|g)}$ and convexity of

393: the relative entropy to further bound the mismatch by $A\sqrt{2\big(\bar{I}_f(\cR,\Gamma) - \mu\sum_mp_m\Phi_\Gamma(g_m)\big)}$.

394:

395: Note that the maps

396: $v$ and $w$ are not smooth, unlike the analogous maps of Brand \cite{Bra03,Bra03a}. This

397: is an artifact of the hard partitioning used in our scheme. However,

398: hard partitioning has certain advantages: it allows

399: for use of composite codes \cite{GraLin03} and nonlinear interpolative vector quantization

400: \cite{Ger90} if additional compression of dimension-reduced data is

401: required. Moreover, the lack of smoothness is not a problem in our

402: case because

403: we can use kernel interpolation techniques to model the geometry of

404: dimension-reduced data by a smooth manifold, as explained next.

405:

406: \subsection{Manifold structure of dimension-reduced data}

407: \label{ssec:dimredman}

408:

409: Our use of mixture models has been motivated by

410: certain assumptions about the structure of stochastic embeddings of

411: low-dimensional manifolds into high-dimensional spaces. In

412: particular, given an $n$-dimensional Gaussian mixture model $\{(g_m,p_m) : m \in \cM\}$, we can associate to each component of the

413: mixture a chart of the underlying manifold, such that the image of the

414: chart in $\R^k$ is an open ball of radius $r_m = (\lambda^{(m)}_1)^{1/2}$ centered at the origin, and we can take the

415: first $k$ eigenvectors of the covariance matrix of $g_m$ as coordinate

416: axes in the tangent space to the manifold at

417: the inverse image of $0 \in \R^k$ under the $m$th chart. Owing to

418: geometric complexity regularization, the orientations of tangent

419: spaces change smoothly as a function of position.

420:

421: Ideally, one would like to construct a smooth manifold consistent with the

422: given descriptions of charts and tangent spaces. However, this is a

423: fairly difficult task since we not only have to define a smooth coordinate map

424: $\varphi_m$ for each chart, but also make sure that these maps satisfy

425: the chart compatibility condition. Instead, we can construct the

426: manifold {\em implicitly} by gluing the coordinate frames of the

427: tangent spaces into an object having a smooth inner product.

428:

429: Specifically, let us fix a sufficiently small $\delta > 0$, and let

430: $\psi_m$ be an infinitely differentiable function

431: that is identically zero everywhere outside a closed ball of radius

432: $r_m$ and one everywhere inside an open ball of radius $r_m-\delta$, with

433: both balls centered at $\Pi_m\mu_m$. Let $\eta_m(u) \deq \frac{p_m\psi_m(u)}{\sum_{m \in

434:     \cM}p_m\psi_m(u)}$. The inner product of two vectors $u,u' \in

435: \R^k$, treated as elements of the tangent space

436: $T_{\varphi^{-1}_m(0)}M$, is given by $\ave{u,u'}_m = \sum^k_{i=1} \ave{u,e^{(m)}_i}\ave{e^{(m)}_i,u'}$. Then for each $y \in \R^k$ the map $\map{g_y}{\R^k \times \R^k}{[0,\infty)}$,

437: $$

438: g_y(u,u') \deq \sum_{m \in \cM}\eta_m(y+\Pi_m\mu_m)\ave{u,u'}_m,

439: $$

440: is a symmetric form, which is positive definite

441: whenever $\eta_m(y+\Pi_m\mu_m) \neq 0$ for at least one value of

442: $m$. In addition, the map $y \mapsto g_y(\cdot,\cdot)$ is smooth. In

443: this way, we have implicitly defined a {\em Riemannian metric}

444: \cite[Ch.~VII]{Lan95} on the underlying

445: manifold. The functions $\eta_m$ form a so-called {\em smooth

446:   partition of unity}, which is the only known

447: way of gluing together local geometric data to form smooth objects \cite[Ch.~II]{Lan95}.

448:

449: In geometric terms, $\eta_m(y+\Pi_m\mu_m) = 0$ for all $m$ if and only

450: if $y \in \R^k$ is an image under the dimension-reduction map of a

451: point in $\R^n$ whose first $k$ principal components w.r.t. each

452: Gaussian in the mixture model fall outside the covariance ellipsoid of

453: that Gaussian. If the mixture model is close to

454: optimum, this will happen with negligible probability. A practical

455: advantage of this feature of our scheme is in rendering it robust to outliers.

456:

457: \section{Consistency and codebook design}

458: \label{sec:consistency}

459:

460: Our mixture modeling

461: scheme can also be used to estimate the ``true'' but unknown pdf $f^*$ of the

462: high-dimensional data, if we assume that $f^*$ belongs to some fixed

463: class $\cF$. Indeed, the empirically designed codebook

464: $\Gamma = \{g_m : m \in \cM\}$ of Gaussian pdf's, the corresponding

465: component weights $\{p_m\}$, and the mixture $g = \sum_{m \in

466:   \cM}p_mg_m$ are random variables since they depend on the training sample $X^N$. We are interested in the quality of

467: approximation of $f^*$ by the mixture $g \equiv g(X^N)$.

468:

469: Following Moulin and Liu \cite{MouLiu00}, we use the relative-entropy loss function

470: $D(f^*\|g)$. We shall give an upper bound on the loss in terms of the {\em index of

471:   resolvability} \cite{MouLiu00}

472: $$

473: R_{\mu,N}(f^*) \deq \Min_{m \in \cM}\left[D(f^*\|g_m)+\frac{\mu

474:     L(g_m)}{N}\right],

475: $$

476: where $L(g_m) \deq \Phi_\Gamma(g_m) - \ln

477: p_m$, which quantifies how well $f^*$ can be

478: approximated, in the relative-entropy sense (and, by Pinsker's

479: inequality, in $L_1$ sense),

480: by a Gaussian of moderate geometric complexity relative to the rest of

481: the codebook. We have the following result:

482:

483: \begin{theorem} Let the codebook $\Gamma = \{g_m : m \in \cM\}$ of

484:   Gaussian pdf's be such that the log-likelihood ratios $U_m \deq -\ln

485:   \big(f^*(X)/g_m(X)\big)$ uniformly satisfy the {\em Bernstein moment condition} \cite{DevLug01}, i.e.,

486:   there exists some $h > 0$ such that $\E\abs{U_m-\E U_m}^k \le

487:   (1/2)\var(U_m)k!h^{k-2}$ for all $k \ge 2$. Let $M(f^*)$ be the

488:   smallest number such that $\var(U_m) \le -M(f^*)\E U_m$ for all $m

489:   \in \cM$

490: (owing to the Bernstein condition, it is nonnegative and

491:   finite). Then, for any $\mu > h + M(f^*)/2$ and $\delta > 0$,

492: \begin{equation}

493: \Pr\left\{D(f^*\|g) \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) + \frac{2\mu

494:   \ln\frac{\abs{\cM}}{\delta}}{(1-\alpha)N}\right\} \ge 1-2\delta,

495: \label{eq:lossbound1}

496: \end{equation}

497: where $\alpha = \frac{M(f^*)}{2(\mu - h)}$. The expected loss satisfies

498: \begin{equation}

499: \E[D(f^*\|g)] \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +

500: \frac{4\abs{\cM}\mu}{(1-\alpha)N}.

501: \label{eq:lossbound2}

502: \end{equation}

503: The probabilities and expectations are all w.r.t. the pdf $f^*$.\end{theorem}

504:

505: \begin{proof} Due to the fact that $\Phi_\Gamma(g_m) \ge 0$ for all $m

506:   \in \cM$, the composite complexity $L(g_m)$ satisfies the Kraft inequality. Then we

507: can use a strategy similar to that of Moulin and Liu \cite{MouLiu00} to prove that

508: $$

509: \Pr \left\{D(f^*\|g_m) \ge \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +

510: \frac{2\mu\ln\frac{\abs{\cM}}{\delta}}{(1-\alpha)N}\right\} \le

511: \frac{2\delta}{\abs{\cM}}

512: $$

513: for each $m \in \cM$. Hence, by the union bound

514: $$

515: D(f^*\|g_m) \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +

516: \frac{2\mu\ln\frac{\abs{\cM}}{\delta}}{(1-\alpha)N}

517: $$

518: for all $m \in \cM$, except for an event of probability at most

519: $2\delta$. By convexity of the relative entropy,

520: $D(f^*\|g_m) \le C$ for all $m \in \cM$ implies that $D(f^*\|g) \le C$

521: for $g = \sum_{m \in \cM}p_mg_m$. Therefore

522: $$

523: D(f^*\|g) \le \frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*) +

524: \frac{2\mu \ln \frac{\abs{\cM}}{\delta}}{(1-\alpha)N}

525: $$

526: with probability at least $1-2\delta$. To prove (\ref{eq:lossbound1}), we use the fact \cite{DevLug01} that

527: if $Z$ is a random variable with $\E\abs{Z} < \infty$, then $\E[Z] \le

528: \int^\infty_0 \Pr[Z\ge t]dt$. We let $Z =

529: D(f^*\|g)-\frac{1+\alpha}{1-\alpha}R_{\mu,N}(f^*)$ and choose $\delta

530: = \abs{\cM}e^{-\frac{Nt(1-\alpha)}{2\mu}}$. Then $\E[Z] \le \frac{4\abs{\cM}\mu}{(1-\alpha)N}$, which proves (\ref{eq:lossbound2}).

531: \end{proof}

532:

533: To discuss consistency in the large-sample limit, consider a

534: sequence of empirically designed mixture models

535: $\{(g^{(N)}_m,p^{(N)}_m) : m \in \cM^{(N)}\}$. This is

536: different from the usual empirical quantizer design, where we

537: increase the training set size but keep the number of quantizer

538: levels fixed. The scheme is consistent

539: in the relative-entropy sense if $\E D(f^*\|g^{(N)}) \to 0$ as $N \to

540: \infty$, where $g^{(N)} = \sum_{m\in\cM^{(N)}}p^{(N)}_mg^{(N)}_m$ and the

541: expectation is with respect to $f^*$.

542:

543: A sufficient condition for consistency can be determined by inspection

544: of the upper bound in Eq.~(\ref{eq:lossbound2}). Specifically, we

545: require that the codebooks $\Gamma^{(N)}$ satisfy: (a) $\max_{m \in

546:   \cM^{(N)}}L(g^{(N)}_m) = o(N)$, (b) $\min_{m \in

547:   \cM^{(N)}}D(f^*\|g_m) = o(1)$ for all $f^* \in \cF$, and (c)

548: $\abs{\cM^{(N)}} = o(N)$. Condition (c) can be satisfied by

549: initializing the Lloyd algorithm by a codebook of size much smaller than the training set size $N$, which is usually done in practice in order

550: to ensure good training performance. The first two conditions can also

551: be easily met in many practical settings.

552:

553: Consider, for instance, the class $\cF$ of all pdf's supported on a

554: compact $\cX \subset \R^n$ and Lipschitz-continuous with Lipschitz constant $c$. Then, if we take as our class of

555: admissible Gaussians $\cG = \{\cN(x;\mu,K) : \mu \in \cX, c_1 \le \det K

556: \le c_2\}$ for suitably chosen constants $c_1,c_2 > 0$

557: independent of $N$, the relative entropy $D(g\|g')$ of any two $g,g'

558: \in \cG$ can be bounded independently of $N$, and

559: condition (a) will be met with proper choice of the component

560: weights. Condition (b) is likewise easy to meet since the maximum

561: value of any $f^* \in \cF$ depends only on the set $\cX$, the

562: Lipschitz constant $c$, and the dimension $n$.

563:

564: In general, the issue of optimal codebook design is closely related to

565: the problem of universal vector quantization \cite{ChoEffGra96}:

566: we can consider, e.g., a class $\cF$ of pdf's with disjoint

567: supports contained in a compact $\cX \subset \R^n$. Then a sequence of

568: Gaussian codebooks that yields a consistent estimate of each $f^* \in \cF$

569: in the large-sample limit is weakly minimax universal

570: \cite{ChoEffGra96} for

571: $\cF$ and can also be used to quantize any source contained in the

572: $L_1$-closed convex hull of $\cF$.

573:

574: \section{Discussion}

575: \label{sec:discuss}

576:

577: We have introduced a complexity-regularized quantization approach to

578: NLDR. One advantage of this scheme over existing methods for NLDR

579: based on Gaussian mixtures, e.g., \cite{Bra03}, is that, instead of

580: fitting a Gauss mixture to the entire sample, we design a codebook

581: of Gaussians that provides a good trade-off between local adaptation to

582: the data and global geometric coherence, which is key to robust

583: geometric modeling. Complexity regularization is based on a kernel

584: smoothing technique that allows for a meaningful geometric

585: description of dimension-reduced data by means of a Riemannian metric

586: and is also robust to outliers. Moreover, to our knowledge, the

587: consistency proof presented here is the first theoretical asymptotic

588: consistency result applied to NLDR.

589:

590: Work is currently underway to implement the proposed scheme for applications to image processing and computer vision. Also

591: planned is future work on a quantization-based approach to

592: estimating the intrinsic dimension of the data and on assessing asymptotic

593: {\em geometric} consistency of our scheme in terms of the

594: Gromov-Hausdorff distance between compact metric spaces

595: \cite{Pet90}.\smallskip

596:

597: \noindent{\bf Acknowledgment.} I would like to thank Svetlana Lazebnik and Prof. Pierre Moulin for useful discussions. This research has

598: been supported by the Beckman Postdoctoral Fellowship.

599:

600: \begin{thebibliography}{15}

601:

602: \bibitem{TenSilLan00}

603: J.~Tenenbaum, V.~de~Silva, and J.~Langford, ``A global geometric framework for

604:   nonlinear dimensionality reduction,'' \emph{Science}, vol. 290, pp.

605:   2319--2323, December 2000.

606:

607: \bibitem{RowSau00}

608: S.~Roweis and L.~Saul, ``Nonlinear dimensionality reduction by locally linear

609:   embedding,'' \emph{Science}, vol. 290, pp. 2323--2326, December 2000.

610:

611: \bibitem{LevBic05}

612: E.~Levina and P.~Bickel, ``Maximum likelihood estimation of intrinsic

613:   dimension,'' in \emph{Adv. Neural Inform. Processing Systems}, L.~Saul,

614:   Y.~Weiss, and L.~Bottou, Eds., vol.~17.\hskip 1em plus 0.5em minus

615:   0.4em\relax Cambridge, MA: MIT Press, 2005.

616:

617: \bibitem{Lan95}

618: S.~Lang, \emph{Differential and Riemannian Manifolds}.\hskip 1em plus 0.5em

619:   minus 0.4em\relax New York: Springer-Verlag, 1995.

620:

621: \bibitem{RowSauHin02}

622: S.~Roweis, L.~Saul, and G.~Hinton, ``Global coordination of locally linear

623:   models,'' in \emph{Adv. Neural Inform. Processing Systems}, T.~Dietterich,

624:   S.~Becker, and Z.~Ghahramani, Eds., vol.~14.\hskip 1em plus 0.5em minus

625:   0.4em\relax Cambridge, MA: MIT Press, 2002, pp. 889--896.

626:

627: \bibitem{GraLin03}

628: R.~Gray and T.~Linder, ``Mismatch in high-rate entropy-constrained vector

629:   quantization,'' \emph{IEEE Trans. Inform. Theory}, vol.~49, no.~5, pp.

630:   1204--1217, May 2003.

631:

632: \bibitem{Bra03}

633: M.~Brand, ``Charting a manifold,'' in \emph{Adv. Neural Inform. Processing

634:   Systems}, S.~Becker, S.~Thrun, and K.~Obermayer, Eds., vol.~15.\hskip 1em

635:   plus 0.5em minus 0.4em\relax Cambridge, MA: MIT Press, 2003, pp. 977--984.

636:

637: \bibitem{Bra03a}

638: ------, ``Continuous nonlinear dimensionality reduction by kernel eigenmaps,''

639:   in \emph{Int. Joint Conf. Artif. Intel.}, 2003.

640:

641: \bibitem{KamLee97}

642: N.~Kambhatla and T.~Leen, ``Dimension reduction by local principal component

643:   analysis,'' \emph{Neural Comput.}, vol.~9, pp. 1493--1516, 1997.

644:

645: \bibitem{DevLug01}

646: L.~Devroye and G.~Lugosi, \emph{Combinatorial Methods in Density

647:   Estimation}.\hskip 1em plus 0.5em minus 0.4em\relax New York:

648:   Springer-Verlag, 2001.

649:

650: \bibitem{Ger90}

651: A.~Gersho, ``Optimal nonlinear interpolative vector quantization,'' \emph{IEEE

652:   Trans. Commun.}, vol.~38, no.~9, pp. 1285--1287, September 1990.

653:

654: \bibitem{MouLiu00}

655: P.~Moulin and J.~Liu, ``Statistical imaging and complexity regularization,''

656:   \emph{IEEE Trans. Inform. Theory}, vol.~46, no.~5, pp. 1762--1777, August

657:   2000.

658:

659: %\bibitem{Dob70}

660: %R.~Dobrushin, ``Unified methods for optimal quantization of messages,'' in

661: %  \emph{Problemy Kibernetiki}, A.~A. Lyapunov, Ed.\hskip 1em plus 0.5em minus

662: %  0.4em\relax Moscow: Nauka, 1970, vol.~22, pp. 107--156, (in Russian).

663:

664: \bibitem{ChoEffGra96}

665: P.~Chou, M.~Effros, and R.~Gray, ``A vector quantization approach to universal

666:   noiseless coding and quantization,'' \emph{IEEE Trans. Inform. Theory},

667:   vol.~42, no.~4, pp. 1109--1138, July 1996.

668:

669: \bibitem{Pet90}

670: P.~Petersen, ``Gromov-{H}ausdorff convergence of metric spaces,'' in

671:   \emph{Summer Inst. Diff. Geom.}, ser. Proc. Symposia Pure Math.\hskip 1em

672:   plus 0.5em minus 0.4em\relax Amer. Math. Soc., 1990, pp. 489--505.

673:

674: \end{thebibliography}

675:

676:

677:

678: \end{document}

679: