0202:cs0202009/nnsc.tex

1: % Non-negative Sparse Coding

2: % Patrik Hoyer, patrik.hoyer@hut.fi

3: % Feb 2002.

4: %

5:

6: \documentclass[a4paper]{article}

7:

8: % for extended summary

9: %\usepackage[summary]{nnsp2e}

10: % final paper

11: \usepackage{nnsp2e}

12: \usepackage{graphics}

13:

14: \newcommand{\A}{{\bf A}}

15: \newcommand{\K}{{\bf K}}

16: \newcommand{\ai}{{\bf a}_i}

17: \newcommand{\M}{{\bf M}}

18: \newcommand{\x}{{\bf x}}

19: \newcommand{\s}{{\bf s}}

20: \newcommand{\cvec}{{\bf c}}

21: \newcommand{\st}{{\bf s}^t}

22: \newcommand{\X}{{\bf X}}

23: \newcommand{\bS}{{\bf S}}

24: \newcommand{\Aorig}{{\bf A}_{\mbox{\footnotesize orig}}}

25: \newcommand{\Sorig}{{\bf S}_{\mbox{\footnotesize orig}}}

26:

27: \newcommand{\beq}{\begin{equation}}

28: \newcommand{\eeq}{\end{equation}}

29:

30: \newtheorem{theorem}{Theorem}

31: \newtheorem{definition}{Definition}

32:

33: \title{Non-negative Sparse Coding}

34: \author{Patrik O.\ Hoyer\\

35:         Neural Networks Research Centre\\

36:         Helsinki University of Technology\\

37: 	P.O. Box 9800, FIN-02015 HUT, Finland\\

38:         patrik.hoyer@hut.fi}

39:

40: \begin{document}

41: \maketitle

42:

43: \begin{abstract}

44: Non-negative sparse coding is a method for decomposing multivariate

45: data into non-negative sparse components. In this paper we briefly

46: describe the motivation behind this type of data representation

47: and its relation to standard sparse coding and non-negative

48: matrix factorization. We then give a simple yet efficient

49: multiplicative algorithm for finding the optimal values of the

50: hidden components. In addition, we show how the basis vectors can

51: be learned from the observed data. Simulations demonstrate the

52: effectiveness of the proposed method.

53: \end{abstract}

54:

55: \section{Introduction}

56:

57: Linear data representations are widely used in signal processing and

58: data analysis. A traditional method of choice for signal

59: representation is of course Fourier analysis, but also wavelet

60: representations are increasingly being used in a variety of

61: applications. Both of these methods have strong mathematical

62: foundations and fast implementations, but they share the important

63: drawback that they are not adapted to the particular data being

64: analyzed.

65:

66: Data-adaptive representations, on the other hand, are representations

67: that are tailored to the statistics of the data. Such representations

68: are learned directly from the observed data by optimizing some measure

69: that quantifies the desired properties of the representation. This

70: class of methods include principal component analysis (PCA),

71: independent component analysis (ICA), sparse coding, and non-negative

72: matrix factorization (NMF). Some of these methods have their roots

73: in neural computation, but have since been shown to be widely

74: applicable for signal analysis.

75:

76: In this paper we propose to combine sparse coding and non-negative

77: matrix factorization into \emph{non-negative sparse coding} (NNSC). Again,

78: the motivation comes partly from modeling neural information processing.

79: We believe that, as with previous methods, this technique will be found

80: useful in a more general signal processing framework.

81:

82: \section{Non-negative sparse coding}

83:

84: Assume that we observe data in the form of a large number of

85: i.i.d.\ random vectors $\x_n$, where $n$ is the sample index.

86: Arranging these into the columns of a matrix $\X$, then linear

87: decompositions describe this data as

88: $\X \approx \A\bS$. The matrix $\A$ is called the \emph{mixing matrix},

89: and contains as its columns the \emph{basis vectors} (features) of the

90: decomposition. The rows of $\bS$ contain the corresponding

91: \emph{hidden components} that give the contribution of each basis vector

92: in the input vectors. Although some decompositions provide an exact

93: reconstruction of the data (i.e. $\X = \A\bS$) the ones that

94: we shall consider here are approximative in nature.

95:

96: In linear sparse coding \cite{Harpur96,Olshausen96b}, the goal is

97: to find a decomposition in which the hidden components are \emph{sparse},

98: meaning that they have probability densities which are highly peaked at zero

99: and have heavy tails. This basically means that any given input vector

100: can be well represented using only a few significantly non-zero hidden

101: coefficients. Combining the goal of small reconstruction error

102: with that of sparseness, one can arrive at the following objective

103: function to be minimized \cite{Harpur96,Olshausen96b}:

104: \begin{equation} \label{eq:sc}

105: C(\A,\bS) = \frac{1}{2}\|\X - \A\bS\|^2 + \lambda\sum_{ij} f(S_{ij}),

106: \end{equation}

107: where the squared matrix norm is simply the summed

108: squared value of the elements, i.e. $\|\X-\A\bS\|^2 =

109: \sum_{ij}[\X_{ij}-(\A\bS)_{ij}]^2$.

110: The tradeoff between sparseness and accurate reconstruction is controlled

111: by the parameter $\lambda$, whereas the form of $f$ defines how sparseness

112: is measured. To achieve a sparse code, the form of $f$ must be chosen

113: correctly: A typical choice is $f(s) = |s|$, although often similar

114: functions that exhibit smoother behaviour at zero are chosen for

115: numerical stability.

116:

117: There is one important problem with this objective: As $f$ typically

118: is a strictly increasing function of the absolute value of its argument,

119: the objective can always be decreased by simply scaling up $\A$ and

120: correspondingly scaling down $\bS$. The consequences of this

121: is that optimization of (\ref{eq:sc}) with respect to both $\A$ and $\bS$

122: leads to the elements of

123: $\A$ growing (in absolute value) without bounds whereas $\bS$ tends

124: to zero. More importantly, the solution found does not depend

125: on the second term of the objective as it can always be eliminated

126: by this scaling trick. In other words, some constraint on the scales

127: of $\A$ or $\bS$ is needed. Olshausen and Field \cite{Olshausen96b}

128: used an adaptive method to ensure that the hidden components had unit

129: variance (effectively fixing the norm of the rows of $\bS$), whereas

130: Harpur \cite{HarpurPhD} fixed the norms of the columns of $\A$.

131:

132: With either of the above scale constraints the objective (\ref{eq:sc})

133: is well-behaved and its minimization can produce useful decompositions

134: of many types of data. For example, it was shown in \cite{Olshausen96b}

135: that applying this method to image data yielded features closely

136: resembling simple-cell receptive fields in the mammalian primary

137: visual cortex. The learned decomposition is also similar to

138: wavelet decompositions, implying that it could be useful in applications

139: where wavelets have been successfully applied.

140:

141: In standard sparse coding, described above, the data is described as a

142: combination of elementary features involving both additive and

143: subtractive interactions. The fact that features can `cancel each

144: other out' using subtraction is contrary to the intuitive notion of

145: combining parts to form a whole \cite{LeeDD99}. Thus, Lee and Seung

146: \cite{LeeDD99,LeeDD01} have recently forcefully argued for

147: non-negative representations \cite{Paatero94}. Other arguments for

148: non-negative representations come from biological modeling

149: \cite{Hoyer03CNS,Hoyer02VR,LeeDD99}, where such constraints

150: are related to the non-negativity of neural firing rates.

151: These non-negative representations assume that the input data

152: $\X$, the basis $\A$, and the hidden components $\bS$ are all non-negative.

153:

154: Non-negative matrix factorization\footnote{Note that error measures

155: other than the summed squared error were also considered

156: in \cite{LeeDD99,LeeDD01}.} (NMF) can be performed by the minimization

157: of the following objective function \cite{LeeDD01,Paatero94}:

158: \begin{equation}

159: C(\A,\bS) = \frac{1}{2}\|\X - \A\bS\|^2

160: \end{equation}

161: with the non-negativity constraints

162: $\forall ij: \; A_{ij}\geq 0, \; S_{ij}\geq 0$.

163: This objective requires no constraints on the scales of $\A$ or $\bS$.

164:

165: In \cite{LeeDD99}, the authors showed how non-negative matrix

166: factorization applied to face images yielded features that corresponded

167: to intuitive notions of face parts: lips, nose, eyes, etc. This was

168: contrasted with the holistic representations learned by PCA and

169: vector quantization.

170:

171: We suggest that both the non-negativity constraints and the sparseness

172: goal are important for learning parts-based representations. Thus,

173: we propose to combine these two methods into non-negative sparse coding:

174: \begin{definition}

175: Non-negative sparse coding (NNSC) of a non-negative data matrix $\X$

176: (i.e.\ $\forall ij: \; X_{ij}\geq 0$) is given by the minimization of

177: \begin{equation} \label{eq:nnsc}

178: C(\A,\bS) = \frac{1}{2}\|\X - \A\bS\|^2 + \lambda\sum_{ij} S_{ij}

179: \end{equation}

180: subject to the constraints $\forall ij: \; A_{ij}\geq 0, \; S_{ij}\geq 0$ and

181: $\forall i: \|\ai\| = 1$, where $\ai$ denotes the $i$:th column of $\A$.

182: It is also assumed that the constant $\lambda\geq 0$.

183: \end{definition}

184: Notice that we have here chosen to measure sparseness by a linear

185: activation penalty (i.e. $f(s) = s$).

186: This particular choice is primarily motivated by the fact that this

187: makes the objective function quadratic in $\bS$. This is useful

188: in the development and convergence proof of an efficient algorithm

189: for optimizing the hidden components $\bS$.

190:

191: \section{Estimating the hidden components}

192:

193: We will first consider optimizing $\bS$, for a given basis $\A$. As

194: the objective (\ref{eq:nnsc}) is quadratic with respect to $\bS$, and the

195: set of allowed $\bS$ (i.e. the set where $S_{ij}\geq 0$) is convex,

196: we are guaranteed that no suboptimal local minima exist. The global

197: minimum can be found using, for example, quadratic programming

198: or gradient descent. Gradient descent is quite simple to

199: implement, but convergence can be slow. On the other hand,

200: quadratic programming is much more complicated to implement.

201: To address these concerns, we have developed a multiplicative

202: algorithm based on the one introduced in \cite{LeeDD01} that is

203: extremely simple to implement and nonetheless seems to be quite

204: efficient. This is given by iterating the following update rule:

205:

206: \begin{theorem} \label{theorem:multupdate}

207: The objective (\ref{eq:nnsc}) is nonincreasing under the update rule:

208: \begin{equation} \label{eq:updateS}

209: \bS^{t+1} = \bS^{t} \hspace{1mm}.\hspace{-1mm}* (\A^T\X) \hspace{1mm}./\hspace{1mm} (\A^T\A\bS^{t} + \lambda)

210: \end{equation}

211: where $.*$ and $./$ denote elementwise multiplication and

212: division (respectively), and the addition of the scalar $\lambda$ is done

213: to every element of the matrix $\A^T\A\bS^{t}$.

214: \end{theorem}

215: This is proven in the Appendix. As each element of $\bS$ is updated

216: by simply multiplying with some non-negative factor, it is guaranteed

217: that the elements of $\bS$ stay non-negative under this update rule.

218: As long as the initial values of $\bS$ are all chosen strictly positive,

219: iteration of this update rule is in practice guaranteed to reach the

220: global minimum to any required precision.

221:

222: \section{Learning the basis}

223:

224: In this section we consider optimizing the objective (\ref{eq:nnsc})

225: with respect to both the basis $\A$ and the hidden components $\bS$, under

226: the stated constraints. First, we consider the optimization of

227: $\A$ only, holding $\bS$ fixed.

228:

229: Minimizing (\ref{eq:nnsc}) with respect to $\A$

230: \emph{under the non-negativity constraint only} could be done exactly

231: as in \cite{LeeDD01}, with a simple multiplicative update rule.

232: However, the constraint of unit-norm columns of $\A$ complicates

233: things. We have not found any similarly efficient update rule that would

234: be guaranteed to decrease the objective while obeying the

235: required constraint. Thus, we here resort to projected gradient

236: descent. Each step is composed of three parts:

237: \begin{enumerate}

238: \item $\A' = \A^t - \mu (\A^t\bS-\X)\bS^T$

239: \item Any negative values in $\A'$ are set to zero

240: \item Rescale each column of $\A'$ to unit norm, and then set $\A^{t+1} = \A'$.

241: \end{enumerate}

242: This combined step consists of a gradient descent step (Step 1) followed by

243: projection onto the closest point satisfying both the non-negativity and

244: the unit-norm constraints (Steps 2 and 3). This projected gradient step

245: is guaranteed to decrease the objective if the stepsize $\mu>0$ is

246: small enough and we are not already at a local minimum. (In this case

247: there is no guarantee of reaching the \emph{global} minimum, due to the

248: non-convex constraints.)

249:

250: In the previous section, we gave an update step for $\bS$, holding $\A$

251: fixed. Above, we showed how to update $\A$, holding $\bS$ fixed. To

252: optimize the objective with respect to both, we can of course

253: take turns updating $\A$ and $\bS$. This yields the following

254: algorithm:\\[2mm]

255: \centerline{

256: \fbox{

257: \begin{minipage}{0.90\textwidth}

258: \vspace{2mm}

259: {\bf Algorithm for NNSC}

260: \begin{enumerate}

261: \item Initialize $\A^0$ and $\bS^0$ to random \emph{strictly positive}

262: matrices of the appropriate dimensions, and rescale each column of $\A^0$

263: to unit norm. Set $t=0$.

264: \item Iterate until convergence:

265: \begin{enumerate}

266: \item $\A' = \A^t - \mu (\A^t\bS^{t}-\X)(\bS^t)^T$

267: \item Any negative values in $\A'$ are set to zero

268: \item Rescale each column of $\A'$ to unit norm, and then set $\A^{t+1} = \A'$.

269: \item $\bS^{t+1} = \bS^{t} \hspace{1mm}.\hspace{-1mm}* ((\A^{t+1})^T\X) \hspace{1mm}./\hspace{1mm} ((\A^{t+1})^T(\A^{t+1})\bS^{t} + \lambda)$

270: \item Increment $t$.

271: \end{enumerate}

272: \end{enumerate}

273: \vspace{2mm}

274: \end{minipage}

275: }

276: }

277:

278: \section{Experiments}

279:

280: To demonstrate how sparseness can be essential for learning a

281: parts-based non-negative representation, we performed a simple

282: simulation where the generating features were known. The interested

283: reader can find the code to perform these experiments (as well

284: as the experiments reported in \cite{Hoyer03CNS}) on the web at:\\

285: \centerline{{\ttfamily http://www.cis.hut.fi/phoyer/code/}}\\

286:

287: In our simulations, the data vectors were $3 \times 3$ -pixel images with

288: non-negative pixel values. We manually constructed $10$ original

289: features: the six possible horizontal and vertical bars, and the four

290: possible horizontal and vertical double bars. Each feature was

291: normalized to unit norm, and entered as a column in the matrix

292: $\Aorig$. The features are shown in the leftmost panel of

293: Figure~\ref{fig:experiments}. We then generated random sparse non-negative

294: data $\Sorig$, and obtained the data vectors as $\X = \Aorig\Sorig$.

295: A random sample of $12$ such data vectors are also shown

296: in Figure~\ref{fig:experiments}.

297:

298: We ran NNSC and NMF on this data $\X$. With $10$ hidden components

299: (rows of $\bS$),

300: NNSC can correctly identify all the features in the dataset. This result

301: is shown in Figure~\ref{fig:experiments} under {\bf \sffamily NNSC}.

302: However, NMF cannot find all the features with any hidden dimensionality.

303: With $6$ components, NMF finds all the single bar features. With

304: a dimensionality of $10$, not even all of the single bars are correctly

305: estimated. These results are illustrated in the two rightmost panels

306: of Figure~\ref{fig:experiments}.

307:

308: \begin{figure}

309: \hspace{1.2mm}

310: \large \bf \sffamily{Features}

311: \hspace{7mm}

312: \large \bf \sffamily{Data}

313: \hspace{10mm}

314: \large \bf \sffamily{NNSC}

315: \hspace{7mm}

316: \large \bf \sffamily{NMF (6)}

317: \hspace{2mm}

318: \large \bf \sffamily{NMF (10)} \\[1.3mm]

319: \resizebox{22mm}{!}{

320: \includegraphics{bars-origbasis.eps}}

321: \resizebox{22mm}{!}{

322: \includegraphics{bars-samples.eps}}

323: \resizebox{22mm}{!}{

324: \includegraphics{bars-nnsc-10.eps}}

325: \resizebox{22mm}{!}{

326: \includegraphics{bars-nmf-6.eps}}

327: \resizebox{22mm}{!}{

328: \includegraphics{bars-nmf-10.eps}}

329: \caption{Experiments on bars data. {\mdseries  Features:} The $10$ original

330: features that were used to construct the dataset. {\mdseries Data:} A random

331: sample of 12 data vectors. These constitute superpositions of the

332: original features. {\mdseries NNSC:} Features learned by NNSC, with

333: dimensionality of the hidden representation equal to 10, starting from

334: random initial values. {\mdseries NMF (6):} Features learned by NMF,

335: with dimensionality 6. {\mdseries NMF (10):} Features learned by NMF,

336: with dimensionality 10. See main text for discussion.\vspace{2mm}

337: \label{fig:experiments}}

338: \end{figure}

339:

340: It is not difficult to understand why NMF cannot learn all the

341: features.  The data $\X$ can be perfectly described as an additive

342: combination of the six single bars (because all double bars can be

343: described as two single bars). Thus, NMF essentially achieves the

344: optimum (zero reconstruction error) already with $6$ features, and

345: there is no way in which an \emph{overcomplete} representation could

346: improve that. However, when sparseness is considered as in NNSC, it is

347: clear that it is useful to have double bar features because these

348: allow a sparser description of such data patterns.

349:

350: In addition to these simulations, we have performed experiments with

351: natural image data, reported elsewhere \cite{Hoyer03CNS,Hoyer02VR}.

352: These confirm our belief that sparseness is important when learning

353: non-negative representations from data.

354:

355: \section{Relation to other work}

356:

357: In addition to the tight connection to linear sparse coding

358: \cite{Harpur96,Olshausen96b} and non-negative matrix factorization

359: \cite{LeeDD99,LeeDD01,Paatero94}, this method is intimately related

360: to independent component analysis \cite{Hyva01book}. In fact, when

361: the fixed-norm constraint is placed on the rows of $\bS$ instead

362: of the columns of $\A$, the objective (\ref{eq:nnsc}) could be

363: directly interpreted as the negative joint log-posterior of the

364: basis vectors and components, given the data $\X$, in the noisy

365: ICA model \cite{Hoyer02VR}. This connection is valid when the independent

366: components are assumed to have exponential distributions, and of course the

367: basis vectors are assumed to be non-negative as well.

368:

369: Other researchers have also recently considered the constraint of

370: non-negativity in the context of ICA. In particular, Plumbley

371: \cite{Plumbley02SPL} has considered estimation of the noiseless

372: ICA model (with equal dimensionality of components and observations)

373: in the case of non-negative components. On the other hand,

374: Parra et al.\ \cite{Parra00} considered estimation of the ICA

375: model where the basis (but not the components) was constrained to be

376: non-negative. The main novelty of the present work is the

377: application of the non-negativity constraints in the sparse coding

378: framework, and the simple yet efficient algorithm developed to estimate

379: the components.

380:

381: \section{Conclusions}

382:

383: In this paper, we have defined non-negative sparse coding as a combination

384: of sparse coding with the constraints of non-negative

385: matrix factorization. Although this is essentially a special case

386: of the general sparse coding framework, we believe that the proposed

387: constraints can be important for learning parts-based representations

388: from non-negative data. In addition, the constraints allow a very

389: simple yet efficient algorithm for estimating the hidden components.

390:

391: \section{Appendix}

392:

393: To prove Theorem~\ref{theorem:multupdate}, first note that

394: the objective (\ref{eq:nnsc}) is separable in the

395: columns of $\bS$ so that each column can be optimized without

396: considering the others. We may thus consider the problem for the case

397: of a single column, denoted $\s$. The corresponding column of $\X$ is

398: denoted $\x$, giving the objective

399: \begin{equation}

400: F(\s) = \frac{1}{2}\|\x - \A\s\|^2 + \lambda\sum_i s_i.

401: \end{equation}

402:

403: The proof will follow closely the proof given

404: in \cite{LeeDD01} for the case $\lambda=0$.

405: (Note that in \cite{LeeDD01}, the notation $v=\x$, $W=\A$ and $h=\s$

406: was used.)

407: We define an auxiliary function $G(\s,\s^t)$

408: with the properties that $G(\s,\s) = F(\s)$ and $G(\s,\s^t)\geq F(\s)$.

409: We will then show that the multiplicative update rule corresponds to

410: setting, at each iteration, the new state vector to the values

411: that minimize the auxiliary function:

412: \beq

413: \s^{t+1} = \arg\min_{\s} G(\s,\s^t).

414: \eeq

415: This is guaranteed not to increase the objective function $F$, as

416: \beq

417: F(\s^{t+1}) \leq G(\s^{t+1},\s^t) \leq G(\s^t,\s^t) = F(\s^t).

418: \eeq

419:

420: Following \cite{LeeDD01}, we define the function $G$ as

421: \beq

422: \label{eq:auxdef}

423: G(\s,\s^t) = F(\s^t) + (\s-\s^t)^T\nabla F(\s^t) +

424: \frac{1}{2}(\s-\s^t)^T\K(\s^t)(\s-\s^t)

425: \eeq

426: where the diagonal matrix $\K(\s^t)$ is defined as

427: \beq

428: K_{ab}(\st) = \delta_{ab} \frac{(\A^T\A\s^t)_a + \lambda}{\s^t_a}.

429: \eeq

430: It is important to note that the elements of our choice for $\K$ are

431: always greather than or equal to those of the $\K$ used in \cite{LeeDD01},

432: which is the case where $\lambda=0$. It is obvious that

433: $G(\s,\s) = F(\s)$. Writing out

434: \beq

435: F(\s) = F(\s^t) + (\s-\s^t)^T\nabla F(\s^t) +

436: \frac{1}{2}(\s-\s^t)^T(\A^T\A)(\s-\s^t),

437: \eeq

438: we see that the second property, $G(\s,\s')\geq F(\s)$,

439: is satisfied if

440: \beq

441: 0 \leq (\s-\s^t)^T[\K(\s^t)-\A^T\A](\s-\s^t).

442: \eeq

443: Lee and Seung proved this positive semidefiniteness

444: for the case of $\lambda=0$ \cite{LeeDD01}. In our case, with $\lambda>0$,

445: the matrix whose positive semidefiniteness is to be proved is the same

446: except that a strictly non-negative diagonal matrix has been added

447: (see the above comment on the choice of $\K$). As a non-negative

448: diagonal matrix is positive semidefinite, and the sum

449: of two positive semidefinite matrices is also positive semidefinite,

450: the $\lambda=0$ proof in \cite{LeeDD01} also holds when $\lambda>0$.

451:

452: It remains to be shown that the update rule in (\ref{eq:updateS})

453: selects the minimum of $G$. This minimum is easily found by taking

454: the gradient and equating it to zero:

455: \beq

456: \nabla_{\s} G(\s,\s^t) = \A^T(\A\s^t-\x) + \lambda\cvec

457: + \K(\s^t)(\s-\s^t) = 0,

458: \eeq

459: where $\cvec$ is a vector with all ones. Solving for $\s$, this gives

460: \begin{eqnarray}

461: \s & = & \s^t - \K^{-1}(\s^t)(\A^T\A\s^t - \A^T\x + \lambda\cvec) \\

462:    & = & \s^t - (\s^t ./ (\A^T\A\s^t + \lambda\cvec))

463:          .\hspace{-1mm}* (\A^T\A\s^t - \A^T\x + \lambda\cvec) \\

464:    & = & \s^t .\hspace{-1mm}* (\A^T\x) ./ (\A^T\A\s^t + \lambda\cvec))

465: \end{eqnarray}

466: which is the desired update rule (\ref{eq:updateS}). This completes

467: the proof.

468:

469: \bibliography{/home/info/phoyer/research/bib/collection,/home/info/phoyer/research/bib/others,/home/info/phoyer/research/bib/personal}

470: \bibliographystyle{nnsp}

471:

472: \section{Acknowledgements}

473:

474: I wish to acknowledge Aapo Hyv\"{a}rinen for useful discussions and

475: helpful comments on an earlier version of the manuscript.

476:

477: \end{document}

478: