cs0202009/nnsc.tex
1: % Non-negative Sparse Coding
2: % Patrik Hoyer, patrik.hoyer@hut.fi
3: % Feb 2002.
4: %
5: 
6: \documentclass[a4paper]{article}
7: 
8: % for extended summary
9: %\usepackage[summary]{nnsp2e}
10: % final paper
11: \usepackage{nnsp2e}
12: \usepackage{graphics}
13: 
14: \newcommand{\A}{{\bf A}}
15: \newcommand{\K}{{\bf K}}
16: \newcommand{\ai}{{\bf a}_i}
17: \newcommand{\M}{{\bf M}}
18: \newcommand{\x}{{\bf x}}
19: \newcommand{\s}{{\bf s}}
20: \newcommand{\cvec}{{\bf c}}
21: \newcommand{\st}{{\bf s}^t}
22: \newcommand{\X}{{\bf X}}
23: \newcommand{\bS}{{\bf S}}
24: \newcommand{\Aorig}{{\bf A}_{\mbox{\footnotesize orig}}}
25: \newcommand{\Sorig}{{\bf S}_{\mbox{\footnotesize orig}}}
26: 
27: \newcommand{\beq}{\begin{equation}}
28: \newcommand{\eeq}{\end{equation}}
29: 
30: \newtheorem{theorem}{Theorem}
31: \newtheorem{definition}{Definition}
32: 
33: \title{Non-negative Sparse Coding}
34: \author{Patrik O.\ Hoyer\\
35:         Neural Networks Research Centre\\
36:         Helsinki University of Technology\\
37: 	P.O. Box 9800, FIN-02015 HUT, Finland\\
38:         patrik.hoyer@hut.fi}
39: 
40: \begin{document}
41: \maketitle
42: 
43: \begin{abstract}
44: Non-negative sparse coding is a method for decomposing multivariate
45: data into non-negative sparse components. In this paper we briefly
46: describe the motivation behind this type of data representation
47: and its relation to standard sparse coding and non-negative 
48: matrix factorization. We then give a simple yet efficient 
49: multiplicative algorithm for finding the optimal values of the
50: hidden components. In addition, we show how the basis vectors can
51: be learned from the observed data. Simulations demonstrate the 
52: effectiveness of the proposed method.
53: \end{abstract}
54: 
55: \section{Introduction}
56: 
57: Linear data representations are widely used in signal processing and
58: data analysis. A traditional method of choice for signal 
59: representation is of course Fourier analysis, but also wavelet 
60: representations are increasingly being used in a variety of 
61: applications. Both of these methods have strong mathematical 
62: foundations and fast implementations, but they share the important 
63: drawback that they are not adapted to the particular data being 
64: analyzed.
65: 
66: Data-adaptive representations, on the other hand, are representations
67: that are tailored to the statistics of the data. Such representations 
68: are learned directly from the observed data by optimizing some measure 
69: that quantifies the desired properties of the representation. This 
70: class of methods include principal component analysis (PCA), 
71: independent component analysis (ICA), sparse coding, and non-negative
72: matrix factorization (NMF). Some of these methods have their roots
73: in neural computation, but have since been shown to be widely
74: applicable for signal analysis.
75: 
76: In this paper we propose to combine sparse coding and non-negative 
77: matrix factorization into \emph{non-negative sparse coding} (NNSC). Again,
78: the motivation comes partly from modeling neural information processing.
79: We believe that, as with previous methods, this technique will be found
80: useful in a more general signal processing framework.
81: 
82: \section{Non-negative sparse coding}
83: 
84: Assume that we observe data in the form of a large number of 
85: i.i.d.\ random vectors $\x_n$, where $n$ is the sample index. 
86: Arranging these into the columns of a matrix $\X$, then linear 
87: decompositions describe this data as
88: $\X \approx \A\bS$. The matrix $\A$ is called the \emph{mixing matrix},
89: and contains as its columns the \emph{basis vectors} (features) of the
90: decomposition. The rows of $\bS$ contain the corresponding 
91: \emph{hidden components} that give the contribution of each basis vector 
92: in the input vectors. Although some decompositions provide an exact
93: reconstruction of the data (i.e. $\X = \A\bS$) the ones that
94: we shall consider here are approximative in nature.
95: 
96: In linear sparse coding \cite{Harpur96,Olshausen96b}, the goal is 
97: to find a decomposition in which the hidden components are \emph{sparse},
98: meaning that they have probability densities which are highly peaked at zero 
99: and have heavy tails. This basically means that any given input vector 
100: can be well represented using only a few significantly non-zero hidden 
101: coefficients. Combining the goal of small reconstruction error
102: with that of sparseness, one can arrive at the following objective 
103: function to be minimized \cite{Harpur96,Olshausen96b}:
104: \begin{equation} \label{eq:sc}
105: C(\A,\bS) = \frac{1}{2}\|\X - \A\bS\|^2 + \lambda\sum_{ij} f(S_{ij}),
106: \end{equation}
107: where the squared matrix norm is simply the summed 
108: squared value of the elements, i.e. $\|\X-\A\bS\|^2 = 
109: \sum_{ij}[\X_{ij}-(\A\bS)_{ij}]^2$.
110: The tradeoff between sparseness and accurate reconstruction is controlled
111: by the parameter $\lambda$, whereas the form of $f$ defines how sparseness
112: is measured. To achieve a sparse code, the form of $f$ must be chosen 
113: correctly: A typical choice is $f(s) = |s|$, although often similar
114: functions that exhibit smoother behaviour at zero are chosen for 
115: numerical stability.
116: 
117: There is one important problem with this objective: As $f$ typically
118: is a strictly increasing function of the absolute value of its argument,
119: the objective can always be decreased by simply scaling up $\A$ and 
120: correspondingly scaling down $\bS$. The consequences of this 
121: is that optimization of (\ref{eq:sc}) with respect to both $\A$ and $\bS$ 
122: leads to the elements of 
123: $\A$ growing (in absolute value) without bounds whereas $\bS$ tends
124: to zero. More importantly, the solution found does not depend
125: on the second term of the objective as it can always be eliminated
126: by this scaling trick. In other words, some constraint on the scales
127: of $\A$ or $\bS$ is needed. Olshausen and Field \cite{Olshausen96b}
128: used an adaptive method to ensure that the hidden components had unit 
129: variance (effectively fixing the norm of the rows of $\bS$), whereas
130: Harpur \cite{HarpurPhD} fixed the norms of the columns of $\A$.
131: 
132: With either of the above scale constraints the objective (\ref{eq:sc}) 
133: is well-behaved and its minimization can produce useful decompositions
134: of many types of data. For example, it was shown in \cite{Olshausen96b}
135: that applying this method to image data yielded features closely 
136: resembling simple-cell receptive fields in the mammalian primary 
137: visual cortex. The learned decomposition is also similar to
138: wavelet decompositions, implying that it could be useful in applications
139: where wavelets have been successfully applied. 
140: 
141: In standard sparse coding, described above, the data is described as a
142: combination of elementary features involving both additive and
143: subtractive interactions. The fact that features can `cancel each
144: other out' using subtraction is contrary to the intuitive notion of
145: combining parts to form a whole \cite{LeeDD99}. Thus, Lee and Seung
146: \cite{LeeDD99,LeeDD01} have recently forcefully argued for
147: non-negative representations \cite{Paatero94}. Other arguments for
148: non-negative representations come from biological modeling
149: \cite{Hoyer03CNS,Hoyer02VR,LeeDD99}, where such constraints 
150: are related to the non-negativity of neural firing rates.
151: These non-negative representations assume that the input data
152: $\X$, the basis $\A$, and the hidden components $\bS$ are all non-negative.
153: 
154: Non-negative matrix factorization\footnote{Note that error measures 
155: other than the summed squared error were also considered 
156: in \cite{LeeDD99,LeeDD01}.} (NMF) can be performed by the minimization
157: of the following objective function \cite{LeeDD01,Paatero94}:
158: \begin{equation}
159: C(\A,\bS) = \frac{1}{2}\|\X - \A\bS\|^2
160: \end{equation}
161: with the non-negativity constraints 
162: $\forall ij: \; A_{ij}\geq 0, \; S_{ij}\geq 0$.
163: This objective requires no constraints on the scales of $\A$ or $\bS$. 
164: 
165: In \cite{LeeDD99}, the authors showed how non-negative matrix 
166: factorization applied to face images yielded features that corresponded
167: to intuitive notions of face parts: lips, nose, eyes, etc. This was
168: contrasted with the holistic representations learned by PCA and
169: vector quantization. 
170: 
171: We suggest that both the non-negativity constraints and the sparseness
172: goal are important for learning parts-based representations. Thus,
173: we propose to combine these two methods into non-negative sparse coding:
174: \begin{definition}
175: Non-negative sparse coding (NNSC) of a non-negative data matrix $\X$ 
176: (i.e.\ $\forall ij: \; X_{ij}\geq 0$) is given by the minimization of
177: \begin{equation} \label{eq:nnsc}
178: C(\A,\bS) = \frac{1}{2}\|\X - \A\bS\|^2 + \lambda\sum_{ij} S_{ij}
179: \end{equation}
180: subject to the constraints $\forall ij: \; A_{ij}\geq 0, \; S_{ij}\geq 0$ and
181: $\forall i: \|\ai\| = 1$, where $\ai$ denotes the $i$:th column of $\A$.
182: It is also assumed that the constant $\lambda\geq 0$.
183: \end{definition}
184: Notice that we have here chosen to measure sparseness by a linear 
185: activation penalty (i.e. $f(s) = s$). 
186: This particular choice is primarily motivated by the fact that this 
187: makes the objective function quadratic in $\bS$. This is useful
188: in the development and convergence proof of an efficient algorithm
189: for optimizing the hidden components $\bS$.
190: 
191: \section{Estimating the hidden components}
192: 
193: We will first consider optimizing $\bS$, for a given basis $\A$. As
194: the objective (\ref{eq:nnsc}) is quadratic with respect to $\bS$, and the 
195: set of allowed $\bS$ (i.e. the set where $S_{ij}\geq 0$) is convex, 
196: we are guaranteed that no suboptimal local minima exist. The global
197: minimum can be found using, for example, quadratic programming
198: or gradient descent. Gradient descent is quite simple to 
199: implement, but convergence can be slow. On the other hand, 
200: quadratic programming is much more complicated to implement. 
201: To address these concerns, we have developed a multiplicative 
202: algorithm based on the one introduced in \cite{LeeDD01} that is
203: extremely simple to implement and nonetheless seems to be quite
204: efficient. This is given by iterating the following update rule:
205: 
206: \begin{theorem} \label{theorem:multupdate}
207: The objective (\ref{eq:nnsc}) is nonincreasing under the update rule:
208: \begin{equation} \label{eq:updateS}
209: \bS^{t+1} = \bS^{t} \hspace{1mm}.\hspace{-1mm}* (\A^T\X) \hspace{1mm}./\hspace{1mm} (\A^T\A\bS^{t} + \lambda)
210: \end{equation}
211: where $.*$ and $./$ denote elementwise multiplication and 
212: division (respectively), and the addition of the scalar $\lambda$ is done 
213: to every element of the matrix $\A^T\A\bS^{t}$.
214: \end{theorem}
215: This is proven in the Appendix. As each element of $\bS$ is updated
216: by simply multiplying with some non-negative factor, it is guaranteed
217: that the elements of $\bS$ stay non-negative under this update rule.
218: As long as the initial values of $\bS$ are all chosen strictly positive,
219: iteration of this update rule is in practice guaranteed to reach the
220: global minimum to any required precision.
221: 
222: \section{Learning the basis}
223: 
224: In this section we consider optimizing the objective (\ref{eq:nnsc})
225: with respect to both the basis $\A$ and the hidden components $\bS$, under
226: the stated constraints. First, we consider the optimization of 
227: $\A$ only, holding $\bS$ fixed.
228: 
229: Minimizing (\ref{eq:nnsc}) with respect to $\A$ 
230: \emph{under the non-negativity constraint only} could be done exactly 
231: as in \cite{LeeDD01}, with a simple multiplicative update rule. 
232: However, the constraint of unit-norm columns of $\A$ complicates
233: things. We have not found any similarly efficient update rule that would
234: be guaranteed to decrease the objective while obeying the 
235: required constraint. Thus, we here resort to projected gradient
236: descent. Each step is composed of three parts:
237: \begin{enumerate}
238: \item $\A' = \A^t - \mu (\A^t\bS-\X)\bS^T$
239: \item Any negative values in $\A'$ are set to zero
240: \item Rescale each column of $\A'$ to unit norm, and then set $\A^{t+1} = \A'$.
241: \end{enumerate}
242: This combined step consists of a gradient descent step (Step 1) followed by
243: projection onto the closest point satisfying both the non-negativity and 
244: the unit-norm constraints (Steps 2 and 3). This projected gradient step 
245: is guaranteed to decrease the objective if the stepsize $\mu>0$ is 
246: small enough and we are not already at a local minimum. (In this case 
247: there is no guarantee of reaching the \emph{global} minimum, due to the 
248: non-convex constraints.)
249: 
250: In the previous section, we gave an update step for $\bS$, holding $\A$
251: fixed. Above, we showed how to update $\A$, holding $\bS$ fixed. To
252: optimize the objective with respect to both, we can of course 
253: take turns updating $\A$ and $\bS$. This yields the following
254: algorithm:\\[2mm]
255: \centerline{
256: \fbox{
257: \begin{minipage}{0.90\textwidth}
258: \vspace{2mm}
259: {\bf Algorithm for NNSC}
260: \begin{enumerate}
261: \item Initialize $\A^0$ and $\bS^0$ to random \emph{strictly positive} 
262: matrices of the appropriate dimensions, and rescale each column of $\A^0$ 
263: to unit norm. Set $t=0$.
264: \item Iterate until convergence:
265: \begin{enumerate}
266: \item $\A' = \A^t - \mu (\A^t\bS^{t}-\X)(\bS^t)^T$
267: \item Any negative values in $\A'$ are set to zero
268: \item Rescale each column of $\A'$ to unit norm, and then set $\A^{t+1} = \A'$.
269: \item $\bS^{t+1} = \bS^{t} \hspace{1mm}.\hspace{-1mm}* ((\A^{t+1})^T\X) \hspace{1mm}./\hspace{1mm} ((\A^{t+1})^T(\A^{t+1})\bS^{t} + \lambda)$
270: \item Increment $t$.
271: \end{enumerate}
272: \end{enumerate}
273: \vspace{2mm}
274: \end{minipage}
275: }
276: }
277: 
278: \section{Experiments}
279: 
280: To demonstrate how sparseness can be essential for learning a
281: parts-based non-negative representation, we performed a simple
282: simulation where the generating features were known. The interested
283: reader can find the code to perform these experiments (as well
284: as the experiments reported in \cite{Hoyer03CNS}) on the web at:\\
285: \centerline{{\ttfamily http://www.cis.hut.fi/phoyer/code/}}\\
286: 
287: In our simulations, the data vectors were $3 \times 3$ -pixel images with
288: non-negative pixel values. We manually constructed $10$ original
289: features: the six possible horizontal and vertical bars, and the four
290: possible horizontal and vertical double bars. Each feature was
291: normalized to unit norm, and entered as a column in the matrix
292: $\Aorig$. The features are shown in the leftmost panel of 
293: Figure~\ref{fig:experiments}. We then generated random sparse non-negative 
294: data $\Sorig$, and obtained the data vectors as $\X = \Aorig\Sorig$. 
295: A random sample of $12$ such data vectors are also shown 
296: in Figure~\ref{fig:experiments}.
297: 
298: We ran NNSC and NMF on this data $\X$. With $10$ hidden components 
299: (rows of $\bS$),
300: NNSC can correctly identify all the features in the dataset. This result
301: is shown in Figure~\ref{fig:experiments} under {\bf \sffamily NNSC}. 
302: However, NMF cannot find all the features with any hidden dimensionality. 
303: With $6$ components, NMF finds all the single bar features. With 
304: a dimensionality of $10$, not even all of the single bars are correctly 
305: estimated. These results are illustrated in the two rightmost panels
306: of Figure~\ref{fig:experiments}.
307: 
308: \begin{figure}
309: \hspace{1.2mm}
310: \large \bf \sffamily{Features}
311: \hspace{7mm}
312: \large \bf \sffamily{Data}
313: \hspace{10mm}
314: \large \bf \sffamily{NNSC}
315: \hspace{7mm}
316: \large \bf \sffamily{NMF (6)}
317: \hspace{2mm}
318: \large \bf \sffamily{NMF (10)} \\[1.3mm]
319: \resizebox{22mm}{!}{
320: \includegraphics{bars-origbasis.eps}}
321: \resizebox{22mm}{!}{
322: \includegraphics{bars-samples.eps}}
323: \resizebox{22mm}{!}{
324: \includegraphics{bars-nnsc-10.eps}}
325: \resizebox{22mm}{!}{
326: \includegraphics{bars-nmf-6.eps}}
327: \resizebox{22mm}{!}{
328: \includegraphics{bars-nmf-10.eps}}
329: \caption{Experiments on bars data. {\mdseries  Features:} The $10$ original 
330: features that were used to construct the dataset. {\mdseries Data:} A random 
331: sample of 12 data vectors. These constitute superpositions of the 
332: original features. {\mdseries NNSC:} Features learned by NNSC, with 
333: dimensionality of the hidden representation equal to 10, starting from 
334: random initial values. {\mdseries NMF (6):} Features learned by NMF, 
335: with dimensionality 6. {\mdseries NMF (10):} Features learned by NMF, 
336: with dimensionality 10. See main text for discussion.\vspace{2mm}
337: \label{fig:experiments}}
338: \end{figure}
339: 
340: It is not difficult to understand why NMF cannot learn all the
341: features.  The data $\X$ can be perfectly described as an additive
342: combination of the six single bars (because all double bars can be
343: described as two single bars). Thus, NMF essentially achieves the
344: optimum (zero reconstruction error) already with $6$ features, and
345: there is no way in which an \emph{overcomplete} representation could
346: improve that. However, when sparseness is considered as in NNSC, it is 
347: clear that it is useful to have double bar features because these 
348: allow a sparser description of such data patterns.
349: 
350: In addition to these simulations, we have performed experiments with
351: natural image data, reported elsewhere \cite{Hoyer03CNS,Hoyer02VR}.
352: These confirm our belief that sparseness is important when learning
353: non-negative representations from data.
354: 
355: \section{Relation to other work}
356: 
357: In addition to the tight connection to linear sparse coding 
358: \cite{Harpur96,Olshausen96b} and non-negative matrix factorization 
359: \cite{LeeDD99,LeeDD01,Paatero94}, this method is intimately related
360: to independent component analysis \cite{Hyva01book}. In fact, when
361: the fixed-norm constraint is placed on the rows of $\bS$ instead
362: of the columns of $\A$, the objective (\ref{eq:nnsc}) could be
363: directly interpreted as the negative joint log-posterior of the 
364: basis vectors and components, given the data $\X$, in the noisy 
365: ICA model \cite{Hoyer02VR}. This connection is valid when the independent
366: components are assumed to have exponential distributions, and of course the 
367: basis vectors are assumed to be non-negative as well.
368: 
369: Other researchers have also recently considered the constraint of 
370: non-negativity in the context of ICA. In particular, Plumbley 
371: \cite{Plumbley02SPL} has considered estimation of the noiseless
372: ICA model (with equal dimensionality of components and observations) 
373: in the case of non-negative components. On the other hand, 
374: Parra et al.\ \cite{Parra00} considered estimation of the ICA
375: model where the basis (but not the components) was constrained to be 
376: non-negative. The main novelty of the present work is the
377: application of the non-negativity constraints in the sparse coding
378: framework, and the simple yet efficient algorithm developed to estimate
379: the components.
380: 
381: \section{Conclusions}
382: 
383: In this paper, we have defined non-negative sparse coding as a combination
384: of sparse coding with the constraints of non-negative
385: matrix factorization. Although this is essentially a special case
386: of the general sparse coding framework, we believe that the proposed
387: constraints can be important for learning parts-based representations
388: from non-negative data. In addition, the constraints allow a very
389: simple yet efficient algorithm for estimating the hidden components.
390: 
391: \section{Appendix}
392: 
393: To prove Theorem~\ref{theorem:multupdate}, first note that 
394: the objective (\ref{eq:nnsc}) is separable in the
395: columns of $\bS$ so that each column can be optimized without
396: considering the others. We may thus consider the problem for the case
397: of a single column, denoted $\s$. The corresponding column of $\X$ is
398: denoted $\x$, giving the objective
399: \begin{equation}
400: F(\s) = \frac{1}{2}\|\x - \A\s\|^2 + \lambda\sum_i s_i.
401: \end{equation}
402: 
403: The proof will follow closely the proof given 
404: in \cite{LeeDD01} for the case $\lambda=0$. 
405: (Note that in \cite{LeeDD01}, the notation $v=\x$, $W=\A$ and $h=\s$ 
406: was used.)
407: We define an auxiliary function $G(\s,\s^t)$
408: with the properties that $G(\s,\s) = F(\s)$ and $G(\s,\s^t)\geq F(\s)$.
409: We will then show that the multiplicative update rule corresponds to
410: setting, at each iteration, the new state vector to the values
411: that minimize the auxiliary function:
412: \beq
413: \s^{t+1} = \arg\min_{\s} G(\s,\s^t).
414: \eeq
415: This is guaranteed not to increase the objective function $F$, as
416: \beq
417: F(\s^{t+1}) \leq G(\s^{t+1},\s^t) \leq G(\s^t,\s^t) = F(\s^t).
418: \eeq 
419: 
420: Following \cite{LeeDD01}, we define the function $G$ as
421: \beq 
422: \label{eq:auxdef}
423: G(\s,\s^t) = F(\s^t) + (\s-\s^t)^T\nabla F(\s^t) + 
424: \frac{1}{2}(\s-\s^t)^T\K(\s^t)(\s-\s^t)
425: \eeq
426: where the diagonal matrix $\K(\s^t)$ is defined as
427: \beq
428: K_{ab}(\st) = \delta_{ab} \frac{(\A^T\A\s^t)_a + \lambda}{\s^t_a}.
429: \eeq
430: It is important to note that the elements of our choice for $\K$ are 
431: always greather than or equal to those of the $\K$ used in \cite{LeeDD01}, 
432: which is the case where $\lambda=0$. It is obvious that 
433: $G(\s,\s) = F(\s)$. Writing out
434: \beq
435: F(\s) = F(\s^t) + (\s-\s^t)^T\nabla F(\s^t) + 
436: \frac{1}{2}(\s-\s^t)^T(\A^T\A)(\s-\s^t),
437: \eeq
438: we see that the second property, $G(\s,\s')\geq F(\s)$, 
439: is satisfied if
440: \beq
441: 0 \leq (\s-\s^t)^T[\K(\s^t)-\A^T\A](\s-\s^t).
442: \eeq
443: Lee and Seung proved this positive semidefiniteness
444: for the case of $\lambda=0$ \cite{LeeDD01}. In our case, with $\lambda>0$,
445: the matrix whose positive semidefiniteness is to be proved is the same 
446: except that a strictly non-negative diagonal matrix has been added
447: (see the above comment on the choice of $\K$). As a non-negative 
448: diagonal matrix is positive semidefinite, and the sum
449: of two positive semidefinite matrices is also positive semidefinite,
450: the $\lambda=0$ proof in \cite{LeeDD01} also holds when $\lambda>0$.
451: 
452: It remains to be shown that the update rule in (\ref{eq:updateS})
453: selects the minimum of $G$. This minimum is easily found by taking 
454: the gradient and equating it to zero:
455: \beq
456: \nabla_{\s} G(\s,\s^t) = \A^T(\A\s^t-\x) + \lambda\cvec 
457: + \K(\s^t)(\s-\s^t) = 0,
458: \eeq
459: where $\cvec$ is a vector with all ones. Solving for $\s$, this gives
460: \begin{eqnarray}
461: \s & = & \s^t - \K^{-1}(\s^t)(\A^T\A\s^t - \A^T\x + \lambda\cvec) \\
462:    & = & \s^t - (\s^t ./ (\A^T\A\s^t + \lambda\cvec)) 
463:          .\hspace{-1mm}* (\A^T\A\s^t - \A^T\x + \lambda\cvec) \\
464:    & = & \s^t .\hspace{-1mm}* (\A^T\x) ./ (\A^T\A\s^t + \lambda\cvec))
465: \end{eqnarray}
466: which is the desired update rule (\ref{eq:updateS}). This completes
467: the proof.
468: 
469: \bibliography{/home/info/phoyer/research/bib/collection,/home/info/phoyer/research/bib/others,/home/info/phoyer/research/bib/personal}
470: \bibliographystyle{nnsp}
471: 
472: \section{Acknowledgements}
473: 
474: I wish to acknowledge Aapo Hyv\"{a}rinen for useful discussions and 
475: helpful comments on an earlier version of the manuscript.
476: 
477: \end{document}
478: