0807.2900/ms.tex
1: \documentclass[preprint]{aastex}
2: %\documentclass[manuscript]{aastex}
3: %\documentclass{emulateapj}
4: \shorttitle{Exploiting Low-Dimensional Structure}
5: \shortauthors{Richards, Freeman, Lee, Schafer}
6: %\usepackage{epsfig}
7: \usepackage{color}
8: \newcommand{\x}{{\bf x}}
9: \newcommand{\y}{{\bf y}}
10: \newcommand{\z}{{\bf z}}
11: \newcommand{\W}{{\bf W}}
12: \newcommand{\new}{red}
13: \renewcommand{\P}{{\bf P}}
14: 
15: \begin{document}
16: 
17: \title{Exploiting Low-Dimensional Structure in Astronomical Spectra}
18: \author{Joseph W. Richards, Peter E. Freeman, Ann B. Lee, Chad M. Schafer}
19: \email{jwrichar@stat.cmu.edu}
20: \affil{Department of Statistics, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213}
21: 
22: \begin{abstract}
23: Dimension-reduction techniques can greatly improve statistical inference in astronomy.
24: A standard approach is to use Principal Components Analysis (PCA).
25: In this work we apply a recently-developed technique, diffusion maps, to astronomical
26: spectra for data parameterization and dimensionality reduction, and
27: develop a robust, eigenmode-based framework
28: for regression.
29: We show how our framework provides a computationally efficient means by which
30: to predict redshifts of galaxies, and thus could
31: inform more expensive redshift estimators
32: such as template cross-correlation.  It also provides a natural means
33: by which to identify outliers (e.g., misclassified spectra, spectra
34: with anomalous features).
35: We analyze 3835 SDSS spectra and show how our framework
36: yields a more than 95\% reduction in dimensionality.
37: Finally, we show that the prediction error
38: of the diffusion map-based regression approach is markedly smaller than that of a similar 
39: approach based on PCA, clearly demonstrating the superiority of diffusion
40: maps over PCA for this regression task.
41: \end{abstract}
42: 
43: \keywords{galaxies: distances and redshifts --- galaxies: fundamental parameters --- galaxies: statistics --- methods: statistical --- methods: data analysis}
44: 
45: \section{Introduction}
46: 
47: \label{sect:intro}
48: 
49: Galaxy spectra are classic examples of high-dimensional data, with
50: thousands of measured fluxes providing 
51: information about the physical conditions of the observed object.
52: To make computationally efficient inferences about these 
53: conditions, we need to first reduce the dimensionality of the data 
54: space while preserving relevant physical information. 
55: We then need to find simple relationships between the reduced data and physical parameters of
56: interest.
57: %, e.g., by introducing and estimating a regression function. 
58: Principal Components Analysis (PCA, or the Karhunen-Lo\`eve transform) is a standard method for the first step; its application to astronomical spectra is described in, e.g., \citet{BorosonGreen1992},
59: \citet{Connolly1995}, \citet{Ronen1999}, \citet{Folkes1999},
60: \citet{Madgwick2003}, \citet{Yip2004a},
61: \citet{Yip2004b}, \citet{Li2005}, \citet{Zhang2006}, 
62: \citet{VDB2006}, \citet{Rogers2007}, and \citet{ReFiorentin2007}.
63: In most cases, the authors do not proceed to the second step but only
64:  ascribe physical significance to the first few eigenfunctions from PCA
65: (such as the ``Eigenvector 1" of \citeauthor{BorosonGreen1992}).
66: Notable exceptions are \citeauthor{Li2005}, \citeauthor{Zhang2006},
67: and \citeauthor{ReFiorentin2007} However, 
68: as we discuss in {\S}\ref{sect:app}, these authors combine 
69: eigenfunctions in an ad hoc manner with no formal methods or
70: statistical criteria for regression and risk (i.e., error) estimation.
71: 
72: In this work we present a unified framework for regression and data parameterization of astronomical spectra. The main idea is to describe 
73: the important structure of a data set in terms of its 
74: {\em fundamental eigenmodes}.
75: The corresponding eigenfunctions are used both as coordinates for the data 
76: and as orthogonal basis functions for regression.  
77: We also introduce the {\em diffusion map} framework 
78: (see, e.g., \citealt{Coifman:Lafon:06}, \citealt{LafonLee2006}) 
79: to astronomy, comparing and contrasting it with PCA for regression analysis of SDSS galaxy spectra.  PCA is a global method that finds linear low-dimensional 
80: projections of the data; it attempts to preserve Euclidean distances between all data points and is often not robust to outliers. 
81: The diffusion map approach, on the other hand, is non-linear and instead retains distances that reflect the (local) connectivity of the data.  
82: This method is robust to outliers and is often able to unravel the intrinsic geometry and the natural (non-linear) coordinates of the data.
83: 
84: In {\S}\ref{sect:diff} we describe the diffusion map method for data
85: parameterization.
86: In {\S}\ref{sect:regress} we introduce the technique of {\em adaptive regression} using eigenmodes.
87: In {\S}\ref{sect:app} we demonstrate the effectiveness of our proposed PCA- and
88: diffusion-map-based regression techniques for 
89: predicting the redshifts of SDSS spectra.
90: % Text shifted to section 4
91: %Redshift prediction in SDSS DR6 is calculated by two methods: 
92: %first, via wavelet analyses of continuum-subtracted spectra, where the
93: %continuum is estimated using a fifth-order polynomial, and second,
94: %by cross-correlating templates and observed spectra.\footnote{
95: %See {\scriptsize \tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}
96: %In both cases, confidence levels\footnote{
97: %SDSS ``confidence levels"
98: %are functions of the strengths of observed lines and thus should
99: %not be interpreted probabilistically.}
100: %are computed, with the higher-CL redshift estimate assigned to the galaxy.
101: % shift end
102: %Template matching, in particular, is slow (ARE WE SURE ABOUT THIS?)
103: %and prone to error because
104: %the basis functions are not orthogonal.
105: %(SDSS manually inspects 8\% of estimates and changes 1\% of them.)
106: Our PCA- and diffusion-map-based approaches provide a fast and
107: statistically rigorous means of identifying 
108: outliers in redshift data. The returned embeddings also provide an
109: informative visualization of the results.  In {\S}\ref{sect:summary} we summarize our results.
110: 
111: \section{Diffusion Maps and Data Parameterization}
112: 
113: \label{sect:diff}
114: The variations in a physical system can sometimes be described by
115: a few parameters, while measurements of the system are
116: necessarily of very high dimension; geometrically, the data are
117: points in the $p$-dimensional space $\mathbb{R}^p$, with $p$ large.
118: In our case, a data point is a galaxy spectrum, with the 
119: dimension $p$ given by the number of wavelength bins ($p \gtrsim 10^3$),
120: and a full data set could consist of hundreds of thousands of spectra.
121: To make inference and predictions tractable, 
122: one seeks to find a simpler parameterization of the system. The most 
123: common method for dimension reduction and data parameterization 
124: is Principal Component Analysis (PCA), where the data are projected 
125: onto a lower-dimensional hyperplane. For complex situations, 
126: however, the assumption of linearity may lead to sub-optimal 
127: predictions. A linear model pays very little attention to the 
128: natural geometry and variations of the system. The top plot in Figure
129: \ref{fig:spiral} 
130: illustrates this clearly by showing a data 
131: set that forms a one-dimensional noisy spiral in $\mathbb{R}^2$. 
132: Ideally, we would like to find a coordinate system that reflects 
133: variations along the spiral direction, which is indicated by the
134: dashed line. It is obvious that any
135: projection of the data onto a line would be unsatisfactory.  Results
136: of a PCA analysis of the noisy spiral are shown in the lower-left plot
137: in Figure \ref{fig:spiral}.
138: 
139: In this section, we will use diffusion maps 
140: (\citeauthor{Coifman:Lafon:06}, \citeauthor{LafonLee2006}) --- a non-linear technique --- 
141: %for data parameterization, i.e.  
142: to find a natural coordinate system for the data. 
143: When searching for a lower-dimensional description, one needs to decide 
144: what features to preserve and what aspects of the data one is 
145: willing to lose. The diffusion map framework attempts to retain 
146: the cumulative local interactions between its data points, or 
147: their ``connectivity" in the context of a fictive diffusion process over the data. 
148: We demonstrate how this can be a better method to learn 
149: the intrinsic geometry of a data set than by using, e.g., PCA. 
150: %which simply projects all data points onto a lower-dimensional hyperplane.
151: 
152: Our strategy is to first define a distance metric $D(\x,\y)$ that reflects
153: the connectivity of two points $\x$ and $\y$, then find a map to a 
154: lower-dimensional space (i.e., a new data parameterization) that 
155: best preserves these distances. 
156: (As before, a ``point'' in $p$-dimensional space represents
157: a complete astronomical spectrum of $p$ wavelength bins.)
158: The general idea is that we call two data points ``close'' if there 
159: are many short paths between $\x$ and $\y$ in a jump diffusion process between data points.
160: In Figure \ref{fig:spiral}, the Euclidean distance
161: between two points is an inappropriate measure of
162: similarity. If, instead, one imagines a random walk starting at ``$\x$,'' and
163: only stepping to immediately adjacent points, it is clear that 
164: %it would take a long time for that walk to reach ``$\y$.'' 
165: the time it would take for that walk to reach ``$\y$''  would reflect
166: the length along the spiral direction.  This latter distance measure
167: is represented by the solid path from $\x$ to $\y$ in Figure \ref{fig:spiral}.
168: We will make this measure of connectivity formal in what follows.
169: 
170: The starting point is to construct a weighted graph where the
171: nodes are the observed data points. %(the spectra). repetitive
172: %(i.e., in our case each node is a spectrum).
173: The weight given to the edge connecting $\x$ and $\y$ is
174: \begin{equation}
175: w(\x,\y) = \exp\left(-\frac{s(\x,\y)^2}{\epsilon}\right),
176: \label{eqn:diffw}
177: \end{equation}
178: where $s(\x,\y)$ is a locally relevant similarity measure.
179: For instance, $s(\x,\y)$ could be chosen as 
180: the Euclidean distance between $\x$ and $\y$ (denoted here $\|\x-\y\|$)
181: when $\x$ and $\y$ are vectors.
182: But, the choice of $s(\x,\y)$ is not crucial, and this gets to the heart
183: of the appeal of this approach:
184: it is often simple to determine whether or not two data points are %very 
185: ``similar'', 
186: and many choices of $s(\x,\y)$ will suffice for measuring this
187: local similarity.
188: The tuning parameter $\epsilon$ is chosen small enough that
189: $w(\x,\y) \approx 0$ unless $\x$ and $\y$ are similar,
190: %only local similarities are computed, 
191: but large enough such that the constructed graph is fully connected.
192: 
193: The next step is to use these weights to build a Markov random walk on 
194: the graph. From node (data point) $\x$, the probability of stepping
195: directly to $\y$ is defined naturally as 
196: \begin{equation}
197: p_1(\x,\y) = \frac{w(\x,\y)}{\sum_{\z}w(\x,\z)}.
198: \label{eqn:diffp}
199: \end{equation}
200: %\begin{equation}
201: %p_1(x,y) = \frac{w(x,y)}{\sum_{z \in \Omega}w(x,z)} \,.
202: %\label{eqn:p}
203: %\end{equation}
204: This probability is close to zero unless $\x$ and $\y$ are similar. Hence, in
205: one step the random walk will move only to very similar nodes (with high
206: probability). These one-step transition probabilities are stored in the $n$ by $n$ 
207: matrix $\P$.
208: It follows from standard theory of Markov chains (\citealt{KemenySnell1983}) that, for a positive integer $t$, the element 
209: $p_t(\x,\y)$ of
210: the matrix power $\P^t$ gives the probability of
211: moving from $\x$ to $\y$ in $t$ steps.
212: Increasing $t$ moves the random walk
213: forward in time, propagating the local influence of a data point 
214: (as defined by the kernel $w$)
215: with its neighbors.
216: % so as eventually to form a global representation of the
217: %geometry of the data.
218: 
219: For a fixed time (or scale) $t$, $p_t(\x,\cdot)$ is a vector representing
220: the distribution after $t$ steps of the random walk over the nodes of the
221: graph, conditional on the
222: walk starting at $\x$.
223: In what follows, the points $\x$ and $\y$ are
224: close if the conditional distributions 
225: $p_t(\x,\cdot)$ and $p_t(\y,\cdot)$, are similar. 
226: Formally, the diffusion distance at a scale $t$ is defined as
227: \begin{equation}
228: D_t^2(\x,\y) = \sum_{\z} \frac{\left(p_t(\x,\z) - p_t(\y,\z)\right)^2}{\phi_0(\z)}
229: %D_t^2(x,y) = ||p_t(x,\cdot) - p_t(y,\cdot)||^2_2
230: \label{eqn:diffdist}
231: \end{equation}
232: where $\phi_0(\cdot)$ is the stationary distribution of the random walk, i.e.,
233: the long-run proportion of the time the walk spends at
234: node $\z$. 
235: Dividing by $\phi_0(\z)$ serves to reduce the influence of nodes
236: which are visited with high probability regardless of the starting point of the
237: walk.
238: %{\bf (Change the above; the 2 over 2 nomenclature is unclear.)}
239: The distance $D_t(\x,\y)$ will be small only if $\x$ and $\y$ are connected by
240: many short paths with large weights.  This construction of
241: a distance measure is robust to noise and outliers because it
242: simultaneously accounts for the cumulative effect of {\em all} paths between the
243: data points. 
244: Note that the geodesic distance (the shortest path in a graph), on the other hand, often takes shortcuts due to noise.
245: 
246: % Fig3a was here.
247: 
248: The final step is to find a low-dimensional embedding of the data where Euclidean distances reflect diffusion distances.
249: %In applying this technique for dimensionality reduction,
250: %the data set attribute
251: %we wish to preserve is the diffusion distance between all
252: %points.  
253: A biorthogonal spectral decomposition of the matrix $\P^t$ gives
254: %\begin{equation}
255: %p_t(x,y) = \sum_{j \ge 0} \lambda_j^t \psi_j(x) \phi_j(y) \,,
256: %\label{eqn:diffdecomp}
257: %\end{equation}
258: $p_t(\x,\y) = \sum_{j \ge 0} \lambda_j^t \psi_j(\x) \phi_j(\y)$,  
259: where $\phi_j$, $\psi_j$, and $\lambda_j$, respectively, represent left eigenvectors, right eigenvectors and eigenvalues 
260: of $\P$. It follows that
261: \begin{equation}
262: D^2_t(\x,\y)~= ~\sum_{j=1}^{\infty} \lambda_j^{2t}(\psi_j(\x)-\psi_j(\y))^2.\label{eq:Dt}
263: \end{equation}
264: %{\bf (ANN: How about putting a proof of Equation (4) as an Appendix?)}
265:  The proof of Equation~\ref{eq:Dt} and the details of the computation
266:  and normalization of the eigenvectors  $\phi_j$ and $\psi_j$ are given in
267:  \citeauthor{Coifman:Lafon:06} and
268:  \citeauthor{LafonLee2006}.\footnote{Sample code in Matlab and R for
269:    diffusion maps at {\tt  http://www.stat.cmu.edu/\~{}annlee/software.htm}}  By retaining the
270:  $m$ eigenmodes corresponding to the $m$ largest nontrivial
271:  eigenvalues and by introducing the diffusion map
272: \begin{equation}
273: \Psi_t: \x \mapsto [\lambda_1^t\psi_1(\x), \lambda_2^t\psi_2(\x), \cdots,\lambda_m^t\psi_m(\x)]
274: \label{eqn:diffusion_map}
275: \end{equation}
276: from $\mathbb{R}^p$ to $\mathbb{R}^m$, we have that %(see \citeauthor{Coifman:Lafon:06})
277: %\begin{eqnarray}
278: \begin{equation}
279: D^2_t(\x,\y)~\simeq ~\sum_{j=1}^m \lambda_j^{2t}(\psi_j(\x)-\psi_j(\y))^2 ~=~||\Psi_t(\x) - \Psi_t(\y)||^2 \,,
280: \label{eqn:diffpres}
281: \end{equation}
282: i.e., Euclidean distance in the $m$-dimensional embedding defined by equation~\ref{eqn:diffusion_map}
283: %lower-dimensional space $\mathbb{R}^m$, 
284: approximates diffusion distance.
285: In contrast, Euclidean distances in PC maps approximate the original
286: Euclidean distances $\|\x-\y\|$.
287: Again, consider the example in Figure \ref{fig:spiral}.
288: The plot on the lower left shows that the first diffusion map coordinate is a monotonically increasing
289: function of the
290: arc length of the spiral; this is not the case in the
291: lower right plot, which shows the same relationship for the first PC coordinate. Indeed, the relationship
292: with the first PC coordinate is not even one-to-one.
293: 
294: The choice of the parameters $m$ and $t$ is determined by the fall-off of the eigenvalue spectrum as well 
295: as the problem at hand (e.g., clustering, classification, regression, 
296: or data visualization).  An objective measure
297: of performance should be defined and utilized to find data-driven best choices for these tuning parameters. 
298: In this work, the final goal
299: is regression and prediction of redshift. In the next section, we show how the number of coordinates, $m$, can 
300: be chosen by cross-validation, once one has defined an appropriate statistical ``risk" function. The particular 
301: choice of $t$, on the other hand, will not matter in the regression framework, as it will only represent a 
302: rescaling of the $m$ selected basis vectors.
303: 
304: \section{Adaptive Regression Using Orthogonal Eigenfunctions}
305: \label{sect:regress}
306: Our next problem is how to, in a statistically rigorous way, predict a function $y=r(\mathbf{x})$ (e.g., redshift, age, or metallicity of galaxies) of data 
307: (e.g., spectrum $\mathbf{x}$) in very high dimensions using a sample
308: of known pairs ($\x,y$). As before, imagine that our data are points in $\mathbb{R}^p$, but that the
309: natural variations in the system are along a low dimensional space $\mathcal{X} \subset \mathbb{R}^p$.
310: %In other words, $p$ is very large but the intrinsic dimension of $\mathcal{X}$, which is determined by the natural variations of the system, is considerably 
311: %smaller. 
312: The set $\mathcal{X}$ could, for example, be a non-linear submanifold embedded in $\mathbb{R}^p$.
313: In our toy example in Figure \ref{fig:spiral}, $\mathcal{X}$ is the one-dimensional spiral, but the data are observed
314: in $p=2$ dimensions.
315: The key idea is that one may view the eigenfunctions from PCA or diffusion maps
316: (a) as {\em coordinates} of the data points, as shown in the previous section,
317: or (b) as forming a {\em Hilbert orthonormal basis} for any function (including the regression function $r(\mathbf{x})$) supported on the 
318: subset $\mathcal{X}$. Rather than applying an arbitrarily chosen prediction scheme in the computed diffusion or PC space (as in, e.g., \citeauthor{Li2005}, \citeauthor{Zhang2006}, and \citeauthor{ReFiorentin2007}), we utilize the latter insight to formulate a general regression and risk estimation framework. %for high-dimensional inference.
319: 
320: Any function $r$ satisfying $\int r(\x)^2 dx < \infty$, where $\x \in \mathcal{X} $, can be written as
321: \begin{equation}
322: r(\x) = \sum_{j=1}^{\infty} \beta_j \psi_j(\x) \,, 
323: \label{eqn:orthonorm}
324: \end{equation}
325: where the sequence of functions $\{\psi_1,\psi_2,\cdots\}$ forms an
326: orthonormal basis.  The choice of basis functions is traditionally {\em not} adapted to the geometry of the data, or the set $\mathcal{X}$.
327: Standard choices are, for example, Fourier or wavelet bases for $\mathbf{L}^2(\mathbb{R}^p)$, which are constructed as tensor 
328: products of one-dimensional bases. The latter approach makes sense for low dimensions, for example for $p=2$, but quickly becomes
329: intractable as $p$ increases (see, e.g., \citealt{Bellman:61} for the ``curse of dimensionality''). In particular, note that if a wavelet basis
330: in one dimension consists of $q$ basis functions, and hence 
331: requires the estimation of $q$ parameters, the naive tensor basis in $p$ dimensions will have $q^p$ basis functions/parameters,
332: creating an impossible inference problem even for moderate $p$.
333: Because this basis is not adapted to $\mathcal{X}$, there is little hope of 
334: finding a subset of these basis functions which will
335: do an adequate job of modeling the response.
336: %although for any particular problem 
337: %one strives to represent any sufficiently smooth function with
338: %as small a subset of basis functions as possible.
339: 
340: In this work, we propose a new adaptive framework where the basis functions reflect the intrinsic geometry of the data.  Furthermore, we use a formal statistical method to estimate the risk and the optimal parameters in the model. First, rather than using a generic tensor-product basis for the high-dimensional space $\mathbb{R}^p$, we 
341: construct a data-driven 
342: basis for the lower-dimensional, possibly non-linear set $\mathcal{X}$ where the data lie. 
343: Let $\{{\psi_1},{\psi_2},\cdots,{\psi_n}\}$ be the orthogonal eigenfunctions computed by PCA or diffusion maps. 
344: Our regression function estimate $\widehat{r}(\x)$ is then given by 
345: \begin{equation}
346: \widehat{r}(\x) = \sum_{j=1}^{m} \widehat{\beta_j} {\psi_j}(\x), 
347: \label{eqn:orthoreg}
348: \end{equation}
349: %equation~(\ref{eqn:orthoreg}),
350: where the different terms in the series expansion represent the
351: fundamental eigenmodes of the data, and $m \leq n$ is chosen to
352: minimize the prediction risk that we will now define rigorously.
353: 
354: \subsection{Risk: Theory and Estimation}
355: \label{sect:risk}
356: 
357: A key aspect of our approach is that the choice of the models is driven by the minimization of a well-justified, objective error criterion
358: which compensates for overfitting. This is critical, as any basis could be utilized to fit the observed data well; this does not provide,
359: however, any assurance that the model applies beyond these data.
360: To begin, we establish the standard stochastic framework within which regression models are assessed.
361: We are given $n$ pairs of observations $(X_1,Y_1), \ldots, (X_n, Y_n)$, with the task of predicting the 
362: response $Y=r(X)+\epsilon$ at a new data point $X=\x$, where $\epsilon$ represents random noise.  
363: (In {\S}\ref{sect:app}, the response $Y$ is the redshift, $z$, and $X$ is a complete spectrum.) 
364: In nonparametric regression by orthogonal functions, 
365: one assumes that $r(\x)$ is given 
366: according to equation~(\ref{eqn:orthonorm}), with its estimator given
367: by equation~(\ref{eqn:orthoreg}), with $m \leq n$ where $\{\psi_j\}$
368: is a fixed basis.
369: %An estimator of $r(\x)$ typically has the form 
370: %\begin{equation}
371: %\widehat{r}(\x)=\sum_{j=1}^{m} \widehat{\beta_j} \psi_j(\x),
372: %\label{eqn:orthoreg}
373: %\end{equation}
374: %where $m \leq n$ and $\{\psi_j\}$ is a fixed basis.
375: The primary goal is to minimize the
376: {\em prediction risk} (i.e., expected error), commonly quantified by
377: the mean-squared error (MSE)
378: \begin{equation}
379: R(m)=\mathbb{E}[Y-\widehat{r}(X)]^2,
380: \label{eqn:MSE}
381: \end{equation} 
382: where the average is taken over all possible realizations of $(X,Y)$,
383: including the randomness in the evaluation points $X$, the
384: responses $Y$, and the estimates $\widehat{\beta_j}$.
385: Thus, $\mathbb{E}[\cdot]$ averages everything that is random, including the randomness in the evaluation points $X$
386: and the randomness in the estimates $\widehat{\beta_j}$. This leads to protection against overfitting: if a basis
387: function $\psi_j$ is unnecessarily included in the model, 
388: its coefficient $\widehat{\beta_j}$ will only add variability 
389: or variance to
390: $\widehat{r}(X)$ and not improve the fit, hence increasing $R(m)$.
391: (On the other hand, as $m$ becomes too small, 
392: the estimator becomes increasingly biased, also increasing $R(m)$.)
393: Thus, the ideal choice of $m$ is neither too large, nor too small.
394: In nonparametric statistics, this is dubbed the ``bias-variance tradeoff"
395: (see, e.g., \citealt{Wasserman2006}).
396: A secondary goal is {\em sparsity}; more specifically, 
397: among the estimators with a small risk, 
398: we prefer representations with a smaller $m$.
399: 
400: Since $R(m)$ is a population quantity, one needs to appropriately estimate it from the data. 
401: An estimate based on the full data set will underestimate the error and lead to a model with high bias. 
402: Here we will use the method of $K$-fold cross-validation 
403: (see, e.g., \citeauthor{Wasserman2006}) to achieve 
404: a better estimate of the prediction risk. The basic idea is to randomly split the data set into $K$ blocks 
405:  of approximately the same size; $K=10$ is a common choice. For $k=1$ to $K$, we delete block $k$ from the data. We then fit the model to the 
406: remaining $K-1$ blocks and compute the observed squared error $\widehat{R}_{(-k)}(m)$ on the $k$th block which was not included in the fit. The CV estimate of the risk is defined as $\widehat{R}_{CV}(m)=\frac{1}{K}\sum_{k=1}^{K} \widehat{R}_{(-k)}(m)$.
407: It can be shown that this quantity is an approximately unbiased estimate of the true error $R(m)$.
408: Thus, we choose the model parameters that minimize the CV estimate $\widehat{R}_{CV}(m)$ of the risk, i.e., 
409: we take $m_{\rm opt} = \arg \min \widehat{R}_{CV}(m)$.
410: 
411: Finally, we note that the ideas of CV introduced here generalize to cases where the model
412: parameters are of higher dimension. For example, in the diffusion
413: map case, the risk is minimized over both the bandwidth $\epsilon$ and the number of eigenfunctions $m$. The CV estimate of the
414: risk is implemented in the same fashion, but the search space for finding the minimum is larger.
415: In what follows, the notation will make it clear which
416: model parameters we are minimizing over by writing, for
417: example, $R(\epsilon, m)$.
418: 
419: To summarize, our claim is that the proposed regression framework will lead to efficient inference in high 
420: dimensions, as we are effectively performing regression in a lower-dimensional space $\mathcal{X}$ that 
421: captures the natural variations of the data, where the optimal
422: dimensionality is chosen to minimize prediction risk in our regression
423: task. Finally, the use of eigenfunctions in both the data parameterization 
424: and in the regression formulation provides an elegant, unifying framework for analysis and prediction. 
425:  
426: %Here, $J \leq m$ and is chosen by using an appropriate risk 
427: %estimator, such as cross-validation (see, e.g., \citet{Wasserman2007}),
428: %rather than in ad hoc manner of, e.g., \citeauthor{Li2005},
429: %\citeauthor{Zhang2006}, and \citeauthor{ReFiorentin2007}
430: %The smoother the true regression function $r$, the fewer basis terms
431: %$J$ will be needed to represent it.
432: %The estimated orthonormal basis $\{\hat{\psi}\}$ {\bf SHOULD} converge more
433: %quickly to the true underlying basis in (\ref{orthonorm}) than an
434: %arbitrarily chosen basis.  We {\bf SHOULD} thus be able to obtain better
435: %estimates of $r$ than by using PCA or diffusion mapping eigenfunctions
436: %that by using an arbitrary basis.
437: 
438: \section{Redshift Prediction Using SDSS Spectra}
439: 
440: \label{sect:app}
441: 
442: We apply the formalism presented in {\S}{\S}\ref{sect:diff}-\ref{sect:regress}
443: to the problem of predicting redshifts for a sample of SDSS spectra.
444: Physically similar objects residing at similar redshifts will have
445: similar continuum shapes as well as absorption lines occurring at
446: similar wavelengths.  Hence the %$\mathbf{L}^2$ 
447: Euclidean distances between their spectra will be small.
448:  The proposed regression framework with diffusion map or PC
449:  coordinates provides a natural means by which to predict
450: redshifts.  Furthermore, it is computationally efficient, making its
451: use appropriate for large databases such as the SDSS;
452: one can use these predictions to
453: inform more computationally expensive techniques by narrowing down
454: the relevant parameter space (e.g., the redshift range or the 
455: set of templates in cross-correlation techniques).  
456: Adaptive regression also provides a useful 
457: tool for quickly identifying anomalous data points (e.g., objects
458: misclassified as galaxies), galaxies that have relatively rare
459: features of interest, and 
460: galaxies whose SDSS redshift estimates may be incorrect.
461:  
462: \subsection{Data Preparation}
463: 
464: Our initial data sample consists of spectra that are classified as galaxies
465: from ten arbitrarily chosen spectroscopic plates of SDSS DR6
466: (0266$-$0274 inclusive, and 0286; \citealt{Adelman2008}).  
467: We remove spectra from this sample by applying three cuts.  The first
468: is motivated by aperture considerations: we analyze only those spectra
469: with SDSS redshift estimates $z_{\rm SDSS} \geq$ 0.05.  
470: To include spectra  with $z_{\rm SDSS} < 0.05$
471: would be to add an extra source of variation that would
472: adversely impact regression analysis.  The second cut is based on bin flags.
473: To avoid calibration issues observed at both the low and high
474: wavelength ends, we remove the first 100 and last 250 wavelength bins
475: from each spectrum;
476: then we determine what proportion of the remaining 3500 bins are flagged
477: as bad.  If this proportion exceeds 10\%, we remove the spectrum from the
478: sample; if not, we retain the reduced spectrum for further analysis.  
479: We provide details on the third cut below.
480: The application of these cuts reduces our sample size from 5057
481: to 3835 galaxies.
482: 
483: %(reducing the wavelength range to 3940-8850\AA)
484: %These spectra span the wavelength range 3800-9200\AA, with
485: %uniform binning in $\log_{10}$-space ($\Delta \log_{10} \lambda$ = 10$^{-4}$).
486: 
487: We further process each spectrum in our sample as follows.
488: \begin{itemize}
489: \item We replace the flux values in the vicinity of
490: prominent atmospheric lines at 5577~\AA, 6300~\AA, and 6363~\AA~with
491: the sample mean of the nine closest bins on either side of each line.
492: The flux errors are estimated by averaging (in quadrature)
493: the standard errors of the fluxes for these bins.
494: \item We similarly replace the flux values in each bin flagged by SDSS as
495: part of an emission line, with flux and flux error estimates based
496: upon the closest 50 bins on either side of the line.  (Within this group
497: of 100 bins, we do not include those that are themselves flagged as
498: emission lines.)
499: We do this because highly variable emission line strengths 
500: can strongly bias distance calculations.
501: \item Last, after replacing flux values as necessary, we normalize
502: each spectrum to sum to 1 to mitigate variation due to differences in luminosity between 
503: similar galaxies at similar redshifts.
504: \end{itemize}
505: 
506: In its data reduction pipeline, SDSS estimates spectroscopic redshifts,
507: $z_{\rm SDSS}$, standard errors, $\sigma_{z_{\rm SDSS}}$, and
508: ``confidence levels," CL, the latter of which are functions
509: of the strengths of observed lines (and thus should
510: not be interpreted probabilistically).\footnote{
511: See {\tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}
512: Lacking knowledge of the true redshifts in our sample, we use 
513: $z_{\rm SDSS}$ and $\sigma_{z_{\rm SDSS}}$ to fit our regression model.
514: Since poorly estimated redshifts can bias the model,
515: we divide our data sample into two groups, fitting with only
516: those 2793 galaxies with CL $>$ 0.99.
517: We then use the fitted model to predict redshifts for the other 1042 galaxies.
518: (It is here that we make our third data cut: to avoid issues of extrapolation,
519: we removed 19 of 1061 spectra with CL $\leq$ 0.99 whose SDSS redshift estimates
520: lie outside the range of our training set, i.e. those with $z_{\rm
521:   SDSS} > 0.50$.)  As shown in Figure \ref{fig:zdesign}, the distributions of
522: redshifts in our high- and low-CL samples are similar, implying that 
523: predicted redshifts for low-CL galaxies from the model built on
524: high-CL galaxies should not be systematically biased.
525: 
526: \subsection{Analysis}
527: \label{sect:anal}
528: 
529: % Redundant with line immediately above.
530: %Then, using the regression model presented in 
531: %{\S}\ref{sect:regress} we can regress the SDSS redshift estimates on the
532: %diffusion map coordinates to find galaxies for which our
533: %predicted redshift values do not agree with the corresponding SDSS estimates.
534: 
535: %In its spectral reduction pipeline, SDSS estimates spectroscopic
536: %redshifts $z_{\rm SDSS}$ by (a) using a reference line list\footnote{
537: %\scriptsize \tt http://www.sdss.org/dr6/algorithms/linestable.html}
538: %to identify emission lines that they detect using a wavelet-based
539: %procedure, and (b)
540: %cross-correlating emission-line-masked, continuum-subtracted
541: %spectra with star, galaxy, and quasar templates.\footnote{
542: %See {\scriptsize \tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}
543: 
544: In this section, we perform both PCA and diffusion map for our sample
545: and predict redshift using the
546: regression model introduced in {\S}\ref{sect:regress}.  We provide
547: details on the PCA algorithm in Appendix \ref{sect:pca}.
548: 
549: In the diffusion map analysis, 
550: we begin by calculating Euclidean distances between spectra
551: \begin{equation}
552:    s(\x, \y)~=~ \sqrt{\sum_k (f_{\x,k}-f_{\y,k})^2} \,,
553: \end{equation}
554: where $f_{\x,k}$ and $f_{\y,k}$ are the normalized fluxes in bin $k$ of
555: spectra $\x$ and $\y$, respectively.  We use these distances and a
556: chosen value of
557: $\epsilon$ to construct both the weights for the graph (see equation
558: \ref{eqn:diffw}) and the transition 
559: matrix $\P$ (see equation \ref{eqn:diffp}), from which eigenmodes are
560: generated.  Below we
561: discuss how we select the optimal value of $\epsilon$.  
562: As stated in {\S}\ref{sect:diff}, the value of the parameter $t$ 
563: (see equation \ref{eqn:diffusion_map}) is unimportant in
564: the context of regression, as any change in $t$ would be met 
565: with a corresponding
566: rescaling of the coefficients $\widehat \beta_j$ in the regression model,
567: such that predictions are unchanged.
568: 
569: In Figure \ref{fig:zmaps} we plot the embedding of
570: the 2793 galaxies with CL $>$ 0.99
571: in the first three PC and diffusion map 
572: coordinates (e.g., $\lambda_i^t\psi_i(\cdot)$ in equation \ref{eqn:diffusion_map}).
573: We observe that the structure of each of these reparameterizations of
574: the original data corresponds in a simple way to $\log_{10}(1+z_{\rm
575:   SDSS})$.  These embeddings are a useful way to visualize the data
576: and to qualitatively identify subgroups of data and peculiar data points.
577: 
578: % fig:zmaps was here
579: 
580: In the next stage of analysis we use the computed eigenfunctions to
581: predict $z$ for our sample of 3835 galaxies.
582: We regress $z_{\rm SDSS}$ upon the diffusion map (and PC) eigenmodes
583: (cf.~equation~\ref{eqn:orthoreg}, where $\widehat r$ represents
584: our redshift estimates), weighting each data point by the 
585: inverse variance of its $z_{\rm SDSS}$, 1/$\sigma_{z_{\rm SDSS}}^2$,
586: to account for the uncertainties in $z_{\rm SDSS}$ measurements.
587: We repeat this step for a sequence of 
588: $m$ (and $\epsilon$) values, determining the optimal values of each
589: by minimizing the prediction risk $R(\epsilon,m)$, 
590: estimated via ten-fold cross-validation (see equation~\ref{eqn:MSE}
591: and subsequent discussion).  It is in this regression step that
592: we clearly observe the advantage of using diffusion maps over
593: principal components.  In Figure \ref{fig:zrisk} we show that
594: diffusion map achieves significantly lower
595: CV prediction risk for most choices of model size $m$ and
596: obtains a much lower minimum $\widehat{R}_{\rm CV}$, i.e.,
597: the optimal low-dimensional diffusion map
598: representation of our data captures the trend in $z$ better than the
599: PC representation.  Note that the trend in $\widehat{R}_{\rm CV}$ for both
600: PC and diffusion map basis functions is to decrease with increasing
601: model size for small models and to increase with increasing model size
602: for larger models.  This is the ``bias-variance tradeoff" that was
603: referred to in {\S}\ref{sect:risk}: as the size (complexity) of our model
604: increases, the bias of the model decreases while the variance of the
605: model increases.  Prediction risk is the sum of the squared bias and
606: variance of a model, explaining the behavior observed in Figure
607: \ref{fig:zrisk}: for small models, increasing model size leads to
608: decrease in bias that overwhelms
609: increase in variance while for large models, increase in model size
610: produces minimal decrease in bias and relatively large increase in variance.
611: %It is also a sparser representation, requiring
612: %less than half the number of eigenfunctions (42 vs.~93).
613: %Restating the previous two sentences, 
614: %{\em the diffusion map approach yields better redshift 
615: %predictions than PCA, with a model that is more parsimonious than the
616: %best-fitting PCA model}.
617: 
618: In Table \ref{tab:zreg}, we show the parameters for the
619: optimal (minimal $\widehat{R}_{\rm CV}$) diffusion map and PC regression models.
620: Note that since our original data were in 3500
621: dimensions, our optimal diffusion map model achieves 
622: a 96.4\% reduction in dimensionality.  If we were to choose
623: an arbitrary small model size as is often done in the literature, our
624: prediction risk estimates would be terrible.  For example, for model
625: sizes $m = 10$ and 20, the CV prediction risks for regression on PC
626: basis functions are 0.305 and 0.209, respectively (compared to optimal
627: value 0.193), while regression on
628: diffusion map basis functions yields $\widehat{R}_{\rm CV}$ of 0.295
629: and 0.191, respectively (compared to optimal value 0.134).  The choice of
630: $\epsilon$ in the diffusion map model also has a significant impact on
631: results.  For values of $\epsilon$ that are too small, CV risks are
632: extremely large because the data points are no longer connected in the
633: diffusion process and consequently large outliers occur in the
634: diffusion map parameterization.  Likewise, large values of $\epsilon$
635: yield large prediction risks due to the large weights given to
636: connections between dissimilar data points.
637: 
638: In Figure \ref{fig:zreg} we plot predictions and prediction
639: intervals for all galaxies in
640: our sample using our optimal diffusion map model.
641: (See Appendix \ref{sect:predint} for a discussion of prediction
642: intervals.)
643: Most of our predictions are in close correspondence with the SDSS
644: estimates.  We observe positive correlation in the amount of disparity between
645: our redshift estimates and SDSS estimates versus 1-CL (Figure
646: \ref{fig:cl}) meaning that galaxies for which our estimates disagree
647: with SDSS estimates are more likely to be galaxies with low CL.
648: 
649: There are 54 outliers at the $4\sigma$ level. Visual inspection of
650: their spectra indicates that 39 appear to fit the template assigned by
651: SDSS.  Of these, 27 are well-described by the LRG template.  In
652: Figure \ref{fig:flux} we show that most of the outliers that are
653: well-fit by their SDSS templates are faint objects.  A plausible
654: explanation for their classification as outliers is low S/N in their
655: measured spectra.  Faint galaxies with strong emission lines will
656: generally have accurate SDSS redshifts but can be outliers in
657: the diffusion map because noisy spectra induce higher Euclidean
658: distances.  In a future paper we will introduce a method to account
659: for errors in the original measured data
660: that corrects both for errors in Euclidean distance
661: computations and random errors in the diffusion map coordinates.
662: 
663: 
664: The 15 other outliers show interesting and/or anomalous features.
665: Four spectra appear to be LRG type galaxies with abnormal emission
666: and/or absorption features, of which at least two are likely
667: attributed to calibration errors (see Figure \ref{fig:outliers}a,b).
668: One spectrum is clearly a QSO (Figure \ref{fig:outliers}c), one shows
669: only sky subtraction residuals (Figure \ref{fig:outliers}d), and two others are
670: obvious mismatches to their SDSS
671: templates due to absorption lines whose depths do not match their
672: assigned template.  Four outliers have abnormal bumps (possible
673: continuum jumps due to instrumental artifacts, see Figure
674: \ref{fig:outliers}e,f) that appear like wide emission features.
675: One outlying galaxy has a spectrum that looks like
676: a late-type galaxy with no emission lines, meaning it is likely a
677: K+A post-starburst galaxy.  Another outlier has an anomalous emission
678: feature around 6000~\AA~ in rest frame (Figure \ref{fig:outliers}g).
679: This is a possible lens
680: galaxy, but was not selected by the Sloan Lens ACS Survey (SLACS;
681: \citeauthor{Bolton2006}) because
682: the feature in question
683: occurs in close proximity to strong sky lines at 8800~\AA~.  The final
684: outlier has a strong, wide emission feature in the
685: vicinity of H$\alpha$ but has no emission lines anywhere else in the
686: SDSS spectrum (Figure \ref{fig:outliers}h). 
687: None of the outlying spectra show conclusive evidence of a wrong SDSS redshift
688: measurement (except for the afore-mentioned sky spectrum, which we
689: detect as a 30 $\sigma$ outlier).
690: 
691: 
692: %Manual inspection of these spectra show that
693: %(a) 2 are obviously misclassified QSOs;
694: %%two have been to QSO spectra by the SDSS routines and
695: %%were mislabeled as galaxy spectra. 
696: %%{\bf (NOTE: 001 and 026)}  
697: %(b) 15 of these outliers have strong emission lines
698: %({\bf conclusion?}); and (c)
699: %%{\bf (NOTE: 000,002,003,006,007,016,017,021,022,031,037,042,047,048,053)}.  
700: %10 appear to have questionable
701: %$z_{\rm SDSS}$ values based on visual inspection.  
702: %%{\bf (NOTE: 007 (0.857), 008 (0.741), 011 (0.897), 023 (0.998), 024 (0.831), 025 (0.478), 043 (0.962), 046 (0.111), 050 (0.997), 055 (0.502) ....CL is in parentheses; a few of these have anomalous features but still might have correct z...Peter, can you take a look at these?)}
703: 
704: \subsection{Comparison With Other Methods}
705: 
706: As discussed in {\S}1, many authors have applied PCA to galaxy spectra
707: in an attempt to reduce the dimensionality of the data space, but few attempt
708: to find simple relationships between the reduced data and the physical
709: parameters of interest; these exceptions include 
710: \citeauthor{Li2005}, \citeauthor{Zhang2006}, and
711: \citeauthor{ReFiorentin2007}
712: In all three cases, the authors use
713: PCA to estimate stellar and/or galactic parameters that are traditionally
714: estimated by laboriously measuring equivalent widths and fluxes
715: of individual lines, just as we have used diffusion map eigenfunctions
716: to estimate redshift, a physical parameter usually estimated through
717: computationally intensive cross-correlation methods.
718: We stress three advantages of our approach over those employed by the
719: above authors: 
720: 1) We achieve much lower prediction
721: error using diffusion map coordinates as compared to PCA,
722: 2) we have an objective way of selecting the parameters of
723:  the model, and 3) we use a theoretically well-motivated regression
724:  model which takes statistical variations of the data into account and
725:  which unifies the data parameterization and regression algorithms.
726: 
727: The aim of \citeauthor{Li2005}~is to estimate, e.g., the velocity 
728: dispersion and reddening of a set of approximately 1500 galaxies
729: observed by SDSS.
730: They use PCA in two successive applications.
731: They first apply PCA
732: to the STELIB library to reduce 204 stellar spectra to 24 stellar eigenspectra.
733: These in turn are fit to SDSS DR1 spectra to create a library of 1016 
734: galactic spectra, which are reduced to nine galactic eigenspectra.
735: The authors then regress observed equivalent widths (EW) and fluxes of
736: H$\alpha$ upon these nine eigenspectra.
737: They determine the number of eigenspectra to retain 
738: by estimating noise variance in the stellar case
739: and by using the $F$ test to compute the significance of each additional
740: eigenspectrum in spectral reconstruction in the galactic case.  The latter
741: criterion however is not well-suited to the task of parameter
742: estimation because
743: the appropriate number of components in the regression model depends
744: on the complexity of the dependence of those parameters as a function
745: of the basis elements, not on the complexity of the original spectra.
746: For example, the dependence of the EW of H$\alpha$ on the PC basis
747: functions may be a simple, smooth function while the flux dependence
748: may be complex, bumpy relationship.  In this case, the optimal
749: regression model to predict EW would require fewer basis functions
750: than the optimal model for H$\alpha$ flux prediction.  Minimizing CV
751: risk would lead us to choose the correct number of basis functions for
752: each task, while the method of Li et al. would force us to use the same
753: (inappropriate) size for each model.
754: 
755: \citeauthor{Zhang2006} attempt to predict stellar parameters by
756: regressing on PC coefficients using a kernel regression model with a
757:  variable window width. In their paper, they do not specify how to
758:  select the window 
759: width (they introduce an arbitrary parameter $\lambda$) or how to
760: choose the correct number of PC basis functions (they use 3).
761: Their choice of a small
762: model size is likely due to the computational and statistical
763: difficulties that characterize kernel regression in high dimensions
764: \citep{Wasserman2006}.
765: 
766: \citeauthor{ReFiorentin2007}~attempt to estimate
767: stellar atmospheric parameters (effective temperature, surface gravity,
768: and metallicity) from SDSS/SEGUE spectra.
769: They use PCA for dimension reduction, but set $m$ to an
770: arbitrary value (e.g., 50). 
771: They then use an iterative, non-linear regression model (utilizing the
772: hyperbolic tangent function; see \citealt{Bailer-Jones2000}),
773: with an error function based on the residual sum-of-squares plus
774: a regularization term (see their equation 2). Again, the
775: choice of the regularization parameter is not justified.
776: %This methodology is similar to that used in the
777: %neural network community ({\bf ANN: CONFIRM THIS}).
778: We find that when applied to the same data
779: set of galaxy spectra, their model does not achieve lower CV risk than
780: our model for different choices of regularization parameter and model size.
781: 
782: \section{Summary}
783: 
784: \label{sect:summary}
785: 
786: The purpose of this paper is two-fold.
787: First, we introduce the diffusion map method for data parametrization
788: and dimensionality reduction. We show
789: that for the types of high-dimensional and complex data sets
790: often analyzed in the astronomy, diffusion map can yield
791: far superior results than commonly-used methods such as PCA.  Moreover,
792: the simple, intuitive formulation of diffusion map as a method that
793: preserves the local interactions of a high-dimensional data set makes the
794: technique easily accessible to scientists that are not well-versed in
795: statistics or machine learning.
796: 
797: Second, we present a fast and powerful eigenmode-based framework for
798: estimating physical parameters in databases of high-dimensional
799: astronomical data.  In most astrophysical applications, PCA is used as
800: a data-explorative tool for dimensionality reduction,
801: with no formal methods
802: and statistical criteria for regression, risk estimation and selection
803: of relevant eigenvectors. Here we propose a statistically rigorous,
804: unified framework for
805: regression and data parameterization.  Our proposed regression model
806: combines basis functions in a simple and statistically-motivated
807: manner while our clear objective of risk minimization drives the
808: estimation of the model parameters.  Again, the simplicity of the
809: proposed method will make it appealing to the non-specialist.
810: 
811:  We apply the proposed methodology to predict redshift for a sample of
812:  SDSS galaxy spectra, comparing the use of the proposed regression
813: model with PCA basis functions versus diffusion map basis functions. 
814: We find that the prediction error for the diffusion-map-based approach
815: is markedly smaller than that of a 
816: similar framework based on PCA. Our techniques are also more robust
817: than commonly used template matching
818: methods because they consider the structure of the entire
819: high-dimensional data set when reparametrizing the data.
820: Statistical inferences are based on this learned structure,
821: instead of considering each data point separately in an object-by-object
822: matching algorithm as is currently used by SDSS and commonly employed
823: throughout the astronomy literature.
824: Work in progress extends our approach to
825: photometric redshift estimation and to the estimation of the
826: intrinsic parameters (e.g., mean metallicities and ages) of galaxies.
827: 
828: \begin{acknowledgments}
829: The authors would like to thank Jeff Newman for helpful conversations.
830: This work was supported by NSF grant \#0707059 and ONR grant N00014-08-1-0673.
831: \end{acknowledgments}
832: 
833: \appendix
834: 
835: \section{Principal Components Analysis}
836: \label{sect:pca}
837: 
838: We first center our data (the normalized spectra with $p$ wavelength bins) so that $\frac{1}{n} \sum_{i=1}^{n} {\bf x}_i = 0$. The centered observations ${\bf x}_1, {\bf x}_2, \ldots {\bf x}_n \in \mathbb{R}^p$ are then stacked into the rows of an $n \times p$ matrix ${\bf X}$. Note that the sample covariance matrix of $\bf x$ is given by the $p \times p$ matrix ${\bf S}= \frac{1}{n}{\bf X}^T{\bf X}$. In Principal Component Analysis (PCA), one computes the eigenvectors of the covariance matrix that correspond to the $m < p$ largest eigenvalues; denote these vectors by ${\bf v}_1, \ldots, {\bf v}_m \in \mathbb{R}^p$. In a PC map, the projections of the data onto these vectors are then used as new coordinates; i.e. the PC embedding of data point ${\bf x}_i$ is given by the map
839: $$ {\bf x}_i \mapsto \Psi_{\rm PCA}({\bf x}_i)=({\bf x}_i \cdot {\bf v}_1, \ldots, {\bf x}_i \cdot {\bf v}_m).$$ 
840: These projections are sometimes referred to as the principal components of ${\bf X}$.
841: 
842: Algorithmically, the PC embedding is easy to compute using a singular value decomposition (SVD) of ${\bf X}$:
843: $$ {\bf X=U D V}^T. $$
844: Here ${\bf U}$ is an $n \times p$ orthogonal matrix,  ${\bf V}$ is a $p \times p$ orthogonal matrix (where the columns are eigenvectors ${\bf v}_1, \ldots, {\bf v}_p$ of ${\bf S}$), and ${\bf D}$ is a $p \times p$ diagonal matrix with diagonal elements $d_1 \geq d_2 \ldots \geq d_p \geq 0$ known as the singular values of ${\bf X}$. Since ${\bf XV}={\bf UD}$, the PC embedding of the $i$:th data point in $m$ dimensions is given by the first $m$ elements of the $i$:th row of ${\bf UD}$.
845: 
846: \section{Prediction Intervals for Spectroscopic Redshift Estimates}
847: 
848: \label{sect:predint}
849: 
850: In any one fold of a ten-fold regression analysis, we fit to 90\% of the data,
851: generating predictions and prediction intervals 
852: for the 10\% of the data withheld from the analysis.  A prediction interval
853: is {\it not} a confidence interval; the former 
854: denotes a plausible range of values for a single observation, whereas the
855: latter denotes a plausible range of values for a parameter of the
856: probability distribution function from which that single observation is
857: sampled (e.g., the mean).
858: 
859: Let $\bf X$ and $\bf \tilde X$ represent the matrices of independent variables 
860: included in, and withheld from, regression analysis, respectively.  For
861: instance,
862: \begin{eqnarray}
863: {\bf \tilde X}~=~
864: \left(
865: \begin{array}{cccc}
866: \psi_1(x_1) & \cdots & \cdots & \psi_m(x_1) \\
867: \vdots      & \vdots & \vdots & \vdots \\
868: \psi_1(x_n) & \cdots & \cdots & \psi_m(x_n)
869: \end{array}
870: \right) \,, \nonumber
871: \end{eqnarray}
872: where $n$ is the number of withheld data and $m$ the number of
873: assumed basis functions.  (Here, we leave out factors of
874: $\lambda_j^t$, which are subsumed into the estimated
875: regression coefficients ${\widehat \beta}_j$.)  The vector of 
876: redshift predictions for the withheld data is thus
877: \begin{eqnarray}
878: {\widehat z}~=~{\bf \tilde X} {\widehat \beta} \,, \nonumber
879: \end{eqnarray}
880: where $\widehat \beta$ is estimated from ${\bf X}$
881: while the vector of half-prediction intervals is given by
882: \begin{eqnarray}
883: t_{\alpha/2,N-n-2} \widehat{\sigma} \sqrt{ {\bf \tilde X} \left( {\bf X}^T {\bf X} \right)^{-1} {\bf \tilde X}^T + 1 + \frac{1}{N-n} } \,,
884: \label{eqn:predint}
885: \end{eqnarray}
886: where $\widehat{\sigma}$ is the estimated standard deviation of the
887: random noise $\epsilon$ in the relationship $Y = r({\bf X}) + \epsilon$,
888: estimated from the residuals of the regression of $Y$ upon ${\bf X}$,
889: $t_{\alpha/2,N-n-2}$ is the critical t-value for a two-sided
890: 100(1-$\alpha$)\% prediction interval,
891: and $N$ is the total number of data points.  Equation (\ref{eqn:predint}) is 
892: a multi-dimensional generalization of, e.g., equation (2.26) of 
893: \citet{Weisberg2005}, taking into account that the mean of $\psi({\bf x})$ is
894: zero.
895: 
896: \clearpage
897: 
898: \begin{thebibliography}{}
899: \bibitem[Adelman-McCarthy et al.(2008)]{Adelman2008} Adelman-McCarthy, J.~K., et al.~2008, \apjs, 175, 297
900: \bibitem[Bailer-Jones(2000)]{Bailer-Jones2000} Bailer-Jones, C.~A.~L.~2000, \aa, 357, 197
901: \bibitem[Bellman(1961)]{Bellman:61} Bellman, R.~E.~1961, Adaptive Control Processes (Princeton Univ. Press)
902: \bibitem[Boroson \& Green(1992)]{BorosonGreen1992} Boroson, T.~A., \& Green, R.~F.~1992, \apjs, 80, 109
903: \bibitem[Bolton et al.(2006)]{Bolton2006} Bolton, A.~S., et al.~2006, \apj, 638, 703
904: \bibitem[Coifman \& Lafon(2006)]{Coifman:Lafon:06} Coifman, R.~R., \& Lafon, S.~2006, Appl. Comput. Harmon. Anal., 21, 5
905: \bibitem[Connolly et al.(1995)]{Connolly1995} Connolly, A.~J., Szalay, A.~S., Bershady, M.~A., Kinney, A.~L., \& Calzetti, D.~1995, \aj, 110, 1071
906: \bibitem[Folkes et al.(1999)]{Folkes1999} Folkes, S., et al.~1999, \mnras, 308, 459
907:  \bibitem[Kemeny \& Snell(1983)]{KemenySnell1983} Kemeny, J. G., \& Snell, J. L.~1983, Finite Markov Chains (Springer).
908: \bibitem[Lafon \& Lee(2006)]{LafonLee2006} Lafon, S., \& Lee, A.~2006, IEEE Trans. Pattern Anal. and Mach. Intel., 28, 1393
909: \bibitem[Li et al.(2005)]{Li2005} Li, C., Wang, T.-G., Zhou, H.-Y., Dong, X.-B., \& Cheng, F.-Z.~2005, \aj, 129, 669
910: \bibitem[Madgwick et al.(2003)]{Madgwick2003} Madgwick, D.~S., et al.~2003, \apj, 599, 997
911: \bibitem[Re Fiorentin et al.(2007)]{ReFiorentin2007} Re Fiorentin, P., et al.~2007, \aap, 467, 1373
912: \bibitem[Rogers et al.(2007)]{Rogers2007} Rogers, B., Ferreras, I., Lahav, O., Bernardi, M., Kaviraj, S., \& Yi, S.~K.~2007, \mnras, 382, 750
913: \bibitem[Ronen, Arag\'on-Salamanca, \& Lahav(1999)]{Ronen1999} Ronen, S., Arag\'on-Salamanca, A., \& Lahav, O.~1999, \mnras, 303, 284
914: \bibitem[Vanden Berk et al.(2006)]{VDB2006} Vanden Berk, D.~E., et al.~2006, \aj, 131, 84
915: \bibitem[Wasserman(2006)]{Wasserman2006} Wasserman, L.~W.~2006, All of Nonparametric Statistics (New York:Springer)
916: \bibitem[Weisberg(2005)]{Weisberg2005} Weisberg, S.~2005, Applied Linear Regression (Hoboken:Wiley)
917: \bibitem[Yip et al.(2004a)]{Yip2004a} Yip, C.~W., et al.~2004, \aj, 128, 585
918: \bibitem[Yip et al.(2004b)]{Yip2004b} Yip, C.~W., et al.~2004, \aj, 128, 2603
919: \bibitem[Zhang et al.(2006)]{Zhang2006} Zhang, J., Wu, F., Luo, A., \& Zhao, Y.~2006, ChJAA, 30, 176
920: \end{thebibliography}
921: 
922: 
923: % The figures
924: 
925: \begin{figure}
926: %\epsfig{figure=Fig3a.eps,height=2.3in} 
927: \epsscale{0.7}
928: \plotone{f1a.eps}
929: \vspace{0.7in}
930: \epsscale{0.9}
931: \plottwo{f1b.eps}{f1c.eps}
932: \caption{An example of a one-dimensional manifold (dashed line) with Gaussian noise embedded in
933: two or higher dimensions.  The path (solid line) from $\x$ to $\y$ reflects the natural geometry of
934: the data set which is captured by the
935: diffusion distance between $\x$ and $\y$. 
936: The plot on the lower left shows that the first diffusion map coordinate is a monotonically increasing
937: function of the
938: arc length of the spiral; this is not the case in the
939: lower right plot, which shows the same relationship for the first PC coordinate.}
940: \label{fig:spiral}
941: \end{figure}
942: 
943: \clearpage
944: \begin{figure}
945: \epsscale{0.75}
946: \plotone{f2.eps}
947: \caption{Distributions of SDSS redshift estimates in our
948: high-CL (top) and low-CL (bottom) samples.  We train our regression 
949: model using the 2793 high-CL galaxies only, then apply those
950: predictions to the 1042 low-CL galaxies.}
951: \label{fig:zdesign}
952: \end{figure}
953: 
954: \clearpage
955: \begin{figure}
956: %$\begin{array}{c}
957: %\epsfig{figure=zest_pcmap_ccode.ps,height=2.25in} \\
958: %\epsfig{figure=zest_dmap_ccode.ps,height=2.25in} \\
959: %\end{array}$\\
960: \epsscale{1}
961: \plottwo{f3a.eps}{f3b.eps}
962: \caption{Embedding of our sample of 2793 SDSS galaxy spectra with
963:   SDSS $z$ CL $> 0.99$ with
964: the first 3 PC and the first 3 diffusion map coordinates, respectively.
965: The color codes for $\log_{10}(1+z_{\rm SDSS})$ values.  Both
966: maps show a clear correspondence with redshift.}
967: \label{fig:zmaps}
968: \end{figure}
969: 
970: %\clearpage
971: %\begin{figure}
972: %%\epsfig{figure=outlier.eps,height=2.6in} 
973: %\epsscale{0.75}
974: %\plotone{f3.eps}
975: %\caption{SDSS galaxy spectrum (with {\tt OBJID}) identified as an outlier 
976: %($>$ 4$\sigma$) by the
977: %diffusion map-based regression, overlaid with SDSS template 29, which
978: %provided the highest CL $z_{\rm SDSS}$ estimate in template cross-correlation.
979: %The spectrum exhibits two anomalous features: a sharp, unexplained
980: %rise at low wavelengths and a broad emission feature at $\approx$ 4100 \AA.}
981: %\label{fig:out}
982: %\end{figure}
983: 
984: \clearpage
985: \begin{figure}
986: %\epsfig{figure=zpred_risk.eps,height=2.3in} \\
987: \epsscale{0.75}
988: \plotone{f4.eps}
989: \caption{Risk estimates ($\widehat{R}_{CV}$) for regression of $z$ on diffusion 
990:   map coordinates and PCs. Diffusion map attains a lower 
991:   risk for almost every number of coordinates in the regression. It also 
992:   achieves a lower minimum risk as indicated by Table~\ref{tab:zreg}.
993: Risk estimates are based on 50 repetitions of 10-fold CV.  Thick lines
994: represent mean risk at that model size and thin dotted lines are +/- 1
995: standard deviation bands.}
996: \label{fig:zrisk}
997: \end{figure}
998: 
999: \clearpage
1000: \begin{figure}
1001: %\epsfig{figure=zpredictions.eps,height=4.6in} 
1002: \epsscale{0.6}
1003: \plotone{f5.eps}
1004: \caption{
1005:   Redshift predictions using diffusion map coordinates for galaxies
1006:   with SDSS  CL $\le$ 0.99 (top)
1007:   and CL $>$ 0.99 (bottom), each plotted against $z_{\rm SDSS}$.  
1008:   Error bars
1009:   represent 95\% prediction intervals.  Note that  CL $\le$ 0.99
1010:   redshift predictions are based on the model trained on CL $>$ 0.99
1011:   galaxies while CL $>$ 0.99 predictions are from 10-fold CV on CL
1012:   $>$ 0.99 galaxies.  For most galaxies, our
1013:   predictions are in close correspondence with SDSS estimates.}
1014: \label{fig:zreg}
1015: \end{figure}
1016: 
1017: \clearpage
1018: \begin{figure}
1019: \epsscale{0.6}
1020: \plotone{f6.eps}
1021: \caption{Discrepancy between our predicted redshift values and $z_{\rm
1022:     SDSS}$ estimates versus log(1-CL).  There is a 
1023:   correlation of 0.392 between the amount of discrepancy and 1-CL, meaning
1024:   that galaxies for which there are large differences between the two
1025:   redshift estimates tend to be objects whose SDSS redshift
1026:   confidences are low.  Horizontal lines denote 1, 3, and 5 $\sigma$
1027:   disparities.  Small random perturbations have been added to duplicate
1028:   log(1-CL) values to visualize galaxies with the same CL.  Galaxies with a
1029:   CL of 1.00 are assigned mean log(1-CL) of -4.
1030: }
1031: \label{fig:cl}
1032: \end{figure}
1033: 
1034: \clearpage
1035: \begin{figure}
1036: \epsscale{0.6}
1037: \plotone{f7.eps}
1038: \caption{Discrepancy between our predicted redshift values and $z_{\rm
1039:     SDSS}$ versus log(flux) of the original spectra. There is a
1040:   correlation of -0.327 between the amount of discrepancy and galaxy
1041:   brightness. Galaxies can be detected as outliers even
1042:     if they match well to their SDSS template (in color).  Low S/N
1043:     can cause normal galaxies with correct SDSS redshifts to be labeled
1044:     as outliers.  We also detect several
1045:     physically interesting objects as outliers (see Figure \ref{fig:outliers}).
1046: }
1047: \label{fig:flux}
1048: \end{figure}
1049: 
1050: \clearpage
1051: \begin{figure}
1052: \epsscale{1}
1053: \plotone{f8.eps}
1054: \caption{Eight selected outliers with anomalous features.  Each
1055:   spectrum (solid blue) is plotted along with its SDSS template match
1056:   (dashed red).  Spectra are scaled to have the same sum of squared
1057:   (smoothed) fluxes over the same range of wavelengths.  For a
1058:   thorough discussion of
1059:   these outliers see {\S}\ref{sect:anal}}
1060: \label{fig:outliers}
1061: \end{figure}
1062: 
1063: \clearpage
1064: 
1065: \input{tab1}
1066: 
1067: \end{document}
1068: 
1069: