1: \documentclass[preprint]{aastex}
2: %\documentclass[manuscript]{aastex}
3: %\documentclass{emulateapj}
4: \shorttitle{Exploiting Low-Dimensional Structure}
5: \shortauthors{Richards, Freeman, Lee, Schafer}
6: %\usepackage{epsfig}
7: \usepackage{color}
8: \newcommand{\x}{{\bf x}}
9: \newcommand{\y}{{\bf y}}
10: \newcommand{\z}{{\bf z}}
11: \newcommand{\W}{{\bf W}}
12: \newcommand{\new}{red}
13: \renewcommand{\P}{{\bf P}}
14:
15: \begin{document}
16:
17: \title{Exploiting Low-Dimensional Structure in Astronomical Spectra}
18: \author{Joseph W. Richards, Peter E. Freeman, Ann B. Lee, Chad M. Schafer}
19: \email{jwrichar@stat.cmu.edu}
20: \affil{Department of Statistics, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213}
21:
22: \begin{abstract}
23: Dimension-reduction techniques can greatly improve statistical inference in astronomy.
24: A standard approach is to use Principal Components Analysis (PCA).
25: In this work we apply a recently-developed technique, diffusion maps, to astronomical
26: spectra for data parameterization and dimensionality reduction, and
27: develop a robust, eigenmode-based framework
28: for regression.
29: We show how our framework provides a computationally efficient means by which
30: to predict redshifts of galaxies, and thus could
31: inform more expensive redshift estimators
32: such as template cross-correlation. It also provides a natural means
33: by which to identify outliers (e.g., misclassified spectra, spectra
34: with anomalous features).
35: We analyze 3835 SDSS spectra and show how our framework
36: yields a more than 95\% reduction in dimensionality.
37: Finally, we show that the prediction error
38: of the diffusion map-based regression approach is markedly smaller than that of a similar
39: approach based on PCA, clearly demonstrating the superiority of diffusion
40: maps over PCA for this regression task.
41: \end{abstract}
42:
43: \keywords{galaxies: distances and redshifts --- galaxies: fundamental parameters --- galaxies: statistics --- methods: statistical --- methods: data analysis}
44:
45: \section{Introduction}
46:
47: \label{sect:intro}
48:
49: Galaxy spectra are classic examples of high-dimensional data, with
50: thousands of measured fluxes providing
51: information about the physical conditions of the observed object.
52: To make computationally efficient inferences about these
53: conditions, we need to first reduce the dimensionality of the data
54: space while preserving relevant physical information.
55: We then need to find simple relationships between the reduced data and physical parameters of
56: interest.
57: %, e.g., by introducing and estimating a regression function.
58: Principal Components Analysis (PCA, or the Karhunen-Lo\`eve transform) is a standard method for the first step; its application to astronomical spectra is described in, e.g., \citet{BorosonGreen1992},
59: \citet{Connolly1995}, \citet{Ronen1999}, \citet{Folkes1999},
60: \citet{Madgwick2003}, \citet{Yip2004a},
61: \citet{Yip2004b}, \citet{Li2005}, \citet{Zhang2006},
62: \citet{VDB2006}, \citet{Rogers2007}, and \citet{ReFiorentin2007}.
63: In most cases, the authors do not proceed to the second step but only
64: ascribe physical significance to the first few eigenfunctions from PCA
65: (such as the ``Eigenvector 1" of \citeauthor{BorosonGreen1992}).
66: Notable exceptions are \citeauthor{Li2005}, \citeauthor{Zhang2006},
67: and \citeauthor{ReFiorentin2007} However,
68: as we discuss in {\S}\ref{sect:app}, these authors combine
69: eigenfunctions in an ad hoc manner with no formal methods or
70: statistical criteria for regression and risk (i.e., error) estimation.
71:
72: In this work we present a unified framework for regression and data parameterization of astronomical spectra. The main idea is to describe
73: the important structure of a data set in terms of its
74: {\em fundamental eigenmodes}.
75: The corresponding eigenfunctions are used both as coordinates for the data
76: and as orthogonal basis functions for regression.
77: We also introduce the {\em diffusion map} framework
78: (see, e.g., \citealt{Coifman:Lafon:06}, \citealt{LafonLee2006})
79: to astronomy, comparing and contrasting it with PCA for regression analysis of SDSS galaxy spectra. PCA is a global method that finds linear low-dimensional
80: projections of the data; it attempts to preserve Euclidean distances between all data points and is often not robust to outliers.
81: The diffusion map approach, on the other hand, is non-linear and instead retains distances that reflect the (local) connectivity of the data.
82: This method is robust to outliers and is often able to unravel the intrinsic geometry and the natural (non-linear) coordinates of the data.
83:
84: In {\S}\ref{sect:diff} we describe the diffusion map method for data
85: parameterization.
86: In {\S}\ref{sect:regress} we introduce the technique of {\em adaptive regression} using eigenmodes.
87: In {\S}\ref{sect:app} we demonstrate the effectiveness of our proposed PCA- and
88: diffusion-map-based regression techniques for
89: predicting the redshifts of SDSS spectra.
90: % Text shifted to section 4
91: %Redshift prediction in SDSS DR6 is calculated by two methods:
92: %first, via wavelet analyses of continuum-subtracted spectra, where the
93: %continuum is estimated using a fifth-order polynomial, and second,
94: %by cross-correlating templates and observed spectra.\footnote{
95: %See {\scriptsize \tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}
96: %In both cases, confidence levels\footnote{
97: %SDSS ``confidence levels"
98: %are functions of the strengths of observed lines and thus should
99: %not be interpreted probabilistically.}
100: %are computed, with the higher-CL redshift estimate assigned to the galaxy.
101: % shift end
102: %Template matching, in particular, is slow (ARE WE SURE ABOUT THIS?)
103: %and prone to error because
104: %the basis functions are not orthogonal.
105: %(SDSS manually inspects 8\% of estimates and changes 1\% of them.)
106: Our PCA- and diffusion-map-based approaches provide a fast and
107: statistically rigorous means of identifying
108: outliers in redshift data. The returned embeddings also provide an
109: informative visualization of the results. In {\S}\ref{sect:summary} we summarize our results.
110:
111: \section{Diffusion Maps and Data Parameterization}
112:
113: \label{sect:diff}
114: The variations in a physical system can sometimes be described by
115: a few parameters, while measurements of the system are
116: necessarily of very high dimension; geometrically, the data are
117: points in the $p$-dimensional space $\mathbb{R}^p$, with $p$ large.
118: In our case, a data point is a galaxy spectrum, with the
119: dimension $p$ given by the number of wavelength bins ($p \gtrsim 10^3$),
120: and a full data set could consist of hundreds of thousands of spectra.
121: To make inference and predictions tractable,
122: one seeks to find a simpler parameterization of the system. The most
123: common method for dimension reduction and data parameterization
124: is Principal Component Analysis (PCA), where the data are projected
125: onto a lower-dimensional hyperplane. For complex situations,
126: however, the assumption of linearity may lead to sub-optimal
127: predictions. A linear model pays very little attention to the
128: natural geometry and variations of the system. The top plot in Figure
129: \ref{fig:spiral}
130: illustrates this clearly by showing a data
131: set that forms a one-dimensional noisy spiral in $\mathbb{R}^2$.
132: Ideally, we would like to find a coordinate system that reflects
133: variations along the spiral direction, which is indicated by the
134: dashed line. It is obvious that any
135: projection of the data onto a line would be unsatisfactory. Results
136: of a PCA analysis of the noisy spiral are shown in the lower-left plot
137: in Figure \ref{fig:spiral}.
138:
139: In this section, we will use diffusion maps
140: (\citeauthor{Coifman:Lafon:06}, \citeauthor{LafonLee2006}) --- a non-linear technique ---
141: %for data parameterization, i.e.
142: to find a natural coordinate system for the data.
143: When searching for a lower-dimensional description, one needs to decide
144: what features to preserve and what aspects of the data one is
145: willing to lose. The diffusion map framework attempts to retain
146: the cumulative local interactions between its data points, or
147: their ``connectivity" in the context of a fictive diffusion process over the data.
148: We demonstrate how this can be a better method to learn
149: the intrinsic geometry of a data set than by using, e.g., PCA.
150: %which simply projects all data points onto a lower-dimensional hyperplane.
151:
152: Our strategy is to first define a distance metric $D(\x,\y)$ that reflects
153: the connectivity of two points $\x$ and $\y$, then find a map to a
154: lower-dimensional space (i.e., a new data parameterization) that
155: best preserves these distances.
156: (As before, a ``point'' in $p$-dimensional space represents
157: a complete astronomical spectrum of $p$ wavelength bins.)
158: The general idea is that we call two data points ``close'' if there
159: are many short paths between $\x$ and $\y$ in a jump diffusion process between data points.
160: In Figure \ref{fig:spiral}, the Euclidean distance
161: between two points is an inappropriate measure of
162: similarity. If, instead, one imagines a random walk starting at ``$\x$,'' and
163: only stepping to immediately adjacent points, it is clear that
164: %it would take a long time for that walk to reach ``$\y$.''
165: the time it would take for that walk to reach ``$\y$'' would reflect
166: the length along the spiral direction. This latter distance measure
167: is represented by the solid path from $\x$ to $\y$ in Figure \ref{fig:spiral}.
168: We will make this measure of connectivity formal in what follows.
169:
170: The starting point is to construct a weighted graph where the
171: nodes are the observed data points. %(the spectra). repetitive
172: %(i.e., in our case each node is a spectrum).
173: The weight given to the edge connecting $\x$ and $\y$ is
174: \begin{equation}
175: w(\x,\y) = \exp\left(-\frac{s(\x,\y)^2}{\epsilon}\right),
176: \label{eqn:diffw}
177: \end{equation}
178: where $s(\x,\y)$ is a locally relevant similarity measure.
179: For instance, $s(\x,\y)$ could be chosen as
180: the Euclidean distance between $\x$ and $\y$ (denoted here $\|\x-\y\|$)
181: when $\x$ and $\y$ are vectors.
182: But, the choice of $s(\x,\y)$ is not crucial, and this gets to the heart
183: of the appeal of this approach:
184: it is often simple to determine whether or not two data points are %very
185: ``similar'',
186: and many choices of $s(\x,\y)$ will suffice for measuring this
187: local similarity.
188: The tuning parameter $\epsilon$ is chosen small enough that
189: $w(\x,\y) \approx 0$ unless $\x$ and $\y$ are similar,
190: %only local similarities are computed,
191: but large enough such that the constructed graph is fully connected.
192:
193: The next step is to use these weights to build a Markov random walk on
194: the graph. From node (data point) $\x$, the probability of stepping
195: directly to $\y$ is defined naturally as
196: \begin{equation}
197: p_1(\x,\y) = \frac{w(\x,\y)}{\sum_{\z}w(\x,\z)}.
198: \label{eqn:diffp}
199: \end{equation}
200: %\begin{equation}
201: %p_1(x,y) = \frac{w(x,y)}{\sum_{z \in \Omega}w(x,z)} \,.
202: %\label{eqn:p}
203: %\end{equation}
204: This probability is close to zero unless $\x$ and $\y$ are similar. Hence, in
205: one step the random walk will move only to very similar nodes (with high
206: probability). These one-step transition probabilities are stored in the $n$ by $n$
207: matrix $\P$.
208: It follows from standard theory of Markov chains (\citealt{KemenySnell1983}) that, for a positive integer $t$, the element
209: $p_t(\x,\y)$ of
210: the matrix power $\P^t$ gives the probability of
211: moving from $\x$ to $\y$ in $t$ steps.
212: Increasing $t$ moves the random walk
213: forward in time, propagating the local influence of a data point
214: (as defined by the kernel $w$)
215: with its neighbors.
216: % so as eventually to form a global representation of the
217: %geometry of the data.
218:
219: For a fixed time (or scale) $t$, $p_t(\x,\cdot)$ is a vector representing
220: the distribution after $t$ steps of the random walk over the nodes of the
221: graph, conditional on the
222: walk starting at $\x$.
223: In what follows, the points $\x$ and $\y$ are
224: close if the conditional distributions
225: $p_t(\x,\cdot)$ and $p_t(\y,\cdot)$, are similar.
226: Formally, the diffusion distance at a scale $t$ is defined as
227: \begin{equation}
228: D_t^2(\x,\y) = \sum_{\z} \frac{\left(p_t(\x,\z) - p_t(\y,\z)\right)^2}{\phi_0(\z)}
229: %D_t^2(x,y) = ||p_t(x,\cdot) - p_t(y,\cdot)||^2_2
230: \label{eqn:diffdist}
231: \end{equation}
232: where $\phi_0(\cdot)$ is the stationary distribution of the random walk, i.e.,
233: the long-run proportion of the time the walk spends at
234: node $\z$.
235: Dividing by $\phi_0(\z)$ serves to reduce the influence of nodes
236: which are visited with high probability regardless of the starting point of the
237: walk.
238: %{\bf (Change the above; the 2 over 2 nomenclature is unclear.)}
239: The distance $D_t(\x,\y)$ will be small only if $\x$ and $\y$ are connected by
240: many short paths with large weights. This construction of
241: a distance measure is robust to noise and outliers because it
242: simultaneously accounts for the cumulative effect of {\em all} paths between the
243: data points.
244: Note that the geodesic distance (the shortest path in a graph), on the other hand, often takes shortcuts due to noise.
245:
246: % Fig3a was here.
247:
248: The final step is to find a low-dimensional embedding of the data where Euclidean distances reflect diffusion distances.
249: %In applying this technique for dimensionality reduction,
250: %the data set attribute
251: %we wish to preserve is the diffusion distance between all
252: %points.
253: A biorthogonal spectral decomposition of the matrix $\P^t$ gives
254: %\begin{equation}
255: %p_t(x,y) = \sum_{j \ge 0} \lambda_j^t \psi_j(x) \phi_j(y) \,,
256: %\label{eqn:diffdecomp}
257: %\end{equation}
258: $p_t(\x,\y) = \sum_{j \ge 0} \lambda_j^t \psi_j(\x) \phi_j(\y)$,
259: where $\phi_j$, $\psi_j$, and $\lambda_j$, respectively, represent left eigenvectors, right eigenvectors and eigenvalues
260: of $\P$. It follows that
261: \begin{equation}
262: D^2_t(\x,\y)~= ~\sum_{j=1}^{\infty} \lambda_j^{2t}(\psi_j(\x)-\psi_j(\y))^2.\label{eq:Dt}
263: \end{equation}
264: %{\bf (ANN: How about putting a proof of Equation (4) as an Appendix?)}
265: The proof of Equation~\ref{eq:Dt} and the details of the computation
266: and normalization of the eigenvectors $\phi_j$ and $\psi_j$ are given in
267: \citeauthor{Coifman:Lafon:06} and
268: \citeauthor{LafonLee2006}.\footnote{Sample code in Matlab and R for
269: diffusion maps at {\tt http://www.stat.cmu.edu/\~{}annlee/software.htm}} By retaining the
270: $m$ eigenmodes corresponding to the $m$ largest nontrivial
271: eigenvalues and by introducing the diffusion map
272: \begin{equation}
273: \Psi_t: \x \mapsto [\lambda_1^t\psi_1(\x), \lambda_2^t\psi_2(\x), \cdots,\lambda_m^t\psi_m(\x)]
274: \label{eqn:diffusion_map}
275: \end{equation}
276: from $\mathbb{R}^p$ to $\mathbb{R}^m$, we have that %(see \citeauthor{Coifman:Lafon:06})
277: %\begin{eqnarray}
278: \begin{equation}
279: D^2_t(\x,\y)~\simeq ~\sum_{j=1}^m \lambda_j^{2t}(\psi_j(\x)-\psi_j(\y))^2 ~=~||\Psi_t(\x) - \Psi_t(\y)||^2 \,,
280: \label{eqn:diffpres}
281: \end{equation}
282: i.e., Euclidean distance in the $m$-dimensional embedding defined by equation~\ref{eqn:diffusion_map}
283: %lower-dimensional space $\mathbb{R}^m$,
284: approximates diffusion distance.
285: In contrast, Euclidean distances in PC maps approximate the original
286: Euclidean distances $\|\x-\y\|$.
287: Again, consider the example in Figure \ref{fig:spiral}.
288: The plot on the lower left shows that the first diffusion map coordinate is a monotonically increasing
289: function of the
290: arc length of the spiral; this is not the case in the
291: lower right plot, which shows the same relationship for the first PC coordinate. Indeed, the relationship
292: with the first PC coordinate is not even one-to-one.
293:
294: The choice of the parameters $m$ and $t$ is determined by the fall-off of the eigenvalue spectrum as well
295: as the problem at hand (e.g., clustering, classification, regression,
296: or data visualization). An objective measure
297: of performance should be defined and utilized to find data-driven best choices for these tuning parameters.
298: In this work, the final goal
299: is regression and prediction of redshift. In the next section, we show how the number of coordinates, $m$, can
300: be chosen by cross-validation, once one has defined an appropriate statistical ``risk" function. The particular
301: choice of $t$, on the other hand, will not matter in the regression framework, as it will only represent a
302: rescaling of the $m$ selected basis vectors.
303:
304: \section{Adaptive Regression Using Orthogonal Eigenfunctions}
305: \label{sect:regress}
306: Our next problem is how to, in a statistically rigorous way, predict a function $y=r(\mathbf{x})$ (e.g., redshift, age, or metallicity of galaxies) of data
307: (e.g., spectrum $\mathbf{x}$) in very high dimensions using a sample
308: of known pairs ($\x,y$). As before, imagine that our data are points in $\mathbb{R}^p$, but that the
309: natural variations in the system are along a low dimensional space $\mathcal{X} \subset \mathbb{R}^p$.
310: %In other words, $p$ is very large but the intrinsic dimension of $\mathcal{X}$, which is determined by the natural variations of the system, is considerably
311: %smaller.
312: The set $\mathcal{X}$ could, for example, be a non-linear submanifold embedded in $\mathbb{R}^p$.
313: In our toy example in Figure \ref{fig:spiral}, $\mathcal{X}$ is the one-dimensional spiral, but the data are observed
314: in $p=2$ dimensions.
315: The key idea is that one may view the eigenfunctions from PCA or diffusion maps
316: (a) as {\em coordinates} of the data points, as shown in the previous section,
317: or (b) as forming a {\em Hilbert orthonormal basis} for any function (including the regression function $r(\mathbf{x})$) supported on the
318: subset $\mathcal{X}$. Rather than applying an arbitrarily chosen prediction scheme in the computed diffusion or PC space (as in, e.g., \citeauthor{Li2005}, \citeauthor{Zhang2006}, and \citeauthor{ReFiorentin2007}), we utilize the latter insight to formulate a general regression and risk estimation framework. %for high-dimensional inference.
319:
320: Any function $r$ satisfying $\int r(\x)^2 dx < \infty$, where $\x \in \mathcal{X} $, can be written as
321: \begin{equation}
322: r(\x) = \sum_{j=1}^{\infty} \beta_j \psi_j(\x) \,,
323: \label{eqn:orthonorm}
324: \end{equation}
325: where the sequence of functions $\{\psi_1,\psi_2,\cdots\}$ forms an
326: orthonormal basis. The choice of basis functions is traditionally {\em not} adapted to the geometry of the data, or the set $\mathcal{X}$.
327: Standard choices are, for example, Fourier or wavelet bases for $\mathbf{L}^2(\mathbb{R}^p)$, which are constructed as tensor
328: products of one-dimensional bases. The latter approach makes sense for low dimensions, for example for $p=2$, but quickly becomes
329: intractable as $p$ increases (see, e.g., \citealt{Bellman:61} for the ``curse of dimensionality''). In particular, note that if a wavelet basis
330: in one dimension consists of $q$ basis functions, and hence
331: requires the estimation of $q$ parameters, the naive tensor basis in $p$ dimensions will have $q^p$ basis functions/parameters,
332: creating an impossible inference problem even for moderate $p$.
333: Because this basis is not adapted to $\mathcal{X}$, there is little hope of
334: finding a subset of these basis functions which will
335: do an adequate job of modeling the response.
336: %although for any particular problem
337: %one strives to represent any sufficiently smooth function with
338: %as small a subset of basis functions as possible.
339:
340: In this work, we propose a new adaptive framework where the basis functions reflect the intrinsic geometry of the data. Furthermore, we use a formal statistical method to estimate the risk and the optimal parameters in the model. First, rather than using a generic tensor-product basis for the high-dimensional space $\mathbb{R}^p$, we
341: construct a data-driven
342: basis for the lower-dimensional, possibly non-linear set $\mathcal{X}$ where the data lie.
343: Let $\{{\psi_1},{\psi_2},\cdots,{\psi_n}\}$ be the orthogonal eigenfunctions computed by PCA or diffusion maps.
344: Our regression function estimate $\widehat{r}(\x)$ is then given by
345: \begin{equation}
346: \widehat{r}(\x) = \sum_{j=1}^{m} \widehat{\beta_j} {\psi_j}(\x),
347: \label{eqn:orthoreg}
348: \end{equation}
349: %equation~(\ref{eqn:orthoreg}),
350: where the different terms in the series expansion represent the
351: fundamental eigenmodes of the data, and $m \leq n$ is chosen to
352: minimize the prediction risk that we will now define rigorously.
353:
354: \subsection{Risk: Theory and Estimation}
355: \label{sect:risk}
356:
357: A key aspect of our approach is that the choice of the models is driven by the minimization of a well-justified, objective error criterion
358: which compensates for overfitting. This is critical, as any basis could be utilized to fit the observed data well; this does not provide,
359: however, any assurance that the model applies beyond these data.
360: To begin, we establish the standard stochastic framework within which regression models are assessed.
361: We are given $n$ pairs of observations $(X_1,Y_1), \ldots, (X_n, Y_n)$, with the task of predicting the
362: response $Y=r(X)+\epsilon$ at a new data point $X=\x$, where $\epsilon$ represents random noise.
363: (In {\S}\ref{sect:app}, the response $Y$ is the redshift, $z$, and $X$ is a complete spectrum.)
364: In nonparametric regression by orthogonal functions,
365: one assumes that $r(\x)$ is given
366: according to equation~(\ref{eqn:orthonorm}), with its estimator given
367: by equation~(\ref{eqn:orthoreg}), with $m \leq n$ where $\{\psi_j\}$
368: is a fixed basis.
369: %An estimator of $r(\x)$ typically has the form
370: %\begin{equation}
371: %\widehat{r}(\x)=\sum_{j=1}^{m} \widehat{\beta_j} \psi_j(\x),
372: %\label{eqn:orthoreg}
373: %\end{equation}
374: %where $m \leq n$ and $\{\psi_j\}$ is a fixed basis.
375: The primary goal is to minimize the
376: {\em prediction risk} (i.e., expected error), commonly quantified by
377: the mean-squared error (MSE)
378: \begin{equation}
379: R(m)=\mathbb{E}[Y-\widehat{r}(X)]^2,
380: \label{eqn:MSE}
381: \end{equation}
382: where the average is taken over all possible realizations of $(X,Y)$,
383: including the randomness in the evaluation points $X$, the
384: responses $Y$, and the estimates $\widehat{\beta_j}$.
385: Thus, $\mathbb{E}[\cdot]$ averages everything that is random, including the randomness in the evaluation points $X$
386: and the randomness in the estimates $\widehat{\beta_j}$. This leads to protection against overfitting: if a basis
387: function $\psi_j$ is unnecessarily included in the model,
388: its coefficient $\widehat{\beta_j}$ will only add variability
389: or variance to
390: $\widehat{r}(X)$ and not improve the fit, hence increasing $R(m)$.
391: (On the other hand, as $m$ becomes too small,
392: the estimator becomes increasingly biased, also increasing $R(m)$.)
393: Thus, the ideal choice of $m$ is neither too large, nor too small.
394: In nonparametric statistics, this is dubbed the ``bias-variance tradeoff"
395: (see, e.g., \citealt{Wasserman2006}).
396: A secondary goal is {\em sparsity}; more specifically,
397: among the estimators with a small risk,
398: we prefer representations with a smaller $m$.
399:
400: Since $R(m)$ is a population quantity, one needs to appropriately estimate it from the data.
401: An estimate based on the full data set will underestimate the error and lead to a model with high bias.
402: Here we will use the method of $K$-fold cross-validation
403: (see, e.g., \citeauthor{Wasserman2006}) to achieve
404: a better estimate of the prediction risk. The basic idea is to randomly split the data set into $K$ blocks
405: of approximately the same size; $K=10$ is a common choice. For $k=1$ to $K$, we delete block $k$ from the data. We then fit the model to the
406: remaining $K-1$ blocks and compute the observed squared error $\widehat{R}_{(-k)}(m)$ on the $k$th block which was not included in the fit. The CV estimate of the risk is defined as $\widehat{R}_{CV}(m)=\frac{1}{K}\sum_{k=1}^{K} \widehat{R}_{(-k)}(m)$.
407: It can be shown that this quantity is an approximately unbiased estimate of the true error $R(m)$.
408: Thus, we choose the model parameters that minimize the CV estimate $\widehat{R}_{CV}(m)$ of the risk, i.e.,
409: we take $m_{\rm opt} = \arg \min \widehat{R}_{CV}(m)$.
410:
411: Finally, we note that the ideas of CV introduced here generalize to cases where the model
412: parameters are of higher dimension. For example, in the diffusion
413: map case, the risk is minimized over both the bandwidth $\epsilon$ and the number of eigenfunctions $m$. The CV estimate of the
414: risk is implemented in the same fashion, but the search space for finding the minimum is larger.
415: In what follows, the notation will make it clear which
416: model parameters we are minimizing over by writing, for
417: example, $R(\epsilon, m)$.
418:
419: To summarize, our claim is that the proposed regression framework will lead to efficient inference in high
420: dimensions, as we are effectively performing regression in a lower-dimensional space $\mathcal{X}$ that
421: captures the natural variations of the data, where the optimal
422: dimensionality is chosen to minimize prediction risk in our regression
423: task. Finally, the use of eigenfunctions in both the data parameterization
424: and in the regression formulation provides an elegant, unifying framework for analysis and prediction.
425:
426: %Here, $J \leq m$ and is chosen by using an appropriate risk
427: %estimator, such as cross-validation (see, e.g., \citet{Wasserman2007}),
428: %rather than in ad hoc manner of, e.g., \citeauthor{Li2005},
429: %\citeauthor{Zhang2006}, and \citeauthor{ReFiorentin2007}
430: %The smoother the true regression function $r$, the fewer basis terms
431: %$J$ will be needed to represent it.
432: %The estimated orthonormal basis $\{\hat{\psi}\}$ {\bf SHOULD} converge more
433: %quickly to the true underlying basis in (\ref{orthonorm}) than an
434: %arbitrarily chosen basis. We {\bf SHOULD} thus be able to obtain better
435: %estimates of $r$ than by using PCA or diffusion mapping eigenfunctions
436: %that by using an arbitrary basis.
437:
438: \section{Redshift Prediction Using SDSS Spectra}
439:
440: \label{sect:app}
441:
442: We apply the formalism presented in {\S}{\S}\ref{sect:diff}-\ref{sect:regress}
443: to the problem of predicting redshifts for a sample of SDSS spectra.
444: Physically similar objects residing at similar redshifts will have
445: similar continuum shapes as well as absorption lines occurring at
446: similar wavelengths. Hence the %$\mathbf{L}^2$
447: Euclidean distances between their spectra will be small.
448: The proposed regression framework with diffusion map or PC
449: coordinates provides a natural means by which to predict
450: redshifts. Furthermore, it is computationally efficient, making its
451: use appropriate for large databases such as the SDSS;
452: one can use these predictions to
453: inform more computationally expensive techniques by narrowing down
454: the relevant parameter space (e.g., the redshift range or the
455: set of templates in cross-correlation techniques).
456: Adaptive regression also provides a useful
457: tool for quickly identifying anomalous data points (e.g., objects
458: misclassified as galaxies), galaxies that have relatively rare
459: features of interest, and
460: galaxies whose SDSS redshift estimates may be incorrect.
461:
462: \subsection{Data Preparation}
463:
464: Our initial data sample consists of spectra that are classified as galaxies
465: from ten arbitrarily chosen spectroscopic plates of SDSS DR6
466: (0266$-$0274 inclusive, and 0286; \citealt{Adelman2008}).
467: We remove spectra from this sample by applying three cuts. The first
468: is motivated by aperture considerations: we analyze only those spectra
469: with SDSS redshift estimates $z_{\rm SDSS} \geq$ 0.05.
470: To include spectra with $z_{\rm SDSS} < 0.05$
471: would be to add an extra source of variation that would
472: adversely impact regression analysis. The second cut is based on bin flags.
473: To avoid calibration issues observed at both the low and high
474: wavelength ends, we remove the first 100 and last 250 wavelength bins
475: from each spectrum;
476: then we determine what proportion of the remaining 3500 bins are flagged
477: as bad. If this proportion exceeds 10\%, we remove the spectrum from the
478: sample; if not, we retain the reduced spectrum for further analysis.
479: We provide details on the third cut below.
480: The application of these cuts reduces our sample size from 5057
481: to 3835 galaxies.
482:
483: %(reducing the wavelength range to 3940-8850\AA)
484: %These spectra span the wavelength range 3800-9200\AA, with
485: %uniform binning in $\log_{10}$-space ($\Delta \log_{10} \lambda$ = 10$^{-4}$).
486:
487: We further process each spectrum in our sample as follows.
488: \begin{itemize}
489: \item We replace the flux values in the vicinity of
490: prominent atmospheric lines at 5577~\AA, 6300~\AA, and 6363~\AA~with
491: the sample mean of the nine closest bins on either side of each line.
492: The flux errors are estimated by averaging (in quadrature)
493: the standard errors of the fluxes for these bins.
494: \item We similarly replace the flux values in each bin flagged by SDSS as
495: part of an emission line, with flux and flux error estimates based
496: upon the closest 50 bins on either side of the line. (Within this group
497: of 100 bins, we do not include those that are themselves flagged as
498: emission lines.)
499: We do this because highly variable emission line strengths
500: can strongly bias distance calculations.
501: \item Last, after replacing flux values as necessary, we normalize
502: each spectrum to sum to 1 to mitigate variation due to differences in luminosity between
503: similar galaxies at similar redshifts.
504: \end{itemize}
505:
506: In its data reduction pipeline, SDSS estimates spectroscopic redshifts,
507: $z_{\rm SDSS}$, standard errors, $\sigma_{z_{\rm SDSS}}$, and
508: ``confidence levels," CL, the latter of which are functions
509: of the strengths of observed lines (and thus should
510: not be interpreted probabilistically).\footnote{
511: See {\tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}
512: Lacking knowledge of the true redshifts in our sample, we use
513: $z_{\rm SDSS}$ and $\sigma_{z_{\rm SDSS}}$ to fit our regression model.
514: Since poorly estimated redshifts can bias the model,
515: we divide our data sample into two groups, fitting with only
516: those 2793 galaxies with CL $>$ 0.99.
517: We then use the fitted model to predict redshifts for the other 1042 galaxies.
518: (It is here that we make our third data cut: to avoid issues of extrapolation,
519: we removed 19 of 1061 spectra with CL $\leq$ 0.99 whose SDSS redshift estimates
520: lie outside the range of our training set, i.e. those with $z_{\rm
521: SDSS} > 0.50$.) As shown in Figure \ref{fig:zdesign}, the distributions of
522: redshifts in our high- and low-CL samples are similar, implying that
523: predicted redshifts for low-CL galaxies from the model built on
524: high-CL galaxies should not be systematically biased.
525:
526: \subsection{Analysis}
527: \label{sect:anal}
528:
529: % Redundant with line immediately above.
530: %Then, using the regression model presented in
531: %{\S}\ref{sect:regress} we can regress the SDSS redshift estimates on the
532: %diffusion map coordinates to find galaxies for which our
533: %predicted redshift values do not agree with the corresponding SDSS estimates.
534:
535: %In its spectral reduction pipeline, SDSS estimates spectroscopic
536: %redshifts $z_{\rm SDSS}$ by (a) using a reference line list\footnote{
537: %\scriptsize \tt http://www.sdss.org/dr6/algorithms/linestable.html}
538: %to identify emission lines that they detect using a wavelet-based
539: %procedure, and (b)
540: %cross-correlating emission-line-masked, continuum-subtracted
541: %spectra with star, galaxy, and quasar templates.\footnote{
542: %See {\scriptsize \tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}
543:
544: In this section, we perform both PCA and diffusion map for our sample
545: and predict redshift using the
546: regression model introduced in {\S}\ref{sect:regress}. We provide
547: details on the PCA algorithm in Appendix \ref{sect:pca}.
548:
549: In the diffusion map analysis,
550: we begin by calculating Euclidean distances between spectra
551: \begin{equation}
552: s(\x, \y)~=~ \sqrt{\sum_k (f_{\x,k}-f_{\y,k})^2} \,,
553: \end{equation}
554: where $f_{\x,k}$ and $f_{\y,k}$ are the normalized fluxes in bin $k$ of
555: spectra $\x$ and $\y$, respectively. We use these distances and a
556: chosen value of
557: $\epsilon$ to construct both the weights for the graph (see equation
558: \ref{eqn:diffw}) and the transition
559: matrix $\P$ (see equation \ref{eqn:diffp}), from which eigenmodes are
560: generated. Below we
561: discuss how we select the optimal value of $\epsilon$.
562: As stated in {\S}\ref{sect:diff}, the value of the parameter $t$
563: (see equation \ref{eqn:diffusion_map}) is unimportant in
564: the context of regression, as any change in $t$ would be met
565: with a corresponding
566: rescaling of the coefficients $\widehat \beta_j$ in the regression model,
567: such that predictions are unchanged.
568:
569: In Figure \ref{fig:zmaps} we plot the embedding of
570: the 2793 galaxies with CL $>$ 0.99
571: in the first three PC and diffusion map
572: coordinates (e.g., $\lambda_i^t\psi_i(\cdot)$ in equation \ref{eqn:diffusion_map}).
573: We observe that the structure of each of these reparameterizations of
574: the original data corresponds in a simple way to $\log_{10}(1+z_{\rm
575: SDSS})$. These embeddings are a useful way to visualize the data
576: and to qualitatively identify subgroups of data and peculiar data points.
577:
578: % fig:zmaps was here
579:
580: In the next stage of analysis we use the computed eigenfunctions to
581: predict $z$ for our sample of 3835 galaxies.
582: We regress $z_{\rm SDSS}$ upon the diffusion map (and PC) eigenmodes
583: (cf.~equation~\ref{eqn:orthoreg}, where $\widehat r$ represents
584: our redshift estimates), weighting each data point by the
585: inverse variance of its $z_{\rm SDSS}$, 1/$\sigma_{z_{\rm SDSS}}^2$,
586: to account for the uncertainties in $z_{\rm SDSS}$ measurements.
587: We repeat this step for a sequence of
588: $m$ (and $\epsilon$) values, determining the optimal values of each
589: by minimizing the prediction risk $R(\epsilon,m)$,
590: estimated via ten-fold cross-validation (see equation~\ref{eqn:MSE}
591: and subsequent discussion). It is in this regression step that
592: we clearly observe the advantage of using diffusion maps over
593: principal components. In Figure \ref{fig:zrisk} we show that
594: diffusion map achieves significantly lower
595: CV prediction risk for most choices of model size $m$ and
596: obtains a much lower minimum $\widehat{R}_{\rm CV}$, i.e.,
597: the optimal low-dimensional diffusion map
598: representation of our data captures the trend in $z$ better than the
599: PC representation. Note that the trend in $\widehat{R}_{\rm CV}$ for both
600: PC and diffusion map basis functions is to decrease with increasing
601: model size for small models and to increase with increasing model size
602: for larger models. This is the ``bias-variance tradeoff" that was
603: referred to in {\S}\ref{sect:risk}: as the size (complexity) of our model
604: increases, the bias of the model decreases while the variance of the
605: model increases. Prediction risk is the sum of the squared bias and
606: variance of a model, explaining the behavior observed in Figure
607: \ref{fig:zrisk}: for small models, increasing model size leads to
608: decrease in bias that overwhelms
609: increase in variance while for large models, increase in model size
610: produces minimal decrease in bias and relatively large increase in variance.
611: %It is also a sparser representation, requiring
612: %less than half the number of eigenfunctions (42 vs.~93).
613: %Restating the previous two sentences,
614: %{\em the diffusion map approach yields better redshift
615: %predictions than PCA, with a model that is more parsimonious than the
616: %best-fitting PCA model}.
617:
618: In Table \ref{tab:zreg}, we show the parameters for the
619: optimal (minimal $\widehat{R}_{\rm CV}$) diffusion map and PC regression models.
620: Note that since our original data were in 3500
621: dimensions, our optimal diffusion map model achieves
622: a 96.4\% reduction in dimensionality. If we were to choose
623: an arbitrary small model size as is often done in the literature, our
624: prediction risk estimates would be terrible. For example, for model
625: sizes $m = 10$ and 20, the CV prediction risks for regression on PC
626: basis functions are 0.305 and 0.209, respectively (compared to optimal
627: value 0.193), while regression on
628: diffusion map basis functions yields $\widehat{R}_{\rm CV}$ of 0.295
629: and 0.191, respectively (compared to optimal value 0.134). The choice of
630: $\epsilon$ in the diffusion map model also has a significant impact on
631: results. For values of $\epsilon$ that are too small, CV risks are
632: extremely large because the data points are no longer connected in the
633: diffusion process and consequently large outliers occur in the
634: diffusion map parameterization. Likewise, large values of $\epsilon$
635: yield large prediction risks due to the large weights given to
636: connections between dissimilar data points.
637:
638: In Figure \ref{fig:zreg} we plot predictions and prediction
639: intervals for all galaxies in
640: our sample using our optimal diffusion map model.
641: (See Appendix \ref{sect:predint} for a discussion of prediction
642: intervals.)
643: Most of our predictions are in close correspondence with the SDSS
644: estimates. We observe positive correlation in the amount of disparity between
645: our redshift estimates and SDSS estimates versus 1-CL (Figure
646: \ref{fig:cl}) meaning that galaxies for which our estimates disagree
647: with SDSS estimates are more likely to be galaxies with low CL.
648:
649: There are 54 outliers at the $4\sigma$ level. Visual inspection of
650: their spectra indicates that 39 appear to fit the template assigned by
651: SDSS. Of these, 27 are well-described by the LRG template. In
652: Figure \ref{fig:flux} we show that most of the outliers that are
653: well-fit by their SDSS templates are faint objects. A plausible
654: explanation for their classification as outliers is low S/N in their
655: measured spectra. Faint galaxies with strong emission lines will
656: generally have accurate SDSS redshifts but can be outliers in
657: the diffusion map because noisy spectra induce higher Euclidean
658: distances. In a future paper we will introduce a method to account
659: for errors in the original measured data
660: that corrects both for errors in Euclidean distance
661: computations and random errors in the diffusion map coordinates.
662:
663:
664: The 15 other outliers show interesting and/or anomalous features.
665: Four spectra appear to be LRG type galaxies with abnormal emission
666: and/or absorption features, of which at least two are likely
667: attributed to calibration errors (see Figure \ref{fig:outliers}a,b).
668: One spectrum is clearly a QSO (Figure \ref{fig:outliers}c), one shows
669: only sky subtraction residuals (Figure \ref{fig:outliers}d), and two others are
670: obvious mismatches to their SDSS
671: templates due to absorption lines whose depths do not match their
672: assigned template. Four outliers have abnormal bumps (possible
673: continuum jumps due to instrumental artifacts, see Figure
674: \ref{fig:outliers}e,f) that appear like wide emission features.
675: One outlying galaxy has a spectrum that looks like
676: a late-type galaxy with no emission lines, meaning it is likely a
677: K+A post-starburst galaxy. Another outlier has an anomalous emission
678: feature around 6000~\AA~ in rest frame (Figure \ref{fig:outliers}g).
679: This is a possible lens
680: galaxy, but was not selected by the Sloan Lens ACS Survey (SLACS;
681: \citeauthor{Bolton2006}) because
682: the feature in question
683: occurs in close proximity to strong sky lines at 8800~\AA~. The final
684: outlier has a strong, wide emission feature in the
685: vicinity of H$\alpha$ but has no emission lines anywhere else in the
686: SDSS spectrum (Figure \ref{fig:outliers}h).
687: None of the outlying spectra show conclusive evidence of a wrong SDSS redshift
688: measurement (except for the afore-mentioned sky spectrum, which we
689: detect as a 30 $\sigma$ outlier).
690:
691:
692: %Manual inspection of these spectra show that
693: %(a) 2 are obviously misclassified QSOs;
694: %%two have been to QSO spectra by the SDSS routines and
695: %%were mislabeled as galaxy spectra.
696: %%{\bf (NOTE: 001 and 026)}
697: %(b) 15 of these outliers have strong emission lines
698: %({\bf conclusion?}); and (c)
699: %%{\bf (NOTE: 000,002,003,006,007,016,017,021,022,031,037,042,047,048,053)}.
700: %10 appear to have questionable
701: %$z_{\rm SDSS}$ values based on visual inspection.
702: %%{\bf (NOTE: 007 (0.857), 008 (0.741), 011 (0.897), 023 (0.998), 024 (0.831), 025 (0.478), 043 (0.962), 046 (0.111), 050 (0.997), 055 (0.502) ....CL is in parentheses; a few of these have anomalous features but still might have correct z...Peter, can you take a look at these?)}
703:
704: \subsection{Comparison With Other Methods}
705:
706: As discussed in {\S}1, many authors have applied PCA to galaxy spectra
707: in an attempt to reduce the dimensionality of the data space, but few attempt
708: to find simple relationships between the reduced data and the physical
709: parameters of interest; these exceptions include
710: \citeauthor{Li2005}, \citeauthor{Zhang2006}, and
711: \citeauthor{ReFiorentin2007}
712: In all three cases, the authors use
713: PCA to estimate stellar and/or galactic parameters that are traditionally
714: estimated by laboriously measuring equivalent widths and fluxes
715: of individual lines, just as we have used diffusion map eigenfunctions
716: to estimate redshift, a physical parameter usually estimated through
717: computationally intensive cross-correlation methods.
718: We stress three advantages of our approach over those employed by the
719: above authors:
720: 1) We achieve much lower prediction
721: error using diffusion map coordinates as compared to PCA,
722: 2) we have an objective way of selecting the parameters of
723: the model, and 3) we use a theoretically well-motivated regression
724: model which takes statistical variations of the data into account and
725: which unifies the data parameterization and regression algorithms.
726:
727: The aim of \citeauthor{Li2005}~is to estimate, e.g., the velocity
728: dispersion and reddening of a set of approximately 1500 galaxies
729: observed by SDSS.
730: They use PCA in two successive applications.
731: They first apply PCA
732: to the STELIB library to reduce 204 stellar spectra to 24 stellar eigenspectra.
733: These in turn are fit to SDSS DR1 spectra to create a library of 1016
734: galactic spectra, which are reduced to nine galactic eigenspectra.
735: The authors then regress observed equivalent widths (EW) and fluxes of
736: H$\alpha$ upon these nine eigenspectra.
737: They determine the number of eigenspectra to retain
738: by estimating noise variance in the stellar case
739: and by using the $F$ test to compute the significance of each additional
740: eigenspectrum in spectral reconstruction in the galactic case. The latter
741: criterion however is not well-suited to the task of parameter
742: estimation because
743: the appropriate number of components in the regression model depends
744: on the complexity of the dependence of those parameters as a function
745: of the basis elements, not on the complexity of the original spectra.
746: For example, the dependence of the EW of H$\alpha$ on the PC basis
747: functions may be a simple, smooth function while the flux dependence
748: may be complex, bumpy relationship. In this case, the optimal
749: regression model to predict EW would require fewer basis functions
750: than the optimal model for H$\alpha$ flux prediction. Minimizing CV
751: risk would lead us to choose the correct number of basis functions for
752: each task, while the method of Li et al. would force us to use the same
753: (inappropriate) size for each model.
754:
755: \citeauthor{Zhang2006} attempt to predict stellar parameters by
756: regressing on PC coefficients using a kernel regression model with a
757: variable window width. In their paper, they do not specify how to
758: select the window
759: width (they introduce an arbitrary parameter $\lambda$) or how to
760: choose the correct number of PC basis functions (they use 3).
761: Their choice of a small
762: model size is likely due to the computational and statistical
763: difficulties that characterize kernel regression in high dimensions
764: \citep{Wasserman2006}.
765:
766: \citeauthor{ReFiorentin2007}~attempt to estimate
767: stellar atmospheric parameters (effective temperature, surface gravity,
768: and metallicity) from SDSS/SEGUE spectra.
769: They use PCA for dimension reduction, but set $m$ to an
770: arbitrary value (e.g., 50).
771: They then use an iterative, non-linear regression model (utilizing the
772: hyperbolic tangent function; see \citealt{Bailer-Jones2000}),
773: with an error function based on the residual sum-of-squares plus
774: a regularization term (see their equation 2). Again, the
775: choice of the regularization parameter is not justified.
776: %This methodology is similar to that used in the
777: %neural network community ({\bf ANN: CONFIRM THIS}).
778: We find that when applied to the same data
779: set of galaxy spectra, their model does not achieve lower CV risk than
780: our model for different choices of regularization parameter and model size.
781:
782: \section{Summary}
783:
784: \label{sect:summary}
785:
786: The purpose of this paper is two-fold.
787: First, we introduce the diffusion map method for data parametrization
788: and dimensionality reduction. We show
789: that for the types of high-dimensional and complex data sets
790: often analyzed in the astronomy, diffusion map can yield
791: far superior results than commonly-used methods such as PCA. Moreover,
792: the simple, intuitive formulation of diffusion map as a method that
793: preserves the local interactions of a high-dimensional data set makes the
794: technique easily accessible to scientists that are not well-versed in
795: statistics or machine learning.
796:
797: Second, we present a fast and powerful eigenmode-based framework for
798: estimating physical parameters in databases of high-dimensional
799: astronomical data. In most astrophysical applications, PCA is used as
800: a data-explorative tool for dimensionality reduction,
801: with no formal methods
802: and statistical criteria for regression, risk estimation and selection
803: of relevant eigenvectors. Here we propose a statistically rigorous,
804: unified framework for
805: regression and data parameterization. Our proposed regression model
806: combines basis functions in a simple and statistically-motivated
807: manner while our clear objective of risk minimization drives the
808: estimation of the model parameters. Again, the simplicity of the
809: proposed method will make it appealing to the non-specialist.
810:
811: We apply the proposed methodology to predict redshift for a sample of
812: SDSS galaxy spectra, comparing the use of the proposed regression
813: model with PCA basis functions versus diffusion map basis functions.
814: We find that the prediction error for the diffusion-map-based approach
815: is markedly smaller than that of a
816: similar framework based on PCA. Our techniques are also more robust
817: than commonly used template matching
818: methods because they consider the structure of the entire
819: high-dimensional data set when reparametrizing the data.
820: Statistical inferences are based on this learned structure,
821: instead of considering each data point separately in an object-by-object
822: matching algorithm as is currently used by SDSS and commonly employed
823: throughout the astronomy literature.
824: Work in progress extends our approach to
825: photometric redshift estimation and to the estimation of the
826: intrinsic parameters (e.g., mean metallicities and ages) of galaxies.
827:
828: \begin{acknowledgments}
829: The authors would like to thank Jeff Newman for helpful conversations.
830: This work was supported by NSF grant \#0707059 and ONR grant N00014-08-1-0673.
831: \end{acknowledgments}
832:
833: \appendix
834:
835: \section{Principal Components Analysis}
836: \label{sect:pca}
837:
838: We first center our data (the normalized spectra with $p$ wavelength bins) so that $\frac{1}{n} \sum_{i=1}^{n} {\bf x}_i = 0$. The centered observations ${\bf x}_1, {\bf x}_2, \ldots {\bf x}_n \in \mathbb{R}^p$ are then stacked into the rows of an $n \times p$ matrix ${\bf X}$. Note that the sample covariance matrix of $\bf x$ is given by the $p \times p$ matrix ${\bf S}= \frac{1}{n}{\bf X}^T{\bf X}$. In Principal Component Analysis (PCA), one computes the eigenvectors of the covariance matrix that correspond to the $m < p$ largest eigenvalues; denote these vectors by ${\bf v}_1, \ldots, {\bf v}_m \in \mathbb{R}^p$. In a PC map, the projections of the data onto these vectors are then used as new coordinates; i.e. the PC embedding of data point ${\bf x}_i$ is given by the map
839: $$ {\bf x}_i \mapsto \Psi_{\rm PCA}({\bf x}_i)=({\bf x}_i \cdot {\bf v}_1, \ldots, {\bf x}_i \cdot {\bf v}_m).$$
840: These projections are sometimes referred to as the principal components of ${\bf X}$.
841:
842: Algorithmically, the PC embedding is easy to compute using a singular value decomposition (SVD) of ${\bf X}$:
843: $$ {\bf X=U D V}^T. $$
844: Here ${\bf U}$ is an $n \times p$ orthogonal matrix, ${\bf V}$ is a $p \times p$ orthogonal matrix (where the columns are eigenvectors ${\bf v}_1, \ldots, {\bf v}_p$ of ${\bf S}$), and ${\bf D}$ is a $p \times p$ diagonal matrix with diagonal elements $d_1 \geq d_2 \ldots \geq d_p \geq 0$ known as the singular values of ${\bf X}$. Since ${\bf XV}={\bf UD}$, the PC embedding of the $i$:th data point in $m$ dimensions is given by the first $m$ elements of the $i$:th row of ${\bf UD}$.
845:
846: \section{Prediction Intervals for Spectroscopic Redshift Estimates}
847:
848: \label{sect:predint}
849:
850: In any one fold of a ten-fold regression analysis, we fit to 90\% of the data,
851: generating predictions and prediction intervals
852: for the 10\% of the data withheld from the analysis. A prediction interval
853: is {\it not} a confidence interval; the former
854: denotes a plausible range of values for a single observation, whereas the
855: latter denotes a plausible range of values for a parameter of the
856: probability distribution function from which that single observation is
857: sampled (e.g., the mean).
858:
859: Let $\bf X$ and $\bf \tilde X$ represent the matrices of independent variables
860: included in, and withheld from, regression analysis, respectively. For
861: instance,
862: \begin{eqnarray}
863: {\bf \tilde X}~=~
864: \left(
865: \begin{array}{cccc}
866: \psi_1(x_1) & \cdots & \cdots & \psi_m(x_1) \\
867: \vdots & \vdots & \vdots & \vdots \\
868: \psi_1(x_n) & \cdots & \cdots & \psi_m(x_n)
869: \end{array}
870: \right) \,, \nonumber
871: \end{eqnarray}
872: where $n$ is the number of withheld data and $m$ the number of
873: assumed basis functions. (Here, we leave out factors of
874: $\lambda_j^t$, which are subsumed into the estimated
875: regression coefficients ${\widehat \beta}_j$.) The vector of
876: redshift predictions for the withheld data is thus
877: \begin{eqnarray}
878: {\widehat z}~=~{\bf \tilde X} {\widehat \beta} \,, \nonumber
879: \end{eqnarray}
880: where $\widehat \beta$ is estimated from ${\bf X}$
881: while the vector of half-prediction intervals is given by
882: \begin{eqnarray}
883: t_{\alpha/2,N-n-2} \widehat{\sigma} \sqrt{ {\bf \tilde X} \left( {\bf X}^T {\bf X} \right)^{-1} {\bf \tilde X}^T + 1 + \frac{1}{N-n} } \,,
884: \label{eqn:predint}
885: \end{eqnarray}
886: where $\widehat{\sigma}$ is the estimated standard deviation of the
887: random noise $\epsilon$ in the relationship $Y = r({\bf X}) + \epsilon$,
888: estimated from the residuals of the regression of $Y$ upon ${\bf X}$,
889: $t_{\alpha/2,N-n-2}$ is the critical t-value for a two-sided
890: 100(1-$\alpha$)\% prediction interval,
891: and $N$ is the total number of data points. Equation (\ref{eqn:predint}) is
892: a multi-dimensional generalization of, e.g., equation (2.26) of
893: \citet{Weisberg2005}, taking into account that the mean of $\psi({\bf x})$ is
894: zero.
895:
896: \clearpage
897:
898: \begin{thebibliography}{}
899: \bibitem[Adelman-McCarthy et al.(2008)]{Adelman2008} Adelman-McCarthy, J.~K., et al.~2008, \apjs, 175, 297
900: \bibitem[Bailer-Jones(2000)]{Bailer-Jones2000} Bailer-Jones, C.~A.~L.~2000, \aa, 357, 197
901: \bibitem[Bellman(1961)]{Bellman:61} Bellman, R.~E.~1961, Adaptive Control Processes (Princeton Univ. Press)
902: \bibitem[Boroson \& Green(1992)]{BorosonGreen1992} Boroson, T.~A., \& Green, R.~F.~1992, \apjs, 80, 109
903: \bibitem[Bolton et al.(2006)]{Bolton2006} Bolton, A.~S., et al.~2006, \apj, 638, 703
904: \bibitem[Coifman \& Lafon(2006)]{Coifman:Lafon:06} Coifman, R.~R., \& Lafon, S.~2006, Appl. Comput. Harmon. Anal., 21, 5
905: \bibitem[Connolly et al.(1995)]{Connolly1995} Connolly, A.~J., Szalay, A.~S., Bershady, M.~A., Kinney, A.~L., \& Calzetti, D.~1995, \aj, 110, 1071
906: \bibitem[Folkes et al.(1999)]{Folkes1999} Folkes, S., et al.~1999, \mnras, 308, 459
907: \bibitem[Kemeny \& Snell(1983)]{KemenySnell1983} Kemeny, J. G., \& Snell, J. L.~1983, Finite Markov Chains (Springer).
908: \bibitem[Lafon \& Lee(2006)]{LafonLee2006} Lafon, S., \& Lee, A.~2006, IEEE Trans. Pattern Anal. and Mach. Intel., 28, 1393
909: \bibitem[Li et al.(2005)]{Li2005} Li, C., Wang, T.-G., Zhou, H.-Y., Dong, X.-B., \& Cheng, F.-Z.~2005, \aj, 129, 669
910: \bibitem[Madgwick et al.(2003)]{Madgwick2003} Madgwick, D.~S., et al.~2003, \apj, 599, 997
911: \bibitem[Re Fiorentin et al.(2007)]{ReFiorentin2007} Re Fiorentin, P., et al.~2007, \aap, 467, 1373
912: \bibitem[Rogers et al.(2007)]{Rogers2007} Rogers, B., Ferreras, I., Lahav, O., Bernardi, M., Kaviraj, S., \& Yi, S.~K.~2007, \mnras, 382, 750
913: \bibitem[Ronen, Arag\'on-Salamanca, \& Lahav(1999)]{Ronen1999} Ronen, S., Arag\'on-Salamanca, A., \& Lahav, O.~1999, \mnras, 303, 284
914: \bibitem[Vanden Berk et al.(2006)]{VDB2006} Vanden Berk, D.~E., et al.~2006, \aj, 131, 84
915: \bibitem[Wasserman(2006)]{Wasserman2006} Wasserman, L.~W.~2006, All of Nonparametric Statistics (New York:Springer)
916: \bibitem[Weisberg(2005)]{Weisberg2005} Weisberg, S.~2005, Applied Linear Regression (Hoboken:Wiley)
917: \bibitem[Yip et al.(2004a)]{Yip2004a} Yip, C.~W., et al.~2004, \aj, 128, 585
918: \bibitem[Yip et al.(2004b)]{Yip2004b} Yip, C.~W., et al.~2004, \aj, 128, 2603
919: \bibitem[Zhang et al.(2006)]{Zhang2006} Zhang, J., Wu, F., Luo, A., \& Zhao, Y.~2006, ChJAA, 30, 176
920: \end{thebibliography}
921:
922:
923: % The figures
924:
925: \begin{figure}
926: %\epsfig{figure=Fig3a.eps,height=2.3in}
927: \epsscale{0.7}
928: \plotone{f1a.eps}
929: \vspace{0.7in}
930: \epsscale{0.9}
931: \plottwo{f1b.eps}{f1c.eps}
932: \caption{An example of a one-dimensional manifold (dashed line) with Gaussian noise embedded in
933: two or higher dimensions. The path (solid line) from $\x$ to $\y$ reflects the natural geometry of
934: the data set which is captured by the
935: diffusion distance between $\x$ and $\y$.
936: The plot on the lower left shows that the first diffusion map coordinate is a monotonically increasing
937: function of the
938: arc length of the spiral; this is not the case in the
939: lower right plot, which shows the same relationship for the first PC coordinate.}
940: \label{fig:spiral}
941: \end{figure}
942:
943: \clearpage
944: \begin{figure}
945: \epsscale{0.75}
946: \plotone{f2.eps}
947: \caption{Distributions of SDSS redshift estimates in our
948: high-CL (top) and low-CL (bottom) samples. We train our regression
949: model using the 2793 high-CL galaxies only, then apply those
950: predictions to the 1042 low-CL galaxies.}
951: \label{fig:zdesign}
952: \end{figure}
953:
954: \clearpage
955: \begin{figure}
956: %$\begin{array}{c}
957: %\epsfig{figure=zest_pcmap_ccode.ps,height=2.25in} \\
958: %\epsfig{figure=zest_dmap_ccode.ps,height=2.25in} \\
959: %\end{array}$\\
960: \epsscale{1}
961: \plottwo{f3a.eps}{f3b.eps}
962: \caption{Embedding of our sample of 2793 SDSS galaxy spectra with
963: SDSS $z$ CL $> 0.99$ with
964: the first 3 PC and the first 3 diffusion map coordinates, respectively.
965: The color codes for $\log_{10}(1+z_{\rm SDSS})$ values. Both
966: maps show a clear correspondence with redshift.}
967: \label{fig:zmaps}
968: \end{figure}
969:
970: %\clearpage
971: %\begin{figure}
972: %%\epsfig{figure=outlier.eps,height=2.6in}
973: %\epsscale{0.75}
974: %\plotone{f3.eps}
975: %\caption{SDSS galaxy spectrum (with {\tt OBJID}) identified as an outlier
976: %($>$ 4$\sigma$) by the
977: %diffusion map-based regression, overlaid with SDSS template 29, which
978: %provided the highest CL $z_{\rm SDSS}$ estimate in template cross-correlation.
979: %The spectrum exhibits two anomalous features: a sharp, unexplained
980: %rise at low wavelengths and a broad emission feature at $\approx$ 4100 \AA.}
981: %\label{fig:out}
982: %\end{figure}
983:
984: \clearpage
985: \begin{figure}
986: %\epsfig{figure=zpred_risk.eps,height=2.3in} \\
987: \epsscale{0.75}
988: \plotone{f4.eps}
989: \caption{Risk estimates ($\widehat{R}_{CV}$) for regression of $z$ on diffusion
990: map coordinates and PCs. Diffusion map attains a lower
991: risk for almost every number of coordinates in the regression. It also
992: achieves a lower minimum risk as indicated by Table~\ref{tab:zreg}.
993: Risk estimates are based on 50 repetitions of 10-fold CV. Thick lines
994: represent mean risk at that model size and thin dotted lines are +/- 1
995: standard deviation bands.}
996: \label{fig:zrisk}
997: \end{figure}
998:
999: \clearpage
1000: \begin{figure}
1001: %\epsfig{figure=zpredictions.eps,height=4.6in}
1002: \epsscale{0.6}
1003: \plotone{f5.eps}
1004: \caption{
1005: Redshift predictions using diffusion map coordinates for galaxies
1006: with SDSS CL $\le$ 0.99 (top)
1007: and CL $>$ 0.99 (bottom), each plotted against $z_{\rm SDSS}$.
1008: Error bars
1009: represent 95\% prediction intervals. Note that CL $\le$ 0.99
1010: redshift predictions are based on the model trained on CL $>$ 0.99
1011: galaxies while CL $>$ 0.99 predictions are from 10-fold CV on CL
1012: $>$ 0.99 galaxies. For most galaxies, our
1013: predictions are in close correspondence with SDSS estimates.}
1014: \label{fig:zreg}
1015: \end{figure}
1016:
1017: \clearpage
1018: \begin{figure}
1019: \epsscale{0.6}
1020: \plotone{f6.eps}
1021: \caption{Discrepancy between our predicted redshift values and $z_{\rm
1022: SDSS}$ estimates versus log(1-CL). There is a
1023: correlation of 0.392 between the amount of discrepancy and 1-CL, meaning
1024: that galaxies for which there are large differences between the two
1025: redshift estimates tend to be objects whose SDSS redshift
1026: confidences are low. Horizontal lines denote 1, 3, and 5 $\sigma$
1027: disparities. Small random perturbations have been added to duplicate
1028: log(1-CL) values to visualize galaxies with the same CL. Galaxies with a
1029: CL of 1.00 are assigned mean log(1-CL) of -4.
1030: }
1031: \label{fig:cl}
1032: \end{figure}
1033:
1034: \clearpage
1035: \begin{figure}
1036: \epsscale{0.6}
1037: \plotone{f7.eps}
1038: \caption{Discrepancy between our predicted redshift values and $z_{\rm
1039: SDSS}$ versus log(flux) of the original spectra. There is a
1040: correlation of -0.327 between the amount of discrepancy and galaxy
1041: brightness. Galaxies can be detected as outliers even
1042: if they match well to their SDSS template (in color). Low S/N
1043: can cause normal galaxies with correct SDSS redshifts to be labeled
1044: as outliers. We also detect several
1045: physically interesting objects as outliers (see Figure \ref{fig:outliers}).
1046: }
1047: \label{fig:flux}
1048: \end{figure}
1049:
1050: \clearpage
1051: \begin{figure}
1052: \epsscale{1}
1053: \plotone{f8.eps}
1054: \caption{Eight selected outliers with anomalous features. Each
1055: spectrum (solid blue) is plotted along with its SDSS template match
1056: (dashed red). Spectra are scaled to have the same sum of squared
1057: (smoothed) fluxes over the same range of wavelengths. For a
1058: thorough discussion of
1059: these outliers see {\S}\ref{sect:anal}}
1060: \label{fig:outliers}
1061: \end{figure}
1062:
1063: \clearpage
1064:
1065: \input{tab1}
1066:
1067: \end{document}
1068:
1069: