0807:0807.2900/ms.tex

1: \documentclass[preprint]{aastex}

2: %\documentclass[manuscript]{aastex}

3: %\documentclass{emulateapj}

4: \shorttitle{Exploiting Low-Dimensional Structure}

5: \shortauthors{Richards, Freeman, Lee, Schafer}

6: %\usepackage{epsfig}

7: \usepackage{color}

8: \newcommand{\x}{{\bf x}}

9: \newcommand{\y}{{\bf y}}

10: \newcommand{\z}{{\bf z}}

11: \newcommand{\W}{{\bf W}}

12: \newcommand{\new}{red}

13: \renewcommand{\P}{{\bf P}}

14:

15: \begin{document}

16:

17: \title{Exploiting Low-Dimensional Structure in Astronomical Spectra}

18: \author{Joseph W. Richards, Peter E. Freeman, Ann B. Lee, Chad M. Schafer}

19: \email{jwrichar@stat.cmu.edu}

20: \affil{Department of Statistics, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213}

21:

22: \begin{abstract}

23: Dimension-reduction techniques can greatly improve statistical inference in astronomy.

24: A standard approach is to use Principal Components Analysis (PCA).

25: In this work we apply a recently-developed technique, diffusion maps, to astronomical

26: spectra for data parameterization and dimensionality reduction, and

27: develop a robust, eigenmode-based framework

28: for regression.

29: We show how our framework provides a computationally efficient means by which

30: to predict redshifts of galaxies, and thus could

31: inform more expensive redshift estimators

32: such as template cross-correlation.  It also provides a natural means

33: by which to identify outliers (e.g., misclassified spectra, spectra

34: with anomalous features).

35: We analyze 3835 SDSS spectra and show how our framework

36: yields a more than 95\% reduction in dimensionality.

37: Finally, we show that the prediction error

38: of the diffusion map-based regression approach is markedly smaller than that of a similar

39: approach based on PCA, clearly demonstrating the superiority of diffusion

40: maps over PCA for this regression task.

41: \end{abstract}

42:

43: \keywords{galaxies: distances and redshifts --- galaxies: fundamental parameters --- galaxies: statistics --- methods: statistical --- methods: data analysis}

44:

45: \section{Introduction}

46:

47: \label{sect:intro}

48:

49: Galaxy spectra are classic examples of high-dimensional data, with

50: thousands of measured fluxes providing

51: information about the physical conditions of the observed object.

52: To make computationally efficient inferences about these

53: conditions, we need to first reduce the dimensionality of the data

54: space while preserving relevant physical information.

55: We then need to find simple relationships between the reduced data and physical parameters of

56: interest.

57: %, e.g., by introducing and estimating a regression function.

58: Principal Components Analysis (PCA, or the Karhunen-Lo\`eve transform) is a standard method for the first step; its application to astronomical spectra is described in, e.g., \citet{BorosonGreen1992},

59: \citet{Connolly1995}, \citet{Ronen1999}, \citet{Folkes1999},

60: \citet{Madgwick2003}, \citet{Yip2004a},

61: \citet{Yip2004b}, \citet{Li2005}, \citet{Zhang2006},

62: \citet{VDB2006}, \citet{Rogers2007}, and \citet{ReFiorentin2007}.

63: In most cases, the authors do not proceed to the second step but only

64:  ascribe physical significance to the first few eigenfunctions from PCA

65: (such as the ``Eigenvector 1" of \citeauthor{BorosonGreen1992}).

66: Notable exceptions are \citeauthor{Li2005}, \citeauthor{Zhang2006},

67: and \citeauthor{ReFiorentin2007} However,

68: as we discuss in {\S}\ref{sect:app}, these authors combine

69: eigenfunctions in an ad hoc manner with no formal methods or

70: statistical criteria for regression and risk (i.e., error) estimation.

71:

72: In this work we present a unified framework for regression and data parameterization of astronomical spectra. The main idea is to describe

73: the important structure of a data set in terms of its

74: {\em fundamental eigenmodes}.

75: The corresponding eigenfunctions are used both as coordinates for the data

76: and as orthogonal basis functions for regression.

77: We also introduce the {\em diffusion map} framework

78: (see, e.g., \citealt{Coifman:Lafon:06}, \citealt{LafonLee2006})

79: to astronomy, comparing and contrasting it with PCA for regression analysis of SDSS galaxy spectra.  PCA is a global method that finds linear low-dimensional

80: projections of the data; it attempts to preserve Euclidean distances between all data points and is often not robust to outliers.

81: The diffusion map approach, on the other hand, is non-linear and instead retains distances that reflect the (local) connectivity of the data.

82: This method is robust to outliers and is often able to unravel the intrinsic geometry and the natural (non-linear) coordinates of the data.

83:

84: In {\S}\ref{sect:diff} we describe the diffusion map method for data

85: parameterization.

86: In {\S}\ref{sect:regress} we introduce the technique of {\em adaptive regression} using eigenmodes.

87: In {\S}\ref{sect:app} we demonstrate the effectiveness of our proposed PCA- and

88: diffusion-map-based regression techniques for

89: predicting the redshifts of SDSS spectra.

90: % Text shifted to section 4

91: %Redshift prediction in SDSS DR6 is calculated by two methods:

92: %first, via wavelet analyses of continuum-subtracted spectra, where the

93: %continuum is estimated using a fifth-order polynomial, and second,

94: %by cross-correlating templates and observed spectra.\footnote{

95: %See {\scriptsize \tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}

96: %In both cases, confidence levels\footnote{

97: %SDSS ``confidence levels"

98: %are functions of the strengths of observed lines and thus should

99: %not be interpreted probabilistically.}

100: %are computed, with the higher-CL redshift estimate assigned to the galaxy.

101: % shift end

102: %Template matching, in particular, is slow (ARE WE SURE ABOUT THIS?)

103: %and prone to error because

104: %the basis functions are not orthogonal.

105: %(SDSS manually inspects 8\% of estimates and changes 1\% of them.)

106: Our PCA- and diffusion-map-based approaches provide a fast and

107: statistically rigorous means of identifying

108: outliers in redshift data. The returned embeddings also provide an

109: informative visualization of the results.  In {\S}\ref{sect:summary} we summarize our results.

110:

111: \section{Diffusion Maps and Data Parameterization}

112:

113: \label{sect:diff}

114: The variations in a physical system can sometimes be described by

115: a few parameters, while measurements of the system are

116: necessarily of very high dimension; geometrically, the data are

117: points in the $p$-dimensional space $\mathbb{R}^p$, with $p$ large.

118: In our case, a data point is a galaxy spectrum, with the

119: dimension $p$ given by the number of wavelength bins ($p \gtrsim 10^3$),

120: and a full data set could consist of hundreds of thousands of spectra.

121: To make inference and predictions tractable,

122: one seeks to find a simpler parameterization of the system. The most

123: common method for dimension reduction and data parameterization

124: is Principal Component Analysis (PCA), where the data are projected

125: onto a lower-dimensional hyperplane. For complex situations,

126: however, the assumption of linearity may lead to sub-optimal

127: predictions. A linear model pays very little attention to the

128: natural geometry and variations of the system. The top plot in Figure

129: \ref{fig:spiral}

130: illustrates this clearly by showing a data

131: set that forms a one-dimensional noisy spiral in $\mathbb{R}^2$.

132: Ideally, we would like to find a coordinate system that reflects

133: variations along the spiral direction, which is indicated by the

134: dashed line. It is obvious that any

135: projection of the data onto a line would be unsatisfactory.  Results

136: of a PCA analysis of the noisy spiral are shown in the lower-left plot

137: in Figure \ref{fig:spiral}.

138:

139: In this section, we will use diffusion maps

140: (\citeauthor{Coifman:Lafon:06}, \citeauthor{LafonLee2006}) --- a non-linear technique ---

141: %for data parameterization, i.e.

142: to find a natural coordinate system for the data.

143: When searching for a lower-dimensional description, one needs to decide

144: what features to preserve and what aspects of the data one is

145: willing to lose. The diffusion map framework attempts to retain

146: the cumulative local interactions between its data points, or

147: their ``connectivity" in the context of a fictive diffusion process over the data.

148: We demonstrate how this can be a better method to learn

149: the intrinsic geometry of a data set than by using, e.g., PCA.

150: %which simply projects all data points onto a lower-dimensional hyperplane.

151:

152: Our strategy is to first define a distance metric $D(\x,\y)$ that reflects

153: the connectivity of two points $\x$ and $\y$, then find a map to a

154: lower-dimensional space (i.e., a new data parameterization) that

155: best preserves these distances.

156: (As before, a ``point'' in $p$-dimensional space represents

157: a complete astronomical spectrum of $p$ wavelength bins.)

158: The general idea is that we call two data points ``close'' if there

159: are many short paths between $\x$ and $\y$ in a jump diffusion process between data points.

160: In Figure \ref{fig:spiral}, the Euclidean distance

161: between two points is an inappropriate measure of

162: similarity. If, instead, one imagines a random walk starting at ``$\x$,'' and

163: only stepping to immediately adjacent points, it is clear that

164: %it would take a long time for that walk to reach ``$\y$.''

165: the time it would take for that walk to reach ``$\y$''  would reflect

166: the length along the spiral direction.  This latter distance measure

167: is represented by the solid path from $\x$ to $\y$ in Figure \ref{fig:spiral}.

168: We will make this measure of connectivity formal in what follows.

169:

170: The starting point is to construct a weighted graph where the

171: nodes are the observed data points. %(the spectra). repetitive

172: %(i.e., in our case each node is a spectrum).

173: The weight given to the edge connecting $\x$ and $\y$ is

174: \begin{equation}

175: w(\x,\y) = \exp\left(-\frac{s(\x,\y)^2}{\epsilon}\right),

176: \label{eqn:diffw}

177: \end{equation}

178: where $s(\x,\y)$ is a locally relevant similarity measure.

179: For instance, $s(\x,\y)$ could be chosen as

180: the Euclidean distance between $\x$ and $\y$ (denoted here $\|\x-\y\|$)

181: when $\x$ and $\y$ are vectors.

182: But, the choice of $s(\x,\y)$ is not crucial, and this gets to the heart

183: of the appeal of this approach:

184: it is often simple to determine whether or not two data points are %very

185: ``similar'',

186: and many choices of $s(\x,\y)$ will suffice for measuring this

187: local similarity.

188: The tuning parameter $\epsilon$ is chosen small enough that

189: $w(\x,\y) \approx 0$ unless $\x$ and $\y$ are similar,

190: %only local similarities are computed,

191: but large enough such that the constructed graph is fully connected.

192:

193: The next step is to use these weights to build a Markov random walk on

194: the graph. From node (data point) $\x$, the probability of stepping

195: directly to $\y$ is defined naturally as

196: \begin{equation}

197: p_1(\x,\y) = \frac{w(\x,\y)}{\sum_{\z}w(\x,\z)}.

198: \label{eqn:diffp}

199: \end{equation}

200: %\begin{equation}

201: %p_1(x,y) = \frac{w(x,y)}{\sum_{z \in \Omega}w(x,z)} \,.

202: %\label{eqn:p}

203: %\end{equation}

204: This probability is close to zero unless $\x$ and $\y$ are similar. Hence, in

205: one step the random walk will move only to very similar nodes (with high

206: probability). These one-step transition probabilities are stored in the $n$ by $n$

207: matrix $\P$.

208: It follows from standard theory of Markov chains (\citealt{KemenySnell1983}) that, for a positive integer $t$, the element

209: $p_t(\x,\y)$ of

210: the matrix power $\P^t$ gives the probability of

211: moving from $\x$ to $\y$ in $t$ steps.

212: Increasing $t$ moves the random walk

213: forward in time, propagating the local influence of a data point

214: (as defined by the kernel $w$)

215: with its neighbors.

216: % so as eventually to form a global representation of the

217: %geometry of the data.

218:

219: For a fixed time (or scale) $t$, $p_t(\x,\cdot)$ is a vector representing

220: the distribution after $t$ steps of the random walk over the nodes of the

221: graph, conditional on the

222: walk starting at $\x$.

223: In what follows, the points $\x$ and $\y$ are

224: close if the conditional distributions

225: $p_t(\x,\cdot)$ and $p_t(\y,\cdot)$, are similar.

226: Formally, the diffusion distance at a scale $t$ is defined as

227: \begin{equation}

228: D_t^2(\x,\y) = \sum_{\z} \frac{\left(p_t(\x,\z) - p_t(\y,\z)\right)^2}{\phi_0(\z)}

229: %D_t^2(x,y) = ||p_t(x,\cdot) - p_t(y,\cdot)||^2_2

230: \label{eqn:diffdist}

231: \end{equation}

232: where $\phi_0(\cdot)$ is the stationary distribution of the random walk, i.e.,

233: the long-run proportion of the time the walk spends at

234: node $\z$.

235: Dividing by $\phi_0(\z)$ serves to reduce the influence of nodes

236: which are visited with high probability regardless of the starting point of the

237: walk.

238: %{\bf (Change the above; the 2 over 2 nomenclature is unclear.)}

239: The distance $D_t(\x,\y)$ will be small only if $\x$ and $\y$ are connected by

240: many short paths with large weights.  This construction of

241: a distance measure is robust to noise and outliers because it

242: simultaneously accounts for the cumulative effect of {\em all} paths between the

243: data points.

244: Note that the geodesic distance (the shortest path in a graph), on the other hand, often takes shortcuts due to noise.

245:

246: % Fig3a was here.

247:

248: The final step is to find a low-dimensional embedding of the data where Euclidean distances reflect diffusion distances.

249: %In applying this technique for dimensionality reduction,

250: %the data set attribute

251: %we wish to preserve is the diffusion distance between all

252: %points.

253: A biorthogonal spectral decomposition of the matrix $\P^t$ gives

254: %\begin{equation}

255: %p_t(x,y) = \sum_{j \ge 0} \lambda_j^t \psi_j(x) \phi_j(y) \,,

256: %\label{eqn:diffdecomp}

257: %\end{equation}

258: $p_t(\x,\y) = \sum_{j \ge 0} \lambda_j^t \psi_j(\x) \phi_j(\y)$,

259: where $\phi_j$, $\psi_j$, and $\lambda_j$, respectively, represent left eigenvectors, right eigenvectors and eigenvalues

260: of $\P$. It follows that

261: \begin{equation}

262: D^2_t(\x,\y)~= ~\sum_{j=1}^{\infty} \lambda_j^{2t}(\psi_j(\x)-\psi_j(\y))^2.\label{eq:Dt}

263: \end{equation}

264: %{\bf (ANN: How about putting a proof of Equation (4) as an Appendix?)}

265:  The proof of Equation~\ref{eq:Dt} and the details of the computation

266:  and normalization of the eigenvectors  $\phi_j$ and $\psi_j$ are given in

267:  \citeauthor{Coifman:Lafon:06} and

268:  \citeauthor{LafonLee2006}.\footnote{Sample code in Matlab and R for

269:    diffusion maps at {\tt  http://www.stat.cmu.edu/\~{}annlee/software.htm}}  By retaining the

270:  $m$ eigenmodes corresponding to the $m$ largest nontrivial

271:  eigenvalues and by introducing the diffusion map

272: \begin{equation}

273: \Psi_t: \x \mapsto [\lambda_1^t\psi_1(\x), \lambda_2^t\psi_2(\x), \cdots,\lambda_m^t\psi_m(\x)]

274: \label{eqn:diffusion_map}

275: \end{equation}

276: from $\mathbb{R}^p$ to $\mathbb{R}^m$, we have that %(see \citeauthor{Coifman:Lafon:06})

277: %\begin{eqnarray}

278: \begin{equation}

279: D^2_t(\x,\y)~\simeq ~\sum_{j=1}^m \lambda_j^{2t}(\psi_j(\x)-\psi_j(\y))^2 ~=~||\Psi_t(\x) - \Psi_t(\y)||^2 \,,

280: \label{eqn:diffpres}

281: \end{equation}

282: i.e., Euclidean distance in the $m$-dimensional embedding defined by equation~\ref{eqn:diffusion_map}

283: %lower-dimensional space $\mathbb{R}^m$,

284: approximates diffusion distance.

285: In contrast, Euclidean distances in PC maps approximate the original

286: Euclidean distances $\|\x-\y\|$.

287: Again, consider the example in Figure \ref{fig:spiral}.

288: The plot on the lower left shows that the first diffusion map coordinate is a monotonically increasing

289: function of the

290: arc length of the spiral; this is not the case in the

291: lower right plot, which shows the same relationship for the first PC coordinate. Indeed, the relationship

292: with the first PC coordinate is not even one-to-one.

293:

294: The choice of the parameters $m$ and $t$ is determined by the fall-off of the eigenvalue spectrum as well

295: as the problem at hand (e.g., clustering, classification, regression,

296: or data visualization).  An objective measure

297: of performance should be defined and utilized to find data-driven best choices for these tuning parameters.

298: In this work, the final goal

299: is regression and prediction of redshift. In the next section, we show how the number of coordinates, $m$, can

300: be chosen by cross-validation, once one has defined an appropriate statistical ``risk" function. The particular

301: choice of $t$, on the other hand, will not matter in the regression framework, as it will only represent a

302: rescaling of the $m$ selected basis vectors.

303:

304: \section{Adaptive Regression Using Orthogonal Eigenfunctions}

305: \label{sect:regress}

306: Our next problem is how to, in a statistically rigorous way, predict a function $y=r(\mathbf{x})$ (e.g., redshift, age, or metallicity of galaxies) of data

307: (e.g., spectrum $\mathbf{x}$) in very high dimensions using a sample

308: of known pairs ($\x,y$). As before, imagine that our data are points in $\mathbb{R}^p$, but that the

309: natural variations in the system are along a low dimensional space $\mathcal{X} \subset \mathbb{R}^p$.

310: %In other words, $p$ is very large but the intrinsic dimension of $\mathcal{X}$, which is determined by the natural variations of the system, is considerably

311: %smaller.

312: The set $\mathcal{X}$ could, for example, be a non-linear submanifold embedded in $\mathbb{R}^p$.

313: In our toy example in Figure \ref{fig:spiral}, $\mathcal{X}$ is the one-dimensional spiral, but the data are observed

314: in $p=2$ dimensions.

315: The key idea is that one may view the eigenfunctions from PCA or diffusion maps

316: (a) as {\em coordinates} of the data points, as shown in the previous section,

317: or (b) as forming a {\em Hilbert orthonormal basis} for any function (including the regression function $r(\mathbf{x})$) supported on the

318: subset $\mathcal{X}$. Rather than applying an arbitrarily chosen prediction scheme in the computed diffusion or PC space (as in, e.g., \citeauthor{Li2005}, \citeauthor{Zhang2006}, and \citeauthor{ReFiorentin2007}), we utilize the latter insight to formulate a general regression and risk estimation framework. %for high-dimensional inference.

319:

320: Any function $r$ satisfying $\int r(\x)^2 dx < \infty$, where $\x \in \mathcal{X} $, can be written as

321: \begin{equation}

322: r(\x) = \sum_{j=1}^{\infty} \beta_j \psi_j(\x) \,,

323: \label{eqn:orthonorm}

324: \end{equation}

325: where the sequence of functions $\{\psi_1,\psi_2,\cdots\}$ forms an

326: orthonormal basis.  The choice of basis functions is traditionally {\em not} adapted to the geometry of the data, or the set $\mathcal{X}$.

327: Standard choices are, for example, Fourier or wavelet bases for $\mathbf{L}^2(\mathbb{R}^p)$, which are constructed as tensor

328: products of one-dimensional bases. The latter approach makes sense for low dimensions, for example for $p=2$, but quickly becomes

329: intractable as $p$ increases (see, e.g., \citealt{Bellman:61} for the ``curse of dimensionality''). In particular, note that if a wavelet basis

330: in one dimension consists of $q$ basis functions, and hence

331: requires the estimation of $q$ parameters, the naive tensor basis in $p$ dimensions will have $q^p$ basis functions/parameters,

332: creating an impossible inference problem even for moderate $p$.

333: Because this basis is not adapted to $\mathcal{X}$, there is little hope of

334: finding a subset of these basis functions which will

335: do an adequate job of modeling the response.

336: %although for any particular problem

337: %one strives to represent any sufficiently smooth function with

338: %as small a subset of basis functions as possible.

339:

340: In this work, we propose a new adaptive framework where the basis functions reflect the intrinsic geometry of the data.  Furthermore, we use a formal statistical method to estimate the risk and the optimal parameters in the model. First, rather than using a generic tensor-product basis for the high-dimensional space $\mathbb{R}^p$, we

341: construct a data-driven

342: basis for the lower-dimensional, possibly non-linear set $\mathcal{X}$ where the data lie.

343: Let $\{{\psi_1},{\psi_2},\cdots,{\psi_n}\}$ be the orthogonal eigenfunctions computed by PCA or diffusion maps.

344: Our regression function estimate $\widehat{r}(\x)$ is then given by

345: \begin{equation}

346: \widehat{r}(\x) = \sum_{j=1}^{m} \widehat{\beta_j} {\psi_j}(\x),

347: \label{eqn:orthoreg}

348: \end{equation}

349: %equation~(\ref{eqn:orthoreg}),

350: where the different terms in the series expansion represent the

351: fundamental eigenmodes of the data, and $m \leq n$ is chosen to

352: minimize the prediction risk that we will now define rigorously.

353:

354: \subsection{Risk: Theory and Estimation}

355: \label{sect:risk}

356:

357: A key aspect of our approach is that the choice of the models is driven by the minimization of a well-justified, objective error criterion

358: which compensates for overfitting. This is critical, as any basis could be utilized to fit the observed data well; this does not provide,

359: however, any assurance that the model applies beyond these data.

360: To begin, we establish the standard stochastic framework within which regression models are assessed.

361: We are given $n$ pairs of observations $(X_1,Y_1), \ldots, (X_n, Y_n)$, with the task of predicting the

362: response $Y=r(X)+\epsilon$ at a new data point $X=\x$, where $\epsilon$ represents random noise.

363: (In {\S}\ref{sect:app}, the response $Y$ is the redshift, $z$, and $X$ is a complete spectrum.)

364: In nonparametric regression by orthogonal functions,

365: one assumes that $r(\x)$ is given

366: according to equation~(\ref{eqn:orthonorm}), with its estimator given

367: by equation~(\ref{eqn:orthoreg}), with $m \leq n$ where $\{\psi_j\}$

368: is a fixed basis.

369: %An estimator of $r(\x)$ typically has the form

370: %\begin{equation}

371: %\widehat{r}(\x)=\sum_{j=1}^{m} \widehat{\beta_j} \psi_j(\x),

372: %\label{eqn:orthoreg}

373: %\end{equation}

374: %where $m \leq n$ and $\{\psi_j\}$ is a fixed basis.

375: The primary goal is to minimize the

376: {\em prediction risk} (i.e., expected error), commonly quantified by

377: the mean-squared error (MSE)

378: \begin{equation}

379: R(m)=\mathbb{E}[Y-\widehat{r}(X)]^2,

380: \label{eqn:MSE}

381: \end{equation}

382: where the average is taken over all possible realizations of $(X,Y)$,

383: including the randomness in the evaluation points $X$, the

384: responses $Y$, and the estimates $\widehat{\beta_j}$.

385: Thus, $\mathbb{E}[\cdot]$ averages everything that is random, including the randomness in the evaluation points $X$

386: and the randomness in the estimates $\widehat{\beta_j}$. This leads to protection against overfitting: if a basis

387: function $\psi_j$ is unnecessarily included in the model,

388: its coefficient $\widehat{\beta_j}$ will only add variability

389: or variance to

390: $\widehat{r}(X)$ and not improve the fit, hence increasing $R(m)$.

391: (On the other hand, as $m$ becomes too small,

392: the estimator becomes increasingly biased, also increasing $R(m)$.)

393: Thus, the ideal choice of $m$ is neither too large, nor too small.

394: In nonparametric statistics, this is dubbed the ``bias-variance tradeoff"

395: (see, e.g., \citealt{Wasserman2006}).

396: A secondary goal is {\em sparsity}; more specifically,

397: among the estimators with a small risk,

398: we prefer representations with a smaller $m$.

399:

400: Since $R(m)$ is a population quantity, one needs to appropriately estimate it from the data.

401: An estimate based on the full data set will underestimate the error and lead to a model with high bias.

402: Here we will use the method of $K$-fold cross-validation

403: (see, e.g., \citeauthor{Wasserman2006}) to achieve

404: a better estimate of the prediction risk. The basic idea is to randomly split the data set into $K$ blocks

405:  of approximately the same size; $K=10$ is a common choice. For $k=1$ to $K$, we delete block $k$ from the data. We then fit the model to the

406: remaining $K-1$ blocks and compute the observed squared error $\widehat{R}_{(-k)}(m)$ on the $k$th block which was not included in the fit. The CV estimate of the risk is defined as $\widehat{R}_{CV}(m)=\frac{1}{K}\sum_{k=1}^{K} \widehat{R}_{(-k)}(m)$.

407: It can be shown that this quantity is an approximately unbiased estimate of the true error $R(m)$.

408: Thus, we choose the model parameters that minimize the CV estimate $\widehat{R}_{CV}(m)$ of the risk, i.e.,

409: we take $m_{\rm opt} = \arg \min \widehat{R}_{CV}(m)$.

410:

411: Finally, we note that the ideas of CV introduced here generalize to cases where the model

412: parameters are of higher dimension. For example, in the diffusion

413: map case, the risk is minimized over both the bandwidth $\epsilon$ and the number of eigenfunctions $m$. The CV estimate of the

414: risk is implemented in the same fashion, but the search space for finding the minimum is larger.

415: In what follows, the notation will make it clear which

416: model parameters we are minimizing over by writing, for

417: example, $R(\epsilon, m)$.

418:

419: To summarize, our claim is that the proposed regression framework will lead to efficient inference in high

420: dimensions, as we are effectively performing regression in a lower-dimensional space $\mathcal{X}$ that

421: captures the natural variations of the data, where the optimal

422: dimensionality is chosen to minimize prediction risk in our regression

423: task. Finally, the use of eigenfunctions in both the data parameterization

424: and in the regression formulation provides an elegant, unifying framework for analysis and prediction.

425:

426: %Here, $J \leq m$ and is chosen by using an appropriate risk

427: %estimator, such as cross-validation (see, e.g., \citet{Wasserman2007}),

428: %rather than in ad hoc manner of, e.g., \citeauthor{Li2005},

429: %\citeauthor{Zhang2006}, and \citeauthor{ReFiorentin2007}

430: %The smoother the true regression function $r$, the fewer basis terms

431: %$J$ will be needed to represent it.

432: %The estimated orthonormal basis $\{\hat{\psi}\}$ {\bf SHOULD} converge more

433: %quickly to the true underlying basis in (\ref{orthonorm}) than an

434: %arbitrarily chosen basis.  We {\bf SHOULD} thus be able to obtain better

435: %estimates of $r$ than by using PCA or diffusion mapping eigenfunctions

436: %that by using an arbitrary basis.

437:

438: \section{Redshift Prediction Using SDSS Spectra}

439:

440: \label{sect:app}

441:

442: We apply the formalism presented in {\S}{\S}\ref{sect:diff}-\ref{sect:regress}

443: to the problem of predicting redshifts for a sample of SDSS spectra.

444: Physically similar objects residing at similar redshifts will have

445: similar continuum shapes as well as absorption lines occurring at

446: similar wavelengths.  Hence the %$\mathbf{L}^2$

447: Euclidean distances between their spectra will be small.

448:  The proposed regression framework with diffusion map or PC

449:  coordinates provides a natural means by which to predict

450: redshifts.  Furthermore, it is computationally efficient, making its

451: use appropriate for large databases such as the SDSS;

452: one can use these predictions to

453: inform more computationally expensive techniques by narrowing down

454: the relevant parameter space (e.g., the redshift range or the

455: set of templates in cross-correlation techniques).

456: Adaptive regression also provides a useful

457: tool for quickly identifying anomalous data points (e.g., objects

458: misclassified as galaxies), galaxies that have relatively rare

459: features of interest, and

460: galaxies whose SDSS redshift estimates may be incorrect.

461:

462: \subsection{Data Preparation}

463:

464: Our initial data sample consists of spectra that are classified as galaxies

465: from ten arbitrarily chosen spectroscopic plates of SDSS DR6

466: (0266$-$0274 inclusive, and 0286; \citealt{Adelman2008}).

467: We remove spectra from this sample by applying three cuts.  The first

468: is motivated by aperture considerations: we analyze only those spectra

469: with SDSS redshift estimates $z_{\rm SDSS} \geq$ 0.05.

470: To include spectra  with $z_{\rm SDSS} < 0.05$

471: would be to add an extra source of variation that would

472: adversely impact regression analysis.  The second cut is based on bin flags.

473: To avoid calibration issues observed at both the low and high

474: wavelength ends, we remove the first 100 and last 250 wavelength bins

475: from each spectrum;

476: then we determine what proportion of the remaining 3500 bins are flagged

477: as bad.  If this proportion exceeds 10\%, we remove the spectrum from the

478: sample; if not, we retain the reduced spectrum for further analysis.

479: We provide details on the third cut below.

480: The application of these cuts reduces our sample size from 5057

481: to 3835 galaxies.

482:

483: %(reducing the wavelength range to 3940-8850\AA)

484: %These spectra span the wavelength range 3800-9200\AA, with

485: %uniform binning in $\log_{10}$-space ($\Delta \log_{10} \lambda$ = 10$^{-4}$).

486:

487: We further process each spectrum in our sample as follows.

488: \begin{itemize}

489: \item We replace the flux values in the vicinity of

490: prominent atmospheric lines at 5577~\AA, 6300~\AA, and 6363~\AA~with

491: the sample mean of the nine closest bins on either side of each line.

492: The flux errors are estimated by averaging (in quadrature)

493: the standard errors of the fluxes for these bins.

494: \item We similarly replace the flux values in each bin flagged by SDSS as

495: part of an emission line, with flux and flux error estimates based

496: upon the closest 50 bins on either side of the line.  (Within this group

497: of 100 bins, we do not include those that are themselves flagged as

498: emission lines.)

499: We do this because highly variable emission line strengths

500: can strongly bias distance calculations.

501: \item Last, after replacing flux values as necessary, we normalize

502: each spectrum to sum to 1 to mitigate variation due to differences in luminosity between

503: similar galaxies at similar redshifts.

504: \end{itemize}

505:

506: In its data reduction pipeline, SDSS estimates spectroscopic redshifts,

507: $z_{\rm SDSS}$, standard errors, $\sigma_{z_{\rm SDSS}}$, and

508: ``confidence levels," CL, the latter of which are functions

509: of the strengths of observed lines (and thus should

510: not be interpreted probabilistically).\footnote{

511: See {\tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}

512: Lacking knowledge of the true redshifts in our sample, we use

513: $z_{\rm SDSS}$ and $\sigma_{z_{\rm SDSS}}$ to fit our regression model.

514: Since poorly estimated redshifts can bias the model,

515: we divide our data sample into two groups, fitting with only

516: those 2793 galaxies with CL $>$ 0.99.

517: We then use the fitted model to predict redshifts for the other 1042 galaxies.

518: (It is here that we make our third data cut: to avoid issues of extrapolation,

519: we removed 19 of 1061 spectra with CL $\leq$ 0.99 whose SDSS redshift estimates

520: lie outside the range of our training set, i.e. those with $z_{\rm

521:   SDSS} > 0.50$.)  As shown in Figure \ref{fig:zdesign}, the distributions of

522: redshifts in our high- and low-CL samples are similar, implying that

523: predicted redshifts for low-CL galaxies from the model built on

524: high-CL galaxies should not be systematically biased.

525:

526: \subsection{Analysis}

527: \label{sect:anal}

528:

529: % Redundant with line immediately above.

530: %Then, using the regression model presented in

531: %{\S}\ref{sect:regress} we can regress the SDSS redshift estimates on the

532: %diffusion map coordinates to find galaxies for which our

533: %predicted redshift values do not agree with the corresponding SDSS estimates.

534:

535: %In its spectral reduction pipeline, SDSS estimates spectroscopic

536: %redshifts $z_{\rm SDSS}$ by (a) using a reference line list\footnote{

537: %\scriptsize \tt http://www.sdss.org/dr6/algorithms/linestable.html}

538: %to identify emission lines that they detect using a wavelet-based

539: %procedure, and (b)

540: %cross-correlating emission-line-masked, continuum-subtracted

541: %spectra with star, galaxy, and quasar templates.\footnote{

542: %See {\scriptsize \tt http://www.sdss.org/dr6/algorithms/redshift\_type.html}.}

543:

544: In this section, we perform both PCA and diffusion map for our sample

545: and predict redshift using the

546: regression model introduced in {\S}\ref{sect:regress}.  We provide

547: details on the PCA algorithm in Appendix \ref{sect:pca}.

548:

549: In the diffusion map analysis,

550: we begin by calculating Euclidean distances between spectra

551: \begin{equation}

552:    s(\x, \y)~=~ \sqrt{\sum_k (f_{\x,k}-f_{\y,k})^2} \,,

553: \end{equation}

554: where $f_{\x,k}$ and $f_{\y,k}$ are the normalized fluxes in bin $k$ of

555: spectra $\x$ and $\y$, respectively.  We use these distances and a

556: chosen value of

557: $\epsilon$ to construct both the weights for the graph (see equation

558: \ref{eqn:diffw}) and the transition

559: matrix $\P$ (see equation \ref{eqn:diffp}), from which eigenmodes are

560: generated.  Below we

561: discuss how we select the optimal value of $\epsilon$.

562: As stated in {\S}\ref{sect:diff}, the value of the parameter $t$

563: (see equation \ref{eqn:diffusion_map}) is unimportant in

564: the context of regression, as any change in $t$ would be met

565: with a corresponding

566: rescaling of the coefficients $\widehat \beta_j$ in the regression model,

567: such that predictions are unchanged.

568:

569: In Figure \ref{fig:zmaps} we plot the embedding of

570: the 2793 galaxies with CL $>$ 0.99

571: in the first three PC and diffusion map

572: coordinates (e.g., $\lambda_i^t\psi_i(\cdot)$ in equation \ref{eqn:diffusion_map}).

573: We observe that the structure of each of these reparameterizations of

574: the original data corresponds in a simple way to $\log_{10}(1+z_{\rm

575:   SDSS})$.  These embeddings are a useful way to visualize the data

576: and to qualitatively identify subgroups of data and peculiar data points.

577:

578: % fig:zmaps was here

579:

580: In the next stage of analysis we use the computed eigenfunctions to

581: predict $z$ for our sample of 3835 galaxies.

582: We regress $z_{\rm SDSS}$ upon the diffusion map (and PC) eigenmodes

583: (cf.~equation~\ref{eqn:orthoreg}, where $\widehat r$ represents

584: our redshift estimates), weighting each data point by the

585: inverse variance of its $z_{\rm SDSS}$, 1/$\sigma_{z_{\rm SDSS}}^2$,

586: to account for the uncertainties in $z_{\rm SDSS}$ measurements.

587: We repeat this step for a sequence of

588: $m$ (and $\epsilon$) values, determining the optimal values of each

589: by minimizing the prediction risk $R(\epsilon,m)$,

590: estimated via ten-fold cross-validation (see equation~\ref{eqn:MSE}

591: and subsequent discussion).  It is in this regression step that

592: we clearly observe the advantage of using diffusion maps over

593: principal components.  In Figure \ref{fig:zrisk} we show that

594: diffusion map achieves significantly lower

595: CV prediction risk for most choices of model size $m$ and

596: obtains a much lower minimum $\widehat{R}_{\rm CV}$, i.e.,

597: the optimal low-dimensional diffusion map

598: representation of our data captures the trend in $z$ better than the

599: PC representation.  Note that the trend in $\widehat{R}_{\rm CV}$ for both

600: PC and diffusion map basis functions is to decrease with increasing

601: model size for small models and to increase with increasing model size

602: for larger models.  This is the ``bias-variance tradeoff" that was

603: referred to in {\S}\ref{sect:risk}: as the size (complexity) of our model

604: increases, the bias of the model decreases while the variance of the

605: model increases.  Prediction risk is the sum of the squared bias and

606: variance of a model, explaining the behavior observed in Figure

607: \ref{fig:zrisk}: for small models, increasing model size leads to

608: decrease in bias that overwhelms

609: increase in variance while for large models, increase in model size

610: produces minimal decrease in bias and relatively large increase in variance.

611: %It is also a sparser representation, requiring

612: %less than half the number of eigenfunctions (42 vs.~93).

613: %Restating the previous two sentences,

614: %{\em the diffusion map approach yields better redshift

615: %predictions than PCA, with a model that is more parsimonious than the

616: %best-fitting PCA model}.

617:

618: In Table \ref{tab:zreg}, we show the parameters for the

619: optimal (minimal $\widehat{R}_{\rm CV}$) diffusion map and PC regression models.

620: Note that since our original data were in 3500

621: dimensions, our optimal diffusion map model achieves

622: a 96.4\% reduction in dimensionality.  If we were to choose

623: an arbitrary small model size as is often done in the literature, our

624: prediction risk estimates would be terrible.  For example, for model

625: sizes $m = 10$ and 20, the CV prediction risks for regression on PC

626: basis functions are 0.305 and 0.209, respectively (compared to optimal

627: value 0.193), while regression on

628: diffusion map basis functions yields $\widehat{R}_{\rm CV}$ of 0.295

629: and 0.191, respectively (compared to optimal value 0.134).  The choice of

630: $\epsilon$ in the diffusion map model also has a significant impact on

631: results.  For values of $\epsilon$ that are too small, CV risks are

632: extremely large because the data points are no longer connected in the

633: diffusion process and consequently large outliers occur in the

634: diffusion map parameterization.  Likewise, large values of $\epsilon$

635: yield large prediction risks due to the large weights given to

636: connections between dissimilar data points.

637:

638: In Figure \ref{fig:zreg} we plot predictions and prediction

639: intervals for all galaxies in

640: our sample using our optimal diffusion map model.

641: (See Appendix \ref{sect:predint} for a discussion of prediction

642: intervals.)

643: Most of our predictions are in close correspondence with the SDSS

644: estimates.  We observe positive correlation in the amount of disparity between

645: our redshift estimates and SDSS estimates versus 1-CL (Figure

646: \ref{fig:cl}) meaning that galaxies for which our estimates disagree

647: with SDSS estimates are more likely to be galaxies with low CL.

648:

649: There are 54 outliers at the $4\sigma$ level. Visual inspection of

650: their spectra indicates that 39 appear to fit the template assigned by

651: SDSS.  Of these, 27 are well-described by the LRG template.  In

652: Figure \ref{fig:flux} we show that most of the outliers that are

653: well-fit by their SDSS templates are faint objects.  A plausible

654: explanation for their classification as outliers is low S/N in their

655: measured spectra.  Faint galaxies with strong emission lines will

656: generally have accurate SDSS redshifts but can be outliers in

657: the diffusion map because noisy spectra induce higher Euclidean

658: distances.  In a future paper we will introduce a method to account

659: for errors in the original measured data

660: that corrects both for errors in Euclidean distance

661: computations and random errors in the diffusion map coordinates.

662:

663:

664: The 15 other outliers show interesting and/or anomalous features.

665: Four spectra appear to be LRG type galaxies with abnormal emission

666: and/or absorption features, of which at least two are likely

667: attributed to calibration errors (see Figure \ref{fig:outliers}a,b).

668: One spectrum is clearly a QSO (Figure \ref{fig:outliers}c), one shows

669: only sky subtraction residuals (Figure \ref{fig:outliers}d), and two others are

670: obvious mismatches to their SDSS

671: templates due to absorption lines whose depths do not match their

672: assigned template.  Four outliers have abnormal bumps (possible

673: continuum jumps due to instrumental artifacts, see Figure

674: \ref{fig:outliers}e,f) that appear like wide emission features.

675: One outlying galaxy has a spectrum that looks like

676: a late-type galaxy with no emission lines, meaning it is likely a

677: K+A post-starburst galaxy.  Another outlier has an anomalous emission

678: feature around 6000~\AA~ in rest frame (Figure \ref{fig:outliers}g).

679: This is a possible lens

680: galaxy, but was not selected by the Sloan Lens ACS Survey (SLACS;

681: \citeauthor{Bolton2006}) because

682: the feature in question

683: occurs in close proximity to strong sky lines at 8800~\AA~.  The final

684: outlier has a strong, wide emission feature in the

685: vicinity of H$\alpha$ but has no emission lines anywhere else in the

686: SDSS spectrum (Figure \ref{fig:outliers}h).

687: None of the outlying spectra show conclusive evidence of a wrong SDSS redshift

688: measurement (except for the afore-mentioned sky spectrum, which we

689: detect as a 30 $\sigma$ outlier).

690:

691:

692: %Manual inspection of these spectra show that

693: %(a) 2 are obviously misclassified QSOs;

694: %%two have been to QSO spectra by the SDSS routines and

695: %%were mislabeled as galaxy spectra.

696: %%{\bf (NOTE: 001 and 026)}

697: %(b) 15 of these outliers have strong emission lines

698: %({\bf conclusion?}); and (c)

699: %%{\bf (NOTE: 000,002,003,006,007,016,017,021,022,031,037,042,047,048,053)}.

700: %10 appear to have questionable

701: %$z_{\rm SDSS}$ values based on visual inspection.

702: %%{\bf (NOTE: 007 (0.857), 008 (0.741), 011 (0.897), 023 (0.998), 024 (0.831), 025 (0.478), 043 (0.962), 046 (0.111), 050 (0.997), 055 (0.502) ....CL is in parentheses; a few of these have anomalous features but still might have correct z...Peter, can you take a look at these?)}

703:

704: \subsection{Comparison With Other Methods}

705:

706: As discussed in {\S}1, many authors have applied PCA to galaxy spectra

707: in an attempt to reduce the dimensionality of the data space, but few attempt

708: to find simple relationships between the reduced data and the physical

709: parameters of interest; these exceptions include

710: \citeauthor{Li2005}, \citeauthor{Zhang2006}, and

711: \citeauthor{ReFiorentin2007}

712: In all three cases, the authors use

713: PCA to estimate stellar and/or galactic parameters that are traditionally

714: estimated by laboriously measuring equivalent widths and fluxes

715: of individual lines, just as we have used diffusion map eigenfunctions

716: to estimate redshift, a physical parameter usually estimated through

717: computationally intensive cross-correlation methods.

718: We stress three advantages of our approach over those employed by the

719: above authors:

720: 1) We achieve much lower prediction

721: error using diffusion map coordinates as compared to PCA,

722: 2) we have an objective way of selecting the parameters of

723:  the model, and 3) we use a theoretically well-motivated regression

724:  model which takes statistical variations of the data into account and

725:  which unifies the data parameterization and regression algorithms.

726:

727: The aim of \citeauthor{Li2005}~is to estimate, e.g., the velocity

728: dispersion and reddening of a set of approximately 1500 galaxies

729: observed by SDSS.

730: They use PCA in two successive applications.

731: They first apply PCA

732: to the STELIB library to reduce 204 stellar spectra to 24 stellar eigenspectra.

733: These in turn are fit to SDSS DR1 spectra to create a library of 1016

734: galactic spectra, which are reduced to nine galactic eigenspectra.

735: The authors then regress observed equivalent widths (EW) and fluxes of

736: H$\alpha$ upon these nine eigenspectra.

737: They determine the number of eigenspectra to retain

738: by estimating noise variance in the stellar case

739: and by using the $F$ test to compute the significance of each additional

740: eigenspectrum in spectral reconstruction in the galactic case.  The latter

741: criterion however is not well-suited to the task of parameter

742: estimation because

743: the appropriate number of components in the regression model depends

744: on the complexity of the dependence of those parameters as a function

745: of the basis elements, not on the complexity of the original spectra.

746: For example, the dependence of the EW of H$\alpha$ on the PC basis

747: functions may be a simple, smooth function while the flux dependence

748: may be complex, bumpy relationship.  In this case, the optimal

749: regression model to predict EW would require fewer basis functions

750: than the optimal model for H$\alpha$ flux prediction.  Minimizing CV

751: risk would lead us to choose the correct number of basis functions for

752: each task, while the method of Li et al. would force us to use the same

753: (inappropriate) size for each model.

754:

755: \citeauthor{Zhang2006} attempt to predict stellar parameters by

756: regressing on PC coefficients using a kernel regression model with a

757:  variable window width. In their paper, they do not specify how to

758:  select the window

759: width (they introduce an arbitrary parameter $\lambda$) or how to

760: choose the correct number of PC basis functions (they use 3).

761: Their choice of a small

762: model size is likely due to the computational and statistical

763: difficulties that characterize kernel regression in high dimensions

764: \citep{Wasserman2006}.

765:

766: \citeauthor{ReFiorentin2007}~attempt to estimate

767: stellar atmospheric parameters (effective temperature, surface gravity,

768: and metallicity) from SDSS/SEGUE spectra.

769: They use PCA for dimension reduction, but set $m$ to an

770: arbitrary value (e.g., 50).

771: They then use an iterative, non-linear regression model (utilizing the

772: hyperbolic tangent function; see \citealt{Bailer-Jones2000}),

773: with an error function based on the residual sum-of-squares plus

774: a regularization term (see their equation 2). Again, the

775: choice of the regularization parameter is not justified.

776: %This methodology is similar to that used in the

777: %neural network community ({\bf ANN: CONFIRM THIS}).

778: We find that when applied to the same data

779: set of galaxy spectra, their model does not achieve lower CV risk than

780: our model for different choices of regularization parameter and model size.

781:

782: \section{Summary}

783:

784: \label{sect:summary}

785:

786: The purpose of this paper is two-fold.

787: First, we introduce the diffusion map method for data parametrization

788: and dimensionality reduction. We show

789: that for the types of high-dimensional and complex data sets

790: often analyzed in the astronomy, diffusion map can yield

791: far superior results than commonly-used methods such as PCA.  Moreover,

792: the simple, intuitive formulation of diffusion map as a method that

793: preserves the local interactions of a high-dimensional data set makes the

794: technique easily accessible to scientists that are not well-versed in

795: statistics or machine learning.

796:

797: Second, we present a fast and powerful eigenmode-based framework for

798: estimating physical parameters in databases of high-dimensional

799: astronomical data.  In most astrophysical applications, PCA is used as

800: a data-explorative tool for dimensionality reduction,

801: with no formal methods

802: and statistical criteria for regression, risk estimation and selection

803: of relevant eigenvectors. Here we propose a statistically rigorous,

804: unified framework for

805: regression and data parameterization.  Our proposed regression model

806: combines basis functions in a simple and statistically-motivated

807: manner while our clear objective of risk minimization drives the

808: estimation of the model parameters.  Again, the simplicity of the

809: proposed method will make it appealing to the non-specialist.

810:

811:  We apply the proposed methodology to predict redshift for a sample of

812:  SDSS galaxy spectra, comparing the use of the proposed regression

813: model with PCA basis functions versus diffusion map basis functions.

814: We find that the prediction error for the diffusion-map-based approach

815: is markedly smaller than that of a

816: similar framework based on PCA. Our techniques are also more robust

817: than commonly used template matching

818: methods because they consider the structure of the entire

819: high-dimensional data set when reparametrizing the data.

820: Statistical inferences are based on this learned structure,

821: instead of considering each data point separately in an object-by-object

822: matching algorithm as is currently used by SDSS and commonly employed

823: throughout the astronomy literature.

824: Work in progress extends our approach to

825: photometric redshift estimation and to the estimation of the

826: intrinsic parameters (e.g., mean metallicities and ages) of galaxies.

827:

828: \begin{acknowledgments}

829: The authors would like to thank Jeff Newman for helpful conversations.

830: This work was supported by NSF grant \#0707059 and ONR grant N00014-08-1-0673.

831: \end{acknowledgments}

832:

833: \appendix

834:

835: \section{Principal Components Analysis}

836: \label{sect:pca}

837:

838: We first center our data (the normalized spectra with $p$ wavelength bins) so that $\frac{1}{n} \sum_{i=1}^{n} {\bf x}_i = 0$. The centered observations ${\bf x}_1, {\bf x}_2, \ldots {\bf x}_n \in \mathbb{R}^p$ are then stacked into the rows of an $n \times p$ matrix ${\bf X}$. Note that the sample covariance matrix of $\bf x$ is given by the $p \times p$ matrix ${\bf S}= \frac{1}{n}{\bf X}^T{\bf X}$. In Principal Component Analysis (PCA), one computes the eigenvectors of the covariance matrix that correspond to the $m < p$ largest eigenvalues; denote these vectors by ${\bf v}_1, \ldots, {\bf v}_m \in \mathbb{R}^p$. In a PC map, the projections of the data onto these vectors are then used as new coordinates; i.e. the PC embedding of data point ${\bf x}_i$ is given by the map

839: $$ {\bf x}_i \mapsto \Psi_{\rm PCA}({\bf x}_i)=({\bf x}_i \cdot {\bf v}_1, \ldots, {\bf x}_i \cdot {\bf v}_m).$$

840: These projections are sometimes referred to as the principal components of ${\bf X}$.

841:

842: Algorithmically, the PC embedding is easy to compute using a singular value decomposition (SVD) of ${\bf X}$:

843: $$ {\bf X=U D V}^T. $$

844: Here ${\bf U}$ is an $n \times p$ orthogonal matrix,  ${\bf V}$ is a $p \times p$ orthogonal matrix (where the columns are eigenvectors ${\bf v}_1, \ldots, {\bf v}_p$ of ${\bf S}$), and ${\bf D}$ is a $p \times p$ diagonal matrix with diagonal elements $d_1 \geq d_2 \ldots \geq d_p \geq 0$ known as the singular values of ${\bf X}$. Since ${\bf XV}={\bf UD}$, the PC embedding of the $i$:th data point in $m$ dimensions is given by the first $m$ elements of the $i$:th row of ${\bf UD}$.

845:

846: \section{Prediction Intervals for Spectroscopic Redshift Estimates}

847:

848: \label{sect:predint}

849:

850: In any one fold of a ten-fold regression analysis, we fit to 90\% of the data,

851: generating predictions and prediction intervals

852: for the 10\% of the data withheld from the analysis.  A prediction interval

853: is {\it not} a confidence interval; the former

854: denotes a plausible range of values for a single observation, whereas the

855: latter denotes a plausible range of values for a parameter of the

856: probability distribution function from which that single observation is

857: sampled (e.g., the mean).

858:

859: Let $\bf X$ and $\bf \tilde X$ represent the matrices of independent variables

860: included in, and withheld from, regression analysis, respectively.  For

861: instance,

862: \begin{eqnarray}

863: {\bf \tilde X}~=~

864: \left(

865: \begin{array}{cccc}

866: \psi_1(x_1) & \cdots & \cdots & \psi_m(x_1) \\

867: \vdots      & \vdots & \vdots & \vdots \\

868: \psi_1(x_n) & \cdots & \cdots & \psi_m(x_n)

869: \end{array}

870: \right) \,, \nonumber

871: \end{eqnarray}

872: where $n$ is the number of withheld data and $m$ the number of

873: assumed basis functions.  (Here, we leave out factors of

874: $\lambda_j^t$, which are subsumed into the estimated

875: regression coefficients ${\widehat \beta}_j$.)  The vector of

876: redshift predictions for the withheld data is thus

877: \begin{eqnarray}

878: {\widehat z}~=~{\bf \tilde X} {\widehat \beta} \,, \nonumber

879: \end{eqnarray}

880: where $\widehat \beta$ is estimated from ${\bf X}$

881: while the vector of half-prediction intervals is given by

882: \begin{eqnarray}

883: t_{\alpha/2,N-n-2} \widehat{\sigma} \sqrt{ {\bf \tilde X} \left( {\bf X}^T {\bf X} \right)^{-1} {\bf \tilde X}^T + 1 + \frac{1}{N-n} } \,,

884: \label{eqn:predint}

885: \end{eqnarray}

886: where $\widehat{\sigma}$ is the estimated standard deviation of the

887: random noise $\epsilon$ in the relationship $Y = r({\bf X}) + \epsilon$,

888: estimated from the residuals of the regression of $Y$ upon ${\bf X}$,

889: $t_{\alpha/2,N-n-2}$ is the critical t-value for a two-sided

890: 100(1-$\alpha$)\% prediction interval,

891: and $N$ is the total number of data points.  Equation (\ref{eqn:predint}) is

892: a multi-dimensional generalization of, e.g., equation (2.26) of

893: \citet{Weisberg2005}, taking into account that the mean of $\psi({\bf x})$ is

894: zero.

895:

896: \clearpage

897:

898: \begin{thebibliography}{}

899: \bibitem[Adelman-McCarthy et al.(2008)]{Adelman2008} Adelman-McCarthy, J.~K., et al.~2008, \apjs, 175, 297

900: \bibitem[Bailer-Jones(2000)]{Bailer-Jones2000} Bailer-Jones, C.~A.~L.~2000, \aa, 357, 197

901: \bibitem[Bellman(1961)]{Bellman:61} Bellman, R.~E.~1961, Adaptive Control Processes (Princeton Univ. Press)

902: \bibitem[Boroson \& Green(1992)]{BorosonGreen1992} Boroson, T.~A., \& Green, R.~F.~1992, \apjs, 80, 109

903: \bibitem[Bolton et al.(2006)]{Bolton2006} Bolton, A.~S., et al.~2006, \apj, 638, 703

904: \bibitem[Coifman \& Lafon(2006)]{Coifman:Lafon:06} Coifman, R.~R., \& Lafon, S.~2006, Appl. Comput. Harmon. Anal., 21, 5

905: \bibitem[Connolly et al.(1995)]{Connolly1995} Connolly, A.~J., Szalay, A.~S., Bershady, M.~A., Kinney, A.~L., \& Calzetti, D.~1995, \aj, 110, 1071

906: \bibitem[Folkes et al.(1999)]{Folkes1999} Folkes, S., et al.~1999, \mnras, 308, 459

907:  \bibitem[Kemeny \& Snell(1983)]{KemenySnell1983} Kemeny, J. G., \& Snell, J. L.~1983, Finite Markov Chains (Springer).

908: \bibitem[Lafon \& Lee(2006)]{LafonLee2006} Lafon, S., \& Lee, A.~2006, IEEE Trans. Pattern Anal. and Mach. Intel., 28, 1393

909: \bibitem[Li et al.(2005)]{Li2005} Li, C., Wang, T.-G., Zhou, H.-Y., Dong, X.-B., \& Cheng, F.-Z.~2005, \aj, 129, 669

910: \bibitem[Madgwick et al.(2003)]{Madgwick2003} Madgwick, D.~S., et al.~2003, \apj, 599, 997

911: \bibitem[Re Fiorentin et al.(2007)]{ReFiorentin2007} Re Fiorentin, P., et al.~2007, \aap, 467, 1373

912: \bibitem[Rogers et al.(2007)]{Rogers2007} Rogers, B., Ferreras, I., Lahav, O., Bernardi, M., Kaviraj, S., \& Yi, S.~K.~2007, \mnras, 382, 750

913: \bibitem[Ronen, Arag\'on-Salamanca, \& Lahav(1999)]{Ronen1999} Ronen, S., Arag\'on-Salamanca, A., \& Lahav, O.~1999, \mnras, 303, 284

914: \bibitem[Vanden Berk et al.(2006)]{VDB2006} Vanden Berk, D.~E., et al.~2006, \aj, 131, 84

915: \bibitem[Wasserman(2006)]{Wasserman2006} Wasserman, L.~W.~2006, All of Nonparametric Statistics (New York:Springer)

916: \bibitem[Weisberg(2005)]{Weisberg2005} Weisberg, S.~2005, Applied Linear Regression (Hoboken:Wiley)

917: \bibitem[Yip et al.(2004a)]{Yip2004a} Yip, C.~W., et al.~2004, \aj, 128, 585

918: \bibitem[Yip et al.(2004b)]{Yip2004b} Yip, C.~W., et al.~2004, \aj, 128, 2603

919: \bibitem[Zhang et al.(2006)]{Zhang2006} Zhang, J., Wu, F., Luo, A., \& Zhao, Y.~2006, ChJAA, 30, 176

920: \end{thebibliography}

921:

922:

923: % The figures

924:

925: \begin{figure}

926: %\epsfig{figure=Fig3a.eps,height=2.3in}

927: \epsscale{0.7}

928: \plotone{f1a.eps}

929: \vspace{0.7in}

930: \epsscale{0.9}

931: \plottwo{f1b.eps}{f1c.eps}

932: \caption{An example of a one-dimensional manifold (dashed line) with Gaussian noise embedded in

933: two or higher dimensions.  The path (solid line) from $\x$ to $\y$ reflects the natural geometry of

934: the data set which is captured by the

935: diffusion distance between $\x$ and $\y$.

936: The plot on the lower left shows that the first diffusion map coordinate is a monotonically increasing

937: function of the

938: arc length of the spiral; this is not the case in the

939: lower right plot, which shows the same relationship for the first PC coordinate.}

940: \label{fig:spiral}

941: \end{figure}

942:

943: \clearpage

944: \begin{figure}

945: \epsscale{0.75}

946: \plotone{f2.eps}

947: \caption{Distributions of SDSS redshift estimates in our

948: high-CL (top) and low-CL (bottom) samples.  We train our regression

949: model using the 2793 high-CL galaxies only, then apply those

950: predictions to the 1042 low-CL galaxies.}

951: \label{fig:zdesign}

952: \end{figure}

953:

954: \clearpage

955: \begin{figure}

956: %$\begin{array}{c}

957: %\epsfig{figure=zest_pcmap_ccode.ps,height=2.25in} \\

958: %\epsfig{figure=zest_dmap_ccode.ps,height=2.25in} \\

959: %\end{array}$\\

960: \epsscale{1}

961: \plottwo{f3a.eps}{f3b.eps}

962: \caption{Embedding of our sample of 2793 SDSS galaxy spectra with

963:   SDSS $z$ CL $> 0.99$ with

964: the first 3 PC and the first 3 diffusion map coordinates, respectively.

965: The color codes for $\log_{10}(1+z_{\rm SDSS})$ values.  Both

966: maps show a clear correspondence with redshift.}

967: \label{fig:zmaps}

968: \end{figure}

969:

970: %\clearpage

971: %\begin{figure}

972: %%\epsfig{figure=outlier.eps,height=2.6in}

973: %\epsscale{0.75}

974: %\plotone{f3.eps}

975: %\caption{SDSS galaxy spectrum (with {\tt OBJID}) identified as an outlier

976: %($>$ 4$\sigma$) by the

977: %diffusion map-based regression, overlaid with SDSS template 29, which

978: %provided the highest CL $z_{\rm SDSS}$ estimate in template cross-correlation.

979: %The spectrum exhibits two anomalous features: a sharp, unexplained

980: %rise at low wavelengths and a broad emission feature at $\approx$ 4100 \AA.}

981: %\label{fig:out}

982: %\end{figure}

983:

984: \clearpage

985: \begin{figure}

986: %\epsfig{figure=zpred_risk.eps,height=2.3in} \\

987: \epsscale{0.75}

988: \plotone{f4.eps}

989: \caption{Risk estimates ($\widehat{R}_{CV}$) for regression of $z$ on diffusion

990:   map coordinates and PCs. Diffusion map attains a lower

991:   risk for almost every number of coordinates in the regression. It also

992:   achieves a lower minimum risk as indicated by Table~\ref{tab:zreg}.

993: Risk estimates are based on 50 repetitions of 10-fold CV.  Thick lines

994: represent mean risk at that model size and thin dotted lines are +/- 1

995: standard deviation bands.}

996: \label{fig:zrisk}

997: \end{figure}

998:

999: \clearpage

1000: \begin{figure}

1001: %\epsfig{figure=zpredictions.eps,height=4.6in}

1002: \epsscale{0.6}

1003: \plotone{f5.eps}

1004: \caption{

1005:   Redshift predictions using diffusion map coordinates for galaxies

1006:   with SDSS  CL $\le$ 0.99 (top)

1007:   and CL $>$ 0.99 (bottom), each plotted against $z_{\rm SDSS}$.

1008:   Error bars

1009:   represent 95\% prediction intervals.  Note that  CL $\le$ 0.99

1010:   redshift predictions are based on the model trained on CL $>$ 0.99

1011:   galaxies while CL $>$ 0.99 predictions are from 10-fold CV on CL

1012:   $>$ 0.99 galaxies.  For most galaxies, our

1013:   predictions are in close correspondence with SDSS estimates.}

1014: \label{fig:zreg}

1015: \end{figure}

1016:

1017: \clearpage

1018: \begin{figure}

1019: \epsscale{0.6}

1020: \plotone{f6.eps}

1021: \caption{Discrepancy between our predicted redshift values and $z_{\rm

1022:     SDSS}$ estimates versus log(1-CL).  There is a

1023:   correlation of 0.392 between the amount of discrepancy and 1-CL, meaning

1024:   that galaxies for which there are large differences between the two

1025:   redshift estimates tend to be objects whose SDSS redshift

1026:   confidences are low.  Horizontal lines denote 1, 3, and 5 $\sigma$

1027:   disparities.  Small random perturbations have been added to duplicate

1028:   log(1-CL) values to visualize galaxies with the same CL.  Galaxies with a

1029:   CL of 1.00 are assigned mean log(1-CL) of -4.

1030: }

1031: \label{fig:cl}

1032: \end{figure}

1033:

1034: \clearpage

1035: \begin{figure}

1036: \epsscale{0.6}

1037: \plotone{f7.eps}

1038: \caption{Discrepancy between our predicted redshift values and $z_{\rm

1039:     SDSS}$ versus log(flux) of the original spectra. There is a

1040:   correlation of -0.327 between the amount of discrepancy and galaxy

1041:   brightness. Galaxies can be detected as outliers even

1042:     if they match well to their SDSS template (in color).  Low S/N

1043:     can cause normal galaxies with correct SDSS redshifts to be labeled

1044:     as outliers.  We also detect several

1045:     physically interesting objects as outliers (see Figure \ref{fig:outliers}).

1046: }

1047: \label{fig:flux}

1048: \end{figure}

1049:

1050: \clearpage

1051: \begin{figure}

1052: \epsscale{1}

1053: \plotone{f8.eps}

1054: \caption{Eight selected outliers with anomalous features.  Each

1055:   spectrum (solid blue) is plotted along with its SDSS template match

1056:   (dashed red).  Spectra are scaled to have the same sum of squared

1057:   (smoothed) fluxes over the same range of wavelengths.  For a

1058:   thorough discussion of

1059:   these outliers see {\S}\ref{sect:anal}}

1060: \label{fig:outliers}

1061: \end{figure}

1062:

1063: \clearpage

1064:

1065: \input{tab1}

1066:

1067: \end{document}

1068:

1069: