1: %\documentclass{emulateapj} % 2-column preprint
2: \documentclass[12pt,preprint]{aastex}
3:
4: \usepackage{natbib}
5:
6:
7: \newcommand{\R}{\mathbb{R}}
8: \newcommand{\bP}{\mathcal{P}}
9: \newcommand{\bQ}{\mathcal{Q}}
10: \newcommand{\bK}{\mathcal{K}}
11: \newcommand{\bS}{\mathcal{S}}
12: \newcommand{\bD}{\mathcal{D}}
13: \newcommand{\bA}{\mathcal{A}}
14: \newcommand{\ith}{$^\mathrm{th}$\ }
15: \newcommand{\ird}{$^\mathrm{rd}$\ }
16: \newcommand{\ind}{$^\mathrm{nd}$\ }
17: \newcommand{\lmin}{{L_\mathrm{min}}}
18: \newcommand{\lmax}{{L_\mathrm{max}}}
19: \newcommand{\E}{\mathsf{E}}
20: \newcommand{\Cov}{\mathsf{Cov}}
21: \newcommand{\Var}{\mathsf{Var}}
22: \newcommand{\hmu}{\hat \mu}
23: \newcommand{\ba}{\begin{eqnarray*}}
24: \newcommand{\ea}{\end{eqnarray*}}
25: \newcommand{\mpc}{\frac{\mathrm{km/s}}{\mathrm{Mpc}}}
26:
27: \slugcomment{Submitted to ApJ, 11/09/06}
28: \shorttitle{Mapping the Cosmoligical Confidence Ball Surface}
29: \shortauthors{Bryan et al.}
30:
31: \citestyle{aa}
32:
33: \begin{document}
34: \title{Mapping the Cosmological Confidence Ball Surface}
35: \author{Brent Bryan and Jeff Schneider}
36: \affil{Department of Machine Learning, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213}
37: \email{\{bryanba, schneide\}@cs.cmu.edu}
38:
39: \author{Christopher J. Miller}
40: \affil{Cerro Tololo Interamerican Observatory, Casilla 603, La Serena, Chile}
41: \email{cmiller@noao.edu}
42:
43: \author{Robert C. Nichol}
44: \affil{Institute of Cosmology and Gravitation, University of Portsmouth, Portsmouth, PO1 2EG, UK}
45: \email{bob.nichol@port.ac.uk}
46:
47: \and
48: \author{Christopher Genovese and Larry Wasserman}
49: \affil{Department of Statistics, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213}
50: \email{\{genovese, larry\}@stat.cmu.edu}
51:
52: \begin{abstract}
53:
54: We present a new technique to compute simultaneously valid confidence
55: intervals for a set of model parameters. We apply our method to the
56: Wilkinson Microwave Anisotropy Probe's (WMAP) Cosmic Microwave
57: Background (CMB) data, exploring a seven dimensional space
58: ($\tau, \Omega_\mathrm{DE}, \Omega_\mathrm{M}, \omega_{\mathrm{DM}},
59: \omega_{\mathrm{B}}, f_\nu, n_s$).
60: We find two distinct regions-of-interest: the standard
61: Concordance Model, and a region with large values of $\omega_\mathrm{DM}$,
62: $\omega_\mathrm{B}$ and $H_0$. This second peak in
63: parameter space can be rejected by applying a constraint (or a prior)
64: on the allowable values of the Hubble constant. Our new technique uses
65: a non-parametric fit to the data, along with a frequentist approach and a
66: smart search algorithm to map out a statistical confidence
67: surface. The result is a confidence ``ball'': a set of parameter values that
68: contains the true value with probability at least $1-\alpha$.
69: Our algorithm
70: performs a role similar to the often used Markov Chain Monte Carlo (MCMC),
71: which samples from the posterior probability function in order to provide
72: Bayesian credible intervals on the parameters. While the MCMC approach
73: samples densely around a peak in the posterior, our new technique allows
74: cosmologists to perform efficient analyses around any regions of interest:
75: e.g., the peak itself, or, possibly more importantly, the $1-\alpha$ confidence surface.
76:
77:
78: % We present a new technique to compute simultaneously valid confidence
79: % intervals for a set of model parameters, given a data set and a
80: % parametrized model of the data. This technique utilizes a
81: % non-parametric fit to the data, along with a frequentist approach and
82: % a smart search technique to compute joint confidence intervals of
83: % the parameters. The result is a $1 - \alpha$ confidence ball, which
84: % contains the true values of the unknown parameters with probability $1
85: % - \alpha$. In this paper we apply this method to the Wilkinson
86: % Microwave Anisotropy Probe (WMAP) Cosmic Microwave Background (CMB)
87: % data, exploring a seven dimensional space (optical depth, dark energy mass
88: % fraction, total mass fraction, dark matter density, baryon density,
89: % neutrino fraction, and scalar spectral index).
90: % Our technique performs a role similar to the
91: % often used Monte Carlo Markov Chains (MCMC), which maps out the
92: % posterior probability function. However, the significant difference
93: % between these two techniques is the use of Bayesian (used in MCMC)
94: % versus frequentist approaches, and the resulting implications that
95: % these approaches have on statistical inference.
96: % Using a frequentist approach, we are able to avoid the assumptions
97: % of which functions are to be fit, and on which ranges.
98: % Additionally, the inference is independent of the samples drawn, and
99: % therefore less susceptible to under sampling.
100: % We note that MCMC is not designed to be a search algorithm, and propose
101: % a new search algorithm to guide the evaluation of parameter
102: % settings, which is much more efficient. We present 2D
103: % projections through the $1\sigma$ and $2\sigma$ confidence balls, and
104: % compare the results with those obtained via other methods.
105: \end{abstract}
106:
107: \keywords{cosmology: cosmic microwave background --- cosmology:
108: cosmological parameters --- methods: statistical}
109:
110: \section{Introduction} \label{sec:introduction}
111: The Cosmic Microwave Background (CMB) angular temperature power spectrum
112: is the most widely utilized data set for constraining the cosmological
113: parameters \citep{tegmark2001, christensen2001, verde2003, spergel2003, tegmark2004}.
114: This power spectrum, which
115: statistically measures the distribution of temperature fluctuations as
116: a function of scale, is comprised of at least two peaks thought to
117: have been formed by sound wave modes inherent in the primordial gas during
118: recombination. The locations, heights, and height-ratios
119: of the peaks and valleys in the power spectrum can provide direct
120: information about fundamental parameters of the universe, such as the space-time
121: geometry, the fraction of energy density contained in the baryonic
122: matter, and the cosmological constant \citep{miller2001}. However, it
123: is more common
124: for cosmologists to compare the observed CMB power spectrum to a suite
125: of cosmological models (e.g. CMBFast \citep{seljak1996} and CAMB
126: \citep{lewis2000}). These models require as input some minimal number
127: of cosmological parameters, $d$, --- typically $d=6$ or $d=7$.
128:
129: Most CMB power spectrum parameter estimations to date have been done via
130: Bayesian techniques (e.g., \cite{knox2001, gupta2002, spergel2003, jimenez2004, dunkley2005}).
131: For these techniques, the $d$-dimensional
132: likelihood function is parametrically estimated and prior probabilities are
133: assumed for each parameter. Then, a posterior probability distribution
134: can be computed, and credible intervals can be found. However, unless
135: the form of the prior is conjugate on the likelihood (which is atypical), computing the
136: posterior involves estimating an integral over the entire space
137: spanned by the prior. There are two basic approaches to solving this
138: problem in the literature. \cite{tegmark2001} approximates
139: this integral explicitly, using an adaptive grid, where grid
140: cells are more densely located in areas presumed to be important.
141: Secondly, and more popularly, many authors have used Markov Chain Monte Carlo
142: (MCMC) (e.g. \cite{gupta2002, lewis2002, jimenez2004, sandvik2004,
143: dunkley2005,chu2005, hajain2006}), which tend to be much more efficient than grid
144: based techniques, but are notoriously difficult to tune and test for
145: convergence \citep{olivestatistics}.
146:
147: While Bayesian techniques are used in the majority of work on CMB
148: parameter estimation, there have also been undertakings to estimate
149: cosmological parameters using frequentist techniques, such as $\chi^2$ tests
150: \citep{gorski1993, white1995, padmanabhan2001,
151: griffiths2001, abroe2002} and Bayes risk analyses \citep{schafer2003}.
152: We present a novel frequentist method based upon
153: a non-parametric fit to the data to
154: estimate the smooth underlying power spectrum, as well as an
155: error ``ellipse'' following the technique used in \cite{miller2001}
156: and \cite{genovese2004}.
157: This confidence ball has a radius which is a function of the
158: probability with which the true power spectrum is contained within the
159: ball and the observed error estimates. The ball radius is independent
160: of both the models to be fit, as well as the parameter ranges to be queried. Thus, we
161: can take a vector of parameters, run it through our favorite CMB power
162: spectrum generating model, and determine whether or not the model (and
163: hence the parameter vector) lies within our confidence ball, without
164: fixing \textit{a priori} the model to be used, or the parameter ranges to
165: be searched. We are interested in finding the set of parameter
166: vectors which lie within the $1-\alpha$ confidence ball, for some
167: confidence level (or probability of being incorrect), $\alpha$.
168:
169: This is a statistically different style of ``confidence'' than the
170: credible intervals or the ``degree of belief'' one obtains using
171: Bayesian techniques. In particular, the Bayesian method answers the
172: question ``assuming a given model and prior distribution over the
173: parameter space, what is the smallest range of a particular parameter from which I believe
174: the next sample will be drawn with probability $1-\alpha$?''
175: In contrast, the frequentist approach constructs a procedure for
176: deriving confidence intervals that when applied to a series of
177: data sets, traps the true parameters for at least $100(1-\alpha)\%$
178: of the data sets. For parametric models with large sample
179: sizes, Bayesian and frequentist approaches are known to result in
180: similar inferences. However, for high dimensional and
181: non-parametric problems --- such as estimating cosmological parameters
182: from the CMB power spectrum --- Bayesian methods may not yield accurate
183: inferences \citep{olivestatistics}. In such cases, the Bayesian 95\%
184: credible interval may not contain the true value 95\% of the time in
185: a frequency sense.
186:
187: Additionally, mapping a region of high likelihood points in
188: parameter space is fundamentally a search problem. As MCMC methods are
189: designed to sample and/or integrate a distribution, they are not
190: necessarily good search algorithms in practice. In particular, a MCMC
191: method ``represents'' a high-likelihood region by heavily sampling
192: that region --- an expensive proposition when using CMBFast. In
193: contrast, a search algorithm that can directly observe the
194: (normalized) likelihood of a sample will have no reason to spend more
195: samples in the same location.
196: In addition to describing a frequentist approach to computing
197: confidence intervals for cosmological parameters,
198: another significant contribution of this paper is the
199: proposal of a new search algorithm for mapping confidence surfaces.
200:
201: In this work, we utilize the non-parametric basis
202: described by \cite{miller2001} and \cite{genovese2004} to constrain
203: the set of cosmological models which fit the WMAP observations.
204: At the same time, we must deal with the challenges posed in other
205: frameworks namely: robustness of the algorithm, efficiency,
206: and issues of convergence. A schematic outline of our technique is
207: shown in Figure \ref{fig:outline}.
208: In \S \ref{sec:methodology}, we
209: briefly describe the data and cosmological models used, as well as the
210: non-parametric technique (the bottom row of Figure \ref{fig:outline}). We
211: then focus on a new algorithm to map the derived confidence ball into
212: parameter space in \S \ref{sec:algorithm}, sketched out on the
213: top line of Figure \ref{fig:outline}. In \S
214: \ref{sec:results}, we present results of our
215: algorithm, and discuss challenges to accurately determine confidence
216: intervals using any statistical approach.
217: Finally, in \S \ref{sec:comparison}, we compare our
218: method with commonly used inference techniques, and discuss the
219: advantages of using the proposed approach.
220:
221:
222: \begin{figure*}
223: \begin{center}
224: %\includegraphics[scale=0.85]{f1.eps}
225: \plotone{f1.eps}
226: \end{center}
227: \caption{Schematic outline of our technique to constraint confidence intervals.}
228: \label{fig:outline}
229: \end{figure*}
230:
231: \section{Methodology} \label{sec:methodology}
232: \subsection{Data \& Models} \label{sec:datamodels}
233: We examine the CMB power-spectrum ($\hat C_{\ell}$) as
234: measured by the Wilkinson Microwave Anisotropy Probe's first-year
235: data release \citep{bennett2003, hinshaw2003,
236: verde2003}\footnote{Available at
237: \url{http://lambda.gsfc.nasa.gov}}, shown in Figure
238: \ref{fig:wmapdata1}. Our approach is similar to that of other authors
239: (e.g. \cite{tegmark1999, tegmark2001,
240: spergel2003}), who fit the observed CMB power spectrum to a suite of
241: cosmological models. These models, while sophisticated and detailed, have numerous free
242: parameters, some of which are difficult to ascertain (e.g. ionization
243: depth, contribution of gravity waves). However, there are many codes
244: available to compute CMB power spectrum, which trade off speed for
245: accuracy and robustness.
246:
247: Both CMBFast \citep{seljak1996} and the related CAMB \citep{lewis2000}
248: compute the CMB power spectrum by evolving the Boltzmann equation
249: using a line of sight integration technique. While an order of
250: magnitude faster than computing the full Boltzmann solution, this approach
251: is still rather slow.
252: One approach for reducing the computation time of CMBFast
253: is to split the Boltzmann computation into low and
254: high multipole moment portions, as the low and high multipoles are
255: mostly independent \citep{tegmark2001}. Using this method, ksplit,
256: \cite{tegmark2001} was able to reduce computation time by a factor of 10.
257: Additionally, several approximate programs have been
258: developed which are orders of magnitudes faster than CMBFast,
259: including DASh \citep{kaplinghat2002},
260: CMBWarp \citep{jimenez2004}, and Pico \citep{fendt2006}.
261: In general, these programs gain great speedups
262: by approximating the power spectrum with a regression function
263: fit to predetermined sample points generated from simulators such as CMBFast. As a
264: result, generating a hypothesis spectrum for a new set of parameters
265: is a simple function evaluation, foregoing the computation of the
266: Boltzmann equation entirely.
267:
268: While using any one of these approximate methods or ksplit may seem
269: appealing due to their computational efficiency, they do not have the
270: desired accuracy and robustness \citep{seljak2003}. These codes are
271: only approximations. While fairly accurate around the concordance
272: peak, their accuracy drops off drastically when computing models for
273: parameter vectors slightly removed from
274: the ``accepted'' cosmological models.
275: Additionally, these codes are prone to failures when presented with
276: parameter vectors that are not within a narrowly defined region around
277: the concordance model \citep{fendt2006}.
278: %For instance, Pico
279: %uses a set of precomputed points to generate its interpolation; these
280: %points were all picked to be near the accepted concordance peak.
281: According to the Pico website: ``Since Pico's purpose is to be part of
282: parameter estimation codes, we are mainly concerned with having the regression
283: coefficients defined around the region of parameter space allowed by
284: the data (mainly the WMAP3 data). Pico will not be able to compute
285: accurate spectra and likelihoods away from this region, but it will
286: warn you about this.'' Similarly, in many instances ksplit will hang
287: on parameter vectors that are a short distance from the concordance peak.
288: Since we are interested in finding the tightest possible confidence
289: intervals for all regions of parameter space that can possibly fit the data,
290: we do not want to be artificially restricted by our CMB simulator.
291: Thus, we choose to compute the model CMB power spectra
292: using CMBFast; while not the fastest code available CMBFast is
293: accurate and reliable.
294:
295: Next, multipole covariance is estimated by using the covariance derived for the
296: concordance model using code from \cite{verde2003}.
297: We find that the computed variances match well with
298: those found in the first-year data release, with only a slight
299: (roughly $1.15$) multiplicative offset. This constant factor offset
300: was hinted at by the sub unity slope of the quantile-quantile plot of
301: the variance weighted deviations between the data and the concordance
302: model prediction, using the variances given in the WMAP data.
303:
304:
305: \begin{table}[t]
306: \begin{center}
307: \begin{tabular}{c l r@{${\ }-{\ }$}l}
308: \hline
309: \textbf{Parameter} & \textbf{Description} &
310: \multicolumn{2}{c}{\textbf{Range}}\\
311: \hline
312: \hline
313: $\tau$ & optical depth & 0.0 & 1.2\\
314: $\Omega_\mathrm{DE}$ & dark energy mass fraction& 0.0 & 1.0\\
315: $\Omega_\mathrm{M}$ & total mass fraction & 0.1 & 1.0\\
316: $\omega_{\mathrm{DM}}$ & dark matter density& 0.01 & 1.2 \\
317: $\omega_{\mathrm{B}}$ & baryon density& 0.001 & 0.25\\
318: $f_\nu$ & neutrino fraction& 0.0 & 1.0\\
319: $n_s$ & spectral index& 0.5 & 1.7\\
320: \hline
321: \end{tabular}
322: \end{center}
323: \caption{Cosmological parameters and ranges searched.}
324: \label{paramtable}
325: \end{table}
326:
327: \cite{spergel2006} show that the WMAP third year data are well
328: described by a simple 6 parameter model: $\tau, H_0,\Omega_\mathrm{M},
329: \Omega_\mathrm{B}, \sigma_8, n_s$. In this paper, we use
330: effectively the same model space as the simplified model in
331: \cite{spergel2006}, except that we include the neutrino fraction and
332: exclude $\sigma_8$. We made this change as we are not utilizing
333: large-scale structure data, which is sensitive to $\sigma_8$. The
334: resulting parameter vector
335: $\mathbf{p} = (\tau, \Omega_\mathrm{DE}, \Omega_\mathrm{M},
336: \omega_{\mathrm{DM}}, \omega_{\mathrm{B}}, f_\nu, n_s)$ is similar to
337: the model space searched by \cite{tegmark2001}.
338: A description and considered range for each of these variables
339: is presented in Table \ref{paramtable}; the parameter ranges
340: considered here are slightly larger than those searched by \cite{tegmark2001}, due
341: to our interest in mapping an observed secondary peak in parameter space.
342: Note that $\Omega_\mathrm{k} = 1 - \Omega_\mathrm{M} - \Omega_\mathrm{DE}$.
343: Moreover, the Hubble constant, $H_0$, is not an independent parameter,
344: but given by
345: \[
346: \frac{H_0}{100} = h =\sqrt{\frac{\omega_\mathrm{DM}+\omega_\mathrm{B}}{\Omega_\mathrm{M}}}
347: = \sqrt{\frac{\omega_\mathrm{DM}+\omega_\mathrm{B}}{1-\Omega_\mathrm{k} - \Omega_\mathrm{DE}}}.
348: \]
349: We denote the space spanned by $\mathbf{p}$ as
350: $\mathcal{P}$. $\mathcal{P}$ is a seven dimensional hyper-rectangle
351: where the range of the $j^\mathrm{th}$ side corresponds to the range of the
352: $j^\mathrm{th}$ cosmological parameter of $\mathbf{p}$.
353:
354: \subsection{Nonparametric Analysis} \label{sec:nonparametric}
355: We now provide a brief sketch of nonparametric data
356: analysis, as it pertains to the CMB power spectrum. We follow the
357: derivations given in \cite{miller2001} and \cite{genovese2004}, and refer
358: interested readers to those works. Our technique is
359: designed to:
360: \begin{enumerate}
361: \item Compute a fit to the actual data which minimizes the sum of the
362: bias and the variance between the fit
363: and the data, taking into account the full covariance discussed in
364: \S \ref{sec:datamodels}. Errors are assumed to be Gaussian.
365: This fit is effectively a smoothed version of the data.
366:
367: \item Determine a confidence ellipse ball around the best fit for a
368: given test level, $\alpha$.
369:
370: \item Find all such vectors $s \in \mathcal{P}$ such that the
371: power spectrum output by CMBFast for $s$ results in a
372: model which is contained within the $1-\alpha$ confidence ball
373: found in step 2.
374: \end{enumerate}
375: We now detail items 1 and 2, leaving the discussion of item 3 to \S \ref{sec:mapping}.
376:
377: %%%%%%%%%%%%%%%%%%%%%%%
378: % Intentionally Blank
379: %
380: %
381: %
382:
383: \subsubsection{The Non-Parametric Fit} \label{sec:fit}
384: Let $\ell \in [L_\mathrm{min}, \dots, L_\mathrm{max}]$ denote a
385: generic index of the CMB temperature power spectrum multipole, and $n
386: = L_\mathrm{max} - L_\mathrm{min}+1$ be the total number of observed
387: multipoles. We take $Y_{\ell} = \hat{C}_{\ell}$ to be the observations
388: of the CMB where $x_{\ell} = (\ell-\lmin)/(\lmax-\lmin)$ and let
389: $f(x_{\ell})\equiv C_{\ell}$ denote the true power spectrum
390: at multipole index $\ell$.
391: We then solve the nonparametric regression problem:
392: \begin{equation}\label{eq:regress2}
393: Y_{\ell} = f(x_{\ell}) + \epsilon_{\ell}, \qquad \ell = L_\mathrm{min}, \ldots, L_\mathrm{max},
394: \end{equation}
395: where $\epsilon =
396: (\epsilon_{L_\mathrm{min}},\ldots,\epsilon_{L_\mathrm{max}})$ are
397: assumed Gaussian with known covariance matrix $\Sigma$ as described earlier.
398: Henceforth, we will use $i=\ell -\lmin+1$ as an index.
399: Nonparametric analysis is based on the notion of estimating a function
400: without forcing it to fit some finite-dimensional parameter form
401: (e.g. a Normal distribution), by smoothing the data in such a way to
402: balance the bias and variance. In this work, we use orthogonal series
403: regression to estimate $f$, expanding $f$ as a
404: cosine basis:
405: \[
406: f(x) = \sum\limits_{j=0}^\infty \mu_j \phi_j(x)
407: \]
408: where
409: %\begin{equation}\label{basis}
410: \[
411: \phi_j(x) =
412: \left\{
413: \begin{array}{l l}
414: 1 & \mathrm{for\ } j=0\\
415: \sqrt{2}\cos(\pi j x) & \mathrm{for\ } j = 1,2,3, \dots
416: \end{array}
417: \right.
418: %\end{equation}
419: \]
420: and the $\mu_j$'s are the coefficients for each basis component.
421: If $f$ is smooth, then $\mu_j$ will decay rapidly as $j$
422: increases. That is, if $f$ is smooth, then there are little or no
423: high frequency fluctuations in $f$ and hence $\mu_j \simeq 0$.
424: Thus,
425: $\sum_{j=n+1}^\infty \mu^2_j$ will be negligible, and we can approximate the
426: infinite sum as $f(x) \approx \sum_{j=0}^n \mu_j \phi_j(x)$. Let
427: \[
428: Z_j = \frac{1}{n} \sum_{i=1}^n Y_i \phi_j(X_i)
429: \]
430: for $j=0, 1,\dots n$. Then
431: $Z$ is approximately normal distributed with mean $\mu$ and covariance
432: $B/\sqrt{n} = U \Sigma U^T/ \sqrt{n}$, where $U$ is the cosine basis transformation
433: matrix.
434:
435: In order to obtain an even smoother
436: estimate of $f$, we damp out the higher frequencies using shrinkage
437: estimators. We let $\hat \mu_j = \lambda_j Z_j$ where $1 \ge
438: \lambda_0 \ge \lambda_1 \ge \cdots \ge \lambda_n \ge 0$ are shrinkage
439: coefficients. The estimate of $f$ is now
440: \[
441: \hat f(x) = \sum_{j=0}^n \hat \mu_j \phi_j(x) = \sum_{j=0}^n
442: \lambda_j Z_j \phi_j(x).
443: \]
444: Following \cite{genovese2004}, we use a special case of monotone
445: shrinkage in which
446: \[
447: \lambda_j = \left\{
448: \begin{array}{cc}
449: 1 & \mathrm{for\ } j\le J\\
450: 0 & \mathrm{for\ } j> J
451: \end{array}\right.
452: \]
453: for some integer $J \in [0,n]$. We will show how to find $J$ shortly.
454: Using the monotone shrinkage scheme described above, the estimate of $f$ becomes
455: \[
456: \hat f(x) = \sum_{j=0}^J Z_j \phi_j(x).
457: \]
458:
459: The squared error loss as a function of $\hat \lambda = (\hat \lambda_0,\hat
460: \lambda_1, \dots, \hat \lambda_n)$ is
461: \[
462: L_n(\hat \lambda) =
463: \int_0^1 \left(\frac{\hat f(x) -
464: f(x)}{\sigma(x)}\right)^2 \, dx\approx \sum_{j=1}^n
465: \left(\frac{\mu_j - \hat \mu_j}{\sigma_j}\right)^2,
466: \]
467: where $\sigma^2(x)$ is the variance of $f$, and $\sigma_j^2$ are the
468: observed variances of the power spectrum (the elements on the diagonal
469: of $\Sigma$). Meanwhile, the risk is given by
470: \[
471: R(\lambda) = \E \left[\int_0^1 \left(\frac{\hat f(x) -
472: f(x)}{\sigma(x)}\right)^2 \, dx \right] \approx
473: \frac{J}{n} + \sum_{j=J}^n \frac{\mu_j^2}{\sigma_j^2}
474: \]
475:
476: We choose $J$ to minimize the Stein's unbiased risk estimate
477: \begin{equation}\label{eqn:stein}
478: \hat R = Z^T \bar D W \bar D Z + \mathrm{trace}(DWDB) -
479: \mathrm{trace}(\bar D W \bar D B)
480: \end{equation}
481: where $D$ and $\bar D = 1 -D$ are diagonal matrices with 1's in the
482: first $J$ and last $n-J$ entries respectively, $B$ is the covariance
483: of $Z$, and $W_{jk} = \sum_{\ell} \Delta_{jk\ell}/\sigma_\ell$ and
484: \begin{eqnarray*}
485: \Delta_{jk\ell} &=& \int_0^1 \phi_j \phi_k \phi_\ell\\
486: &=& \left\{
487: \begin{array}{c c}
488: 1 & \mathrm{if\ \#}\{j,k,l = 0\} = 3\\
489: 0 & \mathrm{if\ \#}\{j,k,l = 0\} = 2\\
490: \delta_{jk}\delta_{0\ell} + \delta_{j\ell}\delta_{0k} +
491: \delta_{k\ell}\delta_{0j} & \mathrm{if\ \#}\{j,k,l = 0\} = 1\\
492: \frac{1}{\sqrt{2}}(\delta_{\ell, j+k} + \delta_{\ell,|j-k|}) & \mathrm{if\ \#}\{j,k,l = 0\} = 0
493: \end{array}
494: \right..
495: \end{eqnarray*}
496: \cite{beran1998} showed that $\hat R(\lambda)$ is asymptotically,
497: uniformly close to $R(\lambda)$ when using monotone shrinkage
498: coefficients and $\sigma(x)=1$. \cite{genovese2004} extended this
499: result to the heteroskedastic case used here.
500:
501: In Figure \ref{fig:wmapdata1}, we compare our non-parametric
502: fit to the WMAP data to a model-based fit from \cite{spergel2003}.
503: Points in the figure depict the first year WMAP data.
504: Error bars are omitted for clarity. The full estimated
505: covariance, $\Sigma$, is used in both the \cite{spergel2003} model fit
506: and the \cite{genovese2004} non-parametric fit.
507:
508: \begin{figure}
509: \begin{center}
510: \noindent
511: \plotone{f2.eps}
512: %\includegraphics[scale=1.0]{f2.eps}
513: \end{center}
514: \caption{Comparison of our nonparametric fit of the CMB power-spectrum
515: (solid) with \cite{spergel2003} parametric fit (dashed). First-year
516: WMAP data (dots) are shown without errors for clarity.}
517: \label{fig:wmapdata1}
518: \end{figure}
519:
520:
521: \subsubsection{The Confidence Ball} \label{sec:confball}
522: After we perform the non-parametric fit,
523: we need to quantify the uncertainty to make statistical inferences.
524: We use the Beran-D\"umbgen pivot method \citep{beran1998,beran2000} to
525: derive valid confidence intervals. This method relies
526: on the weak convergence of the ``pivot process'' --- $B_n(\hat \lambda) =
527: \sqrt{n} (L_n(\hat \lambda) - \hat R (\hat \lambda))$ --- to a Normal
528: $(0, \tau^2)$ distribution for some $\tau^2 >0$; a derivation of $\hat
529: \tau_n$ can be found in Appendix \ref{appendix}, taken from Appendix 3 of
530: \cite{genovese2004}. Using the convergence of the pivot process, we can
531: compute a confidence ellipse for the basis coefficients with a
532: ``radius'' given by:
533: \begin{equation} \label{conf0}
534: \mathcal{D}_n
535: = \left\{\mu : \sum_{i=1}^n \left(\frac{\hat{\mu}_i - \mu_i}{\sigma_i}\right)^2 \le
536: \frac{\hat\tau_n \, z_\alpha}{\sqrt{n}} + \hat{R}(\hat\lambda_n)\right\}
537: \end{equation}
538: where the best fit to the data is represented by
539: $\hat{\mu}_i$, the function being tested (whether it is within some
540: confidence ball) is $\mu_i$, and the level of the confidence ball is
541: determined by $z_\alpha$, the upper $\alpha$ quantile of a standard
542: Normal distribution.
543:
544: Therefore, using the central limit theorem, we have
545: \begin{equation} \label{conf}
546: \mathcal{B}_n = \left\{f(x) = \sum_{j=0}^n \mu_j \phi_j(x): \mu \in
547: \mathcal{D}_n \right\}
548: \end{equation}
549: is an asymptotic $1-\alpha$ confidence set for $f$.
550:
551: Thus, to determine if any
552: given vector $s \in \mathcal{P}$ is within our confidence ball, we
553: merely have to run our cosmological model to compute the resulting
554: power spectrum,
555: $\hat f(s)$, and check to see if $\hat f(s) \in \mathcal {B}_n$. This can
556: be easily done by using Equation \ref{conf0} to check whether the sum of
557: squares of $\hat \mu$ and $\mu$ are less than a constant given on the
558: right-hand side of Equation \ref{conf0}.
559: As shown in Figure \ref{fig:distance_alpha}, as the radius increases,
560: so does the size of the confidence set (and $\alpha$ decreases). Thus,
561: a 95\% (or $\alpha = 0.05$) confidence region has a larger ``radius''
562: than does a 67\% (or $\alpha = 0.33$) confidence region. Moreover, a
563: $1-\alpha$ confidence ball strictly contains
564: all confidence balls with smaller values of $1-\alpha$.
565:
566: Since the dimensionality of our space is large, it is difficult to
567: visualize the confidence region that surrounds the non-parametric fit.
568: However, we can show examples of functions which live inside (or outside)
569: our confidence region by calculating their distance from the
570: nonparametric fit to the data.
571: In Figure \ref{fig:wmapdata2}, we show a ``ribbon'' plot for
572: $\omega_\mathrm{B}$ around the concordance model. This figure is generated by
573: setting all of the cosmological parameters to their concordance
574: values and then slowly evolving $\omega_\mathrm{B}$ from $0.012250$ to
575: $0.036750$ to depict the range of temperature spectra allowed due to
576: uncertainty of $\omega_\mathrm{B}$. The
577: black curves are cosmological models which live within the
578: $95\%$ confidence ball, while gray curves are models that do not.
579: As can be seen in this figure, the shape of the confidence region is
580: not simply a band of constant width surrounding the best fit. It is, in fact,
581: a very complicated, possibly disconnected surface in our high-dimensional
582: parameter space. {\it It is this confidence surface that we wish to map in detail.}
583:
584: \begin{figure}
585: \begin{center}
586: %\includegraphics[scale=1.0]{f3.eps}
587: \plotone{f3.eps}
588: \end{center}
589: \caption{A ``ribbon'' plot depicting the effect of varying
590: $\omega_\mathrm{B}$ while all other parameters remain fixed (at
591: concordance values). Black lines indicate those models which are
592: contained within a 95\% confidence ball, while gray lies indicate
593: those models rejected by the hypothesis that the model and the
594: regressed fit are the same.}
595: \label{fig:wmapdata2}
596: \end{figure}
597:
598: \section{Mapping the Confidence Surfaces} \label{sec:mapping}
599: While theoretically Equation \ref{conf} exactly gives us the $1-\alpha$
600: confidence bound for any functional of the data, it is not trivial to
601: compute what these bounds are. While it is easy to use Equation
602: \ref{conf} to compute whether or not a given model is within the
603: confidence ball, the method outlined in \S \ref{sec:nonparametric}
604: does not provide a way to easily compute all those spectrum that lie
605: within that ball.
606:
607: Concretely, when we test if a CMB power
608: spectrum lies within the ball, we compare the given spectrum with
609: the non-parametric fit found above, by computing a variance weighted
610: sum of squares between the given spectrum and the regressed model. We
611: call this weighted sum of squares the test spectrum's ``distance''. If
612: we are given a model which results in a test spectrum whose distance
613: is greater than the radius of our confidence ball, then we can reject
614: the test spectrum (and its associated parameter vector) at the
615: $1-\alpha$ level.
616: If not, then our test does not have the power to distinguish between
617: the regressed model and our test model. Note that we are taking a
618: $\sim900$ element spectrum and compressing it to a scalar. Thus, there
619: are many models --- possibly representing vastly different spectra ---
620: that may result in exactly the same distance value. For the hypothesis
621: test that the fitted function and regressed models are derived form the same
622: distribution, we will draw the same conclusion for all models with the
623: same distance values. Either all models with a particular distance
624: score can be rejected or none can. For a
625: given confidence ball radius, we could compute (possibly with some discrete
626: approximation) all of the possible CMB power spectra that have
627: distances equal to the confidence radius. However, we are unaware of
628: an easy way to determine the cosmological parameters of a power
629: spectrum given only the power spectrum itself. That is, we do not have a
630: method to easily invert CMBFast.
631:
632: % For instance,
633: % \cite{tegmark2001} to sampled a roughly
634: % $10^{7}$ grid of points and use a linear approximator between them to
635: % determine confidence bounds on the individual cosmological parameters.
636: % This approach benefits from explictly searching the entire space, but
637: % suffers from the fact that it cannot give tight bounds in areas where
638: % the grid is course. Moreover, computation of a $10^7$ grid is
639: % expensive, leading to \cite{tegmark2001} to use some approximations in CMBFast
640: % (See \S \ref{FIXME}).
641: %
642: % Another approach is the use of Monte Carlo
643: % Markov Chains (MCMC) to compute the posterior distribution of the
644: % models given the data, under
645: % a given prior \cite{FIXME}. This is done by sampling the input space
646: % in roughly in
647: % proportion to the expected probability of each location. After enough
648: % sampling the posterior distribution will converge to the true
649: % distribution, and confidence bands can be found by integrating the
650: % posterior. This method benefits from its ease of implementation, as
651: % well as the fact that the entire posterior is obtained (not just the
652: % $1-\alpha$ confidence intervals. However, in practice, there is no way
653: % to show that a MCMC has converged truly converged to the true solution
654: % (not just some local optium), and integration of the final posterior
655: % can be tricky.
656:
657: Of course, one solution would be to grid the parameter space, and
658: run a model for each grid cell. We could then use these models to
659: approximate the mapping between parameter vectors and confidence
660: level using, for instance, a simple linear approximator.
661: As noted in \S \ref{sec:introduction}, such an approach
662: is far too slow, explaining why \cite{tegmark2001} use both
663: adaptive grids and a modified version of CMBFast.
664: Instead, we suggest an adaptive approach, which allows us to determine
665: confidence intervals of our cosmological parameters more quickly and
666: accurately.
667: In particular, we are able to quickly refine our approximating surface
668: in the areas of interest -- those near the confidence ball's radius --
669: while ignoring the uninteresting regions. This allows us to obtain
670: estimates of the $1-\alpha$ confidence intervals of our
671: cosmological parameters much more efficiently.
672: % We now show use active learning approaches to improve upon the
673: % previous sampling strategies and show how this will allow us to
674: % map the $1-\alpha$ joint confidence intervals of our cosmological
675: % parameter input space.
676:
677: \subsection{Modeling Known Experiments} \label{model}
678: The combination of CMBFast and the confidence ball method gives us a scoring
679: function $f:\mathcal{P}\to \R$, which takes an input vector of
680: parameters ($s \in \mathcal{P}$) and returns a distance value. This is
681: accomplished by plugging the cosmological parameter values of $s$ into
682: CMBFast to compute a model power spectrum, and then comparing this
683: model spectrum with our non-parametric fit to the observed power spectrum
684: using Equations \ref{conf0} and \ref{conf}.
685: Given a particular $1-\alpha$ confidence ball radius, $t$,
686: we want to find the set of points, $\bS$ ($\bS \subseteq \bP$), that have
687: distances to the regressed fit of the data less than or equal to the
688: confidence ball radius: $\{s \in \bS | s \in \bP, f(s) \le t\}$.
689: Since we can not easily invert $f$ --- that is to say CMBFast ---
690: we must deduce $\bS$ by carefully sampling the points in $\bP$.
691:
692: For CMBFast, the cost to compute $f(s)$ given $s$ can be significant:
693: computing power spectra away from the concordance model can take
694: 5 to 15 minutes.
695: Thus, care should be taken when choosing the
696: next experiment, as picking optimum points can reduce the run time of
697: the algorithm by orders of magnitude. Thus, it is preferable to
698: analyze current knowledge about the underlying function and select experiments
699: which quickly refine the estimate of the distance function around the
700: confidence ball radius. There are several methods one could use to
701: create a model of the data, notably some form of parametric
702: regression. However, we chose to approximate $f(s)$ using
703: Gaussian process regression, as other forms of regression may
704: smooth the data, ignoring subtle features of the function that may
705: become pronounced with more data. A Gaussian process is
706: a non-parametric form of regression. Predictions for
707: unobserved points are computed by using a weighted combination of the
708: function values for those points which have already been observed,
709: where a distance-based kernel function is used to determine the
710: relative weights. These distance-based kernels generally weight nearby points
711: significantly more than distance points.
712: Thus, assuming the underlying function is continuous,
713: Gaussian processes will perfectly describe the function given an
714: infinite set of unique data points.
715:
716:
717: In this work, we use ordinary kriging, a form of Gaussian processes that
718: assumes that the semi-variance, $\mathcal{K}(\cdot, \cdot)$, between
719: two points is a linear function of their distance \citep{cressie1991};
720: for any two points $s_i, s_j \in \bP$,
721: \[
722: \mathcal{K}(s_i, s_j) = \frac{k}{2} \E\left[ \Big(f(s_i) - f(s_j)\Big)^2\right]
723: \]
724: where $k$ is a constant --- known as the kriging
725: parameter --- which is an estimate of the maximum magnitude of the
726: first derivative of the function. Therefore, the
727: expected semi-variance between two points, $s_i, s_j \in \bP$ is given
728: by
729: \ba
730: \gamma(s_i, s_j) &=& E(\bK(s_i, s_j)) = k \bD(s_i, s_j)+c\\
731: &=&k \left[\sum\limits_{\ell=1}^d \alpha_\ell^2(s_{i\ell} - s_{j\ell})^2\right]^{1/2}+c
732: \ea
733: where $\bD(\cdot, \cdot)$ is a distance function defined on the parameter
734: space $\bP$ and $c$ is the observed variance (e.g. experimental noise)
735: when repeatedly sampling the function $f$ at the same location.
736: We have found that using a simple weighted
737: distance function where each dimension is linearly scaled by the
738: parameter $\alpha_\ell$, as depicted in the previous equation,
739: reasonably ensures that parameters are given equal
740: consideration given their disparate values and derivatives. For our
741: analysis, we adjusted the $\alpha_\ell$'s to ensure that the maximum derivative
742: along each dimension was approximately 1 during the sampling process.
743: Additionally, while the simulations computed by CMBFast are
744: deterministic, we shall see in \S \ref{sec:convergence} that there is
745: some inherent noise in the computations; thus we conservatively set $c
746: = 1 \times 10^{-5}$ in our analysis.
747:
748: For the Gaussian process framework, sampled data are assumed to be
749: Normally distributed with means equal to the true function and
750: variance given by the sampling noise. Moreover, a combination of any
751: subset of these points results in a Normal distribution. Thus, we can
752: use the observed set of data, $\bA\subset \bP$, to predict the value
753: of $f$ for any $s_q \in \bP$. This query point, $s_q$, will be Normally
754: distributed, ($N(\mu_{s_q}, \sigma_{s_q})$), with mean and variance given by
755: \begin{eqnarray}
756: \mu_{s_q} &=& \bar f_\bA + \Sigma_{\bA q}^T \Sigma_{\bA\bA}^{-1} (f_\bA
757: - \bar f_\bA) \label{k_mean}\\
758: \sigma^2_{s_q} &=& \Sigma_{\bA q}^T \Sigma_{\bA\bA}^{-1} \Sigma_{\bA q} \label{k_var}
759: \end{eqnarray}
760: %where
761: %\[
762: %\Sigma_{\bA q} =
763: %\left[
764: %\begin{array}{c}
765: %\gamma(a_1, s_q)\\
766: %\gamma(a_2, s_q)\\
767: %\vdots\\
768: %\gamma(a_{|\bA|}, s_q)\\
769: %\end{array}
770: %\right]
771: %\quad
772: %\Sigma_{\bA\bA} =
773: %\left[
774: %\begin{array}{c c c c}
775: %c & \gamma(a_1, a_2) & \dots & \gamma(a_1, a_{|\bA|})\\
776: %\gamma(a_2, a_1) & c & \dots & \gamma(a_2, a_{|\bA|})\\
777: %\vdots & \vdots &\ddots & \vdots\\
778: %\gamma(a_{|\bA|}, a_1) & \gamma(a_{|\bA|}, a_2) & \dots &
779: %\gamma(a_{|\bA|}, a_{|\bA|})
780: %\end{array}
781: %\right]
782: %\quad
783: %(f_\bA - \bar f_\bA) =
784: %\left[
785: %\begin{array}{c}
786: %f(a_1) - \bar f_\bA\\
787: %f(a_2) - \bar f_\bA\\
788: %\dots \\
789: %f(a_{|\bA}}) - \bar f_\bA
790: %\end{array}
791: %\right]
792: %\]
793: where the elements of the matrix $\Sigma_{\bA\bA}$ and arrays
794: $\Sigma_{\bA q}$ and $f_\bA - \bar f_\bA$ are given by
795: \begin{eqnarray*}
796: \Sigma_{\bA \bA} [i,j] &=& \gamma(a_i, a_j)\\
797: \Sigma_{\bA q} [i] &=& \gamma(a_i, s_q)\\
798: (f_\bA - \bar f_\bA)[i] &=& f(s_i) - \bar f_\bA\\
799: \bar f_\bA &=& \frac{1}{|\bA|} \sum_{i=1}^{|\bA|} f(a_i)
800: \end{eqnarray*}
801: and the $a_i$'s and $a_j$'s are the observed data used to make an
802: inference: $a_i, a_j \in \bA$, $0\le i, j \le |\bA|$.
803:
804: % where $\Sigma_{\bA q}$ denotes the column vector with the $i$th entry
805: % equal to $\gamma(a_i, s_q)$, $\Sigma_{\bA\bA}$ denotes the semivariance
806: % matrix between the elements of $\ba$ (the $ij$ element of
807: % $\Sigma_{\bA\bA}$
808: % is $\mathcal{K}(s_i, s_j)$), $y_A$ denotes the column vector with
809: % the $i$th entry equal to $f(s_i)$, the true value of the function for
810: % each point in $A$, and $\mu_A$ is the mean of the $y_A$'s.
811:
812: As given, for a set of $n$ observed points ($|\bA| = n$), prediction
813: with a Gaussian process requires
814: $O(n^3)$ time, as an $n \times n$ linear system of equations must be solved.
815: However, for many Gaussian process --- and ordinary kriging in particular
816: --- the correlation between two points decreases as a function of
817: distance. Thus, the full Gaussian process model can be approximated well by a local
818: Gaussian process, where only the $k$ nearest neighbors of the query point are used
819: to compute the prediction value; this reduces the computation time to
820: $O(k^3+k\log(n))$ per prediction, since $O(k\log(n))$ time is required to find the
821: k-nearest neighbors using spatial indexing structures such as balanced
822: kd-trees.
823:
824: \subsection{Algorithm} \label{sec:algorithm}
825: There are many well-known heuristics for computing where best
826: to perform the next experiment using a regression model, such as
827: that derived in \S \ref{model}. Sampling strategies include picking the
828: point with the largest variance \citep{mackay1992,guestrin2005},
829: entropy or information gain.
830:
831: Sampling points based solely on variance is common in active learning
832: methods whose goal is to map out an entire function, as this will
833: minimize the expected error for prediction. Moreover, the
834: model variance predicted by local ordinary kriging
835: is linear in the distance to the nearest neighbors.
836: As such, this strategy chooses points that are far from areas currently
837: searched, and thus will not get stuck in a specific location in
838: parameter space. However, this
839: strategy is known to over sample boundary regions \citep{mackay1992},
840: and ultimately samples the space evenly like a grid.
841: It is likely that large regions of the input space, $\mathcal{P}$,
842: fall well outside the confidence ball radius. In the
843: progression of the algorithm, points in these regions may have large
844: variances but still not be within 2 or more standard deviations of the
845: boundary; these points are very unlikely to be near the confidence
846: ball radius. Hence, a strategy that samples the entire space
847: evenly, using either a grid or a variance metric, can be extremely
848: inefficient for mapping function boundaries.
849:
850: Information gain heuristics are also popular in the machine learning
851: community. However in a continuous parameter space, computing the effect of adding
852: a new point is prohibitively expensive. Specifically, calculating the
853: information gain of a proposed sample requires integrating the
854: difference between the current model and expected result of the
855: proposed sample over all space. Since our function approximator has
856: only local support for predictions, we can reduce this integral down
857: to the local region. However on this local region, computing
858: the expected value of the model requires multiple matrix inversions
859: to account for differences in the 100 nearest neighbors over the local
860: region. Even approximating this integral with a (small) finite sum,
861: was found to be prohibitively expensive.
862: Instead, we use a strategy that is a combination of entropy and
863: variance (both easy to compute), and is
864: related to information gain. For more discussion on sampling
865: strategies and their performance, we refer interested readers to
866: \cite{bryan2005}.
867:
868: The method we use here, named ``Straddle'', combines the desire to
869: search the entire input space with that of refining our estimate
870: around known interesting regions. We do this by picking points that
871: the model predicts are both close to the boundary and have large
872: variances using the following heuristic:
873: \[
874: \mathrm{straddle}(s_q) = 1.96 \sigma_{s_q} - \big|
875: \mu_{s_q} - t \big|.
876: \]
877: Note that the straddle heuristic chooses those points with large
878: variances which straddle the boundary. In particular, if a point is
879: near the boundary, then $\mu_{s_q} \simeq t$ and
880: this metric is equivalent to a variance-only metric, choosing
881: points that are distant from one another.
882: However, if the point is not on the boundary, then its score drops off
883: proportionally to the distance from the boundary. The straddle score
884: for a point may be negative, which indicates that we predict that the
885: probability that the point is on a boundary is less that five
886: percent. Note that the straddle algorithm scores points highest that
887: are both unknown and near the boundary, and thus gives scores that
888: intuitively are similar to that of information gain.
889:
890: Our sampling strategy then consists of four steps. First we model our
891: current knowledge using the Gaussian process described in \S
892: \ref{model}. We then choose a set of candidate points randomly from the input
893: space and compute their mean and variances using the Gaussian process model. Next,
894: we score these points using the Straddle heuristic, and
895: select the highest scoring point. Finally, we run the chosen point
896: through CMBFast and add use the result to refine our Gaussian process model.
897:
898: Ideally, we would like to analyze the
899: entire input space, and pick experiments in such a manner that
900: minimizes the number of experiments necessary. However, as our
901: input space is infinite (the parameters are continuous), we need
902: a heuristic to quickly generate a large, but not unwieldy set of
903: candidate points.
904: \textit{A priori}, we have no information about the function we are trying to
905: model. Therefore, in order to ensure that all
906: boundary segments of the true function are found (assuming sufficient
907: experimentation), it is necessary that candidate points be chosen such
908: that all infinitesimal hyper-rectangles in the input space have
909: non-zero probabilities of being chosen.
910: We therefore choose candidate points uniformly at randomly from the
911: input space, as this satisfies the probability constraint and is
912: extremely quick. We note that bad candidate points will be discarded
913: when their straddle scores are computed, and pose no problem for the
914: algorithm.
915:
916:
917: %% Section 4
918: \section{Results} \label{sec:results}
919: Using the algorithm described in \S \ref{sec:algorithm}, we have
920: sampled just over 1.2 million CMBFast models creating a ``primary''
921: data set. Additionally, we sampled another 100 thousand models
922: uniformly at random throughout the parameter space.
923: From the randomly sampled data, we find that less than
924: 0.1\% of the parameter space searched is within the $2 \sigma$ confidence ball;
925: that is, our set of acceptable models (those within $2\sigma$) exclude
926: 99.97\% of all possible models defined in Table \ref{paramtable}.
927: However, the
928: method we use to generate parameter vectors results in only 54\% of
929: the points being rejected by the hypothesis that the model and the
930: regressed fit are the same. Thus, by actively searching through the
931: space, we are able to identify and efficiently map regions of interest, while
932: ignoring large areas of parameter space that result in models below
933: the $2\sigma$ level. In \S \ref{sec:mcmc} we will see that our method
934: is much more data efficient than typical Bayesian methods.
935:
936: \subsection{Confidence Interval Projections} \label{sec:intervals}
937:
938: \begin{figure*}[!th]
939: \begin{center}
940: \plotone{f4.eps}
941: \end{center}
942: \caption{Jointly valid confidence intervals for our cosmological
943: parameters for four values of $1-\alpha$, corresponding to
944: $\frac{1}{2} \sigma, \sigma, 1 \frac{1}{2} \sigma$ and $2\sigma$
945: confidence levels, respectively.
946: Areas of solid color indicate values for the given parameter
947: that contain the true value of cosmological parameter with
948: probability $1-\alpha$, regardless of the values of the remaining 6
949: parameters.
950: See the electronic edition of the Journal for a color version of this figure.}
951: \label{fig:results1d}
952: \end{figure*}
953:
954: \begin{figure*}[p]
955: \begin{center}
956: \plotone{f5.eps}
957: \end{center}
958: \caption{Jointly valid confidence regions for pairs of cosmological
959: parameters, where the colors cyan, magenta, blue and red correspond to
960: $\frac{1}{2} \sigma, \sigma, 1 \frac{1}{2} \sigma$ and $2\sigma$,
961: confidence levels respectively.
962: Areas of solid color indicate values for the given
963: pair of fixed (plotted) parameters that contain the true value of
964: cosmological parameter with probability $1-\alpha$, regardless of the
965: values of the remaining 5 parameters.
966: Note there are two disjoint regions in parameter space
967: which are above the $2\sigma$ confidence interval.
968: See the electronic edition of the Journal for a color version of this figure.}
969: \label{fig:results2d}
970: \end{figure*}
971:
972: \begin{figure*}[t]
973: \begin{center}
974: \plotone{f6.eps}
975: \end{center}
976: \caption{Jointly valid confidence intervals for our cosmological
977: parameters, where we assume that that the value of $H_0$ is between 60
978: and $75 \mpc$. Areas of solid color
979: indicate values for the given parameter that contain the true value of
980: cosmological parameter with probability $1-\alpha$, regardless of the
981: values of the remaining 6 parameters. See the electronic edition of
982: the Journal for a color version of this figure.}
983: \label{fig:results1d:h0}
984: \end{figure*}
985:
986: \begin{figure*}[p]
987: \begin{center}
988: \plotone{f7.eps}
989: \end{center}
990: \caption{Jointly valid confidence regions for pairs of cosmological
991: parameters, where we assume that that the value of $H_0$ is between
992: 60 and $75 \mpc$. The colors
993: cyan, magenta, blue and red correspond to
994: $\frac{1}{2} \sigma, \sigma, 1 \frac{1}{2} \sigma$ and $2\sigma$,
995: confidence levels, respectively.
996: Areas of solid color indicate values for the given
997: pair of fixed (plotted) parameters that contain the true value of
998: cosmological parameter with probability $1-\alpha$,
999: regardless of the values of the remaining 5 parameters.
1000: Note that the constraint on $H_0$ eliminates the secondary confidence
1001: region found in Figure \ref{fig:results2d}.
1002: See the electronic edition of the Journal for a color version of this figure.}
1003: \label{fig:results2d:h0}
1004: \end{figure*}
1005:
1006: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1007:
1008: The result of running the 1.2 million models contained in the primary
1009: data set is a set of
1010: disjoint, seven dimensional ``confidence regions'' in parameter space
1011: which contain all models that fall within our $1-\alpha$ confidence
1012: ball. In each of these regions, the confidence interval for a
1013: particular parameter is given by the range of values that parameter
1014: takes in that region. Thus, the confidence interval for a particular
1015: parameter will be a function of which sets of regions we consider.
1016:
1017: If we put no restrictions on the values of the other 6 parameters,
1018: then the confidence interval of a parameter will be the union of
1019: the confidence intervals for that parameter for all confidence
1020: regions. We plot these unrestricted confidence intervals in Figure
1021: \ref{fig:results1d} for four values of $1-\alpha$.
1022: Intuitively, Figure \ref{fig:results1d} can be interpreted as stating
1023: that for any value of a parameter that lies within the depicted
1024: $1-\alpha$ confidence interval, there exists at least one
1025: combination of the remaining six parameters such that the resulting
1026: parameter vector lies within one of the $1-\alpha$ confidence regions.
1027:
1028: In Figure \ref{fig:results2d} we depict results of interactions between pairs
1029: of parameters on the computed confidence regions. As with the 1D
1030: projections in Figure \ref{fig:results1d}, points in Figure
1031: \ref{fig:results2d} which are denoted to be within the $1-\alpha$
1032: confidence ball, are points where given the particular values of the
1033: two fixed cosmological parameters --- those being explicitly plotted
1034: on the $x$ and $y$ axes, --- there exists some values for the other 5
1035: parameters such that the resulting parameter vector is within the
1036: $1-\alpha$ confidence region. While some plots show that most
1037: combinations of the fixed parameters are within the 95\% confidence
1038: ball providing minimal constraints on parameters describing the
1039: Universe, others, such as $\omega_\mathrm{DM}$ versus
1040: $\omega_\mathrm{B}$ (4\ith row, 4\ith column), show strong
1041: constraints.
1042:
1043: Areas in Figure \ref{fig:results2d} which are blank (white), are areas that are
1044: rejected at the 95\% confidence level; for these combinations of fixed
1045: parameters, there exists no combination of the other five parameters,
1046: such that the resulting vector is within any of our confidence regions.
1047: In particular, the plot of $\Omega_\mathrm{DE}$ versus
1048: $\Omega_\mathrm{M}$ (2\ind row, 3\ird column) illustrates that
1049: $\Omega_\mathrm{Total} \gtrsim 0.9$, while the plot of $\omega_\mathrm{DM}$
1050: versus $\omega_\mathrm{B}$ shows that there are at least two disjoint
1051: confidence regions in our seven dimensional space. These disjoint
1052: regions in Figure \ref{fig:results2d} correspond directly to the split
1053: confidence intervals observed in Figure \ref{fig:results1d}.
1054:
1055: The disjoint regions observed in Figure \ref{fig:results2d}, such as
1056: the plot of $\omega_\mathrm{DM}$ vs. $\omega_\mathrm{B}$, indicate
1057: that there are at least two disjoint confidence regions in the
1058: parameter space.
1059: These disjoint regions can also be seen in the 1D projections of
1060: $\omega_\mathrm{DM}$, $\omega_\mathrm{B}$, and $H_0$ shown in Figure \ref{fig:results1d}.
1061: We defer further discussion of the disjoint confidence regions
1062: to \S \ref{sec:connectivity}. Smaller splits in the confidence
1063: intervals observed in nearly
1064: every plot in Figure \ref{fig:results1d} are a result of the fact that
1065: CMBFast does not return models which are perfectly continuous in the
1066: parameter space. While one may expect the derived confidence level to
1067: be smooth in parameter space, this is not the case.
1068: We observe small discretizations and
1069: inconsistencies in the power spectrum model, which result in the
1070: confidence ball having a jagged, nebulous surface (as observed in
1071: Figure \ref{fig:results2d}), rather than a perfectly smooth one. We will
1072: elaborate on this observation in \S \ref{sec:convergence}.
1073:
1074: As illustrated in Figure \ref{fig:results1d}, the confidence intervals
1075: for most parameters are not well constrained by the WMAP data alone.
1076: In particular, the constraint
1077: on the Hubble constant, $H_0$, is so weak as to allow values between
1078: 15 and 300 at the two sigma level; even at the one sigma level, $H_0$
1079: ranges between $15$ and $150$ with additional fits at $H_0 \sim 250$.
1080: The confidence intervals derived here cover the Bayesian credible
1081: intervals found in the literature using a
1082: variety of techniques (e.g. \cite{tegmark2001, spergel2003,
1083: spergel2006}), as shown in Table \ref{tab:compare}.
1084: While the results in Table \ref{tab:compare} are
1085: approximately centered on the same values,
1086: we are not in any way attempting to argue that the allowed parameter
1087: ranges are better, or worse, than those derived from alternative methods,
1088: as the comparison of credible (Bayesian) vs. valid (frequentist)
1089: parameter ranges is non-trivial and outside the scope of this work.
1090: A discussion of difference between the Bayesian and frequentist
1091: interpretations is given in \S \ref{sec:bayesvsfreq}.
1092:
1093: \begin{table*}
1094: \begin{center}
1095: {\footnotesize
1096: \begin{tabular}{c r@{ - }l r@{ - }l r@{ - }l | r@{ - }l r@{ - }l}
1097: \hline
1098: &
1099: \multicolumn{2}{c}{No} &
1100: \multicolumn{2}{c}{} &
1101: \multicolumn{2}{c|}{$n_s < 1$} &
1102: \multicolumn{2}{c}{Spergel} &
1103: \multicolumn{2}{c}{Spergel}\\
1104: %
1105: Parameter &
1106: \multicolumn{2}{c}{Constraints} &
1107: \multicolumn{2}{c}{$ 60 \le H_0 \le 75$} &
1108: \multicolumn{2}{c|}{$ 60 \le H_0 \le 75$} &
1109: \multicolumn{2}{c}{et al. (2003)} &
1110: \multicolumn{2}{c}{et al. (2006)}\\
1111: \hline
1112: \hline
1113: $\tau$
1114: & 0 & 1.2 % none
1115: & \multicolumn{2}{c}{0 - 0.94, 1.17 - 1.2} % h0
1116: & 0 & 0.4 % h0,ns
1117: & 0.095 & 0.242 % spergel 2003
1118: & 0.058 & 0.117 % spergel 2006
1119: \\
1120: $\Omega_\mathrm{DE}$
1121: & 0 & 0.94 %none
1122: & 0 & 0.94 %h0
1123: & 0.39 & 0.9 %h0,ns
1124: & \multicolumn{2}{c}{} % OmegaDE Spergel 2003
1125: & \multicolumn{2}{c}{} % OmegaDE Spergel 2006
1126: \\
1127: $\Omega_\mathrm{M}$
1128: & 0 & 1.0 %none
1129: & 0.13 & 0.95 % h0
1130: & 0.13 & 0.59 % h0,ns
1131: & 0.22 & 0.36 % OmegaM Spergel 2003
1132: & 0.199 & 0.273 % OmegaM Spergel 2006
1133: \\
1134: $\omega_{\mathrm{DM}}$
1135: & \multicolumn{2}{c}{0 - 0.36, 0.62 - 0.70} % none
1136: & 0.0 & 0.36 % h0
1137: & 0.03 & 0.2 % h0,ns
1138: & \multicolumn{2}{c}{} % omegaDM Spergel 2003
1139: & \multicolumn{2}{c}{} % omegaDM Spergel 2006
1140: \\
1141: $100\omega_{\mathrm{B}}$
1142: & \multicolumn{2}{c}{0.5 - 6.2, 11.5 - 12.7} % none
1143: & 1.3 & 5.5 % h0
1144: & 1.3 & 3.2 % h0, ns
1145: & 2.26 & 2.51 % omegaB Spergel 2003
1146: & 2.15 & 2.31 % omegaB Spergel 2006
1147: \\
1148: $f_\nu$
1149: & 0 & 1 % none
1150: & 0 & 1 % h0
1151: & 0 & 1 % h0,ns
1152: & \multicolumn{2}{c}{} % f_nu Spergel2003
1153: & \multicolumn{2}{c}{} % f_nu Spergel 2006
1154: \\
1155: $n_s$
1156: & 0.73 & 1.7 % none
1157: & 0.8 & 1.7 % h0
1158: & 0.84 & \textit{1.0} % h0
1159: & 0.95 & 1.03 % n_s Spergel2003
1160: & 0.944 & 0.978 % n_s Spergel 2006
1161: \\
1162: $\sigma_8$
1163: & \multicolumn{2}{c}{} % no constraints
1164: & \multicolumn{2}{c}{} % h0
1165: & \multicolumn{2}{c|}{} % h0,ns
1166: & 0.82 & 1.02 % sigma8 Spergel2003
1167: & 0.71 & 0.81 % sigma8 Spergel 2006
1168: \\
1169: $H_0$
1170: & \multicolumn{2}{c}{17 - 135, 243 - 272} % no constraints
1171: & \textit{60} & \textit{75} % h0
1172: & \textit{60} & \textit{75} % h0
1173: & 67 & 77 % H_0 Spergel2003
1174: & 70.3 & 76.7 % H_0 Spergel 2006
1175: \\
1176: \hline
1177: \end{tabular}}
1178: \end{center}
1179: \caption{Derived 68\% confidence intervals. Those to the left of the solid
1180: line are derived from Figures \ref{fig:results1d},
1181: \ref{fig:results1d:h0} and \ref{fig:results1d:nsh0} respectively,
1182: while those to the right are quoted from referenced literature.}
1183: \label{tab:compare}
1184: \end{table*}
1185:
1186:
1187: While this assessment may appear bleak, there is
1188: underlying structure to the confidence regions, hinted at by the
1189: disjoint regions in Figure \ref{fig:results2d}. Suppose we restrict
1190: the range of a subset of our parameters and then compute the
1191: confidence intervals for the remaining parameters.
1192: Since our statistical model is independent of the ranges searched, we can
1193: compute these conditional confidence intervals without re-running any
1194: models. For any restriction of our parameter space,
1195: the confidence interval for a parameter of interest will
1196: be the union of the confidence intervals for that parameter over those
1197: confidence regions which obey our restriction. For example, in Figures
1198: \ref{fig:results1d:h0} and \ref{fig:results2d:h0} we show the effect
1199: on the confidence intervals and regions, respectively,
1200: of imposing the restriction that $H_0$ is between
1201: $60$ and $75\mpc$. Note that with this
1202: restriction on $H_0$, the confidence intervals agree much better with the
1203: current estimate of the cosmological matter/energy budget and strongly
1204: suggest that $\Omega_\mathrm{Total} = 1$.
1205:
1206: This analysis exhibits the power of our statistical
1207: inference technique: we can test constraints on one parameter,
1208: and see their effects on the remaining parameters without additional
1209: CMBFast computation or invalidation of statistical inferences. To
1210: this end, we have created a graphical interface that can be used to
1211: apply constraints and view the resulting effects in real time; this
1212: tool, along with the necessary data files, can be downloaded from
1213: \url{http://gs3636.sp.cs.cmu.edu/visualizer/}.
1214:
1215: In the Bayesian view, the tightening of the allowable regions between
1216: Figures \ref{fig:results1d} and \ref{fig:results1d:h0}
1217: and Figures \ref{fig:results2d} and \ref{fig:results2d:h0}
1218: is analogous to what would occur when priors
1219: (either informative or non-informative) are applied. Such
1220: priors are universally applied in CMB cosmological analyses.
1221: As an example of how we can use this technique to better
1222: understand the cosmological confidence surface, we focus
1223: in on one or two parameters and utilize the graphical interface
1224: described above.
1225:
1226: WMAP Three Year data show that a scale invariant spectra
1227: ($n_s = 1$) is not a good fit to the WMAP Three Year data alone.
1228: If we place both the constraint that $n_s < 1$ and that
1229: $ 60 \mpc \le H_0 \le 75\mpc$ on the WMAP One Year data, we see
1230: in Figure \ref{fig:results1d:nsh0} that $\tau, \omega_\mathrm{B}$, and
1231: $\omega_\mathrm{DM}$ are much better constrained. More importantly, we
1232: see that the allowable ranges on $\omega_\mathrm{DM}$ are forced into a single
1233: confidence range, in agreement with previous studies \cite{spergel2003}.
1234:
1235: Exploring the high $\omega_\mathrm{DM}$ space shown in Figure
1236: \ref{fig:results1d}, we find that models consistent with
1237: high $\omega_\mathrm{DM}$ have large values of $\omega_\mathrm{B}$ ($> 0.05$),
1238: as well as large Hubble constants ($>100\mpc$). Both of these parameters are
1239: much better constrained in the WMAP Three Year data. This
1240: leads us to predict that the second confidence surface peak in
1241: the WMAP Three Year Data is less significant than in the
1242: WMAP One Year data (although this has yet to be shown).
1243:
1244: % Add this comment?
1245: %
1246: % Thus, in Figure \ref{fig:results1}, it should not be surprising to see
1247: % such a large range on the allowable cosmological parameters. In fact,
1248: % the only parameters that are reasonably constrained (at the 1$\sigma$
1249: % level) are the Hubble constant, $\Omega_{total}$,
1250: % $\Omega_{matter}$,and $\Omega_{baryon}$. And of course none of these
1251: % are constrained at the level given in Spergel et al. (2003) for WMAP
1252: % (see Table xx).
1253:
1254: \begin{figure*}[t]
1255: \begin{center}
1256: \plotone{f8.eps}
1257: \end{center}
1258: \caption{Jointly valid confidence intervals for our cosmological
1259: parameters, where we assume that $60
1260: \mpc \le H_0 \le 75 \mpc$ and $n_s < 1$. Areas of solid
1261: color indicate values for the given parameter that contain the true
1262: value of cosmological parameter with probability $1-\alpha$,
1263: regardless of the values of the remaining 6 parameters.
1264: See the electronic edition of the Journal for a color
1265: version of this figure.}
1266: \label{fig:results1d:nsh0}
1267: \end{figure*}
1268:
1269:
1270: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1271: % Intentionally blank.
1272: %
1273: %
1274: %
1275: %
1276: %
1277: %
1278:
1279: \subsection{Convergence} \label{sec:convergence}
1280: Ideally, one would like to prove that our mapping from confidence
1281: ball radius to parameter space has converged. This could be done, for
1282: instance, by proving that our approximating model of spectrum distance
1283: as a function of cosmological parameters -- that is our Gaussian
1284: process -- has converged to the true values in those areas where the
1285: true values are near the radius of the $1-\alpha$ confidence ball.
1286: However, this effort has been confounded by a lack of continuity in
1287: the results returned by CMBFast. The method presented in this paper
1288: is not more susceptible to discontinuities than other techniques.
1289: Indeed, the convergence of most, if not all, inference methods will be
1290: adversely effected by the discontinuities of CMBFast models we observe
1291: in parameter space.
1292:
1293: %\subsubsection{Smoothness Assumptions of CMBFast Models} \label{sec:smoothness}
1294: \begin{figure}
1295: \begin{center}
1296: \noindent
1297: \plotone{f9.eps}
1298: \end{center}
1299: \caption{A plot of spectra distance as a function of $\tau$, with
1300: all other parameters fixed, showing the discretization of CMBFast.
1301: For these experiments
1302: $\vec x = \{\tau, \Omega_\mathrm{DE}, \Omega_\mathrm{M},
1303: \omega_\mathrm{DM}, \omega_\mathrm{B}, f_\nu, n_s\}$ $=
1304: \{\tau, 0.0, 0.2, 0.8, 0.003, 0.0, 1.2\}$.}
1305: \label{fig:smooth}
1306: \end{figure}
1307:
1308: One standard assumption of function approximators is that of
1309: smoothness; that is that the underlying function to be modeled is
1310: continuous and differentiable. For Gaussian processes, this
1311: assumption motivates the usage of a covariance matrix in determining
1312: the relative weights of known samples when estimating values for unknown points. In
1313: this paper, we have also assumed that the covariance function is fixed over
1314: the entire space -- that is that the underlying covariance is
1315: isotropic and homogeneous. These assumptions allow us to compute
1316: error bounds for each point in space, and enable us to determine when
1317: the model has converged to the underlying function.
1318:
1319: \begin{figure*}
1320: \begin{center}
1321: \noindent
1322: \plotone{f10.eps}
1323: \end{center}
1324: \caption{A plot of spectra distance as a function of
1325: $\Omega_\mathrm{DE}$, with all other parameters fixed. The square
1326: boxes in each of the left two plots denotes the area enlarged in the
1327: neighboring plot to the right. Note that while on the global scales,
1328: (A), the mapping appears to be smooth, closer inspection (B),(C)
1329: reveal numerical errors resulting from approximations used in CMBFast.}
1330: \label{fig:smooth2}
1331: \end{figure*}
1332:
1333: However, experimentation shows that the underlying CMBFast function
1334: does not fulfill the continuous and differentiable assumptions, as
1335: shown in Figures \ref{fig:smooth} and \ref{fig:smooth2}.
1336: Both figures were produced by plotting the resulting model distance
1337: as we varied one parameter and kept the other six parameters fixed. Figure
1338: \ref{fig:smooth} shows a discretization effect that we believe is a
1339: result of integral approximations done by CMBFast. Discretization
1340: effects are common in simulated environments and it is reasonable to
1341: assume that the true function varies smoothly. More startling are the
1342: discontinuities revealed in Figure \ref{fig:smooth2}. Figure \ref{fig:smooth2} shows
1343: that while on a broad scale the CMBFast function appears smooth, when
1344: one looks closer and closer, the function begins to act quite
1345: erratically. Of particular interest are the large discontinuity at
1346: $\Omega_\mathrm{DE} = 0.446516$ and the seemingly random deviations
1347: from a smooth function throughout the entire range. These fluctuations
1348: in distance are not caused by random noise from CMBFast;
1349: CMBFast's output is deterministic given an input parameter
1350: vector.
1351:
1352: There are two important implications of the results in Figures
1353: \ref{fig:smooth} and \ref{fig:smooth2}. First, we note that
1354: when parameter values result in spectra that are very close to the
1355: confidence ball radius, it is impossible to predict which side of
1356: the boundary a given point will be on, due to the inherent noise in
1357: CMBFast. For regions where many points are near the confidence ball
1358: radius, we will obtain spotty, jagged boundaries between those areas
1359: in the ball and those not.
1360: Secondly, the effects plotted in
1361: Figures \ref{fig:smooth} and \ref{fig:smooth2} do not appear on the
1362: same range scales. This makes it more difficult to determine the
1363: correct level of smoothing, and hence discover the true underlying
1364: function. Thus, while it is still possible to deduce approximate
1365: covariances among the variables, it becomes impossible to ensure the
1366: model has correctly converged to the true model.
1367:
1368: We note that this lack of continuity will adversely effect the
1369: convergence of any model that relies on the smoothness of the
1370: underlying function, be it MCMC or Gaussian processes.
1371: %In the case of
1372: %MCMC the lack of smoothness requires more extensive sampling of the
1373: %posterior to ensure the integral is correctly computed.
1374: In the case of MCMC,
1375: the discontinuities in the variance weighted sum of squares between
1376: the models computed by CMBFast and the data require that comprehensive
1377: sampling of the posterior be performed to ensure that the peaks and
1378: valleys in any local region are correctly averaged out, ensuring that
1379: the integral over the posterior is correctly computed. While we
1380: can run both methods in a mode that smooths over these
1381: discontinuities (by effectively ignoring them), we must realize that
1382: the resulting algorithms will converge to a solution that is
1383: incorrect. Additionally, increasing the sampling of either
1384: algorithm would eventually turn up the existence of these
1385: discontinuities, and the system would jump from an apparent
1386: convergence in the smoothed case, to a new convergence where
1387: discontinuities are considered. We elaborate on this idea further in
1388: \S \ref{sec:mcmc}.
1389:
1390: % Intentionally blank.
1391: %
1392: %
1393: %
1394: %
1395: %
1396: %
1397:
1398: % \subsubsection{Sampling}
1399: % While we cannot prove convergence, we suggest that we may be near
1400: % convergence, noting that the addition of subsequent points does not
1401: % affect either the model prediction (in terms of spectrum classification
1402: % accuracy) or the visual appearance of the data.
1403: % In Figure \ref{fig:diff}
1404: % we show the visual difference between 1.1 million and 1.2 million cmbfast experiments
1405: % for the plot of \fixme and \fixme, while in Table \ref{tab:diff} we show
1406: % the accuracy of classifying parameter vectors as to whether or not
1407: % they result in spectra with distances less than the $1-\alpha$ ball radius for several
1408: % confidence levels. Both Figure \ref{fig:diff} and Table \ref{tab:diff} show
1409: % little difference with the addition of ~10\% more data, suggesting
1410: % that convergence has been obtained.
1411: %
1412: % \begin{figure*}
1413: % \includegraphics{figures/results2d_time/results2d_time.epsi}
1414: % \caption{Plots show ?? versus ?? with 1.1 million points (A) and with
1415: % 1.2 million points. Note that even with the addition of ~10\% more
1416: % data point, the figure remains relatively unchanged.}
1417: % \label{fig:diff}
1418: % \end{figure*}
1419: %
1420: % \begin{table}
1421: % \begin{center}
1422: % \begin{tabular}{c c c}
1423: % \hline
1424: % & \multicolumn{2}{c}{Classification Accuracy}\\
1425: % $1-\alpha$ & 1.1 million points & 1.2 million points \\
1426: % \hline
1427: % \hline
1428: % 0.95\\
1429: % 0.78\\
1430: % 0.65\\
1431: % 0.45\\
1432: % \hline
1433: % \end{tabular}
1434: % \end{center}
1435: % \caption{Classification accuracy of point as to whether or not they
1436: % had distances less than the radius of a $1-\alpha$ confidence ball
1437: % for various $1-\alpha$ levels, with both 1.1 million points and
1438: % 1.2 million points. Note that even with the addition of ~10\% more
1439: % data point, the classification accuracies remain relatively unchanged.}
1440: % \label{tab:diff}
1441: % \end{table}
1442: %
1443: %
1444: % However, we caution that both of these results may be misleading.
1445: % Note that since the plot in Figure \ref{fig:diff} is a projection through
1446: % 5 dimensions, it is
1447: % possible to get similar results for much smaller number of
1448: % experiments, because poorly defined areas can be ``hidden'' behind
1449: % other prominent features.
1450: % % For instance, consider a ball of radius $r$
1451: % % centered at the origin of an $x$, $y$, $z$ axes. In order for the ball
1452: % % to appear well defined when projecting this ball down to
1453: % % the $x,y$ axes, we need only sample around the edge of the ball
1454: % % $x^2+y^2 = r^2$. Even though areas where $x^2 + y^2 < r^2$ may be
1455: % % poorly constrained, it may not be apparent from the projection.
1456: % Additionally, the results of Table \ref{tab:diff} are bouyed by the fact
1457: % that only ~0.1\% of the points at random fall within a 95\% confidence
1458: % ball radius. Thus, a straw-man classification approach that merely
1459: % picked the most common classification would obtain a 99\%
1460: % classification accuracy.
1461: %
1462: % While these arguments suggest that Figure \ref{fig:diff} and Table
1463: % \ref{tab:diff} cannot prove that we have reached convergence, we note
1464: % that they do show that we have reached an approximate convergence.
1465: % That is, while we cannot exactly state where the spectra distance surface
1466: % equals the $1-\alpha$ confidence ball radius, since the deviation
1467: % between both Figure \ref{fig:diff} and Table \ref{tab:diff} are tiny,
1468: % we can reasonably predict the location of this boundary.
1469: % Additionally, new peaks in the distance surface that arise above the
1470: % $1-\alpha$ confidence ball radius (that is new peaks where the points
1471: % are within the confidence ball) are extremely unlikely, as detection
1472: % of such a peak in new sampling would have spurred our algorithm to
1473: % vigorously search that area, resulting in effects visible in both
1474: % visual and classification results.
1475:
1476: \subsection{Connectivity} \label{sec:connectivity}
1477: As Figure \ref{fig:results2d} shows, there are two main peaks that lie above
1478: the $1\sigma$ confidence ball radius. As a test of the
1479: function approximator's convergence, we conducted focused tests to see if these
1480: peaks were truly connected. In particular, we used the semi-variance
1481: matrix of the Gaussian process to compute the maximal influence
1482: distance from a given point one could travel before possibly
1483: encountering the $1-\alpha$ confidence ball radius. We then created
1484: clusters of points above the 68\% confidence ball radius using a
1485: friends-of-friends algorithm; that is, a point is added to an existing
1486: group if it is within the maximal influence distance of any point
1487: currently in the group. Starting with all points in their own groups,
1488: we first passed through the data, merging groups where possible.
1489: Then, additional points were sampled between existing groups, using an
1490: A$^*$ like algorithm \citep{hart1968}. For two groups $A$ and $B$, we
1491: found the point, $x$, in $A$ that was closest to any point in $B$. We
1492: then created a set of candidate points within the influence distance
1493: of $x$, and add them to a queue, $\bQ$, sorted according to their
1494: distances to $B$. We then take the point $p$ from $\bQ$ that is closest
1495: to $B$ run it through CMBFast and compare to our confidence ball. If
1496: $p$ is within our confidence radius, then we create
1497: candidate points for $p$ (just as we did for $x$) and add them to
1498: $\bQ$. Otherwise, we remove $p$ from $\bQ$.
1499: This procedure is repeated until either $B$ is within the influence
1500: distance of $p$ or we exhaust $\bQ$.
1501:
1502: The primary data set contained roughly 2000 distinct groups, which were
1503: quickly merged using the friends-of-friends algorithm. This left us with
1504: 2 major clusters shown in Figure \ref{fig:results2d}.
1505: Using the algorithm noted above,
1506: we were unable to find connections between the main peak and the
1507: secondary peak, even after multiple attempts starting from
1508: different locations. We believe that there exists no
1509: smooth transition of variable parameters that leads from the
1510: concordance to the secondary peak. The second peak is not just an
1511: extension of the concordance peak that appears disjoint due to under
1512: sampling or projection effects.
1513:
1514: \section{Comparison to Alternative Methods of Statistical
1515: Inference} \label{sec:comparison}
1516:
1517: In \S \ref{sec:results}, we showed that the results of our
1518: technique are quite similar to other statistical inference methods
1519: currently employed in the literature. Let us now relate our method
1520: to other inference techniques, and point out a few subtle, but
1521: remarkable, distinctions between them.
1522:
1523: \subsection{$\chi^2$ Tests}
1524: The method presented in \S \ref{sec:nonparametric} can be
1525: succinctly described as a method which computes the weighted sum of
1526: squares of the regressed fit and the test spectrum at the data
1527: points and rejects the hypothesis that the test spectrum could be
1528: generated by the data if the weighted sum is greater than the constant
1529: given in Equation \ref{conf0}.
1530: Intuitively, this process is quite similar to using a $\chi^2$ test,
1531: with two important differences.
1532:
1533: First, our technique is centered
1534: around a nonparametric fit to the data, not the data themselves. As a
1535: result, our method is approximately centered on the true underlying
1536: function, $f$, as opposed to the noisy observations of $f$.
1537: The
1538: implication is that our method is less affected by noise in the data,
1539: than simple $\chi^2$ tests.
1540: In particular, we have observed that $\chi^2$ tests will reject all
1541: models in cases where there is a single outlier $4\sigma$ from the maximum
1542: likelihood estimate fit. By initially fitting a nonparametric
1543: function to the data and then using this function to compute
1544: sum-of-squares distances, we are much less susceptible errors
1545: caused by noisy outliers.
1546:
1547: Secondly, the radius computed using the pivot process is smaller than
1548: the $\chi^2$ radius, as we consider the Gaussian errors of all points
1549: as an ensemble, not individually as with $\chi^2$ tests. The smaller
1550: radius of the pivot process translates directly into smaller confidence regions
1551: as compared with those found using $\chi^2$ tests. This allows
1552: us to reject more of the hypothesis test models, and subsequently return tighter
1553: bounds on the parameters of interest. The confidence ball test has
1554: more statistical power than does the $\chi^2$ test. A comparison of
1555: the relative widths of the confidence and $\chi^2$ balls is shown in
1556: Figure \ref{fig:distance_alpha}.
1557:
1558: \begin{figure}
1559: \begin{center}
1560: \plotone{f11.eps}
1561: \end{center}
1562: \caption{Radius of our non-parametric confidence ball as a function of
1563: confidence level (solid). The reduced $\chi^2$ ball is shown for
1564: comparison (dashed). Arrows depict $\frac{1}{2}, 1, 1\frac{1}{2}$ and
1565: $2\sigma$ respectively.}
1566: \label{fig:distance_alpha}
1567: \end{figure}
1568:
1569:
1570: \subsection{Bayesian Techniques} \label{sec:mcmc}
1571: As noted in \S \ref{sec:introduction}, most CMB power spectrum
1572: parameter estimations to date have been done via
1573: Bayesian techniques (e.g., \cite{knox2001, gupta2002, spergel2003,
1574: jimenez2004, dunkley2005}). Since the prior distribution is not
1575: conjugate on the likelihood, computing the posterior involves
1576: estimating an integral over the entire space spanned by the prior.
1577: Perhaps the most straight-forward way to compute this integral is
1578: with an evenly-spaced grid with $n$ points per parameter. For this
1579: approach, one pre-specifies a $d$-dimensional grid (where $d$ is the
1580: number of parameters of interest) and computes the posterior at the
1581: center of each grid cell. The integral is then (approximately) the
1582: sum of the posterior at each grid cell, and the $1 - \alpha$ credible
1583: intervals can be determined (usually by marginalization) to be the
1584: smallest range for a given parameter that contains $1-\alpha$ of the
1585: posterior probability. While straight forward, this approach scales
1586: exponentially with dimension, and hence is infeasible for even moderate
1587: dimensions; we estimate that a grid based approach, using CMBFast and
1588: seven parameters (similar to our method), with just 10 grid
1589: spacings per parameter would take over 100 years on a single computer.
1590:
1591: As a result of the dimensionality problem, Markov Chain Monte Carlo
1592: (MCMC) has become an increasingly popular approach for
1593: estimating posteriors due to their (perceived) computational
1594: efficiency (e.g \cite{gupta2002, jimenez2004, sandvik2004,
1595: dunkley2005,chu2005}).
1596: In the MCMC technique, new samples are often derived using the
1597: Metropolis-Hastings algorithm. The Metropolis-Hastings algorithm
1598: chooses a new sample $x$ from some arbitrary (pre-specified) proposal
1599: distribution defined over the $d$-dimensional parameter space based on the
1600: previous sample and then accepts or rejects $x$
1601: based on the ratio of the proposed and current posterior density (when
1602: the proposal distribution is symmetric, as is common).
1603: The algorithm samples the input space roughly in
1604: proportion to the expected probability of each location.
1605:
1606: % When sampling reaches detailed
1607: % balance --- that is, the probability of being in state $i$ and
1608: % transitioning to state $j$ is equal to the probability of being in
1609: % state $j$ and transitioning to state $i$ --- then we are guaranteed a
1610: % stationary distribution.
1611:
1612: Theoretically MCMC using Metropolis-Hastings algorithm
1613: converges almost surely to the stationary distribution (the
1614: posterior) in the limit of infinite sampling. However, it is quite
1615: difficult to determine if convergence has been met with a finite number
1616: samples. In particular, if a posterior is comprised
1617: by two narrow, spatially separated Gaussians, then the probability of
1618: transition from one Gaussian to the other will be vanishingly small.
1619: Thus, after the chain has rattled around in one of the peaks for a
1620: while, it will appear that the chain has converged; however, after
1621: some finite amount of time, the chain will suddenly jump to the other
1622: peak, revealing that the initial indications of convergence were
1623: incorrect. As this example illustrates, if the Markov chain is run
1624: with too few examples, the resulting credible intervals will be too
1625: narrow, and thus will not truly contain $1-\alpha$ of the probability
1626: mass. Thus, the consequence of lack of true convergence is artificially
1627: small credible intervals. This problem is usually skirted by assuming
1628: that there are no small isolated peaks, computing multiple independent
1629: chains and comparing the results to illustrate convergence.
1630: Additionally, \cite{dunkley2005} and others have proposed alternative
1631: methods to detect convergence. However, none of these methods are able to
1632: prove convergence with a limited number of CMBFast runs.
1633:
1634: Moreover, as we noted in \S \ref{sec:introduction}, MCMC is designed
1635: to draw samples from an unknown distribution, not to search that distribution.
1636: As a result, MCMC algorithms explicitly spend a large number of samples
1637: on high-likelihood regions, and a minimal number on low-likelihood
1638: regions. However, when we are computing $1-\alpha$ confidence
1639: intervals, it is the low-likelihood regions (those around the
1640: $1-\alpha$ boundary) that we are interested in. In contrast, a search
1641: algorithm that can directly look up the likelihood of a sample
1642: has no reason to spend a large number of samples near the peak of the
1643: distribution, and can instead focus on the boundary in question.
1644:
1645: These differences are clearly shown in Figure \ref{fig:mcmc_straddle},
1646: which depicts (with black dots) samples chosen by typical single runs of MCMC and
1647: our algorithm when trying to compute the $95\%$ credible/confidence
1648: intervals for a standard normal distribution\footnote{For the Bayesian case,
1649: we assume that the observed data is a single point at the origin. As
1650: a result, the true posterior derived via sampling will be exactly
1651: the same as the true standard Normal distribution. This is done to
1652: ensure that both algorithms are sampling the same function, allowing
1653: us to compare the sampling patterns of the algorithms.}.
1654: Both algorithms were
1655: constrained to samples chosen in $[-10:10]$. The MCMC algorithm was
1656: started at a randomly selected point, with a uniform prior over the
1657: range. In this figure we use a standard normal proposal distribution,
1658: although the sampling pattern is similar for other distributions we
1659: tried. Credible intervals for MCMC and confidence intervals
1660: for our algorithm are depicted below the plots.
1661: Several points are quite apparent. First
1662: MCMC has failed to converge in 50 samples, while our algorithm has
1663: converged nicely. The credible intervals given by MCMC are not only
1664: underestimated, but are also not centered on the true distribution's
1665: center, revealing a potential liability for interpreting MCMC chains
1666: which have not converged.
1667:
1668: Secondly, notice that MCMC heavily samples the peak
1669: of the distribution, while our algorithm focus on those regions
1670: associated with the confidence interval boundaries. The
1671: MCMC chain results in a ragged collection of disjoint credible
1672: intervals, while our algorithm returns a single interval in
1673: which the endpoints have been well determined.
1674:
1675: Thirdly, note that our algorithm samples extreme points to ensure that
1676: it has not failed to observe additional peaks in the distribution
1677: which may contribute to the 95\% confidence interval, while MCMC has
1678: not. As noted before, since MCMC is not a search algorithm, it may
1679: spend a large number of samples in a single distribution peak
1680: before jumping to another peak in the distribution. This sampling
1681: pattern may cause MCMC to appear to have converged, when in reality
1682: it has just failed to transition to the second peak, as in the two
1683: Gaussian case described previously.
1684:
1685: Finally, we note that the MCMC algorithm is not data efficient. While
1686: Figure \ref{fig:mcmc_straddle} depicts those experiments run by MCMC,
1687: the final MCMC chain consists of only those points that were accepted
1688: (in this case by the Metropolis-Hastings algorithm). As such, some of the
1689: points that MCMC samples are discarded immediately, and never used to
1690: guide the chain in future steps, or to determine the $1-\alpha$
1691: credible intervals. In addition, many MCMC practitioners
1692: remove all but every $j$th sample point (for some integer $j$) to
1693: ensure that the points in the chain are truly independent. This
1694: significantly reduces data efficiency.
1695:
1696:
1697: \begin{figure*}
1698: \begin{center}
1699: \plotone{f12.eps}
1700: \end{center}
1701: \caption{Distribution of experiments run by MCMC (left) and our
1702: algorithm (right). Black dots denote 50 experiments run in
1703: order to determine
1704: the 95\% credible / confidence interval (shaded red area) for a
1705: standard normal
1706: distribution (solid red line). Shaded blue areas below the normal
1707: curves indicate the credible / confidence intervals derived for the
1708: 50 samples chosen. See the electronic edition of the Journal for a
1709: color version of this
1710: figure.}
1711: \label{fig:mcmc_straddle}
1712: \end{figure*}
1713:
1714: \label{sec:bayesvsfreq}
1715: \subsection{Advantages of Frequentist Inference}
1716: %\subsection{Bayesian vs. Frequentist Inference}
1717:
1718: Often, non-statisticians are confused by differences between Bayesian
1719: and frequentist techniques, and the advantages and limitations that
1720: each maintains. Particularly appealing with the Bayesian approach is
1721: the fact that one is computing a posterior distribution over the
1722: parameter space. Thus, not only does one obtain $1-\alpha$ credible
1723: intervals, but one gets a sense of where within the interval, the
1724: true value is expected to be. Frequentist approaches do not allow for
1725: one to compute the probability that the true value is equal to some
1726: particular parameter value. While choosing one technique over the
1727: other is a matter of personal statistical philosophy, we believe that
1728: frequentist approaches hold important advantages over their Bayesian
1729: counterparts.
1730:
1731: First, any Bayesian technique requires that one assume a family of likelihood
1732: functions and a prior distribution over the parameter space in order
1733: to compute the posterior. The resulting posterior is only as valid as
1734: both the likelihood and the prior. In many cases, a prior
1735: distribution is unknown. In these cases, an ``uninformative prior,''
1736: equivalent to a uniform distribution on some bounded range, is often
1737: assumed. However, such a prior is not uninformative. In particular,
1738: a uniform prior indicates that the practitioner believes that the true
1739: distribution of the parameter is uniform, not unknown. Moreover ``uninformative''
1740: priors are parametrization dependent. If we reformulate our 7D CMB
1741: problem by replacing $\Omega_M$ with $H_0$, a uniform prior over the
1742: original problem will not translate into a uniform prior over the
1743: formulation including $H_0$, as $\Omega_M$ is inversely related to
1744: $H_0$.
1745:
1746: Secondly, any change to the prior invalidates the current results. In
1747: particular, even when one is using a uniform prior,
1748: merely changing parameter
1749: ranges will result in a different posterior with possibly different
1750: $1-\alpha$ credible intervals. Thus analyses, like those we performed
1751: in \S \ref{sec:intervals} would have required us to recompute the
1752: entire chain (or set of chains), an extremely expensive proposition,
1753: or somehow approximate the difference.
1754: Additionally, for Bayesian techniques, the prior should be independent of
1755: the data, and hence it should not be changed after observing the
1756: data. By recomputing the posterior using a new prior (based upon a
1757: previous posterior), we open ourselves to errors incurred due to
1758: multiple hypothesis testing. Moreover, it is a small step from such
1759: repeated Bayesian inferences to data-dependent priors, which are
1760: incoherent not Bayesian. Hence, data-dependent priors do not benefit
1761: from theoretical guarantees derived for Bayesian analyses, which
1762: assume priors are chosen before any data is observed.
1763:
1764: It is interesting to note that Table
1765: \ref{paramtable} denotes the final ranges of parameters
1766: searched. We initially started with the same parameter ranges as
1767: \citep{tegmark2001}, but increased our ranges slightly to better
1768: capture a secondary peak in confidence space (shown in Figure
1769: \ref{fig:results2d}). Because of our frequentist based technique, we
1770: can easily change the ranges being searched without re-running any of
1771: the CMBFast models, or recomputing any of our current inferences.
1772: This contrasts sharply with Bayesian techniques.
1773:
1774: Finally, recall from \S \ref{sec:introduction} that Bayesian approaches
1775: answer a fundamentally different question than do frequentist
1776: approaches. Frequentist approaches are concerned with deriving
1777: procedures which will return confidence intervals that trap the
1778: true value of a parameter in at least $1-\alpha$ of the cases in which the
1779: procedure is used.
1780: Bayesian methods are more interested in determining the
1781: probability that a particular value of a parameter is chosen for the
1782: given data set and prior.
1783: While we can compute ``credible'' intervals for Bayesian methods by
1784: choosing the minimum range of a parameter such that the enclosed
1785: probability is equal to $1-\alpha$, these intervals do not necessary
1786: correspond to those derived from using a frequentist approach. In
1787: particular, there is no guarantee that credible intervals will
1788: contain the true value of the parameter in at least $1-\alpha$
1789: fraction of the instances where the technique is applied.
1790: Specifically, when the likelihood function of the model goes awry,
1791: such as in cases of high-dimension, missing data, and/or
1792: non-parametric models, the inference made using Bayesian methods will
1793: be incorrect.
1794:
1795: This problem is particularly acute for high dimensions,
1796: where $1-\alpha$ credible intervals might trap the true value of the
1797: parameter close to zero percent of the time. That is, if Bayesian
1798: techniques are applied to a series of data sets, the
1799: fraction of the resulting $1-\alpha$ credible intervals that contain the true
1800: values of the parameter will be less than $1-\alpha$ and may be
1801: significantly less that $1-\alpha$. While we find this fact
1802: disturbing, a Bayesian might be willing to trade off the fact that the
1803: credible intervals usually will not contain the truth
1804: for the ability to compute a posterior distribution of likelihood over
1805: parameter space (assuming some prior) and hence determine the
1806: probability of any given parameter setting.
1807: As, \cite{olivestatistics} notes: ``to construct procedures with
1808: guaranteed long run performance, such as confidence intervals, use
1809: frequentist methods.''
1810:
1811:
1812:
1813:
1814: % Intentionally blank.
1815: %
1816: %
1817: %
1818: %
1819: %
1820: %
1821:
1822: \section{Conclusions} \label{sec:conclusion}
1823:
1824: In this paper, we present a new technique to map confidence surfaces, and
1825: show results on first-year WMAP data. This method, utilizing
1826: a non-parametric fit and confidence balls, allows for computing
1827: simultaneously valid confidence intervals.
1828: Our technique is similar in spirit to the Bayesian methods, but
1829: differs significantly in that it is a frequentist analysis with
1830: \textit{simultaneous valid} coverage.
1831: Thus, the derived confidence intervals are valid
1832: regardless of the values of the remaining parameters. This is not the
1833: case when a maximization or marginalization technique is used.
1834: While the use of confidence balls requires a search over the entire
1835: parameter space akin to the integration required for Bayesian
1836: techniques, we present an algorithm to efficiently compute regions of
1837: parameter space which have confidence values above a specified
1838: $1-\alpha$ threshold. We present results of our algorithm and note
1839: that they are similar to those derived using alternative statistical
1840: methods. While the WMAP power spectrum data alone is insufficient to
1841: constrain any of the cosmological parameters, the addition of
1842: a reasonable assumption on the Hubble constant, provides useful
1843: cosmological insights.
1844:
1845: We point out that the purpose of this paper is to present
1846: a new statistical and computational technique to provide
1847: frequentist confidence intervals on the cosmological parameters
1848: using the WMAP Year 1 data. We are not
1849: arguing that the allowed parameter ranges shown in Figures
1850: \ref{fig:results1d}, \ref{fig:results2d}, \ref{fig:results1d:h0} and
1851: \ref{fig:results2d:h0}
1852: are more accurate than those presented by the WMAP
1853: team. The reason for this is two-fold: (1) the comparison
1854: of credible (Bayesian) vs. valid (frequentist) parameter ranges
1855: is non-trivial and outside the scope of this work and (2) we
1856: use only the WMAP Year 1 data, while others have utilized
1857: non-WMAP data in various ways to provide additional
1858: constraints on the parameters.
1859:
1860: Analysis of Figures \ref{fig:results1d} and \ref{fig:results2d} shows
1861: that the one sigma confidence regions are similar to those found in
1862: the literature using a variety of techniques (e.g. \cite{tegmark2001,
1863: spergel2003, spergel2006}). Figures
1864: \ref{fig:results1d} and \ref{fig:results2d} illustrate that
1865: the WMAP data alone is not sufficient to strongly constrain the
1866: matter/energy budget for the Universe. In particular, the constraint
1867: on the Hubble constant, $H_0$, is so weak as to allow values between
1868: 15 and 300 at the two sigma level.
1869:
1870: If we instead constrain $H_0$ to a more ``typical'' range of
1871: $[60:75]$, we get much tighter constraints on \textit{all} parameters,
1872: as shown in Figures \ref{fig:results1d:h0} and \ref{fig:results2d:h0}.
1873: Because we are using a frequentist confidence procedure, adding the
1874: restriction does not affect the validity of the inference. Moreover,
1875: no additional CMBFast models must be computed to test this constraint,
1876: illustrating the power of our statistical procedure.
1877: Note that both Figures \ref{fig:results1d:h0} and
1878: \ref{fig:results2d:h0} agree much better with the current estimates of the
1879: cosmological matter/energy budget and strongly suggest
1880: that $\Omega_\mathrm{Total} = 1$.
1881:
1882: Moreover, as we show in \S \ref{sec:convergence}, CMBFast creates
1883: temperature power spectra which are discontinuous in parameter space.
1884: This discontinuity violates the smoothness assumption of the
1885: underlying target function used by both our Gaussian
1886: process technique, as well as by MCMC. This makes convergence
1887: statements difficult to make. However, we believe that the 1.2
1888: million models run show reasonable convergence. We believe that with additional
1889: assumptions on CMBFast --- such as the maximum size of a discontinuity
1890: --- we will be able to prove that our method converges in a reasonable
1891: time frame.
1892:
1893: Additionally, we show that comparing CMBFast models to the WMAP year 1
1894: temperate power spectrum data results in a multi-modal solution in
1895: confidence space. We have detected at least two distinct confidence
1896: regions in parameter space. However, by adding assumptions on $n_s$,
1897: we can eliminate the secondary peak, leading us to believe that the
1898: secondary peak may not be visible in the WMAP third year data.
1899:
1900: In summary, we believe the proposed approach of using a non-parametric
1901: fit to the data and confidence balls, coupled with a search algorithm
1902: to find models in parameter space which fit our regressed estimate,
1903: provides a robust and informative
1904: method for computing confidence intervals for cosmological
1905: parameters. In addition to merely computing intervals, our approach
1906: has the ability to test various constraints without computing new
1907: models or making assumptions about which models should be fit and
1908: what the ranges of the parameter space should be. We are working on
1909: techniques to prove convergence of the algorithm, as well as the
1910: incorporation of additional data sets to further constrain the
1911: mass/energy budget of the Universe.
1912:
1913: \acknowledgments
1914: The authors would like to thank the referee for his/her valuable
1915: suggestions and corrections.
1916:
1917: {\it Facilities:} \facility{WMAP}
1918: \appendix
1919: \section{Estimating $\tau$} \label{appendix}
1920:
1921: Recall from \S \ref{sec:fit} that the cosine basis is defined on
1922: $[0,1]$ by
1923: \[
1924: \phi_j(x) =
1925: \left\{
1926: \begin{array}{l l}
1927: 1 & \mathrm{for\ } j=0\\
1928: \sqrt{2}\cos(\pi j x) & \mathrm{for\ } j = 1,2,3, \dots
1929: \end{array}
1930: \right.
1931: \]
1932: If $j$ and $k$ are distinct, positive integers, then
1933: \begin{eqnarray*}
1934: \phi_j \phi_k &=& 2 \cos(\pi j x) \cos(\pi k x)\\ &=& \cos(\pi(j+k)x) + \cos(\pi(j-k)x)\\
1935: &=& \frac{1}{\sqrt{2}} (\phi_{j+k} + \phi_{|j-k|}).
1936: \end{eqnarray*}
1937: Moreover, if $j>0$, then $\phi_j^2 = 2 \cos^2(\pi j x) = \cos(2\pi j x)+1 = \frac{1}{\sqrt{2}}
1938: \phi_{2j} + \phi_0.$
1939: Therefore, as mentioned in \S \ref{sec:fit},
1940: \[
1941: \Delta_{jk\ell} = \left\{
1942: \begin{array}{c c}
1943: 1 & \mathrm{if\ \#}\{j,k,l = 0\} = 3\\
1944: 0 & \mathrm{if\ \#}\{j,k,l = 0\} = 2\\
1945: \delta_{jk}\delta_{0\ell} + \delta_{j\ell}\delta_{0k} +
1946: \delta_{k\ell}\delta_{0j} & \mathrm{if\ \#}\{j,k,l = 0\} = 1\\
1947: \frac{1}{\sqrt{2}}(\delta_{\ell, j+k} + \delta_{\ell,|j-k|}) & \mathrm{if\ \#}\{j,k,l = 0\} = 0
1948: \end{array}
1949: \right..
1950: \]
1951: Let $w(x) = 1/\sigma^2(x)$, such that $w^2(x) = \sum_j w_j
1952: \phi_j(x)$. As in \S \ref{sec:fit}, we let $\hat \mu_j = \lambda_j
1953: Z_j$, where
1954: \[
1955: Z_j = \frac{1}{n} \sum_{i=1}^n Y_i \phi_j(X_i)
1956: \]
1957: and $1 \ge \lambda_0 \ge \lambda_1 \ge \cdots \ge \lambda_n \ge 0$ are
1958: shrinkage coefficients. In this work, we use a special case of
1959: monotone shrinkage in which
1960: \[
1961: \lambda_j = \left\{
1962: \begin{array}{cc}
1963: 1 & \mathrm{for\ } j\le J\\
1964: 0 & \mathrm{for\ } j> J
1965: \end{array}\right.
1966: \]
1967: for $J \in [0,1,2,\dots,n]$ such that $J$ minimizes Stein's unbiased
1968: risk estimate given in Equation \ref{eqn:stein}.
1969: With these definitions, the loss can be written as
1970: \begin{eqnarray*}
1971: L(f, \hat f)
1972: &=&
1973: \int_0^1 \left(\frac{\hat f(x) -
1974: f(x)}{\sigma(x)}\right)^2 \, dx\\
1975: &=&
1976: \sum_{j,k,\ell} (\mu_j - \hat \mu_j)(\mu_k - \hat \mu_k) w_\ell
1977: \int_0^1 \phi_j \phi_k \phi_\ell\\
1978: &=&
1979: \sum_{j,k} (\mu_j - \hat \mu_j)(\mu_k - \hat \mu_k)
1980: \sum_{\ell} w_\ell \Delta_{jk\ell}\\
1981: &=& (\mu - \hat \mu)^T W (\mu - \hat \mu),
1982: \end{eqnarray*}
1983: where $W_{jk} = \sum_\ell w_\ell \Delta_{jk\ell}$.
1984: As in \S \ref{sec:fit}, let $D$ and $\bar D = 1 -D$ be diagonal matrices with 1's in the
1985: first $J$ and last $n-J$ entries respectively. Then $\hat \mu = DZ$,
1986: where $Z$ is again assumed to be Normal $(\mu, B)$. Thus,
1987: $\E [\hmu] = D \mu$, $\Cov(\hmu_j, \hmu_k) = \lambda_j\lambda_k B_{jk}$
1988: and $\Var(\hmu) = DBD$. The risk then becomes
1989: \ba
1990: R = \E [L] &=& \E \left[(\mu - \hmu)^T W (\mu - \hmu)\right]\\
1991: &=& \mathrm{trace}(DWDB)+ \mu^T\bar D W \bar D \mu\\
1992: &=& \mathrm{trace}(DWDB)+ \sum_{j,k} \mu_j \mu_k \bar \lambda_j \bar
1993: \lambda_k W_{jk}
1994: \ea
1995: An unbiased estimate can be obtained by replacing $\mu_j \mu_k$ with
1996: $Z_j Z_k - B_{jk}$. The result is
1997: \[
1998: \hat R = Z^T \bar D W \bar D Z + \mathrm{trace}(DWDB) -
1999: \mathrm{trace}(\bar D W \bar D B)
2000: \]
2001: It follows that
2002: \[
2003: \hat L - \hat R =
2004: \mu^T W \mu -Z^TC + Z^T A Z + \mathrm{trace}(AZ)
2005: \]
2006: where $A = DW+WD-W$ and $C = 2DW\mu$. Moreover,
2007: \ba
2008: \Var(\hat L - \hat R) &=& \Var(Z^T A Z - Z^TC)\\
2009: &=& \Var(Z^T A Z) + \Var(Z^T C) - 2\,\Cov(Z^TAZ, Z^TC)\\
2010: &=&2 \, \mathrm{trace}(ABAB) + \mu^T Q \mu
2011: \ea
2012: where $Q = ABA + WDBDW - 2ABDW$. Plugging in unbiased estimates of
2013: the linear and quadratic forms involving $\mu$, we get the following
2014: estimate for the variance of the pivot process:
2015: \[
2016: \hat \tau^2 = 2\, \mathrm{trace}(ABAB)+ Z^TQZ - \mathrm{trace}(QB).
2017: \]
2018:
2019:
2020: % \newpage
2021: \bibliographystyle{apj}
2022: \bibliography{ms}
2023: \end{document}