0704:0704.2605/ms.tex

1: %\documentclass{emulateapj}  % 2-column preprint

2: \documentclass[12pt,preprint]{aastex}

3:

4: \usepackage{natbib}

5:

6:

7: \newcommand{\R}{\mathbb{R}}

8: \newcommand{\bP}{\mathcal{P}}

9: \newcommand{\bQ}{\mathcal{Q}}

10: \newcommand{\bK}{\mathcal{K}}

11: \newcommand{\bS}{\mathcal{S}}

12: \newcommand{\bD}{\mathcal{D}}

13: \newcommand{\bA}{\mathcal{A}}

14: \newcommand{\ith}{$^\mathrm{th}$\ }

15: \newcommand{\ird}{$^\mathrm{rd}$\ }

16: \newcommand{\ind}{$^\mathrm{nd}$\ }

17: \newcommand{\lmin}{{L_\mathrm{min}}}

18: \newcommand{\lmax}{{L_\mathrm{max}}}

19: \newcommand{\E}{\mathsf{E}}

20: \newcommand{\Cov}{\mathsf{Cov}}

21: \newcommand{\Var}{\mathsf{Var}}

22: \newcommand{\hmu}{\hat \mu}

23: \newcommand{\ba}{\begin{eqnarray*}}

24: \newcommand{\ea}{\end{eqnarray*}}

25: \newcommand{\mpc}{\frac{\mathrm{km/s}}{\mathrm{Mpc}}}

26:

27: \slugcomment{Submitted to ApJ, 11/09/06}

28: \shorttitle{Mapping the Cosmoligical Confidence Ball Surface}

29: \shortauthors{Bryan et al.}

30:

31: \citestyle{aa}

32:

33: \begin{document}

34: \title{Mapping the Cosmological Confidence Ball Surface}

35: \author{Brent Bryan and Jeff Schneider}

36: \affil{Department of Machine Learning, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213}

37: \email{\{bryanba, schneide\}@cs.cmu.edu}

38:

39: \author{Christopher J. Miller}

40: \affil{Cerro Tololo Interamerican Observatory, Casilla 603, La Serena, Chile}

41: \email{cmiller@noao.edu}

42:

43: \author{Robert C. Nichol}

44: \affil{Institute of Cosmology and Gravitation, University of Portsmouth, Portsmouth, PO1 2EG, UK}

45: \email{bob.nichol@port.ac.uk}

46:

47: \and

48: \author{Christopher Genovese and Larry Wasserman}

49: \affil{Department of Statistics, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213}

50: \email{\{genovese, larry\}@stat.cmu.edu}

51:

52: \begin{abstract}

53:

54: We present a new technique to compute simultaneously valid confidence

55: intervals for a set of model parameters. We apply our method to the

56: Wilkinson Microwave Anisotropy Probe's (WMAP) Cosmic Microwave

57: Background (CMB) data, exploring a seven dimensional space

58: ($\tau, \Omega_\mathrm{DE}, \Omega_\mathrm{M}, \omega_{\mathrm{DM}},

59: \omega_{\mathrm{B}}, f_\nu, n_s$).

60: We find two distinct regions-of-interest: the standard

61: Concordance Model, and a region with large values of $\omega_\mathrm{DM}$,

62: $\omega_\mathrm{B}$ and $H_0$. This second peak in

63: parameter space can be rejected by applying a constraint (or a prior)

64: on the allowable values of the Hubble constant. Our new technique uses

65: a non-parametric fit to the data, along with a frequentist approach and a

66: smart search algorithm to map out a statistical confidence

67: surface. The result is a confidence ``ball'': a set of parameter values that

68: contains the true value with probability at least $1-\alpha$.

69: Our algorithm

70: performs a role similar to the often used Markov Chain Monte Carlo (MCMC),

71: which samples from the posterior probability function in order to provide

72: Bayesian credible intervals on the parameters. While the MCMC approach

73: samples densely around a peak in the posterior, our new technique allows

74: cosmologists to perform efficient analyses around any regions of interest:

75: e.g., the peak itself, or, possibly more importantly, the $1-\alpha$ confidence surface.

76:

77:

78: % We present a new technique to compute simultaneously valid confidence

79: % intervals for a set of model parameters, given a data set and a

80: % parametrized model of the data.   This technique utilizes a

81: % non-parametric fit to the data, along with a frequentist approach and

82: % a smart search technique to compute joint confidence intervals of

83: % the parameters. The result is a $1 - \alpha$ confidence ball, which

84: % contains the true values of the unknown parameters with probability $1

85: % - \alpha$.  In this paper we apply this method to the Wilkinson

86: % Microwave Anisotropy Probe (WMAP) Cosmic Microwave Background (CMB)

87: % data,  exploring a seven dimensional space (optical depth, dark energy mass

88: % fraction, total mass fraction, dark matter density, baryon density,

89: % neutrino fraction, and scalar spectral index).

90: % Our technique performs a role similar to the

91: % often used Monte Carlo Markov Chains (MCMC), which maps out the

92: % posterior probability function. However, the significant difference

93: % between these two techniques is the use of Bayesian (used in MCMC)

94: % versus frequentist approaches, and the resulting implications that

95: % these approaches have on statistical inference.

96: % Using a frequentist approach, we are able to avoid the assumptions

97: % of which functions are to be fit, and on which ranges.

98: % Additionally, the inference is independent of the samples drawn, and

99: % therefore less susceptible to under sampling.

100: % We  note that MCMC is not designed to be a search algorithm, and propose

101: % a new search algorithm to guide the evaluation of parameter

102: % settings, which is much more efficient.  We present 2D

103: % projections through the $1\sigma$ and $2\sigma$ confidence balls, and

104: % compare the results with those obtained via other methods.

105: \end{abstract}

106:

107: \keywords{cosmology: cosmic microwave background --- cosmology:

108:  cosmological parameters ---  methods: statistical}

109:

110: \section{Introduction} \label{sec:introduction}

111: The Cosmic Microwave Background (CMB) angular temperature power spectrum

112: is the most widely utilized data set for constraining the cosmological

113: parameters \citep{tegmark2001, christensen2001, verde2003, spergel2003, tegmark2004}.

114: This power spectrum, which

115: statistically measures the distribution of temperature fluctuations as

116: a function of scale, is comprised of at least two peaks thought to

117: have been formed by sound wave modes inherent in the primordial gas during

118: recombination.  The locations, heights, and height-ratios

119: of the peaks and valleys in the power spectrum can provide direct

120: information about fundamental parameters of the universe, such as the space-time

121: geometry, the  fraction of energy density contained in the baryonic

122: matter, and the cosmological constant \citep{miller2001}.   However, it

123: is more common

124: for cosmologists to compare the observed CMB power spectrum to a suite

125: of cosmological models (e.g. CMBFast \citep{seljak1996} and CAMB

126: \citep{lewis2000}). These models require as input some minimal number

127: of cosmological parameters, $d$, --- typically $d=6$ or $d=7$.

128:

129: Most CMB power spectrum parameter estimations to date have been done via

130: Bayesian techniques (e.g., \cite{knox2001, gupta2002, spergel2003, jimenez2004, dunkley2005}).

131: For these techniques, the $d$-dimensional

132: likelihood function is parametrically estimated and prior probabilities are

133: assumed for each parameter. Then, a posterior probability distribution

134: can be computed, and credible intervals can be found.  However, unless

135: the form of the prior is conjugate on the likelihood (which is atypical), computing the

136: posterior involves estimating an integral over the entire space

137: spanned by the prior.  There are two basic approaches to solving this

138: problem in the literature.  \cite{tegmark2001} approximates

139: this integral explicitly, using an adaptive grid, where grid

140: cells are more densely located in areas presumed to be important.

141: Secondly, and more popularly, many authors have used Markov Chain Monte Carlo

142:  (MCMC) (e.g. \cite{gupta2002, lewis2002, jimenez2004, sandvik2004,

143: dunkley2005,chu2005, hajain2006}), which tend to be much more efficient than grid

144: based techniques, but are notoriously difficult to tune and test for

145: convergence \citep{olivestatistics}.

146:

147: While Bayesian techniques are used in the majority of work on CMB

148: parameter estimation, there have also been undertakings to estimate

149: cosmological parameters using frequentist techniques, such as $\chi^2$ tests

150: \citep{gorski1993, white1995, padmanabhan2001,

151: griffiths2001, abroe2002} and Bayes risk analyses \citep{schafer2003}.

152: We present a novel frequentist method based upon

153: a non-parametric fit to the data to

154: estimate the smooth underlying power spectrum, as well as an

155: error ``ellipse'' following the technique used in \cite{miller2001}

156: and \cite{genovese2004}.

157: This confidence ball has a radius which is a function of the

158: probability with which the true power spectrum is contained within the

159: ball and the observed error estimates. The ball radius is independent

160: of both the models to be fit, as well as the parameter ranges to be queried.    Thus, we

161: can take a vector of parameters, run it through our favorite CMB power

162: spectrum generating model, and determine whether or not the model (and

163: hence the parameter vector) lies within our confidence ball, without

164: fixing \textit{a priori} the model to be used, or the parameter ranges to

165: be searched.  We are interested in finding the set of parameter

166: vectors which lie within the $1-\alpha$ confidence  ball, for some

167: confidence level (or probability of being incorrect), $\alpha$.

168:

169: This is a statistically different style of ``confidence'' than the

170: credible intervals or the ``degree of belief'' one obtains using

171: Bayesian techniques. In particular, the Bayesian method answers the

172: question ``assuming a given model and prior distribution over the

173: parameter space, what is the smallest range of a particular parameter from which I believe

174: the next sample will be drawn  with probability $1-\alpha$?''

175: In contrast, the frequentist approach constructs a procedure for

176: deriving confidence intervals that when applied to a series of

177: data sets, traps the true parameters for at least $100(1-\alpha)\%$

178: of the data sets.   For parametric models with large sample

179: sizes, Bayesian and frequentist approaches are known to result in

180: similar inferences.  However, for high dimensional and

181: non-parametric problems --- such as estimating cosmological parameters

182: from the CMB power spectrum --- Bayesian methods may not yield accurate

183: inferences \citep{olivestatistics}.  In such cases, the Bayesian 95\%

184: credible interval may not contain the true value 95\% of the time in

185: a frequency sense.

186:

187: Additionally, mapping a region of high likelihood points in

188: parameter space is fundamentally a search problem.  As MCMC methods are

189: designed to sample and/or integrate a distribution, they are not

190: necessarily good search algorithms in practice.  In particular, a MCMC

191: method ``represents'' a high-likelihood region by heavily sampling

192: that region --- an expensive proposition when using CMBFast.  In

193: contrast, a search algorithm that can directly observe the

194: (normalized) likelihood of a sample will have no reason to spend more

195: samples in the same location.

196: In addition to describing a frequentist approach to computing

197: confidence intervals for cosmological parameters,

198: another significant contribution of this paper is the

199: proposal of a new search algorithm for mapping confidence surfaces.

200:

201: In this work, we utilize the non-parametric basis

202: described by  \cite{miller2001} and \cite{genovese2004} to constrain

203: the set of cosmological models which fit the WMAP observations.

204: At the same time, we must deal with the challenges posed in other

205: frameworks namely: robustness of the algorithm, efficiency,

206: and issues of convergence.  A schematic outline of our technique is

207: shown in Figure \ref{fig:outline}.

208: In \S \ref{sec:methodology}, we

209: briefly describe the data and cosmological models used, as well as the

210: non-parametric technique (the bottom row of Figure \ref{fig:outline}).  We

211: then focus on a new algorithm to map the derived confidence ball into

212: parameter space in \S \ref{sec:algorithm}, sketched out on the

213: top line of Figure \ref{fig:outline}.   In \S

214: \ref{sec:results}, we present results of our

215: algorithm, and discuss challenges to accurately determine confidence

216: intervals using any statistical approach.

217: Finally, in \S \ref{sec:comparison}, we compare our

218: method with commonly used inference techniques, and discuss the

219: advantages of using the proposed approach.

220:

221:

222: \begin{figure*}

223: \begin{center}

224: %\includegraphics[scale=0.85]{f1.eps}

225: \plotone{f1.eps}

226: \end{center}

227: \caption{Schematic outline of our technique to constraint confidence intervals.}

228: \label{fig:outline}

229: \end{figure*}

230:

231: \section{Methodology} \label{sec:methodology}

232: \subsection{Data \& Models} \label{sec:datamodels}

233: We examine the CMB power-spectrum ($\hat C_{\ell}$) as

234: measured by the Wilkinson Microwave Anisotropy Probe's first-year

235: data release \citep{bennett2003, hinshaw2003,

236: verde2003}\footnote{Available at

237: \url{http://lambda.gsfc.nasa.gov}}, shown in Figure

238: \ref{fig:wmapdata1}.  Our approach is similar to that of other authors

239: (e.g. \cite{tegmark1999, tegmark2001,

240: spergel2003}), who fit the observed CMB power spectrum to a suite of

241: cosmological models.   These models, while sophisticated and detailed, have numerous free

242: parameters, some of which are difficult to ascertain (e.g. ionization

243: depth, contribution of gravity waves).  However, there are many codes

244: available to compute CMB power spectrum, which trade off speed for

245: accuracy and robustness.

246:

247: Both CMBFast \citep{seljak1996} and the related CAMB \citep{lewis2000}

248: compute the CMB power spectrum by evolving the Boltzmann equation

249: using a line of sight integration technique.  While an order of

250: magnitude faster than computing the full Boltzmann solution, this approach

251: is still rather slow.

252: One approach for reducing the computation time of CMBFast

253: is to split the Boltzmann computation into low and

254: high multipole moment portions, as the low and high multipoles are

255: mostly independent \citep{tegmark2001}.  Using this method, ksplit,

256: \cite{tegmark2001} was able to reduce computation time by a factor of 10.

257: Additionally, several approximate programs have been

258: developed which are orders of magnitudes faster than CMBFast,

259: including DASh \citep{kaplinghat2002},

260: CMBWarp \citep{jimenez2004}, and Pico \citep{fendt2006}.

261: In general, these programs gain great speedups

262: by approximating the power spectrum with a regression function

263: fit to predetermined sample points generated from simulators such as CMBFast.  As a

264: result, generating a hypothesis spectrum for a new set of parameters

265: is a simple function evaluation, foregoing the computation of the

266: Boltzmann equation entirely.

267:

268: While using any one of these approximate methods or ksplit may seem

269: appealing due to their computational efficiency, they do not have the

270: desired accuracy and robustness \citep{seljak2003}.  These codes are

271: only approximations.  While fairly accurate around the concordance

272: peak, their accuracy drops off drastically when computing models for

273: parameter vectors slightly removed from

274: the ``accepted'' cosmological models.

275: Additionally, these codes are prone to failures when presented with

276: parameter vectors that are not within a narrowly defined region around

277: the concordance model \citep{fendt2006}.

278: %For instance, Pico

279: %uses a set  of precomputed points to generate its interpolation; these

280: %points were all picked to be near the accepted concordance peak.

281: According to the Pico website: ``Since Pico's purpose is to be part of

282: parameter estimation codes, we are mainly concerned with having the regression

283: coefficients defined around the region of parameter space allowed by

284: the data (mainly the WMAP3 data). Pico will not be able to compute

285: accurate spectra and likelihoods away from this region, but it will

286: warn you about this.''  Similarly, in many instances ksplit will hang

287: on parameter vectors that are a short distance from the concordance peak.

288: Since we are interested in finding the tightest possible confidence

289: intervals for all regions of parameter space that can possibly fit the data,

290: we do not want to be artificially restricted by our CMB simulator.

291: Thus, we choose to compute the model CMB power spectra

292: using CMBFast; while not the fastest code available CMBFast is

293: accurate and reliable.

294:

295: Next, multipole covariance is estimated by using the covariance derived for the

296: concordance model using code from \cite{verde2003}.

297: We find that the computed variances match well with

298: those found in the first-year data release, with only a slight

299: (roughly $1.15$) multiplicative offset.  This constant factor offset

300: was hinted at by the sub unity slope of the quantile-quantile plot of

301: the variance weighted deviations between the data and the concordance

302: model prediction, using the variances given in the WMAP data.

303:

304:

305: \begin{table}[t]

306: \begin{center}

307: \begin{tabular}{c  l r@{${\ }-{\ }$}l}

308: \hline

309: \textbf{Parameter} & \textbf{Description} &

310: \multicolumn{2}{c}{\textbf{Range}}\\

311: \hline

312: \hline

313: $\tau$ & optical depth & 0.0 & 1.2\\

314: $\Omega_\mathrm{DE}$ & dark energy mass fraction& 0.0 & 1.0\\

315: $\Omega_\mathrm{M}$ & total mass fraction & 0.1 & 1.0\\

316: $\omega_{\mathrm{DM}}$ & dark matter density& 0.01 & 1.2 \\

317: $\omega_{\mathrm{B}}$ & baryon density& 0.001 & 0.25\\

318: $f_\nu$ & neutrino fraction& 0.0 & 1.0\\

319: $n_s$ & spectral index& 0.5 & 1.7\\

320: \hline

321: \end{tabular}

322: \end{center}

323: \caption{Cosmological parameters and ranges searched.}

324: \label{paramtable}

325: \end{table}

326:

327: \cite{spergel2006} show that the WMAP third year data are well

328: described by a simple 6 parameter model: $\tau, H_0,\Omega_\mathrm{M},

329: \Omega_\mathrm{B}, \sigma_8, n_s$.  In this paper, we use

330: effectively the same model space as the simplified model in

331: \cite{spergel2006}, except that we include the neutrino fraction and

332: exclude $\sigma_8$.  We made this change as we are not utilizing

333: large-scale structure data, which is sensitive to $\sigma_8$. The

334: resulting parameter vector

335: $\mathbf{p} = (\tau, \Omega_\mathrm{DE}, \Omega_\mathrm{M},

336: \omega_{\mathrm{DM}}, \omega_{\mathrm{B}}, f_\nu, n_s)$ is similar to

337: the model space searched by \cite{tegmark2001}.

338:  A description and considered range for each of these variables

339: is presented in Table \ref{paramtable}; the parameter ranges

340: considered here are slightly larger than those searched by \cite{tegmark2001}, due

341: to our interest in mapping an observed secondary peak in parameter space.

342: Note that $\Omega_\mathrm{k} = 1 - \Omega_\mathrm{M} - \Omega_\mathrm{DE}$.

343: Moreover, the Hubble constant, $H_0$, is not an independent parameter,

344: but given by

345: \[

346: \frac{H_0}{100} = h =\sqrt{\frac{\omega_\mathrm{DM}+\omega_\mathrm{B}}{\Omega_\mathrm{M}}}

347: = \sqrt{\frac{\omega_\mathrm{DM}+\omega_\mathrm{B}}{1-\Omega_\mathrm{k} - \Omega_\mathrm{DE}}}.

348: \]

349: We denote the space spanned by $\mathbf{p}$ as

350: $\mathcal{P}$. $\mathcal{P}$ is a seven dimensional hyper-rectangle

351: where the range of the $j^\mathrm{th}$ side corresponds to the range of the

352: $j^\mathrm{th}$ cosmological parameter of $\mathbf{p}$.

353:

354: \subsection{Nonparametric Analysis} \label{sec:nonparametric}

355: We now provide a brief sketch of nonparametric data

356: analysis, as it pertains to the CMB power spectrum.  We follow the

357: derivations given in \cite{miller2001} and \cite{genovese2004}, and refer

358: interested readers to those works. Our technique is

359: designed to:

360: \begin{enumerate}

361: \item Compute a fit to the actual data which minimizes the sum of the

362:       bias and the variance between the fit

363:       and the data, taking into account the full covariance discussed in

364:       \S \ref{sec:datamodels}.  Errors are assumed to be Gaussian.

365:       This fit is effectively a smoothed version of the data.

366:

367: \item Determine a confidence ellipse ball around the best fit for a

368:       given test level, $\alpha$.

369:

370: \item Find all such vectors $s \in \mathcal{P}$ such that the

371:       power spectrum output by CMBFast for $s$ results in a

372:       model which is contained within the $1-\alpha$ confidence ball

373:       found in step 2.

374: \end{enumerate}

375: We now detail items 1 and 2, leaving the discussion of item 3 to \S \ref{sec:mapping}.

376:

377: %%%%%%%%%%%%%%%%%%%%%%%

378: % Intentionally Blank

379: %

380: %

381: %

382:

383: \subsubsection{The Non-Parametric Fit} \label{sec:fit}

384: Let $\ell \in [L_\mathrm{min}, \dots, L_\mathrm{max}]$ denote a

385: generic index of the CMB temperature power spectrum multipole, and $n

386: = L_\mathrm{max} - L_\mathrm{min}+1$ be the total number of observed

387: multipoles. We take $Y_{\ell} = \hat{C}_{\ell}$ to be the observations

388: of the CMB where  $x_{\ell} = (\ell-\lmin)/(\lmax-\lmin)$ and let

389: $f(x_{\ell})\equiv C_{\ell}$ denote the true power spectrum

390: at multipole index $\ell$.

391: We then solve the nonparametric regression problem:

392: \begin{equation}\label{eq:regress2}

393: Y_{\ell} = f(x_{\ell}) + \epsilon_{\ell}, \qquad \ell = L_\mathrm{min}, \ldots, L_\mathrm{max},

394: \end{equation}

395: where $\epsilon =

396: (\epsilon_{L_\mathrm{min}},\ldots,\epsilon_{L_\mathrm{max}})$ are

397: assumed Gaussian with known covariance matrix $\Sigma$ as described earlier.

398: Henceforth, we will use $i=\ell -\lmin+1$ as an index.

399: Nonparametric analysis is based on the notion of estimating a function

400: without forcing it to fit some finite-dimensional parameter form

401: (e.g. a Normal distribution), by smoothing the data in such a way to

402: balance the bias and variance. In this work, we use orthogonal series

403: regression to estimate $f$, expanding $f$ as a

404: cosine basis:

405: \[

406: f(x) = \sum\limits_{j=0}^\infty \mu_j \phi_j(x)

407: \]

408: where

409: %\begin{equation}\label{basis}

410: \[

411: \phi_j(x) =

412: \left\{

413: \begin{array}{l l}

414: 1 & \mathrm{for\ } j=0\\

415: \sqrt{2}\cos(\pi j x) & \mathrm{for\ } j = 1,2,3, \dots

416: \end{array}

417: \right.

418: %\end{equation}

419: \]

420: and the $\mu_j$'s are the coefficients for each basis component.

421: If $f$ is smooth, then $\mu_j$ will decay rapidly as $j$

422: increases. That is, if $f$ is smooth, then there are little or no

423: high frequency fluctuations in $f$ and hence $\mu_j \simeq 0$.

424: Thus,

425: $\sum_{j=n+1}^\infty \mu^2_j$ will be negligible, and we can approximate the

426: infinite sum as $f(x) \approx \sum_{j=0}^n \mu_j \phi_j(x)$.  Let

427: \[

428: Z_j = \frac{1}{n} \sum_{i=1}^n Y_i \phi_j(X_i)

429: \]

430: for $j=0, 1,\dots n$.  Then

431: $Z$ is approximately normal distributed with mean $\mu$ and covariance

432: $B/\sqrt{n} = U \Sigma U^T/ \sqrt{n}$, where $U$ is the cosine basis transformation

433: matrix.

434:

435:  In order to obtain an even smoother

436: estimate of $f$, we damp out the higher frequencies using shrinkage

437: estimators.  We let $\hat \mu_j = \lambda_j Z_j$ where $1 \ge

438: \lambda_0 \ge \lambda_1 \ge \cdots \ge \lambda_n \ge 0$ are shrinkage

439: coefficients.  The estimate of $f$ is now

440: \[

441: \hat f(x) = \sum_{j=0}^n \hat \mu_j \phi_j(x) = \sum_{j=0}^n

442: \lambda_j Z_j \phi_j(x).

443: \]

444: Following \cite{genovese2004}, we use a special case of monotone

445: shrinkage in which

446: \[

447: \lambda_j = \left\{

448: \begin{array}{cc}

449: 1 & \mathrm{for\ } j\le J\\

450: 0 & \mathrm{for\ } j> J

451: \end{array}\right.

452: \]

453: for some integer $J \in [0,n]$.  We will show how to find $J$ shortly.

454: Using the monotone shrinkage scheme described above, the estimate of $f$ becomes

455: \[

456: \hat f(x) =  \sum_{j=0}^J  Z_j \phi_j(x).

457: \]

458:

459: The squared error loss as a function of $\hat \lambda = (\hat \lambda_0,\hat

460: \lambda_1,  \dots, \hat \lambda_n)$ is

461: \[

462: L_n(\hat \lambda) =

463: \int_0^1 \left(\frac{\hat f(x) -

464:   f(x)}{\sigma(x)}\right)^2 \, dx\approx \sum_{j=1}^n

465: \left(\frac{\mu_j - \hat \mu_j}{\sigma_j}\right)^2,

466: \]

467: where $\sigma^2(x)$ is the variance of $f$, and $\sigma_j^2$ are the

468: observed variances of the power spectrum (the elements on the diagonal

469: of $\Sigma$).  Meanwhile, the risk is given by

470: \[

471: R(\lambda) = \E \left[\int_0^1 \left(\frac{\hat f(x) -

472:   f(x)}{\sigma(x)}\right)^2 \, dx \right] \approx

473: \frac{J}{n} + \sum_{j=J}^n \frac{\mu_j^2}{\sigma_j^2}

474: \]

475:

476: We choose $J$ to minimize the Stein's unbiased risk estimate

477: \begin{equation}\label{eqn:stein}

478: \hat R = Z^T \bar D W \bar D Z + \mathrm{trace}(DWDB) -

479: \mathrm{trace}(\bar D W \bar D B)

480: \end{equation}

481: where $D$ and $\bar D = 1 -D$ are diagonal matrices with 1's in the

482: first $J$ and last $n-J$ entries respectively, $B$ is the covariance

483: of $Z$, and $W_{jk} = \sum_{\ell} \Delta_{jk\ell}/\sigma_\ell$ and

484: \begin{eqnarray*}

485: \Delta_{jk\ell} &=& \int_0^1 \phi_j \phi_k \phi_\ell\\

486: &=& \left\{

487: \begin{array}{c c}

488: 1 & \mathrm{if\ \#}\{j,k,l = 0\} = 3\\

489: 0 & \mathrm{if\ \#}\{j,k,l = 0\} = 2\\

490: \delta_{jk}\delta_{0\ell} + \delta_{j\ell}\delta_{0k} +

491: \delta_{k\ell}\delta_{0j} & \mathrm{if\ \#}\{j,k,l = 0\} = 1\\

492: \frac{1}{\sqrt{2}}(\delta_{\ell, j+k} + \delta_{\ell,|j-k|}) & \mathrm{if\ \#}\{j,k,l = 0\} = 0

493: \end{array}

494: \right..

495: \end{eqnarray*}

496: \cite{beran1998} showed that $\hat R(\lambda)$ is asymptotically,

497: uniformly close to $R(\lambda)$ when using monotone shrinkage

498: coefficients and $\sigma(x)=1$.  \cite{genovese2004} extended this

499: result to the heteroskedastic case used here.

500:

501: In Figure \ref{fig:wmapdata1}, we compare our non-parametric

502: fit to the WMAP data to a model-based fit from \cite{spergel2003}.

503: Points in the figure depict the first year WMAP data.

504: Error bars are omitted for clarity.  The full estimated

505: covariance, $\Sigma$, is used in both the \cite{spergel2003} model fit

506: and the \cite{genovese2004} non-parametric fit.

507:

508: \begin{figure}

509: \begin{center}

510: \noindent

511: \plotone{f2.eps}

512: %\includegraphics[scale=1.0]{f2.eps}

513: \end{center}

514: \caption{Comparison of our nonparametric fit of the CMB power-spectrum

515: (solid) with \cite{spergel2003} parametric fit (dashed).  First-year

516:  WMAP data (dots) are shown without errors for clarity.}

517: \label{fig:wmapdata1}

518: \end{figure}

519:

520:

521: \subsubsection{The Confidence Ball} \label{sec:confball}

522: After we perform the non-parametric fit,

523: we need to quantify the uncertainty to make statistical inferences.

524: We use the Beran-D\"umbgen pivot method \citep{beran1998,beran2000} to

525: derive valid confidence intervals.  This method relies

526: on the weak convergence of the ``pivot process'' --- $B_n(\hat \lambda) =

527: \sqrt{n} (L_n(\hat \lambda) - \hat R (\hat \lambda))$ --- to a Normal

528: $(0, \tau^2)$ distribution for some $\tau^2 >0$; a derivation of $\hat

529: \tau_n$ can be found in Appendix \ref{appendix}, taken from Appendix 3 of

530: \cite{genovese2004}.  Using the convergence of the pivot process, we can

531: compute a confidence ellipse for the basis coefficients with a

532: ``radius'' given by:

533: \begin{equation} \label{conf0}

534: \mathcal{D}_n

535:   = \left\{\mu : \sum_{i=1}^n \left(\frac{\hat{\mu}_i - \mu_i}{\sigma_i}\right)^2 \le

536:  \frac{\hat\tau_n \, z_\alpha}{\sqrt{n}} + \hat{R}(\hat\lambda_n)\right\}

537: \end{equation}

538: where the best fit to the data is represented by

539: $\hat{\mu}_i$, the function being tested (whether it is within some

540: confidence ball) is $\mu_i$, and the level of the confidence ball is

541: determined by $z_\alpha$, the upper $\alpha$ quantile of a standard

542: Normal distribution.

543:

544: Therefore, using the central limit theorem, we have

545: \begin{equation} \label{conf}

546: \mathcal{B}_n = \left\{f(x) = \sum_{j=0}^n \mu_j \phi_j(x): \mu \in

547: \mathcal{D}_n \right\}

548: \end{equation}

549: is an asymptotic $1-\alpha$ confidence set for $f$.

550:

551: Thus, to determine if any

552: given vector $s \in \mathcal{P}$ is within our confidence ball, we

553: merely have to run our cosmological model to compute the resulting

554: power spectrum,

555: $\hat f(s)$, and check to see if $\hat f(s) \in \mathcal {B}_n$.  This can

556: be easily done by using Equation \ref{conf0} to check whether the sum of

557: squares of $\hat \mu$ and $\mu$ are less than a constant given on the

558: right-hand side of Equation \ref{conf0}.

559: As shown in Figure \ref{fig:distance_alpha}, as the radius increases,

560: so does the size of the confidence set (and $\alpha$ decreases). Thus,

561: a 95\% (or $\alpha = 0.05$) confidence region has a larger ``radius''

562: than does a 67\% (or $\alpha = 0.33$) confidence region.  Moreover, a

563: $1-\alpha$ confidence ball strictly contains

564: all confidence balls with smaller values of $1-\alpha$.

565:

566: Since the dimensionality of our space is large, it is difficult to

567: visualize the confidence region that surrounds the non-parametric fit.

568: However, we can show examples of functions which live inside (or outside)

569: our confidence region by calculating their distance from the

570: nonparametric fit to the data.

571: In Figure \ref{fig:wmapdata2}, we show a ``ribbon'' plot for

572: $\omega_\mathrm{B}$ around the concordance model.  This figure is generated by

573: setting all of the cosmological parameters to their concordance

574: values and then slowly evolving $\omega_\mathrm{B}$ from $0.012250$ to

575: $0.036750$ to depict the range of temperature spectra allowed due to

576: uncertainty of $\omega_\mathrm{B}$.  The

577: black curves are cosmological models which live within the

578: $95\%$ confidence ball, while gray curves are models that do not.

579: As can be seen in this figure, the shape of the confidence region is

580: not simply a band of constant width surrounding the best fit. It is, in fact,

581: a very complicated, possibly disconnected surface in our high-dimensional

582: parameter space. {\it It is this confidence surface that we wish to map in detail.}

583:

584: \begin{figure}

585: \begin{center}

586: %\includegraphics[scale=1.0]{f3.eps}

587: \plotone{f3.eps}

588: \end{center}

589: \caption{A ``ribbon'' plot depicting the effect of varying

590:   $\omega_\mathrm{B}$ while all other parameters remain fixed (at

591:   concordance values).  Black lines indicate those models which are

592:   contained within a 95\% confidence ball, while gray lies indicate

593:   those models rejected by the hypothesis that the model and the

594:   regressed fit are the same.}

595: \label{fig:wmapdata2}

596: \end{figure}

597:

598: \section{Mapping the Confidence Surfaces} \label{sec:mapping}

599: While theoretically Equation \ref{conf} exactly gives us the $1-\alpha$

600: confidence bound for any functional of the data, it is not trivial to

601: compute what these bounds are.  While it is easy to use Equation

602: \ref{conf} to compute whether or not a given model is within the

603: confidence ball, the method outlined in \S \ref{sec:nonparametric}

604: does not provide a way to easily compute all those spectrum that lie

605: within that ball.

606:

607: Concretely, when we test if a CMB power

608: spectrum lies within the ball, we compare the given spectrum with

609: the non-parametric fit found above, by computing a variance weighted

610: sum of squares between the given spectrum and the regressed model.  We

611: call this weighted sum of squares the test spectrum's ``distance''.  If

612: we are given a model which results in a test spectrum whose distance

613: is greater than the radius of our confidence ball, then we can reject

614: the test spectrum (and its associated parameter vector) at the

615: $1-\alpha$ level.

616: If not, then our test does not have the power to distinguish between

617: the regressed model and our test model.  Note that we are taking a

618: $\sim900$ element spectrum and compressing it to a scalar.  Thus, there

619: are many models --- possibly representing vastly different spectra ---

620: that may result in exactly the same distance value. For the hypothesis

621: test that the fitted function and regressed models are derived form the same

622: distribution, we will draw the same conclusion for all models with the

623: same distance values.  Either all models with a particular distance

624: score can be rejected or none can.  For a

625: given confidence ball radius, we could compute (possibly with some discrete

626: approximation) all of the possible CMB power spectra that have

627: distances equal to the confidence radius.  However, we are unaware of

628: an easy way to determine the cosmological parameters of a power

629: spectrum given only the power spectrum itself.  That is, we do not have a

630: method to easily invert CMBFast.

631:

632: % For instance,

633: % \cite{tegmark2001} to sampled a roughly

634: % $10^{7}$ grid of points and use a linear approximator between them to

635: % determine confidence bounds on the individual cosmological parameters.

636: % This approach benefits from explictly searching the entire space, but

637: % suffers from the fact that it cannot give tight bounds in areas where

638: % the grid is course.  Moreover, computation of a $10^7$ grid is

639: % expensive, leading to \cite{tegmark2001} to use some approximations in CMBFast

640: % (See \S \ref{FIXME}).

641: %

642: % Another approach is the use of  Monte Carlo

643: % Markov Chains (MCMC) to compute the posterior distribution of the

644: % models given the data, under

645: % a given prior \cite{FIXME}.  This is done by sampling the input space

646: % in roughly in

647: % proportion to the expected probability of each location.  After enough

648: % sampling the posterior distribution will converge to the true

649: % distribution, and confidence bands can be found by integrating the

650: % posterior.  This method benefits from its ease of implementation, as

651: % well as the fact that the entire posterior is obtained (not just the

652: % $1-\alpha$ confidence intervals.  However, in practice, there is no way

653: % to show that a MCMC has converged truly converged to the true solution

654: % (not just some local optium), and integration of the final posterior

655: % can be tricky.

656:

657: Of course, one solution would be to grid the parameter space, and

658: run a model for each grid cell.  We could then use these models to

659: approximate the mapping between parameter vectors and confidence

660: level using, for instance, a simple linear approximator.

661: As noted in \S \ref{sec:introduction}, such an approach

662: is far too slow, explaining why \cite{tegmark2001} use both

663: adaptive grids and a modified version of CMBFast.

664: Instead, we suggest an adaptive approach, which allows us to determine

665: confidence intervals of our cosmological parameters more quickly and

666: accurately.

667: In particular, we are able to quickly refine our approximating surface

668: in the areas of interest -- those near the confidence ball's radius --

669: while ignoring the uninteresting regions.  This allows us to obtain

670: estimates of the $1-\alpha$ confidence intervals of our

671: cosmological parameters much more efficiently.

672: % We now show use active learning approaches to improve upon the

673: % previous sampling strategies and show how this will allow us to

674: % map the $1-\alpha$ joint confidence intervals of our cosmological

675: % parameter input space.

676:

677: \subsection{Modeling Known Experiments} \label{model}

678: The combination of CMBFast and the confidence ball method gives us a scoring

679: function  $f:\mathcal{P}\to \R$, which takes an input vector of

680: parameters ($s \in \mathcal{P}$) and returns a distance value. This is

681: accomplished by plugging the cosmological parameter values of $s$ into

682: CMBFast to compute a model power spectrum, and then comparing this

683: model spectrum with our non-parametric fit to the observed power spectrum

684: using Equations \ref{conf0} and \ref{conf}.

685: Given a particular $1-\alpha$ confidence ball radius, $t$,

686: we want to find the set of points, $\bS$ ($\bS \subseteq \bP$), that have

687: distances to the regressed fit of the data less than or equal to the

688: confidence ball radius: $\{s \in \bS | s \in \bP, f(s) \le t\}$.

689: Since we can not easily invert $f$ --- that is to say CMBFast ---

690: we must deduce $\bS$ by carefully sampling the points in $\bP$.

691:

692: For CMBFast, the cost to compute $f(s)$ given $s$ can be significant:

693: computing power spectra away from the concordance model can take

694: 5 to 15 minutes.

695: Thus, care should be taken when choosing the

696: next experiment, as picking optimum points can reduce the run time of

697: the algorithm by orders of magnitude.  Thus, it is preferable to

698: analyze current knowledge about the underlying function and select experiments

699: which quickly refine the estimate of the distance function around the

700: confidence ball radius.  There are several methods one could use to

701: create a model of the data, notably some form of parametric

702: regression.  However, we chose to approximate $f(s)$ using

703: Gaussian process regression, as other forms of regression may

704: smooth the data, ignoring subtle features of the function that may

705: become pronounced with more data.  A Gaussian process is

706: a non-parametric form of regression.  Predictions for

707: unobserved points are computed by using a weighted combination of the

708: function values for those points which have already been observed,

709: where a distance-based kernel function is used to determine the

710: relative weights.  These distance-based kernels generally weight nearby points

711: significantly more than distance points.

712: Thus, assuming the underlying function is continuous,

713: Gaussian processes will perfectly describe the function given an

714: infinite set of unique data points.

715:

716:

717: In this work, we use ordinary kriging, a form of Gaussian processes that

718: assumes that the semi-variance, $\mathcal{K}(\cdot, \cdot)$, between

719: two points is a linear function of their distance \citep{cressie1991};

720: for any two points $s_i, s_j \in \bP$,

721: \[

722: \mathcal{K}(s_i, s_j) = \frac{k}{2} \E\left[ \Big(f(s_i) - f(s_j)\Big)^2\right]

723: \]

724: where $k$ is a constant --- known as the kriging

725: parameter --- which is an estimate of the maximum magnitude of the

726: first derivative of the function.  Therefore, the

727: expected semi-variance between two points, $s_i, s_j \in \bP$ is given

728: by

729: \ba

730: \gamma(s_i, s_j) &=& E(\bK(s_i, s_j)) = k \bD(s_i, s_j)+c\\

731: &=&k \left[\sum\limits_{\ell=1}^d \alpha_\ell^2(s_{i\ell} - s_{j\ell})^2\right]^{1/2}+c

732: \ea

733: where $\bD(\cdot, \cdot)$ is a distance function defined on the parameter

734: space $\bP$ and $c$ is the observed variance (e.g. experimental noise)

735: when repeatedly sampling the function $f$ at the same location.

736: We have found that using a simple weighted

737: distance function where each dimension is linearly scaled by the

738: parameter $\alpha_\ell$, as depicted in the previous equation,

739: reasonably ensures that parameters are given equal

740: consideration given their disparate values and derivatives.  For our

741: analysis, we adjusted the $\alpha_\ell$'s to ensure that the maximum derivative

742: along each dimension was approximately 1 during the sampling process.

743: Additionally, while the simulations computed by CMBFast are

744: deterministic, we shall see in \S \ref{sec:convergence} that there is

745: some inherent noise in the computations; thus we conservatively set $c

746: = 1 \times 10^{-5}$ in our analysis.

747:

748: For the Gaussian process framework, sampled data are assumed to be

749: Normally distributed with means equal to the true function and

750: variance given by the sampling noise.  Moreover, a combination of any

751: subset of these points results in a Normal distribution.  Thus, we can

752: use the observed set of data, $\bA\subset \bP$, to predict the value

753: of $f$ for any $s_q \in \bP$.  This query point, $s_q$, will be Normally

754: distributed, ($N(\mu_{s_q}, \sigma_{s_q})$), with mean and variance given by

755: \begin{eqnarray}

756: \mu_{s_q} &=& \bar f_\bA + \Sigma_{\bA q}^T \Sigma_{\bA\bA}^{-1} (f_\bA

757: - \bar f_\bA) \label{k_mean}\\

758: \sigma^2_{s_q} &=& \Sigma_{\bA q}^T \Sigma_{\bA\bA}^{-1} \Sigma_{\bA q} \label{k_var}

759: \end{eqnarray}

760: %where

761: %\[

762: %\Sigma_{\bA q} =

763: %\left[

764: %\begin{array}{c}

765: %\gamma(a_1, s_q)\\

766: %\gamma(a_2, s_q)\\

767: %\vdots\\

768: %\gamma(a_{|\bA|}, s_q)\\

769: %\end{array}

770: %\right]

771: %\quad

772: %\Sigma_{\bA\bA} =

773: %\left[

774: %\begin{array}{c c c c}

775: %c & \gamma(a_1, a_2) & \dots & \gamma(a_1, a_{|\bA|})\\

776: %\gamma(a_2, a_1) & c & \dots & \gamma(a_2, a_{|\bA|})\\

777: %\vdots & \vdots &\ddots & \vdots\\

778: %\gamma(a_{|\bA|}, a_1) & \gamma(a_{|\bA|}, a_2) & \dots &

779: %\gamma(a_{|\bA|}, a_{|\bA|})

780: %\end{array}

781: %\right]

782: %\quad

783: %(f_\bA - \bar f_\bA) =

784: %\left[

785: %\begin{array}{c}

786: %f(a_1) - \bar f_\bA\\

787: %f(a_2) - \bar f_\bA\\

788: %\dots \\

789: %f(a_{|\bA}}) - \bar f_\bA

790: %\end{array}

791: %\right]

792: %\]

793: where the elements of the matrix $\Sigma_{\bA\bA}$ and arrays

794: $\Sigma_{\bA q}$ and $f_\bA - \bar f_\bA$ are given by

795: \begin{eqnarray*}

796: \Sigma_{\bA \bA} [i,j] &=& \gamma(a_i, a_j)\\

797: \Sigma_{\bA q} [i] &=& \gamma(a_i, s_q)\\

798: (f_\bA - \bar f_\bA)[i] &=& f(s_i) - \bar f_\bA\\

799: \bar f_\bA &=& \frac{1}{|\bA|} \sum_{i=1}^{|\bA|} f(a_i)

800: \end{eqnarray*}

801: and the $a_i$'s and $a_j$'s are the observed data used to make an

802: inference: $a_i, a_j \in \bA$, $0\le i, j \le |\bA|$.

803:

804: % where $\Sigma_{\bA q}$ denotes the column vector with the $i$th entry

805: % equal to $\gamma(a_i, s_q)$, $\Sigma_{\bA\bA}$ denotes the semivariance

806: % matrix between the elements of $\ba$ (the $ij$ element of

807: % $\Sigma_{\bA\bA}$

808: % is $\mathcal{K}(s_i, s_j)$), $y_A$ denotes the column vector with

809: % the $i$th entry equal to $f(s_i)$, the true value of the function for

810: % each point in $A$, and $\mu_A$ is the mean of the $y_A$'s.

811:

812: As given, for a set of $n$ observed points ($|\bA| = n$), prediction

813: with a Gaussian process requires

814: $O(n^3)$ time, as an $n \times n$ linear system of equations must be solved.

815: However, for many Gaussian process --- and ordinary kriging in particular

816: --- the correlation between two points decreases as a function of

817: distance.  Thus, the full Gaussian process model can be approximated well by a local

818: Gaussian process, where only the $k$ nearest neighbors of the query point are used

819: to compute the prediction value; this reduces the computation time to

820: $O(k^3+k\log(n))$ per prediction, since $O(k\log(n))$ time is required to find the

821: k-nearest neighbors using spatial indexing structures such as balanced

822: kd-trees.

823:

824: \subsection{Algorithm} \label{sec:algorithm}

825: There are many well-known heuristics for computing where best

826: to perform the next experiment using a regression model, such as

827: that derived in \S \ref{model}.  Sampling strategies include picking the

828: point with the largest variance \citep{mackay1992,guestrin2005},

829: entropy or information gain.

830:

831: Sampling points based solely on variance is common in active learning

832: methods whose goal is to map out an entire function, as this will

833: minimize the expected error for prediction.  Moreover, the

834: model variance predicted by local ordinary kriging

835: is linear in the distance to the nearest neighbors.

836: As such, this strategy chooses points that are far from areas currently

837: searched, and thus will not get stuck in a specific location in

838: parameter space. However, this

839: strategy is known to over sample boundary regions \citep{mackay1992},

840: and ultimately samples the space evenly like a grid.

841: It is likely that large regions of the input space, $\mathcal{P}$,

842: fall well outside the confidence ball radius. In the

843: progression of the algorithm, points in these regions may have large

844: variances but still not be within 2 or more standard deviations of the

845: boundary; these points are very unlikely to be near the confidence

846: ball radius.  Hence, a strategy that samples the entire space

847: evenly, using either a grid or a variance metric, can be extremely

848: inefficient for mapping function boundaries.

849:

850: Information gain heuristics are also popular in the machine learning

851: community.  However in a continuous parameter space, computing the effect of adding

852: a new point is prohibitively expensive. Specifically, calculating the

853: information gain of a proposed sample requires integrating the

854: difference between the current model and expected result of the

855: proposed sample over all space. Since our function approximator has

856: only local support for predictions, we can reduce this integral down

857: to the local region.  However on this local region, computing

858: the expected value of the model requires multiple matrix inversions

859: to account for differences in the 100 nearest neighbors over the local

860: region.   Even approximating this integral with a (small) finite sum,

861: was found to be prohibitively expensive.

862: Instead, we use a strategy that is a combination of entropy and

863: variance (both easy to compute), and is

864: related to information gain. For more discussion on sampling

865: strategies and their performance, we refer interested readers to

866: \cite{bryan2005}.

867:

868: The method we use here, named ``Straddle'', combines the desire to

869: search the entire input space with that of refining our estimate

870: around known interesting regions.  We do this by picking points that

871: the model predicts are both close to the boundary and have large

872: variances using the following heuristic:

873: \[

874: \mathrm{straddle}(s_q) = 1.96  \sigma_{s_q} - \big|

875: \mu_{s_q} - t \big|.

876: \]

877: Note that the straddle heuristic chooses those points with large

878: variances which straddle the boundary.  In particular, if a point is

879: near the boundary, then $\mu_{s_q} \simeq t$ and

880: this metric is equivalent to a variance-only metric, choosing

881: points that are distant from one another.

882: However, if the point is not on the boundary, then its score drops off

883: proportionally to the distance from the boundary.  The straddle score

884: for a point may be negative, which indicates that we predict that the

885: probability that the point is on a boundary is less that five

886: percent.  Note that the straddle algorithm scores points highest that

887: are both unknown and near the boundary, and thus gives scores that

888: intuitively are similar to that of information gain.

889:

890: Our sampling strategy then consists of four steps.  First we model our

891: current knowledge using the Gaussian process described in \S

892: \ref{model}.  We then choose a set of candidate points randomly from the input

893: space and compute their mean and variances using the Gaussian process model.  Next,

894: we score these points using the Straddle heuristic, and

895: select the highest scoring point.  Finally, we run the chosen point

896: through CMBFast and add use the result to refine our Gaussian process model.

897:

898: Ideally, we  would like to analyze the

899: entire input space, and pick experiments in such a manner that

900: minimizes the number of experiments necessary.  However, as our

901: input space is infinite (the parameters are continuous), we need

902: a heuristic to quickly generate a large, but not unwieldy set of

903: candidate points.

904: \textit{A priori}, we have no information about the function we are trying to

905: model.  Therefore, in order to ensure that all

906: boundary segments of the true function are found (assuming sufficient

907: experimentation), it is necessary that candidate points be chosen such

908: that all infinitesimal hyper-rectangles in the input space have

909: non-zero probabilities of being chosen.

910: We therefore choose candidate points uniformly at randomly from the

911: input space, as this satisfies the probability constraint and is

912: extremely quick.  We note that bad candidate points will be discarded

913: when their straddle scores are computed, and pose no problem for the

914: algorithm.

915:

916:

917: %% Section 4

918: \section{Results} \label{sec:results}

919: Using the algorithm described in \S \ref{sec:algorithm}, we have

920: sampled just over 1.2 million CMBFast models creating a ``primary''

921: data set.  Additionally, we sampled another 100 thousand models

922: uniformly at random throughout the parameter space.

923: From the randomly sampled data, we find that less than

924: 0.1\% of the parameter space searched is within the $2 \sigma$ confidence ball;

925: that is, our set of acceptable models (those within $2\sigma$) exclude

926: 99.97\% of all possible models defined in Table \ref{paramtable}.

927: However, the

928: method we use to generate parameter vectors results in only 54\% of

929: the points being rejected by the hypothesis that the model and the

930: regressed fit are the same.  Thus, by actively searching through the

931: space, we are able to identify and efficiently map regions of interest, while

932: ignoring large areas of parameter space that result in models below

933: the $2\sigma$ level.  In \S \ref{sec:mcmc} we will see that our method

934: is much more data efficient than typical Bayesian methods.

935:

936: \subsection{Confidence Interval Projections} \label{sec:intervals}

937:

938: \begin{figure*}[!th]

939:  \begin{center}

940:  \plotone{f4.eps}

941:  \end{center}

942:  \caption{Jointly valid confidence intervals for our cosmological

943:    parameters for four values of $1-\alpha$, corresponding to

944:    $\frac{1}{2} \sigma, \sigma, 1 \frac{1}{2} \sigma$ and $2\sigma$

945:    confidence levels, respectively.

946:    Areas of solid color indicate values for the given parameter

947:    that contain the true value of cosmological parameter with

948:    probability $1-\alpha$, regardless of the values of the remaining 6

949:    parameters.

950: See the electronic edition of the Journal for a color version of this figure.}

951: \label{fig:results1d}

952: \end{figure*}

953:

954: \begin{figure*}[p]

955: \begin{center}

956: \plotone{f5.eps}

957: \end{center}

958: \caption{Jointly valid confidence regions for pairs of cosmological

959: parameters, where the colors cyan, magenta, blue and red correspond to

960: $\frac{1}{2} \sigma, \sigma, 1 \frac{1}{2} \sigma$ and $2\sigma$,

961: confidence levels respectively.

962: Areas of solid color indicate values for the given

963: pair of fixed (plotted) parameters that contain the true value of

964: cosmological parameter with probability $1-\alpha$, regardless of the

965: values of the remaining 5  parameters.

966: Note there are two disjoint regions in parameter space

967:    which are above the $2\sigma$ confidence interval.

968: See the electronic edition of the Journal for a color version of this figure.}

969: \label{fig:results2d}

970: \end{figure*}

971:

972: \begin{figure*}[t]

973: \begin{center}

974: \plotone{f6.eps}

975: \end{center}

976: \caption{Jointly valid confidence intervals for our cosmological

977: parameters, where we assume that that the value of $H_0$ is between 60

978: and $75 \mpc$.  Areas of solid color

979: indicate values for the given parameter that contain the true value of

980: cosmological parameter with probability $1-\alpha$, regardless of the

981: values of the remaining 6 parameters.  See the electronic edition of

982: the Journal for a color version of this figure.}

983: \label{fig:results1d:h0}

984: \end{figure*}

985:

986: \begin{figure*}[p]

987: \begin{center}

988: \plotone{f7.eps}

989: \end{center}

990: \caption{Jointly valid confidence regions for pairs of cosmological

991: parameters, where we assume that that the value of  $H_0$ is between

992: 60 and $75 \mpc$. The colors

993: cyan, magenta, blue and red correspond to

994: $\frac{1}{2} \sigma, \sigma, 1 \frac{1}{2} \sigma$ and $2\sigma$,

995: confidence levels, respectively.

996: Areas of solid color indicate values for the given

997: pair of fixed (plotted) parameters that contain the true value of

998: cosmological parameter with probability $1-\alpha$,

999: regardless of the values of the remaining 5  parameters.

1000: Note that the constraint on $H_0$ eliminates the secondary confidence

1001: region found in Figure \ref{fig:results2d}.

1002: See the electronic edition of the Journal for a color version of this figure.}

1003: \label{fig:results2d:h0}

1004: \end{figure*}

1005:

1006: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1007:

1008: The result of running the 1.2 million models contained in the primary

1009: data set is a set of

1010: disjoint, seven dimensional ``confidence regions'' in parameter space

1011: which contain all models that fall within our $1-\alpha$ confidence

1012: ball.  In each of these regions, the confidence interval for a

1013: particular parameter is given by the range of values that parameter

1014: takes in that region. Thus, the confidence interval for a particular

1015: parameter will be a function of which sets of regions we consider.

1016:

1017: If we put no restrictions on the values of the other 6 parameters,

1018: then the confidence interval of a parameter will be the union of

1019: the confidence intervals for that parameter for all confidence

1020: regions. We plot these unrestricted confidence intervals in Figure

1021: \ref{fig:results1d} for four values of $1-\alpha$.

1022: Intuitively, Figure \ref{fig:results1d} can be interpreted as stating

1023: that for any value of a parameter that lies within the depicted

1024: $1-\alpha$ confidence interval, there exists at least one

1025: combination of the remaining six parameters such that the resulting

1026: parameter vector lies within one of the $1-\alpha$ confidence regions.

1027:

1028: In Figure \ref{fig:results2d} we depict results of interactions between pairs

1029: of parameters on the computed confidence regions.  As with the 1D

1030: projections in Figure \ref{fig:results1d}, points in Figure

1031: \ref{fig:results2d} which are denoted to be within the $1-\alpha$

1032: confidence ball, are points where given the particular values of the

1033: two fixed cosmological parameters --- those being explicitly plotted

1034: on the $x$ and $y$ axes, --- there exists some values for the other 5

1035: parameters such that the resulting parameter vector is within the

1036: $1-\alpha$ confidence region.   While some plots show that most

1037: combinations of the fixed parameters are within the 95\% confidence

1038: ball providing minimal constraints on parameters describing the

1039: Universe, others, such as $\omega_\mathrm{DM}$ versus

1040: $\omega_\mathrm{B}$ (4\ith row, 4\ith column), show strong

1041: constraints.

1042:

1043: Areas in Figure \ref{fig:results2d} which are blank (white), are areas that are

1044: rejected at the 95\% confidence level; for these combinations of fixed

1045: parameters, there exists no combination of the other five parameters,

1046: such that the resulting vector is within any of our confidence regions.

1047: In particular, the plot of $\Omega_\mathrm{DE}$ versus

1048: $\Omega_\mathrm{M}$ (2\ind row, 3\ird column) illustrates that

1049: $\Omega_\mathrm{Total} \gtrsim 0.9$, while the plot of $\omega_\mathrm{DM}$

1050: versus $\omega_\mathrm{B}$  shows that there are at least two disjoint

1051: confidence regions in our seven dimensional space.  These disjoint

1052: regions in Figure \ref{fig:results2d} correspond directly to the split

1053: confidence intervals observed in Figure \ref{fig:results1d}.

1054:

1055: The disjoint regions observed in Figure \ref{fig:results2d}, such as

1056: the plot of $\omega_\mathrm{DM}$ vs. $\omega_\mathrm{B}$, indicate

1057: that there are at least two disjoint confidence regions in the

1058: parameter space.

1059: These disjoint regions can also be seen in the 1D projections of

1060: $\omega_\mathrm{DM}$, $\omega_\mathrm{B}$, and $H_0$ shown in Figure \ref{fig:results1d}.

1061: We defer further discussion of the disjoint confidence regions

1062: to \S \ref{sec:connectivity}. Smaller splits in the confidence

1063: intervals observed in nearly

1064: every plot in Figure \ref{fig:results1d} are a result of the fact that

1065: CMBFast does not return models which are perfectly continuous in the

1066: parameter space.  While one may expect the derived confidence level to

1067: be smooth in parameter space, this is not the case.

1068: We observe small discretizations and

1069: inconsistencies in the power spectrum model, which result in the

1070: confidence ball having a jagged, nebulous surface (as observed in

1071: Figure \ref{fig:results2d}), rather than a perfectly smooth one.  We will

1072: elaborate on this observation in \S \ref{sec:convergence}.

1073:

1074: As illustrated in Figure \ref{fig:results1d}, the confidence intervals

1075: for most parameters are not well constrained by the WMAP data  alone.

1076: In particular, the constraint

1077: on the Hubble constant, $H_0$, is so weak as to allow values between

1078: 15 and 300 at the two sigma level; even at the one sigma level, $H_0$

1079: ranges between $15$ and $150$ with additional fits at $H_0 \sim 250$.

1080: The confidence intervals derived here cover the Bayesian credible

1081: intervals found in the literature using a

1082: variety of techniques (e.g. \cite{tegmark2001, spergel2003,

1083:   spergel2006}), as shown in Table \ref{tab:compare}.

1084: While the results in Table \ref{tab:compare} are

1085: approximately centered on the same values,

1086: we are not in any way attempting to argue that the allowed parameter

1087: ranges are better, or worse, than those derived from alternative methods,

1088: as the comparison of credible (Bayesian) vs. valid (frequentist)

1089: parameter ranges is non-trivial and outside the scope of this work.

1090: A discussion of difference between the Bayesian and frequentist

1091: interpretations is given in \S \ref{sec:bayesvsfreq}.

1092:

1093: \begin{table*}

1094: \begin{center}

1095: {\footnotesize

1096: \begin{tabular}{c r@{ - }l r@{ - }l r@{ - }l | r@{ - }l r@{ - }l}

1097: \hline

1098: &

1099: \multicolumn{2}{c}{No} &

1100: \multicolumn{2}{c}{} &

1101: \multicolumn{2}{c|}{$n_s < 1$} &

1102: \multicolumn{2}{c}{Spergel} &

1103: \multicolumn{2}{c}{Spergel}\\

1104: %

1105: Parameter &

1106: \multicolumn{2}{c}{Constraints} &

1107: \multicolumn{2}{c}{$ 60 \le H_0 \le 75$} &

1108: \multicolumn{2}{c|}{$ 60 \le H_0 \le 75$} &

1109: \multicolumn{2}{c}{et al. (2003)} &

1110: \multicolumn{2}{c}{et al. (2006)}\\

1111: \hline

1112: \hline

1113: $\tau$

1114: & 0 & 1.2 % none

1115: & \multicolumn{2}{c}{0 - 0.94, 1.17 - 1.2} % h0

1116: & 0 & 0.4 % h0,ns

1117: & 0.095  & 0.242 % spergel 2003

1118: & 0.058 & 0.117 % spergel 2006

1119: \\

1120: $\Omega_\mathrm{DE}$

1121: & 0 & 0.94   %none

1122: & 0 & 0.94   %h0

1123: & 0.39 & 0.9 %h0,ns

1124: & \multicolumn{2}{c}{} % OmegaDE  Spergel 2003

1125: & \multicolumn{2}{c}{} % OmegaDE  Spergel 2006

1126: \\

1127: $\Omega_\mathrm{M}$

1128: & 0 & 1.0   %none

1129: & 0.13 & 0.95 % h0

1130: & 0.13 & 0.59 % h0,ns

1131: & 0.22  & 0.36  % OmegaM   Spergel 2003

1132: & 0.199 & 0.273 % OmegaM   Spergel 2006

1133: \\

1134: $\omega_{\mathrm{DM}}$

1135: & \multicolumn{2}{c}{0 - 0.36, 0.62 - 0.70} % none

1136: & 0.0 & 0.36 % h0

1137: & 0.03 & 0.2 % h0,ns

1138: & \multicolumn{2}{c}{} % omegaDM  Spergel 2003

1139: & \multicolumn{2}{c}{} % omegaDM  Spergel 2006

1140: \\

1141: $100\omega_{\mathrm{B}}$

1142: & \multicolumn{2}{c}{0.5 - 6.2, 11.5 - 12.7} % none

1143: & 1.3   & 5.5 %  h0

1144: & 1.3   & 3.2 % h0, ns

1145: & 2.26  & 2.51  % omegaB   Spergel 2003

1146: & 2.15  & 2.31  % omegaB   Spergel 2006

1147: \\

1148: $f_\nu$

1149: & 0 & 1 % none

1150: & 0 & 1 % h0

1151: & 0 & 1 % h0,ns

1152: & \multicolumn{2}{c}{} % f_nu    Spergel2003

1153: & \multicolumn{2}{c}{} % f_nu   Spergel 2006

1154: \\

1155: $n_s$

1156: & 0.73 & 1.7 % none

1157: & 0.8  & 1.7 % h0

1158: & 0.84 & \textit{1.0} % h0

1159: & 0.95  & 1.03   % n_s     Spergel2003

1160: & 0.944 & 0.978  % n_s    Spergel 2006

1161: \\

1162: $\sigma_8$

1163: & \multicolumn{2}{c}{} % no constraints

1164: & \multicolumn{2}{c}{} % h0

1165: & \multicolumn{2}{c|}{} % h0,ns

1166: & 0.82  & 1.02   % sigma8 Spergel2003

1167: & 0.71  & 0.81   % sigma8 Spergel 2006

1168: \\

1169: $H_0$

1170: & \multicolumn{2}{c}{17 - 135, 243 - 272} % no constraints

1171: & \textit{60} & \textit{75} % h0

1172: & \textit{60} & \textit{75} % h0

1173: & 67    & 77     % H_0     Spergel2003

1174: & 70.3  & 76.7   % H_0    Spergel 2006

1175: \\

1176: \hline

1177: \end{tabular}}

1178: \end{center}

1179: \caption{Derived 68\% confidence intervals.  Those to the left of the solid

1180: line are derived from Figures \ref{fig:results1d},

1181: \ref{fig:results1d:h0} and \ref{fig:results1d:nsh0} respectively,

1182: while those to the right are quoted from referenced literature.}

1183: \label{tab:compare}

1184: \end{table*}

1185:

1186:

1187: While this assessment may appear bleak, there is

1188: underlying structure to the confidence regions, hinted at by the

1189: disjoint regions in Figure \ref{fig:results2d}.   Suppose we restrict

1190: the range of a subset of our parameters and then compute the

1191: confidence intervals for the remaining parameters.

1192: Since our statistical model is independent of the ranges searched, we can

1193: compute these conditional confidence intervals without re-running any

1194: models.  For any restriction of our parameter space,

1195: the confidence interval for a parameter of interest will

1196: be the union of the confidence intervals for that parameter over those

1197: confidence regions which obey our restriction.  For example, in Figures

1198: \ref{fig:results1d:h0} and \ref{fig:results2d:h0} we show the effect

1199: on the confidence intervals and regions, respectively,

1200: of imposing the restriction that $H_0$  is between

1201: $60$ and $75\mpc$.   Note that with this

1202: restriction on $H_0$, the confidence intervals agree much better with the

1203: current estimate of the cosmological matter/energy budget and strongly

1204: suggest that $\Omega_\mathrm{Total} = 1$.

1205:

1206: This analysis exhibits the power of our statistical

1207: inference technique: we can test constraints on one parameter,

1208: and see their effects on the remaining parameters without additional

1209: CMBFast computation or invalidation of statistical inferences.  To

1210: this end, we have created a graphical  interface that can be used to

1211: apply constraints and view the resulting effects in real time; this

1212: tool, along with the necessary data files, can be downloaded from

1213: \url{http://gs3636.sp.cs.cmu.edu/visualizer/}.

1214:

1215: In the Bayesian view, the tightening of the allowable regions between

1216: Figures \ref{fig:results1d} and  \ref{fig:results1d:h0}

1217: and Figures \ref{fig:results2d} and \ref{fig:results2d:h0}

1218: is analogous to what would occur when priors

1219: (either informative or non-informative) are applied. Such

1220: priors are  universally applied in CMB cosmological analyses.

1221: As an example of how we can use this technique to better

1222: understand the cosmological confidence surface, we focus

1223: in on one or two parameters and utilize the graphical interface

1224: described above.

1225:

1226: WMAP Three Year data show that a scale invariant spectra

1227: ($n_s = 1$) is not a good fit to the WMAP Three Year data alone.

1228: If we place both the constraint that $n_s < 1$ and that

1229: $ 60 \mpc \le H_0 \le 75\mpc$ on the WMAP One Year data, we see

1230: in Figure \ref{fig:results1d:nsh0} that $\tau, \omega_\mathrm{B}$, and

1231: $\omega_\mathrm{DM}$ are much better constrained. More importantly, we

1232: see that the allowable ranges on $\omega_\mathrm{DM}$ are forced into a single

1233: confidence range, in agreement with previous studies \cite{spergel2003}.

1234:

1235: Exploring the high $\omega_\mathrm{DM}$ space shown in Figure

1236: \ref{fig:results1d}, we find that models consistent with

1237: high $\omega_\mathrm{DM}$ have large values of $\omega_\mathrm{B}$ ($> 0.05$),

1238: as well as large Hubble constants ($>100\mpc$). Both of these parameters are

1239: much better constrained in the WMAP Three Year data. This

1240: leads us to predict that the second confidence surface peak in

1241: the WMAP Three Year Data is less significant than in the

1242: WMAP One Year data (although this has yet to be shown).

1243:

1244: % Add this comment?

1245: %

1246: % Thus, in Figure \ref{fig:results1}, it should not be surprising to see

1247: % such a large range on the allowable cosmological parameters. In fact,

1248: % the only parameters that are reasonably constrained (at the 1$\sigma$

1249: % level) are the Hubble constant, $\Omega_{total}$,

1250: % $\Omega_{matter}$,and $\Omega_{baryon}$. And of course none of these

1251: % are constrained at the level given in Spergel et al. (2003) for WMAP

1252: % (see Table xx).

1253:

1254: \begin{figure*}[t]

1255: \begin{center}

1256: \plotone{f8.eps}

1257: \end{center}

1258: \caption{Jointly valid confidence intervals for our cosmological

1259: parameters, where we assume that $60

1260: \mpc \le H_0 \le 75 \mpc$ and $n_s < 1$.  Areas of solid

1261: color indicate values for the given parameter that contain the true

1262: value of cosmological parameter with probability  $1-\alpha$,

1263: regardless of the values of the remaining 6 parameters.

1264: See the electronic edition of the Journal for a color

1265:    version of this figure.}

1266: \label{fig:results1d:nsh0}

1267: \end{figure*}

1268:

1269:

1270: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1271: % Intentionally blank.

1272: %

1273: %

1274: %

1275: %

1276: %

1277: %

1278:

1279: \subsection{Convergence} \label{sec:convergence}

1280: Ideally, one would like to prove that our mapping from confidence

1281: ball radius to parameter space has converged.  This could be done, for

1282: instance, by proving that our approximating model of spectrum distance

1283: as a function of cosmological parameters -- that is our Gaussian

1284: process -- has converged to the true values in those areas where the

1285: true values are near the radius of the $1-\alpha$ confidence ball.

1286: However, this effort has been confounded by a lack of continuity in

1287: the results returned by CMBFast.  The method presented in this paper

1288: is not more susceptible to discontinuities than other techniques.

1289: Indeed, the convergence of most, if not all, inference methods will be

1290: adversely effected by the discontinuities of CMBFast models we observe

1291: in parameter space.

1292:

1293: %\subsubsection{Smoothness Assumptions of CMBFast Models} \label{sec:smoothness}

1294: \begin{figure}

1295: \begin{center}

1296: \noindent

1297: \plotone{f9.eps}

1298: \end{center}

1299: \caption{A plot of spectra distance as a function of $\tau$, with

1300:  all other parameters fixed, showing the discretization of CMBFast.

1301: For these experiments

1302: $\vec x = \{\tau, \Omega_\mathrm{DE}, \Omega_\mathrm{M},

1303:  \omega_\mathrm{DM}, \omega_\mathrm{B}, f_\nu, n_s\}$ $=

1304: \{\tau, 0.0, 0.2, 0.8, 0.003, 0.0, 1.2\}$.}

1305: \label{fig:smooth}

1306: \end{figure}

1307:

1308: One standard assumption of function approximators is that of

1309: smoothness; that is that the underlying function to be modeled is

1310: continuous and differentiable.  For Gaussian processes, this

1311: assumption motivates the usage of a covariance matrix in determining

1312: the relative weights of known samples when estimating values for unknown points.  In

1313: this paper, we have also assumed that the covariance function is fixed over

1314: the entire space -- that is that the underlying covariance is

1315: isotropic and homogeneous.  These assumptions allow us to compute

1316: error bounds for each point in space, and enable us to determine when

1317: the model has converged to the underlying function.

1318:

1319: \begin{figure*}

1320: \begin{center}

1321: \noindent

1322: \plotone{f10.eps}

1323: \end{center}

1324: \caption{A plot of spectra distance as a function of

1325: $\Omega_\mathrm{DE}$, with all other parameters fixed.  The square

1326: boxes in each of the left two plots denotes the area enlarged in the

1327: neighboring plot to the right. Note that while on the global scales,

1328: (A), the mapping appears to be smooth, closer inspection (B),(C)

1329: reveal numerical errors resulting from approximations used in CMBFast.}

1330: \label{fig:smooth2}

1331: \end{figure*}

1332:

1333: However, experimentation shows that the underlying CMBFast function

1334: does not fulfill the continuous and differentiable assumptions, as

1335: shown in Figures \ref{fig:smooth} and \ref{fig:smooth2}.

1336: Both figures were produced by plotting the resulting model distance

1337: as we varied one parameter and kept the other six parameters fixed.  Figure

1338: \ref{fig:smooth} shows a discretization effect that we believe is a

1339: result of integral approximations done by CMBFast.  Discretization

1340: effects are common in simulated environments and it is reasonable to

1341: assume that the true function varies smoothly.  More startling are the

1342: discontinuities revealed in Figure \ref{fig:smooth2}.  Figure \ref{fig:smooth2} shows

1343: that while on a broad scale the CMBFast function appears smooth, when

1344: one looks closer and closer, the function begins to act quite

1345: erratically.  Of particular interest are the large discontinuity at

1346: $\Omega_\mathrm{DE} = 0.446516$ and the seemingly random deviations

1347: from a smooth function throughout the entire range.  These fluctuations

1348: in distance are not caused by random noise from CMBFast;

1349: CMBFast's output is deterministic given an input parameter

1350: vector.

1351:

1352: There are two important implications of the results in Figures

1353: \ref{fig:smooth} and \ref{fig:smooth2}.  First, we note that

1354: when parameter values result in spectra that are very close to the

1355: confidence ball radius, it is impossible to predict which side of

1356: the boundary a given point will be on, due to the inherent noise in

1357: CMBFast.  For regions where many points are near the confidence ball

1358: radius, we will obtain spotty, jagged boundaries between those areas

1359: in the ball and those not.

1360: Secondly, the effects plotted in

1361: Figures \ref{fig:smooth} and \ref{fig:smooth2} do not appear on the

1362: same range scales. This makes it more difficult to determine the

1363: correct level of smoothing, and hence discover the true underlying

1364: function.  Thus, while it is still possible to deduce approximate

1365: covariances among the variables, it becomes impossible to ensure the

1366: model has correctly converged to the true model.

1367:

1368: We note that this lack of continuity will adversely effect the

1369: convergence of any model that relies on the smoothness of the

1370: underlying function, be it MCMC or Gaussian processes.

1371: %In the case of

1372: %MCMC the lack of smoothness requires more extensive sampling of the

1373: %posterior to ensure the integral is correctly computed.

1374: In the case of MCMC,

1375: the discontinuities in the variance weighted sum of squares between

1376: the models computed by CMBFast and the data require that comprehensive

1377: sampling of the posterior be performed to ensure that the peaks and

1378: valleys in any local region are correctly averaged out, ensuring that

1379: the integral over the posterior is correctly computed.  While we

1380: can run both methods in a mode that smooths over these

1381: discontinuities (by effectively ignoring them), we must realize that

1382: the resulting algorithms will converge to a solution that is

1383: incorrect.  Additionally, increasing the sampling of either

1384: algorithm would eventually turn up the existence of these

1385: discontinuities, and the system would jump from an apparent

1386: convergence in the smoothed case, to a new convergence where

1387: discontinuities are considered.  We elaborate on this idea further in

1388: \S \ref{sec:mcmc}.

1389:

1390: % Intentionally blank.

1391: %

1392: %

1393: %

1394: %

1395: %

1396: %

1397:

1398: % \subsubsection{Sampling}

1399: % While we cannot prove convergence, we suggest that we may be near

1400: % convergence, noting that the addition of subsequent points does not

1401: % affect either the model prediction (in terms of spectrum classification

1402: % accuracy) or the visual appearance of the data.

1403: % In Figure \ref{fig:diff}

1404: % we show the visual difference between 1.1 million and 1.2 million cmbfast experiments

1405: % for the plot of \fixme and \fixme, while in Table \ref{tab:diff} we show

1406: % the accuracy of classifying parameter vectors as to whether or not

1407: % they result in spectra with distances less than the $1-\alpha$ ball radius for several

1408: % confidence levels.  Both Figure \ref{fig:diff} and Table \ref{tab:diff} show

1409: % little difference with the addition of ~10\% more data, suggesting

1410: % that convergence has been obtained.

1411: %

1412: % \begin{figure*}

1413: % \includegraphics{figures/results2d_time/results2d_time.epsi}

1414: % \caption{Plots show ?? versus ?? with 1.1 million points (A) and with

1415: %   1.2 million points. Note that even with the addition of ~10\% more

1416: %   data point, the figure remains relatively unchanged.}

1417: % \label{fig:diff}

1418: % \end{figure*}

1419: %

1420: % \begin{table}

1421: % \begin{center}

1422: % \begin{tabular}{c c c}

1423: % \hline

1424: % & \multicolumn{2}{c}{Classification Accuracy}\\

1425: % $1-\alpha$ & 1.1 million points & 1.2 million points \\

1426: % \hline

1427: % \hline

1428: % 0.95\\

1429: % 0.78\\

1430: % 0.65\\

1431: % 0.45\\

1432: % \hline

1433: % \end{tabular}

1434: % \end{center}

1435: % \caption{Classification accuracy of point as to whether or not they

1436: %   had distances less than the radius of a  $1-\alpha$ confidence ball

1437: %   for various $1-\alpha$ levels, with both 1.1 million points and

1438: %   1.2 million points. Note that even with the addition of ~10\% more

1439: %   data point, the classification accuracies remain relatively unchanged.}

1440: % \label{tab:diff}

1441: % \end{table}

1442: %

1443: %

1444: % However, we caution that both of these results may be misleading.

1445: % Note that since the plot in Figure \ref{fig:diff} is a projection through

1446: % 5 dimensions, it is

1447: % possible to get similar results for much smaller number of

1448: % experiments, because poorly defined areas can be ``hidden'' behind

1449: % other prominent features.

1450: % % For instance, consider a ball of radius $r$

1451: % % centered at the origin of an $x$, $y$, $z$ axes.  In order for the ball

1452: % % to appear well defined when projecting this ball down to

1453: % % the $x,y$ axes, we need only sample around the edge of the ball

1454: % % $x^2+y^2 = r^2$.  Even though areas where $x^2 + y^2 < r^2$ may be

1455: % % poorly constrained, it may not be apparent from the projection.

1456: % Additionally, the results of Table \ref{tab:diff} are bouyed by the fact

1457: % that only ~0.1\% of the points at random fall within a 95\% confidence

1458: % ball radius.  Thus, a straw-man classification approach that merely

1459: % picked the most common classification would obtain a 99\%

1460: % classification accuracy.

1461: %

1462: % While these arguments suggest that Figure \ref{fig:diff} and Table

1463: % \ref{tab:diff} cannot prove that we have reached convergence, we note

1464: % that they do show that we have reached an approximate convergence.

1465: % That is, while we cannot exactly state where the spectra distance surface

1466: % equals the $1-\alpha$ confidence ball radius, since the deviation

1467: % between both Figure \ref{fig:diff} and Table \ref{tab:diff} are tiny,

1468: % we can reasonably predict the location of this boundary.

1469: % Additionally, new peaks in the distance surface that arise above the

1470: % $1-\alpha$ confidence ball radius (that is new peaks where the points

1471: % are within the confidence ball) are extremely unlikely, as detection

1472: % of such a peak in new sampling would have spurred our algorithm to

1473: % vigorously search that area, resulting in effects visible in both

1474: % visual and classification results.

1475:

1476: \subsection{Connectivity} \label{sec:connectivity}

1477: As Figure \ref{fig:results2d} shows, there are two main peaks that lie above

1478: the $1\sigma$ confidence ball radius.  As a test of the

1479: function approximator's convergence, we conducted focused tests to see if these

1480: peaks were truly connected.  In particular, we used the semi-variance

1481: matrix of the Gaussian process to compute the maximal influence

1482: distance from a given point one could travel before possibly

1483: encountering the $1-\alpha$ confidence ball radius.  We then created

1484: clusters of points above the 68\% confidence ball radius using a

1485: friends-of-friends algorithm; that is, a point is added to an existing

1486: group if it is within the maximal influence distance of any point

1487: currently in the group.  Starting with all points in their own groups,

1488: we first passed through the data, merging groups where possible.

1489: Then, additional points were sampled between existing groups, using an

1490: A$^*$ like algorithm \citep{hart1968}.  For two groups $A$ and $B$, we

1491: found the point, $x$, in $A$ that was closest to any point in $B$.  We

1492: then created a set of candidate points within the influence distance

1493: of $x$, and add them to a queue, $\bQ$, sorted according to their

1494: distances to $B$.  We then take the point $p$ from $\bQ$ that is closest

1495: to $B$ run it through CMBFast and compare to our confidence ball.  If

1496: $p$ is within our confidence radius, then we create

1497: candidate points for $p$ (just as we did for $x$) and add them to

1498: $\bQ$.  Otherwise, we remove $p$ from $\bQ$.

1499: This procedure is repeated until either $B$ is within the influence

1500: distance of $p$ or we exhaust $\bQ$.

1501:

1502: The primary data set contained roughly 2000 distinct groups, which were

1503: quickly merged using the friends-of-friends algorithm.  This left us with

1504: 2 major clusters shown in Figure \ref{fig:results2d}.

1505: Using the algorithm noted above,

1506: we were unable to find connections between the main peak and the

1507: secondary peak, even after multiple attempts starting from

1508: different locations.  We believe that there exists no

1509: smooth transition of variable parameters that leads from the

1510: concordance to the secondary peak.  The second peak is not just an

1511: extension of the concordance peak that appears disjoint due to under

1512: sampling or projection effects.

1513:

1514: \section{Comparison to Alternative Methods of Statistical

1515:   Inference} \label{sec:comparison}

1516:

1517: In \S \ref{sec:results}, we showed that the results of our

1518: technique are quite similar to other statistical inference methods

1519: currently employed in the literature.  Let us now relate our method

1520: to other inference techniques, and point out a few subtle, but

1521: remarkable, distinctions between them.

1522:

1523: \subsection{$\chi^2$ Tests}

1524: The method presented in \S \ref{sec:nonparametric} can be

1525: succinctly described as a method which computes the weighted sum of

1526: squares of the regressed fit and the test spectrum at the data

1527: points and rejects the hypothesis that the test spectrum could be

1528: generated by the data if the weighted sum is greater than the constant

1529: given in Equation \ref{conf0}.

1530: Intuitively, this process is quite similar to using a $\chi^2$ test,

1531: with two important differences.

1532:

1533: First, our technique is centered

1534: around a nonparametric fit to the data, not the data themselves.  As a

1535: result, our method is approximately centered on the true underlying

1536: function, $f$, as opposed to the noisy observations of $f$.

1537: The

1538: implication is that our method is less affected by noise in the data,

1539: than simple $\chi^2$ tests.

1540: In particular, we have observed that $\chi^2$ tests will reject all

1541: models in cases where there is a single outlier $4\sigma$ from the maximum

1542: likelihood estimate fit.  By initially fitting a nonparametric

1543: function to the data and then using this function to compute

1544: sum-of-squares distances, we are much less susceptible errors

1545: caused by noisy outliers.

1546:

1547: Secondly, the radius computed using the pivot process is smaller than

1548: the $\chi^2$ radius, as we consider the Gaussian errors of all points

1549: as an ensemble, not individually as with $\chi^2$ tests.  The smaller

1550: radius of the pivot process translates directly into smaller confidence regions

1551: as compared with those found using $\chi^2$ tests.  This allows

1552: us to reject more of the hypothesis test models, and subsequently return tighter

1553: bounds on the parameters of interest. The confidence ball test has

1554: more statistical power than does the $\chi^2$ test.  A comparison of

1555: the relative widths of the confidence and $\chi^2$ balls is shown in

1556: Figure \ref{fig:distance_alpha}.

1557:

1558: \begin{figure}

1559: \begin{center}

1560: \plotone{f11.eps}

1561: \end{center}

1562: \caption{Radius of our non-parametric confidence ball as a function of

1563:   confidence level (solid).  The reduced $\chi^2$ ball is shown for

1564:   comparison (dashed). Arrows depict $\frac{1}{2}, 1, 1\frac{1}{2}$ and

1565:   $2\sigma$ respectively.}

1566: \label{fig:distance_alpha}

1567: \end{figure}

1568:

1569:

1570: \subsection{Bayesian Techniques} \label{sec:mcmc}

1571: As noted in \S \ref{sec:introduction}, most CMB power spectrum

1572: parameter estimations to date have been done via

1573: Bayesian techniques (e.g., \cite{knox2001, gupta2002, spergel2003,

1574: jimenez2004, dunkley2005}).  Since the prior distribution is not

1575: conjugate on the likelihood, computing the posterior involves

1576: estimating an integral over the entire space spanned by the prior.

1577: Perhaps the most straight-forward way to compute this integral is

1578: with an evenly-spaced grid with $n$ points per parameter.  For this

1579: approach, one pre-specifies a $d$-dimensional grid (where $d$ is the

1580: number of parameters of interest) and computes the posterior at the

1581: center of each grid cell.  The integral is then (approximately) the

1582: sum of the posterior at each grid cell, and the $1 - \alpha$ credible

1583: intervals can be determined (usually by marginalization) to be the

1584: smallest range for a given parameter that contains $1-\alpha$ of the

1585: posterior probability.   While straight forward, this approach scales

1586: exponentially with dimension, and hence is infeasible for even moderate

1587: dimensions; we estimate that a grid based approach, using CMBFast and

1588: seven parameters (similar to our method), with just 10 grid

1589: spacings per parameter would take over 100 years on a single computer.

1590:

1591: As a result of the dimensionality problem, Markov Chain Monte Carlo

1592: (MCMC) has become an increasingly popular approach for

1593: estimating posteriors due to their (perceived) computational

1594: efficiency (e.g \cite{gupta2002, jimenez2004, sandvik2004,

1595: dunkley2005,chu2005}).

1596: In the MCMC technique, new samples are often derived using the

1597: Metropolis-Hastings algorithm.  The Metropolis-Hastings algorithm

1598: chooses a new sample $x$ from some arbitrary (pre-specified) proposal

1599: distribution defined over the  $d$-dimensional parameter space based on the

1600: previous sample and then accepts or rejects $x$

1601: based on the ratio of the proposed and current posterior density (when

1602: the proposal distribution is symmetric, as is common).

1603: The algorithm samples the input space roughly in

1604: proportion to the expected probability of each location.

1605:

1606: % When sampling reaches detailed

1607: % balance --- that is, the probability of being in state $i$ and

1608: % transitioning to state $j$ is equal to the probability of being in

1609: % state $j$ and transitioning to state $i$ ---  then we are guaranteed a

1610: % stationary distribution.

1611:

1612: Theoretically MCMC using Metropolis-Hastings algorithm

1613: converges almost surely to the stationary distribution (the

1614: posterior) in the limit of infinite sampling.  However, it is quite

1615: difficult to determine if convergence has been met with a finite number

1616: samples.  In particular, if a posterior is comprised

1617: by two narrow, spatially separated Gaussians, then the probability of

1618: transition from one Gaussian to the other will be vanishingly small.

1619: Thus, after the chain has rattled around in one of the peaks for a

1620: while, it will appear that the chain has converged; however, after

1621: some finite amount of time, the chain will suddenly jump to the other

1622: peak, revealing that the initial indications of convergence were

1623: incorrect.  As this example illustrates, if the Markov chain is run

1624: with too few examples, the resulting credible intervals will be too

1625: narrow, and thus will not truly contain $1-\alpha$ of the probability

1626: mass. Thus, the consequence of lack of true convergence is artificially

1627: small credible intervals.  This problem is usually skirted by assuming

1628: that there are no small isolated peaks, computing multiple independent

1629: chains and comparing the results to illustrate convergence.

1630: Additionally, \cite{dunkley2005} and others have proposed alternative

1631: methods to detect convergence.  However, none of these methods are able to

1632: prove convergence with a limited number of CMBFast runs.

1633:

1634: Moreover, as we noted in \S \ref{sec:introduction}, MCMC is designed

1635: to draw samples from an unknown distribution, not to search that distribution.

1636: As a result, MCMC algorithms explicitly spend a large number of samples

1637: on high-likelihood regions, and a minimal number on low-likelihood

1638: regions.  However, when we are computing $1-\alpha$ confidence

1639: intervals, it is the low-likelihood regions (those around the

1640: $1-\alpha$ boundary) that we are interested in.  In contrast, a search

1641: algorithm that can directly look up the likelihood of a sample

1642: has no reason to spend a large number of samples near the peak of the

1643: distribution, and can instead focus on the boundary in question.

1644:

1645: These differences are clearly shown in Figure \ref{fig:mcmc_straddle},

1646: which depicts (with black dots) samples chosen by typical single runs of MCMC and

1647: our algorithm when trying to compute the $95\%$ credible/confidence

1648: intervals for a standard normal distribution\footnote{For the Bayesian case,

1649: we assume that the observed data is a single point at the origin.  As

1650: a result, the true posterior derived via sampling will be exactly

1651: the same as the true standard Normal distribution.  This is done to

1652: ensure that both algorithms are sampling the same function, allowing

1653: us to compare the sampling patterns of the algorithms.}.

1654: Both algorithms were

1655: constrained to samples chosen in $[-10:10]$. The MCMC algorithm was

1656: started at a randomly selected point, with a uniform prior over the

1657: range. In this figure we use a standard normal proposal distribution,

1658: although the sampling pattern is similar for other distributions we

1659: tried.  Credible intervals for MCMC and confidence intervals

1660: for our algorithm are depicted below the plots.

1661: Several points are quite apparent.  First

1662: MCMC has failed to converge in 50 samples, while our algorithm has

1663: converged nicely.  The credible intervals given by MCMC are not only

1664: underestimated, but are also not centered on the true distribution's

1665: center, revealing a potential liability for interpreting MCMC chains

1666: which have not converged.

1667:

1668: Secondly, notice that MCMC heavily samples the peak

1669: of the distribution, while our algorithm focus on those regions

1670: associated with the confidence interval boundaries.  The

1671: MCMC chain results in a ragged collection of disjoint credible

1672: intervals, while our algorithm returns a single interval in

1673: which the endpoints have been well determined.

1674:

1675: Thirdly, note that our algorithm samples extreme points to ensure that

1676: it has not failed to observe additional peaks in the distribution

1677: which may contribute to the 95\% confidence interval, while MCMC has

1678: not. As noted before, since MCMC is not a search algorithm, it may

1679: spend a large number of samples in a single distribution peak

1680: before jumping to another peak in the distribution.  This sampling

1681: pattern may cause MCMC to appear to have converged, when in reality

1682: it has just failed to transition to the second peak, as in the two

1683: Gaussian case described previously.

1684:

1685: Finally, we note that the MCMC algorithm is not data efficient.  While

1686: Figure \ref{fig:mcmc_straddle} depicts those experiments run by MCMC,

1687: the final MCMC chain consists of only those points that were accepted

1688: (in this case by the Metropolis-Hastings algorithm). As such, some of the

1689: points that MCMC samples are discarded immediately, and never used to

1690: guide the chain in future steps, or to determine the $1-\alpha$

1691: credible intervals.  In addition, many MCMC practitioners

1692: remove all but every $j$th sample point (for some integer $j$) to

1693: ensure that the points in the chain are truly independent. This

1694: significantly reduces data efficiency.

1695:

1696:

1697: \begin{figure*}

1698: \begin{center}

1699: \plotone{f12.eps}

1700: \end{center}

1701: \caption{Distribution of experiments run by MCMC (left) and our

1702:   algorithm (right).  Black dots denote 50 experiments run in

1703:   order to determine

1704:   the 95\% credible / confidence interval (shaded red area) for a

1705:   standard normal

1706:   distribution (solid red line).  Shaded blue areas below the normal

1707:   curves indicate the credible / confidence intervals derived for the

1708:   50 samples chosen.  See the electronic edition of the Journal for a

1709:   color version of this

1710: figure.}

1711: \label{fig:mcmc_straddle}

1712: \end{figure*}

1713:

1714: \label{sec:bayesvsfreq}

1715: \subsection{Advantages of Frequentist Inference}

1716: %\subsection{Bayesian vs. Frequentist Inference}

1717:

1718: Often, non-statisticians are confused by differences between Bayesian

1719: and frequentist techniques, and the advantages and limitations that

1720: each maintains.  Particularly appealing with the Bayesian approach is

1721: the fact that one is computing a posterior distribution over the

1722: parameter space.  Thus, not only does one obtain $1-\alpha$ credible

1723: intervals, but one gets a sense of where within the interval, the

1724: true value is expected to be.  Frequentist approaches do not allow for

1725: one to compute the probability that the true value is equal to some

1726: particular parameter value.  While choosing one technique over the

1727: other is a matter of personal statistical philosophy, we believe that

1728: frequentist approaches hold  important advantages over their Bayesian

1729: counterparts.

1730:

1731: First, any Bayesian technique requires that one assume a family of likelihood

1732: functions and a prior distribution over the parameter space in order

1733: to compute the posterior.  The resulting posterior is only as valid as

1734: both the likelihood and the prior.  In many cases, a prior

1735: distribution is unknown. In these cases, an ``uninformative prior,''

1736: equivalent to a uniform distribution on some bounded range, is often

1737: assumed.  However, such a prior is not uninformative.  In particular,

1738: a uniform prior indicates that the practitioner believes that the true

1739: distribution of the parameter is uniform, not unknown.  Moreover ``uninformative''

1740: priors are parametrization dependent.  If we reformulate our 7D CMB

1741: problem by replacing $\Omega_M$ with $H_0$, a uniform prior over the

1742: original problem will not translate into a uniform prior over the

1743: formulation including $H_0$, as $\Omega_M$ is inversely related to

1744: $H_0$.

1745:

1746: Secondly, any change to the prior invalidates the current results.  In

1747: particular, even when one is using a uniform prior,

1748: merely changing parameter

1749: ranges will result in a different posterior with possibly different

1750: $1-\alpha$ credible intervals.  Thus analyses, like those we performed

1751: in \S \ref{sec:intervals} would have required us to recompute the

1752: entire chain (or set of chains), an extremely expensive proposition,

1753: or somehow approximate the difference.

1754: Additionally, for Bayesian techniques, the prior should be independent of

1755: the data, and hence it should not be changed after observing the

1756: data.  By recomputing the posterior using a new prior (based upon a

1757: previous posterior), we open ourselves to errors incurred due to

1758: multiple hypothesis testing.  Moreover, it is a small step from such

1759: repeated Bayesian inferences to data-dependent priors, which are

1760: incoherent not Bayesian.  Hence, data-dependent priors do not benefit

1761: from theoretical guarantees derived for Bayesian analyses, which

1762: assume priors are chosen before any data is observed.

1763:

1764: It is interesting to note that Table

1765: \ref{paramtable} denotes the final ranges of parameters

1766: searched.  We initially started with the same parameter ranges as

1767: \citep{tegmark2001}, but increased our ranges slightly to better

1768: capture a secondary peak in confidence space (shown in Figure

1769: \ref{fig:results2d}). Because of our frequentist based technique, we

1770: can easily change the ranges being searched without re-running any of

1771: the CMBFast models, or recomputing any of our current inferences.

1772: This contrasts sharply with Bayesian techniques.

1773:

1774: Finally, recall from \S \ref{sec:introduction} that Bayesian approaches

1775: answer a fundamentally different question than do frequentist

1776: approaches.  Frequentist approaches are concerned with deriving

1777: procedures which will return confidence intervals that trap the

1778: true value of a parameter in at least $1-\alpha$ of the cases in which the

1779: procedure is used.

1780: Bayesian methods are more interested in determining the

1781: probability that a particular value of a parameter is chosen for the

1782: given data set and prior.

1783: While we can compute ``credible'' intervals for Bayesian methods by

1784: choosing the minimum range of a parameter such that the enclosed

1785: probability is equal to $1-\alpha$, these intervals do not necessary

1786: correspond to those derived from using a frequentist approach.  In

1787: particular, there is no guarantee that credible intervals will

1788: contain the true value of the parameter in at least $1-\alpha$

1789: fraction of the instances where the technique is applied.

1790: Specifically, when the likelihood function of the model goes awry,

1791: such as in cases of high-dimension, missing data, and/or

1792: non-parametric models, the inference made using Bayesian methods will

1793: be incorrect.

1794:

1795: This problem is particularly acute for high dimensions,

1796: where $1-\alpha$ credible intervals might trap the true value of the

1797: parameter close to zero percent of the time.  That is, if Bayesian

1798: techniques are applied to a series of data sets, the

1799: fraction of the resulting $1-\alpha$ credible intervals that contain the true

1800: values of the parameter will be less than $1-\alpha$ and may be

1801: significantly less that $1-\alpha$.   While we find this fact

1802: disturbing, a Bayesian might be willing to trade off the fact that the

1803: credible intervals usually will not  contain the truth

1804: for the ability to compute a posterior distribution of likelihood over

1805: parameter space (assuming some prior) and hence determine the

1806: probability of any given parameter setting.

1807: As, \cite{olivestatistics} notes: ``to construct procedures with

1808: guaranteed long run performance, such as confidence intervals, use

1809: frequentist methods.''

1810:

1811:

1812:

1813:

1814: % Intentionally blank.

1815: %

1816: %

1817: %

1818: %

1819: %

1820: %

1821:

1822: \section{Conclusions} \label{sec:conclusion}

1823:

1824: In this paper, we present a new technique to map confidence surfaces, and

1825: show results on first-year WMAP data.  This method, utilizing

1826: a non-parametric fit and confidence balls, allows for computing

1827: simultaneously valid confidence intervals.

1828: Our technique is similar in spirit to the Bayesian methods, but

1829: differs significantly in that it is a frequentist analysis with

1830: \textit{simultaneous valid} coverage.

1831: Thus, the derived confidence intervals are valid

1832: regardless of the values of the remaining parameters.  This is not the

1833: case when a maximization or marginalization technique is used.

1834: While the use of confidence balls requires a search over the entire

1835: parameter space akin to the integration required for Bayesian

1836: techniques, we present an algorithm to efficiently compute regions of

1837: parameter space which have confidence values above a specified

1838: $1-\alpha$ threshold.  We present results of our algorithm and note

1839: that they are similar to those derived using alternative statistical

1840: methods.  While the WMAP power spectrum data alone is insufficient to

1841: constrain any of the cosmological parameters, the addition of

1842: a reasonable assumption on the Hubble constant, provides useful

1843: cosmological insights.

1844:

1845: We point out that the purpose of this paper is to present

1846: a new statistical and computational technique to provide

1847: frequentist confidence intervals on the cosmological parameters

1848: using the WMAP Year 1 data. We are not

1849: arguing that the allowed parameter ranges shown in Figures

1850: \ref{fig:results1d}, \ref{fig:results2d}, \ref{fig:results1d:h0} and

1851: \ref{fig:results2d:h0}

1852:  are more accurate than those presented by the WMAP

1853: team. The reason for this is two-fold: (1) the comparison

1854: of credible (Bayesian) vs. valid (frequentist) parameter ranges

1855: is non-trivial and outside the scope of this work and (2) we

1856: use only the WMAP Year 1 data, while others have utilized

1857: non-WMAP data in various ways to provide additional

1858: constraints on the parameters.

1859:

1860: Analysis of Figures \ref{fig:results1d} and \ref{fig:results2d} shows

1861: that the one sigma confidence regions are similar to those found in

1862: the literature using a variety of techniques (e.g. \cite{tegmark2001,

1863:   spergel2003, spergel2006}).  Figures

1864: \ref{fig:results1d} and \ref{fig:results2d}  illustrate that

1865: the WMAP data alone is not sufficient to strongly constrain the

1866: matter/energy budget for the Universe.  In particular, the constraint

1867: on the Hubble constant, $H_0$, is so weak as to allow values between

1868: 15 and 300 at the two sigma level.

1869:

1870: If we instead constrain $H_0$ to a more ``typical'' range of

1871: $[60:75]$, we get much tighter constraints on \textit{all} parameters,

1872: as shown in Figures \ref{fig:results1d:h0} and \ref{fig:results2d:h0}.

1873: Because we are using a frequentist confidence procedure, adding the

1874: restriction does not affect the validity of the inference.  Moreover,

1875: no additional CMBFast models must be computed to test this constraint,

1876: illustrating the power of our statistical procedure.

1877: Note that both Figures \ref{fig:results1d:h0} and

1878: \ref{fig:results2d:h0} agree much better with the current estimates of the

1879: cosmological matter/energy budget and strongly suggest

1880: that $\Omega_\mathrm{Total} = 1$.

1881:

1882: Moreover, as we show in \S \ref{sec:convergence}, CMBFast creates

1883: temperature power spectra which are discontinuous in parameter space.

1884: This discontinuity violates the smoothness assumption of the

1885: underlying target function used by both our Gaussian

1886: process technique, as well as by MCMC.   This makes convergence

1887: statements difficult to make.  However, we believe that the 1.2

1888: million models run show reasonable convergence.  We believe that with additional

1889: assumptions on CMBFast --- such as the maximum size of a discontinuity

1890: --- we will be able to prove that our method converges in a reasonable

1891: time frame.

1892:

1893: Additionally, we show that comparing CMBFast models to the WMAP year 1

1894: temperate power spectrum data results in a multi-modal solution in

1895: confidence space. We have detected at least two distinct confidence

1896: regions in parameter space.  However, by adding assumptions on $n_s$,

1897: we can eliminate the secondary peak, leading us to believe that the

1898: secondary peak may not be visible in the WMAP third year data.

1899:

1900: In summary, we believe the proposed approach of using a non-parametric

1901: fit to the data and confidence balls, coupled with a search algorithm

1902: to find models in parameter space which fit our regressed estimate,

1903: provides a robust and informative

1904: method for computing confidence intervals for cosmological

1905: parameters.  In addition to merely computing intervals, our approach

1906: has the ability to test various constraints without computing new

1907: models or making assumptions about which models should be fit and

1908: what the ranges of the parameter space should be.  We are working on

1909: techniques to prove convergence of the algorithm, as well as the

1910: incorporation of additional data sets to further constrain the

1911: mass/energy budget of the Universe.

1912:

1913: \acknowledgments

1914: The authors would like to thank the referee for his/her valuable

1915: suggestions and corrections.

1916:

1917: {\it Facilities:} \facility{WMAP}

1918: \appendix

1919: \section{Estimating $\tau$} \label{appendix}

1920:

1921: Recall from \S \ref{sec:fit} that the cosine basis is defined on

1922: $[0,1]$ by

1923: \[

1924: \phi_j(x) =

1925: \left\{

1926: \begin{array}{l l}

1927: 1 & \mathrm{for\ } j=0\\

1928: \sqrt{2}\cos(\pi j x) & \mathrm{for\ } j = 1,2,3, \dots

1929: \end{array}

1930: \right.

1931: \]

1932: If $j$ and $k$ are distinct, positive integers, then

1933: \begin{eqnarray*}

1934: \phi_j \phi_k &=& 2 \cos(\pi j x) \cos(\pi k x)\\ &=& \cos(\pi(j+k)x) + \cos(\pi(j-k)x)\\

1935: &=& \frac{1}{\sqrt{2}} (\phi_{j+k} + \phi_{|j-k|}).

1936: \end{eqnarray*}

1937: Moreover, if $j>0$, then $\phi_j^2 = 2 \cos^2(\pi j x) = \cos(2\pi j x)+1  = \frac{1}{\sqrt{2}}

1938: \phi_{2j} + \phi_0.$

1939: Therefore, as mentioned in \S \ref{sec:fit},

1940: \[

1941: \Delta_{jk\ell} = \left\{

1942: \begin{array}{c c}

1943: 1 & \mathrm{if\ \#}\{j,k,l = 0\} = 3\\

1944: 0 & \mathrm{if\ \#}\{j,k,l = 0\} = 2\\

1945: \delta_{jk}\delta_{0\ell} + \delta_{j\ell}\delta_{0k} +

1946: \delta_{k\ell}\delta_{0j} & \mathrm{if\ \#}\{j,k,l = 0\} = 1\\

1947: \frac{1}{\sqrt{2}}(\delta_{\ell, j+k} + \delta_{\ell,|j-k|}) & \mathrm{if\ \#}\{j,k,l = 0\} = 0

1948: \end{array}

1949: \right..

1950: \]

1951: Let $w(x) = 1/\sigma^2(x)$, such that $w^2(x) = \sum_j w_j

1952: \phi_j(x)$.  As in \S \ref{sec:fit}, we let $\hat \mu_j = \lambda_j

1953: Z_j$, where

1954: \[

1955: Z_j = \frac{1}{n} \sum_{i=1}^n Y_i \phi_j(X_i)

1956: \]

1957: and $1 \ge \lambda_0 \ge \lambda_1 \ge \cdots \ge \lambda_n \ge 0$ are

1958: shrinkage coefficients.  In this work, we use a special case of

1959: monotone shrinkage in which

1960: \[

1961: \lambda_j = \left\{

1962: \begin{array}{cc}

1963: 1 & \mathrm{for\ } j\le J\\

1964: 0 & \mathrm{for\ } j> J

1965: \end{array}\right.

1966: \]

1967: for $J \in [0,1,2,\dots,n]$ such that $J$ minimizes Stein's unbiased

1968: risk estimate given in Equation \ref{eqn:stein}.

1969: With these definitions, the loss can be written as

1970: \begin{eqnarray*}

1971: L(f, \hat f)

1972: &=&

1973: \int_0^1 \left(\frac{\hat f(x) -

1974:   f(x)}{\sigma(x)}\right)^2 \, dx\\

1975: &=&

1976: \sum_{j,k,\ell} (\mu_j - \hat \mu_j)(\mu_k - \hat \mu_k) w_\ell

1977: \int_0^1 \phi_j \phi_k \phi_\ell\\

1978: &=&

1979: \sum_{j,k} (\mu_j - \hat \mu_j)(\mu_k - \hat \mu_k)

1980: \sum_{\ell} w_\ell \Delta_{jk\ell}\\

1981: &=& (\mu - \hat \mu)^T W (\mu - \hat \mu),

1982: \end{eqnarray*}

1983: where $W_{jk} = \sum_\ell w_\ell \Delta_{jk\ell}$.

1984: As in \S \ref{sec:fit}, let $D$ and $\bar D = 1 -D$ be diagonal matrices with 1's in the

1985: first $J$ and last $n-J$ entries respectively.  Then $\hat \mu = DZ$,

1986: where $Z$ is again assumed to be Normal $(\mu, B)$.  Thus,

1987: $\E [\hmu] = D \mu$, $\Cov(\hmu_j, \hmu_k) = \lambda_j\lambda_k B_{jk}$

1988: and $\Var(\hmu) = DBD$.  The risk then becomes

1989: \ba

1990: R = \E [L] &=& \E \left[(\mu - \hmu)^T W (\mu - \hmu)\right]\\

1991: &=& \mathrm{trace}(DWDB)+ \mu^T\bar D W \bar D \mu\\

1992: &=& \mathrm{trace}(DWDB)+ \sum_{j,k} \mu_j \mu_k \bar \lambda_j \bar

1993: \lambda_k W_{jk}

1994: \ea

1995: An unbiased estimate can be obtained by replacing $\mu_j \mu_k$ with

1996: $Z_j Z_k - B_{jk}$.  The result is

1997: \[

1998: \hat R = Z^T \bar D W \bar D Z + \mathrm{trace}(DWDB) -

1999: \mathrm{trace}(\bar D W \bar D B)

2000: \]

2001: It follows that

2002: \[

2003: \hat L - \hat R =

2004: \mu^T W \mu -Z^TC + Z^T A Z + \mathrm{trace}(AZ)

2005: \]

2006: where $A = DW+WD-W$ and $C = 2DW\mu$.  Moreover,

2007: \ba

2008: \Var(\hat L - \hat R)  &=& \Var(Z^T A Z - Z^TC)\\

2009: &=& \Var(Z^T A Z) + \Var(Z^T C) - 2\,\Cov(Z^TAZ, Z^TC)\\

2010: &=&2 \, \mathrm{trace}(ABAB) + \mu^T Q \mu

2011: \ea

2012: where $Q = ABA + WDBDW - 2ABDW$.  Plugging in unbiased estimates of

2013: the linear and quadratic forms involving $\mu$, we get the following

2014: estimate for the variance of the pivot process:

2015: \[

2016: \hat \tau^2 = 2\, \mathrm{trace}(ABAB)+ Z^TQZ - \mathrm{trace}(QB).

2017: \]

2018:

2019:

2020: % \newpage

2021: \bibliographystyle{apj}

2022: \bibliography{ms}

2023: \end{document}