gr-qc0608114/ms.tex
1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %%                                                                  %%
3: %% TITLE: When is Enough Good Enough in Source Modeling?            %%
4: %% AUTHOR: Louis J. Rubbo                                           %%
5: %% DATE: August 28, 2006                                            %%
6: %%                                                                  %%
7: %% NOTES: This document requires the aipproc package to compile.    %%
8: %% The package is available at the AIP conference proceedings web   %%
9: %% page at                                                          %%
10: %%                                                                  %%
11: %%    http://proceedings.aip.org/proceedings/authors.jsp#latex      %%
12: %%                                                                  %%
13: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
14: 
15: 
16: \documentclass[final]{aipproc}
17: 
18: %#### PACKAGES ####################################
19: 
20: \usepackage{amsmath, amssymb, latexsym}
21: 
22: 
23: 
24: %#### LENGTHS #####################################
25: 
26: % Page size
27: \layoutstyle{6x9}
28: 
29: % This call sets the amount of space on either side of the equals sign
30: % in the equation array environment
31: \setlength{\arraycolsep}{2pt}
32: 
33: 
34: 
35: %#### NEW COMMANDS ################################
36: 
37: 
38: 
39: %#### INFORMATION #################################
40: 
41: \begin{document}
42: 
43: \title{When is Enough Good Enough\\in Gravitational Wave Source Modeling?}
44: 
45: \classification{02.50.Cw, 04.80.Nn, 05.45.Tp}
46: 
47: \keywords{gravitational waves --- methods: data analysis}
48: 
49: \author{Louis J. Rubbo}{
50:   address={Center for Gravitational Wave Physics, 104 Davey Lab,
51:   University Park, PA 16802} }
52: 
53: 
54: 
55: %#### MAIN DOCUMENT ###############################
56: 
57: %==== Abstract ====================================
58: 
59: \begin{abstract}
60: A typical approach to developing an analysis algorithm for analyzing
61: gravitational wave data is to assume a particular waveform and use its
62: characteristics to formulate a detection criteria.  Once a detection
63: has been made, the algorithm uses those same characteristics to tease
64: out parameter estimates from a given data set.  While an obvious
65: starting point, such an approach is initiated by assuming a single,
66: correct model for the waveform regardless of the signal strength,
67: observation length, noise, etc.  This paper introduces the method of
68: Bayesian model selection as a way to select the most plausible
69: waveform model from a set of models given the data and prior
70: information.  The discussion is done in the scientific context for the
71: proposed Laser Interferometer Space Antenna.
72: \end{abstract}
73: 
74: \maketitle
75: 
76: 
77: 
78: %==== Main matter =================================
79: 
80: \section{INTRODUCTION} \label{sec:intro}
81: 
82: The anticipated data from the proposed Laser Interferometer Space
83: Antenna (LISA) introduces a number of exciting and original
84: challenges.  Central in these challenges is the development of data
85: analysis routines capable of coaxing out and characterizing individual
86: signals from the noisy time series LISA will return.  A great deal of
87: work has already been invested into the development of algorithms
88: applicable to the LISA data.  While a number of these algorithms have
89: demonstrated favorable capabilities on simulated data, each make an
90: initial assumption about the functional form for the waveform under
91: consideration.  This paper introduces the use of Bayesian model
92: selection as a quantitative method to selecting the waveform model.
93: Using Bayes' theorem we show how the data and prior information picks
94: out the most plausible model from a set of proposed models.
95: 
96: Gravitational wave data analysis can be loosely described as a three
97: step process as depicted in figure~\ref{fig:flowchart}.  In the first
98: step, a signal is detected within a set of noisy time streams
99: retrieved from the detector.  In step two, the signal is characterized
100: by producing estimates for the parameterization variables.  Finally,
101: step three is to make physical interpretations based on the estimated
102: parameter values.  These steps are not necessarily mutually exclusive.
103: There are no obvious boundaries and areas of overlap do exist.
104: However, each step is necessary when analyzing a detected signal.
105: \begin{figure}
106:   \includegraphics[height=.27\textheight]{DataAnalysisBW.eps}
107:   \caption{Data analysis flow chart.}
108:   \label{fig:flowchart}
109: \end{figure}
110: 
111: In making the transition form detection to characterization (and quite
112: often in the detection process itself) a particular waveform is
113: assumed prior to the investigate.  While an obvious assumption to make
114: in the early developmental stages for an algorithm, it can lead to
115: needless complications and even misidentifications.  For example, if a
116: signal is characterized by a low signal-to-noise ratio, some of the
117: intricate waveform features can be lost in the noise and therefore a
118: simpler model would have sufficed in the analysis. In the Bayesian
119: model selection approach presented here, the data and prior
120: information justify the selection of a particular waveform model by
121: calculating the most plausible model from a proposed library of
122: models.
123: 
124: Bayesian model selection is not a new methodology, but it is one that
125: has not been fully adopted by the still infant gravitational wave
126: community.  The aim of this paper is to briefly summarize the theory
127: and to discuss possible applications for analyzing the LISA data.  To
128: this end, the paper first introduces the rules of probability theory,
129: including a derivation of Bayes' theorem.  It then outlines the
130: necessary calculations for performing a model selection procedure.
131: From here we give a simple, qualitative example of its use for the
132: LISA data.  We conclude by suggesting a few other applications
133: associated with LISA.
134: 
135: 
136: 
137: \section{BAYESIAN STATISTICS} \label{sec:bayes}
138: 
139: \subsection{Rules of Probability Theory} \label{sec:rules}
140: 
141: We begin by introducing a notation first used by
142: Jeffreys~\cite{Jeffreys:1961}.  We will denote the statement ``the
143: probability that proposition $A$ is true given proposition $B$'' as
144: $P(A|B)$.  Similarly, ``the joint probability that both $A$ and $B$
145: are true given $C$'' is denoted by $P(A,B|C)$.  The notation ``$|C)$''
146: is the conditional that proposition $C$ is assumed to be true.  In
147: Bayesian statistics probability statements such as $P(A)$ are not
148: clear because they do not explicitly state their dependencies.
149: Furthermore, \textit{all} probabilities are conditional.
150: 
151: Starting with the desiderata that degrees of plausibility are
152: represented by real numbers, the rules for manipulating plausibility
153: statements should agree with common sense, and they should be
154: consistent, then it is possible to show that the only two rules are
155: required for manipulating probabilities~\cite{Cox:1946}: the Sum Rule,
156: \begin{equation} \label{eq:sum_rule}
157:   P(A+B|C) = P(A|C) + P(B|C) - P(A, B|C)
158: \end{equation}
159: where the plus sign inside the probability argument means ``or'', and
160: the Product Rule,
161: \begin{equation} \label{eq:multi_rule}
162:   P(A,B|C) = P(A|C) P(B|A,C) \;.
163: \end{equation}
164: By standard Aristotelian logic it must be the case that $P(A,B|C) =
165: P(B,A|C)$.  Consequently, the Product Rule may be re-expressed as
166: \begin{equation}
167:   P(B,A|C) = P(B|C) P(A|B,C) \;.
168: \end{equation}
169: Equating the last two expressions results in Bayes' theorem,
170: \begin{equation} \label{eq:Bayes}
171:   P(A|B,C) = P(A|C) \; \frac{P(B|A,C)}{P(B|C)} \;.
172: \end{equation}
173: Although Bayes' theorem receives the accolades, it is simply a
174: consistency statement for the Product Rule.
175: 
176: In words, Bayes' theorem is often stated as
177: \begin{displaymath}
178:   \textrm{Posterior} = \textrm{Prior} \;
179:   \frac{\textrm{Marginal Likelihood}}{\textrm{Global Likelihood}} \;.
180: \end{displaymath}
181: In this form it is evident that Bayes' theorem quantitatively
182: describes a learning process.  We start with a prior state of
183: knowledge about proposition $A$ when $C$ is assumed true, $P(A|C)$.
184: We then gain new information $B$, which in return updates our final
185: state of knowledge as given by the posterior probability, $P(A|B,C)$.
186: The proportionality factor between our prior and posterior states of
187: knowledge is a normalized statement about how likely the proposition
188: $B$ will occur given that both $A$ and $C$ are true.
189: 
190: While Bayes' theorem is a useful byproduct of the Product Rule, the
191: use of the Sum Rule is equally important.  It is through the Sum Rule
192: that we are able to take a joint probability of multiple propositions,
193: and reduce it to a distribution of a smaller subset of the larger
194: joint distribution.  For example, consider the joint distribution
195: between $A$ and a set of $n$ exhaustive $B_{i}$'s, given prior
196: information $I$.  From the Sum Rule we have
197: \begin{eqnarray}
198:   P(A, \sum_{i=1}^{n} B_{i} | I) &=& P(A|I) \nonumber\\
199:   &=& P(A, B_{1} | I) + P(A, \sum_{i=2}^{n} B_{i} | I) - P(A, B_{1},
200:   \sum_{i=2}^{n} B_{i} | I) \;,
201: \end{eqnarray}
202: where the first equality follows from the Product Rule and the fact
203: that the $B_{i}$'s are exhaustive.  If the $B_{i}$'s are mutually
204: exclusive, that is only one value can be realized at a time, then the
205: last term is zero.  Repeated applications of the Sum Rule leads to 
206: \begin{equation}
207:   P(A|I) = \sum_{i=1}^{n} P(A, B_{i} | I) \;.
208: \end{equation}
209: When the $B_{i}$'s take on continuous values the above goes over to an
210: integral,
211: \begin{equation}
212:   P(A | I) = \int P(A, B | I) \, dB \;.
213: \end{equation}
214: The process which we have just described is referred to as
215: \textit{marginalization}.  In it we have removed a \textit{nuisance
216: parameter}, $B$, from a joint distribution by a repeated application
217: of the Sum Rule.
218: 
219: 
220: \subsection{Model Selection} \label{sec:ModelSel}
221: 
222: In model selection the central question that is being addressed is the
223: following: ``Given a particular set of data, and prior information,
224: which hypothesis from a library $\mathcal{L} \equiv \{H_{1}, \ldots,
225: H_{\ell}\}$ of hypotheses is the most plausible?''  Key to this
226: question are the ideas that all prior information is included and that
227: the most plausible hypothesis is based on the given data.  The
228: hypotheses within a library are either assumed to be exhaustive or, by
229: a careful choice in models, the space is made
230: so~\cite{Bretthorst:1996}.
231: 
232: A model itself consists of a functional form dependent on a vector of
233: parameters $\vec{\lambda}$, and two probability
234: distributions~\cite{MacKay:1992}.  The first distribution describes
235: the probability distribution for the parameter values given the model
236: prior to the new data, $P(\vec{\lambda}|H_{\alpha})$.  This is a key
237: point; two models are distinct even if they have the same
238: parameterization but different priors about how those parameters are
239: believed to be distributed.  The second distribution is the
240: probability of a data set given the model and a particular set of
241: parameter values, $P(D|\vec{\lambda}, H_{\alpha})$.
242: 
243: From Bayes' theorem~\eqref{eq:Bayes}, the posterior probability for a
244: particular model is given by
245: \begin{equation}
246:   P(H_{\alpha}|D, I) = P(H_{\alpha}|I) \; \frac{P(D|H_{\alpha}, I)}{
247:   P(D| I)} \;, 
248: \end{equation}
249: where $I$ symbolizes our unenumerated prior information.  The
250: denominator can be viewed as a normalization constant,
251: \begin{equation}
252:   P(D|I) = \sum_{\alpha = 1}^{\ell} P(H_{\alpha}|I) P(D|H_{\alpha}, I)
253:   \;.
254: \end{equation}
255: By investigating the \textit{odds ratio} between two competing models,
256: we can eliminate the need to calculate the normalization constant,
257: \begin{eqnarray} \label{eq:oddsratio}
258:   O_{12} &=& \frac{P(H_{1}|D,I)}{P(H_{2}|D,I)} =
259:   \frac{P(H_{1}|I) P(D|H_{1},I)}{P(H_{2}|I) P(D|H_{2},I)} \nonumber\\
260:   &=& \frac{P(D|H_{1},I)}{P(D|H_{2},I)} \;.
261: \end{eqnarray}
262: The second line arises by assuming that our prior information does not
263: favor one model over the other.  The odds ratio gives us a means to
264: directly compare competing models.  If our library contains more than
265: two models, one model may be used as a reference.  For example, the
266: reference model may be a constant (i.e. a no signal present model),
267: while the remaining library contains a spectrum of waveform models.
268: 
269: From the odds ratio it is apparent that to compare models in a library
270: only their marginal likelihoods need to be calculated.  The
271: likelihoods are found by marginalizing, over all model parameters, the
272: joint distribution for the data and the model parameters,
273: \begin{equation} \label{eq:evidence}
274:   P(D|H_{\alpha}, I) = \int P(D, \vec{\lambda}_{\alpha} | I) \;
275:   d\vec{\lambda}_{\alpha} = \int P(\vec{\lambda}_{\alpha}|H_{\alpha},
276:   I) P(D|\vec{\lambda}_{\alpha}, H_{\alpha}, I) \;
277:   d\vec{\lambda}_{\alpha} \;,
278: \end{equation}
279: where the second equality follows from the Product Rule.
280: 
281: If the data is informative, i.e. we have learned something new, then
282: the parameter likelihood function, $P(D|\vec{\lambda}_{\alpha},
283: H_{\alpha}, I)$, will be more peaked than the parameter priors,
284: $P(\vec{\lambda}_{\alpha} | H_{\alpha}, I)$.  Figure~\ref{fig:occam}
285: illustrates this for a one dimensional model.
286: \begin{figure}
287:   \includegraphics[height=0.27\textheight]{OccamFactorBW.eps}
288:   \caption{A pictorial representation for the origins of Occam factors
289:     in Bayesian model comparisons.}
290:   \label{fig:occam}
291: \end{figure}
292: In this instance we can estimate the marginal likelihood as
293: \begin{equation}
294:   P(D|H_{\alpha},I) \approx P(D | \lambda_{ML}, H_{\alpha}, I) \left[
295:   P(\lambda_{ML}|H_{\alpha}, I) \; \delta\lambda \right] \;.
296: \end{equation}
297: Here $\lambda_{ML}$ is the parameter value at the maximum likelihood
298: and $\delta\lambda$ is the characteristic width for the parameter
299: likelihood function.  The term in square brackets is an \textit{Occam
300: factor}; a term that naturally penalizes complicated models. To see
301: this consider a uniform prior, $P(\lambda|I) = (\Delta\lambda)^{-1}$,
302: where $\Delta\lambda$ is the interval width for the range of expected
303: parameter values before the data is collected.  The marginal
304: likelihood is now
305: \begin{equation}
306:   P(D|H_{\alpha},I) \approx P(D | \lambda_{ML}, H_{\alpha}, I) \;
307:   \frac{\delta\lambda}{\Delta\lambda} \;.
308: \end{equation}
309: For informative data the Occam factor is always less than unity.
310: Consequently, for a complicated model to be favored over a simpler
311: one, the data must justify it by having a corresponding larger value
312: for the parameter likelihood function.
313: 
314: The proceeding argument is quickly extended to multiple dimensions.
315: If the model has more than one parameter, then there is a
316: corresponding Occam factor for each parameter,
317: \begin{equation} \label{eq:apprxpost}
318:   P(D|H_{\alpha},I) \approx P(D | \vec{\lambda}_{ML}, H_{\alpha}, I)
319:   \; \frac{\delta\lambda_{1}}{\Delta\lambda_{1}} \cdots
320:   \frac{\delta\lambda_{i}}{\Delta\lambda_{i}} \;,
321: \end{equation}
322: where $i$ is the number of parameters.
323: 
324: As a last point of emphasis, it is not enough to perform a parameter
325: estimation analysis and find that $\lambda_{i} = 0$, therefore ruling
326: out the model that includes $\lambda_{i}$.  Doing so would neglect the
327: Occam factors that arise in Bayesian model selection and are not
328: present in a parameter estimation analysis, even a Bayesian analysis.
329: 
330: 
331: 
332: \section{WHITE DWARF TRANSFORM} \label{sec:wdtrans}
333: 
334: As a conceptually trivial but applicable example of Bayesian model
335: selection for the LISA mission, consider the detection of a
336: supermassive black hole binary inspiral.  For black hole binaries with
337: component masses in the range of $10^{4-7}~\textrm{M}_{\odot}$, LISA
338: will observe the binary evolution as the binary sweeps through
339: frequencies from $\sim\!0.01$~mHz up to a few milliHertz (depending on
340: the actual masses).  In this same range of frequencies is the
341: gravitational wave background formed from the $\sim\!10^{8}$ solar
342: mass binaries in our own galaxy.  As the black holes inspiral, their
343: detected signal will overlap with the collective galactic background
344: signal.  Moreover, at any instant of time the black hole binary looks
345: like a monochromatic binary.  That is, as a supermassive black hole
346: binary with a time to coalescence of $t_{c}$ sweeps past a galactic
347: binary of period $T$, the two signals have a significant overlap for
348: an interval equal to the geometric mean of $t_{c}$ and
349: $T$~\cite{Cornish:2005}.  Consequently the black hole inspiral signal
350: may be decomposed into a population of monochromatic galactic
351: binaries.  Such a process is often referred to as a \textit{white
352: dwarf transform}.
353: 
354: For a gravitational wave data analyst the task is to select which of
355: two models is more plausible.  The models under consideration are
356: \begin{eqnarray*}
357:   H_{WD} &=& \left( \begin{array}{l} \text{the detected signal is from
358:   a population} \\ \text{of monochromatic galactic binaries}
359:   \end{array} \right) \\
360:   H_{BH} &=& \left( \begin{array}{l} \text{the detected signal is from
361:   a single} \\ \text{supermassive black hole binary} \end{array}
362:   \right) \;.
363: \end{eqnarray*}
364: Model $H_{WD}$ is parameterized by $7N$ variables, where $N$ is the
365: number of binaries required to describe the apparent inspiral signal.
366: For an inspiral signal between $0.01$ and 1~mHz, $N$ is on the order
367: of $10^{4}$ assuming a binary per frequency bin and for a one year
368: observation\footnote{A frequency bin $\Delta f$ is equal to one on the
369: observation time, $\Delta f = T^{-1}$.  For a one year observation,
370: which is used here, $\Delta f = 3.2 \times 10^{-8}$~Hz.}.  Conversely,
371: model $H_{BH}$ is characterized by only seventeen parameters.
372: 
373: Estimating the posterior probabilities using
374: equation~\eqref{eq:apprxpost} quickly leads to the conclusion that the
375: large parameter space associated with the white dwarf population model
376: has associated with it an overwhelming number of Occam factors.  These
377: Occam factors penalize the white dwarf population model and in turn
378: make the plausibility for the model extremely low.  The black hole
379: model, on the other hand, only has seventeen Occam factors and
380: therefore is not as severely penalized.  Consequently, although an
381: ensemble of galactic binaries could conspire to look like a
382: supermassive black hole binary inspiral, the relative probability for
383: such a model is many orders of magnitude less than a model that
384: contains a single black hole binary.
385: 
386: 
387: 
388: \section{CONCLUDING REMARKS} \label{sec:conclusions}
389: 
390: The white dwarf transform is an obvious application of Bayesian model
391: selection.  More informative and interesting examples include using
392: Bayesian model selection as a criteria for deciding when a signal is
393: present in the data; characterizing complicated but detected signals
394: that have low signal-to-noise ratios; and counting the number of
395: detectable galactic binaries within the larger population.  The first
396: application is simply answering the question, when does the data
397: justify declaring a detection for a particular waveform?  The second
398: application is concerned with deciding the information content from a
399: weak signal.  That is, what features of an emitting system are
400: actually measurable and what features are lost to the noise.  Counting
401: the number of detectable galactic binaries is one of the few Bayesian
402: model selection applications used in the LISA
403: literature~\cite{Umstatter:2005a, Stroeer:2006}.  Embedded within
404: Reversible Jump Markov Chain Monte Carlo techniques is the use of odds
405: ratios in deciding the number of galactic binaries that are
406: detectable.
407: 
408: In general, Bayesian model selection gives a logical and quantitative
409: approach to directly comparing competing models.  By using a model
410: selection procedure we are able to maximize the amount of information
411: we can extract from LISA's data.  The most plausible model is the one
412: that is most justified by the data and our prior state of knowledge
413: prior to the experiment.  As progress is made in the development of
414: LISA analysis routines it is conceivable that Bayesian approaches will
415: be a central tool.
416: 
417: 
418: 
419: %==== Acknowledgments =============================
420: 
421: \begin{theacknowledgments}
422: The author would like to thank Edward Cazalas, Matthew Francis, and
423: Deirdre Shoemaker for a number of helpful discussions.  Also, Lee
424: Samuel Finn for introducing the author to the Bayesian approach and
425: for guidance on a number of its subtler points.  This work was
426: supported by the Center for Gravitational Wave Physics.  The Center
427: for Gravitational Wave Physics is funded by the National Science
428: Foundation under cooperative agreement PHY 01-14375.
429: \end{theacknowledgments}
430: 
431: 
432: 
433: %==== Bibliography ================================
434: 
435: \bibliographystyle{aipproc}
436: \bibliography{References}
437: 
438: 
439: \end{document}
440: