1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %% %%
3: %% TITLE: When is Enough Good Enough in Source Modeling? %%
4: %% AUTHOR: Louis J. Rubbo %%
5: %% DATE: August 28, 2006 %%
6: %% %%
7: %% NOTES: This document requires the aipproc package to compile. %%
8: %% The package is available at the AIP conference proceedings web %%
9: %% page at %%
10: %% %%
11: %% http://proceedings.aip.org/proceedings/authors.jsp#latex %%
12: %% %%
13: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
14:
15:
16: \documentclass[final]{aipproc}
17:
18: %#### PACKAGES ####################################
19:
20: \usepackage{amsmath, amssymb, latexsym}
21:
22:
23:
24: %#### LENGTHS #####################################
25:
26: % Page size
27: \layoutstyle{6x9}
28:
29: % This call sets the amount of space on either side of the equals sign
30: % in the equation array environment
31: \setlength{\arraycolsep}{2pt}
32:
33:
34:
35: %#### NEW COMMANDS ################################
36:
37:
38:
39: %#### INFORMATION #################################
40:
41: \begin{document}
42:
43: \title{When is Enough Good Enough\\in Gravitational Wave Source Modeling?}
44:
45: \classification{02.50.Cw, 04.80.Nn, 05.45.Tp}
46:
47: \keywords{gravitational waves --- methods: data analysis}
48:
49: \author{Louis J. Rubbo}{
50: address={Center for Gravitational Wave Physics, 104 Davey Lab,
51: University Park, PA 16802} }
52:
53:
54:
55: %#### MAIN DOCUMENT ###############################
56:
57: %==== Abstract ====================================
58:
59: \begin{abstract}
60: A typical approach to developing an analysis algorithm for analyzing
61: gravitational wave data is to assume a particular waveform and use its
62: characteristics to formulate a detection criteria. Once a detection
63: has been made, the algorithm uses those same characteristics to tease
64: out parameter estimates from a given data set. While an obvious
65: starting point, such an approach is initiated by assuming a single,
66: correct model for the waveform regardless of the signal strength,
67: observation length, noise, etc. This paper introduces the method of
68: Bayesian model selection as a way to select the most plausible
69: waveform model from a set of models given the data and prior
70: information. The discussion is done in the scientific context for the
71: proposed Laser Interferometer Space Antenna.
72: \end{abstract}
73:
74: \maketitle
75:
76:
77:
78: %==== Main matter =================================
79:
80: \section{INTRODUCTION} \label{sec:intro}
81:
82: The anticipated data from the proposed Laser Interferometer Space
83: Antenna (LISA) introduces a number of exciting and original
84: challenges. Central in these challenges is the development of data
85: analysis routines capable of coaxing out and characterizing individual
86: signals from the noisy time series LISA will return. A great deal of
87: work has already been invested into the development of algorithms
88: applicable to the LISA data. While a number of these algorithms have
89: demonstrated favorable capabilities on simulated data, each make an
90: initial assumption about the functional form for the waveform under
91: consideration. This paper introduces the use of Bayesian model
92: selection as a quantitative method to selecting the waveform model.
93: Using Bayes' theorem we show how the data and prior information picks
94: out the most plausible model from a set of proposed models.
95:
96: Gravitational wave data analysis can be loosely described as a three
97: step process as depicted in figure~\ref{fig:flowchart}. In the first
98: step, a signal is detected within a set of noisy time streams
99: retrieved from the detector. In step two, the signal is characterized
100: by producing estimates for the parameterization variables. Finally,
101: step three is to make physical interpretations based on the estimated
102: parameter values. These steps are not necessarily mutually exclusive.
103: There are no obvious boundaries and areas of overlap do exist.
104: However, each step is necessary when analyzing a detected signal.
105: \begin{figure}
106: \includegraphics[height=.27\textheight]{DataAnalysisBW.eps}
107: \caption{Data analysis flow chart.}
108: \label{fig:flowchart}
109: \end{figure}
110:
111: In making the transition form detection to characterization (and quite
112: often in the detection process itself) a particular waveform is
113: assumed prior to the investigate. While an obvious assumption to make
114: in the early developmental stages for an algorithm, it can lead to
115: needless complications and even misidentifications. For example, if a
116: signal is characterized by a low signal-to-noise ratio, some of the
117: intricate waveform features can be lost in the noise and therefore a
118: simpler model would have sufficed in the analysis. In the Bayesian
119: model selection approach presented here, the data and prior
120: information justify the selection of a particular waveform model by
121: calculating the most plausible model from a proposed library of
122: models.
123:
124: Bayesian model selection is not a new methodology, but it is one that
125: has not been fully adopted by the still infant gravitational wave
126: community. The aim of this paper is to briefly summarize the theory
127: and to discuss possible applications for analyzing the LISA data. To
128: this end, the paper first introduces the rules of probability theory,
129: including a derivation of Bayes' theorem. It then outlines the
130: necessary calculations for performing a model selection procedure.
131: From here we give a simple, qualitative example of its use for the
132: LISA data. We conclude by suggesting a few other applications
133: associated with LISA.
134:
135:
136:
137: \section{BAYESIAN STATISTICS} \label{sec:bayes}
138:
139: \subsection{Rules of Probability Theory} \label{sec:rules}
140:
141: We begin by introducing a notation first used by
142: Jeffreys~\cite{Jeffreys:1961}. We will denote the statement ``the
143: probability that proposition $A$ is true given proposition $B$'' as
144: $P(A|B)$. Similarly, ``the joint probability that both $A$ and $B$
145: are true given $C$'' is denoted by $P(A,B|C)$. The notation ``$|C)$''
146: is the conditional that proposition $C$ is assumed to be true. In
147: Bayesian statistics probability statements such as $P(A)$ are not
148: clear because they do not explicitly state their dependencies.
149: Furthermore, \textit{all} probabilities are conditional.
150:
151: Starting with the desiderata that degrees of plausibility are
152: represented by real numbers, the rules for manipulating plausibility
153: statements should agree with common sense, and they should be
154: consistent, then it is possible to show that the only two rules are
155: required for manipulating probabilities~\cite{Cox:1946}: the Sum Rule,
156: \begin{equation} \label{eq:sum_rule}
157: P(A+B|C) = P(A|C) + P(B|C) - P(A, B|C)
158: \end{equation}
159: where the plus sign inside the probability argument means ``or'', and
160: the Product Rule,
161: \begin{equation} \label{eq:multi_rule}
162: P(A,B|C) = P(A|C) P(B|A,C) \;.
163: \end{equation}
164: By standard Aristotelian logic it must be the case that $P(A,B|C) =
165: P(B,A|C)$. Consequently, the Product Rule may be re-expressed as
166: \begin{equation}
167: P(B,A|C) = P(B|C) P(A|B,C) \;.
168: \end{equation}
169: Equating the last two expressions results in Bayes' theorem,
170: \begin{equation} \label{eq:Bayes}
171: P(A|B,C) = P(A|C) \; \frac{P(B|A,C)}{P(B|C)} \;.
172: \end{equation}
173: Although Bayes' theorem receives the accolades, it is simply a
174: consistency statement for the Product Rule.
175:
176: In words, Bayes' theorem is often stated as
177: \begin{displaymath}
178: \textrm{Posterior} = \textrm{Prior} \;
179: \frac{\textrm{Marginal Likelihood}}{\textrm{Global Likelihood}} \;.
180: \end{displaymath}
181: In this form it is evident that Bayes' theorem quantitatively
182: describes a learning process. We start with a prior state of
183: knowledge about proposition $A$ when $C$ is assumed true, $P(A|C)$.
184: We then gain new information $B$, which in return updates our final
185: state of knowledge as given by the posterior probability, $P(A|B,C)$.
186: The proportionality factor between our prior and posterior states of
187: knowledge is a normalized statement about how likely the proposition
188: $B$ will occur given that both $A$ and $C$ are true.
189:
190: While Bayes' theorem is a useful byproduct of the Product Rule, the
191: use of the Sum Rule is equally important. It is through the Sum Rule
192: that we are able to take a joint probability of multiple propositions,
193: and reduce it to a distribution of a smaller subset of the larger
194: joint distribution. For example, consider the joint distribution
195: between $A$ and a set of $n$ exhaustive $B_{i}$'s, given prior
196: information $I$. From the Sum Rule we have
197: \begin{eqnarray}
198: P(A, \sum_{i=1}^{n} B_{i} | I) &=& P(A|I) \nonumber\\
199: &=& P(A, B_{1} | I) + P(A, \sum_{i=2}^{n} B_{i} | I) - P(A, B_{1},
200: \sum_{i=2}^{n} B_{i} | I) \;,
201: \end{eqnarray}
202: where the first equality follows from the Product Rule and the fact
203: that the $B_{i}$'s are exhaustive. If the $B_{i}$'s are mutually
204: exclusive, that is only one value can be realized at a time, then the
205: last term is zero. Repeated applications of the Sum Rule leads to
206: \begin{equation}
207: P(A|I) = \sum_{i=1}^{n} P(A, B_{i} | I) \;.
208: \end{equation}
209: When the $B_{i}$'s take on continuous values the above goes over to an
210: integral,
211: \begin{equation}
212: P(A | I) = \int P(A, B | I) \, dB \;.
213: \end{equation}
214: The process which we have just described is referred to as
215: \textit{marginalization}. In it we have removed a \textit{nuisance
216: parameter}, $B$, from a joint distribution by a repeated application
217: of the Sum Rule.
218:
219:
220: \subsection{Model Selection} \label{sec:ModelSel}
221:
222: In model selection the central question that is being addressed is the
223: following: ``Given a particular set of data, and prior information,
224: which hypothesis from a library $\mathcal{L} \equiv \{H_{1}, \ldots,
225: H_{\ell}\}$ of hypotheses is the most plausible?'' Key to this
226: question are the ideas that all prior information is included and that
227: the most plausible hypothesis is based on the given data. The
228: hypotheses within a library are either assumed to be exhaustive or, by
229: a careful choice in models, the space is made
230: so~\cite{Bretthorst:1996}.
231:
232: A model itself consists of a functional form dependent on a vector of
233: parameters $\vec{\lambda}$, and two probability
234: distributions~\cite{MacKay:1992}. The first distribution describes
235: the probability distribution for the parameter values given the model
236: prior to the new data, $P(\vec{\lambda}|H_{\alpha})$. This is a key
237: point; two models are distinct even if they have the same
238: parameterization but different priors about how those parameters are
239: believed to be distributed. The second distribution is the
240: probability of a data set given the model and a particular set of
241: parameter values, $P(D|\vec{\lambda}, H_{\alpha})$.
242:
243: From Bayes' theorem~\eqref{eq:Bayes}, the posterior probability for a
244: particular model is given by
245: \begin{equation}
246: P(H_{\alpha}|D, I) = P(H_{\alpha}|I) \; \frac{P(D|H_{\alpha}, I)}{
247: P(D| I)} \;,
248: \end{equation}
249: where $I$ symbolizes our unenumerated prior information. The
250: denominator can be viewed as a normalization constant,
251: \begin{equation}
252: P(D|I) = \sum_{\alpha = 1}^{\ell} P(H_{\alpha}|I) P(D|H_{\alpha}, I)
253: \;.
254: \end{equation}
255: By investigating the \textit{odds ratio} between two competing models,
256: we can eliminate the need to calculate the normalization constant,
257: \begin{eqnarray} \label{eq:oddsratio}
258: O_{12} &=& \frac{P(H_{1}|D,I)}{P(H_{2}|D,I)} =
259: \frac{P(H_{1}|I) P(D|H_{1},I)}{P(H_{2}|I) P(D|H_{2},I)} \nonumber\\
260: &=& \frac{P(D|H_{1},I)}{P(D|H_{2},I)} \;.
261: \end{eqnarray}
262: The second line arises by assuming that our prior information does not
263: favor one model over the other. The odds ratio gives us a means to
264: directly compare competing models. If our library contains more than
265: two models, one model may be used as a reference. For example, the
266: reference model may be a constant (i.e. a no signal present model),
267: while the remaining library contains a spectrum of waveform models.
268:
269: From the odds ratio it is apparent that to compare models in a library
270: only their marginal likelihoods need to be calculated. The
271: likelihoods are found by marginalizing, over all model parameters, the
272: joint distribution for the data and the model parameters,
273: \begin{equation} \label{eq:evidence}
274: P(D|H_{\alpha}, I) = \int P(D, \vec{\lambda}_{\alpha} | I) \;
275: d\vec{\lambda}_{\alpha} = \int P(\vec{\lambda}_{\alpha}|H_{\alpha},
276: I) P(D|\vec{\lambda}_{\alpha}, H_{\alpha}, I) \;
277: d\vec{\lambda}_{\alpha} \;,
278: \end{equation}
279: where the second equality follows from the Product Rule.
280:
281: If the data is informative, i.e. we have learned something new, then
282: the parameter likelihood function, $P(D|\vec{\lambda}_{\alpha},
283: H_{\alpha}, I)$, will be more peaked than the parameter priors,
284: $P(\vec{\lambda}_{\alpha} | H_{\alpha}, I)$. Figure~\ref{fig:occam}
285: illustrates this for a one dimensional model.
286: \begin{figure}
287: \includegraphics[height=0.27\textheight]{OccamFactorBW.eps}
288: \caption{A pictorial representation for the origins of Occam factors
289: in Bayesian model comparisons.}
290: \label{fig:occam}
291: \end{figure}
292: In this instance we can estimate the marginal likelihood as
293: \begin{equation}
294: P(D|H_{\alpha},I) \approx P(D | \lambda_{ML}, H_{\alpha}, I) \left[
295: P(\lambda_{ML}|H_{\alpha}, I) \; \delta\lambda \right] \;.
296: \end{equation}
297: Here $\lambda_{ML}$ is the parameter value at the maximum likelihood
298: and $\delta\lambda$ is the characteristic width for the parameter
299: likelihood function. The term in square brackets is an \textit{Occam
300: factor}; a term that naturally penalizes complicated models. To see
301: this consider a uniform prior, $P(\lambda|I) = (\Delta\lambda)^{-1}$,
302: where $\Delta\lambda$ is the interval width for the range of expected
303: parameter values before the data is collected. The marginal
304: likelihood is now
305: \begin{equation}
306: P(D|H_{\alpha},I) \approx P(D | \lambda_{ML}, H_{\alpha}, I) \;
307: \frac{\delta\lambda}{\Delta\lambda} \;.
308: \end{equation}
309: For informative data the Occam factor is always less than unity.
310: Consequently, for a complicated model to be favored over a simpler
311: one, the data must justify it by having a corresponding larger value
312: for the parameter likelihood function.
313:
314: The proceeding argument is quickly extended to multiple dimensions.
315: If the model has more than one parameter, then there is a
316: corresponding Occam factor for each parameter,
317: \begin{equation} \label{eq:apprxpost}
318: P(D|H_{\alpha},I) \approx P(D | \vec{\lambda}_{ML}, H_{\alpha}, I)
319: \; \frac{\delta\lambda_{1}}{\Delta\lambda_{1}} \cdots
320: \frac{\delta\lambda_{i}}{\Delta\lambda_{i}} \;,
321: \end{equation}
322: where $i$ is the number of parameters.
323:
324: As a last point of emphasis, it is not enough to perform a parameter
325: estimation analysis and find that $\lambda_{i} = 0$, therefore ruling
326: out the model that includes $\lambda_{i}$. Doing so would neglect the
327: Occam factors that arise in Bayesian model selection and are not
328: present in a parameter estimation analysis, even a Bayesian analysis.
329:
330:
331:
332: \section{WHITE DWARF TRANSFORM} \label{sec:wdtrans}
333:
334: As a conceptually trivial but applicable example of Bayesian model
335: selection for the LISA mission, consider the detection of a
336: supermassive black hole binary inspiral. For black hole binaries with
337: component masses in the range of $10^{4-7}~\textrm{M}_{\odot}$, LISA
338: will observe the binary evolution as the binary sweeps through
339: frequencies from $\sim\!0.01$~mHz up to a few milliHertz (depending on
340: the actual masses). In this same range of frequencies is the
341: gravitational wave background formed from the $\sim\!10^{8}$ solar
342: mass binaries in our own galaxy. As the black holes inspiral, their
343: detected signal will overlap with the collective galactic background
344: signal. Moreover, at any instant of time the black hole binary looks
345: like a monochromatic binary. That is, as a supermassive black hole
346: binary with a time to coalescence of $t_{c}$ sweeps past a galactic
347: binary of period $T$, the two signals have a significant overlap for
348: an interval equal to the geometric mean of $t_{c}$ and
349: $T$~\cite{Cornish:2005}. Consequently the black hole inspiral signal
350: may be decomposed into a population of monochromatic galactic
351: binaries. Such a process is often referred to as a \textit{white
352: dwarf transform}.
353:
354: For a gravitational wave data analyst the task is to select which of
355: two models is more plausible. The models under consideration are
356: \begin{eqnarray*}
357: H_{WD} &=& \left( \begin{array}{l} \text{the detected signal is from
358: a population} \\ \text{of monochromatic galactic binaries}
359: \end{array} \right) \\
360: H_{BH} &=& \left( \begin{array}{l} \text{the detected signal is from
361: a single} \\ \text{supermassive black hole binary} \end{array}
362: \right) \;.
363: \end{eqnarray*}
364: Model $H_{WD}$ is parameterized by $7N$ variables, where $N$ is the
365: number of binaries required to describe the apparent inspiral signal.
366: For an inspiral signal between $0.01$ and 1~mHz, $N$ is on the order
367: of $10^{4}$ assuming a binary per frequency bin and for a one year
368: observation\footnote{A frequency bin $\Delta f$ is equal to one on the
369: observation time, $\Delta f = T^{-1}$. For a one year observation,
370: which is used here, $\Delta f = 3.2 \times 10^{-8}$~Hz.}. Conversely,
371: model $H_{BH}$ is characterized by only seventeen parameters.
372:
373: Estimating the posterior probabilities using
374: equation~\eqref{eq:apprxpost} quickly leads to the conclusion that the
375: large parameter space associated with the white dwarf population model
376: has associated with it an overwhelming number of Occam factors. These
377: Occam factors penalize the white dwarf population model and in turn
378: make the plausibility for the model extremely low. The black hole
379: model, on the other hand, only has seventeen Occam factors and
380: therefore is not as severely penalized. Consequently, although an
381: ensemble of galactic binaries could conspire to look like a
382: supermassive black hole binary inspiral, the relative probability for
383: such a model is many orders of magnitude less than a model that
384: contains a single black hole binary.
385:
386:
387:
388: \section{CONCLUDING REMARKS} \label{sec:conclusions}
389:
390: The white dwarf transform is an obvious application of Bayesian model
391: selection. More informative and interesting examples include using
392: Bayesian model selection as a criteria for deciding when a signal is
393: present in the data; characterizing complicated but detected signals
394: that have low signal-to-noise ratios; and counting the number of
395: detectable galactic binaries within the larger population. The first
396: application is simply answering the question, when does the data
397: justify declaring a detection for a particular waveform? The second
398: application is concerned with deciding the information content from a
399: weak signal. That is, what features of an emitting system are
400: actually measurable and what features are lost to the noise. Counting
401: the number of detectable galactic binaries is one of the few Bayesian
402: model selection applications used in the LISA
403: literature~\cite{Umstatter:2005a, Stroeer:2006}. Embedded within
404: Reversible Jump Markov Chain Monte Carlo techniques is the use of odds
405: ratios in deciding the number of galactic binaries that are
406: detectable.
407:
408: In general, Bayesian model selection gives a logical and quantitative
409: approach to directly comparing competing models. By using a model
410: selection procedure we are able to maximize the amount of information
411: we can extract from LISA's data. The most plausible model is the one
412: that is most justified by the data and our prior state of knowledge
413: prior to the experiment. As progress is made in the development of
414: LISA analysis routines it is conceivable that Bayesian approaches will
415: be a central tool.
416:
417:
418:
419: %==== Acknowledgments =============================
420:
421: \begin{theacknowledgments}
422: The author would like to thank Edward Cazalas, Matthew Francis, and
423: Deirdre Shoemaker for a number of helpful discussions. Also, Lee
424: Samuel Finn for introducing the author to the Bayesian approach and
425: for guidance on a number of its subtler points. This work was
426: supported by the Center for Gravitational Wave Physics. The Center
427: for Gravitational Wave Physics is funded by the National Science
428: Foundation under cooperative agreement PHY 01-14375.
429: \end{theacknowledgments}
430:
431:
432:
433: %==== Bibliography ================================
434:
435: \bibliographystyle{aipproc}
436: \bibliography{References}
437:
438:
439: \end{document}
440: