0805.0201/ms.tex
1: \documentclass[letter,useAMS,usenatbib]{mn2e}
2: 
3: \usepackage[english]{babel} \usepackage{subfigure}
4: \usepackage{graphicx}
5: 
6: \usepackage[fleqn]{amsmath}
7: \usepackage{color}
8: 
9: \usepackage[varg]{txfonts}
10: 
11: \citestyle{aa}
12: 
13: \bibliographystyle{mn2e}
14: 
15: \topmargin -1.3cm 
16: 
17: %%%%%%%%%%%%%%% Author definitions %%%%%%%%%%%%%%%%%%%%%%
18: %%%%% 1. Journals
19: 
20: \newcommand{\aj}{AJ} % Astronomical Journal 
21: \newcommand{\aap}{A\&A} % Astronomy and Astrophysics 
22: \newcommand{\aaps}{A\&AS} % Astronomy and Astrophysics Supplement Series 
23: \newcommand{\apj}{ApJ} % Astrophysical Journal 
24: \newcommand{\apjs}{ApJS} % Astrophysical Journal Supplement Series 
25: \newcommand{\apjl}{ApJL} % Astrophysical Journal Letters
26: \newcommand{\araa}{ARAA} % Annual Reviews in Astronomy and Astrophysics 
27: \newcommand{\mnras}{MNRAS} % Monthly Notices of the Royal Astronomical Society
28: 
29: \newcommand{\noi}{\noindent}
30: 
31: %%%%%%%%%%%%%%% Title %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
32: 
33: \title[Strong lensing on adaptive grids]{Bayesian strong gravitational-lens modelling on adaptive
34: grids:\\ objective detection of mass substructure in galaxies}
35: 
36: \author[S. Vegetti \& L. V. E.  Koopmans.]{ Simona Vegetti\thanks{E-mail:
37:     vegetti@astro.rug.nl} \& L. V. E.  Koopmans\\ Kapteyn
38:     Astronomical Institute, University of Groningen, PO Box 800,
39:     9700\,AV Groningen, the Netherlands}
40: 
41: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
42: 
43: \begin{document}
44:   
45:   \date{Accepted for publication on MNRAS}
46:   
47:   \pagerange{\pageref{firstpage}--\pageref{lastpage}} \pubyear{2008}
48:   
49:   \maketitle
50:   
51:   \label{firstpage}
52:   
53:   \begin{abstract}
54:     
55:     We introduce a new adaptive and fully Bayesian grid-based method
56:     to model strong gravitational lenses with extended images. The
57:     primary goal of this method is to quantify the level of luminous
58:     and dark-mass substructure in massive galaxies, through their
59:     effect on highly-magnified arcs and Einstein rings. The method is
60:     adaptive on the source plane, where a Delaunay tessellation is
61:     defined according to the lens mapping of a regular grid onto the
62:     source plane. The Bayesian penalty function allows us to recover
63:     the best non-linear potential-model parameters and/or a grid-based
64:     potential correction and to objectively quantify the level of
65:     regularization for both the source and the potential. In addition,
66:     we implement a Nested-Sampling technique to quantify the
67:     errors on all non-linear mass model parameters -- marginalized
68:     over all source and regularization parameters -- and allow an
69:     objective ranking of different potential models in terms of the
70:     marginalized evidence. In particular, we are interested in
71:     comparing very smooth lens mass models with ones that
72:     contain mass-substructures. The algorithm has been tested on a range
73:     of simulated data sets, created from a model of a realistic
74:     lens system. One of the lens systems is characterized by a smooth
75:     potential with a power-law density profile, twelve 
76:     include a Navarro, Frenk and White (NFW) dark-matter substructure of different masses and at
77:     different positions and one contains two NFW dark substructures with the 
78:     same mass but with different positions. 
79:     Reconstruction of the source and of the lens
80:     potential for all of these systems shows the method is able, in a
81:     realistic scenario, to identify perturbations with masses $\ga 10^7\rm
82:     M_\odot$ when located \emph{on} the Einstein ring. For
83:     positions both inside and outside of the ring, masses of at least
84:     $10^9\rm M_\odot$ are required (i.e. roughly the Einstein ring of
85:     the perturber needs to overlap with that of the main lens). Our
86:     method provides a fully novel and objective test of mass
87:     substructure in massive galaxies.
88: 
89: \end{abstract}
90:   
91:   \begin{keywords}
92:     gravitational lensing --- dark matter --- galaxies: structure ---
93:     galaxies: haloes
94:   \end{keywords}
95:   
96:   
97:   \section{Introduction}
98:   
99:   At the present time, the most popular cosmological model for
100:   structure formation is the $\Lambda \text{CDM}$ paradigm. While this
101:   model has been very successful in describing the Universe on large
102:   scales and in reproducing numerous observational results
103:   \citep[e.g.,][]{Reiss98, Efstathiou02, Burles01,
104:   Philips01, Jaffe01, Percival01, deBernardis02, Hamilton02, Croft02,
105:   Tonry03, Spergel03, Komatsu08}, important discrepancies still
106:   persist on small scales. In particular, some of these involve the
107:   dark matter distribution within galactic haloes
108:   \citep[e.g.,][]{Moore94, Burkert95, McGaugh98,
109:  Binney01, Blok01, deBlok02, McGaugh03, Simon03,Rhee04,Kuzio06} 
110:  and the number of galaxy satellites, i.e the
111:   \emph{Missing Satellite Problem}.
112:   
113:   \noi According to the standard scenario, structures form in a
114:   hierarchical fashion via merging and accretion of smaller objects
115:   \citep{Toomre77, Frenk88, White91, Barnes92, Cole00}. As shown by
116:   the latest numerical simulations, in which high mass and force
117:   resolution is achieved, the progenitor population is only weakly
118:   affected by virialization processes and a large number of sub-haloes
119:   is able to survive after merging. The number of substructures
120:   within the Local Group, however, is predicted to be 1-2 orders of
121:   magnitude higher than what is effectively observed
122:   \citep[e.g.,][]{Kauffmann93, Moore99, Klypin99,
123:   Moore01,Diemand07b,Diemand07a}.
124:   
125:   \noi Two different classes of solutions have been suggested to
126:   alleviate this problem, cosmological and astrophysical. Cosmological
127:   solutions address the basis of the $\Lambda \text{CDM}$ paradigm
128:   itself and mostly concentrate on the properties of the dark matter,
129:   allowing for example, for a warm \citep{Colin00}, decaying
130:   \citep{Cen01}, self-interacting \citep{Spergel00}, repulsive
131:   \citep{Goodman00}, or annihilating nature
132:   \citep{Riotto00}. Alternatively the $\Lambda \text{CDM}$ picture can
133:   be modified by the introduction of a break of the power-spectrum at
134:   the small scales \citep[e.g.,][]{Kamionkowski00, Zentner03}.
135:   
136:   \noi From an astrophysical point of view, the number of visible
137:   satellites can be reduced by suppressing the gas collapse/cooling
138:   \citep[e.g.,][]{Bullock00, Kravtsov04, Moore06} via supernova
139:   feedback, photoionization or reionization. This would result in a
140:   high mass-to-light ratio ($M/L$) in the substructures.  If these
141:   high-$M/L$ substructures indeed exist, different methods
142:   for indirect detection are possible. The dark substructure may be
143:   detectable for example through its effects on stellar streams
144:   \citep[e.g.,][]{Ibata02, Mayer02}, via $\gamma$-rays from dark
145:   matter annihilation \citep{Bergstrom99, Calcaneo00, Stoehr03,
146:   Colafrancesco06} or through gravitational lensing \citep[e.g.,][]{Dalal02,
147:   Koopmans05}.
148:     
149:   \noi While the first two approaches are limited to the local
150:   Universe, gravitational lensing allows one to explore the mass
151:   distribution of galaxies outside the Local Group and at a relatively
152:   high redshift. Moreover, gravitational lensing is independent of the
153:   baryonic content, of the dynamical state of the system and of the
154:   nature of dark matter. For example, when in a lens system a point source is close to the caustic fold or cusp, the sum of the image fluxes should add to zero if the sign of the image parities  
155:   is taken into account \citep{Blandford86,Zakharov95}. This relation is, however, violated by 
156:   many observed lensed quasars with cusp and
157:    fold images. 
158:   As first suggested by \citet{Mao98}, these flux ratio anomalies
159:   can be related to the presence of (dark matter) substructure around the
160:   lensing galaxy on scales smaller than the image
161:   separation \citep{Bradac02, Chiba02, Dalal02,
162:   Metcalf02, Keeton03, Kochanek04, Bradac04, Keeton05}.
163:   Nevertheless subsequent studies of similar
164:   gravitationally lensed systems have shown that
165:   the required mass fraction in substructure is higher than what is
166:   obtained in numerical simulations \citep{Mao04, Maccio06,Diemand07b}. In
167:   addition, for a significant number of cases the observed flux ratio
168:   anomalies can be explained by taking into account the luminous dwarf
169:   satellite population \citep{Trotter00, Ros00,
170:   Koopmans02, Kochanek04, Chen07, McKean07, More08}. Whether the mass fraction
171:   of CDM substructures is quantifiable via flux ratio anomalies is
172:   therefore a question still open for debate. Alternatively,
173:   \citet{Koopmans05} showed that dark matter substructure in lensing
174:   galaxies can be detected by modelling of multiple images or Einstein
175:   rings from extended sources. \\
176:   
177:   \noi In this paper, we developed an adaptive grid-based modelling
178:   code for extended lensed sources and grid-based potentials, to fully
179:   quantify this procedure.  The method presented here is a significant
180:   improvement of the techniques introduced by \citet{Warren03},
181:   \citet{Dye05}, \citet{Koopmans05}, \citet{Suyu106},
182:   \citet{suyu206} and \citet{Brewer06}. In order to detect mass substructure in lens
183:   galaxies one needs to solve simultaneously for the source surface
184:   brightness distribution and the lens potential.  A semilinear
185:   technique for the reconstruction of grid-based sources, given a
186:   parametric lens potential, was first introduced by
187:   \citet{Warren03}. The method was subsequently extended by
188:   \citet{Koopmans05} and  \citet{Suyu106} in order to include a
189:   grid-based potential for the lens and by \citet{Barnabe07} to
190:   include galaxy dynamics. \citet{Dye05} introduced an
191:   adaptive gridding on the source plane; this would minimize the
192:   covariance between pixels and decrease the computational
193:   effort. However the method is still lacking an objective procedure
194:   to quantify the level of regularization. \citet{suyu206} and \citet{Brewer06} encoded the
195:   semi-linear method within the framework of Bayesian statistics
196:   \citep{MacKay92, MacKay03}. Although a vast improvement, the fixed
197:   grids do not allow to take into account the correct number of
198:   degrees of freedom and proper evidence comparison is difficult.  
199:   In the implementation here described, these issues have
200:   been solved:
201:   
202:   \smallskip
203:   
204:   \noi {\bf (i)} the procedure is fully Bayesian; this allows us to
205:   determine the best set of non-linear parameters for a given
206:   potential and the linear parameters of the source, to objectively
207:   set the level of regularization and to compare/rank different
208:   potential families;
209:   
210:   \smallskip
211:   
212:   \noi {\bf (ii)} using a Delaunay tessellation, the source grid
213:   automatically adaptives in such a way that the computational effort
214:   is mostly concentrated in high magnification regions;
215:   
216:   
217:   \smallskip
218:   
219:   \noi {\bf (iii)} the source-grid triangles are re-computed at every
220:   step of the modelling so that the source and the image plane always
221:   perfectly map onto each other and the number of degrees of freedom
222:   remains constant during Bayesian evidence maximisation.
223:   
224:   \smallskip
225:   
226:   \noi For the first time in the framework of grid-based lensing
227:   modelling, we use the Nested-Sampling technique by
228:   \citet{Skilling04} to compute the full marginalized Bayesian
229:   evidence of the data \citep{MacKay92, MacKay03}.  This approach not
230:   only provides statistical errors on the lens parameters, but also
231:   consistently quantifies the relative evidence of a smooth potential
232:   against one containing substructures.  As such, our method
233:   provides a fully objective way to rank these two hypotheses given
234:   the data, which is the goal set out in this paper.
235:   
236:   \noi The paper is organized as follow. In Section 2 we give a
237:   general overview on the data model. In Section 3 we present in
238:   detail how the data model can be inverted and the source and lens
239:   potential reconstructed.  In Section 4 we review the basics of
240:   Bayesian statistics and of the Nested-Sampling technique for
241:   evidence computation.  In Section 5 we describe how the method has
242:   been tested and how its ability in detecting substructures,
243:   depending on the perturbation mass and position, has been
244:   studied. Finally in Section 6 conclusions are drawn and future
245:   applications are discussed.
246:   
247:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
248: 
249:   \section{Construction of the lensing operators}
250:   
251:   In this section, we describe the data model which relates the
252:   unknown source brightness distribution and lens potential to the
253:   known data of the lensed images. The aim is to put this procedure in
254:   a fully self-consistent mathematical framework, excluding as much as
255:   possible any subjective intervention into the modelling.  The core
256:   of the method presented here is based on a Occam's razor argument.
257:   From a Bayesian evidence point of view, correlated features in the
258:   lensed images are most likely due to structure in the source, rather
259:   than being the result of small-scale perturbations of the lens
260:   potential in front of all the lensed images.  On the other hand,
261:   uncorrelated structure in the lensed images is most likely due to
262:   small-scale perturbations of the lens potential.
263:   
264:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
265:   
266:   \subsection{The data, source and potential grids}\label{sec:grids}
267:   The main idea of grid-based lensing techniques is to use a
268:   grid-based reconstruction of the source and of the lens potential.
269:   Here we introduce the general geometry of the problem, explicitly
270:   shown in Fig. \ref{fig:grid}.  Consider a lensed image $\bmath d$
271:   of an unknown extended source $\bmath s$. Both $\bmath d$ and
272:   $\bmath s$ are vectors that describe the surface brightness
273:   distributions on a set of spatial points $\bmath x_i^d$ and $\bmath
274:   y_j^s$ in the lens and source plane, respectively
275:   \citep[e.g.,][]{Warren03,Koopmans05, suyu206}. In general, these are
276:   related through the lens equation ${\bmath y_i^d} = {\bmath x_i^d} -
277:   {\bmath \nabla} \psi({\bmath x_i^d})$, where ${\bmath x}_i^d$
278:   corresponds to the spatial position of the surface brightness in the
279:   $ith$ element of the vector $\bmath d$, i.e. $d_i$ and $\psi({\bmath x_i^d})$
280:   is the lensing potential, which is described in more detail in a moment. 
281:   We note that ${\bmath y}_i^d$ does not necessarily directly correspond to the
282:   elements $\bmath y_j^s$, $jth$ brightness value
283:   of the vector $\bmath s$. In our implementation, the grid on the
284:   source plane is fully adaptive and is directly constructed from a
285:   subset of the $N_d$ pixels in the image plane, with spatial
286:   boundaries of the image grid included.  In particular, as shown
287:   schematically in Fig. \ref{fig:grid}, $N_s$ pixels, located each
288:   at a position $\bmath x_i^s$ on the image grid, are cast back to the
289:   source plane giving the positions $\bmath y_j^s$. 
290:   The set of positions $\{ \bmath y_i^s \}$ constitute
291:   the vertices of a Delaunay triangulation. In this way, we define an
292:   irregular adaptive grid, where vertex positions in the source plane
293:   are related to positions on the image plane via the lens equation
294:   and every vertex value represents an unknown source surface
295:   brightness level.  
296:   
297:   \noi We assume the lens potential to be the
298:   superposition of a parametric smooth component with linear local
299:   perturbations related to the presence of e.g. CDM substructures or
300:   dwarf galaxies:
301:   %
302:   \begin{equation}
303:     \psi(\bmath x,\bmath \eta)=\psi_s(\bmath x,\bmath
304:     \eta)+\delta\psi(\bmath x).
305:   \end{equation}
306:   % 
307:   While $\psi_s(\bmath x,\bmath \eta)$ assumes a parametric form,
308:   with parameters $\bmath\eta$, $\delta \psi(\bmath x)$ is a function
309:   that is pixelized on a regular Cartesian grid of points $\bmath
310:   x_k^{\delta\psi}$ with values
311:   $\delta \psi_k$. The set $\{\delta \psi_k\}$ is written as a vector
312:   $\delta\bmath{\psi}$. Given the observational set of data $\bmath d$,
313:   we now wish to recover the source distribution $\bmath s$ and the
314:   lens potential $\psi({\bmath x}, \bmath\eta)$ simultaneously. To do
315:   this we need to mathematically relate the brightness values $\bmath
316:   d$ to the unknown brightness values $\bmath s$. As described in the
317:   next subsection, this can be done through a linear operation on
318:   $\bmath s$ and $\delta \bmath{\psi}$, where the operator itself is a
319:   function of an initial guess of the lens potential.
320: 
321: 
322:   
323:   \begin{figure} 
324:     \begin{center} 
325:       \includegraphics[width=\hsize]{fig1}
326:       \caption {A schematic overview of the non-linear source and
327: 	potential reconstruction method, as implemented in this
328: 	paper. On the left hand-side, on the image plane, two grids
329: 	are defined: one for the potential corrections and one for the
330: 	lensed image. A subset of $N_s$ of the $N_d$ image pixels
331: 	located at the positions $\bmath x^s_i$ on the image plane
332: 	(filled circles) is cast back to the source plane (on the
333: 	right) on $\bmath{y}^s_i$ through the lens equation. These
334: 	form the vertices of an adaptive grid on the source plane. The
335: 	remaining image pixels (open circles) are also cast to the
336: 	source plane to the positions $\bmath{y}_i^d$ (we note that
337: 	this set of points includes $\bmath{y}^s_i$). Because the
338: 	source brightness distribution is conserved, i.e $S(\bmath
339: 	x^d_i)=S(\bmath y^d_i)$, the surface brightness at the empty
340: 	circles is represented by a linear superposition of the
341: 	surface brightness at the three triangle vertices that enclose
342: 	it. Similarly the potential correction at a point
343: 	$\bmath{x}_i^{\delta\psi}$ is given by linear interpolation of
344: 	the potential corrections at the surrounding pixels (large
345: 	rectangular pixels on the image plane). }
346:       \label{fig:grid} 
347:     \end{center}
348:   \end{figure}
349:   
350:   
351:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
352:   
353:   \subsection{The source and potential operator}
354: 
355:   We now derive the explicit relation between the unknown source
356:   distribution $\bmath s$, the potential correction $\delta
357:   \bmath{\psi}$, the smooth potential $\psi_s(\bmath x,\bmath\eta)$
358:   and the image brightness $\bmath d$. 
359: 
360:   \noi Consider a generic triangle $\widehat{\rm{ABC}}$ on the source
361:   plane (Fig. \ref{fig:single}), then the source surface brightness
362:   ${s_{\rm P}}$ on a point P, located inside the triangle at the
363:   position ${\bmath y}_{\rm P}^d$, can be related to the surface brightness on
364:   the vertices A, B and C through a simple linear relation 
365:   %
366:   \begin{equation}
367:     {s_{\rm P}}=w_{\rm A}{s_{\rm A}}+w_{\rm B}{s_{\rm B}}+w_{\rm
368: 	C}{s_{\rm C}}\,.
369:   \end{equation} 
370:   %
371:   \noi An explicit expression for the bilinear interpolation weights
372:   $w_{\rm{A}}$, $w_{\rm B}$ and $w_{\rm C}$ can be obtained by
373:   considering the point $\rm P_1 $, at the intersection of the line
374:   $\overline{\rm {AP}}$ with the line $\overline{\rm{CB}}$. The source
375:   intensities at P and $\rm P_1$ are also related to each other
376:   through a linear interpolation.  On the other hand, the surface
377:   brightness in $\rm P_1$ is directly related to the values on the
378:   triangle vertices $\rm B$ and $\rm C$
379:   %
380:   \begin{equation}
381:     \left\{
382:     \begin{array}{l}
383:       s_{\rm P} = \frac{d_{\rm {PA} }
384:       }{d_{\rm{P_1A}}}(s_{\rm{P_1}}-s_{\rm A})+s_{\rm A}\\ s_{\rm
385:       {P_1}}= \frac{ d_{ \rm {P_1B} } }{ d_{\rm{CB}} }(s_{\rm
386:       C}-s_{\rm B})+s_{\rm B}
387:     \end{array}
388:     \right.\,
389:     \label{equ:arr}
390:   \end{equation}
391:   % 
392:   \noi where $d_{\rm {PA}}$ and $d_{\rm {P_1A}}$ are the absolute
393:   distances between the points P and A and the points $\rm P_1$ and A
394:   respectively; $d_{ \rm{P_1B}}$ and $d_{\rm {CB}}$ are the distances
395:   between the points $\rm {P_1}$ and B and the points C and B
396:   respectively. Solving (\ref{equ:arr}), we obtain the weights
397:   %
398:   \begin{equation}
399:     \left\{
400:     \begin{array}{l}
401:       w_{\rm A}= 1-\frac{d_{\rm {PA}}}{d_{\rm{P_1A}}}\\ w_{\rm B}=
402:       \frac{d_{\rm{PA}}}{d_{\rm{P_1A}}}
403:       \left(1-\frac{d_{\rm{P_1B}}}{d_{\rm{CB}}}\right)\\
404:       w_{\rm C}=
405:       \frac{d_{\rm{PA}}d_{\rm{P_1B}}}{d_{\rm{P_1A}}d_{\rm{CB}}}
406:     \end{array}
407:     \right.\,
408:   \end{equation}
409:   % 
410:   \noi with $\sum_{i=\rm A,\rm B,\rm C }{w_i}=1$. Because
411:   gravitational lensing conserves the surface brightness, i.e.\ $S(\bmath
412:   x_i^d) = S(\bmath y_i^d)$, the mapping between the two planes (when
413:   $\delta\bmath\psi=0$) can be expressed as a system of $N_s$ coupled
414:   linear equations 
415:   %
416:   \begin{equation}
417:     \mathbf{B\,L}(\bmath \eta)\bmath s =\bmath d + \bmath n\,,
418:     \label{equ: src_linear_blurred} 
419:   \end{equation}
420:   % 
421:   where $\mathbf L(\bmath \eta)$ and $\mathbf B$ are the lensing and
422:   the blurring operators respectively \citep[see e.g.][]{Warren03,
423:   Treu04, Koopmans05, Suyu106}. The blurring operator is a square
424:   sparse matrix which accounts for the effects of the PSF. Each row of
425:   the lensing operator (a sparse matrix) contains at most the three
426:   bilinear interpolation weights, $w_{\rm A ,B, C}$, placed at the columns that
427:   correspond to the three source vertices that enclose the associated
428:   source position. For a vertex point, there is only one weight equal
429:   to unity. In case $N_s = N_d$ (i.e.\ all image positions are used to
430:   create the source grid), all weights are equal to unity. In this
431:   case, the systems of equations is under-constrained and strong
432:   regularization is required.
433: 
434:   \noi By pixelating $\delta \psi(\bmath x)$ on a regular Cartesian
435:   grid, a similar argument as for the source can be applied to the
436:   potential correction; all potential values, $\{\delta \psi_k\}$, and
437:   their derivatives on the image plane can be related to this limited
438:   set of points through bilinear interpolation
439:   \citep[see][]{Koopmans05, Suyu08}. It is then possible to derive from
440:   equation~(\ref{equ: src_linear_blurred}) a new set of linear
441:   equations,
442:   %
443:   \begin{equation}
444:     \mathbf{M_c}\left(\bmath{\eta},\bmath\psi\right)\,\bmath r = \bmath d +
445:     \bmath n,
446:     \label{equ: src_pot_linear_blurred}
447:   \end{equation} 
448:   %
449:   where
450:   %
451:   \begin{equation}
452:     \bmath r\equiv\left(
453:     \begin{array}{c}
454:       \bmath s\\ \delta\bmath \psi
455:     \end{array} 
456:     \right)\,.
457:   \end{equation}
458:   %
459:   \noi More specifically, $\bmath\psi$ is the sum of all the previous
460:   corrections $\delta\bmath\psi$ and the operator $\mathbf{M_c}$ is a
461:   block matrix reading
462:   \begin{equation}
463:     \mathbf{M_c}\equiv \mathbf B \left [\mathbf L(\bmath \eta, \bmath
464:       \psi)\, | -\mathbf{D_s}(\bmath s_{\rm MP})\mathbf {D_{\psi}}\\
465:       \right]\,.
466:     \label{equ:block_matrix}
467:   \end{equation} 
468:   % 
469:   \noi ${\mathbf L}({\bmath \eta}, {\bmath \psi})$ is the
470:   lensing operator introduced above, $\mathbf{D_s}(\bmath s_{\rm MP})$
471:   is a sparse matrix whose entries depend on the surface brightness
472:   gradient of the previously-best source model at $\bmath{y}^d_i$ and
473:   $\mathbf{D_\psi}$ is a matrix that determines the gradient of
474:   $\delta\bmath\psi$ at all corresponding points $\bmath{x}^d_i$
475:   \citep[see] [for details]{Koopmans05}. The generic structure of
476:   these matrices is given by
477:   %
478:   \begin{equation}
479:     \mathbf{D_{s}}= \left(
480:     \begin{array}{ccccc}
481:       ...&&& \\ \\ &\frac{\partial S({\bmath y}^d_i)}{\partial y_1}&
482:       \frac{\partial S({\bmath y}^d_i)}{\partial y_2} & \\ \\ &
483:       &\frac{\partial S({\bmath y}^d_{i+1})}{\partial y_1}&
484:       \frac{\partial S({\bmath y}^d_{i+1})}{\partial y_2} \\ \\ & & & & ...\\
485:     \end{array}
486:     \right)
487:   \end{equation}
488:   %
489:   and
490:   %
491:   \begin{equation}
492:     \mathbf{D_{\delta\psi}}= \left(
493:     \begin{array}{ccccc}
494:       ...&\\ \\ &\frac{\partial \delta\psi (\bmath{x}^d_i)}{\partial
495:       x_1}&\\ &\frac{\partial \delta\psi (\bmath{x}^d_i)}{\partial
496:       x_2} & \\ \\ &&\frac{\partial \delta\psi (\bmath{x}^d_{i+1})}{\partial
497:       x_1} \\ &&\frac{\partial \delta\psi (\bmath{x}^d_{i+1})}{\partial
498:       x_2} \\ & & & &...\\
499:     \end{array}
500:     \right)
501:   \end{equation}
502:   % 
503:   where the index $i$ runs along all the $\bmath{x}_i^d$ and $\bmath{y}_i^d$,
504:   i.e. triangle vertices included. The ``functions'' $S$ and $\delta
505:   \psi$ and their derivative can be derived through bilinear
506:   interpolation and finite differencing from $\bmath s$ and $\delta
507:   \bmath \psi$, respectively.
508: 
509:   \noi It is clear from the structure of these matrices that the
510:   first-order correction to the model, as a result of $\delta \psi$,
511:   is equal to $\delta d_i= -\bmath {\nabla} S(\bmath{y}^d_i) \cdot
512:   \bmath{\nabla} \delta \psi(\bmath{x}^d_i)$ at every point
513:   $\bmath{x}^d_i$ \cite[see e.g.][for a derivation]{Koopmans05}.
514: 
515:   \noi As for the surface brightness itself, also the first derivatives for
516:   a generic point P on the source plane can be expressed as functions
517:   of the relative values on the triangle vertices A, B, C, yielding
518:   %
519:   \begin{eqnarray}
520:     \frac{\partial {s_{\rm P}}}{\partial y_{1}} & = &w_{\rm
521:       A}\frac{\partial {s_{\rm A}}}{\partial y_{1}}+w_{\rm
522:       B}\frac{\partial {s_{\rm B}}}{\partial y_{1}}+w_{\rm
523:       C}\frac{\partial {s_{\rm C}}}{\partial y_{1}}\nonumber\\
524:       \frac{\partial {s_{\rm P}}}{\partial y_{2}} & = &w_{\rm
525:       A}\frac{\partial {s_{\rm A}}}{\partial y_{2}}+w_{\rm
526:       B}\frac{\partial {s_{\rm B}}}{\partial y_{2}}+w_{\rm
527:       C}\frac{\partial {s_{\rm C}}}{\partial y_{2}}
528:   \end{eqnarray}
529:   % 
530:   For the generic vertex $j= \rm{A, B,C}$ these are given by
531:   $\frac{\partial \bmath{s_j}}{\partial y_{1}}=-\frac{n_0}{n_2}$
532:   and $\frac{\partial \bmath{s_j}}{\partial
533:   y_{2}}=-\frac{n_1}{n_2}$, where  $\bmath{N}\equiv(n_0,n_1,n_2)$ is the
534:   unit-length surface normal vector at the vertex $j$ and is defined
535:   as the average of the adjacent per-face normal vectors. For
536:   $\delta\bmath\psi$ and its gradients, on a rectangular grid with
537:   rectangular pixels, we follow \cite{Koopmans05}.
538:   
539:   \begin{figure}
540:     \begin{center}
541:       \subfigure[]{\centering \includegraphics[width=4.5cm]{fig2a}
542: 	\label{fig:single}
543:       }
544:       \hspace{.5in}
545:       
546:       
547:       \subfigure[]{ \includegraphics[width=3cm]{fig2b}
548: 	\label{fig:double_x}
549:       }
550:       \hspace{.25in} \subfigure[]{
551:       \includegraphics[width=3cm]{fig2c}
552: 	\label{fig:double_y}
553:       }
554:       
555:       \caption{Generic triangles from the
556: 	source grid. Both the source surface brightness and its
557: 	derivatives at the points P, $\rm P_1$ and $\rm P_2$ are given
558: 	by linear superposition of the values at the edges of the
559: 	surrounding triangles.}
560:       \label{fig:triangles} 
561:       
562:     \end{center} 
563:   \end{figure}
564:   
565:   
566:   \section{Inverting the data model}\label{sec:inverting}
567: 
568:   \noi As shown above, in both cases of solving for the source alone,
569:   or solving for the source plus a potential correction, a {\sl linear
570:   data model} can be constructed. In this section, we give a
571:   general overview of how this set of linear equations can be
572:   (iteratively) solved. A more thorough Bayesian description and
573:   motivation can be found in Section~4.
574:   
575:   \subsection{The penalty function}
576:   Before we go into the details of the method, we first restate that
577:   for a given lens potential $\psi(\bmath x, {\bmath \eta})$ and
578:   potential correction $\bmath \psi_n = \sum^n_{i=1}
579:   \delta {\bmath \psi_i}$, on a grid, the source surface brightness vector
580:   $\bmath s$ and the data vector $\bmath d$ can be related through a
581:   linear (matrix) operator
582:   %
583:   \begin{equation}
584:     \mathbf {M_c}({\bmath \eta}, {\bmath \psi}_{n-1}, \bmath
585:     s_{n-1})\bmath r_n={\bmath d} + {\bmath n},
586:     \label{equ: src_linear} 
587:   \end{equation}
588:   now explicitly written with their dependencies on the source and
589:   potential and with
590:   \begin{equation}
591:     \bmath r_n= \left(\begin{array}{c}\bmath s_{n} \\ 
592:       \delta\bmath\psi_n \\
593:     \end{array}
594:     \right).
595:   \end{equation}
596:   %
597:   In this equation $\bmath s_n$ is a model of the source
598:   brightness distribution at a given iteration $n$ (we describe the
599:   iterative scheme momentarily). We assume the noise $\bmath n$ to be
600:   Gaussian which is a good approximation for the HST images the 
601:   method will be applied to. Even in case of deviations from Gaussianity, 
602:   the central limit theorem, for many data points, ensures that the probability density 
603:   distribution is often well approximated by a Normal distribution. \\
604:   \noi Because of the ill-posed nature of this relation,
605:   equation (\ref{equ: src_linear}) cannot simply be inverted. Instead a
606:   penalty function which expresses the mismatch between the data and
607:   the model has to be defined by
608:   \begin{equation}\label{eqn:penalty}
609:     P(\bmath s,\delta \bmath \psi \,|\, {\bmath \eta}, {\bmath \lambda},
610:     {\bmath s}_{n-1}, {\bmath
611:     \psi}_{n-1})=\chi^2+\lambda_s^2\|\mathbf{H_s} \bmath s\|^2_2
612:     +\lambda_{\delta\psi}^2 \|\mathbf{H_{\delta\psi}} \delta\bmath
613:     \psi\|^2_2\,,
614:   \end{equation}
615:   with
616:   \begin{equation}\label{eqn:chi2}
617:     \chi^2 = [\mathbf {M_c}({\bmath \eta}, \bmath \psi_{n-1}, \bmath
618:     s_{n-1})\, \bmath r - {\bmath d}]^{\rm T} \, {\mathbf {C_d^{-1}}} \,
619:     [\mathbf {M_c}({\bmath \eta}, \bmath \psi_{n-1}, \bmath
620:     s_{n-1})\,\bmath r - {\bmath d}].
621:   \end{equation}
622:   
623:   \noi The second and third term in the penalty function contain prior
624:   information, or beliefs about the smoothness of the source and of
625:   the potential respectively and $\mathbf{C_d}$ is the diagonal
626:   covariance matrix of the data. The level of regularization is set by
627:   the regularization parameters $\bmath \lambda$, one for the source and one
628:   for the potential \citep[see][for a more general
629:   discussion]{Koopmans05, suyu206}.  In a Bayesian framework, this
630:   penalty function is related to the posterior probability of the
631:   model given the data (see Section 4). In the following two sections
632:   we describe how to solve for the linear and non-linear parameters of
633:   the penalty function (except for $\bmath \lambda$, which is described
634:   in Section 4).
635:   
636:   \subsubsection{Solving for the linear parameters}
637:   \label{sec:solvelinear}
638:   The most probable solution, $\bmath{r_{\rm MP}}$, minimizing the
639:   penalty function is obtained by solving the set of linear equations
640:   \begin{equation}
641:     (\mathbf{M_c^T C_d^{-1}M_c+R^T R})\,\bmath
642:     r=\mathbf{M_c^TC_d^{-1}}\bmath d.
643:     \label{equ: src_pot_penalty} 
644:   \end{equation}
645:   The regularization matrix is given by
646:   \begin{equation}
647:     {\mathbf R^{\rm T}} {\mathbf R} = \left(
648:     \begin{array}{cc}
649:       \lambda_s^2\mathbf{H_s^{\rm T}} \mathbf{H_s} & \\ &
650:       \lambda^2_{\delta\psi}\mathbf{H_{\delta\psi}^{\rm T}}
651:       \mathbf{H_{\delta\psi}}
652:     \end{array} \right).
653:   \end{equation}
654:   
655:   \noi The solution of this symmetric positive definite set of
656:   equations can be found using e.g.\ a Cholesky decomposition
657:   technique. By solving equation (\ref{equ: src_pot_penalty}), adding
658:   the correction $\delta \bmath \psi_n$ to the previously-best
659:   potential $\bmath \psi_{n-1}$ and iterating this procedure, both the
660:   source and the potential should converge to the minimum of the
661:   penalty function $P(\bmath s_n,\delta \bmath \psi_{n} \,|\, {\bmath
662:   \eta}, {\bmath \lambda}, {\bmath s}_{n-1}, {\bmath \psi}_{n-1})$. At
663:   every step of this iterative procedure the matrices $\mathbf {M_c}$
664:   and $\mathbf R$ have to be recalculated for the new updated
665:   potential $\bmath \psi_n$ and source $\bmath s_n$. While the
666:   potential grid points are kept spatially fixed in the image plane,
667:   the Delaunay tessellation grid of the source is re-built at every
668:   iteration to ensure that the number of degrees of freedom is kept
669:   constant during the entire optimization process.
670:   
671:   \noi Note that because the source and the potential corrections are
672:   independent, they require their own form ($\mathbf H$) and level
673:   ($\lambda$) of regularization.  The most common forms of
674:   regularization are the zeroth-order, the gradient and the
675:   curvature. As shown by \citet{suyu206} the best form depends on the
676:   nature of the source distribution and can be assessed via Bayesian
677:   evidence maximisation. For the source, we chose the curvature
678:   regularization defined for the Delaunay tessellation of the source
679:   plane. 
680: 
681:   \noi Specifically one can combine the gradient and curvature
682:   matrices in the $x$ and $y$ directions: $\mathbf{H_{s}^{\rm
683:   T}}\mathbf{H_{s}}=\mathbf{H_{s,y_1}^{\rm
684:   T}}\mathbf{H_{s,y_1}}+\mathbf{H_{s,y_2}^{\rm T}}\mathbf{H_{s,y_2}}$.
685:   Both $\mathbf{H_{s,y_1}}$ and $\mathbf{H_{s,y_2}}$ can be obtained
686:   by analogy by considering the pair of triangles in
687:   Fig.~\ref{fig:double_x} and Fig.~\ref{fig:double_y}
688:   respectively.
689: 
690:   \noi For every generic point C on the source plane we consider the
691:   pair of triangles $\widehat{\rm{ABC}}$ and $\widehat{\rm{DCE}}$ and
692:   define the curvature in C in the $y_1$ direction as:
693:   %
694:   \begin{equation}
695:     {s''_{C,y_1}}
696:     \equiv \frac{1}{d_{CP}}({s_P}-{s_C}) -\frac{1}{d_{CQ}}({s_C}-{s_Q})\,.
697:     \label{equ:curvature}
698:   \end{equation}
699:   This is not the second derivative, but we find that this alternative
700:   curvature definition gives much better results than using the second
701:   derivative directly. The reason is that it gives equal weight to all
702:   triangles, independently of their relative sizes (for identical
703:   rectangular pixels this problem does not arise since the above
704:   definition is equal to the second derivative up to a proportionality
705:   constant). A much smoother solution in that case is obtained.
706:   
707:   \noi P and Q
708:    are given by intersecting the line
709:   $\overline{\rm{CP_1}}$ with the line $\overline{\rm{ED}}$ and the
710:   line $\overline{\rm{CP_2}}$ with the line $\overline{\rm{AB}}$
711:   respectively. Specifically, $\rm{P_1}$ and $\rm{P_2}$ are defined as
712:   very small displacements from the point C in the $y_1$ direction %
713:   \begin{eqnarray}
714:     y_{2}^{\rm{P_1}}      & = & y_{2}^{\rm{P_2}} =  y_{2}^{\rm C}\nonumber\\
715:     y_{1}^{\rm{P_{1,2}}}  & = & y_{1}^{\rm C}  \pm \delta y_1.
716:   \end{eqnarray}
717:   %
718:   The source surface brightness in P and Q can be obtained by
719:   linear interpolation between the source values in D with the value
720:   in E and the value in A with the value in B respectively
721:   %
722:   \begin{eqnarray}
723:     s_{\rm P}&=&\frac{d_{\rm{PD}}}{d_{\rm{ED}}}(s_{\rm E}-s_{\rm
724:       D})+s_{\rm D}\label{equ:s_p} \nonumber \\ s_{\rm
725:       Q}&=&\frac{d_{\rm{QA}}}{d_{\rm{AB}}}({s_{\rm B}}-s_{\rm
726:       A})+s_{\rm A}\label{equ:s_q}\,,
727:   \end{eqnarray}
728:   %
729:   \noi Substituting (\ref{equ:s_p}) in
730:   (\ref{equ:curvature}) gives
731:   %
732:   \begin{multline}
733:     {s''_{C,y_1}}=-\left(\frac{1}{d_{\rm
734:       {CP}}}+\frac{1}{d_{\rm {CQ}}}\right){s_{\rm C}}+\frac{d_{\rm
735:       PD}}{d_{\rm CP}d_{\rm DE}}s_{\rm E}+\\ \frac{d_{\rm
736:       {QA}}}{d_{\rm{CQ}}d_{\rm{AB}}}s_{\rm B}+\frac{d_{\rm{PE}}}
737:       {d_{\rm{CP}}{d_{\rm{DE}} }}s_{\rm D}+\frac{d_{\rm
738:       {QB}}}{d_{\rm{CQ}}d_{\rm{AB}}}s_{\rm A}\,.
739:   \end{multline}
740:   % 
741:   \noi Each row of the regularization matrix $\mathbf{H_{s,y_1}}$, corresponding to every
742:   point C, contains the five interpolation weights, placed at the
743:   columns that correspond to the five vertices A, B, C, D and
744:   E. The curvature in the $y_2$ direction is derived in an analogous
745:   way using the pair of triangles in Fig. \ref{fig:double_y}. We
746:   refer again to \citet{Koopmans05} for details on the
747:   potential regularization matrix $\mathbf{ H_{\delta \psi}}$
748:   
749:   \subsubsection{Solving for the non-linear parameters}
750:   \label{sec:solvenonlinear}
751:   In order to recover the non-linear parameters $\bmath \eta$, we need
752:   to minimize the penalty function $P(\bmath s, {\bmath \eta}\,|\,
753:   {\bmath \lambda}, {\bmath \psi})$. We allow for a correction,
754:   $\bmath \psi$, to the parametric potential $\psi(\bmath \eta,\bmath
755:   x)$ (not necessarily zero), but do not allow it to be changed while
756:   optimising for $\bmath s$ and ${\bmath \eta}$. In all cases, we keep
757:   $\bmath \lambda$ fixed during the optimization. Given an
758:   initial guess for the non-linear parameters $\bmath \eta_0$, we then
759:   minimize the penalty function defined in Section
760:   \ref{sec:solvelinear}, under the conditions outlined above
761:   ($\bmath\psi$ is constant and $\delta\bmath\psi \equiv \bmath 0$).
762:   We use a non-linear optimizer \citep[in our case Downhill-Simplex
763:   with Simulated Annealing;][]{Press92}, to change $\bmath \eta$ at
764:   every step and to minimize the joint penalty function $P(\bmath s,
765:   {\bmath \eta}\,|\, {\bmath \lambda}, {\bmath \psi})$.  The
766:   optimization of $\bmath s$ is implicitly embedded in the
767:   optimization of $\bmath \eta$ by solving equation (\ref{equ:
768:   src_pot_penalty}) only for $\bmath s$, every time $\bmath \eta$ is
769:   modified.
770:   
771:   \subsection{The optimization strategy}\label{sec:strategy}
772:   
773:   We have implemented a multi-fold optimization scheme for solving the
774:   linear equation (\ref{equ: src_linear}). This scheme is not unique,
775:   but stabilises the numerical optimization of this rather complex set
776:   of equations. Solving all parameters simultaneously would be
777:   computationally prohibitive and usually shows poor convergence
778:   properties.
779: 
780:   \subsubsection{Optimization steps}
781:   
782:   Our optimization scheme is similar to a {\sl line-search}
783:   optimization, where consecutively different sets of unknown
784:   parameters are being kept fixed, while the others are optimized
785:   for. The sets $\{\delta \bmath \psi, \bmath s\}$, $\{\bmath \eta,
786:   \bmath s \}$ and $\{\bmath \lambda, \bmath s \}$ define the three
787:   different groups of parameters, of which only one is solved for at
788:   once. The individual steps, in no particular order, are then:
789:   
790:   \noi {\bf (i)} {We assume $\bmath \eta$
791:     and $\bmath \lambda$ to be constant vectors and iteratively solve
792:     for $\delta\bmath\psi$ and the source $\bmath s$. In this case, at
793:     every iteration we solve for $\bmath r$ and adjust $\bmath \psi$,
794:     using the linear correction to the potential $\delta \bmath
795:     \psi$. This was described in Section \ref{sec:solvelinear}.}
796:   
797:   \noi {\bf (ii)} {We assume $\bmath\psi$ and
798:     $\bmath \lambda$ to be constant vectors and
799:     $\delta\bmath\psi_i=\bmath 0$ at every iteration and only solve
800:     for the non-linear potential parameters $\bmath \eta$ and the
801:     source $\bmath s$. This was described in Section
802:     \ref{sec:solvenonlinear}. We note that part of step (i) is also
803:     implicitly carried out in step (ii) (i.e.\ solving for $\bmath s$).}
804:   
805:   \noi {\bf (iii)} {We assume both (i) and (ii), above, and solve for
806:     the regularization parameters $\lambda_s$ of the source and the source
807:     itself $\bmath s$. This requires a Bayesian approach and will be
808:     described in more detail in Section~4. We have not attempted to 
809:     optimize for $\lambda_{\delta \psi}$, but will study this
810:     in future publications.}
811: 
812:   \noi The overall goal, however, remains to solve for the \emph{full}
813:   set of unknown parameters $\{ {\bmath \eta}, {\bmath \psi}_n, \bmath
814:   s_n \}$ for $n\rightarrow \infty$ (or some large number).  In
815:   particular if an overall smooth (on scales of the image separations)
816:   potential model $\psi(\bmath \eta)$ does not allow a proper
817:   reconstruction of the lens system, we add an additional and more
818:   flexible potential correction $\delta{\bmath \psi}$,
819:   which can describe a more complex mass structure. 
820: 
821:   \subsubsection{Line-search optimization scheme}
822: 
823:   In practice, we find that the optimal strategy to minimize the
824:   penalty function is the following, in order:
825:   
826:   \noi {\bf (1)} {We set $\lambda_{\rm s}$ to a large constant value
827:     such that the source model remains relatively smooth throughout
828:     the optimization (i.e.\ the peak brightness of the model is a
829:     factor of a few below that of the data) and keep
830:     $\bmath\psi_n=\bmath 0$ \citep[see also][]{suyu206, Suyu08}.  We then
831:     solve for $\bmath \eta$ and $\bmath s$ that minimize the penalty
832:     function}.
833:   
834:   \noi {\bf (2)} {Once the best $\bmath \eta$ and $\bmath s$ are
835:     found, a Bayesian approach is used to find the best value of
836:     $\lambda_{\rm s}$ for the source only.  At this point
837:     $\bmath\psi$ is still kept equal to zero.}
838:   
839:   \noi {\bf (3)} {Given the new value of $\lambda_{\rm s}$, step (1) is repeated
840:     to find improved values of $\bmath \eta$ and $\bmath s$. Since the
841:     sensitivity of $\lambda_{\rm s}$ to changes in $\bmath \eta$ is
842:     rather weak, at this point the best values of $\bmath \eta$,
843:     $\bmath s$ and $\bmath \lambda$ have been found.}
844:   
845:   \noi {\bf (4)} {Next, all the above parameters are kept fixed and we
846:     solve for $\bmath r$, this time assuming a very large value for
847:     $\lambda_{\delta \psi}$ to keep the potential correction (and
848:     convergence) smooth. We adjust $\bmath \psi$ at every iteration
849:     until convergence is reached
850:     \cite[e.g.][]{Suyu08}. At this point we stop the optimization
851:     procedure.}
852:   
853:   \noi {\bf (5)} {The smooth model with $\bmath \psi = \bmath 0$ and
854:     the same model with $\bmath \psi \neq \bmath 0$ are then compared
855:     through their Bayesian evidence values and errors on the
856:     parameters are estimated through the Nested Sampling of
857:     \citet{Skilling04}(Section 4).}  
858: 
859:   \noi Fig. \ref{fig:flow} shows a complete flow diagram of our
860:     optimization scheme. In the next section we place
861:     equation (\ref{eqn:penalty}) and model ranking on a formal Bayesian
862:     footing. Those readers mostly interested in the application and
863:     tests of the method could continue reading in Section~5.
864:      
865:   \begin{figure*} 
866:     \begin{center} 
867:       \includegraphics[width=\hsize,clip=]{fig3}     
868:        \caption {A schematic overview of the non-linear source and
869: 	potential reconstruction method.}
870:       \label{fig:flow} 
871:     \end{center}
872:   \end{figure*}
873:  
874:   \section{A Bayesian approach to data fitting and model selection}
875:   \label{sec:bayes}
876:   
877:   When trying to constrain the physical properties of the lens galaxy,
878:   within the grid-based approach, three different problems are
879:   faced.  Given the linear relation in equation (\ref{equ:
880:   src_pot_linear_blurred}) we need to determine the linear parameters
881:   $\bmath r$ for a certain set of data $\bmath d$ and a form for the
882:   smooth potential $\psi_{s}(\bmath x,\bmath \eta)$. We then aim to
883:   find the best values for the parameters $\bmath \eta$ and $\bmath
884:   \lambda$ and finally, on a more general level, we wish to infer the
885:   best model for the overall potential and quantitatively rank
886:   different potential families. In particular, we want to compare smooth models with models
887:   that also include a potential grid for substructure (with more free
888:   parameters). These issues can all be quantitatively and objectively
889:   addressed within the framework of Bayesian statistics. In the
890:   context of data modelling three levels of inference can be
891:   distinguished \citep{MacKay92, suyu206}.
892:   
893:   \medskip
894:   
895:   \noi {\bf (1)} First level of inference: linear optimization.  We
896:   assume the model $\mathbf{M_c}$, which depends on a given potential
897:   and source model, to be true and for a fixed form $\mathbf R$ and
898:   level ($\bmath\lambda$) of regularization, we derive from Bayes'
899:   theorem the following expression:
900:   \begin{equation}
901:     P\left(\bmath r\,|\,\bmath d,\bmath\lambda,\bmath \eta,\mathbf
902:     {M_c},\mathbf R\right)=\frac{P(\bmath d \,|\,\bmath r,\bmath \eta,
903:     \mathbf{M_c})\, P(\bmath r\,|\,\bmath\lambda,\mathbf R)}{P(\bmath
904:     d \,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)}\,.
905:   \end{equation}
906:   The likelihood term, in case of Gaussian noise, for a covariance
907:   matrix $\mathbf{C_d}$, is given by
908:   \begin{equation}
909:     P(\bmath d \,|\,\bmath r, \bmath\eta,\mathbf{M_c})=
910: 	\frac{1}{Z_d}\exp{[-E_d(\bmath d \,|\,\bmath
911: 	r,\bmath\eta,\mathbf{M_c})]}\,
912:   \end{equation}
913:   where
914:   \begin{equation}
915:     Z_d=(2\pi)^{N_d/2}(\det \ \mathbf{C_d})^{1/2}
916:   \end{equation}
917:   and (see equation \ref{eqn:chi2})
918:   \begin{equation}
919:     E_d(\bmath d \,|\,\bmath r,\bmath\eta,\mathbf{M_c}]=
920:       \frac{1}{2}\,\chi^2=\frac{1}{2}\left(\mathbf{M_c} \bmath
921:       r-\bmath d\right)^{\rm T}\mathbf{C}_D^{-1}\left(\mathbf{M_c}
922:       \bmath r-\bmath d\right)\,.
923:   \end{equation}
924:   Because of the presence of noise and often the singularity of
925:   $\det\,(\mathbf{M_c^{\rm T}} \mathbf{M_c})$, it is not possible to
926:   simply invert the linear relation in equation (\ref{equ:
927:   src_pot_linear_blurred}) but an additional penalty function must be
928:   defined through the introduction of a prior probability $P(\bmath r
929:   \,|\,\bmath\lambda,\mathbf R)$ on $\bmath s$ and on $\delta\bmath
930:   \psi$. In our implementation of the method, the prior assumes a
931:   quadratic form, with minimum in $\bmath r=\bmath 0$ and sets the
932:   level of smoothness (specified in $\mathbf H$ and $\bmath\lambda$)
933:   for the solution
934:   \begin{equation}
935:     P(\bmath r\,|\,\bmath\lambda,\mathbf R)=
936:     \frac{1}{Z_r}\exp{\left[-\bmath\lambda E_r(\bmath r\,|\,\mathbf
937:     R)\right]}\,,
938:   \end{equation}
939:   with
940:   \begin{equation}
941:     Z_r(\bmath\lambda)=\int {d\bmath r e^{-\bmath\lambda E_r}}=
942:     e^{-\bmath\lambda
943:     E_s(0)}\left(\frac{2\pi}{\bmath\lambda}\right)^{N_r/2}(\det\mathbf
944:     C)^{-1/2}\,,
945:   \end{equation}
946:   \begin{equation}
947:     E_r=\frac{1}{2}\|\mathbf R\bmath r\|^2_2
948:   \end{equation}
949:   and
950:   \begin{equation}
951:     \mathbf C=\nabla \nabla E_r=\mathbf R\,\mathbf {R}^{\rm T}\,.
952:   \end{equation}
953:   The normalization constant $P(\bmath d\,|\,\bmath\lambda,\bmath
954:   \eta,\mathbf{M_c},\mathbf R)$ is called the evidence and plays an
955:   important role at higher levels of inference. In this specific case
956:   it reads
957:   \begin{equation}
958:     P(\bmath d\,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)
959:     =\frac{\int{d\bmath r\exp{(-M(\bmath r))}}}{Z_d Z_r}\,,
960:   \end{equation}
961:   \noi where
962:   \begin{equation}
963:     M(\bmath r)=E_d+ E_r\,.
964:   \end{equation}
965:   The most probable solution for the linear parameters, is found by
966:   maximizing the posterior probability
967:   \begin{equation}
968:     P(\bmath r\,|\,\bmath d,\bmath\lambda,\bmath
969:     \eta,\mathbf{M_c},\mathbf R)=\frac{\exp(-M(\bmath
970:     r))}{\int{d\bmath r\,\exp(-M(\bmath r))}}\,.
971:     \label{equ:posterior}
972:   \end{equation}
973:   The condition $\partial (E_d+ E_r)/\partial \bmath r=0$ now yields the
974:   set of linear equations already introduced in Section
975:   \ref{sec:solvelinear}:
976:   \begin{equation}
977:     \left(\mathbf{M_c^{\rm T}} \mathbf{C_d}^{-1} \mathbf{M_c}+\mathbf
978:     R^{\rm T} \mathbf R\right)\bmath r = \mathbf{M_c^{\rm T}}
979:     \mathbf{C_d}^{-1}\bmath d\,.
980:     \label{equ:src_pot_penalty_bayes}
981:   \end{equation}
982:   Equation (\ref{equ:src_pot_penalty_bayes}) is solved iteratively
983:   using a Cholesky decomposition technique.  
984:   
985:   \noi {\bf (2)} Second level of inference: non-linear optimization.
986:   At this level we want to infer the non-linear parameters $\bmath
987:   \eta$ and the hyper-parameter $\lambda_{\rm s}$ for the
988:   source. Since at this point we are interested only in the smooth
989:   component of the lens potential, we set $\delta\bmath \psi=0$ and
990:   for a fixed family $\psi_s(\bmath \eta)$, form of the regularization
991:   $\mathbf R$ and model $\mathbf{M_c}$, we maximize the posterior
992:   probability
993:   
994:   \begin{equation}\label{equ:posterior_2}
995:     P(\bmath\lambda,\bmath \eta\,|\,\bmath d,\mathbf{M_c},\mathbf
996:       R)=\frac{P(\bmath d\,|\,\bmath \lambda,\bmath \eta,\mathbf{M_c},\mathbf
997:       R)P(\bmath \lambda,\bmath \eta)}{P(\bmath d\,|\,\mathbf{M_c},\mathbf
998:       R)}\,.
999:   \end{equation}
1000:   
1001:   \noi Assuming a prior $P(\bmath \lambda,\bmath \eta)$, which is flat in
1002:   $\log(\lambda_s)$ and $\bmath\eta$, reduces to maximizing the
1003:   evidence $P(\bmath d\,|\,\bmath\lambda,\bmath
1004:   \eta,\mathbf{M_c},\mathbf R)$ (which here plays the role of the
1005:   likelihood) for $\bmath \eta$ and $\bmath\lambda$. The evidence can
1006:   be computed by integrating over the posterior (\ref{equ:posterior_2})
1007:   %
1008:   \begin{equation}
1009:     P(\bmath d\,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)=\int{d\bmath
1010:       r\, P(\bmath d\,|\,\bmath r,\bmath
1011:       \eta,\mathbf{M_c})P(\bmath r\,|\,\bmath\lambda,\mathbf
1012:       R)}\,.
1013:     \label{equ:evidence}
1014:   \end{equation}
1015:   %
1016:   Because of the assumptions we made (Gaussian noise and quadratic
1017:   form of regularization), this integral can be solved analytically
1018:   and yields
1019:   %
1020:   \begin{equation}
1021:     P(\bmath d\,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)=
1022:     \frac{Z_M(\bmath\lambda, \bmath \eta)}{Z_d Z_r(\bmath\lambda)}\,,
1023:   \end{equation}
1024:   %
1025:   where
1026:   %
1027:   \begin{equation}
1028:     Z_M(\bmath\lambda, \bmath \eta)=\exp{(-M(\bmath
1029:       r_{\rm MP}))}\left(2\pi\right)^{N_r/2}(\det \ \mathbf A)^{-1/2}\,,
1030:   \end{equation}
1031:   %
1032:  
1033:  \noi  with $\mathbf A=\nabla\nabla M(\bmath r).$ Again we proceed in an
1034:   iterative fashion: using a simulated annealing technique we maximize
1035:   the evidence (\ref{equ:evidence}) for the parameters $\bmath
1036:   \eta$. Every step of the maximisation generates a new model
1037:   $\mathbf{M_c}(\psi(\bmath \eta_i))$, for which the most probable
1038:   source $\bmath s_{\rm{MP}}$ is reconstructed as described in Section
1039:   \ref{sec:inverting}. At this starting step the level of the source
1040:   regularization is set to a relatively large initial value
1041:   $\lambda_{s,0}$; in this way we ensure the solution to be smooth (at
1042:   least at this first level) and the exploration of the $\bmath \eta$
1043:   space to be faster. Subsequently we fix the best model
1044:   $\mathbf{M_c}(\bmath \eta_0)$ found at the previous iteration and,
1045:   using the same technique, we maximize the evidence for the source
1046:   regularization level $\lambda_s$.  The procedure is repeated until
1047:   the total evidence has reached its maximum. In principle we should
1048:   have built a nested loop for $\lambda_s$ at every step of the
1049:   $\bmath \eta$ exploration, but in practice the regularization
1050:   constant only changes slightly with $\bmath \eta$ and the alternate
1051:   loop described above gives a faster way to reach the maximum
1052:   (line-search method).
1053:   
1054:   \noi {\bf (3)} At the third level of inference Bayesian statistics
1055:   provides an objective and quantitative procedure for model
1056:   comparison and ranking on the basis of the evidence,
1057:   \begin{equation}
1058:     P(\mathbf{M_c},\mathbf R\,|\,\bmath d) \propto P(\bmath
1059:     d\,|\,\mathbf{M_c},\mathbf R)P(\mathbf{M_c},\mathbf R)\,.
1060:   \end{equation}
1061:   For a flat prior $P(\mathbf{M_c},\mathbf R)$ (at this level of
1062:   inference we can make little to no assumptions) different models can
1063:   be compared according to their value of $P(\bmath
1064:   d\,|\,\mathbf{M_c},\mathbf R)$, which is related to the evidence of
1065:   the previous level by the following relation
1066:   \begin{equation}
1067:     P(\bmath d\,|\,\mathbf{M_c},\mathbf R)=\int{d\bmath\lambda\, d\bmath
1068:       \eta \,P(\bmath d\,|\,\bmath \lambda,\bmath \eta,\mathbf{M_c},\mathbf
1069:       R) P(\bmath\lambda,\bmath\eta)}\,.
1070:     \label{equ:evidence_integral}
1071:   \end{equation}
1072:   Being multidimensional and highly non-linear, the integral
1073:   (\ref{equ:evidence_integral}) is carried out numerically through a
1074:   Nested-Sampling technique \citep{Skilling04}, which is described in
1075:   more detail in the next section. A by-product of this method is an
1076:   exploration of the posterior probability (\ref{equ:posterior_2}),
1077:   allowing for error analysis of the non-linear parameters and of the
1078:   evidence itself.
1079:   
1080:   \subsection{Model selection: smooth versus clumpy models}\label{sec:nested sampling} 
1081:   
1082:   In the previous section we introduced the main structure of the
1083:   Bayesian inference for model fitting and model selection. While
1084:   parameter fitting simply determines how well a model matches the
1085:   data and can be easily attained with the relatively simple analytic
1086:   integrations of the first and second level of inference, model
1087:   selection itself requires the highly non-linear and multidimensional
1088:   integral (\ref{equ:evidence_integral}) to be solved.  This
1089:   marginalized evidence can be used to assign probabilities to models
1090:   and to reasonably establish whether the data require or allows
1091:   additional parameters or not. Given two competing models $\rm M_0$
1092:   and $\rm M_1$ with relative marginalized evidence ${\cal{E}}_0$ and
1093:   ${\cal{E}}_1$, the Bayes factor, $\Delta {\cal{E}} \equiv
1094:   \log{\cal{E}}_0 - \log{\cal{E}}_1$, quantifies how well $\rm M_0$ is
1095:   supported by the data when compared with $\rm M_1$ and it
1096:   automatically includes the Occam's razor. Typically the literature
1097:   suggests to weigh the Bayes factor using  Jeffreys' scale
1098:   \citep{Jeffreys61}, which however provides only a qualitative
1099:   indication: $\Delta {\cal{E}} < 1$ is not significant, $1 < \Delta
1100:   {\cal{E}}< 2.5$ is significant, $2.5 < \Delta {\cal{E}}< 5$ is
1101:   strong and $\Delta {\cal{E}} > 5$ is decisive.
1102:   
1103:   
1104:   \noi In order to evaluate this marginalized evidence with a high
1105:   enough accuracy we implemented the new evidence algorithm known as
1106:   Nested Sampling, proposed by \citet{Skilling04}. Specifically, we
1107:   would like to compare two different models: one in which the lens
1108:   potential is smooth and one in which substructures are present, with
1109:   e.g. a NFW profile. While the first is defined by the non-linear
1110:   parameters of the lens potential and of the source regularization
1111:   only, the second also allows for three extra parameters: the mass of
1112:   the substructure and its position on the lens plane (see
1113:   Section \ref{sec:test})
1114:   
1115:   \subsection{Model ranking: nested sampling}
1116:   
1117:   Here, we provide a short description of how the Nested Sampling can
1118:   be used to compute the marginalized evidence and errors on the model
1119:   parameters; a more detailed one can be found in
1120:   \citet{Skilling04}. The Nested-Sampling algorithm integrates the
1121:   likelihood over the prior volume by moving through thin nested
1122:   likelihood surfaces. Introducing the fraction of total prior
1123:   mass $X$, within which the likelihood exceeds ${\cal L^*}$, hence
1124:   %
1125:   \begin{equation}
1126:     X=\int_{{\cal{L}}>{\cal{L^*}}}{dX}\,,
1127:   \end{equation}
1128:   %
1129:   with
1130:   %
1131:   \begin{equation}
1132:     dX=P\left(\bmath\lambda,\bmath\eta\right)d\bmath\lambda\,d\bmath\eta\,,
1133:   \end{equation}
1134:   %
1135:   the multi-dimensional integral (\ref{equ:evidence_integral})
1136:   relating the likelihood $\cal{L}$ and the marginalized evidence
1137:   $\cal{E}$ can be reduced to a one-dimensional integral with positive
1138:   and decreasing integrand
1139:   %
1140:   \begin{equation}
1141:     {\cal{E}}=\int_0^1{dX\,{\cal{L}}(X)}\,.
1142:   \end{equation}
1143:   
1144:   \noi Where ${\cal L}(X)$ is the likelihood of the (possibly disjoint)
1145:   iso-likelihood surface in parameter space which encloses a total prior
1146:   mass of $X$. If the likelihood ${\cal{L}}_j={\cal{L}}(X_j)$ can be
1147:   evaluated for each of a given set of decreasing points, $0 < X_j <
1148:   X_{j-1} <....< 1$, then the total evidence ${\cal{E}}$ can be
1149:   obtained, for example, with the trapezoid rule,
1150:   ${\cal{E}}=\sum_{j=1}^m{\cal{E}}_j=\sum_{j=1}^m{\frac{{\cal{L}}_j}{2}}\left(X_{j-1}-X_{j+1}\right)$.
1151:   
1152:   \noi The power of the method is that the values of $X_j$ do not
1153:   have to be explicitly calculated, but can be statistically
1154:   estimated. Specifically, the marginalized evidence is obtained
1155:   through the following iterative scheme:
1156:   
1157:   \noi {\bf (1)} the likelihood ${\cal{L}}$ is computed for N
1158:   different points, called active points, which are randomly drawn
1159:   from the prior volume.
1160:   
1161:   
1162:   \noi {\bf (2)} the point $X_j$ with the lowest likelihood is found
1163:   and the corresponding prior volume is estimated statistically: after
1164:   $j$ iterations the average volume decreases as $ X_j/X_{j-1}=t $,
1165:   where t is the expectation value of the largest of N numbers
1166:   uniformly distributed between $\left(0,1\right)$.
1167:   
1168:   \noi {\bf (3)} the term
1169:   ${\cal{E}}_j=\frac{{\cal{L}}_j}{2}\left(X_{j-1}-X_{j+1}\right)$ is
1170:   added to the current value of the total evidence;
1171:   
1172:   \noi {\bf (4)} $X_j$ is replaced by a new point randomly
1173:   distributed within the remaining prior volume and satisfying the
1174:   condition ${\cal{L}} >  {\cal{L}}^* \equiv {\cal{L}}_j$;
1175:   
1176:   \noi {\bf (5)} the above steps are repeated until a stopping
1177:   criterion is satisfied.
1178:   
1179:   \noi By climbing up the iso-likelihood surfaces, the method, in
1180:   general, find and quantifies the small region in which the bulk
1181:   of the evidence is located. 
1182: 
1183:   \noi Different stopping criteria can be chosen.  Following
1184:   \citet{Skilling04}, we stop the iteration when $j \gg \rm{N}H$,
1185:   where H is minus the logarithm of that fraction of prior mass which
1186:   contains the bulk of the posterior mass.  In practical terms this
1187:   means that the procedure should be stopped only when most of the
1188:   evidence has been included. Given the areas ${\cal{E}}_j$, in fact,
1189:   the likelihood initially increases faster than the widths decrease,
1190:   until its maximum is reached; across this maximum, located in the
1191:   region ${\cal{E}}\thickapprox e^{-H}$, the likelihood flatten off
1192:   and the decreasing widths dominate the increasing
1193:   ${\cal{L}}_j$. Since ${\cal{E}}_j\thickapprox e^{-j/\rm{N}}$, it
1194:   takes $\rm{N}H$ iterations to reach the dominating areas.  These
1195:   $\rm{N}H$ iterations are random and are subjected to a standard
1196:   deviation uncertainty $\sqrt{\rm{N}H}$, corresponding to a
1197:   deviation standard on the logarithmic evidence of $\sqrt{\rm{N}
1198:   H}/ \rm{N}$
1199:   
1200:   \begin{equation}
1201:     {\log \cal{E}}=
1202:     \log\left(\sum_j{{\cal{E}}_j}\right)\mathrm{~~~with~~~}
1203:     \sigma_{\log{\cal E}}=\sqrt{\frac{H}{\rm{N}}}\,.
1204:   \end{equation}
1205:    
1206:     \subsubsection{Posterior probability distributions}
1207: 
1208:   
1209:   \noi For the lens parameters, the substructure position and the
1210:   logarithm of the source regularization, priors are chosen to be
1211:   uniform on a symmetric interval around the best values which we have
1212:   determined at the second level of the Bayesian inference. The size
1213:   of the interval being at least one order of magnitude larger than
1214:   the errors on the parameters. In practice, we first carry out a fast
1215:   run of the Nested Sampling with few active points $\rm{N}$, this gives us
1216:   an estimate for the non-linear parameter errors. Using the product
1217:   $2\times N_{\rm dim}\times \sigma_\eta$, where $N_{\rm dim}$ is the
1218:   total number of parameters and $\sigma_\eta$ the corresponding
1219:   standard deviation, we can then roughly enclose the bulk of the
1220:   likelihood (note that this can be double-checked and corrected in
1221:   hindsight, if the posterior probability functions are truncated at
1222:   the prior boundaries). Priors on the parameters are taken in such a
1223:   way that this maximum is fully included in the total integral of the
1224:   marginalized evidence. For the main lens parameters and for the
1225:   regularization constant the same priors are used for model with and
1226:   without substructure. For the substructure mass  a flat prior between
1227:   $M_{\rm min}=4.0\times 10^6M_\odot$ and $M_{\rm
1228:   max}=4.0\times 10^9M_\odot$ is adopted, with the two limits given by N-body
1229:   simulations \citep[e.g.][]{Diemand07b, Diemand07a}. In reality,
1230:   the method does not require the parameters to be well known a
1231:   priori, but limiting the exploration to the best fit region
1232:   sensibly reduces the computational effort without significantly
1233:   altering the evidence estimation. From Bayes theorem we have that
1234:   the posterior probability density $p_j$ is given by
1235:   %
1236:   \begin{equation}
1237:     p_j(t)=
1238:     \frac{{\cal{L}}_j}{2}\left(X_{j-1}-X_{j+1}\right)/{\cal{E}}(t)=w_j/{\cal{E}}(t)\,.
1239:   \end{equation}
1240:   % 
1241:   The existing set of points $\left(\bmath\eta, \bmath\lambda
1242:   \right)_1$,..., $\left(\bmath\eta, \bmath\lambda \right)_{\rm N}$
1243:   then gives us a set of posterior values that can be then used to
1244:   obtain mean values and standard deviations on the non-linear
1245:   parameters
1246:   %
1247:   \begin{equation}
1248:     \langle\bmath\eta\rangle=\sum_j{w_j\bmath\eta_j}/\sum_j{w_j}\,,
1249:   \end{equation}
1250:   % 
1251:   and similarly for $\bmath\lambda$. These samples also provide a
1252:   sampling of the full joint probability density
1253:   function. Marginalising over this function, the full marginalized
1254:   probability density distribution of each parameters can be determined
1255:   (see Section 5.5).
1256:   
1257:   \section{Testing and calibrating the method}\label{sec:test}
1258:   
1259:   In this section we describe the procedure to test the method
1260:   introduced above and to assess its ability to detect dark matter
1261:   substructures in realistic data sets (e.g. from HST). A set of mock
1262:   data, mimicking a typical Einstein ring, is created. We generate
1263:   fourteen different lens models, of which $\rm L_0$ is purely
1264:   smooth, $\rm L_{1 \le i < 13}$ are given by the superposition
1265:   of the same smooth potential with a single NFW dark matter substructure of
1266:   varying mass and position and $\rm L_{13}$
1267:   contains two NFW dark matter substructures with 
1268:   the same mass but with different positions (See Table \ref{tab:lenses}).
1269:   A first approximate reconstruction of the source and of the lens potential
1270:   is performed by recovering the best non-linear lens parameters
1271:   $\bmath\eta$ and the level of source regularization
1272:   $\lambda_s$. These values are then used for the linear grid-based
1273:   optimization, which provides initial values of the substructure
1274:   position and mass. Three extra runs of the non-linear optimization are then
1275:   performed to recover the best set
1276:   $\left(\bmath\eta_b,\lambda_{s,b}\right)$ for the main lens and the
1277:   best mass and position of the substructure (solely modelled with a
1278:   NFW density profile). Finally by means of the Nested-Sampling
1279:   technique described in Section \ref{sec:nested sampling} we
1280:   compute the marginalized evidence, equation (\ref{equ:evidence_integral}), for
1281:   every model twice, once under the hypothesis of a smooth lens and
1282:   once allowing for the presence of one or two extra mass
1283:   substructures. Comparison between these two models allows us to
1284:   assess whether the presence of substructure in the model improves
1285:   the evidence despite the larger number of free parameters.
1286:   
1287:   \subsection{Mock data realisations}
1288:   
1289:   A set of simulated data with realistic noise is generated from a
1290:   model based on the real lens SLACS J1627$-$0055
1291:   \citep{Koopmans06,Bolton06,Treu06}. We assume the lens to be well
1292:   described by a power-law (PL) profile \citep{Barkana98}. Using the
1293:   optimization technique described in Section (\ref{sec:bayes}) we find
1294:   the best set of non-linear parameters
1295:   $\left(\bmath\eta_b,\lambda_{s,b}\right)$. In particular
1296:   $\bmath\eta$ contains the lens strength $b$, and some of the
1297:   lens-geometry parameters: the position angle $\theta$, the
1298:   axis ratio $f$, the centre coordinates $\bmath x_0$ and the density
1299:   profile slope $q$, $\left(\rho \propto r^{-(2q+1)}\right)$. If
1300:   necessary, information about external shear can be included. The
1301:   best parameters are used to create fourteen different lenses and
1302:   their corresponding lensed images. One of the systems is given by a
1303:   smooth PL model while twelve include a dark matter
1304:   substructure with virial mass $\rm M_{vir}=10^7 \rm M_\odot, 10^8
1305:   \rm M_\odot,10^9 \rm M_\odot$ located either on the lowest surface
1306:   brightness point of the ring $P_0$, on a high surface brightness
1307:   point of the ring $P_1$, inside the ring $P_2$ and outside the ring
1308:   $P_3$ (see Table \ref{tab:lenses}). The fourteenth lens
1309:   contains two substructures each with a mass of $\rm M_{vir}=10^8  M_\odot$,
1310:   located respectively in $P_0$ and $P_1$. The substructures are assumed
1311:   to have a NFW profile
1312:   %
1313:   \begin{equation}
1314:     \rho\left(r\right)={\rho_s}{\left(r_s/r\right)\left[1+\left(r/r_s\right)\right]^{-2}}\,,
1315:   \end{equation}
1316:   %
1317:   where the concentration $c=r_{\mathrm {vir}}/r_s$ and the scaling radius $r_s$
1318:   are obtained from the virial mass using the empirical scaling laws
1319:   provided by \citet{Diemand07b, Diemand07a}. The source has an
1320:   elliptical Gaussian surface brightness profile centred in zero
1321:   %
1322:   \begin{equation}
1323:     s\left(\bmath y\right) = s_0 \exp\left[ - (y_1/\delta y_1)^2 - (y_2/\delta y_2)^2 \right]\,.
1324:   \end{equation}
1325:   %
1326:  We assume $s_0=0.25$, $\delta y_1=0.01$ and $\delta y_2=0.04$. 
1327:   
1328:   \begin{table}
1329:     \begin{center}
1330:       \caption {Non-smooth (PL+NFW) lens models. At each of the $P_i$
1331: 	positions a NFW perturbation of virial mass $m_{sub}$ is superimposed
1332: 	on a smooth PL mass model distribution.}
1333:       \begin{tabular}{cccc} 
1334: 	\hline Lens&$\bmath x_{sub}$ $\left( \mathrm{arcsec}
1335: 	\right)$&$m_{sub}$ $\left( M_\odot \right)$\\ \hline $\rm
1336: 	L_1$&$P_0= (+0.90 ; +1.19)$&$10^7$\\ $\rm L_2$&&$10^8$\\ $\rm
1337: 	L_3$&&$10^9$\\ \\ $\rm L_4$&$P_1= (-0.50 ; -1.00)$&$10^7$\\ $\rm
1338: 	L_5$&&$10^8$\\ $\rm L_6$&&$10^9$\\ \\ $\rm L_7$&$P_2 = (-0.10 ;
1339: 	-0.60)$&$10^7$\\ $\rm L_8$&&$10^8$\\ $\rm L_9$&&$10^9$\\ \\
1340: 	$\rm L_{10}$&$P_3 = (-0.90 ; -1.40)$&$10^7$\\ $\rm
1341: 	L_{11}$&&$10^8$\\ $\rm L_{12}$&&$10^9$\\ \\
1342: 	$\rm L_{13}$&$P_0$ and $P_1 $&$10^8$\\\hline
1343:       \end{tabular} 
1344:       \label{tab:lenses}
1345:     \end{center}
1346:   \end{table}
1347:   
1348:   \subsection{Non-linear reconstruction of the main lens}
1349:   
1350:   We start by choosing an initial parameter set $\bmath\eta_{0}$ for
1351:   the main lens, which is offset from $\bmath\eta_{\rm true}$ that we
1352:   used to create the simulated data. Assuming the lens does not
1353:   contain any substructure we run the non-linear procedure described
1354:   in Section (\ref{sec:bayes}) and optimize $\{\bmath\eta,\lambda_{s}\}$
1355:   for each of the considered systems. At every step of the
1356:   optimization a new set $\{\bmath\eta_i,\lambda_{s,i}\}$ is obtained
1357:   and the corresponding lensing operator $\mathbf{M_c}(\bmath
1358:   \eta_{i},\lambda_{s,i})$ has to be re-computed. The images are
1359:   defined on a 81 by 81 pixels $\left(N_d= 6561\right)$ regular
1360:   Cartesian grid while the sources are reconstructed on a Delaunay
1361:   tessellation grid of $N_s= 441$ vertices. The number of image
1362:   points, used for the source grid construction, is effectively a form
1363:   of a prior and the marginalized evidence (equation \ref{equ:evidence_integral}) can be used to
1364:   test this choice. To check whether the number of image pixels used 
1365:   can affect the result of our modelling, we consider the smooth lens
1366:   $\rm L_0$ and  perform the non-linear reconstruction using one pixel every sixteen, nine, four and
1367:   one. In each of the considered cases we find that the lens parameters are within the relative errors (see Table ~3).
1368:  This suggests that, for this particular case, the choice of number of pixels is not influencing the quality of the reconstruction.  
1369:   In real systems, the dynamic range of the lensed images could be much
1370:   higher and a case by case choice based on the marginalized evidence has to be considered. 
1371:   In Fig. \ref{fig:best1_upr}, the  residuals relative to the system $\rm L_1$ are shown; the noise
1372:   level is in general reached and only small residuals are observed at 
1373:   the position of the substructure. 
1374:   Whether the level of such residuals and therefore the relative detection 
1375:   of the substructure are significant is an issue we will address later on in 
1376:   terms of the  total marginalized evidence.
1377:   
1378:   \subsection{Linear reconstruction: substructure detection}\label{sec:linear rec}
1379:   
1380:   The non-linear optimization provides us with a first good
1381:   approximate solution for the source and for the smooth component of
1382:   the lens potential. While this is a good description for the smooth
1383:   model $\rm L_0$ (see Fig. \ref{fig:best_smooth}), the residuals
1384:   (e.g. Fig. \ref{fig:best1_outside_01}) for
1385:   the perturbed model $\rm L_{i\ge1}$ indicate that the
1386:   \emph{no-substructure} hypothesis is improbable and that
1387:   perturbations to the main potential have to be considered. If the
1388:   perturbation is small, this can be done by temporarily assuming that
1389:   $\bmath{\eta}_{i=1}$ reflects the true mass model distribution for the
1390:   main lens and reconstruct the source and the potential correction by
1391:   means of equation (\ref{equ:src_pot_penalty_bayes}). In order to
1392:   keep the potential corrections in the linear regime, where the
1393:   approximation (\ref{equ:src_pot_penalty_bayes}) is valid, both the
1394:   source and the potential need to be initially over-regularised:
1395:   $\lambda_s=10\,\lambda_{s,1}$ and
1396:   $\lambda_{\delta\psi}=3.0\times10^5$ \citep{Koopmans05,
1397:   suyu206}. For each of the possible substructure positions we
1398:   identify the lowest-mass-substructure we are able to recover. In the
1399:   two most favourable cases, $\rm L_1$ and $\rm L_4$, in which the
1400:   substructure sits on the Einstein ring a perturbation of $10^7 \rm
1401:   M_\odot$ is readily reconstructed. For these two positions higher
1402:   mass models, with the exception of $\rm L_2$, will not be further analysed. The systems $\rm
1403:   L_{7,8,9}$ and $\rm L_{10,11,12}$, in which the substructure is
1404:   located, respectively, inside and outside the ring, represent more
1405:   difficult scenarios. In the first case all perturbations below $10^9
1406:   \rm M_\odot$ can be mimicked by an increase of the mass of the main
1407:   lens within the ring, while in the second case these cannot be
1408:   easily distinguished from an external shear effect. For the models
1409:   $\rm L_{1,2,4,9,12}$ convergence is reached after 150 iterations and
1410:   the perturbations are recovered near their known position (Figs. 8 and 9). The grid
1411:   based potential reconstruction indeed leads to a good first
1412:   estimation for the substructure position.
1413: 
1414:   
1415:   
1416:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1417:   
1418:   \subsection{Non-linear reconstruction: main lens and substructure}\label{sec:non-linear rec}
1419:   
1420:   In order to compare with numerical simulations, the mass of the
1421:   substructure is required. Performing this evaluation with a grid
1422:   based reconstruction is more complicated and requires some
1423:   assumptions (e.g.\ aperture). To alleviate this problem we assume a
1424:   parametric model, in which the substructures are described by a NFW
1425:   density profile, and we recover the corresponding non-linear
1426:   parameters, mass and position, using the non-linear Bayesian
1427:   optimization previously described.
1428: 
1429:   \noi To quantify the mass and position of the substructure and to
1430:   update the non-linear parameters when a substructure is added, we
1431:   adopt a multi-step non-linear procedure that relatively fast
1432:   converges to a best PL+NFW mass model. At this level, we neglect the
1433:   smooth lens $\rm L_0$, for which a satisfactory model already has
1434:   been obtained after the first non-linear optimization, and the
1435:   perturbed models $ \rm L_{7,8,10,11}$ for which the substructure
1436:   could not be recovered. We proceed as follows:
1437:   
1438:   \medskip
1439:   
1440:   \noi {\bf (i)} we fix the main lens parameters to the best values
1441:   found in Section (\ref{sec:linear rec}),
1442:   $\{\bmath\eta_1,\lambda_{\rm s,1}\}$. We set the substructure
1443:   mass to some guess value. We optimize for the substructure position
1444:   $\bmath x_{\rm sub,1}$.
1445:   
1446:   \noi {\bf (ii)} we fix $\{\bmath\eta_1,\lambda_{s,1}\}$ and
1447:   the source position $\bmath x_{\rm sub,1}$. We optimize for the
1448:   substructure mass $m_{\rm sub,1}$.
1449:   
1450:   \noi {\bf (iii)} we run the non-linear procedure described in
1451:   Section (\ref{sec:bayes}) by alternatively optimising for the main
1452:   lens, source, and substructure parameters and for the level of source
1453:   regularization. 
1454: 
1455:   \medskip
1456:   
1457:   \noi This leads to a new set of parameters, $\{\bmath\eta_{\rm b},
1458:   \lambda_{\rm s,b}, m_{\rm sub,b}, \bmath x_{\rm sub,b}\}$. Final
1459:   results for the considered models are listed in
1460:   Table 3 and the
1461:   relative residuals are shown in the Figs. \ref{fig:best1_upr}-\ref{fig:best1_outside_01}, respectively. For all the considered lenses the final
1462:   reconstruction converges to the noise level. 
1463:   
1464:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1465: 
1466:    \subsection{Multiple substructures}
1467:    The lens system $\rm L_{13}$ represents a more complex case in which two substructures
1468:    are included. In particular we are interested in testing
1469:    whether both substructures are detectable and whether their effect may be hidden by the 
1470:    presence of external shear. As for the previously considered cases, we first perform a non-linear 
1471:    reconstruction of the main lens parameters assuming a single PL mass model.
1472:    For this particular system we also include the strength $\rm \Gamma_{sh}$ and the position
1473:    angle $\rm \theta_{sh}$ of the external shear as free parameters. Results for this first step of the reconstruction 
1474:    are shown in Fig. \ref{fig:linear_a}. We then run the linear potential
1475:    reconstruction. One of the two substructures is detected although a significant
1476:    level of image residuals is left (Fig. \ref{fig:best1_sub_double}). 
1477:    The combined effect of external shears ($\rm \Gamma_{sh}=-0.031$) and the substructure in $P_1$
1478:    is not sufficient to explain the perturbation generated by the second substructure at the lowest surface 
1479:    brightness point of the Einstein ring. We therefore include a NFW substructure in 
1480:    the recovered position and run a non-linear reconstruction for the new PL+NFW model, 
1481:    Fig. \ref{fig:linear_b}. We are then able to detect also the second substructure, Fig. \ref{fig:best2_sub_double}. 
1482:    Finally we run a global non-linear reconstruction for the
1483:    PL+2NFW model (Fig. \ref{fig:linear_c}), the noise level is reached and the strength of the external shear  is consistent with zero ($\rm \Gamma_{sh}=0.0001$).
1484:       
1485:   \subsection{Nested sampling: the evidence for substructure}
1486: 
1487:   When modelling systems as $\rm L_{0}$ or $\rm L_{i\ge1}$ we assume
1488:   that the best recovered values, under the hypothesis of a single
1489:   power-law, provide a good description of the true mass distribution
1490:   and that any eventually observed residual could be an indication for
1491:   the presence of mass substructure. Model comparison within the
1492:   framework of Bayesian statistics gives us the possibility to test
1493:   this assumption. 
1494:   
1495:   \subsubsection{Marginalized Bayesian evidence}
1496:   
1497:   In order to statistically compare two models the
1498:   marginalized evidence (\ref{equ:evidence_integral}) has to be
1499:   computed. As described in Section (\ref{sec:nested sampling}) this
1500:   multi-dimensional and non-linear integral can be evaluated using the
1501:   Nested-Sampling technique by
1502:   \citet{Skilling04}. Specifically the two mass models we wish to
1503:   compare are a single PL, M$_0$, versus a PL+NWF
1504:   substructure, M$_1$. The first one is completely defined by the
1505:   non-linear parameters $\left(\bmath \eta, \lambda_s\right)$, while the
1506:   second needs three extra parameters, namely the substructure mass
1507:   and position. For all these parameters prior probabilities have to
1508:   be defined:
1509:   %
1510:   \begin{equation}
1511:     P\left(\eta_i\right)= \left\{
1512:     \begin{array}{ll}
1513:       \text{constant} & {\rm ~~~for~~~} |\eta_{\rm {b},i}-\eta_i|\leq\delta\eta_i \\ &\\ 
1514:       0 & {\rm ~~~for~~~} |\eta_{\rm {b},i}-\eta_i| > \delta\eta_i
1515:     \end{array} 
1516:     \right.
1517:   \end{equation}
1518:   and
1519:   \begin{equation}
1520:     P\left( x_{\rm {sub},i}\right)= \left\{
1521:     \begin{array}{ll}
1522:       \text{constant} & {\rm ~~~for~~~} | x_{\rm {sub,b},i}- x_{\rm {sub},i}|\leq\delta
1523:       x_{\rm {sub},i}\\ &\\ 0 & {\rm ~~~for~~~} | x_{\rm {sub,b},i}- x_{\rm {sub},i}| > \delta
1524:       x_{\rm {sub},i}
1525:     \end{array}
1526:     \right.
1527:   \end{equation}
1528:   
1529:   \noi where the elements of $\delta\eta_i$ and $\delta x_{\rm sub,i}$
1530:   are empirically assessed such that the bulk of the evidence
1531:   likelihood is included \citep[see][]{Skilling04}. The prior on the
1532:   substructure mass is flat between the lower and upper mass limits
1533:   given by numerical simulations \citep[e.g.][]{Diemand07b,
1534:   Diemand07a}.  Given the lenses $\rm L_{0,1,2,4,9,12,13}$ we run the
1535:   Nested Sampling twice, once for the single PL model and
1536:   once for the PL+NFW (+NFW) one. The two marginalized evidences with
1537:   corresponding numerical errors can be compared from Table ~2. Despite a certain number of authors suggest
1538:   the use of Jeffreys' scale \citep{Jeffreys61} for model comparison, we adopt here a
1539:   more conservative criterion. In particular, we note that the
1540:   perturbed model M$_1$ for the lens system $\rm L_0$ is basically
1541:   consistent with a single smooth PL model M$_0$, with $\Delta{\cal
1542:   {E}}\sim 7.85$. The Bayesian factor for the system $\rm L_4$ is of
1543:   the order of $\Delta{\cal {E}} \sim 21.5$ in favour of the smooth
1544:   model M$_0$, indicating that the detection of such a low-mass
1545:   substructure can formally not be claimed at a significant level. The
1546:   reason why we think this substructure is clearly visible in the
1547:   grid-based results, is that this particular solution is the
1548:   maximum-posterior (MP) solution, whereas the evidence gives the
1549:   integral over the entire parameter space. This implies that there
1550:   must be many solutions near the MP solution that do not show the
1551:   substructure. This indicates that our approach of quantifying the
1552:   evidence for substructure is very conservative.  On the other hand
1553:   the Bayes factor for the lens $\rm L_1$, $\Delta{\cal {E}} = -17.1
1554:   $, clearly shows that the detection of a $10^7 M_\odot$ substructure
1555:   can be significant when the latter is located at a different
1556:   position on the ring. Finally all higher mass perturbations are
1557:   easily detectable independently of their position relative to the
1558:   image ring; Bayes factor for $\rm L_2$, $\rm L_9$, $\rm L_{12}$ and $\rm L_{13}$
1559:   are, in fact, respectively $\Delta{\cal {E}} = -213.0 $,
1560:   $\Delta{\cal {E}} = -2609.7$, $\Delta{\cal {E}} = -4603.4$ and $\Delta{\cal {E}} = -1835.7$.
1561:   Substructure properties for these systems are also confidently
1562:   recovered.
1563:   The main difference between Jeffreys' scale and our criterion for 
1564:   quantifying the significance level of the substructure detection is observed 
1565:   for the system  $\rm L_1$.  If we had to adopt Jeffreys' scale in fact, such detection
1566:   would have to be claimed decisive while we think it is only significant. 
1567:   
1568:   \begin{figure}
1569:     \begin{center} 
1570:       \includegraphics[width=8cm]{fig4}
1571:       \caption{Results of the non-linear optimization for the smooth
1572: 	lens $\rm {L_0}$. The top-right panel shows the original mock
1573: 	data, while the top-left one shows the final
1574: 	reconstruction. On the second row the source reconstruction
1575: 	(left) and the image residuals (right) are shown.}
1576:       \label{fig:best_smooth} 
1577:     \end{center}     
1578:   \end{figure}	
1579:  
1580: 
1581: \subsection{Posterior probabilities}
1582: 
1583:   As discussed in Section (\ref{sec:nested sampling}) an interesting
1584:   by-product of the Nested-Sampling procedure is an exploration of the
1585:   posterior probability (\ref{equ:posterior_2}) which provides us with
1586:   statistical errors on the model parameters, see Tables 3 and 4. The
1587:   relative posterior probabilities for $\rm L_0$, $\rm L_1$ and $\rm
1588:   L_2$ are plotted in Fig.~\ref{fig:smooth_weights},
1589:   Fig.~\ref{fig:pert0001_weights} and
1590:   Fig.~\ref{fig:pert001_weights} respectively.  Lets start by
1591:   considering the lens system $\rm L_0$ and the relative probability
1592:   distribution for the substructure mass. Although the model M$_1$, in
1593:   terms of marginalized evidence, is consistent with the single smooth
1594:   PL model M$_0$, there is a small probability for the presence of a
1595:   substructure with mass up to few $10^8 M_\odot$ located as far as
1596:   possible from the ring.  The effect of such objects on the lensed
1597:   image would be very small and could be easily hidden by introducing
1598:   artificial features in the source structure, as suggested by the
1599:   posterior distributions for the source regularization constant.
1600:   This means, that from the image point of view, a smooth single PL
1601:   model and a perturbed PL+NWF with a substructure of $10^8 M_\odot$,
1602:   located far from ring, are not distinguishable from each other as
1603:   long as the effect of the perburber can be hidden in the structure
1604:   of the source. From a probabilistic point of view, however, the second
1605:   scenario is more unlikely to happen.  A similar argument can be
1606:   applied to the lens $\rm L_1$ for which a strong degeneracy between
1607:   the mass and the position of the substructure is observed.  We
1608:   conclude therefore that, although this substructure can be detected
1609:   at a statistically significant level, its mass and position cannot
1610:   be confidently assessed yet.  In contrast, for systems such as $\rm
1611:   L_{2,9,12}$, the effect of the substructure is so strong that it can
1612:   not be mimicked by the source structure or by a different
1613:   combination of the substructure parameters. For these cases not only
1614:   the detection is highly significant, but the properties of the
1615:   perturber can be confidently constrained with minimal biases.
1616:    
1617:   \begin{table}
1618:     \begin{center}
1619:       \caption{marginalized evidence and corresponding standard
1620: 	deviation as obtained via the Nested-Sampling
1621: 	integration. Results are shown for the hypothesis of a smooth
1622: 	lens (PL) and the hypothesis of a clumpy lens potential
1623: 	(PL+NFW).}
1624:       \begin{tabular}{cccc} 
1625:       	\hline Lens&Model& $\log {\cal E} \,$&$\sigma_{{\log {\cal E}
1626: 	}}\,$\\ \hline $\rm L_0$ & PL & 26332.70&0.33\\ &
1627: 	PL+NFW &26324.85&0.30\\ \\ $\rm L_1$ & PL
1628: 	&20366.86&0.34\\ &PL+NFW&20383.95&0.30\\ \\ $\rm L_4$
1629: 	& PL &20292.40&0.33\\ & PL+NFW &20270.87& 0.29\\ \\
1630: 	$\rm L_9$ & PL &17669.41&0.45\\ & PL+NFW
1631: 	&20279.13&0.36\\ \\ $\rm L_{12}$ & PL
1632: 	&15786.91&0.33\\ & PL+NFW
1633: 	&20390.35&0.37\\ \\ $\rm L_{13}$ & PL
1634: 	&18509.76&0.24\\ & PL+2 NFW
1635: 	&20346.48&0.49\\ \hline
1636:       \end{tabular} 
1637:       \label{tab:evidence}
1638:     \end{center}
1639:   \end{table}
1640:       
1641: %     \begin{table*}
1642:    % \vbox to220mm{\vfil Landscape table to go here.
1643:     %  \caption{} \vfil}
1644:    % \label{tab:results}
1645:  % \end{table*}
1646:  
1647:      %\begin{table*}
1648:     %\vbox to220mm{\vfil Landscape table to go here.
1649:       %\caption{} \vfil}
1650:     %\label{tab:results}
1651:  % \end{table*}  
1652:       
1653:   \begin{figure*}
1654:     \begin{center} 
1655:       \includegraphics[width=0.45\hsize]{fig5a}
1656:       \hfill
1657:       \includegraphics[width=0.45\hsize]{fig5b}
1658:       \caption{{\bf Left panel:} Results of the first non-linear
1659: 	reconstruction for the smooth component of the perturbed lens
1660: 	L$_1$. The top-right panel shows the original mock
1661: 	data, while the top-left one shows the final
1662: 	reconstruction. On the second row the source reconstruction
1663: 	(left) and the image residuals (right) are shown. {\bf Right
1664: 	panel:} Final results of the non-linear reconstruction for the
1665: 	perturbed lens L$_1$. The top-right panel shows the
1666: 	original mock data, while the top-left one shows the final
1667: 	model reconstruction obtained after a non-linear optimization
1668: 	involving the lens parameters and the substructure position
1669: 	and mass. The recovered source is plotted in the low-left
1670: 	panel. Image
1671: 	residuals (right) are shown.}
1672:       \label{fig:best1_upr} 
1673:     \end{center}
1674:   \end{figure*}
1675:   
1676:   \begin{figure*}
1677:     \begin{center} 
1678:       \includegraphics[width=0.45\hsize]{fig6a}
1679:       \hfill
1680:       \includegraphics[width=0.45\hsize]{fig6b}
1681:       \caption{Similar as Figure~\ref{fig:best1_upr} for L$_2$.}
1682:       \label{fig:best1_upr_001} 
1683:     \end{center}
1684:   \end{figure*}
1685: 
1686:   
1687:  
1688:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1689:   
1690:   \section{Conclusions and Future work }
1691:   
1692:   We have introduced a fully Bayesian adaptive method for objectively
1693:   detecting mass substructure in gravitational lens galaxies. The
1694:   implemented method has the following specific features:
1695: 
1696:   \begin{itemize}
1697: 
1698:   \item Arbitrary imaging data-set defined on a regular grid can be
1699:     modelled, as long as only lensed structure is included. The code
1700:     is specifically tailored to high-resolution HST data-sets with a
1701:     compact PSF that can be sampled by a small number of pixels.
1702: 
1703:   \item Different parametric two-dimensional mass-models can be used,
1704:     with a set of free parameter $\bmath \eta$. Currently, we have
1705:     implemented the elliptical power-law density models from
1706:     \citet{Barkana98}, but other models can easily be included.
1707:     Multiple parametric mass models can be simultaneously optimized.
1708:     
1709:   \item A grid-based correction to the parametric potential can
1710:     iteratively be determined for any perturbation that can not easily
1711:     be modelled within the chosen family of potential models (e.g.\
1712:     warps, twists, mass-substructures, etc.).
1713: 
1714:   \item The source surface-brightness structure is determined on a
1715:     fully adaptive Delaunay tessellation grid, which is updated with
1716:     every change of the lens potential.
1717: 
1718:   \item Both model-parameter optimization and model ranking are fully
1719:     embedded in a Bayesian framework. The method takes special care not
1720:     to change the number of degrees of freedom during the
1721:     optimization, which is ensured by the adaptive source grid. Methods
1722:     with a fixed source surface-brightness grid can not do this.
1723:     
1724:   \item Both source and potential solutions are regularised, based on
1725:     a smoothness criterion. The choice of regularization can be
1726:     modified and the level of regularization is set by Bayesian
1727:     optimization of the evidence. The data itself determine what
1728:     level of regularization is needed. Hence overly smooth or overly
1729:     irregular structure is automatically penalised.
1730: 
1731:   \item The maximum-posterior and the full marginalized probability
1732:     distribution function of {\sl all} linear and non-linear
1733:     parameters can be determined, marginalized over all other 
1734:     parameters (including regularization). Hence a full exploration
1735:     of {\sl all} uncertainties of the model is undertaken.
1736: 
1737:   \item The full marginalized evidence (i.e.\ the probability of the
1738:     model given the data) is calculated, which can be used to rank
1739:     {\sl any} set of model assumptions (e.g. pixel size, PSF) or model
1740:     families. In our case, we intend to compare smooth models with
1741:     models that include mass substructure. The marginalized evidence
1742:     implicitly includes Occam's razor and can be used to assess whether
1743:     substructure or any other assumption is justified, compared to a
1744:     null-hypothesis.
1745:   
1746:   \end{itemize}
1747: 
1748:   \noi The method has been tested and calibrated on a set of
1749:   artificial but realistic lens systems, based on the 
1750:   lens system SLACS J1627$-$0055. 
1751: 
1752:   \noi The ensemble of mock data consists of a smooth PL lens and
1753:   thirteen clumpy models including one or two NFW substructures.  Different values
1754:   for the mass and the substructure position have been considered.
1755:   Using the Bayesian optimization strategy developed in this paper we are
1756:   able to recover the smooth PL system and all perturbed models with a
1757:   substructure mass $ \ga 10^7 M_\odot$ when located at the lowest
1758:   surface brightness point on the Einstein ring and with a mass $\geq
1759:   10^9 M_\odot$ when located just inside or outside the ring (i.e.\
1760:   their Einstein rings need to overlap roughly).  For all these models
1761:   we have convincingly recovered the best set of non-linear parameters
1762:   describing the lens potential and objectively set the level of
1763:   regularization. 
1764: 
1765:   \noi Furthermore, our implementation of the Nested-Sampling
1766:   technique provides statistical errors for {\sl all} model parameters
1767:   and allows us to objectively rank and compare different potential
1768:   models in terms of Bayesian evidence, removing as much as possible
1769:   any subjective choices. Any choice can quantitatively be
1770:   ranked. For each of the lens systems we compare a complete smooth PL
1771:   mass model with a perturbed PL+NFW (+NFW) one.  The method here developed
1772:   allows us to solve simultaneously for the lens potential and the
1773:   lensed source. The latter, in particular, is reconstructed on an
1774:   adaptive grid which is re-computed at every step of the
1775:   optimization, allowing to take into account the correct number
1776:   of degrees of freedom.
1777: 
1778: 
1779:  \begin{figure*}
1780:     \begin{center} 
1781:       \includegraphics[width=0.45\hsize]{fig7a}
1782:       \hfill
1783:       \includegraphics[width=0.45\hsize]{fig7b}
1784:       \caption{ Similar as Figure~\ref{fig:best1_upr} for L$_{12}$.}
1785:        \label{fig:best1_outside_01} 
1786:     \end{center}
1787:   \end{figure*}
1788:   
1789:   \begin{figure*}
1790:     \begin{center} 
1791:       \includegraphics[width=\hsize]{fig8}
1792:       \caption{Results of the linear source and potential
1793: 	reconstruction for the lens L$_1$. The first row shows
1794: 	the original model (left), the reconstructed model (middle)
1795: 	and the current-best source, as well as the corresponding adaptive grid. 
1796: 	On the second row the image
1797: 	residuals (left), the total potential convergence (middle) and
1798: 	the substructure convergence (right) are shown. Note 
1799: 	that the substructure, although weak, is reconstructed at 
1800: 	the correct position.}
1801:       \label{fig:best1_sub_upr} 
1802:     \end{center}
1803:   \end{figure*}
1804:   
1805:    \begin{figure*}
1806:     \begin{center} 
1807:       \includegraphics[width=\hsize]{fig9}
1808:       \caption{Similar as Figure~\ref{fig:best1_sub_upr} for L$_2$. We note 
1809: 	that the substructure is extremely 
1810: 	well reconstructed, both at the correct position and in mass.}
1811:       \label{fig:best1_sub_upr_001} 
1812:     \end{center}
1813:   \end{figure*}
1814:   
1815:   
1816:    \begin{figure*}
1817:     \begin{center} 
1818:       \subfigure[]{ \includegraphics[width=0.45\hsize]{fig10a}
1819: 	\label{fig:linear_a}
1820:       }
1821:       \hfill
1822:       \subfigure[]{ \includegraphics[width=0.45\hsize]{fig10b}
1823: 	\label{fig:linear_b}
1824:       }
1825: 
1826:      \subfigure[]{\centering \includegraphics[width=0.45\hsize]{fig10c}
1827: 	\label{fig:linear_c}
1828:       }
1829: 
1830:       \caption{Non linear reconstruction of the lens $\rm L_{13}$ for a single PL model, a PL+NFW and
1831:       a PL+2NFW one.}
1832:        \label{fig:best_double} 
1833:     \end{center}
1834:   \end{figure*}
1835:   
1836:   \begin{figure*}
1837:     \begin{center} 
1838:       \includegraphics[width=\hsize]{fig11}
1839:       \caption{Results of the first linear source and potential
1840: 	reconstruction for the lens L$_{13}$. The first row shows
1841: 	the original model (left), the reconstructed model (middle)
1842: 	and the image residuals. On the second row the current-best source (left), the total potential convergence (middle) and
1843: 	the substructure convergence (right) are shown. Note 
1844: 	that the substructure, although weak, is reconstructed at 
1845: 	the correct position.}
1846:       \label{fig:best1_sub_double} 
1847:     \end{center}
1848:   \end{figure*}
1849:   
1850:  \begin{figure*}
1851:     \begin{center} 
1852:       \includegraphics[width=\hsize]{fig12}
1853:       \caption{ Results of the second linear source and potential
1854: 	reconstruction for the lens L$_{13}$.  }
1855:       \label{fig:best2_sub_double} 
1856:     \end{center}
1857:   \end{figure*}
1858: 
1859:   \noi In this paper we have considered systems which contains at most two CDM substructures. Although it may appear as a very small 
1860:   number when compared with predictions from N-body simulations within the virial radius, this represents a realistic scenario. 
1861:   As we have shown, our method, with current HST data, is mostly sensitive 
1862:   to perturbations with mass $\ga 10^7\rm M_\odot$ and located on the Einstein ring ($\Delta\theta\sim\mu\theta_{\rm ER}$). 
1863:   The projected volume that we are able to probe is therefore small compared to the projected volume within the virial radius. 
1864:   The probability that more than two substructures have this right combination of mass and position is relatively low and we expect most of the 
1865:   real systems to be dominated by one or at most two perturbers.
1866:   \noi Despite these new results, further improvements can still be
1867:   made. We think, for example, that an adaptive source grid based on surface
1868:   brightness, rather than magnification, or a combination, could be
1869:   more suitable for the scientific problem considered here. 
1870:   
1871:   \noi The method will soon be applied to real systems, as for example
1872:   from the \emph{Sloan Lens ACS Survey} sample of massive early-type galaxies
1873:   \citep{Koopmans06,Bolton06,Treu06}. This will lead to powerful new
1874:   constraints or limits on the fraction and mass distribution of
1875:   substructure. Results will be compared with CDM simulations.
1876: 
1877:   \section*{Acknowledgements} The authors would like to thank Matteo
1878:   Barnab\`e, Oliver Czoske, Antonaldo Diaferio, Phil Marshall, Sherry Suyu and the anonymous referee  for useful 
1879:   discussions. They also thank the Kavli Institute for Theoretical Physics 
1880:   for hosting the gravitational lensing workshop in fall 2006, during which 
1881:  important parts of  this work were made. SV and LVEK are supported (in part) through an
1882:   NWO-VIDI program subsidy (project number 639.042.505).
1883: 
1884:   
1885:   \bibliography{ms}
1886: 
1887:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1888: 
1889:   \begin{figure*} 
1890:     \begin{center} 
1891:       \includegraphics[width=\hsize]{fig13}
1892:       \caption{Posterior probability distributions for the non linear
1893: 	parameters of the smooth lens model $\rm L_0$ as obtained from
1894: 	the Nested-Sampling evidence exploration. In particular
1895: 	results for two different models are shown, a smooth PL
1896: 	potential (blue histograms) and a perturbed PL+NFW lens
1897: 	(black histograms). From up left, the lens strength, the
1898: 	position angle, the axis ratio, the slope, the logarithm of
1899: 	the source regularization constant, the substructure mass and
1900: 	position are plotted.}
1901:       \label{fig:smooth_weights} 
1902:     \end{center}
1903:   \end{figure*}
1904:   
1905:   \begin{figure*} 
1906:     \begin{center} 
1907:       \includegraphics[width=\hsize]{fig14}
1908:       \caption{Similar as Figure~\ref{fig:smooth_weights} for L$_1$.}
1909:         \label{fig:pert0001_weights} \end{center}
1910: 	\end{figure*} 
1911: 	
1912: 	\begin{figure*} 
1913:     \begin{center} 
1914:       \includegraphics[width=\hsize]{fig15}
1915:       \caption {Similar as Figure~\ref{fig:smooth_weights} for L$_2$.}
1916:       \label{fig:pert001_weights} 
1917:     \end{center}
1918:   \end{figure*}
1919:   
1920:  
1921:  \begin{figure*} 
1922:     \begin{center} 
1923:       \includegraphics[width=\hsize]{table_3.ps}
1924:          \end{center}
1925:   \end{figure*}
1926:   
1927:   \begin{figure*} 
1928:     \begin{center} 
1929:       \includegraphics[width=\hsize]{table_4.ps}
1930:     \end{center}
1931:   \end{figure*}
1932:   
1933:  
1934: \clearpage
1935: 
1936: \newpage \label{lastpage} 
1937: 
1938: 
1939: 
1940: \end{document}
1941: