0805:0805.0201/ms.tex

1: \documentclass[letter,useAMS,usenatbib]{mn2e}

2:

3: \usepackage[english]{babel} \usepackage{subfigure}

4: \usepackage{graphicx}

5:

6: \usepackage[fleqn]{amsmath}

7: \usepackage{color}

8:

9: \usepackage[varg]{txfonts}

10:

11: \citestyle{aa}

12:

13: \bibliographystyle{mn2e}

14:

15: \topmargin -1.3cm

16:

17: %%%%%%%%%%%%%%% Author definitions %%%%%%%%%%%%%%%%%%%%%%

18: %%%%% 1. Journals

19:

20: \newcommand{\aj}{AJ} % Astronomical Journal

21: \newcommand{\aap}{A\&A} % Astronomy and Astrophysics

22: \newcommand{\aaps}{A\&AS} % Astronomy and Astrophysics Supplement Series

23: \newcommand{\apj}{ApJ} % Astrophysical Journal

24: \newcommand{\apjs}{ApJS} % Astrophysical Journal Supplement Series

25: \newcommand{\apjl}{ApJL} % Astrophysical Journal Letters

26: \newcommand{\araa}{ARAA} % Annual Reviews in Astronomy and Astrophysics

27: \newcommand{\mnras}{MNRAS} % Monthly Notices of the Royal Astronomical Society

28:

29: \newcommand{\noi}{\noindent}

30:

31: %%%%%%%%%%%%%%% Title %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

32:

33: \title[Strong lensing on adaptive grids]{Bayesian strong gravitational-lens modelling on adaptive

34: grids:\\ objective detection of mass substructure in galaxies}

35:

36: \author[S. Vegetti \& L. V. E.  Koopmans.]{ Simona Vegetti\thanks{E-mail:

37:     vegetti@astro.rug.nl} \& L. V. E.  Koopmans\\ Kapteyn

38:     Astronomical Institute, University of Groningen, PO Box 800,

39:     9700\,AV Groningen, the Netherlands}

40:

41: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

42:

43: \begin{document}

44:

45:   \date{Accepted for publication on MNRAS}

46:

47:   \pagerange{\pageref{firstpage}--\pageref{lastpage}} \pubyear{2008}

48:

49:   \maketitle

50:

51:   \label{firstpage}

52:

53:   \begin{abstract}

54:

55:     We introduce a new adaptive and fully Bayesian grid-based method

56:     to model strong gravitational lenses with extended images. The

57:     primary goal of this method is to quantify the level of luminous

58:     and dark-mass substructure in massive galaxies, through their

59:     effect on highly-magnified arcs and Einstein rings. The method is

60:     adaptive on the source plane, where a Delaunay tessellation is

61:     defined according to the lens mapping of a regular grid onto the

62:     source plane. The Bayesian penalty function allows us to recover

63:     the best non-linear potential-model parameters and/or a grid-based

64:     potential correction and to objectively quantify the level of

65:     regularization for both the source and the potential. In addition,

66:     we implement a Nested-Sampling technique to quantify the

67:     errors on all non-linear mass model parameters -- marginalized

68:     over all source and regularization parameters -- and allow an

69:     objective ranking of different potential models in terms of the

70:     marginalized evidence. In particular, we are interested in

71:     comparing very smooth lens mass models with ones that

72:     contain mass-substructures. The algorithm has been tested on a range

73:     of simulated data sets, created from a model of a realistic

74:     lens system. One of the lens systems is characterized by a smooth

75:     potential with a power-law density profile, twelve

76:     include a Navarro, Frenk and White (NFW) dark-matter substructure of different masses and at

77:     different positions and one contains two NFW dark substructures with the

78:     same mass but with different positions.

79:     Reconstruction of the source and of the lens

80:     potential for all of these systems shows the method is able, in a

81:     realistic scenario, to identify perturbations with masses $\ga 10^7\rm

82:     M_\odot$ when located \emph{on} the Einstein ring. For

83:     positions both inside and outside of the ring, masses of at least

84:     $10^9\rm M_\odot$ are required (i.e. roughly the Einstein ring of

85:     the perturber needs to overlap with that of the main lens). Our

86:     method provides a fully novel and objective test of mass

87:     substructure in massive galaxies.

88:

89: \end{abstract}

90:

91:   \begin{keywords}

92:     gravitational lensing --- dark matter --- galaxies: structure ---

93:     galaxies: haloes

94:   \end{keywords}

95:

96:

97:   \section{Introduction}

98:

99:   At the present time, the most popular cosmological model for

100:   structure formation is the $\Lambda \text{CDM}$ paradigm. While this

101:   model has been very successful in describing the Universe on large

102:   scales and in reproducing numerous observational results

103:   \citep[e.g.,][]{Reiss98, Efstathiou02, Burles01,

104:   Philips01, Jaffe01, Percival01, deBernardis02, Hamilton02, Croft02,

105:   Tonry03, Spergel03, Komatsu08}, important discrepancies still

106:   persist on small scales. In particular, some of these involve the

107:   dark matter distribution within galactic haloes

108:   \citep[e.g.,][]{Moore94, Burkert95, McGaugh98,

109:  Binney01, Blok01, deBlok02, McGaugh03, Simon03,Rhee04,Kuzio06}

110:  and the number of galaxy satellites, i.e the

111:   \emph{Missing Satellite Problem}.

112:

113:   \noi According to the standard scenario, structures form in a

114:   hierarchical fashion via merging and accretion of smaller objects

115:   \citep{Toomre77, Frenk88, White91, Barnes92, Cole00}. As shown by

116:   the latest numerical simulations, in which high mass and force

117:   resolution is achieved, the progenitor population is only weakly

118:   affected by virialization processes and a large number of sub-haloes

119:   is able to survive after merging. The number of substructures

120:   within the Local Group, however, is predicted to be 1-2 orders of

121:   magnitude higher than what is effectively observed

122:   \citep[e.g.,][]{Kauffmann93, Moore99, Klypin99,

123:   Moore01,Diemand07b,Diemand07a}.

124:

125:   \noi Two different classes of solutions have been suggested to

126:   alleviate this problem, cosmological and astrophysical. Cosmological

127:   solutions address the basis of the $\Lambda \text{CDM}$ paradigm

128:   itself and mostly concentrate on the properties of the dark matter,

129:   allowing for example, for a warm \citep{Colin00}, decaying

130:   \citep{Cen01}, self-interacting \citep{Spergel00}, repulsive

131:   \citep{Goodman00}, or annihilating nature

132:   \citep{Riotto00}. Alternatively the $\Lambda \text{CDM}$ picture can

133:   be modified by the introduction of a break of the power-spectrum at

134:   the small scales \citep[e.g.,][]{Kamionkowski00, Zentner03}.

135:

136:   \noi From an astrophysical point of view, the number of visible

137:   satellites can be reduced by suppressing the gas collapse/cooling

138:   \citep[e.g.,][]{Bullock00, Kravtsov04, Moore06} via supernova

139:   feedback, photoionization or reionization. This would result in a

140:   high mass-to-light ratio ($M/L$) in the substructures.  If these

141:   high-$M/L$ substructures indeed exist, different methods

142:   for indirect detection are possible. The dark substructure may be

143:   detectable for example through its effects on stellar streams

144:   \citep[e.g.,][]{Ibata02, Mayer02}, via $\gamma$-rays from dark

145:   matter annihilation \citep{Bergstrom99, Calcaneo00, Stoehr03,

146:   Colafrancesco06} or through gravitational lensing \citep[e.g.,][]{Dalal02,

147:   Koopmans05}.

148:

149:   \noi While the first two approaches are limited to the local

150:   Universe, gravitational lensing allows one to explore the mass

151:   distribution of galaxies outside the Local Group and at a relatively

152:   high redshift. Moreover, gravitational lensing is independent of the

153:   baryonic content, of the dynamical state of the system and of the

154:   nature of dark matter. For example, when in a lens system a point source is close to the caustic fold or cusp, the sum of the image fluxes should add to zero if the sign of the image parities

155:   is taken into account \citep{Blandford86,Zakharov95}. This relation is, however, violated by

156:   many observed lensed quasars with cusp and

157:    fold images.

158:   As first suggested by \citet{Mao98}, these flux ratio anomalies

159:   can be related to the presence of (dark matter) substructure around the

160:   lensing galaxy on scales smaller than the image

161:   separation \citep{Bradac02, Chiba02, Dalal02,

162:   Metcalf02, Keeton03, Kochanek04, Bradac04, Keeton05}.

163:   Nevertheless subsequent studies of similar

164:   gravitationally lensed systems have shown that

165:   the required mass fraction in substructure is higher than what is

166:   obtained in numerical simulations \citep{Mao04, Maccio06,Diemand07b}. In

167:   addition, for a significant number of cases the observed flux ratio

168:   anomalies can be explained by taking into account the luminous dwarf

169:   satellite population \citep{Trotter00, Ros00,

170:   Koopmans02, Kochanek04, Chen07, McKean07, More08}. Whether the mass fraction

171:   of CDM substructures is quantifiable via flux ratio anomalies is

172:   therefore a question still open for debate. Alternatively,

173:   \citet{Koopmans05} showed that dark matter substructure in lensing

174:   galaxies can be detected by modelling of multiple images or Einstein

175:   rings from extended sources. \\

176:

177:   \noi In this paper, we developed an adaptive grid-based modelling

178:   code for extended lensed sources and grid-based potentials, to fully

179:   quantify this procedure.  The method presented here is a significant

180:   improvement of the techniques introduced by \citet{Warren03},

181:   \citet{Dye05}, \citet{Koopmans05}, \citet{Suyu106},

182:   \citet{suyu206} and \citet{Brewer06}. In order to detect mass substructure in lens

183:   galaxies one needs to solve simultaneously for the source surface

184:   brightness distribution and the lens potential.  A semilinear

185:   technique for the reconstruction of grid-based sources, given a

186:   parametric lens potential, was first introduced by

187:   \citet{Warren03}. The method was subsequently extended by

188:   \citet{Koopmans05} and  \citet{Suyu106} in order to include a

189:   grid-based potential for the lens and by \citet{Barnabe07} to

190:   include galaxy dynamics. \citet{Dye05} introduced an

191:   adaptive gridding on the source plane; this would minimize the

192:   covariance between pixels and decrease the computational

193:   effort. However the method is still lacking an objective procedure

194:   to quantify the level of regularization. \citet{suyu206} and \citet{Brewer06} encoded the

195:   semi-linear method within the framework of Bayesian statistics

196:   \citep{MacKay92, MacKay03}. Although a vast improvement, the fixed

197:   grids do not allow to take into account the correct number of

198:   degrees of freedom and proper evidence comparison is difficult.

199:   In the implementation here described, these issues have

200:   been solved:

201:

202:   \smallskip

203:

204:   \noi {\bf (i)} the procedure is fully Bayesian; this allows us to

205:   determine the best set of non-linear parameters for a given

206:   potential and the linear parameters of the source, to objectively

207:   set the level of regularization and to compare/rank different

208:   potential families;

209:

210:   \smallskip

211:

212:   \noi {\bf (ii)} using a Delaunay tessellation, the source grid

213:   automatically adaptives in such a way that the computational effort

214:   is mostly concentrated in high magnification regions;

215:

216:

217:   \smallskip

218:

219:   \noi {\bf (iii)} the source-grid triangles are re-computed at every

220:   step of the modelling so that the source and the image plane always

221:   perfectly map onto each other and the number of degrees of freedom

222:   remains constant during Bayesian evidence maximisation.

223:

224:   \smallskip

225:

226:   \noi For the first time in the framework of grid-based lensing

227:   modelling, we use the Nested-Sampling technique by

228:   \citet{Skilling04} to compute the full marginalized Bayesian

229:   evidence of the data \citep{MacKay92, MacKay03}.  This approach not

230:   only provides statistical errors on the lens parameters, but also

231:   consistently quantifies the relative evidence of a smooth potential

232:   against one containing substructures.  As such, our method

233:   provides a fully objective way to rank these two hypotheses given

234:   the data, which is the goal set out in this paper.

235:

236:   \noi The paper is organized as follow. In Section 2 we give a

237:   general overview on the data model. In Section 3 we present in

238:   detail how the data model can be inverted and the source and lens

239:   potential reconstructed.  In Section 4 we review the basics of

240:   Bayesian statistics and of the Nested-Sampling technique for

241:   evidence computation.  In Section 5 we describe how the method has

242:   been tested and how its ability in detecting substructures,

243:   depending on the perturbation mass and position, has been

244:   studied. Finally in Section 6 conclusions are drawn and future

245:   applications are discussed.

246:

247:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

248:

249:   \section{Construction of the lensing operators}

250:

251:   In this section, we describe the data model which relates the

252:   unknown source brightness distribution and lens potential to the

253:   known data of the lensed images. The aim is to put this procedure in

254:   a fully self-consistent mathematical framework, excluding as much as

255:   possible any subjective intervention into the modelling.  The core

256:   of the method presented here is based on a Occam's razor argument.

257:   From a Bayesian evidence point of view, correlated features in the

258:   lensed images are most likely due to structure in the source, rather

259:   than being the result of small-scale perturbations of the lens

260:   potential in front of all the lensed images.  On the other hand,

261:   uncorrelated structure in the lensed images is most likely due to

262:   small-scale perturbations of the lens potential.

263:

264:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

265:

266:   \subsection{The data, source and potential grids}\label{sec:grids}

267:   The main idea of grid-based lensing techniques is to use a

268:   grid-based reconstruction of the source and of the lens potential.

269:   Here we introduce the general geometry of the problem, explicitly

270:   shown in Fig. \ref{fig:grid}.  Consider a lensed image $\bmath d$

271:   of an unknown extended source $\bmath s$. Both $\bmath d$ and

272:   $\bmath s$ are vectors that describe the surface brightness

273:   distributions on a set of spatial points $\bmath x_i^d$ and $\bmath

274:   y_j^s$ in the lens and source plane, respectively

275:   \citep[e.g.,][]{Warren03,Koopmans05, suyu206}. In general, these are

276:   related through the lens equation ${\bmath y_i^d} = {\bmath x_i^d} -

277:   {\bmath \nabla} \psi({\bmath x_i^d})$, where ${\bmath x}_i^d$

278:   corresponds to the spatial position of the surface brightness in the

279:   $ith$ element of the vector $\bmath d$, i.e. $d_i$ and $\psi({\bmath x_i^d})$

280:   is the lensing potential, which is described in more detail in a moment.

281:   We note that ${\bmath y}_i^d$ does not necessarily directly correspond to the

282:   elements $\bmath y_j^s$, $jth$ brightness value

283:   of the vector $\bmath s$. In our implementation, the grid on the

284:   source plane is fully adaptive and is directly constructed from a

285:   subset of the $N_d$ pixels in the image plane, with spatial

286:   boundaries of the image grid included.  In particular, as shown

287:   schematically in Fig. \ref{fig:grid}, $N_s$ pixels, located each

288:   at a position $\bmath x_i^s$ on the image grid, are cast back to the

289:   source plane giving the positions $\bmath y_j^s$.

290:   The set of positions $\{ \bmath y_i^s \}$ constitute

291:   the vertices of a Delaunay triangulation. In this way, we define an

292:   irregular adaptive grid, where vertex positions in the source plane

293:   are related to positions on the image plane via the lens equation

294:   and every vertex value represents an unknown source surface

295:   brightness level.

296:

297:   \noi We assume the lens potential to be the

298:   superposition of a parametric smooth component with linear local

299:   perturbations related to the presence of e.g. CDM substructures or

300:   dwarf galaxies:

301:   %

302:   \begin{equation}

303:     \psi(\bmath x,\bmath \eta)=\psi_s(\bmath x,\bmath

304:     \eta)+\delta\psi(\bmath x).

305:   \end{equation}

306:   %

307:   While $\psi_s(\bmath x,\bmath \eta)$ assumes a parametric form,

308:   with parameters $\bmath\eta$, $\delta \psi(\bmath x)$ is a function

309:   that is pixelized on a regular Cartesian grid of points $\bmath

310:   x_k^{\delta\psi}$ with values

311:   $\delta \psi_k$. The set $\{\delta \psi_k\}$ is written as a vector

312:   $\delta\bmath{\psi}$. Given the observational set of data $\bmath d$,

313:   we now wish to recover the source distribution $\bmath s$ and the

314:   lens potential $\psi({\bmath x}, \bmath\eta)$ simultaneously. To do

315:   this we need to mathematically relate the brightness values $\bmath

316:   d$ to the unknown brightness values $\bmath s$. As described in the

317:   next subsection, this can be done through a linear operation on

318:   $\bmath s$ and $\delta \bmath{\psi}$, where the operator itself is a

319:   function of an initial guess of the lens potential.

320:

321:

322:

323:   \begin{figure}

324:     \begin{center}

325:       \includegraphics[width=\hsize]{fig1}

326:       \caption {A schematic overview of the non-linear source and

327: 	potential reconstruction method, as implemented in this

328: 	paper. On the left hand-side, on the image plane, two grids

329: 	are defined: one for the potential corrections and one for the

330: 	lensed image. A subset of $N_s$ of the $N_d$ image pixels

331: 	located at the positions $\bmath x^s_i$ on the image plane

332: 	(filled circles) is cast back to the source plane (on the

333: 	right) on $\bmath{y}^s_i$ through the lens equation. These

334: 	form the vertices of an adaptive grid on the source plane. The

335: 	remaining image pixels (open circles) are also cast to the

336: 	source plane to the positions $\bmath{y}_i^d$ (we note that

337: 	this set of points includes $\bmath{y}^s_i$). Because the

338: 	source brightness distribution is conserved, i.e $S(\bmath

339: 	x^d_i)=S(\bmath y^d_i)$, the surface brightness at the empty

340: 	circles is represented by a linear superposition of the

341: 	surface brightness at the three triangle vertices that enclose

342: 	it. Similarly the potential correction at a point

343: 	$\bmath{x}_i^{\delta\psi}$ is given by linear interpolation of

344: 	the potential corrections at the surrounding pixels (large

345: 	rectangular pixels on the image plane). }

346:       \label{fig:grid}

347:     \end{center}

348:   \end{figure}

349:

350:

351:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

352:

353:   \subsection{The source and potential operator}

354:

355:   We now derive the explicit relation between the unknown source

356:   distribution $\bmath s$, the potential correction $\delta

357:   \bmath{\psi}$, the smooth potential $\psi_s(\bmath x,\bmath\eta)$

358:   and the image brightness $\bmath d$.

359:

360:   \noi Consider a generic triangle $\widehat{\rm{ABC}}$ on the source

361:   plane (Fig. \ref{fig:single}), then the source surface brightness

362:   ${s_{\rm P}}$ on a point P, located inside the triangle at the

363:   position ${\bmath y}_{\rm P}^d$, can be related to the surface brightness on

364:   the vertices A, B and C through a simple linear relation

365:   %

366:   \begin{equation}

367:     {s_{\rm P}}=w_{\rm A}{s_{\rm A}}+w_{\rm B}{s_{\rm B}}+w_{\rm

368: 	C}{s_{\rm C}}\,.

369:   \end{equation}

370:   %

371:   \noi An explicit expression for the bilinear interpolation weights

372:   $w_{\rm{A}}$, $w_{\rm B}$ and $w_{\rm C}$ can be obtained by

373:   considering the point $\rm P_1 $, at the intersection of the line

374:   $\overline{\rm {AP}}$ with the line $\overline{\rm{CB}}$. The source

375:   intensities at P and $\rm P_1$ are also related to each other

376:   through a linear interpolation.  On the other hand, the surface

377:   brightness in $\rm P_1$ is directly related to the values on the

378:   triangle vertices $\rm B$ and $\rm C$

379:   %

380:   \begin{equation}

381:     \left\{

382:     \begin{array}{l}

383:       s_{\rm P} = \frac{d_{\rm {PA} }

384:       }{d_{\rm{P_1A}}}(s_{\rm{P_1}}-s_{\rm A})+s_{\rm A}\\ s_{\rm

385:       {P_1}}= \frac{ d_{ \rm {P_1B} } }{ d_{\rm{CB}} }(s_{\rm

386:       C}-s_{\rm B})+s_{\rm B}

387:     \end{array}

388:     \right.\,

389:     \label{equ:arr}

390:   \end{equation}

391:   %

392:   \noi where $d_{\rm {PA}}$ and $d_{\rm {P_1A}}$ are the absolute

393:   distances between the points P and A and the points $\rm P_1$ and A

394:   respectively; $d_{ \rm{P_1B}}$ and $d_{\rm {CB}}$ are the distances

395:   between the points $\rm {P_1}$ and B and the points C and B

396:   respectively. Solving (\ref{equ:arr}), we obtain the weights

397:   %

398:   \begin{equation}

399:     \left\{

400:     \begin{array}{l}

401:       w_{\rm A}= 1-\frac{d_{\rm {PA}}}{d_{\rm{P_1A}}}\\ w_{\rm B}=

402:       \frac{d_{\rm{PA}}}{d_{\rm{P_1A}}}

403:       \left(1-\frac{d_{\rm{P_1B}}}{d_{\rm{CB}}}\right)\\

404:       w_{\rm C}=

405:       \frac{d_{\rm{PA}}d_{\rm{P_1B}}}{d_{\rm{P_1A}}d_{\rm{CB}}}

406:     \end{array}

407:     \right.\,

408:   \end{equation}

409:   %

410:   \noi with $\sum_{i=\rm A,\rm B,\rm C }{w_i}=1$. Because

411:   gravitational lensing conserves the surface brightness, i.e.\ $S(\bmath

412:   x_i^d) = S(\bmath y_i^d)$, the mapping between the two planes (when

413:   $\delta\bmath\psi=0$) can be expressed as a system of $N_s$ coupled

414:   linear equations

415:   %

416:   \begin{equation}

417:     \mathbf{B\,L}(\bmath \eta)\bmath s =\bmath d + \bmath n\,,

418:     \label{equ: src_linear_blurred}

419:   \end{equation}

420:   %

421:   where $\mathbf L(\bmath \eta)$ and $\mathbf B$ are the lensing and

422:   the blurring operators respectively \citep[see e.g.][]{Warren03,

423:   Treu04, Koopmans05, Suyu106}. The blurring operator is a square

424:   sparse matrix which accounts for the effects of the PSF. Each row of

425:   the lensing operator (a sparse matrix) contains at most the three

426:   bilinear interpolation weights, $w_{\rm A ,B, C}$, placed at the columns that

427:   correspond to the three source vertices that enclose the associated

428:   source position. For a vertex point, there is only one weight equal

429:   to unity. In case $N_s = N_d$ (i.e.\ all image positions are used to

430:   create the source grid), all weights are equal to unity. In this

431:   case, the systems of equations is under-constrained and strong

432:   regularization is required.

433:

434:   \noi By pixelating $\delta \psi(\bmath x)$ on a regular Cartesian

435:   grid, a similar argument as for the source can be applied to the

436:   potential correction; all potential values, $\{\delta \psi_k\}$, and

437:   their derivatives on the image plane can be related to this limited

438:   set of points through bilinear interpolation

439:   \citep[see][]{Koopmans05, Suyu08}. It is then possible to derive from

440:   equation~(\ref{equ: src_linear_blurred}) a new set of linear

441:   equations,

442:   %

443:   \begin{equation}

444:     \mathbf{M_c}\left(\bmath{\eta},\bmath\psi\right)\,\bmath r = \bmath d +

445:     \bmath n,

446:     \label{equ: src_pot_linear_blurred}

447:   \end{equation}

448:   %

449:   where

450:   %

451:   \begin{equation}

452:     \bmath r\equiv\left(

453:     \begin{array}{c}

454:       \bmath s\\ \delta\bmath \psi

455:     \end{array}

456:     \right)\,.

457:   \end{equation}

458:   %

459:   \noi More specifically, $\bmath\psi$ is the sum of all the previous

460:   corrections $\delta\bmath\psi$ and the operator $\mathbf{M_c}$ is a

461:   block matrix reading

462:   \begin{equation}

463:     \mathbf{M_c}\equiv \mathbf B \left [\mathbf L(\bmath \eta, \bmath

464:       \psi)\, | -\mathbf{D_s}(\bmath s_{\rm MP})\mathbf {D_{\psi}}\\

465:       \right]\,.

466:     \label{equ:block_matrix}

467:   \end{equation}

468:   %

469:   \noi ${\mathbf L}({\bmath \eta}, {\bmath \psi})$ is the

470:   lensing operator introduced above, $\mathbf{D_s}(\bmath s_{\rm MP})$

471:   is a sparse matrix whose entries depend on the surface brightness

472:   gradient of the previously-best source model at $\bmath{y}^d_i$ and

473:   $\mathbf{D_\psi}$ is a matrix that determines the gradient of

474:   $\delta\bmath\psi$ at all corresponding points $\bmath{x}^d_i$

475:   \citep[see] [for details]{Koopmans05}. The generic structure of

476:   these matrices is given by

477:   %

478:   \begin{equation}

479:     \mathbf{D_{s}}= \left(

480:     \begin{array}{ccccc}

481:       ...&&& \\ \\ &\frac{\partial S({\bmath y}^d_i)}{\partial y_1}&

482:       \frac{\partial S({\bmath y}^d_i)}{\partial y_2} & \\ \\ &

483:       &\frac{\partial S({\bmath y}^d_{i+1})}{\partial y_1}&

484:       \frac{\partial S({\bmath y}^d_{i+1})}{\partial y_2} \\ \\ & & & & ...\\

485:     \end{array}

486:     \right)

487:   \end{equation}

488:   %

489:   and

490:   %

491:   \begin{equation}

492:     \mathbf{D_{\delta\psi}}= \left(

493:     \begin{array}{ccccc}

494:       ...&\\ \\ &\frac{\partial \delta\psi (\bmath{x}^d_i)}{\partial

495:       x_1}&\\ &\frac{\partial \delta\psi (\bmath{x}^d_i)}{\partial

496:       x_2} & \\ \\ &&\frac{\partial \delta\psi (\bmath{x}^d_{i+1})}{\partial

497:       x_1} \\ &&\frac{\partial \delta\psi (\bmath{x}^d_{i+1})}{\partial

498:       x_2} \\ & & & &...\\

499:     \end{array}

500:     \right)

501:   \end{equation}

502:   %

503:   where the index $i$ runs along all the $\bmath{x}_i^d$ and $\bmath{y}_i^d$,

504:   i.e. triangle vertices included. The ``functions'' $S$ and $\delta

505:   \psi$ and their derivative can be derived through bilinear

506:   interpolation and finite differencing from $\bmath s$ and $\delta

507:   \bmath \psi$, respectively.

508:

509:   \noi It is clear from the structure of these matrices that the

510:   first-order correction to the model, as a result of $\delta \psi$,

511:   is equal to $\delta d_i= -\bmath {\nabla} S(\bmath{y}^d_i) \cdot

512:   \bmath{\nabla} \delta \psi(\bmath{x}^d_i)$ at every point

513:   $\bmath{x}^d_i$ \cite[see e.g.][for a derivation]{Koopmans05}.

514:

515:   \noi As for the surface brightness itself, also the first derivatives for

516:   a generic point P on the source plane can be expressed as functions

517:   of the relative values on the triangle vertices A, B, C, yielding

518:   %

519:   \begin{eqnarray}

520:     \frac{\partial {s_{\rm P}}}{\partial y_{1}} & = &w_{\rm

521:       A}\frac{\partial {s_{\rm A}}}{\partial y_{1}}+w_{\rm

522:       B}\frac{\partial {s_{\rm B}}}{\partial y_{1}}+w_{\rm

523:       C}\frac{\partial {s_{\rm C}}}{\partial y_{1}}\nonumber\\

524:       \frac{\partial {s_{\rm P}}}{\partial y_{2}} & = &w_{\rm

525:       A}\frac{\partial {s_{\rm A}}}{\partial y_{2}}+w_{\rm

526:       B}\frac{\partial {s_{\rm B}}}{\partial y_{2}}+w_{\rm

527:       C}\frac{\partial {s_{\rm C}}}{\partial y_{2}}

528:   \end{eqnarray}

529:   %

530:   For the generic vertex $j= \rm{A, B,C}$ these are given by

531:   $\frac{\partial \bmath{s_j}}{\partial y_{1}}=-\frac{n_0}{n_2}$

532:   and $\frac{\partial \bmath{s_j}}{\partial

533:   y_{2}}=-\frac{n_1}{n_2}$, where  $\bmath{N}\equiv(n_0,n_1,n_2)$ is the

534:   unit-length surface normal vector at the vertex $j$ and is defined

535:   as the average of the adjacent per-face normal vectors. For

536:   $\delta\bmath\psi$ and its gradients, on a rectangular grid with

537:   rectangular pixels, we follow \cite{Koopmans05}.

538:

539:   \begin{figure}

540:     \begin{center}

541:       \subfigure[]{\centering \includegraphics[width=4.5cm]{fig2a}

542: 	\label{fig:single}

543:       }

544:       \hspace{.5in}

545:

546:

547:       \subfigure[]{ \includegraphics[width=3cm]{fig2b}

548: 	\label{fig:double_x}

549:       }

550:       \hspace{.25in} \subfigure[]{

551:       \includegraphics[width=3cm]{fig2c}

552: 	\label{fig:double_y}

553:       }

554:

555:       \caption{Generic triangles from the

556: 	source grid. Both the source surface brightness and its

557: 	derivatives at the points P, $\rm P_1$ and $\rm P_2$ are given

558: 	by linear superposition of the values at the edges of the

559: 	surrounding triangles.}

560:       \label{fig:triangles}

561:

562:     \end{center}

563:   \end{figure}

564:

565:

566:   \section{Inverting the data model}\label{sec:inverting}

567:

568:   \noi As shown above, in both cases of solving for the source alone,

569:   or solving for the source plus a potential correction, a {\sl linear

570:   data model} can be constructed. In this section, we give a

571:   general overview of how this set of linear equations can be

572:   (iteratively) solved. A more thorough Bayesian description and

573:   motivation can be found in Section~4.

574:

575:   \subsection{The penalty function}

576:   Before we go into the details of the method, we first restate that

577:   for a given lens potential $\psi(\bmath x, {\bmath \eta})$ and

578:   potential correction $\bmath \psi_n = \sum^n_{i=1}

579:   \delta {\bmath \psi_i}$, on a grid, the source surface brightness vector

580:   $\bmath s$ and the data vector $\bmath d$ can be related through a

581:   linear (matrix) operator

582:   %

583:   \begin{equation}

584:     \mathbf {M_c}({\bmath \eta}, {\bmath \psi}_{n-1}, \bmath

585:     s_{n-1})\bmath r_n={\bmath d} + {\bmath n},

586:     \label{equ: src_linear}

587:   \end{equation}

588:   now explicitly written with their dependencies on the source and

589:   potential and with

590:   \begin{equation}

591:     \bmath r_n= \left(\begin{array}{c}\bmath s_{n} \\

592:       \delta\bmath\psi_n \\

593:     \end{array}

594:     \right).

595:   \end{equation}

596:   %

597:   In this equation $\bmath s_n$ is a model of the source

598:   brightness distribution at a given iteration $n$ (we describe the

599:   iterative scheme momentarily). We assume the noise $\bmath n$ to be

600:   Gaussian which is a good approximation for the HST images the

601:   method will be applied to. Even in case of deviations from Gaussianity,

602:   the central limit theorem, for many data points, ensures that the probability density

603:   distribution is often well approximated by a Normal distribution. \\

604:   \noi Because of the ill-posed nature of this relation,

605:   equation (\ref{equ: src_linear}) cannot simply be inverted. Instead a

606:   penalty function which expresses the mismatch between the data and

607:   the model has to be defined by

608:   \begin{equation}\label{eqn:penalty}

609:     P(\bmath s,\delta \bmath \psi \,|\, {\bmath \eta}, {\bmath \lambda},

610:     {\bmath s}_{n-1}, {\bmath

611:     \psi}_{n-1})=\chi^2+\lambda_s^2\|\mathbf{H_s} \bmath s\|^2_2

612:     +\lambda_{\delta\psi}^2 \|\mathbf{H_{\delta\psi}} \delta\bmath

613:     \psi\|^2_2\,,

614:   \end{equation}

615:   with

616:   \begin{equation}\label{eqn:chi2}

617:     \chi^2 = [\mathbf {M_c}({\bmath \eta}, \bmath \psi_{n-1}, \bmath

618:     s_{n-1})\, \bmath r - {\bmath d}]^{\rm T} \, {\mathbf {C_d^{-1}}} \,

619:     [\mathbf {M_c}({\bmath \eta}, \bmath \psi_{n-1}, \bmath

620:     s_{n-1})\,\bmath r - {\bmath d}].

621:   \end{equation}

622:

623:   \noi The second and third term in the penalty function contain prior

624:   information, or beliefs about the smoothness of the source and of

625:   the potential respectively and $\mathbf{C_d}$ is the diagonal

626:   covariance matrix of the data. The level of regularization is set by

627:   the regularization parameters $\bmath \lambda$, one for the source and one

628:   for the potential \citep[see][for a more general

629:   discussion]{Koopmans05, suyu206}.  In a Bayesian framework, this

630:   penalty function is related to the posterior probability of the

631:   model given the data (see Section 4). In the following two sections

632:   we describe how to solve for the linear and non-linear parameters of

633:   the penalty function (except for $\bmath \lambda$, which is described

634:   in Section 4).

635:

636:   \subsubsection{Solving for the linear parameters}

637:   \label{sec:solvelinear}

638:   The most probable solution, $\bmath{r_{\rm MP}}$, minimizing the

639:   penalty function is obtained by solving the set of linear equations

640:   \begin{equation}

641:     (\mathbf{M_c^T C_d^{-1}M_c+R^T R})\,\bmath

642:     r=\mathbf{M_c^TC_d^{-1}}\bmath d.

643:     \label{equ: src_pot_penalty}

644:   \end{equation}

645:   The regularization matrix is given by

646:   \begin{equation}

647:     {\mathbf R^{\rm T}} {\mathbf R} = \left(

648:     \begin{array}{cc}

649:       \lambda_s^2\mathbf{H_s^{\rm T}} \mathbf{H_s} & \\ &

650:       \lambda^2_{\delta\psi}\mathbf{H_{\delta\psi}^{\rm T}}

651:       \mathbf{H_{\delta\psi}}

652:     \end{array} \right).

653:   \end{equation}

654:

655:   \noi The solution of this symmetric positive definite set of

656:   equations can be found using e.g.\ a Cholesky decomposition

657:   technique. By solving equation (\ref{equ: src_pot_penalty}), adding

658:   the correction $\delta \bmath \psi_n$ to the previously-best

659:   potential $\bmath \psi_{n-1}$ and iterating this procedure, both the

660:   source and the potential should converge to the minimum of the

661:   penalty function $P(\bmath s_n,\delta \bmath \psi_{n} \,|\, {\bmath

662:   \eta}, {\bmath \lambda}, {\bmath s}_{n-1}, {\bmath \psi}_{n-1})$. At

663:   every step of this iterative procedure the matrices $\mathbf {M_c}$

664:   and $\mathbf R$ have to be recalculated for the new updated

665:   potential $\bmath \psi_n$ and source $\bmath s_n$. While the

666:   potential grid points are kept spatially fixed in the image plane,

667:   the Delaunay tessellation grid of the source is re-built at every

668:   iteration to ensure that the number of degrees of freedom is kept

669:   constant during the entire optimization process.

670:

671:   \noi Note that because the source and the potential corrections are

672:   independent, they require their own form ($\mathbf H$) and level

673:   ($\lambda$) of regularization.  The most common forms of

674:   regularization are the zeroth-order, the gradient and the

675:   curvature. As shown by \citet{suyu206} the best form depends on the

676:   nature of the source distribution and can be assessed via Bayesian

677:   evidence maximisation. For the source, we chose the curvature

678:   regularization defined for the Delaunay tessellation of the source

679:   plane.

680:

681:   \noi Specifically one can combine the gradient and curvature

682:   matrices in the $x$ and $y$ directions: $\mathbf{H_{s}^{\rm

683:   T}}\mathbf{H_{s}}=\mathbf{H_{s,y_1}^{\rm

684:   T}}\mathbf{H_{s,y_1}}+\mathbf{H_{s,y_2}^{\rm T}}\mathbf{H_{s,y_2}}$.

685:   Both $\mathbf{H_{s,y_1}}$ and $\mathbf{H_{s,y_2}}$ can be obtained

686:   by analogy by considering the pair of triangles in

687:   Fig.~\ref{fig:double_x} and Fig.~\ref{fig:double_y}

688:   respectively.

689:

690:   \noi For every generic point C on the source plane we consider the

691:   pair of triangles $\widehat{\rm{ABC}}$ and $\widehat{\rm{DCE}}$ and

692:   define the curvature in C in the $y_1$ direction as:

693:   %

694:   \begin{equation}

695:     {s''_{C,y_1}}

696:     \equiv \frac{1}{d_{CP}}({s_P}-{s_C}) -\frac{1}{d_{CQ}}({s_C}-{s_Q})\,.

697:     \label{equ:curvature}

698:   \end{equation}

699:   This is not the second derivative, but we find that this alternative

700:   curvature definition gives much better results than using the second

701:   derivative directly. The reason is that it gives equal weight to all

702:   triangles, independently of their relative sizes (for identical

703:   rectangular pixels this problem does not arise since the above

704:   definition is equal to the second derivative up to a proportionality

705:   constant). A much smoother solution in that case is obtained.

706:

707:   \noi P and Q

708:    are given by intersecting the line

709:   $\overline{\rm{CP_1}}$ with the line $\overline{\rm{ED}}$ and the

710:   line $\overline{\rm{CP_2}}$ with the line $\overline{\rm{AB}}$

711:   respectively. Specifically, $\rm{P_1}$ and $\rm{P_2}$ are defined as

712:   very small displacements from the point C in the $y_1$ direction %

713:   \begin{eqnarray}

714:     y_{2}^{\rm{P_1}}      & = & y_{2}^{\rm{P_2}} =  y_{2}^{\rm C}\nonumber\\

715:     y_{1}^{\rm{P_{1,2}}}  & = & y_{1}^{\rm C}  \pm \delta y_1.

716:   \end{eqnarray}

717:   %

718:   The source surface brightness in P and Q can be obtained by

719:   linear interpolation between the source values in D with the value

720:   in E and the value in A with the value in B respectively

721:   %

722:   \begin{eqnarray}

723:     s_{\rm P}&=&\frac{d_{\rm{PD}}}{d_{\rm{ED}}}(s_{\rm E}-s_{\rm

724:       D})+s_{\rm D}\label{equ:s_p} \nonumber \\ s_{\rm

725:       Q}&=&\frac{d_{\rm{QA}}}{d_{\rm{AB}}}({s_{\rm B}}-s_{\rm

726:       A})+s_{\rm A}\label{equ:s_q}\,,

727:   \end{eqnarray}

728:   %

729:   \noi Substituting (\ref{equ:s_p}) in

730:   (\ref{equ:curvature}) gives

731:   %

732:   \begin{multline}

733:     {s''_{C,y_1}}=-\left(\frac{1}{d_{\rm

734:       {CP}}}+\frac{1}{d_{\rm {CQ}}}\right){s_{\rm C}}+\frac{d_{\rm

735:       PD}}{d_{\rm CP}d_{\rm DE}}s_{\rm E}+\\ \frac{d_{\rm

736:       {QA}}}{d_{\rm{CQ}}d_{\rm{AB}}}s_{\rm B}+\frac{d_{\rm{PE}}}

737:       {d_{\rm{CP}}{d_{\rm{DE}} }}s_{\rm D}+\frac{d_{\rm

738:       {QB}}}{d_{\rm{CQ}}d_{\rm{AB}}}s_{\rm A}\,.

739:   \end{multline}

740:   %

741:   \noi Each row of the regularization matrix $\mathbf{H_{s,y_1}}$, corresponding to every

742:   point C, contains the five interpolation weights, placed at the

743:   columns that correspond to the five vertices A, B, C, D and

744:   E. The curvature in the $y_2$ direction is derived in an analogous

745:   way using the pair of triangles in Fig. \ref{fig:double_y}. We

746:   refer again to \citet{Koopmans05} for details on the

747:   potential regularization matrix $\mathbf{ H_{\delta \psi}}$

748:

749:   \subsubsection{Solving for the non-linear parameters}

750:   \label{sec:solvenonlinear}

751:   In order to recover the non-linear parameters $\bmath \eta$, we need

752:   to minimize the penalty function $P(\bmath s, {\bmath \eta}\,|\,

753:   {\bmath \lambda}, {\bmath \psi})$. We allow for a correction,

754:   $\bmath \psi$, to the parametric potential $\psi(\bmath \eta,\bmath

755:   x)$ (not necessarily zero), but do not allow it to be changed while

756:   optimising for $\bmath s$ and ${\bmath \eta}$. In all cases, we keep

757:   $\bmath \lambda$ fixed during the optimization. Given an

758:   initial guess for the non-linear parameters $\bmath \eta_0$, we then

759:   minimize the penalty function defined in Section

760:   \ref{sec:solvelinear}, under the conditions outlined above

761:   ($\bmath\psi$ is constant and $\delta\bmath\psi \equiv \bmath 0$).

762:   We use a non-linear optimizer \citep[in our case Downhill-Simplex

763:   with Simulated Annealing;][]{Press92}, to change $\bmath \eta$ at

764:   every step and to minimize the joint penalty function $P(\bmath s,

765:   {\bmath \eta}\,|\, {\bmath \lambda}, {\bmath \psi})$.  The

766:   optimization of $\bmath s$ is implicitly embedded in the

767:   optimization of $\bmath \eta$ by solving equation (\ref{equ:

768:   src_pot_penalty}) only for $\bmath s$, every time $\bmath \eta$ is

769:   modified.

770:

771:   \subsection{The optimization strategy}\label{sec:strategy}

772:

773:   We have implemented a multi-fold optimization scheme for solving the

774:   linear equation (\ref{equ: src_linear}). This scheme is not unique,

775:   but stabilises the numerical optimization of this rather complex set

776:   of equations. Solving all parameters simultaneously would be

777:   computationally prohibitive and usually shows poor convergence

778:   properties.

779:

780:   \subsubsection{Optimization steps}

781:

782:   Our optimization scheme is similar to a {\sl line-search}

783:   optimization, where consecutively different sets of unknown

784:   parameters are being kept fixed, while the others are optimized

785:   for. The sets $\{\delta \bmath \psi, \bmath s\}$, $\{\bmath \eta,

786:   \bmath s \}$ and $\{\bmath \lambda, \bmath s \}$ define the three

787:   different groups of parameters, of which only one is solved for at

788:   once. The individual steps, in no particular order, are then:

789:

790:   \noi {\bf (i)} {We assume $\bmath \eta$

791:     and $\bmath \lambda$ to be constant vectors and iteratively solve

792:     for $\delta\bmath\psi$ and the source $\bmath s$. In this case, at

793:     every iteration we solve for $\bmath r$ and adjust $\bmath \psi$,

794:     using the linear correction to the potential $\delta \bmath

795:     \psi$. This was described in Section \ref{sec:solvelinear}.}

796:

797:   \noi {\bf (ii)} {We assume $\bmath\psi$ and

798:     $\bmath \lambda$ to be constant vectors and

799:     $\delta\bmath\psi_i=\bmath 0$ at every iteration and only solve

800:     for the non-linear potential parameters $\bmath \eta$ and the

801:     source $\bmath s$. This was described in Section

802:     \ref{sec:solvenonlinear}. We note that part of step (i) is also

803:     implicitly carried out in step (ii) (i.e.\ solving for $\bmath s$).}

804:

805:   \noi {\bf (iii)} {We assume both (i) and (ii), above, and solve for

806:     the regularization parameters $\lambda_s$ of the source and the source

807:     itself $\bmath s$. This requires a Bayesian approach and will be

808:     described in more detail in Section~4. We have not attempted to

809:     optimize for $\lambda_{\delta \psi}$, but will study this

810:     in future publications.}

811:

812:   \noi The overall goal, however, remains to solve for the \emph{full}

813:   set of unknown parameters $\{ {\bmath \eta}, {\bmath \psi}_n, \bmath

814:   s_n \}$ for $n\rightarrow \infty$ (or some large number).  In

815:   particular if an overall smooth (on scales of the image separations)

816:   potential model $\psi(\bmath \eta)$ does not allow a proper

817:   reconstruction of the lens system, we add an additional and more

818:   flexible potential correction $\delta{\bmath \psi}$,

819:   which can describe a more complex mass structure.

820:

821:   \subsubsection{Line-search optimization scheme}

822:

823:   In practice, we find that the optimal strategy to minimize the

824:   penalty function is the following, in order:

825:

826:   \noi {\bf (1)} {We set $\lambda_{\rm s}$ to a large constant value

827:     such that the source model remains relatively smooth throughout

828:     the optimization (i.e.\ the peak brightness of the model is a

829:     factor of a few below that of the data) and keep

830:     $\bmath\psi_n=\bmath 0$ \citep[see also][]{suyu206, Suyu08}.  We then

831:     solve for $\bmath \eta$ and $\bmath s$ that minimize the penalty

832:     function}.

833:

834:   \noi {\bf (2)} {Once the best $\bmath \eta$ and $\bmath s$ are

835:     found, a Bayesian approach is used to find the best value of

836:     $\lambda_{\rm s}$ for the source only.  At this point

837:     $\bmath\psi$ is still kept equal to zero.}

838:

839:   \noi {\bf (3)} {Given the new value of $\lambda_{\rm s}$, step (1) is repeated

840:     to find improved values of $\bmath \eta$ and $\bmath s$. Since the

841:     sensitivity of $\lambda_{\rm s}$ to changes in $\bmath \eta$ is

842:     rather weak, at this point the best values of $\bmath \eta$,

843:     $\bmath s$ and $\bmath \lambda$ have been found.}

844:

845:   \noi {\bf (4)} {Next, all the above parameters are kept fixed and we

846:     solve for $\bmath r$, this time assuming a very large value for

847:     $\lambda_{\delta \psi}$ to keep the potential correction (and

848:     convergence) smooth. We adjust $\bmath \psi$ at every iteration

849:     until convergence is reached

850:     \cite[e.g.][]{Suyu08}. At this point we stop the optimization

851:     procedure.}

852:

853:   \noi {\bf (5)} {The smooth model with $\bmath \psi = \bmath 0$ and

854:     the same model with $\bmath \psi \neq \bmath 0$ are then compared

855:     through their Bayesian evidence values and errors on the

856:     parameters are estimated through the Nested Sampling of

857:     \citet{Skilling04}(Section 4).}

858:

859:   \noi Fig. \ref{fig:flow} shows a complete flow diagram of our

860:     optimization scheme. In the next section we place

861:     equation (\ref{eqn:penalty}) and model ranking on a formal Bayesian

862:     footing. Those readers mostly interested in the application and

863:     tests of the method could continue reading in Section~5.

864:

865:   \begin{figure*}

866:     \begin{center}

867:       \includegraphics[width=\hsize,clip=]{fig3}

868:        \caption {A schematic overview of the non-linear source and

869: 	potential reconstruction method.}

870:       \label{fig:flow}

871:     \end{center}

872:   \end{figure*}

873:

874:   \section{A Bayesian approach to data fitting and model selection}

875:   \label{sec:bayes}

876:

877:   When trying to constrain the physical properties of the lens galaxy,

878:   within the grid-based approach, three different problems are

879:   faced.  Given the linear relation in equation (\ref{equ:

880:   src_pot_linear_blurred}) we need to determine the linear parameters

881:   $\bmath r$ for a certain set of data $\bmath d$ and a form for the

882:   smooth potential $\psi_{s}(\bmath x,\bmath \eta)$. We then aim to

883:   find the best values for the parameters $\bmath \eta$ and $\bmath

884:   \lambda$ and finally, on a more general level, we wish to infer the

885:   best model for the overall potential and quantitatively rank

886:   different potential families. In particular, we want to compare smooth models with models

887:   that also include a potential grid for substructure (with more free

888:   parameters). These issues can all be quantitatively and objectively

889:   addressed within the framework of Bayesian statistics. In the

890:   context of data modelling three levels of inference can be

891:   distinguished \citep{MacKay92, suyu206}.

892:

893:   \medskip

894:

895:   \noi {\bf (1)} First level of inference: linear optimization.  We

896:   assume the model $\mathbf{M_c}$, which depends on a given potential

897:   and source model, to be true and for a fixed form $\mathbf R$ and

898:   level ($\bmath\lambda$) of regularization, we derive from Bayes'

899:   theorem the following expression:

900:   \begin{equation}

901:     P\left(\bmath r\,|\,\bmath d,\bmath\lambda,\bmath \eta,\mathbf

902:     {M_c},\mathbf R\right)=\frac{P(\bmath d \,|\,\bmath r,\bmath \eta,

903:     \mathbf{M_c})\, P(\bmath r\,|\,\bmath\lambda,\mathbf R)}{P(\bmath

904:     d \,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)}\,.

905:   \end{equation}

906:   The likelihood term, in case of Gaussian noise, for a covariance

907:   matrix $\mathbf{C_d}$, is given by

908:   \begin{equation}

909:     P(\bmath d \,|\,\bmath r, \bmath\eta,\mathbf{M_c})=

910: 	\frac{1}{Z_d}\exp{[-E_d(\bmath d \,|\,\bmath

911: 	r,\bmath\eta,\mathbf{M_c})]}\,

912:   \end{equation}

913:   where

914:   \begin{equation}

915:     Z_d=(2\pi)^{N_d/2}(\det \ \mathbf{C_d})^{1/2}

916:   \end{equation}

917:   and (see equation \ref{eqn:chi2})

918:   \begin{equation}

919:     E_d(\bmath d \,|\,\bmath r,\bmath\eta,\mathbf{M_c}]=

920:       \frac{1}{2}\,\chi^2=\frac{1}{2}\left(\mathbf{M_c} \bmath

921:       r-\bmath d\right)^{\rm T}\mathbf{C}_D^{-1}\left(\mathbf{M_c}

922:       \bmath r-\bmath d\right)\,.

923:   \end{equation}

924:   Because of the presence of noise and often the singularity of

925:   $\det\,(\mathbf{M_c^{\rm T}} \mathbf{M_c})$, it is not possible to

926:   simply invert the linear relation in equation (\ref{equ:

927:   src_pot_linear_blurred}) but an additional penalty function must be

928:   defined through the introduction of a prior probability $P(\bmath r

929:   \,|\,\bmath\lambda,\mathbf R)$ on $\bmath s$ and on $\delta\bmath

930:   \psi$. In our implementation of the method, the prior assumes a

931:   quadratic form, with minimum in $\bmath r=\bmath 0$ and sets the

932:   level of smoothness (specified in $\mathbf H$ and $\bmath\lambda$)

933:   for the solution

934:   \begin{equation}

935:     P(\bmath r\,|\,\bmath\lambda,\mathbf R)=

936:     \frac{1}{Z_r}\exp{\left[-\bmath\lambda E_r(\bmath r\,|\,\mathbf

937:     R)\right]}\,,

938:   \end{equation}

939:   with

940:   \begin{equation}

941:     Z_r(\bmath\lambda)=\int {d\bmath r e^{-\bmath\lambda E_r}}=

942:     e^{-\bmath\lambda

943:     E_s(0)}\left(\frac{2\pi}{\bmath\lambda}\right)^{N_r/2}(\det\mathbf

944:     C)^{-1/2}\,,

945:   \end{equation}

946:   \begin{equation}

947:     E_r=\frac{1}{2}\|\mathbf R\bmath r\|^2_2

948:   \end{equation}

949:   and

950:   \begin{equation}

951:     \mathbf C=\nabla \nabla E_r=\mathbf R\,\mathbf {R}^{\rm T}\,.

952:   \end{equation}

953:   The normalization constant $P(\bmath d\,|\,\bmath\lambda,\bmath

954:   \eta,\mathbf{M_c},\mathbf R)$ is called the evidence and plays an

955:   important role at higher levels of inference. In this specific case

956:   it reads

957:   \begin{equation}

958:     P(\bmath d\,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)

959:     =\frac{\int{d\bmath r\exp{(-M(\bmath r))}}}{Z_d Z_r}\,,

960:   \end{equation}

961:   \noi where

962:   \begin{equation}

963:     M(\bmath r)=E_d+ E_r\,.

964:   \end{equation}

965:   The most probable solution for the linear parameters, is found by

966:   maximizing the posterior probability

967:   \begin{equation}

968:     P(\bmath r\,|\,\bmath d,\bmath\lambda,\bmath

969:     \eta,\mathbf{M_c},\mathbf R)=\frac{\exp(-M(\bmath

970:     r))}{\int{d\bmath r\,\exp(-M(\bmath r))}}\,.

971:     \label{equ:posterior}

972:   \end{equation}

973:   The condition $\partial (E_d+ E_r)/\partial \bmath r=0$ now yields the

974:   set of linear equations already introduced in Section

975:   \ref{sec:solvelinear}:

976:   \begin{equation}

977:     \left(\mathbf{M_c^{\rm T}} \mathbf{C_d}^{-1} \mathbf{M_c}+\mathbf

978:     R^{\rm T} \mathbf R\right)\bmath r = \mathbf{M_c^{\rm T}}

979:     \mathbf{C_d}^{-1}\bmath d\,.

980:     \label{equ:src_pot_penalty_bayes}

981:   \end{equation}

982:   Equation (\ref{equ:src_pot_penalty_bayes}) is solved iteratively

983:   using a Cholesky decomposition technique.

984:

985:   \noi {\bf (2)} Second level of inference: non-linear optimization.

986:   At this level we want to infer the non-linear parameters $\bmath

987:   \eta$ and the hyper-parameter $\lambda_{\rm s}$ for the

988:   source. Since at this point we are interested only in the smooth

989:   component of the lens potential, we set $\delta\bmath \psi=0$ and

990:   for a fixed family $\psi_s(\bmath \eta)$, form of the regularization

991:   $\mathbf R$ and model $\mathbf{M_c}$, we maximize the posterior

992:   probability

993:

994:   \begin{equation}\label{equ:posterior_2}

995:     P(\bmath\lambda,\bmath \eta\,|\,\bmath d,\mathbf{M_c},\mathbf

996:       R)=\frac{P(\bmath d\,|\,\bmath \lambda,\bmath \eta,\mathbf{M_c},\mathbf

997:       R)P(\bmath \lambda,\bmath \eta)}{P(\bmath d\,|\,\mathbf{M_c},\mathbf

998:       R)}\,.

999:   \end{equation}

1000:

1001:   \noi Assuming a prior $P(\bmath \lambda,\bmath \eta)$, which is flat in

1002:   $\log(\lambda_s)$ and $\bmath\eta$, reduces to maximizing the

1003:   evidence $P(\bmath d\,|\,\bmath\lambda,\bmath

1004:   \eta,\mathbf{M_c},\mathbf R)$ (which here plays the role of the

1005:   likelihood) for $\bmath \eta$ and $\bmath\lambda$. The evidence can

1006:   be computed by integrating over the posterior (\ref{equ:posterior_2})

1007:   %

1008:   \begin{equation}

1009:     P(\bmath d\,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)=\int{d\bmath

1010:       r\, P(\bmath d\,|\,\bmath r,\bmath

1011:       \eta,\mathbf{M_c})P(\bmath r\,|\,\bmath\lambda,\mathbf

1012:       R)}\,.

1013:     \label{equ:evidence}

1014:   \end{equation}

1015:   %

1016:   Because of the assumptions we made (Gaussian noise and quadratic

1017:   form of regularization), this integral can be solved analytically

1018:   and yields

1019:   %

1020:   \begin{equation}

1021:     P(\bmath d\,|\,\bmath\lambda,\bmath \eta,\mathbf{M_c},\mathbf R)=

1022:     \frac{Z_M(\bmath\lambda, \bmath \eta)}{Z_d Z_r(\bmath\lambda)}\,,

1023:   \end{equation}

1024:   %

1025:   where

1026:   %

1027:   \begin{equation}

1028:     Z_M(\bmath\lambda, \bmath \eta)=\exp{(-M(\bmath

1029:       r_{\rm MP}))}\left(2\pi\right)^{N_r/2}(\det \ \mathbf A)^{-1/2}\,,

1030:   \end{equation}

1031:   %

1032:

1033:  \noi  with $\mathbf A=\nabla\nabla M(\bmath r).$ Again we proceed in an

1034:   iterative fashion: using a simulated annealing technique we maximize

1035:   the evidence (\ref{equ:evidence}) for the parameters $\bmath

1036:   \eta$. Every step of the maximisation generates a new model

1037:   $\mathbf{M_c}(\psi(\bmath \eta_i))$, for which the most probable

1038:   source $\bmath s_{\rm{MP}}$ is reconstructed as described in Section

1039:   \ref{sec:inverting}. At this starting step the level of the source

1040:   regularization is set to a relatively large initial value

1041:   $\lambda_{s,0}$; in this way we ensure the solution to be smooth (at

1042:   least at this first level) and the exploration of the $\bmath \eta$

1043:   space to be faster. Subsequently we fix the best model

1044:   $\mathbf{M_c}(\bmath \eta_0)$ found at the previous iteration and,

1045:   using the same technique, we maximize the evidence for the source

1046:   regularization level $\lambda_s$.  The procedure is repeated until

1047:   the total evidence has reached its maximum. In principle we should

1048:   have built a nested loop for $\lambda_s$ at every step of the

1049:   $\bmath \eta$ exploration, but in practice the regularization

1050:   constant only changes slightly with $\bmath \eta$ and the alternate

1051:   loop described above gives a faster way to reach the maximum

1052:   (line-search method).

1053:

1054:   \noi {\bf (3)} At the third level of inference Bayesian statistics

1055:   provides an objective and quantitative procedure for model

1056:   comparison and ranking on the basis of the evidence,

1057:   \begin{equation}

1058:     P(\mathbf{M_c},\mathbf R\,|\,\bmath d) \propto P(\bmath

1059:     d\,|\,\mathbf{M_c},\mathbf R)P(\mathbf{M_c},\mathbf R)\,.

1060:   \end{equation}

1061:   For a flat prior $P(\mathbf{M_c},\mathbf R)$ (at this level of

1062:   inference we can make little to no assumptions) different models can

1063:   be compared according to their value of $P(\bmath

1064:   d\,|\,\mathbf{M_c},\mathbf R)$, which is related to the evidence of

1065:   the previous level by the following relation

1066:   \begin{equation}

1067:     P(\bmath d\,|\,\mathbf{M_c},\mathbf R)=\int{d\bmath\lambda\, d\bmath

1068:       \eta \,P(\bmath d\,|\,\bmath \lambda,\bmath \eta,\mathbf{M_c},\mathbf

1069:       R) P(\bmath\lambda,\bmath\eta)}\,.

1070:     \label{equ:evidence_integral}

1071:   \end{equation}

1072:   Being multidimensional and highly non-linear, the integral

1073:   (\ref{equ:evidence_integral}) is carried out numerically through a

1074:   Nested-Sampling technique \citep{Skilling04}, which is described in

1075:   more detail in the next section. A by-product of this method is an

1076:   exploration of the posterior probability (\ref{equ:posterior_2}),

1077:   allowing for error analysis of the non-linear parameters and of the

1078:   evidence itself.

1079:

1080:   \subsection{Model selection: smooth versus clumpy models}\label{sec:nested sampling}

1081:

1082:   In the previous section we introduced the main structure of the

1083:   Bayesian inference for model fitting and model selection. While

1084:   parameter fitting simply determines how well a model matches the

1085:   data and can be easily attained with the relatively simple analytic

1086:   integrations of the first and second level of inference, model

1087:   selection itself requires the highly non-linear and multidimensional

1088:   integral (\ref{equ:evidence_integral}) to be solved.  This

1089:   marginalized evidence can be used to assign probabilities to models

1090:   and to reasonably establish whether the data require or allows

1091:   additional parameters or not. Given two competing models $\rm M_0$

1092:   and $\rm M_1$ with relative marginalized evidence ${\cal{E}}_0$ and

1093:   ${\cal{E}}_1$, the Bayes factor, $\Delta {\cal{E}} \equiv

1094:   \log{\cal{E}}_0 - \log{\cal{E}}_1$, quantifies how well $\rm M_0$ is

1095:   supported by the data when compared with $\rm M_1$ and it

1096:   automatically includes the Occam's razor. Typically the literature

1097:   suggests to weigh the Bayes factor using  Jeffreys' scale

1098:   \citep{Jeffreys61}, which however provides only a qualitative

1099:   indication: $\Delta {\cal{E}} < 1$ is not significant, $1 < \Delta

1100:   {\cal{E}}< 2.5$ is significant, $2.5 < \Delta {\cal{E}}< 5$ is

1101:   strong and $\Delta {\cal{E}} > 5$ is decisive.

1102:

1103:

1104:   \noi In order to evaluate this marginalized evidence with a high

1105:   enough accuracy we implemented the new evidence algorithm known as

1106:   Nested Sampling, proposed by \citet{Skilling04}. Specifically, we

1107:   would like to compare two different models: one in which the lens

1108:   potential is smooth and one in which substructures are present, with

1109:   e.g. a NFW profile. While the first is defined by the non-linear

1110:   parameters of the lens potential and of the source regularization

1111:   only, the second also allows for three extra parameters: the mass of

1112:   the substructure and its position on the lens plane (see

1113:   Section \ref{sec:test})

1114:

1115:   \subsection{Model ranking: nested sampling}

1116:

1117:   Here, we provide a short description of how the Nested Sampling can

1118:   be used to compute the marginalized evidence and errors on the model

1119:   parameters; a more detailed one can be found in

1120:   \citet{Skilling04}. The Nested-Sampling algorithm integrates the

1121:   likelihood over the prior volume by moving through thin nested

1122:   likelihood surfaces. Introducing the fraction of total prior

1123:   mass $X$, within which the likelihood exceeds ${\cal L^*}$, hence

1124:   %

1125:   \begin{equation}

1126:     X=\int_{{\cal{L}}>{\cal{L^*}}}{dX}\,,

1127:   \end{equation}

1128:   %

1129:   with

1130:   %

1131:   \begin{equation}

1132:     dX=P\left(\bmath\lambda,\bmath\eta\right)d\bmath\lambda\,d\bmath\eta\,,

1133:   \end{equation}

1134:   %

1135:   the multi-dimensional integral (\ref{equ:evidence_integral})

1136:   relating the likelihood $\cal{L}$ and the marginalized evidence

1137:   $\cal{E}$ can be reduced to a one-dimensional integral with positive

1138:   and decreasing integrand

1139:   %

1140:   \begin{equation}

1141:     {\cal{E}}=\int_0^1{dX\,{\cal{L}}(X)}\,.

1142:   \end{equation}

1143:

1144:   \noi Where ${\cal L}(X)$ is the likelihood of the (possibly disjoint)

1145:   iso-likelihood surface in parameter space which encloses a total prior

1146:   mass of $X$. If the likelihood ${\cal{L}}_j={\cal{L}}(X_j)$ can be

1147:   evaluated for each of a given set of decreasing points, $0 < X_j <

1148:   X_{j-1} <....< 1$, then the total evidence ${\cal{E}}$ can be

1149:   obtained, for example, with the trapezoid rule,

1150:   ${\cal{E}}=\sum_{j=1}^m{\cal{E}}_j=\sum_{j=1}^m{\frac{{\cal{L}}_j}{2}}\left(X_{j-1}-X_{j+1}\right)$.

1151:

1152:   \noi The power of the method is that the values of $X_j$ do not

1153:   have to be explicitly calculated, but can be statistically

1154:   estimated. Specifically, the marginalized evidence is obtained

1155:   through the following iterative scheme:

1156:

1157:   \noi {\bf (1)} the likelihood ${\cal{L}}$ is computed for N

1158:   different points, called active points, which are randomly drawn

1159:   from the prior volume.

1160:

1161:

1162:   \noi {\bf (2)} the point $X_j$ with the lowest likelihood is found

1163:   and the corresponding prior volume is estimated statistically: after

1164:   $j$ iterations the average volume decreases as $ X_j/X_{j-1}=t $,

1165:   where t is the expectation value of the largest of N numbers

1166:   uniformly distributed between $\left(0,1\right)$.

1167:

1168:   \noi {\bf (3)} the term

1169:   ${\cal{E}}_j=\frac{{\cal{L}}_j}{2}\left(X_{j-1}-X_{j+1}\right)$ is

1170:   added to the current value of the total evidence;

1171:

1172:   \noi {\bf (4)} $X_j$ is replaced by a new point randomly

1173:   distributed within the remaining prior volume and satisfying the

1174:   condition ${\cal{L}} >  {\cal{L}}^* \equiv {\cal{L}}_j$;

1175:

1176:   \noi {\bf (5)} the above steps are repeated until a stopping

1177:   criterion is satisfied.

1178:

1179:   \noi By climbing up the iso-likelihood surfaces, the method, in

1180:   general, find and quantifies the small region in which the bulk

1181:   of the evidence is located.

1182:

1183:   \noi Different stopping criteria can be chosen.  Following

1184:   \citet{Skilling04}, we stop the iteration when $j \gg \rm{N}H$,

1185:   where H is minus the logarithm of that fraction of prior mass which

1186:   contains the bulk of the posterior mass.  In practical terms this

1187:   means that the procedure should be stopped only when most of the

1188:   evidence has been included. Given the areas ${\cal{E}}_j$, in fact,

1189:   the likelihood initially increases faster than the widths decrease,

1190:   until its maximum is reached; across this maximum, located in the

1191:   region ${\cal{E}}\thickapprox e^{-H}$, the likelihood flatten off

1192:   and the decreasing widths dominate the increasing

1193:   ${\cal{L}}_j$. Since ${\cal{E}}_j\thickapprox e^{-j/\rm{N}}$, it

1194:   takes $\rm{N}H$ iterations to reach the dominating areas.  These

1195:   $\rm{N}H$ iterations are random and are subjected to a standard

1196:   deviation uncertainty $\sqrt{\rm{N}H}$, corresponding to a

1197:   deviation standard on the logarithmic evidence of $\sqrt{\rm{N}

1198:   H}/ \rm{N}$

1199:

1200:   \begin{equation}

1201:     {\log \cal{E}}=

1202:     \log\left(\sum_j{{\cal{E}}_j}\right)\mathrm{~~~with~~~}

1203:     \sigma_{\log{\cal E}}=\sqrt{\frac{H}{\rm{N}}}\,.

1204:   \end{equation}

1205:

1206:     \subsubsection{Posterior probability distributions}

1207:

1208:

1209:   \noi For the lens parameters, the substructure position and the

1210:   logarithm of the source regularization, priors are chosen to be

1211:   uniform on a symmetric interval around the best values which we have

1212:   determined at the second level of the Bayesian inference. The size

1213:   of the interval being at least one order of magnitude larger than

1214:   the errors on the parameters. In practice, we first carry out a fast

1215:   run of the Nested Sampling with few active points $\rm{N}$, this gives us

1216:   an estimate for the non-linear parameter errors. Using the product

1217:   $2\times N_{\rm dim}\times \sigma_\eta$, where $N_{\rm dim}$ is the

1218:   total number of parameters and $\sigma_\eta$ the corresponding

1219:   standard deviation, we can then roughly enclose the bulk of the

1220:   likelihood (note that this can be double-checked and corrected in

1221:   hindsight, if the posterior probability functions are truncated at

1222:   the prior boundaries). Priors on the parameters are taken in such a

1223:   way that this maximum is fully included in the total integral of the

1224:   marginalized evidence. For the main lens parameters and for the

1225:   regularization constant the same priors are used for model with and

1226:   without substructure. For the substructure mass  a flat prior between

1227:   $M_{\rm min}=4.0\times 10^6M_\odot$ and $M_{\rm

1228:   max}=4.0\times 10^9M_\odot$ is adopted, with the two limits given by N-body

1229:   simulations \citep[e.g.][]{Diemand07b, Diemand07a}. In reality,

1230:   the method does not require the parameters to be well known a

1231:   priori, but limiting the exploration to the best fit region

1232:   sensibly reduces the computational effort without significantly

1233:   altering the evidence estimation. From Bayes theorem we have that

1234:   the posterior probability density $p_j$ is given by

1235:   %

1236:   \begin{equation}

1237:     p_j(t)=

1238:     \frac{{\cal{L}}_j}{2}\left(X_{j-1}-X_{j+1}\right)/{\cal{E}}(t)=w_j/{\cal{E}}(t)\,.

1239:   \end{equation}

1240:   %

1241:   The existing set of points $\left(\bmath\eta, \bmath\lambda

1242:   \right)_1$,..., $\left(\bmath\eta, \bmath\lambda \right)_{\rm N}$

1243:   then gives us a set of posterior values that can be then used to

1244:   obtain mean values and standard deviations on the non-linear

1245:   parameters

1246:   %

1247:   \begin{equation}

1248:     \langle\bmath\eta\rangle=\sum_j{w_j\bmath\eta_j}/\sum_j{w_j}\,,

1249:   \end{equation}

1250:   %

1251:   and similarly for $\bmath\lambda$. These samples also provide a

1252:   sampling of the full joint probability density

1253:   function. Marginalising over this function, the full marginalized

1254:   probability density distribution of each parameters can be determined

1255:   (see Section 5.5).

1256:

1257:   \section{Testing and calibrating the method}\label{sec:test}

1258:

1259:   In this section we describe the procedure to test the method

1260:   introduced above and to assess its ability to detect dark matter

1261:   substructures in realistic data sets (e.g. from HST). A set of mock

1262:   data, mimicking a typical Einstein ring, is created. We generate

1263:   fourteen different lens models, of which $\rm L_0$ is purely

1264:   smooth, $\rm L_{1 \le i < 13}$ are given by the superposition

1265:   of the same smooth potential with a single NFW dark matter substructure of

1266:   varying mass and position and $\rm L_{13}$

1267:   contains two NFW dark matter substructures with

1268:   the same mass but with different positions (See Table \ref{tab:lenses}).

1269:   A first approximate reconstruction of the source and of the lens potential

1270:   is performed by recovering the best non-linear lens parameters

1271:   $\bmath\eta$ and the level of source regularization

1272:   $\lambda_s$. These values are then used for the linear grid-based

1273:   optimization, which provides initial values of the substructure

1274:   position and mass. Three extra runs of the non-linear optimization are then

1275:   performed to recover the best set

1276:   $\left(\bmath\eta_b,\lambda_{s,b}\right)$ for the main lens and the

1277:   best mass and position of the substructure (solely modelled with a

1278:   NFW density profile). Finally by means of the Nested-Sampling

1279:   technique described in Section \ref{sec:nested sampling} we

1280:   compute the marginalized evidence, equation (\ref{equ:evidence_integral}), for

1281:   every model twice, once under the hypothesis of a smooth lens and

1282:   once allowing for the presence of one or two extra mass

1283:   substructures. Comparison between these two models allows us to

1284:   assess whether the presence of substructure in the model improves

1285:   the evidence despite the larger number of free parameters.

1286:

1287:   \subsection{Mock data realisations}

1288:

1289:   A set of simulated data with realistic noise is generated from a

1290:   model based on the real lens SLACS J1627$-$0055

1291:   \citep{Koopmans06,Bolton06,Treu06}. We assume the lens to be well

1292:   described by a power-law (PL) profile \citep{Barkana98}. Using the

1293:   optimization technique described in Section (\ref{sec:bayes}) we find

1294:   the best set of non-linear parameters

1295:   $\left(\bmath\eta_b,\lambda_{s,b}\right)$. In particular

1296:   $\bmath\eta$ contains the lens strength $b$, and some of the

1297:   lens-geometry parameters: the position angle $\theta$, the

1298:   axis ratio $f$, the centre coordinates $\bmath x_0$ and the density

1299:   profile slope $q$, $\left(\rho \propto r^{-(2q+1)}\right)$. If

1300:   necessary, information about external shear can be included. The

1301:   best parameters are used to create fourteen different lenses and

1302:   their corresponding lensed images. One of the systems is given by a

1303:   smooth PL model while twelve include a dark matter

1304:   substructure with virial mass $\rm M_{vir}=10^7 \rm M_\odot, 10^8

1305:   \rm M_\odot,10^9 \rm M_\odot$ located either on the lowest surface

1306:   brightness point of the ring $P_0$, on a high surface brightness

1307:   point of the ring $P_1$, inside the ring $P_2$ and outside the ring

1308:   $P_3$ (see Table \ref{tab:lenses}). The fourteenth lens

1309:   contains two substructures each with a mass of $\rm M_{vir}=10^8  M_\odot$,

1310:   located respectively in $P_0$ and $P_1$. The substructures are assumed

1311:   to have a NFW profile

1312:   %

1313:   \begin{equation}

1314:     \rho\left(r\right)={\rho_s}{\left(r_s/r\right)\left[1+\left(r/r_s\right)\right]^{-2}}\,,

1315:   \end{equation}

1316:   %

1317:   where the concentration $c=r_{\mathrm {vir}}/r_s$ and the scaling radius $r_s$

1318:   are obtained from the virial mass using the empirical scaling laws

1319:   provided by \citet{Diemand07b, Diemand07a}. The source has an

1320:   elliptical Gaussian surface brightness profile centred in zero

1321:   %

1322:   \begin{equation}

1323:     s\left(\bmath y\right) = s_0 \exp\left[ - (y_1/\delta y_1)^2 - (y_2/\delta y_2)^2 \right]\,.

1324:   \end{equation}

1325:   %

1326:  We assume $s_0=0.25$, $\delta y_1=0.01$ and $\delta y_2=0.04$.

1327:

1328:   \begin{table}

1329:     \begin{center}

1330:       \caption {Non-smooth (PL+NFW) lens models. At each of the $P_i$

1331: 	positions a NFW perturbation of virial mass $m_{sub}$ is superimposed

1332: 	on a smooth PL mass model distribution.}

1333:       \begin{tabular}{cccc}

1334: 	\hline Lens&$\bmath x_{sub}$ $\left( \mathrm{arcsec}

1335: 	\right)$&$m_{sub}$ $\left( M_\odot \right)$\\ \hline $\rm

1336: 	L_1$&$P_0= (+0.90 ; +1.19)$&$10^7$\\ $\rm L_2$&&$10^8$\\ $\rm

1337: 	L_3$&&$10^9$\\ \\ $\rm L_4$&$P_1= (-0.50 ; -1.00)$&$10^7$\\ $\rm

1338: 	L_5$&&$10^8$\\ $\rm L_6$&&$10^9$\\ \\ $\rm L_7$&$P_2 = (-0.10 ;

1339: 	-0.60)$&$10^7$\\ $\rm L_8$&&$10^8$\\ $\rm L_9$&&$10^9$\\ \\

1340: 	$\rm L_{10}$&$P_3 = (-0.90 ; -1.40)$&$10^7$\\ $\rm

1341: 	L_{11}$&&$10^8$\\ $\rm L_{12}$&&$10^9$\\ \\

1342: 	$\rm L_{13}$&$P_0$ and $P_1 $&$10^8$\\\hline

1343:       \end{tabular}

1344:       \label{tab:lenses}

1345:     \end{center}

1346:   \end{table}

1347:

1348:   \subsection{Non-linear reconstruction of the main lens}

1349:

1350:   We start by choosing an initial parameter set $\bmath\eta_{0}$ for

1351:   the main lens, which is offset from $\bmath\eta_{\rm true}$ that we

1352:   used to create the simulated data. Assuming the lens does not

1353:   contain any substructure we run the non-linear procedure described

1354:   in Section (\ref{sec:bayes}) and optimize $\{\bmath\eta,\lambda_{s}\}$

1355:   for each of the considered systems. At every step of the

1356:   optimization a new set $\{\bmath\eta_i,\lambda_{s,i}\}$ is obtained

1357:   and the corresponding lensing operator $\mathbf{M_c}(\bmath

1358:   \eta_{i},\lambda_{s,i})$ has to be re-computed. The images are

1359:   defined on a 81 by 81 pixels $\left(N_d= 6561\right)$ regular

1360:   Cartesian grid while the sources are reconstructed on a Delaunay

1361:   tessellation grid of $N_s= 441$ vertices. The number of image

1362:   points, used for the source grid construction, is effectively a form

1363:   of a prior and the marginalized evidence (equation \ref{equ:evidence_integral}) can be used to

1364:   test this choice. To check whether the number of image pixels used

1365:   can affect the result of our modelling, we consider the smooth lens

1366:   $\rm L_0$ and  perform the non-linear reconstruction using one pixel every sixteen, nine, four and

1367:   one. In each of the considered cases we find that the lens parameters are within the relative errors (see Table ~3).

1368:  This suggests that, for this particular case, the choice of number of pixels is not influencing the quality of the reconstruction.

1369:   In real systems, the dynamic range of the lensed images could be much

1370:   higher and a case by case choice based on the marginalized evidence has to be considered.

1371:   In Fig. \ref{fig:best1_upr}, the  residuals relative to the system $\rm L_1$ are shown; the noise

1372:   level is in general reached and only small residuals are observed at

1373:   the position of the substructure.

1374:   Whether the level of such residuals and therefore the relative detection

1375:   of the substructure are significant is an issue we will address later on in

1376:   terms of the  total marginalized evidence.

1377:

1378:   \subsection{Linear reconstruction: substructure detection}\label{sec:linear rec}

1379:

1380:   The non-linear optimization provides us with a first good

1381:   approximate solution for the source and for the smooth component of

1382:   the lens potential. While this is a good description for the smooth

1383:   model $\rm L_0$ (see Fig. \ref{fig:best_smooth}), the residuals

1384:   (e.g. Fig. \ref{fig:best1_outside_01}) for

1385:   the perturbed model $\rm L_{i\ge1}$ indicate that the

1386:   \emph{no-substructure} hypothesis is improbable and that

1387:   perturbations to the main potential have to be considered. If the

1388:   perturbation is small, this can be done by temporarily assuming that

1389:   $\bmath{\eta}_{i=1}$ reflects the true mass model distribution for the

1390:   main lens and reconstruct the source and the potential correction by

1391:   means of equation (\ref{equ:src_pot_penalty_bayes}). In order to

1392:   keep the potential corrections in the linear regime, where the

1393:   approximation (\ref{equ:src_pot_penalty_bayes}) is valid, both the

1394:   source and the potential need to be initially over-regularised:

1395:   $\lambda_s=10\,\lambda_{s,1}$ and

1396:   $\lambda_{\delta\psi}=3.0\times10^5$ \citep{Koopmans05,

1397:   suyu206}. For each of the possible substructure positions we

1398:   identify the lowest-mass-substructure we are able to recover. In the

1399:   two most favourable cases, $\rm L_1$ and $\rm L_4$, in which the

1400:   substructure sits on the Einstein ring a perturbation of $10^7 \rm

1401:   M_\odot$ is readily reconstructed. For these two positions higher

1402:   mass models, with the exception of $\rm L_2$, will not be further analysed. The systems $\rm

1403:   L_{7,8,9}$ and $\rm L_{10,11,12}$, in which the substructure is

1404:   located, respectively, inside and outside the ring, represent more

1405:   difficult scenarios. In the first case all perturbations below $10^9

1406:   \rm M_\odot$ can be mimicked by an increase of the mass of the main

1407:   lens within the ring, while in the second case these cannot be

1408:   easily distinguished from an external shear effect. For the models

1409:   $\rm L_{1,2,4,9,12}$ convergence is reached after 150 iterations and

1410:   the perturbations are recovered near their known position (Figs. 8 and 9). The grid

1411:   based potential reconstruction indeed leads to a good first

1412:   estimation for the substructure position.

1413:

1414:

1415:

1416:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1417:

1418:   \subsection{Non-linear reconstruction: main lens and substructure}\label{sec:non-linear rec}

1419:

1420:   In order to compare with numerical simulations, the mass of the

1421:   substructure is required. Performing this evaluation with a grid

1422:   based reconstruction is more complicated and requires some

1423:   assumptions (e.g.\ aperture). To alleviate this problem we assume a

1424:   parametric model, in which the substructures are described by a NFW

1425:   density profile, and we recover the corresponding non-linear

1426:   parameters, mass and position, using the non-linear Bayesian

1427:   optimization previously described.

1428:

1429:   \noi To quantify the mass and position of the substructure and to

1430:   update the non-linear parameters when a substructure is added, we

1431:   adopt a multi-step non-linear procedure that relatively fast

1432:   converges to a best PL+NFW mass model. At this level, we neglect the

1433:   smooth lens $\rm L_0$, for which a satisfactory model already has

1434:   been obtained after the first non-linear optimization, and the

1435:   perturbed models $ \rm L_{7,8,10,11}$ for which the substructure

1436:   could not be recovered. We proceed as follows:

1437:

1438:   \medskip

1439:

1440:   \noi {\bf (i)} we fix the main lens parameters to the best values

1441:   found in Section (\ref{sec:linear rec}),

1442:   $\{\bmath\eta_1,\lambda_{\rm s,1}\}$. We set the substructure

1443:   mass to some guess value. We optimize for the substructure position

1444:   $\bmath x_{\rm sub,1}$.

1445:

1446:   \noi {\bf (ii)} we fix $\{\bmath\eta_1,\lambda_{s,1}\}$ and

1447:   the source position $\bmath x_{\rm sub,1}$. We optimize for the

1448:   substructure mass $m_{\rm sub,1}$.

1449:

1450:   \noi {\bf (iii)} we run the non-linear procedure described in

1451:   Section (\ref{sec:bayes}) by alternatively optimising for the main

1452:   lens, source, and substructure parameters and for the level of source

1453:   regularization.

1454:

1455:   \medskip

1456:

1457:   \noi This leads to a new set of parameters, $\{\bmath\eta_{\rm b},

1458:   \lambda_{\rm s,b}, m_{\rm sub,b}, \bmath x_{\rm sub,b}\}$. Final

1459:   results for the considered models are listed in

1460:   Table 3 and the

1461:   relative residuals are shown in the Figs. \ref{fig:best1_upr}-\ref{fig:best1_outside_01}, respectively. For all the considered lenses the final

1462:   reconstruction converges to the noise level.

1463:

1464:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1465:

1466:    \subsection{Multiple substructures}

1467:    The lens system $\rm L_{13}$ represents a more complex case in which two substructures

1468:    are included. In particular we are interested in testing

1469:    whether both substructures are detectable and whether their effect may be hidden by the

1470:    presence of external shear. As for the previously considered cases, we first perform a non-linear

1471:    reconstruction of the main lens parameters assuming a single PL mass model.

1472:    For this particular system we also include the strength $\rm \Gamma_{sh}$ and the position

1473:    angle $\rm \theta_{sh}$ of the external shear as free parameters. Results for this first step of the reconstruction

1474:    are shown in Fig. \ref{fig:linear_a}. We then run the linear potential

1475:    reconstruction. One of the two substructures is detected although a significant

1476:    level of image residuals is left (Fig. \ref{fig:best1_sub_double}).

1477:    The combined effect of external shears ($\rm \Gamma_{sh}=-0.031$) and the substructure in $P_1$

1478:    is not sufficient to explain the perturbation generated by the second substructure at the lowest surface

1479:    brightness point of the Einstein ring. We therefore include a NFW substructure in

1480:    the recovered position and run a non-linear reconstruction for the new PL+NFW model,

1481:    Fig. \ref{fig:linear_b}. We are then able to detect also the second substructure, Fig. \ref{fig:best2_sub_double}.

1482:    Finally we run a global non-linear reconstruction for the

1483:    PL+2NFW model (Fig. \ref{fig:linear_c}), the noise level is reached and the strength of the external shear  is consistent with zero ($\rm \Gamma_{sh}=0.0001$).

1484:

1485:   \subsection{Nested sampling: the evidence for substructure}

1486:

1487:   When modelling systems as $\rm L_{0}$ or $\rm L_{i\ge1}$ we assume

1488:   that the best recovered values, under the hypothesis of a single

1489:   power-law, provide a good description of the true mass distribution

1490:   and that any eventually observed residual could be an indication for

1491:   the presence of mass substructure. Model comparison within the

1492:   framework of Bayesian statistics gives us the possibility to test

1493:   this assumption.

1494:

1495:   \subsubsection{Marginalized Bayesian evidence}

1496:

1497:   In order to statistically compare two models the

1498:   marginalized evidence (\ref{equ:evidence_integral}) has to be

1499:   computed. As described in Section (\ref{sec:nested sampling}) this

1500:   multi-dimensional and non-linear integral can be evaluated using the

1501:   Nested-Sampling technique by

1502:   \citet{Skilling04}. Specifically the two mass models we wish to

1503:   compare are a single PL, M$_0$, versus a PL+NWF

1504:   substructure, M$_1$. The first one is completely defined by the

1505:   non-linear parameters $\left(\bmath \eta, \lambda_s\right)$, while the

1506:   second needs three extra parameters, namely the substructure mass

1507:   and position. For all these parameters prior probabilities have to

1508:   be defined:

1509:   %

1510:   \begin{equation}

1511:     P\left(\eta_i\right)= \left\{

1512:     \begin{array}{ll}

1513:       \text{constant} & {\rm ~~~for~~~} |\eta_{\rm {b},i}-\eta_i|\leq\delta\eta_i \\ &\\

1514:       0 & {\rm ~~~for~~~} |\eta_{\rm {b},i}-\eta_i| > \delta\eta_i

1515:     \end{array}

1516:     \right.

1517:   \end{equation}

1518:   and

1519:   \begin{equation}

1520:     P\left( x_{\rm {sub},i}\right)= \left\{

1521:     \begin{array}{ll}

1522:       \text{constant} & {\rm ~~~for~~~} | x_{\rm {sub,b},i}- x_{\rm {sub},i}|\leq\delta

1523:       x_{\rm {sub},i}\\ &\\ 0 & {\rm ~~~for~~~} | x_{\rm {sub,b},i}- x_{\rm {sub},i}| > \delta

1524:       x_{\rm {sub},i}

1525:     \end{array}

1526:     \right.

1527:   \end{equation}

1528:

1529:   \noi where the elements of $\delta\eta_i$ and $\delta x_{\rm sub,i}$

1530:   are empirically assessed such that the bulk of the evidence

1531:   likelihood is included \citep[see][]{Skilling04}. The prior on the

1532:   substructure mass is flat between the lower and upper mass limits

1533:   given by numerical simulations \citep[e.g.][]{Diemand07b,

1534:   Diemand07a}.  Given the lenses $\rm L_{0,1,2,4,9,12,13}$ we run the

1535:   Nested Sampling twice, once for the single PL model and

1536:   once for the PL+NFW (+NFW) one. The two marginalized evidences with

1537:   corresponding numerical errors can be compared from Table ~2. Despite a certain number of authors suggest

1538:   the use of Jeffreys' scale \citep{Jeffreys61} for model comparison, we adopt here a

1539:   more conservative criterion. In particular, we note that the

1540:   perturbed model M$_1$ for the lens system $\rm L_0$ is basically

1541:   consistent with a single smooth PL model M$_0$, with $\Delta{\cal

1542:   {E}}\sim 7.85$. The Bayesian factor for the system $\rm L_4$ is of

1543:   the order of $\Delta{\cal {E}} \sim 21.5$ in favour of the smooth

1544:   model M$_0$, indicating that the detection of such a low-mass

1545:   substructure can formally not be claimed at a significant level. The

1546:   reason why we think this substructure is clearly visible in the

1547:   grid-based results, is that this particular solution is the

1548:   maximum-posterior (MP) solution, whereas the evidence gives the

1549:   integral over the entire parameter space. This implies that there

1550:   must be many solutions near the MP solution that do not show the

1551:   substructure. This indicates that our approach of quantifying the

1552:   evidence for substructure is very conservative.  On the other hand

1553:   the Bayes factor for the lens $\rm L_1$, $\Delta{\cal {E}} = -17.1

1554:   $, clearly shows that the detection of a $10^7 M_\odot$ substructure

1555:   can be significant when the latter is located at a different

1556:   position on the ring. Finally all higher mass perturbations are

1557:   easily detectable independently of their position relative to the

1558:   image ring; Bayes factor for $\rm L_2$, $\rm L_9$, $\rm L_{12}$ and $\rm L_{13}$

1559:   are, in fact, respectively $\Delta{\cal {E}} = -213.0 $,

1560:   $\Delta{\cal {E}} = -2609.7$, $\Delta{\cal {E}} = -4603.4$ and $\Delta{\cal {E}} = -1835.7$.

1561:   Substructure properties for these systems are also confidently

1562:   recovered.

1563:   The main difference between Jeffreys' scale and our criterion for

1564:   quantifying the significance level of the substructure detection is observed

1565:   for the system  $\rm L_1$.  If we had to adopt Jeffreys' scale in fact, such detection

1566:   would have to be claimed decisive while we think it is only significant.

1567:

1568:   \begin{figure}

1569:     \begin{center}

1570:       \includegraphics[width=8cm]{fig4}

1571:       \caption{Results of the non-linear optimization for the smooth

1572: 	lens $\rm {L_0}$. The top-right panel shows the original mock

1573: 	data, while the top-left one shows the final

1574: 	reconstruction. On the second row the source reconstruction

1575: 	(left) and the image residuals (right) are shown.}

1576:       \label{fig:best_smooth}

1577:     \end{center}

1578:   \end{figure}

1579:

1580:

1581: \subsection{Posterior probabilities}

1582:

1583:   As discussed in Section (\ref{sec:nested sampling}) an interesting

1584:   by-product of the Nested-Sampling procedure is an exploration of the

1585:   posterior probability (\ref{equ:posterior_2}) which provides us with

1586:   statistical errors on the model parameters, see Tables 3 and 4. The

1587:   relative posterior probabilities for $\rm L_0$, $\rm L_1$ and $\rm

1588:   L_2$ are plotted in Fig.~\ref{fig:smooth_weights},

1589:   Fig.~\ref{fig:pert0001_weights} and

1590:   Fig.~\ref{fig:pert001_weights} respectively.  Lets start by

1591:   considering the lens system $\rm L_0$ and the relative probability

1592:   distribution for the substructure mass. Although the model M$_1$, in

1593:   terms of marginalized evidence, is consistent with the single smooth

1594:   PL model M$_0$, there is a small probability for the presence of a

1595:   substructure with mass up to few $10^8 M_\odot$ located as far as

1596:   possible from the ring.  The effect of such objects on the lensed

1597:   image would be very small and could be easily hidden by introducing

1598:   artificial features in the source structure, as suggested by the

1599:   posterior distributions for the source regularization constant.

1600:   This means, that from the image point of view, a smooth single PL

1601:   model and a perturbed PL+NWF with a substructure of $10^8 M_\odot$,

1602:   located far from ring, are not distinguishable from each other as

1603:   long as the effect of the perburber can be hidden in the structure

1604:   of the source. From a probabilistic point of view, however, the second

1605:   scenario is more unlikely to happen.  A similar argument can be

1606:   applied to the lens $\rm L_1$ for which a strong degeneracy between

1607:   the mass and the position of the substructure is observed.  We

1608:   conclude therefore that, although this substructure can be detected

1609:   at a statistically significant level, its mass and position cannot

1610:   be confidently assessed yet.  In contrast, for systems such as $\rm

1611:   L_{2,9,12}$, the effect of the substructure is so strong that it can

1612:   not be mimicked by the source structure or by a different

1613:   combination of the substructure parameters. For these cases not only

1614:   the detection is highly significant, but the properties of the

1615:   perturber can be confidently constrained with minimal biases.

1616:

1617:   \begin{table}

1618:     \begin{center}

1619:       \caption{marginalized evidence and corresponding standard

1620: 	deviation as obtained via the Nested-Sampling

1621: 	integration. Results are shown for the hypothesis of a smooth

1622: 	lens (PL) and the hypothesis of a clumpy lens potential

1623: 	(PL+NFW).}

1624:       \begin{tabular}{cccc}

1625:       	\hline Lens&Model& $\log {\cal E} \,$&$\sigma_{{\log {\cal E}

1626: 	}}\,$\\ \hline $\rm L_0$ & PL & 26332.70&0.33\\ &

1627: 	PL+NFW &26324.85&0.30\\ \\ $\rm L_1$ & PL

1628: 	&20366.86&0.34\\ &PL+NFW&20383.95&0.30\\ \\ $\rm L_4$

1629: 	& PL &20292.40&0.33\\ & PL+NFW &20270.87& 0.29\\ \\

1630: 	$\rm L_9$ & PL &17669.41&0.45\\ & PL+NFW

1631: 	&20279.13&0.36\\ \\ $\rm L_{12}$ & PL

1632: 	&15786.91&0.33\\ & PL+NFW

1633: 	&20390.35&0.37\\ \\ $\rm L_{13}$ & PL

1634: 	&18509.76&0.24\\ & PL+2 NFW

1635: 	&20346.48&0.49\\ \hline

1636:       \end{tabular}

1637:       \label{tab:evidence}

1638:     \end{center}

1639:   \end{table}

1640:

1641: %     \begin{table*}

1642:    % \vbox to220mm{\vfil Landscape table to go here.

1643:     %  \caption{} \vfil}

1644:    % \label{tab:results}

1645:  % \end{table*}

1646:

1647:      %\begin{table*}

1648:     %\vbox to220mm{\vfil Landscape table to go here.

1649:       %\caption{} \vfil}

1650:     %\label{tab:results}

1651:  % \end{table*}

1652:

1653:   \begin{figure*}

1654:     \begin{center}

1655:       \includegraphics[width=0.45\hsize]{fig5a}

1656:       \hfill

1657:       \includegraphics[width=0.45\hsize]{fig5b}

1658:       \caption{{\bf Left panel:} Results of the first non-linear

1659: 	reconstruction for the smooth component of the perturbed lens

1660: 	L$_1$. The top-right panel shows the original mock

1661: 	data, while the top-left one shows the final

1662: 	reconstruction. On the second row the source reconstruction

1663: 	(left) and the image residuals (right) are shown. {\bf Right

1664: 	panel:} Final results of the non-linear reconstruction for the

1665: 	perturbed lens L$_1$. The top-right panel shows the

1666: 	original mock data, while the top-left one shows the final

1667: 	model reconstruction obtained after a non-linear optimization

1668: 	involving the lens parameters and the substructure position

1669: 	and mass. The recovered source is plotted in the low-left

1670: 	panel. Image

1671: 	residuals (right) are shown.}

1672:       \label{fig:best1_upr}

1673:     \end{center}

1674:   \end{figure*}

1675:

1676:   \begin{figure*}

1677:     \begin{center}

1678:       \includegraphics[width=0.45\hsize]{fig6a}

1679:       \hfill

1680:       \includegraphics[width=0.45\hsize]{fig6b}

1681:       \caption{Similar as Figure~\ref{fig:best1_upr} for L$_2$.}

1682:       \label{fig:best1_upr_001}

1683:     \end{center}

1684:   \end{figure*}

1685:

1686:

1687:

1688:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1689:

1690:   \section{Conclusions and Future work }

1691:

1692:   We have introduced a fully Bayesian adaptive method for objectively

1693:   detecting mass substructure in gravitational lens galaxies. The

1694:   implemented method has the following specific features:

1695:

1696:   \begin{itemize}

1697:

1698:   \item Arbitrary imaging data-set defined on a regular grid can be

1699:     modelled, as long as only lensed structure is included. The code

1700:     is specifically tailored to high-resolution HST data-sets with a

1701:     compact PSF that can be sampled by a small number of pixels.

1702:

1703:   \item Different parametric two-dimensional mass-models can be used,

1704:     with a set of free parameter $\bmath \eta$. Currently, we have

1705:     implemented the elliptical power-law density models from

1706:     \citet{Barkana98}, but other models can easily be included.

1707:     Multiple parametric mass models can be simultaneously optimized.

1708:

1709:   \item A grid-based correction to the parametric potential can

1710:     iteratively be determined for any perturbation that can not easily

1711:     be modelled within the chosen family of potential models (e.g.\

1712:     warps, twists, mass-substructures, etc.).

1713:

1714:   \item The source surface-brightness structure is determined on a

1715:     fully adaptive Delaunay tessellation grid, which is updated with

1716:     every change of the lens potential.

1717:

1718:   \item Both model-parameter optimization and model ranking are fully

1719:     embedded in a Bayesian framework. The method takes special care not

1720:     to change the number of degrees of freedom during the

1721:     optimization, which is ensured by the adaptive source grid. Methods

1722:     with a fixed source surface-brightness grid can not do this.

1723:

1724:   \item Both source and potential solutions are regularised, based on

1725:     a smoothness criterion. The choice of regularization can be

1726:     modified and the level of regularization is set by Bayesian

1727:     optimization of the evidence. The data itself determine what

1728:     level of regularization is needed. Hence overly smooth or overly

1729:     irregular structure is automatically penalised.

1730:

1731:   \item The maximum-posterior and the full marginalized probability

1732:     distribution function of {\sl all} linear and non-linear

1733:     parameters can be determined, marginalized over all other

1734:     parameters (including regularization). Hence a full exploration

1735:     of {\sl all} uncertainties of the model is undertaken.

1736:

1737:   \item The full marginalized evidence (i.e.\ the probability of the

1738:     model given the data) is calculated, which can be used to rank

1739:     {\sl any} set of model assumptions (e.g. pixel size, PSF) or model

1740:     families. In our case, we intend to compare smooth models with

1741:     models that include mass substructure. The marginalized evidence

1742:     implicitly includes Occam's razor and can be used to assess whether

1743:     substructure or any other assumption is justified, compared to a

1744:     null-hypothesis.

1745:

1746:   \end{itemize}

1747:

1748:   \noi The method has been tested and calibrated on a set of

1749:   artificial but realistic lens systems, based on the

1750:   lens system SLACS J1627$-$0055.

1751:

1752:   \noi The ensemble of mock data consists of a smooth PL lens and

1753:   thirteen clumpy models including one or two NFW substructures.  Different values

1754:   for the mass and the substructure position have been considered.

1755:   Using the Bayesian optimization strategy developed in this paper we are

1756:   able to recover the smooth PL system and all perturbed models with a

1757:   substructure mass $ \ga 10^7 M_\odot$ when located at the lowest

1758:   surface brightness point on the Einstein ring and with a mass $\geq

1759:   10^9 M_\odot$ when located just inside or outside the ring (i.e.\

1760:   their Einstein rings need to overlap roughly).  For all these models

1761:   we have convincingly recovered the best set of non-linear parameters

1762:   describing the lens potential and objectively set the level of

1763:   regularization.

1764:

1765:   \noi Furthermore, our implementation of the Nested-Sampling

1766:   technique provides statistical errors for {\sl all} model parameters

1767:   and allows us to objectively rank and compare different potential

1768:   models in terms of Bayesian evidence, removing as much as possible

1769:   any subjective choices. Any choice can quantitatively be

1770:   ranked. For each of the lens systems we compare a complete smooth PL

1771:   mass model with a perturbed PL+NFW (+NFW) one.  The method here developed

1772:   allows us to solve simultaneously for the lens potential and the

1773:   lensed source. The latter, in particular, is reconstructed on an

1774:   adaptive grid which is re-computed at every step of the

1775:   optimization, allowing to take into account the correct number

1776:   of degrees of freedom.

1777:

1778:

1779:  \begin{figure*}

1780:     \begin{center}

1781:       \includegraphics[width=0.45\hsize]{fig7a}

1782:       \hfill

1783:       \includegraphics[width=0.45\hsize]{fig7b}

1784:       \caption{ Similar as Figure~\ref{fig:best1_upr} for L$_{12}$.}

1785:        \label{fig:best1_outside_01}

1786:     \end{center}

1787:   \end{figure*}

1788:

1789:   \begin{figure*}

1790:     \begin{center}

1791:       \includegraphics[width=\hsize]{fig8}

1792:       \caption{Results of the linear source and potential

1793: 	reconstruction for the lens L$_1$. The first row shows

1794: 	the original model (left), the reconstructed model (middle)

1795: 	and the current-best source, as well as the corresponding adaptive grid.

1796: 	On the second row the image

1797: 	residuals (left), the total potential convergence (middle) and

1798: 	the substructure convergence (right) are shown. Note

1799: 	that the substructure, although weak, is reconstructed at

1800: 	the correct position.}

1801:       \label{fig:best1_sub_upr}

1802:     \end{center}

1803:   \end{figure*}

1804:

1805:    \begin{figure*}

1806:     \begin{center}

1807:       \includegraphics[width=\hsize]{fig9}

1808:       \caption{Similar as Figure~\ref{fig:best1_sub_upr} for L$_2$. We note

1809: 	that the substructure is extremely

1810: 	well reconstructed, both at the correct position and in mass.}

1811:       \label{fig:best1_sub_upr_001}

1812:     \end{center}

1813:   \end{figure*}

1814:

1815:

1816:    \begin{figure*}

1817:     \begin{center}

1818:       \subfigure[]{ \includegraphics[width=0.45\hsize]{fig10a}

1819: 	\label{fig:linear_a}

1820:       }

1821:       \hfill

1822:       \subfigure[]{ \includegraphics[width=0.45\hsize]{fig10b}

1823: 	\label{fig:linear_b}

1824:       }

1825:

1826:      \subfigure[]{\centering \includegraphics[width=0.45\hsize]{fig10c}

1827: 	\label{fig:linear_c}

1828:       }

1829:

1830:       \caption{Non linear reconstruction of the lens $\rm L_{13}$ for a single PL model, a PL+NFW and

1831:       a PL+2NFW one.}

1832:        \label{fig:best_double}

1833:     \end{center}

1834:   \end{figure*}

1835:

1836:   \begin{figure*}

1837:     \begin{center}

1838:       \includegraphics[width=\hsize]{fig11}

1839:       \caption{Results of the first linear source and potential

1840: 	reconstruction for the lens L$_{13}$. The first row shows

1841: 	the original model (left), the reconstructed model (middle)

1842: 	and the image residuals. On the second row the current-best source (left), the total potential convergence (middle) and

1843: 	the substructure convergence (right) are shown. Note

1844: 	that the substructure, although weak, is reconstructed at

1845: 	the correct position.}

1846:       \label{fig:best1_sub_double}

1847:     \end{center}

1848:   \end{figure*}

1849:

1850:  \begin{figure*}

1851:     \begin{center}

1852:       \includegraphics[width=\hsize]{fig12}

1853:       \caption{ Results of the second linear source and potential

1854: 	reconstruction for the lens L$_{13}$.  }

1855:       \label{fig:best2_sub_double}

1856:     \end{center}

1857:   \end{figure*}

1858:

1859:   \noi In this paper we have considered systems which contains at most two CDM substructures. Although it may appear as a very small

1860:   number when compared with predictions from N-body simulations within the virial radius, this represents a realistic scenario.

1861:   As we have shown, our method, with current HST data, is mostly sensitive

1862:   to perturbations with mass $\ga 10^7\rm M_\odot$ and located on the Einstein ring ($\Delta\theta\sim\mu\theta_{\rm ER}$).

1863:   The projected volume that we are able to probe is therefore small compared to the projected volume within the virial radius.

1864:   The probability that more than two substructures have this right combination of mass and position is relatively low and we expect most of the

1865:   real systems to be dominated by one or at most two perturbers.

1866:   \noi Despite these new results, further improvements can still be

1867:   made. We think, for example, that an adaptive source grid based on surface

1868:   brightness, rather than magnification, or a combination, could be

1869:   more suitable for the scientific problem considered here.

1870:

1871:   \noi The method will soon be applied to real systems, as for example

1872:   from the \emph{Sloan Lens ACS Survey} sample of massive early-type galaxies

1873:   \citep{Koopmans06,Bolton06,Treu06}. This will lead to powerful new

1874:   constraints or limits on the fraction and mass distribution of

1875:   substructure. Results will be compared with CDM simulations.

1876:

1877:   \section*{Acknowledgements} The authors would like to thank Matteo

1878:   Barnab\`e, Oliver Czoske, Antonaldo Diaferio, Phil Marshall, Sherry Suyu and the anonymous referee  for useful

1879:   discussions. They also thank the Kavli Institute for Theoretical Physics

1880:   for hosting the gravitational lensing workshop in fall 2006, during which

1881:  important parts of  this work were made. SV and LVEK are supported (in part) through an

1882:   NWO-VIDI program subsidy (project number 639.042.505).

1883:

1884:

1885:   \bibliography{ms}

1886:

1887:   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1888:

1889:   \begin{figure*}

1890:     \begin{center}

1891:       \includegraphics[width=\hsize]{fig13}

1892:       \caption{Posterior probability distributions for the non linear

1893: 	parameters of the smooth lens model $\rm L_0$ as obtained from

1894: 	the Nested-Sampling evidence exploration. In particular

1895: 	results for two different models are shown, a smooth PL

1896: 	potential (blue histograms) and a perturbed PL+NFW lens

1897: 	(black histograms). From up left, the lens strength, the

1898: 	position angle, the axis ratio, the slope, the logarithm of

1899: 	the source regularization constant, the substructure mass and

1900: 	position are plotted.}

1901:       \label{fig:smooth_weights}

1902:     \end{center}

1903:   \end{figure*}

1904:

1905:   \begin{figure*}

1906:     \begin{center}

1907:       \includegraphics[width=\hsize]{fig14}

1908:       \caption{Similar as Figure~\ref{fig:smooth_weights} for L$_1$.}

1909:         \label{fig:pert0001_weights} \end{center}

1910: 	\end{figure*}

1911:

1912: 	\begin{figure*}

1913:     \begin{center}

1914:       \includegraphics[width=\hsize]{fig15}

1915:       \caption {Similar as Figure~\ref{fig:smooth_weights} for L$_2$.}

1916:       \label{fig:pert001_weights}

1917:     \end{center}

1918:   \end{figure*}

1919:

1920:

1921:  \begin{figure*}

1922:     \begin{center}

1923:       \includegraphics[width=\hsize]{table_3.ps}

1924:          \end{center}

1925:   \end{figure*}

1926:

1927:   \begin{figure*}

1928:     \begin{center}

1929:       \includegraphics[width=\hsize]{table_4.ps}

1930:     \end{center}

1931:   \end{figure*}

1932:

1933:

1934: \clearpage

1935:

1936: \newpage \label{lastpage}

1937:

1938:

1939:

1940: \end{document}

1941: