0705:0705.2152/ms.tex

1: %\documentclass[referee]{aa} % for a referee version

2: %\documentclass[onecolumn]{aa} % for a paper on 1 column

3: %\documentclass[longauth]{aa} % for the long lists of affiliations

4: %\documentclass[rnote]{aa} % for the research notes

5: %

6: \documentclass{aa}

7: %

8: \usepackage{graphicx,supertabular}

9: \usepackage{txfonts}

10: \usepackage{natbib}

11: \citestyle{aa}

12:

13: \begin{document}

14: %

15:    \title{Towards a library of synthetic galaxy spectra and preliminary results of classification and parametrization of unresolved galaxies for Gaia}

16:

17:    \author{P. Tsalmantza\inst{1}

18:           \and M. Kontizas\inst{1}

19:           \and C. A. L. Bailer-Jones\inst{2}

20:           \and B. Rocca-Volmerange\inst{3,4}

21:           \and R. Korakitis\inst{5}

22:           \and E. Kontizas\inst{6}

23:           \and E. Livanou\inst{1}

24:           \and A. Dapergolas\inst{6}

25:           \and I. Bellas-Velidis\inst{6}

26:           \and A. Vallenari\inst{7}

27:           \and M. Fioc\inst{3,8}}

28:

29:     \offprints{P. Tsalmantza\\

30:     \email{vivitsal@phys.uoa.gr}}

31:

32:     \institute{Department of Astrophysics Astronomy \& Mechanics, Faculty

33:                of Physics, University of Athens, GR-15783 Athens, Greece

34:          \and

35:               Max-Planck-Institut f\"ur Astronomie, K\"onigstuhl 17, 69117 Heidelberg, Germany

36:          \and

37:               Institut d'Astrophysique de Paris, 98bis Bd Arago, 75014 Paris, France

38:          \and

39:               Universit\'e de Paris-Sud XI, I.A.S., 91405 Orsay Cedex, France

40:          \and

41:               Dionysos Satellite Observatory, National Technical University of Athens, 15780 Athens, Greece

42:          \and

43:               IAA, National Observatory of Athens, P.O. Box 20048, GR-118 10 Athens, Greece

44:          \and

45:               INAF, Padova Observatory, Vicolo dell'Osservatorio 5, 35122 Padova, Italy

46:          \and

47:               Universit\'e Pierre et Marie Curie, 4 place Jussieu, 75005 Paris, France}

48:

49:

50: \date{Received date / accepted}

51:

52:

53: % \abstract{}{}{}{}{}

54: % 5 {} token are mandatory

55:

56:   \abstract

57:   % context heading (optional)

58:    {} %leave it empty if necessary

59:   % aims heading (mandatory)

60:    {The Gaia astrometric survey mission will, as a consequence of its scanning

61:      law, obtain low resolution optical (330--1000\,nm) spectrophotometry of

62:      several million unresolved galaxies brighter than V=22. We present the

63:      first steps in a project to design and implement a classification system

64:      for these data. The goal is both to determine morphological classes and

65:      to estimate intrinsic astrophysical parameters via synthetic templates.

66:      Here we describe (1) a new library of synthetic galaxy spectra, and (2)

67:      first results of classification and parametrization experiments using

68:      simulated Gaia spectrophotometry of this library.}

69:   % methods heading (mandatory)

70:    {We have created a large grid of synthetic galaxy spectra using the

71:      P\'EGASE.2 code, which is based on galaxy evolution models that take into

72:      account metallicity evolution, extinction correction, emission lines

73:      (with stellar spectra based on the BaSeL library). Our classification and

74:      regression models are Support Vector Machines (SVMs), which are kernel-based

75:      nonlinear estimators.}

76:   % results heading (mandatory)

77:    {We produce a basic library of about 3600 zero redshift galaxy spectra covering

78:      the main Hubble types over wavelength range 250 to 1050\,nm at a sampling

79:      of 1\,nm or less. It is computed on a regular grid of four key

80:      astrophysical parameters for each type and for intermediate random values

81:      of the same parameters. An extended library reproduces this at a series

82:      of redshifts. Initial results from the SVM classifiers and parametrizers are

83:      promising, indicating that Hubble types can be reliably predicted and

84:      several parameters estimated with low bias and variance. Comparing the colours

85:      of our synthetic library with Sloan Digital Sky Survey (SDSS) spectra we

86:      find good agreement over the full range of Hubble types and parameters.}

87:   % conclusions heading (optional), leave it empty if necessary

88:    {}

89:

90:    \keywords{-- Galaxies: fundamental parameters -- Techniques: photometric --

91:      Techniques: spectroscopic}

92:

93: \maketitle

94:

95: \section{Introduction}

96: Large surveys of galaxies provide information on their global spatial

97: distribution and the physical properties of individual galaxies. Such a survey

98: will be obtained for the whole sky by the ESA mission, Gaia, from 2011--2016.

99: During its five year mission Gaia will observe several million unresolved

100: galaxies all over the whole sky. Although the survey's main goal is the

101: stellar content and the structure of our galaxy, there remains a lot of

102: important science to be extracted from the galactic component.

103:

104: There currently exist several surveys of galaxies, but even SDSS -- one of the

105: most extended galaxy photometric and spectroscopic surveys in the the optical

106: and near IR (about at the spectral range of Gaia) -- covers only a fifth of

107: the sky. Gaia extends this in several ways: i) It will be able to detect about

108: 10$^7$ unresolved galaxies down to G=20 (V=20--22); ii) Gaia will be the first

109: homogeneous survey of galaxies covering the whole sky since photographic ones

110: (UK, ESO, Palomar Schmidt surveys, 3500 to 6500\,\AA) of 30 years ago; iii) The

111: spectrophotometry covers a larger spectral range (3300 to 10\,000\,\AA\

112: sampled in about 100 bins) than earlier surveys; iv) Gaia observes each source

113: an average of 80 times over the mission. With this we can investigate many

114: different types of galaxy, QSO and AGN variability; v) The sample will have a

115: well-defined selection function, important for estimating the galaxy density

116: in the local universe.

117:

118: Our long-term objective is to classify and to determine the astrophysical

119: parameters of all unresolved galaxies which Gaia will observe. In order to

120: proceed with this we first need to acquire or build an appropriate library of

121: galaxy spectra. This library must show sufficient variation in those

122: intrinsic astrophysical parameters (APs) to which the Gaia observations will

123: be sensitive. To determine APs on a homogeneous system we ultimately need to

124: build or calibrate our classifiers using synthesis models and synthetic

125: spectra. Existing observed or synthetic libraries are too small or don't

126: cover the required wavelength range. For this reason we set on in this paper

127: to start building a new library.

128:

129: We use the galaxy evolution model P\'EGASE (Projet d' Etude des Galaxies

130: par Synthese Evolutive) \citep{fioc2,fioc5}, to synthesize galaxy spectra. The

131: P\'EGASE.2 code\footnote{http://www2.iap.fr/users/fioc/PEGASE.html} is aimed

132: principally at modelling the spectral evolution of galaxies by types: the

133: active and passive evolution of stellar populations as well as interstellar gas

134: and dust are coherently evolved in time. No galaxy number density evolution is

135: considered, although the results of our models are compatible with occasional

136: rare galaxy merging. The code is based on the stellar evolutionary tracks from

137: the Padova group, extended to the thermally pulsating asymptotic giant branch

138: (AGB) and post-AGB phases \citep{groenewegen}. These tracks cover all the

139: masses, metalicities and phases of interest for galaxy spectral synthesis.

140: P\'EGASE.2 uses the BaSeL 2.2 library of stellar spectra and can synthesize low

141: resolution (R=200) ultraviolet to near-infrared spectra of Hubble sequence

142: galaxies, as well as of starbursts. For a given evolutionary scenario

143: (typically characterized by a star formation law, an initial mass function and,

144: possibly, infall or galactic winds), the code consistently gives the spectral

145: energy distribution (SED) and computes the star formation rate and the

146: metallicity at any time. The nebular component (continuum and lines) due to HII

147: regions is calculated and added to the stellar component. Depending on the

148: geometry of the galaxy (disk or spheroidal), the attenuation of the spectrum by

149: dust is then computed using a radiative transfer code (which takes account of

150: the scattering).

151:

152: By accepting a star formation rate proportional to mass of the gas, the IMF of

153: \citet{rana} and the presence of infall and galactic winds, eight synthetic

154: spectra corresponding to different typical types of Hubble sequence galaxies

155: (E, S0, Sa, Sb, Sbc, Sc, Sd and Im) have already been produced using P\'EGASE.2

156: \citep{fioc3,fioc1,le2}. For each type, the values of the parameter set have

157: been fitted to the observed spectral energy distribution (SED) of nearby (z=0)

158: galaxies. For illustration a comparison with data is shown in \citet{fioc6}. At

159: higher redshifts, the evolution scenarios have been tested against most

160: existing faint galaxy samples, including the deepest surveys \citep[B=29 Hubble

161: Deep Field-N,][]{williams}. One unique model of galaxy fractions by type

162: simultaneously predicts the multi-wavelength (UV to near-IR) galaxy counts,

163: dominated by young stellar populations in the UV and old evolved galaxies in

164: the near-IR respectively.  The faint blue galaxy population, in excess in the

165: far-UV, has also been analysed \citep{fioc4}. An episodic star formation rate

166: of low level is proposed to fit the far-UV counts \citep{FOCA2000,buat}. In the

167: near-IR, the evolution scenario of elliptical galaxies predicts the puzzling

168: $K$-$z$ relation of radio galaxy hosts between z=0 and z=4. \citet{rocca} use

169: P\'EGASE.2 scenarios to interpret the galaxy distribution in the K-band Hubble

170: diagram. The same models are used to interpret the mid-IR galaxy counts \citep{rocca07},

171: although here a supplementary ultra-luminous infrared galaxy population is

172: required. Finally, the robustness of our evolution scenarios is confirmed by

173: the significant predictions of photometric redshifts as compared to

174: spectroscopic redshifts of HDF-N sample \citep{le2}. Using a much larger sample

175: from the SDSS, we make an additional comparison. This is the subject of the

176: second section of this paper, made using simulated photometry and colour-colour

177: diagrams.  In section 3 we describe the production of our library based on

178: these eight typical synthetic spectra of galaxies and in section 4 we explain

179: how these are used to simulate Gaia data. In section 5 we present our

180: classification and parametrization models and give preliminary results on their

181: performance.  A brief discussion follows in section 6.

182:

183:

184: \section{P\'EGASE synthetic spectra and comparison with the SDSS spectra}

185:

186: In order to determine the parameter ranges over which we should generate the

187: library, we first make a comparison of colours synthesized from the eight

188: typical P\'EGASE spectra with SDSS data. To avoid small discrepancies that

189: occur between synthesized and published SDSS

190: photometry\footnote{http://www.sdss.org/dr4/products/spectra/spectrophotometry.html}

191: and to treat both types of spectral data in the same way, we decided to

192: synthesize SDSS photometry from the SDSS spectra in the same way as we do with

193: the synthetic spectra (and using the same ``calib'' and ``colors'' programs in

194: the P\'EGASE.2 code for both). For this we use the whole set of spectroscopic

195: data for the 565\,715 galaxies that are available in data release 4 (DR4) of

196: SDSS. The properties of the SDSS filters are given in Table~\ref{t1}.

197:

198: \begin{table}

199:  \centering

200:  \caption {Characteristics of the five SDSS filters}

201:  \begin{tabular}{c c c c c}

202:  \hline\hline

203:   Name & Average    & Starting   & Ending     & magnitude \\

204:        & wavelength & wavelength & wavelength & limit in  \\

205:        & ($\AA$)    & ($\AA$)    & ($\AA$)    & survey    \\

206:  \hline

207: u      & 3551       & 2980       & 4130       & 22.0      \\

208: g      & 4686       & 3630       & 5830       & 22.2      \\

209: r      & 6165       & 5380       & 7230       & 22.2      \\

210: i      & 7481       & 6430       & 8630       & 21.3      \\

211: z      & 8931       & 7730       & 11230      & 20.5      \\

212: \hline

213: \end{tabular}

214: \label{t1}

215: \end{table}

216:

217: Typical synthetic spectra corresponding to each of the eight Hubble types are

218: shown in Fig. \ref{f1}, with the location of the SDSS filters superimposed.

219: Each of these ``typical spectra'' corresponds to specific combination of

220: values of the astrophysical parameters (see section~\ref{most_signif}). The

221: SEDs produced by P\'EGASE have been normalized to the flux of a 50$\AA$

222: wavelength interval centered on 5500$\AA$. The elliptical and S0 galaxies have

223: very small differences, apparent at the two extremes of the wavelength range.

224: This implies small differences in colours but not necessarily in magnitudes

225: (which depend on their masses).

226:

227: From Fig.~\ref{f1} it is obvious that the u filter is very important for the

228: comparison with real data since it is the one containing the discontinuity

229: around 4000$\AA$. However, the SDSS spectra do not cover the u band, so

230: photometry in this band cannot be synthesized. We refrain from using the SDSS

231: photometry for the u band because of the red leak in this

232: filter\footnote{http://www.sdss.org/dr4/products/images/index.html$\#$redleak},

233: which would render comparisons with synthetic data unreliable. This leak

234: produces erroneous magnitudes, especially for E and S0 types on

235: account of their large numbers of red stars.

236:

237: In addition we avoid using the z filter in our comparison since

238: its photometry also cannot be synthesized from the SDSS spectra, which terminate

239: at shorter wavelengths than the z passband.

240:

241: \begin{figure}

242: \centering

243: \includegraphics[width=6cm,angle=-90]{f1.ps}

244: \caption{

245:   Synthetic spectra for the eight typical galaxy types from P\'EGASE.2. The

246:   vertical lines denote the limits of the five SDSS filters (transmission

247:   below 1e-4 of the peak). (Emission lines are not included). The legend at

248:   the right defines colour used to plot each type of galaxy (top) and SDSS

249:   filter (bottom).}

250: \label{f1}

251: \end{figure}

252:

253: We therefore decided to base our comparison between the SDSS and P\'EGASE.2

254: data using the g, r, i filters only and, more specifically, the g--r and r--i

255: colours. However, the wavelength range of the SDSS spectra does not quite

256: extend to the bluest side of the g filter. For this reason, we cut the blue

257: end of this and created a new g filter starting at 3830$\AA$ instead of the

258: 3630$\AA$ (table \ref{t1}). However, this change is in practice very small

259: since the transmission of the g filter is only 3\% of the peak transmission at

260: 3830$\AA$ and drops very rapidly below that (e.g.\ it is only 0.5\% at just

261: 10\AA\ lower). Furthermore, simulated photometry from the synthetic spectra

262: showed virtually no difference for the original and ``trimmed'' g band. The

263: published transmission curves of the SDSS filters depend on airmass and

264: whether a point or extended source is being observed.

265: We use those for extended sources and zero airmass. The

266: photometry is calibrated on the AB system, as used by SDSS \citep{fukugita}.

267:

268: We synthesize photometry using the one-dimensional spectra from DR4, which are

269: supplied with additional analysis information, such as redshift and emission

270: line parameters. In order to select data suitable for our purposes, we

271: applied the following criteria: the galaxies should not be near a CCD edge nor

272: saturated, and they should not be very low SNR (the photometric error in all

273: bands should be less than 0.1 mag). Only spectra with redshifts below 0.01 are

274: retained, since the synthetic spectra of P\'EGASE.2 were produced at zero

275: redshift. These criteria resulted in a sample of 1292 galaxies. Their

276: synthesized photometry plus that for the eight typical galaxy types from

277: P\'EGASE.2 is shown in Fig.~\ref{f2}. This figure clearly shows that the

278: colours of the Im, Sd, Sc, Sbc, Sb and Sa types are generally in good

279: agreement with the colours of the observed spectra, although in the case of

280: S0 and E types the synthetic spectra seem to be slightly redder in g$-$r than

281: the SDSS spectra.

282:

283:

284: \begin{figure}

285: \centering

286: \includegraphics[width=6cm,angle=-90]{f2.ps}

287: \caption{Colour--colour (g$-$r vs.\ r$-$i) diagram of synthesized photometry of SDSS galaxy spectra (black) and synthetic photometry of the eight typical galaxy types generated from the

288:   P\'EGASE models (red points).}

289: \label{f2}

290: \end{figure}

291:

292:

293:

294: \section{The library of synthetic spectra}\label{library}

295:

296: \subsection{The most significant parameters}\label{most_signif}

297:

298: Each spectrum in our library is uniquely defined by a set of 17 astrophysical

299: parameters, plus the morphological type (E, S0, Sa, Sb, Sbc, Sc, Sd or Im).

300: The four most significant APs are: p1 and p2 of the star formation scenario

301: (($Mgas^{p_1}$)/p2); the infall timescale; the age of the galactic winds. The

302: age of the galactic winds is non-zero only for E and S0 galaxies. Note that

303: the Hubble type is not an independent parameter, as only certain ranges of the

304: APs are available for each type (as will be detailed later).

305:

306: In order to investigate the influence of each of the parameters p1, p2 and

307: infall timescale to the integrated galaxy spectrum (SED), we modified the

308: parameters of the Sbc model (an intermediate type) over a range of values. In

309: the typical model for the Sbc type the values were 1, 5714 Myr/$M_{\odot}$ and 6000 Myr for

310: p1, p2 and infall timescale, respectively. In the modified models we vary p1

311: between 0.4 and 2, p2 from 100 to 20000 Myr/$M_{\odot}$ and infall from 100 to 10000 Myr. The

312: results are shown in Figs.~\ref{f3}--\ref{f5}.

313:

314: To investigate the effect of the age of the galactic winds parameter we followed

315: the same procedure but now with the elliptical model. In the typical model for

316: the E type the age is 1~Gyr and we vary it between 0.1 and 7.5 Gyr

317: (Fig.~\ref{f6}).

318:

319: From the figures we see that these four parameters have a major effect on the

320: colours. We performed similar analyses for other APs and concluded that they

321: had a much smaller impact on the data (in particular once the spectra are

322: reduced to the Gaia resolution). Therefore, the spectra in the present library

323: show variance only in these four APs.

324:

325: \begin{figure}

326: \centering

327: \includegraphics[width=6cm,angle=-90]{f3.ps}

328: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE

329:   spectra of the typical Sbc model (yellow) and the models of Sbc with

330:   different values of p1 (red). The largest g--r corresponds to p1=1 and the

331:   smallest g--r to p1=2.}

332: \label{f3}

333: \end{figure}

334:

335: \begin{figure}

336: \centering

337: \includegraphics[width=6cm,angle=-90]{f4.ps}

338: \caption{Colour-colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE

339:   spectra of the typical Sbc model (yellow) and the models of Sbc with

340:   different values of p2 (red). The largest g--r corresponds to p2=2000 Myr/$M_{\odot}$ and the

341:   smallest g--r to p2=20000 Myr/$M_{\odot}$.}

342: \label{f4}

343: \end{figure}

344:

345: \begin{figure}

346: \centering

347: \includegraphics[width=6cm,angle=-90]{f5.ps}

348: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE

349: spectra of the typical Sbc model (yellow) and the models of Sbc with different values of infall timescale (red). The largest

350: g--r corresponds to infall timescale=100My and the smallest g--r to infall timescale=10 Gyr.}

351: \label{f5}

352: \end{figure}

353:

354: \begin{figure}

355: \centering

356: \includegraphics[width=6cm,angle=-90]{f6.ps}

357: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE

358: spectra of the typical E model (yellow) and the models of E with different values of age of galactic winds (red). The

359: largest g--r corresponds to age of galactic winds=7.5Gy and the smallest g--r to 0.1 Gyr.}

360: \label{f6}

361: \end{figure}

362:

363: By co-varying these four parameters and using all their combinations in each of

364: the eight typical models we are able to cover most of the variance we see in

365: the SDSS data in the colour--colour diagram. Generally, there is no clear

366: distinction between the colours of neighbouring Hubble types. In order to

367: have a knowledge of types in our library we decided (as a first working approximation) to only

368: retain those models which lie within a circle (in the colour-colour diagram)

369: centered on one of the eight typical types and with a radius equal to half of

370: the distance to the nearest neighbouring typical model. This is reasonable

371: since the models lie mostly on a one-dimensional surface (line) in the

372: colour--colour diagram. In this way upper and lower limits of the values of

373: the parameters were established for each type, although in this case

374: an overlap in APs (if not in colours) remains, as can be seen in table \ref{t2}.

375: This leaves a set of 888 synthetic spectra of known types of

376: galaxies (see section~\ref{regular_grid}).

377:

378: The galaxy type can be considered as a 5th AP, although it is of a different

379: nature than the others, since it is needed to fully specify the spectrum and

380: constrain the range of values of the other four APs. In addition, when one

381: redshifts the spectrum to non-zero values of z, this quantity also becomes a

382: parameter (albeit not intrinsic to the source).

383:

384: \subsection{Library of galaxy spectra over a regular grid of parameters}\label{regular_grid}

385:

386: Applying the above procedures, we produced a library of 888 synthetic spectra

387: covering seven separate Hubble types (because we consider E and S0 as a single

388: type). The values of the four parameters of each type are given in table

389: \ref{t2}, while the values of the other input parameters of P\'EGASE.2 (kept

390: constant in all models) are given in table \ref{t3}. The models are plotted

391: in Fig.~\ref{f7}, where the simulated colours of the 888 synthetic spectra and

392: the 1292 SDSS spectra are compared. This first set of 888 synthetic spectra

393: was then calculated at five values of redshift: 0, 0.05, 0.1, 0.15, 0.2,

394: resulting in a total of 4440 spectra.

395:

396:

397: \begin{table*}

398:  \centering

399:  \caption {The four astrophysical parameter (AP) ranges for each Hubble type in

400: the regular library of P\'EGASE synthetic spectra. Note that the AP ranges for

401: each Hubble type partially overlap. The morphological type can be considered as

402: an additional (but non-independent) parameter, required to fully explain the

403: variance in the library. The final column (N) gives the number of spectra for

404: each type (which sum to 888). See the regular library grid in \citet{le2} for

405: comparison.}

406:  \begin{tabular}{c c c c c c}

407:  \hline\hline

408:

409: Type & p1      & p2          & infall     & galactic winds & N  \\

410:      &         &(Myr/Msol)   &(Myr)       &(Gyr)           &    \\

411:  \hline

412: E-S0 & 0.6-1.5 & 100-1500    & 100-2500   & 0.1-7.5        & 327\\

413: Sa   & 0.8-1.5 & 500-2500    & 2500-3500  & none           & 10 \\

414: Sb   & 0.6-1.5 & 1500-6000   & 2000-4500  & none           & 25 \\

415: Sbc  & 0.4-1.5 & 2000-10000  & 4000-7000  & none           & 148\\

416: Sc   & 0.6-1.5 & 6000-14000  & 7000-10000 & none           & 68 \\

417: Sd   & 0.4-1.5 & 10000-18000 & 7000-10000 & none           & 65 \\

418: Im   & 1.0-2.0 & 14000-20000 & 7000-10000 & none           & 245\\

419: \hline

420: \end{tabular}

421: \label{t2}

422: \end{table*}

423: \begin{table*}

424:  \centering

425:  \caption {The values of the parameters of the P\'EGASE models which are kept

426: constant in the library \citep{fioc2}.}

427:  \begin{tabular}{c c}

428:  \hline\hline

429: Parameters & Values \\

430: \hline

431: SNII Ejecta of massive stars                     & model B of \citet{woosley}\\

432: Stellar winds                                    & yes\\

433: Initial mass function                            & \citet{rana}\\

434: Lower mass                                       & 0.09 solar masses \\

435: Upper mass                                       & 120.00 solar masses\\

436: Fraction of close binary systems                 & 0.05\\

437: Initial metallicity                              & 0.00 \\

438: Metallicity of the infalling gas                 & 0.00\\

439: Consistent evolution of the stellar metallicity  & yes \\

440: Mass fraction of substellar objects              & 0.00\\

441: Nebular emission                                 & yes \\

442: Extinction                                       & disk geometry: inclination-averaged \\

443:                                                  & for Sa, Sb, Sbc, Sc, Sd and Im  \\

444:                                                  & spheroidal geometry for E-S0 \\

445: Age                                              & 13 Gyr  for E-S0,Sa, Sb, Sbc, Sc \& Sd \\

446:                                                  & 9 Gyr for Im \\

447: \hline

448: \end{tabular}

449: \label{t3}

450: \end{table*}

451:

452:

453: \subsection{Extension of the library to random values of parameters}

454:

455: After producing the regular synthetic spectral grid (table \ref{t2}), we

456: proceed to produce synthetic spectra of galaxies with parameters selected

457: from a random distribution, in order to achieve a more continuous coverage in

458: colour space. Such grids permit more robust tests of parameter estimation

459: algorithms than do regular grids. Each parameter is selected independently

460: from a uniform distribution over the parameter ranges in the regular grid. We

461: used this approach to generate 5500 models. In doing this we keep

462: approximately the ratios between the Hubble types as in the regular grid.

463: Because the parameter ranges for each galaxy type in Table \ref{t2} show some overlap,

464: a random draw may produce a set of parameters which fits into more than one

465: Hubble type category. To remove this ``degeneracy'' we again apply the circle

466: removal method we used in section~\ref{most_signif}. This results in a

467: ``non-degenerate'' sample of 2709 spectra.

468:

469: \begin{figure}

470: \centering

471: \includegraphics[width=6cm,angle=-90]{f7.ps}

472: \caption{ Colour--colour (g--r vs.\ r--i) diagram of synthesized photometry of

473: SDSS galaxy spectra (black) and of synthetic P\'EGASE spectra of the 8 typical

474: models of P\'EGASE.2 (yellow). Moving from the lower left to the upper right

475: part of the diagram we encounter types from Im to E. The red dots along both

476: sides of the typical models represent the spectra of both the regular and

477: random library.}

478: \label{f7}

479: \end{figure}

480:

481: A comparison of the simulated colours of the synthetic spectra (888 regular

482: grid plus 2709 random grid, at zero redshift) with the colours of SDSS spectra

483: is shown in Fig. \ref{f7}. One sees that the new set of spectra is in very

484: good agreement with the SDSS data, except for the small differences in the E

485: and S0 galaxies.

486:

487: In summary, we have produced a library of 7149 synthetic galaxy spectra (888

488: spectra of the regular grid for 5 values of redshift and 2709 of the random

489: grid at zero redshift) which can be used as an initial library of unresolved

490: galaxy spectra for assessing the possibilities of galaxy classification and

491: parametrization with Gaia. This library was created at the resolution of the

492: BaSeL 2.2 stellar library

493: (gradually changing from 8\,\AA\ at 2500\,\AA\ to 50\,\AA\ at 10\,500\,\AA),

494: which is not quite high enough for the Gaia simulation software (which

495: requires 10\,\AA). Therefore, we linearly interpolated our spectra in order

496: to resample the spectra to 10\,\AA\ over the wavelength range of

497: 2500--10\,500\,\AA. Higher resolution spectra will be produced in future work

498: using the High-spectral Resolution code P\'EGASE-HR \citep{le1}.

499:

500: \section{Simulated Gaia spectra}

501:

502: The Gaia spectrophotometer is a slitless prism spectrograph comprising blue

503: and red channels (called BP and RP respectively) which operate over the

504: wavelength ranges 3300--6800\,\AA\ and 6400-10\,500\,\AA\ respectively. BP

505: and RP spectra were simulated for all 7149 library spectra using the simulator

506: developed by \citet{brown}. Each of BP and RP is simulated with 48 pixels,

507: whereby the dispersion varies from 30--290\,\AA/pix and 60--150\,\AA/pix

508: respectively. We artificially reddened each spectrum with a standard

509: interstellar extinction law with R=3.1, for regular values of $A_{V}$ from 0

510: to 10 for the regular library, and for 10 random values of $A_{V}$ uniformly

511: distributed in $log(1+A_{V})$ for the the random library. Noise was added to

512: all spectra, which includes the source Poisson noise, background Poisson noise

513: and CCD readout noise. This is done for five different source G-band

514: magnitudes (15, 17, 18, 19 and 20). For the following classification tests we

515: use only the sample at G=18. In Fig. \ref{f8} we present the simulated BP and RP spectra

516: for the eight typical synthetic spectra of galaxies.

517:

518: \begin{figure}

519: \centering

520: \includegraphics[width=6cm,angle=-90]{f8.ps}

521: \caption{ The simulated BP and RP spectra of the synthetic spectra for the

522: eight typical galaxy types from P\'EGASE.2. Black, green, blue, yellow,

523: magenta, light blue and red denote galaxies of type E, Sa, Sb, Sbc, Sc, Sd and

524: Im respectively.}

525: \label{f8}

526: \end{figure}

527:

528:

529: \section{Classification \& Parametrization}

530:

531: In the present work we use classification Support Vector Machines (SVMs)

532: (C-classification) to determine morphological types and regression SVMs

533: ($\epsilon$-regression) to estimate the various astrophysical parameters. We

534: use the libsvm library of \citet{libsvm} implemented in the \verb+e1071+

535: package in the R statistics package.\footnote{{\tt http://www.r-project.org}} A

536: brief description of the SVMs is given in the Appendix of this paper. An

537: accessible introduction to SVMs can be found in \citet{bennett00}. For a more

538: technical introduction, the tutorial by \citet{burges98} is recommended.

539:

540:

541: \subsection{Galaxies at zero redshift}

542:

543: \subsubsection{Classification of the morphological type}\label{classify}

544:

545: We now try to classify the set of Gaia-simulated galaxy spectra, at G=18 with

546: zero redshift, into the seven Hubble types. This subset of the library

547: includes characteristic noise and a wide range of interstellar extinction

548: (from 0--10 mag in $A_{v}$). It comprises 9691 spectra. This we divide at

549: random into two subsets: 4846 for training the SVM classifiers and 4845 for

550: evaluating their performance. As is recommendable with many machine learning

551: methods, we first normalized the data by scaling each input (pixel) to have

552: zero mean and unit standard deviation.

553:

554: For the purpose of visualizing the data set only, we perform a Principal

555: Components Analysis (PCA) on the set of 9691 96-dimensional Gaia spectra.

556: The first three Principal Components describe 78.25\%, 20.44\% and 1.02\% of

557: the data variance respectively (i.e.\ 99.71\% together).\footnote{Note that,

558: because each input dimension has already been normalized to have zero mean

559: and unit variance, a considerable fraction of the total variance is already

560: accounted for.} In Fig. \ref{f9} we plot the data in projection onto the

561: first three PCs. This diagram, plus the fact that the first three PCs explain

562: almost all of the variance in the data, suggest that a good classification

563: should be possible (the data have an intrinsic low dimensionality).

564:

565: \begin{figure}

566: \centering

567: \includegraphics[width=6cm,angle=-90]{f9.ps}

568: \caption{ The 9691 simulated Gaia galaxy spectra with z=0 plotted as their

569: projections onto the first three Principal Components. Black, green, blue,

570: light blue, magenta, yellow and red denote galaxies of type E, Sa, Sb, Sbc, Sc,

571: Sd and Im respectively.}

572: \label{f9}

573: \end{figure}

574:

575: \begin{table}

576:

577:  \centering

578:  \caption {Galaxy classification with the SVM. The confusion matrix for the

579: training set for galaxies at z=0. Columns indicate the true class, row the

580: predicted ones.}

581:  \begin{tabular}{l | c c c c c c c}

582:  \hline\hline

583:

584: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\

585:  \hline

586: E-S0 & 1799  & 0    & 0  & 0   & 0   & 0   & 0   \\

587: Sa   & 0     & 1366 & 0  & 0   & 0   & 0   & 0   \\

588: Sb   & 0     & 0    & 53 & 5   & 0   & 0   & 0   \\

589: Sbc  & 0     & 0    & 0  & 134 & 0   & 0   & 0   \\

590: Sc   & 0     & 0    & 0  & 0   & 830 & 0   & 0   \\

591: Sd   & 0     & 0    & 0  & 0   & 0   & 347 & 1   \\

592: Im   & 0     & 0    & 0  & 0   & 0   & 0   & 311 \\

593:

594: \hline

595: \end{tabular}

596: \label{t4}

597:  \centering

598:  \caption {As Table~\ref{t4} but for the test set.}

599:  \begin{tabular}{l | c c c c c c c}

600:  \hline\hline

601:

602: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\

603:  \hline

604: E-S0 & 1798 & 0    & 0  & 0   & 0   & 0   & 0   \\

605: Sa   & 0    & 1329 & 0  & 0   & 0   & 0   & 0   \\

606: Sb   & 0    & 0    & 44 & 0   & 0   & 0   & 0   \\

607: Sbc  & 0    & 0    & 4  & 137 & 0   & 0   & 0   \\

608: Sc   & 0    & 0    & 0  & 1   & 797 & 0   & 0   \\

609: Sd   & 0    & 0    & 0  & 0   & 0   & 394 & 6   \\

610: Im   & 0    & 0    & 0  & 0   & 0   & 3   & 324 \\

611:

612: \hline

613: \end{tabular}

614: \label{t5}

615: \end{table}

616:

617: The results of training and testing the SVM classifier on the full 96-pixel

618: spectra are shown in Tables \ref{t4} and \ref{t5}. We see that there are very

619: few misclassifications: only 6 and 14 in the training and testing set

620: corresponding to an error of 0.12\% and 0.29\% respectively. While these

621: results are very promising, it must be recalled that the way the library has been

622: constructed avoids class overlap in the SDSS g$-$r, r$-$i colour space,

623: which surely eases separation in the 96-dimensional BP/RP colour space.

624:

625: \subsubsection{Regression of astrophysical parameters}

626:

627: In addition to simulating an output spectrum, P\'EGASE.2 also derives 18

628: output astrophysical parameters for each galaxy. Of course, by construction we know that our

629: synthetic spectra are uniquely defined by five parameters (p1, p2, infall

630: timescale, age of the galactic winds and the Hubble type), so there can only

631: be five equivalent independent parameters amongst these 18. Nonetheless, it

632: would be useful to predict them directly. Here we build SVM regression models

633: to separately predict the nine most significant ones (listed in Table

634: \ref{t6}). For each model we train on a randomly selected set of 4846 spectra

635: and evaluate performance on the remaining 4845. In Fig.~\ref{f10} we present

636: the true and the SVM-predicted values of each parameter on the test set. Table

637: \ref{t6} summarizes this by giving the mean of the difference between the true

638: and predicted values for each parameter (which measures the systematic error)

639: as well as the RMS residual (which measures the total scatter). The plots and

640: table indicate that we can predict the parameters to good accuracy and

641: precision, i.e.\ the systematics are very small and the RMS error is a small

642: fraction of the typical values.

643:

644:

645: \begin{figure*}[t]

646:   \setlength{\unitlength}{1cm}

647: \begin{picture}(18,15)

648: \put(0,15){\special{psfile=f10a.ps hoffset=0 voffset=0 hscale=20

649: vscale=20 angle=-90}}

650: \put(0,10){\special{psfile=f10b.ps hoffset=0 voffset=0 hscale=20

651: vscale=20 angle=-90}}

652: \put(0,5){\special{psfile=f10c.ps hoffset=0 voffset=0 hscale=20

653: vscale=20 angle=-90}}

654: \put(6,15){\special{psfile=f10d.ps hoffset=0 voffset=0 hscale=20

655: vscale=20 angle=-90}}

656: \put(6,10){\special{psfile=f10e.ps hoffset=0 voffset=0 hscale=20

657: vscale=20 angle=-90}}

658: \put(6,5){\special{psfile=f10f.ps hoffset=0 voffset=0 hscale=20

659: vscale=20 angle=-90}}

660: \put(12,15){\special{psfile=f10g.ps hoffset=0 voffset=0 hscale=20

661: vscale=20 angle=-90}}

662: \put(12,10){\special{psfile=f10h.ps hoffset=0 voffset=0 hscale=20

663: vscale=20 angle=-90}}

664: \put(12,5){\special{psfile=f10i.ps hoffset=0 voffset=0 hscale=20

665: vscale=20 angle=-90}}

666: \end{picture}

667: \caption{

668:   Galaxy parameter estimation performance. For each of the nine APs we plot

669:   the predicted vs.\ true AP values for the test set. The red line indicates

670:   the line of perfect estimation. The summary errors are given in

671:   Table~\ref{t6}.}

672: \label{f10}

673: \end{figure*}

674:

675: \begin{table*}

676:  \centering

677:  \caption {Summary of the performance of the SVM regression models for

678: predicting the nine APs listed. The sample is for zero redshift but for

679: interstellar extinction ($A_{v}$) varying from 0 to 10\,mag. The second and

680: third columns list the mean and RMS errors respectively. The final column gives

681: the number of support vectors in the SVM model.}

682:  \begin{tabular}{l c c c}

683:  \hline\hline

684:

685: Astrophysical Parameter                                  & mean(real-predicted)/mean(real) & sd(real-predicted)/mean(real) & SVs   \\

686:  \hline

687: mass to light ratio (M/L)                                & -1.03e-2             & 3.78e-2            & 97    \\

688: normalized star formation rate (SFR)                     & -3.35e-3             & 3.97e-2            & 2285  \\

689: metallicity of interstellar medium (Mim)                 & -2.85e-3             & 8.77e-2            & 345   \\

690: metallicity of stars averaged on mass (Msm)              & -3.64e-4             & 2.17e-2            & 3544  \\

691: normalized mass of gas (Mgas)                            &  4.52e-3             & 4.29e-2            & 190   \\

692: normalized mass in stars (Ms)                            &  3.22e-4             & 5.48e-2            & 1639  \\

693: mean age of stars averaged on bolometric luminosity (Al) &  1.45e-3             & 3.22e-2            & 3566  \\

694: normalized SNIa rate (SNIa)                              &  9.69e-4             & 3.43e-2            & 376   \\

695: normalized SNII rate (SNII)                              & -6.04e-4             & 3.81e-2            & 2247  \\

696: \hline

697: \end{tabular}

698: \label{t6}

699: \end{table*}

700:

701: \subsection{Galaxies with redshift}

702:

703: \subsubsection{Regression of redshift and classification of morphological type}

704:

705: We now enlarge the subset of the library we used in the previous tests by

706: adding the same galaxies at four nonzero values of redshift, specifically

707: 0.05, 0.1, 0.15, 0.2. The library for z=0 includes 9691 galaxies as described above.

708: For each nonzero redshift there are 9757

709: giving a total sample of 48\,719 galaxies. (Recall that this includes each

710: galaxy simulated at 11 regular values of $A_{v}$.) We now build another

711: morphological type classification model as done in section~\ref{classify}, now

712: with 6719 galaxies in the training set and 42\,000 galaxies for testing set.

713:

714: We again applied a PCA to the data. This time the first three Principal

715: Components describe 76.01\%, 21.63\% and 1.02\% of the data variance

716: respectively (i.e.\ 98.6\% together), very similar to before. The

717: corresponding PCA-project plot is Fig.~\ref{f11}. Comparing to Fig.~\ref{f9} we

718: can see how the redshift spreads out the previous loci of types.

719: The performance of the SVM classifier is summarized in Tables \ref{t7} and

720: \ref{t8}. The performance is good considering the added complexity introduced

721: by the redshift variations (and the corresponding increase in the sample

722: size). The misclassification errors are 0.13\% and 0.98\% corresponding to 9

723: and 411 galaxies for the training and the testing data respectively.

724:

725:

726: \begin{figure}

727:   \centering \includegraphics[width=6cm,angle=-90]{f11.ps}

728:  \caption{ The 48\,719 simulated Gaia galaxy spectra with nonzero redshift

729: plotted as their projections onto the first three Principal Components. Black,

730: green, blue, light blue, magenta, yellow and red denote galaxies of type E, Sa,

731: Sb, Sbc, Sc, Sd and Im respectively.}

732: \label{f11}

733: \end{figure}

734:

735: \begin{table}

736:  \centering

737:  \caption {Galaxy classification with the SVM. The confusion matrix for the

738: training set for galaxies at z=0.0, 0.05, 0.1, 0.15, 0.2. Columns indicate the

739: true class, row the predicted ones.}

740:  \begin{tabular}{l | c c c c c c c}

741:  \hline\hline

742:

743: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\

744:  \hline

745: E-S0 & 2512 & 0    & 0  & 0   & 0    & 0   & 0   \\

746: Sa   & 0    & 1828 & 0  & 0   & 0    & 0   & 0   \\

747: Sb   & 0    & 0    & 74 & 2   & 0    & 0   & 0   \\

748: Sbc  & 0    & 0    & 1  & 183 & 1    & 0   & 0   \\

749: Sc   & 0    & 0    & 0  & 0   & 1115 & 0   & 0   \\

750: Sd   & 0    & 0    & 0  & 0   & 0    & 536 & 4   \\

751: Im   & 0    & 0    & 0  & 0   & 0    & 1   & 462 \\

752:

753: \hline

754: \end{tabular}

755: \label{t7}

756:  \centering

757:  \caption {As Table~\ref{t7} but for the test set.}

758:  \begin{tabular}{l | c c c c c c c}

759:  \hline\hline

760:

761: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\

762:  \hline

763: E-S0 & 15473 & 0     & 0   & 0    & 0    & 0    & 0    \\

764: Sa   & 0     & 11647 & 0   & 0    & 0    & 0    & 0    \\

765: Sb   & 17    & 0     & 344 & 113  & 0    & 0    & 0    \\

766: Sbc  & 0     & 0     & 83  & 1084 & 23   & 0    & 0    \\

767: Sc   & 0     & 0     & 8   & 39   & 6971 & 7    & 0    \\

768: Sd   & 0     & 0     & 0   & 0    & 1    & 3149 & 50   \\

769: Im   & 0     & 0     & 0   & 0    & 0    & 70   & 2921 \\

770:

771: \hline

772: \end{tabular}

773: \label{t8}

774: \end{table}

775:

776: In practice we may want to first reduce spectra to the rest frame, for which

777: we require an estimate of the redshift. Therefore, we also set up a SVM

778: regression model to predict redshift, using the same training and test sets.

779: The predicted values of redshift for each of the five true redshift values are

780: presented in Fig. \ref{f12}. We do not expect very good performance here,

781: because the SVM is having to learn the effect of redshift based on just five

782: different values.

783:

784: \begin{figure*}[t]

785: \setlength{\unitlength}{1cm}

786: \begin{picture}(18,10)

787: \put(0,10){\special{psfile=f12a.ps hoffset=0 voffset=0 hscale=20

788: vscale=20 angle=-90}}

789: \put(6,10){\special{psfile=f12b.ps hoffset=0 voffset=0 hscale=20

790: vscale=20 angle=-90}}

791: \put(12,10){\special{psfile=f12c.ps hoffset=0 voffset=0 hscale=20

792: vscale=20 angle=-90}}

793: \put(4,5){\special{psfile=f12d.ps hoffset=0 voffset=0 hscale=20

794: vscale=20 angle=-90}}

795: \put(10,5){\special{psfile=f12e.ps hoffset=0 voffset=0 hscale=20

796: vscale=20 angle=-90}}

797: \end{picture}

798: \caption{Distribution of predicted values of redshift shows separately for the five true values of redshift (z=0, 0.05, 0.1, 0.15 and 0.2)}

799: \label{f12}

800: \end{figure*}

801:

802:

803:

804: \section{Discussion and conclusion}

805:

806: We have used the P\'EGASE.2 galaxy evolution model and the observational data

807: from SDSS to create an extended grid of synthetic galaxy spectra. Using these

808: we have identified the relevant astrophysical parameters and their relevant

809: ranges which provide a realistic galaxy spectra of known morphological type.

810: This was done specifically by comparing the colours of our library spectra

811: with those synthesized from SDSS spectra. We found small deviations between

812: the two colour loci for redder galaxies -- where the ellipticals are found --

813: which might be due to the fact that SDSS spectra are obtained in a small aperture

814: (fibre diameter) while P\'EGASE spectra are representative of the whole

815: galaxy. We also see that the observed sample has a considerably larger spread

816: in the colour--colour diagram than the library spectra, which probably has

817: observational reasons (photometric errors) as well as theoretical ones

818: (insufficient cosmic variance in the galaxy models). That is, it may

819: partially reflect the complicated nature of galaxy formation and evolution,

820: although the overall agreement between the two is good.

821:

822: To achieve a better agreement between the observational and

823: synthesized libraries we will further investigate the influence of the

824: various P\'EGASE.2 parameters, especially those that were kept constant for

825: this release of the library. On the other hand, due to the narrow redshift

826: range ($z < 0.2$) explored here, evolution factors are minimized. At higher

827: redshifts, synthetic spectra will be computed by simultaneously applying

828: cosmological k-corrections and evolution e-corrections to z=0 templates.

829:

830: Among the existing libraries of observed spectra, the most complete

831: and homogeneous is the SDSS, since it covers a significant part of the whole

832: sky and it goes fainter than the expected detection limit of Gaia. We

833: therefore aim to produce a suitable set of synthetic spectra covering as

834: much as possible of the SDSS colour range and we plan further comparisons in

835: our future work.

836:

837: Adding phenomena such as the galaxy mergers is a challenging

838: hypothesis, but we believe that at the low redshifts Gaia will observe, this

839: is not such an important or frequent mechanism of galaxy evolution. On the

840: other hand, starburst galaxies are more frequent at small redshifts and we

841: intend to enrich our library with this type of galaxy.

842:

843: First results of SVM for classification and parametrization of the library are

844: quite promising. In particular, the first indications are that Gaia will be

845: able to produce a wealth of information for a large statistical sample of

846: galaxies. After constructing a more complete library of spectra we will be

847: able to perform more tests and construct a classifier able to treat more

848: realistic and complete simulations of galaxy spectra.

849:

850: \section{Acknowledgments}

851: The authors (the Greek team) would like to thank the Greek General Secretariat

852: of Research and Technology (GSRT) for financial support.

853:

854: P. Tsalmantza would also like to thank the Max-Planck-Institut f\"ur

855: Astronomie (MPIA) and Institut d'Astrophysique de Paris (IAP) for their

856: support and hospitality.

857:

858: Funding for the Sloan Digital Sky Survey (SDSS) has been provided by the Alfred

859: P. Sloan Foundation, the Participating Institutions, the National Aeronautics

860: and Space Administration, the National Science Foundation, the U.S. Department

861: of Energy, the Japanese Monbukagakusho, and the Max Planck Society. The SDSS

862: Web site is http://www.sdss.org/.

863:

864: The SDSS is managed by the Astrophysical Research Consortium (ARC) for the

865: Participating Institutions. The Participating Institutions are The University

866: of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation

867: Group, The Johns Hopkins University, the Korean Scientist Group, Los Alamos

868: National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the

869: Max-Planck-Institute for Astrophysics (MPA), New Mexico State University,

870: University of Pittsburgh, University of Portsmouth, Princeton University, the

871: United States Naval Observatory, and the University of Washington.

872:

873: \begin{thebibliography}{22}

874: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi

875:

876: \bibitem[{{Armand} \& {Milliard}(1994)}]{FOCA2000}

877: {Armand}, C. \& {Milliard}, B. 1994, \aap, 282, 1

878:

879: \bibitem[{Bennett \& Campbell(2000)}]{bennett00}

880: Bennett, K.~P. \& Campbell, C. 2000, SIGKDD Explor. Newsl., 2, 1

881:

882: \bibitem[{{Brown}(2006)}]{brown}

883: {Brown}, A. G.~A. 2006, Gaia Technical Report GAIA-C8-SP-LEI-AB-006-1

884:

885: \bibitem[{{Buat} {et~al.}(1999){Buat}, {Donas}, {Milliard}, \& {Xu}}]{buat}

886: {Buat}, V., {Donas}, J., {Milliard}, B., \& {Xu}, C. 1999, \aap, 352, 371

887:

888: \bibitem[{Burges(1998)}]{burges98}

889: Burges, C. J.~C. 1998, Data Mining and Knowledge Discovery, 2, 121

890:

891: \bibitem[{Chang \& Lin(2001)}]{libsvm}

892: Chang, C.-C. \& Lin, C.-J. 2001, {LIBSVM}: a library for support vector

893:   machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

894:

895: \bibitem[{{Fioc}(1997)}]{fioc3}

896: {Fioc}, M. 1997, PhD thesis, Universit{\'e} Paris XI,

897:   http://www.iap.fr/users/fioc.html

898:

899: \bibitem[{{Fioc}(1999)}]{fioc6}

900: {Fioc}, M. 1999, in Astronomical Society of the Pacific Conference Series, Vol.

901:   192, Spectrophotometric Dating of Stars and Galaxies, ed. I.~{Hubeny},

902:   S.~{Heap}, \& R.~{Cornett}, 299--+

903:

904: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1997)}]{fioc2}

905: {Fioc}, M. \& {Rocca-Volmerange}, B. 1997, \aap, 326, 950

906:

907: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{a}})}]{fioc1}

908: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{a}}, \aap, 351, 869

909:

910: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{b}})}]{fioc4}

911: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{b}}, \aap, 344, 393

912:

913: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{c}})}]{fioc5}

914: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{c}}, arXiv:astro-ph/9912179

915:

916: \bibitem[{{Fukugita} {et~al.}(1996){Fukugita}, {Ichikawa}, {Gunn}, {Doi},

917:   {Shimasaku}, \& {Schneider}}]{fukugita}

918: {Fukugita}, M., {Ichikawa}, T., {Gunn}, J.~E., {et~al.} 1996, \aj, 111, 1748

919:

920: \bibitem[{{Groenewegen} \& {de Jong}(1993)}]{groenewegen}

921: {Groenewegen}, M.~A.~T. \& {de Jong}, T. 1993, \aap, 267, 410

922:

923: \bibitem[{{Le Borgne} \& {Rocca-Volmerange}(2002)}]{le2}

924: {Le Borgne}, D. \& {Rocca-Volmerange}, B. 2002, \aap, 386, 446

925:

926: \bibitem[{{Le Borgne} {et~al.}(2004){Le Borgne}, {Rocca-Volmerange},

927:   {Prugniel}, {Lan{\c c}on}, {Fioc}, \& {Soubiran}}]{le1}

928: {Le Borgne}, D., {Rocca-Volmerange}, B., {Prugniel}, P., {et~al.} 2004, \aap,

929:   425, 881

930:

931: \bibitem[{{Rana} \& {Basu}(1992)}]{rana}

932: {Rana}, N.~C. \& {Basu}, S. 1992, \aap, 265, 499

933:

934: \bibitem[{{Rocca-Volmerange} {et~al.}(2007){Rocca-Volmerange}, {de Lapparent},

935:   {Seymour}, \& {Fioc}}]{rocca07}

936: {Rocca-Volmerange}, B., {de Lapparent}, V., {Seymour}, N., \& {Fioc}, M. 2007,

937:   arXiv:0705.2031

938:

939: \bibitem[{{Rocca-Volmerange} {et~al.}(2004){Rocca-Volmerange}, {Le Borgne}, {De

940:   Breuck}, {Fioc}, \& {Moy}}]{rocca}

941: {Rocca-Volmerange}, B., {Le Borgne}, D., {De Breuck}, C., {Fioc}, M., \& {Moy},

942:   E. 2004, \aap, 415, 931

943:

944: \bibitem[{Vapnik(1995)}]{vapnik95}

945: Vapnik, V.~N. 1995, The nature of statistical learning theory (Springer)

946:

947: \bibitem[{{Williams} {et~al.}(1996){Williams}, {Blacker}, {Dickinson}, {Dixon},

948:   {Ferguson}, {Fruchter}, {Giavalisco}, {Gilliland}, {Heyer}, {Katsanis},

949:   {Levay}, {Lucas}, {McElroy}, {Petro}, {Postman}, {Adorf}, \&

950:   {Hook}}]{williams}

951: {Williams}, R.~E., {Blacker}, B., {Dickinson}, M., {et~al.} 1996, \aj, 112,

952:   1335

953:

954: \bibitem[{{Woosley} \& {Weaver}(1995)}]{woosley}

955: {Woosley}, S.~E. \& {Weaver}, T.~A. 1995, \apjs, 101, 181

956:

957: \end{thebibliography}

958:

959: \clearpage

960: \appendix

961: \section{Support Vector Machines}

962:

963: Support Vector Machines (SVMs) \citep{vapnik95} are supervised machine

964: learning methods for data classification. In their basic form they achieve a

965: linear classification between two classes by defining an optimal hyperplane which

966: separates members of the two classes. If the classes are separable then there

967: generally exists an infinite number of hyperplanes which achieve this.

968: The SVM optimal plane is defined as that plane which maximises the margin

969: between the opposing class members nearest to the boundary. That is, unlike

970: many other classifiers which use all of the data to define the boundary, SVMs

971: take the (arguably more reasonable) approach of using just those points nearest

972: to the boundary. It has been demonstrated that this gives rise to a

973: more robust and more accurate classifier under general conditions.

974:

975: In most non-trivial problems, however, the classes are not linearly separable.

976: In these cases, just those points which lie on the wrong side of the

977: hyperplane -- the so-called support vectors -- enter into the total

978: classification error. By minimizing this error -- which also measures the

979: distance of the support vectors from the plane -- we define the optimal

980: separating plane, i.e.\ with the fewest misclassifications (and preferentially

981: of those which lie closer to the plane).

982:

983: In the general case, the classes are not even marginally linearly separable

984: (consider the XOR problem) so a linear classifier, no matter how optimal, is

985: useless. SVMs address this issue by using kernels to project the data into a

986: higher dimensional space. For example, with a polynomial kernel we take

987: square, cubic etc.\ combinations of the original data to form additional

988: dimensions and then apply the (linear) SVM classifier in this higher

989: dimensional space. With many other kernels, however, this projection is only

990: carried out implicitly. This approach can be thought of as nonlinearity by

991: preprocessing, with the kernel overcoming the well known ``curse of

992: dimensionality''. In the present work we use the radial basis kernel

993: \begin{equation}

994: K(x_i - x_j)=exp(-\gamma||x_i-x_j||^{2})

995: \label{SVM_kernel}

996: \end{equation}

997: where $x_i$ and $x_j$ are two input vectors (e.g.\ spectra). The

998: classification of a new vector $x_i$ is then given by a function

999: \begin{equation}

1000: f(x_j) = \sum_{i}^{i=N} y_i \alpha_i K(x_i - x_j)

1001: \label{SVM_model}

1002: \end{equation}

1003: where $y_i \in (-1,1)$ denotes the two classes, and a classification is made

1004: by applying a threshold, e.g.\ $f(x_j) > 0.0 \Rightarrow$ class 1. The

1005: ${\alpha_i}$ are the parameters of the model which are determined by the model

1006: training ($i$ counts over the $N$ support vectors). SVMs have a very important

1007: property, namely that the error function is strictly convex, so it

1008: has a unique global solution which can be found in polynomial time with

1009: standard optimizers (it is a linearly constrained quadratic programming

1010: problem).

1011:

1012: This is in marked contrast to neural networks, for example, in which the optimizers

1013: converge on a local minimum and we can only be guaranteed to find

1014: the global minimum via an exhaustive search. Furthermore, with a sigmoidal

1015: kernel SVMs are equivalent to neural networks but with the additional

1016: advantage that the SVM automatically determines the neural network

1017: architecture (number weights).

1018:

1019: The SVM model incorporates regularization via the specification of a

1020: hyperparameter, $C$, which defines the width of a margin around the separating

1021: hyperplane. The wider this margin (larger $C$), the more data vectors

1022: which fall into it. These are all considered support vectors and so all enter the

1023: error equation. Thus with a larger $C$ there is a higher penalty attached to

1024: errors, i.e.\ less regularization.\footnote{$C$ is actually the upper bound on

1025: $\alpha_i$, specifically $0 \leq \alpha_i \leq C$ and $\sum_i \alpha_i y_i =

1026: 0$ (two of the constraints in the error minimization). Thus a small $C$

1027: implies smaller $\alpha_i$ in equation~\ref{SVM_model} which in turn implies

1028: smoother functions equivalent to more regularization.}

1029:

1030: The other hyperparameter in the model is $\gamma$ (equation~\ref{SVM_kernel}).

1031: Both $\gamma$ and $C$ must be determined by the user. Prior information may

1032: help, but in practice one carries out a rigorous search over a two-dimensional

1033: grid to ``tune'' the SVM. We did this using 4-fold cross validation,

1034: iterating over grids of increasing density.

1035:

1036: SVMs can also be used for regression. Instead of a hyperplane and a margin

1037: about it, regression SVMs fit a line with a tube of radius $\epsilon$

1038: encompassing it. Data vectors which are less than a distance $\epsilon$ from

1039: the line are considered to be correctly fit, that is, the support vectors are

1040: only those points outside of the tube. Thus the $\epsilon$

1041: hyperparameter controls the degree of regularization. The specific error

1042: function we use is the mean squared error on the predictions, with

1043: the regularization again being introduced via the constraints in the

1044: optimization (with Lagrangian multipliers). All of the kernel and optimization

1045: machinery applies equally to these models, so that nonlinear regression can

1046: also be achieved.

1047:

1048: \end{document}

1049: