1: %\documentclass[referee]{aa} % for a referee version
2: %\documentclass[onecolumn]{aa} % for a paper on 1 column
3: %\documentclass[longauth]{aa} % for the long lists of affiliations
4: %\documentclass[rnote]{aa} % for the research notes
5: %
6: \documentclass{aa}
7: %
8: \usepackage{graphicx,supertabular}
9: \usepackage{txfonts}
10: \usepackage{natbib}
11: \citestyle{aa}
12:
13: \begin{document}
14: %
15: \title{Towards a library of synthetic galaxy spectra and preliminary results of classification and parametrization of unresolved galaxies for Gaia}
16:
17: \author{P. Tsalmantza\inst{1}
18: \and M. Kontizas\inst{1}
19: \and C. A. L. Bailer-Jones\inst{2}
20: \and B. Rocca-Volmerange\inst{3,4}
21: \and R. Korakitis\inst{5}
22: \and E. Kontizas\inst{6}
23: \and E. Livanou\inst{1}
24: \and A. Dapergolas\inst{6}
25: \and I. Bellas-Velidis\inst{6}
26: \and A. Vallenari\inst{7}
27: \and M. Fioc\inst{3,8}}
28:
29: \offprints{P. Tsalmantza\\
30: \email{vivitsal@phys.uoa.gr}}
31:
32: \institute{Department of Astrophysics Astronomy \& Mechanics, Faculty
33: of Physics, University of Athens, GR-15783 Athens, Greece
34: \and
35: Max-Planck-Institut f\"ur Astronomie, K\"onigstuhl 17, 69117 Heidelberg, Germany
36: \and
37: Institut d'Astrophysique de Paris, 98bis Bd Arago, 75014 Paris, France
38: \and
39: Universit\'e de Paris-Sud XI, I.A.S., 91405 Orsay Cedex, France
40: \and
41: Dionysos Satellite Observatory, National Technical University of Athens, 15780 Athens, Greece
42: \and
43: IAA, National Observatory of Athens, P.O. Box 20048, GR-118 10 Athens, Greece
44: \and
45: INAF, Padova Observatory, Vicolo dell'Osservatorio 5, 35122 Padova, Italy
46: \and
47: Universit\'e Pierre et Marie Curie, 4 place Jussieu, 75005 Paris, France}
48:
49:
50: \date{Received date / accepted}
51:
52:
53: % \abstract{}{}{}{}{}
54: % 5 {} token are mandatory
55:
56: \abstract
57: % context heading (optional)
58: {} %leave it empty if necessary
59: % aims heading (mandatory)
60: {The Gaia astrometric survey mission will, as a consequence of its scanning
61: law, obtain low resolution optical (330--1000\,nm) spectrophotometry of
62: several million unresolved galaxies brighter than V=22. We present the
63: first steps in a project to design and implement a classification system
64: for these data. The goal is both to determine morphological classes and
65: to estimate intrinsic astrophysical parameters via synthetic templates.
66: Here we describe (1) a new library of synthetic galaxy spectra, and (2)
67: first results of classification and parametrization experiments using
68: simulated Gaia spectrophotometry of this library.}
69: % methods heading (mandatory)
70: {We have created a large grid of synthetic galaxy spectra using the
71: P\'EGASE.2 code, which is based on galaxy evolution models that take into
72: account metallicity evolution, extinction correction, emission lines
73: (with stellar spectra based on the BaSeL library). Our classification and
74: regression models are Support Vector Machines (SVMs), which are kernel-based
75: nonlinear estimators.}
76: % results heading (mandatory)
77: {We produce a basic library of about 3600 zero redshift galaxy spectra covering
78: the main Hubble types over wavelength range 250 to 1050\,nm at a sampling
79: of 1\,nm or less. It is computed on a regular grid of four key
80: astrophysical parameters for each type and for intermediate random values
81: of the same parameters. An extended library reproduces this at a series
82: of redshifts. Initial results from the SVM classifiers and parametrizers are
83: promising, indicating that Hubble types can be reliably predicted and
84: several parameters estimated with low bias and variance. Comparing the colours
85: of our synthetic library with Sloan Digital Sky Survey (SDSS) spectra we
86: find good agreement over the full range of Hubble types and parameters.}
87: % conclusions heading (optional), leave it empty if necessary
88: {}
89:
90: \keywords{-- Galaxies: fundamental parameters -- Techniques: photometric --
91: Techniques: spectroscopic}
92:
93: \maketitle
94:
95: \section{Introduction}
96: Large surveys of galaxies provide information on their global spatial
97: distribution and the physical properties of individual galaxies. Such a survey
98: will be obtained for the whole sky by the ESA mission, Gaia, from 2011--2016.
99: During its five year mission Gaia will observe several million unresolved
100: galaxies all over the whole sky. Although the survey's main goal is the
101: stellar content and the structure of our galaxy, there remains a lot of
102: important science to be extracted from the galactic component.
103:
104: There currently exist several surveys of galaxies, but even SDSS -- one of the
105: most extended galaxy photometric and spectroscopic surveys in the the optical
106: and near IR (about at the spectral range of Gaia) -- covers only a fifth of
107: the sky. Gaia extends this in several ways: i) It will be able to detect about
108: 10$^7$ unresolved galaxies down to G=20 (V=20--22); ii) Gaia will be the first
109: homogeneous survey of galaxies covering the whole sky since photographic ones
110: (UK, ESO, Palomar Schmidt surveys, 3500 to 6500\,\AA) of 30 years ago; iii) The
111: spectrophotometry covers a larger spectral range (3300 to 10\,000\,\AA\
112: sampled in about 100 bins) than earlier surveys; iv) Gaia observes each source
113: an average of 80 times over the mission. With this we can investigate many
114: different types of galaxy, QSO and AGN variability; v) The sample will have a
115: well-defined selection function, important for estimating the galaxy density
116: in the local universe.
117:
118: Our long-term objective is to classify and to determine the astrophysical
119: parameters of all unresolved galaxies which Gaia will observe. In order to
120: proceed with this we first need to acquire or build an appropriate library of
121: galaxy spectra. This library must show sufficient variation in those
122: intrinsic astrophysical parameters (APs) to which the Gaia observations will
123: be sensitive. To determine APs on a homogeneous system we ultimately need to
124: build or calibrate our classifiers using synthesis models and synthetic
125: spectra. Existing observed or synthetic libraries are too small or don't
126: cover the required wavelength range. For this reason we set on in this paper
127: to start building a new library.
128:
129: We use the galaxy evolution model P\'EGASE (Projet d' Etude des Galaxies
130: par Synthese Evolutive) \citep{fioc2,fioc5}, to synthesize galaxy spectra. The
131: P\'EGASE.2 code\footnote{http://www2.iap.fr/users/fioc/PEGASE.html} is aimed
132: principally at modelling the spectral evolution of galaxies by types: the
133: active and passive evolution of stellar populations as well as interstellar gas
134: and dust are coherently evolved in time. No galaxy number density evolution is
135: considered, although the results of our models are compatible with occasional
136: rare galaxy merging. The code is based on the stellar evolutionary tracks from
137: the Padova group, extended to the thermally pulsating asymptotic giant branch
138: (AGB) and post-AGB phases \citep{groenewegen}. These tracks cover all the
139: masses, metalicities and phases of interest for galaxy spectral synthesis.
140: P\'EGASE.2 uses the BaSeL 2.2 library of stellar spectra and can synthesize low
141: resolution (R=200) ultraviolet to near-infrared spectra of Hubble sequence
142: galaxies, as well as of starbursts. For a given evolutionary scenario
143: (typically characterized by a star formation law, an initial mass function and,
144: possibly, infall or galactic winds), the code consistently gives the spectral
145: energy distribution (SED) and computes the star formation rate and the
146: metallicity at any time. The nebular component (continuum and lines) due to HII
147: regions is calculated and added to the stellar component. Depending on the
148: geometry of the galaxy (disk or spheroidal), the attenuation of the spectrum by
149: dust is then computed using a radiative transfer code (which takes account of
150: the scattering).
151:
152: By accepting a star formation rate proportional to mass of the gas, the IMF of
153: \citet{rana} and the presence of infall and galactic winds, eight synthetic
154: spectra corresponding to different typical types of Hubble sequence galaxies
155: (E, S0, Sa, Sb, Sbc, Sc, Sd and Im) have already been produced using P\'EGASE.2
156: \citep{fioc3,fioc1,le2}. For each type, the values of the parameter set have
157: been fitted to the observed spectral energy distribution (SED) of nearby (z=0)
158: galaxies. For illustration a comparison with data is shown in \citet{fioc6}. At
159: higher redshifts, the evolution scenarios have been tested against most
160: existing faint galaxy samples, including the deepest surveys \citep[B=29 Hubble
161: Deep Field-N,][]{williams}. One unique model of galaxy fractions by type
162: simultaneously predicts the multi-wavelength (UV to near-IR) galaxy counts,
163: dominated by young stellar populations in the UV and old evolved galaxies in
164: the near-IR respectively. The faint blue galaxy population, in excess in the
165: far-UV, has also been analysed \citep{fioc4}. An episodic star formation rate
166: of low level is proposed to fit the far-UV counts \citep{FOCA2000,buat}. In the
167: near-IR, the evolution scenario of elliptical galaxies predicts the puzzling
168: $K$-$z$ relation of radio galaxy hosts between z=0 and z=4. \citet{rocca} use
169: P\'EGASE.2 scenarios to interpret the galaxy distribution in the K-band Hubble
170: diagram. The same models are used to interpret the mid-IR galaxy counts \citep{rocca07},
171: although here a supplementary ultra-luminous infrared galaxy population is
172: required. Finally, the robustness of our evolution scenarios is confirmed by
173: the significant predictions of photometric redshifts as compared to
174: spectroscopic redshifts of HDF-N sample \citep{le2}. Using a much larger sample
175: from the SDSS, we make an additional comparison. This is the subject of the
176: second section of this paper, made using simulated photometry and colour-colour
177: diagrams. In section 3 we describe the production of our library based on
178: these eight typical synthetic spectra of galaxies and in section 4 we explain
179: how these are used to simulate Gaia data. In section 5 we present our
180: classification and parametrization models and give preliminary results on their
181: performance. A brief discussion follows in section 6.
182:
183:
184: \section{P\'EGASE synthetic spectra and comparison with the SDSS spectra}
185:
186: In order to determine the parameter ranges over which we should generate the
187: library, we first make a comparison of colours synthesized from the eight
188: typical P\'EGASE spectra with SDSS data. To avoid small discrepancies that
189: occur between synthesized and published SDSS
190: photometry\footnote{http://www.sdss.org/dr4/products/spectra/spectrophotometry.html}
191: and to treat both types of spectral data in the same way, we decided to
192: synthesize SDSS photometry from the SDSS spectra in the same way as we do with
193: the synthetic spectra (and using the same ``calib'' and ``colors'' programs in
194: the P\'EGASE.2 code for both). For this we use the whole set of spectroscopic
195: data for the 565\,715 galaxies that are available in data release 4 (DR4) of
196: SDSS. The properties of the SDSS filters are given in Table~\ref{t1}.
197:
198: \begin{table}
199: \centering
200: \caption {Characteristics of the five SDSS filters}
201: \begin{tabular}{c c c c c}
202: \hline\hline
203: Name & Average & Starting & Ending & magnitude \\
204: & wavelength & wavelength & wavelength & limit in \\
205: & ($\AA$) & ($\AA$) & ($\AA$) & survey \\
206: \hline
207: u & 3551 & 2980 & 4130 & 22.0 \\
208: g & 4686 & 3630 & 5830 & 22.2 \\
209: r & 6165 & 5380 & 7230 & 22.2 \\
210: i & 7481 & 6430 & 8630 & 21.3 \\
211: z & 8931 & 7730 & 11230 & 20.5 \\
212: \hline
213: \end{tabular}
214: \label{t1}
215: \end{table}
216:
217: Typical synthetic spectra corresponding to each of the eight Hubble types are
218: shown in Fig. \ref{f1}, with the location of the SDSS filters superimposed.
219: Each of these ``typical spectra'' corresponds to specific combination of
220: values of the astrophysical parameters (see section~\ref{most_signif}). The
221: SEDs produced by P\'EGASE have been normalized to the flux of a 50$\AA$
222: wavelength interval centered on 5500$\AA$. The elliptical and S0 galaxies have
223: very small differences, apparent at the two extremes of the wavelength range.
224: This implies small differences in colours but not necessarily in magnitudes
225: (which depend on their masses).
226:
227: From Fig.~\ref{f1} it is obvious that the u filter is very important for the
228: comparison with real data since it is the one containing the discontinuity
229: around 4000$\AA$. However, the SDSS spectra do not cover the u band, so
230: photometry in this band cannot be synthesized. We refrain from using the SDSS
231: photometry for the u band because of the red leak in this
232: filter\footnote{http://www.sdss.org/dr4/products/images/index.html$\#$redleak},
233: which would render comparisons with synthetic data unreliable. This leak
234: produces erroneous magnitudes, especially for E and S0 types on
235: account of their large numbers of red stars.
236:
237: In addition we avoid using the z filter in our comparison since
238: its photometry also cannot be synthesized from the SDSS spectra, which terminate
239: at shorter wavelengths than the z passband.
240:
241: \begin{figure}
242: \centering
243: \includegraphics[width=6cm,angle=-90]{f1.ps}
244: \caption{
245: Synthetic spectra for the eight typical galaxy types from P\'EGASE.2. The
246: vertical lines denote the limits of the five SDSS filters (transmission
247: below 1e-4 of the peak). (Emission lines are not included). The legend at
248: the right defines colour used to plot each type of galaxy (top) and SDSS
249: filter (bottom).}
250: \label{f1}
251: \end{figure}
252:
253: We therefore decided to base our comparison between the SDSS and P\'EGASE.2
254: data using the g, r, i filters only and, more specifically, the g--r and r--i
255: colours. However, the wavelength range of the SDSS spectra does not quite
256: extend to the bluest side of the g filter. For this reason, we cut the blue
257: end of this and created a new g filter starting at 3830$\AA$ instead of the
258: 3630$\AA$ (table \ref{t1}). However, this change is in practice very small
259: since the transmission of the g filter is only 3\% of the peak transmission at
260: 3830$\AA$ and drops very rapidly below that (e.g.\ it is only 0.5\% at just
261: 10\AA\ lower). Furthermore, simulated photometry from the synthetic spectra
262: showed virtually no difference for the original and ``trimmed'' g band. The
263: published transmission curves of the SDSS filters depend on airmass and
264: whether a point or extended source is being observed.
265: We use those for extended sources and zero airmass. The
266: photometry is calibrated on the AB system, as used by SDSS \citep{fukugita}.
267:
268: We synthesize photometry using the one-dimensional spectra from DR4, which are
269: supplied with additional analysis information, such as redshift and emission
270: line parameters. In order to select data suitable for our purposes, we
271: applied the following criteria: the galaxies should not be near a CCD edge nor
272: saturated, and they should not be very low SNR (the photometric error in all
273: bands should be less than 0.1 mag). Only spectra with redshifts below 0.01 are
274: retained, since the synthetic spectra of P\'EGASE.2 were produced at zero
275: redshift. These criteria resulted in a sample of 1292 galaxies. Their
276: synthesized photometry plus that for the eight typical galaxy types from
277: P\'EGASE.2 is shown in Fig.~\ref{f2}. This figure clearly shows that the
278: colours of the Im, Sd, Sc, Sbc, Sb and Sa types are generally in good
279: agreement with the colours of the observed spectra, although in the case of
280: S0 and E types the synthetic spectra seem to be slightly redder in g$-$r than
281: the SDSS spectra.
282:
283:
284: \begin{figure}
285: \centering
286: \includegraphics[width=6cm,angle=-90]{f2.ps}
287: \caption{Colour--colour (g$-$r vs.\ r$-$i) diagram of synthesized photometry of SDSS galaxy spectra (black) and synthetic photometry of the eight typical galaxy types generated from the
288: P\'EGASE models (red points).}
289: \label{f2}
290: \end{figure}
291:
292:
293:
294: \section{The library of synthetic spectra}\label{library}
295:
296: \subsection{The most significant parameters}\label{most_signif}
297:
298: Each spectrum in our library is uniquely defined by a set of 17 astrophysical
299: parameters, plus the morphological type (E, S0, Sa, Sb, Sbc, Sc, Sd or Im).
300: The four most significant APs are: p1 and p2 of the star formation scenario
301: (($Mgas^{p_1}$)/p2); the infall timescale; the age of the galactic winds. The
302: age of the galactic winds is non-zero only for E and S0 galaxies. Note that
303: the Hubble type is not an independent parameter, as only certain ranges of the
304: APs are available for each type (as will be detailed later).
305:
306: In order to investigate the influence of each of the parameters p1, p2 and
307: infall timescale to the integrated galaxy spectrum (SED), we modified the
308: parameters of the Sbc model (an intermediate type) over a range of values. In
309: the typical model for the Sbc type the values were 1, 5714 Myr/$M_{\odot}$ and 6000 Myr for
310: p1, p2 and infall timescale, respectively. In the modified models we vary p1
311: between 0.4 and 2, p2 from 100 to 20000 Myr/$M_{\odot}$ and infall from 100 to 10000 Myr. The
312: results are shown in Figs.~\ref{f3}--\ref{f5}.
313:
314: To investigate the effect of the age of the galactic winds parameter we followed
315: the same procedure but now with the elliptical model. In the typical model for
316: the E type the age is 1~Gyr and we vary it between 0.1 and 7.5 Gyr
317: (Fig.~\ref{f6}).
318:
319: From the figures we see that these four parameters have a major effect on the
320: colours. We performed similar analyses for other APs and concluded that they
321: had a much smaller impact on the data (in particular once the spectra are
322: reduced to the Gaia resolution). Therefore, the spectra in the present library
323: show variance only in these four APs.
324:
325: \begin{figure}
326: \centering
327: \includegraphics[width=6cm,angle=-90]{f3.ps}
328: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE
329: spectra of the typical Sbc model (yellow) and the models of Sbc with
330: different values of p1 (red). The largest g--r corresponds to p1=1 and the
331: smallest g--r to p1=2.}
332: \label{f3}
333: \end{figure}
334:
335: \begin{figure}
336: \centering
337: \includegraphics[width=6cm,angle=-90]{f4.ps}
338: \caption{Colour-colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE
339: spectra of the typical Sbc model (yellow) and the models of Sbc with
340: different values of p2 (red). The largest g--r corresponds to p2=2000 Myr/$M_{\odot}$ and the
341: smallest g--r to p2=20000 Myr/$M_{\odot}$.}
342: \label{f4}
343: \end{figure}
344:
345: \begin{figure}
346: \centering
347: \includegraphics[width=6cm,angle=-90]{f5.ps}
348: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE
349: spectra of the typical Sbc model (yellow) and the models of Sbc with different values of infall timescale (red). The largest
350: g--r corresponds to infall timescale=100My and the smallest g--r to infall timescale=10 Gyr.}
351: \label{f5}
352: \end{figure}
353:
354: \begin{figure}
355: \centering
356: \includegraphics[width=6cm,angle=-90]{f6.ps}
357: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE
358: spectra of the typical E model (yellow) and the models of E with different values of age of galactic winds (red). The
359: largest g--r corresponds to age of galactic winds=7.5Gy and the smallest g--r to 0.1 Gyr.}
360: \label{f6}
361: \end{figure}
362:
363: By co-varying these four parameters and using all their combinations in each of
364: the eight typical models we are able to cover most of the variance we see in
365: the SDSS data in the colour--colour diagram. Generally, there is no clear
366: distinction between the colours of neighbouring Hubble types. In order to
367: have a knowledge of types in our library we decided (as a first working approximation) to only
368: retain those models which lie within a circle (in the colour-colour diagram)
369: centered on one of the eight typical types and with a radius equal to half of
370: the distance to the nearest neighbouring typical model. This is reasonable
371: since the models lie mostly on a one-dimensional surface (line) in the
372: colour--colour diagram. In this way upper and lower limits of the values of
373: the parameters were established for each type, although in this case
374: an overlap in APs (if not in colours) remains, as can be seen in table \ref{t2}.
375: This leaves a set of 888 synthetic spectra of known types of
376: galaxies (see section~\ref{regular_grid}).
377:
378: The galaxy type can be considered as a 5th AP, although it is of a different
379: nature than the others, since it is needed to fully specify the spectrum and
380: constrain the range of values of the other four APs. In addition, when one
381: redshifts the spectrum to non-zero values of z, this quantity also becomes a
382: parameter (albeit not intrinsic to the source).
383:
384: \subsection{Library of galaxy spectra over a regular grid of parameters}\label{regular_grid}
385:
386: Applying the above procedures, we produced a library of 888 synthetic spectra
387: covering seven separate Hubble types (because we consider E and S0 as a single
388: type). The values of the four parameters of each type are given in table
389: \ref{t2}, while the values of the other input parameters of P\'EGASE.2 (kept
390: constant in all models) are given in table \ref{t3}. The models are plotted
391: in Fig.~\ref{f7}, where the simulated colours of the 888 synthetic spectra and
392: the 1292 SDSS spectra are compared. This first set of 888 synthetic spectra
393: was then calculated at five values of redshift: 0, 0.05, 0.1, 0.15, 0.2,
394: resulting in a total of 4440 spectra.
395:
396:
397: \begin{table*}
398: \centering
399: \caption {The four astrophysical parameter (AP) ranges for each Hubble type in
400: the regular library of P\'EGASE synthetic spectra. Note that the AP ranges for
401: each Hubble type partially overlap. The morphological type can be considered as
402: an additional (but non-independent) parameter, required to fully explain the
403: variance in the library. The final column (N) gives the number of spectra for
404: each type (which sum to 888). See the regular library grid in \citet{le2} for
405: comparison.}
406: \begin{tabular}{c c c c c c}
407: \hline\hline
408:
409: Type & p1 & p2 & infall & galactic winds & N \\
410: & &(Myr/Msol) &(Myr) &(Gyr) & \\
411: \hline
412: E-S0 & 0.6-1.5 & 100-1500 & 100-2500 & 0.1-7.5 & 327\\
413: Sa & 0.8-1.5 & 500-2500 & 2500-3500 & none & 10 \\
414: Sb & 0.6-1.5 & 1500-6000 & 2000-4500 & none & 25 \\
415: Sbc & 0.4-1.5 & 2000-10000 & 4000-7000 & none & 148\\
416: Sc & 0.6-1.5 & 6000-14000 & 7000-10000 & none & 68 \\
417: Sd & 0.4-1.5 & 10000-18000 & 7000-10000 & none & 65 \\
418: Im & 1.0-2.0 & 14000-20000 & 7000-10000 & none & 245\\
419: \hline
420: \end{tabular}
421: \label{t2}
422: \end{table*}
423: \begin{table*}
424: \centering
425: \caption {The values of the parameters of the P\'EGASE models which are kept
426: constant in the library \citep{fioc2}.}
427: \begin{tabular}{c c}
428: \hline\hline
429: Parameters & Values \\
430: \hline
431: SNII Ejecta of massive stars & model B of \citet{woosley}\\
432: Stellar winds & yes\\
433: Initial mass function & \citet{rana}\\
434: Lower mass & 0.09 solar masses \\
435: Upper mass & 120.00 solar masses\\
436: Fraction of close binary systems & 0.05\\
437: Initial metallicity & 0.00 \\
438: Metallicity of the infalling gas & 0.00\\
439: Consistent evolution of the stellar metallicity & yes \\
440: Mass fraction of substellar objects & 0.00\\
441: Nebular emission & yes \\
442: Extinction & disk geometry: inclination-averaged \\
443: & for Sa, Sb, Sbc, Sc, Sd and Im \\
444: & spheroidal geometry for E-S0 \\
445: Age & 13 Gyr for E-S0,Sa, Sb, Sbc, Sc \& Sd \\
446: & 9 Gyr for Im \\
447: \hline
448: \end{tabular}
449: \label{t3}
450: \end{table*}
451:
452:
453: \subsection{Extension of the library to random values of parameters}
454:
455: After producing the regular synthetic spectral grid (table \ref{t2}), we
456: proceed to produce synthetic spectra of galaxies with parameters selected
457: from a random distribution, in order to achieve a more continuous coverage in
458: colour space. Such grids permit more robust tests of parameter estimation
459: algorithms than do regular grids. Each parameter is selected independently
460: from a uniform distribution over the parameter ranges in the regular grid. We
461: used this approach to generate 5500 models. In doing this we keep
462: approximately the ratios between the Hubble types as in the regular grid.
463: Because the parameter ranges for each galaxy type in Table \ref{t2} show some overlap,
464: a random draw may produce a set of parameters which fits into more than one
465: Hubble type category. To remove this ``degeneracy'' we again apply the circle
466: removal method we used in section~\ref{most_signif}. This results in a
467: ``non-degenerate'' sample of 2709 spectra.
468:
469: \begin{figure}
470: \centering
471: \includegraphics[width=6cm,angle=-90]{f7.ps}
472: \caption{ Colour--colour (g--r vs.\ r--i) diagram of synthesized photometry of
473: SDSS galaxy spectra (black) and of synthetic P\'EGASE spectra of the 8 typical
474: models of P\'EGASE.2 (yellow). Moving from the lower left to the upper right
475: part of the diagram we encounter types from Im to E. The red dots along both
476: sides of the typical models represent the spectra of both the regular and
477: random library.}
478: \label{f7}
479: \end{figure}
480:
481: A comparison of the simulated colours of the synthetic spectra (888 regular
482: grid plus 2709 random grid, at zero redshift) with the colours of SDSS spectra
483: is shown in Fig. \ref{f7}. One sees that the new set of spectra is in very
484: good agreement with the SDSS data, except for the small differences in the E
485: and S0 galaxies.
486:
487: In summary, we have produced a library of 7149 synthetic galaxy spectra (888
488: spectra of the regular grid for 5 values of redshift and 2709 of the random
489: grid at zero redshift) which can be used as an initial library of unresolved
490: galaxy spectra for assessing the possibilities of galaxy classification and
491: parametrization with Gaia. This library was created at the resolution of the
492: BaSeL 2.2 stellar library
493: (gradually changing from 8\,\AA\ at 2500\,\AA\ to 50\,\AA\ at 10\,500\,\AA),
494: which is not quite high enough for the Gaia simulation software (which
495: requires 10\,\AA). Therefore, we linearly interpolated our spectra in order
496: to resample the spectra to 10\,\AA\ over the wavelength range of
497: 2500--10\,500\,\AA. Higher resolution spectra will be produced in future work
498: using the High-spectral Resolution code P\'EGASE-HR \citep{le1}.
499:
500: \section{Simulated Gaia spectra}
501:
502: The Gaia spectrophotometer is a slitless prism spectrograph comprising blue
503: and red channels (called BP and RP respectively) which operate over the
504: wavelength ranges 3300--6800\,\AA\ and 6400-10\,500\,\AA\ respectively. BP
505: and RP spectra were simulated for all 7149 library spectra using the simulator
506: developed by \citet{brown}. Each of BP and RP is simulated with 48 pixels,
507: whereby the dispersion varies from 30--290\,\AA/pix and 60--150\,\AA/pix
508: respectively. We artificially reddened each spectrum with a standard
509: interstellar extinction law with R=3.1, for regular values of $A_{V}$ from 0
510: to 10 for the regular library, and for 10 random values of $A_{V}$ uniformly
511: distributed in $log(1+A_{V})$ for the the random library. Noise was added to
512: all spectra, which includes the source Poisson noise, background Poisson noise
513: and CCD readout noise. This is done for five different source G-band
514: magnitudes (15, 17, 18, 19 and 20). For the following classification tests we
515: use only the sample at G=18. In Fig. \ref{f8} we present the simulated BP and RP spectra
516: for the eight typical synthetic spectra of galaxies.
517:
518: \begin{figure}
519: \centering
520: \includegraphics[width=6cm,angle=-90]{f8.ps}
521: \caption{ The simulated BP and RP spectra of the synthetic spectra for the
522: eight typical galaxy types from P\'EGASE.2. Black, green, blue, yellow,
523: magenta, light blue and red denote galaxies of type E, Sa, Sb, Sbc, Sc, Sd and
524: Im respectively.}
525: \label{f8}
526: \end{figure}
527:
528:
529: \section{Classification \& Parametrization}
530:
531: In the present work we use classification Support Vector Machines (SVMs)
532: (C-classification) to determine morphological types and regression SVMs
533: ($\epsilon$-regression) to estimate the various astrophysical parameters. We
534: use the libsvm library of \citet{libsvm} implemented in the \verb+e1071+
535: package in the R statistics package.\footnote{{\tt http://www.r-project.org}} A
536: brief description of the SVMs is given in the Appendix of this paper. An
537: accessible introduction to SVMs can be found in \citet{bennett00}. For a more
538: technical introduction, the tutorial by \citet{burges98} is recommended.
539:
540:
541: \subsection{Galaxies at zero redshift}
542:
543: \subsubsection{Classification of the morphological type}\label{classify}
544:
545: We now try to classify the set of Gaia-simulated galaxy spectra, at G=18 with
546: zero redshift, into the seven Hubble types. This subset of the library
547: includes characteristic noise and a wide range of interstellar extinction
548: (from 0--10 mag in $A_{v}$). It comprises 9691 spectra. This we divide at
549: random into two subsets: 4846 for training the SVM classifiers and 4845 for
550: evaluating their performance. As is recommendable with many machine learning
551: methods, we first normalized the data by scaling each input (pixel) to have
552: zero mean and unit standard deviation.
553:
554: For the purpose of visualizing the data set only, we perform a Principal
555: Components Analysis (PCA) on the set of 9691 96-dimensional Gaia spectra.
556: The first three Principal Components describe 78.25\%, 20.44\% and 1.02\% of
557: the data variance respectively (i.e.\ 99.71\% together).\footnote{Note that,
558: because each input dimension has already been normalized to have zero mean
559: and unit variance, a considerable fraction of the total variance is already
560: accounted for.} In Fig. \ref{f9} we plot the data in projection onto the
561: first three PCs. This diagram, plus the fact that the first three PCs explain
562: almost all of the variance in the data, suggest that a good classification
563: should be possible (the data have an intrinsic low dimensionality).
564:
565: \begin{figure}
566: \centering
567: \includegraphics[width=6cm,angle=-90]{f9.ps}
568: \caption{ The 9691 simulated Gaia galaxy spectra with z=0 plotted as their
569: projections onto the first three Principal Components. Black, green, blue,
570: light blue, magenta, yellow and red denote galaxies of type E, Sa, Sb, Sbc, Sc,
571: Sd and Im respectively.}
572: \label{f9}
573: \end{figure}
574:
575: \begin{table}
576:
577: \centering
578: \caption {Galaxy classification with the SVM. The confusion matrix for the
579: training set for galaxies at z=0. Columns indicate the true class, row the
580: predicted ones.}
581: \begin{tabular}{l | c c c c c c c}
582: \hline\hline
583:
584: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
585: \hline
586: E-S0 & 1799 & 0 & 0 & 0 & 0 & 0 & 0 \\
587: Sa & 0 & 1366 & 0 & 0 & 0 & 0 & 0 \\
588: Sb & 0 & 0 & 53 & 5 & 0 & 0 & 0 \\
589: Sbc & 0 & 0 & 0 & 134 & 0 & 0 & 0 \\
590: Sc & 0 & 0 & 0 & 0 & 830 & 0 & 0 \\
591: Sd & 0 & 0 & 0 & 0 & 0 & 347 & 1 \\
592: Im & 0 & 0 & 0 & 0 & 0 & 0 & 311 \\
593:
594: \hline
595: \end{tabular}
596: \label{t4}
597: \centering
598: \caption {As Table~\ref{t4} but for the test set.}
599: \begin{tabular}{l | c c c c c c c}
600: \hline\hline
601:
602: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
603: \hline
604: E-S0 & 1798 & 0 & 0 & 0 & 0 & 0 & 0 \\
605: Sa & 0 & 1329 & 0 & 0 & 0 & 0 & 0 \\
606: Sb & 0 & 0 & 44 & 0 & 0 & 0 & 0 \\
607: Sbc & 0 & 0 & 4 & 137 & 0 & 0 & 0 \\
608: Sc & 0 & 0 & 0 & 1 & 797 & 0 & 0 \\
609: Sd & 0 & 0 & 0 & 0 & 0 & 394 & 6 \\
610: Im & 0 & 0 & 0 & 0 & 0 & 3 & 324 \\
611:
612: \hline
613: \end{tabular}
614: \label{t5}
615: \end{table}
616:
617: The results of training and testing the SVM classifier on the full 96-pixel
618: spectra are shown in Tables \ref{t4} and \ref{t5}. We see that there are very
619: few misclassifications: only 6 and 14 in the training and testing set
620: corresponding to an error of 0.12\% and 0.29\% respectively. While these
621: results are very promising, it must be recalled that the way the library has been
622: constructed avoids class overlap in the SDSS g$-$r, r$-$i colour space,
623: which surely eases separation in the 96-dimensional BP/RP colour space.
624:
625: \subsubsection{Regression of astrophysical parameters}
626:
627: In addition to simulating an output spectrum, P\'EGASE.2 also derives 18
628: output astrophysical parameters for each galaxy. Of course, by construction we know that our
629: synthetic spectra are uniquely defined by five parameters (p1, p2, infall
630: timescale, age of the galactic winds and the Hubble type), so there can only
631: be five equivalent independent parameters amongst these 18. Nonetheless, it
632: would be useful to predict them directly. Here we build SVM regression models
633: to separately predict the nine most significant ones (listed in Table
634: \ref{t6}). For each model we train on a randomly selected set of 4846 spectra
635: and evaluate performance on the remaining 4845. In Fig.~\ref{f10} we present
636: the true and the SVM-predicted values of each parameter on the test set. Table
637: \ref{t6} summarizes this by giving the mean of the difference between the true
638: and predicted values for each parameter (which measures the systematic error)
639: as well as the RMS residual (which measures the total scatter). The plots and
640: table indicate that we can predict the parameters to good accuracy and
641: precision, i.e.\ the systematics are very small and the RMS error is a small
642: fraction of the typical values.
643:
644:
645: \begin{figure*}[t]
646: \setlength{\unitlength}{1cm}
647: \begin{picture}(18,15)
648: \put(0,15){\special{psfile=f10a.ps hoffset=0 voffset=0 hscale=20
649: vscale=20 angle=-90}}
650: \put(0,10){\special{psfile=f10b.ps hoffset=0 voffset=0 hscale=20
651: vscale=20 angle=-90}}
652: \put(0,5){\special{psfile=f10c.ps hoffset=0 voffset=0 hscale=20
653: vscale=20 angle=-90}}
654: \put(6,15){\special{psfile=f10d.ps hoffset=0 voffset=0 hscale=20
655: vscale=20 angle=-90}}
656: \put(6,10){\special{psfile=f10e.ps hoffset=0 voffset=0 hscale=20
657: vscale=20 angle=-90}}
658: \put(6,5){\special{psfile=f10f.ps hoffset=0 voffset=0 hscale=20
659: vscale=20 angle=-90}}
660: \put(12,15){\special{psfile=f10g.ps hoffset=0 voffset=0 hscale=20
661: vscale=20 angle=-90}}
662: \put(12,10){\special{psfile=f10h.ps hoffset=0 voffset=0 hscale=20
663: vscale=20 angle=-90}}
664: \put(12,5){\special{psfile=f10i.ps hoffset=0 voffset=0 hscale=20
665: vscale=20 angle=-90}}
666: \end{picture}
667: \caption{
668: Galaxy parameter estimation performance. For each of the nine APs we plot
669: the predicted vs.\ true AP values for the test set. The red line indicates
670: the line of perfect estimation. The summary errors are given in
671: Table~\ref{t6}.}
672: \label{f10}
673: \end{figure*}
674:
675: \begin{table*}
676: \centering
677: \caption {Summary of the performance of the SVM regression models for
678: predicting the nine APs listed. The sample is for zero redshift but for
679: interstellar extinction ($A_{v}$) varying from 0 to 10\,mag. The second and
680: third columns list the mean and RMS errors respectively. The final column gives
681: the number of support vectors in the SVM model.}
682: \begin{tabular}{l c c c}
683: \hline\hline
684:
685: Astrophysical Parameter & mean(real-predicted)/mean(real) & sd(real-predicted)/mean(real) & SVs \\
686: \hline
687: mass to light ratio (M/L) & -1.03e-2 & 3.78e-2 & 97 \\
688: normalized star formation rate (SFR) & -3.35e-3 & 3.97e-2 & 2285 \\
689: metallicity of interstellar medium (Mim) & -2.85e-3 & 8.77e-2 & 345 \\
690: metallicity of stars averaged on mass (Msm) & -3.64e-4 & 2.17e-2 & 3544 \\
691: normalized mass of gas (Mgas) & 4.52e-3 & 4.29e-2 & 190 \\
692: normalized mass in stars (Ms) & 3.22e-4 & 5.48e-2 & 1639 \\
693: mean age of stars averaged on bolometric luminosity (Al) & 1.45e-3 & 3.22e-2 & 3566 \\
694: normalized SNIa rate (SNIa) & 9.69e-4 & 3.43e-2 & 376 \\
695: normalized SNII rate (SNII) & -6.04e-4 & 3.81e-2 & 2247 \\
696: \hline
697: \end{tabular}
698: \label{t6}
699: \end{table*}
700:
701: \subsection{Galaxies with redshift}
702:
703: \subsubsection{Regression of redshift and classification of morphological type}
704:
705: We now enlarge the subset of the library we used in the previous tests by
706: adding the same galaxies at four nonzero values of redshift, specifically
707: 0.05, 0.1, 0.15, 0.2. The library for z=0 includes 9691 galaxies as described above.
708: For each nonzero redshift there are 9757
709: giving a total sample of 48\,719 galaxies. (Recall that this includes each
710: galaxy simulated at 11 regular values of $A_{v}$.) We now build another
711: morphological type classification model as done in section~\ref{classify}, now
712: with 6719 galaxies in the training set and 42\,000 galaxies for testing set.
713:
714: We again applied a PCA to the data. This time the first three Principal
715: Components describe 76.01\%, 21.63\% and 1.02\% of the data variance
716: respectively (i.e.\ 98.6\% together), very similar to before. The
717: corresponding PCA-project plot is Fig.~\ref{f11}. Comparing to Fig.~\ref{f9} we
718: can see how the redshift spreads out the previous loci of types.
719: The performance of the SVM classifier is summarized in Tables \ref{t7} and
720: \ref{t8}. The performance is good considering the added complexity introduced
721: by the redshift variations (and the corresponding increase in the sample
722: size). The misclassification errors are 0.13\% and 0.98\% corresponding to 9
723: and 411 galaxies for the training and the testing data respectively.
724:
725:
726: \begin{figure}
727: \centering \includegraphics[width=6cm,angle=-90]{f11.ps}
728: \caption{ The 48\,719 simulated Gaia galaxy spectra with nonzero redshift
729: plotted as their projections onto the first three Principal Components. Black,
730: green, blue, light blue, magenta, yellow and red denote galaxies of type E, Sa,
731: Sb, Sbc, Sc, Sd and Im respectively.}
732: \label{f11}
733: \end{figure}
734:
735: \begin{table}
736: \centering
737: \caption {Galaxy classification with the SVM. The confusion matrix for the
738: training set for galaxies at z=0.0, 0.05, 0.1, 0.15, 0.2. Columns indicate the
739: true class, row the predicted ones.}
740: \begin{tabular}{l | c c c c c c c}
741: \hline\hline
742:
743: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
744: \hline
745: E-S0 & 2512 & 0 & 0 & 0 & 0 & 0 & 0 \\
746: Sa & 0 & 1828 & 0 & 0 & 0 & 0 & 0 \\
747: Sb & 0 & 0 & 74 & 2 & 0 & 0 & 0 \\
748: Sbc & 0 & 0 & 1 & 183 & 1 & 0 & 0 \\
749: Sc & 0 & 0 & 0 & 0 & 1115 & 0 & 0 \\
750: Sd & 0 & 0 & 0 & 0 & 0 & 536 & 4 \\
751: Im & 0 & 0 & 0 & 0 & 0 & 1 & 462 \\
752:
753: \hline
754: \end{tabular}
755: \label{t7}
756: \centering
757: \caption {As Table~\ref{t7} but for the test set.}
758: \begin{tabular}{l | c c c c c c c}
759: \hline\hline
760:
761: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
762: \hline
763: E-S0 & 15473 & 0 & 0 & 0 & 0 & 0 & 0 \\
764: Sa & 0 & 11647 & 0 & 0 & 0 & 0 & 0 \\
765: Sb & 17 & 0 & 344 & 113 & 0 & 0 & 0 \\
766: Sbc & 0 & 0 & 83 & 1084 & 23 & 0 & 0 \\
767: Sc & 0 & 0 & 8 & 39 & 6971 & 7 & 0 \\
768: Sd & 0 & 0 & 0 & 0 & 1 & 3149 & 50 \\
769: Im & 0 & 0 & 0 & 0 & 0 & 70 & 2921 \\
770:
771: \hline
772: \end{tabular}
773: \label{t8}
774: \end{table}
775:
776: In practice we may want to first reduce spectra to the rest frame, for which
777: we require an estimate of the redshift. Therefore, we also set up a SVM
778: regression model to predict redshift, using the same training and test sets.
779: The predicted values of redshift for each of the five true redshift values are
780: presented in Fig. \ref{f12}. We do not expect very good performance here,
781: because the SVM is having to learn the effect of redshift based on just five
782: different values.
783:
784: \begin{figure*}[t]
785: \setlength{\unitlength}{1cm}
786: \begin{picture}(18,10)
787: \put(0,10){\special{psfile=f12a.ps hoffset=0 voffset=0 hscale=20
788: vscale=20 angle=-90}}
789: \put(6,10){\special{psfile=f12b.ps hoffset=0 voffset=0 hscale=20
790: vscale=20 angle=-90}}
791: \put(12,10){\special{psfile=f12c.ps hoffset=0 voffset=0 hscale=20
792: vscale=20 angle=-90}}
793: \put(4,5){\special{psfile=f12d.ps hoffset=0 voffset=0 hscale=20
794: vscale=20 angle=-90}}
795: \put(10,5){\special{psfile=f12e.ps hoffset=0 voffset=0 hscale=20
796: vscale=20 angle=-90}}
797: \end{picture}
798: \caption{Distribution of predicted values of redshift shows separately for the five true values of redshift (z=0, 0.05, 0.1, 0.15 and 0.2)}
799: \label{f12}
800: \end{figure*}
801:
802:
803:
804: \section{Discussion and conclusion}
805:
806: We have used the P\'EGASE.2 galaxy evolution model and the observational data
807: from SDSS to create an extended grid of synthetic galaxy spectra. Using these
808: we have identified the relevant astrophysical parameters and their relevant
809: ranges which provide a realistic galaxy spectra of known morphological type.
810: This was done specifically by comparing the colours of our library spectra
811: with those synthesized from SDSS spectra. We found small deviations between
812: the two colour loci for redder galaxies -- where the ellipticals are found --
813: which might be due to the fact that SDSS spectra are obtained in a small aperture
814: (fibre diameter) while P\'EGASE spectra are representative of the whole
815: galaxy. We also see that the observed sample has a considerably larger spread
816: in the colour--colour diagram than the library spectra, which probably has
817: observational reasons (photometric errors) as well as theoretical ones
818: (insufficient cosmic variance in the galaxy models). That is, it may
819: partially reflect the complicated nature of galaxy formation and evolution,
820: although the overall agreement between the two is good.
821:
822: To achieve a better agreement between the observational and
823: synthesized libraries we will further investigate the influence of the
824: various P\'EGASE.2 parameters, especially those that were kept constant for
825: this release of the library. On the other hand, due to the narrow redshift
826: range ($z < 0.2$) explored here, evolution factors are minimized. At higher
827: redshifts, synthetic spectra will be computed by simultaneously applying
828: cosmological k-corrections and evolution e-corrections to z=0 templates.
829:
830: Among the existing libraries of observed spectra, the most complete
831: and homogeneous is the SDSS, since it covers a significant part of the whole
832: sky and it goes fainter than the expected detection limit of Gaia. We
833: therefore aim to produce a suitable set of synthetic spectra covering as
834: much as possible of the SDSS colour range and we plan further comparisons in
835: our future work.
836:
837: Adding phenomena such as the galaxy mergers is a challenging
838: hypothesis, but we believe that at the low redshifts Gaia will observe, this
839: is not such an important or frequent mechanism of galaxy evolution. On the
840: other hand, starburst galaxies are more frequent at small redshifts and we
841: intend to enrich our library with this type of galaxy.
842:
843: First results of SVM for classification and parametrization of the library are
844: quite promising. In particular, the first indications are that Gaia will be
845: able to produce a wealth of information for a large statistical sample of
846: galaxies. After constructing a more complete library of spectra we will be
847: able to perform more tests and construct a classifier able to treat more
848: realistic and complete simulations of galaxy spectra.
849:
850: \section{Acknowledgments}
851: The authors (the Greek team) would like to thank the Greek General Secretariat
852: of Research and Technology (GSRT) for financial support.
853:
854: P. Tsalmantza would also like to thank the Max-Planck-Institut f\"ur
855: Astronomie (MPIA) and Institut d'Astrophysique de Paris (IAP) for their
856: support and hospitality.
857:
858: Funding for the Sloan Digital Sky Survey (SDSS) has been provided by the Alfred
859: P. Sloan Foundation, the Participating Institutions, the National Aeronautics
860: and Space Administration, the National Science Foundation, the U.S. Department
861: of Energy, the Japanese Monbukagakusho, and the Max Planck Society. The SDSS
862: Web site is http://www.sdss.org/.
863:
864: The SDSS is managed by the Astrophysical Research Consortium (ARC) for the
865: Participating Institutions. The Participating Institutions are The University
866: of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation
867: Group, The Johns Hopkins University, the Korean Scientist Group, Los Alamos
868: National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the
869: Max-Planck-Institute for Astrophysics (MPA), New Mexico State University,
870: University of Pittsburgh, University of Portsmouth, Princeton University, the
871: United States Naval Observatory, and the University of Washington.
872:
873: \begin{thebibliography}{22}
874: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
875:
876: \bibitem[{{Armand} \& {Milliard}(1994)}]{FOCA2000}
877: {Armand}, C. \& {Milliard}, B. 1994, \aap, 282, 1
878:
879: \bibitem[{Bennett \& Campbell(2000)}]{bennett00}
880: Bennett, K.~P. \& Campbell, C. 2000, SIGKDD Explor. Newsl., 2, 1
881:
882: \bibitem[{{Brown}(2006)}]{brown}
883: {Brown}, A. G.~A. 2006, Gaia Technical Report GAIA-C8-SP-LEI-AB-006-1
884:
885: \bibitem[{{Buat} {et~al.}(1999){Buat}, {Donas}, {Milliard}, \& {Xu}}]{buat}
886: {Buat}, V., {Donas}, J., {Milliard}, B., \& {Xu}, C. 1999, \aap, 352, 371
887:
888: \bibitem[{Burges(1998)}]{burges98}
889: Burges, C. J.~C. 1998, Data Mining and Knowledge Discovery, 2, 121
890:
891: \bibitem[{Chang \& Lin(2001)}]{libsvm}
892: Chang, C.-C. \& Lin, C.-J. 2001, {LIBSVM}: a library for support vector
893: machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
894:
895: \bibitem[{{Fioc}(1997)}]{fioc3}
896: {Fioc}, M. 1997, PhD thesis, Universit{\'e} Paris XI,
897: http://www.iap.fr/users/fioc.html
898:
899: \bibitem[{{Fioc}(1999)}]{fioc6}
900: {Fioc}, M. 1999, in Astronomical Society of the Pacific Conference Series, Vol.
901: 192, Spectrophotometric Dating of Stars and Galaxies, ed. I.~{Hubeny},
902: S.~{Heap}, \& R.~{Cornett}, 299--+
903:
904: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1997)}]{fioc2}
905: {Fioc}, M. \& {Rocca-Volmerange}, B. 1997, \aap, 326, 950
906:
907: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{a}})}]{fioc1}
908: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{a}}, \aap, 351, 869
909:
910: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{b}})}]{fioc4}
911: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{b}}, \aap, 344, 393
912:
913: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{c}})}]{fioc5}
914: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{c}}, arXiv:astro-ph/9912179
915:
916: \bibitem[{{Fukugita} {et~al.}(1996){Fukugita}, {Ichikawa}, {Gunn}, {Doi},
917: {Shimasaku}, \& {Schneider}}]{fukugita}
918: {Fukugita}, M., {Ichikawa}, T., {Gunn}, J.~E., {et~al.} 1996, \aj, 111, 1748
919:
920: \bibitem[{{Groenewegen} \& {de Jong}(1993)}]{groenewegen}
921: {Groenewegen}, M.~A.~T. \& {de Jong}, T. 1993, \aap, 267, 410
922:
923: \bibitem[{{Le Borgne} \& {Rocca-Volmerange}(2002)}]{le2}
924: {Le Borgne}, D. \& {Rocca-Volmerange}, B. 2002, \aap, 386, 446
925:
926: \bibitem[{{Le Borgne} {et~al.}(2004){Le Borgne}, {Rocca-Volmerange},
927: {Prugniel}, {Lan{\c c}on}, {Fioc}, \& {Soubiran}}]{le1}
928: {Le Borgne}, D., {Rocca-Volmerange}, B., {Prugniel}, P., {et~al.} 2004, \aap,
929: 425, 881
930:
931: \bibitem[{{Rana} \& {Basu}(1992)}]{rana}
932: {Rana}, N.~C. \& {Basu}, S. 1992, \aap, 265, 499
933:
934: \bibitem[{{Rocca-Volmerange} {et~al.}(2007){Rocca-Volmerange}, {de Lapparent},
935: {Seymour}, \& {Fioc}}]{rocca07}
936: {Rocca-Volmerange}, B., {de Lapparent}, V., {Seymour}, N., \& {Fioc}, M. 2007,
937: arXiv:0705.2031
938:
939: \bibitem[{{Rocca-Volmerange} {et~al.}(2004){Rocca-Volmerange}, {Le Borgne}, {De
940: Breuck}, {Fioc}, \& {Moy}}]{rocca}
941: {Rocca-Volmerange}, B., {Le Borgne}, D., {De Breuck}, C., {Fioc}, M., \& {Moy},
942: E. 2004, \aap, 415, 931
943:
944: \bibitem[{Vapnik(1995)}]{vapnik95}
945: Vapnik, V.~N. 1995, The nature of statistical learning theory (Springer)
946:
947: \bibitem[{{Williams} {et~al.}(1996){Williams}, {Blacker}, {Dickinson}, {Dixon},
948: {Ferguson}, {Fruchter}, {Giavalisco}, {Gilliland}, {Heyer}, {Katsanis},
949: {Levay}, {Lucas}, {McElroy}, {Petro}, {Postman}, {Adorf}, \&
950: {Hook}}]{williams}
951: {Williams}, R.~E., {Blacker}, B., {Dickinson}, M., {et~al.} 1996, \aj, 112,
952: 1335
953:
954: \bibitem[{{Woosley} \& {Weaver}(1995)}]{woosley}
955: {Woosley}, S.~E. \& {Weaver}, T.~A. 1995, \apjs, 101, 181
956:
957: \end{thebibliography}
958:
959: \clearpage
960: \appendix
961: \section{Support Vector Machines}
962:
963: Support Vector Machines (SVMs) \citep{vapnik95} are supervised machine
964: learning methods for data classification. In their basic form they achieve a
965: linear classification between two classes by defining an optimal hyperplane which
966: separates members of the two classes. If the classes are separable then there
967: generally exists an infinite number of hyperplanes which achieve this.
968: The SVM optimal plane is defined as that plane which maximises the margin
969: between the opposing class members nearest to the boundary. That is, unlike
970: many other classifiers which use all of the data to define the boundary, SVMs
971: take the (arguably more reasonable) approach of using just those points nearest
972: to the boundary. It has been demonstrated that this gives rise to a
973: more robust and more accurate classifier under general conditions.
974:
975: In most non-trivial problems, however, the classes are not linearly separable.
976: In these cases, just those points which lie on the wrong side of the
977: hyperplane -- the so-called support vectors -- enter into the total
978: classification error. By minimizing this error -- which also measures the
979: distance of the support vectors from the plane -- we define the optimal
980: separating plane, i.e.\ with the fewest misclassifications (and preferentially
981: of those which lie closer to the plane).
982:
983: In the general case, the classes are not even marginally linearly separable
984: (consider the XOR problem) so a linear classifier, no matter how optimal, is
985: useless. SVMs address this issue by using kernels to project the data into a
986: higher dimensional space. For example, with a polynomial kernel we take
987: square, cubic etc.\ combinations of the original data to form additional
988: dimensions and then apply the (linear) SVM classifier in this higher
989: dimensional space. With many other kernels, however, this projection is only
990: carried out implicitly. This approach can be thought of as nonlinearity by
991: preprocessing, with the kernel overcoming the well known ``curse of
992: dimensionality''. In the present work we use the radial basis kernel
993: \begin{equation}
994: K(x_i - x_j)=exp(-\gamma||x_i-x_j||^{2})
995: \label{SVM_kernel}
996: \end{equation}
997: where $x_i$ and $x_j$ are two input vectors (e.g.\ spectra). The
998: classification of a new vector $x_i$ is then given by a function
999: \begin{equation}
1000: f(x_j) = \sum_{i}^{i=N} y_i \alpha_i K(x_i - x_j)
1001: \label{SVM_model}
1002: \end{equation}
1003: where $y_i \in (-1,1)$ denotes the two classes, and a classification is made
1004: by applying a threshold, e.g.\ $f(x_j) > 0.0 \Rightarrow$ class 1. The
1005: ${\alpha_i}$ are the parameters of the model which are determined by the model
1006: training ($i$ counts over the $N$ support vectors). SVMs have a very important
1007: property, namely that the error function is strictly convex, so it
1008: has a unique global solution which can be found in polynomial time with
1009: standard optimizers (it is a linearly constrained quadratic programming
1010: problem).
1011:
1012: This is in marked contrast to neural networks, for example, in which the optimizers
1013: converge on a local minimum and we can only be guaranteed to find
1014: the global minimum via an exhaustive search. Furthermore, with a sigmoidal
1015: kernel SVMs are equivalent to neural networks but with the additional
1016: advantage that the SVM automatically determines the neural network
1017: architecture (number weights).
1018:
1019: The SVM model incorporates regularization via the specification of a
1020: hyperparameter, $C$, which defines the width of a margin around the separating
1021: hyperplane. The wider this margin (larger $C$), the more data vectors
1022: which fall into it. These are all considered support vectors and so all enter the
1023: error equation. Thus with a larger $C$ there is a higher penalty attached to
1024: errors, i.e.\ less regularization.\footnote{$C$ is actually the upper bound on
1025: $\alpha_i$, specifically $0 \leq \alpha_i \leq C$ and $\sum_i \alpha_i y_i =
1026: 0$ (two of the constraints in the error minimization). Thus a small $C$
1027: implies smaller $\alpha_i$ in equation~\ref{SVM_model} which in turn implies
1028: smoother functions equivalent to more regularization.}
1029:
1030: The other hyperparameter in the model is $\gamma$ (equation~\ref{SVM_kernel}).
1031: Both $\gamma$ and $C$ must be determined by the user. Prior information may
1032: help, but in practice one carries out a rigorous search over a two-dimensional
1033: grid to ``tune'' the SVM. We did this using 4-fold cross validation,
1034: iterating over grids of increasing density.
1035:
1036: SVMs can also be used for regression. Instead of a hyperplane and a margin
1037: about it, regression SVMs fit a line with a tube of radius $\epsilon$
1038: encompassing it. Data vectors which are less than a distance $\epsilon$ from
1039: the line are considered to be correctly fit, that is, the support vectors are
1040: only those points outside of the tube. Thus the $\epsilon$
1041: hyperparameter controls the degree of regularization. The specific error
1042: function we use is the mean squared error on the predictions, with
1043: the regularization again being introduced via the constraints in the
1044: optimization (with Lagrangian multipliers). All of the kernel and optimization
1045: machinery applies equally to these models, so that nonlinear regression can
1046: also be achieved.
1047:
1048: \end{document}
1049: