0705.2152/ms.tex
1: %\documentclass[referee]{aa} % for a referee version
2: %\documentclass[onecolumn]{aa} % for a paper on 1 column
3: %\documentclass[longauth]{aa} % for the long lists of affiliations
4: %\documentclass[rnote]{aa} % for the research notes
5: %
6: \documentclass{aa}
7: %
8: \usepackage{graphicx,supertabular}
9: \usepackage{txfonts}
10: \usepackage{natbib}
11: \citestyle{aa}
12: 
13: \begin{document}
14: %
15:    \title{Towards a library of synthetic galaxy spectra and preliminary results of classification and parametrization of unresolved galaxies for Gaia}
16: 
17:    \author{P. Tsalmantza\inst{1}
18:           \and M. Kontizas\inst{1}
19:           \and C. A. L. Bailer-Jones\inst{2}
20:           \and B. Rocca-Volmerange\inst{3,4}
21:           \and R. Korakitis\inst{5}
22:           \and E. Kontizas\inst{6}
23:           \and E. Livanou\inst{1}
24:           \and A. Dapergolas\inst{6}
25:           \and I. Bellas-Velidis\inst{6}
26:           \and A. Vallenari\inst{7}
27:           \and M. Fioc\inst{3,8}}
28: 
29:     \offprints{P. Tsalmantza\\
30:     \email{vivitsal@phys.uoa.gr}}
31: 
32:     \institute{Department of Astrophysics Astronomy \& Mechanics, Faculty
33:                of Physics, University of Athens, GR-15783 Athens, Greece                 
34:          \and
35:               Max-Planck-Institut f\"ur Astronomie, K\"onigstuhl 17, 69117 Heidelberg, Germany
36:          \and
37:               Institut d'Astrophysique de Paris, 98bis Bd Arago, 75014 Paris, France
38:          \and
39:               Universit\'e de Paris-Sud XI, I.A.S., 91405 Orsay Cedex, France              
40:          \and
41:               Dionysos Satellite Observatory, National Technical University of Athens, 15780 Athens, Greece
42:          \and
43:               IAA, National Observatory of Athens, P.O. Box 20048, GR-118 10 Athens, Greece
44:          \and 
45:               INAF, Padova Observatory, Vicolo dell'Osservatorio 5, 35122 Padova, Italy
46:          \and
47:               Universit\'e Pierre et Marie Curie, 4 place Jussieu, 75005 Paris, France}
48: 
49: 
50: \date{Received date / accepted}
51: 
52: 
53: % \abstract{}{}{}{}{}
54: % 5 {} token are mandatory
55: 
56:   \abstract
57:   % context heading (optional)
58:    {} %leave it empty if necessary
59:   % aims heading (mandatory)
60:    {The Gaia astrometric survey mission will, as a consequence of its scanning
61:      law, obtain low resolution optical (330--1000\,nm) spectrophotometry of
62:      several million unresolved galaxies brighter than V=22. We present the
63:      first steps in a project to design and implement a classification system
64:      for these data. The goal is both to determine morphological classes and
65:      to estimate intrinsic astrophysical parameters via synthetic templates.
66:      Here we describe (1) a new library of synthetic galaxy spectra, and (2)
67:      first results of classification and parametrization experiments using
68:      simulated Gaia spectrophotometry of this library.}
69:   % methods heading (mandatory)
70:    {We have created a large grid of synthetic galaxy spectra using the
71:      P\'EGASE.2 code, which is based on galaxy evolution models that take into
72:      account metallicity evolution, extinction correction, emission lines
73:      (with stellar spectra based on the BaSeL library). Our classification and
74:      regression models are Support Vector Machines (SVMs), which are kernel-based
75:      nonlinear estimators.}
76:   % results heading (mandatory)
77:    {We produce a basic library of about 3600 zero redshift galaxy spectra covering
78:      the main Hubble types over wavelength range 250 to 1050\,nm at a sampling
79:      of 1\,nm or less. It is computed on a regular grid of four key
80:      astrophysical parameters for each type and for intermediate random values
81:      of the same parameters. An extended library reproduces this at a series
82:      of redshifts. Initial results from the SVM classifiers and parametrizers are
83:      promising, indicating that Hubble types can be reliably predicted and
84:      several parameters estimated with low bias and variance. Comparing the colours 
85:      of our synthetic library with Sloan Digital Sky Survey (SDSS) spectra we
86:      find good agreement over the full range of Hubble types and parameters.}
87:   % conclusions heading (optional), leave it empty if necessary
88:    {}
89:    
90:    \keywords{-- Galaxies: fundamental parameters -- Techniques: photometric --
91:      Techniques: spectroscopic}
92: 
93: \maketitle
94: 
95: \section{Introduction}
96: Large surveys of galaxies provide information on their global spatial
97: distribution and the physical properties of individual galaxies. Such a survey
98: will be obtained for the whole sky by the ESA mission, Gaia, from 2011--2016.
99: During its five year mission Gaia will observe several million unresolved
100: galaxies all over the whole sky. Although the survey's main goal is the
101: stellar content and the structure of our galaxy, there remains a lot of
102: important science to be extracted from the galactic component.
103:                                             
104: There currently exist several surveys of galaxies, but even SDSS -- one of the
105: most extended galaxy photometric and spectroscopic surveys in the the optical
106: and near IR (about at the spectral range of Gaia) -- covers only a fifth of
107: the sky. Gaia extends this in several ways: i) It will be able to detect about
108: 10$^7$ unresolved galaxies down to G=20 (V=20--22); ii) Gaia will be the first
109: homogeneous survey of galaxies covering the whole sky since photographic ones
110: (UK, ESO, Palomar Schmidt surveys, 3500 to 6500\,\AA) of 30 years ago; iii) The
111: spectrophotometry covers a larger spectral range (3300 to 10\,000\,\AA\ 
112: sampled in about 100 bins) than earlier surveys; iv) Gaia observes each source
113: an average of 80 times over the mission. With this we can investigate many
114: different types of galaxy, QSO and AGN variability; v) The sample will have a
115: well-defined selection function, important for estimating the galaxy density
116: in the local universe.
117: 
118: Our long-term objective is to classify and to determine the astrophysical
119: parameters of all unresolved galaxies which Gaia will observe. In order to
120: proceed with this we first need to acquire or build an appropriate library of
121: galaxy spectra. This library must show sufficient variation in those
122: intrinsic astrophysical parameters (APs) to which the Gaia observations will
123: be sensitive. To determine APs on a homogeneous system we ultimately need to
124: build or calibrate our classifiers using synthesis models and synthetic
125: spectra. Existing observed or synthetic libraries are too small or don't
126: cover the required wavelength range. For this reason we set on in this paper
127: to start building a new library.
128: 
129: We use the galaxy evolution model P\'EGASE (Projet d' Etude des Galaxies
130: par Synthese Evolutive) \citep{fioc2,fioc5}, to synthesize galaxy spectra. The
131: P\'EGASE.2 code\footnote{http://www2.iap.fr/users/fioc/PEGASE.html} is aimed
132: principally at modelling the spectral evolution of galaxies by types: the
133: active and passive evolution of stellar populations as well as interstellar gas
134: and dust are coherently evolved in time. No galaxy number density evolution is
135: considered, although the results of our models are compatible with occasional
136: rare galaxy merging. The code is based on the stellar evolutionary tracks from
137: the Padova group, extended to the thermally pulsating asymptotic giant branch
138: (AGB) and post-AGB phases \citep{groenewegen}. These tracks cover all the
139: masses, metalicities and phases of interest for galaxy spectral synthesis.
140: P\'EGASE.2 uses the BaSeL 2.2 library of stellar spectra and can synthesize low
141: resolution (R=200) ultraviolet to near-infrared spectra of Hubble sequence
142: galaxies, as well as of starbursts. For a given evolutionary scenario
143: (typically characterized by a star formation law, an initial mass function and,
144: possibly, infall or galactic winds), the code consistently gives the spectral
145: energy distribution (SED) and computes the star formation rate and the
146: metallicity at any time. The nebular component (continuum and lines) due to HII
147: regions is calculated and added to the stellar component. Depending on the
148: geometry of the galaxy (disk or spheroidal), the attenuation of the spectrum by
149: dust is then computed using a radiative transfer code (which takes account of
150: the scattering).
151: 
152: By accepting a star formation rate proportional to mass of the gas, the IMF of
153: \citet{rana} and the presence of infall and galactic winds, eight synthetic
154: spectra corresponding to different typical types of Hubble sequence galaxies
155: (E, S0, Sa, Sb, Sbc, Sc, Sd and Im) have already been produced using P\'EGASE.2
156: \citep{fioc3,fioc1,le2}. For each type, the values of the parameter set have
157: been fitted to the observed spectral energy distribution (SED) of nearby (z=0)
158: galaxies. For illustration a comparison with data is shown in \citet{fioc6}. At
159: higher redshifts, the evolution scenarios have been tested against most
160: existing faint galaxy samples, including the deepest surveys \citep[B=29 Hubble
161: Deep Field-N,][]{williams}. One unique model of galaxy fractions by type
162: simultaneously predicts the multi-wavelength (UV to near-IR) galaxy counts,
163: dominated by young stellar populations in the UV and old evolved galaxies in
164: the near-IR respectively.  The faint blue galaxy population, in excess in the
165: far-UV, has also been analysed \citep{fioc4}. An episodic star formation rate
166: of low level is proposed to fit the far-UV counts \citep{FOCA2000,buat}. In the
167: near-IR, the evolution scenario of elliptical galaxies predicts the puzzling
168: $K$-$z$ relation of radio galaxy hosts between z=0 and z=4. \citet{rocca} use
169: P\'EGASE.2 scenarios to interpret the galaxy distribution in the K-band Hubble
170: diagram. The same models are used to interpret the mid-IR galaxy counts \citep{rocca07},
171: although here a supplementary ultra-luminous infrared galaxy population is
172: required. Finally, the robustness of our evolution scenarios is confirmed by
173: the significant predictions of photometric redshifts as compared to
174: spectroscopic redshifts of HDF-N sample \citep{le2}. Using a much larger sample
175: from the SDSS, we make an additional comparison. This is the subject of the
176: second section of this paper, made using simulated photometry and colour-colour
177: diagrams.  In section 3 we describe the production of our library based on
178: these eight typical synthetic spectra of galaxies and in section 4 we explain
179: how these are used to simulate Gaia data. In section 5 we present our
180: classification and parametrization models and give preliminary results on their
181: performance.  A brief discussion follows in section 6.
182: 
183: 
184: \section{P\'EGASE synthetic spectra and comparison with the SDSS spectra}
185: 
186: In order to determine the parameter ranges over which we should generate the
187: library, we first make a comparison of colours synthesized from the eight
188: typical P\'EGASE spectra with SDSS data. To avoid small discrepancies that
189: occur between synthesized and published SDSS
190: photometry\footnote{http://www.sdss.org/dr4/products/spectra/spectrophotometry.html}
191: and to treat both types of spectral data in the same way, we decided to
192: synthesize SDSS photometry from the SDSS spectra in the same way as we do with
193: the synthetic spectra (and using the same ``calib'' and ``colors'' programs in
194: the P\'EGASE.2 code for both). For this we use the whole set of spectroscopic
195: data for the 565\,715 galaxies that are available in data release 4 (DR4) of
196: SDSS. The properties of the SDSS filters are given in Table~\ref{t1}.
197: 
198: \begin{table}
199:  \centering
200:  \caption {Characteristics of the five SDSS filters}                                      
201:  \begin{tabular}{c c c c c}          
202:  \hline\hline                        
203:   Name & Average    & Starting   & Ending     & magnitude \\
204:        & wavelength & wavelength & wavelength & limit in  \\
205:        & ($\AA$)    & ($\AA$)    & ($\AA$)    & survey    \\
206:  \hline
207: u      & 3551       & 2980       & 4130       & 22.0      \\
208: g      & 4686       & 3630       & 5830       & 22.2      \\
209: r      & 6165       & 5380       & 7230       & 22.2      \\
210: i      & 7481       & 6430       & 8630       & 21.3      \\
211: z      & 8931       & 7730       & 11230      & 20.5      \\
212: \hline
213: \end{tabular}
214: \label{t1}
215: \end{table}
216: 
217: Typical synthetic spectra corresponding to each of the eight Hubble types are
218: shown in Fig. \ref{f1}, with the location of the SDSS filters superimposed.
219: Each of these ``typical spectra'' corresponds to specific combination of
220: values of the astrophysical parameters (see section~\ref{most_signif}). The
221: SEDs produced by P\'EGASE have been normalized to the flux of a 50$\AA$
222: wavelength interval centered on 5500$\AA$. The elliptical and S0 galaxies have
223: very small differences, apparent at the two extremes of the wavelength range.
224: This implies small differences in colours but not necessarily in magnitudes
225: (which depend on their masses).
226: 
227: From Fig.~\ref{f1} it is obvious that the u filter is very important for the
228: comparison with real data since it is the one containing the discontinuity
229: around 4000$\AA$. However, the SDSS spectra do not cover the u band, so
230: photometry in this band cannot be synthesized. We refrain from using the SDSS
231: photometry for the u band because of the red leak in this
232: filter\footnote{http://www.sdss.org/dr4/products/images/index.html$\#$redleak},
233: which would render comparisons with synthetic data unreliable. This leak
234: produces erroneous magnitudes, especially for E and S0 types on
235: account of their large numbers of red stars.
236: 
237: In addition we avoid using the z filter in our comparison since 
238: its photometry also cannot be synthesized from the SDSS spectra, which terminate 
239: at shorter wavelengths than the z passband.
240: 
241: \begin{figure}
242: \centering
243: \includegraphics[width=6cm,angle=-90]{f1.ps}
244: \caption{
245:   Synthetic spectra for the eight typical galaxy types from P\'EGASE.2. The
246:   vertical lines denote the limits of the five SDSS filters (transmission
247:   below 1e-4 of the peak). (Emission lines are not included). The legend at
248:   the right defines colour used to plot each type of galaxy (top) and SDSS
249:   filter (bottom).}
250: \label{f1}
251: \end{figure}
252: 
253: We therefore decided to base our comparison between the SDSS and P\'EGASE.2
254: data using the g, r, i filters only and, more specifically, the g--r and r--i
255: colours. However, the wavelength range of the SDSS spectra does not quite
256: extend to the bluest side of the g filter. For this reason, we cut the blue
257: end of this and created a new g filter starting at 3830$\AA$ instead of the
258: 3630$\AA$ (table \ref{t1}). However, this change is in practice very small
259: since the transmission of the g filter is only 3\% of the peak transmission at
260: 3830$\AA$ and drops very rapidly below that (e.g.\ it is only 0.5\% at just
261: 10\AA\ lower). Furthermore, simulated photometry from the synthetic spectra
262: showed virtually no difference for the original and ``trimmed'' g band. The
263: published transmission curves of the SDSS filters depend on airmass and
264: whether a point or extended source is being observed.
265: We use those for extended sources and zero airmass. The
266: photometry is calibrated on the AB system, as used by SDSS \citep{fukugita}.  
267: 
268: We synthesize photometry using the one-dimensional spectra from DR4, which are
269: supplied with additional analysis information, such as redshift and emission
270: line parameters. In order to select data suitable for our purposes, we
271: applied the following criteria: the galaxies should not be near a CCD edge nor
272: saturated, and they should not be very low SNR (the photometric error in all
273: bands should be less than 0.1 mag). Only spectra with redshifts below 0.01 are
274: retained, since the synthetic spectra of P\'EGASE.2 were produced at zero
275: redshift. These criteria resulted in a sample of 1292 galaxies. Their
276: synthesized photometry plus that for the eight typical galaxy types from
277: P\'EGASE.2 is shown in Fig.~\ref{f2}. This figure clearly shows that the
278: colours of the Im, Sd, Sc, Sbc, Sb and Sa types are generally in good
279: agreement with the colours of the observed spectra, although in the case of
280: S0 and E types the synthetic spectra seem to be slightly redder in g$-$r than
281: the SDSS spectra.
282: 
283: 
284: \begin{figure}
285: \centering
286: \includegraphics[width=6cm,angle=-90]{f2.ps}
287: \caption{Colour--colour (g$-$r vs.\ r$-$i) diagram of synthesized photometry of SDSS galaxy spectra (black) and synthetic photometry of the eight typical galaxy types generated from the 
288:   P\'EGASE models (red points).}
289: \label{f2}
290: \end{figure}
291: 
292: 
293: 
294: \section{The library of synthetic spectra}\label{library}
295: 
296: \subsection{The most significant parameters}\label{most_signif}
297: 
298: Each spectrum in our library is uniquely defined by a set of 17 astrophysical
299: parameters, plus the morphological type (E, S0, Sa, Sb, Sbc, Sc, Sd or Im).
300: The four most significant APs are: p1 and p2 of the star formation scenario
301: (($Mgas^{p_1}$)/p2); the infall timescale; the age of the galactic winds. The
302: age of the galactic winds is non-zero only for E and S0 galaxies. Note that
303: the Hubble type is not an independent parameter, as only certain ranges of the
304: APs are available for each type (as will be detailed later).
305: 
306: In order to investigate the influence of each of the parameters p1, p2 and
307: infall timescale to the integrated galaxy spectrum (SED), we modified the
308: parameters of the Sbc model (an intermediate type) over a range of values. In
309: the typical model for the Sbc type the values were 1, 5714 Myr/$M_{\odot}$ and 6000 Myr for
310: p1, p2 and infall timescale, respectively. In the modified models we vary p1
311: between 0.4 and 2, p2 from 100 to 20000 Myr/$M_{\odot}$ and infall from 100 to 10000 Myr. The
312: results are shown in Figs.~\ref{f3}--\ref{f5}.
313: 
314: To investigate the effect of the age of the galactic winds parameter we followed
315: the same procedure but now with the elliptical model. In the typical model for
316: the E type the age is 1~Gyr and we vary it between 0.1 and 7.5 Gyr
317: (Fig.~\ref{f6}).
318: 
319: From the figures we see that these four parameters have a major effect on the
320: colours. We performed similar analyses for other APs and concluded that they
321: had a much smaller impact on the data (in particular once the spectra are
322: reduced to the Gaia resolution). Therefore, the spectra in the present library
323: show variance only in these four APs.
324: 
325: \begin{figure}
326: \centering
327: \includegraphics[width=6cm,angle=-90]{f3.ps}
328: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE 
329:   spectra of the typical Sbc model (yellow) and the models of Sbc with
330:   different values of p1 (red). The largest g--r corresponds to p1=1 and the
331:   smallest g--r to p1=2.}
332: \label{f3}
333: \end{figure}
334: 
335: \begin{figure}
336: \centering
337: \includegraphics[width=6cm,angle=-90]{f4.ps}
338: \caption{Colour-colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE 
339:   spectra of the typical Sbc model (yellow) and the models of Sbc with
340:   different values of p2 (red). The largest g--r corresponds to p2=2000 Myr/$M_{\odot}$ and the
341:   smallest g--r to p2=20000 Myr/$M_{\odot}$.}
342: \label{f4}
343: \end{figure}
344: 
345: \begin{figure}
346: \centering
347: \includegraphics[width=6cm,angle=-90]{f5.ps}
348: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE 
349: spectra of the typical Sbc model (yellow) and the models of Sbc with different values of infall timescale (red). The largest 
350: g--r corresponds to infall timescale=100My and the smallest g--r to infall timescale=10 Gyr.}
351: \label{f5}
352: \end{figure}
353: 
354: \begin{figure}
355: \centering
356: \includegraphics[width=6cm,angle=-90]{f6.ps}
357: \caption{Colour--colour (g--r vs r--i) diagram of synthesized photometry of SDSS galaxy spectra (black) and of synthetic P\'EGASE 
358: spectra of the typical E model (yellow) and the models of E with different values of age of galactic winds (red). The 
359: largest g--r corresponds to age of galactic winds=7.5Gy and the smallest g--r to 0.1 Gyr.}
360: \label{f6}
361: \end{figure}
362: 
363: By co-varying these four parameters and using all their combinations in each of
364: the eight typical models we are able to cover most of the variance we see in
365: the SDSS data in the colour--colour diagram. Generally, there is no clear
366: distinction between the colours of neighbouring Hubble types. In order to
367: have a knowledge of types in our library we decided (as a first working approximation) to only
368: retain those models which lie within a circle (in the colour-colour diagram)
369: centered on one of the eight typical types and with a radius equal to half of
370: the distance to the nearest neighbouring typical model. This is reasonable
371: since the models lie mostly on a one-dimensional surface (line) in the
372: colour--colour diagram. In this way upper and lower limits of the values of
373: the parameters were established for each type, although in this case
374: an overlap in APs (if not in colours) remains, as can be seen in table \ref{t2}.
375: This leaves a set of 888 synthetic spectra of known types of
376: galaxies (see section~\ref{regular_grid}).
377: 
378: The galaxy type can be considered as a 5th AP, although it is of a different
379: nature than the others, since it is needed to fully specify the spectrum and
380: constrain the range of values of the other four APs. In addition, when one
381: redshifts the spectrum to non-zero values of z, this quantity also becomes a
382: parameter (albeit not intrinsic to the source).
383: 
384: \subsection{Library of galaxy spectra over a regular grid of parameters}\label{regular_grid}
385: 
386: Applying the above procedures, we produced a library of 888 synthetic spectra
387: covering seven separate Hubble types (because we consider E and S0 as a single
388: type). The values of the four parameters of each type are given in table
389: \ref{t2}, while the values of the other input parameters of P\'EGASE.2 (kept
390: constant in all models) are given in table \ref{t3}. The models are plotted
391: in Fig.~\ref{f7}, where the simulated colours of the 888 synthetic spectra and
392: the 1292 SDSS spectra are compared. This first set of 888 synthetic spectra
393: was then calculated at five values of redshift: 0, 0.05, 0.1, 0.15, 0.2,
394: resulting in a total of 4440 spectra.
395: 
396: 
397: \begin{table*}
398:  \centering
399:  \caption {The four astrophysical parameter (AP) ranges for each Hubble type in
400: the regular library of P\'EGASE synthetic spectra. Note that the AP ranges for
401: each Hubble type partially overlap. The morphological type can be considered as
402: an additional (but non-independent) parameter, required to fully explain the
403: variance in the library. The final column (N) gives the number of spectra for
404: each type (which sum to 888). See the regular library grid in \citet{le2} for
405: comparison.}                                      
406:  \begin{tabular}{c c c c c c}          
407:  \hline\hline                        
408:   
409: Type & p1      & p2          & infall     & galactic winds & N  \\
410:      &         &(Myr/Msol)   &(Myr)       &(Gyr)           &    \\
411:  \hline
412: E-S0 & 0.6-1.5 & 100-1500    & 100-2500   & 0.1-7.5        & 327\\
413: Sa   & 0.8-1.5 & 500-2500    & 2500-3500  & none           & 10 \\
414: Sb   & 0.6-1.5 & 1500-6000   & 2000-4500  & none           & 25 \\
415: Sbc  & 0.4-1.5 & 2000-10000  & 4000-7000  & none           & 148\\
416: Sc   & 0.6-1.5 & 6000-14000  & 7000-10000 & none           & 68 \\
417: Sd   & 0.4-1.5 & 10000-18000 & 7000-10000 & none           & 65 \\
418: Im   & 1.0-2.0 & 14000-20000 & 7000-10000 & none           & 245\\
419: \hline
420: \end{tabular}
421: \label{t2}
422: \end{table*}
423: \begin{table*}
424:  \centering
425:  \caption {The values of the parameters of the P\'EGASE models which are kept
426: constant in the library \citep{fioc2}.}                                      
427:  \begin{tabular}{c c}          
428:  \hline\hline                        
429: Parameters & Values \\
430: \hline  
431: SNII Ejecta of massive stars                     & model B of \citet{woosley}\\
432: Stellar winds                                    & yes\\
433: Initial mass function                            & \citet{rana}\\
434: Lower mass                                       & 0.09 solar masses \\
435: Upper mass                                       & 120.00 solar masses\\
436: Fraction of close binary systems                 & 0.05\\ 
437: Initial metallicity                              & 0.00 \\
438: Metallicity of the infalling gas                 & 0.00\\
439: Consistent evolution of the stellar metallicity  & yes \\
440: Mass fraction of substellar objects              & 0.00\\
441: Nebular emission                                 & yes \\
442: Extinction                                       & disk geometry: inclination-averaged \\
443:                                                  & for Sa, Sb, Sbc, Sc, Sd and Im  \\
444:                                                  & spheroidal geometry for E-S0 \\
445: Age                                              & 13 Gyr  for E-S0,Sa, Sb, Sbc, Sc \& Sd \\
446:                                                  & 9 Gyr for Im \\
447: \hline
448: \end{tabular}
449: \label{t3}
450: \end{table*}
451: 
452: 
453: \subsection{Extension of the library to random values of parameters}
454: 
455: After producing the regular synthetic spectral grid (table \ref{t2}), we
456: proceed to produce synthetic spectra of galaxies with parameters selected
457: from a random distribution, in order to achieve a more continuous coverage in
458: colour space. Such grids permit more robust tests of parameter estimation
459: algorithms than do regular grids. Each parameter is selected independently
460: from a uniform distribution over the parameter ranges in the regular grid. We
461: used this approach to generate 5500 models. In doing this we keep
462: approximately the ratios between the Hubble types as in the regular grid.
463: Because the parameter ranges for each galaxy type in Table \ref{t2} show some overlap,
464: a random draw may produce a set of parameters which fits into more than one
465: Hubble type category. To remove this ``degeneracy'' we again apply the circle
466: removal method we used in section~\ref{most_signif}. This results in a
467: ``non-degenerate'' sample of 2709 spectra.
468: 
469: \begin{figure}
470: \centering                    
471: \includegraphics[width=6cm,angle=-90]{f7.ps}
472: \caption{ Colour--colour (g--r vs.\ r--i) diagram of synthesized photometry of
473: SDSS galaxy spectra (black) and of synthetic P\'EGASE spectra of the 8 typical
474: models of P\'EGASE.2 (yellow). Moving from the lower left to the upper right
475: part of the diagram we encounter types from Im to E. The red dots along both
476: sides of the typical models represent the spectra of both the regular and
477: random library.}
478: \label{f7}
479: \end{figure}
480: 
481: A comparison of the simulated colours of the synthetic spectra (888 regular
482: grid plus 2709 random grid, at zero redshift) with the colours of SDSS spectra
483: is shown in Fig. \ref{f7}. One sees that the new set of spectra is in very
484: good agreement with the SDSS data, except for the small differences in the E
485: and S0 galaxies.
486: 
487: In summary, we have produced a library of 7149 synthetic galaxy spectra (888
488: spectra of the regular grid for 5 values of redshift and 2709 of the random
489: grid at zero redshift) which can be used as an initial library of unresolved
490: galaxy spectra for assessing the possibilities of galaxy classification and
491: parametrization with Gaia. This library was created at the resolution of the
492: BaSeL 2.2 stellar library 
493: (gradually changing from 8\,\AA\ at 2500\,\AA\ to 50\,\AA\ at 10\,500\,\AA),
494: which is not quite high enough for the Gaia simulation software (which
495: requires 10\,\AA). Therefore, we linearly interpolated our spectra in order
496: to resample the spectra to 10\,\AA\ over the wavelength range of
497: 2500--10\,500\,\AA. Higher resolution spectra will be produced in future work
498: using the High-spectral Resolution code P\'EGASE-HR \citep{le1}.
499: 
500: \section{Simulated Gaia spectra}
501: 
502: The Gaia spectrophotometer is a slitless prism spectrograph comprising blue
503: and red channels (called BP and RP respectively) which operate over the
504: wavelength ranges 3300--6800\,\AA\ and 6400-10\,500\,\AA\ respectively. BP
505: and RP spectra were simulated for all 7149 library spectra using the simulator
506: developed by \citet{brown}. Each of BP and RP is simulated with 48 pixels,
507: whereby the dispersion varies from 30--290\,\AA/pix and 60--150\,\AA/pix
508: respectively. We artificially reddened each spectrum with a standard
509: interstellar extinction law with R=3.1, for regular values of $A_{V}$ from 0
510: to 10 for the regular library, and for 10 random values of $A_{V}$ uniformly
511: distributed in $log(1+A_{V})$ for the the random library. Noise was added to
512: all spectra, which includes the source Poisson noise, background Poisson noise
513: and CCD readout noise. This is done for five different source G-band
514: magnitudes (15, 17, 18, 19 and 20). For the following classification tests we
515: use only the sample at G=18. In Fig. \ref{f8} we present the simulated BP and RP spectra
516: for the eight typical synthetic spectra of galaxies. 
517: 
518: \begin{figure}
519: \centering                    
520: \includegraphics[width=6cm,angle=-90]{f8.ps}
521: \caption{ The simulated BP and RP spectra of the synthetic spectra for the
522: eight typical galaxy types from P\'EGASE.2. Black, green, blue, yellow,
523: magenta, light blue and red denote galaxies of type E, Sa, Sb, Sbc, Sc, Sd and
524: Im respectively.}
525: \label{f8}
526: \end{figure}
527: 
528: 
529: \section{Classification \& Parametrization}
530: 
531: In the present work we use classification Support Vector Machines (SVMs)
532: (C-classification) to determine morphological types and regression SVMs
533: ($\epsilon$-regression) to estimate the various astrophysical parameters. We
534: use the libsvm library of \citet{libsvm} implemented in the \verb+e1071+
535: package in the R statistics package.\footnote{{\tt http://www.r-project.org}} A
536: brief description of the SVMs is given in the Appendix of this paper. An
537: accessible introduction to SVMs can be found in \citet{bennett00}. For a more
538: technical introduction, the tutorial by \citet{burges98} is recommended.
539: 
540: 
541: \subsection{Galaxies at zero redshift}
542: 
543: \subsubsection{Classification of the morphological type}\label{classify}
544: 
545: We now try to classify the set of Gaia-simulated galaxy spectra, at G=18 with
546: zero redshift, into the seven Hubble types. This subset of the library
547: includes characteristic noise and a wide range of interstellar extinction
548: (from 0--10 mag in $A_{v}$). It comprises 9691 spectra. This we divide at
549: random into two subsets: 4846 for training the SVM classifiers and 4845 for
550: evaluating their performance. As is recommendable with many machine learning
551: methods, we first normalized the data by scaling each input (pixel) to have
552: zero mean and unit standard deviation.
553: 
554: For the purpose of visualizing the data set only, we perform a Principal
555: Components Analysis (PCA) on the set of 9691 96-dimensional Gaia spectra.
556: The first three Principal Components describe 78.25\%, 20.44\% and 1.02\% of
557: the data variance respectively (i.e.\ 99.71\% together).\footnote{Note that,
558: because each input dimension has already been normalized to have zero mean
559: and unit variance, a considerable fraction of the total variance is already
560: accounted for.} In Fig. \ref{f9} we plot the data in projection onto the
561: first three PCs. This diagram, plus the fact that the first three PCs explain
562: almost all of the variance in the data, suggest that a good classification
563: should be possible (the data have an intrinsic low dimensionality).
564: 
565: \begin{figure}
566: \centering                    
567: \includegraphics[width=6cm,angle=-90]{f9.ps}
568: \caption{ The 9691 simulated Gaia galaxy spectra with z=0 plotted as their
569: projections onto the first three Principal Components. Black, green, blue,
570: light blue, magenta, yellow and red denote galaxies of type E, Sa, Sb, Sbc, Sc,
571: Sd and Im respectively.}
572: \label{f9}
573: \end{figure}
574: 
575: \begin{table}
576:       
577:  \centering 
578:  \caption {Galaxy classification with the SVM. The confusion matrix for the
579: training set for galaxies at z=0. Columns indicate the true class, row the
580: predicted ones.}                                   
581:  \begin{tabular}{l | c c c c c c c}          
582:  \hline\hline                        
583:   
584: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
585:  \hline
586: E-S0 & 1799  & 0    & 0  & 0   & 0   & 0   & 0   \\
587: Sa   & 0     & 1366 & 0  & 0   & 0   & 0   & 0   \\
588: Sb   & 0     & 0    & 53 & 5   & 0   & 0   & 0   \\ 
589: Sbc  & 0     & 0    & 0  & 134 & 0   & 0   & 0   \\
590: Sc   & 0     & 0    & 0  & 0   & 830 & 0   & 0   \\ 
591: Sd   & 0     & 0    & 0  & 0   & 0   & 347 & 1   \\
592: Im   & 0     & 0    & 0  & 0   & 0   & 0   & 311 \\
593: 
594: \hline
595: \end{tabular}
596: \label{t4}
597:  \centering
598:  \caption {As Table~\ref{t4} but for the test set.}
599:  \begin{tabular}{l | c c c c c c c}          
600:  \hline\hline                        
601:   
602: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
603:  \hline
604: E-S0 & 1798 & 0    & 0  & 0   & 0   & 0   & 0   \\
605: Sa   & 0    & 1329 & 0  & 0   & 0   & 0   & 0   \\
606: Sb   & 0    & 0    & 44 & 0   & 0   & 0   & 0   \\ 
607: Sbc  & 0    & 0    & 4  & 137 & 0   & 0   & 0   \\
608: Sc   & 0    & 0    & 0  & 1   & 797 & 0   & 0   \\ 
609: Sd   & 0    & 0    & 0  & 0   & 0   & 394 & 6   \\
610: Im   & 0    & 0    & 0  & 0   & 0   & 3   & 324 \\
611: 
612: \hline
613: \end{tabular}
614: \label{t5}
615: \end{table}
616:                         
617: The results of training and testing the SVM classifier on the full 96-pixel
618: spectra are shown in Tables \ref{t4} and \ref{t5}. We see that there are very
619: few misclassifications: only 6 and 14 in the training and testing set
620: corresponding to an error of 0.12\% and 0.29\% respectively. While these
621: results are very promising, it must be recalled that the way the library has been
622: constructed avoids class overlap in the SDSS g$-$r, r$-$i colour space,
623: which surely eases separation in the 96-dimensional BP/RP colour space.
624: 
625: \subsubsection{Regression of astrophysical parameters}
626: 
627: In addition to simulating an output spectrum, P\'EGASE.2 also derives 18
628: output astrophysical parameters for each galaxy. Of course, by construction we know that our
629: synthetic spectra are uniquely defined by five parameters (p1, p2, infall
630: timescale, age of the galactic winds and the Hubble type), so there can only
631: be five equivalent independent parameters amongst these 18. Nonetheless, it
632: would be useful to predict them directly. Here we build SVM regression models
633: to separately predict the nine most significant ones (listed in Table
634: \ref{t6}). For each model we train on a randomly selected set of 4846 spectra
635: and evaluate performance on the remaining 4845. In Fig.~\ref{f10} we present
636: the true and the SVM-predicted values of each parameter on the test set. Table
637: \ref{t6} summarizes this by giving the mean of the difference between the true
638: and predicted values for each parameter (which measures the systematic error)
639: as well as the RMS residual (which measures the total scatter). The plots and
640: table indicate that we can predict the parameters to good accuracy and
641: precision, i.e.\ the systematics are very small and the RMS error is a small
642: fraction of the typical values.
643: 
644: 
645: \begin{figure*}[t]
646:   \setlength{\unitlength}{1cm}
647: \begin{picture}(18,15)
648: \put(0,15){\special{psfile=f10a.ps hoffset=0 voffset=0 hscale=20  
649: vscale=20 angle=-90}}
650: \put(0,10){\special{psfile=f10b.ps hoffset=0 voffset=0 hscale=20  
651: vscale=20 angle=-90}}
652: \put(0,5){\special{psfile=f10c.ps hoffset=0 voffset=0 hscale=20  
653: vscale=20 angle=-90}}
654: \put(6,15){\special{psfile=f10d.ps hoffset=0 voffset=0 hscale=20  
655: vscale=20 angle=-90}}
656: \put(6,10){\special{psfile=f10e.ps hoffset=0 voffset=0 hscale=20  
657: vscale=20 angle=-90}}
658: \put(6,5){\special{psfile=f10f.ps hoffset=0 voffset=0 hscale=20  
659: vscale=20 angle=-90}}
660: \put(12,15){\special{psfile=f10g.ps hoffset=0 voffset=0 hscale=20  
661: vscale=20 angle=-90}}
662: \put(12,10){\special{psfile=f10h.ps hoffset=0 voffset=0 hscale=20  
663: vscale=20 angle=-90}}
664: \put(12,5){\special{psfile=f10i.ps hoffset=0 voffset=0 hscale=20  
665: vscale=20 angle=-90}}
666: \end{picture}
667: \caption{
668:   Galaxy parameter estimation performance. For each of the nine APs we plot
669:   the predicted vs.\ true AP values for the test set. The red line indicates
670:   the line of perfect estimation. The summary errors are given in
671:   Table~\ref{t6}.}
672: \label{f10}
673: \end{figure*}
674: 
675: \begin{table*}
676:  \centering
677:  \caption {Summary of the performance of the SVM regression models for
678: predicting the nine APs listed. The sample is for zero redshift but for
679: interstellar extinction ($A_{v}$) varying from 0 to 10\,mag. The second and
680: third columns list the mean and RMS errors respectively. The final column gives
681: the number of support vectors in the SVM model.}
682:  \begin{tabular}{l c c c}          
683:  \hline\hline                        
684:   
685: Astrophysical Parameter                                  & mean(real-predicted)/mean(real) & sd(real-predicted)/mean(real) & SVs   \\
686:  \hline
687: mass to light ratio (M/L)                                & -1.03e-2             & 3.78e-2            & 97    \\
688: normalized star formation rate (SFR)                     & -3.35e-3             & 3.97e-2            & 2285  \\
689: metallicity of interstellar medium (Mim)                 & -2.85e-3             & 8.77e-2            & 345   \\ 
690: metallicity of stars averaged on mass (Msm)              & -3.64e-4             & 2.17e-2            & 3544  \\
691: normalized mass of gas (Mgas)                            &  4.52e-3             & 4.29e-2            & 190   \\ 
692: normalized mass in stars (Ms)                            &  3.22e-4             & 5.48e-2            & 1639  \\
693: mean age of stars averaged on bolometric luminosity (Al) &  1.45e-3             & 3.22e-2            & 3566  \\
694: normalized SNIa rate (SNIa)                              &  9.69e-4             & 3.43e-2            & 376   \\
695: normalized SNII rate (SNII)                              & -6.04e-4             & 3.81e-2            & 2247  \\
696: \hline
697: \end{tabular}
698: \label{t6}
699: \end{table*}
700: 
701: \subsection{Galaxies with redshift}
702: 
703: \subsubsection{Regression of redshift and classification of morphological type}
704: 
705: We now enlarge the subset of the library we used in the previous tests by
706: adding the same galaxies at four nonzero values of redshift, specifically
707: 0.05, 0.1, 0.15, 0.2. The library for z=0 includes 9691 galaxies as described above.
708: For each nonzero redshift there are 9757
709: giving a total sample of 48\,719 galaxies. (Recall that this includes each
710: galaxy simulated at 11 regular values of $A_{v}$.) We now build another
711: morphological type classification model as done in section~\ref{classify}, now
712: with 6719 galaxies in the training set and 42\,000 galaxies for testing set.
713: 
714: We again applied a PCA to the data. This time the first three Principal
715: Components describe 76.01\%, 21.63\% and 1.02\% of the data variance
716: respectively (i.e.\ 98.6\% together), very similar to before. The
717: corresponding PCA-project plot is Fig.~\ref{f11}. Comparing to Fig.~\ref{f9} we
718: can see how the redshift spreads out the previous loci of types.
719: The performance of the SVM classifier is summarized in Tables \ref{t7} and
720: \ref{t8}. The performance is good considering the added complexity introduced
721: by the redshift variations (and the corresponding increase in the sample
722: size). The misclassification errors are 0.13\% and 0.98\% corresponding to 9
723: and 411 galaxies for the training and the testing data respectively.
724: 
725: 
726: \begin{figure}
727:   \centering \includegraphics[width=6cm,angle=-90]{f11.ps}
728:  \caption{ The 48\,719 simulated Gaia galaxy spectra with nonzero redshift
729: plotted as their projections onto the first three Principal Components. Black,
730: green, blue, light blue, magenta, yellow and red denote galaxies of type E, Sa,
731: Sb, Sbc, Sc, Sd and Im respectively.}
732: \label{f11}
733: \end{figure}
734: 
735: \begin{table}
736:  \centering
737:  \caption {Galaxy classification with the SVM. The confusion matrix for the
738: training set for galaxies at z=0.0, 0.05, 0.1, 0.15, 0.2. Columns indicate the
739: true class, row the predicted ones.}
740:  \begin{tabular}{l | c c c c c c c}          
741:  \hline\hline                        
742:   
743: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
744:  \hline
745: E-S0 & 2512 & 0    & 0  & 0   & 0    & 0   & 0   \\
746: Sa   & 0    & 1828 & 0  & 0   & 0    & 0   & 0   \\
747: Sb   & 0    & 0    & 74 & 2   & 0    & 0   & 0   \\ 
748: Sbc  & 0    & 0    & 1  & 183 & 1    & 0   & 0   \\
749: Sc   & 0    & 0    & 0  & 0   & 1115 & 0   & 0   \\ 
750: Sd   & 0    & 0    & 0  & 0   & 0    & 536 & 4   \\
751: Im   & 0    & 0    & 0  & 0   & 0    & 1   & 462 \\
752: 
753: \hline
754: \end{tabular}
755: \label{t7}
756:  \centering                                      
757:  \caption {As Table~\ref{t7} but for the test set.}
758:  \begin{tabular}{l | c c c c c c c}          
759:  \hline\hline                        
760:   
761: Type & E-S0 & Sa & Sb & Sbc & Sc & Sd & Im \\
762:  \hline
763: E-S0 & 15473 & 0     & 0   & 0    & 0    & 0    & 0    \\
764: Sa   & 0     & 11647 & 0   & 0    & 0    & 0    & 0    \\
765: Sb   & 17    & 0     & 344 & 113  & 0    & 0    & 0    \\ 
766: Sbc  & 0     & 0     & 83  & 1084 & 23   & 0    & 0    \\
767: Sc   & 0     & 0     & 8   & 39   & 6971 & 7    & 0    \\ 
768: Sd   & 0     & 0     & 0   & 0    & 1    & 3149 & 50   \\
769: Im   & 0     & 0     & 0   & 0    & 0    & 70   & 2921 \\
770: 
771: \hline
772: \end{tabular}
773: \label{t8} 
774: \end{table}
775: 
776: In practice we may want to first reduce spectra to the rest frame, for which
777: we require an estimate of the redshift. Therefore, we also set up a SVM
778: regression model to predict redshift, using the same training and test sets.
779: The predicted values of redshift for each of the five true redshift values are
780: presented in Fig. \ref{f12}. We do not expect very good performance here,
781: because the SVM is having to learn the effect of redshift based on just five
782: different values.
783: 
784: \begin{figure*}[t]
785: \setlength{\unitlength}{1cm}
786: \begin{picture}(18,10)
787: \put(0,10){\special{psfile=f12a.ps hoffset=0 voffset=0 hscale=20  
788: vscale=20 angle=-90}}
789: \put(6,10){\special{psfile=f12b.ps hoffset=0 voffset=0 hscale=20  
790: vscale=20 angle=-90}}
791: \put(12,10){\special{psfile=f12c.ps hoffset=0 voffset=0 hscale=20  
792: vscale=20 angle=-90}}
793: \put(4,5){\special{psfile=f12d.ps hoffset=0 voffset=0 hscale=20  
794: vscale=20 angle=-90}}
795: \put(10,5){\special{psfile=f12e.ps hoffset=0 voffset=0 hscale=20  
796: vscale=20 angle=-90}}
797: \end{picture}
798: \caption{Distribution of predicted values of redshift shows separately for the five true values of redshift (z=0, 0.05, 0.1, 0.15 and 0.2)}
799: \label{f12}
800: \end{figure*}
801: 
802: 
803: 
804: \section{Discussion and conclusion}
805: 
806: We have used the P\'EGASE.2 galaxy evolution model and the observational data
807: from SDSS to create an extended grid of synthetic galaxy spectra. Using these
808: we have identified the relevant astrophysical parameters and their relevant
809: ranges which provide a realistic galaxy spectra of known morphological type.
810: This was done specifically by comparing the colours of our library spectra
811: with those synthesized from SDSS spectra. We found small deviations between
812: the two colour loci for redder galaxies -- where the ellipticals are found --
813: which might be due to the fact that SDSS spectra are obtained in a small aperture
814: (fibre diameter) while P\'EGASE spectra are representative of the whole
815: galaxy. We also see that the observed sample has a considerably larger spread
816: in the colour--colour diagram than the library spectra, which probably has
817: observational reasons (photometric errors) as well as theoretical ones
818: (insufficient cosmic variance in the galaxy models). That is, it may
819: partially reflect the complicated nature of galaxy formation and evolution,
820: although the overall agreement between the two is good.
821: 
822: To achieve a better agreement between the observational and
823: synthesized libraries we will further investigate the influence of the
824: various P\'EGASE.2 parameters, especially those that were kept constant for
825: this release of the library. On the other hand, due to the narrow redshift
826: range ($z < 0.2$) explored here, evolution factors are minimized. At higher
827: redshifts, synthetic spectra will be computed by simultaneously applying
828: cosmological k-corrections and evolution e-corrections to z=0 templates.
829: 
830: Among the existing libraries of observed spectra, the most complete
831: and homogeneous is the SDSS, since it covers a significant part of the whole
832: sky and it goes fainter than the expected detection limit of Gaia. We
833: therefore aim to produce a suitable set of synthetic spectra covering as
834: much as possible of the SDSS colour range and we plan further comparisons in
835: our future work.
836: 
837: Adding phenomena such as the galaxy mergers is a challenging
838: hypothesis, but we believe that at the low redshifts Gaia will observe, this
839: is not such an important or frequent mechanism of galaxy evolution. On the
840: other hand, starburst galaxies are more frequent at small redshifts and we
841: intend to enrich our library with this type of galaxy.
842: 
843: First results of SVM for classification and parametrization of the library are
844: quite promising. In particular, the first indications are that Gaia will be
845: able to produce a wealth of information for a large statistical sample of
846: galaxies. After constructing a more complete library of spectra we will be
847: able to perform more tests and construct a classifier able to treat more
848: realistic and complete simulations of galaxy spectra.
849: 
850: \section{Acknowledgments}
851: The authors (the Greek team) would like to thank the Greek General Secretariat
852: of Research and Technology (GSRT) for financial support.
853: 
854: P. Tsalmantza would also like to thank the Max-Planck-Institut f\"ur
855: Astronomie (MPIA) and Institut d'Astrophysique de Paris (IAP) for their
856: support and hospitality.
857: 
858: Funding for the Sloan Digital Sky Survey (SDSS) has been provided by the Alfred
859: P. Sloan Foundation, the Participating Institutions, the National Aeronautics
860: and Space Administration, the National Science Foundation, the U.S. Department
861: of Energy, the Japanese Monbukagakusho, and the Max Planck Society. The SDSS
862: Web site is http://www.sdss.org/.
863: 
864: The SDSS is managed by the Astrophysical Research Consortium (ARC) for the
865: Participating Institutions. The Participating Institutions are The University
866: of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation
867: Group, The Johns Hopkins University, the Korean Scientist Group, Los Alamos
868: National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the
869: Max-Planck-Institute for Astrophysics (MPA), New Mexico State University,
870: University of Pittsburgh, University of Portsmouth, Princeton University, the
871: United States Naval Observatory, and the University of Washington.
872: 
873: \begin{thebibliography}{22}
874: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
875: 
876: \bibitem[{{Armand} \& {Milliard}(1994)}]{FOCA2000}
877: {Armand}, C. \& {Milliard}, B. 1994, \aap, 282, 1
878: 
879: \bibitem[{Bennett \& Campbell(2000)}]{bennett00}
880: Bennett, K.~P. \& Campbell, C. 2000, SIGKDD Explor. Newsl., 2, 1
881: 
882: \bibitem[{{Brown}(2006)}]{brown}
883: {Brown}, A. G.~A. 2006, Gaia Technical Report GAIA-C8-SP-LEI-AB-006-1
884: 
885: \bibitem[{{Buat} {et~al.}(1999){Buat}, {Donas}, {Milliard}, \& {Xu}}]{buat}
886: {Buat}, V., {Donas}, J., {Milliard}, B., \& {Xu}, C. 1999, \aap, 352, 371
887: 
888: \bibitem[{Burges(1998)}]{burges98}
889: Burges, C. J.~C. 1998, Data Mining and Knowledge Discovery, 2, 121
890: 
891: \bibitem[{Chang \& Lin(2001)}]{libsvm}
892: Chang, C.-C. \& Lin, C.-J. 2001, {LIBSVM}: a library for support vector
893:   machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
894: 
895: \bibitem[{{Fioc}(1997)}]{fioc3}
896: {Fioc}, M. 1997, PhD thesis, Universit{\'e} Paris XI,
897:   http://www.iap.fr/users/fioc.html
898: 
899: \bibitem[{{Fioc}(1999)}]{fioc6}
900: {Fioc}, M. 1999, in Astronomical Society of the Pacific Conference Series, Vol.
901:   192, Spectrophotometric Dating of Stars and Galaxies, ed. I.~{Hubeny},
902:   S.~{Heap}, \& R.~{Cornett}, 299--+
903: 
904: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1997)}]{fioc2}
905: {Fioc}, M. \& {Rocca-Volmerange}, B. 1997, \aap, 326, 950
906: 
907: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{a}})}]{fioc1}
908: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{a}}, \aap, 351, 869
909: 
910: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{b}})}]{fioc4}
911: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{b}}, \aap, 344, 393
912: 
913: \bibitem[{{Fioc} \& {Rocca-Volmerange}(1999{\natexlab{c}})}]{fioc5}
914: {Fioc}, M. \& {Rocca-Volmerange}, B. 1999{\natexlab{c}}, arXiv:astro-ph/9912179
915: 
916: \bibitem[{{Fukugita} {et~al.}(1996){Fukugita}, {Ichikawa}, {Gunn}, {Doi},
917:   {Shimasaku}, \& {Schneider}}]{fukugita}
918: {Fukugita}, M., {Ichikawa}, T., {Gunn}, J.~E., {et~al.} 1996, \aj, 111, 1748
919: 
920: \bibitem[{{Groenewegen} \& {de Jong}(1993)}]{groenewegen}
921: {Groenewegen}, M.~A.~T. \& {de Jong}, T. 1993, \aap, 267, 410
922: 
923: \bibitem[{{Le Borgne} \& {Rocca-Volmerange}(2002)}]{le2}
924: {Le Borgne}, D. \& {Rocca-Volmerange}, B. 2002, \aap, 386, 446
925: 
926: \bibitem[{{Le Borgne} {et~al.}(2004){Le Borgne}, {Rocca-Volmerange},
927:   {Prugniel}, {Lan{\c c}on}, {Fioc}, \& {Soubiran}}]{le1}
928: {Le Borgne}, D., {Rocca-Volmerange}, B., {Prugniel}, P., {et~al.} 2004, \aap,
929:   425, 881
930: 
931: \bibitem[{{Rana} \& {Basu}(1992)}]{rana}
932: {Rana}, N.~C. \& {Basu}, S. 1992, \aap, 265, 499
933: 
934: \bibitem[{{Rocca-Volmerange} {et~al.}(2007){Rocca-Volmerange}, {de Lapparent},
935:   {Seymour}, \& {Fioc}}]{rocca07}
936: {Rocca-Volmerange}, B., {de Lapparent}, V., {Seymour}, N., \& {Fioc}, M. 2007,
937:   arXiv:0705.2031
938: 
939: \bibitem[{{Rocca-Volmerange} {et~al.}(2004){Rocca-Volmerange}, {Le Borgne}, {De
940:   Breuck}, {Fioc}, \& {Moy}}]{rocca}
941: {Rocca-Volmerange}, B., {Le Borgne}, D., {De Breuck}, C., {Fioc}, M., \& {Moy},
942:   E. 2004, \aap, 415, 931
943: 
944: \bibitem[{Vapnik(1995)}]{vapnik95}
945: Vapnik, V.~N. 1995, The nature of statistical learning theory (Springer)
946: 
947: \bibitem[{{Williams} {et~al.}(1996){Williams}, {Blacker}, {Dickinson}, {Dixon},
948:   {Ferguson}, {Fruchter}, {Giavalisco}, {Gilliland}, {Heyer}, {Katsanis},
949:   {Levay}, {Lucas}, {McElroy}, {Petro}, {Postman}, {Adorf}, \&
950:   {Hook}}]{williams}
951: {Williams}, R.~E., {Blacker}, B., {Dickinson}, M., {et~al.} 1996, \aj, 112,
952:   1335
953: 
954: \bibitem[{{Woosley} \& {Weaver}(1995)}]{woosley}
955: {Woosley}, S.~E. \& {Weaver}, T.~A. 1995, \apjs, 101, 181
956: 
957: \end{thebibliography}
958: 
959: \clearpage
960: \appendix
961: \section{Support Vector Machines}
962: 
963: Support Vector Machines (SVMs) \citep{vapnik95} are supervised machine
964: learning methods for data classification. In their basic form they achieve a
965: linear classification between two classes by defining an optimal hyperplane which
966: separates members of the two classes. If the classes are separable then there
967: generally exists an infinite number of hyperplanes which achieve this.
968: The SVM optimal plane is defined as that plane which maximises the margin
969: between the opposing class members nearest to the boundary. That is, unlike
970: many other classifiers which use all of the data to define the boundary, SVMs
971: take the (arguably more reasonable) approach of using just those points nearest
972: to the boundary. It has been demonstrated that this gives rise to a
973: more robust and more accurate classifier under general conditions.
974: 
975: In most non-trivial problems, however, the classes are not linearly separable.
976: In these cases, just those points which lie on the wrong side of the
977: hyperplane -- the so-called support vectors -- enter into the total
978: classification error. By minimizing this error -- which also measures the
979: distance of the support vectors from the plane -- we define the optimal
980: separating plane, i.e.\ with the fewest misclassifications (and preferentially
981: of those which lie closer to the plane).
982: 
983: In the general case, the classes are not even marginally linearly separable
984: (consider the XOR problem) so a linear classifier, no matter how optimal, is
985: useless. SVMs address this issue by using kernels to project the data into a
986: higher dimensional space. For example, with a polynomial kernel we take
987: square, cubic etc.\ combinations of the original data to form additional
988: dimensions and then apply the (linear) SVM classifier in this higher
989: dimensional space. With many other kernels, however, this projection is only
990: carried out implicitly. This approach can be thought of as nonlinearity by
991: preprocessing, with the kernel overcoming the well known ``curse of
992: dimensionality''. In the present work we use the radial basis kernel
993: \begin{equation}
994: K(x_i - x_j)=exp(-\gamma||x_i-x_j||^{2})
995: \label{SVM_kernel}
996: \end{equation}
997: where $x_i$ and $x_j$ are two input vectors (e.g.\ spectra). The
998: classification of a new vector $x_i$ is then given by a function
999: \begin{equation}
1000: f(x_j) = \sum_{i}^{i=N} y_i \alpha_i K(x_i - x_j)
1001: \label{SVM_model}
1002: \end{equation}
1003: where $y_i \in (-1,1)$ denotes the two classes, and a classification is made
1004: by applying a threshold, e.g.\ $f(x_j) > 0.0 \Rightarrow$ class 1. The
1005: ${\alpha_i}$ are the parameters of the model which are determined by the model
1006: training ($i$ counts over the $N$ support vectors). SVMs have a very important
1007: property, namely that the error function is strictly convex, so it
1008: has a unique global solution which can be found in polynomial time with
1009: standard optimizers (it is a linearly constrained quadratic programming
1010: problem).
1011: 
1012: This is in marked contrast to neural networks, for example, in which the optimizers
1013: converge on a local minimum and we can only be guaranteed to find
1014: the global minimum via an exhaustive search. Furthermore, with a sigmoidal
1015: kernel SVMs are equivalent to neural networks but with the additional
1016: advantage that the SVM automatically determines the neural network
1017: architecture (number weights).
1018: 
1019: The SVM model incorporates regularization via the specification of a
1020: hyperparameter, $C$, which defines the width of a margin around the separating
1021: hyperplane. The wider this margin (larger $C$), the more data vectors 
1022: which fall into it. These are all considered support vectors and so all enter the
1023: error equation. Thus with a larger $C$ there is a higher penalty attached to
1024: errors, i.e.\ less regularization.\footnote{$C$ is actually the upper bound on
1025: $\alpha_i$, specifically $0 \leq \alpha_i \leq C$ and $\sum_i \alpha_i y_i =
1026: 0$ (two of the constraints in the error minimization). Thus a small $C$
1027: implies smaller $\alpha_i$ in equation~\ref{SVM_model} which in turn implies
1028: smoother functions equivalent to more regularization.}
1029: 
1030: The other hyperparameter in the model is $\gamma$ (equation~\ref{SVM_kernel}).
1031: Both $\gamma$ and $C$ must be determined by the user. Prior information may
1032: help, but in practice one carries out a rigorous search over a two-dimensional
1033: grid to ``tune'' the SVM. We did this using 4-fold cross validation,
1034: iterating over grids of increasing density.
1035: 
1036: SVMs can also be used for regression. Instead of a hyperplane and a margin
1037: about it, regression SVMs fit a line with a tube of radius $\epsilon$
1038: encompassing it. Data vectors which are less than a distance $\epsilon$ from
1039: the line are considered to be correctly fit, that is, the support vectors are
1040: only those points outside of the tube. Thus the $\epsilon$
1041: hyperparameter controls the degree of regularization. The specific error
1042: function we use is the mean squared error on the predictions, with
1043: the regularization again being introduced via the constraints in the
1044: optimization (with Lagrangian multipliers). All of the kernel and optimization
1045: machinery applies equally to these models, so that nonlinear regression can
1046: also be achieved.
1047: 
1048: \end{document}
1049: