0709:0709.2358/ms.tex

1: \documentclass[12pt,preprint]{aastex}

2: \usepackage{natbib}

3: \usepackage{ifthen}

4: \newcounter{address}

5: \newcommand{\latin}[1]{\textit{#1}}

6: \newcommand{\ie}{\latin{ie}}

7: \newcommand{\eg}{\latin{eg}}

8: \newcommand{\cf}{\latin{cf}}

9: \newcommand{\etc}{\latin{etc}}

10: \newcommand{\etal}{\latin{et~al}}

11: \newlength{\threewidth}

12: \setlength{\threewidth}{0.333\textwidth}

13: \newlength{\twowidth}

14: \setlength{\twowidth}{0.499\textwidth}

15: \newlength{\twothreewidth}

16: \setlength{\twothreewidth}{0.666\textwidth}

17: \newlength{\onewidth}

18: \setlength{\onewidth}{1.0\textwidth}

19: \newcommand{\Nside}{N_{\mathrm{side}}}

20: \newcommand{\unit}[1]{\mathrm{#1}}

21: \renewcommand{\mag}{\unit{mag}}

22: \newcommand{\rad}{\unit{rad}}

23: \renewcommand{\arcsec}{\unit{arcsec}}

24: \newcommand{\ster}{\unit{ster}}

25: \newcommand{\percent}{\unit{percent}}

26: \newcommand{\Tycho}{Tycho-2}

27: \newcommand{\USNOB}{USNO-B Catalog}

28: \newcommand{\TWOMASS}{2MASS PSC Catalog}

29: \newcommand{\merged}{merged \Tycho\ and \USNOB s}

30: \newcommand{\an}{\textsl{Astrometry.net}}

31: \newcommand{\numAllStars}{1,045,175,762}

32: \newcommand{\numUSNOBStars}{1,042,618,261}

33: \newcommand{\numSpikes}{24,148,382}

34: \newcommand{\numHalos}{196,133}

35: \newcommand{\percentSpikes}{$2.3\,\percent$}  % 2.316128818

36: \newcommand{\percentHalos}{$0.02\,\percent$}  % 0.01876555

37:

38: \begin{document}

39: \title{

40:   Cleaning the USNO-B Catalog

41:   through automatic detection of optical artifacts

42: }

43: \author{

44:   Jonathan~T.~Barron\altaffilmark{\ref{Toronto}},

45:   Christopher~Stumm\altaffilmark{\ref{Toronto}},

46:   David~W.~Hogg\altaffilmark{\ref{NYU},\ref{email}},

47:   Dustin~Lang\altaffilmark{\ref{Toronto}},

48:   Sam~Roweis\altaffilmark{\ref{Toronto},\ref{Google}}

49: }

50:

51: \setcounter{address}{1}

52: \altaffiltext{\theaddress}{\stepcounter{address}\label{Toronto}

53: Department of Computer Science, University of Toronto,

54: 6 King's College Road, Toronto, Ontario, M5S~3G4 Canada}

55: \altaffiltext{\theaddress}{\stepcounter{address}\label{NYU}

56: Center for Cosmology and Particle Physics, Department of Physics, New

57: York University, 4 Washington Place, New York, NY 10003}

58: \altaffiltext{\theaddress}{\stepcounter{address}\label{email}

59: To whom correspondence should be addressed: \texttt{david.hogg@nyu.edu}}

60: \altaffiltext{\theaddress}{\stepcounter{address}\label{Google}

61: Google, Mountain View, CA}

62:

63: \begin{abstract}

64: The USNO-B Catalog contains spurious entries that are caused by

65: diffraction spikes and circular reflection halos around bright stars

66: in the original imaging data. These spurious entries appear in the

67: Catalog as if they were real stars; they are confusing for some

68: scientific tasks.  The spurious entries can be identified by simple

69: computer vision techniques because they produce repeatable patterns on

70: the sky. Some techniques employed here are variants of the Hough

71: transform, one of which is sensitive to (two-dimensional)

72: overdensities of faint stars in thin right-angle cross patterns

73: centered on bright ($<13\,\mag$) stars, and one of which is sensitive

74: to thin annular overdensities centered on very bright ($<7\,\mag$)

75: stars.  After enforcing conservative statistical requirements on

76: spurious-entry identifications, we find that of the 1,042,618,261

77: entries in the USNO-B Catalog, 24,148,382 of them ($2.3\,\percent$)

78: are identified as spurious by diffraction-spike criteria and 196,133

79: ($0.02\,\percent$) are identified as spurious by reflection-halo

80: criteria.  The spurious entries are often detected in more than 2

81: bands and are not overwhelmingly outliers in any photometric

82: properties; they therefore cannot be rejected easily on other grounds,

83: \ie, without the use of computer vision techniques.  We demonstrate our

84: method, and return to the community in electronic form a table of spurious

85: entries in the Catalog.

86: \end{abstract}

87:

88: \keywords{

89:     astrometry ---

90:     catalogs ---

91:     methods:~statistical ---

92:     standards ---

93:     techniques:~image~processing

94: }

95:

96: \section{Introduction}

97:

98: The \USNOB\ \citep{monet03a} is an astrometric catalog containing

99: information on $\sim10^{9}$ stars.  The original imaging data taken

100: for this catalog come exclusively from photographic plates, taken from

101: several different surveys operating over many decades.  These plates

102: were uniformly scanned and automated source detection was performed on

103: the scans.  From the sources detected in the scans, the Catalog was

104: constructed in a relatively ``inclusive'' way.  The sources were

105: required to be compact, and to show detections in more than one band

106: of the five bands ($O,E,J,F,N$) from which the Catalog was constructed.

107: However, the original plate images contained many artifacts, defects,

108: trailed satellites, and large, resolved sources such as nearby

109: galaxies, nebulae, and star clusters.  Some of the entries in the

110: \USNOB\ do not correspond to real, independent, astronomical sources

111: but rather to arbitrary parts of extended sources, or fortuitously

112: coincident (across bands) data defects or artificial features.  Though

113: compact galaxies can be used along with stars for astrometric science,

114: the artificial features recorded as stars are at best useless---and at

115: worst damaging---to scientific projects undertaken with the \USNOB.

116:

117: That said, the \USNOB\ is a tremendously important and productive tool

118: as the largest visual ($BRI$) all-sky catalog for astrometric science

119: available at the present day. Users of the Catalog benefit from its

120: careful construction, its connection to the absolute astrometric

121: reference frame, and the long time baseline of its originating data.

122:

123: Our group is using the Catalog for the ambitious \an\ project

124: \citep{lang07a} in which we locate ``blind'' the position,

125: orientation, and scale of images with little, no, or corrupted

126: astrometric meta-data.  For the \an\ project, we need the input

127: astrometric catalog to have as few spurious entries as possible.

128: Indeed, in our early work, most of the

129: ``false positive'' results from our blind astrometry system involved

130: spurious alignments of linear defects in submitted images with

131: lines of spurious entries in the \USNOB\ coming from diffraction

132: spikes near bright stars.  For this reason, we found it necessary to

133: ``clean'' the Catalog of as many spurious entries as we can identify

134: by their configurations on the two-dimensional plane of the sky. In

135: what follows we describe how we identified two large classes of

136: spurious entries, thereby greatly improving the value of the Catalog

137: for our needs.

138:

139: The most analogous prior work in the astronomical literature is a

140: cleaning of the SuperCOSMOS Sky Survey using sophisticated computer

141: vision and machine learning techniques \citep{storkey04a}. Our work

142: is less general because we have specialized our detection algorithms

143: to the specific morphologies of the features we know to be present in

144: the \USNOB. This specialization is possible because the vetting

145: procedure employed in the construction of the \USNOB\ has eliminated

146: most of the defects (satellite trails, dirt, and scratches) that have

147: unpredictable morphologies. This specialization has the great

148: advantage that it permits us to detect image defects composed of small

149: numbers of catalog entries, which would not be statistically

150: identifiable if we did not have a strong \emph{a priori} model for their

151: morphologies.

152:

153: In what follows, we will treat the \USNOB\ as a collection of catalog

154: ``entries'', which are rows in a (large) table.  Most of these entries

155: correspond to ``stars'', which are hot balls of hydrogen in space, or

156: compact galaxies, which are extremely distant collections of stars,

157: but which will also be referred to as ``stars'' because from the point

158: of view of astrometric calibration they behave the same as stars.

159: Catalog entries that do not correspond to stars or individual compact

160: galaxies are considered by us to be ``spurious''.  We identify

161: some fraction of the spurious entries in the \USNOB\ by exploiting the

162: repeatable configurations they show around bright stars.

163:

164: \section{Spurious catalog entries}

165:

166: The \USNOB\ was constructed from imaging in five bands ($O$, $E$, $J$,

167: $F$, $N$) at two broad epochs ($O$, $E$ at first epoch, $J$, $F$, $N$

168: at second), taken with plate centers on a (fairly) regular grid of the

169: sky.  The plate imaging comprising the original data for the Catalog

170: is heterogeneous (in camera or survey origin and in data quality); in

171: order to guard against spurious entries, the construction of the

172: Catalog required detection of sources in multiple bands. However, some

173: spurious catalog entries survived this requirement.

174:

175: \subsection{Diffraction spikes}

176:

177: The diffraction-limited point-spread function of a physical telescope

178: is related to the Fourier transform of the entrance aperture. In this

179: transform, the thin cross-like support structure holding the secondary

180: mirror in the entrance aperture produces a large cross-like pattern in

181: the stellar point-spread function (PSF). The sources automatically

182: extracted from the scans of the photographic plate images include many

183: spurious features that are in fact just detections of these diffraction

184: spikes (Figures~\ref{fig:skyPatchSource} and \ref{fig:skyPatch}).

185:

186: The survey cameras that took the imaging data used to construct the

187: \USNOB\ are on equatorial mounts and have no capability for rotation

188: of the support structure relative to the sky once the pointing of the

189: telescope is set.  The diffraction spikes for any two images

190: taken by the same camera at the same pointing are therefore always aligned. For

191: this reason, spurious stars detected as part of one of these spikes in

192: one image in one band often line up with spurious stars detected in

193: the corresponding spike in some other band. Some spurious ``spike''

194: catalog entries thereby satisfy the \USNOB\ vetting requirement that

195: catalog entries have cospatial counterparts in multiple bands.

196:

197: Fortunately, spurious spike entries can be identified on the basis of

198: morphological regularities in the two-dimensional distribution on the

199: sky of the spurious catalog entries they generate. These regularities

200: include the following: \textsl{(1)}~Diffraction spikes are centered on

201: bright ($<13\,\mag$) stars. In what follows, the central star for a

202: diffraction spike will be referred to as the ``generating star''.

203: \textsl{(2)}~Because telescope supports are usually four perpendicular

204: rods, each diffraction spike generated by a bright star has four lines

205: at right angles to one another.  \textsl{(3)}~The diffraction spike

206: brightness is proportional to the brightness of the generating star,

207: but each spike becomes fainter with angular distance from the

208: generating star. Given that sources extracted from the scanned plates

209: are detected to some limiting brightness, the angular length of a

210: diffraction spike is closely related to the magnitude of the

211: generating star (Figure~\ref{fig:spikeProperties}).  \textsl{(4)}~The

212: angular width of a diffraction spike is narrow, so the two-dimensional

213: density of spurious spike entries can be very large. The angular width

214: is set by physical optics and is therefore roughly independent of the

215: magnitude of the generating star.  \textsl{(5)}~The orientation of the

216: diffraction spike pattern is roughly common to all spikes taken by the

217: same camera at the same pointing.

218: We can use the regularities among diffraction spikes to guide a

219: sensitive, automated search.

220:

221: Each \USNOB\ entry is tagged with a survey identifier and one or more

222: field numbers corresponding to the plates in that survey in which it

223: was detected.  Because all diffraction spikes in one field will share

224: the same orientation and properties, we analyze the \USNOB\ entries

225: one field at a time.  In this context, we consider an entry to belong

226: to a particular field if any of its photometric measurements has been

227: given that field number.

228:

229: \subsection{Reflection halos}

230:

231: The brightest stars in the \USNOB\ are surrounded not just by

232: diffraction spikes but by a thin circular ring or ``halo''.  This halo

233: is caused by internal reflections in the camera. Because this has a

234: geometric-optics rather than a physical-optics origin, the halo radius

235: is not a function of the wavelength of the imaging bandpass. This

236: means that spurious ``halo'' catalog entries can easily be present and

237: cospatial in multiple bands and thereby pass the \USNOB\ vetting

238: process (Figures~\ref{fig:skyPatchSource} and \ref{fig:skyPatch}).

239:

240: Again, the spurious catalog entries can be identified by the patterns

241: they make on the sky.  Regularities include the following:

242: \textsl{(1)}~Halos are centered on extremely bright ($<7\,\mag$)

243: generating stars.  \textsl{(2)}~Halos have a circular or near-circular

244: shape.  \textsl{(3)}~Because they are very thin in the radial

245: direction, spurious halo entries have high two-dimensional density on

246: the sky.  \textsl{(4)}~The spurious halo entries are usually close to

247: making up full circles, and only rarely appear in just a fragment of a

248: circle.  These regularities permit a sensitive search.

249:

250: \subsection{Other spurious entries}

251:

252: In addition to the spikes and halos we address above, there are other

253: categories of spurious catalog entries with other origins, including

254: but not limited to the following: \textsl{(1)}~There are some lines of

255: entries from fortuitously coincident features

256: (scratches, trails, handwriting, and

257: other artifacts) on overlapping plates.  \textsl{(2)}~There are some

258: duplicate entries for individual stars in sky regions where two fields

259: overlap. These are cases in which individual stars detected in multiple

260: fields have not been correctly identified as identical.

261: \textsl{(3)}~There are quasi-spurious clusters of entries in and

262: around extended objects such as galaxies, nebulae, and globular

263: clusters.

264:

265: We are doing nothing about any of these spurious features, in part because they

266: do not have regularities that lend themselves to computer-vision

267: techniques we employ in finding the previously mentioned defects.

268: They also represent a much smaller fraction of the \USNOB\

269: entries than the spurious entries from diffraction spikes and

270: reflection halos.

271:

272: Of course the \USNOB\ contains also many entries that are in fact

273: compact galaxies rather than stars.  However, these entries are

274: \emph{not} spurious from our perspective, since compact

275: galaxies are as good as---or better than---stars for our \an\

276: astrometric calibration efforts, and most other astrometric calibration

277: tasks.

278:

279: \section{Methods}

280:

281: The Catalog we begin with is not the unmodified \USNOB, but rather the

282: \USNOB\ with the \Tycho\ Catalog \citep{hog00a} stars

283: re-inserted by us from the official \Tycho\ Catalog release.  We were

284: forced to perform this operation because in the official \USNOB\

285: release, the \Tycho\ Catalog stars were added in an undocumented

286: binary format.

287:

288: \subsection{Diffraction Spikes}

289:

290: We begin by dividing the Catalog into a fine healpix \citep{gorski05a}

291: grid, and projecting the entries in each healpixel onto planes tangent

292: to each healpixel's center. For each entry we calculate the average

293: $m$ of all magnitudes of all bands in which the entry has been

294: detected, and we find the union of all fields in which the entry is

295: present.

296:

297: For each field present, we construct a ``profile'' of the field's

298: largest diffraction spikes, by overlaying the local neighborhoods of

299: the ten brightest stars in the field.  Given the regularities

300: discussed above, we can expect all spikes in each field to have the

301: same orientation. Therefore, each composite profile has one dominant

302: orientation, which is more apparent than in any single star's

303: neighborhood. To find each field's orientation, we first convert the

304: composite profile into polar coordinates, collapse the angles of each

305: point into a ${\pi\over 2}\,\rad$ range, and calculate a rough

306: histogram of the resulting angles.  The angle with the most densely

307: populated bin is used as an initial guess of the field's orientation,

308: which is then re-estimated using an iteratively reweighted least

309: squares (IRLS) fitting algorithm for robust M-estimation

310: \citep{hampel86}. The M-estimation is guaranteed to converge to an

311: estimate of the orientation that locally minimizes a total cost

312: $\sum_k \rho(e_k)$ where $e_k$ is the angular distance of entry $k$

313: from the estimated orientation.  We employ a Geman-McLure (GM) cost

314: function $ \rho(e_k) = e_k^2/(\sigma^2 + e_k^2) $, where $\sigma$ is

315: the initial guess of the root-variance of the angular width of a

316: spike.  This GM cost function replaces the standard least-squares cost

317: function $\rho(e_k) = e_k^2/\sigma^2$ and thereby downweights

318: outliers.  The resulting angle is a very robust and precise estimate

319: of the average orientation of all diffraction spikes present in the

320: field.

321:

322: We iterate over fields, using our estimation of each field's dominant

323: orientation to rotate the entries present in each field such that the

324: diffraction spikes present become axis-aligned on average, making

325: their detection much easier. Because there is sometimes some

326: discrepancy between the position of the diffraction spike's generating

327: star and the center of the diffraction spike, we perform a robust

328: estimation of the centerpoint of the spike, just as we did in

329: estimating the orientations of the field profiles. With the

330: diffraction spike axis-aligned and zero-centered, we collapse all of

331: the entries in the neighborhood of the diffraction spike into a single

332: composite of all four ``corners'' (as if we were to convert the

333: neighborhood to polar coordinates, and collapse their angles into a

334: ${\pi\over 2}\,\rad$ range), thereby reducing the four-part

335: diffraction spike to a single dense cluster of points.

336:

337: We found a power-law approximation to the relationship between the

338: magnitude of the generating star and the angular extent of the

339: diffraction spike it generates among spurious entries.  This was found

340: by initially hand-labeling a small subset of the data, making a crude

341: fit to the hand-labeled data, then later refining the estimate using

342: the results of our algorithm.  Given the magnitude of a generating

343: star, we are able to use this relationship to estimate the angular

344: extent of the spike we would expect that generating star to produce.

345: As previously mentioned, the width of each spike is roughly

346: independent of the magnitude of the generating star, and is therefore

347: initialized to a constant value.  This estimate of the center and

348: extent of the spike is used to initialized a two-dimensional Gaussian,

349: which is then fit to the entries belonging to the diffraction spike

350: using iterated variance clipping at 2.5\,sigma.  What we construct is

351: not a traditional multivariate normal distribution, which would assume

352: that the data lies in an elliptical distribution, but is instead a

353: ``rectangular'' distribution. That is, we consider an entry to be

354: within the Gaussian distribution if it is simultaneously within

355: 2.5\,sigma of the width and 2.5\,sigma of the length of the

356: distribution.  When the Gaussian converges to its final parameters, we

357: take the rectangular area within 2.5\,sigma, and extend its range

358: towards the generating star to cover all entries between the area and

359: the generating star at the center of the spike.  If this area's

360: angular width, length, and position all pass a set of thresholds,

361: detailed later, we flag all of the entries within it (excluding the

362: generating star and any \Tycho\ stars, which we assume are not

363: spurious) as potential spurious entries.  If these entries pass a set

364: of thresholds (detailed below) they are marked as spurious.

365:

366: The algorithm is depicted in Figure~\ref{fig:demoRun}.

367:

368: \subsection{Reflection halos}

369:

370: Once all spurious catalog entries attributed to spikes are found and

371: temporarily removed (such entries disturb the results of the halo

372: detection algorithm), we search the remaining catalog for halos.  This

373: process is similar to the process of searching for diffraction spikes:

374: We divide the Catalog into a fine healpix grid and process each grid

375: cell independently.  We project the entries in each grid cell onto a

376: plane tangent to the cell's center.  Next, we examine each star

377: brighter than $7\,\mag$, and attempt to find and eliminate halos that

378: it has generated.  Since the radius of each halo is not dependent on

379: the magnitude of its generating star, the size of the neighborhood we

380: search is constant.

381:

382: We convert each neighborhood into polar coordinates centered at the

383: generating star, and calculate a histogram of the radii of all entries

384: in the neighborhood. This simple count of the number of stars present

385: at different radii is used to generate a more informative histogram of

386: the densities of stars at each radius. Our initial guess of the radius

387: of the halo is whichever coarse bin is the most dense.

388:

389: With this estimate of the radius of our halo, and with a constant as

390: our initial estimate of the radial width of the halo, we construct a

391: one-dimensional Gaussian and again robustly fit the position and width

392: of the Gaussian using variance clipping at 2.5\,sigma.  Once the

393: re-estimation has converged, we check that our resulting values for

394: the variance of the width are reasonable ($<3\,\arcsec$), and if so,

395: we label all entries within 2.5\,sigma of the Gaussian as potentially

396: spurious. Again, if these entries pass another set of thresholds, they

397: are marked as spurious.

398:

399: Because one generating star may produce multiple halos, we search each

400: generating star, and remove each salient halo we find, until we fail to

401: detect any new halo that passes our thresholds.

402:

403: \subsection{Parameters of the Algorithms}

404:

405: By necessity the algorithms have a number of free parameters.  Some of

406: these are measurements of diffraction-spike and reflection-halo

407: configurations, derived from quantitative analyses of the properties

408: of the spurious entries, while others are additional conservative

409: constraints, applied to ensure that the spurious entries appear to be

410: correctly identified on visual inspection of the results.

411:

412: In addition to the parameters that specifically apply to the spike and

413: halo identification algorithms, we somewhat arbitrarily chose to work

414: in a $\Nside=9$ healpix grid; there are $12\times 9\times 9=972$

415: healpixels.  We set all variance-clipping thresholds to 2.5\,sigma,

416: and when we define regions by variance clipping we make them

417: 2.5\,sigma in half-width.

418:

419: \subsubsection{Measured Spike Parameters}

420:

421: \begin{itemize}

422: \item We search for diffraction spikes generated by stars brighter

423: than $13\,\mag$.  Bright stars tend to produce large diffraction

424: spikes containing many spurious entries, while dimmer stars produce

425: small diffraction spikes containing few, and potentially ambiguous,

426: spurious entries.  When we extended our search to stars brighter than

427: $15\,\mag$, we found that the proportion of falsely labeled spurious

428: entries increased dramatically.  Our decision to restrict to

429: $<13\,\mag$ is further supported by the second panel of

430: Figure~\ref{fig:spuriousStats}, which shows that generating stars at

431: $>13\,\mag$ have mean number of entries per spike less than four,

432: which means that most will contain too few entries to be accepted.

433:

434: \item Our initial estimate of the angular length $\ell$ of a

435: diffraction spike given the magnitude $m$ of its generating star is

436: $\ell = 3500 \times 1.53^{-m}\,\arcsec$; see

437: Figure~\ref{fig:spikeProperties}.  This estimate initializes a

438: refinement by iterated variance clipping and therefore does not

439: strongly affect our results.  In detail this relationship between

440: length and magnitude depends on band, exposure time, and data quality,

441: and is is therefore different for every plate; but since we use it

442: only as an initialization, those details do not substantially affect

443: our results.

444:

445: \item Our initial estimate of the angular width of a spike is

446: $1\,\arcsec$.  This also initializes a refinement by iterated variance

447: clipping and also has little effect on our results.

448:

449: \item We define the ``reasonable'' width of a diffraction spike to be

450: three times the initial estimate of $1\,\arcsec$.  If the adaptive

451: fitting process produces a width larger than this, the candidate spike

452: is rejected.

453: \end{itemize}

454:

455: \subsubsection{Additional Spike Constraints}

456:

457: \begin{itemize}

458: \item The size of the local neighborhood constructed around each spike

459: is $2.5$ times the initial estimate of the spike's size.  This limits

460: the catalog entries considered in the subsequent analysis, though the

461: effect on our results is minimal.

462:

463: \item We required each spike to have entries in at least $2$

464: of the $4$ spike regions.

465:

466: \item We required the total area within the four spike regions

467: to be at least as dense in Catalog entries as the surrounding area.

468: \end{itemize}

469:

470: \subsubsection{Measured Halo Parameters}

471:

472: \begin{itemize}

473: \item We search for halos around generating stars brighter than

474: $7\,\mag$.  Our experiments have shown that halos do not appear around

475: stars dimmer than this.

476:

477: \item We discard any halo whose radius is outside the range of $240$ to

478: $410\,\arcsec$.  Direct inspection of the catalog shows that

479: reflection halos rarely appear outside of this range.

480:

481: \item Our initial estimate of the standard deviation of the radial

482: width of a halo is $1.8\,\arcsec$.  This is approximately the average

483: value to which our variance-clipping fitting algorithm converges.

484:

485: \item We discard any halo for which our variance-clipping fitting

486: algorithm computes a radial width larger than $4.5$ times the initial

487: estimate.

488: \end{itemize}

489:

490: \subsubsection{Additional Halo Constraints}

491:

492: \begin{itemize}

493: \item Each halo must contain at least $25$ catalog entries.

494:

495: \item The density of catalog entries in each halo annulus

496: must be at least $1.25$ times the density of the area near the halo.

497:

498: \item There must be entries present in the halo annulus every

499: ${\pi\over 4}\,\rad$. This forces all detected halos to be fully

500: circular, rather than just fragments of circles. More importantly,

501: this requirement prevents the false detection of halos near the edges

502: of healpixels, which would otherwise happen very often. Unfortunately,

503: this requirement prevents us from detecting any halo near the edges of

504: a healpixel.

505: \end{itemize}

506:

507: \subsection{Limitations}

508:

509: Limitations of our procedures include the following.

510: \begin{itemize}

511: \item The algorithm assigns hard labels to indicate that an entry is

512: spurious.  A future version of the algorithm could assign an

513: assessment of our \emph{confidence} that an entry is spurious.

514:

515: \item The algorithm processes each healpixel independently, and we

516: have not included a buffer region around the edges of the healpixels,

517: so there are minor edge effects: the algorithm is less likely to

518: detect spurious entries near the healpixel boundaries.  We expect this

519: to affect roughly $0.4\,\percent$ of the diffraction spikes and

520: $3.5\,\percent$ of the reflection halos.

521:

522: \item The algorithms are highly specialized to the typical data in the

523: \USNOB. If a small fraction of the data in the Catalog come from some

524: telescope with, for example, three rather than four supports for the

525: secondary, or very different internal reflections, the algorithms we

526: use would not detect the spurious features in those data.

527:

528: \item There are many hard settings of parameters, as discussed above.

529: Most of these are either just initializations for iterative procedures

530: or else set manually after an analysis of the data, but more

531: experimentation could have been performed if we had a substantial data

532: set in which the spurious entries had been reliably identified in

533: advance.

534:

535: \item Sometimes a diffraction spike that exists in multiple fields is

536: detected in a field whose orientation does not match the spike's orientation

537: as well as some other field. The is because the order in which we search

538: each field is arbitrary; we flag a detected diffraction spike upon

539: it's first successful detection. This usually results in a detected

540: diffraction spike with an unusually wide angular width. Though this happens

541: frequently, its overall effect on the fidelity of our results is small.

542: A better solution would be to remove spikes in non-increasing order of

543: their resemblance to our model of a diffraction spike.

544:

545: \item We ought never consider as a generating star any star that was

546: marked spurious in the analysis of a brighter generating star. We

547: don't currently enforce this, and it may produce some incorrect

548: identification of spurious entries.

549:

550: \item Many of these limitations could be overcome if we constructed

551: a complete generative model of

552: diffraction spikes and halos.  This would allow us to ``score'' potential spurious

553: detections with something approaching a \emph{probability} that they are spurious,

554: rather than simply cut at hard thresholds. This could also improve the fidelity of our results,

555: by allowing us to increase our statistical requirements of some parameters

556: of our generative model when a detected spike or halo fails to fit other

557: parameters. For example, if a possible halo appears at an uncommon radius,

558: a proper generative model would effectively put a stronger constraint on

559: other properties (such as the density of entries in the halo annulus)

560: in order for the entries to be marked as spurious with high probability.

561: Done well, this approach could also allow us to reduce the number of

562: individual requirements we require of each detected spike and halo.

563: This would be aided by a set of hand-labeled spikes and non-spikes, with

564: which we could tune the generative model --- or which we could use as

565: input to some kind of discriminator which would tune the model automatically.

566:

567: \end{itemize}

568:

569: \section{Results}

570:

571: The number of entries flagged as spurious on diffraction-spike grounds

572: is \numSpikes\ (\percentSpikes\ of the \USNOB) and on halo

573: grounds is \numHalos\ (\percentHalos).  Our grounds for declaring an

574: entry spurious are conservative in the sense that a spike or halo is

575: only treated as being detected if it passes a set of statistical

576: thresholds.

577:

578: The method works by marking as spurious all \USNOB\ entries in a set

579: of finite regions of the sky, with those sky regions adaptively fit to

580: the observed diffraction spike and reflection halo features.  Because

581: the total solid angle removed is non-zero, we expect some of the

582: entries we mark as spurious to in fact correspond to real sources. We

583: can estimate this in a representative healpixel: Healpixel 0 contains

584: $299573$ \USNOB\ entries; we flag as spurious $7924$ entries within a

585: set of regions comprising $1.5\times 10^{-5}\,\ster$ ($0.12\,\percent$

586: of the healpixel); we expect therefore some 300 of these to correspond

587: to real stars.  We tested this hypothesis with the

588: \TWOMASS\footnote{http://www.ipac.caltech.edu/2mass/}.  In this

589: healpixel there are $81089$ entries, of which we would expect

590: $\sim100$ to lie in the spurious area we've removed.  We find that

591: $82$ \TWOMASS\ entries match to a spurious \USNOB\ entry and no

592: non-spurious \USNOB\ entry, consistent with what we would expect

593: assuming a uniform distribution of \TWOMASS\ entries over the

594: healpixel. This count is probably an overestimate, because there are

595: some diffraction artifacts in the \TWOMASS\ that are similar to those

596: in \USNOB.  Our marking of spurious entries is aggressive in this

597: sense; as we noted in the Introduction, this is because for our

598: scientific purposes we require a catalog as clean of spurious entries

599: as possible.

600:

601: Properties of the spurious entries we have identified are shown in

602: Figures~\ref{fig:spuriousStats}, \ref{fig:spikeProperties}, and

603: \ref{fig:haloProperties}, including the numbers and fractions of

604: spurious entries as a function of generating star magnitude, and

605: distributions of spikes and halos in size and on the sky.  These

606: figures show a number of important regularities, for example that

607: brighter stars have larger diffraction spikes (as expected), that the

608: widths of the spikes is not a function of generating star magnitude

609: (also as expected), and that both the number of spurious entries and

610: our ability to robustly detect them are functions of sky position

611: (mainly because of the Galactic Plane).

612: Figure~\ref{fig:haloProperties} shows that there are two different

613: dominant halo radii, one for the North and one for the South;

614: presumably this indicates differences in the hardware used for each

615: hemisphere.

616:

617: At the outset, we imagined that we could remove these spurious entries

618: trivially using the photometric properties listed in the Catalog.  For

619: example, there is no reason in principle that a spurious entry would

620: obtain a reasonable color or pass star--galaxy separation.  In

621: Figure~\ref{fig:spuriousProperties}, we show the distribution of the

622: spurious entries in photometric properties such as magnitude, color,

623: and star--galaxy separation.  This Figure shows that it would not have

624: been possible to identify the spurious on photometric grounds,

625: including even the \emph{number} of images with detections.

626: Presumably the reasonable colors and large numbers of overlapping

627: images in which the stars are detected result from the great stability

628: of the hardware and software employed in the construction of the

629: \USNOB.  It would have been extremely difficult to reliably identify

630: the spurious entries without automatic computer-vision techniques like

631: those employed in this project.

632:

633: Associated with this paper is a small amount of computer code, the

634: information required to clean the \USNOB\ of the spurious entries we

635: identified, and some methods for accessing our cleaned version of the

636: \USNOB.  All of these are available at the \an\ web

637: site\footnote{http://astrometry.net/cleanusnob/}.

638:

639: \acknowledgments We are very grateful to Dave Monet and the team that

640: created the \USNOB, which is one of astronomy's most productive and

641: useful resources.  We benefitted from useful discussions with Mike

642: Blanton, Keir Mierle, and David Warde-Farley, and from the

643: constructive comments of our anonymous referee.  DWH was partially

644: supported by the National Aeronautics and Space Administration (NASA;

645: grant NAG5-11669) and the National Science Foundation (NSF; grant

646: AST-0428465).  This research made use of the NASA Astrophysics Data

647: System, and the US Naval Observatory Precision Measuring Machine Data

648: Archive.

649:

650: \begin{thebibliography}{70}

651: \bibitem[G{\'o}rski \etal(2005)]{gorski05a}

652: G{\'o}rski,~K.~M., Hivon,~E., Banday,~A.~J., Wandelt,~B.~D.,

653: Hansen,~F.~K., Reinecke,~M., \& Bartelmann,~M.,

654: 2005, \apj, 622, 759

655: \bibitem[Hampel \etal(1986)]{hampel86}

656: Hampel,~F.~R., Ronchetti,~E.~M., Rousseeuw,~P.~J., \& Stahel,~W.~A.,

657: 1986, \textit{Robust Statistics:\ The Approach Based on Influence Functions,}

658: Wiley, New York

659: \bibitem[H{\o}g \etal(2000)]{hog00a}

660: H{\o}g,~E., \etal,

661: 2000, \aap, 355, L27

662: \bibitem[Lang \etal(2007)]{lang07a}

663: Lang,~D., Hogg,~D.~W., Mierle,~K., Blanton,~M., \& Roweis,~S.,

664: 2007, Science, submitted

665: \bibitem[Monet \etal(2003)]{monet03a}

666: Monet,~D.~G., \etal,

667: 2003, \aj, 125, 984

668: \bibitem[Storkey \etal(2004)]{storkey04a}

669: Storkey,~A.~J., Hambly,~N.~C., Williams,~C.~K.~I., \& Mann,~R.~G.,

670: 2004, \mnras, 347, 36

671: \end{thebibliography}

672:

673: \clearpage

674: \begin{figure}

675: 	\hbox{

676: 		\hbox{\resizebox{\threewidth}{!}{\includegraphics{f1a.eps}}}

677: 		\hbox{\resizebox{\threewidth}{!}{\includegraphics{f1b.eps}}}

678: 		\hbox{\resizebox{\threewidth}{!}{\includegraphics{f1c.eps}}}

679: 	}

680: \caption{Subimages of three of the nine scanned plates that overlap a

681: small patch of sky centered around (RA,Dec)=(341.8, -81.4)~deg (J2000)

682: from which part of \USNOB was created, retrieved from the US Naval

683: Observatory Precision Measuring Machine Data Archive. Note the

684: different orientations of the diffraction spikes generated by brighter

685: stars, and the multiple halos surrounding the brightest star.

686: \label{fig:skyPatchSource}}

687: \end{figure}

688:

689: \begin{figure}

690: \hbox{

691: 	\hbox{

692: 		\vbox{

693: 			\hbox{\resizebox{\threewidth}{!}{\includegraphics{f2a.eps}}}

694: 			\hbox{\resizebox{\threewidth}{!}{\includegraphics{f2b.eps}}}

695: 			}

696: 		}

697: 	\hbox{

698: 		\resizebox{\twothreewidth}{!}{\includegraphics{f2c.eps}}

699: 	}

700: }

701: \caption{The same small patch of sky as pictured in

702: Figure~\ref{fig:skyPatch}, taken from the \USNOB, in tangent-plane

703: coordinates relative to a tangent point at the center of the

704: containing healpixel, in units of $\arcsec$.  The bright stars in this

705: patch have multiple sets of diffraction spikes because they lie in a

706: sky region where plates taken at different orientations overlap.

707: \textsl{Upper left panel:} All \USNOB\ entries in this patch.

708: \textsl{Right panel:} The same patch with dark points showing catalog

709: entries marked as spurious by either the diffraction spike or

710: reflection halo criteria described in the text.  \textsl{Lower left

711: panel:} The same patch, with only non-spurious entries shown.

712: \label{fig:skyPatch}}

713: \end{figure}

714:

715: \begin{figure}

716: \resizebox{\threewidth}{!}{\includegraphics{f3a.eps}}%

717: \resizebox{\threewidth}{!}{\includegraphics{f3b.eps}}%

718: \resizebox{\threewidth}{!}{\includegraphics{f3c.eps}}

719: \resizebox{\threewidth}{!}{\includegraphics{f3d.eps}}%

720: \resizebox{\threewidth}{!}{\includegraphics{f3e.eps}}%

721: \resizebox{\threewidth}{!}{\includegraphics{f3f.eps}}

722: \caption{A narrative demo of finding a single spike within the patch of sky shown in Figure~\ref{fig:skyPatch}.

723: \textsl{Upper left panel:} The composite profile of the current

724: field, with the field's estimated orientation highlighted.

725: The profile is dominated by the spike

726: of the current generating star, which is the largest in the field.

727: \textsl{Upper center panel:} The neighborhood surrounding the

728: current generating star, with all entries in all fields shown.

729: \textsl{Upper right panel:} The same neighborhood, with only entries

730: in the current field shown. The dominant orientation from this field's

731: composite profile is highlighted.

732: \textsl{Lower left panel:} All four directions of the spike, collapsed

733: into one ${\pi\over 2}\,\rad$ profile.

734: The dashed lines outline the areas encompassed at 1, 2, and 3 times the

735: root-variance of the Gaussian used to initialize the variance clipping.

736: The solid rectangle is the area encompassed at

737: 2.5\,sigma, which is threshold for flagging entries. The solid rectangle

738: is extended all of the way to the bottom, as it is assumed that all of the

739: spike profile between the spike cluster and the generating star is also

740: spurious.

741: \textsl{Lower center panel:} The same profile, after the Gaussian has

742: been fit using variance clipping. The solid rectangle shown here

743: is the final area we use for determining if entries are flagged.

744: \textsl{Lower right panel:} The neighborhood surrounding the generating

745: star with all entries shown, and with the newly-flagged spike entries darkened.

746: The diffraction spikes in the other orientations come from different fields and

747: are flagged by later passes of the algorithm.

748: \label{fig:demoRun}}

749: \end{figure}

750:

751: \clearpage

752: \begin{figure}

753: \resizebox{\twowidth}{!}{\includegraphics{f4a.eps}}%%

754: \resizebox{\twowidth}{!}{\includegraphics{f4b.eps}}\\

755: % dstn asks: can you make the legend show the dark thin bars as a dark thin patch?

756: % (right now it only shows the different in darkness, not of thickness)

757: % jon replies: Yeah, no, I can't, or not easily, at least. It's on my queue.

758: \caption{Statistics of spurious entries.

759: \textsl{Left panel:} The fraction of all entries marked as spurious

760: as a function of generating star magnitude.

761: \textsl{Right panel:} The mean number of entries marked as spurious

762: per generating star as a function of generating star magnitude.%

763: \label{fig:spuriousStats}}

764: \end{figure}

765:

766: \clearpage

767: \begin{figure}

768: \resizebox{\twowidth}{!}{\includegraphics{f5a.eps}}%

769: \resizebox{\twowidth}{!}{\includegraphics{f5b.eps}}\\

770: \resizebox{\twowidth}{!}{\includegraphics{f5c.eps}}%

771: \resizebox{\twowidth}{!}{\includegraphics{f5d.eps}}

772: \caption{Regularities of spurious catalog entries in the \USNOB\

773: identified as caused by diffraction spikes.  \textsl{Top-left

774: panel:}~Two-dimensional histogram showing the adaptively fit radial

775: lengths of the spikes, found by iterative variance-clipping, as a function

776: of generating star magnitude.  Each vertical column in the histogram

777: is independently normalized.  The solid line shows the

778: value used to initialize the adaptive fitting.  \textsl{Top-right

779: panel:}~Similar two-dimensional histogram but showing the adaptively

780: fit widths of the spikes, as a function of generating star magnitude.

781: The solid line shows the initial value. \textsl{Bottom-left

782: panel:}~The two-dimensional solid-angular density on the sky of

783: spurious entries identified as parts of diffraction spikes as a

784: function of sky position (shown as an ``unwrapped'' $\Nside=9$ healpixel

785: grid). The darker a healpixel is, the more spurious ``spike'' entries it contains.

786: \textsl{Bottom-right panel:}~The same, but shown relative to

787: the number of catalog entries in that healpixel. In the two sky

788: density plots, a North--South asymmetry is visible, as well as the

789: Galactic plane.

790: \label{fig:spikeProperties}}

791: \end{figure}

792:

793: \clearpage

794: \begin{figure}

795: \resizebox{\twowidth}{!}{\includegraphics{f6a.eps}}%

796: \resizebox{\twowidth}{!}{\includegraphics{f6b.eps}}\\

797: \resizebox{\twowidth}{!}{\includegraphics{f6c.eps}}%

798: \resizebox{\twowidth}{!}{\includegraphics{f6d.eps}}

799: \centerline{\resizebox{\onewidth}{!}{\includegraphics{f6e.eps}}}

800: \caption{Regularities of spurious catalog entries identified as caused

801: by reflection halos.  \textsl{Top-left panel:}~Two-dimensional

802: histogram showing adaptively fit reflection-halo radii as a function

803: of generating star magnitude.  Each vertical column of the histogram

804: has been independently normalized.  \textsl{Top-right panel:}~Similar

805: two-dimensional histogram but showing the adaptively fit widths of the

806: halo annuli.

807: \textsl{Middle-left panel:}~Two-dimensional solid-angular density on

808: the sky of spurious entries identified as parts of reflection halos.

809: The darker a healpixel is, the more spurious ``halo'' entries it contains.

810: \textsl{Middle-right panel:}~The same but shown relative to the number

811: of catalog entries in that healpixel.

812: \textsl{Wide bottom panel:}~Two-dimensional histogram showing that

813: each of the two principal halo radii is in one hemisphere of the sky.

814: \label{fig:haloProperties}}

815: \end{figure}

816:

817: \clearpage

818: \begin{figure}

819: \resizebox{\twowidth}{!}{\includegraphics{f7a.eps}}%%

820: \resizebox{\twowidth}{!}{\includegraphics{f7b.eps}}\\

821: \resizebox{\twowidth}{!}{\includegraphics{f7c.eps}}%%

822: \resizebox{\twowidth}{!}{\includegraphics{f7d.eps}}

823: \caption{Spurious catalog entries are not obvious from their basic

824: photometric properties.

825: \textsl{Top-left panel:} Number of the five

826: bands in the \merged\ in which entries show detections, for all and

827: spurious entries.

828: \textsl{Top-right panel:} Magnitude distribution,

829: for all and spurious entries with detections in the $F$ band.

830: \textsl{Bottom-left panel:} J-F color distribution, for all and spurious

831: entries with detections in the $J$ and $F$ bands.

832: \textsl{Bottom-right panel:} \USNOB\ $F$-band star-galaxy separator

833: quantity, for all and spurious entries with detections in the $F$ band.%

834: \label{fig:spuriousProperties}}

835: \end{figure}

836:

837: \end{document}

838: