1: \documentclass[12pt,preprint]{aastex}
2: \usepackage{natbib}
3: \usepackage{ifthen}
4: \newcounter{address}
5: \newcommand{\latin}[1]{\textit{#1}}
6: \newcommand{\ie}{\latin{ie}}
7: \newcommand{\eg}{\latin{eg}}
8: \newcommand{\cf}{\latin{cf}}
9: \newcommand{\etc}{\latin{etc}}
10: \newcommand{\etal}{\latin{et~al}}
11: \newlength{\threewidth}
12: \setlength{\threewidth}{0.333\textwidth}
13: \newlength{\twowidth}
14: \setlength{\twowidth}{0.499\textwidth}
15: \newlength{\twothreewidth}
16: \setlength{\twothreewidth}{0.666\textwidth}
17: \newlength{\onewidth}
18: \setlength{\onewidth}{1.0\textwidth}
19: \newcommand{\Nside}{N_{\mathrm{side}}}
20: \newcommand{\unit}[1]{\mathrm{#1}}
21: \renewcommand{\mag}{\unit{mag}}
22: \newcommand{\rad}{\unit{rad}}
23: \renewcommand{\arcsec}{\unit{arcsec}}
24: \newcommand{\ster}{\unit{ster}}
25: \newcommand{\percent}{\unit{percent}}
26: \newcommand{\Tycho}{Tycho-2}
27: \newcommand{\USNOB}{USNO-B Catalog}
28: \newcommand{\TWOMASS}{2MASS PSC Catalog}
29: \newcommand{\merged}{merged \Tycho\ and \USNOB s}
30: \newcommand{\an}{\textsl{Astrometry.net}}
31: \newcommand{\numAllStars}{1,045,175,762}
32: \newcommand{\numUSNOBStars}{1,042,618,261}
33: \newcommand{\numSpikes}{24,148,382}
34: \newcommand{\numHalos}{196,133}
35: \newcommand{\percentSpikes}{$2.3\,\percent$} % 2.316128818
36: \newcommand{\percentHalos}{$0.02\,\percent$} % 0.01876555
37:
38: \begin{document}
39: \title{
40: Cleaning the USNO-B Catalog
41: through automatic detection of optical artifacts
42: }
43: \author{
44: Jonathan~T.~Barron\altaffilmark{\ref{Toronto}},
45: Christopher~Stumm\altaffilmark{\ref{Toronto}},
46: David~W.~Hogg\altaffilmark{\ref{NYU},\ref{email}},
47: Dustin~Lang\altaffilmark{\ref{Toronto}},
48: Sam~Roweis\altaffilmark{\ref{Toronto},\ref{Google}}
49: }
50:
51: \setcounter{address}{1}
52: \altaffiltext{\theaddress}{\stepcounter{address}\label{Toronto}
53: Department of Computer Science, University of Toronto,
54: 6 King's College Road, Toronto, Ontario, M5S~3G4 Canada}
55: \altaffiltext{\theaddress}{\stepcounter{address}\label{NYU}
56: Center for Cosmology and Particle Physics, Department of Physics, New
57: York University, 4 Washington Place, New York, NY 10003}
58: \altaffiltext{\theaddress}{\stepcounter{address}\label{email}
59: To whom correspondence should be addressed: \texttt{david.hogg@nyu.edu}}
60: \altaffiltext{\theaddress}{\stepcounter{address}\label{Google}
61: Google, Mountain View, CA}
62:
63: \begin{abstract}
64: The USNO-B Catalog contains spurious entries that are caused by
65: diffraction spikes and circular reflection halos around bright stars
66: in the original imaging data. These spurious entries appear in the
67: Catalog as if they were real stars; they are confusing for some
68: scientific tasks. The spurious entries can be identified by simple
69: computer vision techniques because they produce repeatable patterns on
70: the sky. Some techniques employed here are variants of the Hough
71: transform, one of which is sensitive to (two-dimensional)
72: overdensities of faint stars in thin right-angle cross patterns
73: centered on bright ($<13\,\mag$) stars, and one of which is sensitive
74: to thin annular overdensities centered on very bright ($<7\,\mag$)
75: stars. After enforcing conservative statistical requirements on
76: spurious-entry identifications, we find that of the 1,042,618,261
77: entries in the USNO-B Catalog, 24,148,382 of them ($2.3\,\percent$)
78: are identified as spurious by diffraction-spike criteria and 196,133
79: ($0.02\,\percent$) are identified as spurious by reflection-halo
80: criteria. The spurious entries are often detected in more than 2
81: bands and are not overwhelmingly outliers in any photometric
82: properties; they therefore cannot be rejected easily on other grounds,
83: \ie, without the use of computer vision techniques. We demonstrate our
84: method, and return to the community in electronic form a table of spurious
85: entries in the Catalog.
86: \end{abstract}
87:
88: \keywords{
89: astrometry ---
90: catalogs ---
91: methods:~statistical ---
92: standards ---
93: techniques:~image~processing
94: }
95:
96: \section{Introduction}
97:
98: The \USNOB\ \citep{monet03a} is an astrometric catalog containing
99: information on $\sim10^{9}$ stars. The original imaging data taken
100: for this catalog come exclusively from photographic plates, taken from
101: several different surveys operating over many decades. These plates
102: were uniformly scanned and automated source detection was performed on
103: the scans. From the sources detected in the scans, the Catalog was
104: constructed in a relatively ``inclusive'' way. The sources were
105: required to be compact, and to show detections in more than one band
106: of the five bands ($O,E,J,F,N$) from which the Catalog was constructed.
107: However, the original plate images contained many artifacts, defects,
108: trailed satellites, and large, resolved sources such as nearby
109: galaxies, nebulae, and star clusters. Some of the entries in the
110: \USNOB\ do not correspond to real, independent, astronomical sources
111: but rather to arbitrary parts of extended sources, or fortuitously
112: coincident (across bands) data defects or artificial features. Though
113: compact galaxies can be used along with stars for astrometric science,
114: the artificial features recorded as stars are at best useless---and at
115: worst damaging---to scientific projects undertaken with the \USNOB.
116:
117: That said, the \USNOB\ is a tremendously important and productive tool
118: as the largest visual ($BRI$) all-sky catalog for astrometric science
119: available at the present day. Users of the Catalog benefit from its
120: careful construction, its connection to the absolute astrometric
121: reference frame, and the long time baseline of its originating data.
122:
123: Our group is using the Catalog for the ambitious \an\ project
124: \citep{lang07a} in which we locate ``blind'' the position,
125: orientation, and scale of images with little, no, or corrupted
126: astrometric meta-data. For the \an\ project, we need the input
127: astrometric catalog to have as few spurious entries as possible.
128: Indeed, in our early work, most of the
129: ``false positive'' results from our blind astrometry system involved
130: spurious alignments of linear defects in submitted images with
131: lines of spurious entries in the \USNOB\ coming from diffraction
132: spikes near bright stars. For this reason, we found it necessary to
133: ``clean'' the Catalog of as many spurious entries as we can identify
134: by their configurations on the two-dimensional plane of the sky. In
135: what follows we describe how we identified two large classes of
136: spurious entries, thereby greatly improving the value of the Catalog
137: for our needs.
138:
139: The most analogous prior work in the astronomical literature is a
140: cleaning of the SuperCOSMOS Sky Survey using sophisticated computer
141: vision and machine learning techniques \citep{storkey04a}. Our work
142: is less general because we have specialized our detection algorithms
143: to the specific morphologies of the features we know to be present in
144: the \USNOB. This specialization is possible because the vetting
145: procedure employed in the construction of the \USNOB\ has eliminated
146: most of the defects (satellite trails, dirt, and scratches) that have
147: unpredictable morphologies. This specialization has the great
148: advantage that it permits us to detect image defects composed of small
149: numbers of catalog entries, which would not be statistically
150: identifiable if we did not have a strong \emph{a priori} model for their
151: morphologies.
152:
153: In what follows, we will treat the \USNOB\ as a collection of catalog
154: ``entries'', which are rows in a (large) table. Most of these entries
155: correspond to ``stars'', which are hot balls of hydrogen in space, or
156: compact galaxies, which are extremely distant collections of stars,
157: but which will also be referred to as ``stars'' because from the point
158: of view of astrometric calibration they behave the same as stars.
159: Catalog entries that do not correspond to stars or individual compact
160: galaxies are considered by us to be ``spurious''. We identify
161: some fraction of the spurious entries in the \USNOB\ by exploiting the
162: repeatable configurations they show around bright stars.
163:
164: \section{Spurious catalog entries}
165:
166: The \USNOB\ was constructed from imaging in five bands ($O$, $E$, $J$,
167: $F$, $N$) at two broad epochs ($O$, $E$ at first epoch, $J$, $F$, $N$
168: at second), taken with plate centers on a (fairly) regular grid of the
169: sky. The plate imaging comprising the original data for the Catalog
170: is heterogeneous (in camera or survey origin and in data quality); in
171: order to guard against spurious entries, the construction of the
172: Catalog required detection of sources in multiple bands. However, some
173: spurious catalog entries survived this requirement.
174:
175: \subsection{Diffraction spikes}
176:
177: The diffraction-limited point-spread function of a physical telescope
178: is related to the Fourier transform of the entrance aperture. In this
179: transform, the thin cross-like support structure holding the secondary
180: mirror in the entrance aperture produces a large cross-like pattern in
181: the stellar point-spread function (PSF). The sources automatically
182: extracted from the scans of the photographic plate images include many
183: spurious features that are in fact just detections of these diffraction
184: spikes (Figures~\ref{fig:skyPatchSource} and \ref{fig:skyPatch}).
185:
186: The survey cameras that took the imaging data used to construct the
187: \USNOB\ are on equatorial mounts and have no capability for rotation
188: of the support structure relative to the sky once the pointing of the
189: telescope is set. The diffraction spikes for any two images
190: taken by the same camera at the same pointing are therefore always aligned. For
191: this reason, spurious stars detected as part of one of these spikes in
192: one image in one band often line up with spurious stars detected in
193: the corresponding spike in some other band. Some spurious ``spike''
194: catalog entries thereby satisfy the \USNOB\ vetting requirement that
195: catalog entries have cospatial counterparts in multiple bands.
196:
197: Fortunately, spurious spike entries can be identified on the basis of
198: morphological regularities in the two-dimensional distribution on the
199: sky of the spurious catalog entries they generate. These regularities
200: include the following: \textsl{(1)}~Diffraction spikes are centered on
201: bright ($<13\,\mag$) stars. In what follows, the central star for a
202: diffraction spike will be referred to as the ``generating star''.
203: \textsl{(2)}~Because telescope supports are usually four perpendicular
204: rods, each diffraction spike generated by a bright star has four lines
205: at right angles to one another. \textsl{(3)}~The diffraction spike
206: brightness is proportional to the brightness of the generating star,
207: but each spike becomes fainter with angular distance from the
208: generating star. Given that sources extracted from the scanned plates
209: are detected to some limiting brightness, the angular length of a
210: diffraction spike is closely related to the magnitude of the
211: generating star (Figure~\ref{fig:spikeProperties}). \textsl{(4)}~The
212: angular width of a diffraction spike is narrow, so the two-dimensional
213: density of spurious spike entries can be very large. The angular width
214: is set by physical optics and is therefore roughly independent of the
215: magnitude of the generating star. \textsl{(5)}~The orientation of the
216: diffraction spike pattern is roughly common to all spikes taken by the
217: same camera at the same pointing.
218: We can use the regularities among diffraction spikes to guide a
219: sensitive, automated search.
220:
221: Each \USNOB\ entry is tagged with a survey identifier and one or more
222: field numbers corresponding to the plates in that survey in which it
223: was detected. Because all diffraction spikes in one field will share
224: the same orientation and properties, we analyze the \USNOB\ entries
225: one field at a time. In this context, we consider an entry to belong
226: to a particular field if any of its photometric measurements has been
227: given that field number.
228:
229: \subsection{Reflection halos}
230:
231: The brightest stars in the \USNOB\ are surrounded not just by
232: diffraction spikes but by a thin circular ring or ``halo''. This halo
233: is caused by internal reflections in the camera. Because this has a
234: geometric-optics rather than a physical-optics origin, the halo radius
235: is not a function of the wavelength of the imaging bandpass. This
236: means that spurious ``halo'' catalog entries can easily be present and
237: cospatial in multiple bands and thereby pass the \USNOB\ vetting
238: process (Figures~\ref{fig:skyPatchSource} and \ref{fig:skyPatch}).
239:
240: Again, the spurious catalog entries can be identified by the patterns
241: they make on the sky. Regularities include the following:
242: \textsl{(1)}~Halos are centered on extremely bright ($<7\,\mag$)
243: generating stars. \textsl{(2)}~Halos have a circular or near-circular
244: shape. \textsl{(3)}~Because they are very thin in the radial
245: direction, spurious halo entries have high two-dimensional density on
246: the sky. \textsl{(4)}~The spurious halo entries are usually close to
247: making up full circles, and only rarely appear in just a fragment of a
248: circle. These regularities permit a sensitive search.
249:
250: \subsection{Other spurious entries}
251:
252: In addition to the spikes and halos we address above, there are other
253: categories of spurious catalog entries with other origins, including
254: but not limited to the following: \textsl{(1)}~There are some lines of
255: entries from fortuitously coincident features
256: (scratches, trails, handwriting, and
257: other artifacts) on overlapping plates. \textsl{(2)}~There are some
258: duplicate entries for individual stars in sky regions where two fields
259: overlap. These are cases in which individual stars detected in multiple
260: fields have not been correctly identified as identical.
261: \textsl{(3)}~There are quasi-spurious clusters of entries in and
262: around extended objects such as galaxies, nebulae, and globular
263: clusters.
264:
265: We are doing nothing about any of these spurious features, in part because they
266: do not have regularities that lend themselves to computer-vision
267: techniques we employ in finding the previously mentioned defects.
268: They also represent a much smaller fraction of the \USNOB\
269: entries than the spurious entries from diffraction spikes and
270: reflection halos.
271:
272: Of course the \USNOB\ contains also many entries that are in fact
273: compact galaxies rather than stars. However, these entries are
274: \emph{not} spurious from our perspective, since compact
275: galaxies are as good as---or better than---stars for our \an\
276: astrometric calibration efforts, and most other astrometric calibration
277: tasks.
278:
279: \section{Methods}
280:
281: The Catalog we begin with is not the unmodified \USNOB, but rather the
282: \USNOB\ with the \Tycho\ Catalog \citep{hog00a} stars
283: re-inserted by us from the official \Tycho\ Catalog release. We were
284: forced to perform this operation because in the official \USNOB\
285: release, the \Tycho\ Catalog stars were added in an undocumented
286: binary format.
287:
288: \subsection{Diffraction Spikes}
289:
290: We begin by dividing the Catalog into a fine healpix \citep{gorski05a}
291: grid, and projecting the entries in each healpixel onto planes tangent
292: to each healpixel's center. For each entry we calculate the average
293: $m$ of all magnitudes of all bands in which the entry has been
294: detected, and we find the union of all fields in which the entry is
295: present.
296:
297: For each field present, we construct a ``profile'' of the field's
298: largest diffraction spikes, by overlaying the local neighborhoods of
299: the ten brightest stars in the field. Given the regularities
300: discussed above, we can expect all spikes in each field to have the
301: same orientation. Therefore, each composite profile has one dominant
302: orientation, which is more apparent than in any single star's
303: neighborhood. To find each field's orientation, we first convert the
304: composite profile into polar coordinates, collapse the angles of each
305: point into a ${\pi\over 2}\,\rad$ range, and calculate a rough
306: histogram of the resulting angles. The angle with the most densely
307: populated bin is used as an initial guess of the field's orientation,
308: which is then re-estimated using an iteratively reweighted least
309: squares (IRLS) fitting algorithm for robust M-estimation
310: \citep{hampel86}. The M-estimation is guaranteed to converge to an
311: estimate of the orientation that locally minimizes a total cost
312: $\sum_k \rho(e_k)$ where $e_k$ is the angular distance of entry $k$
313: from the estimated orientation. We employ a Geman-McLure (GM) cost
314: function $ \rho(e_k) = e_k^2/(\sigma^2 + e_k^2) $, where $\sigma$ is
315: the initial guess of the root-variance of the angular width of a
316: spike. This GM cost function replaces the standard least-squares cost
317: function $\rho(e_k) = e_k^2/\sigma^2$ and thereby downweights
318: outliers. The resulting angle is a very robust and precise estimate
319: of the average orientation of all diffraction spikes present in the
320: field.
321:
322: We iterate over fields, using our estimation of each field's dominant
323: orientation to rotate the entries present in each field such that the
324: diffraction spikes present become axis-aligned on average, making
325: their detection much easier. Because there is sometimes some
326: discrepancy between the position of the diffraction spike's generating
327: star and the center of the diffraction spike, we perform a robust
328: estimation of the centerpoint of the spike, just as we did in
329: estimating the orientations of the field profiles. With the
330: diffraction spike axis-aligned and zero-centered, we collapse all of
331: the entries in the neighborhood of the diffraction spike into a single
332: composite of all four ``corners'' (as if we were to convert the
333: neighborhood to polar coordinates, and collapse their angles into a
334: ${\pi\over 2}\,\rad$ range), thereby reducing the four-part
335: diffraction spike to a single dense cluster of points.
336:
337: We found a power-law approximation to the relationship between the
338: magnitude of the generating star and the angular extent of the
339: diffraction spike it generates among spurious entries. This was found
340: by initially hand-labeling a small subset of the data, making a crude
341: fit to the hand-labeled data, then later refining the estimate using
342: the results of our algorithm. Given the magnitude of a generating
343: star, we are able to use this relationship to estimate the angular
344: extent of the spike we would expect that generating star to produce.
345: As previously mentioned, the width of each spike is roughly
346: independent of the magnitude of the generating star, and is therefore
347: initialized to a constant value. This estimate of the center and
348: extent of the spike is used to initialized a two-dimensional Gaussian,
349: which is then fit to the entries belonging to the diffraction spike
350: using iterated variance clipping at 2.5\,sigma. What we construct is
351: not a traditional multivariate normal distribution, which would assume
352: that the data lies in an elliptical distribution, but is instead a
353: ``rectangular'' distribution. That is, we consider an entry to be
354: within the Gaussian distribution if it is simultaneously within
355: 2.5\,sigma of the width and 2.5\,sigma of the length of the
356: distribution. When the Gaussian converges to its final parameters, we
357: take the rectangular area within 2.5\,sigma, and extend its range
358: towards the generating star to cover all entries between the area and
359: the generating star at the center of the spike. If this area's
360: angular width, length, and position all pass a set of thresholds,
361: detailed later, we flag all of the entries within it (excluding the
362: generating star and any \Tycho\ stars, which we assume are not
363: spurious) as potential spurious entries. If these entries pass a set
364: of thresholds (detailed below) they are marked as spurious.
365:
366: The algorithm is depicted in Figure~\ref{fig:demoRun}.
367:
368: \subsection{Reflection halos}
369:
370: Once all spurious catalog entries attributed to spikes are found and
371: temporarily removed (such entries disturb the results of the halo
372: detection algorithm), we search the remaining catalog for halos. This
373: process is similar to the process of searching for diffraction spikes:
374: We divide the Catalog into a fine healpix grid and process each grid
375: cell independently. We project the entries in each grid cell onto a
376: plane tangent to the cell's center. Next, we examine each star
377: brighter than $7\,\mag$, and attempt to find and eliminate halos that
378: it has generated. Since the radius of each halo is not dependent on
379: the magnitude of its generating star, the size of the neighborhood we
380: search is constant.
381:
382: We convert each neighborhood into polar coordinates centered at the
383: generating star, and calculate a histogram of the radii of all entries
384: in the neighborhood. This simple count of the number of stars present
385: at different radii is used to generate a more informative histogram of
386: the densities of stars at each radius. Our initial guess of the radius
387: of the halo is whichever coarse bin is the most dense.
388:
389: With this estimate of the radius of our halo, and with a constant as
390: our initial estimate of the radial width of the halo, we construct a
391: one-dimensional Gaussian and again robustly fit the position and width
392: of the Gaussian using variance clipping at 2.5\,sigma. Once the
393: re-estimation has converged, we check that our resulting values for
394: the variance of the width are reasonable ($<3\,\arcsec$), and if so,
395: we label all entries within 2.5\,sigma of the Gaussian as potentially
396: spurious. Again, if these entries pass another set of thresholds, they
397: are marked as spurious.
398:
399: Because one generating star may produce multiple halos, we search each
400: generating star, and remove each salient halo we find, until we fail to
401: detect any new halo that passes our thresholds.
402:
403: \subsection{Parameters of the Algorithms}
404:
405: By necessity the algorithms have a number of free parameters. Some of
406: these are measurements of diffraction-spike and reflection-halo
407: configurations, derived from quantitative analyses of the properties
408: of the spurious entries, while others are additional conservative
409: constraints, applied to ensure that the spurious entries appear to be
410: correctly identified on visual inspection of the results.
411:
412: In addition to the parameters that specifically apply to the spike and
413: halo identification algorithms, we somewhat arbitrarily chose to work
414: in a $\Nside=9$ healpix grid; there are $12\times 9\times 9=972$
415: healpixels. We set all variance-clipping thresholds to 2.5\,sigma,
416: and when we define regions by variance clipping we make them
417: 2.5\,sigma in half-width.
418:
419: \subsubsection{Measured Spike Parameters}
420:
421: \begin{itemize}
422: \item We search for diffraction spikes generated by stars brighter
423: than $13\,\mag$. Bright stars tend to produce large diffraction
424: spikes containing many spurious entries, while dimmer stars produce
425: small diffraction spikes containing few, and potentially ambiguous,
426: spurious entries. When we extended our search to stars brighter than
427: $15\,\mag$, we found that the proportion of falsely labeled spurious
428: entries increased dramatically. Our decision to restrict to
429: $<13\,\mag$ is further supported by the second panel of
430: Figure~\ref{fig:spuriousStats}, which shows that generating stars at
431: $>13\,\mag$ have mean number of entries per spike less than four,
432: which means that most will contain too few entries to be accepted.
433:
434: \item Our initial estimate of the angular length $\ell$ of a
435: diffraction spike given the magnitude $m$ of its generating star is
436: $\ell = 3500 \times 1.53^{-m}\,\arcsec$; see
437: Figure~\ref{fig:spikeProperties}. This estimate initializes a
438: refinement by iterated variance clipping and therefore does not
439: strongly affect our results. In detail this relationship between
440: length and magnitude depends on band, exposure time, and data quality,
441: and is is therefore different for every plate; but since we use it
442: only as an initialization, those details do not substantially affect
443: our results.
444:
445: \item Our initial estimate of the angular width of a spike is
446: $1\,\arcsec$. This also initializes a refinement by iterated variance
447: clipping and also has little effect on our results.
448:
449: \item We define the ``reasonable'' width of a diffraction spike to be
450: three times the initial estimate of $1\,\arcsec$. If the adaptive
451: fitting process produces a width larger than this, the candidate spike
452: is rejected.
453: \end{itemize}
454:
455: \subsubsection{Additional Spike Constraints}
456:
457: \begin{itemize}
458: \item The size of the local neighborhood constructed around each spike
459: is $2.5$ times the initial estimate of the spike's size. This limits
460: the catalog entries considered in the subsequent analysis, though the
461: effect on our results is minimal.
462:
463: \item We required each spike to have entries in at least $2$
464: of the $4$ spike regions.
465:
466: \item We required the total area within the four spike regions
467: to be at least as dense in Catalog entries as the surrounding area.
468: \end{itemize}
469:
470: \subsubsection{Measured Halo Parameters}
471:
472: \begin{itemize}
473: \item We search for halos around generating stars brighter than
474: $7\,\mag$. Our experiments have shown that halos do not appear around
475: stars dimmer than this.
476:
477: \item We discard any halo whose radius is outside the range of $240$ to
478: $410\,\arcsec$. Direct inspection of the catalog shows that
479: reflection halos rarely appear outside of this range.
480:
481: \item Our initial estimate of the standard deviation of the radial
482: width of a halo is $1.8\,\arcsec$. This is approximately the average
483: value to which our variance-clipping fitting algorithm converges.
484:
485: \item We discard any halo for which our variance-clipping fitting
486: algorithm computes a radial width larger than $4.5$ times the initial
487: estimate.
488: \end{itemize}
489:
490: \subsubsection{Additional Halo Constraints}
491:
492: \begin{itemize}
493: \item Each halo must contain at least $25$ catalog entries.
494:
495: \item The density of catalog entries in each halo annulus
496: must be at least $1.25$ times the density of the area near the halo.
497:
498: \item There must be entries present in the halo annulus every
499: ${\pi\over 4}\,\rad$. This forces all detected halos to be fully
500: circular, rather than just fragments of circles. More importantly,
501: this requirement prevents the false detection of halos near the edges
502: of healpixels, which would otherwise happen very often. Unfortunately,
503: this requirement prevents us from detecting any halo near the edges of
504: a healpixel.
505: \end{itemize}
506:
507: \subsection{Limitations}
508:
509: Limitations of our procedures include the following.
510: \begin{itemize}
511: \item The algorithm assigns hard labels to indicate that an entry is
512: spurious. A future version of the algorithm could assign an
513: assessment of our \emph{confidence} that an entry is spurious.
514:
515: \item The algorithm processes each healpixel independently, and we
516: have not included a buffer region around the edges of the healpixels,
517: so there are minor edge effects: the algorithm is less likely to
518: detect spurious entries near the healpixel boundaries. We expect this
519: to affect roughly $0.4\,\percent$ of the diffraction spikes and
520: $3.5\,\percent$ of the reflection halos.
521:
522: \item The algorithms are highly specialized to the typical data in the
523: \USNOB. If a small fraction of the data in the Catalog come from some
524: telescope with, for example, three rather than four supports for the
525: secondary, or very different internal reflections, the algorithms we
526: use would not detect the spurious features in those data.
527:
528: \item There are many hard settings of parameters, as discussed above.
529: Most of these are either just initializations for iterative procedures
530: or else set manually after an analysis of the data, but more
531: experimentation could have been performed if we had a substantial data
532: set in which the spurious entries had been reliably identified in
533: advance.
534:
535: \item Sometimes a diffraction spike that exists in multiple fields is
536: detected in a field whose orientation does not match the spike's orientation
537: as well as some other field. The is because the order in which we search
538: each field is arbitrary; we flag a detected diffraction spike upon
539: it's first successful detection. This usually results in a detected
540: diffraction spike with an unusually wide angular width. Though this happens
541: frequently, its overall effect on the fidelity of our results is small.
542: A better solution would be to remove spikes in non-increasing order of
543: their resemblance to our model of a diffraction spike.
544:
545: \item We ought never consider as a generating star any star that was
546: marked spurious in the analysis of a brighter generating star. We
547: don't currently enforce this, and it may produce some incorrect
548: identification of spurious entries.
549:
550: \item Many of these limitations could be overcome if we constructed
551: a complete generative model of
552: diffraction spikes and halos. This would allow us to ``score'' potential spurious
553: detections with something approaching a \emph{probability} that they are spurious,
554: rather than simply cut at hard thresholds. This could also improve the fidelity of our results,
555: by allowing us to increase our statistical requirements of some parameters
556: of our generative model when a detected spike or halo fails to fit other
557: parameters. For example, if a possible halo appears at an uncommon radius,
558: a proper generative model would effectively put a stronger constraint on
559: other properties (such as the density of entries in the halo annulus)
560: in order for the entries to be marked as spurious with high probability.
561: Done well, this approach could also allow us to reduce the number of
562: individual requirements we require of each detected spike and halo.
563: This would be aided by a set of hand-labeled spikes and non-spikes, with
564: which we could tune the generative model --- or which we could use as
565: input to some kind of discriminator which would tune the model automatically.
566:
567: \end{itemize}
568:
569: \section{Results}
570:
571: The number of entries flagged as spurious on diffraction-spike grounds
572: is \numSpikes\ (\percentSpikes\ of the \USNOB) and on halo
573: grounds is \numHalos\ (\percentHalos). Our grounds for declaring an
574: entry spurious are conservative in the sense that a spike or halo is
575: only treated as being detected if it passes a set of statistical
576: thresholds.
577:
578: The method works by marking as spurious all \USNOB\ entries in a set
579: of finite regions of the sky, with those sky regions adaptively fit to
580: the observed diffraction spike and reflection halo features. Because
581: the total solid angle removed is non-zero, we expect some of the
582: entries we mark as spurious to in fact correspond to real sources. We
583: can estimate this in a representative healpixel: Healpixel 0 contains
584: $299573$ \USNOB\ entries; we flag as spurious $7924$ entries within a
585: set of regions comprising $1.5\times 10^{-5}\,\ster$ ($0.12\,\percent$
586: of the healpixel); we expect therefore some 300 of these to correspond
587: to real stars. We tested this hypothesis with the
588: \TWOMASS\footnote{http://www.ipac.caltech.edu/2mass/}. In this
589: healpixel there are $81089$ entries, of which we would expect
590: $\sim100$ to lie in the spurious area we've removed. We find that
591: $82$ \TWOMASS\ entries match to a spurious \USNOB\ entry and no
592: non-spurious \USNOB\ entry, consistent with what we would expect
593: assuming a uniform distribution of \TWOMASS\ entries over the
594: healpixel. This count is probably an overestimate, because there are
595: some diffraction artifacts in the \TWOMASS\ that are similar to those
596: in \USNOB. Our marking of spurious entries is aggressive in this
597: sense; as we noted in the Introduction, this is because for our
598: scientific purposes we require a catalog as clean of spurious entries
599: as possible.
600:
601: Properties of the spurious entries we have identified are shown in
602: Figures~\ref{fig:spuriousStats}, \ref{fig:spikeProperties}, and
603: \ref{fig:haloProperties}, including the numbers and fractions of
604: spurious entries as a function of generating star magnitude, and
605: distributions of spikes and halos in size and on the sky. These
606: figures show a number of important regularities, for example that
607: brighter stars have larger diffraction spikes (as expected), that the
608: widths of the spikes is not a function of generating star magnitude
609: (also as expected), and that both the number of spurious entries and
610: our ability to robustly detect them are functions of sky position
611: (mainly because of the Galactic Plane).
612: Figure~\ref{fig:haloProperties} shows that there are two different
613: dominant halo radii, one for the North and one for the South;
614: presumably this indicates differences in the hardware used for each
615: hemisphere.
616:
617: At the outset, we imagined that we could remove these spurious entries
618: trivially using the photometric properties listed in the Catalog. For
619: example, there is no reason in principle that a spurious entry would
620: obtain a reasonable color or pass star--galaxy separation. In
621: Figure~\ref{fig:spuriousProperties}, we show the distribution of the
622: spurious entries in photometric properties such as magnitude, color,
623: and star--galaxy separation. This Figure shows that it would not have
624: been possible to identify the spurious on photometric grounds,
625: including even the \emph{number} of images with detections.
626: Presumably the reasonable colors and large numbers of overlapping
627: images in which the stars are detected result from the great stability
628: of the hardware and software employed in the construction of the
629: \USNOB. It would have been extremely difficult to reliably identify
630: the spurious entries without automatic computer-vision techniques like
631: those employed in this project.
632:
633: Associated with this paper is a small amount of computer code, the
634: information required to clean the \USNOB\ of the spurious entries we
635: identified, and some methods for accessing our cleaned version of the
636: \USNOB. All of these are available at the \an\ web
637: site\footnote{http://astrometry.net/cleanusnob/}.
638:
639: \acknowledgments We are very grateful to Dave Monet and the team that
640: created the \USNOB, which is one of astronomy's most productive and
641: useful resources. We benefitted from useful discussions with Mike
642: Blanton, Keir Mierle, and David Warde-Farley, and from the
643: constructive comments of our anonymous referee. DWH was partially
644: supported by the National Aeronautics and Space Administration (NASA;
645: grant NAG5-11669) and the National Science Foundation (NSF; grant
646: AST-0428465). This research made use of the NASA Astrophysics Data
647: System, and the US Naval Observatory Precision Measuring Machine Data
648: Archive.
649:
650: \begin{thebibliography}{70}
651: \bibitem[G{\'o}rski \etal(2005)]{gorski05a}
652: G{\'o}rski,~K.~M., Hivon,~E., Banday,~A.~J., Wandelt,~B.~D.,
653: Hansen,~F.~K., Reinecke,~M., \& Bartelmann,~M.,
654: 2005, \apj, 622, 759
655: \bibitem[Hampel \etal(1986)]{hampel86}
656: Hampel,~F.~R., Ronchetti,~E.~M., Rousseeuw,~P.~J., \& Stahel,~W.~A.,
657: 1986, \textit{Robust Statistics:\ The Approach Based on Influence Functions,}
658: Wiley, New York
659: \bibitem[H{\o}g \etal(2000)]{hog00a}
660: H{\o}g,~E., \etal,
661: 2000, \aap, 355, L27
662: \bibitem[Lang \etal(2007)]{lang07a}
663: Lang,~D., Hogg,~D.~W., Mierle,~K., Blanton,~M., \& Roweis,~S.,
664: 2007, Science, submitted
665: \bibitem[Monet \etal(2003)]{monet03a}
666: Monet,~D.~G., \etal,
667: 2003, \aj, 125, 984
668: \bibitem[Storkey \etal(2004)]{storkey04a}
669: Storkey,~A.~J., Hambly,~N.~C., Williams,~C.~K.~I., \& Mann,~R.~G.,
670: 2004, \mnras, 347, 36
671: \end{thebibliography}
672:
673: \clearpage
674: \begin{figure}
675: \hbox{
676: \hbox{\resizebox{\threewidth}{!}{\includegraphics{f1a.eps}}}
677: \hbox{\resizebox{\threewidth}{!}{\includegraphics{f1b.eps}}}
678: \hbox{\resizebox{\threewidth}{!}{\includegraphics{f1c.eps}}}
679: }
680: \caption{Subimages of three of the nine scanned plates that overlap a
681: small patch of sky centered around (RA,Dec)=(341.8, -81.4)~deg (J2000)
682: from which part of \USNOB was created, retrieved from the US Naval
683: Observatory Precision Measuring Machine Data Archive. Note the
684: different orientations of the diffraction spikes generated by brighter
685: stars, and the multiple halos surrounding the brightest star.
686: \label{fig:skyPatchSource}}
687: \end{figure}
688:
689: \begin{figure}
690: \hbox{
691: \hbox{
692: \vbox{
693: \hbox{\resizebox{\threewidth}{!}{\includegraphics{f2a.eps}}}
694: \hbox{\resizebox{\threewidth}{!}{\includegraphics{f2b.eps}}}
695: }
696: }
697: \hbox{
698: \resizebox{\twothreewidth}{!}{\includegraphics{f2c.eps}}
699: }
700: }
701: \caption{The same small patch of sky as pictured in
702: Figure~\ref{fig:skyPatch}, taken from the \USNOB, in tangent-plane
703: coordinates relative to a tangent point at the center of the
704: containing healpixel, in units of $\arcsec$. The bright stars in this
705: patch have multiple sets of diffraction spikes because they lie in a
706: sky region where plates taken at different orientations overlap.
707: \textsl{Upper left panel:} All \USNOB\ entries in this patch.
708: \textsl{Right panel:} The same patch with dark points showing catalog
709: entries marked as spurious by either the diffraction spike or
710: reflection halo criteria described in the text. \textsl{Lower left
711: panel:} The same patch, with only non-spurious entries shown.
712: \label{fig:skyPatch}}
713: \end{figure}
714:
715: \begin{figure}
716: \resizebox{\threewidth}{!}{\includegraphics{f3a.eps}}%
717: \resizebox{\threewidth}{!}{\includegraphics{f3b.eps}}%
718: \resizebox{\threewidth}{!}{\includegraphics{f3c.eps}}
719: \resizebox{\threewidth}{!}{\includegraphics{f3d.eps}}%
720: \resizebox{\threewidth}{!}{\includegraphics{f3e.eps}}%
721: \resizebox{\threewidth}{!}{\includegraphics{f3f.eps}}
722: \caption{A narrative demo of finding a single spike within the patch of sky shown in Figure~\ref{fig:skyPatch}.
723: \textsl{Upper left panel:} The composite profile of the current
724: field, with the field's estimated orientation highlighted.
725: The profile is dominated by the spike
726: of the current generating star, which is the largest in the field.
727: \textsl{Upper center panel:} The neighborhood surrounding the
728: current generating star, with all entries in all fields shown.
729: \textsl{Upper right panel:} The same neighborhood, with only entries
730: in the current field shown. The dominant orientation from this field's
731: composite profile is highlighted.
732: \textsl{Lower left panel:} All four directions of the spike, collapsed
733: into one ${\pi\over 2}\,\rad$ profile.
734: The dashed lines outline the areas encompassed at 1, 2, and 3 times the
735: root-variance of the Gaussian used to initialize the variance clipping.
736: The solid rectangle is the area encompassed at
737: 2.5\,sigma, which is threshold for flagging entries. The solid rectangle
738: is extended all of the way to the bottom, as it is assumed that all of the
739: spike profile between the spike cluster and the generating star is also
740: spurious.
741: \textsl{Lower center panel:} The same profile, after the Gaussian has
742: been fit using variance clipping. The solid rectangle shown here
743: is the final area we use for determining if entries are flagged.
744: \textsl{Lower right panel:} The neighborhood surrounding the generating
745: star with all entries shown, and with the newly-flagged spike entries darkened.
746: The diffraction spikes in the other orientations come from different fields and
747: are flagged by later passes of the algorithm.
748: \label{fig:demoRun}}
749: \end{figure}
750:
751: \clearpage
752: \begin{figure}
753: \resizebox{\twowidth}{!}{\includegraphics{f4a.eps}}%%
754: \resizebox{\twowidth}{!}{\includegraphics{f4b.eps}}\\
755: % dstn asks: can you make the legend show the dark thin bars as a dark thin patch?
756: % (right now it only shows the different in darkness, not of thickness)
757: % jon replies: Yeah, no, I can't, or not easily, at least. It's on my queue.
758: \caption{Statistics of spurious entries.
759: \textsl{Left panel:} The fraction of all entries marked as spurious
760: as a function of generating star magnitude.
761: \textsl{Right panel:} The mean number of entries marked as spurious
762: per generating star as a function of generating star magnitude.%
763: \label{fig:spuriousStats}}
764: \end{figure}
765:
766: \clearpage
767: \begin{figure}
768: \resizebox{\twowidth}{!}{\includegraphics{f5a.eps}}%
769: \resizebox{\twowidth}{!}{\includegraphics{f5b.eps}}\\
770: \resizebox{\twowidth}{!}{\includegraphics{f5c.eps}}%
771: \resizebox{\twowidth}{!}{\includegraphics{f5d.eps}}
772: \caption{Regularities of spurious catalog entries in the \USNOB\
773: identified as caused by diffraction spikes. \textsl{Top-left
774: panel:}~Two-dimensional histogram showing the adaptively fit radial
775: lengths of the spikes, found by iterative variance-clipping, as a function
776: of generating star magnitude. Each vertical column in the histogram
777: is independently normalized. The solid line shows the
778: value used to initialize the adaptive fitting. \textsl{Top-right
779: panel:}~Similar two-dimensional histogram but showing the adaptively
780: fit widths of the spikes, as a function of generating star magnitude.
781: The solid line shows the initial value. \textsl{Bottom-left
782: panel:}~The two-dimensional solid-angular density on the sky of
783: spurious entries identified as parts of diffraction spikes as a
784: function of sky position (shown as an ``unwrapped'' $\Nside=9$ healpixel
785: grid). The darker a healpixel is, the more spurious ``spike'' entries it contains.
786: \textsl{Bottom-right panel:}~The same, but shown relative to
787: the number of catalog entries in that healpixel. In the two sky
788: density plots, a North--South asymmetry is visible, as well as the
789: Galactic plane.
790: \label{fig:spikeProperties}}
791: \end{figure}
792:
793: \clearpage
794: \begin{figure}
795: \resizebox{\twowidth}{!}{\includegraphics{f6a.eps}}%
796: \resizebox{\twowidth}{!}{\includegraphics{f6b.eps}}\\
797: \resizebox{\twowidth}{!}{\includegraphics{f6c.eps}}%
798: \resizebox{\twowidth}{!}{\includegraphics{f6d.eps}}
799: \centerline{\resizebox{\onewidth}{!}{\includegraphics{f6e.eps}}}
800: \caption{Regularities of spurious catalog entries identified as caused
801: by reflection halos. \textsl{Top-left panel:}~Two-dimensional
802: histogram showing adaptively fit reflection-halo radii as a function
803: of generating star magnitude. Each vertical column of the histogram
804: has been independently normalized. \textsl{Top-right panel:}~Similar
805: two-dimensional histogram but showing the adaptively fit widths of the
806: halo annuli.
807: \textsl{Middle-left panel:}~Two-dimensional solid-angular density on
808: the sky of spurious entries identified as parts of reflection halos.
809: The darker a healpixel is, the more spurious ``halo'' entries it contains.
810: \textsl{Middle-right panel:}~The same but shown relative to the number
811: of catalog entries in that healpixel.
812: \textsl{Wide bottom panel:}~Two-dimensional histogram showing that
813: each of the two principal halo radii is in one hemisphere of the sky.
814: \label{fig:haloProperties}}
815: \end{figure}
816:
817: \clearpage
818: \begin{figure}
819: \resizebox{\twowidth}{!}{\includegraphics{f7a.eps}}%%
820: \resizebox{\twowidth}{!}{\includegraphics{f7b.eps}}\\
821: \resizebox{\twowidth}{!}{\includegraphics{f7c.eps}}%%
822: \resizebox{\twowidth}{!}{\includegraphics{f7d.eps}}
823: \caption{Spurious catalog entries are not obvious from their basic
824: photometric properties.
825: \textsl{Top-left panel:} Number of the five
826: bands in the \merged\ in which entries show detections, for all and
827: spurious entries.
828: \textsl{Top-right panel:} Magnitude distribution,
829: for all and spurious entries with detections in the $F$ band.
830: \textsl{Bottom-left panel:} J-F color distribution, for all and spurious
831: entries with detections in the $J$ and $F$ bands.
832: \textsl{Bottom-right panel:} \USNOB\ $F$-band star-galaxy separator
833: quantity, for all and spurious entries with detections in the $F$ band.%
834: \label{fig:spuriousProperties}}
835: \end{figure}
836:
837: \end{document}
838: