1: \documentstyle[11pt,psfig,fullpage]{article}
2:
3: \newcommand{\discCoverFig}[2]{%
4: \psfig{figure=#1,width=#2,bbllx=100bp,bblly=175bp,bburx=425bp,bbury=610bp,angle=270,clip=}
5: }
6:
7: \begin{document}
8:
9: \title{%
10: \Large Data Collection for the Sloan Digital Sky Survey
11: --- A Network-Flow Heuristic
12: }
13: \author{%
14: Robert Lupton\thanks{%
15: Astrophysics Department,
16: Princeton University, Princeton, NJ 08540.
17: E-mail: {\tt rhl@astro.princeton.edu}.
18: } \\
19: \and
20: F. Miller Maley\thanks{%
21: Mathematics Department,
22: Princeton University, Princeton, NJ 08540.
23: E-mail: {\tt fmaley@haverford.edu}.
24: } \\
25: \and
26: Neal Young\thanks{%
27: Computer Science Department, Dartmouth College, Hanover, NH 03755.
28: Parts of this research were done at:
29: AT\&T Bell Laboratories, Murray Hill, NJ 07974;
30: the School of ORIE, Cornell University, Ithaca NY 14853
31: on \'Eva Tardos' NSF PYI grant DDM-9157199; and
32: the Dept's of Astrophysics and Computer Science, Princeton University.
33: Corresponding author. E-mail: {\tt ney@cs.dartmouth.edu}.
34: }
35: }
36: \date{}
37:
38: \maketitle
39:
40: \pagestyle{myheadings}
41: \markboth{Robert Lupton, F. Miller Maley and Neal Young}{%
42: Data Collection for the Sloan Digital Sky Survey
43: --- A Network-Flow Heuristic}
44:
45: %\pagenumbering{arabic}
46: %\setcounter{page}{1}%Leave this line commented out.
47:
48: \begin{abstract}
49: This paper describes an NP-hard combinatorial optimization problem
50: arising in the Sloan Digital Sky Survey
51: and a practical approximation algorithm
52: that has been implemented and will be used in the Survey.
53: The algorithm is based on network flow theory
54: and Lagrangian relaxation.
55: \end{abstract}
56:
57: \section{The Sloan Digital Sky Survey}
58: \begin{quote}
59: \small
60:
61: ``The Sloan Digital Sky Survey
62: [is] a joint project of the Astrophysical Research Consortium.
63: ...
64: The goal of the project, which is scheduled to begin in 1997
65: and take five years, is to make a much better map of the universe
66: than is currently available.
67: The volume of the universe to be surveyed will be 100 times larger
68: than the volume of previous surveys.
69: The number of galaxies with known distances is expected to increase
70: by a factor of 100 to 1,000,000 galaxies
71: and the number of quasars to increase to 100,000.
72:
73: ``The Sloan Foundation ... has contributed \$8 million
74: to the \$18 million capital costs of the project. ...
75:
76: ``In order to do the survey, ARC is designing and building a special purpose
77: 2.5 meter (100-inch) telescope at its Apache Point Observatory. ...
78:
79: ``[The Sky Survey will proceed in two phases.
80: In the first phase, a two-dimensional map of the sky will be made.
81: For the second phase, the] million brightest stars
82: and the one hundred thousand brightest quasars will be selected
83: for spectroscopic analysis from the two-dimensional map...
84:
85: \hfill \em \cite[Savani, 1994]{Savani}
86: \end{quote}
87: To gather the spectroscopic data in the second phase,
88: the telescope will be pointed repeatedly at the sky
89: to take a series of ``snapshots''.
90: Each snapshot will capture data for up to $660$
91: galaxies and quasars in the circular portion of the sky
92: visible through the telescope.
93: For each captured galaxy, light from that galaxy will enter
94: the telescope and travel through an optical fiber to a spectral analyzer.
95: The optical fibers (one for each galaxy) will be held in place
96: by a ``plug plate'' drilled to hold the up to $660$ fibers,
97: each aligned to accept the light of its respective galaxy
98: \cite{Crease}.
99:
100: \subsection{A Capacitated Covering Problem. }
101: \label{problemsec}
102: The second phase of the survey is expected to cost
103: on the order of \$4-5 million.
104: This cost will depend primarily on the number of snapshots taken.
105: This paper concerns the following problem:
106: given the ``2-dimensional'' locations of the desired galaxies,
107: determine a minimum-size set of snapshots that capture them.
108: Formally:
109: \begin{quote}
110: \underline{Euclidean Capacitated Covering by Disks} (ECCD)
111:
112: \em
113: Given a collection of points on the unit sphere,
114: a radius $r$, and a capacity $c$, find a small set of discs of radius $r$
115: (located on the sphere) such that each given point
116: can be assigned to a disc containing it,
117: with no disc being assigned more than $c$ points.
118: \end{quote}
119: The sphere corresponds to the view-sphere centered at the telescope.
120: The points correspond to the images of the galaxies
121: projected on the view-sphere.
122: Each disc represents one snapshot to be taken through the telescope;
123: the points assigned to that disc correspond
124: to those galaxies for which data will be collected in that snapshot.
125: The capacity $c$ is the maximum number of galaxies for which
126: spectral data can be gathered in a single snapshot
127: (due to limitations in packing the optical fibers).
128:
129: \begin{figure}[t]
130: \centerline{\discCoverFig{sample.ps}{2in}
131: \discCoverFig{uniform.ps}{2in}
132: \discCoverFig{final.ps}{2in}}
133: \caption{Sample instance (points are dark); near-uniform cover; better cover.
134: This near-uniform cover is from an earlier implementation
135: not using Hardin et al.'s covers, which are more uniform.
136: }
137: \label{fig:sample}
138: \label{fig:uniform}
139: \label{fig:final}
140: \end{figure}
141:
142: The ECCD problem is NP-hard \cite{MegiddoS84}.
143: The instances we need to solve will have hundreds of thousands of points.
144: Luckily, as Figure~\ref{fig:sample} illustrates,
145: the instances we need to solve are nicely structured.
146:
147: In this paper we describe a heuristic algorithm for the problem.
148: The algorithm is effective for instances arising in the Survey
149: and will be used for it.
150: The basic idea behind the algorithm is to start
151: with a near-uniform cover of the sphere by discs
152: \cite{HardinSS94} and then to iteratively improve the cover.
153: The key observation is that a given cover can be improved
154: by first solving a relaxation of the problem in which the ``point-in-disc''
155: constraints are replaced by penalties for assigning points
156: to discs not containing them,
157: and then moving the discs to minimize the cost of the assignment found.
158: The relaxed problem reduces to the minimum-cost flow problem.
159: In our tests, the algorithm runs in nearly linear time
160: and finds covers that are roughly 20\% better
161: than comparable near-uniform covers.
162:
163: \section{Related Work}
164: The NP-completeness of the variant when the points lie in the plane
165: was proven by Megiddo and Supowit \cite{MegiddoS84}.
166: The proof adapts easily to our problem.
167: The NP-completeness of the planar problem when the discs are required
168: to be centered on the given points was proven by
169: Marchetti-Spaccamela \cite{Marchettispaccamela81}.
170: When the covering regions are rings, instead of discs,
171: Maass \cite{Maass86} showed the problem NP-complete
172: even if the points all lie on a single line.
173:
174: Papadimitriou \cite{Papadimitriou81}
175: (improving results by Fisher and Hochbaum \cite{FisherH80}),
176: considered the related {\em $p$-medians} problem in the plane,
177: which is that of covering the given points with
178: $p$ discs (of arbitrary radii, but centered at $p$ of the given points)
179: so as to minimize the {\em sum} of the disc radii.
180: He showed the problem to be NP-complete
181: and presented average-case analyses of several algorithms.
182: One of the heuristics is a uniform (``honeycomb'') covering
183: of the points by discs,
184: which he shows gives a near-optimal solution with high probability
185: when $p$ is $\omega(\log n)$ and $o(n/\log n)$
186: and the points are randomly distributed in the unit square.
187:
188: The problem can be modeled as a capacitated set-cover problem.
189: The well-known greedy algorithm of Johnson \cite{Johnson74}
190: and Lov\'asz \cite{Lovasz75},
191: as modified for the capacitated case
192: by Bar-Ilan, Kortsarz, and Peleg \cite{BarilanKP93},
193: would yield a $\ln n$-approximate solution,
194: where $n$ is the number of galaxies.
195: This algorithm is not good enough in practice.
196: For this particular set-cover problem
197: the dual of the set system has bounded VC-dimension;
198: in this case an improved approximation algorithm
199: is known for the uncapacitated case \cite{BronnimanG95},
200: but, judging from a few small trials,
201: this algorithm does not appear to take sufficient advantage
202: of the structure of our problem instances to perform well in practice.
203:
204: Numerous generalizations of our problem have been considered under various
205: names, including ``(un)capacitated facility (or plant) location,''
206: ``$p$-centers'', and ``minimax facility location''.
207: These problems have been studied under various metrics
208: and also in general graphs.
209: In general, polynomial-time exact algorithms are known
210: only when the number of covering regions (in our case, discs) is small
211: (e.g., \cite{AgarwalS94})
212: or when the underlying metric space (or network) is tree-like
213: (e.g.,
214: \cite{MegiddoTZC81,FredericksonJ83,MegiddoT83,GurevichSV84,HeY90,ErkutFT92}).
215: Generally, these algorithms are for uncapacitated problems.
216:
217: There is a large literature on these problems in Operations Research.
218: Relevant books include
219: \cite{LoveMW88,NemhauserW88,FrancisMW91,Francis90}.
220: Much of this research has concentrated
221: on adapting integer-programming techniques
222: to fairly general formulations of the problem.
223: For example, recent works on the Capacitated Facility Location Problem
224: (a generalization of our problem to arbitrary networks)
225: include \cite{CornuejolsST91,Sridharan93}.
226: Quoting from the conclusion of ``Approximate Solutions to Large Scale
227: Capacitated Facility Location Problems'' (1990) \cite{Shetty90}:
228: \begin{quote} \small
229: The problem of locating facilities has inspired a rich body of literature
230: which spans well over two decades. Numerous algorithms have been devised
231: and successfully applied to problems with as many as 200 customers
232: and 100 facilities. The computational experience on larger problems,
233: however, has been virtually non-existent... In the work leading to this
234: paper, the objective was to develop a heuristic algorithm that can be used
235: to generate effective solutions for large scale facility locations problems.
236: The computational results obtained so far seem to indicate that this
237: requirement can be met for problems with as many as 1000 customers
238: and 100 facilities.
239: \end{quote}
240:
241: \section{The Algorithm}\label{sec:alg}
242:
243: The instances arising in the Sky Survey exhibit particular structure.
244: Within any given region,
245: the galaxies are distributed densely throughout the region,
246: somewhat uniformly but with clustering tendencies
247: and variation in density.
248: The density of the galaxies means that
249: virtually the entire region must be covered by discs.
250: The variation in density means that
251: more discs must be concentrated within densely populated regions.
252: As a reference point, consider the sparsest possible covering
253: of the area by discs (resembling a ``honeycomb'').
254: This cover provides roughly the right {\em total}\/ capacity
255: and does well in sparse areas,
256: but in dense areas does not provide sufficient capacity.
257: Any good solution will have to maintain a honeycomb-like structure
258: in sparse areas while bunching discs more densely in dense areas.
259:
260: The outer loop of the algorithm does a binary search for the smallest value of
261: a density parameter $\delta$ that leads to success in the inner loop. The
262: inner loop begins with a near-uniform cover of normalized density $1+\delta$
263: and iteratively improves it (see Figure~\ref{fig:uniform} for ``before'' and
264: ``after'' covers). Each iteration of the loop perturbs the discs, as described
265: below, in an attempt to improve the cover (Figure~\ref{fig:move} shows the
266: results of such a series of improvement steps). If the desired coverage is
267: obtained, the inner loop stops (successfully). If the perturbation ceases to
268: improve the cover, the inner loop stops (unsuccessfully).
269:
270: \begin{figure}[t]
271: \begin{center}
272: \leavevmode
273: \discCoverFig{move.ps}{0.5\textwidth}
274: \end{center}
275: \caption{Composite of intermediate covers}
276: \label{fig:move}
277: \end{figure}
278:
279: Next we describe how the algorithm perturbs a given cover in order to improve
280: it. We start with the observation that for a {\em given}\/ set of discs (with
281: known locations), the problem of finding the maximum number of galaxies that
282: can be assigned reduces to a generalized maximum bipartite matching problem
283: in a graph $G=(U,V,E)$, where the vertices in $U$ correspond to galaxies, the
284: vertices in $V$ correspond to the discs, and edge $(u,v)$ is present if $u$'s
285: galaxy is in $v$'s disc. A maximum legal assignment of galaxies to discs then
286: corresponds to a maximum size set $S$ of edges such that each $u \in U$ is
287: incident to at most one edge in $S$ while each $v\in V$ is incident to at most
288: $c$ edges in $S$.
289:
290: Since the latter problem reduces in a standard way
291: to the maximum flow problem \cite{PapadimitriouS82},
292: which can be efficiently solved, it follows that
293: for a {\em given} set of discs, one can efficiently find a
294: maximum legal assignment of galaxies to discs.\iffalse\footnote{%
295: The standard reduction can be improved by the following heuristic: say that
296: two galaxies are equivalent if they are contained in the same set of discs.
297: Replace the vertex set $U$ by a set $U'$, where each $u'$ represents a
298: resulting equivalence class. Finally, alter the matching constraint so that
299: the number of matching edges incident to $u'$ is constrained to be at most
300: the size of $u'$'s equivalence class. Although in a typical problem the
301: galaxies will be fairly dense, each disc will intersect only $O(1)$ other
302: discs, so the number of equivalence classes will be proportional to the
303: number of discs.}\fi
304:
305: Of course, the maximum legal assignment may still leave many galaxies unassigned,
306: even though many discs are not used to capacity.
307: In this case, how can discs be moved to improve the coverage?
308: Consider the following relaxation of the problem:
309: \begin{quote}{
310: \underline{Relaxed Problem}
311:
312: \em
313: Given a set of discs, a set of galaxies, and a capacity $c$,
314: find a {\bf minimum-penalty} assignment of the galaxies to discs
315: such that no disc is assigned more than $c$ galaxies.}
316: \end{quote}
317: Here a galaxy can be assigned to a disc not containing it,
318: but there is a penalty for doing so that encourages assignments
319: of galaxies to nearby discs (details of the penalty function are
320: in \S~\ref{sec:impl}).
321:
322: The relaxed problem can be solved efficiently (even for arbitrary penalties) by
323: reducing it to the assignment problem or to minimum-cost maximum flow. We
324: reduced it to the latter, more general, problem in anticipation of having to
325: incorporate more complex constraints on the assignment (that no sufficiently
326: close pairs of galaxies should be assigned to the same disc) at a later point.
327: As described below, even the more general problem can be solved quickly enough
328: for our purposes.
329:
330: A solution to the relaxed problem will assign all galaxies to discs, but a
331: given disc may be assigned galaxies outside of it. {\em The advantage of the
332: relaxed problem is that a solution to it can give information about how to
333: improve a given set of discs.} The intuition is that if excess demand (i.e.\
334: a high density of galaxies relative to discs) exists in one area, and excess
335: capacity exists in another, then a disc between the two areas will tend to be
336: assigned galaxies that are outside of the disc and that lie towards the area of
337: excess demand. Figure~\ref{fig:relax} illustrates this.
338: \begin{figure}[t]
339: \begin{center}
340: \leavevmode
341: \discCoverFig{relax.ps}{5in}
342: \end{center}
343: \caption{Relaxed assignment}
344: \label{fig:relax}
345: \end{figure}
346:
347: Once a minimum-penalty solution to the relaxed problem has been found, the
348: algorithm moves the discs to minimize the cost of the particular assignment of
349: galaxies to discs specified by the minimum-penalty solution. This problem can
350: be solved {\em independently} for each disc. For a given disc, for a fixed set
351: of galaxies assigned to it, the sum of the penalties for those assignments is a
352: function $f(x,y)$ of the coordinates $(x,y)$ of the center of the disc. As
353: long as the penalty function is convex and reasonably smooth, $f$ will be also.
354: Starting with the current location $(x_0,y_0)$ of the disc, a simple
355: gradient-based method (described in \S~\ref{sec:impl}) is used to find $(x,y)$
356: maximizing $f(x,y)$.
357:
358: \newenvironment{tabAlgorithm}{
359: \setcounter{algorithmLine}{1}
360: \samepage
361: \begin{tabbing}
362: 999\=\kill
363: }{
364: \end{tabbing}
365: }
366: \newcounter{algorithmLine}
367: \newcommand{\algline}{\\\thealgorithmLine\hfil\>\stepcounter{algorithmLine}}
368:
369: \begin{figure}[tb]
370: \begin{center}
371: \framebox[.95\textwidth][c]{\parbox{.9\textwidth}{
372: \underline{Inner Loop} ($\epsilon$ is fixed, $\delta$ is determined by
373: the outer loop)
374: \begin{enumerate}
375: \item Compute a minimum-size near-uniform cover $C$ of the region by
376: discs so that the total capacity of the discs in $C$ is at least
377: $1+\delta$ times the number of galaxies.
378:
379: \item {\bf Repeat} until convergence (after ``polishing'') or
380: $1-\epsilon$ of the discs are legally assigned.
381:
382: \begin{enumerate}
383:
384: \item Compute a minimum-penalty assignment $A$ of the galaxies to the
385: discs in $C$. Do this by solving (an approximation of) the
386: corresponding minimum-cost flow problem.
387:
388: \item Move the discs in $C$ to minimize the penalty associated with
389: the assignment $A$. Move each disc independently to minimize
390: its associated penalty by a simple gradient-descent method.
391:
392: \end{enumerate}
393: \item Find a maximum legal assignment of the galaxies to $C$. Do
394: this by solving a corresponding maximum-flow problem.
395:
396: \item {\bf If} at least $1-\epsilon$ of the galaxies are assigned,
397: {\bf succeed}, else {\bf fail}.
398: \end{enumerate}
399: }}
400: \end{center}
401: \caption{
402: Given a desired coverage $1-\epsilon$, where $\epsilon \ge 0$, the outer
403: loop of the algorithm does a binary search for the smallest value of
404: $\delta \ge 0$ such that the above inner loop succeeds. Further details,
405: including the ``polishing'' step, the ``approximation'' of the flow
406: problem, and the criteria for convergence, are described in
407: \S~\protect\ref{sec:impl}.}
408: \label{fig:alg}
409: \end{figure}
410: This gives us the essentials of the inner loop of the algorithm. It starts
411: with a near-uniform cover of some specified (normalized) density $1+\delta$.
412: It improves the cover by finding a minimum-penalty assignment of the galaxies
413: to the discs and then moving the discs to their optimal locations given that
414: assignment. It continues, alternately improving the assignment and then moving
415: the discs, until the net penalty ceases to decrease appreciably. At the end of
416: the inner loop, the algorithm finds a legal (not relaxed) assignment of
417: galaxies to discs maximizing the number of assigned galaxies.
418: Figure~\ref{fig:move} shows a sequence of covers generated by a single run of
419: the inner loop.
420:
421: The outer loop performs a binary search to find the smallest $d$ that causes
422: the inner loop to successfully cover the galaxies. The presentation here is a
423: slight simplification of the actual algorithm, in that the actual algorithm
424: uses a ``polishing'' heuristic before terminating the inner loop, and a
425: heuristic is applied to reduce size of the network-flow problem before solving
426: it. These heuristics and other details about convergence of the inner and outer
427: loops, and starting conditions for the outer loop, are described in
428: \S~\ref{sec:impl}.
429:
430: \subsection{Example Run of Inner Loop. }
431: The sample instance in Figure~\ref{fig:sample} contains 12642 points ---
432: a random $10\%$ of the points in a subregion of the sky previously scanned.
433: The size of this subregion is about $10\%$ of that of the region that will be
434: mapped by the Survey.
435: A uniform cover of 218 discs of capacity 60 (total capacity 13080)
436: allows 81\% of the galaxies to be assigned.
437: After 16 iterations of the inner loop of the algorithm,
438: the improved cover captures 97.8\% of the points.
439: Figure~\ref{fig:sample} shows the initial near-uniform cover
440: and the final cover;
441: Figure~\ref{fig:move} shows a composite of the successive covers.
442: Section~\ref{sec:perf} describes comprehensive testing of quality of solutions
443: given by the algorithm and its running time.
444:
445: \subsection{Implementation Details. }\label{sec:impl}
446: For the initial near-uniform covers, we use
447: Hardin, Sloane, and Smith's catalogue of packings
448: of points on the sphere \cite{HardinSS94}.
449: These packings give covers of the entire sphere,
450: but we need a cover of only a (usually rectangular) subregion of the sky.
451: To prune a ``global'' cover $C$ the algorithm
452: first finds a maximum legal assignment of galaxies to discs in $C$,
453: then discards all discs having at most a few assigned galaxies.
454: (The cutoff for discarding a disc is chosen
455: so that the resulting number of discs is as desired.)
456:
457: The inner loop of the algorithm is implemented in C++ using LEDA \cite{MehlhornN}
458: for basic data structures.
459: We use a scaling algorithm by Andrew Goldberg to solve
460: the minimum-cost flow problems \cite{Goldberg1997}.
461: We use TCL for the outer loop of the algorithm and to collect performance data.
462:
463: \begin{figure}[tb]
464: \centerline{
465: \psfig{figure=cost2.ps,width=3.3in,angle=270}
466: \psfig{figure=cost1.ps,width=3.3in,angle=270}
467: }
468: \caption{
469: The assignment penalty as a function of the distance $d$ (in disc
470: radii) between the disc center and the galaxy. The plot on the left is
471: for $d \le 1$; the plot on the right is for $d \ge 1$.}
472: \label{fig:cost}
473: \end{figure}
474: \paragraph{Penalty function:}
475: The penalty for assigning a galaxy to a disc
476: whose center is distance $d$ away is proportional to
477: $$p(d) = \cases{d^2-r^2 & if $d\le r$ \cr 100(d^2-r^2) & if $d\ge r$.}$$
478: Recall that $r$ is the disc radius.
479: When solving the relaxed problem,
480: the algorithm first {\em rounds} the penalties.
481: Rounding so that there are few distinct penalties
482: allows a heuristic reduction in the size of the resulting flow problem.
483: (This heuristic is discussed further below.)
484: Figure~\ref{fig:cost} shows plots of $p$ and the rounded penalties.
485: The rounding is chosen to preserve the distance
486: between the galaxy and the {\em edge} of the disc
487: within roughly a factor of 2.
488: The edge of the disc is important
489: because the penalty function is least smooth
490: for points near the edge.
491: The factor of 2 is somewhat arbitrary,
492: it was chosen to balance between
493: the advantages of rounding and the resulting loss of accuracy.
494: After rounding, only 14 or so distinct penalties (each an integer power of 2) arise.
495:
496: \paragraph{Reducing the size of the flow problem:}
497: We expected the bottleneck in the algorithm
498: to be solving the minimum-cost flow problems.
499: To minimize this time, the algorithm uses a heuristic
500: to reduce the minimum-cost flow problem to a smaller,
501: approximately equivalent, problem.
502: This is the ``approximation'' of the flow problem
503: mentioned in the high-level description of the algorithm.
504: First, the algorithm only considers assigning each galaxy
505: to discs whose centers are within a distance of 2 disc radii,
506: and of these at most the three closest discs.
507: Second, it rounds the penalties as described above
508: to reduce the number of distinct penalties.
509: Finally, instead of having vertices for individual galaxies,
510: it has vertices for equivalence classes of galaxies,
511: where two galaxies are equivalent
512: if they have the same assignable discs
513: with the same rounded penalties.
514: With these heuristics,
515: even for very dense sets of galaxies,
516: the number of equivalence classes will
517: be proportional to the number of discs
518: as long as each disc intersects $O(1)$ other discs.
519: This is true in our case.
520:
521: The precaution of using equivalence classes
522: turned out to be unnecessary for two reasons.
523: First, the average number of galaxies per equivalence class was
524: typically no more than $3$.
525: More fundamentally, solving the flow problems
526: was not in fact a substantial bottleneck
527: (see the data in \S~\ref{newsec}
528: and the subsequent discussion).\footnote{%
529: It is conceivable that the rounding of the penalties decreased the time
530: used by the minimum-cost flow algorithm, as the latter works by scaling.}
531:
532: \paragraph{Constructing the flow problem:}
533: The algorithm stores all the discs in a two-dimensional array
534: so that discs near any given point can be found rapidly.
535: To construct the flow problem, the algorithm iterates through the galaxies.
536: For each galaxy, it finds the discs whose centers are within 2 disc radii.
537: It selects the three nearest of these discs
538: and computes the rounded distances to each.
539: These discs and their rounded distances determine
540: the equivalence class of the galaxy.
541: The equivalence class is found (or created if necessary).
542: From the equivalence classes, the flow network is constructed.
543:
544: So that the equivalence classes can be found quickly,
545: each equivalence class is stored in a hash table
546: maintained at its nearest disc.
547: The hash table for a disc $D$ contains those equivalence
548: classes whose nearest disc is $D$.
549: This method preserves locality of reference.
550: In an earlier implementation, a single large hash table
551: held all the equivalence classes.
552: For large problems, this table was too large to fit in main memory.
553: This slowed the algorithm by a factor of roughly 50.
554:
555: \paragraph{Moving the discs:}
556: After the minimum-penalty relaxed assignment is found,
557: recall that each disc is moved individually to minimize the penalty
558: associated with that disc.
559: The ``simple gradient-descent method'' used to do this is as follows.
560: To minimize $f(x,y)$, starting at a point $(x_0,y_0)$,
561: compute the gradient (direction of maximum rate of increase),
562: then move $(x,y)$ in steps of $\alpha$
563: (approximately $16/1000$ disc radii,
564: chosen to balance speed and accuracy)
565: in the direction opposite the gradient
566: until such steps ceased to decrease the value of $f(x,y)$.
567: Recompute the gradient at the new location
568: and repeat the process with $\alpha$ halved.
569: Continue in this fashion, halving $\alpha$ each time,
570: until $\alpha$ is decreased to approximately $2/1000$ disc radii.
571:
572: \paragraph{Convergence and ``Polishing'':}
573: The outermost loop of the algorithm
574: does a binary search on the size of the uniform starting cover.
575: Within this loop, the inner loop
576: iteratively improves the given cover.
577:
578: We describe convergence of the inner loop first.
579: Recall that the inner loop starts with a given cover and improves it
580: until the desired number of galaxies are legally covered,
581: or until ``convergence'' occurs. Convergence is determined as follows:
582: after each iteration, if the gap between the actual number of
583: galaxies covered and the desired number did not decrease by at least 5\%,
584: then the algorithm considers the process ``stuck''.
585: At this point it changes the basic improvement step
586: (this is the ``polishing'' heuristic mentioned in the high-level descriptions
587: of the algorithm) as follows:
588: it solves the relaxed problem {\em as if} the disc radius were 2\% smaller.
589: It continues with this heuristic until it also becomes stuck.
590: Every time the process becomes stuck, the algorithm alternates
591: between the standard improvement step and the modified one.
592: If the process is ever stuck for at least two sequential rounds,
593: it is considered to have converged.
594: The purpose of the polishing heuristic
595: is that in the original relaxed problem,
596: a disc may be assigned galaxies
597: that are just barely outside of it at little penalty.
598: These galaxies cannot be legally assigned,
599: yet may ``hold'' discs in place in the subsequent disc-moving step.
600: ``Shrinking'' the effective radius of the disc for a few rounds
601: encourages these galaxies to be assigned elsewhere.
602:
603: Next we describe initial conditions and the convergence criterion for the outer loop.
604: The outer loop maintains a lower bound $L$ and an upper bound $U$
605: on the minimum sufficient cover size.
606: It also maintains covers $C_L$ and $C_U$ obtained by starting with a uniform
607: cover of size $L$ or $U$ (respectively)
608: and applying the basic algorithm to improve the cover until the desired
609: coverage is obtained or convergence occurs.
610: Initially $L$ and $U$ are taken to be $1.05$ and $1.15$, respectively,
611: times the number of galaxies divided by the capacity per disc.
612: The binary search maintains the invariant that $C_L$ and $C_U$ are,
613: respectively, insufficient and sufficient to achieve the desired coverage.
614: If this invariant does not hold initially, $L$ and/or $U$ are adjusted in
615: increments of 5\% to achieve the invariant.
616: The algorithm halts the binary search as soon as
617: the following condition ceases to be met:
618: $C_U$ has more than one more disc than $C_L$,
619: $C_U$ is at least 0.5\% bigger than $C_L$,
620: and $C_U$ legally covers at least 0.5\% more galaxies than $C_L$.
621: Once the search halts, the algorithm returns $C_U$.
622:
623:
624: \section{Performance of the Algorithm}\label{sec:perf}
625: We tested the running time and the quality of the solutions
626: found by the algorithm on sample instances.
627: In this section we describe the results.
628:
629: The Survey will map roughly 25\% of the sky
630: --- the region having right ascension zero through $360$ degrees
631: and declination $30$ degrees through $90$ degrees.
632: Roughly one million galaxies will be mapped.
633: Because the two phases of the Survey will be pipelined
634: (the second will be started before the first is done),
635: the second phase will be done in pieces.
636:
637: \begin{figure}[tb]
638: \begin{tabular}{cc}
639: \begin{tabular}[b]{c}
640: \begin{tabular}{|c||r|r|r|} \hline
641: name & r.\ ascens. & declination & galaxies \\ \hline\hline
642: b & 35 to 55 & -55 to -35 & 29933 \\\hline
643: c & 32 to 57 & -57 to -32 & 45344 \\\hline
644: d & 30 to 59 & -55 to -30 & 52520 \\\hline
645: e & 28 to 62 & -57 to -28 & 70339 \\\hline
646: f & 25 to 65 & -60 to -20 & 109681 \\\hline
647: g & 20 to 70 & -70 to -18 & 157126 \\ \hline
648: \end{tabular}
649: ~\\
650: ~\\
651: ~\\
652: ~\\
653: \end{tabular}
654: &
655: \psfig{figure=cover_size.ps,width=3in,angle=270}
656: \end{tabular}
657: \caption{%
658: Regions from which sample instances were generated;
659: number of discs needed to achieve a $98\%$ coverage
660: (normalized by capacity lower bound).
661: }
662: \label{fig:regions}
663: \label{fig:performance}
664: \end{figure}
665: We generated the problem instances from data from a region of the sky
666: that had been previously scanned for a different purpose.
667: We selected 6 subregions, and for these subregions
668: we generated 4 problem instances by randomly sampling
669: 30, 50, 70 or 100\% of the galaxies.
670: This gave us 24 sample problems.
671: We took the disc radius to be 1.5 arc-seconds
672: %\marginpar{check arc-seconds}
673: and the capacity to be 600
674: times 0.3, 0.5, 0.7, or 1 corresponding to the sampling percentage above.
675: (The base capacity is 600 instead of 660 because approximately 60 points in
676: each disc will be reserved for quasars not in the sample.)
677: The largest region has an area roughly 4\% of the entire sky.
678: For each subregion, the right ascension and declination ranges
679: and the number of galaxies are shown in Figure~\ref{fig:regions}.
680:
681: \subsection{Quality of solutions. }
682: Figure~\ref{fig:performance} illustrates the quality of the solutions
683: returned by the algorithm on the 24 problem instances.
684: The figure plots the size of the cover needed
685: to assign 98\% of the galaxies in each region,
686: normalized by dividing by the number of discs needed just to provide
687: enough capacity to hold 98\% of the galaxies.
688: The plot shows the same information for covering by near-uniform covers.
689: The algorithm (very roughly) requires 5\% to 15\% extra capacity,
690: whereas using uniform covers requires 25\% to 35\% extra capacity.
691:
692: \subsection{Running time. }
693: \label{newsec}
694: \begin{figure}[tb]
695: \centerline{\psfig{figure=time.ps,width=3.3in,angle=270}
696: \psfig{figure=time1.ps,width=3.3in,angle=270}}
697: \caption{Net time per galaxy and main components;
698: time per galaxy per iteration.
699: Each vertical bar represents a group of points with close $x$-coordinates:
700: the center of the bar is the average; the endpoints are one standard
701: deviation away.
702: }
703: \label{fig:time}
704: \label{fig:time_normalized}
705: \end{figure}
706:
707: Plots of the time per galaxy to solve each problem instance
708: as a function of the number of galaxies
709: appear in Figure~\ref{fig:time}.
710: This net time includes all of the iterations needed to find
711: the final cover for the given problem instance, including
712: the binary search ``outer loop''.
713:
714: The three main components of the running time are
715: the time building the graphs
716: (including finding the equivalence classes of galaxies),
717: the time solving the flow problems,
718: and the time moving the discs.
719: These plots show that the net running time is on the order of $0.1$
720: cpu seconds per galaxy ($850000$ galaxies per day),
721: with the three main components each taking a substantial fraction of the time.
722: These tests were carried out on a Silicon Graphics machine
723: with 6 150 MHZ processors,
724: a 16 Kbyte data cache,
725: a 1 Mbyte secondary cache,
726: and 256 Mbytes of main memory.
727:
728: Most of the variation in the time per galaxy is due to the number of
729: iterations, which varied from 30 to 100 per problem instance,
730: and which increases (in our implementation) with the problem size.
731: Figure~\ref{fig:time_normalized} plots the average time per galaxy per
732: iteration versus the problem size.
733: Each iteration represents the solution of one
734: relaxed problem and one perturbation of one set of discs.
735: The time per iteration grows linearly with the number of galaxies.
736: This is as expected, except for the
737: surprising speed of the flow computations.
738:
739: We note that the binary search is fairly naive,
740: given that in principle a fairly precise guess about the correct size
741: of the starting cover could be made.
742: Similarly, we feel that a more careful
743: and less conservative estimate of convergence,
744: possibly interleaving the two loops in some fashion,
745: might be warranted.
746: These improvements might reduce the total number of iterations substantially.
747:
748: \paragraph{Time to solve flow problems. }
749: The heuristics for keeping the flow problems small appear to be effective.
750: Figure~\ref{fig:flowsize} plots the average number of edges per galaxy
751: in each flow problem as a function of the sampling density of the instance.
752:
753: Figure~\ref{fig:flowtime} plots the average time per edge
754: to solve the individual flow problems.
755: The time appears to grow only near-linearly with the number of edges.
756: Better than worst-case behavior on certain classes of problems
757: is not uncommon;
758: further the flow problems arising here are not particularly hard ones.
759: See \cite{DIMACS93} for computational studies related to this issue.
760: \begin{figure}[tb]
761: \centerline{\psfig{figure=flow_size.ps,width=3.3in,angle=270}
762: \psfig{figure=flow_time.ps,width=3.3in,angle=270}}
763: \caption{Number of edges per galaxy vs.~density;
764: time per edge vs.\ number of edges}
765: \label{fig:flowsize}
766: \label{fig:flowtime}
767: \end{figure}
768:
769: \section{Retrospective}
770: The Euclidean capacitated covering problem arising here is very natural.
771: Looking in the Operations Research literature, we found numerous
772: capacitated covering algorithms based on integer linear programming,
773: but these were not fast enough problems of the desired size.
774: The Computer Science literature had a number of efficient approximation
775: algorithms for covering that had provable worst-case performance guarantees,
776: yet these algorithms would not produce good enough solutions in practice.
777:
778: Nonetheless, our final solution rests on theoretical foundations.
779: Our algorithm works in the spirit of Lagrangian relaxation.
780: We decompose the problem into two parts:
781: finding the cover $C$ and finding the assignment $A$.
782: We relax the constraints on $A$
783: by replacing the ``disc-containment'' constraint by a penalty function.
784: Then, for any given cover, finding the minimum-penalty assignment
785: is a tractable problem.
786: Likewise, for any given assignment,
787: finding the minimum-penalty cover is tractable.
788: Thus, the relaxation yields a scheme
789: that iteratively reduces the minimum penalty
790: and so drives the pair $(C,A)$ closer to feasibility.
791: Lagrangian relaxation is a common technique
792: in both the Operations Research \cite{Sridharan93}
793: and the Computer Science \cite{PlotkinST91} literature.
794:
795: Finding the decomposition required an understanding of network flow theory.
796: The ability to solve large problems hinges
797: on a fast network flow algorithm.
798: Classic augmenting paths algorithms
799: are far too slow for our purpose.
800: Goldberg's algorithm incorporates both
801: recent research within the worst-case model
802: and heuristics discovered by empirical studies
803: (in the spirit of \cite{DIMACS93}).
804:
805: Also useful were Hardin, Sloane, and Smith's sphere covers
806: \cite{HardinSS94}. These enabled us to start with better
807: uniform covers than we might have otherwise.
808: Finally, in prototyping and testing ideas, it helped
809: to have a pre-existing library of relevant high-level data types
810: and algorithms. For this we used LEDA \cite{MehlhornN}.
811:
812: Worst-case analysis did side-track us slightly.
813: Although worst-case analysis suggested that network flow would be the
814: bottleneck for large problems, it was not at all.
815: Ironically, as described in \S~\ref{sec:impl},
816: our first attempt to keep the flow problems small
817: by using equivalence classes backfired:
818: our original implementation used a single hashing data structure
819: to hold all the equivalence classes;
820: although the standard worst-case model suggests hashing is quite fast,
821: its incautious use slowed the solutions of large problems by a factor of 50
822: due to the lack of locality of reference.
823:
824: In conclusion, our experience suggests that a successful approach
825: rested on theoretical understanding, but required that it be creatively adapted
826: to take advantage of the particular structure of our problem instances.
827:
828: \section{Acknowledgements}
829: Thanks to Ken Steiglitz for introducing two of the coauthors
830: and to an anonymous referee for helpful suggestions.
831:
832: \bibliographystyle{plain}
833: \bibliography{full,names,you,theory}
834:
835: \end{document}
836:
837: