0205:cs0205034/you.tex

1: \documentstyle[11pt,psfig,fullpage]{article}

2:

3: \newcommand{\discCoverFig}[2]{%

4:   \psfig{figure=#1,width=#2,bbllx=100bp,bblly=175bp,bburx=425bp,bbury=610bp,angle=270,clip=}

5:   }

6:

7: \begin{document}

8:

9: \title{%

10:   \Large Data Collection for the Sloan Digital Sky Survey

11:   --- A Network-Flow Heuristic

12:   }

13: \author{%

14:   Robert Lupton\thanks{%

15:     Astrophysics Department,

16:     Princeton University, Princeton, NJ 08540.

17:     E-mail: {\tt rhl@astro.princeton.edu}.

18:     } \\

19:   \and

20:   F. Miller Maley\thanks{%

21:     Mathematics Department,

22:     Princeton University, Princeton, NJ 08540.

23:     E-mail: {\tt fmaley@haverford.edu}.

24:     } \\

25:   \and

26:   Neal Young\thanks{%

27:     Computer Science Department, Dartmouth College, Hanover, NH 03755.

28:     Parts of this research were done at:

29:     AT\&T Bell Laboratories, Murray Hill, NJ 07974;

30:     the School of ORIE, Cornell University, Ithaca NY 14853

31:     on \'Eva Tardos' NSF PYI grant DDM-9157199; and

32:     the Dept's of Astrophysics and Computer Science, Princeton University.

33:     Corresponding author.    E-mail: {\tt ney@cs.dartmouth.edu}.

34:     }

35:   }

36: \date{}

37:

38: \maketitle

39:

40: \pagestyle{myheadings}

41: \markboth{Robert Lupton, F. Miller Maley and Neal Young}{%

42:   Data Collection for the Sloan Digital Sky Survey

43:   --- A Network-Flow Heuristic}

44:

45: %\pagenumbering{arabic}

46: %\setcounter{page}{1}%Leave this line commented out.

47:

48: \begin{abstract}

49:   This paper describes an NP-hard combinatorial optimization problem

50:   arising in the Sloan Digital Sky Survey

51:   and a practical approximation algorithm

52:   that has been implemented and will be used in the Survey.

53:   The algorithm is based on network flow theory

54:   and Lagrangian relaxation.

55: \end{abstract}

56:

57: \section{The Sloan Digital Sky Survey}

58: \begin{quote}

59:   \small

60:

61:   ``The Sloan Digital Sky Survey

62:   [is] a joint project of the Astrophysical Research Consortium.

63:   ...

64:   The goal of the project, which is scheduled to begin in 1997

65:   and take five years, is to make a much better map of the universe

66:   than is currently available.

67:   The volume of the universe to be surveyed will be 100 times larger

68:   than the volume of previous surveys.

69:   The number of galaxies with known distances is expected to increase

70:   by a factor of 100 to 1,000,000 galaxies

71:   and the number of quasars to increase to 100,000.

72:

73:   ``The Sloan Foundation ... has contributed \$8 million

74:   to the \$18 million capital costs of the project. ...

75:

76:   ``In order to do the survey, ARC is designing and building a special purpose

77:   2.5 meter (100-inch) telescope at its Apache Point Observatory. ...

78:

79:   ``[The Sky Survey will proceed in two phases.

80:   In the first phase, a two-dimensional map of the sky will be made.

81:   For the second phase, the] million brightest stars

82:   and the one hundred thousand brightest quasars will be selected

83:   for spectroscopic analysis from the two-dimensional map...

84:

85:   \hfill \em \cite[Savani, 1994]{Savani}

86: \end{quote}

87: To gather the spectroscopic data in the second phase,

88: the telescope will be pointed repeatedly at the sky

89: to take a series of ``snapshots''.

90: Each snapshot will capture data for up to $660$

91: galaxies and quasars in the circular portion of the sky

92: visible through the telescope.

93: For each captured galaxy, light from that galaxy will enter

94: the telescope and travel through an optical fiber to a spectral analyzer.

95: The optical fibers (one for each galaxy) will be held in place

96: by a ``plug plate'' drilled to hold the up to $660$ fibers,

97: each aligned to accept the light of its respective galaxy

98: \cite{Crease}.

99:

100: \subsection{A Capacitated Covering Problem. }

101: \label{problemsec}

102: The second phase of the survey is expected to cost

103: on the order of \$4-5 million.

104: This cost will depend primarily on the number of snapshots taken.

105: This paper concerns the following problem:

106: given the ``2-dimensional'' locations of the desired galaxies,

107: determine a minimum-size set of snapshots that capture them.

108: Formally:

109: \begin{quote}

110:   \underline{Euclidean Capacitated Covering by Disks} (ECCD)

111:

112:   \em

113:   Given a collection of points on the unit sphere,

114:   a radius $r$, and a capacity $c$, find a small set of discs of radius $r$

115:   (located on the sphere) such that each given point

116:   can be assigned to a disc containing it,

117:   with no disc being assigned more than $c$ points.

118: \end{quote}

119: The sphere corresponds to the view-sphere centered at the telescope.

120: The points correspond to the images of the galaxies

121: projected on the view-sphere.

122: Each disc represents one snapshot to be taken through the telescope;

123: the points assigned to that disc correspond

124: to those galaxies for which data will be collected in that snapshot.

125: The capacity $c$ is the maximum number of galaxies for which

126: spectral data can be gathered in a single snapshot

127: (due to limitations in packing the optical fibers).

128:

129: \begin{figure}[t]

130:   \centerline{\discCoverFig{sample.ps}{2in}

131:     \discCoverFig{uniform.ps}{2in}

132:     \discCoverFig{final.ps}{2in}}

133:   \caption{Sample instance (points are dark); near-uniform cover; better cover.

134:     This near-uniform cover is from an earlier implementation

135:     not using Hardin et al.'s covers, which are more uniform.

136:     }

137:   \label{fig:sample}

138:   \label{fig:uniform}

139:   \label{fig:final}

140: \end{figure}

141:

142: The ECCD problem is NP-hard \cite{MegiddoS84}.

143: The instances we need to solve will have hundreds of thousands of points.

144: Luckily, as Figure~\ref{fig:sample} illustrates,

145: the instances we need to solve are nicely structured.

146:

147: In this paper we describe a heuristic algorithm for the problem.

148: The algorithm is effective for instances arising in the Survey

149: and will be used for it.

150: The basic idea behind the algorithm is to start

151: with a near-uniform cover of the sphere by discs

152: \cite{HardinSS94} and then to iteratively improve the cover.

153: The key observation is that a given cover can be improved

154: by first solving a relaxation of the problem in which the ``point-in-disc''

155: constraints are replaced by penalties for assigning points

156: to discs not containing them,

157: and then moving the discs to minimize the cost of the assignment found.

158: The relaxed problem reduces to the minimum-cost flow problem.

159: In our tests, the algorithm runs in nearly linear time

160: and finds covers that are roughly 20\% better

161: than comparable near-uniform covers.

162:

163: \section{Related Work}

164: The NP-completeness of the variant when the points lie in the plane

165: was proven by Megiddo and Supowit \cite{MegiddoS84}.

166: The proof adapts easily to our problem.

167: The NP-completeness of the planar problem when the discs are required

168: to be centered on the given points was proven by

169: Marchetti-Spaccamela \cite{Marchettispaccamela81}.

170: When the covering regions are rings, instead of discs,

171: Maass \cite{Maass86} showed the problem NP-complete

172: even if the points all lie on a single line.

173:

174: Papadimitriou \cite{Papadimitriou81}

175: (improving results by Fisher and Hochbaum \cite{FisherH80}),

176: considered the related {\em $p$-medians} problem in the plane,

177: which is that of covering the given points with

178: $p$ discs (of arbitrary radii, but centered at $p$ of the given points)

179: so as to minimize the {\em sum} of the disc radii.

180: He showed the problem to be NP-complete

181: and presented average-case analyses of several algorithms.

182: One of the heuristics is a uniform (``honeycomb'') covering

183: of the points by discs,

184: which he shows gives a near-optimal solution with high probability

185: when $p$ is $\omega(\log n)$ and $o(n/\log n)$

186: and the points are randomly distributed in the unit square.

187:

188: The problem can be modeled as a capacitated set-cover problem.

189: The well-known greedy algorithm of Johnson \cite{Johnson74}

190: and Lov\'asz \cite{Lovasz75},

191: as modified for the capacitated case

192: by Bar-Ilan, Kortsarz, and Peleg \cite{BarilanKP93},

193: would yield a $\ln n$-approximate solution,

194: where $n$ is the number of galaxies.

195: This algorithm is not good enough in practice.

196: For this particular set-cover problem

197: the dual of the set system has bounded VC-dimension;

198: in this case an improved approximation algorithm

199: is known for the uncapacitated case \cite{BronnimanG95},

200: but, judging from a few small trials,

201: this algorithm does not appear to take sufficient advantage

202: of the structure of our problem instances to perform well in practice.

203:

204: Numerous generalizations of our problem have been considered under various

205: names, including ``(un)capacitated facility (or plant) location,''

206: ``$p$-centers'', and ``minimax facility location''.

207: These problems have been studied under various metrics

208: and also in general graphs.

209: In general, polynomial-time exact algorithms are known

210: only when the number of covering regions (in our case, discs) is small

211: (e.g., \cite{AgarwalS94})

212: or when the underlying metric space (or network) is tree-like

213: (e.g.,

214: \cite{MegiddoTZC81,FredericksonJ83,MegiddoT83,GurevichSV84,HeY90,ErkutFT92}).

215: Generally, these algorithms are for uncapacitated problems.

216:

217: There is a large literature on these problems in Operations Research.

218: Relevant books include

219: \cite{LoveMW88,NemhauserW88,FrancisMW91,Francis90}.

220: Much of this research has concentrated

221: on adapting integer-programming techniques

222: to fairly general formulations of the problem.

223: For example, recent works on the Capacitated Facility Location Problem

224: (a generalization of our problem to arbitrary networks)

225: include \cite{CornuejolsST91,Sridharan93}.

226: Quoting from the conclusion of ``Approximate Solutions to Large Scale

227: Capacitated Facility Location Problems'' (1990) \cite{Shetty90}:

228: \begin{quote} \small

229:   The problem of locating facilities has inspired a rich body of literature

230:   which spans well over two decades.  Numerous algorithms have been devised

231:   and successfully applied to problems with as many as 200 customers

232:   and 100 facilities.  The computational experience on larger problems,

233:   however, has been virtually non-existent... In the work leading to this

234:   paper, the objective was to develop a heuristic algorithm that can be used

235:   to generate effective solutions for large scale facility locations problems.

236:   The computational results obtained so far seem to indicate that this

237:   requirement can be met for problems with as many as 1000 customers

238:   and 100 facilities.

239: \end{quote}

240:

241: \section{The Algorithm}\label{sec:alg}

242:

243: The instances arising in the Sky Survey exhibit particular structure.

244: Within any given region,

245: the galaxies are distributed densely throughout the region,

246: somewhat uniformly but with clustering tendencies

247: and variation in density.

248: The density of the galaxies means that

249: virtually the entire region must be covered by discs.

250: The variation in density means that

251: more discs must be concentrated within densely populated regions.

252: As a reference point, consider the sparsest possible covering

253: of the area by discs (resembling a ``honeycomb'').

254: This cover provides roughly the right {\em total}\/ capacity

255: and does well in sparse areas,

256: but in dense areas does not provide sufficient capacity.

257: Any good solution will have to maintain a honeycomb-like structure

258: in sparse areas while bunching discs more densely in dense areas.

259:

260: The outer loop of the algorithm does a binary search for the smallest value of

261: a density parameter $\delta$ that leads to success in the inner loop.  The

262: inner loop begins with a near-uniform cover of normalized density $1+\delta$

263: and iteratively improves it (see Figure~\ref{fig:uniform} for ``before'' and

264: ``after'' covers).  Each iteration of the loop perturbs the discs, as described

265: below, in an attempt to improve the cover (Figure~\ref{fig:move} shows the

266: results of such a series of improvement steps).  If the desired coverage is

267: obtained, the inner loop stops (successfully).  If the perturbation ceases to

268: improve the cover, the inner loop stops (unsuccessfully).

269:

270: \begin{figure}[t]

271:   \begin{center}

272:     \leavevmode

273:     \discCoverFig{move.ps}{0.5\textwidth}

274:   \end{center}

275:   \caption{Composite of intermediate covers}

276:   \label{fig:move}

277: \end{figure}

278:

279: Next we describe how the algorithm perturbs a given cover in order to improve

280: it.  We start with the observation that for a {\em given}\/ set of discs (with

281: known locations), the problem of finding the maximum number of galaxies that

282: can be assigned reduces to a generalized maximum bipartite matching problem

283: in a graph $G=(U,V,E)$, where the vertices in $U$ correspond to galaxies, the

284: vertices in $V$ correspond to the discs, and edge $(u,v)$ is present if $u$'s

285: galaxy is in $v$'s disc.  A maximum legal assignment of galaxies to discs then

286: corresponds to a maximum size set $S$ of edges such that each $u \in U$ is

287: incident to at most one edge in $S$ while each $v\in V$ is incident to at most

288: $c$ edges in $S$.

289:

290: Since the latter problem reduces in a standard way

291: to the maximum flow problem \cite{PapadimitriouS82},

292: which can be efficiently solved, it follows that

293: for a {\em given} set of discs, one can efficiently find a

294: maximum legal assignment of galaxies to discs.\iffalse\footnote{%

295:   The standard reduction can be improved by the following heuristic: say that

296:   two galaxies are equivalent if they are contained in the same set of discs.

297:   Replace the vertex set $U$ by a set $U'$, where each $u'$ represents a

298:   resulting equivalence class.  Finally, alter the matching constraint so that

299:   the number of matching edges incident to $u'$ is constrained to be at most

300:   the size of $u'$'s equivalence class.  Although in a typical problem the

301:   galaxies will be fairly dense, each disc will intersect only $O(1)$ other

302:   discs, so the number of equivalence classes will be proportional to the

303:   number of discs.}\fi

304:

305: Of course, the maximum legal assignment may still leave many galaxies unassigned,

306: even though many discs are not used to capacity.

307: In this case, how can discs be moved to improve the coverage?

308: Consider the following relaxation of the problem:

309: \begin{quote}{

310:     \underline{Relaxed Problem}

311:

312:     \em

313:     Given a set of discs, a set of galaxies, and a capacity $c$,

314:     find a {\bf minimum-penalty} assignment of the galaxies to discs

315:     such that no disc is assigned more than $c$ galaxies.}

316: \end{quote}

317: Here a galaxy can be assigned to a disc not containing it,

318: but there is a penalty for doing so that encourages assignments

319: of galaxies to nearby discs (details of the penalty function are

320: in \S~\ref{sec:impl}).

321:

322: The relaxed problem can be solved efficiently (even for arbitrary penalties) by

323: reducing it to the assignment problem or to minimum-cost maximum flow.  We

324: reduced it to the latter, more general, problem in anticipation of having to

325: incorporate more complex constraints on the assignment (that no sufficiently

326: close pairs of galaxies should be assigned to the same disc) at a later point.

327: As described below, even the more general problem can be solved quickly enough

328: for our purposes.

329:

330: A solution to the relaxed problem will assign all galaxies to discs, but a

331: given disc may be assigned galaxies outside of it.  {\em The advantage of the

332:   relaxed problem is that a solution to it can give information about how to

333:   improve a given set of discs.} The intuition is that if excess demand (i.e.\

334: a high density of galaxies relative to discs) exists in one area, and excess

335: capacity exists in another, then a disc between the two areas will tend to be

336: assigned galaxies that are outside of the disc and that lie towards the area of

337: excess demand.  Figure~\ref{fig:relax} illustrates this.

338: \begin{figure}[t]

339:   \begin{center}

340:     \leavevmode

341:     \discCoverFig{relax.ps}{5in}

342:   \end{center}

343:   \caption{Relaxed assignment}

344:   \label{fig:relax}

345: \end{figure}

346:

347: Once a minimum-penalty solution to the relaxed problem has been found, the

348: algorithm moves the discs to minimize the cost of the particular assignment of

349: galaxies to discs specified by the minimum-penalty solution.  This problem can

350: be solved {\em independently} for each disc.  For a given disc, for a fixed set

351: of galaxies assigned to it, the sum of the penalties for those assignments is a

352: function $f(x,y)$ of the coordinates $(x,y)$ of the center of the disc.  As

353: long as the penalty function is convex and reasonably smooth, $f$ will be also.

354: Starting with the current location $(x_0,y_0)$ of the disc, a simple

355: gradient-based method (described in \S~\ref{sec:impl}) is used to find $(x,y)$

356: maximizing $f(x,y)$.

357:

358: \newenvironment{tabAlgorithm}{

359: \setcounter{algorithmLine}{1}

360: \samepage

361: \begin{tabbing}

362: 999\=\kill

363: }{

364: \end{tabbing}

365: }

366: \newcounter{algorithmLine}

367: \newcommand{\algline}{\\\thealgorithmLine\hfil\>\stepcounter{algorithmLine}}

368:

369: \begin{figure}[tb]

370:   \begin{center}

371:     \framebox[.95\textwidth][c]{\parbox{.9\textwidth}{

372:         \underline{Inner Loop} ($\epsilon$ is fixed, $\delta$ is determined by

373:         the outer loop)

374:         \begin{enumerate}

375:         \item Compute a minimum-size near-uniform cover $C$ of the region by

376:           discs so that the total capacity of the discs in $C$ is at least

377:           $1+\delta$ times the number of galaxies.

378:

379:         \item {\bf Repeat} until convergence (after ``polishing'') or

380:           $1-\epsilon$ of the discs are legally assigned.

381:

382:           \begin{enumerate}

383:

384:           \item Compute a minimum-penalty assignment $A$ of the galaxies to the

385:             discs in $C$.  Do this by solving (an approximation of) the

386:             corresponding minimum-cost flow problem.

387:

388:           \item Move the discs in $C$ to minimize the penalty associated with

389:             the assignment $A$.  Move each disc independently to minimize

390:             its associated penalty by a simple gradient-descent method.

391:

392:           \end{enumerate}

393:         \item Find a maximum legal assignment of the galaxies to $C$.  Do

394:           this by solving a corresponding maximum-flow problem.

395:

396:         \item {\bf If} at least $1-\epsilon$ of the galaxies are assigned,

397:           {\bf succeed}, else {\bf fail}.

398:         \end{enumerate}

399:         }}

400:   \end{center}

401:   \caption{

402:     Given a desired coverage $1-\epsilon$, where $\epsilon \ge 0$, the outer

403:     loop of the algorithm does a binary search for the smallest value of

404:     $\delta \ge 0$ such that the above inner loop succeeds.  Further details,

405:     including the ``polishing'' step, the ``approximation'' of the flow

406:     problem, and the criteria for convergence, are described in

407:     \S~\protect\ref{sec:impl}.}

408:   \label{fig:alg}

409: \end{figure}

410: This gives us the essentials of the inner loop of the algorithm.  It starts

411: with a near-uniform cover of some specified (normalized) density $1+\delta$.

412: It improves the cover by finding a minimum-penalty assignment of the galaxies

413: to the discs and then moving the discs to their optimal locations given that

414: assignment.  It continues, alternately improving the assignment and then moving

415: the discs, until the net penalty ceases to decrease appreciably.  At the end of

416: the inner loop, the algorithm finds a legal (not relaxed) assignment of

417: galaxies to discs maximizing the number of assigned galaxies.

418: Figure~\ref{fig:move} shows a sequence of covers generated by a single run of

419: the inner loop.

420:

421: The outer loop performs a binary search to find the smallest $d$ that causes

422: the inner loop to successfully cover the galaxies.  The presentation here is a

423: slight simplification of the actual algorithm, in that the actual algorithm

424: uses a ``polishing'' heuristic before terminating the inner loop, and a

425: heuristic is applied to reduce size of the network-flow problem before solving

426: it.  These heuristics and other details about convergence of the inner and outer

427: loops, and starting conditions for the outer loop, are described in

428: \S~\ref{sec:impl}.

429:

430: \subsection{Example Run of Inner Loop. }

431: The sample instance in Figure~\ref{fig:sample} contains 12642 points ---

432: a random $10\%$ of the points in a subregion of the sky previously scanned.

433: The size of this subregion is about $10\%$ of that of the region that will be

434: mapped by the Survey.

435: A uniform cover of 218 discs of capacity 60 (total capacity 13080)

436: allows 81\% of the galaxies to be assigned.

437: After 16 iterations of the inner loop of the algorithm,

438: the improved cover captures 97.8\% of the points.

439: Figure~\ref{fig:sample} shows the initial near-uniform cover

440: and the final cover;

441: Figure~\ref{fig:move} shows a composite of the successive covers.

442: Section~\ref{sec:perf} describes comprehensive testing of quality of solutions

443: given by the algorithm and its running time.

444:

445: \subsection{Implementation Details. }\label{sec:impl}

446: For the initial near-uniform covers, we use

447: Hardin, Sloane, and Smith's catalogue of packings

448: of points on the sphere \cite{HardinSS94}.

449: These packings give covers of the entire sphere,

450: but we need a cover of only a (usually rectangular) subregion of the sky.

451: To prune a ``global'' cover $C$ the algorithm

452: first finds a maximum legal assignment of galaxies to discs in $C$,

453: then discards all discs having at most a few assigned galaxies.

454: (The cutoff for discarding a disc is chosen

455: so that the resulting number of discs is as desired.)

456:

457: The inner loop of the algorithm is implemented in C++ using LEDA \cite{MehlhornN}

458: for basic data structures.

459: We use a scaling algorithm by Andrew Goldberg to solve

460: the minimum-cost flow problems \cite{Goldberg1997}.

461: We use TCL for the outer loop of the algorithm and to collect performance data.

462:

463: \begin{figure}[tb]

464:   \centerline{

465:     \psfig{figure=cost2.ps,width=3.3in,angle=270}

466:     \psfig{figure=cost1.ps,width=3.3in,angle=270}

467:     }

468:   \caption{

469:     The assignment penalty as a function of the distance $d$ (in disc

470:     radii) between the disc center and the galaxy.  The plot on the left is

471:     for $d \le 1$; the plot on the right is for $d \ge 1$.}

472:   \label{fig:cost}

473: \end{figure}

474: \paragraph{Penalty function:}

475: The penalty for assigning a galaxy to a disc

476: whose center is distance $d$ away is proportional to

477: $$p(d) = \cases{d^2-r^2 & if $d\le r$ \cr 100(d^2-r^2) & if $d\ge r$.}$$

478: Recall that $r$ is the disc radius.

479: When solving the relaxed problem,

480: the algorithm first {\em rounds} the penalties.

481: Rounding so that there are few distinct penalties

482: allows a heuristic reduction in the size of the resulting flow problem.

483: (This heuristic is discussed further below.)

484: Figure~\ref{fig:cost} shows plots of $p$ and the rounded penalties.

485: The rounding is chosen to preserve the distance

486: between the galaxy and the {\em edge} of the disc

487: within roughly a factor of 2.

488: The edge of the disc is important

489: because the penalty function is least smooth

490: for points near the edge.

491: The factor of 2 is somewhat arbitrary,

492: it was chosen to balance between

493: the advantages of rounding and the resulting loss of accuracy.

494: After rounding, only 14 or so distinct penalties (each an integer power of 2) arise.

495:

496: \paragraph{Reducing the size of the flow problem:}

497: We expected the bottleneck in the algorithm

498: to be solving the minimum-cost flow problems.

499: To minimize this time, the algorithm uses a heuristic

500: to reduce the minimum-cost flow problem to a smaller,

501: approximately equivalent, problem.

502: This is the ``approximation'' of the flow problem

503: mentioned in the high-level description of the algorithm.

504: First, the algorithm only considers assigning each galaxy

505: to discs whose centers are within a distance of 2 disc radii,

506: and of these at most the three closest discs.

507: Second, it rounds the penalties as described above

508: to reduce the number of distinct penalties.

509: Finally, instead of having vertices for individual galaxies,

510: it has vertices for equivalence classes of galaxies,

511: where two galaxies are equivalent

512: if they have the same assignable discs

513: with the same rounded penalties.

514: With these heuristics,

515: even for very dense sets of galaxies,

516: the number of equivalence classes will

517: be proportional to the number of discs

518: as long as each disc intersects $O(1)$ other discs.

519: This is true in our case.

520:

521: The precaution of using equivalence classes

522: turned out to be unnecessary for two reasons.

523: First, the average number of galaxies per equivalence class was

524: typically no more than $3$.

525: More fundamentally, solving the flow problems

526: was not in fact a substantial bottleneck

527: (see the data in \S~\ref{newsec}

528: and the subsequent discussion).\footnote{%

529:   It is conceivable that the rounding of the penalties decreased the time

530:   used by the minimum-cost flow algorithm, as the latter works by scaling.}

531:

532: \paragraph{Constructing the flow problem:}

533: The algorithm stores all the discs in a two-dimensional array

534: so that discs near any given point can be found rapidly.

535: To construct the flow problem, the algorithm iterates through the galaxies.

536: For each galaxy, it finds the discs whose centers are within 2 disc radii.

537: It selects the three nearest of these discs

538: and computes the rounded distances to each.

539: These discs and their rounded distances determine

540: the equivalence class of the galaxy.

541: The equivalence class is found (or created if necessary).

542: From the equivalence classes, the flow network is constructed.

543:

544: So that the equivalence classes can be found quickly,

545: each equivalence class is stored in a hash table

546: maintained at its nearest disc.

547: The hash table for a disc $D$ contains those equivalence

548: classes whose nearest disc is $D$.

549: This method preserves locality of reference.

550: In an earlier implementation, a single large hash table

551: held all the equivalence classes.

552: For large problems, this table was too large to fit in main memory.

553: This slowed the algorithm by a factor of roughly 50.

554:

555: \paragraph{Moving the discs:}

556: After the minimum-penalty relaxed assignment is found,

557: recall that each disc is moved individually to minimize the penalty

558: associated with that disc.

559: The ``simple gradient-descent method'' used to do this is as follows.

560: To minimize $f(x,y)$, starting at a point $(x_0,y_0)$,

561: compute the gradient (direction of maximum rate of increase),

562: then move $(x,y)$ in steps of $\alpha$

563: (approximately $16/1000$ disc radii,

564: chosen to balance speed and accuracy)

565: in the direction opposite the gradient

566: until such steps ceased to decrease the value of $f(x,y)$.

567: Recompute the gradient at the new location

568: and repeat the process with $\alpha$ halved.

569: Continue in this fashion, halving $\alpha$ each time,

570: until $\alpha$ is decreased to approximately $2/1000$ disc radii.

571:

572: \paragraph{Convergence and ``Polishing'':}

573: The outermost loop of the algorithm

574: does a binary search on the size of the uniform starting cover.

575: Within this loop, the inner loop

576: iteratively improves the given cover.

577:

578: We describe convergence of the inner loop first.

579: Recall that the inner loop starts with a given cover and improves it

580: until the desired number of galaxies are legally covered,

581: or until ``convergence'' occurs.  Convergence is determined as follows:

582: after each iteration, if the gap between the actual number of

583: galaxies covered and the desired number did not decrease by at least 5\%,

584: then the algorithm considers the process ``stuck''.

585: At this point it changes the basic improvement step

586: (this is the ``polishing'' heuristic mentioned in the high-level descriptions

587: of the algorithm) as follows:

588: it solves the relaxed problem {\em as if} the disc radius were 2\% smaller.

589: It continues with this heuristic until it also becomes stuck.

590: Every time the process becomes stuck, the algorithm alternates

591: between the standard improvement step and the modified one.

592: If the process is ever stuck for at least two sequential rounds,

593: it is considered to have converged.

594: The purpose of the polishing heuristic

595: is that in the original relaxed problem,

596: a disc may be assigned galaxies

597: that are just barely outside of it at little penalty.

598: These galaxies cannot be legally assigned,

599: yet may ``hold'' discs in place in the subsequent disc-moving step.

600: ``Shrinking'' the effective radius of the disc for a few rounds

601: encourages these galaxies to be assigned elsewhere.

602:

603: Next we describe initial conditions and the convergence criterion for the outer loop.

604: The outer loop maintains a lower bound $L$ and an upper bound $U$

605: on the minimum sufficient cover size.

606: It also maintains covers $C_L$ and $C_U$ obtained by starting with a uniform

607: cover of size $L$ or $U$ (respectively)

608: and applying the basic algorithm to improve the cover until the desired

609: coverage is obtained or convergence occurs.

610: Initially $L$ and $U$ are taken to be $1.05$ and $1.15$, respectively,

611: times the number of galaxies divided by the capacity per disc.

612: The binary search maintains the invariant that $C_L$ and $C_U$ are,

613: respectively, insufficient and sufficient to achieve the desired coverage.

614: If this invariant does not hold initially, $L$ and/or $U$ are adjusted in

615: increments of 5\% to achieve the invariant.

616: The algorithm halts the binary search as soon as

617: the following condition ceases to be met:

618: $C_U$ has more than one more disc than $C_L$,

619: $C_U$ is at least 0.5\% bigger than $C_L$,

620: and $C_U$ legally covers at least 0.5\% more galaxies than $C_L$.

621: Once the search halts, the algorithm returns $C_U$.

622:

623:

624: \section{Performance of the Algorithm}\label{sec:perf}

625: We tested the running time and the quality of the solutions

626: found by the algorithm on sample instances.

627: In this section we describe the results.

628:

629: The Survey will map roughly 25\% of the sky

630: --- the region having right ascension zero through $360$ degrees

631: and declination $30$ degrees through $90$ degrees.

632: Roughly one million galaxies will be mapped.

633: Because the two phases of the Survey will be pipelined

634: (the second will be started before the first is done),

635: the second phase will be done in pieces.

636:

637: \begin{figure}[tb]

638:   \begin{tabular}{cc}

639:     \begin{tabular}[b]{c}

640:       \begin{tabular}{|c||r|r|r|} \hline

641:         name & r.\ ascens. & declination & galaxies \\ \hline\hline

642:         b & 35 to 55 & -55 to -35 & 29933 \\\hline

643:         c & 32 to 57 & -57 to -32 & 45344 \\\hline

644:         d & 30 to 59 & -55 to -30 & 52520 \\\hline

645:         e & 28 to 62 & -57 to -28 & 70339 \\\hline

646:         f & 25 to 65 & -60 to -20 & 109681 \\\hline

647:         g & 20 to 70 & -70 to -18 & 157126 \\ \hline

648:       \end{tabular}

649:       ~\\

650:       ~\\

651:       ~\\

652:       ~\\

653:     \end{tabular}

654:     &

655:     \psfig{figure=cover_size.ps,width=3in,angle=270}

656:   \end{tabular}

657:   \caption{%

658:     Regions from which sample instances were generated;

659:     number of discs needed to achieve a $98\%$ coverage

660:     (normalized by capacity lower bound).

661:     }

662:   \label{fig:regions}

663:   \label{fig:performance}

664: \end{figure}

665: We generated the problem instances from data from a region of the sky

666: that had been previously scanned for a different purpose.

667: We selected 6 subregions, and for these subregions

668: we generated 4 problem instances by randomly sampling

669: 30, 50, 70 or 100\% of the galaxies.

670: This gave us 24 sample problems.

671: We took the disc radius to be 1.5 arc-seconds

672: %\marginpar{check arc-seconds}

673: and the capacity to be 600

674: times 0.3, 0.5, 0.7, or 1 corresponding to the sampling percentage above.

675: (The base capacity is 600 instead of 660 because approximately 60 points in

676: each disc will be reserved for quasars not in the sample.)

677: The largest region has an area roughly 4\% of the entire sky.

678: For each subregion, the right ascension and declination ranges

679: and the number of galaxies are shown in Figure~\ref{fig:regions}.

680:

681: \subsection{Quality of solutions. }

682: Figure~\ref{fig:performance} illustrates the quality of the solutions

683: returned by the algorithm on the 24 problem instances.

684: The figure plots the size of the cover needed

685: to assign 98\% of the galaxies in each region,

686: normalized by dividing by the number of discs needed just to provide

687: enough capacity to hold 98\% of the galaxies.

688: The plot shows the same information for covering by near-uniform covers.

689: The algorithm (very roughly) requires 5\% to 15\% extra capacity,

690: whereas using uniform covers requires 25\% to 35\% extra capacity.

691:

692: \subsection{Running time.  }

693: \label{newsec}

694: \begin{figure}[tb]

695:   \centerline{\psfig{figure=time.ps,width=3.3in,angle=270}

696:     \psfig{figure=time1.ps,width=3.3in,angle=270}}

697:   \caption{Net time per galaxy and main components;

698:     time per galaxy per iteration.

699:     Each vertical bar represents a group of points with close $x$-coordinates:

700:     the center of the bar is the average; the endpoints are one standard

701:     deviation away.

702:     }

703:   \label{fig:time}

704:   \label{fig:time_normalized}

705: \end{figure}

706:

707: Plots of the time per galaxy to solve each problem instance

708: as a function of the number of galaxies

709: appear in Figure~\ref{fig:time}.

710: This net time includes all of the iterations needed to find

711: the final cover for the given problem instance, including

712: the binary search  ``outer loop''.

713:

714: The three main components of the running time are

715: the time building the graphs

716: (including finding the equivalence classes of galaxies),

717: the time solving the flow problems,

718: and the time moving the discs.

719: These plots show that the net running time is on the order of $0.1$

720: cpu seconds per galaxy ($850000$ galaxies per day),

721: with the three main components each taking a substantial fraction of the time.

722: These tests were carried out on a Silicon Graphics machine

723: with 6 150 MHZ processors,

724: a 16 Kbyte data cache,

725: a 1 Mbyte secondary cache,

726: and 256 Mbytes of main memory.

727:

728: Most of the variation in the time per galaxy is due to the number of

729: iterations, which varied from 30 to 100 per problem instance,

730: and which increases (in our implementation) with the problem size.

731: Figure~\ref{fig:time_normalized} plots the average time per galaxy per

732: iteration versus the problem size.

733: Each iteration represents the solution of one

734: relaxed problem and one perturbation of one set of discs.

735: The time per iteration grows linearly with the number of galaxies.

736: This is as expected, except for the

737: surprising speed of the flow computations.

738:

739: We note that the binary search is fairly naive,

740: given that in principle a fairly precise guess about the correct size

741: of the starting cover could be made.

742: Similarly, we feel that a more careful

743: and less conservative estimate of convergence,

744: possibly interleaving the two loops in some fashion,

745: might be warranted.

746: These improvements might reduce the total number of iterations substantially.

747:

748: \paragraph{Time to solve flow problems. }

749: The heuristics for keeping the flow problems small appear to be effective.

750: Figure~\ref{fig:flowsize} plots the average number of edges per galaxy

751: in each flow problem as a function of the sampling density of the instance.

752:

753: Figure~\ref{fig:flowtime} plots the average time per edge

754: to solve the individual flow problems.

755: The time appears to grow only near-linearly with the number of edges.

756: Better than worst-case behavior on certain classes of problems

757: is not uncommon;

758: further the flow problems arising here are not particularly hard ones.

759: See \cite{DIMACS93} for computational studies related to this issue.

760: \begin{figure}[tb]

761:   \centerline{\psfig{figure=flow_size.ps,width=3.3in,angle=270}

762:     \psfig{figure=flow_time.ps,width=3.3in,angle=270}}

763:   \caption{Number of edges per galaxy vs.~density;

764:     time per edge vs.\ number of edges}

765:   \label{fig:flowsize}

766:   \label{fig:flowtime}

767: \end{figure}

768:

769: \section{Retrospective}

770: The Euclidean capacitated covering problem arising here is very natural.

771: Looking in the Operations Research literature, we found numerous

772: capacitated covering algorithms based on integer linear programming,

773: but these were not fast enough problems of the desired size.

774: The Computer Science literature had a number of efficient approximation

775: algorithms for covering that had provable worst-case performance guarantees,

776: yet these algorithms would not produce good enough solutions in practice.

777:

778: Nonetheless, our final solution rests on theoretical foundations.

779: Our algorithm works in the spirit of Lagrangian relaxation.

780: We decompose the problem into two parts:

781: finding the cover $C$ and finding the assignment $A$.

782: We relax the constraints on $A$

783: by replacing the ``disc-containment'' constraint by a penalty function.

784: Then, for any given cover, finding the minimum-penalty assignment

785: is a tractable problem.

786: Likewise, for any given assignment,

787: finding the minimum-penalty cover is tractable.

788: Thus, the relaxation yields a scheme

789: that iteratively reduces the minimum penalty

790: and so drives the pair $(C,A)$ closer to feasibility.

791: Lagrangian relaxation is a common technique

792: in both the Operations Research \cite{Sridharan93}

793: and the Computer Science \cite{PlotkinST91} literature.

794:

795: Finding the decomposition required an understanding of network flow theory.

796: The ability to solve large problems hinges

797: on a fast network flow algorithm.

798: Classic augmenting paths algorithms

799: are far too slow for our purpose.

800: Goldberg's algorithm incorporates both

801: recent research within the worst-case model

802: and heuristics discovered by empirical studies

803: (in the spirit of \cite{DIMACS93}).

804:

805: Also useful were Hardin, Sloane, and Smith's sphere covers

806: \cite{HardinSS94}.  These enabled us to start with better

807: uniform covers than we might have otherwise.

808: Finally, in prototyping and testing ideas, it helped

809: to have a pre-existing library of relevant high-level data types

810: and algorithms.  For this we used LEDA \cite{MehlhornN}.

811:

812: Worst-case analysis did side-track us slightly.

813: Although worst-case analysis suggested that network flow would be the

814: bottleneck for large problems, it was not at all.

815: Ironically, as described in \S~\ref{sec:impl},

816: our first attempt to keep the flow problems small

817: by using equivalence classes backfired:

818: our original implementation used a single hashing data structure

819: to hold all the equivalence classes;

820: although the standard worst-case model suggests hashing is quite fast,

821: its incautious use slowed the solutions of large problems by a factor of 50

822: due to the lack of locality of reference.

823:

824: In conclusion, our experience suggests that a successful approach

825: rested on theoretical understanding, but required that it be creatively adapted

826: to take advantage of the particular structure of our problem instances.

827:

828: \section{Acknowledgements}

829: Thanks to Ken Steiglitz for introducing two of the coauthors

830: and to an anonymous referee for helpful suggestions.

831:

832: \bibliographystyle{plain}

833: \bibliography{full,names,you,theory}

834:

835: \end{document}

836:

837: