0310:astro-ph0310891/ms.tex

1: \documentclass[natbib]{svmult}

2:

3: \usepackage{makeidx}

4: \usepackage{graphicx}

5: \usepackage{multicol}

6:

7: \newcommand{\Paramesh}{{\sc{Paramesh}}~}

8: \newcommand{\FLASH}{{\sc{Flash}}~}

9:

10: \makeindex

11:

12:

13: \title*{Efficiency Gains from Time Refinement on

14:        AMR Meshes and Explicit Timestepping}

15: \titlerunning{Efficiency Gains from Time Refinement}

16:

17: \author{L.~J.~Dursi\inst{1} \and

18: M.~Zingale\inst{2}}

19:

20: \institute{Dept.\ of Astronomy \& Astrophysics, The University of Chicago, Chicago, IL  60637 ({\tt ljdursi@flash.uchicago.edu}) \and

21: Dept.\ of Astronomy \& Astrophysics, The University of California, Santa Cruz, Santa Cruz, CA 95064 ({\tt zingale@ucolick.org})}

22:

23: \begin{document}

24: \maketitle

25:

26: \begin{abstract}

27: Block-structured AMR meshes are often used in astrophysical fluid

28: simulations, where the geometry of the domain is simple.

29: We consider potential efficiency gains for time sub-cycling, or

30: time refinement (TR), on Berger-Collela and oct-tree AMR meshes for

31: explicit or local physics (such as explict hydrodynamics), where the

32: work per block is roughly constant with level of refinement.   We note

33: that there are generally many more fine zones than there are

34: coarse zones.  We then quantify the natural result that any overall

35: efficiency gains from reducing the amount of work on the relatively few

36: coarse zones must necessarily be fairly small.  Potential efficiency

37: benefits from TR on these meshes are seen to be quite limited except

38: in the case of refining a small number of points on a large mesh ---

39: in this case, the benefit can be made arbitrarily large, albeit at the

40: expense of spatial refinement efficiency.

41: \end{abstract}

42:

43: % ============================================================================

44: %  Introduction

45: % ============================================================================

46:

47: \section{Introduction}

48:

49: \subsection{Block-Structured AMR}

50: Adaptive mesh refinement on rectangular grids (henceforth AMR) was

51: introduced in \cite{bergeroliger}, and improved for conservation laws in

52: \cite{bergercollela}, henceforth BC89.  In the patch-based meshes of the

53: sort described in BC89, the patches increase in resolution by a fixed even

54: integer factor $N$.  One can place a finer patch anywhere in the domain

55: of a `parent' patch of one fewer level of refinement.  A patch is not

56: required to have only a single parent, but must be completely contained

57: within patches of the next lowest level of refinement.  Note that these

58: meshes are non-conforming; the face of a zone in a parent patch will

59: abut $N$ faces in the child patch.  A final restriction in the nesting

60: of the meshes is that there must be at least one zone of the next lower

61: level refinement about the perimeter of a patch.

62:

63: Another mesh we will consider here is an oct-tree mesh (quad-tree in

64: 2-d, binary tree in 1d), such as is implemented in the \Paramesh package

65: \cite{paramesh} used in the \FLASH code \cite{flashcode}.  This oct-tree

66: mesh is a more restrictive version of an $N=2$ patch-based mesh as

67: described in BC89.  If a block needs additional resolution somewhere

68: in its domain, the entire block is halved in each coordinate direction,

69: creating $2^d$ children, where $d$ is the dimensionality.  Leaf blocks

70: are defined to be those blocks with no children, and are thus at the

71: bottom of the tree --- they are the finest-resolved blocks in their

72: region of the domain.  Frequently, only leaf blocks are evolved to

73: compute the solution to the equations, since a refined parent block's

74: domain is completely spanned by its children.

75:

76: The only difference between the two meshing approaches of immediate

77: interest is the resulting different refinement patterns.  We will use

78: `patch' and `block' interchangeably in this paper.

79:

80: \subsection{Time Refinement}

81:

82: In BC89, the timestep set by the data on

83: the finest mesh is used to evolve that data, and data on the coarser

84: meshes is evolved at a multiple thereof so that there is a constant ratio at each

85: level $l$ of $\Delta t_l$ to $\Delta x_l$.   The assumption here is that there is

86: one roughly spatially constant characteristic speed throughout the entire domain,

87: so that the maximum allowable timestep at any given resolution is

88: directly proportional to the size of the mesh for any given block or

89: patch.   When coupled with

90: the assumption in structured AMR of some fixed jump in

91: refinement between levels, this makes for a very natural time evolution

92: algorithm, shown pictured in Figure~\ref{fig:tmrwcurve} for a mesh with

93: three different levels of refinement, with resolution jumps by constant

94: factors of $N$; shown is $N=2$.

95:

96: \begin{figure}[hH]

97: \begin{center}

98: \includegraphics[width=.9\textwidth]{TMR.eps}

99: \end{center}

100: \caption{A structured AMR mesh containing blocks at three

101:          different levels of refinement, showing the order of operations

102:          (far right) of an explicit time evolution algorithm.  The largest

103:          block is evolved at the system timestep, and smaller blocks are

104:          subcycled at smaller timesteps.  Between evolution at different

105:          levels of the mesh, time averaging and flux corrections must

106:          be done --- these are not shown here.}

107: \label{fig:tmrwcurve}

108: \end{figure}

109:

110: Here the largest blocks are evolved at some system timestep $dt$,

111: and smaller blocks are `subcycled' at proportionally smaller timesteps.

112: This defines a `work function' for each block; the finest blocks must be

113: evolved every sub-timestep so we take their work value to be 1 times the

114: number of zones in the block or patch;  the blocks one level of refinement

115: `up' need only be evolved every $N$ sub-timesteps, so that their work

116: value is $1/N$ times the number of zones, etc.  The work function for

117: an entire mesh is the sum of the work values of each block or patch in

118: the mesh.

119:

120: There are costs associated with this time refinement (hereafter TR).

121: Memory is needed to store information at multiple timesteps.

122: There are overheads from extra copies and time-centering of fluxes.

123: The modified time-structure of work leads to load-balance issues in

124: parallel jobs.  Further complicating parallel performance is increased

125: communication complexity (although, it is to be pointed out, not

126: necessarily increased communication).

127:

128: Nonetheless, one might hope that these costs are outweighed by the time

129: savings of not evolving large blocks at unnecessarily small timesteps;

130: in the example of Figure~\ref{fig:tmrwcurve}, of evolving the larger

131: blocks at timesteps of $dt$ or $dt/2$ instead of $dt/4$.

132: As a first step to quantify the possible benefits, we

133: estimate the reduction in computational cost in simple cases

134: \S\ref{sec:analysis}.  We then use the same approach to examine

135: meshes from simulations performed with a

136: tree-based mesh in \S{\ref{sec:data}}.  In

137: our final section we summarize our results.

138:

139:

140: \section{Simple Mesh Configurations}

141: \label{sec:analysis}

142:

143: \newcommand{\nblocks}{N_{\mathrm{blocks}}}

144: \newcommand{\Wtmr}{W_{\mathrm{TR}}}

145: \newcommand{\Wnontmr}{W_{\mathrm{noTR}}}

146:

147: Here we calculate both the number of evolved blocks in a simple mesh, and a

148: weighted sum representing the ideal amount of work done by a TR method,

149: using the work function described in the previous section.

150: We then calculate a work ratio, $R$ --- the amount of work that

151: would be done by the idealized TR divided by that done with no

152: time refinement.  With no time refinement, each block must be

153: stepped through each sub-timestep, so that the amount of work done

154: is simply the number of blocks; thus, the work ratio is simply

155: (TR~work~function)/(number~of~blocks).  For $R = 1$, there is

156: no reduction in work; for $R < 1$, TR reduces the amount of computational work.

157:

158: One can interpret the work ratios as performance metrics for the TR,

159: assuming that -- all physics benifits from the

160: time subcycling in proportion to the reduction in number of blocks evolved

161: each step; the memory overhead from TR is unimportant; all larger blocks

162: actually {\emph{can}} be evolved at timesteps of larger size in proportion

163: to their physical size; there is no single-processor overhead from TR

164: from memory copies or flux averaging; there is no parallel overhead from

165: increased complexity in communications; and there is no parallel from

166: increased load-balancing issues.

167:

168: \subsection{Point refinement}

169: \label{subsec:pointrefine}

170:

171: The best case for efficiency gains for spatial

172: refinement is clearly one isolated point of refinement.   For a

173: patched-based mesh, we imagine refinements as shown on the left of in

174: Figure~\ref{fig:meshes-point}.

175:

176: \begin{figure}[ht]

177: \begin{center}

178: \includegraphics[width=.2\textwidth]{point-bo.eps}

179: \includegraphics[width=.2\textwidth]{point.eps}

180: \hskip .3 in

181: \includegraphics[width=.2\textwidth]{line-bo.eps}

182: \includegraphics[width=.2\textwidth]{line.eps}

183: \end{center}

184: \caption{Fully refining a zero-thickness point

185:          with an idealized patch-based type mesh (far left) and an oct-tree

186:          mesh (left);  Fully refining an interface with a patch-based mesh

187:          (right) and oct-tree mesh (far right).  For the patch-based mesh, it is assumed that

188:          a patch can be placed anywhere on existing patches, with some

189:          fixed integral increase in resolution (shown here is $N=4$, $L=3$).

190:          For the oct-tree mesh, $N$ is fixed at 2, and shown is $L = 5$.}

191: \label{fig:meshes-point}

192: \end{figure}

193:

194: We begin with domains of length one in all directions.  The

195: completely unrefined domain is defined to be at level $l=1$ of

196: refinement.  Consider placing increasingly fine patches

197: at the corner, until we resolved the finest scale $\Delta x$

198: we wished.  If this requires $L-1$ more levels of refinement, each

199: decreasing the zone size by an integer factor $N$, then we have $\Delta

200: x \sim (1/N)^{L-1}$.  We will assume $\Delta x \ll 1/2$.

201:

202: We consider the mesh in terms of the smallest uniform unit --- for the

203: oct-tree mesh, this is a single block, which will be of size $n_x \times

204: n_y \times n_z$ zones.  For the patch-based mesh, since the patches

205: can be of arbitrary size (and shape), we consider zones individually.

206: (Because we are not modelling guardcell filling, we can safely ignore

207: the fact that these zones are actually components of patches).  Thus, in

208: the results given below, an oct-tree mesh with (say) $8 \times 8$-zone

209: blocks at a maximum refinement $L = 5$ has the same resolution as a

210: patch-based mesh with $L = 8$.

211:

212: The amount of work required by a non-TR code with only explicit or

213: local solves will, by assumption, be the same for each block, so that

214: $\Wnontmr = \nblocks$.  The amount of work with time

215: refinement, $\Wtmr$, will be a weighted sum of blocks.

216: For the pointwise-refined patch mesh, the number of blocks will simply

217: be $\nblocks = L$, as there is only one block per level.  The amount of

218: TR work is

219:

220: \begin{equation}

221: \Wtmr  =  \sum_{l=1}^{L} 1 \cdot \left ( \frac{1}{N} \right )^{L-l}  \sim  \frac{N}{N-1}.

222: \end{equation}

223:

224: Thus the work ratio will be

225: \begin{equation}

226: R = \frac{\Wtmr}{\Wnontmr} = \frac{\Wtmr}{\nblocks} =  \frac{N}{L(N-1)}.

227: \end{equation}

228:

229: For ideal spatial AMR, where one can do all the refinement with

230: only one jump, $L = 2$, and so the amount of work done by a TR

231: algorithm is bounded from below at $1/2$ of the non-TR work.   At the

232: other limit, for a much less aggressive AMR with $N=2$, then the work

233: can be made an arbitrarily small fraction of the non-TR algorithm,

234: with $R = 2/L$ --- but note that this work ratio is achieved only by

235: operating on $L/2$ times as many blocks as in the best case for spatial

236: AMR.

237:

238: The oct-tree meshes refining

239: on a point is shown on the right of Figure~\ref{fig:meshes-point}.

240: In this case, there are $2^d$ highest refined blocks in the corner,

241: with the rest of the $2^d-1$ surrounding blocks at the next highest

242: refinement, surrounded by the $2^d-1$ surrounding blocks at the next

243: highest level of refinement, and so on.

244:

245: Thus the total number of leaf blocks is

246: \begin{equation}

247: \nblocks = (2^d) + \sum_{l=L-1}^{1} {\left ( 2^d-1 \right )} =  2^d (L - 1) - L + 2

248: \end{equation}

249:

250: Weighting them by the amount of work,

251: \begin{equation}

252: \Wtmr = (2^d) + \sum_{l=L-1}^{1} {\left ( \frac{(2^d-1)}{2^{(L-l)}} \right )}  \sim  2^{(d+1)}-1

253: \end{equation}

254: making the work ratio

255: \begin{equation}

256: R  =  \left \{ \begin{array}{cl} 3/L &  1d \\

257:                                    7/(3L-2) & 2d \\

258:                                    15/(7L-6) & 3d

259:                   \end{array}

260:         \right .

261: \end{equation}

262:

263: As with the patch-based result, this ratio goes to zero for arbitrarily

264: large $L$.  These results are similar to the $N=2$ patch-based result, but

265: TR performs better here, and the spatial refinement worse ---  both  of

266: these are due to the fact that the oct-tree mesh generates more intermediate-level

267: blocks.

268:

269: \subsection{Planar Interface Refinement}

270: \label{subsec:planerefine}

271:

272: The refinement of an interface is shown on the right of

273: Figure~\ref{fig:meshes-point}.  In the patch-based case, we continually

274: place a grid of $N$-by-1 (in 2d) or $N^2$-by-1 (in 3d) patches along

275: the interface, until the required resolution is achieved.

276:

277: In this case, performing the same calculation as in the previous section, one

278: obtains

279: \begin{equation}

280: R   \approx 1 - \frac{N-1}{N^d - 1} .

281: \end{equation}

282:

283: Here, there is a fixed lower bound for the amount of work the TR can

284: achieve.  In the spatially-optimal large-$N$ limit, no work is saved

285: at all: $R \rightarrow 1$.   At the other limit, for $N=2$, in 2d,

286: $R \rightarrow 2/3$; in 3d, $R \rightarrow 6/7$.

287:

288: In the oct-tree mesh we begin with one block at

289: the coarsest level.  It must be divided into 4 in this 2D

290: example, or, in general, $2^d$.   Half of these blocks will be

291: further refined.  This continues until we reach the maximum

292: level of refinement.  The work ratio one finds is

293: \begin{equation}

294: R  =  \left \{ \begin{array}{cl} 7/9 & 2d \\ 45/49 & 3d \end{array} \right .

295: \end{equation}

296:

297: In the point-refinement case of the previous subsection, a point of zero volume

298: needed to be refined; as a result, there were the same number of blocks at each

299: level, and thus a significant time savings could be obtained by doing less work

300: at the coarser blocks.  However, as we begin to see here, as soon as a non-trivial

301: volume of the mesh needs to be refined, there is significantly less savings to

302: be had.

303:

304: \subsection{Circular Region Refinement}

305:

306: \begin{figure}[ht]

307: \begin{center}

308: \includegraphics[width=.2\textwidth]{curve-region-bo.eps}

309: \hskip .3 in

310: \includegraphics[width=.2\textwidth]{curve-region-oct.eps}

311: \end{center}

312: \caption{Fully refining the interior of a circle, shown here with radius

313:          of $0.49$ of the

314:          box size, with an idealized patch-based type mesh (left) and

315:          an oct-tree mesh (right).   The patch-based mesh shown has $L

316:          = 3$ and $N = 4$.  For the oct-tree mesh, $N$ is fixed at 2,

317:          and shown is $L = 6$.}

318: \label{fig:meshes-curve-region}

319: \end{figure}

320:

321: The loss of efficiency gains when a non-zero fraction of the mesh must

322: be refined is even clearer when a region, rather than an interface,

323: is fully refined.   In Fig.~\ref{fig:meshes-curve-region} we see

324: the results of fully refining the interior of a quarter-circle with the center

325: at one of the corners of the domain.  Clearly, the number of finest

326: blocks greatly outnumber intermediate or large blocks, so one might

327: guess that there is very little efficiency gain that can be had from

328: reducing work on the larger blocks.

329:

330: \begin{table}[ht]

331: \begin{center}

332: \begin{tabular}{lrrrrrrrr}

333: {} & L=2 & 3 & 4 & 5 & 6 & 7 & 8 \\

334: \hline

335: r = 0.0 &  0.786 & 0.625 & 0.510 & 0.426 & 0.363 & 0.316 & 0.279 \\

336:     0.1 &  0.786 & 0.625 & 0.510 & 0.510 & 0.638 & 0.765 & 0.879 \\

337:     0.2 &  0.786 & 0.625 & 0.625 & 0.714 & 0.806 & 0.895 & 0.940 \\

338:     0.5 &  0.962 & 0.843 & 0.851 & 0.888 & 0.931 & 0.963 & 0.981 \\

339: %    0.7 &  0.962 & 0.9   & 0.908 & 0.927 & 0.954 & 0.973 & 0.985 \\

340:     0.9 &  1.    & 0.973 & 0.962 & 0.962 & 0.973 & 0.982 & 0.989

341: \end{tabular}

342: \end{center}

343: \caption{Work ratio for a 2d Oct-tree mesh with a circular region

344: of radius $r$ (in units of the domain) completely refined.}

345: \label{table:2d-oct-circregion}

346: \end{table}

347:

348: Because in this case the refinement pattern is complicated enough that

349: the process must be iterated to check that each zones neighbors are

350: no further than one level of refinement appart, we do not provide

351: analytic work ratios.  Tables~\ref{table:2d-oct-circregion} and

352: \ref{table:2d-bo-circregion} show the work ratios for an Oct-Tree mesh

353: and an $N=2$ patch-based mesh in refining a circular region of radius $r$.

354: Again, the $r=0$ results reproduce the expected point refinement, but as

355: soon as a non-zero radius must be refined,  the efficiency gains drop

356: significantly further than in the case of only refining an interface,

357: as more small blocks are needed to refine a region than the interface.

358: In Table~\ref{table:2d-bo-circregion} we also show results for the patch

359: based mesh with $N=4$; we see as in previous sections that for the

360: same resolution, increasing $N$ (which increases the spatial efficiency

361: of AMR) decreases the possible gains from time subcycling.

362:

363: \begin{table}[ht]

364: \begin{center}

365: \begin{tabular}{lrrrrrrr||rrr}

366: {} & N=2, L= 2 & 3 & 4 & 5 & 6 & 7 & 8 & N=4, L=2 & 3 & 4 \\

367: \hline

368: r = 0.0 &      0.583 & 0.468& 0.387& 0.328& 0.283& 0.249& 0.221 & 0.438 & 0.332 & 0.510 \\

369:     0.1 &      0.583 & 0.468& 0.444& 0.552& 0.658& 0.754& 0.802 & 0.719 & 0.891 & 0.510\\

370:     0.2 &      0.583 & 0.548& 0.618& 0.694& 0.768& 0.806& 0.833 & 0.812 & 0.914 & 0.625\\

371:     0.5 &      0.75  & 0.737& 0.763& 0.798& 0.825& 0.840& 0.848 & 0.896 & 0.938 & 0.851\\

372: %    0.7 &      0.788 & 0.783& 0.797& 0.818& 0.835& 0.844& 0.851 & 0.926 & 0.942 & 0.908\\

373:     0.9 &      0.847 & 0.827& 0.826& 0.833& 0.842& 0.848& 0.852 & 0.938 & 0.947 & 0.962\\

374: \end{tabular}

375: \end{center}

376: \caption{Work ratio for a 2d patch-based mesh, $N=2$ and $N=4$, with a circular region

377: of radius $r$ (in units of the domain) completely refined.}

378: \label{table:2d-bo-circregion}

379: \end{table}

380:

381: \section{Meshes from simulations}

382: \label{sec:data}

383:

384: \newcommand{\ramr}{R_{\mathrm{AMR}}}

385:

386: The calculations of the previous section are for very simple refinement

387: geometries.  In this section, we apply the same work function used in

388: \S\ref{sec:analysis} to the output of previous actual AMR simulations

389: which use oct-tree based meshes for AMR.   We continue to assume the

390: same idealized performance results of the previous section.

391:

392: We begin with examining results from a standard test problem,

393: a Sedov explosion \cite{sedov}, as included with the \FLASH code and described

394: in \cite{flashcode}.  In this simulation, a high pressure at a point

395: causes a spherical shock wave to expand outwards; this is analogous

396: to the circular region analysis of the previous section.  The adaptive mesh for

397: different stages of this simulation in 2d are shown in \ref{fig:sedov}.

398:

399: \begin{figure}[hHt]

400: \begin{center}

401: \includegraphics[width=.6 \textwidth]{sedov8lev-high.eps}

402: \end{center}

403: \caption{The mesh of a Sedov explosion, from the \FLASH setup test described in

404: \cite{flashcode}, with a maximum of 8 levels of refinement.   Each

405: block shown contains $8 \times 8$ zones.}

406: \label{fig:sedov}

407: \end{figure}

408:

409: Results from the meshes shown are tabulated in

410: Table~\ref{tab:sedovresults}.   The number of blocks listed in the table

411: is the number of `leaf' blocks -- {\emph {e.g.}}, the blocks that are

412: actually evolved.    Also given in the table is the work ratio ($R$)

413: and the work ratio of spatial AMR to a uniform mesh at the highest

414: resolution ($\ramr = \Wnontmr / W_{\mathrm{uniform}}$).  We include

415: $\ramr$ to compare the relative importance of performance gains for

416: the spatial refinement and the time subcycling.

417:

418: TR provides a large performance gain initially, when there is only

419: one point that is refined.  However, consistant with previous results,

420: immediately as the point becomes a region of non-zero measure, idealized

421: performance gains drop to $30\%$--$10\%$.  Regardless

422: of the refinement, the TR provides a very small performance enhancement

423: compared to that of the spatial refinement.

424:

425: \begin{table}[hHt]

426: \begin{center}

427: \begin{tabular}{c|rll}

428: time & $\nblocks$ & $R$ & $\ramr$ \\

429: \hline

430: 0.00 & 256 & 0.426 & 0.0156 \\

431: 0.01 & 892 & 0.805 & 0.0544 \\

432: 0.02 & 1552 & 0.835 & 0.0947 \\

433: 0.03 & 2092 & 0.874 & 0.127 \\

434: \end{tabular}

435: \end{center}

436: \caption{Results from simulations of a Sedov explosion.  Listed

437: at different evolution times are the number of leaf blocks in the mesh,

438: the work ratio, and the work ratio for spatial AMR to uniform grid.}

439: \label{tab:sedovresults}

440: \end{table}

441:

442: The reason for the small predicted efficiency gains, consistent with

443: the discussion of the previous section, is that there quickly become

444: more fine blocks than coarse blocks in the simulation.   By the last

445: frame shown in Figure~\ref{fig:sedov}, there are no blocks being evolved

446: at the the coarsest level of refinement, and indeed 80\% of the blocks

447: are at the highest level of refinement.  Thus, even if all other blocks

448: required zero work to evolve, we could only achieve a $20\%$ speedup.

449:

450: Next we consider an interface problem -- a 2d detonation that will

451: eventually undergo a cellular instability.  These simulations are from

452: results published in \cite{celldet2d}.   A mesh is shown in

453: Figure~\ref{fig:celldet}.  This

454: corresponds almost exactly to the idealized interface problem of the

455: previous section, but here the domain is very

456: long in one direction, increasing the number of low-cost coarsest

457: blocks in the domain.  This change in distribution of blocks means that

458: this problem can benefit more from TR.   The numerical results are

459: shown in Table~\ref{tab:celldet}.

460:

461: \begin{figure}[hHt]

462: \begin{center}

463: \includegraphics[angle=90,width=.7\textwidth]{celldet.eps}

464: \end{center}

465: \caption{Half of the domain for the initial condition of a detonation, where the

466:          long domain is refined nowhere except at a sharp interface.

467:          The domain originally consists of a top-level mesh of $1 \times

468:          20$ blocks.  This mesh is then refined at an interface.  Shown

469:          is the meshes 6, zoomed in near

470:          the interface.  Not shown are 10 coarsest blocks to the right.}

471: \label{fig:celldet}

472: \end{figure}

473:

474: \begin{table}[hHt]

475: \begin{center}

476: \begin{tabular}{c|rlll}

477: Max refinement & $\nblocks$  & $R$ & $\ramr$ \\

478: \hline

479: 4 & 62 & 0.633 & 0.0484 \\

480: 5 & 110 & 0.688 & 0.0215 \\

481: 6 & 206 & 0.727 & 0.0101 \\

482: 7 & 398 & 0.751 & 0.00486

483: \end{tabular}

484: \end{center}

485: \caption{Results from initial conditions for a 2-d detonation problem,

486:          as in Figure~\ref{fig:celldet}.  $R$ is less than

487:          the $7/9$ calculated in the previous section, because of the large number

488:          of extra coarsest  blocks added to the domain.}

489: \label{tab:celldet}

490: \end{table}

491:

492: Here we see TR's efficiency gains actually decrease with increasing

493: resolution, and also see a familiar pattern of TRs efficiency gains

494: going in the opposite direction of spatial AMR efficiency gains.

495: Even at the resolution where TRs efficiency gains are largest, they are

496: much smaller than the improvement from using spatial AMR.

497:

498:

499: \begin{figure}[hHt]

500: \begin{center}

501: \includegraphics[height=.625\textwidth,angle=90]{rt8_early-high.eps}

502: \includegraphics[height=.625\textwidth,angle=90]{rt8_mid-high.eps}

503: \includegraphics[height=.625\textwidth,angle=90]{rt8_late-high.eps}

504: \end{center}

505: \caption{Development of Rayleigh-Taylor instability at 3 epochs, from

506:   simulations presented in \cite{vandv}.   These are fairly high-resolution

507:   simulations, with a maximum of 8 levels of refinement on a top-level

508:   mesh with $6 \times 1$ coarsest blocks.}

509: \label{fig:rt}

510: \end{figure}

511:

512: We next consider the development of the Rayleigh Taylor instability.

513: (Figure~\ref{fig:rt}).  This is an interface problem, but in this

514: set of simulations, the center region of the box is resolved to ensure

515: resolution of the velocity perturbations in the region near the interface.

516: Because this region is fully refined, many `full cost' finest

517: blocks are added.   This decreases the scope of improvement from TR,

518: as seen in Table~\ref{tab:rtresults}.

519:

520: \begin{table}[hHt]

521: \begin{center}

522: \begin{tabular}{c|rlll}

523: time & $\nblocks$ & $R$ & $\ramr$ \\

524: \hline

525: 0.0 & 33150 & 0.993 & 0.337 \\

526: 1.8 & 33150 & 0.993 & 0.337 \\

527: 3.6 & 60816 & 0.987 & 0.619

528: \end{tabular}

529: \end{center}

530: \caption{Numerical results from simulations of a Rayleigh-Taylor instability,

531:          shown in Figure~\ref{fig:rt}.}

532: \label{tab:rtresults}

533: \end{table}

534:

535: \section{Conclusion}

536:

537: We have considered efficiency gains for time subcycling for explict or

538: local physics.  In these cases the work per block is roughly constant.

539: Further, in most cases there are many more fine blocks than coarse

540: blocks --- this is due to simple geometry, as a mesh that refines a

541: significant fraction of its domain will be strongly weighted in favour

542: of small blocks, which must be evolved at a small timestep.  Thus, Any

543: attempt to improve performance by focusing on the relatively few larger

544: blocks can only reduce a small fraction of the work that needs to be

545: done to evolve the system one timestep.  On the other hand, in studies

546: where only a small number of points in a large domain must be fully

547: resolved, there may be significant efficiency gains from TR methods.

548: Some cosmological hydrodynamical simulations \cite{normanextreme} are

549: examples of this situation.

550:

551: We have not considered here accuracy; taking fewer timesteps may

552: increase accuracy with some solvers, although this isn't clear for

553: moderately time-accurate algorithms having errors of $O({\Delta t}^p)$,

554: $p > 1$; further, the coarsely refined regions which would benefit from

555: the fewer timesteps are presumably coarsely refined because the overall

556: solution quality is less sensitive to the error in those regions than

557: it is to that of the highly refined parts of the domain.

558: We also do not consider global or implicit solves, where the

559: timestepping algorithm in Fig.~\ref{fig:tmrwcurve} must be modified.

560: Global or implicit solves will, depending on the methods used, change

561: the amount of work done per block at different levels of refinement,

562: which can change the results given here considerably.

563:

564: We have modelled only computational cost in this work.    Most of the

565: other costs, cf.~\S\ref{sec:analysis}, work to decrease the efficiency

566: gains of TR.   One unmodelled effect that could increase the gains is

567: the reduction of guardcell fills on large blocks.  For the oct-tree

568: mesh, where the number of zones per block is fixed, the reduction in

569: guardcell filling work is reduced in the same way as the computational

570: work, so that our conclusions are unchanged.   For the patch-based mesh,

571: the effect on the guardcell filling will be dependant on the shape of

572: the refined region and the algorithm used for merging patches of the

573: same refinement level, so that it is difficult to say anything in general.

574:

575: Thus, block-structured TR significantly enhances performance of local

576: or explicit physics solvers only under fairly narrow circumstances.

577: In circumstances where TR is unlikely to produce much performance

578: enhancement, the added code complexity, memory overhead, and parallel

579: load-balancing issues may make the costs of the technique exceed its

580: benefits.

581:

582:

583: The authors thank B. Fryxell for useful discussions with this paper,

584: and K. Olson with his help with \Paramesh over the past years.  We thank

585: A. Calder for data from RT simulations, and F. X.  Timmes for data

586: from cellular detonation simulations.  We thank T. Plewa, G. Weirs,

587: R. Kirby, and R. Loy for suggesting this work.  Support for this work

588: was provided by the Scientific Discovery through Advanced Computing

589: (SciDAC) program of the DOE, grant number DE-FC02-01ER41176 to the

590: Supernova Science Center/UCSC.   LJD was supported by the Department of

591: Energy Computational Science Graduate Fellowship Program of the Office

592: of Scientific Computing and Office of Defense Programs in the Department

593: of Energy under contract DE-FG02-97ER25308.

594:

595:

596: The \FLASH code is freely available at http://flash.uchicago.edu/.

597:

598:

599: \bibliographystyle{plain}

600: \bibliography{mesh}

601:

602:

603: \end{document}

604:

605: