astro-ph0310891/ms.tex
1: \documentclass[natbib]{svmult}
2: 
3: \usepackage{makeidx}
4: \usepackage{graphicx}
5: \usepackage{multicol}
6: 
7: \newcommand{\Paramesh}{{\sc{Paramesh}}~}
8: \newcommand{\FLASH}{{\sc{Flash}}~}
9: 
10: \makeindex 
11: 
12: 
13: \title*{Efficiency Gains from Time Refinement on 
14:        AMR Meshes and Explicit Timestepping}
15: \titlerunning{Efficiency Gains from Time Refinement}
16: 
17: \author{L.~J.~Dursi\inst{1} \and 
18: M.~Zingale\inst{2}}
19: 
20: \institute{Dept.\ of Astronomy \& Astrophysics, The University of Chicago, Chicago, IL  60637 ({\tt ljdursi@flash.uchicago.edu}) \and 
21: Dept.\ of Astronomy \& Astrophysics, The University of California, Santa Cruz, Santa Cruz, CA 95064 ({\tt zingale@ucolick.org})}
22: 
23: \begin{document}
24: \maketitle
25: 
26: \begin{abstract}
27: Block-structured AMR meshes are often used in astrophysical fluid
28: simulations, where the geometry of the domain is simple.
29: We consider potential efficiency gains for time sub-cycling, or
30: time refinement (TR), on Berger-Collela and oct-tree AMR meshes for
31: explicit or local physics (such as explict hydrodynamics), where the
32: work per block is roughly constant with level of refinement.   We note
33: that there are generally many more fine zones than there are
34: coarse zones.  We then quantify the natural result that any overall
35: efficiency gains from reducing the amount of work on the relatively few
36: coarse zones must necessarily be fairly small.  Potential efficiency
37: benefits from TR on these meshes are seen to be quite limited except
38: in the case of refining a small number of points on a large mesh ---
39: in this case, the benefit can be made arbitrarily large, albeit at the
40: expense of spatial refinement efficiency.  
41: \end{abstract}
42: 
43: % ============================================================================
44: %  Introduction
45: % ============================================================================
46: 
47: \section{Introduction}
48: 
49: \subsection{Block-Structured AMR}
50: Adaptive mesh refinement on rectangular grids (henceforth AMR) was
51: introduced in \cite{bergeroliger}, and improved for conservation laws in
52: \cite{bergercollela}, henceforth BC89.  In the patch-based meshes of the
53: sort described in BC89, the patches increase in resolution by a fixed even
54: integer factor $N$.  One can place a finer patch anywhere in the domain
55: of a `parent' patch of one fewer level of refinement.  A patch is not
56: required to have only a single parent, but must be completely contained
57: within patches of the next lowest level of refinement.  Note that these
58: meshes are non-conforming; the face of a zone in a parent patch will
59: abut $N$ faces in the child patch.  A final restriction in the nesting
60: of the meshes is that there must be at least one zone of the next lower
61: level refinement about the perimeter of a patch.
62: 
63: Another mesh we will consider here is an oct-tree mesh (quad-tree in
64: 2-d, binary tree in 1d), such as is implemented in the \Paramesh package
65: \cite{paramesh} used in the \FLASH code \cite{flashcode}.  This oct-tree
66: mesh is a more restrictive version of an $N=2$ patch-based mesh as
67: described in BC89.  If a block needs additional resolution somewhere
68: in its domain, the entire block is halved in each coordinate direction,
69: creating $2^d$ children, where $d$ is the dimensionality.  Leaf blocks
70: are defined to be those blocks with no children, and are thus at the
71: bottom of the tree --- they are the finest-resolved blocks in their
72: region of the domain.  Frequently, only leaf blocks are evolved to
73: compute the solution to the equations, since a refined parent block's
74: domain is completely spanned by its children.
75: 
76: The only difference between the two meshing approaches of immediate
77: interest is the resulting different refinement patterns.  We will use
78: `patch' and `block' interchangeably in this paper.
79: 
80: \subsection{Time Refinement}
81: 
82: In BC89, the timestep set by the data on
83: the finest mesh is used to evolve that data, and data on the coarser
84: meshes is evolved at a multiple thereof so that there is a constant ratio at each
85: level $l$ of $\Delta t_l$ to $\Delta x_l$.   The assumption here is that there is
86: one roughly spatially constant characteristic speed throughout the entire domain,
87: so that the maximum allowable timestep at any given resolution is
88: directly proportional to the size of the mesh for any given block or
89: patch.   When coupled with
90: the assumption in structured AMR of some fixed jump in
91: refinement between levels, this makes for a very natural time evolution
92: algorithm, shown pictured in Figure~\ref{fig:tmrwcurve} for a mesh with
93: three different levels of refinement, with resolution jumps by constant
94: factors of $N$; shown is $N=2$.
95: 
96: \begin{figure}[hH]
97: \begin{center}
98: \includegraphics[width=.9\textwidth]{TMR.eps}
99: \end{center}
100: \caption{A structured AMR mesh containing blocks at three
101:          different levels of refinement, showing the order of operations
102:          (far right) of an explicit time evolution algorithm.  The largest
103:          block is evolved at the system timestep, and smaller blocks are
104:          subcycled at smaller timesteps.  Between evolution at different
105:          levels of the mesh, time averaging and flux corrections must
106:          be done --- these are not shown here.}
107: \label{fig:tmrwcurve}
108: \end{figure}
109: 
110: Here the largest blocks are evolved at some system timestep $dt$,
111: and smaller blocks are `subcycled' at proportionally smaller timesteps.
112: This defines a `work function' for each block; the finest blocks must be
113: evolved every sub-timestep so we take their work value to be 1 times the
114: number of zones in the block or patch;  the blocks one level of refinement
115: `up' need only be evolved every $N$ sub-timesteps, so that their work
116: value is $1/N$ times the number of zones, etc.  The work function for
117: an entire mesh is the sum of the work values of each block or patch in
118: the mesh.
119: 
120: There are costs associated with this time refinement (hereafter TR).
121: Memory is needed to store information at multiple timesteps.
122: There are overheads from extra copies and time-centering of fluxes.
123: The modified time-structure of work leads to load-balance issues in
124: parallel jobs.  Further complicating parallel performance is increased
125: communication complexity (although, it is to be pointed out, not
126: necessarily increased communication).
127: 
128: Nonetheless, one might hope that these costs are outweighed by the time
129: savings of not evolving large blocks at unnecessarily small timesteps;
130: in the example of Figure~\ref{fig:tmrwcurve}, of evolving the larger
131: blocks at timesteps of $dt$ or $dt/2$ instead of $dt/4$.  
132: As a first step to quantify the possible benefits, we
133: estimate the reduction in computational cost in simple cases
134: \S\ref{sec:analysis}.  We then use the same approach to examine
135: meshes from simulations performed with a
136: tree-based mesh in \S{\ref{sec:data}}.  In
137: our final section we summarize our results.
138: 
139: 
140: \section{Simple Mesh Configurations}
141: \label{sec:analysis}
142: 
143: \newcommand{\nblocks}{N_{\mathrm{blocks}}}
144: \newcommand{\Wtmr}{W_{\mathrm{TR}}}
145: \newcommand{\Wnontmr}{W_{\mathrm{noTR}}}
146: 
147: Here we calculate both the number of evolved blocks in a simple mesh, and a
148: weighted sum representing the ideal amount of work done by a TR method,
149: using the work function described in the previous section.
150: We then calculate a work ratio, $R$ --- the amount of work that
151: would be done by the idealized TR divided by that done with no
152: time refinement.  With no time refinement, each block must be
153: stepped through each sub-timestep, so that the amount of work done
154: is simply the number of blocks; thus, the work ratio is simply
155: (TR~work~function)/(number~of~blocks).  For $R = 1$, there is
156: no reduction in work; for $R < 1$, TR reduces the amount of computational work.
157: 
158: One can interpret the work ratios as performance metrics for the TR,
159: assuming that -- all physics benifits from the
160: time subcycling in proportion to the reduction in number of blocks evolved
161: each step; the memory overhead from TR is unimportant; all larger blocks
162: actually {\emph{can}} be evolved at timesteps of larger size in proportion
163: to their physical size; there is no single-processor overhead from TR
164: from memory copies or flux averaging; there is no parallel overhead from
165: increased complexity in communications; and there is no parallel from
166: increased load-balancing issues.
167: 
168: \subsection{Point refinement}
169: \label{subsec:pointrefine}
170: 
171: The best case for efficiency gains for spatial
172: refinement is clearly one isolated point of refinement.   For a
173: patched-based mesh, we imagine refinements as shown on the left of in
174: Figure~\ref{fig:meshes-point}.  
175: 
176: \begin{figure}[ht]
177: \begin{center}
178: \includegraphics[width=.2\textwidth]{point-bo.eps}
179: \includegraphics[width=.2\textwidth]{point.eps}
180: \hskip .3 in
181: \includegraphics[width=.2\textwidth]{line-bo.eps}
182: \includegraphics[width=.2\textwidth]{line.eps}
183: \end{center}
184: \caption{Fully refining a zero-thickness point
185:          with an idealized patch-based type mesh (far left) and an oct-tree
186:          mesh (left);  Fully refining an interface with a patch-based mesh
187:          (right) and oct-tree mesh (far right).  For the patch-based mesh, it is assumed that
188:          a patch can be placed anywhere on existing patches, with some
189:          fixed integral increase in resolution (shown here is $N=4$, $L=3$).
190:          For the oct-tree mesh, $N$ is fixed at 2, and shown is $L = 5$.}
191: \label{fig:meshes-point} 
192: \end{figure}
193: 
194: We begin with domains of length one in all directions.  The
195: completely unrefined domain is defined to be at level $l=1$ of
196: refinement.  Consider placing increasingly fine patches
197: at the corner, until we resolved the finest scale $\Delta x$
198: we wished.  If this requires $L-1$ more levels of refinement, each
199: decreasing the zone size by an integer factor $N$, then we have $\Delta
200: x \sim (1/N)^{L-1}$.  We will assume $\Delta x \ll 1/2$.
201: 
202: We consider the mesh in terms of the smallest uniform unit --- for the
203: oct-tree mesh, this is a single block, which will be of size $n_x \times
204: n_y \times n_z$ zones.  For the patch-based mesh, since the patches
205: can be of arbitrary size (and shape), we consider zones individually.
206: (Because we are not modelling guardcell filling, we can safely ignore
207: the fact that these zones are actually components of patches).  Thus, in
208: the results given below, an oct-tree mesh with (say) $8 \times 8$-zone
209: blocks at a maximum refinement $L = 5$ has the same resolution as a
210: patch-based mesh with $L = 8$.
211: 
212: The amount of work required by a non-TR code with only explicit or
213: local solves will, by assumption, be the same for each block, so that
214: $\Wnontmr = \nblocks$.  The amount of work with time
215: refinement, $\Wtmr$, will be a weighted sum of blocks.
216: For the pointwise-refined patch mesh, the number of blocks will simply
217: be $\nblocks = L$, as there is only one block per level.  The amount of
218: TR work is
219: 
220: \begin{equation}
221: \Wtmr  =  \sum_{l=1}^{L} 1 \cdot \left ( \frac{1}{N} \right )^{L-l}  \sim  \frac{N}{N-1}.
222: \end{equation}
223: 
224: Thus the work ratio will be
225: \begin{equation}
226: R = \frac{\Wtmr}{\Wnontmr} = \frac{\Wtmr}{\nblocks} =  \frac{N}{L(N-1)}.
227: \end{equation}
228: 
229: For ideal spatial AMR, where one can do all the refinement with
230: only one jump, $L = 2$, and so the amount of work done by a TR
231: algorithm is bounded from below at $1/2$ of the non-TR work.   At the
232: other limit, for a much less aggressive AMR with $N=2$, then the work
233: can be made an arbitrarily small fraction of the non-TR algorithm,
234: with $R = 2/L$ --- but note that this work ratio is achieved only by
235: operating on $L/2$ times as many blocks as in the best case for spatial
236: AMR.  
237: 
238: The oct-tree meshes refining
239: on a point is shown on the right of Figure~\ref{fig:meshes-point}.
240: In this case, there are $2^d$ highest refined blocks in the corner,
241: with the rest of the $2^d-1$ surrounding blocks at the next highest
242: refinement, surrounded by the $2^d-1$ surrounding blocks at the next
243: highest level of refinement, and so on.
244: 
245: Thus the total number of leaf blocks is 
246: \begin{equation}
247: \nblocks = (2^d) + \sum_{l=L-1}^{1} {\left ( 2^d-1 \right )} =  2^d (L - 1) - L + 2
248: \end{equation}
249: 
250: Weighting them by the amount of work,
251: \begin{equation}
252: \Wtmr = (2^d) + \sum_{l=L-1}^{1} {\left ( \frac{(2^d-1)}{2^{(L-l)}} \right )}  \sim  2^{(d+1)}-1
253: \end{equation}
254: making the work ratio
255: \begin{equation}
256: R  =  \left \{ \begin{array}{cl} 3/L &  1d \\
257:                                    7/(3L-2) & 2d \\
258:                                    15/(7L-6) & 3d 
259:                   \end{array}
260:         \right .
261: \end{equation}
262: 
263: As with the patch-based result, this ratio goes to zero for arbitrarily
264: large $L$.  These results are similar to the $N=2$ patch-based result, but
265: TR performs better here, and the spatial refinement worse ---  both  of
266: these are due to the fact that the oct-tree mesh generates more intermediate-level
267: blocks.
268: 
269: \subsection{Planar Interface Refinement}
270: \label{subsec:planerefine}
271: 
272: The refinement of an interface is shown on the right of 
273: Figure~\ref{fig:meshes-point}.  In the patch-based case, we continually
274: place a grid of $N$-by-1 (in 2d) or $N^2$-by-1 (in 3d) patches along
275: the interface, until the required resolution is achieved.  
276: 
277: In this case, performing the same calculation as in the previous section, one
278: obtains
279: \begin{equation}
280: R   \approx 1 - \frac{N-1}{N^d - 1} .
281: \end{equation}
282: 
283: Here, there is a fixed lower bound for the amount of work the TR can
284: achieve.  In the spatially-optimal large-$N$ limit, no work is saved
285: at all: $R \rightarrow 1$.   At the other limit, for $N=2$, in 2d,
286: $R \rightarrow 2/3$; in 3d, $R \rightarrow 6/7$.
287: 
288: In the oct-tree mesh we begin with one block at
289: the coarsest level.  It must be divided into 4 in this 2D
290: example, or, in general, $2^d$.   Half of these blocks will be 
291: further refined.  This continues until we reach the maximum
292: level of refinement.  The work ratio one finds is
293: \begin{equation}
294: R  =  \left \{ \begin{array}{cl} 7/9 & 2d \\ 45/49 & 3d \end{array} \right .
295: \end{equation}
296: 
297: In the point-refinement case of the previous subsection, a point of zero volume
298: needed to be refined; as a result, there were the same number of blocks at each
299: level, and thus a significant time savings could be obtained by doing less work
300: at the coarser blocks.  However, as we begin to see here, as soon as a non-trivial
301: volume of the mesh needs to be refined, there is significantly less savings to
302: be had.
303: 
304: \subsection{Circular Region Refinement}
305: 
306: \begin{figure}[ht]
307: \begin{center}
308: \includegraphics[width=.2\textwidth]{curve-region-bo.eps}
309: \hskip .3 in
310: \includegraphics[width=.2\textwidth]{curve-region-oct.eps}
311: \end{center}
312: \caption{Fully refining the interior of a circle, shown here with radius
313:          of $0.49$ of the
314:          box size, with an idealized patch-based type mesh (left) and
315:          an oct-tree mesh (right).   The patch-based mesh shown has $L
316:          = 3$ and $N = 4$.  For the oct-tree mesh, $N$ is fixed at 2,
317:          and shown is $L = 6$.}
318: \label{fig:meshes-curve-region}
319: \end{figure}
320: 
321: The loss of efficiency gains when a non-zero fraction of the mesh must
322: be refined is even clearer when a region, rather than an interface,
323: is fully refined.   In Fig.~\ref{fig:meshes-curve-region} we see
324: the results of fully refining the interior of a quarter-circle with the center
325: at one of the corners of the domain.  Clearly, the number of finest
326: blocks greatly outnumber intermediate or large blocks, so one might
327: guess that there is very little efficiency gain that can be had from
328: reducing work on the larger blocks.
329: 
330: \begin{table}[ht]
331: \begin{center}
332: \begin{tabular}{lrrrrrrrr}
333: {} & L=2 & 3 & 4 & 5 & 6 & 7 & 8 \\
334: \hline
335: r = 0.0 &  0.786 & 0.625 & 0.510 & 0.426 & 0.363 & 0.316 & 0.279 \\
336:     0.1 &  0.786 & 0.625 & 0.510 & 0.510 & 0.638 & 0.765 & 0.879 \\
337:     0.2 &  0.786 & 0.625 & 0.625 & 0.714 & 0.806 & 0.895 & 0.940 \\
338:     0.5 &  0.962 & 0.843 & 0.851 & 0.888 & 0.931 & 0.963 & 0.981 \\ 
339: %    0.7 &  0.962 & 0.9   & 0.908 & 0.927 & 0.954 & 0.973 & 0.985 \\
340:     0.9 &  1.    & 0.973 & 0.962 & 0.962 & 0.973 & 0.982 & 0.989 
341: \end{tabular}
342: \end{center}
343: \caption{Work ratio for a 2d Oct-tree mesh with a circular region
344: of radius $r$ (in units of the domain) completely refined.}
345: \label{table:2d-oct-circregion}
346: \end{table}
347: 
348: Because in this case the refinement pattern is complicated enough that
349: the process must be iterated to check that each zones neighbors are
350: no further than one level of refinement appart, we do not provide
351: analytic work ratios.  Tables~\ref{table:2d-oct-circregion} and
352: \ref{table:2d-bo-circregion} show the work ratios for an Oct-Tree mesh
353: and an $N=2$ patch-based mesh in refining a circular region of radius $r$.
354: Again, the $r=0$ results reproduce the expected point refinement, but as
355: soon as a non-zero radius must be refined,  the efficiency gains drop
356: significantly further than in the case of only refining an interface,
357: as more small blocks are needed to refine a region than the interface.
358: In Table~\ref{table:2d-bo-circregion} we also show results for the patch
359: based mesh with $N=4$; we see as in previous sections that for the
360: same resolution, increasing $N$ (which increases the spatial efficiency
361: of AMR) decreases the possible gains from time subcycling.
362: 
363: \begin{table}[ht]
364: \begin{center}
365: \begin{tabular}{lrrrrrrr||rrr}
366: {} & N=2, L= 2 & 3 & 4 & 5 & 6 & 7 & 8 & N=4, L=2 & 3 & 4 \\
367: \hline
368: r = 0.0 &      0.583 & 0.468& 0.387& 0.328& 0.283& 0.249& 0.221 & 0.438 & 0.332 & 0.510 \\
369:     0.1 &      0.583 & 0.468& 0.444& 0.552& 0.658& 0.754& 0.802 & 0.719 & 0.891 & 0.510\\
370:     0.2 &      0.583 & 0.548& 0.618& 0.694& 0.768& 0.806& 0.833 & 0.812 & 0.914 & 0.625\\
371:     0.5 &      0.75  & 0.737& 0.763& 0.798& 0.825& 0.840& 0.848 & 0.896 & 0.938 & 0.851\\
372: %    0.7 &      0.788 & 0.783& 0.797& 0.818& 0.835& 0.844& 0.851 & 0.926 & 0.942 & 0.908\\
373:     0.9 &      0.847 & 0.827& 0.826& 0.833& 0.842& 0.848& 0.852 & 0.938 & 0.947 & 0.962\\
374: \end{tabular}
375: \end{center}
376: \caption{Work ratio for a 2d patch-based mesh, $N=2$ and $N=4$, with a circular region
377: of radius $r$ (in units of the domain) completely refined.}
378: \label{table:2d-bo-circregion}
379: \end{table}
380: 
381: \section{Meshes from simulations}
382: \label{sec:data}
383: 
384: \newcommand{\ramr}{R_{\mathrm{AMR}}}
385: 
386: The calculations of the previous section are for very simple refinement
387: geometries.  In this section, we apply the same work function used in
388: \S\ref{sec:analysis} to the output of previous actual AMR simulations
389: which use oct-tree based meshes for AMR.   We continue to assume the
390: same idealized performance results of the previous section.
391: 
392: We begin with examining results from a standard test problem,
393: a Sedov explosion \cite{sedov}, as included with the \FLASH code and described
394: in \cite{flashcode}.  In this simulation, a high pressure at a point
395: causes a spherical shock wave to expand outwards; this is analogous
396: to the circular region analysis of the previous section.  The adaptive mesh for 
397: different stages of this simulation in 2d are shown in \ref{fig:sedov}.
398: 
399: \begin{figure}[hHt]
400: \begin{center}
401: \includegraphics[width=.6 \textwidth]{sedov8lev-high.eps}
402: \end{center}
403: \caption{The mesh of a Sedov explosion, from the \FLASH setup test described in
404: \cite{flashcode}, with a maximum of 8 levels of refinement.   Each
405: block shown contains $8 \times 8$ zones.}  
406: \label{fig:sedov}
407: \end{figure}
408: 
409: Results from the meshes shown are tabulated in
410: Table~\ref{tab:sedovresults}.   The number of blocks listed in the table
411: is the number of `leaf' blocks -- {\emph {e.g.}}, the blocks that are
412: actually evolved.    Also given in the table is the work ratio ($R$)
413: and the work ratio of spatial AMR to a uniform mesh at the highest
414: resolution ($\ramr = \Wnontmr / W_{\mathrm{uniform}}$).  We include
415: $\ramr$ to compare the relative importance of performance gains for
416: the spatial refinement and the time subcycling.
417: 
418: TR provides a large performance gain initially, when there is only
419: one point that is refined.  However, consistant with previous results,
420: immediately as the point becomes a region of non-zero measure, idealized
421: performance gains drop to $30\%$--$10\%$.  Regardless
422: of the refinement, the TR provides a very small performance enhancement
423: compared to that of the spatial refinement.
424: 
425: \begin{table}[hHt]
426: \begin{center}
427: \begin{tabular}{c|rll}
428: time & $\nblocks$ & $R$ & $\ramr$ \\
429: \hline
430: 0.00 & 256 & 0.426 & 0.0156 \\
431: 0.01 & 892 & 0.805 & 0.0544 \\
432: 0.02 & 1552 & 0.835 & 0.0947 \\
433: 0.03 & 2092 & 0.874 & 0.127 \\
434: \end{tabular}
435: \end{center}
436: \caption{Results from simulations of a Sedov explosion.  Listed
437: at different evolution times are the number of leaf blocks in the mesh,
438: the work ratio, and the work ratio for spatial AMR to uniform grid.}
439: \label{tab:sedovresults}
440: \end{table}
441: 
442: The reason for the small predicted efficiency gains, consistent with
443: the discussion of the previous section, is that there quickly become
444: more fine blocks than coarse blocks in the simulation.   By the last
445: frame shown in Figure~\ref{fig:sedov}, there are no blocks being evolved
446: at the the coarsest level of refinement, and indeed 80\% of the blocks
447: are at the highest level of refinement.  Thus, even if all other blocks
448: required zero work to evolve, we could only achieve a $20\%$ speedup.
449: 
450: Next we consider an interface problem -- a 2d detonation that will
451: eventually undergo a cellular instability.  These simulations are from
452: results published in \cite{celldet2d}.   A mesh is shown in 
453: Figure~\ref{fig:celldet}.  This
454: corresponds almost exactly to the idealized interface problem of the
455: previous section, but here the domain is very
456: long in one direction, increasing the number of low-cost coarsest
457: blocks in the domain.  This change in distribution of blocks means that
458: this problem can benefit more from TR.   The numerical results are
459: shown in Table~\ref{tab:celldet}.
460: 
461: \begin{figure}[hHt]
462: \begin{center}
463: \includegraphics[angle=90,width=.7\textwidth]{celldet.eps}
464: \end{center}
465: \caption{Half of the domain for the initial condition of a detonation, where the
466:          long domain is refined nowhere except at a sharp interface.
467:          The domain originally consists of a top-level mesh of $1 \times
468:          20$ blocks.  This mesh is then refined at an interface.  Shown
469:          is the meshes 6, zoomed in near
470:          the interface.  Not shown are 10 coarsest blocks to the right.}
471: \label{fig:celldet} 
472: \end{figure}
473: 
474: \begin{table}[hHt]
475: \begin{center}
476: \begin{tabular}{c|rlll}
477: Max refinement & $\nblocks$  & $R$ & $\ramr$ \\
478: \hline
479: 4 & 62 & 0.633 & 0.0484 \\
480: 5 & 110 & 0.688 & 0.0215 \\
481: 6 & 206 & 0.727 & 0.0101 \\
482: 7 & 398 & 0.751 & 0.00486
483: \end{tabular}
484: \end{center}
485: \caption{Results from initial conditions for a 2-d detonation problem,
486:          as in Figure~\ref{fig:celldet}.  $R$ is less than
487:          the $7/9$ calculated in the previous section, because of the large number
488:          of extra coarsest  blocks added to the domain.}
489: \label{tab:celldet}
490: \end{table}
491: 
492: Here we see TR's efficiency gains actually decrease with increasing
493: resolution, and also see a familiar pattern of TRs efficiency gains
494: going in the opposite direction of spatial AMR efficiency gains. 
495: Even at the resolution where TRs efficiency gains are largest, they are
496: much smaller than the improvement from using spatial AMR.
497: 
498: 
499: \begin{figure}[hHt]
500: \begin{center}
501: \includegraphics[height=.625\textwidth,angle=90]{rt8_early-high.eps}
502: \includegraphics[height=.625\textwidth,angle=90]{rt8_mid-high.eps}
503: \includegraphics[height=.625\textwidth,angle=90]{rt8_late-high.eps}
504: \end{center}
505: \caption{Development of Rayleigh-Taylor instability at 3 epochs, from
506:   simulations presented in \cite{vandv}.   These are fairly high-resolution
507:   simulations, with a maximum of 8 levels of refinement on a top-level
508:   mesh with $6 \times 1$ coarsest blocks.}
509: \label{fig:rt}
510: \end{figure}
511: 
512: We next consider the development of the Rayleigh Taylor instability.
513: (Figure~\ref{fig:rt}).  This is an interface problem, but in this
514: set of simulations, the center region of the box is resolved to ensure
515: resolution of the velocity perturbations in the region near the interface.
516: Because this region is fully refined, many `full cost' finest
517: blocks are added.   This decreases the scope of improvement from TR,
518: as seen in Table~\ref{tab:rtresults}.
519: 
520: \begin{table}[hHt]
521: \begin{center}
522: \begin{tabular}{c|rlll}
523: time & $\nblocks$ & $R$ & $\ramr$ \\
524: \hline
525: 0.0 & 33150 & 0.993 & 0.337 \\
526: 1.8 & 33150 & 0.993 & 0.337 \\
527: 3.6 & 60816 & 0.987 & 0.619 
528: \end{tabular}
529: \end{center}
530: \caption{Numerical results from simulations of a Rayleigh-Taylor instability,
531:          shown in Figure~\ref{fig:rt}.}
532: \label{tab:rtresults}
533: \end{table}
534: 
535: \section{Conclusion}
536: 
537: We have considered efficiency gains for time subcycling for explict or
538: local physics.  In these cases the work per block is roughly constant.
539: Further, in most cases there are many more fine blocks than coarse
540: blocks --- this is due to simple geometry, as a mesh that refines a
541: significant fraction of its domain will be strongly weighted in favour
542: of small blocks, which must be evolved at a small timestep.  Thus, Any
543: attempt to improve performance by focusing on the relatively few larger
544: blocks can only reduce a small fraction of the work that needs to be
545: done to evolve the system one timestep.  On the other hand, in studies
546: where only a small number of points in a large domain must be fully
547: resolved, there may be significant efficiency gains from TR methods.
548: Some cosmological hydrodynamical simulations \cite{normanextreme} are
549: examples of this situation.
550: 
551: We have not considered here accuracy; taking fewer timesteps may
552: increase accuracy with some solvers, although this isn't clear for
553: moderately time-accurate algorithms having errors of $O({\Delta t}^p)$,
554: $p > 1$; further, the coarsely refined regions which would benefit from
555: the fewer timesteps are presumably coarsely refined because the overall
556: solution quality is less sensitive to the error in those regions than
557: it is to that of the highly refined parts of the domain.
558: We also do not consider global or implicit solves, where the
559: timestepping algorithm in Fig.~\ref{fig:tmrwcurve} must be modified.
560: Global or implicit solves will, depending on the methods used, change
561: the amount of work done per block at different levels of refinement,
562: which can change the results given here considerably.
563: 
564: We have modelled only computational cost in this work.    Most of the
565: other costs, cf.~\S\ref{sec:analysis}, work to decrease the efficiency
566: gains of TR.   One unmodelled effect that could increase the gains is
567: the reduction of guardcell fills on large blocks.  For the oct-tree
568: mesh, where the number of zones per block is fixed, the reduction in
569: guardcell filling work is reduced in the same way as the computational
570: work, so that our conclusions are unchanged.   For the patch-based mesh,
571: the effect on the guardcell filling will be dependant on the shape of
572: the refined region and the algorithm used for merging patches of the
573: same refinement level, so that it is difficult to say anything in general.
574: 
575: Thus, block-structured TR significantly enhances performance of local
576: or explicit physics solvers only under fairly narrow circumstances.
577: In circumstances where TR is unlikely to produce much performance
578: enhancement, the added code complexity, memory overhead, and parallel
579: load-balancing issues may make the costs of the technique exceed its
580: benefits.
581: 
582: 
583: The authors thank B. Fryxell for useful discussions with this paper,
584: and K. Olson with his help with \Paramesh over the past years.  We thank
585: A. Calder for data from RT simulations, and F. X.  Timmes for data
586: from cellular detonation simulations.  We thank T. Plewa, G. Weirs,
587: R. Kirby, and R. Loy for suggesting this work.  Support for this work
588: was provided by the Scientific Discovery through Advanced Computing
589: (SciDAC) program of the DOE, grant number DE-FC02-01ER41176 to the
590: Supernova Science Center/UCSC.   LJD was supported by the Department of
591: Energy Computational Science Graduate Fellowship Program of the Office
592: of Scientific Computing and Office of Defense Programs in the Department
593: of Energy under contract DE-FG02-97ER25308.
594: 
595:  
596: The \FLASH code is freely available at http://flash.uchicago.edu/.
597: 
598: 
599: \bibliographystyle{plain}
600: \bibliography{mesh}
601: 
602: 
603: \end{document}
604: 
605: