1: \documentclass[natbib]{svmult}
2:
3: \usepackage{makeidx}
4: \usepackage{graphicx}
5: \usepackage{multicol}
6:
7: \newcommand{\Paramesh}{{\sc{Paramesh}}~}
8: \newcommand{\FLASH}{{\sc{Flash}}~}
9:
10: \makeindex
11:
12:
13: \title*{Efficiency Gains from Time Refinement on
14: AMR Meshes and Explicit Timestepping}
15: \titlerunning{Efficiency Gains from Time Refinement}
16:
17: \author{L.~J.~Dursi\inst{1} \and
18: M.~Zingale\inst{2}}
19:
20: \institute{Dept.\ of Astronomy \& Astrophysics, The University of Chicago, Chicago, IL 60637 ({\tt ljdursi@flash.uchicago.edu}) \and
21: Dept.\ of Astronomy \& Astrophysics, The University of California, Santa Cruz, Santa Cruz, CA 95064 ({\tt zingale@ucolick.org})}
22:
23: \begin{document}
24: \maketitle
25:
26: \begin{abstract}
27: Block-structured AMR meshes are often used in astrophysical fluid
28: simulations, where the geometry of the domain is simple.
29: We consider potential efficiency gains for time sub-cycling, or
30: time refinement (TR), on Berger-Collela and oct-tree AMR meshes for
31: explicit or local physics (such as explict hydrodynamics), where the
32: work per block is roughly constant with level of refinement. We note
33: that there are generally many more fine zones than there are
34: coarse zones. We then quantify the natural result that any overall
35: efficiency gains from reducing the amount of work on the relatively few
36: coarse zones must necessarily be fairly small. Potential efficiency
37: benefits from TR on these meshes are seen to be quite limited except
38: in the case of refining a small number of points on a large mesh ---
39: in this case, the benefit can be made arbitrarily large, albeit at the
40: expense of spatial refinement efficiency.
41: \end{abstract}
42:
43: % ============================================================================
44: % Introduction
45: % ============================================================================
46:
47: \section{Introduction}
48:
49: \subsection{Block-Structured AMR}
50: Adaptive mesh refinement on rectangular grids (henceforth AMR) was
51: introduced in \cite{bergeroliger}, and improved for conservation laws in
52: \cite{bergercollela}, henceforth BC89. In the patch-based meshes of the
53: sort described in BC89, the patches increase in resolution by a fixed even
54: integer factor $N$. One can place a finer patch anywhere in the domain
55: of a `parent' patch of one fewer level of refinement. A patch is not
56: required to have only a single parent, but must be completely contained
57: within patches of the next lowest level of refinement. Note that these
58: meshes are non-conforming; the face of a zone in a parent patch will
59: abut $N$ faces in the child patch. A final restriction in the nesting
60: of the meshes is that there must be at least one zone of the next lower
61: level refinement about the perimeter of a patch.
62:
63: Another mesh we will consider here is an oct-tree mesh (quad-tree in
64: 2-d, binary tree in 1d), such as is implemented in the \Paramesh package
65: \cite{paramesh} used in the \FLASH code \cite{flashcode}. This oct-tree
66: mesh is a more restrictive version of an $N=2$ patch-based mesh as
67: described in BC89. If a block needs additional resolution somewhere
68: in its domain, the entire block is halved in each coordinate direction,
69: creating $2^d$ children, where $d$ is the dimensionality. Leaf blocks
70: are defined to be those blocks with no children, and are thus at the
71: bottom of the tree --- they are the finest-resolved blocks in their
72: region of the domain. Frequently, only leaf blocks are evolved to
73: compute the solution to the equations, since a refined parent block's
74: domain is completely spanned by its children.
75:
76: The only difference between the two meshing approaches of immediate
77: interest is the resulting different refinement patterns. We will use
78: `patch' and `block' interchangeably in this paper.
79:
80: \subsection{Time Refinement}
81:
82: In BC89, the timestep set by the data on
83: the finest mesh is used to evolve that data, and data on the coarser
84: meshes is evolved at a multiple thereof so that there is a constant ratio at each
85: level $l$ of $\Delta t_l$ to $\Delta x_l$. The assumption here is that there is
86: one roughly spatially constant characteristic speed throughout the entire domain,
87: so that the maximum allowable timestep at any given resolution is
88: directly proportional to the size of the mesh for any given block or
89: patch. When coupled with
90: the assumption in structured AMR of some fixed jump in
91: refinement between levels, this makes for a very natural time evolution
92: algorithm, shown pictured in Figure~\ref{fig:tmrwcurve} for a mesh with
93: three different levels of refinement, with resolution jumps by constant
94: factors of $N$; shown is $N=2$.
95:
96: \begin{figure}[hH]
97: \begin{center}
98: \includegraphics[width=.9\textwidth]{TMR.eps}
99: \end{center}
100: \caption{A structured AMR mesh containing blocks at three
101: different levels of refinement, showing the order of operations
102: (far right) of an explicit time evolution algorithm. The largest
103: block is evolved at the system timestep, and smaller blocks are
104: subcycled at smaller timesteps. Between evolution at different
105: levels of the mesh, time averaging and flux corrections must
106: be done --- these are not shown here.}
107: \label{fig:tmrwcurve}
108: \end{figure}
109:
110: Here the largest blocks are evolved at some system timestep $dt$,
111: and smaller blocks are `subcycled' at proportionally smaller timesteps.
112: This defines a `work function' for each block; the finest blocks must be
113: evolved every sub-timestep so we take their work value to be 1 times the
114: number of zones in the block or patch; the blocks one level of refinement
115: `up' need only be evolved every $N$ sub-timesteps, so that their work
116: value is $1/N$ times the number of zones, etc. The work function for
117: an entire mesh is the sum of the work values of each block or patch in
118: the mesh.
119:
120: There are costs associated with this time refinement (hereafter TR).
121: Memory is needed to store information at multiple timesteps.
122: There are overheads from extra copies and time-centering of fluxes.
123: The modified time-structure of work leads to load-balance issues in
124: parallel jobs. Further complicating parallel performance is increased
125: communication complexity (although, it is to be pointed out, not
126: necessarily increased communication).
127:
128: Nonetheless, one might hope that these costs are outweighed by the time
129: savings of not evolving large blocks at unnecessarily small timesteps;
130: in the example of Figure~\ref{fig:tmrwcurve}, of evolving the larger
131: blocks at timesteps of $dt$ or $dt/2$ instead of $dt/4$.
132: As a first step to quantify the possible benefits, we
133: estimate the reduction in computational cost in simple cases
134: \S\ref{sec:analysis}. We then use the same approach to examine
135: meshes from simulations performed with a
136: tree-based mesh in \S{\ref{sec:data}}. In
137: our final section we summarize our results.
138:
139:
140: \section{Simple Mesh Configurations}
141: \label{sec:analysis}
142:
143: \newcommand{\nblocks}{N_{\mathrm{blocks}}}
144: \newcommand{\Wtmr}{W_{\mathrm{TR}}}
145: \newcommand{\Wnontmr}{W_{\mathrm{noTR}}}
146:
147: Here we calculate both the number of evolved blocks in a simple mesh, and a
148: weighted sum representing the ideal amount of work done by a TR method,
149: using the work function described in the previous section.
150: We then calculate a work ratio, $R$ --- the amount of work that
151: would be done by the idealized TR divided by that done with no
152: time refinement. With no time refinement, each block must be
153: stepped through each sub-timestep, so that the amount of work done
154: is simply the number of blocks; thus, the work ratio is simply
155: (TR~work~function)/(number~of~blocks). For $R = 1$, there is
156: no reduction in work; for $R < 1$, TR reduces the amount of computational work.
157:
158: One can interpret the work ratios as performance metrics for the TR,
159: assuming that -- all physics benifits from the
160: time subcycling in proportion to the reduction in number of blocks evolved
161: each step; the memory overhead from TR is unimportant; all larger blocks
162: actually {\emph{can}} be evolved at timesteps of larger size in proportion
163: to their physical size; there is no single-processor overhead from TR
164: from memory copies or flux averaging; there is no parallel overhead from
165: increased complexity in communications; and there is no parallel from
166: increased load-balancing issues.
167:
168: \subsection{Point refinement}
169: \label{subsec:pointrefine}
170:
171: The best case for efficiency gains for spatial
172: refinement is clearly one isolated point of refinement. For a
173: patched-based mesh, we imagine refinements as shown on the left of in
174: Figure~\ref{fig:meshes-point}.
175:
176: \begin{figure}[ht]
177: \begin{center}
178: \includegraphics[width=.2\textwidth]{point-bo.eps}
179: \includegraphics[width=.2\textwidth]{point.eps}
180: \hskip .3 in
181: \includegraphics[width=.2\textwidth]{line-bo.eps}
182: \includegraphics[width=.2\textwidth]{line.eps}
183: \end{center}
184: \caption{Fully refining a zero-thickness point
185: with an idealized patch-based type mesh (far left) and an oct-tree
186: mesh (left); Fully refining an interface with a patch-based mesh
187: (right) and oct-tree mesh (far right). For the patch-based mesh, it is assumed that
188: a patch can be placed anywhere on existing patches, with some
189: fixed integral increase in resolution (shown here is $N=4$, $L=3$).
190: For the oct-tree mesh, $N$ is fixed at 2, and shown is $L = 5$.}
191: \label{fig:meshes-point}
192: \end{figure}
193:
194: We begin with domains of length one in all directions. The
195: completely unrefined domain is defined to be at level $l=1$ of
196: refinement. Consider placing increasingly fine patches
197: at the corner, until we resolved the finest scale $\Delta x$
198: we wished. If this requires $L-1$ more levels of refinement, each
199: decreasing the zone size by an integer factor $N$, then we have $\Delta
200: x \sim (1/N)^{L-1}$. We will assume $\Delta x \ll 1/2$.
201:
202: We consider the mesh in terms of the smallest uniform unit --- for the
203: oct-tree mesh, this is a single block, which will be of size $n_x \times
204: n_y \times n_z$ zones. For the patch-based mesh, since the patches
205: can be of arbitrary size (and shape), we consider zones individually.
206: (Because we are not modelling guardcell filling, we can safely ignore
207: the fact that these zones are actually components of patches). Thus, in
208: the results given below, an oct-tree mesh with (say) $8 \times 8$-zone
209: blocks at a maximum refinement $L = 5$ has the same resolution as a
210: patch-based mesh with $L = 8$.
211:
212: The amount of work required by a non-TR code with only explicit or
213: local solves will, by assumption, be the same for each block, so that
214: $\Wnontmr = \nblocks$. The amount of work with time
215: refinement, $\Wtmr$, will be a weighted sum of blocks.
216: For the pointwise-refined patch mesh, the number of blocks will simply
217: be $\nblocks = L$, as there is only one block per level. The amount of
218: TR work is
219:
220: \begin{equation}
221: \Wtmr = \sum_{l=1}^{L} 1 \cdot \left ( \frac{1}{N} \right )^{L-l} \sim \frac{N}{N-1}.
222: \end{equation}
223:
224: Thus the work ratio will be
225: \begin{equation}
226: R = \frac{\Wtmr}{\Wnontmr} = \frac{\Wtmr}{\nblocks} = \frac{N}{L(N-1)}.
227: \end{equation}
228:
229: For ideal spatial AMR, where one can do all the refinement with
230: only one jump, $L = 2$, and so the amount of work done by a TR
231: algorithm is bounded from below at $1/2$ of the non-TR work. At the
232: other limit, for a much less aggressive AMR with $N=2$, then the work
233: can be made an arbitrarily small fraction of the non-TR algorithm,
234: with $R = 2/L$ --- but note that this work ratio is achieved only by
235: operating on $L/2$ times as many blocks as in the best case for spatial
236: AMR.
237:
238: The oct-tree meshes refining
239: on a point is shown on the right of Figure~\ref{fig:meshes-point}.
240: In this case, there are $2^d$ highest refined blocks in the corner,
241: with the rest of the $2^d-1$ surrounding blocks at the next highest
242: refinement, surrounded by the $2^d-1$ surrounding blocks at the next
243: highest level of refinement, and so on.
244:
245: Thus the total number of leaf blocks is
246: \begin{equation}
247: \nblocks = (2^d) + \sum_{l=L-1}^{1} {\left ( 2^d-1 \right )} = 2^d (L - 1) - L + 2
248: \end{equation}
249:
250: Weighting them by the amount of work,
251: \begin{equation}
252: \Wtmr = (2^d) + \sum_{l=L-1}^{1} {\left ( \frac{(2^d-1)}{2^{(L-l)}} \right )} \sim 2^{(d+1)}-1
253: \end{equation}
254: making the work ratio
255: \begin{equation}
256: R = \left \{ \begin{array}{cl} 3/L & 1d \\
257: 7/(3L-2) & 2d \\
258: 15/(7L-6) & 3d
259: \end{array}
260: \right .
261: \end{equation}
262:
263: As with the patch-based result, this ratio goes to zero for arbitrarily
264: large $L$. These results are similar to the $N=2$ patch-based result, but
265: TR performs better here, and the spatial refinement worse --- both of
266: these are due to the fact that the oct-tree mesh generates more intermediate-level
267: blocks.
268:
269: \subsection{Planar Interface Refinement}
270: \label{subsec:planerefine}
271:
272: The refinement of an interface is shown on the right of
273: Figure~\ref{fig:meshes-point}. In the patch-based case, we continually
274: place a grid of $N$-by-1 (in 2d) or $N^2$-by-1 (in 3d) patches along
275: the interface, until the required resolution is achieved.
276:
277: In this case, performing the same calculation as in the previous section, one
278: obtains
279: \begin{equation}
280: R \approx 1 - \frac{N-1}{N^d - 1} .
281: \end{equation}
282:
283: Here, there is a fixed lower bound for the amount of work the TR can
284: achieve. In the spatially-optimal large-$N$ limit, no work is saved
285: at all: $R \rightarrow 1$. At the other limit, for $N=2$, in 2d,
286: $R \rightarrow 2/3$; in 3d, $R \rightarrow 6/7$.
287:
288: In the oct-tree mesh we begin with one block at
289: the coarsest level. It must be divided into 4 in this 2D
290: example, or, in general, $2^d$. Half of these blocks will be
291: further refined. This continues until we reach the maximum
292: level of refinement. The work ratio one finds is
293: \begin{equation}
294: R = \left \{ \begin{array}{cl} 7/9 & 2d \\ 45/49 & 3d \end{array} \right .
295: \end{equation}
296:
297: In the point-refinement case of the previous subsection, a point of zero volume
298: needed to be refined; as a result, there were the same number of blocks at each
299: level, and thus a significant time savings could be obtained by doing less work
300: at the coarser blocks. However, as we begin to see here, as soon as a non-trivial
301: volume of the mesh needs to be refined, there is significantly less savings to
302: be had.
303:
304: \subsection{Circular Region Refinement}
305:
306: \begin{figure}[ht]
307: \begin{center}
308: \includegraphics[width=.2\textwidth]{curve-region-bo.eps}
309: \hskip .3 in
310: \includegraphics[width=.2\textwidth]{curve-region-oct.eps}
311: \end{center}
312: \caption{Fully refining the interior of a circle, shown here with radius
313: of $0.49$ of the
314: box size, with an idealized patch-based type mesh (left) and
315: an oct-tree mesh (right). The patch-based mesh shown has $L
316: = 3$ and $N = 4$. For the oct-tree mesh, $N$ is fixed at 2,
317: and shown is $L = 6$.}
318: \label{fig:meshes-curve-region}
319: \end{figure}
320:
321: The loss of efficiency gains when a non-zero fraction of the mesh must
322: be refined is even clearer when a region, rather than an interface,
323: is fully refined. In Fig.~\ref{fig:meshes-curve-region} we see
324: the results of fully refining the interior of a quarter-circle with the center
325: at one of the corners of the domain. Clearly, the number of finest
326: blocks greatly outnumber intermediate or large blocks, so one might
327: guess that there is very little efficiency gain that can be had from
328: reducing work on the larger blocks.
329:
330: \begin{table}[ht]
331: \begin{center}
332: \begin{tabular}{lrrrrrrrr}
333: {} & L=2 & 3 & 4 & 5 & 6 & 7 & 8 \\
334: \hline
335: r = 0.0 & 0.786 & 0.625 & 0.510 & 0.426 & 0.363 & 0.316 & 0.279 \\
336: 0.1 & 0.786 & 0.625 & 0.510 & 0.510 & 0.638 & 0.765 & 0.879 \\
337: 0.2 & 0.786 & 0.625 & 0.625 & 0.714 & 0.806 & 0.895 & 0.940 \\
338: 0.5 & 0.962 & 0.843 & 0.851 & 0.888 & 0.931 & 0.963 & 0.981 \\
339: % 0.7 & 0.962 & 0.9 & 0.908 & 0.927 & 0.954 & 0.973 & 0.985 \\
340: 0.9 & 1. & 0.973 & 0.962 & 0.962 & 0.973 & 0.982 & 0.989
341: \end{tabular}
342: \end{center}
343: \caption{Work ratio for a 2d Oct-tree mesh with a circular region
344: of radius $r$ (in units of the domain) completely refined.}
345: \label{table:2d-oct-circregion}
346: \end{table}
347:
348: Because in this case the refinement pattern is complicated enough that
349: the process must be iterated to check that each zones neighbors are
350: no further than one level of refinement appart, we do not provide
351: analytic work ratios. Tables~\ref{table:2d-oct-circregion} and
352: \ref{table:2d-bo-circregion} show the work ratios for an Oct-Tree mesh
353: and an $N=2$ patch-based mesh in refining a circular region of radius $r$.
354: Again, the $r=0$ results reproduce the expected point refinement, but as
355: soon as a non-zero radius must be refined, the efficiency gains drop
356: significantly further than in the case of only refining an interface,
357: as more small blocks are needed to refine a region than the interface.
358: In Table~\ref{table:2d-bo-circregion} we also show results for the patch
359: based mesh with $N=4$; we see as in previous sections that for the
360: same resolution, increasing $N$ (which increases the spatial efficiency
361: of AMR) decreases the possible gains from time subcycling.
362:
363: \begin{table}[ht]
364: \begin{center}
365: \begin{tabular}{lrrrrrrr||rrr}
366: {} & N=2, L= 2 & 3 & 4 & 5 & 6 & 7 & 8 & N=4, L=2 & 3 & 4 \\
367: \hline
368: r = 0.0 & 0.583 & 0.468& 0.387& 0.328& 0.283& 0.249& 0.221 & 0.438 & 0.332 & 0.510 \\
369: 0.1 & 0.583 & 0.468& 0.444& 0.552& 0.658& 0.754& 0.802 & 0.719 & 0.891 & 0.510\\
370: 0.2 & 0.583 & 0.548& 0.618& 0.694& 0.768& 0.806& 0.833 & 0.812 & 0.914 & 0.625\\
371: 0.5 & 0.75 & 0.737& 0.763& 0.798& 0.825& 0.840& 0.848 & 0.896 & 0.938 & 0.851\\
372: % 0.7 & 0.788 & 0.783& 0.797& 0.818& 0.835& 0.844& 0.851 & 0.926 & 0.942 & 0.908\\
373: 0.9 & 0.847 & 0.827& 0.826& 0.833& 0.842& 0.848& 0.852 & 0.938 & 0.947 & 0.962\\
374: \end{tabular}
375: \end{center}
376: \caption{Work ratio for a 2d patch-based mesh, $N=2$ and $N=4$, with a circular region
377: of radius $r$ (in units of the domain) completely refined.}
378: \label{table:2d-bo-circregion}
379: \end{table}
380:
381: \section{Meshes from simulations}
382: \label{sec:data}
383:
384: \newcommand{\ramr}{R_{\mathrm{AMR}}}
385:
386: The calculations of the previous section are for very simple refinement
387: geometries. In this section, we apply the same work function used in
388: \S\ref{sec:analysis} to the output of previous actual AMR simulations
389: which use oct-tree based meshes for AMR. We continue to assume the
390: same idealized performance results of the previous section.
391:
392: We begin with examining results from a standard test problem,
393: a Sedov explosion \cite{sedov}, as included with the \FLASH code and described
394: in \cite{flashcode}. In this simulation, a high pressure at a point
395: causes a spherical shock wave to expand outwards; this is analogous
396: to the circular region analysis of the previous section. The adaptive mesh for
397: different stages of this simulation in 2d are shown in \ref{fig:sedov}.
398:
399: \begin{figure}[hHt]
400: \begin{center}
401: \includegraphics[width=.6 \textwidth]{sedov8lev-high.eps}
402: \end{center}
403: \caption{The mesh of a Sedov explosion, from the \FLASH setup test described in
404: \cite{flashcode}, with a maximum of 8 levels of refinement. Each
405: block shown contains $8 \times 8$ zones.}
406: \label{fig:sedov}
407: \end{figure}
408:
409: Results from the meshes shown are tabulated in
410: Table~\ref{tab:sedovresults}. The number of blocks listed in the table
411: is the number of `leaf' blocks -- {\emph {e.g.}}, the blocks that are
412: actually evolved. Also given in the table is the work ratio ($R$)
413: and the work ratio of spatial AMR to a uniform mesh at the highest
414: resolution ($\ramr = \Wnontmr / W_{\mathrm{uniform}}$). We include
415: $\ramr$ to compare the relative importance of performance gains for
416: the spatial refinement and the time subcycling.
417:
418: TR provides a large performance gain initially, when there is only
419: one point that is refined. However, consistant with previous results,
420: immediately as the point becomes a region of non-zero measure, idealized
421: performance gains drop to $30\%$--$10\%$. Regardless
422: of the refinement, the TR provides a very small performance enhancement
423: compared to that of the spatial refinement.
424:
425: \begin{table}[hHt]
426: \begin{center}
427: \begin{tabular}{c|rll}
428: time & $\nblocks$ & $R$ & $\ramr$ \\
429: \hline
430: 0.00 & 256 & 0.426 & 0.0156 \\
431: 0.01 & 892 & 0.805 & 0.0544 \\
432: 0.02 & 1552 & 0.835 & 0.0947 \\
433: 0.03 & 2092 & 0.874 & 0.127 \\
434: \end{tabular}
435: \end{center}
436: \caption{Results from simulations of a Sedov explosion. Listed
437: at different evolution times are the number of leaf blocks in the mesh,
438: the work ratio, and the work ratio for spatial AMR to uniform grid.}
439: \label{tab:sedovresults}
440: \end{table}
441:
442: The reason for the small predicted efficiency gains, consistent with
443: the discussion of the previous section, is that there quickly become
444: more fine blocks than coarse blocks in the simulation. By the last
445: frame shown in Figure~\ref{fig:sedov}, there are no blocks being evolved
446: at the the coarsest level of refinement, and indeed 80\% of the blocks
447: are at the highest level of refinement. Thus, even if all other blocks
448: required zero work to evolve, we could only achieve a $20\%$ speedup.
449:
450: Next we consider an interface problem -- a 2d detonation that will
451: eventually undergo a cellular instability. These simulations are from
452: results published in \cite{celldet2d}. A mesh is shown in
453: Figure~\ref{fig:celldet}. This
454: corresponds almost exactly to the idealized interface problem of the
455: previous section, but here the domain is very
456: long in one direction, increasing the number of low-cost coarsest
457: blocks in the domain. This change in distribution of blocks means that
458: this problem can benefit more from TR. The numerical results are
459: shown in Table~\ref{tab:celldet}.
460:
461: \begin{figure}[hHt]
462: \begin{center}
463: \includegraphics[angle=90,width=.7\textwidth]{celldet.eps}
464: \end{center}
465: \caption{Half of the domain for the initial condition of a detonation, where the
466: long domain is refined nowhere except at a sharp interface.
467: The domain originally consists of a top-level mesh of $1 \times
468: 20$ blocks. This mesh is then refined at an interface. Shown
469: is the meshes 6, zoomed in near
470: the interface. Not shown are 10 coarsest blocks to the right.}
471: \label{fig:celldet}
472: \end{figure}
473:
474: \begin{table}[hHt]
475: \begin{center}
476: \begin{tabular}{c|rlll}
477: Max refinement & $\nblocks$ & $R$ & $\ramr$ \\
478: \hline
479: 4 & 62 & 0.633 & 0.0484 \\
480: 5 & 110 & 0.688 & 0.0215 \\
481: 6 & 206 & 0.727 & 0.0101 \\
482: 7 & 398 & 0.751 & 0.00486
483: \end{tabular}
484: \end{center}
485: \caption{Results from initial conditions for a 2-d detonation problem,
486: as in Figure~\ref{fig:celldet}. $R$ is less than
487: the $7/9$ calculated in the previous section, because of the large number
488: of extra coarsest blocks added to the domain.}
489: \label{tab:celldet}
490: \end{table}
491:
492: Here we see TR's efficiency gains actually decrease with increasing
493: resolution, and also see a familiar pattern of TRs efficiency gains
494: going in the opposite direction of spatial AMR efficiency gains.
495: Even at the resolution where TRs efficiency gains are largest, they are
496: much smaller than the improvement from using spatial AMR.
497:
498:
499: \begin{figure}[hHt]
500: \begin{center}
501: \includegraphics[height=.625\textwidth,angle=90]{rt8_early-high.eps}
502: \includegraphics[height=.625\textwidth,angle=90]{rt8_mid-high.eps}
503: \includegraphics[height=.625\textwidth,angle=90]{rt8_late-high.eps}
504: \end{center}
505: \caption{Development of Rayleigh-Taylor instability at 3 epochs, from
506: simulations presented in \cite{vandv}. These are fairly high-resolution
507: simulations, with a maximum of 8 levels of refinement on a top-level
508: mesh with $6 \times 1$ coarsest blocks.}
509: \label{fig:rt}
510: \end{figure}
511:
512: We next consider the development of the Rayleigh Taylor instability.
513: (Figure~\ref{fig:rt}). This is an interface problem, but in this
514: set of simulations, the center region of the box is resolved to ensure
515: resolution of the velocity perturbations in the region near the interface.
516: Because this region is fully refined, many `full cost' finest
517: blocks are added. This decreases the scope of improvement from TR,
518: as seen in Table~\ref{tab:rtresults}.
519:
520: \begin{table}[hHt]
521: \begin{center}
522: \begin{tabular}{c|rlll}
523: time & $\nblocks$ & $R$ & $\ramr$ \\
524: \hline
525: 0.0 & 33150 & 0.993 & 0.337 \\
526: 1.8 & 33150 & 0.993 & 0.337 \\
527: 3.6 & 60816 & 0.987 & 0.619
528: \end{tabular}
529: \end{center}
530: \caption{Numerical results from simulations of a Rayleigh-Taylor instability,
531: shown in Figure~\ref{fig:rt}.}
532: \label{tab:rtresults}
533: \end{table}
534:
535: \section{Conclusion}
536:
537: We have considered efficiency gains for time subcycling for explict or
538: local physics. In these cases the work per block is roughly constant.
539: Further, in most cases there are many more fine blocks than coarse
540: blocks --- this is due to simple geometry, as a mesh that refines a
541: significant fraction of its domain will be strongly weighted in favour
542: of small blocks, which must be evolved at a small timestep. Thus, Any
543: attempt to improve performance by focusing on the relatively few larger
544: blocks can only reduce a small fraction of the work that needs to be
545: done to evolve the system one timestep. On the other hand, in studies
546: where only a small number of points in a large domain must be fully
547: resolved, there may be significant efficiency gains from TR methods.
548: Some cosmological hydrodynamical simulations \cite{normanextreme} are
549: examples of this situation.
550:
551: We have not considered here accuracy; taking fewer timesteps may
552: increase accuracy with some solvers, although this isn't clear for
553: moderately time-accurate algorithms having errors of $O({\Delta t}^p)$,
554: $p > 1$; further, the coarsely refined regions which would benefit from
555: the fewer timesteps are presumably coarsely refined because the overall
556: solution quality is less sensitive to the error in those regions than
557: it is to that of the highly refined parts of the domain.
558: We also do not consider global or implicit solves, where the
559: timestepping algorithm in Fig.~\ref{fig:tmrwcurve} must be modified.
560: Global or implicit solves will, depending on the methods used, change
561: the amount of work done per block at different levels of refinement,
562: which can change the results given here considerably.
563:
564: We have modelled only computational cost in this work. Most of the
565: other costs, cf.~\S\ref{sec:analysis}, work to decrease the efficiency
566: gains of TR. One unmodelled effect that could increase the gains is
567: the reduction of guardcell fills on large blocks. For the oct-tree
568: mesh, where the number of zones per block is fixed, the reduction in
569: guardcell filling work is reduced in the same way as the computational
570: work, so that our conclusions are unchanged. For the patch-based mesh,
571: the effect on the guardcell filling will be dependant on the shape of
572: the refined region and the algorithm used for merging patches of the
573: same refinement level, so that it is difficult to say anything in general.
574:
575: Thus, block-structured TR significantly enhances performance of local
576: or explicit physics solvers only under fairly narrow circumstances.
577: In circumstances where TR is unlikely to produce much performance
578: enhancement, the added code complexity, memory overhead, and parallel
579: load-balancing issues may make the costs of the technique exceed its
580: benefits.
581:
582:
583: The authors thank B. Fryxell for useful discussions with this paper,
584: and K. Olson with his help with \Paramesh over the past years. We thank
585: A. Calder for data from RT simulations, and F. X. Timmes for data
586: from cellular detonation simulations. We thank T. Plewa, G. Weirs,
587: R. Kirby, and R. Loy for suggesting this work. Support for this work
588: was provided by the Scientific Discovery through Advanced Computing
589: (SciDAC) program of the DOE, grant number DE-FC02-01ER41176 to the
590: Supernova Science Center/UCSC. LJD was supported by the Department of
591: Energy Computational Science Graduate Fellowship Program of the Office
592: of Scientific Computing and Office of Defense Programs in the Department
593: of Energy under contract DE-FG02-97ER25308.
594:
595:
596: The \FLASH code is freely available at http://flash.uchicago.edu/.
597:
598:
599: \bibliographystyle{plain}
600: \bibliography{mesh}
601:
602:
603: \end{document}
604:
605: