1: \documentclass[]{article}
2: \usepackage{epsf,a4}
3: \addtolength{\oddsidemargin}{-0.75in}
4: \addtolength{\textwidth}{1.5in}
5: \addtolength{\topmargin}{-0.35in}
6: \addtolength{\textheight}{0.8in}
7: \renewcommand{\baselinestretch}{1}
8:
9: \begin{document}
10: \begin{center}
11: \LARGE
12: Parallel Sparse Matrix Multiplication for \\
13: Linear Scaling Electronic Structure Calculations
14: \bigskip \\
15: \large
16: D. R. Bowler$^\star$, T. Miyazaki$^\dagger$$^\star$ and
17: M. J. Gillan$^\star$ \smallskip \\
18: $^\star$Department of Physics and Astronomy, University College London \\
19: Gower Street, London, WC1E 6BT, U.K. \smallskip \\
20: $^\dagger$National Research Institute for Metals \\
21: 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan
22: \normalsize
23: \end{center}
24:
25: \begin{abstract}
26: Linear-scaling electronic-structure techniques, also called $O(N)$
27: techniques, rely heavily on the multiplication of sparse matrices,
28: where the sparsity arises from spatial cut-offs. In order to treat
29: very large systems, the calculations must be run on parallel
30: computers. We analyse the problem of parallelising the multiplication
31: of sparse matrices with the sparsity pattern required by
32: linear-scaling techniques. We show that the management of inter-node
33: communications and the effective use of on-node cache are helped by
34: organising the atoms into compact groups. We also discuss how to
35: identify a key part of the code called the `multiplication kernel',
36: which is repeatedly invoked to build up the matrix product, and
37: explicit code is presented for this kernel. Numerical tests of the
38: resulting multiplication code are reported for cases of practical
39: interest, and it is shown that their scaling properties for systems
40: containing up to 20,000 atoms on machines having up to 512 processors
41: are excellent. The tests also show that the cpu efficiency of our code
42: is satisfactory.
43: \end{abstract}
44:
45: \section{Introduction}
46: \label{sec:intro}
47:
48: There is rapidly growing interest in linear-scaling methods for
49: electronic structure calculations, in which the memory and number of
50: cpu cycles are proportional to the number of atoms\cite{Goedecker99}.
51: Several practical codes exist for performing
52: tight-binding~\cite{Pettifor89,Yang91,%
53: Galli92,Li93,Daw93,Ordejon93,Aoki93,Mauri93,Goedecker94,%
54: Kress95,Kohn96,Horsfield96,Bowler97},
55: first-principles~\cite{Stechel94,Hernandez95,Haynes97} or
56: Hartree-Fock~\cite{Schweg96,Burant96,Ochsen98,Challa00} calculations
57: using linear-scaling techniques -- often called order-$N$ or $O(N)$
58: techniques. For very large systems containing thousands or tens of
59: thousands of atoms, these codes need to run efficiently on parallel
60: computers. Most of the existing linear-scaling methods rely heavily on
61: the multiplication of sparse matrices, and this means that
62: sparse-matrix multiplication on parallel computers is a key issue for
63: the future of linear-scaling electronic structure work. To achieve the
64: best efficiency, parallel codes to perform this multiplication must be
65: designed to treat the special patterns of sparsity that arise in
66: electronic structure, which we shall refer to as `local sparsity'.
67: The aims of this paper are to analyse the design of parallel code for
68: multiplying matrices having local sparsity, to present the algorithms
69: we have developed for the {\sc Conquest} electronic structure
70: code~\cite{Goringe97,Bowler00}, and to report the results of practical
71: tests on the efficiency and linear-scaling properties of the code.
72:
73: In the tight-binding approach to electronic structure, we are
74: concerned with the Hamiltonian and overlap matrices $H_{i \alpha , j
75: \beta}$ and $S_{i \alpha , j \beta}$. (Here, $i$, $j$ label atoms,
76: while $\alpha$, $\beta$ label the basis functions on each atom.)
77: Linear-scaling methods generally require also the density matrix $K_{i
78: \alpha , j \beta}$ and sometimes also an `auxiliary' density matrix
79: $L_{i \alpha , j \beta}$ -- we follow the notation of
80: Ref.~\cite{Hernandez96}. Linear-scaling methods based on
81: density-functional theory (DFT) or Hartree-Fock theory work with
82: similar quantities, and the DFT-pseudopotential approach also needs
83: scalar products $P_{i \alpha , j \lambda}$ between localised orbitals
84: on atom $i$ and angular-momentum projectors for non-local
85: pseudopotentials on atom $j$. The elements of all these matrices tend
86: to zero as the distance $R_{i j}$ between atoms $i$ and $j$ tends to
87: infinity, and this is what makes linear-scaling eletronic structure
88: theory possible. In the approach used in the {\sc Conquest}
89: code~\cite{Goringe97,Bowler00}, and other related approaches (see
90: e.g. Refs~\cite{Ordejon93,Mauri93, Mauri94,Ordejon95}), the
91: approximation is made that all the matrix elements vanish exactly when
92: $R_{i j}$ exceeds a specified cut-off radius, which generally differs
93: for different matrices. Once this approximation has been made,
94: linear-scaling behaviour for both memory and cpu cycles is guaranteed
95: -- provided, of course, appropriate algorithms are used. All the
96: matrices are sparse, since their only non-vanishing elements are those
97: for which $R_{i j}$ is less than the appropriate cut-off. In
98: everything that follows, the particular pattern of sparsity that
99: arises from the spatial cut-offs is crucial, and it is this pattern
100: that we call `local sparsity'.
101:
102: The techniques widely used to ensure the idempotency of the density
103: matrix $K$~\cite{Li93,Hernandez96,Bowler99} rely on the calculation of
104: matrix products such as $(LSL)_{i \alpha , j \beta}$, $(LSLSL)_{i
105: \alpha , j \beta}$ and other similar products involving $H$ and
106: $P$. Sometimes, these products are needed only for interatomic
107: distances $R_{i j}$ that are smaller than the full distances for which
108: they have non-vanishing elements, so that we do not need to calculate
109: all their elements. The general problem is therefore to perform the
110: product:
111: \begin{equation}
112: C_{i \mu , j \nu} = \sum_{k \xi} A_{i \mu , k \xi}
113: B_{k \xi , j \nu} \; ,
114: \label{eqn:mult}
115: \end{equation}
116: where the elements of $A$ and $B$ are non-vanishing only
117: for $R_{i k} < R_A$ and $R_{k j} < R_B$ respectively, and elements
118: of $C$ are required only for $R_{i j} < R_C$, where $R_A$,
119: $R_B$, $R_C$ are specified cut-off distances. The indices
120: $\mu$, $\nu$, $\xi$ may correspond to localised orbitals or
121: angular momentum projectors
122: on each atom, but in any case they run only over a small set
123: of values. In general, the number of values might depend on
124: the atom, but we ignore this possible complication here and assume that
125: $\mu$ runs from 1 to $n_1$, $\nu$ runs from 1 to $n_2$, and $\xi$
126: runs from 1 to $n_3$. Unless the distinction is important,
127: we shall refer to $n_1$, $n_2$ or $n_3$ simply as $n$.
128:
129: The {\sc Conquest} DFT code was written from
130: the start as a parallel code, and the methods used to parallelise
131: sparse-matrix multiplications and all the other operations
132: were reported earlier~\cite{Goringe97}, together with
133: tests of the scaling behaviour. Although the scaling
134: was excellent, the major problem
135: with the matrix multiplication methods that we used earlier
136: was that they required elaborate indexing, which consumed
137: excessive amounts of memory. This is why we have carried out
138: the much more thorough analysis of locally sparse matrix
139: multiplication reported here.
140:
141: There has also been other work on parallelising the multiplication
142: of locally sparse matrices for linear-scaling electronic structure
143: work, using both tight-binding~\cite{Itoh95,Canning96,Wang96}
144: and Hartree-Fock
145: techniques~\cite{Challa00}. Itoh {\em et al.}~\cite{Itoh95} and
146: Canning {\em et al.}~\cite{Canning96} describe the parallelisation
147: of the two closely related linear-scaling methods reported
148: in Refs.~\cite{Ordejon93,Ordejon95} and~\cite{Mauri93,Mauri94},
149: and find satisfactory
150: speed-ups of up to 7.3 for an eight-fold increase of
151: processor number. Challacombe~\cite{Challa00} presented a very general
152: parallelisation scheme for linear-scaling Hartree-Fock
153: calculations using Gaussian basis sets, but found speed-ups
154: of only 75~\% for an eight-fold increase of processor number.
155: Wang {\it et al.} made passing reference to the parallelisation
156: of the tight-binding density-matrix method~\cite{Wang96}, but no details
157: appear to have been published. The popular tight-binding
158: linear-scaling scheme known as the Fermi-operator
159: method~\cite{Goedecker94}
160: has also been parallelised, but this relies on matrix-vector
161: rather than matrix-matrix operations, and is less relevant here.
162:
163: In considering the efficiency of a multiplication code, we note that
164: there is a certain irreducible minimum number of multiply-and-adds
165: that must be done to accomplish the calculation of
166: eqn~(\ref{eqn:mult}). For a parallel machine whose processors have a
167: given peak speed, this sets a well defined lower bound to the
168: execution time. The ratio between this lower bound and the
169: practical execution time gives a measure of the cpu efficiency of the
170: code. The superscalar processors of main interest to us can sometimes
171: achieve 50~\% of peak speed in the multiplication of large matrices,
172: but it would be unrealistic to expect this kind of efficiency for
173: sparse matrices. On the other hand, it would be cause for concern if
174: the cpu efficiency fell much below 10~\%. We shall show that
175: efficiencies of around 10~\% are indeed achievable. We shall show
176: that the methods we propose achieve scaling very close to linear, both
177: as the numbers of atoms and processors are increased for a fixed
178: number of atoms per processor, and as the number of processors is
179: increased for a fixed number of atoms.
180:
181: We shall see that achievement of high efficiency and good scalability
182: relies heavily on two things: first, good design of storage patterns
183: of matrix elements; second, the use of different schemes for
184: labelling atoms in different parts of the calculation, together with
185: efficient transcription between schemes. For both storage and labelling,
186: we shall show the advantages of grouping atoms into small spatially
187: compact groups (here called `partitions'), and assignment of groups
188: of these partitions (here called partition `bundles') to individual
189: processors.
190:
191: The next section of the paper presents our analysis of the problem,
192: and the set of algorithms that the analysis leads us to. We also
193: present the piece of code that lies at the heart of our multiplication
194: program, which we refer to as the `multiplication kernel'. Our
195: practical code is written in Fortran~90, with communications handled
196: by MPI1. In Sec.~3, we outline the main principles used in
197: programming our algorithms. Sec.~4 reports the range of tests on
198: a Cray~T3E, an SGI Origin2000 and a beowulf cluster
199: that we have performed to probe the efficiency and scaling
200: behaviour of the multiplication code. The final Section discusses our
201: results and draws conclusions. In particular, we shall argue that
202: other possible approaches to the problem of multiplying locally sparse
203: matrices on parallel machines are unlikely to achieve significantly
204: better efficiency or linear scaling behaviour.
205:
206: \section{Development of techniques}
207: \label{sec:techniques}
208:
209: In order to achieve high efficiency, we need careful design
210: of a small set of operations that are repeatedly performed
211: to build up the product matrix: this set of
212: operations is the `multiplication kernel'. Some of the
213: data that serve as input to the kernel need to be communicated
214: from remote nodes, and linear scaling will only be achieved
215: if the communications strategy is properly designed. Both the
216: multiplication kernel and the communications need indexing,
217: and it is important that both the memory
218: and the cpu time needed to manipulate index arrays are kept to a
219: minimum. Before discussing the multiplication kernel, the
220: communications strategy and the indexing, we outline first
221: the starting assumptions and the spatial organisation of
222: atoms into partitions and bundles.
223:
224: For ease of presentation, we confine ourselves in much of the
225: following to the case $R_C \ge R_A + R_B$, which we
226: refer to as the `maximal' case; it is the case where every
227: element of the product matrix $C = A \cdot B$ must be calculated.
228: At the end of this Section, we shall
229: show how the case of a general relation between $R_A$, $R_B$ and
230: $R_C$ can be handled by exactly the same principles that will
231: be described for the maximal case. We also avoid initially any
232: discussion of periodic boundary conditions. The {\sc Conquest}
233: code is able to handle periodic boundary conditions, though
234: this may not always be needed, so that the matrix multiplication
235: code must be capable of including periodicity.
236: But to simplify the discussion,
237: we defer discussion of periodicity till Sec.~\ref{sec:pbc}.
238:
239: \subsection{Assumptions}
240: \label{sec:assumptions}
241:
242: Following Ref.~\cite{Goringe97}, we assume that each node is
243: responsible for a group of atoms: we call this the node's `primary
244: set' of atoms. We also assume that matrix elements are stored by
245: rows, so that for any matrix $X$, the elements $X_{i \mu , j \nu}$ for
246: all $\mu$, $j$, $\nu$ are held by the node whose primary set contains
247: $i$. Finally, we assume that the multiplication operations needed to
248: calculate the elements $C_{i \mu , j \nu}$ of the product matrix are
249: performed by the node whose primary set contains $i$. This means that
250: only matrix elements $B_{k \xi , j \nu}$ need to be communicated
251: between nodes. We discuss in Sec.~\ref{sec:disc} whether anything
252: could be gained by changing these assumptions.
253:
254: With these assumptions, we already encounter a key question, concerning
255: the interleaving of communications and calculations. The matrix
256: multiplications performed by each node can be broken into
257: contributions, with each contribution associated with
258: a particular set of atoms $k$. For each set, we fetch the
259: $B_{k \xi , j \nu}$ data for $k$ in the set, and then perform
260: the multiply-and-adds of eqn~(\ref{eqn:mult}) for those $k$, accumulating the
261: results onto the array of the product matrix; then we move to the next
262: set of $k$ atoms and repeat the communications and calculations.
263: Two extremes can be envisaged. At one extreme, we start by
264: bringing {\em all} the $B_{k \xi , j \nu}$ for all the $k$
265: entering eqn~(\ref{eqn:mult}) from the remote nodes onto the local node;
266: after this has been done, we then perform {\em all} the multiplications.
267: This is the coarsest possible interleaving. At the other extreme, we
268: fetch the $B_{k \xi , j \nu}$ for just a single atom
269: $k$ at a time, and perform the multiplications for that $k$, before
270: moving on to the next $k$. This is the finest possible interleaving.
271: Neither extreme will be satisfactory. The coarse extreme requires
272: an unnecessarily large amount of local memory to store all the
273: $B_{k \xi , j \nu}$ elements; the finest extreme requires
274: the transmission of unnecessarily small packets of
275: $B_{k \xi , j \nu}$ data, so that latency will slow the
276: communications. Somewhere between the extremes, there is a good
277: compromise between memory and latency.
278:
279: In fact, there is no point in communicating {\em all}
280: $B_{k \xi , j \nu}$ data before multiplying: the coarsest interleaving
281: that we should ever consider is where we communicate all the
282: $B_{k \xi , j \nu}$ needed from a particular remote node before
283: performing the multiplications; then we move to the next node.
284: This interleaving already yields the lowest possible latency. It follows
285: that the outermost loop in matrix multiplication must be a loop
286: over remote nodes. However, there is no reason why this coarsest
287: interleaving should give an acceptable compromise, and we shall
288: generally want to reduce the memory requirement further by
289: breaking the atoms $k$ into smaller sets. Within the outer loop
290: over nodes, we shall therefore want a loop over {\em sets} of
291: atoms $k$ in each particular node.
292:
293: Our basic assumptions about the distribution of matrix elements
294: over nodes therefore require the general interleaving of communications
295: and calculations shown in Fig.~\ref{fig:interleave}.
296:
297: \subsection{Partitions, bundles, haloes and the outline
298: multiplication scheme}
299: \label{sec:partitions}
300:
301: The primary sets of atoms should be chosen to be spatially
302: compact: the atoms should all be near each other. The reason for
303: this is that compactness of the primary set helps to
304: reduce the number of atoms $k$ entering equation~(\ref{eqn:mult}) for any
305: given node, so that fewer $B_{k \xi , j \nu}$ elements have
306: to be communicated. Put another way, primary set compactness means that
307: each $B_{k \xi , j \nu}$ communicated is used more times.
308: Since the time spent in communications limits the parallel
309: scalability, compactness help scalability.
310:
311: However, it would limit flexibility if we took the primary sets
312: to be the smallest organisational unit of atoms. The size
313: of the primary sets depends on the number of processors used, and this
314: may depend on unpredictable circumstances. So we need a smaller and
315: more stable organisational unit. We call this a `partition'. A partition
316: is a small, spatially compact set of atoms that does not depend on the
317: number of processors. Each primary set consists of a spatially
318: compact bundle of partitions. Clearly, in assembling partitions
319: to make primary sets, a key consideration will be load balancing,
320: so we shall need each primary set to contain roughly the same
321: number of atoms. Load balancing will be discussed in more detail
322: in Sec.~\ref{sec:load}.
323:
324: Partitions could in principle be formed in many ways. At present,
325: the simulation cell used in the {\sc Conquest} code is required
326: to be orthorhombic (this requirement will, of course, be
327: removed in the future). We divide each of the three edges
328: of this cell using a uniform grid, and the orthorhombic subcells thus
329: formed are used as partitions. For some arrangements of atoms -- for
330: example, slabs of crystal surrounded by vacuum -- there may be
331: partitions that contain no atoms. This creates no problem,
332: though for good load balancing we shall require that no primary sets
333: should consist only of empty partitions.
334:
335: To discuss the outline multiplication scheme in terms of partitions,
336: we now introduce some terminology, which is illustrated schematically
337: in Fig.~\ref{fig:halo}. For each atom $i$ in the
338: primary set, there are atoms $k$ in the system
339: whose distance from $i$ is less
340: than $R_A$; we call these the $A$-neighbours of $i$. The atoms
341: $k$ which are $A$-neighbours of at least one atom in the {\em primary
342: set} form a set which we call the $A$-halo. We refer to atoms in this
343: set as the $A$-halo atoms. The set of {\em partitions} containing at least one
344: $A$-halo {\em atom} are called the $A$-halo {\em partitions}. Finally,
345: the set of nodes responsible for at least one $A$-halo partition are
346: called the $A$-halo {\em nodes}.
347:
348: The multiplication scheme can now be expressed thus: We loop over
349: $A$-halo nodes. For each such node, we loop over groups of atoms $k$,
350: communicate the $B_{k \xi , j \nu}$ elements for atoms in the group,
351: perform the corresponding multiplications, and accumulate onto
352: the $C$-matrix array. The groups of atoms $k$ will now be identified as
353: $A$-halo partitions, so that the inner
354: loop in Fig.~\ref{fig:interleave} is a loop over the
355: $A$-halo partitions on the given $A$-halo node.
356: Partitions therefore serve two purposes: first, they aid in the
357: flexible construction of primary sets; second, they provide
358: the sub-units needed in the communication of $B$-matrix data.
359:
360: A number of issues are raised by the partition scheme. We must
361: consider the size of $B$-data packets to be communicated, and how to avoid
362: fetching unwanted $B$-data. This question is intimately
363: related to the choice and structure of the multiplication
364: kernel. It is also related to the question
365: of the indexing needed so that the multiplication kernel
366: knows which atom
367: pairs are associated with each piece of $B$-data. Finally, we must
368: consider the indexing needed by the multiplication kernel in order to
369: accumulate $C$. Since the nature of the multiplication kernel lies at the
370: heart of these issues, we address this next.
371:
372: \subsection{Choice and structure of the multiplication kernel}
373: \label{sec:multkern}
374:
375: As explained above, the multiplication kernel is a small set
376: of operations repeatedly performed on each node, to build
377: up the rows of the product matrix $C_{i \mu , j \nu}$ for
378: which the node is responsible. More precisely, it is the
379: set of operations on which detailed coding effort will
380: be concentrated in order to maximise the efficiency. The choice
381: of kernel is therefore crucial.
382:
383: Of the choices that might be made, we can envisage two
384: extremes, which resemble those mentioned when we
385: discussed the interleaving of communications and
386: calculations (Sec.~\ref{sec:assumptions}). At one
387: extreme, we could choose the kernel to
388: consist of the multiplication of the $n \times n$ matrices associated
389: with each triplet of atoms $(i,k,j)$. (Recall that $n_1$, $n_2$,
390: $n_3$ are the numbers of indices on the atoms $i$, $j$, $k$; to
391: simplify the discussion, we refer to them without distinction as $n$.)
392: If we made this choice,
393: the kernel would sit inside a triple loop over $k$, $i$ and $j$. We refer
394: to this as the `fine' choice. At the other extreme is the `coarse' choice,
395: where the kernel consists of the entire triple loop over $k$,
396: $i$ and $j$, and everything inside it.
397: Neither extreme is likely to be satisfactory. We note that the fine choice
398: will ensure good reuse of data in registers, as discussed in
399: Ref.~\cite{Canning96}.
400: In linear-scaling calculations, the dimension $n$ will be rather
401: small; in treating $s$-$p$ semiconductors, for example, it will
402: often have the value 4, so that the data for a given triplet
403: will generally fit into registers and primary cache. However,
404: each data item is used only $n$ times, and the use of
405: secondary cache will be poor. On the other hand, the coarse extreme
406: is too undifferentiated. Since the full set of $A$, $B$ and $C$ data for all
407: $i$, $j$, $k$ will not fit into secondary cache in any practical
408: situation, there is no point in making this choice of kernel. Clearly,
409: the operations should be broken into small groups, but these
410: should not be as small as $(i,k,j)$ triplets.
411:
412: An obvious intermediate choice of kernel is to take it as the set
413: of operations associated with a given $k$. For a given $k$, the label $i$
414: goes over all $A$-neighbours of $k$ in the primary set, and $j$
415: goes over all $B$-neighbours of $k$ in the system. This will be
416: excellent for cache reuse of $B_{k \xi , j \nu}$, since
417: $j$ will run over $B$-neighbours of $k$ in the sequence in which they are
418: stored. The same thing will happen for $A$, provided the
419: $A_{i \mu , k \xi}$ elements are stored by blocks associated with
420: a given $k$ -- effectively this means that we store the
421: `local transpose' of $A$. If we denote the number of $B$-neighbours
422: of $k$ by $N_k^B$ and the number of primary $A$-neighbours of $k$
423: by $M_k^A$, then potentially every element of $A$ in cache can be
424: used $n N_k^B$ times, and every element of $B$ in cache can
425: be used $n M_k^A$ times.
426:
427: However, this still disregards cache reuse of $C$. In order
428: to help this, we should take the kernel to be the set of operations
429: associated with a small group of atoms $k$, and this group
430: must clearly be spatially compact. We identify this group as the
431: partition. A great advantage of this choice of kernel is that it fits
432: naturally with our proposed interleaving scheme for communications
433: (Sec.~\ref{sec:assumptions}). The
434: overall scheme is now that each node loops over
435: its $A$-halo partitions. For each partition, $B$-data is communicated
436: for the entire partition. The multiplication kernel then
437: performs all the multiply-and-adds and the accumulation of $C$ for
438: $A$-halo atoms $k$ in the current partition.
439:
440: We note in passing how the partition scheme helps efficient
441: data transfer from memory. The $j$ atoms that we loop over are
442: grouped into the spatially compact partitions used to organise the
443: storage of $B$. Since the $C$-matrix is also stored with the $j$-index
444: grouped into these partitions, and since superscalar processors
445: generally transfer data from main memory into cache in contiguous
446: sets (cache-lines)
447: this should considerably enhance the chances of finding
448: $C$-matrix elements in cache when they are needed. The underlying
449: thought here is that the partition scheme helps to improve
450: `data locality', as discussed by Goedecker~\cite{Goedecker00}.
451:
452: We implied just now that, when $B$-data is fetched from the
453: remote node, we communicate the $B_{k \xi , j \nu}$ for {\em all}
454: $k$ in the current $A$-halo partition. Since these $k$ are not
455: necessarily all in the $A$-halo, this means that in general not
456: all $B_{k \xi , j \nu}$ communicated are actually used, so that
457: there is a waste of communications. We do this, since we believe
458: that it is better to tolerate the waste than to limit
459: communications strictly to the $B_{k \xi , j \nu}$ needed. There
460: are two reasons for this: first, effort would need to be put into
461: generation of the indexing needed; second, the length of the
462: packets communicated would be reduced, so that latency would
463: slow the calculations.
464:
465: In order to implement these ideas, we
466: must now consider the two kinds of indexing
467: needed: first, the indexing needed to identify the atom pairs
468: associated with the $B$-data; second, the indexing needed to accumulate
469: the $C$-matrix. In order to develop an overview of the indexing
470: question, we first discuss schemes for labelling atoms.
471:
472: \subsection{The labelling of atoms}
473: \label{sec:label}
474:
475: \subsubsection{General ideas about labelling}
476: \label{sec:genlabel}
477:
478: With the partition scheme outlined above, matrix multiplication
479: in a parallel linear-scaling code will need at least
480: the following five ways of labelling atoms:
481:
482: \begin{enumerate}
483: \item
484: {\bf Global labelling}. As the atoms move around, and responsibilities
485: for atoms are passed from one node to another,
486: every atom in the entire system needs its own
487: unique label. We call this scheme `global labelling'. The natural way
488: is to label the atoms from 1 to $N_{\rm tot}$, where
489: $N_{\rm tot}$ is the total number of atoms.
490:
491: \item
492: {\bf Partition labelling}. Since atoms are organised into partitions,
493: we can refer to them by saying `the $n$th atom in the $p$th partition'.
494: To ensure that all nodes use the same scheme, we adopt a standard
495: order for enumerating the partitions of the entire system, and a
496: standard order for the atoms in each partition -- the latter can be
497: in increasing order of global label, for example. We refer to this
498: scheme as `partition labelling'.
499:
500: \item
501: {\bf Local labelling}. Each node needs to know only about a subset of
502: atoms: the atoms in its primary set, plus the atoms on other
503: nodes that are in range of the matrices it has to deal with.
504: Strictly speaking, it needs to know only about atoms in the
505: halo associated with the largest cut-off radius. But in order
506: to construct the haloes, it will need to find out about a somewhat
507: larger set, which we call the `grand covering set' (GCS); the number of atoms
508: in this set is denoted by $N_{\rm GCS}$. We label the atoms in the
509: GCS sequentially from 1 to $N_{\rm GCS}$, and call
510: this scheme `local labelling'.
511:
512: \item
513: {\bf Halo labelling}. When looping over atoms in a particular
514: halo -- for example the $A$-halo when we do matrix multiplication --
515: it will be convenient to label the halo atoms sequentially. We refer
516: to this as `halo labelling'.
517:
518: \item
519: {\bf Neighbour labelling}. In order to store matrix elements
520: economically, we shall want to store only those elements
521: $X_{i \mu , j \nu}$ etc. for which the distance between
522: atoms $i$ and $j$ is less than the cut-off $R_X$, since other elements
523: vanish by definition. For each primary atom $i$, we shall run sequentially
524: through $X$-neighbours $j$. This scheme for referring to neighbours
525: $j$ is called `neighbour labelling'. Of course, the label given
526: to an atom $j$ in this scheme depends on the primary atom $i$.
527: \end{enumerate}
528:
529: Note that in these general schemes there is much freedom in the
530: order of the labelling. For example, in partition labelling, we can
531: list the partitions in different orders. Similarly, in local, halo
532: and neighbour labelling, we can list the atoms in different orders.
533: We shall want to design the storage patterns so as to optimise
534: the efficiency of the multiply-and-add operations, and so as
535: to facilitate transcription between the different labellings. To see
536: how this works in practice, we now study how to pass information
537: between nodes and how to determine the storage locations
538: in the $C$-array when accumulating the results of multiply-and-adds.
539:
540: \subsubsection{Passing information between nodes}
541: \label{sec:passing}
542:
543: When the matrix elements $B_{k \xi , j \nu}$ are passed
544: between nodes, information about the identity of the $B$-neighbours $j$
545: of each atom $k$ must also be passed, so that the receiving node can
546: determine where to accumulate the elements of the product
547: $C_{i \mu , j \nu}$. We have to ask what kind
548: of labelling should be used to send the identity of $j$.
549:
550: Global labelling would clearly express the identity of $j$ unambiguously,
551: but there is an objection to using this. The receiving node will need
552: to transcribe to the labelling it uses to refer to the storage
553: locations of $C$, which is essentially $C$-neighbour labelling. While
554: each node can transcribe from its own local, halo or neighbour
555: labelling to global labelling using an amount of storage that is
556: independent of the size of the system, transcription from global labelling
557: to the other kinds demands a storage proportional to the size of
558: the sytem, and this violates the requirement of linear-scaling
559: behaviour. This may not always be a problem in practice,
560: but it is certainly objectionable in principle.
561:
562: Partition labelling offers a simple way to overcome this. For each
563: $B$-neighbour $j$, we specify its partition and its
564: sequence number in the partition, with the partition identified
565: by giving its offset in the three Cartesian directions relative
566: to the partition containing $k$. The receiving node then uses
567: the partition offset and sequence number to transcribe to its own
568: local label for each $j$. The separate indexing needed by the receiving
569: node to determine the storage location of $C$-elements is described
570: next.
571:
572: \subsubsection{Determining storage locations of $C$}
573: \label{sec:Clocations}
574:
575: To achieve the best speed, one would use a pre-calculated index array
576: to determine the storage location of each matrix element $C_{i \mu , j
577: \nu}$. As we run sequentially through the $B$-neighbours $j$ of each
578: atom $k$ for a given primary atom $i$, we would look up the storage
579: location where this contribution has to be accumulated onto the
580: $C$-array. This is what was done in an earlier version of the
581: {\sc Conquest} matrix-multiplication scheme~\cite{Goringe97}. The major
582: objection to this is that the atom $j$ in $B_{k \xi , j \nu}$ is most
583: easily specified in neighbour labelling, in which case the
584: pre-calculated index array will depend on a triplet of labels $(i, k,
585: j)$, so that the storage requirements of the index become exorbitant.
586:
587: Our answer to this is to abandon the use of the pre-calculated index,
588: and instead to determine the storage locations of $C$-elements at run
589: time. As explained in Sec.~\ref{sec:passing}, it is straightforward to
590: calculate the local label of $j$ for each element $B_{k \xi , j \nu}$,
591: provided we pass the offset of the partition containing $j$ relative
592: to the partition containing $k$ in three Cartesian directions. Our
593: procedure is then to transcribe from this local label to the $C$-halo
594: label of $j$ at run-time in the multiplication kernel, and then to
595: look up the address of each $C_{i \mu , j \nu}$ in a pre-calculated
596: index array which takes as input the primary label $i$ and the
597: $C$-halo label $j$. Since this index array depends on only two labels,
598: its storage requirements are moderate.
599:
600: \subsection{Detailed coding of the multiplication kernel}
601: \label{sec:coding}
602:
603: The principles outlined above are embodied in the F90 coding of the
604: multiplication kernel displayed in Fig.~\ref{fig:kernel}. This
605: calculates and accumulates the contributions to the $C$-matrix from
606: $A$-halo atoms $k$ in a given $A$-halo partition $K$, the latter being
607: specified in the code by the variable {\tt kpart}. Note that the outer
608: loop in the kernel goes only over $A$-halo atoms $k$ in $K$ (the
609: number of these being {\tt ahalo\%nh\_part(kpart)}), and not over {\em
610: all} atoms in $K$. The code inside the outer loop consists of three
611: sections: the three lines of the first section set up useful variables
612: associated with atom $k$; the purpose of the second section,
613: consisting of a single loop over atoms $j$, is to construct the table
614: {\tt jbnab2ch(j)} which transcribes from $B$-neighbour labelling to
615: $C$-halo labelling of the $B$-neighbour atom $j$; the third section is
616: a double loop over primary $A$-neighbours $i$ and $B$-neighbours $j$
617: of the atom $k$, within which the multiplication and accumulation are
618: performed. We now comment on the three sections.
619:
620: In the first section, {\tt k\_in\_halo} is the $A$-halo label of
621: atom $k$; the array {\tt ahalo\%j\_beg(kpart)} gives us the
622: $A$-halo label of the first $A$-halo atom in the partition $K$.
623: The variable {\tt k\_in\_part} is the sequence number of atom $k$ in $K$;
624: this means the number in the sequence when {\em all} atoms in $K$ are
625: counted, and not just $A$-halo atoms. Finally, {\tt nbkbeg} is the address
626: where $B$-neighbours of atom $k$ start in the array {\tt b()}
627: holding rows of the $B$-matrix for the current partition $K$.
628:
629: In the second section of the kernel, we loop over all $B$-neighbours
630: of the current atom $k$. The variables {\tt jpart}
631: and {\tt jseq} in this loop are the partition number and the
632: sequence number within its partition of neighbour $j$. (The
633: variable {\tt k\_off} is connected with periodic boundary conditions,
634: and will be explained in Sec.~\ref{sec:pbc}).
635: The two quantities {\tt jpart}
636: and {\tt jseq} together consistute the `local' label of atom $j$
637: in the partition labelling scheme. The array {\tt jbnab2ch(j)}
638: transcribes from the $B$-neighbour label $j$ to the $C$-halo
639: sequence number, and plays a key role in what follows.
640:
641: The outer loop in the third section goes over the
642: {\tt at\%n\_hnab(k\_in\_halo)} primary-set $A$-neighbours
643: $i$ of $k$, {\tt nabeg} is the address in the array {\tt a()}
644: where the corresponding element of $A$ is stored, {\tt i\_in\_prim}
645: is the sequence number of $i$ in the primary set, and {\tt icad}
646: is a starting address needed later. The inner loop goes over all the
647: {\tt nbnab(k\_in\_part)} $B$-neighbours of $k$, and {\tt nbbeg}
648: is the address in the array {\tt b()} holding elements of $B$.
649: We then use the table {\tt jbnab2ch(j)} constructed in the second
650: section to transcribe from $j$ to the $C$-halo label. The
651: pre-calculated table {\tt chalo\%i\_h2d(icad+j\_in\_halo)}
652: is then used to look up the address in {\tt c()} where the product
653: will be stored. The {\tt if} statements that test the values
654: of {\tt j\_in\_halo} and {\tt ncbeg} are redundant in the maximal
655: case $R_C \ge R_A + R_B$, since they can never be satisfied,
656: but they are needed for the general case, as explained later.
657: The triple loop over {\tt n3}, {\tt n1} and {\tt n2}
658: that performs the multiplication of $n \times n$ matrices
659: for each atom triplet $(i,k,j)$ is self-explanatory.
660:
661: \subsection{Periodic boundary conditions}
662: \label{sec:pbc}
663:
664: With periodic boundary conditions (pbc), the system we study is an
665: infinite Bravais lattice with a (generally large) basis. The
666: system is invariant under translation by an Bravais lattice vector.
667: For book-keeping purposes, we regard one of the primitive unit
668: cells of the Bravais lattice as the `fundamental' cell. The
669: elements of every matrix $X_{i \mu , j \nu}$ are stored for all
670: $i$ in the fundamental cell. Now the cut-off radius $R_X$ of any
671: matrix is not to be constrained in any way by the dimensions
672: of the repeating cell, so that the atoms $j$ for which
673: $X_{i \mu , j \nu}$ is non-zero can be in the fundamental
674: cell or in any image of this cell. For given $i$, there can be
675: different atoms $j$, $j^\prime$ having non-zero values of
676: $X_{i \alpha , j \beta}$ which are periodic images of each other.
677:
678: The use of pbc does not disturb the code structure already
679: established, provided one adopts the right viewpoint. Every node
680: knows about atoms in its primary set, and also about atoms in
681: its grand covering set (GCS), which contains all the haloes that it needs.
682: All atoms $i$ in its primary set can be regarded as belonging
683: to the fundamental cell, so that none is a periodic image of any
684: other. But atoms in its GCS may be periodic images of each
685: other. For most purposes, all the atoms in the GCS and
686: in all the haloes are regarded as completely distinct, irrespective
687: of whether they are periodic images or not.
688:
689: To ensure efficient and correct operation of the code with pbc, two
690: steps must be taken. First, as we loop over $A$-halo partitions $K$,
691: we note that for two such partitions that are periodically
692: equivalent, the sets of $B_{k \xi , j \nu}$ data for the
693: two partitions are identical, and we clearly do not
694: wish to communicate the same data more than once. We avoid this
695: by insisting that the order in which $A$-halo partitions
696: are passed through is such that any partition and its periodic
697: images in the halo are grouped together contiguously. Then
698: for each $A$-halo partition, we check whether it is periodically
699: equivalent to the previous one; if it is, then nothing is
700: communicated from the remote node, and we use the $B$-data most
701: recently communicated. The loop over $A$-halo partitions is outside
702: the multiplication kernel, so this does not affect the code shown
703: in Fig.~\ref{fig:kernel}.
704:
705: The second step needed is to take account of the periodic offsets when
706: calculating the storage locations where elements of the $C$-matrix are
707: accumulated in the multiplication kernel. The $B$-matrix data supplied
708: to the kernel include a label identifying the partition to which each
709: $B$-neighbour atom $j$ belongs: this is given in the array {\tt
710: ibpart()} -- see Fig.~\ref{fig:kernel}. But this label must be
711: modified to account for the periodic offset of partition $K$.
712: Provided we use a suitable labelling of partitions in each node's GCS,
713: the modification consists of adding a single offset parameter {\tt
714: k\_off} to the partition label, this parameter being passed as an
715: argument to the multiplication kernel. This addition to the first
716: statement in the $j$-loop (see Fig.~\ref{fig:kernel}) is the only
717: place in the kernel that pbc appear explicitly.
718:
719: \subsection{Practical construction of indexing arrays}
720: \label{sec:indexing}
721:
722: The multiplication kernel needs certain indexing arrays and
723: transcription tables, and we outline briefly the principles
724: used in constructing these.
725:
726: In making all the labelling, a key role is played by the grand
727: covering set (GCS) mentioned above. In general,
728: a `covering set' is a set of partitions that contains all the
729: neighbours of an atom, or the primary-set halo, associated with a
730: given cut-off radius; it is used in constructing lists of neighbours
731: and halo atoms. The GCS is the largest such covering set, so that it can
732: be used for making all the neighbour and halo lists. Each node
733: has its own GCS. It is convenient to choose the GCS to be orthorhombic
734: in shape, because this simplifies the labelling. For any given
735: cut-off radius $R_X$, the node constructs its neighbour lists
736: by passing sequentially through all atoms in its primary set
737: and all atoms in its GCS, calculating the interatomic distances,
738: comparing with $R_X$, and making a record of the neighbours. This record
739: is then used to make the corresponding halo list.
740:
741: The ordering of the lists is important, so we need to consider the
742: order for enumerating GCS partitions. In fact, we need to consider
743: two different orderings. In the first, we label GCS partitions
744: using the obvious Cartesian system. The GCS is orthorhombic and
745: the numbers of partitions along its three edges are called
746: $L_1$, $L_2$, $L_3$, so that any partition can be labelled by the
747: triplet $(l_1, l_2, l_3)$, with $1 \le l_s \le L_s$ $(s = 1,2,3)$.
748: Then we define the label $\lambda \equiv ( l_1 - 1 ) L_2 L_3 +
749: ( l_2 - 1 ) L_3 + l_3$, which we call the `Cartesian composite' (CC)
750: label.
751:
752: CC ordering is not appropriate for enumerating $A$-halo partitions
753: when fetching $B$-data from remote nodes, because for this we should
754: group partitions together by their home node and (if relevant)
755: according to periodic images (see Sec.~\ref{sec:pbc}). To make a
756: suitable ordering for this purpose, we adopt a standard order for
757: enumerating nodes, and a standard order for primary partitions on each
758: node. Then a node enumerates the partitions of its GCS by ordering
759: them with respect to: (i) periodic images; (ii) partitions on each
760: node; (iii) node. It passes most rapidly through periodic images, so
761: that all images of a given partition are listed contiguously; next
762: most rapidly through partitions on each node, so that all partitions
763: on each node are contiguous; and least rapidly through nodes. It is
764: convenient to start with the local node itself, then to go to the next
765: node occurring in the GCS in increasing order of the standard node
766: enumeration, wrapping round in this enumeration as necessary. We call
767: this ordering of GCS partitions node-order periodic-grouped (NOPG).
768: This is the GCS ordering used for forming the $A$-halo, so that when
769: we pass through $A$-halo partitions we are implicitly performing a
770: triple loop over halo-nodes, halo-partitions on each node, and
771: periodic images of each halo-partition -- the latter being needed to
772: avoid unnecessary communication of $B$-data, as explained in
773: Sec.~\ref{sec:pbc}.
774:
775: \subsection{The minimal case}
776: \label{sec:minimal}
777:
778: So far, we have restricted outselves to the `maximal' case
779: $R_C \ge R_A + R_B$.
780: But, of course, we need to handle general cut-off radii.
781: As a first step towards doing this, we examine now the
782: case $R_C \le \mid R_A - R_B \mid$, which we call the `minimal'
783: case. We shall show that a trivial modification of the code already
784: presented for the maximal case gives an efficient code
785: for the minimal case. To see this, we must first outline
786: an important general principle.
787:
788: For the superscalar machines that are our main interest, the speed is
789: limited mainly by transfers between memory and registers {\em via} the
790: cache: the operations performed on the data once it is in registers
791: have little effect on the speed\cite{Goedecker00}. So let us focus on
792: storage and transfer patterns, and ignore the register operations. In
793: the maximal case, one set of data (the $A_{i \mu , k \xi}$ matrix
794: elements) are in local memory and transfer occurs in a regular and
795: predictable way; data that behave like this will be called $X$. A
796: second set of data (the $B_{k \xi , j \nu}$ elements) is fetched from
797: remote nodes, but again is transferred in a regular way; this kind of
798: data will be called $Y$. Finally, a third set of data (the $C_{i \mu
799: , j \nu}$ elements) is in local memory, but its transfer occurs in an
800: irregular way and addressing information has to be calculated for it
801: at run time. Data like this will be called $Z$. The appropriate data
802: storage and transfer patterns for the minimal case are obtained by
803: identifying $C$ as $X$, $\tilde{B}$ as $Y$ and $A$ as $Z$, where
804: $\tilde{B}$ denotes the transpose of $B$ -- whereas in the maximal
805: case $A$ was $X$, $B$ was $Y$ and $C$ was $Z$. In the minimal case the
806: multiply-and-adds needed are $X = Z \cdot Y$ -- whereas in the maximal
807: case we did $Z = X \cdot Y$. As far as the multiplication kernel is
808: concerned, it is transferring certain patterns of data from memory and
809: doing certain things with it in registers. Provided the data passed to
810: the kernel and the operations performed with it in registers are
811: changed appropriately, the maximal and minimal cases are exactly
812: equivalent.
813:
814: It follows from this that the multiplication kernel for the minimal
815: case is coded exactly as in Fig.~\ref{fig:kernel}, except that the
816: statement within the loops over {\tt n2}, {\tt n1} and {\tt n3} must
817: be replaced by the following statement:
818: \begin{verbatim}
819: a(n3,n1,nabeg)=a(n3,n1,nabeg)+&
820: c(n1,n2,ncbeg)*b(n3,n2,nbbeg)
821: \end{verbatim}
822: The structure of all the indexing arrays remains exactly as before.
823: Note that in the minimal case, as in the maximal case, no
824: tests of cut-off distances are needed, and the {\tt if} statements
825: in the kernel are redundant.
826:
827: \subsection{General cut-off radii}
828: \label{sec:cutoff}
829:
830: To treat general cut-off radii, we consider first the case $R_C < R_A
831: + R_B$, where $R_C$ is only a little smaller than the sum of $R_A$ and
832: $R_B$. This case is efficiently handled by using the code for the
833: maximal case (see Sec.~\ref{sec:coding}), with the $j$-loop going over
834: all $B$-neighbours of $k$, but with an {\tt if}-statement inserted so
835: that the multiply-and-adds are not done if $j$ is not a $C$-neighbour
836: of $i$. Needless to say, there must not be any question of calculating
837: interatomic distances in the multiplication kernel itself. However,
838: using the arrays which transcribe from local labelling to $C$-halo
839: labelling, and which give the storage address in the $C$-array for
840: each ($i$,$j$) pair, we can determine easily whether $j$ is a
841: $C$-neighbour of $i$. The convention we use is that the appropriate
842: array elements have the value zero if the atom $j$ is not in the
843: $C$-halo or is not a $C$-neighbour; the test is then applied
844: by the two {\tt if} statements in
845: Fig.~\ref{fig:kernel}. We call this way of treating the case $R_C <
846: R_A + R_B$ `weak reduction', the adjective `weak' referring to the
847: fact that it will work well only if $R_C$ does not fall too much below
848: $R_A + R_B$. This loss of efficiency caused by the wasted passes
849: through the {\tt do}-loop over $j$ is estimated in the Appendix.
850:
851: If $R_C$ is very small, it will be more efficient to perform the
852: multiplication by `weak extension', i.e. by using the kernel for the
853: minimal case, but with two {\tt if}-statements inserted {\it exactly}
854: as for the maximal kernel (using the symmetry between the two cases
855: identified above in Sec.~\ref{sec:minimal}). The effect of wasted
856: passes through the $j$-loop is estimated in the Appendix.
857:
858: Once the {\tt if}-statements have been put into the maximal or minimal
859: code, any case can be treated with either kernel. We will obtain correct
860: results even if we treat the minimal case using the maximal kernel,
861: or {\em vice versa}. Nevertheless, it is clearly more efficient
862: to treat weak reduction using the maximal kernel and weak extension
863: using the minimal kernel. But how, in general, can we tell whether
864: it is more efficient to use the maximal or minimal kernel? This
865: question is also addressed in the Appendix, where we show that the maximal
866: kernel should be used when $R_C > R_A$ and the minimal kernel
867: when $R_C < R_A$.
868:
869: \subsection{Load balancing}
870: \label{sec:load}
871:
872: However efficient the on-node operations and the communications may
873: be, the overall performance will be poor unless the work load is
874: balanced between processors, so that no processors are idle while
875: others are working. The partition scheme gives us a natural way to
876: achieve load balancing: our aim must be to group the partitions into
877: bundles, with each bundle containing the primary set given to a node,
878: in such a way that all nodes take roughly the same
879: computation-plus-communication time, with this time being as small as
880: possible. We plan to report in more detail elsewhere on load balancing
881: in the {\sc Conquest} code, and here we give only a brief summary of
882: our methods.
883:
884: Our approach is to derive an approximate formula for the `cost'
885: of the calculation, and to use simulated annealing to search for
886: a way of organising partitions into bundles which comes close
887: to minimising the cost. This search is done by a separate code
888: before the matrix multiplication code is run. Let $t_i^{\rm comp}$
889: and $t_i^{\rm comm}$ be the times taken by node $i$ to perform
890: on-node computation and inter-node communication, respectively,
891: and let $t_i \equiv t_i^{\rm comp} + t_i^{\rm comm}$
892: be the sum of the two. (We assume that computation and communication
893: cannot be overlapped, and that the former has to wait for the latter.)
894: Then we have to minimise the `cost function' $T$
895: given by $T = \max_i ( t_i )$; this quantity $T$ is the
896: largest of all the $t_i$ values.
897:
898: We can associate a computation time $\tau_p$ with each partition $p$.
899: This time is proportional to the number of $i , k , j$ triplets for which
900: atom $i$ is in partition $p$. The proportionality constant can be
901: estimated from the compute rate of the processors and the cpu efficiency,
902: whose determination for any processor is described in Sec.~\ref{sec:tests}.
903: The value of $t_i^{\rm comp}$ is then the sum of $\tau_p$ over
904: the partitions in the bundle assigned to processor $i$. The value of
905: $t_i^{\rm comm}$ is found by dividing the number of
906: $B_{k \xi , j \nu}$ elements that node $i$ fetches by the
907: practical inter-node transfer rate, with an allowance for latency.
908:
909: We follow the usual methods of simulated annealing~\cite{NR}.
910: Let $\Gamma$
911: label a given grouping of partitions into bundles -- this consists
912: of a list of the partitions in every bundle -- and let $T_\Gamma$
913: be the value of $T$ for this grouping. Then we sample the groupings
914: $\Gamma$ in such a way that the probability of finding $\Gamma$
915: is proportional to a Boltzmann function
916: $\exp ( - T_\Gamma / \theta )$, where $\theta$ is a fictitious
917: temperature. The value of $\theta$ is systematically reduced towards
918: zero during the simulation, so that the only surviving $\Gamma$'s
919: are those whose $T_\Gamma$'s are close to the minimum possible.
920: To sample the space of groupings $\Gamma$, we use an algorithm
921: to step from one $\Gamma$ to the next. As usual in simulated
922: annealing, the new grouping is accepted without question
923: if $T_\Gamma$ decreases; it is accepted with probability
924: $\exp ( - \Delta T_\Gamma / \theta )$ if $T_\Gamma$ increases,
925: where $\Delta T_\Gamma$ is the increase.
926:
927: The choice of algorithm for stepping from one $\Gamma$ to the
928: next is not trivial. A significant problem arises from the
929: rather curious definition of the cost function as the {\em maximum}
930: of $t_i$ values. As we search through groupings $\Gamma$,
931: the time $T_\Gamma$ changes only if we make reassignments of
932: partitions affecting the particular bundle having the maximum
933: $t_i$. For large numbers of processors, this can make equilibration
934: of the `thermal' ensemble very slow. The measures taken
935: to overcome this will be described elsewhere.
936:
937: \section{Coding considerations}
938: \label{sec:codcon}
939:
940: One of the key aims when coding \textsc{Conquest} was to ensure portability
941: as well as performance. Accordingly, the code was written in F90 with
942: all communications written using MPI1 calls (though, as explained later,
943: consideration was also given to rewriting critical communications with
944: high speed, system specific calls, such as {\tt shmem} on the Cray T3E).
945: F90 improves the readability and structure of the code through use of
946: modules and derived types, while retaining enough similarity to
947: \textsc{Fortran77} to be widely understood by the scientific community.
948:
949: The communications are key to the efficiency of the code. There are two
950: types of object to be communicated: the matrix elements themselves and the
951: accompanying indexing. Efficient communication of the matrix elements
952: presents no problems, since these form contiguous arrays, and are passed
953: in relatively large portions (all elements for all the
954: atoms in a partition $K$). The communication of the indexing, however,
955: raises other problems.
956:
957: The indexing for a matrix is all enclosed in a single derived type,
958: {\tt type(matrix)}, of which only part needs to be communicated. The
959: derived type is defined so that it contains all the data for all the atoms
960: in a partition $I$ (and therefore an array of these types is defined to
961: hold the data for a node's primary set). One possibility for the
962: communication of the indexing would be to use the MPI1 facilities for
963: registering a new type and to pass individual partitions in a single call
964: for that new type. However, the implementation of communication of defined
965: types within MPI1 leaves room for loss of efficiency, and another way was
966: found. Essentially, those indexing arrays which need to be communicated
967: are defined within the type as pointers, and point to different parts of
968: an integer array. Then the communication of all relevant indices for
969: a partition can be made via a single MPI call, passing a single, large
970: integer array. Pointers are then put into place at the receiving node
971: to access the indices. Both the matrix element arrays and this large
972: integer array are declared as global variables so that on the Cray
973: they will be symmetric, and {\tt shmem} calls can be used on them.
974:
975: While the code has been written within MPI1, the communication pattern
976: is naturally a one-sided one (since individual nodes require data
977: for partitions at disparate times). We have developed a
978: quasi-one-sided scheme within MPI1, as well as using truly one-sided
979: schemes where available (e.g. under MPI2, which is not yet
980: sufficiently widely supported to be truly portable, or under {\tt
981: shmem} on the Cray T3E or SGI Origin2000). At the start of the matrix
982: multiplication, each node issues a series of non-buffered, non-blocking
983: sends (assuming, quite reasonably, that no implementation of MPI1 would
984: buffer the {\tt MPI\_issend} call). As partition data is required, nodes
985: issue {\tt MPI\_recv} or {\tt MPI\_irecv} calls (as appropriate) to obtain
986: the data. This scheme works extremely well, and shows good performance
987: and scaling. It is easy to overlap communication and calculation by
988: issuing {\tt MPI\_irecv} calls ahead of time.
989:
990: \section{Practical tests}
991: \label{sec:tests}
992:
993: We want to test the scaling properties and the efficiency of the code.
994: As emphasised in our earlier tests on the {\sc Conquest}
995: code~\cite{Goringe97}, there are different types of scaling. For given
996: cut-offs $R_A$, $R_B$, $R_C$, the cpu and memory needed to do the
997: multiplication should be proportional to the number of atoms: this is
998: called `intrinsic scaling'. For a given calculation on a given number
999: of atoms, one hopes to find cpu and memory inversely proportional to
1000: the number of processors: `strong parallel scaling'. Finally, we can
1001: examine the cpu and memory as the numbers of atoms and processors are
1002: scaled up with the number of atoms per processor held fixed: `weak
1003: parallel scaling'. There is a relation between the three scaling
1004: types, so that we can assess the scaling behaviour completely by
1005: studying weak and strong parallel scaling, and we shall present
1006: results on this. Good scaling does not necessarily mean good
1007: efficiency. For example, if the on-node operations were very
1008: inefficient, communications could be rendered negligible, so that
1009: parallel scaling would appear very good, even though the underlying
1010: performance was poor. We shall therefore pay special attention to cpu
1011: efficiency.
1012:
1013: Our main tests are on completely random arrangements of atoms
1014: with periodic boundary conditions; the position
1015: of each atom in the cell is created by using a random-number generator
1016: to draw its $x$, $y$ and $z$ coordinates from a uniform
1017: probability distribution. We choose to use random positions
1018: rather than, for example, perfect-crystal positions, because the
1019: randomness should provide more of a challenge to load balancing,
1020: especially for small numbers of atoms per node. In order to make
1021: contact with the real world, we take a mean density of atoms
1022: equal to that of diamond-structure silicon, and we
1023: use values of $R_A$, $R_B$ and $R_C$ typical of those needed in real
1024: first-principles calculations with the the {\sc Conquest} code.
1025: The values of the matrix elements are irrelevant for these tests,
1026: since we are only concerned with the time taken to perform
1027: the operations; we have actually taken their values to be a simple
1028: radial function of interatomic separation. The number of indices
1029: on each atom is taken to be $n = 4$.
1030:
1031: We start by presenting results obtained on the Cray T3E-1200E
1032: using {\tt shmem} communications. We examine the maximal
1033: case ($R_C = R_A + R_B$) with $R_A = 8.46$~\AA, $R_B = 4.23$~\AA.
1034: This corresponds roughly to performing the multiplication
1035: $L \cdot S$ in {\sc Conquest} ($L$ is the auxiliary density
1036: matrix, $S$ the overlap matrix). Fig.~\ref{fig:Weak} shows results from our
1037: tests of weak parallel scaling with 80 atoms per processor
1038: and processor numbers going from 2 to 250 (numbers of
1039: atoms from 160 to 20,000). The simulation cell is always a cube
1040: in these tests, and the partitions are also taken to be cubic.
1041: The partitions always have the same size, and the average number of
1042: atoms per partition is 20. To obtain the timings, we put a
1043: timer in the code running on each node, and we report the maximum
1044: and minimum times, together with the average time. (The times
1045: include all communication, but they
1046: deliberately exclude the construction of the indexing
1047: arrays needed as input to the multiplication kernel; this is
1048: justified, since in practice these arrays are reconstructed
1049: only when the atoms move, whereas many matrix multiplications
1050: are done for each set of atomic positions.)
1051: The following points
1052: are noteworthy. First, the spread between minimum and maximum times
1053: is small, so that the load balancing is good. In practice,
1054: the time taken is governed by the time of the slowest node. Since
1055: the maximum time is only $\sim$~6.4~\% greater than the average,
1056: little would be gained by working harder on load balancing, at
1057: least for random atomic arrangements. Second, the times are essentially
1058: constant, the time on 250 nodes being only $\sim 4$~\% greater
1059: than that on 16 nodes. This means that the size of system
1060: that could be treated is limited only by the number of nodes
1061: available: there is essentially no limitation from the
1062: parallel behaviour of the code itself.
1063:
1064: Tests of strong parallel scaling performed for the same $R_A$ and
1065: $R_B$ values (maximal case) with a total of 4096 atoms in the whole
1066: cell are shown in Fig.~\ref{fig:Strong}, where we show the total time
1067: summed over the processors as function of processor
1068: number (which would be completely flat if the scaling were
1069: ideal). A useful way to assess the scaling is by comparing the times
1070: with Amdahl's law, which states that the time taken is $t = t_0 ( ( 1
1071: - x ) + x / N_{\rm proc} )$, where $N_{\rm proc}$ is the number of
1072: processors, with $x$ and $( 1 - x )$ the fractions of the calculation
1073: performed in parallel and serially. This formula fits the maximum
1074: times very well, and we find the value $x = 0.9986$, so that the
1075: serial fraction is 0.0014. This means that strong scaling is
1076: extremely good up to $\sim 200$ processors, or $\sim 20$ atoms
1077: per processor. Since it would be wasteful of memory to run {\sc
1078: Conquest} with fewer atoms per processor than this, the implication is
1079: that strong scaling will be very well satisfied in practical
1080: calculations.
1081:
1082: We now turn to the minimal case ($R_C \le \mid R_A - R_B \mid$). We
1083: argued in Sec.~\ref{sec:minimal} that the speed will be limited mainly
1084: by transfers to and from memory, rather than by operations in
1085: registers. This implies that, provided we use the minimal, rather than
1086: the maximal, multiplication kernel, the timings for the multiplication
1087: with $R_A = 12.69$~\AA, $R_B = 4.23$~\AA, $R_C = 8.46$~\AA\ should be
1088: almost the same as for $R_A = 8.46$~\AA, $R_B = 4.23$~\AA, $R_C =
1089: 12.69$~\AA. We have tested this explicitly by performing
1090: multiplication with $R_A = 12.69$~\AA, $R_B = 4.23$~\AA, $R_C =
1091: 8.46$~\AA\ on 4096 atoms with different numbers of processors. The
1092: results are also reported in Fig.~\ref{fig:Strong}. We see that the
1093: timings are very close to our results for the maximal case, though
1094: higher by $\sim 16$~\%. This means that all our conclusions about weak
1095: and strong scaling will be the same for the minimal case.
1096:
1097: We now turn to cpu efficiency. Here, we are solely concerned
1098: with the execution speed of the multiplication kernel itself,
1099: so that communication time is excluded. As explained in the
1100: Introduction, efficiency is defined in terms of the minimum
1101: number of operations required to perform the multiplications
1102: of eqn~(\ref{eqn:mult}). For a given node, this is the
1103: number of $(i,k,j)$ triplets for $i$ in that node's primary
1104: set, multiplied by $n_1 \times n_3 \times n_2$, multiplied
1105: by a factor of 2 to allow for the fact that we do a multiply
1106: and add for each $( i \mu , k \xi , j \nu )$. This minimum
1107: number of operations divided by the time spent in the multiplication
1108: kernel gives the `useful' Mflop rate, which is then divided
1109: by the peak Mflop rate of the processor to give the
1110: cpu efficiency.
1111:
1112: We have done efficiency tests on both perfect-crystal and random
1113: atomic positions. As an illustration, we report results on
1114: perfect-crystal silicon (lattice parameter~= 5.46~\AA) with cut-offs
1115: $R_A = 6$~\AA, $R_B = 10$~\AA, $R_C = 16$~\AA\ with 1728 atoms in the
1116: total system. The tests were done with 64 cubic partitions and 16
1117: processors on the Cray T3E-1200E, the partitions being distributed
1118: unevenly over the processors. The `useful' Mflop rate was found to be
1119: almost identical on all processors, the average being 101.4~Mflops,
1120: and minimum and maximum being 101.1 and 103.6~Mflops. The theoretical
1121: peak speed of the processor is 1.2~Gflops, so that the efficiency is
1122: 8.4~\%. This result is reasonably satisfactory in two ways. First, it
1123: shows that all the operations associated with transcription between
1124: labellings, the determination of storage addresses, etc cannot be
1125: taking a large amount of time (recall that these are not included in
1126: the number of `useful' operations); second, they show that fairly good
1127: use is being made of the cache.
1128:
1129: We have also performed tests on a number of other machines: a Beowulf
1130: linked by fast ethernet; a Beowulf linked by SCI interconnect; and a SGI
1131: Origin2000. We have tested weak scaling on
1132: all these systems to investigate the portability of the code, and
1133: the importance of efficient communications. The {\em increase} in
1134: time per node as the system size is increased is shown in
1135: Fig.~\ref{fig:systems}. We remark that, given perfect communication,
1136: the line ought to be flat at 1.0, and that to all intents and purposes
1137: this is true for the Origin2000 and T3E. The SCI beowulf system performs
1138: well and should be adequate for {\sc Conquest} calculations. The fast
1139: ethernet beowulf, however, shows far too large an increase.
1140:
1141: Finally, we assess briefly the implications of our new methods
1142: for practical calculations with the full {\sc Conquest} code.
1143: The present matrix multiplication code has been incorporated
1144: into {\sc Conquest}, and we have run tests in which the energy of
1145: the self-consistent ground state is calculated for
1146: silicon systems. In searching for the ground state, the
1147: basic step is the `self-consistency iteration', in which an
1148: input charge density is used to calculate the Kohn-Sham
1149: Hamiltonian matrix, and the techniques of Ref.~\cite{Bowler99} are
1150: used to determine the $L$-matrix that yields the non-self-consistent
1151: ground state for the current Hamiltonian, and hence the output
1152: charge density. We search for self-consistency using the
1153: GR-Pulay technique\cite{Bowler00b}.
1154:
1155: We report in Fig.~\ref{fig:Conquest} timings for a single
1156: self-consistency iteration obtained for crystalline silicon containing
1157: numbers of atoms going from 4096 to 12288 when the calculations are
1158: run on 512 processors of the T3E-1200E. For these tests, we used a
1159: norm-conserving non-local pseudopotential\cite{Kleinman82}. The
1160: support-region radius $R_{\rm reg}$ was equal to 2.715~\AA\ and the
1161: $L$-matrix cut-off, $R_L$ equal to 6.9~\AA\ (for definitions of these
1162: cut-offs, see Ref.~\cite{Hernandez96}); these values are typical of
1163: what would be used in practice. The Figure shows that the scaling with
1164: system size is so good that the time is actually a slightly sub-linear
1165: function of atom number. We believe that this is because, as the
1166: primary sets become larger, the number of nodes with which
1167: communication is required actually decreases, making the time required
1168: decrease. To put these results in context, we note that, starting
1169: from scratch, typically 10 self-consistency iterations are needed to
1170: reach the self-consistent ground state with acceptable accuracy for
1171: fixed support functions, and that a further factor of 10 should be
1172: allowed for variation of the support functions. This means that, with
1173: the methods presented here, the attainment of the full ground state
1174: with an accuracy comparable to that expected in a conventional
1175: plane-wave calculation can be accomplished for a 12,000-atom system in
1176: a few hours on a 512-processor T3E.
1177:
1178: \section{Discussion and conclusions}
1179: \label{sec:disc}
1180:
1181: We have shown that our code for parallel multiplication of
1182: locally sparse matrices achieves two important things:
1183: first, it displays almost perfect linear scaling with respect
1184: to numbers of atoms and processors; second, it achieves good
1185: cpu efficiency.
1186:
1187: We remark that we have every right to expect the linear scaling
1188: to be perfect when the numbers of processors and atoms
1189: are increased together, with a constant number of atoms per
1190: processor (weak scaling): for this type of scaling, the amounts
1191: of computation and communication should, indeed, remain
1192: constant. What is less trivial is the type of scaling in which
1193: the number of processors is increased with the total number of
1194: atoms held fixed (strong scaling). In contrast to
1195: some previous schemes, the present code shows excellent strong
1196: scaling, at least on machines having a fast communications
1197: network. Good cpu efficiency is also non-trivial.
1198:
1199: A key idea in the present scheme is the management of the atoms
1200: in small compact groups, referred to as `partitions'. Although
1201: this certainly makes the code more complex, it has the enormous
1202: advantage of giving an internal structure to the calculation,
1203: which can be used in a number of ways: it gives a flexible means
1204: of constructing the primary sets that are assigned to processors;
1205: it enables us to manage the communications by breaking them into
1206: packets of convenient size; it does the same for the on-node
1207: computations, so that we work with a convenient `multiplication kernel';
1208: it provides a labelling scheme which allows nodes to
1209: identify to each other the atoms for which data is communicated;
1210: finally, it aids efficient cache use by improving data locality.
1211: It is important to note that the present scheme of partitions
1212: and bundles for managing atoms
1213: strongly resembles the scheme of blocks and domains for managing
1214: integration-grid points in {\sc Conquest} described in our
1215: earlier work~\cite{Goringe97}. A further key concept in the present
1216: work is that of transcription between different atom-labelling
1217: schemes, which is crucial in reducing the computational
1218: overheads in the multiplication kernel.
1219:
1220: At the start of our analysis (Sec.~\ref{sec:assumptions}), we made
1221: certain assumptions about the storage of matrix elements
1222: and the way the matrix multiplication should be done, and we
1223: now ask whether anything could be gained by changing these
1224: assumptions. Our answer is that little is likely to be gained.
1225: With the scheme we have presented, the scaling properties are already
1226: close to ideal, and there is little room for improvement. It is
1227: true that our cpu efficiency is only $\sim 10$~\%, and there
1228: is clearly scope for improvement here. However, one should
1229: remember that efficiencies as high as 50~\% on superscalar
1230: processors are achievable only in the multiplication
1231: of large non-sparse matrices, and that serious sparsity is
1232: bound to incur a serious loss of efficiency. Our suspicion
1233: is that an improvement of the cpu efficiency by as much as a factor
1234: of two is unlikely, and that any such improvements are more likely
1235: to come from work on the multiplication kernel than by changing
1236: the basic assumptions. Time will tell.
1237:
1238: \section*{Acknowledgments}
1239: \label{sec:ack}
1240:
1241: We would like to thank Dr.~S.~Goedecker for allowing us to read an
1242: early copy of Ref.~\cite{Goedecker00}.
1243: The work of DRB was supported by EPSRC grant GR/M01753 and by an EPSRC
1244: Postdoctoral Fellowship in Theoretical Physics.
1245: The position of MJG is partially
1246: supported by GEC and CLRC Daresbury Laboratory. Calculations on the
1247: Cray T3E at CSAR and on the Origin 2000 at the UCL HiPerSPACE Centre
1248: were supported by grants GR/M01753 and JR98UCGI. We are indebted to
1249: Dr. S.~Pickles and colleagues at CSAR for their help in optimising the
1250: code. Useful discussions with Dr.~I.~Bush in the early stages of the
1251: work are acknowledged. Assistance from Dr.~B.~Searle, Mr.~R.~Harker
1252: and Mr.~A.~Keller with tests on beowulf systems is also acknowledged.
1253: \bigskip
1254:
1255: \begin{thebibliography}{99}
1256: \bibitem{Goedecker99}S.Goedecker, Rev. Mod. Phys. {\bf 71}, 1085 (1999).
1257: \bibitem{Pettifor89}D.G.~Pettifor, Phys. Rev. Lett. {\bf 63}, 2480
1258: (1989).
1259: \bibitem{Yang91}W.~Yang, Phys. Rev. Lett. {\bf 66}, 1438 (1991).
1260: \bibitem{Galli92}G.~Galli and M.~Parrinello, Phys. Rev. Lett. {\bf 69},
1261: 3547 (1992).
1262: \bibitem{Li93}X.-P.~Li, R.W.~Nunes and D.~Vanderbilt, Phys. Rev. B {\bf 47},
1263: 10891 (1993).
1264: \bibitem{Daw93}M.S.~Daw, Phys. Rev. B {\bf 47}, 10895 (1993).
1265: \bibitem{Ordejon93}P.~Ordej\'on, D.~Drabold, M.~Grumbach
1266: and R.M.~Martin, Phys. Rev. B {\bf 48}, 14646 (1993).
1267: \bibitem{Aoki93}M.~Aoki, Phys. Rev. Lett. {\bf 71}, 3842 (1993).
1268: \bibitem{Mauri93}F.~Mauri, G.~Galli and R.~Car, Phys. Rev. B {\bf 47},
1269: 9973 (1993).
1270: \bibitem{Goedecker94} S. Goedecker and L. Colombo, Phys. Rev. Lett.
1271: {\bf 73}, 122 (1994).
1272: \bibitem{Kress95}J.D.~Kress and A.F.~Voter,
1273: Phys. Rev. B {\bf 53}, 12733 (1996).
1274: \bibitem{Kohn96}W.~Kohn, Phys. Rev. Lett. {\bf 76}, 3168 (1996)
1275: \bibitem{Horsfield96}A.P.~Horsfield, Mat. Sci. Engin. B {\bf 37},
1276: 219 (1996).
1277: \bibitem{Bowler97}D.R.~Bowler, M.~Aoki, C.M.~Goringe, A.P.~Horsfield
1278: and D.G.~Pettifor, Modell. Simul. Mater. Sci. Eng. {\bf 5}, 199 (1997).
1279: \bibitem{Stechel94}E.B.~Stechel, A.R.~Williams
1280: and P.J.~Feibelman, Phys. Rev. B {\bf 49}, 10088 (1994).
1281: \bibitem{Hernandez95}E.H.~Hern\'{a}ndez and M.J.~Gillan, Phys. Rev. B
1282: {\bf 51}, 10157 (1995).
1283: \bibitem{Haynes97}
1284: P.D.~Haynes and M.C.~Payne, Comput. Phys. Commun. {\bf 102},
1285: 17 (1997).
1286: \bibitem{Schweg96}E.Schwegler and M.Challacombe, J.Chem. Phys. {\bf 105}, 2726
1287: (1996).
1288: \bibitem{Burant96}J.C.Burant, G.E.Scuseria and M.J.Frisch, J. Chem. Phys. {\bf
1289: 105}, 8689 (1996).
1290: \bibitem{Ochsen98}C.Ochsenfeld, C.A.White and M.Head-Gordon, J. Chem. Phys.
1291: {\bf 109}, 1663 (1998).
1292: \bibitem{Challa00}M.~Challacombe, Comput. Phys. Commun. {\bf 128}, 93 (2000).
1293: \bibitem{Goringe97}C.M.~Goringe, E.H.~Hern\'andez, M.J.~Gillan and
1294: I.J.~Bush, Comput. Phys. Commun. {\bf 102}, 1 (1997).
1295: \bibitem{Bowler00}D.R.~Bowler, I.J.~Bush and M.J.~Gillan, Int. J. Quant.
1296: Chem. {\bf 77}, 831 (2000).
1297: \bibitem{Hernandez96}E.~Hern\'{a}ndez, M.J.~Gillan and C.M.~Goringe,
1298: Phys. Rev. B {\bf 53}, 7147 (1996).
1299: \bibitem{Mauri94}F.~Mauri and G.~Galli, Phys. Rev. B {\bf 50}, 4316 (1994).
1300: \bibitem{Ordejon95}P.~Ord\'ejon, D.A.~Drabold, R.M.~Martin and M.P.~Grumbach,
1301: Phys. Rev. B {\bf 51}, 1456 (1995).
1302: \bibitem{Itoh95}S.~Itoh, P.~Ordej\'on and R.M.~Martin, Comput. Phys. Commun.
1303: {\bf 88}, 173 (1995).
1304: \bibitem{Canning96}A.~Canning, G.~Galli, F.~Mauri,A.~De~Vita and R.~Car,
1305: Comput. Phys. Commun. {\bf 94}, 89 (1996).
1306: \bibitem{Wang96}C.Z.~Wang, S.Y.~Qiu and K.M.~Ho, Comput. Mat. Sci. {\bf 7},
1307: 315 (1996).
1308: \bibitem{Kress98}J.D.~Kress, S.~Goedecker, A.~Hoisie, H.~Wassermann,
1309: O.~Lubeck, L.~Collins and B.~Holian, J. Comput. Aided Mat. Des. {\bf 5},
1310: 295 (1998).
1311: \bibitem{NR}W.H.Press, S.A.Teukolsky, W.T.Vetterling and B.P.Flannery,
1312: Numerical Recipes (Cambridge University Press, Cambridge, 1992).
1313: \bibitem{Bowler99}D.R.~Bowler and M.J.~Gillan, Comput. Phys. Commun.
1314: {\bf 120}, 95 (1999).
1315: \bibitem{Goedecker00}S.Goedecker and A.Hoisie, Achieving High
1316: Performance in Numerical Computations on Modern Computer Architecures
1317: (SIAM, in press, 2000).
1318: \bibitem{Bowler00b}D.R.~Bowler and M.J.~Gillan, Chem. Phys. Lett. {\bf 325},
1319: 473 (2000).
1320: \bibitem{Kleinman82} L.Kleinman and D.M.Bylander, Phys. Rev. Lett.
1321: {\bf 44}, 1425 (1982).
1322: \end{thebibliography}
1323:
1324: \section*{Appendix: Reduction and extension, maximal
1325: and minimal kernels}
1326:
1327: As explained in the text, different multiplication kernels
1328: should be used for large and small values of the cut-off
1329: radius $R_C$ of the product matrix $C$: for large and small $R_C$, we
1330: should use the maximal and minimal kernels respectively.
1331: In this Appendix, we discuss how to decide the cross-over
1332: point from the maximal to the minimal case
1333: as the radii $R_A$, $R_B$ and $R_C$ are varied.
1334:
1335: If we have particular values of $R_A$, $R_B$ and $R_C$,
1336: then for each atom $i$ in a node's primary set, there is a certain
1337: number of pairs $(k,j)$ for which multiply-and-adds have to
1338: be done. Our algorithm for deciding the cross-over point will
1339: be based on estimates for this number of pairs, which
1340: we denote by $\nu_{\rm p}$. If we treat the problem as weak reduction,
1341: then we ignore the constraint supplied by $R_C$, and instead of
1342: the true number of pairs $\nu_{\rm p}$ we actually visit
1343: a larger number of pairs, which we call $\nu_{\rm p}^{\rm max}$.
1344: The merit of doing it like this can be gauged by the ratio
1345: $\eta^{\rm max} \equiv \nu_{\rm p} / \nu_{\rm p}^{\rm max}$.
1346: If this ratio becomes
1347: too small, then it is wasteful to treat it as weak reduction. Similarly,
1348: if we treat as weak extension, we visit a number of pairs
1349: $\nu_{\rm p}^{\rm min}$ which will generally be greater than
1350: $\nu_{\rm p}$, so we can gauge the merit of this method by the
1351: ratio $\eta^{\rm min} \equiv \nu_{\rm p} / \nu_{\rm p}^{\rm min}$.
1352: We should go for the method that gives the largest $\eta$ value,
1353: so the policy will be that for given values of $R_A$,
1354: $R_B$ and $R_C$ we treat it as weak reduction if $\nu_{\rm p}^{\rm max} <
1355: \nu_{\rm p}^{\rm min}$ and as weak extension if $\nu_{\rm p}^{\rm min} <
1356: \nu_{\rm p}^{\rm max}$.
1357:
1358: The values of $\nu_{\rm p}$, $\nu_{\rm p}^{\rm max}$ and
1359: $\nu_{\rm p}^{\rm min}$ in practice will depend on the details
1360: of the atomic positions. We need a simple approximate way of
1361: estimating the numbers of pairs. To do this, we assume
1362: that the density of atoms is uniform, with
1363: $\rho_0$ atoms per unit volume. Then our estimate for the number of
1364: pairs is:
1365: \begin{equation}
1366: \nu_{\rm p} = \rho_0^2
1367: \int_{r_1 < R_A , \, r_2 < R_C , \, \mid {\bf r}_2 - {\bf r}_1 \mid < R_B}
1368: d {\bf r}_1 d {\bf r}_2
1369: \end{equation}
1370: It is simple geometry to evaluate the integral. Let $\Omega ( R_A ,
1371: R_B ; r )$ be the overlap volume of two spheres
1372: of radius $R_A$, $R_B$ when the distance between their
1373: centres is $r$. Then:
1374: \begin{equation}
1375: \nu_{\rm p} ( R_A , R_B , R_C ) = 4 \pi \rho_0^2
1376: \int_0^{R_C} dr \, r^2 \Omega ( R_A , R_B ; r ) \; .
1377: \label{eqn:volume}
1378: \end{equation}
1379:
1380: It will be useful to note some symmetry properties of
1381: $\nu_{\rm p} ( R_A , R_B , R_C )$. Since $\Omega ( R_A , R_B ; r )$
1382: is invariant under interchange of $R_A$ and $R_B$, the
1383: same is true of $\nu_{\rm p}$. In fact, it is easy to
1384: show that $\nu_{\rm p}$ is completely invariant with respect
1385: to all permutations of $R_A$, $R_B$ and $R_C$:
1386: \begin{eqnarray}
1387: \nu_{\rm p} ( R_A , R_B , R_C ) & = & \nu_{\rm p} ( R_C , R_A , R_B ) =
1388: \nu_{\rm p} ( R_B , R_C , R_A ) \nonumber \\
1389: & = & \nu_{\rm p} ( R_B , R_A , R_C ) = \nu_{\rm p} ( R_A , R_C , R_B ) =
1390: \nu_{\rm p} ( R_C , R_B , R_A ) \; .
1391: \end{eqnarray}
1392:
1393: These symmetries very much simplify the calculation of $\nu_{\rm p}$
1394: for general values of $R_A$, $R_B$ and $R_C$. Let $R_1$, $R_2$ and
1395: $R_3$ denote the smallest, the middle and the largest of $R_A$,
1396: $R_B$ and $R_C$.
1397: Then for arbitrary values of $R_A$, $R_B$ and $R_C$, we can
1398: write:
1399: \begin{equation}
1400: \nu_{\rm p} ( R_A , R_B , R_C ) = \nu_{\rm p} ( R_1 , R_2 , R_3 ) \; .
1401: \end{equation}
1402: There are two separate cases that must be considered, which we
1403: refer to as `triangle' and `non-triangle'. In the first
1404: case, we have $R_3 \le R_1 + R_2$, so that the three lengths
1405: can form the sides of a triangle. In the second, we have
1406: $R_3 > R_1 + R_2$, and they cannot form the sides of a triangle.
1407: The non-triangle case is trivial, because the constraint
1408: supplied by $R_3$ is of no effect, and we just have:
1409: \begin{equation}
1410: \nu_{\rm p} ( R_1 , R_2 , R_3 ) = \rho_0^2
1411: \int_{r_1 < R_1 , \, r_2 < R_2}
1412: d {\bf r}_1 d {\bf r}_2 \, = \frac{16}{9} \pi^2 \rho_0^2
1413: R_1^3 R_2^3 \; .
1414: \label{eq:pair_nont}
1415: \end{equation}
1416: In the `triangle' case, we return to eqn.~(\ref{eqn:volume})
1417: and note that the overlap volume of two spheres is:
1418: \begin{eqnarray}
1419: \Omega ( R_1 , R_2 ; r ) & = & \frac{4}{3} \pi R_1^3 \; \; \; \;
1420: {\rm if} \; \; \; r < R_2 - R_1 \nonumber \\
1421: & = & - \frac{\pi}{4 r} ( R_2^2 - R_1^2 )^2 +
1422: \frac{2}{3} \pi ( R_1^3 - R_2^3 ) -
1423: \frac{1}{2} \pi r ( R_1^2 + R_2^2 ) +
1424: \frac{\pi}{12} r^3 \; \; \; \; \; {\rm if} \; \; \; r > R_2 - R_1 \; .
1425: \end{eqnarray}
1426: The integral over $r$ is then straightforward, and we get:
1427: \begin{eqnarray}
1428: \nu_{\rm p} & = & \pi^2 \rho_0^2 \left(
1429: \frac{16}{9} R_1^3 ( R_2 - R_1 )^3
1430: - \frac{1}{2} ( R_2^2 - R_1^2 )^2 \left[ R_3^2 - ( R_2 - R_1 )^2 \right]
1431: + \frac{8}{9} ( R_1^3 + R_2^3 ) \left[ R_3^3 - ( R_2 - R_1 )^3 \right]
1432: \right.
1433: \nonumber \\
1434: & & \left. \mbox{} - \frac{1}{2} ( R_1^4 + R_2^4 )
1435: \left[ R_3^4 - ( R_2 - R_1 )^4 \right]
1436: + \frac{1}{18} \left[ R_3^6 - ( R_2 - R_1 )^6 \right] \right) \; .
1437: \label{eq:pair_t}
1438: \end{eqnarray}
1439:
1440: It is now simple to find the cross-over point and the $\eta$ value
1441: at this point. The strategy for weak reduction is based
1442: on ignoring the constraint supplied by $R_C$. Then $\nu_{\rm p}^{\rm max}$
1443: is just $\nu_{\rm p}$ for the non-triangle case $R_C > R_A + R_B$,
1444: so that from eqn.~(\ref{eq:pair_nont}) we have:
1445: \begin{equation}
1446: \nu_{\rm p}^{\rm max} = \frac{16}{9} \pi^2 \rho_0^2 R_A^3 R_B^3 \; .
1447: \end{equation}
1448: In weak extension, we ignore the constraint supplied by $R_A$. Then
1449: $\nu_{\rm p}^{\rm min}$ is $\nu_{\rm p}$ for the
1450: non-triangle case $R_A > R_B + R_C$, so we have:
1451: \begin{equation}
1452: \nu_{\rm p}^{\rm min} = \frac{16}{9} \pi^2 \rho_0^2 R_B^3 R_C^3 \; .
1453: \end{equation}
1454: The cross-over point is the point at which $\nu_{\rm p}^{\rm max} =
1455: \nu_{\rm p}^{\rm min}$, which requires that $R_A = R_C$.
1456: In terms of the dimensionless ratio $\alpha = R_C / ( R_A + R_B )$,
1457: we can express the cross-over point as:
1458: \begin{equation}
1459: \alpha_{\rm X} = 1 / ( 1 + \xi ) \; ,
1460: \end{equation}
1461: where $\xi = R_B / R_A$.
1462:
1463:
1464: The efficiency $\eta_{\rm X} = \nu_{\rm p}^{\rm max} / \nu_{\rm p} =
1465: \nu_{\rm p}^{\rm min} / \nu_{\rm p}$ at the cross-over point
1466: can now be found from eqn.~(\ref{eq:pair_t}). The derivation of the
1467: resulting formula:
1468: \begin{equation}
1469: \eta_{\rm X} = 1 - \frac{9}{16} \xi + \frac{1}{32} \xi^3
1470: \end{equation}
1471: is straightforward.
1472:
1473: In practice, we should always try to ensure that $R_A \ge R_B$,
1474: so that $0 < \xi < 1$. It is readily shown that $\eta_{\rm X}$
1475: increases montonically from $15/32$ to 1 as $\xi$ decreases
1476: from 1 to 0. This means that the efficiency cannot drop
1477: below $15/32$, or, in round numbers, 50~\%.
1478:
1479: \begin{figure}
1480: \begin{center}
1481: \leavevmode
1482: \epsfxsize=100mm
1483: \epsfbox{CalcsComms.eps}
1484: \end{center}
1485: \caption{Interleaving of communication and calculation in
1486: parallel multiplication of sparse matrices.}
1487: \label{fig:interleave}
1488: \end{figure}
1489:
1490: \begin{figure}
1491: \begin{center}
1492: \leavevmode
1493: \epsfxsize=100mm
1494: \epsfbox{PartHalo.eps}
1495: \end{center}
1496: \caption{Schematic illustration of partitions and haloes. Atoms are
1497: indicated by small, black circles, and partitions by the dashed lines.
1498: A particular atom, $i$, is patterned. Its A-neighbours are those atoms
1499: that are within or touched by the circle drawn centred on $i$.
1500: If the primary set is chosen to be the partition containing $i$, then the
1501: A-halo partitions are those enclosed by the heavy line.}
1502: \label{fig:halo}
1503: \end{figure}
1504:
1505: \begin{figure}
1506: \begin{center}
1507: \begin{verbatim}
1508: do k=1,ahalo%nh_part(kpart) ! Loop over atoms k in current A-halo partn
1509: ! --- useful variables associated with atom k ---------------------
1510: k_in_halo=ahalo%j_beg(kpart)+k-1
1511: k_in_part=ahalo%j_seq(k_in_halo)
1512: nbkbeg=ibaddr(k_in_part)
1513: ! --- transcription of j from B-neighbour to C-halo labelling -----
1514: do j=1,nbnab(k_in_part)
1515: jpart=ibpart(nbkbeg+j-1)+k_off
1516: jseq=ibseq(nbkbeg+j-1)
1517: jbnab2ch(j)=chalo%i_halo(chalo%i_hbeg(jpart)+jseq-1)
1518: enddo
1519: ! --- perform multiply and adds -----------------------------------
1520: do i=1,at%n_hnab(k_in_halo) ! Loop over primary-set A-neighbours of k
1521: nabeg=at%i_beg(k_in_halo)+i-1
1522: i_in_prim=at%i_prim(at%i_beg(k_in_halo)+i-1)
1523: icad=(i_in_prim-1)*chalo%ni_in_halo
1524: do j=1,nbnab(k_in_part) ! Loop over B-neighbours of atom k
1525: nbbeg=nbkbeg+j-1
1526: j_in_halo=jbnab2ch(j)
1527: if(j_in_halo/=0) then
1528: ncbeg=chalo%i_h2d(icad+j_in_halo)
1529: if(ncbeg/=0) then ! multiplication of ndim x ndim blocks
1530: do n2=1,ndim3
1531: do n1=1,ndim1
1532: do n3=1,ndim2
1533: c(n1,n2,ncbeg)=c(n1,n2,ncbeg)+ &
1534: a(n3,n1,nabeg)*b(n3,n2,nbbeg)
1535: enddo
1536: enddo
1537: enddo
1538: endif
1539: endif ! End of if(j_in_halo.ne.0)
1540: enddo ! End of 1,nbnab
1541: enddo ! End of 1,at%n_hnab
1542: enddo ! End of k=1,ahalo%nh_part(kpart)
1543: return
1544: \end{verbatim}
1545: \end{center}
1546: \caption{The matrix multiplication kernel.}
1547: \label{fig:kernel}
1548: \end{figure}
1549:
1550: \begin{figure}
1551: \begin{center}
1552: \leavevmode
1553: \epsfxsize=100mm
1554: \epsfbox{WeakScaling.eps}
1555: \end{center}
1556: \caption{Time recorded on each
1557: processor for the maximal
1558: case of sparse-matrix multiplication. Numbers of processors
1559: and atoms are varied with average number of atoms per processor
1560: equal to 80 in all cases.}
1561: \label{fig:Weak}
1562: \end{figure}
1563:
1564: \begin{figure}
1565: \begin{center}
1566: \leavevmode
1567: \epsfxsize=100mm
1568: %\epsfbox{Strong.eps}
1569: \epsfbox{MinMax.eps}
1570: \end{center}
1571: \caption{Times t recorded on each node for maximal (solid line, circles)
1572: and minimal (dashed line, squares) cases of sparse matrix multiplication
1573: when number of processors $N_{\rm proc}$ is varied with total number of
1574: atoms in the system equal to 4096 in all cases. Quantity plotted is
1575: the sum of the times on all processors (t.$N_{\rm proc}$), which would be
1576: exactly constant for perfect strong parallel scaling.}
1577: \label{fig:Strong}
1578: \end{figure}
1579:
1580: \begin{figure}
1581: \begin{center}
1582: \leavevmode
1583: \epsfxsize=100mm
1584: \epsfbox{SystemPerf.eps}
1585: \end{center}
1586: \caption{Average time per processor as a multiple of time on two processors
1587: for constant number of atoms per processor (equal to 32). Maximal case for
1588: crystalline silicon with cut-off radii $R_A$ = 12\AA, $R_B$ = 8\AA and
1589: $R_C$ = 20\AA.}
1590: \label{fig:systems}
1591: \end{figure}
1592:
1593: \begin{figure}
1594: \begin{center}
1595: \leavevmode
1596: \epsfxsize=100mm
1597: \epsfbox{ConquestRhoTime.eps}
1598: \end{center}
1599: \caption{Time for one self-consistency iteration (see text) of
1600: the {\sc Conquest} code as a function of number of atoms of
1601: perfect-crystal silicon. Calculations were run on 512 processors
1602: of a Cray T3E-1200E. Points connected by solid line show elapsed
1603: time in seconds; dashed line
1604: shows time scaled linearly from the actual time for 4096 atoms.}
1605: \label{fig:Conquest}
1606: \end{figure}
1607:
1608: \end{document}
1609: