0011:cond-mat0011279/mm.tex

1: \documentclass[]{article}

2: \usepackage{epsf,a4}

3: \addtolength{\oddsidemargin}{-0.75in}

4: \addtolength{\textwidth}{1.5in}

5: \addtolength{\topmargin}{-0.35in}

6: \addtolength{\textheight}{0.8in}

7: \renewcommand{\baselinestretch}{1}

8:

9: \begin{document}

10: \begin{center}

11: \LARGE

12: Parallel Sparse Matrix Multiplication for \\

13: Linear Scaling Electronic Structure Calculations

14: \bigskip \\

15: \large

16: D. R. Bowler$^\star$, T. Miyazaki$^\dagger$$^\star$ and

17: M. J. Gillan$^\star$ \smallskip \\

18: $^\star$Department of Physics and Astronomy, University College London \\

19: Gower Street, London, WC1E 6BT, U.K. \smallskip \\

20: $^\dagger$National Research Institute for Metals \\

21: 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan

22: \normalsize

23: \end{center}

24:

25: \begin{abstract}

26: Linear-scaling electronic-structure techniques, also called $O(N)$

27: techniques, rely heavily on the multiplication of sparse matrices,

28: where the sparsity arises from spatial cut-offs.  In order to treat

29: very large systems, the calculations must be run on parallel

30: computers. We analyse the problem of parallelising the multiplication

31: of sparse matrices with the sparsity pattern required by

32: linear-scaling techniques. We show that the management of inter-node

33: communications and the effective use of on-node cache are helped by

34: organising the atoms into compact groups. We also discuss how to

35: identify a key part of the code called the `multiplication kernel',

36: which is repeatedly invoked to build up the matrix product, and

37: explicit code is presented for this kernel. Numerical tests of the

38: resulting multiplication code are reported for cases of practical

39: interest, and it is shown that their scaling properties for systems

40: containing up to 20,000 atoms on machines having up to 512 processors

41: are excellent. The tests also show that the cpu efficiency of our code

42: is satisfactory.

43: \end{abstract}

44:

45: \section{Introduction}

46: \label{sec:intro}

47:

48: There is rapidly growing interest in linear-scaling methods for

49: electronic structure calculations, in which the memory and number of

50: cpu cycles are proportional to the number of atoms\cite{Goedecker99}.

51: Several practical codes exist for performing

52: tight-binding~\cite{Pettifor89,Yang91,%

53: Galli92,Li93,Daw93,Ordejon93,Aoki93,Mauri93,Goedecker94,%

54: Kress95,Kohn96,Horsfield96,Bowler97},

55: first-principles~\cite{Stechel94,Hernandez95,Haynes97} or

56: Hartree-Fock~\cite{Schweg96,Burant96,Ochsen98,Challa00} calculations

57: using linear-scaling techniques -- often called order-$N$ or $O(N)$

58: techniques. For very large systems containing thousands or tens of

59: thousands of atoms, these codes need to run efficiently on parallel

60: computers. Most of the existing linear-scaling methods rely heavily on

61: the multiplication of sparse matrices, and this means that

62: sparse-matrix multiplication on parallel computers is a key issue for

63: the future of linear-scaling electronic structure work. To achieve the

64: best efficiency, parallel codes to perform this multiplication must be

65: designed to treat the special patterns of sparsity that arise in

66: electronic structure, which we shall refer to as `local sparsity'.

67: The aims of this paper are to analyse the design of parallel code for

68: multiplying matrices having local sparsity, to present the algorithms

69: we have developed for the {\sc Conquest} electronic structure

70: code~\cite{Goringe97,Bowler00}, and to report the results of practical

71: tests on the efficiency and linear-scaling properties of the code.

72:

73: In the tight-binding approach to electronic structure, we are

74: concerned with the Hamiltonian and overlap matrices $H_{i \alpha , j

75: \beta}$ and $S_{i \alpha , j \beta}$. (Here, $i$, $j$ label atoms,

76: while $\alpha$, $\beta$ label the basis functions on each atom.)

77: Linear-scaling methods generally require also the density matrix $K_{i

78: \alpha , j \beta}$ and sometimes also an `auxiliary' density matrix

79: $L_{i \alpha , j \beta}$ -- we follow the notation of

80: Ref.~\cite{Hernandez96}. Linear-scaling methods based on

81: density-functional theory (DFT) or Hartree-Fock theory work with

82: similar quantities, and the DFT-pseudopotential approach also needs

83: scalar products $P_{i \alpha , j \lambda}$ between localised orbitals

84: on atom $i$ and angular-momentum projectors for non-local

85: pseudopotentials on atom $j$.  The elements of all these matrices tend

86: to zero as the distance $R_{i j}$ between atoms $i$ and $j$ tends to

87: infinity, and this is what makes linear-scaling eletronic structure

88: theory possible.  In the approach used in the {\sc Conquest}

89: code~\cite{Goringe97,Bowler00}, and other related approaches (see

90: e.g. Refs~\cite{Ordejon93,Mauri93, Mauri94,Ordejon95}), the

91: approximation is made that all the matrix elements vanish exactly when

92: $R_{i j}$ exceeds a specified cut-off radius, which generally differs

93: for different matrices. Once this approximation has been made,

94: linear-scaling behaviour for both memory and cpu cycles is guaranteed

95: -- provided, of course, appropriate algorithms are used. All the

96: matrices are sparse, since their only non-vanishing elements are those

97: for which $R_{i j}$ is less than the appropriate cut-off. In

98: everything that follows, the particular pattern of sparsity that

99: arises from the spatial cut-offs is crucial, and it is this pattern

100: that we call `local sparsity'.

101:

102: The techniques widely used to ensure the idempotency of the density

103: matrix $K$~\cite{Li93,Hernandez96,Bowler99} rely on the calculation of

104: matrix products such as $(LSL)_{i \alpha , j \beta}$, $(LSLSL)_{i

105: \alpha , j \beta}$ and other similar products involving $H$ and

106: $P$. Sometimes, these products are needed only for interatomic

107: distances $R_{i j}$ that are smaller than the full distances for which

108: they have non-vanishing elements, so that we do not need to calculate

109: all their elements.  The general problem is therefore to perform the

110: product:

111: \begin{equation}

112: C_{i \mu , j \nu} = \sum_{k \xi} A_{i \mu , k \xi}

113: B_{k \xi , j \nu} \; ,

114: \label{eqn:mult}

115: \end{equation}

116: where the elements of $A$ and $B$ are non-vanishing only

117: for $R_{i k} < R_A$ and $R_{k j} < R_B$ respectively, and elements

118: of $C$ are required only for $R_{i j} < R_C$, where $R_A$,

119: $R_B$, $R_C$ are specified cut-off distances. The indices

120: $\mu$, $\nu$, $\xi$ may correspond to localised orbitals or

121: angular momentum projectors

122: on each atom, but in any case they run only over a small set

123: of values. In general, the number of values might depend on

124: the atom, but we ignore this possible complication here and assume that

125: $\mu$ runs from 1 to $n_1$, $\nu$ runs from 1 to $n_2$, and $\xi$

126: runs from 1 to $n_3$. Unless the distinction is important,

127: we shall refer to $n_1$, $n_2$ or $n_3$ simply as $n$.

128:

129: The {\sc Conquest} DFT code was written from

130: the start as a parallel code, and the methods used to parallelise

131: sparse-matrix multiplications and all the other operations

132: were reported earlier~\cite{Goringe97}, together with

133: tests of the scaling behaviour. Although the scaling

134: was excellent, the major problem

135: with the matrix multiplication methods that we used earlier

136: was that they required elaborate indexing, which consumed

137: excessive amounts of memory. This is why we have carried out

138: the much more thorough analysis of locally sparse matrix

139: multiplication reported here.

140:

141: There has also been other work on parallelising the multiplication

142: of locally sparse matrices for linear-scaling electronic structure

143: work, using both tight-binding~\cite{Itoh95,Canning96,Wang96}

144: and Hartree-Fock

145: techniques~\cite{Challa00}. Itoh {\em et al.}~\cite{Itoh95} and

146: Canning {\em et al.}~\cite{Canning96} describe the parallelisation

147: of the two closely related linear-scaling methods reported

148: in Refs.~\cite{Ordejon93,Ordejon95} and~\cite{Mauri93,Mauri94},

149: and find satisfactory

150: speed-ups of up to 7.3 for an eight-fold increase of

151: processor number. Challacombe~\cite{Challa00} presented a very general

152: parallelisation scheme for linear-scaling Hartree-Fock

153: calculations using Gaussian basis sets, but found speed-ups

154: of only 75~\% for an eight-fold increase of processor number.

155: Wang {\it et al.} made passing reference to the parallelisation

156: of the tight-binding density-matrix method~\cite{Wang96}, but no details

157: appear to have been published. The popular tight-binding

158: linear-scaling scheme known as the Fermi-operator

159: method~\cite{Goedecker94}

160: has also been parallelised, but this relies on matrix-vector

161: rather than matrix-matrix operations, and is less relevant here.

162:

163: In considering the efficiency of a multiplication code, we note that

164: there is a certain irreducible minimum number of multiply-and-adds

165: that must be done to accomplish the calculation of

166: eqn~(\ref{eqn:mult}).  For a parallel machine whose processors have a

167: given peak speed, this sets a well defined lower bound to the

168: execution time.  The ratio between this lower bound and the

169: practical execution time gives a measure of the cpu efficiency of the

170: code. The superscalar processors of main interest to us can sometimes

171: achieve 50~\% of peak speed in the multiplication of large matrices,

172: but it would be unrealistic to expect this kind of efficiency for

173: sparse matrices. On the other hand, it would be cause for concern if

174: the cpu efficiency fell much below 10~\%. We shall show that

175: efficiencies of around 10~\% are indeed achievable.  We shall show

176: that the methods we propose achieve scaling very close to linear, both

177: as the numbers of atoms and processors are increased for a fixed

178: number of atoms per processor, and as the number of processors is

179: increased for a fixed number of atoms.

180:

181: We shall see that achievement of high efficiency and good scalability

182: relies heavily on two things: first, good design of storage patterns

183: of matrix elements; second, the use of different schemes for

184: labelling atoms in different parts of the calculation, together with

185: efficient transcription between schemes. For both storage and labelling,

186: we shall show the advantages of grouping atoms into small spatially

187: compact groups (here called `partitions'), and assignment of groups

188: of these partitions (here called partition `bundles') to individual

189: processors.

190:

191: The next section of the paper presents our analysis of the problem,

192: and the set of algorithms that the analysis leads us to. We also

193: present the piece of code that lies at the heart of our multiplication

194: program, which we refer to as the `multiplication kernel'.  Our

195: practical code is written in Fortran~90, with communications handled

196: by MPI1. In Sec.~3, we outline the main principles used in

197: programming our algorithms. Sec.~4 reports the range of tests on

198: a Cray~T3E, an SGI Origin2000 and a beowulf cluster

199: that we have performed to probe the efficiency and scaling

200: behaviour of the multiplication code.  The final Section discusses our

201: results and draws conclusions. In particular, we shall argue that

202: other possible approaches to the problem of multiplying locally sparse

203: matrices on parallel machines are unlikely to achieve significantly

204: better efficiency or linear scaling behaviour.

205:

206: \section{Development of techniques}

207: \label{sec:techniques}

208:

209: In order to achieve high efficiency, we need careful design

210: of a small set of operations that are repeatedly performed

211: to build up the product matrix: this set of

212: operations is the `multiplication kernel'. Some of the

213: data that serve as input to the kernel need to be communicated

214: from remote nodes, and linear scaling will only be achieved

215: if the communications strategy is properly designed. Both the

216: multiplication kernel and the communications need indexing,

217: and it is important that both the memory

218: and the cpu time needed to manipulate index arrays are kept to a

219: minimum. Before discussing the multiplication kernel, the

220: communications strategy and the indexing, we outline first

221: the starting assumptions and the spatial organisation of

222: atoms into partitions and bundles.

223:

224: For ease of presentation, we confine ourselves in much of the

225: following to the case $R_C \ge R_A + R_B$, which we

226: refer to as the `maximal' case; it is the case where every

227: element of the product matrix $C = A \cdot B$ must be calculated.

228: At the end of this Section, we shall

229: show how the case of a general relation between $R_A$, $R_B$ and

230: $R_C$ can be handled by exactly the same principles that will

231: be described for the maximal case. We also avoid initially any

232: discussion of periodic boundary conditions. The {\sc Conquest}

233: code is able to handle periodic boundary conditions, though

234: this may not always be needed, so that the matrix multiplication

235: code must be capable of including periodicity.

236: But to simplify the discussion,

237: we defer discussion of periodicity till Sec.~\ref{sec:pbc}.

238:

239: \subsection{Assumptions}

240: \label{sec:assumptions}

241:

242: Following Ref.~\cite{Goringe97}, we assume that each node is

243: responsible for a group of atoms: we call this the node's `primary

244: set' of atoms.  We also assume that matrix elements are stored by

245: rows, so that for any matrix $X$, the elements $X_{i \mu , j \nu}$ for

246: all $\mu$, $j$, $\nu$ are held by the node whose primary set contains

247: $i$.  Finally, we assume that the multiplication operations needed to

248: calculate the elements $C_{i \mu , j \nu}$ of the product matrix are

249: performed by the node whose primary set contains $i$.  This means that

250: only matrix elements $B_{k \xi , j \nu}$ need to be communicated

251: between nodes. We discuss in Sec.~\ref{sec:disc} whether anything

252: could be gained by changing these assumptions.

253:

254: With these assumptions, we already encounter a key question, concerning

255: the interleaving of communications and calculations. The matrix

256: multiplications performed by each node can be broken into

257: contributions, with each contribution associated with

258: a particular set of atoms $k$. For each set, we fetch the

259: $B_{k \xi , j \nu}$ data for $k$ in the set, and then perform

260: the multiply-and-adds of eqn~(\ref{eqn:mult}) for those $k$, accumulating the

261: results onto the array of the product matrix; then we move to the next

262: set of $k$ atoms and repeat the communications and calculations.

263: Two extremes can be envisaged. At one extreme, we start by

264: bringing {\em all} the $B_{k \xi , j \nu}$ for all the $k$

265: entering eqn~(\ref{eqn:mult}) from the remote nodes onto the local node;

266: after this has been done, we then perform {\em all} the multiplications.

267: This is the coarsest possible interleaving. At the other extreme, we

268: fetch the $B_{k \xi , j \nu}$ for just a single atom

269: $k$ at a time, and perform the multiplications for that $k$, before

270: moving on to the next $k$. This is the finest possible interleaving.

271: Neither extreme will be satisfactory. The coarse extreme requires

272: an unnecessarily large amount of local memory to store all the

273: $B_{k \xi , j \nu}$ elements; the finest extreme requires

274: the transmission of unnecessarily small packets of

275: $B_{k \xi , j \nu}$ data, so that latency will slow the

276: communications. Somewhere between the extremes, there is a good

277: compromise between memory and latency.

278:

279: In fact, there is no point in communicating {\em all}

280: $B_{k \xi , j \nu}$ data before multiplying: the coarsest interleaving

281: that we should ever consider is where we communicate all the

282: $B_{k \xi , j \nu}$ needed from a particular remote node before

283: performing the multiplications; then we move to the next node.

284: This interleaving already yields the lowest possible latency. It follows

285: that the outermost loop in matrix multiplication must be a loop

286: over remote nodes. However, there is no reason why this coarsest

287: interleaving should give an acceptable compromise, and we shall

288: generally want to reduce the memory requirement further by

289: breaking the atoms $k$ into smaller sets. Within the outer loop

290: over nodes, we shall therefore want a loop over {\em sets} of

291: atoms $k$ in each particular node.

292:

293: Our basic assumptions about the distribution of matrix elements

294: over nodes therefore require the general interleaving of communications

295: and calculations shown in Fig.~\ref{fig:interleave}.

296:

297: \subsection{Partitions, bundles, haloes and the outline

298: multiplication scheme}

299: \label{sec:partitions}

300:

301: The primary sets of atoms should be chosen to be spatially

302: compact: the atoms should all be near each other. The reason for

303: this is that compactness of the primary set helps to

304: reduce the number of atoms $k$ entering equation~(\ref{eqn:mult}) for any

305: given node, so that fewer $B_{k \xi , j \nu}$ elements have

306: to be communicated. Put another way, primary set compactness means that

307: each $B_{k \xi , j \nu}$ communicated is used more times.

308: Since the time spent in communications limits the parallel

309: scalability, compactness help scalability.

310:

311: However, it would limit flexibility if we took the primary sets

312: to be the smallest organisational unit of atoms. The size

313: of the primary sets depends on the number of processors used, and this

314: may depend on unpredictable circumstances. So we need a smaller and

315: more stable organisational unit. We call this a `partition'. A partition

316: is a small, spatially compact set of atoms that does not depend on the

317: number of processors. Each primary set consists of a spatially

318: compact bundle of partitions. Clearly, in assembling partitions

319: to make primary sets, a key consideration will be load balancing,

320: so we shall need each primary set to contain roughly the same

321: number of atoms. Load balancing will be discussed in more detail

322: in Sec.~\ref{sec:load}.

323:

324: Partitions could in principle be formed in many ways. At present,

325: the simulation cell used in the {\sc Conquest} code is required

326: to be orthorhombic (this requirement will, of course, be

327: removed in the future). We divide each of the three edges

328: of this cell using a uniform grid, and the orthorhombic subcells thus

329: formed are used as partitions. For some arrangements of atoms -- for

330: example, slabs of crystal surrounded by vacuum -- there may be

331: partitions that contain no atoms. This creates no problem,

332: though for good load balancing we shall require that no primary sets

333: should consist only of empty partitions.

334:

335: To discuss the outline multiplication scheme in terms of partitions,

336: we now introduce some terminology, which is illustrated schematically

337: in Fig.~\ref{fig:halo}. For each atom $i$ in the

338: primary set, there are atoms $k$ in the system

339: whose distance from $i$ is less

340: than $R_A$; we call these the $A$-neighbours of $i$. The atoms

341: $k$ which are $A$-neighbours of at least one atom in the {\em primary

342: set} form a set which we call the $A$-halo. We refer to atoms in this

343: set as the $A$-halo atoms. The set of {\em partitions} containing at least one

344: $A$-halo {\em atom} are called the $A$-halo {\em partitions}. Finally,

345: the set of nodes responsible for at least one $A$-halo partition are

346: called the $A$-halo {\em nodes}.

347:

348: The multiplication scheme can now be expressed thus: We loop over

349: $A$-halo nodes. For each such node, we loop over groups of atoms $k$,

350: communicate the $B_{k \xi , j \nu}$ elements for atoms in the group,

351: perform the corresponding multiplications, and accumulate onto

352: the $C$-matrix array. The groups of atoms $k$ will now be identified as

353: $A$-halo partitions, so that the inner

354: loop in Fig.~\ref{fig:interleave} is a loop over the

355: $A$-halo partitions on the given $A$-halo node.

356: Partitions therefore serve two purposes: first, they aid in the

357: flexible construction of primary sets; second, they provide

358: the sub-units needed in the communication of $B$-matrix data.

359:

360: A number of issues are raised by the partition scheme. We must

361: consider the size of $B$-data packets to be communicated, and how to avoid

362: fetching unwanted $B$-data. This question is intimately

363: related to the choice and structure of the multiplication

364: kernel. It is also related to the question

365: of the indexing needed so that the multiplication kernel

366: knows which atom

367: pairs are associated with each piece of $B$-data. Finally, we must

368: consider the indexing needed by the multiplication kernel in order to

369: accumulate $C$. Since the nature of the multiplication kernel lies at the

370: heart of these issues, we address this next.

371:

372: \subsection{Choice and structure of the multiplication kernel}

373: \label{sec:multkern}

374:

375: As explained above, the multiplication kernel is a small set

376: of operations repeatedly performed on each node, to build

377: up the rows of the product matrix $C_{i \mu , j \nu}$ for

378: which the node is responsible. More precisely, it is the

379: set of operations on which detailed coding effort will

380: be concentrated in order to maximise the efficiency. The choice

381: of kernel is therefore crucial.

382:

383: Of the choices that might be made, we can envisage two

384: extremes, which resemble those mentioned when we

385: discussed the interleaving of communications and

386: calculations (Sec.~\ref{sec:assumptions}). At one

387: extreme, we could choose the kernel to

388: consist of the multiplication of the $n \times n$ matrices associated

389: with each triplet of atoms $(i,k,j)$. (Recall that $n_1$, $n_2$,

390: $n_3$ are the numbers of indices on the atoms $i$, $j$, $k$; to

391: simplify the discussion, we refer to them without distinction as $n$.)

392: If we made this choice,

393: the kernel would sit inside a triple loop over $k$, $i$ and $j$. We refer

394: to this as the `fine' choice. At the other extreme is the `coarse' choice,

395: where the kernel consists of the entire triple loop over $k$,

396: $i$ and $j$, and everything inside it.

397: Neither extreme is likely to be satisfactory. We note that the fine choice

398: will ensure good reuse of data in registers, as discussed in

399: Ref.~\cite{Canning96}.

400: In linear-scaling calculations, the dimension $n$ will be rather

401: small; in treating $s$-$p$ semiconductors, for example, it will

402: often have the value 4, so that the data for a given triplet

403: will generally fit into registers and primary cache. However,

404: each data item is used only $n$ times, and the use of

405: secondary cache will be poor. On the other hand, the coarse extreme

406: is too undifferentiated. Since the full set of $A$, $B$ and $C$ data for all

407: $i$, $j$, $k$ will not fit into secondary cache in any practical

408: situation, there is no point in making this choice of kernel. Clearly,

409: the operations should be broken into small groups, but these

410: should not be as small as $(i,k,j)$ triplets.

411:

412: An obvious intermediate choice of kernel is to take it as the set

413: of operations associated with a given $k$. For a given $k$, the label $i$

414: goes over all $A$-neighbours of $k$ in the primary set, and $j$

415: goes over all $B$-neighbours of $k$ in the system. This will be

416: excellent for cache reuse of $B_{k \xi , j \nu}$, since

417: $j$ will run over $B$-neighbours of $k$ in the sequence in which they are

418: stored. The same thing will happen for $A$, provided the

419: $A_{i \mu , k \xi}$ elements are stored by blocks associated with

420: a given $k$ -- effectively this means that we store the

421: `local transpose' of $A$. If we denote the number of $B$-neighbours

422: of $k$ by $N_k^B$ and the number of primary $A$-neighbours of $k$

423: by $M_k^A$, then potentially every element of $A$ in cache can be

424: used $n N_k^B$ times, and every element of $B$ in cache can

425: be used $n M_k^A$ times.

426:

427: However, this still disregards cache reuse of $C$. In order

428: to help this, we should take the kernel to be the set of operations

429: associated with a small group of atoms $k$, and this group

430: must clearly be spatially compact. We identify this group as the

431: partition. A great advantage of this choice of kernel is that it fits

432: naturally with our proposed interleaving scheme for communications

433: (Sec.~\ref{sec:assumptions}). The

434: overall scheme is now that each node loops over

435: its $A$-halo partitions. For each partition, $B$-data is communicated

436: for the entire partition. The multiplication kernel then

437: performs all the multiply-and-adds and the accumulation of $C$ for

438: $A$-halo atoms $k$ in the current partition.

439:

440: We note in passing how the partition scheme helps efficient

441: data transfer from memory. The $j$ atoms that we loop over are

442: grouped into the spatially compact partitions used to organise the

443: storage of $B$. Since the $C$-matrix is also stored with the $j$-index

444: grouped into these partitions, and since superscalar processors

445: generally transfer data from main memory into cache in contiguous

446: sets (cache-lines)

447: this should considerably enhance the chances of finding

448: $C$-matrix elements in cache when they are needed. The underlying

449: thought here is that the partition scheme helps to improve

450: `data locality', as discussed by Goedecker~\cite{Goedecker00}.

451:

452: We implied just now that, when $B$-data is fetched from the

453: remote node, we communicate the $B_{k \xi , j \nu}$ for {\em all}

454: $k$ in the current $A$-halo partition. Since these $k$ are not

455: necessarily all in the $A$-halo, this means that in general not

456: all $B_{k \xi , j \nu}$ communicated are actually used, so that

457: there is a waste of communications. We do this, since we believe

458: that it is better to tolerate the waste than to limit

459: communications strictly to the $B_{k \xi , j \nu}$ needed. There

460: are two reasons for this: first, effort would need to be put into

461: generation of the indexing needed; second, the length of the

462: packets communicated would be reduced, so that latency would

463: slow the calculations.

464:

465: In order to implement these ideas, we

466: must now consider the two kinds of indexing

467: needed: first, the indexing needed to identify the atom pairs

468: associated with the $B$-data; second, the indexing needed to accumulate

469: the $C$-matrix. In order to develop an overview of the indexing

470: question, we first discuss schemes for labelling atoms.

471:

472: \subsection{The labelling of atoms}

473: \label{sec:label}

474:

475: \subsubsection{General ideas about labelling}

476: \label{sec:genlabel}

477:

478: With the partition scheme outlined above, matrix multiplication

479: in a parallel linear-scaling code will need at least

480: the following five ways of labelling atoms:

481:

482: \begin{enumerate}

483: \item

484: {\bf Global labelling}. As the atoms move around, and responsibilities

485: for atoms are passed from one node to another,

486: every atom in the entire system needs its own

487: unique label. We call this scheme `global labelling'. The natural way

488: is to label the atoms from 1 to $N_{\rm tot}$, where

489: $N_{\rm tot}$ is the total number of atoms.

490:

491: \item

492: {\bf Partition labelling}. Since atoms are organised into partitions,

493: we can refer to them by saying `the $n$th atom in the $p$th partition'.

494: To ensure that all nodes use the same scheme, we adopt a standard

495: order for enumerating the partitions of the entire system, and a

496: standard order for the atoms in each partition -- the latter can be

497: in increasing order of global label, for example. We refer to this

498: scheme as `partition labelling'.

499:

500: \item

501: {\bf Local labelling}. Each node needs to know only about a subset of

502: atoms: the atoms in its primary set, plus the atoms on other

503: nodes that are in range of the matrices it has to deal with.

504: Strictly speaking, it needs to know only about atoms in the

505: halo associated with the largest cut-off radius. But in order

506: to construct the haloes, it will need to find out about a somewhat

507: larger set, which we call the `grand covering set' (GCS); the number of atoms

508: in this set is denoted by $N_{\rm GCS}$. We label the atoms in the

509: GCS sequentially from 1 to $N_{\rm GCS}$, and call

510: this scheme `local labelling'.

511:

512: \item

513: {\bf Halo labelling}. When looping over atoms in a particular

514: halo -- for example the $A$-halo when we do matrix multiplication --

515: it will be convenient to label the halo atoms sequentially. We refer

516: to this as `halo labelling'.

517:

518: \item

519: {\bf Neighbour labelling}. In order to store matrix elements

520: economically, we shall want to store only those elements

521: $X_{i \mu , j \nu}$ etc. for which the distance between

522: atoms $i$ and $j$ is less than the cut-off $R_X$, since other elements

523: vanish by definition. For each primary atom $i$, we shall run sequentially

524: through $X$-neighbours $j$. This scheme for referring to neighbours

525: $j$ is called `neighbour labelling'. Of course, the label given

526: to an atom $j$ in this scheme depends on the primary atom $i$.

527: \end{enumerate}

528:

529: Note that in these general schemes there is much freedom in the

530: order of the labelling. For example, in partition labelling, we can

531: list the partitions in different orders. Similarly, in local, halo

532: and neighbour labelling, we can list the atoms in different orders.

533: We shall want to design the storage patterns so as to optimise

534: the efficiency of the multiply-and-add operations, and so as

535: to facilitate transcription between the different labellings. To see

536: how this works in practice, we now study how to pass information

537: between nodes and how to determine the storage locations

538: in the $C$-array when accumulating the results of multiply-and-adds.

539:

540: \subsubsection{Passing information between nodes}

541: \label{sec:passing}

542:

543: When the matrix elements $B_{k \xi , j \nu}$ are passed

544: between nodes, information about the identity of the $B$-neighbours $j$

545: of each atom $k$ must also be passed, so that the receiving node can

546: determine where to accumulate the elements of the product

547: $C_{i \mu , j \nu}$. We have to ask what kind

548: of labelling should be used to send the identity of $j$.

549:

550: Global labelling would clearly express the identity of $j$ unambiguously,

551: but there is an objection to using this. The receiving node will need

552: to transcribe to the labelling it uses to refer to the storage

553: locations of $C$, which is essentially $C$-neighbour labelling. While

554: each node can transcribe from its own local, halo or neighbour

555: labelling to global labelling using an amount of storage that is

556: independent of the size of the system, transcription from global labelling

557: to the other kinds demands a storage proportional to the size of

558: the sytem, and this violates the requirement of linear-scaling

559: behaviour. This may not always be a problem in practice,

560: but it is certainly objectionable in principle.

561:

562: Partition labelling offers a simple way to overcome this. For each

563: $B$-neighbour $j$, we specify its partition and its

564: sequence number in the partition, with the partition identified

565: by giving its offset in the three Cartesian directions relative

566: to the partition containing $k$. The receiving node then uses

567: the partition offset and sequence number to transcribe to its own

568: local label for each $j$. The separate indexing needed by the receiving

569: node to determine the storage location of $C$-elements is described

570: next.

571:

572: \subsubsection{Determining storage locations of $C$}

573: \label{sec:Clocations}

574:

575: To achieve the best speed, one would use a pre-calculated index array

576: to determine the storage location of each matrix element $C_{i \mu , j

577: \nu}$. As we run sequentially through the $B$-neighbours $j$ of each

578: atom $k$ for a given primary atom $i$, we would look up the storage

579: location where this contribution has to be accumulated onto the

580: $C$-array.  This is what was done in an earlier version of the

581: {\sc Conquest} matrix-multiplication scheme~\cite{Goringe97}. The major

582: objection to this is that the atom $j$ in $B_{k \xi , j \nu}$ is most

583: easily specified in neighbour labelling, in which case the

584: pre-calculated index array will depend on a triplet of labels $(i, k,

585: j)$, so that the storage requirements of the index become exorbitant.

586:

587: Our answer to this is to abandon the use of the pre-calculated index,

588: and instead to determine the storage locations of $C$-elements at run

589: time. As explained in Sec.~\ref{sec:passing}, it is straightforward to

590: calculate the local label of $j$ for each element $B_{k \xi , j \nu}$,

591: provided we pass the offset of the partition containing $j$ relative

592: to the partition containing $k$ in three Cartesian directions.  Our

593: procedure is then to transcribe from this local label to the $C$-halo

594: label of $j$ at run-time in the multiplication kernel, and then to

595: look up the address of each $C_{i \mu , j \nu}$ in a pre-calculated

596: index array which takes as input the primary label $i$ and the

597: $C$-halo label $j$. Since this index array depends on only two labels,

598: its storage requirements are moderate.

599:

600: \subsection{Detailed coding of the multiplication kernel}

601: \label{sec:coding}

602:

603: The principles outlined above are embodied in the F90 coding of the

604: multiplication kernel displayed in Fig.~\ref{fig:kernel}. This

605: calculates and accumulates the contributions to the $C$-matrix from

606: $A$-halo atoms $k$ in a given $A$-halo partition $K$, the latter being

607: specified in the code by the variable {\tt kpart}. Note that the outer

608: loop in the kernel goes only over $A$-halo atoms $k$ in $K$ (the

609: number of these being {\tt ahalo\%nh\_part(kpart)}), and not over {\em

610: all} atoms in $K$. The code inside the outer loop consists of three

611: sections: the three lines of the first section set up useful variables

612: associated with atom $k$; the purpose of the second section,

613: consisting of a single loop over atoms $j$, is to construct the table

614: {\tt jbnab2ch(j)} which transcribes from $B$-neighbour labelling to

615: $C$-halo labelling of the $B$-neighbour atom $j$; the third section is

616: a double loop over primary $A$-neighbours $i$ and $B$-neighbours $j$

617: of the atom $k$, within which the multiplication and accumulation are

618: performed. We now comment on the three sections.

619:

620: In the first section, {\tt k\_in\_halo} is the $A$-halo label of

621: atom $k$; the array {\tt ahalo\%j\_beg(kpart)} gives us the

622: $A$-halo label of the first $A$-halo atom in the partition $K$.

623: The variable {\tt k\_in\_part} is the sequence number of atom $k$ in $K$;

624: this means the number in the sequence when {\em all} atoms in $K$ are

625: counted, and not just $A$-halo atoms. Finally, {\tt nbkbeg} is the address

626: where $B$-neighbours of atom $k$ start in the array {\tt b()}

627: holding rows of the $B$-matrix for the current partition $K$.

628:

629: In the second section of the kernel, we loop over all $B$-neighbours

630: of the current atom $k$. The variables {\tt jpart}

631: and {\tt jseq} in this loop are the partition number and the

632: sequence number within its partition of neighbour $j$. (The

633: variable {\tt k\_off} is connected with periodic boundary conditions,

634: and will be explained in Sec.~\ref{sec:pbc}).

635: The two quantities {\tt jpart}

636: and {\tt jseq} together consistute the `local' label of atom $j$

637: in the partition labelling scheme. The array {\tt jbnab2ch(j)}

638: transcribes from the $B$-neighbour label $j$ to the $C$-halo

639: sequence number, and plays a key role in what follows.

640:

641: The outer loop in the third section goes over the

642: {\tt at\%n\_hnab(k\_in\_halo)} primary-set $A$-neighbours

643: $i$ of $k$, {\tt nabeg} is the address in the array {\tt a()}

644: where the corresponding element of $A$ is stored, {\tt i\_in\_prim}

645: is the sequence number of $i$ in the primary set, and {\tt icad}

646: is a starting address needed later. The inner loop goes over all the

647: {\tt nbnab(k\_in\_part)} $B$-neighbours of $k$, and {\tt nbbeg}

648: is the address in the array {\tt b()} holding elements of $B$.

649: We then use the table {\tt jbnab2ch(j)} constructed in the second

650: section to transcribe from $j$ to the $C$-halo label. The

651: pre-calculated table {\tt chalo\%i\_h2d(icad+j\_in\_halo)}

652: is then used to look up the address in {\tt c()} where the product

653: will be stored. The {\tt if} statements that test the values

654: of {\tt j\_in\_halo} and {\tt ncbeg} are redundant in the maximal

655: case $R_C \ge R_A + R_B$, since they can never be satisfied,

656: but they are needed for the general case, as explained later.

657: The triple loop over {\tt n3}, {\tt n1} and {\tt n2}

658: that performs the multiplication of $n \times n$ matrices

659: for each atom triplet $(i,k,j)$ is self-explanatory.

660:

661: \subsection{Periodic boundary conditions}

662: \label{sec:pbc}

663:

664: With periodic boundary conditions (pbc), the system we study is an

665: infinite Bravais lattice with a (generally large) basis. The

666: system is invariant under translation by an Bravais lattice vector.

667: For book-keeping purposes, we regard one of the primitive unit

668: cells of the Bravais lattice as the `fundamental' cell. The

669: elements of every matrix $X_{i \mu , j \nu}$ are stored for all

670: $i$ in the fundamental cell. Now the cut-off radius $R_X$ of any

671: matrix is not to be constrained in any way by the dimensions

672: of the repeating cell, so that the atoms $j$ for which

673: $X_{i \mu , j \nu}$ is non-zero can be in the fundamental

674: cell or in any image of this cell. For given $i$, there can be

675: different atoms $j$, $j^\prime$ having non-zero values of

676: $X_{i \alpha , j \beta}$ which are periodic images of each other.

677:

678: The use of pbc does not disturb the code structure already

679: established, provided one adopts the right viewpoint. Every node

680: knows about atoms in its primary set, and also about atoms in

681: its grand covering set (GCS), which contains all the haloes that it needs.

682: All atoms $i$ in its primary set can be regarded as belonging

683: to the fundamental cell, so that none is a periodic image of any

684: other. But atoms in its GCS may be periodic images of each

685: other. For most purposes, all the atoms in the GCS and

686: in all the haloes are regarded as completely distinct, irrespective

687: of whether they are periodic images or not.

688:

689: To ensure efficient and correct operation of the code with pbc, two

690: steps must be taken. First, as we loop over $A$-halo partitions $K$,

691: we note that for two such partitions that are periodically

692: equivalent, the sets of $B_{k \xi , j \nu}$ data for the

693: two partitions are identical, and we clearly do not

694: wish to communicate the same data more than once. We avoid this

695: by insisting that the order in which $A$-halo partitions

696: are passed through is such that any partition and its periodic

697: images in the halo are grouped together contiguously. Then

698: for each $A$-halo partition, we check whether it is periodically

699: equivalent to the previous one; if it is, then nothing is

700: communicated from the remote node, and we use the $B$-data most

701: recently communicated. The loop over $A$-halo partitions is outside

702: the multiplication kernel, so this does not affect the code shown

703: in Fig.~\ref{fig:kernel}.

704:

705: The second step needed is to take account of the periodic offsets when

706: calculating the storage locations where elements of the $C$-matrix are

707: accumulated in the multiplication kernel. The $B$-matrix data supplied

708: to the kernel include a label identifying the partition to which each

709: $B$-neighbour atom $j$ belongs: this is given in the array {\tt

710: ibpart()} -- see Fig.~\ref{fig:kernel}. But this label must be

711: modified to account for the periodic offset of partition $K$.

712: Provided we use a suitable labelling of partitions in each node's GCS,

713: the modification consists of adding a single offset parameter {\tt

714: k\_off} to the partition label, this parameter being passed as an

715: argument to the multiplication kernel.  This addition to the first

716: statement in the $j$-loop (see Fig.~\ref{fig:kernel}) is the only

717: place in the kernel that pbc appear explicitly.

718:

719: \subsection{Practical construction of indexing arrays}

720: \label{sec:indexing}

721:

722: The multiplication kernel needs certain indexing arrays and

723: transcription tables, and we outline briefly the principles

724: used in constructing these.

725:

726: In making all the labelling, a key role is played by the grand

727: covering set (GCS) mentioned above. In general,

728: a `covering set' is a set of partitions that contains all the

729: neighbours of an atom, or the primary-set halo, associated with a

730: given cut-off radius; it is used in constructing lists of neighbours

731: and halo atoms. The GCS is the largest such covering set, so that it can

732: be used for making all the neighbour and halo lists. Each node

733: has its own GCS. It is convenient to choose the GCS to be orthorhombic

734: in shape, because this simplifies the labelling. For any given

735: cut-off radius $R_X$, the node constructs its neighbour lists

736: by passing sequentially through all atoms in its primary set

737: and all atoms in its GCS, calculating the interatomic distances,

738: comparing with $R_X$, and making a record of the neighbours. This record

739: is then used to make the corresponding halo list.

740:

741: The ordering of the lists is important, so we need to consider the

742: order for enumerating GCS partitions. In fact, we need to consider

743: two different orderings. In the first, we label GCS partitions

744: using the obvious Cartesian system. The GCS is orthorhombic and

745: the numbers of partitions along its three edges are called

746: $L_1$, $L_2$, $L_3$, so that any partition can be labelled by the

747: triplet $(l_1, l_2, l_3)$, with $1 \le l_s \le L_s$ $(s = 1,2,3)$.

748: Then we define the label $\lambda \equiv ( l_1 - 1 ) L_2 L_3 +

749: ( l_2 - 1 ) L_3 + l_3$, which we call the `Cartesian composite' (CC)

750: label.

751:

752: CC ordering is not appropriate for enumerating $A$-halo partitions

753: when fetching $B$-data from remote nodes, because for this we should

754: group partitions together by their home node and (if relevant)

755: according to periodic images (see Sec.~\ref{sec:pbc}). To make a

756: suitable ordering for this purpose, we adopt a standard order for

757: enumerating nodes, and a standard order for primary partitions on each

758: node.  Then a node enumerates the partitions of its GCS by ordering

759: them with respect to: (i) periodic images; (ii) partitions on each

760: node; (iii) node. It passes most rapidly through periodic images, so

761: that all images of a given partition are listed contiguously; next

762: most rapidly through partitions on each node, so that all partitions

763: on each node are contiguous; and least rapidly through nodes. It is

764: convenient to start with the local node itself, then to go to the next

765: node occurring in the GCS in increasing order of the standard node

766: enumeration, wrapping round in this enumeration as necessary. We call

767: this ordering of GCS partitions node-order periodic-grouped (NOPG).

768: This is the GCS ordering used for forming the $A$-halo, so that when

769: we pass through $A$-halo partitions we are implicitly performing a

770: triple loop over halo-nodes, halo-partitions on each node, and

771: periodic images of each halo-partition -- the latter being needed to

772: avoid unnecessary communication of $B$-data, as explained in

773: Sec.~\ref{sec:pbc}.

774:

775: \subsection{The minimal case}

776: \label{sec:minimal}

777:

778: So far, we have restricted outselves to the `maximal' case

779: $R_C \ge R_A + R_B$.

780: But, of course, we need to handle general cut-off radii.

781: As a first step towards doing this, we examine now the

782: case $R_C \le \mid R_A - R_B \mid$, which we call the `minimal'

783: case. We shall show that a trivial modification of the code already

784: presented for the maximal case gives an efficient code

785: for the minimal case. To see this, we must first outline

786: an important general principle.

787:

788: For the superscalar machines that are our main interest, the speed is

789: limited mainly by transfers between memory and registers {\em via} the

790: cache: the operations performed on the data once it is in registers

791: have little effect on the speed\cite{Goedecker00}. So let us focus on

792: storage and transfer patterns, and ignore the register operations. In

793: the maximal case, one set of data (the $A_{i \mu , k \xi}$ matrix

794: elements) are in local memory and transfer occurs in a regular and

795: predictable way; data that behave like this will be called $X$. A

796: second set of data (the $B_{k \xi , j \nu}$ elements) is fetched from

797: remote nodes, but again is transferred in a regular way; this kind of

798: data will be called $Y$.  Finally, a third set of data (the $C_{i \mu

799: , j \nu}$ elements) is in local memory, but its transfer occurs in an

800: irregular way and addressing information has to be calculated for it

801: at run time. Data like this will be called $Z$.  The appropriate data

802: storage and transfer patterns for the minimal case are obtained by

803: identifying $C$ as $X$, $\tilde{B}$ as $Y$ and $A$ as $Z$, where

804: $\tilde{B}$ denotes the transpose of $B$ -- whereas in the maximal

805: case $A$ was $X$, $B$ was $Y$ and $C$ was $Z$. In the minimal case the

806: multiply-and-adds needed are $X = Z \cdot Y$ -- whereas in the maximal

807: case we did $Z = X \cdot Y$.  As far as the multiplication kernel is

808: concerned, it is transferring certain patterns of data from memory and

809: doing certain things with it in registers. Provided the data passed to

810: the kernel and the operations performed with it in registers are

811: changed appropriately, the maximal and minimal cases are exactly

812: equivalent.

813:

814: It follows from this that the multiplication kernel for the minimal

815: case is coded exactly as in Fig.~\ref{fig:kernel}, except that the

816: statement within the loops over {\tt n2}, {\tt n1} and {\tt n3} must

817: be replaced by the following statement:

818: \begin{verbatim}

819:                     a(n3,n1,nabeg)=a(n3,n1,nabeg)+&

820:                       c(n1,n2,ncbeg)*b(n3,n2,nbbeg)

821: \end{verbatim}

822: The structure of all the indexing arrays remains exactly as before.

823: Note that in the minimal case, as in the maximal case, no

824: tests of cut-off distances are needed, and the {\tt if} statements

825: in the kernel are redundant.

826:

827: \subsection{General cut-off radii}

828: \label{sec:cutoff}

829:

830: To treat general cut-off radii, we consider first the case $R_C < R_A

831: + R_B$, where $R_C$ is only a little smaller than the sum of $R_A$ and

832: $R_B$. This case is efficiently handled by using the code for the

833: maximal case (see Sec.~\ref{sec:coding}), with the $j$-loop going over

834: all $B$-neighbours of $k$, but with an {\tt if}-statement inserted so

835: that the multiply-and-adds are not done if $j$ is not a $C$-neighbour

836: of $i$. Needless to say, there must not be any question of calculating

837: interatomic distances in the multiplication kernel itself.  However,

838: using the arrays which transcribe from local labelling to $C$-halo

839: labelling, and which give the storage address in the $C$-array for

840: each ($i$,$j$) pair, we can determine easily whether $j$ is a

841: $C$-neighbour of $i$. The convention we use is that the appropriate

842: array elements have the value zero if the atom $j$ is not in the

843: $C$-halo or is not a $C$-neighbour; the test is then applied

844: by the two {\tt if} statements in

845: Fig.~\ref{fig:kernel}. We call this way of treating the case $R_C <

846: R_A + R_B$ `weak reduction', the adjective `weak' referring to the

847: fact that it will work well only if $R_C$ does not fall too much below

848: $R_A + R_B$. This loss of efficiency caused by the wasted passes

849: through the {\tt do}-loop over $j$ is estimated in the Appendix.

850:

851: If $R_C$ is very small, it will be more efficient to perform the

852: multiplication by `weak extension', i.e. by using the kernel for the

853: minimal case, but with two {\tt if}-statements inserted {\it exactly}

854: as for the maximal kernel (using the symmetry between the two cases

855: identified above in Sec.~\ref{sec:minimal}).  The effect of wasted

856: passes through the $j$-loop is estimated in the Appendix.

857:

858: Once the {\tt if}-statements have been put into the maximal or minimal

859: code, any case can be treated with either kernel. We will obtain correct

860: results even if we treat the minimal case using the maximal kernel,

861: or {\em vice versa}. Nevertheless, it is clearly more efficient

862: to treat weak reduction using the maximal kernel and weak extension

863: using the minimal kernel. But how, in general, can we tell whether

864: it is more efficient to use the maximal or minimal kernel? This

865: question is also addressed in the Appendix, where we show that the maximal

866: kernel should be used when $R_C > R_A$ and the minimal kernel

867: when $R_C < R_A$.

868:

869: \subsection{Load balancing}

870: \label{sec:load}

871:

872: However efficient the on-node operations and the communications may

873: be, the overall performance will be poor unless the work load is

874: balanced between processors, so that no processors are idle while

875: others are working. The partition scheme gives us a natural way to

876: achieve load balancing: our aim must be to group the partitions into

877: bundles, with each bundle containing the primary set given to a node,

878: in such a way that all nodes take roughly the same

879: computation-plus-communication time, with this time being as small as

880: possible. We plan to report in more detail elsewhere on load balancing

881: in the {\sc Conquest} code, and here we give only a brief summary of

882: our methods.

883:

884: Our approach is to derive an approximate formula for the `cost'

885: of the calculation, and to use simulated annealing to search for

886: a way of organising partitions into bundles which comes close

887: to minimising the cost. This search is done by a separate code

888: before the matrix multiplication code is run. Let $t_i^{\rm comp}$

889: and $t_i^{\rm comm}$ be the times taken by node $i$ to perform

890: on-node computation and inter-node communication, respectively,

891: and let $t_i \equiv t_i^{\rm comp} + t_i^{\rm comm}$

892: be the sum of the two. (We assume that computation and communication

893: cannot be overlapped, and that the former has to wait for the latter.)

894: Then we have to minimise the `cost function' $T$

895: given by $T = \max_i ( t_i )$; this quantity $T$ is the

896: largest of all the $t_i$ values.

897:

898: We can associate a computation time $\tau_p$ with each partition $p$.

899: This time is proportional to the number of $i , k , j$ triplets for which

900: atom $i$ is in partition $p$. The proportionality constant can be

901: estimated from the compute rate of the processors and the cpu efficiency,

902: whose determination for any processor is described in Sec.~\ref{sec:tests}.

903: The value of $t_i^{\rm comp}$ is then the sum of $\tau_p$ over

904: the partitions in the bundle assigned to processor $i$. The value of

905: $t_i^{\rm comm}$ is found by dividing the number of

906: $B_{k \xi , j \nu}$ elements that node $i$ fetches by the

907: practical inter-node transfer rate, with an allowance for latency.

908:

909: We follow the usual methods of simulated annealing~\cite{NR}.

910: Let $\Gamma$

911: label a given grouping of partitions into bundles -- this consists

912: of a list of the partitions in every bundle -- and let $T_\Gamma$

913: be the value of $T$ for this grouping. Then we sample the groupings

914: $\Gamma$ in such a way that the probability of finding $\Gamma$

915: is proportional to a Boltzmann function

916: $\exp ( - T_\Gamma / \theta )$, where $\theta$ is a fictitious

917: temperature. The value of $\theta$ is systematically reduced towards

918: zero during the simulation, so that the only surviving $\Gamma$'s

919: are those whose $T_\Gamma$'s are close to the minimum possible.

920: To sample the space of groupings $\Gamma$, we use an algorithm

921: to step from one $\Gamma$ to the next. As usual in simulated

922: annealing, the new grouping is accepted without question

923: if $T_\Gamma$ decreases; it is accepted with probability

924: $\exp ( - \Delta T_\Gamma / \theta )$ if $T_\Gamma$ increases,

925: where $\Delta T_\Gamma$ is the increase.

926:

927: The choice of algorithm for stepping from one $\Gamma$ to the

928: next is not trivial. A significant problem arises from the

929: rather curious definition of the cost function as the {\em maximum}

930: of $t_i$ values. As we search through groupings $\Gamma$,

931: the time $T_\Gamma$ changes only if we make reassignments of

932: partitions affecting the particular bundle having the maximum

933: $t_i$. For large numbers of processors, this can make equilibration

934: of the `thermal' ensemble very slow. The measures taken

935: to overcome this will be described elsewhere.

936:

937: \section{Coding considerations}

938: \label{sec:codcon}

939:

940: One of the key aims when coding \textsc{Conquest} was to ensure portability

941: as well as performance.  Accordingly, the code was written in F90 with

942: all communications written using MPI1 calls (though, as explained later,

943: consideration was also given to rewriting critical communications with

944: high speed, system specific calls, such as {\tt shmem} on the Cray T3E).

945: F90 improves the readability and structure of the code through use of

946: modules and derived types, while retaining enough similarity to

947: \textsc{Fortran77} to be widely understood by the scientific community.

948:

949: The communications are key to the efficiency of the code. There are two

950: types of object to be communicated: the matrix elements themselves and the

951: accompanying indexing.  Efficient communication of the matrix elements

952: presents no problems, since these form contiguous arrays, and are passed

953: in relatively large portions (all elements for all the

954: atoms in a partition $K$).  The communication of the indexing, however,

955: raises other problems.

956:

957: The indexing for a matrix is all enclosed in a single derived type,

958: {\tt type(matrix)}, of which only part needs to be communicated.  The

959: derived type is defined so that it contains all the data for all the atoms

960: in a partition $I$ (and therefore an array of these types is defined to

961: hold the data for a node's primary set).  One possibility for the

962: communication of the indexing would be to use the MPI1 facilities for

963: registering a new type and to pass individual partitions in a single call

964: for that new type.  However, the implementation of communication of defined

965: types within MPI1 leaves room for loss of efficiency, and another way was

966: found.  Essentially, those indexing arrays which need to be communicated

967: are defined within the type as pointers, and point to different parts of

968: an integer array.  Then the communication of all relevant indices for

969: a partition can be made via a single MPI call, passing a single, large

970: integer array.  Pointers are then put into place at the receiving node

971: to access the indices.  Both the matrix element arrays and this large

972: integer array are declared as global variables so that on the Cray

973: they will be symmetric, and {\tt shmem} calls can be used on them.

974:

975: While the code has been written within MPI1, the communication pattern

976: is naturally a one-sided one (since individual nodes require data

977: for partitions at disparate times).  We have developed a

978: quasi-one-sided scheme within MPI1, as well as using truly one-sided

979: schemes where available (e.g. under MPI2, which is not yet

980: sufficiently widely supported to be truly portable, or under {\tt

981: shmem} on the Cray T3E or SGI Origin2000).  At the start of the matrix

982: multiplication, each node issues a series of non-buffered, non-blocking

983: sends (assuming, quite reasonably, that no implementation of MPI1 would

984: buffer the {\tt MPI\_issend} call).  As partition data is required, nodes

985: issue {\tt MPI\_recv} or {\tt MPI\_irecv} calls (as appropriate) to obtain

986: the data.  This scheme works extremely well, and shows good performance

987: and scaling.  It is easy to overlap communication and calculation by

988: issuing {\tt MPI\_irecv} calls ahead of time.

989:

990: \section{Practical tests}

991: \label{sec:tests}

992:

993: We want to test the scaling properties and the efficiency of the code.

994: As emphasised in our earlier tests on the {\sc Conquest}

995: code~\cite{Goringe97}, there are different types of scaling. For given

996: cut-offs $R_A$, $R_B$, $R_C$, the cpu and memory needed to do the

997: multiplication should be proportional to the number of atoms: this is

998: called `intrinsic scaling'. For a given calculation on a given number

999: of atoms, one hopes to find cpu and memory inversely proportional to

1000: the number of processors: `strong parallel scaling'. Finally, we can

1001: examine the cpu and memory as the numbers of atoms and processors are

1002: scaled up with the number of atoms per processor held fixed: `weak

1003: parallel scaling'. There is a relation between the three scaling

1004: types, so that we can assess the scaling behaviour completely by

1005: studying weak and strong parallel scaling, and we shall present

1006: results on this. Good scaling does not necessarily mean good

1007: efficiency. For example, if the on-node operations were very

1008: inefficient, communications could be rendered negligible, so that

1009: parallel scaling would appear very good, even though the underlying

1010: performance was poor. We shall therefore pay special attention to cpu

1011: efficiency.

1012:

1013: Our main tests are on completely random arrangements of atoms

1014: with periodic boundary conditions; the position

1015: of each atom in the cell is created by using a random-number generator

1016: to draw its $x$, $y$ and $z$ coordinates from a uniform

1017: probability distribution. We choose to use random positions

1018: rather than, for example, perfect-crystal positions, because the

1019: randomness should provide more of a challenge to load balancing,

1020: especially for small numbers of atoms per node. In order to make

1021: contact with the real world, we take a mean density of atoms

1022: equal to that of diamond-structure silicon, and we

1023: use values of $R_A$, $R_B$ and $R_C$ typical of those needed in real

1024: first-principles calculations with the the {\sc Conquest} code.

1025: The values of the matrix elements are irrelevant for these tests,

1026: since we are only concerned with the time taken to perform

1027: the operations; we have actually taken their values to be a simple

1028: radial function of interatomic separation. The number of indices

1029: on each atom is taken to be $n = 4$.

1030:

1031: We start by presenting results obtained on the Cray T3E-1200E

1032: using {\tt shmem} communications. We examine the maximal

1033: case ($R_C = R_A + R_B$) with $R_A = 8.46$~\AA, $R_B = 4.23$~\AA.

1034: This corresponds roughly to performing the multiplication

1035: $L \cdot S$ in {\sc Conquest} ($L$ is the auxiliary density

1036: matrix, $S$ the overlap matrix). Fig.~\ref{fig:Weak} shows results from our

1037: tests of weak parallel scaling with 80 atoms per processor

1038: and processor numbers going from 2 to 250 (numbers of

1039: atoms from 160 to 20,000). The simulation cell is always a cube

1040: in these tests, and the partitions are also taken to be cubic.

1041: The partitions always have the same size, and the average number of

1042: atoms per partition is 20. To obtain the timings, we put a

1043: timer in the code running on each node, and we report the maximum

1044: and minimum times, together with the average time. (The times

1045: include all communication, but they

1046: deliberately exclude the construction of the indexing

1047: arrays needed as input to the multiplication kernel; this is

1048: justified, since in practice these arrays are reconstructed

1049: only when the atoms move, whereas many matrix multiplications

1050: are done for each set of atomic positions.)

1051: The following points

1052: are noteworthy. First, the spread between minimum and maximum times

1053: is small, so that the load balancing is good. In practice,

1054: the time taken is governed by the time of the slowest node. Since

1055: the maximum time is only $\sim$~6.4~\% greater than the average,

1056: little would be gained by working harder on load balancing, at

1057: least for random atomic arrangements. Second, the times are essentially

1058: constant, the time on 250 nodes being only $\sim 4$~\% greater

1059: than that on 16 nodes. This means that the size of system

1060: that could be treated is limited only by the number of nodes

1061: available: there is essentially no limitation from the

1062: parallel behaviour of the code itself.

1063:

1064: Tests of strong parallel scaling performed for the same $R_A$ and

1065: $R_B$ values (maximal case) with a total of 4096 atoms in the whole

1066: cell are shown in Fig.~\ref{fig:Strong}, where we show the total time

1067: summed over the processors as function of processor

1068: number (which would be completely flat if the scaling were

1069: ideal).  A useful way to assess the scaling is by comparing the times

1070: with Amdahl's law, which states that the time taken is $t = t_0 ( ( 1

1071: - x ) + x / N_{\rm proc} )$, where $N_{\rm proc}$ is the number of

1072: processors, with $x$ and $( 1 - x )$ the fractions of the calculation

1073: performed in parallel and serially. This formula fits the maximum

1074: times very well, and we find the value $x = 0.9986$, so that the

1075: serial fraction is 0.0014. This means that strong scaling is

1076: extremely good up to $\sim 200$ processors, or $\sim 20$ atoms

1077: per processor. Since it would be wasteful of memory to run {\sc

1078: Conquest} with fewer atoms per processor than this, the implication is

1079: that strong scaling will be very well satisfied in practical

1080: calculations.

1081:

1082: We now turn to the minimal case ($R_C \le \mid R_A - R_B \mid$).  We

1083: argued in Sec.~\ref{sec:minimal} that the speed will be limited mainly

1084: by transfers to and from memory, rather than by operations in

1085: registers. This implies that, provided we use the minimal, rather than

1086: the maximal, multiplication kernel, the timings for the multiplication

1087: with $R_A = 12.69$~\AA, $R_B = 4.23$~\AA, $R_C = 8.46$~\AA\ should be

1088: almost the same as for $R_A = 8.46$~\AA, $R_B = 4.23$~\AA, $R_C =

1089: 12.69$~\AA. We have tested this explicitly by performing

1090: multiplication with $R_A = 12.69$~\AA, $R_B = 4.23$~\AA, $R_C =

1091: 8.46$~\AA\ on 4096 atoms with different numbers of processors. The

1092: results are also reported in Fig.~\ref{fig:Strong}. We see that the

1093: timings are very close to our results for the maximal case, though

1094: higher by $\sim 16$~\%. This means that all our conclusions about weak

1095: and strong scaling will be the same for the minimal case.

1096:

1097: We now turn to cpu efficiency. Here, we are solely concerned

1098: with the execution speed of the multiplication kernel itself,

1099: so that communication time is excluded. As explained in the

1100: Introduction, efficiency is defined in terms of the minimum

1101: number of operations required to perform the multiplications

1102: of eqn~(\ref{eqn:mult}). For a given node, this is the

1103: number of $(i,k,j)$ triplets for $i$ in that node's primary

1104: set, multiplied by $n_1 \times n_3 \times n_2$, multiplied

1105: by a factor of 2 to allow for the fact that we do a multiply

1106: and add for each $( i \mu , k \xi , j \nu )$. This  minimum

1107: number of operations divided by the time spent in the multiplication

1108: kernel gives the `useful' Mflop rate, which is then divided

1109: by the peak Mflop rate of the processor to give the

1110: cpu efficiency.

1111:

1112: We have done efficiency tests on both perfect-crystal and random

1113: atomic positions. As an illustration, we report results on

1114: perfect-crystal silicon (lattice parameter~= 5.46~\AA) with cut-offs

1115: $R_A = 6$~\AA, $R_B = 10$~\AA, $R_C = 16$~\AA\ with 1728 atoms in the

1116: total system. The tests were done with 64 cubic partitions and 16

1117: processors on the Cray T3E-1200E, the partitions being distributed

1118: unevenly over the processors. The `useful' Mflop rate was found to be

1119: almost identical on all processors, the average being 101.4~Mflops,

1120: and minimum and maximum being 101.1 and 103.6~Mflops. The theoretical

1121: peak speed of the processor is 1.2~Gflops, so that the efficiency is

1122: 8.4~\%. This result is reasonably satisfactory in two ways. First, it

1123: shows that all the operations associated with transcription between

1124: labellings, the determination of storage addresses, etc cannot be

1125: taking a large amount of time (recall that these are not included in

1126: the number of `useful' operations); second, they show that fairly good

1127: use is being made of the cache.

1128:

1129: We have also performed tests on a number of other machines: a Beowulf

1130: linked by fast ethernet; a Beowulf linked by SCI interconnect; and a SGI

1131: Origin2000.  We have tested weak scaling on

1132: all these systems to investigate the portability of the code, and

1133: the importance of efficient communications.  The {\em increase} in

1134: time per node as the system size is increased is shown in

1135: Fig.~\ref{fig:systems}.  We remark that, given perfect communication,

1136: the line ought to be flat at 1.0, and that to all intents and purposes

1137: this is true for the Origin2000 and T3E.  The SCI beowulf system performs

1138: well and should be adequate for {\sc Conquest} calculations.  The fast

1139: ethernet beowulf, however, shows far too large an increase.

1140:

1141: Finally, we assess briefly the implications of our new methods

1142: for practical calculations with the full {\sc Conquest} code.

1143: The present matrix multiplication code has been incorporated

1144: into {\sc Conquest}, and we have run tests in which the energy of

1145: the self-consistent ground state is calculated for

1146: silicon systems. In searching for the ground state, the

1147: basic step is the `self-consistency iteration', in which an

1148: input charge density is used to calculate the Kohn-Sham

1149: Hamiltonian matrix, and the techniques of Ref.~\cite{Bowler99} are

1150: used to determine the $L$-matrix that yields the non-self-consistent

1151: ground state for the current Hamiltonian, and hence the output

1152: charge density.  We search for self-consistency using the

1153: GR-Pulay technique\cite{Bowler00b}.

1154:

1155: We report in Fig.~\ref{fig:Conquest} timings for a single

1156: self-consistency iteration obtained for crystalline silicon containing

1157: numbers of atoms going from 4096 to 12288 when the calculations are

1158: run on 512 processors of the T3E-1200E.  For these tests, we used a

1159: norm-conserving non-local pseudopotential\cite{Kleinman82}.  The

1160: support-region radius $R_{\rm reg}$ was equal to 2.715~\AA\ and the

1161: $L$-matrix cut-off, $R_L$ equal to 6.9~\AA\ (for definitions of these

1162: cut-offs, see Ref.~\cite{Hernandez96}); these values are typical of

1163: what would be used in practice. The Figure shows that the scaling with

1164: system size is so good that the time is actually a slightly sub-linear

1165: function of atom number. We believe that this is because, as the

1166: primary sets become larger, the number of nodes with which

1167: communication is required actually decreases, making the time required

1168: decrease.  To put these results in context, we note that, starting

1169: from scratch, typically 10 self-consistency iterations are needed to

1170: reach the self-consistent ground state with acceptable accuracy for

1171: fixed support functions, and that a further factor of 10 should be

1172: allowed for variation of the support functions. This means that, with

1173: the methods presented here, the attainment of the full ground state

1174: with an accuracy comparable to that expected in a conventional

1175: plane-wave calculation can be accomplished for a 12,000-atom system in

1176: a few hours on a 512-processor T3E.

1177:

1178: \section{Discussion and conclusions}

1179: \label{sec:disc}

1180:

1181: We have shown that our code for parallel multiplication of

1182: locally sparse matrices achieves two important things:

1183: first, it displays almost perfect linear scaling with respect

1184: to numbers of atoms and processors; second, it achieves good

1185: cpu efficiency.

1186:

1187: We remark that we have every right to expect the linear scaling

1188: to be perfect when the numbers of processors and atoms

1189: are increased together, with a constant number of atoms per

1190: processor (weak scaling): for this type of scaling, the amounts

1191: of computation and communication should, indeed, remain

1192: constant. What is less trivial is the type of scaling in which

1193: the number of processors is increased with the total number of

1194: atoms held fixed (strong scaling). In contrast to

1195: some previous schemes, the present code shows excellent strong

1196: scaling, at least on machines having a fast communications

1197: network. Good cpu efficiency is also non-trivial.

1198:

1199: A key idea in the present scheme is the management of the atoms

1200: in small compact groups, referred to as `partitions'. Although

1201: this certainly makes the code more complex, it has the enormous

1202: advantage of giving an internal structure to the calculation,

1203: which can be used in a number of ways: it gives a flexible means

1204: of constructing the primary sets that are assigned to processors;

1205: it enables us to manage the communications by breaking them into

1206: packets of convenient size; it does the same for the on-node

1207: computations, so that we work with a convenient `multiplication kernel';

1208: it provides a labelling scheme which allows nodes to

1209: identify to each other the atoms for which data is communicated;

1210: finally, it aids efficient cache use by improving data locality.

1211: It is important to note that the present scheme of partitions

1212: and bundles for managing atoms

1213: strongly resembles the scheme of blocks and domains for managing

1214: integration-grid points in {\sc Conquest} described in our

1215: earlier work~\cite{Goringe97}. A further key concept in the present

1216: work is that of transcription between different atom-labelling

1217: schemes, which is crucial in reducing the computational

1218: overheads in the multiplication kernel.

1219:

1220: At the start of our analysis (Sec.~\ref{sec:assumptions}), we made

1221: certain assumptions about the storage of matrix elements

1222: and the way the matrix multiplication should be done, and we

1223: now ask whether anything could be gained by changing these

1224: assumptions. Our answer is that little is likely to be gained.

1225: With the scheme we have presented, the scaling properties are already

1226: close to ideal, and there is little room for improvement. It is

1227: true that our cpu efficiency is only $\sim 10$~\%, and there

1228: is clearly scope for improvement here. However, one should

1229: remember that efficiencies as high as 50~\% on superscalar

1230: processors are achievable only in the multiplication

1231: of large non-sparse matrices, and that serious sparsity is

1232: bound to incur a serious loss of efficiency. Our suspicion

1233: is that an improvement of the cpu efficiency by as much as a factor

1234: of two is unlikely, and that any such improvements are more likely

1235: to come from work on the multiplication kernel than by changing

1236: the basic assumptions. Time will tell.

1237:

1238: \section*{Acknowledgments}

1239: \label{sec:ack}

1240:

1241: We would like to thank Dr.~S.~Goedecker for allowing us to read an

1242: early copy of Ref.~\cite{Goedecker00}.

1243: The work of DRB was supported by EPSRC grant GR/M01753 and by an EPSRC

1244: Postdoctoral Fellowship in Theoretical Physics.

1245: The position of MJG is partially

1246: supported by GEC and CLRC Daresbury Laboratory. Calculations on the

1247: Cray T3E at CSAR and on the Origin 2000 at the UCL HiPerSPACE Centre

1248: were supported by grants GR/M01753 and JR98UCGI. We are indebted to

1249: Dr. S.~Pickles and colleagues at CSAR for their help in optimising the

1250: code. Useful discussions with Dr.~I.~Bush in the early stages of the

1251: work are acknowledged.  Assistance from Dr.~B.~Searle, Mr.~R.~Harker

1252: and Mr.~A.~Keller with tests on beowulf systems is also acknowledged.

1253: \bigskip

1254:

1255: \begin{thebibliography}{99}

1256: \bibitem{Goedecker99}S.Goedecker, Rev. Mod. Phys. {\bf 71}, 1085 (1999).

1257: \bibitem{Pettifor89}D.G.~Pettifor, Phys. Rev. Lett. {\bf 63}, 2480

1258: (1989).

1259: \bibitem{Yang91}W.~Yang, Phys. Rev. Lett. {\bf 66}, 1438 (1991).

1260: \bibitem{Galli92}G.~Galli and M.~Parrinello, Phys. Rev. Lett. {\bf 69},

1261: 3547 (1992).

1262: \bibitem{Li93}X.-P.~Li, R.W.~Nunes and D.~Vanderbilt, Phys. Rev. B {\bf 47},

1263: 10891 (1993).

1264: \bibitem{Daw93}M.S.~Daw, Phys. Rev. B {\bf 47}, 10895 (1993).

1265: \bibitem{Ordejon93}P.~Ordej\'on, D.~Drabold, M.~Grumbach

1266: and R.M.~Martin, Phys. Rev. B {\bf 48}, 14646 (1993).

1267: \bibitem{Aoki93}M.~Aoki, Phys. Rev. Lett. {\bf 71}, 3842 (1993).

1268: \bibitem{Mauri93}F.~Mauri, G.~Galli and R.~Car, Phys. Rev. B {\bf 47},

1269: 9973 (1993).

1270: \bibitem{Goedecker94} S. Goedecker and L. Colombo, Phys. Rev. Lett.

1271: {\bf 73}, 122 (1994).

1272: \bibitem{Kress95}J.D.~Kress and A.F.~Voter,

1273: Phys. Rev. B {\bf 53}, 12733 (1996).

1274: \bibitem{Kohn96}W.~Kohn, Phys. Rev. Lett. {\bf 76}, 3168 (1996)

1275: \bibitem{Horsfield96}A.P.~Horsfield, Mat. Sci. Engin. B {\bf 37},

1276: 219 (1996).

1277: \bibitem{Bowler97}D.R.~Bowler, M.~Aoki, C.M.~Goringe, A.P.~Horsfield

1278: and D.G.~Pettifor, Modell. Simul. Mater. Sci. Eng. {\bf 5}, 199 (1997).

1279: \bibitem{Stechel94}E.B.~Stechel, A.R.~Williams

1280: and P.J.~Feibelman, Phys. Rev. B {\bf 49}, 10088 (1994).

1281: \bibitem{Hernandez95}E.H.~Hern\'{a}ndez and M.J.~Gillan, Phys. Rev. B

1282: {\bf 51}, 10157 (1995).

1283: \bibitem{Haynes97}

1284: P.D.~Haynes and M.C.~Payne, Comput. Phys. Commun. {\bf 102},

1285: 17 (1997).

1286: \bibitem{Schweg96}E.Schwegler and M.Challacombe, J.Chem. Phys. {\bf 105}, 2726

1287: (1996).

1288: \bibitem{Burant96}J.C.Burant, G.E.Scuseria and M.J.Frisch, J. Chem. Phys. {\bf

1289: 105}, 8689 (1996).

1290: \bibitem{Ochsen98}C.Ochsenfeld, C.A.White and M.Head-Gordon, J. Chem. Phys.

1291: {\bf 109}, 1663  (1998).

1292: \bibitem{Challa00}M.~Challacombe, Comput. Phys. Commun. {\bf 128}, 93 (2000).

1293: \bibitem{Goringe97}C.M.~Goringe, E.H.~Hern\'andez, M.J.~Gillan and

1294: I.J.~Bush, Comput. Phys. Commun. {\bf 102}, 1 (1997).

1295: \bibitem{Bowler00}D.R.~Bowler, I.J.~Bush and M.J.~Gillan, Int. J. Quant.

1296: Chem. {\bf 77}, 831 (2000).

1297: \bibitem{Hernandez96}E.~Hern\'{a}ndez, M.J.~Gillan and C.M.~Goringe,

1298: Phys. Rev. B {\bf 53}, 7147 (1996).

1299: \bibitem{Mauri94}F.~Mauri and G.~Galli, Phys. Rev. B {\bf 50}, 4316 (1994).

1300: \bibitem{Ordejon95}P.~Ord\'ejon, D.A.~Drabold, R.M.~Martin and M.P.~Grumbach,

1301: Phys. Rev. B {\bf 51}, 1456 (1995).

1302: \bibitem{Itoh95}S.~Itoh, P.~Ordej\'on and R.M.~Martin, Comput. Phys. Commun.

1303: {\bf 88}, 173 (1995).

1304: \bibitem{Canning96}A.~Canning, G.~Galli, F.~Mauri,A.~De~Vita and R.~Car,

1305: Comput. Phys. Commun. {\bf 94}, 89 (1996).

1306: \bibitem{Wang96}C.Z.~Wang, S.Y.~Qiu and K.M.~Ho, Comput. Mat. Sci. {\bf 7},

1307: 315 (1996).

1308: \bibitem{Kress98}J.D.~Kress, S.~Goedecker, A.~Hoisie, H.~Wassermann,

1309: O.~Lubeck, L.~Collins and B.~Holian, J. Comput. Aided Mat. Des. {\bf 5},

1310: 295 (1998).

1311: \bibitem{NR}W.H.Press, S.A.Teukolsky, W.T.Vetterling and B.P.Flannery,

1312: Numerical Recipes (Cambridge University Press, Cambridge, 1992).

1313: \bibitem{Bowler99}D.R.~Bowler and M.J.~Gillan, Comput. Phys. Commun.

1314: {\bf 120}, 95 (1999).

1315: \bibitem{Goedecker00}S.Goedecker and A.Hoisie, Achieving High

1316: Performance in Numerical Computations on Modern Computer Architecures

1317: (SIAM, in press, 2000).

1318: \bibitem{Bowler00b}D.R.~Bowler and M.J.~Gillan, Chem. Phys. Lett. {\bf 325},

1319: 473 (2000).

1320: \bibitem{Kleinman82} L.Kleinman and D.M.Bylander, Phys. Rev. Lett.

1321: {\bf 44}, 1425 (1982).

1322: \end{thebibliography}

1323:

1324: \section*{Appendix: Reduction and extension, maximal

1325: and minimal kernels}

1326:

1327: As explained in the text, different multiplication kernels

1328: should be used for large and small values of the cut-off

1329: radius $R_C$ of the product matrix $C$: for large and small $R_C$, we

1330: should use the maximal and minimal kernels respectively.

1331: In this Appendix, we discuss how to decide the cross-over

1332: point from the maximal to the minimal case

1333: as the radii $R_A$, $R_B$ and $R_C$ are varied.

1334:

1335: If we have particular values of $R_A$, $R_B$ and $R_C$,

1336: then for each atom $i$ in a node's primary set, there is a certain

1337: number of pairs $(k,j)$ for which multiply-and-adds have to

1338: be done. Our algorithm for deciding the cross-over point will

1339: be based on estimates for this number of pairs, which

1340: we denote by $\nu_{\rm p}$. If we treat the problem as weak reduction,

1341: then we ignore the constraint supplied by $R_C$, and instead of

1342: the true number of pairs $\nu_{\rm p}$ we actually visit

1343: a larger number of pairs, which we call $\nu_{\rm p}^{\rm max}$.

1344: The merit of doing it like this can be gauged by the ratio

1345: $\eta^{\rm max} \equiv \nu_{\rm p} / \nu_{\rm p}^{\rm max}$.

1346: If this ratio becomes

1347: too small, then it is wasteful to treat it as weak reduction. Similarly,

1348: if we treat as weak extension, we visit a number of pairs

1349: $\nu_{\rm p}^{\rm min}$ which will generally be greater than

1350: $\nu_{\rm p}$, so we can gauge the merit of this method by the

1351: ratio $\eta^{\rm min} \equiv \nu_{\rm p} / \nu_{\rm p}^{\rm min}$.

1352: We should go for the method that gives the largest $\eta$ value,

1353: so the policy will be that for given values of $R_A$,

1354: $R_B$ and $R_C$ we treat it as weak reduction if $\nu_{\rm p}^{\rm max} <

1355: \nu_{\rm p}^{\rm min}$ and as weak extension if $\nu_{\rm p}^{\rm min} <

1356: \nu_{\rm p}^{\rm max}$.

1357:

1358: The values of $\nu_{\rm p}$, $\nu_{\rm p}^{\rm max}$ and

1359: $\nu_{\rm p}^{\rm min}$ in practice will depend on the details

1360: of the atomic positions. We need a simple approximate way of

1361: estimating the numbers of pairs. To do this, we assume

1362: that the density of atoms is uniform, with

1363: $\rho_0$ atoms per unit volume. Then our estimate for the number of

1364: pairs is:

1365: \begin{equation}

1366: \nu_{\rm p} = \rho_0^2

1367: \int_{r_1 < R_A , \, r_2 < R_C , \, \mid {\bf r}_2 - {\bf r}_1 \mid < R_B}

1368: d {\bf r}_1 d {\bf r}_2

1369: \end{equation}

1370: It is simple geometry to evaluate the integral. Let $\Omega ( R_A ,

1371: R_B ; r )$ be the overlap volume of two spheres

1372: of radius $R_A$, $R_B$ when the distance between their

1373: centres is $r$. Then:

1374: \begin{equation}

1375: \nu_{\rm p} ( R_A , R_B , R_C ) = 4 \pi \rho_0^2

1376: \int_0^{R_C} dr \, r^2 \Omega ( R_A , R_B ; r ) \; .

1377: \label{eqn:volume}

1378: \end{equation}

1379:

1380: It will be useful to note some symmetry properties of

1381: $\nu_{\rm p} ( R_A , R_B , R_C )$. Since $\Omega ( R_A , R_B ; r )$

1382: is invariant under interchange of $R_A$ and $R_B$, the

1383: same is true of $\nu_{\rm p}$. In fact, it is easy to

1384: show that $\nu_{\rm p}$ is completely invariant with respect

1385: to all permutations of $R_A$, $R_B$ and $R_C$:

1386: \begin{eqnarray}

1387: \nu_{\rm p} ( R_A , R_B , R_C ) & = & \nu_{\rm p} ( R_C , R_A , R_B ) =

1388: \nu_{\rm p} ( R_B , R_C , R_A ) \nonumber \\

1389: & = & \nu_{\rm p} ( R_B , R_A , R_C ) = \nu_{\rm p} ( R_A , R_C , R_B ) =

1390: \nu_{\rm p} ( R_C , R_B , R_A ) \; .

1391: \end{eqnarray}

1392:

1393: These symmetries very much simplify the calculation of $\nu_{\rm p}$

1394: for general values of $R_A$, $R_B$ and $R_C$. Let $R_1$, $R_2$ and

1395: $R_3$ denote the smallest, the middle and the largest of $R_A$,

1396: $R_B$ and $R_C$.

1397: Then for arbitrary values of $R_A$, $R_B$ and $R_C$, we can

1398: write:

1399: \begin{equation}

1400: \nu_{\rm p} ( R_A , R_B , R_C ) = \nu_{\rm p} ( R_1 , R_2 , R_3 ) \; .

1401: \end{equation}

1402: There are two separate cases that must be considered, which we

1403: refer to as `triangle' and `non-triangle'. In the first

1404: case, we have $R_3 \le R_1 + R_2$, so that the three lengths

1405: can form the sides of a triangle. In the second, we have

1406: $R_3 > R_1 + R_2$, and they cannot form the sides of a triangle.

1407: The non-triangle case is trivial, because the constraint

1408: supplied by $R_3$ is of no effect, and we just have:

1409: \begin{equation}

1410: \nu_{\rm p} ( R_1 , R_2 , R_3 ) = \rho_0^2

1411: \int_{r_1 < R_1 , \, r_2 < R_2}

1412: d {\bf r}_1 d {\bf r}_2 \,    = \frac{16}{9} \pi^2 \rho_0^2

1413: R_1^3 R_2^3 \; .

1414: \label{eq:pair_nont}

1415: \end{equation}

1416: In the `triangle' case, we return to eqn.~(\ref{eqn:volume})

1417: and note that the overlap volume of two spheres is:

1418: \begin{eqnarray}

1419: \Omega ( R_1 , R_2 ; r ) & = & \frac{4}{3} \pi R_1^3 \; \; \; \;

1420: {\rm if} \; \; \; r < R_2 - R_1 \nonumber \\

1421: & = & - \frac{\pi}{4 r} ( R_2^2 - R_1^2 )^2 +

1422: \frac{2}{3} \pi ( R_1^3 - R_2^3 ) -

1423: \frac{1}{2} \pi r ( R_1^2 + R_2^2 ) +

1424: \frac{\pi}{12} r^3 \; \; \; \; \; {\rm if} \; \; \; r > R_2 - R_1 \; .

1425: \end{eqnarray}

1426: The integral over $r$ is then straightforward, and we get:

1427: \begin{eqnarray}

1428: \nu_{\rm p} & = & \pi^2 \rho_0^2 \left(

1429: \frac{16}{9} R_1^3 ( R_2 - R_1 )^3

1430: - \frac{1}{2} ( R_2^2 - R_1^2 )^2 \left[ R_3^2 - ( R_2 - R_1 )^2 \right]

1431: + \frac{8}{9} ( R_1^3 + R_2^3 ) \left[ R_3^3 - ( R_2 - R_1 )^3 \right]

1432: \right.

1433: \nonumber \\

1434: & & \left. \mbox{} - \frac{1}{2} ( R_1^4 + R_2^4 )

1435: \left[ R_3^4 - ( R_2 - R_1 )^4 \right]

1436: + \frac{1}{18} \left[ R_3^6 - ( R_2 - R_1 )^6 \right] \right) \; .

1437: \label{eq:pair_t}

1438: \end{eqnarray}

1439:

1440: It is now simple to find the cross-over point and the $\eta$ value

1441: at this point. The strategy for weak reduction is based

1442: on ignoring the constraint supplied by $R_C$. Then $\nu_{\rm p}^{\rm max}$

1443: is just $\nu_{\rm p}$ for the non-triangle case $R_C > R_A + R_B$,

1444: so that from eqn.~(\ref{eq:pair_nont}) we have:

1445: \begin{equation}

1446: \nu_{\rm p}^{\rm max} = \frac{16}{9} \pi^2 \rho_0^2 R_A^3 R_B^3 \; .

1447: \end{equation}

1448: In weak extension, we ignore the constraint supplied by $R_A$. Then

1449: $\nu_{\rm p}^{\rm min}$ is $\nu_{\rm p}$ for the

1450: non-triangle case $R_A > R_B + R_C$, so we have:

1451: \begin{equation}

1452: \nu_{\rm p}^{\rm min} = \frac{16}{9} \pi^2 \rho_0^2 R_B^3 R_C^3 \; .

1453: \end{equation}

1454: The cross-over point is the point at which $\nu_{\rm p}^{\rm max} =

1455: \nu_{\rm p}^{\rm min}$, which requires that $R_A = R_C$.

1456: In terms of the dimensionless ratio $\alpha = R_C / ( R_A + R_B )$,

1457: we can express the cross-over point as:

1458: \begin{equation}

1459: \alpha_{\rm X} = 1 / ( 1 + \xi ) \; ,

1460: \end{equation}

1461: where $\xi = R_B / R_A$.

1462:

1463:

1464: The efficiency $\eta_{\rm X} = \nu_{\rm p}^{\rm max} / \nu_{\rm p} =

1465: \nu_{\rm p}^{\rm min} / \nu_{\rm p}$ at the cross-over point

1466: can now be found from eqn.~(\ref{eq:pair_t}). The derivation of the

1467: resulting formula:

1468: \begin{equation}

1469: \eta_{\rm X} = 1 - \frac{9}{16} \xi + \frac{1}{32} \xi^3

1470: \end{equation}

1471: is straightforward.

1472:

1473: In practice, we should always try to ensure that $R_A \ge R_B$,

1474: so that $0 < \xi < 1$. It is readily shown that $\eta_{\rm X}$

1475: increases montonically from $15/32$ to 1 as $\xi$ decreases

1476: from 1 to 0. This means that the efficiency cannot drop

1477: below $15/32$, or, in round numbers, 50~\%.

1478:

1479: \begin{figure}

1480: \begin{center}

1481: \leavevmode

1482: \epsfxsize=100mm

1483: \epsfbox{CalcsComms.eps}

1484: \end{center}

1485: \caption{Interleaving of communication and calculation in

1486: parallel multiplication of sparse matrices.}

1487: \label{fig:interleave}

1488: \end{figure}

1489:

1490: \begin{figure}

1491: \begin{center}

1492: \leavevmode

1493: \epsfxsize=100mm

1494: \epsfbox{PartHalo.eps}

1495: \end{center}

1496: \caption{Schematic illustration of partitions and haloes.  Atoms are

1497: indicated by small, black circles, and partitions by the dashed lines.

1498: A particular atom, $i$, is patterned.  Its A-neighbours are those atoms

1499: that are within or touched by the circle drawn centred on $i$.

1500: If the primary set is chosen to be the partition containing $i$, then the

1501: A-halo partitions are those enclosed by the heavy line.}

1502: \label{fig:halo}

1503: \end{figure}

1504:

1505: \begin{figure}

1506: \begin{center}

1507: \begin{verbatim}

1508: do k=1,ahalo%nh_part(kpart)  ! Loop over atoms k in current A-halo partn

1509:   ! --- useful variables associated with atom k ---------------------

1510:   k_in_halo=ahalo%j_beg(kpart)+k-1

1511:   k_in_part=ahalo%j_seq(k_in_halo)

1512:   nbkbeg=ibaddr(k_in_part)

1513:   ! --- transcription of j from B-neighbour to C-halo labelling -----

1514:   do j=1,nbnab(k_in_part)

1515:     jpart=ibpart(nbkbeg+j-1)+k_off

1516:     jseq=ibseq(nbkbeg+j-1)

1517:     jbnab2ch(j)=chalo%i_halo(chalo%i_hbeg(jpart)+jseq-1)

1518:   enddo

1519:   ! --- perform multiply and adds -----------------------------------

1520:   do i=1,at%n_hnab(k_in_halo)  ! Loop over primary-set A-neighbours of k

1521:     nabeg=at%i_beg(k_in_halo)+i-1

1522:     i_in_prim=at%i_prim(at%i_beg(k_in_halo)+i-1)

1523:     icad=(i_in_prim-1)*chalo%ni_in_halo

1524:     do j=1,nbnab(k_in_part) ! Loop over B-neighbours of atom k

1525:       nbbeg=nbkbeg+j-1

1526:       j_in_halo=jbnab2ch(j)

1527:       if(j_in_halo/=0) then

1528:         ncbeg=chalo%i_h2d(icad+j_in_halo)

1529:         if(ncbeg/=0) then  ! multiplication of ndim x ndim blocks

1530:           do n2=1,ndim3

1531:             do n1=1,ndim1

1532:              do n3=1,ndim2

1533:                 c(n1,n2,ncbeg)=c(n1,n2,ncbeg)+ &

1534:                   a(n3,n1,nabeg)*b(n3,n2,nbbeg)

1535:               enddo

1536:             enddo

1537:           enddo

1538:         endif

1539:       endif ! End of if(j_in_halo.ne.0)

1540:     enddo ! End of 1,nbnab

1541:   enddo ! End of 1,at%n_hnab

1542: enddo ! End of k=1,ahalo%nh_part(kpart)

1543: return

1544: \end{verbatim}

1545: \end{center}

1546: \caption{The matrix multiplication kernel.}

1547: \label{fig:kernel}

1548: \end{figure}

1549:

1550: \begin{figure}

1551: \begin{center}

1552: \leavevmode

1553: \epsfxsize=100mm

1554: \epsfbox{WeakScaling.eps}

1555: \end{center}

1556: \caption{Time recorded on each

1557: processor for the maximal

1558: case of sparse-matrix multiplication. Numbers of processors

1559: and atoms are varied with average number of atoms per processor

1560: equal to 80 in all cases.}

1561: \label{fig:Weak}

1562: \end{figure}

1563:

1564: \begin{figure}

1565: \begin{center}

1566: \leavevmode

1567: \epsfxsize=100mm

1568: %\epsfbox{Strong.eps}

1569: \epsfbox{MinMax.eps}

1570: \end{center}

1571: \caption{Times t recorded on each node for maximal (solid line, circles)

1572: and minimal (dashed line, squares) cases of sparse matrix multiplication

1573: when number of processors $N_{\rm proc}$ is varied with total number of

1574: atoms in the system equal to 4096 in all cases.  Quantity plotted is

1575: the sum of the times on all processors (t.$N_{\rm proc}$), which would be

1576: exactly constant for perfect strong parallel scaling.}

1577: \label{fig:Strong}

1578: \end{figure}

1579:

1580: \begin{figure}

1581: \begin{center}

1582: \leavevmode

1583: \epsfxsize=100mm

1584: \epsfbox{SystemPerf.eps}

1585: \end{center}

1586: \caption{Average time per processor as a multiple of time on two processors

1587: for constant number of atoms per processor (equal to 32).  Maximal case for

1588: crystalline silicon with cut-off radii $R_A$ = 12\AA, $R_B$ = 8\AA and

1589: $R_C$ = 20\AA.}

1590: \label{fig:systems}

1591: \end{figure}

1592:

1593: \begin{figure}

1594: \begin{center}

1595: \leavevmode

1596: \epsfxsize=100mm

1597: \epsfbox{ConquestRhoTime.eps}

1598: \end{center}

1599: \caption{Time for one self-consistency iteration (see text) of

1600: the {\sc Conquest} code as a function of number of atoms of

1601: perfect-crystal silicon. Calculations were run on 512 processors

1602: of a Cray T3E-1200E. Points connected by solid line show elapsed

1603: time in seconds; dashed line

1604: shows time scaled linearly from the actual time for 4096 atoms.}

1605: \label{fig:Conquest}

1606: \end{figure}

1607:

1608: \end{document}

1609: