0502:cs0502039/cell.tex

1: \documentclass[12pt]{article}

2: \usepackage{graphicx}

3: \usepackage{amsfonts}

4: \usepackage{latexsym}

5: \usepackage{amsmath}

6: \newtheorem{theorem}{theorem}

7: %\usepackage{fancyhdr}

8:

9: % ----------------------------------------------------------------

10: \voffset = -50pt

11: \textwidth 6.5in

12: \textheight 8.8in

13: \topmargin 0.25in

14: \oddsidemargin -0.1in

15: \evensidemargin 0in

16:

17: % ----------------------------------------------------------------

18: \makeatletter

19: \@addtoreset{figure}{section}

20: \def\thefigure{\thesection.\@arabic\c@figure}

21: \@addtoreset{table}{section}

22: \def\thetable{\thesection.\@arabic\c@table}

23:

24: \def\@sect#1#2#3#4#5#6[#7]#8{\ifnum #2>\c@secnumdepth

25:      \def\@svsec{}\else

26:      \refstepcounter{#1}\edef\@svsec{\csname the#1\endcsname.\hskip .75em

27: }\fi

28:      \@tempskipa #5\relax

29:       \ifdim \@tempskipa>\z@

30:         \begingroup #6\relax

31:           \@hangfrom{\hskip #3\relax\@svsec}{\interlinepenalty \@M #8\par}%

32:         \endgroup

33:        \csname #1mark\endcsname{#7}\addcontentsline

34:          {toc}{#1}{\ifnum #2>\c@secnumdepth \else

35:                       \protect\numberline{\csname the#1\endcsname}\fi

36:                     #7}\else

37:         \def\@svsechd{#6\hskip #3\@svsec #8\csname #1mark\endcsname

38:                       {#7}\addcontentsline

39:                            {toc}{#1}{\ifnum #2>\c@secnumdepth \else

40:                              \protect\numberline{\csname the#1\endcsname}\fi

41:                        #7}}\fi

42:      \@xsect{#5}}

43: % put a period after theorem and theorem-like numbers

44: \def\@begintheorem#1#2{\it \trivlist \item[\hskip \labelsep{\bf #1\ #2.}]}

45: \def\section{\@startsection {section}{1}{\z@}{-3.5ex plus -1ex minus

46:  -.2ex}{2.3ex plus .2ex}{\normalsize\bf}}

47:

48: %\pagestyle{myheadings}

49: %\thispagestyle{empty}

50:

51: %\markright{\sc the electronic journal of combinatorics

52: %(2000),\#Rxx\hfill} \thispagestyle{empty}

53: % ----------------------------------------------------------------

54: \begin{document}

55:

56: \title{Efficient Parallel Simulations of \\

57: Asynchronous Cellular Arrays}

58: %uncomment to remove date

59: \date{}

60: \maketitle

61:

62: \begin{center}

63: %\small

64: \author{Boris D. Lubachevsky\\

65: {\em bdl@bell-labs.com}\\

66: Bell Laboratories\\

67: 600 Mountain Avenue\\

68: Murray Hill, New Jersey}

69: \end{center}

70:

71: \setlength{\baselineskip}{0.995\baselineskip}

72: \normalsize

73: \vspace{0.5\baselineskip}

74: \vspace{1.5\baselineskip}

75: %\end{center}

76:

77: % ----------------------------------------------------------------

78: \begin{abstract}

79: A definition for a class of asynchronous cellular arrays

80: is proposed.

81: An example of such asynchrony would be

82: independent Poisson arrivals of cell iterations.

83: The Ising model in the continuous time formulation of Glauber

84: falls into this class.

85: Also proposed are efficient parallel algorithms for

86: simulating these asynchronous cellular arrays.

87: In the algorithms, one or several cells are assigned to a processing

88: element (PE),

89: local times for different PEs can be different.

90: Although the standard serial algorithm by

91: Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller

92: can simulate such arrays,

93: it is usually

94: believed to be without an efficient parallel counterpart.

95: However, the proposed parallel algorithms

96: contradict this belief

97: proving to be

98: both efficient

99: and able to perform the same task

100: as the standard algorithm.

101: The results of experiments with the new algorithms

102: are encouraging:

103: the speed-up is greater than 16

104: using 25 PEs on a shared memory MIMD

105: bus computer,

106: and greater than 1900

107: using $2^{14}$ PEs on a

108: SIMD computer.

109: The algorithm by

110: Bortz, Kalos, and Lebowitz

111: can be incorporated

112: in the proposed parallel algorithms,

113: further contributing to speed-up.

114: \end{abstract}

115: % ----------------------------------------------------------------

116: \section{Introduction}\label{sec:intro}

117: \hspace*{\parindent}

118: Simulation is inevitable

119: in studying the evolution

120: of complex cellular systems.

121: Large cellular array simulations might require long runs

122: on a serial computer.

123: Parallel processing,

124: wherein each cell or a group of cells

125: is hosted by a separate processing element (PE),

126: is a feasible method to speed up the runs.

127: The strategy of a parallel simulation

128: should depend on whether the

129: simulated system is synchronous

130: or asynchronous.

131:

132: A {\em synchronous} system

133: evolves in discrete time $t=0,1,2,...$.

134: The state of a cell at $t+1$

135: is determined by the state of the cell and its neighbors

136: at $t$

137: and may explicitly depend

138: on $t$ and the result of a random experiment.

139:

140: An obvious and correct way to simulate

141: the system synchrony using a parallel processor

142: is simply to mimic it by the executional synchrony.

143: The simulation is arranged in rounds

144: with

145: one round corresponding to one time step

146: and with

147: no PE processing state changes of its cells for time $t+1$

148: before all PEs have processed state changes of their cells

149: for time $t$.

150:

151: An {\em asynchronous} system evolves in continuous time.

152: State changes at different cells occur

153: asynchronously at unpredictable random times.

154: Here two questions should be answered:

155: (A) How to specify the asynchrony precisely?

156: and (B) How to carry out the parallel simulations

157: for the specified asynchrony?

158:

159: Unlike the synchronous case,

160: simple mimicry does not work well

161: in the asynchronous case.

162: When Geman and Geman \cite{GG}, for example,

163: employ executional {\em physical} asynchrony

164: (introduced by different speeds of different PEs)

165: to mimic the model asynchrony,

166: the simulation becomes irreproducible

167: with its results depending on executional timing.

168: Such dependence may be tolerable in tasks

169: other than simulation

170: (\cite{GG} describes one such task,

171: another example is given in \cite{LM}).

172: In the task of simulation, however, it is

173: a serious shortcoming as seen in the following example.

174:

175: Suppose  a simulationist,

176: after observing the results of a program run,

177: wishes to look closer at a certain phenomenon

178: and inserts an additional `print' statement

179: into the code.

180: As a result of the insertion,

181: the executional timing changes

182: and the phenomenon under investigation vanishes.

183:

184: Ingerson and Buvel \cite{INBUV} and

185: Hofmann \cite{HOF}

186: propose various reproducible

187: computational

188: procedures to simulate asynchronies

189: in cellular arrays.

190: However no uniform principle has been proposed,

191: and no special attention to developing

192: parallel algorithms has been paid.

193: It has been observed that the

194: resulting cellular patterns

195: may depend on the computational

196: procedure \cite{INBUV}.

197:

198: Two main results of this paper are:

199: (I) a definition

200: of a natural  class of asynchronies

201: that can be associated with

202: cellular arrays

203: and

204: (II) efficient parallel algorithms to simulate

205: systems in this class.

206: The following properties specify

207: the {\em Poisson asynchrony},

208: a most common

209: member in the introduced class:

210: \\

211: \\

212: $~~~$Arrivals

213: for a particular cell

214: form a Poisson point process.

215: \\

216: $~~~$Arrivals processes for different cells are independent.

217: \\

218: $~~~$The arrival rate

219: is the same, say $\lambda$,

220: for each cell.

221: \\

222: $~~~$When there is an arrival,

223: the state of the cell

224: instantaneously changes;

225: the new state is computed

226: based on the states of the cell and its neighbors

227: just before the change

228: (in the same manner as in the synchronous model).

229: The new state may be equal to the old one.

230: \\

231: $~~~$The time of arrival

232: and a random experiment may be involved in the computation.

233: \\

234:

235: A familiar example of a cellular system with the Poisson asynchrony

236: is the Ising model \cite{ISING}

237: in the continuous time formulation of Glauber \cite{GL}.

238: In this model

239: a cell configuration is defined

240: by the spin variables $s(c)=\pm 1$

241: specified at the cells $c$ of a two or three dimensional

242: array.

243: When there is an arrival at a cell $c$,

244: the spin $s(c)$ is changed to $-s(c)$ with probability $p$.

245: With probability $1~-~p$,

246: the spin $s(c)$ remains unchanged.

247: The probability $p$

248: is determined

249: using the values of $s(c)$ and neighbors $s(c')$ just before

250: the update time.

251:

252: It is instructive to review the

253: computational procedures for Ising simulations.

254: First, the Ising simulationists realized that the standard procedure by

255: Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller \cite{MRRTT}

256: could be applied.

257: In this procedure, the evolution of the configuration

258: is simulated as a sequence of one-spin updates:

259: Given a configuration,

260: define the next configuration by choosing a cell $c$

261: uniformly at random and changing or not changing the spin

262: $s(c)$ to $-s(c)$ as required.

263: In the original standard procedure time is discrete.

264: Time continuity could have been simply introduced

265: by letting

266: the consecutive arrivals form

267: the Poisson process with rate $\lambda N$,

268: where $N$ is the total number

269: of spins (cells) in the system.

270:

271: The problem of long simulation runs became immediately apparent.

272: Bortz, Kalos, and Lebowitz \cite{BKL}

273: developed a serial algorithm (the BKL algorithm)

274: which avoids processing unsuccessful

275: state change attempts,

276: and reported up to a 10-fold speed-up over the

277: straight-forward implementation of the

278: standard model.

279: Ogielski \cite{OGI} built special purpose hardware

280: for speeding up the processing.

281:

282: The BKL algorithm is serial.

283: Attempts were made

284: to speed up the Ising simulation by parallel

285: computations

286: (Friedberg and Cameron \cite{FC}, Creutz \cite{CR}).

287: However, in these computations the original Markov chain

288: of the continuous time Ising model

289: was modified to satisfy the computational procedure.

290: The modifications do not affect the equilibrium

291: behavior of the chain,

292: and as such are acceptable

293: if one studies only the equilibrium.

294: In the cellular models however,

295: the transient behavior is also of interest,

296: and no model revision should be done.

297:

298: This paper presents

299: efficient methods for parallel simulation

300: of the continuous time asynchronous cellular arrays

301: without changing the model or type of asynchrony in favor

302: of the computational procedure.

303: The methods

304: promise unlimited

305: speed-up when the array and the parallel

306: computer are sufficiently large.

307: For the Poisson asynchrony case,

308: it is also shown how

309: the BKL algorithm can be incorporated,

310: further contributing to speed-up.

311:

312: For the Ising model,

313: presented algorithms can be viewed

314: as exact parallel counterparts

315: to the standard algorithm by Metropolis et al.

316: The latter has been known and

317: believed to be inherently serial since 1953.

318: Yet, the presented algorithms are parallel, efficient, and fairly simple.

319: The ``conceptual level'' codes are rather short

320: (see Figures~\ref{fig:a1c1pe},

321: \ref{fig:s1c1pe},

322: \ref{fig:amcgen},

323: \ref{fig:amcpoi},

324: %$AL$.4, $AL$.6 and $AL$.7

325: and

326: \ref{fig:genout},

327: ).

328: An implementation in a real programming language

329: given in the Appendix

330: is longer, of course,

331: but still rather simple.

332:

333: This paper is organized as follows:

334: Section \ref{sec:model} presents

335: a class of asynchronies

336: and a comparison with other published proposals.

337: Then Section \ref{sec:algo} describes the new algorithms

338: on the conceptual level.

339: While the presented algorithms are simple,

340: there is no simple theory which predicts

341: speed-up of these algorithms for

342: cellular arrays and parallel processors

343: of large sizes.

344: Section \ref{sec:perf} contains a simplified computational

345: procedure which predicts speed-ups faster than it takes

346: to run an actual parallel program.

347: The predictions made by this

348: procedure are compared with actual runs

349: and appear to be rather accurate.

350: The procedure predicts speed-up of more than 8000

351: for the simulation of $10^5 \times 10^5$

352: Poisson asynchronous cellular array in parallel

353: by $10^4$ PEs.

354: Actual speed-ups obtained thus far were:

355: more than 16 on 25 PEs of the

356: Balance (TM)

357: computer and more than 1900

358: on $2^{14}$ PEs of the Connection Machine (R).

359: \footnotetext{

360: Connection Machine is a registered trademark of Thinking Machines Corporation

361: \\

362: Balance is a trademark of Sequent Computer Systems, Inc.}

363:

364: \section{Model}\label{sec:model}

365: \hspace*{\parindent}

366: Time $t$ is continuous.

367: Each cell $c$ has a state $s=s(c)$.

368: At random times, a cell is granted a chance

369: to change the state.

370: The changes, if they occur,

371: are instantaneous events.

372: Random attempts to change the state of a cell

373: are independent of

374: similar attempts for other cells.

375:

376: The general model consists

377: of two functions:

378: {\em time\_of\_next\_arrival ()}

379: and {\em next\_state ()}.

380: They are defined as follows:

381: given the old state of the cell

382: and the states of the neighbors just before time $t$,

383: $s_{t-0} (neighbors (c))$,

384: the next\_state

385: $s(c)=s_t (c)$ is

386: \begin{equation}

387: \label{newst}

388: s_t (c) = next\_state~(c,~s_{t-0} (neighbors(c)),~ \omega ,~t ),

389: \end{equation}

390: where the possibility

391: $s_t (c)=s_{t-0} (c)$ is not excluded;

392: and

393: the time $next\_t$ of the next arrival

394: is

395: \begin{equation}

396: \label{newti}

397: next\_t = time\_of\_next\_arrival~(c, s_{t-0} (neighbors(c)),~ \omega ,~t),

398: \end{equation}

399: where always $next\_t ~ > ~ t$.

400:

401: In \eqref{newst} and \eqref{newti},

402: $\omega$ denotes the result of a random experiment,

403: e.g., coin tossing,

404: $s (neighbors(c))$ denotes the indexed set of states

405: of all the neighbors of $c$ including $c$ itself.

406: Thus,

407: if $neighbors(c)=\{ c, c_1, c_2, c_3, c_4\}$,

408: then

409: $s(neighbors(c)) =

410: (s(c), s(c_1 ), s(c_2 ), s(c_3 ), s(c_4 ))$.

411: Subscript $t-0$ expresses the idea of `just before $t$',

412: e.g.,

413: $a_{t-0} ( \tau ) = lim_{\tau \rightarrow t,~\tau < t} ~ a( \tau )$.

414: According to \eqref{newst}, the value of $s(c)$

415: instantaneously changes at time $t$

416: from $s_{t-0} (c)$ to $s_t (c)$.

417: At time $t$, the value of $s(c)$ is already new.

418: The `just before' feature resolves

419: a possible ambiguity

420: if two neighbors attempt to change their states

421: at the same simulated time.

422:

423: Compare now the class of asynchronies

424: defined by \eqref{newti} with the ones proposed in the literature:

425:

426: \ \ \ (A) Model 1 in \cite{INBUV} reads:

427: ``...the cells iterate randomly, one at a time.''

428: Let $p_c$ be the probability that cell $c$ is chosen.

429: Then the following choice of law \eqref{newti} yields this model

430: \[

431: time\_of\_next\_arrival~(c,~\omega ,~t)= t~-~ \frac {1} {p_c}   \ln  r(c,t, \omega ),

432: \]

433: where $r(c, t, \omega )$ is a random number uniformly distributed on (0,1),

434: and $\ln$ is the natural logarithm,

435: $\ln (x) = {\log}_e (x)$.

436: For $p_{c_1} = p_{c_2} = ... = \lambda$,

437: the asynchrony was called the {\em Poisson asynchrony} in Section~\ref{sec:intro};

438: it coincides with the one defined

439: by the standard model \cite{MRRTT},

440: and by Glauber's model \cite{GL} for the Ising spin simulations.

441:

442: \ \ \ (B) Model 2 in \cite{INBUV} assigns

443: ``each cell a period according to a Gaussian distribution...

444: The cells iterate one at a time each having its own definite

445: period.''

446: While it is not quite clear from \cite{INBUV}

447: what is meant by a ``definite period''

448: (is it fixed for a cell over a simulation run?),

449: the following choice of law \eqref{newti} yields this model

450: in a liberal interpretation:

451: \[

452: time\_of\_next\_arrival~(c,~\omega ,~t)= t~+~ {P_c}^{-1} (r( \omega )),

453: \]

454: where $P^{-1} (y)=x$ if $P(x)=y$,

455: and $P_c (x)$ is the cumulative function for the

456: Gaussian probability distribution

457: with mean $m_c~>~0$

458: and variance ${\sigma_c}^2$.

459: The probability of

460: $next\_t < t$ is small when $\sigma < < m$

461: and is ignored in \cite{INBUV}

462: if this interpretation is meant.

463: In a less liberal interpretation,

464: $\sigma_c \equiv 0$ for all $c$,

465: and $m_c$ is itself

466: random and distributed according to the Gaussian law.

467: This case is even easier to represent in terms of

468: model \eqref{newti} than the previous one:

469: $time\_of\_next\_arrival^ (c,~ \omega ,~t)= t + m_c ( \omega )$.

470:

471: \ \ \ (3) Model \eqref{newti} trivially extends to a synchronous simulation,

472: where the initial state changes arrive at time 0 and

473: then always $next\_t - t$ is identical to 1.

474: The first model in \cite{HOF} is

475: ``to choose a number of cells at random and change

476: only their values before continuing.''

477: This is a variant of synchronous simulation;

478: it is substantially different from both models (A) and (B) above.

479: In (A) and (B),

480: the probability is 1 that

481: no two neighbors attempt to change their states at the same time.

482: In contrast, in this model many neighboring cells

483: are simultaneously changing their values.

484: How the cells are chosen for update

485: is not precisely specified in \cite{HOF}.

486: One way to choose the cells is to assign a probability weight

487: $p_c$ for cell $c$, $c=1,2,...,N$,

488: and to attempt to update cell $c$

489: at each iteration,

490: with probability $p_c$,

491: independent of any other decision.

492: Such a method

493: conforms with the law \eqref{newti}

494: because the method is local:

495: a cell does not need to know

496: what is happening at distant cells.

497: The second model in \cite{HOF}

498: changes states of a

499: fixed number $A$ of randomly chosen cells

500: at each iteration.

501: If $A > 1$,

502: this method is not local

503: and does not conform with the law \eqref{newti}.

504:

505: \section{Algorithms}\label{sec:algo}

506: \hspace*{\parindent}

507: {\bf Elimination of $\omega$}.

508: Deterministic computers

509: represent randomness by using

510: pseudo-random number generators.

511: Thus, equations \eqref{newst} and \eqref{newti} are substituted

512: in the computation by equations

513: \begin{equation}

514: \label{news0t}

515: s_t (c) = next\_state~(c,~s_{t-0} (neighbors(c)),~t ),

516: \end{equation}

517: and

518: \begin{equation}

519: \label{newt0i}

520: next\_t = time\_of\_next\_arrival~(c, s_{t-0} (neighbors(c)),~t),

521: \end{equation}

522: respectively,

523: which do not contain the parameter of randomness $\omega$.

524:

525: This elimination of $\omega$ symbolizes

526: an obvious but important

527: difference between the simulated system and the simulator:

528: In the simulated system,

529: the observer, being a part of the system,

530: does not know in advance

531: the time of the next arrival.

532: In contrast, the simulationist who is,

533: of course, not a part of the simulated system,

534: can know the time of the next arrival

535: before the next arrival is processed.

536:

537: For example,

538: it is not known in advance when the next event

539: from a Poisson stream arrives.

540: However, in the simulation,

541: the time $next\_t$ of the next arrival

542: is obtained in a deterministic manner,

543: given the time $t$ of the previous arrival:

544: \begin{equation}

545: \label{newt1i}

546: next\_t = t ~-~ \frac{1}{\lambda} {\log}_e ( r(n(t))),

547: \end{equation}

548: where $\lambda$ is the rate,

549: $r(n)$ is the $n$-th pseudo-random number in the sequence

550: uniformly distributed on $(0,1)$,

551: and $n(t)$ is the invocation counter.

552: Thus, after the previous arrival is processed,

553: the time of the next arrival is already known.

554: If needed, the entire sequence of arrivals

555: can be precomputed and stored in a table for later

556: use in the simulation,

557: so that all future arrival times

558: would be known in advance.

559: \\

560:

561: {\bf Asynchronous one-cell-per-one-PE algorithm}.

562: The algorithm in Figure~\ref{fig:a1c1pe}

563: is the shortest of those presented in this paper.

564:

565: To understand this code,

566: imagine a parallel computer which consists

567: of a number of PEs running concurrently.

568: One PE is assigned to simulate one cell.

569: The PE which is assigned to simulate cell $c_0$,

570: PE$c_0$, executes the code in Figure~\ref{fig:a1c1pe} with $c=c_0$.

571: The PEs are interconnected by the network

572: which matches the topology of the cellular array.

573: A PE can receive information from its neighbors.

574: PE$c$ maintains state $s(c)$

575: and local simulated time $t(c)$.

576: Variables $t(c)$ and $s(c)$ are visible

577: (accessible for reading only) by the neighbors of $c$.

578: Time $t(c)$ has no connection with the physical

579: time in which the parallel computer runs the program

580: except that $t(c)$ may not decrease

581: when the physical time increases.

582: At a given physical instance of simulation,

583: different cells $c$ may have different values of $t(c)$.

584: Value $end\_time$ is a constant which is known to all PEs.

585:

586: The algorithm in Figure~\ref{fig:a1c1pe}

587: is very asynchronous:

588: different PEs can

589: execute different steps

590: concurrently

591: and can run

592: at different speeds.

593: A statement `wait\_until~~{\em condition}',

594: like the one at Step 2 in Figure~\ref{fig:a1c1pe},

595: does not imply

596: that the {\em condition} must be detected immediately after

597: it occurs.

598: To detect the {\em condition}

599: at Step 2

600: involving local times

601: of neighbors

602: a PE can poll

603: its neighbors

604: one at a time,

605: in any order,

606: with arbitrary delays,

607: and

608: without any respect to

609: what these PEs are doing meanwhile.

610: \\

611: \begin{figure}

612: \centering

613: \fbox{

614: \begin{minipage} {12.8cm}

615: \begin{enumerate}

616: \item while $t(c)~<~end\_time$\\

617: \hspace*{0.2in}

618: \{

619: \item~~~~~~wait\_until $t(c)~\leq~ \min_{c'~\in~neighbors(c)} t(c')$ ;

620: \item~~~~~~$s(c)~\leftarrow~ next\_state~(c,~ s (neighbors (c)),~t(c))$ ;

621: \item~~~~~~$t(c)~\leftarrow~time\_of\_next\_arrival~(c,~s (neighbors (c)),~t(c))$\\

622: \hspace*{0.2in}

623: \}

624: \end{enumerate}

625:

626: \end{minipage}}

627: \caption{Asynchronous one-cell-per-one-PE algorithm}

628: \label{fig:a1c1pe}

629: \end{figure}

630: Despite being seemingly almost chaotic,

631: the algorithm in Figure~\ref{fig:a1c1pe}

632: is free from deadlock.

633: Moreover, it

634: produces a unique simulated trajectory

635: which is independent of executional

636: timing,

637: provided that:

638: \\

639:

640: (i)

641: for the same cell, the pseudo-random sequence is always the same,

642: \\

643:

644: (ii) no two neighboring arrival times are equal.

645: \\

646:

647: Freedom from deadlock follows from the fact that the cell,

648: whose local time is minimal over the entire array,

649: is always able to make progress.

650: (This guaranteed worst case performance,

651: is substantially exceeded

652: in an average case.

653: See Section~\ref{sec:perf}.)

654:

655: The uniqueness of the trajectory can be seen as follows.

656: By (ii),

657: a cell $c$ passes the test at Step 2 only if its local time $t(c)$

658: is smaller than the local time $t(c')$

659: of any its neighbor $c'$.

660: If this is the case, then no neighbor $c'$

661: is able to pass the test at Step 2 before

662: $c$ changes its time at Step 4.

663: This means that processing of the update by $c$ is safe:

664: no neighbor changes its state or time before $c$ completes

665: the processing.

666: By (i), functions $next\_state( )$ and $time\_of\_next\_arrival( )$

667: are independent of the run.

668: Therefore,

669: in each program run,

670: no matter what the neighbors of $c$

671: are doing or trying to do,

672: the next arrival time and state for $c$ are always the same.

673:

674: It is now clear why assumption (ii) is needed.

675: If (ii) is violated by two cells $c$ and $c'$ which are neighbors,

676: then the algorithm in Figure~\ref{fig:a1c1pe}

677: does not exclude concurrent updating by $c$ and $c'$.

678: Such concurrent updating

679: introduces an indeterminism

680: and inconsistency.

681: A scenario

682: of the inconsistency

683: can be as follows:

684: at Step 3

685: the {\em old} value of $s(c')$ is used

686: to update state $s(c)$,

687: but

688: immediately following Step 4 uses the

689: {\em new} value of $s(c')$

690: to update time $t(c)$.

691:

692: In practice, the algorithm in Figure~\ref{fig:a1c1pe} is safe,

693: when $next\_t(c)-t(c)$ for different $c$ are independent

694: random samples from a distribution with a continuous density,

695: like an exponential distribution.

696: In this case, (ii) holds with probability 1.

697: Unless the pseudo-random number generators are faulty,

698: one may imagine only one reason for violating (ii):

699: finite precision of computer representation of real numbers.

700: \\

701:

702: {\bf Synchronous one-cell-per-one-PE algorithm}.

703: If (ii) can be violated with a positive probability

704: (if $t$ takes on only integer values,

705: for example),

706: then the errors might not be tolerable.

707: In this case the synchronous algorithm in

708: Figure~\ref{fig:s1c1pe} should be used.

709:

710: Observe that while the algorithm in Figure~\ref{fig:s1c1pe}

711: is synchronous,

712: it is able to simulate correctly

713: both synchronous and asynchronous systems.

714: Two main additions

715: in the algorithm in Figure~\ref{fig:s1c1pe}

716: are:

717: private variables $new\_s$ and $new\_t$

718: for temporal storage of updated $s$ and $t$,

719: and synchronization barriers `synchronize'.

720: When a PE hits a `synchronize' statement it must wait until

721: all the other PEs hit a `synchronize' statement;

722: then it may resume.

723: Two dummy synchronizations at Steps 9 and 10 are executed

724: by idling PEs in order to match synchronizations

725: at Steps 5 and 8 executed by non-idling PEs.

726:

727: When (ii) is violated,

728: the synchronous algorithm avoids the ambiguity and indeterminism

729: (which in this case are possible in the asynchronous algorithm)

730: as follows:

731: in processing concurrent updates of two neighbors $c$ and $c'$

732: for the same simulated time $t=t(c)=t(c')$,

733: first, $c$ and $c'$ read states $s_{t-0}$ and times $t$ of each other

734: and compute their private $new\_s$'s and $new\_t$

735: (Steps 3 and 4 in Figure~\ref{fig:s1c1pe});

736: then, after the synchronization barrier at Step 5,

737: $c$ and $c'$ write their states and times at Steps 6 and 7,

738: thus making sure that no write

739: interferes with a read.

740: \\

741: \begin{figure}

742: \centering

743: \fbox{

744: \begin{minipage} {14.8cm}

745: \begin{enumerate}

746: \item while $t(c)~<~end\_time$\\

747: \hspace*{0.2in}

748: \{

749: \item~~~~~~if $t(c)~\leq~ \min_{~c'\in neighbors(c)} ~t(c')$ then\\

750: \hspace*{0.2in}

751: ~~~~~~~~\{

752: \item~~~~~~~~~~~~~~$new\_s ~\leftarrow~ ~next\_state~(s (neighbors (c)),~t(c))$ ;

753: \item~~~~~~~~~~~~~~$new\_t~\leftarrow~ ~time\_of\_next\_arrival~(c,~t(c))$ ;

754: \item~~~~~~~~~~~~~~synchronize;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 1 */

755: \item~~~~~~~~~~~~~~$s(c)~~\leftarrow~~new\_s$;

756: \item~~~~~~~~~~~~~~$t(c)~~\leftarrow~~new\_t$;

757: \item~~~~~~~~~~~~~~synchronize~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 2 */\\

758: \hspace*{0.2in}

759: ~~~~~~~~\}\\

760: \hspace*{0.2in}

761: \ \ else~\{

762: \item~~~~~~~~~~~~~~synchronize;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 1 */

763: \item~~~~~~~~~~~~~~synchronize~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 2 */\\

764: \hspace*{0.2in}

765: ~~~~~~~~\}

766: \\

767: \hspace*{0.1in}

768: \ \ \}

769: \end{enumerate}

770:

771: \end{minipage}}

772: \caption{Synchronous one-cell-per-one-PE algorithm}

773: \label{fig:s1c1pe}

774: \end{figure}

775: \\

776:

777: {\bf Aggregation.}

778: In the two algorithms presented above,

779: one PE hosts only one cell.

780: Such an arrangement may be wasteful

781: if the communication between PEs dominates

782: the computation internal to a PE.

783: A more efficient arrangement is

784: to assign several cells to one PE.

785: For concreteness,

786: consider a two-dimensional $n \times n$

787: array with periodic boundary conditions.

788: Let $n$ be a multiple of $m$ and $(n/m)^2$

789: PEs be available.

790: PE$C$ carries $m \times m$ subarray $C$,

791: where $C=1,2,...,(n/m)^2$.

792: (Capital $C$ will be used without confusion to represent

793: both the subarray index and the set of cells $c$ the subarray

794: comprises, e.g. as in $c\in C$)

795: A fragment of a square cellular array

796: in an example of such an aggregation

797: is represented in Figure~\ref{fig:aggr}$a$,

798: wherein $m=4$.

799:

800: The neighbors of a cell carried by PE1 are cells carried

801: by PE2, PE3, PE4, or PE5.

802: PE1 has direct connections with these four PEs (Figure~\ref{fig:aggr}$b$).

803: Given cell $c$ in the subarray hosted by PE1,

804: one can determine with which neighboring PEs

805: communication is required

806: in order to learn the states of the neighboring cells.

807: Let $W(c)$ be the set of these PEs.

808: Examples in Figure~\ref{fig:aggr}$a$ : $W(u)$ is empty,

809: $W(v)=$\{PE5\}, $W(w)=$\{PE3, PE4\}.

810:

811: \begin{figure}

812: \centering

813: \includegraphics*[width=5.8in]{AGGR.PS}

814: \caption{Aggregation:

815: $a$) mapping of cells to PEs,

816: $b$) the interconnection among the PEs which supports the neighborhood

817: topology among the cells

818: }

819: \label{fig:aggr}

820: \end{figure}

821:

822: \begin{figure}

823: \centering

824: \fbox{

825: \begin{minipage} {13.8cm}

826: \begin{enumerate}

827: \item while $T(C) ~<~ end\_time$

828: \\

829: \hspace*{0.2in}

830: \{

831: \item~~~~~~select a cell $c$ in the subarray $C$ such that\\

832: \hspace*{0.2in}

833: ~$t(c)= \min_{~c' \in C} ~t(c')$ and assign $T(C)\leftarrow t(c)$;

834: \item~~~~~~wait\_until $T(C)~\leq~ \min_{~C' \in W(c)} ~T(C')$ ;

835: \item~~~~~~$s(c) \leftarrow next\_state~(c,~ s (neighbors (c)),~t(c))$ ;

836: \item~~~~~~$t(c) \leftarrow time\_of\_next\_arrival~(c,~s (neighbors (c)),~t(c))$\\

837: \hspace*{0.2in}

838: \}

839: \end{enumerate}

840: \end{minipage}}

841: \caption{Asynchronous many-cells-per-one-PE algorithm. General asynchrony}

842: \label{fig:amcgen}

843: \end{figure}

844: %

845: Figure~\ref{fig:amcgen} presents an aggregated variant of the algorithm

846: in Figure~\ref{fig:a1c1pe}.

847: PE$C$, which hosts

848: subarray $C$,

849: maintains the local time register $T(C)$.

850: PE$C_0$ simulates the evolution of its subarray

851: using the algorithm in Figure~\ref{fig:amcgen}

852: with $C=C_0$.

853: Each cell $c~\in~C$

854: is represented in the memory of PE$C$

855: by its current state $s(c)$ and its

856: next arrival time $t(c)$.

857: Note that unlike the one-cell-per-one-PE algorithm,

858: the $t(c)$ does not represent the current local time for cell $c$.

859: Instead, local times of all cells within subarray $C$ are the same, $T(C)$.

860:

861: $T(C)$ moves from one $t(c)$ to another

862: in the order of increasing value.

863: Three successive iterations of this algorithm

864: are shown in Figure~\ref{fig:timlin}, where the subarray $C$

865: consists of four cells: $C=\{1, 2, 3, 4 \}$.

866: Circles in Figure~\ref{fig:timlin} represent

867: arrival points in the simulated time.

868: A crossed-out circle represents an arrival which

869: has just been processed,

870: i.e., Steps 3, 4, and 5 of Figure~\ref{fig:amcgen} have just been executed,

871: so that $T(C)$ has just taken on the value of the processed

872: old arrival time $t(c)$,

873: while the $t(c)$ has taken on a new larger value.

874: This new value is pointed to by an arrow from $T(C)$ in Figure~\ref{fig:timlin}.

875: It is obvious that

876: always $t(c)~>=~T(C)$ if $c~\in~C$.

877:

878: Local times $T(C)$

879: maintained by different PE$C$ might be different.

880: A wait at Step 3 cannot deadlock

881: the execution

882: since the PE$C$ whose $T(C)$ is the minimum over

883: the entire cellular array is always able to make a progress.

884:

885: \begin{figure}

886: \centering

887: \includegraphics*[width=6.2in]{TIMLIN.PS}

888: \caption{$T(C)$ slides along a sequence of $t(c)$'s

889: in successive iterations of the aggregated algorithm}

890: \label{fig:timlin}

891: \end{figure}

892:

893: Assuming property (ii) as above,

894: the algorithm

895: correctly simulates the history

896: of updates.

897: The following example may serve as an informal proof of this statement.

898: Suppose PE1 is currently updating the state of cell $v$

899: (see Figure~\ref{fig:aggr}$a$)

900: and its local time is

901: $T_1$.

902: Since $W(v)=\{ PE5\}$,

903: this update is possible

904: because the local time of PE5, $T_5$, is currently

905: larger than $T_1$.

906: At present,

907: PE1 receives the state

908: of $x$ from PE5

909: in order to perform the update.

910: This state

911: is in time $T_5$,

912: i.e., in the future with respect to local time $T_1$.

913: However, the update is correct,

914: since

915: the state of $x$ was the same at time $T_1$,

916: as it is at time $T_5$.

917:

918: Indeed, suppose

919: the state of $x$ were to be changed

920: at simulated local time $T$,

921: $T_1 <  T <  T_5$.

922: At the moment when this change would have been processed by PE5,

923: the local time of PE1 would have been larger than $T$,

924: and $T$ would have been the local time of PE5.

925: After this processing has supposedly taken place,

926: the local time of PE1 should not decrease.

927: Yet at the present it is $T_1$,

928: which is smaller that $T$.

929: This contradiction proves that the state

930: of $x$ cannot in fact change

931: in the interval ($T_1, T_5$).

932:

933: In the example in Figure~\ref{fig:timlin},

934: only one $t(c)$ supplies $\min_{~c' \in C} t(c')$.

935: However, the algorithm in Figure~\ref{fig:amcgen}

936: at Step 2 commands to select {\em a} cell

937: not {\em the} cell.

938: This covers the unlikely situation of several cells having the same

939: minimum time.

940: If $next\_t(c) - t(c)$ for different $c$ are independent

941: random samples from a distribution with a continuous density,

942: this case occurs with the probability zero.

943: On the other hand,

944: if several cells can, with positive probability,

945: update simultaneously,

946: a synchronous version of the aggregated algorithm should be used instead.

947: To eliminate indeterminism and inconsistency,

948: the latter would use

949: synchronization and intermediate storage

950: techniques.

951: These techniques were demonstrated in the algorithm in Figure~\ref{fig:s1c1pe}

952: and their discussion is not repeated here.

953:

954: \begin{figure}

955: \centering

956: \fbox{

957: \begin{minipage} {12.8cm}

958: \begin{enumerate}

959: \item while $T(C)~<~end\_time$ \\

960: \hspace*{0.2in}

961: \{

962: \item~~~~~~select a cell $c$ in the subarray $C$ uniformly at random;

963: \item~~~~~~wait\_until $T(C) \leq \min_{~C' \in W(c)} ~T(C')$ ;

964: \item~~~~~~$s(c) \leftarrow next\_state~(c,~ s (neighbors (c)),~t(c))$ ;

965: \item~~~~~~$T(C) \leftarrow T(C)~-~ \frac{1}{\lambda \times number\_of\_cells\_in\_C} \ln r(C,n(T(C)))$\\

966: \hspace*{0.2in}

967: \}

968: \end{enumerate}

969:

970: \end{minipage}}

971: \caption{Asynchronous many-cells-per-one-PE algorithm. Poisson asynchrony}

972: \label{fig:amcpoi}

973: \end{figure}

974:

975:

976: For an important special case of

977: {\bf Poisson asynchrony in the aggregated algorithm},

978: the algorithm of Figure~\ref{fig:amcgen}

979: is rewritten in Figure~\ref{fig:amcpoi}.

980: This specialization capitalizes on the

981: additive property of Poisson streams,

982: specifically, on the fact

983: that sum of $k$ independent Poisson streams

984: with rate $\lambda$ each

985: is a Poisson stream with rate $\lambda k$.

986: In the algorithm,

987: $k=number\_of\_cells\_in\_C$;

988: this $k$ is equal to $m^2$ in the special case of partitioning

989: into $m \times m$ subarrays.

990: Unlike the general algorithm of Figure~\ref{fig:amcgen},

991: in the specialization in Figure~\ref{fig:amcpoi}

992: neither individual streams

993: for different cells

994: are maintained,

995: nor future arrivals $t(c)$ for cells are

996: individually computed.

997: Instead, a single cumulative stream is simulated

998: and cells are delegated randomly

999: to meet these arrivals.

1000:

1001: At Step 5 in Figure~\ref{fig:amcpoi},

1002: $r(C, n(T(C)))$ is an $n(T(C))$-th pseudo-random

1003: number in the sequence uniformly distributed in (0,1).

1004: It follows from the notation

1005: that each PE has its own sequence.

1006: If this sequence is independent of the

1007: run (which is condition (i) above) and

1008: if updates for neighboring cells never coincide in time

1009: (which is condition (ii) above), then this algorithm produces

1010: a unique reproducible trajectory.

1011: The same statement is also true for the algorithm in Figure~\ref{fig:amcgen}.

1012: However,

1013: uniqueness provided by the algorithm in Figure~\ref{fig:amcpoi}

1014: is weaker than the one provided by the algorithm in Figure~\ref{fig:amcgen}:

1015: if the same array is partitioned differently and/or executed

1016: with different number of PEs,

1017: a trajectory produced by the algorithm in Figure~\ref{fig:amcpoi}

1018: may change;

1019: however, a trajectory produced by the algorithm in Figure~\ref{fig:amcgen}

1020: is invariant for

1021: such changes given that each cell $c$ uses its own

1022: fixed

1023: pseudo-random sequence.

1024: \\

1025:

1026: {\bf Efficiency of aggregated algorithms}.

1027: Both many-cells-per-one-PE algorithms

1028: in Figure~\ref{fig:amcgen} and Figure~\ref{fig:amcpoi}

1029: are more efficient than the

1030: one-cell-per-one-PE counterparts

1031: in Figure~\ref{fig:a1c1pe} and Figure~\ref{fig:s1c1pe}.

1032: This additional efficiency

1033: can be explained in the example

1034: of the square array, as follows:

1035: In the algorithms

1036: in Figure~\ref{fig:a1c1pe} and Figure~\ref{fig:s1c1pe},

1037: a PE may wait for its four neighbors.

1038: However,

1039: in the algorithms in Figure~\ref{fig:amcgen} and Figure~\ref{fig:amcpoi},

1040: a PE waits for at most two neighbors.

1041: For example, when the state of cell $w$ in Figure~\ref{fig:aggr}$a$ is updated,

1042: PE1 might wait for PE3 and PE4.

1043: Moreover,

1044: for at least

1045: $(m-2)^2$ cells $c$ out of $m^2$,

1046: PE1 does not wait at all,

1047: because $W(c)=\emptyset$.

1048: The cells $c$ such that $W(c)=\emptyset$

1049: form the dashed square in Figure~\ref{fig:aggr}$a$.

1050:

1051: This additional efficiency becomes especially large if,

1052: instead of set $neighbors (c)$ in the original formulation

1053: of the model,

1054: one uses sets

1055: \begin{equation}

1056: \label{nei2}

1057: neighbors^2 (c)~ \stackrel{\rm def}{=} ~next\_to\_nearest\_neighbors (c)

1058: \end{equation}

1059: or, more generally, $q$-th degree neighborhood,

1060: $neighbors^q (c)$.

1061: The latter is

1062: defined for $q~>~1$ inductively

1063: \begin{equation}

1064: \label{neiq}

1065: neighbors^q (c) \stackrel{\rm def}{=} neighbors ( neighbors^{q-1} (c))

1066: \end{equation}

1067: where $neighbors (S)$ for a set $S$ of cells

1068: is defined as

1069: $neighbors (S) \stackrel{\rm def}{=}  \bigcup_{~c \in S} neighbors (c)$.

1070:

1071: It is easy to rewrite

1072: the algorithms in Figure~\ref{fig:a1c1pe} and Figure~\ref{fig:s1c1pe}

1073: for the case $q~>~1$.

1074: The obtained codes have low efficiency however.

1075: For example,

1076: in the square array case,

1077: one has

1078: $| neighbors^q (c) | - 1=2q(q+1)$.

1079: Thus, if $q=2$,

1080: a cell might have to wait

1081: for 12 cells

1082: in order to update.

1083: In the same example,

1084: if one PE carries an $m \times m$ subarray,

1085: and $m~>~q$, then the PE waits for at most three other PEs

1086: no matter how large the $q$ is.

1087: Moreover,

1088: if $m > 2q$ then in $(m-2q)^2$ cases out of $m^2$

1089: the PE does not wait at all.

1090: \\

1091:

1092: {\bf The BKL algorithm} \cite{BKL}

1093: was originally proposed for Ising spin simulations.

1094: It was noticed that the probability $p$ to flip

1095: $s(c)$ takes on only a finite (and small) number

1096: $d$ of values $p_1 ,..., p_d$,

1097: each corresponding to one or several

1098: combinations of old values of $s(c)$ and neighboring spins $s(c')$.

1099: Thus the algorithm

1100: splits the cells into $d$

1101: pairwise disjoint classes $\Gamma_1$, $\Gamma_2$,...$\Gamma_d$.

1102: The rates $\lambda p_k$ of changes

1103: (not just of the attempts to change)

1104: for all $c \in \Gamma_k$

1105: are the same.

1106: At each iteration, the BKL algorithm does the following:

1107: \\

1108: \begin{quotation}

1109: (a) Selects $\Gamma_{k_0}$ at random according to the

1110: weights $| \Gamma_k | p_k$, $k=1,2,...d$,

1111: and selects a cell $c \in \Gamma_{k_0}$ uniformly at random.

1112: \\

1113:

1114: (b) Flips the state of the selected cell,

1115: $s(c) \leftarrow -s(c)$.

1116: \\

1117:

1118: (c) Increases the time by

1119: $- {\log}_e (r) /( \lambda ( \sum_{1 \leq k \leq d} | \Gamma_k |  p_k ))$,

1120: where $r$ is a pseudo-random number uniformly distributed in (0,1).

1121: \\

1122:

1123: (d) Updates the membership in the classes.

1124: \end{quotation}

1125: If the asynchrony law is Poisson,

1126: the idea of the BKL algorithm

1127: can be applied also to a

1128: deterministic update.

1129: Here the probability $p$

1130: of change takes on just two values:

1131: \\

1132: $p_1 =0$ if $next_s(c)=s(c)$,

1133: and $p_2 =1$ if $next\_s(c)~ \neq ~s(c)$.

1134: \\

1135: Accordingly, there are two classes:

1136: $\Gamma_0$, the cells which are not going to change

1137: and

1138: $\Gamma_1$, the cells which are going to change.

1139: As with the original BKL algorithm,

1140: a substantial overhead is required for maintaining an account

1141: of the membership in the classes (Step (d)).

1142: The BKL algorithm is justified only if a large number of cells

1143: are not going to change their states.

1144: The latter is often the case.

1145: For example,

1146: in the Conways's synchronous {\em Game of Life}

1147: (Gardner \cite{GAR})

1148: large regions of

1149: white cells ($s(c)=0$) remain

1150: unchanged for many iterations

1151: with very few black cells ($s(c)=1$).

1152: One would expect similar behavior

1153: for an asynchronous version of the

1154: Game of Life.

1155:

1156: The basic BKL algorithm is serial.

1157: To use it on a parallel computer,

1158: an obvious idea is to run a copy of the serial BKL algorithm

1159: in each subarray carried by a PE.

1160: Such a procedure,

1161: however,

1162: causes roll-backs,

1163: as seen in the following example:

1164:

1165: Suppose PE1 is currently updating the state

1166: of cell $v$ (Figure~\ref{fig:aggr}$a$) and its

1167: local time is $T_1$,

1168: while the local time of PE5, $T_5$,

1169: is larger than $T_1$.

1170: Since $x$ is a nearest neighbor to $B$,

1171: $x$'s membership might change because of $v$'s changed state.

1172: Suppose $x$'s membership were to indeed change.

1173: Although this change would have been in effect since time $T_1$,

1174: PE5, which is responsible for $x$,

1175: would learn about the change

1176: only at time $T_5 ~>~T_1$.

1177: As the past of PE5 is not, therefore,

1178: what PE5 has believed it to be,

1179: interval [$T_1 , T_5$] must have

1180: been simulated by PE5 incorrectly,

1181: and must be played again.

1182: This original roll-back might cause a cascade

1183: of secondary roll-backs, third generation roll-backs etc.

1184: \\

1185:

1186: {\bf A modified BKL algorithm}

1187: applies the original BKL procedure

1188: only to a subset of the cells,

1189: whereas

1190: the procedure of the standard model is applied

1191: to the remaining cells.

1192: More specifically:

1193: An additional separate class $\Gamma_0$ is defined.

1194: Unlike other $\Gamma_k$, $k~>~0$,

1195: class $\Gamma_0$

1196: always contains the same cells.

1197: Steps (a) - (d) are performed as above

1198: with the following modifications:

1199: \newpage

1200: \begin{quotation}

1201: 1) The weight of $\Gamma_0$ at step (a) is taken to be

1202: $| \Gamma_0 |$.

1203: \\

1204:

1205: 2) If the selected $c$ belongs to $\Gamma_0$,

1206: then at step (b) the state of $c$ may or may not change.

1207: The probability $p$ of change

1208: is determined as in the standard model.

1209: \\

1210:

1211: 3) The time at step (c) should be increased by

1212: $- {\log}_e (r) /( \lambda ( | \Gamma_0 |~+~\sum_{1 \leq k \leq d} | \Gamma_k |p_k ))$,

1213: where $r=r(c,n(t))$ is a pseudo-random number uniformly distributed in $(0,1)$.\\

1214: \end{quotation}

1215:

1216: Now consider again the subarray

1217: carried by PE1 in Figure~\ref{fig:aggr}$a$.

1218: The subarray can be subdivided

1219: into the $(m-2) \times (m-2)$ ``kernel'' square

1220: and the remaining boundary layer.

1221: If first degree neighborhood, $neighbors~(c)$,

1222: is replaced with the $q$-th degree neighborhood,

1223: $neighbors^q (c)$,

1224: then the kernel is the central $(m-2q) \times (m-2q)$ square,

1225: and the boundary layer has width $q$.

1226: In Figure~\ref{fig:aggr}$a$, the cells in the dashed square

1227: constitute the kernel with $q=1$.

1228: To apply the modified BKL procedure to

1229: the subarray carried by PE1,

1230: the boundary layer is declared to be

1231: the special fixed class $\Gamma_0$.

1232: Similar identification is done in the other subarrays.

1233: As a result,

1234: the fast concurrent BKL procedures

1235: on the kernels

1236: are shielded from each other

1237: by slower procedures on the layers.

1238:

1239: The roll-back is avoided,

1240: since state change of a cell

1241: in a subarray does not constitute state

1242: or membership change of a cell

1243: in another subarray.

1244: Unless the performance of PE1 is taken into account,

1245: the neighbors of PE1

1246: can not even tell whether PE1

1247: uses the standard or the BKL

1248: algorithm to update its kernel.

1249: As the size of the subarray increases,

1250: so does both the relative weight of the kernel

1251: and the fraction of the fast BKL processing.

1252: \\

1253:

1254: {\bf Generating the output}.

1255: Consider the task of generating cellular patterns

1256: for specified simulated times.

1257: A method for performing this task in a serial

1258: simulation or a parallel simulation of a synchronous

1259: cellular array is obvious:

1260: as the global time reaches a specified value,

1261: the computer outputs the states of all cells.

1262: In an asynchronous simulation,

1263: the task becomes more complicated

1264: because

1265: there is no global time:

1266: different PEs may have different local times

1267: at each physical instance of simulation.

1268:

1269: Suppose for example,

1270: one wants to see the cellular patterns

1271: at regular time intervals

1272: $K_0 \Delta t,~(K_0 +1)  \Delta t,~(K_0 +2) \Delta t,...$

1273: on a screen of a monitor attached to the computer.

1274: Without getting too involved

1275: in the details of performing I/O

1276: operations and the architecture of the parallel computer,

1277: it would be enough to assume that a separate process

1278: or processes are associated with the output;

1279: these processes scan an output buffer memory space

1280: allocated in one or several PEs or in the shared memory;

1281: the buffer space consists of $B$ frames,

1282: numbered 0,1,...,$B-1$,

1283: each capable of storing a complete image of

1284: the cellular array for one time instance.

1285: The output processes draw

1286: the image for time $K \Delta t$

1287: on the screen

1288: as soon as

1289: the frame number $rem (K/B)$

1290: (the reminder of the integer

1291: division $K$ by $B$)

1292: is full and the previous

1293: images have been shown.

1294: Then the frame is flashed for

1295: the next round when it will be filled

1296: with the image for time $(K+B) \Delta t$

1297: and so on.

1298: \\

1299: \begin{figure}

1300: \centering

1301: \fbox{

1302: \begin{minipage} {12.8cm}

1303: /* Initially $K=K_0$, $T(C)~<~K_0 \Delta t$ */\\

1304: \begin{enumerate}

1305: \item while $T(C)~<~end\_time$\\

1306: \hspace*{0.15in}

1307: \{

1308: \item~~~~~~select a cell $c$ in the subarray $C$ such that\\

1309: \hspace*{0.2in}

1310: ~$t(c)=\min_{~c' \in C} t(c')$ and assign $new\_T \leftarrow t(c)$;

1311: \item~~~~~~while $new\_T > K \Delta t$ \\

1312: \hspace*{0.2in}

1313: ~~~~\{

1314: \item~~~~~~~~~~~~~wait\_until frame $rem (K/B)$ is available;

1315: \item~~~~~~~~~~~~~store image $s(C)$ into frame $rem (K/B)$;

1316: \item~~~~~~~~~~~~~$K \leftarrow  K+1$\\

1317: \hspace*{0.2in}

1318: ~~~~~\};

1319: \item~~~~~~$T(C)  \leftarrow  new\_T$;

1320: \item~~~~~~wait\_until $T(C) \leq \min_{~C' \in W(c)} ~T(C')$ ;

1321: \item~~~~~~$s(c) \leftarrow next\_state (c,~ s (neighbors (c)),~t(c))$ ;

1322: \item~~~~~~$t(c) \leftarrow  time\_of\_next\_arrival~(c,~s (neighbors (c)),~ t(c))$\\

1323: \hspace*{0.15in}

1324: \}

1325: \end{enumerate}

1326: \end{minipage}}

1327: \caption{Generating the output in the aggregated asynchronous algorithm}

1328: \label{fig:genout}

1329: \end{figure}

1330:

1331: The algorithm must fill the appropriate frame

1332: with the appropriate data as soon as

1333: both data and the frame become available.

1334: The modifications that enable the asynchronous algorithm

1335: in Figure~\ref{fig:amcgen} to perform this task

1336: are presented in Figure~\ref{fig:genout}.

1337: In this algorithm,

1338: variables $new\_T$ and $K$ are private (i.e., local to PE)

1339: and

1340: $\Delta t$ and $K_0$ are constants

1341: whose values are the same for all the PEs.

1342: Note that different PEs may

1343: fill different frames concurrently.

1344: If the slowest PE is presently filling an image for time $K \Delta t$,

1345: then the fastest PE is allowed to fill the image for

1346: time no later than $(K + B - 1) \Delta t$.

1347: An attempt by the fastest PE to

1348: fill the image for time $(K + B) \Delta t$

1349: will be blocked at Step 4,

1350: until the frame

1351: number $rem(K/B)=rem((K + B)/B)$

1352: becomes available.

1353:

1354: Thus, the finiteness of the output buffer introduces

1355: a restriction which is not present in the original algorithm

1356: in Figure~\ref{fig:amcgen}.

1357: According to this restriction,

1358: the lag between concurrently processed local times

1359: cannot exceed

1360: a certain constant.

1361: The exact value of the constant in each particular instance

1362: depends on the relative

1363: positions of the update times within the $\Delta t$-slots.

1364: In any case,

1365: the constant is not smaller than

1366: $(B-1) \Delta t$

1367: and

1368: not larger than

1369: $B \Delta t$.

1370:

1371: However,

1372: even with a single output buffer segment, $B=1$,

1373: the simulation does not become time-driven.

1374: In this case,

1375: the concurrently processed local times might be

1376: within a distance of

1377: up to $\Delta t$

1378: of each other,

1379: whereas $\Delta t$ might be relatively large.

1380: No precision of update time representation is lost,

1381: although efficiency might degrade

1382: when both $\Delta t$ and $B$ become too small,

1383: see Section~\ref{sec:perf}.

1384:

1385: \section{Performance assessment: experiments and simulations}\label{sec:perf}

1386: \hspace*{\parindent}

1387: Modeling and analysis of asynchronous

1388: algorithms is a difficult theoretical problem.

1389: Strictly speaking,

1390: the following discussion is applicable

1391: only to synchronous algorithms.

1392: However, one may argue informally

1393: that the performance of an asynchronous

1394: algorithm is not worse than that of its synchronous counterpart,

1395: since expensive synchronizations are eliminated.

1396:

1397: First, consider the synchronous algorithm in Figure~\ref{fig:s1c1pe}.

1398: Let $N$ be the size of the array and $N_0$ be the number of

1399: cells which passed

1400: the test at Step 2, Figure~\ref{fig:s1c1pe}.

1401: The ratio of useful work performed,

1402: to the total work expended at

1403: the iteration is $N_0 /N$.

1404: This ratio yields the {\em efficiency}

1405: (or {\em utilization}) at the given iteration.

1406: Assuming that in the serial algorithm all the work is useful,

1407: and that the algorithm performs the same computation as its parallel

1408: counterpart,

1409: the speed-up of the parallel computation

1410: is the average efficiency times the number of PEs involved.

1411: Here the averaging is done

1412: with equal weights

1413: over all the iterations.

1414:

1415: In the general algorithms,

1416: $next\_t(c)$ is determined using the

1417: states of the neighbors of $c$.

1418: However, in the important applications,

1419: such as an Ising model,

1420: $next\_t(c)$ is independent of states.

1421: The following assessment is valid only for

1422: this special case of independence.

1423: Here

1424: the configuration is irrelevant

1425: and

1426: whether the test succeeds or not

1427: can be determined knowing only the times at each iteration.

1428: This leads to a simplified model in which

1429: only local times are taken into account:

1430: at an iteration,

1431: the local time of

1432: a cell is incremented

1433: if the time does not exceed

1434: the minimum of the local times of its neighbors.

1435:

1436: A simple (serial) algorithm

1437: which updates only local times of cells $t(c)$

1438: according to the rules formulated above

1439: was exercised for different array sizes $n$

1440: and three different dimensions:

1441: for an $n$-element circular array,

1442: an $n \times n$ toroidal array,

1443: and for $n \times n \times n$ array with periodic boundary conditions.

1444: Two types of asynchronies are tried:

1445: the Poisson asynchrony

1446: for which

1447: $next\_t~-~t$ is distributed exponentially,

1448: and the asynchrony

1449: for which $next\_t~-~t$ is

1450: uniformly distributed in (0,1).

1451: In both cases,

1452: random time increments

1453: for different cells are independent.

1454:

1455: \begin{figure}

1456: \centering

1457: \includegraphics*[width=5.8in]{PERF1T1.PS}

1458: \caption{Performance of the Ising model simulation. One-cell-per-one-PE case}

1459: \label{fig:perf1t1}

1460: \end{figure}

1461:

1462: The results of these six experiments

1463: are given in Figure~\ref{fig:perf1t1}.

1464: Each solid line in Figure~\ref{fig:perf1t1} is enclosed between two dashed

1465: lines.

1466: The latter represent

1467: 99.99\% Student's confidence intervals constructed

1468: using several simulation runs,

1469: that are parametrically the same

1470: but fed with different pseudo-random sequences.

1471: In Figure~\ref{fig:perf1t1}, for each array topology

1472: there are two solids lines.

1473: The Poisson asynchrony

1474: always corresponds to the lower line.

1475: The corresponding limiting values of performances

1476: (when $n$ is large)

1477: are also shown near the right end of each curve.

1478: For example, the efficiency in the

1479: simulation of a large $n \times n$ array

1480: with the Poisson asynchrony is about 0.121,

1481: with the other asynchrony, it is about 0.132.

1482:

1483: No analytical theory is available

1484: for predicting these values

1485: or even proving their separation from zero

1486: when $n \rightarrow +\infty$.

1487: It follows from Figure~\ref{fig:perf1t1} that replacing

1488: exponential distribution of $next\_t - t$ with

1489: the uniform distribution results in efficiency

1490: increase

1491: from 0.247 to 0.271 for a large $n$-circle

1492: ($n \rightarrow +\infty$).

1493: The efficiency can be raised even more.

1494: If $next\_t - t = r^{1/8}$,

1495: where $r$ is distributed uniformly in (0,1),

1496: then in the limit $n \rightarrow +\infty$,

1497: with the Student's confidence 99.99\%,

1498: the efficiency is $0.3388 \pm 0.0012$.

1499: It is not known how high the efficiency

1500: can be raised this way

1501: (degenerated cases, like a synchronous one,

1502: in which the efficiency is 1, are not counted).

1503:

1504: An efficiency of 0.12 means the speed-up

1505: of $0.12 \times N$;

1506: for $N=2^{14}$ this comes to more than 1900.

1507: This assessment is confirmed in an actual full scale simulation experiment

1508: performed on $2^{14}=128 \times 128$ PEs of

1509: a Connection~Machine~(R)

1510: (a quarter of the full computer

1511: ).

1512: This SIMD computer

1513: appears well-suited for the synchronous execution

1514: of the one-cell-per-one-PE algorithm in Figure~\ref{fig:s1c1pe}

1515: on a toroidal array,

1516: Poisson asynchrony law.

1517: Since an individual PE is rather slow,

1518: it executes several thousand

1519: instructions per second,

1520: and its absolute speed is not very impressive:

1521: It took roughly 1 sec. of real time

1522: to update

1523: all $128 \times 128$ spins

1524: when the traffic generated by other tasks running

1525: on the computer was small

1526: (more precise measurement

1527: was not available).

1528: This includes about

1529: $8.3~\approx~(0.12)^{-1}$

1530: rounds of the algorithm,

1531: several hundred instructions of one PE per round.

1532:

1533: The 12\% efficiency in the one-cell-per-one-PE

1534: experiments could be greatly increased

1535: by aggregation.

1536: The many-cells-per-one-PE

1537: algorithm in Figure~\ref{fig:amcpoi} is implemented

1538: as a $C$ language parallel program for a

1539: Balance~(TM) computer,

1540: which is a shared memory MIMD bus machine.

1541: The $n \times n$ array was split into

1542: $m \times m$ subarrays,

1543: as shown in Figure~\ref{fig:aggr},

1544: where $n$ is a multiple of $m$.

1545: Because the computer has 30 PEs,

1546: the experiments could be performed only with

1547: $(n/m)^2=1, 4, 9, 16$, and 25 PEs

1548: for different $n$ and $m$.

1549:

1550: Along with these experiments,

1551: a simplified model, similar to

1552: the one-cell-per-one-PE case,

1553: was run on a serial computer.

1554: In this model,

1555: quantity

1556: $h(C) \stackrel{\rm def}{=} \lambda T(C)$ is maintained for each PE,

1557: $C=1,...,(n/m)^2$.

1558: The update of $h(C)$ is arranged in rounds,

1559: wherein each $h(C)$ is updated as follows:

1560: \\

1561: ~~~~~~(i) with probability $p_0 =(m-2)^2 /m^2$,

1562: PE$C$ updates $h(C)$:

1563: \begin{equation}

1564: \label{hC}

1565: h(C) ~\leftarrow~ h(C) ~-~ \ln ^r (C, n(h(C))),

1566: \end{equation}

1567: where $r$ and $\ln$ are the same as in Step 5 in

1568: Figure~\ref{fig:amcpoi}.

1569: Here $p_0$ is the probability

1570: that the PE chooses a cell $c$

1571: so that $|W(c)|=0$;

1572: \\

1573: ~~~~~~(ii) with probability $p_1 =4(m-2)/m^2$,

1574: the PE must check the $h(C')$ of one of its four neighbors $C'$

1575: before making the update.

1576: The $C'$ is chosen uniformly at random among the four possibilities.

1577: If $h(C') ~\geq~ h(C)$,

1578: then $h(C)$ gets an increment according to \eqref{hC};

1579: otherwise, $h(C)$ is not updated.

1580: Here $p_1$ is the probability that PE will choose

1581: a cell $c$ in an edge but not in a corner, so that $|W(c)|=1$

1582: \\

1583: ~~~~~~(iii) with the remaining probability $p_2=4/m^2$,

1584: the PE checks $h(C')$ and $h(C'')$

1585: of two of its adjacent neighbors

1586: (for example in Figure~\ref{fig:aggr}, neighbors PE2 and PE3

1587: can be involved in the computation for PE1).

1588: The two neighbors are chosen uniformly at random

1589: from the four possibilities.

1590: Again, if both

1591: $h(C')  \geq  h(C)$

1592: and

1593: $h (C'') \geq  h(C)$,

1594: then $h(C)$ gets an increment according to \eqref{hC};

1595: otherwise, $h(C)$ is not updated.

1596: Here $p_2$ is the probability

1597: to choose a cell $c$ in a corner,

1598: so that $|W(c)|=2$.

1599:

1600: As in the previous case,

1601: this simplified model simulates

1602: a possible but not obligatory synchronous timing arrangement for

1603: executing the real asynchronous algorithm.

1604: Figure~\ref{fig:perfmt1} shows excellent agreement between

1605: actual and predicted performances

1606: for the aggregated Ising model.

1607: The efficiency presented in Figure~\ref{fig:perfmt1} is computed as

1608:

1609: \begin{equation}

1610: \label{effi}

1611: {\rm efficiency}=

1612: \frac{\rm serial~execution~time} {{\rm number~of~PEs} \times {\rm parallel~execution~time}}

1613: \end{equation}

1614:

1615: The parallel speed-up can be found as

1616: efficiency$~\times~$number~of~PEs.

1617: For 25 PEs simulating a 120$\times$120 Ising model,

1618: efficiency is 0.66;

1619: hence, the speed-up is greater than 16.

1620: For the currently unavailable sizes,

1621: when $10^4$ PEs

1622: simulate a $10^4 \times 10^4$ array,

1623: the simplified model predicts

1624: an efficiency of about 0.8 and a speed-up of about 8000.

1625:

1626: \begin{figure}

1627: \centering

1628: \includegraphics*[width=5.8in]{PERFMT1.PS}

1629: \caption{Performance of the Ising model simulation. Many-cells-per-one-PE case}

1630: \label{fig:perfmt1}

1631: \end{figure}

1632:

1633: In the experiments reported above,

1634: the lag between the local times of any two PEs

1635: was not restricted.

1636: As discussed in Section~\ref{sec:algo},

1637: an upper bound on the lag

1638: might result from the necessity

1639: to produce the output.

1640: To see how the bound

1641: affects the efficiency,

1642: one experiment reported in Figure~\ref{fig:perfmt1},

1643: is repeated with various finite

1644: values of the lag bound.

1645: In this experiment,

1646: an $n \times n$ array is simulated

1647: and

1648: one PE carries an $m \times m$ subarray,

1649: where $n=384$ and $m=12$.

1650: The results are presented in Figure~\ref{fig:perfbl}.

1651:

1652: In Figure~\ref{fig:perfbl},

1653: the unit of measure for a lag is the expectation of

1654: time intervals between consecutive arrivals for a cell.

1655: For lag bounds greater than 16,

1656: degradation of efficiency is almost unnoticeable,

1657: when

1658: compared with the base experiment where lag$= \infty$.

1659: Substantial degradation starts at about 8;

1660: for the unity lag bound,

1661: the efficiency is about half that of the base experiment.

1662: However, even for lag bound 0.3, the simulation remains practical,

1663: with an efficiency of about 0.1;

1664: since 1024 PEs execute the task,

1665: this efficiency means a speed-up of more than 100.

1666:

1667: \begin{figure}

1668: \centering

1669: \includegraphics*[width=5.8in]{PERFBL.PS}

1670: \caption{Efficiency degradation caused by bounded lag}

1671: \label{fig:perfbl}

1672: \end{figure}

1673:

1674: \section{Conclusion}\label{sec:concl}

1675: \hspace*{\parindent}

1676: This paper demonstrates an efficient parallel method

1677: for simulating asynchronous cellular arrays.

1678: The algorithms are quite simple and easily implementable

1679: on appropriate hardware.

1680: In particular, each algorithm

1681: presented in the paper

1682: can be implemented on a general purpose

1683: asynchronous parallel computer,

1684: such as the currently available bus machines with shared memory.

1685: The speed of such implementation

1686: depends on the speed of PEs

1687: and the efficiency of the communication system.

1688: A crucial condition for success in such implementation

1689: is the availability of a good parallel generator

1690: of pseudo-random numbers.

1691: To assure reproducibility,

1692: each PE should have its own reproducible

1693: pseudo-random sequence.

1694:

1695: The proposed algorithms present

1696: a number of challenging mathematical problems,

1697: for example, the problem of

1698: proving that efficiency tends

1699: to a positive limit when the number of PEs increases

1700: to infinity.

1701: \\

1702:

1703: {\bf Acknowledgments}.

1704: \\

1705: I acknowledge the personnel

1706: of the Thinking Machine Corp. for their kind invitation,

1707: and help in debugging and running the parallel *LISP

1708: program on one of their computers.

1709: Particularly, the help of

1710: Mr. Gary Rancourt and Mr. Bernie Murray was invaluable.

1711: Also, I thank Andrew T. Ogielski and Malvin H. Kalos

1712: for stimulating discussions,

1713: Debasis Mitra for a helpful explanation of a topic in Markov chains,

1714: and Brigid Moynahan for carefully reading the text.

1715:

1716: \newpage

1717: \begin{thebibliography}{MMMM}

1718: \bibitem[1]{GG}

1719: S. Geman and D. Geman,

1720: Stochastic relaxation, Gibbs distributions,

1721: and the Bayesian restoration of images,

1722: {\em IEEE Transactions on pattern analysis and machine intelligence},

1723: {\bf PAMI-6}, 6, (Novem. 1984), 721--741.

1724:

1725: \bibitem[2]{LM}

1726: B.~D. Lubachevsky and D. Mitra,

1727: A chaotic asynchronous algorithm for computing

1728: the fixed point of a nonnegative matrix of unit spectral radius,

1729: {\em Journal of the ACM}, {\bf 33}, 1 (1986), 130--150.

1730:

1731: \bibitem[3]{INBUV}

1732: T.~E. Ingerson and R.~L. Buvel,

1733: Structure in asynchronous cellular automata,

1734: {\em Physica}, {\bf 10D} (1984), 59--68.

1735:

1736: \bibitem[4]{HOF}

1737: M.~I. Hoffman,

1738: A cellular Automation Model Based on Cortical Physiology,

1739: {\em Complex Systems}, {\bf 1}, (1987), 187--202.

1740:

1741: \bibitem[5]{ISING}

1742: F. Ising,

1743: Beitag zur theorie des ferromagnetismus,

1744: {\em Z. Physik}, {\bf 31} (1925), 253--258.

1745:

1746: \bibitem[6]{GL}

1747: R.~J. Glauber,

1748: Time-dependent statistics of the Ising model,

1749: {\em Journ. Math. Physics}, {\bf 4}, no.2 (1963), 294--307.

1750:

1751: \bibitem[7]{MRRTT}

1752: N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller,

1753: Equation of state calculations by fast computing machines,

1754: {\em Journ. Chem. Physics}, {\bf 21}, no.6 (1953), 1087--1092.

1755:

1756: \bibitem[8]{BKL}

1757: A.~B. Bortz, M.~H. Kalos, and J.~T. Lebowitz,

1758: A new algorithm for Monte Carlo simulation of Ising spin systems,

1759: {\em J. Comp. Physics}, {\bf 17} (1975), 10--18.

1760:

1761: \bibitem[9]{OGI}

1762: A.~T. Ogielski,

1763: Dynamics of three-dimensional Ising spin glasses in thermal equilibrium,

1764: {\em Physical Review B}, {\bf 32}, no.11 (1985), 7384--7398.

1765:

1766: \bibitem[10]{FC}

1767: R. Friedberg, and J.~E. Cameron,

1768: Test of the Monte Carlo method: fast simulation of a small Ising lattice,

1769: {\em Journ. Chem. Physics}, {\bf 52}, no.12 (1970), 6049--6058.

1770:

1771: \bibitem[11]{CR}

1772: M. Creutz,

1773: Deterministic Ising dynamics,

1774: {\em Ann. Phys.} {\bf 167}, no. 62 (1986), pp. 62--72.

1775:

1776: \bibitem[12]{GAR}

1777: M. Gardner,

1778: Mathematical games.

1779: The fantastic combinations of John Conway's new\_solitaire game ``life'',

1780: {\em Scientific American}, October 1970, 120--124.

1781:

1782: \end{thebibliography}

1783: % ----------------------------------------------------------------

1784:

1785: \newpage

1786: {\bf APPENDIX: a working code of Ising simulation}

1787: \\

1788: \\

1789: C language program for the BALANCE parallel computer;

1790: the code is used for timing only and contains no i/o;

1791: the code of the pseudo-random number generator

1792: is not included

1793: \begin{verbatim}

1794: #include <pp.h>

1795: #include <math.h>

1796: #include <sys/tmp_ctl.h>

1797:

1798: #define SHARED_MEM_SIZE (sizeof(double)*10000)

1799: #define END_TIME 1000.

1800: #define A 24       /* side of small square a PE takes care of*/

1801: #define M 5        /* number of PEs along a side of the big square*/

1802:

1803: shared int nPEs = M*M, spin[M*A][M*A];

1804: shared float time[M][M];  /*local times on subarrays*/

1805: shared float prob[10];  /* probabilities of state change */

1806: shared float J = 1., H = 0.;     /* Energy= -J sum spin spin' - H sum spin */

1807: shared float T = 1.;                /* Temperature */

1808: shared int ato2 = A*A;

1809: shared int am = A*M;

1810:

1811: main()

1812: {

1813:     int i,j,child_id, my_spin, sum_nei, index, bit;

1814:     float d_E, x;

1815:     double frand();

1816:

1817: /* compute flip probabilities */

1818:     for (i = 0; i < 5 ; i++)

1819:         for (j = 0; j < 2; j++)

1820:           {index = i + 5*j;    /* index = 0,1,...,9 */

1821:            my_spin = 2*j - 1;

1822:            sum_nei = 2*i - 4;

1823:            d_E = 2.*(J * my_spin * sum_nei + H * my_spin);

1824:            x = exp(-d_E/T);

1825:            prob[index] = x/(1.+x);

1826:    /*      printf("prob[%d]=%f\n",index,prob[index]);  */

1827:           };

1828:

1829: /* initialize local times */

1830:     for (i = 0; i < M; i++)

1831:         for (j = 0; j < M; j++)

1832:             time[i][j]=0.;

1833:

1834: /* initialize spins at random, in seedran(seed,b), b is dummy*/

1835:     seedran(31234,1);

1836:     for (i = 0; i < M*A; i++)

1837:         for (j = 0; j < M*A; j++) {

1838:             bit = 2*frand(1);                /* bit becomes 0 or 1 */

1839:             spin[i][j] = 2*bit - 1;          /* spin becomes -1 or 1 */

1840:    /*       printf("spin[%d][%d]=%d\n",i,j,spin[i][j]);      */

1841:     };

1842:

1843:     /* in the following loop single PE spawns nPEs other PEs for concurrent

1844:        execution. Each child PE would execute subroutine work(my_id) with its

1845:        own argument my_id. */

1846:

1847:     for (child_id = 0; child_id < nPEs; child_id++)

1848:           if (fork() == 0) {

1849:               tmp_affinity(child_id);     /* fixing a PE for process child_id */

1850:               work(child_id);              /* starting a child PE process */

1851:               exit(0);

1852:           }

1853:

1854:

1855:     /* in the following loop the parent PE awaits termination of each child PE

1856:        then terminates itself */

1857:     for (child_id = 0; child_id < nPEs; child_id++) wait(0);

1858:     exit(0);

1859: }

1860:

1861: work(my_id)

1862: int my_id;

1863: {

1864:   int i,j;

1865:   int coord, var;

1866:   int x,y,my_i,my_j,sum_nei, nei_i,nei_j;

1867:   int  up_i, down_i, left_j, right_j;

1868:   int i_base, j_base;

1869:   int index;

1870:   double frand();

1871:   double r;

1872:   double end_time;

1873:

1874:   end_time = END_TIME*A*A;

1875:                                /*normalizing time scale for multiprocessor execution*/

1876:

1877:   my_i = my_id%M;      /*PE my_id carries small square (my_i,my_j)*/

1878:   i_base = my_i*A;

1879:   up_i = (my_i + 1)%M;

1880:   down_i = (my_i + M - 1)%M;

1881:

1882:   my_j = (my_id-my_i)/M;

1883:   j_base = my_j*A;

1884:   left_j = (my_j + M - 1)%M;

1885:   right_j = (my_j + 1)%M;

1886:

1887:   seedran(my_id*my_id*my_id,my_id);

1888:  /*PE my_id has its own copy of pseudo-random number generator and initializes it

1889:    using seedran(seed,my_id) with unique seed=my_id*my_id*my_id */

1890:

1891:   while(time[my_i][my_j] < end_time)

1892:   {

1893:     r = frand(my_id);

1894:       /*PE my_id obtains next pseudo-random number from its own sequence*/

1895:     x = r*A;

1896:     y = (r*A-x)*A;

1897:        /*pick a random cell with internal address (x,y) within the A*A square*/

1898:

1899: /*compute sum of neighboring spins*/

1900:     sum_nei = 0;

1901:     for (coord = 0;  coord < 2; coord += 1)

1902:         for (var = -1;  var < 2; var += 2)

1903:     {

1904:           nei_i = x;

1905:           nei_j = y;

1906:           if(coord == 0) nei_i += var;

1907:           if(coord == 1) nei_j += var;

1908:

1909:           if(0 <= nei_i && nei_i < A && 0 <= nei_j && nei_j < A)

1910:           {

1911:              nei_i += i_base;

1912:              nei_j += j_base;

1913:           }

1914:           else

1915:           {

1916:        /* 4 possible reasons to wait for a neighboring PE */

1917:             if(-1 == nei_i) while (time[down_i][my_j]  < time[my_i][my_j]) ;

1918:             if(-1 == nei_j) while (time[my_i][left_j]  < time[my_i][my_j]) ;

1919:             if(nei_i == A)  while (time[up_i][my_j]    < time[my_i][my_j]) ;

1920:             if(nei_j == A)  while (time[my_i][right_j] < time[my_i][my_j]) ;

1921:

1922:             nei_i = (nei_i+i_base+am)%am;

1923:             nei_j = (nei_j+j_base+am)%am;

1924:           };

1925:           sum_nei += spin[nei_i][nei_j];

1926:     };

1927:

1928: /*recover index*/

1929:     index = (sum_nei + 4)/2 + 5*(spin[x+i_base][y+j_base] + 1)/2;

1930:

1931:     r = frand(my_id);

1932:

1933:     if(r < prob[index])

1934:       spin[x+i_base][y+j_base] *= -1;

1935:     else /* printf(": NO flip\n") */ ;

1936:

1937:     r = frand(my_id);

1938:     time[my_i][my_j] += -log(r);

1939:   };

1940: }

1941: \end{verbatim}

1942: \end{document}

1943: % ----------------------------------------------------------------\\

1944: