cs0502039/cell.tex
1: \documentclass[12pt]{article}
2: \usepackage{graphicx}
3: \usepackage{amsfonts}
4: \usepackage{latexsym}
5: \usepackage{amsmath}
6: \newtheorem{theorem}{theorem}
7: %\usepackage{fancyhdr}
8: 
9: % ----------------------------------------------------------------
10: \voffset = -50pt
11: \textwidth 6.5in
12: \textheight 8.8in
13: \topmargin 0.25in
14: \oddsidemargin -0.1in
15: \evensidemargin 0in
16: 
17: % ----------------------------------------------------------------
18: \makeatletter
19: \@addtoreset{figure}{section}
20: \def\thefigure{\thesection.\@arabic\c@figure}
21: \@addtoreset{table}{section}
22: \def\thetable{\thesection.\@arabic\c@table}
23: 
24: \def\@sect#1#2#3#4#5#6[#7]#8{\ifnum #2>\c@secnumdepth
25:      \def\@svsec{}\else
26:      \refstepcounter{#1}\edef\@svsec{\csname the#1\endcsname.\hskip .75em
27: }\fi
28:      \@tempskipa #5\relax
29:       \ifdim \@tempskipa>\z@
30:         \begingroup #6\relax
31:           \@hangfrom{\hskip #3\relax\@svsec}{\interlinepenalty \@M #8\par}%
32:         \endgroup
33:        \csname #1mark\endcsname{#7}\addcontentsline
34:          {toc}{#1}{\ifnum #2>\c@secnumdepth \else
35:                       \protect\numberline{\csname the#1\endcsname}\fi
36:                     #7}\else
37:         \def\@svsechd{#6\hskip #3\@svsec #8\csname #1mark\endcsname
38:                       {#7}\addcontentsline
39:                            {toc}{#1}{\ifnum #2>\c@secnumdepth \else
40:                              \protect\numberline{\csname the#1\endcsname}\fi
41:                        #7}}\fi
42:      \@xsect{#5}}
43: % put a period after theorem and theorem-like numbers
44: \def\@begintheorem#1#2{\it \trivlist \item[\hskip \labelsep{\bf #1\ #2.}]}
45: \def\section{\@startsection {section}{1}{\z@}{-3.5ex plus -1ex minus
46:  -.2ex}{2.3ex plus .2ex}{\normalsize\bf}}
47: 
48: %\pagestyle{myheadings}
49: %\thispagestyle{empty}
50: 
51: %\markright{\sc the electronic journal of combinatorics
52: %(2000),\#Rxx\hfill} \thispagestyle{empty}
53: % ----------------------------------------------------------------
54: \begin{document}
55: 
56: \title{Efficient Parallel Simulations of \\
57: Asynchronous Cellular Arrays}
58: %uncomment to remove date
59: \date{} 
60: \maketitle
61: 
62: \begin{center}
63: %\small
64: \author{Boris D. Lubachevsky\\
65: {\em bdl@bell-labs.com}\\
66: Bell Laboratories\\
67: 600 Mountain Avenue\\
68: Murray Hill, New Jersey}
69: \end{center}
70: 
71: \setlength{\baselineskip}{0.995\baselineskip}
72: \normalsize
73: \vspace{0.5\baselineskip}
74: \vspace{1.5\baselineskip}
75: %\end{center}
76: 
77: % ----------------------------------------------------------------
78: \begin{abstract}
79: A definition for a class of asynchronous cellular arrays
80: is proposed.
81: An example of such asynchrony would be
82: independent Poisson arrivals of cell iterations.
83: The Ising model in the continuous time formulation of Glauber
84: falls into this class.
85: Also proposed are efficient parallel algorithms for
86: simulating these asynchronous cellular arrays.
87: In the algorithms, one or several cells are assigned to a processing
88: element (PE),
89: local times for different PEs can be different.
90: Although the standard serial algorithm by
91: Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller
92: can simulate such arrays,
93: it is usually
94: believed to be without an efficient parallel counterpart.
95: However, the proposed parallel algorithms
96: contradict this belief
97: proving to be
98: both efficient
99: and able to perform the same task
100: as the standard algorithm.
101: The results of experiments with the new algorithms
102: are encouraging:
103: the speed-up is greater than 16
104: using 25 PEs on a shared memory MIMD
105: bus computer,
106: and greater than 1900
107: using $2^{14}$ PEs on a
108: SIMD computer.
109: The algorithm by
110: Bortz, Kalos, and Lebowitz
111: can be incorporated
112: in the proposed parallel algorithms,
113: further contributing to speed-up.
114: \end{abstract}
115: % ----------------------------------------------------------------
116: \section{Introduction}\label{sec:intro}
117: \hspace*{\parindent} 
118: Simulation is inevitable
119: in studying the evolution
120: of complex cellular systems.
121: Large cellular array simulations might require long runs
122: on a serial computer.
123: Parallel processing,
124: wherein each cell or a group of cells
125: is hosted by a separate processing element (PE),
126: is a feasible method to speed up the runs.
127: The strategy of a parallel simulation
128: should depend on whether the
129: simulated system is synchronous
130: or asynchronous.
131: 
132: A {\em synchronous} system
133: evolves in discrete time $t=0,1,2,...$.
134: The state of a cell at $t+1$
135: is determined by the state of the cell and its neighbors
136: at $t$
137: and may explicitly depend
138: on $t$ and the result of a random experiment.
139:   
140: An obvious and correct way to simulate
141: the system synchrony using a parallel processor
142: is simply to mimic it by the executional synchrony.
143: The simulation is arranged in rounds
144: with
145: one round corresponding to one time step
146: and with
147: no PE processing state changes of its cells for time $t+1$
148: before all PEs have processed state changes of their cells
149: for time $t$.
150:   
151: An {\em asynchronous} system evolves in continuous time.
152: State changes at different cells occur
153: asynchronously at unpredictable random times.
154: Here two questions should be answered:
155: (A) How to specify the asynchrony precisely?
156: and (B) How to carry out the parallel simulations
157: for the specified asynchrony?
158:   
159: Unlike the synchronous case,
160: simple mimicry does not work well
161: in the asynchronous case.
162: When Geman and Geman \cite{GG}, for example,
163: employ executional {\em physical} asynchrony
164: (introduced by different speeds of different PEs)
165: to mimic the model asynchrony,
166: the simulation becomes irreproducible
167: with its results depending on executional timing.
168: Such dependence may be tolerable in tasks
169: other than simulation
170: (\cite{GG} describes one such task,
171: another example is given in \cite{LM}).
172: In the task of simulation, however, it is
173: a serious shortcoming as seen in the following example.
174: 
175: Suppose  a simulationist,
176: after observing the results of a program run,
177: wishes to look closer at a certain phenomenon
178: and inserts an additional `print' statement
179: into the code.
180: As a result of the insertion,
181: the executional timing changes
182: and the phenomenon under investigation vanishes.
183: 
184: Ingerson and Buvel \cite{INBUV} and
185: Hofmann \cite{HOF}
186: propose various reproducible
187: computational
188: procedures to simulate asynchronies
189: in cellular arrays.
190: However no uniform principle has been proposed,
191: and no special attention to developing
192: parallel algorithms has been paid.
193: It has been observed that the
194: resulting cellular patterns
195: may depend on the computational
196: procedure \cite{INBUV}.
197: 
198: Two main results of this paper are:
199: (I) a definition
200: of a natural  class of asynchronies
201: that can be associated with
202: cellular arrays
203: and
204: (II) efficient parallel algorithms to simulate
205: systems in this class.
206: The following properties specify
207: the {\em Poisson asynchrony},
208: a most common
209: member in the introduced class:
210: \\
211: \\   
212: $~~~$Arrivals
213: for a particular cell
214: form a Poisson point process.
215: \\  
216: $~~~$Arrivals processes for different cells are independent.
217: \\   
218: $~~~$The arrival rate
219: is the same, say $\lambda$,
220: for each cell.
221: \\
222: $~~~$When there is an arrival,
223: the state of the cell
224: instantaneously changes;
225: the new state is computed
226: based on the states of the cell and its neighbors
227: just before the change
228: (in the same manner as in the synchronous model).
229: The new state may be equal to the old one.
230: \\
231: $~~~$The time of arrival
232: and a random experiment may be involved in the computation.
233: \\
234: 
235: A familiar example of a cellular system with the Poisson asynchrony
236: is the Ising model \cite{ISING}
237: in the continuous time formulation of Glauber \cite{GL}.
238: In this model
239: a cell configuration is defined
240: by the spin variables $s(c)=\pm 1$
241: specified at the cells $c$ of a two or three dimensional
242: array.
243: When there is an arrival at a cell $c$,
244: the spin $s(c)$ is changed to $-s(c)$ with probability $p$.
245: With probability $1~-~p$,
246: the spin $s(c)$ remains unchanged.
247: The probability $p$
248: is determined
249: using the values of $s(c)$ and neighbors $s(c')$ just before
250: the update time.
251: 
252: It is instructive to review the
253: computational procedures for Ising simulations.
254: First, the Ising simulationists realized that the standard procedure by
255: Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller \cite{MRRTT}
256: could be applied.
257: In this procedure, the evolution of the configuration
258: is simulated as a sequence of one-spin updates:
259: Given a configuration,
260: define the next configuration by choosing a cell $c$
261: uniformly at random and changing or not changing the spin
262: $s(c)$ to $-s(c)$ as required.
263: In the original standard procedure time is discrete.
264: Time continuity could have been simply introduced
265: by letting
266: the consecutive arrivals form
267: the Poisson process with rate $\lambda N$,
268: where $N$ is the total number
269: of spins (cells) in the system.
270:   
271: The problem of long simulation runs became immediately apparent.
272: Bortz, Kalos, and Lebowitz \cite{BKL}
273: developed a serial algorithm (the BKL algorithm)
274: which avoids processing unsuccessful
275: state change attempts,
276: and reported up to a 10-fold speed-up over the 
277: straight-forward implementation of the
278: standard model.
279: Ogielski \cite{OGI} built special purpose hardware
280: for speeding up the processing.
281: 
282: The BKL algorithm is serial.
283: Attempts were made 
284: to speed up the Ising simulation by parallel
285: computations 
286: (Friedberg and Cameron \cite{FC}, Creutz \cite{CR}).
287: However, in these computations the original Markov chain
288: of the continuous time Ising model
289: was modified to satisfy the computational procedure.
290: The modifications do not affect the equilibrium
291: behavior of the chain,
292: and as such are acceptable
293: if one studies only the equilibrium.
294: In the cellular models however,
295: the transient behavior is also of interest,
296: and no model revision should be done.
297: 
298: This paper presents
299: efficient methods for parallel simulation
300: of the continuous time asynchronous cellular arrays
301: without changing the model or type of asynchrony in favor
302: of the computational procedure.
303: The methods
304: promise unlimited
305: speed-up when the array and the parallel
306: computer are sufficiently large.
307: For the Poisson asynchrony case, 
308: it is also shown how 
309: the BKL algorithm can be incorporated,
310: further contributing to speed-up.
311: 
312: For the Ising model,
313: presented algorithms can be viewed
314: as exact parallel counterparts
315: to the standard algorithm by Metropolis et al. 
316: The latter has been known and
317: believed to be inherently serial since 1953.
318: Yet, the presented algorithms are parallel, efficient, and fairly simple.
319: The ``conceptual level'' codes are rather short
320: (see Figures~\ref{fig:a1c1pe}, 
321: \ref{fig:s1c1pe}, 
322: \ref{fig:amcgen}, 
323: \ref{fig:amcpoi}, 
324: %$AL$.4, $AL$.6 and $AL$.7
325: and
326: \ref{fig:genout}, 
327: ).
328: An implementation in a real programming language 
329: given in the Appendix
330: is longer, of course,
331: but still rather simple.
332: 
333: This paper is organized as follows:
334: Section \ref{sec:model} presents 
335: a class of asynchronies
336: and a comparison with other published proposals.
337: Then Section \ref{sec:algo} describes the new algorithms 
338: on the conceptual level.
339: While the presented algorithms are simple,
340: there is no simple theory which predicts
341: speed-up of these algorithms for 
342: cellular arrays and parallel processors
343: of large sizes.
344: Section \ref{sec:perf} contains a simplified computational
345: procedure which predicts speed-ups faster than it takes
346: to run an actual parallel program.
347: The predictions made by this
348: procedure are compared with actual runs
349: and appear to be rather accurate.
350: The procedure predicts speed-up of more than 8000
351: for the simulation of $10^5 \times 10^5$ 
352: Poisson asynchronous cellular array in parallel
353: by $10^4$ PEs.
354: Actual speed-ups obtained thus far were:
355: more than 16 on 25 PEs of the
356: Balance (TM)
357: computer and more than 1900
358: on $2^{14}$ PEs of the Connection Machine (R).
359: \footnotetext{ 
360: Connection Machine is a registered trademark of Thinking Machines Corporation
361: \\
362: Balance is a trademark of Sequent Computer Systems, Inc.}
363: 
364: \section{Model}\label{sec:model}
365: \hspace*{\parindent} 
366: Time $t$ is continuous.
367: Each cell $c$ has a state $s=s(c)$.
368: At random times, a cell is granted a chance
369: to change the state.
370: The changes, if they occur,
371: are instantaneous events.
372: Random attempts to change the state of a cell
373: are independent of
374: similar attempts for other cells.
375: 
376: The general model consists
377: of two functions: 
378: {\em time\_of\_next\_arrival ()}
379: and {\em next\_state ()}.
380: They are defined as follows:
381: given the old state of the cell
382: and the states of the neighbors just before time $t$,
383: $s_{t-0} (neighbors (c))$,
384: the next\_state 
385: $s(c)=s_t (c)$ is
386: \begin{equation}
387: \label{newst}
388: s_t (c) = next\_state~(c,~s_{t-0} (neighbors(c)),~ \omega ,~t ),
389: \end{equation}
390: where the possibility 
391: $s_t (c)=s_{t-0} (c)$ is not excluded;
392: and
393: the time $next\_t$ of the next arrival 
394: is 
395: \begin{equation}
396: \label{newti}
397: next\_t = time\_of\_next\_arrival~(c, s_{t-0} (neighbors(c)),~ \omega ,~t),
398: \end{equation}
399: where always $next\_t ~ > ~ t$.
400: 
401: In \eqref{newst} and \eqref{newti}, 
402: $\omega$ denotes the result of a random experiment, 
403: e.g., coin tossing,
404: $s (neighbors(c))$ denotes the indexed set of states 
405: of all the neighbors of $c$ including $c$ itself.
406: Thus,
407: if $neighbors(c)=\{ c, c_1, c_2, c_3, c_4\}$,
408: then 
409: $s(neighbors(c)) = 
410: (s(c), s(c_1 ), s(c_2 ), s(c_3 ), s(c_4 ))$.
411: Subscript $t-0$ expresses the idea of `just before $t$',
412: e.g.,
413: $a_{t-0} ( \tau ) = lim_{\tau \rightarrow t,~\tau < t} ~ a( \tau )$.
414: According to \eqref{newst}, the value of $s(c)$
415: instantaneously changes at time $t$ 
416: from $s_{t-0} (c)$ to $s_t (c)$.
417: At time $t$, the value of $s(c)$ is already new.
418: The `just before' feature resolves
419: a possible ambiguity
420: if two neighbors attempt to change their states
421: at the same simulated time.
422: 
423: Compare now the class of asynchronies 
424: defined by \eqref{newti} with the ones proposed in the literature:
425:     
426: \ \ \ (A) Model 1 in \cite{INBUV} reads: 
427: ``...the cells iterate randomly, one at a time.'' 
428: Let $p_c$ be the probability that cell $c$ is chosen.
429: Then the following choice of law \eqref{newti} yields this model
430: \[
431: time\_of\_next\_arrival~(c,~\omega ,~t)= t~-~ \frac {1} {p_c}   \ln  r(c,t, \omega ),
432: \]
433: where $r(c, t, \omega )$ is a random number uniformly distributed on (0,1),
434: and $\ln$ is the natural logarithm,
435: $\ln (x) = {\log}_e (x)$.
436: For $p_{c_1} = p_{c_2} = ... = \lambda$,
437: the asynchrony was called the {\em Poisson asynchrony} in Section~\ref{sec:intro};
438: it coincides with the one defined
439: by the standard model \cite{MRRTT},
440: and by Glauber's model \cite{GL} for the Ising spin simulations.
441:    
442: \ \ \ (B) Model 2 in \cite{INBUV} assigns
443: ``each cell a period according to a Gaussian distribution...
444: The cells iterate one at a time each having its own definite
445: period.''
446: While it is not quite clear from \cite{INBUV} 
447: what is meant by a ``definite period''
448: (is it fixed for a cell over a simulation run?),
449: the following choice of law \eqref{newti} yields this model
450: in a liberal interpretation:
451: \[
452: time\_of\_next\_arrival~(c,~\omega ,~t)= t~+~ {P_c}^{-1} (r( \omega )),
453: \]
454: where $P^{-1} (y)=x$ if $P(x)=y$,
455: and $P_c (x)$ is the cumulative function for the
456: Gaussian probability distribution
457: with mean $m_c~>~0$ 
458: and variance ${\sigma_c}^2$.
459: The probability of
460: $next\_t < t$ is small when $\sigma < < m$
461: and is ignored in \cite{INBUV}
462: if this interpretation is meant.
463: In a less liberal interpretation,
464: $\sigma_c \equiv 0$ for all $c$,
465: and $m_c$ is itself 
466: random and distributed according to the Gaussian law.
467: This case is even easier to represent in terms of 
468: model \eqref{newti} than the previous one:
469: $time\_of\_next\_arrival^ (c,~ \omega ,~t)= t + m_c ( \omega )$.
470:    
471: \ \ \ (3) Model \eqref{newti} trivially extends to a synchronous simulation,
472: where the initial state changes arrive at time 0 and 
473: then always $next\_t - t$ is identical to 1.
474: The first model in \cite{HOF} is
475: ``to choose a number of cells at random and change
476: only their values before continuing.''
477: This is a variant of synchronous simulation;
478: it is substantially different from both models (A) and (B) above.
479: In (A) and (B),
480: the probability is 1 that
481: no two neighbors attempt to change their states at the same time.
482: In contrast, in this model many neighboring cells 
483: are simultaneously changing their values.
484: How the cells are chosen for update
485: is not precisely specified in \cite{HOF}.
486: One way to choose the cells is to assign a probability weight
487: $p_c$ for cell $c$, $c=1,2,...,N$,
488: and to attempt to update cell $c$ 
489: at each iteration, 
490: with probability $p_c$,
491: independent of any other decision.
492: Such a method 
493: conforms with the law \eqref{newti}
494: because the method is local:
495: a cell does not need to know 
496: what is happening at distant cells.
497: The second model in \cite{HOF}
498: changes states of a
499: fixed number $A$ of randomly chosen cells
500: at each iteration.
501: If $A > 1$,
502: this method is not local
503: and does not conform with the law \eqref{newti}.
504: 
505: \section{Algorithms}\label{sec:algo}
506: \hspace*{\parindent} 
507: {\bf Elimination of $\omega$}.
508: Deterministic computers 
509: represent randomness by using
510: pseudo-random number generators.
511: Thus, equations \eqref{newst} and \eqref{newti} are substituted
512: in the computation by equations
513: \begin{equation}
514: \label{news0t}
515: s_t (c) = next\_state~(c,~s_{t-0} (neighbors(c)),~t ),
516: \end{equation}
517: and
518: \begin{equation}
519: \label{newt0i}
520: next\_t = time\_of\_next\_arrival~(c, s_{t-0} (neighbors(c)),~t),
521: \end{equation}
522: respectively,
523: which do not contain the parameter of randomness $\omega$.
524: 
525: This elimination of $\omega$ symbolizes
526: an obvious but important 
527: difference between the simulated system and the simulator:
528: In the simulated system, 
529: the observer, being a part of the system,
530: does not know in advance
531: the time of the next arrival.
532: In contrast, the simulationist who is,
533: of course, not a part of the simulated system,  
534: can know the time of the next arrival
535: before the next arrival is processed.
536: 
537: For example,
538: it is not known in advance when the next event
539: from a Poisson stream arrives.
540: However, in the simulation,
541: the time $next\_t$ of the next arrival
542: is obtained in a deterministic manner, 
543: given the time $t$ of the previous arrival:
544: \begin{equation}
545: \label{newt1i}
546: next\_t = t ~-~ \frac{1}{\lambda} {\log}_e ( r(n(t))),
547: \end{equation}
548: where $\lambda$ is the rate,
549: $r(n)$ is the $n$-th pseudo-random number in the sequence
550: uniformly distributed on $(0,1)$,
551: and $n(t)$ is the invocation counter. 
552: Thus, after the previous arrival is processed,
553: the time of the next arrival is already known.
554: If needed, the entire sequence of arrivals
555: can be precomputed and stored in a table for later
556: use in the simulation,
557: so that all future arrival times 
558: would be known in advance.
559: \\
560: 
561: {\bf Asynchronous one-cell-per-one-PE algorithm}.
562: The algorithm in Figure~\ref{fig:a1c1pe}
563: is the shortest of those presented in this paper.
564: 
565: To understand this code, 
566: imagine a parallel computer which consists
567: of a number of PEs running concurrently.
568: One PE is assigned to simulate one cell.
569: The PE which is assigned to simulate cell $c_0$,
570: PE$c_0$, executes the code in Figure~\ref{fig:a1c1pe} with $c=c_0$.
571: The PEs are interconnected by the network
572: which matches the topology of the cellular array.
573: A PE can receive information from its neighbors.
574: PE$c$ maintains state $s(c)$
575: and local simulated time $t(c)$.
576: Variables $t(c)$ and $s(c)$ are visible
577: (accessible for reading only) by the neighbors of $c$.
578: Time $t(c)$ has no connection with the physical
579: time in which the parallel computer runs the program
580: except that $t(c)$ may not decrease 
581: when the physical time increases.
582: At a given physical instance of simulation, 
583: different cells $c$ may have different values of $t(c)$.
584: Value $end\_time$ is a constant which is known to all PEs.
585: 
586: The algorithm in Figure~\ref{fig:a1c1pe}
587: is very asynchronous:
588: different PEs can
589: execute different steps
590: concurrently
591: and can run
592: at different speeds.
593: A statement `wait\_until~~{\em condition}', 
594: like the one at Step 2 in Figure~\ref{fig:a1c1pe},
595: does not imply
596: that the {\em condition} must be detected immediately after
597: it occurs.
598: To detect the {\em condition} 
599: at Step 2
600: involving local times 
601: of neighbors
602: a PE can poll 
603: its neighbors
604: one at a time,
605: in any order, 
606: with arbitrary delays,
607: and
608: without any respect to 
609: what these PEs are doing meanwhile.
610: \\
611: \begin{figure}
612: \centering
613: \fbox{
614: \begin{minipage} {12.8cm}
615: \begin{enumerate}
616: \item while $t(c)~<~end\_time$\\
617: \hspace*{0.2in}
618: \{
619: \item~~~~~~wait\_until $t(c)~\leq~ \min_{c'~\in~neighbors(c)} t(c')$ ;
620: \item~~~~~~$s(c)~\leftarrow~ next\_state~(c,~ s (neighbors (c)),~t(c))$ ;
621: \item~~~~~~$t(c)~\leftarrow~time\_of\_next\_arrival~(c,~s (neighbors (c)),~t(c))$\\
622: \hspace*{0.2in}
623: \}
624: \end{enumerate}
625: 
626: \end{minipage}}
627: \caption{Asynchronous one-cell-per-one-PE algorithm}
628: \label{fig:a1c1pe}
629: \end{figure}
630: Despite being seemingly almost chaotic,
631: the algorithm in Figure~\ref{fig:a1c1pe}
632: is free from deadlock.
633: Moreover, it
634: produces a unique simulated trajectory
635: which is independent of executional
636: timing, 
637: provided that:
638: \\
639: 
640: (i) 
641: for the same cell, the pseudo-random sequence is always the same,
642: \\
643: 
644: (ii) no two neighboring arrival times are equal.
645: \\
646: 
647: Freedom from deadlock follows from the fact that the cell, 
648: whose local time is minimal over the entire array,
649: is always able to make progress.
650: (This guaranteed worst case performance,
651: is substantially exceeded
652: in an average case.
653: See Section~\ref{sec:perf}.)
654: 
655: The uniqueness of the trajectory can be seen as follows.
656: By (ii), 
657: a cell $c$ passes the test at Step 2 only if its local time $t(c)$
658: is smaller than the local time $t(c')$
659: of any its neighbor $c'$.
660: If this is the case, then no neighbor $c'$
661: is able to pass the test at Step 2 before
662: $c$ changes its time at Step 4.
663: This means that processing of the update by $c$ is safe:
664: no neighbor changes its state or time before $c$ completes
665: the processing.
666: By (i), functions $next\_state( )$ and $time\_of\_next\_arrival( )$
667: are independent of the run.
668: Therefore, 
669: in each program run,
670: no matter what the neighbors of $c$
671: are doing or trying to do,
672: the next arrival time and state for $c$ are always the same.
673: 
674: It is now clear why assumption (ii) is needed.
675: If (ii) is violated by two cells $c$ and $c'$ which are neighbors,
676: then the algorithm in Figure~\ref{fig:a1c1pe}
677: does not exclude concurrent updating by $c$ and $c'$.
678: Such concurrent updating
679: introduces an indeterminism
680: and inconsistency.
681: A scenario
682: of the inconsistency
683: can be as follows:
684: at Step 3
685: the {\em old} value of $s(c')$ is used
686: to update state $s(c)$,
687: but 
688: immediately following Step 4 uses the
689: {\em new} value of $s(c')$ 
690: to update time $t(c)$. 
691: 
692: In practice, the algorithm in Figure~\ref{fig:a1c1pe} is safe,
693: when $next\_t(c)-t(c)$ for different $c$ are independent 
694: random samples from a distribution with a continuous density,
695: like an exponential distribution.
696: In this case, (ii) holds with probability 1.
697: Unless the pseudo-random number generators are faulty,
698: one may imagine only one reason for violating (ii):
699: finite precision of computer representation of real numbers.
700: \\
701: 
702: {\bf Synchronous one-cell-per-one-PE algorithm}.
703: If (ii) can be violated with a positive probability
704: (if $t$ takes on only integer values,
705: for example),
706: then the errors might not be tolerable.
707: In this case the synchronous algorithm in
708: Figure~\ref{fig:s1c1pe} should be used.
709: 
710: Observe that while the algorithm in Figure~\ref{fig:s1c1pe} 
711: is synchronous,
712: it is able to simulate correctly
713: both synchronous and asynchronous systems.
714: Two main additions 
715: in the algorithm in Figure~\ref{fig:s1c1pe}
716: are:
717: private variables $new\_s$ and $new\_t$ 
718: for temporal storage of updated $s$ and $t$,
719: and synchronization barriers `synchronize'.
720: When a PE hits a `synchronize' statement it must wait until
721: all the other PEs hit a `synchronize' statement;
722: then it may resume.
723: Two dummy synchronizations at Steps 9 and 10 are executed
724: by idling PEs in order to match synchronizations
725: at Steps 5 and 8 executed by non-idling PEs.
726: 
727: When (ii) is violated,
728: the synchronous algorithm avoids the ambiguity and indeterminism
729: (which in this case are possible in the asynchronous algorithm)
730: as follows:
731: in processing concurrent updates of two neighbors $c$ and $c'$ 
732: for the same simulated time $t=t(c)=t(c')$,
733: first, $c$ and $c'$ read states $s_{t-0}$ and times $t$ of each other
734: and compute their private $new\_s$'s and $new\_t$ 
735: (Steps 3 and 4 in Figure~\ref{fig:s1c1pe});
736: then, after the synchronization barrier at Step 5,
737: $c$ and $c'$ write their states and times at Steps 6 and 7,
738: thus making sure that no write 
739: interferes with a read.
740: \\
741: \begin{figure}
742: \centering
743: \fbox{
744: \begin{minipage} {14.8cm}
745: \begin{enumerate}
746: \item while $t(c)~<~end\_time$\\
747: \hspace*{0.2in}
748: \{
749: \item~~~~~~if $t(c)~\leq~ \min_{~c'\in neighbors(c)} ~t(c')$ then\\
750: \hspace*{0.2in}
751: ~~~~~~~~\{
752: \item~~~~~~~~~~~~~~$new\_s ~\leftarrow~ ~next\_state~(s (neighbors (c)),~t(c))$ ;
753: \item~~~~~~~~~~~~~~$new\_t~\leftarrow~ ~time\_of\_next\_arrival~(c,~t(c))$ ;
754: \item~~~~~~~~~~~~~~synchronize;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 1 */
755: \item~~~~~~~~~~~~~~$s(c)~~\leftarrow~~new\_s$;
756: \item~~~~~~~~~~~~~~$t(c)~~\leftarrow~~new\_t$;
757: \item~~~~~~~~~~~~~~synchronize~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 2 */\\
758: \hspace*{0.2in}
759: ~~~~~~~~\}\\
760: \hspace*{0.2in}
761: \ \ else~\{
762: \item~~~~~~~~~~~~~~synchronize;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 1 */
763: \item~~~~~~~~~~~~~~synchronize~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/* barrier 2 */\\
764: \hspace*{0.2in}
765: ~~~~~~~~\}
766: \\
767: \hspace*{0.1in}
768: \ \ \}
769: \end{enumerate}
770: 
771: \end{minipage}}
772: \caption{Synchronous one-cell-per-one-PE algorithm}
773: \label{fig:s1c1pe}
774: \end{figure}
775: \\
776: 
777: {\bf Aggregation.}
778: In the two algorithms presented above,
779: one PE hosts only one cell.
780: Such an arrangement may be wasteful
781: if the communication between PEs dominates
782: the computation internal to a PE.
783: A more efficient arrangement is
784: to assign several cells to one PE.
785: For concreteness, 
786: consider a two-dimensional $n \times n$
787: array with periodic boundary conditions.
788: Let $n$ be a multiple of $m$ and $(n/m)^2$
789: PEs be available.
790: PE$C$ carries $m \times m$ subarray $C$,
791: where $C=1,2,...,(n/m)^2$.
792: (Capital $C$ will be used without confusion to represent
793: both the subarray index and the set of cells $c$ the subarray
794: comprises, e.g. as in $c\in C$)
795: A fragment of a square cellular array
796: in an example of such an aggregation
797: is represented in Figure~\ref{fig:aggr}$a$,
798: wherein $m=4$.
799: 
800: The neighbors of a cell carried by PE1 are cells carried
801: by PE2, PE3, PE4, or PE5.
802: PE1 has direct connections with these four PEs (Figure~\ref{fig:aggr}$b$).
803: Given cell $c$ in the subarray hosted by PE1,
804: one can determine with which neighboring PEs
805: communication is required 
806: in order to learn the states of the neighboring cells.
807: Let $W(c)$ be the set of these PEs.
808: Examples in Figure~\ref{fig:aggr}$a$ : $W(u)$ is empty,
809: $W(v)=$\{PE5\}, $W(w)=$\{PE3, PE4\}.
810: 
811: \begin{figure}
812: \centering
813: \includegraphics*[width=5.8in]{AGGR.PS}
814: \caption{Aggregation:
815: $a$) mapping of cells to PEs,
816: $b$) the interconnection among the PEs which supports the neighborhood
817: topology among the cells
818: }
819: \label{fig:aggr}
820: \end{figure}
821: 
822: \begin{figure}
823: \centering
824: \fbox{
825: \begin{minipage} {13.8cm}
826: \begin{enumerate}
827: \item while $T(C) ~<~ end\_time$ 
828: \\
829: \hspace*{0.2in}
830: \{
831: \item~~~~~~select a cell $c$ in the subarray $C$ such that\\
832: \hspace*{0.2in}
833: ~$t(c)= \min_{~c' \in C} ~t(c')$ and assign $T(C)\leftarrow t(c)$;
834: \item~~~~~~wait\_until $T(C)~\leq~ \min_{~C' \in W(c)} ~T(C')$ ;
835: \item~~~~~~$s(c) \leftarrow next\_state~(c,~ s (neighbors (c)),~t(c))$ ;
836: \item~~~~~~$t(c) \leftarrow time\_of\_next\_arrival~(c,~s (neighbors (c)),~t(c))$\\
837: \hspace*{0.2in}
838: \}
839: \end{enumerate}
840: \end{minipage}}
841: \caption{Asynchronous many-cells-per-one-PE algorithm. General asynchrony}
842: \label{fig:amcgen}
843: \end{figure}
844: %
845: Figure~\ref{fig:amcgen} presents an aggregated variant of the algorithm
846: in Figure~\ref{fig:a1c1pe}.
847: PE$C$, which hosts
848: subarray $C$,
849: maintains the local time register $T(C)$.
850: PE$C_0$ simulates the evolution of its subarray
851: using the algorithm in Figure~\ref{fig:amcgen}
852: with $C=C_0$.
853: Each cell $c~\in~C$
854: is represented in the memory of PE$C$
855: by its current state $s(c)$ and its
856: next arrival time $t(c)$.
857: Note that unlike the one-cell-per-one-PE algorithm,
858: the $t(c)$ does not represent the current local time for cell $c$.
859: Instead, local times of all cells within subarray $C$ are the same, $T(C)$. 
860: 
861: $T(C)$ moves from one $t(c)$ to another 
862: in the order of increasing value.
863: Three successive iterations of this algorithm
864: are shown in Figure~\ref{fig:timlin}, where the subarray $C$
865: consists of four cells: $C=\{1, 2, 3, 4 \}$.
866: Circles in Figure~\ref{fig:timlin} represent 
867: arrival points in the simulated time.
868: A crossed-out circle represents an arrival which
869: has just been processed,
870: i.e., Steps 3, 4, and 5 of Figure~\ref{fig:amcgen} have just been executed,
871: so that $T(C)$ has just taken on the value of the processed
872: old arrival time $t(c)$, 
873: while the $t(c)$ has taken on a new larger value.
874: This new value is pointed to by an arrow from $T(C)$ in Figure~\ref{fig:timlin}.
875: It is obvious that 
876: always $t(c)~>=~T(C)$ if $c~\in~C$.
877: 
878: Local times $T(C)$ 
879: maintained by different PE$C$ might be different.
880: A wait at Step 3 cannot deadlock
881: the execution
882: since the PE$C$ whose $T(C)$ is the minimum over
883: the entire cellular array is always able to make a progress.
884: 
885: \begin{figure}
886: \centering
887: \includegraphics*[width=6.2in]{TIMLIN.PS}
888: \caption{$T(C)$ slides along a sequence of $t(c)$'s 
889: in successive iterations of the aggregated algorithm}
890: \label{fig:timlin}
891: \end{figure}
892: 
893: Assuming property (ii) as above,
894: the algorithm 
895: correctly simulates the history
896: of updates.
897: The following example may serve as an informal proof of this statement.
898: Suppose PE1 is currently updating the state of cell $v$ 
899: (see Figure~\ref{fig:aggr}$a$)
900: and its local time is
901: $T_1$.
902: Since $W(v)=\{ PE5\}$,
903: this update is possible
904: because the local time of PE5, $T_5$, is currently
905: larger than $T_1$.
906: At present,
907: PE1 receives the state
908: of $x$ from PE5
909: in order to perform the update.
910: This state 
911: is in time $T_5$,
912: i.e., in the future with respect to local time $T_1$.
913: However, the update is correct,
914: since 
915: the state of $x$ was the same at time $T_1$, 
916: as it is at time $T_5$.
917: 
918: Indeed, suppose  
919: the state of $x$ were to be changed 
920: at simulated local time $T$, 
921: $T_1 <  T <  T_5$.
922: At the moment when this change would have been processed by PE5,
923: the local time of PE1 would have been larger than $T$,
924: and $T$ would have been the local time of PE5.
925: After this processing has supposedly taken place,
926: the local time of PE1 should not decrease.
927: Yet at the present it is $T_1$,
928: which is smaller that $T$.
929: This contradiction proves that the state
930: of $x$ cannot in fact change
931: in the interval ($T_1, T_5$).
932: 
933: In the example in Figure~\ref{fig:timlin},
934: only one $t(c)$ supplies $\min_{~c' \in C} t(c')$.
935: However, the algorithm in Figure~\ref{fig:amcgen}
936: at Step 2 commands to select {\em a} cell
937: not {\em the} cell.
938: This covers the unlikely situation of several cells having the same
939: minimum time.
940: If $next\_t(c) - t(c)$ for different $c$ are independent 
941: random samples from a distribution with a continuous density,
942: this case occurs with the probability zero.
943: On the other hand, 
944: if several cells can, with positive probability,
945: update simultaneously,
946: a synchronous version of the aggregated algorithm should be used instead.
947: To eliminate indeterminism and inconsistency,
948: the latter would use
949: synchronization and intermediate storage
950: techniques.
951: These techniques were demonstrated in the algorithm in Figure~\ref{fig:s1c1pe}
952: and their discussion is not repeated here.
953: 
954: \begin{figure}
955: \centering
956: \fbox{
957: \begin{minipage} {12.8cm}
958: \begin{enumerate}
959: \item while $T(C)~<~end\_time$ \\
960: \hspace*{0.2in}
961: \{
962: \item~~~~~~select a cell $c$ in the subarray $C$ uniformly at random;
963: \item~~~~~~wait\_until $T(C) \leq \min_{~C' \in W(c)} ~T(C')$ ;
964: \item~~~~~~$s(c) \leftarrow next\_state~(c,~ s (neighbors (c)),~t(c))$ ;
965: \item~~~~~~$T(C) \leftarrow T(C)~-~ \frac{1}{\lambda \times number\_of\_cells\_in\_C} \ln r(C,n(T(C)))$\\
966: \hspace*{0.2in}
967: \}
968: \end{enumerate}
969: 
970: \end{minipage}}
971: \caption{Asynchronous many-cells-per-one-PE algorithm. Poisson asynchrony}
972: \label{fig:amcpoi}
973: \end{figure}
974: 
975: 
976: For an important special case of 
977: {\bf Poisson asynchrony in the aggregated algorithm},
978: the algorithm of Figure~\ref{fig:amcgen}
979: is rewritten in Figure~\ref{fig:amcpoi}.
980: This specialization capitalizes on the 
981: additive property of Poisson streams,
982: specifically, on the fact 
983: that sum of $k$ independent Poisson streams
984: with rate $\lambda$ each
985: is a Poisson stream with rate $\lambda k$.
986: In the algorithm,
987: $k=number\_of\_cells\_in\_C$;
988: this $k$ is equal to $m^2$ in the special case of partitioning
989: into $m \times m$ subarrays.
990: Unlike the general algorithm of Figure~\ref{fig:amcgen},
991: in the specialization in Figure~\ref{fig:amcpoi}
992: neither individual streams
993: for different cells
994: are maintained,
995: nor future arrivals $t(c)$ for cells are 
996: individually computed.
997: Instead, a single cumulative stream is simulated
998: and cells are delegated randomly
999: to meet these arrivals.
1000: 
1001: At Step 5 in Figure~\ref{fig:amcpoi},
1002: $r(C, n(T(C)))$ is an $n(T(C))$-th pseudo-random
1003: number in the sequence uniformly distributed in (0,1).
1004: It follows from the notation
1005: that each PE has its own sequence.
1006: If this sequence is independent of the 
1007: run (which is condition (i) above) and
1008: if updates for neighboring cells never coincide in time
1009: (which is condition (ii) above), then this algorithm produces
1010: a unique reproducible trajectory.
1011: The same statement is also true for the algorithm in Figure~\ref{fig:amcgen}.
1012: However, 
1013: uniqueness provided by the algorithm in Figure~\ref{fig:amcpoi}
1014: is weaker than the one provided by the algorithm in Figure~\ref{fig:amcgen}: 
1015: if the same array is partitioned differently and/or executed
1016: with different number of PEs,
1017: a trajectory produced by the algorithm in Figure~\ref{fig:amcpoi}
1018: may change;
1019: however, a trajectory produced by the algorithm in Figure~\ref{fig:amcgen}
1020: is invariant for
1021: such changes given that each cell $c$ uses its own
1022: fixed
1023: pseudo-random sequence.
1024: \\
1025: 
1026: {\bf Efficiency of aggregated algorithms}.
1027: Both many-cells-per-one-PE algorithms 
1028: in Figure~\ref{fig:amcgen} and Figure~\ref{fig:amcpoi}
1029: are more efficient than the
1030: one-cell-per-one-PE counterparts
1031: in Figure~\ref{fig:a1c1pe} and Figure~\ref{fig:s1c1pe}.
1032: This additional efficiency
1033: can be explained in the example
1034: of the square array, as follows:
1035: In the algorithms
1036: in Figure~\ref{fig:a1c1pe} and Figure~\ref{fig:s1c1pe},
1037: a PE may wait for its four neighbors.
1038: However,
1039: in the algorithms in Figure~\ref{fig:amcgen} and Figure~\ref{fig:amcpoi}, 
1040: a PE waits for at most two neighbors.
1041: For example, when the state of cell $w$ in Figure~\ref{fig:aggr}$a$ is updated,
1042: PE1 might wait for PE3 and PE4.
1043: Moreover,
1044: for at least
1045: $(m-2)^2$ cells $c$ out of $m^2$,
1046: PE1 does not wait at all,
1047: because $W(c)=\emptyset$.
1048: The cells $c$ such that $W(c)=\emptyset$
1049: form the dashed square in Figure~\ref{fig:aggr}$a$.
1050: 
1051: This additional efficiency becomes especially large if,
1052: instead of set $neighbors (c)$ in the original formulation
1053: of the model,
1054: one uses sets 
1055: \begin{equation}
1056: \label{nei2}
1057: neighbors^2 (c)~ \stackrel{\rm def}{=} ~next\_to\_nearest\_neighbors (c)
1058: \end{equation}
1059: or, more generally, $q$-th degree neighborhood, 
1060: $neighbors^q (c)$.
1061: The latter is
1062: defined for $q~>~1$ inductively
1063: \begin{equation}
1064: \label{neiq}
1065: neighbors^q (c) \stackrel{\rm def}{=} neighbors ( neighbors^{q-1} (c))
1066: \end{equation}
1067: where $neighbors (S)$ for a set $S$ of cells
1068: is defined as
1069: $neighbors (S) \stackrel{\rm def}{=}  \bigcup_{~c \in S} neighbors (c)$.
1070: 
1071: It is easy to rewrite 
1072: the algorithms in Figure~\ref{fig:a1c1pe} and Figure~\ref{fig:s1c1pe}
1073: for the case $q~>~1$.
1074: The obtained codes have low efficiency however.
1075: For example,
1076: in the square array case,
1077: one has
1078: $| neighbors^q (c) | - 1=2q(q+1)$.
1079: Thus, if $q=2$,
1080: a cell might have to wait
1081: for 12 cells
1082: in order to update.
1083: In the same example, 
1084: if one PE carries an $m \times m$ subarray,
1085: and $m~>~q$, then the PE waits for at most three other PEs
1086: no matter how large the $q$ is.
1087: Moreover,
1088: if $m > 2q$ then in $(m-2q)^2$ cases out of $m^2$ 
1089: the PE does not wait at all.
1090: \\
1091: 
1092: {\bf The BKL algorithm} \cite{BKL}
1093: was originally proposed for Ising spin simulations.
1094: It was noticed that the probability $p$ to flip
1095: $s(c)$ takes on only a finite (and small) number
1096: $d$ of values $p_1 ,..., p_d$,
1097: each corresponding to one or several
1098: combinations of old values of $s(c)$ and neighboring spins $s(c')$.
1099: Thus the algorithm
1100: splits the cells into $d$
1101: pairwise disjoint classes $\Gamma_1$, $\Gamma_2$,...$\Gamma_d$.
1102: The rates $\lambda p_k$ of changes
1103: (not just of the attempts to change)  
1104: for all $c \in \Gamma_k$
1105: are the same.
1106: At each iteration, the BKL algorithm does the following: 
1107: \\
1108: \begin{quotation}
1109: (a) Selects $\Gamma_{k_0}$ at random according to the 
1110: weights $| \Gamma_k | p_k$, $k=1,2,...d$,
1111: and selects a cell $c \in \Gamma_{k_0}$ uniformly at random.
1112: \\
1113:     
1114: (b) Flips the state of the selected cell, 
1115: $s(c) \leftarrow -s(c)$.
1116: \\
1117:     
1118: (c) Increases the time by 
1119: $- {\log}_e (r) /( \lambda ( \sum_{1 \leq k \leq d} | \Gamma_k |  p_k ))$,
1120: where $r$ is a pseudo-random number uniformly distributed in (0,1).
1121: \\
1122:     
1123: (d) Updates the membership in the classes.
1124: \end{quotation}
1125: If the asynchrony law is Poisson,
1126: the idea of the BKL algorithm 
1127: can be applied also to a 
1128: deterministic update.
1129: Here the probability $p$
1130: of change takes on just two values:
1131: \\
1132: $p_1 =0$ if $next_s(c)=s(c)$,
1133: and $p_2 =1$ if $next\_s(c)~ \neq ~s(c)$.
1134: \\
1135: Accordingly, there are two classes:
1136: $\Gamma_0$, the cells which are not going to change
1137: and
1138: $\Gamma_1$, the cells which are going to change.
1139: As with the original BKL algorithm,
1140: a substantial overhead is required for maintaining an account 
1141: of the membership in the classes (Step (d)).
1142: The BKL algorithm is justified only if a large number of cells
1143: are not going to change their states.
1144: The latter is often the case.
1145: For example,
1146: in the Conways's synchronous {\em Game of Life}
1147: (Gardner \cite{GAR})
1148: large regions of
1149: white cells ($s(c)=0$) remain
1150: unchanged for many iterations
1151: with very few black cells ($s(c)=1$).
1152: One would expect similar behavior
1153: for an asynchronous version of the 
1154: Game of Life.
1155: 
1156: The basic BKL algorithm is serial.
1157: To use it on a parallel computer,
1158: an obvious idea is to run a copy of the serial BKL algorithm
1159: in each subarray carried by a PE.
1160: Such a procedure,
1161: however,
1162: causes roll-backs,
1163: as seen in the following example:
1164:    
1165: Suppose PE1 is currently updating the state
1166: of cell $v$ (Figure~\ref{fig:aggr}$a$) and its 
1167: local time is $T_1$,
1168: while the local time of PE5, $T_5$,
1169: is larger than $T_1$.
1170: Since $x$ is a nearest neighbor to $B$,
1171: $x$'s membership might change because of $v$'s changed state.
1172: Suppose $x$'s membership were to indeed change.
1173: Although this change would have been in effect since time $T_1$,
1174: PE5, which is responsible for $x$,
1175: would learn about the change 
1176: only at time $T_5 ~>~T_1$.
1177: As the past of PE5 is not, therefore,
1178: what PE5 has believed it to be,
1179: interval [$T_1 , T_5$] must have
1180: been simulated by PE5 incorrectly,
1181: and must be played again.
1182: This original roll-back might cause a cascade
1183: of secondary roll-backs, third generation roll-backs etc.
1184: \\
1185: 
1186: {\bf A modified BKL algorithm}
1187: applies the original BKL procedure
1188: only to a subset of the cells,
1189: whereas
1190: the procedure of the standard model is applied
1191: to the remaining cells.
1192: More specifically:
1193: An additional separate class $\Gamma_0$ is defined.
1194: Unlike other $\Gamma_k$, $k~>~0$, 
1195: class $\Gamma_0$
1196: always contains the same cells.
1197: Steps (a) - (d) are performed as above
1198: with the following modifications:
1199: \newpage
1200: \begin{quotation}
1201: 1) The weight of $\Gamma_0$ at step (a) is taken to be
1202: $| \Gamma_0 |$.
1203: \\
1204: 
1205: 2) If the selected $c$ belongs to $\Gamma_0$,
1206: then at step (b) the state of $c$ may or may not change.
1207: The probability $p$ of change 
1208: is determined as in the standard model.
1209: \\
1210: 
1211: 3) The time at step (c) should be increased by
1212: $- {\log}_e (r) /( \lambda ( | \Gamma_0 |~+~\sum_{1 \leq k \leq d} | \Gamma_k |p_k ))$,
1213: where $r=r(c,n(t))$ is a pseudo-random number uniformly distributed in $(0,1)$.\\
1214: \end{quotation}
1215: 
1216: Now consider again the subarray
1217: carried by PE1 in Figure~\ref{fig:aggr}$a$.
1218: The subarray can be subdivided
1219: into the $(m-2) \times (m-2)$ ``kernel'' square 
1220: and the remaining boundary layer.
1221: If first degree neighborhood, $neighbors~(c)$,
1222: is replaced with the $q$-th degree neighborhood,
1223: $neighbors^q (c)$,
1224: then the kernel is the central $(m-2q) \times (m-2q)$ square,
1225: and the boundary layer has width $q$.
1226: In Figure~\ref{fig:aggr}$a$, the cells in the dashed square 
1227: constitute the kernel with $q=1$.
1228: To apply the modified BKL procedure to 
1229: the subarray carried by PE1,
1230: the boundary layer is declared to be
1231: the special fixed class $\Gamma_0$.
1232: Similar identification is done in the other subarrays.
1233: As a result,
1234: the fast concurrent BKL procedures 
1235: on the kernels
1236: are shielded from each other
1237: by slower procedures on the layers.
1238: 
1239: The roll-back is avoided,
1240: since state change of a cell
1241: in a subarray does not constitute state
1242: or membership change of a cell
1243: in another subarray.
1244: Unless the performance of PE1 is taken into account,
1245: the neighbors of PE1
1246: can not even tell whether PE1 
1247: uses the standard or the BKL 
1248: algorithm to update its kernel.
1249: As the size of the subarray increases,
1250: so does both the relative weight of the kernel
1251: and the fraction of the fast BKL processing.
1252: \\
1253: 
1254: {\bf Generating the output}.
1255: Consider the task of generating cellular patterns
1256: for specified simulated times.
1257: A method for performing this task in a serial
1258: simulation or a parallel simulation of a synchronous
1259: cellular array is obvious:
1260: as the global time reaches a specified value,
1261: the computer outputs the states of all cells.
1262: In an asynchronous simulation, 
1263: the task becomes more complicated
1264: because
1265: there is no global time:
1266: different PEs may have different local times
1267: at each physical instance of simulation.
1268: 
1269: Suppose for example,
1270: one wants to see the cellular patterns
1271: at regular time intervals
1272: $K_0 \Delta t,~(K_0 +1)  \Delta t,~(K_0 +2) \Delta t,...$
1273: on a screen of a monitor attached to the computer.
1274: Without getting too involved 
1275: in the details of performing I/O
1276: operations and the architecture of the parallel computer,
1277: it would be enough to assume that a separate process
1278: or processes are associated with the output;
1279: these processes scan an output buffer memory space
1280: allocated in one or several PEs or in the shared memory;
1281: the buffer space consists of $B$ frames,
1282: numbered 0,1,...,$B-1$,
1283: each capable of storing a complete image of
1284: the cellular array for one time instance.
1285: The output processes draw
1286: the image for time $K \Delta t$ 
1287: on the screen
1288: as soon as
1289: the frame number $rem (K/B)$
1290: (the reminder of the integer
1291: division $K$ by $B$)
1292: is full and the previous 
1293: images have been shown.
1294: Then the frame is flashed for
1295: the next round when it will be filled
1296: with the image for time $(K+B) \Delta t$
1297: and so on.
1298: \\
1299: \begin{figure}
1300: \centering
1301: \fbox{
1302: \begin{minipage} {12.8cm}
1303: /* Initially $K=K_0$, $T(C)~<~K_0 \Delta t$ */\\
1304: \begin{enumerate}
1305: \item while $T(C)~<~end\_time$\\
1306: \hspace*{0.15in}
1307: \{
1308: \item~~~~~~select a cell $c$ in the subarray $C$ such that\\
1309: \hspace*{0.2in}
1310: ~$t(c)=\min_{~c' \in C} t(c')$ and assign $new\_T \leftarrow t(c)$;
1311: \item~~~~~~while $new\_T > K \Delta t$ \\
1312: \hspace*{0.2in}
1313: ~~~~\{
1314: \item~~~~~~~~~~~~~wait\_until frame $rem (K/B)$ is available;
1315: \item~~~~~~~~~~~~~store image $s(C)$ into frame $rem (K/B)$;
1316: \item~~~~~~~~~~~~~$K \leftarrow  K+1$\\
1317: \hspace*{0.2in}
1318: ~~~~~\};
1319: \item~~~~~~$T(C)  \leftarrow  new\_T$;
1320: \item~~~~~~wait\_until $T(C) \leq \min_{~C' \in W(c)} ~T(C')$ ;
1321: \item~~~~~~$s(c) \leftarrow next\_state (c,~ s (neighbors (c)),~t(c))$ ;
1322: \item~~~~~~$t(c) \leftarrow  time\_of\_next\_arrival~(c,~s (neighbors (c)),~ t(c))$\\
1323: \hspace*{0.15in}
1324: \}
1325: \end{enumerate}
1326: \end{minipage}}
1327: \caption{Generating the output in the aggregated asynchronous algorithm}
1328: \label{fig:genout}
1329: \end{figure}
1330: 
1331: The algorithm must fill the appropriate frame
1332: with the appropriate data as soon as 
1333: both data and the frame become available.
1334: The modifications that enable the asynchronous algorithm 
1335: in Figure~\ref{fig:amcgen} to perform this task
1336: are presented in Figure~\ref{fig:genout}.
1337: In this algorithm,
1338: variables $new\_T$ and $K$ are private (i.e., local to PE)
1339: and
1340: $\Delta t$ and $K_0$ are constants
1341: whose values are the same for all the PEs.
1342: Note that different PEs may 
1343: fill different frames concurrently.
1344: If the slowest PE is presently filling an image for time $K \Delta t$,
1345: then the fastest PE is allowed to fill the image for
1346: time no later than $(K + B - 1) \Delta t$.
1347: An attempt by the fastest PE to 
1348: fill the image for time $(K + B) \Delta t$
1349: will be blocked at Step 4, 
1350: until the frame 
1351: number $rem(K/B)=rem((K + B)/B)$
1352: becomes available.
1353: 
1354: Thus, the finiteness of the output buffer introduces
1355: a restriction which is not present in the original algorithm
1356: in Figure~\ref{fig:amcgen}.
1357: According to this restriction,
1358: the lag between concurrently processed local times
1359: cannot exceed
1360: a certain constant.
1361: The exact value of the constant in each particular instance
1362: depends on the relative
1363: positions of the update times within the $\Delta t$-slots.
1364: In any case, 
1365: the constant is not smaller than
1366: $(B-1) \Delta t$
1367: and
1368: not larger than
1369: $B \Delta t$.
1370: 
1371: However, 
1372: even with a single output buffer segment, $B=1$,
1373: the simulation does not become time-driven.
1374: In this case,
1375: the concurrently processed local times might be
1376: within a distance of
1377: up to $\Delta t$
1378: of each other,
1379: whereas $\Delta t$ might be relatively large.
1380: No precision of update time representation is lost,
1381: although efficiency might degrade 
1382: when both $\Delta t$ and $B$ become too small,
1383: see Section~\ref{sec:perf}.
1384: 
1385: \section{Performance assessment: experiments and simulations}\label{sec:perf}
1386: \hspace*{\parindent} 
1387: Modeling and analysis of asynchronous
1388: algorithms is a difficult theoretical problem.
1389: Strictly speaking,
1390: the following discussion is applicable
1391: only to synchronous algorithms.
1392: However, one may argue informally
1393: that the performance of an asynchronous
1394: algorithm is not worse than that of its synchronous counterpart, 
1395: since expensive synchronizations are eliminated.
1396: 
1397: First, consider the synchronous algorithm in Figure~\ref{fig:s1c1pe}.
1398: Let $N$ be the size of the array and $N_0$ be the number of
1399: cells which passed
1400: the test at Step 2, Figure~\ref{fig:s1c1pe}.
1401: The ratio of useful work performed, 
1402: to the total work expended at
1403: the iteration is $N_0 /N$.
1404: This ratio yields the {\em efficiency}
1405: (or {\em utilization}) at the given iteration.
1406: Assuming that in the serial algorithm all the work is useful,
1407: and that the algorithm performs the same computation as its parallel
1408: counterpart,
1409: the speed-up of the parallel computation
1410: is the average efficiency times the number of PEs involved.
1411: Here the averaging is done 
1412: with equal weights
1413: over all the iterations.
1414: 
1415: In the general algorithms, 
1416: $next\_t(c)$ is determined using the
1417: states of the neighbors of $c$.
1418: However, in the important applications,
1419: such as an Ising model,
1420: $next\_t(c)$ is independent of states.
1421: The following assessment is valid only for
1422: this special case of independence.
1423: Here
1424: the configuration is irrelevant
1425: and
1426: whether the test succeeds or not 
1427: can be determined knowing only the times at each iteration.
1428: This leads to a simplified model in which 
1429: only local times are taken into account:
1430: at an iteration,
1431: the local time of
1432: a cell is incremented
1433: if the time does not exceed
1434: the minimum of the local times of its neighbors.
1435: 
1436: A simple (serial) algorithm 
1437: which updates only local times of cells $t(c)$
1438: according to the rules formulated above
1439: was exercised for different array sizes $n$
1440: and three different dimensions:
1441: for an $n$-element circular array,
1442: an $n \times n$ toroidal array,
1443: and for $n \times n \times n$ array with periodic boundary conditions.
1444: Two types of asynchronies are tried:
1445: the Poisson asynchrony
1446: for which
1447: $next\_t~-~t$ is distributed exponentially,
1448: and the asynchrony
1449: for which $next\_t~-~t$ is
1450: uniformly distributed in (0,1).
1451: In both cases,
1452: random time increments
1453: for different cells are independent.
1454: 
1455: \begin{figure}
1456: \centering
1457: \includegraphics*[width=5.8in]{PERF1T1.PS}
1458: \caption{Performance of the Ising model simulation. One-cell-per-one-PE case}
1459: \label{fig:perf1t1}
1460: \end{figure}
1461: 
1462: The results of these six experiments
1463: are given in Figure~\ref{fig:perf1t1}.
1464: Each solid line in Figure~\ref{fig:perf1t1} is enclosed between two dashed
1465: lines. 
1466: The latter represent
1467: 99.99\% Student's confidence intervals constructed
1468: using several simulation runs, 
1469: that are parametrically the same 
1470: but fed with different pseudo-random sequences.
1471: In Figure~\ref{fig:perf1t1}, for each array topology
1472: there are two solids lines.
1473: The Poisson asynchrony
1474: always corresponds to the lower line.
1475: The corresponding limiting values of performances
1476: (when $n$ is large)
1477: are also shown near the right end of each curve.
1478: For example, the efficiency in the 
1479: simulation of a large $n \times n$ array
1480: with the Poisson asynchrony is about 0.121,
1481: with the other asynchrony, it is about 0.132.
1482: 
1483: No analytical theory is available
1484: for predicting these values
1485: or even proving their separation from zero
1486: when $n \rightarrow +\infty$.
1487: It follows from Figure~\ref{fig:perf1t1} that replacing
1488: exponential distribution of $next\_t - t$ with
1489: the uniform distribution results in efficiency
1490: increase
1491: from 0.247 to 0.271 for a large $n$-circle
1492: ($n \rightarrow +\infty$).
1493: The efficiency can be raised even more.
1494: If $next\_t - t = r^{1/8}$,
1495: where $r$ is distributed uniformly in (0,1),
1496: then in the limit $n \rightarrow +\infty$,
1497: with the Student's confidence 99.99\%,
1498: the efficiency is $0.3388 \pm 0.0012$.
1499: It is not known how high the efficiency
1500: can be raised this way
1501: (degenerated cases, like a synchronous one,
1502: in which the efficiency is 1, are not counted).
1503: 
1504: An efficiency of 0.12 means the speed-up
1505: of $0.12 \times N$;
1506: for $N=2^{14}$ this comes to more than 1900.
1507: This assessment is confirmed in an actual full scale simulation experiment
1508: performed on $2^{14}=128 \times 128$ PEs of
1509: a Connection~Machine~(R)
1510: (a quarter of the full computer 
1511: ).
1512: This SIMD computer 
1513: appears well-suited for the synchronous execution
1514: of the one-cell-per-one-PE algorithm in Figure~\ref{fig:s1c1pe} 
1515: on a toroidal array,
1516: Poisson asynchrony law.
1517: Since an individual PE is rather slow,
1518: it executes several thousand
1519: instructions per second,
1520: and its absolute speed is not very impressive:
1521: It took roughly 1 sec. of real time
1522: to update
1523: all $128 \times 128$ spins
1524: when the traffic generated by other tasks running
1525: on the computer was small
1526: (more precise measurement
1527: was not available).
1528: This includes about 
1529: $8.3~\approx~(0.12)^{-1}$
1530: rounds of the algorithm,
1531: several hundred instructions of one PE per round.
1532: 
1533: The 12\% efficiency in the one-cell-per-one-PE
1534: experiments could be greatly increased
1535: by aggregation.
1536: The many-cells-per-one-PE
1537: algorithm in Figure~\ref{fig:amcpoi} is implemented
1538: as a $C$ language parallel program for a 
1539: Balance~(TM) computer,
1540: which is a shared memory MIMD bus machine.
1541: The $n \times n$ array was split into 
1542: $m \times m$ subarrays, 
1543: as shown in Figure~\ref{fig:aggr},
1544: where $n$ is a multiple of $m$.
1545: Because the computer has 30 PEs,
1546: the experiments could be performed only with
1547: $(n/m)^2=1, 4, 9, 16$, and 25 PEs
1548: for different $n$ and $m$.
1549: 
1550: Along with these experiments,
1551: a simplified model, similar to
1552: the one-cell-per-one-PE case,
1553: was run on a serial computer.
1554: In this model,
1555: quantity 
1556: $h(C) \stackrel{\rm def}{=} \lambda T(C)$ is maintained for each PE,
1557: $C=1,...,(n/m)^2$.
1558: The update of $h(C)$ is arranged in rounds,
1559: wherein each $h(C)$ is updated as follows:
1560: \\
1561: ~~~~~~(i) with probability $p_0 =(m-2)^2 /m^2$,
1562: PE$C$ updates $h(C)$:
1563: \begin{equation}
1564: \label{hC}
1565: h(C) ~\leftarrow~ h(C) ~-~ \ln ^r (C, n(h(C))),
1566: \end{equation}
1567: where $r$ and $\ln$ are the same as in Step 5 in 
1568: Figure~\ref{fig:amcpoi}.
1569: Here $p_0$ is the probability 
1570: that the PE chooses a cell $c$ 
1571: so that $|W(c)|=0$;
1572: \\
1573: ~~~~~~(ii) with probability $p_1 =4(m-2)/m^2$,
1574: the PE must check the $h(C')$ of one of its four neighbors $C'$
1575: before making the update.
1576: The $C'$ is chosen uniformly at random among the four possibilities.
1577: If $h(C') ~\geq~ h(C)$,
1578: then $h(C)$ gets an increment according to \eqref{hC};
1579: otherwise, $h(C)$ is not updated.
1580: Here $p_1$ is the probability that PE will choose
1581: a cell $c$ in an edge but not in a corner, so that $|W(c)|=1$
1582: \\
1583: ~~~~~~(iii) with the remaining probability $p_2=4/m^2$,
1584: the PE checks $h(C')$ and $h(C'')$
1585: of two of its adjacent neighbors
1586: (for example in Figure~\ref{fig:aggr}, neighbors PE2 and PE3
1587: can be involved in the computation for PE1).
1588: The two neighbors are chosen uniformly at random
1589: from the four possibilities.
1590: Again, if both
1591: $h(C')  \geq  h(C)$
1592: and
1593: $h (C'') \geq  h(C)$,
1594: then $h(C)$ gets an increment according to \eqref{hC};
1595: otherwise, $h(C)$ is not updated.
1596: Here $p_2$ is the probability 
1597: to choose a cell $c$ in a corner,
1598: so that $|W(c)|=2$.
1599: 
1600: As in the previous case,
1601: this simplified model simulates
1602: a possible but not obligatory synchronous timing arrangement for
1603: executing the real asynchronous algorithm.
1604: Figure~\ref{fig:perfmt1} shows excellent agreement between
1605: actual and predicted performances
1606: for the aggregated Ising model.
1607: The efficiency presented in Figure~\ref{fig:perfmt1} is computed as
1608: 
1609: \begin{equation}
1610: \label{effi}
1611: {\rm efficiency}=
1612: \frac{\rm serial~execution~time} {{\rm number~of~PEs} \times {\rm parallel~execution~time}}
1613: \end{equation}
1614: 
1615: The parallel speed-up can be found as 
1616: efficiency$~\times~$number~of~PEs.
1617: For 25 PEs simulating a 120$\times$120 Ising model,
1618: efficiency is 0.66;
1619: hence, the speed-up is greater than 16.
1620: For the currently unavailable sizes,
1621: when $10^4$ PEs
1622: simulate a $10^4 \times 10^4$ array,
1623: the simplified model predicts
1624: an efficiency of about 0.8 and a speed-up of about 8000.
1625: 
1626: \begin{figure}
1627: \centering
1628: \includegraphics*[width=5.8in]{PERFMT1.PS}
1629: \caption{Performance of the Ising model simulation. Many-cells-per-one-PE case}
1630: \label{fig:perfmt1}
1631: \end{figure}
1632: 
1633: In the experiments reported above,
1634: the lag between the local times of any two PEs
1635: was not restricted.
1636: As discussed in Section~\ref{sec:algo},
1637: an upper bound on the lag 
1638: might result from the necessity
1639: to produce the output.
1640: To see how the bound
1641: affects the efficiency,
1642: one experiment reported in Figure~\ref{fig:perfmt1},
1643: is repeated with various finite 
1644: values of the lag bound.
1645: In this experiment, 
1646: an $n \times n$ array is simulated
1647: and
1648: one PE carries an $m \times m$ subarray,
1649: where $n=384$ and $m=12$.
1650: The results are presented in Figure~\ref{fig:perfbl}.
1651: 
1652: In Figure~\ref{fig:perfbl},
1653: the unit of measure for a lag is the expectation of
1654: time intervals between consecutive arrivals for a cell.
1655: For lag bounds greater than 16, 
1656: degradation of efficiency is almost unnoticeable,
1657: when
1658: compared with the base experiment where lag$= \infty$.
1659: Substantial degradation starts at about 8;
1660: for the unity lag bound,
1661: the efficiency is about half that of the base experiment.
1662: However, even for lag bound 0.3, the simulation remains practical,
1663: with an efficiency of about 0.1;
1664: since 1024 PEs execute the task, 
1665: this efficiency means a speed-up of more than 100.
1666: 
1667: \begin{figure}
1668: \centering
1669: \includegraphics*[width=5.8in]{PERFBL.PS}
1670: \caption{Efficiency degradation caused by bounded lag}
1671: \label{fig:perfbl}
1672: \end{figure}
1673: 
1674: \section{Conclusion}\label{sec:concl}
1675: \hspace*{\parindent} 
1676: This paper demonstrates an efficient parallel method
1677: for simulating asynchronous cellular arrays.
1678: The algorithms are quite simple and easily implementable
1679: on appropriate hardware.
1680: In particular, each algorithm
1681: presented in the paper
1682: can be implemented on a general purpose
1683: asynchronous parallel computer,
1684: such as the currently available bus machines with shared memory.
1685: The speed of such implementation
1686: depends on the speed of PEs 
1687: and the efficiency of the communication system.
1688: A crucial condition for success in such implementation
1689: is the availability of a good parallel generator 
1690: of pseudo-random numbers.
1691: To assure reproducibility,
1692: each PE should have its own reproducible
1693: pseudo-random sequence.
1694: 
1695: The proposed algorithms present
1696: a number of challenging mathematical problems,
1697: for example, the problem of 
1698: proving that efficiency tends 
1699: to a positive limit when the number of PEs increases
1700: to infinity.
1701: \\
1702: 
1703: {\bf Acknowledgments}.
1704: \\
1705: I acknowledge the personnel
1706: of the Thinking Machine Corp. for their kind invitation,
1707: and help in debugging and running the parallel *LISP
1708: program on one of their computers.
1709: Particularly, the help of 
1710: Mr. Gary Rancourt and Mr. Bernie Murray was invaluable.
1711: Also, I thank Andrew T. Ogielski and Malvin H. Kalos 
1712: for stimulating discussions, 
1713: Debasis Mitra for a helpful explanation of a topic in Markov chains,
1714: and Brigid Moynahan for carefully reading the text.
1715: 
1716: \newpage
1717: \begin{thebibliography}{MMMM}
1718: \bibitem[1]{GG}
1719: S. Geman and D. Geman,
1720: Stochastic relaxation, Gibbs distributions,
1721: and the Bayesian restoration of images,
1722: {\em IEEE Transactions on pattern analysis and machine intelligence},
1723: {\bf PAMI-6}, 6, (Novem. 1984), 721--741.
1724: 
1725: \bibitem[2]{LM}
1726: B.~D. Lubachevsky and D. Mitra,
1727: A chaotic asynchronous algorithm for computing
1728: the fixed point of a nonnegative matrix of unit spectral radius,
1729: {\em Journal of the ACM}, {\bf 33}, 1 (1986), 130--150.
1730: 
1731: \bibitem[3]{INBUV}
1732: T.~E. Ingerson and R.~L. Buvel,
1733: Structure in asynchronous cellular automata,
1734: {\em Physica}, {\bf 10D} (1984), 59--68.
1735: 
1736: \bibitem[4]{HOF}
1737: M.~I. Hoffman,
1738: A cellular Automation Model Based on Cortical Physiology,
1739: {\em Complex Systems}, {\bf 1}, (1987), 187--202.
1740: 
1741: \bibitem[5]{ISING}
1742: F. Ising,
1743: Beitag zur theorie des ferromagnetismus,
1744: {\em Z. Physik}, {\bf 31} (1925), 253--258.
1745: 
1746: \bibitem[6]{GL}
1747: R.~J. Glauber,
1748: Time-dependent statistics of the Ising model,
1749: {\em Journ. Math. Physics}, {\bf 4}, no.2 (1963), 294--307.
1750: 
1751: \bibitem[7]{MRRTT}
1752: N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller,
1753: Equation of state calculations by fast computing machines,
1754: {\em Journ. Chem. Physics}, {\bf 21}, no.6 (1953), 1087--1092.
1755: 
1756: \bibitem[8]{BKL}
1757: A.~B. Bortz, M.~H. Kalos, and J.~T. Lebowitz,
1758: A new algorithm for Monte Carlo simulation of Ising spin systems,
1759: {\em J. Comp. Physics}, {\bf 17} (1975), 10--18.
1760: 
1761: \bibitem[9]{OGI}
1762: A.~T. Ogielski,
1763: Dynamics of three-dimensional Ising spin glasses in thermal equilibrium,
1764: {\em Physical Review B}, {\bf 32}, no.11 (1985), 7384--7398.
1765: 
1766: \bibitem[10]{FC}
1767: R. Friedberg, and J.~E. Cameron,
1768: Test of the Monte Carlo method: fast simulation of a small Ising lattice,
1769: {\em Journ. Chem. Physics}, {\bf 52}, no.12 (1970), 6049--6058.
1770: 
1771: \bibitem[11]{CR}
1772: M. Creutz,
1773: Deterministic Ising dynamics,
1774: {\em Ann. Phys.} {\bf 167}, no. 62 (1986), pp. 62--72.
1775: 
1776: \bibitem[12]{GAR}
1777: M. Gardner,
1778: Mathematical games.
1779: The fantastic combinations of John Conway's new\_solitaire game ``life'',
1780: {\em Scientific American}, October 1970, 120--124.
1781: 
1782: \end{thebibliography}
1783: % ----------------------------------------------------------------
1784: 
1785: \newpage
1786: {\bf APPENDIX: a working code of Ising simulation}
1787: \\
1788: \\
1789: C language program for the BALANCE parallel computer;
1790: the code is used for timing only and contains no i/o;
1791: the code of the pseudo-random number generator 
1792: is not included 
1793: \begin{verbatim}
1794: #include <pp.h>
1795: #include <math.h>
1796: #include <sys/tmp_ctl.h>
1797: 
1798: #define SHARED_MEM_SIZE (sizeof(double)*10000)
1799: #define END_TIME 1000.
1800: #define A 24       /* side of small square a PE takes care of*/
1801: #define M 5        /* number of PEs along a side of the big square*/
1802: 
1803: shared int nPEs = M*M, spin[M*A][M*A];
1804: shared float time[M][M];  /*local times on subarrays*/
1805: shared float prob[10];  /* probabilities of state change */
1806: shared float J = 1., H = 0.;     /* Energy= -J sum spin spin' - H sum spin */
1807: shared float T = 1.;                /* Temperature */
1808: shared int ato2 = A*A;
1809: shared int am = A*M;
1810: 
1811: main()
1812: {
1813:     int i,j,child_id, my_spin, sum_nei, index, bit;
1814:     float d_E, x;
1815:     double frand();
1816: 
1817: /* compute flip probabilities */
1818:     for (i = 0; i < 5 ; i++)
1819:         for (j = 0; j < 2; j++)
1820:           {index = i + 5*j;    /* index = 0,1,...,9 */ 
1821:            my_spin = 2*j - 1;
1822:            sum_nei = 2*i - 4;
1823:            d_E = 2.*(J * my_spin * sum_nei + H * my_spin);
1824:            x = exp(-d_E/T);
1825:            prob[index] = x/(1.+x);
1826:    /*      printf("prob[%d]=%f\n",index,prob[index]);  */
1827:           };
1828: 
1829: /* initialize local times */
1830:     for (i = 0; i < M; i++)
1831:         for (j = 0; j < M; j++)
1832:             time[i][j]=0.;
1833: 
1834: /* initialize spins at random, in seedran(seed,b), b is dummy*/
1835:     seedran(31234,1);
1836:     for (i = 0; i < M*A; i++)
1837:         for (j = 0; j < M*A; j++) {
1838:             bit = 2*frand(1);                /* bit becomes 0 or 1 */
1839:             spin[i][j] = 2*bit - 1;          /* spin becomes -1 or 1 */
1840:    /*       printf("spin[%d][%d]=%d\n",i,j,spin[i][j]);      */
1841:     };
1842: 
1843:     /* in the following loop single PE spawns nPEs other PEs for concurrent
1844:        execution. Each child PE would execute subroutine work(my_id) with its
1845:        own argument my_id. */
1846: 
1847:     for (child_id = 0; child_id < nPEs; child_id++)
1848:           if (fork() == 0) {
1849:               tmp_affinity(child_id);     /* fixing a PE for process child_id */
1850:               work(child_id);              /* starting a child PE process */
1851:               exit(0);
1852:           }
1853: 
1854: 
1855:     /* in the following loop the parent PE awaits termination of each child PE
1856:        then terminates itself */
1857:     for (child_id = 0; child_id < nPEs; child_id++) wait(0);
1858:     exit(0);
1859: }
1860: 
1861: work(my_id)
1862: int my_id;
1863: {
1864:   int i,j;
1865:   int coord, var;
1866:   int x,y,my_i,my_j,sum_nei, nei_i,nei_j;
1867:   int  up_i, down_i, left_j, right_j;
1868:   int i_base, j_base;
1869:   int index;
1870:   double frand(); 
1871:   double r;
1872:   double end_time;
1873: 
1874:   end_time = END_TIME*A*A;
1875:                                /*normalizing time scale for multiprocessor execution*/
1876: 
1877:   my_i = my_id%M;      /*PE my_id carries small square (my_i,my_j)*/
1878:   i_base = my_i*A;     
1879:   up_i = (my_i + 1)%M; 
1880:   down_i = (my_i + M - 1)%M;
1881: 
1882:   my_j = (my_id-my_i)/M;
1883:   j_base = my_j*A;        
1884:   left_j = (my_j + M - 1)%M;
1885:   right_j = (my_j + 1)%M;
1886: 
1887:   seedran(my_id*my_id*my_id,my_id);  
1888:  /*PE my_id has its own copy of pseudo-random number generator and initializes it 
1889:    using seedran(seed,my_id) with unique seed=my_id*my_id*my_id */
1890: 
1891:   while(time[my_i][my_j] < end_time) 
1892:   {
1893:     r = frand(my_id); 
1894:       /*PE my_id obtains next pseudo-random number from its own sequence*/
1895:     x = r*A;               
1896:     y = (r*A-x)*A;  
1897:        /*pick a random cell with internal address (x,y) within the A*A square*/
1898: 
1899: /*compute sum of neighboring spins*/
1900:     sum_nei = 0;          
1901:     for (coord = 0;  coord < 2; coord += 1)
1902:         for (var = -1;  var < 2; var += 2)
1903:     {
1904:           nei_i = x;
1905:           nei_j = y;
1906:           if(coord == 0) nei_i += var;
1907:           if(coord == 1) nei_j += var;
1908: 
1909:           if(0 <= nei_i && nei_i < A && 0 <= nei_j && nei_j < A) 
1910:           {
1911:              nei_i += i_base;
1912:              nei_j += j_base;
1913:           }
1914:           else 
1915:           {
1916:        /* 4 possible reasons to wait for a neighboring PE */
1917:             if(-1 == nei_i) while (time[down_i][my_j]  < time[my_i][my_j]) ;
1918:             if(-1 == nei_j) while (time[my_i][left_j]  < time[my_i][my_j]) ;
1919:             if(nei_i == A)  while (time[up_i][my_j]    < time[my_i][my_j]) ;
1920:             if(nei_j == A)  while (time[my_i][right_j] < time[my_i][my_j]) ;
1921: 
1922:             nei_i = (nei_i+i_base+am)%am;
1923:             nei_j = (nei_j+j_base+am)%am;   
1924:           };
1925:           sum_nei += spin[nei_i][nei_j];
1926:     };
1927: 
1928: /*recover index*/
1929:     index = (sum_nei + 4)/2 + 5*(spin[x+i_base][y+j_base] + 1)/2;
1930: 
1931:     r = frand(my_id); 
1932: 
1933:     if(r < prob[index]) 
1934:       spin[x+i_base][y+j_base] *= -1;
1935:     else /* printf(": NO flip\n") */ ;
1936: 
1937:     r = frand(my_id); 
1938:     time[my_i][my_j] += -log(r);
1939:   };
1940: }
1941: \end{verbatim}
1942: \end{document}
1943: % ----------------------------------------------------------------\\
1944: