cs0502093/r.tex
1: \documentclass[9pt,twocolumn,letterpaper]{IEEEtran}
2: 
3: \usepackage{mathptmx}
4: \usepackage[scaled=-90]{helvet}
5: \usepackage{courier}
6: 
7: \usepackage{amsmath}
8: \usepackage{amsfonts,amssymb}
9: \usepackage{graphicx}
10: 
11: \usepackage{subfigure}
12: 
13: \newtheorem{theorem}{Theorem}[section]
14: \newtheorem{conjecture}[theorem]{Conjecture}
15: \newtheorem{corollary}[theorem]{Corollary}
16: \newtheorem{proposition}[theorem]{Proposition}
17: \newtheorem{lemma}[theorem]{Lemma}
18: \newtheorem{definition}[theorem]{Definition}
19: \newtheorem{remark}[theorem]{Remark}
20: \newtheorem{fact}[theorem]{Fact}
21: 
22: \newcommand{\goodgap}{%
23: \hspace{\subfigtopskip}%
24: \hspace{\subfigbottomskip}} 
25: 
26: \newcommand{\POPS}{{\mathrm{POPS}}(d,g)}
27: \newcommand{\POPSg}{{\mathrm{POPS}}(g,g)}
28: \newcommand{\PR}{{\mathbf{Pr}}}
29: \newcommand{\E}{{\mathbf{E}}}
30: 
31: \begin{document}
32: 
33: \title{On-Line Permutation Routing\\ in Partitioned Optical Passive Star Networks}
34: 
35: \author{Alessandro Mei\thanks{Alessandro Mei is with the
36: Department of Computer Science, University of Rome ``La Sapienza'', Italy
37: (e-mail: mei@di.uniroma1.it).} and Romeo Rizzi\thanks{Romeo Rizzi is with the
38: Department of Information and Communication Technology,
39: University of Trento, Italy (e-mail: romeo.rizzi@unitn.it).}}
40: 
41: \maketitle
42: 
43: \begin{abstract}
44: This paper establishes the state of the art in both deterministic and randomized online
45: permutation routing in the POPS network.
46: Indeed, we show that any permutation can be routed online on a $\POPS$ network
47: either with $O(\frac{d}{g}\log g)$ deterministic slots, or, with high probability,
48: with $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ randomized slots,
49: where constant~$c=\exp (1+e^{-1})\approx 3.927$.
50: When $d=\Theta(g)$, that we claim to be the ``interesting'' case, the
51: randomized algorithm
52: is exponentially faster than any other algorithm in the literature, both deterministic and randomized
53: ones. This is true in practice as well. Indeed, experiments show that it
54: outperforms its rivals even starting
55: from as small a network as a ${\mathrm{POPS}}(2,2)$, and the gap grows exponentially with the
56: size of the network. We can also show that, under proper hypothesis,
57: no deterministic algorithm can asymptotically match its performance.
58: \end{abstract}
59: 
60: \begin{keywords}
61: Optical interconnections, partitioned optical passive star network, permutation routing.
62: \end{keywords}
63: 
64: \maketitle
65: 
66: \section{Introduction}
67: 
68: The ever-growing demand of fast interconnections in multiprocessor systems
69: has fostered a large interest in optical technology. All-optical communication
70: benefits from a number of good characteristics such as no opto-electronic
71: conversion, high noise immunity, and low latency. Optical technology can
72: provide an enormous amount of bandwidth and, most probably, will
73: have an important role in the future of distributed and parallel computing
74: systems.
75: 
76: The Partitioned Optical Passive Stars (POPS)
77: network~\cite{clmtg94,gmclt95,gm98,mgcl98} is a SIMD
78: parallel architecture that uses a fast optical network composed of
79: multiple Optical Passive Star (OPS) couplers. 
80: A $d\times d$ OPS coupler is
81: an all-optical passive device which is capable of
82: receiving an optical signal from one of its $d$ sources and broadcast it to
83: all of its $d$ destinations.
84: The number of processors of the network is denoted by $n$, and each processor
85: has a distinct index in $\{0,\dotsc, n-1\}$.
86: The $n$ processors are partitioned into $g$
87: groups of $d$ processors, $n=dg$, in such a way that processor $i$ belongs to
88: group~${\mathrm{group}}(i):=\lfloor i/d\rfloor$ (see Figure~\ref{fig:pops}).
89: \begin{figure}
90: \begin{center}
91: \includegraphics[scale=.7]{pops}
92: \end{center}
93: \caption{A ${\mathrm{POPS}}(3,3)$. Processors are shown as circles, while optical passive stars
94: are shown as boxes. Optical signals flow from the left to the right.
95: The processors on the left and the processors on the right are the same
96: objects shown twice for the sake of clearness.}
97: \label{fig:pops}
98: \end{figure}
99: For each pair of groups $a,b\in\{0,\dotsc,g-1\}$, a coupler~$c(b,a)$ is
100: introduced which has all the $d$ processors of group~$a$ as sources and all
101: the $d$ processors of group~$b$ as destinations.
102: During a computational step (also referred to as a \emph{slot}), each
103: processor~$i$ receives a single message from one of the $g$ couplers
104: $c({\mathrm{group}}(i), a)$, $a\in\{0,\dotsc,g-1\}$, performs some
105: local computations, and sends a single message to a subset of the $g$ couplers
106: $c(b, {\mathrm{group}}(i))$, $b\in\{0,\dotsc,g-1\}$. The couplers are
107: broadcast devices, so this message can be received by more than one processor
108: in the destination groups.
109: In agreement with the literature, in the case when multiple
110: messages are sent to the same coupler, we assume that no message is delivered.
111: This architecture is denoted by $\POPS$.
112: 
113: One of the advantages of a $\mathrm{POPS}(d,g)$ network is
114: that its diameter is one. A packet can
115: be sent from processor~$i$ to processor~$j$, $i\neq j$, in one slot
116: by using coupler $c(\mathrm{group}(j),\mathrm{group}(i))$. However,
117: its bandwidth varies according to $g$. In a $\mathrm{POPS}(n,1)$ network,
118: only one packet can be sent through the single coupler per slot.
119: On the other extreme, a $\mathrm{POPS}(1,n)$ network is a highly expensive,
120: fully interconnected optical network using $n^2$ OPS couplers.
121: A one-to-all communication pattern can also be performed in only one slot in
122: the following way: Processor~$i$ (the speaker) sends the packet to
123: all the couplers~$c(a,\mathrm{group}(i))$, $a\in\{0,\dotsc,g-1\}$,
124: during the same slot all the processors~$j$, $j\in\{0,\dotsc,n-1\}$,
125: can receive the packet through coupler
126: $c(\mathrm{group}(j),\mathrm{group}(i))$.
127: 
128: The POPS network has been shown to support a number of non trivial
129: algorithms. Several common communication patterns are realized
130: in~\cite{gm98}. Simulation algorithms for the ring, the mesh, and the hypercube interconnection
131: networks can be found in~\cite{gm-MPPUOI95} and~\cite{s00a}. Some reliability issues
132: are analyzed in~\cite{c-LCN97}. Algorithms for data sum, prefix
133: sums, consecutive sum, adjacent sum, and several data movement operations
134: are also described in~\cite{s00a} and~\cite{ds-IEEETPDS03}. Later, both the algorithms
135: for hypercube simulation and prefix sums have been improved in~\cite{mr-HIPC03}.
136: An algorithm for matrix
137: multiplication is provided in~\cite{s00b}. 
138: Moreover, \cite{bf96} shows that POPS networks can be modeled by directed
139: and complete stack graphs with loops, and uses this formalization to
140: obtain optimal embeddings of rings and de Bruijn graphs into POPS
141: networks.
142: 
143: In~\cite{ds-IEEETPDS03}, Datta and Soundaralakshmi claim that in most practical
144: $\POPS$ networks it is likely that $d>g$. We believe that they are only partly
145: right. While it is true that
146: systems with $d\ll g$ are too expensive, it is also true that systems with $d\gg g$ give
147: too low parallelism to be worth building. We illustrate our point with an example.
148: Consider the problem of summing $16n$ data values on a $\POPS$ network,
149: $d=g=\sqrt{n}$. This network has $n$ processors. Therefore, the algorithm can work as follows:
150: we input 16 data values per processor, let each processor sum up its 16 data values, and
151: finally we use the algorithm in~\cite{ds-IEEETPDS03} to get the overall sum. This algorithm
152: requires 16 steps to input the data values and compute the local sums, plus
153: $2\log\sqrt{n}=\log n$ slots for computing the final result. A total of $16+\log n$ slots.
154: With the idea of upgrading our system,
155: we buy additional $15n$ processors and build a $16n$ processor ${\mathrm{POPS}}(d',g')$ network
156: with $d'=16d=16\sqrt{n}$ and $g'=g=\sqrt{n}$.
157: Now, we can use just one step to input the data values, one per processor, and then
158: use the same algorithm in~\cite{ds-IEEETPDS03} to get the overall sum. Unfortunately,
159: this algorithm still requires $16+\log n$ slots, even though we are solving a problem of the
160: same size using a system with 16 times more processors!
161: 
162: The problem is not on the data sum algorithm in~\cite{ds-IEEETPDS03}. Essentially the same thing
163: happens with the prefix sums algorithm in~\cite{ds-IEEETPDS03}, the simulations in~\cite{s00a},
164: and all the other algorithms in the literature for the POPS network we know of, including the ones
165: presented in this paper. The point is that a $\POPS$
166: network can exchange $g^2$ messages at most in a slot. This is an unavoidable bottleneck
167: for networks where $d$ is much larger than $g$, resulting in the poor parallelism of
168: these systems.
169: Also, experience says that the case $d=g$ is the most interesting from a
170: ``mathematical'' point of view. In the past literature, the case $d>g$ and symmetrically the case $d<g$
171: are always dealt with by reducing them to the case $d=g$, that usually contains the
172: ``core'' of the problem in its purest form. This work is not an exception to this empirical
173: yet general rule.
174: So, it is probably more reasonable to assume that practical POPS networks
175: will have $d=\Theta(g)$, that is $d/g$,
176: and similarly $g/d$, bounded by a constant.
177: 
178: In any case, finding good algorithms for the case $d\neq g$, both $d<g$ and
179: $d>g$, is of absolute
180: importance, since it is not clear what is the optimal tradeoff between $d$, $g$, and the cost
181: of the network yet. Furthermore, an optimal tradeoff may not exist in general,
182: since it probably depends on the specific problem being solved.
183: By the way, such algorithms are often non trivial, as, for example,
184: in~\cite{ds-IEEETPDS03}. Therefore, we partly accept the claim in~\cite{ds-IEEETPDS03}
185: that the number of groups cannot substantially exceed the number of processors per
186: group. So, throughout the whole paper, we will discuss our asymptotical results assuming
187: that $g$ grows and that $d=\Omega(g)$. Nonetheless, we will
188: keep in mind that the ``important'' case is likely to be $d=\Theta(g)$.
189: 
190: Here, we consider the \emph{permutation routing problem}: Each of the $n$ processors of the POPS
191: network has a packet that is to be sent to another node, and each processor is the destination
192: of exactly one packet. This is a fundamental problem in parallel computing and
193: interconnection networks, and the literature on this topic is vast. As an excellent starting point,
194: the reader can see~\cite{l92}. On the POPS network, this problem has been studied in two
195: different versions: the \emph{offline} and the \emph{online} permutation routing problem.
196: In the former, the permutation to be routed is globally known in the network. Therefore,
197: every processor can pre-compute the route for its packet taking advantage of this information.
198: This version of the problem has been implicitly studied, for particular permutations, in
199: all the simulation algorithms we reviewed above. Later, most of these results
200: have been unified by proving that any permutation can optimally be routed off-line
201: in one slot, when $d=1$, and $2\lceil d/g\rceil$ slots, when $d>1$~\cite{mr-JPDC03}.
202: 
203: In the online version, every processor knows only the destination of the packet it stores.
204: This problem has been attacked in~\cite{ds-IEEETPDS03}. The solution 
205: iteratively makes use of a sub-routine that sorts $g^2$ items in ${\mathrm{POPS}}(g,g)$
206: subnetworks of the larger $\POPS$ network. The sub-routine is built by hypercube simulation
207: starting from either Cypher and Plaxton's $O(\log n\log\log n)$ sorting algorithm for the
208: $n$-processor hypercube or from Leighton's implementation~\cite{l92} on the
209: $n$-processor hypercube of Batcher's odd-even merge sort algorithm~\cite{b-AFIPS68}. 
210: In the first case, Datta and Sounderalakshmi get the asymptotically fastest algorithm for
211: routing in the POPS network, running in $O(\frac{d}{g}\log g\log\log g)$ slots. In the second,
212: they get an algorithm that turns out to be the fastest in practice, running in
213: $\frac{8d}{g}\log^2 g+\frac{21d}{g}+3\log g+7$ slots. Recently, and independently
214: of this work, Rajasekaran and Davila have presented a randomized algorithm for online
215: permutation routing that runs in $O(\frac{d}{g}+\log g)$ slots~\cite{rd-ICPADS04}.
216: 
217: Our contribution is both theoretical and practical. 
218: We show that any permutation can be routed on a $\POPS$ network
219: either with $O(\frac{d}{g}\log g)$ deterministic slots, or, with high probability,
220: with $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ randomized slots,
221: where constant~$c=\exp (1+e^{-1})\approx 3.927$. The deterministic algorithm
222: is based on a direct simulation of the AKS network, and it is the first that requires
223: only $O(\frac{d}{g}\log g)$ slots.
224: When $d=\Theta(g)$, that we claim to be the ``interesting'' case, the
225: randomized algorithm
226: is exponentially faster than any other algorithm in the literature, both deterministic and randomized
227: ones. This is true in practice as well. Indeed, our experiments show that it
228: outperforms its rivals even starting
229: from as small a network as a ${\mathrm{POPS}}(2,2)$, and the gap grows exponentially with the
230: size of the network. We can also show that, under proper hypothesis,
231: no deterministic algorithm can asymptotically match its performance.
232: 
233: This paper also presents a strong separation theorem between determinism and randomization.
234: We build a meaningful and natural problem inspired on permutation routing in the POPS
235: network such that there exists a $O(\log\log g)$ slots randomized solution, and such that
236: no deterministic solution can do better than $O(\log g)$ slots, that is exponentially slower.
237: To the best of our knowledge, this is the first strong separation result from $\log g$ to
238: $\log\log g$, and, quite interestingly, it does not make use of the notion of oblivious routing,
239: that we show to be essentially out of target in the context of routing in the POPS network.
240: 
241: \section{A Deterministic Algorithm}
242: 
243: Let ${\mathbb{N}}_m:=\{0,1,\dotsc,m-1\}$ denote the set of the first $m$ natural numbers.
244: In the \emph{on-line permutation routing problem} we are given $n$ packets, one per
245: processor. Packet $p_i$, $i\in{\mathbb{N}}_{n}$, originates at processor~$i$, the \emph{source
246: processor}, and has
247: processor~$\pi(i)$ as \emph{destination}, where $\pi$ is a permutation of ${\mathbb{N}}_n$.
248: 
249: The problem is to route all the
250: packets to destination with as few slots as possible.
251: Crucially, permutation $\pi$ is not known in advance---at the beginning of
252: the computation, each processor knows only the destination of the packet it stores.
253: 
254: \subsection{The Upper Bound}
255: 
256: So far, the best deterministic algorithm for online permutation routing on the $\POPS$
257: network is presented in~\cite{ds-IEEETPDS03}. The algorithm runs in $O(\frac{d}{g}\log^2 g)$ slots.
258: The computational bottleneck is
259: a $O(\log^2 g)$ sorting sub-routine that sorts $g^2$ data value $\lceil d/g\rceil$ times, each
260: on one of the $\lceil d/g\rceil$ ${\mathrm{POPS}}(g,g)$
261: sub-networks into which the larger $\POPS$ network is partitioned. The idea in~\cite{ds-IEEETPDS03}
262: is to make each ${\mathrm{POPS}}(g,g)$
263: network simulate Leighton's $O(\log^2 n)$ sorting algorithm for the $n$-processor hypercube~\cite{l92},
264: that is, in turn, an implementation of Batcher's odd-even merge sort. This is
265: carried out by using a general result due to Sahni~\cite{s00a},
266: showing that every move of a \emph{normal}
267: algorithm for the hypercube (where only one dimension is used for communication at each
268: step) can be simulated with $2\lceil d/g\rceil$ slots on a POPS network of the same size. Since
269: Leighton's algorithm is normal, and since
270: the sub-routine is always used on ${\mathrm{POPS}}(g,g)$ sub-networks, we get a constant
271: factor slow-down.
272: 
273: The algorithm in~\cite{ds-IEEETPDS03} is fairly good in practice, since hidden constants are small.
274: However, we are interested in the best asymptotical result. So, as suggested
275: in~\cite{ds-IEEETPDS03}, we can replace the Leighton implementation of Batcher's odd-even
276: merge sort with Cypher and Plaxton's routing algorithm for the hypercube,
277: that is asymptotically faster (though slower for networks of practical size),
278: since it runs in $O(\log n\log\log n)$ time~\cite{cp-TR90}. This yields
279: a $O(\frac{d}{g}\log g\log\log g)$ slots algorithm for permutation routing on the POPS network,
280: that is a good improvement. 
281: Nonetheless, here we do even better. Our simple key idea
282: is to simulate a fast sorting network directly on the POPS, instead of going through
283: hypercube simulation. By giving an improved $O(\log g)$ upper bound for sorting on the POPS network,
284: we also get an asymptotically faster algorithm for online permutation routing.
285:  
286: A \emph{comparator} $[i:j]$, $i,j\in{\mathbb{N}}_n$ sorts the $i$-th and $j$-th element of a data
287: sequence into non-decreasing order. A \emph{comparator stage} is a composition of comparators
288: $[i_1:j_1]\circ\dotsb\circ [i_k:j_k]$ such that all $i_r$ and $j_s$ are distinct, and a
289: \emph{sorting network} is a sequence of comparator stages such that any input sequence
290: of $n$ data elements is sorted into non-decreasing order.
291: An introduction to sorting networks can be found
292: in~\cite{clr92}. Crucially, we can show that a $\POPS$ network can efficiently simulate
293: any comparator stage.
294: \begin{theorem}[\cite{mr-JPDC03}]
295: \label{thm:permutationrouting}
296: A $\POPS$ network can route off-line any permutation among the $n=dg$
297: processors using one slot when $d=1$ and $2\lceil d/g\rceil$ slots when $d>1$.
298: \end{theorem}
299: \begin{lemma}
300: \label{lem:comparatorstage}
301: A  $\POPS$ network, $n=dg$, can simulate a comparator stage in one slot, when $d=1$,
302: and in $2\lceil d/g\rceil$ slots, when $d>1$.
303: \end{lemma}
304: \begin{proof}
305: Let $[i_1:j_1]\circ\dotsb\circ [i_k:j_k]$ be a comparator stage. We define a function $\pi$ such
306: that $\pi(i_r)=j_r$ and $\pi(j_r)=i_r$ for all $r$.
307: Since all $i_r$ are distinct, and so are all $j_s$,
308: $\pi$ can arbitrarily
309: be extended in such a way to be a permutation. By
310: Theorem~\ref{thm:permutationrouting}, $\pi$ can be routed in one slot when $d=1$, and
311: $2\lceil d/g\rceil$ slots when $d>1$.
312: During this routing,
313: for every $r$, processor~$i_r$ sends
314: its data value to processor~$j_r$ and vice-versa. Then, processor~$i_r$ discards the maximum
315: of the two data values, while processor~$j_r$ discards the minimum.
316: \end{proof}
317: In~\cite{aks-FOCS83}, the AKS sorting network is presented. This network is able to sort any
318: data sequence with only $O(\log n)$ comparator stages, which is optimal.
319: By simulating the AKS network
320: on a POPS network using Lemma~\ref{lem:comparatorstage}, we easily get the following theorem.
321: \begin{theorem}
322: \label{thm:deterministico}
323: A $\POPSg$ network can sort $g^2$ data values in $O(\log g)$ slots.
324: \end{theorem}
325: The above result is the key to improve on the best deterministic algorithm for online permutation
326: routing in the literature.
327: \begin{corollary}
328: \label{cor:deterministico}
329: A $\POPS$ network can route on-line any permutation in $O(\frac{d}{g}\log g)$ slots.
330: \end{corollary}
331: \begin{proof}
332: To get the claim, it is enough to plug the sorting algorithm of Theorem~\ref{thm:deterministico}
333: into Stage~1 of the deterministic routing algorithm proposed in~\cite{ds-IEEETPDS03}.
334: \end{proof}
335: 
336: This algorithm is not very practical. Indeed, it is based on the AKS network
337: that, in spite of being optimal, is not efficient when $n$ is small due to very large hidden
338: constants. However, the result is important from a theoretical point of view because of two facts:
339: it establishes that,
340: in principle, $O(\frac{d}{g}\log g)$ slots are enough to solve deterministically the online permutation
341: routing problem; and, when $d=O(g)$ and under proper hypothesis, it matches one of the lower
342: bounds for deterministic algorithms in the next section.
343: 
344: \subsection{A Few Lower Bounds}
345: 
346: Borodin et al.~\cite{brsu-JACM97} study the extent to which both complex hardware and
347: randomization can speed up routing in interconnection networks.
348: One of the questions they address is how \emph{oblivious routing algorithms}
349: (in which the possible paths followed by a packet depend only on its own source and destination)
350: compare with \emph{adaptive routing algorithms}. Since oblivious routing 
351: can usually be implemented by using limited hardware resources on each node,
352: it is important to understand whether
353: it is worth using the more complex hardware required by adaptive routing. Here, we address similar
354: questions. In the following, our discussion will be limited to the case $d=\Theta(g)$. 
355:   
356: Unfortunately, the concept of oblivious routing does not seem
357: to be useful for POPS networks.
358: Indeed, by adapting the ideas first used in~\cite{bh-JCSS85},
359: we can prove that any oblivious deterministic routing algorithm
360: needs $\Omega(\sqrt{g})$ slots to deliver correctly every permutation.
361: Moreover, by customizing and slightly adapting the approach
362: developed in~\cite{brsu-JACM97} (that makes use of  Yao's minimax principle~\cite{y-FOCS77}),
363: it is also possible
364: to show that any oblivious randomized routing algorithm must use $\Omega(\log g/\log\log g)$ slots on
365: the average.
366: \begin{theorem}   \label{the:obliviousDet}
367: For any $\POPS$ network, $d=\Theta(g)$, and any oblivious deterministic routing algorithm,
368: there is a permutation for which the routing time is $\Omega(\sqrt{g})$ slots. 
369: \end{theorem}
370: \begin{proof}
371:    We essentially customize the proof in~\cite{bh-JCSS85}
372:    to POPS networks,
373:    but also some minor modifications are in order
374:    to allow for passive devices and a few different assumptions.
375: 
376:    We assume $d=g$,
377:    the extension to $d=\Theta(g)$ or wider
378:    involving no further ideas, only more technical fuss.
379:    Consider the bipartite digraph $D=(V,A)$
380:    having the set $P$ of processors
381:    and the set $C$ of couplers as color classes
382:    and having as arcs in $A$ those pairs $(p,c)$
383:    such that processor $p$ can send to coupler $c$
384:    plus those pairs $(c,p)$
385:    such that processor $p$ can listen from coupler $c$.
386:    We have $|P|=n=dg=g^2$ processors and $|C|=g^2=n$ couplers,
387:    $|V|=|P|+|C| = 2n$;
388:    all nodes have in-degree and out-degree both equal to $g$.
389: 
390:    Every oblivious algorithm defines a directed $a,b$-path,
391:    denoted with $(a,b]$, for every pair $(a,b)\in P^2$,
392:    namely, the directed path of $D$
393:    followed by a packet with destination in $b$
394:    and origin in $a$.
395:    The characteristic vector $\chi_{(a,b]}$
396:    of a path $(a,b]$ is defined
397:    by regarding the path has the set of its nodes
398:    including $b$ but not $a$.   
399:    The {\em congestion} of a family $\Pi$ of directed paths
400:    is defined as $c(\Pi):=\max_{v\in V} \sum_{(a,b] \in \Pi} \chi_{(a,b]}(v)$.
401:    It is clear that the congestion of $\Pi$ gives a lower bound
402:    on the number of steps required to move a packet
403:    along each path in $\Pi$ since no processor in $P$ and no coupler in $C$
404:    can receive more than one different packet within a single slot.
405:    To prove the theorem we do the following:
406:    with reference to the path family $\{(a,b] \, | (a,b)\in P^2\}$
407:    determined by the oblivious algorithm under consideration,
408:    we show how to construct a permutation
409:    $\pi:P\mapsto P$ such that
410:    $c(\{(a,\pi(a)] \; | a\in P\}) \geq \sqrt{g}/2$.
411:    This will imply the stated lower bound regardless
412:    of the queueing discipline,
413:    however omniscent, employed by the algorithm.
414:    For every $b\in P$,
415:    let
416:    $S_b := \{v\in V\; |
417:            \sum_{a\in P\setminus \{b\}} \chi_{(a,b]}(v) \geq \sqrt{g}/2 \}$.
418:    Clearly, every path $(a,b]$, $a\notin S_b$,
419:    must have a last node not in $S_b$.
420:    Moreover, since $b\in S_b$,
421:    the next node on the path $(a,b]$ must be in $S_b$.
422:    Let $X_b$ be the set of these last nodes
423:    when $a$ ranges in $P\setminus S_b$.
424:    By definition of $S_b$,
425:    no node in $X_b$ can be the last node outside $S_b$
426:    for more than $\sqrt{g}/2$ such paths,
427:    hence $|P\setminus S_b| \leq |X_b|(\sqrt{g}/2)$,
428:    which implies $|S_b|\geq \sqrt{g}$
429:    in case $|X_b| < g\sqrt{g}$.
430:    Moreover, $|X_b| \leq g|S_b|$ since the in-degree of the network
431:    is bounded by $g$.
432:    This implies $|S_b|\geq \sqrt{g}$
433:    in the complementary case that $|X_b| \geq g\sqrt{g}$.
434:    In conclusion, $|S_b|\geq \sqrt{g}$ holds for every $b\in P$.
435:    Therefore, by an averaging argument,
436:    there must exist a $v\in V$
437:    which belongs to at least $\frac{|P| \sqrt{g}}{|V|}=\frac{\sqrt{g}}{2}$
438:    of these sets $S_b$, $b\in P$.
439:    Let $B=\{b\in P \,| v\in S_b\}$.
440:    Let $b_1, b_2, \ldots, b_{\sqrt{g}/2}$ be distinct processors in $B$
441:    and run the following greedy algorithm where for all processors
442:    $p$ in $P$ the value $\pi(p)$ is initially undefined.
443: 
444: \begin{quote}
445:    For $i:=1$ to ${\sqrt{g}/2}$,
446:    let $a$ be any processor in $S_{b_i}$
447:    such that $\pi(a)$ is undefined and define $\pi(a):=b_i$.
448: \end{quote}
449: 
450:    Notice that such an $a$ can be found at each step $i\leq {\sqrt{g}/2}$
451:    since at step $i$ at most $i$ values of $\pi$ have been defined,
452:    while $S_{b_i} \geq \sqrt{g}$.
453:    Moreover, $\pi$ can be clearly extended to a full permutation,
454:    while already
455:    $c(\{(a,\pi(a)] \; | \mbox{$\pi(a)$ is defined}\}) \geq
456:    |\{a\, | \mbox{$\pi(a)$ is defined}\}| = \sqrt{g}/2$
457:    since node $v$ belongs to each path $(a,\pi(a)]$ by construction.
458: \end{proof}
459: 
460: \begin{theorem}   \label{the:averageInput}
461: For any $\POPS$ network, $d=\Theta(g)$, and any oblivious deterministic routing algorithm,
462: the expected routing time for a random permutation (with each permutation chosen with uniform probability) is $\Omega(\log g/\log\log g)$. 
463: \end{theorem}
464: \begin{proof}
465:    The proofs to be customized and adapted here come
466:    from~\cite{brsu-JACM97}.
467:    The customization starts again by considering
468:    the bipartite digraph $D=(V,A)$
469:    on color classes $P$ and $C$
470:    introduced in the proof of Theorem~\ref{the:obliviousDet}.
471:    Also the various small adjustment
472:    are in analogy with those detailed
473:    in the proof of Theorem~\ref{the:obliviousDet}.
474: \end{proof}
475: 
476: \begin{corollary}
477: For any $\POPS$ network, $d=\Theta(g)$ and any oblivious deterministic routing algorithm,
478: there is a permutation for which the expected routing time is $\Omega(\log g/\log\log g)$. 
479: \end{corollary}
480: \begin{proof}
481:    To get this corollary of Theorem~\ref{the:averageInput},
482:    use the Yao's minimax principle~\cite{y-FOCS77}
483:    in perfect analogy to what is done in~\cite{brsu-JACM97}.
484: \end{proof}
485: 
486: These complexities are not satisfactory. Indeed, here in this paper we show a non-oblivious
487: deterministic algorithm
488: that runs in $O(\log g)$ slots and a non-oblivious randomized one that runs in $O(\log\log g)$ slots
489: with high probability.
490: So, by restricting to oblivious algorithms, it may be true that we get a (somewhat) simpler processor,
491: but we also lose
492: an exponential factor in running time, both with and without randomization. This is not a
493: good deal.
494: Therefore, we will not discuss oblivious routing any more, and will focus only on
495: adaptive routing.
496: 
497: Finding good lower bounds for adaptive deterministic routing is not trivial.
498: In~\cite{brsu-JACM97}, the authors explicitly say that they were not able to
499: provide any result for this case in their context. Here, we give partial answers.
500: First, we prove a $\Omega(\log g)$ tight lower bound for a special case of adaptive deterministic routing
501: that applies both to the hypercube simulation routing algorithm in~\cite{ds-IEEETPDS03}
502: and to our deterministic algorithm (that is, in this context, optimal). Second, we prove
503: a strong separation theorem between determinism and randomization. Indeed, we can show
504: both a $\Omega(\log g)$ lower bound for a class of adaptive deterministic routing algorithms,
505: and a $O(\log\log g)$ upper bound for the same class where processors are allowed to
506: generate and use randomization.
507: To the best of our knowledge, this is the first separation theorem showing a gap between
508: $\log n$ and $\log\log n$.
509: 
510: Consider our deterministic routing algorithm, proposed in the previous section. It is based
511: on a simulation of the AKS sorting network. At every slot, each processor sends its packet
512: to a pre-determined other processor, according to
513: the comparator it is going to simulate in the slot. So, the communication patterns are fixed for
514: the whole computation, and do not depend on the input permutation. We can prove
515: a lower bound for all algorithms that have the same property. More formally,
516: a routing algorithm is called \emph{rigid} if, at every slot~$t$, each processor~$i$ sends one of the
517: packets it currently stores to the set of groups~$C_{\mathrm{out}}(i,t)$, and listens to
518: group~$c_{\mathrm{in}}(i,t)$, where functions $C_{\mathrm{out}}$ and $c_{\mathrm{in}}$ depend
519: solely on $t$ and on the processor index.
520: Here, we can assume that the processors have enough local memory to store a copy of all the
521: packets they have seen so far and that
522: they choose the packet to send according to any strategy or algorithm.
523: This is enough to get the following lower bound.
524: \begin{theorem}
525: Any deterministic and rigid algorithm for online permutation routing
526: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, must use $\Omega(\log n)$ slots.
527: \end{theorem}
528: \begin{proof}
529: Consider a processor~$i$. Let $P(i,t)$ be the set of all packets that are potentially stored
530: by processor~$i$ at slot~$t$, according to the routing algorithm.
531: At the beginning, $P(i,0)=\{p_i\}$. During slot~$t$, processor~$i$
532: can receive at most one packet from group~$c_{\mathrm{in}}(i,t)$. Assume this packet
533: comes from processor~$j$. Index~$j$ is statically determined and is independent
534: of the initial permutation, since the algorithm is rigid.
535: So, either $P(i,t)=P(i,t-1)\cup P(j,t-1)$ or $P(i,t)=P(i,t-1)$, if no packet is sent to
536: group~$c_{\mathrm{in}}(i,t)$ (because there is no such processor~$j$, or a conflict
537: occurred). Therefore, $|P(i,t)|\le 2^t$ for all $t\ge 0$.
538: 
539: Now, assume that the algorithm stops after $t<\log n$
540: slots. Then, $|P(i,t)|<n$, and there exists $h$ such that $p_h\notin  P(i,t)$. As a consequence,
541: the routing algorithm must fail for all input permutations such that the destination of
542: $p_h$ is processor~$i$. We conclude that $t=\Omega(\log n)$.
543: \end{proof}
544: This bound applies to both the $O(\log^2 g)$ algorithm in~\cite{ds-IEEETPDS03}
545: and to our deterministic algorithm in the previous section.
546: Therefore, within the class of rigid algorithms, our proposed routing scheme is optimal.
547: 
548: Now, we prove a strong separation theorem. Under restricted hypotheses, we can show that
549: randomization can give an exponential speed-up over determinism. Here, we address a class
550: of routing algorithms we call \emph{two-hops algorithms}. A two-hops algorithm has the
551: following properties: 
552: \begin{enumerate}
553: \item
554: Every processor has two buffers, an $A$-buffer and a $B$-buffer;
555: \item
556: at the beginning, the packets are stored in the $A$-buffer of each processor;
557: \item
558: at every odd slot~$2t+1$, $t=0,1,\dotsc$, every processor~$i$ with a packet in the $A$-buffer
559: sends the packet to group~$c_{\textrm{out}}(i,2t+1)$ (two-hops algorithms can only use unicast),
560: listens to incoming packets from
561: group~$c_{\textrm{in}}(i,2t+1)$, and store the incoming packet (if any) into the $B$-buffer;
562: \item
563: at every even slot~$2t$, $t=1,\dotsc$, every processor~$i$ sends the packet in the $B$-buffer to
564: destination, reset the $B$-buffer, and listens to incoming packets from coupler~$c_{\textrm{in}}(i,2t)$.
565: \end{enumerate}
566: Also, we will make the following assumptions:
567: \begin{enumerate}
568: \addtocounter{enumi}{4}
569: \item
570: when multiple packets use the same coupler (multiple packets from a group sent to the
571: same group), no packet is delivered.
572: \item
573: When a packet arrives to any processor in the destination group, it is considered to be
574: successfully routed, and disappears from the network (from the original $A$-buffer as well);
575: \end{enumerate}
576: The last hypothesis simplifies the job of routing all the packets to destination---we don't
577: have to take care of acks when packets reach their destination. However,
578: since we are proving a lower bound, we don't lose generality. Now, our goal is to show that
579: for every deterministic choice of functions $c_{\textrm{in}}$ and $c_{\textrm{out}}$, there exists
580: an input permutation such that the routing is completed in $\Omega(\log g)$ slots. On the other
581: hand, our randomized algorithm shows that there exists a deterministic $c_{\textrm{in}}$ and
582: a randomized $c_{\textrm{out}}$ such that all the packets are routed to destination in
583: $O(\log\log g)$ slots with high probability.
584: 
585: Consider a deterministic two-hops algorithm. Assume that the algorithm stops
586: after $T<\frac{1}{2}\min\{\log d, \log g\}$ slots, $T$ even. We will say that processor~$i$
587: \emph{shoots} on group~$a$ in the first $T$ slots if there exists an odd $t<T$
588: such that $c_{\textrm{out}}(i,t)=a$.
589: \begin{lemma}
590: There exists a group $a_0$ such that at most $dT$ processors shoot on $a_0$
591: in the first $T$ slots.
592: \end{lemma}
593: \begin{proof}
594: By counting.
595: \end{proof}
596: \begin{corollary}
597: \label{cor:separation}
598: There are at least $n-dT=dg-dT>dg/2$ processors~$i$ such that
599: processor~$i$ does not shoot on $a_0$ in the first $T$ slots.
600: \end{corollary}
601: Let $P(a_0)$ be the set of processors~$i$ such that processor~$i$ does not shoot
602: on $a_0$ in the first $T$ slots. By Corollary~\ref{cor:separation}, $|P(a_0)|>dg/2$.
603: A subset $A\subset P(a_0)$ is \emph{$\sqrt{g}$-robust} if for every $i\in A$ and
604: for every $t<T$ there are at least $\sqrt{g}$ processors~$j$ in $A$ such that
605: $c_{\textrm{out}}(i,t)= c_{\textrm{out}}(j,t)$.
606: \begin{lemma}
607: There exists a $\sqrt{g}$-robust subset $P'(a_0)\subset P(a_0)$ such that
608: $|P'(a_0)|\ge \frac{dg}{2}-Tg\sqrt{g}$.
609: \end{lemma}
610: \begin{proof}
611: If $P(a_0)$ is not $\sqrt{g}$-robust, then there must be a processor~$i\in P(a_0)$
612: and a $t<T$ such that $c(i,t)=c(j,t)$ for less than $\sqrt{g}$
613: processors~$j\in P(a_0)$. This means that all the processors~$j$ such that $c(i,t)=c(j,t)$
614: (including $i$) must be removed from $P(a_0)$ to get a $\sqrt{g}$-robust subset. So,
615: let $P_1(a_0)$ be obtained from $P(a_0)$ by removing all these processors and mark
616: the pair $(t,c(i,t))$. Start now from $P_1(a_0)$ in place of $P(a_0)$ and keep iterating.
617: Notice that no pair can be marked twice in the process. The number of pairs is at most
618: $Tg$, and each time we mark a pair we drop at most $\sqrt{g}$ processors.
619: \end{proof}
620: \begin{theorem}
621: Any deterministic and two-hops algorithm for online permutation routing
622: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, must use $\Omega(\log n)$ slots.
623: \end{theorem}
624: \begin{proof}
625: We will show that for every processor~$i$ in $P'(a_0)$ there exists an input permutation
626: such that $p_i$ will not reach destination. The idea of the proof is as follows: we can build
627: an input permutation such that $p_i$ has to perform two hops to get to destination,
628: and that has a conflict at every even slot. Take a packet~$p_i$ such that $i\in P'(a_0)$
629: and mark the packet. Now, for $t:=T-1$ downto $1$, $t$ odd, do the following:
630: \begin{quote}
631: for every marked packet~$p_j$,
632: \begin{enumerate}
633: \item
634: take an unmarked packet~$p_h$ such that $c(h,t)=c(j,t)$;
635: \item
636: mark packet~$p_h$.
637: \end{enumerate}
638: \end{quote}
639: Then, set the destination of all marked packets to processors in group~$a_0$, so that
640: no marked packet can get to destination in one hop (they are chosen from $P'(a_0) \subseteq P(a_0)$).
641: The number
642: of packets that are marked in the above process does not exceed $d$ nor $\sqrt{g}$,
643: since $T<\frac{1}{2}\min\{\log d, \log g\}$. The important property guaranteed by the above
644: process is that any packet~$p_j$ marked at time~$t$ will experience a conflict during all
645: even slots from the beginning of the routing to time~$t$. In particular, packet~$p_i$ does
646: not reach destination within $T=\Omega(\log n)$ slots.
647: \end{proof}
648: 
649: We believe that the $\Omega(\log g)$ lower bound for deterministic routing holds in
650: a much wider setting. This is described in the following two conjectures.
651: \begin{conjecture}
652: There exists a deterministic algorithm for online permutation routing
653: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, that is optimal and conflict-free.
654: \end{conjecture}
655: \begin{conjecture}
656: Any deterministic and conflict-free algorithm for online permutation routing
657: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, must use $\Omega(\log n)$ slots.
658: \end{conjecture}
659: 
660: \section{A Randomized Algorithm}
661: 
662: Here we present our randomized algorithm. In the following, we will make use
663: of the so called \emph{union bound}, a simple bound on the union of events.
664: \begin{fact}[Union Bound]
665: Let $E_1,\dotsc,E_m$ be $m$ events. Then,
666: \begin{equation*}
667: \Pr\left[\bigcup_{i=1}^m E_i\right]\le \sum_{i=1}^m \Pr\left[E_i\right].
668: \end{equation*}
669: \end{fact}
670: We will use a function $\Delta(x):=x \mod g$.
671: Moreover, we will say that some event happens \emph{with high probability}
672: meaning that the probability of the event is $1-1/g^k$ for some positive $k$.
673: 
674: \subsection{The Case $d=g$}
675: 
676: Given a packet $p_i$, $i\in{\mathbb{N}}_{n}$, its \emph{temporary destination group} is
677: group~$\Delta(\pi(i))=\pi(i)\mod g$.
678: Note that there are exactly $d$ packets with temporary destination group~$a$,
679: for all $a\in{\mathbb{N}}_{g}$.
680: The idea of the routing algorithm is as follows: Each packet is first routed to
681: a randomly and independently chosen \emph{random intermediate group}, then to its
682: temporary destination group, and lastly to its final destination.
683: So, we iterate the following \emph{step}, composed of five slots:
684: \begin{figure*}
685: \centering\includegraphics[scale=.7]{algorithm-1}
686: \caption{Example of randomized routing in a ${\mathrm{POPS}}(3,3)$ network. Packet~$p_5$ has destination $\pi(5)=1$ in group~$0$.
687: Its temporary destination group is group $\pi(5)\mod g=1$. In this step, the
688: random intermediate group chosen by packet~$p_5$ is group~$2$.}
689: \end{figure*}
690: \begin{enumerate}
691: \item
692: each processor containing a packet~$p$ to be routed chooses a random intermediate group~$r$
693: (uniformly and independently at random over ${\mathbb{N}}_{g}$)
694: and sends a copy of packet~$p$ to group~$r$;
695: \item
696: every copy that arrived to the random intermediate group is sent to its temporary destination group;
697: \item
698: for each copy that arrived to the temporary destination group an ack is sent back
699: to the random intermediate group;
700: \item
701: for each ack arrived to the random intermediate group, an ack is sent back to the source processor which, in turn, deletes the original packet;
702: \item
703: every copy that arrived to its temporary destination group is sent to its destination.
704: \end{enumerate}
705: During the step, there are at most two replicas of the same packet. One is the \emph{original
706: packet}, stored in the source processor; the other is the \emph{copy}, that tries to go from
707: the source processor to a random intermediate group, then to its temporary destination
708: group, and finally to its destination. In slot~4, if the source processor receives an ack, it
709: can be sure that the copy has been successfully delivered, as proved in
710: Proposition~\ref{pro:conflictless}, and can safely delete the original packet.
711: In fact, the original packet gets deleted in slot~4 if and only if, within the step, the copy
712: gets to destination in slot~5.
713: 
714: In slots~1, 2, and~5, for every group~$a$, every processor~$i$ in group~$a$ is responsible for
715: listening to coupler~$c(a, \Delta(i))$
716: for the message possibly coming from
717: group~$\Delta(i)$.
718: This way, every
719: conflict-less communication successfully completes and no packet is lost. Indeed,
720: during slots~1 and~2, in every group $a$, $a\in{\mathbb{N}}_{g}$, the processor with index~$b$
721: within the group, $b\in{\mathbb{N}}_{g}$, receives the packet that is possibly coming
722: from group~$b$. In slot~5, every processor~$\pi(i)$ that still has to receive packet~$p_{i}$
723: hopefully receives its packet from group~$\Delta(\pi(i))$, the temporary destination group of
724: packet~$p_i$.
725: Slots~3 and~4 behave differently. Indeed, each ack sent during slot~3 is received by the
726: same processor that sent the packet in slot~2. Similarly, each ack sent during slot~4 is received by the same processor that sent the packet in slot~1.
727: 
728: Clearly, during slots~1 and~2, multiple conflicts on the couplers should be expected,
729: and many of the communications may not complete. For example, two packets in the same
730: group can choose the same random intermediate group during slot~1, or two packets willing
731: to go to the same temporary destination group are currently in the same random intermediate
732: group during slot~2.
733: On the contrary, slots 3, 4, and 5 do not generate any conflict,
734: as shown in the following proposition.
735: \begin{proposition}
736: \label{pro:conflictless}
737: At all steps, slots 3, 4, and 5 of the routing algorithm do not generate any conflict.
738: \end{proposition}
739: \begin{proof}
740: Consider packet~$p_i$ stored at processor~$i$ in group~$a$. Assume that, during an arbitrary
741: step, its random intermediate group is $r(i)$, chosen
742: uniformly
743: at random.
744: In the case when packet~$p_i$ survives slot~1 and arrives to its random intermediate group
745: $r(i)$, we know that coupler $c(r(i),a)$ has been used to send
746: packet~$p_i$ only, otherwise a conflict would have stopped the packet.
747: Moreover, since there is only one processor in group~$r(i)$ that is responsible
748: for receiving packet~$p_i$, namely processor~$r(i)d+a$,
749: there will be only one ack message corresponding to packet~$p_i$ to be sent in slot~4,
750: and this ack message is the only one that uses the symmetric coupler
751: $c(a,r(i))$ during slot~4.
752: In conclusion, slot~4 is conflict-free.
753: A similar argument shows that slot~3 is conflict-free as well.
754: 
755: Consider now slot~5. Assume that, after step~4, packet $p_j$ has arrived at the 
756: same temporary destination group as packet~$p_i$.
757: This means that $\Delta(\pi(i))=\Delta(\pi(j))$. That is,
758: $\pi(i)\equiv \pi(j)\mod g$. In this case, it is not possible that $\pi(i)$ and $\pi(j)$ are
759: in the same group; otherwise we would have 
760: $\pi(i)=\pi(j)$, in contrast with the fact that  $\pi$ is a permutation.
761: Therefore, packets~$p_i$ and $p_j$ go to different groups from their temporary
762: destination group. In other words, step~5 is conflict-free as well. 
763: \end{proof}
764: 
765: By Proposition~\ref{pro:conflictless}, if packet~$p_i$ survives the first two slots of a step,
766: then, in the very same step, it will be routed to its destination, and an ack will be successfully
767: returned to source processor~$i$. When the ack arrives, the source processor can delete
768: the packet, since it knows it will be safely stored by the destination processor.
769: Conversely, if no ack arrives, the packet is not deleted, and the processor
770: tries again to deliver it in the next step, choosing again a possibly different
771: random temporary group.
772: 
773: By the above discussion, we can safely concentrate on slots~1 and~2. 
774: A useful way to visualize the conflicts in slots~1 and~2
775: of an arbitrary step is
776: shown in Figure~\ref{fig:bipartite-1}.
777: At any given step of the routing algorithm, let $\pi$ be the
778: restriction of the input permutation to those packets that have not been successfully routed yet (during previous steps).
779: We build the \emph{graph of conflicts}, a bipartite multi-graph $G_{\pi}$ on node classes
780: $S:={\mathbb{N}}_g$ and $D:={\mathbb{N}}_g$. For every group~$a$ and for each
781: packet~$p_i$ in group~$a$ and yet to be
782: routed, we introduce an edge with one endpoint in $a\in S$ and the other endpoint in
783: the temporary destination group~$\Delta(\pi(i))\in D$.
784: During slot~1 of the step,
785: every edge (packet yet to be routed) randomly and uniformly chooses a
786: \emph{color} in ${\mathbb{N}}_g$ (the random intermediate
787: group).
788: Clearly, a same packet can choose different colors
789: in different steps of the routing algorithm.
790: Now we can exactly characterize the conflicts in the first two slots of the routing algorithm during
791: step~$s$.
792: Packet~$p_i$ in group~$a$ (represented by an edge from $a\in S$ to $\Delta(\pi(i))\in D$)
793: has a conflict during slot~1
794: if and only if there is another edge incident to $a\in S$ with
795: the same random color. Moreover, if we remove all edges relative to packets that have a conflict
796: in slot~1 (see Figure~\ref{fig:bipartite-2}), every remaining packet $p_i$ has a
797: conflict during slot~2
798: if and only if there is another remaining edge incident to $\Delta(\pi(i))\in D$ with the same random color.
799: Figure~\ref{fig:bipartite-3} shows which packets of Figure~\ref{fig:bipartite-1} survive both
800: slots and are hence delivered to destination by Proposition~\ref{pro:conflictless}. 
801: 
802: \begin{figure*}
803: \centering
804: \subfigure[Conflict graph~$G_{\pi}$;]{\label{fig:bipartite-1}
805: \includegraphics[scale=.78]{bipartite-1}}\goodgap
806: \subfigure[conflict graph~$G_{\pi}$, where only packets surviving slot~1 are shown;]{\label{fig:bipartite-2}
807: \includegraphics[scale=.78]{bipartite-2}}\goodgap
808: \subfigure[conflict graph~$G_{\pi}$, where only packets surviving both slot~1 and slot~2 are shown.]{\label{fig:bipartite-3}
809: \includegraphics[scale=.78]{bipartite-3}}
810: \caption{Conflict graph~$G_{\pi}$, where permutation~$\pi=[1,5,8,9,3,10,11,14,15,13,0,7,2,6,12,4]$
811: (consequently, $\Delta(\pi(\cdot))=[1,1,0,1,3,2,3,2,3,1,0,3,2,2,0,0]$), in a ${\mathrm{POPS}}(4,4)$
812: network.}
813: \label{fig:bipartite}
814: \end{figure*}
815: 
816: Our first result shows that, in case the packets are ``sparse'' in the network,
817: then all the packets can be delivered in a constant number of slots with high
818: probability.
819: \begin{lemma}
820: \label{lem:phase3}
821: If the maximum degree of the conflict graph is $g^{\alpha}$
822: for some constant $\alpha<1$, then the routing algorithm delivers all the packets to destination in a
823: constant number of slots with high probability.
824: \end{lemma}
825: \begin{proof}
826: Since the maximum degree of the conflict graph is $g^{\alpha}$, in every group of the POPS network
827: there are at most $g^{\alpha}$ packets left to be routed, and every group of the POPS network is
828: the temporary destination group of at most $g^{\alpha}$ packets.
829: Let $\beta=1-\alpha$.
830: We show that
831: the probability 
832: that all packets get routed to destination
833: within $3/\beta$ steps is at least $1-c_\beta/g$,
834: where $c_\beta := 2^{3/\beta}$ is a constant
835: depending only on (the constant) $\beta$.
836: Consider a generic packet $p_i$ in group~$a$.
837: The probability that packet~$p_i$ has a conflict in one step is at most equal to the
838: probability that either one of the packets in group~$a$ or one of the packets
839: with temporary destination group~$\Delta(\pi(i))$ chooses the same random intermediate group as
840: packet~$p_i$.
841: Since at most $g^\alpha-1$ other packets are in group~$a$, and similarly at
842: most $g^\alpha-1$ have temporary destination group~$\Delta(\pi(i))$, this probability cannot be larger
843: than $2g^\alpha/g=2g^{-\beta}$.
844: Therefore, the probability that the packet is not routed
845: in each of the $3/\beta$ steps is at most
846: \begin{equation*}
847: \left(\frac{2}{g^\beta}\right)^{\frac{3}{\beta}}=\frac{2^{3/\beta}}{g^3}=\frac{c_\beta}{g^3}.
848: \end{equation*}
849: By the union bound, the probability that any of the $g^{1+\alpha}<g^2$ packets in the network
850: has not been routed in $3/\beta$ steps is at most $c_\beta/g$.
851: \end{proof}
852: As a matter of fact, the hard part of the job is to reduce the initial number of $g$ packets in each
853: group in such a way to get a ``sparse'' set of remaining packets.
854: We can prove that this is done quickly
855: by our randomized algorithm by
856: providing sharp bounds on the number $X$ of packets that are
857: successfully delivered in a step.
858: We define $X$ as a sum of indicator random variables $Z_i$, where
859: $Z_i$ is equal to $1$ if the $i$-th packet is delivered in this step, and $0$ otherwise.
860: It is important
861: to realize that these random variables are not independent: the event that one packet has
862: a conflict influences the probability that another packet has a conflict as well.
863: As a consequence, we cannot use the well-known Chernoff bound
864: to get sharp estimates of the value of $X$ since there does not seem to be any
865: way to describe the process as a sum of independent random variable.
866: So, we need a more sophisticated 
867: mathematical tool.
868: Specifically, we will see that slots~1 and~2 of one step
869: of the routing algorithm can be modeled by a set of martingales. Martingale theory is 
870: useful to get sharp bounds when the process is described in terms of not necessarily
871: independent random variables.
872: 
873: For an introduction to martingales, the reader is 
874: referred to~\cite{mr95}.
875: Also~\cite{ds65}, \cite{gs88}, \cite{as00}, and~\cite{dp04} give a description of martingale
876: theory. Here, we give a brief review of the main definitions and theorems
877: we will be using in the following.
878: \begin{definition}[\cite{mr95}]
879: Given the $\sigma$-field $(\Omega, {\mathbb{F}})$ with ${\mathbb{F}}=2^{\Omega}$, a
880: \emph{filter} is a nested sequence
881: ${\mathbb{F}}_0\subseteq {\mathbb{F}}_1\subseteq \dotsb \subseteq {\mathbb{F}}_m$ of
882: subsets of $2^{\Omega}$ such that
883: \begin{enumerate}
884: \item
885: ${\mathbb{F}}_0=\{\emptyset, \Omega\}$;
886: \item
887: ${\mathbb{F}}_m=2^{\Omega}$;
888: \item
889: for $0\le h\le m$, $(\Omega, {\mathbb{F}}_h)$ is a $\sigma$-field.
890: \end{enumerate}
891: \end{definition}
892: \begin{definition}[\cite{mr95}]
893: Let $(\Omega, {\mathbb{F}}, \PR)$ be a probability space with a filter
894: ${\mathbb{F}}_0,\dotsc, {\mathbb{F}}_m$. Suppose that $Y_0, \dotsc, Y_m$ are random variables
895: such that for all $h\ge 0$, $Z_h$ is ${\mathbb{F}}_i$-measurable. The sequence $Z_0,\dots,Z_m$
896: is a \emph{martingale} provided that, for all $h\ge 0$,
897: \begin{equation*}
898: \E[Z_{h+1}|{\mathbb{F}}_h]=Z_h.
899: \end{equation*}
900: \end{definition}
901: The next tail bound for martingales is similar to the Chernoff bound for the sum of Poisson
902: trials.
903: \begin{theorem}[Azuma's Inequality~\cite{mr95}]
904: Let $Z_0,\dotsc,Z_m$ be a martingale such that for each $h$,
905: \begin{equation*}
906: |Z_h-Z_{h-1}|\le c_h,
907: \end{equation*}
908: where $c_h$ may depend on $h$. Then, for all $t\ge 0$ and any $\lambda>0$,
909: \begin{equation*}
910: \PR\left[ |Z_t-Z_0|\ge\lambda\right]\le 2e^{-\frac{\lambda^2}{2\sum_{k=1}^t c_k^2}}.
911: \end{equation*}
912: \end{theorem}
913: \begin{theorem}
914: \label{thm:fondamentale}
915: A $\POPSg$ network can route any permutation in $O(\log\log g)$ slots
916: with high probability.
917: \end{theorem}
918: \begin{proof}
919: Let $G_{\pi}=(S,D; E)$ be the conflict graph at step~$s$ of the routing
920: algorithm, where $\pi$ is the input permutation restricted to those packets
921: that still have to be routed at the beginning of step~$s$.
922: Let $d_s$ be the maximum degree of $G_{\pi}$.
923: So, at step~$s$ there are at most $d_s$ packets left to
924: be routed in every group, and at most $d_s$ packets are willing to go to
925: the same temporary destination group.
926: Clearly, $d_1\leq d$. We will show that after
927: $O(\log\log g)$ steps the conflict graph has maximum degree at most $g^{5/6}$.
928: This is enough to prove this theorem by Lemma~\ref{lem:phase3}.
929: 
930: Assume to be at step~$s$. If $d_s\le g^{5/6}$, then we are done.
931: So, we can assume that $d_s> g^{5/6}$.
932: Let $S_a$, $a\in S$, be the set of indices
933: of the packets of group~$a$ that still have to be delivered at the beginning of
934: step~$s$. Similarly, let $D_b$, $b\in D$,
935: be the set of indices
936: of the packets in the whole network that still have to be delivered and that have group~$b$ as
937: temporary destination group.
938: Clearly, $|S_a|$ and $|D_b|$ are the degrees of nodes $a\in S$ and $b\in D$ in the conflict graph of step~$s$.
939: Therefore, $|S_a|\le d_s$ and $|D_b|\le d_s$ for every $a\in S$ and $b\in D$.
940: For every packet~$p_i$ still to be routed, we define the following indicator random variable,
941: \begin{equation*}
942: Z_i^1=\begin{cases} 1 & \textrm{if packet~$p_i$ survives slot~1 in step~$s$,}\\
943: 0 & \textrm{otherwise.} \end{cases}
944: \end{equation*}
945: Random
946: variable~$X_a^1=\sum_{i\in S_a} Z_i^1$ tells the number of packets from group~$a$ that
947: survive slot~1; random variable~$Y_b^1=\sum_{j\in D_b} Z_j^1$ tells the number of packets with temporary destination
948: group~$b$ that survive slot~1.
949: Moreover, let random
950: variable~$C_i$ be equal to the color chosen by packet~$p_i$ in step~$s$.
951: 
952: Clearly, we have nothing to show about the nodes in $G_{\pi}$ that have degree smaller
953: than or equal to $g^{5/6}$. So, we define sets $S^+\subseteq S$ and
954: $D^+\subseteq D$, which collect the nodes with degree
955: larger that $g^{5/6}$, and focus on the nodes in these sets.
956: Consider an arbitrary node $a\in S^+$.
957: The expectation of $Z_i^1$, $i\in S_a$, can be bounded as follows:
958: \begin{equation}
959: \begin{split}
960: \E[Z_i^1] = \PR[\forall \; h\in S_a\setminus \{i\}, \; C_h\neq C_i]=
961: \prod_{h\in S_a\setminus \{i\}} \PR[C_h\neq C_i]\\
962: = \left( 1-\frac{1}{g}\right)^{|S_a|-1}
963: \ge e^{-|S_a|/g}.
964: \end{split}
965: \label{eqn:lowerEXi}
966: \end{equation}
967: So, the expected number of packets in group~$a$ that survive slot~1 can be bounded accordingly,
968: \begin{equation}
969: \label{eqn:lowerX}
970: \E[X_a^1]=\E\left[\sum_{i\in S_a} Z_i^1\right]=\sum_{i\in S_a} \E[Z_i^1]\ge |S_a| e^{-|S_a|/g}.
971: \end{equation}
972: 
973: In order to show that random variable $X_a^1$ is not far from its expectation with high probability,
974: we now define random variables $W_h=\E[X_a^1|{\mathbb{F}}_h]$, $h=0,\dotsc,|S_a|$,
975: where ${\mathbb{F}}_h$ is the $\sigma$-field generated by the random color chosen by
976: the first $h$ packets in $S_a$.
977: Filter ${\mathbb{F}}_h$, $h=0,\dotsc,|S_a|$, is such that $W_0, \dotsc, W_{|S_a|}$ is a martingale and that
978: $|W_{h}-W_{h-1}|\le 2$,
979: since fixing the random color chosen by the $h$-th packet in $S_a$
980: can only affect the expected value of the sum $X_a^1$ at most by two.
981: By the Azuma's inequality, for every $\delta>0$
982: \begin{equation}
983: \begin{split}
984: \PR\left[\left|X_a^1-\E[X_a^1]\right|\ge \delta\E[X_a^1]\right]=\PR\left[\left|W_{|S_a|}-W_0\right|
985: \ge \delta\E[X_a^1]\right]\\
986: \le 2e^{-\frac{\delta^2 \E[X_a^1]^2}{2\sum(2)^2}}
987: \le 2e^{-\frac{\delta^2 |S_a|^2e^{-2d_s/g}}{8|S_a|}}\le 
988: 2e^{-\frac{\delta^2 g^{5/6}}{8e^2}}.
989: \end{split}
990: \label{eqn:azumaX}
991: \end{equation}
992: 
993: To prove a similar result for $Y_b^1$, $b\in D^+$,
994: we must recast the above general martingale arguments
995: into a more structured approach.
996: This is because $Y_b^1$ may depend on the random colors chosen by all the packets
997: in the network, and not only on those chosen by the packets in $D_b$.
998: 
999: Consider an arbitrary node $b\in D^+$.
1000: In the following analysis of the expectation and concentration
1001: of $Y_b^1$ we can clearly pretend that the random colors
1002: are first choosen  for the packets outside $D_b$ and later for the packets in $D_b$.
1003: This will not invalidate our conclusions about the whole of the $Y_b^1$'s, $b\in D^+$,
1004: since these will be derived from the solid claims about any single $Y_b^1$ by the union bound.
1005: For every $a\in S_a$, we define set~$\overline{C}_{a,\overline{b}}$
1006: as ${\mathbb{N}}_g\setminus C_{a,\overline{b}}$, where $C_{a,\overline{b}}$ is the set of colors that are chosen in
1007: step~$s$ by a packet in group~$a$ that has temporary destination group different from $b$,
1008: \begin{equation*}
1009: \overline{C}_{a,\overline{b}}={\mathbb{N}}_g\setminus\left(\bigcup_{i\in S_a\setminus D_b} \left\{ C_i \right\}\right).
1010: \end{equation*}
1011: The average size of $\overline{C}_{a,\overline{b}}$ is
1012: \begin{equation*}
1013: \E\left[\left| P_{b.a}\right|\right]=
1014: g\left(1-\frac{1}{g}\right)^{|S_a\setminus D_b|}.
1015: \end{equation*}
1016: Being just a classical ball and bins problem~\cite{mr95}, we know that random variable~$|\overline{C}_{a,\overline{b}}|$
1017: is not far from its expectation with probability
1018: \begin{equation*}
1019: \PR[|\overline{C}_{a,\overline{b}}|<(1-\delta)\E[|\overline{C}_{a,\overline{b}}|]\le e^{-\frac{\delta^2\E[|\overline{C}_{a,\overline{b}}|]^2}{2g}}
1020: \le e^{-\frac{\delta^2g}{2e^2}},
1021: \end{equation*}
1022: for every $\delta>0$. By the union bound over the $g$ nodes in $S$, for every $\delta>0$,
1023: we know that for every node $a\in S$
1024: \begin{equation}
1025: \label{eqn:lowerP}
1026: |\overline{C}_{a,\overline{b}}| \ge (1-\delta)g\left(1-\frac{1}{g}\right)^{|S_a\setminus D_b|}
1027: \end{equation}
1028: with probability
1029: \begin{equation}
1030: \label{eqn:azumaP}
1031: 1-ge^{-\frac{\delta^2g}{2e^2}}.
1032: \end{equation}
1033: 
1034: Under the hypothesis that Equation~\ref{eqn:lowerP} holds for every $a\in S$, we can bound the expectation of $Z_j^1$, $j\in D_b$, as follows:
1035: \begin{equation*}
1036: \E[Z_j^1]  = \PR\left[\left(\forall \; h\in D_b\cap S_{a_j}^{1}\setminus \{j\}, \; C_h\neq C_j\right) \wedge
1037: (C_j\in P_{b,a_j})\right],
1038: \end{equation*}
1039: where $a_j$ is the group of packet~$p_j$. So,
1040: \begin{align*}
1041: \E[Z_j^1]  & \ge \left(1-\frac{1}{g}\right)^{|D_b\cap S_{a_j}^{1}\setminus \{j\}|}
1042: (1-\delta) \left(1-\frac{1}{g}\right)^{|S_{a_j}^{1}\setminus D_b|}\\
1043: & = (1-\delta)\left(1-\frac{1}{g}\right)^{|S_{a_j}^{1}\setminus \{j\}|}
1044: \ge (1-\delta) e^{-|S_{a_j}^{1}|/g}.
1045: \end{align*}
1046: The expectation of $Y_b^1$ can be bounded accordingly,
1047: \begin{equation}
1048: \label{eqn:lowerY}
1049: \E[Y_b^1]=\E\left[\sum_{j\in D_b} Z_j^1\right]=\sum_{j\in D_b} \E[Z_j^1]
1050: \ge (1-\delta)|D_b| e^{-|D_b|/g}.
1051: \end{equation}
1052: 
1053: In order to show that random variable $Y_b^1$ is not far from its expectation with high probability,
1054: we now define random variables $W_k=\E[Y_b^1|{\mathbb{F}}_k]$, $k=0,\dotsc,|D_b|$,
1055: where ${\mathbb{F}}_k$ is the $\sigma$-field generated by the random color
1056: chosen by the first $k$ packets in $D_b$.
1057: Filter ${\mathbb{F}}_k$, $k=0,\dotsc,|D_b|$, is such that $W_0, \dotsc, W_{|D_b|}$
1058: is a martingale and that $|W_{k}-W_{k-1}|\le 2$,
1059: since fixing the random color chosen by the $k$-th packet in $D_b$
1060: can only affect the expected value of the sum $Y_b^1$ at most by two.
1061: By the Azuma's inequality, for every $\delta>0$
1062: \begin{equation}
1063: \begin{split}
1064: \PR\left[\left|Y_b^1-\E[Y_b^1]\right|\ge \delta\E[Y_b^1]\right]=\PR\left[\left|W_{|D_b|}-W_0\right|
1065: \ge \delta\E[Y_b^1]\right]\le\\
1066: \le 2e^{-\frac{\delta^2 \E[Y_b^1]^2}{2\sum(2)^2}}
1067: \le 2e^{-\frac{\delta^2 (1-\delta)^4|D_b|^2e^{-2d_s/g}}{8|D_b|}}\le 
1068: 2e^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.
1069: \end{split}
1070: \label{eqn:azumaY}
1071: \end{equation}
1072: 
1073: Let $G_{\pi'}=(S,D; E')$ be the conflict graph at step~$s$, where
1074: $\pi'$ is the input permutation restricted to those packets that survive slot~1 in step~$s$.
1075: Hence, $E'\subseteq E$.
1076: Our goal is to bound the number of packets that survive slot~2
1077: as well, and are thus delivered to destination during this step.
1078: Let $Z_j^2$
1079: be equal to one if packet~$p_j$ survives both slots~1 and~2,
1080: and zero otherwise.
1081: Also, let $S_a^1$, $a\in S$, be the set of indices
1082: of the packets of group~$a$ that have survived slot~1. Similarly, let $D_b^1$, $b\in D$,
1083: be the set of indices
1084: of the packets in the whole network that have survived slot~1 and have group~$b$ as
1085: temporary destination group. Clearly, for every $a\in S$,
1086: $|S_a^1|$ is equal to $X_a^1$ and is the degree of node~$a$ in $G_{\pi'}$;
1087: while for every $b\in D$, $|D_b^1|$ is equal to $Y_b^1$ and is the degree of node~$b$
1088: in $G_{\pi'}$. 
1089: Random variables
1090: \begin{equation*}
1091: X_a^2=\sum_{i\in S_a^1} Z_j^2,
1092: \end{equation*}
1093: $a\in S$, tell the number of packets in group~$a$ that
1094: are delivered during step~$s$; similarly, random variables
1095: \begin{equation*}
1096: Y_b^2=\sum_{j\in D_b^1} Z_j^2
1097: \end{equation*}
1098: $b\in D$, tell the number of packets willing to go to temporary destination group~$b$ that
1099: are delivered during step~$s$.
1100: 
1101: Consider an arbitrary node~$b\in D^+$.
1102: The expected value of $Y_b^2$ depends on permutation
1103: $\pi'$. Since we are computing a lower bound to $Y_b^2$, the worst case is
1104: when all packets in $D_b^1$ originate at different groups. Indeed, if two packets
1105: in $D_b^1$ belong to the same $S_a^1$, we already know that they have
1106: chosen two different colors during step~$s$, and the expectation of $Y_b^2$ is larger.
1107: A formal proof of this intuitive claim can be
1108: given, though it's omitted for the sake of brevity.
1109: Assuming that random variable $Y_b^1$ is
1110: not far from expectation as in Equation~\ref{eqn:azumaY},
1111: we can bound the expectation of $Y_b^2$,
1112: \begin{align}
1113: \nonumber
1114: \E[Y_b^2] & = |D_b^1|\left(1-\frac{1}{g}\right)^{|D_b^1|-1}\ge\\
1115: \nonumber & \ge (1-\delta)^2|D_b|e^{-|D_b|/g}\left(1-\frac{1}{g}\right)^{|D_b^1|-1}\ge\\
1116: & \ge (1-\delta)^2|D_b| e^{-|D_b|/g}e^{-|D_b^1|/g}\ge (1-\delta)^2|D_b| e^{-2d_s/g}.
1117: \label{eqn:lowerYtilde}
1118: \end{align}
1119: Just as before, also $Y_b^2$ is not far from its expectation.
1120: Martingale theory can be used again to show that
1121: \begin{equation}
1122: \PR\left[\left|Y_b^2-\E[Y_b^2]\right|\ge \delta\E[Y_b^2]\right]
1123: \le 2e^{-\frac{\delta^2 \E[Y_b^2]^2}{2\sum(2)^2}}\le
1124: 2e^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.
1125: \label{eqn:azumaYtilde}
1126: \end{equation}
1127: Similarly, by using the same technique that has been used to bound random variable $Y_b^1$,
1128: for every node $a\in S^+$ we can show that
1129: \begin{align}
1130: \nonumber
1131: \E[X_a^2] & \ge (1-\delta)|S_a^1|\left(1-\frac{1}{g}\right)^{|S_a^1|-1}\ge (1-\delta)|S_a^1| e^{-|S_a^1|/g} \ge \\
1132: \nonumber & \ge (1-\delta)^2|S_a|e^{-|S_a|/g} e^{-|S_a^1|/g} \ge\\
1133: & \ge (1-\delta)^2|S_a| e^{-2d_s/g},
1134: \label{eqn:lowerXtilde}
1135: \end{align}
1136: and that $X_a^2$ is not far from its expectation
1137: \begin{equation}
1138: \PR\left[\left|X_a^2-\E[X_a^2]\right|\ge \delta\E[X_a^2]\right]
1139: \le 2e^{-\frac{\delta^2 \E[X_a^2]^2}{2\sum(2)^2}}\le
1140: 2e^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.
1141: \label{eqn:azumaXtilde}
1142: \end{equation}
1143: 
1144: By Equations~\ref{eqn:lowerX}, \ref{eqn:azumaX}, \ref{eqn:lowerP}, \ref{eqn:azumaP}, \ref{eqn:lowerY}, \ref{eqn:azumaY}, \ref{eqn:lowerYtilde}, \ref{eqn:azumaYtilde}, \ref{eqn:lowerXtilde}, \ref{eqn:azumaXtilde}, and
1145: by the union bound,
1146: the number of packets successfully delivered
1147: in step~$s$ can be bounded as follows: For every $\delta>0$,
1148: \begin{align}
1149: \label{eqn:lowerXfinale}
1150: X_a^2 & \ge (1-\delta)^3|S_a| e^{-2d_s/g}\\
1151: \label{eqn:lowerYfinale}
1152: Y_b^2 & \ge (1-\delta)^3|D_b| e^{-2d_s/g}
1153: \end{align}
1154: for every $a\in S^+$ and $b\in D^+$, with probability at least
1155: \begin{equation}
1156: 1-9ge^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.
1157: \label{eqn:azumafinale}
1158: \end{equation}
1159: 
1160: Now, we divide our analysis into two phases. Phase~1 is composed of a constant
1161: number of steps and, with high probability, reduces the maximum degree of the conflict
1162: graph from $d_1$ to $gx$ or less,
1163: where $0\le x<1$ is any fixed constant.
1164: Phase~2 follows and reduces the maximum degree of the conflict graph to $g^{5/6}$ or less
1165: in $O(\log\log n)$ steps with high probability.
1166: 
1167: Let us start from Phase~1. For every step~$s$ during Phase~1,
1168: $gx\le d_s\le g$. We show that a constant number of
1169: steps is enough to make $d_s$ fall below $gx$ with high probability.
1170: For all $a\in S^+$, let us refer to a step such that
1171: \begin{equation}
1172: X_a^2\ge\frac{|S_a| e^{-2}}{2}
1173: \end{equation}
1174: as a \emph{lucky} step for group~$a$.
1175: By Equation~\ref{eqn:lowerXfinale} and~\ref{eqn:azumafinale}, where we fix
1176: $\delta$ such that $(1-\delta)^3=1/2$, step~$s$ is lucky for every
1177: group~$a\in S^+$ with probability at least
1178: \begin{equation*}
1179: 1-9ge^{-\alpha |S_a|}\ge 1-9ge^{-\alpha g^{5/6}},
1180: \label{eqn:azuma}
1181: \end{equation*}
1182: where $\alpha$ is a positive constant.
1183: Therefore, the number of packets that remain after step~$s$ in group~$a\in S^+$ is
1184: \begin{equation}
1185: |S_a|-X_a^2\le |S_a|-\frac{|S_a| e^{-2}}{2}
1186: \le d_s\left(1-\frac{e^{-2}}{2}\right)	
1187: \end{equation}
1188: with high probability.
1189: Note the same bound can be shown for sets~$|D_{b}^1|$, $b\in D^+$, with
1190: exactly the same analysis (where an analogous notion of lucky step refers to a step such that the degree of group~$b\in D$ reduces by $|D_{b}^{s,1}|e^{-2}/2$ at least). Therefore,
1191: after
1192: \begin{equation*}
1193: y:=\left\lceil\frac{\log x}{\log (1-e^{-2}/2)}\right\rceil
1194: \end{equation*}
1195: lucky steps for all the groups the maximum degree of the conflict graph reduces
1196: to $gx$ or less.
1197: By the union bound,
1198: this happens
1199: within the very first $y$ steps
1200: with probability at least
1201: \begin{equation*}
1202: 1-9yge^{-\alpha g^{5/6}},
1203: \end{equation*}
1204: That is, Phase~1 completes in a constant number of steps
1205: with high probability.
1206: 
1207: We are now at a generic step~$s$ in Phase~2.
1208: Our goal is to reduce the degree of the graph of conflicts to $g^{5/6}$.
1209: Let $\lambda_s=d_s/g$. We can assume that $g^{-1/6}\le\lambda_s<x$,
1210: and when $\lambda_s$ falls below $g^{-1/6}$ we are done.
1211: This time, let's refer to a step during which at least
1212: $(1-\lambda_{s})|S_a| e^{-2\lambda_s}$ packets in group~$a\in S^+$ are delivered as a
1213: \emph{lucky} step for group~$a$.
1214: By Equation~\ref{eqn:lowerXfinale} and~\ref{eqn:azumafinale}, where we take
1215: $\delta_s=\lambda_s/3$ (in such a way that $(1-\delta_s)^3\ge (1-\lambda_s)$),
1216: step~$s$ is lucky for every group~$a\in S^+$ with probability at least
1217: \begin{equation*}
1218: 1-9yge^{-\beta g^{1/2}},
1219: \end{equation*}
1220: where $\beta$ is a positive constant, 
1221: since $|S_a|\lambda_{s}^2\ge g^{5/6}(g^{-1/6})^2=g^{1/2}$.
1222: So, the number of packets that remain in group~$a\in S^+$ after step~$s$ is
1223: \begin{equation*}
1224: |S_a|-X_a^2\le |S_a|-(1-\lambda_{s})|S_a| e^{-2\lambda_s}\le
1225: d_s\left[1-(1-\lambda_{s}) e^{-2\lambda_s}\right]
1226: \end{equation*}
1227: with high probability. A similar result can be shown
1228: for any group~$b\in D$ such that $|D_b|>g^{5/6}$ with exactly the same analysis.
1229: By the union bound, at the end of step~$s$ the degree of the conflict graph is at most
1230: \begin{equation*}
1231: d_s\left[1-(1-\lambda_{s}) e^{-2\lambda_s}\right]
1232: \end{equation*}
1233: with high probability.
1234: Now, assuming a sequence of lucky steps, we can set up the following recurrence,
1235: \begin{align*}
1236: \lambda_{s+1} & \le \lambda_s \left[1-(1-\lambda_s) e^{-2\lambda_s}\right]\le
1237: \lambda_s\left[1-(1-\lambda_s)(1-2\lambda_s)\right]=\\
1238: & = \lambda_s\left[1-1+3\lambda_s-2\lambda_s^2\right]\le 3\lambda_s^2.
1239: \end{align*}
1240: Therefore,
1241: \begin{equation*}
1242: \lambda_{s}\le 3\lambda_{s-1}^2\le 3\left(3\lambda_{s-2}^2\right)^2\le \dotsb \le
1243: 3^{2^{s-y-1}}\lambda_{y+1}^{2^{s-y-1}}.
1244: \end{equation*}
1245: That is,
1246: \begin{equation*}
1247: \log_3\lambda_{s}\le \log_3\left(3^{2^{s-y-1}}\lambda_{y+1}^{2^{s-y-1}}\right)=
1248: 2^{s-y-1}\left( 1+\log_3 \lambda_{y+1} \right).
1249: \end{equation*}
1250: Since our first goal is to have $\lambda_s\le g^{-1/6}$, we should find $\bar{s}$ such that
1251: \begin{equation*}
1252: \log_3 \lambda_{\bar{s}}\le -\frac{\log_3 g}{6}.
1253: \end{equation*}
1254: We can get this by taking $\bar{s}$ such that
1255: \begin{equation*}
1256: 2^{\bar{s}-y-1}\left( 1+\log_3 \lambda_{y+1} \right)\le -\frac{\log_3 g}{6}.
1257: \end{equation*}
1258: If we choose the arbitrary constant $x$ of Phase~1 to be strictly smaller  than $1/3$, we obtain
1259: that $1+\log_3 \lambda_{y+1}$ is negative, and the above equation comes down to
1260: $\bar{s}=O(\log\log g)$.
1261: Therefore, by the union bound over the $\bar{s}-y-1$ steps of
1262: Phase~2, the whole Phase~2 is made of lucky steps for all the groups in $S^+$ and $D^+$ with
1263: probability at least
1264: \begin{equation*}
1265: 1-9(\bar{s}-y-1)ge^{-(\alpha+\beta)g^{\frac{1}{2}}} 
1266: =1-O\left(ge^{-(\alpha+\beta)g^{\frac{1}{2}}}\log\log g
1267: \right).
1268: \end{equation*}
1269: 
1270: We have shown that, after $\bar{s}=O(\log\log n)$ steps, the maximum degree of
1271: the conflict graph~$G_{\pi}$ is at most $g^{5/6}$ with high probability.
1272: This is enough to get the claim of our theorem
1273: by combining Phase~1 and Phase~2, and then using Lemma~\ref{lem:phase3}.
1274: \end{proof}
1275: 
1276: 
1277: We remark that all transmissions occurring during slots~3 and~4
1278: are just acks requiring only ``empty'' messages providing only headers but without payload.
1279: When packets are very long, it may be more efficient to divide the 5 slots into 2 ``short'' slots
1280: and only 3 ``long'' slots, hence profiting from the homogenity of the operations
1281: within a same slot in our routing algorithm.
1282: 
1283: Note an important property of our algorithm:
1284: processor~$i$ requires enough memory to store at most three packets: one is the original packet~$p_i$,
1285: the second is the
1286: packet whose destination is processor~$i$, and the third
1287: is a copy of another packet as received from group~$\Delta(i)$.
1288: However, if we can assume that packet $p_i$ exits
1289: the network the slot after $p_i$ got to its destination $\pi(i)$,
1290: then the requirement on the internal capacity of processors drops
1291: to only $2$ packets.
1292: Similarly, if we can assume that the input packets are stored on an external feeding line,
1293: then the internal storage requirement drops to $1$.
1294: 
1295: 
1296: \subsection{The General Case}
1297: 
1298: Let start from the case when $d>g$. A natural approach to solve the problem is to perform
1299: two stages: Stage~1 routes the packets until the degree of the conflict graph
1300: is at most $g$; then Stage~2 uses the randomized algorithm described in the previous
1301: section to route the remaining packets in $O(\log\log g)$ slots. Since at most
1302: $g$ packets can be moved without conflicts from each group in each slot, $(d-g)/g$ is a simple
1303: lower bound to the number of slots used in the first of the two above mentioned stages.
1304: In the following, we will show that we are only a constant factor
1305: far from the lower bound, and that we can precisely indicate this factor.
1306: 
1307: Consider a group~$a\in{\mathbb{N}}_g$.
1308: From this group, there are $d>g$ packets
1309: willing to go to destination. If we let every packet choose
1310: a random destination group and try to reach that group, when $d$ is large (it is
1311: enough that $d=\Omega(g\log g)$) every coupler will have a conflict with high probability
1312: and no packet is delivered. Clearly, this is not what we like to happen. So, the idea for the
1313: first stage of the algorithm is a small modification of the randomized algorithm:
1314: Before participating to the step, every processor with a packet tosses a coin
1315: that says 'yes' with probability~$p$. Only those processors that get a 'yes' are allowed to
1316: participate and send their packet.
1317: 
1318: In the first step, it is best to choose $p$ equal to $g/d$, in such a way
1319: that $g$ packets are sent on expectation.
1320: This value maximizes the expected number of conflict-less
1321: communications, and thus the number of packets that survive slot~1 and slot~2.
1322: Later on, $p$ has to
1323: be iteratively reduced using a fixed law according to the expected reduction of the number
1324: of packets left in each group.
1325: When at most $g$ packets are left in each group with high
1326: probability, then
1327: we can set $p$ to one, and so proceed with the same algorithm we propose for the case when
1328: $d=g$.
1329: 
1330: To understand what is the most efficient law, it is important
1331: to understand what is the expected number of packets that are delivered in each step
1332: of the algorithm. Informally speaking, our hope is that exactly $g$ packets from each
1333: group participate to every step of the first phase of the algorithm.
1334: Under this assumption, 
1335: we know that approximately $ge^{-1}$ packets of each group will survive the first slot. At the beginning
1336: of the second slot, these packets are somewhat randomly scattered in the network (not
1337: uniformly at random, unfortunately, as we know from the previous section). If everything
1338: goes just like in the first slot, and this is far from being obvious since the destination is
1339: \emph{not random} now and the packets are \emph{not} distributed
1340: uniformly
1341: at random,
1342: we can hope that $g\exp \{-(1+e^{-1})\}$ packets from each group
1343: survive the second slot
1344: as well, and are thus safely delivered. If this is the case, $\exp \{1+e^{-1}\}((d-g)/g)$ steps
1345: are enough to reduce the number of packets from $d$ to $g$ on expectation.
1346: The following theorem shows that, eventually, what happens is exactly
1347: what we can best hope for. Now, we proceed formally. 
1348: \begin{theorem}
1349: \label{thm:generalcase}
1350: Let $c=\exp (1+e^{-1})\approx 3.927$. A $\POPS$ network can
1351: route any permutation in $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ slots
1352: with high probability.
1353: \end{theorem}
1354: \begin{proof}
1355: The idea of the algorithm
1356: is to use $\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$ steps, where $\epsilon(g)=o(1)$,
1357: to reduce the maximum degree of the conflict graph to at most $g$
1358: with high probability. Since every step consists of 5 slots, we then get the claim by
1359: Theorem~\ref{thm:fondamentale}.
1360: 
1361: Every step~$s$, $s=1,\dotsc, \lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$, is similar to the standard step
1362: of the randomized routing
1363: algorithm, with the difference that, before choosing its random color during slot~1, every packet
1364: independently tosses a coin and participates to the step with probability
1365: \begin{equation*}
1366: \frac{g}{d-\frac{g(s-1)}{c+\epsilon(g)}}.
1367: \end{equation*}
1368: Our claim is that, at the beginning of step~$s$, $s=1,\dotsc, \lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil+1$,
1369: the degree of the conflict graph is at
1370: most $d_s:=d-\frac{g(s-1)}{c+\epsilon(g)}$ with high probability.
1371: As a consequence, when
1372: $s=\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil+1$, we get $d_s\le g$ as desired.
1373: The claim is certainly true when
1374: $s=1$. Assume it is true at the beginning of
1375: step~$s\le\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$.
1376: We show that it is true at the beginning of step~$s+1$ as well.
1377: 
1378: Let $S_a$, $a\in S$, be the set of indices of
1379: the packets in group~$a$ that still have to be delivered at the beginning of
1380: step~$s$. Similarly, let $D_b$, $b\in D$, be the set of indices of the
1381: packets in the whole network that still have to be delivered at the beginning of step~$s$ and
1382: that have group~$b$  as temporary destination group. By hypothesis, $|S_a|\le d_s$ and
1383: $|D_b|\le d_s$ for all $a\in S$ and $b\in D$.
1384: Our first goal is to prove that at the beginning of step $s+1$ the degree of the conflict
1385: graph is at most $d_{s+1}$ with high probability.
1386: 
1387: For every packet~$p_i$ yet to be routed, let random variable $P_i$ be equal to 1 if packet~$p_i$
1388: participates to step~$s$, and 0 otherwise. Random variable $P_a=\sum_{i\in S_a} P_i$
1389: counts the number of packets in group~$a$ that participate to step~$s$. The expectation of $P_a$
1390: can be computed as follows:
1391: \begin{equation*}
1392: \E[P_a]=\sum_{i\in S_a} \E[P_i]=\frac{|S_a|g}{d_s}.
1393: \end{equation*}
1394: And, clearly, $\E[P_a] \leq g$.
1395: Since random variables $P_i$ are independent, the Chernoff bound~\cite{mr95,as00}
1396: (note that in~\cite{mr95} this bound appears in a different yet stronger form) 
1397: is enough to claim that for every $\delta>0$
1398: \begin{equation*}
1399: \Pr\left[ P_a<(1-\delta)\frac{|S_a|g}{d_s}\right]\le e^{-\frac{\delta^2|S_a|g}{2d_s}}\le
1400: e^{-\frac{\delta^2d_{s+1}g}{2d_s}}\le e^{-\frac{\delta^2g}{4}}
1401: \end{equation*}
1402: and
1403: \begin{equation*}
1404: \Pr\left[ P_a>(1+\delta)g\right]\le e^{-\frac{\delta^2|S_a|g}{2d_s}}\le
1405: e^{-\frac{\delta^2d_{s+1}g}{2d_s}}\le e^{-\frac{\delta^2g}{4}}.
1406: \end{equation*}
1407: Let $S'_a$, $a\in S$, be the set of indices of
1408: the packets in group~$a$ that participate to step~$s$.
1409: Random variable $P_a$ is thus equal to $|S'_a|$.
1410: Therefore, for every $\delta>0$
1411: \begin{equation}
1412: (1-\delta)\frac{|S_a|g}{d_s}\le S'_a \le (1+\delta)g
1413: \end{equation}
1414: with probability at least $1-2e^{-\delta^2g/4}$.
1415: Since a similar result holds for every $a\in S$ and
1416: $b\in D$, we also know that for every $\delta>0$
1417: \begin{align}
1418: \label{eqn:S0lower}
1419: & (1-\delta)\frac{|S_a|g}{d_s}\le S'_a \le (1+\delta)g,\\
1420: \label{eqn:D0lower}
1421: & (1-\delta)\frac{|D_b|g}{d_s}\le D'_b \le (1+\delta)g,
1422: \end{align}
1423: hold for every $a\in S$ and $b\in D$, with probability at least
1424: \begin{equation}
1425: \label{eqn:azumaS0}
1426: 1-4ge^{-\delta^2g/4},
1427: \end{equation}
1428: by the union bound over the $2g$ nodes of the conflict graph.
1429: 
1430: Clearly, we have nothing to show about the nodes in the conflict graph that have degree smaller
1431: than or equal to $d_{s+1}$. So, we define sets $S^+\subseteq S$ and
1432: $D^+\subseteq D$, which collect the nodes with degree
1433: larger that $d_{s+1}$, and focus on the nodes in these sets.
1434: Consider an arbitrary group~$a\in S^+$, and assume that the bound in
1435: Equations~\ref{eqn:S0lower} and~\ref{eqn:D0lower} hold for every $a\in S$ and $b\in D$.
1436: Now, we can perform the same analysis as in the proof of Theorem~\ref{thm:fondamentale}.
1437: Similarly to Equation~\ref{eqn:lowerXtilde}, we know that
1438: \begin{equation*}
1439: \E[X_a^2]  \ge (1-\delta)|S_a^1|\left(1-\frac{1}{g}\right)^{|S_a^1|-1}\ge
1440: (1-\delta)|S_a^1| e^{-|S_a^1|/g},
1441: \end{equation*}
1442: with high probability.
1443: In the next equation, we will use the following two facts: $xe^{x/g}\le ye^{y/g}$
1444: whenever $x\le y\le g$, and $xe^{x/g}$ has maximum when $x=g$.
1445: Clearly, $|S_a^1|\le g$ (there
1446: are only $g$ couplers from group~$a$).
1447: So, we get
1448: \begin{align*}
1449: \E[X_a^2] & \ge (1-\delta)|S_a^1| e^{-|S_a^1|/g}\ge\\
1450: & \ge (1-\delta)^2|S'_a| e^{-|S'_a|/g} e^{-|S'_a| e^{-|S'_a|/g}/g}\ge\\
1451: & \ge (1-\delta)^3  \frac{|S_a|g}{d_s}e^{-1}e^{-e^{-1}}.
1452: \end{align*}
1453: with high probability.
1454: By setting $\delta=g^{-1/3}$ in the above equation, with high probability we get
1455: \begin{equation*}
1456: X_a^2 \ge \frac{|S_a|}{d_s}\frac{g}{c+\epsilon(g)},
1457: \end{equation*}
1458: where $c=e^{1+e^{-1}}$ and $\epsilon(g)=o(1)$.
1459: Since $X_a^2$ is the number of packets in group~$a$ that are delivered
1460: to destination during slot~$s$, the degree of group~$a$ in the conflict graph at the
1461: beginning of step~$s+1$ is
1462: \begin{equation*}
1463: |S_a|- X_a^2 \le |S_a|- \frac{|S_a|}{d_s}\frac{g}{c+\epsilon(g)} \le d_s-\frac{g}{c+\epsilon(g)}
1464: = d_{s+1}.
1465: \end{equation*}
1466: The same result can be shown for every $a\in S^+$ and $b\in D^+$. By the union bound over the
1467: $\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$ steps required, and over the $2g$ nodes in the conflict graph,
1468: and by Equation~\ref{eqn:azumaS0} and a corresponding version of Equation~\ref{eqn:azumafinale},
1469: the degree of the conflict graph is reduced below $g$ with probability at least
1470: \begin{equation*}
1471: 1-\left(9ge^{-\delta^2 (1-\delta)^4 g^{5/6}/8e^2}+4ge^{-\delta^2g/4}\right).
1472: \end{equation*}
1473: Note that this is $1-o(1)$ as $g$ grows.
1474: \end{proof}
1475: 
1476: To get a feeling of the performance of our randomized algorithm, we can set
1477: $\epsilon(g)\approx 0.073$ in the proof of the above theorem, in such a way that
1478: $c+\epsilon(g)=4$. The result is claimed in the following corollary.
1479: \begin{corollary}
1480: \label{cor:general}
1481: A $\POPS$ network can route any permutation in $\frac{20d}{g}+O(\log\log g)$ slots with high probability.
1482: \end{corollary}
1483: 
1484: \section{Experiments}
1485: \label{sect:exp}
1486: 
1487: Our results in Theorems~\ref{thm:fondamentale} and~\ref{thm:generalcase} are
1488: asymptotic. In principle, it could thus be possible that the
1489: randomized algorithm does not perform well in practice. This is not the case.
1490: Experiments show that it outperforms the algorithm
1491: in~\cite{ds-IEEETPDS03} even on networks as small as a ${\mathrm{POPS}}(2,2)$,
1492: and proves to be exponentially faster when $d$ and $g$ grow.
1493: 
1494: The algorithm in~\cite{ds-IEEETPDS03} is claimed to run in $\frac{8d}{g}\log^2 g+
1495: \frac{21d}{g}+3\log g+7$ slots. However, the authors make a small mistake when saying
1496: that Leighton's implementation of the odd-even merge sort algorithm is composed of
1497: $\log^2 n$ steps. The actual complexity is only $\frac{\log n(1+\log n)}{2}\approx 2\log^2 g$ steps.
1498: So, the running time of the routing algorithm in~\cite{ds-IEEETPDS03} is
1499: $\frac{4d}{g}\log^2 g+\frac{2d}{g}\log g+\frac{21d}{g}+3\log g+7$ slots, that is smaller,
1500: and this is what we will use in the following.
1501: 
1502: To perform the experiments, we built a simulator for the POPS network. It is written in C++
1503: and simulates the network at a message level. That is, for every message in the real network,
1504: there is a message in the simulator.
1505: Processors (implemented as instances of a class \texttt{Processor}) locally take decisions about the next step to perform, and couplers (implemented as instances of a class \texttt{Coupler}) locally
1506: propagate messages or stop them in case of conflicts.
1507: 
1508: Then, we implemented our randomized algorithm in the
1509: simulator, slot by slot. We have been conservative, no theoretical result is taken for granted and
1510: the randomized algorithm is just simulated message by message.
1511: Not surprisingly, slots~3, 4, and 5 prove to be conflict-less, supporting what is proven
1512: in Proposition~\ref{pro:conflictless}. So, whenever a copy survives slots~1 and~2
1513: it reaches its final destination,
1514: and the associated ack successfully gets to the source processor.
1515: Moreover, three buffers in every processor~$i$ (one for packet~$p_i$, one for packet~$p_{\pi^{-1}(i)}$,
1516: and the third for floating copies of other packets) are enough.
1517: 
1518: In Figure~\ref{fig:esperimento-1},
1519: it is shown the average over a large number of experiments in
1520: the case when $d=g$. The number of processors $n=dg$ goes from 4 to 16,777,216. The
1521: permutation in input is chosen
1522: uniformly
1523: at random from the class of all possible permutations.
1524: It is clear, from the results shown in the figure,
1525: that our algorithm is much faster than the algorithm
1526: in~\cite{ds-IEEETPDS03} even in practice.
1527: Actually, our algorithm outperforms its competitor for all network sizes
1528: hence putting aside any possible concern about the hidden consts.
1529: The performance of our algorithm is so good
1530: that it is actually hard to appreciate it from Figure~\ref{fig:esperimento-1}.
1531: Hence, Table~\ref{tab:esperimenti} shows the exact numerical
1532: results.
1533: \begin{figure*}
1534: \centering\includegraphics{esperimento-1}
1535: \caption{Performance of our randomized routing algorithm against the routing
1536: algorithm proposed in~\protect\cite{ds-IEEETPDS03}.
1537: Case when $d=g$. The number of
1538: processors goes from 4 to 16,777,216 (note that axis $x$ is in logscale).}
1539: \label{fig:esperimento-1}
1540: \end{figure*}
1541: \begin{table}
1542: \centering\begin{tabular}{|r||r|r||r|r||r|r|}
1543: \hline
1544: \multicolumn{1}{|c||}{$n$} & \multicolumn{2}{|c||}{$d=g$} &
1545: \multicolumn{2}{|c||}{$d=4g$} & \multicolumn{2}{|c|}{$d=16g$}\\
1546: \hline
1547: & \multicolumn{1}{|c|}{A} & 
1548: \multicolumn{1}{|c||}{B} & \multicolumn{1}{|c|}{A} & 
1549: \multicolumn{1}{|c||}{B} & \multicolumn{1}{|c|}{A} & 
1550: \multicolumn{1}{|c|}{B}\\
1551: \hline
1552: 4 & 14.75 & 37 & - & - & - & - \\
1553: \hline
1554: 16 & 20.90 & 54 & 71.40 & 118 & - & -\\
1555: \hline
1556: 64 & 27.35 & 79 & 82.80 & 177 & 317.90 & 442 \\
1557: \hline
1558: 256 & 30.10 & 112 & 87.15 & 268 & 322.45 & 669 \\
1559: \hline
1560: 1,024 & 32.50 & 153 & 92.60 & 391 & 343.10 & 1,024 \\
1561: \hline
1562: 4,096 & 34.50 & 202 & 94.00 & 546 & 345.60 & 1,507 \\
1563: \hline
1564: 16,384 & 35.20 & 259 & 94.95 & 733 & 339.25 & 2,118 \\
1565: \hline
1566: 65,536 & 35.55 & 324 & 95.15 & 952 & 336.45 & 2,857 \\
1567: \hline
1568: 262,144 & 36.55 & 397 & 95.35 & 1,203 & 334.30 & 3,724 \\
1569: \hline
1570: 1,048,576 & 38.25 & 478 & 95.65 & 1,486 & 333.55 & 4,719 \\
1571: \hline
1572: 4,194,304 & 39.70 & 567 & 96.25 & 1,801 & 333.05 & 5,842 \\
1573: \hline
1574: 16,777,216 & 40.05 & 664 & 97.05 & 2,148 & 333.60 & 7,093 \\
1575: \hline
1576: \end{tabular}
1577: \caption{Number of slots to route a randomly chosen permutation by our randomized algorithm (A) and by the algorithm in
1578: \protect\cite{ds-IEEETPDS03} (B).} 
1579: \label{tab:esperimenti}
1580: \end{table}
1581: 
1582: Then, we tested our algorithm on POPS networks with $d$ larger than $g$. We performed
1583: two sets of experiments, one in which $d=4g$ and another in which $d=16g$. In both cases,
1584: the number of processors goes from 4 to 16,777,216. We used
1585: the algorithm as implemented in Corollary~\ref{cor:general}. Therefore, we expect
1586: the routing to take $20\frac{d}{g}+O(\log\log g)$ slot, according to our theoretical results.
1587: In fact, the results that are shown in Table~\ref{tab:esperimenti},
1588: Figure~\ref{fig:esperimento-2}, and Figure~\ref{fig:esperimento-3}
1589: show that the hidden constants are
1590: very small, and that
1591: our algorithm dramatically outperforms the best deterministic algorithm known in the literature for all
1592: network sizes we tested. Finally, Table~\ref{tab:scartp} shows some more details: for each
1593: experiment, we report the average number of steps, the standard deviation, and the worst case
1594: over one hundred runs. Note that the standard deviation is extremely small (smaller than one),
1595: therefore, the performance of our algorithm is almost always very close to expectation.
1596: \begin{figure*}
1597: \centering\includegraphics{esperimento-2}
1598: \caption{Performance of our randomized routing algorithm against the routing
1599: algorithm proposed in~\protect\cite{ds-IEEETPDS03}.
1600: Case when $d=4g$. The number of
1601: processors goes from 16 to 16,777,216 (note that axis $x$ is in logscale).}
1602: \label{fig:esperimento-2}
1603: \end{figure*}
1604: \begin{figure*}
1605: \centering\includegraphics{esperimento-3}
1606: \caption{Performance of our randomized routing algorithm against the routing
1607: algorithm proposed in~\protect\cite{ds-IEEETPDS03}.
1608: Case when $d=16g$. The number of
1609: processors goes from 64 to 16,777,216 (note that axis $x$ is in logscale).}
1610: \label{fig:esperimento-3}
1611: \end{figure*}
1612: \begin{table*}
1613: \centering\begin{tabular}{|r||r|r|c||r|r|c||r|r|c|}
1614: \hline
1615: \multicolumn{1}{|c||}{$n$} & \multicolumn{3}{|c||}{$d=g$} &
1616: \multicolumn{3}{|c||}{$d=4g$} & \multicolumn{3}{|c|}{$d=16g$}\\
1617: \hline
1618: & \multicolumn{1}{|c|}{$\mu$} & 
1619: \multicolumn{1}{|c|}{$\sigma$} & \multicolumn{1}{|c||}{max} & \multicolumn{1}{|c|}{$\mu$} & 
1620: \multicolumn{1}{|c|}{$\sigma$} & \multicolumn{1}{|c||}{max} & \multicolumn{1}{|c|}{$\mu$} & 
1621: \multicolumn{1}{|c|}{$\sigma$} & \multicolumn{1}{|c|}{max}\\
1622: \hline
1623: 4 & 3.15 & 1.94 & 12 & - & - & - & - & - & -\\
1624: \hline
1625: 16 & 4.43 & 1.03 & 8 & 14.33 & 4.22 & 35 & - & - & -\\
1626: \hline
1627: 64 & 5.39 & 0.79  & 7 & 16.13 & 2.81 & 27 & 56.88 & 4.52 & 82\\
1628: \hline
1629: 256 & 6.10 & 0.57  & 8 & 18.06 & 1.54 & 23 & 62.58 & 3.86 & 81\\
1630: \hline
1631: 1,024 & 6.50 & 0.53 & 8 & 18.45 & 0.86 & 20 & 66.26 & 5.16 & 94\\
1632: \hline
1633: 4,096 & 6.82 & 0.46 & 8 & 18.81 & 0.64 & 21 & 68.21 & 3.94 & 86\\
1634: \hline
1635: 16,384 & 7.04 & 0.20 & 8 & 18.95 & 0.46 & 20 & 67.65 & 1.76 & 73\\
1636: \hline
1637: 65,536 & 7.16 & 0.37 & 8 & 19.06 & 0.34 & 20 & 67.12 & 0.89 & 71\\
1638: \hline
1639: 262,144 & 7.30 & 0.46 & 8 & 19.09 & 0.29 & 20 & 66.88 & 0.59 & 69\\
1640: \hline
1641: 1,048,576 & 7.59 & 0.49 & 8 & 19.15 & 0.36 & 20 & 66.70 & 0.50 & 68\\
1642: \hline
1643: 4,194,304 & 7.92 & 0.27 & 8 & 19.21 & 0.41 & 20 & 66.59 & 0.49 & 67\\
1644: \hline
1645: 16,777,216 & 8.00 & 0.00 & 8 & 19.41 & 0.49 & 20 & 66.79 & 0.41 & 67\\
1646: \hline
1647: \end{tabular}
1648: \caption{Number of iterations (mean, standard deviation, and worst case over one hundred
1649: runs) to route a randomly chosen permutation by our randomized algorithm.} 
1650: \label{tab:scartp}
1651: \end{table*}
1652: 
1653: \section{Conclusion}
1654: 
1655: In this paper, we introduced the fastest algorithms for both deterministic and randomized
1656: on-line permutation routing. Indeed, we have shown that any permutation can be routed on
1657: a $\POPS$ network either with $O(\frac{d}{g}\log g)$ deterministic slots, or, with high probability,
1658: with $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ randomized slots, where
1659: $c=\exp (1+e^{-1})\approx 3.927$. The randomized algorithm shows that the POPS network
1660: is one of the fastest permutation networks ever. This can be of practical relevance, since
1661: fast switching is one of the key technologies to deliver the ever-growing amount of bandwidth
1662: needed by modern network applications.
1663: 
1664: \section*{Acknowledgments}
1665: 
1666: We are grateful to Alessandro Panconesi for helpful suggestions.
1667: 
1668: \bibliographystyle{IEEEtran}
1669: \bibliography{r}
1670: 
1671: \end{document}
1672: