0502:cs0502093/r.tex

1: \documentclass[9pt,twocolumn,letterpaper]{IEEEtran}

2:

3: \usepackage{mathptmx}

4: \usepackage[scaled=-90]{helvet}

5: \usepackage{courier}

6:

7: \usepackage{amsmath}

8: \usepackage{amsfonts,amssymb}

9: \usepackage{graphicx}

10:

11: \usepackage{subfigure}

12:

13: \newtheorem{theorem}{Theorem}[section]

14: \newtheorem{conjecture}[theorem]{Conjecture}

15: \newtheorem{corollary}[theorem]{Corollary}

16: \newtheorem{proposition}[theorem]{Proposition}

17: \newtheorem{lemma}[theorem]{Lemma}

18: \newtheorem{definition}[theorem]{Definition}

19: \newtheorem{remark}[theorem]{Remark}

20: \newtheorem{fact}[theorem]{Fact}

21:

22: \newcommand{\goodgap}{%

23: \hspace{\subfigtopskip}%

24: \hspace{\subfigbottomskip}}

25:

26: \newcommand{\POPS}{{\mathrm{POPS}}(d,g)}

27: \newcommand{\POPSg}{{\mathrm{POPS}}(g,g)}

28: \newcommand{\PR}{{\mathbf{Pr}}}

29: \newcommand{\E}{{\mathbf{E}}}

30:

31: \begin{document}

32:

33: \title{On-Line Permutation Routing\\ in Partitioned Optical Passive Star Networks}

34:

35: \author{Alessandro Mei\thanks{Alessandro Mei is with the

36: Department of Computer Science, University of Rome ``La Sapienza'', Italy

37: (e-mail: mei@di.uniroma1.it).} and Romeo Rizzi\thanks{Romeo Rizzi is with the

38: Department of Information and Communication Technology,

39: University of Trento, Italy (e-mail: romeo.rizzi@unitn.it).}}

40:

41: \maketitle

42:

43: \begin{abstract}

44: This paper establishes the state of the art in both deterministic and randomized online

45: permutation routing in the POPS network.

46: Indeed, we show that any permutation can be routed online on a $\POPS$ network

47: either with $O(\frac{d}{g}\log g)$ deterministic slots, or, with high probability,

48: with $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ randomized slots,

49: where constant~$c=\exp (1+e^{-1})\approx 3.927$.

50: When $d=\Theta(g)$, that we claim to be the ``interesting'' case, the

51: randomized algorithm

52: is exponentially faster than any other algorithm in the literature, both deterministic and randomized

53: ones. This is true in practice as well. Indeed, experiments show that it

54: outperforms its rivals even starting

55: from as small a network as a ${\mathrm{POPS}}(2,2)$, and the gap grows exponentially with the

56: size of the network. We can also show that, under proper hypothesis,

57: no deterministic algorithm can asymptotically match its performance.

58: \end{abstract}

59:

60: \begin{keywords}

61: Optical interconnections, partitioned optical passive star network, permutation routing.

62: \end{keywords}

63:

64: \maketitle

65:

66: \section{Introduction}

67:

68: The ever-growing demand of fast interconnections in multiprocessor systems

69: has fostered a large interest in optical technology. All-optical communication

70: benefits from a number of good characteristics such as no opto-electronic

71: conversion, high noise immunity, and low latency. Optical technology can

72: provide an enormous amount of bandwidth and, most probably, will

73: have an important role in the future of distributed and parallel computing

74: systems.

75:

76: The Partitioned Optical Passive Stars (POPS)

77: network~\cite{clmtg94,gmclt95,gm98,mgcl98} is a SIMD

78: parallel architecture that uses a fast optical network composed of

79: multiple Optical Passive Star (OPS) couplers.

80: A $d\times d$ OPS coupler is

81: an all-optical passive device which is capable of

82: receiving an optical signal from one of its $d$ sources and broadcast it to

83: all of its $d$ destinations.

84: The number of processors of the network is denoted by $n$, and each processor

85: has a distinct index in $\{0,\dotsc, n-1\}$.

86: The $n$ processors are partitioned into $g$

87: groups of $d$ processors, $n=dg$, in such a way that processor $i$ belongs to

88: group~${\mathrm{group}}(i):=\lfloor i/d\rfloor$ (see Figure~\ref{fig:pops}).

89: \begin{figure}

90: \begin{center}

91: \includegraphics[scale=.7]{pops}

92: \end{center}

93: \caption{A ${\mathrm{POPS}}(3,3)$. Processors are shown as circles, while optical passive stars

94: are shown as boxes. Optical signals flow from the left to the right.

95: The processors on the left and the processors on the right are the same

96: objects shown twice for the sake of clearness.}

97: \label{fig:pops}

98: \end{figure}

99: For each pair of groups $a,b\in\{0,\dotsc,g-1\}$, a coupler~$c(b,a)$ is

100: introduced which has all the $d$ processors of group~$a$ as sources and all

101: the $d$ processors of group~$b$ as destinations.

102: During a computational step (also referred to as a \emph{slot}), each

103: processor~$i$ receives a single message from one of the $g$ couplers

104: $c({\mathrm{group}}(i), a)$, $a\in\{0,\dotsc,g-1\}$, performs some

105: local computations, and sends a single message to a subset of the $g$ couplers

106: $c(b, {\mathrm{group}}(i))$, $b\in\{0,\dotsc,g-1\}$. The couplers are

107: broadcast devices, so this message can be received by more than one processor

108: in the destination groups.

109: In agreement with the literature, in the case when multiple

110: messages are sent to the same coupler, we assume that no message is delivered.

111: This architecture is denoted by $\POPS$.

112:

113: One of the advantages of a $\mathrm{POPS}(d,g)$ network is

114: that its diameter is one. A packet can

115: be sent from processor~$i$ to processor~$j$, $i\neq j$, in one slot

116: by using coupler $c(\mathrm{group}(j),\mathrm{group}(i))$. However,

117: its bandwidth varies according to $g$. In a $\mathrm{POPS}(n,1)$ network,

118: only one packet can be sent through the single coupler per slot.

119: On the other extreme, a $\mathrm{POPS}(1,n)$ network is a highly expensive,

120: fully interconnected optical network using $n^2$ OPS couplers.

121: A one-to-all communication pattern can also be performed in only one slot in

122: the following way: Processor~$i$ (the speaker) sends the packet to

123: all the couplers~$c(a,\mathrm{group}(i))$, $a\in\{0,\dotsc,g-1\}$,

124: during the same slot all the processors~$j$, $j\in\{0,\dotsc,n-1\}$,

125: can receive the packet through coupler

126: $c(\mathrm{group}(j),\mathrm{group}(i))$.

127:

128: The POPS network has been shown to support a number of non trivial

129: algorithms. Several common communication patterns are realized

130: in~\cite{gm98}. Simulation algorithms for the ring, the mesh, and the hypercube interconnection

131: networks can be found in~\cite{gm-MPPUOI95} and~\cite{s00a}. Some reliability issues

132: are analyzed in~\cite{c-LCN97}. Algorithms for data sum, prefix

133: sums, consecutive sum, adjacent sum, and several data movement operations

134: are also described in~\cite{s00a} and~\cite{ds-IEEETPDS03}. Later, both the algorithms

135: for hypercube simulation and prefix sums have been improved in~\cite{mr-HIPC03}.

136: An algorithm for matrix

137: multiplication is provided in~\cite{s00b}.

138: Moreover, \cite{bf96} shows that POPS networks can be modeled by directed

139: and complete stack graphs with loops, and uses this formalization to

140: obtain optimal embeddings of rings and de Bruijn graphs into POPS

141: networks.

142:

143: In~\cite{ds-IEEETPDS03}, Datta and Soundaralakshmi claim that in most practical

144: $\POPS$ networks it is likely that $d>g$. We believe that they are only partly

145: right. While it is true that

146: systems with $d\ll g$ are too expensive, it is also true that systems with $d\gg g$ give

147: too low parallelism to be worth building. We illustrate our point with an example.

148: Consider the problem of summing $16n$ data values on a $\POPS$ network,

149: $d=g=\sqrt{n}$. This network has $n$ processors. Therefore, the algorithm can work as follows:

150: we input 16 data values per processor, let each processor sum up its 16 data values, and

151: finally we use the algorithm in~\cite{ds-IEEETPDS03} to get the overall sum. This algorithm

152: requires 16 steps to input the data values and compute the local sums, plus

153: $2\log\sqrt{n}=\log n$ slots for computing the final result. A total of $16+\log n$ slots.

154: With the idea of upgrading our system,

155: we buy additional $15n$ processors and build a $16n$ processor ${\mathrm{POPS}}(d',g')$ network

156: with $d'=16d=16\sqrt{n}$ and $g'=g=\sqrt{n}$.

157: Now, we can use just one step to input the data values, one per processor, and then

158: use the same algorithm in~\cite{ds-IEEETPDS03} to get the overall sum. Unfortunately,

159: this algorithm still requires $16+\log n$ slots, even though we are solving a problem of the

160: same size using a system with 16 times more processors!

161:

162: The problem is not on the data sum algorithm in~\cite{ds-IEEETPDS03}. Essentially the same thing

163: happens with the prefix sums algorithm in~\cite{ds-IEEETPDS03}, the simulations in~\cite{s00a},

164: and all the other algorithms in the literature for the POPS network we know of, including the ones

165: presented in this paper. The point is that a $\POPS$

166: network can exchange $g^2$ messages at most in a slot. This is an unavoidable bottleneck

167: for networks where $d$ is much larger than $g$, resulting in the poor parallelism of

168: these systems.

169: Also, experience says that the case $d=g$ is the most interesting from a

170: ``mathematical'' point of view. In the past literature, the case $d>g$ and symmetrically the case $d<g$

171: are always dealt with by reducing them to the case $d=g$, that usually contains the

172: ``core'' of the problem in its purest form. This work is not an exception to this empirical

173: yet general rule.

174: So, it is probably more reasonable to assume that practical POPS networks

175: will have $d=\Theta(g)$, that is $d/g$,

176: and similarly $g/d$, bounded by a constant.

177:

178: In any case, finding good algorithms for the case $d\neq g$, both $d<g$ and

179: $d>g$, is of absolute

180: importance, since it is not clear what is the optimal tradeoff between $d$, $g$, and the cost

181: of the network yet. Furthermore, an optimal tradeoff may not exist in general,

182: since it probably depends on the specific problem being solved.

183: By the way, such algorithms are often non trivial, as, for example,

184: in~\cite{ds-IEEETPDS03}. Therefore, we partly accept the claim in~\cite{ds-IEEETPDS03}

185: that the number of groups cannot substantially exceed the number of processors per

186: group. So, throughout the whole paper, we will discuss our asymptotical results assuming

187: that $g$ grows and that $d=\Omega(g)$. Nonetheless, we will

188: keep in mind that the ``important'' case is likely to be $d=\Theta(g)$.

189:

190: Here, we consider the \emph{permutation routing problem}: Each of the $n$ processors of the POPS

191: network has a packet that is to be sent to another node, and each processor is the destination

192: of exactly one packet. This is a fundamental problem in parallel computing and

193: interconnection networks, and the literature on this topic is vast. As an excellent starting point,

194: the reader can see~\cite{l92}. On the POPS network, this problem has been studied in two

195: different versions: the \emph{offline} and the \emph{online} permutation routing problem.

196: In the former, the permutation to be routed is globally known in the network. Therefore,

197: every processor can pre-compute the route for its packet taking advantage of this information.

198: This version of the problem has been implicitly studied, for particular permutations, in

199: all the simulation algorithms we reviewed above. Later, most of these results

200: have been unified by proving that any permutation can optimally be routed off-line

201: in one slot, when $d=1$, and $2\lceil d/g\rceil$ slots, when $d>1$~\cite{mr-JPDC03}.

202:

203: In the online version, every processor knows only the destination of the packet it stores.

204: This problem has been attacked in~\cite{ds-IEEETPDS03}. The solution

205: iteratively makes use of a sub-routine that sorts $g^2$ items in ${\mathrm{POPS}}(g,g)$

206: subnetworks of the larger $\POPS$ network. The sub-routine is built by hypercube simulation

207: starting from either Cypher and Plaxton's $O(\log n\log\log n)$ sorting algorithm for the

208: $n$-processor hypercube or from Leighton's implementation~\cite{l92} on the

209: $n$-processor hypercube of Batcher's odd-even merge sort algorithm~\cite{b-AFIPS68}.

210: In the first case, Datta and Sounderalakshmi get the asymptotically fastest algorithm for

211: routing in the POPS network, running in $O(\frac{d}{g}\log g\log\log g)$ slots. In the second,

212: they get an algorithm that turns out to be the fastest in practice, running in

213: $\frac{8d}{g}\log^2 g+\frac{21d}{g}+3\log g+7$ slots. Recently, and independently

214: of this work, Rajasekaran and Davila have presented a randomized algorithm for online

215: permutation routing that runs in $O(\frac{d}{g}+\log g)$ slots~\cite{rd-ICPADS04}.

216:

217: Our contribution is both theoretical and practical.

218: We show that any permutation can be routed on a $\POPS$ network

219: either with $O(\frac{d}{g}\log g)$ deterministic slots, or, with high probability,

220: with $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ randomized slots,

221: where constant~$c=\exp (1+e^{-1})\approx 3.927$. The deterministic algorithm

222: is based on a direct simulation of the AKS network, and it is the first that requires

223: only $O(\frac{d}{g}\log g)$ slots.

224: When $d=\Theta(g)$, that we claim to be the ``interesting'' case, the

225: randomized algorithm

226: is exponentially faster than any other algorithm in the literature, both deterministic and randomized

227: ones. This is true in practice as well. Indeed, our experiments show that it

228: outperforms its rivals even starting

229: from as small a network as a ${\mathrm{POPS}}(2,2)$, and the gap grows exponentially with the

230: size of the network. We can also show that, under proper hypothesis,

231: no deterministic algorithm can asymptotically match its performance.

232:

233: This paper also presents a strong separation theorem between determinism and randomization.

234: We build a meaningful and natural problem inspired on permutation routing in the POPS

235: network such that there exists a $O(\log\log g)$ slots randomized solution, and such that

236: no deterministic solution can do better than $O(\log g)$ slots, that is exponentially slower.

237: To the best of our knowledge, this is the first strong separation result from $\log g$ to

238: $\log\log g$, and, quite interestingly, it does not make use of the notion of oblivious routing,

239: that we show to be essentially out of target in the context of routing in the POPS network.

240:

241: \section{A Deterministic Algorithm}

242:

243: Let ${\mathbb{N}}_m:=\{0,1,\dotsc,m-1\}$ denote the set of the first $m$ natural numbers.

244: In the \emph{on-line permutation routing problem} we are given $n$ packets, one per

245: processor. Packet $p_i$, $i\in{\mathbb{N}}_{n}$, originates at processor~$i$, the \emph{source

246: processor}, and has

247: processor~$\pi(i)$ as \emph{destination}, where $\pi$ is a permutation of ${\mathbb{N}}_n$.

248:

249: The problem is to route all the

250: packets to destination with as few slots as possible.

251: Crucially, permutation $\pi$ is not known in advance---at the beginning of

252: the computation, each processor knows only the destination of the packet it stores.

253:

254: \subsection{The Upper Bound}

255:

256: So far, the best deterministic algorithm for online permutation routing on the $\POPS$

257: network is presented in~\cite{ds-IEEETPDS03}. The algorithm runs in $O(\frac{d}{g}\log^2 g)$ slots.

258: The computational bottleneck is

259: a $O(\log^2 g)$ sorting sub-routine that sorts $g^2$ data value $\lceil d/g\rceil$ times, each

260: on one of the $\lceil d/g\rceil$ ${\mathrm{POPS}}(g,g)$

261: sub-networks into which the larger $\POPS$ network is partitioned. The idea in~\cite{ds-IEEETPDS03}

262: is to make each ${\mathrm{POPS}}(g,g)$

263: network simulate Leighton's $O(\log^2 n)$ sorting algorithm for the $n$-processor hypercube~\cite{l92},

264: that is, in turn, an implementation of Batcher's odd-even merge sort. This is

265: carried out by using a general result due to Sahni~\cite{s00a},

266: showing that every move of a \emph{normal}

267: algorithm for the hypercube (where only one dimension is used for communication at each

268: step) can be simulated with $2\lceil d/g\rceil$ slots on a POPS network of the same size. Since

269: Leighton's algorithm is normal, and since

270: the sub-routine is always used on ${\mathrm{POPS}}(g,g)$ sub-networks, we get a constant

271: factor slow-down.

272:

273: The algorithm in~\cite{ds-IEEETPDS03} is fairly good in practice, since hidden constants are small.

274: However, we are interested in the best asymptotical result. So, as suggested

275: in~\cite{ds-IEEETPDS03}, we can replace the Leighton implementation of Batcher's odd-even

276: merge sort with Cypher and Plaxton's routing algorithm for the hypercube,

277: that is asymptotically faster (though slower for networks of practical size),

278: since it runs in $O(\log n\log\log n)$ time~\cite{cp-TR90}. This yields

279: a $O(\frac{d}{g}\log g\log\log g)$ slots algorithm for permutation routing on the POPS network,

280: that is a good improvement.

281: Nonetheless, here we do even better. Our simple key idea

282: is to simulate a fast sorting network directly on the POPS, instead of going through

283: hypercube simulation. By giving an improved $O(\log g)$ upper bound for sorting on the POPS network,

284: we also get an asymptotically faster algorithm for online permutation routing.

285:

286: A \emph{comparator} $[i:j]$, $i,j\in{\mathbb{N}}_n$ sorts the $i$-th and $j$-th element of a data

287: sequence into non-decreasing order. A \emph{comparator stage} is a composition of comparators

288: $[i_1:j_1]\circ\dotsb\circ [i_k:j_k]$ such that all $i_r$ and $j_s$ are distinct, and a

289: \emph{sorting network} is a sequence of comparator stages such that any input sequence

290: of $n$ data elements is sorted into non-decreasing order.

291: An introduction to sorting networks can be found

292: in~\cite{clr92}. Crucially, we can show that a $\POPS$ network can efficiently simulate

293: any comparator stage.

294: \begin{theorem}[\cite{mr-JPDC03}]

295: \label{thm:permutationrouting}

296: A $\POPS$ network can route off-line any permutation among the $n=dg$

297: processors using one slot when $d=1$ and $2\lceil d/g\rceil$ slots when $d>1$.

298: \end{theorem}

299: \begin{lemma}

300: \label{lem:comparatorstage}

301: A  $\POPS$ network, $n=dg$, can simulate a comparator stage in one slot, when $d=1$,

302: and in $2\lceil d/g\rceil$ slots, when $d>1$.

303: \end{lemma}

304: \begin{proof}

305: Let $[i_1:j_1]\circ\dotsb\circ [i_k:j_k]$ be a comparator stage. We define a function $\pi$ such

306: that $\pi(i_r)=j_r$ and $\pi(j_r)=i_r$ for all $r$.

307: Since all $i_r$ are distinct, and so are all $j_s$,

308: $\pi$ can arbitrarily

309: be extended in such a way to be a permutation. By

310: Theorem~\ref{thm:permutationrouting}, $\pi$ can be routed in one slot when $d=1$, and

311: $2\lceil d/g\rceil$ slots when $d>1$.

312: During this routing,

313: for every $r$, processor~$i_r$ sends

314: its data value to processor~$j_r$ and vice-versa. Then, processor~$i_r$ discards the maximum

315: of the two data values, while processor~$j_r$ discards the minimum.

316: \end{proof}

317: In~\cite{aks-FOCS83}, the AKS sorting network is presented. This network is able to sort any

318: data sequence with only $O(\log n)$ comparator stages, which is optimal.

319: By simulating the AKS network

320: on a POPS network using Lemma~\ref{lem:comparatorstage}, we easily get the following theorem.

321: \begin{theorem}

322: \label{thm:deterministico}

323: A $\POPSg$ network can sort $g^2$ data values in $O(\log g)$ slots.

324: \end{theorem}

325: The above result is the key to improve on the best deterministic algorithm for online permutation

326: routing in the literature.

327: \begin{corollary}

328: \label{cor:deterministico}

329: A $\POPS$ network can route on-line any permutation in $O(\frac{d}{g}\log g)$ slots.

330: \end{corollary}

331: \begin{proof}

332: To get the claim, it is enough to plug the sorting algorithm of Theorem~\ref{thm:deterministico}

333: into Stage~1 of the deterministic routing algorithm proposed in~\cite{ds-IEEETPDS03}.

334: \end{proof}

335:

336: This algorithm is not very practical. Indeed, it is based on the AKS network

337: that, in spite of being optimal, is not efficient when $n$ is small due to very large hidden

338: constants. However, the result is important from a theoretical point of view because of two facts:

339: it establishes that,

340: in principle, $O(\frac{d}{g}\log g)$ slots are enough to solve deterministically the online permutation

341: routing problem; and, when $d=O(g)$ and under proper hypothesis, it matches one of the lower

342: bounds for deterministic algorithms in the next section.

343:

344: \subsection{A Few Lower Bounds}

345:

346: Borodin et al.~\cite{brsu-JACM97} study the extent to which both complex hardware and

347: randomization can speed up routing in interconnection networks.

348: One of the questions they address is how \emph{oblivious routing algorithms}

349: (in which the possible paths followed by a packet depend only on its own source and destination)

350: compare with \emph{adaptive routing algorithms}. Since oblivious routing

351: can usually be implemented by using limited hardware resources on each node,

352: it is important to understand whether

353: it is worth using the more complex hardware required by adaptive routing. Here, we address similar

354: questions. In the following, our discussion will be limited to the case $d=\Theta(g)$.

355:

356: Unfortunately, the concept of oblivious routing does not seem

357: to be useful for POPS networks.

358: Indeed, by adapting the ideas first used in~\cite{bh-JCSS85},

359: we can prove that any oblivious deterministic routing algorithm

360: needs $\Omega(\sqrt{g})$ slots to deliver correctly every permutation.

361: Moreover, by customizing and slightly adapting the approach

362: developed in~\cite{brsu-JACM97} (that makes use of  Yao's minimax principle~\cite{y-FOCS77}),

363: it is also possible

364: to show that any oblivious randomized routing algorithm must use $\Omega(\log g/\log\log g)$ slots on

365: the average.

366: \begin{theorem}   \label{the:obliviousDet}

367: For any $\POPS$ network, $d=\Theta(g)$, and any oblivious deterministic routing algorithm,

368: there is a permutation for which the routing time is $\Omega(\sqrt{g})$ slots.

369: \end{theorem}

370: \begin{proof}

371:    We essentially customize the proof in~\cite{bh-JCSS85}

372:    to POPS networks,

373:    but also some minor modifications are in order

374:    to allow for passive devices and a few different assumptions.

375:

376:    We assume $d=g$,

377:    the extension to $d=\Theta(g)$ or wider

378:    involving no further ideas, only more technical fuss.

379:    Consider the bipartite digraph $D=(V,A)$

380:    having the set $P$ of processors

381:    and the set $C$ of couplers as color classes

382:    and having as arcs in $A$ those pairs $(p,c)$

383:    such that processor $p$ can send to coupler $c$

384:    plus those pairs $(c,p)$

385:    such that processor $p$ can listen from coupler $c$.

386:    We have $|P|=n=dg=g^2$ processors and $|C|=g^2=n$ couplers,

387:    $|V|=|P|+|C| = 2n$;

388:    all nodes have in-degree and out-degree both equal to $g$.

389:

390:    Every oblivious algorithm defines a directed $a,b$-path,

391:    denoted with $(a,b]$, for every pair $(a,b)\in P^2$,

392:    namely, the directed path of $D$

393:    followed by a packet with destination in $b$

394:    and origin in $a$.

395:    The characteristic vector $\chi_{(a,b]}$

396:    of a path $(a,b]$ is defined

397:    by regarding the path has the set of its nodes

398:    including $b$ but not $a$.

399:    The {\em congestion} of a family $\Pi$ of directed paths

400:    is defined as $c(\Pi):=\max_{v\in V} \sum_{(a,b] \in \Pi} \chi_{(a,b]}(v)$.

401:    It is clear that the congestion of $\Pi$ gives a lower bound

402:    on the number of steps required to move a packet

403:    along each path in $\Pi$ since no processor in $P$ and no coupler in $C$

404:    can receive more than one different packet within a single slot.

405:    To prove the theorem we do the following:

406:    with reference to the path family $\{(a,b] \, | (a,b)\in P^2\}$

407:    determined by the oblivious algorithm under consideration,

408:    we show how to construct a permutation

409:    $\pi:P\mapsto P$ such that

410:    $c(\{(a,\pi(a)] \; | a\in P\}) \geq \sqrt{g}/2$.

411:    This will imply the stated lower bound regardless

412:    of the queueing discipline,

413:    however omniscent, employed by the algorithm.

414:    For every $b\in P$,

415:    let

416:    $S_b := \{v\in V\; |

417:            \sum_{a\in P\setminus \{b\}} \chi_{(a,b]}(v) \geq \sqrt{g}/2 \}$.

418:    Clearly, every path $(a,b]$, $a\notin S_b$,

419:    must have a last node not in $S_b$.

420:    Moreover, since $b\in S_b$,

421:    the next node on the path $(a,b]$ must be in $S_b$.

422:    Let $X_b$ be the set of these last nodes

423:    when $a$ ranges in $P\setminus S_b$.

424:    By definition of $S_b$,

425:    no node in $X_b$ can be the last node outside $S_b$

426:    for more than $\sqrt{g}/2$ such paths,

427:    hence $|P\setminus S_b| \leq |X_b|(\sqrt{g}/2)$,

428:    which implies $|S_b|\geq \sqrt{g}$

429:    in case $|X_b| < g\sqrt{g}$.

430:    Moreover, $|X_b| \leq g|S_b|$ since the in-degree of the network

431:    is bounded by $g$.

432:    This implies $|S_b|\geq \sqrt{g}$

433:    in the complementary case that $|X_b| \geq g\sqrt{g}$.

434:    In conclusion, $|S_b|\geq \sqrt{g}$ holds for every $b\in P$.

435:    Therefore, by an averaging argument,

436:    there must exist a $v\in V$

437:    which belongs to at least $\frac{|P| \sqrt{g}}{|V|}=\frac{\sqrt{g}}{2}$

438:    of these sets $S_b$, $b\in P$.

439:    Let $B=\{b\in P \,| v\in S_b\}$.

440:    Let $b_1, b_2, \ldots, b_{\sqrt{g}/2}$ be distinct processors in $B$

441:    and run the following greedy algorithm where for all processors

442:    $p$ in $P$ the value $\pi(p)$ is initially undefined.

443:

444: \begin{quote}

445:    For $i:=1$ to ${\sqrt{g}/2}$,

446:    let $a$ be any processor in $S_{b_i}$

447:    such that $\pi(a)$ is undefined and define $\pi(a):=b_i$.

448: \end{quote}

449:

450:    Notice that such an $a$ can be found at each step $i\leq {\sqrt{g}/2}$

451:    since at step $i$ at most $i$ values of $\pi$ have been defined,

452:    while $S_{b_i} \geq \sqrt{g}$.

453:    Moreover, $\pi$ can be clearly extended to a full permutation,

454:    while already

455:    $c(\{(a,\pi(a)] \; | \mbox{$\pi(a)$ is defined}\}) \geq

456:    |\{a\, | \mbox{$\pi(a)$ is defined}\}| = \sqrt{g}/2$

457:    since node $v$ belongs to each path $(a,\pi(a)]$ by construction.

458: \end{proof}

459:

460: \begin{theorem}   \label{the:averageInput}

461: For any $\POPS$ network, $d=\Theta(g)$, and any oblivious deterministic routing algorithm,

462: the expected routing time for a random permutation (with each permutation chosen with uniform probability) is $\Omega(\log g/\log\log g)$.

463: \end{theorem}

464: \begin{proof}

465:    The proofs to be customized and adapted here come

466:    from~\cite{brsu-JACM97}.

467:    The customization starts again by considering

468:    the bipartite digraph $D=(V,A)$

469:    on color classes $P$ and $C$

470:    introduced in the proof of Theorem~\ref{the:obliviousDet}.

471:    Also the various small adjustment

472:    are in analogy with those detailed

473:    in the proof of Theorem~\ref{the:obliviousDet}.

474: \end{proof}

475:

476: \begin{corollary}

477: For any $\POPS$ network, $d=\Theta(g)$ and any oblivious deterministic routing algorithm,

478: there is a permutation for which the expected routing time is $\Omega(\log g/\log\log g)$.

479: \end{corollary}

480: \begin{proof}

481:    To get this corollary of Theorem~\ref{the:averageInput},

482:    use the Yao's minimax principle~\cite{y-FOCS77}

483:    in perfect analogy to what is done in~\cite{brsu-JACM97}.

484: \end{proof}

485:

486: These complexities are not satisfactory. Indeed, here in this paper we show a non-oblivious

487: deterministic algorithm

488: that runs in $O(\log g)$ slots and a non-oblivious randomized one that runs in $O(\log\log g)$ slots

489: with high probability.

490: So, by restricting to oblivious algorithms, it may be true that we get a (somewhat) simpler processor,

491: but we also lose

492: an exponential factor in running time, both with and without randomization. This is not a

493: good deal.

494: Therefore, we will not discuss oblivious routing any more, and will focus only on

495: adaptive routing.

496:

497: Finding good lower bounds for adaptive deterministic routing is not trivial.

498: In~\cite{brsu-JACM97}, the authors explicitly say that they were not able to

499: provide any result for this case in their context. Here, we give partial answers.

500: First, we prove a $\Omega(\log g)$ tight lower bound for a special case of adaptive deterministic routing

501: that applies both to the hypercube simulation routing algorithm in~\cite{ds-IEEETPDS03}

502: and to our deterministic algorithm (that is, in this context, optimal). Second, we prove

503: a strong separation theorem between determinism and randomization. Indeed, we can show

504: both a $\Omega(\log g)$ lower bound for a class of adaptive deterministic routing algorithms,

505: and a $O(\log\log g)$ upper bound for the same class where processors are allowed to

506: generate and use randomization.

507: To the best of our knowledge, this is the first separation theorem showing a gap between

508: $\log n$ and $\log\log n$.

509:

510: Consider our deterministic routing algorithm, proposed in the previous section. It is based

511: on a simulation of the AKS sorting network. At every slot, each processor sends its packet

512: to a pre-determined other processor, according to

513: the comparator it is going to simulate in the slot. So, the communication patterns are fixed for

514: the whole computation, and do not depend on the input permutation. We can prove

515: a lower bound for all algorithms that have the same property. More formally,

516: a routing algorithm is called \emph{rigid} if, at every slot~$t$, each processor~$i$ sends one of the

517: packets it currently stores to the set of groups~$C_{\mathrm{out}}(i,t)$, and listens to

518: group~$c_{\mathrm{in}}(i,t)$, where functions $C_{\mathrm{out}}$ and $c_{\mathrm{in}}$ depend

519: solely on $t$ and on the processor index.

520: Here, we can assume that the processors have enough local memory to store a copy of all the

521: packets they have seen so far and that

522: they choose the packet to send according to any strategy or algorithm.

523: This is enough to get the following lower bound.

524: \begin{theorem}

525: Any deterministic and rigid algorithm for online permutation routing

526: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, must use $\Omega(\log n)$ slots.

527: \end{theorem}

528: \begin{proof}

529: Consider a processor~$i$. Let $P(i,t)$ be the set of all packets that are potentially stored

530: by processor~$i$ at slot~$t$, according to the routing algorithm.

531: At the beginning, $P(i,0)=\{p_i\}$. During slot~$t$, processor~$i$

532: can receive at most one packet from group~$c_{\mathrm{in}}(i,t)$. Assume this packet

533: comes from processor~$j$. Index~$j$ is statically determined and is independent

534: of the initial permutation, since the algorithm is rigid.

535: So, either $P(i,t)=P(i,t-1)\cup P(j,t-1)$ or $P(i,t)=P(i,t-1)$, if no packet is sent to

536: group~$c_{\mathrm{in}}(i,t)$ (because there is no such processor~$j$, or a conflict

537: occurred). Therefore, $|P(i,t)|\le 2^t$ for all $t\ge 0$.

538:

539: Now, assume that the algorithm stops after $t<\log n$

540: slots. Then, $|P(i,t)|<n$, and there exists $h$ such that $p_h\notin  P(i,t)$. As a consequence,

541: the routing algorithm must fail for all input permutations such that the destination of

542: $p_h$ is processor~$i$. We conclude that $t=\Omega(\log n)$.

543: \end{proof}

544: This bound applies to both the $O(\log^2 g)$ algorithm in~\cite{ds-IEEETPDS03}

545: and to our deterministic algorithm in the previous section.

546: Therefore, within the class of rigid algorithms, our proposed routing scheme is optimal.

547:

548: Now, we prove a strong separation theorem. Under restricted hypotheses, we can show that

549: randomization can give an exponential speed-up over determinism. Here, we address a class

550: of routing algorithms we call \emph{two-hops algorithms}. A two-hops algorithm has the

551: following properties:

552: \begin{enumerate}

553: \item

554: Every processor has two buffers, an $A$-buffer and a $B$-buffer;

555: \item

556: at the beginning, the packets are stored in the $A$-buffer of each processor;

557: \item

558: at every odd slot~$2t+1$, $t=0,1,\dotsc$, every processor~$i$ with a packet in the $A$-buffer

559: sends the packet to group~$c_{\textrm{out}}(i,2t+1)$ (two-hops algorithms can only use unicast),

560: listens to incoming packets from

561: group~$c_{\textrm{in}}(i,2t+1)$, and store the incoming packet (if any) into the $B$-buffer;

562: \item

563: at every even slot~$2t$, $t=1,\dotsc$, every processor~$i$ sends the packet in the $B$-buffer to

564: destination, reset the $B$-buffer, and listens to incoming packets from coupler~$c_{\textrm{in}}(i,2t)$.

565: \end{enumerate}

566: Also, we will make the following assumptions:

567: \begin{enumerate}

568: \addtocounter{enumi}{4}

569: \item

570: when multiple packets use the same coupler (multiple packets from a group sent to the

571: same group), no packet is delivered.

572: \item

573: When a packet arrives to any processor in the destination group, it is considered to be

574: successfully routed, and disappears from the network (from the original $A$-buffer as well);

575: \end{enumerate}

576: The last hypothesis simplifies the job of routing all the packets to destination---we don't

577: have to take care of acks when packets reach their destination. However,

578: since we are proving a lower bound, we don't lose generality. Now, our goal is to show that

579: for every deterministic choice of functions $c_{\textrm{in}}$ and $c_{\textrm{out}}$, there exists

580: an input permutation such that the routing is completed in $\Omega(\log g)$ slots. On the other

581: hand, our randomized algorithm shows that there exists a deterministic $c_{\textrm{in}}$ and

582: a randomized $c_{\textrm{out}}$ such that all the packets are routed to destination in

583: $O(\log\log g)$ slots with high probability.

584:

585: Consider a deterministic two-hops algorithm. Assume that the algorithm stops

586: after $T<\frac{1}{2}\min\{\log d, \log g\}$ slots, $T$ even. We will say that processor~$i$

587: \emph{shoots} on group~$a$ in the first $T$ slots if there exists an odd $t<T$

588: such that $c_{\textrm{out}}(i,t)=a$.

589: \begin{lemma}

590: There exists a group $a_0$ such that at most $dT$ processors shoot on $a_0$

591: in the first $T$ slots.

592: \end{lemma}

593: \begin{proof}

594: By counting.

595: \end{proof}

596: \begin{corollary}

597: \label{cor:separation}

598: There are at least $n-dT=dg-dT>dg/2$ processors~$i$ such that

599: processor~$i$ does not shoot on $a_0$ in the first $T$ slots.

600: \end{corollary}

601: Let $P(a_0)$ be the set of processors~$i$ such that processor~$i$ does not shoot

602: on $a_0$ in the first $T$ slots. By Corollary~\ref{cor:separation}, $|P(a_0)|>dg/2$.

603: A subset $A\subset P(a_0)$ is \emph{$\sqrt{g}$-robust} if for every $i\in A$ and

604: for every $t<T$ there are at least $\sqrt{g}$ processors~$j$ in $A$ such that

605: $c_{\textrm{out}}(i,t)= c_{\textrm{out}}(j,t)$.

606: \begin{lemma}

607: There exists a $\sqrt{g}$-robust subset $P'(a_0)\subset P(a_0)$ such that

608: $|P'(a_0)|\ge \frac{dg}{2}-Tg\sqrt{g}$.

609: \end{lemma}

610: \begin{proof}

611: If $P(a_0)$ is not $\sqrt{g}$-robust, then there must be a processor~$i\in P(a_0)$

612: and a $t<T$ such that $c(i,t)=c(j,t)$ for less than $\sqrt{g}$

613: processors~$j\in P(a_0)$. This means that all the processors~$j$ such that $c(i,t)=c(j,t)$

614: (including $i$) must be removed from $P(a_0)$ to get a $\sqrt{g}$-robust subset. So,

615: let $P_1(a_0)$ be obtained from $P(a_0)$ by removing all these processors and mark

616: the pair $(t,c(i,t))$. Start now from $P_1(a_0)$ in place of $P(a_0)$ and keep iterating.

617: Notice that no pair can be marked twice in the process. The number of pairs is at most

618: $Tg$, and each time we mark a pair we drop at most $\sqrt{g}$ processors.

619: \end{proof}

620: \begin{theorem}

621: Any deterministic and two-hops algorithm for online permutation routing

622: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, must use $\Omega(\log n)$ slots.

623: \end{theorem}

624: \begin{proof}

625: We will show that for every processor~$i$ in $P'(a_0)$ there exists an input permutation

626: such that $p_i$ will not reach destination. The idea of the proof is as follows: we can build

627: an input permutation such that $p_i$ has to perform two hops to get to destination,

628: and that has a conflict at every even slot. Take a packet~$p_i$ such that $i\in P'(a_0)$

629: and mark the packet. Now, for $t:=T-1$ downto $1$, $t$ odd, do the following:

630: \begin{quote}

631: for every marked packet~$p_j$,

632: \begin{enumerate}

633: \item

634: take an unmarked packet~$p_h$ such that $c(h,t)=c(j,t)$;

635: \item

636: mark packet~$p_h$.

637: \end{enumerate}

638: \end{quote}

639: Then, set the destination of all marked packets to processors in group~$a_0$, so that

640: no marked packet can get to destination in one hop (they are chosen from $P'(a_0) \subseteq P(a_0)$).

641: The number

642: of packets that are marked in the above process does not exceed $d$ nor $\sqrt{g}$,

643: since $T<\frac{1}{2}\min\{\log d, \log g\}$. The important property guaranteed by the above

644: process is that any packet~$p_j$ marked at time~$t$ will experience a conflict during all

645: even slots from the beginning of the routing to time~$t$. In particular, packet~$p_i$ does

646: not reach destination within $T=\Omega(\log n)$ slots.

647: \end{proof}

648:

649: We believe that the $\Omega(\log g)$ lower bound for deterministic routing holds in

650: a much wider setting. This is described in the following two conjectures.

651: \begin{conjecture}

652: There exists a deterministic algorithm for online permutation routing

653: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, that is optimal and conflict-free.

654: \end{conjecture}

655: \begin{conjecture}

656: Any deterministic and conflict-free algorithm for online permutation routing

657: on the ${\mathrm{POPS}}(d,g)$ network, $d=\Theta(g)$, must use $\Omega(\log n)$ slots.

658: \end{conjecture}

659:

660: \section{A Randomized Algorithm}

661:

662: Here we present our randomized algorithm. In the following, we will make use

663: of the so called \emph{union bound}, a simple bound on the union of events.

664: \begin{fact}[Union Bound]

665: Let $E_1,\dotsc,E_m$ be $m$ events. Then,

666: \begin{equation*}

667: \Pr\left[\bigcup_{i=1}^m E_i\right]\le \sum_{i=1}^m \Pr\left[E_i\right].

668: \end{equation*}

669: \end{fact}

670: We will use a function $\Delta(x):=x \mod g$.

671: Moreover, we will say that some event happens \emph{with high probability}

672: meaning that the probability of the event is $1-1/g^k$ for some positive $k$.

673:

674: \subsection{The Case $d=g$}

675:

676: Given a packet $p_i$, $i\in{\mathbb{N}}_{n}$, its \emph{temporary destination group} is

677: group~$\Delta(\pi(i))=\pi(i)\mod g$.

678: Note that there are exactly $d$ packets with temporary destination group~$a$,

679: for all $a\in{\mathbb{N}}_{g}$.

680: The idea of the routing algorithm is as follows: Each packet is first routed to

681: a randomly and independently chosen \emph{random intermediate group}, then to its

682: temporary destination group, and lastly to its final destination.

683: So, we iterate the following \emph{step}, composed of five slots:

684: \begin{figure*}

685: \centering\includegraphics[scale=.7]{algorithm-1}

686: \caption{Example of randomized routing in a ${\mathrm{POPS}}(3,3)$ network. Packet~$p_5$ has destination $\pi(5)=1$ in group~$0$.

687: Its temporary destination group is group $\pi(5)\mod g=1$. In this step, the

688: random intermediate group chosen by packet~$p_5$ is group~$2$.}

689: \end{figure*}

690: \begin{enumerate}

691: \item

692: each processor containing a packet~$p$ to be routed chooses a random intermediate group~$r$

693: (uniformly and independently at random over ${\mathbb{N}}_{g}$)

694: and sends a copy of packet~$p$ to group~$r$;

695: \item

696: every copy that arrived to the random intermediate group is sent to its temporary destination group;

697: \item

698: for each copy that arrived to the temporary destination group an ack is sent back

699: to the random intermediate group;

700: \item

701: for each ack arrived to the random intermediate group, an ack is sent back to the source processor which, in turn, deletes the original packet;

702: \item

703: every copy that arrived to its temporary destination group is sent to its destination.

704: \end{enumerate}

705: During the step, there are at most two replicas of the same packet. One is the \emph{original

706: packet}, stored in the source processor; the other is the \emph{copy}, that tries to go from

707: the source processor to a random intermediate group, then to its temporary destination

708: group, and finally to its destination. In slot~4, if the source processor receives an ack, it

709: can be sure that the copy has been successfully delivered, as proved in

710: Proposition~\ref{pro:conflictless}, and can safely delete the original packet.

711: In fact, the original packet gets deleted in slot~4 if and only if, within the step, the copy

712: gets to destination in slot~5.

713:

714: In slots~1, 2, and~5, for every group~$a$, every processor~$i$ in group~$a$ is responsible for

715: listening to coupler~$c(a, \Delta(i))$

716: for the message possibly coming from

717: group~$\Delta(i)$.

718: This way, every

719: conflict-less communication successfully completes and no packet is lost. Indeed,

720: during slots~1 and~2, in every group $a$, $a\in{\mathbb{N}}_{g}$, the processor with index~$b$

721: within the group, $b\in{\mathbb{N}}_{g}$, receives the packet that is possibly coming

722: from group~$b$. In slot~5, every processor~$\pi(i)$ that still has to receive packet~$p_{i}$

723: hopefully receives its packet from group~$\Delta(\pi(i))$, the temporary destination group of

724: packet~$p_i$.

725: Slots~3 and~4 behave differently. Indeed, each ack sent during slot~3 is received by the

726: same processor that sent the packet in slot~2. Similarly, each ack sent during slot~4 is received by the same processor that sent the packet in slot~1.

727:

728: Clearly, during slots~1 and~2, multiple conflicts on the couplers should be expected,

729: and many of the communications may not complete. For example, two packets in the same

730: group can choose the same random intermediate group during slot~1, or two packets willing

731: to go to the same temporary destination group are currently in the same random intermediate

732: group during slot~2.

733: On the contrary, slots 3, 4, and 5 do not generate any conflict,

734: as shown in the following proposition.

735: \begin{proposition}

736: \label{pro:conflictless}

737: At all steps, slots 3, 4, and 5 of the routing algorithm do not generate any conflict.

738: \end{proposition}

739: \begin{proof}

740: Consider packet~$p_i$ stored at processor~$i$ in group~$a$. Assume that, during an arbitrary

741: step, its random intermediate group is $r(i)$, chosen

742: uniformly

743: at random.

744: In the case when packet~$p_i$ survives slot~1 and arrives to its random intermediate group

745: $r(i)$, we know that coupler $c(r(i),a)$ has been used to send

746: packet~$p_i$ only, otherwise a conflict would have stopped the packet.

747: Moreover, since there is only one processor in group~$r(i)$ that is responsible

748: for receiving packet~$p_i$, namely processor~$r(i)d+a$,

749: there will be only one ack message corresponding to packet~$p_i$ to be sent in slot~4,

750: and this ack message is the only one that uses the symmetric coupler

751: $c(a,r(i))$ during slot~4.

752: In conclusion, slot~4 is conflict-free.

753: A similar argument shows that slot~3 is conflict-free as well.

754:

755: Consider now slot~5. Assume that, after step~4, packet $p_j$ has arrived at the

756: same temporary destination group as packet~$p_i$.

757: This means that $\Delta(\pi(i))=\Delta(\pi(j))$. That is,

758: $\pi(i)\equiv \pi(j)\mod g$. In this case, it is not possible that $\pi(i)$ and $\pi(j)$ are

759: in the same group; otherwise we would have

760: $\pi(i)=\pi(j)$, in contrast with the fact that  $\pi$ is a permutation.

761: Therefore, packets~$p_i$ and $p_j$ go to different groups from their temporary

762: destination group. In other words, step~5 is conflict-free as well.

763: \end{proof}

764:

765: By Proposition~\ref{pro:conflictless}, if packet~$p_i$ survives the first two slots of a step,

766: then, in the very same step, it will be routed to its destination, and an ack will be successfully

767: returned to source processor~$i$. When the ack arrives, the source processor can delete

768: the packet, since it knows it will be safely stored by the destination processor.

769: Conversely, if no ack arrives, the packet is not deleted, and the processor

770: tries again to deliver it in the next step, choosing again a possibly different

771: random temporary group.

772:

773: By the above discussion, we can safely concentrate on slots~1 and~2.

774: A useful way to visualize the conflicts in slots~1 and~2

775: of an arbitrary step is

776: shown in Figure~\ref{fig:bipartite-1}.

777: At any given step of the routing algorithm, let $\pi$ be the

778: restriction of the input permutation to those packets that have not been successfully routed yet (during previous steps).

779: We build the \emph{graph of conflicts}, a bipartite multi-graph $G_{\pi}$ on node classes

780: $S:={\mathbb{N}}_g$ and $D:={\mathbb{N}}_g$. For every group~$a$ and for each

781: packet~$p_i$ in group~$a$ and yet to be

782: routed, we introduce an edge with one endpoint in $a\in S$ and the other endpoint in

783: the temporary destination group~$\Delta(\pi(i))\in D$.

784: During slot~1 of the step,

785: every edge (packet yet to be routed) randomly and uniformly chooses a

786: \emph{color} in ${\mathbb{N}}_g$ (the random intermediate

787: group).

788: Clearly, a same packet can choose different colors

789: in different steps of the routing algorithm.

790: Now we can exactly characterize the conflicts in the first two slots of the routing algorithm during

791: step~$s$.

792: Packet~$p_i$ in group~$a$ (represented by an edge from $a\in S$ to $\Delta(\pi(i))\in D$)

793: has a conflict during slot~1

794: if and only if there is another edge incident to $a\in S$ with

795: the same random color. Moreover, if we remove all edges relative to packets that have a conflict

796: in slot~1 (see Figure~\ref{fig:bipartite-2}), every remaining packet $p_i$ has a

797: conflict during slot~2

798: if and only if there is another remaining edge incident to $\Delta(\pi(i))\in D$ with the same random color.

799: Figure~\ref{fig:bipartite-3} shows which packets of Figure~\ref{fig:bipartite-1} survive both

800: slots and are hence delivered to destination by Proposition~\ref{pro:conflictless}.

801:

802: \begin{figure*}

803: \centering

804: \subfigure[Conflict graph~$G_{\pi}$;]{\label{fig:bipartite-1}

805: \includegraphics[scale=.78]{bipartite-1}}\goodgap

806: \subfigure[conflict graph~$G_{\pi}$, where only packets surviving slot~1 are shown;]{\label{fig:bipartite-2}

807: \includegraphics[scale=.78]{bipartite-2}}\goodgap

808: \subfigure[conflict graph~$G_{\pi}$, where only packets surviving both slot~1 and slot~2 are shown.]{\label{fig:bipartite-3}

809: \includegraphics[scale=.78]{bipartite-3}}

810: \caption{Conflict graph~$G_{\pi}$, where permutation~$\pi=[1,5,8,9,3,10,11,14,15,13,0,7,2,6,12,4]$

811: (consequently, $\Delta(\pi(\cdot))=[1,1,0,1,3,2,3,2,3,1,0,3,2,2,0,0]$), in a ${\mathrm{POPS}}(4,4)$

812: network.}

813: \label{fig:bipartite}

814: \end{figure*}

815:

816: Our first result shows that, in case the packets are ``sparse'' in the network,

817: then all the packets can be delivered in a constant number of slots with high

818: probability.

819: \begin{lemma}

820: \label{lem:phase3}

821: If the maximum degree of the conflict graph is $g^{\alpha}$

822: for some constant $\alpha<1$, then the routing algorithm delivers all the packets to destination in a

823: constant number of slots with high probability.

824: \end{lemma}

825: \begin{proof}

826: Since the maximum degree of the conflict graph is $g^{\alpha}$, in every group of the POPS network

827: there are at most $g^{\alpha}$ packets left to be routed, and every group of the POPS network is

828: the temporary destination group of at most $g^{\alpha}$ packets.

829: Let $\beta=1-\alpha$.

830: We show that

831: the probability

832: that all packets get routed to destination

833: within $3/\beta$ steps is at least $1-c_\beta/g$,

834: where $c_\beta := 2^{3/\beta}$ is a constant

835: depending only on (the constant) $\beta$.

836: Consider a generic packet $p_i$ in group~$a$.

837: The probability that packet~$p_i$ has a conflict in one step is at most equal to the

838: probability that either one of the packets in group~$a$ or one of the packets

839: with temporary destination group~$\Delta(\pi(i))$ chooses the same random intermediate group as

840: packet~$p_i$.

841: Since at most $g^\alpha-1$ other packets are in group~$a$, and similarly at

842: most $g^\alpha-1$ have temporary destination group~$\Delta(\pi(i))$, this probability cannot be larger

843: than $2g^\alpha/g=2g^{-\beta}$.

844: Therefore, the probability that the packet is not routed

845: in each of the $3/\beta$ steps is at most

846: \begin{equation*}

847: \left(\frac{2}{g^\beta}\right)^{\frac{3}{\beta}}=\frac{2^{3/\beta}}{g^3}=\frac{c_\beta}{g^3}.

848: \end{equation*}

849: By the union bound, the probability that any of the $g^{1+\alpha}<g^2$ packets in the network

850: has not been routed in $3/\beta$ steps is at most $c_\beta/g$.

851: \end{proof}

852: As a matter of fact, the hard part of the job is to reduce the initial number of $g$ packets in each

853: group in such a way to get a ``sparse'' set of remaining packets.

854: We can prove that this is done quickly

855: by our randomized algorithm by

856: providing sharp bounds on the number $X$ of packets that are

857: successfully delivered in a step.

858: We define $X$ as a sum of indicator random variables $Z_i$, where

859: $Z_i$ is equal to $1$ if the $i$-th packet is delivered in this step, and $0$ otherwise.

860: It is important

861: to realize that these random variables are not independent: the event that one packet has

862: a conflict influences the probability that another packet has a conflict as well.

863: As a consequence, we cannot use the well-known Chernoff bound

864: to get sharp estimates of the value of $X$ since there does not seem to be any

865: way to describe the process as a sum of independent random variable.

866: So, we need a more sophisticated

867: mathematical tool.

868: Specifically, we will see that slots~1 and~2 of one step

869: of the routing algorithm can be modeled by a set of martingales. Martingale theory is

870: useful to get sharp bounds when the process is described in terms of not necessarily

871: independent random variables.

872:

873: For an introduction to martingales, the reader is

874: referred to~\cite{mr95}.

875: Also~\cite{ds65}, \cite{gs88}, \cite{as00}, and~\cite{dp04} give a description of martingale

876: theory. Here, we give a brief review of the main definitions and theorems

877: we will be using in the following.

878: \begin{definition}[\cite{mr95}]

879: Given the $\sigma$-field $(\Omega, {\mathbb{F}})$ with ${\mathbb{F}}=2^{\Omega}$, a

880: \emph{filter} is a nested sequence

881: ${\mathbb{F}}_0\subseteq {\mathbb{F}}_1\subseteq \dotsb \subseteq {\mathbb{F}}_m$ of

882: subsets of $2^{\Omega}$ such that

883: \begin{enumerate}

884: \item

885: ${\mathbb{F}}_0=\{\emptyset, \Omega\}$;

886: \item

887: ${\mathbb{F}}_m=2^{\Omega}$;

888: \item

889: for $0\le h\le m$, $(\Omega, {\mathbb{F}}_h)$ is a $\sigma$-field.

890: \end{enumerate}

891: \end{definition}

892: \begin{definition}[\cite{mr95}]

893: Let $(\Omega, {\mathbb{F}}, \PR)$ be a probability space with a filter

894: ${\mathbb{F}}_0,\dotsc, {\mathbb{F}}_m$. Suppose that $Y_0, \dotsc, Y_m$ are random variables

895: such that for all $h\ge 0$, $Z_h$ is ${\mathbb{F}}_i$-measurable. The sequence $Z_0,\dots,Z_m$

896: is a \emph{martingale} provided that, for all $h\ge 0$,

897: \begin{equation*}

898: \E[Z_{h+1}|{\mathbb{F}}_h]=Z_h.

899: \end{equation*}

900: \end{definition}

901: The next tail bound for martingales is similar to the Chernoff bound for the sum of Poisson

902: trials.

903: \begin{theorem}[Azuma's Inequality~\cite{mr95}]

904: Let $Z_0,\dotsc,Z_m$ be a martingale such that for each $h$,

905: \begin{equation*}

906: |Z_h-Z_{h-1}|\le c_h,

907: \end{equation*}

908: where $c_h$ may depend on $h$. Then, for all $t\ge 0$ and any $\lambda>0$,

909: \begin{equation*}

910: \PR\left[ |Z_t-Z_0|\ge\lambda\right]\le 2e^{-\frac{\lambda^2}{2\sum_{k=1}^t c_k^2}}.

911: \end{equation*}

912: \end{theorem}

913: \begin{theorem}

914: \label{thm:fondamentale}

915: A $\POPSg$ network can route any permutation in $O(\log\log g)$ slots

916: with high probability.

917: \end{theorem}

918: \begin{proof}

919: Let $G_{\pi}=(S,D; E)$ be the conflict graph at step~$s$ of the routing

920: algorithm, where $\pi$ is the input permutation restricted to those packets

921: that still have to be routed at the beginning of step~$s$.

922: Let $d_s$ be the maximum degree of $G_{\pi}$.

923: So, at step~$s$ there are at most $d_s$ packets left to

924: be routed in every group, and at most $d_s$ packets are willing to go to

925: the same temporary destination group.

926: Clearly, $d_1\leq d$. We will show that after

927: $O(\log\log g)$ steps the conflict graph has maximum degree at most $g^{5/6}$.

928: This is enough to prove this theorem by Lemma~\ref{lem:phase3}.

929:

930: Assume to be at step~$s$. If $d_s\le g^{5/6}$, then we are done.

931: So, we can assume that $d_s> g^{5/6}$.

932: Let $S_a$, $a\in S$, be the set of indices

933: of the packets of group~$a$ that still have to be delivered at the beginning of

934: step~$s$. Similarly, let $D_b$, $b\in D$,

935: be the set of indices

936: of the packets in the whole network that still have to be delivered and that have group~$b$ as

937: temporary destination group.

938: Clearly, $|S_a|$ and $|D_b|$ are the degrees of nodes $a\in S$ and $b\in D$ in the conflict graph of step~$s$.

939: Therefore, $|S_a|\le d_s$ and $|D_b|\le d_s$ for every $a\in S$ and $b\in D$.

940: For every packet~$p_i$ still to be routed, we define the following indicator random variable,

941: \begin{equation*}

942: Z_i^1=\begin{cases} 1 & \textrm{if packet~$p_i$ survives slot~1 in step~$s$,}\\

943: 0 & \textrm{otherwise.} \end{cases}

944: \end{equation*}

945: Random

946: variable~$X_a^1=\sum_{i\in S_a} Z_i^1$ tells the number of packets from group~$a$ that

947: survive slot~1; random variable~$Y_b^1=\sum_{j\in D_b} Z_j^1$ tells the number of packets with temporary destination

948: group~$b$ that survive slot~1.

949: Moreover, let random

950: variable~$C_i$ be equal to the color chosen by packet~$p_i$ in step~$s$.

951:

952: Clearly, we have nothing to show about the nodes in $G_{\pi}$ that have degree smaller

953: than or equal to $g^{5/6}$. So, we define sets $S^+\subseteq S$ and

954: $D^+\subseteq D$, which collect the nodes with degree

955: larger that $g^{5/6}$, and focus on the nodes in these sets.

956: Consider an arbitrary node $a\in S^+$.

957: The expectation of $Z_i^1$, $i\in S_a$, can be bounded as follows:

958: \begin{equation}

959: \begin{split}

960: \E[Z_i^1] = \PR[\forall \; h\in S_a\setminus \{i\}, \; C_h\neq C_i]=

961: \prod_{h\in S_a\setminus \{i\}} \PR[C_h\neq C_i]\\

962: = \left( 1-\frac{1}{g}\right)^{|S_a|-1}

963: \ge e^{-|S_a|/g}.

964: \end{split}

965: \label{eqn:lowerEXi}

966: \end{equation}

967: So, the expected number of packets in group~$a$ that survive slot~1 can be bounded accordingly,

968: \begin{equation}

969: \label{eqn:lowerX}

970: \E[X_a^1]=\E\left[\sum_{i\in S_a} Z_i^1\right]=\sum_{i\in S_a} \E[Z_i^1]\ge |S_a| e^{-|S_a|/g}.

971: \end{equation}

972:

973: In order to show that random variable $X_a^1$ is not far from its expectation with high probability,

974: we now define random variables $W_h=\E[X_a^1|{\mathbb{F}}_h]$, $h=0,\dotsc,|S_a|$,

975: where ${\mathbb{F}}_h$ is the $\sigma$-field generated by the random color chosen by

976: the first $h$ packets in $S_a$.

977: Filter ${\mathbb{F}}_h$, $h=0,\dotsc,|S_a|$, is such that $W_0, \dotsc, W_{|S_a|}$ is a martingale and that

978: $|W_{h}-W_{h-1}|\le 2$,

979: since fixing the random color chosen by the $h$-th packet in $S_a$

980: can only affect the expected value of the sum $X_a^1$ at most by two.

981: By the Azuma's inequality, for every $\delta>0$

982: \begin{equation}

983: \begin{split}

984: \PR\left[\left|X_a^1-\E[X_a^1]\right|\ge \delta\E[X_a^1]\right]=\PR\left[\left|W_{|S_a|}-W_0\right|

985: \ge \delta\E[X_a^1]\right]\\

986: \le 2e^{-\frac{\delta^2 \E[X_a^1]^2}{2\sum(2)^2}}

987: \le 2e^{-\frac{\delta^2 |S_a|^2e^{-2d_s/g}}{8|S_a|}}\le

988: 2e^{-\frac{\delta^2 g^{5/6}}{8e^2}}.

989: \end{split}

990: \label{eqn:azumaX}

991: \end{equation}

992:

993: To prove a similar result for $Y_b^1$, $b\in D^+$,

994: we must recast the above general martingale arguments

995: into a more structured approach.

996: This is because $Y_b^1$ may depend on the random colors chosen by all the packets

997: in the network, and not only on those chosen by the packets in $D_b$.

998:

999: Consider an arbitrary node $b\in D^+$.

1000: In the following analysis of the expectation and concentration

1001: of $Y_b^1$ we can clearly pretend that the random colors

1002: are first choosen  for the packets outside $D_b$ and later for the packets in $D_b$.

1003: This will not invalidate our conclusions about the whole of the $Y_b^1$'s, $b\in D^+$,

1004: since these will be derived from the solid claims about any single $Y_b^1$ by the union bound.

1005: For every $a\in S_a$, we define set~$\overline{C}_{a,\overline{b}}$

1006: as ${\mathbb{N}}_g\setminus C_{a,\overline{b}}$, where $C_{a,\overline{b}}$ is the set of colors that are chosen in

1007: step~$s$ by a packet in group~$a$ that has temporary destination group different from $b$,

1008: \begin{equation*}

1009: \overline{C}_{a,\overline{b}}={\mathbb{N}}_g\setminus\left(\bigcup_{i\in S_a\setminus D_b} \left\{ C_i \right\}\right).

1010: \end{equation*}

1011: The average size of $\overline{C}_{a,\overline{b}}$ is

1012: \begin{equation*}

1013: \E\left[\left| P_{b.a}\right|\right]=

1014: g\left(1-\frac{1}{g}\right)^{|S_a\setminus D_b|}.

1015: \end{equation*}

1016: Being just a classical ball and bins problem~\cite{mr95}, we know that random variable~$|\overline{C}_{a,\overline{b}}|$

1017: is not far from its expectation with probability

1018: \begin{equation*}

1019: \PR[|\overline{C}_{a,\overline{b}}|<(1-\delta)\E[|\overline{C}_{a,\overline{b}}|]\le e^{-\frac{\delta^2\E[|\overline{C}_{a,\overline{b}}|]^2}{2g}}

1020: \le e^{-\frac{\delta^2g}{2e^2}},

1021: \end{equation*}

1022: for every $\delta>0$. By the union bound over the $g$ nodes in $S$, for every $\delta>0$,

1023: we know that for every node $a\in S$

1024: \begin{equation}

1025: \label{eqn:lowerP}

1026: |\overline{C}_{a,\overline{b}}| \ge (1-\delta)g\left(1-\frac{1}{g}\right)^{|S_a\setminus D_b|}

1027: \end{equation}

1028: with probability

1029: \begin{equation}

1030: \label{eqn:azumaP}

1031: 1-ge^{-\frac{\delta^2g}{2e^2}}.

1032: \end{equation}

1033:

1034: Under the hypothesis that Equation~\ref{eqn:lowerP} holds for every $a\in S$, we can bound the expectation of $Z_j^1$, $j\in D_b$, as follows:

1035: \begin{equation*}

1036: \E[Z_j^1]  = \PR\left[\left(\forall \; h\in D_b\cap S_{a_j}^{1}\setminus \{j\}, \; C_h\neq C_j\right) \wedge

1037: (C_j\in P_{b,a_j})\right],

1038: \end{equation*}

1039: where $a_j$ is the group of packet~$p_j$. So,

1040: \begin{align*}

1041: \E[Z_j^1]  & \ge \left(1-\frac{1}{g}\right)^{|D_b\cap S_{a_j}^{1}\setminus \{j\}|}

1042: (1-\delta) \left(1-\frac{1}{g}\right)^{|S_{a_j}^{1}\setminus D_b|}\\

1043: & = (1-\delta)\left(1-\frac{1}{g}\right)^{|S_{a_j}^{1}\setminus \{j\}|}

1044: \ge (1-\delta) e^{-|S_{a_j}^{1}|/g}.

1045: \end{align*}

1046: The expectation of $Y_b^1$ can be bounded accordingly,

1047: \begin{equation}

1048: \label{eqn:lowerY}

1049: \E[Y_b^1]=\E\left[\sum_{j\in D_b} Z_j^1\right]=\sum_{j\in D_b} \E[Z_j^1]

1050: \ge (1-\delta)|D_b| e^{-|D_b|/g}.

1051: \end{equation}

1052:

1053: In order to show that random variable $Y_b^1$ is not far from its expectation with high probability,

1054: we now define random variables $W_k=\E[Y_b^1|{\mathbb{F}}_k]$, $k=0,\dotsc,|D_b|$,

1055: where ${\mathbb{F}}_k$ is the $\sigma$-field generated by the random color

1056: chosen by the first $k$ packets in $D_b$.

1057: Filter ${\mathbb{F}}_k$, $k=0,\dotsc,|D_b|$, is such that $W_0, \dotsc, W_{|D_b|}$

1058: is a martingale and that $|W_{k}-W_{k-1}|\le 2$,

1059: since fixing the random color chosen by the $k$-th packet in $D_b$

1060: can only affect the expected value of the sum $Y_b^1$ at most by two.

1061: By the Azuma's inequality, for every $\delta>0$

1062: \begin{equation}

1063: \begin{split}

1064: \PR\left[\left|Y_b^1-\E[Y_b^1]\right|\ge \delta\E[Y_b^1]\right]=\PR\left[\left|W_{|D_b|}-W_0\right|

1065: \ge \delta\E[Y_b^1]\right]\le\\

1066: \le 2e^{-\frac{\delta^2 \E[Y_b^1]^2}{2\sum(2)^2}}

1067: \le 2e^{-\frac{\delta^2 (1-\delta)^4|D_b|^2e^{-2d_s/g}}{8|D_b|}}\le

1068: 2e^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.

1069: \end{split}

1070: \label{eqn:azumaY}

1071: \end{equation}

1072:

1073: Let $G_{\pi'}=(S,D; E')$ be the conflict graph at step~$s$, where

1074: $\pi'$ is the input permutation restricted to those packets that survive slot~1 in step~$s$.

1075: Hence, $E'\subseteq E$.

1076: Our goal is to bound the number of packets that survive slot~2

1077: as well, and are thus delivered to destination during this step.

1078: Let $Z_j^2$

1079: be equal to one if packet~$p_j$ survives both slots~1 and~2,

1080: and zero otherwise.

1081: Also, let $S_a^1$, $a\in S$, be the set of indices

1082: of the packets of group~$a$ that have survived slot~1. Similarly, let $D_b^1$, $b\in D$,

1083: be the set of indices

1084: of the packets in the whole network that have survived slot~1 and have group~$b$ as

1085: temporary destination group. Clearly, for every $a\in S$,

1086: $|S_a^1|$ is equal to $X_a^1$ and is the degree of node~$a$ in $G_{\pi'}$;

1087: while for every $b\in D$, $|D_b^1|$ is equal to $Y_b^1$ and is the degree of node~$b$

1088: in $G_{\pi'}$.

1089: Random variables

1090: \begin{equation*}

1091: X_a^2=\sum_{i\in S_a^1} Z_j^2,

1092: \end{equation*}

1093: $a\in S$, tell the number of packets in group~$a$ that

1094: are delivered during step~$s$; similarly, random variables

1095: \begin{equation*}

1096: Y_b^2=\sum_{j\in D_b^1} Z_j^2

1097: \end{equation*}

1098: $b\in D$, tell the number of packets willing to go to temporary destination group~$b$ that

1099: are delivered during step~$s$.

1100:

1101: Consider an arbitrary node~$b\in D^+$.

1102: The expected value of $Y_b^2$ depends on permutation

1103: $\pi'$. Since we are computing a lower bound to $Y_b^2$, the worst case is

1104: when all packets in $D_b^1$ originate at different groups. Indeed, if two packets

1105: in $D_b^1$ belong to the same $S_a^1$, we already know that they have

1106: chosen two different colors during step~$s$, and the expectation of $Y_b^2$ is larger.

1107: A formal proof of this intuitive claim can be

1108: given, though it's omitted for the sake of brevity.

1109: Assuming that random variable $Y_b^1$ is

1110: not far from expectation as in Equation~\ref{eqn:azumaY},

1111: we can bound the expectation of $Y_b^2$,

1112: \begin{align}

1113: \nonumber

1114: \E[Y_b^2] & = |D_b^1|\left(1-\frac{1}{g}\right)^{|D_b^1|-1}\ge\\

1115: \nonumber & \ge (1-\delta)^2|D_b|e^{-|D_b|/g}\left(1-\frac{1}{g}\right)^{|D_b^1|-1}\ge\\

1116: & \ge (1-\delta)^2|D_b| e^{-|D_b|/g}e^{-|D_b^1|/g}\ge (1-\delta)^2|D_b| e^{-2d_s/g}.

1117: \label{eqn:lowerYtilde}

1118: \end{align}

1119: Just as before, also $Y_b^2$ is not far from its expectation.

1120: Martingale theory can be used again to show that

1121: \begin{equation}

1122: \PR\left[\left|Y_b^2-\E[Y_b^2]\right|\ge \delta\E[Y_b^2]\right]

1123: \le 2e^{-\frac{\delta^2 \E[Y_b^2]^2}{2\sum(2)^2}}\le

1124: 2e^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.

1125: \label{eqn:azumaYtilde}

1126: \end{equation}

1127: Similarly, by using the same technique that has been used to bound random variable $Y_b^1$,

1128: for every node $a\in S^+$ we can show that

1129: \begin{align}

1130: \nonumber

1131: \E[X_a^2] & \ge (1-\delta)|S_a^1|\left(1-\frac{1}{g}\right)^{|S_a^1|-1}\ge (1-\delta)|S_a^1| e^{-|S_a^1|/g} \ge \\

1132: \nonumber & \ge (1-\delta)^2|S_a|e^{-|S_a|/g} e^{-|S_a^1|/g} \ge\\

1133: & \ge (1-\delta)^2|S_a| e^{-2d_s/g},

1134: \label{eqn:lowerXtilde}

1135: \end{align}

1136: and that $X_a^2$ is not far from its expectation

1137: \begin{equation}

1138: \PR\left[\left|X_a^2-\E[X_a^2]\right|\ge \delta\E[X_a^2]\right]

1139: \le 2e^{-\frac{\delta^2 \E[X_a^2]^2}{2\sum(2)^2}}\le

1140: 2e^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.

1141: \label{eqn:azumaXtilde}

1142: \end{equation}

1143:

1144: By Equations~\ref{eqn:lowerX}, \ref{eqn:azumaX}, \ref{eqn:lowerP}, \ref{eqn:azumaP}, \ref{eqn:lowerY}, \ref{eqn:azumaY}, \ref{eqn:lowerYtilde}, \ref{eqn:azumaYtilde}, \ref{eqn:lowerXtilde}, \ref{eqn:azumaXtilde}, and

1145: by the union bound,

1146: the number of packets successfully delivered

1147: in step~$s$ can be bounded as follows: For every $\delta>0$,

1148: \begin{align}

1149: \label{eqn:lowerXfinale}

1150: X_a^2 & \ge (1-\delta)^3|S_a| e^{-2d_s/g}\\

1151: \label{eqn:lowerYfinale}

1152: Y_b^2 & \ge (1-\delta)^3|D_b| e^{-2d_s/g}

1153: \end{align}

1154: for every $a\in S^+$ and $b\in D^+$, with probability at least

1155: \begin{equation}

1156: 1-9ge^{-\frac{\delta^2 (1-\delta)^4 g^{5/6}}{8e^2}}.

1157: \label{eqn:azumafinale}

1158: \end{equation}

1159:

1160: Now, we divide our analysis into two phases. Phase~1 is composed of a constant

1161: number of steps and, with high probability, reduces the maximum degree of the conflict

1162: graph from $d_1$ to $gx$ or less,

1163: where $0\le x<1$ is any fixed constant.

1164: Phase~2 follows and reduces the maximum degree of the conflict graph to $g^{5/6}$ or less

1165: in $O(\log\log n)$ steps with high probability.

1166:

1167: Let us start from Phase~1. For every step~$s$ during Phase~1,

1168: $gx\le d_s\le g$. We show that a constant number of

1169: steps is enough to make $d_s$ fall below $gx$ with high probability.

1170: For all $a\in S^+$, let us refer to a step such that

1171: \begin{equation}

1172: X_a^2\ge\frac{|S_a| e^{-2}}{2}

1173: \end{equation}

1174: as a \emph{lucky} step for group~$a$.

1175: By Equation~\ref{eqn:lowerXfinale} and~\ref{eqn:azumafinale}, where we fix

1176: $\delta$ such that $(1-\delta)^3=1/2$, step~$s$ is lucky for every

1177: group~$a\in S^+$ with probability at least

1178: \begin{equation*}

1179: 1-9ge^{-\alpha |S_a|}\ge 1-9ge^{-\alpha g^{5/6}},

1180: \label{eqn:azuma}

1181: \end{equation*}

1182: where $\alpha$ is a positive constant.

1183: Therefore, the number of packets that remain after step~$s$ in group~$a\in S^+$ is

1184: \begin{equation}

1185: |S_a|-X_a^2\le |S_a|-\frac{|S_a| e^{-2}}{2}

1186: \le d_s\left(1-\frac{e^{-2}}{2}\right)

1187: \end{equation}

1188: with high probability.

1189: Note the same bound can be shown for sets~$|D_{b}^1|$, $b\in D^+$, with

1190: exactly the same analysis (where an analogous notion of lucky step refers to a step such that the degree of group~$b\in D$ reduces by $|D_{b}^{s,1}|e^{-2}/2$ at least). Therefore,

1191: after

1192: \begin{equation*}

1193: y:=\left\lceil\frac{\log x}{\log (1-e^{-2}/2)}\right\rceil

1194: \end{equation*}

1195: lucky steps for all the groups the maximum degree of the conflict graph reduces

1196: to $gx$ or less.

1197: By the union bound,

1198: this happens

1199: within the very first $y$ steps

1200: with probability at least

1201: \begin{equation*}

1202: 1-9yge^{-\alpha g^{5/6}},

1203: \end{equation*}

1204: That is, Phase~1 completes in a constant number of steps

1205: with high probability.

1206:

1207: We are now at a generic step~$s$ in Phase~2.

1208: Our goal is to reduce the degree of the graph of conflicts to $g^{5/6}$.

1209: Let $\lambda_s=d_s/g$. We can assume that $g^{-1/6}\le\lambda_s<x$,

1210: and when $\lambda_s$ falls below $g^{-1/6}$ we are done.

1211: This time, let's refer to a step during which at least

1212: $(1-\lambda_{s})|S_a| e^{-2\lambda_s}$ packets in group~$a\in S^+$ are delivered as a

1213: \emph{lucky} step for group~$a$.

1214: By Equation~\ref{eqn:lowerXfinale} and~\ref{eqn:azumafinale}, where we take

1215: $\delta_s=\lambda_s/3$ (in such a way that $(1-\delta_s)^3\ge (1-\lambda_s)$),

1216: step~$s$ is lucky for every group~$a\in S^+$ with probability at least

1217: \begin{equation*}

1218: 1-9yge^{-\beta g^{1/2}},

1219: \end{equation*}

1220: where $\beta$ is a positive constant,

1221: since $|S_a|\lambda_{s}^2\ge g^{5/6}(g^{-1/6})^2=g^{1/2}$.

1222: So, the number of packets that remain in group~$a\in S^+$ after step~$s$ is

1223: \begin{equation*}

1224: |S_a|-X_a^2\le |S_a|-(1-\lambda_{s})|S_a| e^{-2\lambda_s}\le

1225: d_s\left[1-(1-\lambda_{s}) e^{-2\lambda_s}\right]

1226: \end{equation*}

1227: with high probability. A similar result can be shown

1228: for any group~$b\in D$ such that $|D_b|>g^{5/6}$ with exactly the same analysis.

1229: By the union bound, at the end of step~$s$ the degree of the conflict graph is at most

1230: \begin{equation*}

1231: d_s\left[1-(1-\lambda_{s}) e^{-2\lambda_s}\right]

1232: \end{equation*}

1233: with high probability.

1234: Now, assuming a sequence of lucky steps, we can set up the following recurrence,

1235: \begin{align*}

1236: \lambda_{s+1} & \le \lambda_s \left[1-(1-\lambda_s) e^{-2\lambda_s}\right]\le

1237: \lambda_s\left[1-(1-\lambda_s)(1-2\lambda_s)\right]=\\

1238: & = \lambda_s\left[1-1+3\lambda_s-2\lambda_s^2\right]\le 3\lambda_s^2.

1239: \end{align*}

1240: Therefore,

1241: \begin{equation*}

1242: \lambda_{s}\le 3\lambda_{s-1}^2\le 3\left(3\lambda_{s-2}^2\right)^2\le \dotsb \le

1243: 3^{2^{s-y-1}}\lambda_{y+1}^{2^{s-y-1}}.

1244: \end{equation*}

1245: That is,

1246: \begin{equation*}

1247: \log_3\lambda_{s}\le \log_3\left(3^{2^{s-y-1}}\lambda_{y+1}^{2^{s-y-1}}\right)=

1248: 2^{s-y-1}\left( 1+\log_3 \lambda_{y+1} \right).

1249: \end{equation*}

1250: Since our first goal is to have $\lambda_s\le g^{-1/6}$, we should find $\bar{s}$ such that

1251: \begin{equation*}

1252: \log_3 \lambda_{\bar{s}}\le -\frac{\log_3 g}{6}.

1253: \end{equation*}

1254: We can get this by taking $\bar{s}$ such that

1255: \begin{equation*}

1256: 2^{\bar{s}-y-1}\left( 1+\log_3 \lambda_{y+1} \right)\le -\frac{\log_3 g}{6}.

1257: \end{equation*}

1258: If we choose the arbitrary constant $x$ of Phase~1 to be strictly smaller  than $1/3$, we obtain

1259: that $1+\log_3 \lambda_{y+1}$ is negative, and the above equation comes down to

1260: $\bar{s}=O(\log\log g)$.

1261: Therefore, by the union bound over the $\bar{s}-y-1$ steps of

1262: Phase~2, the whole Phase~2 is made of lucky steps for all the groups in $S^+$ and $D^+$ with

1263: probability at least

1264: \begin{equation*}

1265: 1-9(\bar{s}-y-1)ge^{-(\alpha+\beta)g^{\frac{1}{2}}}

1266: =1-O\left(ge^{-(\alpha+\beta)g^{\frac{1}{2}}}\log\log g

1267: \right).

1268: \end{equation*}

1269:

1270: We have shown that, after $\bar{s}=O(\log\log n)$ steps, the maximum degree of

1271: the conflict graph~$G_{\pi}$ is at most $g^{5/6}$ with high probability.

1272: This is enough to get the claim of our theorem

1273: by combining Phase~1 and Phase~2, and then using Lemma~\ref{lem:phase3}.

1274: \end{proof}

1275:

1276:

1277: We remark that all transmissions occurring during slots~3 and~4

1278: are just acks requiring only ``empty'' messages providing only headers but without payload.

1279: When packets are very long, it may be more efficient to divide the 5 slots into 2 ``short'' slots

1280: and only 3 ``long'' slots, hence profiting from the homogenity of the operations

1281: within a same slot in our routing algorithm.

1282:

1283: Note an important property of our algorithm:

1284: processor~$i$ requires enough memory to store at most three packets: one is the original packet~$p_i$,

1285: the second is the

1286: packet whose destination is processor~$i$, and the third

1287: is a copy of another packet as received from group~$\Delta(i)$.

1288: However, if we can assume that packet $p_i$ exits

1289: the network the slot after $p_i$ got to its destination $\pi(i)$,

1290: then the requirement on the internal capacity of processors drops

1291: to only $2$ packets.

1292: Similarly, if we can assume that the input packets are stored on an external feeding line,

1293: then the internal storage requirement drops to $1$.

1294:

1295:

1296: \subsection{The General Case}

1297:

1298: Let start from the case when $d>g$. A natural approach to solve the problem is to perform

1299: two stages: Stage~1 routes the packets until the degree of the conflict graph

1300: is at most $g$; then Stage~2 uses the randomized algorithm described in the previous

1301: section to route the remaining packets in $O(\log\log g)$ slots. Since at most

1302: $g$ packets can be moved without conflicts from each group in each slot, $(d-g)/g$ is a simple

1303: lower bound to the number of slots used in the first of the two above mentioned stages.

1304: In the following, we will show that we are only a constant factor

1305: far from the lower bound, and that we can precisely indicate this factor.

1306:

1307: Consider a group~$a\in{\mathbb{N}}_g$.

1308: From this group, there are $d>g$ packets

1309: willing to go to destination. If we let every packet choose

1310: a random destination group and try to reach that group, when $d$ is large (it is

1311: enough that $d=\Omega(g\log g)$) every coupler will have a conflict with high probability

1312: and no packet is delivered. Clearly, this is not what we like to happen. So, the idea for the

1313: first stage of the algorithm is a small modification of the randomized algorithm:

1314: Before participating to the step, every processor with a packet tosses a coin

1315: that says 'yes' with probability~$p$. Only those processors that get a 'yes' are allowed to

1316: participate and send their packet.

1317:

1318: In the first step, it is best to choose $p$ equal to $g/d$, in such a way

1319: that $g$ packets are sent on expectation.

1320: This value maximizes the expected number of conflict-less

1321: communications, and thus the number of packets that survive slot~1 and slot~2.

1322: Later on, $p$ has to

1323: be iteratively reduced using a fixed law according to the expected reduction of the number

1324: of packets left in each group.

1325: When at most $g$ packets are left in each group with high

1326: probability, then

1327: we can set $p$ to one, and so proceed with the same algorithm we propose for the case when

1328: $d=g$.

1329:

1330: To understand what is the most efficient law, it is important

1331: to understand what is the expected number of packets that are delivered in each step

1332: of the algorithm. Informally speaking, our hope is that exactly $g$ packets from each

1333: group participate to every step of the first phase of the algorithm.

1334: Under this assumption,

1335: we know that approximately $ge^{-1}$ packets of each group will survive the first slot. At the beginning

1336: of the second slot, these packets are somewhat randomly scattered in the network (not

1337: uniformly at random, unfortunately, as we know from the previous section). If everything

1338: goes just like in the first slot, and this is far from being obvious since the destination is

1339: \emph{not random} now and the packets are \emph{not} distributed

1340: uniformly

1341: at random,

1342: we can hope that $g\exp \{-(1+e^{-1})\}$ packets from each group

1343: survive the second slot

1344: as well, and are thus safely delivered. If this is the case, $\exp \{1+e^{-1}\}((d-g)/g)$ steps

1345: are enough to reduce the number of packets from $d$ to $g$ on expectation.

1346: The following theorem shows that, eventually, what happens is exactly

1347: what we can best hope for. Now, we proceed formally.

1348: \begin{theorem}

1349: \label{thm:generalcase}

1350: Let $c=\exp (1+e^{-1})\approx 3.927$. A $\POPS$ network can

1351: route any permutation in $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ slots

1352: with high probability.

1353: \end{theorem}

1354: \begin{proof}

1355: The idea of the algorithm

1356: is to use $\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$ steps, where $\epsilon(g)=o(1)$,

1357: to reduce the maximum degree of the conflict graph to at most $g$

1358: with high probability. Since every step consists of 5 slots, we then get the claim by

1359: Theorem~\ref{thm:fondamentale}.

1360:

1361: Every step~$s$, $s=1,\dotsc, \lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$, is similar to the standard step

1362: of the randomized routing

1363: algorithm, with the difference that, before choosing its random color during slot~1, every packet

1364: independently tosses a coin and participates to the step with probability

1365: \begin{equation*}

1366: \frac{g}{d-\frac{g(s-1)}{c+\epsilon(g)}}.

1367: \end{equation*}

1368: Our claim is that, at the beginning of step~$s$, $s=1,\dotsc, \lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil+1$,

1369: the degree of the conflict graph is at

1370: most $d_s:=d-\frac{g(s-1)}{c+\epsilon(g)}$ with high probability.

1371: As a consequence, when

1372: $s=\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil+1$, we get $d_s\le g$ as desired.

1373: The claim is certainly true when

1374: $s=1$. Assume it is true at the beginning of

1375: step~$s\le\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$.

1376: We show that it is true at the beginning of step~$s+1$ as well.

1377:

1378: Let $S_a$, $a\in S$, be the set of indices of

1379: the packets in group~$a$ that still have to be delivered at the beginning of

1380: step~$s$. Similarly, let $D_b$, $b\in D$, be the set of indices of the

1381: packets in the whole network that still have to be delivered at the beginning of step~$s$ and

1382: that have group~$b$  as temporary destination group. By hypothesis, $|S_a|\le d_s$ and

1383: $|D_b|\le d_s$ for all $a\in S$ and $b\in D$.

1384: Our first goal is to prove that at the beginning of step $s+1$ the degree of the conflict

1385: graph is at most $d_{s+1}$ with high probability.

1386:

1387: For every packet~$p_i$ yet to be routed, let random variable $P_i$ be equal to 1 if packet~$p_i$

1388: participates to step~$s$, and 0 otherwise. Random variable $P_a=\sum_{i\in S_a} P_i$

1389: counts the number of packets in group~$a$ that participate to step~$s$. The expectation of $P_a$

1390: can be computed as follows:

1391: \begin{equation*}

1392: \E[P_a]=\sum_{i\in S_a} \E[P_i]=\frac{|S_a|g}{d_s}.

1393: \end{equation*}

1394: And, clearly, $\E[P_a] \leq g$.

1395: Since random variables $P_i$ are independent, the Chernoff bound~\cite{mr95,as00}

1396: (note that in~\cite{mr95} this bound appears in a different yet stronger form)

1397: is enough to claim that for every $\delta>0$

1398: \begin{equation*}

1399: \Pr\left[ P_a<(1-\delta)\frac{|S_a|g}{d_s}\right]\le e^{-\frac{\delta^2|S_a|g}{2d_s}}\le

1400: e^{-\frac{\delta^2d_{s+1}g}{2d_s}}\le e^{-\frac{\delta^2g}{4}}

1401: \end{equation*}

1402: and

1403: \begin{equation*}

1404: \Pr\left[ P_a>(1+\delta)g\right]\le e^{-\frac{\delta^2|S_a|g}{2d_s}}\le

1405: e^{-\frac{\delta^2d_{s+1}g}{2d_s}}\le e^{-\frac{\delta^2g}{4}}.

1406: \end{equation*}

1407: Let $S'_a$, $a\in S$, be the set of indices of

1408: the packets in group~$a$ that participate to step~$s$.

1409: Random variable $P_a$ is thus equal to $|S'_a|$.

1410: Therefore, for every $\delta>0$

1411: \begin{equation}

1412: (1-\delta)\frac{|S_a|g}{d_s}\le S'_a \le (1+\delta)g

1413: \end{equation}

1414: with probability at least $1-2e^{-\delta^2g/4}$.

1415: Since a similar result holds for every $a\in S$ and

1416: $b\in D$, we also know that for every $\delta>0$

1417: \begin{align}

1418: \label{eqn:S0lower}

1419: & (1-\delta)\frac{|S_a|g}{d_s}\le S'_a \le (1+\delta)g,\\

1420: \label{eqn:D0lower}

1421: & (1-\delta)\frac{|D_b|g}{d_s}\le D'_b \le (1+\delta)g,

1422: \end{align}

1423: hold for every $a\in S$ and $b\in D$, with probability at least

1424: \begin{equation}

1425: \label{eqn:azumaS0}

1426: 1-4ge^{-\delta^2g/4},

1427: \end{equation}

1428: by the union bound over the $2g$ nodes of the conflict graph.

1429:

1430: Clearly, we have nothing to show about the nodes in the conflict graph that have degree smaller

1431: than or equal to $d_{s+1}$. So, we define sets $S^+\subseteq S$ and

1432: $D^+\subseteq D$, which collect the nodes with degree

1433: larger that $d_{s+1}$, and focus on the nodes in these sets.

1434: Consider an arbitrary group~$a\in S^+$, and assume that the bound in

1435: Equations~\ref{eqn:S0lower} and~\ref{eqn:D0lower} hold for every $a\in S$ and $b\in D$.

1436: Now, we can perform the same analysis as in the proof of Theorem~\ref{thm:fondamentale}.

1437: Similarly to Equation~\ref{eqn:lowerXtilde}, we know that

1438: \begin{equation*}

1439: \E[X_a^2]  \ge (1-\delta)|S_a^1|\left(1-\frac{1}{g}\right)^{|S_a^1|-1}\ge

1440: (1-\delta)|S_a^1| e^{-|S_a^1|/g},

1441: \end{equation*}

1442: with high probability.

1443: In the next equation, we will use the following two facts: $xe^{x/g}\le ye^{y/g}$

1444: whenever $x\le y\le g$, and $xe^{x/g}$ has maximum when $x=g$.

1445: Clearly, $|S_a^1|\le g$ (there

1446: are only $g$ couplers from group~$a$).

1447: So, we get

1448: \begin{align*}

1449: \E[X_a^2] & \ge (1-\delta)|S_a^1| e^{-|S_a^1|/g}\ge\\

1450: & \ge (1-\delta)^2|S'_a| e^{-|S'_a|/g} e^{-|S'_a| e^{-|S'_a|/g}/g}\ge\\

1451: & \ge (1-\delta)^3  \frac{|S_a|g}{d_s}e^{-1}e^{-e^{-1}}.

1452: \end{align*}

1453: with high probability.

1454: By setting $\delta=g^{-1/3}$ in the above equation, with high probability we get

1455: \begin{equation*}

1456: X_a^2 \ge \frac{|S_a|}{d_s}\frac{g}{c+\epsilon(g)},

1457: \end{equation*}

1458: where $c=e^{1+e^{-1}}$ and $\epsilon(g)=o(1)$.

1459: Since $X_a^2$ is the number of packets in group~$a$ that are delivered

1460: to destination during slot~$s$, the degree of group~$a$ in the conflict graph at the

1461: beginning of step~$s+1$ is

1462: \begin{equation*}

1463: |S_a|- X_a^2 \le |S_a|- \frac{|S_a|}{d_s}\frac{g}{c+\epsilon(g)} \le d_s-\frac{g}{c+\epsilon(g)}

1464: = d_{s+1}.

1465: \end{equation*}

1466: The same result can be shown for every $a\in S^+$ and $b\in D^+$. By the union bound over the

1467: $\lceil(c+\epsilon(g))(\frac{d}{g}-1)\rceil$ steps required, and over the $2g$ nodes in the conflict graph,

1468: and by Equation~\ref{eqn:azumaS0} and a corresponding version of Equation~\ref{eqn:azumafinale},

1469: the degree of the conflict graph is reduced below $g$ with probability at least

1470: \begin{equation*}

1471: 1-\left(9ge^{-\delta^2 (1-\delta)^4 g^{5/6}/8e^2}+4ge^{-\delta^2g/4}\right).

1472: \end{equation*}

1473: Note that this is $1-o(1)$ as $g$ grows.

1474: \end{proof}

1475:

1476: To get a feeling of the performance of our randomized algorithm, we can set

1477: $\epsilon(g)\approx 0.073$ in the proof of the above theorem, in such a way that

1478: $c+\epsilon(g)=4$. The result is claimed in the following corollary.

1479: \begin{corollary}

1480: \label{cor:general}

1481: A $\POPS$ network can route any permutation in $\frac{20d}{g}+O(\log\log g)$ slots with high probability.

1482: \end{corollary}

1483:

1484: \section{Experiments}

1485: \label{sect:exp}

1486:

1487: Our results in Theorems~\ref{thm:fondamentale} and~\ref{thm:generalcase} are

1488: asymptotic. In principle, it could thus be possible that the

1489: randomized algorithm does not perform well in practice. This is not the case.

1490: Experiments show that it outperforms the algorithm

1491: in~\cite{ds-IEEETPDS03} even on networks as small as a ${\mathrm{POPS}}(2,2)$,

1492: and proves to be exponentially faster when $d$ and $g$ grow.

1493:

1494: The algorithm in~\cite{ds-IEEETPDS03} is claimed to run in $\frac{8d}{g}\log^2 g+

1495: \frac{21d}{g}+3\log g+7$ slots. However, the authors make a small mistake when saying

1496: that Leighton's implementation of the odd-even merge sort algorithm is composed of

1497: $\log^2 n$ steps. The actual complexity is only $\frac{\log n(1+\log n)}{2}\approx 2\log^2 g$ steps.

1498: So, the running time of the routing algorithm in~\cite{ds-IEEETPDS03} is

1499: $\frac{4d}{g}\log^2 g+\frac{2d}{g}\log g+\frac{21d}{g}+3\log g+7$ slots, that is smaller,

1500: and this is what we will use in the following.

1501:

1502: To perform the experiments, we built a simulator for the POPS network. It is written in C++

1503: and simulates the network at a message level. That is, for every message in the real network,

1504: there is a message in the simulator.

1505: Processors (implemented as instances of a class \texttt{Processor}) locally take decisions about the next step to perform, and couplers (implemented as instances of a class \texttt{Coupler}) locally

1506: propagate messages or stop them in case of conflicts.

1507:

1508: Then, we implemented our randomized algorithm in the

1509: simulator, slot by slot. We have been conservative, no theoretical result is taken for granted and

1510: the randomized algorithm is just simulated message by message.

1511: Not surprisingly, slots~3, 4, and 5 prove to be conflict-less, supporting what is proven

1512: in Proposition~\ref{pro:conflictless}. So, whenever a copy survives slots~1 and~2

1513: it reaches its final destination,

1514: and the associated ack successfully gets to the source processor.

1515: Moreover, three buffers in every processor~$i$ (one for packet~$p_i$, one for packet~$p_{\pi^{-1}(i)}$,

1516: and the third for floating copies of other packets) are enough.

1517:

1518: In Figure~\ref{fig:esperimento-1},

1519: it is shown the average over a large number of experiments in

1520: the case when $d=g$. The number of processors $n=dg$ goes from 4 to 16,777,216. The

1521: permutation in input is chosen

1522: uniformly

1523: at random from the class of all possible permutations.

1524: It is clear, from the results shown in the figure,

1525: that our algorithm is much faster than the algorithm

1526: in~\cite{ds-IEEETPDS03} even in practice.

1527: Actually, our algorithm outperforms its competitor for all network sizes

1528: hence putting aside any possible concern about the hidden consts.

1529: The performance of our algorithm is so good

1530: that it is actually hard to appreciate it from Figure~\ref{fig:esperimento-1}.

1531: Hence, Table~\ref{tab:esperimenti} shows the exact numerical

1532: results.

1533: \begin{figure*}

1534: \centering\includegraphics{esperimento-1}

1535: \caption{Performance of our randomized routing algorithm against the routing

1536: algorithm proposed in~\protect\cite{ds-IEEETPDS03}.

1537: Case when $d=g$. The number of

1538: processors goes from 4 to 16,777,216 (note that axis $x$ is in logscale).}

1539: \label{fig:esperimento-1}

1540: \end{figure*}

1541: \begin{table}

1542: \centering\begin{tabular}{|r||r|r||r|r||r|r|}

1543: \hline

1544: \multicolumn{1}{|c||}{$n$} & \multicolumn{2}{|c||}{$d=g$} &

1545: \multicolumn{2}{|c||}{$d=4g$} & \multicolumn{2}{|c|}{$d=16g$}\\

1546: \hline

1547: & \multicolumn{1}{|c|}{A} &

1548: \multicolumn{1}{|c||}{B} & \multicolumn{1}{|c|}{A} &

1549: \multicolumn{1}{|c||}{B} & \multicolumn{1}{|c|}{A} &

1550: \multicolumn{1}{|c|}{B}\\

1551: \hline

1552: 4 & 14.75 & 37 & - & - & - & - \\

1553: \hline

1554: 16 & 20.90 & 54 & 71.40 & 118 & - & -\\

1555: \hline

1556: 64 & 27.35 & 79 & 82.80 & 177 & 317.90 & 442 \\

1557: \hline

1558: 256 & 30.10 & 112 & 87.15 & 268 & 322.45 & 669 \\

1559: \hline

1560: 1,024 & 32.50 & 153 & 92.60 & 391 & 343.10 & 1,024 \\

1561: \hline

1562: 4,096 & 34.50 & 202 & 94.00 & 546 & 345.60 & 1,507 \\

1563: \hline

1564: 16,384 & 35.20 & 259 & 94.95 & 733 & 339.25 & 2,118 \\

1565: \hline

1566: 65,536 & 35.55 & 324 & 95.15 & 952 & 336.45 & 2,857 \\

1567: \hline

1568: 262,144 & 36.55 & 397 & 95.35 & 1,203 & 334.30 & 3,724 \\

1569: \hline

1570: 1,048,576 & 38.25 & 478 & 95.65 & 1,486 & 333.55 & 4,719 \\

1571: \hline

1572: 4,194,304 & 39.70 & 567 & 96.25 & 1,801 & 333.05 & 5,842 \\

1573: \hline

1574: 16,777,216 & 40.05 & 664 & 97.05 & 2,148 & 333.60 & 7,093 \\

1575: \hline

1576: \end{tabular}

1577: \caption{Number of slots to route a randomly chosen permutation by our randomized algorithm (A) and by the algorithm in

1578: \protect\cite{ds-IEEETPDS03} (B).}

1579: \label{tab:esperimenti}

1580: \end{table}

1581:

1582: Then, we tested our algorithm on POPS networks with $d$ larger than $g$. We performed

1583: two sets of experiments, one in which $d=4g$ and another in which $d=16g$. In both cases,

1584: the number of processors goes from 4 to 16,777,216. We used

1585: the algorithm as implemented in Corollary~\ref{cor:general}. Therefore, we expect

1586: the routing to take $20\frac{d}{g}+O(\log\log g)$ slot, according to our theoretical results.

1587: In fact, the results that are shown in Table~\ref{tab:esperimenti},

1588: Figure~\ref{fig:esperimento-2}, and Figure~\ref{fig:esperimento-3}

1589: show that the hidden constants are

1590: very small, and that

1591: our algorithm dramatically outperforms the best deterministic algorithm known in the literature for all

1592: network sizes we tested. Finally, Table~\ref{tab:scartp} shows some more details: for each

1593: experiment, we report the average number of steps, the standard deviation, and the worst case

1594: over one hundred runs. Note that the standard deviation is extremely small (smaller than one),

1595: therefore, the performance of our algorithm is almost always very close to expectation.

1596: \begin{figure*}

1597: \centering\includegraphics{esperimento-2}

1598: \caption{Performance of our randomized routing algorithm against the routing

1599: algorithm proposed in~\protect\cite{ds-IEEETPDS03}.

1600: Case when $d=4g$. The number of

1601: processors goes from 16 to 16,777,216 (note that axis $x$ is in logscale).}

1602: \label{fig:esperimento-2}

1603: \end{figure*}

1604: \begin{figure*}

1605: \centering\includegraphics{esperimento-3}

1606: \caption{Performance of our randomized routing algorithm against the routing

1607: algorithm proposed in~\protect\cite{ds-IEEETPDS03}.

1608: Case when $d=16g$. The number of

1609: processors goes from 64 to 16,777,216 (note that axis $x$ is in logscale).}

1610: \label{fig:esperimento-3}

1611: \end{figure*}

1612: \begin{table*}

1613: \centering\begin{tabular}{|r||r|r|c||r|r|c||r|r|c|}

1614: \hline

1615: \multicolumn{1}{|c||}{$n$} & \multicolumn{3}{|c||}{$d=g$} &

1616: \multicolumn{3}{|c||}{$d=4g$} & \multicolumn{3}{|c|}{$d=16g$}\\

1617: \hline

1618: & \multicolumn{1}{|c|}{$\mu$} &

1619: \multicolumn{1}{|c|}{$\sigma$} & \multicolumn{1}{|c||}{max} & \multicolumn{1}{|c|}{$\mu$} &

1620: \multicolumn{1}{|c|}{$\sigma$} & \multicolumn{1}{|c||}{max} & \multicolumn{1}{|c|}{$\mu$} &

1621: \multicolumn{1}{|c|}{$\sigma$} & \multicolumn{1}{|c|}{max}\\

1622: \hline

1623: 4 & 3.15 & 1.94 & 12 & - & - & - & - & - & -\\

1624: \hline

1625: 16 & 4.43 & 1.03 & 8 & 14.33 & 4.22 & 35 & - & - & -\\

1626: \hline

1627: 64 & 5.39 & 0.79  & 7 & 16.13 & 2.81 & 27 & 56.88 & 4.52 & 82\\

1628: \hline

1629: 256 & 6.10 & 0.57  & 8 & 18.06 & 1.54 & 23 & 62.58 & 3.86 & 81\\

1630: \hline

1631: 1,024 & 6.50 & 0.53 & 8 & 18.45 & 0.86 & 20 & 66.26 & 5.16 & 94\\

1632: \hline

1633: 4,096 & 6.82 & 0.46 & 8 & 18.81 & 0.64 & 21 & 68.21 & 3.94 & 86\\

1634: \hline

1635: 16,384 & 7.04 & 0.20 & 8 & 18.95 & 0.46 & 20 & 67.65 & 1.76 & 73\\

1636: \hline

1637: 65,536 & 7.16 & 0.37 & 8 & 19.06 & 0.34 & 20 & 67.12 & 0.89 & 71\\

1638: \hline

1639: 262,144 & 7.30 & 0.46 & 8 & 19.09 & 0.29 & 20 & 66.88 & 0.59 & 69\\

1640: \hline

1641: 1,048,576 & 7.59 & 0.49 & 8 & 19.15 & 0.36 & 20 & 66.70 & 0.50 & 68\\

1642: \hline

1643: 4,194,304 & 7.92 & 0.27 & 8 & 19.21 & 0.41 & 20 & 66.59 & 0.49 & 67\\

1644: \hline

1645: 16,777,216 & 8.00 & 0.00 & 8 & 19.41 & 0.49 & 20 & 66.79 & 0.41 & 67\\

1646: \hline

1647: \end{tabular}

1648: \caption{Number of iterations (mean, standard deviation, and worst case over one hundred

1649: runs) to route a randomly chosen permutation by our randomized algorithm.}

1650: \label{tab:scartp}

1651: \end{table*}

1652:

1653: \section{Conclusion}

1654:

1655: In this paper, we introduced the fastest algorithms for both deterministic and randomized

1656: on-line permutation routing. Indeed, we have shown that any permutation can be routed on

1657: a $\POPS$ network either with $O(\frac{d}{g}\log g)$ deterministic slots, or, with high probability,

1658: with $5c\lceil d/g\rceil+o(d/g)+O(\log\log g)$ randomized slots, where

1659: $c=\exp (1+e^{-1})\approx 3.927$. The randomized algorithm shows that the POPS network

1660: is one of the fastest permutation networks ever. This can be of practical relevance, since

1661: fast switching is one of the key technologies to deliver the ever-growing amount of bandwidth

1662: needed by modern network applications.

1663:

1664: \section*{Acknowledgments}

1665:

1666: We are grateful to Alessandro Panconesi for helpful suggestions.

1667:

1668: \bibliographystyle{IEEEtran}

1669: \bibliography{r}

1670:

1671: \end{document}

1672: