0612:cs0612035/RR.tex

1: \documentclass{article}

2: \usepackage{RR}

3:

4: \usepackage{epsf}

5: \usepackage{amsmath}

6: \usepackage{latexsym}

7: \usepackage{amssymb}

8: \usepackage{amsthm}

9: \usepackage{times}

10: \usepackage{rotating}

11: \usepackage{setspace}

12: \usepackage{graphicx}

13: \usepackage{multicol}

14: \usepackage{latexsym}

15: \usepackage{amsfonts}

16: \usepackage{subfigure}

17:

18: \newtheorem{theorem}{Theorem}[section]

19: \newtheorem{lemma}[theorem]{Lemma}

20: \newtheorem{invariant}[theorem]{Invariant}

21: \newtheorem{definition}[theorem]{Definition}

22: \newtheorem{corollary}[theorem]{Corollary}

23:

24: %%%

25: % michel algorithm style

26: \newcounter{linecounter}

27: \newcommand{\linenumbering}{(\arabic{linecounter})}

28: \renewcommand{\line}[1]{\refstepcounter{linecounter}

29: \label{#1}

30: \linenumbering}

31: \newcommand{\resetline}{\setcounter{linecounter}{0}}

32: %%%

33:

34: \newif\ifcode

35: \codefalse

36:

37: %%%

38: % comment this if you do not have

39: % algpseudocode package for algorithms.

40: %\codetrue

41: %%%

42: \usepackage{macros}

43: \ifcode

44: \usepackage[ruled]{algorithm}

45: \usepackage{ioa_code}

46: \fi

47:

48: \newcommand{\remove}[1]{}

49:

50: \begin{document}

51:

52: \RRdate{D\'ecembre 2006}

53: %\RRNo{ }

54:

55: \title{{\bf Distributed Slicing in Dynamic Systems}}

56:

57: \RRauthor{{\bf Antonio Fern\'andez\thanks{Universidad Rey Juan Carlos, 28933 M\'ostoles, Spain. anto@gsyc.escet.urjc.es}\and Vincent Gramoli\thanks[2]{IRISA, INRIA Universit\'e Rennes 1 (ASAP Research Group) 35042 Rennes, France. \{vgramoli,akermarr,raynal\}@irisa.fr}\and Ernesto Jim\'enez\thanks{Universidad Polit\'ecnica de Madrid, 28031 Madrid, Spain. ernes@eui.upm.es}\and \\

58: Anne-Marie Kermarrec\thanksref{2}\and

59: Michel Raynal\thanksref{2}}

60: }

61: \authorhead{Fern\'andez \& Gramoli \& Jimenez \& Kermarrec \& Raynal} %% Ceci apparait sur chaque page paire.

62: \RRetitle{Distributed Slicing in Dynamic Systems} %% english title

63: \titlehead{Distributed Slicing in Dynamic Systems} %%titre court, sur chaque page impaire.

64: \RRtitle{Morcellement distribu\'e dans les syst\`emes dynamiques}

65: \RRnote{Contact author: Vincent Gramoli \texttt{vgramoli@irisa.fr}}

66:

67: \RRabstract{Peer to peer (P2P) systems are moving from application specific architectures to

68: a generic service oriented design philosophy.

69: This raises interesting problems in connection with providing useful

70: P2P middleware services that are capable of dealing with resource assignment

71: and management in a large-scale, heterogeneous and unreliable environment.

72: One such service, the slicing service, has been proposed to allow for an

73: automatic partitioning of P2P networks into groups (slices) that

74: represent a controllable amount of some resource and that are also relatively

75: homogeneous with respect to that resource, in the face of churn and

76: other failures.

77: In this report we propose two algorithms to solve the distributed slicing problem.

78: The first algorithm improves upon an existing algorithm that is based on

79: gossip-based sorting of a set of uniform random numbers.

80: We speed up convergence via a heuristic for gossip peer selection.

81: The second algorithm is based on a different approach: statistical approximation

82: of the rank of nodes in the ordering.

83: The scalability, efficiency  and resilience to dynamics of both algorithms

84: relies on their gossip-based models.

85: We present theoretical and experimental results to prove the viability

86: of these algorithms.}

87:

88: \RRresume{Un service de \emph{morcellement} d'un r\'eseau pair-\`a-pair permet de partitionner les n\oe uds du syst\`eme en plusieurs groupes appel\'es \emph{morceaux}.  Ce rapport pr\'esente deux algorithmes pour r\'esoudre le probl\`eme du morcellement r\'eparti.  Le premier algorithme am\'eliore un algorithme existant en acc\'erant son temps de convergence.

89: Le second algorithme utilise une approche diff\'erente d'appoximation statistique. Des r\'esultats th\'eoriques et experimentaux

90: montrent la viabilit\'e de nos algorithmes.}

91:

92:

93: \RRmotcle{Morcellement, Bavardage, Morceau, Va-et-vient, Pair-\`a-pair, Aggr\'egation, Grande \'echelle, Allocation de ressources.}

94: \RRkeyword{Slicing, Gossip, Slice, Churn, Peer-to-Peer, Aggregation, Large Scale, Resource Allocation.}

95: %% \RRprojet{Bibli} % cas d'un seul projet

96:

97: %\RRprojets{Bibli et Ami} %% cas de 2 projets.

98: \RRprojet{Asap}

99: %% \RRtheme{\THBio} % cas d'un seul theme

100: \RRtheme{\THCom} %% cas de 2 themes

101:

102: \URRennes

103:

104: \makeRR

105: %

106: %\author{

107: %\authorblockN{Antonio Fern\'andez\authorrefmark{1},

108: %Vincent Gramoli\authorrefmark{2},

109: %%Mark Jelasity\authorrefmark{3},

110: %Ernesto Jimenez\authorrefmark{4},  \\

111: %Anne-Marie Kermarrec\authorrefmark{2},

112: %Michel Raynal\authorrefmark{2}

113: %} \\

114: %\authorblockA{

115: %\hspace{-0.5cm}\begin{tabular}{ccccccc}

116: %\authorrefmark{1}{\small Universidad Rey} &  & \authorrefmark{2}{\small IRISA, INRIA}  & & \authorrefmark{4}{\small Universidad Polit\'ecnica} \\

117: %{\small Juan Carlos,} & & {\small et Universit\'e Rennes 1,}    & & {\small de Madrid,} \\

118: %{\small 28933 M\'ostoles, Spain.} & & {\small 35042 Rennes, France.}    & & {\small 28031 Madrid, Spain.} \\

119: %{\small anto@gsyc.escet.urjc.es} & &  {\small \{vgramoli,akermarr,raynal\}@irisa.fr} & & {\small ernes@eui.upm.es}

120: %\end{tabular}}}

121:

122:

123: %\newcommand{\pb}{\emph Self-Organizing Locality-based Algorithm for Distributed Slicing}

124:

125: \maketitle

126:

127: %\doublespace

128:

129: %\begin{abstract}

130: %Peer to peer (P2P) systems are moving from application specific architectures to

131: %a generic service oriented design philosophy.

132: %This raises interesting problems in connection with providing useful

133: %P2P middleware services that are capable of dealing with resource assignment

134: %and management in a large-scale, heterogeneous and unreliable environment.

135: %One such service, the slicing service, has been proposed to allow for an

136: %automatic partitioning of P2P networks into groups (slices) that

137: %represent a controllable amount of some resource and that are also relatively

138: %homogeneous with respect to that resource, in the face of churn and

139: %other failures.

140: %In this report we propose two algorithms to solve the distributed slicing problem.

141: %The first algorithm improves upon an existing algorithm that is based on

142: %gossip-based sorting of a set of uniform random numbers.

143: %We speed up convergence via a heuristic for gossip peer selection.

144: %The second algorithm is based on a different approach: statistical approximation

145: %of the rank of nodes in the ordering.

146: %The scalability, efficiency  and resilience to dynamics of both algorithms

147: %relies on their gossip-based models.

148: %We present theoretical and experimental results to prove the viability

149: %of these algorithms. \\

150: %\end{abstract}

151: %\vspace{-0.3cm}

152: %\noindent

153: %Keywords: Slicing, Gossip, Slice, Churn, Peer-to-Peer, Aggregation, Large Scale, Resource Allocation. \\

154: %Technical Areas: Operating Systems and Middleware, Internet Computing and Applications, Peer-to-Peer. \\

155: %Contact Author: Vincent Gramoli, vgramoli@irisa.fr

156:

157: \section{Introduction}

158:

159: \subsection{Context and Motivations}

160:

161: The peer to peer (P2P) communication paradigm has now become the prevalent

162: model to build large-scale distributed applications, able to cope  with both

163: scalability and system dynamics.  This is now a mature technology:

164: peer to peer systems are slowly moving from application-specific

165:  architectures to a generic-service oriented design philosophy.

166: More specifically, peer to peer protocols hold the promise to integrate into

167: platforms on top of which several applications with various requirements

168:  may cohabit. This leads to the interesting issue of resource

169: assignment or how to allocate a set of nodes for a given application.

170: Examples of  targeted platforms for such a service are telecommunication

171: platforms, where some set of peers may be automatically assigned to a

172: specific task depending on their capabilities,

173: testbed platform such as Planetlab~\cite{BBC+04},

174: or desktop-grid-like applications~\cite{A04}.

175:

176: In this context, the ordered slicing service has been

177: recently proposed as a building block to allocate resources,

178: i.e a set of nodes sharing some characteristics with respect to  a

179: given metric or attribute,  in a large-scale peer to peer

180: system. This service acknowledges the fact that peers

181: potentially offer heterogeneous capabilities as

182: revealed by many recent works describing heavy-tailed distributions

183:  of storage space, bandwidth, and uptime of peers~\cite{SGG02,BSV03,SR06}.

184:  The slicing  service \cite{JK06} enables peers in a large-scale

185:  unstructured network to self-organize into a partitioning,

186:  where partitions (slices) are connected overlay networks that represent

187:  a given percentage of some resource.

188: Such slices

189: can be allocated to specific applications later on.

190: The slicing is ordered in the sense that peers get ranked

191:  according to their capabilities expressed by an attribute value.

192:

193: Large scale dynamic distributed systems consist of many

194: participants that can join and leave

195: at will. Identifying peers in such systems that have a similar level

196:  of power or capability (for instance, in terms of bandwidth,

197: processing power, storage space, or uptime) in a completely

198: decentralized manner is a difficult task. It is even harder

199: to maintain this information in the presence of churn.

200:  Due to the intrinsic dynamics of contemporary peer to peer

201:  systems it is impossible to obtain accurate information about

202: the capabilities (or even the identity) of the system participants.

203: Consequently, no node is able to maintain accurate information about

204: all the nodes. This  disqualifies  centralized approaches.

205:

206: Taking this into account, we can summarize the ordered slicing problem

207: we tackle in this report:

208: we need to rank nodes depending on their capability, slice the network

209: depending on these capabilities and, most importantly, readapting

210: the slices continuously to cope with system dynamism.

211:

212: Building  upon the work on ordered distributed slicing

213: %%MJ I removed "seminal"; felt it's too self-promoting

214: proposed in \cite{JK06}, here we focus on the issue of \textit{accurate}

215:  slicing.

216: That is, we focus on improving the quality and stability of the slices, both

217: aspects being crucial for potential applications.

218:

219: %A slice is

220: %composed of a subset of nodes which are attribute values close

221: %to each other. Additionally we can consider slices ordered,

222: %so that the first slice contains the nodes with the

223: % smallest attribute values while the last one contains

224: %the nodes with the largest attribute values.

225: %A node determines the slice it belongs to by comparing

226: % its attribute value with the values of other nodes.

227: %Roughly speaking, depending on the attribute values of the other nodes

228: %it encounters, a node $i$  determines if its attribute value

229: %is relatively high or low and where it lies in the

230: %set of nodes. Thus, node $i$ tries to estimate

231: %its rank (dictated by its capability) to determine

232: %the slice it belongs to.

233: %%

234: %%MJ I felt the paragraph was reduntant and too detailed for an intro.

235: %%besides, we don't need it since nothing is built on this info here

236:

237:

238: \subsection{Contributions}

239:

240: %%MJ moved para below here from before the section heading

241: %%MJ also commented it out, becuase it is redundant

242: %% note that the intro is very long, and this section in particular

243:

244: %This report presents two main contributions. The first contribution is

245: %the speedup of the convergence of the \cite{JK06} algorithm. We then

246: %identified a few issues of~\cite{JK06} related to slice

247: %consistency under churn correlated to attribute values.

248: %We fix these issues

249: % by proposing a second algorithm where each node

250: %continuously approximates its rank relatively to the other peers.

251: % Both algorithms rely on  a gossip-based communication model,

252: %which has proved to be both lightweight and highly resilient to

253: %system dynamics.

254:

255:

256: The report presents two gossip-based solutions to slice the nodes

257:  according to their capability (reflected by an attribute value)

258:  in a distributed manner with high probability.

259: The first contribution of the report builds upon the distributed

260: slicing algorithm proposed in \cite{JK06} that we call the JK algorithm in the

261: sequel of this report.

262: The second algorithm is a different approach based on rank approximation

263: through statistical sampling.

264:

265: In JK, each node $i$

266: maintains a random number $r_i$, picked up uniformly at random (between 0 and 1),

267: and an attribute value $a_i$, expressing its capability according to a given

268: metric.

269: Each peer periodically gossips with another peer $j$,

270: randomly chosen among the peers it knows about. If the

271: order between $r_j$ and $r_i$ is different than

272: the order between $a_j$ and $a_i$, random values

273: are swapped between nodes. The algorithm ensures that

274: eventually the order on the

275: random values matches the order of the attribute ones.

276: The quality of the ranking can then be measured by using a

277:  global disorder measure expressing

278: the difference between the exact rank and the actual rank of each peer along the attribute value.

279:

280:  The first contribution of this report is to propose a local disorder

281:  measure so that a peer chooses the neighbor to communicate with in order to  maximize

282: the chance of decreasing the global disorder measure.

283:  The interest of this approach is to speed the convergence up.

284: We provide the analysis and experimental results of this improvement.

285:

286: Once peers are ordered along the attribute values, the slicing in JK

287: takes place as follows. Random values are used

288:  to calculate which slice a node belongs to. For example,

289: a slice containing 20\% of the  best nodes according to a given attribute,

290: will be composed of the nodes that end up holding random values greater than 0.8.

291: The accuracy of the slicing (independent from the accuracy of the ranking)

292:  fully depends on the uniformity of the random value

293:  spread between 0 and 1 and the fact that the proportion of

294: random values between 0.8 and 1 is approximately (but usually not exactly)

295: 20\% of the nodes. Another contribution of this report is to precisely

296: characterize the potential inaccuracy resulting from this effect.

297:

298: This observation means that the problem of ordering nodes

299: based on uniform random values is not fully sufficient for

300: determining slices.

301: This motivates us to find an alternative

302: approach to this algorithm and JK

303: in order to determine more precisely the slice each node belongs to.

304:

305: Another motivation for an alternative approach is related to churn and dynamism.

306: It may well happen that the

307:  churn is actually correlated

308: to the attribute value. For example, if the peers are

309: sorted according to their connectivity potential,

310: a portion of the attribute space (and therefore the random value space) might be

311: suddenly affected.

312: New nodes will then pick up new random values and eventually the distribution of random values will be skewed towards high values.

313:

314: The second contribution is an alternative algorithm solving these issues

315: by approximating the rank of the nodes in the ordering locally, without

316: the application of random values.

317:   The basic idea is that each node periodically estimates its rank

318: along the attribute axis depending of the attributes it has seen so far.

319: This algorithm is robust and lightweight due to its

320: gossip-based communication pattern: each node communicates

321: periodically with a

322: restricted dynamic neighborhood that guarantees connectivity and provides

323: a continuous stream of new samples.

324: Based on continuously aggregated information,

325: the node can determine the

326: slice it belongs to with a decreasing error margin.

327:  We show that this algorithm provides accurate estimation at the price of a slower convergence.

328:

329: \subsection{Outline}

330: The rest of the report is organized as follows: Section \ref{related} surveys some related work.

331: The system model is presented in Section \ref{model}.

332: The first contribution of an improved ordered slicing algorithm based

333: on random values is presented  in Section~\ref{JK+} and the second algorithm

334: based on dynamic ranking in Section~\ref{sec:ranking}.

335: %Section \ref{sec:dicussion} provide some material for discussion before concluding in Section \ref{conclusion}.

336: Section~\ref{conclusion} concludes the report.

337:

338:

339: \section{Related Work}

340: \label{related}

341: %A more general problem than the one investigated here

342: %is the sorting problem, where each node is ordered among others.

343: %A version of this problem has been referred as the

344: %\emph{external sorting problem}~\cite{DNS91}.  What caught our attention is that

345: %this provides a distributed sorting

346: %algorithm where the memory space of each processor does not necessarily

347: %depend on the input and it outputs a sorted sequence of values

348: %distributed among processors.

349: %The \emph{percentile finding} problem was defined in~\cite{IRV89} as

350: %dividing a set of values into equally sized sets.

351:

352: Most of the solutions proposed so far for ordering nodes come

353: from the context of databases,

354: where parallelizing query executions is used to improve efficiency.

355: A large majority of the solutions in this area rely on centralized gathering

356: or all-to-all exchange, which makes them unsuitable

357: for large-scale networks.

358: %

359: For instance, the \emph{external sorting problem}~\cite{DNS91} consists in

360: providing

361: a distributed sorting algorithm where the memory space of each

362: processor does not necessarily depend on the input. This algorithm

363: must output a sorted sequence of values distributed among

364: processors.

365: %

366: The solution proposed in~\cite{DNS91} needs a global merge of

367: the whole information, and

368: thus it implies a centralization of information.

369: Similarly, the \emph{percentile finding} problem~\cite{IRV89}, which

370: aims at dividing a set of values into equally sized sets,

371: requires a logarithmic number of all-to-all message exchanges.

372: %Despite the unaffordable number of messages it requires, it would

373: %be worthy to investigate an adaptation of this algorithm using

374: %gossip-based algorithm for the slice partitioning problem.

375:

376: Other related problems are the selection problem and the $\phi$-quantile search.

377: The selection problem~\cite{FR75,BFPRT72} aims at determining the $i^{th}$

378: smallest element with as few comparisons as possible.

379: The $\phi$-\emph{quantile} search (with

380: $\phi \in (0,1]$) is the problem to find among $n$ elements the $(\phi n)^{th}$ element.

381: Even though these problems look similar to our problem,

382: %in the sense

383: %that they all focus on finding the ordering of a node among the system, it remains

384: %different in the following sense:

385: they aim

386: at finding a specific node among all, while the distributed slicing problem aims

387: at solving a global problem where each node maintains a piece of

388: information.

389: Additionally, solutions to the quantile search problem like the one presented

390: in~\cite{KDG03} use an approximation of the system size. The same holds

391: for the algorithm in~\cite{SDCM06}, which uses similar ideas to determine the distribution

392: of a utility in order to isolate peers with high capability---i.e., super-peers.

393:

394: %As far as we know, the most similar work appeared in

395: As far as we know, the distributed slicing problem was studied in a P2P system

396: for the first time in~\cite{JK06}. In this report, a node with the $k^{th}$

397: smallest attribute value, among those in a system of size $n$, tries to estimate

398: its normalized index $k/n$.

399: %where

400: %each node $i$ is provided with a random value which is an estimate of the

401: %rank of node $i$.

402: %periodically exchange random value to reflect the rank

403: %of its attribute value among all attribute values of the system.

404: %

405: The \emph{JK algorithm} proposed in~\cite{JK06} works as follows.

406: Initially, each node draws independently and uniformly a random value in the

407: interval $(0,1]$ which serves as its first estimate of its normalized index.

408: Then, the nodes use a variant of Newscast~\cite{JMB05} to gossip among each

409: other to exchange random values when they

410: find that the relative order of their random values and that of their attribute

411: values do not match.

412: %The algorithm aims at exchanging random

413: %values among peers to reflect the order given by their attribute values.

414: %When for any $k$ the node with the $k^{th}$ attribute value has also

415: %the $k^{th}$ random value, the system is ordered and random value.

416: %%

417: This algorithm is robust in face of frequent

418: dynamics and guarantees a fast convergence to the same sequence of peers with

419: respect to the random and the attribute values. At every point in time the

420: current random value of a node serves to estimate the slice to which it

421: belongs (its slice).

422: %Consequently this solution

423: %is well-suited for ordering nodes in a large-scale dynamic systems.

424: %The first algorithm we propose speeds up this decrease.

425:

426: %The more similar work has been proposed in~\cite{JK06}, the authors investigate

427: %the slice partitioning problem in a deterministic way.

428: %However, their approach requires that each node generates a random

429: %value.  More precisely, each node maintains a view containing neighbors

430: %information such as their id, their attribute value, and their random value.

431: %The algorithm presented in~\cite{JK06} makes a node to select an arbitrary

432: %misplaced neighbor to swap its random value with.  A node estimates the slice

433: %it belongs to according to the rank given by the current random value it owns.

434: %This solves the slice partitioning problem if the random values are uniformly

435: %spread among all possible values.

436:

437: \section{Model}

438: \label{model}

439:

440: \subsection{System model}

441:

442: We  consider a  system $\Sigma$ containing  a set

443:  of $n$ uniquely identified nodes. (Value $n$ may vary over time, dynamics is explained below).

444: %%MJ don't know what this means, but not needed anyway: commented out

445: % we should use no footnotes at all in any case, or only if it is absolutely

446: %essential

447:  %\footnote{The value $n$ is observed instantaneously but may vary over time.}

448: The set of identifiers is denoted by $I$.

449: %and the number of nodes  by $n$.

450: Each node can leave and new nodes can join the system at any

451: time, thus the number of nodes is a function of time.

452: Nodes may also crash. In this report,

453: we do not differentiate between a crash and a voluntary node departure.

454:

455: Each node $i$ maintains an attribute value $a_i$, reflecting the node

456: capability according to a specific metric.

457: These attribute values over the network might have an arbitrary

458: skewed distribution.

459: Initially, a node

460: has no global information neither about the structure or size of the system nor

461: about

462: the attribute values of the other nodes.

463:

464: We can define a total ordering over the nodes based on their attribute value,

465: with the node identifier used to break ties.

466: Formally, we let $i$ precede $j$ if and only if $a_i < a_j$, or

467: $a_i = a_j$ and $i < j$. We refer to this totally ordered sequence as the

468: \emph{attribute-based sequence}, denoted by $A.\ms{sequence}$. The attribute-based

469: rank of a node $i$, denoted by $\alpha_i \in \{1,...,n\}$, is defined as the index

470: of $a_i$ in $A.\ms{sequence}$.

471: %

472: For instance, let us consider three nodes: 1, 2, and 3, with three

473: different attribute values $a_1 = 50$, $a_2 = 120$, and $a_3 = 25$.

474: In this case, the attribute-based rank of node $1$ would be

475: $\alpha_1 = 2$.

476: %

477: In the rest of the report, we assume that nodes are sorted according to a single

478: attribute and that each node belongs to a unique slice.

479: The sorting along several attributes is out of the scope of this report.

480:

481:

482: %Initially, node $i$ does not know the value of $\alpha_i$.

483: %%MJ Nor does it learn it later, at least in the ordering approach

484: %AMK: I don;t think we should mention this indeed.

485:

486:

487: \subsection{Distributed Slicing Problem}\label{ssec:pb}

488:

489: Let ${\cal S}_{l,u}$ denote the \emph{slice} containing every node $i$ whose

490: normalized rank, namely $\frac{\alpha_{i}}{n}$, satisfies $l < \frac{\alpha_{i}}{n}

491: \leq u$

492: where $l\in [0,1)$ is the slice lower boundary and $u\in (0,1]$ is

493: the slice upper boundary so that all slices represent adjacent intervals $(l_1, u_1], (l_2, u_2]$...

494: Let us assume that we partition the interval $(0,1]$ using a set of slices,

495: and this partitioning is known by all nodes.

496: The distributed slicing problem requires each node

497: to determine the slice it currently belongs to.

498: Note that the problem stated this way is similar to the

499: ordering problem, where each node has to determine its own index in

500: $A.\ms{sequence}$.

501: However, the reference to slices introduces special requirements related to

502: stability and fault tolerance, besides, it allows for future generalizations

503: when one considers different types of categorizations.

504:

505: %\subsection{Example of Slicing a Population}

506:  Figure~\ref{fig:size} illustrates an example

507: of  a population of 10 persons, to be sorted

508: against their height.

509: A partition of this population could be defined by two slices of the same

510: size: the group of short persons, and the

511: group of tall persons. This is clearly an example

512: where the distribution of attribute values is skewed towards 2 meters.

513:  The rank of each person in

514: the population and the two slices are represented on the bottom axis. Each

515: person is represented as a small cross on these axes.\footnote{Note that the shortest (resp. largest) rank is represented by a cross at the

516: extreme left (resp. right) of the bottom axis.}

517: Each slice is represented as an oval.  The slice $S_1 = {\cal

518: S}_{0,\frac{1}{2}}$ contains the five shortest persons and

519: the slice $S_2 = {\cal S}_{\frac{1}{2},1}$ contains the five tallest

520: persons.

521:

522: \begin{figure}

523: \centering\includegraphics[scale=0.6]{size_new.eps}

524: \caption{Slicing of a population based on a height attribute.}

525: \label{fig:size}

526: \end{figure}

527:

528: Observe that another way of partitioning the population could be to define the

529: group of short persons as that containing all the persons shorter than a

530: predefined measure (e.g., $1.65m$) and the group of tall persons as that containing

531: the persons taller than this measure.

532: %The groups differs from the slices because the slices rely tightly on

533: %the distribution of attribute values:  while one of these three

534: %groups could be empty even though the population is large, slices have a minimal size

535: %depending on the population size.

536: However, this way of partitioning would most certainly lead to an unbalanced

537: distribution of persons, in which, for instance a group might be empty (while a

538: slice is almost surely non-empty). Since the distribution of attribute values is

539: unknown and hard to predict, defining relevant groups is a difficult task.  For

540: example, if the distribution of the human heights were unknown, then the persons

541: taller than $1m$ could be considered as tall and the persons shorter than $1m$

542: could be considered as short.  Conversely, slices partition the population

543: into subsets representing a predefined portion of this population.

544: Therefore, in the rest of the report, we consider slices as defined as a proportion of the

545: network.

546:

547: \subsection{Facing Churn}

548:

549: %%MJ rewrote it from skretch. Still, I'm a bit hesitant, becase we do not

550: % actually solve these problems, do we?

551: %AMK: we do somehow in the second case

552:

553: Node churn, that is, the continuous arrival and departure of nodes is an intrinsic characteristic

554: of peer to peer systems and may significantly impact  the outcome, and more specifically the accuracy

555: of the slicing algorithm.

556: The easier case is when  the distribution of the attribute values of the departing and

557: arriving nodes are identical.

558: In this case, in principle, the arriving nodes must find their slices, but

559: the nodes that stay in the system  are mostly able to keep their slice assignment.

560: Even in this case however, nodes that are close to the border of a slice

561: may expect frequent changes in their slice due to the variance of the

562: attribute values, which is non-zero for any non-constant distribution.

563: If the arriving and departing nodes have different attribute distributions,

564:  so that the distribution in the actual network of live nodes keeps changing,

565: then this effect is amplified. However, we believe that this is a realistic assumption

566: to consider that the churn may be correlated to some specific values (for example

567: if the considered attribute is uptime or connectivity).

568:

569:

570: \section{Dynamic Ordering by Exchange of Random Values}\label{sec:ordering}

571: \label{JK+}

572:

573: This section proposes an  algorithm for the distributed slicing

574: problem improving upon the original JK algorithm \cite{JK06},

575:  by considering a local measure of the global disorder function.

576: In this section we present the algorithm along with the corresponding analysis and

577: simulation results.

578:

579: \subsection{On Using Random Numbers to Sort Nodes}

580:

581: This Section presents the algorithm built upon JK.

582: We refer to this algorithm as \emph{mod-JK} (standing for modified JK).

583: In JK,

584: %We build upon  the JK algorithm, where

585: each node $i$

586:  generates a number $r_i\in (0,1]$ independently and uniformly at random.

587: The key idea is to sort these random numbers with

588: respect to the attribute values by swapping these random numbers between nodes,

589: so that if $a_{i} < a_{j}$ then $r_{i} < r_{j}$.

590: Eventually,  the attribute values (that are fixed) and the

591: random values (that are exchanged)  should be sorted in the same order.

592: That is, each node would like to obtain the $x^{th}$ largest random number if

593: it owns the $x^{th}$ largest attribute value.

594: Let $R.\ms{sequence}$ denote the \emph{random sequence} obtained

595: by ordering all nodes according to their random number.

596: Let $\rho_i(t)$ denote the index of node $i$ in $R.\ms{sequence}$ at time

597: $t$. When not required, the time parameter is omitted.

598:

599: To illustrate the above ideas, consider that nodes 1, 2, and 3

600: %from the example above

601: from the previous example

602: have three

603: distinct random values: $r_1 = 0.85$, $r_2 = 0.1$, and $r_3 = 0.35$.

604: In this case, the index $\rho_1$ of node $1$ would be $3$. Since the

605: attribute values are $a_{i}=50$, $a_{2}=120$, and $a_{3}=25$, the algorithm

606: must achieve the following final assignment of random numbers: $r_{1}=0.35$,

607: $r_{2}=0.85$, and $r_{3}=0.1$.

608:

609: Once sorted, the random values are used to determine the portion of the network a peer belongs to.

610:

611: \subsection{Definitions}

612:

613:

614: \paragraph{View.}

615: Every node $i$ keeps track of some neighbors and their age.

616: The \emph{age} of neighbor $j$ is a timestamp, $t_j$, set to 0 when $j$ becomes a

617: neighbor of $i$.

618: Thus, node $i$ maintains an array

619: containing the id, the age,

620: the attribute value, and the

621: random value of its neighbors.  This array, denoted ${\mathcal N}_i$, is

622: called the \emph{view} of node $i$. The views of all nodes have the same size,

623: denoted by $c$.

624: %Neighborhoods are updated regularly following a simple protocol called

625: %newscast~\cite{JMB05}.

626:

627: \paragraph{Misplacement.}

628: A node participates in the algorithm by exchanging its rank with a misplaced

629: neighbor in its view.  Neighbor $j$ is misplaced

630: if and only if

631: \begin{itemize}

632: \item $a_i >  a_j$ and  $r_i <  r_j$, or

633: \item $a_i <  a_j$ and  $r_i >  r_j$.

634: \end{itemize}

635: We can characterize these two cases by the predicate $(a_j - a_i)(r_j - r_i)<0$.

636: %We call the \emph{misplaced measurement} of node $i$

637: %the difference $|\frac{a_i}{n} - r_i|$, and the larger this value is,

638: %the more $i$ is misplaced  (if this value is large, then $i$ estimate is far

639: %from reflecting its real rank).

640: %%MJ you never actually use this definition

641:

642: \paragraph{Global Disorder Measure.}

643:

644: In~\cite{JK06}, a measure of the relative

645: disorder of sequence $R.\ms{sequence}$ with respect to sequence

646: $A.\ms{sequence}$ was introduced, called the

647: \emph{global disorder measure (GDM)} and defined, for any time $t$, as

648: $$\ms{GDM}(t) = \frac{1}{n}\sum_{i}(\alpha_i - \rho(t)_i)^2.$$

649:

650: The minimal value of GDM is 0, which is obtained

651: when $\rho(t)_i = \alpha_i$ for all nodes $i$.  In this case the

652: attribute-based index of a node is equal to its random value index, indicating

653: that random values are ordered.

654:

655: \subsection{Improved Ordering Algorithm}

656:

657: In this algorithm, each node $i$ searches its own view ${\mathcal N}_i$

658:  for misplaced neighbors. Then,

659: one of them is chosen to swap  random value with. This process is repeated

660: until there is no global disorder.

661: In this version of the algorithm, we provide each node with  the capability

662: of measuring locally the disorder. This leads to  a new

663: heuristic for each node to determine

664: the neighbor to exchange with which decreases most the disorder.

665:

666: The proposed technique attempts to decrease the global

667: disorder in each exchange as much as possible via selecting the neighbor

668: from the view that minimizes the local disorder (or, equivalently,

669: maximizes the order \emph{gain}) as defined below.

670:

671: For a node $i$ to evaluate the gain of exchanging with a node $j$

672: of its current view ${\mathcal N}_i$,

673: we define its \emph{local disorder measure} (abbreviated \emph{LDM}$_i$).  Let $LA.\ms{sequence}_i$ and

674: $LR.\ms{sequence}_i$ be the local attribute sequence and the local

675: random sequence of node $i$, respectively.   These sequences are computed locally by $i$

676: using the information ${\mathcal N}_i \cup \{i\}$.  Similarly to $A.\ms{sequence}$

677: and $R.\ms{sequence}$, these are the sequences of neighbors where each node

678: is ordered according to its attribute value and random number, respectively.

679: Let, for any $j\in {\cal N}_i \cup \{i\}$, $\ell\rho_j(t)$ and $\ell\alpha_j(t)$ be the

680: indices of $r_j$ and $a_j$ in sequences $LR.\ms{sequence}_i$ and

681: $LA.\ms{sequence}_i$, respectively, at time $(t)$.

682: At any time $t$, the local disorder measure of node $i$ is defined as:

683: %

684: $$\ms{LDM}_i(t) = \frac{1}{c+1}\sum_{j\in {\mathcal N}_i(t)\cup \{i\}}(\ell\alpha_j(t) - \ell\rho_j(t))^2.$$

685: We denote by $G_{i,j}(t+1)$ the reduction on this measure that $i$ obtains after

686: exchanging its random value with node $j$ between

687: time $t$ and $t+1$. We define it as:

688: %{\small

689: \begin{eqnarray}

690: G_{i,j}(t+1) &=& \ms{LDM}_i(t) - \ms{LDM}_i(t+1), \notag \\

691: G_{i,j}(t+1) &=& \frac{(\ell\alpha_i(t) - \ell\rho_i(t))^2 + (\ell\alpha_j(t) -

692: \ell\rho_j(t))^2

693: - (\ell\alpha_i(t) - \ell\rho_j(t))^2 - (\ell\alpha_j(t) - \ell\rho_i(t))^2}{c+1}.

694: \label{eq:gain}

695: \end{eqnarray}

696: %}

697:

698: The heuristic used chooses for node $i$ the misplaced neighbor $j$ that

699: maximizes $G_{i,j}(t+1)$.

700: %\begin{eqnarray}

701: %$G_{i,k}(t+1) &=& \tau_i(t) - \tau_i(t+1), \notag \\

702: % %The gain that $i$ obtains after swapping with $k$ can be expressed as:

703: % G_{i,k}(t+1) &=& (\alpha_i - \rho_i)^2 + (\alpha_k - \rho_k)^2 - (\alpha_i - \rho_k)^2 - (\alpha_k - \rho_i)^2. \notag

704: % \end{eqnarray}

705: % That is, the node $k$ that maximizes the gain verifies, for any $j\in V_i$:

706: % \begin{eqnarray}

707: % G_{i,j}(t+1) &\leq& G_{i,k}(t+1), \notag \\

708: % % (p_j - r_j(t))^2 - (p_i - r_j(t))^2 - (p_j - r_i(t))^2 &\leq& (p_k - r_k(t))^2 - (p_i - r_k(t))^2 - (p_k - r_i(t))^2, \notag \\

709: % \alpha_i \rho_j(t) + \alpha_j \rho_i(t) - \alpha_j \rho_j(t)  &\leq& \alpha_i \rho_k(t) + \alpha_k \rho_i(t) - \alpha_k \rho_k(t). \Box\notag

710: %\end{eqnarray}

711:

712: \subsubsection{Sampling Uniformly at Random}

713:

714: The algorithm relies on the fact that potential misplaced nodes are found

715: so that they can swap their

716: random numbers thereby increasing order.

717: If the global disorder is high,  it is very likely that

718: any given node has misplaced neighbors in its view to exchange with.

719: %

720: %Nevertheless, the ordering slows down drastically when only few nodes are misplaced.

721: %To ensure continuous ordering, the underlying membership protocol must provide

722: %overlying layer with samples as uniformly drawn as possible.

723: Nevertheless, as the system gets ordered, it becomes more unlikely for a node

724: $i$ to have misplaced neighbors.

725: In this stage the way the view is composed plays a crucial role: if fresh

726: samples from the network are not available, convergence can be slower than

727: optimal.

728: %This problem is exacerbated if the neighbors

729: %of node $i$ are not drawn uniformly, since then some nodes, misplaced with

730: %respect to $i$, tend to

731: %remain misplaced since they never show up in the view of $i$.

732: %%MJ This is not true in this way, as non-uniform can even be better than

733: % uniform. I said something softer.

734:

735: Several protocols may be used to provide a random and dynamic sampling

736: in a peer to peer system such as Newscast~\cite{JMB05}, Cyclon~\cite{VGS05} or Lpbcast~\cite{JGKS04}.

737: They differ mainly by their \textit{closeness}

738: to the uniform random sampling of the neighbors and the way they handle churn.

739: In this report,

740: we chose to  use a variant of the Cyclon

741: protocol to construct and update the views~\cite{I05}, as it is reportedly

742: the best approach to achieve a uniform random neighbor set for all nodes.

743: %%%%% show that Cyclon creates more random views whereas Newscast

744: %tends to clusterize the communication graph.  Technically,

745: %the Cyclon graph (in which links are defined by the views) has a clustering

746: %coefficient similar to the

747: %clustering coefficient of a random graph~\cite{ER59}.

748: %More precisely, the coefficient provided by Cyclon is more than 1800 time closer

749: %to the coefficient of a random graph, than the coefficient provided by Newscast.

750: %

751: %A direct consequence is that with Cyclon a node has a higher chance of

752: %discovering new nodes than with Newscast so the algorithm described in

753: %Figure~\ref{alg:rand}

754: %can be expected to offer better convergence speed.

755:

756: %AMK: I actually removed almost evrything here and certainly did not claim

757: %that was a  contribution.

758:

759:

760: \subsubsection{Description of the Algorithm}

761: \label{sec:modifiedjk}

762:

763: \begin{table}

764:   \centering

765:   \begin{tabular}{|c|l|}

766: \hline

767: {\small {\bf Variable}} & {\small {\bf Description}} \\ \hline \hline

768: $j$ & {\small the identifier of the neighbor} \\ \hline

769: $t_j$ & {\small the age of the neighbor} \\ \hline

770: $a_j$ & {\small the attribute value of the neighbor} \\ \hline

771: $r_j$ & {\small the random value of the neighbor} \\ \hline

772: \end{tabular}

773: \caption{The array corresponding to the view entry of the neighbor $j$ ($j\in {\cal N}_i$).}

774: \label{table:entry}

775: \end{table}

776:

777:

778:

779: The algorithm is presented in Figure~\ref{alg:rand}.  The active thread at node $i$

780: runs the membership (gossiping) procedure ($\lit{recompute-view}()_i$)

781: and the

782: exchange of random values periodically.

783: As motivated above, the membership procedure, specified in

784: Figure~\ref{alg:rcyclon}, is similar to the Cyclon algorithm:

785: each node $i$ maintains a view ${\mathcal N_i}$ containing one

786: entry per neighbor.  The entry of a neighbor $j$ corresponds to a tuple

787: presented in Table~\ref{table:entry}.  Node $i$ copies its view, selects the

788: oldest neighbor $j$ of its view, removes the entry $e_{j}$ of $j$ from the copy

789: of its

790: view, and finally sends the resulting copy to $j$.  When $j$ receives the

791: view, $j$ sends its own view back to $i$ discarding possible pointers to

792: $i$, and $i$ and $j$ update their view with the one they

793: receive. This variant of Cyclon, as opposed to the original version,

794: exchanges all entries of the view at each step.

795:

796:

797: %\input{rand.alg}

798: \begin{figure}

799: \centering{

800: \fbox{

801: \begin{minipage}[ht!]{150mm}

802: \footnotesize

803: \renewcommand{\baselinestretch}{1.5}

804: \resetline

805: \begin{tabbing}

806: aaaaA\=aaaaaA\=aaaaaaA\kill

807: {\bf Initial state of node $i$} \\

808: \line{L00} \> $\ms{period}_{i}$, initially set to a constant; \\

809: $r_i$, a random value chosen in $(0,1]$;

810: $a_i$, the attribute value;   \\

811: $\ms{slice}_i \gets \bot$, the slice $i$ belongs to;

812: ${\mathcal N_i}$, the view; \\

813: $\ms{gain}_{j'}$, a real value indicating the gain achieved by exchanging with

814: $j'$; \\

815: $\ms{gain-max} = 0$, a real.

816: \\ ~ \\

817:

818:

819: %{\bf Active thread at node $i$} \\

820: %\line{L02} \> $\act{wait}(\ms{period_i})$ \\

821: %~(\ref{L02}a) \> $\act{recompute-view}()_i$ \\

822: %\line{L03} \> {\bf for} $j' \in {\mathcal N}_i$ \\

823: %~(\ref{L03}a) \> \T {\bf if} $\ms{gain}_{j'} \geq \ms{gain-max}$ {\bf then} \\

824: %~(\ref{L03}b) \> \T \T $\ms{gain-max} \gets \ms{gain}_{j'}$ \\

825: %~(\ref{L03}c) \> \T \T $j \gets j'$ \\

826: %~(\ref{L03}d) \> {\bf end for} \\

827: %\line{L04} \> $\act{send}(\lit{REQ}, r_i, a_i)$ to $j$ \\

828: %~(\ref{L04}a) \> $\act{recv}(\lit{ACK}, r_j')$ from $j$ \\

829: %\line{L05} \> $r_j \gets r_j'$ \\

830: %~(\ref{L05}a) \> {\bf if} $(a_j - a_i)(r_j - r_i) < 0$ {\bf then} \\

831: %~(\ref{L05}b) \> \T $r_i \gets r_j$ \\

832: %~(\ref{L05}c) \> \T $\ms{slice}_i \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$ \\

833: % ~ \\

834:

835:

836: {\bf Active thread at node $i$} \\

837: \line{M01} \> $\act{wait}(\ms{period_i})$ \\

838: \line{M02} \> $\act{recompute-view}()_i$ \\

839: \line{M03} \> {\bf for} $j' \in {\mathcal N}_i$ \\

840: \line{M04} \> \T {\bf if} $\ms{gain}_{j'} \geq \ms{gain-max}$ {\bf then} \\

841: \line{M05} \> \T \T $\ms{gain-max} \gets \ms{gain}_{j'}$ \\

842: \line{M06} \> \T \T $j \gets j'$ \\

843: \line{M07} \> {\bf end for} \\

844: \line{M08} \> $\act{send}(\lit{REQ}, r_i, a_i)$ to $j$ \\

845: \line{M09} \> $\act{recv}(\lit{ACK}, r_j')$ from $j$ \\

846: \line{M10} \> $r_j \gets r_j'$ \\

847: \line{M11} \> {\bf if} $(a_j - a_i)(r_j - r_i) < 0$ {\bf then} \\

848: \line{M12} \> \T $r_i \gets r_j$ \\

849: \line{M13} \> \T $\ms{slice}_i \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$ \\

850:  ~ \\

851:

852:

853: %% with period, the other nodes must send more frequently

854: %~(\ref{L01})  \> $\act{wait}(\ms{period_i})$ \\

855: %~(\ref{L02})  \> ${\mathcal N}_i \gets \act{recompute-view}()_i$ \\

856: %~(\ref{L03})  \> {\bf for} $j' \in {\mathcal N}_i$ \\

857: %~(\ref{L04}') \> \T {\bf if} $\lit{dist}(a_{j'},b) < dmin$ {\bf then} \\

858: %~(\ref{L05}') \> \T \T $\ms{dist-min} \gets \lit{dist}(a_{j'},b)$ \\

859: %~(\ref{L06}')  \> \T \T $j \gets j'$ \\

860: %\line{L11}    \> \T {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

861: %\line{L12}    \> \T $g_i \gets g_i + 1$ \\

862: %~(\ref{L07})  \> $\act{send}(\lit{REQ},r_i)$ to $j$ \\

863: %~(\ref{L08})  \> $\act{recv}(\lit{ACK},r_j)$ from $j$ \\

864: %\line{L13}   \> {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

865: %\line{L14}   \> $g_i \gets g_i + 1$ \\

866: %\line{L15}   \> $r_i \gets \ell_i / g_i$ \\

867: %\line{L16}   \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$\\

868: % ~ \\

869:

870: {\bf Passive thread at node $i$ activated upon reception} \\

871: \line{R01} \> $\act{recv}(\lit{REQ}, r_j, a_j)$ from $j$ \\

872: \line{R02} \> $\act{send}(\lit{ACK}, r_i)$ to $j$ \\

873: \line{R03} \> {\bf if} $(a_j - a_i)(r_j - r_i) < 0$ {\bf then} \\

874: \line{R04} \> \T $r_i \gets r_j$ \\

875: \line{R05} \> \T $\ms{slice}_i \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$

876:

877: \end{tabbing}

878: \normalsize

879: \end{minipage}

880: }

881: \caption{Dynamic ordering by exchange of random values.}

882: \label{alg:rand}

883: }

884: \end{figure}

885:

886: The algorithm for exchanging random values from node $i$ starts by measuring

887: the ordering that can be gained by swapping with each neighbor (Lines~\ref{M03}--\ref{M07}).

888: Then, $i$ chooses the neighbor $j \in {\mathcal N}_i$ that maximizes gain

889: $G_{i,k}$ for any of its neighbor $k$. Formally, $i$ finds

890: $j\in {\mathcal N}_i$ such that for any $k\in{\mathcal N}_i$, we have

891: \begin{eqnarray}

892: G_{i,j}(t+1) &\geq& G_{i,k}(t+1).  \label{eq:gainmax}

893: \end{eqnarray}

894: Using the definition of $G_{i,j}$ in Equation (\ref{eq:gain}), Equation (\ref{eq:gainmax}) is

895: equivalent to

896: \begin{eqnarray}

897: \ell\alpha_i(t) \ell\rho_j(t) + \ell\alpha_j(t) \ell\rho_i(t) - \ell\alpha_j(t)

898: \ell\rho_j(t)  &\geq& \ell\alpha_i(t) \ell\rho_k(t) + \ell\alpha_k(t) \ell\rho_i(t)

899: - \ell\alpha_k(t) \ell\rho_k(t). \notag

900: \end{eqnarray}

901: \noindent In Figure~\ref{alg:rand} of node $i$, we refer to $\ms{gain}_j$ as the value of

902: $\ell\alpha_i(t) \ell\rho_j(t) + \ell\alpha_j(t) \ell\rho_i(t) - \ell\alpha_j(t)

903: \ell\rho_j(t)$.

904: %\begin{eqnarray}

905: %G_{i,j}(t+1) &\geq& G_{i,k}(t+1), \notag \\

906: %\ell\alpha_i \ell\rho_j(t) + \ell\alpha_j \ell\rho_i(t) - \ell\alpha_j \ell\rho_j(t)  &\geq& \ell\alpha_i \ell\rho_k(t) + \ell\alpha_k \ell\rho_i(t) - \ell\alpha_k \ell\rho_k(t). \Box\notag

907: %\end{eqnarray}

908:

909: %%this $\ms{gain}$ (i.e.,

910: %neighbor $j$ whose swapping minimizes the local disorder measure of $i$).

911: From this point on, $i$ exchanges its random value $r_i$ with the random value $r_j$ of

912: node $j$ (Line~\ref{M10}). The passive threads are executed upon reception of a message.

913: In Figure~\ref{alg:rand}, when $j$ receives the random value $r_i$ of node

914: $i$, it sends back its own random value $r_j$ for the exchange to occur (Lines~\ref{R01}--\ref{R02}).

915: %

916: Observe that the attribute value of $i$ is also sent to $j$, so that $j$ can

917: check if it is correct to exchange before updating its own random number (Lines~\ref{R03}--\ref{R04}). Node

918: $i$ does not need to receive attribute value $a_{j}$ of $j$, since $i$ already has this

919: information in its view and the attribute value of a node never changes over time.

920:

921: \begin{figure}

922: \centering{

923: \fbox{

924: \begin{minipage}[b]{150mm}

925: \footnotesize

926: \renewcommand{\baselinestretch}{1.5}

927: \resetline

928: \begin{tabbing}

929: aaaaA\=aaaaaA\=aaaaaaA\kill

930: %{\bf Initial state of node $i$} \\

931: %\line{L00} \> $r_i$, a random value chosen in $(0,1]$.

932: %$a_i$, the attribute\\ value.

933: %${\mathcal N_i}$, the view.

934: %\\ ~ \\

935: %

936: {\bf Active thread at node $i$} \\

937: \line{V01} \> {\bf for} $j' \in {\mathcal N}_{i}$ {\bf do} $t_{j'} \gets t_{j'} + 1$ {\bf end for} \\

938: \line{V02} \> $j \gets j'': t_{j''} = \lit{max}_{j'\in {\mathcal N}_i}(t_{j'})$ \\

939: \line{V03} \> $\act{send}(\lit{REQ'}, {\mathcal N}_i \setminus \{e_j\} \cup \{\tup{i, 0, a_i, r_i}\})$ to $j$ \\

940: \line{V04} \> $\act{recv}(\lit{ACK'}, {\mathcal N}_{j})$ from $j$ \\

941: \line{V05} \> $\ms{duplicated-entries} = \{e: e.id \in {\mathcal N}_j \cap {\mathcal N}_i\}$ \\

942: \line{V06} \> ${\mathcal N}_i \gets {\mathcal N}_i \cup ({\mathcal N}_j \setminus \ms{duplicated-entries} \setminus \{e_i\})$ \\

943: ~ \\

944:

945: {\bf Passive thread at node $i$ activated upon reception} \\

946: \line{RV01} \> $\act{recv}(\lit{REQ'}, {\mathcal N}_{j})$ from $j$ \\

947: \line{RV03} \> $\act{send}(\lit{ACK'}, {\mathcal N}_{i})$ to $j$ \\

948: \line{RV04} \> $\ms{duplicated-entries} = \{e \in {\mathcal N}_j: e.id \in {\mathcal N}_j \cap {\mathcal N}_i\}$ \\

949: \line{RV05} \> ${\mathcal N}_i \gets {\mathcal N}_i \cup ({\mathcal N}_j \setminus \ms{duplicated-entries})$ \\

950: %{\bf last line says that we updates views and keep timestamp of i'e entries, even if entries are duplicated in Ni and Nj}

951: %\line{RV05} \> ${\mathcal N}_i \gets {\mathcal N}_i \cup ({\mathcal N}_j \setminus ({\mathcal N}_j \cap {\mathcal N}_i))$

952:

953: \end{tabbing}

954: \normalsize

955: \end{minipage}

956: }

957: \caption{$\lit{recompute-view()}$: procedure used to update the view based on a simple variant of the Cyclon algorithm.}

958: \label{alg:rcyclon}

959: }

960: \end{figure}

961:

962:

963: \subsection{Analysis of Slice Misplacement}

964: \label{sec:anal}

965:

966: In mod-JK, as in JK, the current random number

967: $r_{i}$ of a node $i$ determines the slice $s_{i}$ of the node. The

968: objective of both algorithms is to reduce the global disorder as quickly as

969: possible.  Algorithm mod-JK consists in choosing one neighbor among the possible neighbors

970: that would have been chosen in JK, plus the GDM of JK has been shown to fit an exponential

971: decrease~\cite{JK06}. Consequently mod-JK experiences also an exponential decrease of the global disorder.

972: Eventually, JK and mod-JK

973: ensure that the disorder has fully disappeared.

974: However, the accuracy of the slices heavily depends on the uniformity of the random value

975: spread between 0 and 1.  It may happen, that the distribution of the random values

976: is such that some peers decide upon a wrong slice. Even more problematic is the fact that this situation

977: is unrecoverable unless a new random value is drawn for all nodes.

978: This may be considered as an inherent limitation of

979: the approach.

980: For example, consider a system of size 2, where nodes 1 and 2 have the

981: random values $r_1=0.1$, $r_2=0.4$. If we are interested in creating

982: two slices of equal size, the first slice will be of size 2 and the second

983: of size zero, even after perfect ordering of the random values.

984:

985: Therefore, an important step is to characterize the inaccuracy of the

986: uniform distribution  to access the potential impact on

987: the  slice assignment

988: resulting from the fact that uniformly generated random numbers are

989: not distributed perfectly evenly throughout the domain.

990: First of all, consider a slice $S_p$ of length $p$.

991: In a network of $n$ nodes,

992: the number of nodes that will fall into this slice is a random variable $X$

993: with a binomial distribution with parameters $n$ and $p$.

994: The standard deviation of $X$ is therefore $\sqrt{np(1-p)}$.

995: This means that the relative proportional expected difference from the

996: mean (i.e., $np$) can be approximated as $\sqrt{(1-p)/(np)}$, which is very large

997: if $p$ is small, in fact, goes to infinity as $p$ tends to zero,

998: although a very large $n$ compensates for this effect.

999: For a ``normal'' value of $p$, and a reasonably large network, the variance

1000: is very low however.

1001:

1002: To stay with this random variable,

1003: the following result bounds, with high probability, its deviation from its

1004: mean.

1005: %Consequently, this bounds (w.h.p.) the number of nodes that can be misplaced in the system.

1006:

1007: \begin{lemma}

1008: For any $\beta \in (0,1]$, a slice $S_p$ of length  $p \in (0,1]$ has a

1009: number of peers

1010: $X \in [(1-\beta)np, (1+\beta)np]$ with probability at least

1011: $1-\epsilon$ as long as $p  \geq \frac{3}{\beta^2 n} \ln (2/\epsilon)$.

1012: \end{lemma}

1013: \begin{proof}

1014: The way nodes choose their random number is like drawing $n$ times, with replacement and

1015: independently uniformly at random, a value in the interval $(0,1]$.

1016: Let $X_1, ..., X_n$ be the $n$ corresponding independent identically distributed random

1017: variables such that:

1018: \begin{equation}

1019: \left\{\begin{array}{ll}

1020: X_i &= 1 \text{~if the value drawn by node $i$ belongs to~}S_p\text{~and~}\notag

1021: \\

1022: X_i &= 0 \text{~otherwise.}

1023: \end{array}\right.

1024: \end{equation}

1025:

1026: We denote $X= \sum_{i=1}^n X_i$ the number of elements of interval $S_p$ drawn

1027: among the $n$ drawings.

1028: The expectation of $X$ is  $np$.

1029: From now on we compute the probability that a bounded portion of the expected

1030: elements are misplaced.

1031: Two Chernoff bounds~\cite{MR95} give:

1032: \begin{equation}

1033: \left.\begin{array}{cc}

1034: \Pr[X\geq (1+\beta)np] &\leq e^{-\frac{\beta^2 np}{3}} \notag \\

1035: \Pr[X\leq (1-\beta)np] &\leq e^{-\frac{\beta^2 np}{2}} \notag

1036: \end{array}\right\} \\

1037: \Rightarrow \Pr[|X-np| \geq \beta np] \leq 2e^{-\frac{\beta^2 np}{3}},

1038: \end{equation}

1039:

1040: \noindent with $0 < \beta \leq 1$.

1041: That is, the probability that more than ($\beta$ time the number expected) elements are

1042: misplaced regarding to interval $S_p$ is bounded by $2e^{-\frac{\beta^2 n

1043: p}{3}}$. We want this to be at most $\epsilon$. This yields the result.

1044: \end{proof}

1045: %

1046: %The previous result shows that the number of nodes allocated to a given slice

1047: %is not far from the amount is should be.

1048: %

1049: %%MJ Note that you draw the wrong conclusion (commented out), as the simple fact I included

1050: % shows that if p is small then we are in trouble, and also if n is

1051: % not that large. The Chernoff shows the same, but in a more obscure way.

1052: % I repeat that these bounds are useful only if one wants to prove some

1053: % complexity results, and these bounds are not that tight anyway to be

1054: % very useful directly.

1055: % Besides, the "theorem" is a simple application of the Chernoff bound, and

1056: % is rather basic. I renamed it to "lemma", and I also feel the proof can

1057: % be significantly shortened. To something like: "simple application of

1058: % the Chernoff bound".

1059: % I left the derivation in though, with these notes...

1060:

1061: To measure the effect discussed above during the simulation experiments,

1062: we introduce the slice

1063: disorder measure (SDM) as the sum over all nodes $i$ of the

1064: distance between the slice $i$ actually belongs to and the slice $i$ believes

1065: it belongs to.

1066: %

1067: For example (in the case where all slices have the same size),

1068: if node $i$ belongs to the $1^{st}$ slice (according to its attribute

1069: value) while it thinks it belongs to the $3^{rd}$ slice (according to its rank

1070: estimate) then the distance for node $i$ is $|1-3| = 2$.

1071: %

1072: Formally, for any node $i$, let $S_{u_i, l_i}$ be the actual correct slice

1073: of node $i$ and let $S_{\hat{u_i}, \hat{l_i}}(t)$ be the slice $i$ estimates

1074: as its slice at time $t$.  The slice disorder measure is defined as:

1075: %

1076: $$\ms{SDM}(t) = \sum_{i}\frac{1}{u_i - l_i}\left|\frac{u_i+l_i}{2}-\frac{\hat{u_i}+\hat{l_i}}{2}\right|.$$

1077: %

1078: $\ms{SDM}(t)$ is minimal (equals 0) if for all nodes $i$,

1079: we have $S_{\hat{u_i}, \hat{l_i}}(t) = S_{u_i, l_i}$.

1080: %

1081: %

1082: %This

1083: %interval

1084: %is divided in $k$ subintervals $I_1, ..., I_k$ of length $\Delta_1, ..., \Delta_k \in (0,1]$, respectively.

1085: %Let $\Delta$ be the minimal length of these intervals, formally $\Delta = min_{\forall j} \{p\}$.

1086: %Without loss of generality fix $j$ such that $p$ is one of these subintervals.

1087: %

1088: %

1089: %

1090: %

1091: %We then bound (using additive bounds), with high probability, the number of misplaced

1092: %elements with respect to all subintervals.

1093: %\begin{eqnarray}

1094: %P &=& \sum_{j=1}^k \Pr[|X-np| > \beta np] \leq 2k e^{-\frac{\beta^2n\Delta}{3}}. \notag

1095: %\end{eqnarray}

1096: %

1097: %We want $P \leq \epsilon$:

1098: %\begin{eqnarray}

1099: %& 2k e^{-\frac{\beta^2 n \Delta}{3}} \leq \epsilon, \notag \\

1100: %\Rightarrow & -\frac{\beta^2n\Delta}{3} \leq \ln{\frac{\epsilon}{2k}}, \notag \\

1101: %\Rightarrow & \Delta \geq \frac{3}{\beta^2 n}\ln{\frac{2k}{\epsilon}}.\Box \notag

1102: %\end{eqnarray}

1103: %\end{proof}

1104: %

1105: %\begin{corollary}

1106: %For any $\beta \in (0,1]$, any slice of length $\Delta_i$ has a number of peers $n_i \in [(1-\beta)n\Delta_i, (1 +\beta)n\Delta_i]$ with high probability (of at least $1-1/n$) as long as  $\ms{min}_j p \geq \frac{3}{\beta^2 n} \ln{(2nk)}$.

1107: %\end{corollary}

1108:

1109: In fact, it is simple to show that, in general, the probability of  dividing $n$ peers into two slices of the same size is less than $\sqrt{2/n\pi}$. This value is very small even for moderate values of  $n$. Hence, it is highly possible that the random number distribution  does not lead to a perfect division into slices.

1110:

1111: \subsection{Simulation Results}\label{sec:simu2}

1112:

1113: We present simulation results using PeerSim~\cite{JMB04}, using a simplified

1114: cycle-based simulation model, where all messages exchanges are atomic,

1115: so messages never overlap.

1116: First, we compare the performance of the two algorithms: JK and mod-JK.

1117: Second, we study the impact of concurrency that is ignored by the cycle-based

1118: simulations.

1119:

1120: \subsubsection{Performance Comparison}

1121: %To compare the performance of our algorithm,

1122: %mod-JK, with JK,

1123: %we implemented both approaches in a cycle-based

1124: %simulator using PeerSim~\cite{JMB04}.

1125: We compare the time taken by these algorithms

1126: to sort the random values according to the attribute values (i.e., the node with

1127: the $j^{th}$ largest attribute value of the system value obtains the $j^{th}$ random value).

1128: In order to evaluate the convergence speed of each algorithm, we use the slice

1129: disorder measure as defined in Section~\ref{sec:anal}.

1130:

1131: We simulated $10^4$ participants in 100 equally sized slices (when unspecified), each with

1132: a view size $c=20$.  Figure~\ref{fig:convergence2} illustrates the difference between the global disorder

1133: measure and the slice disorder measure while Figure~\ref{fig:convergence1} presents the evolution of

1134: the slice disorder measure over time for JK, and mod-JK.

1135:

1136: \begin{figure*}

1137:   \begin{center}

1138:     \subfigure[]

1139:     {

1140:       \label{fig:convergence2}

1141:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{sdm_gdm.ps}}

1142:     }

1143:     \subfigure[]

1144:     {

1145:     \label{fig:convergence1}

1146:           \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{newscast_cyclon_sdm.ps}}

1147:     }

1148:     \subfigure[]

1149:     {

1150:       \label{fig:swap}

1151:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=0,clip=true,bb=40 10 260 200,viewport=0 0 260 210

1152:       ]{swp.ps}}

1153:     }

1154:     \subfigure[]

1155:     {

1156:       \label{fig:sdm_count}

1157:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,origin=c,clip=true]{sdm_cont.ps}}

1158:     }

1159:     \caption{

1160:        (a) Evolution of the global disorder measure over time.

1161:        (b) Slice disorder measure over time.

1162:        (c) Percentage of unsuccessful swaps in the ordering algorithms.

1163:        (d) Convergence speed under high concurrency.}

1164:   \end{center}

1165: \end{figure*}

1166:

1167: Figure~\ref{fig:convergence2} shows the different speed at which

1168: the global disorder measure and the slice disorder measure converge.

1169: When values are sufficiently large, the GDM and SDM seem tightly related: if

1170: GDM increases then SDM increases too.  Conversely, there is a significant difference between

1171: the GDM and SDM when the values are relatively low: the GDM reaches 0 while the SDM is lower bounded by

1172: a positive value.  This is because the algorithm does lead to a totally

1173: ordered set of nodes, while it still does not associate each node with its

1174: correct slice. Consequently the GDM is not sufficient to rightly estimate the

1175: performance of our algorithms.

1176:

1177: Figure~\ref{fig:convergence1} shows the slice disorder measure to compare

1178: the convergence speed of our algorithm to that of JK with 10 equally sized slices.

1179: Our algorithm converges significantly faster than JK.

1180: Note that none of the algorithm reaches zero SDM, since they are both based on

1181: the same idea of sorting randomly generated values.

1182: Besides, since they both used an identical set of randomly generated values,

1183: both converge to the same SDM.

1184:

1185: \subsubsection{Concurrency}

1186:

1187: The simulations are cycle-based and at each cycle an

1188: algorithm step is done atomically so that no other execution is concurrent.

1189: %a node sends and receives messages in an atomic

1190: %manner---while no other messages are in transit.

1191: More precisely, the algorithms are simulated such that in each cycle, each node updates its view

1192: before sending its random value or its attribute value.  Given this implementation, the

1193: cycle-based simulator does not allow us to realistically simulate concurrency, and a

1194: drawback is that view is up-to-date when a message is sent.  In the following

1195: we artificially introduce concurrency (so that view might be out-of-date) into the

1196: simulator and show that it has only a slight impact on the convergence speed.

1197:

1198: %This parameter is not realistic in case the

1199: %goal is to converge as fast as possible, especially if messages

1200: %are continuously sent for this purpose.  Let $\delta$ be the time a message takes to reach

1201: %its destination.  If a large portion of nodes send a message in a

1202: %period of $\delta$ time units then it is likely that at least two overlapping messages target the

1203: %same node.

1204: %Here, we simulate an artificial concurrency to see the impact on convergence speed.

1205:

1206:

1207: %This implementation choice presents two side-effects in case of high concurrency injection,

1208: %proper to ordering algorithms.

1209: %First, there might be a large number of useless messages.  Second, the convergence

1210: %can slow down compare to the non-concurrent case.

1211: %

1212: Introducing concurrency might result in some problems because of

1213: %These two problems arise from

1214: the potential staleness of views: unsuccessful swaps due to useless messages.

1215: Technically, the view of node $i$ might

1216: indicate that $j$ has a random value $r$ while this value is no longer

1217: up-to-date.  This happens if $i$ has lastly updated its view before $j$ swapped

1218: its random value with another $j'$.

1219: %

1220: Moreover, due to asynchrony, it could happen that by the time a message is received

1221: this message has become useless.  Assume that node $i$ sends its random value $r_i$ to $j$

1222: in order to obtain $r_j$ at time $t$ and $j$ receives it by time $t+\delta$.

1223: With no loss of generality assume $r_i > r_j$.  Then if $j$ swaps its random value with

1224: $j'$ such that $r_j'>r_i$ between time $t$ and $t+\delta$, then the message of $i$

1225: becomes \emph{useless} and the expected swap does not occur (we call this an \emph{unsuccessful swap}).

1226: %

1227: %

1228: %When $i$ sends a message

1229: %$m_i$ to a targeted neighbor $j$, then between the time $i$ updates its view and the time

1230: %$j$ receives the message $m_i$, $j$ may swap its own random value with a distinct node $j'$ (upon

1231: %reception of message $m_{j'}$ from $j'$).

1232: %%

1233: %In case the message from $j'$ results in ordering $j$ regarding to $i$, then $j$ will

1234: %ignore the message from $i$.  This scenario occurs if initially $r_{j'} \leq r_i \leq r_j$

1235: %and $a_j < a_i$ or $r_j \leq r_i \leq r_{j'}$ and $a_i < a_j$.

1236: %Consequently, the random value of $j$ that $i$ knows of is stale, and message $m_i$ does not

1237: %result in any swap, $j$ ignores it.

1238:

1239: %

1240: %\begin{figure*}

1241: %  \begin{center}

1242: %    \caption{

1243: %    }

1244: %  \end{center}

1245: %\end{figure*}

1246:

1247: Figure~\ref{fig:sdm_count} indicates the impact of concurrent message exchange

1248: on the convergence speed while

1249: Figure~\ref{fig:swap} shows the amount of useless messages that are sent.

1250: %shows the percentage of messages that are ignored during

1251: %simulations of the ordering algorithms (JK and mod-JK) over the total number

1252: %of messages.

1253: Now, we explain how the concurrency is simulated.

1254: Let the \emph{overlapping messages} be a set of messages that mutually overlap:

1255: it exists, for any couple of overlapping messages, at least one instant at

1256: which they are both in-transit.

1257: For each algorithm we simulated \textit{(i)}~full concurrency: in a given

1258: cycle, all messages are overlapping messages; and \textit{(ii)}~half

1259: concurrency: in a given cycle, each message is an overlapping message with

1260: probability $\frac{1}{2}$.

1261: Generally, we see that increasing the concurrency increases the number of

1262: useless messages. Moreover, in the modified version of JK, more messages are

1263: ignored than in the original JK algorithm.  This

1264: is due to the fact that some nodes (the most misplaced ones) are more likely

1265: targeted which increases the number of concurrent messages arriving at the

1266: same nodes.  Since a node $i$ ignored more likely a message when it receives

1267: more messages during the same cycle, it comes out that concentrating

1268: message sending at some targets increases the number of useless messages.

1269: %

1270:

1271: Figure~\ref{fig:sdm_count} compares the convergence speed under full concurrency

1272: and no concurrency.  We omit the curve of half-concurrency since it would have been

1273: similar to the two other curves.  Full-concurrency

1274: impacts on the convergence speed very slightly.

1275: %While in the non-concurrent case all messages result

1276: %in an ordering gain, in the concurrent case, some of them may not.

1277: %

1278: %These side-effects are absent in the ranking algorithm (Section~\ref{sec:ranking}) because the

1279: %information conveyed in message from each node $i$ is altered

1280: %independently from any message receipt.

1281: %Indeed, either attribute values slightly change over time (if it

1282: %represents battery, or available storage space), or do not change

1283: %at all, while an ordering algorithm makes nodes exchange random values

1284: %that can be drastically modified (swapped) upon reception of another message.

1285:

1286:

1287:

1288: \section{Dynamic Ranking by Sampling of Attribute Values}\label{sec:ranking}

1289: \label{dynamicranking}

1290: In this section we propose an alternative algorithm for the distributed slicing

1291: problem. This algorithm circumvents some of the problems identified in

1292: the previous approach by continuously ranking nodes based on observing attribute value

1293: information.

1294: Random values no longer play a role, so non-perfect uniformity in the random value

1295: distribution is no longer a problem.

1296: Besides, this algorithm

1297: is not sensitive to churn even if it is correlated with attribute values.

1298:

1299: In the remaining part of the report we refer to this new algorithm

1300: as the ranking algorithm while referring to JK and mod-JK as the ordering algorithms.

1301: Here, we elaborate on the drawbacks arising from the ordering algorithms

1302: relying on the use of random values that are solved by the ranking approach.

1303:

1304: \paragraph{Impact of attribute correlated with dynamics.}

1305: As already mentioned, the ordering algorithms

1306: rely on the fact that random values are uniformly distributed.

1307: However, if the attribute values are not constant but correlated with

1308: the dynamic behavior of the system, the distribution of random values may change

1309: from uniform to  skewed quickly.

1310: For instance, assume that each node maintains an attribute

1311: value that represents its own lifetime.

1312: Although the algorithm is able to quickly sort random values, so

1313: nodes with small lifetime will obtain the small random values, it

1314: is more likely that these nodes leave the system sooner than other nodes.

1315: This results in

1316: a higher concentration of high random values and a large population of the nodes

1317: wrongly estimate themselves as being part of the higher slices.

1318: %Similar situations arise when

1319: %attribute values represent other dynamic features such as

1320: %the remaining storage space of a node, etc.

1321:

1322: \paragraph{Inaccurate slice assignments.}

1323:

1324: As discussed in previous sections in detail, slice assignments will

1325: typically be imperfect even when the random values are perfectly

1326: ordered.

1327: Since the ranking approach does not rely on ordering random nodes,

1328: this problem is not raised: the algorithm guarantees eventually

1329: perfect assignment in a static environment.

1330:

1331: \paragraph{Concurrency side-effect.}

1332: In the previous ordering algorithms, a non negligible amount of messages are sent unnecessarily.

1333: The concurrency of messages has a drastic effect on the number of useless messages

1334: as shown previously, slowing down convergence.

1335: In the ranking algorithm

1336: concurrency has no impact on convergence speed because all

1337: received messages are taken in account.

1338: This is because the information encapsulated in a message (the attribute value of a node)

1339: is guaranteed to be up to date, as long as  the attribute values are constant,

1340: or at least change slowly.

1341:

1342: \subsection{Ranking Algorithm Specification}

1343:

1344: The pseudocode of the ranking algorithm is presented in Figure~\ref{alg:attr}.

1345: As opposed to the ordering algorithm of the previous section, the ranking

1346: algorithm does not assign random initial unalterable values as candidate ranks.

1347: Instead, the ranking algorithm improves its rank estimate each

1348: time a new message is received.

1349:

1350: The ranking algorithm works as follows.

1351: Periodically each node $i$ updates its view ${\mathcal N}_i$ following

1352: an underlying protocol that provides a uniform random sample (Line~\ref{N02}); later, we simulate the algorithm using

1353: the variant of Cyclon protocol presented in Section~\ref{sec:modifiedjk}.

1354: Node $i$ computes its rank estimate (and hence its slice)

1355: by comparing the attribute value of its neighbors to its own attribute value.

1356: This estimate is set to the ratio of the number of nodes with a lower

1357: attribute value that $i$ has seen over the total number of nodes $i$ has seen (Line~\ref{N14}).

1358: Node $i$ looks at the normalized rank estimate of all its neighbors.

1359: Then, $i$

1360: selects the node $j_{1}$ closest to a slice boundary (according to the rank

1361: estimates of its neighbors).

1362: Node $i$ selects also a random neighbor $j_{2}$ among its view (Line~\ref{N11}).

1363: When those two nodes are selected, $i$ sends an update message, denoted by a flag $\lit{UPD}$,

1364: to $j_{1}$ and $j_{2}$

1365: containing its attribute value (Line~\ref{N12}--\ref{N13}).

1366:

1367: The reason why a node close to the slice boundary is selected as one of the

1368: contacts is that such nodes need more samples to accurately determine

1369: which slice they belong to (subsection~\ref{sec:analysis2} shows this point).

1370: This technique introduces a bias towards them, so they receive more messages.

1371:

1372: Upon reception of a message from node $i$, the passive threads of $j_{1}$ and

1373: $j_{2}$ are activated so that $j_{1}$ and $j_{2}$ compute their new rank estimate

1374: $r_{j_{1}}$ and $r_{j_{2}}$.  The estimate of the slice a node belongs to,

1375: follows the computation of the rank estimate.

1376: Messages are not replied, communication is one-way, resulting in identical message

1377: complexity to JK and mod-JK.

1378:

1379:

1380: %\input{attr.alg}

1381: \begin{figure}

1382: \centering{

1383: \fbox{

1384: \begin{minipage}[t]{150mm}

1385: \footnotesize

1386: \renewcommand{\baselinestretch}{1.5}

1387: \resetline

1388: \begin{tabbing}

1389: aaaaA\=aaaaaA\=aaaaaaA\kill

1390: {\bf Initial state of node $i$} \\

1391: \line{N00} \> $\ms{period}_{i}$, initially set to a constant;

1392: $r_i$, a value in $(0,1]$; \\

1393: $a_i$, the attribute value;

1394: $b$, the closest slice boundary to node $i$; \\

1395: $g_i$, the counter of encountered attribute values;

1396: $l_i$, the counter \\

1397: of lower attribute values;

1398: $\ms{slice}_i \gets \bot$;

1399: ${\mathcal N_i}$, the view.

1400: \\ ~ \\

1401:

1402: {\bf Active thread at node $i$} \\

1403:

1404: %% with period, the other nodes must send more frequently

1405: %~(\ref{L02}b)  \> $\act{wait}(\ms{period_i})$ \\

1406: %~(\ref{L02}c)  \> $\act{recompute-view}()_i$ \\

1407: %~(\ref{L02}d)  \> $\ms{dist-min} \gets \infty$ \\

1408: %~(\ref{L03}e)  \> {\bf for} $j' \in {\mathcal N}_i$ \\

1409: %~(\ref{L03}f)    \> \T $g_i \gets g_i + 1$ \\

1410: %~(\ref{L03}g)   \> \T {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

1411: %~(\ref{L03}h) \> \T {\bf if} $\lit{dist}(a_{j'},b) < \ms{dist-min}$ {\bf then} \\

1412: %~(\ref{L03}i) \> \T \T $\ms{dist-min} \gets \lit{dist}(a_{j'},b)$ \\

1413: %~(\ref{L03}j) \> \T \T $j_{1} \gets j'$ \\

1414: %~(\ref{L03}k) \> {\bf end for} \\

1415: %~(\ref{L03}l) \> Let $j_{2}$ be a random node of ${\mathcal N}_i$ \\

1416: %~(\ref{L04}b)  \> $\act{send}(\lit{UPD},r_i)$ to $j_{1}$ \\

1417: %~(\ref{L04}c)  \> $\act{send}(\lit{UPD},r_i)$ to $j_{2}$ \\

1418: %%~(\ref{L08})  \> $\act{recv}(\lit{ACK},r_j)$ from $j$ \\

1419: %%\line{L13}   \> {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

1420: %%\line{L14}   \> $g_i \gets g_i + 1$ \\

1421: %~(\ref{L05}d)   \> $r_i \gets \ell_i / g_i$ \\

1422: %~(\ref{L05}e)   \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$\\

1423: % ~ \\

1424:

1425: % with period, the other nodes must send more frequently

1426: \line{N01}  \> $\act{wait}(\ms{period_i})$ \\

1427: \line{N02}  \> $\act{recompute-view}()_i$ \\

1428: \line{N03}  \> $\ms{dist-min} \gets \infty$ \\

1429: \line{N04}  \> {\bf for} $j' \in {\mathcal N}_i$ \\

1430: \line{N05}    \> \T $g_i \gets g_i + 1$ \\

1431: \line{N06}   \> \T {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

1432: \line{N07} \> \T {\bf if} $\lit{dist}(a_{j'},b) < \ms{dist-min}$ {\bf then} \\

1433: \line{N08} \> \T \T $\ms{dist-min} \gets \lit{dist}(a_{j'},b)$ \\

1434: \line{N09} \> \T \T $j_{1} \gets j'$ \\

1435: \line{N10} \> {\bf end for} \\

1436: \line{N11} \> Let $j_{2}$ be a random node of ${\mathcal N}_i$ \\

1437: \line{N12}  \> $\act{send}(\lit{UPD},a_i)$ to $j_{1}$ \\

1438: \line{N13}  \> $\act{send}(\lit{UPD},a_i)$ to $j_{2}$ \\

1439: %~(\ref{L08})  \> $\act{recv}(\lit{ACK},r_j)$ from $j$ \\

1440: %\line{L13}   \> {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

1441: %\line{L14}   \> $g_i \gets g_i + 1$ \\

1442: \line{N14}   \> $r_i \gets \ell_i / g_i$ \\

1443: \line{N15}   \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$\\

1444:  ~ \\

1445:

1446: %{\bf Passive thread at node $i$ activated upon reception} \\

1447: %~(\ref{R01}) \> $\act{recv}(\lit{UPD},a_j)$ from $j$ \\

1448: %%~(\ref{R02}) \> $\act{send}(\lit{ACK},r_i)$ to $j$ \\

1449: %\line{R06}   \> {\bf if} $a_{j} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

1450: %\line{R07}   \> $g_i \gets g_i + 1$ \\

1451: %\line{R08}   \> $r_i \gets \ell_i / g_i$ \\

1452: %~(\ref{R04}') \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$

1453:

1454: {\bf Passive thread at node $i$ activated upon reception} \\

1455: \line{O01} \> $\act{recv}(\lit{UPD},a_j)$ from $j$ \\

1456: %~(\ref{R02}) \> $\act{send}(\lit{ACK},r_i)$ to $j$ \\

1457: \line{O02}   \> {\bf if} $a_{j} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\

1458: \line{O03}   \> $g_i \gets g_i + 1$ \\

1459: \line{O04}   \> $r_i \gets \ell_i / g_i$ \\

1460: \line{O05} \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$

1461:

1462: \end{tabbing}

1463: \normalsize

1464: \end{minipage}

1465: }

1466: \caption{Dynamic ranking by exchange of attribute values.}

1467: \label{alg:attr}

1468: }

1469: \end{figure}

1470:

1471: \subsection{Theoretical Analysis}\label{sec:analysis2}

1472:

1473: The following Theorem shows a lower bound on the probability

1474: for a node $i$ to accurately estimate the slice it belongs to.

1475: %

1476: This probability depends not only on the number of attribute exchanges

1477: but also on the rank estimate of $i$.

1478:

1479: \begin{theorem}

1480: %Assume that node $i$ with normalized rank $p$, estimated in $\hat{p}$, executes

1481: %$\left( Z_\frac{\alpha}{2} \frac{ \sqrt{\hat{p}(1-\hat{p})} }{ d } \right)^2$

1482: %steps of the algorithm, where $d$ is the distance between the rank estimate of $i$ and

1483: %the closest slice boundary, and $Z_\frac{\alpha}{2}$ represents the endpoints of the

1484: %confidence interval.

1485: %The slice estimate of $i$ is exact with confidence coefficient of $100(1 - \alpha)\%$.

1486: %

1487: Let $p$ be the normalized rank of $i$ and let $\hat{p}$ be its estimate.

1488: For node $i$ to exactly estimate its slice with confidence coefficient of

1489: $100(1 - \alpha)\%$, the number of messages $i$ must receive is:

1490: $$\left( Z_\frac{\alpha}{2} \frac{ \sqrt{\hat{p}(1-\hat{p})} }{ d } \right)^2,$$

1491: \noindent

1492: where $d$ is the distance between the rank estimate of $i$ and

1493: the closest slice boundary, and $Z_\frac{\alpha}{2}$ represents the endpoints of the

1494: confidence interval.

1495: \end{theorem}

1496: \begin{proof}

1497: %At each step of the algorithm execution, node $i$ draws uniformly at random

1498: %a sample in the network.

1499: Each time a node receives a message, it checks whether or not the attribute value

1500: is larger or lower than its own.

1501: Let $X_1, ..., X_k$ be $k$~$(k>0)$ independent identically distributed random variables described as follows. $X_j = 1$

1502: with probability $\frac{i}{n} = p$ (indicating that the attribute value is lower) and

1503: $j \in \{1, ..., k\}$, otherwise $X_j = 0$ (indicating the attribute value is larger).

1504: %

1505: By the central limit theorem, we assume $k > 30$ and we approximate the distribution of $X = \sum_{j=1}^k X_j$

1506: as the normal distribution. We estimate $X$ by $\hat{X} = \sum_{j=1}^k \hat{X}_j$

1507: and $p$ by $\hat{p} = \frac{\hat{X}}{k}$.

1508:

1509: We want a confidence coefficient with value $1-\alpha$.

1510: Let $\Phi$ be the standard normal distribution function, and let

1511: $Z_{\frac{\alpha}{2}}$ be $\Phi^{-1}(1-\frac{\alpha}{2})$.

1512: %

1513: Now, by the Wald large-sample normal test in the binomial case,

1514: where the standard deviation of $\hat{p}$ is

1515: $\sigma(\hat{p}) = \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{k}}$,

1516: we have:

1517: \begin{eqnarray}

1518: \left|\frac{\hat{p}-p}{\sigma(\hat{p})}\right| &\leq Z_{\frac{\alpha}{2}} & \notag \\

1519: \hat{p} - Z_{\frac{\alpha}{2}} \sigma(\hat{p}) &\leq p & \leq \hat{p} + Z_{\frac{\alpha}{2}}\sigma(\hat{p}). \notag

1520: \end{eqnarray}

1521:

1522: % To obtain probability $1-\alpha$, we must have

1523: % $$\hat{p} - Z_{\frac{\alpha}{2}} \sqrt{\frac{\hat{p}(1-\hat{p})}{k}} \leq p \leq \hat{p} +

1524: % Z_{\frac{\alpha}{2}} \sqrt{\frac{\hat{p}(1-\hat{p})}{k}},$$

1525: % %\in (\hat{p}\rip Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{k}})$,

1526: % where $Z_{\frac{\alpha}{2}}$ is obtained from tables

1527: % ($\alpha = 0,1 \Rightarrow Z_{\frac{\alpha}{2}} \neq 2$;

1528: % $\alpha = 0,0001 \Rightarrow Z_{\frac{\alpha}{2}} \leq 4$)

1529: %

1530: Next, assume that $\hat{p}$ falls into the slice $S_{l,u}$, with $l$ and $u$ its lower

1531: and upper boundaries, respectively.

1532: Then, as long as $\hat{p} - Z_\frac{\alpha}{2}\sqrt{\frac{\hat{p}(1-\hat{p})}{k}}

1533: > l$ and

1534: $\hat{p} + Z_\frac{\alpha}{2}\sqrt{\frac{\hat{p}(1-\hat{p})}{k}} \leq u$, the slice estimate

1535: is exact with a confidence coefficient of $100(1-\alpha)\%$.

1536: Let $d=\min(\hat{p}-l,u-\hat{p})$, then we need

1537: %Assume without loss of generality that $\hat{p}-l \leq u-\hat{p}$, then we need

1538: \begin{eqnarray}

1539: d &\geq& Z_{ \frac{\alpha}{2} } \sqrt{\frac{\hat{p}(1-\hat{p}) }{ k }}, \notag \\

1540: k &\geq& \left( Z_\frac{\alpha}{2} \frac{ \sqrt{ \hat{p}(1-\hat{p}) } }{d}

1541: \right)^2.\notag

1542: \end{eqnarray}

1543: %Consequently, we obtain:

1544: %$$k \geq \left( Z_\frac{\alpha}{2} \frac{ \sqrt{\hat{p}(1-\hat{p})} }{ \min(\hat{p}-l, u-\hat{p})} \right)^2.\Box$$

1545: \end{proof}

1546:

1547: To conclude, under reasonable assumptions all node estimate its slice with confidence coefficient $100(1 - \alpha)\%$, after a finite number of message receipts.

1548: Moreover a node closer to the slice boundary needs more messages than a node far from the boundary.

1549:

1550: \subsection{Simulation Results}

1551:

1552: This section evaluates the ranking algorithm by focusing on three

1553: different aspects.

1554: %

1555: First, the performance of the ranking algorithm is

1556: compared to the performance of the ordering algorithm\footnote{We

1557: omit comparison with JK since the performance obtained with mod-JK are either

1558: similar or better.} in a large-scale system where the distribution

1559: of attribute values does not vary over time.

1560: %

1561: Second, we investigate if sufficient uniformity is achievable in reality using a dedicated

1562: protocol.

1563: %

1564: %study the feasibility of our approach.

1565: %Each node

1566: %requires to be associated with a random set of neighbors, chosen

1567: %at uniform.

1568: %We present a view management protocol that gives

1569: %performance results very similar to the desired

1570: %uniform simulation.

1571: %

1572: Third, the ranking algorithm and ordering algorithm are compared in

1573: a dynamic system where the distribution of attribute values may

1574: change.

1575:

1576: %This Section evaluates the impact of dynamics on our algorithms.

1577: %The ranking algorithm presented in Figure~\ref{alg:attr} is compared to

1578: %the ordering algorithm presented in the previous section.

1579: %

1580: For this purpose, we ran two simulations, one for each algorithms.

1581: The system contains (initially) $10^4$ nodes and each view contains $10$ uniformly drawn random

1582: nodes and is updated in each cycle.

1583: The number of slices is 100, and we present the evolution of the slice disorder measure

1584: over time.

1585:

1586: \begin{figure*}

1587:   \begin{center}

1588:     \subfigure[]

1589:     {

1590:       \label{fig:comparison}

1591:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{comparison.ps}}

1592:     }

1593:     \subfigure[]

1594:     {

1595:       \label{fig:convergence3}

1596:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{cyclon.ps}}

1597:     }

1598:     \subfigure[]

1599:     {

1600:       \label{fig:failures2}

1601:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{sdm_battery_200.ps}}

1602:     }

1603:     \subfigure[]

1604:     {

1605:     \label{fig:failures1}

1606:           \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{sdm_failure01.ps}}

1607:     }

1608:     \caption{

1609:        (a) Comparing performance of the ordering algorithm and the ranking algorithm (static case).

1610:        (b) Comparing the ranking algorithm on top of a uniform drawing or a Cyclon-like protocol.

1611:        (c) Effect of dynamics burst on the convergence of the ordering algorithm and the ranking algorithm.

1612:        (d) Effect of a low and regular churn on the convergence of the ordering algorithm and the ranking algorithm.

1613:        }

1614:   \end{center}

1615: \end{figure*}

1616:

1617: \subsubsection{Performance Comparison in the Static Case}

1618: Figure~\ref{fig:comparison} compares the ranking algorithm to the ordering algorithm

1619: while the distribution of attribute values do not change over time

1620: (varying distribution is simulated below).

1621:

1622: The difference between the ordering algorithm and the ranking algorithm

1623: indicates that the ranking algorithm gives a more precise result (in terms of node to

1624: slice assignments) than the ordering algorithm.

1625: More importantly, the slice disorder measure obtained by the ordering algorithm is

1626: lower bounded while the one of the ranking algorithm is not.

1627: Consequently, this simulation shows that the ordering algorithm might fail in

1628: slicing the system while the ranking algorithm keeps improving its accuracy over

1629: time.

1630:

1631: \subsubsection{Feasibility of the Ranking Algorithm}

1632: %Figure~\ref{fig:convergence3} shows that even though the underlying membership

1633: %protocol might impact on the convergence time of the ordering and ranking algorithms,

1634: %it does not impact on their performance in terms of precision.

1635: Figure~\ref{fig:convergence3} shows that the ranking algorithm does not need artificial

1636: uniform drawing of neighbors.  Indeed, an underlying view management protocol might

1637: lead to similar performance results.

1638: %

1639: %the ordering as the ranking algorithm

1640: %present performance independent from the underlying.  First this shows that the result

1641: %achieve by the ordering algorithm is limited regardless the underlying

1642: %sampling protocol.  Second, this shows that the performance in term of

1643: %precision of the ranking algorithm are independent

1644: %

1645: %shows that our approach is realistic.

1646: %In the previous simulations the neighbors chosen for communication exchanges

1647: %were selected uniformly at random among all nodes in the system.

1648: %Here, we simulate the ranking algorithm using an underlying view exchange

1649: %protocol, called Cyclon~\cite{VGS05}.

1650: In the presented simulation we used an artificial protocol, drawing neighbors

1651: randomly at uniform in each cycle of the algorithm execution, and the variant

1652: of the Cyclon~\cite{VGS05} view management protocol presented above.

1653: Those underlying protocols are distinguished on the figure using terms

1654: "uniform" (for the former one) and "views" (for the later one).

1655: %

1656: As said previously, the Cyclon protocol consists of exchanging views

1657: between neighbors such that

1658: the communication graph produced shares similarities with a random graph.

1659: This figure shows that both cases give very similar results.

1660: The SDM legend is on the right-handed vertical axis while

1661: the left-handed vertical axis indicates what percentage the SDM difference

1662: represents over the total SDM value.  At any time during the simulation

1663: (and for both type of algorithms) its

1664: value remains within plus or minus $7\%$.

1665: The two SDM curves of the ranking algorithm almost overlap.

1666: Consequently, the ranking algorithm and the variant of Cyclon

1667: presented in subection~\ref{sec:modifiedjk} achieve very similar result.

1668: %

1669: %Surprisingly however, the variant of Cyclon as an underlying algorithm

1670: %leads seemingly to better performance than drawing the samples uniformly at random.

1671: %In a recent work~\cite{I05} about the experimental analysis of Cyclon compare to a random graph,

1672: %the clustering coefficient provided by the communication graph

1673: %of Cyclon is slightly smaller than the clustering coefficient provided by the random

1674: %graph. Based on this observation, it is reasonable to assume that new nodes are more

1675: %likely discovered with Cyclon than a random graph, and this might explain this

1676: %improvement.

1677:

1678: To conclude, the variant of Cyclon algorithm presented in the previous section can be used

1679: with the ranking algorithm to provide the shuffling of views.

1680: %despite this observation the similarity of the use of Cyclon and the use

1681: %of randomly drawing shows the feasibility of our ranking algorithms in reality.

1682:

1683: \subsubsection{Performance Comparison in the Dynamic Case}

1684:

1685:

1686: In Figure~\ref{fig:failures2} each of the two curves represents the

1687: slice disorder measure obtained over time using the ordering algorithm and

1688: the ranking algorithm respectively.

1689: We simulate the churn such that 0.1\% of nodes leave and 0.1\% of the nodes

1690: join in each cycle during the 200 first cycles. We observe how the SDM converges.

1691: The churn is reasonably and pessimistically tuned compared to recent

1692: experimental evaluations~\cite{SR06} of the session duration in three well-known P2P

1693: systems.\footnote{In~\cite{SR06}, roughly all nodes have left the system after 1 day while there are still

1694: 50\% of nodes after 25 minutes. In our case, assuming that in average a cycle lasts

1695: one second would lead to more than 54\% of leave in 9 minutes.}

1696:

1697: The distribution of the churn is correlated to the attribute value of the nodes.

1698: The leaving nodes are the

1699: nodes with the lowest attribute values while the entering nodes have higher attribute

1700: values than all nodes already in the system.  The parameter choices are motivated by

1701: the need of simulating a system in which the attribute value corresponds to the

1702: session duration of nodes, for example.

1703:

1704: The churn introduces a significant disorder in the system which

1705: counters the fast decrease.  When, the churn stops, the ranking algorithm readapts

1706: well the slice assignments: the SDM starts decreasing again.  However,

1707: in the ordering algorithm, the convergence of SDM gets stuck.

1708: This leads to a poor slice assignment accuracy.

1709:

1710: In Figure~\ref{fig:failures1},

1711: each of the two curves represent the slice disorder measure obtained over time

1712: using the ordering algorithm, the ranking algorithm, and a modified version of the ranking

1713: algorithm using attribute values recorded in a sliding-window, respectively.

1714: (The simulation obtained using sliding windows is described in the next subsection.)

1715: The churn is diminished and made more regular than in the previous simulation such that

1716: 0.1\% of nodes leave and 0.1\% of nodes join every 10 cycles.

1717: %Recall that the slice

1718: %disorder measure (SDM) represents the sum for all nodes of the distance between the

1719: %slice the node thinks it belongs to and the slice it really belongs to.

1720:

1721: The curves fits a fast decrease (superlinear in the number of cycles) at the

1722: beginning of the simulation.

1723: At first cycles, the ordering gain is significant making the impact of churn

1724: negligible.  This phenomenon is due to the fact that SDM decreases rapidly

1725: when the system is fully disordered.  Later on, however, the decrease slope

1726: diminishes and the churn effect reduces the amount of nodes with a low attribute

1727: value while increasing the amount of nodes with a large attribute value.

1728: This unbalance leads to a messy slice assignment, that is, each node must quickly

1729: find its new slice to prevent the SDM from increasing.

1730: %

1731: In the ordering algorithm the SDM starts increasing from cycle 120. Conversely,

1732: with the ranking algorithm the SDM starts increasing not earlier than at cycle 730.

1733: Moreover the increase slope is much larger in the former algorithm than in the latter

1734: one.

1735:

1736: Even though the performance of the ranking algorithm are really significant, its

1737: adaptiveness to churn is not surprising.  Unlike the ordering algorithm, the

1738: ranking one keeps re-estimating the rank of each node depending on the

1739: attribute values present in the system.

1740: Since the churn increases the attribute values present in the

1741: system, nodes tend to receive more messages with higher attribute values and

1742: less messages with lower attribute values, which turns out to keep the SDM

1743: low, despite churn. Further on, we propose a solution  based on

1744: sliding-window technique to limit the increase of the SDM in the ranking

1745: algorithm.

1746:

1747: To conclude, the results show that when the churn is related to the attribute

1748: (e.g., attribute represents the session duration, uptime of a node),

1749: then the ranking algorithm is better suited than the ordering algorithm.

1750:

1751: %In Figure~\ref{fig:failures1}, %simulates 10,000 nodes using Peersim simulator.

1752: %each of the two curves represent the slice disorder measure obtained over time

1753: %using the ordering algorithm, the ranking algorithm, and a modified version of the ranking

1754: %algorithm using attribute values recorded in a sliding-window, respectively.

1755: %(The simulation obtained using sliding windows is described in the next subsection.)

1756: %Recall that the slice

1757: %disorder measure (SDM) represents the sum for all nodes of the distance between the

1758: %slice the node thinks it belongs to and the slice it really belongs to.

1759: %

1760: %A churn applied to the system is regular (1\% every period of 100 cycles), while its

1761: %distribution is correlated to the attribute value of the nodes. More precisely, each

1762: %10 cycles, a node leaves the system while another joins it.  The leaving node is the

1763: %node with the lowest attribute value while the entering node has a higher attribute

1764: %value than all nodes

1765: %already in the system.  The parameter choices are motivated by the need of simulating

1766: %a system in which the attribute value corresponds to the remaining battery power or

1767: %session duration of nodes.

1768: %

1769: %The curves fits a fast decrease (superlinear in the number of cycles) at the

1770: %beginning of the simulation.

1771: %At first cycles, the ordering gain is significant making the impact of churn

1772: %negligible.  This phenomenon is due to the fact that SDM decreases rapidly

1773: %when the system is fully disordered.  Later on, however, the decrease slope

1774: %diminishes and the churn effect reduces the amount of nodes with a low attribute

1775: %value while increasing the amount of nodes with a large attribute value.

1776: %This unbalance leads to a messy slice assignment, that is, each node must quickly

1777: %find its new slice to prevent the SDM from increasing.

1778: %%

1779: %In the ordering algorithm the SDM starts increasing from cycle 120. Conversely,

1780: %with the ranking algorithm the SDM starts increasing not earlier than at cycle 730.

1781: %Moreover the increase slope is much larger in the former algorithm than in the latter

1782: %one.

1783: %

1784: %Even though the performance of the ranking algorithm are really significant, its

1785: %adaptiveness to churn is not surprising.  Unlike the ordering algorithm, the

1786: %ranking one keeps re-estimating the rank of each node depending on the

1787: %attribute values present in the system.

1788: %Since the churn increases the attribute values present in the

1789: %system, nodes tend to receive more messages with higher attribute values and

1790: %less messages with lower attribute values, which turns out to keep the SDM

1791: %low, despite churn. Further on, we propose a solution  based on

1792: %sliding-window technique to limit the increase of the SDM in the ranking

1793: %algorithm.

1794: %

1795: %In Figure~\ref{fig:failures2} the parameters of the simulator are the same as above

1796: %except that the churn is briefer and much higher.  At each step during the first 200 cycles,

1797: %10 nodes leave while 10 new nodes join the system.  The 10 leaving nodes have the

1798: %10 lowest attribute values while the 10 arriving nodes obtain the 10 highest attribute values.

1799: %

1800: %The result shows that when the churn is related to the attribute (e.g., attribute represents

1801: %the remaining session duration, uptime of a node), then the ranking algorithm is better suited

1802: %than the ordering algorithm.

1803: %In this case, small random values

1804: %tends to disappear under the churn effect.

1805: %Consequently, the slices

1806: %of lowest attribute values are progressively emptied and slices with highest attribute

1807: %values are progressively filled.

1808: %Since the rank estimates are constantly re-estimated

1809: %in the ranking algorithm, this algorithm still converges despite churn.

1810: %Conversely, since all candidate ranks are initially randomly chosen once for

1811: %all in the ordering algorithm, then the convergence gets stuck.

1812:

1813: \subsubsection{Sliding-window for Limiting the SDM Increase}

1814:

1815: In Figure~\ref{fig:failures1}, the "sliding-window" curve

1816: presents a slightly modified version of the ranking algorithm

1817: that encompasses SDM increase due to churn correlated to attribute

1818: values.  Here, we present this enrichment.

1819:

1820: In Section~\ref{sec:ranking}, the ranking algorithm specifies that

1821: each node takes into account all received messages.

1822: More precisely, upon reception of a new message each node $i$ re-computes

1823: immediately its rank estimate and the slice it thinks it belongs to

1824: without remembering the attribute values it has seen.

1825: Consequently the messages

1826: received long-time ago have as much importance as the fresh messages

1827: in the estimate of $i$.

1828: The drawback, as it appeared in Figure~\ref{fig:failures1} of Section~\ref{sec:simu2}, is that if

1829: the attribute values are correlated to churn, then the precision of the algorithm might diminish.

1830:

1831: To cope with this issue, the previous algorithm can be easily enriched

1832: in the following way.  Upon reception of a message, each node $i$

1833: records an information about the attribute value received in a fixed-size ordered set

1834: of values.  Say this set is a first-in first-out buffer such that

1835: only the most recent values remain.

1836: Right after having recorded this information, node $i$ can re-compute

1837: its rank estimate and its slice estimate based on the most relevant

1838: piece of information (having discarded the irrelevant piece).

1839: Consequently, the estimate would rely only on fresh attribute values

1840: encountered so that the algorithm would be more tolerant to changes

1841: (e.g., dynamics or non-uniform evolution of attribute values).

1842: %

1843: Of course, since the analysis (cf. Section~\ref{sec:analysis2}) shows that nodes close to

1844: the slice boundary

1845: require a large number of attribute values for estimating precisely their

1846: estimates, it would be unaffordable to record all these last attribute

1847: values encountered due to space limitation.

1848:

1849: Actually, the only necessary relevant information of a message

1850: is simply whether it contains a lower attribute value than the attribute

1851: value of $i$, or not.  Consequently, a single bit per message would be sufficient

1852: to record the necessary information (e.g., adding a 1 meaning that the attribute value

1853: is lower, and 0 otherwise).  Thus, even though a node $i$ would require

1854: $10^4$ messages to rightly estimate its slice (with high probability),

1855: node $i$ simply needs to allocate an array of size

1856: %$\lceil 10^4/(8*1024)\rceil =  1,23$ KB.

1857: $10^4/(8*1000) =  1,25$ kB.

1858:

1859: As expected, Figure~\ref{fig:failures1} shows that the sliding-window method

1860: applied to the ranking algorithm prevents its SDM from increasing.  Consequently, at some

1861: point in time, the resulting slice assignment may become even more accurate.

1862:

1863: \section{Conclusion}

1864: \label{conclusion}

1865:

1866: \subsection{Summary}

1867: Peer to peer systems may now be turned into general frameworks on top of which

1868: several applications might cohabit. To this end, allocating resources to applications,

1869: according to their needs require specific algorithms to partition the network in a relevant

1870: way. The ordered slicing algorithm proposed in \cite{JK06} provided a first attempt to

1871: ``slice'' the network, taking into account the potential heterogeneity of nodes. This algorithm

1872: relies on each node drawing a random value uniformly and swapping continuously those random values,

1873:  with candidate

1874: nodes, so that the order between attributes values (reflecting the capabilities of nodes) and random ones

1875: match. Results from \cite{JK06} have shown that slices can be maintained efficiently and

1876: in large-scale systems even in the presence of churn.

1877:

1878: In this report, we first proposed an improvement over the initial sorting algorithm based on a judicious

1879: choice of candidate nodes to swap values. This is based on each node being able to estimate locally

1880: the potential decrease of the global disorder measure. We provided an analysis along with some simulation

1881: results showing that the convergence speed is significantly improved.

1882: We then identified two issues related to the use of static random values.

1883: The first one refers to the fact that slice assignment heavily depends on

1884: the degree of uniformity of the initial random value.

1885:

1886: The second is related to the fact that

1887: once sorted along one attribute axis, the churn (or failures) might be correlated to the

1888: attribute, therefore leading to a unrecoverable skewed distribution of the

1889: random values resulting in a wrong slice assignment.

1890: Our second contribution is an algorithm enabling nodes to continuously

1891: re-estimate their rank relatively to other nodes based on their sampling of

1892: the network.

1893:

1894: \subsection{Perspective}

1895: %\subsection{Cyclon as an Underlying Sampling Algorithm}

1896: %This report presents two algorithms relying on a variant of Cyclon to benefit from

1897: %the distribution quasi-uniform of the neighbors.

1898: %The simulation with the variant of Cyclon shows that the two algorithms

1899: %are independent from the underlying physical overlay.

1900: %More technically, this shows the feasibility of our approach by providing

1901: %the distribution requirement of the samples.

1902: %

1903: %Nevertheless, previous simulation surveys have shown that Cyclon might suffer

1904: %from a lack of robustness in face of massive departures compared

1905: %to other gossip-based approach such as Newscast.  This report did not

1906: %simulate the behavior of Cyclon under such circumstances.  Despite the

1907: %interest of simulating the ranking algorithm on top of Cyclon and in

1908: %face of massive failures, such extensive experiments were out of the

1909: %scope of this report and could be an interesting future work.

1910:

1911:

1912: This report used a variant of the Cyclon protocol to obtain quasi-uniform

1913: distribution of neighbors.  There are various protocols that might be used

1914: for different purpose.  For instance, Newscast can be used for its resilience

1915: to very high dynamics as in~\cite{JK06}.  Some other protocols exist

1916: in the literature.  Deciding exactly how to parameterize the underlying

1917: peer sampling service might be an interesting future direction.

1918:

1919: \subsection*{Acknowledgment}

1920: We wish especially to thank M\'ark Jelasity for the fruitful discussions we

1921: had and the time he spent improving this report.

1922: We are also thankful to Spyros Voulgaris for having kindly shared his work

1923: on the Cyclon development. The work of A.  Fern\'andez and E. Jim\'enez was partially

1924: supported by the Spanish MEC under grants TIN2005-09198-C02-01,

1925: TIN2004-07474-C02-02, and TIN2004-07474-C02-01, and the

1926: Comunidad de Madrid under grant S-0505/TIC/0285. The work of A.  Fern\'andez was done

1927: while on leave at IRISA, supported by the Spanish MEC under grant

1928: PR-2006-0193.

1929:

1930:

1931: \bibliographystyle{plain}

1932: \bibliography{slice}

1933:

1934: \end{document}

1935:

1936: