cs0612035/RR.tex
1: \documentclass{article}
2: \usepackage{RR}
3: 
4: \usepackage{epsf}
5: \usepackage{amsmath}
6: \usepackage{latexsym}
7: \usepackage{amssymb}
8: \usepackage{amsthm}
9: \usepackage{times}
10: \usepackage{rotating}
11: \usepackage{setspace}
12: \usepackage{graphicx}
13: \usepackage{multicol}
14: \usepackage{latexsym}
15: \usepackage{amsfonts}
16: \usepackage{subfigure}
17: 
18: \newtheorem{theorem}{Theorem}[section]
19: \newtheorem{lemma}[theorem]{Lemma}
20: \newtheorem{invariant}[theorem]{Invariant}
21: \newtheorem{definition}[theorem]{Definition}
22: \newtheorem{corollary}[theorem]{Corollary}
23: 
24: %%%
25: % michel algorithm style
26: \newcounter{linecounter}
27: \newcommand{\linenumbering}{(\arabic{linecounter})}
28: \renewcommand{\line}[1]{\refstepcounter{linecounter}
29: \label{#1}
30: \linenumbering}
31: \newcommand{\resetline}{\setcounter{linecounter}{0}}
32: %%%
33: 
34: \newif\ifcode
35: \codefalse
36: 
37: %%%
38: % comment this if you do not have
39: % algpseudocode package for algorithms.
40: %\codetrue
41: %%%
42: \usepackage{macros}
43: \ifcode
44: \usepackage[ruled]{algorithm}
45: \usepackage{ioa_code}
46: \fi
47: 
48: \newcommand{\remove}[1]{}
49: 
50: \begin{document}
51: 
52: \RRdate{D\'ecembre 2006}
53: %\RRNo{ }
54: 
55: \title{{\bf Distributed Slicing in Dynamic Systems}}
56: 
57: \RRauthor{{\bf Antonio Fern\'andez\thanks{Universidad Rey Juan Carlos, 28933 M\'ostoles, Spain. anto@gsyc.escet.urjc.es}\and Vincent Gramoli\thanks[2]{IRISA, INRIA Universit\'e Rennes 1 (ASAP Research Group) 35042 Rennes, France. \{vgramoli,akermarr,raynal\}@irisa.fr}\and Ernesto Jim\'enez\thanks{Universidad Polit\'ecnica de Madrid, 28031 Madrid, Spain. ernes@eui.upm.es}\and \\
58: Anne-Marie Kermarrec\thanksref{2}\and 
59: Michel Raynal\thanksref{2}}
60: }
61: \authorhead{Fern\'andez \& Gramoli \& Jimenez \& Kermarrec \& Raynal} %% Ceci apparait sur chaque page paire.
62: \RRetitle{Distributed Slicing in Dynamic Systems} %% english title
63: \titlehead{Distributed Slicing in Dynamic Systems} %%titre court, sur chaque page impaire.
64: \RRtitle{Morcellement distribu\'e dans les syst\`emes dynamiques} 
65: \RRnote{Contact author: Vincent Gramoli \texttt{vgramoli@irisa.fr}}
66: 
67: \RRabstract{Peer to peer (P2P) systems are moving from application specific architectures to 
68: a generic service oriented design philosophy.
69: This raises interesting problems in connection with providing useful
70: P2P middleware services that are capable of dealing with resource assignment
71: and management in a large-scale, heterogeneous and unreliable environment.
72: One such service, the slicing service, has been proposed to allow for an
73: automatic partitioning of P2P networks into groups (slices) that
74: represent a controllable amount of some resource and that are also relatively
75: homogeneous with respect to that resource, in the face of churn and
76: other failures.
77: In this report we propose two algorithms to solve the distributed slicing problem.
78: The first algorithm improves upon an existing algorithm that is based on 
79: gossip-based sorting of a set of uniform random numbers.
80: We speed up convergence via a heuristic for gossip peer selection.
81: The second algorithm is based on a different approach: statistical approximation
82: of the rank of nodes in the ordering.
83: The scalability, efficiency  and resilience to dynamics of both algorithms 
84: relies on their gossip-based models.
85: We present theoretical and experimental results to prove the viability
86: of these algorithms.}
87: 
88: \RRresume{Un service de \emph{morcellement} d'un r\'eseau pair-\`a-pair permet de partitionner les n\oe uds du syst\`eme en plusieurs groupes appel\'es \emph{morceaux}.  Ce rapport pr\'esente deux algorithmes pour r\'esoudre le probl\`eme du morcellement r\'eparti.  Le premier algorithme am\'eliore un algorithme existant en acc\'erant son temps de convergence.
89: Le second algorithme utilise une approche diff\'erente d'appoximation statistique. Des r\'esultats th\'eoriques et experimentaux
90: montrent la viabilit\'e de nos algorithmes.}
91: 
92: 
93: \RRmotcle{Morcellement, Bavardage, Morceau, Va-et-vient, Pair-\`a-pair, Aggr\'egation, Grande \'echelle, Allocation de ressources.}
94: \RRkeyword{Slicing, Gossip, Slice, Churn, Peer-to-Peer, Aggregation, Large Scale, Resource Allocation.}
95: %% \RRprojet{Bibli} % cas d'un seul projet
96: 
97: %\RRprojets{Bibli et Ami} %% cas de 2 projets.
98: \RRprojet{Asap}
99: %% \RRtheme{\THBio} % cas d'un seul theme
100: \RRtheme{\THCom} %% cas de 2 themes
101: 
102: \URRennes
103: 
104: \makeRR
105: %
106: %\author{
107: %\authorblockN{Antonio Fern\'andez\authorrefmark{1},
108: %Vincent Gramoli\authorrefmark{2}, 
109: %%Mark Jelasity\authorrefmark{3}, 
110: %Ernesto Jimenez\authorrefmark{4},  \\
111: %Anne-Marie Kermarrec\authorrefmark{2}, 
112: %Michel Raynal\authorrefmark{2}
113: %} \\
114: %\authorblockA{
115: %\hspace{-0.5cm}\begin{tabular}{ccccccc}
116: %\authorrefmark{1}{\small Universidad Rey} &  & \authorrefmark{2}{\small IRISA, INRIA}  & & \authorrefmark{4}{\small Universidad Polit\'ecnica} \\
117: %{\small Juan Carlos,} & & {\small et Universit\'e Rennes 1,}    & & {\small de Madrid,} \\
118: %{\small 28933 M\'ostoles, Spain.} & & {\small 35042 Rennes, France.}    & & {\small 28031 Madrid, Spain.} \\ 
119: %{\small anto@gsyc.escet.urjc.es} & &  {\small \{vgramoli,akermarr,raynal\}@irisa.fr} & & {\small ernes@eui.upm.es}
120: %\end{tabular}}}
121: 
122: 
123: %\newcommand{\pb}{\emph Self-Organizing Locality-based Algorithm for Distributed Slicing}
124: 
125: \maketitle
126: 
127: %\doublespace
128: 
129: %\begin{abstract}
130: %Peer to peer (P2P) systems are moving from application specific architectures to 
131: %a generic service oriented design philosophy.
132: %This raises interesting problems in connection with providing useful
133: %P2P middleware services that are capable of dealing with resource assignment
134: %and management in a large-scale, heterogeneous and unreliable environment.
135: %One such service, the slicing service, has been proposed to allow for an
136: %automatic partitioning of P2P networks into groups (slices) that
137: %represent a controllable amount of some resource and that are also relatively
138: %homogeneous with respect to that resource, in the face of churn and
139: %other failures.
140: %In this report we propose two algorithms to solve the distributed slicing problem.
141: %The first algorithm improves upon an existing algorithm that is based on 
142: %gossip-based sorting of a set of uniform random numbers.
143: %We speed up convergence via a heuristic for gossip peer selection.
144: %The second algorithm is based on a different approach: statistical approximation
145: %of the rank of nodes in the ordering.
146: %The scalability, efficiency  and resilience to dynamics of both algorithms 
147: %relies on their gossip-based models.
148: %We present theoretical and experimental results to prove the viability
149: %of these algorithms. \\
150: %\end{abstract}
151: %\vspace{-0.3cm}
152: %\noindent
153: %Keywords: Slicing, Gossip, Slice, Churn, Peer-to-Peer, Aggregation, Large Scale, Resource Allocation. \\
154: %Technical Areas: Operating Systems and Middleware, Internet Computing and Applications, Peer-to-Peer. \\
155: %Contact Author: Vincent Gramoli, vgramoli@irisa.fr
156: 
157: \section{Introduction}
158: 
159: \subsection{Context and Motivations}
160: 
161: The peer to peer (P2P) communication paradigm has now become the prevalent
162: model to build large-scale distributed applications, able to cope  with both 
163: scalability and system dynamics.  This is now a mature technology:
164: peer to peer systems are slowly moving from application-specific
165:  architectures to a generic-service oriented design philosophy. 
166: More specifically, peer to peer protocols hold the promise to integrate into
167: platforms on top of which several applications with various requirements
168:  may cohabit. This leads to the interesting issue of resource 
169: assignment or how to allocate a set of nodes for a given application. 
170: Examples of  targeted platforms for such a service are telecommunication 
171: platforms, where some set of peers may be automatically assigned to a 
172: specific task depending on their capabilities,
173: testbed platform such as Planetlab~\cite{BBC+04}, 
174: or desktop-grid-like applications~\cite{A04}.
175:  
176: In this context, the ordered slicing service has been 
177: recently proposed as a building block to allocate resources,
178: i.e a set of nodes sharing some characteristics with respect to  a
179: given metric or attribute,  in a large-scale peer to peer
180: system. This service acknowledges the fact that peers 
181: potentially offer heterogeneous capabilities as 
182: revealed by many recent works describing heavy-tailed distributions
183:  of storage space, bandwidth, and uptime of peers~\cite{SGG02,BSV03,SR06}.
184:  The slicing  service \cite{JK06} enables peers in a large-scale
185:  unstructured network to self-organize into a partitioning,
186:  where partitions (slices) are connected overlay networks that represent
187:  a given percentage of some resource.
188: Such slices
189: can be allocated to specific applications later on.
190: The slicing is ordered in the sense that peers get ranked
191:  according to their capabilities expressed by an attribute value.
192: 
193: Large scale dynamic distributed systems consist of many 
194: participants that can join and leave 
195: at will. Identifying peers in such systems that have a similar level
196:  of power or capability (for instance, in terms of bandwidth, 
197: processing power, storage space, or uptime) in a completely 
198: decentralized manner is a difficult task. It is even harder 
199: to maintain this information in the presence of churn. 
200:  Due to the intrinsic dynamics of contemporary peer to peer
201:  systems it is impossible to obtain accurate information about 
202: the capabilities (or even the identity) of the system participants. 
203: Consequently, no node is able to maintain accurate information about 
204: all the nodes. This  disqualifies  centralized approaches.  
205: 
206: Taking this into account, we can summarize the ordered slicing problem
207: we tackle in this report:
208: we need to rank nodes depending on their capability, slice the network 
209: depending on these capabilities and, most importantly, readapting
210: the slices continuously to cope with system dynamism.  
211: 
212: Building  upon the work on ordered distributed slicing
213: %%MJ I removed "seminal"; felt it's too self-promoting
214: proposed in \cite{JK06}, here we focus on the issue of \textit{accurate}
215:  slicing.
216: That is, we focus on improving the quality and stability of the slices, both
217: aspects being crucial for potential applications.
218: 
219: %A slice is 
220: %composed of a subset of nodes which are attribute values close 
221: %to each other. Additionally we can consider slices ordered, 
222: %so that the first slice contains the nodes with the
223: % smallest attribute values while the last one contains 
224: %the nodes with the largest attribute values.
225: %A node determines the slice it belongs to by comparing
226: % its attribute value with the values of other nodes.
227: %Roughly speaking, depending on the attribute values of the other nodes 
228: %it encounters, a node $i$  determines if its attribute value 
229: %is relatively high or low and where it lies in the
230: %set of nodes. Thus, node $i$ tries to estimate 
231: %its rank (dictated by its capability) to determine 
232: %the slice it belongs to.
233: %%
234: %%MJ I felt the paragraph was reduntant and too detailed for an intro.
235: %%besides, we don't need it since nothing is built on this info here
236: 
237: 
238: \subsection{Contributions}
239: 
240: %%MJ moved para below here from before the section heading
241: %%MJ also commented it out, becuase it is redundant
242: %% note that the intro is very long, and this section in particular
243: 
244: %This report presents two main contributions. The first contribution is
245: %the speedup of the convergence of the \cite{JK06} algorithm. We then 
246: %identified a few issues of~\cite{JK06} related to slice 
247: %consistency under churn correlated to attribute values. 
248: %We fix these issues
249: % by proposing a second algorithm where each node 
250: %continuously approximates its rank relatively to the other peers.
251: % Both algorithms rely on  a gossip-based communication model,
252: %which has proved to be both lightweight and highly resilient to 
253: %system dynamics. 
254: 
255: 
256: The report presents two gossip-based solutions to slice the nodes
257:  according to their capability (reflected by an attribute value) 
258:  in a distributed manner with high probability.
259: The first contribution of the report builds upon the distributed 
260: slicing algorithm proposed in \cite{JK06} that we call the JK algorithm in the
261: sequel of this report.
262: The second algorithm is a different approach based on rank approximation
263: through statistical sampling.
264: 
265: In JK, each node $i$  
266: maintains a random number $r_i$, picked up uniformly at random (between 0 and 1),  
267: and an attribute value $a_i$, expressing its capability according to a given
268: metric. 
269: Each peer periodically gossips with another peer $j$, 
270: randomly chosen among the peers it knows about. If the 
271: order between $r_j$ and $r_i$ is different than 
272: the order between $a_j$ and $a_i$, random values 
273: are swapped between nodes. The algorithm ensures that 
274: eventually the order on the 
275: random values matches the order of the attribute ones.  
276: The quality of the ranking can then be measured by using a
277:  global disorder measure expressing
278: the difference between the exact rank and the actual rank of each peer along the attribute value.
279: 
280:  The first contribution of this report is to propose a local disorder
281:  measure so that a peer chooses the neighbor to communicate with in order to  maximize
282: the chance of decreasing the global disorder measure.
283:  The interest of this approach is to speed the convergence up. 
284: We provide the analysis and experimental results of this improvement.
285: 
286: Once peers are ordered along the attribute values, the slicing in JK 
287: takes place as follows. Random values are used
288:  to calculate which slice a node belongs to. For example,  
289: a slice containing 20\% of the  best nodes according to a given attribute, 
290: will be composed of the nodes that end up holding random values greater than 0.8. 
291: The accuracy of the slicing (independent from the accuracy of the ranking)
292:  fully depends on the uniformity of the random value
293:  spread between 0 and 1 and the fact that the proportion of 
294: random values between 0.8 and 1 is approximately (but usually not exactly)
295: 20\% of the nodes. Another contribution of this report is to precisely 
296: characterize the potential inaccuracy resulting from this effect.
297: 
298: This observation means that the problem of ordering nodes 
299: based on uniform random values is not fully sufficient for
300: determining slices.
301: This motivates us to find an alternative
302: approach to this algorithm and JK
303: in order to determine more precisely the slice each node belongs to.
304: 
305: Another motivation for an alternative approach is related to churn and dynamism.
306: It may well happen that the
307:  churn is actually correlated  
308: to the attribute value. For example, if the peers are 
309: sorted according to their connectivity potential, 
310: a portion of the attribute space (and therefore the random value space) might be 
311: suddenly affected. 
312: New nodes will then pick up new random values and eventually the distribution of random values will be skewed towards high values.
313: 
314: The second contribution is an alternative algorithm solving these issues  
315: by approximating the rank of the nodes in the ordering locally, without
316: the application of random values.
317:   The basic idea is that each node periodically estimates its rank 
318: along the attribute axis depending of the attributes it has seen so far.
319: This algorithm is robust and lightweight due to its
320: gossip-based communication pattern: each node communicates 
321: periodically with a
322: restricted dynamic neighborhood that guarantees connectivity and provides
323: a continuous stream of new samples.
324: Based on continuously aggregated information, 
325: the node can determine the
326: slice it belongs to with a decreasing error margin.  
327:  We show that this algorithm provides accurate estimation at the price of a slower convergence.
328: 
329: \subsection{Outline}
330: The rest of the report is organized as follows: Section \ref{related} surveys some related work. 
331: The system model is presented in Section \ref{model}. 
332: The first contribution of an improved ordered slicing algorithm based 
333: on random values is presented  in Section~\ref{JK+} and the second algorithm 
334: based on dynamic ranking in Section~\ref{sec:ranking}. 
335: %Section \ref{sec:dicussion} provide some material for discussion before concluding in Section \ref{conclusion}.
336: Section~\ref{conclusion} concludes the report.
337: 
338: 
339: \section{Related Work}
340: \label{related}
341: %A more general problem than the one investigated here 
342: %is the sorting problem, where each node is ordered among others.
343: %A version of this problem has been referred as the 
344: %\emph{external sorting problem}~\cite{DNS91}.  What caught our attention is that 
345: %this provides a distributed sorting
346: %algorithm where the memory space of each processor does not necessarily 
347: %depend on the input and it outputs a sorted sequence of values
348: %distributed among processors.  
349: %The \emph{percentile finding} problem was defined in~\cite{IRV89} as
350: %dividing a set of values into equally sized sets.
351: 
352: Most of the solutions proposed so far for ordering nodes come
353: from the context of databases,
354: where parallelizing query executions is used to improve efficiency. 
355: A large majority of the solutions in this area rely on centralized gathering
356: or all-to-all exchange, which makes them unsuitable 
357: for large-scale networks.
358: %
359: For instance, the \emph{external sorting problem}~\cite{DNS91} consists in 
360: providing
361: a distributed sorting algorithm where the memory space of each 
362: processor does not necessarily depend on the input. This algorithm
363: must output a sorted sequence of values distributed among 
364: processors.  
365: %
366: The solution proposed in~\cite{DNS91} needs a global merge of 
367: the whole information, and
368: thus it implies a centralization of information.
369: Similarly, the \emph{percentile finding} problem~\cite{IRV89}, which 
370: aims at dividing a set of values into equally sized sets,
371: requires a logarithmic number of all-to-all message exchanges.
372: %Despite the unaffordable number of messages it requires, it would 
373: %be worthy to investigate an adaptation of this algorithm using
374: %gossip-based algorithm for the slice partitioning problem.
375: 
376: Other related problems are the selection problem and the $\phi$-quantile search.
377: The selection problem~\cite{FR75,BFPRT72} aims at determining the $i^{th}$
378: smallest element with as few comparisons as possible.
379: The $\phi$-\emph{quantile} search (with 
380: $\phi \in (0,1]$) is the problem to find among $n$ elements the $(\phi n)^{th}$ element.
381: Even though these problems look similar to our problem, 
382: %in the sense
383: %that they all focus on finding the ordering of a node among the system, it remains 
384: %different in the following sense: 
385: they aim 
386: at finding a specific node among all, while the distributed slicing problem aims
387: at solving a global problem where each node maintains a piece of
388: information.  
389: Additionally, solutions to the quantile search problem like the one presented 
390: in~\cite{KDG03} use an approximation of the system size. The same holds 
391: for the algorithm in~\cite{SDCM06}, which uses similar ideas to determine the distribution
392: of a utility in order to isolate peers with high capability---i.e., super-peers.
393: 
394: %As far as we know, the most similar work appeared in
395: As far as we know, the distributed slicing problem was studied in a P2P system 
396: for the first time in~\cite{JK06}. In this report, a node with the $k^{th}$
397: smallest attribute value, among those in a system of size $n$, tries to estimate
398: its normalized index $k/n$.
399: %where 
400: %each node $i$ is provided with a random value which is an estimate of the 
401: %rank of node $i$.   
402: %periodically exchange random value to reflect the rank
403: %of its attribute value among all attribute values of the system.  
404: %
405: The \emph{JK algorithm} proposed in~\cite{JK06} works as follows.
406: Initially, each node draws independently and uniformly a random value in the
407: interval $(0,1]$ which serves as its first estimate of its normalized index.
408: Then, the nodes use a variant of Newscast~\cite{JMB05} to gossip among each
409: other to exchange random values when they
410: find that the relative order of their random values and that of their attribute
411: values do not match.
412: %The algorithm aims at exchanging random 
413: %values among peers to reflect the order given by their attribute values.
414: %When for any $k$ the node with the $k^{th}$ attribute value has also 
415: %the $k^{th}$ random value, the system is ordered and random value.
416: %%
417: This algorithm is robust in face of frequent 
418: dynamics and guarantees a fast convergence to the same sequence of peers with
419: respect to the random and the attribute values. At every point in time the
420: current random value of a node serves to estimate the slice to which it
421: belongs (its slice).
422: %Consequently this solution 
423: %is well-suited for ordering nodes in a large-scale dynamic systems.
424: %The first algorithm we propose speeds up this decrease.
425: 
426: %The more similar work has been proposed in~\cite{JK06}, the authors investigate 
427: %the slice partitioning problem in a deterministic way.
428: %However, their approach requires that each node generates a random 
429: %value.  More precisely, each node maintains a view containing neighbors 
430: %information such as their id, their attribute value, and their random value.  
431: %The algorithm presented in~\cite{JK06} makes a node to select an arbitrary 
432: %misplaced neighbor to swap its random value with.  A node estimates the slice
433: %it belongs to according to the rank given by the current random value it owns.
434: %This solves the slice partitioning problem if the random values are uniformly 
435: %spread among all possible values.
436: 
437: \section{Model}
438: \label{model}
439: 
440: \subsection{System model}
441: 
442: We  consider a  system $\Sigma$ containing  a set
443:  of $n$ uniquely identified nodes. (Value $n$ may vary over time, dynamics is explained below).
444: %%MJ don't know what this means, but not needed anyway: commented out
445: % we should use no footnotes at all in any case, or only if it is absolutely
446: %essential
447:  %\footnote{The value $n$ is observed instantaneously but may vary over time.}  
448: The set of identifiers is denoted by $I$.
449: %and the number of nodes  by $n$.
450: Each node can leave and new nodes can join the system at any  
451: time, thus the number of nodes is a function of time.
452: Nodes may also crash. In this report,
453: we do not differentiate between a crash and a voluntary node departure. 
454: 
455: Each node $i$ maintains an attribute value $a_i$, reflecting the node
456: capability according to a specific metric.
457: These attribute values over the network might have an arbitrary
458: skewed distribution.
459: Initially, a node
460: has no global information neither about the structure or size of the system nor
461: about 
462: the attribute values of the other nodes. 
463: 
464: We can define a total ordering over the nodes based on their attribute value,
465: with the node identifier used to break ties.
466: Formally, we let $i$ precede $j$ if and only if $a_i < a_j$, or
467: $a_i = a_j$ and $i < j$. We refer to this totally ordered sequence as the
468: \emph{attribute-based sequence}, denoted by $A.\ms{sequence}$. The attribute-based
469: rank of a node $i$, denoted by $\alpha_i \in \{1,...,n\}$, is defined as the index
470: of $a_i$ in $A.\ms{sequence}$. 
471: %
472: For instance, let us consider three nodes: 1, 2, and 3, with three
473: different attribute values $a_1 = 50$, $a_2 = 120$, and $a_3 = 25$.
474: In this case, the attribute-based rank of node $1$ would be
475: $\alpha_1 = 2$.
476: %
477: In the rest of the report, we assume that nodes are sorted according to a single 
478: attribute and that each node belongs to a unique slice.
479: The sorting along several attributes is out of the scope of this report.
480: 
481: 
482: %Initially, node $i$ does not know the value of $\alpha_i$.
483: %%MJ Nor does it learn it later, at least in the ordering approach
484: %AMK: I don;t think we should mention this indeed.
485: 
486: 
487: \subsection{Distributed Slicing Problem}\label{ssec:pb}
488: 
489: Let ${\cal S}_{l,u}$ denote the \emph{slice} containing every node $i$ whose
490: normalized rank, namely $\frac{\alpha_{i}}{n}$, satisfies $l < \frac{\alpha_{i}}{n}
491: \leq u$
492: where $l\in [0,1)$ is the slice lower boundary and $u\in (0,1]$ is 
493: the slice upper boundary so that all slices represent adjacent intervals $(l_1, u_1], (l_2, u_2]$...
494: Let us assume that we partition the interval $(0,1]$ using a set of slices,
495: and this partitioning is known by all nodes.
496: The distributed slicing problem requires each node 
497: to determine the slice it currently belongs to.
498: Note that the problem stated this way is similar to the
499: ordering problem, where each node has to determine its own index in
500: $A.\ms{sequence}$.
501: However, the reference to slices introduces special requirements related to
502: stability and fault tolerance, besides, it allows for future generalizations
503: when one considers different types of categorizations.
504: 
505: %\subsection{Example of Slicing a Population}
506:  Figure~\ref{fig:size} illustrates an example 
507: of  a population of 10 persons, to be sorted 
508: against their height.
509: A partition of this population could be defined by two slices of the same
510: size: the group of short persons, and the 
511: group of tall persons. This is clearly an example 
512: where the distribution of attribute values is skewed towards 2 meters.
513:  The rank of each person in
514: the population and the two slices are represented on the bottom axis. Each
515: person is represented as a small cross on these axes.\footnote{Note that the shortest (resp. largest) rank is represented by a cross at the 
516: extreme left (resp. right) of the bottom axis.} 
517: Each slice is represented as an oval.  The slice $S_1 = {\cal
518: S}_{0,\frac{1}{2}}$ contains the five shortest persons and 
519: the slice $S_2 = {\cal S}_{\frac{1}{2},1}$ contains the five tallest
520: persons.
521: 
522: \begin{figure}
523: \centering\includegraphics[scale=0.6]{size_new.eps}
524: \caption{Slicing of a population based on a height attribute.}
525: \label{fig:size}
526: \end{figure}
527: 
528: Observe that another way of partitioning the population could be to define the
529: group of short persons as that containing all the persons shorter than a 
530: predefined measure (e.g., $1.65m$) and the group of tall persons as that containing
531: the persons taller than this measure.
532: %The groups differs from the slices because the slices rely tightly on 
533: %the distribution of attribute values:  while one of these three
534: %groups could be empty even though the population is large, slices have a minimal size
535: %depending on the population size.
536: However, this way of partitioning would most certainly lead to an unbalanced
537: distribution of persons, in which, for instance a group might be empty (while a
538: slice is almost surely non-empty). Since the distribution of attribute values is
539: unknown and hard to predict, defining relevant groups is a difficult task.  For
540: example, if the distribution of the human heights were unknown, then the persons
541: taller than $1m$ could be considered as tall and the persons shorter than $1m$
542: could be considered as short.  Conversely, slices partition the population  
543: into subsets representing a predefined portion of this population.
544: Therefore, in the rest of the report, we consider slices as defined as a proportion of the
545: network.
546: 
547: \subsection{Facing Churn}
548: 
549: %%MJ rewrote it from skretch. Still, I'm a bit hesitant, becase we do not
550: % actually solve these problems, do we?
551: %AMK: we do somehow in the second case
552: 
553: Node churn, that is, the continuous arrival and departure of nodes is an intrinsic characteristic 
554: of peer to peer systems and may significantly impact  the outcome, and more specifically the accuracy 
555: of the slicing algorithm.
556: The easier case is when  the distribution of the attribute values of the departing and
557: arriving nodes are identical.
558: In this case, in principle, the arriving nodes must find their slices, but
559: the nodes that stay in the system  are mostly able to keep their slice assignment.
560: Even in this case however, nodes that are close to the border of a slice
561: may expect frequent changes in their slice due to the variance of the
562: attribute values, which is non-zero for any non-constant distribution.
563: If the arriving and departing nodes have different attribute distributions,
564:  so that the distribution in the actual network of live nodes keeps changing,
565: then this effect is amplified. However, we believe that this is a realistic assumption
566: to consider that the churn may be correlated to some specific values (for example 
567: if the considered attribute is uptime or connectivity).
568: 
569: 
570: \section{Dynamic Ordering by Exchange of Random Values}\label{sec:ordering}
571: \label{JK+}
572: 
573: This section proposes an  algorithm for the distributed slicing 
574: problem improving upon the original JK algorithm \cite{JK06},
575:  by considering a local measure of the global disorder function.
576: In this section we present the algorithm along with the corresponding analysis and 
577: simulation results.
578: 
579: \subsection{On Using Random Numbers to Sort Nodes}
580: 
581: This Section presents the algorithm built upon JK. 
582: We refer to this algorithm as \emph{mod-JK} (standing for modified JK).
583: In JK,
584: %We build upon  the JK algorithm, where
585: each node $i$
586:  generates a number $r_i\in (0,1]$ independently and uniformly at random.
587: The key idea is to sort these random numbers with 
588: respect to the attribute values by swapping these random numbers between nodes,
589: so that if $a_{i} < a_{j}$ then $r_{i} < r_{j}$. 
590: Eventually,  the attribute values (that are fixed) and the 
591: random values (that are exchanged)  should be sorted in the same order.
592: That is, each node would like to obtain the $x^{th}$ largest random number if 
593: it owns the $x^{th}$ largest attribute value.
594: Let $R.\ms{sequence}$ denote the \emph{random sequence} obtained 
595: by ordering all nodes according to their random number.
596: Let $\rho_i(t)$ denote the index of node $i$ in $R.\ms{sequence}$ at time
597: $t$. When not required, the time parameter is omitted.
598: 
599: To illustrate the above ideas, consider that nodes 1, 2, and 3 
600: %from the example above 
601: from the previous example
602: have three
603: distinct random values: $r_1 = 0.85$, $r_2 = 0.1$, and $r_3 = 0.35$.
604: In this case, the index $\rho_1$ of node $1$ would be $3$. Since the
605: attribute values are $a_{i}=50$, $a_{2}=120$, and $a_{3}=25$, the algorithm
606: must achieve the following final assignment of random numbers: $r_{1}=0.35$,
607: $r_{2}=0.85$, and $r_{3}=0.1$.
608: 
609: Once sorted, the random values are used to determine the portion of the network a peer belongs to.
610: 
611: \subsection{Definitions}
612: 
613: 
614: \paragraph{View.}
615: Every node $i$ keeps track of some neighbors and their age. 
616: The \emph{age} of neighbor $j$ is a timestamp, $t_j$, set to 0 when $j$ becomes a 
617: neighbor of $i$.
618: Thus, node $i$ maintains an array
619: containing the id, the age,
620: the attribute value, and the 
621: random value of its neighbors.  This array, denoted ${\mathcal N}_i$, is 
622: called the \emph{view} of node $i$. The views of all nodes have the same size, 
623: denoted by $c$.  
624: %Neighborhoods are updated regularly following a simple protocol called 
625: %newscast~\cite{JMB05}.
626: 
627: \paragraph{Misplacement.}
628: A node participates in the algorithm by exchanging its rank with a misplaced 
629: neighbor in its view.  Neighbor $j$ is misplaced 
630: if and only if
631: \begin{itemize}
632: \item $a_i >  a_j$ and  $r_i <  r_j$, or
633: \item $a_i <  a_j$ and  $r_i >  r_j$.
634: \end{itemize}
635: We can characterize these two cases by the predicate $(a_j - a_i)(r_j - r_i)<0$.
636: %We call the \emph{misplaced measurement} of node $i$ 
637: %the difference $|\frac{a_i}{n} - r_i|$, and the larger this value is,
638: %the more $i$ is misplaced  (if this value is large, then $i$ estimate is far 
639: %from reflecting its real rank).
640: %%MJ you never actually use this definition
641: 
642: \paragraph{Global Disorder Measure.}
643: 
644: In~\cite{JK06}, a measure of the relative
645: disorder of sequence $R.\ms{sequence}$ with respect to sequence
646: $A.\ms{sequence}$ was introduced, called the 
647: \emph{global disorder measure (GDM)} and defined, for any time $t$, as
648: $$\ms{GDM}(t) = \frac{1}{n}\sum_{i}(\alpha_i - \rho(t)_i)^2.$$
649: 
650: The minimal value of GDM is 0, which is obtained
651: when $\rho(t)_i = \alpha_i$ for all nodes $i$.  In this case the 
652: attribute-based index of a node is equal to its random value index, indicating
653: that random values are ordered.
654: 
655: \subsection{Improved Ordering Algorithm}
656: 
657: In this algorithm, each node $i$ searches its own view ${\mathcal N}_i$
658:  for misplaced neighbors. Then, 
659: one of them is chosen to swap  random value with. This process is repeated
660: until there is no global disorder.
661: In this version of the algorithm, we provide each node with  the capability
662: of measuring locally the disorder. This leads to  a new
663: heuristic for each node to determine 
664: the neighbor to exchange with which decreases most the disorder.
665: 
666: The proposed technique attempts to decrease the global
667: disorder in each exchange as much as possible via selecting the neighbor
668: from the view that minimizes the local disorder (or, equivalently,
669: maximizes the order \emph{gain}) as defined below.
670: 
671: For a node $i$ to evaluate the gain of exchanging with a node $j$
672: of its current view ${\mathcal N}_i$,
673: we define its \emph{local disorder measure} (abbreviated \emph{LDM}$_i$).  Let $LA.\ms{sequence}_i$ and 
674: $LR.\ms{sequence}_i$ be the local attribute sequence and the local 
675: random sequence of node $i$, respectively.   These sequences are computed locally by $i$ 
676: using the information ${\mathcal N}_i \cup \{i\}$.  Similarly to $A.\ms{sequence}$
677: and $R.\ms{sequence}$, these are the sequences of neighbors where each node
678: is ordered according to its attribute value and random number, respectively.
679: Let, for any $j\in {\cal N}_i \cup \{i\}$, $\ell\rho_j(t)$ and $\ell\alpha_j(t)$ be the 
680: indices of $r_j$ and $a_j$ in sequences $LR.\ms{sequence}_i$ and
681: $LA.\ms{sequence}_i$, respectively, at time $(t)$.
682: At any time $t$, the local disorder measure of node $i$ is defined as:
683: %
684: $$\ms{LDM}_i(t) = \frac{1}{c+1}\sum_{j\in {\mathcal N}_i(t)\cup \{i\}}(\ell\alpha_j(t) - \ell\rho_j(t))^2.$$
685: We denote by $G_{i,j}(t+1)$ the reduction on this measure that $i$ obtains after
686: exchanging its random value with node $j$ between 
687: time $t$ and $t+1$. We define it as:
688: %{\small
689: \begin{eqnarray} 
690: G_{i,j}(t+1) &=& \ms{LDM}_i(t) - \ms{LDM}_i(t+1), \notag \\
691: G_{i,j}(t+1) &=& \frac{(\ell\alpha_i(t) - \ell\rho_i(t))^2 + (\ell\alpha_j(t) -
692: \ell\rho_j(t))^2
693: - (\ell\alpha_i(t) - \ell\rho_j(t))^2 - (\ell\alpha_j(t) - \ell\rho_i(t))^2}{c+1}.
694: \label{eq:gain} 
695: \end{eqnarray}
696: %}
697: 
698: The heuristic used chooses for node $i$ the misplaced neighbor $j$ that
699: maximizes $G_{i,j}(t+1)$.
700: %\begin{eqnarray} 
701: %$G_{i,k}(t+1) &=& \tau_i(t) - \tau_i(t+1), \notag \\
702: % %The gain that $i$ obtains after swapping with $k$ can be expressed as:
703: % G_{i,k}(t+1) &=& (\alpha_i - \rho_i)^2 + (\alpha_k - \rho_k)^2 - (\alpha_i - \rho_k)^2 - (\alpha_k - \rho_i)^2. \notag 
704: % \end{eqnarray}
705: % That is, the node $k$ that maximizes the gain verifies, for any $j\in V_i$:
706: % \begin{eqnarray} 
707: % G_{i,j}(t+1) &\leq& G_{i,k}(t+1), \notag \\
708: % % (p_j - r_j(t))^2 - (p_i - r_j(t))^2 - (p_j - r_i(t))^2 &\leq& (p_k - r_k(t))^2 - (p_i - r_k(t))^2 - (p_k - r_i(t))^2, \notag \\
709: % \alpha_i \rho_j(t) + \alpha_j \rho_i(t) - \alpha_j \rho_j(t)  &\leq& \alpha_i \rho_k(t) + \alpha_k \rho_i(t) - \alpha_k \rho_k(t). \Box\notag
710: %\end{eqnarray}
711: 
712: \subsubsection{Sampling Uniformly at Random}
713: 
714: The algorithm relies on the fact that potential misplaced nodes are found
715: so that they can swap their
716: random numbers thereby increasing order. 
717: If the global disorder is high,  it is very likely that
718: any given node has misplaced neighbors in its view to exchange with.
719: %
720: %Nevertheless, the ordering slows down drastically when only few nodes are misplaced.
721: %To ensure continuous ordering, the underlying membership protocol must provide
722: %overlying layer with samples as uniformly drawn as possible.
723: Nevertheless, as the system gets ordered, it becomes more unlikely for a node
724: $i$ to have misplaced neighbors.
725: In this stage the way the view is composed plays a crucial role: if fresh
726: samples from the network are not available, convergence can be slower than
727: optimal.
728: %This problem is exacerbated if the neighbors
729: %of node $i$ are not drawn uniformly, since then some nodes, misplaced with
730: %respect to $i$, tend to 
731: %remain misplaced since they never show up in the view of $i$.
732: %%MJ This is not true in this way, as non-uniform can even be better than
733: % uniform. I said something softer.
734: 
735: Several protocols may be used to provide a random and dynamic sampling 
736: in a peer to peer system such as Newscast~\cite{JMB05}, Cyclon~\cite{VGS05} or Lpbcast~\cite{JGKS04}.
737: They differ mainly by their \textit{closeness}
738: to the uniform random sampling of the neighbors and the way they handle churn.
739: In this report,
740: we chose to  use a variant of the Cyclon
741: protocol to construct and update the views~\cite{I05}, as it is reportedly
742: the best approach to achieve a uniform random neighbor set for all nodes.
743: %%%%% show that Cyclon creates more random views whereas Newscast
744: %tends to clusterize the communication graph.  Technically, 
745: %the Cyclon graph (in which links are defined by the views) has a clustering
746: %coefficient similar to the 
747: %clustering coefficient of a random graph~\cite{ER59}.  
748: %More precisely, the coefficient provided by Cyclon is more than 1800 time closer 
749: %to the coefficient of a random graph, than the coefficient provided by Newscast.
750: %
751: %A direct consequence is that with Cyclon a node has a higher chance of
752: %discovering new nodes than with Newscast so the algorithm described in
753: %Figure~\ref{alg:rand}
754: %can be expected to offer better convergence speed.
755: 
756: %AMK: I actually removed almost evrything here and certainly did not claim 
757: %that was a  contribution.
758: 
759: 
760: \subsubsection{Description of the Algorithm}
761: \label{sec:modifiedjk}
762: 
763: \begin{table}
764:   \centering
765:   \begin{tabular}{|c|l|}
766: \hline
767: {\small {\bf Variable}} & {\small {\bf Description}} \\ \hline \hline
768: $j$ & {\small the identifier of the neighbor} \\ \hline
769: $t_j$ & {\small the age of the neighbor} \\ \hline
770: $a_j$ & {\small the attribute value of the neighbor} \\ \hline
771: $r_j$ & {\small the random value of the neighbor} \\ \hline
772: \end{tabular}
773: \caption{The array corresponding to the view entry of the neighbor $j$ ($j\in {\cal N}_i$).}
774: \label{table:entry}
775: \end{table}
776: 
777: 
778: 
779: The algorithm is presented in Figure~\ref{alg:rand}.  The active thread at node $i$
780: runs the membership (gossiping) procedure ($\lit{recompute-view}()_i$)
781: and the 
782: exchange of random values periodically.
783: As motivated above, the membership procedure, specified in
784: Figure~\ref{alg:rcyclon}, is similar to the Cyclon algorithm: 
785: each node $i$ maintains a view ${\mathcal N_i}$ containing one
786: entry per neighbor.  The entry of a neighbor $j$ corresponds to a tuple
787: presented in Table~\ref{table:entry}.  Node $i$ copies its view, selects the
788: oldest neighbor $j$ of its view, removes the entry $e_{j}$ of $j$ from the copy
789: of its
790: view, and finally sends the resulting copy to $j$.  When $j$ receives the
791: view, $j$ sends its own view back to $i$ discarding possible pointers to 
792: $i$, and $i$ and $j$ update their view with the one they 
793: receive. This variant of Cyclon, as opposed to the original version,
794: exchanges all entries of the view at each step.
795: 
796: 
797: %\input{rand.alg}
798: \begin{figure}
799: \centering{
800: \fbox{
801: \begin{minipage}[ht!]{150mm}
802: \footnotesize
803: \renewcommand{\baselinestretch}{1.5}
804: \resetline
805: \begin{tabbing}
806: aaaaA\=aaaaaA\=aaaaaaA\kill
807: {\bf Initial state of node $i$} \\
808: \line{L00} \> $\ms{period}_{i}$, initially set to a constant; \\
809: $r_i$, a random value chosen in $(0,1]$; 
810: $a_i$, the attribute value;   \\
811: $\ms{slice}_i \gets \bot$, the slice $i$ belongs to;
812: ${\mathcal N_i}$, the view; \\
813: $\ms{gain}_{j'}$, a real value indicating the gain achieved by exchanging with
814: $j'$; \\
815: $\ms{gain-max} = 0$, a real.
816: \\ ~ \\
817:  
818:  
819: %{\bf Active thread at node $i$} \\
820: %\line{L02} \> $\act{wait}(\ms{period_i})$ \\
821: %~(\ref{L02}a) \> $\act{recompute-view}()_i$ \\
822: %\line{L03} \> {\bf for} $j' \in {\mathcal N}_i$ \\
823: %~(\ref{L03}a) \> \T {\bf if} $\ms{gain}_{j'} \geq \ms{gain-max}$ {\bf then} \\
824: %~(\ref{L03}b) \> \T \T $\ms{gain-max} \gets \ms{gain}_{j'}$ \\
825: %~(\ref{L03}c) \> \T \T $j \gets j'$ \\
826: %~(\ref{L03}d) \> {\bf end for} \\
827: %\line{L04} \> $\act{send}(\lit{REQ}, r_i, a_i)$ to $j$ \\
828: %~(\ref{L04}a) \> $\act{recv}(\lit{ACK}, r_j')$ from $j$ \\
829: %\line{L05} \> $r_j \gets r_j'$ \\
830: %~(\ref{L05}a) \> {\bf if} $(a_j - a_i)(r_j - r_i) < 0$ {\bf then} \\
831: %~(\ref{L05}b) \> \T $r_i \gets r_j$ \\
832: %~(\ref{L05}c) \> \T $\ms{slice}_i \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$ \\
833: % ~ \\
834: 
835:  
836: {\bf Active thread at node $i$} \\
837: \line{M01} \> $\act{wait}(\ms{period_i})$ \\
838: \line{M02} \> $\act{recompute-view}()_i$ \\
839: \line{M03} \> {\bf for} $j' \in {\mathcal N}_i$ \\
840: \line{M04} \> \T {\bf if} $\ms{gain}_{j'} \geq \ms{gain-max}$ {\bf then} \\
841: \line{M05} \> \T \T $\ms{gain-max} \gets \ms{gain}_{j'}$ \\
842: \line{M06} \> \T \T $j \gets j'$ \\
843: \line{M07} \> {\bf end for} \\
844: \line{M08} \> $\act{send}(\lit{REQ}, r_i, a_i)$ to $j$ \\
845: \line{M09} \> $\act{recv}(\lit{ACK}, r_j')$ from $j$ \\
846: \line{M10} \> $r_j \gets r_j'$ \\
847: \line{M11} \> {\bf if} $(a_j - a_i)(r_j - r_i) < 0$ {\bf then} \\
848: \line{M12} \> \T $r_i \gets r_j$ \\
849: \line{M13} \> \T $\ms{slice}_i \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$ \\
850:  ~ \\
851: 
852: 
853: %% with period, the other nodes must send more frequently
854: %~(\ref{L01})  \> $\act{wait}(\ms{period_i})$ \\
855: %~(\ref{L02})  \> ${\mathcal N}_i \gets \act{recompute-view}()_i$ \\
856: %~(\ref{L03})  \> {\bf for} $j' \in {\mathcal N}_i$ \\
857: %~(\ref{L04}') \> \T {\bf if} $\lit{dist}(a_{j'},b) < dmin$ {\bf then} \\
858: %~(\ref{L05}') \> \T \T $\ms{dist-min} \gets \lit{dist}(a_{j'},b)$ \\
859: %~(\ref{L06}')  \> \T \T $j \gets j'$ \\
860: %\line{L11}    \> \T {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
861: %\line{L12}    \> \T $g_i \gets g_i + 1$ \\
862: %~(\ref{L07})  \> $\act{send}(\lit{REQ},r_i)$ to $j$ \\
863: %~(\ref{L08})  \> $\act{recv}(\lit{ACK},r_j)$ from $j$ \\
864: %\line{L13}   \> {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
865: %\line{L14}   \> $g_i \gets g_i + 1$ \\
866: %\line{L15}   \> $r_i \gets \ell_i / g_i$ \\
867: %\line{L16}   \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$\\
868: % ~ \\
869: 
870: {\bf Passive thread at node $i$ activated upon reception} \\
871: \line{R01} \> $\act{recv}(\lit{REQ}, r_j, a_j)$ from $j$ \\
872: \line{R02} \> $\act{send}(\lit{ACK}, r_i)$ to $j$ \\
873: \line{R03} \> {\bf if} $(a_j - a_i)(r_j - r_i) < 0$ {\bf then} \\
874: \line{R04} \> \T $r_i \gets r_j$ \\
875: \line{R05} \> \T $\ms{slice}_i \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$
876:  
877: \end{tabbing}
878: \normalsize
879: \end{minipage} 
880: }
881: \caption{Dynamic ordering by exchange of random values.}
882: \label{alg:rand}
883: }
884: \end{figure}
885: 
886: The algorithm for exchanging random values from node $i$ starts by measuring 
887: the ordering that can be gained by swapping with each neighbor (Lines~\ref{M03}--\ref{M07}).
888: Then, $i$ chooses the neighbor $j \in {\mathcal N}_i$ that maximizes gain 
889: $G_{i,k}$ for any of its neighbor $k$. Formally, $i$ finds 
890: $j\in {\mathcal N}_i$ such that for any $k\in{\mathcal N}_i$, we have
891: \begin{eqnarray}
892: G_{i,j}(t+1) &\geq& G_{i,k}(t+1).  \label{eq:gainmax}
893: \end{eqnarray}
894: Using the definition of $G_{i,j}$ in Equation (\ref{eq:gain}), Equation (\ref{eq:gainmax}) is 
895: equivalent to 
896: \begin{eqnarray}
897: \ell\alpha_i(t) \ell\rho_j(t) + \ell\alpha_j(t) \ell\rho_i(t) - \ell\alpha_j(t)
898: \ell\rho_j(t)  &\geq& \ell\alpha_i(t) \ell\rho_k(t) + \ell\alpha_k(t) \ell\rho_i(t)
899: - \ell\alpha_k(t) \ell\rho_k(t). \notag
900: \end{eqnarray}
901: \noindent In Figure~\ref{alg:rand} of node $i$, we refer to $\ms{gain}_j$ as the value of
902: $\ell\alpha_i(t) \ell\rho_j(t) + \ell\alpha_j(t) \ell\rho_i(t) - \ell\alpha_j(t)
903: \ell\rho_j(t)$.
904: %\begin{eqnarray}
905: %G_{i,j}(t+1) &\geq& G_{i,k}(t+1), \notag \\
906: %\ell\alpha_i \ell\rho_j(t) + \ell\alpha_j \ell\rho_i(t) - \ell\alpha_j \ell\rho_j(t)  &\geq& \ell\alpha_i \ell\rho_k(t) + \ell\alpha_k \ell\rho_i(t) - \ell\alpha_k \ell\rho_k(t). \Box\notag
907: %\end{eqnarray}
908: 
909: %%this $\ms{gain}$ (i.e., 
910: %neighbor $j$ whose swapping minimizes the local disorder measure of $i$).
911: From this point on, $i$ exchanges its random value $r_i$ with the random value $r_j$ of 
912: node $j$ (Line~\ref{M10}). The passive threads are executed upon reception of a message.
913: In Figure~\ref{alg:rand}, when $j$ receives the random value $r_i$ of node 
914: $i$, it sends back its own random value $r_j$ for the exchange to occur (Lines~\ref{R01}--\ref{R02}).
915: % 
916: Observe that the attribute value of $i$ is also sent to $j$, so that $j$ can
917: check if it is correct to exchange before updating its own random number (Lines~\ref{R03}--\ref{R04}). Node
918: $i$ does not need to receive attribute value $a_{j}$ of $j$, since $i$ already has this 
919: information in its view and the attribute value of a node never changes over time.
920: 
921: \begin{figure}
922: \centering{
923: \fbox{
924: \begin{minipage}[b]{150mm}
925: \footnotesize
926: \renewcommand{\baselinestretch}{1.5}
927: \resetline
928: \begin{tabbing}
929: aaaaA\=aaaaaA\=aaaaaaA\kill
930: %{\bf Initial state of node $i$} \\
931: %\line{L00} \> $r_i$, a random value chosen in $(0,1]$. 
932: %$a_i$, the attribute\\ value. 
933: %${\mathcal N_i}$, the view.
934: %\\ ~ \\
935: %
936: {\bf Active thread at node $i$} \\
937: \line{V01} \> {\bf for} $j' \in {\mathcal N}_{i}$ {\bf do} $t_{j'} \gets t_{j'} + 1$ {\bf end for} \\
938: \line{V02} \> $j \gets j'': t_{j''} = \lit{max}_{j'\in {\mathcal N}_i}(t_{j'})$ \\
939: \line{V03} \> $\act{send}(\lit{REQ'}, {\mathcal N}_i \setminus \{e_j\} \cup \{\tup{i, 0, a_i, r_i}\})$ to $j$ \\
940: \line{V04} \> $\act{recv}(\lit{ACK'}, {\mathcal N}_{j})$ from $j$ \\
941: \line{V05} \> $\ms{duplicated-entries} = \{e: e.id \in {\mathcal N}_j \cap {\mathcal N}_i\}$ \\
942: \line{V06} \> ${\mathcal N}_i \gets {\mathcal N}_i \cup ({\mathcal N}_j \setminus \ms{duplicated-entries} \setminus \{e_i\})$ \\
943: ~ \\
944: 
945: {\bf Passive thread at node $i$ activated upon reception} \\
946: \line{RV01} \> $\act{recv}(\lit{REQ'}, {\mathcal N}_{j})$ from $j$ \\
947: \line{RV03} \> $\act{send}(\lit{ACK'}, {\mathcal N}_{i})$ to $j$ \\
948: \line{RV04} \> $\ms{duplicated-entries} = \{e \in {\mathcal N}_j: e.id \in {\mathcal N}_j \cap {\mathcal N}_i\}$ \\
949: \line{RV05} \> ${\mathcal N}_i \gets {\mathcal N}_i \cup ({\mathcal N}_j \setminus \ms{duplicated-entries})$ \\
950: %{\bf last line says that we updates views and keep timestamp of i'e entries, even if entries are duplicated in Ni and Nj}
951: %\line{RV05} \> ${\mathcal N}_i \gets {\mathcal N}_i \cup ({\mathcal N}_j \setminus ({\mathcal N}_j \cap {\mathcal N}_i))$
952: 
953: \end{tabbing}
954: \normalsize
955: \end{minipage} 
956: }
957: \caption{$\lit{recompute-view()}$: procedure used to update the view based on a simple variant of the Cyclon algorithm.}
958: \label{alg:rcyclon}
959: }
960: \end{figure}
961: 
962: 
963: \subsection{Analysis of Slice Misplacement}
964: \label{sec:anal}
965: 
966: In mod-JK, as in JK, the current random number
967: $r_{i}$ of a node $i$ determines the slice $s_{i}$ of the node. The
968: objective of both algorithms is to reduce the global disorder as quickly as
969: possible.  Algorithm mod-JK consists in choosing one neighbor among the possible neighbors
970: that would have been chosen in JK, plus the GDM of JK has been shown to fit an exponential 
971: decrease~\cite{JK06}. Consequently mod-JK experiences also an exponential decrease of the global disorder.
972: Eventually, JK and mod-JK 
973: ensure that the disorder has fully disappeared.
974: However, the accuracy of the slices heavily depends on the uniformity of the random value
975: spread between 0 and 1.  It may happen, that the distribution of the random values
976: is such that some peers decide upon a wrong slice. Even more problematic is the fact that this situation
977: is unrecoverable unless a new random value is drawn for all nodes. 
978: This may be considered as an inherent limitation of
979: the approach.
980: For example, consider a system of size 2, where nodes 1 and 2 have the
981: random values $r_1=0.1$, $r_2=0.4$. If we are interested in creating
982: two slices of equal size, the first slice will be of size 2 and the second
983: of size zero, even after perfect ordering of the random values.
984: 
985: Therefore, an important step is to characterize the inaccuracy of the 
986: uniform distribution  to access the potential impact on
987: the  slice assignment
988: resulting from the fact that uniformly generated random numbers are
989: not distributed perfectly evenly throughout the domain.
990: First of all, consider a slice $S_p$ of length $p$.
991: In a network of $n$ nodes,
992: the number of nodes that will fall into this slice is a random variable $X$
993: with a binomial distribution with parameters $n$ and $p$.
994: The standard deviation of $X$ is therefore $\sqrt{np(1-p)}$.
995: This means that the relative proportional expected difference from the
996: mean (i.e., $np$) can be approximated as $\sqrt{(1-p)/(np)}$, which is very large
997: if $p$ is small, in fact, goes to infinity as $p$ tends to zero,
998: although a very large $n$ compensates for this effect.
999: For a ``normal'' value of $p$, and a reasonably large network, the variance
1000: is very low however.
1001: 
1002: To stay with this random variable,
1003: the following result bounds, with high probability, its deviation from its
1004: mean.
1005: %Consequently, this bounds (w.h.p.) the number of nodes that can be misplaced in the system.
1006: 
1007: \begin{lemma}
1008: For any $\beta \in (0,1]$, a slice $S_p$ of length  $p \in (0,1]$ has a
1009: number of peers 
1010: $X \in [(1-\beta)np, (1+\beta)np]$ with probability at least 
1011: $1-\epsilon$ as long as $p  \geq \frac{3}{\beta^2 n} \ln (2/\epsilon)$.
1012: \end{lemma}
1013: \begin{proof}
1014: The way nodes choose their random number is like drawing $n$ times, with replacement and
1015: independently uniformly at random, a value in the interval $(0,1]$. 
1016: Let $X_1, ..., X_n$ be the $n$ corresponding independent identically distributed random 
1017: variables such that:
1018: \begin{equation}
1019: \left\{\begin{array}{ll}
1020: X_i &= 1 \text{~if the value drawn by node $i$ belongs to~}S_p\text{~and~}\notag
1021: \\
1022: X_i &= 0 \text{~otherwise.}
1023: \end{array}\right.
1024: \end{equation}
1025: 
1026: We denote $X= \sum_{i=1}^n X_i$ the number of elements of interval $S_p$ drawn 
1027: among the $n$ drawings. 
1028: The expectation of $X$ is  $np$. 
1029: From now on we compute the probability that a bounded portion of the expected 
1030: elements are misplaced.
1031: Two Chernoff bounds~\cite{MR95} give:
1032: \begin{equation}
1033: \left.\begin{array}{cc}
1034: \Pr[X\geq (1+\beta)np] &\leq e^{-\frac{\beta^2 np}{3}} \notag \\
1035: \Pr[X\leq (1-\beta)np] &\leq e^{-\frac{\beta^2 np}{2}} \notag
1036: \end{array}\right\} \\
1037: \Rightarrow \Pr[|X-np| \geq \beta np] \leq 2e^{-\frac{\beta^2 np}{3}},
1038: \end{equation}
1039: 
1040: \noindent with $0 < \beta \leq 1$. 
1041: That is, the probability that more than ($\beta$ time the number expected) elements are 
1042: misplaced regarding to interval $S_p$ is bounded by $2e^{-\frac{\beta^2 n
1043: p}{3}}$. We want this to be at most $\epsilon$. This yields the result.
1044: \end{proof}
1045: %
1046: %The previous result shows that the number of nodes allocated to a given slice
1047: %is not far from the amount is should be.
1048: %
1049: %%MJ Note that you draw the wrong conclusion (commented out), as the simple fact I included
1050: % shows that if p is small then we are in trouble, and also if n is
1051: % not that large. The Chernoff shows the same, but in a more obscure way.
1052: % I repeat that these bounds are useful only if one wants to prove some
1053: % complexity results, and these bounds are not that tight anyway to be
1054: % very useful directly.
1055: % Besides, the "theorem" is a simple application of the Chernoff bound, and
1056: % is rather basic. I renamed it to "lemma", and I also feel the proof can
1057: % be significantly shortened. To something like: "simple application of
1058: % the Chernoff bound".
1059: % I left the derivation in though, with these notes...
1060: 
1061: To measure the effect discussed above during the simulation experiments, 
1062: we introduce the slice 
1063: disorder measure (SDM) as the sum over all nodes $i$ of the
1064: distance between the slice $i$ actually belongs to and the slice $i$ believes
1065: it belongs to.
1066: % 
1067: For example (in the case where all slices have the same size), 
1068: if node $i$ belongs to the $1^{st}$ slice (according to its attribute 
1069: value) while it thinks it belongs to the $3^{rd}$ slice (according to its rank 
1070: estimate) then the distance for node $i$ is $|1-3| = 2$.
1071: %
1072: Formally, for any node $i$, let $S_{u_i, l_i}$ be the actual correct slice 
1073: of node $i$ and let $S_{\hat{u_i}, \hat{l_i}}(t)$ be the slice $i$ estimates
1074: as its slice at time $t$.  The slice disorder measure is defined as:
1075: %
1076: $$\ms{SDM}(t) = \sum_{i}\frac{1}{u_i - l_i}\left|\frac{u_i+l_i}{2}-\frac{\hat{u_i}+\hat{l_i}}{2}\right|.$$
1077: %
1078: $\ms{SDM}(t)$ is minimal (equals 0) if for all nodes $i$, 
1079: we have $S_{\hat{u_i}, \hat{l_i}}(t) = S_{u_i, l_i}$.
1080: %
1081: %
1082: %This
1083: %interval
1084: %is divided in $k$ subintervals $I_1, ..., I_k$ of length $\Delta_1, ..., \Delta_k \in (0,1]$, respectively.  
1085: %Let $\Delta$ be the minimal length of these intervals, formally $\Delta = min_{\forall j} \{p\}$.
1086: %Without loss of generality fix $j$ such that $p$ is one of these subintervals.
1087: %
1088: %
1089: %
1090: %
1091: %We then bound (using additive bounds), with high probability, the number of misplaced 
1092: %elements with respect to all subintervals.
1093: %\begin{eqnarray}
1094: %P &=& \sum_{j=1}^k \Pr[|X-np| > \beta np] \leq 2k e^{-\frac{\beta^2n\Delta}{3}}. \notag
1095: %\end{eqnarray}
1096: %
1097: %We want $P \leq \epsilon$:
1098: %\begin{eqnarray}
1099: %& 2k e^{-\frac{\beta^2 n \Delta}{3}} \leq \epsilon, \notag \\
1100: %\Rightarrow & -\frac{\beta^2n\Delta}{3} \leq \ln{\frac{\epsilon}{2k}}, \notag \\
1101: %\Rightarrow & \Delta \geq \frac{3}{\beta^2 n}\ln{\frac{2k}{\epsilon}}.\Box \notag
1102: %\end{eqnarray}
1103: %\end{proof}
1104: %
1105: %\begin{corollary}
1106: %For any $\beta \in (0,1]$, any slice of length $\Delta_i$ has a number of peers $n_i \in [(1-\beta)n\Delta_i, (1 +\beta)n\Delta_i]$ with high probability (of at least $1-1/n$) as long as  $\ms{min}_j p \geq \frac{3}{\beta^2 n} \ln{(2nk)}$.
1107: %\end{corollary}
1108: 
1109: In fact, it is simple to show that, in general, the probability of  dividing $n$ peers into two slices of the same size is less than $\sqrt{2/n\pi}$. This value is very small even for moderate values of  $n$. Hence, it is highly possible that the random number distribution  does not lead to a perfect division into slices. 
1110: 
1111: \subsection{Simulation Results}\label{sec:simu2}
1112: 
1113: We present simulation results using PeerSim~\cite{JMB04}, using a simplified
1114: cycle-based simulation model, where all messages exchanges are atomic,
1115: so messages never overlap.
1116: First, we compare the performance of the two algorithms: JK and mod-JK.
1117: Second, we study the impact of concurrency that is ignored by the cycle-based
1118: simulations.
1119: 
1120: \subsubsection{Performance Comparison}
1121: %To compare the performance of our algorithm, 
1122: %mod-JK, with JK,
1123: %we implemented both approaches in a cycle-based 
1124: %simulator using PeerSim~\cite{JMB04}.
1125: We compare the time taken by these algorithms 
1126: to sort the random values according to the attribute values (i.e., the node with 
1127: the $j^{th}$ largest attribute value of the system value obtains the $j^{th}$ random value).
1128: In order to evaluate the convergence speed of each algorithm, we use the slice
1129: disorder measure as defined in Section~\ref{sec:anal}.
1130: 
1131: We simulated $10^4$ participants in 100 equally sized slices (when unspecified), each with 
1132: a view size $c=20$.  Figure~\ref{fig:convergence2} illustrates the difference between the global disorder
1133: measure and the slice disorder measure while Figure~\ref{fig:convergence1} presents the evolution of
1134: the slice disorder measure over time for JK, and mod-JK.
1135: 
1136: \begin{figure*}
1137:   \begin{center}
1138:     \subfigure[]
1139:     { 
1140:       \label{fig:convergence2}
1141:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{sdm_gdm.ps}}
1142:     }
1143:     \subfigure[]
1144:     {
1145:     \label{fig:convergence1}
1146:           \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{newscast_cyclon_sdm.ps}}
1147:     }
1148:     \subfigure[]
1149:     { 
1150:       \label{fig:swap}
1151:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=0,clip=true,bb=40 10 260 200,viewport=0 0 260 210
1152:       ]{swp.ps}}
1153:     } 
1154:     \subfigure[]
1155:     { 
1156:       \label{fig:sdm_count}
1157:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,origin=c,clip=true]{sdm_cont.ps}}
1158:     }    
1159:     \caption{
1160:        (a) Evolution of the global disorder measure over time.
1161:        (b) Slice disorder measure over time.
1162:        (c) Percentage of unsuccessful swaps in the ordering algorithms.
1163:        (d) Convergence speed under high concurrency.}
1164:   \end{center}
1165: \end{figure*}
1166: 
1167: Figure~\ref{fig:convergence2} shows the different speed at which 
1168: the global disorder measure and the slice disorder measure converge.
1169: When values are sufficiently large, the GDM and SDM seem tightly related: if
1170: GDM increases then SDM increases too.  Conversely, there is a significant difference between 
1171: the GDM and SDM when the values are relatively low: the GDM reaches 0 while the SDM is lower bounded by 
1172: a positive value.  This is because the algorithm does lead to a totally
1173: ordered set of nodes, while it still does not associate each node with its
1174: correct slice. Consequently the GDM is not sufficient to rightly estimate the
1175: performance of our algorithms.
1176: 
1177: Figure~\ref{fig:convergence1} shows the slice disorder measure to compare
1178: the convergence speed of our algorithm to that of JK with 10 equally sized slices.
1179: Our algorithm converges significantly faster than JK.
1180: Note that none of the algorithm reaches zero SDM, since they are both based on
1181: the same idea of sorting randomly generated values.
1182: Besides, since they both used an identical set of randomly generated values,
1183: both converge to the same SDM.
1184: 
1185: \subsubsection{Concurrency}
1186: 
1187: The simulations are cycle-based and at each cycle an 
1188: algorithm step is done atomically so that no other execution is concurrent.
1189: %a node sends and receives messages in an atomic 
1190: %manner---while no other messages are in transit.  
1191: More precisely, the algorithms are simulated such that in each cycle, each node updates its view 
1192: before sending its random value or its attribute value.  Given this implementation, the 
1193: cycle-based simulator does not allow us to realistically simulate concurrency, and a
1194: drawback is that view is up-to-date when a message is sent.  In the following 
1195: we artificially introduce concurrency (so that view might be out-of-date) into the 
1196: simulator and show that it has only a slight impact on the convergence speed.
1197: 
1198: %This parameter is not realistic in case the
1199: %goal is to converge as fast as possible, especially if messages
1200: %are continuously sent for this purpose.  Let $\delta$ be the time a message takes to reach 
1201: %its destination.  If a large portion of nodes send a message in a 
1202: %period of $\delta$ time units then it is likely that at least two overlapping messages target the 
1203: %same node.
1204: %Here, we simulate an artificial concurrency to see the impact on convergence speed.
1205: 
1206: 
1207: %This implementation choice presents two side-effects in case of high concurrency injection, 
1208: %proper to ordering algorithms.  
1209: %First, there might be a large number of useless messages.  Second, the convergence
1210: %can slow down compare to the non-concurrent case.  
1211: %
1212: Introducing concurrency might result in some problems because of
1213: %These two problems arise from 
1214: the potential staleness of views: unsuccessful swaps due to useless messages.  
1215: Technically, the view of node $i$ might 
1216: indicate that $j$ has a random value $r$ while this value is no longer 
1217: up-to-date.  This happens if $i$ has lastly updated its view before $j$ swapped
1218: its random value with another $j'$.
1219: %
1220: Moreover, due to asynchrony, it could happen that by the time a message is received
1221: this message has become useless.  Assume that node $i$ sends its random value $r_i$ to $j$
1222: in order to obtain $r_j$ at time $t$ and $j$ receives it by time $t+\delta$. 
1223: With no loss of generality assume $r_i > r_j$.  Then if $j$ swaps its random value with
1224: $j'$ such that $r_j'>r_i$ between time $t$ and $t+\delta$, then the message of $i$ 
1225: becomes \emph{useless} and the expected swap does not occur (we call this an \emph{unsuccessful swap}).
1226: %
1227: %
1228: %When $i$ sends a message 
1229: %$m_i$ to a targeted neighbor $j$, then between the time $i$ updates its view and the time 
1230: %$j$ receives the message $m_i$, $j$ may swap its own random value with a distinct node $j'$ (upon 
1231: %reception of message $m_{j'}$ from $j'$). 
1232: %%
1233: %In case the message from $j'$ results in ordering $j$ regarding to $i$, then $j$ will
1234: %ignore the message from $i$.  This scenario occurs if initially $r_{j'} \leq r_i \leq r_j$
1235: %and $a_j < a_i$ or $r_j \leq r_i \leq r_{j'}$ and $a_i < a_j$. 
1236: %Consequently, the random value of $j$ that $i$ knows of is stale, and message $m_i$ does not
1237: %result in any swap, $j$ ignores it.
1238: 
1239: %
1240: %\begin{figure*}
1241: %  \begin{center}
1242: %    \caption{
1243: %    }
1244: %  \end{center}
1245: %\end{figure*}
1246: 
1247: Figure~\ref{fig:sdm_count} indicates the impact of concurrent message exchange
1248: on the convergence speed while
1249: Figure~\ref{fig:swap} shows the amount of useless messages that are sent.
1250: %shows the percentage of messages that are ignored during
1251: %simulations of the ordering algorithms (JK and mod-JK) over the total number 
1252: %of messages.
1253: Now, we explain how the concurrency is simulated.
1254: Let the \emph{overlapping messages} be a set of messages that mutually overlap: 
1255: it exists, for any couple of overlapping messages, at least one instant at 
1256: which they are both in-transit.  
1257: For each algorithm we simulated \textit{(i)}~full concurrency: in a given 
1258: cycle, all messages are overlapping messages; and \textit{(ii)}~half 
1259: concurrency: in a given cycle, each message is an overlapping message with 
1260: probability $\frac{1}{2}$.
1261: Generally, we see that increasing the concurrency increases the number of
1262: useless messages. Moreover, in the modified version of JK, more messages are 
1263: ignored than in the original JK algorithm.  This
1264: is due to the fact that some nodes (the most misplaced ones) are more likely 
1265: targeted which increases the number of concurrent messages arriving at the 
1266: same nodes.  Since a node $i$ ignored more likely a message when it receives
1267: more messages during the same cycle, it comes out that concentrating
1268: message sending at some targets increases the number of useless messages.
1269: %
1270: 
1271: Figure~\ref{fig:sdm_count} compares the convergence speed under full concurrency 
1272: and no concurrency.  We omit the curve of half-concurrency since it would have been 
1273: similar to the two other curves.  Full-concurrency 
1274: impacts on the convergence speed very slightly.
1275: %While in the non-concurrent case all messages result
1276: %in an ordering gain, in the concurrent case, some of them may not.
1277: %
1278: %These side-effects are absent in the ranking algorithm (Section~\ref{sec:ranking}) because the 
1279: %information conveyed in message from each node $i$ is altered 
1280: %independently from any message receipt.
1281: %Indeed, either attribute values slightly change over time (if it 
1282: %represents battery, or available storage space), or do not change 
1283: %at all, while an ordering algorithm makes nodes exchange random values
1284: %that can be drastically modified (swapped) upon reception of another message.
1285: 
1286: 
1287: 
1288: \section{Dynamic Ranking by Sampling of Attribute Values}\label{sec:ranking}
1289: \label{dynamicranking}
1290: In this section we propose an alternative algorithm for the distributed slicing
1291: problem. This algorithm circumvents some of the problems identified in
1292: the previous approach by continuously ranking nodes based on observing attribute value
1293: information. 
1294: Random values no longer play a role, so non-perfect uniformity in the random value
1295: distribution is no longer a problem.
1296: Besides, this algorithm
1297: is not sensitive to churn even if it is correlated with attribute values. 
1298: 
1299: In the remaining part of the report we refer to this new algorithm 
1300: as the ranking algorithm while referring to JK and mod-JK as the ordering algorithms.
1301: Here, we elaborate on the drawbacks arising from the ordering algorithms 
1302: relying on the use of random values that are solved by the ranking approach.
1303: 
1304: \paragraph{Impact of attribute correlated with dynamics.}
1305: As already mentioned, the ordering algorithms
1306: rely on the fact that random values are uniformly distributed.
1307: However, if the attribute values are not constant but correlated with
1308: the dynamic behavior of the system, the distribution of random values may change
1309: from uniform to  skewed quickly.
1310: For instance, assume that each node maintains an attribute
1311: value that represents its own lifetime. 
1312: Although the algorithm is able to quickly sort random values, so
1313: nodes with small lifetime will obtain the small random values, it
1314: is more likely that these nodes leave the system sooner than other nodes.
1315: This results in
1316: a higher concentration of high random values and a large population of the nodes
1317: wrongly estimate themselves as being part of the higher slices.
1318: %Similar situations arise when
1319: %attribute values represent other dynamic features such as
1320: %the remaining storage space of a node, etc.
1321: 
1322: \paragraph{Inaccurate slice assignments.}
1323: 
1324: As discussed in previous sections in detail, slice assignments will
1325: typically be imperfect even when the random values are perfectly
1326: ordered.
1327: Since the ranking approach does not rely on ordering random nodes,
1328: this problem is not raised: the algorithm guarantees eventually
1329: perfect assignment in a static environment.
1330: 
1331: \paragraph{Concurrency side-effect.}
1332: In the previous ordering algorithms, a non negligible amount of messages are sent unnecessarily.
1333: The concurrency of messages has a drastic effect on the number of useless messages
1334: as shown previously, slowing down convergence.
1335: In the ranking algorithm
1336: concurrency has no impact on convergence speed because all 
1337: received messages are taken in account. 
1338: This is because the information encapsulated in a message (the attribute value of a node)
1339: is guaranteed to be up to date, as long as  the attribute values are constant,
1340: or at least change slowly.
1341: 
1342: \subsection{Ranking Algorithm Specification}
1343: 
1344: The pseudocode of the ranking algorithm is presented in Figure~\ref{alg:attr}.
1345: As opposed to the ordering algorithm of the previous section, the ranking
1346: algorithm does not assign random initial unalterable values as candidate ranks.
1347: Instead, the ranking algorithm improves its rank estimate each
1348: time a new message is received.
1349: 
1350: The ranking algorithm works as follows.
1351: Periodically each node $i$ updates its view ${\mathcal N}_i$ following 
1352: an underlying protocol that provides a uniform random sample (Line~\ref{N02}); later, we simulate the algorithm using
1353: the variant of Cyclon protocol presented in Section~\ref{sec:modifiedjk}.
1354: Node $i$ computes its rank estimate (and hence its slice)
1355: by comparing the attribute value of its neighbors to its own attribute value.
1356: This estimate is set to the ratio of the number of nodes with a lower 
1357: attribute value that $i$ has seen over the total number of nodes $i$ has seen (Line~\ref{N14}).
1358: Node $i$ looks at the normalized rank estimate of all its neighbors.
1359: Then, $i$
1360: selects the node $j_{1}$ closest to a slice boundary (according to the rank 
1361: estimates of its neighbors).
1362: Node $i$ selects also a random neighbor $j_{2}$ among its view (Line~\ref{N11}).
1363: When those two nodes are selected, $i$ sends an update message, denoted by a flag $\lit{UPD}$, 
1364: to $j_{1}$ and $j_{2}$
1365: containing its attribute value (Line~\ref{N12}--\ref{N13}).
1366: 
1367: The reason why a node close to the slice boundary is selected as one of the
1368: contacts is that such nodes need more samples to accurately determine
1369: which slice they belong to (subsection~\ref{sec:analysis2} shows this point).
1370: This technique introduces a bias towards them, so they receive more messages.
1371: 
1372: Upon reception of a message from node $i$, the passive threads of $j_{1}$ and 
1373: $j_{2}$ are activated so that $j_{1}$ and $j_{2}$ compute their new rank estimate 
1374: $r_{j_{1}}$ and $r_{j_{2}}$.  The estimate of the slice a node belongs to,
1375: follows the computation of the rank estimate.
1376: Messages are not replied, communication is one-way, resulting in identical message
1377: complexity to JK and mod-JK.
1378: 
1379: 
1380: %\input{attr.alg}
1381: \begin{figure}
1382: \centering{
1383: \fbox{
1384: \begin{minipage}[t]{150mm}
1385: \footnotesize
1386: \renewcommand{\baselinestretch}{1.5}
1387: \resetline
1388: \begin{tabbing}
1389: aaaaA\=aaaaaA\=aaaaaaA\kill
1390: {\bf Initial state of node $i$} \\
1391: \line{N00} \> $\ms{period}_{i}$, initially set to a constant;
1392: $r_i$, a value in $(0,1]$; \\
1393: $a_i$, the attribute value;
1394: $b$, the closest slice boundary to node $i$; \\
1395: $g_i$, the counter of encountered attribute values;
1396: $l_i$, the counter \\
1397: of lower attribute values;
1398: $\ms{slice}_i \gets \bot$;
1399: ${\mathcal N_i}$, the view.
1400: \\ ~ \\
1401:  
1402: {\bf Active thread at node $i$} \\
1403: 
1404: %% with period, the other nodes must send more frequently
1405: %~(\ref{L02}b)  \> $\act{wait}(\ms{period_i})$ \\
1406: %~(\ref{L02}c)  \> $\act{recompute-view}()_i$ \\
1407: %~(\ref{L02}d)  \> $\ms{dist-min} \gets \infty$ \\
1408: %~(\ref{L03}e)  \> {\bf for} $j' \in {\mathcal N}_i$ \\
1409: %~(\ref{L03}f)    \> \T $g_i \gets g_i + 1$ \\
1410: %~(\ref{L03}g)   \> \T {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
1411: %~(\ref{L03}h) \> \T {\bf if} $\lit{dist}(a_{j'},b) < \ms{dist-min}$ {\bf then} \\
1412: %~(\ref{L03}i) \> \T \T $\ms{dist-min} \gets \lit{dist}(a_{j'},b)$ \\
1413: %~(\ref{L03}j) \> \T \T $j_{1} \gets j'$ \\
1414: %~(\ref{L03}k) \> {\bf end for} \\
1415: %~(\ref{L03}l) \> Let $j_{2}$ be a random node of ${\mathcal N}_i$ \\
1416: %~(\ref{L04}b)  \> $\act{send}(\lit{UPD},r_i)$ to $j_{1}$ \\
1417: %~(\ref{L04}c)  \> $\act{send}(\lit{UPD},r_i)$ to $j_{2}$ \\
1418: %%~(\ref{L08})  \> $\act{recv}(\lit{ACK},r_j)$ from $j$ \\
1419: %%\line{L13}   \> {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
1420: %%\line{L14}   \> $g_i \gets g_i + 1$ \\
1421: %~(\ref{L05}d)   \> $r_i \gets \ell_i / g_i$ \\
1422: %~(\ref{L05}e)   \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$\\
1423: % ~ \\
1424: 
1425: % with period, the other nodes must send more frequently
1426: \line{N01}  \> $\act{wait}(\ms{period_i})$ \\
1427: \line{N02}  \> $\act{recompute-view}()_i$ \\
1428: \line{N03}  \> $\ms{dist-min} \gets \infty$ \\
1429: \line{N04}  \> {\bf for} $j' \in {\mathcal N}_i$ \\
1430: \line{N05}    \> \T $g_i \gets g_i + 1$ \\
1431: \line{N06}   \> \T {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
1432: \line{N07} \> \T {\bf if} $\lit{dist}(a_{j'},b) < \ms{dist-min}$ {\bf then} \\
1433: \line{N08} \> \T \T $\ms{dist-min} \gets \lit{dist}(a_{j'},b)$ \\
1434: \line{N09} \> \T \T $j_{1} \gets j'$ \\
1435: \line{N10} \> {\bf end for} \\
1436: \line{N11} \> Let $j_{2}$ be a random node of ${\mathcal N}_i$ \\
1437: \line{N12}  \> $\act{send}(\lit{UPD},a_i)$ to $j_{1}$ \\
1438: \line{N13}  \> $\act{send}(\lit{UPD},a_i)$ to $j_{2}$ \\
1439: %~(\ref{L08})  \> $\act{recv}(\lit{ACK},r_j)$ from $j$ \\
1440: %\line{L13}   \> {\bf if} $a_{j'} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
1441: %\line{L14}   \> $g_i \gets g_i + 1$ \\
1442: \line{N14}   \> $r_i \gets \ell_i / g_i$ \\
1443: \line{N15}   \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$\\
1444:  ~ \\
1445: 
1446: %{\bf Passive thread at node $i$ activated upon reception} \\
1447: %~(\ref{R01}) \> $\act{recv}(\lit{UPD},a_j)$ from $j$ \\
1448: %%~(\ref{R02}) \> $\act{send}(\lit{ACK},r_i)$ to $j$ \\
1449: %\line{R06}   \> {\bf if} $a_{j} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
1450: %\line{R07}   \> $g_i \gets g_i + 1$ \\
1451: %\line{R08}   \> $r_i \gets \ell_i / g_i$ \\
1452: %~(\ref{R04}') \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$
1453: 
1454: {\bf Passive thread at node $i$ activated upon reception} \\
1455: \line{O01} \> $\act{recv}(\lit{UPD},a_j)$ from $j$ \\
1456: %~(\ref{R02}) \> $\act{send}(\lit{ACK},r_i)$ to $j$ \\
1457: \line{O02}   \> {\bf if} $a_{j} \leq a_i$ {\bf then} $\ell_i \gets \ell_i + 1$ \\
1458: \line{O03}   \> $g_i \gets g_i + 1$ \\
1459: \line{O04}   \> $r_i \gets \ell_i / g_i$ \\
1460: \line{O05} \> $\ms{slice} \gets {\cal S}_{l,u}$ such that $l < r_i \leq u$
1461: 
1462: \end{tabbing}
1463: \normalsize
1464: \end{minipage} 
1465: }
1466: \caption{Dynamic ranking by exchange of attribute values.}
1467: \label{alg:attr}
1468: }
1469: \end{figure}
1470: 
1471: \subsection{Theoretical Analysis}\label{sec:analysis2}
1472: 
1473: The following Theorem shows a lower bound on the probability 
1474: for a node $i$ to accurately estimate the slice it belongs to.
1475: %
1476: This probability depends not only on the number of attribute exchanges
1477: but also on the rank estimate of $i$.
1478: 
1479: \begin{theorem}
1480: %Assume that node $i$ with normalized rank $p$, estimated in $\hat{p}$, executes 
1481: %$\left( Z_\frac{\alpha}{2} \frac{ \sqrt{\hat{p}(1-\hat{p})} }{ d } \right)^2$
1482: %steps of the algorithm, where $d$ is the distance between the rank estimate of $i$ and 
1483: %the closest slice boundary, and $Z_\frac{\alpha}{2}$ represents the endpoints of the 
1484: %confidence interval.  
1485: %The slice estimate of $i$ is exact with confidence coefficient of $100(1 - \alpha)\%$.
1486: %
1487: Let $p$ be the normalized rank of $i$ and let $\hat{p}$ be its estimate.
1488: For node $i$ to exactly estimate its slice with confidence coefficient of 
1489: $100(1 - \alpha)\%$, the number of messages $i$ must receive is:
1490: $$\left( Z_\frac{\alpha}{2} \frac{ \sqrt{\hat{p}(1-\hat{p})} }{ d } \right)^2,$$
1491: \noindent
1492: where $d$ is the distance between the rank estimate of $i$ and 
1493: the closest slice boundary, and $Z_\frac{\alpha}{2}$ represents the endpoints of the 
1494: confidence interval.  
1495: \end{theorem}
1496: \begin{proof}
1497: %At each step of the algorithm execution, node $i$ draws uniformly at random 
1498: %a sample in the network.
1499: Each time a node receives a message, it checks whether or not the attribute value
1500: is larger or lower than its own.  
1501: Let $X_1, ..., X_k$ be $k$~$(k>0)$ independent identically distributed random variables described as follows. $X_j = 1$ 
1502: with probability $\frac{i}{n} = p$ (indicating that the attribute value is lower) and 
1503: $j \in \{1, ..., k\}$, otherwise $X_j = 0$ (indicating the attribute value is larger).
1504: %
1505: By the central limit theorem, we assume $k > 30$ and we approximate the distribution of $X = \sum_{j=1}^k X_j$
1506: as the normal distribution. We estimate $X$ by $\hat{X} = \sum_{j=1}^k \hat{X}_j$
1507: and $p$ by $\hat{p} = \frac{\hat{X}}{k}$.
1508: 
1509: We want a confidence coefficient with value $1-\alpha$.
1510: Let $\Phi$ be the standard normal distribution function, and let 
1511: $Z_{\frac{\alpha}{2}}$ be $\Phi^{-1}(1-\frac{\alpha}{2})$.
1512: %
1513: Now, by the Wald large-sample normal test in the binomial case, 
1514: where the standard deviation of $\hat{p}$ is 
1515: $\sigma(\hat{p}) = \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{k}}$,
1516: we have:
1517: \begin{eqnarray}
1518: \left|\frac{\hat{p}-p}{\sigma(\hat{p})}\right| &\leq Z_{\frac{\alpha}{2}} & \notag \\
1519: \hat{p} - Z_{\frac{\alpha}{2}} \sigma(\hat{p}) &\leq p & \leq \hat{p} + Z_{\frac{\alpha}{2}}\sigma(\hat{p}). \notag
1520: \end{eqnarray}
1521: 
1522: % To obtain probability $1-\alpha$, we must have 
1523: % $$\hat{p} - Z_{\frac{\alpha}{2}} \sqrt{\frac{\hat{p}(1-\hat{p})}{k}} \leq p \leq \hat{p} + 
1524: % Z_{\frac{\alpha}{2}} \sqrt{\frac{\hat{p}(1-\hat{p})}{k}},$$
1525: % %\in (\hat{p}\rip Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{k}})$,
1526: % where $Z_{\frac{\alpha}{2}}$ is obtained from tables 
1527: % ($\alpha = 0,1 \Rightarrow Z_{\frac{\alpha}{2}} \neq 2$; 
1528: % $\alpha = 0,0001 \Rightarrow Z_{\frac{\alpha}{2}} \leq 4$) 
1529: %
1530: Next, assume that $\hat{p}$ falls into the slice $S_{l,u}$, with $l$ and $u$ its lower
1531: and upper boundaries, respectively.
1532: Then, as long as $\hat{p} - Z_\frac{\alpha}{2}\sqrt{\frac{\hat{p}(1-\hat{p})}{k}}
1533: > l$ and
1534: $\hat{p} + Z_\frac{\alpha}{2}\sqrt{\frac{\hat{p}(1-\hat{p})}{k}} \leq u$, the slice estimate 
1535: is exact with a confidence coefficient of $100(1-\alpha)\%$.
1536: Let $d=\min(\hat{p}-l,u-\hat{p})$, then we need
1537: %Assume without loss of generality that $\hat{p}-l \leq u-\hat{p}$, then we need
1538: \begin{eqnarray}
1539: d &\geq& Z_{ \frac{\alpha}{2} } \sqrt{\frac{\hat{p}(1-\hat{p}) }{ k }}, \notag \\
1540: k &\geq& \left( Z_\frac{\alpha}{2} \frac{ \sqrt{ \hat{p}(1-\hat{p}) } }{d}
1541: \right)^2.\notag
1542: \end{eqnarray}
1543: %Consequently, we obtain:
1544: %$$k \geq \left( Z_\frac{\alpha}{2} \frac{ \sqrt{\hat{p}(1-\hat{p})} }{ \min(\hat{p}-l, u-\hat{p})} \right)^2.\Box$$
1545: \end{proof}
1546: 
1547: To conclude, under reasonable assumptions all node estimate its slice with confidence coefficient $100(1 - \alpha)\%$, after a finite number of message receipts. 
1548: Moreover a node closer to the slice boundary needs more messages than a node far from the boundary.
1549: 
1550: \subsection{Simulation Results}
1551: 
1552: This section evaluates the ranking algorithm by focusing on three
1553: different aspects.  
1554: %
1555: First, the performance of the ranking algorithm is
1556: compared to the performance of the ordering algorithm\footnote{We 
1557: omit comparison with JK since the performance obtained with mod-JK are either 
1558: similar or better.} in a large-scale system where the distribution 
1559: of attribute values does not vary over time.
1560: %
1561: Second, we investigate if sufficient uniformity is achievable in reality using a dedicated 
1562: protocol.
1563: %
1564: %study the feasibility of our approach.  
1565: %Each node
1566: %requires to be associated with a random set of neighbors, chosen 
1567: %at uniform. 
1568: %We present a view management protocol that gives 
1569: %performance results very similar to the desired 
1570: %uniform simulation.
1571: %
1572: Third, the ranking algorithm and ordering algorithm are compared in 
1573: a dynamic system where the distribution of attribute values may 
1574: change.
1575: 
1576: %This Section evaluates the impact of dynamics on our algorithms.
1577: %The ranking algorithm presented in Figure~\ref{alg:attr} is compared to 
1578: %the ordering algorithm presented in the previous section.
1579: %
1580: For this purpose, we ran two simulations, one for each algorithms.
1581: The system contains (initially) $10^4$ nodes and each view contains $10$ uniformly drawn random 
1582: nodes and is updated in each cycle.
1583: The number of slices is 100, and we present the evolution of the slice disorder measure
1584: over time.
1585: 
1586: \begin{figure*}
1587:   \begin{center}
1588:     \subfigure[]
1589:     { 
1590:       \label{fig:comparison}    
1591:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{comparison.ps}}
1592:     }
1593:     \subfigure[]
1594:     { 
1595:       \label{fig:convergence3}
1596:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{cyclon.ps}}
1597:     }
1598:     \subfigure[]
1599:     { 
1600:       \label{fig:failures2}
1601:       \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{sdm_battery_200.ps}}
1602:     }
1603:     \subfigure[]
1604:     {
1605:     \label{fig:failures1}
1606:           \resizebox{2.6in}{!}{\includegraphics[scale=0.5,angle=270,clip=true]{sdm_failure01.ps}}
1607:     }
1608:     \caption{
1609:        (a) Comparing performance of the ordering algorithm and the ranking algorithm (static case).
1610:        (b) Comparing the ranking algorithm on top of a uniform drawing or a Cyclon-like protocol.
1611:        (c) Effect of dynamics burst on the convergence of the ordering algorithm and the ranking algorithm.
1612:        (d) Effect of a low and regular churn on the convergence of the ordering algorithm and the ranking algorithm.
1613:        }
1614:   \end{center}
1615: \end{figure*}
1616: 
1617: \subsubsection{Performance Comparison in the Static Case}
1618: Figure~\ref{fig:comparison} compares the ranking algorithm to the ordering algorithm
1619: while the distribution of attribute values do not change over time 
1620: (varying distribution is simulated below).
1621: 
1622: The difference between the ordering algorithm and the ranking algorithm
1623: indicates that the ranking algorithm gives a more precise result (in terms of node to 
1624: slice assignments) than the ordering algorithm.  
1625: More importantly, the slice disorder measure obtained by the ordering algorithm is
1626: lower bounded while the one of the ranking algorithm is not.
1627: Consequently, this simulation shows that the ordering algorithm might fail in 
1628: slicing the system while the ranking algorithm keeps improving its accuracy over 
1629: time.
1630: 
1631: \subsubsection{Feasibility of the Ranking Algorithm}
1632: %Figure~\ref{fig:convergence3} shows that even though the underlying membership
1633: %protocol might impact on the convergence time of the ordering and ranking algorithms,
1634: %it does not impact on their performance in terms of precision. 
1635: Figure~\ref{fig:convergence3} shows that the ranking algorithm does not need artificial
1636: uniform drawing of neighbors.  Indeed, an underlying view management protocol might 
1637: lead to similar performance results.
1638: %
1639: %the ordering as the ranking algorithm
1640: %present performance independent from the underlying.  First this shows that the result
1641: %achieve by the ordering algorithm is limited regardless the underlying
1642: %sampling protocol.  Second, this shows that the performance in term of 
1643: %precision of the ranking algorithm are independent
1644: %
1645: %shows that our approach is realistic.  
1646: %In the previous simulations the neighbors chosen for communication exchanges
1647: %were selected uniformly at random among all nodes in the system.  
1648: %Here, we simulate the ranking algorithm using an underlying view exchange 
1649: %protocol, called Cyclon~\cite{VGS05}.
1650: In the presented simulation we used an artificial protocol, drawing neighbors
1651: randomly at uniform in each cycle of the algorithm execution, and the variant
1652: of the Cyclon~\cite{VGS05} view management protocol presented above. 
1653: Those underlying protocols are distinguished on the figure using terms 
1654: "uniform" (for the former one) and "views" (for the later one).
1655: %
1656: As said previously, the Cyclon protocol consists of exchanging views 
1657: between neighbors such that 
1658: the communication graph produced shares similarities with a random graph.
1659: This figure shows that both cases give very similar results.  
1660: The SDM legend is on the right-handed vertical axis while
1661: the left-handed vertical axis indicates what percentage the SDM difference
1662: represents over the total SDM value.  At any time during the simulation 
1663: (and for both type of algorithms) its 
1664: value remains within plus or minus $7\%$.
1665: The two SDM curves of the ranking algorithm almost overlap.  
1666: Consequently, the ranking algorithm and the variant of Cyclon 
1667: presented in subection~\ref{sec:modifiedjk} achieve very similar result.
1668: %
1669: %Surprisingly however, the variant of Cyclon as an underlying algorithm 
1670: %leads seemingly to better performance than drawing the samples uniformly at random.
1671: %In a recent work~\cite{I05} about the experimental analysis of Cyclon compare to a random graph, 
1672: %the clustering coefficient provided by the communication graph
1673: %of Cyclon is slightly smaller than the clustering coefficient provided by the random 
1674: %graph. Based on this observation, it is reasonable to assume that new nodes are more 
1675: %likely discovered with Cyclon than a random graph, and this might explain this 
1676: %improvement.
1677: 
1678: To conclude, the variant of Cyclon algorithm presented in the previous section can be used 
1679: with the ranking algorithm to provide the shuffling of views.
1680: %despite this observation the similarity of the use of Cyclon and the use 
1681: %of randomly drawing shows the feasibility of our ranking algorithms in reality.
1682: 
1683: \subsubsection{Performance Comparison in the Dynamic Case}
1684: 
1685: 
1686: In Figure~\ref{fig:failures2} each of the two curves represents the 
1687: slice disorder measure obtained over time using the ordering algorithm and 
1688: the ranking algorithm respectively.  
1689: We simulate the churn such that 0.1\% of nodes leave and 0.1\% of the nodes 
1690: join in each cycle during the 200 first cycles. We observe how the SDM converges. 
1691: The churn is reasonably and pessimistically tuned compared to recent
1692: experimental evaluations~\cite{SR06} of the session duration in three well-known P2P 
1693: systems.\footnote{In~\cite{SR06}, roughly all nodes have left the system after 1 day while there are still 
1694: 50\% of nodes after 25 minutes. In our case, assuming that in average a cycle lasts 
1695: one second would lead to more than 54\% of leave in 9 minutes.}
1696: 
1697: The distribution of the churn is correlated to the attribute value of the nodes. 
1698: The leaving nodes are the 
1699: nodes with the lowest attribute values while the entering nodes have higher attribute 
1700: values than all nodes already in the system.  The parameter choices are motivated by 
1701: the need of simulating a system in which the attribute value corresponds to the 
1702: session duration of nodes, for example.
1703: 
1704: The churn introduces a significant disorder in the system which 
1705: counters the fast decrease.  When, the churn stops, the ranking algorithm readapts 
1706: well the slice assignments: the SDM starts decreasing again.  However, 
1707: in the ordering algorithm, the convergence of SDM gets stuck. 
1708: This leads to a poor slice assignment accuracy.
1709: 
1710: In Figure~\ref{fig:failures1}, 
1711: each of the two curves represent the slice disorder measure obtained over time
1712: using the ordering algorithm, the ranking algorithm, and a modified version of the ranking
1713: algorithm using attribute values recorded in a sliding-window, respectively.
1714: (The simulation obtained using sliding windows is described in the next subsection.)
1715: The churn is diminished and made more regular than in the previous simulation such that
1716: 0.1\% of nodes leave and 0.1\% of nodes join every 10 cycles.
1717: %Recall that the slice
1718: %disorder measure (SDM) represents the sum for all nodes of the distance between the 
1719: %slice the node thinks it belongs to and the slice it really belongs to.
1720: 
1721: The curves fits a fast decrease (superlinear in the number of cycles) at the 
1722: beginning of the simulation.
1723: At first cycles, the ordering gain is significant making the impact of churn
1724: negligible.  This phenomenon is due to the fact that SDM decreases rapidly 
1725: when the system is fully disordered.  Later on, however, the decrease slope 
1726: diminishes and the churn effect reduces the amount of nodes with a low attribute 
1727: value while increasing the amount of nodes with a large attribute value.
1728: This unbalance leads to a messy slice assignment, that is, each node must quickly
1729: find its new slice to prevent the SDM from increasing.
1730: %
1731: In the ordering algorithm the SDM starts increasing from cycle 120. Conversely, 
1732: with the ranking algorithm the SDM starts increasing not earlier than at cycle 730.
1733: Moreover the increase slope is much larger in the former algorithm than in the latter 
1734: one.
1735: 
1736: Even though the performance of the ranking algorithm are really significant, its 
1737: adaptiveness to churn is not surprising.  Unlike the ordering algorithm, the 
1738: ranking one keeps re-estimating the rank of each node depending on the 
1739: attribute values present in the system.
1740: Since the churn increases the attribute values present in the 
1741: system, nodes tend to receive more messages with higher attribute values and 
1742: less messages with lower attribute values, which turns out to keep the SDM 
1743: low, despite churn. Further on, we propose a solution  based on 
1744: sliding-window technique to limit the increase of the SDM in the ranking 
1745: algorithm.
1746: 
1747: To conclude, the results show that when the churn is related to the attribute 
1748: (e.g., attribute represents the session duration, uptime of a node), 
1749: then the ranking algorithm is better suited than the ordering algorithm.  
1750: 
1751: %In Figure~\ref{fig:failures1}, %simulates 10,000 nodes using Peersim simulator.
1752: %each of the two curves represent the slice disorder measure obtained over time
1753: %using the ordering algorithm, the ranking algorithm, and a modified version of the ranking
1754: %algorithm using attribute values recorded in a sliding-window, respectively.
1755: %(The simulation obtained using sliding windows is described in the next subsection.)
1756: %Recall that the slice
1757: %disorder measure (SDM) represents the sum for all nodes of the distance between the 
1758: %slice the node thinks it belongs to and the slice it really belongs to.
1759: %
1760: %A churn applied to the system is regular (1\% every period of 100 cycles), while its 
1761: %distribution is correlated to the attribute value of the nodes. More precisely, each 
1762: %10 cycles, a node leaves the system while another joins it.  The leaving node is the 
1763: %node with the lowest attribute value while the entering node has a higher attribute 
1764: %value than all nodes
1765: %already in the system.  The parameter choices are motivated by the need of simulating
1766: %a system in which the attribute value corresponds to the remaining battery power or 
1767: %session duration of nodes.
1768: %
1769: %The curves fits a fast decrease (superlinear in the number of cycles) at the 
1770: %beginning of the simulation.
1771: %At first cycles, the ordering gain is significant making the impact of churn
1772: %negligible.  This phenomenon is due to the fact that SDM decreases rapidly 
1773: %when the system is fully disordered.  Later on, however, the decrease slope 
1774: %diminishes and the churn effect reduces the amount of nodes with a low attribute 
1775: %value while increasing the amount of nodes with a large attribute value.
1776: %This unbalance leads to a messy slice assignment, that is, each node must quickly
1777: %find its new slice to prevent the SDM from increasing.
1778: %%
1779: %In the ordering algorithm the SDM starts increasing from cycle 120. Conversely, 
1780: %with the ranking algorithm the SDM starts increasing not earlier than at cycle 730.
1781: %Moreover the increase slope is much larger in the former algorithm than in the latter 
1782: %one.
1783: %
1784: %Even though the performance of the ranking algorithm are really significant, its 
1785: %adaptiveness to churn is not surprising.  Unlike the ordering algorithm, the 
1786: %ranking one keeps re-estimating the rank of each node depending on the 
1787: %attribute values present in the system.
1788: %Since the churn increases the attribute values present in the 
1789: %system, nodes tend to receive more messages with higher attribute values and 
1790: %less messages with lower attribute values, which turns out to keep the SDM 
1791: %low, despite churn. Further on, we propose a solution  based on 
1792: %sliding-window technique to limit the increase of the SDM in the ranking 
1793: %algorithm.
1794: %
1795: %In Figure~\ref{fig:failures2} the parameters of the simulator are the same as above
1796: %except that the churn is briefer and much higher.  At each step during the first 200 cycles, 
1797: %10 nodes leave while 10 new nodes join the system.  The 10 leaving nodes have the 
1798: %10 lowest attribute values while the 10 arriving nodes obtain the 10 highest attribute values.
1799: %
1800: %The result shows that when the churn is related to the attribute (e.g., attribute represents
1801: %the remaining session duration, uptime of a node), then the ranking algorithm is better suited
1802: %than the ordering algorithm.  
1803: %In this case, small random values 
1804: %tends to disappear under the churn effect.
1805: %Consequently, the slices 
1806: %of lowest attribute values are progressively emptied and slices with highest attribute
1807: %values are progressively filled.
1808: %Since the rank estimates are constantly re-estimated
1809: %in the ranking algorithm, this algorithm still converges despite churn.  
1810: %Conversely, since all candidate ranks are initially randomly chosen once for 
1811: %all in the ordering algorithm, then the convergence gets stuck.
1812: 
1813: \subsubsection{Sliding-window for Limiting the SDM Increase}
1814: 
1815: In Figure~\ref{fig:failures1}, the "sliding-window" curve 
1816: presents a slightly modified version of the ranking algorithm
1817: that encompasses SDM increase due to churn correlated to attribute 
1818: values.  Here, we present this enrichment.
1819: 
1820: In Section~\ref{sec:ranking}, the ranking algorithm specifies that
1821: each node takes into account all received messages.
1822: More precisely, upon reception of a new message each node $i$ re-computes
1823: immediately its rank estimate and the slice it thinks it belongs to 
1824: without remembering the attribute values it has seen.
1825: Consequently the messages
1826: received long-time ago have as much importance as the fresh messages
1827: in the estimate of $i$.
1828: The drawback, as it appeared in Figure~\ref{fig:failures1} of Section~\ref{sec:simu2}, is that if  
1829: the attribute values are correlated to churn, then the precision of the algorithm might diminish.
1830: 
1831: To cope with this issue, the previous algorithm can be easily enriched
1832: in the following way.  Upon reception of a message, each node $i$
1833: records an information about the attribute value received in a fixed-size ordered set 
1834: of values.  Say this set is a first-in first-out buffer such that
1835: only the most recent values remain.
1836: Right after having recorded this information, node $i$ can re-compute 
1837: its rank estimate and its slice estimate based on the most relevant 
1838: piece of information (having discarded the irrelevant piece). 
1839: Consequently, the estimate would rely only on fresh attribute values 
1840: encountered so that the algorithm would be more tolerant to changes 
1841: (e.g., dynamics or non-uniform evolution of attribute values).
1842: %
1843: Of course, since the analysis (cf. Section~\ref{sec:analysis2}) shows that nodes close to 
1844: the slice boundary 
1845: require a large number of attribute values for estimating precisely their 
1846: estimates, it would be unaffordable to record all these last attribute 
1847: values encountered due to space limitation.
1848: 
1849: Actually, the only necessary relevant information of a message
1850: is simply whether it contains a lower attribute value than the attribute
1851: value of $i$, or not.  Consequently, a single bit per message would be sufficient 
1852: to record the necessary information (e.g., adding a 1 meaning that the attribute value
1853: is lower, and 0 otherwise).  Thus, even though a node $i$ would require 
1854: $10^4$ messages to rightly estimate its slice (with high probability), 
1855: node $i$ simply needs to allocate an array of size 
1856: %$\lceil 10^4/(8*1024)\rceil =  1,23$ KB.
1857: $10^4/(8*1000) =  1,25$ kB.
1858: 
1859: As expected, Figure~\ref{fig:failures1} shows that the sliding-window method
1860: applied to the ranking algorithm prevents its SDM from increasing.  Consequently, at some
1861: point in time, the resulting slice assignment may become even more accurate.
1862: 
1863: \section{Conclusion}
1864: \label{conclusion}
1865: 
1866: \subsection{Summary}
1867: Peer to peer systems may now be turned into general frameworks on top of which 
1868: several applications might cohabit. To this end, allocating resources to applications,
1869: according to their needs require specific algorithms to partition the network in a relevant
1870: way. The ordered slicing algorithm proposed in \cite{JK06} provided a first attempt to 
1871: ``slice'' the network, taking into account the potential heterogeneity of nodes. This algorithm
1872: relies on each node drawing a random value uniformly and swapping continuously those random values,
1873:  with candidate
1874: nodes, so that the order between attributes values (reflecting the capabilities of nodes) and random ones
1875: match. Results from \cite{JK06} have shown that slices can be maintained efficiently and 
1876: in large-scale systems even in the presence of churn. 
1877: 
1878: In this report, we first proposed an improvement over the initial sorting algorithm based on a judicious
1879: choice of candidate nodes to swap values. This is based on each node being able to estimate locally
1880: the potential decrease of the global disorder measure. We provided an analysis along with some simulation
1881: results showing that the convergence speed is significantly improved.
1882: We then identified two issues related to the use of static random values. 
1883: The first one refers to the fact that slice assignment heavily depends on 
1884: the degree of uniformity of the initial random value. 
1885: 
1886: The second is related to the fact that
1887: once sorted along one attribute axis, the churn (or failures) might be correlated to the
1888: attribute, therefore leading to a unrecoverable skewed distribution of the 
1889: random values resulting in a wrong slice assignment.
1890: Our second contribution is an algorithm enabling nodes to continuously
1891: re-estimate their rank relatively to other nodes based on their sampling of
1892: the network. 
1893: 
1894: \subsection{Perspective}
1895: %\subsection{Cyclon as an Underlying Sampling Algorithm}
1896: %This report presents two algorithms relying on a variant of Cyclon to benefit from
1897: %the distribution quasi-uniform of the neighbors.
1898: %The simulation with the variant of Cyclon shows that the two algorithms 
1899: %are independent from the underlying physical overlay.
1900: %More technically, this shows the feasibility of our approach by providing 
1901: %the distribution requirement of the samples.
1902: %
1903: %Nevertheless, previous simulation surveys have shown that Cyclon might suffer
1904: %from a lack of robustness in face of massive departures compared
1905: %to other gossip-based approach such as Newscast.  This report did not
1906: %simulate the behavior of Cyclon under such circumstances.  Despite the 
1907: %interest of simulating the ranking algorithm on top of Cyclon and in 
1908: %face of massive failures, such extensive experiments were out of the 
1909: %scope of this report and could be an interesting future work.
1910: 
1911: 
1912: This report used a variant of the Cyclon protocol to obtain quasi-uniform
1913: distribution of neighbors.  There are various protocols that might be used 
1914: for different purpose.  For instance, Newscast can be used for its resilience
1915: to very high dynamics as in~\cite{JK06}.  Some other protocols exist 
1916: in the literature.  Deciding exactly how to parameterize the underlying 
1917: peer sampling service might be an interesting future direction.
1918: 
1919: \subsection*{Acknowledgment}
1920: We wish especially to thank M\'ark Jelasity for the fruitful discussions we 
1921: had and the time he spent improving this report. 
1922: We are also thankful to Spyros Voulgaris for having kindly shared his work 
1923: on the Cyclon development. The work of A.  Fern\'andez and E. Jim\'enez was partially 
1924: supported by the Spanish MEC under grants TIN2005-09198-C02-01, 
1925: TIN2004-07474-C02-02, and TIN2004-07474-C02-01, and the 
1926: Comunidad de Madrid under grant S-0505/TIC/0285. The work of A.  Fern\'andez was done 
1927: while on leave at IRISA, supported by the Spanish MEC under grant 
1928: PR-2006-0193.
1929: 
1930: 
1931: \bibliographystyle{plain}
1932: \bibliography{slice}
1933: 
1934: \end{document}
1935: 
1936: