0810.1571/analysis.tex
1: \section{An Analytical Model of Information Dissemination}
2: \label{sec:analysis}
3: 
4: We analyse dissemination of a generic item $d$ in a network in which the nodes execute the shuffling protocol.
5: 
6: 
7: \subsection{Probabilities of state transitions}
8: 
9: \begin{wrapfigure}[17]{r}{55mm}
10:  \begin{center}
11:   \vspace{-30pt}
12: \includegraphics{figs/all.eps} 
13:  \end{center}
14:   \vspace{-20pt}
15:  \caption{Symbolic representation for caches of gossiping nodes.}
16:  \label{fig:allsc}
17: \end{wrapfigure}
18: We present a model of the shuffle protocol that captures the presence or absence of a generic item $d$ after shuffling of two nodes $A$ and $B$. There are four possible states of the caches of $A$ and $B$ before the shuffle: both hold $d$, either $A$'s or $B$'s cache holds $d$, or neither cache holds $d$.
19: 
20: We use the notation $P(a_2 b_2|a_1 b_1)$ for the probability that from state $a_1 b_1$ after a shuffle we get to state $a_2 b_2$, with $a_i,b_i\in\{0,1\}$. The indices $a_1$, $a_2$ and $b_1$, $b_2$ indicate the presence (if equal to $1$) or the absence (if equal to $0$) of a generic item $d$ in the cache of an initiator $A$ and the contacted node $B$, respectively. For example, $P(01|10)$ means that node $A$ had $d$ before the shuffle, which then moved to the cache of $B$, afterwards. Due to the symmetry of information exchange between nodes $A$ and $B$ in the shuffle protocol, $P(a_2 b_2|a_1 b_1)=P(b_2 a_2|b_1 a_1)$.
21: 
22: Fig.~\ref{fig:allsc} depicts all possible outcomes for the caches of gossiping nodes as a state transition diagram. If before the exchange $A$ and $B$ do not have $d$ ($a_1 b_1 = 00$), then clearly after the exchange $A$ and $B$ still do not have $d$ ($a_2 b_2 = 00$). Otherwise, if $A$ or $B$ has $d$ ($a_1 = 1 \vee b_1 = 1$), the shuffle protocol guarantees that after the exchange $A$ or $B$ still has $d$ ($a_2 = 1 \vee b_2 = 1$). Therefore, the state $(-,-)$ has a self-transition, and no other outgoing or incoming transitions.
23: 
24: We determine values for all probabilities $P(a_2 b_2|a_1 b_1)$. They are expressed in terms of probabilities $\probsn$ and $\probdn$. The probability $\probsn$ expresses the chance of an item to be selected by a node from its local cache when engaged in an exchange. The probability $\probdn$ represents a probability that an item which can be overwritten (meaning it is in the exchange buffer of its node, but not of the other node in the shuffle) is indeed overwritten by an item received by its node in the shuffle. Due to the symmetry of the protocol, these probabilities are the same for both initiating and contacted nodes. In Sec.~\ref{sec:expl}, we will calculate $\probsn$ and $\probdn$. We write $\probnotsn$ for $1-\probsn$ and $\probkn$ for $1-\probdn$.
25: 
26: \begin{scenario}[$a_1 b_1 =00$] Before shuffling, neither node $A$ nor node $B$ have $d$ in their cache.
27: \vspace{-2mm}
28: \begin{description}
29:  \item[$a_2 b_2=00$:] neither node $A$ nor node $B$ have item $d$ after a shuffle because neither of them had it in the caches before the shuffle: $P(00|00)=1$
30: \item[$a_2 b_2 \in \{01,10,11\}$:] cannot occur, because none of the nodes have item $d$.
31:  \end{description} 
32: \end{scenario}
33: 
34: \begin{scenario}[$a_1 b_1 =01$] Before shuffling, a copy of $d$ is only in the cache of node $B$.
35: \vspace{-2mm}
36: \begin{description}
37:  \item[$a_2 b_2=01$:] node $A$ does not have $d$ because node $B$ had $d$ but did not select it (to send) and, thus, $B$ did not overwrite $d$, i.e. the probability is $P(01|01)= \probnotsn$
38:  \item[$a_2 b_2=10$:] only node $A$ has $d$ because node $B$ selected $d$ and dropped it; that is, the probability is $P(10|01)=\probsn \cdot \probdn$
39:  \item[$a_2 b_2=11$:] both nodes $A$ and $B$ have a copy of $d$ because node $B$ selected $d$ and kept it; that is, \(P(11|01)=\probsn \cdot \probkn\)
40:  \item[$a_2 b_2=00$:] cannot occur as completely discarding $d$ is not possible in the protocol; that is, if either nodes send an item, its partner keeps this copy as well, and if an item is not among the selected for a shuffle, the item is not replaced by another one (see Sec.~\ref{sec:alg}).
41: \end{description}
42: \end{scenario}
43: 
44: \begin{scenario}[$a_1 b_1=10$] Before shuffling, $d$ is only in the cache of node $A$.
45: Due to the symmetry of nodes $A$ and $B$, this scenario is symmetric to the previous one with $P(a_2 b_2|10)=P(b_2 a_2|01)$.
46: \end{scenario}
47: 
48: \begin{scenario}[$a_1 b_1=11$] Before shuffling, $d$ is in the cache of node $A$ as well as in the cache of node $B$.
49: \vspace{-1.5mm}
50: \begin{description}
51:  \item[$a_2 b_2=01$:] only node $B$ has $d$ because node $A$ selected $d$ and dropped it and node $B$ did not select $d$; that is, \( P(01|11)= \probsn \cdot \probdn \cdot \probnotsn \)
52:  \item[$a_2 b_2=10$:] this outcome is symmetric to the previous one:
53: \( P(10|11) = \probnotsn \cdot \probsn \cdot \probdn\)
54:  \item[$a_2 b_2=11$:] after the shuffle both nodes $A$ and $B$ have $d$, because:
55:  \subitem nodes $A$ and $B$ had $d$ but both did not select it, i.e. $\probnotsn \cdot \probnotsn$;
56:  \subitem both nodes $A$ and $B$ selected $d$ (thus, both kept it), i.e. $\probsn \cdot \probsn$;
57:  \subitem node $A$ selected $d$ and kept it and node $B$ did not select $d$: $\probsn \cdot \probkn \cdot \probnotsn$;
58:  \subitem symmetric case with the previous one: $\probnotsn \cdot \probsn \cdot \probkn$.
59: 
60: \noindent Thus, \(P(11|11)=\probnotsn \cdot \probnotsn + \probsn \cdot \probsn + 2 \cdot \probsn \cdot \probnotsn \cdot \probkn \)
61: \item[$a_2 b_2=00$:] cannot occur, discarding of an item is not permitted by the protocol (see Sec.~\ref{sec:alg}).
62: \end{description}
63: \end{scenario}
64: 
65: \subsection{Probabilities of selecting and dropping an item}
66: \label{sec:expl}
67: 
68: The following analysis assumes that all node caches are full (that is, the network is already running for a while). Moreover, we assume a uniform distribution of items over the network; this assumption is supported by experiments in \cite{GVS06,steen2007.15}.
69: 
70: Consider nodes $A$ and $B$ engaged in a shuffle, and let $B$ receive the exchange buffer $S_A$ from $A$. Let $k$ be the number of duplicates (see Fig.~\ref{fig:schemenc}), i.e. the items of an intersection of the node cache $C_B$ and the exchange buffer of its gossiping partner $S_A$ (i.e. $S_A \cap C_B$). Recall from Sec.~\ref{sec:assum} that $C_A$ and $C_B$ contain the same number of items for all $A$ and $B$, and likewise for $S_A$ and $S_B$; we use $c$ and $s$ for these values. The total number of different items in the network is denoted as $n$.
71: 
72: \begin{figure}[!hptb]
73: \begin{minipage}[b]{0.5\linewidth}
74: \centering
75: \includegraphics[]{figs/kset}
76: \caption{$k$ items in $S_A \cap C_B$}
77: \label{fig:schemenc}
78: \end{minipage}%
79: \begin{minipage}[b]{0.5\linewidth}
80: \centering
81: \includegraphics[]{figs/shatset}
82: \caption{$\sasb$ items in $S_A \cap S_B$}
83: \label{fig:sasbschema}
84: \end{minipage}
85: \end{figure}
86: 
87: The probability of selecting an item $d$ in the cache is the probability of a single selection trial (i.e. $\frac{1}{c}$) times the number of selections (i.e. $s$): \( \probsn = \probs \). 
88: Thus, the probability that an item $d$ in the cache is not selected is: \( \probnotsn = 1 - \probsn = \frac{c-s}{c} \). 
89: 
90: Consider Figs.~\ref{fig:schemenc} and~\ref{fig:sasbschema}. The shuffle protocol demands that all items in $S_A$ are kept in $C_B$ after the shuffle. This implies that: a) all items in $S_A \sminus C_B$ will overwrite items in $S_B \subseteq C_B$, and b) all items in $S_A \cap C_B$ are kept in $C_B$. Thus, the probability that an item from $S_B$ will be overwritten is determined by the probability that an item from $S_A$ is in $C_B$, but not in $S_B$. Namely, the items in $S_B \sminus S_A$ provide a space in the cache for items from $S_A \sminus C_B$. We would like to express the probability $\probdn$ of a selected item $d$ in $S_B \sminus S_A$ (or $S_A \sminus S_B$) to be overwritten by another item in $C_B$ (or $C_A$). Due to symmetry, this probability is the same for $A$ and $B$; therefore, we only calculate the probability that an item in $S_B \sminus S_A$ is dropped from $C_B$. The expected value of this probability depends on how many duplicates a node receives from its gossiping partner:
91: \[E[\probdn] =  \begin{cases}
92: \bigsum{k=0}{s}(\probdwk \cdot \probok) &\text{if } s + c \loeq n\\
93: \bigsum{k=(s+c)-n}{s}(\probdwk \cdot \probok) &\text{otherwise}
94: \end{cases}
95: \]
96: where $\probok$ is the probability of having exactly $k$ items in $S_A \cap C_B$, and $\probdwk$ is the probability that an item in $S_B \sminus S_A$ is dropped from $C_B$ given $k$ duplicates in $S_A\cap C_B$. The case distinction is because if $s + c > n$, then clearly there are at least $(s+c) -n$ items in $S_A \cap C_B$.
97: 
98: From the $\binom{n}{s}$ possible sets $S_A$, we compute how many have $k$ items in common with $C_B$. Firstly, there are $\binom{c}{k}$ ways to choose $k$ such items in $C_B$. Secondly, there are $\binom{n-c}{s-k}$ ways to choose the remaining $s-k$ items outside $C_B$. So in total, $\binom{c}{k} \cdot \binom{n-c}{s-k}$ possible sets $S_A$ have $k$ items in common with $C_B$. Hence, under the assumption of a uniform distribution of the data items over the caches of the nodes,\footnote{Here we use a generalization of the usual definition of binomial coefficients to negative integers. That is, for all $m$ and $l \geq 0$, $\binom{m}{l} = (-1)^l \binom{-m+l-1}{l}$ (cf.~\cite{HHP97})}
99: \(\probok = \binom{c}{k} \frac{\binom{n-c}{s-k}}{\binom{n}{s}}\).
100: %
101: The expected value of $\probdwk$ is:
102: \[E[\probdwk] =  \begin{cases}
103: \bigsum{\sasb=0}{k}\probdwsh \cdot \probosh &\text{ if } s + k \loeq c\\
104: \bigsum{\sasb=(s+k)-c}{k}\probdwsh \cdot \probosh &\text{otherwise}
105: \end{cases}
106:  \]
107: where $\sasb$ is the number of items in $S_A \cap S_B$ (see Fig.~\ref{fig:sasbschema}). The case distinction is because if $s + k > c$ (with $k$ the number of items in $S_A \cap C_B$), then clearly there are at least $(s+k) - c$ items in $S_A \cap S_B$.
108: 
109: Among the $s$ items in $S_B$, there are $\sasb$ items also in $S_A$, and thus only the $s - \sasb$ items in $S_B \sminus S_A$ can be dropped from $C_B$. $\probdwsh$ is the probability that an item in $S_B \sminus S_A$ is dropped from $C_B$, given $\sasb$ items in $S_A \cap S_B$:
110: \[\probdwsh =  \begin{cases}
111: 0 &\text{if } s = \sasb \\
112: \frac{s-k}{s-\sasb} &\text{otherwise}
113: \end{cases}
114: \]
115: %
116: $\probosh$ is the probability of having exactly $\sasb$ items in $S_A\cap S_B$: %\footnotemark[\value{footnote}]
117: \(E[\probosh] = \binom{s}{\sasb} \frac{\binom{c-s}{k-\sasb}}{\binom{c}{k}}\).
118: The intuition behind this expected value of $\probosh$ is similar to the one of $\probok$. From the $\binom{c}{k}$ possible sets $S_A$, we compute how many have $\sasb$ items in common with $S_B$. That is, there are $\binom{s}{\sasb}$ ways to choose $\sasb$ items in $S_B$, and $\binom{c-s}{k-\sasb}$ ways to choose the remaining $k-\sasb$ items outside $S_B$.
119: 
120: Let's assume $2s \leq c \leq n-s$ (because then $s+c \leq n$ and $s+k \leq 2s \leq c$). Then, substituting in the expression for $E [\probdn]$ in case  $s+c \leq n$, and noting that in the summand $k=s$ the factor $\probdwshp{s}$ is equal to zero, we get:
121: %{\allowdisplaybreaks
122: \begin{eqnarray}
123: E [\probdn] &~=~& \bigsum{k=0}{s-1} \binom{c}{k} \frac{\binom{n-c}{s-k}}{\binom{n}{s}}  
124:     \bigsum{\sasb=0}{k} \frac{s-k}{s-\sasb} 
125: \binom{s}{\sasb} \frac{\binom{c-s}{k-\sasb}}{\binom{c}{k}}  \nonumber \\
126: &~=~& \frac{n-c}{\binom{n}{s}}\bigsum{k=0}{s-1} \binom{(n-c)-1}{(s-k)-1} \bigsum{\sasb=0}{k} \frac{\binom{c-s}{k-\sasb} \binom{s}{\sasb}}{s-\sasb}
127: \label{eq:exact}
128: \end{eqnarray}
129: %}
130: The probability of keeping an item $d$ in $S_B \sminus S_A\! \subseteq\! C_B$ can be expressed as $\probkn = 1 - \probdn$. 
131: 
132: 
133: \subsection{Simplification of $\probdn$}
134: \label{subsec:simplifiedPdrop}
135: In order to gain a clearer insight into the emergent behaviour of the gossiping protocol we make an effort to simplify the formula for the probability $\probdn$ of an item in $S_B \sminus S_A$ to be dropped from $C_B$ after a shuffle. Therefore, we re-examine the relationships between the $k$ duplicates received from a neighbour, the $\sasb$ items of the overlap $S_A \cap S_B$, and $\probdn$. Let's estimate $\probdwk$ by considering each item from $S_A$ separately, and calculating the probability that the item is a duplicate (i.e., is also in $C_B$). The probability of an item from $S_A$ to be a duplicate (also present in $C_B$) is $\frac{c}{n}$. In view of the uniform distribution of items over the network, the items in a node's cache are a random sample from the universe of $n$ data items; so all items in $S_A$ have the same chance to be a duplicate. Thus, the expected number of items in $S_A \cap C_B$ can be estimated by \( E[k] = s \cdot \frac{c}{n} \). And the expected number of items in $S_A \cap S_B$ can be estimated by $E[\sasb] = k \cdot \frac{s}{c}$, because only the $k$ items in $S_A\cap C_B$ may end up in $S_A\cap C_B$; $\frac{s}{c}$ captures the probability that an item from $C_B$ is also selected to be in $S_B$.
136: It follows that the probability of an item in $S_B \sminus S_A$ to be dropped from $C_B$ after the shuffle is
137: \(E[\probdn] = \frac{s-k}{s-\sasb} = \frac{s - s \cdot \frac{c}{n}}{s - s \cdot \frac{c}{n} \cdot \frac{s}{c}} = \frac{n-c}{n-s}.
138: \)
139: The complementary probability of keeping an item is 
140: \( E[\probkn] = 1 - \frac{n-c}{n-s} = \frac{c-s}{n-s} \). These estimates are valid for general $s \leq c \leq n$.
141: 
142: Substituting the expressions for $\probsn$ and the simplified $\probdn$ into the formulas for the transition probabilities in Fig.~\ref{fig:allsc}, we obtain:
143: \[
144: \begin{array}{rclrcl}
145: P(01|01)=P(10|10) &=& \frac{c-s}{c} & P(01|11)=P(10|11) &=& \frac{s}{c} \frac{c-s}{c} \frac{n-c}{n-s} \vspace{2mm}\\
146: P(10|01)=P(01|10) &=& \frac{s}{c} \frac{n-c}{n-s} \hspace*{1.5cm} & P(11|11) &=& 1 - 2 \frac{s}{c} \frac{c-s}{c} \frac{n-c}{n-s} \vspace{2mm}\\
147: P(11|01)=P(11|10) &=& \frac{s}{c} \frac{c-s}{n-s}
148: \end{array}
149: \]
150: 
151: In order to verify the accuracy of the proposed simplification for $E[\probdn]$, we compare the simplification and the accurate formula \eqref{eq:exact} for different values of $n$. We plot the difference of the accurate $\probdn$ and the simplification, for cache sizes $c=250$ and $c=500$ (Fig.~\ref{fig:pdrop_comparison1}).
152: \begin{figure}[!hptb]
153: \vspace{-4mm}
154: \begin{minipage}[b]{0.5\linewidth}
155: \centering
156: \includegraphics[width=1.0\textwidth]{figs/Pdrop-P1.eps}
157: \end{minipage}%
158: \begin{minipage}[b]{0.5\linewidth}
159: \centering
160: \includegraphics[width=1.0\textwidth]{figs/Pdrop-P2.eps}
161: \end{minipage}
162: \caption{The difference of the accurate $\probdn$ and its approximation, for different values of $n$ and $c$.}
163: \vspace{-3mm}
164: \label{fig:pdrop_comparison1}
165: \end{figure}
166: 
167: 
168: \subsection{Correction factor}
169: We now examine how closely the simplified formula $E[\probdn] = \frac{n-c}{n-s}$ (here referred as $S(n,c,s)$) approximates formula \eqref{eq:exact} (here referred as $E(n,c,s)$). We compared the difference between these two formulas using an implementation on the basis of common fractions, which provides loss-less calculation \cite{bigj}. We observed that the inverse of the difference of the inverse values of both formulas, i.e. $e_{c,s}(n) = \left( E(n,c,s)^{-1} - S(n,c,s)^{-1} \right)^{-1}$, exhibits a certain pattern for different values of $n$, $c$ and $s$. For $s=1$, $E(n,c,1) = \frac{n-c}{n}$, whereas $S(n,c,1) = \frac{n-c}{n-1}$. We then   investigate the correction factor $\theta$ in $E(n,c,s) = \frac{n-c}{(n-s) + \theta }$. Thus, for $s=1$ we have $\theta=1$. Yet, for $s > 1$ the situation turned out to be more complicated. For $s=2$, we got $e_{4,2}(7) - e_{4,2}(6) = 3.5$, $e_{4,2}(8) - e_{4,2}(7) = 4$, $e_{4,2}(9) - e_{4,2}(8) = 4.5$, and etc. Therefore we calculated the first, the second and other (forward) differences\footnote{A forward difference of discrete function $f: \mathbb{Z}\rightarrow\mathbb{Z}$ is a function $\Delta f:\mathbb{Z}\rightarrow\mathbb{Z}$ with $\Delta f(n)=f(n+1)-f(n)$ (cf. \cite{AS72}).} over $n$. We recognized that the $s$-th difference of the function $e_{c,s}(n)$ is always $\frac{1}{s}$. Moreover, at the point $n=0$ the $1$st, \ldots, $s$-th differences of the function $e_{c,s}$ exhibit a pattern similar to the Pascal triangle \cite{GKP94}; i.e. for $d \ge 1$ the $d$-th difference is: $({\rm \Delta}^d\; e_{c,s})(0) = \frac{1}{s \cdot \binom{s-1}{d}}$ (assuming $\binom{a}{b}=0$, whenever $b > a$). Knowing the initial difference at the point $n=0$, we were able to use the Newton forward difference equation \cite{AS72} to derive the following formula for $n>0$: $E[\probdn] = \frac{n-c}{(n-s) + \frac{1}{\corr}}$, where
170: 
171: \begin{equation}
172: \corr 
173: ~=~ \bigsum{d=0}{s-1} \frac{\binom{n}{d}}{s \cdot \binom{s-1}{d}} 
174: ~=~ \frac{\binom{n}{s}}{(n-s)+1} \cdot \bigsum{d=0}{s-1} \frac{ 1 }{\binom{n-d}{(s-1)-d}} 
175: %= \frac{\bigsum{d=0}{s-1} \frac{ \binom{n}{s} }{\binom{n-d}{(n-s)+1}} }{(n-s)+1}
176: \label{eq:simpwcor}
177: \end{equation}
178: In this equation the sum is finite because due to the observation that the $s$-th difference is constant $\frac{1}{s}$, all higher differences are $0$.
179: 
180: Extensive experiments with Mathematica and Matlab indicate that $\frac{n-c}{(n-s) + \frac{1}{\corr}}$ and formula \eqref{eq:exact} coincide. We can also see from Fig.~\ref{fig:pdrop_comparison1} that the correction factor is small.
181: 
182: \subsection{Optimal size for the exchange buffer}
183: \label{sec:optimal-s}
184: 
185: \begin{wrapfigure}[14]{r}{0.52\textwidth}
186:  \vspace{-30pt}
187:  \begin{center}
188: \includegraphics[scale=.50]{figs/optimal.eps}
189:  \end{center}
190:   \vspace{-20pt}
191:  \caption{Optimal value of exchange buffer size, depending on $n$.}
192:  \label{fig:optim}
193: \end{wrapfigure}
194: We study what is the optimal value for fast convergence of replication and coverage with respect to an item $d$.
195: Since $d$ is introduced at only one node in the network, one needs to optimize the chance that an item is duplicated.
196: That is, the probabilities $P(11|01)$ and $P(11|10)$ should be optimized (then $P(01|11)$ and $P(10|11)$ are optimized
197: as well, intuitively because for each duplicated item in a shuffle, another item must be dropped).
198: These probabilities both equal $\frac{s}{c}\frac{c-s}{n-s}$; we compute when the $s$-derivative of this formula is zero.
199: This yields the equation $s^2-2ns+nc=0$; taking into the account that $s\leq n$, the only solution of this equation is
200: $s = n - \sqrt{n(n-c)}$.  We conclude that this is the optimal value for $s$ to obtain fast convergence of replication
201: and coverage. This will also be confirmed by the experiments and analyses in the following sections.
202: