0812.0209/hh.tex
1: 
2: \section{Tracking the Heavy Hitters}
3: \label{sec:track-heavy-hitt}
4: 
5: %Our communication-optimal algorithm for tracking the heavy hitters
6: %is actually much simpler than the heuristic solutions previously
7: %proposed \cite{Babcock:Olston:03,fuller07:_fids}. We believe that
8: %the algorithm is also very practical due to its simplicity.
9: 
10: \subsection{The upper bound}
11: \label{sec:upper-bound}
12: \paragraph{The algorithm.}
13: Let $m$ be the current size of $A$. First, the coordinator $C$
14: always maintains $C.m$, an $\eps$-approximation of $m$.  This can
15: be achieved by letting each site send its local count every time
16: it has increased by a certain amount (to be specified shortly).
17: Each site $S_j$ maintains the exact frequency of each $x \in U$ at
18: site $S_j$, denoted $m_{x,j}$, at all times. The overall frequency
19: of $x$ is $m_x = \sum_j m_{x,j}$. Of course, we cannot afford to
20: keep track of $m_x$ exactly.  Instead, the coordinator $C$ maintains
21: an underestimate $C.m_{x,j}$ of $m_{x,j}$, and sets $C.m_x =
22: \sum_j C.m_{x,j}$ as an estimate of $m_x$.  $S_j$ will send its
23: local increment of $m_{x,j}$ to $C$, hence updating $C.m_{x,j}$,
24: from time to time following certain rules to be specified shortly.
25: In addition, each site $S_j$ maintains ${S_j}.m$, an estimate of
26: $m$, a counter ${S_j}.\Delta(m)$, denoting the increment of
27: $S_j.m$ since its last communication to $C$ about $S_j.m$, as well
28: as a counter ${S_j}.\Delta(m_x)$ for each $x$, denoting the
29: increment of ${S_j}.m_x$ since its last communication to $C$ about
30: $m_{x,j}$.
31: 
32: We can assume that the system starts with $m = k/\eps$ items;
33: before that we could simply send each item to the coordinator.  So
34: when the algorithm initiates, all the estimates are exact. We
35: initialize ${S_j}.\Delta(m)$ and ${S_j}.\Delta(m_x)$ for all $x$
36: to be 0. The protocols of tracking the $\phi$-heavy hitters are as
37: follows.
38: \begin{enumerate}
39: \item {\em Each site $S_j$:} When a new item of $x$ arrives,
40: ${S_j}.\Delta(m)$ and ${S_j}.\Delta(m_x)$ are incremented by 1.
41: When ${S_j}.\Delta(m)$ (resp.\ ${S_j}.\Delta(m_x)$) reaches $(\eps
42: \cdot {S_j}.m)/3k$, site $S_j$ sends a message $(all, (\eps \cdot
43: {S_j}.m)/3k)$ (resp.\ $(x,(\eps \cdot {S_j}.m)/3k)$) to the
44: coordinator, and resets ${S_j}.\Delta(m)$ (resp.\
45: ${S_j}.\Delta(m_x)$) to 0.
46: 
47: \item {\em Coordinator:} When $C$ has received a message $(all,
48: (\eps \cdot {S_j}.m)/3k)$ or $(x,(\eps \cdot {S_j}.m)/3k)$, it
49: updates $C.m$ to $C.m + (\eps \cdot {S_j}.m)/3k$ or $C.m_x$ to
50: $C.m_x + (\eps \cdot {S_j}.m)/3k$, respectively. Once $C$ has
51: received $k$ signals in the forms of $(all, (\eps \cdot
52: {S_j}.m)/3k)$, it collects the local counts from each site to compute
53: the exact value of $m$, sets $C.m = m$, and then broadcasts $C.m$
54: to all sites.  Then each site $S_j$ updates its ${S_j}.m$ to $m$.
55: After getting a new $S_j.m$, $S_j$ also resets $S_j.\Delta(m)$ to
56: 0.
57: \end{enumerate}
58: 
59: Finally, at any time, the coordinator $C$ declares an item $x$ to be a
60: $\phi$-heavy hitter if and only if
61: \begin{equation}
62: \label{eq:hh}
63: \frac{C.m_x}{C.m} \ge \phi + \frac{\eps}{2}.
64: \end{equation}
65: 
66: 
67: \paragraph{Correctness.}
68: To prove correctness we first establish the following invariants
69: maintained by the algorithm.
70: \begin{equation}
71: \label{eq:bound_C.m_x} m_x - \frac{\eps m}{3} + k \le C.{m_x} \le
72: m_x,
73: \end{equation}
74: \begin{equation}
75: \label{eq:bound_C.m} m - \frac{\eps m}{3} + k \le C.m \le m.
76: \end{equation}
77: 
78: The second inequalities of both (\ref{eq:bound_C.m}) and
79: (\ref{eq:bound_C.m_x}) are obvious. The first inequality of
80: (\ref{eq:bound_C.m_x}) is valid since once a site $S_j$ gets
81: $(\eps \cdot {S_j}.m)/3k$ items of $x$, it sends a message to the
82: coordinator and the coordinator updates $C.m_x$ accordingly. Thus
83: the maximum error of $C.m$ in the coordinator is at most
84: $\sum_{j=1}^k (\frac{\eps \cdot {S_j}.m}{3k}- 1) \le \frac{\eps
85: m}{3} - k$. The first inequality of (\ref{eq:bound_C.m}) follows
86: from a similar reason. Combining (\ref{eq:bound_C.m_x}) and
87: (\ref{eq:bound_C.m}), we have
88: $$\frac{m_x}{m} - \frac{\eps}{3} < \frac{C.m_x}{C.m} <
89: \frac{m_x}{m}\cdot\frac{1}{1-\eps/3} < \frac{m_x}{m} + \frac{\eps}{2},$$
90: which guarantees that the approximate ratio $\frac{C.m_x}{C.m}$ is within
91: $\eps/2$ of $\frac{m_x}{m}$, thus classifying an item using (\ref{eq:hh})
92: will not generate any false positives or false negatives.
93: 
94: \paragraph{Analysis of communication complexity.}
95: We divide the whole tracking period into rounds. A round start
96: from the time when the coordinator finishes a broadcast of $C.m$ to the
97: time when it initiates the next broadcast. Since the coordinator
98: initiates a broadcast after $C.m$ is increased by a factor of
99: $1+\sum_{i=1}^{k}(\eps/3k) = 1+\eps/3$, the number of rounds is
100: bounded by
101: $$\log_{1+\eps/3} n = O\left(\frac{\log n}{\eps}\right).$$
102: 
103: 
104: In each round, the number of messages in the form of $(all, (\eps \cdot
105: {S_j}.m)/3k)$ sent by all the sites is $k$ by the definition of our
106: protocol. Since there are $O(\log n/ \eps)$ rounds in total, the number of
107: messages in the form of $(all, (\eps \cdot {S_j}.m)/3k)$ can be bounded by
108: $O(k/\eps \cdot \log n)$. On the other hand, it is easy to see that total
109: number of messages of the form $(x, (\eps \cdot S_j.m)/3k)$ is no more than
110: the total number of messages of the form $(all, (\eps \cdot S_j.m)/3k)$.
111: Therefore, the total cost of the whole system is bounded by
112: $O(k/\eps \cdot \log n)$.
113: 
114: 
115: \begin{theorem}
116: For any $\eps \le \phi \le 1$, there is a deterministic algorithm that
117:   continuously tracks the $\phi$-heavy hitters and incurs a total
118:   communication cost of 
119:   $O(k/\eps\cdot \log n)$.
120: \end{theorem}
121: 
122: \paragraph{Implementing with small space.}
123: In the algorithm described above, we have assumed that each site maintains
124: all of its local frequencies $S_j.m_x$ exactly.  In fact, it is not
125: difficult to see that our algorithm still works if we replace these exact
126: frequencies with a heavy hitter sketch, such as the {\em space-saving}
127: sketch \cite{metwally06}, that maintains the local $\eps'$-approximate
128: frequencies for all items for some $\eps' = \Theta(\eps)$.  More precisely,
129: such a sketch gives us an approximate $S_j.m_x$ for any $x\in U$ with
130: absolute error at most $\eps' |S_j|$, where $|S_j|$ denotes the current
131: number of items received at $S_j$ so far.  We need to adjust some of the
132: constants above, but this does not affect our asymptotic results.  By using
133: such a sketch at each site, our tracking algorithm can be implemented in
134: $O(1/\eps)$ space per site and amortized $O(1)$ time per item.
135: 
136: \input{lowerBound}
137: 
138: %%% Local Variables:
139: %%% mode: latex
140: %%% TeX-master: "paper"
141: %%% End:
142: