0702:cs0702159/analysisofhashfunctions.tex

1: \subsubsection{Analytical results}

2: \label{sec:analysis-hash-functions}

3:

4: In this section we sketch the analysis of the hash functions of our scheme. Note that the hash functions $h_0$, $y_1$, $y_2$, $\dots$, $y_k$ have a range of $b+kr$ bits in total. Thus, by universality of linear hash functions~\cite{admp99}, the probability that there exist two keys that have the same values under all functions is at most $\binom{n}{2}/2^{b+kr}$. We will choose $r$ such that this probability becomes negligible.

5: For simplicity, we assume that the zero vector $0^L$ is not in the set $S$ -- it is not hard to see that this assumption is insignificant.

6:

7: A direct consequence of Theorem~5 in~\cite{admp99} is that, assuming $b\leq\log n-\log\log n$, the expected size of the largest bucket is $O(n\log b / 2^b)$, i.e., a factor $O(\log b)$ from the average

8: bucket size. This justifies the choice of $b$ in Eq.~(\ref{eq:parameterb}), imposing the requirement that

9: $\ell \geq \log n\log\log n$.

10: %If we choose $\ell = O(\log n\log\log n)$ then the total space usage for the bucket MPHFs is $n\log\ell = n\log\log n + n\log\log\log n + O(n)$ bits.

11:

12: %In the following we will need the following fact:

13: % XXX Need to modify linear map for this to be true: Add "oddity" bit such that the parity of any vector becomes odd.

14: %For any three distinct nonzero vectors $x_1,x_2,x_3\in\{0,1\}^L$, the function values $y_i(x_1)$, $y_i(x_2)$, $y_i(x_3)$ are random and independent. (In other words, the family of linear functions over GF(2) is 3-wise independent on $\{0,1\}^L \backslash \{0^L\}$.) This is because the matrix consisting of $x_1$, $x_2$, and $x_3$ has rank 3 in GF(2), as no two distinct nonzero vectors can be linearly dependent.

15:

16: For any choice of $s$, we will now analyze the probability (over the choice of

17: $y_1,\dots,y_k$) that $x\mapsto \rho(x,s,0)$ and $x\mapsto \rho(x,s,1)$ map the

18: elements of $B_i$ uniformly and independently to $\{0,\dots,p-1\}$. A sufficient

19:  criterion for this is that the sums $\sum_{j=1}^{k} t_j[y_j(x) \oplus \Delta]$

20: and $\sum_{j=k+1}^{2k}t_j[y_j(x) \oplus\Delta]$, $\Delta\in\{0,1\}$, have values

21:  that are uniform in $\{0,\dots,p-1\}$ and independent. This is the case if for

22: every $x\in B_i$ there exists an index $j_x$ such that neither $y_{j_x}$ or

23: $y_{j_x}\oplus 1$ belongs to $y_{j_x}(B_i - \{x\})$. Since $y_1,\dots,y_k$ are

24: universal hash functions, the probability that this is not the case for a given element

25: $x\in B_i$ is bounded by $(|B_i|/2^r)^k \leq (\ell/2^r)^k$. If we choose, for example

26: $r=\lceil\log(\sqrt[3]{n}\ell)\rceil$

27: and $k=4$ we have that this probability is $o(1/n)$. Hence, the probability that

28: this happens for {\em any} key in $S$ is $o(1)$.

29:

30: Finally, we need to argue that for each bucket $i$ it is easy to find a value of $s$ such that

31: the pair $h_{i1}$, $h_{i2}$ is good for the MPHF of the bucket. We know that with constant

32: probability this is the case if the functions were truly random. Now, as argued above, with

33: probability $1-o(1)$ the functions $x\mapsto \rho(x,s,0)$ and $x\mapsto \rho(x,s,1)$ are

34: random and independent on each bucket, for every value of $s$. Then, for a given bucket

35: and a given value of $s$ there is a probability $\Omega(1)$ that the pair of hash functions

36: work for that bucket. Now, for any $\Delta\in\{0,1\}$ and $s\neq s'$, the functions $x\mapsto \rho(x,s,\Delta)$ and $x\mapsto \rho(x,s',\Delta)$ are independent. Thus, by Chebychev's inequality

37: the probability that less than a constant fraction of the values of $s$ work for a given bucket is

38: $O(1/p)$. So with probability $1-o(1)$ there is a constant fraction of ``good'' choices of $s$ in

39: every bucket, which means that trying an expected constant number of random values for $s$

40: is sufficient in each bucket.

41:

42: %\enlargethispage{\baselineskip}

43: %\vspace{-0.5mm}

44: