1: \subsubsection{Analytical results}
2: \label{sec:analysis-hash-functions}
3:
4: In this section we sketch the analysis of the hash functions of our scheme. Note that the hash functions $h_0$, $y_1$, $y_2$, $\dots$, $y_k$ have a range of $b+kr$ bits in total. Thus, by universality of linear hash functions~\cite{admp99}, the probability that there exist two keys that have the same values under all functions is at most $\binom{n}{2}/2^{b+kr}$. We will choose $r$ such that this probability becomes negligible.
5: For simplicity, we assume that the zero vector $0^L$ is not in the set $S$ -- it is not hard to see that this assumption is insignificant.
6:
7: A direct consequence of Theorem~5 in~\cite{admp99} is that, assuming $b\leq\log n-\log\log n$, the expected size of the largest bucket is $O(n\log b / 2^b)$, i.e., a factor $O(\log b)$ from the average
8: bucket size. This justifies the choice of $b$ in Eq.~(\ref{eq:parameterb}), imposing the requirement that
9: $\ell \geq \log n\log\log n$.
10: %If we choose $\ell = O(\log n\log\log n)$ then the total space usage for the bucket MPHFs is $n\log\ell = n\log\log n + n\log\log\log n + O(n)$ bits.
11:
12: %In the following we will need the following fact:
13: % XXX Need to modify linear map for this to be true: Add "oddity" bit such that the parity of any vector becomes odd.
14: %For any three distinct nonzero vectors $x_1,x_2,x_3\in\{0,1\}^L$, the function values $y_i(x_1)$, $y_i(x_2)$, $y_i(x_3)$ are random and independent. (In other words, the family of linear functions over GF(2) is 3-wise independent on $\{0,1\}^L \backslash \{0^L\}$.) This is because the matrix consisting of $x_1$, $x_2$, and $x_3$ has rank 3 in GF(2), as no two distinct nonzero vectors can be linearly dependent.
15:
16: For any choice of $s$, we will now analyze the probability (over the choice of
17: $y_1,\dots,y_k$) that $x\mapsto \rho(x,s,0)$ and $x\mapsto \rho(x,s,1)$ map the
18: elements of $B_i$ uniformly and independently to $\{0,\dots,p-1\}$. A sufficient
19: criterion for this is that the sums $\sum_{j=1}^{k} t_j[y_j(x) \oplus \Delta]$
20: and $\sum_{j=k+1}^{2k}t_j[y_j(x) \oplus\Delta]$, $\Delta\in\{0,1\}$, have values
21: that are uniform in $\{0,\dots,p-1\}$ and independent. This is the case if for
22: every $x\in B_i$ there exists an index $j_x$ such that neither $y_{j_x}$ or
23: $y_{j_x}\oplus 1$ belongs to $y_{j_x}(B_i - \{x\})$. Since $y_1,\dots,y_k$ are
24: universal hash functions, the probability that this is not the case for a given element
25: $x\in B_i$ is bounded by $(|B_i|/2^r)^k \leq (\ell/2^r)^k$. If we choose, for example
26: $r=\lceil\log(\sqrt[3]{n}\ell)\rceil$
27: and $k=4$ we have that this probability is $o(1/n)$. Hence, the probability that
28: this happens for {\em any} key in $S$ is $o(1)$.
29:
30: Finally, we need to argue that for each bucket $i$ it is easy to find a value of $s$ such that
31: the pair $h_{i1}$, $h_{i2}$ is good for the MPHF of the bucket. We know that with constant
32: probability this is the case if the functions were truly random. Now, as argued above, with
33: probability $1-o(1)$ the functions $x\mapsto \rho(x,s,0)$ and $x\mapsto \rho(x,s,1)$ are
34: random and independent on each bucket, for every value of $s$. Then, for a given bucket
35: and a given value of $s$ there is a probability $\Omega(1)$ that the pair of hash functions
36: work for that bucket. Now, for any $\Delta\in\{0,1\}$ and $s\neq s'$, the functions $x\mapsto \rho(x,s,\Delta)$ and $x\mapsto \rho(x,s',\Delta)$ are independent. Thus, by Chebychev's inequality
37: the probability that less than a constant fraction of the values of $s$ work for a given bucket is
38: $O(1/p)$. So with probability $1-o(1)$ there is a constant fraction of ``good'' choices of $s$ in
39: every bucket, which means that trying an expected constant number of random values for $s$
40: is sufficient in each bucket.
41:
42: %\enlargethispage{\baselineskip}
43: %\vspace{-0.5mm}
44: