1: \documentclass[12pt]{article}
2: \usepackage{latexsym}
3: \usepackage{amsfonts}
4: % This is for including postscript figures.
5: \newcommand{\zed}{\mbox{$\Bbb Z$}}
6: \setlength{\oddsidemargin}{0in}
7: \setlength{\topmargin}{0in}
8: \setlength{\headheight}{0in}
9: \setlength{\textheight}{8.3in}
10: \setlength{\textwidth}{6.7in}
11: \setlength{\topskip}{0in}
12: \setcounter{tocdepth}{4}
13: \newtheorem{thm}{Theorem}[section]
14: \newtheorem{cor}[thm]{Corollary}
15: \newtheorem{lem}[thm]{Lemma}
16: \newtheorem{pro}[thm]{Proposition}
17: \newtheorem{defn}[thm]{Definition}
18: \newtheorem{fact}[thm]{Fact}
19: \newcommand{\ds} {\displaystyle}
20: \renewcommand{\arraystretch}{.6}
21: \bibliographystyle{abbrv}
22:
23:
24: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
25: %
26: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
27:
28: \title{Probabilistic behavior of hash tables}
29: \author{Dawei Hong\footnote{D.~Hong and J.C.~Birget,
30: Dept.\ of Computer Science,
31: Rutgers University at Camden, Camden, NJ 08102,
32: USA, \{dhong, birget\}@camden.rutgers.edu}, \ \
33: Jean-Camille Birget\footnote{Supported in part by
34: NSF grant DMS-9970471}, \ \
35: Shushuang Man \footnote{ Dept.\ of Mathematics and Computer Science,
36: Marshall, MN 56258, USA, mans@southwest.msus.edu}
37: }
38: \date{}
39: \begin{document}
40: \maketitle
41:
42: \begin{abstract}
43: We extend a result of Goldreich and Ron about estimating the collision
44: probability of a hash function. Their estimate has a polynomial tail.
45: We prove that when the load factor is greater than a certain constant,
46: the estimator has a gaussian tail. As an application we find an estimate of
47: an upper bound for the average search time in hashing with chaining,
48: for a particular user (we allow the overall key distribution to be different
49: from the key distribution of a particular user). The estimator has a
50: gaussian tail.
51: \end{abstract}
52:
53: %%%%%%%%%%%%%%%%%%%%%%%%%
54: % Section 1
55:
56: \section{Introduction}
57:
58: Hash tables have many applications in computer science \cite{CLRS}, \cite{Kn}.
59: We especially mention data bases, where hash tables are used for storing
60: values of an attribute; see chapter 12 of \cite{SKS}.
61: Following the notation of \cite{CLRS}, a hash function is a function
62: $h: U \mapsto T$, where both the domain $U$ and the range $T$
63: are finite. Traditionally, $U$ is called the {\it key space} or the ``universe'',
64: and elements $x \in U$ are called {\it keys}. The set $T$ is called the
65: the {\it table}, and its elements are called the {\it table slots}. When
66: $h(x) = i$ we say that $h$ {\it hashes} the key $x$ into the slot $i$.
67: We shall denote by $n$ the cardinality of $T$ and we will simply assume that
68: $T = \{1, \ldots, n\}$.
69: We assumed that $U$ is (very much) larger than $T$.
70:
71: \smallskip
72:
73: We assume that a probability measure $q$ has been defined on $U$.
74: The probability of $S \ (\subset U)$ is denoted by ${\sf P}(S)$
75: $( \ = \sum_{x \in S} q(x))$. We also put the product measure on
76: $U \times U$ and on $U^m$ (for any positive integer $m$); using
77: the product measure amounts to saying that in a sequence of $m$ keys,
78: all the keys are {\it independent}.
79:
80: The probability on $U$ induces a probability measure on $T$:
81: The {\it probability that some key hashes to slot} $i \ (\in T)$ is \
82: $p_i = \sum_{x \in h^{-1}(i)} q(x)$ $= {\sf P}(h^{-1}(i))$.
83:
84: If two keys $x_1, x_2 \in U$ have the same hash value, these keys are said
85: to {\it collide}. The {\it collision probability} of the hash function $h$
86: is defined to be \ ${\sf P}\{(x_1, x_2) \in U \times U : h(x_1) = h(x_2) \}$
87: (in short-hand this is denoted by ${\sf P}(h(x_1) = h(x_2))$).
88: Here we use the product measure (i.e., keys are ``chosen independently'').
89: A {\it true collision} corresponds to keys $x_1, x_2 \in U$ such that
90: $x_1 \neq x_2$ and $h(x_1) = h(x_2)$.
91:
92: Throughout this paper, $\|.\|$ denotes euclidean norm. It is straightforward
93: to prove the following.
94: \begin{pro}
95: The collision probability of $h$ is equal to \
96: $\sum_{i = 1}^n p_i^2 \ \ ( = \|p\|^2)$.
97:
98: \medskip
99:
100: \noindent
101: Moreover, we always have \ $\sum_{i=1}^n p_i^2 \geq \frac{1}{n}$,
102: and equality holds iff \ $p_i = \frac{1}{n}$ for all $i \in T$.
103: \end{pro}
104: Similarly, the probability that two independently chosen keys are equal is
105: \ $\sum_{x \in U} q(u)^2$. Hence, the probability of true collisions
106: for $h$ is \ $\sum_{i = 1}^n p_i^2 \ - \ \sum_{x \in U} q(u)^2.$
107:
108: Note that \ $\sum_{x \in U} q(u)^2$ \ will usually be very small
109: assuming that $U$ is very large (compared to $n$ and compared to the length
110: $m$ of key sequences used), and assuming that the probability distribution
111: $q$ on $U$ is not very concentrated.
112: Therefore, the difference between the collision probability $\|p\|^2$ and
113: the probability of true collisions is usually quite small.
114:
115: \medskip
116:
117: In this paper we assume that collisions
118: are resolved by some form of {\it chaining}; i.e., all the keys that are
119: hashed into one slot are stored in that slot.
120: For a hash table with chaining, we will simply assume that the search time
121: (for both successful or unsuccessful search) in a slot $i$ is proportional
122: to the number of keys stored in that slot; for simplicity, we simply identify
123: search time in a slot and chain length in the slot.
124:
125: \bigskip
126:
127: \noindent {\bf Notation} ``$k_i(x)$'': \
128: Let $x = (x_1, \ldots, x_m)$ be a sequence of $m$ keys that are inserted into
129: our hash table, and let $i$ be a slot ($i = 1, \ldots, n$).
130: We let $k_i(x)$ denote the number of keys (counted with multiplicities)
131: inserted into slot $i$. (``With multiplicities'' means that if a key
132: occurs several times in $x$ it is counted as many times as it occurs.)
133:
134: Since in $k_i(x)$ we count keys with multiplicities, $k_i(x)$ is an upper bound
135: on the number of different keys stored in slot $i$.
136:
137: \begin{pro}
138: For a sequence of keys $x = (x_1, \ldots, x_m)$ that are inserted, the
139: number of collisions between keys in $x$ is
140: $$\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{2}.$$
141: \end{pro}
142: The proof is straightforward. Recall that we count pairs of equal keys in
143: the sequence $x$ as collisions. Since there are $\frac{m(m - 1)}{2}$
144: unordered pairs of key insertions in $x$, we call \
145: $$\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}$$
146: the {\em empirical collision probability} of $x$.
147: This concept, and its relation with the collision probability $\|p\|^2$,
148: were first studied by Goldreich and Ron \cite{GR}.
149:
150: \bigskip
151:
152: In this paper we obtain two results, in the form of deviation bounds.
153: (1) We give an estimation of the collision probability.
154: (2) We give a deviation bound for an upper bound on the average search time.
155:
156: In the second result we assume that the load factor is $> 9$
157: (see later for the exact assumptions).
158: Applications in data bases often lead to hash tables with
159: large load factor (\cite{SKS}, Chapter 12).
160: We allow arbitrary key distributions.
161:
162: \bigskip
163:
164: \noindent {\bf Estimation of the collision probability}
165:
166: \medskip
167:
168: \noindent
169: Our first result extends a result of Goldreich and Ron \cite{GR},
170: namely that \ $\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}$
171: \ is a very good estimator for the collision probability $\|p\|^2$.
172: How good the estimator is can be measured by the relative error \
173: $|\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)} \cdot \frac{1}{\|p\|^2}$
174: $ \ - \ 1|$. Their result, as well as ours, gives a deviation bound for
175: this relative error. Goldreich and Ron \cite{GR} proved a polynominal deviation
176: bound for the estimator \ $\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)}$.
177: Their goal was to find sublinear-time algorithms for testing expansion
178: properties of bounded-degree graphs.
179:
180: \begin{thm} {\rm (Goldreich and Ron \cite{GR}).} \
181: For all \ $\beta > 0$, $\lambda \geq 0$, if
182: $m = n^{1/2 + \beta + \lambda}$ then
183: $${\sf P} \left\{ \left| \sum_{i = 1}^{n}
184: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
185: \leq \frac{3}{n^{\beta/2}} \right\}
186: \ \geq \ 1 - \frac{4}{9n^{\lambda}}.$$
187: \end{thm}
188: We extend the theorem of Goldreich and Ron as follows:
189:
190: \smallskip
191:
192: \begin{thm}
193: \label{OurResult1} \
194: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$, \
195: $s > 0$, if \ $m = \epsilon^{-2}n^{1 + \delta}$ \ we have
196: $${\sf P} \left\{ \left| \sum_{i = 1}^n
197: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
198: \leq \epsilon \left( 3 + \frac{6s}{n^{\delta/2}} +
199: \frac{5s^2\epsilon}{n^\delta} \right) \right\}
200: \ \geq \ 1 - \frac{10}{9} \, e^{-s^2 /4}.$$
201: \end{thm}
202:
203: \medskip
204:
205: \noindent By taking $s = 2 \, n^{\delta/2}$, the expression \
206: $3 + \frac{6s}{n^{\delta/2}} + \frac{5s^2\epsilon}{n^\delta} $ \
207: becomes \ $3 + 12 + 20 \, \epsilon$ $(< 22)$; here we use
208: $\epsilon < \frac{1}{3}$. Therefore,
209:
210: \begin{cor} \label{OurResult1_cor1} \
211: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$, \
212: if \ $m = \epsilon^{-2}n^{1 + \delta}$ \ we have
213:
214: \medskip
215:
216: \ \ \ \ \ \ \ \ \ \
217: ${\sf P} \left\{ \left| \sum_{i = 1}^n
218: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
219: \leq 22 \, \epsilon \right\}
220: \ \geq \ 1 - \frac{10}{9} \, e^{-n^{\delta}}.$
221: \end{cor}
222:
223: \smallskip
224:
225: \noindent
226: Writing $\delta = \frac{\log C}{\log n}$, for $C > 1$, we obtain
227: $n^{\delta} = C$, and $m = \epsilon^{-2}Cn$, i.e., the load factor is
228: $L = C \, \epsilon^{-2}$. Therefore,
229: \begin{cor} \label{OurResult1_cor2} \
230: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ and all $m$ such that
231: $L = \frac{m}{n} > \epsilon^{-2} \ ( > 9)$ \ we have
232:
233: $${\sf P} \left\{ \left| \sum_{i = 1}^n
234: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
235: \leq 22 \, \epsilon \right\}
236: \ \geq \ 1 - \frac{10}{9} \, e^{- L \epsilon^2}.$$
237: \end{cor}
238:
239: \smallskip
240:
241: \noindent
242: Note that the assumptions of this Corollary impose the following relation
243: between $L$ and $\epsilon$: \ $\frac{1}{3} > \epsilon > \frac{1}{\sqrt{L}}$;
244: equivalently, $L = \frac{m}{n} > \epsilon^{-2} \ ( > 9)$.
245:
246: \medskip
247:
248: To compare with the result of Goldreich and Ron, let us pick
249: $\epsilon = n^{- \beta / 2}$ in Corollary \ref{OurResult1_cor1}. Then
250: $n^{1/2 + \beta + \lambda} = m = \epsilon^{-2}n^{1 + \delta}$ implies
251: $\delta = \lambda - \frac{1}{2}$. Hence our Corollary becomes:
252:
253: \begin{cor} \label{OurResult1_cor3} \
254: For all \ $n > 24$, \ $\beta > \frac{\log 3}{\log n}$, \
255: $\lambda > \frac{1}{2}$,
256: if \ $m = n^{1/2 + \beta + \lambda}$ \ we have
257: $${\sf P} \left\{ \left| \sum_{i = 1}^n
258: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
259: \leq \frac{22}{5} \, n^{- \beta / 2} \right\}
260: \ \geq \ 1 - \frac{10}{9}e^{-n^{\lambda - \frac{1}{2}}}.$$
261: \end{cor}
262:
263: \smallskip
264:
265: Comparing \ref{OurResult1_cor3} with the theorem of Goldreich and Ron:
266: Our theorem gives a much better deviation bound
267: (it is exponential, as opposed to the polynomial bound of Goldreich and Ron);
268: but it applies only when the load factor $L$ is $> 9$ (whereas in the
269: result of Goldreich and Ron, the load factor $L = n^{\beta + \lambda - 1/2}$
270: can be arbitrarily small, depending on $n$).
271:
272: \bigskip
273:
274:
275: \noindent {\bf The average search time for a particular user}
276:
277: \medskip
278:
279: In order to analyze the efficiency of a hash table one considers the
280: overall usage statistics of the keys (over all users).
281: By ``user'' we mean a person or a process.
282: For every user we introduce a vector
283: $v = (v_1, \ldots, v_n)$, where $v_i$ is the frequency of the user's access
284: (for search) to slot $i$. More precisely, $v_i$ is the number of searches at
285: slot $i$, divided by the total number of searches in the table, for this user.
286: Then $0 \leq v_i \leq 1$ and $\sum_{i=1}^n v_i = 1$.
287: We shall call $v$ the user's {\it access pattern}.
288: Traditional analysis of the average search time assumes that the accesses
289: pattern of a user is the same as the key distribution (see e.g., \cite{CLRS}).
290:
291: \smallskip
292:
293: We let ${\rm AST}(v, x)$ denote the average search time for a user with access
294: pattern $v$, under the condition that a sequence $x$ of $m$ independent keys
295: was previously inserted into the hash table. Clearly, we have the following
296: upper bound:
297:
298: \smallskip
299:
300: \ \ \ \ \ ${\rm AST}(v, x) \ \leq \ \sum_{i=1}^n v_i \cdot k_i(x)$.
301:
302: \smallskip
303:
304: \noindent The difference between ${\rm AST}(v, x)$ and
305: $\sum_{i=1}^n v_i \cdot k_i(x)$ is caused by the possibility of
306: pseudo-collisions. Here we are only concerned with upper bounds
307: on ${\rm AST}(v, x)$, so we can use $\sum_{i=1}^n v_i \cdot k_i(x)$.
308:
309: \smallskip
310:
311: We write $m$ as $m = Ln$, where $L$ is called the {\it load factor}.
312: We do not assume that $L$ is a constant. Applying Theorem \ref{OurResult1}
313: we show
314: \begin{cor}
315: \label{OurResult2}
316: For all \ $n > 24$, \ $s > 0$, \ $L > 9$, and $m = L n$ we have
317: \[
318: {\sf P} \left\{ {\rm AST}(v, x) \leq \ L \, n \|v\| \, \|p\| \,
319: \sqrt{1+ \frac{3+6s}{\sqrt{L}} + \frac{5s^2}{L} } + 1 \right\}
320: \ \geq \ 1 - \frac{10}{9}e^{-s^2 /4}. \]
321: \end{cor}
322:
323: \medskip
324:
325: \noindent Noting that
326: \ $\sqrt{1+ \frac{3+6s}{\sqrt{L}} + \frac{5s^2}{L} }$ $<$
327: $1 + \frac{ 4s}{\sqrt{L}}$ \ and letting \ $\epsilon = \frac{s}{2\sqrt{L}}$
328: \ we obtain
329: \begin{cor}
330: \label{OurResult2_cor1} \ For all \ $n > 24$, \ $\epsilon > 0$, \ $L > 9$,
331: and $m = L n$ we have
332: \[
333: {\sf P} \left\{ {\rm AST}(v, x) \leq \ L \, n \|v\| \, \|p\| \,
334: (1 + 8\epsilon) + 1 \right\}
335: \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}. \]
336: \end{cor}
337:
338: \noindent One notices that the probability bound is only interesting when
339: $L$ is significantly larger than $\epsilon^{-2}$. Also, the error bound
340: is interesting only when $\epsilon$ is less than $1/8$; this means that
341: the load factor has to be at least 100 for our results to be intersting.
342: In that sense, the results are theoretical, and show just what type of
343: behavior to expect, up to big-O.
344:
345: In \cite{CLRS} (chapt.~12, exercise 12-3) the expected search time (for
346: every user) was found to be $\Theta \left(L\right)$, under the assumption
347: that both the key distribution and the distribution of user's accesses are
348: uniform.
349: Our Corollary implies that if $\|p\|^2 = \Theta \left(\frac{1}{n}\right)$
350: and $\|v\|^2 = \Theta \left(\frac{1}{n}\right)$
351: (which is much more relaxed than the assumption of a uniform distribution),
352: then with exponentially high probability, the average search time is $O(L)$
353: for a user with access pattern $v$.
354:
355: \bigskip
356:
357: \noindent {\bf Example 1}
358:
359: \smallskip
360:
361: Suppose that a hash table, designed for a certain population of users, has
362: collision probability \ $\|p\| \leq \frac{c}{\sqrt{n}}$ (for
363: the overall population of users); $c$ is a positive constant.
364: The keys in the hash table are independent random samples.
365: Now consider an individual user who accesses a subset of cardinality
366: $\alpha \, n$ (where $0 < \alpha \leq 1$) of the $n$ slots of the hash table,
367: with uniform probability $\frac{1}{\alpha n}$, and who does not access the
368: other $(1 - \alpha) n$ slots of the hash table at all (i.e., those slots
369: have probability 0 for this user). Then the question is: What is the average
370: search time for this user and this table, and what is the deviation bound?
371:
372: Since the user accesses a fraction $\alpha$ of the slots uniformly, we have
373: $\|v\| = \frac{1}{\sqrt{\alpha n}}$. By Corollary \ref{OurResult2_cor1},
374:
375: \smallskip
376:
377: \ \ \ \ \ ${\sf P}\{ {\rm AST}(v, x) \leq \ $
378: $\frac{cL}{\sqrt{\alpha}} \, (1 + 8\epsilon) + 1 \}$
379: $ \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}$.
380:
381: \smallskip
382:
383: \noindent
384: So, the average search time is at most $1 + \frac{cL}{\sqrt{\alpha}}$ ,
385: with smaller error bound (namely \ $\frac{cL}{\sqrt{\alpha}} \, 8\epsilon$),
386: and with probability close to 1 (namely \
387: $1 - \frac{10}{9}e^{-L \epsilon^2}$).
388:
389: One observes that when the fraction $\alpha$ of the table used by the user
390: becomes smaller, the upper bound on the average search time for this user
391: increases, as does the error bound. This is not surprising; hashing works
392: best when the keys are spread over the table as evenly as possible.
393: Interestingly, our probability bound does not depend on $\alpha$.
394:
395: Some possible numerical values: For $c = 5$, \ $\alpha = 0.1$,
396: \ $\epsilon = 0.05$, $L = 1000$, we get \
397: ${\rm AST}(v, x) \leq \ 15811 \pm 6324$, with probability at least
398: $1 - \frac{10}{9}e^{-L \epsilon^2}$ \ $= \ 0.909$.
399: For $c = 5$, \ $\alpha = 0.1$, \ $\epsilon = 0.05$, $L = 10000$,
400: we get \ ${\rm AST}(v, x) \leq \ (1.58 \pm 0.64) \cdot 10^5$, with
401: probability at least $1 - 1.54 \cdot 10^{-11}$.
402:
403: \bigskip
404:
405: \noindent {\bf Example 2}
406:
407: \smallskip
408:
409: Let us consider the situation in which a query consists of two subqueries,
410: $Q_1$ and $Q_2$. This happens very commonly (e.g., in a ``three-tier
411: architecture''); see \cite{SKS}.
412: The two subqueries can be viewed as two users with
413: access patterns $v^{(1)}$ and $v^{(2)}$. Assume, for this example, that
414: each of $Q_1$ and $Q_2$ behaves like the user in Example 1 above.
415: In particular, for $Q_i$ $(i = 1, 2)$ we have \
416: $\|v^{(i)}\| = \frac{1}{\sqrt{\alpha_i n}}$, and
417:
418: \smallskip
419:
420: \ \ \ ${\sf P}\{ {\rm AST}_i(v^{(i)}, x) \leq \ $
421: $\frac{cL}{\sqrt{\alpha_i}} \, (1 + 8\epsilon) + 1 \}$
422: $ \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}$.
423:
424: \smallskip
425:
426: \noindent Hence, for the combined query the average search time is a
427: weighted sum \
428:
429: \smallskip
430:
431: \ \ \ ${\rm AST} = w_1 \cdot {\rm AST}_1 + w_2 \cdot {\rm AST}_2$,
432: \ \ \ with $w_1 + w_2 = 1$.
433:
434: \smallskip
435:
436: \noindent Let \ $a_i = \frac{cL}{\sqrt{\alpha_i}} \, (1 + 8\epsilon) + 1$. Then \
437:
438: \smallskip
439:
440: ${\sf P}\{ {\rm AST} \leq w_1a_1 + w_2a_2 \} \ \geq \ $
441: ${\sf P}\{ {\rm AST}_1 \leq {\rm max}\{a_1,a_2\}, \ $
442: ${\rm AST}_2 \leq {\rm max}\{a_1,a_2\}\} $
443:
444: \smallskip
445:
446: $ \geq \ 1 - 2 \, \frac{10}{9}e^{-L \epsilon^2}$.
447:
448: \smallskip
449:
450: \noindent Therefore, the average search time AST$(v^{(1)}, v^{(2)}, x)$
451: of the combined query satisfies
452:
453: \smallskip
454:
455: \ \ \ ${\sf P}\{ {\rm AST}(v^{(1)}, v^{(2)}, x) \leq \ $
456: $\frac{cL}{\sqrt{{\rm min}\{\alpha_1,\alpha_2\}}} \, (1+ 8\epsilon) +1 \}$
457: $ \ \geq \ 1 - \frac{20}{9}e^{-L \epsilon^2}$.
458:
459: \bigskip
460:
461: \noindent Hence, when the load factor is large (compared to $\epsilon^2$) we
462: obtain a very reliable upper bound on the average search time for the
463: combined query. The knowledge of this upper bound enables various processes
464: (that wait for the completion of this query) to be scheduled in a predictable
465: way.
466:
467: The constants in our results are rather large. This is due to the generality
468: of our results. In a precise practical situation, our results could be used
469: for the format of the probabilistic behavior, with constants to be
470: determined empirically.
471:
472: \bigskip
473:
474: The next section contains the proofs of our theorems.
475: %%%%%%%%%%%%%%%%%%%%%%%%%
476: % Section 2
477:
478: \section{Proofs}
479:
480:
481: \subsection{A deviation bound for the empirical
482: collision probability: Proof of Theorem \ref{OurResult1} }
483:
484: Our main technique will be Talagrand's isoperimetric theory, developed
485: by Talagrand in the mid 1990s \cite{Ta}. It has had a profound
486: impact on the probabilistic theory of combinatorial optimization \cite{St}
487: (see Sections 6 - 13 of \cite{Ta} and chapter 6 of \cite{St}).
488:
489: Let $(\Omega, \mu)$ be a probability space, and let $(\Omega^m, \mu^m)$
490: be the product space. For $x \in \Omega^m$ and $A \subset \Omega^m$,
491: Talagrand's convex distance $d_T(x, A)$ is defined by
492: \[ d_T(x, A) = \sup_{\alpha} \left \{z_\alpha = \inf_{y \in A}
493: \left\{\sum_{j = 1}^m \alpha_i \ {\bf 1}(x_j \not= y_j) \right\} \ : \
494: \alpha = (\alpha_1, \ldots, \alpha_m), \ \sum_{j = 1}^m \alpha_j^2 \leq 1
495: \right\},
496: \]
497: where $x$ = $(x_1, \ldots , x_m)$, $y$ = $(y_1, \ldots , y_m)$.
498: Here, ${\bf 1}(x_i \not= y_i)$ = 1 if $x_i$ $\not=$ $y_i$, and it is 0
499: otherwise.
500:
501: \begin{thm}
502: \label{Ta}
503: {\rm (Talagrand 1995)} \
504: For every $A \subset \Omega^m$ with $\mu^m(A) > 0$, we have
505: $$\int_{\Omega^m} \exp \left(\frac{1}{4} d_T (x, A)^2 \right) d \mu^m(x)
506: \ \leq \ \frac{1}{\mu^m(A)},$$
507: and consequently, we have for all $s > 0$,
508: $${\sf P} \left \{ d_T(x, A) \geq s \right \} \ \leq \
509: \frac{e^{-s^2/4}}{\mu^m(A)}.$$
510: \end{thm}
511:
512: \medskip
513:
514: \noindent
515: To apply Talagrand's theorem to our situation we define a set
516: $A \subseteq U^m$ by
517: \[ A \ = \ \left\{y \in U^m \ : \ \left|
518: \sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)} \cdot
519: \frac{1}{\|p\|^2} - 1 \right|
520: \leq 3\epsilon
521: \right\}. \]
522:
523: \begin{lem}
524: \label{SubsetA} \
525: For all $n > 24$ we have \ ${\sf P}(A) \geq \frac{9}{10}.$
526: \end{lem}
527: {\bf Proof.} \ Recall that $m = \epsilon^{-2}n^{1 + \delta}$ with
528: $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$.
529: Letting $\beta = \frac{- 2 \log \epsilon}{\log n}$ and
530: $\lambda = 1/2 + \delta$, we rewrite $m$ as $n^{1/2 + \beta + \lambda}$.
531: Then the lemma follows from Theorem of Goldreich and Ron.
532: \ \ \ $\Box$
533:
534: \bigskip
535:
536: \noindent For every $s > 0$ we define a set $C_s \subseteq U^m$ by
537: $$C_s = \{ x \in U^m : d_T(x, A) < s \}.$$
538: By Theorem \ref{Ta} and Lemma \ref{SubsetA} we have for all $n > 24$
539: and all $s > 0$
540: \begin{equation}
541: \label{SubsetC}
542: {\sf P}(C_s) \geq 1 - \frac{10}{9}e^{-s^2/4}.
543: \end{equation}
544: \begin{lem}
545: \label{ExpandingA} \
546: For every $x = (x_1, \ldots, x_m) \in C_s$ there is
547: $y = (y_1, \ldots, y_m) \in A$ such that
548:
549: $$\sum_{j = 1}^m {\bf 1}(x_j \not= y_j) \ \leq \ sm^{1/2}.$$
550: \end{lem}
551: {\bf Proof.} \ Assume, by contradiction, that there is $x \in C_s$ such that
552: for all $y \in A$, \ \
553:
554: \smallskip
555:
556: $\sum_{j = 1}^m {\bf 1}(x_j \not= y_j) > sm^{1/2}.$
557:
558: \smallskip
559:
560: \noindent Now, if we take $\alpha = (\alpha_1, \ldots , \alpha_m)$
561: $ = (m^{-1/2}, \ldots , m^{-1/2})$ in the definition of the Talagrand
562: distance $d_T$,
563: the inequality above implies \ $d_T(x, A_1) \geq s$. But since $x \in C_s$,
564: we also have $d_T(x, A_1) < s$, a contradiction. \ \ \ $\Box$
565:
566: \bigskip
567:
568: Recall that for any $x = (x_1, ..., x_m)$, $y = (y_1, ..., y_m)$ $\in U^m$,
569: we defined $k_i(x)$ (resp. $k_i(y)$) to be the number of the keys (with
570: multiplicity) that are hashed into the slot $i$ for input sequence $x$, resp.
571: $y$. We define integers $s_i$ ($1 \leq i \leq n$) by
572: $$k_i(x) = k_i(y) + s_i.$$
573: \begin{lem}
574: \label{ExpandingA0} \
575: For all $x, y \in U^m$, \
576: $$\sum_{i = 1}^n |s_i| \ \leq \ 2 \sum_{j = 1}^m {\bf 1}(x_j \not= y_j).$$
577: \end{lem}
578: {\bf Proof.} \ We prove the lemma by induction on \
579: $\sum_{i = 1}^m {\bf 1}(x_i \not= y_i)$.
580:
581: \smallskip
582:
583: \noindent {\bf (0)} \ \ $\sum_{j = 1}^m{\bf 1}(x_j \not= y_j) = 0$:
584:
585: \smallskip
586:
587: \noindent Then we have $x_j = y_j$ for all $j = 1, \ldots, m$, and hence,
588: $k_i(x) = k_i(y)$ for all $i = 1, \ldots, n$. Thus, we have
589: $\sum_{i = 1}^n |s_i| = 0$, finishing the base case.
590:
591: \smallskip
592:
593: \noindent {\bf (Inductive step)} \ \ Assume \
594: $\sum_{j = 1}^m{\bf 1}(x_j \not= y_j) > 0$:
595:
596: \smallskip
597:
598: \noindent Without loss of generality we assume that $x_m \not= y_m$.
599: Now, consider $\bar x = (x_1, \ldots , x_{m - 1}, y_m)$. We write
600: $k_i(\bar x) = k_i(y) + \bar {s_i}$ for $i = 1, \ldots, n$.
601: By the induction hypothesis we have
602: \begin{equation}
603: \label{ExpandingA0_1}
604: \sum_{i=1}^n |\bar s_i| \ \leq \ 2 \sum_{j=1}^m {\bf 1}(\bar x_j \not= y_i).
605: \end{equation}
606: Since $x$ differs from $\bar x$ only in its last component, we either have
607: $h(x_m) = h(y_m)$, in which case
608: $\bar s_i = s_i$ for all $i = 1, \ldots , n$. Or we have
609: $h(x_m) \neq h(y_m)$; let $i_1 = h(x_m)$ and $i_2 = h(y_m)$. Then
610: \ $\bar s_{i_1} = s_{i_1} + 1$, \ $\bar s_{i_2} = s_{i_2} - 1$, and
611: $\bar s_i = s_i$ for all $i \in \{1, \ldots , n \} \setminus \{i_1, i_2\}$.
612: In both cases,
613: \begin{equation}
614: \label{ExpandingA0_2}
615: \left| \sum_{i = 1}^n |\bar {s_i}| -
616: \sum_{i = 1}^n |s_i| \right| \leq 2.
617: \end{equation}
618: On the other hand,
619: $$\sum_{j = 1}^m {\bf 1}(x_j \not= y_j) \ = \
620: \sum_{j = 1}^m {\bf 1}(\bar x_j \not= y_j) + 1.$$
621: Combining this, (\ref{ExpandingA0_1}), and (\ref{ExpandingA0_2}), completes
622: the proof for the inductive step. \ \ \ $\Box$
623:
624: \begin{lem}
625: \label{ExpandingA1} \
626: For every $x \in C_s$ there is $y \in A$ such that for all $n > 24$,
627: $0 < \epsilon < 1/3$, $s > 0$, and $m = \epsilon^{-2}n^{1+\delta}$,
628: we have
629: $$\left|
630: \sum_{i = 1}^{n} \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -
631: \sum_{i = 1}^{n} \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)}
632: \right| \ \leq \ \epsilon \|p\|^2 \left(
633: \frac{6s}{n^{\delta/2}} + \frac{5s^2\epsilon}{n^{\delta}} \right).$$
634: \end{lem}
635: {\bf Proof.} \ For any fixed $x \in C_s$ we take $y \in A$ according to
636: Lemma \ref{ExpandingA}. That is,
637: \begin{equation}
638: \label{ExpandingA1_1}
639: \sum_{j = 1}^m {\bf 1}(x_j \not= y_j) \ \leq \ s m^{1/2}.
640: \end{equation}
641: As in the proof for Lemma \ref{ExpandingA0} we use the notation
642: $k_i(x)$, $k_i(y)$, and $s_i$ \ ($i = 1, \ldots , n$).
643: We will leave the common denominator $m(m-1)$ out of the computations
644: until the end:
645:
646: \medskip
647:
648: $| \sum_{i=1}^n k_i(x)(k_i(x) - 1) \ - \ \sum_{i=1}^n k_i(y)(k_i(y)-1) |$
649:
650: \medskip
651:
652: $= |\sum_{i=1}^n (k_i(y) +s_i)(k_i(y)+s_i-1) \ - \ $
653: $\sum_{i=1}^n k_i(y)(k_i(y)-1) |$
654:
655: \medskip
656:
657: $= |\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ $
658: $[(k_i(y) +s_i)(k_i(y)+s_i-1) - k_i(y)(k_i(y)-1)] $
659:
660: \smallskip
661:
662: \hspace{3in} $ \ + \ \sum_{1\leq i \leq n, \, k_i(y) = 0} \ s_i(s_i -1) |$
663:
664: \medskip
665:
666: $\leq \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ 2\, |s_i|(k_i(y)-1)$
667: $ \ + \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ (s_i^2 + |s_i|)$
668: $ \ + \ | \sum_{1\leq i \leq n, \, k_i(y) =0} \ s_i(s_i -1)|$
669:
670: \medskip
671:
672: $\leq \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ 2\, |s_i|(k_i(y)-1)$
673: $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|).$
674:
675: \medskip
676:
677: \noindent By the Cauchy-Schwarz inequality, this is bounded by
678:
679: \smallskip
680:
681: $\leq \ 2 \ (\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ s_i^2)^{1/2}$
682: $(\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ (k_i(y)-1)^2)^{1/2}$
683: $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|)$
684:
685: \medskip
686:
687: $\leq \ 2 \ (\sum_{i=1}^n s_i^2)^{1/2}$
688: $(\sum_{i=1}^n k_i(y)(k_i(y)-1)^2)^{1/2}$
689: $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|).$
690:
691: \medskip
692:
693: \noindent By Lemma \ref{ExpandingA0} and (\ref{ExpandingA1_1}) we have
694: \begin{equation}
695: \label{ExpandingA1_4}
696: \sum_{i=1}^n s_i^2 \ \leq \ \left(\sum_{i=1}^n |s_i| \right)^2 \ \leq \
697: \left(2 \sum_{j=1}^m {\bf 1}(x_j \not= y_j) \right)^2 \ \leq \ 4 s^2 m.
698: \end{equation}
699: Since $y \in A$ we have
700: $$\sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m (m - 1)} \leq
701: \|p\|^2 \left(1 + 3\epsilon \right).$$
702: Hence, by all the above:
703: $$\left| \sum_{i=1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -
704: \sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)} \right|$$
705: $$\leq \ \frac{4s}{(m - 1)^{1/2}} \cdot \|p\|
706: \left(1+ 3\epsilon \right)^{1/2} + \frac{4s^2}{m-1} + \frac{2s}{m^{1/2}(m-1)}.$$
707:
708: \noindent By calculating, and using the fact that $\|p\|^2 \geq \frac{1}{n}$,
709: $0 < \epsilon < 1/3$, and $m = \epsilon^{-2}n^{1 + \delta}$, we find the
710: following upper bound for \
711: $\left|
712: \sum_{i = 1}^{n} \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -
713: \sum_{i = 1}^{n} \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)}
714: \right| \ $:
715: \[
716: \frac{s\epsilon}{n^{\delta/2}} \ \|p\|^2 \
717: \frac{4(1+3\epsilon)^{1/2}n^{1/2}}{(n - \epsilon^2 n^{-\delta})^{1/2} }
718: \ + \
719: \frac{s\epsilon}{n^{\delta/2}} \ \|p\|^2 \
720: \frac{2 \epsilon^2n}{n^{1/2} (n^{1+\delta} - \epsilon^2)}
721: \ + \
722: \frac{s^2\epsilon^2}{n^{\delta/2}} \ \|p\|^2 \
723: \frac{4n}{n^{1+\delta} - \epsilon^2}
724: \]
725: \noindent Combining this and using $n > 24$ we obtain the upper bound \
726: $$\epsilon \|p\|^2 (\frac{6s}{n^{\delta/2}} +
727: \frac{5s^2 \epsilon}{n^{\delta}}).$$
728: $\Box$
729:
730: \bigskip
731:
732: \noindent
733: {\bf Proof of Theorem \ref{OurResult1}.} \
734: The theorem follows from the definition of $A$, inequality (\ref{SubsetC}),
735: and Lemma \ref{ExpandingA1}. \ \ \ $\Box$
736:
737:
738: %%%%%%%%%%%%%%%%%%%%%%%%%
739:
740: \subsection{Average search time for a particular user}
741:
742: \noindent
743: {\bf Proof of Corollary \ref{OurResult2}.} \
744: Recall that the average search time AST$(v,x)$ is bounded from above by \
745: $\sum_{i = 1}^n v_i \cdot k_i(x)$.
746: In Theorem \ref{OurResult1} let us write $m = L_1 L_2 \, n$, and choose
747: $$\epsilon = \frac{1}{\sqrt L_1}~~{\rm and}~~\delta = \frac{\log L_2}{\log n}.$$
748: Note that for all $i$, \
749:
750: \smallskip
751:
752: $k_i(x) - 1 \leq \sqrt{k_i(x)(k_i(x) - 1)}$ \
753:
754: \smallskip
755:
756: \noindent since the left side is 0 when $k_i(x) = 0$ or 1. Therefore,
757:
758: \smallskip
759:
760: ${\rm AST}(x,v) \leq \ $
761: $\sum_{i=1}^n v_i \cdot k_i(x) \ = \ \sum_{i=1}^n v_i (k_i(x) - 1) + 1 \ $
762: $\leq \ \sqrt{\sum_{i=1}^n v_i^2} \sqrt{\sum_{i=1}^n (k_i(x) - 1)^2} +1$
763:
764: \medskip
765:
766: $\leq \ \sqrt{\sum_{i=1}^n v_i^2} \sqrt{\sum_{i=1}^n k_i(x)(k_i(x) - 1)} + 1$
767: $ \ \leq \ \|v\| \, \|p\| \, m(m-1)$
768: $\sqrt{\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)} \, \frac{1}{\|p\|^2}}$.
769:
770: \medskip
771:
772: \noindent The corollary follows from this and Theorem \ref{OurResult1}.
773: \ \ \ $\Box$
774:
775: \bigskip
776:
777: \noindent
778: {\bf Remark.} \ Our proof method depends crucially on Talagrand's theorem.
779: Many readers, more familiar with techniques like the Chernoff bound, or more
780: generally, the Hoeffding inequality for martingale differences (from which
781: the Chernoff bound follows directly), may wonder whether these simpler
782: techniques don't work here. In order to apply Hoeffding's inequality we
783: could view $\sum_{i=1}^n v_i \cdot k_i(x)$ as a weighted sum of the random
784: variables $k_i(x)$; to apply Hoeffding one needs to bound $|k_i(x)|$, but
785: we don't have good bounds a priori; finding good bounds on $|k_i(x)|$ seems
786: harder and less promising than our method, based on Talagrand's theorem.
787: See, e.g., Michael Steele's book \cite{St}, which discusses the advantages
788: of applying Talagrand's theorem at length.
789:
790:
791:
792:
793:
794:
795: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
796: % Bibliography
797:
798: \begin{thebibliography}{60}
799:
800: \bibitem{CLRS}
801: T.H.\ Cormen, C.E.\ Leiserson, R.L.\ Rivest, C.\ Stein,
802: {\it Introduction to Algorithms}, 2nd ed., McGraw-Hill, 2001.
803:
804: \bibitem{GR}
805: O.\ Goldreich and D.\ Ron, ``On testing expansion in bounded-degree graphs'',
806: Technical Report TR00-020, ECCC, 2000.
807:
808: \bibitem{Kn}
809: D.E.\ Knuth, {\it Sorting and Searching}, 2nd ed., Addison-Wesley, 1998.
810:
811: \bibitem{SKS}
812: A.\ Silberschatz, H.F.\ Korth, S.\ Sudarshan,
813: {\it Database System Concepts}, 4th ed., McGraw-Hill, 2002.
814:
815: \bibitem{St}
816: M.\ Steele, {\it Probability Theory and Combinatorial Optimization},
817: SIAM, 1997.
818:
819: \bibitem{Ta}
820: M.\ Talagrand, ``Concentration of measure and isoperimetric inequalities
821: in product spaces'',
822: {\it Institut des Hautes \'Etudes Scientifiques, Publications
823: Math\'ematiques,} 81 (1995) 73-205.
824:
825: \end{thebibliography}
826:
827:
828: \end{document}
829: