cs0303022/cs0303022
1: \documentclass[12pt]{article}
2: \usepackage{latexsym}
3: \usepackage{amsfonts}
4: % This is for including postscript figures.
5: \newcommand{\zed}{\mbox{$\Bbb Z$}}
6: \setlength{\oddsidemargin}{0in}
7: \setlength{\topmargin}{0in}
8: \setlength{\headheight}{0in}
9: \setlength{\textheight}{8.3in}
10: \setlength{\textwidth}{6.7in}
11: \setlength{\topskip}{0in}
12: \setcounter{tocdepth}{4}
13: \newtheorem{thm}{Theorem}[section]
14: \newtheorem{cor}[thm]{Corollary}
15: \newtheorem{lem}[thm]{Lemma}
16: \newtheorem{pro}[thm]{Proposition}
17: \newtheorem{defn}[thm]{Definition}
18: \newtheorem{fact}[thm]{Fact}
19: \newcommand{\ds} {\displaystyle}
20: \renewcommand{\arraystretch}{.6}
21: \bibliographystyle{abbrv}
22: 
23: 
24: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
25: %
26: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
27: 
28: \title{Probabilistic behavior of hash tables}
29: \author{Dawei Hong\footnote{D.~Hong and J.C.~Birget,
30:                        Dept.\ of Computer Science,
31:                        Rutgers University at Camden, Camden, NJ 08102,
32:                        USA, \{dhong, birget\}@camden.rutgers.edu}, \ \   
33:         Jean-Camille Birget\footnote{Supported in part by 
34:                               NSF grant DMS-9970471}, \ \ 
35:         Shushuang Man \footnote{ Dept.\ of Mathematics and Computer Science,
36:                    Marshall, MN 56258, USA, mans@southwest.msus.edu} 
37:        }
38: \date{}
39: \begin{document}
40: \maketitle
41: 
42: \begin{abstract}
43: We extend a result of Goldreich and Ron about estimating the collision
44: probability of a hash function. Their estimate has a polynomial tail.
45: We prove that when the load factor is greater than a certain constant,
46: the estimator has a gaussian tail. As an application we find an estimate of 
47: an upper bound for the average search time in hashing with chaining, 
48: for a particular user (we allow the overall key distribution to be different 
49: from the key distribution of a particular user). The estimator has a 
50: gaussian tail. 
51: \end{abstract}
52: 
53: %%%%%%%%%%%%%%%%%%%%%%%%%
54: % Section 1
55: 
56: \section{Introduction}
57: 
58: Hash tables have many applications in computer science \cite{CLRS}, \cite{Kn}.
59: We especially mention data bases, where hash tables are used for storing 
60: values of an attribute; see chapter 12 of \cite{SKS}.
61: Following the notation of \cite{CLRS}, a hash function is a function
62: $h: U \mapsto T$, where both the domain $U$ and the range $T$
63: are finite. Traditionally, $U$ is called the {\it key space} or the ``universe'', 
64: and elements $x \in U$ are called {\it keys}. The set $T$ is called the 
65: the {\it table}, and its elements are called the {\it table slots}. When
66: $h(x) = i$ we say that $h$ {\it hashes} the key $x$ into the slot $i$.
67: We shall denote by $n$ the cardinality of $T$ and we will simply assume that 
68: $T = \{1, \ldots, n\}$.
69: We assumed that $U$ is (very much) larger than $T$. 
70: 
71: \smallskip
72: 
73: We assume that a probability measure $q$ has been defined on $U$.
74: The probability of $S \ (\subset U)$ is denoted by ${\sf P}(S)$ 
75: $( \ = \sum_{x \in S} q(x))$. We also put the product measure on 
76: $U \times U$ and on $U^m$ (for any positive integer $m$); using 
77: the product measure amounts to saying that in a sequence of $m$ keys,
78: all the keys are {\it independent}.   
79: 
80: The probability on $U$ induces a probability measure on $T$:
81: The {\it probability that some key hashes to slot} $i \ (\in T)$ is \  
82: $p_i =  \sum_{x \in h^{-1}(i)} q(x)$ $= {\sf P}(h^{-1}(i))$. 
83: 
84: If two keys $x_1, x_2 \in U$ have the same hash value, these keys are said 
85: to {\it collide}. The {\it collision probability} of the hash function $h$ 
86: is defined to be \ ${\sf P}\{(x_1, x_2) \in U \times U : h(x_1) = h(x_2) \}$ 
87: (in short-hand this is denoted by ${\sf P}(h(x_1) = h(x_2))$). 
88: Here we use the product measure (i.e., keys are ``chosen independently'').
89: A {\it true collision} corresponds to keys $x_1, x_2 \in U$ such that 
90: $x_1 \neq x_2$ and $h(x_1) = h(x_2)$. 
91: 
92: Throughout this paper, $\|.\|$ denotes euclidean norm. It is straightforward 
93: to prove the following.
94: \begin{pro}
95: The collision probability of $h$ is equal to \  
96: $\sum_{i = 1}^n p_i^2 \ \ ( = \|p\|^2)$. 
97: 
98: \medskip
99: 
100: \noindent
101: Moreover, we always have \ $\sum_{i=1}^n p_i^2 \geq \frac{1}{n}$, 
102: and equality holds iff \  $p_i = \frac{1}{n}$ for all $i \in T$.
103: \end{pro}
104: Similarly, the probability that two independently chosen keys are equal is 
105:  \ $\sum_{x \in U} q(u)^2$. Hence, the probability of true collisions 
106: for $h$ is \ $\sum_{i = 1}^n p_i^2 \ - \ \sum_{x \in U} q(u)^2.$
107: 
108: Note that \ $\sum_{x \in U} q(u)^2$ \ will usually be very small
109: assuming that $U$ is very large (compared to $n$ and compared to the length
110: $m$ of key sequences used), and assuming that the probability distribution
111: $q$ on $U$ is not very concentrated. 
112: Therefore, the difference between the collision probability $\|p\|^2$ and
113: the probability of true collisions is usually quite small.
114: 
115: \medskip
116: 
117: In this paper we assume that collisions
118: are resolved by some form of {\it chaining}; i.e., all the keys that are 
119: hashed into one slot are stored in that slot.
120: For a hash table with chaining, we will simply assume that the search time
121: (for both successful or unsuccessful search) in a slot $i$ is proportional
122: to the number of keys stored in that slot; for simplicity, we simply identify 
123: search time in a slot and chain length in the slot.
124: 
125: \bigskip
126: 
127: \noindent {\bf Notation} ``$k_i(x)$'':  \     
128: Let $x = (x_1, \ldots, x_m)$ be a sequence of $m$ keys that are inserted into 
129: our hash table, and let $i$ be a slot ($i = 1, \ldots, n$).
130: We let $k_i(x)$ denote the number of keys (counted with multiplicities) 
131: inserted into slot $i$. (``With multiplicities'' means that if a key 
132: occurs several times in $x$ it is counted as many times as it occurs.)
133: 
134: Since in $k_i(x)$ we count keys with multiplicities, $k_i(x)$ is an upper bound
135: on the number of different keys stored in slot $i$.
136: 
137: \begin{pro}
138: For a sequence of keys $x = (x_1, \ldots, x_m)$ that are inserted, the 
139: number of collisions between keys in $x$ is  
140: $$\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{2}.$$
141: \end{pro}
142: The proof is straightforward. Recall that we count pairs of equal keys in 
143: the sequence $x$ as collisions. Since there are $\frac{m(m - 1)}{2}$ 
144: unordered pairs of key insertions in $x$, we call \    
145: $$\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}$$
146: the  {\em empirical collision probability} of $x$.
147: This concept, and its relation with the collision probability $\|p\|^2$,
148: were first studied by Goldreich and Ron \cite{GR}.  
149: 
150: \bigskip
151: 
152: In this paper we obtain two results, in the form of deviation bounds.  
153: (1) We give an estimation of the collision probability. 
154: (2) We give a deviation bound for an upper bound on the average search time.
155: 
156: In the second result we assume that the load factor is $> 9$
157: (see later for the exact assumptions).
158: Applications in data bases often lead to hash tables with
159: large load factor (\cite{SKS}, Chapter 12). 
160: We allow arbitrary key distributions.
161: 
162: \bigskip
163: 
164: \noindent {\bf Estimation of the collision probability}
165: 
166: \medskip
167: 
168: \noindent 
169: Our first result extends a result of Goldreich and Ron \cite{GR},
170: namely that \  $\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}$
171:  \ is a very good estimator for the collision probability $\|p\|^2$. 
172: How good the estimator is can be measured  by the relative error \    
173: $|\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)} \cdot \frac{1}{\|p\|^2}$
174: $ \ - \ 1|$. Their result, as well as ours, gives a deviation bound for 
175: this relative error. Goldreich and Ron \cite{GR} proved a polynominal deviation
176: bound for the estimator \ $\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)}$.
177: Their goal was to find sublinear-time algorithms for testing expansion
178: properties of bounded-degree graphs.
179: 
180: \begin{thm} {\rm (Goldreich and Ron \cite{GR}).} \
181: For all \ $\beta > 0$, $\lambda \geq 0$, if
182: $m = n^{1/2 + \beta + \lambda}$ then
183: $${\sf P} \left\{ \left| \sum_{i = 1}^{n}
184: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}  \cdot \frac{1}{\|p\|^2} - 1 \right|
185: \leq \frac{3}{n^{\beta/2}} \right\}
186:  \ \geq \  1 - \frac{4}{9n^{\lambda}}.$$
187: \end{thm}
188: We extend the theorem of Goldreich and Ron as follows:
189: 
190: \smallskip
191: 
192: \begin{thm}
193: \label{OurResult1} \ 
194: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$, \ 
195: $s > 0$, if \ $m = \epsilon^{-2}n^{1 + \delta}$ \  we have 
196: $${\sf P} \left\{ \left| \sum_{i = 1}^n
197: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right| 
198: \leq \epsilon \left( 3 + \frac{6s}{n^{\delta/2}} +
199: \frac{5s^2\epsilon}{n^\delta} \right) \right\} 
200:  \ \geq \ 1 - \frac{10}{9} \, e^{-s^2 /4}.$$
201: \end{thm} 
202: 
203: \medskip
204: 
205: \noindent By taking $s = 2 \, n^{\delta/2}$, the expression \
206: $3 + \frac{6s}{n^{\delta/2}} + \frac{5s^2\epsilon}{n^\delta} $ \   
207: becomes \ $3 + 12 + 20 \, \epsilon$ $(< 22)$; here we use 
208: $\epsilon < \frac{1}{3}$. Therefore,
209:  
210: \begin{cor} \label{OurResult1_cor1} \  
211: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$, \
212: if \ $m = \epsilon^{-2}n^{1 + \delta}$ \  we have
213: 
214: \medskip
215: 
216:  \ \ \ \ \   \ \ \ \ \  
217: ${\sf P} \left\{ \left| \sum_{i = 1}^n
218: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
219: \leq 22 \, \epsilon \right\}
220:  \ \geq \ 1 - \frac{10}{9} \, e^{-n^{\delta}}.$
221: \end{cor}
222: 
223: \smallskip
224: 
225: \noindent
226: Writing $\delta = \frac{\log C}{\log n}$, for $C > 1$, we obtain 
227: $n^{\delta} = C$, and $m = \epsilon^{-2}Cn$, i.e., the load factor is 
228: $L = C \, \epsilon^{-2}$. Therefore, 
229: \begin{cor} \label{OurResult1_cor2} \
230: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ and all $m$ such that 
231: $L = \frac{m}{n} > \epsilon^{-2} \ ( > 9)$ \  we have
232: 
233: $${\sf P} \left\{ \left| \sum_{i = 1}^n
234: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
235: \leq 22 \, \epsilon \right\}
236:  \ \geq \ 1 - \frac{10}{9} \, e^{- L \epsilon^2}.$$ 
237: \end{cor}
238: 
239: \smallskip
240: 
241: \noindent
242: Note that the assumptions of this Corollary impose the following relation
243: between $L$ and $\epsilon$: \ $\frac{1}{3} > \epsilon > \frac{1}{\sqrt{L}}$;
244: equivalently, $L = \frac{m}{n} > \epsilon^{-2} \ ( > 9)$.
245: 
246: \medskip
247: 
248: To compare with the result of Goldreich and Ron, let us pick 
249: $\epsilon = n^{- \beta / 2}$ in Corollary \ref{OurResult1_cor1}. Then 
250: $n^{1/2 + \beta + \lambda} = m = \epsilon^{-2}n^{1 + \delta}$ implies
251: $\delta = \lambda - \frac{1}{2}$. Hence our Corollary becomes:
252: 
253: \begin{cor} \label{OurResult1_cor3} \  
254: For all \ $n > 24$, \ $\beta > \frac{\log 3}{\log n}$, \  
255: $\lambda > \frac{1}{2}$, 
256: if \ $m = n^{1/2 + \beta + \lambda}$ \  we have
257: $${\sf P} \left\{ \left| \sum_{i = 1}^n
258: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|
259: \leq \frac{22}{5} \, n^{- \beta / 2} \right\}
260:  \ \geq \ 1 - \frac{10}{9}e^{-n^{\lambda - \frac{1}{2}}}.$$
261: \end{cor}
262: 
263: \smallskip
264: 
265: Comparing \ref{OurResult1_cor3} with the theorem of Goldreich and Ron: 
266: Our theorem gives a much better deviation bound
267: (it is exponential, as opposed to the polynomial bound of Goldreich and Ron); 
268: but it applies only when the load factor $L$ is $> 9$ (whereas in the 
269: result of Goldreich and Ron, the load factor $L = n^{\beta + \lambda - 1/2}$ 
270: can be arbitrarily small, depending on $n$).
271: 
272: \bigskip
273: 
274: 
275: \noindent {\bf The average search time for a particular user}
276: 
277: \medskip
278: 
279: In order to analyze the efficiency of a hash table one considers the 
280: overall usage statistics of the keys (over all users). 
281: By ``user'' we mean a person or a process.
282: For every user we introduce a vector
283: $v = (v_1, \ldots, v_n)$, where $v_i$ is the frequency of the user's access 
284: (for search) to slot $i$. More precisely, $v_i$ is the number of searches at 
285: slot $i$, divided by the total number of searches in the table, for this user.
286:  Then $0 \leq v_i \leq 1$ and $\sum_{i=1}^n v_i = 1$.
287: We shall call $v$ the user's {\it access pattern}.
288: Traditional analysis of the average search time assumes that the accesses 
289: pattern of a user is the same as the key distribution (see e.g., \cite{CLRS}).
290: 
291: \smallskip
292: 
293: We let ${\rm AST}(v, x)$ denote the average search time for a user with access 
294: pattern $v$, under the condition that a sequence $x$ of $m$ independent keys
295: was previously inserted into the hash table. Clearly, we have the following 
296: upper bound:
297: 
298: \smallskip
299: 
300:   \ \ \ \ \ ${\rm AST}(v, x) \ \leq \ \sum_{i=1}^n v_i \cdot k_i(x)$.
301: 
302: \smallskip
303: 
304: \noindent The difference between ${\rm AST}(v, x)$ and 
305: $\sum_{i=1}^n v_i \cdot k_i(x)$ is caused by the possibility of 
306: pseudo-collisions. Here we are only concerned with upper bounds 
307: on ${\rm AST}(v, x)$, so we can use $\sum_{i=1}^n v_i \cdot k_i(x)$.
308: 
309: \smallskip
310: 
311: We write $m$ as $m = Ln$, where $L$ is called the {\it load factor}.
312: We do not assume that $L$ is a constant. Applying Theorem \ref{OurResult1} 
313: we show 
314: \begin{cor}
315: \label{OurResult2}
316: For all \ $n > 24$, \ $s > 0$, \ $L > 9$,  and $m = L n$ we have 
317: \[
318: {\sf P} \left\{ {\rm AST}(v, x) \leq  \ L \, n \|v\| \, \|p\| \, 
319: \sqrt{1+ \frac{3+6s}{\sqrt{L}} + \frac{5s^2}{L} } + 1 \right\}
320:  \ \geq \ 1 - \frac{10}{9}e^{-s^2 /4}. \]
321: \end{cor} 
322: 
323: \medskip
324: 
325: \noindent Noting that 
326:  \ $\sqrt{1+ \frac{3+6s}{\sqrt{L}} + \frac{5s^2}{L} }$ $<$ 
327: $1 + \frac{ 4s}{\sqrt{L}}$ \ and letting \ $\epsilon = \frac{s}{2\sqrt{L}}$ 
328:  \ we obtain 
329: \begin{cor}
330: \label{OurResult2_cor1} \ For all \ $n > 24$, \ $\epsilon > 0$, \ $L > 9$,  
331: and $m = L n$ we have           
332: \[
333: {\sf P} \left\{ {\rm AST}(v, x) \leq \ L \, n \|v\| \, \|p\| \,          
334:  (1 + 8\epsilon) + 1 \right\}
335:  \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}. \]
336: \end{cor} 
337: 
338: \noindent One notices that the probability bound is only interesting when 
339: $L$ is significantly larger than $\epsilon^{-2}$. Also, the error bound
340: is interesting only when $\epsilon$ is less than $1/8$; this means that 
341: the load factor has to be at least 100 for our results to be intersting.
342: In that sense, the results are theoretical, and show just what type of 
343: behavior to expect, up to big-O.  
344: 
345: In \cite{CLRS} (chapt.~12, exercise 12-3) the expected search time (for 
346: every user) was found to be $\Theta \left(L\right)$, under the assumption 
347: that both the key distribution and the distribution of user's accesses are 
348: uniform. 
349: Our Corollary implies that if $\|p\|^2 = \Theta \left(\frac{1}{n}\right)$ 
350: and $\|v\|^2 = \Theta \left(\frac{1}{n}\right)$
351: (which is much more relaxed than the assumption of a uniform distribution), 
352: then with exponentially high probability, the average search time is $O(L)$ 
353: for a user with access pattern  $v$.
354: 
355: \bigskip
356: 
357: \noindent {\bf Example 1} 
358: 
359: \smallskip
360: 
361: Suppose that a hash table, designed for a certain population of users, has 
362: collision probability \ $\|p\| \leq \frac{c}{\sqrt{n}}$ (for 
363: the overall population of users); $c$ is a positive constant.
364: The keys in the hash table are independent random samples.
365: Now consider an individual user who accesses a subset of cardinality
366: $\alpha \, n$ (where $0 < \alpha \leq 1$) of the $n$ slots of the hash table, 
367: with uniform probability $\frac{1}{\alpha n}$, and who does not access the 
368: other $(1 - \alpha) n$ slots of the hash table at all (i.e., those slots 
369: have probability 0 for this user).  Then the question is: What is the average 
370: search time for this user and this table, and what is the deviation bound?
371: 
372: Since the user accesses a fraction $\alpha$ of the slots uniformly, we have
373: $\|v\| = \frac{1}{\sqrt{\alpha n}}$. By Corollary \ref{OurResult2_cor1}, 
374: 
375: \smallskip
376: 
377:   \ \ \ \ \ ${\sf P}\{ {\rm AST}(v, x) \leq \ $
378:                    $\frac{cL}{\sqrt{\alpha}} \, (1 + 8\epsilon) + 1 \}$
379:  $ \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}$.
380: 
381: \smallskip
382: 
383: \noindent
384: So, the average search time is at most $1 + \frac{cL}{\sqrt{\alpha}}$ ,
385: with smaller error bound (namely \ $\frac{cL}{\sqrt{\alpha}} \, 8\epsilon$), 
386: and with probability close to 1 (namely \ 
387:  $1 - \frac{10}{9}e^{-L \epsilon^2}$). 
388: 
389: One observes that when the fraction $\alpha$ of the table used by the user
390: becomes smaller, the upper bound on the average search time for this user
391: increases, as does the error bound. This is not surprising; hashing works 
392: best when the keys are spread over the table as evenly as possible. 
393: Interestingly, our probability bound does not depend on $\alpha$. 
394: 
395: Some possible numerical values: For $c = 5$, \ $\alpha = 0.1$, 
396:  \ $\epsilon = 0.05$, $L = 1000$, we get \ 
397: ${\rm AST}(v, x) \leq \ 15811 \pm 6324$, with probability at least
398: $1 - \frac{10}{9}e^{-L \epsilon^2}$ \ $= \ 0.909$.
399: For $c = 5$, \ $\alpha = 0.1$, \ $\epsilon = 0.05$, $L = 10000$, 
400: we get \ ${\rm AST}(v, x) \leq \ (1.58 \pm 0.64) \cdot 10^5$, with 
401: probability at least $1 - 1.54 \cdot 10^{-11}$.
402: 
403: \bigskip
404: 
405: \noindent {\bf Example 2}
406: 
407: \smallskip
408: 
409: Let us consider the situation in which a query consists of two subqueries,
410: $Q_1$ and $Q_2$. This happens very commonly (e.g., in a ``three-tier
411: architecture''); see \cite{SKS}.
412: The two subqueries  can be viewed as two users with 
413: access patterns $v^{(1)}$ and $v^{(2)}$. Assume, for this example, that 
414: each of $Q_1$ and $Q_2$ behaves like the user in Example 1 above.
415: In particular, for $Q_i$ $(i = 1, 2)$ we have \ 
416: $\|v^{(i)}\| = \frac{1}{\sqrt{\alpha_i n}}$,  and 
417: 
418: \smallskip
419: 
420:   \ \ \ ${\sf P}\{ {\rm AST}_i(v^{(i)}, x) \leq \ $
421:                    $\frac{cL}{\sqrt{\alpha_i}} \, (1 + 8\epsilon) + 1 \}$
422:  $ \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}$.
423: 
424: \smallskip
425: 
426: \noindent Hence, for the combined query the average search time is a 
427: weighted sum \ 
428: 
429: \smallskip
430: 
431:   \ \ \ ${\rm AST} = w_1 \cdot {\rm AST}_1 + w_2 \cdot {\rm AST}_2$, 
432:   \ \ \ with $w_1 + w_2 = 1$. 
433: 
434: \smallskip
435: 
436: \noindent Let \ $a_i = \frac{cL}{\sqrt{\alpha_i}} \, (1 + 8\epsilon) + 1$. Then \   
437: 
438: \smallskip
439:    
440: ${\sf P}\{ {\rm AST} \leq w_1a_1 + w_2a_2 \} \ \geq \ $ 
441: ${\sf P}\{ {\rm AST}_1 \leq {\rm max}\{a_1,a_2\}, \ $
442:           ${\rm AST}_2 \leq {\rm max}\{a_1,a_2\}\} $
443: 
444: \smallskip
445: 
446: $ \geq \ 1 - 2 \, \frac{10}{9}e^{-L \epsilon^2}$.
447: 
448: \smallskip
449: 
450: \noindent Therefore, the average search time AST$(v^{(1)}, v^{(2)}, x)$
451:  of the combined query satisfies 
452: 
453: \smallskip
454: 
455:  \ \ \ ${\sf P}\{ {\rm AST}(v^{(1)}, v^{(2)}, x) \leq \ $
456:  $\frac{cL}{\sqrt{{\rm min}\{\alpha_1,\alpha_2\}}} \, (1+ 8\epsilon) +1 \}$
457:  $ \ \geq \ 1 - \frac{20}{9}e^{-L \epsilon^2}$.
458: 
459: \bigskip
460: 
461: \noindent Hence, when the load factor is large (compared to $\epsilon^2$) we 
462: obtain a very reliable upper bound on the average search time for the 
463: combined query. The knowledge of this upper bound enables various processes 
464: (that wait for the completion of this query) to be scheduled in a predictable 
465: way. 
466: 
467: The constants in our results are rather large. This is due to the generality
468: of our results. In a precise practical situation, our results could be used
469: for the format of the probabilistic behavior, with constants to be 
470: determined empirically.
471: 
472: \bigskip
473: 
474: The next section contains the proofs of our theorems.
475: %%%%%%%%%%%%%%%%%%%%%%%%%
476: % Section 2
477: 
478: \section{Proofs}
479: 
480:  
481: \subsection{A deviation bound for the empirical
482: collision probability: Proof of Theorem \ref{OurResult1} }
483:  
484: Our main technique will be  Talagrand's isoperimetric theory, developed 
485: by Talagrand in the mid 1990s \cite{Ta}. It has had a profound
486: impact on the probabilistic theory of combinatorial optimization \cite{St}
487: (see Sections 6 - 13 of \cite{Ta} and chapter 6 of \cite{St}).
488:  
489: Let $(\Omega, \mu)$ be a probability space, and let $(\Omega^m, \mu^m)$
490: be the product space. For $x \in \Omega^m$ and $A \subset \Omega^m$, 
491: Talagrand's convex distance $d_T(x, A)$ is defined by
492: \[     d_T(x, A) = \sup_{\alpha} \left \{z_\alpha  = \inf_{y \in A}
493: \left\{\sum_{j = 1}^m \alpha_i \ {\bf 1}(x_j \not= y_j) \right\} \ : \ 
494:  \alpha = (\alpha_1, \ldots, \alpha_m), \ \sum_{j = 1}^m \alpha_j^2 \leq 1  
495: \right\},
496: \]
497: where $x$ = $(x_1, \ldots , x_m)$,  $y$ = $(y_1, \ldots , y_m)$. 
498: Here, ${\bf 1}(x_i \not= y_i)$ = 1 if $x_i$ $\not=$ $y_i$, and it is 0
499: otherwise.
500:  
501: \begin{thm}
502: \label{Ta}
503: {\rm (Talagrand 1995)} \ 
504: For every $A \subset \Omega^m$ with $\mu^m(A) > 0$, we have
505: $$\int_{\Omega^m} \exp \left(\frac{1}{4} d_T (x, A)^2 \right) d \mu^m(x)
506:   \  \leq \   \frac{1}{\mu^m(A)},$$
507: and consequently, we have for all $s > 0$, 
508: $${\sf P} \left \{ d_T(x, A) \geq s \right \} \ \leq \  
509: \frac{e^{-s^2/4}}{\mu^m(A)}.$$
510: \end{thm}
511: 
512: \medskip
513: 
514: \noindent 
515: To apply Talagrand's theorem to our situation we define a set 
516: $A \subseteq U^m$ by 
517: \[      A \ = \  \left\{y \in U^m  \ : \  \left|
518: \sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)} \cdot
519: \frac{1}{\|p\|^2} - 1 \right|
520: \leq 3\epsilon
521: \right\}.        \]
522: 
523: \begin{lem}
524: \label{SubsetA} \ 
525: For all $n > 24$ we have \  ${\sf P}(A) \geq \frac{9}{10}.$
526: \end{lem}
527: {\bf Proof.} \ Recall that $m = \epsilon^{-2}n^{1 + \delta}$  with
528: $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$. 
529: Letting $\beta = \frac{- 2 \log \epsilon}{\log n}$ and
530: $\lambda = 1/2 + \delta$, we rewrite $m$ as $n^{1/2 + \beta + \lambda}$.
531: Then the lemma follows from Theorem of Goldreich and Ron.
532:  \ \ \ $\Box$
533:  
534: \bigskip
535:  
536: \noindent For every $s > 0$ we define a set $C_s \subseteq U^m$ by 
537: $$C_s = \{ x \in U^m : d_T(x, A) < s \}.$$
538: By Theorem \ref{Ta} and Lemma \ref{SubsetA} we have for all $n > 24$
539: and all $s > 0$
540: \begin{equation}
541: \label{SubsetC}
542: {\sf P}(C_s) \geq 1 - \frac{10}{9}e^{-s^2/4}.
543: \end{equation}
544: \begin{lem}
545: \label{ExpandingA} \ 
546: For every $x = (x_1, \ldots, x_m) \in C_s$ there is 
547: $y = (y_1, \ldots, y_m) \in A$ such that 
548:  
549:   $$\sum_{j = 1}^m {\bf 1}(x_j \not= y_j) \ \leq \ sm^{1/2}.$$
550: \end{lem}
551: {\bf Proof.} \ Assume, by contradiction, that there is $x \in C_s$ such that 
552: for all $y \in A$, \ \ 
553: 
554: \smallskip
555: 
556: $\sum_{j = 1}^m {\bf 1}(x_j \not= y_j) > sm^{1/2}.$
557: 
558: \smallskip
559: 
560: \noindent Now, if we take $\alpha = (\alpha_1, \ldots , \alpha_m)$
561: $ = (m^{-1/2}, \ldots , m^{-1/2})$ in the definition of the Talagrand 
562: distance $d_T$, 
563: the inequality above implies  \ $d_T(x, A_1) \geq s$. But since $x \in C_s$, 
564: we also have $d_T(x, A_1) < s$, a contradiction.  \ \ \ $\Box$
565:  
566: \bigskip
567:  
568: Recall that for any $x = (x_1, ..., x_m)$, $y = (y_1, ..., y_m)$ $\in U^m$,
569: we defined $k_i(x)$ (resp. $k_i(y)$) to be the number of the keys (with
570: multiplicity) that are hashed into the slot $i$ for input sequence $x$, resp.
571: $y$. We define integers $s_i$ ($1 \leq i \leq n$) by
572: $$k_i(x) = k_i(y) + s_i.$$
573: \begin{lem}
574: \label{ExpandingA0} \ 
575: For all $x, y \in U^m$, \ 
576: $$\sum_{i = 1}^n |s_i| \ \leq \ 2 \sum_{j = 1}^m {\bf 1}(x_j \not= y_j).$$
577: \end{lem}
578: {\bf Proof.} \ We prove the lemma by induction on \ 
579: $\sum_{i = 1}^m {\bf 1}(x_i \not= y_i)$.
580:  
581: \smallskip
582:  
583: \noindent {\bf (0)} \ \ $\sum_{j = 1}^m{\bf 1}(x_j \not= y_j) = 0$: 
584: 
585: \smallskip
586: 
587: \noindent Then we have $x_j = y_j$ for all $j = 1, \ldots, m$, and hence,
588: $k_i(x) = k_i(y)$ for all $i = 1, \ldots, n$. Thus, we have
589: $\sum_{i = 1}^n |s_i| = 0$, finishing the base case.
590: 
591: \smallskip
592:  
593: \noindent {\bf (Inductive step)} \ \ Assume \  
594: $\sum_{j = 1}^m{\bf 1}(x_j \not= y_j) > 0$:
595: 
596: \smallskip
597: 
598: \noindent Without loss of generality we assume that $x_m \not= y_m$. 
599: Now, consider $\bar x = (x_1, \ldots , x_{m - 1}, y_m)$. We write
600: $k_i(\bar x) = k_i(y) + \bar {s_i}$ for $i = 1, \ldots, n$.
601: By the induction hypothesis we have
602: \begin{equation}
603: \label{ExpandingA0_1}
604: \sum_{i=1}^n |\bar s_i| \ \leq \ 2 \sum_{j=1}^m {\bf 1}(\bar x_j \not= y_i).
605: \end{equation}
606: Since $x$ differs from $\bar x$ only in its last component, we either have 
607: $h(x_m) = h(y_m)$, in which case 
608: $\bar s_i = s_i$ for all $i = 1, \ldots , n$.  Or we have 
609: $h(x_m) \neq h(y_m)$; let $i_1 = h(x_m)$ and $i_2 = h(y_m)$. Then 
610:  \ $\bar s_{i_1} = s_{i_1} + 1$, \  $\bar s_{i_2} = s_{i_2} - 1$, and
611: $\bar s_i = s_i$ for all $i \in \{1, \ldots , n \} \setminus \{i_1, i_2\}$.
612: In both cases, 
613: \begin{equation}
614: \label{ExpandingA0_2}
615: \left| \sum_{i = 1}^n |\bar {s_i}| -
616: \sum_{i = 1}^n |s_i| \right| \leq 2.
617: \end{equation}
618: On the other hand, 
619: $$\sum_{j = 1}^m {\bf 1}(x_j \not= y_j)  \ = \  
620: \sum_{j = 1}^m {\bf 1}(\bar x_j \not= y_j) + 1.$$
621: Combining this, (\ref{ExpandingA0_1}), and (\ref{ExpandingA0_2}), completes 
622: the proof for the inductive step.  \ \ \ $\Box$
623:  
624: \begin{lem}
625: \label{ExpandingA1} \  
626: For every $x \in C_s$ there is $y \in A$ such that for all $n > 24$, 
627: $0 < \epsilon < 1/3$, $s > 0$, and $m = \epsilon^{-2}n^{1+\delta}$, 
628:  we have 
629: $$\left|
630: \sum_{i = 1}^{n} \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -
631: \sum_{i = 1}^{n} \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)}
632: \right| \ \leq \ \epsilon \|p\|^2 \left(
633: \frac{6s}{n^{\delta/2}} + \frac{5s^2\epsilon}{n^{\delta}} \right).$$
634: \end{lem}
635: {\bf Proof.} \ For any fixed $x \in C_s$ we take $y \in A$ according to 
636: Lemma \ref{ExpandingA}.  That is,
637: \begin{equation}
638: \label{ExpandingA1_1}
639: \sum_{j = 1}^m {\bf 1}(x_j \not= y_j) \ \leq \ s m^{1/2}.
640: \end{equation}
641: As in the proof for Lemma \ref{ExpandingA0} we use the notation 
642: $k_i(x)$, $k_i(y)$, and $s_i$ \ ($i = 1, \ldots , n$).
643: We will leave the common denominator $m(m-1)$ out of the computations 
644: until the end:
645: 
646: \medskip
647: 
648: $| \sum_{i=1}^n k_i(x)(k_i(x) - 1) \ - \ \sum_{i=1}^n k_i(y)(k_i(y)-1) |$
649: 
650: \medskip
651: 
652: $= |\sum_{i=1}^n (k_i(y) +s_i)(k_i(y)+s_i-1) \ -  \ $
653: $\sum_{i=1}^n k_i(y)(k_i(y)-1) |$
654: 
655: \medskip
656: 
657: $= |\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ $
658:   $[(k_i(y) +s_i)(k_i(y)+s_i-1) - k_i(y)(k_i(y)-1)] $
659: 
660: \smallskip
661: 
662:  \hspace{3in}  $ \ + \ \sum_{1\leq i \leq n, \, k_i(y) = 0} \ s_i(s_i -1) |$ 
663: 
664: \medskip
665: 
666: $\leq \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ 2\, |s_i|(k_i(y)-1)$
667:   $ \ + \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ (s_i^2 + |s_i|)$ 
668:   $ \ + \ | \sum_{1\leq i \leq n, \, k_i(y) =0} \ s_i(s_i -1)|$
669: 
670: \medskip
671: 
672: $\leq \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ 2\, |s_i|(k_i(y)-1)$ 
673: $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|).$ 
674: 
675: \medskip 
676: 
677: \noindent By the Cauchy-Schwarz inequality, this is bounded by 
678: 
679: \smallskip
680: 
681: $\leq \ 2 \ (\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ s_i^2)^{1/2}$
682:        $(\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ (k_i(y)-1)^2)^{1/2}$ 
683:    $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|)$ 
684: 
685: \medskip
686: 
687: $\leq \ 2 \ (\sum_{i=1}^n s_i^2)^{1/2}$ 
688:   $(\sum_{i=1}^n k_i(y)(k_i(y)-1)^2)^{1/2}$ 
689:  $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|).$ 
690: 
691: \medskip 
692: 
693: \noindent By Lemma \ref{ExpandingA0} and (\ref{ExpandingA1_1}) we have
694: \begin{equation}
695: \label{ExpandingA1_4}
696: \sum_{i=1}^n s_i^2 \ \leq \ \left(\sum_{i=1}^n |s_i| \right)^2 \ \leq \  
697: \left(2 \sum_{j=1}^m {\bf 1}(x_j \not= y_j) \right)^2 \ \leq \ 4 s^2 m.
698: \end{equation}
699: Since $y \in A$ we have
700: $$\sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m (m - 1)} \leq
701: \|p\|^2 \left(1 + 3\epsilon \right).$$
702: Hence, by all the above:
703: $$\left| \sum_{i=1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -
704: \sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)} \right|$$
705: $$\leq \ \frac{4s}{(m - 1)^{1/2}} \cdot \|p\|
706: \left(1+ 3\epsilon \right)^{1/2} + \frac{4s^2}{m-1} + \frac{2s}{m^{1/2}(m-1)}.$$
707: 
708: \noindent By calculating, and using the fact that $\|p\|^2 \geq \frac{1}{n}$, 
709: $0 < \epsilon < 1/3$, and $m = \epsilon^{-2}n^{1 + \delta}$, we find the 
710: following upper bound for \ 
711: $\left|
712: \sum_{i = 1}^{n} \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -
713: \sum_{i = 1}^{n} \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)}
714: \right| \ $: 
715: \[
716:   \frac{s\epsilon}{n^{\delta/2}} \ \|p\|^2 \ 
717:  \frac{4(1+3\epsilon)^{1/2}n^{1/2}}{(n - \epsilon^2 n^{-\delta})^{1/2} } 
718:  \ + \ 
719: \frac{s\epsilon}{n^{\delta/2}} \ \|p\|^2 \  
720:   \frac{2 \epsilon^2n}{n^{1/2} (n^{1+\delta} - \epsilon^2)}
721: \ + \  
722:  \frac{s^2\epsilon^2}{n^{\delta/2}} \ \|p\|^2 \  
723:  \frac{4n}{n^{1+\delta} - \epsilon^2}
724: \]
725: \noindent Combining this and using $n > 24$ we obtain the upper bound \ 
726: $$\epsilon \|p\|^2 (\frac{6s}{n^{\delta/2}} + 
727: \frac{5s^2 \epsilon}{n^{\delta}}).$$
728: $\Box$
729: 
730: \bigskip
731:  
732: \noindent
733: {\bf Proof of Theorem \ref{OurResult1}.} \  
734: The theorem follows from the definition of $A$, inequality (\ref{SubsetC}), 
735: and Lemma \ref{ExpandingA1}. \ \ \ $\Box$
736:  
737: 
738: %%%%%%%%%%%%%%%%%%%%%%%%%
739:  
740: \subsection{Average search time for a particular user}
741: 
742: \noindent
743: {\bf Proof of Corollary \ref{OurResult2}.} \   
744: Recall that the average search time AST$(v,x)$ is bounded from above by \
745: $\sum_{i = 1}^n v_i \cdot k_i(x)$.
746: In Theorem \ref{OurResult1} let us write $m = L_1 L_2 \, n$, and choose  
747: $$\epsilon = \frac{1}{\sqrt L_1}~~{\rm and}~~\delta = \frac{\log L_2}{\log n}.$$
748: Note that for all $i$, \ 
749: 
750: \smallskip
751: 
752: $k_i(x) - 1 \leq \sqrt{k_i(x)(k_i(x) - 1)}$ \   
753: 
754: \smallskip
755: 
756: \noindent since the left side is 0 when $k_i(x) = 0$ or 1. Therefore,  
757: 
758: \smallskip
759: 
760: ${\rm AST}(x,v) \leq \ $
761: $\sum_{i=1}^n v_i \cdot k_i(x) \ = \ \sum_{i=1}^n v_i (k_i(x) - 1) + 1 \  $
762: $\leq \ \sqrt{\sum_{i=1}^n v_i^2} \sqrt{\sum_{i=1}^n (k_i(x) - 1)^2} +1$
763: 
764: \medskip
765: 
766: $\leq \ \sqrt{\sum_{i=1}^n v_i^2} \sqrt{\sum_{i=1}^n k_i(x)(k_i(x) - 1)} + 1$
767: $ \ \leq \ \|v\| \, \|p\| \, m(m-1)$
768:   $\sqrt{\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)} \, \frac{1}{\|p\|^2}}$.
769: 
770: \medskip
771: 
772: \noindent The corollary follows from this and Theorem \ref{OurResult1}.
773:  \ \ \   $\Box$   
774: 
775: \bigskip
776: 
777: \noindent
778: {\bf Remark.} \ Our proof method depends crucially on Talagrand's theorem.
779: Many readers, more familiar with techniques like the Chernoff bound, or more
780: generally, the Hoeffding inequality for martingale differences (from which 
781: the Chernoff bound follows directly), may wonder whether these simpler 
782: techniques don't work here. In order to apply Hoeffding's inequality we 
783: could view $\sum_{i=1}^n v_i \cdot k_i(x)$ as a weighted sum of the random
784: variables $k_i(x)$; to apply Hoeffding one needs to bound $|k_i(x)|$, but 
785: we don't have good bounds a priori; finding good bounds on $|k_i(x)|$ seems
786: harder and less promising than our method, based on Talagrand's theorem.
787: See, e.g., Michael Steele's book \cite{St}, which discusses the advantages 
788: of applying Talagrand's theorem at length. 
789:   
790: 
791: 
792: 
793: 
794: 
795: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
796: % Bibliography
797: 
798: \begin{thebibliography}{60}
799: 
800: \bibitem{CLRS}
801: T.H.\ Cormen, C.E.\ Leiserson, R.L.\ Rivest, C.\ Stein, 
802: {\it Introduction to Algorithms}, 2nd ed., McGraw-Hill, 2001.
803: 
804: \bibitem{GR}
805: O.\ Goldreich and D.\ Ron, ``On testing expansion in bounded-degree graphs'',
806: Technical Report TR00-020, ECCC, 2000. 
807: 
808: \bibitem{Kn}
809: D.E.\ Knuth, {\it Sorting and Searching}, 2nd ed., Addison-Wesley, 1998.
810: 
811: \bibitem{SKS}
812: A.\ Silberschatz, H.F.\ Korth, S.\ Sudarshan, 
813: {\it Database System Concepts}, 4th ed., McGraw-Hill, 2002.
814: 
815: \bibitem{St}
816: M.\ Steele, {\it Probability Theory and Combinatorial Optimization},
817: SIAM, 1997.
818: 
819: \bibitem{Ta}
820: M.\ Talagrand, ``Concentration of measure and isoperimetric inequalities 
821: in product spaces'', 
822: {\it Institut des Hautes \'Etudes Scientifiques, Publications
823: Math\'ematiques,} 81 (1995) 73-205.
824: 
825: \end{thebibliography}
826: 
827: 
828: \end{document}
829: