0303:cs0303022/cs0303022

1: \documentclass[12pt]{article}

2: \usepackage{latexsym}

3: \usepackage{amsfonts}

4: % This is for including postscript figures.

5: \newcommand{\zed}{\mbox{$\Bbb Z$}}

6: \setlength{\oddsidemargin}{0in}

7: \setlength{\topmargin}{0in}

8: \setlength{\headheight}{0in}

9: \setlength{\textheight}{8.3in}

10: \setlength{\textwidth}{6.7in}

11: \setlength{\topskip}{0in}

12: \setcounter{tocdepth}{4}

13: \newtheorem{thm}{Theorem}[section]

14: \newtheorem{cor}[thm]{Corollary}

15: \newtheorem{lem}[thm]{Lemma}

16: \newtheorem{pro}[thm]{Proposition}

17: \newtheorem{defn}[thm]{Definition}

18: \newtheorem{fact}[thm]{Fact}

19: \newcommand{\ds} {\displaystyle}

20: \renewcommand{\arraystretch}{.6}

21: \bibliographystyle{abbrv}

22:

23:

24: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

25: %

26: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

27:

28: \title{Probabilistic behavior of hash tables}

29: \author{Dawei Hong\footnote{D.~Hong and J.C.~Birget,

30:                        Dept.\ of Computer Science,

31:                        Rutgers University at Camden, Camden, NJ 08102,

32:                        USA, \{dhong, birget\}@camden.rutgers.edu}, \ \

33:         Jean-Camille Birget\footnote{Supported in part by

34:                               NSF grant DMS-9970471}, \ \

35:         Shushuang Man \footnote{ Dept.\ of Mathematics and Computer Science,

36:                    Marshall, MN 56258, USA, mans@southwest.msus.edu}

37:        }

38: \date{}

39: \begin{document}

40: \maketitle

41:

42: \begin{abstract}

43: We extend a result of Goldreich and Ron about estimating the collision

44: probability of a hash function. Their estimate has a polynomial tail.

45: We prove that when the load factor is greater than a certain constant,

46: the estimator has a gaussian tail. As an application we find an estimate of

47: an upper bound for the average search time in hashing with chaining,

48: for a particular user (we allow the overall key distribution to be different

49: from the key distribution of a particular user). The estimator has a

50: gaussian tail.

51: \end{abstract}

52:

53: %%%%%%%%%%%%%%%%%%%%%%%%%

54: % Section 1

55:

56: \section{Introduction}

57:

58: Hash tables have many applications in computer science \cite{CLRS}, \cite{Kn}.

59: We especially mention data bases, where hash tables are used for storing

60: values of an attribute; see chapter 12 of \cite{SKS}.

61: Following the notation of \cite{CLRS}, a hash function is a function

62: $h: U \mapsto T$, where both the domain $U$ and the range $T$

63: are finite. Traditionally, $U$ is called the {\it key space} or the ``universe'',

64: and elements $x \in U$ are called {\it keys}. The set $T$ is called the

65: the {\it table}, and its elements are called the {\it table slots}. When

66: $h(x) = i$ we say that $h$ {\it hashes} the key $x$ into the slot $i$.

67: We shall denote by $n$ the cardinality of $T$ and we will simply assume that

68: $T = \{1, \ldots, n\}$.

69: We assumed that $U$ is (very much) larger than $T$.

70:

71: \smallskip

72:

73: We assume that a probability measure $q$ has been defined on $U$.

74: The probability of $S \ (\subset U)$ is denoted by ${\sf P}(S)$

75: $( \ = \sum_{x \in S} q(x))$. We also put the product measure on

76: $U \times U$ and on $U^m$ (for any positive integer $m$); using

77: the product measure amounts to saying that in a sequence of $m$ keys,

78: all the keys are {\it independent}.

79:

80: The probability on $U$ induces a probability measure on $T$:

81: The {\it probability that some key hashes to slot} $i \ (\in T)$ is \

82: $p_i =  \sum_{x \in h^{-1}(i)} q(x)$ $= {\sf P}(h^{-1}(i))$.

83:

84: If two keys $x_1, x_2 \in U$ have the same hash value, these keys are said

85: to {\it collide}. The {\it collision probability} of the hash function $h$

86: is defined to be \ ${\sf P}\{(x_1, x_2) \in U \times U : h(x_1) = h(x_2) \}$

87: (in short-hand this is denoted by ${\sf P}(h(x_1) = h(x_2))$).

88: Here we use the product measure (i.e., keys are ``chosen independently'').

89: A {\it true collision} corresponds to keys $x_1, x_2 \in U$ such that

90: $x_1 \neq x_2$ and $h(x_1) = h(x_2)$.

91:

92: Throughout this paper, $\|.\|$ denotes euclidean norm. It is straightforward

93: to prove the following.

94: \begin{pro}

95: The collision probability of $h$ is equal to \

96: $\sum_{i = 1}^n p_i^2 \ \ ( = \|p\|^2)$.

97:

98: \medskip

99:

100: \noindent

101: Moreover, we always have \ $\sum_{i=1}^n p_i^2 \geq \frac{1}{n}$,

102: and equality holds iff \  $p_i = \frac{1}{n}$ for all $i \in T$.

103: \end{pro}

104: Similarly, the probability that two independently chosen keys are equal is

105:  \ $\sum_{x \in U} q(u)^2$. Hence, the probability of true collisions

106: for $h$ is \ $\sum_{i = 1}^n p_i^2 \ - \ \sum_{x \in U} q(u)^2.$

107:

108: Note that \ $\sum_{x \in U} q(u)^2$ \ will usually be very small

109: assuming that $U$ is very large (compared to $n$ and compared to the length

110: $m$ of key sequences used), and assuming that the probability distribution

111: $q$ on $U$ is not very concentrated.

112: Therefore, the difference between the collision probability $\|p\|^2$ and

113: the probability of true collisions is usually quite small.

114:

115: \medskip

116:

117: In this paper we assume that collisions

118: are resolved by some form of {\it chaining}; i.e., all the keys that are

119: hashed into one slot are stored in that slot.

120: For a hash table with chaining, we will simply assume that the search time

121: (for both successful or unsuccessful search) in a slot $i$ is proportional

122: to the number of keys stored in that slot; for simplicity, we simply identify

123: search time in a slot and chain length in the slot.

124:

125: \bigskip

126:

127: \noindent {\bf Notation} ``$k_i(x)$'':  \

128: Let $x = (x_1, \ldots, x_m)$ be a sequence of $m$ keys that are inserted into

129: our hash table, and let $i$ be a slot ($i = 1, \ldots, n$).

130: We let $k_i(x)$ denote the number of keys (counted with multiplicities)

131: inserted into slot $i$. (``With multiplicities'' means that if a key

132: occurs several times in $x$ it is counted as many times as it occurs.)

133:

134: Since in $k_i(x)$ we count keys with multiplicities, $k_i(x)$ is an upper bound

135: on the number of different keys stored in slot $i$.

136:

137: \begin{pro}

138: For a sequence of keys $x = (x_1, \ldots, x_m)$ that are inserted, the

139: number of collisions between keys in $x$ is

140: $$\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{2}.$$

141: \end{pro}

142: The proof is straightforward. Recall that we count pairs of equal keys in

143: the sequence $x$ as collisions. Since there are $\frac{m(m - 1)}{2}$

144: unordered pairs of key insertions in $x$, we call \

145: $$\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}$$

146: the  {\em empirical collision probability} of $x$.

147: This concept, and its relation with the collision probability $\|p\|^2$,

148: were first studied by Goldreich and Ron \cite{GR}.

149:

150: \bigskip

151:

152: In this paper we obtain two results, in the form of deviation bounds.

153: (1) We give an estimation of the collision probability.

154: (2) We give a deviation bound for an upper bound on the average search time.

155:

156: In the second result we assume that the load factor is $> 9$

157: (see later for the exact assumptions).

158: Applications in data bases often lead to hash tables with

159: large load factor (\cite{SKS}, Chapter 12).

160: We allow arbitrary key distributions.

161:

162: \bigskip

163:

164: \noindent {\bf Estimation of the collision probability}

165:

166: \medskip

167:

168: \noindent

169: Our first result extends a result of Goldreich and Ron \cite{GR},

170: namely that \  $\sum_{i = 1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}$

171:  \ is a very good estimator for the collision probability $\|p\|^2$.

172: How good the estimator is can be measured  by the relative error \

173: $|\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)} \cdot \frac{1}{\|p\|^2}$

174: $ \ - \ 1|$. Their result, as well as ours, gives a deviation bound for

175: this relative error. Goldreich and Ron \cite{GR} proved a polynominal deviation

176: bound for the estimator \ $\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)}$.

177: Their goal was to find sublinear-time algorithms for testing expansion

178: properties of bounded-degree graphs.

179:

180: \begin{thm} {\rm (Goldreich and Ron \cite{GR}).} \

181: For all \ $\beta > 0$, $\lambda \geq 0$, if

182: $m = n^{1/2 + \beta + \lambda}$ then

183: $${\sf P} \left\{ \left| \sum_{i = 1}^{n}

184: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)}  \cdot \frac{1}{\|p\|^2} - 1 \right|

185: \leq \frac{3}{n^{\beta/2}} \right\}

186:  \ \geq \  1 - \frac{4}{9n^{\lambda}}.$$

187: \end{thm}

188: We extend the theorem of Goldreich and Ron as follows:

189:

190: \smallskip

191:

192: \begin{thm}

193: \label{OurResult1} \

194: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$, \

195: $s > 0$, if \ $m = \epsilon^{-2}n^{1 + \delta}$ \  we have

196: $${\sf P} \left\{ \left| \sum_{i = 1}^n

197: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|

198: \leq \epsilon \left( 3 + \frac{6s}{n^{\delta/2}} +

199: \frac{5s^2\epsilon}{n^\delta} \right) \right\}

200:  \ \geq \ 1 - \frac{10}{9} \, e^{-s^2 /4}.$$

201: \end{thm}

202:

203: \medskip

204:

205: \noindent By taking $s = 2 \, n^{\delta/2}$, the expression \

206: $3 + \frac{6s}{n^{\delta/2}} + \frac{5s^2\epsilon}{n^\delta} $ \

207: becomes \ $3 + 12 + 20 \, \epsilon$ $(< 22)$; here we use

208: $\epsilon < \frac{1}{3}$. Therefore,

209:

210: \begin{cor} \label{OurResult1_cor1} \

211: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$, \

212: if \ $m = \epsilon^{-2}n^{1 + \delta}$ \  we have

213:

214: \medskip

215:

216:  \ \ \ \ \   \ \ \ \ \

217: ${\sf P} \left\{ \left| \sum_{i = 1}^n

218: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|

219: \leq 22 \, \epsilon \right\}

220:  \ \geq \ 1 - \frac{10}{9} \, e^{-n^{\delta}}.$

221: \end{cor}

222:

223: \smallskip

224:

225: \noindent

226: Writing $\delta = \frac{\log C}{\log n}$, for $C > 1$, we obtain

227: $n^{\delta} = C$, and $m = \epsilon^{-2}Cn$, i.e., the load factor is

228: $L = C \, \epsilon^{-2}$. Therefore,

229: \begin{cor} \label{OurResult1_cor2} \

230: For all \ $n > 24$, \ $\frac{1}{3} > \epsilon > 0$, \ and all $m$ such that

231: $L = \frac{m}{n} > \epsilon^{-2} \ ( > 9)$ \  we have

232:

233: $${\sf P} \left\{ \left| \sum_{i = 1}^n

234: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|

235: \leq 22 \, \epsilon \right\}

236:  \ \geq \ 1 - \frac{10}{9} \, e^{- L \epsilon^2}.$$

237: \end{cor}

238:

239: \smallskip

240:

241: \noindent

242: Note that the assumptions of this Corollary impose the following relation

243: between $L$ and $\epsilon$: \ $\frac{1}{3} > \epsilon > \frac{1}{\sqrt{L}}$;

244: equivalently, $L = \frac{m}{n} > \epsilon^{-2} \ ( > 9)$.

245:

246: \medskip

247:

248: To compare with the result of Goldreich and Ron, let us pick

249: $\epsilon = n^{- \beta / 2}$ in Corollary \ref{OurResult1_cor1}. Then

250: $n^{1/2 + \beta + \lambda} = m = \epsilon^{-2}n^{1 + \delta}$ implies

251: $\delta = \lambda - \frac{1}{2}$. Hence our Corollary becomes:

252:

253: \begin{cor} \label{OurResult1_cor3} \

254: For all \ $n > 24$, \ $\beta > \frac{\log 3}{\log n}$, \

255: $\lambda > \frac{1}{2}$,

256: if \ $m = n^{1/2 + \beta + \lambda}$ \  we have

257: $${\sf P} \left\{ \left| \sum_{i = 1}^n

258: \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} \cdot \frac{1}{\|p\|^2} - 1 \right|

259: \leq \frac{22}{5} \, n^{- \beta / 2} \right\}

260:  \ \geq \ 1 - \frac{10}{9}e^{-n^{\lambda - \frac{1}{2}}}.$$

261: \end{cor}

262:

263: \smallskip

264:

265: Comparing \ref{OurResult1_cor3} with the theorem of Goldreich and Ron:

266: Our theorem gives a much better deviation bound

267: (it is exponential, as opposed to the polynomial bound of Goldreich and Ron);

268: but it applies only when the load factor $L$ is $> 9$ (whereas in the

269: result of Goldreich and Ron, the load factor $L = n^{\beta + \lambda - 1/2}$

270: can be arbitrarily small, depending on $n$).

271:

272: \bigskip

273:

274:

275: \noindent {\bf The average search time for a particular user}

276:

277: \medskip

278:

279: In order to analyze the efficiency of a hash table one considers the

280: overall usage statistics of the keys (over all users).

281: By ``user'' we mean a person or a process.

282: For every user we introduce a vector

283: $v = (v_1, \ldots, v_n)$, where $v_i$ is the frequency of the user's access

284: (for search) to slot $i$. More precisely, $v_i$ is the number of searches at

285: slot $i$, divided by the total number of searches in the table, for this user.

286:  Then $0 \leq v_i \leq 1$ and $\sum_{i=1}^n v_i = 1$.

287: We shall call $v$ the user's {\it access pattern}.

288: Traditional analysis of the average search time assumes that the accesses

289: pattern of a user is the same as the key distribution (see e.g., \cite{CLRS}).

290:

291: \smallskip

292:

293: We let ${\rm AST}(v, x)$ denote the average search time for a user with access

294: pattern $v$, under the condition that a sequence $x$ of $m$ independent keys

295: was previously inserted into the hash table. Clearly, we have the following

296: upper bound:

297:

298: \smallskip

299:

300:   \ \ \ \ \ ${\rm AST}(v, x) \ \leq \ \sum_{i=1}^n v_i \cdot k_i(x)$.

301:

302: \smallskip

303:

304: \noindent The difference between ${\rm AST}(v, x)$ and

305: $\sum_{i=1}^n v_i \cdot k_i(x)$ is caused by the possibility of

306: pseudo-collisions. Here we are only concerned with upper bounds

307: on ${\rm AST}(v, x)$, so we can use $\sum_{i=1}^n v_i \cdot k_i(x)$.

308:

309: \smallskip

310:

311: We write $m$ as $m = Ln$, where $L$ is called the {\it load factor}.

312: We do not assume that $L$ is a constant. Applying Theorem \ref{OurResult1}

313: we show

314: \begin{cor}

315: \label{OurResult2}

316: For all \ $n > 24$, \ $s > 0$, \ $L > 9$,  and $m = L n$ we have

317: \[

318: {\sf P} \left\{ {\rm AST}(v, x) \leq  \ L \, n \|v\| \, \|p\| \,

319: \sqrt{1+ \frac{3+6s}{\sqrt{L}} + \frac{5s^2}{L} } + 1 \right\}

320:  \ \geq \ 1 - \frac{10}{9}e^{-s^2 /4}. \]

321: \end{cor}

322:

323: \medskip

324:

325: \noindent Noting that

326:  \ $\sqrt{1+ \frac{3+6s}{\sqrt{L}} + \frac{5s^2}{L} }$ $<$

327: $1 + \frac{ 4s}{\sqrt{L}}$ \ and letting \ $\epsilon = \frac{s}{2\sqrt{L}}$

328:  \ we obtain

329: \begin{cor}

330: \label{OurResult2_cor1} \ For all \ $n > 24$, \ $\epsilon > 0$, \ $L > 9$,

331: and $m = L n$ we have

332: \[

333: {\sf P} \left\{ {\rm AST}(v, x) \leq \ L \, n \|v\| \, \|p\| \,

334:  (1 + 8\epsilon) + 1 \right\}

335:  \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}. \]

336: \end{cor}

337:

338: \noindent One notices that the probability bound is only interesting when

339: $L$ is significantly larger than $\epsilon^{-2}$. Also, the error bound

340: is interesting only when $\epsilon$ is less than $1/8$; this means that

341: the load factor has to be at least 100 for our results to be intersting.

342: In that sense, the results are theoretical, and show just what type of

343: behavior to expect, up to big-O.

344:

345: In \cite{CLRS} (chapt.~12, exercise 12-3) the expected search time (for

346: every user) was found to be $\Theta \left(L\right)$, under the assumption

347: that both the key distribution and the distribution of user's accesses are

348: uniform.

349: Our Corollary implies that if $\|p\|^2 = \Theta \left(\frac{1}{n}\right)$

350: and $\|v\|^2 = \Theta \left(\frac{1}{n}\right)$

351: (which is much more relaxed than the assumption of a uniform distribution),

352: then with exponentially high probability, the average search time is $O(L)$

353: for a user with access pattern  $v$.

354:

355: \bigskip

356:

357: \noindent {\bf Example 1}

358:

359: \smallskip

360:

361: Suppose that a hash table, designed for a certain population of users, has

362: collision probability \ $\|p\| \leq \frac{c}{\sqrt{n}}$ (for

363: the overall population of users); $c$ is a positive constant.

364: The keys in the hash table are independent random samples.

365: Now consider an individual user who accesses a subset of cardinality

366: $\alpha \, n$ (where $0 < \alpha \leq 1$) of the $n$ slots of the hash table,

367: with uniform probability $\frac{1}{\alpha n}$, and who does not access the

368: other $(1 - \alpha) n$ slots of the hash table at all (i.e., those slots

369: have probability 0 for this user).  Then the question is: What is the average

370: search time for this user and this table, and what is the deviation bound?

371:

372: Since the user accesses a fraction $\alpha$ of the slots uniformly, we have

373: $\|v\| = \frac{1}{\sqrt{\alpha n}}$. By Corollary \ref{OurResult2_cor1},

374:

375: \smallskip

376:

377:   \ \ \ \ \ ${\sf P}\{ {\rm AST}(v, x) \leq \ $

378:                    $\frac{cL}{\sqrt{\alpha}} \, (1 + 8\epsilon) + 1 \}$

379:  $ \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}$.

380:

381: \smallskip

382:

383: \noindent

384: So, the average search time is at most $1 + \frac{cL}{\sqrt{\alpha}}$ ,

385: with smaller error bound (namely \ $\frac{cL}{\sqrt{\alpha}} \, 8\epsilon$),

386: and with probability close to 1 (namely \

387:  $1 - \frac{10}{9}e^{-L \epsilon^2}$).

388:

389: One observes that when the fraction $\alpha$ of the table used by the user

390: becomes smaller, the upper bound on the average search time for this user

391: increases, as does the error bound. This is not surprising; hashing works

392: best when the keys are spread over the table as evenly as possible.

393: Interestingly, our probability bound does not depend on $\alpha$.

394:

395: Some possible numerical values: For $c = 5$, \ $\alpha = 0.1$,

396:  \ $\epsilon = 0.05$, $L = 1000$, we get \

397: ${\rm AST}(v, x) \leq \ 15811 \pm 6324$, with probability at least

398: $1 - \frac{10}{9}e^{-L \epsilon^2}$ \ $= \ 0.909$.

399: For $c = 5$, \ $\alpha = 0.1$, \ $\epsilon = 0.05$, $L = 10000$,

400: we get \ ${\rm AST}(v, x) \leq \ (1.58 \pm 0.64) \cdot 10^5$, with

401: probability at least $1 - 1.54 \cdot 10^{-11}$.

402:

403: \bigskip

404:

405: \noindent {\bf Example 2}

406:

407: \smallskip

408:

409: Let us consider the situation in which a query consists of two subqueries,

410: $Q_1$ and $Q_2$. This happens very commonly (e.g., in a ``three-tier

411: architecture''); see \cite{SKS}.

412: The two subqueries  can be viewed as two users with

413: access patterns $v^{(1)}$ and $v^{(2)}$. Assume, for this example, that

414: each of $Q_1$ and $Q_2$ behaves like the user in Example 1 above.

415: In particular, for $Q_i$ $(i = 1, 2)$ we have \

416: $\|v^{(i)}\| = \frac{1}{\sqrt{\alpha_i n}}$,  and

417:

418: \smallskip

419:

420:   \ \ \ ${\sf P}\{ {\rm AST}_i(v^{(i)}, x) \leq \ $

421:                    $\frac{cL}{\sqrt{\alpha_i}} \, (1 + 8\epsilon) + 1 \}$

422:  $ \ \geq \ 1 - \frac{10}{9}e^{-L \epsilon^2}$.

423:

424: \smallskip

425:

426: \noindent Hence, for the combined query the average search time is a

427: weighted sum \

428:

429: \smallskip

430:

431:   \ \ \ ${\rm AST} = w_1 \cdot {\rm AST}_1 + w_2 \cdot {\rm AST}_2$,

432:   \ \ \ with $w_1 + w_2 = 1$.

433:

434: \smallskip

435:

436: \noindent Let \ $a_i = \frac{cL}{\sqrt{\alpha_i}} \, (1 + 8\epsilon) + 1$. Then \

437:

438: \smallskip

439:

440: ${\sf P}\{ {\rm AST} \leq w_1a_1 + w_2a_2 \} \ \geq \ $

441: ${\sf P}\{ {\rm AST}_1 \leq {\rm max}\{a_1,a_2\}, \ $

442:           ${\rm AST}_2 \leq {\rm max}\{a_1,a_2\}\} $

443:

444: \smallskip

445:

446: $ \geq \ 1 - 2 \, \frac{10}{9}e^{-L \epsilon^2}$.

447:

448: \smallskip

449:

450: \noindent Therefore, the average search time AST$(v^{(1)}, v^{(2)}, x)$

451:  of the combined query satisfies

452:

453: \smallskip

454:

455:  \ \ \ ${\sf P}\{ {\rm AST}(v^{(1)}, v^{(2)}, x) \leq \ $

456:  $\frac{cL}{\sqrt{{\rm min}\{\alpha_1,\alpha_2\}}} \, (1+ 8\epsilon) +1 \}$

457:  $ \ \geq \ 1 - \frac{20}{9}e^{-L \epsilon^2}$.

458:

459: \bigskip

460:

461: \noindent Hence, when the load factor is large (compared to $\epsilon^2$) we

462: obtain a very reliable upper bound on the average search time for the

463: combined query. The knowledge of this upper bound enables various processes

464: (that wait for the completion of this query) to be scheduled in a predictable

465: way.

466:

467: The constants in our results are rather large. This is due to the generality

468: of our results. In a precise practical situation, our results could be used

469: for the format of the probabilistic behavior, with constants to be

470: determined empirically.

471:

472: \bigskip

473:

474: The next section contains the proofs of our theorems.

475: %%%%%%%%%%%%%%%%%%%%%%%%%

476: % Section 2

477:

478: \section{Proofs}

479:

480:

481: \subsection{A deviation bound for the empirical

482: collision probability: Proof of Theorem \ref{OurResult1} }

483:

484: Our main technique will be  Talagrand's isoperimetric theory, developed

485: by Talagrand in the mid 1990s \cite{Ta}. It has had a profound

486: impact on the probabilistic theory of combinatorial optimization \cite{St}

487: (see Sections 6 - 13 of \cite{Ta} and chapter 6 of \cite{St}).

488:

489: Let $(\Omega, \mu)$ be a probability space, and let $(\Omega^m, \mu^m)$

490: be the product space. For $x \in \Omega^m$ and $A \subset \Omega^m$,

491: Talagrand's convex distance $d_T(x, A)$ is defined by

492: \[     d_T(x, A) = \sup_{\alpha} \left \{z_\alpha  = \inf_{y \in A}

493: \left\{\sum_{j = 1}^m \alpha_i \ {\bf 1}(x_j \not= y_j) \right\} \ : \

494:  \alpha = (\alpha_1, \ldots, \alpha_m), \ \sum_{j = 1}^m \alpha_j^2 \leq 1

495: \right\},

496: \]

497: where $x$ = $(x_1, \ldots , x_m)$,  $y$ = $(y_1, \ldots , y_m)$.

498: Here, ${\bf 1}(x_i \not= y_i)$ = 1 if $x_i$ $\not=$ $y_i$, and it is 0

499: otherwise.

500:

501: \begin{thm}

502: \label{Ta}

503: {\rm (Talagrand 1995)} \

504: For every $A \subset \Omega^m$ with $\mu^m(A) > 0$, we have

505: $$\int_{\Omega^m} \exp \left(\frac{1}{4} d_T (x, A)^2 \right) d \mu^m(x)

506:   \  \leq \   \frac{1}{\mu^m(A)},$$

507: and consequently, we have for all $s > 0$,

508: $${\sf P} \left \{ d_T(x, A) \geq s \right \} \ \leq \

509: \frac{e^{-s^2/4}}{\mu^m(A)}.$$

510: \end{thm}

511:

512: \medskip

513:

514: \noindent

515: To apply Talagrand's theorem to our situation we define a set

516: $A \subseteq U^m$ by

517: \[      A \ = \  \left\{y \in U^m  \ : \  \left|

518: \sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)} \cdot

519: \frac{1}{\|p\|^2} - 1 \right|

520: \leq 3\epsilon

521: \right\}.        \]

522:

523: \begin{lem}

524: \label{SubsetA} \

525: For all $n > 24$ we have \  ${\sf P}(A) \geq \frac{9}{10}.$

526: \end{lem}

527: {\bf Proof.} \ Recall that $m = \epsilon^{-2}n^{1 + \delta}$  with

528: $\frac{1}{3} > \epsilon > 0$, \ $\delta > 0$.

529: Letting $\beta = \frac{- 2 \log \epsilon}{\log n}$ and

530: $\lambda = 1/2 + \delta$, we rewrite $m$ as $n^{1/2 + \beta + \lambda}$.

531: Then the lemma follows from Theorem of Goldreich and Ron.

532:  \ \ \ $\Box$

533:

534: \bigskip

535:

536: \noindent For every $s > 0$ we define a set $C_s \subseteq U^m$ by

537: $$C_s = \{ x \in U^m : d_T(x, A) < s \}.$$

538: By Theorem \ref{Ta} and Lemma \ref{SubsetA} we have for all $n > 24$

539: and all $s > 0$

540: \begin{equation}

541: \label{SubsetC}

542: {\sf P}(C_s) \geq 1 - \frac{10}{9}e^{-s^2/4}.

543: \end{equation}

544: \begin{lem}

545: \label{ExpandingA} \

546: For every $x = (x_1, \ldots, x_m) \in C_s$ there is

547: $y = (y_1, \ldots, y_m) \in A$ such that

548:

549:   $$\sum_{j = 1}^m {\bf 1}(x_j \not= y_j) \ \leq \ sm^{1/2}.$$

550: \end{lem}

551: {\bf Proof.} \ Assume, by contradiction, that there is $x \in C_s$ such that

552: for all $y \in A$, \ \

553:

554: \smallskip

555:

556: $\sum_{j = 1}^m {\bf 1}(x_j \not= y_j) > sm^{1/2}.$

557:

558: \smallskip

559:

560: \noindent Now, if we take $\alpha = (\alpha_1, \ldots , \alpha_m)$

561: $ = (m^{-1/2}, \ldots , m^{-1/2})$ in the definition of the Talagrand

562: distance $d_T$,

563: the inequality above implies  \ $d_T(x, A_1) \geq s$. But since $x \in C_s$,

564: we also have $d_T(x, A_1) < s$, a contradiction.  \ \ \ $\Box$

565:

566: \bigskip

567:

568: Recall that for any $x = (x_1, ..., x_m)$, $y = (y_1, ..., y_m)$ $\in U^m$,

569: we defined $k_i(x)$ (resp. $k_i(y)$) to be the number of the keys (with

570: multiplicity) that are hashed into the slot $i$ for input sequence $x$, resp.

571: $y$. We define integers $s_i$ ($1 \leq i \leq n$) by

572: $$k_i(x) = k_i(y) + s_i.$$

573: \begin{lem}

574: \label{ExpandingA0} \

575: For all $x, y \in U^m$, \

576: $$\sum_{i = 1}^n |s_i| \ \leq \ 2 \sum_{j = 1}^m {\bf 1}(x_j \not= y_j).$$

577: \end{lem}

578: {\bf Proof.} \ We prove the lemma by induction on \

579: $\sum_{i = 1}^m {\bf 1}(x_i \not= y_i)$.

580:

581: \smallskip

582:

583: \noindent {\bf (0)} \ \ $\sum_{j = 1}^m{\bf 1}(x_j \not= y_j) = 0$:

584:

585: \smallskip

586:

587: \noindent Then we have $x_j = y_j$ for all $j = 1, \ldots, m$, and hence,

588: $k_i(x) = k_i(y)$ for all $i = 1, \ldots, n$. Thus, we have

589: $\sum_{i = 1}^n |s_i| = 0$, finishing the base case.

590:

591: \smallskip

592:

593: \noindent {\bf (Inductive step)} \ \ Assume \

594: $\sum_{j = 1}^m{\bf 1}(x_j \not= y_j) > 0$:

595:

596: \smallskip

597:

598: \noindent Without loss of generality we assume that $x_m \not= y_m$.

599: Now, consider $\bar x = (x_1, \ldots , x_{m - 1}, y_m)$. We write

600: $k_i(\bar x) = k_i(y) + \bar {s_i}$ for $i = 1, \ldots, n$.

601: By the induction hypothesis we have

602: \begin{equation}

603: \label{ExpandingA0_1}

604: \sum_{i=1}^n |\bar s_i| \ \leq \ 2 \sum_{j=1}^m {\bf 1}(\bar x_j \not= y_i).

605: \end{equation}

606: Since $x$ differs from $\bar x$ only in its last component, we either have

607: $h(x_m) = h(y_m)$, in which case

608: $\bar s_i = s_i$ for all $i = 1, \ldots , n$.  Or we have

609: $h(x_m) \neq h(y_m)$; let $i_1 = h(x_m)$ and $i_2 = h(y_m)$. Then

610:  \ $\bar s_{i_1} = s_{i_1} + 1$, \  $\bar s_{i_2} = s_{i_2} - 1$, and

611: $\bar s_i = s_i$ for all $i \in \{1, \ldots , n \} \setminus \{i_1, i_2\}$.

612: In both cases,

613: \begin{equation}

614: \label{ExpandingA0_2}

615: \left| \sum_{i = 1}^n |\bar {s_i}| -

616: \sum_{i = 1}^n |s_i| \right| \leq 2.

617: \end{equation}

618: On the other hand,

619: $$\sum_{j = 1}^m {\bf 1}(x_j \not= y_j)  \ = \

620: \sum_{j = 1}^m {\bf 1}(\bar x_j \not= y_j) + 1.$$

621: Combining this, (\ref{ExpandingA0_1}), and (\ref{ExpandingA0_2}), completes

622: the proof for the inductive step.  \ \ \ $\Box$

623:

624: \begin{lem}

625: \label{ExpandingA1} \

626: For every $x \in C_s$ there is $y \in A$ such that for all $n > 24$,

627: $0 < \epsilon < 1/3$, $s > 0$, and $m = \epsilon^{-2}n^{1+\delta}$,

628:  we have

629: $$\left|

630: \sum_{i = 1}^{n} \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -

631: \sum_{i = 1}^{n} \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)}

632: \right| \ \leq \ \epsilon \|p\|^2 \left(

633: \frac{6s}{n^{\delta/2}} + \frac{5s^2\epsilon}{n^{\delta}} \right).$$

634: \end{lem}

635: {\bf Proof.} \ For any fixed $x \in C_s$ we take $y \in A$ according to

636: Lemma \ref{ExpandingA}.  That is,

637: \begin{equation}

638: \label{ExpandingA1_1}

639: \sum_{j = 1}^m {\bf 1}(x_j \not= y_j) \ \leq \ s m^{1/2}.

640: \end{equation}

641: As in the proof for Lemma \ref{ExpandingA0} we use the notation

642: $k_i(x)$, $k_i(y)$, and $s_i$ \ ($i = 1, \ldots , n$).

643: We will leave the common denominator $m(m-1)$ out of the computations

644: until the end:

645:

646: \medskip

647:

648: $| \sum_{i=1}^n k_i(x)(k_i(x) - 1) \ - \ \sum_{i=1}^n k_i(y)(k_i(y)-1) |$

649:

650: \medskip

651:

652: $= |\sum_{i=1}^n (k_i(y) +s_i)(k_i(y)+s_i-1) \ -  \ $

653: $\sum_{i=1}^n k_i(y)(k_i(y)-1) |$

654:

655: \medskip

656:

657: $= |\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ $

658:   $[(k_i(y) +s_i)(k_i(y)+s_i-1) - k_i(y)(k_i(y)-1)] $

659:

660: \smallskip

661:

662:  \hspace{3in}  $ \ + \ \sum_{1\leq i \leq n, \, k_i(y) = 0} \ s_i(s_i -1) |$

663:

664: \medskip

665:

666: $\leq \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ 2\, |s_i|(k_i(y)-1)$

667:   $ \ + \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ (s_i^2 + |s_i|)$

668:   $ \ + \ | \sum_{1\leq i \leq n, \, k_i(y) =0} \ s_i(s_i -1)|$

669:

670: \medskip

671:

672: $\leq \ \sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ 2\, |s_i|(k_i(y)-1)$

673: $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|).$

674:

675: \medskip

676:

677: \noindent By the Cauchy-Schwarz inequality, this is bounded by

678:

679: \smallskip

680:

681: $\leq \ 2 \ (\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ s_i^2)^{1/2}$

682:        $(\sum_{1\leq i \leq n, \, k_i(y) \geq 1} \ (k_i(y)-1)^2)^{1/2}$

683:    $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|)$

684:

685: \medskip

686:

687: $\leq \ 2 \ (\sum_{i=1}^n s_i^2)^{1/2}$

688:   $(\sum_{i=1}^n k_i(y)(k_i(y)-1)^2)^{1/2}$

689:  $ \ + \ \sum_{i=1}^n (s_i^2 + |s_i|).$

690:

691: \medskip

692:

693: \noindent By Lemma \ref{ExpandingA0} and (\ref{ExpandingA1_1}) we have

694: \begin{equation}

695: \label{ExpandingA1_4}

696: \sum_{i=1}^n s_i^2 \ \leq \ \left(\sum_{i=1}^n |s_i| \right)^2 \ \leq \

697: \left(2 \sum_{j=1}^m {\bf 1}(x_j \not= y_j) \right)^2 \ \leq \ 4 s^2 m.

698: \end{equation}

699: Since $y \in A$ we have

700: $$\sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m (m - 1)} \leq

701: \|p\|^2 \left(1 + 3\epsilon \right).$$

702: Hence, by all the above:

703: $$\left| \sum_{i=1}^n \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -

704: \sum_{i=1}^n \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)} \right|$$

705: $$\leq \ \frac{4s}{(m - 1)^{1/2}} \cdot \|p\|

706: \left(1+ 3\epsilon \right)^{1/2} + \frac{4s^2}{m-1} + \frac{2s}{m^{1/2}(m-1)}.$$

707:

708: \noindent By calculating, and using the fact that $\|p\|^2 \geq \frac{1}{n}$,

709: $0 < \epsilon < 1/3$, and $m = \epsilon^{-2}n^{1 + \delta}$, we find the

710: following upper bound for \

711: $\left|

712: \sum_{i = 1}^{n} \frac{k_i(x)(k_i(x) - 1)}{m(m - 1)} -

713: \sum_{i = 1}^{n} \frac{k_i(y)(k_i(y) - 1)}{m(m - 1)}

714: \right| \ $:

715: \[

716:   \frac{s\epsilon}{n^{\delta/2}} \ \|p\|^2 \

717:  \frac{4(1+3\epsilon)^{1/2}n^{1/2}}{(n - \epsilon^2 n^{-\delta})^{1/2} }

718:  \ + \

719: \frac{s\epsilon}{n^{\delta/2}} \ \|p\|^2 \

720:   \frac{2 \epsilon^2n}{n^{1/2} (n^{1+\delta} - \epsilon^2)}

721: \ + \

722:  \frac{s^2\epsilon^2}{n^{\delta/2}} \ \|p\|^2 \

723:  \frac{4n}{n^{1+\delta} - \epsilon^2}

724: \]

725: \noindent Combining this and using $n > 24$ we obtain the upper bound \

726: $$\epsilon \|p\|^2 (\frac{6s}{n^{\delta/2}} +

727: \frac{5s^2 \epsilon}{n^{\delta}}).$$

728: $\Box$

729:

730: \bigskip

731:

732: \noindent

733: {\bf Proof of Theorem \ref{OurResult1}.} \

734: The theorem follows from the definition of $A$, inequality (\ref{SubsetC}),

735: and Lemma \ref{ExpandingA1}. \ \ \ $\Box$

736:

737:

738: %%%%%%%%%%%%%%%%%%%%%%%%%

739:

740: \subsection{Average search time for a particular user}

741:

742: \noindent

743: {\bf Proof of Corollary \ref{OurResult2}.} \

744: Recall that the average search time AST$(v,x)$ is bounded from above by \

745: $\sum_{i = 1}^n v_i \cdot k_i(x)$.

746: In Theorem \ref{OurResult1} let us write $m = L_1 L_2 \, n$, and choose

747: $$\epsilon = \frac{1}{\sqrt L_1}~~{\rm and}~~\delta = \frac{\log L_2}{\log n}.$$

748: Note that for all $i$, \

749:

750: \smallskip

751:

752: $k_i(x) - 1 \leq \sqrt{k_i(x)(k_i(x) - 1)}$ \

753:

754: \smallskip

755:

756: \noindent since the left side is 0 when $k_i(x) = 0$ or 1. Therefore,

757:

758: \smallskip

759:

760: ${\rm AST}(x,v) \leq \ $

761: $\sum_{i=1}^n v_i \cdot k_i(x) \ = \ \sum_{i=1}^n v_i (k_i(x) - 1) + 1 \  $

762: $\leq \ \sqrt{\sum_{i=1}^n v_i^2} \sqrt{\sum_{i=1}^n (k_i(x) - 1)^2} +1$

763:

764: \medskip

765:

766: $\leq \ \sqrt{\sum_{i=1}^n v_i^2} \sqrt{\sum_{i=1}^n k_i(x)(k_i(x) - 1)} + 1$

767: $ \ \leq \ \|v\| \, \|p\| \, m(m-1)$

768:   $\sqrt{\sum_{i=1}^n \frac{k_i(x)(k_i(x)-1)}{m(m-1)} \, \frac{1}{\|p\|^2}}$.

769:

770: \medskip

771:

772: \noindent The corollary follows from this and Theorem \ref{OurResult1}.

773:  \ \ \   $\Box$

774:

775: \bigskip

776:

777: \noindent

778: {\bf Remark.} \ Our proof method depends crucially on Talagrand's theorem.

779: Many readers, more familiar with techniques like the Chernoff bound, or more

780: generally, the Hoeffding inequality for martingale differences (from which

781: the Chernoff bound follows directly), may wonder whether these simpler

782: techniques don't work here. In order to apply Hoeffding's inequality we

783: could view $\sum_{i=1}^n v_i \cdot k_i(x)$ as a weighted sum of the random

784: variables $k_i(x)$; to apply Hoeffding one needs to bound $|k_i(x)|$, but

785: we don't have good bounds a priori; finding good bounds on $|k_i(x)|$ seems

786: harder and less promising than our method, based on Talagrand's theorem.

787: See, e.g., Michael Steele's book \cite{St}, which discusses the advantages

788: of applying Talagrand's theorem at length.

789:

790:

791:

792:

793:

794:

795: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

796: % Bibliography

797:

798: \begin{thebibliography}{60}

799:

800: \bibitem{CLRS}

801: T.H.\ Cormen, C.E.\ Leiserson, R.L.\ Rivest, C.\ Stein,

802: {\it Introduction to Algorithms}, 2nd ed., McGraw-Hill, 2001.

803:

804: \bibitem{GR}

805: O.\ Goldreich and D.\ Ron, ``On testing expansion in bounded-degree graphs'',

806: Technical Report TR00-020, ECCC, 2000.

807:

808: \bibitem{Kn}

809: D.E.\ Knuth, {\it Sorting and Searching}, 2nd ed., Addison-Wesley, 1998.

810:

811: \bibitem{SKS}

812: A.\ Silberschatz, H.F.\ Korth, S.\ Sudarshan,

813: {\it Database System Concepts}, 4th ed., McGraw-Hill, 2002.

814:

815: \bibitem{St}

816: M.\ Steele, {\it Probability Theory and Combinatorial Optimization},

817: SIAM, 1997.

818:

819: \bibitem{Ta}

820: M.\ Talagrand, ``Concentration of measure and isoperimetric inequalities

821: in product spaces'',

822: {\it Institut des Hautes \'Etudes Scientifiques, Publications

823: Math\'ematiques,} 81 (1995) 73-205.

824:

825: \end{thebibliography}

826:

827:

828: \end{document}

829: