0502:cs0502032/full.tex

1: \documentclass[11pt,letterpaper]{article}

2: \usepackage{fullpage, comment}

3: \usepackage{amsmath, amsthm}

4:

5: \newtheorem{theorem}{Theorem}

6: \newtheorem{lemma}[theorem]{Lemma}

7:

8: % Avoid line breaks before citations (\cite) and references (\ref)

9: \let\latexcite=\cite

10: \def\cite{\nolinebreak\latexcite}

11: \let\latexref=\ref

12: \def\ref{\nolinebreak\latexref}

13:

14:

15: \let\epsilon=\varepsilon

16:

17: \newenvironment{description*}%

18:   {\vspace{-2ex}

19:    \begin{description}%

20:     \setlength{\itemsep}{-1ex}%

21:     \setlength{\parsep}{0pt}}%

22:   {\end{description}}

23:

24: \newenvironment{itemize*}%

25:   {\vspace{-2ex}

26:    \begin{itemize}%

27:     \setlength{\itemsep}{-1ex}%

28:     \setlength{\parsep}{0pt}}%

29:   {\end{itemize}

30:    \vspace{-1ex}}

31:

32: \newcommand{\func}[1] {\texttt{#1}}

33:

34: \begin{document}

35:

36: \title{On Dynamic Range Reporting in One Dimension}

37:

38: \author{

39:         Christian Worm Mortensen%

40: \footnote{Part of this work was done while

41:   the author was visiting the Max-Planck-Institut f\"ur Informatik,

42:   Saarbr\"ucken, as a Marie Curie doctoral fellow.} \\

43: \small  IT U. Copenhagen \\

44:         \texttt{cworm@itu.dk}

45: \and

46:         Rasmus Pagh \\

47: \small  IT U. Copenhagen \\

48:         \texttt{pagh@itu.dk}

49: \and

50:         Mihai P\v{a}tra\c{s}cu \\

51: \small  MIT \\

52:         \texttt{mip@mit.edu}

53: }

54:

55: \maketitle

56:

57:

58: \begin{abstract}

59:   We consider the problem of maintaining a dynamic set of integers and

60:   answering queries of the form: report a point (equivalently, all

61:   points) in a given interval. Range searching is a natural and

62:   fundamental variant of integer search, and can be solved using

63:   predecessor search. However, for a RAM with $w$-bit words, we show

64:   how to perform updates in $O(\lg w)$ time and answer queries in

65:   $O(\lg\lg w)$ time. The update time is identical to the van Emde

66:   Boas structure, but the query time is exponentially faster. Existing

67:   lower bounds show that achieving our query time for predecessor

68:   search requires doubly-exponentially slower updates. We present some

69:   arguments supporting the conjecture that our solution is optimal.

70:

71:   Our solution is based on a new and interesting recursion idea which

72:   is ``more extreme'' that the van Emde Boas recursion. Whereas van

73:   Emde Boas uses a simple recursion (repeated halving) on each path in

74:   a trie, we use a nontrivial, van Emde Boas-like recursion on every

75:   such path. Despite this, our algorithm is quite clean when seen from

76:   the right angle. To achieve linear space for our data structure, we

77:   solve a problem which is of independent interest. We develop the

78:   first scheme for dynamic perfect hashing requiring sublinear

79:   space. This gives a dynamic Bloomier filter (an approximate storage

80:   scheme for sparse vectors) which uses low space. We strengthen

81:   previous lower bounds to show that these results are optimal.

82: \end{abstract}

83:

84:

85: \section{Introduction}

86:

87: Our problem is to maintain a set $S$ under insertions and deletions of

88: values, and a range reporting query. The query $\func{findany}(a,b)$

89: should return an arbitrary value in $S \cap [a,b]$, or report that $S

90: \cap [a,b] = \emptyset$. This is a form of existential range query.

91: In fact, since we only consider update times above the predecessor

92: bound, updates can maintain a linked list of the values in $S$ in

93: increasing order. Given a value $x \in S \cap [a,b]$, one can traverse

94: this list in both directions starting from $x$ and list all values in

95: the interval $[a,b]$ in constant time per value.  Thus, the

96: $\func{findany}$ query is equivalent to one-dimensional range

97: reporting.

98:

99: The model in which we study this problem is the word RAM. We assume

100: the elements of $S$ are integers that fit in a word, and let $w$ be

101: the number of bits in a word (thus, the ``universe size'' is $u =

102: 2^w$). We let $n = |S|$. Our data structure will use Las Vegas

103: randomization (through hashing), and the bounds stated will hold with

104: high probability in $n$.

105:

106: Range reporting is a very natural problem, and its higher-dimensional

107: versions have been studied for decades. In one dimension, the problem

108: is easily solved using predecessor search. The predecessor problem has

109: also been studied intensively, and the known bounds are now tight in

110: almost all cases \cite{beame02predecessor}. Another well-studied

111: problem related to ours is the lookup problem (usually solved by

112: hashing), which asks to find a key in a set of values. Our problem is

113: more general than the lookup problem, and less general than the

114: predecessor problem. While these two problems are often dubbed ``the

115: integer search problems'', we feel range reporting is an equally

116: natural and fundamental incarnation of this idea, and deserves similar

117: attention.

118:

119: The first to ask whether or not range reporting is as hard as finding

120: predecessors were Miltersen et al in STOC'95

121: \cite{miltersen99asymmetric}. For the static case, they gave a data

122: structure with space $O(nw)$ and constant query time, which cannot be

123: achieved for the predecessor problem with polynomial space. An even

124: more surprising result from STOC'01 is due to Alstrup, Brodal and

125: Rauhe \cite{alstrup01range}, who gave an optimal solution for the

126: static case, achieving linear space and constant query time. In the

127: dynamic case, however, no solution better than the predecessor problem

128: was known. For this problem, the fastest known solution in terms of

129: $w$ is the classic van Emde Boas structure \cite{veb77predecessor},

130: which achieves $O(\lg w)$ time per operation.

131:

132: For the range reporting problem, we show how to perform updates in

133: $O(\lg w)$ time, while supporting queries in $O(\lg\lg w)$ time. The

134: space usage is optimal, i.e. $O(n)$ words. The update time is

135: identical to the one given by the van Emde Boas structure, but the

136: query time is exponentially faster. In contrast, Beame and Fich

137: \cite[Theorem 3.7]{beame02predecessor} show that achieving any query

138: time that is $o(\lg w / \lg\lg w)$ for the predecessor problem

139: requires update time $\Omega(2^{w^{1 - \epsilon}})$, which is

140: doubly-exponentially slower than our update time. We also give an

141: interesting tradeoff between update and query times; see theorem

142: \ref{thm:range} below.

143:

144: Our solution incorporates some basic ideas from the previous solutions

145: to static range reporting in one dimension

146: \cite{miltersen99asymmetric, alstrup01range}. However, it brings two

147: important technical contributions. First, we develop a new and

148: interesting recursion idea which is more advanced than van Emde Boas

149: recursion (but, nonetheless, not technically involved). We describe

150: this idea by first considering a simpler problem, the bit-probe

151: complexity of the greater-than function. Then, the solution for

152: dynamic range reporting is obtained by using the recursion for this

153: simpler problem, on \emph{every path} of a binary trie of depth

154: $w$. This should be contrasted to the van Emde Boas structure, which

155: uses a very simple recursion idea (repeated halving) on every

156: root-to-leaf path of the trie. The van Emde Boas recursion is

157: fundamental in the modern world of data structures, and has found many

158: unrelated applications (e.g.  exponential trees, integer sorting,

159: cache-oblivious layouts, interpolation search trees). It will be

160: interesting to see if our recursion scheme has a similar impact.

161:

162: The second important contribution of this paper is needed to achieve

163: linear space for our data structure. We develop a scheme for dynamic

164: perfect hashing, which requires sublinear space. This can be used to

165: store a sparse vector in small space, if we are only interested in

166: obtaining correct results when querying non-null positions (the

167: Bloomier filter problem). We also prove that our solution is

168: optimal. To our knowledge, this solves the last important theoretical

169: problem connected to Bloom filters. The stringent space requirements

170: that our data structure can meet are important in data-stream

171: algorithms and database systems. We mention one application below, but

172: believe others exist as well.

173:

174:

175: \subsection{Data-Stream Perfect Hashing and Bloomier Filters}

176:

177: The Bloom filter is a classic data structure for testing membership in

178: a set. If a constant rate of false-positives is allowed, the space

179: \emph{in bits} can be made essentially linear in the size of the

180: set. Optimal bounds for this problem are obtained in

181: \cite{pagh05bloom}. Bloomier filters, an extension of the classical

182: Bloom filter with a catchy name, were defined and analyzed in the

183: static case by Chazelle et al \cite{chazelle04bloom}. The problem is

184: to represent a vector $V[0..u-1]$ with elements from $\{ 0, \dots, 2^r

185: - 1\}$ which is nonzero in only $n$ places (assume $n \ll u$, so the

186: vector is sparse). Thus, we have a sparse set as before, but with

187: values associated to the elements.  The information theoretic lower

188: bound for representing such a vector is $\Omega(n\cdot r + \lg

189: \binom{u}{n}) \approx \Omega(n (r + \lg u))$ bits. However, if we only

190: want correct answers when $V[x] \ne 0$, we can obtain a space usage of

191: roughly $O(nr)$ bits in the static case.

192:

193: For the dynamic problem, where the values of $V$ can change

194: arbitrarily at any point, achieving such low space is impossible

195: regardless of the query and update times. Chazelle et

196: al.~\cite{chazelle04bloom} proved that $\Omega(n(r + \min(\lg\lg

197: \frac{u}{n^3}, \lg n)))$ bits are needed. No non-trivial upper bound

198: was known. We give matching lower and upper bounds:

199:

200: \begin{theorem} \label{thm:bloomlb}

201: The randomized space complexity of maintaining a dynamic Bloomier

202: filter for $r\geq 2$ is $\Theta(n(r + \lg\lg \frac{u}{n}))$ bits in

203: expectation. The upper bound is achieved by a RAM data structure that

204: allows access to elements of the vector in worst-case constant time,

205: and supports updates in amortized expected $O(1)$ time.

206: \end{theorem}

207:

208: To detect whether $V[x] = 0$ with probability of correctness at least

209: $1-\epsilon$, one can use a Bloom filter on top. This requires space

210: $\Theta(n\lg( 1/\epsilon ))$, and also works in the dynamic case

211: \cite{pagh05bloom}. Note that even for $\epsilon = 1$, randomization

212: is essential, since any deterministic solution must use $\Omega(n

213: \lg(u/n))$ bits of space, i.e.~it must essentially store the set of

214: nonzero entries in the vector.

215:

216: With marginally more space, $O(n(r + \lg\lg u))$, we can make the

217: space and update bounds hold with high probability. To do that, we

218: analyze a harder problem, namely maintaining a perfect hash function

219: dynamically using low space. The problem is to maintain a set $S$ of

220: keys from $\{0, \dots, u-1\}$ under insertions and deletions, and be

221: able to evaluate a perfect hash function (i.e. a one-to-one function)

222: from $S$ to a small range. An element needs to maintain the same hash

223: value while it is in $S$. However, if an element is deleted and

224: subsequently reinserted, its hash value may change.

225:

226: \begin{theorem} \label{thm:hash}

227: We can maintain a perfect hash function from a set $S \subset \{ 0,

228: \dots, u-1 \}$ with $|S| \leq n$ to a range of size $n + o(n)$, under

229: $n^{O(1)}$ insertions and deletions, using $O(n\lg\lg u)$ bits of

230: space w.h.p., plus a constant number of machine words. The function

231: can be evaluated in worst-case constant time, and updates take

232: constant time w.h.p.

233: \end{theorem}

234:

235: This is the first dynamic perfect hash function that uses less space

236: than needed to store $S$ ($\lg \binom{u}{n}$ bits). Our space usage

237: is close to optimal, since the problem is harder than dynamic Bloomier

238: filtering. These operating conditions are typical of data-stream

239: computation, where one needs to support a stream of updates and

240: queries, but does not have space to hold the entire state of the data

241: structure. Quite remarkably, our solution can achieve this goal

242: without introducing errors (we use only Las Vegas randomization).

243:

244: We mention an independent application of Theorem \ref{thm:hash}.

245: In a database we can maintain an index of a relation under insertions

246: of tuples, using internal memory per tuple which is logarithmic in the

247: length of the key for the tuple. If tuples have fixed length, they can

248: be placed directly in the hash table, and need only be moved if the

249: capacity of the hash table is exceeded.

250:

251:

252: \subsection{Tradeoffs and the scheme of things} \label{scheme}

253:

254: We begin with a discussion of the greater-than problem. Consider an

255: infinite memory of bits, initialized to zero. Our problem has two

256: stages. In the update stage, the algorithm is given a number $a \in

257: [0..n-1]$. After seeing $a$, the algorithm is allowed to flip $O(T_u)$

258: bits in the memory. In the query stage, the algorithm is given a

259: number $b \in [0..n-1]$. Now the algorithm may inspect $O(T_q)$ bits,

260: and must decide whether or not $b > a$. The problem was previously

261: studied by Fredman \cite{fredman82sums}, who showed that $\max(T_u,

262: T_q) = \Omega(\lg n / \lg\lg n)$. It is quite tempting to believe that

263: one cannot improve past the trivial upper bound $T_u = T_q = O(\lg

264: n)$, since, in some sense, this is the complexity of ``writing down''

265: $a$. However, as we show in this paper, Fredman's bound is optimal, in

266: the sense that it is a point on our tradeoff curve. We give upper and

267: lower bounds that completely characterize the possible asymptotic

268: tradeoffs:

269:

270: \begin{theorem} \label{thm:bitgt}

271:   The bit-probe complexity of the greater-than function satisfies the

272:   tight tradeoffs:

273:

274:   \vspace{-4ex}

275:   \begin{eqnarray*}

276:    T_q \geq \lg\lg n,\ T_u \leq \lg n &:& T_u = \Theta(\lg_{T_q} n) \\

277:    T_q \leq \lg\lg n,\ T_u \geq \lg n &:& 2^{T_q} = \Theta(\lg_{T_u} n) \\

278:   \end{eqnarray*}

279: \end{theorem}

280: \vspace{-3ex}

281:

282: As mentioned already, we use the same recursion idea as in the

283: previous algorithm for dynamic range reporting, except that we apply

284: this recursion to every root-to-leaf path of a binary trie of depth

285: $w$. Quite remarkably, these structures can be made to overlap

286: in-as-much as the paths overlap, so only one update suffices for all

287: paths going through a node. Due to this close relation, we view the

288: lower bounds for the greater-than function as giving an indication

289: that our range reporting data structure is likewise optimal. In any

290: case, the lower bounds show that markedly different ideas would be

291: necessary to improve our solution for range reporting.

292:

293: Let $T_{pred}$ be the time needed by one update and one query in the

294: dynamic predecessor problem. The following theorem summarizes our

295: results for dynamic range reporting:

296:

297: \begin{theorem} \label{thm:range}

298:   There is a data structure for the dynamic range reporting problem,

299:   which uses $O(n)$ space and supports updates in time $O(T_u)$, and

300:   queries in time $O(T_q)$, $(\forall) T_u, T_q$ satisfying:

301:

302:   \vspace{-3ex}

303:   \begin{eqnarray*}

304:     T_q \geq \lg\lg w,\ \frac{\lg w}{\lg\lg w} \leq T_u \leq \lg w

305:       &:& T_u = O(\lg_{T_q} w) + T_{pred} \\

306:     T_q \leq \lg\lg w,\phantom{\ \ \frac{\lg w}{\lg\lg w} \leq}

307:       T_u \geq \lg w &:& 2^{T_q} = O(\lg_{T_u} w) \\

308:   \end{eqnarray*}

309: \end{theorem}

310: \vspace{-3ex}

311:

312: Notice that the most appealing point of the tradeoff is the cross-over

313: of the two curves: $T_u = O(\lg w)$ and $T_q = O(\lg\lg w)$ (and

314: indeed, this has been the focus of our discussion). Another

315: interesting point is at constant query time. In this case, our data

316: structure needs $O(w^{\epsilon})$ update time. Thus, our data

317: structure can be used as an optimal static data structure, which is

318: constructed in time $O(n w^{\epsilon})$, improving on the construction

319: time of $O(n \sqrt{w})$ given by Alstrup et al \cite{alstrup01range}.

320:

321: The first branch of our tradeoff is not interesting with $T_{pred} =

322: \Theta(\lg w)$. However, it is generally believed that one can achieve

323: $T_{pred} = \Theta( \lg w / \lg\lg w)$, matching the optimal bound for

324: the static case. If this is true, the $T_{pred}$ term can be ignored.

325: In this case, we can remark a very interesting relation between our

326: problem and the predecessor problem. When $T_u = T_q$, the bounds we

327: achieve are identical to the ones for the predecessor problem, i.e.

328: $T_u = T_q = O(\lg w / \lg\lg w)$. However, if we are interested in

329: the possible tradeoffs, the gap between range reporting and the

330: predecessor problem quickly becomes huge. The same situation appears

331: to be true for deterministic dictionaries with linear space, though

332: the known tradeoffs are not as general as ours. We set forth the bold

333: conjecture (the proof of which requires many missing pieces) that all

334: three search problems are united by an optimal time of $\Theta(\lg w /

335: \lg\lg w)$ in this point of their tradeoff curves.

336:

337: We can achieve bounds in terms of $n$, rather than $w$, by the classic

338: trick of using our structure for small $w$ and a fusion tree structure

339: \cite{fredman93fusion} for large $w$. In particular, we can achieve

340: $T_q = O(\lg\lg n)$ and $T_u = O\left( \frac{\lg n}{\lg\lg n}

341: \right)$. Compared with the optimal bound for the predecessor problem

342: of $\Theta\left( \sqrt{\frac{\lg n}{\lg\lg n}} \right)$, our data

343: structure improves the query time exponentially by sacrificing the

344: update time quadratically.

345:

346: \begin{comment}

347:   Specifically, we use \cite{exptrees}, which gives a

348:   bound of $O(\lg_w n + \lg\lg n)$. We obtain the interesting tradeoff

349:   $T_u \cdot T_q \lg T_q = O(\lg n)$ for $\Omega(\lg\lg n) < T_q <

350:   O\left( \sqrt{\frac{\lg n}{\lg\lg n}} \right)$.

351: \end{comment}

352:

353:

354: \section{Data-Stream Perfect Hashing}

355:

356: We denote by $S$ be the set of values that we need to hash at present

357: time. Our data structure has the following parts:

358:

359: \begin{itemize}

360: \item A hash function $\rho: \{0,\dots,u-1 \} \rightarrow

361:   \{0,1\}^{v}$, where $v = O(\lg n)$, from a family of universal hash

362:   functions with small representations (for example, the one

363:   from \cite{dietzfel96universal}).

364:

365: \item A hash function $\phi: \{0,1\}^{v} \rightarrow \{1,\dots,r\}$,

366:   where $r=\lceil n/\lg^2 n \rceil$, taken from Siegel's class of

367:   highly independent hash functions \cite{siegel04hash}.

368:   % note: used to be thm 3 in \cite{ncstrl.nyu_cs//TR1995-684}

369:

370: \item An array of hash functions $h_1,\dots,h_r: \{0,1\}^v \rightarrow

371:   \{0,1\}^s$, where $s=\lceil (6+2c)\lg\lg u \rceil$, chosen

372:   independently from a family of universal hash functions; $c$ is a

373:   constant specified below.

374:

375: \item A high performance dictionary \cite{dietzfel90highperf} for a

376:   subset $S'$ of the keys in $S$. The dictionary should have a

377:   capacity of $O(\lceil n/\lg u \rceil)$ keys (but might expand

378:   further). Along with the dictionary we store a linked list of length

379:   $O(\lceil n/\lg u \rceil)$, specifying certain vacant positions in

380:   the hash table.

381:

382: \item An array of dictionaries $D_1,\dots,D_r$, where $D_i$ is a

383:   dictionary that holds $h_i(\rho(k))$ for each key $k\in S \setminus

384:   S'$ with $\phi(\rho(k))=i$. A unique value in $\{0,\dots,j-1\}$,

385:   where $j=(1+o(1))\lg^2 n$, is associated with each key in $D_i$. A

386:   bit vector of $j$ bits and an additional string of $\lg n$ bits is

387:   used to keep track of which associated values are in use. We will

388:   return to the exact choice of $j$ and the implementation of the

389:   dictionaries.

390: \end{itemize}

391:

392: The main idea is that all dictionaries in the construction assign to

393: each of their keys a unique value within a subinterval of $[1 .. m]$.

394: Each of the dictionaries $D_1, \dots, D_r$ is responsible for an

395: interval of size $j$, and the high performance dictionary is

396: responsible for an interval of size $O(n/\lg u) = o(n)$.

397:

398: The hash function $\rho$ is used to reduce the key length to $v$. The

399: constant in $v = O(\lg n)$ can be chosen such that with high

400: probability, over a polynomially bounded sequence of updates, $\rho$

401: will never map two elements of $S$ to the same value (the conflicts,

402: if they occur, end up in $S'$ and are handled by the high performance

403: dictionary).

404:

405: When inserting a new value $k$, the new key is included in $S'$ if

406: either:

407:

408: \begin{itemize}

409: \item There are $j$ keys in $D_i$, where $i=\phi(\rho(k))$, or

410:

411: \item There exists a key $k'\in S$ where

412:   $\phi(\rho(k))=\phi(\rho(k'))=i$ and $h_i(\rho(k))=h_i(\rho(k'))$.

413: \end{itemize}

414:

415: Otherwise $k$ is associated with the key $h_i(\rho(k))$ in $D_i$.

416: Deletion of a key $k$ is done in $S'$ if $k\in S'$, and otherwise the

417: associated key in the appropriate $D_i$ is deleted.

418:

419: To evaluate the perfect hash function on a key $k$ we first see

420: whether $k$ is in the high performance dictionary. If so, we return

421: the value associated with $k$. Otherwise we compute $i=\phi(\rho(k))$

422: and look up the value $\Delta$ associated with the key $h_i(\rho(k))$

423: in $D_i$. Then we return $(i-1)j+\Delta$, i.e., position $\Delta$

424: within the $i$-th interval.

425:

426: Since $D_1,\dots,D_r$ store keys and associated values of $O(\lg\lg

427: u)$ bits, they can be efficiently implemented as constant depth search

428: trees of degree $w^{\Omega(1)}$, where each internal node resides in a

429: single machine word. This yields constant time for dictionary

430: insertions and lookups, with an optimal space usage of $O(\lg^2

431: n\lg\lg u)$ bits for each dictionary.  We do not go into details of

432: the implementation as they are standard; refer to \cite{hagerup98ram}

433: for explanation of the required word-level parallelism techniques.

434:

435: What remains to describe is how the dictionaries keep track of vacant

436: positions in the hash table in constant time per insertion and

437: deletion. The high performance dictionary simply keeps a linked list

438: of all vacant positions in its interval. Each of $D_1,\dots,D_r$

439: maintain a bit vector indicating vacant positions, and additional

440: $O(\lg n)$ summary bits, each taking the or of an interval of size

441: $O(\lg n)$. This can be maintained in constant time per operation,

442: employing standard techniques.

443:

444: Only $o(n)$ preprocessing is necessary for the data structure

445: (essentially to build tables needed for the word-level parallelism).

446: The major part of the data structure is initialized lazily.

447:

448:

449: \subsection{Analysis}

450:

451: Since evaluation of all involved hash functions and lookup in the

452: dictionaries takes constant time, evaluation of the perfect hash

453: function is done in constant time. As we will see below, the high

454: performance dictionary is empty with high probability unless $n/\lg u

455: > \sqrt{n}$. This means that it always uses constant time per

456: update with high probability in $n$. All other operations done for

457: update are easily seen to require constant time w.h.p.

458:

459: We now consider the space usage of our scheme. The function $\rho$ can

460: be represented in $O(w)$ bits. Siegel's highly independent hash

461: function uses $o(n)$ bits of space. The hash functions $h_1,\dots,h_r$

462: use $O(\lg n + \lg\lg u)$ bits each, and $o(n\lg\lg u)$ bits in

463: total.  The main space bottleneck is the space for $D_1,\dots,D_r$,

464: which sums to $O(n\lg\lg u)$.

465:

466: Finally, we show that the space used by the high performance

467: dictionary is $O(n)$ bits w.h.p. This is done by showing that each of

468: the following hold with high probability throughout a polynomial

469: sequence of operations:

470:

471: \begin{itemize*}

472: \item[1.] The function $\rho$ is one-to-one on $S$.

473:

474: \item[2.] There is no $i$ such that $S_i = \{ k \in S \mid

475:   \phi(\rho(k))=i \}$ has more than $j$ elements.

476:

477: \item[3.] The set $S'$ has $O(\lceil n/\lg u \rceil)$ elements.

478: \end{itemize*}

479:

480: That 1.~holds with high probability is well known. To show 2.~we use

481: the fact that, with high probability, Siegel's hash function is

482: independent on every set of $n^{\Omega(1)}$ keys. We may thus employ

483: Chernoff bounds for random variables with limited independence to

484: bound the probability that any $i$ has $|S_i| > j$, conditioned on the

485: fact that 1.~holds. Specifically, we can use \cite[Theorem

486: 5.I.b]{schmidt95chernoff} to argue that for any $l$, the probability

487: that $|S_{i}| > j$ for $j = \lceil \lg^2 n + \lg^{5/3} n \rceil$ is

488: $n^{-\omega(1)}$, which is negligible. On the assumption that 1.~and

489: 2.~hold, we finally consider~3. We note that every key $k'\in S'$ is

490: involved in an $h_i$-collision in $S_i$ for $i=\phi(\rho(k'))$,

491: i.e.~there exists $k''\in S_i \setminus \{k'\}$ where

492: $h_i(k')=h_i(k'')$. By universality, for any $i$ the expected number

493: of $h_i$-collisions in $S_i$ is $O(\lg^4 n / (\lg u)^{6+2c}) = O((\lg

494: u)^{-(2+2c)})$.  Thus the probability of one or more collisions is

495: $O((\lg u)^{-(2+2c)})$.  For $\lg u \geq \sqrt{n}$ this means that

496: there are no keys in $S'$ with high probability. Specifically, $c$ may

497: be chosen as the sum of the constants in the exponents of the length

498: of the operation sequence and the desired high probability bound. For

499: the case $\lg u < \sqrt{n}$ we note that the expected number of

500: elements in $S'$ is certainly $O(n/\lg u)$. To see than this also

501: holds with high probability, note that the event that one or more keys

502: from $S_i$ end up in $S'$ is independent among the $i$'s. Thus we can

503: use Chernoff bounds to get that the deviation from the expectation is

504: small with high probability.

505:

506:

507: \section{Lower Bound for Bloomier Filters}

508:

509: For the purpose of the lower bound, we consider the following two-set

510: distinction problem, following \cite{chazelle04bloom}. The problem has

511: the following stages:

512:

513: \begin{enumerate}

514: \item[0.] a random string $R$ is drawn, which will be available to the

515:   data structure throughout its operation. This is equivalent to

516:   drawing a deterministic algorithm from a given distribution, and is

517:   more general than assuming each stage has its own random coins (we

518:   are giving the data structure free storage for its random bits).

519:

520: \item the data structure is given $A \subset [u], |A| \le n$. It must

521:   produce a representation $f_R(A)$, which for any $A$ has size at

522:   most $S$ bits, in expectation over all choices of $R$. Here $S$ is a

523:   function of $n$ and $u$, which is the target of our lower bound.

524:

525: \item the data structure is given $B \subset [u]$, such that $|B| \le

526:   n, A \cap B = \emptyset$. Based on the old state $f_R(A)$, it must

527:   produce $g_R(B, f_R(A))$ with expected size at most $S$ bits.

528:

529: \item the data structure is given $x \in [u]$ and its previously

530:   generated state, i.e.~$f_R(A)$ and $g_R(B, f_R(A))$. Now it must

531:   answer as follows with no error allowed: if $x \in A$, it must

532:   answer ``A''; if $x \in B$, it must answer ``B''; if $x \notin A

533:   \cup B$, it can answer either ``A'' or ``B''. Let $h_R(x,f,g)$

534:   be the answer computed by the data structure, when the previous

535:   state is $f$ and $g$.

536:

537: \end{enumerate}

538:

539: It is easy to see that a solution for dynamic Bloomier filters

540: supporting ternary associated data, using expected space $o(n\lg\lg

541: \frac{u}{n})$, yields a solution to the two-set distinction problem

542: with $S = o(n\lg\lg \frac{u}{n})$. We will prove such a solution does

543: not exist.

544:

545: Since a solution to the distinction problem is not allowed to make an

546: error we can assume w.l.o.g.~that step 3 is implemented as follows. If

547: there exist appropriate $A, B \subset [u]$, with $x \in A$ such that

548: $f_R(A) = f_0$ and $g_R(B, f_0) = g_0$, then $h_R(x, f_0, g_0)$ must

549: be ``A''. Similarly, if there exists a plausible scenario with $x \in

550: B$, the answer must be ``B''. Otherwise, the answer can be arbitrary.

551:

552: Assume that the inputs $A \times B$ are drawn from a given

553: distribution. We argue that if the expected sizes of $f$ and $g$ are

554: allowed to be at most $2S$, the data structure need not be

555: randomized. This uses a bicriteria minimax principle. We have

556: $E_{R,A,B}\left[ \frac{|f|}{S} + \frac{|g|}{S} \right] \leq 2$, where

557: $|f|, |g|$ denote the length of the representations. Then, there

558: exists a random string $R_0$ such that $E_{A,B} \left[ \frac{|f|}{S} +

559: \frac{|g|}{S} \right] \leq 2$. Since $|f|, |g| \geq 0$, this implies

560: $E_{A,B}[|f|] \leq 2S, E_{A,B}[|g|] \leq S$. The data structure can

561: simply use the deterministic sequence $R_0$ as its random bits; we

562: drop the subscript from $f_R, g_R$ when talking about this

563: deterministic data structure.

564:

565:

566: \subsection{Lower Bound for Two-Set Distinction}

567:

568: Assume $u = \omega(n)$, since a lower bound of $\Omega(n)$ is trivial

569: for universe $u \ge 2n$. Break the universe into $n$ equal parts $U_1,

570: \dots, U_n$; w.l.o.g.~assume $n$ divides $u$, so $|U_i| =

571: \frac{n}{u}$. The hard input distribution chooses $A$ uniformly at

572: random from $U_1 \times \dots \times U_n$. We write $A = \{ a_1,

573: \dots, a_n \}$, where $a_i$ is a random variable drawn from

574: $U_i$. Then, $B'$ is chosen uniformly at random from the same product

575: space; again $B' = \{b_1, \dots, b_n\}, b_i \gets U_i$. We let $B = B'

576: \setminus A$. Note that $E[|B|] = n \cdot \Pr[A_1 \ne B_1] = (1 -

577: \frac{n}{u}) \cdot n = (1 - o(1)) \cdot n$.

578:

579: Let $A_i^p$ be the plausible values of $A_i$ after we see $f(A)$; that

580: is, $A_i^p$ contains all $a \in U_i$ for which there exists a valid

581: $A'$ with $a \in A'$ and $f(A') = f(A)$. Intuitively speaking, if

582: $f(A)$ has expected size $o(n \lg\lg \frac{u}{n})$, it contains on

583: average $o(\lg\lg \frac{u}{n})$ bits of information about each

584: $a_i$. This is much smaller than the range of $a_i$, which is

585: $\frac{u}{n}$, so we would expect that the average $|A_i^p|$ is quite

586: large, around $\frac{u}{n} / (\lg \frac{u}{n})^{o(1)}$. This intuition

587: is formalized in the following lemma:

588:

589: \begin{lemma}

590: With probability at least a half over a uniform choice of $A$ and $i$,

591: we have $|A_i^p| \geq \frac{u/n}{2^{O(S/n)}}$.

592: \end{lemma}

593:

594: \begin{proof}

595: The Kolmogorov complexity of $A$ is $n\lg \frac{u}{n} - O(1)$; no

596: encoding for $A$ can have an expected size less than this quantity.

597: We propose an encoding for $A$ consisting of two parts: first, we

598: include $f(A)$; second, for each $i$ we include the index of $a_i$ in

599: $|A_i^p|$, using $\lceil \lg|A_i^p| \rceil$ bits. This is easily

600: decodable. We first generate all possible $A'$ with $f(A') = f(A)$,

601: and thus obtain the sets $A_i^p$. Then, we extract from each plausible

602: set the element with the given index. The expected size of the

603: encoding is $2S + \sum_i E_{A}[\lg |A_i^p|] + O(n)$, which must be

604: $\ge n\lg \frac{u}{n} - O(1)$. This implies $\lg \frac{u}{n} -

605: E_{i,A}[\lg |A_i^p|] \le \frac{2S}{n} + O(1)$. By Markov's inequality,

606: with probability at least a half over $i$ and $A$, $\lg \frac{u}{n} -

607: \lg |A_i^p| \le \frac{4S}{n} + O(1)$, so $\lg |A_i^p| \ge \lg

608: \frac{u}{n} - O(\frac{S}{n})$.

609: \end{proof}

610:

611: We now make a crucial observation which justifies our interest in

612: $A_i^p$. Assume that $b_i \in A_i^p$. In this case, the data structure

613: must be able to determine $b_i$ from $f(A)$ and $g(B,f(A))$. Indeed,

614: suppose we compute $h(x,f,g)$ for all $x \in |A_i^p|$. If that data

615: strucuture does not answer ``B'' when $x = b_i$, it is obviously

616: incorrent. On the other hand, if it answers ``B'' for both $x = b_i$

617: and some other $x' \in A_i^p$, it also makes an error. Since $x'$ is

618: plausible, there exist $A'$ with $x' \in A'$ such that $f(A') =

619: f(A)$. Then, we can run the data structure with $A'$ as the first set

620: and $B$ as the second set. Since $f(A') = f(A)$, the data structure

621: will behave exactly the same, and will incorrectly answer ``B'' for

622: $x'$.

623:

624: To draw our conclusion, we consider another encoding argument, this

625: time in connection to the set $B'$. The Kolmogorov complexity of $B'$

626: is $n \lg \frac{u}{n} - O(1)$. Consider a randomized encoding,

627: depending on a set $A$ drawn at random. First, we encode an $n$-bit

628: vector specifying which indices $i$ have $a_i = b_i$. It remains to

629: encode $B' \setminus A = B$. We encode another $n$-bit vector,

630: specifying for which positions $i$ we have $b_i \in A_i^p$. For each

631: $b_i \notin A_i^p$, we simply encode $B_i$ using $\lceil \lg

632: \frac{u}{n} \rceil$ bits. Finally, we include in the encoding $g(B,

633: f(A))$. As explained already, this is enough to recover all $b_i$

634: which are in $A_i^p$. Note that we do not need to encode $f(A)$, since

635: this depends only on our random coins, and the decoding algorithm can

636: reconstruct it.

637:

638: The expected size of this encoding will be $O(n + S) + n\cdot

639: \Pr_{A,B',i} [b_i \notin A_i^p] \cdot \lg \frac{u}{n}$. We know that

640: with probability a half over $A$ and $i$, we have $|A_i^p| \geq

641: \frac{u/n}{2^{O(S/n)}}$. Thus, $\Pr_{A,B',i} [b_i \in A_i^p] \geq

642: \frac{1}{2} \cdot 2^{-O(S/n)}$. Thus, the expected size of the

643: encoding is at most $O(n + S) + (1 - 2^{-O(S/n)}) \cdot n \lg

644: \frac{u}{n}$. Note that by the minimax principle, randomness in the

645: encoding is unessential and we can always fix $A$ guaranteeing the

646: same encoding size, in expectation over $B$. We now get the bound:

647:

648: \begin{eqnarray*}

649: & & O(n + S) + (1 - 2^{-O(S/n)}) \cdot n \lg \frac{u}{n} \geq n \lg

650: \frac{u}{n} - O(1) \\

651: & \Rightarrow & O\left( \frac{S}{n} \right) \geq 2^{-O(S/n)} \lg

652: \frac{u}{n} - O(1) \Rightarrow 2^{O(S/n)} O(S / n) \geq \lg

653: \frac{u}{n} \Rightarrow \frac{S}{n} = \Omega \left( \lg\lg \frac{u}{n}

654: \right)

655: \end{eqnarray*}

656:

657:

658: \section{A Space-Optimal Bloomier Filter}

659:

660: It was shown in \cite{carter78bloom} that the approximate membership

661: problem (i.e., the problem solved by Bloom filters) can be solved

662: optimally using a reduction to the exact membership problem. The

663: reduction uses universal hashing.  In this section we extend this idea

664: to achieve optimal dynamic Bloomier filters.

665:

666: Recall that Bloomier filters encode sparse vectors with entries from

667: $\{0,\dots,2^r - 1\}$.  Let $S\subseteq [u]$ be the set of at most $n$

668: indexes of nonzero entries in the vector $V$.  The data structure must

669: encode a vector $V'$ that agrees with $V$ on indexes in $S$, and such

670: that for any $x\not\in S$, $\Pr[V'[x]\neq 0]\leq \epsilon$, where

671: $\epsilon > 0$ is the error probability of the Bloomier

672: filter. Updates to $V$ are done using the following operations:

673: \begin{itemize}

674: \item {\sc Insert($x$, $a$)}. Set $V[x]:=a$, where $a\neq 0$.

675: \item {\sc Delete($x$)}. Set $V[x]:=0$.

676: \end{itemize}

677:

678: The data structure assumes that only valid updates are performed,

679: i.e. that inserts are done only in situations where $V[x]=0$ and

680: deletions are done only when $V[x]\neq 0$.

681:

682: \begin{theorem}\label{thm:filter}

683: Let positive integers $n$ and $r$, and $\epsilon > 0$ be given. On a

684: RAM with word length $w$ we can maintain a Bloomier filter $V'$ for a

685: vector $V$ of length $u=2^{O(w)}$ with at most $n$ nonzero entries

686: from $\{0,\dots,2^r - 1\}$, such that:

687:

688: \begin{itemize}

689: \item {\sc Insert} and {\sc Delete} can be done in amortized

690:   expected constant time. The data structure assumes all updates are

691:   valid.

692:

693: \item Computing $V'[x]$ on input $x$ takes worst case constant

694:   time. If $V[x]\neq 0$ the answer is always 'V[x]'. If $V[x]=0$ the

695:   answer is '0' with probability at least $1-\epsilon$.

696:

697: \item The expected space usage is $O(n(\lg\lg(u/n) + \lg(1/\epsilon) +

698:   r))$ bits.

699: \end{itemize}

700: \end{theorem}

701:

702:

703: \subsection{The Data Structure}

704:

705: Assume without loss of generality that $u\geq 2n$ and that

706: $\epsilon\geq u/n$.  Let $v=\max(n \log(u/n), n/\epsilon)$, and choose

707: $h: \{0,\dots,u-1\} \rightarrow \{0,\dots,v-1\}$ as a random function

708: from a universal class of hash functions. The data structure maintains

709: information about a minimal set $S'$ such that $h$ is 1-1 on $S

710: \setminus S'$. Specifically, it consists of two parts:

711:

712: \begin{enumerate}

713: \item A dictionary for the set $S'$, with corresponding values of $V$

714:   as associated information.

715:

716: \item A dictionary for the set $h(S\backslash S')$, where the element

717:   $h(x)$, $x\in S\backslash S'$, has $V[x]$ as associated information.

718: \end{enumerate}

719:

720: Both dictionaries should succinct, i.e., use space close to the

721: information theoretic lower bound.  Raman and Rao

722: \cite{raman03succinct} have described such a dictionary using space

723: that is $1+o(1)$ times the minimum possible, while supporting lookups

724: in $O(1)$ time and updates in expected amortized $O(1)$ time.

725:

726: To compute $V'[x]$ we first check whether $x\in S'$, in which case

727: $V'[x]$ is stored in the first dictionary. If this is not the case, we

728: check whether $h(x)\in h(S\backslash S')$.  If this is the case we

729: return the information associated with $h(x)$ in the second

730: dictionary.  Otherwise, we return '0'.

731:

732: {\sc Insert($x$, $a$)}. First determine whether $h(x)\in h(S\backslash

733: S')$, in which case we add $x$ to the set $S'$, inserting $x$ in the

734: first dictionary.  Otherwise we add $h(x)$ to the second

735: dictionary. In both cases, we associate $a$ with the inserted element.

736:

737: {\sc Delete($x$)} proceeds by deleting $x$ from the first dictionary

738: if $x\in S'$, and otherwise deleting $h(x)$ from the second

739: dictionary.

740:

741:

742: \subsection{Analysis}

743:

744: It is easy to see that the data structure always return correct

745: function values on elements in $S$, given that all updates are

746: valid. When computing $V'[x]$ for $x\not\in S$ we get a nonzero result

747: if and only if there exists $x'\in S$ such that $h(x)=h(x')$. Since

748: $h$ was chosen from a universal family, this happens with probability

749: at most $n/v \leq \epsilon$.

750:

751: It remains to analyze the space usage. Using once again that $h$ was

752: chosen from a universal family, the expected size of $S'$ is

753: $O(n/\log(u/n))$. This implies that the expected number of bits

754: necessary to store the set $S'$ is $\log\binom{u}{O(n/\log(u/n))} =

755: O(n)$, using convexity of the function $x\mapsto \binom{u}{x}$ in the

756: interval $0\dots u/2$. In particular, the first dictionary achieves an

757: expected space usage of $O(n)$ bits.  The information theoretical

758: minimum space for the set $h(S\backslash S')$ is bounded by

759: $\log\binom{r}{n} = O(n \log(r/n)) = O(n \log\log(u/n) +

760: n\log(1/\epsilon))$ bits, matching the lower bound.  We disregarded is

761: the space for the universal hash function, which is $O(\log u)$ bits.

762: However, this can be reduced to $O(\log n + \log\log u)$ bits, which

763: is vanishing, by using slightly weaker universal functions and

764: doubling the size $r$ of the range. Specifically, $2$-universal

765: functions suffice; see \cite{pagh00dispers} for a construction. Using

766: such a family requires preprocessing time $(\log u)^{O(1)}$, expected.

767:

768:

769: \section{Upper Bounds for the Greater-Than Problem}

770:

771: We start with a simple upper bound of $T_u = O(\lg n), T_q = O(\lg\lg

772: n)$. Our upper bound uses a trie structure. We consider a balanced

773: tree with branching factor 2, and with $n$ leaves. Every possible

774: value of the update parameter $a$ is represented by a root-to-leaf

775: path. In the update stage, we mark this root-to-leaf path, taking time

776: $O(\lg n)$. In the query stage, we want to find the point where $b$'s

777: path in the trie would diverge from $a$'s path. This uses binary

778: search on the $\lg n$ levels, as follows. To test if the paths diverge

779: on a level, we examine the node on that level on $b$'s path.  If the

780: node is marked, the paths diverge below; otherwise they diverge

781: above. Once we have found the divergence point, we know that the

782: larger of $a$ and $b$ is the one following the right child of the

783: lowest common ancestor.

784:

785: For the full tradeoff, we consider a balanced tree with branching

786: factor $B$. In the update stage, we need to mark a root-to-leaf path,

787: taking time $\lg_B n$. In the query stage, we first find the point

788: where $b$'s path in the trie would diverge from $a$'s path. This uses

789: binary search on the $\lg_B n$ levels, so it takes time $O(\lg\lg_B

790: n)$. Now we know the level where the paths of $a$ and $b$ diverge. The

791: nodes on that level from the paths of $a$ and $b$ must be siblings in

792: the tree. To test whether $b > a$, we must find the relative order of

793: the two sibling nodes. There are two strategies for this, giving the

794: two branches of the tradeoff curve. To achieve small update time, we

795: can do all work at query time. We simply test all siblings to the left

796: of $b$'s path on the level of divergence. If we find a marked one,

797: then $a$'s path goes to the left of $b$'s path, so $a < b$; otherwise

798: $a > b$. This stragegy gives $T_u = O(\lg_B n)$ and $T_q = O(\lg(\lg_B

799: n) + B)$, for any $B \geq 2$. For $T_q > \Omega(\lg\lg n)$, we have

800: $T_q = \Theta(B)$, so we have achieved the tradeoff $T_u = O(\lg_{T_q}

801: n)$.

802:

803: The second strategy is to do all work at update time. For every node

804: on $a$'s path, we mark all left siblings of the node as such. Then to

805: determine if $b$'s path is to the left or to the right of $a$'s path,

806: we can simply query the node on $b$'s path just below the divergence

807: point, and see if it is marked as a left sibling. This strategy gives

808: $T_u = O(B \lg_B n)$ and $T_q = O(\lg(\lg_B n))$. For small enough $B$

809: (say $B = O(\lg n)$), this strategy gives $T_q = O(\lg\lg n)$

810: regardless of $B$ and $T_u$. For $B = \Omega(\lg n)$, we have $\lg B =

811: \Theta(\lg T_u)$. Therefore, we can express our tradeoff as: $2^{T_q}

812: = O(\lg_{T_u} n)$.

813:

814:

815: \section{Dynamic Range Reporting}

816:

817: We begin with the case $T_u = O(\lg w), T_q = O(\lg\lg w)$. Let $S$ be

818: the current set of values stored by the data structure.  Without loss

819: of generality, assume $w$ is a power of two.  For an arbitrary $t \in

820: [0, \lg w]$, we define the trie of order $t$, denoted $T_t$, to be the

821: trie of depth $w / 2^t$ and alphabet of $2^t$ \emph{bits}, which

822: represents all numbers in $S$. We call $T_0$ the \emph{primary trie}

823: (this is the classic binary trie with elements from $S$). Observe that

824: we can assign distinct names of $O(w)$ bits to all nodes in all

825: tries. We call \emph{active paths} the paths in the tries which

826: correspond to elements of $S$. A node $v$ from $T_t$ corresponds to a

827: subtree of depth $2^t$ in the primary trie; we denote the root of this

828: subtree by $r_0(v)$. A node from $T_t$ corresponds to a 2-level

829: subtree in $T_{t-1}$; we call such a subtree a \emph{natural

830: subtree}. Alternatively, a 2-level subtree of any trie is natural iff

831: its root is at an even depth.

832:

833: A root-to-leaf path in the primary trie is seen as the leaves of the

834: tree used for the greater-than problem. The paths from the primary

835: trie are broken into chunks of length $2^t$ in the trie of order

836: $t$. So $T_t$ is similar to the $t$-th level (counted bottom-up) of

837: the greater-than tree. Indeed, every node on the $t$-th level of that

838: tree held information about a subtree with $2^t$ leaves; here one edge

839: in $T_t$ summarizes a segment of length $2^t$ bits.  Also, a natural

840: subtree corresponds to two siblings in the greater-than structure. On

841: the next level, the two siblings are contracted into a node; in the

842: trie of higher order, a natural subtree is also contracted into a

843: node. It will be very useful for the reader to hold these parallels in

844: mind, and realize that the data structure from this section is

845: implementing the old recursion idea \emph{on every path}.

846:

847: The root-to-leaf paths corresponding to the values in $S$ determine at

848: most $n-1$ branching nodes in any trie. By convention, we always

849: consider roots to be branching nodes. For every branching node from

850: $T_0$, we consider the extreme points of the interval spanned by the

851: node's subtree. By doubling the universe size, we can assume these are

852: never elements of $S$ (alternatively, such extreme points are formal

853: rationals like $x + \frac{1}{2}$). We define $\overline{S}$ to be the

854: union of $S$ and the two special values for each branching node in the

855: primary trie; observe that $|\overline{S}| = O(n)$. We are interested

856: in holding $\overline{S}$ for navigation purposes: it gives a way to

857: find in constant time the maximum and minimum element from $S$ that

858: fits under a branching node (because these two values should be the

859: elements from $S$ closest to the special values for the branching

860: node).

861:

862: \smallskip

863:

864: Our data structure has the following components:

865:

866: \begin{itemize}

867: \item[1.] a linked list with all elements of $S$ in increasing order,

868:   and a predecessor structure for $S$.

869:

870: \item[2.] a linked list with all elements of $\overline{S}$ in

871:   increasing order, accompanied by a navigation structure which

872:   enables us to find in constant time the largest value from $S$

873:   smaller than a given value from $\overline{S} \setminus S$.  We also

874:   hold a predecessor structure for $\overline{S}$.

875:

876: \item[3.] every branching node from the primary trie holds pointers to

877:   its lowest branching ancestor, and the two branching descendants

878:   (the highest branching nodes from the left and right subtrees; we

879:   consider leaves associated with elements from $S$ as branching

880:   descendants). We also hold pointers to the two extreme values

881:   associated with the node in the list in item 2. Finally, we hold a

882:   hash table with these branching nodes.

883:

884: \item[4.] for each $t$, and every node $v$ in $T_t$, which is either a

885:   branching node or a child of a branching node on an active path, we

886:   hold the depth of the lowest branching ancestor of $r_0(v)$, using a

887:   Bloomier filter.

888: \end{itemize}

889:

890: We begin by showing that this data structure takes linear space. Items

891: 1-3 handle $O(n)$ elements, and have constant overhead per element.

892: We show below that the navigation structure from 2.~can be implemented

893: in linear space. The predecessor structure should also use linear

894: space; for van Emde Boas, this can be achieved through hashing

895: \cite{willard83predecessor}.

896:

897: In item 4., there are $O(n)$ branching nodes per trie. In addition,

898: there are $O(n)$ children of branching nodes which are on active

899: paths. Thus, we consider $O(n\lg w)$ nodes in total, and hold $O(\lg

900: w)$ bits of information for each (a depth). Using our solution for the

901: Bloomier filter, this takes $O(n(\lg w)^2 + w)$ bits, which is $o(n)$

902: words. Note that storing the depth of the branching ancestor is just a

903: trick to reduce space. Once we have a node in $T_0$ and we know the

904: depth of its branching ancestor, we can calculate the ancestor in

905: $O(1)$ time (just ignore the bits below the depth of the ancestor). So

906: in essence these are ``compressed pointers'' to the ancestors.

907:

908: We now sketch the navigation structure from item 2. Observe that the

909: longest run in the list of elements from $\overline{S} \setminus S$

910: can have length at most $2w$. Indeed, the leftmost and rightmost

911: extreme values for the branching nodes form a parenthesis structure;

912: the maximum depth is $w$, corresponding to the maximum depth in the

913: trie. Between an open and a closed parenthesis, there must be at least

914: one element from $S$, so the longest uninterrupted sequence of

915: parenthesis can be $w$ closed parenthesis and $w$ open parenthesis.

916:

917: The implementation of the navigation structure uses classic ideas. We

918: bucket $\Theta(\sqrt{w})$ consecutive elements from the list, and then

919: we bucket $\Theta(\sqrt{w})$ buckets. Each bucket holds a summary

920: word, with a bit for each element indicating whether it is in $S$ or

921: not; second-order buckets hold bits saying whether first order buckets

922: contain at least one element from $S$ or not. There is also an array

923: with pointers to the elements or first order buckets. By shifting, we

924: can always insert another summary bit in constant time when something

925: is added. However, we cannot insert something in the array in constant

926: time; to fix that, we insert elements in the array on the next

927: available position, and hold the correct permutation packed in a word

928: (using $O(\sqrt{w} \lg w)$ bits). To find an element from $S$, we need

929: to walk $O(1)$ buckets. The time is $O(1)$ per traversed bucket, since

930: we can use the classic constant-time subroutine for finding the most

931: significant bit \cite{fredman93fusion}.

932:

933: We also describe a useful subroutine, $\func{test-branching}(v)$,

934: which tests whether a node $v$ from some $T_t$ is a branching node. To

935: do that, we query the structure in item 4.~to find the lowest

936: branching ancestor of $r_0(v)$. This value is defined if $v$ is a

937: branching node, but the Bloomier filter may return an arbitrary result

938: otherwise. We look up the purported ancestor in the structure of item

939: 3. If the node is not a branching node, the value in the Bloom filter

940: for $v$ was bogus, so $v$ is not a branching node. Otherwise, we

941: inspect the two branching descendants of this node. If $v$ is a

942: branching node, one of these two descendants must be mapped to $v$ in

943: the trie of order $t$, which can be tested easily.

944:

945:

946: \subsection{Implementation of Updates}

947:

948: We only discuss insertions; deletions follow parallel steps

949: uneventfully. We first insert the new element in $S$ and

950: $\overline{S}$ using the predecessor structures. Inserting a new

951: element creates exactly one branching node $v$ in the primary trie.

952: This node can be determined by examining the predecessor and successor

953: in $S$. Indeed, the lowest common ancestor in the primary trie can be

954: determined by taking an xor of the two values, finding the most

955: significant bit, and them masking everything below that bit from the

956: original values \cite{alstrup01range}.

957:

958: We calculate the extreme values for the new branching node $v$, and

959: insert them in $\overline{S}$ using the predecessor structure. Finding

960: the branching ancestor of $v$ is equivalent to finding the enclosing

961: parentheses for the pair of parentheses which was just inserted. But

962: $\overline{S}$ has a special structure: a pair of parentheses always

963: encloses two subexpressions, which are either values from $S$, or a

964: parenthesized expression (i.e., the branching nodes from $T_0$ form a

965: binary tree structure). So one of the enclosing parentheses is either

966: immediately to the left, or immediately to the right of the new

967: pair. We can traverse a link from there to find the branching

968: ancestor. Once we have this ancestor, it is easy to update the local

969: structure of the branching nodes from item 3. Until now, the time is

970: dominated by the predecessor structure.

971:

972: It remains to update the structure in item 4. For each $t > 0$, we can

973: either create a new branching node in $T_t$, or the branching node

974: existed already (this is possible for $t > 0$ because nodes have many

975: children). We first test whether the branching node existed or not

976: (using the $\func{test-branching}$ subroutine). If we need to

977: introduce a branching node, we simply add a new new entry in the

978: Bloomier filter with the depth of the branching ancestor of $v$. It

979: remains to consider active children of branching nodes, for which we

980: must store the depth of $v$. If we have just introduced a branching

981: node, it has exactly two active children (if there exist more than two

982: children on active paths, the node was a branching node before). These

983: children are determined by looking at the branching descendants of

984: $v$; these give the two active paths going into $v$. Both descendants

985: are mapped to active children of the new branching node from $T_t$. If

986: the branching node already existed, we must add one active child,

987: which is simply the child that the path to the newly inserted value

988: follows. Thus, to update item 4., we spend constant time per $T_t$. In

989: total, the running time of an update is $T_{pred} + O(\lg w) = O(\lg

990: w)$.

991:

992: \subsection{Implementation of Queries}

993:

994: Remember that a query receives an interval $[a,b]$ and must return a

995: value in $S \cap [a,b]$, if one exists. We begin by finding the node

996: $v$ which is the lowest common ancestor of $a$ and $b$ in the primary

997: trie; this takes constant time \cite{alstrup01range}. Note that $v$

998: spans an interval which includes $[a,b]$. The easiest case is when $v$

999: is a branching node; this can be recognized by a lookup in the hash

1000: table from item 3. If so, we find the two branching descendants of

1001: $v$; call the left one $v_L$ and the right one $v_R$. Then, if $S \cap

1002: [a,b] \ne \emptyset$, either the rightmost value from $S$ that fits

1003: under $v_L$ or the leftmost value from $S$ that in fits under $v_R$

1004: must be in the interval $[a,b]$. This is so because $[a,b]$ straddles

1005: the middle point of the interval spanned by $v$. The two values

1006: mentioned above are the two values from $S$ closest (on both sides) to

1007: this middle point, so if $[a,b]$ is non-empty, it must contain one of

1008: these two. To find these two values, we follow a pointer from $v_L$ to

1009: its left extreme point in $\overline{S}$. Then, we use the navigation

1010: structure from item 2., and find the predecessor from $S$ of this

1011: value in constant time. The rightmost value under $v_R$ is the next

1012: element from $S$. Altogether, the case when $v$ is a branching node

1013: takes constant time.

1014:

1015: Now we must handle the case when $v$ is not a branching node. If $S

1016: \cap [a,b] \ne \emptyset$, it must be the case that $v$ is on an

1017: active path. Below we describe how to find the lowest branching

1018: ancestor of $v$, \emph{assuming that $v$ is on an active path}. If

1019: this assumption is violated, the value returned can be arbitrary. Once

1020: we have the branching ancestor of $v$, we find the branching

1021: descendant $w$ which is in $v$'s subtree. Now it is easy to see, by

1022: the same reasoning as above, that if $[a,b] \cap S \ne \emptyset$

1023: either the leftmost or the rightmost value from $S$ which is under $w$

1024: must be in $[a,b]$. These two values are found in constant time using

1025: the navigation structure from item 2., as described above. So if

1026: $[a,b] \cap S \ne \emptyset$, we can find an element inside $[a,b]$.

1027: If none of these two elements were in $[a,b]$ it must be the case that

1028: $[a,b]$ was empty, because the algorithm works correctly when $[a,b]

1029: \cap S \ne \emptyset$.

1030:

1031: It remains to show how to find $v$'s branching ancestor, assuming $v$

1032: is on an active path, but is not a branching node. If for some $t >

1033: 0$, $v$ is mapped to a branching node in $T_t$, it will also be mapped

1034: to a branching node in tries of higher order. We are interested in the

1035: smallest $t$ for which this happens. We find this $t$ by binary

1036: search, taking time $O(\lg\lg w)$. For some proposed $t$, we check

1037: whether the node to which $v$ is mapped in $T_t$ is a branching node

1038: (using the $\func{test-branching}$ subroutine). If it is, we continue

1039: searching below; otherwise, we continue above.

1040:

1041: Suppose we found the smallest $t$ for which $v$ is mapped to a

1042: branching node. In $T_{t-1}$, $v$ is mapped to some $z$ which is

1043: \emph{not} a branching node. Finding the lowest branching ancestor of

1044: $v$ is identical to finding the lowest branching ancestor of $r_0(z)$

1045: in the primary trie (since $z$ is a not a branching node, there is no

1046: branching node in the primary trie in the subtree corresponding to

1047: $z$). Since in $T_t$ $z$ gets mapped to a branching node, its natural

1048: subtree in $T_{t-1}$ must contain at least one branching node. We have

1049: two cases: either $z$ is the root or a leaf of the natural subtree

1050: (remember that a natural subtree has two levels). These can be

1051: distinguished based on the parity of $z$'s depth. If $z$ is a leaf,

1052: the root must be a branching node (because there is at least another

1053: active leaf). But then $z$ is an active child of a branching node, so

1054: item 4.~tells us the branching ancestor of $r_0(z)$. Now consider the

1055: case when $z$ is the root of the natural subtree. Then $z$ is above

1056: any branching node in its natural subtree, so to find the branching

1057: ancestor of $r_0(z)$ we can find the branching ancestor of the node

1058: from $T_t$ to which the natural subtree is mapped. But this is a

1059: branching node, so the structure in item 4.~gives the desired

1060: branching ancestor. To summarize, the only super-constant cost is the

1061: binary search for $t$, which takes $O(\lg\lg w)$ time.

1062:

1063:

1064: \section{Tradeoffs from Dynamic Range Reporting}

1065:

1066: Fix a value $B \in [2,\sqrt{w}]$; varying $B$ will give our tradeoff

1067: curve.  For an arbitrary $t \in [0, \lg_B w]$, we define the trie of

1068: order $t$ to be the trie of depth $w / B^t$ and alphabet of $B^t$

1069: bits, which represents all numbers in $S$. We call the trie for $t =

1070: 0$ the primary trie. A node $v$ in a trie of order $t$ is represented

1071: by a subtree of depth $B^t$ in the primary trie; we say that the root

1072: of this subtree ``corresponds to'' the node $v$. A node from a trie of

1073: order $t$ is represented by a subtree of depth $B$ in the trie of

1074: order $t-1$; we call such a subtree a ``natural depth-$B$

1075: subtree''. Alternatively, a depth-$B$ subtree is natural if it starts

1076: at a depth divisible by $B$.

1077:

1078: The root-to-leaf paths from the primary trie are boken into chunks of

1079: length $B^t$ in the trie of order $t$. A trie of order $t$ is similar

1080: to the $t$-th level (counted bottom-up) of the tree used for the

1081: greater-than problem, since a path in the primary trie is seen as the

1082: leaves of that tree. Indeed, every node on the $t$-th level of that

1083: tree held information about a subtree with $B^t$ leaves; here one edge

1084: in a trie of order $t$ summarizes a segment of length $B^t$ bits.

1085: Also, a natural depth-$B$ subtree corresponds to $B$ siblings in the

1086: old structure. On the next level, the $B$ siblings are contracted into

1087: a node; in the trie of higher order, a natural depth-$B$ subtree is

1088: also contracted into a node.

1089:

1090: Our data structure has the following new components:

1091:

1092: \begin{itemize}

1093:

1094: \item[5A.] choose this for the first branch of the tradeoff (faster

1095:   updates, slower queries): hold the same information as in 4.~for

1096:   each $t$, and every node $v$ in the trie of order $t$ which is not a

1097:   branching node, is on an active path, and is the child of a

1098:   branching node in the trie of order $t$.

1099:

1100: \item[5B.] choose this for the second branch of the tradeoff: hold the

1101:   same information as above for each $t$, and every node $v$ which is

1102:   not a branching node, is on an active path, and has a branching

1103:   ancestor in the same natural depth-$B$ subtree.

1104: \end{itemize}

1105:

1106: In item 5A., notice that for every $t$ there are at most $2n - 2$

1107: children of branching nodes which are on active paths. We store $O(\lg

1108: w)$ bits for each, and there are $O(\lg_B w)$ values of $t$, so we can

1109: store this in a Bloomier filter with $o(n)$ words of space. In item

1110: 5B., the number of interesting nodes blows up by at most $B$ compared

1111: to 5A., and since $B \leq \sqrt{w}$, we are still using $o(n)$ words

1112: of space.

1113:

1114: \paragraph{Updates.}

1115: For each $t > 0$, we can either create a new branching node in the

1116: trie of order $t$, or the branching node existed already. We first

1117: test whether the branching node existed or not. If we just introduced

1118: a branching node, it has at most two children which are not branching

1119: nodes and are on active paths (if more than two such children exist,

1120: the node was a branching node before). If the branching node was old,

1121: we might have added one such child. These children are determined by

1122: looking at the branching descendents of $v$ (these give the two active

1123: paths going into $v$, one or both of which are new active paths going

1124: into the node in the subtrie of order $t$). For such children, we add

1125: the depth of $v$ in the structure from item 5A. If we are in case 5B,

1126: we follow both paths either until we find a branching node, or the

1127: border of the natural depth-$B$ subtree. For of these $O(B)$

1128: positions, we add the depth of $v$ in item $5B$. To summarize, the

1129: running time is $O(T_{pred} + \lg_B w)$ if we need to update 5A., and

1130: $O(T_{pred} + B \lg_B w)$ is we need to update 5B.

1131:

1132:

1133: \paragraph{Queries.}

1134: We need to show how to find $v$'s branching ancestor, assuming $v$ is

1135: on an active path, but is not a branching node. For some $t > 0$, and

1136: all $t$'s above that value, $v$ will be mapped in the trie of order

1137: $t$ to some branching node. That is the smallest $t$ such that the

1138: depth-$B^t$ natural subtree containing $v$ contains some branching

1139: node. We find this $t$ by binary search, taking time $O(\lg(\lg_B

1140: w))$. For some proposed $t$, we check if the node to which $v$ is

1141: mapped is a branching node in the trie of order $t$ (using the

1142: subroutine described above). If it is, we continue searching below;

1143: otherwise, we continue above.

1144:

1145: Say we found the smallest $t$ for which $v$ is mapped to a branching

1146: node. In the trie of order $t-1$, $v$ is mapped to some $w$ which is

1147: not a branching node. Finding the lowest branching ancestor of $v$ is

1148: identical to finding the lowest branching ancestor of the node

1149: corresponding to $w$ in the primary trie (since $w$ is a not a

1150: branching node, there is no branching node in the primary trie in the

1151: subtree represented by $w$). In the trie of order $t$, $w$ gets mapped

1152: to a branching node, so the natural depth-$B$ subtree of $w$ contains

1153: at least one branching node. The either: (1) there is some branching

1154: node above $w$ in its natural depth-$B$ subtree, or (2) $w$ is on the

1155: active path going to the root of this natural subtree (it is above any

1156: branching node).

1157:

1158: We first deal with case (2). If $w$ is above any branching node in its

1159: natural subtree, to find $w$'s branching ancestor we can find the

1160: branching ancestor of the node from the trie of order $t$, to which

1161: this subtree is mapped. But this is a branching node, so the structure

1162: in item 4.~gives the branching ancestor $z$. We can test that we are

1163: indeed in case (2), and not case (1), by looking at the two branching

1164: descendents of $z$, and checking that one of them is strictly under

1165: $v$.

1166:

1167: Now we deal with case (1). If we have the structure 5B., this is

1168: trivial. Because $w$ is on an active path and has a branching ancestor

1169: in its natural depth-$B$ subtree, it records the depth of the

1170: branching ancestor of the node corresponding to $w$ in the primary

1171: trie. So in this case, the only super-constant cost is the binary

1172: search for $t$, which is $O(\lg(\lg_B w))$. If we only have the

1173: structure 5A., we need to walk up the trie of order $t-1$ starting

1174: from $w$. When we reach the child of the branching node above $w$, the

1175: branching node from the primary trie is recorded in item 5A. Since the

1176: branching node is in the same natural depth-$B$ subtree as $w$, we

1177: reach this point after $O(B)$ steps. One last detail is that we do not

1178: actually know when we have reached the child of a branching node

1179: (because the Bloomier filter from item 5A.~can return arbitrary

1180: results for nodes not satisfying this property). To cope with this, at

1181: each level we hope that we have reached the destination, we query the

1182: structure in item 5A., we find the purported branching ancestor, and

1183: check that it really is the lowest branching acestor of $v$. This

1184: takes constant time; if the result is wrong, we continue walking up

1185: the trie. Overall, with the structure of 5A.~we need query time

1186: $O(\lg(\lg_B w) + B)$.

1187:

1188: We have shown how to achieve the same running times (as functions of

1189: $B$) as in the case of the greater-than function. The same calculation

1190: establishes our tradeoff curve.

1191:

1192:

1193: \section{Lower Bounds for the Greater-Than Problem}

1194:

1195: A lower bound for the first branch of the tradeoff can be obtained

1196: based on Fredman's proof idea \cite{fredman82sums}. We ommit the

1197: details for now. To get a lower bound for the second case ($T_q <

1198: O(\lg\lg n)$), we use the sunflower lemma of Erd\H{o}s and Rado. A

1199: sunflower is collection of sets (called petals) such that the

1200: intersection of any two of the sets is equal to the intersection of

1201: all the sets.

1202:

1203: \begin{lemma}[Sunflower Lemma]

1204:   Consider a collection of $n$ sets, of cardinalities at most $s$. If

1205:   $n > (p-1)^{s+1} s!$, the collection contains as a subcollection a

1206:   sunflower with $p$ petals.

1207: \end{lemma}

1208:

1209: For every query parameter in $[0,n-1]$, the algorithm performs at most

1210: $T_q$ probes to the memory. Thus, there are $2^{T_q}$ possible

1211: execution paths, and at most $2^{T_q} - 1$ bit cells are probed on at

1212: least some execution path. This gives $n$ sets of cells of sizes at

1213: most $s = O(2^{T_q})$; we call these sets query schemes. By the

1214: sunflower lemma, we can find a sunflower with $p$ petals, if $p$

1215: satisfies: $n > (p-1)^{s+1} s! \Rightarrow \lg n > \Theta(s (\lg p +

1216: \lg s))$. If $T_q < (1-\epsilon) \lg\lg n$, we have $s\lg s = o(\lg

1217: n)$, so our condition becomes $\lg n > \Theta(s \lg p)$. So we can

1218: find a sunflower with $p$ petals such that $\lg p = \Omega((\lg

1219: n)/s)$. Let $P$ be the set of query parameters whose query schemes are

1220: these $p$ petals.

1221:

1222: The center of the sunflower (the intersection of all sets) obviously

1223: has size at most $s$. Now consider the update schemes for the numbers

1224: in $P$. We can always find $T \subset P$ such that $|T| \geq |P| /

1225: 2^s$ and the update schemes for all numbers in $T$ look identical if

1226: we only inspect the center of the sunflower. Thus $\lg |T| = \lg |P| -

1227: s = \Omega(\frac{\lg n}{s} - s)$. If $T_u < (\frac{1}{2} - \epsilon)

1228: \lg\lg n$, we have $s = o(\frac{\lg n}{s})$, so we obtain $\lg |T| =

1229: \Omega(\frac{\lg n}{s})$.

1230:

1231: Now we restrict our attention to numbers in $T$ for both the update

1232: and query value. The cells in the center of the sunflower are thus

1233: fixed. Define the natural result of a certain query to be the result

1234: (greater than vs. not greater than) of the query if all bit cells read

1235: by the query outside the center of the sunflower are zero. Now pick a

1236: random $x \in T$. For some $y$ in the middle third of $T$ (when

1237: considering the elements of $T$ in increasing order), we have $\Pr[y

1238: \leq x] \geq \frac{1}{3}, \Pr[y > x] \geq \frac{1}{3}$, so no matter

1239: what the natural result of querying $y$ is, it is wrong with

1240: probability at least $\frac{1}{3}$. So for a random $x$, at least a

1241: fraction of $\frac{1}{9}$ of the natural results are wrong. Consider

1242: an explicit $x$ with this property. The update scheme for $x$ must set

1243: sufficiently many cells to change these natural results. But these

1244: cells can only be in the petals of the queries whose natural results

1245: are wrong, and the petals are disjoint except for the center, which is

1246: fixed. So the update scheme must set at least one cell for every

1247: natural result which is wrong. Hence $T_u \geq |T|/9 \Rightarrow \lg

1248: T_u = \Omega(\lg |T|) = \Omega(\frac{\lg n}{s}) = \Omega(\frac{\lg

1249:   n}{2^{T_q}}) \Rightarrow 2^{T_q} = \Omega(\lg_{T_u} n)$.

1250:

1251:

1252:

1253: \paragraph*{Acknowledgement.}

1254: The authors would like to thank Gerth Brodal for discussions in the

1255: early stages of this work, in particular on how the results could be

1256: extended to dynamic range counting.

1257:

1258:

1259: \bibliographystyle{plain}

1260: \bibliography{../general}

1261:

1262: %\begin{thebibliography}{10}

1263: %  \setlength{\parsep}{0pt}

1264: %  \setlength{\itemsep}{0pt}

1265:

1266: \end{document}

1267:

1268: % cite bloom using:

1269: % @Article{Bloom:1970:STT,

1270: % author =       "Burton H. Bloom",

1271: % title = "Space\slash Time Trade-offs in Hash Coding with Allowable Errors",

1272: % journal =      "Communications of the {ACM}",

1273: % volume =       "13",

1274: % number =       "7",

1275: % pages =        "422--426",

1276: % year =         "1970",

1277: %}

1278: