cs0502075/pap.tex
1: \documentclass[11pt]{article}
2: \usepackage{times}
3: \usepackage{graphicx}
4: \usepackage{epsfig}
5: \usepackage{amsmath}
6: 
7: %\usepackage{fullpage}
8: \topmargin 0pt
9: \advance \topmargin by -\headheight
10: \advance \topmargin by -\headsep
11: 
12: \textheight 9in
13: \oddsidemargin 0pt
14: \evensidemargin \oddsidemargin
15: \marginparwidth 0.5in
16: \textwidth 6.5in
17: 
18: 
19: 
20: %\renewcommand{\baselinestretch}{1.28}
21: \newcommand{\qed}{\hfill \rule{2mm}{3mm}}
22: \newenvironment{proof}{\par \noindent{\bf Proof:}}{\(\qed\) \par}
23: \newcommand{\eat}[1]{}
24: \newcommand{\mnote}[1]{ [[[ \marginpar{\mbox{$<==$}} #1 ]]] }
25: %\newcommand{\eatreminders}{\renewcommand{\reminder}[1]{}}
26: \newcommand{\opt}{\mbox{O{\sc pt}}}
27: \newcommand{\var}{\mbox{V{\sc ar}}}
28: \newcommand{\sol}{\mbox{S{\sc ol}}}
29: \newcommand{\E}{\mbox{\bf E}}
30: \renewcommand{\thefootnote}{\fnsymbol{footnote}}
31: %\begin{document}
32: 
33: 
34: \newcommand{\goal}[1]{ {\noindent {$\Rightarrow$} \em {#1} } }
35: \newcommand{\hide}[1]{}
36: \newcommand{\junk}[1]{}
37: \newcommand{\etal}{ {\em et. al. } }
38: \newcommand{\comment}[1]{ {\footnotesize {#1} } }
39: 
40: \newtheorem{theorem}{Theorem}
41: \newtheorem{lemma}{Lemma}[theorem]
42: \newtheorem{claim}{Claim}[theorem]
43: \newtheorem{corollary}{Corollary}[theorem]
44: \newtheorem{defn}{Definition}[theorem]
45: \newtheorem{algo}{Algorithm}
46: \newtheorem{observation}[theorem]{Observation}
47: \newtheorem{proposition}{Proposition}[theorem]
48: \newtheorem{problem}{Problem}
49: \newtheorem{example}{Example}
50: 
51: \newcommand{\vsum}{\mbox{S{\sc um}}}
52: \newcommand{\sqvsum}{\mbox{S{\sc qsum}}}
53: \newcommand{\se}{\mbox{S{\sc qerror}}}
54: \newcommand{\sse}{\mbox{T{\sc err}}}
55: \newcommand{\asse}{\mbox{A{\sc pxerr}}}
56: \newenvironment{proofidea}{{\bf Proof Idea:}}{}
57: \newcommand{\barw}{\bar{W}}
58: \newcommand{\vsumw}{\mbox{I{\sc nfo}}}
59: \newcommand{\sew}{\mbox{E}_B}
60: \newcommand{\ssew}{\mbox{E}_T}
61: \newcommand{\assew}{\mbox{A{\sc px}E}}
62: 
63: \newcounter{ccc}
64: \newcommand{\bcc}{\setcounter{ccc}{1}\theccc.}
65: \newcommand{\icc}{\addtocounter{ccc}{1}\theccc.}
66: \newcommand{\wa}{{\cal W}}
67: \newcommand{\wai}{{\cal W}^{-1}}
68: \newcommand{\zr}{{\cal Z}_R}
69: \newcommand{\real}{{\cal R}}
70: 
71: \title{How far will you walk to find your shortcut: 
72: Space Efficient Synopsis Construction Algorithms}
73: \author{Sudipto Guha \thanks{Department of Computer Science, University of Pennsylvania, 3330 Walnut St, Philadelphia, PA 19104. Email: {\tt sudipto@cis.upenn.edu}}}
74: 
75: \begin{document}
76: \date{}
77: \maketitle
78: \begin{abstract}
79:   In this paper we consider the wavelet synopsis construction problem
80:   without the restriction that we only choose a subset of coefficients
81:   of the original data. We provide the first near optimal algorithm.
82:   
83:   We arrive at the above algorithm by considering space efficient
84:   algorithms for the restricted version of the problem. In this context 
85:   we improve previous algorithms by almost a linear factor and reduce the 
86:   required space to almost linear. Our techniques also extend to histogram
87:   construction, and improve the space-running time tradeoffs for V-Opt
88:   and range query histograms. We believe the idea applies to a broad
89:   range of dynamic programs and demonstrate it by showing improvements
90:   in a knapsack-like setting seen in construction of Extended Wavelets.
91: \end{abstract}
92: \section{Introduction}
93: Wavelet synopsis techniques have become extremely popular in query
94: optimization, approximate query answering and a large number decision
95: support systems. Wavelets, specially Haar wavelets, are one-one mappings and 
96: admit a natural 
97: multi-resolution interpretation, as well as 
98: fast algorithms for the forward and inverse transforms.
99:  
100: Given a set of $n$ numbers $X=x_1,\ldots,x_n$ the wavelet synopsis
101: construction problem seeks to choose a synopsis vector $Z$ with at
102: most $B$ non-zero entries, such that the inverse wavelet transform of
103: $Z$ (denoted by $\wai(Z)$) gives a good estimate of the data. The
104: typical objective measures are (suitably weighted\footnote{ The
105:   weighted $\ell_k$ with weights $\pi_i>0$ and wlog $\sum_i \pi_i=n$
106:   minimizes $\left( \sum_i (\pi_i (x_i -
107:     \wai(Z)_i))^k\right)^{\frac1k}$.}) $\ell_k$ norm of $X - \wai(Z)$.
108: In an early paper \cite{MVW98}, demonstrated a number of different
109: applications for wavelet synopsis and proposed greedy algorithms.
110: However for objective measures other than the $\ell_2$ measure,
111: the greedy algorithm does not necessarily provide the optimum solution.  The
112: problem is quite non-trivial, primarily due the fact that the Wavelet
113: basis vectors overlap and cancellations (subtractions) occur. This
114: means that we can have two coefficients that cancel out each other
115: leaving a significantly (exponentially) smaller contribution, which
116: needs to be accounted for.  The precision of the coefficients in the
117: optimum solution can be much larger than the precision of the data. In
118: fact there are no known bounds or promising techniques for quantifying
119: the precision - this is the biggest stumbling block in the synopsis
120: construction.
121: 
122: Most of the literature focuses on the {\bf Restricted case} where the non-zero
123: entries of $Z$ are equal to the corresponding entries in the
124: transform of the original data, $\wa(X)$. A natural question remains: {\em why
125:   should we be optimizing under the restriction of retaining the
126:   coefficients of the data -- with no guarantees that such a
127:   restriction does not compromise the quality of the final synopsis?}
128: This is clearly suboptimal -- a comparable example would be to
129: optimize the synopsis for point queries, and use it for range queries.
130: 
131: 
132: A simple example renders the discussion concrete; $X=\{1,2,3,7\}$ and
133: $B=1$ illustrates that choosing any single coefficient of
134: $\wa(X)=\{3.25,-1.75,-0.5,-2\}$ (non-normalized) 
135: does not give the optimum answer for
136: $\ell_1$ or $\ell_\infty$ norm. 
137: Normalization does not help. The normalized transform is 
138: $\{4.55,-2.45,-0.5,-2\}$ -- but choosing the first coefficient as 
139: $4.55$ in the normalized setting implies assigning $4.55/\sqrt{2}=3.25$ everywhere. 
140: Thus dynamic program approaches that seek to see the effect of the coefficient on the data come to the same conclusion in both settings.
141: The optimum choices of $Z$ are
142: $\{z,0,0,0\}$ for any $2\leq z\leq 3$ and $\{4,0,0,0\}$ for $\ell_1$
143: and $\ell_\infty$ respectively. 
144: The same example applies to {\bf weighted} $\mathbf \ell_2$,
145: e.g., if $\pi=\{\frac12,\frac12,\frac32,\frac32\}$ then the best error achieved by retaining any single entry of $\wa(X)$ 
146: is $5.78$ whereas $Z=\{4.65,0,0,0\}$ gives an error of $4.87$.   
147: The example can be extended to any
148: $B$ (by repetition and scaling). The restriction of only retaining 
149: the coefficients of the data is significantly self
150: defeating.
151: 
152: However the restriction does ease the search for a solution, and as
153: this paper shows, is an important stepping stone towards the final
154: result. For the restricted case, \cite{GG02} gave a probabilistic
155: scheme (the space constraint is preserved in expectation only, along
156: with the error) and very recently \cite{GK04} gave an optimal solution.
157: This has been extended and improved in
158: \cite{muthu-wave}.  {\em However, the solution to the unrestricted
159:   case has remained elusive and we provide the first near optimal
160:   solutions.} In the process, we also improve upon previous algorithms
161: for the restricted case as well.  However our algorithm is best
162: explained by taking a different path, which brings us to the {\em major
163: theme} of the paper.
164: 
165: \paragraph{}
166: Synopsis construction is perhaps most relevant in context of massive
167: data sets. In some scenarios we can justify that the synopsis is
168: created using a ``scratch'' space larger than the synopsis and stored.  
169: However a {\em quadratic} or extremely superlinear space complexity is 
170: near infeasible for large $n$. 
171: The dependence on synopsis size $B$ is also important in this context --
172: the smaller the dependence is, the larger is the synopsis that can be 
173: computed in the environment of a particular system. Further, 
174: {\em space is typically a more inflexible resource}, 
175: and not just a matter of wait. However a natural conceptual 
176: question arises: {\em We are only given $n$ numbers, -- do we
177:   really need to save so much information to compute the optimum
178:   answer ?}
179: 
180: All previous algorithms (for the restricted case) are expensive in
181: space (see table below). This (super-linearity in $n,B$) is also seen
182: in context of histogram construction (we provide a detailed table in
183: Section~\ref{sec:hist}).  To avoid this expensive space complexity,
184: several researchers have introduced the notion of {\em working space},
185: which is the amount of space required to compute the error -- the rest
186: of the space is used to construct the answer (coefficients,
187: representatives, etc.).  In case of wavelets the working space used by
188: previous algorithms is $O(nB)$. In case of histograms, known
189: algorithms reconstruct the answer only using the $O(n)$ working space,
190: but {\em with a penalty of an extra factor of $B$ in the running
191:   time}.  In this paper, we reduce the space for wavelets and
192: eliminate the penalty for histograms, in fact our results show that
193: the {\em working space notion is not needed} for a wide range of
194: problems. To summarize {\bf Our contributions}:
195: \begin{itemize}\parskip=-0.05in
196: \item We provide the first near optimum algorithm for the wavelet
197:   synopsis construction problem. The algorithm naturally extends to
198:   multiple dimensions.
199: \item For the restricted case \cite{GG02} provided approximation algorithms, however the space constraints were obeyed in expectation. The results for (optimum) algorithms with strict space bounds are
200: \footnote{In \cite{muthu-wave} the space bounds are not explicitly provided, but
201: the total space appears to be $O(n^2B/\log B)$ as well. The authors of \cite{yossi1} consider the same problem for a
202:   non-Haar basis, and is excluded from the discussion here}:
203: \begin{center}
204: {\small
205: \begin{tabular}{|c|c|c|c|c|c|}
206: \hline
207: Paper &  Error & Time & Space & Working Space \\
208: \hline
209: \hline
210: \cite{GK04} & $\ell_\infty$ & $ O(n^2B \log B) $ & $O(n^2B)$ & $O(nB)$\\
211:          & $\ell_k$ & $ O(n^2B^2) $ & $O(n^2B)$ & $O(nB)$\\
212: \hline
213: \cite{muthu-wave} 
214: 
215: & (weighted) $\ell_k$ & $O(\frac{n^2B}{\log B}) $ & ? &  ? \\
216: \hline
217: \hline
218: {\bf This Paper} & (weighted) $\ell_\infty$& $O(n^2)$ & $O(n+B\log (n/B))=O(n)$ & $O(n+B\log (n/B))=O(n)$\\
219: & (weighted) $\ell_k$& $O(n^2 \log B)$ & $O(n)$ & $O(n)$\\
220: \hline
221: \end{tabular}
222: }
223: \end{center}
224: \cite{GK04} also provided approximation algorithms for multiple dimensions 
225: and our techniques extend to this context as well, and improves the
226: running time and space by almost a factor $B$. 
227: 
228: \item We improve several histogram construction algorithms, e.g.,
229:   V-Opt histograms, range query histograms, by simultaneously
230:   achieving the best known running time and space bounds. The results
231:   and {\em a table comparing the results} are presented in
232:   Section~\ref{sec:hist}. Due to lack of space omit the improvements
233:   for the range query histograms, which are similar.
234: \item We believe the space efficient paradigm is applicable to other
235:   dynamic programs as well, and we demonstrate the improvements in
236:   case of Extended Wavelets in Section~\ref{sec:ext}.
237: \end{itemize}
238: 
239: 
240: 
241: \section{The Restricted (Haar) Wavelet Synopsis construction Problem}
242: \label{sec:old}
243: We will work with {\bf non-normalized} wavelet transforms where the
244: inverse computation is simply adding the coefficients that affect a
245: coordinate\footnote{For normalized wavelets the normalization
246:   constant appears both in forward and inverse transform, all the
247:   results in the paper will carry over in that setting as well,with
248:   the introduction of the normalization constants at several places}.
249: The wavelet basis vectors are defined as (assume $n$ is a power of $2$):
250: \[ \begin{array}{rll}
251:   V_0(j) = & 1 & \mbox{~~for all $j$}\\
252:   V_{2^s+t}(j) = & \left\{ \begin{array}{l}
253:           1 \\
254:           -1 \end{array} \right. &
255:           \begin{array}{l}
256:           \mbox{for $(t-1)\frac{n}{2^s}+1 \leq j \leq \frac{tn}{2^s} - \frac{n}{2^{s+1}}$} \\
257:           \mbox{for $\frac{nt}{2^s}- \frac{n}{2^{s+1}}+1 \leq j \leq \frac{tn}{2^s} $} \\
258: \end{array} \hspace{0.5in} (1\leq t\leq \frac{n}{2^s}, 1\leq s \leq \log n)
259: \end{array}
260: \]
261: The above definitions ensure $\wai(Z) = \sum_i Z_i V_i$. To compute
262: $\wa(X)$, the algorithms computes the average
263: $\frac{x_{2i+1}+x_{2i+2}}{2}$ and the difference
264: $\frac{x_{2i+1}-x_{2i+2}}{2}$ for each pair of consecutive elements as
265: $i$ ranges over $0,2,4,6,\ldots$ The difference coefficients form the
266: last $n/2$ entries of $\wa(X)$. The process is repeated on the $n/2$
267: average coefficients - {\em their difference coefficients yield the
268:   $n/4+1,\ldots,n/2$'th coefficients of $\wa(Z)$}. The process stops
269: when we compute the overall average, which is the first element of
270: $\wa(Z)$. The wavelet basis functions naturally form a complete binary
271: tree since their support sets are nested and are of size powers of $2$
272: (with one additional node as a parent of the tree, see
273: Figure~\ref{fig:one}).  The $x_j$ correspond to the leaves, denoted by
274: boxes, and the coefficients correspond to the non-leaf nodes of the
275: tree.  This tree of coefficients is termed as the error tree
276: (following \cite{GK04}).  Likewise assigning a value $c_i$ to the
277: coefficient corresponds to assigning $+c_i$ to all leaves $j$ that are
278: {\bf left descendants} (descendants of the left child) and $-c_j$ to
279: all right descendants.  The leaves that are descendants of a
280: coefficient are termed as the {\bf support} of the coefficient.
281: Recall that the {\bf Restricted} (Haar) Wavelet construction problem
282: is that given a set of $n$ numbers $X=x_1,\ldots,x_n$ the problem
283: seeks to choose at most $B$ terms from the wavelet representation
284: $\wa(X)$ of $X$, say denoted by $\zr$, such that a (weighted) $\ell_k$
285: norm of $X - \wai(\zr)$ is minimized.
286: 
287: 
288: \begin{figure}
289: \begin{center}
290: \begin{tabular}[t]{cc}
291: \begin{minipage}{2in}
292: \centerline{\psfig{figure=a.eps,width=1.8in}}
293: \end{minipage}
294: & \framebox{
295: \small
296: \begin{minipage}{4in}
297: At each internal node $i$ to compute $E[i,b,S]$:
298: \begin{itemize}\parskip=-0.05in
299: \item We determine if we are choosing the coefficient $i$. 
300: \item Assuming we are, we decide how the remaining $b-1$ coefficients are to allocated between the two subtrees. If the children are $i_L$ and $i_R$,  we are interested in 
301: \vspace{-0.1in}
302: \[ \min_{b'} E[i_L,b',S\cup\{i\}]+ E[i_R,b-1-b',S\cup\{i\}] \]
303: \item Assuming that we do not choose $i$ we are interested in a similar expression giving the overall minimization to be 
304: \vspace{-0.1in}
305: \[
306: \min \left \{ \begin{array}{l} \min_{b'} E[i_L,b',S\cup\{i\}]+ E[i_R,b-1-b',S\cup\{i\}] \\ 
307: \min_{b'} E[i_L,b',S]+ E[i_R,b-b',S] \end{array} \right. 
308: \]
309: \end{itemize}
310: \end{minipage}
311: } \\
312: (a) & (b) 
313: \end{tabular}
314: \caption{The Error Tree and the previous algorithm. \label{fig:one} }
315: \end{center}
316: \end{figure}
317: 
318: 
319: \subsection{Reviewing Previous Algorithm(s)}
320: 
321: It is immediate that the value of $\wai(\zr)_j$ is fixed by the choices of
322: all coefficients $i$ such that $j$ belongs to the support of $i$.
323: Suppose $S$ is a subset of the ancestors of a coefficient $i$.  Thus a
324: natural dynamic program emerges where we define $E[i,b,S]$ to be {\em
325:   the minimum contribution to the error from all $j$ in the support of $i$, such
326:   that exactly $b$ coefficients that are descendants of $i$ are chosen
327:   along with the coefficients of $S$.} The algorithm is given in Figure~\ref{fig:one}(b). Clearly the number of entries in the array $E[]$ is $Bn$ times $2^r$
328: where $r$ is the maximum number of ancestors of any node. It is easy
329: to see that $r=\log n +1$ and thus the number of entries is $n^2B$.
330: For $\ell_1$ measure we need to spend $O(B)$ time in the minimization
331: giving a running time of $O(n^2B^2)$. For $\ell_\infty$, we may
332: perform binary search and only need $\log B$ time (see \cite{GK04}).
333: 
334: \subsection{A Simple Improvement}
335: 
336: \begin{observation}
337:   A node $i$ al level $t_i$ can have at most $2^{t_i}-1$ descendants.
338:   Thus $E[i,b,S]$ is meaningful only for $2^{t_i}$ values of $b$
339:   (including $b=0$). Further, the number of nodes at level $t_i$ is
340:   $\lceil\frac{n}{2^{t_i}}\rceil$  and the number of possible subsets of 
341:   ancestors of a node is $2^{\log n + 1 - t_i}$.
342: \end{observation}
343: 
344: Thus the number of $E[]$ entries to fill corresponding to $i$ is $
345: 2^{\log n + 1 - t_i}\min \{ B, 2^{t_i} \} $. The time takes is
346: $2^{\log n + 1 - t_i}\min \{ B^2, 2^{2t_i} \}$. Thus one way of
347: computing the total time taken is
348: \begin{eqnarray*}
349: & & \sum_{t_i=1}^{\log n} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i}\min \{ B^2, 2^{2t_i} \} + B^2 = \sum_{t_i=1}^{\log B} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i} 2^{2t_i} 
350: + \sum_{t_i=\log B+1}^{\log n} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i} B^2 
351:  + B^2 \\
352: & & = \sum_{t_i=1}^{\log B} 2n^2 + \sum_{u=1}^{\log n - \log B} \frac{n}{2^{u+\log B}} 2^{\log n + 1 - u - \log B} B^2 
353:  + B^2  = \ 2n^2 \log B + 2n^2 \sum_{u=1}^{\log n - \log B} \frac{1}{4^u} + B^2 
354: \end{eqnarray*}
355: 
356: \noindent which is $O(n^2\log B)$. In case of $\ell_\infty$, the expression
357: $\sum_{t_i=1}^{\log n} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i}\min \{
358: B\log B, t_i2^{t_i} \} + B^2$ can be shown to be $O(n^2)$ using the
359: same scheme and change of variables as above.
360: %Observe that the above also reduces the total space complexity. But we
361: %omit the discussion since we will achieve a better space bound
362: %anyways.
363: 
364: \subsection{The Intuition and the new algorithm}
365: The properties that stands out from the above dynamic program are
366: \begin{itemize}\parskip=-0.05in
367: \item {\em There is no connection between $E[i,b,S]$ and $E[i,b',S']$ as long as $S \neq S'$.}
368: \item {\em We do not need $E[i,b,S]$ while computing $E[i',b',S']$ unless $i$ is a child of $i'$ and either  $S=S'$ or $S=S'\cup\{i'\}$.}
369: \item {\em And finally, there is no need to allocate space for $E[i,b,S]$ while computing $E[i',b',S']$ if $i$ is an ancestor (not a descendant) of $i'$.}
370: \end{itemize} 
371: 
372: The simplest view of the new algorithm that computes the same table
373: ({\em but it is not stored in entirety at any time}) is a parallel
374: algorithm, where there is a processor at each node of the error tree.
375: The algorithm at a node $i$ with children $i_L,i_R$ can be described as follows:
376: \begin{enumerate}\parskip=-0.05in
377: \item The node $i$ receives $S$ from its parent and seeks to return an 
378: array of size $B$ (or less) corresponding to $E[i,b,S]$ for $0\leq b \leq B$.
379: It actually receives
380: \[ v(i,S)=\sum_{i'\in S,i \mbox{ left descendant of }i'} c_{i'} - 
381: \sum_{i'\in S,i \mbox{ right descendant of }i'} c_{i'} \]
382: 
383: \item To evaluate $\min_{b'} E[i_L,b',S\cup\{i\}]+ E[i_R,b-1-b',S\cup\{i\}] $ 
384: the node $i$ passes $S\cup\{i\}$ to both of its children, i.e., 
385: $v({i_L},S\cup\{i\})$ and $v(i_R,S\cup\{i\})$. 
386: The children return the two arrays of size $B$ (or less), 
387: and the $\min_{b'}$ is performed for each $b$. Note that the right child can reuse the same space needed by the left child.
388: \item Now $i$ passes $S$ to the children and asks for $E[i_L,b,S]$ for all $b$ and likewise for $i_R$.
389: \item The node $i$ can now compute all $E[i,b,S]$. The entire time spent at this node is $\min \{ 2^{2t_i},B^2\}$.
390: \item If $i$ is the overall root, then $i$ also performs a minimization over all $b$ to find the solution with {\bf at most $\mathbf B$} coefficients. 
391: \end{enumerate}
392: 
393: \begin{lemma} No node receives the value $v(i,S)$ twice for the same set $S$.
394: \end{lemma}
395: The above shows that the algorithm is correct and runs in time $O(n^2\log n)$ (and $O(n^2)$ for $\ell_\infty$).
396: The next lemma is also immediate from the description of the algorithm:
397: \begin{lemma}
398: The space required at node $i$ is $\min \{ B , 2^{t_i} \}$, since this space is used for all $S$.
399: \end{lemma}
400: Thus the total space required is $O(B \log (n/B))$ (the last $\log B$ levels use geometrically decreasing space which sums to $O(B)$ and $\log n - \log B = \log (n/B)$).
401: Therefore if we consider the algorithm that simulates the parallel algorithm, we can
402: conclude with
403: \begin{theorem}
404: We can compute the {\bf error} of optimum $B$ term wavelet synopsis in time $O(n^2\log B)$ (and $O(n^2)$ for $\ell_\infty$) using overall space $O(n+B \log (n/B))=O(n)$.
405: \end{theorem}
406: Observe that we can only compute the error, and we do not know which coefficients are in the synopsis.
407: \subsection{How do we find the coefficients?}
408: We now show how to retrieve the coefficients after finding the total
409: error.  When we find the optimum error, we also resolve (i) if the
410: topmost coefficient is present or not and (ii) what is the allocation
411: of the coefficients to the left and right children.  Armed with these
412: two pieces of information, {\em we simply recurse/recompute}, i.e., we
413: pass the appropriate set (or $v(i,S)$ values) to the two children and
414: their respective allocations. Each child now finds the total error
415: {\em restricted to its subtree} and each decides on the two pieces of
416: information to set up the recursive game.
417: 
418: \noindent{\bf Analysis:} Let the running time of the recompute strategy be $f(n)$. To find the optimum error, we spend $c n^2 \log B $ time and therefore we have the recursion:
419: \[ f(n) = c n^2 \log B + 2f(n/2) \]
420: If we unroll the recursion one step, we see that $f(n) = c n^2 \log B + 2c(n/2)^2 \log B + 4 f(n/4)$. We can immediately observe that we are setting up a geometric sum and we can bound $f(n)$ by $2cn^2 \log B$. Therefore we conclude:
421: 
422: \begin{theorem}
423:   We can compute the {\em complete solution, i.e., total error and the
424:     stored coefficients} of the optimum $B$ term wavelet synopsis in
425:   $O(n^2\log B)$ time ($O(n^2)$ for $\ell_\infty$) using overall space
426:   $O(n + B \log (n/B))$.
427: \end{theorem}
428: 
429: {\bf Caveat:} We have to be careful and ensure that when we output the
430: coefficients recursively, we output all the coefficients of the first
431: half before outputting all the coefficients of the next half. In the
432: process, we need to remember the partition of the buckets, the
433: parameter $b'$, for $\log n$ levels. But since we have to remember
434: only $1$ number, the total space is $O(n+B\log (n/B)+\log n)=O(n+B\log
435: (n/B))$.
436: 
437: \section{Unrestricted Wavelet Synopsis construction Algorithms} 
438: \label{sec:new}
439: We now show how to obtain an approximation algorithm for the
440: general/unrestricted wavelet synopsis construction problem. We focus
441: our attention on $\ell_k$ error, we indicate the changes necessary for
442: the weighted case appropriately. Recall that the Wavelet synopsis
443: problem is: Given a set of $n$ numbers $X=x_1,\ldots,x_n$, find a $Z
444: \in \real^n$ with at most $B$ non-zero entries such that $\| X -
445: \wai(Z) \|_k$ is minimized.
446: 
447: The following will be an important observation leading towards a
448: suitable algorithm: {\em If we observe the previous algorithm based on
449:   assigning a processor to each coefficient in the error tree, we
450:   immediately observe that if for different subsets of ancestors, we
451:   receive the same value, i.e., $v(i,S)=v(i,S')$ for $S'\neq S$, we
452:   need not redo the computation.}  {\bf Note:} that the savings cannot
453: be guaranteed and in order to achieve the savings we have to increase
454: the space bound.
455: 
456: \paragraph{Overview:} The above will form a kernel of our algorithm for the
457: (unrestricted) wavelet synopsis construction problem. We would
458: actually perform the computation {\em for all possible, anticipated
459:   values of $v(i,S)$. However, non-zero elements of $Z$ can have any
460:   real value and it is not clear how to restrict the set of values.}
461: 
462: In what follows, we first describe the algorithm assuming that the
463: wavelet coefficients belong to a set of anticipated values $R$.
464: Subsequently we describe how to
465: determine $R$ and more importantly, bound $|R|$. 
466: 
467: 
468: \subsection{The Algorithm}
469: \begin{defn}
470: Let $E[i,v,b]$ be the minimum possible contribution to the overall 
471: error from all descendants of $i$ using exactly $b$ coefficients, under the 
472: assumption that the combined value of all ancestors chosen is $v$. 
473: \end{defn}
474: 
475: The overall answer is clearly $\min_b E[root,0,b]$. A natural dynamic
476: program is immediate, to compute $E[i,v,b]$ if we decide the best
477: choice is to allocate $b'$ coefficients to the left and let the
478: $i^{th}$ coefficient be $r$, then we need to add $E[i_L,v+r,b']$ and
479: $E[i_R,b-b'-1,v-r]$. The overall algorithm is:
480: 
481: \begin{enumerate}\parskip=-0.05in
482: \item The number of $b$ that are relevant to $i$ is $\min\{ B,2^{r_i} \}$. 
483: The node receives the $E[i_L,v',b'],E[i_R,v'',b'']$ from its children.
484: \item A non-root node computes $E[i,v,b]$ as follows:
485: \vspace{-0.05in}
486: \[ E[i,v,b] = \min \left \{ \begin{array}{ll}
487: \min_{r,b'} E[i_L,v+r,b'] + E[i_R,v-r,b-b'-1] & \mbox{~~~$i^{th}$ coefficient is $r$} \\ 
488: \min_{b'} E[i_L,v,b'] + E[i_R,v,b-b'] & \mbox{~~~$i^{th}$ coefficient not chosen} 
489: \end{array} \right.
490: \] 
491: \item  If $i$ is the root, then $i$ computes
492: \vspace{-0.05in} \[
493: \min_b \left \{ \begin{array}{ll}
494: \min_{r,b'} E[i_L,r,b'] + E[i_R,r,b-b'-1] & \mbox{~~~root coefficient is $r$}\\
495: \min_{b'} E[i_L,0,b'] + E[i_R,0,b-b'] & \mbox{~~~root coefficient not chosen} 
496: \end{array} \right.
497: \]
498: \end{enumerate}
499: 
500: Note that the root can figure out (i) the optimum error (ii) if any
501: coefficient corresponding to it is chosen and (iii) the value $r$ of
502: the coefficient. After the final solution is computed, we apply the
503: recompute strategy, and each node in the tree finds out if it has a
504: coefficient in the answer and its value. The running time is 
505: \[ \sum_i |R| \min \{ 2^{r_i},B \} \cdot
506: |R| \min \{ 2^{r_i},B \} = \sum_{t} |R|^2 \frac{n}{2^t} \min \{ 2^{2t},B^2 \} = |R|^2 nB
507: \]
508: \vspace{-0.1in}
509: 
510: For $\ell_\infty$ the bound is $\sum_{t} |R|^2 \frac{n}{2^t} \min \{
511: t2^{t},B \log B \} = O(n|R|^2\log^2 B)$. The required space can be
512: shown to be $O( RB\log (n/B))$ ensuring that the computation resembles a
513: post-order traversal of the tree and we do not the tables of the
514: children nodes once we are done. Thus for each level we may need at
515: most $2$ tables of size $R \min\{B,2^\ell\}$, which sums to the above..
516: 
517: \subsection{Computing $R$}
518: \begin{lemma}
519: \label{poo}
520: If the $\max_i |x_i|$ is $M$ then $\max_i |\wa(X)_i| \leq M$.
521: \end{lemma}
522: \begin{proof}
523:   The $1^{st}$ coefficient is the average of all values and therefore
524:   cannot exceed $M$.  Every other coefficient is half the average value of
525:   left half (of the support) minus half the average value of right half.
526:   Each cannot be more than $M$ in absolute value.
527: \end{proof}
528: \begin{lemma}
529: \label{boo}
530:   If the optimum solution is $Z^*$ then $\max_i |Z^*_i| \leq 2n^{\frac1k}M$.
531: \end{lemma}
532: \begin{proof}
533:   If $\max_i |\wai(Z^*)_i| \geq 2n^{\frac1k}M$ then $\|X - \wai(Z^*)\|_k \geq \|\wai(Z^*)\|_k - \|X\|_k $ and 
534: \[ \|\wai(Z^*)\|_k - \|X\|_k \geq  \|\wai(Z^*)\|_k - Mn^{\frac1k} \geq \max_i |\wai(Z^*)_i| - Mn^{\frac1k} \geq Mn^{\frac1k} \geq \|X\|_k\]
535: The all zero solution is a better solution, which is a contradiction. 
536: Now we apply Lemma~\ref{poo} and get $\max_i
537:   |\wa(\wai(Z^*))_i| = \max_i |Z^*_i| \leq 2n^{\frac1k}M$, which proves the lemma.
538: \end{proof}
539: In case of weighted $\ell_k$ the above is modified to $\max_i |Z^*_i|
540: \leq 2n^{\frac1k}M \frac 1 {\min_i \pi_i}$.
541: The next lemma follows from triangle inequality.
542: \begin{lemma}
543: If we round each non-zero value of the optimum $Z^*$ to the nearest multiple 
544: of $\delta$ thereby obtaining $\hat{Z}$, then $\| X - \wai(\hat{Z})\|_k \leq  \| X - \wai(Z^*)\|_k + \delta n^{\frac1k}$ and $|R| \leq \frac{2n^{\frac1k}M}{\delta}$.
545: 
546: \end{lemma}
547:   Therefore if we set $\delta = \epsilon M/n^{\frac1k}$ we can say that we have an additive approximation of $\epsilon M$ as well as $|R|  = O(\epsilon n^{\frac2k})$.
548: Therefore we conclude the following:
549: \begin{theorem}
550:   We can solve the Wavelet Synopsis Construction problem with $\ell_k$
551:   error with an additive approximation of $\epsilon M$ where $M=\max_i
552:   |x_i|$ in time $O(n^{1+\frac4k}B\epsilon^{-2})$ and space
553:   $O(n+n^{\frac1k}\epsilon^{-1} B \log (n/B))$. For $\ell_\infty$ the running
554:   time is $O(n\epsilon^{-2}\log^2 B)$.
555: \end{theorem} 
556: 
557: \section{The theme of space efficiency and applications}
558: 
559: A natural paradigm emerges from inspecting the above: 
560: {\em If we can
561: compute the total error and the best way to partition the problem into
562: two halves of $\frac{n}2$ elements, we do not need to store the entire
563: dynamic programming table} -- {\em and thereby save space.} 
564: If we can compute the
565: overall error in time $f(n)=An^\alpha$ where $A$ is independent of
566: $n$, then the time taken by the {\em Recompute} strategy is
567: $g(n)=f(n)+2g(n/2)$. The solution to the recurrence is 
568: $O(An^\alpha)$ if $\alpha>1$ and $O(An\log n)$ if $\alpha=1$.
569: 
570: 
571: We demonstrate the above idea in two examples. First, we show its
572: impact in space efficient V-Opt histogram construction.  Second, we
573: show the applicability in a new synopsis technique, {\em Extended
574:   Wavelets}.
575: 
576: The idea also improves several results on range query histograms --
577: however those algorithms are quite similar in spirit to the V-Opt
578: histogram construction and we relegate the discussion to a fuller
579: version of the paper. However the idea does help in reducing the space
580: bound across the board -- in fact for a large variety of problems it
581: is immediate that the notion of {\em working space}, the space
582: necessary to compute the {\em value} of the final answer, is not
583: required any more. We can compute the entire answer, in the
584: aforementioned working space.
585: 
586: \subsection{V-Opt Histograms}
587: \label{sec:hist}
588: The V-Opt histogram is a classic problem in synopsis construction.
589: Given a set of $n$ numbers $X=x_1,\ldots,x_n$ the problem seeks to
590: construct a $B$ piecewise constant representation $H$ such that $\|X -
591: H \|_2$ (or its square) is minimized.  Since their introduction in
592: query optimization in \cite{I93}, and subsequently in
593: approximate query answering (\cite{Aqua}, among others), histograms
594: have accumulated a rich history \cite{I03}.  Several different
595: optimization criteria have been proposed for histogram construction,
596: e.g., $\ell_1$, relative error, $\ell_\infty$, to name a few.  However
597: most of them are based on a dynamic program similar to the V-Opt case.
598: Thus the V-Opt histograms provide an excellent foil to discuss all of
599: the measures at the same time.  As mentioned in the introduction,
600: \cite{Jag98} gave a $O(n^2B)$ time algorithm to find
601: the optimum histogram using $O(nB)$ space. They observed that the
602: space could be reduced to $O(n)$ at the expense of increasing the
603: running time to $O(n^2B^2)$.  The data stream algorithms\footnote{Note
604:   that by the streaming model we refer to the ``sorted'' or
605:   ``aggregate'' model, most useful in time series data, where the
606:   input is $x_i$ in increasing order of $i$. Only \cite{GGIKMS02}
607:   applies to the general ``turnstile'' or ``update'' model, but seems
608:   to have high polynomial dependence on $B\epsilon^{-1}\log n$. See
609:   \cite{muthu-survey,pods02} for more details on data stream models.}
610: of \cite{GKS01} (extended in \cite{GKS04}) represent sparse dynamic
611: tables -- but the space is still $\tilde{O}(B^2)$, a quadratic in $B$.
612: In a those algorithms the $\tilde{O}(B^2)$ space performs a double role
613: of storing the coefficients as well as maintaining a frontier.  
614: 
615: This is somewhat remedied in \cite{GIMS02,MS04}, where a robust
616: wavelet representation of $\tilde{O}(B)$ coefficients is constructed 
617: and then a dynamic program in the fashion of \cite{Jag98} or \cite{GKS01} 
618: restricted to the {\em endpoints of the support regions} is used.  
619: The dynamic program of \cite{Jag98} can be used to compute the answer 
620: in $\tilde{O}(B)$ space, but with an extra factor of $B$ in
621: running time. Therefore, irrespective of offline or streaming computation
622: there was a tradeoff between large space and an increased running time
623: -- this is {\em the penalty} referred to in the introduction.
624: This is the first paper which removes that penalty and gives an algorithm that 
625: simultaneously achieves the best known space and time bounds.
626: 
627: \begin{center}
628: {\small
629: \begin{tabular}{|c|c|c|c|c|c|}
630: \hline
631: Paper & Stream & Factor & Time & Space &  Working space  \\
632: \hline
633: \hline
634: \cite{Jag98} & No & Opt & $O(n^2B)$ & $O(nB)$ & $O(n)$ \\ 
635: & &  & $O(n^2B^2)$ & $O(n)$ & $O(n)$ \\
636: \hline
637: \cite{GKS01} & Yes & $(1+\epsilon)$ & $ O(nB^2\epsilon^{-1}\log n $ & $O(B^2\epsilon^{-1} \log n)$ & --\\
638: \hline \cite{GIMS02} & Yes & $(1+\epsilon)$ & $O(n+B^3 \epsilon^{-8} \log^4 n)$ & $O(B^2\epsilon^{-4} \log^2 n)$ & -- \\  
639: & & &  $O(n+B^4 \epsilon^{-8} \log^4 n)$ & $O(B \epsilon^{-4} \log^2 n)$ & --\\ 
640: \hline
641: \cite{newver}& No & $(1+\epsilon)$ & $O(n + B^3(\epsilon^{-2} + \log n) \log n)$ & $O(n + B^2\epsilon^{-1})$ & $O(n + B\epsilon^{-1})$ \\ 
642: & Yes &  & $O(n+ (n/M)B^3 \epsilon^{-2} \log^3 n)$ & $O(M + B^2\epsilon^{-1} \log n) $ & -- \\ 
643: \hline \cite{MS04} & Yes & $(1+\epsilon)$ & $O(n+B^3 \epsilon^{-3} (\log 1/\epsilon) \log n)$ & $O(B\epsilon^{-2} (\log 1/\epsilon) \log n + B^2/\epsilon)$ & -- \\  
644: & & &  $O(n+B^4 \epsilon^{-3} (\log 1/\epsilon) \log n)$ & $O(B \epsilon^{-2} (\log 1/\epsilon) \log n + B/\epsilon)$ & --\\  
645: \hline
646: \hline 
647: {\bf This Paper} & No & Opt & $O(n^2B)$ & $O(n)$ & $O(n)$ \\
648: & No &  $(1+\epsilon)$ & $O(n + B^3(\epsilon^{-2} + \log n) \log n)$ & $O(n + B\epsilon^{-1})$ & $O(n + B\epsilon^{-1})$ \\
649: & Yes & $(1+\epsilon)$ & $O(n+B^3 \epsilon^{-3} (\log 1/\epsilon)\log n)$ & $O(B\epsilon^{-2} (\log 1/\epsilon) \log n + B/\epsilon)$ & -- \\
650: \hline
651: \end{tabular}
652: }
653: \end{center}
654: 
655: 
656: \paragraph{Algorithm idea:} Due to lack of space, we indicate the 
657: modification to the optimum algorithm. The modifications to the 
658: approximation and streaming algorithms are similar. The optimal 
659: algorithm maintains $E[i,b]$ which is the
660: minimum error of expressing the interval $[1,i]$ by at most $b$
661: buckets (intervals where the representation is constant). 
662: A natural dynamic programming arises: $E[i,b] = \min_{j<i}
663: E[j,b-1] + e(j+1,i)$ where $e(j,i)$ is the minimum error of a single
664: bucket\footnote{It is straightforward to show that the minimum error
665:   is achieved by the mean of $x_{j+1},\ldots,x_i$.}.  The running time
666: is $O(n^2B)$. If we are interested in computing only the final answer,
667: there is an $O(n)$ space algorithm which computes $E[i,1]$ for all $i$, and then
668: extends that to $b=2,3,$ etc.
669: 
670: If $i>\frac n 2$ we maintain $A[i]$ to be the starting point of the 
671: bucket that contains the $x_\frac{n}2$ for the best representation of $[1,i]$ 
672: by $b$ buckets, and $B[i]$ to be the ending point
673: of that interval, and $C[i]$ to be the number of buckets used before $A[i]$. 
674: This requires $O(n)$ space, and is updated as shown below. Now, after we compute 
675: $E[n,B]$ we can divide the problem into two parts, representing $[1,A[i]]$ using 
676: $C[i]$ buckets and $[B[i]+1,n]$ by $B - C[i] - 1$ buckets. {\em Note that each subproblem is defined on $\frac{n}{2}$ or less elements}. Therefore the {\em Recompute strategy} will run in time $O(n^2B)$ as well and compute all the coefficients.
677: 
678: \begin{figure}[htbp]
679: \begin{center}
680: \framebox{\small
681: \begin{minipage}{5.5in}
682: \begin{tabbing}11111\=111\=111\=111\=111\=111\=111\=111\=111\=111\=111\kill 
683: \bcc \> $A[i]=0$ if $i\leq \frac{n}2$ and $1$ otherwise. $B[i]=0$ if
684:   $i\leq \frac{n}2$ and $i$ otherwise. $c[i]=0$ for all $i$.\\
685: \icc \> For $b=2$ to $B$ do \\
686: \icc \> \> For $i=2$ to $n/2$ do \\
687: \icc \> \> \> $E[i,b]=\min_{j<i} E[j,b-1] + e(j+1,i)$ \\
688: \icc \> \> For $i=n/2$ to $n$ do \\
689: \icc \> \> \> $E[i,b]=\min_{j<i} E[j,b-1] + e(j+1,i)$ \\
690: \icc \> \> \> If $j$ (which achieved the minimum) $\leq \frac{n}2$ then $newA[i]=j+1,newC[i]=b,newB[i]=i$.\\
691: \icc \> \> \> else $newA[i]=A[j],newB[i]=B[j],newC[i]=C[j]$; \\
692: \icc \> \> $A \leftarrow newA, B \leftarrow newB, C \leftarrow newC$. \\
693: \icc \> Recurse using $A[n],B[n],C[n]$ to compute the coefficients.
694: \end{tabbing}
695: \end{minipage}
696: }
697: \end{center}
698: \vspace{-0.2in}
699: \caption{The $O(n)$ space optimum algorithm\label{fig:opt}}
700: \end{figure}
701: 
702: Observe that we wave kept the $E[j,b-1],E[i,b]$ notation, but we can
703: reuse two arrays of size $n$ for this purpose (and keep switching them
704: as $newE,E$ etc.) -- the overall space required is $O(n)$.  We now
705: know the final solution $E[n,B]$ and how to partition the problem.
706: For {\em offline approximation algorithm}, when we recurse, we have to
707: add the approximate error $E'[B[i]+1,C[i]+1]$ to all the elements on
708: the right subproblem (since we build histograms with error increasing by $1+\epsilon$ factor, this ``shift'' is needed). Due to lack of space, the details are relegated
709: to the full version.
710: 
711: \subsection{Extended Wavelets}
712: \label{sec:ext}
713: Extended wavelets were introduced in
714: \cite{DR03}. The central idea is that in case of multi-dimensional
715: data, there can be significant saving of space if we use a
716: non-standard way of storing the information. There are several
717: standard ways of extending 1-dimensional (Haar) wavelets to multiple
718: dimensions. The wavelet basis corresponds to high-dimensional squares.
719: But irrespective of the number of dimensions, the format of the
720: synopsis is a pair of numbers {\em (coefficient index,value)}.
721: In Extended Wavelets we perform wavelet decomposition independently in
722: each dimension but then 
723: %If we wish to store the coefficient $i$ in all the
724: %dimensions, we can store $i$ followed by a list of the values. This
725: %would use roughly half the space to store each coefficient (since we
726: %are storing 1 number per dimension). 
727: we store tuples consisting of 
728: the coefficient index, a bitmap indicating the dimensions for
729: which the coefficient in that dimension is chosen,and a list of
730: values. Since the coefficient number and the bitmap is shared across
731: the coefficients, we can store more coefficients than a simple union
732: of unidimensional transforms.
733: 
734: Notice that there is no interaction between the benefits of storing
735: coefficient $i$ and $i'$. The problem reduces naturally to a {\em
736:   Knapsack} problem with a twist that each item (coefficient $i$) can
737: be present in varying sizes (how many values corresponding to
738: different dimensions are stored). However the variant also has a
739: simplifying feature that the space bound is polynomially bounded,
740: therefore allowing a simple dynamic program. The program estimates
741: $E[i,b]$ which indicates the minimum error on using {\em at most} $b$ 
742: space and storing only a subset of the first $i$ coefficients.
743: 
744: The idea is relatively new, and it remains to be seen if Extended
745: wavelets are applied widely. But it is an intriguing and novel idea in
746: synopsis construction and serve as an example of the broad
747: applicability of the ideas in this paper. 
748: This paper is also the first (almost) linear ($O(B)$, ignoring $M$) space
749: algorithm in the streaming (as well as offline) model.  We present the
750: results on the optimum algorithms below\footnote{The input is $n$
751:   tuples in $M$ dimensions and the total synopsis size is $B$.  The
752:   papers \cite{DR03,GKS04} contain other approximation algorithms that
753:   are not relevant to our context. The extended version of \cite{GKS04} reduces
754:   Extended Wavelets to a problem similar to V-Opt histogram
755:   construction and gives a $O(NM)$ time algorithm using dynamic
756:   programming.  The ideas of this paper naturally implies improvements
757:   to the space requirement under the assumption that $B \ll NM$.  The
758:   reduction is somewhat detailed and is omitted in this draft.}.
759: 
760: \begin{center}
761: {\small
762: \begin{tabular}{|c|c|c|c|c|c|}
763: \hline
764: Paper & Stream &  Time & Space & Working Space \\
765: \hline
766: \hline
767: \cite{DR03} & No & $ O(nMB) $ & $O(nMB)$ & $O(nM+MB)$\\
768: \hline
769:  \cite{GKS04} & Yes & $O(nMB)$ & $O(MB+B^2)$ & $O(MB+B^2)$ \\
770: & & $O(nM \log M +B^2M^2)$ & $O(MB+B^2)$ & $O(MB+B^2)$ \\
771:  \hline
772: \hline
773: {\bf This Paper} & Yes & $O(nM \log M + B^2 \log M )$ & $O(MB)$ & $O(MB)$\\
774: \hline
775: \end{tabular} 
776: }
777: \end{center}
778: 
779: 
780: 
781: 
782: \paragraph{Algorithm Idea:} We follow the previous algorithms and introduce a 
783: few small changes and a more careful analysis. For each item $i$ we
784: compute the best profit if $i$ is allocated size $j$. This is done in
785: time $O(nM\log M)$ as in \cite{GKS04}. For each $1\leq j\leq M$ we
786: maintain the top $B/j$ items corresponding to size $j$. For each $j$ we
787: can achieve this in $O(B/j)$ space and $O(n)$ running time (using
788: details from \cite{GIMS02}), using overall $O(nM)$ time and $\sum_j
789: (B/j)j = O(BM)$ space. The optimum answer uses items and sizes from
790: this list only. The total number of item-size pairs are $\sum_j (B/j)
791: = O(B \log M)$.
792:  
793: We can sort this list in lexicographic order. %using time $O(B(\log M) \log B)$.
794: Suppose item $i$ has $x_i \geq 1$ occurrences (thus $\sum_i x_i =O(B \log M)$).
795: The dynamic program to extend the answer to $i$ (from the item before
796: $i$) first needs to guess/choose which of the $x_i$ occurrences
797: are used (or none) and compute the best solution for each $B$. The
798: time taken is $c(x_i+1)B$ at $i$, which totals to at most $2cB^2\log M$.
799: 
800: We maintain a $O(B)$ array where $P[z]$ corresponds to the best
801: profit for space $z$ up to the current $i$.  For {\em space
802:   efficiency}, for $z\geq B/2$ we keep track of $Q[z]$ which contains
803: the pair $\langle,i',r,b'\rangle$ s.t. the optimum solution for space
804: $z$ for current $i$ uses space $b' < B/2$ Upton $i'$ and a size $r$
805: copy of $i'$ with $b'+r\geq B/2$. In other words, the {\em crossing
806:   point} where we crossed $B/2$ space for that solution (which remains
807: same even if we extend it later).
808: 
809: We now recurse with $b,b' \leq B/2$ on the two parts. Now each item
810: contributes $c(x_i+1)B/2$ adding up to less than $cB^2 \log M$. Once
811: again we have a geometric sum which sums up to $O(B^2\log M)$ for the
812: entire recursion.~\\
813: 
814: \noindent {\bf Acknowledgments:} We would like to thank Hyoungmin 
815: Park and Kyuseok Shim for many interesting discussions.
816: {\small
817: \bibliographystyle{plain}
818: \begin{thebibliography}{10}
819: 
820: \bibitem{Aqua}
821: S.~Acharya, P.~Gibbons, V.~Poosala, and S.~Ramaswamy.
822: \newblock {The Aqua Approximate Query Answering System}.
823: \newblock {\em Proc. of ACM SIGMOD}, 1999.
824: 
825: \bibitem{pods02}
826: B.~Babcock, S.~Babu, M.~Datar, R.~Motwani, and J.~Widom.
827: \newblock Models and issues in data stream systems.
828: \newblock {\em PODS}, pages 1--16, 2002.
829: 
830: \bibitem{DR03}
831: A.~Deligiannakis and N.~Roussopoulos.
832: \newblock Extended wavelets for multiple measures.
833: \newblock In {\em SIGMOD Conference}, 2003.
834: 
835: \bibitem{GK04}
836: M.~Garofalakis and A.~Kumar.
837: \newblock Deterministic wavelet thresholding for maximum error metric.
838: \newblock {\em Proc. of PODS}, 2004.
839: 
840: \bibitem{GG02}
841: M.~N. Garofalakis and P.~B. Gibbons.
842: \newblock Wavelet synopses with error guarantees.
843: \newblock In {\em Proc. of ACM SIGMOD}, 2002.
844: 
845: \bibitem{GGIKMS02}
846: A.~C. Gilbert, S.~Guha, P.~Indyk, Y.~Kotidis, S.~Muthukrishnan, and Martin
847:   Strauss.
848: \newblock Fast, small-space algorithms for approximate histogram maintenance.
849: \newblock In {\em Proc. of ACM STOC}, 2002.
850: 
851: \bibitem{GIMS02}
852: S.~Guha, P.~Indyk, S.~Muthukrishnan, and M.~Strauss.
853: \newblock Histogramming data streams with fast per-item processing.
854: \newblock In {\em Proc. of ICALP}, 2002.
855: 
856: \bibitem{GKS04}
857: S.~Guha, C.~Kim, and K.~Shim.
858: \newblock {XWAVE}: Optimal and approximate extended wavelets for streaming
859:   data.
860: \newblock {\em Proceedings of VLDB Conference}, 2004.
861: 
862: \bibitem{GKS01}
863: S.~Guha, N.~Koudas, and K.~Shim.
864: \newblock {Data Streams and Histograms}.
865: \newblock In {\em Proc. of STOC}, 2001.
866: 
867: \bibitem{newver}
868: S.~Guha, N~Koudas, and K.~Shim.
869: \newblock Approximation algorithms for histogram construction problems.
870: \newblock {\em Technical Report, the full version of \cite{GKS01}, available at
871:   http://www.cis.upenn.edu/~sudipto/mypapers/histjour.pdf.gz}, 2004.
872: 
873: \bibitem{I93}
874: Y.~E. Ioannidis.
875: \newblock Universality of serial histograms.
876: \newblock In {\em Proc. of the VLDB Conference}, 1993.
877: 
878: \bibitem{I03}
879: Y.~E. Ioannidis.
880: \newblock The history of histograms (abridged).
881: \newblock {\em Proc. of VLDB Conference}, pages 19--30, 2003.
882: 
883: \bibitem{Jag98}
884: H.~V Jagadish, N.~Koudas, S.~Muthukrishnan, V.~Poosala, K.~C. Sevcik, and
885:   T.~Suel.
886: \newblock {Optimal Histograms with Quality Guarantees}.
887: \newblock In {\em Proc. of the VLDB Conference}, 1998.
888: 
889: \bibitem{yossi1}
890: Y.~Matias and D.~Urieli.
891: \newblock Optimal workload-based wavelet synopses.
892: \newblock {\em TR-TAU}, 2004.
893: 
894: \bibitem{MVW98}
895: Y.~Matias, J.~Scott Vitter, and M.~Wang.
896: \newblock { Wavelet-Based Histograms for Selectivity Estimation}.
897: \newblock {\em Proc. of ACM SIGMOD}, 1998.
898: 
899: \bibitem{muthu-survey}
900: S.~Muthukrishnan.
901: \newblock Data streams: Algorithms and applications.
902: \newblock {\em Survey available on request at {\tt muthu@research.att.com}},
903:   2003.
904: 
905: \bibitem{muthu-wave}
906: S.~Muthukrishnan.
907: \newblock Workload optimal wavelet synopsis.
908: \newblock {\em DIMACS TR}, 2004.
909: 
910: \bibitem{MS04}
911: S.~Muthukrishnan and M.~Strauss.
912: \newblock Approximate histogram and wavelet summaries of streaming data.
913: \newblock {\em DIMACS TR 52}, 2003.
914: 
915: \end{thebibliography}
916: 
917: }
918: \end{document}
919: