0502:cs0502075/pap.tex

1: \documentclass[11pt]{article}

2: \usepackage{times}

3: \usepackage{graphicx}

4: \usepackage{epsfig}

5: \usepackage{amsmath}

6:

7: %\usepackage{fullpage}

8: \topmargin 0pt

9: \advance \topmargin by -\headheight

10: \advance \topmargin by -\headsep

11:

12: \textheight 9in

13: \oddsidemargin 0pt

14: \evensidemargin \oddsidemargin

15: \marginparwidth 0.5in

16: \textwidth 6.5in

17:

18:

19:

20: %\renewcommand{\baselinestretch}{1.28}

21: \newcommand{\qed}{\hfill \rule{2mm}{3mm}}

22: \newenvironment{proof}{\par \noindent{\bf Proof:}}{\(\qed\) \par}

23: \newcommand{\eat}[1]{}

24: \newcommand{\mnote}[1]{ [[[ \marginpar{\mbox{$<==$}} #1 ]]] }

25: %\newcommand{\eatreminders}{\renewcommand{\reminder}[1]{}}

26: \newcommand{\opt}{\mbox{O{\sc pt}}}

27: \newcommand{\var}{\mbox{V{\sc ar}}}

28: \newcommand{\sol}{\mbox{S{\sc ol}}}

29: \newcommand{\E}{\mbox{\bf E}}

30: \renewcommand{\thefootnote}{\fnsymbol{footnote}}

31: %\begin{document}

32:

33:

34: \newcommand{\goal}[1]{ {\noindent {$\Rightarrow$} \em {#1} } }

35: \newcommand{\hide}[1]{}

36: \newcommand{\junk}[1]{}

37: \newcommand{\etal}{ {\em et. al. } }

38: \newcommand{\comment}[1]{ {\footnotesize {#1} } }

39:

40: \newtheorem{theorem}{Theorem}

41: \newtheorem{lemma}{Lemma}[theorem]

42: \newtheorem{claim}{Claim}[theorem]

43: \newtheorem{corollary}{Corollary}[theorem]

44: \newtheorem{defn}{Definition}[theorem]

45: \newtheorem{algo}{Algorithm}

46: \newtheorem{observation}[theorem]{Observation}

47: \newtheorem{proposition}{Proposition}[theorem]

48: \newtheorem{problem}{Problem}

49: \newtheorem{example}{Example}

50:

51: \newcommand{\vsum}{\mbox{S{\sc um}}}

52: \newcommand{\sqvsum}{\mbox{S{\sc qsum}}}

53: \newcommand{\se}{\mbox{S{\sc qerror}}}

54: \newcommand{\sse}{\mbox{T{\sc err}}}

55: \newcommand{\asse}{\mbox{A{\sc pxerr}}}

56: \newenvironment{proofidea}{{\bf Proof Idea:}}{}

57: \newcommand{\barw}{\bar{W}}

58: \newcommand{\vsumw}{\mbox{I{\sc nfo}}}

59: \newcommand{\sew}{\mbox{E}_B}

60: \newcommand{\ssew}{\mbox{E}_T}

61: \newcommand{\assew}{\mbox{A{\sc px}E}}

62:

63: \newcounter{ccc}

64: \newcommand{\bcc}{\setcounter{ccc}{1}\theccc.}

65: \newcommand{\icc}{\addtocounter{ccc}{1}\theccc.}

66: \newcommand{\wa}{{\cal W}}

67: \newcommand{\wai}{{\cal W}^{-1}}

68: \newcommand{\zr}{{\cal Z}_R}

69: \newcommand{\real}{{\cal R}}

70:

71: \title{How far will you walk to find your shortcut:

72: Space Efficient Synopsis Construction Algorithms}

73: \author{Sudipto Guha \thanks{Department of Computer Science, University of Pennsylvania, 3330 Walnut St, Philadelphia, PA 19104. Email: {\tt sudipto@cis.upenn.edu}}}

74:

75: \begin{document}

76: \date{}

77: \maketitle

78: \begin{abstract}

79:   In this paper we consider the wavelet synopsis construction problem

80:   without the restriction that we only choose a subset of coefficients

81:   of the original data. We provide the first near optimal algorithm.

82:

83:   We arrive at the above algorithm by considering space efficient

84:   algorithms for the restricted version of the problem. In this context

85:   we improve previous algorithms by almost a linear factor and reduce the

86:   required space to almost linear. Our techniques also extend to histogram

87:   construction, and improve the space-running time tradeoffs for V-Opt

88:   and range query histograms. We believe the idea applies to a broad

89:   range of dynamic programs and demonstrate it by showing improvements

90:   in a knapsack-like setting seen in construction of Extended Wavelets.

91: \end{abstract}

92: \section{Introduction}

93: Wavelet synopsis techniques have become extremely popular in query

94: optimization, approximate query answering and a large number decision

95: support systems. Wavelets, specially Haar wavelets, are one-one mappings and

96: admit a natural

97: multi-resolution interpretation, as well as

98: fast algorithms for the forward and inverse transforms.

99:

100: Given a set of $n$ numbers $X=x_1,\ldots,x_n$ the wavelet synopsis

101: construction problem seeks to choose a synopsis vector $Z$ with at

102: most $B$ non-zero entries, such that the inverse wavelet transform of

103: $Z$ (denoted by $\wai(Z)$) gives a good estimate of the data. The

104: typical objective measures are (suitably weighted\footnote{ The

105:   weighted $\ell_k$ with weights $\pi_i>0$ and wlog $\sum_i \pi_i=n$

106:   minimizes $\left( \sum_i (\pi_i (x_i -

107:     \wai(Z)_i))^k\right)^{\frac1k}$.}) $\ell_k$ norm of $X - \wai(Z)$.

108: In an early paper \cite{MVW98}, demonstrated a number of different

109: applications for wavelet synopsis and proposed greedy algorithms.

110: However for objective measures other than the $\ell_2$ measure,

111: the greedy algorithm does not necessarily provide the optimum solution.  The

112: problem is quite non-trivial, primarily due the fact that the Wavelet

113: basis vectors overlap and cancellations (subtractions) occur. This

114: means that we can have two coefficients that cancel out each other

115: leaving a significantly (exponentially) smaller contribution, which

116: needs to be accounted for.  The precision of the coefficients in the

117: optimum solution can be much larger than the precision of the data. In

118: fact there are no known bounds or promising techniques for quantifying

119: the precision - this is the biggest stumbling block in the synopsis

120: construction.

121:

122: Most of the literature focuses on the {\bf Restricted case} where the non-zero

123: entries of $Z$ are equal to the corresponding entries in the

124: transform of the original data, $\wa(X)$. A natural question remains: {\em why

125:   should we be optimizing under the restriction of retaining the

126:   coefficients of the data -- with no guarantees that such a

127:   restriction does not compromise the quality of the final synopsis?}

128: This is clearly suboptimal -- a comparable example would be to

129: optimize the synopsis for point queries, and use it for range queries.

130:

131:

132: A simple example renders the discussion concrete; $X=\{1,2,3,7\}$ and

133: $B=1$ illustrates that choosing any single coefficient of

134: $\wa(X)=\{3.25,-1.75,-0.5,-2\}$ (non-normalized)

135: does not give the optimum answer for

136: $\ell_1$ or $\ell_\infty$ norm.

137: Normalization does not help. The normalized transform is

138: $\{4.55,-2.45,-0.5,-2\}$ -- but choosing the first coefficient as

139: $4.55$ in the normalized setting implies assigning $4.55/\sqrt{2}=3.25$ everywhere.

140: Thus dynamic program approaches that seek to see the effect of the coefficient on the data come to the same conclusion in both settings.

141: The optimum choices of $Z$ are

142: $\{z,0,0,0\}$ for any $2\leq z\leq 3$ and $\{4,0,0,0\}$ for $\ell_1$

143: and $\ell_\infty$ respectively.

144: The same example applies to {\bf weighted} $\mathbf \ell_2$,

145: e.g., if $\pi=\{\frac12,\frac12,\frac32,\frac32\}$ then the best error achieved by retaining any single entry of $\wa(X)$

146: is $5.78$ whereas $Z=\{4.65,0,0,0\}$ gives an error of $4.87$.

147: The example can be extended to any

148: $B$ (by repetition and scaling). The restriction of only retaining

149: the coefficients of the data is significantly self

150: defeating.

151:

152: However the restriction does ease the search for a solution, and as

153: this paper shows, is an important stepping stone towards the final

154: result. For the restricted case, \cite{GG02} gave a probabilistic

155: scheme (the space constraint is preserved in expectation only, along

156: with the error) and very recently \cite{GK04} gave an optimal solution.

157: This has been extended and improved in

158: \cite{muthu-wave}.  {\em However, the solution to the unrestricted

159:   case has remained elusive and we provide the first near optimal

160:   solutions.} In the process, we also improve upon previous algorithms

161: for the restricted case as well.  However our algorithm is best

162: explained by taking a different path, which brings us to the {\em major

163: theme} of the paper.

164:

165: \paragraph{}

166: Synopsis construction is perhaps most relevant in context of massive

167: data sets. In some scenarios we can justify that the synopsis is

168: created using a ``scratch'' space larger than the synopsis and stored.

169: However a {\em quadratic} or extremely superlinear space complexity is

170: near infeasible for large $n$.

171: The dependence on synopsis size $B$ is also important in this context --

172: the smaller the dependence is, the larger is the synopsis that can be

173: computed in the environment of a particular system. Further,

174: {\em space is typically a more inflexible resource},

175: and not just a matter of wait. However a natural conceptual

176: question arises: {\em We are only given $n$ numbers, -- do we

177:   really need to save so much information to compute the optimum

178:   answer ?}

179:

180: All previous algorithms (for the restricted case) are expensive in

181: space (see table below). This (super-linearity in $n,B$) is also seen

182: in context of histogram construction (we provide a detailed table in

183: Section~\ref{sec:hist}).  To avoid this expensive space complexity,

184: several researchers have introduced the notion of {\em working space},

185: which is the amount of space required to compute the error -- the rest

186: of the space is used to construct the answer (coefficients,

187: representatives, etc.).  In case of wavelets the working space used by

188: previous algorithms is $O(nB)$. In case of histograms, known

189: algorithms reconstruct the answer only using the $O(n)$ working space,

190: but {\em with a penalty of an extra factor of $B$ in the running

191:   time}.  In this paper, we reduce the space for wavelets and

192: eliminate the penalty for histograms, in fact our results show that

193: the {\em working space notion is not needed} for a wide range of

194: problems. To summarize {\bf Our contributions}:

195: \begin{itemize}\parskip=-0.05in

196: \item We provide the first near optimum algorithm for the wavelet

197:   synopsis construction problem. The algorithm naturally extends to

198:   multiple dimensions.

199: \item For the restricted case \cite{GG02} provided approximation algorithms, however the space constraints were obeyed in expectation. The results for (optimum) algorithms with strict space bounds are

200: \footnote{In \cite{muthu-wave} the space bounds are not explicitly provided, but

201: the total space appears to be $O(n^2B/\log B)$ as well. The authors of \cite{yossi1} consider the same problem for a

202:   non-Haar basis, and is excluded from the discussion here}:

203: \begin{center}

204: {\small

205: \begin{tabular}{|c|c|c|c|c|c|}

206: \hline

207: Paper &  Error & Time & Space & Working Space \\

208: \hline

209: \hline

210: \cite{GK04} & $\ell_\infty$ & $ O(n^2B \log B) $ & $O(n^2B)$ & $O(nB)$\\

211:          & $\ell_k$ & $ O(n^2B^2) $ & $O(n^2B)$ & $O(nB)$\\

212: \hline

213: \cite{muthu-wave}

214:

215: & (weighted) $\ell_k$ & $O(\frac{n^2B}{\log B}) $ & ? &  ? \\

216: \hline

217: \hline

218: {\bf This Paper} & (weighted) $\ell_\infty$& $O(n^2)$ & $O(n+B\log (n/B))=O(n)$ & $O(n+B\log (n/B))=O(n)$\\

219: & (weighted) $\ell_k$& $O(n^2 \log B)$ & $O(n)$ & $O(n)$\\

220: \hline

221: \end{tabular}

222: }

223: \end{center}

224: \cite{GK04} also provided approximation algorithms for multiple dimensions

225: and our techniques extend to this context as well, and improves the

226: running time and space by almost a factor $B$.

227:

228: \item We improve several histogram construction algorithms, e.g.,

229:   V-Opt histograms, range query histograms, by simultaneously

230:   achieving the best known running time and space bounds. The results

231:   and {\em a table comparing the results} are presented in

232:   Section~\ref{sec:hist}. Due to lack of space omit the improvements

233:   for the range query histograms, which are similar.

234: \item We believe the space efficient paradigm is applicable to other

235:   dynamic programs as well, and we demonstrate the improvements in

236:   case of Extended Wavelets in Section~\ref{sec:ext}.

237: \end{itemize}

238:

239:

240:

241: \section{The Restricted (Haar) Wavelet Synopsis construction Problem}

242: \label{sec:old}

243: We will work with {\bf non-normalized} wavelet transforms where the

244: inverse computation is simply adding the coefficients that affect a

245: coordinate\footnote{For normalized wavelets the normalization

246:   constant appears both in forward and inverse transform, all the

247:   results in the paper will carry over in that setting as well,with

248:   the introduction of the normalization constants at several places}.

249: The wavelet basis vectors are defined as (assume $n$ is a power of $2$):

250: \[ \begin{array}{rll}

251:   V_0(j) = & 1 & \mbox{~~for all $j$}\\

252:   V_{2^s+t}(j) = & \left\{ \begin{array}{l}

253:           1 \\

254:           -1 \end{array} \right. &

255:           \begin{array}{l}

256:           \mbox{for $(t-1)\frac{n}{2^s}+1 \leq j \leq \frac{tn}{2^s} - \frac{n}{2^{s+1}}$} \\

257:           \mbox{for $\frac{nt}{2^s}- \frac{n}{2^{s+1}}+1 \leq j \leq \frac{tn}{2^s} $} \\

258: \end{array} \hspace{0.5in} (1\leq t\leq \frac{n}{2^s}, 1\leq s \leq \log n)

259: \end{array}

260: \]

261: The above definitions ensure $\wai(Z) = \sum_i Z_i V_i$. To compute

262: $\wa(X)$, the algorithms computes the average

263: $\frac{x_{2i+1}+x_{2i+2}}{2}$ and the difference

264: $\frac{x_{2i+1}-x_{2i+2}}{2}$ for each pair of consecutive elements as

265: $i$ ranges over $0,2,4,6,\ldots$ The difference coefficients form the

266: last $n/2$ entries of $\wa(X)$. The process is repeated on the $n/2$

267: average coefficients - {\em their difference coefficients yield the

268:   $n/4+1,\ldots,n/2$'th coefficients of $\wa(Z)$}. The process stops

269: when we compute the overall average, which is the first element of

270: $\wa(Z)$. The wavelet basis functions naturally form a complete binary

271: tree since their support sets are nested and are of size powers of $2$

272: (with one additional node as a parent of the tree, see

273: Figure~\ref{fig:one}).  The $x_j$ correspond to the leaves, denoted by

274: boxes, and the coefficients correspond to the non-leaf nodes of the

275: tree.  This tree of coefficients is termed as the error tree

276: (following \cite{GK04}).  Likewise assigning a value $c_i$ to the

277: coefficient corresponds to assigning $+c_i$ to all leaves $j$ that are

278: {\bf left descendants} (descendants of the left child) and $-c_j$ to

279: all right descendants.  The leaves that are descendants of a

280: coefficient are termed as the {\bf support} of the coefficient.

281: Recall that the {\bf Restricted} (Haar) Wavelet construction problem

282: is that given a set of $n$ numbers $X=x_1,\ldots,x_n$ the problem

283: seeks to choose at most $B$ terms from the wavelet representation

284: $\wa(X)$ of $X$, say denoted by $\zr$, such that a (weighted) $\ell_k$

285: norm of $X - \wai(\zr)$ is minimized.

286:

287:

288: \begin{figure}

289: \begin{center}

290: \begin{tabular}[t]{cc}

291: \begin{minipage}{2in}

292: \centerline{\psfig{figure=a.eps,width=1.8in}}

293: \end{minipage}

294: & \framebox{

295: \small

296: \begin{minipage}{4in}

297: At each internal node $i$ to compute $E[i,b,S]$:

298: \begin{itemize}\parskip=-0.05in

299: \item We determine if we are choosing the coefficient $i$.

300: \item Assuming we are, we decide how the remaining $b-1$ coefficients are to allocated between the two subtrees. If the children are $i_L$ and $i_R$,  we are interested in

301: \vspace{-0.1in}

302: \[ \min_{b'} E[i_L,b',S\cup\{i\}]+ E[i_R,b-1-b',S\cup\{i\}] \]

303: \item Assuming that we do not choose $i$ we are interested in a similar expression giving the overall minimization to be

304: \vspace{-0.1in}

305: \[

306: \min \left \{ \begin{array}{l} \min_{b'} E[i_L,b',S\cup\{i\}]+ E[i_R,b-1-b',S\cup\{i\}] \\

307: \min_{b'} E[i_L,b',S]+ E[i_R,b-b',S] \end{array} \right.

308: \]

309: \end{itemize}

310: \end{minipage}

311: } \\

312: (a) & (b)

313: \end{tabular}

314: \caption{The Error Tree and the previous algorithm. \label{fig:one} }

315: \end{center}

316: \end{figure}

317:

318:

319: \subsection{Reviewing Previous Algorithm(s)}

320:

321: It is immediate that the value of $\wai(\zr)_j$ is fixed by the choices of

322: all coefficients $i$ such that $j$ belongs to the support of $i$.

323: Suppose $S$ is a subset of the ancestors of a coefficient $i$.  Thus a

324: natural dynamic program emerges where we define $E[i,b,S]$ to be {\em

325:   the minimum contribution to the error from all $j$ in the support of $i$, such

326:   that exactly $b$ coefficients that are descendants of $i$ are chosen

327:   along with the coefficients of $S$.} The algorithm is given in Figure~\ref{fig:one}(b). Clearly the number of entries in the array $E[]$ is $Bn$ times $2^r$

328: where $r$ is the maximum number of ancestors of any node. It is easy

329: to see that $r=\log n +1$ and thus the number of entries is $n^2B$.

330: For $\ell_1$ measure we need to spend $O(B)$ time in the minimization

331: giving a running time of $O(n^2B^2)$. For $\ell_\infty$, we may

332: perform binary search and only need $\log B$ time (see \cite{GK04}).

333:

334: \subsection{A Simple Improvement}

335:

336: \begin{observation}

337:   A node $i$ al level $t_i$ can have at most $2^{t_i}-1$ descendants.

338:   Thus $E[i,b,S]$ is meaningful only for $2^{t_i}$ values of $b$

339:   (including $b=0$). Further, the number of nodes at level $t_i$ is

340:   $\lceil\frac{n}{2^{t_i}}\rceil$  and the number of possible subsets of

341:   ancestors of a node is $2^{\log n + 1 - t_i}$.

342: \end{observation}

343:

344: Thus the number of $E[]$ entries to fill corresponding to $i$ is $

345: 2^{\log n + 1 - t_i}\min \{ B, 2^{t_i} \} $. The time takes is

346: $2^{\log n + 1 - t_i}\min \{ B^2, 2^{2t_i} \}$. Thus one way of

347: computing the total time taken is

348: \begin{eqnarray*}

349: & & \sum_{t_i=1}^{\log n} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i}\min \{ B^2, 2^{2t_i} \} + B^2 = \sum_{t_i=1}^{\log B} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i} 2^{2t_i}

350: + \sum_{t_i=\log B+1}^{\log n} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i} B^2

351:  + B^2 \\

352: & & = \sum_{t_i=1}^{\log B} 2n^2 + \sum_{u=1}^{\log n - \log B} \frac{n}{2^{u+\log B}} 2^{\log n + 1 - u - \log B} B^2

353:  + B^2  = \ 2n^2 \log B + 2n^2 \sum_{u=1}^{\log n - \log B} \frac{1}{4^u} + B^2

354: \end{eqnarray*}

355:

356: \noindent which is $O(n^2\log B)$. In case of $\ell_\infty$, the expression

357: $\sum_{t_i=1}^{\log n} \frac{n}{2^{t_i}} 2^{\log n + 1 - t_i}\min \{

358: B\log B, t_i2^{t_i} \} + B^2$ can be shown to be $O(n^2)$ using the

359: same scheme and change of variables as above.

360: %Observe that the above also reduces the total space complexity. But we

361: %omit the discussion since we will achieve a better space bound

362: %anyways.

363:

364: \subsection{The Intuition and the new algorithm}

365: The properties that stands out from the above dynamic program are

366: \begin{itemize}\parskip=-0.05in

367: \item {\em There is no connection between $E[i,b,S]$ and $E[i,b',S']$ as long as $S \neq S'$.}

368: \item {\em We do not need $E[i,b,S]$ while computing $E[i',b',S']$ unless $i$ is a child of $i'$ and either  $S=S'$ or $S=S'\cup\{i'\}$.}

369: \item {\em And finally, there is no need to allocate space for $E[i,b,S]$ while computing $E[i',b',S']$ if $i$ is an ancestor (not a descendant) of $i'$.}

370: \end{itemize}

371:

372: The simplest view of the new algorithm that computes the same table

373: ({\em but it is not stored in entirety at any time}) is a parallel

374: algorithm, where there is a processor at each node of the error tree.

375: The algorithm at a node $i$ with children $i_L,i_R$ can be described as follows:

376: \begin{enumerate}\parskip=-0.05in

377: \item The node $i$ receives $S$ from its parent and seeks to return an

378: array of size $B$ (or less) corresponding to $E[i,b,S]$ for $0\leq b \leq B$.

379: It actually receives

380: \[ v(i,S)=\sum_{i'\in S,i \mbox{ left descendant of }i'} c_{i'} -

381: \sum_{i'\in S,i \mbox{ right descendant of }i'} c_{i'} \]

382:

383: \item To evaluate $\min_{b'} E[i_L,b',S\cup\{i\}]+ E[i_R,b-1-b',S\cup\{i\}] $

384: the node $i$ passes $S\cup\{i\}$ to both of its children, i.e.,

385: $v({i_L},S\cup\{i\})$ and $v(i_R,S\cup\{i\})$.

386: The children return the two arrays of size $B$ (or less),

387: and the $\min_{b'}$ is performed for each $b$. Note that the right child can reuse the same space needed by the left child.

388: \item Now $i$ passes $S$ to the children and asks for $E[i_L,b,S]$ for all $b$ and likewise for $i_R$.

389: \item The node $i$ can now compute all $E[i,b,S]$. The entire time spent at this node is $\min \{ 2^{2t_i},B^2\}$.

390: \item If $i$ is the overall root, then $i$ also performs a minimization over all $b$ to find the solution with {\bf at most $\mathbf B$} coefficients.

391: \end{enumerate}

392:

393: \begin{lemma} No node receives the value $v(i,S)$ twice for the same set $S$.

394: \end{lemma}

395: The above shows that the algorithm is correct and runs in time $O(n^2\log n)$ (and $O(n^2)$ for $\ell_\infty$).

396: The next lemma is also immediate from the description of the algorithm:

397: \begin{lemma}

398: The space required at node $i$ is $\min \{ B , 2^{t_i} \}$, since this space is used for all $S$.

399: \end{lemma}

400: Thus the total space required is $O(B \log (n/B))$ (the last $\log B$ levels use geometrically decreasing space which sums to $O(B)$ and $\log n - \log B = \log (n/B)$).

401: Therefore if we consider the algorithm that simulates the parallel algorithm, we can

402: conclude with

403: \begin{theorem}

404: We can compute the {\bf error} of optimum $B$ term wavelet synopsis in time $O(n^2\log B)$ (and $O(n^2)$ for $\ell_\infty$) using overall space $O(n+B \log (n/B))=O(n)$.

405: \end{theorem}

406: Observe that we can only compute the error, and we do not know which coefficients are in the synopsis.

407: \subsection{How do we find the coefficients?}

408: We now show how to retrieve the coefficients after finding the total

409: error.  When we find the optimum error, we also resolve (i) if the

410: topmost coefficient is present or not and (ii) what is the allocation

411: of the coefficients to the left and right children.  Armed with these

412: two pieces of information, {\em we simply recurse/recompute}, i.e., we

413: pass the appropriate set (or $v(i,S)$ values) to the two children and

414: their respective allocations. Each child now finds the total error

415: {\em restricted to its subtree} and each decides on the two pieces of

416: information to set up the recursive game.

417:

418: \noindent{\bf Analysis:} Let the running time of the recompute strategy be $f(n)$. To find the optimum error, we spend $c n^2 \log B $ time and therefore we have the recursion:

419: \[ f(n) = c n^2 \log B + 2f(n/2) \]

420: If we unroll the recursion one step, we see that $f(n) = c n^2 \log B + 2c(n/2)^2 \log B + 4 f(n/4)$. We can immediately observe that we are setting up a geometric sum and we can bound $f(n)$ by $2cn^2 \log B$. Therefore we conclude:

421:

422: \begin{theorem}

423:   We can compute the {\em complete solution, i.e., total error and the

424:     stored coefficients} of the optimum $B$ term wavelet synopsis in

425:   $O(n^2\log B)$ time ($O(n^2)$ for $\ell_\infty$) using overall space

426:   $O(n + B \log (n/B))$.

427: \end{theorem}

428:

429: {\bf Caveat:} We have to be careful and ensure that when we output the

430: coefficients recursively, we output all the coefficients of the first

431: half before outputting all the coefficients of the next half. In the

432: process, we need to remember the partition of the buckets, the

433: parameter $b'$, for $\log n$ levels. But since we have to remember

434: only $1$ number, the total space is $O(n+B\log (n/B)+\log n)=O(n+B\log

435: (n/B))$.

436:

437: \section{Unrestricted Wavelet Synopsis construction Algorithms}

438: \label{sec:new}

439: We now show how to obtain an approximation algorithm for the

440: general/unrestricted wavelet synopsis construction problem. We focus

441: our attention on $\ell_k$ error, we indicate the changes necessary for

442: the weighted case appropriately. Recall that the Wavelet synopsis

443: problem is: Given a set of $n$ numbers $X=x_1,\ldots,x_n$, find a $Z

444: \in \real^n$ with at most $B$ non-zero entries such that $\| X -

445: \wai(Z) \|_k$ is minimized.

446:

447: The following will be an important observation leading towards a

448: suitable algorithm: {\em If we observe the previous algorithm based on

449:   assigning a processor to each coefficient in the error tree, we

450:   immediately observe that if for different subsets of ancestors, we

451:   receive the same value, i.e., $v(i,S)=v(i,S')$ for $S'\neq S$, we

452:   need not redo the computation.}  {\bf Note:} that the savings cannot

453: be guaranteed and in order to achieve the savings we have to increase

454: the space bound.

455:

456: \paragraph{Overview:} The above will form a kernel of our algorithm for the

457: (unrestricted) wavelet synopsis construction problem. We would

458: actually perform the computation {\em for all possible, anticipated

459:   values of $v(i,S)$. However, non-zero elements of $Z$ can have any

460:   real value and it is not clear how to restrict the set of values.}

461:

462: In what follows, we first describe the algorithm assuming that the

463: wavelet coefficients belong to a set of anticipated values $R$.

464: Subsequently we describe how to

465: determine $R$ and more importantly, bound $|R|$.

466:

467:

468: \subsection{The Algorithm}

469: \begin{defn}

470: Let $E[i,v,b]$ be the minimum possible contribution to the overall

471: error from all descendants of $i$ using exactly $b$ coefficients, under the

472: assumption that the combined value of all ancestors chosen is $v$.

473: \end{defn}

474:

475: The overall answer is clearly $\min_b E[root,0,b]$. A natural dynamic

476: program is immediate, to compute $E[i,v,b]$ if we decide the best

477: choice is to allocate $b'$ coefficients to the left and let the

478: $i^{th}$ coefficient be $r$, then we need to add $E[i_L,v+r,b']$ and

479: $E[i_R,b-b'-1,v-r]$. The overall algorithm is:

480:

481: \begin{enumerate}\parskip=-0.05in

482: \item The number of $b$ that are relevant to $i$ is $\min\{ B,2^{r_i} \}$.

483: The node receives the $E[i_L,v',b'],E[i_R,v'',b'']$ from its children.

484: \item A non-root node computes $E[i,v,b]$ as follows:

485: \vspace{-0.05in}

486: \[ E[i,v,b] = \min \left \{ \begin{array}{ll}

487: \min_{r,b'} E[i_L,v+r,b'] + E[i_R,v-r,b-b'-1] & \mbox{~~~$i^{th}$ coefficient is $r$} \\

488: \min_{b'} E[i_L,v,b'] + E[i_R,v,b-b'] & \mbox{~~~$i^{th}$ coefficient not chosen}

489: \end{array} \right.

490: \]

491: \item  If $i$ is the root, then $i$ computes

492: \vspace{-0.05in} \[

493: \min_b \left \{ \begin{array}{ll}

494: \min_{r,b'} E[i_L,r,b'] + E[i_R,r,b-b'-1] & \mbox{~~~root coefficient is $r$}\\

495: \min_{b'} E[i_L,0,b'] + E[i_R,0,b-b'] & \mbox{~~~root coefficient not chosen}

496: \end{array} \right.

497: \]

498: \end{enumerate}

499:

500: Note that the root can figure out (i) the optimum error (ii) if any

501: coefficient corresponding to it is chosen and (iii) the value $r$ of

502: the coefficient. After the final solution is computed, we apply the

503: recompute strategy, and each node in the tree finds out if it has a

504: coefficient in the answer and its value. The running time is

505: \[ \sum_i |R| \min \{ 2^{r_i},B \} \cdot

506: |R| \min \{ 2^{r_i},B \} = \sum_{t} |R|^2 \frac{n}{2^t} \min \{ 2^{2t},B^2 \} = |R|^2 nB

507: \]

508: \vspace{-0.1in}

509:

510: For $\ell_\infty$ the bound is $\sum_{t} |R|^2 \frac{n}{2^t} \min \{

511: t2^{t},B \log B \} = O(n|R|^2\log^2 B)$. The required space can be

512: shown to be $O( RB\log (n/B))$ ensuring that the computation resembles a

513: post-order traversal of the tree and we do not the tables of the

514: children nodes once we are done. Thus for each level we may need at

515: most $2$ tables of size $R \min\{B,2^\ell\}$, which sums to the above..

516:

517: \subsection{Computing $R$}

518: \begin{lemma}

519: \label{poo}

520: If the $\max_i |x_i|$ is $M$ then $\max_i |\wa(X)_i| \leq M$.

521: \end{lemma}

522: \begin{proof}

523:   The $1^{st}$ coefficient is the average of all values and therefore

524:   cannot exceed $M$.  Every other coefficient is half the average value of

525:   left half (of the support) minus half the average value of right half.

526:   Each cannot be more than $M$ in absolute value.

527: \end{proof}

528: \begin{lemma}

529: \label{boo}

530:   If the optimum solution is $Z^*$ then $\max_i |Z^*_i| \leq 2n^{\frac1k}M$.

531: \end{lemma}

532: \begin{proof}

533:   If $\max_i |\wai(Z^*)_i| \geq 2n^{\frac1k}M$ then $\|X - \wai(Z^*)\|_k \geq \|\wai(Z^*)\|_k - \|X\|_k $ and

534: \[ \|\wai(Z^*)\|_k - \|X\|_k \geq  \|\wai(Z^*)\|_k - Mn^{\frac1k} \geq \max_i |\wai(Z^*)_i| - Mn^{\frac1k} \geq Mn^{\frac1k} \geq \|X\|_k\]

535: The all zero solution is a better solution, which is a contradiction.

536: Now we apply Lemma~\ref{poo} and get $\max_i

537:   |\wa(\wai(Z^*))_i| = \max_i |Z^*_i| \leq 2n^{\frac1k}M$, which proves the lemma.

538: \end{proof}

539: In case of weighted $\ell_k$ the above is modified to $\max_i |Z^*_i|

540: \leq 2n^{\frac1k}M \frac 1 {\min_i \pi_i}$.

541: The next lemma follows from triangle inequality.

542: \begin{lemma}

543: If we round each non-zero value of the optimum $Z^*$ to the nearest multiple

544: of $\delta$ thereby obtaining $\hat{Z}$, then $\| X - \wai(\hat{Z})\|_k \leq  \| X - \wai(Z^*)\|_k + \delta n^{\frac1k}$ and $|R| \leq \frac{2n^{\frac1k}M}{\delta}$.

545:

546: \end{lemma}

547:   Therefore if we set $\delta = \epsilon M/n^{\frac1k}$ we can say that we have an additive approximation of $\epsilon M$ as well as $|R|  = O(\epsilon n^{\frac2k})$.

548: Therefore we conclude the following:

549: \begin{theorem}

550:   We can solve the Wavelet Synopsis Construction problem with $\ell_k$

551:   error with an additive approximation of $\epsilon M$ where $M=\max_i

552:   |x_i|$ in time $O(n^{1+\frac4k}B\epsilon^{-2})$ and space

553:   $O(n+n^{\frac1k}\epsilon^{-1} B \log (n/B))$. For $\ell_\infty$ the running

554:   time is $O(n\epsilon^{-2}\log^2 B)$.

555: \end{theorem}

556:

557: \section{The theme of space efficiency and applications}

558:

559: A natural paradigm emerges from inspecting the above:

560: {\em If we can

561: compute the total error and the best way to partition the problem into

562: two halves of $\frac{n}2$ elements, we do not need to store the entire

563: dynamic programming table} -- {\em and thereby save space.}

564: If we can compute the

565: overall error in time $f(n)=An^\alpha$ where $A$ is independent of

566: $n$, then the time taken by the {\em Recompute} strategy is

567: $g(n)=f(n)+2g(n/2)$. The solution to the recurrence is

568: $O(An^\alpha)$ if $\alpha>1$ and $O(An\log n)$ if $\alpha=1$.

569:

570:

571: We demonstrate the above idea in two examples. First, we show its

572: impact in space efficient V-Opt histogram construction.  Second, we

573: show the applicability in a new synopsis technique, {\em Extended

574:   Wavelets}.

575:

576: The idea also improves several results on range query histograms --

577: however those algorithms are quite similar in spirit to the V-Opt

578: histogram construction and we relegate the discussion to a fuller

579: version of the paper. However the idea does help in reducing the space

580: bound across the board -- in fact for a large variety of problems it

581: is immediate that the notion of {\em working space}, the space

582: necessary to compute the {\em value} of the final answer, is not

583: required any more. We can compute the entire answer, in the

584: aforementioned working space.

585:

586: \subsection{V-Opt Histograms}

587: \label{sec:hist}

588: The V-Opt histogram is a classic problem in synopsis construction.

589: Given a set of $n$ numbers $X=x_1,\ldots,x_n$ the problem seeks to

590: construct a $B$ piecewise constant representation $H$ such that $\|X -

591: H \|_2$ (or its square) is minimized.  Since their introduction in

592: query optimization in \cite{I93}, and subsequently in

593: approximate query answering (\cite{Aqua}, among others), histograms

594: have accumulated a rich history \cite{I03}.  Several different

595: optimization criteria have been proposed for histogram construction,

596: e.g., $\ell_1$, relative error, $\ell_\infty$, to name a few.  However

597: most of them are based on a dynamic program similar to the V-Opt case.

598: Thus the V-Opt histograms provide an excellent foil to discuss all of

599: the measures at the same time.  As mentioned in the introduction,

600: \cite{Jag98} gave a $O(n^2B)$ time algorithm to find

601: the optimum histogram using $O(nB)$ space. They observed that the

602: space could be reduced to $O(n)$ at the expense of increasing the

603: running time to $O(n^2B^2)$.  The data stream algorithms\footnote{Note

604:   that by the streaming model we refer to the ``sorted'' or

605:   ``aggregate'' model, most useful in time series data, where the

606:   input is $x_i$ in increasing order of $i$. Only \cite{GGIKMS02}

607:   applies to the general ``turnstile'' or ``update'' model, but seems

608:   to have high polynomial dependence on $B\epsilon^{-1}\log n$. See

609:   \cite{muthu-survey,pods02} for more details on data stream models.}

610: of \cite{GKS01} (extended in \cite{GKS04}) represent sparse dynamic

611: tables -- but the space is still $\tilde{O}(B^2)$, a quadratic in $B$.

612: In a those algorithms the $\tilde{O}(B^2)$ space performs a double role

613: of storing the coefficients as well as maintaining a frontier.

614:

615: This is somewhat remedied in \cite{GIMS02,MS04}, where a robust

616: wavelet representation of $\tilde{O}(B)$ coefficients is constructed

617: and then a dynamic program in the fashion of \cite{Jag98} or \cite{GKS01}

618: restricted to the {\em endpoints of the support regions} is used.

619: The dynamic program of \cite{Jag98} can be used to compute the answer

620: in $\tilde{O}(B)$ space, but with an extra factor of $B$ in

621: running time. Therefore, irrespective of offline or streaming computation

622: there was a tradeoff between large space and an increased running time

623: -- this is {\em the penalty} referred to in the introduction.

624: This is the first paper which removes that penalty and gives an algorithm that

625: simultaneously achieves the best known space and time bounds.

626:

627: \begin{center}

628: {\small

629: \begin{tabular}{|c|c|c|c|c|c|}

630: \hline

631: Paper & Stream & Factor & Time & Space &  Working space  \\

632: \hline

633: \hline

634: \cite{Jag98} & No & Opt & $O(n^2B)$ & $O(nB)$ & $O(n)$ \\

635: & &  & $O(n^2B^2)$ & $O(n)$ & $O(n)$ \\

636: \hline

637: \cite{GKS01} & Yes & $(1+\epsilon)$ & $ O(nB^2\epsilon^{-1}\log n $ & $O(B^2\epsilon^{-1} \log n)$ & --\\

638: \hline \cite{GIMS02} & Yes & $(1+\epsilon)$ & $O(n+B^3 \epsilon^{-8} \log^4 n)$ & $O(B^2\epsilon^{-4} \log^2 n)$ & -- \\

639: & & &  $O(n+B^4 \epsilon^{-8} \log^4 n)$ & $O(B \epsilon^{-4} \log^2 n)$ & --\\

640: \hline

641: \cite{newver}& No & $(1+\epsilon)$ & $O(n + B^3(\epsilon^{-2} + \log n) \log n)$ & $O(n + B^2\epsilon^{-1})$ & $O(n + B\epsilon^{-1})$ \\

642: & Yes &  & $O(n+ (n/M)B^3 \epsilon^{-2} \log^3 n)$ & $O(M + B^2\epsilon^{-1} \log n) $ & -- \\

643: \hline \cite{MS04} & Yes & $(1+\epsilon)$ & $O(n+B^3 \epsilon^{-3} (\log 1/\epsilon) \log n)$ & $O(B\epsilon^{-2} (\log 1/\epsilon) \log n + B^2/\epsilon)$ & -- \\

644: & & &  $O(n+B^4 \epsilon^{-3} (\log 1/\epsilon) \log n)$ & $O(B \epsilon^{-2} (\log 1/\epsilon) \log n + B/\epsilon)$ & --\\

645: \hline

646: \hline

647: {\bf This Paper} & No & Opt & $O(n^2B)$ & $O(n)$ & $O(n)$ \\

648: & No &  $(1+\epsilon)$ & $O(n + B^3(\epsilon^{-2} + \log n) \log n)$ & $O(n + B\epsilon^{-1})$ & $O(n + B\epsilon^{-1})$ \\

649: & Yes & $(1+\epsilon)$ & $O(n+B^3 \epsilon^{-3} (\log 1/\epsilon)\log n)$ & $O(B\epsilon^{-2} (\log 1/\epsilon) \log n + B/\epsilon)$ & -- \\

650: \hline

651: \end{tabular}

652: }

653: \end{center}

654:

655:

656: \paragraph{Algorithm idea:} Due to lack of space, we indicate the

657: modification to the optimum algorithm. The modifications to the

658: approximation and streaming algorithms are similar. The optimal

659: algorithm maintains $E[i,b]$ which is the

660: minimum error of expressing the interval $[1,i]$ by at most $b$

661: buckets (intervals where the representation is constant).

662: A natural dynamic programming arises: $E[i,b] = \min_{j<i}

663: E[j,b-1] + e(j+1,i)$ where $e(j,i)$ is the minimum error of a single

664: bucket\footnote{It is straightforward to show that the minimum error

665:   is achieved by the mean of $x_{j+1},\ldots,x_i$.}.  The running time

666: is $O(n^2B)$. If we are interested in computing only the final answer,

667: there is an $O(n)$ space algorithm which computes $E[i,1]$ for all $i$, and then

668: extends that to $b=2,3,$ etc.

669:

670: If $i>\frac n 2$ we maintain $A[i]$ to be the starting point of the

671: bucket that contains the $x_\frac{n}2$ for the best representation of $[1,i]$

672: by $b$ buckets, and $B[i]$ to be the ending point

673: of that interval, and $C[i]$ to be the number of buckets used before $A[i]$.

674: This requires $O(n)$ space, and is updated as shown below. Now, after we compute

675: $E[n,B]$ we can divide the problem into two parts, representing $[1,A[i]]$ using

676: $C[i]$ buckets and $[B[i]+1,n]$ by $B - C[i] - 1$ buckets. {\em Note that each subproblem is defined on $\frac{n}{2}$ or less elements}. Therefore the {\em Recompute strategy} will run in time $O(n^2B)$ as well and compute all the coefficients.

677:

678: \begin{figure}[htbp]

679: \begin{center}

680: \framebox{\small

681: \begin{minipage}{5.5in}

682: \begin{tabbing}11111\=111\=111\=111\=111\=111\=111\=111\=111\=111\=111\kill

683: \bcc \> $A[i]=0$ if $i\leq \frac{n}2$ and $1$ otherwise. $B[i]=0$ if

684:   $i\leq \frac{n}2$ and $i$ otherwise. $c[i]=0$ for all $i$.\\

685: \icc \> For $b=2$ to $B$ do \\

686: \icc \> \> For $i=2$ to $n/2$ do \\

687: \icc \> \> \> $E[i,b]=\min_{j<i} E[j,b-1] + e(j+1,i)$ \\

688: \icc \> \> For $i=n/2$ to $n$ do \\

689: \icc \> \> \> $E[i,b]=\min_{j<i} E[j,b-1] + e(j+1,i)$ \\

690: \icc \> \> \> If $j$ (which achieved the minimum) $\leq \frac{n}2$ then $newA[i]=j+1,newC[i]=b,newB[i]=i$.\\

691: \icc \> \> \> else $newA[i]=A[j],newB[i]=B[j],newC[i]=C[j]$; \\

692: \icc \> \> $A \leftarrow newA, B \leftarrow newB, C \leftarrow newC$. \\

693: \icc \> Recurse using $A[n],B[n],C[n]$ to compute the coefficients.

694: \end{tabbing}

695: \end{minipage}

696: }

697: \end{center}

698: \vspace{-0.2in}

699: \caption{The $O(n)$ space optimum algorithm\label{fig:opt}}

700: \end{figure}

701:

702: Observe that we wave kept the $E[j,b-1],E[i,b]$ notation, but we can

703: reuse two arrays of size $n$ for this purpose (and keep switching them

704: as $newE,E$ etc.) -- the overall space required is $O(n)$.  We now

705: know the final solution $E[n,B]$ and how to partition the problem.

706: For {\em offline approximation algorithm}, when we recurse, we have to

707: add the approximate error $E'[B[i]+1,C[i]+1]$ to all the elements on

708: the right subproblem (since we build histograms with error increasing by $1+\epsilon$ factor, this ``shift'' is needed). Due to lack of space, the details are relegated

709: to the full version.

710:

711: \subsection{Extended Wavelets}

712: \label{sec:ext}

713: Extended wavelets were introduced in

714: \cite{DR03}. The central idea is that in case of multi-dimensional

715: data, there can be significant saving of space if we use a

716: non-standard way of storing the information. There are several

717: standard ways of extending 1-dimensional (Haar) wavelets to multiple

718: dimensions. The wavelet basis corresponds to high-dimensional squares.

719: But irrespective of the number of dimensions, the format of the

720: synopsis is a pair of numbers {\em (coefficient index,value)}.

721: In Extended Wavelets we perform wavelet decomposition independently in

722: each dimension but then

723: %If we wish to store the coefficient $i$ in all the

724: %dimensions, we can store $i$ followed by a list of the values. This

725: %would use roughly half the space to store each coefficient (since we

726: %are storing 1 number per dimension).

727: we store tuples consisting of

728: the coefficient index, a bitmap indicating the dimensions for

729: which the coefficient in that dimension is chosen,and a list of

730: values. Since the coefficient number and the bitmap is shared across

731: the coefficients, we can store more coefficients than a simple union

732: of unidimensional transforms.

733:

734: Notice that there is no interaction between the benefits of storing

735: coefficient $i$ and $i'$. The problem reduces naturally to a {\em

736:   Knapsack} problem with a twist that each item (coefficient $i$) can

737: be present in varying sizes (how many values corresponding to

738: different dimensions are stored). However the variant also has a

739: simplifying feature that the space bound is polynomially bounded,

740: therefore allowing a simple dynamic program. The program estimates

741: $E[i,b]$ which indicates the minimum error on using {\em at most} $b$

742: space and storing only a subset of the first $i$ coefficients.

743:

744: The idea is relatively new, and it remains to be seen if Extended

745: wavelets are applied widely. But it is an intriguing and novel idea in

746: synopsis construction and serve as an example of the broad

747: applicability of the ideas in this paper.

748: This paper is also the first (almost) linear ($O(B)$, ignoring $M$) space

749: algorithm in the streaming (as well as offline) model.  We present the

750: results on the optimum algorithms below\footnote{The input is $n$

751:   tuples in $M$ dimensions and the total synopsis size is $B$.  The

752:   papers \cite{DR03,GKS04} contain other approximation algorithms that

753:   are not relevant to our context. The extended version of \cite{GKS04} reduces

754:   Extended Wavelets to a problem similar to V-Opt histogram

755:   construction and gives a $O(NM)$ time algorithm using dynamic

756:   programming.  The ideas of this paper naturally implies improvements

757:   to the space requirement under the assumption that $B \ll NM$.  The

758:   reduction is somewhat detailed and is omitted in this draft.}.

759:

760: \begin{center}

761: {\small

762: \begin{tabular}{|c|c|c|c|c|c|}

763: \hline

764: Paper & Stream &  Time & Space & Working Space \\

765: \hline

766: \hline

767: \cite{DR03} & No & $ O(nMB) $ & $O(nMB)$ & $O(nM+MB)$\\

768: \hline

769:  \cite{GKS04} & Yes & $O(nMB)$ & $O(MB+B^2)$ & $O(MB+B^2)$ \\

770: & & $O(nM \log M +B^2M^2)$ & $O(MB+B^2)$ & $O(MB+B^2)$ \\

771:  \hline

772: \hline

773: {\bf This Paper} & Yes & $O(nM \log M + B^2 \log M )$ & $O(MB)$ & $O(MB)$\\

774: \hline

775: \end{tabular}

776: }

777: \end{center}

778:

779:

780:

781:

782: \paragraph{Algorithm Idea:} We follow the previous algorithms and introduce a

783: few small changes and a more careful analysis. For each item $i$ we

784: compute the best profit if $i$ is allocated size $j$. This is done in

785: time $O(nM\log M)$ as in \cite{GKS04}. For each $1\leq j\leq M$ we

786: maintain the top $B/j$ items corresponding to size $j$. For each $j$ we

787: can achieve this in $O(B/j)$ space and $O(n)$ running time (using

788: details from \cite{GIMS02}), using overall $O(nM)$ time and $\sum_j

789: (B/j)j = O(BM)$ space. The optimum answer uses items and sizes from

790: this list only. The total number of item-size pairs are $\sum_j (B/j)

791: = O(B \log M)$.

792:

793: We can sort this list in lexicographic order. %using time $O(B(\log M) \log B)$.

794: Suppose item $i$ has $x_i \geq 1$ occurrences (thus $\sum_i x_i =O(B \log M)$).

795: The dynamic program to extend the answer to $i$ (from the item before

796: $i$) first needs to guess/choose which of the $x_i$ occurrences

797: are used (or none) and compute the best solution for each $B$. The

798: time taken is $c(x_i+1)B$ at $i$, which totals to at most $2cB^2\log M$.

799:

800: We maintain a $O(B)$ array where $P[z]$ corresponds to the best

801: profit for space $z$ up to the current $i$.  For {\em space

802:   efficiency}, for $z\geq B/2$ we keep track of $Q[z]$ which contains

803: the pair $\langle,i',r,b'\rangle$ s.t. the optimum solution for space

804: $z$ for current $i$ uses space $b' < B/2$ Upton $i'$ and a size $r$

805: copy of $i'$ with $b'+r\geq B/2$. In other words, the {\em crossing

806:   point} where we crossed $B/2$ space for that solution (which remains

807: same even if we extend it later).

808:

809: We now recurse with $b,b' \leq B/2$ on the two parts. Now each item

810: contributes $c(x_i+1)B/2$ adding up to less than $cB^2 \log M$. Once

811: again we have a geometric sum which sums up to $O(B^2\log M)$ for the

812: entire recursion.~\\

813:

814: \noindent {\bf Acknowledgments:} We would like to thank Hyoungmin

815: Park and Kyuseok Shim for many interesting discussions.

816: {\small

817: \bibliographystyle{plain}

818: \begin{thebibliography}{10}

819:

820: \bibitem{Aqua}

821: S.~Acharya, P.~Gibbons, V.~Poosala, and S.~Ramaswamy.

822: \newblock {The Aqua Approximate Query Answering System}.

823: \newblock {\em Proc. of ACM SIGMOD}, 1999.

824:

825: \bibitem{pods02}

826: B.~Babcock, S.~Babu, M.~Datar, R.~Motwani, and J.~Widom.

827: \newblock Models and issues in data stream systems.

828: \newblock {\em PODS}, pages 1--16, 2002.

829:

830: \bibitem{DR03}

831: A.~Deligiannakis and N.~Roussopoulos.

832: \newblock Extended wavelets for multiple measures.

833: \newblock In {\em SIGMOD Conference}, 2003.

834:

835: \bibitem{GK04}

836: M.~Garofalakis and A.~Kumar.

837: \newblock Deterministic wavelet thresholding for maximum error metric.

838: \newblock {\em Proc. of PODS}, 2004.

839:

840: \bibitem{GG02}

841: M.~N. Garofalakis and P.~B. Gibbons.

842: \newblock Wavelet synopses with error guarantees.

843: \newblock In {\em Proc. of ACM SIGMOD}, 2002.

844:

845: \bibitem{GGIKMS02}

846: A.~C. Gilbert, S.~Guha, P.~Indyk, Y.~Kotidis, S.~Muthukrishnan, and Martin

847:   Strauss.

848: \newblock Fast, small-space algorithms for approximate histogram maintenance.

849: \newblock In {\em Proc. of ACM STOC}, 2002.

850:

851: \bibitem{GIMS02}

852: S.~Guha, P.~Indyk, S.~Muthukrishnan, and M.~Strauss.

853: \newblock Histogramming data streams with fast per-item processing.

854: \newblock In {\em Proc. of ICALP}, 2002.

855:

856: \bibitem{GKS04}

857: S.~Guha, C.~Kim, and K.~Shim.

858: \newblock {XWAVE}: Optimal and approximate extended wavelets for streaming

859:   data.

860: \newblock {\em Proceedings of VLDB Conference}, 2004.

861:

862: \bibitem{GKS01}

863: S.~Guha, N.~Koudas, and K.~Shim.

864: \newblock {Data Streams and Histograms}.

865: \newblock In {\em Proc. of STOC}, 2001.

866:

867: \bibitem{newver}

868: S.~Guha, N~Koudas, and K.~Shim.

869: \newblock Approximation algorithms for histogram construction problems.

870: \newblock {\em Technical Report, the full version of \cite{GKS01}, available at

871:   http://www.cis.upenn.edu/~sudipto/mypapers/histjour.pdf.gz}, 2004.

872:

873: \bibitem{I93}

874: Y.~E. Ioannidis.

875: \newblock Universality of serial histograms.

876: \newblock In {\em Proc. of the VLDB Conference}, 1993.

877:

878: \bibitem{I03}

879: Y.~E. Ioannidis.

880: \newblock The history of histograms (abridged).

881: \newblock {\em Proc. of VLDB Conference}, pages 19--30, 2003.

882:

883: \bibitem{Jag98}

884: H.~V Jagadish, N.~Koudas, S.~Muthukrishnan, V.~Poosala, K.~C. Sevcik, and

885:   T.~Suel.

886: \newblock {Optimal Histograms with Quality Guarantees}.

887: \newblock In {\em Proc. of the VLDB Conference}, 1998.

888:

889: \bibitem{yossi1}

890: Y.~Matias and D.~Urieli.

891: \newblock Optimal workload-based wavelet synopses.

892: \newblock {\em TR-TAU}, 2004.

893:

894: \bibitem{MVW98}

895: Y.~Matias, J.~Scott Vitter, and M.~Wang.

896: \newblock { Wavelet-Based Histograms for Selectivity Estimation}.

897: \newblock {\em Proc. of ACM SIGMOD}, 1998.

898:

899: \bibitem{muthu-survey}

900: S.~Muthukrishnan.

901: \newblock Data streams: Algorithms and applications.

902: \newblock {\em Survey available on request at {\tt muthu@research.att.com}},

903:   2003.

904:

905: \bibitem{muthu-wave}

906: S.~Muthukrishnan.

907: \newblock Workload optimal wavelet synopsis.

908: \newblock {\em DIMACS TR}, 2004.

909:

910: \bibitem{MS04}

911: S.~Muthukrishnan and M.~Strauss.

912: \newblock Approximate histogram and wavelet summaries of streaming data.

913: \newblock {\em DIMACS TR 52}, 2003.

914:

915: \end{thebibliography}

916:

917: }

918: \end{document}

919: