1: \NeedsTeXFormat{LaTeX2e}[1995/12/01]
2:
3: \documentclass[global]{svjour}
4: \usepackage{epsfig}
5:
6:
7: \def\U{{{\cal U}}}
8: \def\F{{{\cal F}}}
9: \def\S{{{\cal S}}}
10: \newcommand{\eex}{\hspace*{\fill}$~_\triangle$}
11:
12: \title{Enhancing Histograms by Tree-Like Bucket Indices\thanks{An
13: abridged version of this paper appeared in the Proceedings
14: of the International Conference on Data Engineering (ICDE 2002),
15: IEEE Computer Society 2002, ISBN 0-7695-1531-2 \protect\cite{Buccafurri02Improving}}
16: }
17:
18: \author{Francesco Buccafurri \and\inst{1}
19: Gianluca Lax \and\inst{1}
20: Domenico Sacc\`a \and\inst{2}
21: Luigi Pontieri \and\inst{2}
22: Domenico Rosaci\inst{1}
23: }
24:
25: \institute{DIMET Dept., University ``Mediterranea'' of Reggio Calabria, Italy \\
26: \email{\{bucca,lax,domenico.rosaci\}@unirc.it}
27: \and
28: DEIS Dept., University of Calabria, \& ICAR-CNR, Rende, Italy \\
29: \email{pontieri@icar.cnr.it, sacca@unical.it}
30: }
31:
32: %-------------------------------------------------------------------------
33: \begin{document}
34: \maketitle
35: \begin{abstract}
36: Histograms are used to summarize the contents of relations into a
37: number of buckets for the estimation of query result sizes.
38: Several techniques (e.g., MaxDiff and V-Optimal) have been
39: proposed in the past for determining bucket boundaries which
40: provide accurate estimations. However, while search strategies for
41: optimal bucket boundaries are rather sophisticated, no much
42: attention has been paid for estimating queries inside buckets and
43: all of the above techniques adopt naive methods for such an
44: estimation. This paper focuses on the problem of improving the
45: estimation inside a bucket once its boundaries have been fixed.
46: The proposed technique is based on the addition, to each bucket,
47: of 32-bit additional information (organized into a 4-level tree
48: index), storing approximate cumulative frequencies at 7 internal
49: intervals of the bucket. Both theoretical analysis and
50: experimental results show that, among a number of alternative ways
51: to organize the additional information, the 4-level tree index
52: provides the best frequency estimation inside a bucket. The index
53: is later added to two well-known histograms, MaxDiff and
54: V-Optimal, obtaining the non-obvious result that despite the
55: spatial cost of 4LT which reduces the number of allowed buckets
56: once the storage space has been fixed, the original methods are
57: strongly improved in terms of accuracy.
58: \end{abstract}
59:
60: \keywords{histograms -- range query estimation -- approximate OLAP}
61:
62: \section{Introduction}
63: A {\em histogram} is a lossy compression technique used for
64: representing efficiently a relation. It is based on the partition
65: of one of the relation attributes into {\em buckets} and the
66: storage, for each of them, of a few summary information in place
67: of the detailed one. Among others, some important examples of
68: application domains of histograms are the estimation of query
69: selectivity
70: \protect\cite{IoPo95,Jagadish98Optimal,Poosala96Improved,Ja*01,Wu03Using},
71: temporal databases, where histograms are used for improving the
72: join processing \protect\cite{Sitzmann00Improving}, statistical
73: databases, where histograms represent a method for approximating
74: probability distributions \protect\cite{Malvestuto93Universal}.
75: Recently, histograms have received a new deal of interest, mainly
76: because they can be effectively used for approximating query
77: answering in order to reduce the query response time in on-line
78: decision support systems and OLAP \protect\cite{Poosala99Approx},
79: as well as the problem of reconstructing original data from
80: aggregate information \protect\cite{BuFuSa01} and, finally, in the
81: context of Data Streams
82: \protect\cite{Guha01Data,Babcock02Models,Datar02Maintaining,Guha02Histogramming}.
83:
84: For a given storage space reduction, the problem of determining
85: the best histogram is crucial. Indeed, different partitions lead
86: to dramatically different errors in reconstructing the original
87: data distribution, especially for skewed data. To better explain
88: the problem, consider a typical case of recovering original data
89: from a histogram: the evaluation of range queries. Think to a
90: histogram defined on the attribute $X$ of a relation $R$ as a set
91: of non-overlapping intervals of $X$ covering all values assumed by
92: $X$ in $R$. To each of these intervals, say $B$, the number of
93: occurrences (called {\em frequency}) in $R$, having the value of
94: $X$ belonging to the interval $B$, is associated (and included
95: into a data structure called {\em bucket}). A {\em range query},
96: defined on an interval $Q$ of $X$, evaluates the number of
97: occurrences in $R$ with value of $X$ in $Q$. Thus, buckets embed a
98: set of pre-computed disjoint range queries capable of covering the
99: whole active domain of $X$ in $R$ (with active here we mean
100: attribute values actually appearing in $R$). As a consequence, the
101: histogram does not give, in general, the possibility of evaluating
102: exactly a range query not corresponding to one of the pre-computed
103: embedded queries. In other words, while the contribution to the
104: answer coming from the sub-ranges coinciding with entire buckets
105: can be returned exactly, the contribution coming from the
106: sub-ranges which partially overlap buckets can be only estimated,
107: since the actual data distribution inside the buckets is not
108: available.
109:
110: It turns out that it is convenient to define the boundaries of
111: buckets in such a way that the estimation error of the
112: non-precomputed range queries is minimized (e.g., by avoiding that
113: large frequency differences arise inside a bucket). In other
114: words, among all possible sets of pre-computed range queries, we
115: find the set which guarantees the best estimation of the other
116: (non-precomputed) queries, once a technique for estimating such
117: queries is defined. This issue is being investigated since some
118: decades, and a large number of techniques for arranging histograms
119: have been proposed
120: \protect\cite{Cri81,Cri84,IoPo95,Jagadish98Optimal,Poosala96Improved,DonIoa00,Ja*01}.
121:
122: All these techniques adopt simple methods for estimating
123: non-precomputed queries (actually, their portions partially
124: overlapping buckets). The most significant approaches are the {\em
125: continuous value assumption} (often denoted in this paper by CVA)
126: \protect\cite{Sac79}, where the estimation is made by linear
127: interpolation on the whole domain of the bucket, and the {\em
128: uniform spread assumption} (denoted by USA)
129: \protect\cite{Poosala96Improved}, which assumes that values are
130: located at equal distance from each other so that the overall
131: frequency sum can be equally distributed among them.
132:
133:
134: An interesting problem is understanding whether, by exploiting
135: information typically contained in histogram buckets, and possibly
136: adding a few summary information, the frequency estimation inside
137: buckets, and then, the histogram accuracy, can be improved. This
138: paper focuses on this problem. Starting from the consideration of
139: limits of CVA and USA studied in \protect\cite{BuFuSa01}, we
140: propose to use some additional storage space in order to describe
141: the distribution inside a bucket in an approximate yet very
142: effective way.
143:
144: The first step is studying how to use these 32 additional bits in
145: order to maximize benefits in terms of accuracy. Our analysis
146: shows that the trivial technique of partitioning the bucket into 8
147: equal-size parts and encoding each corresponding sum by 4 bits,
148: leads to high scaling errors since it is needed to represent each
149: sum as a fraction of the overall sum of the bucket. Our proposal
150: then relies on the idea of storing partial sums internal to the
151: bucket in a hierarchical fashion, using a tree-like index
152: (occupying 32 bits). This way, the sum contained in a given tree
153: node, can be represented as a fraction of the sum contained in the
154: parent node, which is a value (reasonably) smaller than the
155: overall sum of the bucket. It turns out that the encoding length
156: may decrease as the level of the tree increases. The benefits we
157: expect by applying this approach concern the scaling error. But a
158: crucial point is to decide how to arrange the tree, that is, how
159: far going down in depth with the index. Of course, the higher the
160: resolution, the larger the number of embedded precomputed range
161: queries (internal to the buckets) is. Hence, we expect better
162: accuracy as the resolution increases. However, increasing
163: resolution reduces the number of bits available for encoding
164: nodes, and, thus, amplifies scaling errors. We study the above
165: trade-off by considering the two possible (from a practical point
166: of view) tree-indices with 32 bits, which we call 3LT and 4LT,
167: with depth 3 and 4, respectively. The analysis leads to the
168: conclusion that the 4LT-index represents the best solution.
169:
170: The next step is then understanding whether this improvement of
171: accuracy for the estimation inside buckets can really give
172: benefits in terms of accuracy of a histogram arranged by one of
173: the existing techniques. This problem is not straightforward:
174: think, to mention the most evident aspect, that 4LT buckets use 32
175: bits more than CVA ones, and, then, for a fixed storage space,
176: allows a smaller number of buckets. The last part of this paper is
177: thus devoted to evaluate the effects of the combination of the 4LT
178: technique with existing methods for building histograms. Through a
179: deep experimental comparative analysis conducted, for a fixed
180: storage space, over several data sets, both synthetic and
181: real-life, we show that 4LT improves significantly the accuracy of
182: the considered histograms. Therefore this paper, beside giving the
183: specific contribution of proposing a technique (i.e., the 4LT) for
184: estimating accurately range queries internal to buckets, proves
185: the more general result that going beyond classical techniques
186: (i.e., CVA and USA) for the estimation inside buckets may give
187: concrete improvements of histogram accuracy.
188:
189: It is worth noting that the choice of MaxDiff and V-Optimal
190: histograms for testing our method does not limit the generality of
191: the 4LT index, which is applicable to every bucket-based
192: histogram\footnote{There are histograms, like wavelet-based ones,
193: that are not based on a set of buckets.}. Nevertheless, it is not
194: limited the validity of our comparison, since MaxDiff and
195: V-Optimal, despite their non-young age, are still considered in
196: this scientific community as point of references due to their
197: accuracy \protect\cite{Ioannidis03History}.
198:
199: The paper is organized as follows. In Section
200: \ref{sec-preliminary}, we introduce some preliminary definitions.
201: The comparison, both experimental and theoretical, among a number
202: of techniques including our tree-based methods (3LT and 4LT) for
203: estimating range queries {\em inside} a bucket is reported in
204: Section \ref{sec-Estimation}. Therein, 3LT and 4LT are also
205: presented. From this analysis it results that 4LT has the best
206: performances in terms of accuracy. Thus, 4LT can be combined to
207: every bucked-based histogram for increasing its accuracy. Section
208: \ref{sec-Improved} presents a large set of experiments, conducted
209: by applying 4LT to two, well-known methods, {\em MaxDiff} and {\em
210: V-Optimal} \protect\cite{Poosala96Improved}. Results show high
211: improvements in the estimation of range queries w.r.t. to the
212: original methods --- of course, the comparisons are made at parity
213: of storage consumption so that the revised methods use less
214: buckets to compensate the additional storage for the 4LT indices.
215: The 4LT technique provides good results also when combined with
216: the very simple method {\em EquiSplit}, which consists in dividing
217: the histogram value domain into buckets of the same size so that
218: the bucket boundaries need not to be stored, thus obtaining a very
219: high number of buckets at the same compression rate. We draw our
220: conclusions in Section \ref{sec-Conclusion}.
221:
222:
223:
224:
225:
226:
227: \section{Basic Definitions}\label{sec-preliminary}
228:
229: Given a relation $R$ and an attribute $X$ of $R$, a histogram for $R$ on
230: $X$ is constructed as follows. Let $\U = \{u_1, ... , u_m\}$ be
231: the set of all possible values (the {\em domain}) of $X$ and let
232: $u_i < u_{i+1}$, for each $i$, $1 \leq i <m$. The {\em frequency
233: set} for $X$ is the set $\F =\{f(u_1), ... , f(u_m) \} $ such that
234: for each $i$, $1 \leq i \leq m$, $f(u_i)$ is the number of
235: occurrences of the attribute value $u_i$ in the relation $R$. The
236: {\em cumulative frequency set} $\S =\{s_1, ... , s_m \}$ contains
237: the value $s_i = \sum_{j=1}^i f(u_j)$ for each attribute value
238: $u_i$. The {\em value set} $V=$ $\{u_i \in \U \ | \ f(u_i) > 0 \}$
239: is the active domain of $X$ in $R$ as it consists of all attribute
240: values actually occurring in the relation $R$ ({\em non-null
241: values}). Given any $u_i$ in $V$, the {\em spread} $d_i$ of $u_i
242: \in V$ for $1 \leq i < n$ is defined as 1 if $u_i$ is the last
243: non-null value or otherwise as the difference $u_j - u_{i}$, where
244: $u_{j}$ is the first non-null value for which $u_j > u_i$ (i.e.,
245: $d_i$ is the distance from $u_i$ to the next non-null value).
246:
247: A {\em bucket} $B$ for $R$ on $X$ is a 4-tuple $\langle inf, sup,
248: t, c \rangle$, where $u_{inf}$ and $u_{sup}$, $1 \leq inf \leq sup
249: \leq m$, are the boundaries of the domain range pertaining to the
250: bucket, $t$ is the number of non-null values occurring in the
251: range, and $c = \sum_{i=inf}^{sup} f(u_i)$ is the sum of
252: frequencies of all values in the range.
253: We say that the bucket $B$ is {\em
254: 1-biased} if $u_{sup}$ is not null; if also $u_{inf} $ is not
255: null, then we say that $B$ is {\em 2-biased}.
256:
257:
258: A {\em histogram} $H$ for $R$ on $X$ is a $h$-tuple $\langle
259: B_1,B_2, ..., B_h \rangle$ of buckets such that: (1) for each $1
260: \leq i < h$, the upper bound of $B_i$ precedes the lower bound of
261: $B_{i+1}$ and (2) $u \in V$ implies $u \in B_i$, for some $i$, $1
262: \leq i \leq h$. Condition (1) guarantees that buckets do not
263: overlap each other, and condition (2) enforces that every non-null
264: value be hosted by some bucket. Classically, histograms have
265: 2-biased buckets; sometime, for storage optimizations, 2-biased
266: buckets are made 1-biased by replacing the lower bound of each
267: bucket with the successive in the domain of the upper bound of the
268: preceding bucket.
269:
270: A classical problem on histograms is: given a histogram $H$ and a
271: (range) query of the form $u_j \leq X \leq u_i$, $1 \leq j \leq i \leq m$,
272: estimate the overall frequency $\sum_{k=j}^i f(i)$ in the range from $u_j$
273: to $u_i$.
274:
275:
276:
277:
278: \section{Estimation Inside a Bucket}\label{sec-Estimation}
279:
280: In this section we deeply investigate the problem of frequency
281: estimation inside buckets. First of all, we present the classical
282: two techniques (CVA and USA), discuss their limitations and
283: propose some simple alternatives. Then we introduce a novel
284: technique which is based on a 4-level tree index storing
285: approximate representations of the partial sums of 7 fixed bucket
286: intervals. Later we evaluate the accuracy of the various
287: techniques by performing both a theoretical analysis of errors and
288: a number of experiments on some typical sample distributions.
289:
290: \subsection{Notations and Problem Formulation}
291:
292: Let $B=\langle inf, sup, t, c \rangle$ be a bucket on an attribute
293: $X$ of a relation $R$. Without loss of generality, we assume that
294: $inf=1$ and $sup=b$ so that we can represent the frequency set
295: inside the bucket as a vector $F$ with indexes ranging from $1$ to $b$
296: ({\em frequency vector of} $B$). Similarly, the cumulative
297: frequencies are represented by a vector $S$ with indexes from $1$
298: to $b$ ({\em cumulative frequency vector of} $B$). Hence, for each
299: $i$, $1 \leq i \leq b$, $F[i]\geq 0$ is the frequency of the value
300: $u_i$ while $S[i]=$ $\sum_{j=1}^{i}F[j]$ is the cumulative
301: frequency. Then $c=S[b]$ is the sum of all frequencies in the
302: bucket; moreover, for notation convenience, we assume that
303: $S[0]=0$.
304:
305: The problem of the estimation inside a bucket can be formulated as
306: follows: {\em given any pair} $i,j$, $1 \leq i \leq j \leq b$, such
307: that $d=j-i+1 < b$, {\em estimate the range query} $S[j] - S[i-1] =$
308: $\sum_{k=i}^j
309: F[k]$.
310: We focus our attention on the basic problem of estimating $S[d]$
311: (then by assuming $i=1$).
312:
313: We introduce now the following notation.
314: Given $1 \leq i \leq j \leq 8$,
315: we denote by $\delta_{i/j}$ the sum
316: $\sum_{i=x}^{y} F[i]$,
317: where $x=1+ \lceil \frac{b}{j} \cdot (i -1) \rceil$
318: and $y= \lceil \frac{b}{j} \cdot i\rceil$.
319: $\delta_{i/j}$ represents the frequency sum of the $i-$th
320: elements of the partition of $B$ into $j$ equal size sub-ranges.
321: Thus, the frequency sum for a bucket is $\delta_{1/1}$; the
322: frequency sums for two halves are $\delta_{1/2}$ and
323: $\delta_{2/2}$; the frequency sums for the 4 quarters are
324: $\delta_{i/4}$, $1 \leq i \leq 4$; the frequency sums for the 8
325: eighths are $\delta_{i/8}$, $1 \leq i \leq 8$, and so on.
326:
327: \subsection{Estimation Techniques}
328:
329: Next we illustrate the existing approximation techniques
330: and discuss some additional simple approaches.
331:
332: \noindent{\bf Continuous Value Assumption (CVA).} The estimation
333: of $S[d]$ is computed as $\widetilde{S}[d]=\frac{d}{b} \cdot c$.
334: In words, the partial contribution of a bucket to a range query
335: result is estimated by linear interpolation. As pointed out in
336: \protect\cite{Buccafurri99Compressed,BuFuSa01}, the above
337: estimation coincides with the expected value of the $S[d]$ when it
338: is considered a random variable over the population of all
339: frequency distributions in the bucket for which the overall
340: cumulative frequency is $c$. \noindent{\bf Uniform Spread
341: Assumption (USA).} The estimation of $S[d]$ is given by
342: $\widetilde{S}[d] = \left ( 1 + \frac{(t-1)\cdot (d-1)}{(b-1)}
343: \right ) \cdot \frac{c}{t}$, where $t$ is the number of non-null
344: attribute values in the bucket. The uniform spread assumption
345: assumes that such values are distributed at equal distance from
346: each other and the overall frequency sum is equally distributed
347: among them. Obviously, in this case the information $t$ is
348: necessary. We stress that, as discussed in
349: \protect\cite{BuFuSa01}, this estimation is not supported by any
350: unbiased probabilistic model so the assumption is rather
351: arbitrary.
352:
353: \noindent{\bf 1-Biased Estimation (1b).} The possibly available
354: information on the number $t$ of non-null elements cannot be
355: exploited in the estimation unless some further information on the
356: frequency distribution is either available or assumed (as for the
357: USA estimation). We next show how to exploit the fact that a
358: bucket is often 1-biased (i.e., $u_b$ is not null) using the
359: probabilistic approach proposed in \protect\cite{BuFuSa01}. This
360: approach assumes that the query is a random variable on the
361: population of all 1-biased frequency distributions having $c$ as
362: overall cumulative frequency. The estimation of the range query
363: $S[d]$ for a 1-biased bucket is given by $\widetilde{S}[d]=
364: \frac{d}{b-1} \cdot \frac{t-1}{t} \cdot c$.
365:
366: \noindent{\bf 2-Split Estimation (2s).} We split the bucket into
367: two parts of the same size and store the cumulative frequency of
368: the first part, say $\delta_{1/2}=S[b/2]$ --- we therefore need
369: additional storage space (typically 32 bits). We call this method
370: {\em 2-split} or $2s$ for short. Following this approach, the
371: estimation of the range query $S[d]$ is given by
372: $2 \cdot \frac{d}{b} \cdot \delta_{1/2}$ if $d \leq \frac{b}{2}$,
373: $\delta_{1/2} + 2 \cdot \frac{d - b}{b} \cdot (c - \delta_{1/2})$,
374: otherwise.
375: Thus we use the CVA techniques for each of the two halves of the
376: bucket.
377:
378: \noindent{\bf 4-Split Estimation (4s).} We split the bucket into
379: 4 parts of the same size ({\em quarts}) and store the
380: approximate values of the cumulative frequency of the each
381: part $\delta_{i/4}$, $1 \leq i \leq 4$.
382: In case the additional available space is 32 bits, we use 8 bits for each
383: approximate value, which is therefore computed as
384: $\tilde{\delta}_{i/4}=\langle\frac{\delta_{i/4}}{c}
385: \times (2^8-1)\rangle$,
386: where $\langle x \rangle$ stands for $round(x)$.
387: The frequency sum
388: for an interval $d$ is estimated by adding the approximate values
389: of all first quarts that are fully contained in the interval plus
390: the CVA estimation of the portion of the last eighth that
391: partially overlaps the interval. Obviously, in order to reduce the
392: approximation error, in case $d>b/2$, it is convenient to derive
393: the approximate value from the estimation of the cumulative
394: frequency in the complementary interval from $d+1$ to $b$.
395:
396:
397: \noindent{\bf 8-Split Estimation (8s).}
398: It is analogous to the 4-Split Estimation. The only difference is that the
399: bucket is
400: divided into 8 parts ({\em eighths}) and, for each of them, we use
401: 4 bits for storing the cumulative frequency.
402: Thus, the approximate value of the $i$-th eight ($1 \leq i \leq 4$) , is
403: computed as
404: $\tilde{\delta}_{i/8}=\langle\frac{\delta_{i/8}}{c}
405: \times (2^4-1)\rangle$,
406: where $\langle x \rangle$ stands for $round(x)$.
407:
408:
409:
410: \subsection{The Tree Indices for Bucket Frequency Estimation}
411:
412: We now propose to use 32 bits as sophisticated tree-indices for
413: providing an {\em approximate description} of the cumulative
414: frequencies in the bucket --- this index can be easily extended
415: also to the case that more bits are available. To this end, we store
416: the approximate value of the cumulative frequency in a suitable number of
417: intervals
418: inside the bucket.
419: The first type of tree-index is 3LT.
420:
421: \noindent
422: {\bf 3 Level Tree index (3LT)}
423: The 3LT index uses 11 bits for
424: approximating the value of $\delta_{1/2}$, and 10 bits both for
425: approximating $\delta_{1/4}$ and for $\delta_{3/4}$.
426:
427: Let $L_{1/2}$ be the
428: 11-bits string corresponding to $\delta_{1/2}$, and let $L_{1/4}$ and
429: $L_{3/4}$ be the 10-bits strings corresponding, respectively, to
430: $\delta_{1/4}$ and $\delta_{3/4}$.
431:
432: The three $L$ strings are constructed as follows:
433:
434: \vspace{2mm}
435: \begin{center}
436: {\small
437: $L_{1/2} = \langle\frac{\delta_{1/2}}{\delta_{1/1}}\cdot (2^{11}-1)
438: \rangle;
439: \ \ \ \
440: L_{1/4}= \langle\frac{\delta_{1/4}}{\delta_{1/2}}\cdot (2^{10}-1)\rangle;
441: \ \ \ \
442: L_{3/4}= \langle\frac{\delta_{3/4}}{\delta_{2/2}}\cdot (2^{10}-1)\rangle$
443: }
444: \end{center}
445:
446: \vspace{2mm}
447: \noindent where, we recall, $\langle x \rangle$ stands for $round(x)$.
448:
449:
450: The approximate values for the partial sums are given by:
451:
452: \vspace{2mm}
453: \begin{center}
454: {\small
455: $\widetilde{\delta}_{1/1}=\delta_{1/1}=s$\\
456:
457: $\widetilde{\delta}_{1/2}= \frac{L_{1/2}}{2^{11}-1} \cdot
458: \widetilde{\delta}_{1/1};
459: \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
460: \widetilde{\delta}_{2/2}= \widetilde{\delta}_{1/1} -
461: \widetilde{\delta}_{1/2}$ \\
462:
463: $\widetilde{\delta}_{1/4}= \frac{L_{1/4}}{2^{10}-1}\cdot
464: \widetilde{\delta}_{1/2};
465: \ \ \ \ \
466: \widetilde{\delta}_{2/4}=\widetilde{\delta}_{1/2} -
467: \widetilde{\delta}_{1/4};
468: \ \ \ \ \ \
469: \widetilde{\delta}_{3/4}= \frac{L_{3/4}}{2^{10}-1}\cdot
470: \widetilde{\delta}_{2/2};
471: \ \ \ \ \
472: \widetilde{\delta}_{4/4}=\widetilde{\delta}_{2/2} -
473: \widetilde{\delta}_{3/4}$\\
474: }
475: \end{center}
476: \vspace{2mm}
477:
478: Observe that the 32 bits index refers to a 3-level tree whose
479: nodes store directly or indirectly the approximate values of the
480: cumulative frequencies for fixed intervals: the root stores the
481: overall cumulative frequency $c$, the two nodes of the second
482: level store the cumulative frequencies for the two halves of the
483: bucket and so on.
484:
485: \begin{example}\label{3LT-example}
486: Consider the 3-level tree in Figure
487: \ref{fig-3LT}. The 32 bits store the following approximate
488: cumulative frequencies: $L_{1/2}=\langle \frac{5594}{8678} \cdot
489: 2047 \rangle=1320$, $L_{1/4}=\langle \frac{2834}{5594} \cdot 1023
490: \rangle=518$, $L_{3/4}=\langle \frac{2818}{8678-5594} \cdot 1023
491: \rangle=935$.
492: \end{example}
493:
494: \begin{figure}[t]
495: \epsfig{file=fig1.eps,width=11cm}
496: \caption{The 3-level tree.}\label{fig-3LT}
497: \end{figure}
498:
499:
500: We are now ready to solve the frequency estimation inside the bucket
501: $B$. Given $d$, $1 \leq d < b$, let $i$ be the integer for which
502: $\lceil(i-1)/4 \cdot b \rceil \leq d < \lceil i/4 \cdot b \rceil$.
503: Then the approximate value of $F[d]$ is:
504:
505:
506: \[
507: \begin{array}{l}
508: \widetilde{F}[d]= P(i)+P'(i)+\frac{d-\lceil(i-1)/4 \cdot
509: b\rceil} {\lceil i/4 \cdot b\rceil-\lceil(i-1)/4 \cdot b\rceil}
510: \cdot \widetilde{\delta}_{i/4}
511: \end{array}
512: \]
513: \noindent where
514: \[
515: \begin{array}{ll}
516: P(i)= \left \{
517: \begin{array}{ll}
518: \widetilde{\delta}_{1/2} & \mbox{if $i > 2$}\\
519: 0 & \mbox{if $i \leq 2$}
520: \end{array} \right.
521: & \ \ \ \ \
522: P'(i)= \left \{
523: \begin{array}{ll}
524: \widetilde{\delta}_{1/4} & \mbox{if $i = 2$}\\
525: \widetilde{\delta}_{3/4} & \mbox{if $i = 4$}\\
526: 0 & \mbox{otherwise}
527: \end{array} \right.
528: \\
529: \end{array}
530: \]
531: Thus we use the interpolation based on the CVA only inside a
532: segment of length $\lceil(1/4) \cdot b\rceil$. This component
533: becomes zero at each distance $d=\lceil i \cdot \frac{b}{4} \rceil$, $1
534: \leq i < 4$.
535:
536:
537: 32 bits may be distributed in such a way that the granularity of
538: the tree-index increases w.r.t. 3LT. 4LT index has 4 levels
539: and uses 6 bits for the first level, 5 bits for the second one and
540: 4 bits for the last level.
541:
542: \noindent
543: {\bf 4 Level Tree index (4LT)}
544: We reserve 4 bits to store the approximate value of each of the
545: following 4 partial sums: $\delta_{1/8}$, $\delta_{3/8}$,
546: $\delta_{5/8}$ and $\delta_{7/8}$ --- let $L_{i/8}$, $i=1,3,5,7$,
547: denote such 4-bits strings. We then use the remaining 16 bits as
548: follows: the partial sums $\delta_{1/4}$ and $\delta_{3/4}$ are
549: approximated by the 5-bit strings $L_{1/4}$ and $L_{3/4}$,
550: respectively, while the partial sum $\delta_{1/2}$ with a 6-bits
551: string $L_{1/2}$. As a result, the larger the intervals, the higher
552: is the number of bits used.
553: The 8 $L$ strings are constructed as follows:
554:
555:
556: \[
557: \small
558: \begin{array}{ll}
559: L_{1/2} = \langle\frac{\delta_{1/2}
560: }{\delta_{1/1}}\cdot (2^6-1)\rangle &
561: \\
562:
563: L_{i/4}= \langle\frac{\delta_{i/4}}{\delta_{j/2}}\cdot (2^5-1)\rangle
564: & (i=1 \wedge j =1 ), (i=3 \wedge j =2 )\\
565:
566: L_{i/8}=\langle\frac{\delta_{i/8}}{\delta_{j/4}}\cdot (2^4-1)\rangle
567: & (i=1 \wedge j =1), (i=3 \wedge j =2), \\
568:
569: & (i=5 \wedge j =3), (i=7 \wedge j =4)\\
570:
571: \end{array}
572: \]
573:
574: \noindent where,
575: we recall, $\langle x \rangle$ stands for $round(x)$.
576:
577:
578: The approximate values for the partial sums are eventually
579: computed as:
580:
581: \[
582: \begin{array}{ll}
583: \widetilde{\delta}_{1/1}=\delta_{1/1}=c & \\
584:
585: \widetilde{\delta}_{1/2}= \frac{L_{1/2}}{2^6-1}\times
586: \widetilde{\delta}_{1/1}\\
587:
588: \widetilde{\delta}_{2/2}= \widetilde{\delta}_{1/1} -
589: \widetilde{\delta}_{1/2} &\\
590:
591:
592: \widetilde{\delta}_{i/4}= \frac{L_{i/4}}{2^5-1}\times
593: \widetilde{\delta}_{j/2} & (i=1 \wedge j =1), (i=3 \wedge j =2) \\
594:
595: \widetilde{\delta}_{i/4}=\widetilde{\delta}_{j/2} -
596: \widetilde{\delta}_{i-1/4} &
597: (i=2 \wedge j =1),
598: (i=4 \wedge j =2)\\
599:
600: \widetilde{\delta}_{i/8}= \frac{L_{i/8}}{2^4-1}\times
601: \widetilde{\delta}_{j/4} & (i=1 \wedge j =1), (i=3 \wedge j =2) \\ &
602: (i=5 \wedge j =3), (i=7 \wedge j =4) \\
603: \widetilde{\delta}_{i/8}=\widetilde{\delta}_{j/4} -
604: \widetilde{\delta}_{i-1/8} &
605: (i=2 \wedge j =1),
606: (i=4 \wedge j =2)\\ & (i=6 \wedge j =3), (i=8 \wedge j =4)
607: \end{array}
608: \]
609:
610: Similarly to the 3LT-index, the 4LT-index
611: refers to a 4-level tree whose
612: nodes store directly or indirectly the approximate values of the
613: cumulative frequencies for fixed hierarchical intervals
614: starting from the root which stores the
615: overall cumulative frequency $c$.
616:
617: \begin{figure}[t]
618: \epsfig{file=fig2.eps,width=11cm}
619: \caption{The 4-level tree.}\label{fig-4LT}
620: \end{figure}
621:
622:
623:
624: \begin{example}
625: Consider the 4-level tree in Figure
626: \ref{fig-4LT}.
627: The 32 bits store the following approximate
628: cumulative frequencies: $L_{1/2}=33$, $L_{1/4}=18$, $L_{3/4}=13$,
629: $L_{1/8}=6$, $L_{3/8}=11$, $L_{5/8}=5$, $L_{7/8}=7$.
630: \end{example}
631:
632:
633:
634: Again, similarly to the 3LT-index, the frequency estimation inside the
635: bucket
636: $B$ can be obtained by exploiting the content of the nodes of the index.
637: Given $d$, $1 \leq d < b$, and the integer $i$ which
638: $\lceil(i-1)/8\times b\rceil \leq d < \lceil i/8\times b\rceil$,
639: the approximate value of $F[d]$ is:
640: \[
641: \begin{array}{l}
642: \widetilde{F}[d]= P(i)+P'(i)+P''(i)+\frac{d-\lceil(i-1)/8\times
643: b\rceil} {\lceil i/8\times b\rceil-\lceil(i-1)/8\times b\rceil}
644: \times \widetilde{\delta}_{i/8}
645: \end{array}
646: \]
647: \noindent where
648: \[
649: \begin{array}{ll}
650: P(i)= \left \{
651: \begin{array}{ll}
652: \widetilde{\delta}_{1/2} & \mbox{if $i > 4$}\\
653: 0 & \mbox{if $i \leq 4$}
654: \end{array} \right.
655: &
656: P'(i)= \left \{
657: \begin{array}{ll}
658: \widetilde{\delta}_{1/4} & \mbox{if $i = 3,4$}\\
659: \widetilde{\delta}_{3/4} & \mbox{if $i = 7,8$}\\
660: 0 & \mbox{otherwise}
661: \end{array} \right.
662: \\
663: \end{array}
664: \]
665: \[
666: P''(i)= \left \{
667: \begin{array}{ll}
668: \widetilde{\delta}_{i-1/8} & \mbox{if $i$ is even}\\
669: 0 & \mbox{otherwise}
670: \end{array} \right.
671: \]
672:
673: Thus we
674: use the interpolation like in CVA only
675: inside a segment of length $\lceil(1/8)b\rceil$. This component
676: becomes zero at each distance $d=\lceil i \times b/8 \rceil$, $1
677: \leq i < 8$. We call the estimation {\em 4-level tree} or 4LT for short.
678:
679:
680:
681: \subsection{Worst-case Error Analysis}\label{sec-Analysis}
682:
683: The approximation error for CVA, 1b, USA and 2s arises only from interpolation.
684: On the contrary, for other methods (i.e., 4s, 8s, 3LT and 4LT), the scaling
685: error due to bit saving is added to the interpolation error.
686: However, all methods but CVA, 1b and USA implement a equi-size division of
687: the bucket and
688: 3LT and 4LT provide also an index over sub-buckets.
689: We expect that such a division into sub-buckets produces an improvement
690: from the side of the interpolation error.
691: Indeed, sub-buckets increase
692: the granularity of summarization.
693: In addition, we expect that index-based methods (i.e., 3LT and 4LT), reduce
694: the scaling error, since
695: hierarchical tree-like organization allows us to
696: represent the sum inside a given sub-bucket, corresponding to a
697: node of the tree, as a fraction of the sum contained in the parent
698: node, instead of a fraction of the entire bucket sum (as it happens for the
699: "flat" methods 4s and 8s).
700: The worst-case analysis confirms the above observations.
701: In particular we show that while CVA, 1b and USA are the same, under the
702: worst-case point of view, 4LT outperforms the other methods.
703:
704: Results of our analysis are summarized in the following theorem.
705: Recall that, throughout the whole section, a bucket $B$ of
706: size $b$ is given.
707:
708: \begin{theorem}
709: Let $F$ be the maximum frequency value occurring in $B$ and
710: let assume that $b $ {\em mod} $ 8 = 0$. Then, the
711: interpolation and scaling worst-case errors of
712: CVA, 1b, USA, 2s, 4s, 8s, 3LT and 4LT are the following:
713:
714: \begin{center}
715: \begin{tabular}[h]{||c||c|c|c|c|c|c|c|c||}
716: \hline\hline
717: error/method & CVA & 1b & USA & 2s & 4s & 8s & 3LT & 4LT
718: \\ \hline
719: interpolation & $\frac{F \cdot b}{4}$ & $\frac{F \cdot b}{4}$ & $\frac{F
720: \cdot b}{4}$ & $\frac{F \cdot b}{8}$ &
721: $\frac{F \cdot b}{16}$ & $\frac{F \cdot b}{32}$ & $\frac{F \cdot b}{16}$
722: & $\frac{F \cdot b}{32}$
723: \\ \hline
724: scaling & 0 & 0& 0 & 0 & $\frac{F \cdot b}{2^9}$ & $\frac{F \cdot
725: b}{32}$ & $\frac{F \cdot b}{2^{12}}$ & $\frac{F \cdot b}{2^7}$
726:
727: \\ \hline
728: total & $\frac{F \cdot b}{4}$ & $\frac{F \cdot b}{4}$ & $\frac{F \cdot
729: b}{4}$ & $\frac{F \cdot b}{8}$
730: & $\frac{F \cdot b}{16}$ & $\frac{F \cdot b}{16}$ & $\frac{F \cdot b}{16}$
731: & $\frac{F \cdot b}{32}$
732: \\ \hline\hline
733: \end{tabular}
734: \end{center}
735: \end{theorem}
736:
737: \begin{proof}
738: Let $b_M$ the size of the smallest sub-bucket produced by the method
739: $M$, where $M$ is either CVA, 1b, USA, 2s, 4s, 8s, 3LT or 4LT.
740: Observe that $b_M=b$ for CVA, 1b and USA (since they do not produce
741: sub-buckets), while $b_{2s}= \frac{b}{2}$, $b_M = \frac{b}{4}$ for
742: $M=$ 4s or $M=$ 3LT, $b_M = \frac{b}{8}$ otherwise.
743:
744: Consider first the interpolation error
745: (by assuming that no scaling error occurs).
746:
747: \noindent
748: {\bf Interpolation error bounds.}
749: It can be easily verified that the worst case for a method $M$ happens
750: whenever
751: both the following conditions hold:
752: \begin{enumerate}
753: \item [(1)]
754: there is a smallest sub-bucket, say $B$ (of size $b_M$) containing,
755: in the first half, $\frac{b_M}{2}$ frequencies with value $F$,
756: and, in the second half, $\frac{b_M}{2}$ frequencies with value 0, and
757: \item [(2)]
758: the range query involves exactly the first half of the sub-bucket $B$.
759: \end{enumerate}
760: The proof of this part is conducted separately for each method,
761: by determining the maximum absolute interpolation error:
762:
763: \noindent{\bf CVA:}
764: In this case, $b_M = b$, that is the sub-bucket coincides with
765: the entire
766: bucket and the query boundaries are $1$ and $\frac{b}{2}$.
767: The cumulative value of the bucket is $F \cdot \frac{b}{2}$.
768: Under CVA, the estimated value of the query is $\frac{F \cdot
769: \frac{b}{2}}{b}\cdot \frac{b}{2}$,
770: that is $\frac{F \cdot b}{4}$. The actual value of the query is $\frac{F
771: \cdot b}{2}$.
772: Therefore the absolute error is $\frac{F \cdot b}{4}$.
773:
774:
775:
776: \noindent{\bf 1b:}
777: We obtain the same absolute error $\frac{F \cdot b}{4}$.
778: Indeed, being the first value of the bucket $F$
779: (i.e., not null), 1-biased estimation does not give additional information
780: w.r.t. CVA.
781:
782: \noindent{\bf USA:}
783: Also in this case, $b_M = b$, that is the sub-bucket coincides
784: with the entire
785: bucket and the query boundaries are $1$ and $\frac{b}{2}$.
786: The cumulative value of the bucket is $F \cdot \frac{b}{2}$.
787: USA assumes that the $\frac{b}{2}$ non null values are
788: located at equal distance from each other,
789: and each has the value $F$. As a consequence the estimated value of the
790: query
791: is $F \cdot \frac{b}{4}$, since the query involves just half non null
792: estimated values.
793: The actual value is $\frac{F \cdot b}{2}$. Thus, the absolute error is
794: $\frac{F \cdot b}{4}$, that is the same as CVA.
795:
796: \noindent{\bf 2s:}
797: In this case $b_M = \frac{b}{2}$.
798: According to the case CVA, the absolute error is
799: $\frac{F \cdot b_M}{4}$, that is
800: $\frac{F \cdot b}{8}$.
801:
802: \noindent{\bf 4s} and {\bf 3LT}:
803: Both 4s and 3LT produce sub-buckets
804: of size $\frac{b}{4}$. Thus, in these cases $b_M = \frac{b}{4}$.
805: Identically to the previous case, the absolute error is
806: $\frac{F \cdot b_M}{4}$, that is
807: $\frac{F \cdot b}{16}$.
808:
809: \noindent{\bf 8s} and {\bf 4LT}:
810: Both 8s and 4LT produce sub-buckets
811: of size $\frac{b}{8}$. Thus, in these cases $b_M = \frac{b}{8}$.
812: Identically to the previous case, the absolute error is
813: $\frac{F \cdot b_M}{4}$, that is
814: $\frac{F \cdot b}{32}$.
815:
816:
817: Now we consider the scaling error.
818:
819: \noindent{\bf Scaling error bounds.}
820: The proof that CVA, 1b, USA and 2s do not
821: produce scaling error is straightforward.
822: Let us consider the other methods:
823:
824: \noindent{\bf 4s:}
825: Since each sub-bucket sum is encoded by 8 bits and is scaled
826: w.r.t. the overall bucket sum, the maximum scaling error is
827: $\frac{F \cdot b}{2^9}$.
828:
829: \noindent{\bf 8s:}
830: Since each sub-bucket sum is encoded by 4 bits and scaled
831: w.r.t. the overall bucket sum, the maximum scaling error is
832: $\frac{F \cdot b}{2^5}= \frac{F \cdot b}{32}$.
833:
834: \noindent{\bf 3LT:}
835: In this case, the scaling error may be propagated
836: going down along the path from the root to the leaves of the tree.
837: We may determine an upper bound of the worst-case error
838: by considering the sum of the maximum scaling error at each level.
839: Thus, we obtain the following upper bound:
840: $\frac{\frac{F \cdot b}{2^{12}} + \frac{F \cdot b}{2}}{2^{11}}$.
841: Indeed, the maximum scaling error of the first level is
842: $\frac{F \cdot b}{2^{12}}$. The above value is obtained by considering
843: that the maximum sum in the half bucket corresponding to the first level
844: is $\frac{F \cdot b}{2}$, and that going down to the second level
845: introduces a maximum scaling error obtained by dividing the
846: overall sum by $2^{11}$. Thus, the maximum scaling error for 3LT
847: is $\Theta(\frac{F \cdot b}{2^{12}})$ (that is, the scaling error of the first
848: level).
849:
850: \noindent{\bf 4LT:}
851: For 4LT can be applied the same argumentation as 3LT, by
852: obtaining
853: that the maximum scaling error is of the same order as the first level.
854: That is, $\Theta(\frac{F \cdot b}{2^7})$, since the first level uses 6 bits.
855:
856: The proof is thus completed.
857: \end{proof}
858:
859: It is worth noting that,
860: as expected, 4LT and 8s produce the smallest interpolation worst-case error,
861: that is $\frac{F \cdot b}{32}$.
862: Considering also the results about scaling error,
863: the overall conclusion we may draw from the above analysis is that the best two
864: methods w.r.t. interpolation, that is 8s and 4LT, are not the same in terms of
865: scaling error.
866: Indeed 4LT shows a relevant accuracy improvement since the error
867: goes from $\frac{F \cdot b}{2^5}$ of 8s to $\frac{F \cdot b}{2^7}$
868: of 4LT.
869:
870: In the next subsection we shall perform a number of experiments to
871: provide additional arguments in favor of the superiority of 4LT
872: estimation, by performing also an average-case analysis
873: of methods under a number of meaningful data distributions.
874: We shall not conduct experiments on the CVA
875: because we are aware that CVA uses 32 bits less and, therefore,
876: could reduce the size of the bucket, thus providing a better
877: accuracy. Actually, the performance analysis coincides with the one
878: of 2s estimation, that is CVA in half bucket.
879:
880: \subsection{Experiments inside a Bucket}\label{sec-ExperimentsIntra}
881:
882: In this section we report the results of a large number of
883: experiments performed with various synthetic data sets obtained
884: with different distributions. We measure the accuracy of all the
885: above mentioned methods in estimating range queries inside a
886: bucket. In particular, the methods considered are: USA, 1b, 2s, 8s, 3LT and
887: 4LT. We observe that the space required for storing a bucket is the same
888: for all the considered methods.
889: Experiments are conducted
890: on synthetic data generated according several data distributions.
891: A data distribution is characterized by a distribution for frequencies and
892: a distribution for spreads.
893: Frequency set and value set are generated independently, then
894: frequencies are randomly
895: assigned to the elements of the value set.
896:
897: \subsubsection{Test Bed.}
898:
899: In this section we illustrate the test bed used in our
900: experiments. In particular, we describe (1) the
901: {\em data distributions}, that is the probability
902: distributions used for generating frequencies in the tested
903: buckets, (2) the {\em bucket populations}, that is the set
904: of parameters characterizing bucket used for
905: generating them under the
906: probability distributions, (3) the {\em data sets},
907: that is the set of samples produced by the combination of (1)
908: and (2), (4) the {\em query set and error metrics}, that is
909: the set of query submitted to sample data and
910: the metrics used for measuring the approximation error.
911:
912: \noindent {\bf Data Distributions:} We consider four data
913: distributions: ({\bf 1}) {\em Zipf-$cusp\_max$ (0.5,1.0)}:
914: Frequencies are distributed according to a Zipf distribution
915: \protect\cite{Zipf49Human} with the $z$ parameter equal to $0.5$.
916: Spreads are distributed according to a Zipf {\em $cusp\_max$}
917: \protect\cite{Poo97} (i.e., increasing spreads following a Zipf
918: for the first half elements and decreasing spreads following a
919: Zipf distribution for the remaining elements) with $z$ parameter
920: equal to $1.0$. ({\bf 2}) {\em Zipf-$cusp\_max$(1.0,1.0).} ({\bf
921: 3}) {\em Zipf-$cusp\_max$(1.5,1.0).} ({\bf 4}) {\em Gauss-rand}:
922: Frequencies are distributed according to a Gauss distribution with
923: standard deviation $1.0$. Spreads are randomly distributed as
924: well.
925:
926:
927:
928:
929: \noindent {\bf Bucket Populations:} A population is characterized
930: by the values of $c$ (overall cumulative frequency), $b$ (the
931: bucket size) and $t$ (number of non-null attribute values) and
932: consists of all buckets having such values. We consider 9
933: different populations divided into two sets, that are called t-var
934: and b-var, respectively.
935:
936: \noindent
937: {\em Set of populations t-var.}
938: It is a set of 6 populations of buckets, all of them with
939: $c=20000$ and $b=500$. The 6 populations differ on the value of
940: the parameter $t$ ($t$=10, 100, 200, 300, 400, 500), and are denoted by
941: t-var(10), t-var(100), t-var(200), t-var(300), t-var(400) and
942: t-var(400), respectively.
943:
944: \noindent
945: {\em Set of populations b-var.}
946: It is a set of 4 populations of buckets, all of them with
947: $c=20000$. They differ on the value of the parameters $b$ and
948: $t$. We consider $4$ different values for $b$ ($b$=100, 200, 500,
949: 1000). The number of non-null values $t$ of each population is
950: fixed in a way that the ratio $t/b$ is constant and equal to
951: $0.2$; so the values of $t$ are 20, 40, 100 and 200. The four
952: populations are denoted by b-var(100), b-var(200), b-var(500) and
953: b-var(1000).
954:
955: Moreover, a generic population whose parameter values are,
956: say, $\bar c$, $\bar b$ and $\bar t$ (for $c$, $b$ and $t$, respectively),
957: is denoted by p($\bar c$, $\bar b$, $\bar t$).
958:
959: \noindent {\bf Data Sets:} As a data set we mean a sampling of the
960: set of buckets belonging to a given population following a given
961: data distribution. Each data set included in the experiments is
962: obtained by generating $100$ buckets belonging to one of the
963: populations specified above under one of the above described data
964: distributions. We denote a data set by the name of the data
965: distribution and the name of the population. For example, the data
966: set (Zipf-cusp\_max(0.5,1.0), b-var(200)) denotes a sampling of
967: the set of buckets belonging to the population of b-var
968: corresponding to the value 200 for the parameter $b$ following the
969: data distribution Zipf-cusp\_max(0.5,1.0).
970:
971: We generate 23 different data sets
972: classified as follows:
973: (1)
974: {\bf Zipf-t} (i.e., Zipf data, different bucket density),
975: containing the five data sets (Zipf-cusp\_max(0.5,1), t-var($t$)), for
976: $t$=10,
977: 100, 200, 300, 400, 500.
978: (2)
979: {\bf Zipf-b} (i.e., Zipf data, different bucket size),
980: containing the
981: four data sets (Zipf-cusp\_max(0.5,1), b-var($b$)), for $b$=100,
982: 200, 500, 1000.
983: (3) {\bf Gauss-t} (i.e., Gauss data, different bucket density),
984: containing the five data sets (Gauss-rand, t-var($t$)), for $t$=10,
985: 100, 200, 300, 400, 500.
986: (4)
987: {\bf Gauss-b} (i.e., Gauss data, different bucket size),
988: containing the
989: four data sets (Gauss-rand, b-var($b$)), for $b$=100, 200, 500,
990: 1000.
991: (5)
992: {\bf Zipf-z} (i.e., Zipf data, different skew), containing the three
993: data sets Zipf-cusp\_max($z$,1.0), p(20000,400,200)), for $z$=0.5,
994: 1.0, 1.5. Recall that p(20000,400,200) denotes the population characterized
995: by
996: $c=20000, b=400, t=200$.
997:
998:
999:
1000: Each class of data sets is designed for studying the dependence of
1001: the accuracy of the various methods on a different parameter
1002: (parameter $t$ measuring the density of the bucket, parameter $b$
1003: measuring the size of the bucket and parameter $z$, measuring the
1004: data skew). For each data set, 1000 different samples obtained by
1005: permutation
1006: of frequencies was generated and tested, in order to give
1007: statistical significance to experiments.
1008:
1009:
1010:
1011: \noindent {\bf Query set and error metrics:} We perform all the
1012: queries $S[d]$, for all $1 \leq d < b$. We measure the error of
1013: approximation made by the various estimation techniques on the
1014: above query set by using both:
1015: \begin{itemize}
1016: \item
1017: the \em average \em of the \em relative
1018: error \em $\frac{1}{b-1}\sum_{d=1}^{b-1}e_d^{rel}$,
1019: where $e_d^{rel}$ is the \em relative error \em of the query with
1020: range $d$, i.e., $e_d^{rel}=\frac{\vert{S[d]-
1021: \widetilde{S}[d]}\vert}{S[d]}$, and
1022:
1023: \item
1024: the {\em normalized absolute error}, that is the ratio between the average
1025: absolute error
1026: and the overall sum of the frequencies in the bucket, i.e.
1027: $\sum_{d=1}^{b-1}\frac{\vert{S[d]- \widetilde{S}[d]}\vert}{c \cdot b}$
1028: \end{itemize}
1029: where $\widetilde{S}[d]$ is the value of $S[d]$ estimated by the
1030: technique at hand.
1031:
1032:
1033: \subsubsection{Results of Experiments and Discussion.}
1034:
1035: In this section we give a qualitative discussion about the approximation
1036: error
1037: of the considered methods, excluding USA and 1-biased, about which we have
1038: already
1039: provided a theoretical analysis in Section \ref{sec-Analysis}.
1040: First we consider methods working simply by splitting the original
1041: bucket, that are 2s, 4s and 8s.
1042: For all these methods,
1043: the estimation error may arise from the following approximation sources:
1044:
1045: \begin{enumerate}
1046:
1047: \item
1048: the linear interpolation (i.e., CVA), concerning the evaluation of the query
1049: inside
1050: the ``smallest" sub-buckets (for instance, in the case of the 4s, the
1051: smallest sub-buckets
1052: are the quarts of the bucket),
1053:
1054: \item
1055:
1056: the numeric approximation, in case sums are stored by less than 32 bits
1057: (note that only 2s is not affected by this error).
1058:
1059: \end{enumerate}
1060: We call error of type 1 and 2, respectively, the above described components
1061: of the approximation error.
1062:
1063:
1064: \subsubsection*{Relative error vs data density.}
1065:
1066: Concerning error of type 1, what we expect is that, for all methods, it
1067: increases as
1068: data sparsity increases.
1069: Indeed, in case of sparse data, the sum tends to concentrate in a few
1070: points,
1071: and this reduces the suitability of linear interpolation to approximate
1072: the frequency distribution.
1073: Moreover, we expect that such a component of the error
1074: decreases as splitting degree increases:
1075: for instance,
1076: in case of 8s, which splits the bucket into 8 parts,
1077: we expect more accuracy (in terms of the error of type 1) than
1078: the 2s method. The reason is that having smaller sub-buckets
1079: means applying linear interpolation to shorter
1080: (and, thus, better linearly-approximable) segments of
1081: the cumulative frequency distribution.
1082:
1083:
1084: About error of type 2 we expect that both (i) it increases
1085: as the splitting degree increases and (ii) it is independent of
1086: data sparsity.
1087: Claim (i) is explained by considering that increasing the splitting degree
1088: means reducing the number of bits used for representing the sum of
1089: sub-buckets.
1090: Claim (ii) is related to the numeric nature of the error.
1091:
1092: The observations above show the existence of a trade-off between the need of
1093: increasing the splitting degree for improving CVA precision on one hand, and
1094: the need of
1095: using as more bits as possible for representing partial sums in the bucket
1096: on the other hand.
1097: However, we expect that such a trade-off is more evident in case of high
1098: splitting degree,
1099: that is, when the error of type 2 is more relevant.
1100: For instance, recalling that the maximum absolute error of type 2 is
1101: $\frac{c}{2^{k+1}}$,
1102: where $k$ is the number of bits assigned to smallest sub-buckets, being
1103: $k=4$ for 8s
1104: and $k=8$ for 4s, the maximum absolute error of type 2 for 8s in
1105: case $c=20000$
1106: is 625 (i.e., about the 3\% of $c$) while it is 39 (i.e., a negligible
1107: percentage of $c$) for 4s.
1108:
1109: \begin{figure}[ht]
1110: \begin{center}
1111: \begin{tabular}{c}
1112: \epsfig{file=fig3a.eps,width=9cm} \\
1113: {\bf (a)}: Error for different values of $t$ \\
1114: \epsfig{file=fig3b.eps,width=9cm} \\
1115: {\bf (b)}: Error for different values of $b$
1116: \end{tabular}
1117: \end{center}
1118: \caption{Experimental Results for data sets Zipf}\label{fig1}
1119: \end{figure}
1120:
1121: Experiments confirm the above considerations. By looking at graphs of
1122: Figure \ref{fig1}.(a) we may observe that for 2s and 4s the error
1123: decreases as the data density increases. On the contrary, for
1124: 8s, the error is quasi-constant (slightly increasing) in case of
1125: Zipf distributions, while it is slightly decreasing (but much less
1126: quickly than 4s) in case of Gauss distribution (see Figure
1127: \ref{fig2}.(a)). Concerning the comparison between 2s, 4s and 8s,
1128: we may observe in Figures \ref{fig1}.(a) that for low values of
1129: data density, as expected, accuracy of 8s is higher than 4s and,
1130: in turn, accuracy of 4s is higher than 2s. But, as observed above,
1131: for increasing data density, trends of 4s and 8s suffer, in a
1132: different measure, the presence of the error of type 2. This
1133: appears quite evident in Figure \ref{fig1}.(a), whereby we may note
1134: that 8s becomes worse than 4s from about 210 non null elements on
1135: and the improving trend of 2s is considerable faster than the
1136: other methods (since 2s does not suffer the error of type 2).
1137:
1138: We observe that USA
1139: gives better estimation than $1b$ on Zipf data (see Figures
1140: \ref{fig1}.(a)). Accuracy of USA becomes the worst when the data sets
1141: follow the Gauss distribution (see Figures \ref{fig2}.(a)).
1142: This proves that the
1143: assumption made by USA can be applicable for particular
1144: distributions of frequencies and spreads, like those of data sets
1145: Zipf-t. Results obtained on data sets distributed according a
1146: Gauss distribution confirm the above claim: accuracy of USA
1147: becomes the worst when the data sets have a random distribution as
1148: it happens for Gauss-t (see Figure \ref{fig2}.(a)).
1149:
1150: Concerning 1b we may observe that the behaviours of $1b$
1151: and 2s are similar. As expected, the exploitation of the
1152: information that the bucket is 1-biased does not give a
1153: significant contribution to the accuracy of the estimation.
1154: Indeed, the knowledge of the position of just one element in the
1155: bucket does not add in general appreciable information.
1156:
1157:
1158: Consider now the usage of the tree-indices 3LT and 4LT. Recall
1159: that 3LT has the same splitting degree of 4s, since both methods
1160: divide the bucket into 4 sub-buckets. Possible difference in terms
1161: of accuracy between the two methods may arise from error of type 2.
1162: Indeed, the tree-like organization of indices allows us to
1163: represent the sum inside a given sub-bucket corresponding to a
1164: node of the tree as a fraction of the sum contained in the parent
1165: node, instead of the entire sum (as it happens for the "flat"
1166: methods).
1167: Thus, we expect that tree-indices produce smaller
1168: errors of type 2. However, as previous noted, 4s produces a
1169: negligible percentage of error of type 2. This explains why
1170: 3LT and 4s basically present
1171: the same error (lines in the graphs are almost entirely
1172: overlapped).
1173:
1174: 4LT has the same splitting degree as 8s (since both methods divide the
1175: bucket into 8 sub-buckets). As a consequence, being appreciable the error of
1176: type 2 of the 8s (as already discussed), we may expect
1177: improvements by the usage of 4LT. This is that results from
1178: experiments. 4LT has the best performances: it shows only benefits
1179: deriving from the increasing of data density (producing the
1180: reduction of error of type 1), with no appreciable increasing of
1181: error of type 2. 4LT, thanks to the tree-like organization of the
1182: sums, seems to solve the trade-off between increasing splitting
1183: degree (for improving CVA precision) and controlling numeric error
1184: arising from the usage of a reduced number of bits for
1185: representing sums.
1186:
1187:
1188: \subsubsection*{Relative error vs bucket size and
1189: data skew.}
1190:
1191: First consider populations b-var. Recall that for such data sets
1192: we have maintained constant the data density around 20\%. Thus,
1193: increasing the bucket size means increasing also non-null
1194: elements. While, as for previous experiments, error of type 2 is
1195: independent of the bucket size, (even though all the above considerations
1196: about the relationship between error of type 2, splitting degree
1197: and number of bits per smallest sub-buckets are still valid), we
1198: expect that CVA precision suffers the variation of the bucket
1199: size. Indeed, on the one hand the CVA precision decreases as the
1200: bucket size increases, since, for a larger bucket, linear
1201: interpolation is applied to a larger segment of the cumulative
1202: frequency. But, on the other hand, increasing the bucket size means
1203: increasing the number of non-null elements (keeping constant the overall sum)
1204: and this means reducing the probability that the sum is
1205: concentrated into a few picks. Thus, whenever the cumulative
1206: frequency is smooth, linear interpolation tends to give better
1207: results. Depending on data distribution, we may observe either
1208: that the two opposite component compensate each other or one
1209: prevails over the other. Indeed, experiments with Zipf data,
1210: corresponding to Figure \ref{fig1}.(b), show that methods have a
1211: quasi-constant trend (with a slight prevalence of the first
1212: component), while experiments conducted on Gauss data,
1213: corresponding to Figure \ref{fig2}.(b), show a net prevalence of the
1214: second component (all the methods present a decreasing trend for
1215: increasing bucket size). Such experiments do not give new
1216: information about the comparison between the considered methods,
1217: confirming substantially the previous results. Again 4LT has the
1218: best performance.
1219:
1220:
1221: \begin{figure}[ht]
1222: \begin{center}
1223: \begin{tabular}{c}
1224: \epsfig{file=fig4a.eps,width=9cm} \\
1225: {\bf (a)}: Data sets Gauss-t: error for different values of $t$ \\
1226: \epsfig{file=fig4b.eps,width=9cm} \\
1227: {\bf (b)}: Data sets Gauss-D: error for different values of $b$
1228: \end{tabular}
1229: \end{center}
1230: \caption{Experimental Results for data sets Gauss} \label{fig2}
1231: \end{figure}
1232:
1233:
1234:
1235: Results of experiments conducted on the class of data sets Zipf-z,
1236: for measuring the dependence of the accuracy of methods on the
1237: data skew are reported in Figure \ref{fig3}. We note that all
1238: methods become worse as $z$ increases (as it can be
1239: intuitively expected).
1240: The behaviours of $1b$ and 2s are similar, while 4LT shows the best
1241: performance.
1242:
1243:
1244: As a final remark we may summarize the comparison between the
1245: considered methods concluding that the worst method is always 2s,
1246: followed by 8s and then by 3LT and 4s for sparse data. On the
1247: contrary, for dense data 3LT and 4s show better performance than
1248: 8s. Observe that 4s and 3LT have basically the same accuracy. The
1249: best methods appears definitely 4LT.
1250:
1251:
1252:
1253:
1254: \begin{figure}[h]
1255: \begin{center}
1256: \begin{tabular}{c}
1257: \epsfig{file=fig5.eps,width=9cm}
1258: \end{tabular}
1259: \end{center}
1260: \caption{Data sets Zipf-z: dependence on data skew} \label{fig3}
1261: \end{figure}
1262:
1263:
1264:
1265:
1266:
1267: \section{Applying the 4LT Index to the Entire Histogram}\label{sec-Improved}
1268:
1269: The analysis described in the previous sections suggests to apply
1270: the technique of the 4-level tree index to a whole histogram in
1271: order to improve its accuracy on the approximation of the
1272: underlying frequency set.
1273: We stress that the problem
1274: of investigating whether such an addition is really convenient
1275: is not straightforward: observe that 4LT buckets use 32 bits more than CVA ones, and, then, for a fixed storage
1276: space, allow a smaller number of buckets.
1277: In this section we show how to combine
1278: the 4LT technique with classical methods for constructing
1279: histograms and we perform a large number of experiments to measure
1280: the effective improvement given by the usage of the 4LT.
1281: The advantage of the 4LT index is shown to be relevant also when
1282: it is compared with buckets using CVA,
1283: that is, when the storage space required by
1284: 4LT is larger than the original method.
1285: Moreover, the 4LT index shows very good performances
1286: if it is combines with a very
1287: simple method for constructing histograms, called EquiSplit,
1288: consisting on partitioning the attribute domain into equal-size
1289: buckets.
1290: Let us start with a quick overview of the most relevant methods
1291: proposed so far for the construction of histograms.
1292:
1293: \subsection{Methods for Constructing Histograms}
1294:
1295: Besides the method used for approximating frequencies inside
1296: buckets, the capability of a histogram of accurately approximating
1297: the underlying frequency set strongly depends on the way such a
1298: set is partitioned into buckets. Typically, criteria driving the
1299: construction of a histogram is the minimization of the error of
1300: the reconstruction of the original (cumulative) frequency set from
1301: the histogram. Partition rules proposed in
1302: \protect\cite{Poosala96Improved,Jagadish98Optimal}, try to achieve
1303: this goal. Among those, we sketch the description of two
1304: well-known approaches: {\em MaxDiff } and {\em V-optimal} (see
1305: \protect\cite{Poosala96Improved,Poo97} for an exhaustive
1306: taxonomy). Note that these methods are defined for 2-histograms
1307: but are in practice mainly used for 1-histograms to minimize
1308: storage consumption.
1309:
1310:
1311: \noindent {\bf MaxDiff.} A MaxDiff histogram
1312: \protect\cite{Cri81,Poosala96Improved} of size $h$ is obtained by
1313: putting a boundary between two adjacent attribute values $v_i$ and
1314: $v_{i+1}$ of $V$ if the difference between $f(v_{i+1}) \cdot
1315: \sigma_{i+1}$ and $f(v_{i}) \cdot \sigma_{i}$ is one of the $h-1$
1316: largest such differences (where $\sigma_i$ denotes the spread of
1317: $v_i$). The product $f(v_{i}) \cdot \sigma_{i}$ is said the {\em
1318: area} of $v_i$.
1319:
1320:
1321: \noindent {\bf V-Optimal.} A V-Optimal histogram
1322: \protect\cite{Poosala96Improved,Jagadish98Optimal} gives very good
1323: performances. It is obtained by selecting the boundaries for each
1324: bucket, $inf_i$ and $sup_i$, $1 \leq i \leq n$, so that
1325: $\sum_{i=1}^n SSE_i$ is minimal, where $SSE_i =
1326: \sum_{j=inf_i}^{sup_i} (f(j)-avg_i)^2$ and $avg_i$ is equal to the
1327: average frequency in the $i$-th bucket, thus the cumulative
1328: frequency in the whole bucket divided by the size $sup_i -
1329: inf_i+1$.
1330:
1331: We now propose to combine both methods, MaxDiff and V-Optimal,with
1332: the 4LT index in order to have an approximate representation of
1333: frequency distributions inside the buckets. We shall compare the
1334: so-revised methods with the original ones with CVA estimation at
1335: parity of storage consumption. The results will show that the 4LT
1336: index very much increases the estimation accuracy of both methods.
1337: The additional estimation power carried by the 4LT index even
1338: enables a very simple method like the one described below to
1339: produce very accurate estimations.
1340:
1341: \noindent {\bf EquiSplit.} The attribute domain is split into $k$
1342: buckets of approximately the same size $b=\lceil m/k \rceil$. In
1343: this way, as the boundaries of all buckets can be easily
1344: determined from the value $b$, we only need to store a value for
1345: each bucket: the sum of all frequencies. This method has been
1346: first introduced in \protect\cite{Cri81} and, as the experimental
1347: analysis will confirm, it has very good performances for low
1348: skewed data, while its performances get worse in case of high
1349: skew.
1350:
1351:
1352:
1353: \subsection{Experiments on Histograms}\label{exp}
1354:
1355: In this section we shall conduct several experiments both on
1356: synthetic and real-life data in order to compare the effectiveness of
1357: several histograms in estimating range query size.
1358:
1359: \subsubsection*{Experiments on Synthetic Data.}
1360: First we present the experiments performed on synthetic data.
1361: Below we describe data sets, error metrics and the query set
1362: considered in our experiments.
1363:
1364: \noindent {\bf Available Storage:} Note that under CVA each bucket
1365: stores only two integers, while with the 4LT index each bucket
1366: needs three integers. Assuming 32 bits the storage space for an
1367: integer, given a fixed $K$ number of bits for the total storage
1368: space required for the whole histogram, both MaxDiff and V-Optimal
1369: under CVA produce $\lfloor \frac{K}{64} \rfloor$ buckets while
1370: both of them with 4LT indices only produce $\lfloor \frac{K}{96}
1371: \rfloor$ buckets. On the other hand, a bucket for EquiSplit just
1372: needs one integer (the sum of all the frequencies), while for
1373: EquiSplit-4LT it needs two integers. Thus, for a fixed $K$ number
1374: of bits for the total storage space, EquiSplit with CVA produces
1375: $\lfloor \frac{K}{32} \rfloor$ and EquiSplit with 4LT indices
1376: produces $\lfloor \frac{K}{64} \rfloor$ as $MD\_CVA$.
1377:
1378: For our experiments, we shall use a storage space, that is $42$
1379: four-byte numbers to be in line with experiments reported in
1380: \protect\cite{Poosala96Improved,Jagadish98Optimal}, which we
1381: replicate. Using the above considerations, it can be easily
1382: realized that MaxDiff with CVA, V-Optimal with CVA, and EquiSplit
1383: with 4LT indices produce 21 buckets, EquiSplit with CVA produces
1384: 42 buckets, and both MaxDiff and V-Optimal with 4LT indices only
1385: produce 14 buckets.
1386:
1387:
1388:
1389: \noindent {\bf Data Distributions:} A data distribution is
1390: characterized by a distribution for frequencies and a distribution
1391: for spreads. Frequency set and value set are generated
1392: independently, then frequencies are randomly assigned to the
1393: elements of the value set. We consider 5 data distributions: ({\bf
1394: 1}) $D_1$: {\em Zipf-$cusp\_max$(0.5,1.0)}. ({\bf 2}) $D_2=$ {\em
1395: Zipf-zrand(0.5,1.0)}: Frequencies are distributed according to a
1396: Zipf distribution with the $z$ parameter equal to $0.5$. Spreads
1397: follow a $ZRand$ distribution \protect\cite{Poo97} with $z$
1398: parameter equal to $1.0$ (i.e., spreads following a Zipf
1399: distributions with $z$ parameter equal to $1.0$ are randomly
1400: assigned to attribute values). ({\bf 3}) $D_3=$ {\em Gauss-rand}:
1401: Frequencies are distributed according to a Gauss distribution with
1402: standard deviation $1.0$. Spreads are randomly distributed. ({\bf
1403: 4}) $D_4=$ {\em Zipf-$cusp\_max$(1.5,1.0)}. ({\bf 5}) $D_5=$ {\em
1404: Zipf-$cusp\_max$(3.0,1.0)}.
1405:
1406:
1407: \noindent {\bf Histograms Populations:} A population is
1408: characterized by the value of three parameters, that are $T$, $D$
1409: and $t$ and represents the set of histograms storing a relation of
1410: cardinality $T$, attribute domain size $D$ and value set size $t$
1411: (i.e., number of non-null attribute values).
1412:
1413: \noindent
1414: {\em Population $P_1$.}
1415: This population is characterized by the following values for the
1416: parameters: $D=4100$, $t=500$ and $T=100000$.
1417:
1418: \noindent
1419: {\em Population $P_2$.}
1420: This population is characterized by the following values for the
1421: parameters: $D=4100$, $t=500$ and $T=500000$.
1422:
1423: \noindent
1424: {\em Population $P_3$.}
1425: This population is characterized by the following values for the
1426: parameters: $D=4100$, $t=1000$ and $T=500000$.
1427:
1428:
1429: \noindent
1430: {\bf Data Sets:} Similarly to the experiments inside
1431: buckets, each data set included in the experiments is obtained by
1432: generating under one of the above described data distributions
1433: $10$ histograms belonging to one of the populations specified
1434: below. We consider the 15 data sets that are generated by
1435: combining all data distributions and all populations.\\
1436: All queries belonging to the query set below are evaluated over
1437: the histograms of each data set:
1438:
1439: \noindent
1440: {\bf Query set and error metrics:} In our experiments, we use the
1441: query set $\{X\leq d :d\in \U \}$ (recall that $X$ is the
1442: histogram attribute and $\U$ is its domain) for evaluating the
1443: effectiveness of the various methods. We measure the error of
1444: approximation made by histograms on the above query set by using
1445: the \em average \em of the \em relative error \em
1446: $\frac{1}{Q}\sum_{i=1}^Qe_i^{rel}$,
1447: where $Q$ is the cardinality of the query set and $e_i^{rel}$ is
1448: the \em relative error \em, i.e.,
1449: $e_i^{rel}=\frac{\vert{S_i-\widetilde{S}_i}\vert}{S_i}$,
1450: where $S_i$ and $\widetilde{S}_i$ are the actual answer and the
1451: estimated answer of the query $i$-th of the query set.
1452:
1453:
1454: \subsubsection{Results of the Experiments.} In Tables
1455: \ref{table-1}, \ref{table-2} and \ref{table-3} the results of
1456: experiments conducted on all data sets are reported. We denote the
1457: methods MaxDiff, V-Optimal and EquiSplit with CVA by MD, VO and
1458: ES, respectively; these methods with 4LT indices are denoted by
1459: MD\_4LT, VO\_4LT, ES\_4LT.
1460:
1461:
1462: \begin{table}
1463: \begin{center}
1464:
1465: \begin{tabular}[h]{|c|c|c|c|c|c|}
1466: \hline\hline
1467:
1468: $method/distr.$ & $D_1$ & $D_2$ & $D_3$ & $D_4$ & $D_5$
1469:
1470:
1471:
1472: \\ \hline
1473:
1474: $ES$& $0.79$& $1.69$& $10.61$& $3.89$& $57.63$
1475:
1476: \\ \hline
1477:
1478: $ES\_4LT$& $0.29$& $0.84$& $2.01$& $2.89$& $29.63$
1479:
1480: \\ \hline
1481:
1482: $MD$& $4.29$& $19.37$& $11.65$& $7.02$& $31.46$
1483:
1484: \\ \hline
1485:
1486: $MD\_4LT$& $0.70$& $1.57$& $3.14$& $1.92$& $4.39$
1487:
1488: \\ \hline
1489:
1490: $VO$& $1.43$& $5.55$& $10.6$& $5.16$& $21.57$
1491:
1492: \\ \hline
1493:
1494: $VO\_4LT$& $0.29$& $1.33$& $2.32$& $1.62$& $3.15$
1495:
1496: \\ \hline\hline
1497:
1498: \end{tabular}
1499:
1500: \end{center}
1501:
1502: \caption{Pop. 1: error for various methods.}
1503: \label{table-1}
1504: \end{table}
1505:
1506:
1507: \begin{table}
1508: \begin{center}
1509:
1510: \begin{tabular}[h]{|c|c|c|c|c|c|c|}
1511: \hline\hline
1512:
1513: $method/distr.$ & $D_1$ & $D_2$ & $D_3$ & $D_4$ & $D_5$
1514:
1515:
1516:
1517: \\ \hline
1518:
1519: $ES$& $0.76$& $1.78$& $4.83$& $3.63$& $59.74$
1520:
1521: \\ \hline
1522:
1523: $ES\_4LT$& $0.28$& $0.84$& $6.40$& $1.40$& $31.12$
1524:
1525: \\ \hline
1526:
1527: $MD$& $5.79$& $16.04$& $6.65$& $13.56$& $33.51$
1528:
1529: \\ \hline
1530:
1531: $MD\_4LT$& $0.80$& $1.60$& $2.32$& $2.36$& $4.87$
1532:
1533: \\ \hline
1534:
1535: $VO$& $1.68$& $5.96$& $6.16$& $7.25$& $18.10$
1536:
1537: \\ \hline
1538:
1539: $VO\_4LT$& $0.32$& $1.41$& $4.85$& $1.53$& $3.12$
1540:
1541:
1542: \\ \hline\hline
1543:
1544: \end{tabular}
1545:
1546: \end{center}
1547:
1548: \caption{Pop. 2: error for various methods.}
1549: \label{table-2}
1550: \end{table}
1551:
1552: \begin{table}
1553: \begin{center}
1554:
1555: \begin{tabular}[h]{|c|c|c|c|c|c|c|}
1556: \hline\hline
1557:
1558: $method/distr.$ & $D_1$ & $D_2$ & $D_3$ & $D_4$ & $D_5$
1559:
1560:
1561: \\ \hline
1562:
1563: $ES$& $0.47$& $0.87$& $2.31$& $7.54$& $66.41$
1564:
1565: \\ \hline
1566:
1567: $ES\_4LT$& $0.27$& $0.35$& $1.14$& $3.59$& $25.01$
1568:
1569: \\ \hline
1570:
1571: $MD$& $8.37$& $2.89$& $3.30$& $3.46$& $25.01$
1572:
1573: \\ \hline
1574:
1575: $MD\_4LT$& $0.70$& $0.59$& $1.33$& $1.79$& $2.02$
1576:
1577: \\ \hline
1578:
1579: $VO$& $1.77$& $2.16$& $2.82$& $3.37$& $7.78$
1580:
1581: \\ \hline
1582:
1583: $VO\_4LT$& $0.32$& $0.56$& $1.24$& $1.68$& $1.82$
1584:
1585: \\ \hline\hline
1586:
1587: \end{tabular}
1588:
1589: \end{center}
1590:
1591: \caption{Pop. 3: error for various methods.}
1592: \label{table-3}
1593: \end{table}
1594:
1595:
1596: The cross behavior of the various methods is
1597: similar for the three populations. Experiments confirm the good
1598: performance of the MaxDiff method and, particularly, of V-Optimal
1599: but they also pinpoint that 4LT adds to both methods relevant
1600: benefits. Indeed MD\_4LT and VO\_4LT show very low errors. Also
1601: EquiSplit and EquiSplit-4LT have good performances. But, as shown
1602: in Figure \ref{fig-5}.(a), where the dependence of the estimation
1603: error on data skew is plotted, these methods quickly get worse for
1604: high data skew. Indeed, in such cases, the benefit given by the
1605: higher number of buckets is lost because of the high skew inside
1606: buckets. In case of high skew, partition rules play a central
1607: role, and the naive approach of EquiSplit is not suitable.
1608: Interestingly, we observe that the improving of MaxDiff and
1609: V-Optimal by the usage of 4LT indices is relevant also for high
1610: skew, proving the effectiveness of such indices. In Figure \ref{fig-5}.(b)
1611: we show the dependence of the accuracy of the methods on the amount of
1612: space.
1613: There, we consider the data distribution $D_4$ and the population
1614: $P_1$ and generate 10 histograms belonging to $P_1$ according to
1615: $D_4$ for different amounts of space. The aim of this experiment
1616: is to study the behaviour of the various methods as the compression factor increases.
1617: Clearly, when the available amount of space
1618: increases, all methods behave well. The differences are more
1619: relevant for values corresponding to high compression. Methods
1620: using 4TL are the best. This can be intuitively explained by
1621: considering that in case of large buckets the role of the
1622: approximation technique inside buckets becomes more important than
1623: the rules followed for constructing buckets.
1624:
1625:
1626: \begin{figure}[h]
1627: \begin{center}
1628: \begin{tabular}{c@{\hspace{0.6cm}}c}
1629: \epsfig{file=fig6a.eps,width=9cm} \\
1630: {\bf (a)}: Dependence of the accuracy on the data skew \\
1631: \epsfig{file=fig6b.eps,width=9cm} \\
1632: {\bf (b)}: Dependence of the accuracy on the representation \\
1633: size (i.e., number of stored 4-byte integers)
1634: \end{tabular}
1635: \end{center}
1636: \caption{Experimental Results}
1637: \label{fig-5}
1638: \end{figure}
1639:
1640:
1641:
1642: \subsubsection*{Experiments on Real-Life Data.}
1643: We have performed further experiments using real-life data. We
1644: have considered two data sets (that we denote by Data Set A and
1645: Data Set B) obtained from the {\em 1997 U.S. Census Statistics}
1646: \protect\cite{Census}, by choosing two attributes of the table
1647: {\em Special District Governments}, having the following
1648: characteristics:
1649:
1650: \noindent
1651: {\bf Data Set A:}
1652: attribute name: {\em Type Code},
1653: domain size: $D= 998$,
1654: number of non-null attribute values: $t = 787$,
1655: cardinality: $T=34683$.
1656:
1657: \noindent
1658: {\bf Data Set B:}
1659: attribute name: {\em Function Code},
1660: domain size: $D= 99$,
1661: number of non-null attribute values: $t = 32$,
1662: cardinality: $T=34683$.
1663:
1664: We use for each histogram the same amount of
1665: storage space, that is $21$ four-byte numbers.
1666: Query set and error metrics are the same used for experiments
1667: on synthetic data.
1668:
1669:
1670: \begin{table}
1671:
1672: \begin{center}
1673:
1674: \begin{tabular}{|c|c|c|}
1675: \hline
1676: method & data set A & data set B \\
1677: \hline
1678: $ES$ & 4.32 & 7.02 \\
1679: \hline
1680: $ES\_4LT$ & 0.97 & 3.59 \\
1681: \hline
1682: $MD$ & 11.30 & 22.82 \\
1683: \hline
1684: $MD\_4LT$ & 1.63 & 1.25 \\
1685: \hline
1686: $VO$ & 4.49 & 17.19 \\
1687: \hline
1688: $VO\_4LT$ & 1.86 & 3.05 \\
1689: \hline
1690:
1691:
1692:
1693: \end{tabular}
1694: \caption{Errors obtained on real data.}
1695:
1696: \end{center}
1697: \end{table}\label{realtable}
1698:
1699:
1700:
1701: \noindent {\bf Results of the Experiments.} As shown in Table 4,
1702: experiments on real data confirm the results obtained with
1703: synthetic data. We note that 4LT adds to MaxDiff and V-Optimal
1704: relevant benefits and both EquiSplit and EquiSplit-4LT have good
1705: performances. Not surprisingly, for the data set A, EquiSplit-4LT produces the
1706: smallest error. This can be explained
1707: by considering that data of this set are rather uniform, and, in this case, as
1708: discussed previously, the cheapest technique (in terms of storage space) gives the best
1709: performances. In other words, the extra storage space required for recording
1710: bucket boundaries of the more sophisticate techniques does not give benefits due to the
1711: trivial data distribution.
1712:
1713:
1714:
1715:
1716:
1717:
1718:
1719: \section{Conclusions}\label{sec-Conclusion}
1720:
1721: In this paper we have presented a technique for improving the frequency estimation within
1722: each bucket of a histogram. This technique goes beyond the simple methods used in the
1723: literature, that is, the continuous value assumption and the uniform spread assumption.
1724: Our method is based on the addition of a 32 data item to each bucket organized into a 4-level
1725: tree index (4LT, for short) that stores, in a bit-saving approximate form, a number of hierarchical range queries
1726: internal to the bucket. We have shown both theoretically and experimentally that such an additional
1727: information effectively allows us to better estimate range queries inside buckets.
1728: Interestingly, the usage of 4LT on top of histograms built through well-know techniques like
1729: MaxDiff and V-Optimal, outperforms such histograms in terms of accuracy.
1730: This claim is proven in the paper through a large number of experiments conducted on both synthetic
1731: and real-life data, where classical histograms combined with 4LT are compared
1732: with the standard versions (i.e., with no 4LT) under several
1733: different data distributions at parity of consumed storage space.
1734: It turns out that the price we have to pay
1735: in terms of storage space by consuming 32 bits more per bucket
1736: w.r.t. CVA-based histograms is overcome by the benefits given
1737: by the improvement of precision in estimating
1738: queries inside buckets.
1739: Thus, the main conclusion we draw is that the 4LT index may represent a general technique
1740: that can be combined with any bucket-based histogram for significantly
1741: improving its accuracy.
1742:
1743: {\footnotesize
1744: \bibliography{isto}
1745:
1746: \bibliographystyle{plain}
1747: }
1748:
1749: \end{document}
1750: