0501:cs0501020/tree.tex

1: \NeedsTeXFormat{LaTeX2e}[1995/12/01]

2:

3: \documentclass[global]{svjour}

4: \usepackage{epsfig}

5:

6:

7: \def\U{{{\cal U}}}

8: \def\F{{{\cal F}}}

9: \def\S{{{\cal S}}}

10: \newcommand{\eex}{\hspace*{\fill}$~_\triangle$}

11:

12: \title{Enhancing Histograms by Tree-Like Bucket Indices\thanks{An

13:     abridged version of this paper appeared in the Proceedings

14:     of the International Conference on Data Engineering (ICDE 2002),

15:     IEEE Computer Society 2002, ISBN 0-7695-1531-2 \protect\cite{Buccafurri02Improving}}

16: }

17:

18: \author{Francesco Buccafurri \and\inst{1}

19:     Gianluca Lax \and\inst{1}

20:     Domenico Sacc\`a \and\inst{2}

21:     Luigi Pontieri \and\inst{2}

22:     Domenico Rosaci\inst{1}

23: }

24:

25: \institute{DIMET Dept., University ``Mediterranea'' of Reggio Calabria, Italy \\

26:     \email{\{bucca,lax,domenico.rosaci\}@unirc.it}

27:     \and

28:     DEIS Dept., University of Calabria, \& ICAR-CNR, Rende, Italy \\

29:     \email{pontieri@icar.cnr.it, sacca@unical.it}

30: }

31:

32: %-------------------------------------------------------------------------

33: \begin{document}

34: \maketitle

35: \begin{abstract}

36: Histograms are used to summarize the contents of relations into a

37: number of buckets for the estimation of query result sizes.

38: Several techniques (e.g., MaxDiff and V-Optimal) have been

39: proposed in the past for determining bucket boundaries which

40: provide accurate estimations. However, while search strategies for

41: optimal bucket boundaries are rather sophisticated, no much

42: attention has been paid for estimating queries inside buckets and

43: all of the above techniques adopt naive methods for such an

44: estimation. This paper focuses on the problem of improving the

45: estimation inside a bucket once its boundaries have been fixed.

46: The proposed technique is based on the addition, to each bucket,

47: of 32-bit additional information (organized into a 4-level tree

48: index), storing approximate cumulative frequencies at 7 internal

49: intervals of the bucket. Both theoretical analysis and

50: experimental results show that, among a number of alternative ways

51: to organize the additional information, the 4-level tree index

52: provides the best frequency estimation inside a bucket. The index

53: is later added to two well-known histograms, MaxDiff and

54: V-Optimal, obtaining the non-obvious result that despite the

55: spatial cost of 4LT which reduces the number of allowed buckets

56: once the storage space has been fixed, the original methods are

57: strongly improved in terms of accuracy.

58: \end{abstract}

59:

60: \keywords{histograms -- range query estimation -- approximate OLAP}

61:

62: \section{Introduction}

63: A {\em histogram} is a lossy compression technique used for

64: representing efficiently a relation. It is based on the partition

65: of one of the relation attributes into {\em buckets} and the

66: storage, for each of them, of a few summary information in place

67: of the detailed one. Among others, some important examples of

68: application domains of histograms are the estimation of query

69: selectivity

70: \protect\cite{IoPo95,Jagadish98Optimal,Poosala96Improved,Ja*01,Wu03Using},

71: temporal databases, where histograms are used for improving the

72: join processing \protect\cite{Sitzmann00Improving}, statistical

73: databases, where histograms represent a method for approximating

74: probability distributions \protect\cite{Malvestuto93Universal}.

75: Recently, histograms have received a new deal of interest, mainly

76: because they can be effectively used for approximating query

77: answering in order to reduce the query response time in on-line

78: decision support systems and OLAP \protect\cite{Poosala99Approx},

79: as well as the problem of reconstructing original data from

80: aggregate information \protect\cite{BuFuSa01} and, finally, in the

81: context of Data Streams

82: \protect\cite{Guha01Data,Babcock02Models,Datar02Maintaining,Guha02Histogramming}.

83:

84: For a given storage space reduction, the problem of determining

85: the best histogram is crucial. Indeed, different partitions lead

86: to dramatically different errors in reconstructing the original

87: data distribution, especially for skewed data. To better explain

88: the problem, consider a typical case of recovering original data

89: from a histogram: the evaluation of range queries. Think to a

90: histogram defined on the attribute $X$ of a relation $R$ as a set

91: of non-overlapping intervals of $X$ covering all values assumed by

92: $X$ in $R$. To each of these intervals, say $B$, the number of

93: occurrences (called {\em frequency}) in $R$, having the value of

94: $X$ belonging to the interval $B$, is associated (and included

95: into a data structure called {\em bucket}). A {\em range query},

96: defined on an interval $Q$ of $X$, evaluates the number of

97: occurrences in $R$ with value of $X$ in $Q$. Thus, buckets embed a

98: set of pre-computed disjoint range queries capable of covering the

99: whole active domain of $X$ in $R$ (with active here we mean

100: attribute values actually appearing in $R$). As a consequence, the

101: histogram does not give, in general, the possibility of evaluating

102: exactly a range query not corresponding to one of the pre-computed

103: embedded queries. In other words, while the contribution to the

104: answer coming from the sub-ranges coinciding with entire buckets

105: can be returned exactly, the contribution coming from the

106: sub-ranges which partially overlap buckets can be only estimated,

107: since the actual data distribution inside the buckets is not

108: available.

109:

110: It turns out that it is convenient to define the boundaries of

111: buckets in such a way that the estimation error of the

112: non-precomputed range queries is minimized (e.g., by avoiding that

113: large frequency differences arise inside a bucket). In other

114: words, among all possible sets of pre-computed range queries, we

115: find the set which guarantees the best estimation of the other

116: (non-precomputed) queries, once a technique for estimating such

117: queries is defined. This issue is being investigated since some

118: decades, and a large number of techniques for arranging histograms

119: have been proposed

120: \protect\cite{Cri81,Cri84,IoPo95,Jagadish98Optimal,Poosala96Improved,DonIoa00,Ja*01}.

121:

122: All these techniques adopt simple methods for estimating

123: non-precomputed queries (actually, their portions partially

124: overlapping buckets). The most significant approaches are the {\em

125: continuous value assumption} (often denoted in this paper by CVA)

126: \protect\cite{Sac79}, where the estimation is made by linear

127: interpolation on the whole domain of the bucket, and the {\em

128: uniform spread assumption} (denoted by USA)

129: \protect\cite{Poosala96Improved}, which assumes that values are

130: located at equal distance from each other so that the overall

131: frequency sum can be equally distributed among them.

132:

133:

134: An interesting problem is understanding whether, by exploiting

135: information typically contained in histogram buckets, and possibly

136: adding a few summary information, the frequency estimation inside

137: buckets, and then, the histogram accuracy, can be improved. This

138: paper focuses on this problem. Starting from the consideration of

139: limits of CVA and USA studied in \protect\cite{BuFuSa01}, we

140: propose to use some additional storage space in order to describe

141: the distribution inside a bucket in an approximate yet very

142: effective way.

143:

144: The first step is studying how to use these 32 additional bits in

145: order to maximize benefits in terms of accuracy. Our analysis

146: shows that the trivial technique of partitioning the bucket into 8

147: equal-size parts and encoding each corresponding sum by 4 bits,

148: leads to high scaling errors since it is needed to represent each

149: sum as a fraction of the overall sum of the bucket. Our proposal

150: then relies on the idea of storing partial sums internal to the

151: bucket in a hierarchical fashion, using a tree-like index

152: (occupying 32 bits). This way, the sum contained in a given tree

153: node, can be represented as a fraction of the sum contained in the

154: parent node, which is a value (reasonably) smaller than the

155: overall sum of the bucket. It turns out that the encoding length

156: may decrease as the level of the tree increases. The benefits we

157: expect by applying this approach concern the scaling error. But a

158: crucial point is to decide how to arrange the tree, that is, how

159: far going down in depth with the index. Of course, the higher the

160: resolution, the larger the number of embedded precomputed range

161: queries (internal to the buckets) is. Hence, we expect better

162: accuracy as the resolution increases. However, increasing

163: resolution reduces the number of bits available for encoding

164: nodes, and, thus, amplifies scaling errors. We study the above

165: trade-off by considering the two possible (from a practical point

166: of view) tree-indices with 32 bits, which we call 3LT and 4LT,

167: with depth 3 and 4, respectively. The analysis leads to the

168: conclusion that the 4LT-index represents the best solution.

169:

170: The next step is then understanding whether this improvement of

171: accuracy for the estimation inside buckets can really give

172: benefits in terms of accuracy of a histogram arranged by one of

173: the existing techniques. This problem is not straightforward:

174: think, to mention the most evident aspect, that 4LT buckets use 32

175: bits more than CVA ones, and, then, for a fixed storage space,

176: allows a smaller number of buckets. The last part of this paper is

177: thus devoted to evaluate the effects of the combination of the 4LT

178: technique with existing methods for building histograms. Through a

179: deep experimental comparative analysis conducted, for a fixed

180: storage space, over several data sets, both synthetic and

181: real-life, we show that 4LT improves significantly the accuracy of

182: the considered histograms. Therefore this paper, beside giving the

183: specific contribution of proposing a technique (i.e., the 4LT) for

184: estimating accurately range queries internal to buckets, proves

185: the more general result that going beyond classical techniques

186: (i.e., CVA and USA) for the estimation inside buckets may give

187: concrete improvements of histogram accuracy.

188:

189: It is worth noting that the choice of MaxDiff and V-Optimal

190: histograms for testing our method does not limit the generality of

191: the 4LT index, which is applicable to every bucket-based

192: histogram\footnote{There are histograms, like wavelet-based ones,

193: that are not based on a set of buckets.}. Nevertheless, it is not

194: limited the validity of our comparison, since MaxDiff and

195: V-Optimal, despite their non-young age, are still considered in

196: this scientific community as point of references due to their

197: accuracy \protect\cite{Ioannidis03History}.

198:

199: The paper is organized as follows. In Section

200: \ref{sec-preliminary}, we introduce some preliminary definitions.

201: The comparison, both experimental and theoretical, among a number

202: of techniques including our tree-based methods (3LT and 4LT) for

203: estimating range queries {\em inside} a bucket is reported in

204: Section \ref{sec-Estimation}. Therein, 3LT and 4LT are also

205: presented. From this analysis it results that 4LT has the best

206: performances in terms of accuracy. Thus, 4LT can be combined to

207: every bucked-based histogram for increasing its accuracy. Section

208: \ref{sec-Improved} presents a large set of experiments, conducted

209: by applying 4LT to two, well-known methods, {\em MaxDiff} and {\em

210: V-Optimal} \protect\cite{Poosala96Improved}. Results show high

211: improvements in the estimation of range queries w.r.t. to the

212: original methods --- of course, the comparisons are made at parity

213: of storage consumption so that the revised methods use less

214: buckets to compensate the additional storage for the 4LT indices.

215: The 4LT technique provides good results also when combined with

216: the very simple method {\em EquiSplit}, which consists in dividing

217: the histogram value domain into buckets of the same size so that

218: the bucket boundaries need not to be stored, thus obtaining a very

219: high number of buckets at the same compression rate. We draw our

220: conclusions in Section \ref{sec-Conclusion}.

221:

222:

223:

224:

225:

226:

227: \section{Basic Definitions}\label{sec-preliminary}

228:

229: Given a relation $R$ and  an attribute $X$ of $R$, a histogram for $R$ on

230: $X$ is constructed as follows. Let  $\U = \{u_1, ... , u_m\}$ be

231: the set of all possible values (the {\em domain}) of $X$ and let

232: $u_i < u_{i+1}$, for each $i$, $1 \leq i <m$. The {\em frequency

233: set} for $X$ is the set $\F =\{f(u_1), ... , f(u_m) \} $ such that

234: for each $i$, $1 \leq i \leq m$, $f(u_i)$ is the number of

235: occurrences of the attribute value $u_i$ in the relation $R$. The

236: {\em cumulative frequency set} $\S =\{s_1, ... , s_m \}$ contains

237: the value $s_i = \sum_{j=1}^i f(u_j)$ for each attribute value

238: $u_i$. The {\em value set} $V=$ $\{u_i \in \U \ | \ f(u_i) > 0 \}$

239: is the active domain of $X$ in $R$ as it consists of all attribute

240: values actually occurring in the relation $R$ ({\em non-null

241: values}). Given any $u_i$ in $V$, the {\em spread} $d_i$ of $u_i

242: \in V$ for $1 \leq i < n$  is defined as 1 if $u_i$ is the last

243: non-null value or otherwise as the difference $u_j - u_{i}$, where

244: $u_{j}$ is the first non-null value for which $u_j > u_i$ (i.e.,

245: $d_i$ is the distance from $u_i$ to the next non-null value).

246:

247: A {\em bucket} $B$ for $R$ on $X$  is a 4-tuple $\langle inf, sup,

248: t, c \rangle$, where $u_{inf}$ and $u_{sup}$, $1 \leq inf \leq sup

249: \leq m$, are the boundaries of the domain range pertaining to the

250: bucket, $t$ is the number of non-null values occurring in the

251: range, and $c = \sum_{i=inf}^{sup} f(u_i)$ is the sum of

252: frequencies of all values in the range.

253: We say that the bucket $B$ is {\em

254: 1-biased} if $u_{sup}$ is not null; if also $u_{inf} $ is not

255: null, then we say that $B$ is {\em 2-biased}.

256:

257:

258: A {\em histogram} $H$ for $R$ on $X$  is a $h$-tuple $\langle

259: B_1,B_2, ..., B_h \rangle$ of buckets such that: (1) for each $1

260: \leq i < h$, the upper bound of $B_i$ precedes the lower bound of

261: $B_{i+1}$ and (2) $u \in V$ implies $u \in B_i$, for some $i$, $1

262: \leq i \leq h$. Condition (1) guarantees that buckets do not

263: overlap each other, and condition (2) enforces that every non-null

264: value be hosted by some bucket. Classically, histograms have

265: 2-biased buckets; sometime, for storage optimizations, 2-biased

266: buckets are made 1-biased by replacing the lower bound of each

267: bucket with the successive in the domain of the upper bound of the

268: preceding bucket.

269:

270: A classical problem on histograms is: given a histogram $H$ and a

271: (range) query of the form $u_j \leq X \leq u_i$, $1 \leq j \leq i \leq m$,

272: estimate the overall frequency $\sum_{k=j}^i f(i)$ in the range from $u_j$

273: to $u_i$.

274:

275:

276:

277:

278: \section{Estimation Inside a Bucket}\label{sec-Estimation}

279:

280: In this section we deeply investigate the problem of frequency

281: estimation inside buckets. First of all,  we present the classical

282: two techniques (CVA and USA), discuss their limitations and

283: propose some simple alternatives. Then we introduce a novel

284: technique which is based on a 4-level tree index storing

285: approximate representations of the partial sums of 7 fixed bucket

286: intervals. Later we evaluate the accuracy of the various

287: techniques by performing both a theoretical analysis of errors and

288: a number of experiments on some typical sample distributions.

289:

290: \subsection{Notations and Problem Formulation}

291:

292: Let $B=\langle inf, sup, t, c \rangle$ be a bucket on an attribute

293: $X$ of a relation $R$. Without loss of generality, we assume that

294: $inf=1$ and $sup=b$ so  that we can represent the frequency set

295: inside the bucket as a vector $F$ with indexes ranging from $1$ to $b$

296: ({\em frequency vector of} $B$). Similarly, the cumulative

297: frequencies are represented by a vector $S$ with indexes from $1$

298: to $b$ ({\em cumulative frequency vector of} $B$). Hence, for each

299: $i$, $1 \leq i \leq b$, $F[i]\geq 0$ is the frequency of the value

300: $u_i$ while $S[i]=$ $\sum_{j=1}^{i}F[j]$ is the cumulative

301: frequency. Then $c=S[b]$ is the sum of all frequencies in the

302: bucket; moreover, for notation convenience, we assume that

303: $S[0]=0$.

304:

305: The problem of the estimation inside a bucket can be formulated as

306: follows: {\em given any pair} $i,j$, $1 \leq i \leq j \leq b$, such

307: that $d=j-i+1 < b$, {\em estimate the range query} $S[j] - S[i-1] =$

308: $\sum_{k=i}^j

309: F[k]$.

310: We focus our attention on the basic problem of estimating $S[d]$

311: (then by assuming $i=1$).

312:

313: We introduce now the following notation.

314: Given $1 \leq i \leq j \leq 8$,

315: we denote by $\delta_{i/j}$ the sum

316: $\sum_{i=x}^{y} F[i]$,

317: where $x=1+ \lceil \frac{b}{j}  \cdot (i -1) \rceil$

318: and $y= \lceil \frac{b}{j}  \cdot i\rceil$.

319: $\delta_{i/j}$ represents the frequency sum of the $i-$th

320: elements of the partition of $B$ into $j$ equal size sub-ranges.

321: Thus, the frequency sum for a bucket is $\delta_{1/1}$; the

322: frequency sums for two halves are $\delta_{1/2}$ and

323: $\delta_{2/2}$; the frequency sums for the 4 quarters are

324: $\delta_{i/4}$, $1 \leq i \leq 4$; the frequency sums for the 8

325: eighths are $\delta_{i/8}$, $1 \leq i \leq 8$, and so on.

326:

327: \subsection{Estimation Techniques}

328:

329: Next we illustrate the existing approximation techniques

330: and discuss some additional simple approaches.

331:

332: \noindent{\bf Continuous Value Assumption (CVA).} The estimation

333: of $S[d]$ is computed as $\widetilde{S}[d]=\frac{d}{b} \cdot c$.

334: In words, the partial contribution of a bucket to a range query

335: result is estimated by linear interpolation. As pointed out in

336: \protect\cite{Buccafurri99Compressed,BuFuSa01}, the above

337: estimation coincides with the expected value of the $S[d]$ when it

338: is considered a random variable over the population of all

339: frequency distributions in the bucket for which the overall

340: cumulative frequency is $c$. \noindent{\bf Uniform Spread

341: Assumption (USA).} The estimation of $S[d]$ is given by

342: $\widetilde{S}[d] = \left ( 1 + \frac{(t-1)\cdot (d-1)}{(b-1)}

343: \right ) \cdot \frac{c}{t}$, where $t$ is the number of non-null

344: attribute values in the bucket. The uniform spread assumption

345: assumes that such values are distributed at equal distance from

346: each other and the overall frequency sum is equally distributed

347: among them. Obviously, in this case the information $t$ is

348: necessary.  We stress that, as discussed in

349: \protect\cite{BuFuSa01}, this estimation is not supported by any

350: unbiased probabilistic model so the assumption is rather

351: arbitrary.

352:

353: \noindent{\bf 1-Biased Estimation (1b).} The possibly available

354: information on the number $t$ of non-null elements cannot be

355: exploited in the estimation unless some further information on the

356: frequency distribution is either available or assumed (as for the

357: USA estimation). We next show how to exploit the fact that a

358: bucket is often 1-biased (i.e., $u_b$ is not null) using the

359: probabilistic approach proposed in \protect\cite{BuFuSa01}. This

360: approach assumes that the query is a random variable on the

361: population  of all 1-biased frequency distributions having $c$ as

362: overall cumulative frequency. The estimation of the range query

363: $S[d]$ for a 1-biased bucket is given by $\widetilde{S}[d]=

364: \frac{d}{b-1} \cdot \frac{t-1}{t} \cdot c$.

365:

366: \noindent{\bf 2-Split Estimation (2s).} We split the bucket into

367: two parts of the same size and  store the cumulative frequency of

368: the first part, say $\delta_{1/2}=S[b/2]$ ---  we therefore need

369: additional storage space (typically 32 bits). We call this method

370: {\em 2-split} or $2s$ for short. Following this approach, the

371: estimation of the range query $S[d]$ is given by

372: $2 \cdot \frac{d}{b} \cdot  \delta_{1/2}$ if $d \leq \frac{b}{2}$,

373: $\delta_{1/2} + 2 \cdot \frac{d - b}{b} \cdot (c - \delta_{1/2})$,

374: otherwise.

375: Thus we use the CVA techniques for each of the two halves of the

376: bucket.

377:

378: \noindent{\bf 4-Split Estimation (4s).} We split the bucket into

379: 4 parts of the same size ({\em quarts}) and  store the

380: approximate values of the cumulative frequency of the each

381: part $\delta_{i/4}$, $1 \leq i \leq 4$.

382: In case the additional available space is 32 bits, we use 8 bits for each

383: approximate value, which is therefore computed as

384: $\tilde{\delta}_{i/4}=\langle\frac{\delta_{i/4}}{c}

385: \times (2^8-1)\rangle$,

386: where $\langle x \rangle$ stands for $round(x)$.

387: The frequency sum

388: for an interval $d$ is estimated by adding the approximate values

389: of all first quarts that are fully contained in the interval plus

390: the CVA estimation of the portion of the last eighth that

391: partially overlaps the interval. Obviously, in order to reduce the

392: approximation error, in case $d>b/2$, it is convenient to derive

393: the approximate value from the estimation of the cumulative

394: frequency in the complementary interval from $d+1$ to $b$.

395:

396:

397: \noindent{\bf 8-Split Estimation (8s).}

398: It is analogous to the 4-Split Estimation. The only difference is that the

399: bucket is

400: divided into 8 parts ({\em eighths}) and, for each of them, we use

401: 4 bits for storing the cumulative frequency.

402: Thus, the approximate value of the $i$-th eight ($1 \leq i \leq 4$) , is

403: computed as

404: $\tilde{\delta}_{i/8}=\langle\frac{\delta_{i/8}}{c}

405: \times (2^4-1)\rangle$,

406: where $\langle x \rangle$ stands for $round(x)$.

407:

408:

409:

410: \subsection{The Tree Indices for Bucket Frequency Estimation}

411:

412: We now propose to use 32 bits as sophisticated tree-indices for

413: providing an {\em approximate description} of the cumulative

414: frequencies in the bucket --- this index can be easily extended

415: also to the case that more bits are available. To this end, we store

416: the approximate value of the cumulative frequency in a suitable number of

417: intervals

418: inside the bucket.

419: The first type of tree-index is 3LT.

420:

421: \noindent

422: {\bf 3 Level Tree index (3LT)}

423: The 3LT index uses 11 bits for

424: approximating the value of $\delta_{1/2}$, and 10 bits both for

425: approximating $\delta_{1/4}$ and for $\delta_{3/4}$.

426:

427: Let $L_{1/2}$ be the

428: 11-bits string corresponding to $\delta_{1/2}$, and let $L_{1/4}$ and

429: $L_{3/4}$ be the 10-bits strings corresponding, respectively, to

430: $\delta_{1/4}$ and $\delta_{3/4}$.

431:

432: The three $L$ strings are constructed as follows:

433:

434: \vspace{2mm}

435: \begin{center}

436: {\small

437: $L_{1/2} =  \langle\frac{\delta_{1/2}}{\delta_{1/1}}\cdot (2^{11}-1)

438: \rangle;

439: \ \ \ \

440: L_{1/4}= \langle\frac{\delta_{1/4}}{\delta_{1/2}}\cdot (2^{10}-1)\rangle;

441: \ \ \ \

442: L_{3/4}= \langle\frac{\delta_{3/4}}{\delta_{2/2}}\cdot (2^{10}-1)\rangle$

443: }

444: \end{center}

445:

446: \vspace{2mm}

447: \noindent where, we recall, $\langle x \rangle$ stands for $round(x)$.

448:

449:

450: The approximate values for the partial sums are given by:

451:

452: \vspace{2mm}

453: \begin{center}

454: {\small

455: $\widetilde{\delta}_{1/1}=\delta_{1/1}=s$\\

456:

457: $\widetilde{\delta}_{1/2}= \frac{L_{1/2}}{2^{11}-1} \cdot

458: \widetilde{\delta}_{1/1};

459: \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

460: \widetilde{\delta}_{2/2}= \widetilde{\delta}_{1/1} -

461: \widetilde{\delta}_{1/2}$ \\

462:

463: $\widetilde{\delta}_{1/4}= \frac{L_{1/4}}{2^{10}-1}\cdot

464: \widetilde{\delta}_{1/2};

465: \ \ \ \ \

466: \widetilde{\delta}_{2/4}=\widetilde{\delta}_{1/2} -

467: \widetilde{\delta}_{1/4};

468: \ \ \ \ \ \

469: \widetilde{\delta}_{3/4}= \frac{L_{3/4}}{2^{10}-1}\cdot

470: \widetilde{\delta}_{2/2};

471: \ \ \ \ \

472: \widetilde{\delta}_{4/4}=\widetilde{\delta}_{2/2} -

473: \widetilde{\delta}_{3/4}$\\

474: }

475: \end{center}

476: \vspace{2mm}

477:

478: Observe that the 32 bits index refers to a 3-level tree whose

479: nodes store directly or indirectly the approximate values of the

480: cumulative frequencies for fixed intervals: the root stores the

481: overall cumulative frequency $c$, the two nodes of the second

482: level store the cumulative frequencies for the two halves of the

483: bucket and so on.

484:

485: \begin{example}\label{3LT-example}

486: Consider the 3-level tree in Figure

487: \ref{fig-3LT}. The 32 bits store the following approximate

488: cumulative frequencies: $L_{1/2}=\langle \frac{5594}{8678} \cdot

489: 2047 \rangle=1320$, $L_{1/4}=\langle \frac{2834}{5594} \cdot 1023

490: \rangle=518$, $L_{3/4}=\langle \frac{2818}{8678-5594} \cdot 1023

491: \rangle=935$.

492: \end{example}

493:

494: \begin{figure}[t]

495: \epsfig{file=fig1.eps,width=11cm}

496: \caption{The 3-level tree.}\label{fig-3LT}

497: \end{figure}

498:

499:

500: We are now ready to solve the frequency estimation inside the bucket

501: $B$. Given $d$, $1 \leq d < b$, let $i$ be the integer for which

502: $\lceil(i-1)/4 \cdot b \rceil \leq d < \lceil i/4 \cdot b \rceil$.

503: Then the approximate value of $F[d]$ is:

504:

505:

506: \[

507: \begin{array}{l}

508: \widetilde{F}[d]= P(i)+P'(i)+\frac{d-\lceil(i-1)/4 \cdot

509: b\rceil} {\lceil i/4 \cdot b\rceil-\lceil(i-1)/4 \cdot b\rceil}

510: \cdot \widetilde{\delta}_{i/4}

511: \end{array}

512: \]

513: \noindent where

514: \[

515: \begin{array}{ll}

516: P(i)= \left \{

517: \begin{array}{ll}

518: \widetilde{\delta}_{1/2} & \mbox{if $i > 2$}\\

519: 0 & \mbox{if $i \leq 2$}

520: \end{array}  \right.

521: & \ \ \ \ \

522: P'(i)= \left \{

523: \begin{array}{ll}

524: \widetilde{\delta}_{1/4} & \mbox{if $i = 2$}\\

525: \widetilde{\delta}_{3/4} & \mbox{if $i = 4$}\\

526: 0 & \mbox{otherwise}

527: \end{array}  \right.

528: \\

529: \end{array}

530: \]

531: Thus we use the interpolation based on the CVA only inside a

532: segment of length $\lceil(1/4) \cdot b\rceil$. This component

533: becomes zero at each distance $d=\lceil i \cdot \frac{b}{4} \rceil$, $1

534: \leq i < 4$.

535:

536:

537: 32 bits may be distributed in such a way that the granularity of

538: the tree-index increases w.r.t. 3LT. 4LT index has 4 levels

539: and uses 6 bits for the first level, 5 bits for the second one and

540: 4 bits for the last level.

541:

542: \noindent

543: {\bf 4 Level Tree index (4LT)}

544: We reserve 4 bits to store the approximate value of each of the

545: following 4 partial sums: $\delta_{1/8}$, $\delta_{3/8}$,

546: $\delta_{5/8}$ and $\delta_{7/8}$ --- let $L_{i/8}$, $i=1,3,5,7$,

547: denote such 4-bits strings. We then use the remaining 16 bits as

548: follows: the partial sums $\delta_{1/4}$ and $\delta_{3/4}$ are

549: approximated by the 5-bit strings $L_{1/4}$ and $L_{3/4}$,

550: respectively, while the partial sum $\delta_{1/2}$ with a 6-bits

551: string $L_{1/2}$. As a result, the larger the intervals, the higher

552: is the number of bits used.

553: The 8 $L$ strings are constructed as follows:

554:

555:

556: \[

557: \small

558: \begin{array}{ll}

559: L_{1/2} =  \langle\frac{\delta_{1/2}

560: }{\delta_{1/1}}\cdot (2^6-1)\rangle  &

561: \\

562:

563: L_{i/4}= \langle\frac{\delta_{i/4}}{\delta_{j/2}}\cdot (2^5-1)\rangle

564: & (i=1 \wedge j =1 ), (i=3 \wedge j =2 )\\

565:

566: L_{i/8}=\langle\frac{\delta_{i/8}}{\delta_{j/4}}\cdot (2^4-1)\rangle

567: & (i=1 \wedge j =1), (i=3 \wedge j =2), \\

568:

569: & (i=5 \wedge j =3), (i=7 \wedge j =4)\\

570:

571: \end{array}

572: \]

573:

574: \noindent where,

575: we recall, $\langle x \rangle$ stands for $round(x)$.

576:

577:

578: The approximate values for the partial sums are eventually

579: computed as:

580:

581: \[

582: \begin{array}{ll}

583: \widetilde{\delta}_{1/1}=\delta_{1/1}=c & \\

584:

585: \widetilde{\delta}_{1/2}= \frac{L_{1/2}}{2^6-1}\times

586: \widetilde{\delta}_{1/1}\\

587:

588: \widetilde{\delta}_{2/2}= \widetilde{\delta}_{1/1} -

589: \widetilde{\delta}_{1/2} &\\

590:

591:

592: \widetilde{\delta}_{i/4}= \frac{L_{i/4}}{2^5-1}\times

593: \widetilde{\delta}_{j/2} & (i=1 \wedge j =1), (i=3 \wedge j =2) \\

594:

595: \widetilde{\delta}_{i/4}=\widetilde{\delta}_{j/2} -

596: \widetilde{\delta}_{i-1/4} &

597: (i=2 \wedge j =1),

598: (i=4 \wedge j =2)\\

599:

600: \widetilde{\delta}_{i/8}= \frac{L_{i/8}}{2^4-1}\times

601: \widetilde{\delta}_{j/4} & (i=1 \wedge j =1), (i=3 \wedge j =2) \\ &

602: (i=5 \wedge j =3), (i=7 \wedge j =4) \\

603: \widetilde{\delta}_{i/8}=\widetilde{\delta}_{j/4} -

604: \widetilde{\delta}_{i-1/8} &

605: (i=2 \wedge j =1),

606: (i=4 \wedge j =2)\\ & (i=6 \wedge j =3), (i=8 \wedge j =4)

607: \end{array}

608: \]

609:

610: Similarly to the 3LT-index, the 4LT-index

611: refers to a 4-level tree whose

612: nodes store directly or indirectly the approximate values of the

613: cumulative frequencies for fixed hierarchical intervals

614: starting from the root which stores the

615: overall cumulative frequency $c$.

616:

617: \begin{figure}[t]

618: \epsfig{file=fig2.eps,width=11cm}

619: \caption{The 4-level tree.}\label{fig-4LT}

620: \end{figure}

621:

622:

623:

624: \begin{example}

625: Consider the 4-level tree in Figure

626: \ref{fig-4LT}.

627: The 32 bits store the following approximate

628: cumulative frequencies: $L_{1/2}=33$, $L_{1/4}=18$, $L_{3/4}=13$,

629: $L_{1/8}=6$, $L_{3/8}=11$, $L_{5/8}=5$, $L_{7/8}=7$.

630: \end{example}

631:

632:

633:

634: Again, similarly to the 3LT-index, the frequency estimation inside the

635: bucket

636: $B$ can be obtained by exploiting the content of the nodes of the index.

637: Given $d$, $1 \leq d < b$, and the integer $i$ which

638: $\lceil(i-1)/8\times b\rceil \leq d < \lceil i/8\times b\rceil$,

639: the approximate value of $F[d]$ is:

640: \[

641: \begin{array}{l}

642: \widetilde{F}[d]= P(i)+P'(i)+P''(i)+\frac{d-\lceil(i-1)/8\times

643: b\rceil} {\lceil i/8\times b\rceil-\lceil(i-1)/8\times b\rceil}

644: \times \widetilde{\delta}_{i/8}

645: \end{array}

646: \]

647: \noindent where

648: \[

649: \begin{array}{ll}

650: P(i)= \left \{

651: \begin{array}{ll}

652: \widetilde{\delta}_{1/2} & \mbox{if $i > 4$}\\

653: 0 & \mbox{if $i \leq 4$}

654: \end{array}  \right.

655: &

656: P'(i)= \left \{

657: \begin{array}{ll}

658: \widetilde{\delta}_{1/4} & \mbox{if $i = 3,4$}\\

659: \widetilde{\delta}_{3/4} & \mbox{if $i = 7,8$}\\

660: 0 & \mbox{otherwise}

661: \end{array}  \right.

662: \\

663: \end{array}

664: \]

665: \[

666: P''(i)= \left \{

667: \begin{array}{ll}

668: \widetilde{\delta}_{i-1/8} & \mbox{if $i$ is even}\\

669: 0 & \mbox{otherwise}

670: \end{array}  \right.

671: \]

672:

673: Thus we

674: use the interpolation like in CVA only

675: inside a segment of length $\lceil(1/8)b\rceil$. This component

676: becomes zero at each distance $d=\lceil i \times b/8 \rceil$, $1

677: \leq i < 8$. We call the estimation {\em 4-level tree} or 4LT for short.

678:

679:

680:

681: \subsection{Worst-case Error Analysis}\label{sec-Analysis}

682:

683: The approximation error for CVA, 1b, USA and 2s arises only from interpolation.

684: On the contrary, for other methods (i.e., 4s, 8s, 3LT and 4LT), the scaling

685: error due to bit saving is added to the interpolation error.

686: However, all methods but CVA, 1b and USA implement a equi-size division of

687: the bucket and

688: 3LT and 4LT provide also an index over sub-buckets.

689: We expect that such a division into sub-buckets produces an improvement

690: from the side of the interpolation error.

691: Indeed, sub-buckets increase

692: the granularity of summarization.

693: In addition, we expect that index-based methods (i.e., 3LT and 4LT), reduce

694: the scaling error, since

695: hierarchical tree-like organization allows us to

696: represent the sum inside a given sub-bucket, corresponding to a

697: node of the tree, as a fraction of the sum contained in the parent

698: node, instead of a fraction of the entire bucket sum (as it happens for the

699: "flat" methods 4s and 8s).

700: The worst-case analysis confirms the above observations.

701: In particular we show that while CVA, 1b and USA are the same, under the

702: worst-case point of view, 4LT outperforms the other methods.

703:

704: Results of our analysis are summarized in the following theorem.

705: Recall that, throughout the whole section, a bucket $B$ of

706: size $b$ is given.

707:

708: \begin{theorem}

709: Let $F$ be the maximum frequency value occurring in $B$ and

710: let assume that $b  $ {\em  mod} $  8 = 0$. Then, the

711: interpolation and scaling worst-case errors of

712: CVA, 1b, USA, 2s, 4s, 8s, 3LT and 4LT are the following:

713:

714: \begin{center}

715: \begin{tabular}[h]{||c||c|c|c|c|c|c|c|c||}

716: \hline\hline

717: error/method & CVA & 1b & USA & 2s & 4s & 8s & 3LT & 4LT

718: \\ \hline

719: interpolation & $\frac{F \cdot b}{4}$ &  $\frac{F \cdot b}{4}$  &  $\frac{F

720: \cdot b}{4}$ &  $\frac{F \cdot b}{8}$ &

721: $\frac{F \cdot b}{16}$  &  $\frac{F \cdot b}{32}$ &  $\frac{F \cdot b}{16}$

722: &  $\frac{F \cdot b}{32}$

723: \\ \hline

724: scaling &   0 & 0& 0 & 0 &  $\frac{F \cdot b}{2^9}$  &  $\frac{F \cdot

725: b}{32}$  & $\frac{F \cdot b}{2^{12}}$ &   $\frac{F \cdot b}{2^7}$

726:

727: \\ \hline

728: total & $\frac{F \cdot b}{4}$ &  $\frac{F \cdot b}{4}$  &  $\frac{F \cdot

729: b}{4}$ &  $\frac{F \cdot b}{8}$

730: & $\frac{F \cdot b}{16}$  &  $\frac{F \cdot b}{16}$ & $\frac{F \cdot b}{16}$

731:   &  $\frac{F \cdot b}{32}$

732: \\ \hline\hline

733: \end{tabular}

734: \end{center}

735: \end{theorem}

736:

737: \begin{proof}

738: Let $b_M$ the size of the smallest sub-bucket produced by the method

739: $M$, where $M$ is either CVA, 1b, USA, 2s, 4s, 8s, 3LT or 4LT.

740: Observe that $b_M=b$ for CVA, 1b and USA (since they do not produce

741: sub-buckets), while $b_{2s}= \frac{b}{2}$, $b_M = \frac{b}{4}$ for

742: $M=$ 4s or $M=$ 3LT, $b_M = \frac{b}{8}$ otherwise.

743:

744: Consider first the interpolation error

745: (by assuming that no scaling error occurs).

746:

747: \noindent

748: {\bf Interpolation error bounds.}

749: It can be easily verified that the worst case for a method $M$ happens

750: whenever

751: both the following conditions hold:

752: \begin{enumerate}

753: \item [(1)]

754: there is a smallest sub-bucket, say $B$ (of size $b_M$) containing,

755: in the first half, $\frac{b_M}{2}$ frequencies with value $F$,

756: and, in the second half, $\frac{b_M}{2}$ frequencies with value 0, and

757: \item [(2)]

758: the range query involves exactly the first half of the sub-bucket $B$.

759: \end{enumerate}

760: The proof of this part is conducted separately for each method,

761: by determining the maximum absolute interpolation error:

762:

763: \noindent{\bf CVA:}

764: In this case, $b_M = b$, that is the sub-bucket coincides with

765: the entire

766: bucket and the query boundaries are $1$ and $\frac{b}{2}$.

767: The cumulative value of the bucket is $F \cdot \frac{b}{2}$.

768: Under CVA, the estimated value of the query is $\frac{F \cdot

769: \frac{b}{2}}{b}\cdot \frac{b}{2}$,

770: that is $\frac{F \cdot b}{4}$. The actual value of the query is $\frac{F

771: \cdot b}{2}$.

772: Therefore the absolute error is $\frac{F \cdot b}{4}$.

773:

774:

775:

776: \noindent{\bf 1b:}

777: We obtain the same absolute error $\frac{F \cdot b}{4}$.

778: Indeed, being the first value of the bucket $F$

779: (i.e., not null), 1-biased estimation does not give additional information

780: w.r.t. CVA.

781:

782: \noindent{\bf USA:}

783: Also in this case, $b_M = b$, that is the sub-bucket coincides

784: with the entire

785: bucket and the query boundaries are $1$ and $\frac{b}{2}$.

786: The cumulative value of the bucket is $F \cdot \frac{b}{2}$.

787: USA assumes that the $\frac{b}{2}$ non null values are

788: located at equal distance from each other,

789: and each has the value $F$. As a consequence the estimated value of the

790: query

791: is $F \cdot \frac{b}{4}$, since the query involves just half non null

792: estimated values.

793: The actual value is $\frac{F \cdot b}{2}$. Thus, the absolute error is

794: $\frac{F \cdot b}{4}$, that is the same as CVA.

795:

796: \noindent{\bf 2s:}

797:  In this case $b_M = \frac{b}{2}$.

798: According to the case CVA, the absolute error is

799: $\frac{F \cdot b_M}{4}$, that is

800: $\frac{F \cdot b}{8}$.

801:

802: \noindent{\bf 4s} and {\bf 3LT}:

803:  Both 4s and 3LT produce sub-buckets

804: of size $\frac{b}{4}$. Thus, in these cases $b_M = \frac{b}{4}$.

805: Identically to the previous case, the absolute error is

806: $\frac{F \cdot b_M}{4}$, that is

807: $\frac{F \cdot b}{16}$.

808:

809: \noindent{\bf 8s} and {\bf 4LT}:

810: Both 8s and 4LT produce sub-buckets

811: of size $\frac{b}{8}$. Thus, in these cases $b_M = \frac{b}{8}$.

812: Identically to the previous case, the absolute error is

813: $\frac{F \cdot b_M}{4}$, that is

814: $\frac{F \cdot b}{32}$.

815:

816:

817: Now we consider the scaling error.

818:

819: \noindent{\bf Scaling error bounds.}

820: The proof that CVA, 1b, USA and 2s do not

821: produce scaling error is straightforward.

822: Let us consider the other methods:

823:

824: \noindent{\bf 4s:}

825: Since each sub-bucket sum is encoded by 8 bits and is scaled

826: w.r.t. the overall bucket sum, the maximum scaling error is

827: $\frac{F \cdot b}{2^9}$.

828:

829: \noindent{\bf 8s:}

830: Since each sub-bucket sum is encoded by 4 bits and scaled

831: w.r.t. the overall bucket sum, the maximum scaling error is

832: $\frac{F \cdot b}{2^5}= \frac{F \cdot b}{32}$.

833:

834: \noindent{\bf 3LT:}

835: In this case, the scaling error may be propagated

836: going down along the path from the root to the leaves of the tree.

837: We may determine an upper bound of the worst-case error

838: by considering the sum of the maximum scaling error at each level.

839: Thus, we obtain the following upper bound:

840: $\frac{\frac{F \cdot b}{2^{12}} + \frac{F \cdot b}{2}}{2^{11}}$.

841: Indeed, the maximum scaling error of the first level is

842: $\frac{F \cdot b}{2^{12}}$. The above value is obtained by considering

843: that the maximum sum in the half bucket corresponding to the first level

844: is $\frac{F \cdot b}{2}$, and that going down to the second level

845: introduces a maximum scaling error obtained by dividing the

846: overall sum by $2^{11}$. Thus, the maximum scaling error for 3LT

847: is $\Theta(\frac{F \cdot b}{2^{12}})$ (that is, the scaling error of the first

848: level).

849:

850: \noindent{\bf 4LT:}

851: For 4LT can be applied the same argumentation as 3LT, by

852: obtaining

853: that the maximum scaling error is of the same order as the first level.

854: That is, $\Theta(\frac{F \cdot b}{2^7})$, since the first level uses 6 bits.

855:

856: The proof is thus completed.

857: \end{proof}

858:

859: It is worth noting that,

860: as expected, 4LT and 8s produce the smallest interpolation worst-case error,

861: that is $\frac{F \cdot b}{32}$.

862: Considering also the results about scaling error,

863: the overall conclusion we may draw from the above analysis is that the best two

864: methods w.r.t. interpolation, that is 8s and 4LT, are not the same in terms of

865: scaling error.

866: Indeed 4LT shows a relevant accuracy improvement since the error

867: goes from $\frac{F \cdot b}{2^5}$ of 8s to $\frac{F \cdot b}{2^7}$

868: of 4LT.

869:

870: In the next subsection we shall perform a number of experiments to

871: provide additional arguments in favor of the superiority of 4LT

872: estimation, by performing also an average-case analysis

873: of methods under a number of meaningful data distributions.

874: We shall not conduct experiments on the CVA

875: because we are aware that CVA uses 32 bits less and, therefore,

876: could reduce the size of the bucket, thus providing a better

877: accuracy. Actually, the performance analysis coincides with the one

878: of 2s estimation, that is CVA in half bucket.

879:

880: \subsection{Experiments inside a Bucket}\label{sec-ExperimentsIntra}

881:

882: In this section we report the results of a large number of

883: experiments performed with various synthetic data sets obtained

884: with different distributions. We measure the accuracy of all the

885: above mentioned methods in estimating range queries inside a

886: bucket. In particular, the methods considered are: USA, 1b, 2s, 8s, 3LT and

887: 4LT. We observe that the space required for storing a bucket is the same

888: for all the considered methods.

889: Experiments are conducted

890: on synthetic data generated according several data distributions.

891: A data distribution is characterized by a distribution for frequencies and

892: a distribution for spreads.

893: Frequency set and value set are generated independently, then

894: frequencies are randomly

895: assigned to the elements of the value set.

896:

897: \subsubsection{Test Bed.}

898:

899: In this section we illustrate the test bed used in our

900: experiments. In particular, we describe (1) the

901: {\em data distributions}, that is the probability

902: distributions used for generating frequencies in the tested

903: buckets, (2) the {\em bucket populations}, that is the set

904: of parameters characterizing bucket used for

905: generating them under the

906: probability distributions, (3) the {\em data sets},

907: that is the set of samples produced by the combination of (1)

908: and (2), (4) the {\em query set and error metrics}, that is

909: the set of query submitted to sample data and

910: the metrics used for measuring the approximation error.

911:

912: \noindent {\bf Data Distributions:} We consider four data

913: distributions: ({\bf 1}) {\em Zipf-$cusp\_max$ (0.5,1.0)}:

914: Frequencies are distributed according to a Zipf distribution

915: \protect\cite{Zipf49Human} with the $z$ parameter equal to $0.5$.

916: Spreads are distributed according to a Zipf {\em $cusp\_max$}

917: \protect\cite{Poo97} (i.e., increasing spreads following a Zipf

918: for the first half elements and decreasing spreads following a

919: Zipf distribution for the remaining elements) with $z$ parameter

920: equal to $1.0$. ({\bf 2})  {\em Zipf-$cusp\_max$(1.0,1.0).} ({\bf

921: 3}) {\em Zipf-$cusp\_max$(1.5,1.0).} ({\bf 4}) {\em Gauss-rand}:

922: Frequencies are distributed according to a Gauss distribution with

923: standard deviation $1.0$. Spreads are randomly distributed as

924: well.

925:

926:

927:

928:

929: \noindent {\bf Bucket Populations:} A population is characterized

930: by the values of $c$ (overall cumulative frequency), $b$ (the

931: bucket size) and $t$ (number of non-null attribute values) and

932: consists of all buckets having such values.  We consider 9

933: different populations divided into two sets, that are called t-var

934: and b-var, respectively.

935:

936: \noindent

937: {\em Set of populations t-var.}

938: It is a set of 6 populations of buckets, all of them with

939: $c=20000$ and  $b=500$. The 6 populations differ on the value of

940: the parameter $t$ ($t$=10, 100, 200, 300, 400, 500), and are denoted by

941: t-var(10), t-var(100), t-var(200), t-var(300), t-var(400) and

942: t-var(400), respectively.

943:

944: \noindent

945: {\em Set of populations b-var.}

946: It is a set of 4 populations of buckets,  all of them with

947: $c=20000$. They  differ on the value of the parameters $b$ and

948: $t$. We consider $4$ different values for $b$ ($b$=100, 200, 500,

949: 1000). The number of non-null values $t$ of each population is

950: fixed in a way that the ratio $t/b$ is constant and equal to

951: $0.2$; so the values of $t$ are 20, 40, 100 and 200. The four

952: populations are denoted by b-var(100), b-var(200), b-var(500) and

953: b-var(1000).

954:

955: Moreover, a generic population whose parameter values are,

956: say, $\bar c$, $\bar b$ and $\bar t$ (for $c$, $b$ and $t$, respectively),

957: is denoted by p($\bar c$, $\bar b$, $\bar t$).

958:

959: \noindent {\bf Data Sets:} As a data set we mean a sampling of the

960: set of buckets belonging to a given population following a given

961: data distribution. Each data set included in the experiments is

962: obtained by generating $100$ buckets belonging to one of the

963: populations specified above under one of the above described data

964: distributions. We denote a data set by the name of the data

965: distribution and the name of the population. For example, the data

966: set (Zipf-cusp\_max(0.5,1.0), b-var(200)) denotes a sampling of

967: the set of buckets belonging to the population of b-var

968: corresponding to the value 200 for the parameter $b$ following the

969: data distribution Zipf-cusp\_max(0.5,1.0).

970:

971: We generate 23 different data sets

972: classified as follows:

973: (1)

974: {\bf Zipf-t} (i.e., Zipf data, different bucket density),

975: containing the five data sets (Zipf-cusp\_max(0.5,1), t-var($t$)), for

976: $t$=10,

977: 100, 200, 300, 400, 500.

978: (2)

979: {\bf Zipf-b} (i.e., Zipf data, different bucket size),

980: containing the

981: four data sets (Zipf-cusp\_max(0.5,1), b-var($b$)), for $b$=100,

982: 200, 500, 1000.

983: (3) {\bf Gauss-t} (i.e., Gauss data, different bucket density),

984: containing the five data sets (Gauss-rand, t-var($t$)), for $t$=10,

985: 100, 200, 300, 400, 500.

986: (4)

987: {\bf Gauss-b} (i.e., Gauss data, different bucket size),

988: containing the

989: four data sets (Gauss-rand, b-var($b$)), for $b$=100, 200, 500,

990: 1000.

991: (5)

992: {\bf Zipf-z} (i.e., Zipf data, different skew), containing the three

993: data sets Zipf-cusp\_max($z$,1.0), p(20000,400,200)), for $z$=0.5,

994: 1.0, 1.5. Recall that p(20000,400,200) denotes the population characterized

995: by

996: $c=20000, b=400, t=200$.

997:

998:

999:

1000: Each class of data sets is designed for studying the dependence of

1001: the accuracy of the various methods on a different parameter

1002: (parameter $t$ measuring the density of the bucket, parameter $b$

1003: measuring the size of the bucket and parameter $z$, measuring the

1004: data skew). For each data set, 1000 different samples obtained by

1005: permutation

1006: of frequencies was generated and tested, in order to give

1007: statistical significance to experiments.

1008:

1009:

1010:

1011: \noindent {\bf Query set and error metrics:} We perform all the

1012: queries $S[d]$, for all  $1 \leq d < b$. We measure the error of

1013: approximation made by the various estimation techniques on the

1014: above query set by using both:

1015: \begin{itemize}

1016: \item

1017: the \em average \em of the \em relative

1018: error \em $\frac{1}{b-1}\sum_{d=1}^{b-1}e_d^{rel}$,

1019: where $e_d^{rel}$ is the \em relative error \em of the query with

1020: range $d$, i.e., $e_d^{rel}=\frac{\vert{S[d]-

1021: \widetilde{S}[d]}\vert}{S[d]}$, and

1022:

1023: \item

1024: the {\em normalized absolute error}, that is the ratio between the average

1025: absolute error

1026: and the overall sum of the frequencies in the bucket, i.e.

1027: $\sum_{d=1}^{b-1}\frac{\vert{S[d]- \widetilde{S}[d]}\vert}{c \cdot b}$

1028: \end{itemize}

1029: where $\widetilde{S}[d]$ is the value of $S[d]$ estimated by the

1030: technique at hand.

1031:

1032:

1033: \subsubsection{Results of Experiments and Discussion.}

1034:

1035: In this section we give a qualitative discussion about the approximation

1036: error

1037: of the considered methods, excluding USA and 1-biased, about which we have

1038: already

1039: provided a theoretical analysis in Section \ref{sec-Analysis}.

1040: First we consider methods working simply by splitting the original

1041: bucket, that are 2s, 4s and 8s.

1042: For all these methods,

1043: the estimation error may arise from the following approximation sources:

1044:

1045: \begin{enumerate}

1046:

1047: \item

1048: the linear interpolation (i.e., CVA), concerning the evaluation of the query

1049: inside

1050: the ``smallest" sub-buckets (for instance, in the case of the 4s, the

1051: smallest sub-buckets

1052: are the quarts of the bucket),

1053:

1054: \item

1055:

1056: the numeric approximation, in case sums are stored by less than 32 bits

1057: (note that only 2s is not affected by this error).

1058:

1059: \end{enumerate}

1060: We call error of type 1 and 2, respectively, the above described components

1061: of the approximation error.

1062:

1063:

1064: \subsubsection*{Relative error vs data density.}

1065:

1066: Concerning error of type 1, what we expect is that, for all methods, it

1067: increases as

1068: data sparsity increases.

1069: Indeed, in case of sparse data, the sum tends to concentrate in a few

1070: points,

1071: and this reduces the suitability of linear interpolation to approximate

1072: the frequency distribution.

1073: Moreover, we expect that such a component of the error

1074: decreases as splitting degree increases:

1075: for instance,

1076: in case of 8s, which splits the bucket into 8 parts,

1077: we expect more accuracy (in terms of the error of type 1) than

1078: the 2s method. The reason is that having smaller sub-buckets

1079: means applying linear interpolation to shorter

1080: (and, thus, better linearly-approximable) segments of

1081: the cumulative frequency distribution.

1082:

1083:

1084: About error of type 2 we expect that both (i) it increases

1085: as the splitting degree increases and (ii) it is independent of

1086: data sparsity.

1087: Claim (i) is explained by considering that increasing the splitting degree

1088: means reducing the number of bits used for representing the sum of

1089: sub-buckets.

1090: Claim (ii) is related to the numeric nature of the error.

1091:

1092: The observations above show the existence of a trade-off between the need of

1093: increasing the splitting degree for improving CVA precision on one hand, and

1094: the need of

1095: using as more bits as possible for representing partial sums in the bucket

1096: on the other hand.

1097: However, we expect that such a trade-off is more evident in case of high

1098: splitting degree,

1099: that is, when the error of type 2 is more relevant.

1100: For instance, recalling that the maximum absolute error of type 2 is

1101: $\frac{c}{2^{k+1}}$,

1102: where $k$ is the number of bits assigned to smallest sub-buckets, being

1103: $k=4$ for 8s

1104: and $k=8$ for 4s, the maximum absolute error of type 2 for 8s in

1105: case $c=20000$

1106: is 625 (i.e., about the 3\% of $c$) while it is 39 (i.e., a negligible

1107: percentage of $c$) for 4s.

1108:

1109: \begin{figure}[ht]

1110: \begin{center}

1111: \begin{tabular}{c}

1112: \epsfig{file=fig3a.eps,width=9cm} \\

1113: {\bf (a)}: Error for different values of $t$ \\

1114: \epsfig{file=fig3b.eps,width=9cm} \\

1115: {\bf (b)}: Error for different values of $b$

1116: \end{tabular}

1117: \end{center}

1118: \caption{Experimental Results for data sets Zipf}\label{fig1}

1119: \end{figure}

1120:

1121: Experiments confirm the above considerations. By looking at graphs of

1122: Figure \ref{fig1}.(a) we may observe that for 2s and 4s the error

1123: decreases as the data density increases. On the contrary, for

1124: 8s, the error is quasi-constant (slightly increasing) in case of

1125: Zipf distributions, while it is slightly decreasing (but much less

1126: quickly than 4s) in case of Gauss distribution (see Figure

1127: \ref{fig2}.(a)). Concerning the comparison between 2s, 4s and 8s,

1128: we may observe in Figures \ref{fig1}.(a) that for low values of

1129: data density, as expected, accuracy of 8s is higher than 4s and,

1130: in turn, accuracy of 4s is higher than 2s. But, as observed above,

1131: for increasing data density, trends of 4s and 8s suffer, in a

1132: different measure, the presence of the error of type 2. This

1133: appears quite evident in Figure \ref{fig1}.(a), whereby we may note

1134: that 8s becomes worse than 4s from about 210 non null elements on

1135: and the improving trend of 2s is considerable faster than the

1136: other methods (since 2s does not suffer the error of type 2).

1137:

1138: We observe that USA

1139: gives better estimation than $1b$ on Zipf data (see Figures

1140: \ref{fig1}.(a)). Accuracy of USA becomes the worst when the data sets

1141: follow the Gauss distribution (see Figures \ref{fig2}.(a)).

1142: This proves that the

1143: assumption made by USA can be applicable for particular

1144: distributions of frequencies and spreads, like those of data sets

1145: Zipf-t. Results obtained on data sets distributed according a

1146: Gauss distribution confirm the above claim: accuracy of USA

1147: becomes the worst when the data sets have a random distribution as

1148: it happens for Gauss-t (see Figure \ref{fig2}.(a)).

1149:

1150: Concerning 1b we may observe that the behaviours of $1b$

1151: and 2s are similar. As expected, the exploitation of the

1152: information that the bucket is 1-biased does not give a

1153: significant contribution to the accuracy of the estimation.

1154: Indeed, the knowledge of the position of just one element in the

1155: bucket does not add in general appreciable information.

1156:

1157:

1158: Consider now the usage of the tree-indices 3LT and 4LT. Recall

1159: that 3LT has the same splitting degree of 4s, since both methods

1160: divide the bucket into 4 sub-buckets. Possible difference in terms

1161: of accuracy between the two methods may arise from error of type 2.

1162: Indeed, the tree-like organization of indices allows us to

1163: represent the sum inside a given sub-bucket corresponding to a

1164: node of the tree as a fraction of the sum contained in the parent

1165: node, instead of the entire sum (as it happens for the "flat"

1166: methods).

1167: Thus, we expect that tree-indices produce smaller

1168: errors of type 2. However, as previous noted, 4s produces a

1169: negligible percentage of error of type 2. This explains why

1170: 3LT and 4s basically present

1171: the same error (lines in the graphs are almost entirely

1172: overlapped).

1173:

1174: 4LT has the same splitting degree as 8s (since both methods divide the

1175: bucket into 8 sub-buckets). As a consequence, being appreciable the error of

1176: type 2 of the 8s (as already discussed), we may expect

1177: improvements by the usage of 4LT. This is that results from

1178: experiments. 4LT has the best performances: it shows only benefits

1179: deriving from the increasing of data density (producing the

1180: reduction of error of type 1), with no appreciable increasing of

1181: error of type 2. 4LT, thanks to the tree-like organization of the

1182: sums, seems to solve the trade-off between increasing splitting

1183: degree (for improving CVA precision) and controlling numeric error

1184: arising from the usage of a reduced number of bits for

1185: representing sums.

1186:

1187:

1188: \subsubsection*{Relative error vs bucket size and

1189: data skew.}

1190:

1191: First consider populations b-var. Recall that for such data sets

1192: we have maintained constant the data density around 20\%. Thus,

1193: increasing the bucket size means increasing also non-null

1194: elements. While, as for previous experiments, error of type 2 is

1195: independent of the bucket size, (even though all the above considerations

1196: about the relationship between error of type 2, splitting degree

1197: and number of bits per smallest sub-buckets are still valid), we

1198: expect that CVA precision suffers the variation of the bucket

1199: size. Indeed, on the one hand the CVA precision decreases as the

1200: bucket size increases, since, for a larger bucket, linear

1201: interpolation is applied to a larger segment of the cumulative

1202: frequency. But, on the other hand, increasing the bucket size means

1203: increasing the number of non-null elements (keeping constant the overall sum)

1204: and this means reducing the probability that the sum is

1205: concentrated into a few picks. Thus, whenever the cumulative

1206: frequency is smooth, linear interpolation tends to give better

1207: results. Depending on data distribution, we may observe either

1208: that the two opposite component compensate each other or one

1209: prevails over the other. Indeed, experiments with Zipf data,

1210: corresponding to Figure \ref{fig1}.(b), show that methods have a

1211: quasi-constant trend (with a slight prevalence of the first

1212: component), while experiments conducted on Gauss data,

1213: corresponding to Figure \ref{fig2}.(b), show a net prevalence of the

1214: second component (all the methods present a decreasing trend for

1215: increasing bucket size). Such experiments do not give new

1216: information about the comparison between the considered methods,

1217: confirming substantially the previous results. Again 4LT has the

1218: best performance.

1219:

1220:

1221: \begin{figure}[ht]

1222: \begin{center}

1223: \begin{tabular}{c}

1224: \epsfig{file=fig4a.eps,width=9cm} \\

1225: {\bf (a)}: Data sets Gauss-t: error for different values of $t$ \\

1226: \epsfig{file=fig4b.eps,width=9cm} \\

1227: {\bf (b)}: Data sets Gauss-D: error for different values of $b$

1228: \end{tabular}

1229: \end{center}

1230: \caption{Experimental Results for data sets Gauss} \label{fig2}

1231: \end{figure}

1232:

1233:

1234:

1235: Results of experiments conducted on the class of data sets Zipf-z,

1236: for measuring the dependence of the accuracy of methods on the

1237: data skew are reported in Figure \ref{fig3}. We note that all

1238: methods become worse as $z$ increases (as it can be

1239: intuitively expected).

1240: The behaviours of $1b$ and 2s are similar, while 4LT shows the best

1241: performance.

1242:

1243:

1244: As a final remark we may summarize the comparison between the

1245: considered methods concluding that the worst method is always 2s,

1246: followed by 8s and then by 3LT and 4s for sparse data. On the

1247: contrary, for dense data 3LT and 4s show better performance than

1248: 8s. Observe that 4s and 3LT have basically the same accuracy. The

1249: best methods appears definitely 4LT.

1250:

1251:

1252:

1253:

1254: \begin{figure}[h]

1255: \begin{center}

1256: \begin{tabular}{c}

1257: \epsfig{file=fig5.eps,width=9cm}

1258: \end{tabular}

1259: \end{center}

1260: \caption{Data sets Zipf-z: dependence on data skew} \label{fig3}

1261: \end{figure}

1262:

1263:

1264:

1265:

1266:

1267: \section{Applying the 4LT Index to the Entire Histogram}\label{sec-Improved}

1268:

1269: The analysis described in the previous sections suggests to apply

1270: the technique of the 4-level tree index to a whole histogram in

1271: order to improve its accuracy on the approximation of the

1272: underlying frequency set.

1273: We stress that the problem

1274: of investigating whether such an addition is really convenient

1275: is not straightforward: observe that 4LT buckets use 32 bits more than CVA ones, and, then, for a fixed storage

1276: space, allow a smaller number of buckets.

1277: In this section we show how to combine

1278: the 4LT technique with classical methods for constructing

1279: histograms and we perform a large number of experiments to measure

1280: the effective improvement given by the usage of the 4LT.

1281: The advantage of the 4LT index is shown to be relevant also when

1282: it is compared with buckets using CVA,

1283: that is, when the storage space required by

1284: 4LT is larger than the original method.

1285: Moreover, the 4LT index shows very good performances

1286: if it is combines with a very

1287: simple method for constructing histograms, called EquiSplit,

1288: consisting on partitioning the attribute domain into equal-size

1289: buckets.

1290: Let us start with a quick overview of the most relevant methods

1291: proposed so far for the construction of histograms.

1292:

1293: \subsection{Methods for Constructing Histograms}

1294:

1295: Besides the method used for approximating frequencies inside

1296: buckets, the capability of a histogram of accurately approximating

1297: the underlying frequency set strongly depends on the way such a

1298: set is partitioned into buckets. Typically, criteria driving the

1299: construction of a histogram is the  minimization of the error of

1300: the reconstruction of the original (cumulative) frequency set from

1301: the histogram. Partition rules proposed in

1302: \protect\cite{Poosala96Improved,Jagadish98Optimal}, try to achieve

1303: this goal. Among those, we sketch the description of two

1304: well-known approaches: {\em MaxDiff } and {\em V-optimal} (see

1305: \protect\cite{Poosala96Improved,Poo97} for an exhaustive

1306: taxonomy). Note that these methods are defined for 2-histograms

1307: but are in practice mainly used  for 1-histograms to minimize

1308: storage consumption.

1309:

1310:

1311: \noindent {\bf MaxDiff.} A MaxDiff histogram

1312: \protect\cite{Cri81,Poosala96Improved} of size $h$ is obtained by

1313: putting a boundary between two adjacent attribute values $v_i$ and

1314: $v_{i+1}$ of $V$ if the difference between $f(v_{i+1}) \cdot

1315: \sigma_{i+1}$ and $f(v_{i}) \cdot \sigma_{i}$ is one of the $h-1$

1316: largest such differences (where $\sigma_i$ denotes the spread of

1317: $v_i$). The product $f(v_{i}) \cdot \sigma_{i}$ is said the {\em

1318: area} of $v_i$.

1319:

1320:

1321: \noindent {\bf V-Optimal.} A V-Optimal histogram

1322: \protect\cite{Poosala96Improved,Jagadish98Optimal} gives very good

1323: performances. It is obtained by selecting the boundaries for each

1324: bucket, $inf_i$ and $sup_i$, $1 \leq i \leq n$, so that

1325: $\sum_{i=1}^n SSE_i$ is minimal, where $SSE_i =

1326: \sum_{j=inf_i}^{sup_i} (f(j)-avg_i)^2$ and $avg_i$ is equal to the

1327: average frequency in the $i$-th bucket, thus the cumulative

1328: frequency in the whole bucket divided by the size $sup_i -

1329: inf_i+1$.

1330:

1331: We now propose to combine both methods, MaxDiff and V-Optimal,with

1332: the 4LT index in order to  have an approximate representation of

1333: frequency distributions inside the buckets. We shall compare the

1334: so-revised methods with the original ones with CVA estimation at

1335: parity of storage consumption. The results will show that the 4LT

1336: index very much increases the estimation accuracy of both methods.

1337: The additional estimation power carried by the 4LT index even

1338: enables a very simple method like the one described below to

1339: produce very accurate estimations.

1340:

1341: \noindent {\bf EquiSplit.} The attribute domain is split into $k$

1342: buckets of approximately the same size $b=\lceil m/k \rceil$. In

1343: this way, as the boundaries of all buckets can be easily

1344: determined from the value $b$, we only need to store a value for

1345: each bucket: the sum of all frequencies. This method has been

1346: first introduced in \protect\cite{Cri81} and, as the experimental

1347: analysis will confirm, it has very good performances for low

1348: skewed data, while its performances get worse in case of high

1349: skew.

1350:

1351:

1352:

1353: \subsection{Experiments on Histograms}\label{exp}

1354:

1355: In this section we shall conduct several experiments both on

1356: synthetic and real-life data in order to compare the effectiveness of

1357: several histograms in estimating range query size.

1358:

1359: \subsubsection*{Experiments on Synthetic Data.}

1360: First we present the experiments performed on synthetic data.

1361: Below we describe data sets, error metrics and the query set

1362: considered in our experiments.

1363:

1364: \noindent {\bf Available Storage:} Note that under CVA each bucket

1365: stores only two integers, while with the 4LT index each bucket

1366: needs  three integers. Assuming 32 bits the storage space for an

1367: integer, given a fixed $K$ number of bits for the total storage

1368: space required for the whole histogram, both MaxDiff and V-Optimal

1369: under CVA produce $\lfloor \frac{K}{64} \rfloor$ buckets while

1370: both of them with 4LT indices only produce $\lfloor \frac{K}{96}

1371: \rfloor$ buckets. On the other hand, a bucket for EquiSplit just

1372: needs one integer (the sum of all the frequencies), while for

1373: EquiSplit-4LT it needs two integers. Thus, for a fixed $K$ number

1374: of bits for the total storage space, EquiSplit with CVA produces

1375: $\lfloor \frac{K}{32} \rfloor$ and EquiSplit with 4LT indices

1376: produces $\lfloor \frac{K}{64} \rfloor$ as $MD\_CVA$.

1377:

1378: For our experiments, we shall use a storage space, that is $42$

1379: four-byte numbers to be in line with experiments reported in

1380: \protect\cite{Poosala96Improved,Jagadish98Optimal}, which we

1381: replicate. Using the above considerations, it can be easily

1382: realized that MaxDiff with CVA, V-Optimal with CVA, and EquiSplit

1383: with 4LT indices produce 21 buckets, EquiSplit with CVA produces

1384: 42 buckets, and both MaxDiff and V-Optimal with 4LT indices only

1385: produce 14 buckets.

1386:

1387:

1388:

1389: \noindent {\bf Data Distributions:} A data distribution is

1390: characterized by a distribution for frequencies and a distribution

1391: for spreads. Frequency set and value set are generated

1392: independently, then frequencies are randomly assigned to the

1393: elements of the value set. We consider 5 data distributions: ({\bf

1394: 1}) $D_1$: {\em Zipf-$cusp\_max$(0.5,1.0)}. ({\bf 2}) $D_2=$ {\em

1395: Zipf-zrand(0.5,1.0)}: Frequencies are distributed according to a

1396: Zipf distribution with the $z$ parameter equal to $0.5$. Spreads

1397: follow a $ZRand$ distribution \protect\cite{Poo97} with $z$

1398: parameter equal to $1.0$ (i.e., spreads following a Zipf

1399: distributions with $z$ parameter equal to $1.0$ are randomly

1400: assigned to attribute values). ({\bf 3}) $D_3=$ {\em Gauss-rand}:

1401: Frequencies are distributed according to a Gauss distribution with

1402: standard deviation $1.0$. Spreads are randomly distributed. ({\bf

1403: 4}) $D_4=$ {\em Zipf-$cusp\_max$(1.5,1.0)}. ({\bf 5}) $D_5=$ {\em

1404: Zipf-$cusp\_max$(3.0,1.0)}.

1405:

1406:

1407: \noindent {\bf Histograms Populations:} A population is

1408: characterized by the value of three parameters, that are $T$, $D$

1409: and $t$ and represents the set of histograms storing a relation of

1410: cardinality $T$, attribute domain size $D$ and value set size $t$

1411: (i.e., number of non-null attribute values).

1412:

1413: \noindent

1414: {\em Population $P_1$.}

1415: This population is characterized by the following values for the

1416: parameters: $D=4100$, $t=500$ and $T=100000$.

1417:

1418: \noindent

1419: {\em Population $P_2$.}

1420: This population is characterized by the following values for the

1421: parameters: $D=4100$, $t=500$ and $T=500000$.

1422:

1423: \noindent

1424: {\em Population $P_3$.}

1425: This population is characterized by the following values for the

1426: parameters: $D=4100$, $t=1000$ and $T=500000$.

1427:

1428:

1429: \noindent

1430: {\bf Data Sets:} Similarly to the experiments inside

1431: buckets, each data set included in the experiments is obtained by

1432: generating under one of the above described data distributions

1433: $10$ histograms belonging to one of the populations specified

1434: below. We consider the 15 data sets that are generated by

1435: combining all data distributions and all populations.\\

1436: All queries belonging to the query set below are evaluated over

1437: the histograms of each data set:

1438:

1439: \noindent

1440: {\bf Query set and error metrics:} In our experiments, we use the

1441: query set $\{X\leq d :d\in \U \}$ (recall that $X$ is the

1442: histogram attribute and  $\U$ is its domain) for evaluating the

1443: effectiveness of the various methods. We measure the error of

1444: approximation made by histograms on the above query set by using

1445: the \em average \em of the \em relative error \em

1446: $\frac{1}{Q}\sum_{i=1}^Qe_i^{rel}$,

1447: where $Q$ is the cardinality of the query set and $e_i^{rel}$ is

1448: the \em  relative error \em, i.e.,

1449: $e_i^{rel}=\frac{\vert{S_i-\widetilde{S}_i}\vert}{S_i}$,

1450: where $S_i$ and $\widetilde{S}_i$ are the actual answer and the

1451: estimated  answer of the query $i$-th of the query set.

1452:

1453:

1454: \subsubsection{Results of the Experiments.} In Tables

1455: \ref{table-1}, \ref{table-2} and \ref{table-3} the results of

1456: experiments conducted on all data sets are reported. We denote the

1457: methods MaxDiff, V-Optimal and EquiSplit with CVA by MD, VO and

1458: ES, respectively; these methods with 4LT indices are denoted by

1459: MD\_4LT, VO\_4LT, ES\_4LT.

1460:

1461:

1462: \begin{table}

1463: \begin{center}

1464:

1465: \begin{tabular}[h]{|c|c|c|c|c|c|}

1466: \hline\hline

1467:

1468: $method/distr.$ &  $D_1$ &  $D_2$ &  $D_3$ & $D_4$ & $D_5$

1469:

1470:

1471:

1472: \\ \hline

1473:

1474: $ES$& $0.79$& $1.69$& $10.61$& $3.89$& $57.63$

1475:

1476: \\ \hline

1477:

1478: $ES\_4LT$& $0.29$& $0.84$& $2.01$& $2.89$& $29.63$

1479:

1480: \\ \hline

1481:

1482: $MD$& $4.29$& $19.37$& $11.65$& $7.02$& $31.46$

1483:

1484: \\ \hline

1485:

1486: $MD\_4LT$& $0.70$& $1.57$& $3.14$& $1.92$& $4.39$

1487:

1488: \\ \hline

1489:

1490: $VO$& $1.43$& $5.55$& $10.6$& $5.16$& $21.57$

1491:

1492: \\ \hline

1493:

1494: $VO\_4LT$& $0.29$& $1.33$& $2.32$& $1.62$& $3.15$

1495:

1496: \\ \hline\hline

1497:

1498: \end{tabular}

1499:

1500: \end{center}

1501:

1502: \caption{Pop. 1: error for various methods.}

1503: \label{table-1}

1504: \end{table}

1505:

1506:

1507: \begin{table}

1508: \begin{center}

1509:

1510: \begin{tabular}[h]{|c|c|c|c|c|c|c|}

1511: \hline\hline

1512:

1513: $method/distr.$ &  $D_1$ &  $D_2$ &  $D_3$ & $D_4$ & $D_5$

1514:

1515:

1516:

1517: \\ \hline

1518:

1519: $ES$& $0.76$& $1.78$& $4.83$& $3.63$& $59.74$

1520:

1521: \\ \hline

1522:

1523: $ES\_4LT$& $0.28$& $0.84$& $6.40$& $1.40$& $31.12$

1524:

1525: \\ \hline

1526:

1527: $MD$& $5.79$& $16.04$& $6.65$& $13.56$& $33.51$

1528:

1529: \\ \hline

1530:

1531: $MD\_4LT$& $0.80$& $1.60$& $2.32$& $2.36$& $4.87$

1532:

1533: \\ \hline

1534:

1535: $VO$& $1.68$& $5.96$& $6.16$& $7.25$& $18.10$

1536:

1537: \\ \hline

1538:

1539: $VO\_4LT$& $0.32$& $1.41$& $4.85$& $1.53$& $3.12$

1540:

1541:

1542: \\ \hline\hline

1543:

1544: \end{tabular}

1545:

1546: \end{center}

1547:

1548: \caption{Pop. 2: error for various methods.}

1549: \label{table-2}

1550: \end{table}

1551:

1552: \begin{table}

1553: \begin{center}

1554:

1555: \begin{tabular}[h]{|c|c|c|c|c|c|c|}

1556: \hline\hline

1557:

1558: $method/distr.$ &  $D_1$ &  $D_2$ &  $D_3$ & $D_4$ & $D_5$

1559:

1560:

1561: \\ \hline

1562:

1563: $ES$& $0.47$& $0.87$& $2.31$& $7.54$& $66.41$

1564:

1565: \\ \hline

1566:

1567: $ES\_4LT$& $0.27$& $0.35$& $1.14$& $3.59$& $25.01$

1568:

1569: \\ \hline

1570:

1571: $MD$& $8.37$& $2.89$& $3.30$& $3.46$& $25.01$

1572:

1573: \\ \hline

1574:

1575: $MD\_4LT$& $0.70$& $0.59$& $1.33$& $1.79$& $2.02$

1576:

1577: \\ \hline

1578:

1579: $VO$& $1.77$& $2.16$& $2.82$& $3.37$& $7.78$

1580:

1581: \\ \hline

1582:

1583: $VO\_4LT$& $0.32$& $0.56$& $1.24$& $1.68$& $1.82$

1584:

1585: \\ \hline\hline

1586:

1587: \end{tabular}

1588:

1589: \end{center}

1590:

1591: \caption{Pop. 3: error for various methods.}

1592: \label{table-3}

1593: \end{table}

1594:

1595:

1596: The cross behavior of the various methods is

1597: similar for the three populations. Experiments confirm the good

1598: performance of the MaxDiff method and, particularly, of V-Optimal

1599: but they also pinpoint that 4LT adds to both methods relevant

1600: benefits. Indeed MD\_4LT and VO\_4LT show very low errors. Also

1601: EquiSplit and EquiSplit-4LT have good performances. But, as shown

1602: in Figure \ref{fig-5}.(a), where the dependence of the estimation

1603: error on data skew is plotted, these methods quickly get worse for

1604: high data skew. Indeed, in such cases, the benefit given by the

1605: higher number of buckets is lost because of the high skew inside

1606: buckets. In case of high skew, partition rules play a central

1607: role, and the naive approach of EquiSplit is not suitable.

1608: Interestingly, we observe that the improving of MaxDiff and

1609: V-Optimal by the usage of 4LT indices is relevant also for high

1610: skew, proving the effectiveness of such indices. In Figure \ref{fig-5}.(b)

1611: we show the dependence of the accuracy of the methods on the amount of

1612: space.

1613: There, we consider the data distribution $D_4$ and the population

1614: $P_1$ and generate 10 histograms belonging to $P_1$ according to

1615: $D_4$ for different amounts of space. The aim of this experiment

1616: is to study the behaviour of the various methods as the compression factor increases.

1617: Clearly, when the available amount of space

1618: increases, all methods behave well. The differences are more

1619: relevant for values corresponding to high compression. Methods

1620: using 4TL are the best. This can be intuitively explained by

1621: considering that in case of large buckets the role of the

1622: approximation technique inside buckets becomes more important than

1623: the rules followed for constructing buckets.

1624:

1625:

1626: \begin{figure}[h]

1627: \begin{center}

1628: \begin{tabular}{c@{\hspace{0.6cm}}c}

1629: \epsfig{file=fig6a.eps,width=9cm} \\

1630: {\bf (a)}: Dependence of the accuracy on the data skew \\

1631: \epsfig{file=fig6b.eps,width=9cm} \\

1632: {\bf (b)}: Dependence of the accuracy on the representation \\

1633:   size (i.e., number of stored 4-byte integers)

1634: \end{tabular}

1635: \end{center}

1636: \caption{Experimental Results}

1637: \label{fig-5}

1638: \end{figure}

1639:

1640:

1641:

1642: \subsubsection*{Experiments on Real-Life Data.}

1643: We have performed further experiments using real-life data. We

1644: have considered two data sets (that we denote by Data Set A and

1645: Data Set B) obtained from the {\em 1997 U.S. Census Statistics}

1646: \protect\cite{Census}, by choosing two attributes of the table

1647: {\em Special District Governments}, having the following

1648: characteristics:

1649:

1650: \noindent

1651: {\bf Data Set A:}

1652: attribute name: {\em Type Code},

1653: domain size: $D= 998$,

1654: number of non-null attribute values: $t = 787$,

1655: cardinality: $T=34683$.

1656:

1657: \noindent

1658: {\bf Data Set B:}

1659: attribute name: {\em Function Code},

1660: domain size: $D= 99$,

1661: number of non-null attribute values: $t = 32$,

1662: cardinality: $T=34683$.

1663:

1664: We use for each histogram the same amount of

1665: storage space, that is $21$ four-byte numbers.

1666: Query set and error metrics are the same used for experiments

1667: on synthetic data.

1668:

1669:

1670: \begin{table}

1671:

1672: \begin{center}

1673:

1674: \begin{tabular}{|c|c|c|}

1675:   \hline

1676:   method & data set A & data set B \\

1677:   \hline

1678:   $ES$ & 4.32 & 7.02 \\

1679:    \hline

1680:   $ES\_4LT$ & 0.97 & 3.59 \\

1681:   \hline

1682:   $MD$ & 11.30 & 22.82 \\

1683:   \hline

1684:   $MD\_4LT$ & 1.63 & 1.25 \\

1685:   \hline

1686:   $VO$ & 4.49 & 17.19 \\

1687:   \hline

1688:   $VO\_4LT$ & 1.86 & 3.05 \\

1689:   \hline

1690:

1691:

1692:

1693: \end{tabular}

1694: \caption{Errors obtained on real data.}

1695:

1696: \end{center}

1697: \end{table}\label{realtable}

1698:

1699:

1700:

1701: \noindent {\bf Results of the Experiments.} As shown in Table 4,

1702: experiments on real data confirm the results obtained with

1703: synthetic data. We note that 4LT adds to MaxDiff and V-Optimal

1704: relevant benefits and both EquiSplit and EquiSplit-4LT have good

1705: performances. Not surprisingly, for the data set A, EquiSplit-4LT produces the

1706: smallest error. This can be explained

1707: by considering that data of this set are rather uniform, and, in this case, as

1708: discussed previously, the cheapest technique (in terms of storage space) gives the best

1709: performances. In other words, the extra storage space required for recording

1710: bucket boundaries of the more sophisticate techniques does not give benefits due to the

1711: trivial data distribution.

1712:

1713:

1714:

1715:

1716:

1717:

1718:

1719: \section{Conclusions}\label{sec-Conclusion}

1720:

1721: In this paper we have presented a technique for improving the frequency estimation within

1722: each bucket of a histogram. This technique goes beyond the simple methods used in the

1723: literature, that is, the continuous value assumption and the uniform spread assumption.

1724: Our method is based on the addition of a 32 data item to each bucket organized into a 4-level

1725: tree index (4LT, for short) that stores, in a bit-saving approximate form, a number of hierarchical range queries

1726: internal to the bucket. We have shown both theoretically and experimentally that such an additional

1727: information effectively allows us to better estimate range queries inside buckets.

1728: Interestingly, the usage of 4LT on top of histograms built through well-know techniques like

1729: MaxDiff and V-Optimal, outperforms such histograms in terms of accuracy.

1730: This claim is proven in the paper through a large number of experiments conducted on both synthetic

1731: and real-life data, where classical histograms combined with 4LT are compared

1732: with the standard versions (i.e., with no 4LT) under several

1733: different data distributions at parity of consumed storage space.

1734: It turns out that the price we have to pay

1735: in terms of storage space by consuming 32 bits more per bucket

1736: w.r.t. CVA-based histograms is overcome by the benefits given

1737: by the improvement of precision in estimating

1738: queries inside buckets.

1739: Thus, the main conclusion we draw is that the 4LT index may represent a general technique

1740: that can be combined with any bucket-based histogram for significantly

1741: improving its accuracy.

1742:

1743: {\footnotesize

1744: \bibliography{isto}

1745:

1746: \bibliographystyle{plain}

1747: }

1748:

1749: \end{document}

1750: