0611:cs0611031/geo.tex

1: \documentclass{article}

2:

3: \usepackage[left=1in,right=1in,top=1in,bottom=1in]{geometry}

4: \usepackage[singlespacing]{setspace}

5: \usepackage{amssymb}

6: \usepackage{amsfonts}

7: \usepackage{amsmath}

8: \usepackage{psfrag}

9: \usepackage{graphicx}

10: \usepackage{afterpage}

11: \usepackage{rotating}

12: \usepackage{calc}

13: \usepackage{url}

14: \usepackage{natbib}

15:

16: \def\lfp{\mathop{\hbox{\it lfp}}}

17: \def\impl{\mathrel{\hbox{~~:---~~}}}

18: \def\progstart{\singlespacing\begin{center}\begin{minipage}{.95\textwidth}\small\noindent\rule[0pt]{\linewidth}{0.4pt}\vspace{6pt} \\}

19: \def\progend{\rm\rule[6pt]{\linewidth}{0.4pt} \\ \end{minipage}\end{center}\doublespacing}

20:

21: %Included for Gather Purpose only:

22: %input "h:\unl\bibtex\all.bib"

23:

24: \newtheorem{theorem}{Theorem}

25: \newtheorem{acknowledgement}[theorem]{Acknowledgement}

26: \newtheorem{algorithm}[theorem]{Algorithm}

27: \newtheorem{axiom}[theorem]{Axiom}

28: \newtheorem{case}[theorem]{Case}

29: \newtheorem{claim}[theorem]{Claim}

30: \newtheorem{conclusion}[theorem]{Conclusion}

31: \newtheorem{condition}[theorem]{Condition}

32: \newtheorem{conjecture}[theorem]{Conjecture}

33: \newtheorem{corollary}[theorem]{Corollary}

34: \newtheorem{criterion}[theorem]{Criterion}

35: \newtheorem{definition}[theorem]{Definition}

36: \newtheorem{example}[theorem]{Example}

37: \newtheorem{exercise}[theorem]{Exercise}

38: \newtheorem{lemma}[theorem]{Lemma}

39: \newtheorem{notation}[theorem]{Notation}

40: \newtheorem{problem}[theorem]{Problem}

41: \newtheorem{proposition}[theorem]{Proposition}

42: \newtheorem{remark}[theorem]{Remark}

43: \newtheorem{solution}[theorem]{Solution}

44: \newtheorem{summary}[theorem]{Summary}

45: \newenvironment{proof}[1][Proof]{\noindent\textbf{#1.} }{\ \rule{0.5em}{0.5em}}

46:

47:

48: \begin{document}

49:

50: \author{Scot Anderson\\ sanderson@southern.edu\\ Southern Adventist University, Tennessee

51: \and Peter Revesz\\ revesz@cse.unl.edu\\ University of Nebraska-Lincoln}

52:

53: \title{Efficient Threshold Aggregation of Moving Objects}

54:

55: \date{}

56:

57: \maketitle

58:

59: \begin{abstract}

60: Calculating aggregation operators of moving point objects, using

61: time as a continuous variable, presents unique problems when

62: querying for congestion in a moving and changing (or dynamic) query

63: space. We present a set of congestion query operators, based on a

64: threshold value, that estimate the following $5$ aggregation

65: operations in $d$-dimensions. 1) We call the count of point objects

66: that intersect the dynamic query space during the query time

67: interval, the \textsc{CountRange}. 2) We call the Maximum (or

68: Minimum) congestion in the dynamic query space at any time during

69: the query time interval, the \textsc{MaxCount} (or

70: \textsc{MinCount}). 3) We call the sum of time that the dynamic

71: query space is congested, the \textsc{ThresholdSum}. 4) We call the

72: number of times that the dynamic query space is congested, the

73: \textsc{ThresholdCount}. And 5) we call the average length of time

74: of all the time intervals when the dynamic query space is congested,

75: the \textsc{ThresholdAverage}. These operators rely on a novel

76: approach to transforming the problem of selection based on position

77: to a problem of selection based on a threshold. These operators can

78: be used to predict concentrations of migrating birds that may carry

79: disease such as Bird Flu and hence the information may be used to

80: predict high risk areas. On a smaller scale, those operators are

81: also applicable to maintaining safety in airplane operations. We

82: present the theory of our estimation operators and provide

83: algorithms for exact operators. The implementations of those

84: operators, and experiments, which include data from more than 7500

85: queries, indicate that our estimation operators produce fast,

86: efficient results with error under 5\%.

87: \end{abstract}

88:

89: \section{Introduction}

90:

91: \label{intro:ST}

92:

93: Safety can often be reduced to to a problem of congestion. The

94: safety of flight depends on separation of airplanes or more

95: generally the maximum number of airplanes that a particular airspace

96: can safely contain, and the maximum number of airplanes that air

97: traffic controllers (ATC) responsible for directing airplanes can

98: safely track. When considering epidemics, the presence of a single

99: animal with Bird Flue does not does not indicate the start of an

100: epidemic. Instead the presence of a certain number of instances of

101: the disease indicates a high risk of starting an epidemic, or actual

102: epidemic conditions. Consequently, we see that congestion often

103: links to safety and can predict high risk or even dangerous

104: conditions.

105:

106: Congestion is defined differently depending on the application.

107: Hence it is necessary to provide aggregation operators that take a

108: threshold value as a parameter to define congestion.

109:

110: In relational databases, \textsc{Max}, \textsc{Min}, \textsc{Count},

111: \textsc{Sum} and \textsc{Average} form the set of natural

112: aggregation-operators. Spatiotemporal databases containing moving

113: objects, based on continuous time, can not apply these operators in

114: the same way. However, these operators may still function in

115: interesting ways for moving objects. For example, one can ask how

116: many moving point objects exist within a moving and changing (or

117: \emph{dynamic}) rectangular area \emph{at a certain time}, or what

118: is the maximum distance between two moving points \emph{at certain

119: times}. Obviously, when we are interested in discrete time

120: instances, then the moving point object database can be reduced to a

121: relational database and the above queries can be expressed as simple

122: \textsc{Count} or \textsc{Max} queries.

123:

124: Moving object databases naturally suggest new aggregate operators

125: that have no equivalents in relational databases. For example, one

126: may ask what is the maximum number of moving-point objects that

127: exist simultaneously within a dynamic rectangular area at any time

128: during a time interval $T$? We call this the \textsc{MaxCount} query

129: (symmetrically we can also find the \textsc{Min-Count}). One may

130: also ask during what time intervals in $T$ does there exist more

131: than $M$ moving objects within a rectangular area? We call this the

132: \textsc{ThresholdRange}. We show that a strong relationship exists

133: between \textsc{MaxCount} and \textsc{ThresholdRange}, and we show

134: that \textsc{ThresholdRange} forms the bases for a family of

135: threshold operators that include: \textsc{ThresholdCount},

136: \textsc{ThresholdSum}, and \textsc{ThresholdAverage}. A related,

137: though less complex, operator answers the question: what is the

138: number of moving objects that exist within or intersect a dynamic

139: rectangular area at any time instance during interval $T$. We call

140: this type of query the \textsc{CountRange} query.

141:

142: We give the following definitions for aggregation operators:

143:

144: \begin{definition}[Dynamic Query Space]

145: \label{def:DynamicQuerySpace} Dynamic query space is defined by a

146: continuous time interval $T$, and a $d$-dimensional space that may

147: move and change size or shape over the query time interval.

148: \end{definition}

149:

150: Throughout this paper we consider the shape of the query space to be

151: a box or cube.

152:

153: \begin{definition}[\textsc{MaxCount (MinCount)}]

154: \label{def:MaxCount} Let $S$ be a set of moving points. Given a

155: dynamic query space $R$ defined by two moving points $Q_1$ and $Q_2$

156: as the lower-left and upper-right corners of $R$, and a time

157: interval $T$, the \textsc{MaxCount} (\emph{Min-Count}) operator

158: finds the time $t_{\max(\min)}$ and maximum (or minimum) number of

159: points $M_{\max(\min)}$ in $S$ that $R$ can contain at any time

160: instance within $T$.

161: \end{definition}

162:

163: Throughout this paper we develop the \textsc{MaxCount} operator

164: because where ever we find a maximum, a minimum can be found

165: similarly.

166:

167: \begin{definition}[\textsc{ThresholdRange}]

168: \label{def:ThresholdRange}Let $S$ be a set of moving points. Given a

169: dynamic query space $R$ defined by two moving points $Q_1$ and $Q_2$

170: as the lower-left and upper-right corners of $R$, a time interval

171: $T$, and a threshold value $M$, the \textsc{ThresholdRange} operator

172: finds the set of time intervals $T_M$ where the count of objects in

173: $R$ is larger than $M$.

174: \end{definition}

175:

176: \textsc{ThresholdRange} is directly related to \textsc{MaxCount} in

177: that when $M$ is raised to $M_{\max}$, then \textsc{ThresholdRange}

178: returns a time interval containing $t_{\max}$ and during this time

179: interval, the count will be $M_{\max}$.

180:

181: \begin{definition}[\textsc{ThresholdCount}]

182: \label{def:ThresholdCount} Given a \textsc{ThresholdRange}, \textsc{%

183: ThresholdCount} returns the number of time intervals.

184: \end{definition}

185:

186: \begin{definition}[\textsc{ThresholdSum}]

187: \label{def:ThresholdSum} Given a \textsc{ThresholdRange}, \textsc{%

188: ThresholdSum} returns the total time $T_s$ during which the count is above $%

189: M $. That is, for each $T_i \in T_M$, \textsc{ThresholdSum} return:

190: \begin{equation}

191: T_s = \displaystyle{\sum\limits_i}|T_i|

192: \end{equation}

193: where $|T_i|$ means the length of the interval.

194: \end{definition}

195:

196: \begin{definition}[\textsc{ThresholdRange}]

197: \label{def:ThresholdAverage} Given a \textsc{ThresholdRange}, \textsc{%

198: ThresholdAverage} returns the average length of the intervals in

199: $T_M$.

200: \end{definition}

201:

202: In addition to the threshold aggregation operators, we also use our

203: bucketing method to implement the \textsc{CountRange} defined as

204: follows.

205:

206: \begin{definition}[\textsc{CountRange}]

207: \label{def:SpatioTemporalRangeCount} Let $S$ be a set of moving

208: points. Given a dynamic query space $R$ defined by two moving points

209: $Q_1$ and $Q_2$ as the lower-left and upper-right corners of $R$ and

210: a time interval $T$, the \textsc{CountRange} query returns the total

211: number of points that intersect $R$ in $T$.

212: \end{definition}

213:

214: Together \textsc{MaxCount (MinCount)} and the threshold operators

215: form a complete set of threshold aggregation operators comparable to

216: the aggregation operators given in relational databases.

217:

218: The following examples use the simple concepts of flying to

219: demonstrate the use of a few of these threshold aggregation

220: operators.

221:

222: \begin{example}

223: \label{ex:MaxCount}\textrm{Airplanes are commonly modeled as

224: linearly moving objects with preestablished flight plans. Suppose,

225: at any time, at most a constant number $M$ of airplanes is allowed

226: to be in the O'Hare airspace to avoid congestion. Suppose also a new

227: airplane requests approval of its flight plan for entering the

228: O'Hare airspace between times $t_a$ and $t_b$. The air traffic

229: controllers can avoid congestion as follows. If after adding a new

230: flight plan, the \textsc{MaxCount} between $t_a$ and $t_b$ is still

231: less than $M$, then they can approve the flight. Otherwise, they

232: need to find some alternative path, and check it again against the

233: database. }

234:

235: \textrm{Air traffic controllers try to direct airplanes as linearly

236: moving objects for fuel efficiency, among other reasons. If they

237: recognize a developing congestion too late, then they often must

238: direct the airplane to fly in circles until the congestion has

239: cleared. That solution wastes fuel. On the other hand, if they

240: recognize the developing congestion early, then they can often

241: simply tell the airplane to change its speed, which saves fuel.

242: Therefore, it is important to identify congestions as early as

243: possible. We may identify congestions by using a \textsc{MaxCount}

244: query where a moving box around the airplane and a time interval

245: $[t_{a},t_{b}]$ define the query. If the \textsc{MaxCount} predicts

246: congestion, then the airplane's speed can be adjusted early in the

247: flight. }

248: \end{example}

249:

250: \begin{example}

251: \label{ex:ThresholdCount}\textrm{Suppose we want to alert pilots if

252: their current flight path takes them through at least one congested

253: region. }

254:

255: \textrm{\emph{Traffic Alert/Collision Avoidance Systems (TCAS)} is a

256: system that provides similar functionality. TCASs only provide

257: alerts for current congestion, not predictive congestion. Although

258: TCASs were implemented in 1986, we continue to have mid-air

259: collisions and near misses indicating that the system still needs

260: improvement. \textsc{ThresholdRange} is a modification of

261: \textsc{MaxCount} that returns all predicted time intervals on the

262: flight path where the \textsc{Count} exceeds a given threshold.

263: Hence using \textsc{ThresholdRange} we can alert a pilot of

264: predicted congestions where more than $M$ other airplanes will be

265: within the space $B$ around the airplane. Predicting and avoiding

266: these areas can significantly reduce the chances of mid-air

267: collisions. }

268: \end{example}

269:

270: \begin{example}

271: \label{ex:CountRange}\textrm{Suppose we are especially concerned

272: about a rush-hour period $[t_a,t_b]$ that is particularly stressful

273: to air traffic controllers. Suppose controllers can direct at most

274: $M$ airplanes safely. We can determine the number of controllers

275: needed during the rush-hour time by executing the

276: \textsc{CountRange} query over the controlled airspace during the

277: rush-hour and dividing by $M$. By ensuring that a sufficient number

278: of controllers are present, safety is achieved and controllers are

279: not over stressed. }

280: \end{example}

281:

282: Each of the operators can also be applied to examine different

283: aspects of congestion with regard to bird migration and hence

284: disease control. These questions and examples, motivated by research

285: on \textsc{MaxCount}, led us to explore complex threshold

286: aggregations and data structures to support them.

287:

288: The rest of this paper is organized as follows.

289: Section~\ref{sec:BucketDataStructures} gives some background on the

290: concepts of point domination, sweeping techniques and then

291: introduces the data structures used to build buckets. These buckets

292: can then be used in various indexing algorithms to fit the type of

293: application used. Section~\ref{sec:DynamicMaxCount} develops the

294: {\sc MaxCount} estimation algorithm using a running example.

295: Section~\ref{sec:ThresholdOperators} develops the {\sc

296: ThresholdRange} algorithm based on {\sc MaxCount} and demonstrates

297: the relationship that ties {\sc MaxCount} to the remaining threshold

298: operators. This section also develops algorithms for each of those

299: operators including {\sc CountRange}.

300: Section~\ref{sec:ExperimentalResults} gives the experimental results

301: of the implementation. Section~\ref{sec:RelatedWork} reviews the

302: related work and Section~\ref{sec:Conclusions} gives conclusions and

303: future work.

304:

305:

306:

307: \section{Hyper-Bucket Data Structures}\label{sec:BucketDataStructures}

308:

309: This section presents an updatable {\em skew-aware} bucket for

310: indices that models the skewed point distributions in each bucket.

311: The skew-aware technique allows the index structure to perform

312: inserts, deletes, and updates in {\em fast constant time} using a

313: \textsc{HashTable} to store the buckets. Many spatiotemporal

314: applications, such as tracking clients on a wireless network,

315: particularly need these fast updates and no other {\sc MaxCount}

316: presented prior to this can meet that requirement. Because the

317: buckets are spatially defined, the bucketing technique also easily

318: adapts to other spatial and spatiotemporal indices such as the

319: \textsc{R-tree}~\cite{DBLP:conf/sigmod/Guttman84}. Hence the

320: technique performs well for applications where search operations or

321: update operations occur more frequently by using an appropriate

322: index.

323:

324: Our algorithm uses a sweeping method to evaluate the threshold

325: aggregation operators similar to previous approaches from

326: \cite{Chen20041,Revesz20031} and \cite{Anderson20061}. The algorithm

327: differs in that the sweeping algorithm integrates a skew-aware

328: density function over the spatial dimensions of the bucket to obtain

329: the time dependent count function. The density function in the

330: bucket increases accuracy over methods given in

331: \citep{Chen20041,Anderson20061} while maintaining the same number of

332: buckets. This idea is a crucial improvement because we model the

333: point distribution skew in a bucket, whereas previous methods

334: adapted to skew by increasing the number of buckets or changing

335: their shape and contents. We also present a precise algorithm for

336: evaluating the threshold aggregation operators that requires no

337: index and runs in $O(N)+ O(n \log n)$ time and $O(n)$ space where

338: $N$ is the number of points in the database and $n$ is the value of

339: a {\sc CountRange} query using the same query space and time. Both

340: the threshold aggregation algorithms and the skew-aware bucket data

341: structure presented are implemented and analyzed in 3-dimensional

342: space. We show that the approximation achieves good results while

343: significantly reducing the running times.

344:

345: Section~\ref{ssec:buckets} describes the problems related to

346: creating hyper-buckets (also referred to as just buckets) and a

347: specific solution for creating $6$-dimensional buckets for

348: $3$-dimensional linearly moving points.  In all cases, we can extend

349: our method to $d$-dimensions. Section~\ref{ssec:updates} describes

350: the method for inserting and deleting a point from a bucket and

351: shows that updates take constant time. Section~\ref{ssec:structures}

352: applies two different data structures to contain the buckets suited

353: for applications where either inserts and deletes or threshold

354: aggregation queries dominate.

355:

356:

357: \subsection{Hyper-Bucket Data Structure}\label{ssec:buckets}

358:

359: \begin{definition}[Hex Representation]

360: \label{def:hex} Define each 3-dimensional linearly moving point $p$

361: by parametric linear equations in $t$ as follows:

362: \begin{equation}

363:     p=\left\{

364:     \begin{array}{c}

365:         p_x ~=~ v_x t ~+~ x_0 \\

366:         p_y ~=~ v_y t ~+~ y_0 \\

367:         p_z ~=~ v_z t ~+~ z_0 \\

368:     \end{array}

369:     \right.

370: \end{equation}

371: where the corresponding {\em hex representation} of $p$ is the tuple

372: $(v_x,x_0,v_y,y_0,v_z,z_0)$ containing the duals of $p_x$, $p_y$,

373: and $p_z$. For simplicity we often denote the six-tuple as

374: $(x_1,...,x_6)$.

375: \end{definition}

376:

377: \medskip

378: Consider a relation $D(x_1,..,x_6)$ that contains the {\em hex

379: representation} of linearly moving points in $3$ dimensions. Then

380: $D$ represents a $6$-dimensional {\em static} space. Divide the

381: space into axis-aligned hyper-rectangles where the $k^{th}$ axis has

382: $d_k$ divisions. Each hyper-rectangle becomes a bucket containing

383: moving points whose hex falls inside the hyper-rectangle.

384:

385: \begin{definition}[Hyper-bucket dimensions]

386: \label{def:bucketDimensions} Define the dimensions of each bucket

387: $B_i$ by inequalities of the form:

388: \begin{equation}

389: \begin{array}{lcl}

390:     v_{x,L} \leq v_x < v_{x,U} &\bigwedge& x_{0,L} \leq x_0 < x_{0,U} ~\bigwedge \\

391:     v_{y,L} \leq v_y < v_{y,U} &\bigwedge& y_{0,L} \leq y_0 < y_{0,U} ~\bigwedge \\

392:     v_{z,L} \leq v_z < v_{z,U} &\bigwedge& z_{0,L} \leq z_0 < z_{0,U} \\

393: \end{array}

394: \end{equation}

395: where we denote the lower bound as:

396: \begin{equation}

397:     (v_{x,L}, x_{0,L},v_{y,L}, y_{0,L}, v_{z,L}, z_{0,L})

398: \end{equation}

399: and the upper bound as

400: \begin{equation}

401:     (v_{x,U}, x_{0,U},v_{y,U}, y_{0,U}, v_{z,U}, z_{0,U}).

402: \end{equation}

403: \end{definition}

404:

405: Each hyper-rectangle defines the spatial dimensions of a possible

406: bucket, where only buckets that contain points need be included in

407: the index. The maximum number of possible buckets is given by

408: $m=\prod\limits_{k}d_k$.

409:

410: \begin{definition}[Histograms]

411: \label{def:histograms} Given a $6$-dimensional rectangle $B_i$,

412: given by Definition~\ref{def:bucketDimensions}, containing $b_i$

413: points, build the {\em histograms} $h_{i,1}$,...,$h_{i,6}$ for each

414: axis using $s$ subdivisions as follows. To create histogram

415: $h_{i,j}$, divide bucket $B_i$ into $s$ parallel subdivisions along

416: the $j$th axis, and record separately the number of points within

417: $B_i$ that fall within each subdivision.

418: \end{definition}

419:

420:

421: \begin{example}[Building Histograms]\rm

422:     \begin{figure}[ht]

423:         \centering

424:         \psfrag{Y}{$X_0$}

425:         \psfrag{X}{$V_x$}

426:         \includegraphics[width=4in]{figs/pointssplit3.eps}

427:         \caption{Points projected onto $v_x,x_0$ plane.} \label{fig:points}

428:     \end{figure}

429:     Consider a set of 6-dimensional points projected onto the $v_x,x_0$

430:     plane as shown in Figure~\ref{fig:points}. Assume that the number of

431:     subdivisions is $s=10$ along both $v_x$ and $x_0$.

432:     Figure~\ref{fig:vxx0histograms} shows $h_{i,1}$ and $h_{i,2}$. For

433:     example, the subdivision $0\leq v_x < 1$ contains six points and

434:     hence the first bar of histogram $h_{i,1}$ rises to level $6$. The

435:     other values can be determined similarly.

436:

437:     \begin{figure}[htb]

438:         \begin{minipage}[t]{3in}

439:             \begin{center}

440:             \includegraphics[width=3in]{figs/vxhist.eps}\\

441:             \mbox{$h_{i,1}$: Points projected onto $v_x$.}

442:             \end{center}

443:         \end{minipage}

444:         \hfill

445:         \begin{minipage}[t]{3in}

446:             \begin{center}

447:             \includegraphics[width=3in]{figs/x0hist.eps}\\

448:             \mbox{$h_{i,2}$: Points projected onto $x_0$.}

449:             \end{center}

450:         \end{minipage}

451:         \begin{center}

452:         \end{center}

453:         \caption{Histogram of Points in 2 Dimensions.}

454:         \label{fig:vxx0histograms}

455:     \end{figure}

456:     \begin{figure}[htb]

457:             \begin{minipage}[t]{3in}

458:                 \psfrag{A}{$x_0$} \psfrag{B}{$v_x$}

459:                 \includegraphics[width=3in]{figs/xaccurate2.eps}

460:             \end{minipage}

461:             \hfill

462:             \begin{minipage}[t]{3in}

463:                 \psfrag{A}{$x_0$} \psfrag{B}{$v_x$}

464:                 \includegraphics[width=3in]{figs/xwrong2.eps}

465:             \end{minipage}

466:             \caption{2D Distribution Functions}

467:             \label{fig:vx2d}

468:     \end{figure}

469:

470:     Histograms tell much about the distribution of the points in a

471:     bucket but they introduce some ambiguity. For example, the

472:     histograms in Figure~\ref{fig:vxx0histograms} match both of the

473:     $2d$-distributions in Figure~\ref{fig:vx2d}.

474: \end{example}

475: \bigskip

476:

477: \begin{definition}[Axis Trend Function]

478: \label{def:trendfunctions}%

479: The {\em axis trend function} $f_{i,j}(x_j)$ is some polynomial

480: function for bucket $B_i$ and axis $j$ such that the following hold:

481: \begin{enumerate}

482:     \item $f_{i,j} \geq 0$ over $B_i$.

483:     \item $f'_{i,j}$, the derivative $f_{i,j}$, does not change sign over the valid range.

484: \end{enumerate}

485: The {\em bucket trend function} $f_i$ for bucket $B_i$ is the

486: following:

487:     \begin{equation}

488:         \label{eq:bucketdensity}

489:         f_i=\prod_j f_{i,j}

490:     \end{equation}

491: \end{definition}

492:

493: Condition 1 ensures that the bucket trend function, built from the

494: axis trend functions, does not contain a negative probability

495: region. Condition 2 requires that the bucket density increase,

496: decrease, or remain constant when considering any single axis. This

497: condition avoids the ambiguity demonstrated in

498: Figures~\ref{fig:vxx0histograms} and \ref{fig:vx2d} by giving a

499: polynomial that approximates the density change correctly. We show

500: this in the following Lemma.

501:

502: \begin{lemma}

503: \label{lem:distributionindependence} Given a bucket $B_{i}$ with

504: bucket trend functions $f_{i,j}$, let $r_{1}$ and $r_{2}$ be

505: identically sized regions in bucket $B_{i}$. If the density in

506: $B_{i}$ along each axis monotonically increases from $r_{1}$ to

507: $r_{2}

508: $ the following holds:%

509: \begin{equation}

510: \int_{r_{2}}f_{i}~d\phi \geq \int_{r_{1}}f_{i}~d\phi

511: \end{equation}

512: \end{lemma}

513:

514: \begin{proof}

515: Increasing densities from $r_{1}$ to $r_{2}$ translates into

516: histograms that also increase from $r_{1}$ in the direction of

517: $r_{2}$ along each axis. The translation from histograms to the axis

518: trend functions gives the following conditions:

519: \begin{equation}

520:     f_{i,j}(x_{2,j})\geq f_{i,j}( x_{1,j})

521: \end{equation}

522: where $x_{1,j}$ and $x_{2,j}$ are the $j^{th}$ coordinates of the

523: points in $r_{1}$ and $r_{2}$ respectively, and are located the same

524: distance from the $j^{th}$ coordinates of the lower bounds of $r_1$

525: and $r_2$ respectively. Since this constraint holds for each $j$ and

526: $f_{i,j}\geq 0$ we have:

527: \begin{equation}

528:     f_{i}(x_2)\geq f_{i}(x_1)

529: \end{equation}

530: Hence by the properties of integration we conclude

531: \begin{equation}

532:     \int_{r_{2}}f_{i}~d\phi \geq \int_{r_{1}}f_{i}~d\phi

533: \end{equation}

534: \end{proof}

535:

536: Definition~\ref{def:trendfunctions} allows a whole class of

537: polynomial functions, and Lemma~\ref{lem:distributionindependence}

538: applies to each member of that class. However, in the following, we

539: use a particular polynomial function derived from the product of

540: linear functions, which are obtained by using the least squares

541: method for each histogram.

542:

543: \begin{definition}[Normalized Trend Functions]

544: \label{def:NormalizedTrendFunction} Let $n$ be the number of points

545: in the database, $b_i$ the number of points in bucket $B_i$, and

546: $f_i$ be given by Equation~(\ref{eq:bucketdensity}). The {\em

547: normalized trend function} $F_i$ for bucket $B_i$ is:

548: \begin{equation}

549:     F_{i} = \frac{b_i f_{i}}

550:     {

551:         n \mathop{\displaystyle\int}\limits_{B_i}^{~}

552:         f_{i}~d\phi

553:     }

554:     \label{eq:NormalizedSurface}

555: \end{equation}

556: and the {\em percentage of points} in bucket $B_i$ is:

557: \begin{equation}

558:         p = \mathop{\displaystyle\int}\limits_{B_i} F_i~d\phi.

559: \label{eq:percentagepoints}

560: \end{equation}

561: \end{definition}

562:

563: With this definition we can calculate the number of points in $O(1)$

564: time using the following simple lemma.

565:

566: \begin{lemma}

567: \label{lem:ConstRunningTimeForBucket} Let $B_i$ be a bucket, $n$ the

568: number of points in the databases, and $p$ be given by

569: Definition~\ref{def:NormalizedTrendFunction}. Then $np$ is the

570: number of points in bucket $B_i$ and $np$ is calculated in $O(1)$

571: time.

572: \end{lemma}

573: \begin{proof}

574: By Equation~(\ref{eq:NormalizedSurface}) and

575: (\ref{eq:percentagepoints}) we have:

576: \begin{equation}

577: \begin{array}{ccl}

578:     n p & = &n \mathop{\displaystyle\int}\limits_{B_i} F_i~d\phi \vspace{6pt}\\

579:         & = &n  \mathop{\displaystyle\int}\limits_{B_i} \displaystyle{\frac{b_i}{n}} \frac{f_{i}}{\mathop{\displaystyle\int}_{B_i} f_i~d\phi}~d\phi \vspace{6pt}\\

580:         & = &n  \displaystyle{\frac{b_i}{n}} \cdot \frac{\mathop{\displaystyle\int}_{B_i} f_{i}~d\phi}{\mathop{\displaystyle\int}_{B_i} f_{i}~d\phi} \vspace{6pt}\\

581:         & = &b_i. \\

582: \end{array}

583: \end{equation}

584: Clearly the above calculations take only $O(1)$ time.

585: \end{proof}

586:

587: Using the above definitions we can now define the bucket data

588: structure used throughout the rest of this paper.

589:

590: \begin{definition}[Skew Aware Buckets]\label{def:bucketsN}%

591: A bucket is a hyper-rectangle with dimensions given by

592: Definition~\ref{def:bucketDimensions} and that maintains histograms

593: given by Definition~\ref{def:histograms}, additional data for the

594: least squares method, and the normalized trend function given by

595: Definition~\ref{def:NormalizedTrendFunction}. Throughout the rest of

596: this paper we refer to these as buckets.

597: \end{definition}

598:

599:

600:

601: \subsection{Inserts and Deletes}\label{ssec:updates}

602:

603: We can maintain the bucket (and hence the index) while deleting or

604: inserting a point for any bucket $B_i$ by recalculating the trend

605: function $F_i$ for the bucket.

606:

607: \begin{lemma}\label{lem:ConstantUpdates}

608: Insertion and deletion of a moving point can be done in $O(1)$ time.

609: \end{lemma}

610:

611: \begin{proof}

612: When we insert or delete a point, we need to update the histograms

613: and the normalized trend function. Let the point to insert/delete be

614: $P_a$ represented using the hex representation as

615: $(a_0,a_1,a_2,a_3,a_4,a_5)$, let $d_j$, for $0 \leq j \leq 5$ be the

616: bucket width in the $j^{th}$, and let $s$ be the number of

617: subdivisions in each histogram. The concatenation of $id_0, \ldots,

618: id_5$ gives the $ID_i$ of bucket $i$ to insert (or delete) $P_a$

619: into where each $id_l$ and $0 \le l \le 5$ is defined by:

620: \begin{equation}

621:     id_l = \left\lfloor \frac{a_l}{d_l} \right\rfloor.

622: \end{equation}

623: The calculation of $ID_i$ and retrieving bucket $B_i$ takes $O(1)$

624: time using a \textsc{HashTable}.

625:

626: Let $hw_{i,j}$ be the histogram-division width for the $j^{th}$

627: calculated as $hw_{i,j} = \left\lceil \frac{d_j}{s} \right\rceil$.

628: Then $p$ is projected onto each dimension to determine which

629: division of the histogram to update. For the $j^{th}$ dimension the

630: $k^{th}$ division of histogram $h_{i,j}$ is given as follows:

631: \begin{equation}

632:     k(j) = \left\lfloor \frac{a_j - id_j*d_j}{hw_k} \right\rfloor

633: \end{equation}

634: Let $h_{i,j,k}$ be the histogram division to update for each

635: histogram. Update $h_{i,j,k}$ and the sums $\displaystyle{\sum}y_i$,

636: and $\displaystyle{\sum}x_i y_{i}$ from the normal equations in the

637: least squares method. $N$, $\displaystyle{\sum}x_i$ and

638: $\displaystyle{\sum}x_{i}^{2}$ from the normal equations do not need

639: updating since the number of histogram divisions $s$ is fixed within

640: the database.

641:

642: We can now recalculate each $f_{i,j}$ in constant time by solving

643: the $2 \times 3$ matrix corresponding to the normal equations of the

644: least squares method for each histogram. For each $f_{i,j}$

645: calculate the endpoints to determine the required shift amount

646: (Definition~\ref{def:trendfunctions}, property 1) and calculate

647: $f_i$ from Equation~(\ref{eq:bucketdensity}). Now we calculate $F_i$

648: using Equation~(\ref{def:NormalizedTrendFunction}). Each of these

649: steps depends only on the dimension of the database. Hence for any

650: fixed dimension we can rebuild the normalized trend function $F_i$

651: in $O(1)$ time.

652: \end{proof}

653:

654: \subsection{Index Data Structures}\label{ssec:structures}

655:

656: There is no need to create a bucket unless it contains at least one

657: point. We consider two classes of data structures for organizing the

658: buckets: \textsc{HashTables} and \textsc{Trees}.

659:

660: For databases where inserts and deletes are the most common

661: operation, the \textsc{HashTable} approach allows these operations

662: to run in constant time. However, the {\sc MaxCount} operation will

663: require an enumeration of all the buckets and thus at least a

664: running time of $O(B)$. As long as the number of buckets is

665: reasonable, this approach works well.

666:

667: For databases where {\sc MaxCount} is the most common operation, we

668: may use an \textsc{R-tree} structure

669: \citep{DBLP:conf/sigmod/Guttman84,BKS+90} where the elements to be

670: inserted are the buckets. This approach speeds up the {\sc MaxCount}

671: query to $O(\log|B| + R)$ where $R$ is the number of buckets needed

672: to calculate the query. The insert and delete costs for these

673: \textsc{R-trees} are $O(\log|B|)$, because buckets do not overlap.

674:

675: Since buckets do not change shape, the database is decomposable and

676: allows each type of aggregation to be calculated from simultaneous

677: executions on subspaces of the index space. We discuss the method

678: and ramifications of this capability at the end of Section

679: \ref{sec:ExactMaxCount}.

680:

681:

682: \section{Dynamic \textsc{MaxCount}}\label{sec:DynamicMaxCount}

683:

684: Section~\ref{ssec:PointDomination} reviews point domination in

685: higher dimensions. Section~\ref{ssec:IntegratingBuckets} examines

686: finding the percentage of points in a bucket that are in the query

687: space as a function of time. Section~\ref{ssec:MaxCountAlgorithm}

688: puts the two previous sections together to create the dynamic {\sc

689: MaxCount} algorithm for $d$-dimensions.

690:

691: \subsection{Point Domination in 6-Dimensional Space}\label{ssec:PointDomination}

692:

693: Let $B$ be the set of 6-dimensional hyper-buckets in the input where

694: each hyper-bucket $B_i$ has an associated normalized trend function

695: $F_i$ as in Definition~\ref{def:NormalizedTrendFunction}. Let the

696: vertices of $B_i$ be denoted $v_{i,j}$ where $1 \leq j \leq 64$,

697: because there are $2^6$ corner vertices to a 6-dimensional

698: hyper-cube.

699:

700: \begin{definition}[Point Domination]\label{def:pointdomination}

701:     Given two linearly moving points in three dimensions

702:     \begin{equation}

703:             P(t)=\left\{

704:         \begin{array}{l}

705:             p_{x}=x_1 t + x_2 \\

706:             p_{y}=x_3 t + x_4 \\

707:             p_{z}=x_5 t + x_6

708:         \end{array}

709:         \right.   %\label{eq:point2}

710:         \quad {\rm and} \quad

711:         Q(t)=\left\{

712:         \begin{array}{l}

713:             q_{x}=v_{x}t+x_{0} \\

714:             q_{y}=v_{y}t+y_{0} \\

715:             q_{z}=v_{z}t+z_{0}

716:         \end{array}

717:         \right.

718:     \end{equation}

719:     $Q(t)$ dominates $P(t)$ if and only if the following holds:

720:     \begin{equation}

721:         (p_x < q_x) \quad \wedge \quad (p_y < q_y) \quad \wedge \quad (p_z < q_z).

722:     \end{equation}

723: \end{definition}

724:

725: The previous definition takes 6-dimensional points defined in

726: Definition~\ref{def:hex} and places them into three inequalities of

727: the form $x_2 < -t(x_1-v_x) + x_0$. Each inequality defines a region

728: below a line with slope $-t$.

729:

730: \begin{definition}[$x$-view, $y$-view and $z$-view projections]\label{def:views}

731: Projecting the inequalities from

732: Definition~\ref{def:pointdomination} onto their respective dual

733: planes allows a visualization in three 2-dimensional planes. Define

734: these three projections as the $x-$view, $y-$view and $z-$view

735: respectively. Because the time $-t$ defines the slopes of each line,

736: all views contain lines with identical slopes. (See

737: Figure~\ref{fig:views})

738: \end{definition}

739:

740:

741:

742: \begin{definition}[Query Space]\label{def:queryspace}

743: Given two moving query points $Q_1(t)$ and $Q_2(t)$ and lines

744: $l_{x1}$, $l_{x2}$, $l_{y1}$, $l_{y2}$, $l_{z1}$, $l_{z2}$ crossing

745: them in their respective hexes with slopes $-t$, the intersection of

746: the bands formed by the area between $l_{x1}$ and $l_{x2}$, $l_{y1}$

747: and $l_{y2}$, and $l_{z1}$ and $l_{z2}$ in the 6-dimensional space

748: forms a hyper-tunnel that defines the {\em query space} as shown in

749: Figure~\ref{fig:views}.

750: \end{definition}

751:

752: \begin{figure}[ht]

753:     \centering

754:     \psfrag{X-View}{$X-$view}

755:     \psfrag{Y-View}{$Y-$view}

756:     \psfrag{Z-View}{$Z-$view}

757:     \psfrag{Q2x}{$Q_{2x}$}

758:     \psfrag{Q1x}{$Q_{1x}$}

759:     \psfrag{Q2y}{$Q_{2y}$}

760:     \psfrag{Q1y}{$Q_{1y}$}

761:     \psfrag{Q2z}{$Q_{2z}$}

762:     \psfrag{Q1z}{$Q_{1z}$}

763:     \psfrag{lx2}{$l_{x2}$}

764:     \psfrag{lx1}{$l_{x1}$}

765:     \psfrag{ly2}{$l_{y2}$}

766:     \psfrag{ly1}{$l_{y1}$}

767:     \psfrag{lz2}{$l_{z2}$}

768:     \psfrag{lz1}{$l_{z1}$}

769:     \psfrag{Position}{Position}

770:     \psfrag{Velocity}{Velocity}

771:     \includegraphics[width=5.9in]{figs/views.eps}\\

772:     \caption{Views.}\label{fig:views}

773: \end{figure}

774:

775: We can now visualize the query in space and time as the {\em query

776: space} sweeping through a bucket as the slopes of the lines change

777: with time. Using the above, it is now easy to prove the following

778: lemma.

779:

780: \begin{lemma}

781: At any time $t$, the moving points whose hex-representation lies

782: below (or above) $l_{x1},l_{y1}$ and $l_{z1}$ in their respective

783: views are exactly those points that lie below (or above) $Q_{1}$ in

784: the original 3-dimensional plane.

785: \end{lemma}

786:

787: \begin{proof}

788: Let $Q_{x}(t)=v_{x}t+x_{0}$ where $v_{x}$ and $x_{0}$ are constants

789: and consider any $x$ component of a point $P_{x}(t)=x_1 t + x_2$

790: that lies below $Q$ on the $x$-axis. Then

791: \begin{eqnarray}

792:     x_1 t + x_2 &<& v_{x} t + x_{0} \\

793:     x_2         &<& -t (x_1 - v_{x}) + x_{0}

794: \end{eqnarray}

795: Obviously, at any time $t$ these are the points below the line $x_2

796: = -t(x_1 - v_{x}) + x_{0}$, which has a slope of $-t$ and goes

797: through $( v_{x},x_{0})$. This representation is the dual of point

798: $Q_{x}$. By Definition \ref{def:queryspace}, this is exactly the

799: line $l_{x1}$. We can prove similarly that the points with duals

800: above $l_{x1}$ are above $Q_{1}$ at any time $t$. The proof that

801: points whose hex-representations are above or below $l_{y1},$ and

802: $l_{z1}$ are exactly those points that lie above or below $Q_{1}$ is

803: similar to the proof for points above or below $l_{x1}$. By

804: Definition~\ref{def:pointdomination}, we conclude that the points

805: dominated by $Q_{1}$ in the dual space are those points that are

806: below $l_{x1}, l_{y1}$, and $l_{z1}$ in the $x$-view, $y-$view, and

807: $z$-view, respectively. Similarly, we conclude that the points that

808: dominate $Q_1$ in the dual space are those points that are above

809: $l_{x1},~l_{y1}$, and $l_{z1}$ in the $x$-view, $y-$view, and

810: $z$-view, respectively.

811: \end{proof}

812:

813: Throughout the examples in this chapter, we use the points shown in

814: Figures~\ref{fig:Points} and \ref{fig:ExPointsProjected} to

815: demonstrate the evaluation of a {\sc MaxCount} query. We begin by

816: creating the index.

817:

818: \begin{example}[Creating the Index]\label{ex:BuildIndex}\rm

819:     \begin{figure}[htb]

820:         \centering

821:         \includegraphics[scale=1]{figs/ExamplePoints_v2.eps}

822:         \caption{Example points.}

823:         \label{fig:Points}

824:     \end{figure}

825:

826:     Consider a relation that contains the $6$-dimensional space 10

827:     units $(0 \ldots 10)$ in each dimension. If we break this up

828:     into buckets that are $5$ units long in each dimension, we have

829:     $2^{6}$ buckets. Although these divisions make a space with $64$

830:     buckets, all the points are contained in a single bucket whose

831:     index is $(2,2,2,2,2,2)$. All the points listed in Figure

832:     \ref{fig:Points} have the same velocities for each dual plane.

833:     Notice the columns for $x_1$, $x_3$, and $x_5$ all have the same

834:     values in different orders. The projection of the points onto

835:     the $3$ dual planes shown in Figure~\ref{fig:ExPointsProjected}

836:     does not immediately show this organization. Projecting the

837:     points for any view in Figure~\ref{fig:HistogramVP} onto each

838:     axis and creating histograms with $5$ divisions gives the

839:     histograms for the Velocity and Position axes shown in

840:     Figure~\ref{fig:HistogramVP}.

841:     \begin{figure}[h]

842:         \centering

843:         \begin{minipage}[t]{2in}

844:             \begin{center}

845:                 \includegraphics[width=2in]{graphics/ex3d1.eps} \\

846:                 (a)

847:             \end{center}

848:         \end{minipage}

849:         \begin{minipage}[t]{2in}

850:             \begin{center}

851:                 \includegraphics[width=2in]{graphics/ex3d2.eps} \\

852:                 (b)

853:             \end{center}

854:         \end{minipage}

855:         \begin{minipage}[t]{2in}

856:             \begin{center}

857:                 \includegraphics[width=2in]{graphics/ex3d3.eps} \\

858:                 (c)

859:             \end{center}

860:         \end{minipage}

861:     \caption{Points projected onto (a) $X$-view, (b) $Y$-view, and (c) $Z$-view.}

862:     \label{fig:ExPointsProjected}

863:     \end{figure}

864:     \begin{figure}[h]

865:         \centering

866:         \begin{minipage}[t]{3in}

867:             \begin{center}

868:                 \includegraphics[width=3in]{graphics/ExHistV.eps} \\

869:                 (a) Velocity.

870:             \end{center}

871:         \end{minipage}

872:         \begin{minipage}[t]{3in}

873:             \begin{center}

874:                 \includegraphics[width=3in]{graphics/ExHistP.eps} \\

875:                 (b) Position.

876:             \end{center}

877:         \end{minipage}

878:         \caption{Position and velocity histograms, identical for each view.}

879:         \label{fig:HistogramVP}

880:     \end{figure}

881:     Hence, each velocity dimension has the same histogram. Similarly each

882:     position dimension has the same histogram. To create these

883:     histograms each point is projected onto the axis. For example point

884:     $1$ projected onto the $x_1$ axis is given as:

885:     \begin{equation}

886:     ~5.345,7.543,5.345,8.158,5.345,5.488\rightarrow5.345.

887:     \end{equation}

888:     Calculate the widths of the histograms as:

889:     \begin{equation}

890:         Histogram\_Width =(10-5)/5 =1

891:     \end{equation}

892:     We determine the histogram for each point by looping through the

893:     points and calculating the following:

894:     \begin{equation}

895:         division=\left\lfloor((point-lowerbound)/Histogram\_Width)\right\rfloor

896:     \end{equation}

897:     For example the lowest and highest points in velocity would be added

898:     to the division calculated as

899:     $\left\lfloor \left(  5.84-5\right)  /1\right\rfloor = 0$ and

900:     $\left\lfloor (9.468-5)/1\right\rfloor = 4$.

901:

902:     The histograms translate into a set of points for each view given

903:     as:

904:     \begin{eqnarray}

905:         Velocity =\{(0,1),(1,1),(2,2),(3,2),(4,4)\}\label{pt:Vel} \\

906:         Position =\{(0,2),(1,2),(2,2),(3,2),(4,2)\}\label{pt:Pos}

907:     \end{eqnarray}

908:     Before applying the least squares method each division number

909:     must be translated back into the bucket. Translation is done

910:     using the following code fragment:

911:     \medskip

912:

913:     \progstart \vspace{-18pt}

914:     \begin{tabbing}

915:     \hspace{.25in}\= \kill

916:     \textbf{for} $i \leftarrow 0$ \textbf{to} \emph{number\_of\_divisions} $-1$\\

917:     \> $point[i][0]\ \leftarrow i*histogram\_width + lowerbound$ \\

918:     \> $point[i][1]\ \leftarrow histogram\_value[i]$ \\

919:     \textbf{end for}

920:     \end{tabbing}

921:     \progend

922:

923:     Translation of the points from (\ref{pt:Vel}) and (\ref{pt:Pos}) gives: The

924:     histograms for velocity and position in each view are given as:

925:     \begin{eqnarray}

926:         Velocity =\{(5,1),(6,1),(7,2),(8,2),(9,4)\} \\

927:         Position =\{(5,2),(6,2),(7,2),(8,2),(9,2)\}.

928:     \end{eqnarray}

929:     Using the least squares method to fit each of these to a line yields

930:     the following for each velocity and position dimension:

931:     \begin{align}

932:         Velocity:~~  &y=0.7x-2.9\label{eq:RawVelocity}\\

933:         Position:~~  &y=0x+2 \label{eq:RawPosition}.%

934:     \end{align}

935:     Evaluating Equations~(\ref{eq:RawVelocity}) and

936:     (\ref{eq:RawPosition}) at the end points to find the shift value

937:     for the axis trend function to add to each equation gives:

938:     \begin{align}

939:         Velocity:~~  &y(5)=1,~~ y(10)=4.3\\

940:         Position:~~  &y(5)=y(10)=2.

941:     \end{align}

942:     In this case no constant needs to be added to our equation and

943:     the trend function becomes:

944:     \begin{equation}

945:         f_{i}=(0.7x_{0}-2.9)(0x_{1}+2)(0.7x_{2}-2.9)(0x_{3}+2)(0.7x_{4}-2.9)(0x_{5}+2)

946:     \end{equation}

947:     Calculating $F_{i}$ from Equation~(\ref{eq:NormalizedSurface}) requires

948:     integrating $f_i$ over the bucket where

949:     $\int_{B_{i}}\equiv\int_{5}^{10}...\int_{5}^{10}$ and where

950:     $d\phi\equiv

951:     dx_{0}dx_{1}dx_{2}dx_{3} dx_{4}dx_{5}$ gives

952:     \begin{align}

953:         \int_{B_{i}}f_{i}d\phi &  =8 \int_{B_{i}}(0.7x_{0}-2.9)(0.7x_{2}-2.9)(0.7x_{4}-2.9)d\phi \nonumber\\

954:                             &  =1622234.375.

955:     \end{align}

956:     Since all the points reside in a single bucket, $b_{i}=n$, the

957:     constant $c$ is given by $c=1/1622234.375 \approx 6.164\times10^{-7}$. Then

958:     $F_{i}$ is given by

959:     \begin{align}

960:         F_{i}  &  \approx c~(0.7x_{0}-2.9)(0x_{1}+2)(0.7x_{2}-2.9)(0x_{3}+2)(0.7x_{4}-2.9)(0x_{5}+2)\label{eq:Fi}\nonumber\\

961:             &  =8c(0.7x_{0}-2.9)(.7x_{2}-2.9)(.7x_{4}-2.9)

962:     \end{align}

963:     So far we have calculated the normalized trend function $F_{i}$

964:     for just one bucket. This calculation finishes the bucket

965:     creation process, and the index contains this single bucket

966:     defined by the points $lowerbound=(5,5,5,5,5,5)$ and

967:     $upperbound=(10,10,10,10,10,10)$.

968: \end{example}

969:

970:

971:

972: \subsection{Approximating the Number of Points in a Bucket}\label{ssec:IntegratingBuckets}

973:

974: As a line through a query point sweeps across a bucket, the points

975: in the bucket that dominate the query point are approximated by the

976: integral over the region above the line. In each of the three views

977: the query space intersects the plane giving the cases shown in

978: Figure~\ref{fig:cases}.

979: \begin{figure}[ht]

980:     \centering

981:     \includegraphics[width=4.25in]{figs/casesfilled.eps}\\

982:     \caption{Sweep algorithm cases.}

983:     \label{fig:cases}

984: \end{figure}

985:

986:

987: \begin{definition}[Percentage Function]\label{def:percentagefunction}

988: Integrating over the region above the line gives an approximation of

989: the percentage of points in the query space. We define the

990: percentage function given as:

991: \begin{equation}\label{eq:percentofbucket}

992:     p=\int\limits_{r_1} F_i~d\phi

993: \end{equation}

994: where $r_1$ is the region of the bucket in the query space. If two

995: lines go through the same bucket we have the smaller region $r_2$

996: subtracted from the larger region $r_1$ as follows.

997: \begin{equation}\label{eq:percentofbucket2}

998:     \triangle p=\int\limits_{r_1} F_i~d\phi - \int\limits_{r_2} F_i~d\phi.

999: \end{equation}

1000: Here, regions $r_1$ and $r_2$ correspond to regions above $Q_1$ and

1001: $Q_2$ in Figure~\ref{fig:views}, respectively.

1002: Lemma~\ref{lem:ConstRunningTimeForBucket} showed that finding the

1003: number of points in the bucket requires multiplying

1004: Equation~(\ref{eq:percentofbucket}) or (\ref{eq:percentofbucket2})

1005: by $n$.

1006: \end{definition}

1007:

1008: For each case shown in Figure~\ref{fig:cases}, we describe the

1009: function that results from integration in one view. To extend the

1010: result to any number of views, we take the result from the last view

1011: and integrate it in the next view. If the region below the line were

1012: desired, $p_{lower}=\frac{b_i}{n}-p$ gives the percentage of points

1013: below the line.

1014:

1015: For cases (a) -- (h) below, let $Q=(x_{1,q},x_{2,q},...,x_{6,q})$.

1016: For the $x$-view, let the lower left corner vertex be

1017: $(x_{1,l},x_{2,l})$ and the upper right corner vertex be

1018: $(x_{1,u},x_{2,u})$. In addition each line denoted $l$ is given by

1019: $x_2 = -t (x_1 - x_{i,q}) + x_{i+1,q}$ and corresponds to a line

1020: shown in the corresponding case in Figure~\ref{fig:cases}.

1021:

1022: %%%CASE A

1023: \medskip\noindent{\bf Case (a):}

1024: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{2,u}$. The

1025: integral over the shaded region is given by the following:

1026: \begin{equation}\label{eq:integrala}

1027:     p_a = \int\limits_{x_{1,l}}^{\frac{x_{2,u} - x_{2,q}}{-t} + x_{1,q}}

1028:     \int\limits_{-t(x_1 - x_{1,q}) + x_{2,q}}^{x_{2,u}}

1029:     F_i~dx_2 dx_1

1030: \end{equation}

1031: Notice that the lower bound of the integral over $dx_2$ contains

1032: $x_1$. This dependence within each view does not affect the

1033: integration in the remaining four dimensions. The solution to

1034: Equation~(\ref{eq:integrala})

1035: %, given in Appendix~\ref{apx:casesolutions},

1036: has the form:

1037: \begin{equation}\label{eq:forma}

1038:     a t^2 + b t + c + \frac{d}{t} + \frac{e}{t^2}.

1039: \end{equation}

1040:

1041:

1042: %%%CASE B

1043: \medskip\noindent{\bf Case (b):}

1044: For this case $l$ crosses the bucket at $x_{1,u}$ and $x_{2,u}$. The

1045: integral over the shaded region is given by:

1046: \begin{equation}\label{eq:integralb}

1047:     p_b = \int\limits_{-\frac{(x_{2,u}-x_{2,q})}{t}+x_{1,q}}^{x_{1,u}}\int

1048:     \limits_{-t(x_{1}-x_{1,q})+x_{2,q}}^{x_{2,u}}F_i~dx_{2}dx_{1}.

1049: \end{equation}

1050: The solution

1051: %is given in Appendix~\ref{apx:casesolutions} and

1052: has the form of Equation~(\ref{eq:forma}).

1053:

1054:

1055: %%%CASE C

1056: \medskip\noindent{\bf Case (c):}

1057: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{2,l}$. The

1058: integral over the shaded region above the line is given by:

1059: \begin{equation}\label{eq:integrale}

1060:     p_e = \int\limits_{x_{1,l}}^{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}

1061:     \int\limits_{-t(x_1 - x_{1,q}) + x_{2,q}}^{x_{2,u}}

1062:     F_i~dx_2 dx_1 ~+~

1063:     \int\limits_{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}^{x_{1,l}}

1064:     \int\limits_{x_{2,l}}^{x_{2,u}}

1065:     F_i~dx_2 dx_1.

1066: \end{equation}

1067: The solution

1068: %is given in Appendix~\ref{apx:casesolutions} and

1069: has the form of Equation~(\ref{eq:forma}).

1070:

1071:

1072: %%%CASE D

1073: \medskip\noindent{\bf Case (d):}

1074: For this case $l$ crosses the bucket at $x_{1,u}$ and $x_{2,l}$. The

1075: integral over the shaded region is given by:

1076: \begin{equation}\label{eq:integralf}

1077:     p_f = \int\limits_{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}^{x_{1,u}}

1078:     \int\limits_{-t(x_1 - x_{1,q}) + x_{2,q}}^{x_{2,u}}

1079:     F_i~dx_2 dx_1 ~+~

1080:     \int\limits_{x_{1,l}}^{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}

1081:     \int\limits_{x_{2,l}}^{x_{2,u}}

1082:     F_i~dx_2 dx_1.

1083: \end{equation}

1084: The solution

1085: %is given in Appendix~\ref{apx:casesolutions} and

1086: has the form of Equation~(\ref{eq:forma}).

1087:

1088: %%%CASE E

1089: \medskip\noindent{\bf Case (e):}

1090: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{1,u}$. The

1091: integral over the shaded region is given by:

1092: \begin{equation}\label{eq:integralc}

1093:     p_c = \int\limits_{x_{2,l}}^{x_{2,u}}

1094:     \int\limits_{x_{1,l}}^{\frac{x_2 - x_{2,q}}{-t} + x_{1,q}}

1095:     F_i~dx_1 dx_2.

1096: \end{equation}

1097: The solution

1098: %is given in Appendix~\ref{apx:casesolutions} and

1099: has the form of

1100: \begin{equation}\label{eq:formc}

1101:     c + \frac{d}{t} + \frac{e}{t^2}

1102: \end{equation}

1103: which is like Equation~(\ref{eq:forma}) with $a=b=0$.

1104:

1105:

1106: %%%CASE F

1107: \medskip\noindent{\bf Case (f):}

1108: Similar to case(e), $l$ crosses the bucket at $x_{1,l}$ and

1109: $x_{1,u}$. The integral over the shaded region is given by:

1110: \begin{equation}\label{eq:integrald}

1111:     p_d = \int\limits_{x_{2,l}}^{x_{2,u}}

1112:     \int\limits_{\frac{x_2 - x_{2,q}}{-t} + x_{1,q}}^{x_{1,u}}

1113:     F_i~dx_1 dx_2.

1114: \end{equation}

1115: The solution

1116: %is given in Appendix~\ref{apx:casesolutions} and

1117: has the form of Equation~(\ref{eq:formc}).

1118:

1119:

1120: %%%CASE G

1121: \medskip\noindent{\bf Case (g):}

1122: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{1,u}$. The

1123: integral over the shaded region is given by:

1124: \begin{equation}\label{eq:integralg}

1125:     p_g = \int\limits_{x_{1,l}}^{x_{1,u}}

1126:     \int\limits_{-t(x_1 - x_{1,q})+x_{2,q}}^{x_{2,u}}

1127:     F_i~dx_2 dx_1.

1128: \end{equation}

1129: The solution

1130: %is given in Appendix~\ref{apx:casesolutions} and

1131: has the form

1132: \begin{equation} \label{eq:formg}

1133:     at^2+bt+c

1134: \end{equation}

1135: which is like Equation~(\ref{eq:forma}) with $d=e=0$.

1136:

1137:

1138: %%%CASE H

1139: \medskip\noindent{\bf Case (h):}

1140: The line $l$ crosses below all the corner vertices hence the

1141: integral of the function is given as:

1142: \begin{equation}\label{eq:integralh}

1143:     p_h = \int\limits_{x_{1,l}}^{x_{1,u}}

1144:     \int\limits_{x_{2,l}}^{x_{2,u}}

1145:     F_i~dx_2 dx_1.

1146: \end{equation}

1147: The solution

1148: %is given in Appendix~\ref{apx:casesolutions} and

1149: has the form of Equation~(\ref{eq:formg}).

1150:

1151: %%DONE WITH CASES

1152:

1153: The above cases have solutions for each view in the form of

1154: Equation~(\ref{eq:forma}). Hence the percentage function for a

1155: single bucket as a function of $t$ is of the form:

1156: \begin{align}\label{eq:BucketProbability}

1157:     p &=\left( a_x t^2 + b_x t + c_x + \frac{d_x}{t} + \frac{e_x}{t^2} \right)

1158:         \left( a_y t^2 + b_y t + c_y + \frac{d_y}{t} + \frac{e_y}{t^2} \right)\nonumber \\

1159:       &~~~~\left( a_z t^2 + b_z t + c_z + \frac{d_z}{t} + \frac{e_z}{t^2} \right)

1160: \end{align}

1161: where $t\neq0$ when $d_x,d_y,d_z,e_x,e_y,e_z \neq 0$. Finally,

1162: renaming variables gives the general form:

1163: \begin{equation}\label{eq:BucketGeneralForm}

1164:     p=a_6 t^6 + a_5 t^5 + a_4 t^4 + a_3 t^3 + a_2 t^2 + a_1 t + c +

1165:     \frac{d_1}{t} + \frac{d_2}{t^2} + \frac{d_3}{t^3} + \frac{d_4}{t^4} +

1166:     \frac{d_5}{t^5} + \frac{d_6}{t^6}

1167: \end{equation}

1168: where $t \neq 0$ when $d_i \neq 0$ for $1 \leq i \leq 6$. Since

1169: Equation~(\ref{eq:BucketGeneralForm}) is closed under subtraction,

1170: $\triangle p$ from Equation~(\ref{eq:percentofbucket2}) will also

1171: have the same form.\medskip

1172:

1173: As the {\em query space} from Definition~\ref{def:queryspace} sweeps

1174: through a bucket, it crosses the bucket corner vertices. Each time a

1175: corner vertex crosses the {\em query space} boundary, the case that

1176: applies may change in one or more of the views.

1177:

1178: \begin{definition}[Bucket and Index Time-Intervals]\label{def:buckettimeinterval}

1179: The span of time in which no vertex from bucket $B_i$ enters or

1180: leaves the query space defines a {\em bucket time-interval}. We

1181: denote the time-interval as a half-open interval $[l,u)$ where $l$

1182: is the lower bound and $u$ is the upper bound. Each {\em bucket

1183: time-interval} has an associated percentage function $\triangle p$

1184: given by Equation~(\ref{eq:percentofbucket2}). We define the {\em

1185: index time-interval} similarly except that the span of time is

1186: defined when no vertex from {\em any} bucket in the index enters or

1187: leaves the query space.

1188: \end{definition}

1189:

1190: As we will see, index time-intervals are created from individual

1191: bucket intervals. Throughout the rest of this dissertation we use

1192: the term {\em time intervals} when the context clearly identifies

1193: which type we mean.

1194:

1195: \begin{definition}[Time-Partition Order]\label{def:timepartitionorder}%

1196: Let $B$ be the set of buckets. Let $Q_1$ and $Q_2$ be two query

1197: points and $(t^[,t^])$ be the query time interval. We define the

1198: {\em Time-Partition Order} to be the set of ordered time instances

1199: $TP={t_1,t_2,...,t_i,...,t_k}$ such that $t_1=t^[$ and $t_k=t^]$,

1200: and each $[t_i,t_{i+1})$ is an {\em index time-interval}.

1201: \end{definition}

1202:

1203: \begin{example}[Calculating Bucket Time-Intervals]

1204:     \label{ex:TimeIntervals} \rm %

1205:     Continuing Example~\ref{ex:BuildIndex}, let $Q$ be a query defined

1206:     by:

1207:     \begin{eqnarray}

1208:         q_{1} &=& (9.5,~8,~9.5,~8,~9.5,~8)\\

1209:         q_{2} &=& (8.5,~5,~8.5,~5,~8.5,~5)\\

1210:         T &=& (0.1,~10)

1211:     \end{eqnarray}

1212:     where $q_{1}$ and $q_{2}$ form the query space over the query time

1213:     interval $T$. To determine time intervals when corner vertices do

1214:     not change, find the slopes of lines through both query points and

1215:     each corner vertex of the bucket. Figure \ref{fig:CornerLines} shows

1216:     lines from the two query points to the corner vertices for the first

1217:     dimension. Since the query points are the same in each dimension each

1218:     will appear the same.

1219:     \begin{figure}[t]

1220:         \centering

1221:         \includegraphics[width=3.75in]{figs/ExampleSlopes_v2.eps}

1222:         \caption{Lines from query points to corner vertices.}%

1223:         \label{fig:CornerLines}

1224:     \end{figure}

1225:     The set of times when lines through $q_1$ (shown as solid lines) cross

1226:     corner vertices is $\{0.\overline{4}, 6\}$. The set of times when

1227:     lines through $q_2$ (shown as dotted lines) cross corner

1228:     vertices and are in the time interval is $\{1.42857\}$. The

1229:     union of these two sets along with the end points makes up the

1230:     times used to create the time intervals:

1231:     $\{(.1,0.\overline{4}),(0.\overline{4},1.42857),(1.42857,6),(6,10)\}$.

1232: \end{example}

1233: \medskip

1234:

1235: Integration over the {\em spatial dimensions} of the eight possible

1236: cases presented in Figure~\ref{fig:cases} gave a function of the

1237: form of Equation~(\ref{eq:BucketGeneralForm}). {\em Maximizing}

1238: Equation~(\ref{eq:BucketGeneralForm}) in the {\em temporal

1239: dimension} by first taking the derivative, we get:

1240: \begin{eqnarray}

1241:     \triangle p'&=&(6a_{6}t^{12}+5a_{5}t^{11}+4a_{4}t^{10}+3a_{3}t^{9}+2a_{2}t^{8}+a_{1}t^{7} \nonumber \\

1242:                 &~&~ -d_{1}t^{5}-2d_{2}t^{4}-3d_{3}t^{3}-4d_{4}t^{2}-5d_{5}t -6d_{6}) / t^7 \label{eq:Derivative}

1243: \end{eqnarray}

1244: where $t \neq 0$. Solving $\triangle p'=0$ requires finding the

1245: roots of this $12$-degree polynomial, {\em which is not possible

1246: using an exact method}. Hence we need a numerical method for solving

1247: the polynomial.

1248: %(Note that an exact solution is possible if the problem

1249: %uses only two dimensional moving points because that would require

1250: %solving only $8$-degree polynomial.)

1251:

1252: The following factors influenced the choice of the numerical method:

1253: \begin{enumerate}

1254:     \item Speed of the algorithm is more important than accuracy

1255:           because we don't expect the original function to change

1256:           dramatically over an index time-interval. We expect small

1257:           change because in practice the time intervals are short.

1258:     \item The algorithm must converge toward a solution within the

1259:           interval, that is the algorithm must be stable.

1260:     \item Given that we are maximizing Equation~(\ref{eq:BucketGeneralForm})

1261:           over a short time interval, we don't expect

1262:           Equation~(\ref{eq:Derivative}) to have more than one

1263:           solution. This assumption may seem naive, but it is

1264:           reasonable given factor (1).

1265: \end{enumerate}

1266:

1267: Factor (1) above is related to (3) in that it indicates that points

1268: close together have similar values, but emphasizes that speed is the

1269: goal. Factor (2) above eliminates several algorithms from

1270: consideration, but must be required to keep from choosing a solution

1271: that is not within the time interval evaluated.

1272:

1273: Of the three points to consider, (3) is probably the least

1274: intuitive. Consider the following conjecture:

1275:

1276: \begin{conjecture}\label{lem:NearMaximums}

1277: Given $p$ for a set of buckets, if the Euclidean distance between

1278: two maxima is small, then the difference between the maxima is

1279: small.

1280: \end{conjecture}

1281:

1282: Consider the physical characteristics of the system. The value of

1283: $p$ over the time interval changes no more than $b_i$ for any bucket

1284: $B_i$. Clearly $p$ either increases as it encompasses more of the

1285: bucket or decreases at as it encompasses less of the bucket. When

1286: $p$ represents the distribution over several buckets, each bucket

1287: contributes a decreasing or increasing amount over the time

1288: interval. Clearly $p$ is bounded below by $0$ and above by

1289: $\sum\limits_i b_i$. Hence, the rate at which the derivative $p'$

1290: changes is characterized by the physical system and reflects the

1291: differences in the buckets as $t$ changes. Since $p$ does not change

1292: dramatically over $t$ for any bucket, then change in several buckets

1293: over $t$ will likewise not be dramatic. Hence if the distance

1294: between two maxima is small, the maxima have a small difference in

1295: magnitude. {\em This rational for the conjecture above is verified

1296: by the experiments}.\medskip

1297:

1298: Based on these factors, we use a common method for the first

1299: approximation: we look at the graph of $p'$. Programmatically check

1300: $c$ intervals of Equation~(\ref{eq:Derivative}) for a change in

1301: sign. If there exists a sign change, use the bisection method to

1302: find the root. If two points lie within $\epsilon$ of $0$, we

1303: perform a check for each of these intervals when no change of sign

1304: is found. If some roots exist, we check them for maximal values

1305: along with the end points.

1306:

1307: \begin{lemma}\label{lem:ConstRunningTimeForTimeInterval}

1308: The approximate maximum within a time interval can be found in

1309: $O(1)$ time.

1310: \end{lemma}

1311:

1312: \begin{proof}

1313: Each {\em time interval} has an associated probability function

1314: $\triangle p$ which is calculated in $O(1)$ time. Finding $\triangle

1315: p' = 0$ also takes $O(1)$ time. By placing a constant bound on the

1316: number of iterations in the bisection method, we bound the time

1317: required in the numerical section of the algorithm by a constant.

1318: Plugging in the solution found by the bisection method along with

1319: the end points also takes $O(1)$ time. Hence, the running time to

1320: find the maximum within a bucket is $O(1)$.

1321: \end{proof}

1322:

1323: We chose to limit the number of iterations in the bisection method

1324: to 10, which limits the running time to a small constant value. This

1325: value was chosen based on empirical observation that index

1326: time-intervals remain small (about $0.01$ to $4$). Hence, using the

1327: bisection method allows us to narrow our search down to an interval

1328: at least as small as $\frac{1}{256}$ units of time. If time is

1329: measured in hours, this interval equates to only $14$ seconds.

1330:

1331: \begin{example}[Building Time-Intervals and Finding {\sc MaxCount}]\rm %

1332:     Continuing Example~\ref{ex:TimeIntervals} we build the functions for

1333:     time intervals

1334:     \begin{equation}

1335:         \{(.1,0.\overline{4}),(0.\overline{4},1.42857),(1.42857,6),(6,10)\}

1336:     \end{equation}

1337:     by integrating using the different cases from

1338:     Figure~\ref{fig:cases}. For space concerns we omit the integrals here and

1339:     note that the result of integrating each interval and finding

1340:     the maximum gives a maximum of approximately $3$ at

1341:     $t=0.\overline{4}$

1342:

1343:

1344:     \noindent\textbf{Time Interval: }$[0.1,0.\overline{4}]$. Here case

1345:     (c) holds for query point $q_2$ over this time interval. Hence the

1346:     integral for query point $q_{2}$ and $t\in\lbrack.1,.\overline{4}]$

1347:     in each dimension is given as:

1348:     \begin{eqnarray}

1349:         p_{c} &=& c\int_{8.5}^{10}\int_{5}^{10}2(0.7x_{0}-2.9)dx_{1}dx_{0}+\int_{5}^{8.5}\int_{-t(x_{0}-8.5)+5}^{10}2(0.7x_{0}-2.9)dx_{1}dx_{0}\nonumber\\

1350:             &=& 117.5-17.354\bar{6}t \label{eq:Q2CaseEinterval1}

1351:     \end{eqnarray}

1352:     Case (g) holds for query point $q_1$ and thus the integral for query

1353:     point $q_{1}$ and $t\in (.1,.\overline{4})$ in each dimension is

1354:     given as:

1355:     \begin{align}

1356:         p_{g} &  =c\int_{5}^{10}\int_{-t(x_{0}-9.5)+8}^{10}2(0.7x_{0}-2.9)dx_{1}dx_{0}\nonumber\\

1357:             &  =47.0-32.41\bar{6}t

1358:     \end{align}

1359:     Hence the integral of the region is:

1360:     \begin{align}

1361:         p &  =c\left(  p_{c}-p_{g}\right)  ^{3}\nonumber\\

1362:         &  =2.106\times10^{-3}t^{3}+2.957\times10^{-2}t^{2}+0.138t+0.216

1363:     \end{align}

1364:     Evaluating $p$ at the start and end of the time interval we have

1365:     $p(0.1)\approx0.23$ and $p(0.\overline{4})=0.28$. Figure

1366:     \ref{fig:Interval1} shows $p$ in the time interval. Clearly $p$ is

1367:     increasing and consequently we have a maximum at the end point

1368:     $t=0.\overline{4}$.

1369:     \begin{figure}[h]

1370:         \centering

1371:         \includegraphics[width=4in]{figs/ExampleInterval1.eps}

1372:         \caption{Graph of $p$, $0.1 \leq t \leq 0.\overline{4}$.}

1373:         \label{fig:Interval1}

1374:     \end{figure}

1375:     Since there are $10$ points we must multiply

1376:     $p(0.\overline{4})$ by $10$ to get the approximation for the time

1377:     interval as:%

1378:     \begin{equation}

1379:         MaxCount_{0.1\leq t\leq 0.\overline{4}} \approx 2.8.

1380:     \end{equation}

1381:     Since we can not have partial points, we can round this result

1382:     to $3$.\medskip

1383:

1384:     The rest of the intervals are similar using different cases. We

1385:     omit the remaining cases to save space and to eliminate the risk

1386:     of boring the reader. None of the other intervals has a higher

1387:     \sc{MaxCount} and so it follows that {\sc MaxCount} has an

1388:     approximate value of $3$ at time $t=0.\overline{4}$.

1389: \end{example}

1390:

1391: \subsection{Dynamic {\sc MaxCount} Algorithm}\label{ssec:MaxCountAlgorithm}

1392:

1393: \progstart \vspace{-18pt}

1394: \begin{tabbing}

1395: \hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\= \kill

1396: {\bf {\sc MaxCount}$(H, Q_1, Q_2, t^[,t^])$} \\

1397: {\bf input:} \>\> A set of buckets $H$ built by the index structure presented, \\

1398:              \>\> query points $Q_1(t)$ and $Q_2(t)$ and a query time interval $(t^[,t^])$. \\

1399: {\bf output:}\>\> The estimated {\sc MaxCount} value.\\

1400: \\

1401: 01.  \> $TimeIntervals \leftarrow \emptyset $                            \` $O(1)$ \\

1402: 02.  \> {\bf for} $i \leftarrow 0$ {\bf to} $|H|-1$                                  \` $O(B)$ \\

1403: 03.  \>   \> $CrossTimes \leftarrow $\textsc{ CalculateCrossTimes}$(Q_1,Q_2,t^[,t^],H_i)$     \` $O(1)$ \\

1404: 04.  \>   \> {\bf for} $j \leftarrow 1$ {\bf to} $|CrossTimes|-1$                    \` $O(1)$ \\

1405: 05.  \>   \>   \>\textsc{Union}$(TimeIntervals,TimeInterval(t_{j-1}, t_{j})$      \` $O(1)$ \\

1406: 06.  \>   \> {\bf end for} \\

1407: 07.  \> {\bf end for}\\

1408: \\

1409: 08. \> $TimeIntervals = $\textsc{ BucketSort}$(TimeIntervals)$                       \` $O(B)$ \\

1410: 09. \> $IndexTimeIntervals = $\textsc{ Merge}$(TimeIntervals)$                       \` $O(B)$ \\

1411: 10. \> {\bf for each} $IndexTimeInterval \in IndexTimeIntervals$               \` $O(B)$ \\

1412: 11. \>  \> \textsc{calculate}$(MaxCount, MaxTime, IndexTimeInterval)$             \` $O(1)$ \\

1413: 12. \> {\bf end for} \\

1414: \\

1415: 13. \> {\bf return} $(MaxCount, MaxTime)$

1416: \end{tabbing}

1417: \progend

1418:

1419: The algorithm to compute {\sc MaxCount} with each line labeled with

1420: its running time is given above. Line 01 initiates a set of bucket

1421: time-interval objects to be empty. Line 03 returns a list of ordered

1422: times when a line through $Q_1$ or $Q_2$ crosses a bucket corner

1423: vertex. Line 05 turns this list into a set of $TimeInterval$ objects

1424: and adds them to the set of $TimeIntervals$. We list this ``for

1425: each'' loop as $O(1)$ because it consists of a constant number of

1426: calculations bounded by the number of vertices in the bucket.  Line

1427: 08 uses the linear time sorting algorithm \textsc{BucketSort} to

1428: sort the bucket time intervals. Line 09 creates the time-partition

1429: order and index bucket time intervals from the bucket time intervals

1430: in $O(B)$. An additional pass adds the bucket time intervals to the

1431: appropriate index time-intervals in $O(B)$. Lines 10-12 perform the

1432: {\sc MaxCount} calculation discussed above.

1433:

1434: \medskip

1435: In order to use the linear time \textsc{BucketSort} algorithm, we

1436: need the following definition and lemmas.

1437:

1438: \begin{definition}[Time-Interval Ordering]

1439: \label{def:IntervalOrder}%

1440: We define the lexicographical ordering $\prec$ of two {\em time

1441: intervals} $A$ and $B$ as follows:

1442: \begin{eqnarray}

1443:     A.l < B.l                       & \Rightarrow & A \prec B \\

1444:     A.l = B.l \quad \wedge \quad A.u < B.u & \Rightarrow & A \prec B \\

1445:     A.l = B.l \quad \wedge \quad A.u = B.u & \Rightarrow & A = B

1446: \end{eqnarray}

1447:

1448:

1449: \end{definition}

1450:

1451: %??? Fix this so that the values are near the correct areas

1452: \begin{figure}

1453:   \centering

1454:   \psfrag{Q}{{\tiny $Q$}}

1455:   \psfrag{A1}{{\tiny $A=\frac{1}{2}$}}

1456:   \psfrag{A2}{{\tiny $A=\frac{1}{4}$}}

1457:   \psfrag{A3}{{\tiny $A=\frac{1}{12}$}}

1458:   \includegraphics[height=3in]{figs/DistributionBox1.eps}\\

1459:   \caption{Areas of successive slopes.}

1460:   \label{fig:SlopeDistribution}

1461: \end{figure}

1462:

1463:

1464: The distribution of time interval objects created in Line 08 of the

1465: {\sc MaxCount} algorithm may not be uniform across the query time

1466: interval $T=[t^[,t^]]$. However, we can still prove the following.

1467:

1468: \begin{lemma}

1469: \label{lem:TimeIntervalDistribution}%

1470: If the distribution of buckets is uniform, then the distribution of

1471: bucket time-interval objects can be uniformly distributed within the

1472: sorting buckets of the bucket sort.

1473: \end{lemma}

1474: \begin{proof}

1475: Consider the relationship between successive slopes measured as the

1476: angles between lines through a query point $Q$ with slopes

1477: $s_i=-t_i$ and $s_{i+1}=-t_{i+1}$. Suppose $\triangle t=1$ with

1478: $t_0=0$ and $t_1=1$, then the angle between the two lines is

1479: $\triangle s=\frac{\pi}{4}$. The solid lines in

1480: Figure~\ref{fig:SlopeDistribution} show that half of the bucket

1481: corner vertices are swept by the line sweeping through $Q$ between

1482: $s_0=0$ and $s_1=-1$. Consider a query time interval $[0,10]$. Half

1483: of the corner vertices, and thus half of the time intervals, are

1484: between time $t=0$ and $t=1$. Thus, we conclude that the time

1485: interval objects created by sweeping will not be uniformly

1486: distributed throughout the query time interval.

1487:

1488: Let $Q'$ be the midpoint between $Q_1$ and $Q_2$. Let $S =

1489: \{t_1,...t_k\}$ where $t_1 = t^[$, $t_k=t^]$ and $t_{i+1} - t_i = L$

1490: for some positive constant $L$ and $1 \leq i \leq k-1$. Let $D_B$ be

1491: a bucket that contains the space in the 6-dimensional index. Model

1492: the normalized bucket function for $D_B$ as a constant $F=1$. Thus

1493: $p$, the bucket probability, from

1494: Equation~(\ref{eq:BucketProbability}) becomes the hyper-volume of

1495: the space swept by the line through $Q'$. By

1496: Lemma~\ref{lem:ConstRunningTimeForTimeInterval}, we can find the

1497: area for a specific time interval in $S$ in constant time. The

1498: percentage of sorting buckets, $posb_i$, needed in any time interval

1499: $T_i=[t_i,t_{i+1}] \in S$ within the query time interval is given

1500: by:

1501: \begin{equation}

1502:     posb_i = \frac{p(t_{i+1})-p(t_i)}{p(t^])-p(t^[)}

1503: \end{equation}

1504: Let $N$ be the number of sorting buckets. Then, the number of

1505: sorting buckets, $nosb_i$, assigned to interval $i$ is given by:

1506: \begin{equation}

1507:     nosb_i = N \cdot posb_i

1508: \end{equation}

1509: If $nosb_i<1$ we can combine it with $nosb_{i+1}$. If the query time

1510: interval is very large, then we may need to include multiple time

1511: intervals from $S$ to get one sorting bucket. Thus, we create more

1512: sorting buckets (with smaller time intervals) in areas where the

1513: expected number of bucket time intervals is large. Conversely, we

1514: create fewer sorting buckets (with larger time intervals) in areas

1515: where the expected number of bucket time intervals is small. Hence

1516: we model each sorting bucket so that its time interval length

1517: directly relates to the percentage of bucket time intervals that are

1518: assigned to it. Thus, we conclude that we will uniformly distribute

1519: the time interval objects across all sorting buckets.

1520: \end{proof}

1521:

1522: \begin{lemma}

1523: \label{lem:BuscketSortConstantTimeInsertion}%

1524: Insertion of any bucket time-interval object $T_O$ into the proper

1525: sorting bucket can be done in $O(1)$ time.

1526: \end{lemma}

1527: \begin{proof}

1528: The distribution of sorting buckets is determined by $k$ time

1529: intervals in Lemma~\ref{lem:TimeIntervalDistribution}. Call these

1530: {\em sorting time interval objects} where each object contains: the

1531: lower bound $l$, the upper bound $u$, the number of sorting buckets

1532: assigned to this interval $b_s$, the length of the time interval for

1533: the sorting bucket $w$ and an array $B_p$ containing pointers to

1534: these sorting buckets. Let $A$ be the array of sorting time interval

1535: objects, and  $L$ be the length of each time interval where the time

1536: intervals are as in Lemma~\ref{lem:TimeIntervalDistribution}. Then,

1537: finding the correct sorting bucket for $T_O$ requires two

1538: calculations:

1539: \begin{eqnarray}

1540:     SortingTimeInterval &=& A \left[ ~ \left\lfloor \frac{T_O.l}{L} \right\rfloor ~ \right] \\

1541:     SortingBucket       &=& B_p \left[ ~ \left\lfloor \frac{T_O.l - SortingTimeInterval.l}{w} \right\rfloor ~ \right].

1542: \end{eqnarray}

1543: Each of these calculations requires constant time, hence $T_O$ can

1544: be inserted into the proper sorting bucket in $O(1)$ time.

1545: \end{proof}

1546:

1547: Using the above two lemmas, we can prove the following.

1548:

1549: \begin{theorem}

1550: \label{th:constanttime} The running time of the {\sc MaxCount}

1551: algorithm is $O(B)$ where $B$ is the number of buckets.

1552: \end{theorem}

1553:

1554: \begin{proof}

1555: Let $H$ be the set of buckets where each bucket $B_i$ contains the

1556: normalized trend function $F_i$. Let $Q_1$ and $Q_2$ be the query

1557: points and $[t^[,t^]]$ be the query time interval. (Lines 01-07):

1558: Calculating the time intervals takes $O(B)$ time because the cross

1559: times for each bucket can be calculated in constant time. (Line 08):

1560: By Lemmas~\ref{lem:TimeIntervalDistribution} and

1561: \ref{lem:BuscketSortConstantTimeInsertion}, we have an approximately

1562: even distribution of time interval objects within the sorting

1563: buckets where we can insert an object in constant time. This result

1564: fulfills the requirements of the \textsc{BucketSort},

1565: \cite{IntroToAlgorithms}, which allows the intervals to be sorted in

1566: $O(B)$ time. (Lines 09-12): Calculate the {\sc MaxCount} and time

1567: for each time interval in constant time using

1568: Lemma~\ref{lem:ConstRunningTimeForTimeInterval}. These lines takes

1569: $O(B)$ time because there are $O(B)$ time intervals. Finding the

1570: global {\sc MaxCount} and time requires retaining the maximum time

1571: and count at line 11. Returning the {\sc MaxCount} and time takes

1572: $O(1)$ time. Thus, the running time is given by $O(B) + O(B) + O(B)

1573: + O(1) = O(B)$.

1574: \end{proof}

1575:

1576: \subsection{An Exact {\sc MaxCount} Algorithm}\label{sec:ExactMaxCount}

1577:

1578: The Exact MaxCount algorithm below finds the exact {\sc MaxCount}

1579: values. It is easy to see that the running time is given by:

1580: \begin{equation}\label{eq:ExactRunningTime}

1581:     O(N) + O(n \log n)

1582: \end{equation}

1583: where $N$ is the number of points in the database and $n$ represents

1584: the result size of the query.

1585:

1586: %Considering a tree structure to store points to reduce the first

1587: %term in (\ref{eq:ExactRunningTime}) may result in negligible

1588: %benefits since we expect to examine a significant number of the

1589: %points contained in the database. Even so the worst case running

1590: %time remains $O(N) + O(n \log n)$.

1591:

1592: It is possible to slightly improve the algorithm below. First,

1593: divide the index space into $k$ subspaces and maintain separate

1594: partial databases for each. Assign processes on individual systems

1595: to each database to calculate the {\sc MaxCount} query and return

1596: the time intervals to a central process. Merging the time interval

1597: lists into a global time interval list saves time on the sorting

1598: part of the algorithm. The running time for each of $k$ partial

1599: databases would be close to $O(\frac{n}{k} \log \frac{n}{k})$. This

1600: result is an approximate value because we do not guarantee an even

1601: split between partial databases. Placing buckets for each partial

1602: database in a \textsc{Tree} structure may be reasonable and could

1603: cut down the average running time to $O(\log N + n \log n/k)$.

1604: %For small enough data

1605: %subsets $\log n/k$ may be considered a constant resulting in an

1606: %average running time of $\max (\log N, n)$.

1607: Implementation and analysis for this particular approach is left as

1608: future work.

1609:

1610: \progstart \vspace{-18pt}

1611: \begin{tabbing}

1612: \hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=

1613: \kill

1614: {\bf  {\sc ExactMaxCount}$(D, Q_1, Q_2, t^[, t^])$} \\

1615: {\bf input:}  \>\> $D$ is the database of points. The query is made up of a \\

1616:               \>\> hyper-rectangle $Q$ defined by points $Q_1$ and $Q_2$ and the time\\

1617:               \>\> interval $T=[t^[, t^]]$ \\

1618: {\bf output:} \>\> The exact {\sc MaxCount} and time at which it occurs. \\ \\

1619: 01.  \>$Times \leftarrow \emptyset$ //of \emph{CrossTime} objects      \` $O(1)$\\

1620: 02.  \>\textbf{for each} \emph{point} $p_i \in D$                      \` $O(N)$\\

1621: 03.  \>   \> \textbf{if} $p_i \in Q$ during $T$                        \` $O(1)$\\

1622: 04.  \>   \>   \> $EntryTime \leftarrow CalculateEntryTime(p_i,Q,T)$   \` $O(1)$\\

1623: 05.  \>   \>   \> $ExitTime \leftarrow CalculateExitTime(p_i,Q,T)$     \` $O(1)$\\

1624: 06.  \>   \>   \> \textbf{if} $EntryTime \in Times$                    \` $O(1)$\\

1625: 07.  \>   \>   \>   \> $Times.$\textsc{get}$(EntryTime).Count$++       \` $O(1)$\\

1626: 08.  \>   \>   \> \textbf{else} \\

1627: 09.  \>   \>   \>   \> $Times.$\textsc{add}$(new CrossTime(EntryTime))$\` $O(1)$\\

1628: 10.  \>   \>   \> \textbf{end if} \\

1629: 11.  \>   \>   \> \textbf{if} $ExitTime \in Times$                     \` $O(1)$\\

1630: 12.  \>   \>   \>   \> $Times.$\textsc{get}$(ExitTime).Count$-\,-      \` $O(1)$\\

1631: 13.  \>   \>   \> \textbf{else} \\

1632: 14.  \>   \>   \>   \> $Times.$\textsc{add}$(new CrossTime(ExitTime))$ \` $O(1)$\\

1633: 15.  \>   \>   \> \textbf{end if} \\

1634: 16.  \>\textbf{end for} \\

1635: 17.  \>\textsc{Sort}$(Times)$                                          \` $O(n \log n)$\\

1636: 18.  \>\textsc{traverse}$(Times,time,Max\textrm{-}Count)$ //tracking time\` $O(N)$\\

1637: \> \> \> \> \qquad \qquad \qquad \quad //and {\sc MaxCount} \\

1638: 19.  \>\textbf{return} (time,{\sc MaxCount}) \` $O(1)$

1639: \end{tabbing}

1640: \progend

1641:

1642:

1643: \section{Threshold Operators}\label{sec:ThresholdOperators}%includes CountRange

1644:

1645: \progstart \vspace{-18pt}

1646: \begin{tabbing}

1647: \hspace*{.35in}\=\hspace*{.3in}\=\hspace*{.3in}\=\hspace*{.3in}\=

1648: \kill

1649: {\bf  {\sc ThresholdRange}$(H, Q_1, Q_2, t^[, t^], M)$} \\

1650: {\bf input:}  \>\> A set of buckets $H$ build by the index structure presented, \\

1651:               \>\> query points $Q_1(t)$ and $Q_2(t)$, a query time interval $[t^[, t^]]$, \\

1652:               \>\> and $M$ is the threshold value \\

1653: {\bf output:} \>\> The estimated set of time intervals where $R$ contains more \\

1654:               \>\> than $M$ points.\\

1655: \\

1656: 01 - 08 are the same as the {\sc MaxCount} algorithm.\\

1657: 09.  \> $TimeIntervals \leftarrow \emptyset$                                        \`$O(1)$ \\

1658: 10.  \> \textbf{for each} $TimeInterval \in TimePartitionOrder$                     \`$O(B)$ \\

1659: 11.  \>     \> $CMaxCount \leftarrow \textsc{calculate}(\textsc{MaxCount}, MaxTime, TimeInterval)$\`$O(1)$ \\

1660: 12.  \>     \> \textbf{if} $CMaxCount > M$                                          \`$O(1)$ \\

1661: 13.  \>     \>  \> $TimeIntervals \leftarrow TimeIntervals \bigcup TimeInterval$    \`$O(1)$ \\

1662: 14.  \>     \>  \textbf{end if}                                                              \\

1663: 15.  \> \textbf{end for}                                                                     \\

1664: 16.  \> $\textsc{Merge}(TimeIntervals)$                                             \`$O(B)$ \\

1665: 17.  \> \textbf{return} $TimeIntervals$

1666: \end{tabbing}

1667: \progend

1668:

1669: The {\sc ThresholdRange} algorithm shown above and described in

1670: Definition~\ref{def:ThresholdRange} relates to {\sc MaxCount} in the

1671: way we calculate the aggregation. We maintain a running count to

1672: find time intervals that exceed the threshold value $M$. If we set

1673: the threshold value near the {\sc MaxCount} value ($M \rightarrow$

1674: {\sc MaxCount}), {\sc ThresholdRange} finds a small interval

1675: containing the {\sc MaxCount}. We demonstrate this in the

1676: experimental results,

1677: Section~\ref{sec:ExperimentalResults}.\smallskip

1678:

1679: The {\sc ThresholdRange} algorithm is the same as {\sc MaxCount} up

1680: to Line 08, and then collects different information from each

1681: $TimeInterval$ starting in Line 10. This leads to the following

1682: Theorem.

1683:

1684: \begin{theorem}

1685: \label{th:ThresholdConstantTime}%

1686: The estimated {\sc ThresholdRange} query runs in $O(B)$ time.

1687: \end{theorem}

1688: \begin{proof}

1689: The {\sc ThresholdRange} algorithm differs from the {\sc MaxCount}

1690: algorithm only in lines 09-17. Lines 11-14 run in $O(1)$ time. Line

1691: 10 executes lines 11-13 $O(B)$ times. In line 16,

1692: $\textsc{Merge}(TimeIntervals)$ is a linear walk of the time

1693: intervals that joins adjacent time intervals $T_a$ and $T_b$ when

1694: $T_a \bigcup T_b$ would form a continuous time interval. The

1695: calculation is trivially $O(1)$ time for joining the adjacent

1696: intervals. Hence, we conclude by Theorem~\ref{th:constanttime} that

1697: the {\sc ThresholdRange} runs in $O(B)$ time.

1698: \end{proof}

1699:

1700: \subsection{Threshold: Sum, Count and Average}

1701:

1702: We give the following three operators based on {\sc ThresholdRange}

1703: and conclude that none of the changes to the algorithm affect the

1704: running time of the {\sc ThresholdRange} algorithm.\medskip

1705:

1706: \noindent {\sc ThresholdCount}: \\

1707: By adding a line between 14 and 15 in the {\sc ThresholdRange}

1708: algorithm that counts the merged time intervals, we can return the

1709: count of time intervals during the query time interval where

1710: congestion occurs. This count of time intervals gives a measure of

1711: variation in congestion. That is, if we have lots of time intervals,

1712: we expect that we have a large number of pockets of congestion.

1713: Since {\sc ThresholdCount} does not give information relative to the

1714: entire time interval, it may need to be examined in light of the

1715: total time above the threshold.\medskip

1716:

1717: \noindent {\sc ThresholdSum}: \\

1718: By summing the times instead of using the $\bigcup$ operator in line

1719: 13 of the {\sc ThresholdRange} algorithm, we can return the total

1720: congestion time during the query time interval. This total gives a

1721: measure of the severity of congestion that may be compared to the

1722: length of query time.\medskip

1723:

1724: \noindent {\sc ThresholdAverage}: \\

1725: By adding a line between lines 14 and 15 in the {\sc ThresholdRange}

1726: algorithm that finds average length of the merged time intervals, we

1727: can return the average length of time each congestion will last.

1728: This average gives a different measure of the severity of each

1729: congestion.\medskip

1730:

1731: %We could calculate other operators such as the standard deviation of

1732: %the time intervals or many other complicated statistics on the

1733: %distribution of time intervals. However the five operators we define

1734: %mirror the standard aggregation operators available in relational

1735: %databases.

1736:

1737: %???Check this section!!!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1738:

1739: \subsection{Count Range Algorithm}

1740:

1741: The {\sc CountRange} algorithm is an adaptation of {\sc MaxCount} in

1742: that it is the {\sc Count} portion of the {\sc MaxCount} query.

1743: Using the equations for the cases described in

1744: Figure~\ref{fig:cases}, we calculate the {\sc CountRange} as

1745: follows:

1746:

1747: %%%%EDIT

1748:

1749: \begin{figure}[h]

1750: \begin{minipage}{0.49\textwidth}

1751:     \centering

1752:     \psfrag{Q1}{$Q_1$}

1753:     \psfrag{Q2}{$Q_2$}

1754:     \psfrag{lq2t1}{$l_{Q_2,t^[}$}

1755:     \psfrag{lq2t2}{$l_{Q_2,t^]}$}

1756:     \psfrag{lq1t1}{$l_{Q_1,t^[}$}

1757:     \psfrag{q1t2}{$l_{Q_1,t^]}$}

1758:     \psfrag{x0lvxl}{$(x_{0,l},v_{x,l})$}

1759:     \psfrag{x0uvxu}{$(x_{0,u},v_{x,u})$}

1760:     \includegraphics[width=.8\textwidth]{figs/CntRngNorm1.eps}\\

1761:     \caption{{\sc CountRange} $Q_1$ at $t^{]}$ to $Q_2$ at $t^{[}$.}

1762:     \label{fig:CountRangeNormal1}

1763: \end{minipage}

1764: \begin{minipage}{0.49\textwidth}

1765: %\end{figure}

1766: %\begin{figure}[h]

1767:     \noindent

1768:     \psfrag{Q1}{$Q_1$}

1769:     \psfrag{Q2}{$Q_2$}

1770:     \psfrag{lq2t1}{$l_{Q_2,t^[}$}

1771:     \psfrag{lq2t2}{$l_{Q_2,t^]}$}

1772:     \psfrag{lq1t1}{$l_{Q_1,t^[}$}

1773:     \psfrag{lq1t2}{$l_{Q_1,t^]}$}

1774:     \psfrag{x0lvxl}{$(x_{0,l},v_{x,l})$}

1775:     \psfrag{x0uvxu}{$(x_{0,u},v_{x,u})$}

1776:     \includegraphics[width=.8\textwidth]{figs/CntRngNorm2.eps}\\

1777:     \caption{{\sc CountRange} $Q_1$ at $t^{[}$ to $Q_2$ at $t^{]}$.}

1778:     \label{fig:countRangeNormal2}

1779: \end{minipage}

1780: \end{figure}

1781:

1782:

1783: For each bucket we determine if the bucket is completely in or

1784: completely out of the query space. First we find the beginning and

1785: ending time intervals. For each time interval, we get the associated

1786: function $\triangle p$ given in Equation~(\ref{eq:percentofbucket2})

1787: and its components. The components $\triangle p$ given in

1788: Equation~(\ref{eq:percentofbucket}) define the area above a line

1789: through $Q_1$ and $Q_2$ at times $t^[$ and $t^]$.

1790: Figures~\ref{fig:CountRangeNormal1} and \ref{fig:countRangeNormal2}

1791: show these four lines. Figure~\ref{fig:CountRangeNormal1} shows the

1792: shaded area defined by:

1793: \begin{equation}\label{eq:pleft}

1794:     \triangle \overleftarrow{p} = p_{Q_2,t^[} - p_{Q_1,t^]}.

1795: \end{equation}

1796: Figure~\ref{fig:countRangeNormal2} shows the shaded area:

1797: \begin{equation}\label{eq:pright}

1798:     \triangle \overrightarrow{p} = p_{Q_2,t^]} - p_{Q_1,t^[}.

1799: \end{equation}

1800: If $\triangle \overleftarrow{p}$ or $\triangle \overrightarrow{p}$

1801: for bucket $i$ is equal to the count of the bucket, then bucket $i$

1802: is completely contained in the query. If $\triangle

1803: \overleftarrow{p}$ and $\triangle \overrightarrow{p}$ for bucket $i$

1804: are equal to $0$, then bucket $i$ is not contained in the query. If

1805: neither of these is true, we approximate the count for bucket $i$ as

1806: the $\max (\triangle \overleftarrow{p}, \triangle

1807: \overrightarrow{p})$. That is, we calculate the number of points in

1808: bucket $i$ that contribute to the {\sc CountRange} as:

1809: \begin{equation}\label{eq:CountRangei}

1810:     count_i = \left\{

1811:     \begin{array}{llr}

1812:         b_i & \textrm{ if } & \triangle \overleftarrow{p} = b_i \vee \triangle \overrightarrow{p} = b_i \\

1813:         0   & \textrm{ if } & \triangle \overleftarrow{p}=\triangle \overrightarrow{p} = 0 \\

1814:         \max (\triangle \overleftarrow{p}, \triangle \overrightarrow{p}) & & \textrm{ Otherwise}

1815:     \end{array}

1816:     \right.

1817: \end{equation}

1818: This calculation requires that we keep the single dimension

1819: equations for $Q_1$ and $Q_2$ available and not discard them after

1820: finding $\triangle p$ (see Equation~(\ref{eq:percentofbucket2})).

1821:

1822: Hence, we have the following algorithm for {\sc CountRange}:

1823:

1824: \progstart\vspace{-18pt}

1825: \begin{tabbing}

1826: \hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=

1827: \kill

1828: {\bf  {\sc CountRange}$(H, Q_1, Q_2, t^[,t^])$} \\

1829: {\bf input:}  \>\> A set of buckets $H$ built by the index structure presented, \\

1830:               \>\> query points $Q_1(t)$ and $Q_2(t)$ and a query time interval $(t^[,t^])$. \\

1831: {\bf output:} \>\>the estimated {\sc CountRange}. \\

1832: \\

1833: 1.\>  $Count \leftarrow 0$                                             \` $O(1)$ \\

1834: 2.\>  \textbf{for each} \emph{bucket} $B_i \in D$                                      \` $O(B)$ \\

1835: 3.\>  \>  \textsc{Calculate}($\triangle \overleftarrow{p}, \triangle \overrightarrow{p}$) //using Equations~(\ref{eq:pleft})-(\ref{eq:pright}) \` $O(1)$ \\

1836: 4.\>  \>  \textsc{Calculate}($count_i$) //using Equation~(\ref{eq:CountRangei})   \` $O(1)$ \\

1837: 5.\>  \>  $Count \leftarrow Count + count_i$                                           \` $O(1)$ \\

1838: 6.\>  \textbf{end for} \\

1839: 7.\>  \textbf{return} $Count$ \` $O(1)$

1840: \end{tabbing}

1841: \progend

1842:

1843: \begin{theorem}

1844: \label{th:RangeConstantTime} The {\sc CountRange} query runs in

1845: $O(B)$ time.

1846: \end{theorem}

1847: \begin{proof}

1848: Consider two different data structures for our buckets:

1849: \textsc{HashTables} and \textsc{R-trees}. In the case of indexing

1850: using an \textsc{R-tree}, the worst case requires that we examine

1851: all buckets used in generating {\sc CountRange}. It is possible that

1852: this list could include all $B$ buckets giving a worst case of

1853: $O(B)$. In the case of using a \textsc{HashTable}, we must examine

1854: all $B$ buckets. By Lemma \ref{lem:ConstRunningTimeForBucket}, and

1855: because Equations~(\ref{eq:BucketGeneralForm}) and

1856: (\ref{eq:CountRangei}) are calculated in constant time, each bucket

1857: can be examined to determine the count that contributes to the {\sc

1858: CountRange} query in constant time. Therefore, the algorithm runs in

1859: $O(B)$ time.

1860: \end{proof}

1861:

1862: We note that {\sc CountRange} is a simplification of the {\sc

1863: MaxCount} operator in that we do not examine every time interval.

1864: Further we have a slightly different form of $\triangle p$ from

1865: Equation~(\ref{eq:percentofbucket2}) to find the count.

1866:

1867:

1868: \section{Experimental Results}\label{sec:ExperimentalResults}

1869:

1870: We collected data from over $7500$ queries that were selected from a

1871: set of randomly generated queries. The selection process weeded out

1872: most similar queries and kept a set that represents narrow queries,

1873: wide queries, near corner or edge queries, and queries outside the

1874: space contained in the database. Throughout our experiments, we did

1875: not see significant accuracy fluctuation due to any of these types

1876: of queries.

1877:

1878: Each experimental run consists of running all of the queries at

1879: several different decreasing bucket sizes on a single data set. We

1880: made experimental runs against data sets ranging from 10,000 points

1881: to 1,500,000 points\footnote{Threshold aggregation runs go only to 1

1882: million points at which we already achieve acceptable error.}.

1883:

1884: In the following experimental analysis, we measure the percentage

1885: error of the estimation algorithm relative to the exact-count

1886: algorithm as follows:

1887: \begin{equation}\label{eq:RelativeError}

1888:     Error_{Relative} = \frac{|Exact~Operator-Estimated~Operator|}{Exact~Operator}

1889: \end{equation}

1890: Equation (\ref{eq:RelativeError}) provides a useful measure if the

1891: query returns a reasonable number of points. Queries that return a

1892: small number of points indicate that we should use the exact method.

1893:

1894: For {\sc ThresholdRange}, we measure the percentage of intervals

1895: given by the accurate algorithm not covered by the estimation

1896: algorithm using the operator {\sc UC} for uncovered. That is, {\sc

1897: UC}$(a,b)$ returns the sum of the lengths of intervals in $a$ not

1898: covered by intervals in $b$. We divide the result by the accurate

1899: {\sc ThresholdSum} to determine the {\sc ThresholdRange error}:

1900: \begin{equation}\label{eq:ThresholdRangeError}

1901:     \texttt{error} =

1902:     \frac{\textsc{UC}\left(\textit{Ext. }\textsc{ThresholdRange}, \textit{Est. }\textsc{ThresholdRange}\right)}{\textit{Ext. }\textsc{ThresholdSum}}

1903: \end{equation}

1904: We also measure the percentage of intervals given by the estimate

1905: algorithm not covered by the exact algorithm. We divide the result

1906: by the estimated {\sc ThresholdSum} to determine the {\sc

1907: ThresholdRange excess-error}.

1908: \begin{equation}\label{eq:ThresholdRangeExcessError}

1909:     \texttt{excess-error} = \frac{\textsc{UC}\left(\textit{Est. }\textsc{ThresholdRange} \backslash \textit{Ext. }\textsc{ThresholdRange}\right)}{\textit{Est. }\textsc{ThresholdSum}}

1910: \end{equation}

1911:

1912: We performed all the data runs on a Athlon 2000  with 1 GB of RAM.

1913: During each of the queries the program does not contact the server

1914: tier and, thus, minimizes the impact of running a server on the same

1915: computer. The program pre-loads all data into data structures so

1916: that even the exact algorithms do not contact the server tier.

1917:

1918:

1919: \subsection{Data Generation}

1920:

1921: %\vspace{-.3in}

1922: \begin{figure}[h]

1923: \centering

1924: \begin{minipage}{6in}

1925: \centerline{ \hspace{-1em}

1926: \mbox{\includegraphics[width=2in]{figs/C10P10K_11.eps}}\hspace{-0.1in}

1927: \mbox{\includegraphics[width=2in]{figs/C10P10K_21.eps}}\hspace{-0.1in}

1928: \mbox{\includegraphics[width=2in]{figs/C10P10K_31.eps}}}

1929: \end{minipage}

1930: %\vspace{-.3in}

1931: \caption{$X$-View, $Y$-view and $Z$-view of sample data.}

1932: \label{fig:sampledata}

1933: \end{figure}

1934:

1935: Data for the experiments was randomly generated around several

1936: cluster centers. The $i^{th}$ point generated for the database is

1937: located near a randomly selected cluster at a distance between $0$

1938: and $d$, where $d$ is proportional to $i$. This method is similar to

1939: the Ziggurat~\citep{marsaglia2000zmg} method of generating gaussian

1940: (or normal) distributions used in the

1941: GSTD~\citep{theodoridis1999gsd} and

1942: G-TERD~\citep{tzouramanis2002gte} spatiotemporal data

1943: generators~\citep{nascimento2003sar}. However, our method does not

1944: generate strictly Gaussian distributions since the distributions may

1945: stretch and compress along an axis. Our goal was to generate a

1946: cluster that represents a source location and velocity that has most

1947: elements starting near a center point and decreasing as one moves to

1948: a boundary for the cluster. This method models source regions where

1949: the objects all head about the same direction. A secondary goal was

1950: to make certain that clusters were random in size and shape. The

1951: program is also capable of approximating a Zipf distribution used in

1952: \citep{CC02,Revesz20031,TSP03}. However, a single Zipf distribution

1953: does not test the adaptability of our algorithm well. I.e. our

1954: algorithm is capable of modeling a Zipf distribution and as such we

1955: could use a single bucket. Figure~\ref{fig:sampledata} shows a

1956: sample of a data set with points projected onto the three views. The

1957: clusters look even more random, because they can overlay one

1958: another. When one looks at these, they nearly resemble the lights of

1959: a city from the air.

1960:

1961: Along with a single Zipf distribution, we also note that a randomly

1962: generated uniform-distribution is not a good distribution to use for

1963: these types of experiments. Uniform distributions do not test the

1964: ability of the algorithm to adapt. In fact from earlier experiments

1965: in~\citep{Anderson20061} we have found that using such a

1966: distribution gives great (though meaningless) results. The problem

1967: resolves to a system capable (and willing to) model a uniform

1968: distribution finding a nearly perfect uniform distribution to model.

1969: Hence these results are neither realistic, nor meaningful.

1970:

1971: \subsection{Parameter Effects}

1972:

1973: The index space ranges from $0$ to $100$ in each dimension. The {\bf

1974: number of points} in the different data sets ranges from $10,000$ to

1975: $1,500,000$. The following parameters were used in creating the

1976: index and finding the {\sc MaxCount}. \medskip

1977:

1978: \noindent{\bf Size of Buckets:} The size of the buckets determines

1979: the number of possible buckets in the index. In the experiments,

1980: buckets divide the space up such that there are $5$ to $20$

1981: divisions in each dimension\footnote{Some {\sc MaxCount} runs

1982: included up to 40 divisions increasing accuracy, but not enough to

1983: warrant the extra running time.}. These divisions equate to bucket

1984: sizes ranging from $5$ to $20$ units wide in each dimension.

1985: Relative to our previous work \citep{Anderson20061}, this algorithm

1986: puts much more space into each bucket creating bigger buckets.

1987: \medskip

1988:

1989: \noindent{\bf Query Location:} Locating the query near the lower or

1990: upper corners affects relative accuracy because the query returns

1991: very few points. Queries in this region are not interesting because

1992: they rarely involve many points and represent a query region that

1993: moves away from points in the database or barely moves at all. The

1994: small number of points returned indicates use of the exact

1995: algorithms.

1996: \medskip

1997:

1998: \noindent{\bf Query Types:} In~\citep{Anderson20061}, we considered

1999: queries with several different characteristics: dense, sparse, and

2000: Euclidean distance as it related to bucket size. By modeling the

2001: skew in buckets, we minimize the effect of these characteristics to

2002: the point that they did not impact the query error. Queries where

2003: the distance between the query points was small appeared to do as

2004: well as wider queries {\em providing they returned a reasonable

2005: number of points}. This result is a clear improvement over previous

2006: work that assumed uniform density within a bucket.\medskip

2007:

2008: \noindent{\bf Cluster Points:} Index space saturation determines the

2009: number of buckets necessary for the index. The number of cluster

2010: points does not appear to affect error as much as the space

2011: saturation. Further, we do not consider a larger number of cluster

2012: points reasonable since the index space approaches a uniform

2013: distribution as the number of cluster points increases. Gaps

2014: introduce difficult areas to model when they are not uniform. And

2015: once again we reiterate, uniform distributions are not useful. In

2016: our experiments cluster points number between 10 and 50. \medskip

2017:

2018: \noindent{\bf Histogram Divisions:} Increasing histogram divisions

2019: to $s>5$ had no affect on the accuracy. This result is not

2020: unexpected because histograms are used to define a trend function

2021: relative to trend functions on other axes. Increasing the histogram

2022: divisions has a tendency to flatten the lines. However,

2023: normalization flattens the trend function while maintaining the

2024: relationships between trends and hence this behavior is easily

2025: explained. Thus, increasing histogram divisions only increases the

2026: running time without increasing accuracy.\medskip

2027:

2028: \noindent{\bf Threshold Value:} The threshold value determines the

2029: accuracy when set to low values compared to the number of points in

2030: the database. As expected, these extreme point values produce

2031: accurate estimations. High values also follow this trend.\medskip

2032:

2033:

2034: \noindent{\bf Time Endpoints:} When dealing with either small time

2035: end points or small buckets, the method is susceptible to rounding

2036: error. In particular, Equation~(\ref{eq:BucketGeneralForm}) contains

2037: both $t^6$ and $\frac{1}{t^6}$ terms. For very small values, on the

2038: order of $1 \times 10^{-54}$ for 64-bit doubles, these calculations

2039: are extremely sensitive and care must be given to guard against

2040: rounding error. Those errors showed in two ways. First, by a direct

2041: warning programmed into the solution, and second, by a series of

2042: fairly stable time values for the {\sc MaxCount} followed by

2043: unstable variations when increasing the number of buckets. At some

2044: point, smaller bucket sizes increase the likelihood of errors in

2045: both time and count values. Also smaller buckets contain fewer

2046: points, which impacts the size of the constants in

2047: Equation~(\ref{eq:BucketGeneralForm}). Hence, as the bucket size

2048: becomes smaller in successive runs, the existence of instability in

2049: the time values after a series of stable values predicts that an

2050: accurate {\sc MaxCount} may be found in the previous larger bucket

2051: size. {\em Throughout our experiments, this condition was an

2052: excellent predictor of an accurate {\sc MaxCount}}.\medskip

2053:

2054: The experiments demonstrated that 6-dimensional space compounds the

2055: problem when creating small buckets. Creating an index with unit

2056: buckets would result in the possibility of having $1\times 10^{12}$

2057: buckets. Clearly this number is unrealistic for common moving object

2058: applications where we may be dealing with million(s) of objects. In

2059: practice the number of buckets needed to reach acceptable error

2060: levels was between $78,000$ and $227,000$ buckets. These numbers

2061: reflect the ability to reach error levels under $5\%$ and were

2062: roughly related to the saturation of the space by the points. It

2063: should be clear that a higher saturation of the space by points

2064: would require a larger number of buckets.

2065: Figure~\ref{fig:BucketsToPointsRatio} shows that we had a roughly

2066: linear increase in the number of buckets for an exponential increase

2067: in the space. This pleasant surprise indicates that for unsaturated

2068: data sets, the exponential explosion of space is manageable.

2069:

2070: \begin{figure}[htb]

2071:   \centering

2072:   \includegraphics[width=4.5in]{figs/Points2Buckets.eps}\\

2073:   \caption{Ratio of the number of buckets in the index to the width of the space measured in buckets.}

2074:   \label{fig:BucketsToPointsRatio}

2075: \end{figure}

2076:

2077: \subsection{Running Time Observations}

2078:

2079: Figure~\ref{fig:runningtime} shows the average ratio of the exact

2080: {\sc MaxCount} running time to the estimated {\sc MaxCount} running

2081: time as a function of the number of points in the database. This

2082: result shows a nearly exponential growth when comparing the values

2083: between 10,000 and 1,000,000. The leveling off occurs because the

2084: number of points returned by the queries of 1 million points nearly

2085: equals the number of points returned by the queries of 1.5 million

2086: points. This result precisely matches our running-time analysis of

2087: the exact and estimation algorithms.

2088:

2089: \begin{figure}[h]

2090:   \centering

2091:   \includegraphics[width=4.5in]{figs/results/CSpeedup.eps}\\%used to be runningtime.eps

2092:   \caption{Ratio of exact running time to estimated running time.}

2093:   \label{fig:runningtime}

2094: \end{figure}

2095:

2096: A natural question is when to use the exact versus the estimated

2097: methods. In runs with a small number of points that need to be

2098: processed, %returned by a {\sc MaxCount} query,

2099: the exact and estimation methods run about equally fast. However,

2100: when the result size reaches values greater than $40,000$ (our

2101: experiments returned sets as large as 331,491), the estimation

2102: algorithms run up to $35$ times faster than the exact algorithms.

2103: Further, we note that the error is less predictable at smaller

2104: results sizes. Hence for small databases or in queries that return

2105: small result sets, efficiency and accuracy both indicate using the

2106: exact method. However, for large data sets greater than or equal to

2107: 1 million points, the estimation method greatly out-performs the

2108: exact method.

2109:

2110: \subsection{Operator Observations}

2111:

2112: As expected, we noticed that each operator runs in about the same

2113: time as {\sc MaxCount}. Only error values seemed to be different

2114: when studying different types of aggregation (e.g., when studying

2115: overlap error in {\sc ThresholdRange} versus count error in {\sc

2116: MaxCount}). Never-the-less, we have similarities between the

2117: results. Almost all the figures in this section look like a view of

2118: mountains from a valley. That is what we expected to see and the

2119: lower and flatter the terrain the better. Buckets increase from back

2120: to front and point set sizes increase from left to right.

2121:

2122: \subsection{\sc MaxCount}

2123:

2124: Figure~\ref{fig:RelativeError} shows that increasing the number of

2125: buckets to the indicated values dramatically decreases the {\sc

2126: MaxCount} error. As the number of points increases we also see a

2127: decrease in the error. Note that for larger buckets (e.g. smaller

2128: values on the ``Buckets per Dimension axis''), the error decreases

2129: at a slightly faster rate.

2130:

2131: \begin{figure}[h]

2132:   \centering

2133:   \includegraphics[width=5.5in]{figs/results/CMaxCount.eps}\\

2134:   \caption{{\sc MaxCount} error.}

2135:   \label{fig:RelativeError}

2136: \end{figure}

2137:

2138: The exact {\sc MaxCount} provided the values against which our

2139: estimation algorithm was tested for accuracy. Since the method does

2140: not rely on buckets, and has zero error, we note only that on

2141: queries with small result sizes, this method performs as well, or

2142: better than the estimation algorithm.

2143:

2144: \subsection{\sc ThresholdRange}

2145:

2146: \begin{figure}[h]

2147:   \centering

2148:   \includegraphics[width=4in]{figs/results/CTRE10.eps}\\

2149:   \caption{{\sc ThresholdRange} error.}

2150:   \label{fig:TRE10}

2151: \end{figure}

2152:

2153: \begin{figure}[h]

2154:   \centering

2155:   \includegraphics[width=4in]{figs/results/CTREE10.eps}\\

2156:   \caption{{\sc ThresholdRange} error.}

2157:   \label{fig:TREE10}

2158: \end{figure}

2159:

2160: Figures~\ref{fig:TRE10} and \ref{fig:TREE10} give the {\sc

2161: ThresholdRange} error and {\sc ThresholdRange} excess error

2162: respectively for $T=10$. {\sc ThresholdRange} error gives the

2163: percentage of the exact intervals not covered by the estimation

2164: value, and {\sc ThresholdRange} excess error gives the percentage of

2165: the estimation not covering the exact. These figures show that our

2166: method acts conservatively in covering more than is needed. However,

2167: at larger point-set sizes, we still achieve under 5\% error.

2168: Figure~\ref{fig:TRE10} shows 0\% error caused by the point count

2169: staying above 10\% in data sets containing more than 30,000 points.

2170: Figure~\ref{fig:TREE10} shows that we covered at least 10\% more

2171: time in the query time interval than needed until we reach larger

2172: point sets. Still, we showed improvement with more buckets.

2173:

2174: At $T=1000$, we see 0\% error until we reach point sets of 500,000

2175: and greater. Figure~\ref{fig:TRE1000} shows excellent results with

2176: buckets above 10. Also, Figure~\ref{fig:TREE1000} shows that the

2177: excess error drops to near 0\% as well.

2178:

2179: \begin{figure}

2180:   \centering

2181:   \includegraphics[width=4in]{figs/results/CTRE1000.eps}\\

2182:   \caption{{\sc ThresholdRange} error, T=1000.}

2183:   \label{fig:TRE1000}

2184: \end{figure}

2185:

2186: \begin{figure}

2187:   \centering

2188:   \includegraphics[width=4in]{figs/results/CTREE1000.eps}\\

2189:   \caption{{\sc ThresholdRange} excess error, T=1000.}

2190:   \label{fig:TREE1000}

2191: \end{figure}

2192:

2193: Figures~\ref{fig:TRE100000} and \ref{fig:TREE100000} show what

2194: happens when we find an interval near the {\sc MaxCount} value. The

2195: two figures show the consequences of the estimation intervals being

2196: offset from the exact intervals by small amounts. The error

2197: decreases with more buckets.

2198:

2199: \begin{figure}

2200:   \centering

2201:   \includegraphics[width=4in]{figs/results/CTRE100000.eps}\\

2202:   \caption{{\sc ThresholdRange} error, T=100000.}

2203:   \label{fig:TRE100000}

2204: \end{figure}

2205:

2206: \begin{figure}

2207:   \centering

2208:   \includegraphics[width=4in]{figs/results/CTREE100000.eps}\\

2209:   \caption{{\sc ThresholdRange} excess error, T=100000.}

2210:   \label{fig:TREE100000}

2211: \end{figure}

2212:

2213:

2214: \subsection{\sc ThresholdCount}

2215:

2216: This operator is the only operator that does not have relative error

2217: measurements. Instead we report the average number of intervals the

2218: estimation method differs from the exact method. As you can see, we

2219: differ by two from the correct number.

2220:

2221: Figure~\ref{fig:TCE10} shows the average error at $T=10$ where the

2222: errors are small. Figure~\ref{fig:TCE1000} ($T=1000$) looks much

2223: worse, but in reality we are still below 2 intervals off. We also

2224: note that the estimation may split or combine an interval

2225: incorrectly when the intervals are very close together without

2226: greatly affecting the error of other operators. Given this

2227: possibility, the results are excellent.

2228:

2229: \begin{figure}[h]

2230:   \centering

2231:   \includegraphics[width=4in]{figs/results/CTCE10.eps}\\

2232:   \caption{{\sc ThresholdCount} error, T=10.}

2233:   \label{fig:TCE10}

2234: \end{figure}

2235:

2236: \begin{figure}[h]

2237:   \centering

2238:   \includegraphics[width=4in]{figs/results/CTCE100.eps}\\

2239:   \caption{{\sc ThresholdCount} error, T=100.}

2240:   \label{fig:TCE1000}

2241: \end{figure}

2242:

2243:

2244:

2245: \subsection{\sc ThresholdSum}

2246:

2247: {\sc ThresholdSum} gives the total time above the threshold $T$. As

2248: one can see in Figure~\ref{fig:TSE10}, at higher bucket counts we

2249: have excellent error rates at $T=10$. We didn't always expect great

2250: results at this threshold level across all data sets, but {\sc

2251: ThresholdSum} gives this result consistantly all the way across.

2252:

2253: \begin{figure}[h]

2254:   \centering

2255:   \includegraphics[width=4in]{figs/results/CTSE10.eps}\\

2256:   \caption{{\sc ThresholdSum} error, T=10.}

2257:   \label{fig:TSE10}

2258: \end{figure}

2259:

2260: We do note that when the threshold approaches {\sc MaxCount}, we see

2261: extremely good accuracy as shown in Figure~\ref{fig:TSE100000}.

2262:

2263: \begin{figure}[h]

2264:   \centering

2265:   \includegraphics[width=4in]{figs/results/CTSE100000.eps}\\

2266:   \caption{{\sc ThresholdSum} error, T=100000.}

2267:   \label{fig:TSE100000}

2268: \end{figure}

2269:

2270:

2271: \subsection{\sc ThresholdAverage}

2272:

2273: {\sc ThresholdAverage} gives the average length of each time

2274: interval. Figure~\ref{fig:TAE10} shows the now familiar mountains

2275: descending below 5\% error at 20 buckets for $T=10$. The Figure also

2276: shows that even though a few of the data sets tended to have good

2277: results at 5 and 10 buckets, these results are not guaranteed in

2278: general. In Figure~\ref{fig:TAE1000}, the error reaches a plateau

2279: below 5\% with only small bumps in the data.

2280:

2281: \begin{figure}[h]

2282:   \centering

2283:   \includegraphics[width=4in]{figs/results/CTAE10.eps}\\

2284:   \caption{{\sc ThresholdAverage} error, T=10.}

2285:   \label{fig:TAE10}

2286: \end{figure}

2287:

2288: \begin{figure}[h]

2289:   \centering

2290:   \includegraphics[width=4in]{figs/results/CTAE1000.eps}\\

2291:   \caption{{\sc ThresholdAverage} error, T=1000.}

2292:   \label{fig:TAE1000}

2293: \end{figure}

2294:

2295: \subsection{\sc CountRange}

2296:

2297: Other {\sc CountRange} algorithms have achieved error values between

2298: 2\% and 3\%. Using our method we conjecture that we could reduce the

2299: error because our method of approximation, although much more

2300: complicated, theoretically adapts to skewed distributions better

2301: than other methods. Figure~\ref{fig:CountRangeError} shows that we

2302: achieved errors under 2\% for 20 buckets across all the data sets,

2303: and in some cases, under 1\%.

2304:

2305: \begin{figure}[h]

2306:   \centering

2307:   \includegraphics[width=4in]{figs/results/CCountRange.eps}\\

2308:   \caption{{\sc CountRange} error.}

2309:   \label{fig:CountRangeError}

2310: \end{figure}

2311:

2312: Count range also performs about the same speed as the threshold

2313: operators due to its similar implementation.

2314:

2315:

2316: Additional information that contains error analyses of all the

2317: threshold values is given in~\cite{Anderson2007D}.

2318:

2319:

2320: \section{Related Work}\label{sec:RelatedWork}

2321:

2322: This Section reviews the literature specific to aggregation. Spatial

2323: and spatiotemporal databases have attracted an enormous amount of

2324: interest, and there exists a wide range of literature that is

2325: related to our work only through indexing. For books on the subjects

2326: of spatiotemporal and constraint databases we suggest:

2327: \cite{rigaux2001B,revesz2002,77589,S05Book}, and

2328: \cite{guting_book05}.

2329:

2330: \subsection{\textsc{MaxCount} and \textsc{CountRange} Aggregation}

2331:

2332: There exists only a few previous algorithms to compute {\sc

2333: MaxCount}~\citep{Revesz20031,Chen20041,Anderson20061}.  None of

2334: those previous algorithms provides efficient queries without

2335: rebuilding the index (i.e., they do not provide dynamic updates).

2336:

2337: Previous \emph{approximate} {\sc MaxCount} solutions use indices

2338: from~\cite{APR99} that minimize the skew of point distributions in

2339: the buckets by creating hyper-buckets based on the properties of all

2340: points at index creation time. Updates require the index to be

2341: rebuilt because the buckets depend on the point distribution at a

2342: specific time. In contrast, the probabilistic method we presented

2343: {\em recognizes} point density skew in each bucket instead and

2344: creates a density distribution to model it. We present the first

2345: efficient and dynamic algorithm for {\sc MaxCount}.

2346: Table~\ref{table:results} compares the results of earlier {\sc

2347: MaxCount} algorithms with our current algorithm where $N$ is the

2348: number of points and $B$ is the number of buckets in the index.

2349:

2350: \begin{table}[ht]

2351: \centering \caption{{\sc MaxCount} aggregation complexity on

2352: linearly moving objects.} \label{table:results}

2353: \begin{tabular}[bt]{|c|c|c|l|l|l|} \hline

2354: {\bf Max.}& {\bf Worst Case} & {\bf Space}  &  {\bf Exact }     & {\bf Static or}      & {\bf Reference}      \\

2355: {\bf Dim.}& {\bf Time}       &              &  {\bf or Est.}    &

2356: {\bf Dynamic}        & \\ \hline\hline 1         & $O(log\ N)$ &

2357: $O(N^2)$     & Exact             & Static               &

2358: \cite{Revesz20031}   \\ \hline 1         & $O(B \log B)$    & $O(B)$

2359: & Est.              & Static               & \cite{Chen20041}     \\

2360: \hline 2         & $O(B \log B)$    & $O(B)$       & Est. & Static

2361: & \cite{Anderson20061} \\ \hline d         & $O(B)$           &

2362: $O(B)$       & Est.              & Dynamic & \cite{Anderson2007D} \\

2363: \hline d         & $O(N)$           & $O(1)$       & Exact

2364: & Dynamic              & \cite{Anderson2007D} \\ \hline

2365: \end{tabular}

2366: \end{table}

2367:

2368: To our knowledge, we present the first proposal of these threshold

2369: aggregate operators for moving points: {\sc MaxCount (and MinCount),

2370: ThresholdRange, ThresholdCount, ThresholdSum}, and {\sc

2371: ThresholdAverage}.

2372:

2373: We can modify {\sc Spatiotemporal-Range} algorithms to return the

2374: {\sc CountRange} by counting the objects returned. Several other

2375: algorithms were proposed directly for the {\sc CountRange} problem.

2376: We summarize previous {\sc Spatiotemporal-Range} and {\sc

2377: CountRange} algorithms in Table~\ref{tbl:count}, where $N$ is the

2378: number of moving objects or points in the database, $d$ is the

2379: dimension of the space, and $B$ is the number of buckets. All

2380: algorithms listed are dynamic, which means that they allow

2381: insertions and deletions of moving objects without rebuilding the

2382: index.

2383:

2384: \begin{table}[htb]

2385: \centering \caption{{\sc Range} and {\sc CountRange} aggregation

2386: summary.}

2387: \begin{tabular}{|c|l|l|l|l|l|} \hline

2388: {\bf Max.}& {\bf Worst Case}                 & {\bf Worst case}                 & {\bf Exact }       & {\bf Reference}      \\                                         %& {\bf Static or}

2389: {\bf Dim.}& {\bf Time}                       & {\bf Space}                      & {\bf or Est.}      & \\ \hline\hline                                                 %& {\bf Dynamic}

2390: 2         & $O(N^{\frac{3}{4}+\epsilon}+k)$  & $O(N)$                           & Exact              & \cite{KGT99} \\   % Simplex range method                  %& Dynamic

2391: 2         & $O(\log_2 N + k)$                & $O(N^2)$\footnotemark            & Exact              &  \\ \hline        % time limited method

2392: 2         & $O(N)$                           & $O(N)$                           & Exact              & \cite{PKGT02} \\ \hline % R*-Tree Model                         %& Dynamic

2393: 3         & $O(N)$                           & $O(N)$                           & Exact              & \cite{SJLL00} \\ \hline % Saltenis Jensen... TPR-Tree Model     %& Dynamic

2394: d         & $O(N)$                           & $O(N)$                           & Exact              & \cite{PLM01} \\ \hline  % R-Tree Model                          %& Dynamic

2395: d         & $O(B^{d-1} \log_{B}^{d} N)$      & $O(\frac{N}{B}\log_{B}^{d-1} N)$ & Exact              & \cite{ZGTS03} \\ \hline  % also in --> cite{ZTG02} ECDF-B-Tree and BA-Tree (sum,avg,cnt) %& Dynamic

2396: 2         & $O(\log_B N + C)/B$              & $O(N)$                           & Est.               & \cite{KGT99}\footnotemark \\ \hline                                          %& Dynamic

2397: 2         & $O(B)$                           & $O(B)$                           & Est.               & \cite{CC02} \\ \hline %first S-T selectivity estimation         %& Dynamic

2398: d         & $O(B)$                           & $O(B)$                           & Est.               & Tao et al. (2003) \\ \hline %\citet{TSP03} \footnotemark \\ \hline                            %& Dynamic

2399: d         & $O(\sqrt{N})$                    & $O(N)$                           & Est.               & \cite{TP05} \\ \hline %aMVRB-tree Tao, Papadias    %& Dynamic

2400: d         & $O(B)$                           & $O(B)$                           & Est.               & \cite{Anderson2007D} \\ \hline                                         %& Dynamic

2401: \end{tabular}

2402: \label{tbl:count}

2403: \end{table}

2404:

2405: \footnotetext[1]{This is a restricted future time query with

2406: expected $O(N)$ space that becomes quadratic if the restriction is

2407: too far into the future.} \footnotetext[2]{$C=K+K'$, where $K'$ is

2408: the approximation error.} \footnotetext[3]{Although \cite{TSP03}

2409: allow dynamic updates, over time the index must be rebuilt.}

2410:

2411: In all our work we consider time as a continuous variable. Time as a

2412: discrete variable is discussed in both temporal and spatiotemporal

2413: aggregation by \cite{AAE03,TP05} and \cite{BGJ06}. In the discrete

2414: approach, time stamps describe the temporal nature of objects. This

2415: approach is less relevant to our work, but is relevant to many

2416: applications.

2417:

2418: \subsection{Indices and Estimation Techniques}

2419:

2420: There are many ways our work is indirectly related to previous work

2421: on indexing structures and estimation techniques. %Example~\ref{ex:MaxCount} shows that the

2422: {\sc Count} and {\sc Max} aggregation operators have only a titular

2423: relationship to the {\sc MaxCount} aggregation, because one cannot

2424: use the {\sc Count} and {\sc Max} aggregation operators to implement

2425: the {\sc MaxCount} aggregation. Nevertheless, several techniques

2426: used in the {\sc MaxCount} problem are also used in other indices

2427: and algorithms designed for range, max/min, and count queries. We

2428: summarize several of these related techniques next.

2429:

2430: \subsubsection{Indices}

2431:

2432: The index structure of \cite{AAE03} finds the 2-dimensional moving

2433: points contained in a rectangle in $O(\sqrt {N})$ time.

2434: \cite{GKTD05} gave a selectivity estimation with a histogram

2435: structure of overlapping buckets designed to approximate the density

2436: of multi-dimensional data. The algorithm runs in constant time

2437: $O(d|B|)$, where $d$ is the number of dimensions and $B$ is the

2438: number of buckets. \cite{GKR04} gave a technique for answering

2439: spatiotemporal range, intercept, incidence, and shortest path

2440: queries on objects that move along curves in a planar graph.

2441: \cite{CJNP04,CJSP05} also gave indexing methods that use networks,

2442: such as roads, to predict position and motion changes of objects

2443: that follow roads and characteristics of routes. \cite{PJ05} used

2444: networks to reduce the dimensionality of constrained moving object

2445: to two dimensional trajectories and examined the method in terms of

2446: the spatiotemporal range query. \cite{AG051} proposed the MON-Tree

2447: to index moving objects in networks using graphs or route oriented

2448: networks to find the spatiotemporal range and windows queries. They

2449: define window queries as returning the pieces of the object's

2450: movement that intersects the query window. \cite{ZMTGS01} proposed

2451: the {\em multiversion SB-tree} to perform range temporal aggregates:

2452: {\sc Sum}, {\sc Count} and {\sc Avg} in $O(\log_b n)$, where $b$ is

2453: the number of records per block and $n$ is the number of entries in

2454: the database. \cite{TIME05_Revesz} gave efficient rectangle indexing

2455: algorithms based on point dominance to find count interpreted in $k$

2456: dimensions using the following concepts:

2457: \begin{enumerate}

2458: \item {\em stabbing} gives the number of objects that contain a point;

2459: \item {\em contain} gives the number of rectangles that contain the query rectangle;

2460: \item {\em overlap} gives the number of rectangles that overlap the query rectangle; and

2461: \item {\em within} gives the number of rectangles within the query space.

2462: \end{enumerate}

2463: These four operators have a running time of $O(\log^k n)$ where $k$

2464: is the number of dimensions and $n$ is the number of points.

2465:

2466: %TIME PARAMETERIZED INDICES

2467: \cite{SJLL00} gave an R$^{\ast}$-tree based indexing technique for

2468: 1, 2, and 3 dimensional moving objects that provide time-slice

2469: queries (selection queries), windows queries, and moving queries.

2470: Window queries return the same information as range queries, but

2471: with a valid time window starting at the current time and continuing

2472: to $t_h$. Window queries may request predictions for range queries

2473: within this window of time. Moving queries, similar to incidence

2474: queries, return the points that are contained within the space

2475: connecting one rectangle at a start time to a second rectangle at an

2476: end time. The proposed time parameterized R-tree (TPR-Tree) search

2477: runs in expected {\em logarithmic time}. Another R$^{\ast}$-tree

2478: extension given by \cite{CR00} forms tighter parametric bounding

2479: boxes than~\cite{SJLL00} and has similar running time. \cite{Tao03}

2480: proposed the TPR$^\ast$-Tree that extends the TPR-Tree with improved

2481: insert and delete algorithms. In the context of a variety of count

2482: queries it performs similarly to previous indices.

2483:

2484: \cite{SPTL04} uses time-dependent, updatable, histograms to query

2485: counts at specific times including past, present and future.

2486: Recently, \cite{PSJ06} proposed the $R^{PPF}$-tree that indexes

2487: past, present and predictive positions of moving points, and extends

2488: the previous work on TPR-Trees \citep{SJLL00} with a partial

2489: persistence framework. Earlier work by \cite{TayebUW98} adapted the

2490: PMR-quadtree~\citep{77589}, a variant of the quadtree structure, for

2491: indexing moving objects to answer time-slice queries, which they

2492: called instantaneous queries, and infinitely repeated time-slice

2493: queries, called continuous queries. Search performance is similar to

2494: quadtrees and allows searches in $O(\log N)$ time.

2495:

2496: \cite{MSI02} use the sweeping technique from computational geometry

2497: to define a query language to evaluate past, present, and future

2498: positions of moving objects in constraint databases.

2499:

2500: Finally, \cite{HadjieleftheriouKGT03} use an efficient approximation

2501: method to find areas where the density of objects is above a

2502: specific threshold during a specific time interval. This method

2503: comes the closest to the method used in our aggregation operators,

2504: but does not allow for the query to move or change shape over time.

2505: In fact, this method is not applied to counting at all.

2506:

2507: Note that each of these indexing methods that return the moving

2508: points in a query window or rectangle can be easily modified to

2509: return instead the {\em count} of the number of moving points.

2510: However, they may not be easily extended to provide a {\sc MaxCount}

2511: within a changing, moving query space.

2512:

2513:

2514: With a few exceptions you can see that {\sc Count} aggregation is

2515: $O(\log N + d)$ for exact methods and $O(B)$ or better for

2516: estimation methods. The hidden constant in the exact method is the

2517: number of buckets that must be traversed to find the {\sc Count}.

2518: Estimation methods vary in many ways and asymptotic running time

2519: doesn't always give a meaningful estimate as to how big $B$ will be.

2520:

2521: %However since estimation methods aim at being faster than precise

2522: %methods, generally we have that $B < \log N$.\medskip

2523:

2524: \subsubsection{Estimation Techniques}

2525:

2526: Our work is related to several other papers that {\em estimate} the

2527: count aggregate operation on spatiotemporal databases.

2528:

2529: \cite{APR99} gave an algorithm that can estimate the {\sc Count} of

2530: the number of the rectangles that intersect a query rectangle for

2531: Selectivity Estimation. \cite{CC02} and \cite{TSP03} proposed

2532: methods that can estimate the {\sc Count} of the moving points in

2533: the plane that intersect a query rectangle. More recently,

2534: \cite{KPGT05} gave a predictive method based on dual

2535: transformations.

2536:

2537: \cite{WolfsonY03} and \cite{TWHC04} gave a method for generating

2538: pseudo trajectories of moving objects. Most of these estimation

2539: algorithms use {\em buckets} as basic building structures of the

2540: index.  In extending this idea, we use $2d$-dimensional

2541: hyper-buckets in our algorithms where $d$ is the number of

2542: dimensions in the moving-objects space.

2543:

2544:

2545: \section{Conclusions and Future Work}\label{sec:Conclusions}

2546:

2547: We implemented and compared two new {\sc MaxCount} algorithms. The

2548: estimated {\sc MaxCount} was shown to be fast and accurate while

2549: still allowing fast constant time updates. No other algorithm has

2550: these features to date. We showed that {\sc ThresholdRange}, {\sc

2551: ThresholdCount}, {\sc ThresholdSum}, {\sc ThresholdAverage}, and

2552: {\sc CountRange} are related to {\sc MaxCount} and can be evaluated

2553: using similar techniques and that we achieve error values under 5\%

2554: in these operations. We gave an empirical threshold for choosing

2555: between the exact and estimated algorithms. We discussed the issues

2556: related to higher dimensions and note that all sweeping algorithms

2557: have this problem. We also note that using our technique it is

2558: possible to decompose the problem and run it in a multiprocessor or

2559: grid environment where the database is divided into smaller

2560: databases.

2561:

2562: Future work may include decreasing the running time by finding other

2563: techniques because there does not appear to be a clear method for

2564: decreasing the running time of sweeping methods. One could also

2565: consider implementing and comparing these techniques in a grid

2566: computing environment.

2567:

2568: \bibliographystyle{agsm}

2569: \bibliography{h:/unl/BibTex/all}

2570:

2571: \end{document}

2572: