1: \documentclass{article}
2:
3: \usepackage[left=1in,right=1in,top=1in,bottom=1in]{geometry}
4: \usepackage[singlespacing]{setspace}
5: \usepackage{amssymb}
6: \usepackage{amsfonts}
7: \usepackage{amsmath}
8: \usepackage{psfrag}
9: \usepackage{graphicx}
10: \usepackage{afterpage}
11: \usepackage{rotating}
12: \usepackage{calc}
13: \usepackage{url}
14: \usepackage{natbib}
15:
16: \def\lfp{\mathop{\hbox{\it lfp}}}
17: \def\impl{\mathrel{\hbox{~~:---~~}}}
18: \def\progstart{\singlespacing\begin{center}\begin{minipage}{.95\textwidth}\small\noindent\rule[0pt]{\linewidth}{0.4pt}\vspace{6pt} \\}
19: \def\progend{\rm\rule[6pt]{\linewidth}{0.4pt} \\ \end{minipage}\end{center}\doublespacing}
20:
21: %Included for Gather Purpose only:
22: %input "h:\unl\bibtex\all.bib"
23:
24: \newtheorem{theorem}{Theorem}
25: \newtheorem{acknowledgement}[theorem]{Acknowledgement}
26: \newtheorem{algorithm}[theorem]{Algorithm}
27: \newtheorem{axiom}[theorem]{Axiom}
28: \newtheorem{case}[theorem]{Case}
29: \newtheorem{claim}[theorem]{Claim}
30: \newtheorem{conclusion}[theorem]{Conclusion}
31: \newtheorem{condition}[theorem]{Condition}
32: \newtheorem{conjecture}[theorem]{Conjecture}
33: \newtheorem{corollary}[theorem]{Corollary}
34: \newtheorem{criterion}[theorem]{Criterion}
35: \newtheorem{definition}[theorem]{Definition}
36: \newtheorem{example}[theorem]{Example}
37: \newtheorem{exercise}[theorem]{Exercise}
38: \newtheorem{lemma}[theorem]{Lemma}
39: \newtheorem{notation}[theorem]{Notation}
40: \newtheorem{problem}[theorem]{Problem}
41: \newtheorem{proposition}[theorem]{Proposition}
42: \newtheorem{remark}[theorem]{Remark}
43: \newtheorem{solution}[theorem]{Solution}
44: \newtheorem{summary}[theorem]{Summary}
45: \newenvironment{proof}[1][Proof]{\noindent\textbf{#1.} }{\ \rule{0.5em}{0.5em}}
46:
47:
48: \begin{document}
49:
50: \author{Scot Anderson\\ sanderson@southern.edu\\ Southern Adventist University, Tennessee
51: \and Peter Revesz\\ revesz@cse.unl.edu\\ University of Nebraska-Lincoln}
52:
53: \title{Efficient Threshold Aggregation of Moving Objects}
54:
55: \date{}
56:
57: \maketitle
58:
59: \begin{abstract}
60: Calculating aggregation operators of moving point objects, using
61: time as a continuous variable, presents unique problems when
62: querying for congestion in a moving and changing (or dynamic) query
63: space. We present a set of congestion query operators, based on a
64: threshold value, that estimate the following $5$ aggregation
65: operations in $d$-dimensions. 1) We call the count of point objects
66: that intersect the dynamic query space during the query time
67: interval, the \textsc{CountRange}. 2) We call the Maximum (or
68: Minimum) congestion in the dynamic query space at any time during
69: the query time interval, the \textsc{MaxCount} (or
70: \textsc{MinCount}). 3) We call the sum of time that the dynamic
71: query space is congested, the \textsc{ThresholdSum}. 4) We call the
72: number of times that the dynamic query space is congested, the
73: \textsc{ThresholdCount}. And 5) we call the average length of time
74: of all the time intervals when the dynamic query space is congested,
75: the \textsc{ThresholdAverage}. These operators rely on a novel
76: approach to transforming the problem of selection based on position
77: to a problem of selection based on a threshold. These operators can
78: be used to predict concentrations of migrating birds that may carry
79: disease such as Bird Flu and hence the information may be used to
80: predict high risk areas. On a smaller scale, those operators are
81: also applicable to maintaining safety in airplane operations. We
82: present the theory of our estimation operators and provide
83: algorithms for exact operators. The implementations of those
84: operators, and experiments, which include data from more than 7500
85: queries, indicate that our estimation operators produce fast,
86: efficient results with error under 5\%.
87: \end{abstract}
88:
89: \section{Introduction}
90:
91: \label{intro:ST}
92:
93: Safety can often be reduced to to a problem of congestion. The
94: safety of flight depends on separation of airplanes or more
95: generally the maximum number of airplanes that a particular airspace
96: can safely contain, and the maximum number of airplanes that air
97: traffic controllers (ATC) responsible for directing airplanes can
98: safely track. When considering epidemics, the presence of a single
99: animal with Bird Flue does not does not indicate the start of an
100: epidemic. Instead the presence of a certain number of instances of
101: the disease indicates a high risk of starting an epidemic, or actual
102: epidemic conditions. Consequently, we see that congestion often
103: links to safety and can predict high risk or even dangerous
104: conditions.
105:
106: Congestion is defined differently depending on the application.
107: Hence it is necessary to provide aggregation operators that take a
108: threshold value as a parameter to define congestion.
109:
110: In relational databases, \textsc{Max}, \textsc{Min}, \textsc{Count},
111: \textsc{Sum} and \textsc{Average} form the set of natural
112: aggregation-operators. Spatiotemporal databases containing moving
113: objects, based on continuous time, can not apply these operators in
114: the same way. However, these operators may still function in
115: interesting ways for moving objects. For example, one can ask how
116: many moving point objects exist within a moving and changing (or
117: \emph{dynamic}) rectangular area \emph{at a certain time}, or what
118: is the maximum distance between two moving points \emph{at certain
119: times}. Obviously, when we are interested in discrete time
120: instances, then the moving point object database can be reduced to a
121: relational database and the above queries can be expressed as simple
122: \textsc{Count} or \textsc{Max} queries.
123:
124: Moving object databases naturally suggest new aggregate operators
125: that have no equivalents in relational databases. For example, one
126: may ask what is the maximum number of moving-point objects that
127: exist simultaneously within a dynamic rectangular area at any time
128: during a time interval $T$? We call this the \textsc{MaxCount} query
129: (symmetrically we can also find the \textsc{Min-Count}). One may
130: also ask during what time intervals in $T$ does there exist more
131: than $M$ moving objects within a rectangular area? We call this the
132: \textsc{ThresholdRange}. We show that a strong relationship exists
133: between \textsc{MaxCount} and \textsc{ThresholdRange}, and we show
134: that \textsc{ThresholdRange} forms the bases for a family of
135: threshold operators that include: \textsc{ThresholdCount},
136: \textsc{ThresholdSum}, and \textsc{ThresholdAverage}. A related,
137: though less complex, operator answers the question: what is the
138: number of moving objects that exist within or intersect a dynamic
139: rectangular area at any time instance during interval $T$. We call
140: this type of query the \textsc{CountRange} query.
141:
142: We give the following definitions for aggregation operators:
143:
144: \begin{definition}[Dynamic Query Space]
145: \label{def:DynamicQuerySpace} Dynamic query space is defined by a
146: continuous time interval $T$, and a $d$-dimensional space that may
147: move and change size or shape over the query time interval.
148: \end{definition}
149:
150: Throughout this paper we consider the shape of the query space to be
151: a box or cube.
152:
153: \begin{definition}[\textsc{MaxCount (MinCount)}]
154: \label{def:MaxCount} Let $S$ be a set of moving points. Given a
155: dynamic query space $R$ defined by two moving points $Q_1$ and $Q_2$
156: as the lower-left and upper-right corners of $R$, and a time
157: interval $T$, the \textsc{MaxCount} (\emph{Min-Count}) operator
158: finds the time $t_{\max(\min)}$ and maximum (or minimum) number of
159: points $M_{\max(\min)}$ in $S$ that $R$ can contain at any time
160: instance within $T$.
161: \end{definition}
162:
163: Throughout this paper we develop the \textsc{MaxCount} operator
164: because where ever we find a maximum, a minimum can be found
165: similarly.
166:
167: \begin{definition}[\textsc{ThresholdRange}]
168: \label{def:ThresholdRange}Let $S$ be a set of moving points. Given a
169: dynamic query space $R$ defined by two moving points $Q_1$ and $Q_2$
170: as the lower-left and upper-right corners of $R$, a time interval
171: $T$, and a threshold value $M$, the \textsc{ThresholdRange} operator
172: finds the set of time intervals $T_M$ where the count of objects in
173: $R$ is larger than $M$.
174: \end{definition}
175:
176: \textsc{ThresholdRange} is directly related to \textsc{MaxCount} in
177: that when $M$ is raised to $M_{\max}$, then \textsc{ThresholdRange}
178: returns a time interval containing $t_{\max}$ and during this time
179: interval, the count will be $M_{\max}$.
180:
181: \begin{definition}[\textsc{ThresholdCount}]
182: \label{def:ThresholdCount} Given a \textsc{ThresholdRange}, \textsc{%
183: ThresholdCount} returns the number of time intervals.
184: \end{definition}
185:
186: \begin{definition}[\textsc{ThresholdSum}]
187: \label{def:ThresholdSum} Given a \textsc{ThresholdRange}, \textsc{%
188: ThresholdSum} returns the total time $T_s$ during which the count is above $%
189: M $. That is, for each $T_i \in T_M$, \textsc{ThresholdSum} return:
190: \begin{equation}
191: T_s = \displaystyle{\sum\limits_i}|T_i|
192: \end{equation}
193: where $|T_i|$ means the length of the interval.
194: \end{definition}
195:
196: \begin{definition}[\textsc{ThresholdRange}]
197: \label{def:ThresholdAverage} Given a \textsc{ThresholdRange}, \textsc{%
198: ThresholdAverage} returns the average length of the intervals in
199: $T_M$.
200: \end{definition}
201:
202: In addition to the threshold aggregation operators, we also use our
203: bucketing method to implement the \textsc{CountRange} defined as
204: follows.
205:
206: \begin{definition}[\textsc{CountRange}]
207: \label{def:SpatioTemporalRangeCount} Let $S$ be a set of moving
208: points. Given a dynamic query space $R$ defined by two moving points
209: $Q_1$ and $Q_2$ as the lower-left and upper-right corners of $R$ and
210: a time interval $T$, the \textsc{CountRange} query returns the total
211: number of points that intersect $R$ in $T$.
212: \end{definition}
213:
214: Together \textsc{MaxCount (MinCount)} and the threshold operators
215: form a complete set of threshold aggregation operators comparable to
216: the aggregation operators given in relational databases.
217:
218: The following examples use the simple concepts of flying to
219: demonstrate the use of a few of these threshold aggregation
220: operators.
221:
222: \begin{example}
223: \label{ex:MaxCount}\textrm{Airplanes are commonly modeled as
224: linearly moving objects with preestablished flight plans. Suppose,
225: at any time, at most a constant number $M$ of airplanes is allowed
226: to be in the O'Hare airspace to avoid congestion. Suppose also a new
227: airplane requests approval of its flight plan for entering the
228: O'Hare airspace between times $t_a$ and $t_b$. The air traffic
229: controllers can avoid congestion as follows. If after adding a new
230: flight plan, the \textsc{MaxCount} between $t_a$ and $t_b$ is still
231: less than $M$, then they can approve the flight. Otherwise, they
232: need to find some alternative path, and check it again against the
233: database. }
234:
235: \textrm{Air traffic controllers try to direct airplanes as linearly
236: moving objects for fuel efficiency, among other reasons. If they
237: recognize a developing congestion too late, then they often must
238: direct the airplane to fly in circles until the congestion has
239: cleared. That solution wastes fuel. On the other hand, if they
240: recognize the developing congestion early, then they can often
241: simply tell the airplane to change its speed, which saves fuel.
242: Therefore, it is important to identify congestions as early as
243: possible. We may identify congestions by using a \textsc{MaxCount}
244: query where a moving box around the airplane and a time interval
245: $[t_{a},t_{b}]$ define the query. If the \textsc{MaxCount} predicts
246: congestion, then the airplane's speed can be adjusted early in the
247: flight. }
248: \end{example}
249:
250: \begin{example}
251: \label{ex:ThresholdCount}\textrm{Suppose we want to alert pilots if
252: their current flight path takes them through at least one congested
253: region. }
254:
255: \textrm{\emph{Traffic Alert/Collision Avoidance Systems (TCAS)} is a
256: system that provides similar functionality. TCASs only provide
257: alerts for current congestion, not predictive congestion. Although
258: TCASs were implemented in 1986, we continue to have mid-air
259: collisions and near misses indicating that the system still needs
260: improvement. \textsc{ThresholdRange} is a modification of
261: \textsc{MaxCount} that returns all predicted time intervals on the
262: flight path where the \textsc{Count} exceeds a given threshold.
263: Hence using \textsc{ThresholdRange} we can alert a pilot of
264: predicted congestions where more than $M$ other airplanes will be
265: within the space $B$ around the airplane. Predicting and avoiding
266: these areas can significantly reduce the chances of mid-air
267: collisions. }
268: \end{example}
269:
270: \begin{example}
271: \label{ex:CountRange}\textrm{Suppose we are especially concerned
272: about a rush-hour period $[t_a,t_b]$ that is particularly stressful
273: to air traffic controllers. Suppose controllers can direct at most
274: $M$ airplanes safely. We can determine the number of controllers
275: needed during the rush-hour time by executing the
276: \textsc{CountRange} query over the controlled airspace during the
277: rush-hour and dividing by $M$. By ensuring that a sufficient number
278: of controllers are present, safety is achieved and controllers are
279: not over stressed. }
280: \end{example}
281:
282: Each of the operators can also be applied to examine different
283: aspects of congestion with regard to bird migration and hence
284: disease control. These questions and examples, motivated by research
285: on \textsc{MaxCount}, led us to explore complex threshold
286: aggregations and data structures to support them.
287:
288: The rest of this paper is organized as follows.
289: Section~\ref{sec:BucketDataStructures} gives some background on the
290: concepts of point domination, sweeping techniques and then
291: introduces the data structures used to build buckets. These buckets
292: can then be used in various indexing algorithms to fit the type of
293: application used. Section~\ref{sec:DynamicMaxCount} develops the
294: {\sc MaxCount} estimation algorithm using a running example.
295: Section~\ref{sec:ThresholdOperators} develops the {\sc
296: ThresholdRange} algorithm based on {\sc MaxCount} and demonstrates
297: the relationship that ties {\sc MaxCount} to the remaining threshold
298: operators. This section also develops algorithms for each of those
299: operators including {\sc CountRange}.
300: Section~\ref{sec:ExperimentalResults} gives the experimental results
301: of the implementation. Section~\ref{sec:RelatedWork} reviews the
302: related work and Section~\ref{sec:Conclusions} gives conclusions and
303: future work.
304:
305:
306:
307: \section{Hyper-Bucket Data Structures}\label{sec:BucketDataStructures}
308:
309: This section presents an updatable {\em skew-aware} bucket for
310: indices that models the skewed point distributions in each bucket.
311: The skew-aware technique allows the index structure to perform
312: inserts, deletes, and updates in {\em fast constant time} using a
313: \textsc{HashTable} to store the buckets. Many spatiotemporal
314: applications, such as tracking clients on a wireless network,
315: particularly need these fast updates and no other {\sc MaxCount}
316: presented prior to this can meet that requirement. Because the
317: buckets are spatially defined, the bucketing technique also easily
318: adapts to other spatial and spatiotemporal indices such as the
319: \textsc{R-tree}~\cite{DBLP:conf/sigmod/Guttman84}. Hence the
320: technique performs well for applications where search operations or
321: update operations occur more frequently by using an appropriate
322: index.
323:
324: Our algorithm uses a sweeping method to evaluate the threshold
325: aggregation operators similar to previous approaches from
326: \cite{Chen20041,Revesz20031} and \cite{Anderson20061}. The algorithm
327: differs in that the sweeping algorithm integrates a skew-aware
328: density function over the spatial dimensions of the bucket to obtain
329: the time dependent count function. The density function in the
330: bucket increases accuracy over methods given in
331: \citep{Chen20041,Anderson20061} while maintaining the same number of
332: buckets. This idea is a crucial improvement because we model the
333: point distribution skew in a bucket, whereas previous methods
334: adapted to skew by increasing the number of buckets or changing
335: their shape and contents. We also present a precise algorithm for
336: evaluating the threshold aggregation operators that requires no
337: index and runs in $O(N)+ O(n \log n)$ time and $O(n)$ space where
338: $N$ is the number of points in the database and $n$ is the value of
339: a {\sc CountRange} query using the same query space and time. Both
340: the threshold aggregation algorithms and the skew-aware bucket data
341: structure presented are implemented and analyzed in 3-dimensional
342: space. We show that the approximation achieves good results while
343: significantly reducing the running times.
344:
345: Section~\ref{ssec:buckets} describes the problems related to
346: creating hyper-buckets (also referred to as just buckets) and a
347: specific solution for creating $6$-dimensional buckets for
348: $3$-dimensional linearly moving points. In all cases, we can extend
349: our method to $d$-dimensions. Section~\ref{ssec:updates} describes
350: the method for inserting and deleting a point from a bucket and
351: shows that updates take constant time. Section~\ref{ssec:structures}
352: applies two different data structures to contain the buckets suited
353: for applications where either inserts and deletes or threshold
354: aggregation queries dominate.
355:
356:
357: \subsection{Hyper-Bucket Data Structure}\label{ssec:buckets}
358:
359: \begin{definition}[Hex Representation]
360: \label{def:hex} Define each 3-dimensional linearly moving point $p$
361: by parametric linear equations in $t$ as follows:
362: \begin{equation}
363: p=\left\{
364: \begin{array}{c}
365: p_x ~=~ v_x t ~+~ x_0 \\
366: p_y ~=~ v_y t ~+~ y_0 \\
367: p_z ~=~ v_z t ~+~ z_0 \\
368: \end{array}
369: \right.
370: \end{equation}
371: where the corresponding {\em hex representation} of $p$ is the tuple
372: $(v_x,x_0,v_y,y_0,v_z,z_0)$ containing the duals of $p_x$, $p_y$,
373: and $p_z$. For simplicity we often denote the six-tuple as
374: $(x_1,...,x_6)$.
375: \end{definition}
376:
377: \medskip
378: Consider a relation $D(x_1,..,x_6)$ that contains the {\em hex
379: representation} of linearly moving points in $3$ dimensions. Then
380: $D$ represents a $6$-dimensional {\em static} space. Divide the
381: space into axis-aligned hyper-rectangles where the $k^{th}$ axis has
382: $d_k$ divisions. Each hyper-rectangle becomes a bucket containing
383: moving points whose hex falls inside the hyper-rectangle.
384:
385: \begin{definition}[Hyper-bucket dimensions]
386: \label{def:bucketDimensions} Define the dimensions of each bucket
387: $B_i$ by inequalities of the form:
388: \begin{equation}
389: \begin{array}{lcl}
390: v_{x,L} \leq v_x < v_{x,U} &\bigwedge& x_{0,L} \leq x_0 < x_{0,U} ~\bigwedge \\
391: v_{y,L} \leq v_y < v_{y,U} &\bigwedge& y_{0,L} \leq y_0 < y_{0,U} ~\bigwedge \\
392: v_{z,L} \leq v_z < v_{z,U} &\bigwedge& z_{0,L} \leq z_0 < z_{0,U} \\
393: \end{array}
394: \end{equation}
395: where we denote the lower bound as:
396: \begin{equation}
397: (v_{x,L}, x_{0,L},v_{y,L}, y_{0,L}, v_{z,L}, z_{0,L})
398: \end{equation}
399: and the upper bound as
400: \begin{equation}
401: (v_{x,U}, x_{0,U},v_{y,U}, y_{0,U}, v_{z,U}, z_{0,U}).
402: \end{equation}
403: \end{definition}
404:
405: Each hyper-rectangle defines the spatial dimensions of a possible
406: bucket, where only buckets that contain points need be included in
407: the index. The maximum number of possible buckets is given by
408: $m=\prod\limits_{k}d_k$.
409:
410: \begin{definition}[Histograms]
411: \label{def:histograms} Given a $6$-dimensional rectangle $B_i$,
412: given by Definition~\ref{def:bucketDimensions}, containing $b_i$
413: points, build the {\em histograms} $h_{i,1}$,...,$h_{i,6}$ for each
414: axis using $s$ subdivisions as follows. To create histogram
415: $h_{i,j}$, divide bucket $B_i$ into $s$ parallel subdivisions along
416: the $j$th axis, and record separately the number of points within
417: $B_i$ that fall within each subdivision.
418: \end{definition}
419:
420:
421: \begin{example}[Building Histograms]\rm
422: \begin{figure}[ht]
423: \centering
424: \psfrag{Y}{$X_0$}
425: \psfrag{X}{$V_x$}
426: \includegraphics[width=4in]{figs/pointssplit3.eps}
427: \caption{Points projected onto $v_x,x_0$ plane.} \label{fig:points}
428: \end{figure}
429: Consider a set of 6-dimensional points projected onto the $v_x,x_0$
430: plane as shown in Figure~\ref{fig:points}. Assume that the number of
431: subdivisions is $s=10$ along both $v_x$ and $x_0$.
432: Figure~\ref{fig:vxx0histograms} shows $h_{i,1}$ and $h_{i,2}$. For
433: example, the subdivision $0\leq v_x < 1$ contains six points and
434: hence the first bar of histogram $h_{i,1}$ rises to level $6$. The
435: other values can be determined similarly.
436:
437: \begin{figure}[htb]
438: \begin{minipage}[t]{3in}
439: \begin{center}
440: \includegraphics[width=3in]{figs/vxhist.eps}\\
441: \mbox{$h_{i,1}$: Points projected onto $v_x$.}
442: \end{center}
443: \end{minipage}
444: \hfill
445: \begin{minipage}[t]{3in}
446: \begin{center}
447: \includegraphics[width=3in]{figs/x0hist.eps}\\
448: \mbox{$h_{i,2}$: Points projected onto $x_0$.}
449: \end{center}
450: \end{minipage}
451: \begin{center}
452: \end{center}
453: \caption{Histogram of Points in 2 Dimensions.}
454: \label{fig:vxx0histograms}
455: \end{figure}
456: \begin{figure}[htb]
457: \begin{minipage}[t]{3in}
458: \psfrag{A}{$x_0$} \psfrag{B}{$v_x$}
459: \includegraphics[width=3in]{figs/xaccurate2.eps}
460: \end{minipage}
461: \hfill
462: \begin{minipage}[t]{3in}
463: \psfrag{A}{$x_0$} \psfrag{B}{$v_x$}
464: \includegraphics[width=3in]{figs/xwrong2.eps}
465: \end{minipage}
466: \caption{2D Distribution Functions}
467: \label{fig:vx2d}
468: \end{figure}
469:
470: Histograms tell much about the distribution of the points in a
471: bucket but they introduce some ambiguity. For example, the
472: histograms in Figure~\ref{fig:vxx0histograms} match both of the
473: $2d$-distributions in Figure~\ref{fig:vx2d}.
474: \end{example}
475: \bigskip
476:
477: \begin{definition}[Axis Trend Function]
478: \label{def:trendfunctions}%
479: The {\em axis trend function} $f_{i,j}(x_j)$ is some polynomial
480: function for bucket $B_i$ and axis $j$ such that the following hold:
481: \begin{enumerate}
482: \item $f_{i,j} \geq 0$ over $B_i$.
483: \item $f'_{i,j}$, the derivative $f_{i,j}$, does not change sign over the valid range.
484: \end{enumerate}
485: The {\em bucket trend function} $f_i$ for bucket $B_i$ is the
486: following:
487: \begin{equation}
488: \label{eq:bucketdensity}
489: f_i=\prod_j f_{i,j}
490: \end{equation}
491: \end{definition}
492:
493: Condition 1 ensures that the bucket trend function, built from the
494: axis trend functions, does not contain a negative probability
495: region. Condition 2 requires that the bucket density increase,
496: decrease, or remain constant when considering any single axis. This
497: condition avoids the ambiguity demonstrated in
498: Figures~\ref{fig:vxx0histograms} and \ref{fig:vx2d} by giving a
499: polynomial that approximates the density change correctly. We show
500: this in the following Lemma.
501:
502: \begin{lemma}
503: \label{lem:distributionindependence} Given a bucket $B_{i}$ with
504: bucket trend functions $f_{i,j}$, let $r_{1}$ and $r_{2}$ be
505: identically sized regions in bucket $B_{i}$. If the density in
506: $B_{i}$ along each axis monotonically increases from $r_{1}$ to
507: $r_{2}
508: $ the following holds:%
509: \begin{equation}
510: \int_{r_{2}}f_{i}~d\phi \geq \int_{r_{1}}f_{i}~d\phi
511: \end{equation}
512: \end{lemma}
513:
514: \begin{proof}
515: Increasing densities from $r_{1}$ to $r_{2}$ translates into
516: histograms that also increase from $r_{1}$ in the direction of
517: $r_{2}$ along each axis. The translation from histograms to the axis
518: trend functions gives the following conditions:
519: \begin{equation}
520: f_{i,j}(x_{2,j})\geq f_{i,j}( x_{1,j})
521: \end{equation}
522: where $x_{1,j}$ and $x_{2,j}$ are the $j^{th}$ coordinates of the
523: points in $r_{1}$ and $r_{2}$ respectively, and are located the same
524: distance from the $j^{th}$ coordinates of the lower bounds of $r_1$
525: and $r_2$ respectively. Since this constraint holds for each $j$ and
526: $f_{i,j}\geq 0$ we have:
527: \begin{equation}
528: f_{i}(x_2)\geq f_{i}(x_1)
529: \end{equation}
530: Hence by the properties of integration we conclude
531: \begin{equation}
532: \int_{r_{2}}f_{i}~d\phi \geq \int_{r_{1}}f_{i}~d\phi
533: \end{equation}
534: \end{proof}
535:
536: Definition~\ref{def:trendfunctions} allows a whole class of
537: polynomial functions, and Lemma~\ref{lem:distributionindependence}
538: applies to each member of that class. However, in the following, we
539: use a particular polynomial function derived from the product of
540: linear functions, which are obtained by using the least squares
541: method for each histogram.
542:
543: \begin{definition}[Normalized Trend Functions]
544: \label{def:NormalizedTrendFunction} Let $n$ be the number of points
545: in the database, $b_i$ the number of points in bucket $B_i$, and
546: $f_i$ be given by Equation~(\ref{eq:bucketdensity}). The {\em
547: normalized trend function} $F_i$ for bucket $B_i$ is:
548: \begin{equation}
549: F_{i} = \frac{b_i f_{i}}
550: {
551: n \mathop{\displaystyle\int}\limits_{B_i}^{~}
552: f_{i}~d\phi
553: }
554: \label{eq:NormalizedSurface}
555: \end{equation}
556: and the {\em percentage of points} in bucket $B_i$ is:
557: \begin{equation}
558: p = \mathop{\displaystyle\int}\limits_{B_i} F_i~d\phi.
559: \label{eq:percentagepoints}
560: \end{equation}
561: \end{definition}
562:
563: With this definition we can calculate the number of points in $O(1)$
564: time using the following simple lemma.
565:
566: \begin{lemma}
567: \label{lem:ConstRunningTimeForBucket} Let $B_i$ be a bucket, $n$ the
568: number of points in the databases, and $p$ be given by
569: Definition~\ref{def:NormalizedTrendFunction}. Then $np$ is the
570: number of points in bucket $B_i$ and $np$ is calculated in $O(1)$
571: time.
572: \end{lemma}
573: \begin{proof}
574: By Equation~(\ref{eq:NormalizedSurface}) and
575: (\ref{eq:percentagepoints}) we have:
576: \begin{equation}
577: \begin{array}{ccl}
578: n p & = &n \mathop{\displaystyle\int}\limits_{B_i} F_i~d\phi \vspace{6pt}\\
579: & = &n \mathop{\displaystyle\int}\limits_{B_i} \displaystyle{\frac{b_i}{n}} \frac{f_{i}}{\mathop{\displaystyle\int}_{B_i} f_i~d\phi}~d\phi \vspace{6pt}\\
580: & = &n \displaystyle{\frac{b_i}{n}} \cdot \frac{\mathop{\displaystyle\int}_{B_i} f_{i}~d\phi}{\mathop{\displaystyle\int}_{B_i} f_{i}~d\phi} \vspace{6pt}\\
581: & = &b_i. \\
582: \end{array}
583: \end{equation}
584: Clearly the above calculations take only $O(1)$ time.
585: \end{proof}
586:
587: Using the above definitions we can now define the bucket data
588: structure used throughout the rest of this paper.
589:
590: \begin{definition}[Skew Aware Buckets]\label{def:bucketsN}%
591: A bucket is a hyper-rectangle with dimensions given by
592: Definition~\ref{def:bucketDimensions} and that maintains histograms
593: given by Definition~\ref{def:histograms}, additional data for the
594: least squares method, and the normalized trend function given by
595: Definition~\ref{def:NormalizedTrendFunction}. Throughout the rest of
596: this paper we refer to these as buckets.
597: \end{definition}
598:
599:
600:
601: \subsection{Inserts and Deletes}\label{ssec:updates}
602:
603: We can maintain the bucket (and hence the index) while deleting or
604: inserting a point for any bucket $B_i$ by recalculating the trend
605: function $F_i$ for the bucket.
606:
607: \begin{lemma}\label{lem:ConstantUpdates}
608: Insertion and deletion of a moving point can be done in $O(1)$ time.
609: \end{lemma}
610:
611: \begin{proof}
612: When we insert or delete a point, we need to update the histograms
613: and the normalized trend function. Let the point to insert/delete be
614: $P_a$ represented using the hex representation as
615: $(a_0,a_1,a_2,a_3,a_4,a_5)$, let $d_j$, for $0 \leq j \leq 5$ be the
616: bucket width in the $j^{th}$, and let $s$ be the number of
617: subdivisions in each histogram. The concatenation of $id_0, \ldots,
618: id_5$ gives the $ID_i$ of bucket $i$ to insert (or delete) $P_a$
619: into where each $id_l$ and $0 \le l \le 5$ is defined by:
620: \begin{equation}
621: id_l = \left\lfloor \frac{a_l}{d_l} \right\rfloor.
622: \end{equation}
623: The calculation of $ID_i$ and retrieving bucket $B_i$ takes $O(1)$
624: time using a \textsc{HashTable}.
625:
626: Let $hw_{i,j}$ be the histogram-division width for the $j^{th}$
627: calculated as $hw_{i,j} = \left\lceil \frac{d_j}{s} \right\rceil$.
628: Then $p$ is projected onto each dimension to determine which
629: division of the histogram to update. For the $j^{th}$ dimension the
630: $k^{th}$ division of histogram $h_{i,j}$ is given as follows:
631: \begin{equation}
632: k(j) = \left\lfloor \frac{a_j - id_j*d_j}{hw_k} \right\rfloor
633: \end{equation}
634: Let $h_{i,j,k}$ be the histogram division to update for each
635: histogram. Update $h_{i,j,k}$ and the sums $\displaystyle{\sum}y_i$,
636: and $\displaystyle{\sum}x_i y_{i}$ from the normal equations in the
637: least squares method. $N$, $\displaystyle{\sum}x_i$ and
638: $\displaystyle{\sum}x_{i}^{2}$ from the normal equations do not need
639: updating since the number of histogram divisions $s$ is fixed within
640: the database.
641:
642: We can now recalculate each $f_{i,j}$ in constant time by solving
643: the $2 \times 3$ matrix corresponding to the normal equations of the
644: least squares method for each histogram. For each $f_{i,j}$
645: calculate the endpoints to determine the required shift amount
646: (Definition~\ref{def:trendfunctions}, property 1) and calculate
647: $f_i$ from Equation~(\ref{eq:bucketdensity}). Now we calculate $F_i$
648: using Equation~(\ref{def:NormalizedTrendFunction}). Each of these
649: steps depends only on the dimension of the database. Hence for any
650: fixed dimension we can rebuild the normalized trend function $F_i$
651: in $O(1)$ time.
652: \end{proof}
653:
654: \subsection{Index Data Structures}\label{ssec:structures}
655:
656: There is no need to create a bucket unless it contains at least one
657: point. We consider two classes of data structures for organizing the
658: buckets: \textsc{HashTables} and \textsc{Trees}.
659:
660: For databases where inserts and deletes are the most common
661: operation, the \textsc{HashTable} approach allows these operations
662: to run in constant time. However, the {\sc MaxCount} operation will
663: require an enumeration of all the buckets and thus at least a
664: running time of $O(B)$. As long as the number of buckets is
665: reasonable, this approach works well.
666:
667: For databases where {\sc MaxCount} is the most common operation, we
668: may use an \textsc{R-tree} structure
669: \citep{DBLP:conf/sigmod/Guttman84,BKS+90} where the elements to be
670: inserted are the buckets. This approach speeds up the {\sc MaxCount}
671: query to $O(\log|B| + R)$ where $R$ is the number of buckets needed
672: to calculate the query. The insert and delete costs for these
673: \textsc{R-trees} are $O(\log|B|)$, because buckets do not overlap.
674:
675: Since buckets do not change shape, the database is decomposable and
676: allows each type of aggregation to be calculated from simultaneous
677: executions on subspaces of the index space. We discuss the method
678: and ramifications of this capability at the end of Section
679: \ref{sec:ExactMaxCount}.
680:
681:
682: \section{Dynamic \textsc{MaxCount}}\label{sec:DynamicMaxCount}
683:
684: Section~\ref{ssec:PointDomination} reviews point domination in
685: higher dimensions. Section~\ref{ssec:IntegratingBuckets} examines
686: finding the percentage of points in a bucket that are in the query
687: space as a function of time. Section~\ref{ssec:MaxCountAlgorithm}
688: puts the two previous sections together to create the dynamic {\sc
689: MaxCount} algorithm for $d$-dimensions.
690:
691: \subsection{Point Domination in 6-Dimensional Space}\label{ssec:PointDomination}
692:
693: Let $B$ be the set of 6-dimensional hyper-buckets in the input where
694: each hyper-bucket $B_i$ has an associated normalized trend function
695: $F_i$ as in Definition~\ref{def:NormalizedTrendFunction}. Let the
696: vertices of $B_i$ be denoted $v_{i,j}$ where $1 \leq j \leq 64$,
697: because there are $2^6$ corner vertices to a 6-dimensional
698: hyper-cube.
699:
700: \begin{definition}[Point Domination]\label{def:pointdomination}
701: Given two linearly moving points in three dimensions
702: \begin{equation}
703: P(t)=\left\{
704: \begin{array}{l}
705: p_{x}=x_1 t + x_2 \\
706: p_{y}=x_3 t + x_4 \\
707: p_{z}=x_5 t + x_6
708: \end{array}
709: \right. %\label{eq:point2}
710: \quad {\rm and} \quad
711: Q(t)=\left\{
712: \begin{array}{l}
713: q_{x}=v_{x}t+x_{0} \\
714: q_{y}=v_{y}t+y_{0} \\
715: q_{z}=v_{z}t+z_{0}
716: \end{array}
717: \right.
718: \end{equation}
719: $Q(t)$ dominates $P(t)$ if and only if the following holds:
720: \begin{equation}
721: (p_x < q_x) \quad \wedge \quad (p_y < q_y) \quad \wedge \quad (p_z < q_z).
722: \end{equation}
723: \end{definition}
724:
725: The previous definition takes 6-dimensional points defined in
726: Definition~\ref{def:hex} and places them into three inequalities of
727: the form $x_2 < -t(x_1-v_x) + x_0$. Each inequality defines a region
728: below a line with slope $-t$.
729:
730: \begin{definition}[$x$-view, $y$-view and $z$-view projections]\label{def:views}
731: Projecting the inequalities from
732: Definition~\ref{def:pointdomination} onto their respective dual
733: planes allows a visualization in three 2-dimensional planes. Define
734: these three projections as the $x-$view, $y-$view and $z-$view
735: respectively. Because the time $-t$ defines the slopes of each line,
736: all views contain lines with identical slopes. (See
737: Figure~\ref{fig:views})
738: \end{definition}
739:
740:
741:
742: \begin{definition}[Query Space]\label{def:queryspace}
743: Given two moving query points $Q_1(t)$ and $Q_2(t)$ and lines
744: $l_{x1}$, $l_{x2}$, $l_{y1}$, $l_{y2}$, $l_{z1}$, $l_{z2}$ crossing
745: them in their respective hexes with slopes $-t$, the intersection of
746: the bands formed by the area between $l_{x1}$ and $l_{x2}$, $l_{y1}$
747: and $l_{y2}$, and $l_{z1}$ and $l_{z2}$ in the 6-dimensional space
748: forms a hyper-tunnel that defines the {\em query space} as shown in
749: Figure~\ref{fig:views}.
750: \end{definition}
751:
752: \begin{figure}[ht]
753: \centering
754: \psfrag{X-View}{$X-$view}
755: \psfrag{Y-View}{$Y-$view}
756: \psfrag{Z-View}{$Z-$view}
757: \psfrag{Q2x}{$Q_{2x}$}
758: \psfrag{Q1x}{$Q_{1x}$}
759: \psfrag{Q2y}{$Q_{2y}$}
760: \psfrag{Q1y}{$Q_{1y}$}
761: \psfrag{Q2z}{$Q_{2z}$}
762: \psfrag{Q1z}{$Q_{1z}$}
763: \psfrag{lx2}{$l_{x2}$}
764: \psfrag{lx1}{$l_{x1}$}
765: \psfrag{ly2}{$l_{y2}$}
766: \psfrag{ly1}{$l_{y1}$}
767: \psfrag{lz2}{$l_{z2}$}
768: \psfrag{lz1}{$l_{z1}$}
769: \psfrag{Position}{Position}
770: \psfrag{Velocity}{Velocity}
771: \includegraphics[width=5.9in]{figs/views.eps}\\
772: \caption{Views.}\label{fig:views}
773: \end{figure}
774:
775: We can now visualize the query in space and time as the {\em query
776: space} sweeping through a bucket as the slopes of the lines change
777: with time. Using the above, it is now easy to prove the following
778: lemma.
779:
780: \begin{lemma}
781: At any time $t$, the moving points whose hex-representation lies
782: below (or above) $l_{x1},l_{y1}$ and $l_{z1}$ in their respective
783: views are exactly those points that lie below (or above) $Q_{1}$ in
784: the original 3-dimensional plane.
785: \end{lemma}
786:
787: \begin{proof}
788: Let $Q_{x}(t)=v_{x}t+x_{0}$ where $v_{x}$ and $x_{0}$ are constants
789: and consider any $x$ component of a point $P_{x}(t)=x_1 t + x_2$
790: that lies below $Q$ on the $x$-axis. Then
791: \begin{eqnarray}
792: x_1 t + x_2 &<& v_{x} t + x_{0} \\
793: x_2 &<& -t (x_1 - v_{x}) + x_{0}
794: \end{eqnarray}
795: Obviously, at any time $t$ these are the points below the line $x_2
796: = -t(x_1 - v_{x}) + x_{0}$, which has a slope of $-t$ and goes
797: through $( v_{x},x_{0})$. This representation is the dual of point
798: $Q_{x}$. By Definition \ref{def:queryspace}, this is exactly the
799: line $l_{x1}$. We can prove similarly that the points with duals
800: above $l_{x1}$ are above $Q_{1}$ at any time $t$. The proof that
801: points whose hex-representations are above or below $l_{y1},$ and
802: $l_{z1}$ are exactly those points that lie above or below $Q_{1}$ is
803: similar to the proof for points above or below $l_{x1}$. By
804: Definition~\ref{def:pointdomination}, we conclude that the points
805: dominated by $Q_{1}$ in the dual space are those points that are
806: below $l_{x1}, l_{y1}$, and $l_{z1}$ in the $x$-view, $y-$view, and
807: $z$-view, respectively. Similarly, we conclude that the points that
808: dominate $Q_1$ in the dual space are those points that are above
809: $l_{x1},~l_{y1}$, and $l_{z1}$ in the $x$-view, $y-$view, and
810: $z$-view, respectively.
811: \end{proof}
812:
813: Throughout the examples in this chapter, we use the points shown in
814: Figures~\ref{fig:Points} and \ref{fig:ExPointsProjected} to
815: demonstrate the evaluation of a {\sc MaxCount} query. We begin by
816: creating the index.
817:
818: \begin{example}[Creating the Index]\label{ex:BuildIndex}\rm
819: \begin{figure}[htb]
820: \centering
821: \includegraphics[scale=1]{figs/ExamplePoints_v2.eps}
822: \caption{Example points.}
823: \label{fig:Points}
824: \end{figure}
825:
826: Consider a relation that contains the $6$-dimensional space 10
827: units $(0 \ldots 10)$ in each dimension. If we break this up
828: into buckets that are $5$ units long in each dimension, we have
829: $2^{6}$ buckets. Although these divisions make a space with $64$
830: buckets, all the points are contained in a single bucket whose
831: index is $(2,2,2,2,2,2)$. All the points listed in Figure
832: \ref{fig:Points} have the same velocities for each dual plane.
833: Notice the columns for $x_1$, $x_3$, and $x_5$ all have the same
834: values in different orders. The projection of the points onto
835: the $3$ dual planes shown in Figure~\ref{fig:ExPointsProjected}
836: does not immediately show this organization. Projecting the
837: points for any view in Figure~\ref{fig:HistogramVP} onto each
838: axis and creating histograms with $5$ divisions gives the
839: histograms for the Velocity and Position axes shown in
840: Figure~\ref{fig:HistogramVP}.
841: \begin{figure}[h]
842: \centering
843: \begin{minipage}[t]{2in}
844: \begin{center}
845: \includegraphics[width=2in]{graphics/ex3d1.eps} \\
846: (a)
847: \end{center}
848: \end{minipage}
849: \begin{minipage}[t]{2in}
850: \begin{center}
851: \includegraphics[width=2in]{graphics/ex3d2.eps} \\
852: (b)
853: \end{center}
854: \end{minipage}
855: \begin{minipage}[t]{2in}
856: \begin{center}
857: \includegraphics[width=2in]{graphics/ex3d3.eps} \\
858: (c)
859: \end{center}
860: \end{minipage}
861: \caption{Points projected onto (a) $X$-view, (b) $Y$-view, and (c) $Z$-view.}
862: \label{fig:ExPointsProjected}
863: \end{figure}
864: \begin{figure}[h]
865: \centering
866: \begin{minipage}[t]{3in}
867: \begin{center}
868: \includegraphics[width=3in]{graphics/ExHistV.eps} \\
869: (a) Velocity.
870: \end{center}
871: \end{minipage}
872: \begin{minipage}[t]{3in}
873: \begin{center}
874: \includegraphics[width=3in]{graphics/ExHistP.eps} \\
875: (b) Position.
876: \end{center}
877: \end{minipage}
878: \caption{Position and velocity histograms, identical for each view.}
879: \label{fig:HistogramVP}
880: \end{figure}
881: Hence, each velocity dimension has the same histogram. Similarly each
882: position dimension has the same histogram. To create these
883: histograms each point is projected onto the axis. For example point
884: $1$ projected onto the $x_1$ axis is given as:
885: \begin{equation}
886: ~5.345,7.543,5.345,8.158,5.345,5.488\rightarrow5.345.
887: \end{equation}
888: Calculate the widths of the histograms as:
889: \begin{equation}
890: Histogram\_Width =(10-5)/5 =1
891: \end{equation}
892: We determine the histogram for each point by looping through the
893: points and calculating the following:
894: \begin{equation}
895: division=\left\lfloor((point-lowerbound)/Histogram\_Width)\right\rfloor
896: \end{equation}
897: For example the lowest and highest points in velocity would be added
898: to the division calculated as
899: $\left\lfloor \left( 5.84-5\right) /1\right\rfloor = 0$ and
900: $\left\lfloor (9.468-5)/1\right\rfloor = 4$.
901:
902: The histograms translate into a set of points for each view given
903: as:
904: \begin{eqnarray}
905: Velocity =\{(0,1),(1,1),(2,2),(3,2),(4,4)\}\label{pt:Vel} \\
906: Position =\{(0,2),(1,2),(2,2),(3,2),(4,2)\}\label{pt:Pos}
907: \end{eqnarray}
908: Before applying the least squares method each division number
909: must be translated back into the bucket. Translation is done
910: using the following code fragment:
911: \medskip
912:
913: \progstart \vspace{-18pt}
914: \begin{tabbing}
915: \hspace{.25in}\= \kill
916: \textbf{for} $i \leftarrow 0$ \textbf{to} \emph{number\_of\_divisions} $-1$\\
917: \> $point[i][0]\ \leftarrow i*histogram\_width + lowerbound$ \\
918: \> $point[i][1]\ \leftarrow histogram\_value[i]$ \\
919: \textbf{end for}
920: \end{tabbing}
921: \progend
922:
923: Translation of the points from (\ref{pt:Vel}) and (\ref{pt:Pos}) gives: The
924: histograms for velocity and position in each view are given as:
925: \begin{eqnarray}
926: Velocity =\{(5,1),(6,1),(7,2),(8,2),(9,4)\} \\
927: Position =\{(5,2),(6,2),(7,2),(8,2),(9,2)\}.
928: \end{eqnarray}
929: Using the least squares method to fit each of these to a line yields
930: the following for each velocity and position dimension:
931: \begin{align}
932: Velocity:~~ &y=0.7x-2.9\label{eq:RawVelocity}\\
933: Position:~~ &y=0x+2 \label{eq:RawPosition}.%
934: \end{align}
935: Evaluating Equations~(\ref{eq:RawVelocity}) and
936: (\ref{eq:RawPosition}) at the end points to find the shift value
937: for the axis trend function to add to each equation gives:
938: \begin{align}
939: Velocity:~~ &y(5)=1,~~ y(10)=4.3\\
940: Position:~~ &y(5)=y(10)=2.
941: \end{align}
942: In this case no constant needs to be added to our equation and
943: the trend function becomes:
944: \begin{equation}
945: f_{i}=(0.7x_{0}-2.9)(0x_{1}+2)(0.7x_{2}-2.9)(0x_{3}+2)(0.7x_{4}-2.9)(0x_{5}+2)
946: \end{equation}
947: Calculating $F_{i}$ from Equation~(\ref{eq:NormalizedSurface}) requires
948: integrating $f_i$ over the bucket where
949: $\int_{B_{i}}\equiv\int_{5}^{10}...\int_{5}^{10}$ and where
950: $d\phi\equiv
951: dx_{0}dx_{1}dx_{2}dx_{3} dx_{4}dx_{5}$ gives
952: \begin{align}
953: \int_{B_{i}}f_{i}d\phi & =8 \int_{B_{i}}(0.7x_{0}-2.9)(0.7x_{2}-2.9)(0.7x_{4}-2.9)d\phi \nonumber\\
954: & =1622234.375.
955: \end{align}
956: Since all the points reside in a single bucket, $b_{i}=n$, the
957: constant $c$ is given by $c=1/1622234.375 \approx 6.164\times10^{-7}$. Then
958: $F_{i}$ is given by
959: \begin{align}
960: F_{i} & \approx c~(0.7x_{0}-2.9)(0x_{1}+2)(0.7x_{2}-2.9)(0x_{3}+2)(0.7x_{4}-2.9)(0x_{5}+2)\label{eq:Fi}\nonumber\\
961: & =8c(0.7x_{0}-2.9)(.7x_{2}-2.9)(.7x_{4}-2.9)
962: \end{align}
963: So far we have calculated the normalized trend function $F_{i}$
964: for just one bucket. This calculation finishes the bucket
965: creation process, and the index contains this single bucket
966: defined by the points $lowerbound=(5,5,5,5,5,5)$ and
967: $upperbound=(10,10,10,10,10,10)$.
968: \end{example}
969:
970:
971:
972: \subsection{Approximating the Number of Points in a Bucket}\label{ssec:IntegratingBuckets}
973:
974: As a line through a query point sweeps across a bucket, the points
975: in the bucket that dominate the query point are approximated by the
976: integral over the region above the line. In each of the three views
977: the query space intersects the plane giving the cases shown in
978: Figure~\ref{fig:cases}.
979: \begin{figure}[ht]
980: \centering
981: \includegraphics[width=4.25in]{figs/casesfilled.eps}\\
982: \caption{Sweep algorithm cases.}
983: \label{fig:cases}
984: \end{figure}
985:
986:
987: \begin{definition}[Percentage Function]\label{def:percentagefunction}
988: Integrating over the region above the line gives an approximation of
989: the percentage of points in the query space. We define the
990: percentage function given as:
991: \begin{equation}\label{eq:percentofbucket}
992: p=\int\limits_{r_1} F_i~d\phi
993: \end{equation}
994: where $r_1$ is the region of the bucket in the query space. If two
995: lines go through the same bucket we have the smaller region $r_2$
996: subtracted from the larger region $r_1$ as follows.
997: \begin{equation}\label{eq:percentofbucket2}
998: \triangle p=\int\limits_{r_1} F_i~d\phi - \int\limits_{r_2} F_i~d\phi.
999: \end{equation}
1000: Here, regions $r_1$ and $r_2$ correspond to regions above $Q_1$ and
1001: $Q_2$ in Figure~\ref{fig:views}, respectively.
1002: Lemma~\ref{lem:ConstRunningTimeForBucket} showed that finding the
1003: number of points in the bucket requires multiplying
1004: Equation~(\ref{eq:percentofbucket}) or (\ref{eq:percentofbucket2})
1005: by $n$.
1006: \end{definition}
1007:
1008: For each case shown in Figure~\ref{fig:cases}, we describe the
1009: function that results from integration in one view. To extend the
1010: result to any number of views, we take the result from the last view
1011: and integrate it in the next view. If the region below the line were
1012: desired, $p_{lower}=\frac{b_i}{n}-p$ gives the percentage of points
1013: below the line.
1014:
1015: For cases (a) -- (h) below, let $Q=(x_{1,q},x_{2,q},...,x_{6,q})$.
1016: For the $x$-view, let the lower left corner vertex be
1017: $(x_{1,l},x_{2,l})$ and the upper right corner vertex be
1018: $(x_{1,u},x_{2,u})$. In addition each line denoted $l$ is given by
1019: $x_2 = -t (x_1 - x_{i,q}) + x_{i+1,q}$ and corresponds to a line
1020: shown in the corresponding case in Figure~\ref{fig:cases}.
1021:
1022: %%%CASE A
1023: \medskip\noindent{\bf Case (a):}
1024: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{2,u}$. The
1025: integral over the shaded region is given by the following:
1026: \begin{equation}\label{eq:integrala}
1027: p_a = \int\limits_{x_{1,l}}^{\frac{x_{2,u} - x_{2,q}}{-t} + x_{1,q}}
1028: \int\limits_{-t(x_1 - x_{1,q}) + x_{2,q}}^{x_{2,u}}
1029: F_i~dx_2 dx_1
1030: \end{equation}
1031: Notice that the lower bound of the integral over $dx_2$ contains
1032: $x_1$. This dependence within each view does not affect the
1033: integration in the remaining four dimensions. The solution to
1034: Equation~(\ref{eq:integrala})
1035: %, given in Appendix~\ref{apx:casesolutions},
1036: has the form:
1037: \begin{equation}\label{eq:forma}
1038: a t^2 + b t + c + \frac{d}{t} + \frac{e}{t^2}.
1039: \end{equation}
1040:
1041:
1042: %%%CASE B
1043: \medskip\noindent{\bf Case (b):}
1044: For this case $l$ crosses the bucket at $x_{1,u}$ and $x_{2,u}$. The
1045: integral over the shaded region is given by:
1046: \begin{equation}\label{eq:integralb}
1047: p_b = \int\limits_{-\frac{(x_{2,u}-x_{2,q})}{t}+x_{1,q}}^{x_{1,u}}\int
1048: \limits_{-t(x_{1}-x_{1,q})+x_{2,q}}^{x_{2,u}}F_i~dx_{2}dx_{1}.
1049: \end{equation}
1050: The solution
1051: %is given in Appendix~\ref{apx:casesolutions} and
1052: has the form of Equation~(\ref{eq:forma}).
1053:
1054:
1055: %%%CASE C
1056: \medskip\noindent{\bf Case (c):}
1057: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{2,l}$. The
1058: integral over the shaded region above the line is given by:
1059: \begin{equation}\label{eq:integrale}
1060: p_e = \int\limits_{x_{1,l}}^{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}
1061: \int\limits_{-t(x_1 - x_{1,q}) + x_{2,q}}^{x_{2,u}}
1062: F_i~dx_2 dx_1 ~+~
1063: \int\limits_{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}^{x_{1,l}}
1064: \int\limits_{x_{2,l}}^{x_{2,u}}
1065: F_i~dx_2 dx_1.
1066: \end{equation}
1067: The solution
1068: %is given in Appendix~\ref{apx:casesolutions} and
1069: has the form of Equation~(\ref{eq:forma}).
1070:
1071:
1072: %%%CASE D
1073: \medskip\noindent{\bf Case (d):}
1074: For this case $l$ crosses the bucket at $x_{1,u}$ and $x_{2,l}$. The
1075: integral over the shaded region is given by:
1076: \begin{equation}\label{eq:integralf}
1077: p_f = \int\limits_{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}^{x_{1,u}}
1078: \int\limits_{-t(x_1 - x_{1,q}) + x_{2,q}}^{x_{2,u}}
1079: F_i~dx_2 dx_1 ~+~
1080: \int\limits_{x_{1,l}}^{\frac{x_{2,l}-x_{2,q}}{-t}+x_{1,q}}
1081: \int\limits_{x_{2,l}}^{x_{2,u}}
1082: F_i~dx_2 dx_1.
1083: \end{equation}
1084: The solution
1085: %is given in Appendix~\ref{apx:casesolutions} and
1086: has the form of Equation~(\ref{eq:forma}).
1087:
1088: %%%CASE E
1089: \medskip\noindent{\bf Case (e):}
1090: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{1,u}$. The
1091: integral over the shaded region is given by:
1092: \begin{equation}\label{eq:integralc}
1093: p_c = \int\limits_{x_{2,l}}^{x_{2,u}}
1094: \int\limits_{x_{1,l}}^{\frac{x_2 - x_{2,q}}{-t} + x_{1,q}}
1095: F_i~dx_1 dx_2.
1096: \end{equation}
1097: The solution
1098: %is given in Appendix~\ref{apx:casesolutions} and
1099: has the form of
1100: \begin{equation}\label{eq:formc}
1101: c + \frac{d}{t} + \frac{e}{t^2}
1102: \end{equation}
1103: which is like Equation~(\ref{eq:forma}) with $a=b=0$.
1104:
1105:
1106: %%%CASE F
1107: \medskip\noindent{\bf Case (f):}
1108: Similar to case(e), $l$ crosses the bucket at $x_{1,l}$ and
1109: $x_{1,u}$. The integral over the shaded region is given by:
1110: \begin{equation}\label{eq:integrald}
1111: p_d = \int\limits_{x_{2,l}}^{x_{2,u}}
1112: \int\limits_{\frac{x_2 - x_{2,q}}{-t} + x_{1,q}}^{x_{1,u}}
1113: F_i~dx_1 dx_2.
1114: \end{equation}
1115: The solution
1116: %is given in Appendix~\ref{apx:casesolutions} and
1117: has the form of Equation~(\ref{eq:formc}).
1118:
1119:
1120: %%%CASE G
1121: \medskip\noindent{\bf Case (g):}
1122: For this case $l$ crosses the bucket at $x_{1,l}$ and $x_{1,u}$. The
1123: integral over the shaded region is given by:
1124: \begin{equation}\label{eq:integralg}
1125: p_g = \int\limits_{x_{1,l}}^{x_{1,u}}
1126: \int\limits_{-t(x_1 - x_{1,q})+x_{2,q}}^{x_{2,u}}
1127: F_i~dx_2 dx_1.
1128: \end{equation}
1129: The solution
1130: %is given in Appendix~\ref{apx:casesolutions} and
1131: has the form
1132: \begin{equation} \label{eq:formg}
1133: at^2+bt+c
1134: \end{equation}
1135: which is like Equation~(\ref{eq:forma}) with $d=e=0$.
1136:
1137:
1138: %%%CASE H
1139: \medskip\noindent{\bf Case (h):}
1140: The line $l$ crosses below all the corner vertices hence the
1141: integral of the function is given as:
1142: \begin{equation}\label{eq:integralh}
1143: p_h = \int\limits_{x_{1,l}}^{x_{1,u}}
1144: \int\limits_{x_{2,l}}^{x_{2,u}}
1145: F_i~dx_2 dx_1.
1146: \end{equation}
1147: The solution
1148: %is given in Appendix~\ref{apx:casesolutions} and
1149: has the form of Equation~(\ref{eq:formg}).
1150:
1151: %%DONE WITH CASES
1152:
1153: The above cases have solutions for each view in the form of
1154: Equation~(\ref{eq:forma}). Hence the percentage function for a
1155: single bucket as a function of $t$ is of the form:
1156: \begin{align}\label{eq:BucketProbability}
1157: p &=\left( a_x t^2 + b_x t + c_x + \frac{d_x}{t} + \frac{e_x}{t^2} \right)
1158: \left( a_y t^2 + b_y t + c_y + \frac{d_y}{t} + \frac{e_y}{t^2} \right)\nonumber \\
1159: &~~~~\left( a_z t^2 + b_z t + c_z + \frac{d_z}{t} + \frac{e_z}{t^2} \right)
1160: \end{align}
1161: where $t\neq0$ when $d_x,d_y,d_z,e_x,e_y,e_z \neq 0$. Finally,
1162: renaming variables gives the general form:
1163: \begin{equation}\label{eq:BucketGeneralForm}
1164: p=a_6 t^6 + a_5 t^5 + a_4 t^4 + a_3 t^3 + a_2 t^2 + a_1 t + c +
1165: \frac{d_1}{t} + \frac{d_2}{t^2} + \frac{d_3}{t^3} + \frac{d_4}{t^4} +
1166: \frac{d_5}{t^5} + \frac{d_6}{t^6}
1167: \end{equation}
1168: where $t \neq 0$ when $d_i \neq 0$ for $1 \leq i \leq 6$. Since
1169: Equation~(\ref{eq:BucketGeneralForm}) is closed under subtraction,
1170: $\triangle p$ from Equation~(\ref{eq:percentofbucket2}) will also
1171: have the same form.\medskip
1172:
1173: As the {\em query space} from Definition~\ref{def:queryspace} sweeps
1174: through a bucket, it crosses the bucket corner vertices. Each time a
1175: corner vertex crosses the {\em query space} boundary, the case that
1176: applies may change in one or more of the views.
1177:
1178: \begin{definition}[Bucket and Index Time-Intervals]\label{def:buckettimeinterval}
1179: The span of time in which no vertex from bucket $B_i$ enters or
1180: leaves the query space defines a {\em bucket time-interval}. We
1181: denote the time-interval as a half-open interval $[l,u)$ where $l$
1182: is the lower bound and $u$ is the upper bound. Each {\em bucket
1183: time-interval} has an associated percentage function $\triangle p$
1184: given by Equation~(\ref{eq:percentofbucket2}). We define the {\em
1185: index time-interval} similarly except that the span of time is
1186: defined when no vertex from {\em any} bucket in the index enters or
1187: leaves the query space.
1188: \end{definition}
1189:
1190: As we will see, index time-intervals are created from individual
1191: bucket intervals. Throughout the rest of this dissertation we use
1192: the term {\em time intervals} when the context clearly identifies
1193: which type we mean.
1194:
1195: \begin{definition}[Time-Partition Order]\label{def:timepartitionorder}%
1196: Let $B$ be the set of buckets. Let $Q_1$ and $Q_2$ be two query
1197: points and $(t^[,t^])$ be the query time interval. We define the
1198: {\em Time-Partition Order} to be the set of ordered time instances
1199: $TP={t_1,t_2,...,t_i,...,t_k}$ such that $t_1=t^[$ and $t_k=t^]$,
1200: and each $[t_i,t_{i+1})$ is an {\em index time-interval}.
1201: \end{definition}
1202:
1203: \begin{example}[Calculating Bucket Time-Intervals]
1204: \label{ex:TimeIntervals} \rm %
1205: Continuing Example~\ref{ex:BuildIndex}, let $Q$ be a query defined
1206: by:
1207: \begin{eqnarray}
1208: q_{1} &=& (9.5,~8,~9.5,~8,~9.5,~8)\\
1209: q_{2} &=& (8.5,~5,~8.5,~5,~8.5,~5)\\
1210: T &=& (0.1,~10)
1211: \end{eqnarray}
1212: where $q_{1}$ and $q_{2}$ form the query space over the query time
1213: interval $T$. To determine time intervals when corner vertices do
1214: not change, find the slopes of lines through both query points and
1215: each corner vertex of the bucket. Figure \ref{fig:CornerLines} shows
1216: lines from the two query points to the corner vertices for the first
1217: dimension. Since the query points are the same in each dimension each
1218: will appear the same.
1219: \begin{figure}[t]
1220: \centering
1221: \includegraphics[width=3.75in]{figs/ExampleSlopes_v2.eps}
1222: \caption{Lines from query points to corner vertices.}%
1223: \label{fig:CornerLines}
1224: \end{figure}
1225: The set of times when lines through $q_1$ (shown as solid lines) cross
1226: corner vertices is $\{0.\overline{4}, 6\}$. The set of times when
1227: lines through $q_2$ (shown as dotted lines) cross corner
1228: vertices and are in the time interval is $\{1.42857\}$. The
1229: union of these two sets along with the end points makes up the
1230: times used to create the time intervals:
1231: $\{(.1,0.\overline{4}),(0.\overline{4},1.42857),(1.42857,6),(6,10)\}$.
1232: \end{example}
1233: \medskip
1234:
1235: Integration over the {\em spatial dimensions} of the eight possible
1236: cases presented in Figure~\ref{fig:cases} gave a function of the
1237: form of Equation~(\ref{eq:BucketGeneralForm}). {\em Maximizing}
1238: Equation~(\ref{eq:BucketGeneralForm}) in the {\em temporal
1239: dimension} by first taking the derivative, we get:
1240: \begin{eqnarray}
1241: \triangle p'&=&(6a_{6}t^{12}+5a_{5}t^{11}+4a_{4}t^{10}+3a_{3}t^{9}+2a_{2}t^{8}+a_{1}t^{7} \nonumber \\
1242: &~&~ -d_{1}t^{5}-2d_{2}t^{4}-3d_{3}t^{3}-4d_{4}t^{2}-5d_{5}t -6d_{6}) / t^7 \label{eq:Derivative}
1243: \end{eqnarray}
1244: where $t \neq 0$. Solving $\triangle p'=0$ requires finding the
1245: roots of this $12$-degree polynomial, {\em which is not possible
1246: using an exact method}. Hence we need a numerical method for solving
1247: the polynomial.
1248: %(Note that an exact solution is possible if the problem
1249: %uses only two dimensional moving points because that would require
1250: %solving only $8$-degree polynomial.)
1251:
1252: The following factors influenced the choice of the numerical method:
1253: \begin{enumerate}
1254: \item Speed of the algorithm is more important than accuracy
1255: because we don't expect the original function to change
1256: dramatically over an index time-interval. We expect small
1257: change because in practice the time intervals are short.
1258: \item The algorithm must converge toward a solution within the
1259: interval, that is the algorithm must be stable.
1260: \item Given that we are maximizing Equation~(\ref{eq:BucketGeneralForm})
1261: over a short time interval, we don't expect
1262: Equation~(\ref{eq:Derivative}) to have more than one
1263: solution. This assumption may seem naive, but it is
1264: reasonable given factor (1).
1265: \end{enumerate}
1266:
1267: Factor (1) above is related to (3) in that it indicates that points
1268: close together have similar values, but emphasizes that speed is the
1269: goal. Factor (2) above eliminates several algorithms from
1270: consideration, but must be required to keep from choosing a solution
1271: that is not within the time interval evaluated.
1272:
1273: Of the three points to consider, (3) is probably the least
1274: intuitive. Consider the following conjecture:
1275:
1276: \begin{conjecture}\label{lem:NearMaximums}
1277: Given $p$ for a set of buckets, if the Euclidean distance between
1278: two maxima is small, then the difference between the maxima is
1279: small.
1280: \end{conjecture}
1281:
1282: Consider the physical characteristics of the system. The value of
1283: $p$ over the time interval changes no more than $b_i$ for any bucket
1284: $B_i$. Clearly $p$ either increases as it encompasses more of the
1285: bucket or decreases at as it encompasses less of the bucket. When
1286: $p$ represents the distribution over several buckets, each bucket
1287: contributes a decreasing or increasing amount over the time
1288: interval. Clearly $p$ is bounded below by $0$ and above by
1289: $\sum\limits_i b_i$. Hence, the rate at which the derivative $p'$
1290: changes is characterized by the physical system and reflects the
1291: differences in the buckets as $t$ changes. Since $p$ does not change
1292: dramatically over $t$ for any bucket, then change in several buckets
1293: over $t$ will likewise not be dramatic. Hence if the distance
1294: between two maxima is small, the maxima have a small difference in
1295: magnitude. {\em This rational for the conjecture above is verified
1296: by the experiments}.\medskip
1297:
1298: Based on these factors, we use a common method for the first
1299: approximation: we look at the graph of $p'$. Programmatically check
1300: $c$ intervals of Equation~(\ref{eq:Derivative}) for a change in
1301: sign. If there exists a sign change, use the bisection method to
1302: find the root. If two points lie within $\epsilon$ of $0$, we
1303: perform a check for each of these intervals when no change of sign
1304: is found. If some roots exist, we check them for maximal values
1305: along with the end points.
1306:
1307: \begin{lemma}\label{lem:ConstRunningTimeForTimeInterval}
1308: The approximate maximum within a time interval can be found in
1309: $O(1)$ time.
1310: \end{lemma}
1311:
1312: \begin{proof}
1313: Each {\em time interval} has an associated probability function
1314: $\triangle p$ which is calculated in $O(1)$ time. Finding $\triangle
1315: p' = 0$ also takes $O(1)$ time. By placing a constant bound on the
1316: number of iterations in the bisection method, we bound the time
1317: required in the numerical section of the algorithm by a constant.
1318: Plugging in the solution found by the bisection method along with
1319: the end points also takes $O(1)$ time. Hence, the running time to
1320: find the maximum within a bucket is $O(1)$.
1321: \end{proof}
1322:
1323: We chose to limit the number of iterations in the bisection method
1324: to 10, which limits the running time to a small constant value. This
1325: value was chosen based on empirical observation that index
1326: time-intervals remain small (about $0.01$ to $4$). Hence, using the
1327: bisection method allows us to narrow our search down to an interval
1328: at least as small as $\frac{1}{256}$ units of time. If time is
1329: measured in hours, this interval equates to only $14$ seconds.
1330:
1331: \begin{example}[Building Time-Intervals and Finding {\sc MaxCount}]\rm %
1332: Continuing Example~\ref{ex:TimeIntervals} we build the functions for
1333: time intervals
1334: \begin{equation}
1335: \{(.1,0.\overline{4}),(0.\overline{4},1.42857),(1.42857,6),(6,10)\}
1336: \end{equation}
1337: by integrating using the different cases from
1338: Figure~\ref{fig:cases}. For space concerns we omit the integrals here and
1339: note that the result of integrating each interval and finding
1340: the maximum gives a maximum of approximately $3$ at
1341: $t=0.\overline{4}$
1342:
1343:
1344: \noindent\textbf{Time Interval: }$[0.1,0.\overline{4}]$. Here case
1345: (c) holds for query point $q_2$ over this time interval. Hence the
1346: integral for query point $q_{2}$ and $t\in\lbrack.1,.\overline{4}]$
1347: in each dimension is given as:
1348: \begin{eqnarray}
1349: p_{c} &=& c\int_{8.5}^{10}\int_{5}^{10}2(0.7x_{0}-2.9)dx_{1}dx_{0}+\int_{5}^{8.5}\int_{-t(x_{0}-8.5)+5}^{10}2(0.7x_{0}-2.9)dx_{1}dx_{0}\nonumber\\
1350: &=& 117.5-17.354\bar{6}t \label{eq:Q2CaseEinterval1}
1351: \end{eqnarray}
1352: Case (g) holds for query point $q_1$ and thus the integral for query
1353: point $q_{1}$ and $t\in (.1,.\overline{4})$ in each dimension is
1354: given as:
1355: \begin{align}
1356: p_{g} & =c\int_{5}^{10}\int_{-t(x_{0}-9.5)+8}^{10}2(0.7x_{0}-2.9)dx_{1}dx_{0}\nonumber\\
1357: & =47.0-32.41\bar{6}t
1358: \end{align}
1359: Hence the integral of the region is:
1360: \begin{align}
1361: p & =c\left( p_{c}-p_{g}\right) ^{3}\nonumber\\
1362: & =2.106\times10^{-3}t^{3}+2.957\times10^{-2}t^{2}+0.138t+0.216
1363: \end{align}
1364: Evaluating $p$ at the start and end of the time interval we have
1365: $p(0.1)\approx0.23$ and $p(0.\overline{4})=0.28$. Figure
1366: \ref{fig:Interval1} shows $p$ in the time interval. Clearly $p$ is
1367: increasing and consequently we have a maximum at the end point
1368: $t=0.\overline{4}$.
1369: \begin{figure}[h]
1370: \centering
1371: \includegraphics[width=4in]{figs/ExampleInterval1.eps}
1372: \caption{Graph of $p$, $0.1 \leq t \leq 0.\overline{4}$.}
1373: \label{fig:Interval1}
1374: \end{figure}
1375: Since there are $10$ points we must multiply
1376: $p(0.\overline{4})$ by $10$ to get the approximation for the time
1377: interval as:%
1378: \begin{equation}
1379: MaxCount_{0.1\leq t\leq 0.\overline{4}} \approx 2.8.
1380: \end{equation}
1381: Since we can not have partial points, we can round this result
1382: to $3$.\medskip
1383:
1384: The rest of the intervals are similar using different cases. We
1385: omit the remaining cases to save space and to eliminate the risk
1386: of boring the reader. None of the other intervals has a higher
1387: \sc{MaxCount} and so it follows that {\sc MaxCount} has an
1388: approximate value of $3$ at time $t=0.\overline{4}$.
1389: \end{example}
1390:
1391: \subsection{Dynamic {\sc MaxCount} Algorithm}\label{ssec:MaxCountAlgorithm}
1392:
1393: \progstart \vspace{-18pt}
1394: \begin{tabbing}
1395: \hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\= \kill
1396: {\bf {\sc MaxCount}$(H, Q_1, Q_2, t^[,t^])$} \\
1397: {\bf input:} \>\> A set of buckets $H$ built by the index structure presented, \\
1398: \>\> query points $Q_1(t)$ and $Q_2(t)$ and a query time interval $(t^[,t^])$. \\
1399: {\bf output:}\>\> The estimated {\sc MaxCount} value.\\
1400: \\
1401: 01. \> $TimeIntervals \leftarrow \emptyset $ \` $O(1)$ \\
1402: 02. \> {\bf for} $i \leftarrow 0$ {\bf to} $|H|-1$ \` $O(B)$ \\
1403: 03. \> \> $CrossTimes \leftarrow $\textsc{ CalculateCrossTimes}$(Q_1,Q_2,t^[,t^],H_i)$ \` $O(1)$ \\
1404: 04. \> \> {\bf for} $j \leftarrow 1$ {\bf to} $|CrossTimes|-1$ \` $O(1)$ \\
1405: 05. \> \> \>\textsc{Union}$(TimeIntervals,TimeInterval(t_{j-1}, t_{j})$ \` $O(1)$ \\
1406: 06. \> \> {\bf end for} \\
1407: 07. \> {\bf end for}\\
1408: \\
1409: 08. \> $TimeIntervals = $\textsc{ BucketSort}$(TimeIntervals)$ \` $O(B)$ \\
1410: 09. \> $IndexTimeIntervals = $\textsc{ Merge}$(TimeIntervals)$ \` $O(B)$ \\
1411: 10. \> {\bf for each} $IndexTimeInterval \in IndexTimeIntervals$ \` $O(B)$ \\
1412: 11. \> \> \textsc{calculate}$(MaxCount, MaxTime, IndexTimeInterval)$ \` $O(1)$ \\
1413: 12. \> {\bf end for} \\
1414: \\
1415: 13. \> {\bf return} $(MaxCount, MaxTime)$
1416: \end{tabbing}
1417: \progend
1418:
1419: The algorithm to compute {\sc MaxCount} with each line labeled with
1420: its running time is given above. Line 01 initiates a set of bucket
1421: time-interval objects to be empty. Line 03 returns a list of ordered
1422: times when a line through $Q_1$ or $Q_2$ crosses a bucket corner
1423: vertex. Line 05 turns this list into a set of $TimeInterval$ objects
1424: and adds them to the set of $TimeIntervals$. We list this ``for
1425: each'' loop as $O(1)$ because it consists of a constant number of
1426: calculations bounded by the number of vertices in the bucket. Line
1427: 08 uses the linear time sorting algorithm \textsc{BucketSort} to
1428: sort the bucket time intervals. Line 09 creates the time-partition
1429: order and index bucket time intervals from the bucket time intervals
1430: in $O(B)$. An additional pass adds the bucket time intervals to the
1431: appropriate index time-intervals in $O(B)$. Lines 10-12 perform the
1432: {\sc MaxCount} calculation discussed above.
1433:
1434: \medskip
1435: In order to use the linear time \textsc{BucketSort} algorithm, we
1436: need the following definition and lemmas.
1437:
1438: \begin{definition}[Time-Interval Ordering]
1439: \label{def:IntervalOrder}%
1440: We define the lexicographical ordering $\prec$ of two {\em time
1441: intervals} $A$ and $B$ as follows:
1442: \begin{eqnarray}
1443: A.l < B.l & \Rightarrow & A \prec B \\
1444: A.l = B.l \quad \wedge \quad A.u < B.u & \Rightarrow & A \prec B \\
1445: A.l = B.l \quad \wedge \quad A.u = B.u & \Rightarrow & A = B
1446: \end{eqnarray}
1447:
1448:
1449: \end{definition}
1450:
1451: %??? Fix this so that the values are near the correct areas
1452: \begin{figure}
1453: \centering
1454: \psfrag{Q}{{\tiny $Q$}}
1455: \psfrag{A1}{{\tiny $A=\frac{1}{2}$}}
1456: \psfrag{A2}{{\tiny $A=\frac{1}{4}$}}
1457: \psfrag{A3}{{\tiny $A=\frac{1}{12}$}}
1458: \includegraphics[height=3in]{figs/DistributionBox1.eps}\\
1459: \caption{Areas of successive slopes.}
1460: \label{fig:SlopeDistribution}
1461: \end{figure}
1462:
1463:
1464: The distribution of time interval objects created in Line 08 of the
1465: {\sc MaxCount} algorithm may not be uniform across the query time
1466: interval $T=[t^[,t^]]$. However, we can still prove the following.
1467:
1468: \begin{lemma}
1469: \label{lem:TimeIntervalDistribution}%
1470: If the distribution of buckets is uniform, then the distribution of
1471: bucket time-interval objects can be uniformly distributed within the
1472: sorting buckets of the bucket sort.
1473: \end{lemma}
1474: \begin{proof}
1475: Consider the relationship between successive slopes measured as the
1476: angles between lines through a query point $Q$ with slopes
1477: $s_i=-t_i$ and $s_{i+1}=-t_{i+1}$. Suppose $\triangle t=1$ with
1478: $t_0=0$ and $t_1=1$, then the angle between the two lines is
1479: $\triangle s=\frac{\pi}{4}$. The solid lines in
1480: Figure~\ref{fig:SlopeDistribution} show that half of the bucket
1481: corner vertices are swept by the line sweeping through $Q$ between
1482: $s_0=0$ and $s_1=-1$. Consider a query time interval $[0,10]$. Half
1483: of the corner vertices, and thus half of the time intervals, are
1484: between time $t=0$ and $t=1$. Thus, we conclude that the time
1485: interval objects created by sweeping will not be uniformly
1486: distributed throughout the query time interval.
1487:
1488: Let $Q'$ be the midpoint between $Q_1$ and $Q_2$. Let $S =
1489: \{t_1,...t_k\}$ where $t_1 = t^[$, $t_k=t^]$ and $t_{i+1} - t_i = L$
1490: for some positive constant $L$ and $1 \leq i \leq k-1$. Let $D_B$ be
1491: a bucket that contains the space in the 6-dimensional index. Model
1492: the normalized bucket function for $D_B$ as a constant $F=1$. Thus
1493: $p$, the bucket probability, from
1494: Equation~(\ref{eq:BucketProbability}) becomes the hyper-volume of
1495: the space swept by the line through $Q'$. By
1496: Lemma~\ref{lem:ConstRunningTimeForTimeInterval}, we can find the
1497: area for a specific time interval in $S$ in constant time. The
1498: percentage of sorting buckets, $posb_i$, needed in any time interval
1499: $T_i=[t_i,t_{i+1}] \in S$ within the query time interval is given
1500: by:
1501: \begin{equation}
1502: posb_i = \frac{p(t_{i+1})-p(t_i)}{p(t^])-p(t^[)}
1503: \end{equation}
1504: Let $N$ be the number of sorting buckets. Then, the number of
1505: sorting buckets, $nosb_i$, assigned to interval $i$ is given by:
1506: \begin{equation}
1507: nosb_i = N \cdot posb_i
1508: \end{equation}
1509: If $nosb_i<1$ we can combine it with $nosb_{i+1}$. If the query time
1510: interval is very large, then we may need to include multiple time
1511: intervals from $S$ to get one sorting bucket. Thus, we create more
1512: sorting buckets (with smaller time intervals) in areas where the
1513: expected number of bucket time intervals is large. Conversely, we
1514: create fewer sorting buckets (with larger time intervals) in areas
1515: where the expected number of bucket time intervals is small. Hence
1516: we model each sorting bucket so that its time interval length
1517: directly relates to the percentage of bucket time intervals that are
1518: assigned to it. Thus, we conclude that we will uniformly distribute
1519: the time interval objects across all sorting buckets.
1520: \end{proof}
1521:
1522: \begin{lemma}
1523: \label{lem:BuscketSortConstantTimeInsertion}%
1524: Insertion of any bucket time-interval object $T_O$ into the proper
1525: sorting bucket can be done in $O(1)$ time.
1526: \end{lemma}
1527: \begin{proof}
1528: The distribution of sorting buckets is determined by $k$ time
1529: intervals in Lemma~\ref{lem:TimeIntervalDistribution}. Call these
1530: {\em sorting time interval objects} where each object contains: the
1531: lower bound $l$, the upper bound $u$, the number of sorting buckets
1532: assigned to this interval $b_s$, the length of the time interval for
1533: the sorting bucket $w$ and an array $B_p$ containing pointers to
1534: these sorting buckets. Let $A$ be the array of sorting time interval
1535: objects, and $L$ be the length of each time interval where the time
1536: intervals are as in Lemma~\ref{lem:TimeIntervalDistribution}. Then,
1537: finding the correct sorting bucket for $T_O$ requires two
1538: calculations:
1539: \begin{eqnarray}
1540: SortingTimeInterval &=& A \left[ ~ \left\lfloor \frac{T_O.l}{L} \right\rfloor ~ \right] \\
1541: SortingBucket &=& B_p \left[ ~ \left\lfloor \frac{T_O.l - SortingTimeInterval.l}{w} \right\rfloor ~ \right].
1542: \end{eqnarray}
1543: Each of these calculations requires constant time, hence $T_O$ can
1544: be inserted into the proper sorting bucket in $O(1)$ time.
1545: \end{proof}
1546:
1547: Using the above two lemmas, we can prove the following.
1548:
1549: \begin{theorem}
1550: \label{th:constanttime} The running time of the {\sc MaxCount}
1551: algorithm is $O(B)$ where $B$ is the number of buckets.
1552: \end{theorem}
1553:
1554: \begin{proof}
1555: Let $H$ be the set of buckets where each bucket $B_i$ contains the
1556: normalized trend function $F_i$. Let $Q_1$ and $Q_2$ be the query
1557: points and $[t^[,t^]]$ be the query time interval. (Lines 01-07):
1558: Calculating the time intervals takes $O(B)$ time because the cross
1559: times for each bucket can be calculated in constant time. (Line 08):
1560: By Lemmas~\ref{lem:TimeIntervalDistribution} and
1561: \ref{lem:BuscketSortConstantTimeInsertion}, we have an approximately
1562: even distribution of time interval objects within the sorting
1563: buckets where we can insert an object in constant time. This result
1564: fulfills the requirements of the \textsc{BucketSort},
1565: \cite{IntroToAlgorithms}, which allows the intervals to be sorted in
1566: $O(B)$ time. (Lines 09-12): Calculate the {\sc MaxCount} and time
1567: for each time interval in constant time using
1568: Lemma~\ref{lem:ConstRunningTimeForTimeInterval}. These lines takes
1569: $O(B)$ time because there are $O(B)$ time intervals. Finding the
1570: global {\sc MaxCount} and time requires retaining the maximum time
1571: and count at line 11. Returning the {\sc MaxCount} and time takes
1572: $O(1)$ time. Thus, the running time is given by $O(B) + O(B) + O(B)
1573: + O(1) = O(B)$.
1574: \end{proof}
1575:
1576: \subsection{An Exact {\sc MaxCount} Algorithm}\label{sec:ExactMaxCount}
1577:
1578: The Exact MaxCount algorithm below finds the exact {\sc MaxCount}
1579: values. It is easy to see that the running time is given by:
1580: \begin{equation}\label{eq:ExactRunningTime}
1581: O(N) + O(n \log n)
1582: \end{equation}
1583: where $N$ is the number of points in the database and $n$ represents
1584: the result size of the query.
1585:
1586: %Considering a tree structure to store points to reduce the first
1587: %term in (\ref{eq:ExactRunningTime}) may result in negligible
1588: %benefits since we expect to examine a significant number of the
1589: %points contained in the database. Even so the worst case running
1590: %time remains $O(N) + O(n \log n)$.
1591:
1592: It is possible to slightly improve the algorithm below. First,
1593: divide the index space into $k$ subspaces and maintain separate
1594: partial databases for each. Assign processes on individual systems
1595: to each database to calculate the {\sc MaxCount} query and return
1596: the time intervals to a central process. Merging the time interval
1597: lists into a global time interval list saves time on the sorting
1598: part of the algorithm. The running time for each of $k$ partial
1599: databases would be close to $O(\frac{n}{k} \log \frac{n}{k})$. This
1600: result is an approximate value because we do not guarantee an even
1601: split between partial databases. Placing buckets for each partial
1602: database in a \textsc{Tree} structure may be reasonable and could
1603: cut down the average running time to $O(\log N + n \log n/k)$.
1604: %For small enough data
1605: %subsets $\log n/k$ may be considered a constant resulting in an
1606: %average running time of $\max (\log N, n)$.
1607: Implementation and analysis for this particular approach is left as
1608: future work.
1609:
1610: \progstart \vspace{-18pt}
1611: \begin{tabbing}
1612: \hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=
1613: \kill
1614: {\bf {\sc ExactMaxCount}$(D, Q_1, Q_2, t^[, t^])$} \\
1615: {\bf input:} \>\> $D$ is the database of points. The query is made up of a \\
1616: \>\> hyper-rectangle $Q$ defined by points $Q_1$ and $Q_2$ and the time\\
1617: \>\> interval $T=[t^[, t^]]$ \\
1618: {\bf output:} \>\> The exact {\sc MaxCount} and time at which it occurs. \\ \\
1619: 01. \>$Times \leftarrow \emptyset$ //of \emph{CrossTime} objects \` $O(1)$\\
1620: 02. \>\textbf{for each} \emph{point} $p_i \in D$ \` $O(N)$\\
1621: 03. \> \> \textbf{if} $p_i \in Q$ during $T$ \` $O(1)$\\
1622: 04. \> \> \> $EntryTime \leftarrow CalculateEntryTime(p_i,Q,T)$ \` $O(1)$\\
1623: 05. \> \> \> $ExitTime \leftarrow CalculateExitTime(p_i,Q,T)$ \` $O(1)$\\
1624: 06. \> \> \> \textbf{if} $EntryTime \in Times$ \` $O(1)$\\
1625: 07. \> \> \> \> $Times.$\textsc{get}$(EntryTime).Count$++ \` $O(1)$\\
1626: 08. \> \> \> \textbf{else} \\
1627: 09. \> \> \> \> $Times.$\textsc{add}$(new CrossTime(EntryTime))$\` $O(1)$\\
1628: 10. \> \> \> \textbf{end if} \\
1629: 11. \> \> \> \textbf{if} $ExitTime \in Times$ \` $O(1)$\\
1630: 12. \> \> \> \> $Times.$\textsc{get}$(ExitTime).Count$-\,- \` $O(1)$\\
1631: 13. \> \> \> \textbf{else} \\
1632: 14. \> \> \> \> $Times.$\textsc{add}$(new CrossTime(ExitTime))$ \` $O(1)$\\
1633: 15. \> \> \> \textbf{end if} \\
1634: 16. \>\textbf{end for} \\
1635: 17. \>\textsc{Sort}$(Times)$ \` $O(n \log n)$\\
1636: 18. \>\textsc{traverse}$(Times,time,Max\textrm{-}Count)$ //tracking time\` $O(N)$\\
1637: \> \> \> \> \qquad \qquad \qquad \quad //and {\sc MaxCount} \\
1638: 19. \>\textbf{return} (time,{\sc MaxCount}) \` $O(1)$
1639: \end{tabbing}
1640: \progend
1641:
1642:
1643: \section{Threshold Operators}\label{sec:ThresholdOperators}%includes CountRange
1644:
1645: \progstart \vspace{-18pt}
1646: \begin{tabbing}
1647: \hspace*{.35in}\=\hspace*{.3in}\=\hspace*{.3in}\=\hspace*{.3in}\=
1648: \kill
1649: {\bf {\sc ThresholdRange}$(H, Q_1, Q_2, t^[, t^], M)$} \\
1650: {\bf input:} \>\> A set of buckets $H$ build by the index structure presented, \\
1651: \>\> query points $Q_1(t)$ and $Q_2(t)$, a query time interval $[t^[, t^]]$, \\
1652: \>\> and $M$ is the threshold value \\
1653: {\bf output:} \>\> The estimated set of time intervals where $R$ contains more \\
1654: \>\> than $M$ points.\\
1655: \\
1656: 01 - 08 are the same as the {\sc MaxCount} algorithm.\\
1657: 09. \> $TimeIntervals \leftarrow \emptyset$ \`$O(1)$ \\
1658: 10. \> \textbf{for each} $TimeInterval \in TimePartitionOrder$ \`$O(B)$ \\
1659: 11. \> \> $CMaxCount \leftarrow \textsc{calculate}(\textsc{MaxCount}, MaxTime, TimeInterval)$\`$O(1)$ \\
1660: 12. \> \> \textbf{if} $CMaxCount > M$ \`$O(1)$ \\
1661: 13. \> \> \> $TimeIntervals \leftarrow TimeIntervals \bigcup TimeInterval$ \`$O(1)$ \\
1662: 14. \> \> \textbf{end if} \\
1663: 15. \> \textbf{end for} \\
1664: 16. \> $\textsc{Merge}(TimeIntervals)$ \`$O(B)$ \\
1665: 17. \> \textbf{return} $TimeIntervals$
1666: \end{tabbing}
1667: \progend
1668:
1669: The {\sc ThresholdRange} algorithm shown above and described in
1670: Definition~\ref{def:ThresholdRange} relates to {\sc MaxCount} in the
1671: way we calculate the aggregation. We maintain a running count to
1672: find time intervals that exceed the threshold value $M$. If we set
1673: the threshold value near the {\sc MaxCount} value ($M \rightarrow$
1674: {\sc MaxCount}), {\sc ThresholdRange} finds a small interval
1675: containing the {\sc MaxCount}. We demonstrate this in the
1676: experimental results,
1677: Section~\ref{sec:ExperimentalResults}.\smallskip
1678:
1679: The {\sc ThresholdRange} algorithm is the same as {\sc MaxCount} up
1680: to Line 08, and then collects different information from each
1681: $TimeInterval$ starting in Line 10. This leads to the following
1682: Theorem.
1683:
1684: \begin{theorem}
1685: \label{th:ThresholdConstantTime}%
1686: The estimated {\sc ThresholdRange} query runs in $O(B)$ time.
1687: \end{theorem}
1688: \begin{proof}
1689: The {\sc ThresholdRange} algorithm differs from the {\sc MaxCount}
1690: algorithm only in lines 09-17. Lines 11-14 run in $O(1)$ time. Line
1691: 10 executes lines 11-13 $O(B)$ times. In line 16,
1692: $\textsc{Merge}(TimeIntervals)$ is a linear walk of the time
1693: intervals that joins adjacent time intervals $T_a$ and $T_b$ when
1694: $T_a \bigcup T_b$ would form a continuous time interval. The
1695: calculation is trivially $O(1)$ time for joining the adjacent
1696: intervals. Hence, we conclude by Theorem~\ref{th:constanttime} that
1697: the {\sc ThresholdRange} runs in $O(B)$ time.
1698: \end{proof}
1699:
1700: \subsection{Threshold: Sum, Count and Average}
1701:
1702: We give the following three operators based on {\sc ThresholdRange}
1703: and conclude that none of the changes to the algorithm affect the
1704: running time of the {\sc ThresholdRange} algorithm.\medskip
1705:
1706: \noindent {\sc ThresholdCount}: \\
1707: By adding a line between 14 and 15 in the {\sc ThresholdRange}
1708: algorithm that counts the merged time intervals, we can return the
1709: count of time intervals during the query time interval where
1710: congestion occurs. This count of time intervals gives a measure of
1711: variation in congestion. That is, if we have lots of time intervals,
1712: we expect that we have a large number of pockets of congestion.
1713: Since {\sc ThresholdCount} does not give information relative to the
1714: entire time interval, it may need to be examined in light of the
1715: total time above the threshold.\medskip
1716:
1717: \noindent {\sc ThresholdSum}: \\
1718: By summing the times instead of using the $\bigcup$ operator in line
1719: 13 of the {\sc ThresholdRange} algorithm, we can return the total
1720: congestion time during the query time interval. This total gives a
1721: measure of the severity of congestion that may be compared to the
1722: length of query time.\medskip
1723:
1724: \noindent {\sc ThresholdAverage}: \\
1725: By adding a line between lines 14 and 15 in the {\sc ThresholdRange}
1726: algorithm that finds average length of the merged time intervals, we
1727: can return the average length of time each congestion will last.
1728: This average gives a different measure of the severity of each
1729: congestion.\medskip
1730:
1731: %We could calculate other operators such as the standard deviation of
1732: %the time intervals or many other complicated statistics on the
1733: %distribution of time intervals. However the five operators we define
1734: %mirror the standard aggregation operators available in relational
1735: %databases.
1736:
1737: %???Check this section!!!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1738:
1739: \subsection{Count Range Algorithm}
1740:
1741: The {\sc CountRange} algorithm is an adaptation of {\sc MaxCount} in
1742: that it is the {\sc Count} portion of the {\sc MaxCount} query.
1743: Using the equations for the cases described in
1744: Figure~\ref{fig:cases}, we calculate the {\sc CountRange} as
1745: follows:
1746:
1747: %%%%EDIT
1748:
1749: \begin{figure}[h]
1750: \begin{minipage}{0.49\textwidth}
1751: \centering
1752: \psfrag{Q1}{$Q_1$}
1753: \psfrag{Q2}{$Q_2$}
1754: \psfrag{lq2t1}{$l_{Q_2,t^[}$}
1755: \psfrag{lq2t2}{$l_{Q_2,t^]}$}
1756: \psfrag{lq1t1}{$l_{Q_1,t^[}$}
1757: \psfrag{q1t2}{$l_{Q_1,t^]}$}
1758: \psfrag{x0lvxl}{$(x_{0,l},v_{x,l})$}
1759: \psfrag{x0uvxu}{$(x_{0,u},v_{x,u})$}
1760: \includegraphics[width=.8\textwidth]{figs/CntRngNorm1.eps}\\
1761: \caption{{\sc CountRange} $Q_1$ at $t^{]}$ to $Q_2$ at $t^{[}$.}
1762: \label{fig:CountRangeNormal1}
1763: \end{minipage}
1764: \begin{minipage}{0.49\textwidth}
1765: %\end{figure}
1766: %\begin{figure}[h]
1767: \noindent
1768: \psfrag{Q1}{$Q_1$}
1769: \psfrag{Q2}{$Q_2$}
1770: \psfrag{lq2t1}{$l_{Q_2,t^[}$}
1771: \psfrag{lq2t2}{$l_{Q_2,t^]}$}
1772: \psfrag{lq1t1}{$l_{Q_1,t^[}$}
1773: \psfrag{lq1t2}{$l_{Q_1,t^]}$}
1774: \psfrag{x0lvxl}{$(x_{0,l},v_{x,l})$}
1775: \psfrag{x0uvxu}{$(x_{0,u},v_{x,u})$}
1776: \includegraphics[width=.8\textwidth]{figs/CntRngNorm2.eps}\\
1777: \caption{{\sc CountRange} $Q_1$ at $t^{[}$ to $Q_2$ at $t^{]}$.}
1778: \label{fig:countRangeNormal2}
1779: \end{minipage}
1780: \end{figure}
1781:
1782:
1783: For each bucket we determine if the bucket is completely in or
1784: completely out of the query space. First we find the beginning and
1785: ending time intervals. For each time interval, we get the associated
1786: function $\triangle p$ given in Equation~(\ref{eq:percentofbucket2})
1787: and its components. The components $\triangle p$ given in
1788: Equation~(\ref{eq:percentofbucket}) define the area above a line
1789: through $Q_1$ and $Q_2$ at times $t^[$ and $t^]$.
1790: Figures~\ref{fig:CountRangeNormal1} and \ref{fig:countRangeNormal2}
1791: show these four lines. Figure~\ref{fig:CountRangeNormal1} shows the
1792: shaded area defined by:
1793: \begin{equation}\label{eq:pleft}
1794: \triangle \overleftarrow{p} = p_{Q_2,t^[} - p_{Q_1,t^]}.
1795: \end{equation}
1796: Figure~\ref{fig:countRangeNormal2} shows the shaded area:
1797: \begin{equation}\label{eq:pright}
1798: \triangle \overrightarrow{p} = p_{Q_2,t^]} - p_{Q_1,t^[}.
1799: \end{equation}
1800: If $\triangle \overleftarrow{p}$ or $\triangle \overrightarrow{p}$
1801: for bucket $i$ is equal to the count of the bucket, then bucket $i$
1802: is completely contained in the query. If $\triangle
1803: \overleftarrow{p}$ and $\triangle \overrightarrow{p}$ for bucket $i$
1804: are equal to $0$, then bucket $i$ is not contained in the query. If
1805: neither of these is true, we approximate the count for bucket $i$ as
1806: the $\max (\triangle \overleftarrow{p}, \triangle
1807: \overrightarrow{p})$. That is, we calculate the number of points in
1808: bucket $i$ that contribute to the {\sc CountRange} as:
1809: \begin{equation}\label{eq:CountRangei}
1810: count_i = \left\{
1811: \begin{array}{llr}
1812: b_i & \textrm{ if } & \triangle \overleftarrow{p} = b_i \vee \triangle \overrightarrow{p} = b_i \\
1813: 0 & \textrm{ if } & \triangle \overleftarrow{p}=\triangle \overrightarrow{p} = 0 \\
1814: \max (\triangle \overleftarrow{p}, \triangle \overrightarrow{p}) & & \textrm{ Otherwise}
1815: \end{array}
1816: \right.
1817: \end{equation}
1818: This calculation requires that we keep the single dimension
1819: equations for $Q_1$ and $Q_2$ available and not discard them after
1820: finding $\triangle p$ (see Equation~(\ref{eq:percentofbucket2})).
1821:
1822: Hence, we have the following algorithm for {\sc CountRange}:
1823:
1824: \progstart\vspace{-18pt}
1825: \begin{tabbing}
1826: \hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=\hspace*{.5in}\=
1827: \kill
1828: {\bf {\sc CountRange}$(H, Q_1, Q_2, t^[,t^])$} \\
1829: {\bf input:} \>\> A set of buckets $H$ built by the index structure presented, \\
1830: \>\> query points $Q_1(t)$ and $Q_2(t)$ and a query time interval $(t^[,t^])$. \\
1831: {\bf output:} \>\>the estimated {\sc CountRange}. \\
1832: \\
1833: 1.\> $Count \leftarrow 0$ \` $O(1)$ \\
1834: 2.\> \textbf{for each} \emph{bucket} $B_i \in D$ \` $O(B)$ \\
1835: 3.\> \> \textsc{Calculate}($\triangle \overleftarrow{p}, \triangle \overrightarrow{p}$) //using Equations~(\ref{eq:pleft})-(\ref{eq:pright}) \` $O(1)$ \\
1836: 4.\> \> \textsc{Calculate}($count_i$) //using Equation~(\ref{eq:CountRangei}) \` $O(1)$ \\
1837: 5.\> \> $Count \leftarrow Count + count_i$ \` $O(1)$ \\
1838: 6.\> \textbf{end for} \\
1839: 7.\> \textbf{return} $Count$ \` $O(1)$
1840: \end{tabbing}
1841: \progend
1842:
1843: \begin{theorem}
1844: \label{th:RangeConstantTime} The {\sc CountRange} query runs in
1845: $O(B)$ time.
1846: \end{theorem}
1847: \begin{proof}
1848: Consider two different data structures for our buckets:
1849: \textsc{HashTables} and \textsc{R-trees}. In the case of indexing
1850: using an \textsc{R-tree}, the worst case requires that we examine
1851: all buckets used in generating {\sc CountRange}. It is possible that
1852: this list could include all $B$ buckets giving a worst case of
1853: $O(B)$. In the case of using a \textsc{HashTable}, we must examine
1854: all $B$ buckets. By Lemma \ref{lem:ConstRunningTimeForBucket}, and
1855: because Equations~(\ref{eq:BucketGeneralForm}) and
1856: (\ref{eq:CountRangei}) are calculated in constant time, each bucket
1857: can be examined to determine the count that contributes to the {\sc
1858: CountRange} query in constant time. Therefore, the algorithm runs in
1859: $O(B)$ time.
1860: \end{proof}
1861:
1862: We note that {\sc CountRange} is a simplification of the {\sc
1863: MaxCount} operator in that we do not examine every time interval.
1864: Further we have a slightly different form of $\triangle p$ from
1865: Equation~(\ref{eq:percentofbucket2}) to find the count.
1866:
1867:
1868: \section{Experimental Results}\label{sec:ExperimentalResults}
1869:
1870: We collected data from over $7500$ queries that were selected from a
1871: set of randomly generated queries. The selection process weeded out
1872: most similar queries and kept a set that represents narrow queries,
1873: wide queries, near corner or edge queries, and queries outside the
1874: space contained in the database. Throughout our experiments, we did
1875: not see significant accuracy fluctuation due to any of these types
1876: of queries.
1877:
1878: Each experimental run consists of running all of the queries at
1879: several different decreasing bucket sizes on a single data set. We
1880: made experimental runs against data sets ranging from 10,000 points
1881: to 1,500,000 points\footnote{Threshold aggregation runs go only to 1
1882: million points at which we already achieve acceptable error.}.
1883:
1884: In the following experimental analysis, we measure the percentage
1885: error of the estimation algorithm relative to the exact-count
1886: algorithm as follows:
1887: \begin{equation}\label{eq:RelativeError}
1888: Error_{Relative} = \frac{|Exact~Operator-Estimated~Operator|}{Exact~Operator}
1889: \end{equation}
1890: Equation (\ref{eq:RelativeError}) provides a useful measure if the
1891: query returns a reasonable number of points. Queries that return a
1892: small number of points indicate that we should use the exact method.
1893:
1894: For {\sc ThresholdRange}, we measure the percentage of intervals
1895: given by the accurate algorithm not covered by the estimation
1896: algorithm using the operator {\sc UC} for uncovered. That is, {\sc
1897: UC}$(a,b)$ returns the sum of the lengths of intervals in $a$ not
1898: covered by intervals in $b$. We divide the result by the accurate
1899: {\sc ThresholdSum} to determine the {\sc ThresholdRange error}:
1900: \begin{equation}\label{eq:ThresholdRangeError}
1901: \texttt{error} =
1902: \frac{\textsc{UC}\left(\textit{Ext. }\textsc{ThresholdRange}, \textit{Est. }\textsc{ThresholdRange}\right)}{\textit{Ext. }\textsc{ThresholdSum}}
1903: \end{equation}
1904: We also measure the percentage of intervals given by the estimate
1905: algorithm not covered by the exact algorithm. We divide the result
1906: by the estimated {\sc ThresholdSum} to determine the {\sc
1907: ThresholdRange excess-error}.
1908: \begin{equation}\label{eq:ThresholdRangeExcessError}
1909: \texttt{excess-error} = \frac{\textsc{UC}\left(\textit{Est. }\textsc{ThresholdRange} \backslash \textit{Ext. }\textsc{ThresholdRange}\right)}{\textit{Est. }\textsc{ThresholdSum}}
1910: \end{equation}
1911:
1912: We performed all the data runs on a Athlon 2000 with 1 GB of RAM.
1913: During each of the queries the program does not contact the server
1914: tier and, thus, minimizes the impact of running a server on the same
1915: computer. The program pre-loads all data into data structures so
1916: that even the exact algorithms do not contact the server tier.
1917:
1918:
1919: \subsection{Data Generation}
1920:
1921: %\vspace{-.3in}
1922: \begin{figure}[h]
1923: \centering
1924: \begin{minipage}{6in}
1925: \centerline{ \hspace{-1em}
1926: \mbox{\includegraphics[width=2in]{figs/C10P10K_11.eps}}\hspace{-0.1in}
1927: \mbox{\includegraphics[width=2in]{figs/C10P10K_21.eps}}\hspace{-0.1in}
1928: \mbox{\includegraphics[width=2in]{figs/C10P10K_31.eps}}}
1929: \end{minipage}
1930: %\vspace{-.3in}
1931: \caption{$X$-View, $Y$-view and $Z$-view of sample data.}
1932: \label{fig:sampledata}
1933: \end{figure}
1934:
1935: Data for the experiments was randomly generated around several
1936: cluster centers. The $i^{th}$ point generated for the database is
1937: located near a randomly selected cluster at a distance between $0$
1938: and $d$, where $d$ is proportional to $i$. This method is similar to
1939: the Ziggurat~\citep{marsaglia2000zmg} method of generating gaussian
1940: (or normal) distributions used in the
1941: GSTD~\citep{theodoridis1999gsd} and
1942: G-TERD~\citep{tzouramanis2002gte} spatiotemporal data
1943: generators~\citep{nascimento2003sar}. However, our method does not
1944: generate strictly Gaussian distributions since the distributions may
1945: stretch and compress along an axis. Our goal was to generate a
1946: cluster that represents a source location and velocity that has most
1947: elements starting near a center point and decreasing as one moves to
1948: a boundary for the cluster. This method models source regions where
1949: the objects all head about the same direction. A secondary goal was
1950: to make certain that clusters were random in size and shape. The
1951: program is also capable of approximating a Zipf distribution used in
1952: \citep{CC02,Revesz20031,TSP03}. However, a single Zipf distribution
1953: does not test the adaptability of our algorithm well. I.e. our
1954: algorithm is capable of modeling a Zipf distribution and as such we
1955: could use a single bucket. Figure~\ref{fig:sampledata} shows a
1956: sample of a data set with points projected onto the three views. The
1957: clusters look even more random, because they can overlay one
1958: another. When one looks at these, they nearly resemble the lights of
1959: a city from the air.
1960:
1961: Along with a single Zipf distribution, we also note that a randomly
1962: generated uniform-distribution is not a good distribution to use for
1963: these types of experiments. Uniform distributions do not test the
1964: ability of the algorithm to adapt. In fact from earlier experiments
1965: in~\citep{Anderson20061} we have found that using such a
1966: distribution gives great (though meaningless) results. The problem
1967: resolves to a system capable (and willing to) model a uniform
1968: distribution finding a nearly perfect uniform distribution to model.
1969: Hence these results are neither realistic, nor meaningful.
1970:
1971: \subsection{Parameter Effects}
1972:
1973: The index space ranges from $0$ to $100$ in each dimension. The {\bf
1974: number of points} in the different data sets ranges from $10,000$ to
1975: $1,500,000$. The following parameters were used in creating the
1976: index and finding the {\sc MaxCount}. \medskip
1977:
1978: \noindent{\bf Size of Buckets:} The size of the buckets determines
1979: the number of possible buckets in the index. In the experiments,
1980: buckets divide the space up such that there are $5$ to $20$
1981: divisions in each dimension\footnote{Some {\sc MaxCount} runs
1982: included up to 40 divisions increasing accuracy, but not enough to
1983: warrant the extra running time.}. These divisions equate to bucket
1984: sizes ranging from $5$ to $20$ units wide in each dimension.
1985: Relative to our previous work \citep{Anderson20061}, this algorithm
1986: puts much more space into each bucket creating bigger buckets.
1987: \medskip
1988:
1989: \noindent{\bf Query Location:} Locating the query near the lower or
1990: upper corners affects relative accuracy because the query returns
1991: very few points. Queries in this region are not interesting because
1992: they rarely involve many points and represent a query region that
1993: moves away from points in the database or barely moves at all. The
1994: small number of points returned indicates use of the exact
1995: algorithms.
1996: \medskip
1997:
1998: \noindent{\bf Query Types:} In~\citep{Anderson20061}, we considered
1999: queries with several different characteristics: dense, sparse, and
2000: Euclidean distance as it related to bucket size. By modeling the
2001: skew in buckets, we minimize the effect of these characteristics to
2002: the point that they did not impact the query error. Queries where
2003: the distance between the query points was small appeared to do as
2004: well as wider queries {\em providing they returned a reasonable
2005: number of points}. This result is a clear improvement over previous
2006: work that assumed uniform density within a bucket.\medskip
2007:
2008: \noindent{\bf Cluster Points:} Index space saturation determines the
2009: number of buckets necessary for the index. The number of cluster
2010: points does not appear to affect error as much as the space
2011: saturation. Further, we do not consider a larger number of cluster
2012: points reasonable since the index space approaches a uniform
2013: distribution as the number of cluster points increases. Gaps
2014: introduce difficult areas to model when they are not uniform. And
2015: once again we reiterate, uniform distributions are not useful. In
2016: our experiments cluster points number between 10 and 50. \medskip
2017:
2018: \noindent{\bf Histogram Divisions:} Increasing histogram divisions
2019: to $s>5$ had no affect on the accuracy. This result is not
2020: unexpected because histograms are used to define a trend function
2021: relative to trend functions on other axes. Increasing the histogram
2022: divisions has a tendency to flatten the lines. However,
2023: normalization flattens the trend function while maintaining the
2024: relationships between trends and hence this behavior is easily
2025: explained. Thus, increasing histogram divisions only increases the
2026: running time without increasing accuracy.\medskip
2027:
2028: \noindent{\bf Threshold Value:} The threshold value determines the
2029: accuracy when set to low values compared to the number of points in
2030: the database. As expected, these extreme point values produce
2031: accurate estimations. High values also follow this trend.\medskip
2032:
2033:
2034: \noindent{\bf Time Endpoints:} When dealing with either small time
2035: end points or small buckets, the method is susceptible to rounding
2036: error. In particular, Equation~(\ref{eq:BucketGeneralForm}) contains
2037: both $t^6$ and $\frac{1}{t^6}$ terms. For very small values, on the
2038: order of $1 \times 10^{-54}$ for 64-bit doubles, these calculations
2039: are extremely sensitive and care must be given to guard against
2040: rounding error. Those errors showed in two ways. First, by a direct
2041: warning programmed into the solution, and second, by a series of
2042: fairly stable time values for the {\sc MaxCount} followed by
2043: unstable variations when increasing the number of buckets. At some
2044: point, smaller bucket sizes increase the likelihood of errors in
2045: both time and count values. Also smaller buckets contain fewer
2046: points, which impacts the size of the constants in
2047: Equation~(\ref{eq:BucketGeneralForm}). Hence, as the bucket size
2048: becomes smaller in successive runs, the existence of instability in
2049: the time values after a series of stable values predicts that an
2050: accurate {\sc MaxCount} may be found in the previous larger bucket
2051: size. {\em Throughout our experiments, this condition was an
2052: excellent predictor of an accurate {\sc MaxCount}}.\medskip
2053:
2054: The experiments demonstrated that 6-dimensional space compounds the
2055: problem when creating small buckets. Creating an index with unit
2056: buckets would result in the possibility of having $1\times 10^{12}$
2057: buckets. Clearly this number is unrealistic for common moving object
2058: applications where we may be dealing with million(s) of objects. In
2059: practice the number of buckets needed to reach acceptable error
2060: levels was between $78,000$ and $227,000$ buckets. These numbers
2061: reflect the ability to reach error levels under $5\%$ and were
2062: roughly related to the saturation of the space by the points. It
2063: should be clear that a higher saturation of the space by points
2064: would require a larger number of buckets.
2065: Figure~\ref{fig:BucketsToPointsRatio} shows that we had a roughly
2066: linear increase in the number of buckets for an exponential increase
2067: in the space. This pleasant surprise indicates that for unsaturated
2068: data sets, the exponential explosion of space is manageable.
2069:
2070: \begin{figure}[htb]
2071: \centering
2072: \includegraphics[width=4.5in]{figs/Points2Buckets.eps}\\
2073: \caption{Ratio of the number of buckets in the index to the width of the space measured in buckets.}
2074: \label{fig:BucketsToPointsRatio}
2075: \end{figure}
2076:
2077: \subsection{Running Time Observations}
2078:
2079: Figure~\ref{fig:runningtime} shows the average ratio of the exact
2080: {\sc MaxCount} running time to the estimated {\sc MaxCount} running
2081: time as a function of the number of points in the database. This
2082: result shows a nearly exponential growth when comparing the values
2083: between 10,000 and 1,000,000. The leveling off occurs because the
2084: number of points returned by the queries of 1 million points nearly
2085: equals the number of points returned by the queries of 1.5 million
2086: points. This result precisely matches our running-time analysis of
2087: the exact and estimation algorithms.
2088:
2089: \begin{figure}[h]
2090: \centering
2091: \includegraphics[width=4.5in]{figs/results/CSpeedup.eps}\\%used to be runningtime.eps
2092: \caption{Ratio of exact running time to estimated running time.}
2093: \label{fig:runningtime}
2094: \end{figure}
2095:
2096: A natural question is when to use the exact versus the estimated
2097: methods. In runs with a small number of points that need to be
2098: processed, %returned by a {\sc MaxCount} query,
2099: the exact and estimation methods run about equally fast. However,
2100: when the result size reaches values greater than $40,000$ (our
2101: experiments returned sets as large as 331,491), the estimation
2102: algorithms run up to $35$ times faster than the exact algorithms.
2103: Further, we note that the error is less predictable at smaller
2104: results sizes. Hence for small databases or in queries that return
2105: small result sets, efficiency and accuracy both indicate using the
2106: exact method. However, for large data sets greater than or equal to
2107: 1 million points, the estimation method greatly out-performs the
2108: exact method.
2109:
2110: \subsection{Operator Observations}
2111:
2112: As expected, we noticed that each operator runs in about the same
2113: time as {\sc MaxCount}. Only error values seemed to be different
2114: when studying different types of aggregation (e.g., when studying
2115: overlap error in {\sc ThresholdRange} versus count error in {\sc
2116: MaxCount}). Never-the-less, we have similarities between the
2117: results. Almost all the figures in this section look like a view of
2118: mountains from a valley. That is what we expected to see and the
2119: lower and flatter the terrain the better. Buckets increase from back
2120: to front and point set sizes increase from left to right.
2121:
2122: \subsection{\sc MaxCount}
2123:
2124: Figure~\ref{fig:RelativeError} shows that increasing the number of
2125: buckets to the indicated values dramatically decreases the {\sc
2126: MaxCount} error. As the number of points increases we also see a
2127: decrease in the error. Note that for larger buckets (e.g. smaller
2128: values on the ``Buckets per Dimension axis''), the error decreases
2129: at a slightly faster rate.
2130:
2131: \begin{figure}[h]
2132: \centering
2133: \includegraphics[width=5.5in]{figs/results/CMaxCount.eps}\\
2134: \caption{{\sc MaxCount} error.}
2135: \label{fig:RelativeError}
2136: \end{figure}
2137:
2138: The exact {\sc MaxCount} provided the values against which our
2139: estimation algorithm was tested for accuracy. Since the method does
2140: not rely on buckets, and has zero error, we note only that on
2141: queries with small result sizes, this method performs as well, or
2142: better than the estimation algorithm.
2143:
2144: \subsection{\sc ThresholdRange}
2145:
2146: \begin{figure}[h]
2147: \centering
2148: \includegraphics[width=4in]{figs/results/CTRE10.eps}\\
2149: \caption{{\sc ThresholdRange} error.}
2150: \label{fig:TRE10}
2151: \end{figure}
2152:
2153: \begin{figure}[h]
2154: \centering
2155: \includegraphics[width=4in]{figs/results/CTREE10.eps}\\
2156: \caption{{\sc ThresholdRange} error.}
2157: \label{fig:TREE10}
2158: \end{figure}
2159:
2160: Figures~\ref{fig:TRE10} and \ref{fig:TREE10} give the {\sc
2161: ThresholdRange} error and {\sc ThresholdRange} excess error
2162: respectively for $T=10$. {\sc ThresholdRange} error gives the
2163: percentage of the exact intervals not covered by the estimation
2164: value, and {\sc ThresholdRange} excess error gives the percentage of
2165: the estimation not covering the exact. These figures show that our
2166: method acts conservatively in covering more than is needed. However,
2167: at larger point-set sizes, we still achieve under 5\% error.
2168: Figure~\ref{fig:TRE10} shows 0\% error caused by the point count
2169: staying above 10\% in data sets containing more than 30,000 points.
2170: Figure~\ref{fig:TREE10} shows that we covered at least 10\% more
2171: time in the query time interval than needed until we reach larger
2172: point sets. Still, we showed improvement with more buckets.
2173:
2174: At $T=1000$, we see 0\% error until we reach point sets of 500,000
2175: and greater. Figure~\ref{fig:TRE1000} shows excellent results with
2176: buckets above 10. Also, Figure~\ref{fig:TREE1000} shows that the
2177: excess error drops to near 0\% as well.
2178:
2179: \begin{figure}
2180: \centering
2181: \includegraphics[width=4in]{figs/results/CTRE1000.eps}\\
2182: \caption{{\sc ThresholdRange} error, T=1000.}
2183: \label{fig:TRE1000}
2184: \end{figure}
2185:
2186: \begin{figure}
2187: \centering
2188: \includegraphics[width=4in]{figs/results/CTREE1000.eps}\\
2189: \caption{{\sc ThresholdRange} excess error, T=1000.}
2190: \label{fig:TREE1000}
2191: \end{figure}
2192:
2193: Figures~\ref{fig:TRE100000} and \ref{fig:TREE100000} show what
2194: happens when we find an interval near the {\sc MaxCount} value. The
2195: two figures show the consequences of the estimation intervals being
2196: offset from the exact intervals by small amounts. The error
2197: decreases with more buckets.
2198:
2199: \begin{figure}
2200: \centering
2201: \includegraphics[width=4in]{figs/results/CTRE100000.eps}\\
2202: \caption{{\sc ThresholdRange} error, T=100000.}
2203: \label{fig:TRE100000}
2204: \end{figure}
2205:
2206: \begin{figure}
2207: \centering
2208: \includegraphics[width=4in]{figs/results/CTREE100000.eps}\\
2209: \caption{{\sc ThresholdRange} excess error, T=100000.}
2210: \label{fig:TREE100000}
2211: \end{figure}
2212:
2213:
2214: \subsection{\sc ThresholdCount}
2215:
2216: This operator is the only operator that does not have relative error
2217: measurements. Instead we report the average number of intervals the
2218: estimation method differs from the exact method. As you can see, we
2219: differ by two from the correct number.
2220:
2221: Figure~\ref{fig:TCE10} shows the average error at $T=10$ where the
2222: errors are small. Figure~\ref{fig:TCE1000} ($T=1000$) looks much
2223: worse, but in reality we are still below 2 intervals off. We also
2224: note that the estimation may split or combine an interval
2225: incorrectly when the intervals are very close together without
2226: greatly affecting the error of other operators. Given this
2227: possibility, the results are excellent.
2228:
2229: \begin{figure}[h]
2230: \centering
2231: \includegraphics[width=4in]{figs/results/CTCE10.eps}\\
2232: \caption{{\sc ThresholdCount} error, T=10.}
2233: \label{fig:TCE10}
2234: \end{figure}
2235:
2236: \begin{figure}[h]
2237: \centering
2238: \includegraphics[width=4in]{figs/results/CTCE100.eps}\\
2239: \caption{{\sc ThresholdCount} error, T=100.}
2240: \label{fig:TCE1000}
2241: \end{figure}
2242:
2243:
2244:
2245: \subsection{\sc ThresholdSum}
2246:
2247: {\sc ThresholdSum} gives the total time above the threshold $T$. As
2248: one can see in Figure~\ref{fig:TSE10}, at higher bucket counts we
2249: have excellent error rates at $T=10$. We didn't always expect great
2250: results at this threshold level across all data sets, but {\sc
2251: ThresholdSum} gives this result consistantly all the way across.
2252:
2253: \begin{figure}[h]
2254: \centering
2255: \includegraphics[width=4in]{figs/results/CTSE10.eps}\\
2256: \caption{{\sc ThresholdSum} error, T=10.}
2257: \label{fig:TSE10}
2258: \end{figure}
2259:
2260: We do note that when the threshold approaches {\sc MaxCount}, we see
2261: extremely good accuracy as shown in Figure~\ref{fig:TSE100000}.
2262:
2263: \begin{figure}[h]
2264: \centering
2265: \includegraphics[width=4in]{figs/results/CTSE100000.eps}\\
2266: \caption{{\sc ThresholdSum} error, T=100000.}
2267: \label{fig:TSE100000}
2268: \end{figure}
2269:
2270:
2271: \subsection{\sc ThresholdAverage}
2272:
2273: {\sc ThresholdAverage} gives the average length of each time
2274: interval. Figure~\ref{fig:TAE10} shows the now familiar mountains
2275: descending below 5\% error at 20 buckets for $T=10$. The Figure also
2276: shows that even though a few of the data sets tended to have good
2277: results at 5 and 10 buckets, these results are not guaranteed in
2278: general. In Figure~\ref{fig:TAE1000}, the error reaches a plateau
2279: below 5\% with only small bumps in the data.
2280:
2281: \begin{figure}[h]
2282: \centering
2283: \includegraphics[width=4in]{figs/results/CTAE10.eps}\\
2284: \caption{{\sc ThresholdAverage} error, T=10.}
2285: \label{fig:TAE10}
2286: \end{figure}
2287:
2288: \begin{figure}[h]
2289: \centering
2290: \includegraphics[width=4in]{figs/results/CTAE1000.eps}\\
2291: \caption{{\sc ThresholdAverage} error, T=1000.}
2292: \label{fig:TAE1000}
2293: \end{figure}
2294:
2295: \subsection{\sc CountRange}
2296:
2297: Other {\sc CountRange} algorithms have achieved error values between
2298: 2\% and 3\%. Using our method we conjecture that we could reduce the
2299: error because our method of approximation, although much more
2300: complicated, theoretically adapts to skewed distributions better
2301: than other methods. Figure~\ref{fig:CountRangeError} shows that we
2302: achieved errors under 2\% for 20 buckets across all the data sets,
2303: and in some cases, under 1\%.
2304:
2305: \begin{figure}[h]
2306: \centering
2307: \includegraphics[width=4in]{figs/results/CCountRange.eps}\\
2308: \caption{{\sc CountRange} error.}
2309: \label{fig:CountRangeError}
2310: \end{figure}
2311:
2312: Count range also performs about the same speed as the threshold
2313: operators due to its similar implementation.
2314:
2315:
2316: Additional information that contains error analyses of all the
2317: threshold values is given in~\cite{Anderson2007D}.
2318:
2319:
2320: \section{Related Work}\label{sec:RelatedWork}
2321:
2322: This Section reviews the literature specific to aggregation. Spatial
2323: and spatiotemporal databases have attracted an enormous amount of
2324: interest, and there exists a wide range of literature that is
2325: related to our work only through indexing. For books on the subjects
2326: of spatiotemporal and constraint databases we suggest:
2327: \cite{rigaux2001B,revesz2002,77589,S05Book}, and
2328: \cite{guting_book05}.
2329:
2330: \subsection{\textsc{MaxCount} and \textsc{CountRange} Aggregation}
2331:
2332: There exists only a few previous algorithms to compute {\sc
2333: MaxCount}~\citep{Revesz20031,Chen20041,Anderson20061}. None of
2334: those previous algorithms provides efficient queries without
2335: rebuilding the index (i.e., they do not provide dynamic updates).
2336:
2337: Previous \emph{approximate} {\sc MaxCount} solutions use indices
2338: from~\cite{APR99} that minimize the skew of point distributions in
2339: the buckets by creating hyper-buckets based on the properties of all
2340: points at index creation time. Updates require the index to be
2341: rebuilt because the buckets depend on the point distribution at a
2342: specific time. In contrast, the probabilistic method we presented
2343: {\em recognizes} point density skew in each bucket instead and
2344: creates a density distribution to model it. We present the first
2345: efficient and dynamic algorithm for {\sc MaxCount}.
2346: Table~\ref{table:results} compares the results of earlier {\sc
2347: MaxCount} algorithms with our current algorithm where $N$ is the
2348: number of points and $B$ is the number of buckets in the index.
2349:
2350: \begin{table}[ht]
2351: \centering \caption{{\sc MaxCount} aggregation complexity on
2352: linearly moving objects.} \label{table:results}
2353: \begin{tabular}[bt]{|c|c|c|l|l|l|} \hline
2354: {\bf Max.}& {\bf Worst Case} & {\bf Space} & {\bf Exact } & {\bf Static or} & {\bf Reference} \\
2355: {\bf Dim.}& {\bf Time} & & {\bf or Est.} &
2356: {\bf Dynamic} & \\ \hline\hline 1 & $O(log\ N)$ &
2357: $O(N^2)$ & Exact & Static &
2358: \cite{Revesz20031} \\ \hline 1 & $O(B \log B)$ & $O(B)$
2359: & Est. & Static & \cite{Chen20041} \\
2360: \hline 2 & $O(B \log B)$ & $O(B)$ & Est. & Static
2361: & \cite{Anderson20061} \\ \hline d & $O(B)$ &
2362: $O(B)$ & Est. & Dynamic & \cite{Anderson2007D} \\
2363: \hline d & $O(N)$ & $O(1)$ & Exact
2364: & Dynamic & \cite{Anderson2007D} \\ \hline
2365: \end{tabular}
2366: \end{table}
2367:
2368: To our knowledge, we present the first proposal of these threshold
2369: aggregate operators for moving points: {\sc MaxCount (and MinCount),
2370: ThresholdRange, ThresholdCount, ThresholdSum}, and {\sc
2371: ThresholdAverage}.
2372:
2373: We can modify {\sc Spatiotemporal-Range} algorithms to return the
2374: {\sc CountRange} by counting the objects returned. Several other
2375: algorithms were proposed directly for the {\sc CountRange} problem.
2376: We summarize previous {\sc Spatiotemporal-Range} and {\sc
2377: CountRange} algorithms in Table~\ref{tbl:count}, where $N$ is the
2378: number of moving objects or points in the database, $d$ is the
2379: dimension of the space, and $B$ is the number of buckets. All
2380: algorithms listed are dynamic, which means that they allow
2381: insertions and deletions of moving objects without rebuilding the
2382: index.
2383:
2384: \begin{table}[htb]
2385: \centering \caption{{\sc Range} and {\sc CountRange} aggregation
2386: summary.}
2387: \begin{tabular}{|c|l|l|l|l|l|} \hline
2388: {\bf Max.}& {\bf Worst Case} & {\bf Worst case} & {\bf Exact } & {\bf Reference} \\ %& {\bf Static or}
2389: {\bf Dim.}& {\bf Time} & {\bf Space} & {\bf or Est.} & \\ \hline\hline %& {\bf Dynamic}
2390: 2 & $O(N^{\frac{3}{4}+\epsilon}+k)$ & $O(N)$ & Exact & \cite{KGT99} \\ % Simplex range method %& Dynamic
2391: 2 & $O(\log_2 N + k)$ & $O(N^2)$\footnotemark & Exact & \\ \hline % time limited method
2392: 2 & $O(N)$ & $O(N)$ & Exact & \cite{PKGT02} \\ \hline % R*-Tree Model %& Dynamic
2393: 3 & $O(N)$ & $O(N)$ & Exact & \cite{SJLL00} \\ \hline % Saltenis Jensen... TPR-Tree Model %& Dynamic
2394: d & $O(N)$ & $O(N)$ & Exact & \cite{PLM01} \\ \hline % R-Tree Model %& Dynamic
2395: d & $O(B^{d-1} \log_{B}^{d} N)$ & $O(\frac{N}{B}\log_{B}^{d-1} N)$ & Exact & \cite{ZGTS03} \\ \hline % also in --> cite{ZTG02} ECDF-B-Tree and BA-Tree (sum,avg,cnt) %& Dynamic
2396: 2 & $O(\log_B N + C)/B$ & $O(N)$ & Est. & \cite{KGT99}\footnotemark \\ \hline %& Dynamic
2397: 2 & $O(B)$ & $O(B)$ & Est. & \cite{CC02} \\ \hline %first S-T selectivity estimation %& Dynamic
2398: d & $O(B)$ & $O(B)$ & Est. & Tao et al. (2003) \\ \hline %\citet{TSP03} \footnotemark \\ \hline %& Dynamic
2399: d & $O(\sqrt{N})$ & $O(N)$ & Est. & \cite{TP05} \\ \hline %aMVRB-tree Tao, Papadias %& Dynamic
2400: d & $O(B)$ & $O(B)$ & Est. & \cite{Anderson2007D} \\ \hline %& Dynamic
2401: \end{tabular}
2402: \label{tbl:count}
2403: \end{table}
2404:
2405: \footnotetext[1]{This is a restricted future time query with
2406: expected $O(N)$ space that becomes quadratic if the restriction is
2407: too far into the future.} \footnotetext[2]{$C=K+K'$, where $K'$ is
2408: the approximation error.} \footnotetext[3]{Although \cite{TSP03}
2409: allow dynamic updates, over time the index must be rebuilt.}
2410:
2411: In all our work we consider time as a continuous variable. Time as a
2412: discrete variable is discussed in both temporal and spatiotemporal
2413: aggregation by \cite{AAE03,TP05} and \cite{BGJ06}. In the discrete
2414: approach, time stamps describe the temporal nature of objects. This
2415: approach is less relevant to our work, but is relevant to many
2416: applications.
2417:
2418: \subsection{Indices and Estimation Techniques}
2419:
2420: There are many ways our work is indirectly related to previous work
2421: on indexing structures and estimation techniques. %Example~\ref{ex:MaxCount} shows that the
2422: {\sc Count} and {\sc Max} aggregation operators have only a titular
2423: relationship to the {\sc MaxCount} aggregation, because one cannot
2424: use the {\sc Count} and {\sc Max} aggregation operators to implement
2425: the {\sc MaxCount} aggregation. Nevertheless, several techniques
2426: used in the {\sc MaxCount} problem are also used in other indices
2427: and algorithms designed for range, max/min, and count queries. We
2428: summarize several of these related techniques next.
2429:
2430: \subsubsection{Indices}
2431:
2432: The index structure of \cite{AAE03} finds the 2-dimensional moving
2433: points contained in a rectangle in $O(\sqrt {N})$ time.
2434: \cite{GKTD05} gave a selectivity estimation with a histogram
2435: structure of overlapping buckets designed to approximate the density
2436: of multi-dimensional data. The algorithm runs in constant time
2437: $O(d|B|)$, where $d$ is the number of dimensions and $B$ is the
2438: number of buckets. \cite{GKR04} gave a technique for answering
2439: spatiotemporal range, intercept, incidence, and shortest path
2440: queries on objects that move along curves in a planar graph.
2441: \cite{CJNP04,CJSP05} also gave indexing methods that use networks,
2442: such as roads, to predict position and motion changes of objects
2443: that follow roads and characteristics of routes. \cite{PJ05} used
2444: networks to reduce the dimensionality of constrained moving object
2445: to two dimensional trajectories and examined the method in terms of
2446: the spatiotemporal range query. \cite{AG051} proposed the MON-Tree
2447: to index moving objects in networks using graphs or route oriented
2448: networks to find the spatiotemporal range and windows queries. They
2449: define window queries as returning the pieces of the object's
2450: movement that intersects the query window. \cite{ZMTGS01} proposed
2451: the {\em multiversion SB-tree} to perform range temporal aggregates:
2452: {\sc Sum}, {\sc Count} and {\sc Avg} in $O(\log_b n)$, where $b$ is
2453: the number of records per block and $n$ is the number of entries in
2454: the database. \cite{TIME05_Revesz} gave efficient rectangle indexing
2455: algorithms based on point dominance to find count interpreted in $k$
2456: dimensions using the following concepts:
2457: \begin{enumerate}
2458: \item {\em stabbing} gives the number of objects that contain a point;
2459: \item {\em contain} gives the number of rectangles that contain the query rectangle;
2460: \item {\em overlap} gives the number of rectangles that overlap the query rectangle; and
2461: \item {\em within} gives the number of rectangles within the query space.
2462: \end{enumerate}
2463: These four operators have a running time of $O(\log^k n)$ where $k$
2464: is the number of dimensions and $n$ is the number of points.
2465:
2466: %TIME PARAMETERIZED INDICES
2467: \cite{SJLL00} gave an R$^{\ast}$-tree based indexing technique for
2468: 1, 2, and 3 dimensional moving objects that provide time-slice
2469: queries (selection queries), windows queries, and moving queries.
2470: Window queries return the same information as range queries, but
2471: with a valid time window starting at the current time and continuing
2472: to $t_h$. Window queries may request predictions for range queries
2473: within this window of time. Moving queries, similar to incidence
2474: queries, return the points that are contained within the space
2475: connecting one rectangle at a start time to a second rectangle at an
2476: end time. The proposed time parameterized R-tree (TPR-Tree) search
2477: runs in expected {\em logarithmic time}. Another R$^{\ast}$-tree
2478: extension given by \cite{CR00} forms tighter parametric bounding
2479: boxes than~\cite{SJLL00} and has similar running time. \cite{Tao03}
2480: proposed the TPR$^\ast$-Tree that extends the TPR-Tree with improved
2481: insert and delete algorithms. In the context of a variety of count
2482: queries it performs similarly to previous indices.
2483:
2484: \cite{SPTL04} uses time-dependent, updatable, histograms to query
2485: counts at specific times including past, present and future.
2486: Recently, \cite{PSJ06} proposed the $R^{PPF}$-tree that indexes
2487: past, present and predictive positions of moving points, and extends
2488: the previous work on TPR-Trees \citep{SJLL00} with a partial
2489: persistence framework. Earlier work by \cite{TayebUW98} adapted the
2490: PMR-quadtree~\citep{77589}, a variant of the quadtree structure, for
2491: indexing moving objects to answer time-slice queries, which they
2492: called instantaneous queries, and infinitely repeated time-slice
2493: queries, called continuous queries. Search performance is similar to
2494: quadtrees and allows searches in $O(\log N)$ time.
2495:
2496: \cite{MSI02} use the sweeping technique from computational geometry
2497: to define a query language to evaluate past, present, and future
2498: positions of moving objects in constraint databases.
2499:
2500: Finally, \cite{HadjieleftheriouKGT03} use an efficient approximation
2501: method to find areas where the density of objects is above a
2502: specific threshold during a specific time interval. This method
2503: comes the closest to the method used in our aggregation operators,
2504: but does not allow for the query to move or change shape over time.
2505: In fact, this method is not applied to counting at all.
2506:
2507: Note that each of these indexing methods that return the moving
2508: points in a query window or rectangle can be easily modified to
2509: return instead the {\em count} of the number of moving points.
2510: However, they may not be easily extended to provide a {\sc MaxCount}
2511: within a changing, moving query space.
2512:
2513:
2514: With a few exceptions you can see that {\sc Count} aggregation is
2515: $O(\log N + d)$ for exact methods and $O(B)$ or better for
2516: estimation methods. The hidden constant in the exact method is the
2517: number of buckets that must be traversed to find the {\sc Count}.
2518: Estimation methods vary in many ways and asymptotic running time
2519: doesn't always give a meaningful estimate as to how big $B$ will be.
2520:
2521: %However since estimation methods aim at being faster than precise
2522: %methods, generally we have that $B < \log N$.\medskip
2523:
2524: \subsubsection{Estimation Techniques}
2525:
2526: Our work is related to several other papers that {\em estimate} the
2527: count aggregate operation on spatiotemporal databases.
2528:
2529: \cite{APR99} gave an algorithm that can estimate the {\sc Count} of
2530: the number of the rectangles that intersect a query rectangle for
2531: Selectivity Estimation. \cite{CC02} and \cite{TSP03} proposed
2532: methods that can estimate the {\sc Count} of the moving points in
2533: the plane that intersect a query rectangle. More recently,
2534: \cite{KPGT05} gave a predictive method based on dual
2535: transformations.
2536:
2537: \cite{WolfsonY03} and \cite{TWHC04} gave a method for generating
2538: pseudo trajectories of moving objects. Most of these estimation
2539: algorithms use {\em buckets} as basic building structures of the
2540: index. In extending this idea, we use $2d$-dimensional
2541: hyper-buckets in our algorithms where $d$ is the number of
2542: dimensions in the moving-objects space.
2543:
2544:
2545: \section{Conclusions and Future Work}\label{sec:Conclusions}
2546:
2547: We implemented and compared two new {\sc MaxCount} algorithms. The
2548: estimated {\sc MaxCount} was shown to be fast and accurate while
2549: still allowing fast constant time updates. No other algorithm has
2550: these features to date. We showed that {\sc ThresholdRange}, {\sc
2551: ThresholdCount}, {\sc ThresholdSum}, {\sc ThresholdAverage}, and
2552: {\sc CountRange} are related to {\sc MaxCount} and can be evaluated
2553: using similar techniques and that we achieve error values under 5\%
2554: in these operations. We gave an empirical threshold for choosing
2555: between the exact and estimated algorithms. We discussed the issues
2556: related to higher dimensions and note that all sweeping algorithms
2557: have this problem. We also note that using our technique it is
2558: possible to decompose the problem and run it in a multiprocessor or
2559: grid environment where the database is divided into smaller
2560: databases.
2561:
2562: Future work may include decreasing the running time by finding other
2563: techniques because there does not appear to be a clear method for
2564: decreasing the running time of sweeping methods. One could also
2565: consider implementing and comparing these techniques in a grid
2566: computing environment.
2567:
2568: \bibliographystyle{agsm}
2569: \bibliography{h:/unl/BibTex/all}
2570:
2571: \end{document}
2572: