cs0007044/dqy.tex
1: \documentclass[11pt]{article}
2: \usepackage{epsfig}
3: \usepackage{amsmath}
4: \usepackage{latexsym}
5: \usepackage{graphicx}
6: \usepackage{amsfonts}
7: \usepackage{amssymb}
8: 
9: \newtheorem{theorem}{Theorem}
10: \newtheorem{acknowledgement}[theorem]{Acknowledgement}
11: \newtheorem{algorithm}[theorem]{Algorithm}
12: \newtheorem{axiom}[theorem]{Axiom}
13: \newtheorem{case}[theorem]{Case}
14: \newtheorem{claim}[theorem]{Claim}
15: \newtheorem{conclusion}[theorem]{Conclusion}
16: \newtheorem{condition}[theorem]{Condition}
17: \newtheorem{conjecture}[theorem]{Conjecture}
18: \newtheorem{corollary}[theorem]{Corollary}
19: \newtheorem{criterion}[theorem]{Criterion}
20: \newtheorem{definition}{Definition}
21: \newtheorem{example}{Example}
22: \newtheorem{exercise}[theorem]{Exercise}
23: \newtheorem{lemma}{Lemma}
24: \newtheorem{notation}[theorem]{Notation}
25: \newtheorem{problem}[theorem]{Problem}
26: \newtheorem{proposition}{Proposition}
27: \newtheorem{remark}[theorem]{Remark}
28: \newtheorem{solution}[theorem]{Solution}
29: \newtheorem{summary}[theorem]{Summary}
30: \newenvironment{proof}[1][Proof]{\noindent\textbf{#1.}
31: }{\hspace*{\fill}\ \rule{0.5em}{0.5em} \vspace{2ex}}
32: 
33: \setlength{\textwidth}{6.5in}
34: \setlength{\textheight}{9in}
35: \hoffset=-0.7in
36: \voffset=-0.8in
37: 
38: % Try to prevent stanky page breaks.
39: \clubpenalty=10000
40: \widowpenalty=10000
41: 
42: \newcommand{\middlebar}[2]{#1 \;\big|\; #2}
43: \newcommand{\set}[2]{\left\{\middlebar{#1}{#2}\right\}}
44: \DeclareMathOperator{\dom}{dom}
45: \DeclareMathOperator{\probop}{P}
46: \newcommand{\prob}[1]{\probop\!\left\{{#1}\right\}}
47: \newcommand{\probgiven}[2]{\probop\!\set{#1}{#2}}
48: \DeclareMathOperator{\expecop}{E}
49: \newcommand{\expec}[1]{\expecop\!\left[{#1}\right]}
50: \newcommand{\expecover}[2]{\expecop_{#1}\!\left[{#2}\right]}
51: \newcommand{\expecgiven}[2]{\expec{\middlebar{#1}{#2}}}
52: \DeclareMathOperator{\varop}{Var}
53: \newcommand{\variance}[1]{\varop\!\left[{#1}\right]}
54: \newcommand{\varover}[2]{\varop_{#1}\!\left[{#2}\right]}
55: \newcommand{\vargiven}[2]{\variance{\middlebar{#1}{#2}}}
56: \newcommand{\card}[1]{\left|{#1}\right|}
57: \newcommand{\abs}[1]{\left|{#1}\right|}
58: \newcommand{\norm}[1]{\left\|{#1}\right\|}
59: \DeclareMathOperator{\expdistrib}{Exp}
60: \DeclareMathOperator{\bindistrib}{Bin}
61: \DeclareMathOperator{\poissondistrib}{Poisson}
62: \DeclareMathOperator{\udistrib}{U}
63: \DeclareMathOperator{\bigO}{O}
64: \DeclareMathOperator{\intrinsic}{I}
65: \DeclareMathOperator{\insertion}{I}
66: \DeclareMathOperator{\deletion}{D}
67: \DeclareMathOperator{\modification}{M}
68: \newcommand{\nhexp}{\expdistrib_s}
69: \newcommand{\expof}{\exp\negthickspace}
70: \newcommand{\intdspace}{\hspace{0.15em}}
71: \newcommand{\tighteq}{\!=\!}
72: \newcommand{\mathtti}[1]{\mbox{\tt \em #1}}
73: \newcommand{\msis}{Department of Management Science and Information Systems}
74: \newcommand{\rutgers}{Rutgers University, Piscataway, NJ 08854 USA}
75: 
76: \begin{document}
77: 
78: 
79: \title{Managing Periodically Updated Data in Relational Databases: \\A Stochastic Modeling Approach}
80: \author{Avigdor Gal\thanks{ Department of Management Science and Information Systems,
81: Rutgers University, Piscataway, NJ 08854 USA, phone: (732) 445-3245, fax:
82: (732) 445-6329, e-mail: \texttt{avigal@rci.rutgers.edu} }
83: \and Jonathan Eckstein\thanks{ Department of Management Science and Information
84: Systems\ and RUTCOR, Rutgers University, Piscataway, NJ 08854 USA, phone:
85: (732) 445-0510, fax: (732) 445-6329, e-mail:
86: \texttt{jeckstei@rutcor.rutgers.edu}}}
87: \date{}
88: \maketitle
89: \begin{abstract}
90: Recent trends in information management involve the periodic transcription of
91: data onto secondary devices in a networked environment, and the proper
92: scheduling of these transcriptions is critical for efficient data management.
93: To assist in the scheduling process, we are interested in modeling
94: \emph{data obsolescence}, that is, the
95: reduction of consistency over time between a relation and its replica. The
96: modeling is based on techniques from the field of stochastic processes, and
97: provides several stochastic models for content evolution in the base relations
98: of a database, taking referential integrity constraints into account. These
99: models are general enough to accommodate most of the common scenarios in
100: databases, including batch insertions and life spans both with and without
101: memory. As an initial ``proof of concept'' of 
102: the applicability of our approach, we validate the insertion portion of
103: our model framework
104: via experiments with real data feeds. We also discuss a set of
105: transcription protocols which make use of the proposed stochastic model.
106: \end{abstract}
107: 
108: 
109: \section{Introduction and motivation}
110: 
111: Recent developments in information management involve the transcription of
112: data onto secondary devices in a networked environment, \emph{e.g.},
113: materialized views in data warehouses and search engines, and replicas in
114: pervasive systems. Data transcription influences the way databases define and
115: maintain consistency. In particular, the networked environment may require
116: periodic (rather than continuous) synchronization between the database and
117: secondary copies, either due to paucity of resources (\emph{e.g.}, low
118: bandwidth or limited night windows) or to the transient characteristics of the
119: connection. Hence, the consistency of the information in secondary copies,
120: with respect to the transcription origin, varies over time and depends on the
121: rate of change of the base data and on the frequency of synchronization.
122: 
123: Systematic approaches to the proper scheduling of transcriptions necessarily
124: involve optimizing a trade-off between the cost of transcribing fresh
125: information versus the cost of using obsolescent data. To do so, one must
126: quantify, at least in probabilistic terms, this latter cost, which we call
127: \emph{obsolescence cost} \cite{GAL99c}. This paper aims to provide a
128: comprehensive stochastic framework for quantifying time-dependent data
129: obsolescence in replicas. Suppose we are given a relation $R$, a start time
130: $s\in\Re$, and some later time $f>s$. We denote the extension of a relation
131: $R$ at time $t\in\Re$ by $R(t)$. Starting from a known extension $R(s)$, we
132: are interested in making probabilistic predictions about the contents of the
133: later extension $R(f)$. We also suggest a cost model schema to quantify the
134: difference between $R(s)$ and $R(f)$. Such tools assist in optimizing the
135: synchronization process, as demonstrated in this paper. Our approach is based
136: on techniques from the field of stochastic processes, and provides several
137: stochastic models for content evolution in a relational database, taking
138: referential integrity constraints into account. In particular, we make use of
139: compound nonhomogeneous Poisson models and Markov chains; see for example
140: \cite{ROSS80,ROSS95,TK94}. We use Poisson processes to model the behavior of
141: tuples entering and departing relations, allowing (nonhomogeneous) time-varying behavior ---
142: \emph{e.g.}, more intensive activity during work hours, and less intensive
143: activity after hours and on weekends --- as well as compound (bulk) 
144: insertions, that is,
145: the simultaneous arrival of several tuples. We use Markov chains in a
146: general modeling approach for attribute modifications, allowing the assignment
147: of a new value to an attribute in a tuple to depend on its current value. The
148: approach is general enough to accommodate most of the common scenarios in
149: databases, including batch insertions and memoryless, as well as time
150: dependent, life spans.
151: 
152: As motivation, consider the following two examples:
153: 
154: \begin{example}
155: [Query optimization]Query optimization relies heavily on estimating the
156: cardinality and value distribution of relations in a database. If these
157: statistics are outdated and inaccurate, even the best query optimizer may
158: formulate poor execution plans. Typically, statistics are updated only
159: periodically, usually at the discretion of the database administrator, using
160: utilities such as DB2's RUNSTATS. Although some research has been devoted to
161: speeding up statistics collection through sampling and wavelet
162: approximations~\cite{HAAS95,MATIAS98}, periodic updates are unavoidable in
163: very large databases such as IBM's Net.Commerce \cite{SHURETY98}, an
164: e-business software package with roughly one hundred relations, or an SAP
165: application, which has more than 8,000 relations and 9,000 indices. Collection
166: of statistics becomes an even more acute problem in database federations
167: \cite{SHETH90}, where the federation members do not always ``volunteer'' their
168: statistics \cite{RASCHID2001} (or their cost models for that matter
169: \cite{ROTH96}), and are unwilling to burden their resources with frequent
170: statistics collection.
171: 
172: In current practice, cardinality or histogram data recorded at time $s$ are
173: used unchanged until the next full analysis of the database at some later time
174: $s^{\prime}>s$. If a query optimization must be performed at some time
175: $f\in(s,s^{\prime})$, the optimizer simply uses the statistics gathered at
176: time $s$, since the time spent recomputing them may overwhelm any benefits of
177: the query optimization. As an alternative, we suggest using a probabilistic
178: estimate of the necessary statistics at time $f$. Use of these techniques
179: might make it possible to increase the interval between statistics-gathering
180: scans, as will be discussed in Example~\ref{ex:qopt2}.\hspace*{\fill}$\Box$
181: \end{example}
182: 
183: \begin{example}
184: [Replication management in distributed databases]\label{ex:replica} We now
185: consider replication management in a distributed database. Since fully
186: synchronous replication management, in which a user is guaranteed access to
187: the most current data, comes at a significant computational cost, most
188: commercial distributed database providers have adopted asynchronous
189: replication management. That is, updates to relation replicas are performed
190: after the original transaction has committed, in accordance with the workload
191: of the machine on which the secondary copy is stored. Asynchronous
192: replicas are also
193: very common in Web applications such as search engines, where Web crawlers
194: sample Web sites periodically, and in \emph{pervasive systems }(\emph{e.g.},
195: Microsoft's Mobile Information
196: Server\footnote{http://www.microsoft.com/servers/miserver/} and Caf\'{e}
197: Central\footnote{http://www.comalex.com/central.htm}). In a pervasive system,
198: a server serves many different users, each with her own unpredictable
199: connectivity schedule and dynamically changing device capabilities. Our
200: modeling techniques would allow client devices to reduce the rate at which
201: they poll the server, saving both server resources and network
202: bandwidth. We demonstrate the usefulness of stochastic modeling in
203: this setting in Sections~\ref{sec:condepup} 
204: and~\ref{sec:exwebcraw}.\hspace*{\fill}$\Box$
205: \end{example}
206: 
207: The novelty of this paper is in developing a formal framework for modeling
208: content evolution in relational databases. The problem of content evolution
209: with respect to materialized views (which may be regarded as a complex form of
210: data transcription) in databases has already been recognized. For example, in
211: \cite{ABITEBOUL99}, the incompleteness of data in views was noted as being a
212: ``dynamic notion since data may be constantly added/removed from the view.''
213: Yet, we believe that there has been no prior formal modeling of the evolution
214: process.\footnote{Other research efforts involve probabilistic database
215: systems (\emph{e.g.}, \cite{LAKSHMANAN97}), but this work is concerned with
216: uncertainty in the stored data, rather than data evolution.} Related research
217: involves the containment property of a materialized view with respect to its
218: base data: a few of the many references in this area include
219: \cite{YANG87,CHAUDHURI95,LEVY95a,ABITEBOUL98,GRUMBACH00}. However, the
220: temporal aspects of content evolution have not been systematically addressed
221: in this work. In \cite{ABITEBOUL98}, for example, the containment
222: relationships between a materialized view $I$ and the ``true'' query result
223: $\mathcal{V}(D)$, taken from a database $D$, can be either $I=\mathcal{V}(D)$
224: or $I\subseteq\mathcal{V}(D)$. The latter relationship represents a situation
225: where the materialized view stores only a partial subset of the query result.
226: However, taking content evolution into account, it is also possible that
227: $I\supset\mathcal{V}(D)$, if tuples may be deleted from $\mathcal{V}(D)$ and
228: $I$ is periodically updated. Moreover, modifications to the base data may
229: result in both $I\not \subseteq\mathcal{V}(D)$ and $I\not \supseteq
230: \mathcal{V}(D)$.
231: 
232: Refresh policies for materialized views have been previously discussed in the
233: literature (\emph{e.g.}, \cite{LINDSAY86} and \cite{COLBY96}). Typically,
234: materialized views are refreshed immediately upon updates to the base data, at
235: query time (as in \cite{COLBY96}), or using snapshot databases (as in
236: \cite{LINDSAY86}). The latter approach can produce obsolescent materialized
237: views. A combination of all three approaches appears in \cite{COLBY97}. Our
238: methodology differs in that we do not assume an \emph{a priori} association of
239: a materialized view with a refresh policy, but instead design policies based
240: on their transcription and obsolescence costs.
241: 
242: A preliminary attempt to describe the time dependency of updates in the
243: context of Web management was given in \cite{CHO00}, which suggests a simple
244: homogeneous Poisson process to model the updating of Web pages. We suggest
245: instead a nonhomogeneous compound Poisson model, which is far more flexible,
246: and yet still tractable. In addition, the work in~\cite{CHO00} supposes that
247: transcriptions are performed at uniform time intervals, mainly because
248: ``crawlers cannot guess the best time to visit each site.'' We show in this
249: paper that our model of content evolution gives rise to other, better
250: transcription policies.
251: 
252: In \cite{OLSTON00}, a trade-off mechanism was suggested to decide between the
253: use of a cache or recomputation from base data by using range data, computed
254: at the source. In this framework, an update is ``pushed'' to a replication
255: site whenever updated data falls outside a predetermined interval, or whenever
256: a query requires current data. The former requires the client and the server
257: to be in touch continuously, in case the server needs to track down the
258: client, which is not always realistic (either because the server does not
259: provide such services, or because the overhead for such services undermines
260: the cost-effectiveness of the client). The latter requirement puts the burden
261: of deciding whether to refresh the data on the client, without providing it
262: with any model for the evolution of the base data. We attempt to fill this gap
263: by providing a stochastic model for content evolution, which allows a client
264: to make judicious requests for current data. Other work in related areas
265: (\emph{e.g.}, \cite{ALONSO90,CAREY91a,DELIS98}) has considered
266: various alternatives for pushing updated data from a server to a cache on
267: the client side. Lazy replica-update policies using replication graphs have
268: also been discussed in, for example, \cite{ANDERSON98}. This work, however,
269: does not take the data obsolescence into account, and is primarily concerned
270: with transaction throughput and timely updates, subject to network constraints.
271: 
272: As with models in general, our model is an idealized representation of a
273: process. To be useful, we wish to make predictions based on tractable
274: analytical calculations, rather than detailed, computationally intensive
275: simulations. Therefore, we restrict our modeling to some of the more basic
276: tools of applied probability theory, specifically those relating to Poisson
277: processes and Markov chains. Texts such as~\cite{ROSS80,TK94} contain the
278: necessary reference material on Markov chains and Poisson processes, and
279: specifically on nonhomogeneous Poisson processes. Poisson processes can model
280: a world where data updates are independent from one another. In databases with
281: widely distributed access, \emph{e.g.}, Web interfacing databases, such an
282: independence assumption seems plausible, as was verified in \cite{CHO00}.
283: 
284: The rest of the paper is organized as follows: Section \ref{sec:preliminaries}
285: introduces some basic notation. Section~\ref{sec:estcard}
286: provides a content evolution model for insertions and deletions, while
287: Section~\ref{sec:modif} discusses data modifications. We
288: shall introduce preliminary results of fitting the insertion model parameters
289: to real data feeds in Section \ref{sec:verify}. A cost model
290: and transcription policies that utilize it follow in Section \ref{costmodel},
291: highlighting the practical impact of the model. Conclusions and topics for
292: further research are provided in Section \ref{sec:conclusion}. 
293: 
294: 
295: \subsection{Notational preliminaries}
296: 
297: \label{sec:preliminaries} In what follows, we denote the set of attributes and
298: relations in the database by $\mathcal{B}$ and $\mathcal{R}$, respectively.
299: Each $R\in\mathcal{R}$ consists of a set of attributes $\mathcal{A}%
300: (R)\subseteq\mathcal{B}$, and also has a \emph{primary key} $\mathcal{K}(R)$,
301: which is a nonempty subset of $\mathcal{A}(R)$. Each attribute $A\in
302: \mathcal{B}$ has a \emph{domain} $\dom
303: A$, which we assume to be a finite set, and for any subset of attributes
304: $\mathcal{A}=\left\{  A_{1},A_{2},...,A_{k}\right\}  $, we let $\dom
305: {\mathcal{A}}=\dom A_{1}\times\dom A_{2}\times...\times\dom
306: A_{k}$ denote the compound domain of $\mathcal{A}$. We denote by $r.A(t)$ the
307: value of attribute $A$ in tuple $r$ at time $t$, and similarly use
308: $r.\mathcal{A}(t)$ for the value of a compound attribute. For a given time
309: $t$, subset of attributes $\mathcal{A}\subseteq\mathcal{A}(R)$, and value
310: $v=\langle v_{1},v_{2},...,v_{k}\rangle\in\dom{\mathcal{A}}$, we define
311: $R_{\mathcal{A},v}(t)=\left\{  r\in R(t)\;\left|  \;\;(r.A_{1}(t)=v_{1}%
312: )\wedge(r.A_{2}(t)=v_{2})\wedge\ldots\wedge(r.A_{k}(t)=v_{k})\right.
313: \right\}  $. We also define $\hat{R}_{\mathcal{A}}(t)$ to be the
314: \emph{histogram} of values of $\mathcal{A}$ at time $t$, that is, for each
315: value $v\in\dom\mathcal{A}$, $\hat{R}_{\mathcal{A}}(t)$ associates a
316: nonnegative integer $\hat{R}_{\mathcal{A},v}(t)$, which is the cardinality of
317: $R_{\mathcal{A},v}(t)$.\footnote{This vector can be computed exactly and
318: efficiently using indices. Alternatively, in the absence of an index for a
319: given attribute, statistical methods (such as ``probabilistic'' counting
320: \cite{WHANG90}, sampling-based estimators \cite{HAAS95}, and wavelets
321: \cite{MATIAS98}) can be applied.}  This notation, and well as other
322: symbols used throughout the paper, 
323: are also summarized in Table \ref{tab:listssym}.
324: 
325: \renewcommand{\arraystretch}{1.3}
326: 
327: \begin{table}[t] \centering
328: {\scriptsize
329: \begin{tabular}
330: [c]{|l|p{4.5in}|}\hline
331: $s,f$ & Points in time\\\hline
332: $R,S\in\mathcal{R}; R(t); \card{R(s)}$ & Relations; 
333: $R$'s extension at time $t$; its cardinality at time $s$.\\
334: $A\in\mathcal{B}; \mathcal{A} \subseteq \mathcal{B}; \dom A;
335: \dom\mathcal{A}$ & 
336: Attribute; compound attribute; domain of attribute;
337: domain of compound attribute \\
338: $\mathcal{A}(R) \subseteq \mathcal{B}; 
339: \mathcal{K}(R); 
340: \mathcal{C}(R)$ & 
341: Attributes of $R$; primary key of $R$; modifiable attributes of $R$ \\
342: $r; r.A(t); r.\mathcal{A}(t)$ & 
343: Tuple; value of attribute $A$ in $r$ at $t$; 
344: value of compound attribute $\mathcal{A}$ in $r$ at $t$ \\
345: $v\in\dom\mathcal{A}; 
346: R_{\mathcal{A},v}(t); \hat{R}_{\mathcal{A}}(t)$ &
347: Value; set of tuples with $r.A(t)=v$; histogram of
348: $\mathcal{A}$\\
349: $b(r); d(r)$ & Insertion time of $r$; deletion time of $r$\\
350: $\mathcal{N}\subset\mathcal{B}$ & Set of numeric attributes \\
351: $G; G(R)$ & 
352: Dependency multigraph; dependency sub-multigraph generated by $R$\\
353: $\lambda_{R}(t)\hspace{0.15em};\Lambda_{R}(s,f);B_{R}(s,f)$ & 
354: Insertion rate (intensity); 
355: expected number of insertion events during $(s,f]$;
356: number of insertions during $(s,f]$ \\
357: $\expdistrib_{s}(\phi(\cdot));L_{R,s};L_{R,s}^{\intrinsic}$ & 
358: Nonhomogeneous exponential distribution; 
359: interarrival time;
360: remaining life span
361: \\\hline
362: $\Delta_{R,i}^{+}; \Delta_{i}^{-}$ & 
363: Number of tuples for insertion event $i$;
364: number of tuples for deletion  event $i$ \\
365: $\mu_{R}(t); M_{R}(s,f)$ & 
366: Deletion rate (intensity); expected number of deletion events
367: \\\hline
368: $w(r,S)$ & 
369: Number of tuples in $S$ forcing deletion of $r$ via referential
370: integrity \\
371: $W(R,S,t)$ & 
372: Random variable of $w(r,S)$ over uniform selection of $r\in R$\\
373: $W(R,t)$ &
374: Vector of $W(R,S,t)$ over $S\in G(R)$
375: \\\hline
376: $p_{R}(s,f)$ & 
377: Probability that a tuple in $R$ at time $s$ survives through $f$ \\
378: $\hat{p}_{R}(t,f)$ &
379: Survival probability through $f$ for tuple inserted at $t$
380: \\\hline
381: $\expecop_{r\in R(s)}\!\left[  {\cdot}\right]  $ & 
382: Expectation over uniform random
383: selection of tuples $r\in R(s)$ \\\hline
384: $X_{R}(s,f)$ & 
385: Number of tuples inserted into $R$ during $(s,f]$ \\
386: $Y_{R}(s,f)$ &
387: Number of tuples in $R(s)$ surviving through $f$\\
388: $Y_{R}^{+}(s,f); Y_{R}^{-}(s,f)$ &
389: Surviving tuples that were modified;
390: surviving tuples that were not modified\\\hline
391: $\tau_{v,s}^{R,A}; \gamma_{R,A}(t); \Gamma_{R,A}(s,f)$ & 
392: Remaining time to next modification ;
393: modification rate;
394: expected number of modification events \\
395: $\ell_{v}^{R,A}$ & 
396: Relative exit rate\\
397: $P_{u,v}^{R,A}(s,f); q_{u,v}^{R,A}$ & 
398: Transition probability; relative transition rate\\\hline
399: $\Delta A; \delta; \sigma^{2}$ & 
400: Change to a value of $A$ in a random-walk update
401: event; expected value of change; variance of change\\\hline
402: $C_{R,\text{u}}(s,f);C_{R,\text{o}}(s,f);C_{R}(t);$ & 
403: Transription cost; obsolescence cost; total cost\\
404: $\iota_{r,A}(s,f); \iota_{R,A}(s,f); \iota_{r}(s,f)$ & 
405: Contribution to obsolescence of: $r$ via $A$; $A$; $r$\\
406: $\hat{\iota}_{R,A}^{\modification}(s,f); 
407: \hat{\iota}_{R}^{\deletion}(s,f);
408: \hat{\iota}_{R}^{\medspace\insertion}(s,f)$ & 
409: Expected obsolescence cost
410: due to: modification; deletion; insertion\\
411: $\hat{\iota}_{R,A,u}^{\modification}(s,f)$ & 
412: Expected obsolescence cost due to modification to the value $u$\\
413: $c_{u,v}^{R,A}$ & 
414: Elements of a cost matrix\\\hline
415: \end{tabular}
416: }
417: \caption{List of Symbols.}
418: \label{tab:listssym}
419: \end{table}
420: 
421: \renewcommand{\arraystretch}{1}
422: 
423: 
424: \section{Modeling insertions and deletions}
425: 
426: \label{sec:estcard}This section introduces the stochastic models
427: for insertions and deletions. Section
428: \ref{sec:insertion} discusses insertions, 
429: while deletions are discussed in section
430: \ref{sec:deletions}. Section \ref{sec:combinedinsertdelete} 
431: combines the effect of
432: insertions and deletions on a relation's cardinality. We conclude 
433: with a discussion of non-exponential life spans in Section
434: \ref{sec:nonexplife}. We defer discussing model
435: validation until Section \ref{sec:verify}.
436: 
437: \subsection{Insertion}
438: \label{sec:insertion}
439: We use a nonhomogeneous Poisson process~\cite{ROSS80,TK94}
440: with instantaneous arrival rate $\lambda_{R}:\Re\rightarrow\lbrack0,\infty)$
441: to model the occurrence of \emph{insertion events} into $R$. That is, the
442: number of insertion events occurring in any interval $(s,f]$ is a Poisson
443: random variable with expected value $\Lambda_{R}(s,f)=\int_{s}^{f}\lambda
444: _{R}(t)\hspace{0.15em}dt.$ A homogeneous Poisson process may be considered as
445: the special case where $\lambda_{R}(t)$ is equal to a constant $\lambda_{R}>0$
446: for all $t$, yielding $\Lambda_{R}(s,f)=\int_{s}^{f}\lambda_{R}(t)\hspace
447: {0.15em}dt=\int_{s}^{f}\lambda_{R}\hspace{0.15em}dt=\lambda_{R}\cdot(f-s)$.
448: 
449: We now consider the interarrival time distribution of the nonhomogeneous
450: Poisson process. \ We first define the nonhomogeneous exponential
451: distribution, as follows:
452: 
453: \begin{definition}
454: [Nonhomogeneous exponential distribution]\label{def:nhexp}
455: Let $\phi:\Re\rightarrow\lbrack0,\infty)$ be a integrable
456: function. Given some $s\in\Re$, a random variable $V$ is said to have a
457: \emph{nonhomogeneous exponential} distribution (denoted by $V\sim
458: \expdistrib_{s}(\phi(\cdot))$) if $V$'s density function is
459: \[
460: p(\tau)=\left\{
461: \begin{array}
462: [c]{ll}%
463: {\displaystyle\phi(s+\tau)\exp\!{\left(  -\!\!\int_{0}^{\tau}\!\!\!\phi
464: (s+u)\hspace{0.15em}du\right)  }}, & \tau\geq0\\
465: 0, & \tau<0.
466: \end{array}
467: \right.
468: \]
469: \end{definition}
470: 
471: It is worth noting that if $\phi(t)$ is constant, $p(\tau)$ is just a standard
472: exponential distribution. We shall now show that, as with homogeneous Poisson
473: processes, the interarrival time of insertion events is distributed like an
474: exponential random variable, $L_{R,s}$, but with a time-varying density function.
475: 
476: \begin{lemma}
477: \label{lem:interarrival}At any time $s$, the amount of time $L_{R,s}$
478: to the next insertion event is distributed like $\expdistrib_{s}(\lambda
479: _{R}(\cdot))$. The probability of an insertion event occurring during $(s,f]$
480: is $\probop
481: \!\{L_{R,s}<f-s\}=1-e^{-\Lambda_{R}(s,f)}$.
482: \end{lemma}
483: 
484: \begin{proof}
485: \noindent Let $\{N(t),t\geq0\}$ be a nonhomogeneous Poisson process with
486: intensity function $\lambda_{R}(t)$, which implies $\probop\!\left\{
487: {N(f)-N(s)=0}\right\}  =e^{-\Lambda_{R}(s,f)}$. Now, the chance that no new
488: tuple was inserted during $(s,f]$ is the same as the chance that the process
489: $N(\cdot)$ has no arrivals during $(s,f]$, that is, $e^{-\Lambda_{R}(s,f)}$.
490: The chance that a new tuple was inserted during $(s,f]$ is just the complement
491: of the chance of no arrivals, namely,
492: \[
493: \probop\!\left\{  L_{R,s}{<f-s}\right\}  =\{N(f)-N(s)\geq
494: 1\}=1-P\{N(f)-N(s)=0\}=1-e^{-\Lambda_{R}(s,f)}.
495: \]
496: Taking the derivative of this expression with respect to $f$ and making a
497: change of variables, the probability density of the time until the next
498: insertion from time $s$ is $p(\tau)=\lambda_{R}(s+\tau)e^{-\Lambda
499: _{R}(s,s+\tau)}$. Thus, $L_{R,s}\sim\expdistrib_{s}(\lambda_{R}(\cdot))$.
500: \end{proof}
501: 
502: At insertion event $i$, a random number of tuples $\Delta_{R,i}^{+}$ are
503: inserted, allowing us to model bulk insertions. A \emph{bulk insertion} is the
504: simultaneous arrival of multiple tuples, and may occur because the tuples are
505: related, or because of limitations in the implementation of the server. For
506: example, e-mail servers may process an input stream periodically, resulting in
507: bulk updates of a mailbox. Assuming that the $\{\Delta_{R,i}^{+}\}$ are
508: independent and identically distributed (IID), then the stochastic process
509: $\{B_{R}(t),t\geq0\}$ representing the cumulative number of insertions through
510: time $t$ is a \emph{compound Poisson} process (\emph{e.g.}, \cite{ROSS95}, pp.
511: 87-88). We let $B_{R}(s,f)$ denote the number of insertions falling into the
512: interval $(s,f]$. The expected number of inserted tuples during $(s,f]$ may be
513: computed via $\expecop\!\left[  {B}_{R}{(s,f)}\right]  =\int_{s}%
514: ^{f}\!\!\lambda_{R}(t)\expecop
515: \!\left[  {\Delta}_{R}^{+}\right]  \hspace{0.15em}dt=\expecop\left[  {\Delta
516: }_{R}^{+}\right]  \int_{s}^{f}\!\!\lambda_{R}(t)\!\hspace{0.15em}%
517: dt=\expecop\!\left[  {\Delta}_{R}^{+}\right]  \Lambda_{R}(s,f).$ 
518: Here, $\Delta_{R}^{+}$ represents a generic random variable distributed like
519: the $\{\Delta_{R,i}^{+}\}$.
520: 
521: We now consider three simple cases of this model:
522: 
523: \paragraph{General nonhomogeneous Poisson process:}
524: Assume that $\expecop\!\left[  {\Delta_{R}^{+}}\right]  =1$. The expected
525: number of insertions simplifies to $\expecop\!\left[  {B}_{R}{(s,f)}\right]
526: =\expecop
527: \!\left[  {\Delta}_{R}^{+}\right]  \Lambda_{R}(s,f)=1\cdot\Lambda
528: _{R}(s,f)=\Lambda_{R}(s,f)$.
529: 
530: \paragraph{Homogeneous Poisson process:}
531: Assume once more that $\expecop\!\left[  {\Delta_{R}^{+}}\right]  =1$. Assume
532: further that $\lambda_{R}(t)$ is a constant function, that is, $\lambda
533: _{R}(t)=\lambda_{R}$ for all times $t$. In this case, as shown above,
534: $\Lambda_{R}(s,f)$ takes on the simple form of $\lambda_{R}\cdot(f-s)$. Thus,
535: $\expecop\!\left[  {B}_{R}{(s,f)}\right]  =\Lambda_{R}(s,f)=\lambda_{R}%
536: \cdot(f-s)$. The interarrival times are distributed as $\expdistrib
537: (\lambda_{R})$, the exponential distribution with parameter $\lambda_{R}$.
538: 
539: \paragraph{Recurrent piecewise-constant Poisson process:}
540: A simple kind of nonhomogeneous Poisson process can be built out of
541: homogeneous Poisson processes that repeat in a cyclic pattern. Given some
542: length of time $T$, such as one day or one week, suppose that the arrival rate
543: function $\lambda_{R}(t)$ of the recurrent Poisson process repeats every $T$
544: time units, that is, $\lambda_{R}(t)=\lambda_{R}\!\left(  t-T\!\left\lfloor
545: {t}/{T}\right\rfloor \right)  $ for all $t$. Furthermore, the interval
546: $\left[  0,T\right)  $ is partitioned into a finite number of subsets
547: $J_{1},\ldots,J_{K}$, with $\lambda_{R}(t)$ constant throughout each $J_{k}$,
548: $k=1,\ldots,K$. Finally, each $J_{k}$ is in turn composed of a finite number
549: of half-open intervals of the form $[s,f)$. For instance, $T$ might be one
550: day, with $K=24$ and $J_{1}=[0\text{:}00,1\text{:}00),J_{2}=[1\text{:}%
551: 00,2\text{:}00),\ldots,J_{24}=[23\text{:}00,0\text{:}00)$. As another simple
552: example, $T$ might be one week, and $K=2$. The subset $J_{1}$ would consist of
553: a firm's normal hours of operation, say $[9$:$00,18$:$00)$ for each weekday,
554: and $J_{2}=[0,T)\backslash J_{1}$ would denote all ``off-hour'' times.
555: Formalisms like those of~\cite{NIEZETTE92} could also be used to describe such
556: processes in a more structured way. We term this class of Poisson processes to
557: be \emph{recurrent piecewise-constant} --- abbreviated \emph{RPC}.
558: 
559: It is worth noting that, in client-server environments, the 
560: insertion model should typically be formed from the client's
561: point of view. Therefore, if the server keeps a database from which many
562: clients transcribe data, the modeling of insertions for a given client should
563: only include the part of the database the client actually transcribes.
564: Therefore, if a ``road warrior'' is interested only in new orders for the
565: 08904 zip code area, the insertion model for that client should concentrate on
566: that zip code, ignoring the arrival orders from other areas.
567: 
568: \subsubsection{The complexity of computing $\Lambda_{R}(s,f)$}
569: 
570: \label{sec:lambdacomplex}$\Lambda_{R}(s,f)$, the Poisson
571: expected value, is computed by integrating the model parameter $\lambda
572: _{R}(t)$ over the interval $[s,f]$. Standard numerical methods allow rapid
573: approximation of this definite integral even if no closed formula is known for
574: the indefinite integral. However, the complexity of this calculation depends
575: on the information-theoretic properties of $\lambda_{R}(t)$~\cite[Section
576: 1]{TRAUB98}.
577: 
578: For our purposes, however, simple models of $\lambda_{R}(t)$ are likely to
579: suffice. For example, if $\lambda_{R}(t)$ is a polynomial of degree $d
580: \geq 0$,
581: %Note -- I have to put d+1 here because d can be zero!  JE
582: the integration can be performed in $\bigO(d+1)$ time. Consider next a
583: piecewise-polynomial Poisson process: the time line is divided into intervals
584: such that, in each time interval, $\lambda_{R}(t)$ can be written as a
585: polynomial. The complexity of calculating $\Lambda_{R}(s,f)$ in this case is
586: $\bigO(n(d+1))$, where $n$ is the number of segments in the time interval $(s,f]$,
587: and $d$ is the highest degree of the $n$ polynomials.
588: 
589: Further suppose that the piecewise-polynomial process is recurrent in a
590: similar manner to the RPC process, that is, given some fixed time interval
591: $T$, $\lambda_{R}(t)=\lambda_{R}\!\left(  t-T\!\left\lfloor {t}/{T}%
592: \right\rfloor \right)  $ for all $t$. Note that the RPC Poisson process is the
593: special case of this model in which $d=0$. If there are $c$ segments in the
594: interval $[0,T]$, then the complexity of calculating $\Lambda_{R}(s,f)$
595: becomes $\bigO(c(d+1))$, regardless of the length of the interval $[s,f]$. This
596: reduction occurs because, for all intervals of the form $[kT,(k+1)T]\subseteq
597: [s,f]$ for which $k$ is an integer, the integral $\int_{kT}^{(k+1)T}%
598: \lambda_{R}(t)\hspace{0.15em} dt$ is equal to $\int_{0}^{T}\lambda
599: _{R}(t)\hspace{0.15em} dt$, which only needs to be calculated once.
600: 
601: In Section \ref{sec:verify}, we demonstrate the usefulness
602: of the RPC model for one specific application. We hypothesize that a recurrent
603: piecewise-polynomial process of modest degree (for example, $d$=3) will be
604: sufficient to model most systems we are likely to encounter, and so the
605: complexity of computing $\Lambda_{R}(s,f)$ should be very manageable.
606: 
607: \subsection{Deletion}
608: \label{sec:deletions}We allow for two distinct deletion mechanisms. First, we
609: assume individual tuples have their own intrinsic stochastic life spans.
610: Second, we assume that tuples are deleted to satisfy referential integrity
611: constraints when tuples in other relations are deleted. 
612: These two
613: mechanisms are combined in a tuple's overall probability of being deleted. 
614: Let $R$ and $S$ be two
615: relations such that $\mathcal{K}(S)$ is a foreign key of $S$ in $R$. We refer
616: to $S$ as a \emph{primary relation} of $R$. Consider the directed multigraph
617: $G$ whose vertices consist of all relations $R$ in the database, and whose
618: edges are of the form $\langle R,S\rangle$, where $S$ is a primary relation of
619: $R$. The number of edges $\langle R,S\rangle$ is the number of foreign keys of
620: $S$ in $R$ for which integrity constraints are enforced. We assume that $G(R)$
621: has no directed cycles. Let $G(R)$ denote the subgraph of $G$ consisting of
622: $R$ and all directed paths starting at $R$. We denote the vertices of this
623: subgraph by $S(R)$.
624: 
625: \begin{figure}
626: [ptb]
627: \begin{center}
628: \epsfig{file=multigraph.eps}
629: \caption{A partial multigraph of the case study.}%
630: \label{fig:multigraph}
631: \end{center}
632: \end{figure}
633: %EndExpansion
634: 
635: \begin{example}
636: [Referential integrity constraints in Net.Commerce]
637: IBM's Net.Commerce is supported by a DB2 database with
638: about a hundred relations interrelated through foreign keys. For demonstration
639: purposes, consider a sample of seven relations in the Net.Commerce database.
640: Figure \ref{fig:multigraph} is a pictorial
641: representation of the multigraph $G$ of these seven relations. The
642: \texttt{MERCHANT} relation provides data about merchant profiles, the
643: \texttt{SCALE} and \texttt{DISCCALC} relations are for computing price
644: discounts, the \texttt{CATEGORY} and \texttt{CGRYREL} relations assist in
645: categorizing products, and the \texttt{ORDERS} and \texttt{SHIPTO} relations
646: contain information about orders. The six relations, \texttt{SCALE, DISCCALC,
647: CATEGORY,} \texttt{CGRYREL, ORDERS, }and \texttt{SHIPTO} have a foreign key to
648: the \texttt{MERCHANT} relation, through \texttt{MERCHANT}'s primary key
649: (\texttt{MERFNBR}). Integrity constraints are enforced between the
650: \texttt{SCALE} relation and the \texttt{MERCHANT} relation, as long as
651: \texttt{SCALE.SCLMENBR} (the foreign key to \texttt{MERCHANT.MERFNBR}) does
652: not have the value \texttt{NULL}. That is, unless a \texttt{NULL} value is
653: assigned to the \texttt{MERCHANT.MERFNBR} attribute, a deletion of a tuple
654: in \texttt{MERCHANT} results in a deletion of all tuples in \texttt{SCALE}
655: such that \texttt{SCALE.SCLMENBR$\,=\,$MERCHANT.MERFNBR}. \texttt{DISCCALC}
656: has a foreign key to the \texttt{SCALE} relation, through \texttt{SCALE}'s
657: primary key (\texttt{SCLRFNBR}). There are two attributes of \texttt{CGRYREL}
658: that serve as foreign keys to the \texttt{CATEGORY} relation, through
659: \texttt{CATEGORY}'s primary key (\texttt{CGRFNBR}). Finally, \texttt{SHIPTO}
660: contains shipment information of each product in an order, and therefore it
661: has a foreign key to \texttt{ORDERS} through its primary key (\texttt{ORFNBR}%
662: ). \hspace*{\fill}$\Box$
663: \end{example}
664: 
665: With regard to intrinsic deletions within a relation, we assume that each
666: tuple $r\in R(s)$ has a stochastic remaining life span $L_{R,s}^{\intrinsic}$.
667: This random variable is identically distributed for each $r\in R(s)$, and is
668: independent of the remaining life span of any other tuple and of $r$'s age at
669: time $s$ (see Section \ref{sec:nonexplife} for a
670: discussion of tuples with a non-memoryless life span). Specifically, we will
671: assume that the chance of $r\in R(t)$ being deleted in the time interval
672: $[t,t+\Delta t]$ approaches $\mu_{R}(t)\Delta t$ as $\Delta t\rightarrow0$,
673: for some function $\mu_{R}:\Re\rightarrow\lbrack0,\infty)$. We define
674: $M_{R}(s,f)=\int_{s}^{f}\mu_{R}(t)\hspace{0.15em}dt$.
675: 
676: \begin{lemma}
677: $L_{R,s}^{\intrinsic}\thicksim\expdistrib_{s}(\mu_{R}(\cdot))$. The
678: probability that a tuple $r\in R(s)$ is deleted by time $f$, given that no
679: corresponding tuple in $S(R)\backslash\{R\}$ is deleted, is $\probop
680: \!\{{L_{R,s}^{\intrinsic}<f-s\}=}1-e^{-M_{R}(s,f)}$.
681: \end{lemma}
682: 
683: \begin{proof}
684: \noindent Let $r\in R(s)$ be a randomly chosen tuple, and assume that no
685: corresponding tuple to $r$ in $S(R)\backslash\{R\}$ is deleted. The proof is
686: identical to that of Lemma \ref{lem:interarrival}, replacing
687: $\lambda_{R}(t)$ with $\mu_{R}(t)$ and $\Lambda_{R}(s,f)$ by $M_{R}(s,f)$.
688: \end{proof}
689: 
690: \subsubsection{Deletion and referential integrity}
691: For any $r\in R(s)$ and any relation $S\in S(R)$, we define $w(r,S)$ to be the
692: number of tuples in $S$ whose deletion would force deletion of $r$ in order to
693: maintain referential integrity. This value can be between $0$ and the number
694: of paths from $R$ to $S$ in $G(R)$. For example, if 
695: $r\in R=\mathtt{CGRYREL}$ of
696: Figure \ref{fig:multigraph}, then $0\leq
697: w(r,\mathtt{CATEGORY})\leq2$ and $0\leq w(r,\mathtt{MERCHANT})\leq3$. For
698: completeness, we define $w(r,R)=1$. Each tuple in $S$ has an independent
699: remaining lifetime distributed as $\expdistrib
700: _{s}(\mu_{S}(\cdot))$, and if any of the $w(r,S)$ tuples corresponding to $r$
701: is deleted, then $r$ must be immediately deleted, to maintain referential integrity constraints. We use $p_{R}(s,f)$ to
702: denote the probability that a randomly chosen tuple in $R(s)$ survives until
703: time $f$.
704: 
705: \begin{lemma}
706: \label{prop:rawdelete} $p_{R}(s,f)=\expecop_{r\in R(s)}\!\!\left[
707: \exp\!\left(  -\!\sum_{S\in S(R)}\!w(r,S)M_{S}(s,f)\right)  \right]  $, where
708: $\expecop
709: _{r\in R(s)}\!\left[  {\cdot}\right]  $ denotes expectation over random
710: selection of tuples in $R(s)$.
711: \end{lemma}
712: 
713: \begin{proof}
714: \noindent Considering all $S\in S(R)$, and using the well-known fact that if
715: $L_{i}\sim\expdistrib
716: _{s}(\mu_{i}(\cdot))$ for $i=1,\ldots,k$ are independent, then
717: \begin{equation}
718: \min\!\left\{  L_{1},\ldots,L_{k}\right\}  \sim\expdistrib_{s}\!\left(
719: \sum_{i=0}^{k}\mu_{i}(\cdot)\right)  ,\label{eq:combineexp}%
720: \end{equation}
721: we conclude that the remaining lifetime of $r$ (denoted $L_{R,s}$) has a
722: nonhomogeneous exponential distribution with intensity function $\sum_{S\in
723: S(R)}w(r,S)\mu_{S}(\cdot)$. The probability of a given tuple $r\in R(s)$
724: surviving through time $f$ is thus
725: \[
726: \exp\negthickspace\left(  -\int_{s}^{f}\!\!\left(  \sum_{S\in S(R)}%
727: \!\!\!w(r,S)\mu_{S}(t)\right)  dt\right)  =\exp\negthickspace
728: \left(  -\!\!\!\sum_{S\in S(R)}\!\!\!w(r,S)M_{S}(s,f)\right)  ,
729: \]
730: and the probability that a randomly chosen tuple in $R(s)$ survives until time
731: $f$ is therefore
732: \begin{equation}
733: p_{R}(s,f)=\expecop_{r\in R(s)}\!\!\left[  \exp\!\!\left(  -\!\!\!\sum_{S\in
734: S(R)}\!\!\!w(r,S)M_{S}(s,f)\right)  \right]  .\label{eq:prdef}%
735: \end{equation}
736: \hspace*{\fill}\hspace*{\fill}
737: \end{proof}
738: 
739: The complexity analysis of integrating $\mu_{S}(t)$ over time to
740: obtain $M_S(s,f)$
741: is similar to
742: that of Section \ref{sec:lambdacomplex}. However, the
743: computation required by Lemma \ref{prop:rawdelete} may be prohibitive, in the
744: most general case, because it requires knowing the empirical distribution of
745: the $w(r,S)$ over all $r\in R(s)$ for all $S\in S(R)$. This empirical
746: distribution can be computed accurately by computing for each tuple, upon
747: insertion, the number of tuples in any $S\in S(R)$ with a comparable foreign
748: key, using either histograms or by directly querying the database. Maintaining
749: this information requires $\bigO(\left|  {R(s)}\right|  \left|  {S(R)}\right|  )$
750: space. This complexity can be reduced using a manageably-sized sample from
751: $R(s)$. Our initial analysis of real-world applications, however, indicates
752: that in many cases, $w(r,S)$ takes on a much simpler form, in which $w(r,S)$
753: is identical for all $r\in R(s)$. We term such a typical relationship between
754: $R$ and $S\in S(R)$ a \emph{fixed multiplicity}, as defined next:
755: 
756: \begin{definition}
757: The pair $\langle R,S\rangle$, where $S\in S(R)$, has \emph{fixed
758: multiplicity} if $w(r,S)$ is identical for all tuples in $R$. In this case, we
759: denote its common value by $w(R,S)$. \hspace*{\fill}$\Box$
760: \end{definition}
761: 
762: \begin{example}
763: [Fixed multiplicies in Net.Commerce] Consider the example
764: multigraph of
765: Figure \ref{fig:multigraph}. Both \texttt{DISCALC} and
766: \texttt{SCALE} reference \texttt{MERCHANT}. It is clear that the discount
767: calculation of a product (as stored in \texttt{DISCALC}) cannot reference a
768: different merchant than \texttt{SCALE}. The only exception is when the foreign
769: key in \texttt{SCALE} is assigned with a null value. If this is the case,
770: however, there is only a single tuple in \texttt{MERCHANT} whose deletion
771: requires the deletion of a tuple in \texttt{DISCALC}. Thus, for any tuple
772: $r\in$~\texttt{DISCALC}, $w(r,\mbox{\tt\em SCALE})=w(r,\mbox{\tt\em
773: MERCHANT})=1$ and therefore $\langle\mbox{\tt\em DISCALC},\mbox{\tt\em
774: SCALE}\rangle$ and $\langle\mbox{\tt\em DISCALC},\mbox{\tt\em MERCHANT}%
775: \rangle$ both have fixed multiplicity of $1$. Now consider \texttt{CGRYREL}.
776: Since each tuple in \texttt{CGRYREL} describes the relationship between a
777: category and a subcategory, it is clear that its two foreign keys to
778: \texttt{CATEGORY} must always have distinct values. Thus, $\langle\mbox{\tt\em
779: CGRYREL},\mbox{\tt\em CATEGORY}\rangle$ has a fixed multiplicity, and
780: $w(\mbox{\tt\em CGRYREL},\mbox{\tt\em CATEGORY})=2$.\hspace*{\fill}$\Box$
781: \end{example}
782: 
783: As the following lemma shows, fixed multiplicities permit great simplification
784: in computing $p_{R}(s,f)$.
785: 
786: \begin{lemma}
787: \label{prop:fixed} If $\langle R,S\rangle$ has fixed multiplicity for all
788: $S\in S(R)$, $p_{R}(s,f)=\exp(-\widetilde{M}_{R}(s,f))$, where $\widetilde
789: {M}_{R}(s,f)=\int_{s}^{f}\tilde{\mu}_{R}(t)\hspace{0.15em}dt$ and $\tilde{\mu
790: }_{R}(t)=\!\!\sum_{S\in S(R)}w(R,S)\mu_{S}(t)$.
791: \end{lemma}
792: 
793: \begin{proof}%
794: \begin{align*}
795: p_{R}(s,f) &  =\expecop_{r\in R(s)}\!\!\left[  \exp\!\!\left(  -\!\!\!\sum
796: _{S\in S(R)}\!\!\!w(r,S)M_{S}(s,f)\right)  \right]  \\
797: &  =\expecop_{r\in R(s)}\!\!\left[  \exp\negthickspace\left(  -\int_{s}%
798: ^{f}\!\!\left(  \sum_{S\in S(R)}\!\!\!w(r,S)\mu_{S}(t)\right)  dt\right)
799: \right]  \\
800: &  =\expecop_{r\in R(s)}\!\!\left[  \exp\negthickspace\left(  -\int_{s}%
801: ^{f}\!\!\left(  \sum_{S\in S(R)}\!\!\!w(R,S)\mu_{S}(t)\right)  dt\right)
802: \right]  \\
803: &  =\exp\negthickspace\left(  -\!\!\!\sum_{S\in S(R)}\!\!\!\left(
804: w(R,S)\int_{s}^{f}\!\!\mu_{S}(t)\hspace{0.15em}dt\right)  \right)  \\
805: &  =\exp\negthickspace{\left( - \!\!\! \sum_{S\in S(R)}%
806: \!\!\! w(R,S)M_S(s,f)\right)}.
807: \end{align*}
808: \end{proof}
809: 
810: Since $w(R,S)$ is fixed and constant over time, no additional statistics need
811: to be collected for it. As a final note, it is worth noting that in certain
812: situations, another alternative may also be available. Let $\{N_{R}%
813: (t),t\geq0\}$ be a nonhomogeneous Poisson process with intensity function
814: $\hat{\mu}_{R}(t)$, modeling the occurrence of \emph{deletion events} in $R$.
815: At deletion event $i$, a random number $\Delta_{i}^{-}$ tuples are deleted
816: from $R$. Generally speaking, this kind of model cannot be accurate, since it
817: ignores that each deletion causes a reduction in the number of remaining
818: tuples, and thus presumably a change in the spacing of subsequent deletion
819: events. However, it may be reasonably accurate for large databases with either
820: a stable or steadily growing number of tuples, or whenever the time interval
821: $(s,f]$ is sufficiently small. Statistical analysis of the database log would
822: be required to say whether the model is applicable. 
823: If the model is valid, then the
824: stochastic process $\{D_{R}(t),t\geq0\}$ representing the cumulative number of
825: deletions through time $t$, can be taken to be a compound Poisson process. The
826: expected number of deleted tuples during $(s,f]$ may be computed via
827: \[
828: \expecop\!\left[  {D_{R}(t)}\right]  =\int_{s}^{f}\!\!\mu_{R}(t)\expecop
829: \!\left[  {\Delta}^{-}\right]  dt=M_{R}(s,f)\expecop\!\left[  {\Delta}%
830: ^{-}\right]  ,
831: \]
832: where $\Delta^{-}$ is a generic random variable distributed like the
833: $\{\Delta_{i}^{-}\}$.\hspace*{\fill}
834: 
835: \subsection{Tuple survival: the combined effect of insertions and deletions}
836: \label{sec:combinedinsertdelete} Some tuples inserted during $(s,f]$ may be
837: deleted by time $f$. Let the random variable $X_{R}(s,f)$ denote the number of
838: tuples inserted during the interval $(s,f]$ that survive through time $f$.
839: Consider any tuple inserted into $R$ at time $t\in(s,f]$, and denote its
840: chance of surviving through time $f$ by $\hat{p}_{R}(t,f)$. For any $S\in
841: S(R)$ and $t\in(s,f]$, let $W(R,S,t)$ be a random variable denoting the value
842: of $w(r,S)$, given that $r$ was inserted into $R$ at time $t$. Let $W(R,t)$
843: denote the random vector, of length $\left|  {S(R)}\right|  $, formed by
844: concatenating the $W(R,S,t)$ for all $S\in S(R)$.
845: 
846: \begin{lemma}
847: \label{prop:insertsurvival} $\hat{p}_{R}(t,f)=\expecop_{W(R,t)}\!\left[
848: {\exp\!\left(  -\sum_{S\in S(R)}W(R,S,t)M_{S}(t,f)\right)  }\right]  $. When
849: $\langle R,S\rangle$ has fixed multiplicity for all $S\in S(R)$, then $\hat
850: {p}_{R}(t,f)=p_{R}(t,f)=\exp(-\widetilde{M}_{R}(t,f))$.
851: \end{lemma}
852: 
853: \begin{proof}
854: \noindent Let $L_{R,t}$ denote the lifetime of a tuple inserted into $R$ at
855: time $t$. Similarly to the proof of Lemma \ref{prop:rawdelete}, we know that
856: $L_{R,t}\thicksim\expdistrib_{t}(\sum_{S\in S(R)}W(R,S,t)\mu_{S}(\cdot))$. The
857: probability such a tuple survives through time $f$ is the random quantity
858: \[
859: \exp\negthickspace\left(  -\int_{t}^{f}\!\!\left(  \sum_{S\in S(R)}%
860: \!\!\!W(R,S,t)\mu_{S}(\tau)\right)  \hspace{0.15em}d\tau\right)
861: =\exp\negthickspace
862: \left(  -\!\!\!\sum_{S\in S(R)}\!\!\!W(R,S,t)M_{S}(t,f)\right)  .
863: \]
864: Considering all the possible elements of the vector $W(R,t)$, we then obtain
865: \[
866: \hat{p}_{R}(t,f)=\expecop_{W(R,t)}\!\!\left[  \exp\!\!\left(  \!-\!\!\!\!\sum
867: _{S\in S(R)}\!\!\!W(R,S,t)M_{S}(t,f)\right)  \right]  ,
868: \]
869: 
870: Assume now that $\langle R,S\rangle$ has fixed multiplicity for all $S\in
871: S(R)$. Consequently, we replace $W(R,S,t)$ with $w(R,S)$. Drawing on the proof
872: of the previous lemma,
873: \begin{align*}
874: \hat{p}_{R}(t,f) &  =\expecop_{W(R,t)}\!\!\left[  \exp\!\!\left(
875: \!-\!\!\!\!\sum_{S\in S(R)}\!\!\!w(R,S)M_{S}(t,f)\right)  \right]  \\
876: &  =\exp\negthickspace\left(  -\int_{t}^{f}\!\!\left(  \sum_{S\in
877: S(R)}\!\!\!w(R,S)\mu_{S}(\tau)\right)  \hspace{0.15em}d\tau\right)  \\
878: &  =p_{R}(t,f)\text{.}%
879: \end{align*}
880: \hspace*{\fill}
881: \end{proof}
882: 
883: \noindent The following proposition establishes the formula for the expected
884: value of ${X_{R}(s,f)}$.
885: 
886: \begin{proposition}
887: $\expecop\!\left[  {X_{R}(s,f)}\right]  =\widetilde{\Lambda}_{R}%
888: (s,f)\expecop\!\left[  {\Delta_{R}^{+}}\right]  $, where $\widetilde{\Lambda
889: }_{R}(s,f)=\int_{s}^{f}\lambda_{R}(t)\hat{p}_{R}(t,f)\hspace{0.15em}dt$. In
890: the simple case where each insertion involves exactly one tuple,
891: $X_{R}(s,f)\sim\poissondistrib(\widetilde{\Lambda}_{R}(s,f))$.
892: \end{proposition}
893: 
894: \begin{proof}
895: \noindent Let $N$ be the number of insertion events in $(s,f]$, and let their
896: times be $\{T_{1},T_{2},\ldots,T_{N}\}$. Suppose that $N=n$ and that insertion
897: event $i$ happens at time $t_{i}\in(s,f]$. Event $i$ inserts a random number
898: of tuples $\Delta_{R,i}^{+}$, each of which has probability $\hat{p}_{R}%
899: (t_{i},f)$ of surviving through time $f$. Therefore, the expected number of
900: tuples surviving through $f$ from insertion event $i$ is $\expecop\!\left[
901: {\Delta_{R}^{+}}\right]  \hat{p}_{R}(t_{i},f)$. Consequently,
902: \[
903: \expecop\!\left[  {X_{R}(s,f)\;\big|\;N=n,T_{1}=t_{1},T_{2}=t_{2},\ldots
904: ,T_{n}=t_{n}}\right]  =\expecop\!\left[  {\Delta_{R}^{+}}\right]  \sum
905: _{i=1}^{n}\hat{p}_{R}(t_{i},f).
906: \]
907: Next, we recall, given that $N=n$, that the times $T_{i}$ of the insertion
908: events are distributed like $n$ independent random variables with probability
909: density function $\lambda_{R}(t)/\Lambda_{R}(s,f)$ on the interval $(s,f]$.
910: Thus,
911: \begin{align*}
912: \expecop\!\left[  {X_{R}(s,f)\;\big|\;N=n}\right]   &  =\expecop_{T_{1}%
913: ,\ldots,T_{n}}\!\left[  {\expecop\!\left[  {\Delta_{R}^{+}}\right]  \sum
914: _{i=1}^{n}\hat{p}_{R}(T_{i},f)}\right]  \\
915: &  =\expecop\!\left[  {\Delta_{R}^{+}}\right]  \sum_{i=1}^{n}\left(  \int
916: _{s}^{f}\!\!\hat{p}_{R}(t,f)\frac{\lambda_{R}(t)}{\Lambda_{R}(s,f)}%
917: \hspace{0.15em}dt\right)  \\
918: &  =n\left(  \frac{\expecop\!\left[  {\Delta_{R}^{+}}\right]  \widetilde
919: {\Lambda}_{R}(s,f)}{\Lambda_{R}(s,f)}\right)  .
920: \end{align*}
921: Finally, removing the conditioning on $N=n$, we obtain
922: \begin{align*}
923: \expecop\!\left[  {X_{R}(s,f)}\right]   &  =\expecop_{N}\!\left[  {N\left(
924: \frac{\expecop\!\left[  {\Delta_{R}^{+}}\right]  \widetilde{\Lambda}_{R}%
925: (s,f)}{\Lambda_{R}(s,f)}\right)  }\right]  \\
926: &  =\Lambda_{R}(s,f)\left(  \frac{\expecop\!\left[  {\Delta_{R}^{+}}\right]
927: \widetilde{\Lambda}_{R}(s,f)}{\Lambda_{R}(s,f)}\right)  \\
928: &  =\expecop\!\left[  {\Delta_{R}^{+}}\right]  \widetilde{\Lambda}_{R}(s,f).
929: \end{align*}
930: 
931: In the case that $\Delta_{R}^{+}$ is always $1$, we may use the notion of a
932: \emph{filtered} Poisson process: if we consider only tuples that manage to
933: survive until time $f$, the chance of a single insertion in time interval
934: $[t,t+\Delta t]$ no longer has the limiting value $\lambda_{R}(t)\Delta t$,
935: but instead $\lambda_{R}(t)\hat{p}_{R}(t,f)\Delta t$. Therefore, the insertion
936: of surviving tuples can be viewed as a nonhomogeneous Poisson process with
937: intensity function $\lambda_{R}(t)\hat{p}_{R}(t,f)$ over the time interval
938: $(s,f]$, so $X_{R}(s,f)\sim\poissondistrib(\widetilde{\Lambda}_{R}(s,f))$.
939: \end{proof}
940: 
941: In the general case, the computation of $\widetilde{\Lambda}_{R}(s,f)$ will
942: require approximation by numerical integration techniques; the complexity of
943: this calculation will depend on the information-theoretic properties of
944: $\lambda_{R}(\cdot)$ and the $\mu_{S}(\cdot)$, $S\in S(R)$, but is unlikely to
945: be burdensome if these functions are reasonably smoothly-varying. In one
946: important special case, however, the complexity of computing $\widetilde
947: {\Lambda}_{R}(s,f)$ is essentially the same as that of calculating
948: $\Lambda_{R}(s,f)$: suppose that for some constants $\alpha(R,S)$, $S\in
949: S(R)$, one has that $\mu_{S}(t)=\alpha(R,S)\lambda_{R}(t)$ for all $t$. That
950: is, the general insertion and deletion activity level of the relations in
951: $S(R)$ all vary proportionally to some common fluctuation pattern. In this
952: case, we have $\widetilde{\mu}_{R}(t) = \alpha(R)\lambda_{R}(t)$ and
953: $\widetilde{M}_{R}(s,f) = \alpha(R)\Lambda_{R}(s,f)$ for all $t,s,f$, where
954: $\alpha(R) = \sum_{S\in S(R)} \alpha(R,S)$. Making a substitution
955: $u(t)=\Lambda_{R}(t,f)$, we have:
956: \begin{align*}
957: \widetilde{\Lambda}_{R}(s,f)  &  = \int_{s}^{f} \!\! \lambda_{R}(t)
958: \exp(-\alpha(R)\Lambda_{R}(t,f)) \hspace{0.15em} dt\\
959: &  = \int_{s}^{f} \! \left(  \frac{-d\Lambda_{R}(t,f)}{dt} \right)
960: \exp(-\alpha(R)\Lambda_{R}(t,f)) \hspace{0.15em} dt\\
961: &  = \int_{s}^{f} \!\! - \exp(-\alpha(R)u(t)) \hspace{0.15em} du(t)\\
962: &  = - \int_{u(s)}^{u(f)} \!\! e^{-\alpha(R)u} \hspace{0.15em} du\\
963: &  = \frac{1}{\alpha(R)} \left(  1 - e^{-\alpha(R)\Lambda(s,f)} \right)  ,
964: \end{align*}
965: so $\widetilde{\Lambda}_{R}(s,f)$ can be calculated directly from
966: $\Lambda(s,f)$.
967: 
968: We define the random variable $Y_{R}(s,f)$ to be the number of tuples in
969: $R(s)$ that survive through time $f$.
970: 
971: \begin{proposition}
972: $\expecop\!\left[  {Y_{R}(s,f)}\right]  =p_{R}(s,f)\left|  {R(s)}\right|  $
973: and $\expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =p_{R}(s,f)\left|
974: {R(s)}\right|  +\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta}_{R}%
975: ^{+}\right]  $.
976: \end{proposition}
977: 
978: \begin{proof}
979: \noindent Each tuple in $R(s)$ has a survival probability of $p_{R}(s,f)$,
980: which yields that $\expecop
981: \!\left[  {Y_{R}(s,f)}\right]  =p_{R}(s,f)\left|  {R(s)}\right|  $. By the
982: definitions of $Y_{R}(s,f)$ and $X_{R}(s,f)$, one has that
983: \[
984: \left|  {R(f)}\right|  =Y_{R}(s,f)+X_{R}(s,f),
985: \]
986: so therefore
987: \[
988: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =\expecop\!\left[
989: {Y_{R}(s,f)}\right]  +\expecop\!\left[  {X_{R}(s,f)}\right]  =p_{R}%
990: (s,f)\left|  {R(s)}\right|  +\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[
991: {\Delta_{R}}^{+}\right]  .
992: \]
993: \hspace*{\fill}\hspace*{\fill}\hspace*{\fill}
994: \end{proof}
995: 
996: In cases where deletions may also be accurately modeled as a compound Poisson
997: process, we have
998: \begin{align*}
999: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]   &  =\expecop\!\left[
1000: {\left|  {R(s)}\right|  }\right]  +B_{R}(s,f)-D_{R}(s,f)\\
1001: &  =\left|  {R(s)}\right|  +\Lambda_{R}(s,f)\expecop\!\left[  {\Delta}%
1002: ^{+}\right]  -M_{R}(s,f)\expecop\!\left[  {\Delta}^{-}\right]  .
1003: \end{align*}
1004: 
1005: \begin{example}
1006: [The homogeneous case]Assume that $\expecop\!\left[  {\Delta_{R}^{+}}\right]
1007: =1$, that $\langle R,S\rangle$ has fixed multiplicity for all $S\in S(R)$, and
1008: furthermore $\lambda_{R}(t)$ and $\mu_{S}(t)$, for all $S\in S(R)$, are
1009: constant functions, that is, $\lambda_{R}(t)=\lambda_{R}$ for all times $t$
1010: and $\mu_{S}(t)=\mu_{S}$ for all $S\in S(R)$ and times $t$. Then $\Lambda
1011: _{R}(s,f)=\lambda_{R}\cdot(f-s)$ and $M_{R}(s,f)=\mu_{R}\cdot(f-s)$. Thus,
1012: letting $\tilde{\mu}_{R}=\sum_{S\in S(R)}w(R,s)\mu_{S}$,
1013: \[
1014: \widetilde{\Lambda}_{R}(s,f)=\int_{s}^{f}\!\!\lambda_{R}e^{-\tilde{\mu}%
1015: _{R}(t-s)}\hspace{0.15em}dt=\frac{\lambda_{R}}{\tilde{\mu}_{R}}\left(
1016: 1-e^{-\tilde{\mu}_{R}(f-s)}\right)  ,
1017: \]
1018: and assuming ${\Delta}_{R,i}^{+}=1$ for all $i>0$,
1019: \[
1020: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =\left|  {R(s)}\right|
1021: e^{-\tilde{\mu}_{R}(f-s)}+\frac{\lambda_{R}}{\tilde{\mu}_{R}}\left(
1022: 1-e^{-\tilde{\mu}_{R}(f-s)}\right)  =\frac{\lambda_{R}}{\tilde{\mu}_{R}%
1023: }+e^{-\tilde{\mu}_{R}(f-s)}\left(  \left|  {R(s)}\right|  -\frac{\lambda_{R}%
1024: }{\tilde{\mu}_{R}}\right)  .
1025: \]
1026: \hspace*{\fill}$\Box$
1027: \end{example}
1028: 
1029: \subsection{Tuples with non-exponential life spans}
1030: \label{sec:nonexplife}We now consider the possibility
1031: that tuples in $R$ have a stochastic life span $L_{R}^{\intrinsic}$ that is
1032: not memoryless, but rather has some general cumulative distribution function
1033: $G_{R}$. For example, if tuples in $R$ correspond to pieces of work in process
1034: on a production floor, the likelihood of deletion might rise the longer the
1035: tuple has been in existence. Let us consider a single relation, and thus no
1036: referential integrity constraints. For any tuple $r$, let $b(r)$ denote the
1037: time it was created. We next establish the expected cardinality of $R$ at time
1038: $f$.
1039: 
1040: \begin{proposition}
1041: In the case that tuples in $R$ have lifetimes with a general cumulative
1042: distribution function $G_{R}$,
1043: \begin{equation}
1044: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =\left|  {R(s)}\right|
1045: \expecop_{r\in R(s)}\!\left[  {\frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}%
1046: }\right]  +\expecop\!\left[  {\Delta}_{R}^{+}\right]  \int_{s}^{f}%
1047: \!\!\lambda_{R}(t)\left(  1-G_{R}(f-t)\right)  \hspace{0.15em}%
1048: dt.\label{eq:nonexperf}%
1049: \end{equation}
1050: \end{proposition}
1051: 
1052: \begin{proof}
1053: \noindent Let $L_{R}^{\intrinsic}$ denote a generic random variable with
1054: cumulative distribution $G_{R}$. The probability of $r\in R(s)$ surviving
1055: throughout $(s,f]$ is then
1056: \[
1057: \probop\!\left\{  L_{R}^{\intrinsic}\geq f-b(r)\;\big|\;L_{R}^{\intrinsic
1058: }\geq s-b(r)\right\}  =\frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))},
1059: \]
1060: and therefore the expected number of tuples in $R(s)$ that survive through
1061: time $f$ is
1062: \[
1063: \expecop\!\left[  {Y_{R}(s,f)}\right]  =\!\!\!\sum_{r\in R(s)}\!\!\!\left(
1064: \frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}\right)  =\left|  {R(s)}\right|
1065: \expecop_{r\in R(s)}\!\left[  {\frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}%
1066: }\right]  .
1067: \]
1068: We now consider a tuple $r$ inserted at some time $t\in(s,f]$. The probability
1069: that such a tuple survives through time $f$ is simply $\hat{p}_{R}%
1070: (t,f)=1-G_{R}(f-t)$. By reasoning similar to the proof of Lemma
1071: ~\ref{prop:insertsurvival},
1072: \[
1073: \expecop\!\left[  {X_{R}(s,f]}\right]  =\expecop
1074: \!\left[  {\Delta_{R}}^{+}\right]  \widetilde{\Lambda}_{R}(s,f)=\expecop
1075: \!\left[  {\Delta_{R}}^{+}\right]  \int_{s}^{f\!\!}\lambda_{R}(t)\left(
1076: 1-G_{R}(f-t)\right)  \hspace{0.15em}dt.
1077: \]
1078: The conclusion then follows from $\expecop\!\left[  {\left|  {R(f)}\right|
1079: }\right]  =\expecop\!\left[  {Y_{R}(s,f)}\right]  +\expecop\!\left[
1080: {X_{R}(s,f)}]\right.  $
1081: \end{proof}
1082: 
1083: It is worth noting that, as opposed to the memoryless case presented above,
1084: the calculation of $\expecop\!\left[  {Y_{R}(s,f)}\right]  $ requires
1085: remembering the commit times $b(r)$ of all tuples $r\in R(s)$, or equivalently
1086: the ages of all such tuples. Of course, for large relations $R$, a reasonable
1087: approximation could be obtained by using a manageably-sized sample to
1088: estimate
1089: \[
1090: \expecop_{r\in R(s)}\!\left[  \frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}
1091: \right]  .
1092: \]
1093: It is likely that the integral in (\ref{eq:nonexperf}) will require general
1094: numerical integration, depending on the exact form of $G_{R}$.
1095: 
1096: \subsection{Summary}
1097: 
1098: In this section, we have provided a model for the insertion and deletion of
1099: tuples in a relational database. The immediate benefit of this model is the
1100: computation of the expected relation cardinality ($\expecop\!\left[  {\left|
1101: {R(f)}\right|  }\right]  $), given an initial cardinality and insertion and
1102: tuple life span parameters. Relation cardinality has proven to be an important
1103: property in many database tools, including query optimization and database
1104: tuning. Reasonable assumptions regarding constant multiplicity allow, once
1105: appropriate statistics have been gathered, a rapid computation of
1106: cardinalities in this framework. Section \ref{sec:verify}
1107: elaborates on statistics gathering and model validation.
1108: 
1109: A note regarding tuples with non-exponential life spans is now warranted. For
1110: the case of a single relation, non-exponential life spans add only a moderate
1111: amount of complexity to our model, namely the requirement to store at least an
1112: approximation of the distribution of tuples ages in $R(s)$. For multiple
1113: relations with referential integrity constraints, 
1114: however, the complexity of dealing with
1115: general tuple life spans is much greater. First, to estimate the cardinality
1116: of $R(f)$, we must keep (approximate) tuple age distributions for all
1117: relations in $S(R)$. Second, because the tuple life span distributions of some
1118: of the members of $S(R)$ are not memoryless, we cannot combine them with a
1119: simple relation like (\ref{eq:combineexp}). Furthermore, in attempting to find
1120: the distribution of the remaining life span of a particular tuple $r\in R(s)$,
1121: it may become necessary to consider the issue of the correlation of ages of
1122: tuples in $R(s)$ with the ages of the corresponding tuples in other relations
1123: of $S(R)$. Because of these complications, we defer further consideration of
1124: non-exponential tuple life spans to future research.
1125: 
1126: \section{Modeling data modification}
1127: \label{sec:modif}This section describes various ways to model
1128: the modification of the contents of tuples. We start with a general approach,
1129: using Markov chains, followed by several special cases where the amount of
1130: computation can be greatly reduced.
1131: 
1132: \subsection{Content-dependent updates}
1133: 
1134: \label{sec:condepup}In this section, we model the modification of the contents of tuples as a
1135: finite-state continuous-time Markov chain, thus assuming dependence on tuples'
1136: previous contents. For each relation $R$, we allow for some (possibly empty)
1137: subset $\mathcal{C}(R)\subset\mathcal{A}(R)$ of its attributes to be subject
1138: to change over the lifetime of a tuple. We do not permit primary key fields to
1139: be modified, that is, $\mathcal{C}(R)\cap\mathcal{K}(R)=\emptyset$.
1140: 
1141: Attribute values may change at time instants called \emph{transition events},
1142: which are the transition times of the Markov chain. We assume that the spacing
1143: of transition events is memoryless with respect to the age of a tuple
1144: (although it may depend on the time and the current value of the attribute, as
1145: demonstrated below). For any attribute $A$, tuple $r$, time $s$, and value
1146: $v\in\dom A$ with $r.A(s)=v$, the time remaining until the next transition
1147: event for $r.A$ is a random variable $\tau_{v,s}^{R,A}$ with the distribution
1148: $\expdistrib
1149: _{s}(\ell_{v}^{R,A}\gamma_{R,A}(\cdot))$, where $\gamma_{R,A}:\Re
1150: \rightarrow\lbrack0,\infty)$ is a function giving the general instantaneous
1151: rate of change for the attribute, and $\ell_{v}^{R,A}$ is a nonnegative scalar
1152: which we call the \emph{relative exit rate} of $v$. We define $\Gamma
1153: _{R,A}(s,f)=\int_{s}^{f}\gamma_{R,A}(t)\hspace{0.15em}dt$. When a transition
1154: event occurs from state $u\in\dom A$, attribute $A$ changes to $v\in\dom A$
1155: with probability $P_{u,v}^{R,A}$.
1156: 
1157: Suppose $\mathcal{A}=\{A_{1},A_{2},...,A_{k}\}\subseteq\mathcal{R}$ is an
1158: independently varying set of attributes, and $v=\langle v_{1},v_{2}%
1159: ,...,v_{k}\rangle\in\dom\mathcal{A}$ is a compound value. Then the time until
1160: the next transition event for $r.\mathcal{A}$ is $\tau_{v,s}^{R,\mathcal{A}}=\min
1161: \{\tau_{v_{1},s}^{R,A_{1}},\ldots,\tau_{v_{k},s}^{R,A_{k}}\}$. As a rule, we
1162: will assume that the modification processes for the attributes of a relation
1163: are independent, so $\tau_{v,s}^{R,\mathcal{A}}\sim\expdistrib
1164: _{s}(\sum_{i=1}^{k}\ell_{v_{i}}^{R,A_{i}}\gamma_{A_{i},R}(\cdot))$. When the
1165: functions $\gamma_{R,A_{i}}$ are identical for $i=1,\ldots,k$, we define
1166: $\gamma_{R,\mathcal{A}}=\gamma_{R,A_{i}}$ and $\ell_{v}^{R,\mathcal{A}}=\sum_{i=1}^{k}\ell_{v_{i}%
1167: }^{R,A_{i}}$, so $\tau_{v,s}^{R,\mathcal{A}}\sim\expdistrib_{s}(\ell_{v}^{R,\mathcal{A}}%
1168: \gamma_{R,\mathcal{A}}(\cdot))$. To justify the assumption of independence, we note that
1169: coordinated modifications among attributes can be modeled by replacing the
1170: coordinated attributes with a single compound attribute (this technique
1171: requires that the attributes have identical $\gamma_{R,\mathcal{A}}(\cdot)$ functions,
1172: which is reasonable if they change in a coordinated way).
1173: 
1174: Under these assumptions,
1175: let $\overline{\mathcal{C}}(R)$ denote a partition of $\mathcal{C}(R)$ into
1176: subsets $\mathcal{A}$ such that any two attributes $A_{1},A_{2}\in
1177: \mathcal{C}(R)$ vary dependently iff they are in the same $\mathcal{A}%
1178: \in\overline{\mathcal{C}}(R)$.
1179: 
1180: \begin{example}
1181: [First alteration time]\label{ex:fat} For a relation $R$ and time
1182: $s$, we define $\Upsilon_{R,s}$ to be the amount of time until the next change
1183: in $R$, be it a tuple insertion, a tuple deletion, or an attribute
1184: modification. Also, for any $S\in S(R)$, let $D(R,S,s)$ denote the number of
1185: tuples in $S(s)$ whose deletion would force the deletion of some tuple in
1186: $R(s)$ $($and therefore $D(R,R,s)=\left|  {R(s)}\right|  )$. The following
1187: proposition establishes the distribution of $\Upsilon_{R,s}$.
1188: \end{example}
1189: 
1190: \begin{proposition}
1191: $\Upsilon_{R,s}\sim\expdistrib_{s}(\zeta_{R}(\cdot))$, where
1192: \[
1193: \zeta_{R}(t)=\lambda_{R}(t)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)\mu
1194: _{S}(t)\;\;+\sum_{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!h(R,A,s)\gamma
1195: _{R,\mathcal{A}}(t)
1196: \]
1197: and $h(R,A,s)=\sum_{v\in\dom\mathcal{A}}\hat{R}_{\mathcal{A},v}(s)\ell
1198: _{v}^{R,A}$. The probability of any alteration to $R$ in the time interval
1199: $(s,f]$ is $1-e^{-Z_{R}(s,f)}$, where
1200: \[
1201: Z_{R}(s,f)=\Lambda_{R}(s,f)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)M_{S}%
1202: (t)\;\;+\sum_{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!\!\!h(R,A,s)\Gamma
1203: _{R,\mathcal{A}}(s,f)
1204: \]
1205: \end{proposition}
1206: 
1207: \begin{proof}
1208: Let $\Upsilon_{R,s}^{\insertion}$, $\Upsilon_{R,s}^{\modification}$, and
1209: $\Upsilon_{R,s}^{\deletion}$ be the times until the next insertion,
1210: modification, and deletion in $R$, respectively. From Section
1211: \ref{sec:estcard}, we have that $\Upsilon_{R,s}^{\insertion}\sim\expdistrib
1212: _{s}(\lambda_{R}(\cdot))$. Now, for each $S\in S(R)$, there are $D(R,S,s)$
1213: tuples whose deletion would cause a deletion in $R$. The time until deletion
1214: of any such $r\in S\in S(R)$ is distributed like $\expdistrib_{s}(\mu
1215: _{S}(\cdot))$. The deletion processes for all these tuples are independent
1216: across all of $S(R)$, so we can use (\ref{eq:combineexp}) to conclude that
1217: \[
1218: \Upsilon_{R,s}^{\deletion}\sim\expdistrib_{s}\left(  \sum_{S\in S(R)}%
1219: \!\!\!D(R,S,s)\mu_{S}(\cdot)\right)  .
1220: \]
1221: From the preceding discussion, we have
1222: \[
1223: \Upsilon_{R,s}^{\modification}=\tau_{v,s}^{\mathcal{C}(R),R}\sim\expdistrib
1224: _{s}\left(  \sum_{\mathcal{A}\in\overline{\mathcal{C}}(R)}\ell_{r.\mathcal{A}%
1225: (s)}^{R,\mathcal{A}}\gamma_{R,\mathcal{A}}(\cdot)\right)  .
1226: \]
1227: Since $\Upsilon_{R,s}=\min\{\Upsilon_{R,s}^{\insertion
1228: },\Upsilon_{R,s}^{\modification},\Upsilon_{R,s}^{\deletion}\}$, we therefore
1229: have, again using independence and (\ref{eq:combineexp}), that $\Upsilon
1230: _{R,s}\sim\expdistrib_{s}(\zeta_{R}(\cdot))$, where
1231: \begin{align*}
1232: \zeta_{R}(t) &  =\lambda_{R}(t)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)\mu
1233: _{S}(t)\;\;+\sum_{r\in R(s)}\!\!\left[  \sum_{\mathcal{A}\in\overline
1234: {\mathcal{C}}(R)}\!\!\!\ell_{r.\mathcal{A}(s)}^{R,\mathcal{A}}\gamma
1235: _{R,\mathcal{A}}(t)\right]  \\
1236: &  =\lambda_{R}(t)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)\mu_{S}(t)\;\;+\sum
1237: _{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!\!\!h(R,A,s)\gamma_{R,\mathcal{A}%
1238: }(t).
1239: \end{align*}
1240: Integrating over $(s,f]$ results in
1241: \begin{align*}
1242: Z_{R}(s,f) &  =\int_{s}^{f}\!\zeta_{R}(t)\hspace{0.15em}dt\\
1243: &  =\Lambda_{R}(s,f)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)M_{S}(t)\;\;+\sum
1244: _{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!\!\!h(R,A,s)\gamma_{R,\mathcal{A}%
1245: }(s,f).
1246: \end{align*}
1247: Therefore, the probability of any alteration to $R$ in the time interval
1248: $(s,f]$ is%
1249: \[
1250: \probop\left\{  \Upsilon_{R,s}<f-s\right\}  =1-e^{-Z_{R}(s,f)}%
1251: \]
1252: \hspace*{\fill}
1253: \end{proof}
1254: 
1255: \noindent\textbf{Example \ref{ex:fat} continued (First alteration
1256: transcription policy)~~} \textit{Suppose the
1257: user wishes to refresh her replica of relation $R$ whenever the
1258: probability that it
1259: contains any inaccuracy exceeds some threshold $\pi$, a tactic we call
1260: the \emph{first
1261: alteration policy}. Then, a refresh is required at time $f$ if $1-e^{-Z_{R}
1262: (s,f)}>\pi$.}\hspace*{\fill}$\Box$
1263: 
1264: 
1265: ~
1266: 
1267: Given $\mathcal{A}$, we describe the transition process for the value
1268: $r.\mathcal{A}$ over time by probabilities
1269: \[
1270: P_{u,v}^{R,\mathcal{A}}(s,f)=\probop\!\left\{  r.\mathcal{A}(f)\!=\!v\;\big
1271: |\;r.\mathcal{A}(s)\!=\!u\right\}  ,
1272: \]
1273: for any two values $u,v\in\dom\mathcal{A}$ and times $s<f$. Under the
1274: assumption of independence,
1275: \begin{equation}
1276: P_{u,v}^{R,\mathcal{A}}(s,f)=\prod_{i=1}^{k}P_{u_{i},v_{i}}^{R,A_{i}}(s,f).
1277: \label{eq:probproduct}%
1278: \end{equation}
1279: Given any simple attribute $A$, we define $q_{u,v}^{R,A}$, the \emph{relative
1280: transition rate} from $u$ to $v$, by
1281: \[
1282: q_{u,v}^{R,A}=\ell_{u}^{R,A}P_{u,v}^{R,A}.
1283: \]
1284: Given a set of attributes $\mathcal{A}$ with identical $\gamma_{R,A}(\cdot)$
1285: functions, the compound transition rate $q_{u,v}^{R,A}$ may be computed via
1286: \begin{equation}
1287: q_{u,v}^{R,\mathcal{A}}=\ell_{u}^{R,\mathcal{A}}P_{u,v}^{R,\mathcal{A}%
1288: }=\left(  \sum_{i=1}^{k}\ell_{u_{i}}^{R,A_{i}}\right)  \left(  \prod_{i=1}%
1289: ^{k}P_{u_{i},v_{i}}^{R,A_{i}}\right)  . \label{eq:jointexit}%
1290: \end{equation}
1291: Let $Q^{R,A}$ be the matrix of $q_{u,v}^{R,A}$, where $q_{u,u}^{R,A}=-\ell
1292: _{u}^{R,A}$.
1293: 
1294: \begin{proposition}
1295: The matrix $P^{R,A}(s,f)$ of elements $P_{u,v}^{R,A}(s,f)$ is given by the
1296: matrix exponential formula
1297: \begin{equation}
1298: P^{R,A}(s,f)=\exp\!\left(  \Gamma_{R,A}(s,f)\,Q^{R,A}\right)  =\sum
1299: _{n=0}^{\infty}\frac{\Gamma_{R,A}(s,f)^{n}}{n!}{\left(  Q^{R,A}\right)  }%
1300: ^{n}.\label{eq:matrixexp}%
1301: \end{equation}
1302: \end{proposition}
1303: 
1304: \begin{proof}
1305: \noindent Consider a continuous-time Markov chain on the same state space
1306: $\dom
1307: A$, and with the same instantaneous transition probabilities $P_{u,v}^{R,A}$,
1308: where $u,v\in\dom A$. However, in the new chain, the holding time in each
1309: state $v$ is simply a homogeneous exponential random variable with arrival
1310: rate $\ell_{v}^{R,A}$. We call this system the \emph{linear-time} chain, to
1311: distinguish it from the original chain. Define $\overline{P}_{u,v}^{A,R}(t)$
1312: to be the chance that the linear-time chain is in state $v$ at time $t$, given
1313: that it is in state $u$ at time $0$. Standard results for finite-state
1314: continuous time Markov chains imply that
1315: \[
1316: \overline{P}_{u,v}^{R,A}(t)=\exp\!\left(  t\,Q^{R,A}\right)  =\sum
1317: _{n=0}^{\infty}\frac{t^{n}}{n!}{\left(  Q^{R,A}\right)  }^{n}.
1318: \]
1319: By a transformation of the time variable, we then assert that
1320: \[
1321: P_{u,v}^{R,A}(s,f)=\overline{P}_{u,v}^{R,A}(\Gamma_{R,A}(s,f)),
1322: \]
1323: from which the result follows.
1324: \end{proof}
1325: 
1326: \begin{example}[Query optimization, revisited]
1327: \label{ex:qopt2}As the following proposition shows, our model
1328: can be used to estimate the histogram of a relation $R$ at time $f$. A query
1329: optimizer running at time $f$ could use expected histograms, calculated in
1330: this manner, instead of the old histograms $\hat{R}_{A}(s)$.
1331: \end{example}
1332: 
1333: \begin{proposition}
1334: Assume that $w(r,S)$, for all $S\in S(R)$, is independent of the attribute
1335: values $r.A(s)$ for all $A\in\mathcal{C}(R)$. Let $\hat{\omega}_{u}^{R,A}(t)$
1336: denote the probability that $r.A(t)=u$, given that $r$ is inserted into $R$ at
1337: time $t$. Then, for all $v\in\dom A$,
1338: \begin{align}
1339: \expecop\!\left[  {\hat{R}_{A,v}(f)}\right]   &  =p_{R}(s,f)\!\!\!\!\sum
1340: _{u\in\dom A}\!\!\!\!\hat{R}_{A,u}(s)P_{u,v}^{R,A}(s,f)\nonumber\\
1341: &  \quad\quad+\quad\expecop\!\left[  {\Delta_{R}^{+}}\right]  \!\!\!\!\sum
1342: _{u\in\dom A}\!\!\!\left(  \int_{s}^{f}\!\!\hat{\omega}_{u}^{R,A}(t)\hat
1343: {p}_{R}(s,f)\lambda_{R}(t)P_{u,v}^{R,A}(t,f)\intdspace dt
1344: \right)  .\label{eq:query}%
1345: \end{align}
1346: \end{proposition}
1347: 
1348: \begin{proof}
1349: We first compute the expected number of surviving tuples $r$ whose values
1350: $r.A$ migrate to $v$. Given a value $u\in\dom A$, there are $\hat{R}_{A,u}(s)$
1351: tuples at time $s$ such that $r.A(s)=u$. Using the previous results, the
1352: expected number of these tuples surviving through time $f$ is $\hat{R}%
1353: _{A,u}(s)p_{R}(s,f)$, and the probability of each surviving tuple $r$ having
1354: $r.A(f)=v$ is $P_{u,v}^{R,A}(s,f)$. Using the independence assumption and
1355: summing over all $u\in\dom
1356: A$, one has that the expected numbers of tuples in $R(s)$ that survive through
1357: $f$ and have $r.A(f)=v$ is
1358: \[
1359: \sum_{u\in\dom A}\!\!\!\!\hat{R}_{A,u}(s)p_{R}(s,f)P_{u,v}^{R,A}%
1360: (s,f)=p_{R}(s,f)\!\!\!\!\sum_{u\in\dom A}\!\!\!\!\hat{R}_{A,u}(s)P_{u,v}%
1361: ^{R,A}(s,f).
1362: \]
1363: We next consider newly inserted tuples. Recall that $\hat{\omega}_{u}%
1364: ^{R,A}(t)$ denotes the probability that $r.A(t)=u$, given that $r$ is inserted
1365: into $R$ at time $t$. Suppose that an insertion occurs at time $t\in(s,f]$.
1366: The expected number of tuples $r$ created at this insertion that both survive
1367: until $f$ and have $r.A(f)=v$ is
1368: \[
1369: \hat{p}_{R}(t,f)\!\!\!\sum_{u\in\dom A}\!\!\!\!{\omega}_{u}^{R,A}%
1370: (t)P_{u,v}^{R,A}(t,f).
1371: \]
1372: By logic similar to Proposition~\ref{prop:insertsurvival}, one may then
1373: conclude that the expected number of newly-inserted tuples that survive
1374: through time $f$ and have $r.A(f)=v$ is
1375: \[
1376: \expecop\!\left[  {\Delta_{R}^{+}}\right]  \!\!\!\!\sum_{u\in\dom
1377: A}\!\!\!\left(  \int_{s}^{f}\!\!\hat{\omega}_{u}^{R,A}(t)\hat{p}%
1378: _{R}(s,f)\lambda_{R}(t)P_{u,v}^{R,A}(t,f)\hspace{0.15em}dt\right)  .
1379: \]
1380: The result follows by adding the last two expressions.
1381: \end{proof}
1382: 
1383: \noindent\textbf{Example \ref{ex:qopt2} continued~} \emph{We next
1384: consider whether the complexity of calculating (\ref{eq:query}) is
1385: preferable to recomputing the histogram vector ${\hat{R}_{A}(f)}$.
1386: This topic is quite involved and depends heavily on the specific structure of
1387: the database (\emph{e.g.}, the availability of indices) and the
1388: specific application (\emph{e.g.}, the concentration of values in a
1389: small subset of an attribute's domain). In what follows, we lay out
1390: some qualitative considerations in deciding whether calculating
1391: (\ref{eq:query}) would be more efficient than recalculating
1392: ${\hat{R}_{A}(f)}$ ``from scratch.''  Experimentation with
1393: real-world application is left for further research.}
1394: 
1395: \emph{Generally speaking, direct computation of the histogram of an
1396: attribute $A$ (in the absence of an index for $A$) can be done by
1397: either scanning all tuples (although sampling may also be used) or
1398: scanning a modification log to capture changes to the prior histogram
1399: vector ${\hat{R}_{A}(s)}$ during $(s,f]$. Therefore, the 
1400: recomputation can be performed in $\bigO(\min\{\card{R(f)},T(s,f)\})$
1401: time, where $T(s,f)$ denotes the total number of updates during
1402: $(s,f]$. Whenever $\card{R(f)}$ and $T(s,f)$ are both large ---
1403: \emph{i.e.}, the database is large and the transaction load is high
1404: --- the straightforward techniques will be relatively
1405: unattractive. As for the estimation technique, it will probably work
1406: best when $\card{domA}$ is small (for example, for a binary attribute)
1407: or whenever the subset of actually utilized values in the domain is
1408: small. In addition, commercial databases recompute
1409: the entire histogram as a single, atomic task. Formula
1410: (\ref{eq:query}), on the other hand, can be performed on a subset of the
1411: attribute values.  For example, in the case of exact matching (say, a
1412: condition of the form $\mathtt{WHERE}\;A=v$), it is sufficient to
1413: compute $\hat{R}_{A,v}(f)$, rather than the full
1414: ${\hat{R}_{A}(f)}$ vector. Finally, it is worth noting that the
1415: computing the expected value 
1416: of ${\hat{R}_{A,v}(f)}$ via (\ref{eq:query}) does not require locking $R$,
1417: while a full histogram recomputation may involve extended periods of
1418: locking.}\hspace*{\fill}$\Box$
1419: 
1420: 
1421: ~
1422: 
1423: We next consider the number of tuples in $R(s)$ that have survived through
1424: time $f$ without being modified, which we denote $Y_{R}^{-}(s,f)$. The
1425: expectation of this random variable is
1426: \[
1427: \expecop\!\left[  {Y_{R}^{-}(s,f)}\right]  =p_{R}(s,f)\!\!\!\!
1428: \sum_{v\in\dom{\mathcal{A}}}\!\!\!\!
1429: \hat{R}_{\mathcal{A},v}(s)P_{v,v}^{R,\mathcal{A}}(s,f)
1430: \]
1431: We let $Y_{R}^{+}(s,f)=Y_{R}(s,f)-Y_{R}^{-}(s,f)$ denote the number of tuples
1432: in $R(s)$ that have survived through time $f$ and were modified; it follows
1433: from the linearity of the $\expecop\!\left[  {\cdot}\right]  $ operator that
1434: \[
1435: \expecop\!\left[  {Y_{R}^{+}(s,f)}\right]  =\expecop\!\left[  {Y_{R}%
1436: (s,f)}\right]  -\expecop\!\left[  {Y_{R}^{-}(s,f)}\right]  .
1437: \]
1438: 
1439: \subsubsection{Complexity analysis of content-dependent updates}
1440: 
1441: In practice, as with computing a scalar exponential, only a limited number of
1442: terms will be needed to compute the sum (\ref{eq:matrixexp}) to machine
1443: precision. It is worth noting that efficient means of calculating
1444: (\ref{eq:matrixexp}) are a major topic in the field of computational probability.
1445: 
1446: In the case of a compound attribute $\mathcal{A}=\{A_{1},A_{2},...,A_{k}\}$
1447: with independently varying components, it will be computationally more
1448: efficient to first calculate the individual transition probability matrices
1449: $P^{A_{i},R}(s,f)$ via (\ref{eq:matrixexp}), and then calculate the joint
1450: probability matrix $P_{u,v}^{R,\mathcal{A}}(s,f)$ using (\ref{eq:probproduct}%
1451: ), rather than first finding the joint exit rate matrix $Q^{R,\mathcal{A}}$
1452: via (\ref{eq:jointexit}) and then applying (\ref{eq:matrixexp}). The former
1453: approach would involve repeated multiplications of square matrices of size
1454: $\left|  {\dom A_{i}}\right|  $, for $i=1,\ldots,k$, resulting in a
1455: computational complexity of $\bigO(\sum_{i=1}^{k}n_{i}{\left|  {\dom
1456: A_{i}}\right|  }^{\nu})$, where $n_{i}$ is the number of iterations needed to
1457: compute the sum (\ref{eq:matrixexp}) to machine precision, and the complexity
1458: of multiplying two $n\times n$ matrices is $\bigO(n^{\nu})$.\footnote{$\nu=3$ for
1459: the standard method and $\nu=\log_{2}7$ for Strassen's and related methods.}
1460: The latter would involve multiplying square matrices of size $\prod_{i=1}%
1461: ^{k}\left|  {\dom
1462: A_{i}}\right|  $, resulting in the considerably worse complexity of
1463: $\bigO(n(\prod_{i=1}^{k}{\left|  {\dom A_{i}}\right|  )}^{\nu})$, where $n$ is the
1464: number of iterations needed to obtain the desired precision.
1465: 
1466: \subsection{Simplified modification models}
1467: 
1468: We next introduce several possible simplifications of the general Markov chain
1469: case. To do so, we start by differentiating numeric domains from non-numeric
1470: domains. Certain database attributes $A\in\mathcal{A}$, such as prices and
1471: order quantities, represent numbers, and numeric operations such as
1472: addition are meaningful for these attributes. For such attributes, one can
1473: easily define a distance function between two attribute values, as we shall
1474: see below. We call the domains $\dom A$ of such attributes \emph{numeric
1475: domains}, and denote the set of all attributes with numeric domains by
1476: $\mathcal{N}\subset\mathcal{B}$. All other attributes and domains are
1477: considered \emph{non-numeric}.\footnote{Distance metrics can also be defined
1478: for complex data types such as images. We leave the handling of such cases to
1479: further research.} It is worth noting that not all numeric data necessarily
1480: constitute a numeric domain. Consider, for example, a customer relation $R$
1481: whose primary key is a customer number. Although the customer number consists
1482: of numeric symbols, it is essentially an arbitrary identification string for
1483: which arithmetic operations like addition and subtraction are not
1484: intrinsically meaningful for the database application. We consider such
1485: attributes to be non-numeric.
1486: 
1487: \subsubsection{Domain lumping}
1488: 
1489: \label{sec:lumping}To make our data modification model more computationally
1490: tractable, it may be appropriate, in many cases, to simplify the Markov chain
1491: state space for an attribute $A$ so that it is much smaller than $\dom A$.
1492: Suppose, for example, that $A$ is a 64-character string representing a street
1493: address. Restricting to 96 printable characters, $A$ may assume on the order
1494: of $96^{64}\approx10^{126}$ possible values. It is obviously unnecessary,
1495: inappropriate, and intractable to work with a Markov chain with such an
1496: astronomical number of states.
1497: 
1498: One possible remedy for such situations is referred to as \emph{lumping }in
1499: the Markov chain literature~\cite{KEMENY60}. In our terminology, suppose we
1500: can partition $\dom A$ into a collection of sets ${\{V\}}_{V\in\mathcal{V}}$
1501: with the property that $\left|  {\mathcal{V}}\right|  \ll\left|  {\dom
1502: A}\right|  $ and
1503: \[
1504: \forall\;U,V\in\mathcal{V},\;\forall\;u,u^{\prime}\in U\quad\sum_{v\in
1505: V}q_{u,v}^{R,A}=\sum_{v\in V}q_{u^{\prime},v}^{R,A}.
1506: \]
1507: Then, one can model the transitions between the ``lumps'' $V\in\mathcal{V}$ as
1508: a much smaller Markov chain whose set of states is $\mathcal{V}$, with the
1509: transition rate from $U\in\mathcal{V}$ to $V\in\mathcal{V}$ being given by the
1510: common value of $\sum_{v\in V}q_{u,v}^{R,A}$, $u\in U$. If we are interested
1511: only in which lump the attribute is in, rather than its precise value, this
1512: smaller chain will suffice. Using lumping, the complexity of the computation
1513: is directly dependent on the number of lumps. We now give a few simple examples:
1514: 
1515: \begin{example}
1516: [Lumping into a binary domain]\label{ex:binarylump}Consider the
1517: street address example just discussed. Fortunately, if an address has changed
1518: since time $s$, the database user is unlikely to be concerned with how
1519: different it is from the address at time $s$, but simply whether it is
1520: different. Thus, instead of modeling the full domain $\dom A$, we can
1521: represent the domain via the simple binary set $\{0,1\}$, where $0$
1522: indicates that the address has not changed since time $s$, and $1$ indicates
1523: that it has. We assume that the exit rates $q_{v,r.A(s)}^{R,A}$ from all other
1524: addresses $v\in\dom A$ back to the original value $r.A(s)$ all have the
1525: identical value $\theta^{\prime}$. In this case, one has $P_{0,1}%
1526: ^{R,A}=P_{1,0}^{R,A}=1$, and the behavior of the attribute is fully captured
1527: by the exit rates $\ell_{0}^{R,A}=q_{0,1}^{R,A}$ and 
1528: $\ell_{1}^{R,A}=q_{1,0}^{R,A}$.  We will abbreviate
1529: these quantities by $\theta$ and $\theta^{\prime}$, respectively.
1530: 
1531: Using standard results for a two-state continuous-time Markov chain
1532: \cite[Section VI.3.3]{TK94}, we conclude that
1533: \begin{align}
1534: P_{0,0}^{R,A}(s,f) &  =
1535: \frac{\theta^{\prime}+\theta e^{-{(\theta+\theta^{\prime})\Gamma_{R,A}(s,f)}}}
1536: {\theta+\theta^{\prime}}\label{binary:0to0} \\
1537: P_{0,1}^{R,A}(s,f) &  =
1538: \frac{\theta-\theta e^{-{(\theta+\theta^{\prime})\Gamma_{R,A}(s,f)}}}
1539: {\theta+\theta^{\prime}}.  \label{binary:0to1}%
1540: \end{align}
1541: \hspace*{\fill}$\Box$
1542: \end{example}
1543: 
1544: %JE -- I think it is much simpler if we give just skip directly to the
1545: %web crawling example...
1546: %[Domain lumping in the semi-interval case]Consider an attribute $A$ with a
1547: %numeric domain for which we are interested in identifying whether a tuple $r$
1548: %is in some semi-interval of the form $[\min(\dom A),k)$, where $k\in\dom
1549: %A$. Assume further that once an attribute has been assigned with a value in
1550: %$[k,\max(\dom A))$ it can never go back to $[\min(\dom A),k)$. Under this
1551: %condition, we can collapse $[k,\max(\dom A))$ into a single state
1552: %$k^{^{\prime}}$ for which $\forall v\in\lbrack\min(\dom
1553: %A),k)$, $q_{{v,k^{^{\prime}}}}^{R,A}=\sum_{u\in\lbrack k,\max(\dom A))}%
1554: %q_{v,u}^{R,A}$.
1555: 
1556: \begin{example}[Web crawling]
1557: As an even simpler special case, consider a Web crawler
1558: (\emph{e.g.}, \cite{PINKERTON94,HEYDON99,CHO00}). Such a crawler needs to
1559: visit Web pages upon change to re-process their content, possibly for the use
1560: of a search engine. Recalling Example \ref{ex:binarylump}, one
1561: may define a boolean attribute \texttt{Modified} in a relation that collects
1562: information on Web pages. \texttt{Modified} is set to \texttt{True} once the
1563: page has changed, and back to \texttt{False} once the Web crawler has visit
1564: the page. Therefore, once a page has been modified to \texttt{True}, it cannot
1565: be modified back to \texttt{False} before the next visit of the Web
1566: crawler. 
1567: In the analysis of 
1568: Example \ref{ex:binarylump}, one can set $\theta^{\prime}=0$,
1569: resulting in $P_{0,0}^{A,R}(s,f)=e^{-{\theta\Gamma_{R,A}(s,f)}}$ and $P_{0,1}%
1570: ^{A,R}(s,f)=1-e^{-{\theta\Gamma_{R,A}(s,f)}}$. \hspace*{\fill}$\Box$
1571: \end{example}
1572: 
1573: \subsubsection{Random walks}
1574: \label{sec:randomwalks} 
1575: Like large non-numeric domains, many numeric domains may
1576: also be cumbersome to model directly via Markov chain techniques. For example,
1577: a 32-bit integer attribute can, in theory, take $2^{32}\approx4\times10^{9}$
1578: distinct values, and it would be virtually impossible to directly form, much
1579: less exponentiate, a full transition rate matrix for a Markov chain of this size.
1580: 
1581: Fortunately, it is likely that such attributes will have ``structured'' value
1582: transition patterns that can be modeled, or at least closely approximated, in
1583: a tractable way. As an example, we consider here a random walk model for
1584: numeric attributes.
1585: 
1586: In this case, we still suppose that the attribute $A$ is modified only at
1587: transition event times that are distributed as described above. Letting
1588: $t_{i}$ denote the time of transition event $i$, with $t_{0}=s$, we suppose
1589: that at transition event $i$, the value of attribute $A$ is modified according
1590: to
1591: \[
1592: r.A(t_{i})=r.A(t_{i-1})+\Delta A_{i},
1593: \]
1594: where $\Delta A_{i}$ is a random variable. We suppose that the random
1595: variables $\left\{  \Delta A_{i}\right\}  $ are IID, that is, they are
1596: independent and share a common distribution with mean $\delta$ and variance
1597: $\sigma^{2}$. Defining
1598: \[
1599: \Delta A(s,f)=\!\!\!\sum_{i:t_{i}\in(s,f]}\!\!\!\!\Delta A_{i},
1600: \]
1601: we obtain that $\{\Delta A(s,f),f\geq s\}$ is a nonhomogeneous compound
1602: Poisson process, and $r.A(f)=r.A(s)+\Delta A(s,f)$. From standard results for
1603: compound Poisson processes, we then obtain for each tuple $r\in R(s)$ that
1604: $\expecop\!\left[  {r.A(f)}\right]  =r.A(s)+\Gamma_{R,A}(s,f)\delta$.
1605: 
1606: It should be stressed that such a model must ultimately be only an
1607: approximation, since a random walk model of this kind would, strictly
1608: speaking, require an infinite number of possible states, while $\dom A$ is
1609: necessarily finite for any real database. However, we still expect it to be
1610: accurate and useful in many situations, such as when $r.A(s)$ and
1611: $\expecop\!\left[  {r.A(f)}\right]  $ are both far from largest and smallest
1612: possible values in $\dom A$.
1613: 
1614: \subsubsection{Content-independent overwrites}
1615: 
1616: Consider the simple case in which $P_{u,v}^{R,A}(s,f)$ is independent of $u$
1617: once a transition event has occurred. Let $\mathcal{A}\subseteq\mathcal{C}(R)$
1618: be a set of attributes $A$ with identical $\gamma_{R,A}$ functions, and let
1619: $\Gamma_{R,\mathcal{A}}(s,t)=\Gamma_{R,A}(s,t)$ for any $A\in\mathcal{A}$. We
1620: define a probability distribution $\omega_{R,\mathcal{A}}$ over $\dom
1621: \mathcal{A}$, and assume that at each transition event, a new value for
1622: $\mathcal{A}$ is selected at random from this distribution, without regard to
1623: the prior value of $r.\mathcal{A}$. It is thus possible that a transition
1624: event will leave $r.\mathcal{A}$ unchanged, since the value selected may be
1625: the same one already stored in $r$. For any tuple $r\in R(s)\cap R(f)$ and
1626: $u\in\dom\mathcal{A}$, we thus compute the probability $P_{u,u}^{R,\mathcal{A}%
1627: }(s,f)$ that the value of $r.\mathcal{A}$ remains unchanged at $u$ at time
1628: $f$ to be
1629: \begin{align*}
1630: P_{u,u}^{R,\mathcal{A}}(s,f)  &  =\probop\!\left\{  {\tau_{u}^{R,\mathcal{A}%
1631: }>f-s}\right\}  +\probop\!\left\{  {\tau_{u}^{R,\mathcal{A}}\leq f-s}\right\}
1632: \omega_{R,\mathcal{A}}(u)\\
1633: &  =e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}+\left(
1634: 1-e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}\right)
1635: \omega_{R,\mathcal{A}}(u)\\
1636: &  =e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}\left(
1637: 1-\omega_{R,\mathcal{A}}(u)\right)  +\omega_{R,\mathcal{A}}(u)
1638: \end{align*}
1639: For $u,v\in\dom\mathcal{A}$ such that $u\neq v$, we also compute the
1640: probability $P_{u,v}^{R,A}(s,f)$ that $r.\mathcal{A}$ changes from $u$ to $v$
1641: in $[s,f)$ to be
1642: \begin{align*}
1643: P_{u,v}^{R,\mathcal{A}}(s,f)  &  =\probop\!\left\{  {\tau_{u}^{R,\mathcal{A}%
1644: }\leq f-s}\right\}  \omega_{R,\mathcal{A}}(v)\\
1645: &  =\left(  1-e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}\right)
1646: \omega_{R,\mathcal{A}}(v).
1647: \end{align*}
1648: 
1649: Content-independent overwrites are a special case of the Markov chain model
1650: discussed above. To apply the general model formulae when content-independent
1651: updates are present, each $\ell_{v}^{R,A}$ is multiplied by $1-\omega
1652: _{R,A}(v)$ and $P_{u,v}^{R,A}=\omega_{R,A}(v)/(1-\omega_{R,A}(u))$ for all
1653: $u,v\in\dom A$, $u\neq v$.
1654: 
1655: \subsection{Summary}
1656: 
1657: In this section we have introduced a general Markov-chain model for data
1658: modification, and discussed three simplified models that allows tractable
1659: computation. Using these models, one can compute, in probabilistic terms, value
1660: histograms at time $f$, given a known initial set of value histograms
1661: at time $s<f$. Such a
1662: model could be useful in query optimization, whenever the
1663: continual gathering of statistics becomes impossible due to either heavy system
1664: loads or structural constraints (\emph{e.g.}, federations of databases with
1665: autonomous DBMSs).
1666: 
1667: Generally speaking, computing the transition matrix for an attribute $A$
1668: involves repeated multiplications of square matrices of size $\left|  {\dom
1669: A}\right|  $, resulting in a computational complexity of $\bigO(n{\left|  {\dom
1670: A}\right|  }^{\nu})$, where $n$ is the number of iterations needed to compute
1671: the sum (\ref{eq:matrixexp}) to machine precision. While $n$ is usually small,
1672: $\left|  {\dom
1673: A}\right|  $ may be very large, as demonstrated in Section
1674: \ref{sec:lumping} 
1675: and Section \ref{sec:randomwalks}. Methods such as domain lumping would
1676: require $\bigO(nX^{\nu})$ time, where $X\ll\left|  {\dom
1677: A}\right|  $.\footnote{Here, $n$ may also be affected by the change of
1678: domain.} As for random walks and independent updates, both methods no longer
1679: require repeated matrix multiplications, but rather the computation of
1680: $\Gamma_{R,A}(s,f)$. The complexity of calculating $\Gamma_{R,A}(s,f)$ is
1681: similar to that for $\Lambda_{R}(s,f)$ in Section \ref{sec:lambdacomplex}.
1682: 
1683: \section{Insertion model verification}
1684: \label{sec:verify} 
1685: It is well-known that Poisson processes
1686: model a world where data updates are independent from one another. While in
1687: databases with widely distributed access, \emph{e.g.}, incoming e-mails,
1688: postings to newsgroups, or posting of orders from independent customers, such
1689: an independence assumption seems plausible, we still need to validate the
1690: model against real data. In this section we shall present some initial
1691: experiments as a ``proof of concept.'' These experiments deal only with the
1692: insertion component of the model. Further experiments, including modification
1693: and deletion operations, will be reported in future work.
1694: 
1695: \begin{figure}[ptb]
1696: \begin{center}
1697: \epsfig{file=training.eps,width=6.5in}
1698: \caption{Training data set.}
1699: \label{fig:training}
1700: \end{center}
1701: \end{figure}
1702: %EndExpansion
1703: 
1704: Our data set is taken from postings to the DBWORLD electronic bulletin board.
1705: The data were collected over more than seven months and consists of about 750
1706: insertions, from November 9$^{\text{th}}$, 2000 through May 14$^{\text{th}}$,
1707: 2001. Figure \ref{fig:training} illustrates a data set with 580
1708: insertions during the interval 
1709: [2000/11/9:00:00:00,~2001/3/31:00:00:00). We used the 
1710: Figure \ref{fig:training} data as a \emph{training set}, 
1711: \emph{i.e.}, it serves as our basis for
1712: parameter estimation. Later, in order to test the model, we applied these
1713: parameters to a separate \emph{testing set} covering the period
1714: [2001/3/31:00:00:00,~2001/5/15:00:00:00). In the experiments described below,
1715: we tried fitting the training data with two insertion-only models, namely a
1716: homogeneous Poisson process and an RPC Poisson process (see Section
1717: \ref{sec:insertion}). For each of these two models, we have applied two
1718: variations, either as a compound or as a non-compound model. In the
1719: experiments described below, we have used the Kolmogorov-Smirnov goodness of
1720: fit test (see for example~\cite[Section 7.7]{HOGG83}). For completeness, we
1721: first overview the principles of this statistical test.
1722: 
1723: The Kolmogorov-Smirnov test evaluates the likelihood of a \emph{null
1724: hypothesis} that a given sample may
1725: have been drawn from some
1726: hypothesized distribution.  If the null hypothesis is true, and
1727: sample set has indeed been drawn from the
1728: hypothesized distribution, then the empirical cumulative distribution of the
1729: sample should be close to its theoretical counterpart. If the sample
1730: cumulative distribution is too far from the hypothesized distribution at any
1731: point, that suggests that the sample comes from a different distribution.
1732: Formally, suppose that the theoretical distribution is $F(x)$, and we have $n$
1733: sample values $x_{1},...,x_{n}$ in nondecreasing order. We define an empirical
1734: cumulative distribution $F_{n}(x)$ via
1735: \[
1736: F_{n}(x)=\left\{
1737: \begin{array}
1738: [c]{cl}%
1739: 0, & \text{if }x<x_{1}\\
1740: \frac{k}{n}, & \text{if }x_{k}\leq x<x_{k+1}\\
1741: 1, & \text{if }x>x_{n},
1742: \end{array}
1743: \right.
1744: \]
1745: and then compute $D_{n}=\sup_{k=1,\ldots,n}\{|F_{n}(x_{k})-F(x_{k})|\}$. For
1746: large $n$, given a significance level $\alpha$, the test measures $D_{n}$
1747: against $X(\alpha)/\sqrt{n}$, where $X(\alpha)$ is a factor depending on the
1748: \emph{significance level} $\alpha$ at which we reject the null hypothesis. 
1749: For example, $X(0.05)=1.36$ and $X(0.1)=1.22$.  The value of $\alpha$
1750: is the probability of a ``false negative,'' that is, the chance that
1751: the null hypothesis might be rejected when it is actually true.
1752: Larger values of $\alpha$ make the test harder to pass.
1753: 
1754: \subsection{Fitting the homogeneous Poisson process}
1755: \label{sec:fithomo}
1756: \begin{figure}[ptb]
1757: \begin{center}
1758: \epsfig{file=kshomo.eps, width=6.5in}
1759: \caption{A comparison of a theoretical and empirical distribution functions
1760: for the homogeneous Poisson process model (a) and the compound homogeneous
1761: Poisson process model (b).}%
1762: \label{fig:kshomo}%
1763: \end{center}
1764: \end{figure}
1765: %EndExpansion
1766: 
1767: Based on the training set, we computed the parameter for a homogeneous Poisson
1768: process by averaging the 580 interarrival times, an unbiased estimator of the
1769: Poisson process parameter. The average interarrival time was computed to be
1770: 5:15:19, and thus $\lambda=4.57$ per day. Figure \ref{fig:kshomo}(a) 
1771: provides a pictorial comparison of the cumulative
1772: distribution functions of the interarrival times with their theoretical
1773: counterpart. We applied the Kolmogorov-Smirnov test to the distribution
1774: of interarrival times, comparing it with an exponential distribution with a
1775: parameter of $\lambda=4.57$. The outcome of the test is $D_{n}=0.106$, which
1776: means we can reject the null hypothesis at any reasonable level of confidence
1777: $\alpha\geq 0.005$ (for $\alpha=0.005$, the rejection threshold is $0.0718$ for
1778: $n=580$). In all likelihood, then, the data are not derived from a homogeneous
1779: Poisson process.
1780: 
1781: Next, we have applied a compound homogeneous Poisson model.  Our
1782: rationale in this case is that DBWORLD is a moderated list, and the
1783: moderators sometimes work on postings in batches.  
1784: These batches are sometimes posted to the group in tightly-spaced clusters.
1785: For all practical purposes, we
1786: treat each such cluster as a single batch insertion event.
1787: To construct the model, any
1788: two insertions occurring within less than one minute from one another were
1789: considered to be a single event occurring at the insertion
1790: time of the first arrival. For example, on November 14, 2000, we had three
1791: arrivals, one at 13:43:19, and two more at 13:43:23. All three arrivals are
1792: considered to occur at the same insertion arrival event, with an insertion
1793: time of 13:43:19. Using the compound variation, the data set now has 557
1794: insertion events. The revised average interarrival time is now 5:28:20, and
1795: thus $\lambda=4.39$ per day. Figure \ref{fig:kshomo}(b)
1796: provides a pictorial comparison of the cumulative distribution functions of
1797: the interarrival times, assuming a compound model, with their theoretical
1798: counterpart. We have applied the Kolmogorov-Smirnov test to the distribution
1799: of interarrival times, comparing it with an exponential distribution with a
1800: parameter of $\lambda=4.39$. The outcome was somewhat better than before.
1801: $D_{n}=0.094$, which means we can still reject the null hypothesis at any
1802: level of confidence $\alpha\geq 0.005$ (for $\alpha=0.005$, the rejection
1803: threshold is $0.0733$ for $n=557$). Although the compound variant of the model
1804: fits the data better, it is still not statistically plausible.
1805: 
1806: \subsection{Fitting the RPC Poisson process}
1807: \label{sec:rpcfit}%
1808: \begin{table}[tbp] \centering
1809: \begin{tabular}[c]{|l|l|l|l|}\hline
1810: & \textbf{Workdays} & \textbf{Saturday} & \textbf{Sunday}\\\hline\hline
1811: $\lbrack0\text{:}00,3\text{:}00)$ & \multicolumn{1}{|c|}{$2.40$} &
1812: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%
1813: $\lbrack3\text{:}00,6\text{:}00)$ & \multicolumn{1}{|c|}{$5.96$} &
1814: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%
1815: $\lbrack6\text{:}00,9\text{:}00)$ & \multicolumn{1}{|c|}{$6.04$} &
1816: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%
1817: & & \multicolumn{1}{|c|}{$1.50$} & \multicolumn{1}{|c|}{$1.15$}\\
1818: $\lbrack9\text{:}00,18\text{:}00)$ & \multicolumn{1}{|c|}{$7.50$} &
1819: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\
1820: & & \multicolumn{1}{|c|}{}& \multicolumn{1}{|c|}{}\\\cline{1-2}%
1821: $\lbrack18\text{:}00,21\text{:}00)$ & \multicolumn{1}{|c|}{$3.03$} &
1822: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}
1823: $\lbrack21\text{:}00,24\text{:}00)$ & \multicolumn{1}{|c|}{$2.41$} &
1824: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\hline
1825: \end{tabular}
1826: \caption{Average $\lambda$ levels for the recurrent piecewise-constant 
1827: Poisson model.}
1828: \label{tab:rpclambdas}
1829: \end{table}
1830: 
1831: Next, we tried fitting the data to an RPC model.
1832: Examining the data, we chose a cycle of one week. 
1833: Within each week, we used the same pattern for each weekday, with one
1834: interval for work hours (9:00-18:00), plus five additional three-hour
1835: intervals for ``off hours''.  
1836: We treated Saturday and Sunday each as one long interval.
1837: Table \ref{tab:rpclambdas}
1838: shows the arrival rate parameters for each segment
1839: of the RPC Poisson model, calculated in much the same manner as the
1840: for the homogeneous Poisson model. 
1841: 
1842: The specific methodology for structuring the RPC Poisson model is
1843: beyond the scope of this paper and can range from \emph{ad hoc}
1844: ``look and feel'' crafting (as practiced here)
1845: to more established formal processes for
1846: statistically segmenting, filtering, and aggregating intervals
1847: \cite{STOUMBOS97,STOUMBOS2002}. It is worth noting, however, that from
1848: experimenting with different methods, we have found that the model is not
1849: sensitive to slight changes in the interval definitions.  Also, the model we
1850: selected has only $8$ segments, and thus only $8$ parameters, so there
1851: is little danger of ``overfitting'' the training data set, which has over
1852: $500$ observations.
1853: 
1854: Next, we attempted to statistically validate the RPC model. To this end, we use
1855: the following lemma:
1856: 
1857: \begin{lemma}
1858: \label{lem:udistrib}
1859: Given a nonhomogeneous Poisson process with arrival intensity
1860: $\lambda(t)$, the random variable $U_{s}=\int
1861: _{s}^{s+L_{R,s}}\lambda(t)\hspace{0.15em}dt$ is of the distribution
1862: $\expdistrib(1)$.
1863: \end{lemma}
1864: 
1865: \begin{proof}
1866: Let $f_{s}(t)=\Lambda(s,s+t)$, which is a monotonically nondecreasing
1867: function. From Lemma \ref{lem:interarrival}, $\probop
1868: \!\{{L_{R,s}<t\}=}1-e^{f_{s}(t)}$ for all $t\geq0$. We have $U_{s}=f_{s}(L_{R,s}%
1869: )$. By applying the monotonic function $f_{s}$ to both sides of the inequality
1870: $L_{R,s}<t$, one has that $\probop\!\{f_{s}({L_{R,s})<f_{s}(t)\}}=\probop
1871: \!\{{L_{R,s}<t\}}=1-e^{f_{s}(t)}$ for all $t\geq0$. Substituting in the
1872: definitions of $U_{s}$ and $u=f_{s}(t)$, one then obtains $\probop
1873: \!\{U{_{s}<u\}}=1-e^{-u}$ for all $u\geq0$, and therefore $U_{s}%
1874: \sim\expdistrib(1)$.
1875: \end{proof}
1876: 
1877: Thus, given an instantaneous arrival rate $\lambda(t)$, and a sequence of
1878: observed arrival events $\{t_{n}\}_{n=0}^{N}$, we compute the set of values
1879: $u_{n}=\int_{t_{n-1}}^{t_{n}}\lambda(t)\hspace{0.15em}dt$, $n=1,\ldots,N,$ and
1880: perform a Kolmogorov-Smirnov test of them versus the unit exponential
1881: distribution. 
1882: 
1883: \begin{figure}[tb]
1884: \begin{center}
1885: \epsfig{file=ksrpc.eps,width=6.5in}
1886: \caption{A comparison of a theoretical and empirical distribution functions of
1887: $U$ for the RPC Poisson model (a) and the compound RPC Poisson model (b).}
1888: \label{fig:ksrpc}%
1889: \end{center}
1890: \end{figure}
1891: 
1892: Figure \ref{fig:ksrpc}(a) provides a
1893: comparison of the theoretical and empirical cumulative distribution of the
1894: random variable $U$. We applied the Kolmogorov-Smirnov test to $U$,
1895: comparing it with an exponential distribution with $\lambda=1$, based on Lemma
1896: \ref{lem:udistrib}. The outcome of the test is $D_{n}=0.080$,
1897: which is better than either homogeneous model, but is still rejected
1898: at any reasonable level of significance
1899: (recall that for $\alpha=0.005$, the rejection threshold is again $0.0718$ for $n=580$). 
1900: 
1901: Finally, we evaluated a compound version of the RPC model, combining
1902: successive postings separated by less than one minute.  We kept the
1903: same segmentation as in Table \ref{tab:rpclambdas}, but recalculated the
1904: arrival intensities in each segment, as shown in
1905: Table~\ref{tab:compoundrpclambdas}. 
1906: 
1907: \begin{table}[tbp] \centering
1908: \begin{tabular}[c]{|l|l|l|l|}
1909: \hline
1910: & \textbf{Workdays} & \textbf{Saturday} & \textbf{Sunday}\\\hline\hline
1911: $\lbrack0\text{:}00,3\text{:}00)$ & \multicolumn{1}{|c|}{$2.40$} &
1912: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%
1913: $\lbrack3\text{:}00,6\text{:}00)$ & \multicolumn{1}{|c|}{$5.96$} &
1914: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%
1915: $\lbrack6\text{:}00,9\text{:}00)$ & \multicolumn{1}{|c|}{$5.59$} &
1916: \multicolumn{1}{|c|}{$$} & \multicolumn{1}{|c|}{$$}\\\cline{1-2}%
1917: & &
1918: \multicolumn{1}{|c|}{$1.45$} & \multicolumn{1}{|c|}{$1.15$}\\
1919: $\lbrack9\text{:}00,18\text{:}00)$ & \multicolumn{1}{|c|}{$7.11$} &
1920: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\
1921: & & \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%
1922: $\lbrack18\text{:}00,21\text{:}00)$ & \multicolumn{1}{|c|}{$3.03$} &
1923: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}
1924: $\lbrack21\text{:}00,24\text{:}00)$ & \multicolumn{1}{|c|}{$2.33$} &
1925: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\hline
1926: \end{tabular}
1927: \caption{Average $\lambda$ levels for the compound RPC Poisson model.}
1928: \label{tab:compoundrpclambdas}
1929: \end{table}
1930: 
1931: Next we recalculated the sample of the random variable $U$ for the
1932: compound RPC Poisson model, and applied the Kolmogorov-Smirnov
1933: test. In this case, we have $D_{n}=0.050$, which cannot be rejected at
1934: any reasonable confidence level through $\alpha=0.10$ (for $\alpha=0.10$, the
1935: rejection threshold is $0.0517$ for $n=557$).  Figure
1936: \ref{fig:ksrpc}(b) 
1937: shows the
1938: theoretical and empirical distributions of $U$ in this case.
1939: 
1940: As a final confirmation of the applicability of the compound RPC
1941: Poisson model, we attempted to validate the assumption that the number
1942: of postings in successive insertion events are independent and
1943: identically distributed (IID).  In the sample, 536 insertion events
1944: were of size 1, 19 were of size 2, and 2 were of size 3.  Thus, we
1945: approximate the random variable $\Delta^+_R$ as having a $536/557
1946: \approx .962$ probability of being 1, a $19/557 \approx .034$
1947: probability of being 2, and a $2/557 \approx .004$ probability of
1948: being 3.  Validating that the observed insertion batch sizes
1949: $\Delta^+_{R,i}$ appear to be independently drawn from this
1950: distribution is somewhat delicate, since they nearly always take the
1951: value 1.  To compensate, we performed our test on the 
1952: \emph{runs} in the sample, that is, the number of consecutive insertion
1953: events of size 1 between insertions of size 2 or 3.  Our sample
1954: contains 21 runs, ranging from 0 to 112.  If the insertion
1955: batch sizes $\{\Delta^+_{R,i}\}$ are independent with the distribution
1956: $\Delta^+_R$, then the length of a run should be a geometric random
1957: variable with parameter $536/557\approx .962$.  We tested this
1958: hypothesis via a Kolmogorov-Smirnov test, as shown in
1959: Figure~\ref{fig:runs}.  The $D_n$ statistic is $0.207$, which is 
1960: within the $\alpha=0.1$ acceptance level for a sample of size $n=21$
1961: (although the divergence of the theoretical and empirical curves in
1962: Figure~\ref{fig:runs} is more visually pronounced than in the prior figures, it
1963: should be remembered that the sample is far smaller).
1964: Thus, the assumption that the insertion batch sizes
1965: $\{\Delta^+_{R,i}\}$ are IID is plausible.
1966: 
1967: \begin{figure}[tb]
1968: \begin{center}
1969: \epsfig{file=runs.eps}
1970: \caption{Empirical and theoretical distributions for number of
1971: single arrivals between multiple arrivals, compound RCP Poisson model.}
1972: \label{fig:runs}
1973: \end{center}
1974: \end{figure}
1975: 
1976: \begin{table}[tbp] \centering
1977: \begin{tabular}[c]{|l|c|c|}\hline
1978: \textbf{Model} & $D_{n}$ & \textbf{Rejection level}\\\hline\hline
1979: Homogeneous & {$0.106$} & {$<0.005$}\\\hline
1980: Homogeneous+compound & {$0.094$} &
1981: {$<0.005$}\\\hline
1982: RPC & {$0.080$} & {$<0.005$}\\\hline
1983: RPC+compound & {$0.050$} & {$>0.100$}\\\hline
1984: \end{tabular}
1985: \caption{Goodness of fit of the four models.}
1986: \label{tab:goodfit}
1987: \end{table}
1988: 
1989: Table \ref{tab:goodfit} compares the goodness-of-fit of
1990: the four models to the test data. For each of the models, we have specified
1991: the KS test result ($D_{n}$) and the level at which one can reject the null
1992: hypothesis. The higher the level of confidence is, the better the fit is. The
1993: RPC compound Poisson model models best the data set, accepting the null
1994: hypothesis at any level up to $0.1$ (which practically means that the model
1995: can fit to the data well). The main conclusion from these experiments is that
1996: the simple model of homogeneous Poisson process is limited to the modeling of
1997: a restricted class of applications (one of which was suggested in
1998: \cite{CHO00}). Therefore, there is a need for a more elaborate model, as
1999: suggested in this paper, to capture a broader range of update behaviors. A
2000: nonhomogeneous model consisting of just 8 segments per week, 
2001: as we have constructed, seems to model the arrivals significantly
2002: better than the homogeneous approach.
2003: 
2004: \section{Content evolution cost model}
2005: 
2006: \label{costmodel} We now develop a cost model suitable for
2007: transcription-scheduling applications such as those described in
2008: Example~\ref{ex:replica}. The question is how often to generate a remote
2009: replica of a relation $R$. We have suggested one such policy in Example
2010: \ref{ex:fat}. In this section, we shall introduce two more policies and
2011: show an empirical comparison based on the data introduced in Section
2012: \ref{sec:verify}.
2013: 
2014: A transcription policy aims to minimize the combined cost of
2015: \emph{transcription cost} and \emph{obsolescence cost}~\cite{GAL99c}. The
2016: former includes the cost of connecting to a network and the cost of
2017: transcribing the data, and may depend on the time at which the transcription
2018: is performed (\emph{e.g.}, as a function of network congestion), and the
2019: length of connection needed to perform the transcription. The obsolescence
2020: cost captures the cost of using obsolescent data, and is 
2021: basically a function of the
2022: amount of time that has passed since the last transcription.
2023: 
2024: In what follows, let the set $\{b_{i},e_{i}\}_{i=1}^{\infty}$ represents an
2025: infinite sequence of connectivity periods between a client and a server.
2026: During session $i$, the client data is synchronized with the state of the
2027: server at time $b_{i}$, the information becoming available at the client at
2028: time $e_{i}$. At the next session, beginning at time $b_{i+1}$, the client is
2029: updated with all the information arriving at the server during the interval
2030: $(b_{i},b_{i+1}]$, which becomes usable at time $e_{i+1}$, and so forth. We
2031: define $b_{0}=e_{0}=0$, and require that $0<b_{1}\leq e_{1}<b_{2}\leq
2032: e_{2}<\ldots$.
2033: 
2034: Let $C_{R,\text{u}}(s,f)$ denote the cost of performing a transcription of $R$
2035: starting at time $f$, given that the last update was started at time $s$. Let
2036: $C_{R,\text{o}}(s,f)$, to be described in more detail later, denote the
2037: obsolescence cost through time $f$ attributable to tuples inserted into $R$ at
2038: the server during the time interval $(s,f]$. Then the total cost $C_{R}(t)$
2039: through time $t$ is
2040: \begin{equation}
2041: C_{R}(t)=\sum_{i:b_{i}\leq t}\!\!
2042: \Big(  
2043: \alpha C_{R,\text{u}}(b_{i-1},b_{i})
2044: +(1-\alpha)C_{R,\text{o}}(b_{i-1},b_{i})
2045: \Big) + (1-\alpha)C_{R,\text{o}}(b_{i^*(t)},t), 
2046: \label{costformula}
2047: \end{equation}
2048: where $i^*(t)=\max\left\{i\;\big|\;b_{i}\leq t\right\}$ and
2049: $\alpha$ serves as the ratio of importance a user puts on the
2050: transcription cost versus the obsolescence cost. Traditionally, $\alpha=0$,
2051: and therefore $C_{R}(t)$ is minimized for $C_{R,\text{o}}(b_{i-1},b_{i})=0$,
2052: $\forall b_{i}<t$, allowing the use of current data only. In this section we
2053: shall look into another, more realistic approach, where data currency is
2054: sacrificed (up to a level defined by the user through $\alpha$) for the sake
2055: of reducing the transcription cost. Ideally, one would want to choose the
2056: sequence $\{b_{i},e_{i}\}_{i=1}^{\infty}$ of connectivity periods, subject to
2057: any constraints on their durations $e_{i}-b_{i}$, to minimize $C_{R}(t)$ over
2058: some time horizon $t$. One may also consider the asymptotic problem of
2059: minimizing the average cost over time, $\lim_{t\rightarrow\infty}C_{R}(t)/t$.
2060: We note that the presence of $\alpha$ is not strictly required, as its effects
2061: could be subsumed into the definitions of the $C_{R,\text{u}}$ and
2062: $C_{R,\text{o}}$ functions, especially if both are expressed in natural
2063: monetary units. However, we retain $\alpha$ in order to demonstrate some of
2064: the parametric properties of our model.
2065: 
2066: In general, modeling transcription and obsolescence costs may be difficult and
2067: application-dependent. They may be difficult to quantify and difficult to
2068: convert to a common set of units, such as dollars or seconds. Some subjective
2069: estimation may be needed, especially for the obsolescence costs. However, we
2070: maintain that, rather than avoiding the subject altogether, it is best to try
2071: construct these cost models and then use them, perhaps parametrically, to
2072: evaluate transcription policies. Any transcription policy implicitly makes
2073: some trade-off between consuming network resources and incurring
2074: obsolescence, so it is
2075: best to try quantify the trade-off and see if a better policy exists. In
2076: particular, one should try to avoid policies that are clearly \emph{dominated}%
2077: , meaning that there is another policy with the same or lower transcription
2078: cost, and strictly lower obsolescence, or \emph{vice versa}. Below, for
2079: purposes of illustration, we will give one simple, plausible way in which the
2080: cost functions may be constructed; alternatives are left to future research.
2081: 
2082: \subsection{Transcription costing example}
2083: In determining the transcription cost, one may use existing research into
2084: costs of distributed query execution strategies. Typically, (\emph{e.g.},
2085: \cite{LOHMAN85}) the transcription time can be computed as some function of
2086: the CPU and I/O time for writing the new tuples onto the client and the cost
2087: of transmitting the tuples over a network. There is also some fixed setup time
2088: to establish the connection, which can be substantial. For purposes of
2089: example, suppose that
2090: \begin{align*}
2091: C_{R,\text{u}}(s,f)  &  =c+\beta\cdot\left(  X_{R}(s,f)+Y_{R}^{+}(s,f)+\left|
2092: R(s)\right|  -Y_{R}(s,f)\right) \\
2093: &  = c+\beta\cdot\left(  X_{R}(s,f)+\left|  R(s)\right|  -Y_{R}^{-}%
2094: (s,f)\right)
2095: \end{align*}
2096: Here, $c\geq0$ denotes the fixed setup cost, $\beta\geq0$, $X_{R}(s,f)$
2097: denotes the number of tuples inserted during the interval $(s,f]$ that survive
2098: through time $f$, $Y_{R}^{+}(s,f)$ is the number of tuples that
2099: survive but are
2100: modified, by time $f$, and $\left|  R(s)\right|  -Y_{R}(s,f)$ is the number of
2101: deleted tuples. For the latter, it may suffice to transmit only the 
2102: primary key of each deleted tuple, incurring a unit cost of less than
2103: $\beta$. For sake of simplicity, however, we use the same cost factor
2104: $\beta$ for deletion, insertion, and modification. We note that, under
2105: this assumption,
2106: \[
2107: \sum_{i:b_{i}\leq t}C_{R,\text{u}}(b_{i-1},b_{i})=n(t)c+\beta\left|
2108: R(s)\right|  +\beta\sum_{i:b_{i}\leq t}\left(  X_{R}(b_{i-1},b_{i})-Y_{R}%
2109: ^{-}(b_{i-1},b_{i})\right)  ,
2110: \]
2111: where $n(t)$ is the number of transcriptions in the interval $[0,t]$. For the
2112: special case that there are no deletions or modifications, 
2113: $\beta\left|  R(s)\right|
2114: +\beta\left(  X_{R}(s,f)-Y_{R}^{-}(s,f)\right)  =\beta B(s,f)$ and
2115: \[
2116: \sum_{i:b_{i}\leq t}C_{R,\text{u}}(b_{i-1},b_{i})=n(T)c+\beta B(0,b_{i^{\ast
2117: }(T)}).
2118: \]
2119: For large $t$, one would expect the $\beta B(0,b_{i^{\ast}(t)})$ term to be
2120: roughly comparable across most reasonable polices, whereas the $n(t)c$ term
2121: may vary widely for any value of $t$. It is worth noting that $c$ and $\beta$
2122: could be generalized to vary with time or other factors. 
2123: For example, due to network congestion, certain
2124: times of day may have higher unit transcription costs than others. 
2125: Also, transcribing via airline-seat telephone costs substantially more than
2126: connecting via a cellular phone. For simplicity, we have refrained from
2127: discussing such variations in the transcription cost.
2128: 
2129: \subsection{Obsolescence costing example}
2130: We next turn our attention to the obsolescence cost, which is clearly a
2131: function of the update time of tuples and the time they were transcribed to
2132: the client. Intuitively, the shorter the time between the update of a tuple
2133: and its transcription to the client, the better off the client would be. As a
2134: basis for the obsolescence cost, we suggest a criterion that takes into
2135: account user preferences, as well as the content evolution parameters. For any
2136: relation $R$, times $s<f$, and tuple $r\in R(s)\cup R(f)$, let $b(r)$ and
2137: $d(r)$ denote the time $r$ was inserted into and deleted from $R$,
2138: respectively. We let $\iota_{r}(s,f)$ be some function denoting the
2139: contribution of tuple $r$ to the obsolescence cost over $(s,f]$; we will give
2140: some more specific example forms of this function later. We then make the
2141: following definition:
2142: 
2143: \begin{definition}
2144: The total \emph{obsolescence cost} of a relation $R$ over the time interval
2145: $(s,f]$ (annotated \emph{$C_{R,\text{o}}(s,f)$}) is defined to be
2146: \emph{$C_{R,\text{o}}(s,f)\triangleq\sum_{r\in R(s)\cup R(f)}\iota
2147: _{r}(s,f)\!\!.$}\hspace*{\fill}$\Box$
2148: \end{definition}
2149: 
2150: Our principal concern is with the \emph{expected} 
2151: obsolescence cost, that is, the
2152: expected value of $C_{R,\text{o}}(s,f)$,
2153: \[
2154: \expecop\!\left[  C_{R,\text{o}}(s,f)\right]  = \expecop\!\left[
2155: \sum_{r\in R(s)\cup R(f)}\!\!\!\!\!\!\!\!\iota_{r}(s,f)\right]  .
2156: \]
2157: To compute $\expecop\!\left[  C_{R,\text{o}}(s,f)\right]  $, we note that
2158: \[
2159: \expecop\!\left[  C_{R,\text{o}}(s,f)\right]  =
2160: \expecop\!\left[  \sum_{r\in R(s)\cap R(f)}\!\!\!\!\!\!\!\!
2161: \iota_{r}(s,f)\right]  
2162: +\expecop\!\left[  {\sum_{r\in R(s)\backslash R(f)}\!\!\!\!\!\!\!\!
2163: \iota_{r}(s,f)}\right]  
2164: +\expecop\!\left[{\sum_{r\in R(f)\backslash R(s)}\!\!\!\!\!\!\!\!
2165: \iota_{r}(s,f)}\right]  .
2166: \]
2167: The three terms in the last expression represent potentially modified tuples,
2168: deleted tuples, and inserted tuples, respectively. We denote these three terms
2169: by $\hat{\iota}_{R}^{\modification}(s,f)$, $\hat{\iota}_{R}^{\deletion
2170: }(s,f)$, and $\hat{\iota}_{R}^{\medspace\insertion}(s,f)$, respectively,
2171: whence
2172: \[
2173: \expecop\!\left[  C_{R,\text{o}}(s,f)\right]  =\hat{\iota}_{R}^{\modification
2174: }(s,f)+\hat{\iota}_{R}^{\deletion
2175: }(s,f)+\hat{\iota}_{R}^{\medspace\insertion}(s,f).
2176: \]
2177: 
2178: \subsection{Obsolescence for insertions}
2179: \label{sec:insertobs}
2180: We will now consider a specific
2181: metric for computing the obsolescence stemming from insertions in $(s,f]$, as
2182: follows:
2183: \begin{equation}
2184: \iota_{r}^{\insertion}(s,f)=\left\{
2185: \begin{array}
2186: [c]{ll}%
2187: g^{\insertion}(s,f,b(r)) & s<b(r)\leq f<d(r)\\
2188: 0 & \text{otherwise},%
2189: \end{array}
2190: \right.  \label{iota}%
2191: \end{equation}
2192: where $g^{\insertion}(s,f,t)$ is some application-dependent function representing the
2193: level of importance a user assigns, over the interval $(s,f]$,
2194: to a tuple arriving at a time $t$. For
2195: example, in an e-mail transcription application, a user may attach greater
2196: importance to messages arriving during official work hours, and a lesser
2197: measure of importance to non-work hours (since no one expects her to be
2198: available at those times). Thus, one might define
2199: \begin{equation}
2200: g^{\insertion}(s,f,t)=\int_{t}^{f}a(\tau)\hspace{0.15em}d\tau,\quad\text{where }%
2201: a(\tau)=\left\{
2202: \begin{array}
2203: [c]{ll}%
2204: a_{1}, & \text{if }\tau\text{ is during work hours}\\
2205: a_{2}, & \text{if }\tau\text{ is after hours,}%
2206: \end{array}
2207: \right.  \label{importanceformula}%
2208: \end{equation}
2209: and $a_{1}\geq a_{2}$. For $a_{1}=a_{2}=1$, $g^{\insertion}(s,f,t)$ 
2210: takes a form resembling the age of a local element in~\cite{CHO00}.
2211: More complex forms of $g^{\insertion}(s,f,t)$ are certainly possible.  In this
2212: simple case, we refer to $a_1/a_2$ as the \emph{preference ratio}.
2213: 
2214: Using the properties of nonhomogeneous Poisson processes, we calculate
2215: \begin{align*}
2216: \hat{\iota}_{R}^{\medspace\insertion}(s,f) &  =\expecop\!\left[  {\sum_{r\in
2217: R(f)\backslash R(s)}\!\!\!\!\!\!\!\!}\iota_{r}(s,f)\right]  \\
2218: &  =\expecop\!\left[  {X}_{R}{(s,f)}\right]  \cdot\expecop\!\left[
2219: {f(s,f,b(r))\;\big|\;s<b(r)\leq f<d(r)}\right]  \\
2220: &  =\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta_{R}^{+}}\right]
2221: \int_{s}^{f}\frac{\lambda_{R}(t^{\prime})}{\widetilde{\Lambda}_{R}%
2222: (s,f)}g^{\insertion}(s,f,t^{\prime})dt^{\prime}\\
2223: &  =\expecop\!\left[  {\Delta}_{R}^{+}\right]  \int_{s}^{f}\lambda
2224: _{R}(t^{\prime})g^{\insertion}(s,f,t^{\prime})dt^{\prime}.
2225: \end{align*}%
2226: 
2227: \begin{example}
2228: [Transcription policies using the expected obsolescence cost]
2229: Consider the insertion-only data set of Section~\ref{sec:verify}.
2230: Figure
2231: \ref{fig:ttimes} compares two transcription
2232: policies for the week 
2233: $[2001/4/2\mathrm{:}0\mathrm{:}00,2001/4/8\mathrm{:}0\mathrm{:}00)$. The
2234: transcription policy in Figure \ref{fig:ttimes}(a) 
2235: (referred to below as the \emph{uniform synchronization
2236: point} --- USP --- policy) was suggested in
2237: \cite{CHO00}. According to this policy, the intervals $(s,f]$ are always of
2238: the same size. The decision regarding the interval size $f-s$ may be either
2239: arbitrary (\emph{e.g.}, once a day) or may depend on $\lambda$, the Poisson
2240: model parameter (in which case a homogeneous Poisson process is implicitly
2241: assumed). The policy may be expressed as $f=s+{M}/{\lambda}$ for some
2242: multiplier $M>0$. According to this policy with $M=1$ (as suggested in
2243: \cite{CHO00}), and $\lambda=4.57$ per day as computed from the training data.
2244: Therefore, one would refresh the database every 5:15:19. 
2245: Figure \ref{fig:ttimes}(a) shows
2246: the transcription times resulting from the 
2247: USP policy.
2248: 
2249: \begin{figure}[tbp]
2250: \begin{center}
2251: \epsfig{file=ttimes.eps,width=6.5in}
2252: \caption{Transcription times for the USP and RPC/threshold policies.}%
2253: \label{fig:ttimes}%
2254: \end{center}
2255: \end{figure}
2256: 
2257: Consider now another transcription policy, dubbed the
2258: \emph{threshold} policy. With this policy, given that the last connection
2259: started at time $s$, we transcribe at time $f$ if the expected obsolescence
2260: cost from insertions ($\hat{\iota}_{R}^{\medspace\insertion}(s,f)$) exceeds
2261: $\Pi$, where $\Pi$ is a threshold that measures the user's tolerance to
2262: obsolescent data. In comparing the two policies, one can compute $\Pi$, given
2263: $M$, as follows. Consider the homogeneous case where $\expecop\!\left[
2264: {\Delta}_{R}^{+}\right]  =1$ and $\lambda_{R}(t)=\lambda_{R}$ for all $t$.
2265: Assume further that $a_{1}=a_{2}=1$ for all $t$. In this case,
2266: \[
2267: \hat{\iota}_{R}^{\medspace\insertion}(s,f)=\int_{s}^{f}\lambda_{R}%
2268: \cdot(f-t)dt=\lambda_{R}\int_{s}^{f}(f-t)dt\newline =\frac{1}{2}\lambda
2269: _{R}\cdot(f-s)^{2}%
2270: \]
2271: Setting $f=s+M/\lambda_{R}$ and $\Pi=\hat{\iota}_{R}^{\medspace\insertion
2272: }(s,f)$, one has that
2273: \[
2274: \Pi=({1}/{2})\lambda_{R}\cdot(f-s)^{2}=({1}/{2})\lambda_{R}\cdot\left(
2275: {M}/{\lambda}_{R}\right)  ^{2}={M^{2}}/{2\lambda}_{R}.
2276: \]
2277: Figure \ref{fig:ttimes}(b) shows the
2278: transcription times using the RPC arrival model (see Section 
2279: \ref{sec:fithomo}) and the threshold policy with
2280: $\Pi=0.109$ (obtained by setting $M=1$ and $\lambda=4.57$ per day, and letting
2281: $\Pi={M^{2}}/{2\lambda}$). It is worth noting that transcriptions are more
2282: frequent when the $\lambda$ intensity is higher and less frequent whenever the
2283: arrival rate is expected to be more sluggish.
2284: 
2285: \begin{figure}[tbp]
2286: \begin{center}
2287: \epsfig{file=thpolicy.eps,width=6.5in}
2288: \caption{Threshold policy, homogeneous vs. RPC.}%
2289: \label{fig:thpolicy}%
2290: \end{center}
2291: \end{figure}
2292: 
2293: We have performed experiments comparing the performance of the threshold
2294: policy for the homogeneous Poisson model (equivalent to the USP
2295: policy) and the
2296: RPC Poisson model. Figure \ref{fig:thpolicy} shows
2297: representative results, with costs computed over the testing set. 
2298: Figure \ref{fig:thpolicy}(a)
2299: displays the obsolescence cost and the number of transcriptions for various
2300: $M$ values, with a preference ratio $a_2/a_1=4$. For all $M$ values, there is
2301: no dominant model. For example, for $M=1$, the RPC model has a slightly higher
2302: obsolescence cost (43.02 versus 42.35, a 1.6\% increase) 
2303: and a significantly lower
2304: number of transcriptions (137 versus 204, a 32.8\% decrease).
2305: 
2306: Figure \ref{fig:thpolicy}(b) provides a comparison of
2307: combined normalized obsolescence and transcription costs for both insertion
2308: models and $\alpha\in\{0.6,0.7,0.8\}$ (still assuming a 4:1 preference ratio).
2309: Solid lines represent results related with the homogeneous Poisson model,
2310: while dotted lines represent results related with the RPC Poisson model.
2311: Generally speaking, the RPC model performs better for small $M$ values
2312: ($M\leq 7$), 
2313: while the homogeneous model performs better for the largest $M$ values
2314: ($M \geq 8$). \hspace*{\fill}$\Box$
2315: \end{example}
2316: 
2317: \begin{example}
2318: [Comparison of USP, threshold and FA policies]
2319: Once again with the data from Section~\ref{sec:verify}, 
2320: we consider one more transcription policy, the first alteration (FA) policy
2321: derived from the analysis
2322: of Example \ref{ex:fat}.  Since there are no deletions or
2323: modifications, $Z_R(s,f)$ simplifies to $\Lambda_R(s,f)$.  We choose
2324: $\pi$ in the FA policy 
2325: to be a function of $M$ such that the transcription intervals
2326: agree with the USP policy in the case of the homogeneous model.
2327: Figure \ref{fig:3policies} 
2328: compares the performance of all three transcription policies: 
2329: USP, threshold, and FA,
2330: for a 4:1 preference ratio and
2331: $\alpha=0.8$, using the testing data set to compute the costs. 
2332: For $M=1$, the threshold policy and the FA policy perform similarly, where the FA policy performs slightly better than the Threshold policy. Both policies outperform the USP policy. 
2333: The threshold policy is best for $M\in\{2 \dots 8\}$. For all $M>8$, the USP
2334: policy is best. The best policy for this choice of $a_{1}/a_{2}$ and $\alpha$
2335: is threshold with $M=6$, followed closely by FA with $M\in\{5,6\}$. We have
2336: conducted our experiments with various $\alpha$ values and our conclusion is
2337: that the Threshold model is preferred over the USP model for larger $\alpha$, that is, the more the user is willing to sacrifice currency for the sake of reducing transcription cost.\hspace*{\fill}$\Box$
2338: \end{example}
2339: 
2340: \begin{figure}[tbp]
2341: \begin{center}
2342: \epsfig{file=fapolicy.eps,width=5.4in,height=2.9in}
2343: \caption{Transcription schedule based on the first alteration policy.}
2344: \label{fig:fapolicy}
2345: \end{center}
2346: \end{figure}
2347: 
2348: \begin{figure}[tbp]
2349: \begin{center}
2350: \epsfig{file=3policies.eps,width=5.4in,height=2.9in}
2351: \caption{Comparison of three policies, for a 4:1 preference ratio and
2352: $\alpha=0.5$.}
2353: \label{fig:3policies}
2354: \end{center}
2355: \end{figure}
2356: 
2357: 
2358: \subsection{Obsolescence for deletions}
2359: In a similar manner to Section \ref{sec:insertobs}, 
2360: we will consider the following
2361: metric for computing the obsolescence stemming from deletions in $(s,f]$. We
2362: compute $\iota_{r}^{\deletion}(s,f)$ via
2363: \begin{equation}
2364: \iota_{r}^{\deletion}(s,f)=\left\{
2365: \begin{array}
2366: [c]{ll}%
2367: g^{\deletion}(s,f,d(r)) & b(r)\leq s<d(r)\leq f\\
2368: 0 & \text{otherwise}%
2369: \end{array}
2370: \right.
2371: \end{equation}
2372: where $g^{\deletion}(s,f,t)$ is some application-dependent
2373: function,  possibly similar to $g^{\insertion}(s,f,t)$ above.
2374: 
2375: 
2376: Using the properties of nonhomogeneous Poisson processes, we calculate
2377: \begin{align*}
2378: \hat{\iota}_{R}^{\deletion}(s,f)  &  =\expecop\!\left[  {\sum_{r\in
2379: R(s)\backslash R(f)}\!\!\!\!\!\!\!\!}\iota_{r}(s,f)\right] \\
2380: &  =\left(  \left|  R(s)\right|  -\expecop\!\left[  {Y}_{R}{(s,f)}\right]
2381: \right)  \expecop\!\left[  g^{\deletion}(s,f,d(r)){\;\big|\;b(r)\leq s<d(r)\leq
2382: f}\right] \\
2383: &  =\left(  \left|  R(s)\right|  -p_{R}(s,f)\left|  {R(s)}\right|  \right)
2384: \expecop\!\left[  g^{\deletion}(s,f,d(r)){\;\big|\;b(r)\leq s<d(r)\leq f}\right] \\
2385: &  =\left|  R(s)\right|  \left(  1-p_{R}(s,f)\right)  
2386: \expecop\!\left[  g^{\deletion}(s,f,d(r)){\;\big|\;b(r)\leq s<d(r)\leq f}\right]
2387: \end{align*}
2388: In the case $\langle R,S\rangle$ has fixed multiplicity for all $S\in S(R)$,
2389: $p_{R}(s,f)=\exp(-\widetilde{M}_{R}(s,f))$, where $\widetilde{M}_{R}%
2390: (s,f)=\int_{s}^{f}\tilde{\mu}_{R}(t)\hspace{0.15em}dt$ and $\tilde{\mu}%
2391: _{R}(t)=\!\!\sum_{S\in S(R)}w(R,S)\mu_{S}(t)$. Therefore,%
2392: \begin{align*}
2393: \hat{\iota}_{R,A}^{\deletion}(s,f)  &  =\left|  R(s)\right|  \left(
2394: 1-\exp(-\widetilde{M}_{R}(s,f))\right)  \int_{s}^{f}\frac{\tilde{\mu}%
2395: _{R}(t^{\prime})}{\widetilde{M}_{R}(s,f)}g^{\deletion}(s,f,t^{\prime})dt^{\prime}\\
2396: &  =\left|  R(s)\right|  \left(  \frac{1-\exp(-\widetilde{M}_{R}%
2397: (s,f))}{\widetilde{M}_{R}(s,f)}\right)  \int_{s}^{f}\tilde{\mu}%
2398: _{R}(t^{\prime})\cdot g^{\deletion}(s,f,t^{\prime})dt^{\prime}%
2399: \end{align*}
2400: 
2401: \subsection{Obsolescence for modification}
2402: We now consider obsolescence costs relating to modifications.
2403: While, in some applications, a user may be
2404: primarily concerned with how many tuples were modified during $[s,f)$, 
2405: we believe that a more general, attribute-based framework is warranted
2406: here, taking into account exactly how each tuple was changed.
2407: Therefore, we define $\iota_{r,A}(s,f)$ to be some function denoting the
2408: contribution of attribute $A\in\mathcal{A}(R)$ in tuple $r$ to the
2409: obsolescence cost over $(s,f]$ and assume that%
2410: \[
2411: \iota_{r}(s,f)=\!\!\!\sum_{A\in\mathcal{A}(R)}\!\!\!\iota_{r,A}(s,f)
2412: \]
2413: Therefore,
2414: \begin{align*}
2415: \hat{\iota}_{R}^{\modification}(s,f)  & 
2416: =\expecop\!\left[  \sum_{r\in R(s)\cap R(f)}\!\!\!\!\!\iota_{r}(s,f)\right]  \\
2417: & =\expecop\!\left[  \sum_{A\in\mathcal{A}(R)}\;\sum_{r\in R(s)\cap R(f)}
2418: \!\!\!\!\!\!\!\iota_{r,A}(s,f)\right]  \\
2419: & =\sum_{A\in\mathcal{A}(R)}\!\!\!\hat{\iota}_{R,A}^{\modification}(s,f)
2420: \end{align*}
2421: where $\hat{\iota}_{R,A}^{\modification}(s,f)$ 
2422: is the expected obsolescence cost due to modifications to $A$ during
2423: $(s,f]$.  Assuming that attributes not in $\mathcal{C}(R)$ incur zero
2424: modification cost, the last sum may be taken over $\mathcal{C}(R)$
2425: instead of $\mathcal{A}(R)$.
2426: 
2427: We start the section by introducing the notion of distance metric and provide
2428: two models of $\iota_{r,A}(s,f)$, for numeric and non-numeric domains. We then
2429: provide an explicit description of $\hat{\iota}_{R,A}^{\modification
2430: }$, based on distance metrics. 
2431: 
2432: \subsubsection{General distance metrics}
2433: Let $c_{u,v}^{R,A}$, where $u,v\in\dom A$ denote the
2434: elements of a matrix of costs for an attribute $A$. We declare that if
2435: $r.A(s)=u$ and $r.A(f)=v$, then $\iota_{r,A}(s,f)=c_{u,v}^{R,A}$, or
2436: equivalently,
2437: \[
2438: \iota_{r,A}(s,f)=c_{r.A(s),r.A(f)}^{R,A}.
2439: \]
2440: Consequently, we require that $c_{u,u}^{R,A}=0$ for all $u\in\dom A$, so that
2441: an unchanged attribute field yields a cost of zero.
2442: 
2443: \paragraph{A squared-error metric for numeric domains:}
2444: For numeric domains, that
2445: is, $A\in\mathcal{N}$, we propose a squared-error metric, as is standard in
2446: statistical regression models. In this case, we let
2447: \[
2448: \iota_{r,A}(s,f)=c_{r.A(s),r.A(f)}^{R,A}=k_{R,A}(s){\left(
2449: r.A(f)-r.A(s)\right)  }^{2},
2450: \]
2451: where $k_{R,A}(s)$ is a user-specified
2452: scaling factor. A typical choice for the scaling
2453: factor would be the reciprocal ${1}/\left(  {\varop_{r\in R(s)}\!\left[
2454: {r.A(s)}\right]  }\right)  $ of the 
2455: variance of attribute $A$ in $R$ at time $s$,
2456: \begin{align*}
2457: \varop_{r\in R(s)}\!\left[  {r.A(s)}\right]   &  =\expecop_{r\in
2458: R(s)}\!\left[  {{\left(  r.A(s)-\expecop_{r\in R(s)}\!\left[  {r.A(s)}\right]
2459: \right)  }^{2}}\right] \\
2460: &  =\expecop_{r\in R(s)}\!\left[  {{r.A(s)}^{2}}\right]  -{\expecop_{r\in
2461: R(s)}\!\left[  {r.A(s)}\right]  }^{2}\\
2462: &  =\frac{1}{\left|  {R(s)}\right|  }\left(  \,\sum_{v\in\dom A}%
2463: \!\!\!v^{2}\hat{R}_{A,v}(s)\right)  -{\left(  \frac{1}{\left|  {R(s)}\right|
2464: }\sum_{v\in\dom A}\!\!\!v\hat{R}_{A,v}(s)\right)  }^{2}.
2465: \end{align*}
2466: Other choices for the scaling factor $k_{R,A}(s)$ are also possible. In any
2467: case, we may calculate the expected alteration cost for attribute $A$ in tuple
2468: $r$ via
2469: \begin{align}
2470: \expecop\!\left[  \iota_{r,A}(s,f)\right]   &  =\expecop\!\left[
2471: {k_{R,A}(s){\left(  r.A(f)-r.A(s)\right)  }^{2}}\right] \nonumber\\
2472: &  =k_{R,A}(s)\expecop\!\left[  {{r.A(f)}^{2}-2\,r.A(f)r.A(s)+{r.A(s)}^{2}%
2473: }\right] \nonumber\\
2474: &  =k_{R,A}(s)\left(  \expecop\!\left[  {{r.A(f)}^{2}}\right]
2475: -2\,r.A(s)\expecop\!\left[  {r.A(f)}\right]  +{r.A(s)}^{2}\right)  .
2476: \label{eq:gennumexpec}%
2477: \end{align}
2478: 
2479: \paragraph{A general metric for non-numeric domains:}
2480: \label{nonnummetric}For non-numeric domains, it may not be possible or
2481: meaningful to compute the difference of $r.A(s)$ and $r.A(f)$. In such cases,
2482: we shall use a general cost matrix ${[}${$c_{u,v}^{R,A}$}${]}_{u,v\in\dom A}$
2483: and compute
2484: \begin{align*}
2485: \expecop\!\left[  \iota_{r,A}(s,f)\right]   &  =\sum_{v\in\dom A}\!\!\left(
2486: P_{r.A(s),v}^{R,A}(s,f)\right)  \left(  c_{r.A(s),v}^{R,A}\right) \\
2487: &  =\sum_{
2488: \genfrac{}{}{0pt}{1}{v\in\dom A}{v\neq r.A(s)}%
2489: }\!\!\left(  P_{r.A(s),v}^{R,A}(s,f)\right)  \left(  c_{r.A(s),v}%
2490: ^{R,A}\right)  .
2491: \end{align*}
2492: 
2493: For domains that have no particular structure, a typical choice might be
2494: $c_{u,v}^{R,A}=1$ whenever $u\neq v$. In this case, the expected cost
2495: calculation simplifies to
2496: \begin{align*}
2497: \expecop\!\left[  \iota_{r,A}(s,f)\right]   &  =\probop\!\left\{  {r.A(f)\neq
2498: r.A(s)}\right\} \\
2499: &  =1-P_{r.A(s),r.A(s)}^{A,R}(s,f).
2500: \end{align*}
2501: 
2502: We are now ready to consider the calculation of $\hat{\iota}_{R,A}%
2503: ^{\modification}(s,f)$.
2504: 
2505: \subsubsection{The expected modification cost}
2506: We next consider computing the
2507: expected modification cost $\hat{\iota}_{R,A}^{\modification}(s,f)$. To do so,
2508: we partition the tuples $r$ in $R(s)\cap R(f)$ according to their initial
2509: value $r.A(s)$ of the attribute $A$. Consider the subset $R_{A,u}(s)\cap R(f)$
2510: of all $r\in R(s)\cap R(f)$ that have $r.A(s)=u$. Since all such tuples are
2511: indistinguishable from the point of view of the modification process for
2512: $(R,A)$, their $\iota_{r,A}(s,f)$ random variables will be identically
2513: distributed. The number of tuples $r\in R(s)$ with $r.A(s)=u$ is, by
2514: definition, $\hat{R}_{A,u}(s)$. The number $\left|  {R_{A,u}(s)\cap
2515: R(f)}\right|  $ that are also in $r.A(f)$ is a random variable whose
2516: expectation, by the independence of the deletion and modification processes,
2517: must be $p_{R}(s,f)\hat{R}_{A,u}(s)$. Using standard results for sums of
2518: random numbers of IID random variables, we conclude that
2519: \begin{align*}
2520: \hat{\iota}_{R,A}^{\modification}(s,f)  &  =\expecop\!\left[  {\sum_{r\in
2521: R(s)\cap R(f)}\!\!\!\!\!\!\!\!}\iota_{r,A}(s,f)\right] \\
2522: &  =\!\!\!\!\sum_{u\in\dom A}\!\!\!\left(  p_{R}(s,f)\hat{R}_{A,u}(s)\right)
2523: \expecop\!\left[  \iota_{r,A}(s,f){\;\big|\;r.A(s)\!=\!u}\right] \\
2524: &  =p_{R}(s,f)\!\!\!\!\!\!\sum_{%
2525: \genfrac{}{}{0pt}{1}{u\in\dom A}{\hat{R}_{A,u}(s)>0}%
2526: }\!\!\!\!\!\!\!\hat{R}_{A,u}(s)\expecop\!\left[  \iota_{r,A}(s,f){\;\big
2527: |\;r.A(s)\!=\!u}\right] \\
2528: &  =p_{R}(s,f)\!\!\!\!\!\!\!\sum_{%
2529: \genfrac{}{}{0pt}{1}{u\in\dom A}{\hat{R}_{A,u}(s)>0}%
2530: }\!\!\!\!\!\!\hat{R}_{A,u}(s)\hat{\iota}_{R,A,u}^{\modification}(s,f),
2531: \end{align*}
2532: where we define $\hat{\iota}_{R,A,u}^{\modification}(s,f)=\expecop\!\left[
2533: \iota_{r,A}(s,f){\;\big|\;r.A(s)\!=\!u}\right]  $. We now address the
2534: calculation of the $\hat{\iota}_{R,A,u}^{\modification}(s,f)$.
2535: 
2536: For a non-numeric domain, we have from Section \ref{nonnummetric} that
2537: \[
2538: \hat{\iota}_{R,A,u}^{\modification}(s,f)=\!\!\!\sum_{v\in\dom A}%
2539: \!\!\!\!\!\left(  P_{u,v}^{R,A}(s,f)\right)  \left(  c_{u,v}^{R,A}\right)  ,
2540: \]
2541: and in the simple case of $c_{u,v}^{R,A}=1$ whenever $u\neq v$,
2542: \[
2543: \hat{\iota}_{R,A,u}^{\modification}(s,f)=1-P_{u,u}^{R,A}(s,f).
2544: \]
2545: In any case, $P_{u,v}^{R,A}(s,f)$ and $P_{u,u}^{R,A}(s,f)$ may be computed
2546: using the results of Section \ref{sec:modif}.
2547: 
2548: For a numeric domain, we have from (\ref{eq:gennumexpec}) that
2549: \begin{align*}
2550: \hat{\iota}_{R,A,u}^{\modification}(s,f)  &  =k_{R,A}(s)\left(  \expecop
2551: \!\left[  {{\left(  r.A(f)\right)  }^{2}\;\big|\;r.A(s)\!=\!u}\right]
2552: -2\,u\expecop\!\left[  {r.A(f)\;\big|\;r.A(s)\!=\!u}\right]  +u^{2}\right) \\
2553: &  =k_{R,A}(s)\left(  \left(  \sum_{v\in\dom A}\!\!\!(v^{2}-2uv)P_{u,v}%
2554: ^{R,A}(s,f)\right)  +u^{2}\right)  .
2555: \end{align*}
2556: 
2557: In cases where a random walk approximation applies, however, the situation
2558: simplifies considerably, as demonstrated in the following proposition.
2559: 
2560: \begin{proposition}
2561: When a random walk model with mean $\delta$ and variance $\sigma^2$
2562: accurately describes 
2563: modifications to a numeric attribute $A$, $\hat{\iota
2564: }_{R,A,u}^{\modification}(s,f)\approx 
2565: k_{R,A}(s)\,\Gamma_{R,A}(s,f)\left(  \sigma
2566: ^{2}+2\,\Gamma_{R,A}(s,f)\delta^{2}\right)  .$
2567: \end{proposition}
2568: 
2569: \begin{proof}
2570: \noindent In this case, we note that the random variable $r.A(f)-r.A(s)$ is
2571: identical to $\Delta A(s,f)$ 
2572: (using the notation of section \ref{sec:randomwalks}%
2573: ), and is independent of $r.A(s)$. The number $N$ of modification events in
2574: $(s,f]$ has a Poisson distribution with mean $\Gamma_{R,A}(s,f)$, and hence
2575: variance {$\Gamma_{R,A}(s,f)$}$^{2}$. Therefore we have, for any $u\in\dom
2576: A$,
2577: \begin{align*}
2578: \hat{\iota}_{R,A,u}^{\modification}(s,f) &  \approx k_{R,A}(s)\expecop\!\left[
2579: {{\left(  \Delta A(s,f)\right)  }^{2}}\right]  \\
2580: &  =k_{R,A}(s)\left(  \varop\!\left[  {\Delta A(s,f)}\right]  +{\expecop
2581: \!\left[  {\Delta A(s,f)}\right]  }^{2}\right)  \\
2582: &  =k_{R,A}(s)\left(  \expecop\!\left[  {N}\right]  \sigma^{2}+\delta
2583: ^{2}\varop\!\left[  {N}\right]  +{\expecop\!\left[  {N}\right]  }^{2}%
2584: \delta^{2}\right)  \\
2585: &  =k_{R,A}(s)\,\Gamma_{R,A}(s,f)\left(  \sigma^{2}+2\,\Gamma_{R,A}%
2586: (s,f)\delta^{2}\right)  .
2587: \end{align*}
2588: \hspace*{\fill}
2589: \end{proof}
2590: 
2591: \subsection{Example: the use of the cost model in Web crawling}
2592: 
2593: \label{sec:exwebcraw} The following example concludes the introduction of the cost function. We show
2594: how, by using the cost model, one can generate an optimal transcription policy
2595: for Web crawling.
2596: 
2597: \begin{example}
2598: [Web Monitoring]WebSQL \cite{MENDELZON97} is a Web monitoring tool which uses
2599: a virtual database schema to query the structural properties of Web documents.
2600: The database schema consists of two relations, \texttt{Document} with six
2601: attributes, namely \texttt{url, title, text, type, length, }and \texttt{modif}%
2602: , and \texttt{Anchor} with four attributes, namely \texttt{base, label, href},
2603: and \texttt{context}. Each tuple in \texttt{Anchor} indicates that document
2604: \texttt{base} contains a link to document \texttt{href}. Consider the
2605: following query (taken from \texttt{http://www.cs.toronto.edu/\symbol{126}%
2606: websql/}), which identifies locally reachable documents that contain some
2607: hyperlink to a compressed Postscript File:
2608: 
2609: \vspace{2ex} \texttt{SELECT d.url, d.modif }
2610: 
2611: \texttt{FROM Document d SUCH THAT ``http://www.OtherDoc.html'' -%
2612: $>$%
2613: -%
2614: $>$%
2615: * d, }
2616: 
2617: \texttt{Anchor a SUCH THAT base = d }
2618: 
2619: \texttt{WHERE filename(a.href) CONTAINS ``.ps.Z''; }
2620: 
2621: \vspace{2ex}
2622: (We refrain from dwelling
2623: on the language specification;he interested reader is referred to the cited
2624: Web site.)
2625: Assume that the cost of performing the query at time $t$ is $\sum_{d\in
2626: D(t)}\psi_{d}$, where $D(t)$ represents the set of scanned documents and
2627: $\psi_{d}$ is a random variable representing the size of document $d$ in
2628: bytes. Assuming the $\{\psi_{d}\}$ are IID, the expected cost of performing
2629: the query at time $t$ is thus
2630: \[
2631: \expecop\left[  \sum_{d\in D(t)}\!\!\psi_{d}\right]  =\expecop[\card{D(t)}%
2632: ]\expecop[\psi],
2633: \]
2634: where $\psi$ is a generic random variable distributed like the $\{\psi_{d}\}$.
2635: 
2636: A modification to a document is identified using changes to the \texttt{modif}
2637: attribute of the Document relation. For brevity in what follows, we let
2638: $R=\text{\texttt{Document}}$ and $A=\text{\texttt{modif}}$. We assign the
2639: following costs to changes in $A$:
2640: 
2641: \begin{itemize}
2642: \item $g^{\deletion}(s,f,t) = 0$ for all $s<t<f$, 
2643: that is, the user has no interest in being
2644: notified of deleted documents.
2645: 
2646: \item For all $s<t<f$ and $u,v\in\dom A$, $u\neq v$, 
2647: $c^{R,A}_{u,v}=g^{\insertion}(s,f,t)=\expecop[\psi]$, 
2648: where $c_{{R,A}}^{\modification}$ is the cost for
2649: a modified document. For all other attribute $A^{\prime}\neq A$, 
2650: $c^{R,A^{\prime}}_{u,v}=0$ for all $u,v\in\dom A^{\prime}$.
2651: \end{itemize}
2652: 
2653: Suppose that a query was performed at time $s$, scanning the set of documents
2654: $D(s)$, and returning the set of documents $B(s)$, where $\left|
2655: {B(s)}\right|  \leq\left|  {D(s)}\right|  $. A user is interested in
2656: refreshing the query result without overloading system resources, thus
2657: balancing the cost of refreshing the query results against the cost of using
2658: partial or obsolescent data. This trade-off can be captured by the following
2659: policy: refresh the query at time $f$, after performing it at time $s$ iff
2660: \[
2661: \expecop\!\!\left[  \sum_{d\in D(f)}\!\!\psi_{d}\right]  \;<\;\expecop
2662: \!\left[  C_{R,\mathrm{o}}(s,f)\right]
2663: \]
2664: Thus, an equivalent conditions is
2665: \begin{align*}
2666: \expecop[\card{D(f)}]\expecop[\psi]\; &  <\;\sum_{A^{\prime}\in\mathcal{A}%
2667: (R)}\hat{\iota}_{R,A^{\prime}}^{\modification}(s,f)+\hat{\iota}_{R}%
2668: ^{\deletion}(s,f)+\hat{\iota}_{R}^{\medspace\insertion}(s,f)\\
2669: &  =\hat{\iota}_{R,A}^{\modification}(s,f)+\hat{\iota}_{R}^{\medspace
2670: \insertion}(s,f),
2671: \end{align*}
2672: or
2673: \begin{align*}
2674: &  \left(  p_{R}(s,f)\left|  {D(s)}\right|  +\widetilde{\Lambda}%
2675: _{R}(s,f)\expecop\!\left[  {\Delta}_{R}^{+}\right]  \right)  \expecop[\psi
2676: ]\;\\
2677: &  <\left(  p_{R}(s,f)\!\!\!\!\sum_{{u\in\dom A}}\!\!\!\hat{R}_{A,u}%
2678: (s)(1-P_{u,u}^{{A,R}}(s,f))+\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[
2679: {\Delta}_{R}^{+}\right]  p^{\medspace\insertion}\right)  \expecop[\psi],
2680: \end{align*}
2681: where $p^{\medspace\insertion}$ is the probability of a newly-inserted
2682: document being relevant to the query. Cancelling the factor of $\expecop[\psi
2683: ]$, another equivalent condition is
2684: \[
2685: p_{R}(s,f)\left|  {D(s)}\right|  +\widetilde{\Lambda}_{R}(s,f)\expecop
2686: \!\left[  {\Delta}_{R}^{+}\right]  \;<\;p_{R}(s,f)\!\!\!\!\sum_{u\in\dom
2687: A}\!\!\!\hat{R}_{A,u}(s)(1-P_{u,u}^{{A,R}}(s,f))+\widetilde{\Lambda}%
2688: _{R}(s,f)\expecop\!\left[  {\Delta}_{R}^{+}\right]  p^{\medspace\insertion},
2689: \]
2690: which is independent of the expected document size. Further assume that
2691: $P_{u,u}^{R{A}}(s,f)=P_{\ast,\ast}^{R,{A}}(s,f)$ is independent of $u$. Then
2692: the refresh condition can be expressed as
2693: \[
2694: p_{R}(s,f)D(s)+\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta}_{R}%
2695: ^{+}\right]  \;<\;p_{R}(s,f)\left|  {B(s)}\right|  (1-P_{\ast,\ast}^{{A,R}%
2696: }(s,f))+\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta}_{R}^{+}\right]
2697: p^{\medspace\insertion}%
2698: \]
2699: \hspace*{\fill}$\Box$
2700: \end{example}
2701: 
2702: \section{Conclusion and topics for future research}
2703: \label{sec:conclusion} 
2704: This paper represents a first step in a new research area,
2705: the stochastic estimation of the consistency of transcribed data over time. We
2706: have also suggested one possible 
2707: technique for assigning a cost to the differences
2708: between two relation extensions, including a means of computing the expected
2709: value of this cost under our stochastic model. We have discussed a number
2710: of potential applications relating managing replicas, query
2711: management, and Web crawling. We have also examined several
2712: strategies for refreshing replicas, although other strategies are
2713: certainly possible.
2714: 
2715: As an illustration of the low client-side computational demands of the
2716: insertion-only transcription application of our model, 
2717: a Java-based demo, based on the transcription policies described in
2718: \cite{GAL2001} and in this paper, can be accessed at
2719: \texttt{http://rbs.rutgers.edu:6677/}. The demo compares the performance of
2720: various policies using data that exist at a backend mSQL database.
2721: 
2722: We hope to extend our work to the case where the materialized views are not
2723: simple replications, but are produced by SQL queries that involve selections,
2724: projections, natural joins, and certain types of aggregations. This work will
2725: involve a \emph{propagation algebra} for tracing the base data changes through
2726: a series of relational operators.
2727: 
2728: This development should make it possible to apply the theory to the management
2729: of more complex queries than presented here. In particular, it will facilitate
2730: a possible approach to managing general materialized view obsolescence on a
2731: query-by-query basis, taking into account current user preferences for query
2732: accuracy and speed. The refresh rate of materialized views in a
2733: periodically-updated data source (such as a data warehouse) can be defined in
2734: terms of data obsolescence, which in turn can be stochastically estimated
2735: using our model for content evolution. In this case, we advocate a three-way
2736: cost model for query optimization~\cite{GAL99c}, in which the query optimizer
2737: evaluates various query plans using three complementary factors, namely
2738: \textit{generation cost}, \textit{transmission cost}, and \textit{obsolescence
2739: cost}. The first two factors take on a conventional interpretation and the
2740: obsolescence cost of a query represents a penalty for basing the query result
2741: on possibly obsolescent materialized views. A query plan using only selection
2742: from a local materialized view, for example, might have lower generation and
2743: transmission costs, but a higher obsolescence cost, than a plan fetching
2744: complete base relations from an extranet and then processing them through a
2745: series of join operations. Our model, when combined with additional techniques
2746: to propagate updates through relational operators, can be used as a basis for
2747: estimating the obsolescence cost. However, developing the propagation algebra
2748: may require some enrichment of our basic model, in particular the 
2749: introduction of dependency between the deletion and modification processes.
2750: 
2751: We foresee several additional future research directions. One direction
2752: involves the design of efficient algorithms for the numerical computations
2753: required by our model. As it stands so far, the most demanding computations
2754: required are general numerical integration and the matrix exponentiation
2755: formula (\ref{eq:matrixexp}). With regard to integration, we note that, in
2756: practice, the nonhomogeneous Poisson arrival rate functions $\lambda_{R}%
2757: (\cdot)$, $\mu_{R}(\cdot)$, and $\gamma_{R,A}(\cdot)$ will most likely be
2758: chosen to be periodic piecewise low-order polynomials, as suggested in
2759: Section \ref{sec:verify}. In such cases, many of the integrals
2760: needed by the model could be performed in closed form within each time
2761: period.
2762: 
2763: Further calibration and verification of the models in real situations
2764: is also needed.  So far, we have demonstrated that the insertion model
2765: has plausible applications, but this work needs to be extended to the
2766: deletion and modification models.  Furthermore, the insertion model
2767: may need to be generalized to handle situations where there is
2768: ``burstiness'' or autocorrelation in the interarrival times that may require more involved techniques than simply combining very
2769: closely spaced arrivals.
2770: 
2771: Another future research direction involves applying the model to real-life
2772: settings such as managing a data warehouse. While the model is quite flexible,
2773: a methodology is still needed for structuring Markov chains and estimating the
2774: stochastic model's parameters. Finally, in order to calibrate the cost model,
2775: the issue of measuring user tolerance for data obsolescence 
2776: should be considered.
2777: 
2778: \section*{Acknowledgments}
2779: 
2780: We would like to thank Benny Avi-Itzhak, Adi Ben-Israel, David Shanno, Andrzej
2781: Ruszczynski, Ben Melamed, Zachary Stoumbos, and Bob Vanderbei for their help.
2782: Also, we thank Kumaresan Chinnusamy and Shah Mitul for their comparative
2783: research on statistics gathering methods and Connie Lu and Gunjan Modha for
2784: their assistance in designing and implementing the demo.
2785: 
2786: \bibliographystyle{plain}
2787: \bibliography{bib}
2788: \end{document}