0007:cs0007044/dqy.tex

1: \documentclass[11pt]{article}

2: \usepackage{epsfig}

3: \usepackage{amsmath}

4: \usepackage{latexsym}

5: \usepackage{graphicx}

6: \usepackage{amsfonts}

7: \usepackage{amssymb}

8:

9: \newtheorem{theorem}{Theorem}

10: \newtheorem{acknowledgement}[theorem]{Acknowledgement}

11: \newtheorem{algorithm}[theorem]{Algorithm}

12: \newtheorem{axiom}[theorem]{Axiom}

13: \newtheorem{case}[theorem]{Case}

14: \newtheorem{claim}[theorem]{Claim}

15: \newtheorem{conclusion}[theorem]{Conclusion}

16: \newtheorem{condition}[theorem]{Condition}

17: \newtheorem{conjecture}[theorem]{Conjecture}

18: \newtheorem{corollary}[theorem]{Corollary}

19: \newtheorem{criterion}[theorem]{Criterion}

20: \newtheorem{definition}{Definition}

21: \newtheorem{example}{Example}

22: \newtheorem{exercise}[theorem]{Exercise}

23: \newtheorem{lemma}{Lemma}

24: \newtheorem{notation}[theorem]{Notation}

25: \newtheorem{problem}[theorem]{Problem}

26: \newtheorem{proposition}{Proposition}

27: \newtheorem{remark}[theorem]{Remark}

28: \newtheorem{solution}[theorem]{Solution}

29: \newtheorem{summary}[theorem]{Summary}

30: \newenvironment{proof}[1][Proof]{\noindent\textbf{#1.}

31: }{\hspace*{\fill}\ \rule{0.5em}{0.5em} \vspace{2ex}}

32:

33: \setlength{\textwidth}{6.5in}

34: \setlength{\textheight}{9in}

35: \hoffset=-0.7in

36: \voffset=-0.8in

37:

38: % Try to prevent stanky page breaks.

39: \clubpenalty=10000

40: \widowpenalty=10000

41:

42: \newcommand{\middlebar}[2]{#1 \;\big|\; #2}

43: \newcommand{\set}[2]{\left\{\middlebar{#1}{#2}\right\}}

44: \DeclareMathOperator{\dom}{dom}

45: \DeclareMathOperator{\probop}{P}

46: \newcommand{\prob}[1]{\probop\!\left\{{#1}\right\}}

47: \newcommand{\probgiven}[2]{\probop\!\set{#1}{#2}}

48: \DeclareMathOperator{\expecop}{E}

49: \newcommand{\expec}[1]{\expecop\!\left[{#1}\right]}

50: \newcommand{\expecover}[2]{\expecop_{#1}\!\left[{#2}\right]}

51: \newcommand{\expecgiven}[2]{\expec{\middlebar{#1}{#2}}}

52: \DeclareMathOperator{\varop}{Var}

53: \newcommand{\variance}[1]{\varop\!\left[{#1}\right]}

54: \newcommand{\varover}[2]{\varop_{#1}\!\left[{#2}\right]}

55: \newcommand{\vargiven}[2]{\variance{\middlebar{#1}{#2}}}

56: \newcommand{\card}[1]{\left|{#1}\right|}

57: \newcommand{\abs}[1]{\left|{#1}\right|}

58: \newcommand{\norm}[1]{\left\|{#1}\right\|}

59: \DeclareMathOperator{\expdistrib}{Exp}

60: \DeclareMathOperator{\bindistrib}{Bin}

61: \DeclareMathOperator{\poissondistrib}{Poisson}

62: \DeclareMathOperator{\udistrib}{U}

63: \DeclareMathOperator{\bigO}{O}

64: \DeclareMathOperator{\intrinsic}{I}

65: \DeclareMathOperator{\insertion}{I}

66: \DeclareMathOperator{\deletion}{D}

67: \DeclareMathOperator{\modification}{M}

68: \newcommand{\nhexp}{\expdistrib_s}

69: \newcommand{\expof}{\exp\negthickspace}

70: \newcommand{\intdspace}{\hspace{0.15em}}

71: \newcommand{\tighteq}{\!=\!}

72: \newcommand{\mathtti}[1]{\mbox{\tt \em #1}}

73: \newcommand{\msis}{Department of Management Science and Information Systems}

74: \newcommand{\rutgers}{Rutgers University, Piscataway, NJ 08854 USA}

75:

76: \begin{document}

77:

78:

79: \title{Managing Periodically Updated Data in Relational Databases: \\A Stochastic Modeling Approach}

80: \author{Avigdor Gal\thanks{ Department of Management Science and Information Systems,

81: Rutgers University, Piscataway, NJ 08854 USA, phone: (732) 445-3245, fax:

82: (732) 445-6329, e-mail: \texttt{avigal@rci.rutgers.edu} }

83: \and Jonathan Eckstein\thanks{ Department of Management Science and Information

84: Systems\ and RUTCOR, Rutgers University, Piscataway, NJ 08854 USA, phone:

85: (732) 445-0510, fax: (732) 445-6329, e-mail:

86: \texttt{jeckstei@rutcor.rutgers.edu}}}

87: \date{}

88: \maketitle

89: \begin{abstract}

90: Recent trends in information management involve the periodic transcription of

91: data onto secondary devices in a networked environment, and the proper

92: scheduling of these transcriptions is critical for efficient data management.

93: To assist in the scheduling process, we are interested in modeling

94: \emph{data obsolescence}, that is, the

95: reduction of consistency over time between a relation and its replica. The

96: modeling is based on techniques from the field of stochastic processes, and

97: provides several stochastic models for content evolution in the base relations

98: of a database, taking referential integrity constraints into account. These

99: models are general enough to accommodate most of the common scenarios in

100: databases, including batch insertions and life spans both with and without

101: memory. As an initial ``proof of concept'' of

102: the applicability of our approach, we validate the insertion portion of

103: our model framework

104: via experiments with real data feeds. We also discuss a set of

105: transcription protocols which make use of the proposed stochastic model.

106: \end{abstract}

107:

108:

109: \section{Introduction and motivation}

110:

111: Recent developments in information management involve the transcription of

112: data onto secondary devices in a networked environment, \emph{e.g.},

113: materialized views in data warehouses and search engines, and replicas in

114: pervasive systems. Data transcription influences the way databases define and

115: maintain consistency. In particular, the networked environment may require

116: periodic (rather than continuous) synchronization between the database and

117: secondary copies, either due to paucity of resources (\emph{e.g.}, low

118: bandwidth or limited night windows) or to the transient characteristics of the

119: connection. Hence, the consistency of the information in secondary copies,

120: with respect to the transcription origin, varies over time and depends on the

121: rate of change of the base data and on the frequency of synchronization.

122:

123: Systematic approaches to the proper scheduling of transcriptions necessarily

124: involve optimizing a trade-off between the cost of transcribing fresh

125: information versus the cost of using obsolescent data. To do so, one must

126: quantify, at least in probabilistic terms, this latter cost, which we call

127: \emph{obsolescence cost} \cite{GAL99c}. This paper aims to provide a

128: comprehensive stochastic framework for quantifying time-dependent data

129: obsolescence in replicas. Suppose we are given a relation $R$, a start time

130: $s\in\Re$, and some later time $f>s$. We denote the extension of a relation

131: $R$ at time $t\in\Re$ by $R(t)$. Starting from a known extension $R(s)$, we

132: are interested in making probabilistic predictions about the contents of the

133: later extension $R(f)$. We also suggest a cost model schema to quantify the

134: difference between $R(s)$ and $R(f)$. Such tools assist in optimizing the

135: synchronization process, as demonstrated in this paper. Our approach is based

136: on techniques from the field of stochastic processes, and provides several

137: stochastic models for content evolution in a relational database, taking

138: referential integrity constraints into account. In particular, we make use of

139: compound nonhomogeneous Poisson models and Markov chains; see for example

140: \cite{ROSS80,ROSS95,TK94}. We use Poisson processes to model the behavior of

141: tuples entering and departing relations, allowing (nonhomogeneous) time-varying behavior ---

142: \emph{e.g.}, more intensive activity during work hours, and less intensive

143: activity after hours and on weekends --- as well as compound (bulk)

144: insertions, that is,

145: the simultaneous arrival of several tuples. We use Markov chains in a

146: general modeling approach for attribute modifications, allowing the assignment

147: of a new value to an attribute in a tuple to depend on its current value. The

148: approach is general enough to accommodate most of the common scenarios in

149: databases, including batch insertions and memoryless, as well as time

150: dependent, life spans.

151:

152: As motivation, consider the following two examples:

153:

154: \begin{example}

155: [Query optimization]Query optimization relies heavily on estimating the

156: cardinality and value distribution of relations in a database. If these

157: statistics are outdated and inaccurate, even the best query optimizer may

158: formulate poor execution plans. Typically, statistics are updated only

159: periodically, usually at the discretion of the database administrator, using

160: utilities such as DB2's RUNSTATS. Although some research has been devoted to

161: speeding up statistics collection through sampling and wavelet

162: approximations~\cite{HAAS95,MATIAS98}, periodic updates are unavoidable in

163: very large databases such as IBM's Net.Commerce \cite{SHURETY98}, an

164: e-business software package with roughly one hundred relations, or an SAP

165: application, which has more than 8,000 relations and 9,000 indices. Collection

166: of statistics becomes an even more acute problem in database federations

167: \cite{SHETH90}, where the federation members do not always ``volunteer'' their

168: statistics \cite{RASCHID2001} (or their cost models for that matter

169: \cite{ROTH96}), and are unwilling to burden their resources with frequent

170: statistics collection.

171:

172: In current practice, cardinality or histogram data recorded at time $s$ are

173: used unchanged until the next full analysis of the database at some later time

174: $s^{\prime}>s$. If a query optimization must be performed at some time

175: $f\in(s,s^{\prime})$, the optimizer simply uses the statistics gathered at

176: time $s$, since the time spent recomputing them may overwhelm any benefits of

177: the query optimization. As an alternative, we suggest using a probabilistic

178: estimate of the necessary statistics at time $f$. Use of these techniques

179: might make it possible to increase the interval between statistics-gathering

180: scans, as will be discussed in Example~\ref{ex:qopt2}.\hspace*{\fill}$\Box$

181: \end{example}

182:

183: \begin{example}

184: [Replication management in distributed databases]\label{ex:replica} We now

185: consider replication management in a distributed database. Since fully

186: synchronous replication management, in which a user is guaranteed access to

187: the most current data, comes at a significant computational cost, most

188: commercial distributed database providers have adopted asynchronous

189: replication management. That is, updates to relation replicas are performed

190: after the original transaction has committed, in accordance with the workload

191: of the machine on which the secondary copy is stored. Asynchronous

192: replicas are also

193: very common in Web applications such as search engines, where Web crawlers

194: sample Web sites periodically, and in \emph{pervasive systems }(\emph{e.g.},

195: Microsoft's Mobile Information

196: Server\footnote{http://www.microsoft.com/servers/miserver/} and Caf\'{e}

197: Central\footnote{http://www.comalex.com/central.htm}). In a pervasive system,

198: a server serves many different users, each with her own unpredictable

199: connectivity schedule and dynamically changing device capabilities. Our

200: modeling techniques would allow client devices to reduce the rate at which

201: they poll the server, saving both server resources and network

202: bandwidth. We demonstrate the usefulness of stochastic modeling in

203: this setting in Sections~\ref{sec:condepup}

204: and~\ref{sec:exwebcraw}.\hspace*{\fill}$\Box$

205: \end{example}

206:

207: The novelty of this paper is in developing a formal framework for modeling

208: content evolution in relational databases. The problem of content evolution

209: with respect to materialized views (which may be regarded as a complex form of

210: data transcription) in databases has already been recognized. For example, in

211: \cite{ABITEBOUL99}, the incompleteness of data in views was noted as being a

212: ``dynamic notion since data may be constantly added/removed from the view.''

213: Yet, we believe that there has been no prior formal modeling of the evolution

214: process.\footnote{Other research efforts involve probabilistic database

215: systems (\emph{e.g.}, \cite{LAKSHMANAN97}), but this work is concerned with

216: uncertainty in the stored data, rather than data evolution.} Related research

217: involves the containment property of a materialized view with respect to its

218: base data: a few of the many references in this area include

219: \cite{YANG87,CHAUDHURI95,LEVY95a,ABITEBOUL98,GRUMBACH00}. However, the

220: temporal aspects of content evolution have not been systematically addressed

221: in this work. In \cite{ABITEBOUL98}, for example, the containment

222: relationships between a materialized view $I$ and the ``true'' query result

223: $\mathcal{V}(D)$, taken from a database $D$, can be either $I=\mathcal{V}(D)$

224: or $I\subseteq\mathcal{V}(D)$. The latter relationship represents a situation

225: where the materialized view stores only a partial subset of the query result.

226: However, taking content evolution into account, it is also possible that

227: $I\supset\mathcal{V}(D)$, if tuples may be deleted from $\mathcal{V}(D)$ and

228: $I$ is periodically updated. Moreover, modifications to the base data may

229: result in both $I\not \subseteq\mathcal{V}(D)$ and $I\not \supseteq

230: \mathcal{V}(D)$.

231:

232: Refresh policies for materialized views have been previously discussed in the

233: literature (\emph{e.g.}, \cite{LINDSAY86} and \cite{COLBY96}). Typically,

234: materialized views are refreshed immediately upon updates to the base data, at

235: query time (as in \cite{COLBY96}), or using snapshot databases (as in

236: \cite{LINDSAY86}). The latter approach can produce obsolescent materialized

237: views. A combination of all three approaches appears in \cite{COLBY97}. Our

238: methodology differs in that we do not assume an \emph{a priori} association of

239: a materialized view with a refresh policy, but instead design policies based

240: on their transcription and obsolescence costs.

241:

242: A preliminary attempt to describe the time dependency of updates in the

243: context of Web management was given in \cite{CHO00}, which suggests a simple

244: homogeneous Poisson process to model the updating of Web pages. We suggest

245: instead a nonhomogeneous compound Poisson model, which is far more flexible,

246: and yet still tractable. In addition, the work in~\cite{CHO00} supposes that

247: transcriptions are performed at uniform time intervals, mainly because

248: ``crawlers cannot guess the best time to visit each site.'' We show in this

249: paper that our model of content evolution gives rise to other, better

250: transcription policies.

251:

252: In \cite{OLSTON00}, a trade-off mechanism was suggested to decide between the

253: use of a cache or recomputation from base data by using range data, computed

254: at the source. In this framework, an update is ``pushed'' to a replication

255: site whenever updated data falls outside a predetermined interval, or whenever

256: a query requires current data. The former requires the client and the server

257: to be in touch continuously, in case the server needs to track down the

258: client, which is not always realistic (either because the server does not

259: provide such services, or because the overhead for such services undermines

260: the cost-effectiveness of the client). The latter requirement puts the burden

261: of deciding whether to refresh the data on the client, without providing it

262: with any model for the evolution of the base data. We attempt to fill this gap

263: by providing a stochastic model for content evolution, which allows a client

264: to make judicious requests for current data. Other work in related areas

265: (\emph{e.g.}, \cite{ALONSO90,CAREY91a,DELIS98}) has considered

266: various alternatives for pushing updated data from a server to a cache on

267: the client side. Lazy replica-update policies using replication graphs have

268: also been discussed in, for example, \cite{ANDERSON98}. This work, however,

269: does not take the data obsolescence into account, and is primarily concerned

270: with transaction throughput and timely updates, subject to network constraints.

271:

272: As with models in general, our model is an idealized representation of a

273: process. To be useful, we wish to make predictions based on tractable

274: analytical calculations, rather than detailed, computationally intensive

275: simulations. Therefore, we restrict our modeling to some of the more basic

276: tools of applied probability theory, specifically those relating to Poisson

277: processes and Markov chains. Texts such as~\cite{ROSS80,TK94} contain the

278: necessary reference material on Markov chains and Poisson processes, and

279: specifically on nonhomogeneous Poisson processes. Poisson processes can model

280: a world where data updates are independent from one another. In databases with

281: widely distributed access, \emph{e.g.}, Web interfacing databases, such an

282: independence assumption seems plausible, as was verified in \cite{CHO00}.

283:

284: The rest of the paper is organized as follows: Section \ref{sec:preliminaries}

285: introduces some basic notation. Section~\ref{sec:estcard}

286: provides a content evolution model for insertions and deletions, while

287: Section~\ref{sec:modif} discusses data modifications. We

288: shall introduce preliminary results of fitting the insertion model parameters

289: to real data feeds in Section \ref{sec:verify}. A cost model

290: and transcription policies that utilize it follow in Section \ref{costmodel},

291: highlighting the practical impact of the model. Conclusions and topics for

292: further research are provided in Section \ref{sec:conclusion}.

293:

294:

295: \subsection{Notational preliminaries}

296:

297: \label{sec:preliminaries} In what follows, we denote the set of attributes and

298: relations in the database by $\mathcal{B}$ and $\mathcal{R}$, respectively.

299: Each $R\in\mathcal{R}$ consists of a set of attributes $\mathcal{A}%

300: (R)\subseteq\mathcal{B}$, and also has a \emph{primary key} $\mathcal{K}(R)$,

301: which is a nonempty subset of $\mathcal{A}(R)$. Each attribute $A\in

302: \mathcal{B}$ has a \emph{domain} $\dom

303: A$, which we assume to be a finite set, and for any subset of attributes

304: $\mathcal{A}=\left\{  A_{1},A_{2},...,A_{k}\right\}  $, we let $\dom

305: {\mathcal{A}}=\dom A_{1}\times\dom A_{2}\times...\times\dom

306: A_{k}$ denote the compound domain of $\mathcal{A}$. We denote by $r.A(t)$ the

307: value of attribute $A$ in tuple $r$ at time $t$, and similarly use

308: $r.\mathcal{A}(t)$ for the value of a compound attribute. For a given time

309: $t$, subset of attributes $\mathcal{A}\subseteq\mathcal{A}(R)$, and value

310: $v=\langle v_{1},v_{2},...,v_{k}\rangle\in\dom{\mathcal{A}}$, we define

311: $R_{\mathcal{A},v}(t)=\left\{  r\in R(t)\;\left|  \;\;(r.A_{1}(t)=v_{1}%

312: )\wedge(r.A_{2}(t)=v_{2})\wedge\ldots\wedge(r.A_{k}(t)=v_{k})\right.

313: \right\}  $. We also define $\hat{R}_{\mathcal{A}}(t)$ to be the

314: \emph{histogram} of values of $\mathcal{A}$ at time $t$, that is, for each

315: value $v\in\dom\mathcal{A}$, $\hat{R}_{\mathcal{A}}(t)$ associates a

316: nonnegative integer $\hat{R}_{\mathcal{A},v}(t)$, which is the cardinality of

317: $R_{\mathcal{A},v}(t)$.\footnote{This vector can be computed exactly and

318: efficiently using indices. Alternatively, in the absence of an index for a

319: given attribute, statistical methods (such as ``probabilistic'' counting

320: \cite{WHANG90}, sampling-based estimators \cite{HAAS95}, and wavelets

321: \cite{MATIAS98}) can be applied.}  This notation, and well as other

322: symbols used throughout the paper,

323: are also summarized in Table \ref{tab:listssym}.

324:

325: \renewcommand{\arraystretch}{1.3}

326:

327: \begin{table}[t] \centering

328: {\scriptsize

329: \begin{tabular}

330: [c]{|l|p{4.5in}|}\hline

331: $s,f$ & Points in time\\\hline

332: $R,S\in\mathcal{R}; R(t); \card{R(s)}$ & Relations;

333: $R$'s extension at time $t$; its cardinality at time $s$.\\

334: $A\in\mathcal{B}; \mathcal{A} \subseteq \mathcal{B}; \dom A;

335: \dom\mathcal{A}$ &

336: Attribute; compound attribute; domain of attribute;

337: domain of compound attribute \\

338: $\mathcal{A}(R) \subseteq \mathcal{B};

339: \mathcal{K}(R);

340: \mathcal{C}(R)$ &

341: Attributes of $R$; primary key of $R$; modifiable attributes of $R$ \\

342: $r; r.A(t); r.\mathcal{A}(t)$ &

343: Tuple; value of attribute $A$ in $r$ at $t$;

344: value of compound attribute $\mathcal{A}$ in $r$ at $t$ \\

345: $v\in\dom\mathcal{A};

346: R_{\mathcal{A},v}(t); \hat{R}_{\mathcal{A}}(t)$ &

347: Value; set of tuples with $r.A(t)=v$; histogram of

348: $\mathcal{A}$\\

349: $b(r); d(r)$ & Insertion time of $r$; deletion time of $r$\\

350: $\mathcal{N}\subset\mathcal{B}$ & Set of numeric attributes \\

351: $G; G(R)$ &

352: Dependency multigraph; dependency sub-multigraph generated by $R$\\

353: $\lambda_{R}(t)\hspace{0.15em};\Lambda_{R}(s,f);B_{R}(s,f)$ &

354: Insertion rate (intensity);

355: expected number of insertion events during $(s,f]$;

356: number of insertions during $(s,f]$ \\

357: $\expdistrib_{s}(\phi(\cdot));L_{R,s};L_{R,s}^{\intrinsic}$ &

358: Nonhomogeneous exponential distribution;

359: interarrival time;

360: remaining life span

361: \\\hline

362: $\Delta_{R,i}^{+}; \Delta_{i}^{-}$ &

363: Number of tuples for insertion event $i$;

364: number of tuples for deletion  event $i$ \\

365: $\mu_{R}(t); M_{R}(s,f)$ &

366: Deletion rate (intensity); expected number of deletion events

367: \\\hline

368: $w(r,S)$ &

369: Number of tuples in $S$ forcing deletion of $r$ via referential

370: integrity \\

371: $W(R,S,t)$ &

372: Random variable of $w(r,S)$ over uniform selection of $r\in R$\\

373: $W(R,t)$ &

374: Vector of $W(R,S,t)$ over $S\in G(R)$

375: \\\hline

376: $p_{R}(s,f)$ &

377: Probability that a tuple in $R$ at time $s$ survives through $f$ \\

378: $\hat{p}_{R}(t,f)$ &

379: Survival probability through $f$ for tuple inserted at $t$

380: \\\hline

381: $\expecop_{r\in R(s)}\!\left[  {\cdot}\right]  $ &

382: Expectation over uniform random

383: selection of tuples $r\in R(s)$ \\\hline

384: $X_{R}(s,f)$ &

385: Number of tuples inserted into $R$ during $(s,f]$ \\

386: $Y_{R}(s,f)$ &

387: Number of tuples in $R(s)$ surviving through $f$\\

388: $Y_{R}^{+}(s,f); Y_{R}^{-}(s,f)$ &

389: Surviving tuples that were modified;

390: surviving tuples that were not modified\\\hline

391: $\tau_{v,s}^{R,A}; \gamma_{R,A}(t); \Gamma_{R,A}(s,f)$ &

392: Remaining time to next modification ;

393: modification rate;

394: expected number of modification events \\

395: $\ell_{v}^{R,A}$ &

396: Relative exit rate\\

397: $P_{u,v}^{R,A}(s,f); q_{u,v}^{R,A}$ &

398: Transition probability; relative transition rate\\\hline

399: $\Delta A; \delta; \sigma^{2}$ &

400: Change to a value of $A$ in a random-walk update

401: event; expected value of change; variance of change\\\hline

402: $C_{R,\text{u}}(s,f);C_{R,\text{o}}(s,f);C_{R}(t);$ &

403: Transription cost; obsolescence cost; total cost\\

404: $\iota_{r,A}(s,f); \iota_{R,A}(s,f); \iota_{r}(s,f)$ &

405: Contribution to obsolescence of: $r$ via $A$; $A$; $r$\\

406: $\hat{\iota}_{R,A}^{\modification}(s,f);

407: \hat{\iota}_{R}^{\deletion}(s,f);

408: \hat{\iota}_{R}^{\medspace\insertion}(s,f)$ &

409: Expected obsolescence cost

410: due to: modification; deletion; insertion\\

411: $\hat{\iota}_{R,A,u}^{\modification}(s,f)$ &

412: Expected obsolescence cost due to modification to the value $u$\\

413: $c_{u,v}^{R,A}$ &

414: Elements of a cost matrix\\\hline

415: \end{tabular}

416: }

417: \caption{List of Symbols.}

418: \label{tab:listssym}

419: \end{table}

420:

421: \renewcommand{\arraystretch}{1}

422:

423:

424: \section{Modeling insertions and deletions}

425:

426: \label{sec:estcard}This section introduces the stochastic models

427: for insertions and deletions. Section

428: \ref{sec:insertion} discusses insertions,

429: while deletions are discussed in section

430: \ref{sec:deletions}. Section \ref{sec:combinedinsertdelete}

431: combines the effect of

432: insertions and deletions on a relation's cardinality. We conclude

433: with a discussion of non-exponential life spans in Section

434: \ref{sec:nonexplife}. We defer discussing model

435: validation until Section \ref{sec:verify}.

436:

437: \subsection{Insertion}

438: \label{sec:insertion}

439: We use a nonhomogeneous Poisson process~\cite{ROSS80,TK94}

440: with instantaneous arrival rate $\lambda_{R}:\Re\rightarrow\lbrack0,\infty)$

441: to model the occurrence of \emph{insertion events} into $R$. That is, the

442: number of insertion events occurring in any interval $(s,f]$ is a Poisson

443: random variable with expected value $\Lambda_{R}(s,f)=\int_{s}^{f}\lambda

444: _{R}(t)\hspace{0.15em}dt.$ A homogeneous Poisson process may be considered as

445: the special case where $\lambda_{R}(t)$ is equal to a constant $\lambda_{R}>0$

446: for all $t$, yielding $\Lambda_{R}(s,f)=\int_{s}^{f}\lambda_{R}(t)\hspace

447: {0.15em}dt=\int_{s}^{f}\lambda_{R}\hspace{0.15em}dt=\lambda_{R}\cdot(f-s)$.

448:

449: We now consider the interarrival time distribution of the nonhomogeneous

450: Poisson process. \ We first define the nonhomogeneous exponential

451: distribution, as follows:

452:

453: \begin{definition}

454: [Nonhomogeneous exponential distribution]\label{def:nhexp}

455: Let $\phi:\Re\rightarrow\lbrack0,\infty)$ be a integrable

456: function. Given some $s\in\Re$, a random variable $V$ is said to have a

457: \emph{nonhomogeneous exponential} distribution (denoted by $V\sim

458: \expdistrib_{s}(\phi(\cdot))$) if $V$'s density function is

459: \[

460: p(\tau)=\left\{

461: \begin{array}

462: [c]{ll}%

463: {\displaystyle\phi(s+\tau)\exp\!{\left(  -\!\!\int_{0}^{\tau}\!\!\!\phi

464: (s+u)\hspace{0.15em}du\right)  }}, & \tau\geq0\\

465: 0, & \tau<0.

466: \end{array}

467: \right.

468: \]

469: \end{definition}

470:

471: It is worth noting that if $\phi(t)$ is constant, $p(\tau)$ is just a standard

472: exponential distribution. We shall now show that, as with homogeneous Poisson

473: processes, the interarrival time of insertion events is distributed like an

474: exponential random variable, $L_{R,s}$, but with a time-varying density function.

475:

476: \begin{lemma}

477: \label{lem:interarrival}At any time $s$, the amount of time $L_{R,s}$

478: to the next insertion event is distributed like $\expdistrib_{s}(\lambda

479: _{R}(\cdot))$. The probability of an insertion event occurring during $(s,f]$

480: is $\probop

481: \!\{L_{R,s}<f-s\}=1-e^{-\Lambda_{R}(s,f)}$.

482: \end{lemma}

483:

484: \begin{proof}

485: \noindent Let $\{N(t),t\geq0\}$ be a nonhomogeneous Poisson process with

486: intensity function $\lambda_{R}(t)$, which implies $\probop\!\left\{

487: {N(f)-N(s)=0}\right\}  =e^{-\Lambda_{R}(s,f)}$. Now, the chance that no new

488: tuple was inserted during $(s,f]$ is the same as the chance that the process

489: $N(\cdot)$ has no arrivals during $(s,f]$, that is, $e^{-\Lambda_{R}(s,f)}$.

490: The chance that a new tuple was inserted during $(s,f]$ is just the complement

491: of the chance of no arrivals, namely,

492: \[

493: \probop\!\left\{  L_{R,s}{<f-s}\right\}  =\{N(f)-N(s)\geq

494: 1\}=1-P\{N(f)-N(s)=0\}=1-e^{-\Lambda_{R}(s,f)}.

495: \]

496: Taking the derivative of this expression with respect to $f$ and making a

497: change of variables, the probability density of the time until the next

498: insertion from time $s$ is $p(\tau)=\lambda_{R}(s+\tau)e^{-\Lambda

499: _{R}(s,s+\tau)}$. Thus, $L_{R,s}\sim\expdistrib_{s}(\lambda_{R}(\cdot))$.

500: \end{proof}

501:

502: At insertion event $i$, a random number of tuples $\Delta_{R,i}^{+}$ are

503: inserted, allowing us to model bulk insertions. A \emph{bulk insertion} is the

504: simultaneous arrival of multiple tuples, and may occur because the tuples are

505: related, or because of limitations in the implementation of the server. For

506: example, e-mail servers may process an input stream periodically, resulting in

507: bulk updates of a mailbox. Assuming that the $\{\Delta_{R,i}^{+}\}$ are

508: independent and identically distributed (IID), then the stochastic process

509: $\{B_{R}(t),t\geq0\}$ representing the cumulative number of insertions through

510: time $t$ is a \emph{compound Poisson} process (\emph{e.g.}, \cite{ROSS95}, pp.

511: 87-88). We let $B_{R}(s,f)$ denote the number of insertions falling into the

512: interval $(s,f]$. The expected number of inserted tuples during $(s,f]$ may be

513: computed via $\expecop\!\left[  {B}_{R}{(s,f)}\right]  =\int_{s}%

514: ^{f}\!\!\lambda_{R}(t)\expecop

515: \!\left[  {\Delta}_{R}^{+}\right]  \hspace{0.15em}dt=\expecop\left[  {\Delta

516: }_{R}^{+}\right]  \int_{s}^{f}\!\!\lambda_{R}(t)\!\hspace{0.15em}%

517: dt=\expecop\!\left[  {\Delta}_{R}^{+}\right]  \Lambda_{R}(s,f).$

518: Here, $\Delta_{R}^{+}$ represents a generic random variable distributed like

519: the $\{\Delta_{R,i}^{+}\}$.

520:

521: We now consider three simple cases of this model:

522:

523: \paragraph{General nonhomogeneous Poisson process:}

524: Assume that $\expecop\!\left[  {\Delta_{R}^{+}}\right]  =1$. The expected

525: number of insertions simplifies to $\expecop\!\left[  {B}_{R}{(s,f)}\right]

526: =\expecop

527: \!\left[  {\Delta}_{R}^{+}\right]  \Lambda_{R}(s,f)=1\cdot\Lambda

528: _{R}(s,f)=\Lambda_{R}(s,f)$.

529:

530: \paragraph{Homogeneous Poisson process:}

531: Assume once more that $\expecop\!\left[  {\Delta_{R}^{+}}\right]  =1$. Assume

532: further that $\lambda_{R}(t)$ is a constant function, that is, $\lambda

533: _{R}(t)=\lambda_{R}$ for all times $t$. In this case, as shown above,

534: $\Lambda_{R}(s,f)$ takes on the simple form of $\lambda_{R}\cdot(f-s)$. Thus,

535: $\expecop\!\left[  {B}_{R}{(s,f)}\right]  =\Lambda_{R}(s,f)=\lambda_{R}%

536: \cdot(f-s)$. The interarrival times are distributed as $\expdistrib

537: (\lambda_{R})$, the exponential distribution with parameter $\lambda_{R}$.

538:

539: \paragraph{Recurrent piecewise-constant Poisson process:}

540: A simple kind of nonhomogeneous Poisson process can be built out of

541: homogeneous Poisson processes that repeat in a cyclic pattern. Given some

542: length of time $T$, such as one day or one week, suppose that the arrival rate

543: function $\lambda_{R}(t)$ of the recurrent Poisson process repeats every $T$

544: time units, that is, $\lambda_{R}(t)=\lambda_{R}\!\left(  t-T\!\left\lfloor

545: {t}/{T}\right\rfloor \right)  $ for all $t$. Furthermore, the interval

546: $\left[  0,T\right)  $ is partitioned into a finite number of subsets

547: $J_{1},\ldots,J_{K}$, with $\lambda_{R}(t)$ constant throughout each $J_{k}$,

548: $k=1,\ldots,K$. Finally, each $J_{k}$ is in turn composed of a finite number

549: of half-open intervals of the form $[s,f)$. For instance, $T$ might be one

550: day, with $K=24$ and $J_{1}=[0\text{:}00,1\text{:}00),J_{2}=[1\text{:}%

551: 00,2\text{:}00),\ldots,J_{24}=[23\text{:}00,0\text{:}00)$. As another simple

552: example, $T$ might be one week, and $K=2$. The subset $J_{1}$ would consist of

553: a firm's normal hours of operation, say $[9$:$00,18$:$00)$ for each weekday,

554: and $J_{2}=[0,T)\backslash J_{1}$ would denote all ``off-hour'' times.

555: Formalisms like those of~\cite{NIEZETTE92} could also be used to describe such

556: processes in a more structured way. We term this class of Poisson processes to

557: be \emph{recurrent piecewise-constant} --- abbreviated \emph{RPC}.

558:

559: It is worth noting that, in client-server environments, the

560: insertion model should typically be formed from the client's

561: point of view. Therefore, if the server keeps a database from which many

562: clients transcribe data, the modeling of insertions for a given client should

563: only include the part of the database the client actually transcribes.

564: Therefore, if a ``road warrior'' is interested only in new orders for the

565: 08904 zip code area, the insertion model for that client should concentrate on

566: that zip code, ignoring the arrival orders from other areas.

567:

568: \subsubsection{The complexity of computing $\Lambda_{R}(s,f)$}

569:

570: \label{sec:lambdacomplex}$\Lambda_{R}(s,f)$, the Poisson

571: expected value, is computed by integrating the model parameter $\lambda

572: _{R}(t)$ over the interval $[s,f]$. Standard numerical methods allow rapid

573: approximation of this definite integral even if no closed formula is known for

574: the indefinite integral. However, the complexity of this calculation depends

575: on the information-theoretic properties of $\lambda_{R}(t)$~\cite[Section

576: 1]{TRAUB98}.

577:

578: For our purposes, however, simple models of $\lambda_{R}(t)$ are likely to

579: suffice. For example, if $\lambda_{R}(t)$ is a polynomial of degree $d

580: \geq 0$,

581: %Note -- I have to put d+1 here because d can be zero!  JE

582: the integration can be performed in $\bigO(d+1)$ time. Consider next a

583: piecewise-polynomial Poisson process: the time line is divided into intervals

584: such that, in each time interval, $\lambda_{R}(t)$ can be written as a

585: polynomial. The complexity of calculating $\Lambda_{R}(s,f)$ in this case is

586: $\bigO(n(d+1))$, where $n$ is the number of segments in the time interval $(s,f]$,

587: and $d$ is the highest degree of the $n$ polynomials.

588:

589: Further suppose that the piecewise-polynomial process is recurrent in a

590: similar manner to the RPC process, that is, given some fixed time interval

591: $T$, $\lambda_{R}(t)=\lambda_{R}\!\left(  t-T\!\left\lfloor {t}/{T}%

592: \right\rfloor \right)  $ for all $t$. Note that the RPC Poisson process is the

593: special case of this model in which $d=0$. If there are $c$ segments in the

594: interval $[0,T]$, then the complexity of calculating $\Lambda_{R}(s,f)$

595: becomes $\bigO(c(d+1))$, regardless of the length of the interval $[s,f]$. This

596: reduction occurs because, for all intervals of the form $[kT,(k+1)T]\subseteq

597: [s,f]$ for which $k$ is an integer, the integral $\int_{kT}^{(k+1)T}%

598: \lambda_{R}(t)\hspace{0.15em} dt$ is equal to $\int_{0}^{T}\lambda

599: _{R}(t)\hspace{0.15em} dt$, which only needs to be calculated once.

600:

601: In Section \ref{sec:verify}, we demonstrate the usefulness

602: of the RPC model for one specific application. We hypothesize that a recurrent

603: piecewise-polynomial process of modest degree (for example, $d$=3) will be

604: sufficient to model most systems we are likely to encounter, and so the

605: complexity of computing $\Lambda_{R}(s,f)$ should be very manageable.

606:

607: \subsection{Deletion}

608: \label{sec:deletions}We allow for two distinct deletion mechanisms. First, we

609: assume individual tuples have their own intrinsic stochastic life spans.

610: Second, we assume that tuples are deleted to satisfy referential integrity

611: constraints when tuples in other relations are deleted.

612: These two

613: mechanisms are combined in a tuple's overall probability of being deleted.

614: Let $R$ and $S$ be two

615: relations such that $\mathcal{K}(S)$ is a foreign key of $S$ in $R$. We refer

616: to $S$ as a \emph{primary relation} of $R$. Consider the directed multigraph

617: $G$ whose vertices consist of all relations $R$ in the database, and whose

618: edges are of the form $\langle R,S\rangle$, where $S$ is a primary relation of

619: $R$. The number of edges $\langle R,S\rangle$ is the number of foreign keys of

620: $S$ in $R$ for which integrity constraints are enforced. We assume that $G(R)$

621: has no directed cycles. Let $G(R)$ denote the subgraph of $G$ consisting of

622: $R$ and all directed paths starting at $R$. We denote the vertices of this

623: subgraph by $S(R)$.

624:

625: \begin{figure}

626: [ptb]

627: \begin{center}

628: \epsfig{file=multigraph.eps}

629: \caption{A partial multigraph of the case study.}%

630: \label{fig:multigraph}

631: \end{center}

632: \end{figure}

633: %EndExpansion

634:

635: \begin{example}

636: [Referential integrity constraints in Net.Commerce]

637: IBM's Net.Commerce is supported by a DB2 database with

638: about a hundred relations interrelated through foreign keys. For demonstration

639: purposes, consider a sample of seven relations in the Net.Commerce database.

640: Figure \ref{fig:multigraph} is a pictorial

641: representation of the multigraph $G$ of these seven relations. The

642: \texttt{MERCHANT} relation provides data about merchant profiles, the

643: \texttt{SCALE} and \texttt{DISCCALC} relations are for computing price

644: discounts, the \texttt{CATEGORY} and \texttt{CGRYREL} relations assist in

645: categorizing products, and the \texttt{ORDERS} and \texttt{SHIPTO} relations

646: contain information about orders. The six relations, \texttt{SCALE, DISCCALC,

647: CATEGORY,} \texttt{CGRYREL, ORDERS, }and \texttt{SHIPTO} have a foreign key to

648: the \texttt{MERCHANT} relation, through \texttt{MERCHANT}'s primary key

649: (\texttt{MERFNBR}). Integrity constraints are enforced between the

650: \texttt{SCALE} relation and the \texttt{MERCHANT} relation, as long as

651: \texttt{SCALE.SCLMENBR} (the foreign key to \texttt{MERCHANT.MERFNBR}) does

652: not have the value \texttt{NULL}. That is, unless a \texttt{NULL} value is

653: assigned to the \texttt{MERCHANT.MERFNBR} attribute, a deletion of a tuple

654: in \texttt{MERCHANT} results in a deletion of all tuples in \texttt{SCALE}

655: such that \texttt{SCALE.SCLMENBR$\,=\,$MERCHANT.MERFNBR}. \texttt{DISCCALC}

656: has a foreign key to the \texttt{SCALE} relation, through \texttt{SCALE}'s

657: primary key (\texttt{SCLRFNBR}). There are two attributes of \texttt{CGRYREL}

658: that serve as foreign keys to the \texttt{CATEGORY} relation, through

659: \texttt{CATEGORY}'s primary key (\texttt{CGRFNBR}). Finally, \texttt{SHIPTO}

660: contains shipment information of each product in an order, and therefore it

661: has a foreign key to \texttt{ORDERS} through its primary key (\texttt{ORFNBR}%

662: ). \hspace*{\fill}$\Box$

663: \end{example}

664:

665: With regard to intrinsic deletions within a relation, we assume that each

666: tuple $r\in R(s)$ has a stochastic remaining life span $L_{R,s}^{\intrinsic}$.

667: This random variable is identically distributed for each $r\in R(s)$, and is

668: independent of the remaining life span of any other tuple and of $r$'s age at

669: time $s$ (see Section \ref{sec:nonexplife} for a

670: discussion of tuples with a non-memoryless life span). Specifically, we will

671: assume that the chance of $r\in R(t)$ being deleted in the time interval

672: $[t,t+\Delta t]$ approaches $\mu_{R}(t)\Delta t$ as $\Delta t\rightarrow0$,

673: for some function $\mu_{R}:\Re\rightarrow\lbrack0,\infty)$. We define

674: $M_{R}(s,f)=\int_{s}^{f}\mu_{R}(t)\hspace{0.15em}dt$.

675:

676: \begin{lemma}

677: $L_{R,s}^{\intrinsic}\thicksim\expdistrib_{s}(\mu_{R}(\cdot))$. The

678: probability that a tuple $r\in R(s)$ is deleted by time $f$, given that no

679: corresponding tuple in $S(R)\backslash\{R\}$ is deleted, is $\probop

680: \!\{{L_{R,s}^{\intrinsic}<f-s\}=}1-e^{-M_{R}(s,f)}$.

681: \end{lemma}

682:

683: \begin{proof}

684: \noindent Let $r\in R(s)$ be a randomly chosen tuple, and assume that no

685: corresponding tuple to $r$ in $S(R)\backslash\{R\}$ is deleted. The proof is

686: identical to that of Lemma \ref{lem:interarrival}, replacing

687: $\lambda_{R}(t)$ with $\mu_{R}(t)$ and $\Lambda_{R}(s,f)$ by $M_{R}(s,f)$.

688: \end{proof}

689:

690: \subsubsection{Deletion and referential integrity}

691: For any $r\in R(s)$ and any relation $S\in S(R)$, we define $w(r,S)$ to be the

692: number of tuples in $S$ whose deletion would force deletion of $r$ in order to

693: maintain referential integrity. This value can be between $0$ and the number

694: of paths from $R$ to $S$ in $G(R)$. For example, if

695: $r\in R=\mathtt{CGRYREL}$ of

696: Figure \ref{fig:multigraph}, then $0\leq

697: w(r,\mathtt{CATEGORY})\leq2$ and $0\leq w(r,\mathtt{MERCHANT})\leq3$. For

698: completeness, we define $w(r,R)=1$. Each tuple in $S$ has an independent

699: remaining lifetime distributed as $\expdistrib

700: _{s}(\mu_{S}(\cdot))$, and if any of the $w(r,S)$ tuples corresponding to $r$

701: is deleted, then $r$ must be immediately deleted, to maintain referential integrity constraints. We use $p_{R}(s,f)$ to

702: denote the probability that a randomly chosen tuple in $R(s)$ survives until

703: time $f$.

704:

705: \begin{lemma}

706: \label{prop:rawdelete} $p_{R}(s,f)=\expecop_{r\in R(s)}\!\!\left[

707: \exp\!\left(  -\!\sum_{S\in S(R)}\!w(r,S)M_{S}(s,f)\right)  \right]  $, where

708: $\expecop

709: _{r\in R(s)}\!\left[  {\cdot}\right]  $ denotes expectation over random

710: selection of tuples in $R(s)$.

711: \end{lemma}

712:

713: \begin{proof}

714: \noindent Considering all $S\in S(R)$, and using the well-known fact that if

715: $L_{i}\sim\expdistrib

716: _{s}(\mu_{i}(\cdot))$ for $i=1,\ldots,k$ are independent, then

717: \begin{equation}

718: \min\!\left\{  L_{1},\ldots,L_{k}\right\}  \sim\expdistrib_{s}\!\left(

719: \sum_{i=0}^{k}\mu_{i}(\cdot)\right)  ,\label{eq:combineexp}%

720: \end{equation}

721: we conclude that the remaining lifetime of $r$ (denoted $L_{R,s}$) has a

722: nonhomogeneous exponential distribution with intensity function $\sum_{S\in

723: S(R)}w(r,S)\mu_{S}(\cdot)$. The probability of a given tuple $r\in R(s)$

724: surviving through time $f$ is thus

725: \[

726: \exp\negthickspace\left(  -\int_{s}^{f}\!\!\left(  \sum_{S\in S(R)}%

727: \!\!\!w(r,S)\mu_{S}(t)\right)  dt\right)  =\exp\negthickspace

728: \left(  -\!\!\!\sum_{S\in S(R)}\!\!\!w(r,S)M_{S}(s,f)\right)  ,

729: \]

730: and the probability that a randomly chosen tuple in $R(s)$ survives until time

731: $f$ is therefore

732: \begin{equation}

733: p_{R}(s,f)=\expecop_{r\in R(s)}\!\!\left[  \exp\!\!\left(  -\!\!\!\sum_{S\in

734: S(R)}\!\!\!w(r,S)M_{S}(s,f)\right)  \right]  .\label{eq:prdef}%

735: \end{equation}

736: \hspace*{\fill}\hspace*{\fill}

737: \end{proof}

738:

739: The complexity analysis of integrating $\mu_{S}(t)$ over time to

740: obtain $M_S(s,f)$

741: is similar to

742: that of Section \ref{sec:lambdacomplex}. However, the

743: computation required by Lemma \ref{prop:rawdelete} may be prohibitive, in the

744: most general case, because it requires knowing the empirical distribution of

745: the $w(r,S)$ over all $r\in R(s)$ for all $S\in S(R)$. This empirical

746: distribution can be computed accurately by computing for each tuple, upon

747: insertion, the number of tuples in any $S\in S(R)$ with a comparable foreign

748: key, using either histograms or by directly querying the database. Maintaining

749: this information requires $\bigO(\left|  {R(s)}\right|  \left|  {S(R)}\right|  )$

750: space. This complexity can be reduced using a manageably-sized sample from

751: $R(s)$. Our initial analysis of real-world applications, however, indicates

752: that in many cases, $w(r,S)$ takes on a much simpler form, in which $w(r,S)$

753: is identical for all $r\in R(s)$. We term such a typical relationship between

754: $R$ and $S\in S(R)$ a \emph{fixed multiplicity}, as defined next:

755:

756: \begin{definition}

757: The pair $\langle R,S\rangle$, where $S\in S(R)$, has \emph{fixed

758: multiplicity} if $w(r,S)$ is identical for all tuples in $R$. In this case, we

759: denote its common value by $w(R,S)$. \hspace*{\fill}$\Box$

760: \end{definition}

761:

762: \begin{example}

763: [Fixed multiplicies in Net.Commerce] Consider the example

764: multigraph of

765: Figure \ref{fig:multigraph}. Both \texttt{DISCALC} and

766: \texttt{SCALE} reference \texttt{MERCHANT}. It is clear that the discount

767: calculation of a product (as stored in \texttt{DISCALC}) cannot reference a

768: different merchant than \texttt{SCALE}. The only exception is when the foreign

769: key in \texttt{SCALE} is assigned with a null value. If this is the case,

770: however, there is only a single tuple in \texttt{MERCHANT} whose deletion

771: requires the deletion of a tuple in \texttt{DISCALC}. Thus, for any tuple

772: $r\in$~\texttt{DISCALC}, $w(r,\mbox{\tt\em SCALE})=w(r,\mbox{\tt\em

773: MERCHANT})=1$ and therefore $\langle\mbox{\tt\em DISCALC},\mbox{\tt\em

774: SCALE}\rangle$ and $\langle\mbox{\tt\em DISCALC},\mbox{\tt\em MERCHANT}%

775: \rangle$ both have fixed multiplicity of $1$. Now consider \texttt{CGRYREL}.

776: Since each tuple in \texttt{CGRYREL} describes the relationship between a

777: category and a subcategory, it is clear that its two foreign keys to

778: \texttt{CATEGORY} must always have distinct values. Thus, $\langle\mbox{\tt\em

779: CGRYREL},\mbox{\tt\em CATEGORY}\rangle$ has a fixed multiplicity, and

780: $w(\mbox{\tt\em CGRYREL},\mbox{\tt\em CATEGORY})=2$.\hspace*{\fill}$\Box$

781: \end{example}

782:

783: As the following lemma shows, fixed multiplicities permit great simplification

784: in computing $p_{R}(s,f)$.

785:

786: \begin{lemma}

787: \label{prop:fixed} If $\langle R,S\rangle$ has fixed multiplicity for all

788: $S\in S(R)$, $p_{R}(s,f)=\exp(-\widetilde{M}_{R}(s,f))$, where $\widetilde

789: {M}_{R}(s,f)=\int_{s}^{f}\tilde{\mu}_{R}(t)\hspace{0.15em}dt$ and $\tilde{\mu

790: }_{R}(t)=\!\!\sum_{S\in S(R)}w(R,S)\mu_{S}(t)$.

791: \end{lemma}

792:

793: \begin{proof}%

794: \begin{align*}

795: p_{R}(s,f) &  =\expecop_{r\in R(s)}\!\!\left[  \exp\!\!\left(  -\!\!\!\sum

796: _{S\in S(R)}\!\!\!w(r,S)M_{S}(s,f)\right)  \right]  \\

797: &  =\expecop_{r\in R(s)}\!\!\left[  \exp\negthickspace\left(  -\int_{s}%

798: ^{f}\!\!\left(  \sum_{S\in S(R)}\!\!\!w(r,S)\mu_{S}(t)\right)  dt\right)

799: \right]  \\

800: &  =\expecop_{r\in R(s)}\!\!\left[  \exp\negthickspace\left(  -\int_{s}%

801: ^{f}\!\!\left(  \sum_{S\in S(R)}\!\!\!w(R,S)\mu_{S}(t)\right)  dt\right)

802: \right]  \\

803: &  =\exp\negthickspace\left(  -\!\!\!\sum_{S\in S(R)}\!\!\!\left(

804: w(R,S)\int_{s}^{f}\!\!\mu_{S}(t)\hspace{0.15em}dt\right)  \right)  \\

805: &  =\exp\negthickspace{\left( - \!\!\! \sum_{S\in S(R)}%

806: \!\!\! w(R,S)M_S(s,f)\right)}.

807: \end{align*}

808: \end{proof}

809:

810: Since $w(R,S)$ is fixed and constant over time, no additional statistics need

811: to be collected for it. As a final note, it is worth noting that in certain

812: situations, another alternative may also be available. Let $\{N_{R}%

813: (t),t\geq0\}$ be a nonhomogeneous Poisson process with intensity function

814: $\hat{\mu}_{R}(t)$, modeling the occurrence of \emph{deletion events} in $R$.

815: At deletion event $i$, a random number $\Delta_{i}^{-}$ tuples are deleted

816: from $R$. Generally speaking, this kind of model cannot be accurate, since it

817: ignores that each deletion causes a reduction in the number of remaining

818: tuples, and thus presumably a change in the spacing of subsequent deletion

819: events. However, it may be reasonably accurate for large databases with either

820: a stable or steadily growing number of tuples, or whenever the time interval

821: $(s,f]$ is sufficiently small. Statistical analysis of the database log would

822: be required to say whether the model is applicable.

823: If the model is valid, then the

824: stochastic process $\{D_{R}(t),t\geq0\}$ representing the cumulative number of

825: deletions through time $t$, can be taken to be a compound Poisson process. The

826: expected number of deleted tuples during $(s,f]$ may be computed via

827: \[

828: \expecop\!\left[  {D_{R}(t)}\right]  =\int_{s}^{f}\!\!\mu_{R}(t)\expecop

829: \!\left[  {\Delta}^{-}\right]  dt=M_{R}(s,f)\expecop\!\left[  {\Delta}%

830: ^{-}\right]  ,

831: \]

832: where $\Delta^{-}$ is a generic random variable distributed like the

833: $\{\Delta_{i}^{-}\}$.\hspace*{\fill}

834:

835: \subsection{Tuple survival: the combined effect of insertions and deletions}

836: \label{sec:combinedinsertdelete} Some tuples inserted during $(s,f]$ may be

837: deleted by time $f$. Let the random variable $X_{R}(s,f)$ denote the number of

838: tuples inserted during the interval $(s,f]$ that survive through time $f$.

839: Consider any tuple inserted into $R$ at time $t\in(s,f]$, and denote its

840: chance of surviving through time $f$ by $\hat{p}_{R}(t,f)$. For any $S\in

841: S(R)$ and $t\in(s,f]$, let $W(R,S,t)$ be a random variable denoting the value

842: of $w(r,S)$, given that $r$ was inserted into $R$ at time $t$. Let $W(R,t)$

843: denote the random vector, of length $\left|  {S(R)}\right|  $, formed by

844: concatenating the $W(R,S,t)$ for all $S\in S(R)$.

845:

846: \begin{lemma}

847: \label{prop:insertsurvival} $\hat{p}_{R}(t,f)=\expecop_{W(R,t)}\!\left[

848: {\exp\!\left(  -\sum_{S\in S(R)}W(R,S,t)M_{S}(t,f)\right)  }\right]  $. When

849: $\langle R,S\rangle$ has fixed multiplicity for all $S\in S(R)$, then $\hat

850: {p}_{R}(t,f)=p_{R}(t,f)=\exp(-\widetilde{M}_{R}(t,f))$.

851: \end{lemma}

852:

853: \begin{proof}

854: \noindent Let $L_{R,t}$ denote the lifetime of a tuple inserted into $R$ at

855: time $t$. Similarly to the proof of Lemma \ref{prop:rawdelete}, we know that

856: $L_{R,t}\thicksim\expdistrib_{t}(\sum_{S\in S(R)}W(R,S,t)\mu_{S}(\cdot))$. The

857: probability such a tuple survives through time $f$ is the random quantity

858: \[

859: \exp\negthickspace\left(  -\int_{t}^{f}\!\!\left(  \sum_{S\in S(R)}%

860: \!\!\!W(R,S,t)\mu_{S}(\tau)\right)  \hspace{0.15em}d\tau\right)

861: =\exp\negthickspace

862: \left(  -\!\!\!\sum_{S\in S(R)}\!\!\!W(R,S,t)M_{S}(t,f)\right)  .

863: \]

864: Considering all the possible elements of the vector $W(R,t)$, we then obtain

865: \[

866: \hat{p}_{R}(t,f)=\expecop_{W(R,t)}\!\!\left[  \exp\!\!\left(  \!-\!\!\!\!\sum

867: _{S\in S(R)}\!\!\!W(R,S,t)M_{S}(t,f)\right)  \right]  ,

868: \]

869:

870: Assume now that $\langle R,S\rangle$ has fixed multiplicity for all $S\in

871: S(R)$. Consequently, we replace $W(R,S,t)$ with $w(R,S)$. Drawing on the proof

872: of the previous lemma,

873: \begin{align*}

874: \hat{p}_{R}(t,f) &  =\expecop_{W(R,t)}\!\!\left[  \exp\!\!\left(

875: \!-\!\!\!\!\sum_{S\in S(R)}\!\!\!w(R,S)M_{S}(t,f)\right)  \right]  \\

876: &  =\exp\negthickspace\left(  -\int_{t}^{f}\!\!\left(  \sum_{S\in

877: S(R)}\!\!\!w(R,S)\mu_{S}(\tau)\right)  \hspace{0.15em}d\tau\right)  \\

878: &  =p_{R}(t,f)\text{.}%

879: \end{align*}

880: \hspace*{\fill}

881: \end{proof}

882:

883: \noindent The following proposition establishes the formula for the expected

884: value of ${X_{R}(s,f)}$.

885:

886: \begin{proposition}

887: $\expecop\!\left[  {X_{R}(s,f)}\right]  =\widetilde{\Lambda}_{R}%

888: (s,f)\expecop\!\left[  {\Delta_{R}^{+}}\right]  $, where $\widetilde{\Lambda

889: }_{R}(s,f)=\int_{s}^{f}\lambda_{R}(t)\hat{p}_{R}(t,f)\hspace{0.15em}dt$. In

890: the simple case where each insertion involves exactly one tuple,

891: $X_{R}(s,f)\sim\poissondistrib(\widetilde{\Lambda}_{R}(s,f))$.

892: \end{proposition}

893:

894: \begin{proof}

895: \noindent Let $N$ be the number of insertion events in $(s,f]$, and let their

896: times be $\{T_{1},T_{2},\ldots,T_{N}\}$. Suppose that $N=n$ and that insertion

897: event $i$ happens at time $t_{i}\in(s,f]$. Event $i$ inserts a random number

898: of tuples $\Delta_{R,i}^{+}$, each of which has probability $\hat{p}_{R}%

899: (t_{i},f)$ of surviving through time $f$. Therefore, the expected number of

900: tuples surviving through $f$ from insertion event $i$ is $\expecop\!\left[

901: {\Delta_{R}^{+}}\right]  \hat{p}_{R}(t_{i},f)$. Consequently,

902: \[

903: \expecop\!\left[  {X_{R}(s,f)\;\big|\;N=n,T_{1}=t_{1},T_{2}=t_{2},\ldots

904: ,T_{n}=t_{n}}\right]  =\expecop\!\left[  {\Delta_{R}^{+}}\right]  \sum

905: _{i=1}^{n}\hat{p}_{R}(t_{i},f).

906: \]

907: Next, we recall, given that $N=n$, that the times $T_{i}$ of the insertion

908: events are distributed like $n$ independent random variables with probability

909: density function $\lambda_{R}(t)/\Lambda_{R}(s,f)$ on the interval $(s,f]$.

910: Thus,

911: \begin{align*}

912: \expecop\!\left[  {X_{R}(s,f)\;\big|\;N=n}\right]   &  =\expecop_{T_{1}%

913: ,\ldots,T_{n}}\!\left[  {\expecop\!\left[  {\Delta_{R}^{+}}\right]  \sum

914: _{i=1}^{n}\hat{p}_{R}(T_{i},f)}\right]  \\

915: &  =\expecop\!\left[  {\Delta_{R}^{+}}\right]  \sum_{i=1}^{n}\left(  \int

916: _{s}^{f}\!\!\hat{p}_{R}(t,f)\frac{\lambda_{R}(t)}{\Lambda_{R}(s,f)}%

917: \hspace{0.15em}dt\right)  \\

918: &  =n\left(  \frac{\expecop\!\left[  {\Delta_{R}^{+}}\right]  \widetilde

919: {\Lambda}_{R}(s,f)}{\Lambda_{R}(s,f)}\right)  .

920: \end{align*}

921: Finally, removing the conditioning on $N=n$, we obtain

922: \begin{align*}

923: \expecop\!\left[  {X_{R}(s,f)}\right]   &  =\expecop_{N}\!\left[  {N\left(

924: \frac{\expecop\!\left[  {\Delta_{R}^{+}}\right]  \widetilde{\Lambda}_{R}%

925: (s,f)}{\Lambda_{R}(s,f)}\right)  }\right]  \\

926: &  =\Lambda_{R}(s,f)\left(  \frac{\expecop\!\left[  {\Delta_{R}^{+}}\right]

927: \widetilde{\Lambda}_{R}(s,f)}{\Lambda_{R}(s,f)}\right)  \\

928: &  =\expecop\!\left[  {\Delta_{R}^{+}}\right]  \widetilde{\Lambda}_{R}(s,f).

929: \end{align*}

930:

931: In the case that $\Delta_{R}^{+}$ is always $1$, we may use the notion of a

932: \emph{filtered} Poisson process: if we consider only tuples that manage to

933: survive until time $f$, the chance of a single insertion in time interval

934: $[t,t+\Delta t]$ no longer has the limiting value $\lambda_{R}(t)\Delta t$,

935: but instead $\lambda_{R}(t)\hat{p}_{R}(t,f)\Delta t$. Therefore, the insertion

936: of surviving tuples can be viewed as a nonhomogeneous Poisson process with

937: intensity function $\lambda_{R}(t)\hat{p}_{R}(t,f)$ over the time interval

938: $(s,f]$, so $X_{R}(s,f)\sim\poissondistrib(\widetilde{\Lambda}_{R}(s,f))$.

939: \end{proof}

940:

941: In the general case, the computation of $\widetilde{\Lambda}_{R}(s,f)$ will

942: require approximation by numerical integration techniques; the complexity of

943: this calculation will depend on the information-theoretic properties of

944: $\lambda_{R}(\cdot)$ and the $\mu_{S}(\cdot)$, $S\in S(R)$, but is unlikely to

945: be burdensome if these functions are reasonably smoothly-varying. In one

946: important special case, however, the complexity of computing $\widetilde

947: {\Lambda}_{R}(s,f)$ is essentially the same as that of calculating

948: $\Lambda_{R}(s,f)$: suppose that for some constants $\alpha(R,S)$, $S\in

949: S(R)$, one has that $\mu_{S}(t)=\alpha(R,S)\lambda_{R}(t)$ for all $t$. That

950: is, the general insertion and deletion activity level of the relations in

951: $S(R)$ all vary proportionally to some common fluctuation pattern. In this

952: case, we have $\widetilde{\mu}_{R}(t) = \alpha(R)\lambda_{R}(t)$ and

953: $\widetilde{M}_{R}(s,f) = \alpha(R)\Lambda_{R}(s,f)$ for all $t,s,f$, where

954: $\alpha(R) = \sum_{S\in S(R)} \alpha(R,S)$. Making a substitution

955: $u(t)=\Lambda_{R}(t,f)$, we have:

956: \begin{align*}

957: \widetilde{\Lambda}_{R}(s,f)  &  = \int_{s}^{f} \!\! \lambda_{R}(t)

958: \exp(-\alpha(R)\Lambda_{R}(t,f)) \hspace{0.15em} dt\\

959: &  = \int_{s}^{f} \! \left(  \frac{-d\Lambda_{R}(t,f)}{dt} \right)

960: \exp(-\alpha(R)\Lambda_{R}(t,f)) \hspace{0.15em} dt\\

961: &  = \int_{s}^{f} \!\! - \exp(-\alpha(R)u(t)) \hspace{0.15em} du(t)\\

962: &  = - \int_{u(s)}^{u(f)} \!\! e^{-\alpha(R)u} \hspace{0.15em} du\\

963: &  = \frac{1}{\alpha(R)} \left(  1 - e^{-\alpha(R)\Lambda(s,f)} \right)  ,

964: \end{align*}

965: so $\widetilde{\Lambda}_{R}(s,f)$ can be calculated directly from

966: $\Lambda(s,f)$.

967:

968: We define the random variable $Y_{R}(s,f)$ to be the number of tuples in

969: $R(s)$ that survive through time $f$.

970:

971: \begin{proposition}

972: $\expecop\!\left[  {Y_{R}(s,f)}\right]  =p_{R}(s,f)\left|  {R(s)}\right|  $

973: and $\expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =p_{R}(s,f)\left|

974: {R(s)}\right|  +\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta}_{R}%

975: ^{+}\right]  $.

976: \end{proposition}

977:

978: \begin{proof}

979: \noindent Each tuple in $R(s)$ has a survival probability of $p_{R}(s,f)$,

980: which yields that $\expecop

981: \!\left[  {Y_{R}(s,f)}\right]  =p_{R}(s,f)\left|  {R(s)}\right|  $. By the

982: definitions of $Y_{R}(s,f)$ and $X_{R}(s,f)$, one has that

983: \[

984: \left|  {R(f)}\right|  =Y_{R}(s,f)+X_{R}(s,f),

985: \]

986: so therefore

987: \[

988: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =\expecop\!\left[

989: {Y_{R}(s,f)}\right]  +\expecop\!\left[  {X_{R}(s,f)}\right]  =p_{R}%

990: (s,f)\left|  {R(s)}\right|  +\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[

991: {\Delta_{R}}^{+}\right]  .

992: \]

993: \hspace*{\fill}\hspace*{\fill}\hspace*{\fill}

994: \end{proof}

995:

996: In cases where deletions may also be accurately modeled as a compound Poisson

997: process, we have

998: \begin{align*}

999: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]   &  =\expecop\!\left[

1000: {\left|  {R(s)}\right|  }\right]  +B_{R}(s,f)-D_{R}(s,f)\\

1001: &  =\left|  {R(s)}\right|  +\Lambda_{R}(s,f)\expecop\!\left[  {\Delta}%

1002: ^{+}\right]  -M_{R}(s,f)\expecop\!\left[  {\Delta}^{-}\right]  .

1003: \end{align*}

1004:

1005: \begin{example}

1006: [The homogeneous case]Assume that $\expecop\!\left[  {\Delta_{R}^{+}}\right]

1007: =1$, that $\langle R,S\rangle$ has fixed multiplicity for all $S\in S(R)$, and

1008: furthermore $\lambda_{R}(t)$ and $\mu_{S}(t)$, for all $S\in S(R)$, are

1009: constant functions, that is, $\lambda_{R}(t)=\lambda_{R}$ for all times $t$

1010: and $\mu_{S}(t)=\mu_{S}$ for all $S\in S(R)$ and times $t$. Then $\Lambda

1011: _{R}(s,f)=\lambda_{R}\cdot(f-s)$ and $M_{R}(s,f)=\mu_{R}\cdot(f-s)$. Thus,

1012: letting $\tilde{\mu}_{R}=\sum_{S\in S(R)}w(R,s)\mu_{S}$,

1013: \[

1014: \widetilde{\Lambda}_{R}(s,f)=\int_{s}^{f}\!\!\lambda_{R}e^{-\tilde{\mu}%

1015: _{R}(t-s)}\hspace{0.15em}dt=\frac{\lambda_{R}}{\tilde{\mu}_{R}}\left(

1016: 1-e^{-\tilde{\mu}_{R}(f-s)}\right)  ,

1017: \]

1018: and assuming ${\Delta}_{R,i}^{+}=1$ for all $i>0$,

1019: \[

1020: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =\left|  {R(s)}\right|

1021: e^{-\tilde{\mu}_{R}(f-s)}+\frac{\lambda_{R}}{\tilde{\mu}_{R}}\left(

1022: 1-e^{-\tilde{\mu}_{R}(f-s)}\right)  =\frac{\lambda_{R}}{\tilde{\mu}_{R}%

1023: }+e^{-\tilde{\mu}_{R}(f-s)}\left(  \left|  {R(s)}\right|  -\frac{\lambda_{R}%

1024: }{\tilde{\mu}_{R}}\right)  .

1025: \]

1026: \hspace*{\fill}$\Box$

1027: \end{example}

1028:

1029: \subsection{Tuples with non-exponential life spans}

1030: \label{sec:nonexplife}We now consider the possibility

1031: that tuples in $R$ have a stochastic life span $L_{R}^{\intrinsic}$ that is

1032: not memoryless, but rather has some general cumulative distribution function

1033: $G_{R}$. For example, if tuples in $R$ correspond to pieces of work in process

1034: on a production floor, the likelihood of deletion might rise the longer the

1035: tuple has been in existence. Let us consider a single relation, and thus no

1036: referential integrity constraints. For any tuple $r$, let $b(r)$ denote the

1037: time it was created. We next establish the expected cardinality of $R$ at time

1038: $f$.

1039:

1040: \begin{proposition}

1041: In the case that tuples in $R$ have lifetimes with a general cumulative

1042: distribution function $G_{R}$,

1043: \begin{equation}

1044: \expecop\!\left[  {\left|  {R(f)}\right|  }\right]  =\left|  {R(s)}\right|

1045: \expecop_{r\in R(s)}\!\left[  {\frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}%

1046: }\right]  +\expecop\!\left[  {\Delta}_{R}^{+}\right]  \int_{s}^{f}%

1047: \!\!\lambda_{R}(t)\left(  1-G_{R}(f-t)\right)  \hspace{0.15em}%

1048: dt.\label{eq:nonexperf}%

1049: \end{equation}

1050: \end{proposition}

1051:

1052: \begin{proof}

1053: \noindent Let $L_{R}^{\intrinsic}$ denote a generic random variable with

1054: cumulative distribution $G_{R}$. The probability of $r\in R(s)$ surviving

1055: throughout $(s,f]$ is then

1056: \[

1057: \probop\!\left\{  L_{R}^{\intrinsic}\geq f-b(r)\;\big|\;L_{R}^{\intrinsic

1058: }\geq s-b(r)\right\}  =\frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))},

1059: \]

1060: and therefore the expected number of tuples in $R(s)$ that survive through

1061: time $f$ is

1062: \[

1063: \expecop\!\left[  {Y_{R}(s,f)}\right]  =\!\!\!\sum_{r\in R(s)}\!\!\!\left(

1064: \frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}\right)  =\left|  {R(s)}\right|

1065: \expecop_{r\in R(s)}\!\left[  {\frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}%

1066: }\right]  .

1067: \]

1068: We now consider a tuple $r$ inserted at some time $t\in(s,f]$. The probability

1069: that such a tuple survives through time $f$ is simply $\hat{p}_{R}%

1070: (t,f)=1-G_{R}(f-t)$. By reasoning similar to the proof of Lemma

1071: ~\ref{prop:insertsurvival},

1072: \[

1073: \expecop\!\left[  {X_{R}(s,f]}\right]  =\expecop

1074: \!\left[  {\Delta_{R}}^{+}\right]  \widetilde{\Lambda}_{R}(s,f)=\expecop

1075: \!\left[  {\Delta_{R}}^{+}\right]  \int_{s}^{f\!\!}\lambda_{R}(t)\left(

1076: 1-G_{R}(f-t)\right)  \hspace{0.15em}dt.

1077: \]

1078: The conclusion then follows from $\expecop\!\left[  {\left|  {R(f)}\right|

1079: }\right]  =\expecop\!\left[  {Y_{R}(s,f)}\right]  +\expecop\!\left[

1080: {X_{R}(s,f)}]\right.  $

1081: \end{proof}

1082:

1083: It is worth noting that, as opposed to the memoryless case presented above,

1084: the calculation of $\expecop\!\left[  {Y_{R}(s,f)}\right]  $ requires

1085: remembering the commit times $b(r)$ of all tuples $r\in R(s)$, or equivalently

1086: the ages of all such tuples. Of course, for large relations $R$, a reasonable

1087: approximation could be obtained by using a manageably-sized sample to

1088: estimate

1089: \[

1090: \expecop_{r\in R(s)}\!\left[  \frac{1-G_{R}(f-b(r))}{1-G_{R}(s-b(r))}

1091: \right]  .

1092: \]

1093: It is likely that the integral in (\ref{eq:nonexperf}) will require general

1094: numerical integration, depending on the exact form of $G_{R}$.

1095:

1096: \subsection{Summary}

1097:

1098: In this section, we have provided a model for the insertion and deletion of

1099: tuples in a relational database. The immediate benefit of this model is the

1100: computation of the expected relation cardinality ($\expecop\!\left[  {\left|

1101: {R(f)}\right|  }\right]  $), given an initial cardinality and insertion and

1102: tuple life span parameters. Relation cardinality has proven to be an important

1103: property in many database tools, including query optimization and database

1104: tuning. Reasonable assumptions regarding constant multiplicity allow, once

1105: appropriate statistics have been gathered, a rapid computation of

1106: cardinalities in this framework. Section \ref{sec:verify}

1107: elaborates on statistics gathering and model validation.

1108:

1109: A note regarding tuples with non-exponential life spans is now warranted. For

1110: the case of a single relation, non-exponential life spans add only a moderate

1111: amount of complexity to our model, namely the requirement to store at least an

1112: approximation of the distribution of tuples ages in $R(s)$. For multiple

1113: relations with referential integrity constraints,

1114: however, the complexity of dealing with

1115: general tuple life spans is much greater. First, to estimate the cardinality

1116: of $R(f)$, we must keep (approximate) tuple age distributions for all

1117: relations in $S(R)$. Second, because the tuple life span distributions of some

1118: of the members of $S(R)$ are not memoryless, we cannot combine them with a

1119: simple relation like (\ref{eq:combineexp}). Furthermore, in attempting to find

1120: the distribution of the remaining life span of a particular tuple $r\in R(s)$,

1121: it may become necessary to consider the issue of the correlation of ages of

1122: tuples in $R(s)$ with the ages of the corresponding tuples in other relations

1123: of $S(R)$. Because of these complications, we defer further consideration of

1124: non-exponential tuple life spans to future research.

1125:

1126: \section{Modeling data modification}

1127: \label{sec:modif}This section describes various ways to model

1128: the modification of the contents of tuples. We start with a general approach,

1129: using Markov chains, followed by several special cases where the amount of

1130: computation can be greatly reduced.

1131:

1132: \subsection{Content-dependent updates}

1133:

1134: \label{sec:condepup}In this section, we model the modification of the contents of tuples as a

1135: finite-state continuous-time Markov chain, thus assuming dependence on tuples'

1136: previous contents. For each relation $R$, we allow for some (possibly empty)

1137: subset $\mathcal{C}(R)\subset\mathcal{A}(R)$ of its attributes to be subject

1138: to change over the lifetime of a tuple. We do not permit primary key fields to

1139: be modified, that is, $\mathcal{C}(R)\cap\mathcal{K}(R)=\emptyset$.

1140:

1141: Attribute values may change at time instants called \emph{transition events},

1142: which are the transition times of the Markov chain. We assume that the spacing

1143: of transition events is memoryless with respect to the age of a tuple

1144: (although it may depend on the time and the current value of the attribute, as

1145: demonstrated below). For any attribute $A$, tuple $r$, time $s$, and value

1146: $v\in\dom A$ with $r.A(s)=v$, the time remaining until the next transition

1147: event for $r.A$ is a random variable $\tau_{v,s}^{R,A}$ with the distribution

1148: $\expdistrib

1149: _{s}(\ell_{v}^{R,A}\gamma_{R,A}(\cdot))$, where $\gamma_{R,A}:\Re

1150: \rightarrow\lbrack0,\infty)$ is a function giving the general instantaneous

1151: rate of change for the attribute, and $\ell_{v}^{R,A}$ is a nonnegative scalar

1152: which we call the \emph{relative exit rate} of $v$. We define $\Gamma

1153: _{R,A}(s,f)=\int_{s}^{f}\gamma_{R,A}(t)\hspace{0.15em}dt$. When a transition

1154: event occurs from state $u\in\dom A$, attribute $A$ changes to $v\in\dom A$

1155: with probability $P_{u,v}^{R,A}$.

1156:

1157: Suppose $\mathcal{A}=\{A_{1},A_{2},...,A_{k}\}\subseteq\mathcal{R}$ is an

1158: independently varying set of attributes, and $v=\langle v_{1},v_{2}%

1159: ,...,v_{k}\rangle\in\dom\mathcal{A}$ is a compound value. Then the time until

1160: the next transition event for $r.\mathcal{A}$ is $\tau_{v,s}^{R,\mathcal{A}}=\min

1161: \{\tau_{v_{1},s}^{R,A_{1}},\ldots,\tau_{v_{k},s}^{R,A_{k}}\}$. As a rule, we

1162: will assume that the modification processes for the attributes of a relation

1163: are independent, so $\tau_{v,s}^{R,\mathcal{A}}\sim\expdistrib

1164: _{s}(\sum_{i=1}^{k}\ell_{v_{i}}^{R,A_{i}}\gamma_{A_{i},R}(\cdot))$. When the

1165: functions $\gamma_{R,A_{i}}$ are identical for $i=1,\ldots,k$, we define

1166: $\gamma_{R,\mathcal{A}}=\gamma_{R,A_{i}}$ and $\ell_{v}^{R,\mathcal{A}}=\sum_{i=1}^{k}\ell_{v_{i}%

1167: }^{R,A_{i}}$, so $\tau_{v,s}^{R,\mathcal{A}}\sim\expdistrib_{s}(\ell_{v}^{R,\mathcal{A}}%

1168: \gamma_{R,\mathcal{A}}(\cdot))$. To justify the assumption of independence, we note that

1169: coordinated modifications among attributes can be modeled by replacing the

1170: coordinated attributes with a single compound attribute (this technique

1171: requires that the attributes have identical $\gamma_{R,\mathcal{A}}(\cdot)$ functions,

1172: which is reasonable if they change in a coordinated way).

1173:

1174: Under these assumptions,

1175: let $\overline{\mathcal{C}}(R)$ denote a partition of $\mathcal{C}(R)$ into

1176: subsets $\mathcal{A}$ such that any two attributes $A_{1},A_{2}\in

1177: \mathcal{C}(R)$ vary dependently iff they are in the same $\mathcal{A}%

1178: \in\overline{\mathcal{C}}(R)$.

1179:

1180: \begin{example}

1181: [First alteration time]\label{ex:fat} For a relation $R$ and time

1182: $s$, we define $\Upsilon_{R,s}$ to be the amount of time until the next change

1183: in $R$, be it a tuple insertion, a tuple deletion, or an attribute

1184: modification. Also, for any $S\in S(R)$, let $D(R,S,s)$ denote the number of

1185: tuples in $S(s)$ whose deletion would force the deletion of some tuple in

1186: $R(s)$ $($and therefore $D(R,R,s)=\left|  {R(s)}\right|  )$. The following

1187: proposition establishes the distribution of $\Upsilon_{R,s}$.

1188: \end{example}

1189:

1190: \begin{proposition}

1191: $\Upsilon_{R,s}\sim\expdistrib_{s}(\zeta_{R}(\cdot))$, where

1192: \[

1193: \zeta_{R}(t)=\lambda_{R}(t)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)\mu

1194: _{S}(t)\;\;+\sum_{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!h(R,A,s)\gamma

1195: _{R,\mathcal{A}}(t)

1196: \]

1197: and $h(R,A,s)=\sum_{v\in\dom\mathcal{A}}\hat{R}_{\mathcal{A},v}(s)\ell

1198: _{v}^{R,A}$. The probability of any alteration to $R$ in the time interval

1199: $(s,f]$ is $1-e^{-Z_{R}(s,f)}$, where

1200: \[

1201: Z_{R}(s,f)=\Lambda_{R}(s,f)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)M_{S}%

1202: (t)\;\;+\sum_{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!\!\!h(R,A,s)\Gamma

1203: _{R,\mathcal{A}}(s,f)

1204: \]

1205: \end{proposition}

1206:

1207: \begin{proof}

1208: Let $\Upsilon_{R,s}^{\insertion}$, $\Upsilon_{R,s}^{\modification}$, and

1209: $\Upsilon_{R,s}^{\deletion}$ be the times until the next insertion,

1210: modification, and deletion in $R$, respectively. From Section

1211: \ref{sec:estcard}, we have that $\Upsilon_{R,s}^{\insertion}\sim\expdistrib

1212: _{s}(\lambda_{R}(\cdot))$. Now, for each $S\in S(R)$, there are $D(R,S,s)$

1213: tuples whose deletion would cause a deletion in $R$. The time until deletion

1214: of any such $r\in S\in S(R)$ is distributed like $\expdistrib_{s}(\mu

1215: _{S}(\cdot))$. The deletion processes for all these tuples are independent

1216: across all of $S(R)$, so we can use (\ref{eq:combineexp}) to conclude that

1217: \[

1218: \Upsilon_{R,s}^{\deletion}\sim\expdistrib_{s}\left(  \sum_{S\in S(R)}%

1219: \!\!\!D(R,S,s)\mu_{S}(\cdot)\right)  .

1220: \]

1221: From the preceding discussion, we have

1222: \[

1223: \Upsilon_{R,s}^{\modification}=\tau_{v,s}^{\mathcal{C}(R),R}\sim\expdistrib

1224: _{s}\left(  \sum_{\mathcal{A}\in\overline{\mathcal{C}}(R)}\ell_{r.\mathcal{A}%

1225: (s)}^{R,\mathcal{A}}\gamma_{R,\mathcal{A}}(\cdot)\right)  .

1226: \]

1227: Since $\Upsilon_{R,s}=\min\{\Upsilon_{R,s}^{\insertion

1228: },\Upsilon_{R,s}^{\modification},\Upsilon_{R,s}^{\deletion}\}$, we therefore

1229: have, again using independence and (\ref{eq:combineexp}), that $\Upsilon

1230: _{R,s}\sim\expdistrib_{s}(\zeta_{R}(\cdot))$, where

1231: \begin{align*}

1232: \zeta_{R}(t) &  =\lambda_{R}(t)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)\mu

1233: _{S}(t)\;\;+\sum_{r\in R(s)}\!\!\left[  \sum_{\mathcal{A}\in\overline

1234: {\mathcal{C}}(R)}\!\!\!\ell_{r.\mathcal{A}(s)}^{R,\mathcal{A}}\gamma

1235: _{R,\mathcal{A}}(t)\right]  \\

1236: &  =\lambda_{R}(t)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)\mu_{S}(t)\;\;+\sum

1237: _{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!\!\!h(R,A,s)\gamma_{R,\mathcal{A}%

1238: }(t).

1239: \end{align*}

1240: Integrating over $(s,f]$ results in

1241: \begin{align*}

1242: Z_{R}(s,f) &  =\int_{s}^{f}\!\zeta_{R}(t)\hspace{0.15em}dt\\

1243: &  =\Lambda_{R}(s,f)\;\;+\sum_{S\in S(R)}\!\!\!D(R,S,s)M_{S}(t)\;\;+\sum

1244: _{\mathcal{A}\in\overline{\mathcal{C}}(R)}\!\!\!h(R,A,s)\gamma_{R,\mathcal{A}%

1245: }(s,f).

1246: \end{align*}

1247: Therefore, the probability of any alteration to $R$ in the time interval

1248: $(s,f]$ is%

1249: \[

1250: \probop\left\{  \Upsilon_{R,s}<f-s\right\}  =1-e^{-Z_{R}(s,f)}%

1251: \]

1252: \hspace*{\fill}

1253: \end{proof}

1254:

1255: \noindent\textbf{Example \ref{ex:fat} continued (First alteration

1256: transcription policy)~~} \textit{Suppose the

1257: user wishes to refresh her replica of relation $R$ whenever the

1258: probability that it

1259: contains any inaccuracy exceeds some threshold $\pi$, a tactic we call

1260: the \emph{first

1261: alteration policy}. Then, a refresh is required at time $f$ if $1-e^{-Z_{R}

1262: (s,f)}>\pi$.}\hspace*{\fill}$\Box$

1263:

1264:

1265: ~

1266:

1267: Given $\mathcal{A}$, we describe the transition process for the value

1268: $r.\mathcal{A}$ over time by probabilities

1269: \[

1270: P_{u,v}^{R,\mathcal{A}}(s,f)=\probop\!\left\{  r.\mathcal{A}(f)\!=\!v\;\big

1271: |\;r.\mathcal{A}(s)\!=\!u\right\}  ,

1272: \]

1273: for any two values $u,v\in\dom\mathcal{A}$ and times $s<f$. Under the

1274: assumption of independence,

1275: \begin{equation}

1276: P_{u,v}^{R,\mathcal{A}}(s,f)=\prod_{i=1}^{k}P_{u_{i},v_{i}}^{R,A_{i}}(s,f).

1277: \label{eq:probproduct}%

1278: \end{equation}

1279: Given any simple attribute $A$, we define $q_{u,v}^{R,A}$, the \emph{relative

1280: transition rate} from $u$ to $v$, by

1281: \[

1282: q_{u,v}^{R,A}=\ell_{u}^{R,A}P_{u,v}^{R,A}.

1283: \]

1284: Given a set of attributes $\mathcal{A}$ with identical $\gamma_{R,A}(\cdot)$

1285: functions, the compound transition rate $q_{u,v}^{R,A}$ may be computed via

1286: \begin{equation}

1287: q_{u,v}^{R,\mathcal{A}}=\ell_{u}^{R,\mathcal{A}}P_{u,v}^{R,\mathcal{A}%

1288: }=\left(  \sum_{i=1}^{k}\ell_{u_{i}}^{R,A_{i}}\right)  \left(  \prod_{i=1}%

1289: ^{k}P_{u_{i},v_{i}}^{R,A_{i}}\right)  . \label{eq:jointexit}%

1290: \end{equation}

1291: Let $Q^{R,A}$ be the matrix of $q_{u,v}^{R,A}$, where $q_{u,u}^{R,A}=-\ell

1292: _{u}^{R,A}$.

1293:

1294: \begin{proposition}

1295: The matrix $P^{R,A}(s,f)$ of elements $P_{u,v}^{R,A}(s,f)$ is given by the

1296: matrix exponential formula

1297: \begin{equation}

1298: P^{R,A}(s,f)=\exp\!\left(  \Gamma_{R,A}(s,f)\,Q^{R,A}\right)  =\sum

1299: _{n=0}^{\infty}\frac{\Gamma_{R,A}(s,f)^{n}}{n!}{\left(  Q^{R,A}\right)  }%

1300: ^{n}.\label{eq:matrixexp}%

1301: \end{equation}

1302: \end{proposition}

1303:

1304: \begin{proof}

1305: \noindent Consider a continuous-time Markov chain on the same state space

1306: $\dom

1307: A$, and with the same instantaneous transition probabilities $P_{u,v}^{R,A}$,

1308: where $u,v\in\dom A$. However, in the new chain, the holding time in each

1309: state $v$ is simply a homogeneous exponential random variable with arrival

1310: rate $\ell_{v}^{R,A}$. We call this system the \emph{linear-time} chain, to

1311: distinguish it from the original chain. Define $\overline{P}_{u,v}^{A,R}(t)$

1312: to be the chance that the linear-time chain is in state $v$ at time $t$, given

1313: that it is in state $u$ at time $0$. Standard results for finite-state

1314: continuous time Markov chains imply that

1315: \[

1316: \overline{P}_{u,v}^{R,A}(t)=\exp\!\left(  t\,Q^{R,A}\right)  =\sum

1317: _{n=0}^{\infty}\frac{t^{n}}{n!}{\left(  Q^{R,A}\right)  }^{n}.

1318: \]

1319: By a transformation of the time variable, we then assert that

1320: \[

1321: P_{u,v}^{R,A}(s,f)=\overline{P}_{u,v}^{R,A}(\Gamma_{R,A}(s,f)),

1322: \]

1323: from which the result follows.

1324: \end{proof}

1325:

1326: \begin{example}[Query optimization, revisited]

1327: \label{ex:qopt2}As the following proposition shows, our model

1328: can be used to estimate the histogram of a relation $R$ at time $f$. A query

1329: optimizer running at time $f$ could use expected histograms, calculated in

1330: this manner, instead of the old histograms $\hat{R}_{A}(s)$.

1331: \end{example}

1332:

1333: \begin{proposition}

1334: Assume that $w(r,S)$, for all $S\in S(R)$, is independent of the attribute

1335: values $r.A(s)$ for all $A\in\mathcal{C}(R)$. Let $\hat{\omega}_{u}^{R,A}(t)$

1336: denote the probability that $r.A(t)=u$, given that $r$ is inserted into $R$ at

1337: time $t$. Then, for all $v\in\dom A$,

1338: \begin{align}

1339: \expecop\!\left[  {\hat{R}_{A,v}(f)}\right]   &  =p_{R}(s,f)\!\!\!\!\sum

1340: _{u\in\dom A}\!\!\!\!\hat{R}_{A,u}(s)P_{u,v}^{R,A}(s,f)\nonumber\\

1341: &  \quad\quad+\quad\expecop\!\left[  {\Delta_{R}^{+}}\right]  \!\!\!\!\sum

1342: _{u\in\dom A}\!\!\!\left(  \int_{s}^{f}\!\!\hat{\omega}_{u}^{R,A}(t)\hat

1343: {p}_{R}(s,f)\lambda_{R}(t)P_{u,v}^{R,A}(t,f)\intdspace dt

1344: \right)  .\label{eq:query}%

1345: \end{align}

1346: \end{proposition}

1347:

1348: \begin{proof}

1349: We first compute the expected number of surviving tuples $r$ whose values

1350: $r.A$ migrate to $v$. Given a value $u\in\dom A$, there are $\hat{R}_{A,u}(s)$

1351: tuples at time $s$ such that $r.A(s)=u$. Using the previous results, the

1352: expected number of these tuples surviving through time $f$ is $\hat{R}%

1353: _{A,u}(s)p_{R}(s,f)$, and the probability of each surviving tuple $r$ having

1354: $r.A(f)=v$ is $P_{u,v}^{R,A}(s,f)$. Using the independence assumption and

1355: summing over all $u\in\dom

1356: A$, one has that the expected numbers of tuples in $R(s)$ that survive through

1357: $f$ and have $r.A(f)=v$ is

1358: \[

1359: \sum_{u\in\dom A}\!\!\!\!\hat{R}_{A,u}(s)p_{R}(s,f)P_{u,v}^{R,A}%

1360: (s,f)=p_{R}(s,f)\!\!\!\!\sum_{u\in\dom A}\!\!\!\!\hat{R}_{A,u}(s)P_{u,v}%

1361: ^{R,A}(s,f).

1362: \]

1363: We next consider newly inserted tuples. Recall that $\hat{\omega}_{u}%

1364: ^{R,A}(t)$ denotes the probability that $r.A(t)=u$, given that $r$ is inserted

1365: into $R$ at time $t$. Suppose that an insertion occurs at time $t\in(s,f]$.

1366: The expected number of tuples $r$ created at this insertion that both survive

1367: until $f$ and have $r.A(f)=v$ is

1368: \[

1369: \hat{p}_{R}(t,f)\!\!\!\sum_{u\in\dom A}\!\!\!\!{\omega}_{u}^{R,A}%

1370: (t)P_{u,v}^{R,A}(t,f).

1371: \]

1372: By logic similar to Proposition~\ref{prop:insertsurvival}, one may then

1373: conclude that the expected number of newly-inserted tuples that survive

1374: through time $f$ and have $r.A(f)=v$ is

1375: \[

1376: \expecop\!\left[  {\Delta_{R}^{+}}\right]  \!\!\!\!\sum_{u\in\dom

1377: A}\!\!\!\left(  \int_{s}^{f}\!\!\hat{\omega}_{u}^{R,A}(t)\hat{p}%

1378: _{R}(s,f)\lambda_{R}(t)P_{u,v}^{R,A}(t,f)\hspace{0.15em}dt\right)  .

1379: \]

1380: The result follows by adding the last two expressions.

1381: \end{proof}

1382:

1383: \noindent\textbf{Example \ref{ex:qopt2} continued~} \emph{We next

1384: consider whether the complexity of calculating (\ref{eq:query}) is

1385: preferable to recomputing the histogram vector ${\hat{R}_{A}(f)}$.

1386: This topic is quite involved and depends heavily on the specific structure of

1387: the database (\emph{e.g.}, the availability of indices) and the

1388: specific application (\emph{e.g.}, the concentration of values in a

1389: small subset of an attribute's domain). In what follows, we lay out

1390: some qualitative considerations in deciding whether calculating

1391: (\ref{eq:query}) would be more efficient than recalculating

1392: ${\hat{R}_{A}(f)}$ ``from scratch.''  Experimentation with

1393: real-world application is left for further research.}

1394:

1395: \emph{Generally speaking, direct computation of the histogram of an

1396: attribute $A$ (in the absence of an index for $A$) can be done by

1397: either scanning all tuples (although sampling may also be used) or

1398: scanning a modification log to capture changes to the prior histogram

1399: vector ${\hat{R}_{A}(s)}$ during $(s,f]$. Therefore, the

1400: recomputation can be performed in $\bigO(\min\{\card{R(f)},T(s,f)\})$

1401: time, where $T(s,f)$ denotes the total number of updates during

1402: $(s,f]$. Whenever $\card{R(f)}$ and $T(s,f)$ are both large ---

1403: \emph{i.e.}, the database is large and the transaction load is high

1404: --- the straightforward techniques will be relatively

1405: unattractive. As for the estimation technique, it will probably work

1406: best when $\card{domA}$ is small (for example, for a binary attribute)

1407: or whenever the subset of actually utilized values in the domain is

1408: small. In addition, commercial databases recompute

1409: the entire histogram as a single, atomic task. Formula

1410: (\ref{eq:query}), on the other hand, can be performed on a subset of the

1411: attribute values.  For example, in the case of exact matching (say, a

1412: condition of the form $\mathtt{WHERE}\;A=v$), it is sufficient to

1413: compute $\hat{R}_{A,v}(f)$, rather than the full

1414: ${\hat{R}_{A}(f)}$ vector. Finally, it is worth noting that the

1415: computing the expected value

1416: of ${\hat{R}_{A,v}(f)}$ via (\ref{eq:query}) does not require locking $R$,

1417: while a full histogram recomputation may involve extended periods of

1418: locking.}\hspace*{\fill}$\Box$

1419:

1420:

1421: ~

1422:

1423: We next consider the number of tuples in $R(s)$ that have survived through

1424: time $f$ without being modified, which we denote $Y_{R}^{-}(s,f)$. The

1425: expectation of this random variable is

1426: \[

1427: \expecop\!\left[  {Y_{R}^{-}(s,f)}\right]  =p_{R}(s,f)\!\!\!\!

1428: \sum_{v\in\dom{\mathcal{A}}}\!\!\!\!

1429: \hat{R}_{\mathcal{A},v}(s)P_{v,v}^{R,\mathcal{A}}(s,f)

1430: \]

1431: We let $Y_{R}^{+}(s,f)=Y_{R}(s,f)-Y_{R}^{-}(s,f)$ denote the number of tuples

1432: in $R(s)$ that have survived through time $f$ and were modified; it follows

1433: from the linearity of the $\expecop\!\left[  {\cdot}\right]  $ operator that

1434: \[

1435: \expecop\!\left[  {Y_{R}^{+}(s,f)}\right]  =\expecop\!\left[  {Y_{R}%

1436: (s,f)}\right]  -\expecop\!\left[  {Y_{R}^{-}(s,f)}\right]  .

1437: \]

1438:

1439: \subsubsection{Complexity analysis of content-dependent updates}

1440:

1441: In practice, as with computing a scalar exponential, only a limited number of

1442: terms will be needed to compute the sum (\ref{eq:matrixexp}) to machine

1443: precision. It is worth noting that efficient means of calculating

1444: (\ref{eq:matrixexp}) are a major topic in the field of computational probability.

1445:

1446: In the case of a compound attribute $\mathcal{A}=\{A_{1},A_{2},...,A_{k}\}$

1447: with independently varying components, it will be computationally more

1448: efficient to first calculate the individual transition probability matrices

1449: $P^{A_{i},R}(s,f)$ via (\ref{eq:matrixexp}), and then calculate the joint

1450: probability matrix $P_{u,v}^{R,\mathcal{A}}(s,f)$ using (\ref{eq:probproduct}%

1451: ), rather than first finding the joint exit rate matrix $Q^{R,\mathcal{A}}$

1452: via (\ref{eq:jointexit}) and then applying (\ref{eq:matrixexp}). The former

1453: approach would involve repeated multiplications of square matrices of size

1454: $\left|  {\dom A_{i}}\right|  $, for $i=1,\ldots,k$, resulting in a

1455: computational complexity of $\bigO(\sum_{i=1}^{k}n_{i}{\left|  {\dom

1456: A_{i}}\right|  }^{\nu})$, where $n_{i}$ is the number of iterations needed to

1457: compute the sum (\ref{eq:matrixexp}) to machine precision, and the complexity

1458: of multiplying two $n\times n$ matrices is $\bigO(n^{\nu})$.\footnote{$\nu=3$ for

1459: the standard method and $\nu=\log_{2}7$ for Strassen's and related methods.}

1460: The latter would involve multiplying square matrices of size $\prod_{i=1}%

1461: ^{k}\left|  {\dom

1462: A_{i}}\right|  $, resulting in the considerably worse complexity of

1463: $\bigO(n(\prod_{i=1}^{k}{\left|  {\dom A_{i}}\right|  )}^{\nu})$, where $n$ is the

1464: number of iterations needed to obtain the desired precision.

1465:

1466: \subsection{Simplified modification models}

1467:

1468: We next introduce several possible simplifications of the general Markov chain

1469: case. To do so, we start by differentiating numeric domains from non-numeric

1470: domains. Certain database attributes $A\in\mathcal{A}$, such as prices and

1471: order quantities, represent numbers, and numeric operations such as

1472: addition are meaningful for these attributes. For such attributes, one can

1473: easily define a distance function between two attribute values, as we shall

1474: see below. We call the domains $\dom A$ of such attributes \emph{numeric

1475: domains}, and denote the set of all attributes with numeric domains by

1476: $\mathcal{N}\subset\mathcal{B}$. All other attributes and domains are

1477: considered \emph{non-numeric}.\footnote{Distance metrics can also be defined

1478: for complex data types such as images. We leave the handling of such cases to

1479: further research.} It is worth noting that not all numeric data necessarily

1480: constitute a numeric domain. Consider, for example, a customer relation $R$

1481: whose primary key is a customer number. Although the customer number consists

1482: of numeric symbols, it is essentially an arbitrary identification string for

1483: which arithmetic operations like addition and subtraction are not

1484: intrinsically meaningful for the database application. We consider such

1485: attributes to be non-numeric.

1486:

1487: \subsubsection{Domain lumping}

1488:

1489: \label{sec:lumping}To make our data modification model more computationally

1490: tractable, it may be appropriate, in many cases, to simplify the Markov chain

1491: state space for an attribute $A$ so that it is much smaller than $\dom A$.

1492: Suppose, for example, that $A$ is a 64-character string representing a street

1493: address. Restricting to 96 printable characters, $A$ may assume on the order

1494: of $96^{64}\approx10^{126}$ possible values. It is obviously unnecessary,

1495: inappropriate, and intractable to work with a Markov chain with such an

1496: astronomical number of states.

1497:

1498: One possible remedy for such situations is referred to as \emph{lumping }in

1499: the Markov chain literature~\cite{KEMENY60}. In our terminology, suppose we

1500: can partition $\dom A$ into a collection of sets ${\{V\}}_{V\in\mathcal{V}}$

1501: with the property that $\left|  {\mathcal{V}}\right|  \ll\left|  {\dom

1502: A}\right|  $ and

1503: \[

1504: \forall\;U,V\in\mathcal{V},\;\forall\;u,u^{\prime}\in U\quad\sum_{v\in

1505: V}q_{u,v}^{R,A}=\sum_{v\in V}q_{u^{\prime},v}^{R,A}.

1506: \]

1507: Then, one can model the transitions between the ``lumps'' $V\in\mathcal{V}$ as

1508: a much smaller Markov chain whose set of states is $\mathcal{V}$, with the

1509: transition rate from $U\in\mathcal{V}$ to $V\in\mathcal{V}$ being given by the

1510: common value of $\sum_{v\in V}q_{u,v}^{R,A}$, $u\in U$. If we are interested

1511: only in which lump the attribute is in, rather than its precise value, this

1512: smaller chain will suffice. Using lumping, the complexity of the computation

1513: is directly dependent on the number of lumps. We now give a few simple examples:

1514:

1515: \begin{example}

1516: [Lumping into a binary domain]\label{ex:binarylump}Consider the

1517: street address example just discussed. Fortunately, if an address has changed

1518: since time $s$, the database user is unlikely to be concerned with how

1519: different it is from the address at time $s$, but simply whether it is

1520: different. Thus, instead of modeling the full domain $\dom A$, we can

1521: represent the domain via the simple binary set $\{0,1\}$, where $0$

1522: indicates that the address has not changed since time $s$, and $1$ indicates

1523: that it has. We assume that the exit rates $q_{v,r.A(s)}^{R,A}$ from all other

1524: addresses $v\in\dom A$ back to the original value $r.A(s)$ all have the

1525: identical value $\theta^{\prime}$. In this case, one has $P_{0,1}%

1526: ^{R,A}=P_{1,0}^{R,A}=1$, and the behavior of the attribute is fully captured

1527: by the exit rates $\ell_{0}^{R,A}=q_{0,1}^{R,A}$ and

1528: $\ell_{1}^{R,A}=q_{1,0}^{R,A}$.  We will abbreviate

1529: these quantities by $\theta$ and $\theta^{\prime}$, respectively.

1530:

1531: Using standard results for a two-state continuous-time Markov chain

1532: \cite[Section VI.3.3]{TK94}, we conclude that

1533: \begin{align}

1534: P_{0,0}^{R,A}(s,f) &  =

1535: \frac{\theta^{\prime}+\theta e^{-{(\theta+\theta^{\prime})\Gamma_{R,A}(s,f)}}}

1536: {\theta+\theta^{\prime}}\label{binary:0to0} \\

1537: P_{0,1}^{R,A}(s,f) &  =

1538: \frac{\theta-\theta e^{-{(\theta+\theta^{\prime})\Gamma_{R,A}(s,f)}}}

1539: {\theta+\theta^{\prime}}.  \label{binary:0to1}%

1540: \end{align}

1541: \hspace*{\fill}$\Box$

1542: \end{example}

1543:

1544: %JE -- I think it is much simpler if we give just skip directly to the

1545: %web crawling example...

1546: %[Domain lumping in the semi-interval case]Consider an attribute $A$ with a

1547: %numeric domain for which we are interested in identifying whether a tuple $r$

1548: %is in some semi-interval of the form $[\min(\dom A),k)$, where $k\in\dom

1549: %A$. Assume further that once an attribute has been assigned with a value in

1550: %$[k,\max(\dom A))$ it can never go back to $[\min(\dom A),k)$. Under this

1551: %condition, we can collapse $[k,\max(\dom A))$ into a single state

1552: %$k^{^{\prime}}$ for which $\forall v\in\lbrack\min(\dom

1553: %A),k)$, $q_{{v,k^{^{\prime}}}}^{R,A}=\sum_{u\in\lbrack k,\max(\dom A))}%

1554: %q_{v,u}^{R,A}$.

1555:

1556: \begin{example}[Web crawling]

1557: As an even simpler special case, consider a Web crawler

1558: (\emph{e.g.}, \cite{PINKERTON94,HEYDON99,CHO00}). Such a crawler needs to

1559: visit Web pages upon change to re-process their content, possibly for the use

1560: of a search engine. Recalling Example \ref{ex:binarylump}, one

1561: may define a boolean attribute \texttt{Modified} in a relation that collects

1562: information on Web pages. \texttt{Modified} is set to \texttt{True} once the

1563: page has changed, and back to \texttt{False} once the Web crawler has visit

1564: the page. Therefore, once a page has been modified to \texttt{True}, it cannot

1565: be modified back to \texttt{False} before the next visit of the Web

1566: crawler.

1567: In the analysis of

1568: Example \ref{ex:binarylump}, one can set $\theta^{\prime}=0$,

1569: resulting in $P_{0,0}^{A,R}(s,f)=e^{-{\theta\Gamma_{R,A}(s,f)}}$ and $P_{0,1}%

1570: ^{A,R}(s,f)=1-e^{-{\theta\Gamma_{R,A}(s,f)}}$. \hspace*{\fill}$\Box$

1571: \end{example}

1572:

1573: \subsubsection{Random walks}

1574: \label{sec:randomwalks}

1575: Like large non-numeric domains, many numeric domains may

1576: also be cumbersome to model directly via Markov chain techniques. For example,

1577: a 32-bit integer attribute can, in theory, take $2^{32}\approx4\times10^{9}$

1578: distinct values, and it would be virtually impossible to directly form, much

1579: less exponentiate, a full transition rate matrix for a Markov chain of this size.

1580:

1581: Fortunately, it is likely that such attributes will have ``structured'' value

1582: transition patterns that can be modeled, or at least closely approximated, in

1583: a tractable way. As an example, we consider here a random walk model for

1584: numeric attributes.

1585:

1586: In this case, we still suppose that the attribute $A$ is modified only at

1587: transition event times that are distributed as described above. Letting

1588: $t_{i}$ denote the time of transition event $i$, with $t_{0}=s$, we suppose

1589: that at transition event $i$, the value of attribute $A$ is modified according

1590: to

1591: \[

1592: r.A(t_{i})=r.A(t_{i-1})+\Delta A_{i},

1593: \]

1594: where $\Delta A_{i}$ is a random variable. We suppose that the random

1595: variables $\left\{  \Delta A_{i}\right\}  $ are IID, that is, they are

1596: independent and share a common distribution with mean $\delta$ and variance

1597: $\sigma^{2}$. Defining

1598: \[

1599: \Delta A(s,f)=\!\!\!\sum_{i:t_{i}\in(s,f]}\!\!\!\!\Delta A_{i},

1600: \]

1601: we obtain that $\{\Delta A(s,f),f\geq s\}$ is a nonhomogeneous compound

1602: Poisson process, and $r.A(f)=r.A(s)+\Delta A(s,f)$. From standard results for

1603: compound Poisson processes, we then obtain for each tuple $r\in R(s)$ that

1604: $\expecop\!\left[  {r.A(f)}\right]  =r.A(s)+\Gamma_{R,A}(s,f)\delta$.

1605:

1606: It should be stressed that such a model must ultimately be only an

1607: approximation, since a random walk model of this kind would, strictly

1608: speaking, require an infinite number of possible states, while $\dom A$ is

1609: necessarily finite for any real database. However, we still expect it to be

1610: accurate and useful in many situations, such as when $r.A(s)$ and

1611: $\expecop\!\left[  {r.A(f)}\right]  $ are both far from largest and smallest

1612: possible values in $\dom A$.

1613:

1614: \subsubsection{Content-independent overwrites}

1615:

1616: Consider the simple case in which $P_{u,v}^{R,A}(s,f)$ is independent of $u$

1617: once a transition event has occurred. Let $\mathcal{A}\subseteq\mathcal{C}(R)$

1618: be a set of attributes $A$ with identical $\gamma_{R,A}$ functions, and let

1619: $\Gamma_{R,\mathcal{A}}(s,t)=\Gamma_{R,A}(s,t)$ for any $A\in\mathcal{A}$. We

1620: define a probability distribution $\omega_{R,\mathcal{A}}$ over $\dom

1621: \mathcal{A}$, and assume that at each transition event, a new value for

1622: $\mathcal{A}$ is selected at random from this distribution, without regard to

1623: the prior value of $r.\mathcal{A}$. It is thus possible that a transition

1624: event will leave $r.\mathcal{A}$ unchanged, since the value selected may be

1625: the same one already stored in $r$. For any tuple $r\in R(s)\cap R(f)$ and

1626: $u\in\dom\mathcal{A}$, we thus compute the probability $P_{u,u}^{R,\mathcal{A}%

1627: }(s,f)$ that the value of $r.\mathcal{A}$ remains unchanged at $u$ at time

1628: $f$ to be

1629: \begin{align*}

1630: P_{u,u}^{R,\mathcal{A}}(s,f)  &  =\probop\!\left\{  {\tau_{u}^{R,\mathcal{A}%

1631: }>f-s}\right\}  +\probop\!\left\{  {\tau_{u}^{R,\mathcal{A}}\leq f-s}\right\}

1632: \omega_{R,\mathcal{A}}(u)\\

1633: &  =e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}+\left(

1634: 1-e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}\right)

1635: \omega_{R,\mathcal{A}}(u)\\

1636: &  =e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}\left(

1637: 1-\omega_{R,\mathcal{A}}(u)\right)  +\omega_{R,\mathcal{A}}(u)

1638: \end{align*}

1639: For $u,v\in\dom\mathcal{A}$ such that $u\neq v$, we also compute the

1640: probability $P_{u,v}^{R,A}(s,f)$ that $r.\mathcal{A}$ changes from $u$ to $v$

1641: in $[s,f)$ to be

1642: \begin{align*}

1643: P_{u,v}^{R,\mathcal{A}}(s,f)  &  =\probop\!\left\{  {\tau_{u}^{R,\mathcal{A}%

1644: }\leq f-s}\right\}  \omega_{R,\mathcal{A}}(v)\\

1645: &  =\left(  1-e^{-\ell_{u}^{R,\mathcal{A}}\Gamma_{R,\mathcal{A}}(s,f)}\right)

1646: \omega_{R,\mathcal{A}}(v).

1647: \end{align*}

1648:

1649: Content-independent overwrites are a special case of the Markov chain model

1650: discussed above. To apply the general model formulae when content-independent

1651: updates are present, each $\ell_{v}^{R,A}$ is multiplied by $1-\omega

1652: _{R,A}(v)$ and $P_{u,v}^{R,A}=\omega_{R,A}(v)/(1-\omega_{R,A}(u))$ for all

1653: $u,v\in\dom A$, $u\neq v$.

1654:

1655: \subsection{Summary}

1656:

1657: In this section we have introduced a general Markov-chain model for data

1658: modification, and discussed three simplified models that allows tractable

1659: computation. Using these models, one can compute, in probabilistic terms, value

1660: histograms at time $f$, given a known initial set of value histograms

1661: at time $s<f$. Such a

1662: model could be useful in query optimization, whenever the

1663: continual gathering of statistics becomes impossible due to either heavy system

1664: loads or structural constraints (\emph{e.g.}, federations of databases with

1665: autonomous DBMSs).

1666:

1667: Generally speaking, computing the transition matrix for an attribute $A$

1668: involves repeated multiplications of square matrices of size $\left|  {\dom

1669: A}\right|  $, resulting in a computational complexity of $\bigO(n{\left|  {\dom

1670: A}\right|  }^{\nu})$, where $n$ is the number of iterations needed to compute

1671: the sum (\ref{eq:matrixexp}) to machine precision. While $n$ is usually small,

1672: $\left|  {\dom

1673: A}\right|  $ may be very large, as demonstrated in Section

1674: \ref{sec:lumping}

1675: and Section \ref{sec:randomwalks}. Methods such as domain lumping would

1676: require $\bigO(nX^{\nu})$ time, where $X\ll\left|  {\dom

1677: A}\right|  $.\footnote{Here, $n$ may also be affected by the change of

1678: domain.} As for random walks and independent updates, both methods no longer

1679: require repeated matrix multiplications, but rather the computation of

1680: $\Gamma_{R,A}(s,f)$. The complexity of calculating $\Gamma_{R,A}(s,f)$ is

1681: similar to that for $\Lambda_{R}(s,f)$ in Section \ref{sec:lambdacomplex}.

1682:

1683: \section{Insertion model verification}

1684: \label{sec:verify}

1685: It is well-known that Poisson processes

1686: model a world where data updates are independent from one another. While in

1687: databases with widely distributed access, \emph{e.g.}, incoming e-mails,

1688: postings to newsgroups, or posting of orders from independent customers, such

1689: an independence assumption seems plausible, we still need to validate the

1690: model against real data. In this section we shall present some initial

1691: experiments as a ``proof of concept.'' These experiments deal only with the

1692: insertion component of the model. Further experiments, including modification

1693: and deletion operations, will be reported in future work.

1694:

1695: \begin{figure}[ptb]

1696: \begin{center}

1697: \epsfig{file=training.eps,width=6.5in}

1698: \caption{Training data set.}

1699: \label{fig:training}

1700: \end{center}

1701: \end{figure}

1702: %EndExpansion

1703:

1704: Our data set is taken from postings to the DBWORLD electronic bulletin board.

1705: The data were collected over more than seven months and consists of about 750

1706: insertions, from November 9$^{\text{th}}$, 2000 through May 14$^{\text{th}}$,

1707: 2001. Figure \ref{fig:training} illustrates a data set with 580

1708: insertions during the interval

1709: [2000/11/9:00:00:00,~2001/3/31:00:00:00). We used the

1710: Figure \ref{fig:training} data as a \emph{training set},

1711: \emph{i.e.}, it serves as our basis for

1712: parameter estimation. Later, in order to test the model, we applied these

1713: parameters to a separate \emph{testing set} covering the period

1714: [2001/3/31:00:00:00,~2001/5/15:00:00:00). In the experiments described below,

1715: we tried fitting the training data with two insertion-only models, namely a

1716: homogeneous Poisson process and an RPC Poisson process (see Section

1717: \ref{sec:insertion}). For each of these two models, we have applied two

1718: variations, either as a compound or as a non-compound model. In the

1719: experiments described below, we have used the Kolmogorov-Smirnov goodness of

1720: fit test (see for example~\cite[Section 7.7]{HOGG83}). For completeness, we

1721: first overview the principles of this statistical test.

1722:

1723: The Kolmogorov-Smirnov test evaluates the likelihood of a \emph{null

1724: hypothesis} that a given sample may

1725: have been drawn from some

1726: hypothesized distribution.  If the null hypothesis is true, and

1727: sample set has indeed been drawn from the

1728: hypothesized distribution, then the empirical cumulative distribution of the

1729: sample should be close to its theoretical counterpart. If the sample

1730: cumulative distribution is too far from the hypothesized distribution at any

1731: point, that suggests that the sample comes from a different distribution.

1732: Formally, suppose that the theoretical distribution is $F(x)$, and we have $n$

1733: sample values $x_{1},...,x_{n}$ in nondecreasing order. We define an empirical

1734: cumulative distribution $F_{n}(x)$ via

1735: \[

1736: F_{n}(x)=\left\{

1737: \begin{array}

1738: [c]{cl}%

1739: 0, & \text{if }x<x_{1}\\

1740: \frac{k}{n}, & \text{if }x_{k}\leq x<x_{k+1}\\

1741: 1, & \text{if }x>x_{n},

1742: \end{array}

1743: \right.

1744: \]

1745: and then compute $D_{n}=\sup_{k=1,\ldots,n}\{|F_{n}(x_{k})-F(x_{k})|\}$. For

1746: large $n$, given a significance level $\alpha$, the test measures $D_{n}$

1747: against $X(\alpha)/\sqrt{n}$, where $X(\alpha)$ is a factor depending on the

1748: \emph{significance level} $\alpha$ at which we reject the null hypothesis.

1749: For example, $X(0.05)=1.36$ and $X(0.1)=1.22$.  The value of $\alpha$

1750: is the probability of a ``false negative,'' that is, the chance that

1751: the null hypothesis might be rejected when it is actually true.

1752: Larger values of $\alpha$ make the test harder to pass.

1753:

1754: \subsection{Fitting the homogeneous Poisson process}

1755: \label{sec:fithomo}

1756: \begin{figure}[ptb]

1757: \begin{center}

1758: \epsfig{file=kshomo.eps, width=6.5in}

1759: \caption{A comparison of a theoretical and empirical distribution functions

1760: for the homogeneous Poisson process model (a) and the compound homogeneous

1761: Poisson process model (b).}%

1762: \label{fig:kshomo}%

1763: \end{center}

1764: \end{figure}

1765: %EndExpansion

1766:

1767: Based on the training set, we computed the parameter for a homogeneous Poisson

1768: process by averaging the 580 interarrival times, an unbiased estimator of the

1769: Poisson process parameter. The average interarrival time was computed to be

1770: 5:15:19, and thus $\lambda=4.57$ per day. Figure \ref{fig:kshomo}(a)

1771: provides a pictorial comparison of the cumulative

1772: distribution functions of the interarrival times with their theoretical

1773: counterpart. We applied the Kolmogorov-Smirnov test to the distribution

1774: of interarrival times, comparing it with an exponential distribution with a

1775: parameter of $\lambda=4.57$. The outcome of the test is $D_{n}=0.106$, which

1776: means we can reject the null hypothesis at any reasonable level of confidence

1777: $\alpha\geq 0.005$ (for $\alpha=0.005$, the rejection threshold is $0.0718$ for

1778: $n=580$). In all likelihood, then, the data are not derived from a homogeneous

1779: Poisson process.

1780:

1781: Next, we have applied a compound homogeneous Poisson model.  Our

1782: rationale in this case is that DBWORLD is a moderated list, and the

1783: moderators sometimes work on postings in batches.

1784: These batches are sometimes posted to the group in tightly-spaced clusters.

1785: For all practical purposes, we

1786: treat each such cluster as a single batch insertion event.

1787: To construct the model, any

1788: two insertions occurring within less than one minute from one another were

1789: considered to be a single event occurring at the insertion

1790: time of the first arrival. For example, on November 14, 2000, we had three

1791: arrivals, one at 13:43:19, and two more at 13:43:23. All three arrivals are

1792: considered to occur at the same insertion arrival event, with an insertion

1793: time of 13:43:19. Using the compound variation, the data set now has 557

1794: insertion events. The revised average interarrival time is now 5:28:20, and

1795: thus $\lambda=4.39$ per day. Figure \ref{fig:kshomo}(b)

1796: provides a pictorial comparison of the cumulative distribution functions of

1797: the interarrival times, assuming a compound model, with their theoretical

1798: counterpart. We have applied the Kolmogorov-Smirnov test to the distribution

1799: of interarrival times, comparing it with an exponential distribution with a

1800: parameter of $\lambda=4.39$. The outcome was somewhat better than before.

1801: $D_{n}=0.094$, which means we can still reject the null hypothesis at any

1802: level of confidence $\alpha\geq 0.005$ (for $\alpha=0.005$, the rejection

1803: threshold is $0.0733$ for $n=557$). Although the compound variant of the model

1804: fits the data better, it is still not statistically plausible.

1805:

1806: \subsection{Fitting the RPC Poisson process}

1807: \label{sec:rpcfit}%

1808: \begin{table}[tbp] \centering

1809: \begin{tabular}[c]{|l|l|l|l|}\hline

1810: & \textbf{Workdays} & \textbf{Saturday} & \textbf{Sunday}\\\hline\hline

1811: $\lbrack0\text{:}00,3\text{:}00)$ & \multicolumn{1}{|c|}{$2.40$} &

1812: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%

1813: $\lbrack3\text{:}00,6\text{:}00)$ & \multicolumn{1}{|c|}{$5.96$} &

1814: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%

1815: $\lbrack6\text{:}00,9\text{:}00)$ & \multicolumn{1}{|c|}{$6.04$} &

1816: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%

1817: & & \multicolumn{1}{|c|}{$1.50$} & \multicolumn{1}{|c|}{$1.15$}\\

1818: $\lbrack9\text{:}00,18\text{:}00)$ & \multicolumn{1}{|c|}{$7.50$} &

1819: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\

1820: & & \multicolumn{1}{|c|}{}& \multicolumn{1}{|c|}{}\\\cline{1-2}%

1821: $\lbrack18\text{:}00,21\text{:}00)$ & \multicolumn{1}{|c|}{$3.03$} &

1822: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}

1823: $\lbrack21\text{:}00,24\text{:}00)$ & \multicolumn{1}{|c|}{$2.41$} &

1824: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\hline

1825: \end{tabular}

1826: \caption{Average $\lambda$ levels for the recurrent piecewise-constant

1827: Poisson model.}

1828: \label{tab:rpclambdas}

1829: \end{table}

1830:

1831: Next, we tried fitting the data to an RPC model.

1832: Examining the data, we chose a cycle of one week.

1833: Within each week, we used the same pattern for each weekday, with one

1834: interval for work hours (9:00-18:00), plus five additional three-hour

1835: intervals for ``off hours''.

1836: We treated Saturday and Sunday each as one long interval.

1837: Table \ref{tab:rpclambdas}

1838: shows the arrival rate parameters for each segment

1839: of the RPC Poisson model, calculated in much the same manner as the

1840: for the homogeneous Poisson model.

1841:

1842: The specific methodology for structuring the RPC Poisson model is

1843: beyond the scope of this paper and can range from \emph{ad hoc}

1844: ``look and feel'' crafting (as practiced here)

1845: to more established formal processes for

1846: statistically segmenting, filtering, and aggregating intervals

1847: \cite{STOUMBOS97,STOUMBOS2002}. It is worth noting, however, that from

1848: experimenting with different methods, we have found that the model is not

1849: sensitive to slight changes in the interval definitions.  Also, the model we

1850: selected has only $8$ segments, and thus only $8$ parameters, so there

1851: is little danger of ``overfitting'' the training data set, which has over

1852: $500$ observations.

1853:

1854: Next, we attempted to statistically validate the RPC model. To this end, we use

1855: the following lemma:

1856:

1857: \begin{lemma}

1858: \label{lem:udistrib}

1859: Given a nonhomogeneous Poisson process with arrival intensity

1860: $\lambda(t)$, the random variable $U_{s}=\int

1861: _{s}^{s+L_{R,s}}\lambda(t)\hspace{0.15em}dt$ is of the distribution

1862: $\expdistrib(1)$.

1863: \end{lemma}

1864:

1865: \begin{proof}

1866: Let $f_{s}(t)=\Lambda(s,s+t)$, which is a monotonically nondecreasing

1867: function. From Lemma \ref{lem:interarrival}, $\probop

1868: \!\{{L_{R,s}<t\}=}1-e^{f_{s}(t)}$ for all $t\geq0$. We have $U_{s}=f_{s}(L_{R,s}%

1869: )$. By applying the monotonic function $f_{s}$ to both sides of the inequality

1870: $L_{R,s}<t$, one has that $\probop\!\{f_{s}({L_{R,s})<f_{s}(t)\}}=\probop

1871: \!\{{L_{R,s}<t\}}=1-e^{f_{s}(t)}$ for all $t\geq0$. Substituting in the

1872: definitions of $U_{s}$ and $u=f_{s}(t)$, one then obtains $\probop

1873: \!\{U{_{s}<u\}}=1-e^{-u}$ for all $u\geq0$, and therefore $U_{s}%

1874: \sim\expdistrib(1)$.

1875: \end{proof}

1876:

1877: Thus, given an instantaneous arrival rate $\lambda(t)$, and a sequence of

1878: observed arrival events $\{t_{n}\}_{n=0}^{N}$, we compute the set of values

1879: $u_{n}=\int_{t_{n-1}}^{t_{n}}\lambda(t)\hspace{0.15em}dt$, $n=1,\ldots,N,$ and

1880: perform a Kolmogorov-Smirnov test of them versus the unit exponential

1881: distribution.

1882:

1883: \begin{figure}[tb]

1884: \begin{center}

1885: \epsfig{file=ksrpc.eps,width=6.5in}

1886: \caption{A comparison of a theoretical and empirical distribution functions of

1887: $U$ for the RPC Poisson model (a) and the compound RPC Poisson model (b).}

1888: \label{fig:ksrpc}%

1889: \end{center}

1890: \end{figure}

1891:

1892: Figure \ref{fig:ksrpc}(a) provides a

1893: comparison of the theoretical and empirical cumulative distribution of the

1894: random variable $U$. We applied the Kolmogorov-Smirnov test to $U$,

1895: comparing it with an exponential distribution with $\lambda=1$, based on Lemma

1896: \ref{lem:udistrib}. The outcome of the test is $D_{n}=0.080$,

1897: which is better than either homogeneous model, but is still rejected

1898: at any reasonable level of significance

1899: (recall that for $\alpha=0.005$, the rejection threshold is again $0.0718$ for $n=580$).

1900:

1901: Finally, we evaluated a compound version of the RPC model, combining

1902: successive postings separated by less than one minute.  We kept the

1903: same segmentation as in Table \ref{tab:rpclambdas}, but recalculated the

1904: arrival intensities in each segment, as shown in

1905: Table~\ref{tab:compoundrpclambdas}.

1906:

1907: \begin{table}[tbp] \centering

1908: \begin{tabular}[c]{|l|l|l|l|}

1909: \hline

1910: & \textbf{Workdays} & \textbf{Saturday} & \textbf{Sunday}\\\hline\hline

1911: $\lbrack0\text{:}00,3\text{:}00)$ & \multicolumn{1}{|c|}{$2.40$} &

1912: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%

1913: $\lbrack3\text{:}00,6\text{:}00)$ & \multicolumn{1}{|c|}{$5.96$} &

1914: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%

1915: $\lbrack6\text{:}00,9\text{:}00)$ & \multicolumn{1}{|c|}{$5.59$} &

1916: \multicolumn{1}{|c|}{$$} & \multicolumn{1}{|c|}{$$}\\\cline{1-2}%

1917: & &

1918: \multicolumn{1}{|c|}{$1.45$} & \multicolumn{1}{|c|}{$1.15$}\\

1919: $\lbrack9\text{:}00,18\text{:}00)$ & \multicolumn{1}{|c|}{$7.11$} &

1920: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\

1921: & & \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}%

1922: $\lbrack18\text{:}00,21\text{:}00)$ & \multicolumn{1}{|c|}{$3.03$} &

1923: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\cline{1-2}

1924: $\lbrack21\text{:}00,24\text{:}00)$ & \multicolumn{1}{|c|}{$2.33$} &

1925: \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{}\\\hline

1926: \end{tabular}

1927: \caption{Average $\lambda$ levels for the compound RPC Poisson model.}

1928: \label{tab:compoundrpclambdas}

1929: \end{table}

1930:

1931: Next we recalculated the sample of the random variable $U$ for the

1932: compound RPC Poisson model, and applied the Kolmogorov-Smirnov

1933: test. In this case, we have $D_{n}=0.050$, which cannot be rejected at

1934: any reasonable confidence level through $\alpha=0.10$ (for $\alpha=0.10$, the

1935: rejection threshold is $0.0517$ for $n=557$).  Figure

1936: \ref{fig:ksrpc}(b)

1937: shows the

1938: theoretical and empirical distributions of $U$ in this case.

1939:

1940: As a final confirmation of the applicability of the compound RPC

1941: Poisson model, we attempted to validate the assumption that the number

1942: of postings in successive insertion events are independent and

1943: identically distributed (IID).  In the sample, 536 insertion events

1944: were of size 1, 19 were of size 2, and 2 were of size 3.  Thus, we

1945: approximate the random variable $\Delta^+_R$ as having a $536/557

1946: \approx .962$ probability of being 1, a $19/557 \approx .034$

1947: probability of being 2, and a $2/557 \approx .004$ probability of

1948: being 3.  Validating that the observed insertion batch sizes

1949: $\Delta^+_{R,i}$ appear to be independently drawn from this

1950: distribution is somewhat delicate, since they nearly always take the

1951: value 1.  To compensate, we performed our test on the

1952: \emph{runs} in the sample, that is, the number of consecutive insertion

1953: events of size 1 between insertions of size 2 or 3.  Our sample

1954: contains 21 runs, ranging from 0 to 112.  If the insertion

1955: batch sizes $\{\Delta^+_{R,i}\}$ are independent with the distribution

1956: $\Delta^+_R$, then the length of a run should be a geometric random

1957: variable with parameter $536/557\approx .962$.  We tested this

1958: hypothesis via a Kolmogorov-Smirnov test, as shown in

1959: Figure~\ref{fig:runs}.  The $D_n$ statistic is $0.207$, which is

1960: within the $\alpha=0.1$ acceptance level for a sample of size $n=21$

1961: (although the divergence of the theoretical and empirical curves in

1962: Figure~\ref{fig:runs} is more visually pronounced than in the prior figures, it

1963: should be remembered that the sample is far smaller).

1964: Thus, the assumption that the insertion batch sizes

1965: $\{\Delta^+_{R,i}\}$ are IID is plausible.

1966:

1967: \begin{figure}[tb]

1968: \begin{center}

1969: \epsfig{file=runs.eps}

1970: \caption{Empirical and theoretical distributions for number of

1971: single arrivals between multiple arrivals, compound RCP Poisson model.}

1972: \label{fig:runs}

1973: \end{center}

1974: \end{figure}

1975:

1976: \begin{table}[tbp] \centering

1977: \begin{tabular}[c]{|l|c|c|}\hline

1978: \textbf{Model} & $D_{n}$ & \textbf{Rejection level}\\\hline\hline

1979: Homogeneous & {$0.106$} & {$<0.005$}\\\hline

1980: Homogeneous+compound & {$0.094$} &

1981: {$<0.005$}\\\hline

1982: RPC & {$0.080$} & {$<0.005$}\\\hline

1983: RPC+compound & {$0.050$} & {$>0.100$}\\\hline

1984: \end{tabular}

1985: \caption{Goodness of fit of the four models.}

1986: \label{tab:goodfit}

1987: \end{table}

1988:

1989: Table \ref{tab:goodfit} compares the goodness-of-fit of

1990: the four models to the test data. For each of the models, we have specified

1991: the KS test result ($D_{n}$) and the level at which one can reject the null

1992: hypothesis. The higher the level of confidence is, the better the fit is. The

1993: RPC compound Poisson model models best the data set, accepting the null

1994: hypothesis at any level up to $0.1$ (which practically means that the model

1995: can fit to the data well). The main conclusion from these experiments is that

1996: the simple model of homogeneous Poisson process is limited to the modeling of

1997: a restricted class of applications (one of which was suggested in

1998: \cite{CHO00}). Therefore, there is a need for a more elaborate model, as

1999: suggested in this paper, to capture a broader range of update behaviors. A

2000: nonhomogeneous model consisting of just 8 segments per week,

2001: as we have constructed, seems to model the arrivals significantly

2002: better than the homogeneous approach.

2003:

2004: \section{Content evolution cost model}

2005:

2006: \label{costmodel} We now develop a cost model suitable for

2007: transcription-scheduling applications such as those described in

2008: Example~\ref{ex:replica}. The question is how often to generate a remote

2009: replica of a relation $R$. We have suggested one such policy in Example

2010: \ref{ex:fat}. In this section, we shall introduce two more policies and

2011: show an empirical comparison based on the data introduced in Section

2012: \ref{sec:verify}.

2013:

2014: A transcription policy aims to minimize the combined cost of

2015: \emph{transcription cost} and \emph{obsolescence cost}~\cite{GAL99c}. The

2016: former includes the cost of connecting to a network and the cost of

2017: transcribing the data, and may depend on the time at which the transcription

2018: is performed (\emph{e.g.}, as a function of network congestion), and the

2019: length of connection needed to perform the transcription. The obsolescence

2020: cost captures the cost of using obsolescent data, and is

2021: basically a function of the

2022: amount of time that has passed since the last transcription.

2023:

2024: In what follows, let the set $\{b_{i},e_{i}\}_{i=1}^{\infty}$ represents an

2025: infinite sequence of connectivity periods between a client and a server.

2026: During session $i$, the client data is synchronized with the state of the

2027: server at time $b_{i}$, the information becoming available at the client at

2028: time $e_{i}$. At the next session, beginning at time $b_{i+1}$, the client is

2029: updated with all the information arriving at the server during the interval

2030: $(b_{i},b_{i+1}]$, which becomes usable at time $e_{i+1}$, and so forth. We

2031: define $b_{0}=e_{0}=0$, and require that $0<b_{1}\leq e_{1}<b_{2}\leq

2032: e_{2}<\ldots$.

2033:

2034: Let $C_{R,\text{u}}(s,f)$ denote the cost of performing a transcription of $R$

2035: starting at time $f$, given that the last update was started at time $s$. Let

2036: $C_{R,\text{o}}(s,f)$, to be described in more detail later, denote the

2037: obsolescence cost through time $f$ attributable to tuples inserted into $R$ at

2038: the server during the time interval $(s,f]$. Then the total cost $C_{R}(t)$

2039: through time $t$ is

2040: \begin{equation}

2041: C_{R}(t)=\sum_{i:b_{i}\leq t}\!\!

2042: \Big(

2043: \alpha C_{R,\text{u}}(b_{i-1},b_{i})

2044: +(1-\alpha)C_{R,\text{o}}(b_{i-1},b_{i})

2045: \Big) + (1-\alpha)C_{R,\text{o}}(b_{i^*(t)},t),

2046: \label{costformula}

2047: \end{equation}

2048: where $i^*(t)=\max\left\{i\;\big|\;b_{i}\leq t\right\}$ and

2049: $\alpha$ serves as the ratio of importance a user puts on the

2050: transcription cost versus the obsolescence cost. Traditionally, $\alpha=0$,

2051: and therefore $C_{R}(t)$ is minimized for $C_{R,\text{o}}(b_{i-1},b_{i})=0$,

2052: $\forall b_{i}<t$, allowing the use of current data only. In this section we

2053: shall look into another, more realistic approach, where data currency is

2054: sacrificed (up to a level defined by the user through $\alpha$) for the sake

2055: of reducing the transcription cost. Ideally, one would want to choose the

2056: sequence $\{b_{i},e_{i}\}_{i=1}^{\infty}$ of connectivity periods, subject to

2057: any constraints on their durations $e_{i}-b_{i}$, to minimize $C_{R}(t)$ over

2058: some time horizon $t$. One may also consider the asymptotic problem of

2059: minimizing the average cost over time, $\lim_{t\rightarrow\infty}C_{R}(t)/t$.

2060: We note that the presence of $\alpha$ is not strictly required, as its effects

2061: could be subsumed into the definitions of the $C_{R,\text{u}}$ and

2062: $C_{R,\text{o}}$ functions, especially if both are expressed in natural

2063: monetary units. However, we retain $\alpha$ in order to demonstrate some of

2064: the parametric properties of our model.

2065:

2066: In general, modeling transcription and obsolescence costs may be difficult and

2067: application-dependent. They may be difficult to quantify and difficult to

2068: convert to a common set of units, such as dollars or seconds. Some subjective

2069: estimation may be needed, especially for the obsolescence costs. However, we

2070: maintain that, rather than avoiding the subject altogether, it is best to try

2071: construct these cost models and then use them, perhaps parametrically, to

2072: evaluate transcription policies. Any transcription policy implicitly makes

2073: some trade-off between consuming network resources and incurring

2074: obsolescence, so it is

2075: best to try quantify the trade-off and see if a better policy exists. In

2076: particular, one should try to avoid policies that are clearly \emph{dominated}%

2077: , meaning that there is another policy with the same or lower transcription

2078: cost, and strictly lower obsolescence, or \emph{vice versa}. Below, for

2079: purposes of illustration, we will give one simple, plausible way in which the

2080: cost functions may be constructed; alternatives are left to future research.

2081:

2082: \subsection{Transcription costing example}

2083: In determining the transcription cost, one may use existing research into

2084: costs of distributed query execution strategies. Typically, (\emph{e.g.},

2085: \cite{LOHMAN85}) the transcription time can be computed as some function of

2086: the CPU and I/O time for writing the new tuples onto the client and the cost

2087: of transmitting the tuples over a network. There is also some fixed setup time

2088: to establish the connection, which can be substantial. For purposes of

2089: example, suppose that

2090: \begin{align*}

2091: C_{R,\text{u}}(s,f)  &  =c+\beta\cdot\left(  X_{R}(s,f)+Y_{R}^{+}(s,f)+\left|

2092: R(s)\right|  -Y_{R}(s,f)\right) \\

2093: &  = c+\beta\cdot\left(  X_{R}(s,f)+\left|  R(s)\right|  -Y_{R}^{-}%

2094: (s,f)\right)

2095: \end{align*}

2096: Here, $c\geq0$ denotes the fixed setup cost, $\beta\geq0$, $X_{R}(s,f)$

2097: denotes the number of tuples inserted during the interval $(s,f]$ that survive

2098: through time $f$, $Y_{R}^{+}(s,f)$ is the number of tuples that

2099: survive but are

2100: modified, by time $f$, and $\left|  R(s)\right|  -Y_{R}(s,f)$ is the number of

2101: deleted tuples. For the latter, it may suffice to transmit only the

2102: primary key of each deleted tuple, incurring a unit cost of less than

2103: $\beta$. For sake of simplicity, however, we use the same cost factor

2104: $\beta$ for deletion, insertion, and modification. We note that, under

2105: this assumption,

2106: \[

2107: \sum_{i:b_{i}\leq t}C_{R,\text{u}}(b_{i-1},b_{i})=n(t)c+\beta\left|

2108: R(s)\right|  +\beta\sum_{i:b_{i}\leq t}\left(  X_{R}(b_{i-1},b_{i})-Y_{R}%

2109: ^{-}(b_{i-1},b_{i})\right)  ,

2110: \]

2111: where $n(t)$ is the number of transcriptions in the interval $[0,t]$. For the

2112: special case that there are no deletions or modifications,

2113: $\beta\left|  R(s)\right|

2114: +\beta\left(  X_{R}(s,f)-Y_{R}^{-}(s,f)\right)  =\beta B(s,f)$ and

2115: \[

2116: \sum_{i:b_{i}\leq t}C_{R,\text{u}}(b_{i-1},b_{i})=n(T)c+\beta B(0,b_{i^{\ast

2117: }(T)}).

2118: \]

2119: For large $t$, one would expect the $\beta B(0,b_{i^{\ast}(t)})$ term to be

2120: roughly comparable across most reasonable polices, whereas the $n(t)c$ term

2121: may vary widely for any value of $t$. It is worth noting that $c$ and $\beta$

2122: could be generalized to vary with time or other factors.

2123: For example, due to network congestion, certain

2124: times of day may have higher unit transcription costs than others.

2125: Also, transcribing via airline-seat telephone costs substantially more than

2126: connecting via a cellular phone. For simplicity, we have refrained from

2127: discussing such variations in the transcription cost.

2128:

2129: \subsection{Obsolescence costing example}

2130: We next turn our attention to the obsolescence cost, which is clearly a

2131: function of the update time of tuples and the time they were transcribed to

2132: the client. Intuitively, the shorter the time between the update of a tuple

2133: and its transcription to the client, the better off the client would be. As a

2134: basis for the obsolescence cost, we suggest a criterion that takes into

2135: account user preferences, as well as the content evolution parameters. For any

2136: relation $R$, times $s<f$, and tuple $r\in R(s)\cup R(f)$, let $b(r)$ and

2137: $d(r)$ denote the time $r$ was inserted into and deleted from $R$,

2138: respectively. We let $\iota_{r}(s,f)$ be some function denoting the

2139: contribution of tuple $r$ to the obsolescence cost over $(s,f]$; we will give

2140: some more specific example forms of this function later. We then make the

2141: following definition:

2142:

2143: \begin{definition}

2144: The total \emph{obsolescence cost} of a relation $R$ over the time interval

2145: $(s,f]$ (annotated \emph{$C_{R,\text{o}}(s,f)$}) is defined to be

2146: \emph{$C_{R,\text{o}}(s,f)\triangleq\sum_{r\in R(s)\cup R(f)}\iota

2147: _{r}(s,f)\!\!.$}\hspace*{\fill}$\Box$

2148: \end{definition}

2149:

2150: Our principal concern is with the \emph{expected}

2151: obsolescence cost, that is, the

2152: expected value of $C_{R,\text{o}}(s,f)$,

2153: \[

2154: \expecop\!\left[  C_{R,\text{o}}(s,f)\right]  = \expecop\!\left[

2155: \sum_{r\in R(s)\cup R(f)}\!\!\!\!\!\!\!\!\iota_{r}(s,f)\right]  .

2156: \]

2157: To compute $\expecop\!\left[  C_{R,\text{o}}(s,f)\right]  $, we note that

2158: \[

2159: \expecop\!\left[  C_{R,\text{o}}(s,f)\right]  =

2160: \expecop\!\left[  \sum_{r\in R(s)\cap R(f)}\!\!\!\!\!\!\!\!

2161: \iota_{r}(s,f)\right]

2162: +\expecop\!\left[  {\sum_{r\in R(s)\backslash R(f)}\!\!\!\!\!\!\!\!

2163: \iota_{r}(s,f)}\right]

2164: +\expecop\!\left[{\sum_{r\in R(f)\backslash R(s)}\!\!\!\!\!\!\!\!

2165: \iota_{r}(s,f)}\right]  .

2166: \]

2167: The three terms in the last expression represent potentially modified tuples,

2168: deleted tuples, and inserted tuples, respectively. We denote these three terms

2169: by $\hat{\iota}_{R}^{\modification}(s,f)$, $\hat{\iota}_{R}^{\deletion

2170: }(s,f)$, and $\hat{\iota}_{R}^{\medspace\insertion}(s,f)$, respectively,

2171: whence

2172: \[

2173: \expecop\!\left[  C_{R,\text{o}}(s,f)\right]  =\hat{\iota}_{R}^{\modification

2174: }(s,f)+\hat{\iota}_{R}^{\deletion

2175: }(s,f)+\hat{\iota}_{R}^{\medspace\insertion}(s,f).

2176: \]

2177:

2178: \subsection{Obsolescence for insertions}

2179: \label{sec:insertobs}

2180: We will now consider a specific

2181: metric for computing the obsolescence stemming from insertions in $(s,f]$, as

2182: follows:

2183: \begin{equation}

2184: \iota_{r}^{\insertion}(s,f)=\left\{

2185: \begin{array}

2186: [c]{ll}%

2187: g^{\insertion}(s,f,b(r)) & s<b(r)\leq f<d(r)\\

2188: 0 & \text{otherwise},%

2189: \end{array}

2190: \right.  \label{iota}%

2191: \end{equation}

2192: where $g^{\insertion}(s,f,t)$ is some application-dependent function representing the

2193: level of importance a user assigns, over the interval $(s,f]$,

2194: to a tuple arriving at a time $t$. For

2195: example, in an e-mail transcription application, a user may attach greater

2196: importance to messages arriving during official work hours, and a lesser

2197: measure of importance to non-work hours (since no one expects her to be

2198: available at those times). Thus, one might define

2199: \begin{equation}

2200: g^{\insertion}(s,f,t)=\int_{t}^{f}a(\tau)\hspace{0.15em}d\tau,\quad\text{where }%

2201: a(\tau)=\left\{

2202: \begin{array}

2203: [c]{ll}%

2204: a_{1}, & \text{if }\tau\text{ is during work hours}\\

2205: a_{2}, & \text{if }\tau\text{ is after hours,}%

2206: \end{array}

2207: \right.  \label{importanceformula}%

2208: \end{equation}

2209: and $a_{1}\geq a_{2}$. For $a_{1}=a_{2}=1$, $g^{\insertion}(s,f,t)$

2210: takes a form resembling the age of a local element in~\cite{CHO00}.

2211: More complex forms of $g^{\insertion}(s,f,t)$ are certainly possible.  In this

2212: simple case, we refer to $a_1/a_2$ as the \emph{preference ratio}.

2213:

2214: Using the properties of nonhomogeneous Poisson processes, we calculate

2215: \begin{align*}

2216: \hat{\iota}_{R}^{\medspace\insertion}(s,f) &  =\expecop\!\left[  {\sum_{r\in

2217: R(f)\backslash R(s)}\!\!\!\!\!\!\!\!}\iota_{r}(s,f)\right]  \\

2218: &  =\expecop\!\left[  {X}_{R}{(s,f)}\right]  \cdot\expecop\!\left[

2219: {f(s,f,b(r))\;\big|\;s<b(r)\leq f<d(r)}\right]  \\

2220: &  =\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta_{R}^{+}}\right]

2221: \int_{s}^{f}\frac{\lambda_{R}(t^{\prime})}{\widetilde{\Lambda}_{R}%

2222: (s,f)}g^{\insertion}(s,f,t^{\prime})dt^{\prime}\\

2223: &  =\expecop\!\left[  {\Delta}_{R}^{+}\right]  \int_{s}^{f}\lambda

2224: _{R}(t^{\prime})g^{\insertion}(s,f,t^{\prime})dt^{\prime}.

2225: \end{align*}%

2226:

2227: \begin{example}

2228: [Transcription policies using the expected obsolescence cost]

2229: Consider the insertion-only data set of Section~\ref{sec:verify}.

2230: Figure

2231: \ref{fig:ttimes} compares two transcription

2232: policies for the week

2233: $[2001/4/2\mathrm{:}0\mathrm{:}00,2001/4/8\mathrm{:}0\mathrm{:}00)$. The

2234: transcription policy in Figure \ref{fig:ttimes}(a)

2235: (referred to below as the \emph{uniform synchronization

2236: point} --- USP --- policy) was suggested in

2237: \cite{CHO00}. According to this policy, the intervals $(s,f]$ are always of

2238: the same size. The decision regarding the interval size $f-s$ may be either

2239: arbitrary (\emph{e.g.}, once a day) or may depend on $\lambda$, the Poisson

2240: model parameter (in which case a homogeneous Poisson process is implicitly

2241: assumed). The policy may be expressed as $f=s+{M}/{\lambda}$ for some

2242: multiplier $M>0$. According to this policy with $M=1$ (as suggested in

2243: \cite{CHO00}), and $\lambda=4.57$ per day as computed from the training data.

2244: Therefore, one would refresh the database every 5:15:19.

2245: Figure \ref{fig:ttimes}(a) shows

2246: the transcription times resulting from the

2247: USP policy.

2248:

2249: \begin{figure}[tbp]

2250: \begin{center}

2251: \epsfig{file=ttimes.eps,width=6.5in}

2252: \caption{Transcription times for the USP and RPC/threshold policies.}%

2253: \label{fig:ttimes}%

2254: \end{center}

2255: \end{figure}

2256:

2257: Consider now another transcription policy, dubbed the

2258: \emph{threshold} policy. With this policy, given that the last connection

2259: started at time $s$, we transcribe at time $f$ if the expected obsolescence

2260: cost from insertions ($\hat{\iota}_{R}^{\medspace\insertion}(s,f)$) exceeds

2261: $\Pi$, where $\Pi$ is a threshold that measures the user's tolerance to

2262: obsolescent data. In comparing the two policies, one can compute $\Pi$, given

2263: $M$, as follows. Consider the homogeneous case where $\expecop\!\left[

2264: {\Delta}_{R}^{+}\right]  =1$ and $\lambda_{R}(t)=\lambda_{R}$ for all $t$.

2265: Assume further that $a_{1}=a_{2}=1$ for all $t$. In this case,

2266: \[

2267: \hat{\iota}_{R}^{\medspace\insertion}(s,f)=\int_{s}^{f}\lambda_{R}%

2268: \cdot(f-t)dt=\lambda_{R}\int_{s}^{f}(f-t)dt\newline =\frac{1}{2}\lambda

2269: _{R}\cdot(f-s)^{2}%

2270: \]

2271: Setting $f=s+M/\lambda_{R}$ and $\Pi=\hat{\iota}_{R}^{\medspace\insertion

2272: }(s,f)$, one has that

2273: \[

2274: \Pi=({1}/{2})\lambda_{R}\cdot(f-s)^{2}=({1}/{2})\lambda_{R}\cdot\left(

2275: {M}/{\lambda}_{R}\right)  ^{2}={M^{2}}/{2\lambda}_{R}.

2276: \]

2277: Figure \ref{fig:ttimes}(b) shows the

2278: transcription times using the RPC arrival model (see Section

2279: \ref{sec:fithomo}) and the threshold policy with

2280: $\Pi=0.109$ (obtained by setting $M=1$ and $\lambda=4.57$ per day, and letting

2281: $\Pi={M^{2}}/{2\lambda}$). It is worth noting that transcriptions are more

2282: frequent when the $\lambda$ intensity is higher and less frequent whenever the

2283: arrival rate is expected to be more sluggish.

2284:

2285: \begin{figure}[tbp]

2286: \begin{center}

2287: \epsfig{file=thpolicy.eps,width=6.5in}

2288: \caption{Threshold policy, homogeneous vs. RPC.}%

2289: \label{fig:thpolicy}%

2290: \end{center}

2291: \end{figure}

2292:

2293: We have performed experiments comparing the performance of the threshold

2294: policy for the homogeneous Poisson model (equivalent to the USP

2295: policy) and the

2296: RPC Poisson model. Figure \ref{fig:thpolicy} shows

2297: representative results, with costs computed over the testing set.

2298: Figure \ref{fig:thpolicy}(a)

2299: displays the obsolescence cost and the number of transcriptions for various

2300: $M$ values, with a preference ratio $a_2/a_1=4$. For all $M$ values, there is

2301: no dominant model. For example, for $M=1$, the RPC model has a slightly higher

2302: obsolescence cost (43.02 versus 42.35, a 1.6\% increase)

2303: and a significantly lower

2304: number of transcriptions (137 versus 204, a 32.8\% decrease).

2305:

2306: Figure \ref{fig:thpolicy}(b) provides a comparison of

2307: combined normalized obsolescence and transcription costs for both insertion

2308: models and $\alpha\in\{0.6,0.7,0.8\}$ (still assuming a 4:1 preference ratio).

2309: Solid lines represent results related with the homogeneous Poisson model,

2310: while dotted lines represent results related with the RPC Poisson model.

2311: Generally speaking, the RPC model performs better for small $M$ values

2312: ($M\leq 7$),

2313: while the homogeneous model performs better for the largest $M$ values

2314: ($M \geq 8$). \hspace*{\fill}$\Box$

2315: \end{example}

2316:

2317: \begin{example}

2318: [Comparison of USP, threshold and FA policies]

2319: Once again with the data from Section~\ref{sec:verify},

2320: we consider one more transcription policy, the first alteration (FA) policy

2321: derived from the analysis

2322: of Example \ref{ex:fat}.  Since there are no deletions or

2323: modifications, $Z_R(s,f)$ simplifies to $\Lambda_R(s,f)$.  We choose

2324: $\pi$ in the FA policy

2325: to be a function of $M$ such that the transcription intervals

2326: agree with the USP policy in the case of the homogeneous model.

2327: Figure \ref{fig:3policies}

2328: compares the performance of all three transcription policies:

2329: USP, threshold, and FA,

2330: for a 4:1 preference ratio and

2331: $\alpha=0.8$, using the testing data set to compute the costs.

2332: For $M=1$, the threshold policy and the FA policy perform similarly, where the FA policy performs slightly better than the Threshold policy. Both policies outperform the USP policy.

2333: The threshold policy is best for $M\in\{2 \dots 8\}$. For all $M>8$, the USP

2334: policy is best. The best policy for this choice of $a_{1}/a_{2}$ and $\alpha$

2335: is threshold with $M=6$, followed closely by FA with $M\in\{5,6\}$. We have

2336: conducted our experiments with various $\alpha$ values and our conclusion is

2337: that the Threshold model is preferred over the USP model for larger $\alpha$, that is, the more the user is willing to sacrifice currency for the sake of reducing transcription cost.\hspace*{\fill}$\Box$

2338: \end{example}

2339:

2340: \begin{figure}[tbp]

2341: \begin{center}

2342: \epsfig{file=fapolicy.eps,width=5.4in,height=2.9in}

2343: \caption{Transcription schedule based on the first alteration policy.}

2344: \label{fig:fapolicy}

2345: \end{center}

2346: \end{figure}

2347:

2348: \begin{figure}[tbp]

2349: \begin{center}

2350: \epsfig{file=3policies.eps,width=5.4in,height=2.9in}

2351: \caption{Comparison of three policies, for a 4:1 preference ratio and

2352: $\alpha=0.5$.}

2353: \label{fig:3policies}

2354: \end{center}

2355: \end{figure}

2356:

2357:

2358: \subsection{Obsolescence for deletions}

2359: In a similar manner to Section \ref{sec:insertobs},

2360: we will consider the following

2361: metric for computing the obsolescence stemming from deletions in $(s,f]$. We

2362: compute $\iota_{r}^{\deletion}(s,f)$ via

2363: \begin{equation}

2364: \iota_{r}^{\deletion}(s,f)=\left\{

2365: \begin{array}

2366: [c]{ll}%

2367: g^{\deletion}(s,f,d(r)) & b(r)\leq s<d(r)\leq f\\

2368: 0 & \text{otherwise}%

2369: \end{array}

2370: \right.

2371: \end{equation}

2372: where $g^{\deletion}(s,f,t)$ is some application-dependent

2373: function,  possibly similar to $g^{\insertion}(s,f,t)$ above.

2374:

2375:

2376: Using the properties of nonhomogeneous Poisson processes, we calculate

2377: \begin{align*}

2378: \hat{\iota}_{R}^{\deletion}(s,f)  &  =\expecop\!\left[  {\sum_{r\in

2379: R(s)\backslash R(f)}\!\!\!\!\!\!\!\!}\iota_{r}(s,f)\right] \\

2380: &  =\left(  \left|  R(s)\right|  -\expecop\!\left[  {Y}_{R}{(s,f)}\right]

2381: \right)  \expecop\!\left[  g^{\deletion}(s,f,d(r)){\;\big|\;b(r)\leq s<d(r)\leq

2382: f}\right] \\

2383: &  =\left(  \left|  R(s)\right|  -p_{R}(s,f)\left|  {R(s)}\right|  \right)

2384: \expecop\!\left[  g^{\deletion}(s,f,d(r)){\;\big|\;b(r)\leq s<d(r)\leq f}\right] \\

2385: &  =\left|  R(s)\right|  \left(  1-p_{R}(s,f)\right)

2386: \expecop\!\left[  g^{\deletion}(s,f,d(r)){\;\big|\;b(r)\leq s<d(r)\leq f}\right]

2387: \end{align*}

2388: In the case $\langle R,S\rangle$ has fixed multiplicity for all $S\in S(R)$,

2389: $p_{R}(s,f)=\exp(-\widetilde{M}_{R}(s,f))$, where $\widetilde{M}_{R}%

2390: (s,f)=\int_{s}^{f}\tilde{\mu}_{R}(t)\hspace{0.15em}dt$ and $\tilde{\mu}%

2391: _{R}(t)=\!\!\sum_{S\in S(R)}w(R,S)\mu_{S}(t)$. Therefore,%

2392: \begin{align*}

2393: \hat{\iota}_{R,A}^{\deletion}(s,f)  &  =\left|  R(s)\right|  \left(

2394: 1-\exp(-\widetilde{M}_{R}(s,f))\right)  \int_{s}^{f}\frac{\tilde{\mu}%

2395: _{R}(t^{\prime})}{\widetilde{M}_{R}(s,f)}g^{\deletion}(s,f,t^{\prime})dt^{\prime}\\

2396: &  =\left|  R(s)\right|  \left(  \frac{1-\exp(-\widetilde{M}_{R}%

2397: (s,f))}{\widetilde{M}_{R}(s,f)}\right)  \int_{s}^{f}\tilde{\mu}%

2398: _{R}(t^{\prime})\cdot g^{\deletion}(s,f,t^{\prime})dt^{\prime}%

2399: \end{align*}

2400:

2401: \subsection{Obsolescence for modification}

2402: We now consider obsolescence costs relating to modifications.

2403: While, in some applications, a user may be

2404: primarily concerned with how many tuples were modified during $[s,f)$,

2405: we believe that a more general, attribute-based framework is warranted

2406: here, taking into account exactly how each tuple was changed.

2407: Therefore, we define $\iota_{r,A}(s,f)$ to be some function denoting the

2408: contribution of attribute $A\in\mathcal{A}(R)$ in tuple $r$ to the

2409: obsolescence cost over $(s,f]$ and assume that%

2410: \[

2411: \iota_{r}(s,f)=\!\!\!\sum_{A\in\mathcal{A}(R)}\!\!\!\iota_{r,A}(s,f)

2412: \]

2413: Therefore,

2414: \begin{align*}

2415: \hat{\iota}_{R}^{\modification}(s,f)  &

2416: =\expecop\!\left[  \sum_{r\in R(s)\cap R(f)}\!\!\!\!\!\iota_{r}(s,f)\right]  \\

2417: & =\expecop\!\left[  \sum_{A\in\mathcal{A}(R)}\;\sum_{r\in R(s)\cap R(f)}

2418: \!\!\!\!\!\!\!\iota_{r,A}(s,f)\right]  \\

2419: & =\sum_{A\in\mathcal{A}(R)}\!\!\!\hat{\iota}_{R,A}^{\modification}(s,f)

2420: \end{align*}

2421: where $\hat{\iota}_{R,A}^{\modification}(s,f)$

2422: is the expected obsolescence cost due to modifications to $A$ during

2423: $(s,f]$.  Assuming that attributes not in $\mathcal{C}(R)$ incur zero

2424: modification cost, the last sum may be taken over $\mathcal{C}(R)$

2425: instead of $\mathcal{A}(R)$.

2426:

2427: We start the section by introducing the notion of distance metric and provide

2428: two models of $\iota_{r,A}(s,f)$, for numeric and non-numeric domains. We then

2429: provide an explicit description of $\hat{\iota}_{R,A}^{\modification

2430: }$, based on distance metrics.

2431:

2432: \subsubsection{General distance metrics}

2433: Let $c_{u,v}^{R,A}$, where $u,v\in\dom A$ denote the

2434: elements of a matrix of costs for an attribute $A$. We declare that if

2435: $r.A(s)=u$ and $r.A(f)=v$, then $\iota_{r,A}(s,f)=c_{u,v}^{R,A}$, or

2436: equivalently,

2437: \[

2438: \iota_{r,A}(s,f)=c_{r.A(s),r.A(f)}^{R,A}.

2439: \]

2440: Consequently, we require that $c_{u,u}^{R,A}=0$ for all $u\in\dom A$, so that

2441: an unchanged attribute field yields a cost of zero.

2442:

2443: \paragraph{A squared-error metric for numeric domains:}

2444: For numeric domains, that

2445: is, $A\in\mathcal{N}$, we propose a squared-error metric, as is standard in

2446: statistical regression models. In this case, we let

2447: \[

2448: \iota_{r,A}(s,f)=c_{r.A(s),r.A(f)}^{R,A}=k_{R,A}(s){\left(

2449: r.A(f)-r.A(s)\right)  }^{2},

2450: \]

2451: where $k_{R,A}(s)$ is a user-specified

2452: scaling factor. A typical choice for the scaling

2453: factor would be the reciprocal ${1}/\left(  {\varop_{r\in R(s)}\!\left[

2454: {r.A(s)}\right]  }\right)  $ of the

2455: variance of attribute $A$ in $R$ at time $s$,

2456: \begin{align*}

2457: \varop_{r\in R(s)}\!\left[  {r.A(s)}\right]   &  =\expecop_{r\in

2458: R(s)}\!\left[  {{\left(  r.A(s)-\expecop_{r\in R(s)}\!\left[  {r.A(s)}\right]

2459: \right)  }^{2}}\right] \\

2460: &  =\expecop_{r\in R(s)}\!\left[  {{r.A(s)}^{2}}\right]  -{\expecop_{r\in

2461: R(s)}\!\left[  {r.A(s)}\right]  }^{2}\\

2462: &  =\frac{1}{\left|  {R(s)}\right|  }\left(  \,\sum_{v\in\dom A}%

2463: \!\!\!v^{2}\hat{R}_{A,v}(s)\right)  -{\left(  \frac{1}{\left|  {R(s)}\right|

2464: }\sum_{v\in\dom A}\!\!\!v\hat{R}_{A,v}(s)\right)  }^{2}.

2465: \end{align*}

2466: Other choices for the scaling factor $k_{R,A}(s)$ are also possible. In any

2467: case, we may calculate the expected alteration cost for attribute $A$ in tuple

2468: $r$ via

2469: \begin{align}

2470: \expecop\!\left[  \iota_{r,A}(s,f)\right]   &  =\expecop\!\left[

2471: {k_{R,A}(s){\left(  r.A(f)-r.A(s)\right)  }^{2}}\right] \nonumber\\

2472: &  =k_{R,A}(s)\expecop\!\left[  {{r.A(f)}^{2}-2\,r.A(f)r.A(s)+{r.A(s)}^{2}%

2473: }\right] \nonumber\\

2474: &  =k_{R,A}(s)\left(  \expecop\!\left[  {{r.A(f)}^{2}}\right]

2475: -2\,r.A(s)\expecop\!\left[  {r.A(f)}\right]  +{r.A(s)}^{2}\right)  .

2476: \label{eq:gennumexpec}%

2477: \end{align}

2478:

2479: \paragraph{A general metric for non-numeric domains:}

2480: \label{nonnummetric}For non-numeric domains, it may not be possible or

2481: meaningful to compute the difference of $r.A(s)$ and $r.A(f)$. In such cases,

2482: we shall use a general cost matrix ${[}${$c_{u,v}^{R,A}$}${]}_{u,v\in\dom A}$

2483: and compute

2484: \begin{align*}

2485: \expecop\!\left[  \iota_{r,A}(s,f)\right]   &  =\sum_{v\in\dom A}\!\!\left(

2486: P_{r.A(s),v}^{R,A}(s,f)\right)  \left(  c_{r.A(s),v}^{R,A}\right) \\

2487: &  =\sum_{

2488: \genfrac{}{}{0pt}{1}{v\in\dom A}{v\neq r.A(s)}%

2489: }\!\!\left(  P_{r.A(s),v}^{R,A}(s,f)\right)  \left(  c_{r.A(s),v}%

2490: ^{R,A}\right)  .

2491: \end{align*}

2492:

2493: For domains that have no particular structure, a typical choice might be

2494: $c_{u,v}^{R,A}=1$ whenever $u\neq v$. In this case, the expected cost

2495: calculation simplifies to

2496: \begin{align*}

2497: \expecop\!\left[  \iota_{r,A}(s,f)\right]   &  =\probop\!\left\{  {r.A(f)\neq

2498: r.A(s)}\right\} \\

2499: &  =1-P_{r.A(s),r.A(s)}^{A,R}(s,f).

2500: \end{align*}

2501:

2502: We are now ready to consider the calculation of $\hat{\iota}_{R,A}%

2503: ^{\modification}(s,f)$.

2504:

2505: \subsubsection{The expected modification cost}

2506: We next consider computing the

2507: expected modification cost $\hat{\iota}_{R,A}^{\modification}(s,f)$. To do so,

2508: we partition the tuples $r$ in $R(s)\cap R(f)$ according to their initial

2509: value $r.A(s)$ of the attribute $A$. Consider the subset $R_{A,u}(s)\cap R(f)$

2510: of all $r\in R(s)\cap R(f)$ that have $r.A(s)=u$. Since all such tuples are

2511: indistinguishable from the point of view of the modification process for

2512: $(R,A)$, their $\iota_{r,A}(s,f)$ random variables will be identically

2513: distributed. The number of tuples $r\in R(s)$ with $r.A(s)=u$ is, by

2514: definition, $\hat{R}_{A,u}(s)$. The number $\left|  {R_{A,u}(s)\cap

2515: R(f)}\right|  $ that are also in $r.A(f)$ is a random variable whose

2516: expectation, by the independence of the deletion and modification processes,

2517: must be $p_{R}(s,f)\hat{R}_{A,u}(s)$. Using standard results for sums of

2518: random numbers of IID random variables, we conclude that

2519: \begin{align*}

2520: \hat{\iota}_{R,A}^{\modification}(s,f)  &  =\expecop\!\left[  {\sum_{r\in

2521: R(s)\cap R(f)}\!\!\!\!\!\!\!\!}\iota_{r,A}(s,f)\right] \\

2522: &  =\!\!\!\!\sum_{u\in\dom A}\!\!\!\left(  p_{R}(s,f)\hat{R}_{A,u}(s)\right)

2523: \expecop\!\left[  \iota_{r,A}(s,f){\;\big|\;r.A(s)\!=\!u}\right] \\

2524: &  =p_{R}(s,f)\!\!\!\!\!\!\sum_{%

2525: \genfrac{}{}{0pt}{1}{u\in\dom A}{\hat{R}_{A,u}(s)>0}%

2526: }\!\!\!\!\!\!\!\hat{R}_{A,u}(s)\expecop\!\left[  \iota_{r,A}(s,f){\;\big

2527: |\;r.A(s)\!=\!u}\right] \\

2528: &  =p_{R}(s,f)\!\!\!\!\!\!\!\sum_{%

2529: \genfrac{}{}{0pt}{1}{u\in\dom A}{\hat{R}_{A,u}(s)>0}%

2530: }\!\!\!\!\!\!\hat{R}_{A,u}(s)\hat{\iota}_{R,A,u}^{\modification}(s,f),

2531: \end{align*}

2532: where we define $\hat{\iota}_{R,A,u}^{\modification}(s,f)=\expecop\!\left[

2533: \iota_{r,A}(s,f){\;\big|\;r.A(s)\!=\!u}\right]  $. We now address the

2534: calculation of the $\hat{\iota}_{R,A,u}^{\modification}(s,f)$.

2535:

2536: For a non-numeric domain, we have from Section \ref{nonnummetric} that

2537: \[

2538: \hat{\iota}_{R,A,u}^{\modification}(s,f)=\!\!\!\sum_{v\in\dom A}%

2539: \!\!\!\!\!\left(  P_{u,v}^{R,A}(s,f)\right)  \left(  c_{u,v}^{R,A}\right)  ,

2540: \]

2541: and in the simple case of $c_{u,v}^{R,A}=1$ whenever $u\neq v$,

2542: \[

2543: \hat{\iota}_{R,A,u}^{\modification}(s,f)=1-P_{u,u}^{R,A}(s,f).

2544: \]

2545: In any case, $P_{u,v}^{R,A}(s,f)$ and $P_{u,u}^{R,A}(s,f)$ may be computed

2546: using the results of Section \ref{sec:modif}.

2547:

2548: For a numeric domain, we have from (\ref{eq:gennumexpec}) that

2549: \begin{align*}

2550: \hat{\iota}_{R,A,u}^{\modification}(s,f)  &  =k_{R,A}(s)\left(  \expecop

2551: \!\left[  {{\left(  r.A(f)\right)  }^{2}\;\big|\;r.A(s)\!=\!u}\right]

2552: -2\,u\expecop\!\left[  {r.A(f)\;\big|\;r.A(s)\!=\!u}\right]  +u^{2}\right) \\

2553: &  =k_{R,A}(s)\left(  \left(  \sum_{v\in\dom A}\!\!\!(v^{2}-2uv)P_{u,v}%

2554: ^{R,A}(s,f)\right)  +u^{2}\right)  .

2555: \end{align*}

2556:

2557: In cases where a random walk approximation applies, however, the situation

2558: simplifies considerably, as demonstrated in the following proposition.

2559:

2560: \begin{proposition}

2561: When a random walk model with mean $\delta$ and variance $\sigma^2$

2562: accurately describes

2563: modifications to a numeric attribute $A$, $\hat{\iota

2564: }_{R,A,u}^{\modification}(s,f)\approx

2565: k_{R,A}(s)\,\Gamma_{R,A}(s,f)\left(  \sigma

2566: ^{2}+2\,\Gamma_{R,A}(s,f)\delta^{2}\right)  .$

2567: \end{proposition}

2568:

2569: \begin{proof}

2570: \noindent In this case, we note that the random variable $r.A(f)-r.A(s)$ is

2571: identical to $\Delta A(s,f)$

2572: (using the notation of section \ref{sec:randomwalks}%

2573: ), and is independent of $r.A(s)$. The number $N$ of modification events in

2574: $(s,f]$ has a Poisson distribution with mean $\Gamma_{R,A}(s,f)$, and hence

2575: variance {$\Gamma_{R,A}(s,f)$}$^{2}$. Therefore we have, for any $u\in\dom

2576: A$,

2577: \begin{align*}

2578: \hat{\iota}_{R,A,u}^{\modification}(s,f) &  \approx k_{R,A}(s)\expecop\!\left[

2579: {{\left(  \Delta A(s,f)\right)  }^{2}}\right]  \\

2580: &  =k_{R,A}(s)\left(  \varop\!\left[  {\Delta A(s,f)}\right]  +{\expecop

2581: \!\left[  {\Delta A(s,f)}\right]  }^{2}\right)  \\

2582: &  =k_{R,A}(s)\left(  \expecop\!\left[  {N}\right]  \sigma^{2}+\delta

2583: ^{2}\varop\!\left[  {N}\right]  +{\expecop\!\left[  {N}\right]  }^{2}%

2584: \delta^{2}\right)  \\

2585: &  =k_{R,A}(s)\,\Gamma_{R,A}(s,f)\left(  \sigma^{2}+2\,\Gamma_{R,A}%

2586: (s,f)\delta^{2}\right)  .

2587: \end{align*}

2588: \hspace*{\fill}

2589: \end{proof}

2590:

2591: \subsection{Example: the use of the cost model in Web crawling}

2592:

2593: \label{sec:exwebcraw} The following example concludes the introduction of the cost function. We show

2594: how, by using the cost model, one can generate an optimal transcription policy

2595: for Web crawling.

2596:

2597: \begin{example}

2598: [Web Monitoring]WebSQL \cite{MENDELZON97} is a Web monitoring tool which uses

2599: a virtual database schema to query the structural properties of Web documents.

2600: The database schema consists of two relations, \texttt{Document} with six

2601: attributes, namely \texttt{url, title, text, type, length, }and \texttt{modif}%

2602: , and \texttt{Anchor} with four attributes, namely \texttt{base, label, href},

2603: and \texttt{context}. Each tuple in \texttt{Anchor} indicates that document

2604: \texttt{base} contains a link to document \texttt{href}. Consider the

2605: following query (taken from \texttt{http://www.cs.toronto.edu/\symbol{126}%

2606: websql/}), which identifies locally reachable documents that contain some

2607: hyperlink to a compressed Postscript File:

2608:

2609: \vspace{2ex} \texttt{SELECT d.url, d.modif }

2610:

2611: \texttt{FROM Document d SUCH THAT ``http://www.OtherDoc.html'' -%

2612: $>$%

2613: -%

2614: $>$%

2615: * d, }

2616:

2617: \texttt{Anchor a SUCH THAT base = d }

2618:

2619: \texttt{WHERE filename(a.href) CONTAINS ``.ps.Z''; }

2620:

2621: \vspace{2ex}

2622: (We refrain from dwelling

2623: on the language specification;he interested reader is referred to the cited

2624: Web site.)

2625: Assume that the cost of performing the query at time $t$ is $\sum_{d\in

2626: D(t)}\psi_{d}$, where $D(t)$ represents the set of scanned documents and

2627: $\psi_{d}$ is a random variable representing the size of document $d$ in

2628: bytes. Assuming the $\{\psi_{d}\}$ are IID, the expected cost of performing

2629: the query at time $t$ is thus

2630: \[

2631: \expecop\left[  \sum_{d\in D(t)}\!\!\psi_{d}\right]  =\expecop[\card{D(t)}%

2632: ]\expecop[\psi],

2633: \]

2634: where $\psi$ is a generic random variable distributed like the $\{\psi_{d}\}$.

2635:

2636: A modification to a document is identified using changes to the \texttt{modif}

2637: attribute of the Document relation. For brevity in what follows, we let

2638: $R=\text{\texttt{Document}}$ and $A=\text{\texttt{modif}}$. We assign the

2639: following costs to changes in $A$:

2640:

2641: \begin{itemize}

2642: \item $g^{\deletion}(s,f,t) = 0$ for all $s<t<f$,

2643: that is, the user has no interest in being

2644: notified of deleted documents.

2645:

2646: \item For all $s<t<f$ and $u,v\in\dom A$, $u\neq v$,

2647: $c^{R,A}_{u,v}=g^{\insertion}(s,f,t)=\expecop[\psi]$,

2648: where $c_{{R,A}}^{\modification}$ is the cost for

2649: a modified document. For all other attribute $A^{\prime}\neq A$,

2650: $c^{R,A^{\prime}}_{u,v}=0$ for all $u,v\in\dom A^{\prime}$.

2651: \end{itemize}

2652:

2653: Suppose that a query was performed at time $s$, scanning the set of documents

2654: $D(s)$, and returning the set of documents $B(s)$, where $\left|

2655: {B(s)}\right|  \leq\left|  {D(s)}\right|  $. A user is interested in

2656: refreshing the query result without overloading system resources, thus

2657: balancing the cost of refreshing the query results against the cost of using

2658: partial or obsolescent data. This trade-off can be captured by the following

2659: policy: refresh the query at time $f$, after performing it at time $s$ iff

2660: \[

2661: \expecop\!\!\left[  \sum_{d\in D(f)}\!\!\psi_{d}\right]  \;<\;\expecop

2662: \!\left[  C_{R,\mathrm{o}}(s,f)\right]

2663: \]

2664: Thus, an equivalent conditions is

2665: \begin{align*}

2666: \expecop[\card{D(f)}]\expecop[\psi]\; &  <\;\sum_{A^{\prime}\in\mathcal{A}%

2667: (R)}\hat{\iota}_{R,A^{\prime}}^{\modification}(s,f)+\hat{\iota}_{R}%

2668: ^{\deletion}(s,f)+\hat{\iota}_{R}^{\medspace\insertion}(s,f)\\

2669: &  =\hat{\iota}_{R,A}^{\modification}(s,f)+\hat{\iota}_{R}^{\medspace

2670: \insertion}(s,f),

2671: \end{align*}

2672: or

2673: \begin{align*}

2674: &  \left(  p_{R}(s,f)\left|  {D(s)}\right|  +\widetilde{\Lambda}%

2675: _{R}(s,f)\expecop\!\left[  {\Delta}_{R}^{+}\right]  \right)  \expecop[\psi

2676: ]\;\\

2677: &  <\left(  p_{R}(s,f)\!\!\!\!\sum_{{u\in\dom A}}\!\!\!\hat{R}_{A,u}%

2678: (s)(1-P_{u,u}^{{A,R}}(s,f))+\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[

2679: {\Delta}_{R}^{+}\right]  p^{\medspace\insertion}\right)  \expecop[\psi],

2680: \end{align*}

2681: where $p^{\medspace\insertion}$ is the probability of a newly-inserted

2682: document being relevant to the query. Cancelling the factor of $\expecop[\psi

2683: ]$, another equivalent condition is

2684: \[

2685: p_{R}(s,f)\left|  {D(s)}\right|  +\widetilde{\Lambda}_{R}(s,f)\expecop

2686: \!\left[  {\Delta}_{R}^{+}\right]  \;<\;p_{R}(s,f)\!\!\!\!\sum_{u\in\dom

2687: A}\!\!\!\hat{R}_{A,u}(s)(1-P_{u,u}^{{A,R}}(s,f))+\widetilde{\Lambda}%

2688: _{R}(s,f)\expecop\!\left[  {\Delta}_{R}^{+}\right]  p^{\medspace\insertion},

2689: \]

2690: which is independent of the expected document size. Further assume that

2691: $P_{u,u}^{R{A}}(s,f)=P_{\ast,\ast}^{R,{A}}(s,f)$ is independent of $u$. Then

2692: the refresh condition can be expressed as

2693: \[

2694: p_{R}(s,f)D(s)+\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta}_{R}%

2695: ^{+}\right]  \;<\;p_{R}(s,f)\left|  {B(s)}\right|  (1-P_{\ast,\ast}^{{A,R}%

2696: }(s,f))+\widetilde{\Lambda}_{R}(s,f)\expecop\!\left[  {\Delta}_{R}^{+}\right]

2697: p^{\medspace\insertion}%

2698: \]

2699: \hspace*{\fill}$\Box$

2700: \end{example}

2701:

2702: \section{Conclusion and topics for future research}

2703: \label{sec:conclusion}

2704: This paper represents a first step in a new research area,

2705: the stochastic estimation of the consistency of transcribed data over time. We

2706: have also suggested one possible

2707: technique for assigning a cost to the differences

2708: between two relation extensions, including a means of computing the expected

2709: value of this cost under our stochastic model. We have discussed a number

2710: of potential applications relating managing replicas, query

2711: management, and Web crawling. We have also examined several

2712: strategies for refreshing replicas, although other strategies are

2713: certainly possible.

2714:

2715: As an illustration of the low client-side computational demands of the

2716: insertion-only transcription application of our model,

2717: a Java-based demo, based on the transcription policies described in

2718: \cite{GAL2001} and in this paper, can be accessed at

2719: \texttt{http://rbs.rutgers.edu:6677/}. The demo compares the performance of

2720: various policies using data that exist at a backend mSQL database.

2721:

2722: We hope to extend our work to the case where the materialized views are not

2723: simple replications, but are produced by SQL queries that involve selections,

2724: projections, natural joins, and certain types of aggregations. This work will

2725: involve a \emph{propagation algebra} for tracing the base data changes through

2726: a series of relational operators.

2727:

2728: This development should make it possible to apply the theory to the management

2729: of more complex queries than presented here. In particular, it will facilitate

2730: a possible approach to managing general materialized view obsolescence on a

2731: query-by-query basis, taking into account current user preferences for query

2732: accuracy and speed. The refresh rate of materialized views in a

2733: periodically-updated data source (such as a data warehouse) can be defined in

2734: terms of data obsolescence, which in turn can be stochastically estimated

2735: using our model for content evolution. In this case, we advocate a three-way

2736: cost model for query optimization~\cite{GAL99c}, in which the query optimizer

2737: evaluates various query plans using three complementary factors, namely

2738: \textit{generation cost}, \textit{transmission cost}, and \textit{obsolescence

2739: cost}. The first two factors take on a conventional interpretation and the

2740: obsolescence cost of a query represents a penalty for basing the query result

2741: on possibly obsolescent materialized views. A query plan using only selection

2742: from a local materialized view, for example, might have lower generation and

2743: transmission costs, but a higher obsolescence cost, than a plan fetching

2744: complete base relations from an extranet and then processing them through a

2745: series of join operations. Our model, when combined with additional techniques

2746: to propagate updates through relational operators, can be used as a basis for

2747: estimating the obsolescence cost. However, developing the propagation algebra

2748: may require some enrichment of our basic model, in particular the

2749: introduction of dependency between the deletion and modification processes.

2750:

2751: We foresee several additional future research directions. One direction

2752: involves the design of efficient algorithms for the numerical computations

2753: required by our model. As it stands so far, the most demanding computations

2754: required are general numerical integration and the matrix exponentiation

2755: formula (\ref{eq:matrixexp}). With regard to integration, we note that, in

2756: practice, the nonhomogeneous Poisson arrival rate functions $\lambda_{R}%

2757: (\cdot)$, $\mu_{R}(\cdot)$, and $\gamma_{R,A}(\cdot)$ will most likely be

2758: chosen to be periodic piecewise low-order polynomials, as suggested in

2759: Section \ref{sec:verify}. In such cases, many of the integrals

2760: needed by the model could be performed in closed form within each time

2761: period.

2762:

2763: Further calibration and verification of the models in real situations

2764: is also needed.  So far, we have demonstrated that the insertion model

2765: has plausible applications, but this work needs to be extended to the

2766: deletion and modification models.  Furthermore, the insertion model

2767: may need to be generalized to handle situations where there is

2768: ``burstiness'' or autocorrelation in the interarrival times that may require more involved techniques than simply combining very

2769: closely spaced arrivals.

2770:

2771: Another future research direction involves applying the model to real-life

2772: settings such as managing a data warehouse. While the model is quite flexible,

2773: a methodology is still needed for structuring Markov chains and estimating the

2774: stochastic model's parameters. Finally, in order to calibrate the cost model,

2775: the issue of measuring user tolerance for data obsolescence

2776: should be considered.

2777:

2778: \section*{Acknowledgments}

2779:

2780: We would like to thank Benny Avi-Itzhak, Adi Ben-Israel, David Shanno, Andrzej

2781: Ruszczynski, Ben Melamed, Zachary Stoumbos, and Bob Vanderbei for their help.

2782: Also, we thank Kumaresan Chinnusamy and Shah Mitul for their comparative

2783: research on statistics gathering methods and Connie Lu and Gunjan Modha for

2784: their assistance in designing and implementing the demo.

2785:

2786: \bibliographystyle{plain}

2787: \bibliography{bib}

2788: \end{document}