0204:cs0204052/cs0204052

1: \documentclass[12pt]{article}

2: \usepackage{amsfonts,latexsym,fullpage,amsmath,latexsym,amssymb}

3: \renewcommand{\baselinestretch}{1.2}

4: \newcommand{\Y}{{\cal Y}}

5: \newcommand{\cP}{{\bf P}}

6: \newcommand{\N}{\mathbb N}

7: \newcommand{\R}{\mathbb R}

8: \newcommand{\x}{{\bf x}}

9: \newcommand{\qed}[0]{\hfill $\Box$}

10:

11: \newtheorem{theorem}{Theorem}

12: \newtheorem{lemma}{Lemma}

13: \newtheorem{corollary}{Corollary}

14: \newtheorem{remark}{Remark}

15: \newtheorem{example}[theorem]{Example}

16: \newtheorem{definition}[theorem]{Definition}

17:

18: \title{Required sample size for learning sparse\\ Bayesian

19: networks with many variables}

20:

21: \date{April 26, 2002}

22:

23: \author{ Pawe{\l}

24: Wocjan\thanks{e-mail: {\protect\tt

25: \{wocjan,janzing,eiss\_office\}@ira.uka.de}}, Dominik Janzing, and Thomas Beth\\

26: \small Institut f{\"u}r Algorithmen und Kognitive Systeme,

27: Universit{\"a}t Karlsruhe,\\[-1ex] \small Am Fasanengarten 5,

28: D-76\,131 Karlsruhe, Germany}

29: \begin{document}

30:

31: \maketitle

32:

33: \abstract{Learning joint probability distributions on $n$ random

34: variables requires exponential sample size in the generic case.  Here

35: we consider the case that a temporal (or causal) order of the

36: variables is known and that the (unknown) graph of causal dependencies

37: has bounded in-degree $\Delta$. Then the joint measure is uniquely

38: determined by the probabilities of all $(2\Delta+1)$-tuples.  Upper bounds on

39: the sample size required for estimating their probabilities can be

40: given in terms of the VC-dimension of the set of corresponding

41: cylinder sets. The sample size grows less than linearly with $n$.}

42:

43:

44: \section{Introduction}

45: Learning joint probability measures on a large set of variables is an

46: important task of statistics. One of the main motivations to estimate

47: joint probabilities is to study statistical dependencies and

48: independencies between the random variables \cite{Pearl:00}. In many

49: applications the goal is to obtain information on the underlying

50: causal structure that produces the statistical correlations.  However,

51: the problem of learning causal structure from statistical data is in

52: general a deep problem and cannot be solved by statistical

53: considerations alone \cite{Pearl:00,Glymour}.

54:

55: Here we do not focus on the problem of uncovering the causal

56: structure, we rather address the problem of learning the probability

57: distribution on a large set of variables. In general, the sample size

58: required for estimating an unknown measure on the variables

59: $X_1,\dots,X_n$ grows exponentially with $n$. Assume for simplicity

60: that each $X_j$ is a discrete variable with $d$ possible values. Then

61: the probabilities of $d^n$ possible outcomes have to be estimated. The

62: sample size can be decreased considerably if prior knowledge on the

63: possible correlations is given. Consider for example the trivial case

64: when no statistical dependencies are possible at all, i.e.,

65: \[

66: P(x_1,x_2,\dots,x_n)=P(x_1)P(x_2)\dots P(x_n)\,,

67: \]

68: where $x_j$ denotes particular realizations of the corresponding

69: variable $X_j$. Then one has only to learn the probabilities

70: $P(x_1),\ldots,P(x_n)$.

71:

72: There are less trivial examples where prior information on the

73: statistical dependencies strongly reduce the required sample size.

74: For instance, this information may stem from knowledge on the

75: underlying causal structure. Following \cite{Pearl:00,Spirtes} we

76: encode causal structure in a directed graph with random variables as

77: its nodes.  Here we assume the graph to be acyclic. The decisive prior

78: information assumed to be given here is that each variable has at most

79: $\Delta$ parents, i.e., is influenced directly by at most $\Delta$ other nodes.

80: Note that we do not assume that we know which nodes are the parents.

81: Therefore, our assumption is merely a kind of {\em simplicity

82: assumption} on the causation for the statistical

83: dependencies. Furthermore, it should be emphasized that in many cases

84: one will not find any pair of variables that are statistically

85: independent. The constraints on the causal structure for the joint

86: probability measure are more sophisticated and are only reflected in

87: {\it conditional} probabilities. These constraints are well-known as

88: the {\it Markov condition} in {\it Bayesian networks}

89: \cite{Pearl:00,Glymour}. Conversely,

90: Bayesian networks may be considered as a convenient and intuitive way

91: of encoding statistical dependencies among variables in a graph

92: (without any causal interpretation).

93:

94:

95: \section{Bayesian networks}

96: Let us briefly introduce Bayesian networks. To do that we define {\em

97: conditional independence} relationships among variables, a central

98: notion in the analysis of probability distributions.

99:

100: \begin{definition}[Conditional independence]${}$\\

101: Let ${\bf V}=\{X_1,X_2,\ldots,X_n\}$ be a finite set of variables. Let

102: $P(\cdot)$ be a joint probability distribution over the variables in

103: $V$, and let ${\bf X}$, ${\bf Y}$ and ${\bf Z}$ stand for any three

104: subsets of ${\bf V}$. The sets ${\bf X}$ and ${\bf Y}$ are said to be

105: conditionally independent given ${\bf Z}$, denoted by

106: \begin{equation}

107: ({\bf X} \perp {\bf Y}\, |\, {\bf Z})

108: \end{equation}

109: if

110: \begin{equation}

111: P({\bf x},{\bf y}|{\bf z})=P({\bf x}|{\bf z})P({\bf y}|{\bf z})\,,

112: \quad\mbox{whenever } P({\bf z})>0\,,

113: \end{equation}

114: where ${\bf x}$ is the tuple denoting a particular realization of the

115: values of the variables in ${\bf X}$ and the tuples ${\bf y}$ and

116: ${\bf z}$ are defined analogously. In words, if all the actual values

117: of the variables in ${\bf Z}$ are known the actual values of the

118: variables in ${\bf Y}$ do not provide any further information on the

119: actual values of the variables in ${\bf X}$.

120: \end{definition}

121:

122: Directed acyclic graphs or Bayesian networks -- a term coined in

123: \cite{Pearl:85} -- are used to facilitate economical representation of

124: joint probability distributions. The basis decomposition scheme

125: offered by directed acyclic graphs can be illustrated as follows. Let

126: $P(\cdot)$ be a joint probability distribution as in Definition~1. The

127: chain rule of probability calculus always permit to decompose $P$ as

128: a product of $n$ conditional probability distributions:

129: \begin{equation}

130: P(x_1,\ldots,x_n)=\prod_{j=1}^n P(x_j|x_1,\ldots,x_{j-1})\,.

131: \end{equation}

132: Now suppose that the conditional probability of some variable $X_j$ is

133: not sensitive to all the predecessors of $X_j$ but only to a small

134: subset of those predecessors. In words, suppose that $X_j$ is

135: independent of all other predecessors, once we know the values of a

136: selected group of predecessors called ${\bf

137: P}_j:=\{X_{j,1},\ldots,X_{j,m_j}\}$. We can then write

138: \begin{equation}

139: P(x_1,\ldots,x_n)=\prod_{j=1}^n P(x_j|{\bf p}_j)

140: \end{equation}

141: considerably simplifying the input information. Instead of specifying

142: the probability of $X_j$ conditional on all possible realizations of

143: its predecessors $X_1,\ldots,X_{j-1}$, we need only to take into

144: account the possible realizations of the set ${\bf P}_j$. The set

145: ${\bf P}_j$ is called the {\em Markovian parents} of $X_j$, or the

146: parents for short. The reason for the name becomes clear when we

147: introduce graphs around this concept.

148:

149: \begin{definition}[Markov parents]${}$\\

150: Let $V=\{X_1,\ldots,X_n\}$ be an ordered set of variables, and let

151: $P(\cdot)$ be the joint probability distribution on these

152: variables. A set of variables ${\bf P}_j$ is said to be Markovian

153: parents of $X_j$ if ${\bf P}_j$ is a minimal set of predecessors of

154: $X_j$ that renders $X_j$ independent of all its other predecessors. In

155: words, ${\bf P}_j$ is any subset of $\{X_1,\ldots,X_{j-1}\}$ satisfying

156: \begin{equation}\label{eq:markovian}

157: P(x_j|{\bf p}_j)=P(x_j|x_1,\ldots,x_{j-1})

158: \end{equation}

159: such that no proper subset of ${\bf P}_j$ satisfies

160: Eq.~(\ref{eq:markovian}).

161: \end{definition}

162:

163: This definition assigns to each variable $X_j$ a selected set ${\bf

164: P}_j$ of preceding variables that are sufficient for determining the

165: probability of $X_j$. The values of the other preceding variables are

166: redundant once we know the values ${\bf p}_j$ of the parent set ${\bf

167: P}_j$. This assignment can be encoded in a directed acyclic graph in

168: which the variables are represented by the nodes and arrows are drawn

169: from each node of the parent set toward the child node $X_j$.

170:

171: Furthermore, Definition~2 also provides a simple recursive method for

172: constructing such a DAG: Starting with the pair $(X_1,X_2)$, we draw

173: an arrow from $X_1$ to $X_2$ if and only if the two variables are

174: dependent. Assume that we have constructed the DAG up to node

175: $j-1$. At the $j$th stage, we select any minimal set of predecessors

176: of $X_j$ that renders $X_j$ independent from its other predecessors

177: (as in Eq.~(\ref{eq:markovian})), call this set ${\bf P}_j$ and draw

178: an arrow from each member in ${\bf P}_j$ to $X_j$. The result is a

179: directed acyclic graph, called a Bayesian network, in which an arrow

180: from $X_i$ to $X_j$ assigns $X_i$ as a Markovian parent of $X_j$,

181: consistent with Definition~2.

182:

183: Let us mention that the set ${\bf P}_j$ is unique whenever the

184: distribution $P(\cdot)$ is strictly positive, i.e.\ every

185: configuration of variables, no matter how unlikely, has some finite

186: probability of occurring. Under such conditions, the Bayesian network

187: associated with $P(\cdot)$ is unique, given the ordering of the

188: variables \cite{Pearl:88}.

189:

190: \begin{definition}[Markov Compatibility]${}$\\

191: Let $G$ be a DAG. If a probability distribution $P$ admits a

192: factorization relative to $G$, i.e.\

193: \begin{equation}

194: P(x_1,\ldots,x_n)=\prod_{j=1}^n P(X_j=x_j|{\bf P}_j={\bf p}_j)\,,

195: \end{equation}

196: where ${\bf P}_j$ are the parents of the node $X_j$ defined by the

197: graph $G$, then we say $G$ and $P$ are compatible, or that $P$ is

198: Markov relative to $G$.

199: \end{definition}

200:

201: The problem of learning a Bayesian network usually treated in the

202: literature is as follows. Given a {\em training set }

203: $\{\x^1,\ldots,\x^l\}$, find a network that {\em best matches} the

204: training set (see e.g.\ \cite{CBL:97,FNP:99}), i.e. to determine a

205: graph $G$ such that $P$ is Markov relative to $G$.

206:

207:

208: \section{Networks with bounded in-degree}

209:

210: To motivate our decisive assumption we would like to note that

211: scientific reasoning always tries to find a simple explanation

212: for the data (``Occam's Razor''). We are aware of the fact that

213: ``simplicity'' is hard to formalize.

214: However, it seems reasonable to try

215: to explain data by   {\it simple causal graphs}.

216: Here we may use the {\it in-degree} of the graph

217: as criterion for simplicity.

218: It is defined as the

219: greatest number of parents that occurs. The intuitive meaning

220: of in-degree $\Delta$ is that no variable is directly influenced

221: by more than $\Delta$ others.

222: For $\Delta \ll n$

223: we call the graph {\it sparse}.

224: Clearly, the in-degree is only one of the graph theoretical notions that

225: may be used to define {\it simplicity} of causal explanations;

226: we could use e.g.\ the number of

227: edges.

228:

229: Let $G$ be an arbitrary DAG with in-degree $\Delta$. Then every

230: probability measure that is Markovian relative to $G$ is already

231: determined by the probabilities of all $(\Delta+1)$-tuples. This

232: follows directly from the decomposition in Eq.~(\ref{eq:markovian})

233: since the conditional probabilities $P(x_j|{\bf p}_j)$ are the

234: quotients of the probabilities $P(x_j,{\bf p}_j)$ and $P({\bf p}_j)$

235: of sizes at most $\Delta+1$ and $\Delta$, respectively. Consequently,

236: if $G$ is known we can learn the probability measure $P$ by learning

237: the probabilities of all $\Delta+1$-tuples.

238:

239: In contrast, we do not assume that we know the exact structure of $G$

240: but only that its in-degree at most $\Delta$.  Now the situation is more

241: complicated. Since we do not know the set of parents for any $X_j$, we

242: do not know which conditional probabilities have to appear in the

243: factorization in Eq.~(\ref{eq:markovian}). Therefore, it is not

244: sufficient to know the probabilities of all tuples of size $\Delta+1$

245: to reconstruct the structure. We have to know the

246: probabilities of at least all $(\Delta+2)$-tuples to be able to {\it test}

247: conditional independencies. The following theorem shows that it is

248: {\it sufficient} to know the probabilities of all $(2\Delta+1)$ tuples.

249:

250: \begin{theorem}[Graph structure from correlations]${}$\\\label{Algo}

251: Let $X_1<X_2<\ldots <X_n$ be an ordering of the variables. Assume that

252: $P$ is a probability measure that is Markov relative to a directed

253: acyclic graph (DAG) $G$. Let $G$ be consistent with the ordering,

254: i.e., the graph $G$ contains no arrow from $X_j$ to $X_i$ for $i<j$.

255: Let $G$ have in-degree $\Delta$ and assume that the probabilities of

256: all $(2\Delta+1)$-tuples are known. Then we can find a graph

257: $\tilde{G}$ (possibly different from $G$) that is Markov relative to

258: $P$ and has at most in-degree $\Delta$.

259: \end{theorem}

260: {\bf Proof:}

261: We can find the correct graph structure by the following iteration:

262: Draw an arrow from  $X_1$ to $X_2$ if the two variables are dependent.

263: Assume we have found the correct structure on $X_1,X_2,\dots,X_{j-1}$.

264:

265: In order to find a possible minimal set $\cP_j$ of parents of $X_j$

266: we proceed as follows:

267: Let $m:=\min\{j-1,\Delta\}$.

268: For each $m$-subset  $K\subseteq V_j:=\{X_1,X_2,\dots,X_{j-1}\}$

269: test whether the following statement is true:

270:

271: $(X_j \perp L \, |\, K)$ for all sets $L$ (disjoint from $K$)

272: that contain at most $m$

273: elements.

274:

275: If this is true, $K$ contains necessarily a set $\cP'_j$ that can be

276: taken as Markovian parents of $X_j$.  This can be seen as follows:

277: Choose $L$ such that $(L\cup K) \supseteq \cP_j$ for an arbitrary

278: minimal choice of parents of $X_j$.  This is possible since $X_j$ has

279: at most $m$ parents.  Since $L\cup K$ contains the parents of $X_j$ it

280: renders $X_j$ independent of its predecessors (see the $d$-separation

281: criteria in \cite{Pearl:88,Spirtes,Glymour}). Formally we have $(X_j

282: \perp V_j \,|\, L\cup K)$.  By the contraction rule for conditional

283: independencies (see \cite{Pearl:88}) the statements $(X_j \perp V_j \,

284: |\, L\cup K)$ and $(X_j \perp L \,|\, K)$ imply $(X_j \perp V_j \,|\,

285: K)$. Hence $K$ must contain a set $\cP_j'$ that can be viewed as

286: Markovian parents of $X_j$.

287:

288: Now we can test whether a proper subset $K'$  of $K$  satisfies

289: $(X_j \perp L\,| \,K')$ and obtain a minimal set of parents of $X_j$ by

290: iterating this procedure. \qed

291:

292:

293: \section{Learning the probabilities of $k$-tuples}

294: Now we shall present an upper bound on the required sample size in

295: order to learn the probabilities of all $k$-tuples with good

296: reliability. Then we can apply this result to the case

297: $k:=2\Delta +1$.

298:

299: Let $P(\cdot)$ be a probability distribution over an (ordered) set of

300: random variables ${\bf V}=\{X_1,\ldots,X_n\}$ taking on values in

301: $\Omega_j$ for $j=1,\ldots,n$.

302:

303: Let $X_{j_{1}},\dots,X_{j_{k}}$ be any $k$-subset of ${\bf V}$. We

304: would like to have a reliable statement on the probability of the

305: event

306: $(x_{j_1},\dots,x_{j_k})\in\Omega_{j_1}\times\cdots\times\Omega_{j_k}$,

307: i.e.\ the probability

308: \begin{equation}

309: P(X_{j_1}=x_{j_1},\ldots,X_{j_k}=x_{j_k})\,.

310: \end{equation}

311: The problem to determine the sample size required for estimating

312: reliably the probability of {\it one specific} event

313: is a usual problem of statistics. However,

314: the problem we encounter in learning Bayesian networks is more

315: sophisticated: we have to be almost sure that the estimated

316: probabilities of all $(2\Delta+1)$-tuples are sufficiently close to

317: the real (unknown) probabilities.

318:

319: The problem to determine whether and how fast

320: the relative frequencies of a large

321: set of events converge {\it uniformly} to their probabilities is

322: well-known in statistical learning theory \cite{Vapnik:98}.

323: Statements on uniform convergence rely on the so-called

324: Vapnik-Chervonenkis dimension (VC-dimension) of the considered set of events.

325:

326: \begin{definition}[VC dimension]${}$\\

327: Let $P$ be an unknown probability measure on a probability space

328: $\Omega$ and $S$ a set of events, i.e., a set of measurable subsets of

329: $\Omega$. Define the VC-dimension of $S:=(M_\lambda)$ as the largest

330: number $h$ such that there exist $h$ points

331: $\omega_1,\omega_2,\dots,\omega_h \in \Omega$ such that the sets

332: $M_\lambda \cap \{\omega_1,\dots,\omega_h\}$ run over all $2^h$ subsets of

333: $\{\omega_1,\dots,\omega_h\}$.

334: Intuitively, one can consider the sets $M_\lambda$ as classifiers

335: and the VC-dimension as the largest number of points that

336: can be classified in all $2^h$ possible ways.

337:  The VC-dimension is said to be

338: infinite if such an $h$-subset can be found for all $h\in\N$.

339: \end{definition}

340:

341: A trivial upper bound on the VC-dimension is given by the logarithm to

342: base $2$ of the number of events (in the case that $S$ is finite).

343:

344: Finite VC-dimension is known to be sufficient and necessary in order to

345: have uniform convergence of relative frequencies to their probabilities.

346: Quantitatively, one has the following theorem:

347:

348: \begin{theorem}[Uniform convergence]${}$\\\label{Uniform}

349: Let $f(M)$ be the relative frequency of the number of occurrences of

350: $M$ after $l$ runs. Let $S$ have VC-dimension $h$. Let $R_{\epsilon}$

351: be the risk (probability) that $S$ contains at least one set $M$ such

352: that $|f(M)-P(M)|\geq \epsilon$ for an arbitrary positive

353: $\epsilon$. Then we have

354: \begin{equation}

355: R_{\epsilon} <

356: 4 \exp\left\{\left(

357: \frac{h(1+\ln(2l/h))}{l}-(\epsilon-1/l)^2 \right)l\right\}\,.

358: \end{equation}

359: \end{theorem}

360: {\bf Proof:} see Theorem 4.4. in \cite{Vapnik:98} \qed

361:

362: This theorem allows to derive a lower bound on the required sample

363: size in order to estimate the probability of all $k$-tuples. First we

364: have to define the set of events and give an upper bound on its

365: VC-dimension.

366:

367: Let $\Omega:=\Omega_1\times\cdots\times\Omega_n$

368: be the probability space.

369: This means that the $j$th

370: random variable takes on values from $\Omega_j$ for $j=1,\ldots,n$.

371: The $k$-tuples are characterized by the positions and values the

372: corresponding random variables take on. Let ${\bf

373: j}:=\{j_1,j_2,\dots,j_k\}$ be an arbitrary $k$-subset of

374: $\{1,\dots,n\}$ and ${\bf x}\in \Omega_{j_1},\dots,\Omega_{j_k}$.

375:

376: We then denote by $M^{\bf j}_{\bf x}$ the event that the random

377: variables $X_{j_1},X_{j_2},\dots, X_{j_k}$ take on the values

378: $x_{j_1},\dots,x_{j_k}$. This event corresponds uniquely to a cylinder

379: set $C^{\bf j}_{\bf x}\subset \Omega$.

380:

381: An upper bound on the VC-dimension of the set of those  events that

382: correspond to cylinder sets $C^{\bf j}_{\bf x}$ is easy to

383: get.  Let $d$ be the maximal

384: cardinality of the sets $\Omega_j$. Then, for fixed $k$, there exist

385: at most

386: \[

387: d^k {n \choose k}

388: \]

389: such cylinder sets. The first term gives an upper bound on the

390: possible combinations of values and the second term the number of

391: different positions. This number is smaller than $(nd)^k$.

392: By taking the logarithm to base $2$ we obtain an upper bound on the

393: VC-dimension

394: \begin{equation}\label{eq:upper}

395: h\le k\log_2 (nd)

396: \end{equation}

397: Obviously, we can use much better bounds for concrete applications,

398: e.g.\ given by Stirling's approximation (giving a less intuitive

399: expression but providing a tighter bound). However, this crude upper

400: bound is sufficient to study the asymptotic behavior.

401:

402: Now we will present a lower bound on the VC-dimension in order to get

403: an idea how tight the upper bound in (\ref{eq:upper}) is.

404:

405: We construct $l$ $n$-tuples with $l:=\lfloor \log_2 (n-k+1) \rfloor$

406: as follows. For each set $\Omega_j$ we choose two different  values

407: $x_{j;0}$ and $x_{j;1}$ for $j=1,\ldots,n$. This defines a map

408: $\phi$ from the set of binary words of length $n$ into $\Omega$ by

409: setting

410: \begin{equation}

411: \phi: b_1 b_2 \ldots b_n \mapsto

412: x_{1,b_1} x_{2,b_2} \ldots x_{n,b_n}\,.

413: \end{equation}

414:

415: Now we define an $l \times n$ matrix $M$ with entries $0$ and $1$ as

416: follows: The first $k-1$ columns have only $1$ as entries. The next $2^l$

417: columns are the binary words of length $l$. The remaining

418: $(n-k+1 -2^l)$ columns can be chosen arbitrarily.

419:

420: The rows of $M$ correspond

421: to $n$-tuples by the map $\phi$. Let $\Y$ be the set of those

422: $n$-tuples and $S$ be an arbitrary subset of $\Y$. $S$ can uniquely be

423: characterized by a vector $s$ of length $l$ with entries $0$ and $1$

424: where the $j$-th entry of $s$ indicates whether the $j$-th $n$-tuple

425: is an element of $S$ or not.  The matrix $M$ contains a column that

426: coincides with $s$.  Assume it to be the $i$-th column. Than

427: $C^{\bf j}_{\bf x}

428: \cap \Y $ contains exactly those $n$-tuples that are elements of $S$

429: provided that $C^{\bf j}_{\bf x}$ is chosen as follows.  Let ${\bf j}$ be

430: $(1,2,\dots,k-1,i)$ and choose ${\bf x}$ as the $k$-tuple

431: $(x_{1;1},x_{2;1},\dots,x_{k-1;1},x_{i;1})$.

432: This shows that the cylinder sets corresponding to $k$-tuples

433: are able to classify $\Y$ on all $2^l$ possibilities.

434: Therefore

435: $\lfloor\log_2 (n-k+1)\rfloor$ is a lower

436: bound on the VC-dimension of the cylinder sets. Comparing this bound

437: with the upper bound in (\ref{eq:upper}), we see that it gives the

438: correct asymptotic behavior in the $O$-notation

439: if $k$ and $d$ are considered as constants.

440:

441: \begin{theorem}

442: For $\epsilon >0$ let $R_\epsilon$ be the risk that there is a

443: cylinder set $C^{\bf j}_{\bf x}$ such that its relative frequency

444: deviates from its probability by more than $\epsilon$. Than

445: $R_\epsilon$ can be made smaller than any $\delta>0$ while only

446: increasing the sample size linearly with $n$.

447: \end{theorem}

448: {\bf Proof:}

449: We choose $l$ such that

450: \[

451: \frac{l}{1+\ln(2l)}\frac{(\epsilon -1/l)^2}{2} \geq

452: k \log_2 (nd) \,.

453: \]

454: This can asymptotically be achieved by increasing $l$ with $O(n)$,

455: since $l/(1+\ln(2l))\leq l/(\ln (l))$ and the latter term increases

456: less than linearly in $l$.

457:

458: Using our bound

459: \[

460: h\leq k\log_2 (nd)

461: \]

462: we obtain

463: \[

464: h \leq \frac{(\epsilon-1/l)^2}{2}\frac{l}{1+\ln(2l)}

465: \]

466: and get

467: \[

468: \frac{h(1+\ln (2l))}{l} \leq \frac{(\epsilon -1/l)^2}{2}\,.

469: \]

470: By elementary calculation, this implies

471: \[

472: \frac{h(1+\ln(2l/h))}{l} -(\epsilon -1/l)^2 \leq

473: -\frac{(\epsilon -1/l)^2}{2}\,.

474: \]

475: Using the bound of Theorem 6 this shows that the risk $R_\epsilon$

476: can even be made to decrease exponentially in $n$

477: while increasing the sample size $l$ only linearly in $n$.

478: \qed

479:

480: Note that the sample size has to be chosen such that the deviation of

481: the relative frequencies from their probabilities is small compared to

482: the relative frequencies. Then we have a reasonable criterion to

483: decide for which sets ${\bf X},{\bf Y},{\bf Z}$ of variables we may

484: assume ${\bf X}$ and ${\bf Y}$ to be independent given ${\bf Z}$. This

485: criterion is as follows: Based on the error bound of

486: Theorem~\ref{Uniform} we compute the relative uncertainty of the

487: conditional probabilities used in the algorithm in the proof of

488: Theorem~\ref{Algo}. If the observed statistical dependencies are

489: greater than the uncertainty we assume the variables to be dependent.

490:

491:

492: \section{Conclusions}

493: The sample size to learn the joint probability distribution on $n$

494: nodes does only increase linearly with $n$ if the underlying causal

495: structure is assumed to be sufficiently simple.  Here we considered

496: the case that we know that the (unknown) causal graph has at most

497: in-degree $\Delta$ and a known time order exists.  Than a graph that

498: is Markov relative to the unknown probability measure can be found

499: efficiently if only the probabilities of all $(2\Delta+1)$-tuples are

500: known. They can be learned with linear sample size.  We have shown

501: this by finding bounds on the VC-dimension of the corresponding

502: cylinder sets.  We would like to note that the causal structure can at

503: least be {\it guessed} if only the probabilities of $(2\Delta

504: +1)$-tuples are known, since they allow to test a large number of

505: statistical independencies.

506:

507: \begin{thebibliography}{1}

508:

509: \bibitem{CBL:97}

510: J.~Cheng, D.~Bell, and W.~Liu.

511: \newblock Learning belief networks from data: an information theory based

512:   approach.

513: \newblock In {\em Proc. of the Sixth ACM International Conference on

514:   Information and Knowledge Management}, 1997.

515:

516: \bibitem{FNP:99}

517: N.~Friedman, I.~Nachman, and D.~Pe{\'e}r.

518: \newblock Learning bayesian network structure from massive datasets: The

519:   ``sparse candidate'' algorithm.

520: \newblock In {\em Proc. Fifteenth Conf. on Uncertainty in Artificial

521:   Intelligence (UAI)}, 1999.

522:

523: \bibitem{Glymour}

524: C.~Glymour and G.~F Cooper, editors.

525: \newblock {\em Computation, Causation \& Discovery}.

526: \newblock AAAI Press/The MIT press, 1999.

527:

528: \bibitem{Pearl:85}

529: J.~Pearl.

530: \newblock Bayesian networks: A model of self-activated memory for evidential

531:   reasoning.

532: \newblock In {\em Proceedings, Cognitive Science Society}, pages 329--334,

533:   Greenwich, CT: Albex, 1985.

534:

535: \bibitem{Pearl:88}

536: J.~Pearl.

537: \newblock {\em Probabilistic Reasoning in Inteligent Systems}.

538: \newblock Morgan Kaufmann, San Mateo, CA, 1998.

539:

540: \bibitem{Pearl:00}

541: J.~Pearl.

542: \newblock {\em Causality: models, reasoning, and inference}.

543: \newblock Cambridge University Press, 2000.

544:

545: \bibitem{Spirtes}

546: P.~Spirtes, C.~Glymour, and R.~Scheines.

547: \newblock {\em Causation, Prediction, and Search}, volume~81 of {\em Lecture

548:   Notes in Statistics}.

549: \newblock Springer, 1993.

550:

551: \bibitem{Vapnik:98}

552: V.~N. Vapnik.

553: \newblock {\em Statistical Learning theory}.

554: \newblock Adaptive and learning systems for signal processing, communications,

555:   and control. Wiley Interscience, 1998.

556:

557: \end{thebibliography}

558: \end{document}

559: