1: \documentclass[12pt]{article}
2: \usepackage{amsfonts,latexsym,fullpage,amsmath,latexsym,amssymb}
3: \renewcommand{\baselinestretch}{1.2}
4: \newcommand{\Y}{{\cal Y}}
5: \newcommand{\cP}{{\bf P}}
6: \newcommand{\N}{\mathbb N}
7: \newcommand{\R}{\mathbb R}
8: \newcommand{\x}{{\bf x}}
9: \newcommand{\qed}[0]{\hfill $\Box$}
10:
11: \newtheorem{theorem}{Theorem}
12: \newtheorem{lemma}{Lemma}
13: \newtheorem{corollary}{Corollary}
14: \newtheorem{remark}{Remark}
15: \newtheorem{example}[theorem]{Example}
16: \newtheorem{definition}[theorem]{Definition}
17:
18: \title{Required sample size for learning sparse\\ Bayesian
19: networks with many variables}
20:
21: \date{April 26, 2002}
22:
23: \author{ Pawe{\l}
24: Wocjan\thanks{e-mail: {\protect\tt
25: \{wocjan,janzing,eiss\_office\}@ira.uka.de}}, Dominik Janzing, and Thomas Beth\\
26: \small Institut f{\"u}r Algorithmen und Kognitive Systeme,
27: Universit{\"a}t Karlsruhe,\\[-1ex] \small Am Fasanengarten 5,
28: D-76\,131 Karlsruhe, Germany}
29: \begin{document}
30:
31: \maketitle
32:
33: \abstract{Learning joint probability distributions on $n$ random
34: variables requires exponential sample size in the generic case. Here
35: we consider the case that a temporal (or causal) order of the
36: variables is known and that the (unknown) graph of causal dependencies
37: has bounded in-degree $\Delta$. Then the joint measure is uniquely
38: determined by the probabilities of all $(2\Delta+1)$-tuples. Upper bounds on
39: the sample size required for estimating their probabilities can be
40: given in terms of the VC-dimension of the set of corresponding
41: cylinder sets. The sample size grows less than linearly with $n$.}
42:
43:
44: \section{Introduction}
45: Learning joint probability measures on a large set of variables is an
46: important task of statistics. One of the main motivations to estimate
47: joint probabilities is to study statistical dependencies and
48: independencies between the random variables \cite{Pearl:00}. In many
49: applications the goal is to obtain information on the underlying
50: causal structure that produces the statistical correlations. However,
51: the problem of learning causal structure from statistical data is in
52: general a deep problem and cannot be solved by statistical
53: considerations alone \cite{Pearl:00,Glymour}.
54:
55: Here we do not focus on the problem of uncovering the causal
56: structure, we rather address the problem of learning the probability
57: distribution on a large set of variables. In general, the sample size
58: required for estimating an unknown measure on the variables
59: $X_1,\dots,X_n$ grows exponentially with $n$. Assume for simplicity
60: that each $X_j$ is a discrete variable with $d$ possible values. Then
61: the probabilities of $d^n$ possible outcomes have to be estimated. The
62: sample size can be decreased considerably if prior knowledge on the
63: possible correlations is given. Consider for example the trivial case
64: when no statistical dependencies are possible at all, i.e.,
65: \[
66: P(x_1,x_2,\dots,x_n)=P(x_1)P(x_2)\dots P(x_n)\,,
67: \]
68: where $x_j$ denotes particular realizations of the corresponding
69: variable $X_j$. Then one has only to learn the probabilities
70: $P(x_1),\ldots,P(x_n)$.
71:
72: There are less trivial examples where prior information on the
73: statistical dependencies strongly reduce the required sample size.
74: For instance, this information may stem from knowledge on the
75: underlying causal structure. Following \cite{Pearl:00,Spirtes} we
76: encode causal structure in a directed graph with random variables as
77: its nodes. Here we assume the graph to be acyclic. The decisive prior
78: information assumed to be given here is that each variable has at most
79: $\Delta$ parents, i.e., is influenced directly by at most $\Delta$ other nodes.
80: Note that we do not assume that we know which nodes are the parents.
81: Therefore, our assumption is merely a kind of {\em simplicity
82: assumption} on the causation for the statistical
83: dependencies. Furthermore, it should be emphasized that in many cases
84: one will not find any pair of variables that are statistically
85: independent. The constraints on the causal structure for the joint
86: probability measure are more sophisticated and are only reflected in
87: {\it conditional} probabilities. These constraints are well-known as
88: the {\it Markov condition} in {\it Bayesian networks}
89: \cite{Pearl:00,Glymour}. Conversely,
90: Bayesian networks may be considered as a convenient and intuitive way
91: of encoding statistical dependencies among variables in a graph
92: (without any causal interpretation).
93:
94:
95: \section{Bayesian networks}
96: Let us briefly introduce Bayesian networks. To do that we define {\em
97: conditional independence} relationships among variables, a central
98: notion in the analysis of probability distributions.
99:
100: \begin{definition}[Conditional independence]${}$\\
101: Let ${\bf V}=\{X_1,X_2,\ldots,X_n\}$ be a finite set of variables. Let
102: $P(\cdot)$ be a joint probability distribution over the variables in
103: $V$, and let ${\bf X}$, ${\bf Y}$ and ${\bf Z}$ stand for any three
104: subsets of ${\bf V}$. The sets ${\bf X}$ and ${\bf Y}$ are said to be
105: conditionally independent given ${\bf Z}$, denoted by
106: \begin{equation}
107: ({\bf X} \perp {\bf Y}\, |\, {\bf Z})
108: \end{equation}
109: if
110: \begin{equation}
111: P({\bf x},{\bf y}|{\bf z})=P({\bf x}|{\bf z})P({\bf y}|{\bf z})\,,
112: \quad\mbox{whenever } P({\bf z})>0\,,
113: \end{equation}
114: where ${\bf x}$ is the tuple denoting a particular realization of the
115: values of the variables in ${\bf X}$ and the tuples ${\bf y}$ and
116: ${\bf z}$ are defined analogously. In words, if all the actual values
117: of the variables in ${\bf Z}$ are known the actual values of the
118: variables in ${\bf Y}$ do not provide any further information on the
119: actual values of the variables in ${\bf X}$.
120: \end{definition}
121:
122: Directed acyclic graphs or Bayesian networks -- a term coined in
123: \cite{Pearl:85} -- are used to facilitate economical representation of
124: joint probability distributions. The basis decomposition scheme
125: offered by directed acyclic graphs can be illustrated as follows. Let
126: $P(\cdot)$ be a joint probability distribution as in Definition~1. The
127: chain rule of probability calculus always permit to decompose $P$ as
128: a product of $n$ conditional probability distributions:
129: \begin{equation}
130: P(x_1,\ldots,x_n)=\prod_{j=1}^n P(x_j|x_1,\ldots,x_{j-1})\,.
131: \end{equation}
132: Now suppose that the conditional probability of some variable $X_j$ is
133: not sensitive to all the predecessors of $X_j$ but only to a small
134: subset of those predecessors. In words, suppose that $X_j$ is
135: independent of all other predecessors, once we know the values of a
136: selected group of predecessors called ${\bf
137: P}_j:=\{X_{j,1},\ldots,X_{j,m_j}\}$. We can then write
138: \begin{equation}
139: P(x_1,\ldots,x_n)=\prod_{j=1}^n P(x_j|{\bf p}_j)
140: \end{equation}
141: considerably simplifying the input information. Instead of specifying
142: the probability of $X_j$ conditional on all possible realizations of
143: its predecessors $X_1,\ldots,X_{j-1}$, we need only to take into
144: account the possible realizations of the set ${\bf P}_j$. The set
145: ${\bf P}_j$ is called the {\em Markovian parents} of $X_j$, or the
146: parents for short. The reason for the name becomes clear when we
147: introduce graphs around this concept.
148:
149: \begin{definition}[Markov parents]${}$\\
150: Let $V=\{X_1,\ldots,X_n\}$ be an ordered set of variables, and let
151: $P(\cdot)$ be the joint probability distribution on these
152: variables. A set of variables ${\bf P}_j$ is said to be Markovian
153: parents of $X_j$ if ${\bf P}_j$ is a minimal set of predecessors of
154: $X_j$ that renders $X_j$ independent of all its other predecessors. In
155: words, ${\bf P}_j$ is any subset of $\{X_1,\ldots,X_{j-1}\}$ satisfying
156: \begin{equation}\label{eq:markovian}
157: P(x_j|{\bf p}_j)=P(x_j|x_1,\ldots,x_{j-1})
158: \end{equation}
159: such that no proper subset of ${\bf P}_j$ satisfies
160: Eq.~(\ref{eq:markovian}).
161: \end{definition}
162:
163: This definition assigns to each variable $X_j$ a selected set ${\bf
164: P}_j$ of preceding variables that are sufficient for determining the
165: probability of $X_j$. The values of the other preceding variables are
166: redundant once we know the values ${\bf p}_j$ of the parent set ${\bf
167: P}_j$. This assignment can be encoded in a directed acyclic graph in
168: which the variables are represented by the nodes and arrows are drawn
169: from each node of the parent set toward the child node $X_j$.
170:
171: Furthermore, Definition~2 also provides a simple recursive method for
172: constructing such a DAG: Starting with the pair $(X_1,X_2)$, we draw
173: an arrow from $X_1$ to $X_2$ if and only if the two variables are
174: dependent. Assume that we have constructed the DAG up to node
175: $j-1$. At the $j$th stage, we select any minimal set of predecessors
176: of $X_j$ that renders $X_j$ independent from its other predecessors
177: (as in Eq.~(\ref{eq:markovian})), call this set ${\bf P}_j$ and draw
178: an arrow from each member in ${\bf P}_j$ to $X_j$. The result is a
179: directed acyclic graph, called a Bayesian network, in which an arrow
180: from $X_i$ to $X_j$ assigns $X_i$ as a Markovian parent of $X_j$,
181: consistent with Definition~2.
182:
183: Let us mention that the set ${\bf P}_j$ is unique whenever the
184: distribution $P(\cdot)$ is strictly positive, i.e.\ every
185: configuration of variables, no matter how unlikely, has some finite
186: probability of occurring. Under such conditions, the Bayesian network
187: associated with $P(\cdot)$ is unique, given the ordering of the
188: variables \cite{Pearl:88}.
189:
190: \begin{definition}[Markov Compatibility]${}$\\
191: Let $G$ be a DAG. If a probability distribution $P$ admits a
192: factorization relative to $G$, i.e.\
193: \begin{equation}
194: P(x_1,\ldots,x_n)=\prod_{j=1}^n P(X_j=x_j|{\bf P}_j={\bf p}_j)\,,
195: \end{equation}
196: where ${\bf P}_j$ are the parents of the node $X_j$ defined by the
197: graph $G$, then we say $G$ and $P$ are compatible, or that $P$ is
198: Markov relative to $G$.
199: \end{definition}
200:
201: The problem of learning a Bayesian network usually treated in the
202: literature is as follows. Given a {\em training set }
203: $\{\x^1,\ldots,\x^l\}$, find a network that {\em best matches} the
204: training set (see e.g.\ \cite{CBL:97,FNP:99}), i.e. to determine a
205: graph $G$ such that $P$ is Markov relative to $G$.
206:
207:
208: \section{Networks with bounded in-degree}
209:
210: To motivate our decisive assumption we would like to note that
211: scientific reasoning always tries to find a simple explanation
212: for the data (``Occam's Razor''). We are aware of the fact that
213: ``simplicity'' is hard to formalize.
214: However, it seems reasonable to try
215: to explain data by {\it simple causal graphs}.
216: Here we may use the {\it in-degree} of the graph
217: as criterion for simplicity.
218: It is defined as the
219: greatest number of parents that occurs. The intuitive meaning
220: of in-degree $\Delta$ is that no variable is directly influenced
221: by more than $\Delta$ others.
222: For $\Delta \ll n$
223: we call the graph {\it sparse}.
224: Clearly, the in-degree is only one of the graph theoretical notions that
225: may be used to define {\it simplicity} of causal explanations;
226: we could use e.g.\ the number of
227: edges.
228:
229: Let $G$ be an arbitrary DAG with in-degree $\Delta$. Then every
230: probability measure that is Markovian relative to $G$ is already
231: determined by the probabilities of all $(\Delta+1)$-tuples. This
232: follows directly from the decomposition in Eq.~(\ref{eq:markovian})
233: since the conditional probabilities $P(x_j|{\bf p}_j)$ are the
234: quotients of the probabilities $P(x_j,{\bf p}_j)$ and $P({\bf p}_j)$
235: of sizes at most $\Delta+1$ and $\Delta$, respectively. Consequently,
236: if $G$ is known we can learn the probability measure $P$ by learning
237: the probabilities of all $\Delta+1$-tuples.
238:
239: In contrast, we do not assume that we know the exact structure of $G$
240: but only that its in-degree at most $\Delta$. Now the situation is more
241: complicated. Since we do not know the set of parents for any $X_j$, we
242: do not know which conditional probabilities have to appear in the
243: factorization in Eq.~(\ref{eq:markovian}). Therefore, it is not
244: sufficient to know the probabilities of all tuples of size $\Delta+1$
245: to reconstruct the structure. We have to know the
246: probabilities of at least all $(\Delta+2)$-tuples to be able to {\it test}
247: conditional independencies. The following theorem shows that it is
248: {\it sufficient} to know the probabilities of all $(2\Delta+1)$ tuples.
249:
250: \begin{theorem}[Graph structure from correlations]${}$\\\label{Algo}
251: Let $X_1<X_2<\ldots <X_n$ be an ordering of the variables. Assume that
252: $P$ is a probability measure that is Markov relative to a directed
253: acyclic graph (DAG) $G$. Let $G$ be consistent with the ordering,
254: i.e., the graph $G$ contains no arrow from $X_j$ to $X_i$ for $i<j$.
255: Let $G$ have in-degree $\Delta$ and assume that the probabilities of
256: all $(2\Delta+1)$-tuples are known. Then we can find a graph
257: $\tilde{G}$ (possibly different from $G$) that is Markov relative to
258: $P$ and has at most in-degree $\Delta$.
259: \end{theorem}
260: {\bf Proof:}
261: We can find the correct graph structure by the following iteration:
262: Draw an arrow from $X_1$ to $X_2$ if the two variables are dependent.
263: Assume we have found the correct structure on $X_1,X_2,\dots,X_{j-1}$.
264:
265: In order to find a possible minimal set $\cP_j$ of parents of $X_j$
266: we proceed as follows:
267: Let $m:=\min\{j-1,\Delta\}$.
268: For each $m$-subset $K\subseteq V_j:=\{X_1,X_2,\dots,X_{j-1}\}$
269: test whether the following statement is true:
270:
271: $(X_j \perp L \, |\, K)$ for all sets $L$ (disjoint from $K$)
272: that contain at most $m$
273: elements.
274:
275: If this is true, $K$ contains necessarily a set $\cP'_j$ that can be
276: taken as Markovian parents of $X_j$. This can be seen as follows:
277: Choose $L$ such that $(L\cup K) \supseteq \cP_j$ for an arbitrary
278: minimal choice of parents of $X_j$. This is possible since $X_j$ has
279: at most $m$ parents. Since $L\cup K$ contains the parents of $X_j$ it
280: renders $X_j$ independent of its predecessors (see the $d$-separation
281: criteria in \cite{Pearl:88,Spirtes,Glymour}). Formally we have $(X_j
282: \perp V_j \,|\, L\cup K)$. By the contraction rule for conditional
283: independencies (see \cite{Pearl:88}) the statements $(X_j \perp V_j \,
284: |\, L\cup K)$ and $(X_j \perp L \,|\, K)$ imply $(X_j \perp V_j \,|\,
285: K)$. Hence $K$ must contain a set $\cP_j'$ that can be viewed as
286: Markovian parents of $X_j$.
287:
288: Now we can test whether a proper subset $K'$ of $K$ satisfies
289: $(X_j \perp L\,| \,K')$ and obtain a minimal set of parents of $X_j$ by
290: iterating this procedure. \qed
291:
292:
293: \section{Learning the probabilities of $k$-tuples}
294: Now we shall present an upper bound on the required sample size in
295: order to learn the probabilities of all $k$-tuples with good
296: reliability. Then we can apply this result to the case
297: $k:=2\Delta +1$.
298:
299: Let $P(\cdot)$ be a probability distribution over an (ordered) set of
300: random variables ${\bf V}=\{X_1,\ldots,X_n\}$ taking on values in
301: $\Omega_j$ for $j=1,\ldots,n$.
302:
303: Let $X_{j_{1}},\dots,X_{j_{k}}$ be any $k$-subset of ${\bf V}$. We
304: would like to have a reliable statement on the probability of the
305: event
306: $(x_{j_1},\dots,x_{j_k})\in\Omega_{j_1}\times\cdots\times\Omega_{j_k}$,
307: i.e.\ the probability
308: \begin{equation}
309: P(X_{j_1}=x_{j_1},\ldots,X_{j_k}=x_{j_k})\,.
310: \end{equation}
311: The problem to determine the sample size required for estimating
312: reliably the probability of {\it one specific} event
313: is a usual problem of statistics. However,
314: the problem we encounter in learning Bayesian networks is more
315: sophisticated: we have to be almost sure that the estimated
316: probabilities of all $(2\Delta+1)$-tuples are sufficiently close to
317: the real (unknown) probabilities.
318:
319: The problem to determine whether and how fast
320: the relative frequencies of a large
321: set of events converge {\it uniformly} to their probabilities is
322: well-known in statistical learning theory \cite{Vapnik:98}.
323: Statements on uniform convergence rely on the so-called
324: Vapnik-Chervonenkis dimension (VC-dimension) of the considered set of events.
325:
326: \begin{definition}[VC dimension]${}$\\
327: Let $P$ be an unknown probability measure on a probability space
328: $\Omega$ and $S$ a set of events, i.e., a set of measurable subsets of
329: $\Omega$. Define the VC-dimension of $S:=(M_\lambda)$ as the largest
330: number $h$ such that there exist $h$ points
331: $\omega_1,\omega_2,\dots,\omega_h \in \Omega$ such that the sets
332: $M_\lambda \cap \{\omega_1,\dots,\omega_h\}$ run over all $2^h$ subsets of
333: $\{\omega_1,\dots,\omega_h\}$.
334: Intuitively, one can consider the sets $M_\lambda$ as classifiers
335: and the VC-dimension as the largest number of points that
336: can be classified in all $2^h$ possible ways.
337: The VC-dimension is said to be
338: infinite if such an $h$-subset can be found for all $h\in\N$.
339: \end{definition}
340:
341: A trivial upper bound on the VC-dimension is given by the logarithm to
342: base $2$ of the number of events (in the case that $S$ is finite).
343:
344: Finite VC-dimension is known to be sufficient and necessary in order to
345: have uniform convergence of relative frequencies to their probabilities.
346: Quantitatively, one has the following theorem:
347:
348: \begin{theorem}[Uniform convergence]${}$\\\label{Uniform}
349: Let $f(M)$ be the relative frequency of the number of occurrences of
350: $M$ after $l$ runs. Let $S$ have VC-dimension $h$. Let $R_{\epsilon}$
351: be the risk (probability) that $S$ contains at least one set $M$ such
352: that $|f(M)-P(M)|\geq \epsilon$ for an arbitrary positive
353: $\epsilon$. Then we have
354: \begin{equation}
355: R_{\epsilon} <
356: 4 \exp\left\{\left(
357: \frac{h(1+\ln(2l/h))}{l}-(\epsilon-1/l)^2 \right)l\right\}\,.
358: \end{equation}
359: \end{theorem}
360: {\bf Proof:} see Theorem 4.4. in \cite{Vapnik:98} \qed
361:
362: This theorem allows to derive a lower bound on the required sample
363: size in order to estimate the probability of all $k$-tuples. First we
364: have to define the set of events and give an upper bound on its
365: VC-dimension.
366:
367: Let $\Omega:=\Omega_1\times\cdots\times\Omega_n$
368: be the probability space.
369: This means that the $j$th
370: random variable takes on values from $\Omega_j$ for $j=1,\ldots,n$.
371: The $k$-tuples are characterized by the positions and values the
372: corresponding random variables take on. Let ${\bf
373: j}:=\{j_1,j_2,\dots,j_k\}$ be an arbitrary $k$-subset of
374: $\{1,\dots,n\}$ and ${\bf x}\in \Omega_{j_1},\dots,\Omega_{j_k}$.
375:
376: We then denote by $M^{\bf j}_{\bf x}$ the event that the random
377: variables $X_{j_1},X_{j_2},\dots, X_{j_k}$ take on the values
378: $x_{j_1},\dots,x_{j_k}$. This event corresponds uniquely to a cylinder
379: set $C^{\bf j}_{\bf x}\subset \Omega$.
380:
381: An upper bound on the VC-dimension of the set of those events that
382: correspond to cylinder sets $C^{\bf j}_{\bf x}$ is easy to
383: get. Let $d$ be the maximal
384: cardinality of the sets $\Omega_j$. Then, for fixed $k$, there exist
385: at most
386: \[
387: d^k {n \choose k}
388: \]
389: such cylinder sets. The first term gives an upper bound on the
390: possible combinations of values and the second term the number of
391: different positions. This number is smaller than $(nd)^k$.
392: By taking the logarithm to base $2$ we obtain an upper bound on the
393: VC-dimension
394: \begin{equation}\label{eq:upper}
395: h\le k\log_2 (nd)
396: \end{equation}
397: Obviously, we can use much better bounds for concrete applications,
398: e.g.\ given by Stirling's approximation (giving a less intuitive
399: expression but providing a tighter bound). However, this crude upper
400: bound is sufficient to study the asymptotic behavior.
401:
402: Now we will present a lower bound on the VC-dimension in order to get
403: an idea how tight the upper bound in (\ref{eq:upper}) is.
404:
405: We construct $l$ $n$-tuples with $l:=\lfloor \log_2 (n-k+1) \rfloor$
406: as follows. For each set $\Omega_j$ we choose two different values
407: $x_{j;0}$ and $x_{j;1}$ for $j=1,\ldots,n$. This defines a map
408: $\phi$ from the set of binary words of length $n$ into $\Omega$ by
409: setting
410: \begin{equation}
411: \phi: b_1 b_2 \ldots b_n \mapsto
412: x_{1,b_1} x_{2,b_2} \ldots x_{n,b_n}\,.
413: \end{equation}
414:
415: Now we define an $l \times n$ matrix $M$ with entries $0$ and $1$ as
416: follows: The first $k-1$ columns have only $1$ as entries. The next $2^l$
417: columns are the binary words of length $l$. The remaining
418: $(n-k+1 -2^l)$ columns can be chosen arbitrarily.
419:
420: The rows of $M$ correspond
421: to $n$-tuples by the map $\phi$. Let $\Y$ be the set of those
422: $n$-tuples and $S$ be an arbitrary subset of $\Y$. $S$ can uniquely be
423: characterized by a vector $s$ of length $l$ with entries $0$ and $1$
424: where the $j$-th entry of $s$ indicates whether the $j$-th $n$-tuple
425: is an element of $S$ or not. The matrix $M$ contains a column that
426: coincides with $s$. Assume it to be the $i$-th column. Than
427: $C^{\bf j}_{\bf x}
428: \cap \Y $ contains exactly those $n$-tuples that are elements of $S$
429: provided that $C^{\bf j}_{\bf x}$ is chosen as follows. Let ${\bf j}$ be
430: $(1,2,\dots,k-1,i)$ and choose ${\bf x}$ as the $k$-tuple
431: $(x_{1;1},x_{2;1},\dots,x_{k-1;1},x_{i;1})$.
432: This shows that the cylinder sets corresponding to $k$-tuples
433: are able to classify $\Y$ on all $2^l$ possibilities.
434: Therefore
435: $\lfloor\log_2 (n-k+1)\rfloor$ is a lower
436: bound on the VC-dimension of the cylinder sets. Comparing this bound
437: with the upper bound in (\ref{eq:upper}), we see that it gives the
438: correct asymptotic behavior in the $O$-notation
439: if $k$ and $d$ are considered as constants.
440:
441: \begin{theorem}
442: For $\epsilon >0$ let $R_\epsilon$ be the risk that there is a
443: cylinder set $C^{\bf j}_{\bf x}$ such that its relative frequency
444: deviates from its probability by more than $\epsilon$. Than
445: $R_\epsilon$ can be made smaller than any $\delta>0$ while only
446: increasing the sample size linearly with $n$.
447: \end{theorem}
448: {\bf Proof:}
449: We choose $l$ such that
450: \[
451: \frac{l}{1+\ln(2l)}\frac{(\epsilon -1/l)^2}{2} \geq
452: k \log_2 (nd) \,.
453: \]
454: This can asymptotically be achieved by increasing $l$ with $O(n)$,
455: since $l/(1+\ln(2l))\leq l/(\ln (l))$ and the latter term increases
456: less than linearly in $l$.
457:
458: Using our bound
459: \[
460: h\leq k\log_2 (nd)
461: \]
462: we obtain
463: \[
464: h \leq \frac{(\epsilon-1/l)^2}{2}\frac{l}{1+\ln(2l)}
465: \]
466: and get
467: \[
468: \frac{h(1+\ln (2l))}{l} \leq \frac{(\epsilon -1/l)^2}{2}\,.
469: \]
470: By elementary calculation, this implies
471: \[
472: \frac{h(1+\ln(2l/h))}{l} -(\epsilon -1/l)^2 \leq
473: -\frac{(\epsilon -1/l)^2}{2}\,.
474: \]
475: Using the bound of Theorem 6 this shows that the risk $R_\epsilon$
476: can even be made to decrease exponentially in $n$
477: while increasing the sample size $l$ only linearly in $n$.
478: \qed
479:
480: Note that the sample size has to be chosen such that the deviation of
481: the relative frequencies from their probabilities is small compared to
482: the relative frequencies. Then we have a reasonable criterion to
483: decide for which sets ${\bf X},{\bf Y},{\bf Z}$ of variables we may
484: assume ${\bf X}$ and ${\bf Y}$ to be independent given ${\bf Z}$. This
485: criterion is as follows: Based on the error bound of
486: Theorem~\ref{Uniform} we compute the relative uncertainty of the
487: conditional probabilities used in the algorithm in the proof of
488: Theorem~\ref{Algo}. If the observed statistical dependencies are
489: greater than the uncertainty we assume the variables to be dependent.
490:
491:
492: \section{Conclusions}
493: The sample size to learn the joint probability distribution on $n$
494: nodes does only increase linearly with $n$ if the underlying causal
495: structure is assumed to be sufficiently simple. Here we considered
496: the case that we know that the (unknown) causal graph has at most
497: in-degree $\Delta$ and a known time order exists. Than a graph that
498: is Markov relative to the unknown probability measure can be found
499: efficiently if only the probabilities of all $(2\Delta+1)$-tuples are
500: known. They can be learned with linear sample size. We have shown
501: this by finding bounds on the VC-dimension of the corresponding
502: cylinder sets. We would like to note that the causal structure can at
503: least be {\it guessed} if only the probabilities of $(2\Delta
504: +1)$-tuples are known, since they allow to test a large number of
505: statistical independencies.
506:
507: \begin{thebibliography}{1}
508:
509: \bibitem{CBL:97}
510: J.~Cheng, D.~Bell, and W.~Liu.
511: \newblock Learning belief networks from data: an information theory based
512: approach.
513: \newblock In {\em Proc. of the Sixth ACM International Conference on
514: Information and Knowledge Management}, 1997.
515:
516: \bibitem{FNP:99}
517: N.~Friedman, I.~Nachman, and D.~Pe{\'e}r.
518: \newblock Learning bayesian network structure from massive datasets: The
519: ``sparse candidate'' algorithm.
520: \newblock In {\em Proc. Fifteenth Conf. on Uncertainty in Artificial
521: Intelligence (UAI)}, 1999.
522:
523: \bibitem{Glymour}
524: C.~Glymour and G.~F Cooper, editors.
525: \newblock {\em Computation, Causation \& Discovery}.
526: \newblock AAAI Press/The MIT press, 1999.
527:
528: \bibitem{Pearl:85}
529: J.~Pearl.
530: \newblock Bayesian networks: A model of self-activated memory for evidential
531: reasoning.
532: \newblock In {\em Proceedings, Cognitive Science Society}, pages 329--334,
533: Greenwich, CT: Albex, 1985.
534:
535: \bibitem{Pearl:88}
536: J.~Pearl.
537: \newblock {\em Probabilistic Reasoning in Inteligent Systems}.
538: \newblock Morgan Kaufmann, San Mateo, CA, 1998.
539:
540: \bibitem{Pearl:00}
541: J.~Pearl.
542: \newblock {\em Causality: models, reasoning, and inference}.
543: \newblock Cambridge University Press, 2000.
544:
545: \bibitem{Spirtes}
546: P.~Spirtes, C.~Glymour, and R.~Scheines.
547: \newblock {\em Causation, Prediction, and Search}, volume~81 of {\em Lecture
548: Notes in Statistics}.
549: \newblock Springer, 1993.
550:
551: \bibitem{Vapnik:98}
552: V.~N. Vapnik.
553: \newblock {\em Statistical Learning theory}.
554: \newblock Adaptive and learning systems for signal processing, communications,
555: and control. Wiley Interscience, 1998.
556:
557: \end{thebibliography}
558: \end{document}
559: