1:
2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3: % Distribution of Mutual Information %%
4: %% Marcus Hutter: Start: 07.05.01 LastEdit: 15.12.01 %%
5: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6:
7: %-------------------------------%
8: % Document-Style %
9: %-------------------------------%
10: \documentclass[12pt]{article}
11: \parskip=1.5ex plus 1ex minus 1ex \parindent=0ex
12: \topmargin=0cm \oddsidemargin=0cm \evensidemargin=0cm
13: \textwidth=16cm \textheight=22.2cm \unitlength=1mm %\sloppy
14:
15: %-------------------------------%
16: % My Math-Spacings %
17: %-------------------------------%
18: \def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
19: \def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
20: \def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
21: \textmuskip
22: \def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
23: \def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
24: \def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
25: \def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
26:
27: %-------------------------------%
28: % Macro-Definitions %
29: %-------------------------------%
30: \newenvironment{keywords}{\centerline{\small\bf
31: Keywords}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
32: 1ex}
33: \def\nq{\hspace{-1em}}
34: \def\odt{{\textstyle{1\over 2}}}
35: \def\eps{\varepsilon}
36: \def\vec#1{{\bf #1}}
37: \def\p{{\scriptscriptstyle+}}
38: \def\pp{{\scriptscriptstyle++}}
39: \def\n{n}
40: \def\npp{\n}
41: \def\t{\pi}
42:
43: \begin{document}
44:
45: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
46: % T i t l e - P a g e %
47: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
48:
49: \begin{titlepage}
50:
51: \begin{center}
52: {\small Technical Report IDSIA-13-01 \hfill 15 December 2001}\\[5mm]
53: {\Large\sc\hrule height1pt \vskip 2mm
54: Distribution of Mutual Information
55: \vskip 5mm \hrule height1pt} \vspace{10mm}
56: {\bf Marcus Hutter} \\[10mm]
57: {\rm IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland} \\
58: {\rm\footnotesize marcus@idsia.ch \qquad
59: http://www.idsia.ch/$^{_{_\sim}}\!$marcus}
60: \\[15mm]
61: \end{center}
62:
63: \begin{keywords}
64: Mutual Information, Cross Entropy, Dirichlet distribution, Second
65: order distribution, expectation and variance of mutual
66: information.
67: \end{keywords}
68:
69: \begin{abstract}
70: The mutual information of two random variables $\imath$ and
71: $\jmath$ with joint probabilities $\{\t_{ij}\}$ is commonly used in
72: learning Bayesian nets as well as in many other fields. The
73: chances $\t_{ij}$ are usually estimated by the empirical sampling
74: frequency $\n_{ij}/\n$ leading to a point estimate $I(\n_{ij}/\n)$
75: for the mutual information. To answer questions like ``is
76: $I(\n_{ij}/\n)$ consistent with zero?'' or ``what is the
77: probability that the true mutual information is much larger than
78: the point estimate?'' one has to go beyond the point estimate.
79: %
80: In the Bayesian framework one can answer
81: these questions by utilizing a (second order) prior distribution
82: $p(\t)$ comprising prior information about $\t$. From the prior
83: $p(\t)$ one can compute the posterior $p(\t|\vec\n)$, from which
84: the distribution $p(I|\vec\n)$ of the mutual information can be
85: calculated.
86: %
87: We derive reliable and quickly computable approximations for
88: $p(I|\vec\n)$. We concentrate on the mean, variance, skewness, and
89: kurtosis, and non-informative priors. For the mean we also give an
90: exact expression. Numerical issues and the range of validity are
91: discussed.
92: \end{abstract}
93:
94: \end{titlepage}
95:
96: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
97: \section{Introduction}\label{secInt}
98: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
99: The mutual information $I$ (also called cross entropy) is a
100: widely used information theoretic measure for the stochastic
101: dependency of random variables \cite{Cover:91,Soofi:00}. It is
102: used, for instance, in learning Bayesian nets
103: \cite{Buntine:96,Heckerman:98}, where stochastically dependent
104: nodes shall be connected. The mutual information defined in
105: (\ref{mi}) can be computed if the joint probabilities
106: $\{\t_{ij}\}$ of the two random variables $\imath$ and
107: $\jmath$ are known. The standard procedure in the common case
108: of unknown chances $\t_{ij}$ is to use the sample frequency
109: estimates ${\n_{ij}\over\n}$ instead, as if they were
110: precisely known probabilities; but this is not always
111: appropriate. Furthermore, the point estimate
112: $I({\n_{ij}\over\n})$ gives no clue about the reliability of
113: the value if the sample size $n$ is finite. For instance, for
114: independent $\imath$ and $\jmath$, $I(\t)=0$ but
115: $I({\n_{ij}\over\n})=O(n^{-1/2})$ due to noise in the data.
116: The criterion for judging dependency is how many standard
117: deviations $I({\n_{ij}\over\n})$ is away from zero. In
118: \cite{Kleiter:96,Kleiter:99} the probability that the true
119: $I(\vec\t)$ is greater than a given threshold has been used to
120: construct Bayesian nets. In the Bayesian framework one can
121: answer these questions by utilizing a (second order) prior
122: distribution
123: $p(\t)$%
124: %comprising prior information about $\t$.
125: ,which takes account of any impreciseness about $\t$.
126: From the prior
127: $p(\t)$ one can compute the posterior $p(\t|\vec n)$, from which
128: the distribution $p(I|\vec\n)$ of the mutual information can be
129: obtained.
130:
131: The objective of this work is to derive reliable and quickly
132: computable analytical expressions for $p(I|\vec\n)$. Section
133: \ref{secMI} introduces the mutual information distribution,
134: Section \ref{secResults} discusses some results in advance before
135: delving into the derivation. Since the central limit theorem
136: ensures that $p(I|\vec\n)$ converges to a Gaussian distribution a
137: good starting point is to compute the mean and variance of
138: $p(I|\vec\n)$. In section \ref{secApprox} we relate the mean and
139: variance to the covariance structure of $p(\t|\vec n)$. Most
140: non-informative priors lead to a Dirichlet posterior. An exact
141: expression for the mean (Section \ref{secExact}) and approximate
142: expressions for the variance (Sections \ref{secDD}) are given for
143: the Dirichlet distribution. More accurate estimates of the
144: variance and higher central moments are derived in Section
145: \ref{secGeneral}, which lead to good approximations of
146: $p(I|\vec\n)$ even for small sample sizes. We show that the
147: expressions obtained in \cite{Kleiter:96,Kleiter:99} by heuristic
148: numerical methods are incorrect. Numerical issues and the range of
149: validity are briefly discussed in section \ref{secNum}.
150:
151: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
152: \section{Mutual Information Distribution}\label{secMI}
153: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
154:
155: We consider discrete random variables $\imath\in\{1,...,r\}$ and $\jmath\in
156: \{1,...,s\}$ and an i.i.d.\ random process with samples
157: $(i,j)\in\{1,...,r\}\times\{1,...,s\}$ drawn with joint probability
158: $\t_{ij}$. An important measure of the stochastic
159: dependence of $\imath$ and $\jmath$ is the mutual
160: information
161: \beq\label{mi}
162: I({\vec \t}) \;=\; \sum_{i=1}^r\sum_{j=1}^s
163: \t_{ij}\log{\t_{ij}\over\t_{i\p}\t_{\p j}} \;=\;
164: \sum_{ij}\t_{ij}\log\t_{ij} -
165: \sum_{i}\t_{i\p}\log\t_{i\p} -
166: \sum_{j}\t_{\p j}\log\t_{\p j}.
167: \eeq
168: $\log$ denotes the natural logarithm and
169: $\t_{i\p}=\sum_j\t_{ij}$ and
170: $\t_{\p j}=\sum_i\t_{ij}$ are marginal probabilities.
171: Often one does not know the probabilities $\t_{ij}$ exactly,
172: but one has a sample set with $\n_{ij}$ outcomes of pair $(i,j)$.
173: The frequency $\hat\t_{ij}:={\n_{ij}\over\npp}$ may
174: be used as a first estimate of the unknown probabilities.
175: $\npp:=\sum_{ij}\n_{ij}$ is the total sample size.
176: This leads to a point (frequency) estimate $I(\hat\vec\t) =
177: \sum_{ij}{\n_{ij}\over\npp}
178: \log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}}$
179: for the mutual information (per sample).
180:
181: Unfortunately the point estimation $I(\hat\vec\t)$ gives no
182: information about its accuracy. In the Bayesian approach to this
183: problem one assumes a prior (second order) probability density
184: $p(\vec\t)$ for the unknown probabilities $\t_{ij}$ on the
185: probability simplex. From this one can compute the posterior
186: distribution $p(\vec\t|\vec\n) \propto
187: p(\t)\prod_{ij}\t_{ij}^{\n_{ij}}$ (the $n_{ij}$ are multinomially
188: distributed). This allows to compute the
189: posterior probability density of the mutual information.$\!$%
190: \footnote{$I(\vec\t)$ denotes the mutual information for the
191: specific chances $\vec\t$, whereas $I$ in the context above is
192: just some non-negative real number. $I$ will also denote the
193: mutual information {\it random variable} in the
194: expectation $E[I]$ and variance $\mbox{Var}[I]$. Expectaions are
195: {\it always} w.r.t.\ to the posterior distribution
196: $p(\vec\t|\vec\n)$. }
197: \beq\label{midistr}
198: p(I|\vec\n) = \int
199: \delta(I(\vec\t)-I)p(\vec\t|\vec\n)d^{rs}\vec\t
200: \eeq
201: \footnote{Since $0\leq I(\t)\leq I_{max}$ with sharp upper
202: bound $I_{max}:= \min\{\log r,\log s\}$, the integral may be
203: restricted to $\int_0^{I_{max}}$, which shows that the domain
204: of $p(I|\vec n)$ is $[0,I_{max}].$}%
205: The $\delta()$ distribution restricts the integral to $\t$ for
206: which $I(\t)=I$. For large sample size $\npp\to\infty$,
207: $p(\vec\t|\vec\n)$ is strongly peaked around $\vec\t=\hat\vec\t$
208: and $p(I|\vec\n)$ gets strongly peaked around the frequency
209: estimate $I=I(\hat\vec\t)$. The mean $E[I] = \int_0^\infty I
210: p(I|\vec\n)\,dI = \int I(\vec\t)p(\vec\t|\vec\n)d^{rs}\vec\t$ and
211: the variance $\mbox{Var}[I]=E[(I-E[I])^2]=E[I^2]-E[I]^2$ are of
212: central interest.
213:
214: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
215: \section{Results for $I$ under the Dirichlet P{\rm(}oste{\rm)}rior}\label{secResults}
216: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
217: Most\footnote{But not all priors which one can argue to be
218: non-informative lead to Dirichlet posteriors. Brand \cite{Brand:99}
219: (and others), for instance, advocate the entropic prior
220: $p(\vec\t)\propto e^{-H(\vec\t)}$.}
221: non-informative priors for $p(\t)$ lead to a Dirichlet
222: posterior distribution $p(\vec\t|\vec\n) \propto
223: \prod_{ij}\t_{ij}^{\n_{ij}-1}$ with interpretation
224: $\n_{ij}=\n'_{ij}+\n''_{ij}$, where
225: $\n'_{ij}$ are the number of samples $(i,j)$, and
226: $\n''_{ij}$ comprises prior information
227: ($1$ for the uniform prior, $\odt$ for Jeffreys' prior, $0$ for
228: Haldane's prior, ${1\over rs}$ for Perks' prior \cite{Gelman:95}).
229: In principle this allows to compute the
230: posterior density $p(I|\vec\n)$ of the mutual information. In
231: sections \ref{secApprox} and \ref{secDD} we expand the mean and
232: variance in terms of $\npp^{-1}$:
233: \bqa\label{mvappr}
234: E[I] &=&
235: \sum_{ij}{\n_{ij}\over\npp}
236: \log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}} \;+\;
237: {(r-1)(s-1)\over 2\npp} \;+\; O(\npp^{-2}),
238: \\\nonumber
239: \mbox{Var}[I] &=&
240: {1\over\npp}
241: \sum_{ij}{\n_{ij}\over\npp}\bigg(\log{\n_{ij}\npp\over
242: \n_{i\p}\n_{\p j}}\bigg)^2 -
243: {1\over\npp}\bigg(\sum_{ij}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over
244: \n_{i\p}\n_{\p j}}\bigg)^2 \;+\; O(\npp^{-2}).
245: \eqa
246: The first term for the mean is just the point estimate
247: $I(\hat\t)$. The second term is a small correction if $\npp\gg
248: r \cdot s$. Kleiter \cite{Kleiter:96,Kleiter:99} determined the
249: correction by Monte Carlo studies as $\min\{{r-1\over
250: 2\npp},{s-1\over 2\npp}\}$. This is wrong unless $s$ or $r$ are 2.
251: The expression $2E[I]/n$ they determined for the variance has a
252: completely different structure than ours. Note that the mean is
253: lower bounded by ${const.\over\npp}+O(\npp^{-2})$, which is
254: strictly positive for large, but finite sample sizes, even if
255: $\imath$ and $\jmath$ are statistically independent and
256: independence is perfectly represented in the data ($I(\hat\t)=0$).
257: On the other hand, in this case, the standard deviation
258: $\sigma=\sqrt{\mbox{Var} (I)}\sim {1\over\npp}\sim E[I]$ correctly
259: indicates that the mean is still consistent with zero.
260:
261: Our approximations (\ref{mvappr}) for the mean and variance are
262: good if ${r \cdot s\over\npp}$ is small. The central limit
263: theorem ensures that $p(I|\vec\n)$ converges to a Gaussian
264: distribution with mean $E[I]$ and variance $\mbox{Var}[I]$. Since
265: $I$ is non-negative it is more appropriate to approximate
266: $p(I|\vec\t)$ as a Gamma ($=$ scaled $\chi^2$) or log-normal
267: distribution with mean $E[I]$ and variance $\mbox{Var}[I]$, which
268: is of course also asymptotically correct.
269:
270: A systematic expansion in $\npp^{-1}$ of the mean, variance, and
271: higher moments is possible but gets arbitrarily cumbersome.
272: The $O(\npp^{-2})$ terms for the variance and leading order
273: terms for the skewness and kurtosis
274: are given in Section \ref{secGeneral}.
275: For the mean it is possible to give an exact expression
276: \beq\label{miexex2}
277: E[I] = {1\over\npp}\sum_{ij}\n_{ij}
278: [\psi(\n_{ij}+1)-\psi(\n_{i\p}+1)-\psi(\n_{\p
279: j}+1)+\psi(\npp+1)]
280: \eeq
281: with $\psi(n+1)=-\gamma+\sum_{k=1}^n{1\over k}=\log
282: n+O({1\over n})$ for integer $n$. See Section \ref{secExact} for
283: details and more general expressions for $\psi$ for non-integer
284: arguments.
285:
286: There may be other prior information available which cannot be
287: comprised in a Dirichlet distribution. In this general case, the
288: mean and variance of $I$ can still be related to the covariance
289: structure of $p(\t|\vec\n)$, which will be done in the following
290: Section.
291:
292: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
293: \section{Approximation of Expectation and Variance of $I$}\label{secApprox}
294: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
295: In the following let $\hat\t_{ij}:=E[\t_{ij}]$.
296: Since $p(\vec\t|\vec\n)$ is strongly peaked
297: around $\vec\t=\hat\vec\t$ for large $\npp$ we may
298: expand $I(\t)$ around $\hat\vec\t$ in the integrals for the mean and the variance.
299: With
300: $\Delta_{ij}:=\t_{ij}-\hat\t_{ij}$ and using $\sum_{ij}\t_{ij}= 1
301: =\sum_{ij}\hat\t_{ij}$ we get for the expansion of (\ref{mi})
302: \beq\label{miexp}
303: I(\t) \;=\; I(\hat\t) +
304: \sum_{ij}\log\left({\hat\t_{ij}\over\hat\t_{i\p}\hat\t_{\p j}}\right)\Delta_{ij}
305: + \sum_{ij}{\Delta_{ij}^2\over 2\hat\t_{ij}} -
306: \sum_i{\Delta_{i\p}^2\over 2\hat\t_{i\p}} -
307: \sum_j{\Delta_{\p j}^2\over 2\hat\t_{\p j}} +
308: O(\Delta^3).
309: \eeq
310: Taking the expectation, the linear term $E[\Delta_{ij}]=0$ drops
311: out. The quadratic terms $E[\Delta_{ij}\Delta_{kl}] =
312: \mbox{Cov}(\t_{ij},\t_{kl})$ are the covariance of $\t$ under
313: distribution $p(\vec\t|\vec\n)$ and are proportional to
314: $\npp^{-1}$. It can be shown that $E[\Delta^3]\sim\npp^{-2}$ (see
315: Section \ref{secGeneral}).
316: \beq\label{exnlo}
317: E[I] \;=\; I(\hat\t) + {1\over 2}
318: \sum_{ijkl}\left({\delta_{ik}\delta_{jl}\over\hat\t_{ij}} -
319: {\delta_{ik}\over\hat\t_{i\p}} -
320: {\delta_{jl}\over\hat\t_{\p j}}\right)\mbox{Cov}(\t_{ij},\t_{kl}) +
321: O(\npp^{-2}).
322: \eeq
323: The Kronecker delta $\delta_{ij}$ is $1$ for $i=j$ and $0$ otherwise.
324: The variance of $I$ in leading order in $\npp^{-1}$ is
325: \bqa\nonumber
326: \mbox{Var}\,I(\t) &=&
327: E[(I-E[I])^2] \;\stackrel+=\;
328: E\left[\left(\sum_{ij}\log\left({\hat\t_{ij}\over
329: \hat\t_{i\p}\hat\t_{\p j}}\right)\Delta_{ij}\right)^2\right]
330: \;=\; \\\label{varlo}
331: &=&
332: \sum_{ijkl}\log{\hat\t_{ij}\over\hat\t_{i\p}\hat\t_{\p j}}
333: \log{\hat\t_{kl}\over\hat\t_{k\p}\hat\t_{\p l}}
334: \mbox{Cov}(\t_{ij},\t_{kl}),
335: \eqa
336: where $\stackrel+=$ means $=$ up to terms of order
337: $\npp^{-2}$. So the leading order variance and the leading and
338: next to leading order mean of the mutual information $I(\t)$ can be
339: expressed in terms of the covariance of $\t$ under the posterior distribution
340: $p(\t|\vec\n)$.
341:
342: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
343: \section{The Second Order Dirichlet Distribution}\label{secDD}
344: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
345: Noninformative priors for $p(\t)$ are commonly used if no
346: additional prior information is available. Many non-informative
347: choices (uniform, Jeffreys', Haldane's, Perks', ... prior) lead to
348: a Dirichlet posterior distribution:
349: \bqa\nonumber
350: p(\t|\vec\n) &=&
351: {1\over
352: N(\vec\n)}\prod_{ij}\t_{ij}^{\n_{ij}-1}\delta(\t_\pp-1)
353: \quad\mbox{with normalization}
354: \\\label{norm}
355: N(\vec\n) &=&
356: \int\prod_{ij}\t_{ij}^{\n_{ij}-1}\delta(\t_\pp-1)
357: d^{rs}\t \;=\;
358: {\prod_{ij}\Gamma(\n_{ij})\over\Gamma(\npp)},
359: \eqa
360: where $\Gamma$ is the Gamma function, and
361: $\n_{ij}=\n'_{ij}+\n''_{ij}$, where $\n'_{ij}$ are
362: the number of samples $(i,j)$, and $\n''_{ij}$ comprises prior
363: information
364: ($1$ for the uniform prior, $\odt$ for Jeffreys' prior,
365: $0$ for Haldane's prior, ${1\over rs}$ for Perks' prior).
366: Mean and covariance of $p(\t|\vec\n)$ are
367: \beq\label{ecov}
368: \hat\t_{ij} := E[\t_{ij}]=
369: {\n_{ij}\over\npp}, \quad
370: \mbox{Cov}(\t_{ij},\t_{kl}) =
371: {1\over\npp+1}(\hat\t_{ij}\delta_{ik}\delta_{jl}-
372: \hat\t_{ij}\hat\t_{kl})
373: \eeq
374: Inserting this into (\ref{exnlo}) and (\ref{varlo}) we get after some
375: algebra for the mean and variance of the mutual information
376: $I(\t)$ up to terms of order $\npp^{-2}$:
377: \bqa\label{exnlodi}
378: E[I] &=& J \;+\; {(r-1)(s-1)\over 2(\npp+1)}
379: \;+\; O(\npp^{-2}),
380: \\\label{varlodi}
381: \mbox{Var}[I] &=&
382: {1\over\npp+1}(K-J^2) \;+\;
383: O(\npp^{-2}), \quad
384: \\\label{Jdef}
385: J &:=& \sum_{ij}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over
386: \n_{i\p}\n_{\p j}} \;=\; I(\hat\t), \quad
387: \\\label{Kdef}
388: K &:=& \sum_{ij}{\n_{ij}\over\npp}\left(\log{\n_{ij}\npp\over
389: \n_{i\p}\n_{\p j}}\right)^2.
390: \eqa
391: $J$ and $K$ (and $L$, $M$, $P$, $Q$ defined later) depend on
392: $\hat\t_{ij} = {\n_{ij}\over\npp}$ only, i.e.\ are $O(1)$ in
393: $\vec\n$. Strictly speaking we should expand
394: ${1\over\npp+1}={1\over\npp}+O(\npp^{-2})$, i.e.\ drop the $+1$,
395: but the exact expression (\ref{ecov}) for the covariance suggests
396: to keep the $+1$. We compared both versions with the exact values
397: (from Monte-Carlo simulations) for various parameters $\vec\t$. In
398: most cases the expansion in ${1\over\npp+1}$ was more accurate, so
399: we suggest to use this variant.
400:
401: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
402: \section{Exact Value for $E[I]$}\label{secExact}
403: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
404: It is possible to get an exact expression for the mean mutual
405: information $E[I]$ under the Dirichlet distribution.
406: By noting that $x\log x = {d\over d\beta}x^\beta|_{\beta=1}$,
407: ($x = \{\t_{ij},\t_{i\p},\t_{\p j}\}$), one
408: can replace the logarithms in the last expression of
409: (\ref{mi}) by powers. From (\ref{norm}) we see that
410: $E[(\t_{ij})^\beta]={\Gamma(\n_{ij}+\beta)\Gamma(\npp)\over
411: \Gamma(\n_{ij})\Gamma(\npp+\beta)}$. Taking the
412: derivative and setting $\beta=1$ we get
413: \beqn
414: E[\t_{ij}\log\t_{ij}] = {d\over d\beta}E[(\t_{ij})^\beta]_{\beta=1}
415: = {1\over\npp}\sum_{ij}\n_{ij}[\psi(\n_{ij}+1)-\psi(\npp+1)].
416: \eeqn
417: The $\psi$ function has the following properties (see
418: \cite{Abramowitz:74} for details)
419: \beqn
420: \psi(z)={d\log\Gamma(z)\over dz}={\Gamma'(z)\over\Gamma(z)},\quad
421: \psi(z+1)=\log z + {1\over 2z} - {1\over 12z^2} + O({1\over z^4}),
422: \eeqn
423: \beq\label{psi2}
424: \psi(n)=-\gamma+\sum_{k=1}^{n-1}{1\over k},\quad
425: \psi(n+\odt)=-\gamma+2\log 2+2\sum_{k=1}^n{1\over 2k-1}.
426: \eeq
427: The value of the Euler constant $\gamma$ is irrelevant here,
428: since it cancels out. Since the marginal distributions of
429: $\t_{i\p}$ and $\t_{\p j}$ are also Dirichlet (with parameters
430: $\n_{i\p}$ and $\n_{\p j}$) we get similarly
431: \bqan
432: E[\t_{i\p}\log\t_{i\p}] &=&
433: {1\over\npp}\sum_i\n_{i\p}[\psi(\n_{i\p}+1)-\psi(\npp+1)],
434: \\
435: E[\t_{\p j}\log\t_{\p j}] &=&
436: {1\over\npp}\sum_j\n_{\p j}[\psi(\n_{\p j}+1)-\psi(\npp+1)].
437: \eqan
438: Inserting this into (\ref{mi}) and rearranging terms we get the
439: exact expression\footnote{This expression has independently
440: been derived in \cite{Wolpert:93b}.}
441: \beq\label{miexex}
442: E[I] = {1\over\npp}\sum_{ij}\n_{ij}
443: [\psi(\n_{ij}+1)-\psi(\n_{i\p}+1)-\psi(\n_{\p
444: j}+1)+\psi(\npp+1)]
445: \eeq
446: For large sample sizes, $\psi(z+1)\approx\log z$ and (\ref{miexex})
447: approaches the frequency estimate $I(\hat\t)$ as it should be.
448: Inserting the expansion $\psi(z+1)=\log z+{1\over 2z}+...$ into
449: (\ref{miexex}) we also get the correction term ${(r-1)(s-1)\over
450: 2\npp}$ of (\ref{mvappr}).
451:
452: The presented method (with some refinements) may also be used to
453: determine an exact expression for the variance of $I(\t)$. All but
454: one term can be expressed in terms of Gamma functions. The final
455: result after differentiating w.r.t.\ $\beta_1$ and $\beta_2$ can
456: be represented in terms of $\psi$ and its derivative $\psi'$. The
457: mixed term $E[(\t_{i\p})^{\beta_1}(\t_{\p j})^{\beta_2}]$ is more
458: complicated and involves confluent hypergeometric functions, which
459: limits its practical use \cite{Wolpert:93b}.
460:
461: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
462: \section{Generalizations}\label{secGeneral}
463: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
464: A systematic expansion of all moments of $p(I|\vec\n)$ to arbitrary order in
465: $\npp^{-1}$ is possible, but gets soon quite cumbersome.
466: For the mean we already gave an exact expression (\ref{miexex}), so we
467: concentrate here on the variance, skewness and the kurtosis of $p(I|\vec\n)$.
468: The $3^{rd}$ and $4^{th}$
469: central moments of $\t$ under
470: the Dirichlet distribution are
471: \beq\label{mom3}
472: E[\Delta_a\Delta_b\Delta_c] \;=\; {2\over(\npp+1)(\npp+2)}
473: [2\hat\t_a\hat\t_b\hat\t_c
474: - \hat\t_a\hat\t_b\delta_{bc}
475: - \hat\t_b\hat\t_c\delta_{ca}
476: - \hat\t_c\hat\t_a\delta_{ab}
477: + \hat\t_a\delta_{ab}\delta_{bc}]
478: \eeq
479: \bqa
480: E[\Delta_a\Delta_b\Delta_c\Delta_d] &=& {1\over\npp^2}
481: [3\hat\t_a\hat\t_b\hat\t_c\hat\t_d
482: - \hat\t_c\hat\t_d\hat\t_a\delta_{ab}
483: - \hat\t_b\hat\t_d\hat\t_a\delta_{ac}
484: - \hat\t_b\hat\t_c\hat\t_a\delta_{ad} \nq\\[-2ex]\nonumber
485: && \qquad\qquad\qquad\; - \hat\t_a\hat\t_d\hat\t_b\delta_{bc}
486: - \hat\t_a\hat\t_c\hat\t_b\delta_{bd}
487: - \hat\t_a\hat\t_b\hat\t_c\delta_{cd} \nq\\\nonumber
488: && \qquad\qquad\qquad\;
489: + \hat\t_a\hat\t_c\delta_{ab}\delta_{cd}
490: + \hat\t_a\hat\t_b\delta_{ac}\delta_{bd}
491: + \hat\t_a\hat\t_b\delta_{ad}\delta_{bc}]
492: +O(\npp^{-3})\nq
493: \eqa
494: with $a = ij$, $b = kl,...\in\{1,...,r\}\times\{1,...,s\}$
495: being double indices,
496: $\delta_{ab} = \delta_{ik}\delta_{jl},...$
497: $\hat\t_{ij}={\n_{ij}\over\npp}$.
498: Expanding $\Delta^k = (\t-\hat\t)^k$ in $E[\Delta_a\Delta_b...]$ leads to
499: expressions containing $E[\t_a\t_b...]$, which can be
500: computed by a case analysis of all combinations of equal/unequal
501: indices $a,b,c,...$ using (\ref{norm}).
502: Many terms cancel leading to the above expressions.
503: They allow to compute the order $\npp^{-2}$ term of
504: the variance of $I(\t)$. Again, inspection of (\ref{mom3})
505: suggests to expand in $[(\npp+1)(\npp+2)]^{-1}$, rather than in
506: $\npp^{-2}$. The variance in leading and next to leading order
507: is
508: \bqa\label{var2ndo}
509: \mbox{Var}[I] %&=&
510: &=& {K-J^2\over\npp+1} +
511: {M+(r - 1)(s - 1)(\odt - J)-Q
512: \over(\npp+1)(\npp+2)} + O(\npp^{-3})
513: \\\label{Mdef}
514: M &:=& \sum_{ij}
515: \left({1\over\n_{ij}}-{1\over\n_{i\p}}-{1\over\n_{\p
516: j}}+{1\over\npp}\right)
517: \n_{ij}\log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}},
518: \\\label{Qdef}
519: Q &:=& 1-\sum_{ij}{\n_{ij}^2\over\n_{i\p}\n_{\p j}}.
520: \eqa
521: $J$ and $K$ are defined in (\ref{Jdef}) and (\ref{Kdef}).
522: Note that the first term ${K-J^2\over\n+1}$ also contains second
523: order terms when expanded in $\npp^{-1}$. The leading order
524: terms for the $3^{rd}$ and $4^{th}$ central moments of $p(I|\vec\n)$ are
525: \bqan
526: E[(I-E[I])^3] & = &
527: {2\over\npp^2}[2J^3 - 3KJ + L] +
528: {3\over\npp^2}[K + J^2 - P] +
529: O(\npp^{-3}),
530: \\
531: L & := & \sum_{ij}{\n_{ij}\over\npp}\left(\log{\n_{ij}\npp\over
532: \n_{i\p}\n_{\p j}}\right)^3,\quad
533: P \;:=\; \sum_i{\n J_{i\p}^2\over\n_{i\p}} + \sum_j{\n J_{\p j}^2\over\n_{\p j}},
534: \\
535: J_{i\p} & :=& \sum_{j}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over\n_{i\p}\n_{\p
536: j}}\qquad,\quad
537: J_{\p j} \;:=\; \sum_{i}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}},
538: \\
539: E[(I-E[I])^4] & = &
540: {3\over\npp^2}[K-J^2]^2 + O(\npp^{-3}),
541: \eqan
542: from which the skewness and kurtosis can be obtained by dividing
543: by $\mbox{Var}[I]^{3/2}$ and $\mbox{Var}[I]^2$
544: respectively. One can see that the skewness is of order
545: $\npp^{-1/2}$ and the kurtosis is $3+O(\npp^{-1})$.
546: Significant deviation of the skewness from $0$ or the kurtosis from
547: $3$ would indicate a non-Gaussian $I$. They can be used to get an improved
548: approximation for $p(I|\vec\n)$ by making, for instance, an ansatz
549: \beqn
550: p(I|\vec\n)\propto (1+\tilde b I+\tilde c I^2) \cdot p_0(I|\tilde\mu,\tilde\sigma^2)
551: \eeqn
552: and fitting the parameters $\tilde b$, $\tilde c$, $\tilde\mu$,
553: and $\tilde\sigma^2$ to the mean, variance, skewness, and kurtosis
554: expressions above. $p_0$ is the Normal or Gamma distribution (or
555: any other distribution with Gaussian limit). From this, quantiles
556: $p(I > I_*|\vec\n):=\int_{I_*}^\infty p(I|\vec\n)\, dI$, needed in
557: \cite{Kleiter:96,Kleiter:99}, can be computed. A systematic
558: expansion of arbitrarily high moments to arbitrarily high order in
559: $\npp^{-1}$ leads, in principle, to arbitrarily accurate
560: estimates.
561:
562: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
563: \section{Numerics}\label{secNum}
564: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
565: %-------------------------------%
566: %\subsection{Implementation of $\psi(z)$}
567: %-------------------------------%
568: There are short and fast implementations of
569: $\psi$. The code of the Gamma function in \cite{Press:92}, for
570: instance, can be modified to compute the $\psi$ function. For
571: integer and half-integer values one may create a lookup table from
572: (\ref{psi2}).
573: %-------------------------------%
574: %\subsection{Computation time of (central moments)}
575: %-------------------------------%
576: The needed quantities $J$, $K$, $L$, $M$, and $Q$ (depending on $\vec
577: n$) involve a double sum, $P$ only a single sum, and the $r + s$
578: quantities $J_{i\p}$ and $J_{\p j}$ also only a single sum. Hence,
579: the computation time for the (central) moments is of the same
580: order $O(r \cdot s)$ as for the point estimate (\ref{mi}).
581: %-------------------------------%
582: %\subsection{Exact Monte Carlo}
583: %-------------------------------%
584: ``Exact'' values have been obtained for representative choices of
585: $\t_{ij}$, $r$, $s$, and $\npp$ by Monte Carlo simulation.
586: The $\t_{ij}:=x_{ij}/x_\pp$ are Dirichlet distributed, if each
587: $x_{ij}$ follows a Gamma distribution. See \cite{Press:92} how to
588: sample from a Gamma distribution.
589: %-------------------------------%
590: %\subsection{Numerical accuracy of expansion}
591: %-------------------------------%
592: The variance has been expanded in ${r \cdot s\over \npp}$,
593: so the relative error ${\mbox{\scriptsize
594: Var}[I]_{approx}-\mbox{\scriptsize Var}[I]_{exact}\over
595: \mbox{\scriptsize Var}[I]_{exact}}$ of the approximation
596: (\ref{varlodi}) and (\ref{var2ndo}) are of the order of
597: ${r \cdot s\over \npp}$ and $({r \cdot s\over \npp})^2$
598: respectively, {\em if} $\imath$ and $\jmath$ are dependent. If
599: they are independent the leading term (\ref{varlodi}) drops
600: itself down to order $\npp^{-2}$ resulting in a reduced
601: relative accuracy $O({r \cdot s\over \npp})$ of (\ref{var2ndo}).
602: Comparison with the Monte Carlo values confirmed an accurracy
603: in the range $({r \cdot s\over\npp})^{1...2}$. The mean
604: (\ref{miexex2}) is exact. Together with the skewness and
605: kurtosis we have a good description for the distribution of
606: the mutual information $p(I|\vec n)$ for not too small sample
607: bin sizes $n_{ij}$.
608: %-------------------------------%
609: %\subsection{Useful accuracy}
610: %-------------------------------%
611: We want to conclude with some notes on {\it useful} accuracy. The
612: hypothetical prior sample sizes $\n''_{ij}=\{0,{1\over
613: rs},\odt,1\}$ can all be argued to be non-informative
614: \cite{Gelman:95}. Since the central moments are expansions in
615: $\npp^{-1}$, the next to leading order term can be freely adjusted
616: by adjusting $\n''_{ij}\in[0...1]$.
617: So one may argue that anything beyond leading order is free to
618: will, and the leading order terms may be regarded as accurate as
619: we can specify our prior knowledge. On the other hand, exact
620: expressions have the advantage of being safe against
621: cancellations. For instance, leading order of $E[I]$ and $E[I^2]$
622: does not suffice to compute the leading order of $\mbox{Var}[I]$.
623:
624: %------------------------------%
625: \subsubsection*{Acknowledgements}
626: %------------------------------%
627: I want to thank Ivo Kwee for valuable discussions and Marco
628: Zaffalon for encouraging me to investigate this topic. This work
629: was supported by SNF grant 2000-61847.00 to J\"urgen Schmidhuber.
630:
631: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
632: % Bibliography %
633: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
634: \begin{thebibliography}{PFTV92}
635:
636: \bibitem[AS74]{Abramowitz:74}
637: M.~Abramowitz and I.~A. Stegun, editors.
638: \newblock {\em Handbook of mathematical functions}.
639: \newblock Dover publications, inc., 1974.
640:
641: \bibitem[Bra99]{Brand:99}
642: M.~Brand.
643: \newblock Structure learning in conditional probability models via an entropic
644: prior and parameter extinction.
645: \newblock {\em Neural Computation}, 11(5):1155--1182, 1999.
646:
647: \bibitem[Bun96]{Buntine:96}
648: W.~Buntine.
649: \newblock A guide to the literature on learning probabilistic networks from
650: data.
651: \newblock {\em {IEEE} Transactions on Knowledge and Data Engineering},
652: 8:195--210, 1996.
653:
654: \bibitem[CT91]{Cover:91}
655: T.~M. Cover and J.~A. Thomas.
656: \newblock {\em Elements of Information Theory}.
657: \newblock Wiley Series in Telecommunications. John Wiley \& Sons, New York, NY,
658: USA, 1991.
659:
660: \bibitem[GCSR95]{Gelman:95}
661: A.~Gelman, J.~B. Carlin, H.~S. Stern, and D.~B. Rubin.
662: \newblock {\em Bayesian Data Analysis.}
663: \newblock Chapman, 1995.
664:
665: \bibitem[Hec98]{Heckerman:98}
666: D.~Heckerman.
667: \newblock A tutorial on learning with {B}ayesian networks.
668: \newblock {\em Learnig in Graphical Models}, pages 301--354, 1998.
669:
670: \bibitem[KJ96]{Kleiter:96}
671: G.~D. Kleiter and R.~Jirousek.
672: \newblock Learning {B}ayesian networks under the control of mutual information.
673: \newblock {\em Proceedings of the 6th International Conference on Information
674: Processing and Management of Uncertainty in Knowledge-Based Systems
675: (IPMU-1996)}, pages 985--990, 1996.
676:
677: \bibitem[Kle99]{Kleiter:99}
678: G.~D. Kleiter.
679: \newblock The posterior probability of {B}ayes nets with strong dependences.
680: \newblock {\em Soft Computing}, 3:162--173, 1999.
681:
682: \bibitem[PFTV92]{Press:92}
683: W.~H. Press, B.~P. Flannery, S.~A. Teukolsky, and W.~T. Vetterling.
684: \newblock {\em Numerical Recipes in {C}: The Art of Scientific Computing}.
685: \newblock Cambridge University Press, Cambridge, second edition, 1992.
686:
687: \bibitem[Soo00]{Soofi:00}
688: E.~S. Soofi.
689: \newblock Principal information theoretic approaches.
690: \newblock {\em Journal of the American Statistical Association}, 95:1349--1353,
691: 2000.
692:
693: \bibitem[WW93]{Wolpert:93b}
694: D.~R. Wolf and D.~H. Wolpert.
695: \newblock Estimating functions of distributions from {A} finite set of samples,
696: part 2: Bayes estimators for mutual information, chi-squared, covariance and
697: other statistics.
698: \newblock Technical Report LANL-LA-UR-93-833, Los Alamos National Laboratory,
699: 1993.
700: \newblock Also Santa Fe Insitute report SFI-TR-93-07-047.
701:
702: \end{thebibliography}
703:
704: \end{document}
705:
706: %---------------------------------------------------------------
707: