0112:cs0112019/cs0112019

1:

2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3: %             Distribution of Mutual Information             %%

4: %%     Marcus Hutter: Start: 07.05.01  LastEdit: 15.12.01    %%

5: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

6:

7: %-------------------------------%

8: %   Document-Style              %

9: %-------------------------------%

10: \documentclass[12pt]{article}

11: \parskip=1.5ex plus 1ex minus 1ex \parindent=0ex

12: \topmargin=0cm  \oddsidemargin=0cm \evensidemargin=0cm

13: \textwidth=16cm \textheight=22.2cm \unitlength=1mm %\sloppy

14:

15: %-------------------------------%

16: %       My Math-Spacings        %

17: %-------------------------------%

18: \def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}

19: \def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip=  4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}

20: \def\textmuskip{\thinmuskip= 0mu                    \medmuskip=  1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}

21: \textmuskip

22: \def\beq{\dispmuskip\begin{equation}}    \def\eeq{\end{equation}\textmuskip}

23: \def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}

24: \def\bqa{\dispmuskip\begin{eqnarray}}    \def\eqa{\end{eqnarray}\textmuskip}

25: \def\bqan{\dispmuskip\begin{eqnarray*}}  \def\eqan{\end{eqnarray*}\textmuskip}

26:

27: %-------------------------------%

28: %   Macro-Definitions           %

29: %-------------------------------%

30: \newenvironment{keywords}{\centerline{\small\bf

31: Keywords}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip

32: 1ex}

33: \def\nq{\hspace{-1em}}

34: \def\odt{{\textstyle{1\over 2}}}

35: \def\eps{\varepsilon}

36: \def\vec#1{{\bf #1}}

37: \def\p{{\scriptscriptstyle+}}

38: \def\pp{{\scriptscriptstyle++}}

39: \def\n{n}

40: \def\npp{\n}

41: \def\t{\pi}

42:

43: \begin{document}

44:

45: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

46: %                      T i t l e - P a g e                      %

47: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

48:

49: \begin{titlepage}

50:

51: \begin{center}

52:  {\small Technical Report IDSIA-13-01 \hfill 15 December 2001}\\[5mm]

53:   {\Large\sc\hrule height1pt \vskip 2mm

54:      Distribution of Mutual Information

55:      \vskip 5mm \hrule height1pt} \vspace{10mm}

56:   {\bf Marcus Hutter} \\[10mm]

57:   {\rm IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland}  \\

58:   {\rm\footnotesize marcus@idsia.ch \qquad

59:       http://www.idsia.ch/$^{_{_\sim}}\!$marcus}

60:       \\[15mm]

61: \end{center}

62:

63: \begin{keywords}

64: Mutual Information, Cross Entropy, Dirichlet distribution, Second

65: order distribution, expectation and variance of mutual

66: information.

67: \end{keywords}

68:

69: \begin{abstract}

70: The mutual information of two random variables $\imath$ and

71: $\jmath$ with joint probabilities $\{\t_{ij}\}$ is commonly used in

72: learning Bayesian nets as well as in many other fields. The

73: chances $\t_{ij}$ are usually estimated by the empirical sampling

74: frequency $\n_{ij}/\n$ leading to a point estimate $I(\n_{ij}/\n)$

75: for the mutual information. To answer questions like ``is

76: $I(\n_{ij}/\n)$ consistent with zero?'' or ``what is the

77: probability that the true mutual information is much larger than

78: the point estimate?'' one has to go beyond the point estimate.

79: %

80: In the Bayesian framework one can answer

81: these questions by utilizing a (second order) prior distribution

82: $p(\t)$ comprising prior information about $\t$. From the prior

83: $p(\t)$ one can compute the posterior $p(\t|\vec\n)$, from which

84: the distribution $p(I|\vec\n)$ of the mutual information can be

85: calculated.

86: %

87: We derive reliable and quickly computable approximations for

88: $p(I|\vec\n)$. We concentrate on the mean, variance, skewness, and

89: kurtosis, and non-informative priors. For the mean we also give an

90: exact expression. Numerical issues and the range of validity are

91: discussed.

92: \end{abstract}

93:

94: \end{titlepage}

95:

96: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

97: \section{Introduction}\label{secInt}

98: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

99: The mutual information $I$ (also called cross entropy) is a

100: widely used information theoretic measure for the stochastic

101: dependency of random variables \cite{Cover:91,Soofi:00}. It is

102: used, for instance, in learning Bayesian nets

103: \cite{Buntine:96,Heckerman:98}, where stochastically dependent

104: nodes shall be connected. The mutual information defined in

105: (\ref{mi}) can be computed if the joint probabilities

106: $\{\t_{ij}\}$ of the two random variables $\imath$ and

107: $\jmath$ are known. The standard procedure in the common case

108: of unknown chances $\t_{ij}$ is to use the sample frequency

109: estimates ${\n_{ij}\over\n}$ instead, as if they were

110: precisely known probabilities; but this is not always

111: appropriate. Furthermore, the point estimate

112: $I({\n_{ij}\over\n})$ gives no clue about the reliability of

113: the value if the sample size $n$ is finite. For instance, for

114: independent $\imath$ and $\jmath$, $I(\t)=0$ but

115: $I({\n_{ij}\over\n})=O(n^{-1/2})$ due to noise in the data.

116: The criterion for judging dependency is how many standard

117: deviations $I({\n_{ij}\over\n})$ is away from zero. In

118: \cite{Kleiter:96,Kleiter:99} the probability that the true

119: $I(\vec\t)$ is greater than a given threshold has been used to

120: construct Bayesian nets. In the Bayesian framework one can

121: answer these questions by utilizing a (second order) prior

122: distribution

123: $p(\t)$%

124: %comprising prior information about $\t$.

125: ,which takes account of any impreciseness about $\t$.

126: From the prior

127: $p(\t)$ one can compute the posterior $p(\t|\vec n)$, from which

128: the distribution $p(I|\vec\n)$ of the mutual information can be

129: obtained.

130:

131: The objective of this work is to derive reliable and quickly

132: computable analytical expressions for $p(I|\vec\n)$. Section

133: \ref{secMI} introduces the mutual information distribution,

134: Section \ref{secResults} discusses some results in advance before

135: delving into the derivation. Since the central limit theorem

136: ensures that $p(I|\vec\n)$ converges to a Gaussian distribution a

137: good starting point is to compute the mean and variance of

138: $p(I|\vec\n)$. In section \ref{secApprox} we relate the mean and

139: variance to the covariance structure of $p(\t|\vec n)$. Most

140: non-informative priors lead to a Dirichlet posterior. An exact

141: expression for the mean (Section \ref{secExact}) and approximate

142: expressions for the variance (Sections \ref{secDD}) are given for

143: the Dirichlet distribution. More accurate estimates of the

144: variance and higher central moments are derived in Section

145: \ref{secGeneral}, which lead to good approximations of

146: $p(I|\vec\n)$ even for small sample sizes. We show that the

147: expressions obtained in \cite{Kleiter:96,Kleiter:99} by heuristic

148: numerical methods are incorrect. Numerical issues and the range of

149: validity are briefly discussed in section \ref{secNum}.

150:

151: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

152: \section{Mutual Information Distribution}\label{secMI}

153: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

154:

155: We consider discrete random variables $\imath\in\{1,...,r\}$ and $\jmath\in

156: \{1,...,s\}$ and an i.i.d.\ random process with samples

157: $(i,j)\in\{1,...,r\}\times\{1,...,s\}$ drawn with joint probability

158: $\t_{ij}$. An important measure of the stochastic

159: dependence of $\imath$ and $\jmath$ is the mutual

160: information

161: \beq\label{mi}

162:   I({\vec \t}) \;=\; \sum_{i=1}^r\sum_{j=1}^s

163:   \t_{ij}\log{\t_{ij}\over\t_{i\p}\t_{\p j}} \;=\;

164:   \sum_{ij}\t_{ij}\log\t_{ij} -

165:   \sum_{i}\t_{i\p}\log\t_{i\p} -

166:   \sum_{j}\t_{\p j}\log\t_{\p j}.

167: \eeq

168: $\log$ denotes the natural logarithm and

169: $\t_{i\p}=\sum_j\t_{ij}$ and

170: $\t_{\p j}=\sum_i\t_{ij}$ are marginal probabilities.

171: Often one does not know the probabilities $\t_{ij}$ exactly,

172: but one has a sample set with $\n_{ij}$ outcomes of pair $(i,j)$.

173: The frequency $\hat\t_{ij}:={\n_{ij}\over\npp}$ may

174: be used as a first estimate of the unknown probabilities.

175: $\npp:=\sum_{ij}\n_{ij}$ is the total sample size.

176: This leads to a point (frequency) estimate $I(\hat\vec\t) =

177: \sum_{ij}{\n_{ij}\over\npp}

178: \log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}}$

179: for the mutual information (per sample).

180:

181: Unfortunately the point estimation $I(\hat\vec\t)$ gives no

182: information about its accuracy. In the Bayesian approach to this

183: problem one assumes a prior (second order) probability density

184: $p(\vec\t)$ for the unknown probabilities $\t_{ij}$ on the

185: probability simplex. From this one can compute the posterior

186: distribution $p(\vec\t|\vec\n) \propto

187: p(\t)\prod_{ij}\t_{ij}^{\n_{ij}}$ (the $n_{ij}$ are multinomially

188: distributed). This allows to compute the

189: posterior probability density of the mutual information.$\!$%

190: \footnote{$I(\vec\t)$ denotes the mutual information for the

191: specific chances $\vec\t$, whereas $I$ in the context above is

192: just some non-negative real number. $I$ will also denote the

193: mutual information {\it random variable} in the

194: expectation $E[I]$ and variance $\mbox{Var}[I]$. Expectaions are

195: {\it always} w.r.t.\ to the posterior distribution

196: $p(\vec\t|\vec\n)$. }

197: \beq\label{midistr}

198:   p(I|\vec\n) = \int

199:   \delta(I(\vec\t)-I)p(\vec\t|\vec\n)d^{rs}\vec\t

200: \eeq

201: \footnote{Since $0\leq I(\t)\leq I_{max}$ with sharp upper

202: bound $I_{max}:= \min\{\log r,\log s\}$, the integral may be

203: restricted to $\int_0^{I_{max}}$, which shows that the domain

204: of $p(I|\vec n)$ is $[0,I_{max}].$}%

205: The $\delta()$ distribution restricts the integral to $\t$ for

206: which $I(\t)=I$. For large sample size $\npp\to\infty$,

207: $p(\vec\t|\vec\n)$ is strongly peaked around $\vec\t=\hat\vec\t$

208: and $p(I|\vec\n)$ gets strongly peaked around the frequency

209: estimate $I=I(\hat\vec\t)$. The mean $E[I] = \int_0^\infty I

210: p(I|\vec\n)\,dI = \int I(\vec\t)p(\vec\t|\vec\n)d^{rs}\vec\t$ and

211: the variance $\mbox{Var}[I]=E[(I-E[I])^2]=E[I^2]-E[I]^2$ are of

212: central interest.

213:

214: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

215: \section{Results for $I$ under the Dirichlet P{\rm(}oste{\rm)}rior}\label{secResults}

216: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

217: Most\footnote{But not all priors which one can argue to be

218: non-informative lead to Dirichlet posteriors. Brand \cite{Brand:99}

219: (and others), for instance, advocate the entropic prior

220: $p(\vec\t)\propto e^{-H(\vec\t)}$.}

221: non-informative priors for $p(\t)$ lead to a Dirichlet

222: posterior distribution $p(\vec\t|\vec\n) \propto

223: \prod_{ij}\t_{ij}^{\n_{ij}-1}$ with interpretation

224: $\n_{ij}=\n'_{ij}+\n''_{ij}$, where

225: $\n'_{ij}$ are the number of samples $(i,j)$, and

226: $\n''_{ij}$ comprises prior information

227: ($1$ for the uniform prior, $\odt$ for Jeffreys' prior, $0$ for

228: Haldane's prior, ${1\over rs}$ for Perks' prior \cite{Gelman:95}).

229: In principle this allows to compute the

230: posterior density $p(I|\vec\n)$ of the mutual information. In

231: sections \ref{secApprox} and \ref{secDD} we expand the mean and

232: variance in terms of $\npp^{-1}$:

233: \bqa\label{mvappr}

234:   E[I] &=&

235:   \sum_{ij}{\n_{ij}\over\npp}

236:   \log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}} \;+\;

237:   {(r-1)(s-1)\over 2\npp} \;+\; O(\npp^{-2}),

238:   \\\nonumber

239:   \mbox{Var}[I] &=&

240:   {1\over\npp}

241:   \sum_{ij}{\n_{ij}\over\npp}\bigg(\log{\n_{ij}\npp\over

242:     \n_{i\p}\n_{\p j}}\bigg)^2 -

243:   {1\over\npp}\bigg(\sum_{ij}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over

244:     \n_{i\p}\n_{\p j}}\bigg)^2 \;+\; O(\npp^{-2}).

245: \eqa

246: The first term for the mean is just the point estimate

247: $I(\hat\t)$. The second term is a small correction if $\npp\gg

248: r \cdot s$. Kleiter \cite{Kleiter:96,Kleiter:99} determined the

249: correction by Monte Carlo studies as $\min\{{r-1\over

250: 2\npp},{s-1\over 2\npp}\}$. This is wrong unless $s$ or $r$ are 2.

251: The expression $2E[I]/n$ they determined for the variance has a

252: completely different structure than ours. Note that the mean is

253: lower bounded by ${const.\over\npp}+O(\npp^{-2})$, which is

254: strictly positive for large, but finite sample sizes, even if

255: $\imath$ and $\jmath$ are statistically independent and

256: independence is perfectly represented in the data ($I(\hat\t)=0$).

257: On the other hand, in this case, the standard deviation

258: $\sigma=\sqrt{\mbox{Var} (I)}\sim {1\over\npp}\sim E[I]$ correctly

259: indicates that the mean is still consistent with zero.

260:

261: Our approximations (\ref{mvappr}) for the mean and variance are

262: good if ${r \cdot s\over\npp}$ is small. The central limit

263: theorem ensures that $p(I|\vec\n)$ converges to a Gaussian

264: distribution with mean $E[I]$ and variance $\mbox{Var}[I]$. Since

265: $I$ is non-negative it is more appropriate to approximate

266: $p(I|\vec\t)$ as a Gamma ($=$ scaled $\chi^2$) or log-normal

267: distribution with mean $E[I]$ and variance $\mbox{Var}[I]$, which

268: is of course also asymptotically correct.

269:

270: A systematic expansion in $\npp^{-1}$ of the mean, variance, and

271: higher moments is possible but gets arbitrarily cumbersome.

272: The $O(\npp^{-2})$ terms for the variance and leading order

273: terms for the skewness and kurtosis

274: are given in Section \ref{secGeneral}.

275: For the mean it is possible to give an exact expression

276: \beq\label{miexex2}

277:   E[I] = {1\over\npp}\sum_{ij}\n_{ij}

278:   [\psi(\n_{ij}+1)-\psi(\n_{i\p}+1)-\psi(\n_{\p

279:   j}+1)+\psi(\npp+1)]

280: \eeq

281: with $\psi(n+1)=-\gamma+\sum_{k=1}^n{1\over k}=\log

282: n+O({1\over n})$ for integer $n$. See Section \ref{secExact} for

283: details and more general expressions for $\psi$ for non-integer

284: arguments.

285:

286: There may be other prior information available which cannot be

287: comprised in a Dirichlet distribution. In this general case, the

288: mean and variance of $I$ can still be related to the covariance

289: structure of $p(\t|\vec\n)$, which will be done in the following

290: Section.

291:

292: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

293: \section{Approximation of Expectation and Variance of $I$}\label{secApprox}

294: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

295: In the following let $\hat\t_{ij}:=E[\t_{ij}]$.

296: Since $p(\vec\t|\vec\n)$ is strongly peaked

297: around $\vec\t=\hat\vec\t$ for large $\npp$ we may

298: expand $I(\t)$ around $\hat\vec\t$ in the integrals for the mean and the variance.

299: With

300: $\Delta_{ij}:=\t_{ij}-\hat\t_{ij}$ and using $\sum_{ij}\t_{ij}= 1

301: =\sum_{ij}\hat\t_{ij}$ we get for the expansion of (\ref{mi})

302: \beq\label{miexp}

303:   I(\t) \;=\; I(\hat\t) +

304:   \sum_{ij}\log\left({\hat\t_{ij}\over\hat\t_{i\p}\hat\t_{\p j}}\right)\Delta_{ij}

305:   + \sum_{ij}{\Delta_{ij}^2\over 2\hat\t_{ij}} -

306:   \sum_i{\Delta_{i\p}^2\over 2\hat\t_{i\p}} -

307:   \sum_j{\Delta_{\p j}^2\over 2\hat\t_{\p j}} +

308:   O(\Delta^3).

309: \eeq

310: Taking the expectation, the linear term $E[\Delta_{ij}]=0$ drops

311: out. The quadratic terms $E[\Delta_{ij}\Delta_{kl}] =

312: \mbox{Cov}(\t_{ij},\t_{kl})$ are the covariance of $\t$ under

313: distribution $p(\vec\t|\vec\n)$ and are proportional to

314: $\npp^{-1}$. It can be shown that $E[\Delta^3]\sim\npp^{-2}$ (see

315: Section \ref{secGeneral}).

316: \beq\label{exnlo}

317:   E[I] \;=\; I(\hat\t) + {1\over 2}

318:   \sum_{ijkl}\left({\delta_{ik}\delta_{jl}\over\hat\t_{ij}} -

319:   {\delta_{ik}\over\hat\t_{i\p}} -

320:   {\delta_{jl}\over\hat\t_{\p j}}\right)\mbox{Cov}(\t_{ij},\t_{kl}) +

321:   O(\npp^{-2}).

322: \eeq

323: The Kronecker delta $\delta_{ij}$ is $1$ for $i=j$ and $0$ otherwise.

324: The variance of $I$ in leading order in $\npp^{-1}$ is

325: \bqa\nonumber

326:   \mbox{Var}\,I(\t) &=&

327:   E[(I-E[I])^2] \;\stackrel+=\;

328:   E\left[\left(\sum_{ij}\log\left({\hat\t_{ij}\over

329:     \hat\t_{i\p}\hat\t_{\p j}}\right)\Delta_{ij}\right)^2\right]

330:   \;=\; \\\label{varlo}

331:   &=&

332:   \sum_{ijkl}\log{\hat\t_{ij}\over\hat\t_{i\p}\hat\t_{\p j}}

333:   \log{\hat\t_{kl}\over\hat\t_{k\p}\hat\t_{\p l}}

334:   \mbox{Cov}(\t_{ij},\t_{kl}),

335: \eqa

336: where $\stackrel+=$ means $=$ up to terms of order

337: $\npp^{-2}$. So the leading order variance and the leading and

338: next to leading order mean of the mutual information $I(\t)$ can be

339: expressed in terms of the covariance of $\t$ under the posterior distribution

340: $p(\t|\vec\n)$.

341:

342: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

343: \section{The Second Order Dirichlet Distribution}\label{secDD}

344: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

345: Noninformative priors for $p(\t)$ are commonly used if no

346: additional prior information is available. Many non-informative

347: choices (uniform, Jeffreys', Haldane's, Perks', ... prior) lead to

348: a Dirichlet posterior distribution:

349: \bqa\nonumber

350:   p(\t|\vec\n) &=&

351:   {1\over

352:   N(\vec\n)}\prod_{ij}\t_{ij}^{\n_{ij}-1}\delta(\t_\pp-1)

353:   \quad\mbox{with normalization}

354:   \\\label{norm}

355:   N(\vec\n) &=&

356:   \int\prod_{ij}\t_{ij}^{\n_{ij}-1}\delta(\t_\pp-1)

357:   d^{rs}\t \;=\;

358:   {\prod_{ij}\Gamma(\n_{ij})\over\Gamma(\npp)},

359: \eqa

360: where $\Gamma$ is the Gamma function, and

361: $\n_{ij}=\n'_{ij}+\n''_{ij}$, where $\n'_{ij}$ are

362: the number of samples $(i,j)$, and $\n''_{ij}$ comprises prior

363: information

364: ($1$ for the uniform prior, $\odt$ for Jeffreys' prior,

365: $0$ for Haldane's prior, ${1\over rs}$ for Perks' prior).

366: Mean and covariance of $p(\t|\vec\n)$ are

367: \beq\label{ecov}

368:   \hat\t_{ij} := E[\t_{ij}]=

369:   {\n_{ij}\over\npp}, \quad

370:   \mbox{Cov}(\t_{ij},\t_{kl}) =

371:   {1\over\npp+1}(\hat\t_{ij}\delta_{ik}\delta_{jl}-

372:   \hat\t_{ij}\hat\t_{kl})

373: \eeq

374: Inserting this into (\ref{exnlo}) and (\ref{varlo}) we get after some

375: algebra for the mean and variance of the mutual information

376: $I(\t)$ up to terms of order $\npp^{-2}$:

377: \bqa\label{exnlodi}

378:   E[I] &=& J \;+\; {(r-1)(s-1)\over 2(\npp+1)}

379:   \;+\; O(\npp^{-2}),

380:   \\\label{varlodi}

381:   \mbox{Var}[I] &=&

382:   {1\over\npp+1}(K-J^2) \;+\;

383:   O(\npp^{-2}), \quad

384:   \\\label{Jdef}

385:   J &:=& \sum_{ij}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over

386:     \n_{i\p}\n_{\p j}} \;=\; I(\hat\t), \quad

387:   \\\label{Kdef}

388:   K &:=& \sum_{ij}{\n_{ij}\over\npp}\left(\log{\n_{ij}\npp\over

389:     \n_{i\p}\n_{\p j}}\right)^2.

390: \eqa

391: $J$ and $K$ (and $L$, $M$, $P$, $Q$ defined later) depend on

392: $\hat\t_{ij} = {\n_{ij}\over\npp}$ only, i.e.\ are $O(1)$ in

393: $\vec\n$. Strictly speaking we should expand

394: ${1\over\npp+1}={1\over\npp}+O(\npp^{-2})$, i.e.\ drop the $+1$,

395: but the exact expression (\ref{ecov}) for the covariance suggests

396: to keep the $+1$. We compared both versions with the exact values

397: (from Monte-Carlo simulations) for various parameters $\vec\t$. In

398: most cases the expansion in ${1\over\npp+1}$ was more accurate, so

399: we suggest to use this variant.

400:

401: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

402: \section{Exact Value for $E[I]$}\label{secExact}

403: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

404: It is possible to get an exact expression for the mean mutual

405: information $E[I]$ under the Dirichlet distribution.

406: By noting that $x\log x = {d\over d\beta}x^\beta|_{\beta=1}$,

407: ($x = \{\t_{ij},\t_{i\p},\t_{\p j}\}$), one

408: can replace the logarithms in the last expression of

409: (\ref{mi}) by powers. From (\ref{norm}) we see that

410: $E[(\t_{ij})^\beta]={\Gamma(\n_{ij}+\beta)\Gamma(\npp)\over

411: \Gamma(\n_{ij})\Gamma(\npp+\beta)}$. Taking the

412: derivative and setting $\beta=1$ we get

413: \beqn

414:   E[\t_{ij}\log\t_{ij}] = {d\over d\beta}E[(\t_{ij})^\beta]_{\beta=1}

415:   = {1\over\npp}\sum_{ij}\n_{ij}[\psi(\n_{ij}+1)-\psi(\npp+1)].

416: \eeqn

417: The $\psi$ function has the following properties (see

418: \cite{Abramowitz:74} for details)

419: \beqn

420:   \psi(z)={d\log\Gamma(z)\over dz}={\Gamma'(z)\over\Gamma(z)},\quad

421:   \psi(z+1)=\log z + {1\over 2z} - {1\over 12z^2} + O({1\over z^4}),

422: \eeqn

423: \beq\label{psi2}

424:   \psi(n)=-\gamma+\sum_{k=1}^{n-1}{1\over k},\quad

425:   \psi(n+\odt)=-\gamma+2\log 2+2\sum_{k=1}^n{1\over 2k-1}.

426: \eeq

427: The value of the Euler constant $\gamma$ is irrelevant here,

428: since it cancels out. Since the marginal distributions of

429: $\t_{i\p}$ and $\t_{\p j}$ are also Dirichlet (with parameters

430: $\n_{i\p}$ and $\n_{\p j}$) we get similarly

431: \bqan

432:   E[\t_{i\p}\log\t_{i\p}] &=&

433:   {1\over\npp}\sum_i\n_{i\p}[\psi(\n_{i\p}+1)-\psi(\npp+1)],

434:   \\

435:   E[\t_{\p j}\log\t_{\p j}] &=&

436:   {1\over\npp}\sum_j\n_{\p j}[\psi(\n_{\p j}+1)-\psi(\npp+1)].

437: \eqan

438: Inserting this into (\ref{mi}) and rearranging terms we get the

439: exact expression\footnote{This expression has independently

440: been derived in \cite{Wolpert:93b}.}

441: \beq\label{miexex}

442:   E[I] = {1\over\npp}\sum_{ij}\n_{ij}

443:   [\psi(\n_{ij}+1)-\psi(\n_{i\p}+1)-\psi(\n_{\p

444:   j}+1)+\psi(\npp+1)]

445: \eeq

446: For large sample sizes, $\psi(z+1)\approx\log z$ and (\ref{miexex})

447: approaches the frequency estimate $I(\hat\t)$ as it should be.

448: Inserting the expansion $\psi(z+1)=\log z+{1\over 2z}+...$ into

449: (\ref{miexex}) we also get the correction term ${(r-1)(s-1)\over

450: 2\npp}$ of (\ref{mvappr}).

451:

452: The presented method (with some refinements) may also be used to

453: determine an exact expression for the variance of $I(\t)$. All but

454: one term can be expressed in terms of Gamma functions. The final

455: result after differentiating w.r.t.\ $\beta_1$ and $\beta_2$ can

456: be represented in terms of $\psi$ and its derivative $\psi'$. The

457: mixed term $E[(\t_{i\p})^{\beta_1}(\t_{\p j})^{\beta_2}]$ is more

458: complicated and involves confluent hypergeometric functions, which

459: limits its practical use \cite{Wolpert:93b}.

460:

461: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

462: \section{Generalizations}\label{secGeneral}

463: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

464: A systematic expansion of all moments of $p(I|\vec\n)$ to arbitrary order in

465: $\npp^{-1}$ is possible, but gets soon quite cumbersome.

466: For the mean we already gave an exact expression (\ref{miexex}), so we

467: concentrate here on the variance, skewness and the kurtosis of $p(I|\vec\n)$.

468: The $3^{rd}$ and $4^{th}$

469: central moments of $\t$ under

470: the Dirichlet distribution are

471: \beq\label{mom3}

472:   E[\Delta_a\Delta_b\Delta_c] \;=\; {2\over(\npp+1)(\npp+2)}

473:   [2\hat\t_a\hat\t_b\hat\t_c

474:    - \hat\t_a\hat\t_b\delta_{bc}

475:    - \hat\t_b\hat\t_c\delta_{ca}

476:    - \hat\t_c\hat\t_a\delta_{ab}

477:    + \hat\t_a\delta_{ab}\delta_{bc}]

478: \eeq

479: \bqa

480:    E[\Delta_a\Delta_b\Delta_c\Delta_d] &=& {1\over\npp^2}

481:    [3\hat\t_a\hat\t_b\hat\t_c\hat\t_d

482:    - \hat\t_c\hat\t_d\hat\t_a\delta_{ab}

483:    - \hat\t_b\hat\t_d\hat\t_a\delta_{ac}

484:    - \hat\t_b\hat\t_c\hat\t_a\delta_{ad} \nq\\[-2ex]\nonumber

485:    && \qquad\qquad\qquad\; - \hat\t_a\hat\t_d\hat\t_b\delta_{bc}

486:    - \hat\t_a\hat\t_c\hat\t_b\delta_{bd}

487:    - \hat\t_a\hat\t_b\hat\t_c\delta_{cd} \nq\\\nonumber

488:    && \qquad\qquad\qquad\;

489:    + \hat\t_a\hat\t_c\delta_{ab}\delta_{cd}

490:    + \hat\t_a\hat\t_b\delta_{ac}\delta_{bd}

491:    + \hat\t_a\hat\t_b\delta_{ad}\delta_{bc}]

492:    +O(\npp^{-3})\nq

493: \eqa

494: with $a = ij$, $b = kl,...\in\{1,...,r\}\times\{1,...,s\}$

495: being double indices,

496: $\delta_{ab} = \delta_{ik}\delta_{jl},...$

497: $\hat\t_{ij}={\n_{ij}\over\npp}$.

498: Expanding $\Delta^k = (\t-\hat\t)^k$ in $E[\Delta_a\Delta_b...]$ leads to

499: expressions containing $E[\t_a\t_b...]$, which can be

500: computed by a case analysis of all combinations of equal/unequal

501: indices $a,b,c,...$ using (\ref{norm}).

502: Many terms cancel leading to the above expressions.

503: They allow to compute the order $\npp^{-2}$ term of

504: the variance of $I(\t)$. Again, inspection of (\ref{mom3})

505: suggests to expand in $[(\npp+1)(\npp+2)]^{-1}$, rather than in

506: $\npp^{-2}$. The variance in leading and next to leading order

507: is

508: \bqa\label{var2ndo}

509:   \mbox{Var}[I] %&=&

510:   &=& {K-J^2\over\npp+1} +

511:   {M+(r - 1)(s - 1)(\odt - J)-Q

512:   \over(\npp+1)(\npp+2)} + O(\npp^{-3})

513:   \\\label{Mdef}

514:   M &:=& \sum_{ij}

515:   \left({1\over\n_{ij}}-{1\over\n_{i\p}}-{1\over\n_{\p

516:   j}}+{1\over\npp}\right)

517:   \n_{ij}\log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}},

518:   \\\label{Qdef}

519:   Q &:=& 1-\sum_{ij}{\n_{ij}^2\over\n_{i\p}\n_{\p j}}.

520: \eqa

521: $J$ and $K$ are defined in (\ref{Jdef}) and (\ref{Kdef}).

522: Note that the first term ${K-J^2\over\n+1}$ also contains second

523: order terms when expanded in $\npp^{-1}$. The leading order

524: terms for the $3^{rd}$ and $4^{th}$ central moments of $p(I|\vec\n)$ are

525: \bqan

526:   E[(I-E[I])^3] & = &

527:   {2\over\npp^2}[2J^3 - 3KJ + L] +

528:   {3\over\npp^2}[K + J^2 - P] +

529:   O(\npp^{-3}),

530:   \\

531:   L & := & \sum_{ij}{\n_{ij}\over\npp}\left(\log{\n_{ij}\npp\over

532:     \n_{i\p}\n_{\p j}}\right)^3,\quad

533:   P \;:=\; \sum_i{\n J_{i\p}^2\over\n_{i\p}} + \sum_j{\n J_{\p j}^2\over\n_{\p j}},

534:   \\

535:   J_{i\p} & :=&  \sum_{j}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over\n_{i\p}\n_{\p

536:   j}}\qquad,\quad

537:   J_{\p j} \;:=\; \sum_{i}{\n_{ij}\over\npp}\log{\n_{ij}\npp\over\n_{i\p}\n_{\p j}},

538:   \\

539:   E[(I-E[I])^4] & = &

540:   {3\over\npp^2}[K-J^2]^2 + O(\npp^{-3}),

541: \eqan

542: from which the skewness and kurtosis can be obtained by dividing

543: by $\mbox{Var}[I]^{3/2}$ and $\mbox{Var}[I]^2$

544: respectively. One can see that the skewness is of order

545: $\npp^{-1/2}$ and the kurtosis is $3+O(\npp^{-1})$.

546: Significant deviation of the skewness from $0$ or the kurtosis from

547: $3$ would indicate a non-Gaussian $I$. They can be used to get an improved

548: approximation for $p(I|\vec\n)$ by making, for instance, an ansatz

549: \beqn

550:   p(I|\vec\n)\propto (1+\tilde b I+\tilde c I^2) \cdot p_0(I|\tilde\mu,\tilde\sigma^2)

551: \eeqn

552: and fitting the parameters $\tilde b$, $\tilde c$, $\tilde\mu$,

553: and $\tilde\sigma^2$ to the mean, variance, skewness, and kurtosis

554: expressions above. $p_0$ is the Normal or Gamma distribution (or

555: any other distribution with Gaussian limit). From this, quantiles

556: $p(I > I_*|\vec\n):=\int_{I_*}^\infty p(I|\vec\n)\, dI$, needed in

557: \cite{Kleiter:96,Kleiter:99}, can be computed. A systematic

558: expansion of arbitrarily high moments to arbitrarily high order in

559: $\npp^{-1}$ leads, in principle, to arbitrarily accurate

560: estimates.

561:

562: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

563: \section{Numerics}\label{secNum}

564: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

565: %-------------------------------%

566: %\subsection{Implementation of $\psi(z)$}

567: %-------------------------------%

568: There are short and fast implementations of

569: $\psi$. The code of the Gamma function in \cite{Press:92}, for

570: instance, can be modified to compute the $\psi$ function. For

571: integer and half-integer values one may create a lookup table from

572: (\ref{psi2}).

573: %-------------------------------%

574: %\subsection{Computation time of (central moments)}

575: %-------------------------------%

576: The needed quantities $J$, $K$, $L$, $M$, and $Q$ (depending on $\vec

577: n$) involve a double sum, $P$ only a single sum, and the $r + s$

578: quantities $J_{i\p}$ and $J_{\p j}$ also only a single sum. Hence,

579: the computation time for the (central) moments is of the same

580: order $O(r \cdot s)$ as for the point estimate (\ref{mi}).

581: %-------------------------------%

582: %\subsection{Exact Monte Carlo}

583: %-------------------------------%

584: ``Exact'' values have been obtained for representative choices of

585: $\t_{ij}$, $r$, $s$, and $\npp$ by Monte Carlo simulation.

586: The $\t_{ij}:=x_{ij}/x_\pp$ are Dirichlet distributed, if each

587: $x_{ij}$ follows a Gamma distribution. See \cite{Press:92} how to

588: sample from a Gamma distribution.

589: %-------------------------------%

590: %\subsection{Numerical accuracy of expansion}

591: %-------------------------------%

592: The variance has been expanded in ${r \cdot s\over \npp}$,

593: so the relative error ${\mbox{\scriptsize

594: Var}[I]_{approx}-\mbox{\scriptsize Var}[I]_{exact}\over

595: \mbox{\scriptsize Var}[I]_{exact}}$ of the approximation

596: (\ref{varlodi}) and (\ref{var2ndo}) are of the order of

597: ${r \cdot s\over \npp}$ and $({r \cdot s\over \npp})^2$

598: respectively, {\em if} $\imath$ and $\jmath$ are dependent. If

599: they are independent the leading term (\ref{varlodi}) drops

600: itself down to order $\npp^{-2}$ resulting in a reduced

601: relative accuracy $O({r \cdot s\over \npp})$ of (\ref{var2ndo}).

602: Comparison with the Monte Carlo values confirmed an accurracy

603: in the range $({r \cdot s\over\npp})^{1...2}$. The mean

604: (\ref{miexex2}) is exact. Together with the skewness and

605: kurtosis we have a good description for the distribution of

606: the mutual information $p(I|\vec n)$ for not too small sample

607: bin sizes $n_{ij}$.

608: %-------------------------------%

609: %\subsection{Useful accuracy}

610: %-------------------------------%

611: We want to conclude with some notes on {\it useful} accuracy. The

612: hypothetical prior sample sizes $\n''_{ij}=\{0,{1\over

613: rs},\odt,1\}$ can all be argued to be non-informative

614: \cite{Gelman:95}. Since the central moments are expansions in

615: $\npp^{-1}$, the next to leading order term can be freely adjusted

616: by adjusting $\n''_{ij}\in[0...1]$.

617: So one may argue that anything beyond leading order is free to

618: will, and the leading order terms may be regarded as accurate as

619: we can specify our prior knowledge. On the other hand, exact

620: expressions have the advantage of being safe against

621: cancellations. For instance, leading order of $E[I]$ and $E[I^2]$

622: does not suffice to compute the leading order of $\mbox{Var}[I]$.

623:

624: %------------------------------%

625: \subsubsection*{Acknowledgements}

626: %------------------------------%

627: I want to thank Ivo Kwee for valuable discussions and Marco

628: Zaffalon for encouraging me to investigate this topic. This work

629: was supported by SNF grant 2000-61847.00 to J\"urgen Schmidhuber.

630:

631: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

632: %         Bibliography        %

633: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

634: \begin{thebibliography}{PFTV92}

635:

636: \bibitem[AS74]{Abramowitz:74}

637: M.~Abramowitz and I.~A. Stegun, editors.

638: \newblock {\em Handbook of mathematical functions}.

639: \newblock Dover publications, inc., 1974.

640:

641: \bibitem[Bra99]{Brand:99}

642: M.~Brand.

643: \newblock Structure learning in conditional probability models via an entropic

644:   prior and parameter extinction.

645: \newblock {\em Neural Computation}, 11(5):1155--1182, 1999.

646:

647: \bibitem[Bun96]{Buntine:96}

648: W.~Buntine.

649: \newblock A guide to the literature on learning probabilistic networks from

650:   data.

651: \newblock {\em {IEEE} Transactions on Knowledge and Data Engineering},

652:   8:195--210, 1996.

653:

654: \bibitem[CT91]{Cover:91}

655: T.~M. Cover and J.~A. Thomas.

656: \newblock {\em Elements of Information Theory}.

657: \newblock Wiley Series in Telecommunications. John Wiley \& Sons, New York, NY,

658:   USA, 1991.

659:

660: \bibitem[GCSR95]{Gelman:95}

661: A.~Gelman, J.~B. Carlin, H.~S. Stern, and D.~B. Rubin.

662: \newblock {\em Bayesian Data Analysis.}

663: \newblock Chapman, 1995.

664:

665: \bibitem[Hec98]{Heckerman:98}

666: D.~Heckerman.

667: \newblock A tutorial on learning with {B}ayesian networks.

668: \newblock {\em Learnig in Graphical Models}, pages 301--354, 1998.

669:

670: \bibitem[KJ96]{Kleiter:96}

671: G.~D. Kleiter and R.~Jirousek.

672: \newblock Learning {B}ayesian networks under the control of mutual information.

673: \newblock {\em Proceedings of the 6th International Conference on Information

674:   Processing and Management of Uncertainty in Knowledge-Based Systems

675:   (IPMU-1996)}, pages 985--990, 1996.

676:

677: \bibitem[Kle99]{Kleiter:99}

678: G.~D. Kleiter.

679: \newblock The posterior probability of {B}ayes nets with strong dependences.

680: \newblock {\em Soft Computing}, 3:162--173, 1999.

681:

682: \bibitem[PFTV92]{Press:92}

683: W.~H. Press, B.~P. Flannery, S.~A. Teukolsky, and W.~T. Vetterling.

684: \newblock {\em Numerical Recipes in {C}: The Art of Scientific Computing}.

685: \newblock Cambridge University Press, Cambridge, second edition, 1992.

686:

687: \bibitem[Soo00]{Soofi:00}

688: E.~S. Soofi.

689: \newblock Principal information theoretic approaches.

690: \newblock {\em Journal of the American Statistical Association}, 95:1349--1353,

691:   2000.

692:

693: \bibitem[WW93]{Wolpert:93b}

694: D.~R. Wolf and D.~H. Wolpert.

695: \newblock Estimating functions of distributions from {A} finite set of samples,

696:   part 2: Bayes estimators for mutual information, chi-squared, covariance and

697:   other statistics.

698: \newblock Technical Report LANL-LA-UR-93-833, Los Alamos National Laboratory,

699:   1993.

700: \newblock Also Santa Fe Insitute report SFI-TR-93-07-047.

701:

702: \end{thebibliography}

703:

704: \end{document}

705:

706: %---------------------------------------------------------------

707: