0108:physics0108025/nsb.tex

1: \newcommand{\NIPS}[3]{#1}

2: \newcommand{\NEC}[3]{#2}

3: \newcommand{\LANL}[3]{#3}

4:

5: %\newcommand{\FORMAT}[3]{\NIPS{#1}{#2}{#3}}

6: %\newcommand{\FORMAT}[3]{\NEC{#1}{#2}{#3}}

7: \newcommand{\FORMAT}[3]{\LANL{#1}{#2}{#3}}

8:

9: \FORMAT{

10:   \documentclass[fleqn]{article}

11:   \usepackage{times,epsf,graphics,floatflt,wrapfig}

12:   \usepackage{nips99}

13:   }

14: {

15:   \documentclass[fleqn]{article}

16:   \usepackage{times,epsf,graphics,floatflt,wrapfig}

17:   }

18: {

19:   \documentclass[fleqn]{article}

20:   \usepackage{times,epsf,graphics,floatflt,wrapfig}

21:   \usepackage[hyperindex,hyperfigures]{hyperref}

22:   \setlength{\oddsidemargin}{0.35in}

23:   \setlength{\textwidth}{5.75in}

24:   }

25:

26:

27: \FORMAT{\intextsep 0mm}{\intextsep 2mm}{\intextsep2mm}

28: \columnsep 3.5mm

29:

30:

31: \title{Entropy and Inference, Revisited}

32:

33: \author{Ilya Nemenman,$^{1,2}$ Fariel Shafee,$^3$ and William Bialek$^{1,3}$

34:   \\

35:   $^1$NEC Research Institute, 4 Independence Way,

36: \FORMAT{}{\\}{}

37:   Princeton, New Jersey 08540\\

38: $^2$Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106\\

39:   $^3$Department of Physics, Princeton University,

40: \FORMAT{}{\\}{}

41:   Princeton, New

42:   Jersey 08544\\ {\it nemenman@itp.ucsb.edu,

43:     \{fshafee/wbialek\}@princeton.edu}}

44:

45:

46: \begin{document}

47:

48: \maketitle

49:

50:

51:

52: \begin{abstract}

53: We study properties of popular near--uniform (Dirichlet) priors for

54: learning undersampled probability distributions on discrete nonmetric

55: spaces and show that they lead to disastrous results.  However, an

56: Occam--style phase space argument expands the priors into their infinite

57: mixture and resolves most of the observed problems. This leads to a

58: surprisingly good estimator of entropies of discrete distributions.

59: \end{abstract}

60:

61: \FORMAT{}{\newpage}{}

62:

63:

64:

65: Learning a probability distribution from examples is one of the basic

66: problems in data analysis. Common practical approaches introduce a

67: family of parametric models, leading to questions about model

68: selection. In Bayesian inference, computing the total probability of

69: the data arising from a model involves an integration over parameter

70: space, and the resulting ``phase space volume'' automatically

71: discriminates against models with larger numbers of parameters---hence

72: the description of these volume terms as Occam factors

73: \cite{mackay,vijay}.  As we move from finite parameterizations to

74: models that are described by smooth functions, the integrals over

75: parameter space become functional integrals and methods from quantum

76: field theory allow us to do these integrals asymptotically; again the

77: volume in model space consistent with the data is larger for models

78: that are smoother and hence less complex \cite{bcs}.  Further, at

79: least under some conditions the relevant degree of smoothness can be

80: determined self--consistently from the data, so that we approach

81: something like a model independent method for learning a distribution

82: \cite{nb}.

83:

84: The results emphasizing the importance of phase space factors in

85: learning prompt us to look back at a seemingly much simpler problem,

86: namely learning a distribution on a discrete, nonmetric space.  Here

87: the probability distribution is just a list of numbers $\{q_i\}$, $i =

88: 1, 2, \cdots , K$, where $K$ is the number of bins or possibilities.

89: We do not assume any metric on the space, so that a priori there is no

90: reason to believe that any $q_i$ and $q_j$ should be similar.  The

91: task is to learn this distribution from a set of examples, which we

92: can describe as the number of times $n_i$ each possibility is observed

93: in a set of $N= \sum_{i=1}^K n_i$ samples. This problem arises in the

94: context of language, where the index $i$ might label words or phrases,

95: so that there is no natural way to place a metric on the space, nor is

96: it even clear that our intuitions about similarity are consistent with

97: the constraints of a metric space.  Similarly, in bioinformatics the

98: index $i$ might label n--mers of the the DNA or amino acid sequence,

99: and although most work in the field is based on metrics for sequence

100: comparison one might like an alternative approach that does not rest

101: on such assumptions.  In the analysis of neural responses, once we fix

102: our time resolution the response becomes a set of discrete ``words,''

103: and estimates of the information content in the response are

104: determined by the probability distribution on this discrete space.

105: What all of these examples have in common is that we often need to

106: draw some conclusions with data sets that are {\em not} in the

107: asymptotic limit $N \gg K$.  Thus, while we might use a large corpus

108: to sample the distribution of words in English by brute force

109: (reaching $N \gg K$ with $K$ the size of the vocabulary), we can

110: hardly do the same for three or four word phrases.

111:

112:

113: In models described by continuous functions, the infinite number of

114: ``possibilities'' can never be overwhelmed by examples; one is saved

115: by the notion of smoothness. Is there some nonmetric analog of this

116: notion that we can apply in the discrete case?  Our intuition is that

117: information theoretic quantities may play this role.  If we have a

118: joint distribution of two variables, the analog of a smooth

119: distribution would be one which does not have too much mutual

120: information between these variables.  Even more simply, we might say that

121: smooth distributions have large entropy.  While the idea of ``maximum

122: entropy inference'' is common \cite{maxent}, the interplay between

123: constraints on the entropy and the volume in the space of models seems

124: not to have been considered.  As we shall explain, phase space factors

125: alone imply that seemingly sensible, more or less uniform priors on the

126: space of discrete probability distributions correspond to disastrously

127: singular prior hypotheses about the entropy of the underlying

128: distribution.  We argue that reliable inference outside the asymptotic

129: regime $N \gg K$ requires a more uniform prior on the entropy, and we

130: offer one way of doing this.  While many distributions are consistent

131: with the data when $N \leq K$, we provide empirical evidence that this

132: flattening of the entropic prior allows us to make surprisingly reliable

133: statements about the entropy itself in this regime.

134:

135: At the risk of being pedantic, we state very explicitly what we mean by

136: uniform or nearly uniform priors on the space of distributions.

137: The natural ``uniform'' prior  is given by

138: \begin{equation}

139:   {\mathcal P}_{\rm u}(\{q_i\}) = {1\over Z_{\rm u}}\,\delta\left(

140:     1 - \sum_{i=1}^K q_i\right), \;\; Z_{\rm u} = \int_{\mathcal

141:     A}dq_1 dq_2 \cdots  dq_K

142:   \,\delta\left( 1 - \sum_{i=1}^K q_i\right)

143: \end{equation}

144: where the delta function imposes the normalization, $Z_{\rm u}$ is the

145: total volume in the space of models, and the integration domain

146: ${\mathcal A}$ is such that each $q_i$ varies in the range $[0,1]$.

147: Note that, because of the normalization constraint, an {\em

148:   individual} $q_i$ chosen from this distribution in fact is not

149: uniformly distributed---this is also an example of phase space

150: effects, since in choosing one $q_i$ we constrain all the other

151: $\{q_{j\neq i}\}$. What we mean by uniformity is that all

152: distributions that obey the normalization constraint are equally

153: likely a priori.

154:

155: Inference with this uniform prior is straightforward.  If our examples

156: come independently from $\{ q_i\}$, then we calculate the probability

157: of the model $\{ q_i\}$ with the usual Bayes rule: \footnote{If the data

158: are unordered,  extra combinatorial factors have to be included in $P(\{

159: n_i\} | \{ q_i\})$. However, these cancel  immediately in later

160: expressions.}

161: \begin{equation}

162:   P(\{ q_i\}| \{ n_i\} ) = \frac{P(\{ n_i\} | \{ q_i\})

163:     {\mathcal P}_{\rm u}(\{q_i\})}{P_{\rm u}(\{ n_i\})}, \;\;

164:   P(\{ n_i\} | \{ q_i\}) = \prod_{i=1}^K (q_i)^{n_i}.

165: \end{equation}

166: If we want the best estimate of the probability $q_i$ in the least

167: squares sense, then we should compute the conditional mean, and this

168: can be done exactly, so that \cite{ww,thesis}

169: \vspace{-0.5mm}

170: \begin{equation}

171: \langle q_i\rangle = {{n_i +1}\over{N+K}} .

172: \label{laprule}

173: \end{equation}

174: Thus we can think of inference with this uniform prior as setting

175: probabilities equal to the observed frequencies, but with an ``extra

176: count'' in every bin.  This sensible procedure was first introduced by

177: Laplace \cite{laplace}. It has the desirable property that events which have not been observed are not automatically assigned probability zero.

178:

179:

180: A natural generalization of these ideas is to consider priors that

181: have a power--law dependence on the probabilities, the so called Dirichlet family of priors:

182: \vspace{-0.5mm}

183: \begin{equation}

184: {\mathcal P}_\beta(\{q_i\}) = {1\over Z(\beta)}

185: \delta\left( 1 - \sum_{i=1}^K q_i\right)

186: \prod_{i=1}^K q_i^{\beta-1} \,,

187: \label{P(q)}

188: \end{equation}

189:

190: It is interesting to see what typical distributions from these priors

191: look like. Even though different $q_i$'s are not independent random

192: variables due to the normalizing $\delta$--function, generation of

193: random distributions is still easy: one can show that if $q_i$'s are

194: generated successively (starting from $i=1$ and proceeding up to

195: $i=K$) from the Beta--distribution

196: \begin{equation}

197:   P(q_i) = B\left(\frac{q_i}{1-\sum_{j<i} q_j}; \beta, (K-i)\beta

198:   \right),\;\;\;\;  B\left(x; a,b \right) =

199:   \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}\,,

200: \label{betadistr}

201: \end{equation}

202:

203: \begin{wrapfigure}{r}{63mm}

204:   \vspace{-0mm}

205:   \centerline{\epsfxsize=1.0\hsize\epsffile{Q_example.eps}}

206:   \vspace{-3.5mm}

207:   \caption{Typical distributions, $K=1000$.}

208:   \FORMAT{\vspace{-1mm}}{}{}

209:   \label{example}

210: \end{wrapfigure}

211:

212: \noindent then the probability of the whole sequence $\{q_i\}$ is ${\mathcal

213:   P}_{\beta}(\{q_i\})$.  Fig.~\ref{example} shows some typical

214: distributions generated this way. They represent different regions of

215: the range of possible entropies: low entropy ($\sim 1$ bit, where only

216: a few bins have observable probabilities), entropy in the middle of

217: the possible range, and entropy in the vicinity of the maximum,

218: $\log_2 K$.  When learning an unknown distribution, we usually have no

219: a priori reason to expect it to look like only one of these

220: possibilities, but choosing $\beta$ pretty much fixes allowed

221: ``shapes.''  This will be a focal point of our discussion.

222:

223:

224: Even though distributions look different, inference with all priors

225: Eq.~(\ref{P(q)}) is similar \cite{ww,thesis}:

226: \begin{equation}

227: \langle q_i\rangle_\beta = {{n_i

228: +\beta}\over{N+\kappa}}\,,\;\;\;\; \kappa = K\beta.

229: \label{estim}

230: \end{equation}

231: This simple modification of the  Laplace's rule, Eq.~(\ref{laprule}),

232: which allows us to vary probability assigned to the outcomes not yet

233: seen, was first examined by Hardy and Lidstone \cite{hardy,lidstone}.

234: Together with the Laplace's formula, $\beta=1$, this family includes the

235: usual maximum likelihood estimator (MLE), $\beta \to 0$, that identifies

236: probabilities with frequencies, as well as the Jeffreys' or

237: Krichevsky--Trofimov (KT) estimator, $\beta=1/2$ \cite{jeffreys,kt,wst},

238: the Schurmann--Grassberger (SG) estimator, $\beta=1/K$ \cite{sg}, and

239: other popular choices.

240:

241:

242:

243: To understand why inference in the family of priors defined by

244: Eq.~(\ref{P(q)}) is unreliable, consider the entropy of a distribution

245: drawn at random from this ensemble.  Ideally we would like to compute

246: this whole a priori distribution of entropies,

247: \begin{equation}

248: {\mathcal  P}_\beta (S) = \int dq_1  dq_2 \cdots dq_K \,

249: P_\beta(\{q_i\})

250: \,\delta\left[

251: S + \sum_{i =1}^K q_i\log_2 q_i \right] ,

252: \end{equation}

253: but this is quite difficult. However, as noted by Wolpert and Wolf

254: \cite{ww}, one can compute the moments of ${\mathcal P}_\beta (S)$

255: rather easily.  Transcribing their results to the present notation

256: (and correcting some small errors), we find:

257: \begin{eqnarray}

258:   \xi(\beta)  \equiv  \langle\, S [n_i =0]\, \rangle_\beta  &=&

259:   \psi_0(\kappa+1)

260:   -\psi_0(\beta+1) \, ,

261:   \label{Sap}

262:   \\

263:   \sigma^2(\beta) \equiv \langle \, (\delta S)^2  [n_i =0] \rangle_\beta

264:      &=&

265:   \frac{\beta+1}{\kappa +

266:     1}\, \psi_1(\beta+1) -\psi_1(\kappa+1) \,,

267:   \label{dS2ap}

268: \end{eqnarray}

269: \vspace{-0.5mm}

270: where $\psi_m(x) = (d/dx)^{m+1} \log_2 \Gamma(x)$ are the polygamma

271: functions.

272:

273:

274: \begin{wrapfigure}{L}{63mm}

275:   \vspace{-1mm}

276:   \centerline{\epsfxsize=1.0\hsize\epsffile{mean_var.eps}}

277:   \vspace{-4mm}

278:   \caption{$\xi(\beta) / \log_2

279:     K$ and $\sigma(\beta)$ as functions of $\beta$ and $K$; gray bands

280:     are the region of $\pm \sigma(\beta)$ around the mean. Note the

281:     transition from the logarithmic to the linear scale at

282:     $\beta=0.25$ in the insert.}

283:   \FORMAT{\vspace{1mm}}{}{}

284: \label{Sapriori}

285: \end{wrapfigure}

286:

287: This behavior of the moments is shown on Fig.~\ref{Sapriori}.  We are

288: faced with a striking observation: a priori distributions of entropies

289: in the power--law priors are extremely peaked for even moderately

290: large $K$. Indeed, as a simple analysis shows, their maximum standard

291: deviation of approximately 0.61 bits is attained at $\beta \approx

292: 1/K$, where $\xi(\beta) \approx 1/\ln 2$ bits. This has to be compared

293: with the possible range of entropies, $[0, \log_2 K]$, which is

294: asymptotically large with $K$.  Even worse, for any fixed $\beta$ and

295: sufficiently large $K$, $\xi(\beta) = \log_2 K - O(K^0)$, and

296: $\sigma(\beta) \propto 1/\sqrt{\kappa}$. Similarly, if $K$ is large,

297: but $\kappa$ is small, then $\xi(\beta) \propto \kappa$, and

298: $\sigma(\beta) \propto \sqrt{\kappa}$.  This paints a lively picture:

299: varying $\beta$ between $0$ and $\infty$ results in a smooth variation

300: of $\xi$, the a priori expectation of the entropy, from $0$ to $S_{\rm

301:   max}= \log_2 K$.  Moreover, for large $K$, the standard deviation of

302: ${\mathcal P}_{\beta} (S)$ is always negligible relative to the

303: possible range of entropies, and it is negligible even absolutely for

304: $\xi\gg 1$ ($\beta \gg 1/K$). Thus a seemingly innocent choice of the

305: prior, Eq.~(\ref{P(q)}), leads to a disaster: {\em fixing $\beta$

306:   specifies the entropy almost uniquely}.  Furthermore, the situation

307: persists even after we observe some data: {\em until the distribution

308:   is well sampled, our estimate of the entropy is dominated by the prior!}

309:

310: Thus it is clear that all commonly used estimators mentioned above

311: have a problem. While they may or may not provide a reliable estimate

312: of the distribution $\{q_i\}$\footnote{In any case, the answer to

313:   this question depends mostly on the ``metric'' chosen to measure

314:   reliability. Minimization of bias, variance, or information cost

315:   (Kullback--Leibler divergence between the target distribution and

316:   the estimate) leads to very different ``best'' estimators.}, they

317: are definitely a poor tool to learn entropies.  Unfortunately, often

318: we are interested precisely in these entropies or similar

319: information--theoretic quantities, as in the examples (neural code,

320: language, and bio\-informatics) we briefly mentioned earlier.

321:

322: Are the usual estimators really this bad? Consider this: for the MLE

323: ($\beta=0$), Eqs.~(\ref{Sap}, \ref{dS2ap}) are formally wrong since it

324: is impossible to normalize ${\mathcal P}_0(\{q_i\})$.  However, the

325: prediction that ${\mathcal P}_0(S) = \delta(S)$ still holds. Indeed,

326: $S_{\rm ML}$, the entropy of the ML distribution, is zero even for

327: $N=1$, let alone for $N=0$. In general, it is well known that $S_{\rm

328:   ML}$ always underestimates the actual value of the entropy, and the

329: correction  \vspace{-0.5mm}

330: \begin{equation}

331:   S = S_{\rm ML} + \frac{K^*}{2N} + O \left( \frac{1}{N^2} \right)

332:   \label{corr}

333: \end{equation}

334: \vspace{-0.5mm} is usually used (cf.~\cite{sg}).  Here we must set

335: $K^*=K-1$ to have an asymptotically correct result.  Unfortunately in

336: an undersampled regime, $N \ll K$, this is a disaster. To alleviate

337: the problem, different authors suggested to determine the dependence

338: $K^*=K^*(K)$ by various (rather ad hoc) empirical \cite{srrb} or

339: pseudo--Bayesian techniques \cite{pt}.  However, then there is no

340: principled way to estimate both the residual bias and the error of the

341: estimator.

342:

343:

344: The situation is even worse for the Laplace's rule, $\beta=1$. We were

345: unable to find any results in the literature that would show a clear

346: understanding of the effects of the prior on the entropy estimate,

347: $S_{\rm L}$.  And these effects are enormous: the a priori

348: distribution of the entropy has $\sigma(1) \sim 1/\sqrt{K}$ and is

349: almost $\delta$-like. This translates into a very certain, but

350: nonetheless possibly wrong, estimate of the entropy. We believe that

351: this type of error (cf.~Fig.~\ref{fixedbeta}) has been overlooked in

352: some previous literature.

353:

354:

355:

356: The Schurmann--Grassberger estimator, $\beta=1/K$, deserves a special

357: attention. The variance of ${\mathcal P}_{\beta}(S)$ is maximized near

358: this value of $\beta$ (cf.~Fig.~\ref{Sapriori}).  Thus the SG

359: estimator results in the most uniform a priori expectation of $S$

360: possible for the power--law priors, and consequently in the least

361: bias. We suspect that this feature is responsible for a remark in

362: Ref.~\cite{sg} that this $\beta$ was empirically the best for studying

363: printed texts. But even the SG estimator is flawed: it is biased

364: towards (roughly) $1/\ln 2$, and it is still a priori rather narrow.

365:

366: \begin{wrapfigure}{r}{63mm}

367:   \vspace{-1mm}

368:   \centerline{\epsfxsize=1.0\hsize\epsffile{diffbeta.eps}}

369:   \vspace{-5mm}

370:   \caption{Learning the $\beta=0.02$ distribution from  Fig.~\ref{example}

371:     with $\beta=0.001, 0.02, 1$. The actual error of the estimators is

372:     plotted; the error bars are the standard deviations of the

373:     posteriors. The ``wrong'' estimators are very certain but

374:     nonetheless incorrect.}

375:   \FORMAT{\vspace{-2mm}}{}{\vspace{-2mm} }

376:   \label{fixedbeta}

377: \end{wrapfigure}

378:

379:

380:

381:

382:

383: Summarizing, we conclude that simple power--law priors,

384: Eq.~(\ref{P(q)}), must not be used to learn entropies when there is no

385: strong a priori knowledge to back them up. On the other hand, they are

386: the only priors we know of that allow to calculate $\langle q_i

387: \rangle$, $\langle S \rangle$, $\langle \chi^2 \rangle$, \dots exactly

388: \cite{ww}. Is there a way to resolve the problem of peakedness of

389: ${\mathcal P}_{\beta}(S)$ without throwing away their analytical ease?

390: One approach would be to use $ {\mathcal P}^{\rm

391:   flat}_{\beta}(\{q_i\}) = \frac{{\mathcal P}_{\beta}(\{q_i\})

392:   }{{\mathcal P}_{\beta}(S[q_i])} \; {\mathcal P}^{\rm

393:   actual}(S[q_i])\,$ as a prior on $\{q_i\}$. This has a feature that

394: the a priori distribution of $S$ deviates from uniformity only due to

395: our actual knowledge ${\mathcal P}^{\rm actual} (S[q_i])$, but not in

396: the way ${\mathcal P}_{\beta}(S)$ does.  However, as we already

397: mentioned, ${\mathcal P}_{\beta}(S[q_i])$ is yet to be calculated.

398:

399:

400: Another way to a flat prior is to write ${\mathcal P}(S) = 1 = \int

401: \delta(S - \xi) d \xi$. If we find a family of priors ${\mathcal

402:   P}(\{q_i\}, {\rm parameters})$ that result in a $\delta$-function

403: over $S$, and if changing the parameters moves the peak across the

404: whole range of entropies uniformly, we may be able to use this.

405: Luckily, ${\mathcal P}_{\beta}(S)$ is almost a

406: $\delta$-function!~\footnote{The approximation becomes not so good as

407:   $\beta \to 0$ since $\sigma(\beta)$ becomes $O(1)$ before dropping

408:   to zero.  Even worse, ${\mathcal P}_{\beta}(S)$ is skewed at small

409:   $\beta$. This accumulates an extra weight at $S=0$.  Our approach to

410:   dealing with these problems is to ignore them while the posterior

411:   integrals are dominated by $\beta$'s that are far away from zero.

412:   This was always the case in our simulations, but is an open

413: question for the analysis of real data.} In addition, changing

414: $\beta$ results in changing $\xi(\beta) = \langle\, S [n_i=0] \,

415: \rangle_\beta$ across the whole range $[0, \log_2 K$]. So we may hope

416: that the prior \footnote{Priors that are formed as weighted sums of the

417: different members of the Dirichlet family are usually called {\em

418: Dirichlet mixture priors}. They have been used to estimate probability

419: distributions of, for example, protein sequences \cite{mixt}.

420: Equation (\ref{Pflat}), an {\em infinite} mixture, is a further

421: generalization, and, to our knowledge, it has not been studied before.}

422: \begin{equation}

423: {\mathcal P} (\{q_i\};\beta) = {1\over Z}\,

424: \delta\left( 1 - \sum_{i=1}^K q_i\right)

425: \prod_{i=1}^K q_i^{\beta-1} \frac{d \xi(\beta)}{d\beta} \,{\mathcal P}(\beta)

426: \label{Pflat}

427: \end{equation}

428: may do the trick and estimate entropy reliably even for small $N$, and

429: even for distributions that are atypical for any one $\beta$. We have less

430: reason, however, to expect that this will give an equally reliable

431: estimator of the atypical distributions themselves.$^2$ Note the term $d\xi/d\beta$  in Eq.~(\ref{Pflat}). It is there because $\xi$, not $\beta$, measures the position of the entropy density peak.

432:

433:

434: Inference with the prior, Eq.~(\ref{Pflat}), involves additional

435: averaging over $\beta$ (or, equivalently, $\xi$), but is nevertheless

436: straightforward. The a posteriori moments of the entropy are

437: \begin{eqnarray}

438:   \widehat{S^m} &=& \frac{\int d\xi\,

439:     \rho(\xi,\{n_i\}) \langle\, S^m [n_i]\, \rangle_{\beta(\xi)}}

440:   {\int d\xi\, \rho(\xi,[n_i])}\,,\;\;\;\mbox{where}

441:   \label{Shat}

442:   \\

443:   \rho(\xi, [n_i]) &=& {\mathcal P}\left(\beta\left(\xi\right)\right)

444:   \frac{\Gamma(\kappa(\xi))}{\Gamma(N+\kappa(\xi))}\,

445:   \prod_{i=1}^K \frac{\Gamma(n_i+\beta(\xi))}{\Gamma(\beta(\xi))}\,.

446:   \label{rho}

447: \end{eqnarray}

448: Here the moments $\langle\, S^m [n_i]\, \rangle_{\beta(\xi)}$ are

449: calculated at fixed $\beta$ according to the (corrected) formulas of

450: Wolpert and Wolf \cite{ww}.  We can view this inference scheme as

451: follows: first, one sets the value of $\beta$ and calculates the

452: expectation value (or other moments) of the entropy at this $\beta$.

453: For small $N$, the expectations will be very close to their a priori

454: values due to the peakedness of ${\mathcal P}_{\beta}(S)$.

455: Afterwards, one integrates over $\beta(\xi)$ with the density

456: $\rho(\xi)$, which includes our a priori expectations about the

457: entropy of the distribution we are studying [${\mathcal

458:   P}\left(\beta\left(\xi\right)\right)$], as well as the evidence for

459: a particular value of $\beta$ [$\Gamma$-terms in Eq.~(\ref{rho})].

460:

461: The crucial point is the behavior of the evidence. If it has a

462: pronounced peak at some $\beta_{\rm cl}$, then the integrals over

463: $\beta$ are dominated by the vicinity of the peak, $\widehat{S}$ is

464: close to $\xi(\beta_{\rm cl})$, and the variance of the estimator is

465: small. In other words, data ``selects'' some value of $\beta$, much in

466: the spirit of Refs.~\cite{mackay} -- \cite{nb}.  However, this

467: scenario may fail in two ways.  First, there may be no peak in the

468: evidence; this will result in a very wide posterior and poor

469: inference. Second, the posterior density may be dominated by $\beta

470: \to 0$, which corresponds to MLE, the best possible fit to the data,

471: and is a discrete analog of overfitting.  While all these situations

472: are possible, we claim that generically the evidence is well--behaved.

473: Indeed, while small $\beta$ increases the fit to the data, it also

474: increases the phase space volume of all allowed distributions and thus

475: decreases probability of each particular one [remember that $\langle

476: q_i \rangle_{\beta}$ has an extra $\beta$ counts in each bin, thus

477: distributions with $q_i < \beta/(N+\kappa)$ are strongly suppressed].

478: The fight between the ``goodness of fit'' and the phase space volume

479: should then result in some non--trivial $\beta_{cl}$, set by factors

480: $\propto N$ in the exponent of the integrand.

481:

482:

483: Figure~\ref{learning} shows how the prior, Eq.~(\ref{Pflat}), performs

484: on some of the many distributions we tested. The left panel describes

485: learning of distributions that are typical in the prior ${\mathcal

486:   P}_{\beta}(\{q_i\})$ and, therefore, are also likely in ${\mathcal

487:   P}(\{q_i\};\beta)$. Thus we may expect a reasonable performance, but

488: the real results exceed all expectations: for all three cases, the

489: actual relative error drops to the $10\%$ level at $N$ as low as 30

490: (recall that $K=1000$, so we only have $\sim 0.03$ data points per bin

491: on average)! To put this in perspective, simple estimates like fixed

492: $\beta$ ones, MLE, and MLE corrected as in Eq.~(\ref{corr}) with $K^*$

493: equal to the number of nonzero $n_i$'s produce an error so big that it

494: puts them off the axes until $N >100$. \footnote{More work is needed to

495:   compare our estimator to more complex techniques, like in

496:   Ref.~\cite{srrb,pt}.}  Our results have two more nice features: the

497: estimator seems to know its error pretty well, and it is almost

498: completely unbiased.

499:

500:

501: \begin{figure}[t]

502:   \begin{center}

503:     \begin{picture}(60,5)(0,0)

504:       \put(-60,0){(a)}

505:       \put(120,0){(b)}

506:     \end{picture}

507:   \end{center}

508:   \vspace{-1mm}

509:   \centerline{\epsfxsize=.49\hsize\epsffile{correct.eps}

510:     \epsfxsize=.49\hsize\epsffile{incorrect.eps}}

511:   \vspace{-4mm}

512:   \caption{Learning entropies with the prior Eq.~(\ref{Pflat}) and

513:     ${\mathcal P}(\beta)=1$. The actual relative errors of the

514:     estimator are plotted; the error bars are the relative widths of

515:     the posteriors. (a) Distributions from Fig.~\ref{example}. (b)

516:     Distributions atypical in the prior.  Note that while

517:     $\widehat{S}$ may be safely calculated as just $\langle S

518:     \rangle_{\beta_{\rm cl}}$, one has to do an honest integration

519:     over $\beta$ to get $\widehat{S^2}$ and the error bars.  Indeed,

520:     since ${\mathcal P}_{\beta} (S)$ is almost a $\delta$-function,

521:     the uncertainty at any fixed $\beta$ is very small (see

522:     Fig.~\ref{fixedbeta}).}

523:   \label{learning}

524:   \vspace{-4mm}

525: \end{figure}

526:

527:

528: One might be puzzled at how it is possible to estimate anything in a

529: 1000--bin distribution with just a few samples: the distribution is

530: completely unspecified for low $N$! The point is that we are not

531: trying to learn the distribution --- in the absence of additional prior

532: information this would, indeed, take $N\gg K$ --- but to estimate

533: just one of its characteristics. It is less surprising that one number

534: can be learned well with only a handful of measurements. In practice

535: the algorithm builds its estimate based on the number of coinciding

536: samples (multiple coincidences are likely only for small $\beta$), as

537: in the  Ma's approach to entropy estimation from simulations of physical

538: systems

539: \cite{ma}.

540:

541:

542:

543:

544: What will happen if the algorithm is fed with data from a distribution

545: $\{\tilde{q}_i\}$ that is strongly atypical in ${\mathcal

546:   P}(\{q_i\};\beta)$? Since there is no $\{\tilde{q}_i\}$ in our

547: prior, its estimate may suffer.  Nonetheless, for any

548: $\{\tilde{q}_i\}$, there is some $\beta$ which produces distributions

549: with the same mean entropy as $S[\tilde{q}_i]$.  Such $\beta$ should

550: be determined in the usual fight between the ``goodness of fit'' and

551: the Occam factors, and the correct value of entropy will follow.

552: However, there will be an important distinction from the ``correct

553: prior'' cases. The value of $\beta$ indexes available phase space

554: volumes, and thus the smoothness (complexity) of the model class

555: \cite{bnt}. In the case of discrete distributions, smoothness is the

556: absence of high peaks. Thus data with faster decaying Zipf plots

557: (plots of bins' occupancy vs.\ occupancy rank $i$) are rougher. The priors ${\mathcal P}_{\beta}(\{q_i\})$ cannot account for all possible roughnesses. Indeed, they only generate distributions for which the expected number of bins $\nu$ with the probability mass less than some $q$ is given by $\nu(q) = K B(q, \beta, \kappa -\beta)$, where $B$ is the familiar incomplete Beta function, as in Eq.~(\ref{betadistr}). This means that the expected rank ordering for small and large ranks is

558: \begin{eqnarray}

559: q_i &\approx& 1 - \left[\frac{ \beta B(\beta, \kappa - \beta )  (K-1) \,i}

560: {K} \right] ^{1/(\kappa-\beta)}, \,\,\,\, i\ll K\,,

561: \label{left}\\

562: q_i &\approx& \left[ \frac{ \beta B(\beta, \kappa - \beta )  (K-i+1)}

563: {K}\right]^{1/\beta},\,\,\,\, K-i+1 \ll K\,.

564: \end{eqnarray}

565: In an undersampled regime we can observe only the first of the behaviors. Therefore, any

566: distribution with $q_i$ decaying

567: faster (rougher) or slower (smoother) than Eq.~(\ref{left}) for some $\beta$ cannot be explained

568: well with fixed $\beta_{\rm cl}$ for different $N$.  So, unlike in the cases of learning  data that are typical in ${\mathcal P}_{\beta}(\{q_i\})$, we should

569: expect to see $\beta_{\rm cl}$ growing (falling) for qualitatively

570: smoother (rougher) cases as $N$ grows.

571:

572: \FORMAT{

573: \tabcolsep 0.5mm

574: \begin{wraptable}{r}{40.5mm}{

575: %\begin{floatingtable}{

576: %\begin{tabular}{ccccccc}

577: %$N$  &0.0007& 0.02 & 1.0  & 1/2 full & Zipf & rough\\ \hline

578: %{\small units}  & $\cdot 10^{-4}$ & $\cdot 10^{-2}$ & $\cdot 10^{-0}$ &

579: %$\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline

580: %10   & 4.3  & 4.1  & 2773 & 1.7      & 1907 & 16.8\\

581: %30   & 6.1  & 1.9  & 0.74 & 2.2      & 0.99 & 11.5\\

582: %100  & 4.3  & 2.3  & 0.80 & 2.4      & 0.86 & 12.9\\

583: %300  & 3.4  & 2.0  & 1.12 & 2.2      & 1.36 & 8.3 \\

584: %1000 & 5.9  & 2.0  & 0.96 & 2.1      & 2.24 & 6.4 \\

585: %3000 & 6.3  & 1.9  & 0.99 & 1.9      & 3.36 & 5.4 \\

586: %10000& 1.0  & 1.8  & 0.99 & 2.0      & 4.89 & 4.5 \\

587: %\end{tabular}

588: \begin{tabular}{cccc}

589: $N$  & 1/2 full & Zipf & rough\\ \hline

590: {\small units} & $\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline

591: 10   & 1.7      & 1907 & 16.8\\

592: 30   & 2.2      & 0.99 & 11.5\\

593: 100  & 2.4      & 0.86 & 12.9\\

594: 300  & 2.2      & 1.36 & 8.3 \\

595: 1000 & 2.1      & 2.24 & 6.4 \\

596: 3000 & 1.9      & 3.36 & 5.4 \\

597: 10000& 2.0      & 4.89 & 4.5 \\

598: \end{tabular}}

599: \vspace{-3mm}

600: \caption{$\beta_{\rm cl}$ for solutions shown on Fig.~\ref{learning}(b).}

601: \label{betacl}

602: \end{wraptable}}{}{}

603:

604: Figure~\ref{learning}(b) and Tbl.~\ref{betacl} illustrate these

605: points. First, we study the $\beta=0.02$ distribution from

606: Fig.~\ref{example}. However, we added a 1000 extra bins, each with

607: $q_i=0$.  Our estimator performs remarkably well, and $\beta_{\rm cl}$

608: does not drift because the ranking law remains the same. Then we turn

609: to the famous Zipf's distribution, so common in Nature. It has $n_i

610: \propto 1/i$, which is qualitatively smoother than our prior allows.

611: Correspondingly, we get an upwards drift in $\beta_{\rm cl}$. Finally,

612: we analyze a ``rough'' distribution, which has $q_i \propto 50 - 4(\ln

613: i)^2$, and $\beta_{\rm cl}$ drifts downwards. Clearly, one would want

614: to predict the dependence $\beta_{\rm cl}(N)$ analytically, but this

615: requires calculation of the predictive information (complexity) for the

616: involved distributions \cite{bnt} and is a work for the future. Notice that, the entropy estimator for atypical

617: \FORMAT{}{}{

618: \tabcolsep 0.5mm

619: \begin{wraptable}{r}{40.5mm}{

620: \begin{tabular}{cccc}

621: $N$  & 1/2 full & Zipf & rough\\ \hline

622: {\small units} & $\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline

623: 10   & 1.7      & 1907 & 16.8\\

624: 30   & 2.2      & 0.99 & 11.5\\

625: 100  & 2.4      & 0.86 & 12.9\\

626: 300  & 2.2      & 1.36 & 8.3 \\

627: 1000 & 2.1      & 2.24 & 6.4 \\

628: 3000 & 1.9      & 3.36 & 5.4 \\

629: 10000& 2.0      & 4.89 & 4.5 \\

630: \end{tabular}}

631: \vspace{-3mm}

632: \caption{$\beta_{\rm cl}$ for solutions shown on Fig.~\ref{learning}(b).}

633: \label{betacl}

634: \FORMAT{}{}{\vspace{-3mm}}

635: \end{wraptable}}

636:  cases is almost as

637: good as for typical ones.  A possible exception is the 100--1000

638: points for the Zipf distribution---they are about two standard

639: deviations off. We saw similar effects in some other ``smooth'' cases

640: also.  This may be another manifestation of an observation made in

641: Ref.~\cite{nb}: smooth priors can easily adapt to rough distribution,

642: but there is a limit to the smoothness beyond which rough priors

643: become inaccurate.

644:

645:

646:

647: To summarize, an analysis of a priori entropy statistics in common

648: power--law Bayesian estimators revealed some very undesirable features. We are fortunate, however, that these minuses can be easily

649: turned into pluses, and the resulting estimator of entropy is precise,

650: knows its own error, and gives amazing results for a very large class of

651: distributions.

652:

653:

654:

655:

656: \section*{Acknowledgements}

657: We thank Vijay Balasubramanian, Curtis Callan, Adrienne Fairhall, Tim

658: Holy, Jonathan Miller, Vipul Periwal, Steve Strong, and Naftali Tishby for useful

659: discussions. I.\ N.\ was supported in part by NSF Grant No.\ PHY99-07949 to the Institute for Theoretical Physics.

660:

661:

662:

663: \begin{thebibliography}{99}

664: \itemsep 0mm

665: {\small

666:     \bibitem{mackay}\newblock{D.~MacKay, {\it Neural Comp.} {\bf 4},

667:     415--448 (1992).}

668:

669:     \bibitem{vijay}\newblock{V.~Balasubramanian, {\em Neural Comp.}

670:     {\bf 9}, 349--368 (1997)\FORMAT{.}{, {\tt \small

671:         adap-org/9601001}.}{, {\tt \small adap-org/9601001}.}}

672:

673:     \bibitem{bcs}\newblock{W.~Bialek, C.~Callan, and S.~Strong, {\it

674:       Phys.~Rev.~Lett.}  {\bf 77}, 4693--4697 (1996)\FORMAT{.}{, {\tt

675:         \small cond-mat/9607180}.}{, {\tt \small cond-mat/9607180}.}}

676:

677:     \bibitem{nb}\newblock{I.~Nemenman and W.~Bialek, {\it Advances in

678:       Neural Inf.\ Processing Systems} {\bf 13}, 287--293 (2001)\FORMAT{.}{,

679:       {\tt \small cond-mat/0009165}.}{, {\tt \small

680:         cond-mat/0009165}.}}

681:

682:     \bibitem{maxent}\newblock{J.~Skilling, in {\it Maximum entropy and

683:       Bayesian methods,} J.~Skilling ed. (Kluwer Academic Publ.,

684:     Amsterdam, 1989), pp.~45--52.}

685:

686:     \bibitem{ww}\newblock{D.~Wolpert and D.~Wolf, {\it Phys.~Rev.~E}

687:     {\bf 52}, 6841--6854 (1995)\FORMAT{.}{, {\tt \small

688:         comp-gas/9403001}.}{, {\tt \small comp-gas/9403001}.}}

689:

690:     \bibitem{thesis}\newblock{I.~Nemenman, Ph.D. Thesis, Princeton,

691:     (2000), ch.~3, \FORMAT{\small

692:       http://arXiv.org/abs/physics/0009032} {\tt \small

693:       physics/0009032} {\tt \small physics/0009032}.}

694:

695: \bibitem{laplace}\newblock{P.~de Laplace, marquis de, {\em Essai philosophique sur les probabilit\'es} (Courcier, Paris, 1814), trans.\ by F.~Truscott and F.~Emory, {\em A philosophical essay on probabilities}  (Dover, New York, 1951).}

696:

697: \bibitem{hardy}\newblock{G.~Hardy, {\em Insurance Record} (1889), reprinted in {\em Trans.~Fac.~Actuaries} {\bf 8} (1920).}

698:

699: \bibitem{lidstone}\newblock{G.~Lidstone, {\em Trans.~Fac.~Actuaries} {\bf 8}, 182--192 (1920).}%Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities.

700:

701: \bibitem{jeffreys}\newblock{H.~Jeffreys, {\em Proc.~Roy.~Soc.~(London) A} {\bf 186}, 453--461 (1946).} %An invariant form for the prior probability in estimation problems.

702:

703: \bibitem{kt}\newblock{R.~Krichevskii and V.~Trofimov, {\em IEEE Trans.\ Inf.\ Thy.} {\bf  27}, 199--207 (1981).}

704:

705:     \bibitem{wst}\newblock{F.~Willems, Y.~Shtarkov, and T.~Tjalkens,

706:     {\it IEEE Trans.\ Inf.\ Thy.} {\bf 41}, 653--664 (1995).}

707:

708:     \bibitem{sg}\newblock{T.~Schurmann and P.~Grassberger, {\it Chaos}

709:     {\bf 6}, 414--427 (1996).}

710:

711:     \bibitem{srrb}\newblock{S.~Strong, R.\ Koberle, R.\ de Ruyter van Steveninck, and W.\ Bialek, {\em Phys.\ Rev.\ Lett.}

712:     {\bf 80}, 197--200 (1998)\FORMAT{.}{, {\tt \small

713:         cond-mat/9603127}.}{, {\tt \small cond-mat/9603127}.}}

714:

715:     \bibitem{pt}\newblock{S.~Panzeri and A.~Treves, {\em Network:

716:       Comput. in Neural Syst.} {\bf 7}, 87--107 (1996).}

717:

718: \bibitem{mixt}\newblock{K.\ Sj�lander, K.\ Karplus, M.\ Brown, R.\ Hughey, A.\ Krogh, I. S.\ Mian, and D.\ Haussler,

719: {\em Computer Applications in the Biosciences (CABIOS)} {\bf 12}, 327--345 (1996).}

720:

721:     \bibitem{ma}\newblock{S.~Ma, {\em J.\ Stat.\ Phys.} {\bf 26}, 221

722:     (1981).}

723:

724:     \bibitem{bnt}\newblock{W.~Bialek, I.~Nemenman, N.~Tishby, {\em Neural Comp.} {\bf 13}, 2409-2463 (2001)\FORMAT{.}{, {\tt

725:         \small physics/0007070}.}{, {\tt \small physics/0007070}.}}  }

726:

727: \end{thebibliography}

728:

729: \end{document}

730: