1: \newcommand{\NIPS}[3]{#1}
2: \newcommand{\NEC}[3]{#2}
3: \newcommand{\LANL}[3]{#3}
4:
5: %\newcommand{\FORMAT}[3]{\NIPS{#1}{#2}{#3}}
6: %\newcommand{\FORMAT}[3]{\NEC{#1}{#2}{#3}}
7: \newcommand{\FORMAT}[3]{\LANL{#1}{#2}{#3}}
8:
9: \FORMAT{
10: \documentclass[fleqn]{article}
11: \usepackage{times,epsf,graphics,floatflt,wrapfig}
12: \usepackage{nips99}
13: }
14: {
15: \documentclass[fleqn]{article}
16: \usepackage{times,epsf,graphics,floatflt,wrapfig}
17: }
18: {
19: \documentclass[fleqn]{article}
20: \usepackage{times,epsf,graphics,floatflt,wrapfig}
21: \usepackage[hyperindex,hyperfigures]{hyperref}
22: \setlength{\oddsidemargin}{0.35in}
23: \setlength{\textwidth}{5.75in}
24: }
25:
26:
27: \FORMAT{\intextsep 0mm}{\intextsep 2mm}{\intextsep2mm}
28: \columnsep 3.5mm
29:
30:
31: \title{Entropy and Inference, Revisited}
32:
33: \author{Ilya Nemenman,$^{1,2}$ Fariel Shafee,$^3$ and William Bialek$^{1,3}$
34: \\
35: $^1$NEC Research Institute, 4 Independence Way,
36: \FORMAT{}{\\}{}
37: Princeton, New Jersey 08540\\
38: $^2$Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106\\
39: $^3$Department of Physics, Princeton University,
40: \FORMAT{}{\\}{}
41: Princeton, New
42: Jersey 08544\\ {\it nemenman@itp.ucsb.edu,
43: \{fshafee/wbialek\}@princeton.edu}}
44:
45:
46: \begin{document}
47:
48: \maketitle
49:
50:
51:
52: \begin{abstract}
53: We study properties of popular near--uniform (Dirichlet) priors for
54: learning undersampled probability distributions on discrete nonmetric
55: spaces and show that they lead to disastrous results. However, an
56: Occam--style phase space argument expands the priors into their infinite
57: mixture and resolves most of the observed problems. This leads to a
58: surprisingly good estimator of entropies of discrete distributions.
59: \end{abstract}
60:
61: \FORMAT{}{\newpage}{}
62:
63:
64:
65: Learning a probability distribution from examples is one of the basic
66: problems in data analysis. Common practical approaches introduce a
67: family of parametric models, leading to questions about model
68: selection. In Bayesian inference, computing the total probability of
69: the data arising from a model involves an integration over parameter
70: space, and the resulting ``phase space volume'' automatically
71: discriminates against models with larger numbers of parameters---hence
72: the description of these volume terms as Occam factors
73: \cite{mackay,vijay}. As we move from finite parameterizations to
74: models that are described by smooth functions, the integrals over
75: parameter space become functional integrals and methods from quantum
76: field theory allow us to do these integrals asymptotically; again the
77: volume in model space consistent with the data is larger for models
78: that are smoother and hence less complex \cite{bcs}. Further, at
79: least under some conditions the relevant degree of smoothness can be
80: determined self--consistently from the data, so that we approach
81: something like a model independent method for learning a distribution
82: \cite{nb}.
83:
84: The results emphasizing the importance of phase space factors in
85: learning prompt us to look back at a seemingly much simpler problem,
86: namely learning a distribution on a discrete, nonmetric space. Here
87: the probability distribution is just a list of numbers $\{q_i\}$, $i =
88: 1, 2, \cdots , K$, where $K$ is the number of bins or possibilities.
89: We do not assume any metric on the space, so that a priori there is no
90: reason to believe that any $q_i$ and $q_j$ should be similar. The
91: task is to learn this distribution from a set of examples, which we
92: can describe as the number of times $n_i$ each possibility is observed
93: in a set of $N= \sum_{i=1}^K n_i$ samples. This problem arises in the
94: context of language, where the index $i$ might label words or phrases,
95: so that there is no natural way to place a metric on the space, nor is
96: it even clear that our intuitions about similarity are consistent with
97: the constraints of a metric space. Similarly, in bioinformatics the
98: index $i$ might label n--mers of the the DNA or amino acid sequence,
99: and although most work in the field is based on metrics for sequence
100: comparison one might like an alternative approach that does not rest
101: on such assumptions. In the analysis of neural responses, once we fix
102: our time resolution the response becomes a set of discrete ``words,''
103: and estimates of the information content in the response are
104: determined by the probability distribution on this discrete space.
105: What all of these examples have in common is that we often need to
106: draw some conclusions with data sets that are {\em not} in the
107: asymptotic limit $N \gg K$. Thus, while we might use a large corpus
108: to sample the distribution of words in English by brute force
109: (reaching $N \gg K$ with $K$ the size of the vocabulary), we can
110: hardly do the same for three or four word phrases.
111:
112:
113: In models described by continuous functions, the infinite number of
114: ``possibilities'' can never be overwhelmed by examples; one is saved
115: by the notion of smoothness. Is there some nonmetric analog of this
116: notion that we can apply in the discrete case? Our intuition is that
117: information theoretic quantities may play this role. If we have a
118: joint distribution of two variables, the analog of a smooth
119: distribution would be one which does not have too much mutual
120: information between these variables. Even more simply, we might say that
121: smooth distributions have large entropy. While the idea of ``maximum
122: entropy inference'' is common \cite{maxent}, the interplay between
123: constraints on the entropy and the volume in the space of models seems
124: not to have been considered. As we shall explain, phase space factors
125: alone imply that seemingly sensible, more or less uniform priors on the
126: space of discrete probability distributions correspond to disastrously
127: singular prior hypotheses about the entropy of the underlying
128: distribution. We argue that reliable inference outside the asymptotic
129: regime $N \gg K$ requires a more uniform prior on the entropy, and we
130: offer one way of doing this. While many distributions are consistent
131: with the data when $N \leq K$, we provide empirical evidence that this
132: flattening of the entropic prior allows us to make surprisingly reliable
133: statements about the entropy itself in this regime.
134:
135: At the risk of being pedantic, we state very explicitly what we mean by
136: uniform or nearly uniform priors on the space of distributions.
137: The natural ``uniform'' prior is given by
138: \begin{equation}
139: {\mathcal P}_{\rm u}(\{q_i\}) = {1\over Z_{\rm u}}\,\delta\left(
140: 1 - \sum_{i=1}^K q_i\right), \;\; Z_{\rm u} = \int_{\mathcal
141: A}dq_1 dq_2 \cdots dq_K
142: \,\delta\left( 1 - \sum_{i=1}^K q_i\right)
143: \end{equation}
144: where the delta function imposes the normalization, $Z_{\rm u}$ is the
145: total volume in the space of models, and the integration domain
146: ${\mathcal A}$ is such that each $q_i$ varies in the range $[0,1]$.
147: Note that, because of the normalization constraint, an {\em
148: individual} $q_i$ chosen from this distribution in fact is not
149: uniformly distributed---this is also an example of phase space
150: effects, since in choosing one $q_i$ we constrain all the other
151: $\{q_{j\neq i}\}$. What we mean by uniformity is that all
152: distributions that obey the normalization constraint are equally
153: likely a priori.
154:
155: Inference with this uniform prior is straightforward. If our examples
156: come independently from $\{ q_i\}$, then we calculate the probability
157: of the model $\{ q_i\}$ with the usual Bayes rule: \footnote{If the data
158: are unordered, extra combinatorial factors have to be included in $P(\{
159: n_i\} | \{ q_i\})$. However, these cancel immediately in later
160: expressions.}
161: \begin{equation}
162: P(\{ q_i\}| \{ n_i\} ) = \frac{P(\{ n_i\} | \{ q_i\})
163: {\mathcal P}_{\rm u}(\{q_i\})}{P_{\rm u}(\{ n_i\})}, \;\;
164: P(\{ n_i\} | \{ q_i\}) = \prod_{i=1}^K (q_i)^{n_i}.
165: \end{equation}
166: If we want the best estimate of the probability $q_i$ in the least
167: squares sense, then we should compute the conditional mean, and this
168: can be done exactly, so that \cite{ww,thesis}
169: \vspace{-0.5mm}
170: \begin{equation}
171: \langle q_i\rangle = {{n_i +1}\over{N+K}} .
172: \label{laprule}
173: \end{equation}
174: Thus we can think of inference with this uniform prior as setting
175: probabilities equal to the observed frequencies, but with an ``extra
176: count'' in every bin. This sensible procedure was first introduced by
177: Laplace \cite{laplace}. It has the desirable property that events which have not been observed are not automatically assigned probability zero.
178:
179:
180: A natural generalization of these ideas is to consider priors that
181: have a power--law dependence on the probabilities, the so called Dirichlet family of priors:
182: \vspace{-0.5mm}
183: \begin{equation}
184: {\mathcal P}_\beta(\{q_i\}) = {1\over Z(\beta)}
185: \delta\left( 1 - \sum_{i=1}^K q_i\right)
186: \prod_{i=1}^K q_i^{\beta-1} \,,
187: \label{P(q)}
188: \end{equation}
189:
190: It is interesting to see what typical distributions from these priors
191: look like. Even though different $q_i$'s are not independent random
192: variables due to the normalizing $\delta$--function, generation of
193: random distributions is still easy: one can show that if $q_i$'s are
194: generated successively (starting from $i=1$ and proceeding up to
195: $i=K$) from the Beta--distribution
196: \begin{equation}
197: P(q_i) = B\left(\frac{q_i}{1-\sum_{j<i} q_j}; \beta, (K-i)\beta
198: \right),\;\;\;\; B\left(x; a,b \right) =
199: \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}\,,
200: \label{betadistr}
201: \end{equation}
202:
203: \begin{wrapfigure}{r}{63mm}
204: \vspace{-0mm}
205: \centerline{\epsfxsize=1.0\hsize\epsffile{Q_example.eps}}
206: \vspace{-3.5mm}
207: \caption{Typical distributions, $K=1000$.}
208: \FORMAT{\vspace{-1mm}}{}{}
209: \label{example}
210: \end{wrapfigure}
211:
212: \noindent then the probability of the whole sequence $\{q_i\}$ is ${\mathcal
213: P}_{\beta}(\{q_i\})$. Fig.~\ref{example} shows some typical
214: distributions generated this way. They represent different regions of
215: the range of possible entropies: low entropy ($\sim 1$ bit, where only
216: a few bins have observable probabilities), entropy in the middle of
217: the possible range, and entropy in the vicinity of the maximum,
218: $\log_2 K$. When learning an unknown distribution, we usually have no
219: a priori reason to expect it to look like only one of these
220: possibilities, but choosing $\beta$ pretty much fixes allowed
221: ``shapes.'' This will be a focal point of our discussion.
222:
223:
224: Even though distributions look different, inference with all priors
225: Eq.~(\ref{P(q)}) is similar \cite{ww,thesis}:
226: \begin{equation}
227: \langle q_i\rangle_\beta = {{n_i
228: +\beta}\over{N+\kappa}}\,,\;\;\;\; \kappa = K\beta.
229: \label{estim}
230: \end{equation}
231: This simple modification of the Laplace's rule, Eq.~(\ref{laprule}),
232: which allows us to vary probability assigned to the outcomes not yet
233: seen, was first examined by Hardy and Lidstone \cite{hardy,lidstone}.
234: Together with the Laplace's formula, $\beta=1$, this family includes the
235: usual maximum likelihood estimator (MLE), $\beta \to 0$, that identifies
236: probabilities with frequencies, as well as the Jeffreys' or
237: Krichevsky--Trofimov (KT) estimator, $\beta=1/2$ \cite{jeffreys,kt,wst},
238: the Schurmann--Grassberger (SG) estimator, $\beta=1/K$ \cite{sg}, and
239: other popular choices.
240:
241:
242:
243: To understand why inference in the family of priors defined by
244: Eq.~(\ref{P(q)}) is unreliable, consider the entropy of a distribution
245: drawn at random from this ensemble. Ideally we would like to compute
246: this whole a priori distribution of entropies,
247: \begin{equation}
248: {\mathcal P}_\beta (S) = \int dq_1 dq_2 \cdots dq_K \,
249: P_\beta(\{q_i\})
250: \,\delta\left[
251: S + \sum_{i =1}^K q_i\log_2 q_i \right] ,
252: \end{equation}
253: but this is quite difficult. However, as noted by Wolpert and Wolf
254: \cite{ww}, one can compute the moments of ${\mathcal P}_\beta (S)$
255: rather easily. Transcribing their results to the present notation
256: (and correcting some small errors), we find:
257: \begin{eqnarray}
258: \xi(\beta) \equiv \langle\, S [n_i =0]\, \rangle_\beta &=&
259: \psi_0(\kappa+1)
260: -\psi_0(\beta+1) \, ,
261: \label{Sap}
262: \\
263: \sigma^2(\beta) \equiv \langle \, (\delta S)^2 [n_i =0] \rangle_\beta
264: &=&
265: \frac{\beta+1}{\kappa +
266: 1}\, \psi_1(\beta+1) -\psi_1(\kappa+1) \,,
267: \label{dS2ap}
268: \end{eqnarray}
269: \vspace{-0.5mm}
270: where $\psi_m(x) = (d/dx)^{m+1} \log_2 \Gamma(x)$ are the polygamma
271: functions.
272:
273:
274: \begin{wrapfigure}{L}{63mm}
275: \vspace{-1mm}
276: \centerline{\epsfxsize=1.0\hsize\epsffile{mean_var.eps}}
277: \vspace{-4mm}
278: \caption{$\xi(\beta) / \log_2
279: K$ and $\sigma(\beta)$ as functions of $\beta$ and $K$; gray bands
280: are the region of $\pm \sigma(\beta)$ around the mean. Note the
281: transition from the logarithmic to the linear scale at
282: $\beta=0.25$ in the insert.}
283: \FORMAT{\vspace{1mm}}{}{}
284: \label{Sapriori}
285: \end{wrapfigure}
286:
287: This behavior of the moments is shown on Fig.~\ref{Sapriori}. We are
288: faced with a striking observation: a priori distributions of entropies
289: in the power--law priors are extremely peaked for even moderately
290: large $K$. Indeed, as a simple analysis shows, their maximum standard
291: deviation of approximately 0.61 bits is attained at $\beta \approx
292: 1/K$, where $\xi(\beta) \approx 1/\ln 2$ bits. This has to be compared
293: with the possible range of entropies, $[0, \log_2 K]$, which is
294: asymptotically large with $K$. Even worse, for any fixed $\beta$ and
295: sufficiently large $K$, $\xi(\beta) = \log_2 K - O(K^0)$, and
296: $\sigma(\beta) \propto 1/\sqrt{\kappa}$. Similarly, if $K$ is large,
297: but $\kappa$ is small, then $\xi(\beta) \propto \kappa$, and
298: $\sigma(\beta) \propto \sqrt{\kappa}$. This paints a lively picture:
299: varying $\beta$ between $0$ and $\infty$ results in a smooth variation
300: of $\xi$, the a priori expectation of the entropy, from $0$ to $S_{\rm
301: max}= \log_2 K$. Moreover, for large $K$, the standard deviation of
302: ${\mathcal P}_{\beta} (S)$ is always negligible relative to the
303: possible range of entropies, and it is negligible even absolutely for
304: $\xi\gg 1$ ($\beta \gg 1/K$). Thus a seemingly innocent choice of the
305: prior, Eq.~(\ref{P(q)}), leads to a disaster: {\em fixing $\beta$
306: specifies the entropy almost uniquely}. Furthermore, the situation
307: persists even after we observe some data: {\em until the distribution
308: is well sampled, our estimate of the entropy is dominated by the prior!}
309:
310: Thus it is clear that all commonly used estimators mentioned above
311: have a problem. While they may or may not provide a reliable estimate
312: of the distribution $\{q_i\}$\footnote{In any case, the answer to
313: this question depends mostly on the ``metric'' chosen to measure
314: reliability. Minimization of bias, variance, or information cost
315: (Kullback--Leibler divergence between the target distribution and
316: the estimate) leads to very different ``best'' estimators.}, they
317: are definitely a poor tool to learn entropies. Unfortunately, often
318: we are interested precisely in these entropies or similar
319: information--theoretic quantities, as in the examples (neural code,
320: language, and bio\-informatics) we briefly mentioned earlier.
321:
322: Are the usual estimators really this bad? Consider this: for the MLE
323: ($\beta=0$), Eqs.~(\ref{Sap}, \ref{dS2ap}) are formally wrong since it
324: is impossible to normalize ${\mathcal P}_0(\{q_i\})$. However, the
325: prediction that ${\mathcal P}_0(S) = \delta(S)$ still holds. Indeed,
326: $S_{\rm ML}$, the entropy of the ML distribution, is zero even for
327: $N=1$, let alone for $N=0$. In general, it is well known that $S_{\rm
328: ML}$ always underestimates the actual value of the entropy, and the
329: correction \vspace{-0.5mm}
330: \begin{equation}
331: S = S_{\rm ML} + \frac{K^*}{2N} + O \left( \frac{1}{N^2} \right)
332: \label{corr}
333: \end{equation}
334: \vspace{-0.5mm} is usually used (cf.~\cite{sg}). Here we must set
335: $K^*=K-1$ to have an asymptotically correct result. Unfortunately in
336: an undersampled regime, $N \ll K$, this is a disaster. To alleviate
337: the problem, different authors suggested to determine the dependence
338: $K^*=K^*(K)$ by various (rather ad hoc) empirical \cite{srrb} or
339: pseudo--Bayesian techniques \cite{pt}. However, then there is no
340: principled way to estimate both the residual bias and the error of the
341: estimator.
342:
343:
344: The situation is even worse for the Laplace's rule, $\beta=1$. We were
345: unable to find any results in the literature that would show a clear
346: understanding of the effects of the prior on the entropy estimate,
347: $S_{\rm L}$. And these effects are enormous: the a priori
348: distribution of the entropy has $\sigma(1) \sim 1/\sqrt{K}$ and is
349: almost $\delta$-like. This translates into a very certain, but
350: nonetheless possibly wrong, estimate of the entropy. We believe that
351: this type of error (cf.~Fig.~\ref{fixedbeta}) has been overlooked in
352: some previous literature.
353:
354:
355:
356: The Schurmann--Grassberger estimator, $\beta=1/K$, deserves a special
357: attention. The variance of ${\mathcal P}_{\beta}(S)$ is maximized near
358: this value of $\beta$ (cf.~Fig.~\ref{Sapriori}). Thus the SG
359: estimator results in the most uniform a priori expectation of $S$
360: possible for the power--law priors, and consequently in the least
361: bias. We suspect that this feature is responsible for a remark in
362: Ref.~\cite{sg} that this $\beta$ was empirically the best for studying
363: printed texts. But even the SG estimator is flawed: it is biased
364: towards (roughly) $1/\ln 2$, and it is still a priori rather narrow.
365:
366: \begin{wrapfigure}{r}{63mm}
367: \vspace{-1mm}
368: \centerline{\epsfxsize=1.0\hsize\epsffile{diffbeta.eps}}
369: \vspace{-5mm}
370: \caption{Learning the $\beta=0.02$ distribution from Fig.~\ref{example}
371: with $\beta=0.001, 0.02, 1$. The actual error of the estimators is
372: plotted; the error bars are the standard deviations of the
373: posteriors. The ``wrong'' estimators are very certain but
374: nonetheless incorrect.}
375: \FORMAT{\vspace{-2mm}}{}{\vspace{-2mm} }
376: \label{fixedbeta}
377: \end{wrapfigure}
378:
379:
380:
381:
382:
383: Summarizing, we conclude that simple power--law priors,
384: Eq.~(\ref{P(q)}), must not be used to learn entropies when there is no
385: strong a priori knowledge to back them up. On the other hand, they are
386: the only priors we know of that allow to calculate $\langle q_i
387: \rangle$, $\langle S \rangle$, $\langle \chi^2 \rangle$, \dots exactly
388: \cite{ww}. Is there a way to resolve the problem of peakedness of
389: ${\mathcal P}_{\beta}(S)$ without throwing away their analytical ease?
390: One approach would be to use $ {\mathcal P}^{\rm
391: flat}_{\beta}(\{q_i\}) = \frac{{\mathcal P}_{\beta}(\{q_i\})
392: }{{\mathcal P}_{\beta}(S[q_i])} \; {\mathcal P}^{\rm
393: actual}(S[q_i])\,$ as a prior on $\{q_i\}$. This has a feature that
394: the a priori distribution of $S$ deviates from uniformity only due to
395: our actual knowledge ${\mathcal P}^{\rm actual} (S[q_i])$, but not in
396: the way ${\mathcal P}_{\beta}(S)$ does. However, as we already
397: mentioned, ${\mathcal P}_{\beta}(S[q_i])$ is yet to be calculated.
398:
399:
400: Another way to a flat prior is to write ${\mathcal P}(S) = 1 = \int
401: \delta(S - \xi) d \xi$. If we find a family of priors ${\mathcal
402: P}(\{q_i\}, {\rm parameters})$ that result in a $\delta$-function
403: over $S$, and if changing the parameters moves the peak across the
404: whole range of entropies uniformly, we may be able to use this.
405: Luckily, ${\mathcal P}_{\beta}(S)$ is almost a
406: $\delta$-function!~\footnote{The approximation becomes not so good as
407: $\beta \to 0$ since $\sigma(\beta)$ becomes $O(1)$ before dropping
408: to zero. Even worse, ${\mathcal P}_{\beta}(S)$ is skewed at small
409: $\beta$. This accumulates an extra weight at $S=0$. Our approach to
410: dealing with these problems is to ignore them while the posterior
411: integrals are dominated by $\beta$'s that are far away from zero.
412: This was always the case in our simulations, but is an open
413: question for the analysis of real data.} In addition, changing
414: $\beta$ results in changing $\xi(\beta) = \langle\, S [n_i=0] \,
415: \rangle_\beta$ across the whole range $[0, \log_2 K$]. So we may hope
416: that the prior \footnote{Priors that are formed as weighted sums of the
417: different members of the Dirichlet family are usually called {\em
418: Dirichlet mixture priors}. They have been used to estimate probability
419: distributions of, for example, protein sequences \cite{mixt}.
420: Equation (\ref{Pflat}), an {\em infinite} mixture, is a further
421: generalization, and, to our knowledge, it has not been studied before.}
422: \begin{equation}
423: {\mathcal P} (\{q_i\};\beta) = {1\over Z}\,
424: \delta\left( 1 - \sum_{i=1}^K q_i\right)
425: \prod_{i=1}^K q_i^{\beta-1} \frac{d \xi(\beta)}{d\beta} \,{\mathcal P}(\beta)
426: \label{Pflat}
427: \end{equation}
428: may do the trick and estimate entropy reliably even for small $N$, and
429: even for distributions that are atypical for any one $\beta$. We have less
430: reason, however, to expect that this will give an equally reliable
431: estimator of the atypical distributions themselves.$^2$ Note the term $d\xi/d\beta$ in Eq.~(\ref{Pflat}). It is there because $\xi$, not $\beta$, measures the position of the entropy density peak.
432:
433:
434: Inference with the prior, Eq.~(\ref{Pflat}), involves additional
435: averaging over $\beta$ (or, equivalently, $\xi$), but is nevertheless
436: straightforward. The a posteriori moments of the entropy are
437: \begin{eqnarray}
438: \widehat{S^m} &=& \frac{\int d\xi\,
439: \rho(\xi,\{n_i\}) \langle\, S^m [n_i]\, \rangle_{\beta(\xi)}}
440: {\int d\xi\, \rho(\xi,[n_i])}\,,\;\;\;\mbox{where}
441: \label{Shat}
442: \\
443: \rho(\xi, [n_i]) &=& {\mathcal P}\left(\beta\left(\xi\right)\right)
444: \frac{\Gamma(\kappa(\xi))}{\Gamma(N+\kappa(\xi))}\,
445: \prod_{i=1}^K \frac{\Gamma(n_i+\beta(\xi))}{\Gamma(\beta(\xi))}\,.
446: \label{rho}
447: \end{eqnarray}
448: Here the moments $\langle\, S^m [n_i]\, \rangle_{\beta(\xi)}$ are
449: calculated at fixed $\beta$ according to the (corrected) formulas of
450: Wolpert and Wolf \cite{ww}. We can view this inference scheme as
451: follows: first, one sets the value of $\beta$ and calculates the
452: expectation value (or other moments) of the entropy at this $\beta$.
453: For small $N$, the expectations will be very close to their a priori
454: values due to the peakedness of ${\mathcal P}_{\beta}(S)$.
455: Afterwards, one integrates over $\beta(\xi)$ with the density
456: $\rho(\xi)$, which includes our a priori expectations about the
457: entropy of the distribution we are studying [${\mathcal
458: P}\left(\beta\left(\xi\right)\right)$], as well as the evidence for
459: a particular value of $\beta$ [$\Gamma$-terms in Eq.~(\ref{rho})].
460:
461: The crucial point is the behavior of the evidence. If it has a
462: pronounced peak at some $\beta_{\rm cl}$, then the integrals over
463: $\beta$ are dominated by the vicinity of the peak, $\widehat{S}$ is
464: close to $\xi(\beta_{\rm cl})$, and the variance of the estimator is
465: small. In other words, data ``selects'' some value of $\beta$, much in
466: the spirit of Refs.~\cite{mackay} -- \cite{nb}. However, this
467: scenario may fail in two ways. First, there may be no peak in the
468: evidence; this will result in a very wide posterior and poor
469: inference. Second, the posterior density may be dominated by $\beta
470: \to 0$, which corresponds to MLE, the best possible fit to the data,
471: and is a discrete analog of overfitting. While all these situations
472: are possible, we claim that generically the evidence is well--behaved.
473: Indeed, while small $\beta$ increases the fit to the data, it also
474: increases the phase space volume of all allowed distributions and thus
475: decreases probability of each particular one [remember that $\langle
476: q_i \rangle_{\beta}$ has an extra $\beta$ counts in each bin, thus
477: distributions with $q_i < \beta/(N+\kappa)$ are strongly suppressed].
478: The fight between the ``goodness of fit'' and the phase space volume
479: should then result in some non--trivial $\beta_{cl}$, set by factors
480: $\propto N$ in the exponent of the integrand.
481:
482:
483: Figure~\ref{learning} shows how the prior, Eq.~(\ref{Pflat}), performs
484: on some of the many distributions we tested. The left panel describes
485: learning of distributions that are typical in the prior ${\mathcal
486: P}_{\beta}(\{q_i\})$ and, therefore, are also likely in ${\mathcal
487: P}(\{q_i\};\beta)$. Thus we may expect a reasonable performance, but
488: the real results exceed all expectations: for all three cases, the
489: actual relative error drops to the $10\%$ level at $N$ as low as 30
490: (recall that $K=1000$, so we only have $\sim 0.03$ data points per bin
491: on average)! To put this in perspective, simple estimates like fixed
492: $\beta$ ones, MLE, and MLE corrected as in Eq.~(\ref{corr}) with $K^*$
493: equal to the number of nonzero $n_i$'s produce an error so big that it
494: puts them off the axes until $N >100$. \footnote{More work is needed to
495: compare our estimator to more complex techniques, like in
496: Ref.~\cite{srrb,pt}.} Our results have two more nice features: the
497: estimator seems to know its error pretty well, and it is almost
498: completely unbiased.
499:
500:
501: \begin{figure}[t]
502: \begin{center}
503: \begin{picture}(60,5)(0,0)
504: \put(-60,0){(a)}
505: \put(120,0){(b)}
506: \end{picture}
507: \end{center}
508: \vspace{-1mm}
509: \centerline{\epsfxsize=.49\hsize\epsffile{correct.eps}
510: \epsfxsize=.49\hsize\epsffile{incorrect.eps}}
511: \vspace{-4mm}
512: \caption{Learning entropies with the prior Eq.~(\ref{Pflat}) and
513: ${\mathcal P}(\beta)=1$. The actual relative errors of the
514: estimator are plotted; the error bars are the relative widths of
515: the posteriors. (a) Distributions from Fig.~\ref{example}. (b)
516: Distributions atypical in the prior. Note that while
517: $\widehat{S}$ may be safely calculated as just $\langle S
518: \rangle_{\beta_{\rm cl}}$, one has to do an honest integration
519: over $\beta$ to get $\widehat{S^2}$ and the error bars. Indeed,
520: since ${\mathcal P}_{\beta} (S)$ is almost a $\delta$-function,
521: the uncertainty at any fixed $\beta$ is very small (see
522: Fig.~\ref{fixedbeta}).}
523: \label{learning}
524: \vspace{-4mm}
525: \end{figure}
526:
527:
528: One might be puzzled at how it is possible to estimate anything in a
529: 1000--bin distribution with just a few samples: the distribution is
530: completely unspecified for low $N$! The point is that we are not
531: trying to learn the distribution --- in the absence of additional prior
532: information this would, indeed, take $N\gg K$ --- but to estimate
533: just one of its characteristics. It is less surprising that one number
534: can be learned well with only a handful of measurements. In practice
535: the algorithm builds its estimate based on the number of coinciding
536: samples (multiple coincidences are likely only for small $\beta$), as
537: in the Ma's approach to entropy estimation from simulations of physical
538: systems
539: \cite{ma}.
540:
541:
542:
543:
544: What will happen if the algorithm is fed with data from a distribution
545: $\{\tilde{q}_i\}$ that is strongly atypical in ${\mathcal
546: P}(\{q_i\};\beta)$? Since there is no $\{\tilde{q}_i\}$ in our
547: prior, its estimate may suffer. Nonetheless, for any
548: $\{\tilde{q}_i\}$, there is some $\beta$ which produces distributions
549: with the same mean entropy as $S[\tilde{q}_i]$. Such $\beta$ should
550: be determined in the usual fight between the ``goodness of fit'' and
551: the Occam factors, and the correct value of entropy will follow.
552: However, there will be an important distinction from the ``correct
553: prior'' cases. The value of $\beta$ indexes available phase space
554: volumes, and thus the smoothness (complexity) of the model class
555: \cite{bnt}. In the case of discrete distributions, smoothness is the
556: absence of high peaks. Thus data with faster decaying Zipf plots
557: (plots of bins' occupancy vs.\ occupancy rank $i$) are rougher. The priors ${\mathcal P}_{\beta}(\{q_i\})$ cannot account for all possible roughnesses. Indeed, they only generate distributions for which the expected number of bins $\nu$ with the probability mass less than some $q$ is given by $\nu(q) = K B(q, \beta, \kappa -\beta)$, where $B$ is the familiar incomplete Beta function, as in Eq.~(\ref{betadistr}). This means that the expected rank ordering for small and large ranks is
558: \begin{eqnarray}
559: q_i &\approx& 1 - \left[\frac{ \beta B(\beta, \kappa - \beta ) (K-1) \,i}
560: {K} \right] ^{1/(\kappa-\beta)}, \,\,\,\, i\ll K\,,
561: \label{left}\\
562: q_i &\approx& \left[ \frac{ \beta B(\beta, \kappa - \beta ) (K-i+1)}
563: {K}\right]^{1/\beta},\,\,\,\, K-i+1 \ll K\,.
564: \end{eqnarray}
565: In an undersampled regime we can observe only the first of the behaviors. Therefore, any
566: distribution with $q_i$ decaying
567: faster (rougher) or slower (smoother) than Eq.~(\ref{left}) for some $\beta$ cannot be explained
568: well with fixed $\beta_{\rm cl}$ for different $N$. So, unlike in the cases of learning data that are typical in ${\mathcal P}_{\beta}(\{q_i\})$, we should
569: expect to see $\beta_{\rm cl}$ growing (falling) for qualitatively
570: smoother (rougher) cases as $N$ grows.
571:
572: \FORMAT{
573: \tabcolsep 0.5mm
574: \begin{wraptable}{r}{40.5mm}{
575: %\begin{floatingtable}{
576: %\begin{tabular}{ccccccc}
577: %$N$ &0.0007& 0.02 & 1.0 & 1/2 full & Zipf & rough\\ \hline
578: %{\small units} & $\cdot 10^{-4}$ & $\cdot 10^{-2}$ & $\cdot 10^{-0}$ &
579: %$\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline
580: %10 & 4.3 & 4.1 & 2773 & 1.7 & 1907 & 16.8\\
581: %30 & 6.1 & 1.9 & 0.74 & 2.2 & 0.99 & 11.5\\
582: %100 & 4.3 & 2.3 & 0.80 & 2.4 & 0.86 & 12.9\\
583: %300 & 3.4 & 2.0 & 1.12 & 2.2 & 1.36 & 8.3 \\
584: %1000 & 5.9 & 2.0 & 0.96 & 2.1 & 2.24 & 6.4 \\
585: %3000 & 6.3 & 1.9 & 0.99 & 1.9 & 3.36 & 5.4 \\
586: %10000& 1.0 & 1.8 & 0.99 & 2.0 & 4.89 & 4.5 \\
587: %\end{tabular}
588: \begin{tabular}{cccc}
589: $N$ & 1/2 full & Zipf & rough\\ \hline
590: {\small units} & $\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline
591: 10 & 1.7 & 1907 & 16.8\\
592: 30 & 2.2 & 0.99 & 11.5\\
593: 100 & 2.4 & 0.86 & 12.9\\
594: 300 & 2.2 & 1.36 & 8.3 \\
595: 1000 & 2.1 & 2.24 & 6.4 \\
596: 3000 & 1.9 & 3.36 & 5.4 \\
597: 10000& 2.0 & 4.89 & 4.5 \\
598: \end{tabular}}
599: \vspace{-3mm}
600: \caption{$\beta_{\rm cl}$ for solutions shown on Fig.~\ref{learning}(b).}
601: \label{betacl}
602: \end{wraptable}}{}{}
603:
604: Figure~\ref{learning}(b) and Tbl.~\ref{betacl} illustrate these
605: points. First, we study the $\beta=0.02$ distribution from
606: Fig.~\ref{example}. However, we added a 1000 extra bins, each with
607: $q_i=0$. Our estimator performs remarkably well, and $\beta_{\rm cl}$
608: does not drift because the ranking law remains the same. Then we turn
609: to the famous Zipf's distribution, so common in Nature. It has $n_i
610: \propto 1/i$, which is qualitatively smoother than our prior allows.
611: Correspondingly, we get an upwards drift in $\beta_{\rm cl}$. Finally,
612: we analyze a ``rough'' distribution, which has $q_i \propto 50 - 4(\ln
613: i)^2$, and $\beta_{\rm cl}$ drifts downwards. Clearly, one would want
614: to predict the dependence $\beta_{\rm cl}(N)$ analytically, but this
615: requires calculation of the predictive information (complexity) for the
616: involved distributions \cite{bnt} and is a work for the future. Notice that, the entropy estimator for atypical
617: \FORMAT{}{}{
618: \tabcolsep 0.5mm
619: \begin{wraptable}{r}{40.5mm}{
620: \begin{tabular}{cccc}
621: $N$ & 1/2 full & Zipf & rough\\ \hline
622: {\small units} & $\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline
623: 10 & 1.7 & 1907 & 16.8\\
624: 30 & 2.2 & 0.99 & 11.5\\
625: 100 & 2.4 & 0.86 & 12.9\\
626: 300 & 2.2 & 1.36 & 8.3 \\
627: 1000 & 2.1 & 2.24 & 6.4 \\
628: 3000 & 1.9 & 3.36 & 5.4 \\
629: 10000& 2.0 & 4.89 & 4.5 \\
630: \end{tabular}}
631: \vspace{-3mm}
632: \caption{$\beta_{\rm cl}$ for solutions shown on Fig.~\ref{learning}(b).}
633: \label{betacl}
634: \FORMAT{}{}{\vspace{-3mm}}
635: \end{wraptable}}
636: cases is almost as
637: good as for typical ones. A possible exception is the 100--1000
638: points for the Zipf distribution---they are about two standard
639: deviations off. We saw similar effects in some other ``smooth'' cases
640: also. This may be another manifestation of an observation made in
641: Ref.~\cite{nb}: smooth priors can easily adapt to rough distribution,
642: but there is a limit to the smoothness beyond which rough priors
643: become inaccurate.
644:
645:
646:
647: To summarize, an analysis of a priori entropy statistics in common
648: power--law Bayesian estimators revealed some very undesirable features. We are fortunate, however, that these minuses can be easily
649: turned into pluses, and the resulting estimator of entropy is precise,
650: knows its own error, and gives amazing results for a very large class of
651: distributions.
652:
653:
654:
655:
656: \section*{Acknowledgements}
657: We thank Vijay Balasubramanian, Curtis Callan, Adrienne Fairhall, Tim
658: Holy, Jonathan Miller, Vipul Periwal, Steve Strong, and Naftali Tishby for useful
659: discussions. I.\ N.\ was supported in part by NSF Grant No.\ PHY99-07949 to the Institute for Theoretical Physics.
660:
661:
662:
663: \begin{thebibliography}{99}
664: \itemsep 0mm
665: {\small
666: \bibitem{mackay}\newblock{D.~MacKay, {\it Neural Comp.} {\bf 4},
667: 415--448 (1992).}
668:
669: \bibitem{vijay}\newblock{V.~Balasubramanian, {\em Neural Comp.}
670: {\bf 9}, 349--368 (1997)\FORMAT{.}{, {\tt \small
671: adap-org/9601001}.}{, {\tt \small adap-org/9601001}.}}
672:
673: \bibitem{bcs}\newblock{W.~Bialek, C.~Callan, and S.~Strong, {\it
674: Phys.~Rev.~Lett.} {\bf 77}, 4693--4697 (1996)\FORMAT{.}{, {\tt
675: \small cond-mat/9607180}.}{, {\tt \small cond-mat/9607180}.}}
676:
677: \bibitem{nb}\newblock{I.~Nemenman and W.~Bialek, {\it Advances in
678: Neural Inf.\ Processing Systems} {\bf 13}, 287--293 (2001)\FORMAT{.}{,
679: {\tt \small cond-mat/0009165}.}{, {\tt \small
680: cond-mat/0009165}.}}
681:
682: \bibitem{maxent}\newblock{J.~Skilling, in {\it Maximum entropy and
683: Bayesian methods,} J.~Skilling ed. (Kluwer Academic Publ.,
684: Amsterdam, 1989), pp.~45--52.}
685:
686: \bibitem{ww}\newblock{D.~Wolpert and D.~Wolf, {\it Phys.~Rev.~E}
687: {\bf 52}, 6841--6854 (1995)\FORMAT{.}{, {\tt \small
688: comp-gas/9403001}.}{, {\tt \small comp-gas/9403001}.}}
689:
690: \bibitem{thesis}\newblock{I.~Nemenman, Ph.D. Thesis, Princeton,
691: (2000), ch.~3, \FORMAT{\small
692: http://arXiv.org/abs/physics/0009032} {\tt \small
693: physics/0009032} {\tt \small physics/0009032}.}
694:
695: \bibitem{laplace}\newblock{P.~de Laplace, marquis de, {\em Essai philosophique sur les probabilit\'es} (Courcier, Paris, 1814), trans.\ by F.~Truscott and F.~Emory, {\em A philosophical essay on probabilities} (Dover, New York, 1951).}
696:
697: \bibitem{hardy}\newblock{G.~Hardy, {\em Insurance Record} (1889), reprinted in {\em Trans.~Fac.~Actuaries} {\bf 8} (1920).}
698:
699: \bibitem{lidstone}\newblock{G.~Lidstone, {\em Trans.~Fac.~Actuaries} {\bf 8}, 182--192 (1920).}%Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities.
700:
701: \bibitem{jeffreys}\newblock{H.~Jeffreys, {\em Proc.~Roy.~Soc.~(London) A} {\bf 186}, 453--461 (1946).} %An invariant form for the prior probability in estimation problems.
702:
703: \bibitem{kt}\newblock{R.~Krichevskii and V.~Trofimov, {\em IEEE Trans.\ Inf.\ Thy.} {\bf 27}, 199--207 (1981).}
704:
705: \bibitem{wst}\newblock{F.~Willems, Y.~Shtarkov, and T.~Tjalkens,
706: {\it IEEE Trans.\ Inf.\ Thy.} {\bf 41}, 653--664 (1995).}
707:
708: \bibitem{sg}\newblock{T.~Schurmann and P.~Grassberger, {\it Chaos}
709: {\bf 6}, 414--427 (1996).}
710:
711: \bibitem{srrb}\newblock{S.~Strong, R.\ Koberle, R.\ de Ruyter van Steveninck, and W.\ Bialek, {\em Phys.\ Rev.\ Lett.}
712: {\bf 80}, 197--200 (1998)\FORMAT{.}{, {\tt \small
713: cond-mat/9603127}.}{, {\tt \small cond-mat/9603127}.}}
714:
715: \bibitem{pt}\newblock{S.~Panzeri and A.~Treves, {\em Network:
716: Comput. in Neural Syst.} {\bf 7}, 87--107 (1996).}
717:
718: \bibitem{mixt}\newblock{K.\ Sjšlander, K.\ Karplus, M.\ Brown, R.\ Hughey, A.\ Krogh, I. S.\ Mian, and D.\ Haussler,
719: {\em Computer Applications in the Biosciences (CABIOS)} {\bf 12}, 327--345 (1996).}
720:
721: \bibitem{ma}\newblock{S.~Ma, {\em J.\ Stat.\ Phys.} {\bf 26}, 221
722: (1981).}
723:
724: \bibitem{bnt}\newblock{W.~Bialek, I.~Nemenman, N.~Tishby, {\em Neural Comp.} {\bf 13}, 2409-2463 (2001)\FORMAT{.}{, {\tt
725: \small physics/0007070}.}{, {\tt \small physics/0007070}.}} }
726:
727: \end{thebibliography}
728:
729: \end{document}
730: