physics0108025/nsb.tex
1: \newcommand{\NIPS}[3]{#1}
2: \newcommand{\NEC}[3]{#2}
3: \newcommand{\LANL}[3]{#3}
4: 
5: %\newcommand{\FORMAT}[3]{\NIPS{#1}{#2}{#3}}
6: %\newcommand{\FORMAT}[3]{\NEC{#1}{#2}{#3}}
7: \newcommand{\FORMAT}[3]{\LANL{#1}{#2}{#3}}
8: 
9: \FORMAT{
10:   \documentclass[fleqn]{article}
11:   \usepackage{times,epsf,graphics,floatflt,wrapfig}
12:   \usepackage{nips99}
13:   }
14: {
15:   \documentclass[fleqn]{article}
16:   \usepackage{times,epsf,graphics,floatflt,wrapfig}
17:   }
18: {  
19:   \documentclass[fleqn]{article}
20:   \usepackage{times,epsf,graphics,floatflt,wrapfig}
21:   \usepackage[hyperindex,hyperfigures]{hyperref}
22:   \setlength{\oddsidemargin}{0.35in}
23:   \setlength{\textwidth}{5.75in}
24:   }
25: 
26: 
27: \FORMAT{\intextsep 0mm}{\intextsep 2mm}{\intextsep2mm}
28: \columnsep 3.5mm
29: 
30: 
31: \title{Entropy and Inference, Revisited}
32: 
33: \author{Ilya Nemenman,$^{1,2}$ Fariel Shafee,$^3$ and William Bialek$^{1,3}$
34:   \\
35:   $^1$NEC Research Institute, 4 Independence Way,
36: \FORMAT{}{\\}{}
37:   Princeton, New Jersey 08540\\
38: $^2$Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106\\
39:   $^3$Department of Physics, Princeton University, 
40: \FORMAT{}{\\}{}
41:   Princeton, New
42:   Jersey 08544\\ {\it nemenman@itp.ucsb.edu,
43:     \{fshafee/wbialek\}@princeton.edu}}
44: 
45: 
46: \begin{document}
47: 
48: \maketitle
49: 
50: 
51: 
52: \begin{abstract}
53: We study properties of popular near--uniform (Dirichlet) priors for
54: learning undersampled probability distributions on discrete nonmetric
55: spaces and show that they lead to disastrous results.  However, an
56: Occam--style phase space argument expands the priors into their infinite
57: mixture and resolves most of the observed problems. This leads to a
58: surprisingly good estimator of entropies of discrete distributions.
59: \end{abstract}
60: 
61: \FORMAT{}{\newpage}{}
62: 
63: 
64: 
65: Learning a probability distribution from examples is one of the basic
66: problems in data analysis. Common practical approaches introduce a
67: family of parametric models, leading to questions about model
68: selection. In Bayesian inference, computing the total probability of
69: the data arising from a model involves an integration over parameter
70: space, and the resulting ``phase space volume'' automatically
71: discriminates against models with larger numbers of parameters---hence
72: the description of these volume terms as Occam factors
73: \cite{mackay,vijay}.  As we move from finite parameterizations to
74: models that are described by smooth functions, the integrals over
75: parameter space become functional integrals and methods from quantum
76: field theory allow us to do these integrals asymptotically; again the
77: volume in model space consistent with the data is larger for models
78: that are smoother and hence less complex \cite{bcs}.  Further, at
79: least under some conditions the relevant degree of smoothness can be
80: determined self--consistently from the data, so that we approach
81: something like a model independent method for learning a distribution
82: \cite{nb}.
83: 
84: The results emphasizing the importance of phase space factors in
85: learning prompt us to look back at a seemingly much simpler problem,
86: namely learning a distribution on a discrete, nonmetric space.  Here
87: the probability distribution is just a list of numbers $\{q_i\}$, $i =
88: 1, 2, \cdots , K$, where $K$ is the number of bins or possibilities.
89: We do not assume any metric on the space, so that a priori there is no
90: reason to believe that any $q_i$ and $q_j$ should be similar.  The
91: task is to learn this distribution from a set of examples, which we
92: can describe as the number of times $n_i$ each possibility is observed
93: in a set of $N= \sum_{i=1}^K n_i$ samples. This problem arises in the
94: context of language, where the index $i$ might label words or phrases,
95: so that there is no natural way to place a metric on the space, nor is
96: it even clear that our intuitions about similarity are consistent with
97: the constraints of a metric space.  Similarly, in bioinformatics the
98: index $i$ might label n--mers of the the DNA or amino acid sequence,
99: and although most work in the field is based on metrics for sequence
100: comparison one might like an alternative approach that does not rest
101: on such assumptions.  In the analysis of neural responses, once we fix
102: our time resolution the response becomes a set of discrete ``words,''
103: and estimates of the information content in the response are
104: determined by the probability distribution on this discrete space.
105: What all of these examples have in common is that we often need to
106: draw some conclusions with data sets that are {\em not} in the
107: asymptotic limit $N \gg K$.  Thus, while we might use a large corpus
108: to sample the distribution of words in English by brute force
109: (reaching $N \gg K$ with $K$ the size of the vocabulary), we can
110: hardly do the same for three or four word phrases.
111: 
112: 
113: In models described by continuous functions, the infinite number of
114: ``possibilities'' can never be overwhelmed by examples; one is saved
115: by the notion of smoothness. Is there some nonmetric analog of this
116: notion that we can apply in the discrete case?  Our intuition is that
117: information theoretic quantities may play this role.  If we have a
118: joint distribution of two variables, the analog of a smooth
119: distribution would be one which does not have too much mutual
120: information between these variables.  Even more simply, we might say that
121: smooth distributions have large entropy.  While the idea of ``maximum
122: entropy inference'' is common \cite{maxent}, the interplay between
123: constraints on the entropy and the volume in the space of models seems
124: not to have been considered.  As we shall explain, phase space factors
125: alone imply that seemingly sensible, more or less uniform priors on the
126: space of discrete probability distributions correspond to disastrously
127: singular prior hypotheses about the entropy of the underlying
128: distribution.  We argue that reliable inference outside the asymptotic
129: regime $N \gg K$ requires a more uniform prior on the entropy, and we
130: offer one way of doing this.  While many distributions are consistent
131: with the data when $N \leq K$, we provide empirical evidence that this
132: flattening of the entropic prior allows us to make surprisingly reliable
133: statements about the entropy itself in this regime.
134: 
135: At the risk of being pedantic, we state very explicitly what we mean by
136: uniform or nearly uniform priors on the space of distributions.
137: The natural ``uniform'' prior  is given by
138: \begin{equation}
139:   {\mathcal P}_{\rm u}(\{q_i\}) = {1\over Z_{\rm u}}\,\delta\left(
140:     1 - \sum_{i=1}^K q_i\right), \;\; Z_{\rm u} = \int_{\mathcal
141:     A}dq_1 dq_2 \cdots  dq_K 
142:   \,\delta\left( 1 - \sum_{i=1}^K q_i\right) 
143: \end{equation}
144: where the delta function imposes the normalization, $Z_{\rm u}$ is the
145: total volume in the space of models, and the integration domain
146: ${\mathcal A}$ is such that each $q_i$ varies in the range $[0,1]$.
147: Note that, because of the normalization constraint, an {\em
148:   individual} $q_i$ chosen from this distribution in fact is not
149: uniformly distributed---this is also an example of phase space
150: effects, since in choosing one $q_i$ we constrain all the other
151: $\{q_{j\neq i}\}$. What we mean by uniformity is that all
152: distributions that obey the normalization constraint are equally
153: likely a priori.
154: 
155: Inference with this uniform prior is straightforward.  If our examples
156: come independently from $\{ q_i\}$, then we calculate the probability
157: of the model $\{ q_i\}$ with the usual Bayes rule: \footnote{If the data
158: are unordered,  extra combinatorial factors have to be included in $P(\{
159: n_i\} | \{ q_i\})$. However, these cancel  immediately in later
160: expressions.}
161: \begin{equation}
162:   P(\{ q_i\}| \{ n_i\} ) = \frac{P(\{ n_i\} | \{ q_i\})
163:     {\mathcal P}_{\rm u}(\{q_i\})}{P_{\rm u}(\{ n_i\})}, \;\;
164:   P(\{ n_i\} | \{ q_i\}) = \prod_{i=1}^K (q_i)^{n_i}.
165: \end{equation}
166: If we want the best estimate of the probability $q_i$ in the least
167: squares sense, then we should compute the conditional mean, and this
168: can be done exactly, so that \cite{ww,thesis}
169: \vspace{-0.5mm}
170: \begin{equation}
171: \langle q_i\rangle = {{n_i +1}\over{N+K}} .
172: \label{laprule}
173: \end{equation}
174: Thus we can think of inference with this uniform prior as setting
175: probabilities equal to the observed frequencies, but with an ``extra
176: count'' in every bin.  This sensible procedure was first introduced by
177: Laplace \cite{laplace}. It has the desirable property that events which have not been observed are not automatically assigned probability zero.
178: 
179: 
180: A natural generalization of these ideas is to consider priors that
181: have a power--law dependence on the probabilities, the so called Dirichlet family of priors:
182: \vspace{-0.5mm}
183: \begin{equation}
184: {\mathcal P}_\beta(\{q_i\}) = {1\over Z(\beta)}
185: \delta\left( 1 - \sum_{i=1}^K q_i\right)
186: \prod_{i=1}^K q_i^{\beta-1} \,,
187: \label{P(q)}
188: \end{equation}
189: 
190: It is interesting to see what typical distributions from these priors
191: look like. Even though different $q_i$'s are not independent random
192: variables due to the normalizing $\delta$--function, generation of
193: random distributions is still easy: one can show that if $q_i$'s are
194: generated successively (starting from $i=1$ and proceeding up to
195: $i=K$) from the Beta--distribution
196: \begin{equation}
197:   P(q_i) = B\left(\frac{q_i}{1-\sum_{j<i} q_j}; \beta, (K-i)\beta
198:   \right),\;\;\;\;  B\left(x; a,b \right) =
199:   \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}\,,
200: \label{betadistr}
201: \end{equation}
202: 
203: \begin{wrapfigure}{r}{63mm}
204:   \vspace{-0mm}
205:   \centerline{\epsfxsize=1.0\hsize\epsffile{Q_example.eps}}  
206:   \vspace{-3.5mm}
207:   \caption{Typical distributions, $K=1000$.} 
208:   \FORMAT{\vspace{-1mm}}{}{}
209:   \label{example}
210: \end{wrapfigure}
211: 
212: \noindent then the probability of the whole sequence $\{q_i\}$ is ${\mathcal
213:   P}_{\beta}(\{q_i\})$.  Fig.~\ref{example} shows some typical
214: distributions generated this way. They represent different regions of
215: the range of possible entropies: low entropy ($\sim 1$ bit, where only
216: a few bins have observable probabilities), entropy in the middle of
217: the possible range, and entropy in the vicinity of the maximum,
218: $\log_2 K$.  When learning an unknown distribution, we usually have no
219: a priori reason to expect it to look like only one of these
220: possibilities, but choosing $\beta$ pretty much fixes allowed
221: ``shapes.''  This will be a focal point of our discussion.
222: 
223: 
224: Even though distributions look different, inference with all priors
225: Eq.~(\ref{P(q)}) is similar \cite{ww,thesis}:
226: \begin{equation}
227: \langle q_i\rangle_\beta = {{n_i
228: +\beta}\over{N+\kappa}}\,,\;\;\;\; \kappa = K\beta.
229: \label{estim}
230: \end{equation}
231: This simple modification of the  Laplace's rule, Eq.~(\ref{laprule}),
232: which allows us to vary probability assigned to the outcomes not yet
233: seen, was first examined by Hardy and Lidstone \cite{hardy,lidstone}.
234: Together with the Laplace's formula, $\beta=1$, this family includes the
235: usual maximum likelihood estimator (MLE), $\beta \to 0$, that identifies
236: probabilities with frequencies, as well as the Jeffreys' or
237: Krichevsky--Trofimov (KT) estimator, $\beta=1/2$ \cite{jeffreys,kt,wst},
238: the Schurmann--Grassberger (SG) estimator, $\beta=1/K$ \cite{sg}, and
239: other popular choices.
240: 
241: 
242: 
243: To understand why inference in the family of priors defined by
244: Eq.~(\ref{P(q)}) is unreliable, consider the entropy of a distribution
245: drawn at random from this ensemble.  Ideally we would like to compute
246: this whole a priori distribution of entropies,
247: \begin{equation}
248: {\mathcal  P}_\beta (S) = \int dq_1  dq_2 \cdots dq_K \,
249: P_\beta(\{q_i\})
250: \,\delta\left[
251: S + \sum_{i =1}^K q_i\log_2 q_i \right] ,
252: \end{equation}
253: but this is quite difficult. However, as noted by Wolpert and Wolf
254: \cite{ww}, one can compute the moments of ${\mathcal P}_\beta (S)$
255: rather easily.  Transcribing their results to the present notation
256: (and correcting some small errors), we find:
257: \begin{eqnarray}
258:   \xi(\beta)  \equiv  \langle\, S [n_i =0]\, \rangle_\beta  &=& 
259:   \psi_0(\kappa+1) 
260:   -\psi_0(\beta+1) \, ,
261:   \label{Sap}
262:   \\
263:   \sigma^2(\beta) \equiv \langle \, (\delta S)^2  [n_i =0] \rangle_\beta
264:      &=& 
265:   \frac{\beta+1}{\kappa +
266:     1}\, \psi_1(\beta+1) -\psi_1(\kappa+1) \,,
267:   \label{dS2ap}
268: \end{eqnarray}
269: \vspace{-0.5mm}
270: where $\psi_m(x) = (d/dx)^{m+1} \log_2 \Gamma(x)$ are the polygamma
271: functions. 
272: 
273: 
274: \begin{wrapfigure}{L}{63mm}
275:   \vspace{-1mm}
276:   \centerline{\epsfxsize=1.0\hsize\epsffile{mean_var.eps}}
277:   \vspace{-4mm}
278:   \caption{$\xi(\beta) / \log_2
279:     K$ and $\sigma(\beta)$ as functions of $\beta$ and $K$; gray bands
280:     are the region of $\pm \sigma(\beta)$ around the mean. Note the
281:     transition from the logarithmic to the linear scale at
282:     $\beta=0.25$ in the insert.} 
283:   \FORMAT{\vspace{1mm}}{}{}
284: \label{Sapriori}
285: \end{wrapfigure}
286: 
287: This behavior of the moments is shown on Fig.~\ref{Sapriori}.  We are
288: faced with a striking observation: a priori distributions of entropies
289: in the power--law priors are extremely peaked for even moderately
290: large $K$. Indeed, as a simple analysis shows, their maximum standard
291: deviation of approximately 0.61 bits is attained at $\beta \approx
292: 1/K$, where $\xi(\beta) \approx 1/\ln 2$ bits. This has to be compared
293: with the possible range of entropies, $[0, \log_2 K]$, which is
294: asymptotically large with $K$.  Even worse, for any fixed $\beta$ and
295: sufficiently large $K$, $\xi(\beta) = \log_2 K - O(K^0)$, and
296: $\sigma(\beta) \propto 1/\sqrt{\kappa}$. Similarly, if $K$ is large,
297: but $\kappa$ is small, then $\xi(\beta) \propto \kappa$, and
298: $\sigma(\beta) \propto \sqrt{\kappa}$.  This paints a lively picture:
299: varying $\beta$ between $0$ and $\infty$ results in a smooth variation
300: of $\xi$, the a priori expectation of the entropy, from $0$ to $S_{\rm
301:   max}= \log_2 K$.  Moreover, for large $K$, the standard deviation of
302: ${\mathcal P}_{\beta} (S)$ is always negligible relative to the
303: possible range of entropies, and it is negligible even absolutely for
304: $\xi\gg 1$ ($\beta \gg 1/K$). Thus a seemingly innocent choice of the
305: prior, Eq.~(\ref{P(q)}), leads to a disaster: {\em fixing $\beta$
306:   specifies the entropy almost uniquely}.  Furthermore, the situation
307: persists even after we observe some data: {\em until the distribution
308:   is well sampled, our estimate of the entropy is dominated by the prior!}
309: 
310: Thus it is clear that all commonly used estimators mentioned above
311: have a problem. While they may or may not provide a reliable estimate
312: of the distribution $\{q_i\}$\footnote{In any case, the answer to
313:   this question depends mostly on the ``metric'' chosen to measure
314:   reliability. Minimization of bias, variance, or information cost
315:   (Kullback--Leibler divergence between the target distribution and
316:   the estimate) leads to very different ``best'' estimators.}, they
317: are definitely a poor tool to learn entropies.  Unfortunately, often
318: we are interested precisely in these entropies or similar
319: information--theoretic quantities, as in the examples (neural code,
320: language, and bio\-informatics) we briefly mentioned earlier.
321: 
322: Are the usual estimators really this bad? Consider this: for the MLE
323: ($\beta=0$), Eqs.~(\ref{Sap}, \ref{dS2ap}) are formally wrong since it
324: is impossible to normalize ${\mathcal P}_0(\{q_i\})$.  However, the
325: prediction that ${\mathcal P}_0(S) = \delta(S)$ still holds. Indeed,
326: $S_{\rm ML}$, the entropy of the ML distribution, is zero even for
327: $N=1$, let alone for $N=0$. In general, it is well known that $S_{\rm
328:   ML}$ always underestimates the actual value of the entropy, and the
329: correction  \vspace{-0.5mm}
330: \begin{equation}
331:   S = S_{\rm ML} + \frac{K^*}{2N} + O \left( \frac{1}{N^2} \right) 
332:   \label{corr}
333: \end{equation}
334: \vspace{-0.5mm} is usually used (cf.~\cite{sg}).  Here we must set
335: $K^*=K-1$ to have an asymptotically correct result.  Unfortunately in
336: an undersampled regime, $N \ll K$, this is a disaster. To alleviate
337: the problem, different authors suggested to determine the dependence
338: $K^*=K^*(K)$ by various (rather ad hoc) empirical \cite{srrb} or
339: pseudo--Bayesian techniques \cite{pt}.  However, then there is no
340: principled way to estimate both the residual bias and the error of the
341: estimator.
342: 
343: 
344: The situation is even worse for the Laplace's rule, $\beta=1$. We were
345: unable to find any results in the literature that would show a clear
346: understanding of the effects of the prior on the entropy estimate,
347: $S_{\rm L}$.  And these effects are enormous: the a priori
348: distribution of the entropy has $\sigma(1) \sim 1/\sqrt{K}$ and is
349: almost $\delta$-like. This translates into a very certain, but
350: nonetheless possibly wrong, estimate of the entropy. We believe that
351: this type of error (cf.~Fig.~\ref{fixedbeta}) has been overlooked in
352: some previous literature.
353: 
354: 
355: 
356: The Schurmann--Grassberger estimator, $\beta=1/K$, deserves a special
357: attention. The variance of ${\mathcal P}_{\beta}(S)$ is maximized near
358: this value of $\beta$ (cf.~Fig.~\ref{Sapriori}).  Thus the SG
359: estimator results in the most uniform a priori expectation of $S$
360: possible for the power--law priors, and consequently in the least
361: bias. We suspect that this feature is responsible for a remark in
362: Ref.~\cite{sg} that this $\beta$ was empirically the best for studying
363: printed texts. But even the SG estimator is flawed: it is biased
364: towards (roughly) $1/\ln 2$, and it is still a priori rather narrow.
365: 
366: \begin{wrapfigure}{r}{63mm}
367:   \vspace{-1mm}
368:   \centerline{\epsfxsize=1.0\hsize\epsffile{diffbeta.eps}}
369:   \vspace{-5mm}
370:   \caption{Learning the $\beta=0.02$ distribution from  Fig.~\ref{example}
371:     with $\beta=0.001, 0.02, 1$. The actual error of the estimators is
372:     plotted; the error bars are the standard deviations of the
373:     posteriors. The ``wrong'' estimators are very certain but
374:     nonetheless incorrect.}  
375:   \FORMAT{\vspace{-2mm}}{}{\vspace{-2mm} }
376:   \label{fixedbeta}
377: \end{wrapfigure}
378: 
379: 
380: 
381: 
382: 
383: Summarizing, we conclude that simple power--law priors,
384: Eq.~(\ref{P(q)}), must not be used to learn entropies when there is no
385: strong a priori knowledge to back them up. On the other hand, they are
386: the only priors we know of that allow to calculate $\langle q_i
387: \rangle$, $\langle S \rangle$, $\langle \chi^2 \rangle$, \dots exactly
388: \cite{ww}. Is there a way to resolve the problem of peakedness of
389: ${\mathcal P}_{\beta}(S)$ without throwing away their analytical ease?
390: One approach would be to use $ {\mathcal P}^{\rm
391:   flat}_{\beta}(\{q_i\}) = \frac{{\mathcal P}_{\beta}(\{q_i\})
392:   }{{\mathcal P}_{\beta}(S[q_i])} \; {\mathcal P}^{\rm
393:   actual}(S[q_i])\,$ as a prior on $\{q_i\}$. This has a feature that
394: the a priori distribution of $S$ deviates from uniformity only due to
395: our actual knowledge ${\mathcal P}^{\rm actual} (S[q_i])$, but not in
396: the way ${\mathcal P}_{\beta}(S)$ does.  However, as we already
397: mentioned, ${\mathcal P}_{\beta}(S[q_i])$ is yet to be calculated.
398: 
399: 
400: Another way to a flat prior is to write ${\mathcal P}(S) = 1 = \int
401: \delta(S - \xi) d \xi$. If we find a family of priors ${\mathcal
402:   P}(\{q_i\}, {\rm parameters})$ that result in a $\delta$-function
403: over $S$, and if changing the parameters moves the peak across the
404: whole range of entropies uniformly, we may be able to use this.
405: Luckily, ${\mathcal P}_{\beta}(S)$ is almost a
406: $\delta$-function!~\footnote{The approximation becomes not so good as
407:   $\beta \to 0$ since $\sigma(\beta)$ becomes $O(1)$ before dropping
408:   to zero.  Even worse, ${\mathcal P}_{\beta}(S)$ is skewed at small
409:   $\beta$. This accumulates an extra weight at $S=0$.  Our approach to
410:   dealing with these problems is to ignore them while the posterior
411:   integrals are dominated by $\beta$'s that are far away from zero.
412:   This was always the case in our simulations, but is an open
413: question for the analysis of real data.} In addition, changing
414: $\beta$ results in changing $\xi(\beta) = \langle\, S [n_i=0] \,
415: \rangle_\beta$ across the whole range $[0, \log_2 K$]. So we may hope
416: that the prior \footnote{Priors that are formed as weighted sums of the
417: different members of the Dirichlet family are usually called {\em
418: Dirichlet mixture priors}. They have been used to estimate probability
419: distributions of, for example, protein sequences \cite{mixt}.
420: Equation (\ref{Pflat}), an {\em infinite} mixture, is a further
421: generalization, and, to our knowledge, it has not been studied before.}
422: \begin{equation}
423: {\mathcal P} (\{q_i\};\beta) = {1\over Z}\,
424: \delta\left( 1 - \sum_{i=1}^K q_i\right)
425: \prod_{i=1}^K q_i^{\beta-1} \frac{d \xi(\beta)}{d\beta} \,{\mathcal P}(\beta)
426: \label{Pflat}
427: \end{equation}
428: may do the trick and estimate entropy reliably even for small $N$, and
429: even for distributions that are atypical for any one $\beta$. We have less
430: reason, however, to expect that this will give an equally reliable
431: estimator of the atypical distributions themselves.$^2$ Note the term $d\xi/d\beta$  in Eq.~(\ref{Pflat}). It is there because $\xi$, not $\beta$, measures the position of the entropy density peak.
432: 
433: 
434: Inference with the prior, Eq.~(\ref{Pflat}), involves additional
435: averaging over $\beta$ (or, equivalently, $\xi$), but is nevertheless
436: straightforward. The a posteriori moments of the entropy are
437: \begin{eqnarray}
438:   \widehat{S^m} &=& \frac{\int d\xi\, 
439:     \rho(\xi,\{n_i\}) \langle\, S^m [n_i]\, \rangle_{\beta(\xi)}}
440:   {\int d\xi\, \rho(\xi,[n_i])}\,,\;\;\;\mbox{where}
441:   \label{Shat}
442:   \\
443:   \rho(\xi, [n_i]) &=& {\mathcal P}\left(\beta\left(\xi\right)\right)
444:   \frac{\Gamma(\kappa(\xi))}{\Gamma(N+\kappa(\xi))}\,
445:   \prod_{i=1}^K \frac{\Gamma(n_i+\beta(\xi))}{\Gamma(\beta(\xi))}\,.
446:   \label{rho}
447: \end{eqnarray}
448: Here the moments $\langle\, S^m [n_i]\, \rangle_{\beta(\xi)}$ are
449: calculated at fixed $\beta$ according to the (corrected) formulas of
450: Wolpert and Wolf \cite{ww}.  We can view this inference scheme as
451: follows: first, one sets the value of $\beta$ and calculates the
452: expectation value (or other moments) of the entropy at this $\beta$.
453: For small $N$, the expectations will be very close to their a priori
454: values due to the peakedness of ${\mathcal P}_{\beta}(S)$.
455: Afterwards, one integrates over $\beta(\xi)$ with the density
456: $\rho(\xi)$, which includes our a priori expectations about the
457: entropy of the distribution we are studying [${\mathcal
458:   P}\left(\beta\left(\xi\right)\right)$], as well as the evidence for
459: a particular value of $\beta$ [$\Gamma$-terms in Eq.~(\ref{rho})].
460: 
461: The crucial point is the behavior of the evidence. If it has a
462: pronounced peak at some $\beta_{\rm cl}$, then the integrals over
463: $\beta$ are dominated by the vicinity of the peak, $\widehat{S}$ is
464: close to $\xi(\beta_{\rm cl})$, and the variance of the estimator is
465: small. In other words, data ``selects'' some value of $\beta$, much in
466: the spirit of Refs.~\cite{mackay} -- \cite{nb}.  However, this
467: scenario may fail in two ways.  First, there may be no peak in the
468: evidence; this will result in a very wide posterior and poor
469: inference. Second, the posterior density may be dominated by $\beta
470: \to 0$, which corresponds to MLE, the best possible fit to the data,
471: and is a discrete analog of overfitting.  While all these situations
472: are possible, we claim that generically the evidence is well--behaved.
473: Indeed, while small $\beta$ increases the fit to the data, it also
474: increases the phase space volume of all allowed distributions and thus
475: decreases probability of each particular one [remember that $\langle
476: q_i \rangle_{\beta}$ has an extra $\beta$ counts in each bin, thus
477: distributions with $q_i < \beta/(N+\kappa)$ are strongly suppressed].
478: The fight between the ``goodness of fit'' and the phase space volume
479: should then result in some non--trivial $\beta_{cl}$, set by factors
480: $\propto N$ in the exponent of the integrand.
481: 
482: 
483: Figure~\ref{learning} shows how the prior, Eq.~(\ref{Pflat}), performs
484: on some of the many distributions we tested. The left panel describes
485: learning of distributions that are typical in the prior ${\mathcal
486:   P}_{\beta}(\{q_i\})$ and, therefore, are also likely in ${\mathcal
487:   P}(\{q_i\};\beta)$. Thus we may expect a reasonable performance, but
488: the real results exceed all expectations: for all three cases, the
489: actual relative error drops to the $10\%$ level at $N$ as low as 30
490: (recall that $K=1000$, so we only have $\sim 0.03$ data points per bin
491: on average)! To put this in perspective, simple estimates like fixed
492: $\beta$ ones, MLE, and MLE corrected as in Eq.~(\ref{corr}) with $K^*$
493: equal to the number of nonzero $n_i$'s produce an error so big that it
494: puts them off the axes until $N >100$. \footnote{More work is needed to
495:   compare our estimator to more complex techniques, like in
496:   Ref.~\cite{srrb,pt}.}  Our results have two more nice features: the
497: estimator seems to know its error pretty well, and it is almost
498: completely unbiased.
499: 
500: 
501: \begin{figure}[t]
502:   \begin{center}
503:     \begin{picture}(60,5)(0,0)
504:       \put(-60,0){(a)}
505:       \put(120,0){(b)}
506:     \end{picture}
507:   \end{center}
508:   \vspace{-1mm}
509:   \centerline{\epsfxsize=.49\hsize\epsffile{correct.eps}
510:     \epsfxsize=.49\hsize\epsffile{incorrect.eps}}
511:   \vspace{-4mm}
512:   \caption{Learning entropies with the prior Eq.~(\ref{Pflat}) and
513:     ${\mathcal P}(\beta)=1$. The actual relative errors of the
514:     estimator are plotted; the error bars are the relative widths of
515:     the posteriors. (a) Distributions from Fig.~\ref{example}. (b)
516:     Distributions atypical in the prior.  Note that while
517:     $\widehat{S}$ may be safely calculated as just $\langle S
518:     \rangle_{\beta_{\rm cl}}$, one has to do an honest integration
519:     over $\beta$ to get $\widehat{S^2}$ and the error bars.  Indeed,
520:     since ${\mathcal P}_{\beta} (S)$ is almost a $\delta$-function,
521:     the uncertainty at any fixed $\beta$ is very small (see
522:     Fig.~\ref{fixedbeta}).}
523:   \label{learning}
524:   \vspace{-4mm}
525: \end{figure}
526: 
527: 
528: One might be puzzled at how it is possible to estimate anything in a
529: 1000--bin distribution with just a few samples: the distribution is
530: completely unspecified for low $N$! The point is that we are not
531: trying to learn the distribution --- in the absence of additional prior
532: information this would, indeed, take $N\gg K$ --- but to estimate
533: just one of its characteristics. It is less surprising that one number
534: can be learned well with only a handful of measurements. In practice
535: the algorithm builds its estimate based on the number of coinciding
536: samples (multiple coincidences are likely only for small $\beta$), as
537: in the  Ma's approach to entropy estimation from simulations of physical
538: systems
539: \cite{ma}.
540: 
541: 
542: 
543: 
544: What will happen if the algorithm is fed with data from a distribution
545: $\{\tilde{q}_i\}$ that is strongly atypical in ${\mathcal
546:   P}(\{q_i\};\beta)$? Since there is no $\{\tilde{q}_i\}$ in our
547: prior, its estimate may suffer.  Nonetheless, for any
548: $\{\tilde{q}_i\}$, there is some $\beta$ which produces distributions
549: with the same mean entropy as $S[\tilde{q}_i]$.  Such $\beta$ should
550: be determined in the usual fight between the ``goodness of fit'' and
551: the Occam factors, and the correct value of entropy will follow.
552: However, there will be an important distinction from the ``correct
553: prior'' cases. The value of $\beta$ indexes available phase space
554: volumes, and thus the smoothness (complexity) of the model class
555: \cite{bnt}. In the case of discrete distributions, smoothness is the
556: absence of high peaks. Thus data with faster decaying Zipf plots
557: (plots of bins' occupancy vs.\ occupancy rank $i$) are rougher. The priors ${\mathcal P}_{\beta}(\{q_i\})$ cannot account for all possible roughnesses. Indeed, they only generate distributions for which the expected number of bins $\nu$ with the probability mass less than some $q$ is given by $\nu(q) = K B(q, \beta, \kappa -\beta)$, where $B$ is the familiar incomplete Beta function, as in Eq.~(\ref{betadistr}). This means that the expected rank ordering for small and large ranks is
558: \begin{eqnarray}
559: q_i &\approx& 1 - \left[\frac{ \beta B(\beta, \kappa - \beta )  (K-1) \,i}
560: {K} \right] ^{1/(\kappa-\beta)}, \,\,\,\, i\ll K\,,
561: \label{left}\\
562: q_i &\approx& \left[ \frac{ \beta B(\beta, \kappa - \beta )  (K-i+1)}
563: {K}\right]^{1/\beta},\,\,\,\, K-i+1 \ll K\,.
564: \end{eqnarray}
565: In an undersampled regime we can observe only the first of the behaviors. Therefore, any
566: distribution with $q_i$ decaying
567: faster (rougher) or slower (smoother) than Eq.~(\ref{left}) for some $\beta$ cannot be explained
568: well with fixed $\beta_{\rm cl}$ for different $N$.  So, unlike in the cases of learning  data that are typical in ${\mathcal P}_{\beta}(\{q_i\})$, we should
569: expect to see $\beta_{\rm cl}$ growing (falling) for qualitatively
570: smoother (rougher) cases as $N$ grows.
571: 
572: \FORMAT{
573: \tabcolsep 0.5mm
574: \begin{wraptable}{r}{40.5mm}{
575: %\begin{floatingtable}{
576: %\begin{tabular}{ccccccc}
577: %$N$  &0.0007& 0.02 & 1.0  & 1/2 full & Zipf & rough\\ \hline
578: %{\small units}  & $\cdot 10^{-4}$ & $\cdot 10^{-2}$ & $\cdot 10^{-0}$ &
579: %$\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline
580: %10   & 4.3  & 4.1  & 2773 & 1.7      & 1907 & 16.8\\
581: %30   & 6.1  & 1.9  & 0.74 & 2.2      & 0.99 & 11.5\\
582: %100  & 4.3  & 2.3  & 0.80 & 2.4      & 0.86 & 12.9\\
583: %300  & 3.4  & 2.0  & 1.12 & 2.2      & 1.36 & 8.3 \\
584: %1000 & 5.9  & 2.0  & 0.96 & 2.1      & 2.24 & 6.4 \\
585: %3000 & 6.3  & 1.9  & 0.99 & 1.9      & 3.36 & 5.4 \\
586: %10000& 1.0  & 1.8  & 0.99 & 2.0      & 4.89 & 4.5 \\
587: %\end{tabular}
588: \begin{tabular}{cccc}
589: $N$  & 1/2 full & Zipf & rough\\ \hline
590: {\small units} & $\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline
591: 10   & 1.7      & 1907 & 16.8\\
592: 30   & 2.2      & 0.99 & 11.5\\
593: 100  & 2.4      & 0.86 & 12.9\\
594: 300  & 2.2      & 1.36 & 8.3 \\
595: 1000 & 2.1      & 2.24 & 6.4 \\
596: 3000 & 1.9      & 3.36 & 5.4 \\
597: 10000& 2.0      & 4.89 & 4.5 \\
598: \end{tabular}}
599: \vspace{-3mm}
600: \caption{$\beta_{\rm cl}$ for solutions shown on Fig.~\ref{learning}(b).}
601: \label{betacl}
602: \end{wraptable}}{}{}
603: 
604: Figure~\ref{learning}(b) and Tbl.~\ref{betacl} illustrate these
605: points. First, we study the $\beta=0.02$ distribution from
606: Fig.~\ref{example}. However, we added a 1000 extra bins, each with
607: $q_i=0$.  Our estimator performs remarkably well, and $\beta_{\rm cl}$
608: does not drift because the ranking law remains the same. Then we turn
609: to the famous Zipf's distribution, so common in Nature. It has $n_i
610: \propto 1/i$, which is qualitatively smoother than our prior allows.
611: Correspondingly, we get an upwards drift in $\beta_{\rm cl}$. Finally,
612: we analyze a ``rough'' distribution, which has $q_i \propto 50 - 4(\ln
613: i)^2$, and $\beta_{\rm cl}$ drifts downwards. Clearly, one would want
614: to predict the dependence $\beta_{\rm cl}(N)$ analytically, but this
615: requires calculation of the predictive information (complexity) for the
616: involved distributions \cite{bnt} and is a work for the future. Notice that, the entropy estimator for atypical
617: \FORMAT{}{}{
618: \tabcolsep 0.5mm
619: \begin{wraptable}{r}{40.5mm}{
620: \begin{tabular}{cccc}
621: $N$  & 1/2 full & Zipf & rough\\ \hline
622: {\small units} & $\cdot 10^{-2}$ & $\cdot 10^{-1}$ & $\cdot 10^{-3}$ \\ \hline
623: 10   & 1.7      & 1907 & 16.8\\
624: 30   & 2.2      & 0.99 & 11.5\\
625: 100  & 2.4      & 0.86 & 12.9\\
626: 300  & 2.2      & 1.36 & 8.3 \\
627: 1000 & 2.1      & 2.24 & 6.4 \\
628: 3000 & 1.9      & 3.36 & 5.4 \\
629: 10000& 2.0      & 4.89 & 4.5 \\
630: \end{tabular}}
631: \vspace{-3mm}
632: \caption{$\beta_{\rm cl}$ for solutions shown on Fig.~\ref{learning}(b).}
633: \label{betacl}
634: \FORMAT{}{}{\vspace{-3mm}}
635: \end{wraptable}}
636:  cases is almost as
637: good as for typical ones.  A possible exception is the 100--1000
638: points for the Zipf distribution---they are about two standard
639: deviations off. We saw similar effects in some other ``smooth'' cases
640: also.  This may be another manifestation of an observation made in
641: Ref.~\cite{nb}: smooth priors can easily adapt to rough distribution,
642: but there is a limit to the smoothness beyond which rough priors
643: become inaccurate.
644: 
645: 
646: 
647: To summarize, an analysis of a priori entropy statistics in common
648: power--law Bayesian estimators revealed some very undesirable features. We are fortunate, however, that these minuses can be easily
649: turned into pluses, and the resulting estimator of entropy is precise,
650: knows its own error, and gives amazing results for a very large class of
651: distributions.
652: 
653: 
654: 
655: 
656: \section*{Acknowledgements}
657: We thank Vijay Balasubramanian, Curtis Callan, Adrienne Fairhall, Tim
658: Holy, Jonathan Miller, Vipul Periwal, Steve Strong, and Naftali Tishby for useful
659: discussions. I.\ N.\ was supported in part by NSF Grant No.\ PHY99-07949 to the Institute for Theoretical Physics. 
660: 
661: 
662: 
663: \begin{thebibliography}{99}
664: \itemsep 0mm
665: {\small
666:     \bibitem{mackay}\newblock{D.~MacKay, {\it Neural Comp.} {\bf 4},
667:     415--448 (1992).}
668: 
669:     \bibitem{vijay}\newblock{V.~Balasubramanian, {\em Neural Comp.}
670:     {\bf 9}, 349--368 (1997)\FORMAT{.}{, {\tt \small
671:         adap-org/9601001}.}{, {\tt \small adap-org/9601001}.}}
672: 
673:     \bibitem{bcs}\newblock{W.~Bialek, C.~Callan, and S.~Strong, {\it
674:       Phys.~Rev.~Lett.}  {\bf 77}, 4693--4697 (1996)\FORMAT{.}{, {\tt
675:         \small cond-mat/9607180}.}{, {\tt \small cond-mat/9607180}.}}
676: 
677:     \bibitem{nb}\newblock{I.~Nemenman and W.~Bialek, {\it Advances in
678:       Neural Inf.\ Processing Systems} {\bf 13}, 287--293 (2001)\FORMAT{.}{,
679:       {\tt \small cond-mat/0009165}.}{, {\tt \small
680:         cond-mat/0009165}.}}
681: 
682:     \bibitem{maxent}\newblock{J.~Skilling, in {\it Maximum entropy and
683:       Bayesian methods,} J.~Skilling ed. (Kluwer Academic Publ.,
684:     Amsterdam, 1989), pp.~45--52.}
685: 
686:     \bibitem{ww}\newblock{D.~Wolpert and D.~Wolf, {\it Phys.~Rev.~E}
687:     {\bf 52}, 6841--6854 (1995)\FORMAT{.}{, {\tt \small
688:         comp-gas/9403001}.}{, {\tt \small comp-gas/9403001}.}}
689: 
690:     \bibitem{thesis}\newblock{I.~Nemenman, Ph.D. Thesis, Princeton,
691:     (2000), ch.~3, \FORMAT{\small
692:       http://arXiv.org/abs/physics/0009032} {\tt \small
693:       physics/0009032} {\tt \small physics/0009032}.}
694: 
695: \bibitem{laplace}\newblock{P.~de Laplace, marquis de, {\em Essai philosophique sur les probabilit\'es} (Courcier, Paris, 1814), trans.\ by F.~Truscott and F.~Emory, {\em A philosophical essay on probabilities}  (Dover, New York, 1951).}
696: 
697: \bibitem{hardy}\newblock{G.~Hardy, {\em Insurance Record} (1889), reprinted in {\em Trans.~Fac.~Actuaries} {\bf 8} (1920).}
698: 
699: \bibitem{lidstone}\newblock{G.~Lidstone, {\em Trans.~Fac.~Actuaries} {\bf 8}, 182--192 (1920).}%Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities.
700: 
701: \bibitem{jeffreys}\newblock{H.~Jeffreys, {\em Proc.~Roy.~Soc.~(London) A} {\bf 186}, 453--461 (1946).} %An invariant form for the prior probability in estimation problems.
702: 
703: \bibitem{kt}\newblock{R.~Krichevskii and V.~Trofimov, {\em IEEE Trans.\ Inf.\ Thy.} {\bf  27}, 199--207 (1981).}
704: 
705:     \bibitem{wst}\newblock{F.~Willems, Y.~Shtarkov, and T.~Tjalkens,
706:     {\it IEEE Trans.\ Inf.\ Thy.} {\bf 41}, 653--664 (1995).}
707: 
708:     \bibitem{sg}\newblock{T.~Schurmann and P.~Grassberger, {\it Chaos}
709:     {\bf 6}, 414--427 (1996).}
710: 
711:     \bibitem{srrb}\newblock{S.~Strong, R.\ Koberle, R.\ de Ruyter van Steveninck, and W.\ Bialek, {\em Phys.\ Rev.\ Lett.}
712:     {\bf 80}, 197--200 (1998)\FORMAT{.}{, {\tt \small
713:         cond-mat/9603127}.}{, {\tt \small cond-mat/9603127}.}}
714: 
715:     \bibitem{pt}\newblock{S.~Panzeri and A.~Treves, {\em Network:
716:       Comput. in Neural Syst.} {\bf 7}, 87--107 (1996).}
717: 
718: \bibitem{mixt}\newblock{K.\ Sjšlander, K.\ Karplus, M.\ Brown, R.\ Hughey, A.\ Krogh, I. S.\ Mian, and D.\ Haussler, 
719: {\em Computer Applications in the Biosciences (CABIOS)} {\bf 12}, 327--345 (1996).}
720: 
721:     \bibitem{ma}\newblock{S.~Ma, {\em J.\ Stat.\ Phys.} {\bf 26}, 221
722:     (1981).}
723: 
724:     \bibitem{bnt}\newblock{W.~Bialek, I.~Nemenman, N.~Tishby, {\em Neural Comp.} {\bf 13}, 2409-2463 (2001)\FORMAT{.}{, {\tt
725:         \small physics/0007070}.}{, {\tt \small physics/0007070}.}}  }
726: 
727: \end{thebibliography}
728: 
729: \end{document}
730: