physics0306063/nbr.tex
1: \documentclass[pre,twocolumn,superscriptaddress,
2: showkeys,showpacs,amsmath,amsfonts,final]{revtex4}
3: 
4: \newif\ifpdf 
5: \ifx\pdfoutput\undefined
6: \pdffalse % we are not running PDFLaTeX
7: \else
8: \pdfoutput=1 % we are running PDFLaTeX
9: \pdfcompresslevel=9
10: \pdftrue
11: \fi
12: 
13: \ifpdf 
14: \usepackage[pdftex]{graphicx}
15: %\usepackage[pdftex]{hyperref}
16: %\usepackage{thumbpdf}
17: \else 
18: \usepackage{graphicx}
19: %\usepackage[dvips]{hyperref}  
20: \fi
21: \usepackage[cdot,squaren,textstyle]{SIunits}
22: \usepackage[sort&compress]{natbib}
23: 
24: 
25: 
26: %\hypersetup{pdfsubject={Preprint; Journal submission},   
27: %  pdftitle  = {Entropy and information in neural spike trains: Progress on the
28: %    sampling problem}
29: %  pdfauthor = {Ilya Nemenman, William Bialek, Rob de Ruyter van Steveninck}
30: %  pdfkeywords = {entropy, information, estimation, Bayesian statistics, neuroscience, neural code}
31: %  bookmarks ={true}, 
32: %}
33: 
34: %\include{aliases}
35: 
36: 
37: \begin{document}
38: 
39: \title{Entropy and information in neural spike trains: Progress on the
40:   sampling problem}
41: 
42: 
43: \author{Ilya Nemenman}\email{nemenman@kitp.ucsb.edu}
44: \affiliation{Kavli Institute for Theoretical Physics, University of
45:   California, Santa Barbara, California 93106}
46: 
47: \author{William Bialek} \email{wbialek@princeton.edu}
48: \affiliation{Departments of Physics and the Lewis--Sigler Institute
49:   for Integrative Genomics, Princeton University, Princeton, New
50:   Jersey 08544}
51: 
52: \author{Rob de Ruyter van Steveninck} \email{deruyter@indiana.edu}
53: \affiliation{Department of Molecular Biology, Princeton University,
54:   Princeton, New Jersey 08544} 
55: 
56: \altaffiliation[Current address: ]{Department of Physics, Indiana
57:   University, 727 E. Third St., Bloomington, Indiana 47405}
58: 
59: 
60: \begin{abstract}
61:   The major problem in information theoretic analysis of neural
62:   responses and other biological data is the reliable estimation of
63:   entropy--like quantities from small samples. We apply a recently
64:   introduced Bayesian entropy estimator to synthetic data inspired by
65:   experiments, and to real experimental spike trains. The estimator
66:   performs admirably even very deep in the undersampled regime, where
67:   other techniques fail.  This opens new possibilities for the
68:   information theoretic analysis of experiments, and may be of general
69:   interest as an example of learning from limited data.
70: \end{abstract}
71: 
72: \pacs{02.50.Tt, 89.70.+c, 87.19.La, 87.80.Tq}
73: 
74: \keywords{entropy, information, estimation, Bayesian statistics,
75:   neuroscience, neural code}
76: 
77: \preprint{NSF-KITP-03-43}
78: %\date{\today}
79: \maketitle
80: 
81: 
82: 
83: 
84: 
85: 
86: \section{Introduction}
87: \label{intro}
88: 
89: There has been considerable progress in using information theoretic
90: methods to sharpen and to answer many questions about the structure of
91: the neural code
92: \cite{BialekEtAl1991,TheunissenAndMiller1991,berry-97,strong-98,BorstAndTheunissen1999,brenner-00a,ReinagelAndReid2000,reich-01}.
93: Where classical experimental approaches have focused on mean response
94: of neurons to relatively simple stimuli, information theoretic methods
95: have the power to quantify the responses to arbitrarily complex and
96: even fully natural stimuli \cite{spikes,LewenEtAl2001}, taking account
97: of both the mean response and its variability in a rigorous way,
98: independent of detailed modeling assumptions.  Measurements of entropy
99: and information in spike trains also allow us to test directly the
100: hypothesis that the neural code adapts to the distribution of sensory
101: inputs, optimizing the rate or efficiency of information transmission
102: \cite{barlow-59,barlow-61,laughlin-81,BrennerEtAl2000,FairhallEtAl2001}.
103: 
104: A problem with such measurements is that entropy and information
105: depend explicitly on the full distribution of neural responses, just a
106: limited sample of which is provided by experiments. In particular, we
107: need to know the distribution of responses to each stimulus in our
108: ensemble, and the number of samples from this distribution is limited
109: by the number of times the full set of stimuli can be repeated.  For
110: natural stimuli with long correlation times the time required to
111: present a useful ``full set of stimuli'' is long, limiting the number
112: of independent samples we can obtain from stable neural recordings.
113: Furthermore, natural stimuli generate neural responses of high timing
114: precision, and thus the space of meaningful responses itself is very
115: large \cite{mainen-95,rds-97,berry-97,LewenEtAl2001}.  These factors
116: make the sampling problem more serious as we move to more interesting
117: and natural stimuli.
118: 
119: A natural response to this problem is to give up the generality of a
120: completely model independent information theoretic approach.  Some
121: explicit help from models is required to regularize learning of the
122: underlying probability distributions from the experiments.  The
123: question is if we can keep the generality of our analysis by
124: introducing the gentlest of regularizations for the abstract learning
125: problem, or if we need stronger assumptions about the structure of the
126: neural code itself (for example, introducing a metric on the space of
127: responses \cite{victor-purpura-97,Victor2002}).
128: 
129: A classical problem suggests that we may succeed even with very weak
130: assumptions. Remember that one needs to have only $N\sim 23$ people in
131: a room before any two of them are reasonably likely to share the same
132: birthday. This is much less than $K=365$, the number of possible
133: birthdays.  Turning this around, we can estimate the number of
134: possible birthdays by polling $N$ people and counting how often we
135: find coincidences.  Once $N$ is large enough to have observed a few of
136: those, we can get a pretty good estimate of $K$.  This will happen
137: with a significant probability for $N \sim \sqrt{K}\ll K$.
138: 
139: The idea of estimating entropy by counting coincidences was proposed
140: long ago by Ma \cite{ma-81} for physical systems in the microcanonical
141: ensemble where distributions should be uniform at fixed energy.
142: Clearly, if we could generalize the Ma idea to arbitrary
143: distributions, then we would be able to explore a much wider variety
144: of question about information in the neural code.  Here we argue that
145: a simple and abstract Bayesian prior, introduced in Ref.~\cite{nsb},
146: comes close to the objective.
147: 
148: It is well known that, for $N<K$, there are no universally good
149: entropy estimators \cite{paninski-03,rubinfeld-02}.  Thus the main
150: question is: does a particular method work well only for (possibly
151: irrelevant) abstract model problems, or can it also be trusted for
152: natural data?  Hence our goal is neither to search for potential
153: theoretical limitations of the approach (these must exist and have
154: been found), nor to analyze the neural code (this will be left for the
155: future).  Instead we aim at convincingly showing that the method of
156: Ref.~\cite{nsb} can generate reliable estimates of entropy well into a
157: classically undersampled regime for an experimentally relevant case of
158: neurophysiological recordings.
159: 
160: 
161: 
162: \section{An estimation strategy}
163: \label{strategy}
164: Consider the problem of estimating the entropy $S$ of a probability
165: distribution $\{p_{\rm i}\}$, $ S = -\sum_{{\rm i}=1}^K p_{\rm
166:   i}\log_2 p_{\rm i}$,  where the index $\rm i$ runs over $K$
167: possibilities (e.g., $K$ possible neural responses). In an experiment
168: we observe that in $N$ examples each possibility $\rm i$ occurred
169: $n_{\rm i}$ times.  If $N \gg K$, we approximate the probabilities by
170: frequencies, $p_{\rm i} \approx f_{\rm i} \equiv n_{\rm i} /N$, and
171: construct a naive estimate of the entropy,
172: \begin{equation}
173: S_{\rm naive} = -\sum_{{\rm i}=1}^K f_{\rm i}\log_2 f_{\rm i} .
174: \end{equation}
175: This is also a maximum likelihood estimator, since the maximum
176: likelihood estimate of the probabilities is given by the frequencies.
177: Thus we will replace $S_{\rm naive}$ by $S^{\rm ML}$ in what follows.
178: 
179: It is well know that $S^{\rm ML}$ underestimates the entropy (cf.\ 
180: Ref.~\cite{paninski-03}).  With good sampling ($N \gg K$), classical
181: arguments due to Miller \cite{miller-55} show that the ML estimate
182: should be corrected by a universal term $(K-1)/2N$, and several groups
183: have used this correction in the analysis of neural data.  In
184: practice, many bins may have truly zero probability (for example, as a
185: result of refractoriness; see below), and the samples from the
186: distribution might not be completely independent.  Then $S^{\rm ML}$
187: still deviates from the correct answer by a term $\propto 1/N$, but
188: the coefficient is no longer known a priori. Under these conditions
189: one can heuristically verify and extrapolate the $1/N$ behavior from
190: subsets of the available data \cite{strong-98}.  Alternatively, still
191: agreeing on the $1/N$ correction, one can calculate its coefficient
192: (interpretable as an effective number of bins $K^*$) for some classes
193: of distributions
194: \cite{grassberger-88,panzeri-treves-96,grassberger-03}.  All of these
195: approaches, however, work only when the sampling errors are in some
196: sense a small perturbation.
197: 
198: If we want to make progress outside of the asymptotically large $N$
199: regime we need an estimator that does not have a perturbative
200: expansion in $1/N$ with $S_{\rm ML}$ as the zeroth order term. The
201: estimator of Ref.~\cite{nsb} has just this property. Recall that
202: $S_{\rm ML}$ is a limiting case of Bayesian estimation with Dirichlet
203: priors.  Formally, we consider that the probability distributions
204: ${\bf p} \equiv \{p_{\rm i}\}$ are themselves drawn from a
205: distribution ${\mathcal P}_\beta ({\bf p})$ of the form
206: \begin{equation}
207: {\mathcal P}_\beta ({\bf p}) = \frac{1}{Z(\beta; K)}
208: \left[\prod_{{\rm i}=1}^K p_{\rm i}^{(\beta-1)}\right]
209:  \delta \Bigl( \sum_{{\rm i}=1}^K p_{\rm i} -1 \Bigr) ,
210: \end{equation}
211: where the delta function enforces normalization of distributions ${\bf
212:   p}$ and the partition function $Z(\beta ; K)$ normalizes the prior
213: ${\mathcal P}_\beta ({\bf p})$.  Maximum likelihood estimation is
214: Bayesian estimation with this prior in the limit $\beta \rightarrow
215: 0$, while the natural ``uniform'' prior is $\beta =1$.  The key
216: observation of Ref.~\cite{nsb} is that while these priors are quite
217: smooth on the space of ${\bf p}$, the distributions drawn at random
218: from ${\mathcal P}_\beta$ all have very similar entropies, with a
219: variance that vanishes as $K$ becomes large.  Fundamentally, this is
220: the origin of the sample size dependent bias in entropy estimation,
221: and one might thus hope to correct the bias at its source.  The goal
222: then is to construct a prior on the space of probability distributions
223: which generates a nearly uniform distribution of entropies.  Because
224: the entropy of distributions chosen from ${\mathcal P}_\beta$ is
225: sharply defined {\em and} monotonically dependent on the parameter
226: $\beta$, we can come close to this goal by an average over $\beta$,
227: \begin{equation}
228:   {\mathcal P}_{\rm NSB}  ({\bf p} )  \propto \int d\beta 
229:    \,\frac{d \bar S(\beta;K)}{d\beta}\,
230:   {\mathcal P}_\beta ({\bf p})\,.
231: \end{equation}
232: Here $\bar S(\beta;K)$ is the average entropy of distributions chosen
233: from ${\mathcal P}_\beta$ \cite{ww,nsb},
234: \begin{equation}
235:   \bar S(\beta;K)  \equiv \xi =
236:   \psi_0(K\beta+1) 
237:   -\psi_0(\beta+1) \, ,
238:   \label{Sap}
239: \end{equation}
240: where $\psi_m(x) = (d/dx)^{m+1} \log_2 \Gamma(x)$ are the polygamma
241: functions.
242: 
243: Given this prior, we proceed in standard Bayesian fashion.  The
244: probability of observing the data ${\bf n}\equiv\{n_{\rm i}\}$ given
245: the distribution $\bf p$ is
246: \begin{equation}
247: P({\bf n} | {\bf p}) \propto \prod_{{\rm i}=1}^K p_{\rm i}^{n_{\rm i}} ,
248: \end{equation}
249: and then
250: \begin{eqnarray}
251:   P({\bf p} | {\bf n}) &=& 
252:   P({\bf n} | {\bf p})   {\mathcal P}_{\rm NSB}  ({\bf p} ){\bf \cdot}
253:   \frac{1}{P({\bf n})},\\
254:   P({\bf n}) &=& \int  d{\bf p} \,P({\bf n} | {\bf p})
255:   {\mathcal P}_{\rm NSB}  ({\bf p}
256:   ),\\
257:   \left(S^{\rm NSB}\right)^m &=& \int  d{\bf p} \,
258:   \left( -\sum_{{\rm i=1}}^K
259:     p_{\rm i} \log_2 p_{\rm i} \right)^m P({\bf p} | {\bf n}) .
260: \end{eqnarray}
261: Here we need to calculate the first two posterior moments of the
262: entropy, $m=1,2$, in order to have an access to the entropy estimate
263: and to its variance as well.
264: 
265: The Dirichlet priors allow all the ($K$ dimensional) integrals over
266: $\bf p$ to be done analytically, so that the computation of $S^{\rm
267:   NSB}$ and of its posterior error reduces to just three numerical
268: one--dimensional integrals:
269: \begin{eqnarray}
270:   \left(S^{\rm NSB}\right)^m &=& \frac{\int d\xi\, 
271:     \rho(\xi,{\bf n}) \, S_\beta^m ({\bf n})}
272:   {\int d\xi\, \rho(\xi,{\bf n})}\,,\;\;\;\mbox{where}
273:   \label{Shat}
274:   \\
275:   \rho(\xi, {\bf n}) &=&
276:   \frac{\Gamma(K\beta(\xi))}{\Gamma(N+K\beta(\xi))}\,
277:   \prod_{i=1}^K \frac{\Gamma(n_i+\beta(\xi))}{\Gamma(\beta(\xi))}\,,
278:   \label{rho}
279: \end{eqnarray}
280: where the one--to--one relation between $\beta$ and $\xi$ is given by
281: Eq.~(\ref{Sap}), and $S_\beta^m({\bf n})$ is the expectation value of
282: the $m$-th entropy moment at fixed $\beta$; the exact expression for
283: $m=1,2$ is given in Ref.~\cite{ww}.
284: 
285: Details of the NSB method can be found in Refs.~\cite{nsb,nsb2}, and
286: the source code of the implementations in either Octave/C++ or plain
287: C++ is available from the authors.  We draw attention to several
288: points.
289: 
290: First, since the analysis is Bayesian, we obtain not only $S^{\rm NSB}$
291: but also its a posteriori standard deviation, $\delta S^{\rm
292:   NSB}$---an error bar on our estimate, see Eq.~(\ref{Shat}).
293: 
294: Second, for $N\to\infty$ and $N/K\to 0$ the estimator admits
295: asymptotic analysis. The important parameter is the number of
296: coincidences $\Delta = N-K_1$, where $K_1$ is the number of bins with
297: non-zero counts. If $\Delta/N\to {\rm const}<1$ (many coincidences),
298: then the standard saddle point evaluation of the integrals in
299: Eq.~(\ref{Sap}) is possible. Interestingly, the second derivative at
300: the saddle is $(\ln^2 2)\, \Delta$ to the leading order in $\Delta/N$.
301: The second asymptotic can be obtained for $\Delta\sim O(N^0)$ (few
302: coincidences).  Then
303: \begin{eqnarray}
304:   S^{\rm NSB} &\approx&\frac{C_\gamma}{\ln 2} - 1 + 2 \log_2 N
305:   -\psi_0(\Delta)\,,
306:   \label{Shat_res}
307:   \\ 
308:   \delta S^{\rm NSB} &\approx& \sqrt{\psi_1(\Delta)}\,,
309:   \label{dShat_res}
310: \end{eqnarray}
311: where $C_\gamma$ is the Euler's constant. This is particularly
312: interesting since $S^{\rm NSB}$ happens to have a finite limit for
313: $K\to\infty$, thus allowing entropy estimation even for infinite (or
314: unknown) cardinalities.
315: 
316: Third, both of the above asymptotics show that the estimation
317: procedure relies on $\Delta$ to make its estimates; this is in the
318: spirit of Ref.~\cite{ma-81}.
319: 
320: 
321: Finally, $S^{\rm NSB}$ is unbiased if the distribution being learned
322: is typical in ${\mathcal P}_\beta({\bf p})$ for some $\beta$, that is,
323: its rank ordered (Zipf) plot is of the form
324: \begin{eqnarray}
325: q_i &\approx& 1 - \left[\frac{ \beta B(\beta, K\beta - \beta )  (K-1) \,i}
326: {K} \right] ^{1/(K\beta-\beta)}, 
327: \label{left}\\
328: q_i &\approx& \left[ \frac{ \beta B(\beta, K\beta - \beta )  (K-i+1)}
329: {K}\right]^{1/\beta},
330: \label{right}
331: \end{eqnarray}  
332: for $i/K\to 0$ and $i/K\to1$ respectively. If the Zipf plot has tails
333: that are too short (too long), then the estimator should over (under)
334: estimate.  While underestimation may be severe (though always strictly
335: smaller than that for $S^{\rm ML}$), overestimation is very mild, if
336: present at all, in the most interesting regime $1\ll\Delta\ll N$.
337: $S^{\rm NSB}$ is also unbiased for distributions that are typical in
338: some weighted combinations of ${\mathcal P}_\beta$ for different
339: $\beta$'s, in particular in ${\mathcal P}_{\rm NSB}$ itself. However,
340: the typical Zipf plots in this case are more complicated and will be
341: detailed elsewhere.
342: 
343: 
344: Before proceeding it is worth asking what we hope to accomplish. Any
345: reasonable estimator will converge to the right answer in the limit of
346: large $N$.  In particular, this is true for $S^{\rm NSB}$, which is a
347: {\em consistent} Bayesian estimator \footnote{In reference to Bayesian
348:   estimators, consistency usually means that, as $N$ grows, the
349:   posterior probability concentrates around unknown parameters of the
350:   true model that generated the data. For finite parameter models,
351:   such as the one considered here, only technical assumptions like
352:   positivity of the prior for all parameter values, soundness
353:   (different parameters always correspond to different distributions)
354:   \cite{clarke-barron-90}, and a few others are needed for
355:   consistency.  For nonparametric models, the situation is more
356:   complicated. There one also needs ultraviolet convergence of the
357:   functional integrals defined by the prior
358:   \cite{nemenman-00,bnt-01}.}.  The central problem of entropy
359: estimation is systematic bias, which will cause us to (perhaps
360: significantly) under- or overestimate the information content of spike
361: trains or the efficiency of the neural code. The bias, which vanishes
362: for $N\to\infty$, will manifest itself as a systematic drift in plots
363: of the estimated value versus the sample size. A successful estimator
364: would remove this bias as much as possible.  Ideally we thus hope to
365: see an estimate which for all values of $N$ is within its error bars
366: from the correct answer.  As $N$ increases the error bars should
367: narrow, with relatively little variation of the (mean) estimate
368: itself. When data are such that no reliable estimation is possible,
369: the estimator should remain uncertain, that is, the posterior variance
370: should be large. The main purpose of this paper is to show that the
371: NSB procedure applied to natural and nature--inspired synthetic
372: signals comes close to this ideal over a wide range of $N \ll K$, and
373: even $N \ll 2^S$.  The procedure thus is a viable tool for
374: experimental analysis.
375: 
376: 
377: 
378: \section{A model problem}
379: \label{modprob}
380: 
381: It is important to test our techniques on a problem which captures
382: some aspects of real world data yet is sufficiently well defined that
383: we know the correct answer.  We constructed synthetic spike trains
384: where intervals between successive spikes were independent and chosen
385: from an exponential distribution with a dead time or refractory period
386: of $g=1.8$ \milli\second; the mean spike rate was $r=0.26$
387: spikes/\milli\second.  This corresponds to the rate of $r_0 = r/(1 -
388: rg) = 0.49$ spikes/\milli\second\ for the part of the signal, where
389: spiking is not prohibited by refractoriness.  These parameters are
390: typical of the high spike rate, noisy regions of the experiment
391: discussed below, which provide the greatest challenge for entropy
392: estimation.
393: 
394: Following the scheme outlined in Ref.~\cite{strong-98}, we examine the
395: spike train in windows of duration $T=15$ \milli\second\ and
396: discretize the response with a time resolution $\tau = 0.5$
397: \milli\second.  Because of the refractory period each bin of size
398: $\tau$ can contain at most one spike, and hence the neural response is
399: a binary word with $T/\tau = 30$ letters.  The space of responses has
400: $K = 2^{30}\approx 10^9$ possibilities.  Of course, most of these have
401: probability exactly zero because of refractoriness, and the number of
402: possible responses consistent with this constraint is bounded by $\sim
403: 2^{16} \approx 10^5$.  An approximation to the entropy of this
404: distribution, is given by an appropriate correction to Eq.~(3.21) of
405: Ref.~\cite{spikes}, the entropy of a non--refractory Poisson process:
406: \begin{equation}
407: S =\frac{rT}{\ln 2} \left[-\ln \left(1-{\rm e}^{-r_0\tau} \right) 
408:   + \frac{r_0\tau\, {\rm e}^{-r_0\tau}}{1-{\rm e}^{-r_0\tau}}\right]=
409:  13.57~{\rm bits}.
410: \end{equation}
411: 
412: 
413: \begin{figure}
414:   \ifpdf \includegraphics[width=3.2in]{refractory} \else
415:   \includegraphics[height=3.2in, angle=270]{refractory} \fi
416:   \caption{\label{fig:artif}Entropy estimation for a model
417:     problem. Notice that the estimator reaches the true value within
418:     the error bars as soon as $N^2 \sim 2^S$, at which point
419:     coincidences start to occur with high probability. Slight
420:     overestimation for $N>10^3$ is expected (see text) since this
421:     distribution is atypical in ${\mathcal P}_{\rm NSB}$.}
422: \end{figure}
423: In Fig.~\ref{fig:artif} we show the results of entropy estimation for
424: this model problem. As expected, the naive estimate $S^{\rm ML}$
425: reaches its asymptotic behavior only when $N > 2^S$, thus the $1/N$
426: extrapolation becomes successful at $N\sim10^4$ (the ``ML fit'' line
427: on the plot).  In contrast, we see that $S^{\rm NSB}$ gives the right
428: answer within errors at $N \sim 100$.  We can improve convergence by
429: providing the estimator with the ``hint'' that the number of possible
430: responses $K$ is much smaller than the upper limit of $2^{30}$, but
431: even without this hint we have excellent entropy estimates already at
432: $N \sim (2^S)^{1/2}$.  This is in accord with expectations from Ma's
433: analysis of (microcanonical) entropy estimation \cite{ma-81}. However,
434: here we achieve these results for a nonuniform distribution.
435: 
436: 
437: 
438: \section{Analyzing real data}
439: 
440: \begin{figure}
441:   \ifpdf \includegraphics[width=3in]{sfly022_vel_raster}
442:   \else \includegraphics[width=3in,angle=0]{sfly022_vel_raster} \fi
443:   \caption{\label{fig:flyexp} Data from a fly motion sensitive neuron
444:     in a natural stimulus setting. Top: a 500 \milli\second\ section
445:     of a 10 \second\ angular velocity trace that was repeated 196
446:     times.  Bottom: raster plot showing the
447:     response to 30 consecutive trials; each dot marks the occurrence of a spike.}
448: \end{figure}
449: 
450: For a test on real neurophysiological data, we use recordings from a
451: wide field motion sensitive neuron (H1) in the visual system of the
452: blowfly {\em Calliphora vicina}. While action potentials from H1 were
453: recorded, the fly rotated on a stepper motor outside among the bushes,
454: with time dependent angular velocity representative of natural flight.
455: Figure~\ref{fig:flyexp} presents a sample of raw data from such an
456: experiment (see~Ref.~\cite{LewenEtAl2001} for details).
457: 
458: 
459: Following Ref.~\cite{strong-98}, the information content of a spike
460: train is the difference between its total entropy and the entropy of
461: neural responses to repeated presentations of the same stimulus
462: \footnote{It may happen that information is a small difference between
463:   two large entropies. Then, due to statistical errors, methods that
464:   estimate information directly will have an advantage over NSB, which
465:   estimates entropies first. In our case, this is not a problem since
466:   the information is roughly a half of the total available entropy
467:   \cite{strong-98}.}. The latter is substantially more difficult to
468: estimate. It is called the noise entropy $S_n$, since it measures
469: response variations that are uncorrelated with the sensory input. The
470: noise in neurons depends on the stimulus itself---there are, for
471: example, stimuli which generate with certainty zero spikes in a given
472: window of time---and so we write $S_{n|t}$ to mark the dependence on
473: the time $t$ at which we take a slice through the raster of responses.
474: In this experiment the full stimulus was repeated 196 times, which
475: actually is a relatively large number by the standards of
476: neurophysiology.  The fly makes behavioral decisions based on $\sim
477: 10- 30~\milli\second$ windows of its visual input
478: \cite{LandAndCollett1974}, and under natural conditions the time
479: resolution of the neural responses is of order 1 \milli\second\ or
480: even less \cite{LewenEtAl2001}, so that a meaningful analysis of
481: neural responses must deal with binary words of length $10-30$ or
482: more. Refractoriness limits the number of these words which can occur
483: with nonzero probability (as in our model problem), but nonetheless we
484: easily reach the limit where the number of samples is substantially
485: smaller than the number of possible responses.
486: 
487: \begin{figure}[t]
488:    \ifpdf
489:    \includegraphics[width=2.7in]{T8_nsb_vs_ml_2}
490:    \includegraphics[width=2.7in]{T15_nsb_vs_ml_2}
491:    \else
492:    \includegraphics[height=2.7in,angle=270]{T8_nsb_vs_ml_2}
493:    \includegraphics[height=2.7in,angle=270]{T15_nsb_vs_ml_2}
494:    \fi
495:    \caption{\label{fig:nsbml}Slice entropy vs.\ sample size.  Dashed
496:      line on both plots is drawn at the value of $\left.S^{\rm
497:          NSB}\right|_{N=N_{\rm max}}$ to show that the estimator is
498:      stable within its error bars even for very low $N$. Triangle
499:      corresponds to the value of $S^{\rm ML}$ extrapolated to
500:      $N\to\infty$ from the four largest values of $N$. First and second
501:      panels show examples of word lengths for which $S_{\rm ML}$ can
502:      or cannot be reliably extrapolated. $S^{\rm NSB}$ is stable in
503:      both cases, shows no $N$ dependent drift, and agrees with $S^{\rm
504:        ML}$ where the latter is reliable.}
505: \end{figure}
506: Let us start by looking at a single moment in time,
507: $t=1800~\milli\second$ from the start of the repeated stimulus, as in
508: Fig.~\ref{fig:flyexp}.  If we consider a window of duration $T =
509: 16~\milli\second$ at time resolution $\tau = 2~\milli\second$
510: \footnote{For our and many other neural systems, the spike timing can
511:   be more accurate than the refractory period of roughly 2
512:   \milli\second\ \cite{brenner-00a,rob-01,LewenEtAl2001}.  For the
513:   current amount of data, discretization of $\tau\ll1~\milli\second$
514:   and large enough $T$ will push the limits of all estimation methods,
515:   including ours, that do not make explicit assumptions about
516:   properties of the spike trains. Thus, to have enough statistics to
517:   convincingly show validity of the NSB approach, in this paper we
518:   choose $\tau =0.75\dots2~\milli\second$, which is still much shorter
519:   than other methods can handle. We leave open the possibility that
520:   more information is contained in timing precision at finer scales.},
521: we obtain the entropy estimates shown in the first panel of
522: Fig.~\ref{fig:nsbml}.  Notice that in this case we actually have a
523: total number of samples which is comparable to or larger than
524: $2^{S_{n|t}}$, and so the maximum likelihood estimate of the entropy
525: is converging with the expected $1/N$ behavior.  The NSB estimate
526: agrees with this extrapolation.  The crucial result is that the NSB
527: estimate is correct within error bars across the whole range of $N$;
528: there is a slight variation in the mean estimate, but the main effect
529: as we add samples is that the error bars narrow around the correct
530: answer.  In this case our estimation procedure has removed essentially
531: all of the sample size dependent bias.
532: 
533: 
534:  
535: 
536: As we open our window to $T = 30~\milli\second$, the number of
537: possible responses (even considering refractoriness) is vastly larger
538: than the number of samples. As we see in the second panel of
539: Fig.~\ref{fig:nsbml}, any attempt to extrapolate the ML estimate of
540: entropy now requires some wishful thinking.  Nonetheless, in parallel
541: with our results for the model problem, we find that the NSB estimate
542: is stable within error bars across the full range of available $N$.
543: 
544: 
545: 
546: \begin{figure}
547:   \centerline{\includegraphics[width=2.9in,height=2.3in]{cond_entr_75}}
548:   \caption{\label{fig:conds}Distribution of the normalized entropy
549:     error conditional on $S^{\rm NSB}(N_{\rm max})$ for $N=75$ and
550:     $\tau=0.75~\milli\second$. Darker patches correspond to higher
551:     probability.  The band in the right part of the plot is the normal
552:     distribution around zero with the standard deviation of 1 (the
553:     standard deviation of plotted conditional distributions averaged
554:     over $S^{\rm NSB}$ is about 0.7, which indicates a non--Gaussian
555:     form of the posterior for small number of coincidences
556:     \cite{nsb2}). For values of $S^{\rm NSB}$ up to about 12 bits the
557:     estimator performs remarkably well. For yet larger entropies,
558:     where the number of coincidence is just a few, the discrete nature
559:     of the estimated values is evident, and this puts a bound on
560:     reliability of $S^{\rm NSB}$.}
561: \end{figure}
562: 
563: For small $T$ we can compare the results of our Bayesian estimation
564: with an extrapolation of the ML estimate; each moment in time relative
565: to the repeated stimulus provides an example.  We have found that the
566: results in the first panel of Fig. \ref{fig:nsbml} are typical: in the
567: regime where extrapolation of the ML estimator is reliable, our
568: estimator agrees within error bars over a broad range of sample sizes.
569: More precisely, if we take the extrapolated ML estimate as the correct
570: answer, and measure the deviation of $S^{\rm NSB}$ from this answer in
571: units of the predicted error bar, we find that the mean square value
572: of this normalized error is of order one.  This is as expected if our
573: estimation errors are random rather than systematic.
574:  
575: 
576: For larger $T$ we do not have a calibration against the (extrapolated)
577: $S^{\rm ML}$, but we can still ask if the estimator is stable, within
578: error bars, over a wide range of $N$.  To check this stability we
579: treat the value of $S^{\rm NSB}$ at $N=N_{\rm max}=196$ as our best
580: guess for the entropy and compute the normalized deviation of the
581: estimates at smaller values of $N$ from this guess,
582: $\varepsilon=\left[S^{\rm NSB}(N) - S^{\rm NSB}(N_{\rm
583:     max})\right]/\delta S^{\rm NSB}(N)$.  Again, each moment in time
584: is an example.  Figure~\ref{fig:conds} shows the distribution of these
585: normalized deviations conditional on the entropy estimate with $N=75$;
586: this analysis is done for $\tau = 0.75~\milli\second$, with $T$ in the
587: range between $1.5~\milli\second$ and $22.5~\milli\second$.  Since the
588: different time slices span a range of entropies, over some range we
589: have $N > 2^S$, and in this regime the entropy estimate must be
590: accurate (as in the analysis of small $T$ above).  Throughout this
591: range, the normalized deviations fall in a narrow band with mean close
592: to zero and a variance of order one, as expected if the only
593: variations with the sample size were random.  Remarkably this pattern
594: continues for larger entropies, $S > \log_2 N=6.2$ bits, demonstrating
595: that our estimator is stable even deep into the undersampled regime.
596: This is consistent with the results obtained in our model problem, but
597: here we find the same answer for the real data.
598: 
599: Note that Fig. \ref{fig:conds} illustrates results with $N$ less than
600: one half the total number of samples, so we really are testing for
601: stability over a large range in $N$.  This emphasizes that our
602: estimation procedure moves smoothly from the well sampled into the
603: undersampled regime without accumulating any clear signs of systematic
604: error.  The procedure collapses only when the entropy is so large that
605: the probability of observing the same response more than once (a
606: coincidence) becomes negligible.
607: 
608: \section{Discussion}
609: 
610: %% We finish with the following observations. While one might expect that
611: %% a meaningful estimate of entropy $S$ requires many more than $2^S$
612: %% samples, we know from Ma \cite{ma-81} that for uniform distributions
613: %% of unknown size (as in the microcanonical ensemble) we can make
614: %% reliable estimates of the entropy when $N^2 \sim 2^S$.  This is
615: %% related to the fact that we will encounter two people who have the
616: %% same birthday once we have chosen at random a group of $N \sim 23 <<
617: %% 365$ people---conversely, testing for coincidences tells us something
618: %% about the entropy of the distribution of birthdays even when we are
619: %% far from having seen all the possibilities.  The challenge is to
620: %% convert these ideas about coincidences into a reliable estimator for
621: %% the entropy of nonuniform distributions.
622: 
623: The estimator we have explored here is constructed from a prior that
624: has a nearly uniform distribution of entropies.  It is plausible that
625: such a uniform prior would largely remove the sample size dependent
626: bias in entropy estimation, but it is crucial to test this
627: experimentally. In particular, there are infinitely many priors which
628: are approximately (and even exactly) uniform in entropy, and it is not
629: clear which of them will allow successful estimation in real world
630: problems.  We have found that the NSB prior almost completely removed
631: the bias in the model problem (Fig.~\ref{fig:artif}). Further, for
632: real data in a regime where undersampling can be beaten down by data
633: the bias is removed to yield agreement with the extrapolated ML
634: estimator even at very small sample sizes (Fig.~\ref{fig:nsbml}, first
635: panel).  Finally and most crucially, the NSB estimation procedure
636: continues to perform smoothly and stably past the nominal sampling
637: limit of $N \sim 2^S$, all the way to the Ma cutoff $N^2 \sim 2^S$
638: (Fig.~\ref{fig:conds}).  This opens the opportunity for rigorous
639: analysis of entropy and information in spike trains under a much wider
640: set of experimental conditions.
641: 
642: \acknowledgments
643: 
644: We thank J Miller for important discussions, GD Lewen for his help
645: with the experiments, which were supported by the NEC Research
646: Institute, and the organizers of the NIC'03 workshop for providing a
647: venue for a preliminary presentation of this work.  IN was supported
648: by NSF Grant No.\ PHY99-07949 to the Kavli Institute for Theoretical
649: Physics.  IN is also very thankful to the developers of the following
650: Open Source software: GNU Emacs, GNU Octave, GNUplot, and te\TeX.
651: 
652: 
653: 
654: 
655: \bibliographystyle{unsrtnat} {\small\bibliography{flies}}
656: 
657: \end{document}
658: 
659: