0306:physics0306063/nbr.tex

1: \documentclass[pre,twocolumn,superscriptaddress,

2: showkeys,showpacs,amsmath,amsfonts,final]{revtex4}

3:

4: \newif\ifpdf

5: \ifx\pdfoutput\undefined

6: \pdffalse % we are not running PDFLaTeX

7: \else

8: \pdfoutput=1 % we are running PDFLaTeX

9: \pdfcompresslevel=9

10: \pdftrue

11: \fi

12:

13: \ifpdf

14: \usepackage[pdftex]{graphicx}

15: %\usepackage[pdftex]{hyperref}

16: %\usepackage{thumbpdf}

17: \else

18: \usepackage{graphicx}

19: %\usepackage[dvips]{hyperref}

20: \fi

21: \usepackage[cdot,squaren,textstyle]{SIunits}

22: \usepackage[sort&compress]{natbib}

23:

24:

25:

26: %\hypersetup{pdfsubject={Preprint; Journal submission},

27: %  pdftitle  = {Entropy and information in neural spike trains: Progress on the

28: %    sampling problem}

29: %  pdfauthor = {Ilya Nemenman, William Bialek, Rob de Ruyter van Steveninck}

30: %  pdfkeywords = {entropy, information, estimation, Bayesian statistics, neuroscience, neural code}

31: %  bookmarks ={true},

32: %}

33:

34: %\include{aliases}

35:

36:

37: \begin{document}

38:

39: \title{Entropy and information in neural spike trains: Progress on the

40:   sampling problem}

41:

42:

43: \author{Ilya Nemenman}\email{nemenman@kitp.ucsb.edu}

44: \affiliation{Kavli Institute for Theoretical Physics, University of

45:   California, Santa Barbara, California 93106}

46:

47: \author{William Bialek} \email{wbialek@princeton.edu}

48: \affiliation{Departments of Physics and the Lewis--Sigler Institute

49:   for Integrative Genomics, Princeton University, Princeton, New

50:   Jersey 08544}

51:

52: \author{Rob de Ruyter van Steveninck} \email{deruyter@indiana.edu}

53: \affiliation{Department of Molecular Biology, Princeton University,

54:   Princeton, New Jersey 08544}

55:

56: \altaffiliation[Current address: ]{Department of Physics, Indiana

57:   University, 727 E. Third St., Bloomington, Indiana 47405}

58:

59:

60: \begin{abstract}

61:   The major problem in information theoretic analysis of neural

62:   responses and other biological data is the reliable estimation of

63:   entropy--like quantities from small samples. We apply a recently

64:   introduced Bayesian entropy estimator to synthetic data inspired by

65:   experiments, and to real experimental spike trains. The estimator

66:   performs admirably even very deep in the undersampled regime, where

67:   other techniques fail.  This opens new possibilities for the

68:   information theoretic analysis of experiments, and may be of general

69:   interest as an example of learning from limited data.

70: \end{abstract}

71:

72: \pacs{02.50.Tt, 89.70.+c, 87.19.La, 87.80.Tq}

73:

74: \keywords{entropy, information, estimation, Bayesian statistics,

75:   neuroscience, neural code}

76:

77: \preprint{NSF-KITP-03-43}

78: %\date{\today}

79: \maketitle

80:

81:

82:

83:

84:

85:

86: \section{Introduction}

87: \label{intro}

88:

89: There has been considerable progress in using information theoretic

90: methods to sharpen and to answer many questions about the structure of

91: the neural code

92: \cite{BialekEtAl1991,TheunissenAndMiller1991,berry-97,strong-98,BorstAndTheunissen1999,brenner-00a,ReinagelAndReid2000,reich-01}.

93: Where classical experimental approaches have focused on mean response

94: of neurons to relatively simple stimuli, information theoretic methods

95: have the power to quantify the responses to arbitrarily complex and

96: even fully natural stimuli \cite{spikes,LewenEtAl2001}, taking account

97: of both the mean response and its variability in a rigorous way,

98: independent of detailed modeling assumptions.  Measurements of entropy

99: and information in spike trains also allow us to test directly the

100: hypothesis that the neural code adapts to the distribution of sensory

101: inputs, optimizing the rate or efficiency of information transmission

102: \cite{barlow-59,barlow-61,laughlin-81,BrennerEtAl2000,FairhallEtAl2001}.

103:

104: A problem with such measurements is that entropy and information

105: depend explicitly on the full distribution of neural responses, just a

106: limited sample of which is provided by experiments. In particular, we

107: need to know the distribution of responses to each stimulus in our

108: ensemble, and the number of samples from this distribution is limited

109: by the number of times the full set of stimuli can be repeated.  For

110: natural stimuli with long correlation times the time required to

111: present a useful ``full set of stimuli'' is long, limiting the number

112: of independent samples we can obtain from stable neural recordings.

113: Furthermore, natural stimuli generate neural responses of high timing

114: precision, and thus the space of meaningful responses itself is very

115: large \cite{mainen-95,rds-97,berry-97,LewenEtAl2001}.  These factors

116: make the sampling problem more serious as we move to more interesting

117: and natural stimuli.

118:

119: A natural response to this problem is to give up the generality of a

120: completely model independent information theoretic approach.  Some

121: explicit help from models is required to regularize learning of the

122: underlying probability distributions from the experiments.  The

123: question is if we can keep the generality of our analysis by

124: introducing the gentlest of regularizations for the abstract learning

125: problem, or if we need stronger assumptions about the structure of the

126: neural code itself (for example, introducing a metric on the space of

127: responses \cite{victor-purpura-97,Victor2002}).

128:

129: A classical problem suggests that we may succeed even with very weak

130: assumptions. Remember that one needs to have only $N\sim 23$ people in

131: a room before any two of them are reasonably likely to share the same

132: birthday. This is much less than $K=365$, the number of possible

133: birthdays.  Turning this around, we can estimate the number of

134: possible birthdays by polling $N$ people and counting how often we

135: find coincidences.  Once $N$ is large enough to have observed a few of

136: those, we can get a pretty good estimate of $K$.  This will happen

137: with a significant probability for $N \sim \sqrt{K}\ll K$.

138:

139: The idea of estimating entropy by counting coincidences was proposed

140: long ago by Ma \cite{ma-81} for physical systems in the microcanonical

141: ensemble where distributions should be uniform at fixed energy.

142: Clearly, if we could generalize the Ma idea to arbitrary

143: distributions, then we would be able to explore a much wider variety

144: of question about information in the neural code.  Here we argue that

145: a simple and abstract Bayesian prior, introduced in Ref.~\cite{nsb},

146: comes close to the objective.

147:

148: It is well known that, for $N<K$, there are no universally good

149: entropy estimators \cite{paninski-03,rubinfeld-02}.  Thus the main

150: question is: does a particular method work well only for (possibly

151: irrelevant) abstract model problems, or can it also be trusted for

152: natural data?  Hence our goal is neither to search for potential

153: theoretical limitations of the approach (these must exist and have

154: been found), nor to analyze the neural code (this will be left for the

155: future).  Instead we aim at convincingly showing that the method of

156: Ref.~\cite{nsb} can generate reliable estimates of entropy well into a

157: classically undersampled regime for an experimentally relevant case of

158: neurophysiological recordings.

159:

160:

161:

162: \section{An estimation strategy}

163: \label{strategy}

164: Consider the problem of estimating the entropy $S$ of a probability

165: distribution $\{p_{\rm i}\}$, $ S = -\sum_{{\rm i}=1}^K p_{\rm

166:   i}\log_2 p_{\rm i}$,  where the index $\rm i$ runs over $K$

167: possibilities (e.g., $K$ possible neural responses). In an experiment

168: we observe that in $N$ examples each possibility $\rm i$ occurred

169: $n_{\rm i}$ times.  If $N \gg K$, we approximate the probabilities by

170: frequencies, $p_{\rm i} \approx f_{\rm i} \equiv n_{\rm i} /N$, and

171: construct a naive estimate of the entropy,

172: \begin{equation}

173: S_{\rm naive} = -\sum_{{\rm i}=1}^K f_{\rm i}\log_2 f_{\rm i} .

174: \end{equation}

175: This is also a maximum likelihood estimator, since the maximum

176: likelihood estimate of the probabilities is given by the frequencies.

177: Thus we will replace $S_{\rm naive}$ by $S^{\rm ML}$ in what follows.

178:

179: It is well know that $S^{\rm ML}$ underestimates the entropy (cf.\

180: Ref.~\cite{paninski-03}).  With good sampling ($N \gg K$), classical

181: arguments due to Miller \cite{miller-55} show that the ML estimate

182: should be corrected by a universal term $(K-1)/2N$, and several groups

183: have used this correction in the analysis of neural data.  In

184: practice, many bins may have truly zero probability (for example, as a

185: result of refractoriness; see below), and the samples from the

186: distribution might not be completely independent.  Then $S^{\rm ML}$

187: still deviates from the correct answer by a term $\propto 1/N$, but

188: the coefficient is no longer known a priori. Under these conditions

189: one can heuristically verify and extrapolate the $1/N$ behavior from

190: subsets of the available data \cite{strong-98}.  Alternatively, still

191: agreeing on the $1/N$ correction, one can calculate its coefficient

192: (interpretable as an effective number of bins $K^*$) for some classes

193: of distributions

194: \cite{grassberger-88,panzeri-treves-96,grassberger-03}.  All of these

195: approaches, however, work only when the sampling errors are in some

196: sense a small perturbation.

197:

198: If we want to make progress outside of the asymptotically large $N$

199: regime we need an estimator that does not have a perturbative

200: expansion in $1/N$ with $S_{\rm ML}$ as the zeroth order term. The

201: estimator of Ref.~\cite{nsb} has just this property. Recall that

202: $S_{\rm ML}$ is a limiting case of Bayesian estimation with Dirichlet

203: priors.  Formally, we consider that the probability distributions

204: ${\bf p} \equiv \{p_{\rm i}\}$ are themselves drawn from a

205: distribution ${\mathcal P}_\beta ({\bf p})$ of the form

206: \begin{equation}

207: {\mathcal P}_\beta ({\bf p}) = \frac{1}{Z(\beta; K)}

208: \left[\prod_{{\rm i}=1}^K p_{\rm i}^{(\beta-1)}\right]

209:  \delta \Bigl( \sum_{{\rm i}=1}^K p_{\rm i} -1 \Bigr) ,

210: \end{equation}

211: where the delta function enforces normalization of distributions ${\bf

212:   p}$ and the partition function $Z(\beta ; K)$ normalizes the prior

213: ${\mathcal P}_\beta ({\bf p})$.  Maximum likelihood estimation is

214: Bayesian estimation with this prior in the limit $\beta \rightarrow

215: 0$, while the natural ``uniform'' prior is $\beta =1$.  The key

216: observation of Ref.~\cite{nsb} is that while these priors are quite

217: smooth on the space of ${\bf p}$, the distributions drawn at random

218: from ${\mathcal P}_\beta$ all have very similar entropies, with a

219: variance that vanishes as $K$ becomes large.  Fundamentally, this is

220: the origin of the sample size dependent bias in entropy estimation,

221: and one might thus hope to correct the bias at its source.  The goal

222: then is to construct a prior on the space of probability distributions

223: which generates a nearly uniform distribution of entropies.  Because

224: the entropy of distributions chosen from ${\mathcal P}_\beta$ is

225: sharply defined {\em and} monotonically dependent on the parameter

226: $\beta$, we can come close to this goal by an average over $\beta$,

227: \begin{equation}

228:   {\mathcal P}_{\rm NSB}  ({\bf p} )  \propto \int d\beta

229:    \,\frac{d \bar S(\beta;K)}{d\beta}\,

230:   {\mathcal P}_\beta ({\bf p})\,.

231: \end{equation}

232: Here $\bar S(\beta;K)$ is the average entropy of distributions chosen

233: from ${\mathcal P}_\beta$ \cite{ww,nsb},

234: \begin{equation}

235:   \bar S(\beta;K)  \equiv \xi =

236:   \psi_0(K\beta+1)

237:   -\psi_0(\beta+1) \, ,

238:   \label{Sap}

239: \end{equation}

240: where $\psi_m(x) = (d/dx)^{m+1} \log_2 \Gamma(x)$ are the polygamma

241: functions.

242:

243: Given this prior, we proceed in standard Bayesian fashion.  The

244: probability of observing the data ${\bf n}\equiv\{n_{\rm i}\}$ given

245: the distribution $\bf p$ is

246: \begin{equation}

247: P({\bf n} | {\bf p}) \propto \prod_{{\rm i}=1}^K p_{\rm i}^{n_{\rm i}} ,

248: \end{equation}

249: and then

250: \begin{eqnarray}

251:   P({\bf p} | {\bf n}) &=&

252:   P({\bf n} | {\bf p})   {\mathcal P}_{\rm NSB}  ({\bf p} ){\bf \cdot}

253:   \frac{1}{P({\bf n})},\\

254:   P({\bf n}) &=& \int  d{\bf p} \,P({\bf n} | {\bf p})

255:   {\mathcal P}_{\rm NSB}  ({\bf p}

256:   ),\\

257:   \left(S^{\rm NSB}\right)^m &=& \int  d{\bf p} \,

258:   \left( -\sum_{{\rm i=1}}^K

259:     p_{\rm i} \log_2 p_{\rm i} \right)^m P({\bf p} | {\bf n}) .

260: \end{eqnarray}

261: Here we need to calculate the first two posterior moments of the

262: entropy, $m=1,2$, in order to have an access to the entropy estimate

263: and to its variance as well.

264:

265: The Dirichlet priors allow all the ($K$ dimensional) integrals over

266: $\bf p$ to be done analytically, so that the computation of $S^{\rm

267:   NSB}$ and of its posterior error reduces to just three numerical

268: one--dimensional integrals:

269: \begin{eqnarray}

270:   \left(S^{\rm NSB}\right)^m &=& \frac{\int d\xi\,

271:     \rho(\xi,{\bf n}) \, S_\beta^m ({\bf n})}

272:   {\int d\xi\, \rho(\xi,{\bf n})}\,,\;\;\;\mbox{where}

273:   \label{Shat}

274:   \\

275:   \rho(\xi, {\bf n}) &=&

276:   \frac{\Gamma(K\beta(\xi))}{\Gamma(N+K\beta(\xi))}\,

277:   \prod_{i=1}^K \frac{\Gamma(n_i+\beta(\xi))}{\Gamma(\beta(\xi))}\,,

278:   \label{rho}

279: \end{eqnarray}

280: where the one--to--one relation between $\beta$ and $\xi$ is given by

281: Eq.~(\ref{Sap}), and $S_\beta^m({\bf n})$ is the expectation value of

282: the $m$-th entropy moment at fixed $\beta$; the exact expression for

283: $m=1,2$ is given in Ref.~\cite{ww}.

284:

285: Details of the NSB method can be found in Refs.~\cite{nsb,nsb2}, and

286: the source code of the implementations in either Octave/C++ or plain

287: C++ is available from the authors.  We draw attention to several

288: points.

289:

290: First, since the analysis is Bayesian, we obtain not only $S^{\rm NSB}$

291: but also its a posteriori standard deviation, $\delta S^{\rm

292:   NSB}$---an error bar on our estimate, see Eq.~(\ref{Shat}).

293:

294: Second, for $N\to\infty$ and $N/K\to 0$ the estimator admits

295: asymptotic analysis. The important parameter is the number of

296: coincidences $\Delta = N-K_1$, where $K_1$ is the number of bins with

297: non-zero counts. If $\Delta/N\to {\rm const}<1$ (many coincidences),

298: then the standard saddle point evaluation of the integrals in

299: Eq.~(\ref{Sap}) is possible. Interestingly, the second derivative at

300: the saddle is $(\ln^2 2)\, \Delta$ to the leading order in $\Delta/N$.

301: The second asymptotic can be obtained for $\Delta\sim O(N^0)$ (few

302: coincidences).  Then

303: \begin{eqnarray}

304:   S^{\rm NSB} &\approx&\frac{C_\gamma}{\ln 2} - 1 + 2 \log_2 N

305:   -\psi_0(\Delta)\,,

306:   \label{Shat_res}

307:   \\

308:   \delta S^{\rm NSB} &\approx& \sqrt{\psi_1(\Delta)}\,,

309:   \label{dShat_res}

310: \end{eqnarray}

311: where $C_\gamma$ is the Euler's constant. This is particularly

312: interesting since $S^{\rm NSB}$ happens to have a finite limit for

313: $K\to\infty$, thus allowing entropy estimation even for infinite (or

314: unknown) cardinalities.

315:

316: Third, both of the above asymptotics show that the estimation

317: procedure relies on $\Delta$ to make its estimates; this is in the

318: spirit of Ref.~\cite{ma-81}.

319:

320:

321: Finally, $S^{\rm NSB}$ is unbiased if the distribution being learned

322: is typical in ${\mathcal P}_\beta({\bf p})$ for some $\beta$, that is,

323: its rank ordered (Zipf) plot is of the form

324: \begin{eqnarray}

325: q_i &\approx& 1 - \left[\frac{ \beta B(\beta, K\beta - \beta )  (K-1) \,i}

326: {K} \right] ^{1/(K\beta-\beta)},

327: \label{left}\\

328: q_i &\approx& \left[ \frac{ \beta B(\beta, K\beta - \beta )  (K-i+1)}

329: {K}\right]^{1/\beta},

330: \label{right}

331: \end{eqnarray}

332: for $i/K\to 0$ and $i/K\to1$ respectively. If the Zipf plot has tails

333: that are too short (too long), then the estimator should over (under)

334: estimate.  While underestimation may be severe (though always strictly

335: smaller than that for $S^{\rm ML}$), overestimation is very mild, if

336: present at all, in the most interesting regime $1\ll\Delta\ll N$.

337: $S^{\rm NSB}$ is also unbiased for distributions that are typical in

338: some weighted combinations of ${\mathcal P}_\beta$ for different

339: $\beta$'s, in particular in ${\mathcal P}_{\rm NSB}$ itself. However,

340: the typical Zipf plots in this case are more complicated and will be

341: detailed elsewhere.

342:

343:

344: Before proceeding it is worth asking what we hope to accomplish. Any

345: reasonable estimator will converge to the right answer in the limit of

346: large $N$.  In particular, this is true for $S^{\rm NSB}$, which is a

347: {\em consistent} Bayesian estimator \footnote{In reference to Bayesian

348:   estimators, consistency usually means that, as $N$ grows, the

349:   posterior probability concentrates around unknown parameters of the

350:   true model that generated the data. For finite parameter models,

351:   such as the one considered here, only technical assumptions like

352:   positivity of the prior for all parameter values, soundness

353:   (different parameters always correspond to different distributions)

354:   \cite{clarke-barron-90}, and a few others are needed for

355:   consistency.  For nonparametric models, the situation is more

356:   complicated. There one also needs ultraviolet convergence of the

357:   functional integrals defined by the prior

358:   \cite{nemenman-00,bnt-01}.}.  The central problem of entropy

359: estimation is systematic bias, which will cause us to (perhaps

360: significantly) under- or overestimate the information content of spike

361: trains or the efficiency of the neural code. The bias, which vanishes

362: for $N\to\infty$, will manifest itself as a systematic drift in plots

363: of the estimated value versus the sample size. A successful estimator

364: would remove this bias as much as possible.  Ideally we thus hope to

365: see an estimate which for all values of $N$ is within its error bars

366: from the correct answer.  As $N$ increases the error bars should

367: narrow, with relatively little variation of the (mean) estimate

368: itself. When data are such that no reliable estimation is possible,

369: the estimator should remain uncertain, that is, the posterior variance

370: should be large. The main purpose of this paper is to show that the

371: NSB procedure applied to natural and nature--inspired synthetic

372: signals comes close to this ideal over a wide range of $N \ll K$, and

373: even $N \ll 2^S$.  The procedure thus is a viable tool for

374: experimental analysis.

375:

376:

377:

378: \section{A model problem}

379: \label{modprob}

380:

381: It is important to test our techniques on a problem which captures

382: some aspects of real world data yet is sufficiently well defined that

383: we know the correct answer.  We constructed synthetic spike trains

384: where intervals between successive spikes were independent and chosen

385: from an exponential distribution with a dead time or refractory period

386: of $g=1.8$ \milli\second; the mean spike rate was $r=0.26$

387: spikes/\milli\second.  This corresponds to the rate of $r_0 = r/(1 -

388: rg) = 0.49$ spikes/\milli\second\ for the part of the signal, where

389: spiking is not prohibited by refractoriness.  These parameters are

390: typical of the high spike rate, noisy regions of the experiment

391: discussed below, which provide the greatest challenge for entropy

392: estimation.

393:

394: Following the scheme outlined in Ref.~\cite{strong-98}, we examine the

395: spike train in windows of duration $T=15$ \milli\second\ and

396: discretize the response with a time resolution $\tau = 0.5$

397: \milli\second.  Because of the refractory period each bin of size

398: $\tau$ can contain at most one spike, and hence the neural response is

399: a binary word with $T/\tau = 30$ letters.  The space of responses has

400: $K = 2^{30}\approx 10^9$ possibilities.  Of course, most of these have

401: probability exactly zero because of refractoriness, and the number of

402: possible responses consistent with this constraint is bounded by $\sim

403: 2^{16} \approx 10^5$.  An approximation to the entropy of this

404: distribution, is given by an appropriate correction to Eq.~(3.21) of

405: Ref.~\cite{spikes}, the entropy of a non--refractory Poisson process:

406: \begin{equation}

407: S =\frac{rT}{\ln 2} \left[-\ln \left(1-{\rm e}^{-r_0\tau} \right)

408:   + \frac{r_0\tau\, {\rm e}^{-r_0\tau}}{1-{\rm e}^{-r_0\tau}}\right]=

409:  13.57~{\rm bits}.

410: \end{equation}

411:

412:

413: \begin{figure}

414:   \ifpdf \includegraphics[width=3.2in]{refractory} \else

415:   \includegraphics[height=3.2in, angle=270]{refractory} \fi

416:   \caption{\label{fig:artif}Entropy estimation for a model

417:     problem. Notice that the estimator reaches the true value within

418:     the error bars as soon as $N^2 \sim 2^S$, at which point

419:     coincidences start to occur with high probability. Slight

420:     overestimation for $N>10^3$ is expected (see text) since this

421:     distribution is atypical in ${\mathcal P}_{\rm NSB}$.}

422: \end{figure}

423: In Fig.~\ref{fig:artif} we show the results of entropy estimation for

424: this model problem. As expected, the naive estimate $S^{\rm ML}$

425: reaches its asymptotic behavior only when $N > 2^S$, thus the $1/N$

426: extrapolation becomes successful at $N\sim10^4$ (the ``ML fit'' line

427: on the plot).  In contrast, we see that $S^{\rm NSB}$ gives the right

428: answer within errors at $N \sim 100$.  We can improve convergence by

429: providing the estimator with the ``hint'' that the number of possible

430: responses $K$ is much smaller than the upper limit of $2^{30}$, but

431: even without this hint we have excellent entropy estimates already at

432: $N \sim (2^S)^{1/2}$.  This is in accord with expectations from Ma's

433: analysis of (microcanonical) entropy estimation \cite{ma-81}. However,

434: here we achieve these results for a nonuniform distribution.

435:

436:

437:

438: \section{Analyzing real data}

439:

440: \begin{figure}

441:   \ifpdf \includegraphics[width=3in]{sfly022_vel_raster}

442:   \else \includegraphics[width=3in,angle=0]{sfly022_vel_raster} \fi

443:   \caption{\label{fig:flyexp} Data from a fly motion sensitive neuron

444:     in a natural stimulus setting. Top: a 500 \milli\second\ section

445:     of a 10 \second\ angular velocity trace that was repeated 196

446:     times.  Bottom: raster plot showing the

447:     response to 30 consecutive trials; each dot marks the occurrence of a spike.}

448: \end{figure}

449:

450: For a test on real neurophysiological data, we use recordings from a

451: wide field motion sensitive neuron (H1) in the visual system of the

452: blowfly {\em Calliphora vicina}. While action potentials from H1 were

453: recorded, the fly rotated on a stepper motor outside among the bushes,

454: with time dependent angular velocity representative of natural flight.

455: Figure~\ref{fig:flyexp} presents a sample of raw data from such an

456: experiment (see~Ref.~\cite{LewenEtAl2001} for details).

457:

458:

459: Following Ref.~\cite{strong-98}, the information content of a spike

460: train is the difference between its total entropy and the entropy of

461: neural responses to repeated presentations of the same stimulus

462: \footnote{It may happen that information is a small difference between

463:   two large entropies. Then, due to statistical errors, methods that

464:   estimate information directly will have an advantage over NSB, which

465:   estimates entropies first. In our case, this is not a problem since

466:   the information is roughly a half of the total available entropy

467:   \cite{strong-98}.}. The latter is substantially more difficult to

468: estimate. It is called the noise entropy $S_n$, since it measures

469: response variations that are uncorrelated with the sensory input. The

470: noise in neurons depends on the stimulus itself---there are, for

471: example, stimuli which generate with certainty zero spikes in a given

472: window of time---and so we write $S_{n|t}$ to mark the dependence on

473: the time $t$ at which we take a slice through the raster of responses.

474: In this experiment the full stimulus was repeated 196 times, which

475: actually is a relatively large number by the standards of

476: neurophysiology.  The fly makes behavioral decisions based on $\sim

477: 10- 30~\milli\second$ windows of its visual input

478: \cite{LandAndCollett1974}, and under natural conditions the time

479: resolution of the neural responses is of order 1 \milli\second\ or

480: even less \cite{LewenEtAl2001}, so that a meaningful analysis of

481: neural responses must deal with binary words of length $10-30$ or

482: more. Refractoriness limits the number of these words which can occur

483: with nonzero probability (as in our model problem), but nonetheless we

484: easily reach the limit where the number of samples is substantially

485: smaller than the number of possible responses.

486:

487: \begin{figure}[t]

488:    \ifpdf

489:    \includegraphics[width=2.7in]{T8_nsb_vs_ml_2}

490:    \includegraphics[width=2.7in]{T15_nsb_vs_ml_2}

491:    \else

492:    \includegraphics[height=2.7in,angle=270]{T8_nsb_vs_ml_2}

493:    \includegraphics[height=2.7in,angle=270]{T15_nsb_vs_ml_2}

494:    \fi

495:    \caption{\label{fig:nsbml}Slice entropy vs.\ sample size.  Dashed

496:      line on both plots is drawn at the value of $\left.S^{\rm

497:          NSB}\right|_{N=N_{\rm max}}$ to show that the estimator is

498:      stable within its error bars even for very low $N$. Triangle

499:      corresponds to the value of $S^{\rm ML}$ extrapolated to

500:      $N\to\infty$ from the four largest values of $N$. First and second

501:      panels show examples of word lengths for which $S_{\rm ML}$ can

502:      or cannot be reliably extrapolated. $S^{\rm NSB}$ is stable in

503:      both cases, shows no $N$ dependent drift, and agrees with $S^{\rm

504:        ML}$ where the latter is reliable.}

505: \end{figure}

506: Let us start by looking at a single moment in time,

507: $t=1800~\milli\second$ from the start of the repeated stimulus, as in

508: Fig.~\ref{fig:flyexp}.  If we consider a window of duration $T =

509: 16~\milli\second$ at time resolution $\tau = 2~\milli\second$

510: \footnote{For our and many other neural systems, the spike timing can

511:   be more accurate than the refractory period of roughly 2

512:   \milli\second\ \cite{brenner-00a,rob-01,LewenEtAl2001}.  For the

513:   current amount of data, discretization of $\tau\ll1~\milli\second$

514:   and large enough $T$ will push the limits of all estimation methods,

515:   including ours, that do not make explicit assumptions about

516:   properties of the spike trains. Thus, to have enough statistics to

517:   convincingly show validity of the NSB approach, in this paper we

518:   choose $\tau =0.75\dots2~\milli\second$, which is still much shorter

519:   than other methods can handle. We leave open the possibility that

520:   more information is contained in timing precision at finer scales.},

521: we obtain the entropy estimates shown in the first panel of

522: Fig.~\ref{fig:nsbml}.  Notice that in this case we actually have a

523: total number of samples which is comparable to or larger than

524: $2^{S_{n|t}}$, and so the maximum likelihood estimate of the entropy

525: is converging with the expected $1/N$ behavior.  The NSB estimate

526: agrees with this extrapolation.  The crucial result is that the NSB

527: estimate is correct within error bars across the whole range of $N$;

528: there is a slight variation in the mean estimate, but the main effect

529: as we add samples is that the error bars narrow around the correct

530: answer.  In this case our estimation procedure has removed essentially

531: all of the sample size dependent bias.

532:

533:

534:

535:

536: As we open our window to $T = 30~\milli\second$, the number of

537: possible responses (even considering refractoriness) is vastly larger

538: than the number of samples. As we see in the second panel of

539: Fig.~\ref{fig:nsbml}, any attempt to extrapolate the ML estimate of

540: entropy now requires some wishful thinking.  Nonetheless, in parallel

541: with our results for the model problem, we find that the NSB estimate

542: is stable within error bars across the full range of available $N$.

543:

544:

545:

546: \begin{figure}

547:   \centerline{\includegraphics[width=2.9in,height=2.3in]{cond_entr_75}}

548:   \caption{\label{fig:conds}Distribution of the normalized entropy

549:     error conditional on $S^{\rm NSB}(N_{\rm max})$ for $N=75$ and

550:     $\tau=0.75~\milli\second$. Darker patches correspond to higher

551:     probability.  The band in the right part of the plot is the normal

552:     distribution around zero with the standard deviation of 1 (the

553:     standard deviation of plotted conditional distributions averaged

554:     over $S^{\rm NSB}$ is about 0.7, which indicates a non--Gaussian

555:     form of the posterior for small number of coincidences

556:     \cite{nsb2}). For values of $S^{\rm NSB}$ up to about 12 bits the

557:     estimator performs remarkably well. For yet larger entropies,

558:     where the number of coincidence is just a few, the discrete nature

559:     of the estimated values is evident, and this puts a bound on

560:     reliability of $S^{\rm NSB}$.}

561: \end{figure}

562:

563: For small $T$ we can compare the results of our Bayesian estimation

564: with an extrapolation of the ML estimate; each moment in time relative

565: to the repeated stimulus provides an example.  We have found that the

566: results in the first panel of Fig. \ref{fig:nsbml} are typical: in the

567: regime where extrapolation of the ML estimator is reliable, our

568: estimator agrees within error bars over a broad range of sample sizes.

569: More precisely, if we take the extrapolated ML estimate as the correct

570: answer, and measure the deviation of $S^{\rm NSB}$ from this answer in

571: units of the predicted error bar, we find that the mean square value

572: of this normalized error is of order one.  This is as expected if our

573: estimation errors are random rather than systematic.

574:

575:

576: For larger $T$ we do not have a calibration against the (extrapolated)

577: $S^{\rm ML}$, but we can still ask if the estimator is stable, within

578: error bars, over a wide range of $N$.  To check this stability we

579: treat the value of $S^{\rm NSB}$ at $N=N_{\rm max}=196$ as our best

580: guess for the entropy and compute the normalized deviation of the

581: estimates at smaller values of $N$ from this guess,

582: $\varepsilon=\left[S^{\rm NSB}(N) - S^{\rm NSB}(N_{\rm

583:     max})\right]/\delta S^{\rm NSB}(N)$.  Again, each moment in time

584: is an example.  Figure~\ref{fig:conds} shows the distribution of these

585: normalized deviations conditional on the entropy estimate with $N=75$;

586: this analysis is done for $\tau = 0.75~\milli\second$, with $T$ in the

587: range between $1.5~\milli\second$ and $22.5~\milli\second$.  Since the

588: different time slices span a range of entropies, over some range we

589: have $N > 2^S$, and in this regime the entropy estimate must be

590: accurate (as in the analysis of small $T$ above).  Throughout this

591: range, the normalized deviations fall in a narrow band with mean close

592: to zero and a variance of order one, as expected if the only

593: variations with the sample size were random.  Remarkably this pattern

594: continues for larger entropies, $S > \log_2 N=6.2$ bits, demonstrating

595: that our estimator is stable even deep into the undersampled regime.

596: This is consistent with the results obtained in our model problem, but

597: here we find the same answer for the real data.

598:

599: Note that Fig. \ref{fig:conds} illustrates results with $N$ less than

600: one half the total number of samples, so we really are testing for

601: stability over a large range in $N$.  This emphasizes that our

602: estimation procedure moves smoothly from the well sampled into the

603: undersampled regime without accumulating any clear signs of systematic

604: error.  The procedure collapses only when the entropy is so large that

605: the probability of observing the same response more than once (a

606: coincidence) becomes negligible.

607:

608: \section{Discussion}

609:

610: %% We finish with the following observations. While one might expect that

611: %% a meaningful estimate of entropy $S$ requires many more than $2^S$

612: %% samples, we know from Ma \cite{ma-81} that for uniform distributions

613: %% of unknown size (as in the microcanonical ensemble) we can make

614: %% reliable estimates of the entropy when $N^2 \sim 2^S$.  This is

615: %% related to the fact that we will encounter two people who have the

616: %% same birthday once we have chosen at random a group of $N \sim 23 <<

617: %% 365$ people---conversely, testing for coincidences tells us something

618: %% about the entropy of the distribution of birthdays even when we are

619: %% far from having seen all the possibilities.  The challenge is to

620: %% convert these ideas about coincidences into a reliable estimator for

621: %% the entropy of nonuniform distributions.

622:

623: The estimator we have explored here is constructed from a prior that

624: has a nearly uniform distribution of entropies.  It is plausible that

625: such a uniform prior would largely remove the sample size dependent

626: bias in entropy estimation, but it is crucial to test this

627: experimentally. In particular, there are infinitely many priors which

628: are approximately (and even exactly) uniform in entropy, and it is not

629: clear which of them will allow successful estimation in real world

630: problems.  We have found that the NSB prior almost completely removed

631: the bias in the model problem (Fig.~\ref{fig:artif}). Further, for

632: real data in a regime where undersampling can be beaten down by data

633: the bias is removed to yield agreement with the extrapolated ML

634: estimator even at very small sample sizes (Fig.~\ref{fig:nsbml}, first

635: panel).  Finally and most crucially, the NSB estimation procedure

636: continues to perform smoothly and stably past the nominal sampling

637: limit of $N \sim 2^S$, all the way to the Ma cutoff $N^2 \sim 2^S$

638: (Fig.~\ref{fig:conds}).  This opens the opportunity for rigorous

639: analysis of entropy and information in spike trains under a much wider

640: set of experimental conditions.

641:

642: \acknowledgments

643:

644: We thank J Miller for important discussions, GD Lewen for his help

645: with the experiments, which were supported by the NEC Research

646: Institute, and the organizers of the NIC'03 workshop for providing a

647: venue for a preliminary presentation of this work.  IN was supported

648: by NSF Grant No.\ PHY99-07949 to the Kavli Institute for Theoretical

649: Physics.  IN is also very thankful to the developers of the following

650: Open Source software: GNU Emacs, GNU Octave, GNUplot, and te\TeX.

651:

652:

653:

654:

655: \bibliographystyle{unsrtnat} {\small\bibliography{flies}}

656:

657: \end{document}

658:

659: