0405:q-bio0405004/Adami.tex

1: %

2: \documentclass[12pt]{article}

3: \usepackage{psfig,mydef}

4: \begin{document}

5: \begin{titlepage}

6: \mbox{} \vskip 3cm \centerline{\LARGE\bf \sf Information Theory in Molecular

7: Biology} \vskip 1in \centerline{\large \bf \sf Christoph Adami$^{1,2}$} \vskip

8: 0.2in \centerline{\it $^1$Keck Graduate Institute of Applied Life Sciences,}

9: \centerline{\it 535 Watson Drive, Claremont, CA 91711}\vskip 0.5cm \centerline{\it

10: $^2$Digital Life Laboratory 139-74,}  \centerline {\it California Institute of

11: Technology, Pasadena, CA 91125} \vskip 2cm

12:

13: \date{\today}

14: \centerline{\bf Abstract} This article introduces the physics of

15: information in the context of molecular biology and genomics.

16: Entropy and information, the two central concepts of Shannon's

17: theory of information and communication, are often confused with

18: each other but play transparent roles when applied to statistical

19: {\em ensembles} (i.e., identically prepared sets) of symbolic

20: sequences. Such an approach can distinguish between entropy and

21: information in genes, predict the secondary structure of

22: ribozymes, and detect the covariation between residues in folded

23: proteins. We also review applications to molecular sequence and

24: structure analysis, and introduce new tools in the

25: characterization of resistance mutations, and in drug design.

26:

27: \vskip 4 cm

28:  %

29: \end{titlepage}

30: \vskip 0.25cm In a curious twist of history, the dawn of the age

31: of genomics has both seen the rise of the science of

32: bioinformatics as a tool to cope with the enormous amounts of data

33: being generated daily, and the decline of the {\em theory} of

34: information as applied to molecular biology. Hailed as a harbinger

35: of a ``new movement''\nocite{quast53} (Quastler 1953) along with

36: Cybernetics, the principles of information theory were thought to

37: be applicable to the higher functions of living organisms, and

38: able to analyze such functions as metabolism, growth, and

39: differentiation (Quastler 1953). Today, the metaphors and the

40: jargon of information theory are still widely used (Maynard Smith

41: 1999a, 1999b)\nocite{maynard99a,maynard99b}, as opposed to the

42: mathematical formalism, which is too often considered to be

43: inapplicable to biological information.

44:

45: Clearly, looking back it appears that too much hope was laid upon

46: this theory's relevance for biology. However, there was

47: well-founded optimism that information theory ought to be able to

48: address the complex issues associated with the storage of

49: information in the genetic code, only to be repeatedly questioned

50: and rebuked (see, e.g., \nocite{vincent94,sarkar96}Vincent 1994,

51: Sarkar 1996). In this article, I outline the concepts of entropy

52: and information (as defined by Shannon) in the context of

53: molecular biology. We shall see that not only are these terms

54: well-defined and useful, they also coincide precisely with what we

55: intuitively mean when we speak about information stored in genes,

56: for example. I then present examples of applications of the theory

57: to measuring the information content of biomolecules, the

58: identification of polymorphisms, RNA and protein secondary

59: structure prediction, the prediction and analysis of molecular

60: interactions, and drug design.

61:

62: \section{Entropy and Information}

63: Entropy and information are often used in conflicting manners in

64: the literature. A precise understanding, both mathematical and

65: intuitive, of the notion of information (and its relationship to

66: entropy) is crucial for applications in molecular biology.

67: Therefore, let us begin by outlining Shannon's original entropy

68: concept (Shannon, 1948).

69:

70: \subsection{Shannon's Uncertainty Measure}

71: Entropy in Shannon's theory (defined mathematically below) is a

72: measure of uncertainty about the identity of objects in an

73: ensemble. Thus, while ``entropy'' and ``uncertainty'' can be used

74: interchangeably, they can {\it never} mean information. There is a

75: simple relationship between the entropy concept in information

76: theory and the Boltzmann-Gibbs entropy concept in thermodynamics,

77: briefly pointed out below.

78:

79: Shannon entropy or uncertainty is usually defined with respect to

80: a particular observer.  More precisely, the entropy of a system

81: represents the amount of uncertainty {\em one particular observer}

82: has about the state of this system. The simplest example of a

83: system is a {\em random variable}, a mathematical object that can

84: be thought of as an $N$-sided die that is uneven, i.e., the

85: probability of it landing in any of its $N$ states is not equal

86: for all $N$ states. For our purposes, we can conveniently think of

87: a polymer of fixed length (fixed number of monomers), which can

88: take on any one of $N$ possible states, where each possible

89: sequence corresponds to one possible state. Thus, for a sequence

90: made of $L$ monomers taken from an alphabet of size $D$, we would

91: have $N=D^L$. The uncertainty we calculate below then describes

92: the observer's uncertainty about the true identity of the molecule

93: (among a very large number of identically prepared molecules: an

94: {\em ensemble}), given that he only has a certain amount of

95: probabilistic knowledge, as explained below.

96:

97: This hypothetical molecule plays the role of a random variable if we

98: are given its {\em probability distribution}: the set of probabilities

99: $p_1,...,p_N$ to find it in its $N$ possible states.  Let us thus

100: call our random variable (random molecule) ``$X$'', and give the names

101: $x_1,...,x_N$ to its $N$ states. If $X$ will be found in state $x_i$

102: with probability $p_i$, then the entropy $H$ of $X$ is given by

103: Shannon's formula

104: \begin{equation}

105: H(X)=-\sum_{i=1}^N p_i\log p_i\;. \label{entropy}

106: \end{equation}

107: I have not here specified the basis of the log to be taken in the above formula.

108: Specifying it assigns units to the uncertainty. It is sometimes convenient to use

109: the number of possible states of $X$ as the base of the logarithm (in which case

110: the entropy is between zero and one), in other cases base 2 is convenient (leading

111: to an entropy in units ``bits''). For biomolecular sequences, a convenient unit

112: obtains by taking logarithms to the basis of the alphabet size, leading to an

113: entropy whose units we shall call ``mers". Then, the maximal entropy equals the

114: length of the sequence in mers.

115:

116: Let us examine Eq.~(\ref{entropy}) more closely. If measured in

117: bits, a standard interpretation of $H(X)$ as an uncertainty

118: function connects it to the smallest number of ``yes-no" questions

119: necessary, on average, to identify the state of random variable

120: $X$. Because this series of yes/no questions can be thought of as

121: a {\em description} of the random variable, the entropy $H(X)$ can

122: also be viewed as the {\em length of the shortest description of

123: $X$} (Cover and Thomas, 1991). In case nothing is known about $X$,

124: this entropy is $H(X)=\log N$, the maximal value that $H(X)$ can

125: take on. This occurs if all states are equally likely:

126: $p_i=1/N\,;i=1,...,N$. If something (beyond the possible number of

127: states $N$) is known about $X$, this reduces our necessary number

128: of questions, or the length of tape necessary to describe $X$.

129: If I know that state $X=x_7$, for example, is highly

130: unlikely, then my uncertainty about $X$ is going to be smaller.

131:

132: How do we ever learn anything about a system?  There are two choices. Either we

133: obtain the probability distribution using {\em prior knowledge} (for example, by

134: taking the system apart and predicting its states theoretically) or by making

135: measurements on it, which for example might reveal that not all states, in fact,

136: are taken on with the same probability. In both cases, the difference between the

137: maximal entropy and the remaining entropy after we have either done our

138: measurements or examined the system, is the amount of information we have about the

139: system. Before I write this into a formula, let me remark that, by its very

140: definition, information is a {\it relative} quantity. It measures the {\it

141: difference of uncertainty}, in the previous case the entropy before and after the

142: measurement, and thus can never be absolute, in the same sense as potential energy

143: in physics is not absolute. In fact, it is not a bad analogy to refer to entropy as

144: ``potential information'', because potentially all of a system's entropy can be

145: transformed into information (for example by measurement).

146:

147: \subsection{Information}

148: In the above case, information was the difference between the maximal and the

149: actual entropy of a system. This is not the most general definition as I have

150: alluded to. More generally, information measures the amount of {\it correlation}

151: between two systems, and reduces to a difference in entropies in special cases. To

152: define information properly, let me introduce another random variable or molecule

153: (call it ``$Y$''), which can be in states $y_1,...,y_M$ with probabilities

154: $p_1,...,p_M$. We can then, along with the entropy $H(Y)$, introduce the joint

155: entropy $H(XY)$, which measures my uncertainty about the joint system $XY$ (which

156: can be in $N\cdot M$ states). If $X$ and $Y$ are {\em independent} random variables

157: (like, e.g., two dice that are thrown independently) the joint entropy will be just

158: the sum of the entropy of each of the random variables. Not so if $X$ and $Y$ are

159: somehow connected.  Imagine, for example, two coins that are glued together at one

160: face. Then, heads for one of the coins will always imply tails for the other, and

161: vice versa. By gluing them together, the two coins can only take on two states, not

162: four, and the joint entropy is equal to the entropy of one of the coins.

163:

164: The same is true for two molecules that can bind to each other.

165: First, remark that random molecules do not bind. Second, binding

166: is effected by mutual specificity, which requires that part of the

167: sequence of one of the molecules is interacting with the sequence

168: of the other, so that the joint entropy of the pair is much less

169: than the sum of entropies of each. Quite clearly, this binding

170: introduces strong correlations between the states of $X$ and $Y$:

171: if I know the state of one, I can make strong predictions about

172: the state of the other. The information that one molecule has {\em

173: about} the other is given by

174: \begin{equation}

175: I(X:Y)=H(X)+H(Y)-H(XY)\;, \label{info}

176: \end{equation}

177: i.e., it is the difference between the sum of the entropies of each, and the joint

178: entropy. The colon between $X$ and $Y$ in the notation for the information is

179: standard; it is supposed to remind the reader that information is a symmetric

180: quantity: what $X$ knows about $Y$, $Y$ also knows about $X$. For later reference,

181: let me introduce some more jargon. When more than one random variable is involved,

182: we can define the concept of {\em conditional entropy}. This is straightforward.

183: The entropy of $X$ conditional on $Y$ is the entropy of $X$ {\em given} $Y$, that

184: is, if I know which state $Y$ is in. It is denoted by $H(X|Y)$ (read ``$H$ of $X$

185: given $Y$'') and is calculated as

186: \begin{equation}

187: H(X|Y)=H(XY)-H(Y)\;.

188: \end{equation}

189: This formula is self-explanatory: the uncertainty I have about $X$ if $Y$ is known

190: is the uncertainty about the joint system minus the uncertainty about $Y$ alone.

191: The latter, namely the entropy of $Y$ without regard to $X$ (as opposed to

192: ``conditional on $X$'') is sometimes called a {\em marginal} entropy. Using the

193: concept of conditional entropy, we can rewrite Eq.~(\ref{info}) as

194: \begin{equation}

195: I(X:Y)=H(X)-H(X|Y)\;.

196: \end{equation}

197:

198: We have seen earlier that for independent variables

199: $H(XY)=H(X)+H(Y)$, so information measures the {\em deviation}

200: from independence. In fact, it measures exactly the amount by

201: which the entropy of $X$ or $Y$ is reduced by knowing the other,

202: $Y$ or $X$.  If $I$ is non-zero, knowing one of the molecules

203: allows you to make more accurate predictions about the other:

204: quite clearly this is exactly what we mean by information in

205: ordinary language. Note that this definition reduces to the

206: example given earlier (information as difference between

207: entropies), if the only possible correlations are {\em between}

208: $X$ and $Y$, while in the absence of the other each molecule is

209: equiprobable (meaning that any sequence is equally likely).  In

210: that case, the marginal entropy $H(X)$ must be maximal ($H=\log

211: N$) and the information is the difference between maximal and

212: actual (i.e., conditional) entropy, as before.

213:

214: \subsection{Entropy in Thermodynamics}

215: I will briefly comment about the relationship between Shannon's theory and

216: thermodynamics (Adami and Cerf 1999)\nocite{adami99}. For the present purpose it

217: should suffice to remark that Boltzmann-Gibbs thermodynamic entropy is just like

218: Shannon entropy, only that the probability distribution $p_i$ is given by the

219: Boltzmann distribution of the relevant degrees of freedom (position and momentum):

220: \begin{equation}

221: \rho(p,q)= \frac1Ze^{-E(p,q)/kT}\;,

222: \end{equation}

223: and the thermodynamic quantity is made dimensional by multiplying

224: Shannon's dimensionless uncertainty by Boltzmann's constant. It

225: should not worry us that the degrees of freedom in thermodynamics

226: are continuous, because any particular measurement device that is

227: used to measure these quantities will have a finite resolution,

228: rendering these variables effectively discrete through

229: coarse-graining. More importantly, equilibrium thermodynamics

230: assumes that all entropies of isolated systems are at their

231: maximum, so there are no correlations in equilibrium thermodynamic

232: systems, and therefore there is {\em no information}. This is

233: important for our purposes, because it implies, a fortiori, that

234: the information stored in biological genomes guarantees that

235: living systems are far away from thermodynamical equilibrium.

236: Information theory can thus be viewed as a type of non-equilibrium

237: thermodynamics.

238:

239:

240: Before exploring the uses of these concepts in molecular biology,

241: let me reiterate the most important points which tend to be

242: obscured when discussing information. Information is defined as

243: the amount of correlation between two systems. It measures the

244: amount of entropy {\em shared} between two systems, and this

245: shared entropy is the information that one system has {\em about

246: the other}. Perhaps this is the key insight that I would like to

247: convey: Information is always {\em about something}. If it cannot

248: be specified what the information is about, then we are dealing

249: with entropy, not information. Indeed, entropy is sometimes

250: called, in what borders on an abuse of language, ``useless

251: information". The previous discussion also implies that

252: information is only defined {\it relative} to the system it is

253: information about, and is therefore {\em never} absolute. This

254: will be particularly clear in the discussion of the information

255: content of genomes, which we now enter.

256:

257: \section{Information in Genomes}

258: There is a long history of applying information theory to symbolic

259: sequences. Most of this work is concerned with the randomness, or,

260: conversely, regularity, of the sequence. Ascertaining the

261: probabilities with which symbols are found on a sequence or

262: message will allow us to estimate the entropy of the {\em source

263: of symbols}, but not what they stand for. In other words,

264: information cannot be accessed in this manner.  It should be

265: noted, however, that studying {\em horizontal} correlations, i.e.,

266: correlations between symbols along a sequence rather than across

267: sequences, can be useful for distinguishing coding from non-coding

268: regions in DNA (Grosse et al., 2000), and can serve as a distance

269: measure between DNA sequences that can be used to assemble

270: fragments obtained from shotgun-sequencing (Otu and Sayood, 2003).

271:

272: In terms of the jargon introduced above, measuring the

273: probabilities with which symbols (or groups of symbols) appear

274: {\em anywhere} in a sequence will reveal the {\em marginal}

275: entropy of the sequence, i.e., the entropy {\em without} regard to

276: the environment or context. The entropy {\em with} regard to the

277: environment is the entropy {\em given} the environment, a

278: conditional entropy, which we shall calculate below. This will

279: involve obtaining the probability to find a symbol at a {\em

280: specific} point in the sequence, as opposed to anywhere on it. We

281: sometimes refer to this as obtaining the {\em vertical}

282: correlations between symbols.

283:

284: \subsection{Meaning from Correlations}

285: Obtaining the marginal entropy of a genetic sequence can be quite

286: involved (in particular if multi-symbol probabilities are

287: required), but a very good approximative answer can be given

288: without any work at all: This entropy (for DNA sequences) is about

289: two bits per base. There are deviations of interest (for example

290: in GC-rich genes, etc.) but overall this is what the

291: (non-conditional) entropy of most of DNA is (see, e.g., Schmitt

292: and Herzel 1997)\nocite{SchmittHerzel1997}. The reason for this is

293: immediately clear: DNA is a {\em code}, and codes do not reveal

294: information from sequence alone. Optimal codes, e.g., are such

295: that the encoded sequences cannot be compressed any further (Cover

296: and Thomas, 1991)\nocite{CoverThomas1991}. While DNA is not

297: optimal (there are some correlations between symbols along the

298: sequence), it is nearly so. The same seems to hold true for

299: proteins: a random protein would have $\log_2(20)=4.32$ bits of

300: entropy per site (or 1 mer, the entropy of a random monomer

301: introduced above), while the actual entropy is somewhat lower due

302: to biases in the overall abundance (leucine is over three times as

303: abundant as tyrosine, for example), and due to pair and triplet

304: correlations. Depending on the data set used, the protein entropy

305: per site is between 2.5 (Strait and Dewey, 1996) and 4.17 bits

306: (Weiss et al., 2000)\nocite{StraitDewey1996,Weissetal2000}, or

307: between 0.6 and 0.97 mers. Indeed, it seems that protein sequences

308: can only be compressed by about 1\% (Weiss et al. 2000). This is a

309: pretty good code! But this entropy per symbol only allows us to

310: quantify our uncertainty about the sequence identity, but it will

311: not reveal to us the {\em function} of the genes. If this is all

312: that information theory could do, we would have to agree with the

313: critics that information theory is nearly useless in molecular

314: biology. Yet, I have promised that information theory {\em is}

315: relevant, and I shall presently point out how. First of all, let

316: us return to the concept of information. How should we decide

317: whether or not {\em potential information} (a.k.a entropy) is in

318: {\em actuality} information, i.e., whether it is shared with

319: another variable?

320:

321: The key to information lies in its use to make predictions {\em

322: about} other systems. Only in {\em reference} to another ensemble

323: can entropy become information, i.e., be promoted from useless to

324: useful, from potential to actual. Information therefore is clearly

325: not stored {\em within} a sequence, but rather in the {\em

326: correlations} between the sequence and what it describes, or what

327: it {\em corresponds to}. What do biomolecular sequences correspond

328: to? What is the {\em meaning} of a genomic sequence, what

329: information does it represent?  This depends, quite naturally, on

330: what environment the sequence is to be interpreted within.

331: According to the arguments advanced here, no sequence has an

332: intrinsic meaning, but only a relative (or conditional) one with

333: respect to an environment. So, for example, the genome of {\it

334: Mycoplasma pneumoniae} (a bacterium that causes pneumonia-like

335: respiratory illnesses) has an entropy of almost a million base

336: pairs, which is its genome length. Within the soft tissues that it

337: relies on for survival, most of these base pairs (about 89\%) are

338: information (Dandekar et al., 2000). Indeed, Mycoplasmas are

339: obligate parasites in these soft tissues, having shed from 50\% to

340: three quarters of the genome of their bacterial ancestors (the

341: {\em Bacillae}). Within these soft tissues that make many

342: metabolites readily available, what was information for a Bacillus

343: had become entropy for the Mycoplasma. With respect to {\em other}

344: environments, the Mycoplasma information might mean very little,

345: i.e., it might not {\em correspond} to anything there. Whether or

346: not a sequence means something in its environment determines

347: whether or not the organism hosting it lives or dies there. This

348: will allow us to find a way to distinguish entropy from

349: information in genomes.

350:

351: \subsection{Physical Complexity}

352: In practice, how can we determine whether a particular base's

353: entropy is shared, i.e., whether a nucleotide carries entropy or

354: information? At first glance one might fear that we would have to

355: know a gene's function (i.e., know what it corresponds to within

356: its surrounding) before we can determine the information content;

357: that, for example, we might need to know that a gene codes for an

358: alcoholdehydrogenase before we can ascertain which base pairs code

359: for it. Fortunately, this is not true. What is clear, however, is

360: that we may never distinguish entropy from information if we are

361: only given a {\em single} sequence to make this determination,

362: because, in a single sequence, symbols that carry information are

363: indistinguishable from those that do not. The trick lies in

364: studying {\em functionally equivalent sets} of sequences, and the

365: substitution patterns at each aligned position. In an equilibrated

366: population, i.e, one where sufficient time has passed since the

367: last evolutionary innovation or bottleneck, we expect a position

368: that codes for information to be nearly uniform {\em across} the

369: population (meaning that the same base pair will be found at that

370: position in all sequences of that population), because a mutation

371: at that position would detrimentally affect the fitness of the

372: bearer, and, over time, be purged from the ensemble (this holds in

373: its precise form only for asexual populations). Positions that do

374: not code for information, on the other hand, are selectively

375: neutral, and, with time, will take on all possible symbols at that

376: position. Thus, we may think of each position on the genome as a

377: four-sided die. A priori, the uncertainty (entropy) at each

378: position is two bits, the maximal entropy:

379: \begin{equation}

380: H = -\sum_{i= {\rm G,C,A,T}}p(i)\log_2 p(i)=\log_2 4 = 2 \ \ {\rm bits}

381: \end{equation}

382: because, a priori, $p(i)=1/4$. For the {\it actual} entropy, we need the actual

383: probabilities $p_j(i)$, for each position $j$ on the sequence. In a pool of $N$

384: sequences, $p_j(i)$ is estimated by counting the number $n_j(i)$ of occurrences of

385: nucleotide $i$ at position $j$, so that $p_j(i)=n_j(i)/N$. This should be done for

386: all positions $j=1,...,L$ of the sequence, where $L$ is the sequence length.

387: Ignoring correlations {\em between} positions $j$ on a sequence (so-called

388: ``epistatic'' correlations, to which we shall return below), the information stored

389: in the sequence is then (with logs to base 2)

390:

391: \begin{equation}

392: I=H_{\rm max}- H = 2L -H\;\; {\rm bits}\;, \label{infomeasure}

393: \end{equation}

394: where

395: \begin{equation}

396: H = -\sum_{j=1}^L\,\sum_{i={\rm G,C,A,T}}p_j(i)\log_2 p_j(i)\;. \label{cond}

397: \end{equation}

398: Note that this estimate, because it relies on the difference of

399: maximal and actual entropy, does not require us to know which

400: variables in the environment cause some nucleotides to be uniform,

401: or ``fixed''. These probabilities are set by mutation-selection

402: balance in the environment. I have argued earlier (Adami and Cerf

403: 2000, Adami et al. 2000)\nocite{Adamietal2000} that the

404: information stored in a sequence is a good proxy for the

405: sequences's complexity (called ``physical complexity"), which

406: itself might be a good predictor of functional complexity. And

407: indeed, it seems to correspond to the quantity that increases

408: during Darwinian evolution (Adami 2002a)\nocite{Adami2002a}. We

409: will encounter below an evolutionary experiment that seems to

410: corroborate these notions.

411:

412: In general (for sequences taken from any monomer alphabet of size $D$), the

413: information stored in the sequence is

414: \begin{eqnarray}

415: I=H_{\rm max}-H &=& L-\left(-\sum_{j=1}^L\,\sum_{i=1}^{D}p_j(i)\log_D p_j(i)\right)\\

416: &=& L-J\;\; {\rm mers}\;, \label{infomer}

417: \end{eqnarray}

418: where $J$ can be thought of as the number of {\it non-functional}

419: (i.e., ``junk") instructions, and I

420: remind the reader that we defined the ``mer" as the entropy of a

421: random monomer, normalized to lie between zero and one.

422:

423: \subsection{Application to DNA and RNA}

424: In the simplest case, the environment is essentially given by the

425: intra-cellular binding proteins, and the measure

426: (\ref{infomeasure}) can be used to investigate the information

427: content of DNA binding sites (this use of information theory was

428: pioneered by Schneider et al., 1986). Here, the sample of

429: sequences can be provided by a sample of equivalent binding sites

430: within a single genome.  For example, the latter authors aligned

431: the sequences of 149 {\it E. coli} and coliphage ribosome binding

432: sites in order to calculate the substitution probabilities at each

433: position of a 44 base pair region (which encompasses the 34

434: positions that can be said to constitute the binding site).

435: \begin{figure}[h]

436: \centerline{\psfig{figure=Adami-Fig1.ps,width=4in,angle=90}}

437:  \caption{Information content (in bits) of an {\it E. coli} ribosome

438:  binding site, aligned at the $f$Met-tRNA$_f$ initiation site (L=0),

439:  from Schneider et al.\ (1986).

440: \label{schneider}}

441: \end{figure}

442: Fig. 1 shows the information content as a function of position

443: (Schneider et al.\ 1986), where position $L=0$ is the first base

444: of the initiation codon. The information content is highest near

445: the initiation codon, and shows several distinct peaks. The peak

446: at $L=-10$ corresponds to the Shine-Dalgarno sequence (Shine and

447: Dalgarno, 1974).

448:

449: When the information content of a base is zero we must assume that it

450: has no function, i.e., it is neither expressed nor does anything

451: bind to it. Regions with positive information

452: content\footnote{Finite sampling of the substitution probabilities

453: introduces a systematic error in the information content, which

454: can be corrected (Miller 1954, Basharin 1959, Schneider et al.\

455: 1986). In the present case, the correction ensures that the

456: information content is approximately zero at the left and right

457: edge of the binding site.} carry information about the binding

458: protein, just as the binding protein carries information about the

459: binding site.

460:

461: It is important to emphasize that the reason that sites $L=1$ and $L=2$, for

462: example, have maximal information content is a consequence of the fact that their

463: {\em conditional} entropy Eq.~(\ref{cond}) vanishes.  The entropy is conditional

464: because only {\em given} the environment of binding proteins in which it functions

465: in {\em E. coli} or a coliphage, is the entropy zero. If there were, say, two

466: different proteins which could initiate translation at the same site (two different

467: environments), the conditional entropy of these sites could be higher. Intermediate

468: information content (between zero and 2 bits) signals the presence of {\em

469: polymorphisms} implying either non-specific binding to one protein or competition

470: between more than one protein for that site.

471:

472: A polymorphism is a deviation from the consensus sequence that is

473: not, as a rule, detrimental to the organism carrying it. If it

474: was, we would call it a ``deleterious mutation'' (or

475: just``mutation"). The latter should be very infrequent as it

476: implies disease or death for the carrier. On the contrary,

477: polymorphisms can establish themselves in the population, leading

478: either to no change in the phenotype whatsoever, in which case we

479: may term them ``strictly neutral'', or they may be deleterious by

480: themselves but neutral if associated with a commensurate

481: (compensatory) mutation either on the same sequence or somewhere

482: else.

483:

484: Polymorphisms are easily detected if we plot the per-site

485: entropies of a sequence vs.\ residue or nucleotide number in an

486: {\em entropy map} of a gene. Polymorphisms carry per-site

487: entropies intermediate between zero (perfectly conserved locus)

488: and unity (strictly neutral locus). Mutations, on the other hand,

489: (because they are deleterious) are associated with very low

490: entropy (Rogan and Schneider 1995), so polymorphisms stand out

491: among conserved regions and even mutations. In principle,

492: mutations can occur on sites which are themselves polymorphic;

493: those can only be detected by a more in-depth analysis of

494: substitution patterns such as suggested in Schneider (1997).

495: Because polymorphic sites in proteins are a clue to which sites

496: can easily be mutated, per-site entropies have also been

497: calculated for the directed evolution of proteins and enzymes

498: (Saven and Wolynes 1997, Voigt et al. 2001).

499:

500: \begin{figure}[!h]

501: \centerline{\psfig{figure=Adami-Fig2.ps,width=3in,angle=0}}

502:  \caption{Entropy (in bits) of {\it E. coli} tRNA (upper panel)

503:   from 5' (L=0) to 3' (L=76), from

504:  33 structurally similar sequences obtained from Sprinzl et al. (1996), where we

505:  arbitrarily set the entropy of the anti-codon to zero.

506: Lower panel: Same for 32 sequences of {\it B. subtilis} tRNA. \label{ecolisubt}}

507: \end{figure}

508:

509: As mentioned earlier, the actual function of a sequence is irrelevant for

510: determining its information content. In the previous example, the region

511: investigated was a binding site. However, any gene's information content can be

512: measured in such a manner. In Adami and Cerf (2000), the information content of the

513: 76 base pair nucleic acid sequence that codes for bacterial tRNA was investigated.

514: In this case the analysis is complicated by the fact that the tRNA sequence

515: displays secondary and tertiary structure, so that the entropy of those sites that

516: bind in Watson-Crick pairs, for example, are shared, reducing the information

517: content estimate based on Eq.~(\ref{info}) significantly. In Fig.~\ref{ecolisubt},

518: I show the entropy (in bits)

519: \begin{figure}[h]

520: \centerline{\psfig{figure=Adami-Fig3.eps,width=4.5in,angle=0}}

521: \caption{Secondary structure of tRNA molecule, with bases colored

522: black for low entropy ($0\leq H\leq0.3$ mers), grey for

523: intermediate ($0.3<H\leq 0.7$ mers), and white for maximal entropy

524: ($0.7<H\leq 1.0$ mers), numbered 1-76 (entropies from {\it E.

525: coli} sequences). \label{tRNA}}

526: \end{figure}

527: derived from 33 structurally similar sequences of {\it E. coli}

528: tRNA (upper panel) and 32 sequences of {\it B. subtilis} tRNA,

529: respectively, obtained from the EMBL nucleotide sequence library

530: (Sprinzl et al. 1996). Note how similar these entropy maps are

531: across species (even though they last shared an ancestor over 1.6

532: billion years ago), indicating that the profiles are

533: characteristic of the {\em function} of the molecule, and thus

534: statistically stable.

535:

536: Because of base-pairing, we should not expect to be able to simply

537: sum up the per-site entropies of the sequence to obtain the

538: (conditional) sequence entropy. The pairing in the stacks (the

539: ladder-like arrangement of bases that bind in pairs) of the

540: secondary structure (see Fig.~\ref{tRNA}) reduces the actual

541: entropy, because two nucleotides that are bound together {\em

542: share} their entropy. This is an example where {\em epistatic

543: correlations} are important. Two sites (loci) are called epistatic

544: if their contributions to the sequence's fitness are not

545: independent, in other words, if the probability to find a

546: particular base at one position depends on the identity of a base

547: at another position. Watson-Crick-binding in stacks is the

548: simplest such example; it is also a typical example of the

549: maintenance of polymorphisms in a population because of functional

550: association. Indeed, the fact that polymorphisms are correlated in

551: stacks makes it possible to deduce the secondary structure of an

552: RNA molecule from sequence information alone.

553: \begin{figure}[h]

554: \centerline{\psfig{figure=Adami-Fig4.ps,width=3in,angle=0}}

555: \caption{Mutual

556:  entropy (information) between base 28 and bases 39 to 45 (information

557:  is normalized to $I_{\rm max}=1$ by taking logarithms to base 4). Because

558:  finite sample size corrections of higher order have been neglected, the

559:  information estimate can appear to be negative by an amount of the order of this

560:  error.  \label{spike}}

561: \end{figure}

562: Take, for example, nucleotide $L=28$ (in the anti-codon stack) which is bound to

563: nucleotide $L=42$, and let us measure entropies in mers (by taking logarithms to

564: the base 4). The mutual entropy between $L=28$ and $L=42$ (in {\em E. coli}) can be

565: calculated using Eq. (4):

566: \begin{equation}

567: I(28:42)=H(28)+H(42)-H(28,42)=0.78\;. \label{info28}

568: \end{equation}

569: Thus indeed, these two bases share almost all of their entropy. We

570: can see furthermore that they share very little entropy with any

571: other base. Note that, in order to estimate the entropies in

572: Eq.~(\ref{info28}), we applied a first-order correction that takes

573: into account a bias due to the finite size of the sample, as

574: described in Miller (1954). This correction amounts to $\Delta

575: H_1=3/(132 \ln2)$ for single nucleotide entropies, and $\Delta

576: H_2=15/(132 \ln2)$ for the joint entropy. In Fig.~\ref{spike}, I

577: plot the mutual entropy of base 28 with bases 39 to 45

578: respectively, showing that base 42 is picked out unambiguously.

579: Such an analysis can be carried out for all pairs of nucleotides,

580: so that the secondary structure of the molecule is revealed

581: unambiguously (see, e.g., Durbin et al. 1998). In

582: Fig.~\ref{wagenaar}, I show the entropy (in bits) for all pairs of

583: bases of the set of {\it E. coli} sequences used to produce the

584: entropy map in Fig.~\ref{ecolisubt}, which demonstrates how the

585: paired bases in the four stems stand out.

586: \begin{figure}[h]

587: \centerline{\psfig{figure=Adami-Fig5.eps,width=5in,angle=0}}

588: \caption{Mutual

589:  entropy (information) between all bases (in bits), colored according

590:  to the color bar on the right, from 33 sequences of {\it E. coli} tRNA.

591:  The four stems are readily identified by their correlations as indicated.

592:   \label{wagenaar}}

593: \end{figure}

594:

595: Since we found that most bases in stacks share almost all of their

596: entropy with their binding partner, it is easy to correct formula

597: (\ref{infomer}) to account for the epistatic effects of

598: stack-binding: We only need to subtract from the total length of

599: the molecule (in mers) the number of bases involved in stack

600: binding. In a tRNA molecule (with a secondary structure as in

601: Fig.~\ref{tRNA}) there are 21 such bases, so the sum in

602: Eq.~(\ref{cond}) should only go over the 52 ``reference

603: positions"\footnote{We exclude the three anticodon-specifying

604: bases from the entropy calculation because they have zero

605: conditional entropy by {\em definition} (they cannot vary among a

606: tRNA-type because it would change the type). However, the

607: substitution probabilities are obtained from mixtures of {\em

608: different} tRNA-types, and therefore appear to deviate from zero or one.}.

609: For {\it E. coli}, the entropy summed over the reference positions

610: gives $H\approx 24$ mers, while the {\it B. subtilis} set gives

611: $H\approx 21$ mers. We thus conclude that bacterial tRNA stores

612: between 52 and 55 mers of information about its environment (104-110

613: bits).

614:

615: This type of sequence analysis combining structural and complexity

616: information has recently been used to quantify the information

617: gain during in-vitro evolution of catalytic RNA molecules

618: (ribozyme ligases) (Carothers et al.\

619: 2004)\nocite{Carothersetal2004}. The authors evolved RNA aptamers

620: that bind GTP (guanine triphosphate) with different catalytic

621: effectiveness (different functional capacity) from a mutagenized

622: sequence library. They found 11 different classes of ribozymes,

623: whose structure they determined using the correlation analysis

624: outlined above. They subsequently measured the amount of

625: information in each structure [using Eq.~(\ref{infomeasure}) and

626: correcting for stack binding as described above] and showed that

627: ligases with higher affinity for the substrate had more complex

628: secondary structure {\em and} stored more information.

629: Furthermore, they found that the information estimate based on

630: Eq.~(\ref{infomeasure}) was consistent with an interpretation in

631: terms of the amount of information necessary to specify the

632: particular structure in the given environment. Thus, at least in

633: this restricted biochemical example, structural, functional, and

634: informational complexity seem to go hand in hand.

635:

636:

637: \subsection{Application to Proteins}

638: If the secondary structure of RNA and DNA enzymes can be predicted

639: based on correlations alone, what about protein secondary

640: structure? Because proteins fold and function via the interactions

641: among the amino acids they are made of, these interactions should,

642: in evolutionary time, lead to correlations between residues so

643: that the fitness effect of an amino acid substitution at one

644: position will depend on the residue at another position. (Care

645: must be taken to avoid contamination from correlations that are

646: due entirely to a common evolutionary path, see Wollenberg and

647: Atchley 2000; Govindarajan et al.\

648: 2003.)\nocite{WollenbergAtchley2000,Govindarajanetal2003} Such an

649: analysis has been carried out on a number of different molecule

650: families, such as the V3 loop region of HIV-1 (Korber et al.

651: 1993)\nocite{Korberetal1993}, which shows high variability (high

652: entropy) and strong correlations between residues (leading to

653: shared entropy) that are due to functional constraints. These

654: correlations have also been modelled (Giraud et al.\

655: 1998)\nocite{Giraudetal1998}.

656:

657: A similar analysis for the homeodomain sequence family was

658: performed by Clarke (1995)\nocite{Clarke1995}, who was able to

659: detect 16 strongly co-varying pairs in this 60 amino acid binding

660: motif. However, determining secondary structure based on these

661: correlations alone is much more difficult, because proteins do not

662: fold neatly into stacks and loops as does RNA. Also, residue

663: covariation does not necessarily indicate physical proximity

664: (Clarke 1995)\nocite{Clarke1995}, even though the strongest

665: correlations are often due to salt-bridges. But the correlations

666: can at least help in eliminating some models of protein structure

667: (Clarke 1995).

668:

669: Atchley et al.\ (2000)\nocite{Atchleyetal2000} carried out a

670: detailed analysis of correlated mutations in the bHLH (basic

671: helix-loop-helix) protein domain of a family of transcription

672: factors. Their set covered 242 proteins across a large number of

673: vertebrates that could be aligned to detect covariation. They

674: found that amino acid sites known to pack against each other

675: showed low entropy, whereas exposed non-contact sites exhibited

676: significantly larger entropy. Furthermore, they determined that a

677: significant amount of the observed correlations between sites was

678: due to functional or structural constraints that could help in

679: elucidating the structural, functional, and evolutionary dynamics

680: of these proteins (Atchley et al. 2000).

681:

682: Some attempts have been made to study the {\em thermodynamics} of protein

683: structures and relate it to the sequence entropy (Dewey 1997)\nocite{Dewey1997}, by

684: studying the mutual entropy between protein sequence and {\em structure}. This line

685: of thought is inspired by our concept of the genotype-phenotype map, which implies

686: that sequence should predict structure. If we hypothesize a structural entropy of

687: proteins $H({\rm str})$, obtained for example as the logarithm of the possible

688: stable protein structures for a given chain length (and a given environment), then

689: we can write down the mutual entropy between structure and sequence simply as

690: \begin{equation}

691: I({\rm seq}:{\rm str})=H({\rm seq})-H({\rm seq|str})\;, \label{seqstr}

692: \end{equation}

693: where $H({\rm seq})$ is the entropy of sequences of length $L$, given by $L$, and

694: $H({\rm seq|str})$ is the entropy of sequences {\em given} the structure. If we

695: assume that the environment perfectly dictates structure (i.e., if we assume that

696: only one particular structure will perform any given function) then

697: \begin{equation}

698: H({\rm seq|str})\approx H({\rm seq}|{\rm env})

699: \end{equation}

700: and $I({\rm str}:{\rm seq})$ is then roughly equal to the physical

701: complexity defined earlier. Because $H({\rm str|seq})=0$ (per the

702: above assumption that any given sequence gives rise to exactly one

703: structure), we can rewrite (\ref{seqstr}) as

704: \begin{equation}

705: I({\rm seq}:{\rm env})\approx I({\rm seq}:{\rm str})=H({\rm

706: str})-\underbrace{H({\rm str}|{\rm seq})}_{=0} \;,

707: \end{equation}

708: i.e., the mutual entropy between sequence and structure only tells

709: us that the thermodynamical entropy of possible protein structures

710: is limited by the amount of information about the environment

711: coded for by the sequence. This is interesting because it implies

712: that sequences that encode more information about the environment

713: are also potentially more complex, a relationship we discussed

714: earlier in connection with ribozymes (Carothers et al.\ 2004).

715: Note, however, that the assumption that only one particular

716: structure will perform any given function need not hold. Szostak

717: (2003), for example, advocates a definition of {\em functional

718: information} that allows for different structures carrying out an

719: equivalent biochemical function.

720:

721: \section{Molecular Interactions and Resistance}

722:

723: One of the more pressing concerns in bioinformatics is the identification of DNA

724: protein-binding regions, such as promoters, regulatory regions, and splice

725: junctions. The common method to find such regions is through {\em sequence

726: identity}, i.e., known promoter or binding sites are compared to the region being

727: scanned (e.g., via freely available bioinformatics software such as BLAST), and a

728: ``hit'' results if the scanned region is sufficiently identical according to a

729: user-specified threshold. Such a method cannot, of course, find {\em unknown}

730: binding sites, nor can it detect interactions between proteins, which is another

731: one of bioinformatics' holy grails (see, e.g., Tucker et al.

732: 2001)\nocite{TuckerGeraUetz2001}. Information theory can in principle detect

733: interactions between different molecules (such as DNA-protein or protein-protein

734: interactions) from {\it sequence heterogeneity}, because interacting pairs share

735: {\em correlated mutations}, that arise as follows.

736:

737: \subsection{Detecting Protein-Protein and DNA-Protein Interactions}

738: Imagine two proteins bound to each other, while each protein has

739: some entropy in its binding motif (substitutions that do not

740: affect structure). If a mutation in one of the proteins leads to

741: severely reduced interaction specificity, the substitution is

742: strongly selected against. It is possible, however, that a {\em

743: compensatory} mutation in the binding partner restores

744: specificity, such that the {\em pair} of mutations together is

745: neutral (and will persist in the population), while each mutation

746: by itself is deleterious. Over evolutionary time, such pairs of

747: correlated mutations will establish themselves in populations and

748: in homologous genes across species, and could be used to identify

749: interacting pairs. This effect has been seen previously in the

750: Cytochrome c/Cytochrome oxidase (CYC/COX) heterodimer (Rawson and

751: Burton 2002)\nocite{RawsonBurton2002} of the marine copepod {\it

752: Tigriopus californicus}. In Rawson and Burton (2002), the authors performed crosses between

753: the San Diego (SD) and Santa Cruz (SC) variants from two natural

754: allopatric populations that have long, independent evolutionary

755: histories. Inter-population crosses produced strongly reduced

756: activity of the cytochrome complex, while intra-population crosses

757: were vigorous. Indeed, the SD and SC variants of COX differ by at

758: least 30 amino acid substitutions, while the smaller CYC has up to

759: 5 substitutions. But can these correlated mutations be found from

760: sequence data alone? This turns out to be a difficult

761: computational problem unless it is known precisely which member of

762: a set of $N$ sequences of one binding partner binds to which

763: member of a set of $N$ of the other. Unless we are in possession

764: of this $N$ to $N$ assignment, we cannot calculate the joint

765: probabilities $p_{ij}$ that go into the calculation of the mutual

766: entropies such as Eq.~(\ref{info28}) that reveal correlated

767: mutations.

768:

769: Of course, if we have one pair of sequences from $N$ species of organisms with the

770: same homologous gene, the assignment is automatically implied. In the absence of

771: such an assignment, it may be possible to recover the correct matches from two sets

772: of $N$ sequences by searching for the assignment with the highest mutual entropy,

773: because we can safely assume that the correct assignment maximizes the correlations

774: (Adami and Thomsen 2004)\nocite{AdamiThomsen2004}. However, this is a difficult

775: search problem because the number of possible assignments scales like $N!$. Still,

776: because correlated mutations due to coevolution seem to be relatively common

777: (Bonneton et al. 2003)\nocite{Bonnetonetal2003}, this would be a useful tool for

778: revealing those residues involved in binding, or even in protein-protein

779: interaction prediction.

780:

781: In principle, the information-theoretical method described above

782: can potentially identify {\em unknown} binding sites by

783: identifying complementary patterns (between binding sites and

784: protein coding regions), if the binding regions are not

785: well-conserved, i.e., when the binding site and the corresponding

786: transcription factor carry a reasonable amount of polymorphisms,

787: and if enough annotation exists to identify the genomic partners

788: that correspond to each other in a set. If sufficient pairs of

789: transcription-factor/binding-domain pairs can be sequenced, an

790: information-theoretic analysis could conceivably reveal genomic

791: regulatory regions that standard sequence analysis methods miss.

792: For example, it was suggested recently (Brown and Callan,

793: 2004)\nocite{BrownCallan2004} that the cAMP response protein (CRP,

794: a transcription factor that regulates many {\it E. coli} genes)

795: binds to a number of entropic sites in {\it E.coli}, i.e., sites

796: that are not strictly conserved, but that still retain

797: functionality (see also Berg and von Hippel, 1987).

798:

799: \begin{figure}[h]

800: \centerline{\psfig{figure=Adami-Fig6.ps,width=4in,angle=90}}

801: \caption{Normalized ($0\le H\le1$) entropy of HIV-1 protease in

802: mers, as a function of residue number, using 146 sequences from

803: patients exposed to a protease inhibitor drug (entropy is

804: normalized to $H_{\rm max}=1$ per amino acid by taking logarithms

805: to base 20).\label{hivent}}

806: \end{figure}

807:

808:

809: \subsection{Tracking Drug Resistance}

810:

811: An interesting pattern of mutations can be observed in the

812: protease of HIV-1, a protein that binds to particular motifs on a

813: virus polyprotein, and then cuts it into functional pieces.

814: Resistance to protease inhibitors (small molecules designed to

815: bind to the ``business end" of the protease, thereby preventing

816: its function) occurs via mutations in the protease that do not

817: change the protease's cutting function (proteolysis), while

818: preventing the inhibitor to bind to it. Information theory can be

819: used to study whether mutations are involved in drug resistance or

820: whether they are purely neutral, and to discover correlated

821: resistance mutations.

822:

823: The emergence of resistance mutations in the protease after

824: exposure to antiviral drugs has been well studied (Molla et al.

825: 1996, Schinazi, Larder, and Mellors 1999). The entropy map of HIV

826: protease in Fig.~\ref{hivent}\footnote{The map was created using

827: 146 sequences obtained from a cohort in Luxembourg, and deposited

828: in GenBank (Servais et al.\ 1999 and 2001a).}  (on the level of

829: amino acids) reveals a distinctive pattern of polymorphisms and

830: only two strictly conserved regions. HIV protease {\em not}

831: exposed to inhibitory drugs, on the other hand, shows three such

832: conserved regions (Loeb et al. 1989)\nocite{Loebetal1989}. It is

833: believed that the polymorphisms contribute to resistance mutations

834: involved in HAART (Highly Active Antiretroviral Therapy) failure

835: patients (Servais et al.\ 2001b). But, as a matter of fact, many

836: of the observed polymorphisms can be observed in treatment-naive

837: patients (Kozal et al. 1996, and Lech et al. 1996) so it is not

838: immediately clear which of the polymorphic sites are involved in

839: drug resistance.

840:

841: In principle, exposure of a population to a new environment can

842: lead to fast adaptation if the mutation rate is high enough. This

843: is certainly the case with HIV. The adaptive changes generally

844: fall into two classes: mutations in regions that were previously

845: conserved (true resistance mutations), and changes in the

846: substitution pattern on sites that were previously polymorphic. In

847: the case of HIV-1 protease, both patterns seem to contribute. In

848: Fig.~\ref{figdiff}, I show the {\it changes} in the entropic

849: profile of HIV-1 protease obtained from a group of patients before

850: and six months after treatment with high doses of saquinavir (a

851: protease inhibitor). Most spikes are positive, in particular the

852: changes around residues 46-56, a region that is well-conserved in

853: treatment-naive proteases, and that is associated with a {\em

854: flap} in the molecule that must be flexible and that extends over

855: the substrate binding cleft (Shao et al. 1997). Mutations in that

856: region indeed appeared on sites that were previously uniform,

857: while some changes occurred on polymorphic sites (negative

858: spikes). For those, exposure to the new environment usually

859: reduced the entropy at that site.

860:

861: Some of the resistance mutations actually appear in pairs,

862: indicating that they may be compensatory in nature (Leigh Brown et

863: al. 1999, Hoffman et al. 2003, Wu et al. 2003)\nocite{Wuetal2003}.

864: The strongest association occurs between residues 54 and 82, the

865: former associated with the flap, and the latter right within the

866: binding cleft. This association does not occur in treatment-naive

867: patients, but stands out strongly after therapy (such correlations

868: are easily detected by creating mutual entropy graphs such as

869: Fig.~\ref{wagenaar}, data not shown). The common explanation for

870: this covariation is again compensation: while a mutation in the

871: flap or in the cleft leads to reduced functionality of the

872: protease, both together restore function while evading the

873: inhibitor.

874:

875: \begin{figure}[h] \centerline{\psfig{figure=Adami-Fig7.ps,width=4in,angle=90}}

876: \caption{Change in per-site entropy of HIV-1 protease after six

877: months of exposure to saquinavir, $\Delta$Entropy=$H_{26}-H_0$,

878: where $H_{26}$ is the entropy after 26 weeks of exposure. The

879: entropies were obtained from 34 sequences before and after

880: exposure, available through GenBank (Schapiro et al.\ 1997). The

881: three highest (positive) spikes are associated to the well-known

882: resistance mutations G48V, T74(A,S), and L90M, respectively.

883: }\label{figdiff}

884: \end{figure}

885:

886: \subsection{Information-theoretic Drug Design}

887: Because many of the protease polymorphisms are prevalent in

888: treatment-naive patients, we must assume that they are either

889: neutral, or that the steric changes they entail do not impede the

890: protease's proteolytic activity while failing to bind the protease

891: inhibitor. Thus, a typical protease population is a mixture of

892: polymorphic molecules (polymorphic both in genotype and in

893: structure, see Maggio et al. 2002)\nocite{Maggioetal2002} that can

894: outsmart a drug designed for a single protease type relatively

895: easily. An interesting alternative in drug design would therefore

896: use an entropic mixture of polymorphisms, or ``quasispecies''

897: (Eigen 1971) as the drug target. Such a drug would {\em itself}

898: form a quasispecies rather than a pure drug. Indeed, an analysis

899: of the information content of realistic ensembles shows that

900: consensus sequences are exceedingly rare in real populations

901: (Schneider 1997), and certainly absent in highly variable ones

902: such as HIV proteases. The absence of a consensus sequence is also

903: predicted for molecules evolving at the {\em error threshold}

904: (Eigen 1971), which is very likely in these viruses.

905:

906: The ideal {\em superdrug} should represent a mixture of inhibitors

907: that is perfectly tuned to the mixture of proteases. What this

908: mixture is can be determined with information theory, by ensuring

909: that the ensemble of inhibitors {\em co-varies} with the protease,

910: such as to produce tight binding even in the presence of mutations

911: (or more precisely {\em because} of the presence of mutations).

912: The substitution probabilities of the inhibitor ensemble would be

913: obtained by maximizing the mutual entropy (information) between

914: the protease and an inhibitor library obtained by combinatorial

915: methods, either on a nucleotide or on the amino acid level (Adami

916: 2002b)\nocite{Adami2002b}. If such a procedure could create a drug

917: that successfully inhibits resistance mutations, we could no

918: longer doubt the utility of information theory for molecular

919: biology.

920:

921:

922: \section{Conclusions}

923: Information theory is not widely used in bioinformatics today even

924: though, as the name suggests, it should be {\em the} relevant

925: theory for investigating the information content of sequences. The

926: reason for the neglect appears to be a misunderstanding of the

927: concepts of entropy versus information throughout most of the

928: literature, which has led to the widespread perception of its

929: incompetence. Instead, I point out that Shannon's theory precisely

930: defines both entropy and information, and that our intuitive

931: concept of information coincides with the mathematical notion.

932: Using these concepts, it is possible in principle to distinguish

933: information-coding regions from random ones in ensembles of

934: genomes, and thus quantify the information content. A thorough

935: application of this program should resolve the C-paradox, that is,

936: the absence of a correlation between the size of the genome and

937: the apparent complexity of an organism (Cavalier-Smith 1985), by

938: distinguishing information that contributes to complexity from

939: non-functional stretches that do not. However, this is a challenge

940: for the future because of the dearth of multiply sequenced

941: genomes.

942:

943: Another possible application of information theory in molecular

944: biology is the association of regulatory molecules with their

945: binding sites or even protein-protein interactions, in the case

946: where transcription factors and their corresponding binding site

947: show a good amount of polymorphism (methods based on correlated

948: heterogeneity), and the binding association between pairs can be

949: established. This approach is complementary to sequence comparison

950: of conserved regions (methods based on sequence identity), in

951: which information theory methods cannot be used because zero

952: (conditional) entropy regions cannot share entropy. Conversely,

953: sequence comparison methods must fail if polymorphisms are too

954: pronounced. Finally, the recognition of the polymorphic (or

955: quasispecies) nature of many viral proteins suggests an

956: information theory based approach to drug design in which the

957: quasispecies of proteins---rather than the consensus sequence---is

958: the drug target, by maximizing the information shared between the

959: target and drug ensembles.

960:

961:

962: \vskip 0.25cm \noindent{\bf Acknowledgements} \vskip 0.25cm I

963: thank David Baltimore and Alan Perelson for discussions, and Titus

964: Brown for comments on the manuscript. Thanks are also due to

965: Daniel Wagenaar for producing Fig.~5. This work was supported by the

966: National Science Foundation under grant DEB-9981397.

967:

968:

969: \begin{thebibliography}{99}

970:

971: \setlength{\itemindent}{-0.5cm}

972:

973: \bibitem[]{Adami2002a}Adami, C. 2002a. What is complexity? BioEssays 24:1085-1094

974:

975: \bibitem[]{Adami2002b}Adami, C. 2002b. Combinatorial drug design augmented by

976: information theory. NASA Tech Briefs 26:52.

977:

978: \bibitem[]{AdamiThomsen2004}Adami, C. and S.W. Thomsen 2004. Predicting protein-protein interactions from sequence data. In: Hicks MG, editor. {\it The Chemical

979: Theatre of Biological Systems}. Proceedings of the Beilstein-Institute

980: Workshop.

981:

982:

983: \bibitem[]{adami99} Adami, C. and N.J. Cerf 1999. Prolegomena to a non-equilibrium

984: quantum statistical mechanics. Chaos, Solitons, and Fractals

985: 10:1637-1650.

986:

987: \bibitem[]{Adamietal2000} Adami, C., C. Ofria and T.C. Collier 2000. Evolution of

988: biological complexity. Proc. Natl. Acad. Sci. USA 97:4463-4468.

989:

990: \bibitem[]{adami00}Adami, C. and N.J. Cerf. 2000. Physical complexity

991: of symbolic sequences. Physica D 137:62-69.

992:

993: \bibitem[]{Atchleyetal2000}Atchley W.R., K.R. Wollenberg, W.M. Fitch, W. Terhalle,

994: and A.D. Dress 2000. Correlations among amino acid sites in bHLH protein domains:

995: An information-theoretic analysis. Mol. Biol. Evol. 17:164-178.

996:

997: \bibitem[]{basharin59}Basharin, G.P. 1959. On a statistical estimate

998: for the entropy of a sequence of random variables. Theory Probability Appl. 4:333.

999:

1000: \bibitem[]{BergvonHippel1987}Berg, O.G. and P.H. von Hippel 1987. Selection of DNA binding sites by regulatory proteins. II. The binding specificity of

1001: cyclic AMP receptor protein to recognition sites. J. Mol. Biol.

1002: 200:709-723.

1003:

1004: \bibitem[]{Bonnetonetal2003}Bonneton, F., D. Zelus, T. Iwema, M. Robinson-Rechavi,

1005: and V. Laudet 2003. Rapid divergence of the Ecdysone receptor in Diptera and

1006: Lepidoptera suggests coevolution between ECR and USP-RXR. Mol. Biol. Evol.

1007: 20:541-553.

1008:

1009: \bibitem[]{BrownCallan2004}Brown, C.T. and C.G. Callan, Jr. 2004. Evolutionary comparisons suggest

1010: many novel CRP binding sites in {\it E. coli}. Proc. Natl. Acad. Sci. USA 101:2404-2409.

1011:

1012:

1013: \bibitem[]{Carothersetal2004}Carothers, J.M., S.C. Oestreich, J.H. Davis and J.W.

1014: Szostak 2004. Informational complexity and functional activity. J. Am. Chem.

1015: Society 126 (in press).

1016:

1017: \bibitem[]{cavalier85}Cavalier-Smith, T. 1985. {\it The Evolution of

1018: Genome Size}, ed. Cavalier-Smith, T. (Wiley, N.Y.)

1019:

1020: \bibitem[]{Clarke1995}Clarke, N.D. 1995. Covariation of residues in the homeodomain

1021: sequence family. Prot. Sci. 4:2269-2278.

1022:

1023: \bibitem[]{CoverThomas1991}Cover, T.M. and J.A. Thomas 1991. {\it Elements of

1024: Information Theory}. (John Wiley, New York).

1025:

1026: \bibitem[]{Dandekaretal2000}Dandekar, T. et al.\ 2000.

1027: Re-annotating the {\it Mycoplasma pneumoniae} genome sequence:

1028: Adding value, function, and reading frames. Nucl. Acids Res.

1029: 28:3278-3288.

1030:

1031: \bibitem[]{Dewey1997}Dewey, T.G. 1997. Algorithmic complexity and thermodynamics of

1032: sequence-structure relationships in proteins. Phys. Rev. E 56:4545-4552.

1033:

1034: \bibitem[]{durbin85}Durbin, R., S. Eddy, A. Krogh and G. Mitchison 1998. {\it Biological

1035: Sequence Analysis}. (Cambridge University Press, Cambridge MA).

1036:

1037: \bibitem[]{eigen71}Eigen, M. 1971. Self-organization of matter and the

1038: evolution of macromolecules. Naturwissenschaften 58:465.

1039:

1040: \bibitem[]{Giraudetal1998}Giraud B.G., A. Lapedes and L.C. Liu 1998. Analysis of

1041: correlations between sites of model protein sequences. Phys. Rev. E 58:6312-6322.

1042:

1043: \bibitem[]{Govindarajanetal2003}Govindarajan S., J.E. Ness, S. Kim, E.C. Mundorff, J. Minshull, and C.

1044: Gustafsson 2003. Systematic variation of amino acid substitutions for stringent

1045: assignment of pairwise covariation. J. Mol. Biol. 328:1061-2069.

1046:

1047: \bibitem[]{Grosseetal2000}Grosse, I., H. Herzel, S.V. Buldyrev,

1048: and H.E. Stanley 2000. Species independence of mutual information

1049: in coding and non-coding DNA. Phys. Rev. E 61:5624-5629.

1050:

1051: \bibitem[]{Hoffmanetal2003}Hoffman, N.G., C.A. Schiffer, and R. Swanstorm 2003.

1052: Covariation of amino acid positions in HIV-1 protease. Virology 314:536-548.

1053:

1054: \bibitem[]{Korberetal1993}Korber B.T.M., R.M. Farber, D.H. Wolpert, and A.S. Lapedes

1055: 1993. Covariation of mutations in the V3 loop of human immunodeficiency virus type

1056: 1 envelope protein: An information-theoretic analysis. Proc. Natl. Acad. Sci. USA

1057: 90:7176-7180.

1058:

1059: \bibitem[]{kozaletal96}Kozal, M.J., et al. 1996. Extensive polymorphisms

1060: observed in HIV-1 clade B protease gene using high density

1061: oligonucleotide arrays. Nature Medicine 2:753-759.

1062:

1063: \bibitem[]{lechetal96}Lech, W.J., et al. 1996. In vivo sequence diversity of the

1064: protease of human immunodeficiency virus type 1: {P}resence of protease inhibitor

1065: resistant variants in untreated subjects. J. Virol. 70:2038-2043.

1066:

1067: \bibitem[]{LeighBrownetal1999}Leigh Brown, A.J., B.T. Koerber, and J.H. Condra 1999.

1068: Associations between amino acids in the evolution of HIV type 1 protease sequences

1069: under indinavir therapy. AIDS Res. Hum. Retrovir. 15:247-253.

1070:

1071: \bibitem[]{Loebetal1989}Loeb D.D., R. Swanstrom, L. Everitt, M. Manchester,

1072: S.E. Stamper, and C.A. Hutchison III 1989. Complete mutagenesis of the HIV-1

1073: protease. Nature 340:397-400.

1074:

1075: \bibitem[]{Maggioetal2002}Maggio, E.T., M. Shenderovich, R. Kagan, D. Goddette, and

1076: K. Ramnarayan 2002. Structural pharmacogenetics, drug resistance, and the design of

1077: anti-infective superdrugs. Drug Discovery Today 7:1214-1220.

1078:

1079: \bibitem[]{maynard99b}Maynard Smith, J. 1999a. The idea of

1080: information in biology. Quart. Rev. Biol. 74:395-400.

1081:

1082: \bibitem[]{maynard99a}Maynard Smith, J. 1999b. The concept of

1083: information in biology. Philo. Sci. 67:177-194.

1084:

1085: \bibitem[]{miller54}Miller, G.A. 1954. Note on the bias of information estimates.

1086: In H. Quastler, ed., {\it Information Theory and Psychology}, pp. 95-100 (The Free

1087: Press, Illinois).

1088:

1089: \bibitem[]{mollaetal96}Molla, A. et al. 1996. Ordered accumulation of mutations in

1090: HIV protease confers resistance to ritonavir. Nature Medicine

1091: 2:760-766.

1092:

1093: \bibitem[]{OtuSayood2003}Otu, H. H. and K. Sayood 2003. A

1094: divide-and-conquer approach to fragment assembly. Bioinformatics

1095: 19:22-29.

1096:

1097: \bibitem[]{quast53}Quastler, H. (Ed.) 1953. {\it Information Theory in Biology}

1098: (University of Illinois Press, Urbana).

1099:

1100: \bibitem[]{RawsonBurton2002}Rawson, P.D. and R.S. Burton 2002. Functional

1101: coadaptation between cytochrome c and cytochrome oxidase within allopatric

1102: populations of a marine copepod. Proc. Natl. Acad. Sci. USA 99:12955-12958.

1103:

1104: \bibitem[]{roganschneider95}Rogan, P.K.  and T.D. Schneider 1995. Using

1105: information content and base frequencies to distinguish mutations from

1106: genetic polymorphisms. Hum. Mut. 6:74-76.

1107:

1108: \bibitem[]{sarkar96}Sarkar, S. 1996. Decoding ``coding''---information

1109: and DNA.  BioScience 46:857-864.

1110:

1111: \bibitem[]{saven97}Saven, J.G., and P.G. Wolynes 1997. Statistical

1112: mechanics of the combinatorial synthesis and analysis of folding

1113: macromolecules. J. Phys. Chem. B 101:8375-8389.

1114:

1115: \bibitem[]{schapiroetal97} Schapiro, J.M., M.A. Winters, F. Stewart, B. Efron, J.

1116: Norris, M.J. Kozal, and T.C. Mergan 1997. The effect of high-dose

1117: saquinavir on viral load and CD4+ T-cell counts in HIV-infected

1118: patients (unpublished).

1119:

1120: \bibitem[]{schinazietal99}Schinazi, R.F., B. A. Larder, and

1121: J.W. Mellors 1999.  Mutation in retroviral genes associated with

1122: drug resistance: 1999-2000 update. Intern. Antiviral News

1123: 7.4:46-68.

1124:

1125: \bibitem[]{schneider86} Schneider T.D., G.D. Stormo, L. Gold, and A.

1126:   Ehrenfeucht 1986. Information content of binding sites on nucleotide

1127:  sequences. J. Mol. Biol. 188:415-431.

1128:

1129: \bibitem[]{SchmittHerzel1997} Schmitt A.O. and H. Herzel 1997. Estimating the

1130: entropy of DNA sequences. J. theor. Biol. 188:369-377.

1131:

1132: \bibitem[]{schneider97} Schneider, T.D. 1997. Information content of

1133: individual genetic sequences. J. theor. Biol. 189:427-441.

1134:

1135: \bibitem[]{servais99}Servais, J. et al.\ 1999. The natural

1136: polymorphism of the HIV-1 protease gene in treatment naive patients

1137: and response to combination therapy including a protease

1138: inhibitor. 4th European Conference on Experimental AIDS Research, June

1139: 18-21, Tampere, Finland.

1140:

1141: \bibitem[]{servais01a}Servais, J., et al.\ 2001a. Comparison of DNA sequencing and a

1142: line probe assay for detection of Human Immunodeficiency Virus type 1 drug

1143: resistance mutations in patients failing highly active antiretroviral therapy. J.

1144: Clin. Microbiol. 39:454-459.

1145:

1146: \bibitem[]{servais01b}Servais, J., et al.\ 2001b. Variant HIV-1 proteases

1147: and response to combination therapy including a protease inhibitor. Antimicrob.

1148: Agents Chemother. 45:893-900.

1149:

1150: \bibitem[]{shannon48}Shannon, C.E. 1948. A mathematical theory of

1151: communication. Bell Systems Technical Journal 27:379-423; {\em ibid}

1152: 27:623-656; reprinted in {\it C.E. Shannon: Collected Papers},

1153: N.J.A. Sloane and A.D. Wyner, eds., IEEE Press (1993).

1154:

1155: \bibitem[]{Shaoetal1997}Shao, W., L. Everitt, M. Manchester, D.L. Loeb, C.A.

1156: Hutchison III, and R. Swanstrom 1997. Sequence requirements of the

1157: HIV-1 protease flap region determined by saturation mutagenesis

1158: and kinetic analysis of flap mutants. Proc. Natl. Acad. Sci. USA

1159: 94:2243-2248.

1160:

1161: \bibitem[]{ShineDalgarno1974}Shine, J., and L. Dalgarno 1974. The

1162: 3'-terminal sequence of {\it E. coli} 16s ribosomal RNA:

1163: Complementarity to nonsense triplets and ribosome binding sites.

1164: Proc. Natl. Acad. Sci. USA 71:1342-1346.

1165:

1166: \bibitem[]{Sprinzletal1996}Sprinzl, M., C. Steegborn, F. H\"ubel, and S. Steinberg

1167: 1996. Compilation of tRNA sequences and sequences of tRNA genes. Nucl. Acids. Res.

1168: 23:68-72.

1169:

1170: \bibitem[]{StraitDewey1996}Strait, B.J. and T.G. Dewey 1996. The Shannon information entropy of

1171: protein sequences. Biophysical Journal 71:148-155.

1172:

1173: \bibitem[]{Szostak2003}Szostak, J.W. 2003. Molecular messages:

1174: Functional information. Nature 423:689.

1175:

1176: \bibitem[]{TuckerGeraUetz2001} Tucker, C.L., J.F. Gera, and P. Uetz 2001. Towards

1177: an understanding of complex protein interaction networks. Trends

1178: Cell Biol. 11:102-106.

1179:

1180: \bibitem[]{vincent94}Vincent, L.-M. 1994. R\'eflexions sur l'usage, en

1181: biologie, de la th\'eorie de l'information. Acta Biotheoretica

1182: 42:167-179.

1183:

1184: \bibitem[]{voigtetal01}Voigt, C.A., S.L. Mayo, F.H. Arnold, and

1185: Z.-G. Wang. 2001. Computational method to reduce the search space

1186: for directed protein evolution. Proc. Natl. Acad. Sci. USA

1187: 98:3778-3783.

1188:

1189: \bibitem[]{Weissetal2000}Weiss O., M.A. Jimenez-Monta\~no, and H. Herzel 2000.

1190: Information content of protein sequences. J. theor. Biol. 206:379-386.

1191:

1192: \bibitem[]{WollenbergAtchley2000}Wollenberg, K.R. and W.R. Atchley 2000. Separation

1193: of phylogenetic and functional associations in biological sequences by using the

1194: parametric bootstrap. Proc. Natl. Acad. Sci. USA 97:3288-3291.

1195:

1196: \bibitem[]{Wuetal2003}Wu T.D. et al.\ 2003. Mutation patterns and structural

1197: correlates in human immunodeficieny virus type 1 protease following different

1198: protease inhibitor treatments. J. Virol. 77:4836-4847.

1199:

1200:

1201: %\bibitem[]{wintersetal2000} Winters, M.A. et al. 2000. Frequency of

1202: %antiretroviral drug resistance mutations in HIV-1 strains from

1203: %patients failing triple drug regimens. Antiviral Therapy 5:57-63.

1204:

1205: \end{thebibliography}

1206: \end{document}

1207: