1: %
2: \documentclass[12pt]{article}
3: \usepackage{psfig,mydef}
4: \begin{document}
5: \begin{titlepage}
6: \mbox{} \vskip 3cm \centerline{\LARGE\bf \sf Information Theory in Molecular
7: Biology} \vskip 1in \centerline{\large \bf \sf Christoph Adami$^{1,2}$} \vskip
8: 0.2in \centerline{\it $^1$Keck Graduate Institute of Applied Life Sciences,}
9: \centerline{\it 535 Watson Drive, Claremont, CA 91711}\vskip 0.5cm \centerline{\it
10: $^2$Digital Life Laboratory 139-74,} \centerline {\it California Institute of
11: Technology, Pasadena, CA 91125} \vskip 2cm
12:
13: \date{\today}
14: \centerline{\bf Abstract} This article introduces the physics of
15: information in the context of molecular biology and genomics.
16: Entropy and information, the two central concepts of Shannon's
17: theory of information and communication, are often confused with
18: each other but play transparent roles when applied to statistical
19: {\em ensembles} (i.e., identically prepared sets) of symbolic
20: sequences. Such an approach can distinguish between entropy and
21: information in genes, predict the secondary structure of
22: ribozymes, and detect the covariation between residues in folded
23: proteins. We also review applications to molecular sequence and
24: structure analysis, and introduce new tools in the
25: characterization of resistance mutations, and in drug design.
26:
27: \vskip 4 cm
28: %
29: \end{titlepage}
30: \vskip 0.25cm In a curious twist of history, the dawn of the age
31: of genomics has both seen the rise of the science of
32: bioinformatics as a tool to cope with the enormous amounts of data
33: being generated daily, and the decline of the {\em theory} of
34: information as applied to molecular biology. Hailed as a harbinger
35: of a ``new movement''\nocite{quast53} (Quastler 1953) along with
36: Cybernetics, the principles of information theory were thought to
37: be applicable to the higher functions of living organisms, and
38: able to analyze such functions as metabolism, growth, and
39: differentiation (Quastler 1953). Today, the metaphors and the
40: jargon of information theory are still widely used (Maynard Smith
41: 1999a, 1999b)\nocite{maynard99a,maynard99b}, as opposed to the
42: mathematical formalism, which is too often considered to be
43: inapplicable to biological information.
44:
45: Clearly, looking back it appears that too much hope was laid upon
46: this theory's relevance for biology. However, there was
47: well-founded optimism that information theory ought to be able to
48: address the complex issues associated with the storage of
49: information in the genetic code, only to be repeatedly questioned
50: and rebuked (see, e.g., \nocite{vincent94,sarkar96}Vincent 1994,
51: Sarkar 1996). In this article, I outline the concepts of entropy
52: and information (as defined by Shannon) in the context of
53: molecular biology. We shall see that not only are these terms
54: well-defined and useful, they also coincide precisely with what we
55: intuitively mean when we speak about information stored in genes,
56: for example. I then present examples of applications of the theory
57: to measuring the information content of biomolecules, the
58: identification of polymorphisms, RNA and protein secondary
59: structure prediction, the prediction and analysis of molecular
60: interactions, and drug design.
61:
62: \section{Entropy and Information}
63: Entropy and information are often used in conflicting manners in
64: the literature. A precise understanding, both mathematical and
65: intuitive, of the notion of information (and its relationship to
66: entropy) is crucial for applications in molecular biology.
67: Therefore, let us begin by outlining Shannon's original entropy
68: concept (Shannon, 1948).
69:
70: \subsection{Shannon's Uncertainty Measure}
71: Entropy in Shannon's theory (defined mathematically below) is a
72: measure of uncertainty about the identity of objects in an
73: ensemble. Thus, while ``entropy'' and ``uncertainty'' can be used
74: interchangeably, they can {\it never} mean information. There is a
75: simple relationship between the entropy concept in information
76: theory and the Boltzmann-Gibbs entropy concept in thermodynamics,
77: briefly pointed out below.
78:
79: Shannon entropy or uncertainty is usually defined with respect to
80: a particular observer. More precisely, the entropy of a system
81: represents the amount of uncertainty {\em one particular observer}
82: has about the state of this system. The simplest example of a
83: system is a {\em random variable}, a mathematical object that can
84: be thought of as an $N$-sided die that is uneven, i.e., the
85: probability of it landing in any of its $N$ states is not equal
86: for all $N$ states. For our purposes, we can conveniently think of
87: a polymer of fixed length (fixed number of monomers), which can
88: take on any one of $N$ possible states, where each possible
89: sequence corresponds to one possible state. Thus, for a sequence
90: made of $L$ monomers taken from an alphabet of size $D$, we would
91: have $N=D^L$. The uncertainty we calculate below then describes
92: the observer's uncertainty about the true identity of the molecule
93: (among a very large number of identically prepared molecules: an
94: {\em ensemble}), given that he only has a certain amount of
95: probabilistic knowledge, as explained below.
96:
97: This hypothetical molecule plays the role of a random variable if we
98: are given its {\em probability distribution}: the set of probabilities
99: $p_1,...,p_N$ to find it in its $N$ possible states. Let us thus
100: call our random variable (random molecule) ``$X$'', and give the names
101: $x_1,...,x_N$ to its $N$ states. If $X$ will be found in state $x_i$
102: with probability $p_i$, then the entropy $H$ of $X$ is given by
103: Shannon's formula
104: \begin{equation}
105: H(X)=-\sum_{i=1}^N p_i\log p_i\;. \label{entropy}
106: \end{equation}
107: I have not here specified the basis of the log to be taken in the above formula.
108: Specifying it assigns units to the uncertainty. It is sometimes convenient to use
109: the number of possible states of $X$ as the base of the logarithm (in which case
110: the entropy is between zero and one), in other cases base 2 is convenient (leading
111: to an entropy in units ``bits''). For biomolecular sequences, a convenient unit
112: obtains by taking logarithms to the basis of the alphabet size, leading to an
113: entropy whose units we shall call ``mers". Then, the maximal entropy equals the
114: length of the sequence in mers.
115:
116: Let us examine Eq.~(\ref{entropy}) more closely. If measured in
117: bits, a standard interpretation of $H(X)$ as an uncertainty
118: function connects it to the smallest number of ``yes-no" questions
119: necessary, on average, to identify the state of random variable
120: $X$. Because this series of yes/no questions can be thought of as
121: a {\em description} of the random variable, the entropy $H(X)$ can
122: also be viewed as the {\em length of the shortest description of
123: $X$} (Cover and Thomas, 1991). In case nothing is known about $X$,
124: this entropy is $H(X)=\log N$, the maximal value that $H(X)$ can
125: take on. This occurs if all states are equally likely:
126: $p_i=1/N\,;i=1,...,N$. If something (beyond the possible number of
127: states $N$) is known about $X$, this reduces our necessary number
128: of questions, or the length of tape necessary to describe $X$.
129: If I know that state $X=x_7$, for example, is highly
130: unlikely, then my uncertainty about $X$ is going to be smaller.
131:
132: How do we ever learn anything about a system? There are two choices. Either we
133: obtain the probability distribution using {\em prior knowledge} (for example, by
134: taking the system apart and predicting its states theoretically) or by making
135: measurements on it, which for example might reveal that not all states, in fact,
136: are taken on with the same probability. In both cases, the difference between the
137: maximal entropy and the remaining entropy after we have either done our
138: measurements or examined the system, is the amount of information we have about the
139: system. Before I write this into a formula, let me remark that, by its very
140: definition, information is a {\it relative} quantity. It measures the {\it
141: difference of uncertainty}, in the previous case the entropy before and after the
142: measurement, and thus can never be absolute, in the same sense as potential energy
143: in physics is not absolute. In fact, it is not a bad analogy to refer to entropy as
144: ``potential information'', because potentially all of a system's entropy can be
145: transformed into information (for example by measurement).
146:
147: \subsection{Information}
148: In the above case, information was the difference between the maximal and the
149: actual entropy of a system. This is not the most general definition as I have
150: alluded to. More generally, information measures the amount of {\it correlation}
151: between two systems, and reduces to a difference in entropies in special cases. To
152: define information properly, let me introduce another random variable or molecule
153: (call it ``$Y$''), which can be in states $y_1,...,y_M$ with probabilities
154: $p_1,...,p_M$. We can then, along with the entropy $H(Y)$, introduce the joint
155: entropy $H(XY)$, which measures my uncertainty about the joint system $XY$ (which
156: can be in $N\cdot M$ states). If $X$ and $Y$ are {\em independent} random variables
157: (like, e.g., two dice that are thrown independently) the joint entropy will be just
158: the sum of the entropy of each of the random variables. Not so if $X$ and $Y$ are
159: somehow connected. Imagine, for example, two coins that are glued together at one
160: face. Then, heads for one of the coins will always imply tails for the other, and
161: vice versa. By gluing them together, the two coins can only take on two states, not
162: four, and the joint entropy is equal to the entropy of one of the coins.
163:
164: The same is true for two molecules that can bind to each other.
165: First, remark that random molecules do not bind. Second, binding
166: is effected by mutual specificity, which requires that part of the
167: sequence of one of the molecules is interacting with the sequence
168: of the other, so that the joint entropy of the pair is much less
169: than the sum of entropies of each. Quite clearly, this binding
170: introduces strong correlations between the states of $X$ and $Y$:
171: if I know the state of one, I can make strong predictions about
172: the state of the other. The information that one molecule has {\em
173: about} the other is given by
174: \begin{equation}
175: I(X:Y)=H(X)+H(Y)-H(XY)\;, \label{info}
176: \end{equation}
177: i.e., it is the difference between the sum of the entropies of each, and the joint
178: entropy. The colon between $X$ and $Y$ in the notation for the information is
179: standard; it is supposed to remind the reader that information is a symmetric
180: quantity: what $X$ knows about $Y$, $Y$ also knows about $X$. For later reference,
181: let me introduce some more jargon. When more than one random variable is involved,
182: we can define the concept of {\em conditional entropy}. This is straightforward.
183: The entropy of $X$ conditional on $Y$ is the entropy of $X$ {\em given} $Y$, that
184: is, if I know which state $Y$ is in. It is denoted by $H(X|Y)$ (read ``$H$ of $X$
185: given $Y$'') and is calculated as
186: \begin{equation}
187: H(X|Y)=H(XY)-H(Y)\;.
188: \end{equation}
189: This formula is self-explanatory: the uncertainty I have about $X$ if $Y$ is known
190: is the uncertainty about the joint system minus the uncertainty about $Y$ alone.
191: The latter, namely the entropy of $Y$ without regard to $X$ (as opposed to
192: ``conditional on $X$'') is sometimes called a {\em marginal} entropy. Using the
193: concept of conditional entropy, we can rewrite Eq.~(\ref{info}) as
194: \begin{equation}
195: I(X:Y)=H(X)-H(X|Y)\;.
196: \end{equation}
197:
198: We have seen earlier that for independent variables
199: $H(XY)=H(X)+H(Y)$, so information measures the {\em deviation}
200: from independence. In fact, it measures exactly the amount by
201: which the entropy of $X$ or $Y$ is reduced by knowing the other,
202: $Y$ or $X$. If $I$ is non-zero, knowing one of the molecules
203: allows you to make more accurate predictions about the other:
204: quite clearly this is exactly what we mean by information in
205: ordinary language. Note that this definition reduces to the
206: example given earlier (information as difference between
207: entropies), if the only possible correlations are {\em between}
208: $X$ and $Y$, while in the absence of the other each molecule is
209: equiprobable (meaning that any sequence is equally likely). In
210: that case, the marginal entropy $H(X)$ must be maximal ($H=\log
211: N$) and the information is the difference between maximal and
212: actual (i.e., conditional) entropy, as before.
213:
214: \subsection{Entropy in Thermodynamics}
215: I will briefly comment about the relationship between Shannon's theory and
216: thermodynamics (Adami and Cerf 1999)\nocite{adami99}. For the present purpose it
217: should suffice to remark that Boltzmann-Gibbs thermodynamic entropy is just like
218: Shannon entropy, only that the probability distribution $p_i$ is given by the
219: Boltzmann distribution of the relevant degrees of freedom (position and momentum):
220: \begin{equation}
221: \rho(p,q)= \frac1Ze^{-E(p,q)/kT}\;,
222: \end{equation}
223: and the thermodynamic quantity is made dimensional by multiplying
224: Shannon's dimensionless uncertainty by Boltzmann's constant. It
225: should not worry us that the degrees of freedom in thermodynamics
226: are continuous, because any particular measurement device that is
227: used to measure these quantities will have a finite resolution,
228: rendering these variables effectively discrete through
229: coarse-graining. More importantly, equilibrium thermodynamics
230: assumes that all entropies of isolated systems are at their
231: maximum, so there are no correlations in equilibrium thermodynamic
232: systems, and therefore there is {\em no information}. This is
233: important for our purposes, because it implies, a fortiori, that
234: the information stored in biological genomes guarantees that
235: living systems are far away from thermodynamical equilibrium.
236: Information theory can thus be viewed as a type of non-equilibrium
237: thermodynamics.
238:
239:
240: Before exploring the uses of these concepts in molecular biology,
241: let me reiterate the most important points which tend to be
242: obscured when discussing information. Information is defined as
243: the amount of correlation between two systems. It measures the
244: amount of entropy {\em shared} between two systems, and this
245: shared entropy is the information that one system has {\em about
246: the other}. Perhaps this is the key insight that I would like to
247: convey: Information is always {\em about something}. If it cannot
248: be specified what the information is about, then we are dealing
249: with entropy, not information. Indeed, entropy is sometimes
250: called, in what borders on an abuse of language, ``useless
251: information". The previous discussion also implies that
252: information is only defined {\it relative} to the system it is
253: information about, and is therefore {\em never} absolute. This
254: will be particularly clear in the discussion of the information
255: content of genomes, which we now enter.
256:
257: \section{Information in Genomes}
258: There is a long history of applying information theory to symbolic
259: sequences. Most of this work is concerned with the randomness, or,
260: conversely, regularity, of the sequence. Ascertaining the
261: probabilities with which symbols are found on a sequence or
262: message will allow us to estimate the entropy of the {\em source
263: of symbols}, but not what they stand for. In other words,
264: information cannot be accessed in this manner. It should be
265: noted, however, that studying {\em horizontal} correlations, i.e.,
266: correlations between symbols along a sequence rather than across
267: sequences, can be useful for distinguishing coding from non-coding
268: regions in DNA (Grosse et al., 2000), and can serve as a distance
269: measure between DNA sequences that can be used to assemble
270: fragments obtained from shotgun-sequencing (Otu and Sayood, 2003).
271:
272: In terms of the jargon introduced above, measuring the
273: probabilities with which symbols (or groups of symbols) appear
274: {\em anywhere} in a sequence will reveal the {\em marginal}
275: entropy of the sequence, i.e., the entropy {\em without} regard to
276: the environment or context. The entropy {\em with} regard to the
277: environment is the entropy {\em given} the environment, a
278: conditional entropy, which we shall calculate below. This will
279: involve obtaining the probability to find a symbol at a {\em
280: specific} point in the sequence, as opposed to anywhere on it. We
281: sometimes refer to this as obtaining the {\em vertical}
282: correlations between symbols.
283:
284: \subsection{Meaning from Correlations}
285: Obtaining the marginal entropy of a genetic sequence can be quite
286: involved (in particular if multi-symbol probabilities are
287: required), but a very good approximative answer can be given
288: without any work at all: This entropy (for DNA sequences) is about
289: two bits per base. There are deviations of interest (for example
290: in GC-rich genes, etc.) but overall this is what the
291: (non-conditional) entropy of most of DNA is (see, e.g., Schmitt
292: and Herzel 1997)\nocite{SchmittHerzel1997}. The reason for this is
293: immediately clear: DNA is a {\em code}, and codes do not reveal
294: information from sequence alone. Optimal codes, e.g., are such
295: that the encoded sequences cannot be compressed any further (Cover
296: and Thomas, 1991)\nocite{CoverThomas1991}. While DNA is not
297: optimal (there are some correlations between symbols along the
298: sequence), it is nearly so. The same seems to hold true for
299: proteins: a random protein would have $\log_2(20)=4.32$ bits of
300: entropy per site (or 1 mer, the entropy of a random monomer
301: introduced above), while the actual entropy is somewhat lower due
302: to biases in the overall abundance (leucine is over three times as
303: abundant as tyrosine, for example), and due to pair and triplet
304: correlations. Depending on the data set used, the protein entropy
305: per site is between 2.5 (Strait and Dewey, 1996) and 4.17 bits
306: (Weiss et al., 2000)\nocite{StraitDewey1996,Weissetal2000}, or
307: between 0.6 and 0.97 mers. Indeed, it seems that protein sequences
308: can only be compressed by about 1\% (Weiss et al. 2000). This is a
309: pretty good code! But this entropy per symbol only allows us to
310: quantify our uncertainty about the sequence identity, but it will
311: not reveal to us the {\em function} of the genes. If this is all
312: that information theory could do, we would have to agree with the
313: critics that information theory is nearly useless in molecular
314: biology. Yet, I have promised that information theory {\em is}
315: relevant, and I shall presently point out how. First of all, let
316: us return to the concept of information. How should we decide
317: whether or not {\em potential information} (a.k.a entropy) is in
318: {\em actuality} information, i.e., whether it is shared with
319: another variable?
320:
321: The key to information lies in its use to make predictions {\em
322: about} other systems. Only in {\em reference} to another ensemble
323: can entropy become information, i.e., be promoted from useless to
324: useful, from potential to actual. Information therefore is clearly
325: not stored {\em within} a sequence, but rather in the {\em
326: correlations} between the sequence and what it describes, or what
327: it {\em corresponds to}. What do biomolecular sequences correspond
328: to? What is the {\em meaning} of a genomic sequence, what
329: information does it represent? This depends, quite naturally, on
330: what environment the sequence is to be interpreted within.
331: According to the arguments advanced here, no sequence has an
332: intrinsic meaning, but only a relative (or conditional) one with
333: respect to an environment. So, for example, the genome of {\it
334: Mycoplasma pneumoniae} (a bacterium that causes pneumonia-like
335: respiratory illnesses) has an entropy of almost a million base
336: pairs, which is its genome length. Within the soft tissues that it
337: relies on for survival, most of these base pairs (about 89\%) are
338: information (Dandekar et al., 2000). Indeed, Mycoplasmas are
339: obligate parasites in these soft tissues, having shed from 50\% to
340: three quarters of the genome of their bacterial ancestors (the
341: {\em Bacillae}). Within these soft tissues that make many
342: metabolites readily available, what was information for a Bacillus
343: had become entropy for the Mycoplasma. With respect to {\em other}
344: environments, the Mycoplasma information might mean very little,
345: i.e., it might not {\em correspond} to anything there. Whether or
346: not a sequence means something in its environment determines
347: whether or not the organism hosting it lives or dies there. This
348: will allow us to find a way to distinguish entropy from
349: information in genomes.
350:
351: \subsection{Physical Complexity}
352: In practice, how can we determine whether a particular base's
353: entropy is shared, i.e., whether a nucleotide carries entropy or
354: information? At first glance one might fear that we would have to
355: know a gene's function (i.e., know what it corresponds to within
356: its surrounding) before we can determine the information content;
357: that, for example, we might need to know that a gene codes for an
358: alcoholdehydrogenase before we can ascertain which base pairs code
359: for it. Fortunately, this is not true. What is clear, however, is
360: that we may never distinguish entropy from information if we are
361: only given a {\em single} sequence to make this determination,
362: because, in a single sequence, symbols that carry information are
363: indistinguishable from those that do not. The trick lies in
364: studying {\em functionally equivalent sets} of sequences, and the
365: substitution patterns at each aligned position. In an equilibrated
366: population, i.e, one where sufficient time has passed since the
367: last evolutionary innovation or bottleneck, we expect a position
368: that codes for information to be nearly uniform {\em across} the
369: population (meaning that the same base pair will be found at that
370: position in all sequences of that population), because a mutation
371: at that position would detrimentally affect the fitness of the
372: bearer, and, over time, be purged from the ensemble (this holds in
373: its precise form only for asexual populations). Positions that do
374: not code for information, on the other hand, are selectively
375: neutral, and, with time, will take on all possible symbols at that
376: position. Thus, we may think of each position on the genome as a
377: four-sided die. A priori, the uncertainty (entropy) at each
378: position is two bits, the maximal entropy:
379: \begin{equation}
380: H = -\sum_{i= {\rm G,C,A,T}}p(i)\log_2 p(i)=\log_2 4 = 2 \ \ {\rm bits}
381: \end{equation}
382: because, a priori, $p(i)=1/4$. For the {\it actual} entropy, we need the actual
383: probabilities $p_j(i)$, for each position $j$ on the sequence. In a pool of $N$
384: sequences, $p_j(i)$ is estimated by counting the number $n_j(i)$ of occurrences of
385: nucleotide $i$ at position $j$, so that $p_j(i)=n_j(i)/N$. This should be done for
386: all positions $j=1,...,L$ of the sequence, where $L$ is the sequence length.
387: Ignoring correlations {\em between} positions $j$ on a sequence (so-called
388: ``epistatic'' correlations, to which we shall return below), the information stored
389: in the sequence is then (with logs to base 2)
390:
391: \begin{equation}
392: I=H_{\rm max}- H = 2L -H\;\; {\rm bits}\;, \label{infomeasure}
393: \end{equation}
394: where
395: \begin{equation}
396: H = -\sum_{j=1}^L\,\sum_{i={\rm G,C,A,T}}p_j(i)\log_2 p_j(i)\;. \label{cond}
397: \end{equation}
398: Note that this estimate, because it relies on the difference of
399: maximal and actual entropy, does not require us to know which
400: variables in the environment cause some nucleotides to be uniform,
401: or ``fixed''. These probabilities are set by mutation-selection
402: balance in the environment. I have argued earlier (Adami and Cerf
403: 2000, Adami et al. 2000)\nocite{Adamietal2000} that the
404: information stored in a sequence is a good proxy for the
405: sequences's complexity (called ``physical complexity"), which
406: itself might be a good predictor of functional complexity. And
407: indeed, it seems to correspond to the quantity that increases
408: during Darwinian evolution (Adami 2002a)\nocite{Adami2002a}. We
409: will encounter below an evolutionary experiment that seems to
410: corroborate these notions.
411:
412: In general (for sequences taken from any monomer alphabet of size $D$), the
413: information stored in the sequence is
414: \begin{eqnarray}
415: I=H_{\rm max}-H &=& L-\left(-\sum_{j=1}^L\,\sum_{i=1}^{D}p_j(i)\log_D p_j(i)\right)\\
416: &=& L-J\;\; {\rm mers}\;, \label{infomer}
417: \end{eqnarray}
418: where $J$ can be thought of as the number of {\it non-functional}
419: (i.e., ``junk") instructions, and I
420: remind the reader that we defined the ``mer" as the entropy of a
421: random monomer, normalized to lie between zero and one.
422:
423: \subsection{Application to DNA and RNA}
424: In the simplest case, the environment is essentially given by the
425: intra-cellular binding proteins, and the measure
426: (\ref{infomeasure}) can be used to investigate the information
427: content of DNA binding sites (this use of information theory was
428: pioneered by Schneider et al., 1986). Here, the sample of
429: sequences can be provided by a sample of equivalent binding sites
430: within a single genome. For example, the latter authors aligned
431: the sequences of 149 {\it E. coli} and coliphage ribosome binding
432: sites in order to calculate the substitution probabilities at each
433: position of a 44 base pair region (which encompasses the 34
434: positions that can be said to constitute the binding site).
435: \begin{figure}[h]
436: \centerline{\psfig{figure=Adami-Fig1.ps,width=4in,angle=90}}
437: \caption{Information content (in bits) of an {\it E. coli} ribosome
438: binding site, aligned at the $f$Met-tRNA$_f$ initiation site (L=0),
439: from Schneider et al.\ (1986).
440: \label{schneider}}
441: \end{figure}
442: Fig. 1 shows the information content as a function of position
443: (Schneider et al.\ 1986), where position $L=0$ is the first base
444: of the initiation codon. The information content is highest near
445: the initiation codon, and shows several distinct peaks. The peak
446: at $L=-10$ corresponds to the Shine-Dalgarno sequence (Shine and
447: Dalgarno, 1974).
448:
449: When the information content of a base is zero we must assume that it
450: has no function, i.e., it is neither expressed nor does anything
451: bind to it. Regions with positive information
452: content\footnote{Finite sampling of the substitution probabilities
453: introduces a systematic error in the information content, which
454: can be corrected (Miller 1954, Basharin 1959, Schneider et al.\
455: 1986). In the present case, the correction ensures that the
456: information content is approximately zero at the left and right
457: edge of the binding site.} carry information about the binding
458: protein, just as the binding protein carries information about the
459: binding site.
460:
461: It is important to emphasize that the reason that sites $L=1$ and $L=2$, for
462: example, have maximal information content is a consequence of the fact that their
463: {\em conditional} entropy Eq.~(\ref{cond}) vanishes. The entropy is conditional
464: because only {\em given} the environment of binding proteins in which it functions
465: in {\em E. coli} or a coliphage, is the entropy zero. If there were, say, two
466: different proteins which could initiate translation at the same site (two different
467: environments), the conditional entropy of these sites could be higher. Intermediate
468: information content (between zero and 2 bits) signals the presence of {\em
469: polymorphisms} implying either non-specific binding to one protein or competition
470: between more than one protein for that site.
471:
472: A polymorphism is a deviation from the consensus sequence that is
473: not, as a rule, detrimental to the organism carrying it. If it
474: was, we would call it a ``deleterious mutation'' (or
475: just``mutation"). The latter should be very infrequent as it
476: implies disease or death for the carrier. On the contrary,
477: polymorphisms can establish themselves in the population, leading
478: either to no change in the phenotype whatsoever, in which case we
479: may term them ``strictly neutral'', or they may be deleterious by
480: themselves but neutral if associated with a commensurate
481: (compensatory) mutation either on the same sequence or somewhere
482: else.
483:
484: Polymorphisms are easily detected if we plot the per-site
485: entropies of a sequence vs.\ residue or nucleotide number in an
486: {\em entropy map} of a gene. Polymorphisms carry per-site
487: entropies intermediate between zero (perfectly conserved locus)
488: and unity (strictly neutral locus). Mutations, on the other hand,
489: (because they are deleterious) are associated with very low
490: entropy (Rogan and Schneider 1995), so polymorphisms stand out
491: among conserved regions and even mutations. In principle,
492: mutations can occur on sites which are themselves polymorphic;
493: those can only be detected by a more in-depth analysis of
494: substitution patterns such as suggested in Schneider (1997).
495: Because polymorphic sites in proteins are a clue to which sites
496: can easily be mutated, per-site entropies have also been
497: calculated for the directed evolution of proteins and enzymes
498: (Saven and Wolynes 1997, Voigt et al. 2001).
499:
500: \begin{figure}[!h]
501: \centerline{\psfig{figure=Adami-Fig2.ps,width=3in,angle=0}}
502: \caption{Entropy (in bits) of {\it E. coli} tRNA (upper panel)
503: from 5' (L=0) to 3' (L=76), from
504: 33 structurally similar sequences obtained from Sprinzl et al. (1996), where we
505: arbitrarily set the entropy of the anti-codon to zero.
506: Lower panel: Same for 32 sequences of {\it B. subtilis} tRNA. \label{ecolisubt}}
507: \end{figure}
508:
509: As mentioned earlier, the actual function of a sequence is irrelevant for
510: determining its information content. In the previous example, the region
511: investigated was a binding site. However, any gene's information content can be
512: measured in such a manner. In Adami and Cerf (2000), the information content of the
513: 76 base pair nucleic acid sequence that codes for bacterial tRNA was investigated.
514: In this case the analysis is complicated by the fact that the tRNA sequence
515: displays secondary and tertiary structure, so that the entropy of those sites that
516: bind in Watson-Crick pairs, for example, are shared, reducing the information
517: content estimate based on Eq.~(\ref{info}) significantly. In Fig.~\ref{ecolisubt},
518: I show the entropy (in bits)
519: \begin{figure}[h]
520: \centerline{\psfig{figure=Adami-Fig3.eps,width=4.5in,angle=0}}
521: \caption{Secondary structure of tRNA molecule, with bases colored
522: black for low entropy ($0\leq H\leq0.3$ mers), grey for
523: intermediate ($0.3<H\leq 0.7$ mers), and white for maximal entropy
524: ($0.7<H\leq 1.0$ mers), numbered 1-76 (entropies from {\it E.
525: coli} sequences). \label{tRNA}}
526: \end{figure}
527: derived from 33 structurally similar sequences of {\it E. coli}
528: tRNA (upper panel) and 32 sequences of {\it B. subtilis} tRNA,
529: respectively, obtained from the EMBL nucleotide sequence library
530: (Sprinzl et al. 1996). Note how similar these entropy maps are
531: across species (even though they last shared an ancestor over 1.6
532: billion years ago), indicating that the profiles are
533: characteristic of the {\em function} of the molecule, and thus
534: statistically stable.
535:
536: Because of base-pairing, we should not expect to be able to simply
537: sum up the per-site entropies of the sequence to obtain the
538: (conditional) sequence entropy. The pairing in the stacks (the
539: ladder-like arrangement of bases that bind in pairs) of the
540: secondary structure (see Fig.~\ref{tRNA}) reduces the actual
541: entropy, because two nucleotides that are bound together {\em
542: share} their entropy. This is an example where {\em epistatic
543: correlations} are important. Two sites (loci) are called epistatic
544: if their contributions to the sequence's fitness are not
545: independent, in other words, if the probability to find a
546: particular base at one position depends on the identity of a base
547: at another position. Watson-Crick-binding in stacks is the
548: simplest such example; it is also a typical example of the
549: maintenance of polymorphisms in a population because of functional
550: association. Indeed, the fact that polymorphisms are correlated in
551: stacks makes it possible to deduce the secondary structure of an
552: RNA molecule from sequence information alone.
553: \begin{figure}[h]
554: \centerline{\psfig{figure=Adami-Fig4.ps,width=3in,angle=0}}
555: \caption{Mutual
556: entropy (information) between base 28 and bases 39 to 45 (information
557: is normalized to $I_{\rm max}=1$ by taking logarithms to base 4). Because
558: finite sample size corrections of higher order have been neglected, the
559: information estimate can appear to be negative by an amount of the order of this
560: error. \label{spike}}
561: \end{figure}
562: Take, for example, nucleotide $L=28$ (in the anti-codon stack) which is bound to
563: nucleotide $L=42$, and let us measure entropies in mers (by taking logarithms to
564: the base 4). The mutual entropy between $L=28$ and $L=42$ (in {\em E. coli}) can be
565: calculated using Eq. (4):
566: \begin{equation}
567: I(28:42)=H(28)+H(42)-H(28,42)=0.78\;. \label{info28}
568: \end{equation}
569: Thus indeed, these two bases share almost all of their entropy. We
570: can see furthermore that they share very little entropy with any
571: other base. Note that, in order to estimate the entropies in
572: Eq.~(\ref{info28}), we applied a first-order correction that takes
573: into account a bias due to the finite size of the sample, as
574: described in Miller (1954). This correction amounts to $\Delta
575: H_1=3/(132 \ln2)$ for single nucleotide entropies, and $\Delta
576: H_2=15/(132 \ln2)$ for the joint entropy. In Fig.~\ref{spike}, I
577: plot the mutual entropy of base 28 with bases 39 to 45
578: respectively, showing that base 42 is picked out unambiguously.
579: Such an analysis can be carried out for all pairs of nucleotides,
580: so that the secondary structure of the molecule is revealed
581: unambiguously (see, e.g., Durbin et al. 1998). In
582: Fig.~\ref{wagenaar}, I show the entropy (in bits) for all pairs of
583: bases of the set of {\it E. coli} sequences used to produce the
584: entropy map in Fig.~\ref{ecolisubt}, which demonstrates how the
585: paired bases in the four stems stand out.
586: \begin{figure}[h]
587: \centerline{\psfig{figure=Adami-Fig5.eps,width=5in,angle=0}}
588: \caption{Mutual
589: entropy (information) between all bases (in bits), colored according
590: to the color bar on the right, from 33 sequences of {\it E. coli} tRNA.
591: The four stems are readily identified by their correlations as indicated.
592: \label{wagenaar}}
593: \end{figure}
594:
595: Since we found that most bases in stacks share almost all of their
596: entropy with their binding partner, it is easy to correct formula
597: (\ref{infomer}) to account for the epistatic effects of
598: stack-binding: We only need to subtract from the total length of
599: the molecule (in mers) the number of bases involved in stack
600: binding. In a tRNA molecule (with a secondary structure as in
601: Fig.~\ref{tRNA}) there are 21 such bases, so the sum in
602: Eq.~(\ref{cond}) should only go over the 52 ``reference
603: positions"\footnote{We exclude the three anticodon-specifying
604: bases from the entropy calculation because they have zero
605: conditional entropy by {\em definition} (they cannot vary among a
606: tRNA-type because it would change the type). However, the
607: substitution probabilities are obtained from mixtures of {\em
608: different} tRNA-types, and therefore appear to deviate from zero or one.}.
609: For {\it E. coli}, the entropy summed over the reference positions
610: gives $H\approx 24$ mers, while the {\it B. subtilis} set gives
611: $H\approx 21$ mers. We thus conclude that bacterial tRNA stores
612: between 52 and 55 mers of information about its environment (104-110
613: bits).
614:
615: This type of sequence analysis combining structural and complexity
616: information has recently been used to quantify the information
617: gain during in-vitro evolution of catalytic RNA molecules
618: (ribozyme ligases) (Carothers et al.\
619: 2004)\nocite{Carothersetal2004}. The authors evolved RNA aptamers
620: that bind GTP (guanine triphosphate) with different catalytic
621: effectiveness (different functional capacity) from a mutagenized
622: sequence library. They found 11 different classes of ribozymes,
623: whose structure they determined using the correlation analysis
624: outlined above. They subsequently measured the amount of
625: information in each structure [using Eq.~(\ref{infomeasure}) and
626: correcting for stack binding as described above] and showed that
627: ligases with higher affinity for the substrate had more complex
628: secondary structure {\em and} stored more information.
629: Furthermore, they found that the information estimate based on
630: Eq.~(\ref{infomeasure}) was consistent with an interpretation in
631: terms of the amount of information necessary to specify the
632: particular structure in the given environment. Thus, at least in
633: this restricted biochemical example, structural, functional, and
634: informational complexity seem to go hand in hand.
635:
636:
637: \subsection{Application to Proteins}
638: If the secondary structure of RNA and DNA enzymes can be predicted
639: based on correlations alone, what about protein secondary
640: structure? Because proteins fold and function via the interactions
641: among the amino acids they are made of, these interactions should,
642: in evolutionary time, lead to correlations between residues so
643: that the fitness effect of an amino acid substitution at one
644: position will depend on the residue at another position. (Care
645: must be taken to avoid contamination from correlations that are
646: due entirely to a common evolutionary path, see Wollenberg and
647: Atchley 2000; Govindarajan et al.\
648: 2003.)\nocite{WollenbergAtchley2000,Govindarajanetal2003} Such an
649: analysis has been carried out on a number of different molecule
650: families, such as the V3 loop region of HIV-1 (Korber et al.
651: 1993)\nocite{Korberetal1993}, which shows high variability (high
652: entropy) and strong correlations between residues (leading to
653: shared entropy) that are due to functional constraints. These
654: correlations have also been modelled (Giraud et al.\
655: 1998)\nocite{Giraudetal1998}.
656:
657: A similar analysis for the homeodomain sequence family was
658: performed by Clarke (1995)\nocite{Clarke1995}, who was able to
659: detect 16 strongly co-varying pairs in this 60 amino acid binding
660: motif. However, determining secondary structure based on these
661: correlations alone is much more difficult, because proteins do not
662: fold neatly into stacks and loops as does RNA. Also, residue
663: covariation does not necessarily indicate physical proximity
664: (Clarke 1995)\nocite{Clarke1995}, even though the strongest
665: correlations are often due to salt-bridges. But the correlations
666: can at least help in eliminating some models of protein structure
667: (Clarke 1995).
668:
669: Atchley et al.\ (2000)\nocite{Atchleyetal2000} carried out a
670: detailed analysis of correlated mutations in the bHLH (basic
671: helix-loop-helix) protein domain of a family of transcription
672: factors. Their set covered 242 proteins across a large number of
673: vertebrates that could be aligned to detect covariation. They
674: found that amino acid sites known to pack against each other
675: showed low entropy, whereas exposed non-contact sites exhibited
676: significantly larger entropy. Furthermore, they determined that a
677: significant amount of the observed correlations between sites was
678: due to functional or structural constraints that could help in
679: elucidating the structural, functional, and evolutionary dynamics
680: of these proteins (Atchley et al. 2000).
681:
682: Some attempts have been made to study the {\em thermodynamics} of protein
683: structures and relate it to the sequence entropy (Dewey 1997)\nocite{Dewey1997}, by
684: studying the mutual entropy between protein sequence and {\em structure}. This line
685: of thought is inspired by our concept of the genotype-phenotype map, which implies
686: that sequence should predict structure. If we hypothesize a structural entropy of
687: proteins $H({\rm str})$, obtained for example as the logarithm of the possible
688: stable protein structures for a given chain length (and a given environment), then
689: we can write down the mutual entropy between structure and sequence simply as
690: \begin{equation}
691: I({\rm seq}:{\rm str})=H({\rm seq})-H({\rm seq|str})\;, \label{seqstr}
692: \end{equation}
693: where $H({\rm seq})$ is the entropy of sequences of length $L$, given by $L$, and
694: $H({\rm seq|str})$ is the entropy of sequences {\em given} the structure. If we
695: assume that the environment perfectly dictates structure (i.e., if we assume that
696: only one particular structure will perform any given function) then
697: \begin{equation}
698: H({\rm seq|str})\approx H({\rm seq}|{\rm env})
699: \end{equation}
700: and $I({\rm str}:{\rm seq})$ is then roughly equal to the physical
701: complexity defined earlier. Because $H({\rm str|seq})=0$ (per the
702: above assumption that any given sequence gives rise to exactly one
703: structure), we can rewrite (\ref{seqstr}) as
704: \begin{equation}
705: I({\rm seq}:{\rm env})\approx I({\rm seq}:{\rm str})=H({\rm
706: str})-\underbrace{H({\rm str}|{\rm seq})}_{=0} \;,
707: \end{equation}
708: i.e., the mutual entropy between sequence and structure only tells
709: us that the thermodynamical entropy of possible protein structures
710: is limited by the amount of information about the environment
711: coded for by the sequence. This is interesting because it implies
712: that sequences that encode more information about the environment
713: are also potentially more complex, a relationship we discussed
714: earlier in connection with ribozymes (Carothers et al.\ 2004).
715: Note, however, that the assumption that only one particular
716: structure will perform any given function need not hold. Szostak
717: (2003), for example, advocates a definition of {\em functional
718: information} that allows for different structures carrying out an
719: equivalent biochemical function.
720:
721: \section{Molecular Interactions and Resistance}
722:
723: One of the more pressing concerns in bioinformatics is the identification of DNA
724: protein-binding regions, such as promoters, regulatory regions, and splice
725: junctions. The common method to find such regions is through {\em sequence
726: identity}, i.e., known promoter or binding sites are compared to the region being
727: scanned (e.g., via freely available bioinformatics software such as BLAST), and a
728: ``hit'' results if the scanned region is sufficiently identical according to a
729: user-specified threshold. Such a method cannot, of course, find {\em unknown}
730: binding sites, nor can it detect interactions between proteins, which is another
731: one of bioinformatics' holy grails (see, e.g., Tucker et al.
732: 2001)\nocite{TuckerGeraUetz2001}. Information theory can in principle detect
733: interactions between different molecules (such as DNA-protein or protein-protein
734: interactions) from {\it sequence heterogeneity}, because interacting pairs share
735: {\em correlated mutations}, that arise as follows.
736:
737: \subsection{Detecting Protein-Protein and DNA-Protein Interactions}
738: Imagine two proteins bound to each other, while each protein has
739: some entropy in its binding motif (substitutions that do not
740: affect structure). If a mutation in one of the proteins leads to
741: severely reduced interaction specificity, the substitution is
742: strongly selected against. It is possible, however, that a {\em
743: compensatory} mutation in the binding partner restores
744: specificity, such that the {\em pair} of mutations together is
745: neutral (and will persist in the population), while each mutation
746: by itself is deleterious. Over evolutionary time, such pairs of
747: correlated mutations will establish themselves in populations and
748: in homologous genes across species, and could be used to identify
749: interacting pairs. This effect has been seen previously in the
750: Cytochrome c/Cytochrome oxidase (CYC/COX) heterodimer (Rawson and
751: Burton 2002)\nocite{RawsonBurton2002} of the marine copepod {\it
752: Tigriopus californicus}. In Rawson and Burton (2002), the authors performed crosses between
753: the San Diego (SD) and Santa Cruz (SC) variants from two natural
754: allopatric populations that have long, independent evolutionary
755: histories. Inter-population crosses produced strongly reduced
756: activity of the cytochrome complex, while intra-population crosses
757: were vigorous. Indeed, the SD and SC variants of COX differ by at
758: least 30 amino acid substitutions, while the smaller CYC has up to
759: 5 substitutions. But can these correlated mutations be found from
760: sequence data alone? This turns out to be a difficult
761: computational problem unless it is known precisely which member of
762: a set of $N$ sequences of one binding partner binds to which
763: member of a set of $N$ of the other. Unless we are in possession
764: of this $N$ to $N$ assignment, we cannot calculate the joint
765: probabilities $p_{ij}$ that go into the calculation of the mutual
766: entropies such as Eq.~(\ref{info28}) that reveal correlated
767: mutations.
768:
769: Of course, if we have one pair of sequences from $N$ species of organisms with the
770: same homologous gene, the assignment is automatically implied. In the absence of
771: such an assignment, it may be possible to recover the correct matches from two sets
772: of $N$ sequences by searching for the assignment with the highest mutual entropy,
773: because we can safely assume that the correct assignment maximizes the correlations
774: (Adami and Thomsen 2004)\nocite{AdamiThomsen2004}. However, this is a difficult
775: search problem because the number of possible assignments scales like $N!$. Still,
776: because correlated mutations due to coevolution seem to be relatively common
777: (Bonneton et al. 2003)\nocite{Bonnetonetal2003}, this would be a useful tool for
778: revealing those residues involved in binding, or even in protein-protein
779: interaction prediction.
780:
781: In principle, the information-theoretical method described above
782: can potentially identify {\em unknown} binding sites by
783: identifying complementary patterns (between binding sites and
784: protein coding regions), if the binding regions are not
785: well-conserved, i.e., when the binding site and the corresponding
786: transcription factor carry a reasonable amount of polymorphisms,
787: and if enough annotation exists to identify the genomic partners
788: that correspond to each other in a set. If sufficient pairs of
789: transcription-factor/binding-domain pairs can be sequenced, an
790: information-theoretic analysis could conceivably reveal genomic
791: regulatory regions that standard sequence analysis methods miss.
792: For example, it was suggested recently (Brown and Callan,
793: 2004)\nocite{BrownCallan2004} that the cAMP response protein (CRP,
794: a transcription factor that regulates many {\it E. coli} genes)
795: binds to a number of entropic sites in {\it E.coli}, i.e., sites
796: that are not strictly conserved, but that still retain
797: functionality (see also Berg and von Hippel, 1987).
798:
799: \begin{figure}[h]
800: \centerline{\psfig{figure=Adami-Fig6.ps,width=4in,angle=90}}
801: \caption{Normalized ($0\le H\le1$) entropy of HIV-1 protease in
802: mers, as a function of residue number, using 146 sequences from
803: patients exposed to a protease inhibitor drug (entropy is
804: normalized to $H_{\rm max}=1$ per amino acid by taking logarithms
805: to base 20).\label{hivent}}
806: \end{figure}
807:
808:
809: \subsection{Tracking Drug Resistance}
810:
811: An interesting pattern of mutations can be observed in the
812: protease of HIV-1, a protein that binds to particular motifs on a
813: virus polyprotein, and then cuts it into functional pieces.
814: Resistance to protease inhibitors (small molecules designed to
815: bind to the ``business end" of the protease, thereby preventing
816: its function) occurs via mutations in the protease that do not
817: change the protease's cutting function (proteolysis), while
818: preventing the inhibitor to bind to it. Information theory can be
819: used to study whether mutations are involved in drug resistance or
820: whether they are purely neutral, and to discover correlated
821: resistance mutations.
822:
823: The emergence of resistance mutations in the protease after
824: exposure to antiviral drugs has been well studied (Molla et al.
825: 1996, Schinazi, Larder, and Mellors 1999). The entropy map of HIV
826: protease in Fig.~\ref{hivent}\footnote{The map was created using
827: 146 sequences obtained from a cohort in Luxembourg, and deposited
828: in GenBank (Servais et al.\ 1999 and 2001a).} (on the level of
829: amino acids) reveals a distinctive pattern of polymorphisms and
830: only two strictly conserved regions. HIV protease {\em not}
831: exposed to inhibitory drugs, on the other hand, shows three such
832: conserved regions (Loeb et al. 1989)\nocite{Loebetal1989}. It is
833: believed that the polymorphisms contribute to resistance mutations
834: involved in HAART (Highly Active Antiretroviral Therapy) failure
835: patients (Servais et al.\ 2001b). But, as a matter of fact, many
836: of the observed polymorphisms can be observed in treatment-naive
837: patients (Kozal et al. 1996, and Lech et al. 1996) so it is not
838: immediately clear which of the polymorphic sites are involved in
839: drug resistance.
840:
841: In principle, exposure of a population to a new environment can
842: lead to fast adaptation if the mutation rate is high enough. This
843: is certainly the case with HIV. The adaptive changes generally
844: fall into two classes: mutations in regions that were previously
845: conserved (true resistance mutations), and changes in the
846: substitution pattern on sites that were previously polymorphic. In
847: the case of HIV-1 protease, both patterns seem to contribute. In
848: Fig.~\ref{figdiff}, I show the {\it changes} in the entropic
849: profile of HIV-1 protease obtained from a group of patients before
850: and six months after treatment with high doses of saquinavir (a
851: protease inhibitor). Most spikes are positive, in particular the
852: changes around residues 46-56, a region that is well-conserved in
853: treatment-naive proteases, and that is associated with a {\em
854: flap} in the molecule that must be flexible and that extends over
855: the substrate binding cleft (Shao et al. 1997). Mutations in that
856: region indeed appeared on sites that were previously uniform,
857: while some changes occurred on polymorphic sites (negative
858: spikes). For those, exposure to the new environment usually
859: reduced the entropy at that site.
860:
861: Some of the resistance mutations actually appear in pairs,
862: indicating that they may be compensatory in nature (Leigh Brown et
863: al. 1999, Hoffman et al. 2003, Wu et al. 2003)\nocite{Wuetal2003}.
864: The strongest association occurs between residues 54 and 82, the
865: former associated with the flap, and the latter right within the
866: binding cleft. This association does not occur in treatment-naive
867: patients, but stands out strongly after therapy (such correlations
868: are easily detected by creating mutual entropy graphs such as
869: Fig.~\ref{wagenaar}, data not shown). The common explanation for
870: this covariation is again compensation: while a mutation in the
871: flap or in the cleft leads to reduced functionality of the
872: protease, both together restore function while evading the
873: inhibitor.
874:
875: \begin{figure}[h] \centerline{\psfig{figure=Adami-Fig7.ps,width=4in,angle=90}}
876: \caption{Change in per-site entropy of HIV-1 protease after six
877: months of exposure to saquinavir, $\Delta$Entropy=$H_{26}-H_0$,
878: where $H_{26}$ is the entropy after 26 weeks of exposure. The
879: entropies were obtained from 34 sequences before and after
880: exposure, available through GenBank (Schapiro et al.\ 1997). The
881: three highest (positive) spikes are associated to the well-known
882: resistance mutations G48V, T74(A,S), and L90M, respectively.
883: }\label{figdiff}
884: \end{figure}
885:
886: \subsection{Information-theoretic Drug Design}
887: Because many of the protease polymorphisms are prevalent in
888: treatment-naive patients, we must assume that they are either
889: neutral, or that the steric changes they entail do not impede the
890: protease's proteolytic activity while failing to bind the protease
891: inhibitor. Thus, a typical protease population is a mixture of
892: polymorphic molecules (polymorphic both in genotype and in
893: structure, see Maggio et al. 2002)\nocite{Maggioetal2002} that can
894: outsmart a drug designed for a single protease type relatively
895: easily. An interesting alternative in drug design would therefore
896: use an entropic mixture of polymorphisms, or ``quasispecies''
897: (Eigen 1971) as the drug target. Such a drug would {\em itself}
898: form a quasispecies rather than a pure drug. Indeed, an analysis
899: of the information content of realistic ensembles shows that
900: consensus sequences are exceedingly rare in real populations
901: (Schneider 1997), and certainly absent in highly variable ones
902: such as HIV proteases. The absence of a consensus sequence is also
903: predicted for molecules evolving at the {\em error threshold}
904: (Eigen 1971), which is very likely in these viruses.
905:
906: The ideal {\em superdrug} should represent a mixture of inhibitors
907: that is perfectly tuned to the mixture of proteases. What this
908: mixture is can be determined with information theory, by ensuring
909: that the ensemble of inhibitors {\em co-varies} with the protease,
910: such as to produce tight binding even in the presence of mutations
911: (or more precisely {\em because} of the presence of mutations).
912: The substitution probabilities of the inhibitor ensemble would be
913: obtained by maximizing the mutual entropy (information) between
914: the protease and an inhibitor library obtained by combinatorial
915: methods, either on a nucleotide or on the amino acid level (Adami
916: 2002b)\nocite{Adami2002b}. If such a procedure could create a drug
917: that successfully inhibits resistance mutations, we could no
918: longer doubt the utility of information theory for molecular
919: biology.
920:
921:
922: \section{Conclusions}
923: Information theory is not widely used in bioinformatics today even
924: though, as the name suggests, it should be {\em the} relevant
925: theory for investigating the information content of sequences. The
926: reason for the neglect appears to be a misunderstanding of the
927: concepts of entropy versus information throughout most of the
928: literature, which has led to the widespread perception of its
929: incompetence. Instead, I point out that Shannon's theory precisely
930: defines both entropy and information, and that our intuitive
931: concept of information coincides with the mathematical notion.
932: Using these concepts, it is possible in principle to distinguish
933: information-coding regions from random ones in ensembles of
934: genomes, and thus quantify the information content. A thorough
935: application of this program should resolve the C-paradox, that is,
936: the absence of a correlation between the size of the genome and
937: the apparent complexity of an organism (Cavalier-Smith 1985), by
938: distinguishing information that contributes to complexity from
939: non-functional stretches that do not. However, this is a challenge
940: for the future because of the dearth of multiply sequenced
941: genomes.
942:
943: Another possible application of information theory in molecular
944: biology is the association of regulatory molecules with their
945: binding sites or even protein-protein interactions, in the case
946: where transcription factors and their corresponding binding site
947: show a good amount of polymorphism (methods based on correlated
948: heterogeneity), and the binding association between pairs can be
949: established. This approach is complementary to sequence comparison
950: of conserved regions (methods based on sequence identity), in
951: which information theory methods cannot be used because zero
952: (conditional) entropy regions cannot share entropy. Conversely,
953: sequence comparison methods must fail if polymorphisms are too
954: pronounced. Finally, the recognition of the polymorphic (or
955: quasispecies) nature of many viral proteins suggests an
956: information theory based approach to drug design in which the
957: quasispecies of proteins---rather than the consensus sequence---is
958: the drug target, by maximizing the information shared between the
959: target and drug ensembles.
960:
961:
962: \vskip 0.25cm \noindent{\bf Acknowledgements} \vskip 0.25cm I
963: thank David Baltimore and Alan Perelson for discussions, and Titus
964: Brown for comments on the manuscript. Thanks are also due to
965: Daniel Wagenaar for producing Fig.~5. This work was supported by the
966: National Science Foundation under grant DEB-9981397.
967:
968:
969: \begin{thebibliography}{99}
970:
971: \setlength{\itemindent}{-0.5cm}
972:
973: \bibitem[]{Adami2002a}Adami, C. 2002a. What is complexity? BioEssays 24:1085-1094
974:
975: \bibitem[]{Adami2002b}Adami, C. 2002b. Combinatorial drug design augmented by
976: information theory. NASA Tech Briefs 26:52.
977:
978: \bibitem[]{AdamiThomsen2004}Adami, C. and S.W. Thomsen 2004. Predicting protein-protein interactions from sequence data. In: Hicks MG, editor. {\it The Chemical
979: Theatre of Biological Systems}. Proceedings of the Beilstein-Institute
980: Workshop.
981:
982:
983: \bibitem[]{adami99} Adami, C. and N.J. Cerf 1999. Prolegomena to a non-equilibrium
984: quantum statistical mechanics. Chaos, Solitons, and Fractals
985: 10:1637-1650.
986:
987: \bibitem[]{Adamietal2000} Adami, C., C. Ofria and T.C. Collier 2000. Evolution of
988: biological complexity. Proc. Natl. Acad. Sci. USA 97:4463-4468.
989:
990: \bibitem[]{adami00}Adami, C. and N.J. Cerf. 2000. Physical complexity
991: of symbolic sequences. Physica D 137:62-69.
992:
993: \bibitem[]{Atchleyetal2000}Atchley W.R., K.R. Wollenberg, W.M. Fitch, W. Terhalle,
994: and A.D. Dress 2000. Correlations among amino acid sites in bHLH protein domains:
995: An information-theoretic analysis. Mol. Biol. Evol. 17:164-178.
996:
997: \bibitem[]{basharin59}Basharin, G.P. 1959. On a statistical estimate
998: for the entropy of a sequence of random variables. Theory Probability Appl. 4:333.
999:
1000: \bibitem[]{BergvonHippel1987}Berg, O.G. and P.H. von Hippel 1987. Selection of DNA binding sites by regulatory proteins. II. The binding specificity of
1001: cyclic AMP receptor protein to recognition sites. J. Mol. Biol.
1002: 200:709-723.
1003:
1004: \bibitem[]{Bonnetonetal2003}Bonneton, F., D. Zelus, T. Iwema, M. Robinson-Rechavi,
1005: and V. Laudet 2003. Rapid divergence of the Ecdysone receptor in Diptera and
1006: Lepidoptera suggests coevolution between ECR and USP-RXR. Mol. Biol. Evol.
1007: 20:541-553.
1008:
1009: \bibitem[]{BrownCallan2004}Brown, C.T. and C.G. Callan, Jr. 2004. Evolutionary comparisons suggest
1010: many novel CRP binding sites in {\it E. coli}. Proc. Natl. Acad. Sci. USA 101:2404-2409.
1011:
1012:
1013: \bibitem[]{Carothersetal2004}Carothers, J.M., S.C. Oestreich, J.H. Davis and J.W.
1014: Szostak 2004. Informational complexity and functional activity. J. Am. Chem.
1015: Society 126 (in press).
1016:
1017: \bibitem[]{cavalier85}Cavalier-Smith, T. 1985. {\it The Evolution of
1018: Genome Size}, ed. Cavalier-Smith, T. (Wiley, N.Y.)
1019:
1020: \bibitem[]{Clarke1995}Clarke, N.D. 1995. Covariation of residues in the homeodomain
1021: sequence family. Prot. Sci. 4:2269-2278.
1022:
1023: \bibitem[]{CoverThomas1991}Cover, T.M. and J.A. Thomas 1991. {\it Elements of
1024: Information Theory}. (John Wiley, New York).
1025:
1026: \bibitem[]{Dandekaretal2000}Dandekar, T. et al.\ 2000.
1027: Re-annotating the {\it Mycoplasma pneumoniae} genome sequence:
1028: Adding value, function, and reading frames. Nucl. Acids Res.
1029: 28:3278-3288.
1030:
1031: \bibitem[]{Dewey1997}Dewey, T.G. 1997. Algorithmic complexity and thermodynamics of
1032: sequence-structure relationships in proteins. Phys. Rev. E 56:4545-4552.
1033:
1034: \bibitem[]{durbin85}Durbin, R., S. Eddy, A. Krogh and G. Mitchison 1998. {\it Biological
1035: Sequence Analysis}. (Cambridge University Press, Cambridge MA).
1036:
1037: \bibitem[]{eigen71}Eigen, M. 1971. Self-organization of matter and the
1038: evolution of macromolecules. Naturwissenschaften 58:465.
1039:
1040: \bibitem[]{Giraudetal1998}Giraud B.G., A. Lapedes and L.C. Liu 1998. Analysis of
1041: correlations between sites of model protein sequences. Phys. Rev. E 58:6312-6322.
1042:
1043: \bibitem[]{Govindarajanetal2003}Govindarajan S., J.E. Ness, S. Kim, E.C. Mundorff, J. Minshull, and C.
1044: Gustafsson 2003. Systematic variation of amino acid substitutions for stringent
1045: assignment of pairwise covariation. J. Mol. Biol. 328:1061-2069.
1046:
1047: \bibitem[]{Grosseetal2000}Grosse, I., H. Herzel, S.V. Buldyrev,
1048: and H.E. Stanley 2000. Species independence of mutual information
1049: in coding and non-coding DNA. Phys. Rev. E 61:5624-5629.
1050:
1051: \bibitem[]{Hoffmanetal2003}Hoffman, N.G., C.A. Schiffer, and R. Swanstorm 2003.
1052: Covariation of amino acid positions in HIV-1 protease. Virology 314:536-548.
1053:
1054: \bibitem[]{Korberetal1993}Korber B.T.M., R.M. Farber, D.H. Wolpert, and A.S. Lapedes
1055: 1993. Covariation of mutations in the V3 loop of human immunodeficiency virus type
1056: 1 envelope protein: An information-theoretic analysis. Proc. Natl. Acad. Sci. USA
1057: 90:7176-7180.
1058:
1059: \bibitem[]{kozaletal96}Kozal, M.J., et al. 1996. Extensive polymorphisms
1060: observed in HIV-1 clade B protease gene using high density
1061: oligonucleotide arrays. Nature Medicine 2:753-759.
1062:
1063: \bibitem[]{lechetal96}Lech, W.J., et al. 1996. In vivo sequence diversity of the
1064: protease of human immunodeficiency virus type 1: {P}resence of protease inhibitor
1065: resistant variants in untreated subjects. J. Virol. 70:2038-2043.
1066:
1067: \bibitem[]{LeighBrownetal1999}Leigh Brown, A.J., B.T. Koerber, and J.H. Condra 1999.
1068: Associations between amino acids in the evolution of HIV type 1 protease sequences
1069: under indinavir therapy. AIDS Res. Hum. Retrovir. 15:247-253.
1070:
1071: \bibitem[]{Loebetal1989}Loeb D.D., R. Swanstrom, L. Everitt, M. Manchester,
1072: S.E. Stamper, and C.A. Hutchison III 1989. Complete mutagenesis of the HIV-1
1073: protease. Nature 340:397-400.
1074:
1075: \bibitem[]{Maggioetal2002}Maggio, E.T., M. Shenderovich, R. Kagan, D. Goddette, and
1076: K. Ramnarayan 2002. Structural pharmacogenetics, drug resistance, and the design of
1077: anti-infective superdrugs. Drug Discovery Today 7:1214-1220.
1078:
1079: \bibitem[]{maynard99b}Maynard Smith, J. 1999a. The idea of
1080: information in biology. Quart. Rev. Biol. 74:395-400.
1081:
1082: \bibitem[]{maynard99a}Maynard Smith, J. 1999b. The concept of
1083: information in biology. Philo. Sci. 67:177-194.
1084:
1085: \bibitem[]{miller54}Miller, G.A. 1954. Note on the bias of information estimates.
1086: In H. Quastler, ed., {\it Information Theory and Psychology}, pp. 95-100 (The Free
1087: Press, Illinois).
1088:
1089: \bibitem[]{mollaetal96}Molla, A. et al. 1996. Ordered accumulation of mutations in
1090: HIV protease confers resistance to ritonavir. Nature Medicine
1091: 2:760-766.
1092:
1093: \bibitem[]{OtuSayood2003}Otu, H. H. and K. Sayood 2003. A
1094: divide-and-conquer approach to fragment assembly. Bioinformatics
1095: 19:22-29.
1096:
1097: \bibitem[]{quast53}Quastler, H. (Ed.) 1953. {\it Information Theory in Biology}
1098: (University of Illinois Press, Urbana).
1099:
1100: \bibitem[]{RawsonBurton2002}Rawson, P.D. and R.S. Burton 2002. Functional
1101: coadaptation between cytochrome c and cytochrome oxidase within allopatric
1102: populations of a marine copepod. Proc. Natl. Acad. Sci. USA 99:12955-12958.
1103:
1104: \bibitem[]{roganschneider95}Rogan, P.K. and T.D. Schneider 1995. Using
1105: information content and base frequencies to distinguish mutations from
1106: genetic polymorphisms. Hum. Mut. 6:74-76.
1107:
1108: \bibitem[]{sarkar96}Sarkar, S. 1996. Decoding ``coding''---information
1109: and DNA. BioScience 46:857-864.
1110:
1111: \bibitem[]{saven97}Saven, J.G., and P.G. Wolynes 1997. Statistical
1112: mechanics of the combinatorial synthesis and analysis of folding
1113: macromolecules. J. Phys. Chem. B 101:8375-8389.
1114:
1115: \bibitem[]{schapiroetal97} Schapiro, J.M., M.A. Winters, F. Stewart, B. Efron, J.
1116: Norris, M.J. Kozal, and T.C. Mergan 1997. The effect of high-dose
1117: saquinavir on viral load and CD4+ T-cell counts in HIV-infected
1118: patients (unpublished).
1119:
1120: \bibitem[]{schinazietal99}Schinazi, R.F., B. A. Larder, and
1121: J.W. Mellors 1999. Mutation in retroviral genes associated with
1122: drug resistance: 1999-2000 update. Intern. Antiviral News
1123: 7.4:46-68.
1124:
1125: \bibitem[]{schneider86} Schneider T.D., G.D. Stormo, L. Gold, and A.
1126: Ehrenfeucht 1986. Information content of binding sites on nucleotide
1127: sequences. J. Mol. Biol. 188:415-431.
1128:
1129: \bibitem[]{SchmittHerzel1997} Schmitt A.O. and H. Herzel 1997. Estimating the
1130: entropy of DNA sequences. J. theor. Biol. 188:369-377.
1131:
1132: \bibitem[]{schneider97} Schneider, T.D. 1997. Information content of
1133: individual genetic sequences. J. theor. Biol. 189:427-441.
1134:
1135: \bibitem[]{servais99}Servais, J. et al.\ 1999. The natural
1136: polymorphism of the HIV-1 protease gene in treatment naive patients
1137: and response to combination therapy including a protease
1138: inhibitor. 4th European Conference on Experimental AIDS Research, June
1139: 18-21, Tampere, Finland.
1140:
1141: \bibitem[]{servais01a}Servais, J., et al.\ 2001a. Comparison of DNA sequencing and a
1142: line probe assay for detection of Human Immunodeficiency Virus type 1 drug
1143: resistance mutations in patients failing highly active antiretroviral therapy. J.
1144: Clin. Microbiol. 39:454-459.
1145:
1146: \bibitem[]{servais01b}Servais, J., et al.\ 2001b. Variant HIV-1 proteases
1147: and response to combination therapy including a protease inhibitor. Antimicrob.
1148: Agents Chemother. 45:893-900.
1149:
1150: \bibitem[]{shannon48}Shannon, C.E. 1948. A mathematical theory of
1151: communication. Bell Systems Technical Journal 27:379-423; {\em ibid}
1152: 27:623-656; reprinted in {\it C.E. Shannon: Collected Papers},
1153: N.J.A. Sloane and A.D. Wyner, eds., IEEE Press (1993).
1154:
1155: \bibitem[]{Shaoetal1997}Shao, W., L. Everitt, M. Manchester, D.L. Loeb, C.A.
1156: Hutchison III, and R. Swanstrom 1997. Sequence requirements of the
1157: HIV-1 protease flap region determined by saturation mutagenesis
1158: and kinetic analysis of flap mutants. Proc. Natl. Acad. Sci. USA
1159: 94:2243-2248.
1160:
1161: \bibitem[]{ShineDalgarno1974}Shine, J., and L. Dalgarno 1974. The
1162: 3'-terminal sequence of {\it E. coli} 16s ribosomal RNA:
1163: Complementarity to nonsense triplets and ribosome binding sites.
1164: Proc. Natl. Acad. Sci. USA 71:1342-1346.
1165:
1166: \bibitem[]{Sprinzletal1996}Sprinzl, M., C. Steegborn, F. H\"ubel, and S. Steinberg
1167: 1996. Compilation of tRNA sequences and sequences of tRNA genes. Nucl. Acids. Res.
1168: 23:68-72.
1169:
1170: \bibitem[]{StraitDewey1996}Strait, B.J. and T.G. Dewey 1996. The Shannon information entropy of
1171: protein sequences. Biophysical Journal 71:148-155.
1172:
1173: \bibitem[]{Szostak2003}Szostak, J.W. 2003. Molecular messages:
1174: Functional information. Nature 423:689.
1175:
1176: \bibitem[]{TuckerGeraUetz2001} Tucker, C.L., J.F. Gera, and P. Uetz 2001. Towards
1177: an understanding of complex protein interaction networks. Trends
1178: Cell Biol. 11:102-106.
1179:
1180: \bibitem[]{vincent94}Vincent, L.-M. 1994. R\'eflexions sur l'usage, en
1181: biologie, de la th\'eorie de l'information. Acta Biotheoretica
1182: 42:167-179.
1183:
1184: \bibitem[]{voigtetal01}Voigt, C.A., S.L. Mayo, F.H. Arnold, and
1185: Z.-G. Wang. 2001. Computational method to reduce the search space
1186: for directed protein evolution. Proc. Natl. Acad. Sci. USA
1187: 98:3778-3783.
1188:
1189: \bibitem[]{Weissetal2000}Weiss O., M.A. Jimenez-Monta\~no, and H. Herzel 2000.
1190: Information content of protein sequences. J. theor. Biol. 206:379-386.
1191:
1192: \bibitem[]{WollenbergAtchley2000}Wollenberg, K.R. and W.R. Atchley 2000. Separation
1193: of phylogenetic and functional associations in biological sequences by using the
1194: parametric bootstrap. Proc. Natl. Acad. Sci. USA 97:3288-3291.
1195:
1196: \bibitem[]{Wuetal2003}Wu T.D. et al.\ 2003. Mutation patterns and structural
1197: correlates in human immunodeficieny virus type 1 protease following different
1198: protease inhibitor treatments. J. Virol. 77:4836-4847.
1199:
1200:
1201: %\bibitem[]{wintersetal2000} Winters, M.A. et al. 2000. Frequency of
1202: %antiretroviral drug resistance mutations in HIV-1 strains from
1203: %patients failing triple drug regimens. Antiviral Therapy 5:57-63.
1204:
1205: \end{thebibliography}
1206: \end{document}
1207: