q-bio0410013/draft.tex
1: \documentclass[12pt]{article}
2: \usepackage{epsfig}
3: \usepackage{enumerate}
4: \usepackage{amsmath, amsfonts, amssymb}
5: \bibliographystyle{plain}
6: \setlength{\textheight}{8.4in} \setlength{\textwidth}{6.5in}
7: \setlength{\topmargin}{0.25in} \setlength{\headheight}{0.0in}
8: \setlength{\headsep}{0.0in} \setlength{\leftmargin}{0.25in}
9: \setlength{\rightmargin}{0.0in} \setlength{\oddsidemargin}{0.0in}
10: \setlength{\evensidemargin}{0.0in}
11: 
12: 
13: \newcommand{\ct}[1]{#1}
14: \renewcommand{\ct}[1]{ \cite{#1}}
15: \newcommand{\sub}[1]{\vspace{1.2ex}\noindent{\bf #1}}
16: \newcommand{\et}{\emph{et~al.}}
17: \newcommand{\jbreak}{}
18: \newcommand{\comment}[1]{{\bf[[#1]]}}
19: \newcommand{\cc}{\comment{Changed}}
20: 
21: \newcommand{\bes}{\mathcal{B}}
22: \newcommand{\acid}{\text{aa}}
23: 
24: \newcommand{\mon}{\begin{displaymath}}
25: \newcommand{\moff}{\end{displaymath}}
26: \newcommand{\eon}{\begin{equation}}
27: \newcommand{\eoff}{\end{equation}}
28: \newcommand{\eq}[1]{Eq. \ref{#1}}
29: \newcommand{\fig}[1]{Fig. \ref{#1}}
30: 
31: \newenvironment{changemargin}[2]{%
32:  \begin{list}{}{%
33:   \setlength{\topsep}{0pt}%
34:   \setlength{\leftmargin}{#1}%
35:   \setlength{\rightmargin}{#2}%
36:   \setlength{\listparindent}{\parindent}%
37:   \setlength{\itemindent}{\parindent}%
38:   \setlength{\parsep}{\parskip}%
39:  }%
40: \item[]}{\end{list}}
41: 
42: \long\def\symbolfootnote[#1]#2{\begingroup%
43: \def\thefootnote{\fnsymbol{footnote}}\footnote[#1]{#2}\endgroup}
44: \renewcommand{\baselinestretch}{1.0}
45: 
46: 
47: \title{Synonymous codon usage and\\
48: selection on proteins}
49: \date{October 13, 2004}
50: 
51: \author{Joshua B. Plotkin$^1$, Jonathan Dushoff$^2$, Michael M.
52: Desai$^3$,
53: Hunter B. Fraser$^4$}
54: \linespread{1.2}
55: 
56: 
57: 
58: \begin{document}
59: 
60: \maketitle
61: 
62: \begin{center} $^1$\textsc{Harvard Society of Fellows\\ 
63: and Bauer Center for Genomics Research \\
64: 7 Divinity Avenue, Cambridge MA 02138, USA} \end{center} 
65: 
66: \begin{center} $^2$\textsc{Department of Ecology and Evolutionary Biology\\ 
67: Princeton University, Princeton, NJ 08540, USA} \end{center} 
68: 
69: \begin{center} $^3$\textsc{Department of Physics \\
70: and Department of
71: Molecular and Cellular Biology\\ 
72: Harvard University, Cambridge, MA 02138, USA} \end{center} 
73: 
74: \begin{center} $^4$\textsc{Department of Molecular and Cell Biology\\
75: University of California, Berkeley, CA, 94720, USA} \end{center}
76: 
77: 
78: 
79: \bigskip
80: 
81: \vspace*{.3in}
82: 
83: 
84: 
85: \pagebreak
86: 
87: 
88: \begin{changemargin}{.9cm}{.9cm} 
89: 
90: 
91: \noindent Selection pressures on proteins are usually measured by
92: comparing homologous nucleotide sequences\ct{ZuckPaul65}.  Recently we
93: introduced a novel method, termed `volatility', to estimate selection
94: pressures on protein sequences from their synonymous codon
95: usage\ct{PlotDush03,PlotDush04}.  Here we provide a theoretical
96: foundation for this approach.  We derive the expected frequencies of
97: synonymous codons as a function of the strength of selection, the
98: mutation rate, and the effective population size.  We analyze the
99: conditions under which we can expect to draw inferences from biased
100: codon usage, and we estimate the time scales required to establish and
101: maintain such a signal.  Our results indicate that, over a broad range
102: of parameters, synonymous codon usage can reliably distinguish between
103: negative selection, positive selection, and neutrality.  While the power
104: of volatility to detect negative selection depends on the population
105: size, there is no such dependence for the detection of positive
106: selection.  Furthermore, we show that phenomena such as transient
107: hyper-mutators in microbes can improve the power of volatility to detect
108: negative selection, even when the typical observed neutral site
109: heterozygosity is low.
110: 
111: \end{changemargin}
112: 
113: \section{Introduction}
114: 
115: Nucleotide coding sequences of many organisms exhibit significant codon
116: bias -- that is, unequal usage of synonymous codons.  Codon bias has
117: been attributed both to neutral processes, such as asymmetric mutation
118: rates, as well as to selection acting on the synonymous codons
119: themselves. The most common selective explanation of codon bias posits
120: that synonymous codons differ in their fitness according to the relative
121: abundances of iso-accepting tRNAs; a codon corresponding to a more
122: abundant tRNA would be used preferentially so as to increase
123: translational efficiency\ct{Ikem81,DebrMarz94,SoreKurl89}.  To a large
124: extent, this hypothesis has successfully
125: explained interspecific variation in genome-wide codon usage for
126: organisms ranging from \textit{Escherichia coli} to \textit{Drosophila
127: melanogaster}\ct{Akas01}.
128: 
129: Recently, however, we have noted that codon bias in a protein sequence
130: can also result from selection at the amino acid level, even in the
131: absence of direct selection on synonymous codons
132: themselves\ct{PlotDush03,PlotDush04}. Codon bias arises from selection
133: at the amino acid level because of asymmetries in the structure of the
134: standard genetic code. Proteins that experience different selective
135: regimes should exhibit different synonymous codon usage.  Following from
136: this observation, we have introduced methods to screen a single genome
137: sequence for estimates of the selection pressures acting on its proteins
138: by comparing their synonymous codon usage\ct{PlotDush04}.
139: 
140: In this paper, we provide a theoretical discussion of codon usage biases
141: that result from selection at the amino acid level.  Our analysis helps
142: to provide a theoretical grounding for techniques of estimating
143: selection pressures on proteins using signals gathered from their
144: synonymous codon usage\ct{PlotDush03,PlotDush04}. Throughout most of
145: this paper, we will ignore any source of direct selection on synonymous
146: codons, and focus on the codon biases that result purely from selection
147: at the amino acid level.  To the extent that any other sources of codon
148: bias apply equally across the genome, we have devised a bootstrap method
149: to control for these external sources of codon bias when estimating
150: selection pressures on proteins\ct{PlotDush03,PlotDush04}.  In the
151: discussion, however, we describe a range of confounding factors that may
152: vary across the genome in some organisms and limit the applicability of
153: codon-based methods to detect selection.
154: 
155: \section{Codon volatility}
156: 
157: Codon usage biases can arise from the familiar process of
158: selection on proteins because synonymous codons may differ in their
159: \textit{volatility} -- defined, loosely, as the proportion of a codon's
160: point mutations that result in an amino acid
161: substitution\ct{PlotDush03}. Although there are several possible
162: definitions of volatility, which can all be informative, we have
163: recently used the following formal definition\ct{PlotDush04}.
164: 
165: We index the 61 sense codons in an arbitrary order $i=1\ldots61$.  We
166: use the notation $\acid(i)$ to denote the amino acid encoded by codon
167: $i$.  For each codon $i$, let $B(i)$ denote the set of sense codons
168: that differ from codon $i$ by a single point mutation. 
169: We define the volatility of codon $i$ by:
170: \begin{equation}
171: \nu(i) = \frac{1}{\#B(i)}\sum_{j \in B(i)}
172: D[\acid(i),\acid(j)]\\
173: \label{voldef}
174: \end{equation}
175: where $D$ denotes the Hamming metric, which is zero if two amino acids
176: are identical, and one otherwise.  The definition in Eq. ~\ref{voldef}
177: applies when all nucleotide mutations occur at the same rate.
178: When differential nucleotide mutation rates are known
179: (\textit{e.g.} a transition/transversion bias\ct{Wake96}), these rates can be
180: incorporated into the definition of codon volatility by appropriately
181: weighting the ancestor codons\ct{PlotDush04}.
182: 
183: Minor variants of Eq. \ref{voldef} yield related definitions of codon
184: volatility. For some applications, one may want to allow termination
185: codons in the definition of $B(i)$. It is also natural to consider
186: alternatives to the Hamming metric, $D$, that weight substitutions
187: between amino acids depending upon the differences in their
188: stereochemical properties\ct{MiyaMiya79,PlotDush03}. A variety of other
189: metrics\ct{TangWyck04,YampStol04} that reflect the effects of different
190: amino acid substitutions on protein structure may likewise be
191: incorporated into the definition of codon volatility.  In this paper,
192: however, we will focus on the most basic definition of codon volatility
193: (Eq.  \ref{voldef}, using the Hamming metric), because variant
194: definitions are based on the same underlying principle and produce
195: similar results in practice\ct{PlotDush03}.
196: 
197: Under the most basic definition of volatility, there are four amino
198: acids (Glycine, Leucine, Arginine, and Serine) whose codons differ in
199: their volatility.  As a result, when controlling for amino acid content,
200: we obtain a volatility signal from only those sites that contain one of
201: these four amino acids -- which amounts to about 30\% of the sites in a
202: typical gene. (If one uses stereochemical
203: metrics\ct{MiyaMiya79,PlotDush03} for $D$ in the definition of
204: volatility, then $\sim\!75$\% of the sites in a gene contain a
205: volatility signal).  Although 30\% may seem like a small proportion of
206: sites from which to obtain a signal of selective pressures, it is larger
207: than the proportion of sites often used to detect selection via sequence
208: comparison of recently diverged species\ct{FleiAlla02,ClarkGlan03}. (For
209: example, fewer than 4\% of neutral sites exhibit substitutions when
210: comparing human and chimpanzee sequences\ct{ClarkGlan03}.)
211: 
212: 
213: In the following sections we analyze the consequences of selection on
214: proteins for codon usage in general, as well as for the volatility
215: measure in particular.  We demonstrate that the expected codon usage at
216: a site, as well as its temporal dynamics, depend upon the strength of
217: positive or negative selection on the amino acid sequence.  In Sections
218: \ref{QSmodel} through \ref{PopSize} we examine negative selection in
219: infinite and finite populations. In Section \ref{PosSel} we discuss
220: positive selection.  Our analysis is initially confined to the patterns
221: of codon usage at a single site under selection at the amino acid level.
222: Proceeding from this analysis, we also discuss codon usage over many
223: sites within a gene or genome, and analyze how many sites are required
224: in principle to detect a reliable signal of selection by inspecting
225: synonymous codon usage. 
226: 
227: \section{Negative Selection and Codon Bias in an Infinite Population}
228: 
229: \label{QSmodel}
230: 
231: Most nonsynonymous mutations in a protein coding sequence presumably
232: reduce the fitness of an organism. For a large proportion of sites,
233: therefore, natural selection opposes any change in the amino acid.  We
234: refer to this type of selection as ``negative selection.''  
235: 
236: For the purposes of exploring the effect of negative selection on codon
237: usage, we assume that selection cannot discriminate between the
238: synonymous codons for the favored amino acid at a site.
239: However, mutations are more likely to be nonsynonymous, and hence
240: deleterious, if the codon at that site has high volatility. As we will
241: show, this fact results in an effective preference for the less
242: volatile codons, among those codons that code for the favored amino acid
243: at the site. We emphasize that this preference for a codon of low
244: volatility at a site under negative selection is \textit{not} caused by
245: a direct fitness difference between synonyms.  Rather, more volatile
246: codons will occur less frequently as a second-order consequence of
247: negative selection at the amino acid level, and the structure of the
248: genetic code. 
249: 
250: 
251: Proteins with a larger number of sites under negative selection will
252: exhibit a statistical bias towards less volatile codons, after
253: controlling for their amino acid content.  
254: Here we calculate the expected magnitude of the codon bias as a function
255: of the mutation rate, the strength of negative selection, and, in
256: Section \ref{Wright}, the population size.  We also analyze the
257: conditions under which we can expect to detect and draw inferences from
258: this bias, and we estimate the time scales needed to establish and
259: maintain such a signal.
260: 
261: \subsection{A simplified genetic code}
262: 
263: In an infinite population, we can describe the dynamics of codon usage
264: at an individual site by using the standard multi-allele model first
265: introduced by Haldane\ct{Hald27} and used throughout the literature
266: (\textit{e.g.} ref.\ct{Nagy92} Eq. 2.25 or ref.\ct{Higg94}).  This model
267: describes a single site which can assume any of $K$ states. In order to
268: investigate codon usage, we consider $K = 64$ states, corresponding to
269: each of the $64$ possible codons.  In continuous time, the frequency
270: $x_i$ of individuals with codon $i$ evolves according to
271: \begin{equation} \frac{dx_i}{dt}=\sum_{j=1}^Kx_j(t) w_j M_{ij} - x_iW(t)
272: \label{QS} \end{equation} where $w_j$ is the Malthusian fitness of codon
273: $j$, $W(t) \equiv \sum_j w_j x_j(t)$ is the mean fitness of the
274: population, and $M_{ij}$ is the instantaneous rate of mutation from
275: codon $j$ to codon $i$, with $\sum_j M_{ij}=0$.  Although Eq. \ref{QS} is non-linear, the
276: equilibrium frequencies of the ``alleles" $i=1,2,\ldots K$ are given by
277: the leading eigenvector of the matrix $w_j M_{ij}$\ct{ThomMcBr74}.
278: These frequencies determine the expected equilibrium codon usage at a
279: site. For the purposes of this paper, alternative formulations of the
280: $K$-allele model that treat the processes of selection and mutation
281: separately (\textit{e.g.} ref\ct{CrowKimu70} Eq. 6.4.1) yield the exact same
282: results.
283: 
284: The equilibrium solution to Eq. \ref{QS} for the full genetic code does
285: not lend itself to intuitive understanding. Transient dynamics are also
286: difficult to calculate in this high-dimensional system.  Therefore, in
287: order to highlight the essential points of our analysis, we first
288: consider a ``toy'' genetic code that retains those features of the true
289: genetic code relevant to the study of synonymous codon usage under
290: negative selection.  As we will demonstrate, the solution for the
291: simplified genetic code yields a complete understanding for the full
292: genetic code as well.
293: 
294: We imagine a simplified genetic system with only three possible codons,
295: $a_1$, $a_2$, and $b$.  Codons $a_1$ and $a_2$ code for amino acid $A$,
296: which is favored, and codon $b$ encodes amino acid $B$, which has
297: selective disadvantage $\sigma$.  We assume that mutations occur at rate
298: $u$ between these codons according to the structure \[ a_1
299: \rightleftarrows a_2 \rightleftarrows b, \] so that of the two
300: synonymous codons, $a_2$ is more volatile.
301: 
302: According to the standard multi-allele model (\eq{QS}), the relative
303: frequencies of codons $a_1$, $a_2$, and $b$ are described by the
304: equation
305: \begin{equation} \label{trit}
306: \frac{d}{dt}\left (\begin{array}{c} a_1(t) \\ a_2(t) \\ b(t)
307: \end{array}\right ) = \left ( \begin{array}{ccc}
308: 1-u & u & 0 \\ u & 1-2u & u(1-\sigma) \\ 0 & u & (1-u)(1-\sigma)
309: \end{array} \right ) \left (\begin{array}{c} a_1(t) \\ a_2(t) \\
310: b(t) \end{array}\right ) - W(t) \left (\begin{array}{c} a_1(t) \\
311: a_2(t) \\ b(t) \end{array}\right ),
312: \end{equation} 
313: where $W(t)=a_1(t)+a_2(t)+(1-\sigma)b(t)$. 
314: 
315: The equilibrium frequencies
316: of codons are given by the leading eigenvector of the matrix in Eq.
317: \ref{trit}. A simple perturbation analysis of this eigenvector shows that the
318: equilibrium frequency of $a_1$ depends monotonically
319: on $\sigma$, and it exhibits a sharp transition between two regimes: the
320: weak selection regime $\sigma \ll u$ and the strong selection regime
321: $\sigma \gg u$. In the weak selection regime, the equilibrium relative
322: frequencies of synonyms are given by the expansion
323: \eon \frac{\hat{a_1}}{\hat{a_1}+\hat{a_2}}=\frac{1}{2}+\frac{1}{12}
324: \frac{\sigma}{u} +
325: O\left(\frac{\sigma^2}{u^2}\right). \eoff
326: And in the strong selection regime, the equilibrium relative
327: frequencies are given by
328: \eon \frac{\hat{a_1}}{\hat{a_1}+\hat{a_2}}=\frac{\sqrt 5 -1}{2}-\frac{
329: (5-2\sqrt5)(1-\sigma)}{5} \frac{u}{\sigma} +
330: O\left(\frac{u^2}{\sigma^2}\right). \eoff
331: 
332: In the absence of selection $(\sigma=0)$ all three codons occur with
333: equal frequency, as we would expect.  In particular, the relative
334: frequency of the two synonymous codons $a_1$ and $a_2$ equals
335: $\frac{1}{2}$, regardless of the mutation rate.  For weak selection
336: ($\sigma \ll u$), this result is still approximately true, according to
337: the perturbation expansion above.  In the case of strong negative
338: selection ($\sigma \gg u$), the relative frequency of the two synonymous
339: codons is given approximately by the inverse of the golden mean,
340: $\frac{\sqrt 5 -1}{2} \approx 0.62$. 
341: 
342: The sharp transition between the weak and strong selection regimes
343: defines $\sigma = u$ as a critical value for negative selection.  For
344: $\sigma \ll u$ negative selection is ineffective at favoring the less
345: volatile codon, and the site is effectively neutral.  But when $\sigma
346: \gg u$, negative selection favors the less volatile codon, and the
347: magnitude of this effect depends only weakly on the value of $\sigma$.
348: This is an essential point.  In the strong selection regime, the
349: magnitude of negative selection is relatively unimportant; volatile
350: codons are disfavored at all sites where $\sigma \gg u$. The transition
351: between the weak and strong selection regimes is shown in Fig.
352: \ref{TritEquil}.
353: 
354: \begin{figure}[ht] \begin{center}
355: \epsfig{file=TritEquil.eps,angle=0,width=12cm} \caption{The relationship
356: between selection at the amino acid level and resulting synonymous codon
357: usage. The graph shows relative equilibrium frequency of synonymous
358: codons,  $\hat{a_1}/(\hat{a_1}+\hat{a_2})$, as a function of the strength
359: of negative selection, $\sigma$. The relative frequency of codon $a_1$
360: is approximately $\frac{1}{2}$ in the weak selection regime ($\sigma \ll
361: u$), and approximately $\frac{\sqrt{5}-1}{2}$ in the strong selection
362: regime ($\sigma \gg u$). In this figure $u=10^{-5}$.} \label{TritEquil}
363: \end{center} \end{figure}
364: %From TritInfinitePopGraph.nb
365: 
366: \subsection{The effective disadvantage of a volatile codon}
367: 
368: The critical value of $\sigma$ discussed above can be understood
369: intuitively by considering the ``effective selective disadvantage" of
370: the more volatile codon $a_2$ that results indirectly from its
371: volatility.  We will use the notion of an ``effective selective
372: disadvantage" to aid in our analysis of codon usage at a
373: site under negative selection. But we emphasize that our model (Eq.
374: \ref{QS}) does not assume any direct fitness difference
375: between synonymous codons.
376: 
377: When the disfavored amino acid $B$ is lethal to the organism, then the
378: effective selective disadvantage of codon $a_2$ is particularly simple
379: to understand.  In this case, individuals with codon $a_2$ are removed
380: from the population at rate $u$ because they mutate to the lethal codon
381: $b$, but receive no back-mutations.  Hence the effective selective
382: disadvantage, denoted $s$, of codon $a_2$ versus codon $a_1$ is given by
383: $s = u$. The effective selective disadvantage of $a_2$ does not arise
384: from a fitness difference between synonyms, but rather from selection at
385: the level of amino acids and the structure of the genetic code.
386: 
387: When amino acid $B$ is not lethal the situation is slightly more
388: complicated.  Nevertheless, for $\sigma \gg u$, mutations from $a_2$ to
389: $b$ typically die due to negative selection before they mutate back from
390: $b$ to $a_2$. As a result, the effective selective disadvantage will
391: still be $s=u$ in the regime of strong selection. We can make this
392: argument concrete by considering the mutation-selection balance between
393: codon $b$ and codon $a_2$. According to the standard mutation-selection
394: balance, the equilibrium frequency of codon $b$ relative to codon $a_2$
395: equals $\frac{u}{\sigma}$ in the regime $\sigma \gg u$.  Thus for each
396: mutant from $a_2$ to $b$, there are at most of order $\frac{u}{\sigma}$
397: mutations from $b$ to $a_2$. The net mutation rate from $a_2$ to $b$ is
398: therefore $u \left(1-\frac{u}{\sigma} \right)$.  This is the rate at
399: which individuals of type $a_2$ are lost from the population due to the
400: fact that $a_2$ is more volatile than $a_1$.  Thus the effective
401: selective disadvantage of $a_2$ relative to $a_1$ is given by $s = u
402: \left( 1 - \frac{u}{\sigma} \right)$.  By definition, in the strong
403: selection regime we neglect $\frac{u}{\sigma}$ compared to 1, and the
404: effective selective disadvantage of codon $a_2$ is simply $s = u$.  
405: 
406: 
407: A similar argument holds for the real genetic code.  In this case, the
408: favored amino acid may correspond to several synonymous codons, each
409: with a potentially different volatility. However, the effective
410: selective disadvantage, $s$, of a more volatile codon relative to a less
411: volatile synonym is simply the difference in the number of mutations
412: leading to a disfavored codon ($\sigma \gg u$) times $\frac{u}{3}$,
413: where $u$ is the nucleotide mutation rate. (Note that $\frac{u}{3}$ is
414: the rate of mutation between any two particular nucleotides.)   For
415: example, when considering the relative frequencies of codons AGA and CGG
416: at a site under negative selection for Arginine, AGA has selective
417: disadvantage $s=\frac{2}{3}u$ compared to CGG, since AGA has two more
418: disfavored neighbors than CGG. By using the value of the effective
419: selective disadvantage, $s$, we can calculate the equilibrium relative
420: frequency of any pair of synonymous codons in mutation-selection
421: balance, and thereby deduce the relative frequencies of all synonyms.
422: Therefore, we can predict synonymous codon usage in the genetic code
423: without resorting to the full solution of \eq{QS}.
424: 
425: An analogous argument can be used to calculate the effective selective
426: disadvantage of codon $a_2$ in the regime of weak selection ($\sigma \ll
427: u$). In this regime, the relative equilibrium frequency of codon $b$
428: versus codon $a_2$ equals $1-\frac{\sigma}{2u}$. Thus, the effective
429: selective disadvantage of $a_2$ versus $a_1$ is approximately $s = 0$,
430: plus a small correction of order $\sigma$.  In other words, when $\sigma
431: \ll u$ selection between $a_1$ and $a_2$ is effectively neutral; it
432: cannot generate codon bias.  We therefore refer to the regime
433: $\sigma \ll u$ as the ``almost neutral regime." This result holds both
434: for the simplified three-codon model and for the real genetic code.  
435: %SeeClassicMutationSelectionBalance.nb
436: 
437: 
438: It is also important to calculate the amount of time required to reach
439: equilibrium codon usage in the presence of strong negative selection.
440: Explicit solution of Eq. \ref{trit}, assuming $\sigma \gg u$, indicates
441: that the $e$-fold relaxation time is of order $\frac{1}{u}$ (the
442: selection coefficient is $s \sim u$, and so the time scale for population
443: sizes to change under selection is of order $\frac{1}{s} \sim
444: \frac{1}{u}$).  In other words, starting from any initial frequencies
445: $a_1(0)$ and $a_2(0)$, these frequencies will become $e$-fold closer to
446: their equilibrium values after a duration of order $\frac{1}{u}$
447: generations.  The same time scale holds for almost neutral sites
448: ($\sigma \ll u$) and for the real genetic code\symbolfootnote[2]{For
449: $\sigma \ll u$, the process is almost neutral and the time scale
450: calculation of Section \ref{relax} applies.  The real genetic code has
451: the same dynamics because we still have $s \sim u$ for $\sigma \gg u$
452: and neutral behavior for $\sigma \ll u$.}.  In practice, $u$ will
453: be quite small, and equilibrium volatility is approached very slowly.
454: We will revisit this point when we discuss finite populations, and again
455: when we discuss positive selection.
456: 
457: \subsection{A specific example: selection for Arginine}
458: 
459: In this section we consider a simple example that demonstrates how
460: our analysis applies to the real genetic code. We use Eq. \ref{QS} to
461: model the dynamics of $K=64$ alleles corresponding to the 64 codons,
462: indexed in an arbitrary order. For our example, we consider a single
463: site under negative selection for an Arginine codon. In this case we
464: define
465: \begin{equation}
466: M_{ij}=\begin{cases}
467: 1-3u, \text{\ \ \ if $i$=$j$}\\
468: u/3, \text{\ \ \ if $i$ and $j$ differ by a point mutation} \\
469: 0, \text{\ \ \ otherwise}
470: \end{cases}
471: \end{equation}
472: where $u$ is the nucleotide mutation rate. We define
473: \begin{equation}
474: w_i=\begin{cases}
475: 1, \text{\ \ \ if $i$ encodes Arginine} \\
476: 1-\sigma, \text{\ \ \ if $i$ encodes a non-Arginine amino acid}\\
477: 1-\gamma, \text{\ \ \ if $i$ encodes stop}
478: \end{cases}
479: \end{equation}
480: so that a codon encoding an amino acid other than Arginine has
481: fitness $1-\sigma$, and a termination codon has fitness
482: $1-\gamma$.  We analyze this model numerically by calculating the
483: leading eigenvector of the matrix $w_j M_{ij}$, which yields the
484: equilibrium frequencies of all 64 codons.
485: 
486: In the case of no selection ($\sigma = \gamma = 0$), we find that all
487: codons occur with the same equilibrium frequency, independent of
488: mutation rate, as we would expect.  For almost neutral selection
489: ($\sigma \sim \gamma \ll u$), codon usage is still approximately
490: uniform. In the opposite case when Arginine is favored and all other
491: amino acids (or termination codons) are strongly disfavored
492: (\textit{i.e.} $\sigma \sim \gamma \gg u$), the Arginine codons CGA,
493: CGG, CGC, CGT, AGA, and AGG occur with equilibrium relative frequencies
494: $\approx$ 0.214 : 0.214 : 0.191 : 0.191 : 0.095 : 0.095.  As expected,
495: under negative selection the more volatile Arginine codons occur with
496: lower relative frequency in equilibrium.
497: 
498: The equilibrium frequencies of Arginine codons determine the expected
499: volatility at a single Arginine site under negative selection.
500: Assuming free recombination\ct{SawyHart92}, an individual gene consists
501: of many such sites randomly assembled; the mean and standard deviation
502: in the volatility (per site) of a randomly sampled gene are shown in
503: Fig.  \ref{argex}, as a function of the strength of negative selection
504: $\sigma$.  Note that the stronger the negative selection, the lower the
505: expected equilibrium volatility. The expected volatility exhibits a
506: sharp transition from high to low values when the strength of negative
507: selection $\sigma$ reaches the mutation rate $u$, as discussed above.
508: On either side of this transition, the volatility is insensitive to
509: $\sigma$. The standard deviations plotted in Fig. \ref{argex} correspond
510: to a gene comprised of $L=200$ such sites, each modeled independently by
511: the multi-allele equation. 
512: 
513: 
514: \begin{figure}[ht] \begin{center}
515: \epsfig{file=ArgEx.eps,angle=0,width=12cm} \caption{The relationship
516: between selection and volatility for a gene comprised of $L=200$ freely
517: recombining sites under selection for Arginine. The graph shows expected
518: volatility per site in the gene ($\pm 1$ standard deviation, dashed) as
519: a function of the strength of negative selection, $\sigma$. The
520: nucleotide mutation rate is $u=10^{-5}$.  The expected volatility is
521: significantly depressed in the regime of strong negative selection,
522: $\sigma \gg u$.  (For this figure we assume $\gamma = 1$; virtually
523: identical results hold for $\gamma = \sigma$.) } \label{argex}
524: \end{center} \end{figure}
525: 
526: According to Fig.  \ref{argex}, $L=200$ independent sites that each
527: experience neutrality ($\sigma \ll u$) can be distinguished on the basis
528: of their volatility from $L=200$ sites that experience negative
529: selection ($\sigma \gg u$).  The difference in the expected volatility
530: between these two regimes is greater than four standard deviations of
531: the volatility within either regime.
532: 
533: In reality, the selective constraint $\sigma$ will vary greatly across
534: the sites of a given protein.  In this case, disregarding the
535: possibility of positive selection, the volatility of a gene (after
536: controlling for its amino acid sequence) essentially reflects the
537: relative number of informative sites that experience negative selection
538: versus neutrality.  For example, the volatility of gene $X$ that
539: contains $L=200$ informative sites under negative selection and an equal
540: number of neutral sites will be significantly greater (with a
541: $Z$-score of about three) than the volatility of gene $Y$ that
542: consists of $2L$ informative sites all under negative selection.
543: A more thorough discussion of variable selection pressures across genes
544: is described in Section \ref{Infer}, below.
545: %Z-score in MixedGenesArginineExample.xls
546: 
547: Table \ref{GLRS} shows the equilibrium relative frequencies of
548: synonymous codons for each of the informative amino acids (G, L, R, and
549: S) under neutrality versus various selective regimes.  In Table
550: \ref{GLRS} we assume, as we do throughout this manuscript, that
551: volatility is measured using the Hamming metric and that there is no
552: transition/transversion bias.  Corresponding values for different
553: metrics or including a mutational bias may be calculated using the same
554: approach. As seen in Table \ref{GLRS}, the difference in the expected
555: volatility between selective regimes is least extreme (indeed, barely
556: informative) for Glycine sites.  The volatility difference is most
557: extreme for serine sites: the highly volatile codons AGT and AGC are not
558: expected to occur at a site under negative selection, but they
559: preferentially occur at a site under positive selection.  This extreme
560: case results from the fact that codons AGT and AGC are not connected by
561: synonymous point mutations to the other serine codons. 
562: This situation does not imply that codons AGT and AGC should be treated
563: separately from the other serine codons. In fact, when treated as an
564: entire group, the serine codons are particularly informative for
565: positive selection (Table \ref{GLRS}).
566: 
567: %from RelFreqs.xls
568: \renewcommand{\baselinestretch}{.9}
569: \begin{table}
570: {\small
571: \begin{center}
572: \begin{tabular}{lllllc}
573: & Neutral & Neutral* & Negative & Positive & $\nu$\\
574: \textbf{Leucine} & & & & &\\
575: cta & 0.16667 & 0.17300 & 0.21353 & 0.14213 & 5/9\\
576: ctc & 0.16667 & 0.18580 & 0.19098 & 0.17056 & 6/9\\
577: ctg & 0.16667 & 0.17890 & 0.21353 & 0.14213 & 5/9\\
578: ctt & 0.16667 & 0.18580 & 0.19098 & 0.17056 & 6/9 \\
579: tta & 0.16667 & 0.12990 & 0.09549 & 0.18274 & 5/7\\
580: ttg & 0.16667 & 0.14650 & 0.09549 & 0.19188 & 6/8 \\
581: \hline $\mathbb{E}[\nu]$ & 0.65146 & 0.64590 & 0.63172 & 0.65978 &\\
582: $\sigma[\nu]$ & 0.07362 & 0.07259 & 0.07022 & 0.07217 &\\
583: \\
584: \textbf{Arginine} & & & & &\\
585: aga & 0.16667 & 0.15210 & 0.09549 & 0.19149 & 6/8\\
586: agg & 0.16667 & 0.17050 & 0.09549 & 0.19859 & 7/9\\
587: cga & 0.16667 & 0.15210 & 0.21353 & 0.12766 & 4/8\\
588: cgc & 0.16667 & 0.17740 & 0.19098 & 0.17021 & 6/9\\
589: cgg & 0.16667 & 0.17050 & 0.21353 & 0.14184 & 5/9\\
590: cgt & 0.16667 & 0.17740 & 0.19098 & 0.17021 & 6/9\\
591: \hline $\mathbb{E}[\nu]$ & 0.65278 & 0.65400 & 0.62592 & 0.66766 & \\
592: $\sigma[\nu]$ & 0.09854 & 0.09660 & 0.09354 & 0.09528 & \\
593: \\
594: \textbf{Serine} & & & & & \\
595: agc & 0.16667 & 0.18510 & 0.00000 & 0.20636 & 8/9\\
596: agt & 0.16667 & 0.18510 & 0.00000 & 0.20636 & 8/9\\
597: tca & 0.16667 & 0.13440 & 0.25000 & 0.13265 & 4/7\\
598: tcc & 0.16667 & 0.17190 & 0.25000 & 0.15477 & 6/9\\
599: tcg & 0.16667 & 0.15162 & 0.25000 & 0.14510 & 5/8\\
600: tct & 0.16667 & 0.17190 & 0.25000 & 0.15477 & 6/9\\
601: \hline $\mathbb{E}[\nu]$ & 0.71792 & 0.72981 & 0.63243 & 0.73970 & \\
602: $\sigma[\nu]$ & 0.12504 & 0.12561 & 0.03913 & 0.12847 & \\
603: \\
604: \textbf{Glycine} & & & & & \\
605: gga & 0.25000 & 0.22460 & 0.25000 & 0.23810 & 5/8\\
606: ggc & 0.25000 & 0.26180 & 0.25000 & 0.25397 & 6/9\\
607: ggg & 0.25000 & 0.25170 & 0.25000 & 0.25397 & 6/9\\
608: ggt & 0.25000 & 0.26180 & 0.25000 & 0.25397 & 6/9\\
609: \hline $\mathbb{E}[\nu]$ & 0.65625 & 0.65724 & 0.65625 & 0.65675 & \\
610: $\sigma[\nu]$ & 0.01804 & 0.01859 & 0.01804 & 0.01775 & \\
611: \end{tabular}
612: \end{center}
613: \caption{Equilibrium codon usage under neutrality versus selective
614: regimes.  In each selective regime, we report the equilibrium relative
615: abundance of codons, and the resulting mean and standard deviation in
616: volatility per site. The first column corresponds to neutrality
617: ($\sigma=\gamma \ll u$); the second column corresponds to neutrality but
618: with disfavored termination codons ($\sigma \ll u$, $\gamma=1$); the third
619: column corresponds to strong negative selection in an infinite
620: population ($\sigma \gg u$, $\gamma \gg u$); the fourth column
621: corresponds to the expected frequencies after a positively selected
622: sweep (see Section \ref{PosSel}). The final column gives the volatility
623: of each codon, assuming no transition/transversion bias\ct{PlotDush04}.}
624: \label{GLRS} } \end{table}
625: \renewcommand{\baselinestretch}{1.0}
626: 
627: 
628: 
629: \section{Negative Selection in a Finite Population} \label{Wright}
630: 
631: The models presented in Section \ref{QSmodel} describe the processes of
632: mutation and negative selection in an infinite population. In finite
633: populations, however, genetic drift also affects allelic frequencies.
634: In this section, we study the combined effects of mutation, negative
635: selection, and drift, which we analyze using diffusion equations.  These
636: equations can be very complex.  A full treatment of even the simplified
637: three-codon genetic code requires a two-dimensional diffusion process,
638: and the real genetic code involves a $63$-dimensional process.  To make
639: this problem tractable, we use the notion of the ``effective selective
640: disadvantage" of more volatile codons, as discussed above.  This allows
641: us to consider the dynamics only at the favored codons, thereby reducing
642: the dimensionality of the diffusion process.
643: 
644: The neutral ($\sigma = 0$) or almost neutral ($\sigma \ll u$) regimes
645: are straightforward: here all synonymous codons for the favored amino
646: acid have the same effective fitness.  In this regime, each synonymous
647: codon occurs with the same probability in steady state, independent of
648: population size.  
649: 
650: For the remainder of this section, we analyze the case of strong
651: negative selection ($\sigma \gg u$) at a single site.   We consider a
652: diffusion approximation to the process of mutation, selection, and drift
653: operating only on the synonymous codons, to each of which we assign an
654: effective selective coefficient. For the simplified three-codon
655: genetic system, the more volatile codon $a_2$ has an effective selective
656: disadvantage of $s = u$ compared to codon $a_1$.  For the real genetic
657: code, more volatile codons will have a selective disadvantage of this
658: order, but the precise value of $s$ will depend on the specific amino
659: acid in question.  In the following analysis, we consider the case of
660: the simplified three-codon system.  However, we do not explicitly make
661: the substitution $s = u$, so that our results can also be applied (with
662: a slightly different value of $s$) to the real genetic code.
663: 
664: The time-dependent frequency $f(x,t)$ of allele $a_1$ relative to
665: allele $a_2$ can be described by the Komolgorov forward
666: equation\ct{KimuCrow64}
667: \begin{equation} \frac{\partial f(x,t)}{\partial t} =
668: -\frac{\partial}{\partial x} \{a(x) f(x,t)\} +
669: \frac{1}{2}\frac{\partial^2}{\partial x^2} \{b(x)f(x,t)\} \end{equation}
670: where the instantaneous mean and variance in the change of allelic
671: frequency are given by
672: \begin{eqnarray*}
673: a(x)&=&sx(1-x)-ux+u(1-x)\\
674: b(x)&=&x(1-x)/N.
675: \end{eqnarray*}
676: The stationary distribution of allele frequencies $\hat f(x)$ satisfies
677: the equation
678: \begin{equation}
679: \frac{d}{dx}\{b(x)\hat f(x)\}=2 a(x)\hat f(x)
680: \end{equation}
681: which has the solution\ct{Wrig31} \eon \hat
682: f(x)=Cx^{\theta-1}(1-x)^{\theta -1} \ e^{S x} \eoff where $\theta=2Nu$,
683: $S=2Ns$, and $C$ is chosen so that $\int_0^1\hat f(x)  dx=1.$  Since $s
684: \sim u$ (and thus $S \sim \theta$), the shape of the stationary the
685: distribution $\hat f(x)$ falls into two categories: a bell-shaped
686: distribution in the regime $\theta>1$, and a U-shaped distribution in
687: the regime $\theta<1$. In other words, for $\theta>1$ the steady-state
688: population is typically polymorphic at the locus, much like the infinite
689: population mutation-selection balance.  Whereas for $\theta<1$ the
690: steady-state population is usually near-monomorphic at the locus,
691: occasionally switching between alleles $a_1$ and $a_2$, with a bias
692: (whose strength is determined by $S$) towards allele $a_1$ .
693: 
694: In stationary state, 
695: the expected frequency of allele $a_1$ is given by
696: \begin{equation}
697: M(\theta,S)=\int_0^1x \hat f(x) dx =\frac{1}{2}+\frac{\bes(\theta+1/2,
698: S/2)}{2\bes(\theta-1/2,S/2)}
699: \label{mean}
700: \end{equation}
701: where $\bes(x,y)$ is the modified Bessel function of the first kind.
702: Similarly, the variance in the frequency of allele $a_1$ is given by
703: \begin{eqnarray}
704: V(\theta,S)&=&\int_0^1 x^2 \hat f(x) dx - M(\theta)^2\\
705: &=&\frac{1}{4+8\theta}+\frac{2\theta \bes(\theta-1/2,S/2)
706: \bes(\theta+3/2,S/2)-(1+2\theta)
707: \bes(\theta+1/2,S/2)^2}{(4+8\theta)\bes(\theta-1/2,S/2)^2}
708: \label{var}
709: \end{eqnarray}
710: We use the standard Taylor series expansion of
711: $\bes(x,y)$, 
712: \begin{equation}
713: \bes(x,y)=\sum_{m=0}^{\infty}\frac{(y/2)^{x+2m}}{m!\Gamma(x+m+1)}, 
714: \label{Bexpand}
715: \end{equation} 
716: to obtain a simple approximation for the mean
717: stationary frequency of allele $a_1$:
718: \begin{equation}
719: M(\theta,S) = \frac{1}{2}+\frac{S}{4} + O(\theta^2),
720: \label{nearmean}
721: \end{equation}
722: valid for $\theta \sim S \ll 1$. This approximation indicates that the
723: difference in expected volatility at a site under neutral versus
724: negative selection is of order $S$, when $\theta \ll 1$. 
725: 
726: When $\theta=S=1$, the mean stationary frequency of allele $a_1$ assumes
727: the value $\frac{1}{e-1}\approx 0.58$. For $\theta \sim S \gg 1$, the
728: mean frequency quickly approaches the asymptotic value
729: $\lim_{\theta\rightarrow \infty} M(\theta,\theta)=\frac{\sqrt{5}-1}{2}$,
730: in agreement with our earlier result for an infinite population. 
731: 
732: The results in this section generalize our analysis of an infinite
733: population.  For an infinite population, we found that the expected
734: relative frequency of codon $a_1$ equals $\frac{1}{2}$ in the almost
735: neutral regime, and it equals $\frac{\sqrt 5 -1}{2}$ in the strong
736: selection regime. In a finite population with $\theta \gg 1$, the same
737: results hold. In a finite population with $\theta \ll 1$, the
738: expected relative frequency of the more volatile codon equals
739: $\frac{1}{2}$ in the neutral regime, and it equals
740: $\frac{1}{2}+\frac{Ns}{2}$ in the strong selection regime.  For any
741: population size, the relative frequency of codon $a_1$ depends
742: monotonically on the strength of selection at the amino acid level,
743: $\sigma$, and it exhibits a sharp transition at the critical value
744: $\sigma=u$.
745: 
746: It is worth noting that our exact expression (Eq. \ref{mean}) for the
747: mean stationary frequency of allele $a_1$ generalizes earlier work by
748: Bulmer\ct{Bulm91} on the relative frequency of two synonymous codons
749: that experience a direct fitness difference. In the limit of small
750: $\theta$, we find that
751: \begin{equation} \lim_{\theta\rightarrow0} M(\theta,S) = \frac{1}{2} +
752: \frac{\bes(1/2, S/2)}{2\bes(-1/2,S/2)} = \frac{1}{1+e^{-S}},
753: \end{equation}
754: which agrees with Bulmer's result (his Equation 6). In other words, Bulmer's
755: approximation applies only for vanishing small mutation rates (or
756: population sizes).
757: 
758: We can again use the standard Taylor expansion of the Bessel function to
759: obtain a simple expression for the variance in the stationary
760: frequency of allele $a_1$,
761: \begin{equation}
762: V(\theta,S) \approx
763: \frac{(3+2\theta)(4+8\theta)-3S^2}{16(3+2\theta)(1+2\theta)^2},
764: \label{nearvar}
765: \end{equation}
766: which is a highly accurate approximation for all $\theta$,
767: provided as usual that $S$ is of order $\theta$ or smaller. Note that
768: when $\theta \ll 1$ the variance is approximated by
769: $\frac{1}{4}-\frac{\theta}{2}$, and when $\theta \gg 1$ the variance is
770: of order $\frac{1}{\theta}$.
771: 
772: 
773: \subsection{Inferring Negative Selection in a Finite Population}
774: \label{Infer}
775: Our exact (Eq. \ref{mean}) or approximate (Eq.  \ref{nearmean})
776: expressions for the stationary mean frequency of codon $a_1$ allow us to
777: determine the minimum number of sites required for codon volatility to
778: distinguish reliably between neutral versus negative selection.  When
779: sites are modeled independently (equivalent to the assumption of
780: linkage equilibrium\ct{SawyHart92}), under neutrality ($\sigma \ll u$;
781: $s=0$) the relative frequency of codon $a_1$ versus codon $a_2$ across a
782: gene of length $L$ is binomially distributed with mean $\frac{1}{2}$ and
783: variance $\frac{1}{4L}$.  If, on the other hand, the gene experiences
784: negative selection ($\sigma \gg u$; $s=u$), then the relative frequency
785: of codon $a_1$ is binomially distributed with mean $M(\theta,S)$ and
786: variance $M(\theta,S)[1-M(\theta,S)]/L$.  Therefore, in order to
787: reliability reject neutrality at about the 95\% confidence level, we
788: require \begin{equation} \label{minLeq} M(\theta,S)-\frac{1}{2} \ > \  2
789: \sqrt{\frac{1}{4L}} \end{equation} Using this equation, Fig. \ref{minL}
790: shows the minimum number of sites required to reliably distinguish
791: negative selection from neutrality on the basis of codon volatility,
792: under our simplified 'genetic code' consisting of three codons. 
793: 
794: \begin{figure}[ht] \begin{center}
795: \epsfig{file=MinL.eps,angle=0,width=12cm} \caption{The relationship
796: between the scaled population size, $\theta=2Nu$, and the minimum number
797: of sites required to distinguish negative selection from neutrality, at
798: the 95\% confidence level. Sites are assumed to be unlinked.  It is
799: important to note that the appropriate effective population size that
800: determines the value of $\theta$ in practice does not necessarily equal
801: the average neutral site heterozygosity (see Section \ref{PopSize}).}
802: \label{minL} \end{center} \end{figure}
803: %in CheckEquationsAndSomeFigures.nb
804: 
805: Eq. \ref{minLeq} applies when comparing a collection of neutral
806: sites against a collection of sites under negative selection.  In most
807: situations, however, the selective constraint $\sigma$ will vary across
808: the sites of a protein. For example, consider gene $X$ with $L+J$ sites
809: under negative selection, compared to gene $Y$ with $L$ neutral sites
810: and $J$ sites under negative selection. In this case, the expected
811: frequency of codon $a_1$ in gene $Y$ is $(L/2 + J M(\theta,S))/(L+J)$.
812: Therefore, in order to reliably infer that gene $X$ experiences more
813: negative selection than gene $Y$, at the 95\% confidence level we
814: require \begin{equation} \label {minLeq2} M(\theta,S) - \frac{L/2 +
815: JM(\theta,S)}{(L+J)} \  > \ 2 \sqrt{\frac{L/4 + J M(\theta,S)
816: [1-M(\theta,S)]}{(L+J)^2}} \end{equation} As Eq. \ref{minLeq2}
817: indicates, the power to discriminate between two genes is decreased when
818: both genes contain many sites, $J$, under negative selection and only a
819: few sites, $L$, under different selective regimes. Nevertheless,
820: provided $J \sim L$, the power to discriminate between genes $X$ and $Y$
821: is decreased by $\sim$20\% at most (compared to $J=0$), and so the
822: minimum number of sites required to detect negative selection (Fig.
823: \ref{minL}) remains mostly unchanged.
824: %see MinLWithOtherSites.nb
825: 
826: Although the results in this section were derived for a simplified
827: genetic code, the scaling behavior of these solutions holds for the full
828: genetic code as well -- \textit{i.e.} when comparing neutrality to
829: negative selection, for $\theta \ll 1$ the expected difference in
830: volatility per site will be of order $\theta$; and for $\theta \gg 1$
831: the expected difference in volatility can be calculated from the
832: infinite population model (Eq. \ref{QS} and Table \ref{GLRS}). 
833: 
834: \subsection{Relaxation towards steady state} \label{relax} Although Eq.
835: \ref{mean} predicts the steady-state relative frequencies of codons
836: $a_1$ and $a_2$ in the selected regime ($\sigma \gg u$), we have not yet
837: discussed how long it takes, on average, to reach this steady
838: state. In the case of a very large population, $\theta \gg 1$, we know
839: from the infinite population model (Section \ref{QS}) that the $e$-fold
840: relaxation time to equilibrium is of order $\frac{1}{u}$ generations. In
841: this section, we demonstrate that the same result applies to the
842: time scale of relaxation towards steady state in the regime $\theta \ll
843: 1$.
844: 
845: As usual, we consider a single site under negative selection. In the
846: regime $\theta \ll 1$, we have seen that the steady-state population
847: will spend most of the time in a nearly monomorphic state, with a
848: preference (of order $\theta$) for the less volatile codon, $a_1$.
849: Therefore, in order to calculate the time scale of relaxation towards
850: steady state, we may simply calculate the amount of time required such
851: that, starting with a population fixed for allele $a_2$, the probability
852: of the population remaining fixed for allele $a_2$ has been reduced
853: $e$-fold.
854: 
855: Given a population initially fixed for codon $a_2$, there are $Nu$
856: mutations to codon $a_1$ generated per generation. Each of these
857: mutations has an effective selective advantage $s=u$ over allele $a_2$,
858: and will therefore fix with probability
859: $2s/(1-e^{-2Ns})$\ct{CrowKimu70}. Hence the rate of production of a
860: mutation that will eventually fix is given by \begin{equation} P_{fix} =
861: \frac{2Nus}{1-e^{-2Ns}} \approx u, \label{Pfix} \end{equation} assuming
862: $\theta \ll 1$.  According to this calculation, the mean time until
863: fixation of codon $a_1$ is of order $\frac{1}{u}$ generations, which
864: gives the time scale of relaxation to the steady-state codon usage in a
865: finite population under negative selection.
866: 
867: 
868: \section{About Population Sizes} \label{PopSize} As discussed above, the
869: strength of the signal of negative selection depends upon the
870: parameter $\theta = 2Nu$. What is the appropriate value of
871: $\theta$ in practice?
872: 
873: Unfortunately, this question is far easier asked than answered.
874: Population geneticists have long struggled to reconcile estimates of
875: $\theta$ deduced from polymorphism data with direct measurements of $N$
876: and $u$ across broad taxonomic ranges.  The effective population sizes
877: of micro-organisms in particular are topics of active debate.  Estimates
878: of $\theta$ are usually obtained by comparing SNP data at neutral (or
879: presumably neutral) sites against the expected site diversity
880: or the expected number of segregating sites under a neutral
881: model\ct{Ewen04}. In a recent survey\ct{LyncCone03} authors have
882: reported an average value of $\theta \approx 0.15$ among the prokaryotes
883: studied. But estimates of $\theta$ for a microbial species can
884: vary by four orders of magnitude, and they depend strongly on
885: assumptions about population structure\ct{Berg96}.  To complicate
886: matters further, heterogeneity in mutation rates leads to substantial
887: underestimates of $\theta$\ct{Taji96}.  
888: 
889: Aside from uncertainty in its estimation, the value of $\theta$ deduced
890: from neutral SNP data\ct{LyncCone03} may not be relevant to questions of
891: selection and volatility.  Monomorphism observed at neutral sites may
892: result from non-neutral processes, such as background
893: selection\ct{CharMorg93} or hitchhiking on periodically sweeping
894: sites\ct{MaynHaig74}.  As a result, the variance effective population
895: size estimated from SNP data may not be relevant to other aspects of
896: evolution, such as substitutions at linked weakly selected
897: sites\ct{Gill01}.  
898: 
899: One particularly striking example of a discrepancy in the appropriate
900: effective population sizes arises from the consideration of mutator
901: phenotypes. Populations of microbial species periodically experience a
902: transient increase in the mutation rate, often $10^2-10^3$ times greater
903: than that of a non-mutator strain\ct{GiraRadm01}.  Between 2-20\% of
904: bacterial populations isolated in the wild at any given time exhibit a
905: mutator phenotype\ct{GiraRadm01,OlivCant00,LeclLi96}. The mutator phase
906: can be induced in several ways. A defective DNA repair gene may arise
907: and sweep to fixation by hitchhiking on a positively selected
908: mutation\ct{NotlSeet02}. The entire population then experiences an
909: elevated mutation rate until a non-mutator allele sweeps and replaces
910: the mutator\ct{NotlSeet02,DenaLeco00}. A second, perhaps more common
911: mechanism is stress-induced mutagenesis; natural isolates of \textit{E.
912: coli} often experience an increase in their mutation rate in response to
913: stress\ct{BjedTena03}. As a result of these and other observations,
914: researchers have argued that bacterial populations evolve primarily by
915: periodic acquisition of mutator phenotypes followed by adaptive sweeps
916: and subsequent loss of the mutator\ct{GiraRadm01,DenaLeco00,NotlSeet02}.
917: As we shall see, the effect of this process on synonymous codon usage is
918: dramatic: the expected site diversity is driven by the value of $\theta$
919: in the wildtype regime ($\theta_w = 2 N u_w$), but the pattern of
920: synonymous codon usage at a site under negative selection is driven by
921: the value of $\theta$ in the mutator regime ($\theta_m = 2 N u_m \gg
922: \theta_w$). 
923: 
924: As a simple example of this phenomenon, we have simulated a
925: Fisher-Wright model of a single locus in a population of constant size
926: $N=1000$.  The simulated site is subject to recurrent mutation between
927: ``alleles" $a_1$ and $a_2$ at wildtype rate $u_w=10^{-5}$. As in Section
928: \ref{Wright}, the alleles $a_1$ and $a_2$ differ in fitness by $s$,
929: where $s$ equals the mutation rate.  Periodically, we model the fixation
930: of a mutator allele (or, equivalently, the stress-induced mutagenesis
931: across the entire population) by exogenously increasing the mutation
932: rate to $u_m= 10^3 \times u_w$ for 100 generations; thereafter we
933: (artificially) enforce a selective sweep at the site, followed by
934: reversion to the wildtype mutation rate.  Overall, the population
935: experiences the mutator regime for 5\% of the time, consistent with
936: observed frequencies of mutator phenotypes in the
937: wild\ct{GiraRadm01,OlivCant00,LeclLi96}. According to our simulations,
938: the average site diversity, $2x(1-x)$, at a randomly chosen time equals
939: $0.028$, which is close to its expected value assuming that $\theta$ is
940: given by $\theta_w$: $\mathbb{E}[2x(1-x)]=\theta_w=0.02$.  But the
941: average frequency of allele $a_1$ equals $0.611$, which is close to its
942: expectation assuming that $\theta$ is given by $\theta_m$:
943: $\mathbb{E}[x]=M(\theta_m,\theta_m) = 0.616$ (Eq.  \ref{mean}).  In
944: other words, the average frequency of the less volatile codon $a_1$ is
945: dominated by the mutator periods, but the average site heterozygosity
946: (and any estimate of $\theta$ based on it) is dominated by the
947: non-mutator periods.  
948: %/VOLATILITYTHEORY/Simulations/FreeLociLinkedToSweeperWithMutator/averages.1_onelocus.out
949: 
950: There is a simple, intuitive explanation for this result.  The average
951: heterozygosity at the site is low at virtually all times (except during
952: the brief mutator periods) because selective sweeps cause monomorphism,
953: followed by long periods of low $\theta$. Therefore, the effective
954: $\theta$ for SNP diversity is small, \textit{i.e.} close to $Nu_w$.  But
955: the site converges quickly towards the less volatile codon during the
956: mutator periods, since the rate of convergence is determined by
957: $s=u_m$. And the site is essentially frozen during the non-mutator
958: periods, since the decay rate of volatility is only $u_w$.  Therefore
959: the expected frequency of $a_1$ at a random time  is
960: primarily determined by the frequency reached during the mutator regime.
961: As is clear from this explanation, the expected frequency of codon $a_1$
962: will, in general, depend upon the stochastic scheduling of mutator
963: periods. For example, the site will converge towards $M(\theta_m,
964: \theta_m)$ provided the population experiences at least one mutator
965: phase of duration of order $1/u_m$ generations, within every $1/u_w$
966: generations.  In fact, even if the mutator phases are very brief and
967: infrequent, the average frequency of allele $a_1$ can greatly exceed the
968: value predicted by $\theta$ estimated from the average site
969: heterozygosity.
970: 
971: Although the simple model used in this section does not describe any but
972: the most phenomenological features of mutator alleles, it does reveal an
973: important general observation: the value of $\theta$ estimated from
974: neutral SNP data does not in general equal the effective value of
975: $\theta$ that determines synonymous codon usage at a site under negative
976: selection.  This result is of utmost importance to any discussion of the
977: relationship between $\theta$ and the power of volatility to detect
978: negative selection.
979: 
980: 
981: \section{Positive selection} \label{PosSel} In the sections above, we have considered
982: selection that opposes a change to the amino acid at a site.  This type
983: of negative selection induces a bias towards the less volatile codons
984: for the favored amino acid at a site.  However, selection sometimes
985: favors a change in the amino acid at a particular site. In such
986: situations, as we will demonstrate, a site is more likely to be occupied
987: by a codon of greater than average volatility.
988: 
989: A variety of mechanisms are known to cause positive selection.
990: Frequency dependence often induces diversifying selection at a site,
991: whereas an exogenous change in the environment can induce directional
992: selection for a new, specific amino acid. We do not here model all of
993: the various types of positive selection, but rather focus on the
994: essential aspect shared by these mechanisms. We analyze the dynamics at
995: a site that has, for a period of time, experienced negative selection
996: for amino acid $A$, and that subsequently experiences negative selection
997: for different amino acid, $B$ (for whatever reason).  We refer to the
998: change in the selective regime as a positive selection event.
999: 
1000: Prior to the onset of positive selection, amino acid $A$ is assigned
1001: fitness 1 and all other amino acids fitness $1-\sigma$; subsequently,
1002: amino acid $B$ is assigned fitness 1 and all others fitness $1-\sigma$.
1003: We assume that $N \sigma \gg 1$ (otherwise, the site is effectively
1004: neutral at the amino acid level) and that $\sigma \gg u$ (otherwise, the
1005: expected codon frequencies are uniform).  Once the population shifts to
1006: the new amino acid $B$, it is clear that the site will more likely
1007: contain a codon that is more volatile than the average $B$-codon,
1008: because it has just arisen through a nonsynonymous mutation. Since $B$
1009: is now favored, negative selection subsequently operates to reduce the
1010: volatility at the site. However, this process takes time. Thus, for some
1011: time after the positive selection event, there is a bias toward elevated
1012: volatility at the site, which gradually decays. In this section, we
1013: analyze this process.
1014: 
1015: Analagously to previous sections, we initially consider a simplified
1016: genetic code consisting of four codons, $a_1$, $a_2$, $b_1$, and $b_2$,
1017: the first two of which encode amino acid $A$, and the latter two amino
1018: acid $B$.  Mutations can only occur between codons $a_1$ and $a_2$,
1019: $a_2$ and $b_2$, and $b_2$ and $b_1$, creating the mutation structure \[
1020: a_1 \rightleftarrows a_2 \rightleftarrows b_2 \rightleftarrows b_1.  \]
1021: In this simplified genetic code, codons $a_2$ and $b_2$ are the more
1022: volatile codons for their respective amino acids.
1023: 
1024: After the change in selection from amino acid $A$ to $B$, a mutation to
1025: codon $b_2$ that survives stochastic drift will eventually arise.  Thus,
1026: at least initially, the more volatile codon $b_2$ is more prevalent
1027: than the less volatile codon $b_1$. During this period, we can detect
1028: the signature of the positively selected sweep because of the elevated
1029: volatility at the site.  However, negative selection for amino acid $B$
1030: will eventually favor codon $b_1$.  Therefore, the volatility signature
1031: of the positive selection event will be present provided that the time
1032: scale of decay toward codon $b_1$ is longer than the interval since the
1033: positive selection event.
1034: 
1035: Fortunately, the time scale of decay towards $b_1$ is quite long.  For
1036: $\theta \gg 1$, we can use the infinite population model to find this
1037: time scale.  As discussed above, the time required to reduce the
1038: volatility $e$-fold is of order $\frac{1}{u}$.  For $\theta \ll 1$, we
1039: must use a finite population size calculation.  In this regime, the
1040: population is nearly monomorphic at almost all times.  Following the
1041: selective sweep, the site will be monomorphic for $b_2$ with almost unit
1042: probability.  We are interested in the duration of time required such
1043: that probability of being monomorphic for $b_2$ (as opposed to $b_1$)
1044: has been reduced $e$-fold.  The probability of switching between $b_2$
1045: and $b_1$, however, is of order $u$ per unit time (even before $b_2$ has
1046: finished outcompeting $a_2$), according to Eq.  \ref{Pfix}.  Thus, the
1047: time scale of decay in a finite population is also $\frac{1}{u}$. 
1048: 
1049: According to this analysis, a selective sweep will result in the
1050: presence of a more volatile codon for of order $\frac{1}{u}$ generations --
1051: a very long time indeed. (In the case of \textit{E. coli}, for example,
1052: $\frac{1}{u}$ generations is nearly $100,000$ years, given $u \approx
1053: 5\times10^{-10}$ and generation time $\approx$ $20$ minutes. The
1054: generation length and resulting time scale for \textit{E. coli} in the
1055: wild may be much longer yet\ct{GibbKaps67}.) Equivalently, repeated
1056: sweeps for amino acid changes at a site will result in the presence of
1057: more volatile codons at almost all times, provided that new sweeps occur
1058: more often than every $\frac{1}{u}$ generations.
1059: 
1060: 
1061: \subsection{Inferring Positive Selection}
1062: 
1063: The above analysis for a simplified genetic system generalizes in an
1064: obvious way to the real genetic code.  After a positive selection event
1065: at a site, the population switches from a codon for amino acid $A$ to a
1066: codon for amino acid $B$.  The expected volatility of the new codon is
1067: greater than the average volatility of $B$-codons, because the new codon
1068: has just arisen through a nonsynonymous mutation.  To be more precise,
1069: if the population is monomorphic for a random non-$B$ codon before the
1070: selective sweep, then after the sweep occurs the expected relative
1071: frequencies of the $B$-codons are given, approximately, by their relative
1072: volatilities.  Subsequent to the selective sweep, the increased
1073: volatility at the site will decay on a time scale of order of
1074: $\frac{1}{u}$ generations.
1075: 
1076: There is a critical distinction between the volatility signature of
1077: positive selection versus that of negative selection.  The depressed
1078: volatility at a site under negative selection is caused by a
1079: mutation-selection-drift balance. When the effective population size is
1080: small, a large number of sites are required to distinguish negative
1081: selection from neutrality. By contrast, the volatility signature of
1082: \textit{positive} selection is {\it not} an equilibrium property, and
1083: it is not sensitive to population size.  Regardless of $\theta$, the
1084: probability of sampling a more volatile codon is significantly elevated
1085: immediately after a selective sweep at a site, and this probability
1086: decays only at rate $u$.
1087: 
1088: As we have seen, a gene that contains many sites under positive
1089: selection will exhibit a greater volatility (controlling for its amino
1090: acid composition) than a gene under neutral or, especially, negative
1091: selection.  How many positively selected sites are required in order to
1092: detect a reliable signal? In the case of Leucine, for example, the
1093: expected volatility of a site that has recently experienced a positively
1094: selected sweep is approximately $0.660 \pm 0.072$ (one standard
1095: deviation), whereas a neutral Leucine site has expected volatility
1096: $0.646 \pm 0.073$, and a Leucine site under negative selection has
1097: expected volatility $0.632 \pm 0.0070$ (see Table \ref{GLRS}).
1098: Therefore, the volatility of about 100 Leucine sites under positive
1099: selection will be significantly greater (at the 95\% confidence level)
1100: than that of 100 neutral sites.  Similarly, the volatility of about 25
1101: positively selected Leucine sites will be significantly greater than that of
1102: 25 negatively selected sites.  Similar results hold for Serine and
1103: Arginine; Glycine is less informative.
1104: 
1105: It is worth noting that the elevated volatility for a positively selected
1106: Serine site will decay even more slowly than for other amino acids,
1107: because the highly volatile codons ACC and AGT are not connected by
1108: synonymous mutations to other serine codons. 
1109: 
1110: \section{Discussion}
1111: \label{Discussion}
1112: 
1113: \subsection{Codon volatility versus comparative sequence analysis}
1114: 
1115: Selection pressures on proteins are usually estimated by comparing
1116: homologous nucleotide sequences\ct{ZuckPaul65}.  Orthologous genes are
1117: identified in different organisms and sequenced; their sequences are
1118: then aligned, and the changes that have accumulated since divergence are
1119: used to infer the selection pressures that have been
1120: acting\ct{GoldYang94}. When available, sequence variation sampled from
1121: individuals within a species can be compared with variation across
1122: species to produce an elegant test for adaptive evolution at a
1123: locus\ct{McDoKrei91,SawyHart92}. In addition, there are a variety of
1124: statistical tests designed to detect a departure from neutrality in the
1125: site frequency spectrum sampled within a single species (see ref.\ct{Krei00}
1126: and references therein). In many cases, the complete distribution of
1127: these statistics under the neutral null model are difficult to derive,
1128: but they have been studied through computer simulation\ct{SimoChur95}.
1129: 
1130: Techniques for estimating selective constraints via sequence comparison
1131: are typically applied, independently, to one or several genes at a time.
1132: When extensive intra- or inter-specific sequence data are available at a
1133: locus of interest, such techniques have proven enormously useful for
1134: measuring selection, and it is unlikely that they will be significantly
1135: improved by incorporating information about synonymous codon usage.  But
1136: the accurate estimation of selective constraints requires a large number
1137: (approximately six or more\ct{AnisBiel01}) of orthologous sequences for
1138: each gene of interest.  At the genome-wide scale, comparative data
1139: (\textit{i.e.} orthologous gene sequences) will not be available for all
1140: genes, and methods to estimate selective constraints based on sequence
1141: comparison will often be inapplicable.  Furthermore, the genes under
1142: positive selection are often of particular interest, but such genes are
1143: even less likely to have identifiable orthologs in related species due
1144: to their rapid sequence divergence\ct{PlotDush04}.  Even in the lineage
1145: of the \textit{Saccharomyces} genus, which is currently the best-case
1146: scenario for comparative genomics, the genomes of four species have been
1147: fully sequenced and only two-thirds of the genes in \textit{S.
1148: cerevisiae} have unambiguously identifiable orthologs in related
1149: species\ct{PlotFras04}. Unlike comparative techniques, the analysis of
1150: synonymous codon usage offers a computational tool to screen for
1151: selection pressures on \textit{all} genes in a sequenced genome. Genome-wide
1152: screens based on analyzing synonymous codon usage may prove useful in
1153: identifying important classes of genes under strong selection, such as
1154: the antigens of pathogens\ct{PlotDush04}.
1155: 
1156: Unlike most comparative statistics that test for a departure from
1157: neutrality, estimates of selection based on bootstrapped volatility
1158: scores\ct{PlotDush04} are not `estimators' in a rigorous statistical
1159: sense -- \textit{i.e.} statistics whose sampling properties can be
1160: derived from a null model, and which can be used in likelihood ratio
1161: tests of a null hypothesis\ct{YangNiel00,ClarkGlan03}. Given the
1162: expected relative frequencies of codons that we have derived for each of
1163: the three regimes (neutral, negative, and positive selection; Table
1164: \ref{GLRS}), it may yet be possible to design maximum-likelihood methods
1165: that estimate the number of sites of a gene in each regime. This
1166: approach will be complicated, however, by other sources of codon bias;
1167: see below.
1168: 
1169: Aside from the different situations in which they are applicable, and
1170: differences in the rigor of their derivation, estimates of selection
1171: based on codon volatility differ in a fundamental way from most
1172: estimates based on sequence comparison.  Homologous sequence comparison
1173: between species is often used to assess, either by maximum
1174: likelihood\ct{GoldYang94} or maximum parsimony\ct{Li93}, the rates of
1175: synonymous and non-synonymous substitutions in a coding sequence. The
1176: ratio of these rates, dN/dS, is used as a measure of the selective
1177: constraints that have been acting on a protein since the divergence of
1178: the species being compared.  An alternative approach, based on a Poisson
1179: Random Field (PRF) model of mutation frequencies, uses the site
1180: frequency spectrum at a locus sampled from individuals within a species
1181: to deduce the average selective pressure for or against amino acid
1182: changes in a gene\ct{SawyHart92}. (Poisson Random Field models can also
1183: be used to construct likelihood ratio tests of departure from
1184: neutrality\ct{BustWak01}.) Like most comparative methods, however, both
1185: of these models typically assume that all the sites within a gene
1186: experience the same selective pressure against amino acid substitutions
1187: (but see the site-by-site likelihood tests of Yang \textit{et
1188: al.}\ct{YangNiel00}).  Under the PRF theory, for example, authors have
1189: estimated a very small ``average'' selection pressure against amino acid
1190: changes in \textit{E.  coli} genes: $\sigma \sim
1191: 10^{-8}$\ct{HartSawy94}.  This value does not represent the arithmetic
1192: average of the true $\sigma$ values across sites, but rather the
1193: best-fit constant value of $\sigma$ that would make the PRF model
1194: consistent with observed sequence variation at polymorphic sites.
1195: 
1196: When evolutionary rates are estimated at \textit{individual}
1197: residues\ct{Yang00,YangNiel00}, however, we find great variation across
1198: sites. Moreover, direct experimental measurements of the fitness
1199: consequences of amino acid substitutions in micro-organisms reveals huge
1200: variation in selection pressures across the residues of an individual
1201: protein: a substantial proportion of substitutions are lethal, and a
1202: substantial proportion have undetectable
1203: effect\ct{WertDrub92,WlocSzaf01,ZeylDeVi01,SanjMoya04}. Therefore, it is
1204: not entirely clear how best to interpret the value of $\sigma \sim
1205: 10^{-8}$ estimated for \textit{E. coli} genes using the PRF model, which
1206: assumes constant pressure at each residue. 
1207: 
1208: Compared to dN/dS or $\sigma$ estimated by the PRF model, codon
1209: volatility quantifies selection pressures in a very different, coarser
1210: manner.  As discussed above, volatility essentially measures the number
1211: of sites in a gene that experience negative ($\sigma \gg u$) versus
1212: neutral ($\sigma \ll u$) versus positive selection. Given that, in
1213: reality, many amino acid changes to a protein sequence are lethal while
1214: other changes have no effect whatsoever, it is reasonable and meaningful
1215: to estimate the number of sites in the selected versus neutral regimes.
1216: But volatility is not sensitive to variation in selective pressures
1217: within either of these regimes. Hence, the volatility measure is in some
1218: ways a coarser description of selective pressure than PRF or dN/dS.  One
1219: should not necessarily expect that volatility will correlate very
1220: strongly with dN/dS or PRF estimates, because the latter measures
1221: represent some sort of average $\sigma$ over the entire gene, and are
1222: thus presumably sensitive to the full range of variation in $\sigma$.  A
1223: measure based on codon volatility is therefore different from and
1224: complementary to dN/dS or PRF estimates of the selective
1225: constraints on a genes.
1226: 
1227: As an aside, it is important to note that the most common model used to
1228: estimate dN/dS from divergent nucleotide sequences\ct{GoldYang94} does
1229: not itself reflect the relationship between selection and volatility.
1230: dN/dS is often estimated by fitting maximum likelihood parameters to a
1231: simplified Markov-chain model of sequence evolution that ignores
1232: population variability\ct{GoldYang94}.  Models that ignore population
1233: variability are perfectly reasonable approximations when comparing the
1234: sequences of relatively divergent lineages\ct{GoldYang94}; but such
1235: models fail to detect the effect of amino-acid selection on synonymous
1236: codon usage.  Such models consider only a single sequence that is
1237: assumed to represent the dominant genotype in the population at any
1238: time.  Mutation and selection are modeled simultaneously by adjusting
1239: the transition rates between codon states in the
1240: sequence\ct{GoldYang94}.  As a result, in equilibrium, the number of
1241: transitions into a state per unit time must equal the number of
1242: transitions out of that state; and so equilibrium synonymous codon usage
1243: does not depend upon the strength of selection in these simplified
1244: models\ct{GoldYang94}. (In fact, under the standard assumption of
1245: time-reversibility, such models require as parameters the specification
1246: of the equilibrium codon usage\ct{GoldYang94}, and therefore they
1247: clearly cannot be used to predict equilibrium codon usage.) Simulations
1248: of sequence evolution based on these simplified models (such as the
1249: non-frequency-dependent simulations of Zhang\ct{Zhan04}) will thus fail
1250: to detect the relationship between dN/dS and volatility, whereas more
1251: detailed simulations that account for population variability (such as
1252: the frequency-dependent simulations of Zhang\ct{Zhan04}, as well as the
1253: non-frequency-dependent simulations in this work) will properly reflect
1254: the relationship between selection and volatility, as predicted by
1255: Fisher-Wright models of a replicating population.
1256: 
1257: 
1258: \subsection{Other sources of codon bias} 
1259: 
1260: Although it came as a surprise to early neutral
1261: theorists\ct{KingJuke69}, it is now clear that there are several
1262: processes that result in unequal usage of synonymous codons.  Many
1263: processes that cause codon bias in microorganisms, such as biased
1264: nucleotide content or mutation rates, can apply roughly equally to all
1265: the genes in a genome.  To the extent that other sources of codon bias
1266: apply equally across a genome, it is straightforward to control for
1267: these biases when comparing the volatilities of genes within a genome to
1268: estimate selection pressures on proteins\ct{PlotDush04}.
1269: 
1270: To the extent that other sources of codon bias differ from gene to gene
1271: within a genome, they may (if not properly controlled for) introduce
1272: errors into estimates of the relative selection pressures on proteins
1273: inferred from codon volatility\ct{PlotDush04}. Similarly, selection on
1274: synonymous codons -- particularly selection that varies from gene to
1275: gene -- will likewise introduce errors into estimates of selection on
1276: protein sequences obtained by comparative techniques such as
1277: dN/dS\ct{SharLi87,HirsFras04}.
1278: 
1279: As we have argued, some of the variation in synonymous codon usage
1280: across a genome is caused by the variation in selection pressures on
1281: protein sequences.  Throughout our analysis, we have specifically
1282: ignored any other source of codon biases so as to derive the effects of
1283: selection at the amino acid level on codon usage.  But in many organisms
1284: other processes that vary between genes are certainly operating as well.
1285: For instance, it is known that the transition/transversion mutation bias
1286: can vary across a genome.  Results on \textit{S. cerevisiae}, whose
1287: genome exhibits marked variation in the tr/tv bias\ct{PlotFras04},
1288: suggest that this source of variable codon bias will not distort
1289: estimates of selection based on volatility: whether or not one accounts
1290: for the variation in the tr/tv bias across the genome of \textit{S.
1291: cerevisiae} one obtains virtually the same rankings of gene volatilities
1292: ($r>0.99$)\ct{PlotFras04}. 
1293: 
1294: Aside from mutational biases, there are other sources of codon bias that
1295: vary from gene to gene in some organisms. In the yeast \textit{S.
1296: cerevisiae}, researchers have observed that synonymous codon usage,
1297: measured by the Codon Adaptation Index (CAI)\ct{SharLi87}, is correlated
1298: with a gene's expression level in laboratory conditions\ct{CoghWolf00}.
1299: This correlation is thought to be caused by selection for translational
1300: efficiency and/or accuracy: a codon corresponding to a more abundant tRNA is
1301: expected to be translated more quickly (due to the higher probability
1302: per unit time that the appropriate tRNA will ``find" the codon) and more
1303: accurately (since the correct tRNA will likely have the greatest chance
1304: of pairing if it is the most abundant). 
1305: 
1306: Considering this alternative source of biased codon usage, two questions
1307: should be asked: do other sources of codon bias distort estimates of
1308: selection based on volatility, and how can we control for these
1309: confounding factors? Unfortunately we do not have a truly satisfactory
1310: answer for either of these questions, but the discussion below may shed
1311: some light on the issues involved.  
1312: 
1313: With regards to the first question, we note that the degree to which
1314: other sources of codon bias may distort volatility-based estimates of
1315: selection will strongly depend on the organism being studied.  Some
1316: species (such as humans) exhibit a much weaker correspondence between
1317: codon frequencies and tRNA abundances than others species; so clearly
1318: other sources of codon bias will affect volatility values differently in
1319: different species. In a species with a strong correspondence between
1320: codon usage and tRNA abundances, the extent to which variation in this
1321: source of codon bias across the genome affects volatility will depend on
1322: whether volatile codons are (un)preferred: if there is no correlation
1323: between volatility and tRNA abundances, then the other sources of codon
1324: bias will only introduce random error into volatility estimates, making
1325: them less powerful but still reliable. If instead the preferred codons
1326: tend to have either high or low volatility, then this effect could
1327: introduce systematic errors into volatility estimates. In the latter
1328: case, in order to quantify how much codon usage bias is caused by
1329: volatility as opposed to other factors, one would require a method to
1330: predict for individual genes the amount of codon bias due to these other
1331: factors. Unfortunately we are far from having the necessary level of
1332: predictive power for other sources of codon bias in any organism.
1333: Although gene expression level is somewhat predictive of codon bias,
1334: expression levels do not explain most of the variation in codon bias in
1335: any genome studied thus far\ct{Akas01,CoghWolf00}.  Until the various
1336: sources of biased codon usage can be reliably disentangled, we cannot
1337: reliably quantify the effects of these biases on volatility-based
1338: estimates of selection.
1339: 
1340: The second question, how to control for other sources of biased codon
1341: usage, is also difficult to answer at present.  As discussed above, an
1342: appropriate method to control for other sources of bias would require
1343: disentangling the various sources of codon bias in a predictive manner
1344: for each gene. While this degree of precision is not currently possible,
1345: one approach is to assume that the codon bias measured by CAI is
1346: entirely independent of volatility, and then control for CAI using
1347: partial correlations. For several reasons, we expect this approach to be
1348: conservative, as we illustrate using the yeast \textit{S. cerevisiae}
1349: (we use this species as an example because it shows a strong preference
1350: for codons that match abundant tRNAs, and because we have reliable dN/dS
1351: values for almost two-thirds of its genes, calculated from multiple
1352: alignments of closely related species\ct{HirsFras04}). First, we note
1353: that dN/dS is itself strongly correlated with both CAI and gene
1354: expression levels\ct{PalPapp01}, and it is therefore impossible to
1355: construct any measure of selective constraint that agrees with dN/dS and
1356: is not itself strongly correlated with CAI and expression levels in
1357: yeast. Second, it is possible that the codon bias measured by CAI is in
1358: part \textit{caused} by volatility (\textit{i.e.} highly expressed genes
1359: tend to experience stronger purifying selection and therefore exhibit
1360: unusual codon usage biased towards lower volatility), and so controlling
1361: for CAI would be inappropriate. Despite several biological hypotheses,
1362: there is no accepted mechanistic explanation for the correlation between
1363: CAI and dN/dS in yeast\ct{PalPapp01, Akas01}, and so it is unclear
1364: whether controlling for CAI is appropriate.
1365: Nevertheless, we have tested the correlation between volatility and
1366: dN/dS while controlling for CAI using a partial correlation. We find
1367: that even when controlling for CAI (or expression levels), there remains
1368: a highly significant correlation between volatility and dN/dS in yeast
1369: ($p<10^{-34}$\ct{PlotFras04}).  Therefore, even under this conservative
1370: test, estimates of selection obtained by volatility are still consistent
1371: with estimates obtained by homologous sequence comparison. We interpret
1372: this result as evidence that volatility is measuring selective
1373: constraints above and beyond any signal inherent in CAI.
1374: 
1375: Indeed, there is a great deal of empirical evidence
1376: indicating that the volatility of a gene is correlated with the
1377: selective constraint it experiences.  Aside from highly significant
1378: correlations between volatility and dN/dS in bacterial species and
1379: yeast\ct{PlotDush04}, volatility also reflects a range of other features
1380: known to correlate with selection on proteins. In \textit{S.
1381: cerevisiae}, for example, volatility is strongly correlated with the
1382: essentiality of genes, the number of their protein-protein interactions,
1383: and the degree to which they are preserved throughout the eukaryotic
1384: kingdom\ct{PlotFras04}.  Furthermore, volatility is significantly
1385: elevated among the known antigens and surface proteins (which experience
1386: positive selection) in the pathogens \textit{Mycobacterium
1387: tuberculosis}, \textit{Plasmodium falciparum}, and Influenza A
1388: virus\ct{PlotDush03,PlotDush04}. And volatility is significantly
1389: depressed in the genes essential for growth of \textit{M.
1390: tuberculosis}, as well as in the genes conserved between related
1391: \textit{Mycobacterium} species\ct{PlotDush04}. Therefore, despite
1392: potential confounding sources of codon bias that cannot at present be
1393: controlled for with appropriate accuracy, in practice volatility-based
1394: methods produce estimates of selection pressures that are consistent
1395: with our understanding of protein evolution over a diverse range of
1396: taxa.
1397: 
1398: Finally, we note that there may be direct selection on synonymous codons
1399: in order to evade mistranslation\ct{Kono85}. Since mistranslation is far
1400: more likely to occur between a codon and an anticodon that differ by a
1401: single nucleotide, the definition of volatility (Eq. \ref{voldef})  is
1402: appropriate for measuring the selective pressure for or against 
1403: mistranslation. The strength of this type of selection on synonymous
1404: codons would depend upon the mis-incorporation rate of tRNA (which is
1405: far higher than the mutation rate) and the detriment of mistranslation
1406: (which is likely far lower than that of most mis-sense mutations). It is
1407: difficult at present to measure the molecular parameters of tRNA
1408: mis-incorporation and its fitness effects; so it is unclear how much of
1409: a volatility signal arises from mistranslation avoidance versus standard
1410: selection on mis-sense mutations. However strong this signal, though,
1411: the volatility of a gene would still reflect the degree to which there
1412: is selection to conserve, or not to conserve, the (translated) protein
1413: sequence.
1414: 
1415: 
1416: \section*{Acknowledgments}
1417: 
1418: We thank Daniel Fisher, Andrew Murray, and Michael Turelli for their
1419: input during the preparation of this manuscript. J.B.P. acknowledges
1420: support from the Harvard Society of Fellows. M.M.D. acknowledges
1421: support from a Merck Award for Genome-Related Research.
1422: 
1423: \bibliography{./bib}
1424: 
1425: 
1426: \end{document}
1427: 
1428: 
1429: