0410:q-bio0410013/draft.tex

1: \documentclass[12pt]{article}

2: \usepackage{epsfig}

3: \usepackage{enumerate}

4: \usepackage{amsmath, amsfonts, amssymb}

5: \bibliographystyle{plain}

6: \setlength{\textheight}{8.4in} \setlength{\textwidth}{6.5in}

7: \setlength{\topmargin}{0.25in} \setlength{\headheight}{0.0in}

8: \setlength{\headsep}{0.0in} \setlength{\leftmargin}{0.25in}

9: \setlength{\rightmargin}{0.0in} \setlength{\oddsidemargin}{0.0in}

10: \setlength{\evensidemargin}{0.0in}

11:

12:

13: \newcommand{\ct}[1]{#1}

14: \renewcommand{\ct}[1]{ \cite{#1}}

15: \newcommand{\sub}[1]{\vspace{1.2ex}\noindent{\bf #1}}

16: \newcommand{\et}{\emph{et~al.}}

17: \newcommand{\jbreak}{}

18: \newcommand{\comment}[1]{{\bf[[#1]]}}

19: \newcommand{\cc}{\comment{Changed}}

20:

21: \newcommand{\bes}{\mathcal{B}}

22: \newcommand{\acid}{\text{aa}}

23:

24: \newcommand{\mon}{\begin{displaymath}}

25: \newcommand{\moff}{\end{displaymath}}

26: \newcommand{\eon}{\begin{equation}}

27: \newcommand{\eoff}{\end{equation}}

28: \newcommand{\eq}[1]{Eq. \ref{#1}}

29: \newcommand{\fig}[1]{Fig. \ref{#1}}

30:

31: \newenvironment{changemargin}[2]{%

32:  \begin{list}{}{%

33:   \setlength{\topsep}{0pt}%

34:   \setlength{\leftmargin}{#1}%

35:   \setlength{\rightmargin}{#2}%

36:   \setlength{\listparindent}{\parindent}%

37:   \setlength{\itemindent}{\parindent}%

38:   \setlength{\parsep}{\parskip}%

39:  }%

40: \item[]}{\end{list}}

41:

42: \long\def\symbolfootnote[#1]#2{\begingroup%

43: \def\thefootnote{\fnsymbol{footnote}}\footnote[#1]{#2}\endgroup}

44: \renewcommand{\baselinestretch}{1.0}

45:

46:

47: \title{Synonymous codon usage and\\

48: selection on proteins}

49: \date{October 13, 2004}

50:

51: \author{Joshua B. Plotkin$^1$, Jonathan Dushoff$^2$, Michael M.

52: Desai$^3$,

53: Hunter B. Fraser$^4$}

54: \linespread{1.2}

55:

56:

57:

58: \begin{document}

59:

60: \maketitle

61:

62: \begin{center} $^1$\textsc{Harvard Society of Fellows\\

63: and Bauer Center for Genomics Research \\

64: 7 Divinity Avenue, Cambridge MA 02138, USA} \end{center}

65:

66: \begin{center} $^2$\textsc{Department of Ecology and Evolutionary Biology\\

67: Princeton University, Princeton, NJ 08540, USA} \end{center}

68:

69: \begin{center} $^3$\textsc{Department of Physics \\

70: and Department of

71: Molecular and Cellular Biology\\

72: Harvard University, Cambridge, MA 02138, USA} \end{center}

73:

74: \begin{center} $^4$\textsc{Department of Molecular and Cell Biology\\

75: University of California, Berkeley, CA, 94720, USA} \end{center}

76:

77:

78:

79: \bigskip

80:

81: \vspace*{.3in}

82:

83:

84:

85: \pagebreak

86:

87:

88: \begin{changemargin}{.9cm}{.9cm}

89:

90:

91: \noindent Selection pressures on proteins are usually measured by

92: comparing homologous nucleotide sequences\ct{ZuckPaul65}.  Recently we

93: introduced a novel method, termed `volatility', to estimate selection

94: pressures on protein sequences from their synonymous codon

95: usage\ct{PlotDush03,PlotDush04}.  Here we provide a theoretical

96: foundation for this approach.  We derive the expected frequencies of

97: synonymous codons as a function of the strength of selection, the

98: mutation rate, and the effective population size.  We analyze the

99: conditions under which we can expect to draw inferences from biased

100: codon usage, and we estimate the time scales required to establish and

101: maintain such a signal.  Our results indicate that, over a broad range

102: of parameters, synonymous codon usage can reliably distinguish between

103: negative selection, positive selection, and neutrality.  While the power

104: of volatility to detect negative selection depends on the population

105: size, there is no such dependence for the detection of positive

106: selection.  Furthermore, we show that phenomena such as transient

107: hyper-mutators in microbes can improve the power of volatility to detect

108: negative selection, even when the typical observed neutral site

109: heterozygosity is low.

110:

111: \end{changemargin}

112:

113: \section{Introduction}

114:

115: Nucleotide coding sequences of many organisms exhibit significant codon

116: bias -- that is, unequal usage of synonymous codons.  Codon bias has

117: been attributed both to neutral processes, such as asymmetric mutation

118: rates, as well as to selection acting on the synonymous codons

119: themselves. The most common selective explanation of codon bias posits

120: that synonymous codons differ in their fitness according to the relative

121: abundances of iso-accepting tRNAs; a codon corresponding to a more

122: abundant tRNA would be used preferentially so as to increase

123: translational efficiency\ct{Ikem81,DebrMarz94,SoreKurl89}.  To a large

124: extent, this hypothesis has successfully

125: explained interspecific variation in genome-wide codon usage for

126: organisms ranging from \textit{Escherichia coli} to \textit{Drosophila

127: melanogaster}\ct{Akas01}.

128:

129: Recently, however, we have noted that codon bias in a protein sequence

130: can also result from selection at the amino acid level, even in the

131: absence of direct selection on synonymous codons

132: themselves\ct{PlotDush03,PlotDush04}. Codon bias arises from selection

133: at the amino acid level because of asymmetries in the structure of the

134: standard genetic code. Proteins that experience different selective

135: regimes should exhibit different synonymous codon usage.  Following from

136: this observation, we have introduced methods to screen a single genome

137: sequence for estimates of the selection pressures acting on its proteins

138: by comparing their synonymous codon usage\ct{PlotDush04}.

139:

140: In this paper, we provide a theoretical discussion of codon usage biases

141: that result from selection at the amino acid level.  Our analysis helps

142: to provide a theoretical grounding for techniques of estimating

143: selection pressures on proteins using signals gathered from their

144: synonymous codon usage\ct{PlotDush03,PlotDush04}. Throughout most of

145: this paper, we will ignore any source of direct selection on synonymous

146: codons, and focus on the codon biases that result purely from selection

147: at the amino acid level.  To the extent that any other sources of codon

148: bias apply equally across the genome, we have devised a bootstrap method

149: to control for these external sources of codon bias when estimating

150: selection pressures on proteins\ct{PlotDush03,PlotDush04}.  In the

151: discussion, however, we describe a range of confounding factors that may

152: vary across the genome in some organisms and limit the applicability of

153: codon-based methods to detect selection.

154:

155: \section{Codon volatility}

156:

157: Codon usage biases can arise from the familiar process of

158: selection on proteins because synonymous codons may differ in their

159: \textit{volatility} -- defined, loosely, as the proportion of a codon's

160: point mutations that result in an amino acid

161: substitution\ct{PlotDush03}. Although there are several possible

162: definitions of volatility, which can all be informative, we have

163: recently used the following formal definition\ct{PlotDush04}.

164:

165: We index the 61 sense codons in an arbitrary order $i=1\ldots61$.  We

166: use the notation $\acid(i)$ to denote the amino acid encoded by codon

167: $i$.  For each codon $i$, let $B(i)$ denote the set of sense codons

168: that differ from codon $i$ by a single point mutation.

169: We define the volatility of codon $i$ by:

170: \begin{equation}

171: \nu(i) = \frac{1}{\#B(i)}\sum_{j \in B(i)}

172: D[\acid(i),\acid(j)]\\

173: \label{voldef}

174: \end{equation}

175: where $D$ denotes the Hamming metric, which is zero if two amino acids

176: are identical, and one otherwise.  The definition in Eq. ~\ref{voldef}

177: applies when all nucleotide mutations occur at the same rate.

178: When differential nucleotide mutation rates are known

179: (\textit{e.g.} a transition/transversion bias\ct{Wake96}), these rates can be

180: incorporated into the definition of codon volatility by appropriately

181: weighting the ancestor codons\ct{PlotDush04}.

182:

183: Minor variants of Eq. \ref{voldef} yield related definitions of codon

184: volatility. For some applications, one may want to allow termination

185: codons in the definition of $B(i)$. It is also natural to consider

186: alternatives to the Hamming metric, $D$, that weight substitutions

187: between amino acids depending upon the differences in their

188: stereochemical properties\ct{MiyaMiya79,PlotDush03}. A variety of other

189: metrics\ct{TangWyck04,YampStol04} that reflect the effects of different

190: amino acid substitutions on protein structure may likewise be

191: incorporated into the definition of codon volatility.  In this paper,

192: however, we will focus on the most basic definition of codon volatility

193: (Eq.  \ref{voldef}, using the Hamming metric), because variant

194: definitions are based on the same underlying principle and produce

195: similar results in practice\ct{PlotDush03}.

196:

197: Under the most basic definition of volatility, there are four amino

198: acids (Glycine, Leucine, Arginine, and Serine) whose codons differ in

199: their volatility.  As a result, when controlling for amino acid content,

200: we obtain a volatility signal from only those sites that contain one of

201: these four amino acids -- which amounts to about 30\% of the sites in a

202: typical gene. (If one uses stereochemical

203: metrics\ct{MiyaMiya79,PlotDush03} for $D$ in the definition of

204: volatility, then $\sim\!75$\% of the sites in a gene contain a

205: volatility signal).  Although 30\% may seem like a small proportion of

206: sites from which to obtain a signal of selective pressures, it is larger

207: than the proportion of sites often used to detect selection via sequence

208: comparison of recently diverged species\ct{FleiAlla02,ClarkGlan03}. (For

209: example, fewer than 4\% of neutral sites exhibit substitutions when

210: comparing human and chimpanzee sequences\ct{ClarkGlan03}.)

211:

212:

213: In the following sections we analyze the consequences of selection on

214: proteins for codon usage in general, as well as for the volatility

215: measure in particular.  We demonstrate that the expected codon usage at

216: a site, as well as its temporal dynamics, depend upon the strength of

217: positive or negative selection on the amino acid sequence.  In Sections

218: \ref{QSmodel} through \ref{PopSize} we examine negative selection in

219: infinite and finite populations. In Section \ref{PosSel} we discuss

220: positive selection.  Our analysis is initially confined to the patterns

221: of codon usage at a single site under selection at the amino acid level.

222: Proceeding from this analysis, we also discuss codon usage over many

223: sites within a gene or genome, and analyze how many sites are required

224: in principle to detect a reliable signal of selection by inspecting

225: synonymous codon usage.

226:

227: \section{Negative Selection and Codon Bias in an Infinite Population}

228:

229: \label{QSmodel}

230:

231: Most nonsynonymous mutations in a protein coding sequence presumably

232: reduce the fitness of an organism. For a large proportion of sites,

233: therefore, natural selection opposes any change in the amino acid.  We

234: refer to this type of selection as ``negative selection.''

235:

236: For the purposes of exploring the effect of negative selection on codon

237: usage, we assume that selection cannot discriminate between the

238: synonymous codons for the favored amino acid at a site.

239: However, mutations are more likely to be nonsynonymous, and hence

240: deleterious, if the codon at that site has high volatility. As we will

241: show, this fact results in an effective preference for the less

242: volatile codons, among those codons that code for the favored amino acid

243: at the site. We emphasize that this preference for a codon of low

244: volatility at a site under negative selection is \textit{not} caused by

245: a direct fitness difference between synonyms.  Rather, more volatile

246: codons will occur less frequently as a second-order consequence of

247: negative selection at the amino acid level, and the structure of the

248: genetic code.

249:

250:

251: Proteins with a larger number of sites under negative selection will

252: exhibit a statistical bias towards less volatile codons, after

253: controlling for their amino acid content.

254: Here we calculate the expected magnitude of the codon bias as a function

255: of the mutation rate, the strength of negative selection, and, in

256: Section \ref{Wright}, the population size.  We also analyze the

257: conditions under which we can expect to detect and draw inferences from

258: this bias, and we estimate the time scales needed to establish and

259: maintain such a signal.

260:

261: \subsection{A simplified genetic code}

262:

263: In an infinite population, we can describe the dynamics of codon usage

264: at an individual site by using the standard multi-allele model first

265: introduced by Haldane\ct{Hald27} and used throughout the literature

266: (\textit{e.g.} ref.\ct{Nagy92} Eq. 2.25 or ref.\ct{Higg94}).  This model

267: describes a single site which can assume any of $K$ states. In order to

268: investigate codon usage, we consider $K = 64$ states, corresponding to

269: each of the $64$ possible codons.  In continuous time, the frequency

270: $x_i$ of individuals with codon $i$ evolves according to

271: \begin{equation} \frac{dx_i}{dt}=\sum_{j=1}^Kx_j(t) w_j M_{ij} - x_iW(t)

272: \label{QS} \end{equation} where $w_j$ is the Malthusian fitness of codon

273: $j$, $W(t) \equiv \sum_j w_j x_j(t)$ is the mean fitness of the

274: population, and $M_{ij}$ is the instantaneous rate of mutation from

275: codon $j$ to codon $i$, with $\sum_j M_{ij}=0$.  Although Eq. \ref{QS} is non-linear, the

276: equilibrium frequencies of the ``alleles" $i=1,2,\ldots K$ are given by

277: the leading eigenvector of the matrix $w_j M_{ij}$\ct{ThomMcBr74}.

278: These frequencies determine the expected equilibrium codon usage at a

279: site. For the purposes of this paper, alternative formulations of the

280: $K$-allele model that treat the processes of selection and mutation

281: separately (\textit{e.g.} ref\ct{CrowKimu70} Eq. 6.4.1) yield the exact same

282: results.

283:

284: The equilibrium solution to Eq. \ref{QS} for the full genetic code does

285: not lend itself to intuitive understanding. Transient dynamics are also

286: difficult to calculate in this high-dimensional system.  Therefore, in

287: order to highlight the essential points of our analysis, we first

288: consider a ``toy'' genetic code that retains those features of the true

289: genetic code relevant to the study of synonymous codon usage under

290: negative selection.  As we will demonstrate, the solution for the

291: simplified genetic code yields a complete understanding for the full

292: genetic code as well.

293:

294: We imagine a simplified genetic system with only three possible codons,

295: $a_1$, $a_2$, and $b$.  Codons $a_1$ and $a_2$ code for amino acid $A$,

296: which is favored, and codon $b$ encodes amino acid $B$, which has

297: selective disadvantage $\sigma$.  We assume that mutations occur at rate

298: $u$ between these codons according to the structure \[ a_1

299: \rightleftarrows a_2 \rightleftarrows b, \] so that of the two

300: synonymous codons, $a_2$ is more volatile.

301:

302: According to the standard multi-allele model (\eq{QS}), the relative

303: frequencies of codons $a_1$, $a_2$, and $b$ are described by the

304: equation

305: \begin{equation} \label{trit}

306: \frac{d}{dt}\left (\begin{array}{c} a_1(t) \\ a_2(t) \\ b(t)

307: \end{array}\right ) = \left ( \begin{array}{ccc}

308: 1-u & u & 0 \\ u & 1-2u & u(1-\sigma) \\ 0 & u & (1-u)(1-\sigma)

309: \end{array} \right ) \left (\begin{array}{c} a_1(t) \\ a_2(t) \\

310: b(t) \end{array}\right ) - W(t) \left (\begin{array}{c} a_1(t) \\

311: a_2(t) \\ b(t) \end{array}\right ),

312: \end{equation}

313: where $W(t)=a_1(t)+a_2(t)+(1-\sigma)b(t)$.

314:

315: The equilibrium frequencies

316: of codons are given by the leading eigenvector of the matrix in Eq.

317: \ref{trit}. A simple perturbation analysis of this eigenvector shows that the

318: equilibrium frequency of $a_1$ depends monotonically

319: on $\sigma$, and it exhibits a sharp transition between two regimes: the

320: weak selection regime $\sigma \ll u$ and the strong selection regime

321: $\sigma \gg u$. In the weak selection regime, the equilibrium relative

322: frequencies of synonyms are given by the expansion

323: \eon \frac{\hat{a_1}}{\hat{a_1}+\hat{a_2}}=\frac{1}{2}+\frac{1}{12}

324: \frac{\sigma}{u} +

325: O\left(\frac{\sigma^2}{u^2}\right). \eoff

326: And in the strong selection regime, the equilibrium relative

327: frequencies are given by

328: \eon \frac{\hat{a_1}}{\hat{a_1}+\hat{a_2}}=\frac{\sqrt 5 -1}{2}-\frac{

329: (5-2\sqrt5)(1-\sigma)}{5} \frac{u}{\sigma} +

330: O\left(\frac{u^2}{\sigma^2}\right). \eoff

331:

332: In the absence of selection $(\sigma=0)$ all three codons occur with

333: equal frequency, as we would expect.  In particular, the relative

334: frequency of the two synonymous codons $a_1$ and $a_2$ equals

335: $\frac{1}{2}$, regardless of the mutation rate.  For weak selection

336: ($\sigma \ll u$), this result is still approximately true, according to

337: the perturbation expansion above.  In the case of strong negative

338: selection ($\sigma \gg u$), the relative frequency of the two synonymous

339: codons is given approximately by the inverse of the golden mean,

340: $\frac{\sqrt 5 -1}{2} \approx 0.62$.

341:

342: The sharp transition between the weak and strong selection regimes

343: defines $\sigma = u$ as a critical value for negative selection.  For

344: $\sigma \ll u$ negative selection is ineffective at favoring the less

345: volatile codon, and the site is effectively neutral.  But when $\sigma

346: \gg u$, negative selection favors the less volatile codon, and the

347: magnitude of this effect depends only weakly on the value of $\sigma$.

348: This is an essential point.  In the strong selection regime, the

349: magnitude of negative selection is relatively unimportant; volatile

350: codons are disfavored at all sites where $\sigma \gg u$. The transition

351: between the weak and strong selection regimes is shown in Fig.

352: \ref{TritEquil}.

353:

354: \begin{figure}[ht] \begin{center}

355: \epsfig{file=TritEquil.eps,angle=0,width=12cm} \caption{The relationship

356: between selection at the amino acid level and resulting synonymous codon

357: usage. The graph shows relative equilibrium frequency of synonymous

358: codons,  $\hat{a_1}/(\hat{a_1}+\hat{a_2})$, as a function of the strength

359: of negative selection, $\sigma$. The relative frequency of codon $a_1$

360: is approximately $\frac{1}{2}$ in the weak selection regime ($\sigma \ll

361: u$), and approximately $\frac{\sqrt{5}-1}{2}$ in the strong selection

362: regime ($\sigma \gg u$). In this figure $u=10^{-5}$.} \label{TritEquil}

363: \end{center} \end{figure}

364: %From TritInfinitePopGraph.nb

365:

366: \subsection{The effective disadvantage of a volatile codon}

367:

368: The critical value of $\sigma$ discussed above can be understood

369: intuitively by considering the ``effective selective disadvantage" of

370: the more volatile codon $a_2$ that results indirectly from its

371: volatility.  We will use the notion of an ``effective selective

372: disadvantage" to aid in our analysis of codon usage at a

373: site under negative selection. But we emphasize that our model (Eq.

374: \ref{QS}) does not assume any direct fitness difference

375: between synonymous codons.

376:

377: When the disfavored amino acid $B$ is lethal to the organism, then the

378: effective selective disadvantage of codon $a_2$ is particularly simple

379: to understand.  In this case, individuals with codon $a_2$ are removed

380: from the population at rate $u$ because they mutate to the lethal codon

381: $b$, but receive no back-mutations.  Hence the effective selective

382: disadvantage, denoted $s$, of codon $a_2$ versus codon $a_1$ is given by

383: $s = u$. The effective selective disadvantage of $a_2$ does not arise

384: from a fitness difference between synonyms, but rather from selection at

385: the level of amino acids and the structure of the genetic code.

386:

387: When amino acid $B$ is not lethal the situation is slightly more

388: complicated.  Nevertheless, for $\sigma \gg u$, mutations from $a_2$ to

389: $b$ typically die due to negative selection before they mutate back from

390: $b$ to $a_2$. As a result, the effective selective disadvantage will

391: still be $s=u$ in the regime of strong selection. We can make this

392: argument concrete by considering the mutation-selection balance between

393: codon $b$ and codon $a_2$. According to the standard mutation-selection

394: balance, the equilibrium frequency of codon $b$ relative to codon $a_2$

395: equals $\frac{u}{\sigma}$ in the regime $\sigma \gg u$.  Thus for each

396: mutant from $a_2$ to $b$, there are at most of order $\frac{u}{\sigma}$

397: mutations from $b$ to $a_2$. The net mutation rate from $a_2$ to $b$ is

398: therefore $u \left(1-\frac{u}{\sigma} \right)$.  This is the rate at

399: which individuals of type $a_2$ are lost from the population due to the

400: fact that $a_2$ is more volatile than $a_1$.  Thus the effective

401: selective disadvantage of $a_2$ relative to $a_1$ is given by $s = u

402: \left( 1 - \frac{u}{\sigma} \right)$.  By definition, in the strong

403: selection regime we neglect $\frac{u}{\sigma}$ compared to 1, and the

404: effective selective disadvantage of codon $a_2$ is simply $s = u$.

405:

406:

407: A similar argument holds for the real genetic code.  In this case, the

408: favored amino acid may correspond to several synonymous codons, each

409: with a potentially different volatility. However, the effective

410: selective disadvantage, $s$, of a more volatile codon relative to a less

411: volatile synonym is simply the difference in the number of mutations

412: leading to a disfavored codon ($\sigma \gg u$) times $\frac{u}{3}$,

413: where $u$ is the nucleotide mutation rate. (Note that $\frac{u}{3}$ is

414: the rate of mutation between any two particular nucleotides.)   For

415: example, when considering the relative frequencies of codons AGA and CGG

416: at a site under negative selection for Arginine, AGA has selective

417: disadvantage $s=\frac{2}{3}u$ compared to CGG, since AGA has two more

418: disfavored neighbors than CGG. By using the value of the effective

419: selective disadvantage, $s$, we can calculate the equilibrium relative

420: frequency of any pair of synonymous codons in mutation-selection

421: balance, and thereby deduce the relative frequencies of all synonyms.

422: Therefore, we can predict synonymous codon usage in the genetic code

423: without resorting to the full solution of \eq{QS}.

424:

425: An analogous argument can be used to calculate the effective selective

426: disadvantage of codon $a_2$ in the regime of weak selection ($\sigma \ll

427: u$). In this regime, the relative equilibrium frequency of codon $b$

428: versus codon $a_2$ equals $1-\frac{\sigma}{2u}$. Thus, the effective

429: selective disadvantage of $a_2$ versus $a_1$ is approximately $s = 0$,

430: plus a small correction of order $\sigma$.  In other words, when $\sigma

431: \ll u$ selection between $a_1$ and $a_2$ is effectively neutral; it

432: cannot generate codon bias.  We therefore refer to the regime

433: $\sigma \ll u$ as the ``almost neutral regime." This result holds both

434: for the simplified three-codon model and for the real genetic code.

435: %SeeClassicMutationSelectionBalance.nb

436:

437:

438: It is also important to calculate the amount of time required to reach

439: equilibrium codon usage in the presence of strong negative selection.

440: Explicit solution of Eq. \ref{trit}, assuming $\sigma \gg u$, indicates

441: that the $e$-fold relaxation time is of order $\frac{1}{u}$ (the

442: selection coefficient is $s \sim u$, and so the time scale for population

443: sizes to change under selection is of order $\frac{1}{s} \sim

444: \frac{1}{u}$).  In other words, starting from any initial frequencies

445: $a_1(0)$ and $a_2(0)$, these frequencies will become $e$-fold closer to

446: their equilibrium values after a duration of order $\frac{1}{u}$

447: generations.  The same time scale holds for almost neutral sites

448: ($\sigma \ll u$) and for the real genetic code\symbolfootnote[2]{For

449: $\sigma \ll u$, the process is almost neutral and the time scale

450: calculation of Section \ref{relax} applies.  The real genetic code has

451: the same dynamics because we still have $s \sim u$ for $\sigma \gg u$

452: and neutral behavior for $\sigma \ll u$.}.  In practice, $u$ will

453: be quite small, and equilibrium volatility is approached very slowly.

454: We will revisit this point when we discuss finite populations, and again

455: when we discuss positive selection.

456:

457: \subsection{A specific example: selection for Arginine}

458:

459: In this section we consider a simple example that demonstrates how

460: our analysis applies to the real genetic code. We use Eq. \ref{QS} to

461: model the dynamics of $K=64$ alleles corresponding to the 64 codons,

462: indexed in an arbitrary order. For our example, we consider a single

463: site under negative selection for an Arginine codon. In this case we

464: define

465: \begin{equation}

466: M_{ij}=\begin{cases}

467: 1-3u, \text{\ \ \ if $i$=$j$}\\

468: u/3, \text{\ \ \ if $i$ and $j$ differ by a point mutation} \\

469: 0, \text{\ \ \ otherwise}

470: \end{cases}

471: \end{equation}

472: where $u$ is the nucleotide mutation rate. We define

473: \begin{equation}

474: w_i=\begin{cases}

475: 1, \text{\ \ \ if $i$ encodes Arginine} \\

476: 1-\sigma, \text{\ \ \ if $i$ encodes a non-Arginine amino acid}\\

477: 1-\gamma, \text{\ \ \ if $i$ encodes stop}

478: \end{cases}

479: \end{equation}

480: so that a codon encoding an amino acid other than Arginine has

481: fitness $1-\sigma$, and a termination codon has fitness

482: $1-\gamma$.  We analyze this model numerically by calculating the

483: leading eigenvector of the matrix $w_j M_{ij}$, which yields the

484: equilibrium frequencies of all 64 codons.

485:

486: In the case of no selection ($\sigma = \gamma = 0$), we find that all

487: codons occur with the same equilibrium frequency, independent of

488: mutation rate, as we would expect.  For almost neutral selection

489: ($\sigma \sim \gamma \ll u$), codon usage is still approximately

490: uniform. In the opposite case when Arginine is favored and all other

491: amino acids (or termination codons) are strongly disfavored

492: (\textit{i.e.} $\sigma \sim \gamma \gg u$), the Arginine codons CGA,

493: CGG, CGC, CGT, AGA, and AGG occur with equilibrium relative frequencies

494: $\approx$ 0.214 : 0.214 : 0.191 : 0.191 : 0.095 : 0.095.  As expected,

495: under negative selection the more volatile Arginine codons occur with

496: lower relative frequency in equilibrium.

497:

498: The equilibrium frequencies of Arginine codons determine the expected

499: volatility at a single Arginine site under negative selection.

500: Assuming free recombination\ct{SawyHart92}, an individual gene consists

501: of many such sites randomly assembled; the mean and standard deviation

502: in the volatility (per site) of a randomly sampled gene are shown in

503: Fig.  \ref{argex}, as a function of the strength of negative selection

504: $\sigma$.  Note that the stronger the negative selection, the lower the

505: expected equilibrium volatility. The expected volatility exhibits a

506: sharp transition from high to low values when the strength of negative

507: selection $\sigma$ reaches the mutation rate $u$, as discussed above.

508: On either side of this transition, the volatility is insensitive to

509: $\sigma$. The standard deviations plotted in Fig. \ref{argex} correspond

510: to a gene comprised of $L=200$ such sites, each modeled independently by

511: the multi-allele equation.

512:

513:

514: \begin{figure}[ht] \begin{center}

515: \epsfig{file=ArgEx.eps,angle=0,width=12cm} \caption{The relationship

516: between selection and volatility for a gene comprised of $L=200$ freely

517: recombining sites under selection for Arginine. The graph shows expected

518: volatility per site in the gene ($\pm 1$ standard deviation, dashed) as

519: a function of the strength of negative selection, $\sigma$. The

520: nucleotide mutation rate is $u=10^{-5}$.  The expected volatility is

521: significantly depressed in the regime of strong negative selection,

522: $\sigma \gg u$.  (For this figure we assume $\gamma = 1$; virtually

523: identical results hold for $\gamma = \sigma$.) } \label{argex}

524: \end{center} \end{figure}

525:

526: According to Fig.  \ref{argex}, $L=200$ independent sites that each

527: experience neutrality ($\sigma \ll u$) can be distinguished on the basis

528: of their volatility from $L=200$ sites that experience negative

529: selection ($\sigma \gg u$).  The difference in the expected volatility

530: between these two regimes is greater than four standard deviations of

531: the volatility within either regime.

532:

533: In reality, the selective constraint $\sigma$ will vary greatly across

534: the sites of a given protein.  In this case, disregarding the

535: possibility of positive selection, the volatility of a gene (after

536: controlling for its amino acid sequence) essentially reflects the

537: relative number of informative sites that experience negative selection

538: versus neutrality.  For example, the volatility of gene $X$ that

539: contains $L=200$ informative sites under negative selection and an equal

540: number of neutral sites will be significantly greater (with a

541: $Z$-score of about three) than the volatility of gene $Y$ that

542: consists of $2L$ informative sites all under negative selection.

543: A more thorough discussion of variable selection pressures across genes

544: is described in Section \ref{Infer}, below.

545: %Z-score in MixedGenesArginineExample.xls

546:

547: Table \ref{GLRS} shows the equilibrium relative frequencies of

548: synonymous codons for each of the informative amino acids (G, L, R, and

549: S) under neutrality versus various selective regimes.  In Table

550: \ref{GLRS} we assume, as we do throughout this manuscript, that

551: volatility is measured using the Hamming metric and that there is no

552: transition/transversion bias.  Corresponding values for different

553: metrics or including a mutational bias may be calculated using the same

554: approach. As seen in Table \ref{GLRS}, the difference in the expected

555: volatility between selective regimes is least extreme (indeed, barely

556: informative) for Glycine sites.  The volatility difference is most

557: extreme for serine sites: the highly volatile codons AGT and AGC are not

558: expected to occur at a site under negative selection, but they

559: preferentially occur at a site under positive selection.  This extreme

560: case results from the fact that codons AGT and AGC are not connected by

561: synonymous point mutations to the other serine codons.

562: This situation does not imply that codons AGT and AGC should be treated

563: separately from the other serine codons. In fact, when treated as an

564: entire group, the serine codons are particularly informative for

565: positive selection (Table \ref{GLRS}).

566:

567: %from RelFreqs.xls

568: \renewcommand{\baselinestretch}{.9}

569: \begin{table}

570: {\small

571: \begin{center}

572: \begin{tabular}{lllllc}

573: & Neutral & Neutral* & Negative & Positive & $\nu$\\

574: \textbf{Leucine} & & & & &\\

575: cta & 0.16667 & 0.17300 & 0.21353 & 0.14213 & 5/9\\

576: ctc & 0.16667 & 0.18580 & 0.19098 & 0.17056 & 6/9\\

577: ctg & 0.16667 & 0.17890 & 0.21353 & 0.14213 & 5/9\\

578: ctt & 0.16667 & 0.18580 & 0.19098 & 0.17056 & 6/9 \\

579: tta & 0.16667 & 0.12990 & 0.09549 & 0.18274 & 5/7\\

580: ttg & 0.16667 & 0.14650 & 0.09549 & 0.19188 & 6/8 \\

581: \hline $\mathbb{E}[\nu]$ & 0.65146 & 0.64590 & 0.63172 & 0.65978 &\\

582: $\sigma[\nu]$ & 0.07362 & 0.07259 & 0.07022 & 0.07217 &\\

583: \\

584: \textbf{Arginine} & & & & &\\

585: aga & 0.16667 & 0.15210 & 0.09549 & 0.19149 & 6/8\\

586: agg & 0.16667 & 0.17050 & 0.09549 & 0.19859 & 7/9\\

587: cga & 0.16667 & 0.15210 & 0.21353 & 0.12766 & 4/8\\

588: cgc & 0.16667 & 0.17740 & 0.19098 & 0.17021 & 6/9\\

589: cgg & 0.16667 & 0.17050 & 0.21353 & 0.14184 & 5/9\\

590: cgt & 0.16667 & 0.17740 & 0.19098 & 0.17021 & 6/9\\

591: \hline $\mathbb{E}[\nu]$ & 0.65278 & 0.65400 & 0.62592 & 0.66766 & \\

592: $\sigma[\nu]$ & 0.09854 & 0.09660 & 0.09354 & 0.09528 & \\

593: \\

594: \textbf{Serine} & & & & & \\

595: agc & 0.16667 & 0.18510 & 0.00000 & 0.20636 & 8/9\\

596: agt & 0.16667 & 0.18510 & 0.00000 & 0.20636 & 8/9\\

597: tca & 0.16667 & 0.13440 & 0.25000 & 0.13265 & 4/7\\

598: tcc & 0.16667 & 0.17190 & 0.25000 & 0.15477 & 6/9\\

599: tcg & 0.16667 & 0.15162 & 0.25000 & 0.14510 & 5/8\\

600: tct & 0.16667 & 0.17190 & 0.25000 & 0.15477 & 6/9\\

601: \hline $\mathbb{E}[\nu]$ & 0.71792 & 0.72981 & 0.63243 & 0.73970 & \\

602: $\sigma[\nu]$ & 0.12504 & 0.12561 & 0.03913 & 0.12847 & \\

603: \\

604: \textbf{Glycine} & & & & & \\

605: gga & 0.25000 & 0.22460 & 0.25000 & 0.23810 & 5/8\\

606: ggc & 0.25000 & 0.26180 & 0.25000 & 0.25397 & 6/9\\

607: ggg & 0.25000 & 0.25170 & 0.25000 & 0.25397 & 6/9\\

608: ggt & 0.25000 & 0.26180 & 0.25000 & 0.25397 & 6/9\\

609: \hline $\mathbb{E}[\nu]$ & 0.65625 & 0.65724 & 0.65625 & 0.65675 & \\

610: $\sigma[\nu]$ & 0.01804 & 0.01859 & 0.01804 & 0.01775 & \\

611: \end{tabular}

612: \end{center}

613: \caption{Equilibrium codon usage under neutrality versus selective

614: regimes.  In each selective regime, we report the equilibrium relative

615: abundance of codons, and the resulting mean and standard deviation in

616: volatility per site. The first column corresponds to neutrality

617: ($\sigma=\gamma \ll u$); the second column corresponds to neutrality but

618: with disfavored termination codons ($\sigma \ll u$, $\gamma=1$); the third

619: column corresponds to strong negative selection in an infinite

620: population ($\sigma \gg u$, $\gamma \gg u$); the fourth column

621: corresponds to the expected frequencies after a positively selected

622: sweep (see Section \ref{PosSel}). The final column gives the volatility

623: of each codon, assuming no transition/transversion bias\ct{PlotDush04}.}

624: \label{GLRS} } \end{table}

625: \renewcommand{\baselinestretch}{1.0}

626:

627:

628:

629: \section{Negative Selection in a Finite Population} \label{Wright}

630:

631: The models presented in Section \ref{QSmodel} describe the processes of

632: mutation and negative selection in an infinite population. In finite

633: populations, however, genetic drift also affects allelic frequencies.

634: In this section, we study the combined effects of mutation, negative

635: selection, and drift, which we analyze using diffusion equations.  These

636: equations can be very complex.  A full treatment of even the simplified

637: three-codon genetic code requires a two-dimensional diffusion process,

638: and the real genetic code involves a $63$-dimensional process.  To make

639: this problem tractable, we use the notion of the ``effective selective

640: disadvantage" of more volatile codons, as discussed above.  This allows

641: us to consider the dynamics only at the favored codons, thereby reducing

642: the dimensionality of the diffusion process.

643:

644: The neutral ($\sigma = 0$) or almost neutral ($\sigma \ll u$) regimes

645: are straightforward: here all synonymous codons for the favored amino

646: acid have the same effective fitness.  In this regime, each synonymous

647: codon occurs with the same probability in steady state, independent of

648: population size.

649:

650: For the remainder of this section, we analyze the case of strong

651: negative selection ($\sigma \gg u$) at a single site.   We consider a

652: diffusion approximation to the process of mutation, selection, and drift

653: operating only on the synonymous codons, to each of which we assign an

654: effective selective coefficient. For the simplified three-codon

655: genetic system, the more volatile codon $a_2$ has an effective selective

656: disadvantage of $s = u$ compared to codon $a_1$.  For the real genetic

657: code, more volatile codons will have a selective disadvantage of this

658: order, but the precise value of $s$ will depend on the specific amino

659: acid in question.  In the following analysis, we consider the case of

660: the simplified three-codon system.  However, we do not explicitly make

661: the substitution $s = u$, so that our results can also be applied (with

662: a slightly different value of $s$) to the real genetic code.

663:

664: The time-dependent frequency $f(x,t)$ of allele $a_1$ relative to

665: allele $a_2$ can be described by the Komolgorov forward

666: equation\ct{KimuCrow64}

667: \begin{equation} \frac{\partial f(x,t)}{\partial t} =

668: -\frac{\partial}{\partial x} \{a(x) f(x,t)\} +

669: \frac{1}{2}\frac{\partial^2}{\partial x^2} \{b(x)f(x,t)\} \end{equation}

670: where the instantaneous mean and variance in the change of allelic

671: frequency are given by

672: \begin{eqnarray*}

673: a(x)&=&sx(1-x)-ux+u(1-x)\\

674: b(x)&=&x(1-x)/N.

675: \end{eqnarray*}

676: The stationary distribution of allele frequencies $\hat f(x)$ satisfies

677: the equation

678: \begin{equation}

679: \frac{d}{dx}\{b(x)\hat f(x)\}=2 a(x)\hat f(x)

680: \end{equation}

681: which has the solution\ct{Wrig31} \eon \hat

682: f(x)=Cx^{\theta-1}(1-x)^{\theta -1} \ e^{S x} \eoff where $\theta=2Nu$,

683: $S=2Ns$, and $C$ is chosen so that $\int_0^1\hat f(x)  dx=1.$  Since $s

684: \sim u$ (and thus $S \sim \theta$), the shape of the stationary the

685: distribution $\hat f(x)$ falls into two categories: a bell-shaped

686: distribution in the regime $\theta>1$, and a U-shaped distribution in

687: the regime $\theta<1$. In other words, for $\theta>1$ the steady-state

688: population is typically polymorphic at the locus, much like the infinite

689: population mutation-selection balance.  Whereas for $\theta<1$ the

690: steady-state population is usually near-monomorphic at the locus,

691: occasionally switching between alleles $a_1$ and $a_2$, with a bias

692: (whose strength is determined by $S$) towards allele $a_1$ .

693:

694: In stationary state,

695: the expected frequency of allele $a_1$ is given by

696: \begin{equation}

697: M(\theta,S)=\int_0^1x \hat f(x) dx =\frac{1}{2}+\frac{\bes(\theta+1/2,

698: S/2)}{2\bes(\theta-1/2,S/2)}

699: \label{mean}

700: \end{equation}

701: where $\bes(x,y)$ is the modified Bessel function of the first kind.

702: Similarly, the variance in the frequency of allele $a_1$ is given by

703: \begin{eqnarray}

704: V(\theta,S)&=&\int_0^1 x^2 \hat f(x) dx - M(\theta)^2\\

705: &=&\frac{1}{4+8\theta}+\frac{2\theta \bes(\theta-1/2,S/2)

706: \bes(\theta+3/2,S/2)-(1+2\theta)

707: \bes(\theta+1/2,S/2)^2}{(4+8\theta)\bes(\theta-1/2,S/2)^2}

708: \label{var}

709: \end{eqnarray}

710: We use the standard Taylor series expansion of

711: $\bes(x,y)$,

712: \begin{equation}

713: \bes(x,y)=\sum_{m=0}^{\infty}\frac{(y/2)^{x+2m}}{m!\Gamma(x+m+1)},

714: \label{Bexpand}

715: \end{equation}

716: to obtain a simple approximation for the mean

717: stationary frequency of allele $a_1$:

718: \begin{equation}

719: M(\theta,S) = \frac{1}{2}+\frac{S}{4} + O(\theta^2),

720: \label{nearmean}

721: \end{equation}

722: valid for $\theta \sim S \ll 1$. This approximation indicates that the

723: difference in expected volatility at a site under neutral versus

724: negative selection is of order $S$, when $\theta \ll 1$.

725:

726: When $\theta=S=1$, the mean stationary frequency of allele $a_1$ assumes

727: the value $\frac{1}{e-1}\approx 0.58$. For $\theta \sim S \gg 1$, the

728: mean frequency quickly approaches the asymptotic value

729: $\lim_{\theta\rightarrow \infty} M(\theta,\theta)=\frac{\sqrt{5}-1}{2}$,

730: in agreement with our earlier result for an infinite population.

731:

732: The results in this section generalize our analysis of an infinite

733: population.  For an infinite population, we found that the expected

734: relative frequency of codon $a_1$ equals $\frac{1}{2}$ in the almost

735: neutral regime, and it equals $\frac{\sqrt 5 -1}{2}$ in the strong

736: selection regime. In a finite population with $\theta \gg 1$, the same

737: results hold. In a finite population with $\theta \ll 1$, the

738: expected relative frequency of the more volatile codon equals

739: $\frac{1}{2}$ in the neutral regime, and it equals

740: $\frac{1}{2}+\frac{Ns}{2}$ in the strong selection regime.  For any

741: population size, the relative frequency of codon $a_1$ depends

742: monotonically on the strength of selection at the amino acid level,

743: $\sigma$, and it exhibits a sharp transition at the critical value

744: $\sigma=u$.

745:

746: It is worth noting that our exact expression (Eq. \ref{mean}) for the

747: mean stationary frequency of allele $a_1$ generalizes earlier work by

748: Bulmer\ct{Bulm91} on the relative frequency of two synonymous codons

749: that experience a direct fitness difference. In the limit of small

750: $\theta$, we find that

751: \begin{equation} \lim_{\theta\rightarrow0} M(\theta,S) = \frac{1}{2} +

752: \frac{\bes(1/2, S/2)}{2\bes(-1/2,S/2)} = \frac{1}{1+e^{-S}},

753: \end{equation}

754: which agrees with Bulmer's result (his Equation 6). In other words, Bulmer's

755: approximation applies only for vanishing small mutation rates (or

756: population sizes).

757:

758: We can again use the standard Taylor expansion of the Bessel function to

759: obtain a simple expression for the variance in the stationary

760: frequency of allele $a_1$,

761: \begin{equation}

762: V(\theta,S) \approx

763: \frac{(3+2\theta)(4+8\theta)-3S^2}{16(3+2\theta)(1+2\theta)^2},

764: \label{nearvar}

765: \end{equation}

766: which is a highly accurate approximation for all $\theta$,

767: provided as usual that $S$ is of order $\theta$ or smaller. Note that

768: when $\theta \ll 1$ the variance is approximated by

769: $\frac{1}{4}-\frac{\theta}{2}$, and when $\theta \gg 1$ the variance is

770: of order $\frac{1}{\theta}$.

771:

772:

773: \subsection{Inferring Negative Selection in a Finite Population}

774: \label{Infer}

775: Our exact (Eq. \ref{mean}) or approximate (Eq.  \ref{nearmean})

776: expressions for the stationary mean frequency of codon $a_1$ allow us to

777: determine the minimum number of sites required for codon volatility to

778: distinguish reliably between neutral versus negative selection.  When

779: sites are modeled independently (equivalent to the assumption of

780: linkage equilibrium\ct{SawyHart92}), under neutrality ($\sigma \ll u$;

781: $s=0$) the relative frequency of codon $a_1$ versus codon $a_2$ across a

782: gene of length $L$ is binomially distributed with mean $\frac{1}{2}$ and

783: variance $\frac{1}{4L}$.  If, on the other hand, the gene experiences

784: negative selection ($\sigma \gg u$; $s=u$), then the relative frequency

785: of codon $a_1$ is binomially distributed with mean $M(\theta,S)$ and

786: variance $M(\theta,S)[1-M(\theta,S)]/L$.  Therefore, in order to

787: reliability reject neutrality at about the 95\% confidence level, we

788: require \begin{equation} \label{minLeq} M(\theta,S)-\frac{1}{2} \ > \  2

789: \sqrt{\frac{1}{4L}} \end{equation} Using this equation, Fig. \ref{minL}

790: shows the minimum number of sites required to reliably distinguish

791: negative selection from neutrality on the basis of codon volatility,

792: under our simplified 'genetic code' consisting of three codons.

793:

794: \begin{figure}[ht] \begin{center}

795: \epsfig{file=MinL.eps,angle=0,width=12cm} \caption{The relationship

796: between the scaled population size, $\theta=2Nu$, and the minimum number

797: of sites required to distinguish negative selection from neutrality, at

798: the 95\% confidence level. Sites are assumed to be unlinked.  It is

799: important to note that the appropriate effective population size that

800: determines the value of $\theta$ in practice does not necessarily equal

801: the average neutral site heterozygosity (see Section \ref{PopSize}).}

802: \label{minL} \end{center} \end{figure}

803: %in CheckEquationsAndSomeFigures.nb

804:

805: Eq. \ref{minLeq} applies when comparing a collection of neutral

806: sites against a collection of sites under negative selection.  In most

807: situations, however, the selective constraint $\sigma$ will vary across

808: the sites of a protein. For example, consider gene $X$ with $L+J$ sites

809: under negative selection, compared to gene $Y$ with $L$ neutral sites

810: and $J$ sites under negative selection. In this case, the expected

811: frequency of codon $a_1$ in gene $Y$ is $(L/2 + J M(\theta,S))/(L+J)$.

812: Therefore, in order to reliably infer that gene $X$ experiences more

813: negative selection than gene $Y$, at the 95\% confidence level we

814: require \begin{equation} \label {minLeq2} M(\theta,S) - \frac{L/2 +

815: JM(\theta,S)}{(L+J)} \  > \ 2 \sqrt{\frac{L/4 + J M(\theta,S)

816: [1-M(\theta,S)]}{(L+J)^2}} \end{equation} As Eq. \ref{minLeq2}

817: indicates, the power to discriminate between two genes is decreased when

818: both genes contain many sites, $J$, under negative selection and only a

819: few sites, $L$, under different selective regimes. Nevertheless,

820: provided $J \sim L$, the power to discriminate between genes $X$ and $Y$

821: is decreased by $\sim$20\% at most (compared to $J=0$), and so the

822: minimum number of sites required to detect negative selection (Fig.

823: \ref{minL}) remains mostly unchanged.

824: %see MinLWithOtherSites.nb

825:

826: Although the results in this section were derived for a simplified

827: genetic code, the scaling behavior of these solutions holds for the full

828: genetic code as well -- \textit{i.e.} when comparing neutrality to

829: negative selection, for $\theta \ll 1$ the expected difference in

830: volatility per site will be of order $\theta$; and for $\theta \gg 1$

831: the expected difference in volatility can be calculated from the

832: infinite population model (Eq. \ref{QS} and Table \ref{GLRS}).

833:

834: \subsection{Relaxation towards steady state} \label{relax} Although Eq.

835: \ref{mean} predicts the steady-state relative frequencies of codons

836: $a_1$ and $a_2$ in the selected regime ($\sigma \gg u$), we have not yet

837: discussed how long it takes, on average, to reach this steady

838: state. In the case of a very large population, $\theta \gg 1$, we know

839: from the infinite population model (Section \ref{QS}) that the $e$-fold

840: relaxation time to equilibrium is of order $\frac{1}{u}$ generations. In

841: this section, we demonstrate that the same result applies to the

842: time scale of relaxation towards steady state in the regime $\theta \ll

843: 1$.

844:

845: As usual, we consider a single site under negative selection. In the

846: regime $\theta \ll 1$, we have seen that the steady-state population

847: will spend most of the time in a nearly monomorphic state, with a

848: preference (of order $\theta$) for the less volatile codon, $a_1$.

849: Therefore, in order to calculate the time scale of relaxation towards

850: steady state, we may simply calculate the amount of time required such

851: that, starting with a population fixed for allele $a_2$, the probability

852: of the population remaining fixed for allele $a_2$ has been reduced

853: $e$-fold.

854:

855: Given a population initially fixed for codon $a_2$, there are $Nu$

856: mutations to codon $a_1$ generated per generation. Each of these

857: mutations has an effective selective advantage $s=u$ over allele $a_2$,

858: and will therefore fix with probability

859: $2s/(1-e^{-2Ns})$\ct{CrowKimu70}. Hence the rate of production of a

860: mutation that will eventually fix is given by \begin{equation} P_{fix} =

861: \frac{2Nus}{1-e^{-2Ns}} \approx u, \label{Pfix} \end{equation} assuming

862: $\theta \ll 1$.  According to this calculation, the mean time until

863: fixation of codon $a_1$ is of order $\frac{1}{u}$ generations, which

864: gives the time scale of relaxation to the steady-state codon usage in a

865: finite population under negative selection.

866:

867:

868: \section{About Population Sizes} \label{PopSize} As discussed above, the

869: strength of the signal of negative selection depends upon the

870: parameter $\theta = 2Nu$. What is the appropriate value of

871: $\theta$ in practice?

872:

873: Unfortunately, this question is far easier asked than answered.

874: Population geneticists have long struggled to reconcile estimates of

875: $\theta$ deduced from polymorphism data with direct measurements of $N$

876: and $u$ across broad taxonomic ranges.  The effective population sizes

877: of micro-organisms in particular are topics of active debate.  Estimates

878: of $\theta$ are usually obtained by comparing SNP data at neutral (or

879: presumably neutral) sites against the expected site diversity

880: or the expected number of segregating sites under a neutral

881: model\ct{Ewen04}. In a recent survey\ct{LyncCone03} authors have

882: reported an average value of $\theta \approx 0.15$ among the prokaryotes

883: studied. But estimates of $\theta$ for a microbial species can

884: vary by four orders of magnitude, and they depend strongly on

885: assumptions about population structure\ct{Berg96}.  To complicate

886: matters further, heterogeneity in mutation rates leads to substantial

887: underestimates of $\theta$\ct{Taji96}.

888:

889: Aside from uncertainty in its estimation, the value of $\theta$ deduced

890: from neutral SNP data\ct{LyncCone03} may not be relevant to questions of

891: selection and volatility.  Monomorphism observed at neutral sites may

892: result from non-neutral processes, such as background

893: selection\ct{CharMorg93} or hitchhiking on periodically sweeping

894: sites\ct{MaynHaig74}.  As a result, the variance effective population

895: size estimated from SNP data may not be relevant to other aspects of

896: evolution, such as substitutions at linked weakly selected

897: sites\ct{Gill01}.

898:

899: One particularly striking example of a discrepancy in the appropriate

900: effective population sizes arises from the consideration of mutator

901: phenotypes. Populations of microbial species periodically experience a

902: transient increase in the mutation rate, often $10^2-10^3$ times greater

903: than that of a non-mutator strain\ct{GiraRadm01}.  Between 2-20\% of

904: bacterial populations isolated in the wild at any given time exhibit a

905: mutator phenotype\ct{GiraRadm01,OlivCant00,LeclLi96}. The mutator phase

906: can be induced in several ways. A defective DNA repair gene may arise

907: and sweep to fixation by hitchhiking on a positively selected

908: mutation\ct{NotlSeet02}. The entire population then experiences an

909: elevated mutation rate until a non-mutator allele sweeps and replaces

910: the mutator\ct{NotlSeet02,DenaLeco00}. A second, perhaps more common

911: mechanism is stress-induced mutagenesis; natural isolates of \textit{E.

912: coli} often experience an increase in their mutation rate in response to

913: stress\ct{BjedTena03}. As a result of these and other observations,

914: researchers have argued that bacterial populations evolve primarily by

915: periodic acquisition of mutator phenotypes followed by adaptive sweeps

916: and subsequent loss of the mutator\ct{GiraRadm01,DenaLeco00,NotlSeet02}.

917: As we shall see, the effect of this process on synonymous codon usage is

918: dramatic: the expected site diversity is driven by the value of $\theta$

919: in the wildtype regime ($\theta_w = 2 N u_w$), but the pattern of

920: synonymous codon usage at a site under negative selection is driven by

921: the value of $\theta$ in the mutator regime ($\theta_m = 2 N u_m \gg

922: \theta_w$).

923:

924: As a simple example of this phenomenon, we have simulated a

925: Fisher-Wright model of a single locus in a population of constant size

926: $N=1000$.  The simulated site is subject to recurrent mutation between

927: ``alleles" $a_1$ and $a_2$ at wildtype rate $u_w=10^{-5}$. As in Section

928: \ref{Wright}, the alleles $a_1$ and $a_2$ differ in fitness by $s$,

929: where $s$ equals the mutation rate.  Periodically, we model the fixation

930: of a mutator allele (or, equivalently, the stress-induced mutagenesis

931: across the entire population) by exogenously increasing the mutation

932: rate to $u_m= 10^3 \times u_w$ for 100 generations; thereafter we

933: (artificially) enforce a selective sweep at the site, followed by

934: reversion to the wildtype mutation rate.  Overall, the population

935: experiences the mutator regime for 5\% of the time, consistent with

936: observed frequencies of mutator phenotypes in the

937: wild\ct{GiraRadm01,OlivCant00,LeclLi96}. According to our simulations,

938: the average site diversity, $2x(1-x)$, at a randomly chosen time equals

939: $0.028$, which is close to its expected value assuming that $\theta$ is

940: given by $\theta_w$: $\mathbb{E}[2x(1-x)]=\theta_w=0.02$.  But the

941: average frequency of allele $a_1$ equals $0.611$, which is close to its

942: expectation assuming that $\theta$ is given by $\theta_m$:

943: $\mathbb{E}[x]=M(\theta_m,\theta_m) = 0.616$ (Eq.  \ref{mean}).  In

944: other words, the average frequency of the less volatile codon $a_1$ is

945: dominated by the mutator periods, but the average site heterozygosity

946: (and any estimate of $\theta$ based on it) is dominated by the

947: non-mutator periods.

948: %/VOLATILITYTHEORY/Simulations/FreeLociLinkedToSweeperWithMutator/averages.1_onelocus.out

949:

950: There is a simple, intuitive explanation for this result.  The average

951: heterozygosity at the site is low at virtually all times (except during

952: the brief mutator periods) because selective sweeps cause monomorphism,

953: followed by long periods of low $\theta$. Therefore, the effective

954: $\theta$ for SNP diversity is small, \textit{i.e.} close to $Nu_w$.  But

955: the site converges quickly towards the less volatile codon during the

956: mutator periods, since the rate of convergence is determined by

957: $s=u_m$. And the site is essentially frozen during the non-mutator

958: periods, since the decay rate of volatility is only $u_w$.  Therefore

959: the expected frequency of $a_1$ at a random time  is

960: primarily determined by the frequency reached during the mutator regime.

961: As is clear from this explanation, the expected frequency of codon $a_1$

962: will, in general, depend upon the stochastic scheduling of mutator

963: periods. For example, the site will converge towards $M(\theta_m,

964: \theta_m)$ provided the population experiences at least one mutator

965: phase of duration of order $1/u_m$ generations, within every $1/u_w$

966: generations.  In fact, even if the mutator phases are very brief and

967: infrequent, the average frequency of allele $a_1$ can greatly exceed the

968: value predicted by $\theta$ estimated from the average site

969: heterozygosity.

970:

971: Although the simple model used in this section does not describe any but

972: the most phenomenological features of mutator alleles, it does reveal an

973: important general observation: the value of $\theta$ estimated from

974: neutral SNP data does not in general equal the effective value of

975: $\theta$ that determines synonymous codon usage at a site under negative

976: selection.  This result is of utmost importance to any discussion of the

977: relationship between $\theta$ and the power of volatility to detect

978: negative selection.

979:

980:

981: \section{Positive selection} \label{PosSel} In the sections above, we have considered

982: selection that opposes a change to the amino acid at a site.  This type

983: of negative selection induces a bias towards the less volatile codons

984: for the favored amino acid at a site.  However, selection sometimes

985: favors a change in the amino acid at a particular site. In such

986: situations, as we will demonstrate, a site is more likely to be occupied

987: by a codon of greater than average volatility.

988:

989: A variety of mechanisms are known to cause positive selection.

990: Frequency dependence often induces diversifying selection at a site,

991: whereas an exogenous change in the environment can induce directional

992: selection for a new, specific amino acid. We do not here model all of

993: the various types of positive selection, but rather focus on the

994: essential aspect shared by these mechanisms. We analyze the dynamics at

995: a site that has, for a period of time, experienced negative selection

996: for amino acid $A$, and that subsequently experiences negative selection

997: for different amino acid, $B$ (for whatever reason).  We refer to the

998: change in the selective regime as a positive selection event.

999:

1000: Prior to the onset of positive selection, amino acid $A$ is assigned

1001: fitness 1 and all other amino acids fitness $1-\sigma$; subsequently,

1002: amino acid $B$ is assigned fitness 1 and all others fitness $1-\sigma$.

1003: We assume that $N \sigma \gg 1$ (otherwise, the site is effectively

1004: neutral at the amino acid level) and that $\sigma \gg u$ (otherwise, the

1005: expected codon frequencies are uniform).  Once the population shifts to

1006: the new amino acid $B$, it is clear that the site will more likely

1007: contain a codon that is more volatile than the average $B$-codon,

1008: because it has just arisen through a nonsynonymous mutation. Since $B$

1009: is now favored, negative selection subsequently operates to reduce the

1010: volatility at the site. However, this process takes time. Thus, for some

1011: time after the positive selection event, there is a bias toward elevated

1012: volatility at the site, which gradually decays. In this section, we

1013: analyze this process.

1014:

1015: Analagously to previous sections, we initially consider a simplified

1016: genetic code consisting of four codons, $a_1$, $a_2$, $b_1$, and $b_2$,

1017: the first two of which encode amino acid $A$, and the latter two amino

1018: acid $B$.  Mutations can only occur between codons $a_1$ and $a_2$,

1019: $a_2$ and $b_2$, and $b_2$ and $b_1$, creating the mutation structure \[

1020: a_1 \rightleftarrows a_2 \rightleftarrows b_2 \rightleftarrows b_1.  \]

1021: In this simplified genetic code, codons $a_2$ and $b_2$ are the more

1022: volatile codons for their respective amino acids.

1023:

1024: After the change in selection from amino acid $A$ to $B$, a mutation to

1025: codon $b_2$ that survives stochastic drift will eventually arise.  Thus,

1026: at least initially, the more volatile codon $b_2$ is more prevalent

1027: than the less volatile codon $b_1$. During this period, we can detect

1028: the signature of the positively selected sweep because of the elevated

1029: volatility at the site.  However, negative selection for amino acid $B$

1030: will eventually favor codon $b_1$.  Therefore, the volatility signature

1031: of the positive selection event will be present provided that the time

1032: scale of decay toward codon $b_1$ is longer than the interval since the

1033: positive selection event.

1034:

1035: Fortunately, the time scale of decay towards $b_1$ is quite long.  For

1036: $\theta \gg 1$, we can use the infinite population model to find this

1037: time scale.  As discussed above, the time required to reduce the

1038: volatility $e$-fold is of order $\frac{1}{u}$.  For $\theta \ll 1$, we

1039: must use a finite population size calculation.  In this regime, the

1040: population is nearly monomorphic at almost all times.  Following the

1041: selective sweep, the site will be monomorphic for $b_2$ with almost unit

1042: probability.  We are interested in the duration of time required such

1043: that probability of being monomorphic for $b_2$ (as opposed to $b_1$)

1044: has been reduced $e$-fold.  The probability of switching between $b_2$

1045: and $b_1$, however, is of order $u$ per unit time (even before $b_2$ has

1046: finished outcompeting $a_2$), according to Eq.  \ref{Pfix}.  Thus, the

1047: time scale of decay in a finite population is also $\frac{1}{u}$.

1048:

1049: According to this analysis, a selective sweep will result in the

1050: presence of a more volatile codon for of order $\frac{1}{u}$ generations --

1051: a very long time indeed. (In the case of \textit{E. coli}, for example,

1052: $\frac{1}{u}$ generations is nearly $100,000$ years, given $u \approx

1053: 5\times10^{-10}$ and generation time $\approx$ $20$ minutes. The

1054: generation length and resulting time scale for \textit{E. coli} in the

1055: wild may be much longer yet\ct{GibbKaps67}.) Equivalently, repeated

1056: sweeps for amino acid changes at a site will result in the presence of

1057: more volatile codons at almost all times, provided that new sweeps occur

1058: more often than every $\frac{1}{u}$ generations.

1059:

1060:

1061: \subsection{Inferring Positive Selection}

1062:

1063: The above analysis for a simplified genetic system generalizes in an

1064: obvious way to the real genetic code.  After a positive selection event

1065: at a site, the population switches from a codon for amino acid $A$ to a

1066: codon for amino acid $B$.  The expected volatility of the new codon is

1067: greater than the average volatility of $B$-codons, because the new codon

1068: has just arisen through a nonsynonymous mutation.  To be more precise,

1069: if the population is monomorphic for a random non-$B$ codon before the

1070: selective sweep, then after the sweep occurs the expected relative

1071: frequencies of the $B$-codons are given, approximately, by their relative

1072: volatilities.  Subsequent to the selective sweep, the increased

1073: volatility at the site will decay on a time scale of order of

1074: $\frac{1}{u}$ generations.

1075:

1076: There is a critical distinction between the volatility signature of

1077: positive selection versus that of negative selection.  The depressed

1078: volatility at a site under negative selection is caused by a

1079: mutation-selection-drift balance. When the effective population size is

1080: small, a large number of sites are required to distinguish negative

1081: selection from neutrality. By contrast, the volatility signature of

1082: \textit{positive} selection is {\it not} an equilibrium property, and

1083: it is not sensitive to population size.  Regardless of $\theta$, the

1084: probability of sampling a more volatile codon is significantly elevated

1085: immediately after a selective sweep at a site, and this probability

1086: decays only at rate $u$.

1087:

1088: As we have seen, a gene that contains many sites under positive

1089: selection will exhibit a greater volatility (controlling for its amino

1090: acid composition) than a gene under neutral or, especially, negative

1091: selection.  How many positively selected sites are required in order to

1092: detect a reliable signal? In the case of Leucine, for example, the

1093: expected volatility of a site that has recently experienced a positively

1094: selected sweep is approximately $0.660 \pm 0.072$ (one standard

1095: deviation), whereas a neutral Leucine site has expected volatility

1096: $0.646 \pm 0.073$, and a Leucine site under negative selection has

1097: expected volatility $0.632 \pm 0.0070$ (see Table \ref{GLRS}).

1098: Therefore, the volatility of about 100 Leucine sites under positive

1099: selection will be significantly greater (at the 95\% confidence level)

1100: than that of 100 neutral sites.  Similarly, the volatility of about 25

1101: positively selected Leucine sites will be significantly greater than that of

1102: 25 negatively selected sites.  Similar results hold for Serine and

1103: Arginine; Glycine is less informative.

1104:

1105: It is worth noting that the elevated volatility for a positively selected

1106: Serine site will decay even more slowly than for other amino acids,

1107: because the highly volatile codons ACC and AGT are not connected by

1108: synonymous mutations to other serine codons.

1109:

1110: \section{Discussion}

1111: \label{Discussion}

1112:

1113: \subsection{Codon volatility versus comparative sequence analysis}

1114:

1115: Selection pressures on proteins are usually estimated by comparing

1116: homologous nucleotide sequences\ct{ZuckPaul65}.  Orthologous genes are

1117: identified in different organisms and sequenced; their sequences are

1118: then aligned, and the changes that have accumulated since divergence are

1119: used to infer the selection pressures that have been

1120: acting\ct{GoldYang94}. When available, sequence variation sampled from

1121: individuals within a species can be compared with variation across

1122: species to produce an elegant test for adaptive evolution at a

1123: locus\ct{McDoKrei91,SawyHart92}. In addition, there are a variety of

1124: statistical tests designed to detect a departure from neutrality in the

1125: site frequency spectrum sampled within a single species (see ref.\ct{Krei00}

1126: and references therein). In many cases, the complete distribution of

1127: these statistics under the neutral null model are difficult to derive,

1128: but they have been studied through computer simulation\ct{SimoChur95}.

1129:

1130: Techniques for estimating selective constraints via sequence comparison

1131: are typically applied, independently, to one or several genes at a time.

1132: When extensive intra- or inter-specific sequence data are available at a

1133: locus of interest, such techniques have proven enormously useful for

1134: measuring selection, and it is unlikely that they will be significantly

1135: improved by incorporating information about synonymous codon usage.  But

1136: the accurate estimation of selective constraints requires a large number

1137: (approximately six or more\ct{AnisBiel01}) of orthologous sequences for

1138: each gene of interest.  At the genome-wide scale, comparative data

1139: (\textit{i.e.} orthologous gene sequences) will not be available for all

1140: genes, and methods to estimate selective constraints based on sequence

1141: comparison will often be inapplicable.  Furthermore, the genes under

1142: positive selection are often of particular interest, but such genes are

1143: even less likely to have identifiable orthologs in related species due

1144: to their rapid sequence divergence\ct{PlotDush04}.  Even in the lineage

1145: of the \textit{Saccharomyces} genus, which is currently the best-case

1146: scenario for comparative genomics, the genomes of four species have been

1147: fully sequenced and only two-thirds of the genes in \textit{S.

1148: cerevisiae} have unambiguously identifiable orthologs in related

1149: species\ct{PlotFras04}. Unlike comparative techniques, the analysis of

1150: synonymous codon usage offers a computational tool to screen for

1151: selection pressures on \textit{all} genes in a sequenced genome. Genome-wide

1152: screens based on analyzing synonymous codon usage may prove useful in

1153: identifying important classes of genes under strong selection, such as

1154: the antigens of pathogens\ct{PlotDush04}.

1155:

1156: Unlike most comparative statistics that test for a departure from

1157: neutrality, estimates of selection based on bootstrapped volatility

1158: scores\ct{PlotDush04} are not `estimators' in a rigorous statistical

1159: sense -- \textit{i.e.} statistics whose sampling properties can be

1160: derived from a null model, and which can be used in likelihood ratio

1161: tests of a null hypothesis\ct{YangNiel00,ClarkGlan03}. Given the

1162: expected relative frequencies of codons that we have derived for each of

1163: the three regimes (neutral, negative, and positive selection; Table

1164: \ref{GLRS}), it may yet be possible to design maximum-likelihood methods

1165: that estimate the number of sites of a gene in each regime. This

1166: approach will be complicated, however, by other sources of codon bias;

1167: see below.

1168:

1169: Aside from the different situations in which they are applicable, and

1170: differences in the rigor of their derivation, estimates of selection

1171: based on codon volatility differ in a fundamental way from most

1172: estimates based on sequence comparison.  Homologous sequence comparison

1173: between species is often used to assess, either by maximum

1174: likelihood\ct{GoldYang94} or maximum parsimony\ct{Li93}, the rates of

1175: synonymous and non-synonymous substitutions in a coding sequence. The

1176: ratio of these rates, dN/dS, is used as a measure of the selective

1177: constraints that have been acting on a protein since the divergence of

1178: the species being compared.  An alternative approach, based on a Poisson

1179: Random Field (PRF) model of mutation frequencies, uses the site

1180: frequency spectrum at a locus sampled from individuals within a species

1181: to deduce the average selective pressure for or against amino acid

1182: changes in a gene\ct{SawyHart92}. (Poisson Random Field models can also

1183: be used to construct likelihood ratio tests of departure from

1184: neutrality\ct{BustWak01}.) Like most comparative methods, however, both

1185: of these models typically assume that all the sites within a gene

1186: experience the same selective pressure against amino acid substitutions

1187: (but see the site-by-site likelihood tests of Yang \textit{et

1188: al.}\ct{YangNiel00}).  Under the PRF theory, for example, authors have

1189: estimated a very small ``average'' selection pressure against amino acid

1190: changes in \textit{E.  coli} genes: $\sigma \sim

1191: 10^{-8}$\ct{HartSawy94}.  This value does not represent the arithmetic

1192: average of the true $\sigma$ values across sites, but rather the

1193: best-fit constant value of $\sigma$ that would make the PRF model

1194: consistent with observed sequence variation at polymorphic sites.

1195:

1196: When evolutionary rates are estimated at \textit{individual}

1197: residues\ct{Yang00,YangNiel00}, however, we find great variation across

1198: sites. Moreover, direct experimental measurements of the fitness

1199: consequences of amino acid substitutions in micro-organisms reveals huge

1200: variation in selection pressures across the residues of an individual

1201: protein: a substantial proportion of substitutions are lethal, and a

1202: substantial proportion have undetectable

1203: effect\ct{WertDrub92,WlocSzaf01,ZeylDeVi01,SanjMoya04}. Therefore, it is

1204: not entirely clear how best to interpret the value of $\sigma \sim

1205: 10^{-8}$ estimated for \textit{E. coli} genes using the PRF model, which

1206: assumes constant pressure at each residue.

1207:

1208: Compared to dN/dS or $\sigma$ estimated by the PRF model, codon

1209: volatility quantifies selection pressures in a very different, coarser

1210: manner.  As discussed above, volatility essentially measures the number

1211: of sites in a gene that experience negative ($\sigma \gg u$) versus

1212: neutral ($\sigma \ll u$) versus positive selection. Given that, in

1213: reality, many amino acid changes to a protein sequence are lethal while

1214: other changes have no effect whatsoever, it is reasonable and meaningful

1215: to estimate the number of sites in the selected versus neutral regimes.

1216: But volatility is not sensitive to variation in selective pressures

1217: within either of these regimes. Hence, the volatility measure is in some

1218: ways a coarser description of selective pressure than PRF or dN/dS.  One

1219: should not necessarily expect that volatility will correlate very

1220: strongly with dN/dS or PRF estimates, because the latter measures

1221: represent some sort of average $\sigma$ over the entire gene, and are

1222: thus presumably sensitive to the full range of variation in $\sigma$.  A

1223: measure based on codon volatility is therefore different from and

1224: complementary to dN/dS or PRF estimates of the selective

1225: constraints on a genes.

1226:

1227: As an aside, it is important to note that the most common model used to

1228: estimate dN/dS from divergent nucleotide sequences\ct{GoldYang94} does

1229: not itself reflect the relationship between selection and volatility.

1230: dN/dS is often estimated by fitting maximum likelihood parameters to a

1231: simplified Markov-chain model of sequence evolution that ignores

1232: population variability\ct{GoldYang94}.  Models that ignore population

1233: variability are perfectly reasonable approximations when comparing the

1234: sequences of relatively divergent lineages\ct{GoldYang94}; but such

1235: models fail to detect the effect of amino-acid selection on synonymous

1236: codon usage.  Such models consider only a single sequence that is

1237: assumed to represent the dominant genotype in the population at any

1238: time.  Mutation and selection are modeled simultaneously by adjusting

1239: the transition rates between codon states in the

1240: sequence\ct{GoldYang94}.  As a result, in equilibrium, the number of

1241: transitions into a state per unit time must equal the number of

1242: transitions out of that state; and so equilibrium synonymous codon usage

1243: does not depend upon the strength of selection in these simplified

1244: models\ct{GoldYang94}. (In fact, under the standard assumption of

1245: time-reversibility, such models require as parameters the specification

1246: of the equilibrium codon usage\ct{GoldYang94}, and therefore they

1247: clearly cannot be used to predict equilibrium codon usage.) Simulations

1248: of sequence evolution based on these simplified models (such as the

1249: non-frequency-dependent simulations of Zhang\ct{Zhan04}) will thus fail

1250: to detect the relationship between dN/dS and volatility, whereas more

1251: detailed simulations that account for population variability (such as

1252: the frequency-dependent simulations of Zhang\ct{Zhan04}, as well as the

1253: non-frequency-dependent simulations in this work) will properly reflect

1254: the relationship between selection and volatility, as predicted by

1255: Fisher-Wright models of a replicating population.

1256:

1257:

1258: \subsection{Other sources of codon bias}

1259:

1260: Although it came as a surprise to early neutral

1261: theorists\ct{KingJuke69}, it is now clear that there are several

1262: processes that result in unequal usage of synonymous codons.  Many

1263: processes that cause codon bias in microorganisms, such as biased

1264: nucleotide content or mutation rates, can apply roughly equally to all

1265: the genes in a genome.  To the extent that other sources of codon bias

1266: apply equally across a genome, it is straightforward to control for

1267: these biases when comparing the volatilities of genes within a genome to

1268: estimate selection pressures on proteins\ct{PlotDush04}.

1269:

1270: To the extent that other sources of codon bias differ from gene to gene

1271: within a genome, they may (if not properly controlled for) introduce

1272: errors into estimates of the relative selection pressures on proteins

1273: inferred from codon volatility\ct{PlotDush04}. Similarly, selection on

1274: synonymous codons -- particularly selection that varies from gene to

1275: gene -- will likewise introduce errors into estimates of selection on

1276: protein sequences obtained by comparative techniques such as

1277: dN/dS\ct{SharLi87,HirsFras04}.

1278:

1279: As we have argued, some of the variation in synonymous codon usage

1280: across a genome is caused by the variation in selection pressures on

1281: protein sequences.  Throughout our analysis, we have specifically

1282: ignored any other source of codon biases so as to derive the effects of

1283: selection at the amino acid level on codon usage.  But in many organisms

1284: other processes that vary between genes are certainly operating as well.

1285: For instance, it is known that the transition/transversion mutation bias

1286: can vary across a genome.  Results on \textit{S. cerevisiae}, whose

1287: genome exhibits marked variation in the tr/tv bias\ct{PlotFras04},

1288: suggest that this source of variable codon bias will not distort

1289: estimates of selection based on volatility: whether or not one accounts

1290: for the variation in the tr/tv bias across the genome of \textit{S.

1291: cerevisiae} one obtains virtually the same rankings of gene volatilities

1292: ($r>0.99$)\ct{PlotFras04}.

1293:

1294: Aside from mutational biases, there are other sources of codon bias that

1295: vary from gene to gene in some organisms. In the yeast \textit{S.

1296: cerevisiae}, researchers have observed that synonymous codon usage,

1297: measured by the Codon Adaptation Index (CAI)\ct{SharLi87}, is correlated

1298: with a gene's expression level in laboratory conditions\ct{CoghWolf00}.

1299: This correlation is thought to be caused by selection for translational

1300: efficiency and/or accuracy: a codon corresponding to a more abundant tRNA is

1301: expected to be translated more quickly (due to the higher probability

1302: per unit time that the appropriate tRNA will ``find" the codon) and more

1303: accurately (since the correct tRNA will likely have the greatest chance

1304: of pairing if it is the most abundant).

1305:

1306: Considering this alternative source of biased codon usage, two questions

1307: should be asked: do other sources of codon bias distort estimates of

1308: selection based on volatility, and how can we control for these

1309: confounding factors? Unfortunately we do not have a truly satisfactory

1310: answer for either of these questions, but the discussion below may shed

1311: some light on the issues involved.

1312:

1313: With regards to the first question, we note that the degree to which

1314: other sources of codon bias may distort volatility-based estimates of

1315: selection will strongly depend on the organism being studied.  Some

1316: species (such as humans) exhibit a much weaker correspondence between

1317: codon frequencies and tRNA abundances than others species; so clearly

1318: other sources of codon bias will affect volatility values differently in

1319: different species. In a species with a strong correspondence between

1320: codon usage and tRNA abundances, the extent to which variation in this

1321: source of codon bias across the genome affects volatility will depend on

1322: whether volatile codons are (un)preferred: if there is no correlation

1323: between volatility and tRNA abundances, then the other sources of codon

1324: bias will only introduce random error into volatility estimates, making

1325: them less powerful but still reliable. If instead the preferred codons

1326: tend to have either high or low volatility, then this effect could

1327: introduce systematic errors into volatility estimates. In the latter

1328: case, in order to quantify how much codon usage bias is caused by

1329: volatility as opposed to other factors, one would require a method to

1330: predict for individual genes the amount of codon bias due to these other

1331: factors. Unfortunately we are far from having the necessary level of

1332: predictive power for other sources of codon bias in any organism.

1333: Although gene expression level is somewhat predictive of codon bias,

1334: expression levels do not explain most of the variation in codon bias in

1335: any genome studied thus far\ct{Akas01,CoghWolf00}.  Until the various

1336: sources of biased codon usage can be reliably disentangled, we cannot

1337: reliably quantify the effects of these biases on volatility-based

1338: estimates of selection.

1339:

1340: The second question, how to control for other sources of biased codon

1341: usage, is also difficult to answer at present.  As discussed above, an

1342: appropriate method to control for other sources of bias would require

1343: disentangling the various sources of codon bias in a predictive manner

1344: for each gene. While this degree of precision is not currently possible,

1345: one approach is to assume that the codon bias measured by CAI is

1346: entirely independent of volatility, and then control for CAI using

1347: partial correlations. For several reasons, we expect this approach to be

1348: conservative, as we illustrate using the yeast \textit{S. cerevisiae}

1349: (we use this species as an example because it shows a strong preference

1350: for codons that match abundant tRNAs, and because we have reliable dN/dS

1351: values for almost two-thirds of its genes, calculated from multiple

1352: alignments of closely related species\ct{HirsFras04}). First, we note

1353: that dN/dS is itself strongly correlated with both CAI and gene

1354: expression levels\ct{PalPapp01}, and it is therefore impossible to

1355: construct any measure of selective constraint that agrees with dN/dS and

1356: is not itself strongly correlated with CAI and expression levels in

1357: yeast. Second, it is possible that the codon bias measured by CAI is in

1358: part \textit{caused} by volatility (\textit{i.e.} highly expressed genes

1359: tend to experience stronger purifying selection and therefore exhibit

1360: unusual codon usage biased towards lower volatility), and so controlling

1361: for CAI would be inappropriate. Despite several biological hypotheses,

1362: there is no accepted mechanistic explanation for the correlation between

1363: CAI and dN/dS in yeast\ct{PalPapp01, Akas01}, and so it is unclear

1364: whether controlling for CAI is appropriate.

1365: Nevertheless, we have tested the correlation between volatility and

1366: dN/dS while controlling for CAI using a partial correlation. We find

1367: that even when controlling for CAI (or expression levels), there remains

1368: a highly significant correlation between volatility and dN/dS in yeast

1369: ($p<10^{-34}$\ct{PlotFras04}).  Therefore, even under this conservative

1370: test, estimates of selection obtained by volatility are still consistent

1371: with estimates obtained by homologous sequence comparison. We interpret

1372: this result as evidence that volatility is measuring selective

1373: constraints above and beyond any signal inherent in CAI.

1374:

1375: Indeed, there is a great deal of empirical evidence

1376: indicating that the volatility of a gene is correlated with the

1377: selective constraint it experiences.  Aside from highly significant

1378: correlations between volatility and dN/dS in bacterial species and

1379: yeast\ct{PlotDush04}, volatility also reflects a range of other features

1380: known to correlate with selection on proteins. In \textit{S.

1381: cerevisiae}, for example, volatility is strongly correlated with the

1382: essentiality of genes, the number of their protein-protein interactions,

1383: and the degree to which they are preserved throughout the eukaryotic

1384: kingdom\ct{PlotFras04}.  Furthermore, volatility is significantly

1385: elevated among the known antigens and surface proteins (which experience

1386: positive selection) in the pathogens \textit{Mycobacterium

1387: tuberculosis}, \textit{Plasmodium falciparum}, and Influenza A

1388: virus\ct{PlotDush03,PlotDush04}. And volatility is significantly

1389: depressed in the genes essential for growth of \textit{M.

1390: tuberculosis}, as well as in the genes conserved between related

1391: \textit{Mycobacterium} species\ct{PlotDush04}. Therefore, despite

1392: potential confounding sources of codon bias that cannot at present be

1393: controlled for with appropriate accuracy, in practice volatility-based

1394: methods produce estimates of selection pressures that are consistent

1395: with our understanding of protein evolution over a diverse range of

1396: taxa.

1397:

1398: Finally, we note that there may be direct selection on synonymous codons

1399: in order to evade mistranslation\ct{Kono85}. Since mistranslation is far

1400: more likely to occur between a codon and an anticodon that differ by a

1401: single nucleotide, the definition of volatility (Eq. \ref{voldef})  is

1402: appropriate for measuring the selective pressure for or against

1403: mistranslation. The strength of this type of selection on synonymous

1404: codons would depend upon the mis-incorporation rate of tRNA (which is

1405: far higher than the mutation rate) and the detriment of mistranslation

1406: (which is likely far lower than that of most mis-sense mutations). It is

1407: difficult at present to measure the molecular parameters of tRNA

1408: mis-incorporation and its fitness effects; so it is unclear how much of

1409: a volatility signal arises from mistranslation avoidance versus standard

1410: selection on mis-sense mutations. However strong this signal, though,

1411: the volatility of a gene would still reflect the degree to which there

1412: is selection to conserve, or not to conserve, the (translated) protein

1413: sequence.

1414:

1415:

1416: \section*{Acknowledgments}

1417:

1418: We thank Daniel Fisher, Andrew Murray, and Michael Turelli for their

1419: input during the preparation of this manuscript. J.B.P. acknowledges

1420: support from the Harvard Society of Fellows. M.M.D. acknowledges

1421: support from a Merck Award for Genome-Related Research.

1422:

1423: \bibliography{./bib}

1424:

1425:

1426: \end{document}

1427:

1428:

1429: