0708:0708.2038/GLA.tex

1: \documentclass[amsmath,amssymb,aps]{revtex4}

2: \usepackage{graphicx}

3: \usepackage{bm}

4: \usepackage[usenames]{color}

5: \usepackage{multirow}

6: \usepackage{amsmath}

7:

8: \pdfoutput=1

9:

10: % \newcommand{\julius}[1]{\textbf{\textcolor{blue}{#1}}}

11: \newcommand{\julius}[1]{#1}

12: \newcommand{\josh}[1]{\textbf{\textcolor{red}{#1}}}

13: \newcommand{\ecoli}{\emph{E. coli}}

14: \newcommand{\paeru}{\emph{P. aeruginosa}}

15: \newcommand{\llact}{\emph{L. lactis}}

16:

17: \begin{document}

18:

19: \title{Genome landscapes and \\

20: bacteriophage codon usage}

21:

22: \author{Julius B. Lucks$^1$} \author{David R. Nelson$^{1,2}$}

23: \author{Grzegorz Kudla$^1$} \author{Joshua B. Plotkin$^{3,*}$}

24: \affiliation{ $^1$FAS Center for Systems Biology, Harvard University,

25: \\ $^2$ Lyman Laboratory of Physics, Harvard

26: University\\ $^3$ Department of Biology, University

27: of Pennsylvania\\ $^*$E-mail:

28: jplotkin@sas.upenn.edu }

29:

30: \date{\today}

31: \begin{abstract}

32:

33:     Across all kingdoms of biological life, protein-coding genes exhibit

34:     unequal usage of synonmous codons. Although alternative theories abound,

35:     translational selection has been accepted as an important mechanism that

36:     shapes the patterns of codon usage in prokaryotes and simple eukaryotes.

37:     Here we analyze patterns of codon usage across 74 diverse bacteriophages

38:     that infect \emph{E. coli}, \emph{P. aeruginosa} and \emph{L. lactis} as

39:     their primary host. We introduce the concept of a `genome landscape,' which

40:     helps reveal non-trivial, long-range patterns in codon usage across a

41:     genome. We develop a series of randomization tests that allow us to

42:     interrogate the significance of one aspect of codon usage, such a GC

43:     content, while controlling for another aspect, such as adaptation to

44:     host-preferred codons. We find that 33 phage genomes exhibit highly

45:     non-random patterns in their GC3-content, use of host-preferred codons, or

46:     both. We show that the head and tail proteins of these phages

47:     exhibit significant bias towards host-preferred codons, relative

48:     to the non-structural phage proteins. Our results support the hypothesis of

49:     translational selection on viral genes for host-preferred codons, over a

50:     broad range of bacteriophages.

51:

52:

53: \end{abstract}

54:

55:

56:

57: \maketitle

58:

59: \section{Introduction}\label{sec:introduction}

60:

61: The genomes of most organisms exhibit significant codon bias -- that is, the

62: unequal usage of synonymous codons. There are longstanding and contradictory

63: theories to account for such biases. Variation in codon usage between taxa,

64: particularly within mammals, is sometimes atrributed to neutral processes --

65: such as mutational biases during DNA replication, repair, and gene conversion

66: \cite{Bern95,Francino1999,Galtier2003,Eyre91}.

67:

68: There are also theories for codon bias driven by selection. Some researchers

69: have discussed codon bias as the result of selection for regulatory function

70: mediated by ribosome pausing \cite{LawrHart91}, or selection against

71: pre-termination codons \cite{Fitc80,ModiBatt81}. However, the dominant selective

72: theory of codon bias in organisms ranging from \textit{E. coli} to

73: \textit{Drosophila} posits that preferred codons correlate with the relative

74: abundances of isoaccepting tRNAs, thereby increasing translational efficiency

75: \cite{ZuckPaul65,Ikem81a,Ikem85,PoweMori97,DebrMarz94,SoreKurl89} and accuracy

76: \cite{Akas94}. This theory helps to explain why codon bias is often more extreme

77: in highly expressed genes \cite{Ikem81b}, or at highly conserved sites within a

78: gene \cite{Akas94}. Translational selection may also explain variation in codon

79: usage between genes selectively expressed in different tissues

80: \cite{Plotkin2004,Dittmar2006}. However, recent work suggests that synonymous

81: variation, particularly with respect to GC content, affects transcriptional

82: processes as well \cite{Kudla2006}.

83:

84: The codon usage of viruses has also received considerable attention

85: \cite{Jenkins2003,PlotDush03}, particularly in the case of bacteriophages

86: \cite{Sharp1984,Kunisawa1998,Sahu2004, Sahu2005,Sau2005,SauGosh2005}. Most work

87: along these lines has focused on individual phages, or on the patterns of

88: genomic codon usage across a handful of phages of the same host.

89:

90: Here, we provide a systematic analysis of intragenomic variation in

91: bacteriophage codon usage, using 74 fully sequenced viruses that infect a

92: diverse range of bacterial hosts. Motivated by energy landscapes associated with

93: DNA unzipping \cite{LubenskyNelson2002,Weeks2005}, we develop a novel

94: methodological tool, called a genome landscape, for studying the long-range

95: properties of codon usage across a phage genome. We introduce a series of

96: randomization tests that isolate different features of codon usage from each

97: other, and from the amino acid sequence of encoded proteins. More than twenty of

98: the phages in our analysis are shown to exhibit non-random variation in

99: synonymous GC content, as well as non-random variation in codons adapted for

100: host translation, or both. Additionally, we demonstrate that phage genes

101: encoding structural proteins are significantly more adapted to host-preferred

102: codons compared to non-structural genes. We discuss our results in the context

103: of translational selection and lateral gene transfer amongst phages.

104:

105:

106: % section introduction (end)

107: \section{Results}\label{sec:results}

108:

109: % (fold)

110: \subsection{Genome Landscapes}\label{sub:genome_landscapes}

111:

112: % (fold)

113: We start by introducing the concept of a genome landscape, which provides a

114: simple means for visualizing long-range correlations of sequence properties

115: across a genome. A genome landscape is simply a cumulative sum of a specified

116: quantitative property of codons. The calculation of the cumulative sum is

117: straightforward, and it consists of scanning over the genome sequence one codon

118: at a time, gathering the property of each codon, and summing it with the

119: properties of previous codons in the genome sequence.

120: Similar cumulative

121: sums are used in solid-state physics for, e.g., the

122: the calculation of energy levels

123: \cite{Ashcroft1976}.

124: In the case of the GC3 landscape, we have

125: \begin{equation}

126:     \label{eq:FGC3}

127:     F_{\mathrm{GC3}}(m) = \sum_{i=1}^m

128: (\eta_{\mathrm{GC3}}(m) - \overline{\eta_{\mathrm{GC3}}})

129: \end{equation}

130: where $\eta_{\mathrm{GC3}}(m)$ equals one or zero, depending upon whether the

131: the $m^{th}$ codon ends in a G/C or A/T, respectively. Note that we subtract the

132: genome-wide average GC3 content, $\overline{\eta_{\mathrm{GC3}}}$, so that

133: $F_{\mathrm{GC3}}(0) = F_{\mathrm{GC3}}(N) = 0$, where $N$ is the length of the

134: genome. In other words, we convert the genome codon sequence into a binary

135: string of 1's and 0's according to whether each codon is of type GC3 or AT3, and

136: we cumulatively sum this sequence to compute $F_{\mathrm{GC3}}(m)$.

137:

138: The interpretation of a GC3 landscape is straightforward. Regions of the genome

139: whose landscape exhibits an uphill slope contain higher than average GC3

140: content, whereas regions of downhill slope contain lower than average GC3

141: content. The genome landscape provides an efficient visualization of long-range

142: correlations in sequence properties across a genome, similar to the techniques

143: introduced by Karlin \cite{Karlin1993}.

144:

145: Traditional visualizations of GC3 content involve moving window averages of

146: \%GC3 over the genome \cite{Gregory2006}. In order to compare these techniques

147: with the landscape approach, we focus on the \emph{E. coli} phage lambda as an

148: illustrative example. Figure \ref{fig:land_hist} (a) shows the lambda phage GC3

149: landscape above its associated ``GC3 histogram". The histogram shows the GC3

150: content of each gene, and the width of each histogram bar reflects the length of

151: the corresponding gene. The figure reveals a striking pattern of lambda phage

152: codon usage: the genome is apparently divided into two halves that contain

153: significantly different GC3 contents \cite{Inman1966,Sanger1982}. The large

154: region of uphill slope on the left half of the GC3 landscape reflects the fact

155: that the majority of the genes in this region contain an excess of codons that

156: end in G or C. This trend is also reflected in the GC3 histogram bars, which are

157: higher than average in the left half of the genome (Figure \ref{fig:land_hist}).

158:

159: Genome landscapes also provide a natural means of evaluating whether or not

160: features of codon usage are due to random chance. Under a null model in

161: which

162: the $\eta(i)$'s above are chosen as independent random variables

163: with $\mathrm{var}(\eta(i)) = \langle \eta(i)^2 \rangle

164: - \langle \eta(i) \rangle^2 = \Delta$, one can show (see

165: Methods) that the standard

166: deviation of $F(\mathrm{GC3},m)$ is

167: \begin{equation}

168:     \label{eq:sigma}

169:     \sigma_{\mathrm{GC3}}(m) = \sqrt{\langle

170:     F(\mathrm{GC3},m)^2 \rangle - \langle F(\mathrm{GC3},m) \rangle^2} = \sqrt{\frac{\Delta_{\mathrm{GC3}} m (N-m)}{N}}.

171: \end{equation}

172: This quantity is shown as a purple band in Figure

173: \ref{fig:land_hist}. For $\eta(i)$'s chosen to be 0 or 1 at random,

174: $\Delta_{\mathrm{GC3}} = 1/4$ and the maximum width $\sqrt{N}/4$ is

175: obtained at $m= N/2$. Since the scale of variation across the lambda phage GC3

176: landscape is much greater than its expectation under the null, we can

177: conclude that the distribution of G/C versus A/T ending codons is

178: highly non-random in the lambda phage genome.

179:

180: We can also gain intuition about the degree of non-randomness in the GC3

181: landscape by considering what would happen if the lambda phage genome were to

182: accumulate random synonymous mutations.  Figure

183: \ref{fig:land_decay}(a) shows snapshots of the lambda GC3 landscape as

184: we simulate synonymous mutations to the genome. Between each snapshot,

185: $N$ synonymous mutations were introduced by

186: picking a codon at random along the genome, and then choosing a new

187: synonymous codon at random according to the global lambda phage codon

188: distribution. As more mutations are introduced, the GC3 landscape of the

189: synonymously mutated lambda genome approaches the purple band,

190: indicating that the GC3 pattern in the real lambda phage genome is

191: highly non-random.

192:

193: The procedure of producing a genome landscape can be applied to other

194: properties of codon usage. In addition to GC3, we will study patterns in

195: the Codon Adaptation Index (CAI).  CAI measures the similarity of a

196: gene's codon usage to the `preferred' codons of an organism

197: \cite{Sharp1987} -- in this case, the host bacterium of the phage under

198: study.  Every bacterium has a preferred set of codons defined as the

199: codons, one for each amino acid, that occur most frequently in genes

200: that are translated at high abundance. These genes are often taken to be

201: the ribosomal proteins and translational elongation factors

202: \cite{Sharp1987} (see Methods).

203:

204: In order to calculate CAI, the preferred codons are each assigned a weight $w =

205: 1$. The remaining codons are assigned weights according to their frequency in

206: the highly-translated genes, relative to the frequency of the $w=1$ codon. The

207: CAI of a gene is defined as the geometric mean of the $w$-values for its codons

208: \begin{equation}

209: \label{eq:CAI_def} \mathrm{CAI} = \left(\Pi_{i=1}^{M} w_i\right)^{1/M},

210: \end{equation}

211: where $w_i$ is the $w$-value of the $i^{th}$ codon, and

212: $M$ is the length of the gene. This quantity can be re-written as

213: \begin{equation} \mathrm{CAI} = \exp(\frac{1}{M} \sum_{i=1}^{M}

214: \ln(w_i)).

215: \end{equation}

216: The latter formulation is more useful for calculating genome landscapes,

217: because the argument of the exponential function is now a sum of the logs of the

218: $w$-values. Therefore, we define the CAI landscape as \begin{equation}

219: F_{\mathrm{CAI}}(m) = \sum_{i=1}^m (\eta_{\mathrm{CAI}}(m) -

220: \overline{\eta_{\mathrm{CAI}}}), \end{equation} where $\eta_{\mathrm{CAI}}(m) =

221: \ln(w_m)$.

222:

223: The CAI landscape for lambda phage is shown in Figure

224: \ref{fig:land_hist}(b), along with the CAI histogram of lambda phage.

225: For the CAI histograms, the height of each bar represents the CAI value

226: of that gene (Eq. \ref{eq:CAI_def}).  As in the case with the GC3

227: landscape, we find that the lambda phage CAI landscape corresponds

228: closely to the CAI histogram, but it offers a more striking global view

229: of the long-range CAI structure in the lambda phage genome. One

230: contiguous half of the lambda phage genome exhibits elevated CAI,

231: whereas the other half exhibits depressed CAI.  The observed CAI

232: landscape lies far outside the purple band in Figure

233: \ref{fig:land_hist}, calculated according to Eq. \ref{eq:sigma},

234: indicating that the pattern of CAI across the lambda phage genome is

235: non-random. However,

236: the purple band is wider for the CAI landscape than for the GC3

237: landscape, because the variance in the $\ln{(w_i)}$'s,

238: $\Delta_{\mathrm{CAI}}$, is greater than $\Delta_{\mathrm{GC3}}$.

239:

240: The GC3 and CAI landscapes for lambda phage are highly correlated with each

241: other (Figure \ref{fig:land_hist}). In particular they both have large uphill

242: regions on the left-hand side of the genome, indicating a region containing

243: codons with elevated GC3-content and CAI values, compared to the genome average.

244: It is possible that the observed correlation between the GC3 and CAI landscapes

245: could be caused by the conflation between high CAI and GC3 in the preferred

246: \emph{E. coli} codons, as we discuss below.

247:

248: We note that the genes in the region of elevated CAI primarily encode the highly

249: translated structural proteins that form the capsid and tail of the lambda phage

250: virions. This patterns suggests the hypothesis that, because of the need to

251: produce structural genes in high copy number during the viral life cycle, structural

252: genes preferentially use codons that match the host's preferred set of codons.

253: We will explore this translational-selection hypothesis in greater detail below.

254:

255: % subsection genome_landscapes (end)

256: \subsection{The Effect of Amino Acid Content on Genome

257: Landscapes}\label{sub:the_effect_of_amino_acid_content_on_genome_landscapes}

258:

259: % (fold)

260: The previous section illustrated that the codon usage across the lambda phage

261: genome is highly non-random with respect to both GC3 and CAI. In this section we

262: quantify this statement, and we focus on aspects of lambda's codon usage

263: patterns that are \emph{independent} of the amino acid sequences of the

264: encoded proteins.

265:

266: Since we are interested in studying the patterns of \emph{synonymous} codon

267: usage, it is important that we control for the amino acid sequence of encoded

268: proteins. Phages utilize a diverse spectrum of proteins, ranging from

269: those that form the protective capsid for nascent progeny, to those

270: encoding for the tail and tail fibers, to those that regulate the switch

271: between lytic or lysogenic infection pathways. As with other organisms, phage

272: proteins have been selected at the amino acid level for function and folding.

273: Some portion of a phage's codon usage is surely influenced by selection

274: for amino acid content.

275:

276: We can construct a simple randomization test to interrogate the potential

277: influence of the amino acid sequence on the GC3 and CAI landscapes of lambda

278: phage. In this test, we generate random genomes that have the exact same amino

279: acid sequence as lambda phage, but shuffled codons, such that the genome-wide,

280: or global, codon distribution is preserved in each random genome (see

281: Methods). As

282: summarized in Table \ref{tab:tests}, we refer to this test as the `aqua'

283: randomization test. For each of the randomized genomes, we calculate GC3 and CAI

284: landscape. Similar to a recent randomization method \cite{Zeldovich2007}, we then

285: compare the observed landscape of the actual genome to the

286: distribution of landscapes generated from the randomized genomes.

287:

288: Figure \ref{fig:aqua} shows the results of this comparison, with the observed

289: landscapes plotted as black lines, and the mean

290: $\pm$ one and two standard deviations of random trials shown in dark and light

291: aqua, respectively. As the figures show, the observed landscapes lie in the far

292: extremes of the randomized distributions -- indicating that the amino acid

293: sequence of the lambda phage genome does not determine the extraordinary

294: features of the observed landscapes.

295:

296: It is also instructive to query the influence of amino acid content on codon

297: usage in each gene individually. The histogram view of these randomization tests

298: allows us to ask this question precisely. Because the amino acid sequence is

299: preserved exactly across the genome, each histogram bar in Figure \ref{fig:aqua}

300: can be considered as its own randomization test, one for each gene. The position

301: of the horizontal black bar reflects the actual codon usage of

302: each gene, and it can be compared to the distribution of random trials in order

303: to compute a quantile for each gene:

304: \begin{equation} q^{>} = \frac{\mathrm{number\ of\

305: trials\ less\ than\ observed}}{\mathrm{number\ of\ trials}},\\

306: q^{<} = \frac{\mathrm{number\ of\ trials\ greater\ than\

307: observed}}{\mathrm{number\ of\ trials}}.

308: \end{equation}

309: Note that we have defined two quantiles, $q^{>}$ and $q^{<}$, that describe the

310: proportion of random trials strictly less or strictly greater than the observed data.

311: These two quantities sum to a values less than one (and equal to one if there

312: are no ties). A large value of $q^{>}$ signifies that the observed statistic

313: (e.g. GC3 or CAI) is \emph{greater} than most of the random trials.

314:

315: Associated with each of these quantiles is a p-value quantifying whether the

316: observed gene sequence has significantly different codon usage than the random

317: trials: $p^{<} = 1 - q^{<}$ and $p^{>} = 1 - q^{>}$. If either one of these

318: $p$-values is low, it signifies that the GC3 (or CAI) content of the gene is

319: significantly different than the genomic average, controlling for the amino acid

320: sequence of the gene. $p^<$ tests for significantly depressed GC3 (or CAI) in a

321: gene; and $p^>$ tests for significantly elevated GC3 (or CAI) in a gene. We will

322: use these $p$-values, which arise from the `aqua' randomization test, in two

323: ways.

324:

325: Since we are interested in studying the effects of synonymous codon usage alone,

326: we first wish to filter out any genes whose codon usage does not significantly deviate

327: from random, given the amino acid sequence. Therefore, in the subsequent

328: gene-by-gene analyses reported in this paper, we retain only those genes whose

329: quantiles fall in the extreme 5\% of random trials. That is, we only keep those

330: genes for which $p^{<}_{\mathrm{aqua}} < 0.025$ or $p^{>}_{\mathrm{aqua}} <

331: 0.025$. These genes are said to `pass' the aqua test, and they are

332: unshaded in Figure \ref{fig:aqua}.

333:

334: We also use the gene-by-gene $p$-values to quantify the degree to which

335: codon usage is independent of amino acid sequence across the genome as a

336: whole. To do so, we combine all the gene-by-gene $p$-values into an

337: aggregate $p$-value for the entire genome, $p_{\mathrm{aqua}}$, using

338: the method of Fisher \cite{Fisher1948}. We calculate the combined

339: $p$-value by summing the logs of twice the minimum of each gene-specific

340: p-value \begin{equation} f_{\mathrm{aqua}} = -2 \sum_{i=1}^{i=k} \ln{[2

341: \min(p^{<}_{\mathrm{aqua},i}, p^{>}_{\mathrm{aqua},i})]}, \end{equation}

342: where $p^{<}_{\mathrm{aqua},i}$ represents the aqua $p^<$-value for gene

343: $i$, and $k$ is the number of genes in the genome. It is well known that

344: $f_{\mathrm{aqua}}$ is chi-squared distributed with $2k$ degrees of

345: freedom \cite{Fisher1948}.  Thus, the combined $p$-value for the

346: entire genome, $p_{\text{combined}}^{\mathrm{aqua}} = 1-

347: P_{\chi^2,2k}(f_{\mathrm{aqua}})$, where $P_{\chi^2,2k}(f)$ is the

348: cumulative chi-squared distribution with $2k$ degrees of freedom. In

349: the case of lambda phage, we find $p_{\text{combined}}^{\mathrm{aqua}} =

350: 7.42\mathrm{x}10^{-98}$ for GC3 and $p_{\text{combined}}^{\mathrm{aqua}}

351: = 1.50\mathrm{x}10^{-41}$ for CAI. Thus, we conclude that the neither

352: the GC3 nor the CAI patterns across the lambda phage genome are

353: determined by the genome's amino acid sequence.

354:

355: In the following sections we will use the aqua test (see Table

356: \ref{tab:tests}) and its associated gene-by-gene and combined p-values

357: as a control to verify that features of codon usage are not driven by

358: the amino acid sequence.

359:

360: % subsection the_effect_of_amino_acid_content_on_genome_landscapes (end)

361: \subsection{Disentangling CAI from GC3}\label{sub:disentangling_cai_from_gc3}

362:

363: % (fold)

364: Depending upon the preferred codons of the host species, the effect of

365: selection for high CAI in a viral gene is not necessarily independent

366: from the effect of selection for other features of viral codon usage,

367: such as high GC3.

368: For example, codons with high CAI values associated with a given host

369: may be biased towards high GC3 values as well (see Figure

370: \ref{fig:E_coli_master}, and Section

371: \ref{sub:disentangling_cai_from_gc3} below). It is important, therefore,

372: to disentangle the effects of selection for CAI versus selection for

373: GC3, in order to determine which one of these forces is responsible for

374: the non-random patterns of codon usage observed in the lambda genome.

375:

376: The weights used to compute CAI for \emph{E. coli} are shown in Figure

377: \ref{fig:E_coli_master}. The 61 codons are placed into one of four groups

378: according to whether they are GC3 or not (red or blue, respectively), and

379: whether they have high CAI or not (dark or light, respectively). High CAI is

380: determined by an arbitrary cutoff of $w \geq 0.9$. As this table demonstrates,

381: the set of preferred codons in E. coli is slightly biased towards GC-ending

382: codons (58\%).

383:

384: The GC bias of preferred codons, although slight, could conflate the

385: results of selection for CAI versus GC3 in phages that infect \emph{E.

386: coli}, such as lambda.  We therefore introduce another randomization

387: test that allows us to disentangle patterns of CAI content from patterns

388: of GC3 content. Similar to the aqua randomization test described above,

389: we draw random phage genomes such that the amino acid sequence is

390: conserved, but we add the additional constraint of conserving the exact

391: GC3 sequence as well (see Methods). For example, at a site containing a

392: GC3 codon for leucine, in our random trials we only allow those leucine

393: codons terminating in G or C. By comparing the observed landscapes of

394: the genome with the distribution of randomly drawn landscapes, we can

395: isolate the features of codon usage driven by CAI, independent of GC3

396: and amino acid content. We refer to this randomization procedure at the

397: `orange' randomization test (Table \ref{tab:tests}).

398:

399: Conversely, we also wish to assess the strength of patterns in GC3 content,

400: independent of CAI and amino acid content. The appropriate randomization

401: procedure in this case requires that we constrain the amino acid sequence and

402: the sequence of codon CAI values while allowing GC3 to vary. However, because

403: CAI values are not binary, CAI cannot be constrained exactly while still

404: allowing for enough variability to produce a meaningful randomization test.

405: Thus, we introduce a binary version of the CAI measure, called BCAI, that is

406: qualitatively the same as and, for our purposes, interchangeable with CAI.

407:

408: The BCAI $w$-value for a codon is defined to be 0.7 if the codon is high CAI,

409: and 0.3 if the codon has low CAI. High CAI is defined by the threshold of $w

410: \geq 0.9$ (see Figure \ref{fig:E_coli_master}). The actual values assigned for

411: BCAI are arbitrary and have no effect on our results. In addition, the threshold

412: value $w \geq 0.9$ is also arbitrary, and our results are robust to changing

413: this threshold. BCAI provides a useful surrogate for CAI because its values are

414: binary, thereby allowing us to constrain a gene's amino acid sequence and BCAI

415: sequence \emph{exactly}, while varying GC3 content in random trials. The BCAI

416: landscapes and histograms are calculated in the same way as CAI landscapes and

417: histograms, except using BCAI $w$-values. As expected, the BCAI landscape of a

418: genome is qualitatively similar to its CAI landscape (compare Figures

419: \ref{fig:green_orange}b and \ref{fig:aqua}b), and the two landscapes are highly

420: correlated (e.g. $r = 0.72$ for lambda phage). Thus BCAI is interchangeable

421: with CAI for the purposes of our randomization tests.

422:

423: Figure \ref{fig:green_orange} shows the results of the two randomization tests

424: outlined above: the `green' test that compares the observed GC3 landscape to a

425: distribution of random trials constraining the amino acid sequence and the BCAI

426: sequence; and the `orange' test that compares the observed BCAI landscape to a

427: distribution of random trials constraining the amino acid sequence and the GC3

428: sequence. Our convention for naming these two tests is summarized in Table

429: \ref{tab:tests}.

430:

431: As seen in Figure \ref{fig:green_orange}a, the observed GC3 landscape lies

432: significantly outside of the random trials that preserve amino acid sequence and

433: BCAI sequence. Combining the gene-by-gene p-values for this test, we find

434: $p_{\text{combined}}^{\text{green}} = 5.1\mathrm{x}10^{-68}$ -- indicating that

435: the lambda phage genome as a whole has non-random GC3 variation independent of

436: amino acid and CAI (actually, BCAI) sequence. Conversely, Figure

437: \ref{fig:green_orange}b shows that the BCAI landscape contains non-random

438: features when controlling for both GC3 and amino acid sequence

439: ($p_{\text{combined}}^{\text{orange}} = 6.3\mathrm{x}10^{-9}$). In other words,

440: the lambda phage genome exhibits highly non-random patterns of both GC3 and CAI

441: codon variation, independent of one another and independent of the amino acid

442: sequence.

443:

444: % subsection disentangling_cai_from_gc3 (end)

445: \subsection{Non-random patterns of CAI and GC3 In

446: Bacteriophages}\label{sub:selection_for_cai_and_gc3_in_bacteriophages}

447:

448: % (fold)

449:

450: In the sections above we have demonstrated and quantified highly

451: non-random patterns of GC3 and CAI codon usage variation across the

452: lambda phage genome. We have also demonstrated that these trends are

453: independent of one another.  In this section, we will extend our

454: analysis to a large range of diverse phages.

455:

456: In this section we consider all sequenced phages that infect \emph{E. coli},

457: \emph{Pseudomonas aeruginosa} or \emph{Lactococcus lactis} as their primary host.

458: The latter two hosts were chosen because of they contain unusually extreme GC3

459: content: 88 \%GC3 for \emph{P. aeurginosa} and 25 \%GC3 for \emph{L. lactis},

460: genome-wide. The extreme GC3 content of these hosts give rise to opposing

461: relationships between high CAI and GC3 -- as indicated schematically in

462: Figure \ref{fig:master_cartoons}. In particular, \emph{P. aeruginosa} strongly

463: favors GC3 in high-CAI codons (94\%), and \emph{L. lactis} strongly favors AT3 in

464: high-CAI codons (72\%). Thus, these three hosts span a large spectrum of

465: relationships between CAI and GC3. Since our randomization tests constrain amino

466: acid and BCAI exactly (the `green' test), and amino acids and GC3 exactly (the

467: `orange' test), we can control for any possible conflation between GC3 and CAI

468: trends. Thus, the randomization tests are equally applicable to all of the phage

469: genomes, regardless of their host.

470:

471: We performed the aqua, green, and orange randomization tests on the 45

472: phages of \emph{E. coli}, 12 phages of \emph{P. aeruginosa}, and 17

473: phages of \emph{L. lactis} whose genomes have been sequenced

474: (see Methods). In the first step of our

475: analysis, we removed any phages which failed either the aqua GC3 or aqua

476: CAI tests, because the codon usage of such genomes are influenced by

477: their amino acid sequence. A phage was said to pass these two control

478: tests if its Fisher combined p-values for both aqua GC3 and aqua CAI

479: were significant. The significance criterion for each test is

480: $p_{\text{combined}} < 5\%/74$, which incorporates a Bonferroni

481: correction for multiple tests.  With this cutoff, 50 of the initial 74

482: phages passed the aqua control tests.

483:

484: Figure \ref{fig:green_orange_examples} shows results of these tests for

485: several example genomes. P2, a temperate phage, and T3, a non-temperate

486: phage both infect \emph{E. coli} and both pass the control tests and

487: exhibit significant `orange' and `green' results, as does D3112, a

488: temperate phage that infects \emph{P.  aeruginosa}. However, not all

489: phages that pass the control test exhibit signifanct `orange' and

490: `green' results -- as evidenced by bIL286, a temperate phage infecting

491: \emph{L. lactis}.

492:

493: Figure \ref{fig:green_orange_pass_genomes} plots the distribution of

494: combined Fisher p-values of the orange and green tests, for the 50

495: phages that pass the control tests. The majority of these

496: p-values are highly significant. Using a Bonferoni-corrected theshold of

497: 5\%/50, a total of 22 genomes show significance in the orange test, 29

498: in the green text, and 17 in both orange and green.  These results

499: indicate that non-random patterns in codon usage are not unique to

500: lambda phage.  Indeed, over a range of bacterial hosts and a range of

501: phage viruses, there is apparent pressure for non-random patterns of

502: both GC3 content and CAI content, independent of one another and

503: independent of the amino acid sequence.

504:

505:

506: % subsection selection_for_cai_and_gc3_in_bacteriophages (end)

507: \subsection{Translational selection on phage structural

508: proteins}\label{sub:translational_selection_on_phage_structural_proteins}

509:

510: % (fold)

511: In this section, we investigate a natural hypothesis concerning the patterns of

512: non-random CAI usage we have observed in phage genomes -- namely, that these

513: patterns may be driven by selection for translational accuracy and efficiency,

514: which is stronger in more highly expressed proteins \cite{Ikem81a,Sharp1984}.

515:

516: Among all phage proteins, the structural proteins are the most highly expressed

517: \cite{Hendrix2004}. The structural proteins form the protective capsid that

518: encloses the viral genome, as well as the tail, which is often used for

519: transmission of the phage genome to the inside of the host \cite{Roessner1983}.

520: These proteins must be produced in high copy number -- many tens of copies of

521: each type of structural protein needed to form each of hundreds of viral progeny

522: \cite{Hendrix2004}. For each gene in a phage genome, we assigned a structural

523: annotation of 1 if the gene was known to encode a structural protein and 0

524: otherwise (see Methods).

525:

526: According to the standard hypothesis of translational selection, the

527: structural genes of phages should exhibit elevated CAI levels compared

528: to other phage genes, since they are translated (by the host) in high

529: copy numbers. To test this hypothesis, we performed regressions between

530: the structural annotation of phage genes and their aqua CAI and orange

531: BCAI p-values.  In other words, we compared the structural properties of

532: genes against their CAI content, controlling for amino acid sequence,

533: and against their BCAI content, controlling for both amino acid sequence and

534: GC3 sequence.

535:

536: In the case of lambda phage, Figure \ref{fig:structural} shows the results of

537: the aqua CAI and orange BCAI randomization tests, with the structural genes

538: highlighted. The plot reveals a striking pattern: the vast majority of the

539: structural proteins lie on the left half of the genome, exactly in the region

540: where genes have elevated CAI values. In order to quantify this association we

541: performed ANOVAs. Before regressing structural

542: annotations against codon usage, we first removed the non-informative genes --

543: i.e. genes whose codon usage are influenced by their amino acid content, as

544: indicated by a failure to pass the aqua CAI test.

545:

546: Table \ref{tab:lambda_all_struct_non_aqua_orange} shows the results of the

547: regression between aqua CAI and orange BCAI $p^{>}$-values versus structural

548: annotations in lambda phage. The results are highly significant: structural

549: annotations explain half of the variation in CAI, even when controlling for

550: genes' amino acid sequences (aqua, $r^2$=56\%) as well as GC3 seqeuences (orange

551: test, $r^2$=46\%). The median $p^{>}$-value among structural genes is close to

552: zero, whereas the median $p^{>}$-value among non-structural genes is close to

553: one -- indicating that structural genes exhibit significantly \emph{elevated}

554: CAI values. These highly significant results are consistent with the hypothesis

555: of translational selection on structural proteins.

556:

557: In order to examine the relationship between structural annotation and CAI

558: across all 74 phages in our study, we performed the same ANOVA on the 1,309

559: informative genes (i.e. genes that pass the aqua CAI randomization test). Once

560: again, Table \ref{tab:lambda_all_struct_non_aqua_orange} shows a highly

561: significant relationship between structural annotation and CAI values,

562: controlling for amino acid content and GC3. Thus, the tendency toward elevated

563: CAI values in structural genes holds across all the phages in this study,

564: despite the fact that they infect a diverse range of hosts with a wide

565: variety of GC contents.

566:

567: % subsection translational_selection_on_phage_structural_proteins (end)

568: % section results (end)

569: \section{Discussion}\label{sec:discussion}

570:

571: In this paper, we have introduced genome landscapes as a tool for visualizing

572: and analyzing long-range patterns of codon usage across a genome. In combination

573: with a series of randomization tests, we have applied this tool to study

574: synonymous codon usage in 74 fully sequenced phages that infect a diverse range

575: of bacterial hosts. Genome landscapes provide a convenient means to identify

576: long-range trends that are not apparent through conventional, gene-by-gene or

577: moving-window analyses. Using a statistical test that compares codon usage to

578: random trials, controlling for the amino acid sequence, we found that

579: we found that many of the phages studied exhibit non-random variation

580: in codon usage.  However, not all of the phages exhibit non-random variation as

581: exemplified by phage bIL286 (Figure \ref{fig:green_orange_examples}(d)).

582:

583: In light of long-standing \cite{Ikem81a} and recent \cite{Kudla2006}

584: literature from other organisms, we have focussed on two aspects of

585: phage codon usage: variation in third-position GC/AT content (GC3) and

586: variation in the degree of adaptation to the `preferred' codons of the

587: host (CAI). Almost three-quarters of the phages in our study exhibit

588: non-random intragenomic patterns of codon usage, even when controlling

589: for the amino acid sequence encoded by the genome. Almost half of such

590: genomes also show non-random patterns of CAI when additionally

591: controlling for the GC3 sequence. In other words, there is substantial

592: variation in CAI above and beyond what would be expected by random

593: chance, given the amino acid and GC3 sequences of these genomes.

594:

595: We have also compared the CAI values of phage genes to their annotations

596: as structural or non-structural proteins. We have conclusively

597: demonstrated that phage genes encoding structural proteins exhibit

598: significantly elevated CAI values compared to the non-structural proteins

599: from the same genome. These results hold even when controlling for the

600: the amino acid sequence and GC3 sequence of genes. Our

601: conclusions across a diverse range of phages are consistent with

602: early observations on lambda's codon usage \cite{Sanger1982},

603: early results for T7 \cite{Sharp1984}, and with the general hypothesis

604: of translational selection, which predicts elevated CAI in genes

605: expressed at high levels \cite{Ikem81a,Ikem81b,Sharp1987}. The pattern

606: of elevated CAI in structural proteins is particularly striking the case

607: of lambda phage. It is also worth noting that we find no

608: significant relationship between a phage's life-history (i.e. temperate

609: versus non-temperate) and the degree to which its structural proteins

610: exhibit elevated CAI (see Table \ref{tab:temperate_non}). This

611: observation likely reflects the fact that at some point every phage,

612: regardless of its life history, must generate certain structural proteins in

613: high abundance -- and so it is beneficial to encode such protein using

614: the host's translationally preferred codons.

615:

616: Our results on translational selection in phages shed light on the

617: nature of selection on viruses. The standard interpretation of elevated

618: CAI in highly expressed bacterial proteins assumes a fitness cost (per

619: molecule) associated with inefficient or inaccurate translation. We have

620: observed a similar relationship between expression level and CAI across

621: a diverse range of bacteriophages, which presumably do not incur a

622: direct energetic cost from inefficient translation by their hosts. Thus,

623: our results suggest that either there is an adaptive benefit (to the

624: virus) of elevated CAI in phage structural proteins, or that costs

625: incurred by the host bacterium also reduce the fitness of the virus.

626:

627: In addition to our results on CAI, we have also observed non-random patterns of

628: GC3 variation across the genomes of many phages. These patterns are highly

629: significant even after controlling for potential conflating factors, such as the

630: amino acid sequences and CAI sequences of genes. Unlike our results on CAI,

631: there is no clear mechanistic hypothesis underlying the non-random patterns of

632: GC3 in phages. It is possible that these patterns reflect selection for

633: efficient transcription \cite{Kudla2006} or for mRNA secondary structure. But in

634: the absence of independent information on such constraints, we cannot assess the

635: merits of these selective hypotheses, nor rule out the possibility of variation

636: in mutational biases across the phage genomes. It is interesting to note

637: that we find these significant non-random patterns of GC3 predominantly in

638: temperate phages (see Table \ref{tab:temperate_non}).

639:

640: Our study benefits from the number and breadth of phages we

641: have analyzed. Unlike previous studies, here we analyze phages whose

642: suspected hosts span a diverse range of bacteria, which themselves

643: differ in their genomic GC3 content and preferred codon choice. We have

644: calibrated CAI for each phage according to its primary host, and

645: nevertheless we find consistent relationships between CAI and viral

646: protein function. These results therefore conclusively extend the

647: classical theory of translational selection to the relationship between

648: viruses and their hosts.

649:

650: The present study also benefits from the development of randomization tests that

651: isolate the patterns of variation in CAI from variation in GC content. Due to

652: intrinsic biases in the GC content of the preferred codons of hosts, previously

653: studies on codon usage in phage have conflated these two types of synonymous

654: variation \cite{Sahu2004, Sahu2005,Sau2005,SauGosh2005}. The mechanisms

655: underlying GC3 variation and CAI variation likely differ, and so it is

656: critically important that we have analyzed each of these features controlling

657: for the other one.

658:

659: There is a large literature on the structure and evolution of phage genomes

660: which is pertinent to our analyses of phage codon usage. The genomes of phages

661: that infect \emph{E. coli}, \emph{L. lactis}, and \emph{Mycobacteria} are known

662: to be highly mosaic in structure

663: \cite{Juhala2000,Brussow2002,Hendrix2002,Lawrence2002,Pedulla2003,Hatfull2006}.

664: In other words, these genomes exhibit many similar local features that suggest

665: each genome was assembled from a common pool of bacteriophage genomic regions

666: \cite{Hendrix1999}. Recently, mosaicism was discussed in the lambdoid

667: phages focusing specifically on the \emph{E. coli} phages lambda, HK97 and N15

668: \cite{Hendrix2004}. We note that both HK97 and N15 have peaked landscape

669: structures like lambda, although not as pronounced, indicating that some degree

670: of mosaicism can be observed in genome landscapes among closely related phages.

671: The postulated mechanism for mosaicism is homologous and non-homologus

672: recombination between co-infecting phages or between a phage and a prophage

673: embedded in the host genome \cite{Hendrix1999,Brussow2002,Lawrence2001}. Some

674: have argued that the latter mechanism occurs more frequently, due to the large

675: number of lysogenized prophages in bacterial genomes \cite{Lawrence2001}.

676:

677: Lateral gene transfers could affect the codon usage patterns of phages,

678: especially if recombination occurs between phages whose preferred hosts

679: differ. In this case, the codon usage patterns of each phage may be expected to

680: reflect the preferred codons of their preferred hosts; a recent recombination

681: may result in regions of dramatically different codon usage from the average

682: phage codon usage. In particular, regions of unusual GC3 content in a phage

683: genome could reflect gene transfers between phages that typically infect hosts

684: of different GC3 content, in analogy with lateral gene transfer amongst

685: bacteria \cite{Ochman2000}. Morons are genes in phage genomes that are under

686: different transcriptional control than the rest of the phage genes, and are

687: often expressed when the phage is in the lysogenic state \cite{Hendrix2000}.

688: These morons have been observed to have very different nucleotide compositions

689: compared to the rest of the phage genome suggesting that they are the result of

690: such gene transfers \cite{Hendrix2000}. Thus one interpretation for our

691: observations of the 29 phages exhibiting non-random GC3 patterns is that these

692: genomes arose through recent recombination events, and have not subsequently

693: experienced enough time to equilibrate their GC3 content to that of their

694: current host. Given the lack of reliable estimates for time scales between

695: putative phage recombination events, or for codon usage equilibration, this

696: study neither supports nor refutes this interpretation. However, the

697: predominance of significant non-random patterns of GC3 in the genomes of

698: temperate phages (see Table \ref{tab:temperate_non}) may suggest that such

699: recombination occurs more frequently among temperate phage populations.

700:

701: We have demonstrated that phage genes encoding structural proteins exhibit

702: significantly elevated CAI values compared the non-structural phage genes. These

703: results support the classical translation selection hypothesis, now extended to

704: the relationship between viral and host codon usage. We do not find much

705: variation in codon usage among the structural genes themselves. This observation

706: has two plausible interpretations within the literature of lateral gene

707: transfers: either phages of different preferred hosts rarely co-infect, or there

708: is substantially less recombination among the structural proteins of phages. The

709: latter hypothesis has been independently suggested for the capsid proteins of

710: phages, based on the idea that capsid proteins form a complex with

711: multiple physical interactions whose function would be disrupted by individual

712: gene transfer events \cite{Hendrix2002}. Unlike capsid genes, phage tail genes

713: often exhibit mosaicism, and they they can include elements from diverse viruses

714: with variable host ranges \cite{Haggard-Ljungquist1992,Hendrix2002}. To

715: investigate this phenomenon in the context of codon usage, we refined the

716: structural annotation to separate head from tail genes (see Section

717: Methods). We performed three separate ANOVAs to compare

718: the CAI usage in these genes: comparing head versus non-structural, tail versus

719: non-structural, and head versus tail (Table

720: \ref{tab:all_head_tail_aqua_orange}). These regressions indicate that the head

721: genes are primarily responsible for that pattern of elevated CAI in structural

722: proteins. In addition, we detect a difference in codon usage between head and

723: tail genes. These results have at least two possible explanations: either the

724: head proteins are produced in higher copy number than the tail proteins, or

725: lateral gene transfers between diverse phages occur frequently enough in the

726: tail genes to impair their ability to optimize codon usage to their current

727: host. The first hypothesis is very plausible, in light of evidence on the copy

728: number of head and tail proteins \cite{Hendrix2004}; nevertheless, we cannot

729: rule out the second possibility.

730:

731: % section discussion (end)

732: \section{Materials and Methods}\label{sec:materials_and_methods}

733:

734: % (fold)

735: \subsection{Bacteriophage Genomes}\label{sub:bacteriophage_genomes}

736:

737: % (fold)

738: Bacteriophage genomes were downloaded from NCBI's GenBank

739: (\verb=http://www.ncbi.nlm.nih.gov/Genbank/index.html=) release 156 (October,

740: 2006) using Biopython's \cite{biopython} NCBI interface. We only used

741: reference sequence (refseq) phage genome records with accessions

742: of the form NC\_00dddd in order to have the most complete records

743: available. Of the 396 phage refseq's available, we focused on the 74 genomes of

744: phages whose primary host, as listed in the \verb=specific_host= tag in the

745: GenBank file, were \emph{E. coli}, \emph{P. aeruginosa} or \emph{L. lactis}. (A

746: complete list of the accession numbers used can be found in the supplementary

747: material.)

748:

749: All phage genomes were downloaded from GenBank. Before being used for the rest

750: of this study, every gene within a genome was scanned for overlaps within other

751: genes in the same genome, and all overlapping sequences were removed. A codon

752: was only retained if all three of its nucleotides occurred in a single open

753: reading frame. Thus the final genome sequence used was a concatenation

754: of all non-overlapping coding sequences, omitting any control elements and other

755: non-coding sequences.

756:

757: % subsection bacteriophage_genomes (end)

758: \subsection{Calculation of CAI Master

759: Tables}\label{sub:calculation_of_cai_master_tables}

760:

761: % (fold)

762: The definition of the Codon Adaptation Index requires the construction of a

763: `master' $w$-table for the host organism. Each of the 61 sense codons is

764: assigned a $w$-value based on the codon's frequency among the most highly

765: expressed genes in the host organism. In defining this set of genes, we follow

766: Sharp \cite{Sharp1987}, who specified highly expressed genes for \emph{E. coli}.

767:

768: In order to calculate the CAI master $w$-tables for P. aeruginosa and L. lactis,

769: we identified the homologs of the highly expressed \emph{E. coli} genes within

770: the other host genomes, using BLAST \cite{Altschul1990}. In particular, we used

771: qblast to find homologs to these \emph{E. coli} genes by inputting the gene

772: protein sequences, and blasting (blastp) against the nr database, restricting

773: the database to include proteins of the target organism. In all cases, we used

774: the most significant blast result as the ortholog, provided its e-value was less

775: than $1\mathrm{x}10^{-10}$.

776:

777: The particular proteins used for each of these three hosts are as

778: follows (NCBI genome accession numbers listed in parentheses beside the

779: host name, gI numbers listed in parentheses beside each protein). \emph{E.

780: coli} (NC\_000913): 30S ribosomal protein S10 (16131200), 30S ribosomal

781: protein S21 (16130961), 30S ribosomal protein S12 (16131221), 30S

782: ribosomal protein S20 (16128017), 30S ribosomal protein S1 (16128878),

783: 30S ribosomal protein S2 (16128162), 30S ribosomal protein S15

784: (16131057), 30S ribosomal protein S7 (16131220), 50S ribosomal protein

785: L28 (16131508), 50S ribosomal protein L33 (16131507), 50S ribosomal

786: protein L34 (16131571), 50S ribosomal protein L11 (16131813), 50S

787: ribosomal protein L10 (16131815), 50S ribosomal protein L1 (1790416 ),

788: 50S ribosomal protein L7/L12 (1790418 ), 50S ribosomal protein L17

789: (16131173), 50S ribosomal protein L3 (16131199), murein lipoprotein

790: (16129633), outer membrane protein A (3a;II*;G;d) (16128924), outer

791: membrane porin protein C (16130152), outer membrane porin 1a (Ia;b;F)

792: (16128896), protein chain elongation factor EF-Tu (duplicate of tufB)

793: (16131218), TufB (29140507), elongation factor Ts (16128163), elongation

794: factor EF-2 (16131219), recombinase A (16130606), molecular chaperone

795: DnaK (16128008); \emph{P. aeruginosa} (NC\_002516): elongation factor G

796: (15599462), 30S ribosomal protein S10 (15599460), 30S ribosomal protein

797: S21 (15595776), 30S ribosomal protein S12 (15599464), 30S ribosomal

798: protein S20 (15599759), 30S ribosomal protein S1 (15598358), 30S

799: ribosomal protein S2 (15598852), 30S ribosomal protein S15 (15599935),

800: 30S ribosomal protein S7 (15599463), 50S ribosomal protein L28

801: (15600509), 50S ribosomal protein L33 (15600508), 50S ribosomal protein

802: L34 (15600763), 50S ribosomal protein L11 (15599470), 50S ribosomal

803: protein L10 (15599468), 50S ribosomal protein L1 (15599469), 50S

804: ribosomal protein L7/L12 (15599467), 50S ribosomal protein L17

805: (15599433), 50S ribosomal protein L3 (15599459), probable outer membrane

806: protein precursor (15596238), elongation factor Tu (15599461),

807: elongation factor Ts (15598851), elongation factor G (15599462),

808: recombinase A (15598813), molecular chaperone DnaK (15599955); \emph{L. lactis}

809: (NC\_002662): 30S ribosomal protein S10 (15674082), 30S ribosomal

810: protein S21 (15672222), 30S ribosomal protein S12 (15674244), 30S

811: ribosomal protein S20 (15673721), 30S ribosomal protein S1 (15672820),

812: 30S ribosomal protein S2 (15674135), 30S ribosomal protein S15

813: (15673868), 30S ribosomal protein S7 (15674243), 50S ribosomal protein

814: L34 (15672113), 50S ribosomal protein L11 (15673983), 50S ribosomal

815: protein L10 (15673251), 50S ribosomal protein L1 (15673982), 50S

816: ribosomal protein L7/L12 (15673250), 50S ribosomal protein L17

817: (15674049), 50S ribosomal protein L3 (15674081), elongation factor Tu

818: (15673843), elongation factor Ts (15674134), elongation factor EF-2

819: (15674242), recombinase A (15672336), molecular chaperone DnaK

820: (15672936).

821:

822: Given the set of highly expressed genes, the CAI master $w$-table was

823: calculated as follows. For each host, the GenBank file (GenBank release

824: 156) was downloaded locally and transformed into a local data

825: structure using Biopython's \cite{biopython} GenBank parser. The

826: data structure was then scanned for each of the genes in the

827: highly translated gene set, and the collective CDS codon sequences of

828: these genes were concatenated together into one long sequence. Stop

829: codons and codons encoding for amino acids methionine (M), and

830: tryptophan (W) (each encoded by only one codon) were removed

831: from the concatened sequence. The frequencies of codons encoding all

832: other amino acids were then tabulated, and divided into groups according

833: to which amino acid they encode. The w-values are then calculated,

834: according to the procedure of Sharp \cite{Sharp1987}, as these

835: frequencies, normalized by the maximum frequency within each group. Thus

836: each amino acid has a codon with a $w$-value of 1, representing

837: the most commonly used codon for that amino acid. The $w$-values for

838: the stop codons and codons for methionine and tryptophan were set to the

839: average w-value of the remaining codons.

840:

841: % subsection calculation_of_cai_master_tables (end)

842: \subsection{Drawing Random Genomes According to

843: Constraints}\label{sub:drawing_random_genomes_according_to_constraints}

844:

845: % (fold)

846: Our randomization tests require drawing randomized phage genomes that are

847: constrained to have specific properties. In all of the randomization tests

848: discussed, the random sequences were drawn as a sequence of synonymous codons at

849: each position, thereby exactly preserving the amino acid sequences of proteins.

850:

851: The three randomization tests used in this work can all be considered variants

852: of a canonical randomization test that preserves both the amino acid sequence

853: and a bit mask sequence exactly, while drawing codons from the global,

854: genome-wide distribution. A bit mask sequence is string of zeros and ones

855: corresponding to all codons in the genome. For example, GC3 is 1 if the third

856: position of a codon is G or C, and 0 otherwise.

857:

858: Using the GC3 bit mask as an example, the randomization test procedure is

859: initialized by calculating the global codon frequencies that fit into categories

860: specified by the amino acid and the bit-mask value. Each amino acid has

861: associated with it two distributions: one for a bit-mask value of 1 and one for

862: a bit-mask value of 0. For example, alanine (A), is encoded by four codons, GCC

863: (1), GCG (1), GCT (0), GCA (0), where the GC3 bit-mask is shown in parenthesis.

864: Thus to calculate the codon distribution of alanine GC3 codons ($A_1$), we

865: compute the frequency of GCC and GCG codons across the whole phage genome.

866: Similarly, the distribution of $A_0$ codons is determined from the frequency of

867: GCT and GCA codons across the genome. In order to produce a random genome,

868: random codons are drawn at each position according to the distribution

869: associated with the position's amino acid and bit-mask value.

870:

871: Thus the three null tests can be specified by the definition of the bit mask

872: along the sequence, which determines the constraints on the

873: randomize trials. The aqua randomization test constrains the amino acid

874: sequence and nothing else, and so its bit mask consists of all 1's. The orange

875: randomization test preserves the amino acid and the GC3, and so its bit mask is

876: the GC3 sequence mentioned above. The green randomization test preserves the

877: amino acid and BCAI exactly, thus its bit mask is the thresholded BCAI (1 if

878: BCAI $\geq$ 0.7, 0 otherwise).

879:

880: % subsection drawing_random_genomes_according_to_constraints (end)

881: \subsection{Structural Annotation}\label{sub:structural_annotation}

882:

883: % (fold)

884: All phage genes were annotated as structural or non-structural by inspecting

885: the annotations of high-scoring BLAST hits among viral proteins. This procedure is

886: described in detail below.

887:

888: Each gene was considered separately within each genome object, although overlaps

889: were removed in the process of creating the genome objects (see section

890: \ref{sub:bacteriophage_genomes}). The amino acid sequence of each gene was

891: blasted against all known viral protein sequences using Biopython's interface

892: \cite{biopython} to the NCBI blast utility \cite{Altschul1990}. Specifically, we

893: used the blastp utility specifying the nr database, with entrez query `Viruses

894: [ORGN]'. We retained only those BLAST hits with e-values below the cutoff

895: $1\mathrm{x}10^{-4}$. All words in the title of these BLAST hits were collected,

896: using white space as a word-delimiter.

897:

898: The unique words from the blast hits were then compared against a set of

899: structural keywords: ``capsid", ``structural", ``head", ``tail", ``fiber",

900: ``scaffold", ``portal", ``coat", and ``tape". The words associated with the

901: BLAST hits were scanned for matches to the keywords, where each keyword was

902: treated as a regular expression. As a result, partial matching was counted as a

903: match. For example, a BLAST title containing the word `head-tail' would match

904: both keywords `head' and `tail'. If a gene had at least one structural keyword

905: match in its BLAST hit title, it was annotated as structural. Otherwise, it was

906: annotated as non-structural.

907:

908: We further subdivided the structural annotation into two classes: head and tail

909: genes. Tail genes were identified with the keywords ``tail", ``fiber", and

910: ``tape". These remaining structural genes that did not contain any of these

911: keywords were annotated as head genes. Two false positives for tail

912: identification in the lambda phage genome were manually corrected.

913:

914: \subsection{Null Model: Results for Random Walk

915: Landscapes}\label{sub:null_model_results_for_random_walk_landscapes}

916:

917: % (fold)

918:

919: In the sections above we have compared the genome landscapes calculated

920: from real genome sequences to a null model in which the sequences are

921: randomly drawn from a defined distribution. In this section, we compute

922: several properties of genome landscapes calculated from these random

923: genomes.

924:

925: We write the general genome landscape of length $N$ as

926: \begin{equation}

927:     F(m) = \sum_{i=1}^m (\eta(i) - \overline{\eta}),

928: \end{equation}

929: where $\eta(i)$ are indepedant, and chosen from a random distribution with

930: $\mathrm{var}(\eta(i)) = \langle \eta(i)^2 \rangle - \langle \eta(i) \rangle^2 =

931: \Delta$, and

932: \begin{equation}

933:     \overline{\eta} = \frac{1}{N}\sum_{i=1}^N \eta(i),

934: \end{equation}

935: which ensures $F(0) = F(N) = 0$.

936:

937: The purple regions in Figure \ref{fig:land_hist} represent the variance in the

938: genome landscapes of this null model at each $m$, $\sigma(m) = \sqrt{\langle F(m)^2

939: \rangle - \langle F(m) \rangle^2}$. Using the definitions above, we have

940: \begin{equation}

941:     \begin{aligned}

942:         F(m) &= \sum_{i=1}^m \eta(i)- \frac{m}{N}\sum_{i=1}^N \eta(i) \\

943:              &= \left( \frac{m + (N-m)}{N} \right) \sum_{i=1}^m \eta(i)- \frac{m}{N}\sum_{i=1}^N \eta(i) \\

944:              &= \frac{N-m}{N}\sum_{i=1}^m \eta(i) - \frac{m}{N}\sum_{i=m+1}^N \eta(i),

945:     \end{aligned}

946: \end{equation}

947: and

948: \begin{equation}

949:     \langle F(m) \rangle = \frac{m(N-m)\langle\eta\rangle}{N} - \frac{m(N-m)\langle\eta\rangle}{N} = 0.

950: \end{equation}

951: When we use $\langle \eta(i)\eta(j) \rangle = \langle \eta^2 \rangle \delta_{i,j}

952: + (1- \delta_{i,j}) \langle\eta\rangle^2$, with $\delta_{i,j} = 1$ if $i = j$ and 0 otherwise, we find

953: \begin{equation}

954:     \begin{aligned}

955:     \langle F(m)^2 \rangle &= \frac{m(N-m)}{N} (\langle\eta^2\rangle - \langle\eta\rangle^2) \\

956:     &= \frac{\Delta m(N-m)}{N},

957:     \end{aligned}

958: \end{equation}

959: leading to $\sigma(m) = \sqrt{\langle F(m)^2 \rangle - \langle F(m) \rangle^2} = \sqrt{\Delta m

960: (N-m)/N}$. In the case of GC3 landscapes, $\eta(i)$ is either 1 or 0 with equal

961: probability, giving $\Delta_{\mathrm{GC3}} = 1/4$.

962:

963: We can also calculate the full probability distribution,

964: $P(f;m,N,\Delta)$ that the genome landscape of length $N$ has an intermediate

965: value $F(m) = f$, at point $m$, by considering an $N$-step random walk that is

966: constrained to start and stop at $0$. This probability distribution can be

967: written as a product of two conditional probabilities for a walk that starts at

968: $0$ and ends at $f$ in $m$ steps, and a walk that starts at $f$ and ends at $0$

969: in $N-m$ steps

970: \begin{eqnarray}

971:     \begin{aligned}

972:         \label{eq:P_decomp}

973:     P(f;m,N,\Delta) &= A G(0,f;m,\Delta) G(f,0;N-m,\Delta) \\

974:                     &= A G(0,f;m,\Delta) G(0,f;N-m,\Delta),

975:     \end{aligned}

976: \end{eqnarray}

977: where $A$ is a normalization constant, and the last step used the inversion

978: symmetry of the random walks. Thus we seek the form of the conditional

979: probability $G(0,f;m,\Delta)$. In the same way as in Eq. (\ref{eq:P_decomp}), we

980: decompose this conditional probability into a multiplication of the conditional

981: probabilities for two walks, one that starts at $0$ and ends at $y$ in $x$

982: steps, and one that starts at $y$ and ends at $f$ in $m-x$ steps, and integrate

983: over all possible intermediate values $y$

984: \begin{equation}

985:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}y G(0,y;x,\Delta) G(y,f;m-x,\Delta).

986: \end{equation}

987: We can continue this decomposition for each intermediate step to give

988: \begin{equation}

989:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}y_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}y_{m-1} G(0,y_1;1,\Delta) G(y_1,y_2;1,\Delta) \ldots G(y_{m-1},f;1,\Delta).

990: \end{equation}

991: Keeping the order of integration the same, and noting that $G(y_1,y_2;1,\Delta)

992: = G(y_2 - y_1;1,\Delta)$ for these random walks, we can write $y_{i+1} - y_i =

993: s_{i+1}$ to give

994: \begin{equation}

995:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}s_m G(s_1;1,\Delta) \ldots G(s_2;1,\Delta) G(s_m;1,\Delta) \delta\left( \sum_{i=1}^m s_m - f\right),

996: \end{equation}

997: where the delta function is added to force the constraint that the sum of all

998: the intermediate steps must be equal to $f$. All of the intermediate conditional

999: probabilities now represent one step walks, and so are equal to the underlying

1000: probability distribution of drawing a step size $s_m$, $p(s_m;\Delta)$

1001: \begin{equation}

1002:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}s_m \delta\left( \sum_{i=1}^m s_m - f\right) \Pi_{i=1}^m p(s_i;\Delta).

1003: \end{equation}

1004: Making use of the integral representation of the delta function \cite{Grosberg1994}

1005: \begin{equation}

1006:     \delta(x) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-ikx},

1007: \end{equation}

1008: we have

1009: \begin{equation}

1010:     G(0,f;m,\Delta) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-ikf} \tilde{p}(k;\Delta)^m,

1011: \end{equation}

1012: where $\tilde{p}(k;\Delta)$ is the Fourier transform of $p(s;\Delta)$

1013: \begin{equation}

1014:     \tilde{p}(k;\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s e^{-iks} p(s;\Delta) .

1015: \end{equation}

1016: For the purpose of this discussion, we assume $p(s;\Delta)$ has a Gaussian form

1017: $p(s) = \frac{1}{\sqrt{2\pi\Delta}}e^{-\frac{s^2}{2\Delta}}$, and note that the

1018: results are general. In this case, $\tilde{p}(k;\Delta) =

1019: e^{-\frac{k^2\Delta}{2}}$, and we have

1020: \begin{equation}

1021:     G(0,f;m) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-m\Delta k^2/2}e^{-ikf} = \frac{1}{\sqrt{2\pi m\Delta}}e^{-f^2/2m\Delta}.

1022: \end{equation}

1023: To determine $A$, we enforce the normalization condition

1024: \begin{equation}

1025:     \int_{-\infty}^{\infty} \mathrm{d}f P(f;m,N,\Delta) = 1,

1026: \end{equation}

1027: which gives

1028: \begin{eqnarray}

1029:     \begin{aligned}

1030:     P(f;m,N,\Delta) &= \frac{1}{\sigma\sqrt{2\pi}}e^{-f^2/2\sigma^2} \\

1031:     \sigma(m) &= \sqrt{\Delta\frac{m(N-m)}{N}}.

1032:     \end{aligned}

1033: \end{eqnarray}

1034: Note that from the full distribution, we can immediately identify $\sigma(m) =

1035: \sqrt{\langle F(m)^2 \rangle - \langle F(m) \rangle^2}$, confirming the explicit

1036: calculation above.

1037:

1038: \subsection{Acknowledgments}\label{sub:acknowledgments}

1039: % (fold)

1040:

1041: The authors would like to thank Herv\'{e} Isambert, Graham Hatfull, and Roger

1042: Hendrix for conversations and suggestions on this work. JBL and DRN would like

1043: to thank the Institute Curie, Paris, for hospitality during the initial phases

1044: of this work. Work by DRN was supported by the National Science Foundation

1045: through grants DMR-0231631 and DMR-0213805. JBL acknowledges the financial

1046: support of the Fannie and John Hertz Foundation. JBP acknowledges

1047: support from the Burroughs Wellcome Fund.

1048:

1049: % subsection acknowledgements (end)

1050:

1051: % subsection null_model_results_for_random_walk_landscapes (end)

1052:

1053:

1054: \clearpage

1055: % subsection structural_annotation (end)

1056: % section materials_and_methods (end)

1057: % ---- FIGURES ------- (fold)

1058: \begin{figure}

1059: 	[p]

1060: 	\begin{center}

1061: 		\begin{tabular}

1062: 			{cc}

1063: 			\includegraphics[scale=0.8]{Lambda_GC3.pdf} &

1064: 			\includegraphics[scale=0.8]{Lambda_CAI.pdf} \\

1065: 			\includegraphics[scale=0.8]{Lambda_GC3_histogram.pdf} &

1066: 			\includegraphics[scale=0.8]{Lambda_CAI_histogram.pdf} \\

1067: 		\end{tabular}

1068: 	\end{center}

1069: 	\caption{{\bf GC3 and CAI landscapes for lambda phage.} Landscapes of GC3

1070: (left) and CAI (right) measures of codon usage in Lambda phage. Only

1071: coding sequences are considered, which when concatenated together are 40,773 bp

1072: long (see Table \ref{tab:phage_properties}). The GC3 landscape is the

1073: mean-centered cumulative sum of the GC3 content (GC3=1, AT3=0) of codons. The

1074: CAI landscape is the mean-centered cumulative sum of the log $w$-value for each

1075: codon. For each landscape, a region

1076: exhibiting an uphill slope corresponds to higher than average GC3 or CAI. The

1077: horizontal purple band represents the expected

1078: amount of variation in a random walk of GC3 or AT3 choices,

1079: given by Equation \eqref{eq:sigma}. Both landscapes exhibit features far

1080: outside of the purple bands, indicating that the patterns of codon

1081: usage are highly non-random. Gene boundaries are represented by the bars in the

1082: histograms below each landcape. The height of the bars in the histogram indicate

1083: the GC3 and CAI values for each gene.} \label{fig:land_hist}

1084: \end{figure}

1085: \clearpage

1086: \begin{figure}

1087: 	[p]

1088: 	\begin{center}

1089: 		\begin{tabular}

1090: 			{cc}

1091: 			(a) \includegraphics[scale=0.8]{lambda_decay_GC3.pdf} &

1092: 			(b) \includegraphics[scale=0.8]{lambda_decay_CAI.pdf} \\

1093: 		\end{tabular}

1094: 	\end{center}

1095: 	\caption{ {\bf Snapshots of simulated

1096: synonymous mutation in the lambda phage genome.} Panel (a) shows GC3 and

1097: (b) shows CAI landscapes. In between successive snapshots (labeled

1098: by integers), $N$ synonymous mutations are introduced into the genome and the

1099: resulting landscape is shown, where $N$ is the number of codons in the lambda

1100: phage genome (see Section \ref{sub:genome_landscapes}). These snapshots show that

1101: the simulated genome landscapes approach the random null model, indicated by

1102: the purple band (see Figure \ref{fig:land_hist}). The final CAI landscape (3) lies

1103: almost completely within the purple band. Using the lambda phage mutation rate

1104: of $7.7\mathrm{x}10^{-8}$ mutations/bp/replication \cite{Drake1991}, we can

1105: estimate that approximately $10^7$ genome replications would be

1106: required

1107: to relax within the purple bars.} \label{fig:land_decay}

1108: \end{figure}

1109:

1110: \clearpage

1111: \begin{figure}

1112: 	[p]

1113: 	\begin{center}

1114: 		\begin{tabular}

1115: 			{cc}

1116: 			\includegraphics[scale=0.8]{Lambda_aqua_GC3.pdf} &

1117: 			\includegraphics[scale=0.8]{Lambda_aqua_CAI.pdf} \\

1118: 			\includegraphics[scale=0.8]{Lambda_aqua_GC3_histogram_filtered.pdf} &

1119: 			\includegraphics[scale=0.8]{Lambda_aqua_CAI_histogram_filtered.pdf} \\

1120: 		\end{tabular}

1121: 	\end{center}

1122: 	\caption{{\bf Observed and randomized landscapes for lambda phage. } The figure

1123: shows the observed GC3 (left) and CAI (right) landscapes, plotted in black,

1124: along with the mean $\pm 1$, and $\pm 2$ standard deviations of

1125: randomized trials, shown in aqua (bold line, dark and light regions,

1126: respectively). The `aqua' randomization test shown here draws random

1127: synonymous codons that preserve the exact amino acid

1128: sequence, according to probabilities that preserve the global codon

1129: usage

1130: distribution of the lambda genome. For the most part, the observed landscapes

1131: lie signficantly outside the distribution of randomized landscapes -- implying

1132: that the amino acid content of genes is not responsible for the observed pattern

1133: of the CAI landscape. In the lower panel, however, genes whose GC3 (left), or

1134: CAI (right) values fall between the 0.025 and 0.975 quantile of the random

1135: trials are shadowed in grey; the GC3/CAI values of such genes are not

1136: significantly different from random, given their amino acid sequence.}

1137: \label{fig:aqua}

1138: \end{figure}

1139: \clearpage

1140: \begin{figure}

1141: 	[p]

1142: 	\begin{center}

1143: 		\includegraphics[scale=0.6]{ecoli_master.pdf}

1144: 	\end{center}

1145: 	\caption{{\bf \emph{E. coli} codon usage master table.} The table of 61 codons

1146: 	along with their associated w-values is shown for \emph{E. coli}.

1147: 	The $w$-value of each codon reflects its frequency in

1148: 	highly transcribed \emph{E. coli} genes (see main text). The table

1149: 	is divided into four regions: codons with high CAI ($w \geq 0.9$)

1150: 	ending in G or C (dark red); codons with high CAI ending in A or

1151: 	T (dark blue); codons with low CAI ($w \leq 0.9$) ending in G or C

1152: 	(light red); codons with low CAI ending in A or T (light blue).

1153: 	As the table shows, there is

1154: 	a slight bias for GC3 in the high-CAI codons (58\%), and slight

1155: 	bias away from GC3 in the low-CAI codons (48\%).} \label{fig:E_coli_master}

1156: \end{figure}

1157: \clearpage

1158: \begin{figure}

1159: 	[p]

1160: 	\begin{center}

1161: 		\begin{tabular}

1162: 			{cc}

1163: 			\includegraphics[scale=0.8]{Lambda_blue_GC3.pdf} &

1164: 			\includegraphics[scale=0.8]{Lambda_orange_BCAI.pdf} \\

1165: 			\includegraphics[scale=0.8]{Lambda_blue_GC3_histogram.pdf} &

1166: 			\includegraphics[scale=0.8]{Lambda_orange_BCAI_histogram.pdf} \\

1167: 		\end{tabular}

1168: 	\end{center}

1169: 	\caption{{\bf Observed and randomized landscapes for lambda phage.}

1170: 	Observed landscapes are shown along with randomized landscapes

1171: 	associated with the `green' and `orange' tests.

1172: 	The green randomization procedure tests the

1173: 	significance of the GC3 landscape controlling for the observed

1174: 	CAI (actually, BCAI) variation across the genome. The orange

1175: 	randomization procedure tests the significance of the BCAI landscape,

1176: 	controlling for the observed GC3 variation across the genome.

1177: 	Both tests preserve the amino-acid sequence exactly.

1178: 	Both observed landscapes lie outside the distribution

1179: 	of random trials, indicating there is non-random GC3

1180: 	content controlling for CAI, and non-random CAI

1181: 	content controlling for GC3.} \label{fig:green_orange}

1182: \end{figure}

1183: \clearpage

1184: \begin{figure}

1185: 	[p]

1186: 	\begin{center}

1187: 		\begin{tabular}

1188: 			{ccc}

1189: 			\includegraphics[scale=1]{ecoli_CAI_master_cartoon} &

1190: 			\includegraphics[scale=1]{paeruginosa_CAI_master_cartoon} &

1191: 			\includegraphics[scale=1]{llactis_CAI_master_cartoon} \\

1192: 		\end{tabular}

1193: 	\end{center}

1194: 	\caption{{\bf Schematics of prefered codon usage tables for

1195: 	\emph{E. coli}, \emph{P. aeruginosa}, and \emph{L. lactis} following the

1196: 	conventions of Figure \ref{fig:E_coli_master}.}

1197: 	Unlike \emph{E. coli},

1198: 	\emph{P. aeruginosa} strongly favors GC3 in high-CAI codons

1199: 	(94\%), and \emph{L. lactis} strongly favors AT3 in high-CAI

1200: 	codons (72\%).} \label{fig:master_cartoons}

1201: \end{figure}

1202: \clearpage

1203: \begin{figure}

1204: 	[p]

1205: 	% P2 NC_001895

1206: 	% T3 NC_003298

1207: 	% D3112 NC_005178

1208: 	% bIL286 NC_002667

1209: 	\begin{center}

1210: 		\begin{tabular}

1211: 			{cc}

1212: 		(a) \includegraphics[scale=0.5]{P2_green_GC3.pdf} &

1213: 			\includegraphics[scale=0.5]{P2_orange_BCAI.pdf} \\

1214: 		(b)	\includegraphics[scale=0.5]{T3_green_GC3.pdf} &

1215: 			\includegraphics[scale=0.5]{T3_orange_BCAI.pdf} \\

1216: 		(c)	\includegraphics[scale=0.5]{D3112_green_GC3.pdf} &

1217: 			\includegraphics[scale=0.5]{D3112_orange_BCAI.pdf} \\

1218: 		(d)	\includegraphics[scale=0.5]{bIL286_green_GC3.pdf} &

1219: 			\includegraphics[scale=0.5]{bIL286_orange_BCAI.pdf} \\

1220: 		\end{tabular}

1221: 	\end{center}

1222: 	\caption{{\bf `Green' (left) and `orange' (right) randomization tests

1223: for several phages.} Bacteriophages P2 (b) and T3 (b) both

1224: infect \emph{E. coli}. Phage D3112 (c) infects \emph{P. aeruginosa}.

1225: Phage bIL286 (d)

1226: infects \emph{L.

1227: lactis}. T3 is the only non-temperate phage of this group. See

1228: Table \ref{tab:phage_properties} for combined Fisher p-values for these tests.

1229: In the case of bIL286, note the lack of evidence for codon bias evident in

1230: the green and orange

1231: tests for bIL286, as confirmed by the insignificant $p$-values in

1232: Table \ref{tab:phage_properties}. In this case, we cannot rule out the

1233: possibility that the observed pattern in GC3 is determined

1234: completely by the amino acid and CAI sequence (green), or that the observed pattern in

1235: CAI is determined by the amino acid and GC3 sequence (orange).}

1236: \label{fig:green_orange_examples}

1237: \end{figure}

1238: \clearpage

1239: \begin{figure}

1240: 	[p]

1241: 	\begin{center}

1242: 		\includegraphics[scale=1]{gla_orange_blue_fisher_extreme_hist.pdf}

1243: 	\end{center}

1244: 	\caption{{\bf Combined Fisher p-values for the `green' and `orange'

1245: 	randomization tests across 50 phage genomes.} Phage names are

1246: 	listed on the x-axis, and are sorted by their `orange' p-value.

1247: 	A total of 29 genomes exhibit non-random

1248: 	GC3 content controlling for CAI (green test); and a total of

1249: 	22 genome exhibit non-random

1250: 	CAI content controlling for GC3 (orange test). 17 genomes pass both of

1251: 	these tests. The dashed horizontal line indicates the

1252: 	threshold for significance after Bonfernni correction (i.e. 5\%/50).

1253: 	Upwards arrows indicate p-values that lie beyond the limits of the

1254: 	y-axis. See Table \ref{tab:phage_properties} for phage properties,

1255: 	including the

1256: 	p-values for these tests. Twenty four phage genomes

1257: 	that failed the aqua GC3 or CAI control tests

1258: 	are not included in this figure.} \label{fig:green_orange_pass_genomes}

1259: \end{figure}

1260: \clearpage

1261: \begin{figure}

1262: 	[p]

1263: 	\begin{center}

1264: 		\begin{tabular}

1265: 			{cc}

1266: 			\includegraphics[scale=0.8]{Lambda_aqua_CAI_histogram_structural.pdf} &

1267: 			\includegraphics[scale=0.8]{Lambda_orange_BCAI_histogram_structural.pdf} \\

1268: 		\end{tabular}

1269: 	\end{center}

1270: 	\caption{{\bf The relationship between codon usage and protein

1271: 	function in lambda phage.} The figure shows the aqua

1272: 	(CAI, as in Figure \ref{fig:aqua}) and orange (BCAI, as in

1273: 	Figure \ref{fig:green_orange}) randomization tests

1274: 	overlaid with information about protein function:

1275: 	genes classified as structural are shown with a white background

1276: 	and all other genes with a grey background. The histograms

1277: 	indicate a clear relationship between the structural

1278: 	classification of a gene and its significance under the aqua

1279: 	and orange tests: structural genes typically have elevated

1280: 	quantiles in the aqua test, whereas other genes typically have

1281: 	depressed quantiles. In other words, structural genes

1282: 	exhibit elevated CAI values when controlling for their

1283: 	amino acid sequence, compared to codon usage in the

1284: 	genome as a whole. Moreover, as the orange histograms

1285: 	indicate, this pattern is not caused by variation in GC3 content:

1286: 	the structural genes exhibit elevated BCAI values after

1287: 	controlling for both their amino acid sequence and their

1288: 	GC3 sequence.} \label{fig:structural}

1289: \end{figure}

1290: \clearpage

1291:

1292: % section figures (end)

1293: % ------- Tables ------- (fold)

1294:

1295: \begin{table}

1296:     \begin{center}

1297:     \begin{tabular}{c|c|c|c}

1298:         Test Name & Genome Properties Constrained & Genome Properties Varied & Figure \\

1299:         \hline

1300:         Aqua & amino acid sequence, global codon distribution & synonymous codons & \ref{fig:aqua} \\

1301:         Orange & amino acid and BCAI sequences & GC3 & \ref{fig:green_orange} \\

1302:         Green & amino acid and GC3 sequences & BCAI & \ref{fig:green_orange} \\

1303:     \end{tabular}

1304:     \end{center}

1305:     \caption{Randomization test descriptions.

1306: 	 The three randomization tests used in the paper

1307: 	 are color-coded according to what genome properties

1308: 	 are constrained in the random trials.}

1309:     \label{tab:tests}

1310: \end{table}

1311: \clearpage

1312:

1313: \begin{table}

1314:     % \begin{center}

1315:     {\tiny

1316:     \begin{tabular}{c|c|c|c|c|c|c|c|c|c}

1317:         Name & Host & Accession & Lifestyle & \# Genes &

1318:            Length & Coding Length & \%GC3 & Orange p-value & Green p-value \\

1319:            \hline

1320:            T5 & \ecoli & NC\_005859 & NT  & 161 & 121,750 & 96,051 & 31.6 & $1.38\mathrm{x}10^{-31}$ & $1.71\mathrm{x}10^{-19}$ \\

1321:            RB69 & \ecoli & NC\_004928 &  NT & 273 & 167,560 & 156,147 & 29.0 & $1.25\mathrm{x}10^{-21}$ & $5.21\mathrm{x}10^{-01}$ \\

1322:            phiEL & \paeru & NC\_007623 & NT  & 201 & 211,215 & 194,850 & 57.8 & $7.38\mathrm{x}10^{-20}$ & $2.17\mathrm{x}10^{-09}$ \\

1323:            RB49 & \ecoli & NC\_005066 &NT   & 273 & 164,018 & 152,592 & 36.9 & $2.01\mathrm{x}10^{-18}$ & $2.48\mathrm{x}10^{-01}$ \\

1324:            F116 & \paeru & NC\_006552 &  T & 70 & 65,195 & 60,240 & 76.3 & $1.31\mathrm{x}10^{-10}$ & $6.31\mathrm{x}10^{-16}$ \\

1325:            CTX & \paeru & NC\_003278 &T   & 47 & 35,580 & 31,971 & 81.2 & $1.44\mathrm{x}10^{-09}$ & $6.82\mathrm{x}10^{-32}$ \\

1326:            phiKMV & \paeru & NC\_005045 & NT  & 49 & 42,519 & 38,310 & 79.9 & $3.25\mathrm{x}10^{-09}$ & $9.54\mathrm{x}10^{-03}$ \\

1327:            T4 & \ecoli & NC\_000866 &  NT & 269 & 168,903 & 153,660 & 24.3 & $4.59\mathrm{x}10^{-09}$ & $8.62\mathrm{x}10^{-01}$ \\

1328:            lambda & \ecoli & NC\_001416 & T  & 69 & 48,502 & 40,773 & 53.5 & $6.25\mathrm{x}10^{-09}$ & $5.10\mathrm{x}10^{-68}$ \\

1329:            D3 & \paeru & NC\_002484 & T  & 94 & 56,425 & 49,095 & 68.3 & $1.57\mathrm{x}10^{-08}$ & $3.85\mathrm{x}10^{-07}$ \\

1330:            P2 & \ecoli & NC\_001895 & T  & 42 & 33,593 & 30,411 & 54.7 & $5.60\mathrm{x}10^{-08}$ & $2.54\mathrm{x}10^{-61}$ \\

1331:            P1 & \ecoli & NC\_005856 & T  & 108 & 94,800 & 80,103 & 48.2 & $9.37\mathrm{x}10^{-08}$ & $3.51\mathrm{x}10^{-11}$ \\

1332:            D3112 & \paeru & NC\_005178 & T  & 55 & 37,611 & 34,908 & 80.4 & $3.05\mathrm{x}10^{-07}$ & $4.35\mathrm{x}10^{-05}$ \\

1333:            WPhi & \ecoli & NC\_005056 &T   & 43 & 32,684 & 29,601 & 56.4 & $8.39\mathrm{x}10^{-07}$ & $7.80\mathrm{x}10^{-55}$ \\

1334:            K1F & \ecoli & NC\_007456 & NT  & 43 & 39,704 & 34,629 & 53.4 & $1.75\mathrm{x}10^{-05}$ & $8.03\mathrm{x}10^{-02}$ \\

1335:            T3 & \ecoli & NC\_003298 &  NT & 47 & 38,208 & 29,694 & 54.3 & $3.50\mathrm{x}10^{-05}$ & $3.07\mathrm{x}10^{-04}$ \\

1336:            PaP3 & \paeru & NC\_004466 &  T & 71 & 45,503 & 41,115 & 58.1 & $5.09\mathrm{x}10^{-05}$ & $1.64\mathrm{x}10^{-19}$ \\

1337:            phiV10 & \ecoli & NC\_007804 & T  & 55 & 39,104 & 36,111 & 48.8 & $1.25\mathrm{x}10^{-04}$ & $9.38\mathrm{x}10^{-11}$ \\

1338:            P27 & \ecoli & NC\_003356 &   T& 58 & 42,575 & 37,707 & 50.5 & $2.24\mathrm{x}10^{-04}$ & $2.23\mathrm{x}10^{-20}$ \\

1339:            933W & \ecoli & NC\_000924 &  T & 78 & 61,670 & 52,956 & 50.0 & $4.29\mathrm{x}10^{-04}$ & $8.88\mathrm{x}10^{-09}$ \\

1340:            B3 & \paeru & NC\_006548 &  T & 56 & 38,439 & 36,138 & 77.3 & $4.40\mathrm{x}10^{-04}$ & $3.33\mathrm{x}10^{-05}$ \\

1341:            HK97 & \ecoli & NC\_002167 & T  & 59 & 39,732 & 34,191 & 52.1 & $7.61\mathrm{x}10^{-04}$ & $1.19\mathrm{x}10^{-20}$ \\

1342:            VT2-Sa & \ecoli & NC\_000902 & T  & 83 & 60,942 & 52,647 & 51.3 & $1.31\mathrm{x}10^{-03}$ & $7.40\mathrm{x}10^{-07}$ \\

1343:            PRD1 & \ecoli & NC\_001421 &  NT & 21 & 14,925 & 11,988 & 47.6 & $2.99\mathrm{x}10^{-03}$ & $5.97\mathrm{x}10^{-02}$ \\

1344:            JK06 & \ecoli & NC\_007291 &  U & 71 & 46,072 & 32,841 & 43.0 & $3.84\mathrm{x}10^{-03}$ & $1.63\mathrm{x}10^{-03}$ \\

1345:            T1 & \ecoli & NC\_005833 & NT  & 77 & 48,836 & 44,010 & 47.7 & $7.45\mathrm{x}10^{-03}$ & $3.64\mathrm{x}10^{-01}$ \\

1346:            Pf1 & \paeru & NC\_001331 &  U & 12 & 7,349 & 6,282 & 75.7 & $9.66\mathrm{x}10^{-03}$ & $6.67\mathrm{x}10^{-01}$ \\

1347:            HK022 & \ecoli & NC\_002166 & T  & 57 & 40,751 & 33,885 & 52.7 & $1.25\mathrm{x}10^{-02}$ & $4.36\mathrm{x}10^{-18}$ \\

1348:            4268 & \llact & NC\_004746 &  NT & 49 & 36,596 & 33,759 & 24.7 & $1.59\mathrm{x}10^{-02}$ & $3.20\mathrm{x}10^{-01}$ \\

1349:            BP-4795 & \ecoli & NC\_004813 & T  & 48 & 57,930 & 22,356 & 48.1 & $1.66\mathrm{x}10^{-02}$ & $3.29\mathrm{x}10^{-10}$ \\

1350:            186 & \ecoli & NC\_001317 &T   & 43 & 30,624 & 27,747 & 58.7 & $4.02\mathrm{x}10^{-02}$ & $1.79\mathrm{x}10^{-22}$ \\

1351:            I2-2 & \ecoli & NC\_001332 &  U & 8 & 6,744 & 5,166 & 35.0 & $6.91\mathrm{x}10^{-02}$ & $1.01\mathrm{x}10^{-01}$ \\

1352:            phiKZ & \paeru & NC\_004629 & NT  & 306 & 280,334 & 243,384 & 26.8 & $1.32\mathrm{x}10^{-01}$ & $1.79\mathrm{x}10^{-14}$ \\

1353:            bIL312 & \llact & NC\_002671 &  T & 27 & 15,179 & 11,292 & 28.1 & $1.49\mathrm{x}10^{-01}$ & $8.85\mathrm{x}10^{-04}$ \\

1354:            HK620 & \ecoli & NC\_002730 &  T & 58 & 38,297 & 33,717 & 45.9 & $1.61\mathrm{x}10^{-01}$ & $1.41\mathrm{x}10^{-05}$ \\

1355:            Mu & \ecoli & NC\_000929 & T  & 54 & 36,717 & 33,900 & 54.1 & $1.68\mathrm{x}10^{-01}$ & $4.49\mathrm{x}10^{-10}$ \\

1356:            P4 & \ecoli & NC\_001609 &  T & 14 & 11,624 & 9,765 & 52.4 & $1.71\mathrm{x}10^{-01}$ & $4.17\mathrm{x}10^{-18}$ \\

1357:            N15 & \ecoli & NC\_001901 &  T & 59 & 46,375 & 41,472 & 54.9 & $2.17\mathrm{x}10^{-01}$ & $1.38\mathrm{x}10^{-09}$ \\

1358:            Stx2 I & \ecoli & NC\_003525 & T  & 97 & 61,765 & 34,932 & 48.4 & $3.04\mathrm{x}10^{-01}$ & $4.23\mathrm{x}10^{-04}$ \\

1359:            bIL286 & \llact & NC\_002667 &  T & 61 & 41,834 & 38,694 & 24.8 & $3.68\mathrm{x}10^{-01}$ & $1.17\mathrm{x}10^{-01}$ \\

1360:            Tuc2009 & \llact & NC\_002703 &  T & 56 & 38,347 & 35,178 & 28.0 & $4.08\mathrm{x}10^{-01}$ & $1.81\mathrm{x}10^{-02}$ \\

1361:            Stx2 II & \ecoli & NC\_004914 &T   & 99 & 62,706 & 34,755 & 50.1 & $5.85\mathrm{x}10^{-01}$ & $9.94\mathrm{x}10^{-03}$ \\

1362:            BK5-T & \llact & NC\_002796 &  T & 52 & 40,003 & 33,267 & 24.0 & $5.91\mathrm{x}10^{-01}$ & $6.68\mathrm{x}10^{-01}$ \\

1363:            Stx1 & \ecoli & NC\_004913 &  T & 93 & 59,866 & 33,444 & 49.5 & $6.75\mathrm{x}10^{-01}$ & $2.97\mathrm{x}10^{-03}$ \\

1364:            LC3 & \llact & NC\_005822 &T   & 51 & 32,172 & 29,607 & 24.6 & $7.31\mathrm{x}10^{-01}$ & $4.90\mathrm{x}10^{-01}$ \\

1365:            ul36 & \llact & NC\_004066 &  NT & 58 & 36,798 & 32,400 & 27.7 & $8.64\mathrm{x}10^{-01}$ & $4.66\mathrm{x}10^{-02}$ \\

1366:            Pf3 & \paeru & NC\_001418 &U   & 9 & 5,833 & 5,487 & 35.9 & $8.70\mathrm{x}10^{-01}$ & $1.64\mathrm{x}10^{-06}$ \\

1367:            bIL285 & \llact & NC\_002666 &T   & 62 & 35,538 & 32,646 & 26.7 & $9.20\mathrm{x}10^{-01}$ & $9.93\mathrm{x}10^{-01}$ \\

1368:            r1t & \llact & NC\_004302 &T   & 50 & 33,350 & 30,315 & 25.4 & $9.53\mathrm{x}10^{-01}$ & $6.03\mathrm{x}10^{-01}$ \\

1369:            bIL170 & \llact & NC\_001909 & T  & 63 & 31,754 & 27,663 & 27.1 & $9.91\mathrm{x}10^{-01}$ & $8.71\mathrm{x}10^{-01}$ \\

1370:     \end{tabular}

1371:     }

1372:     % \end{center}

1373:     \caption{Phage properties. Properties are listed for all phages included in

1374:     Figure \ref{fig:green_orange_pass_genomes}, in the same order based on the

1375:     orange p-value. Lifestyle annotations are T (temperate), NT (non-temperate),

1376:     U (unknown). The coding length refers to the length of all coding sequences

1377:     concatenated together (see Methods.}

1378:     \label{tab:phage_properties}

1379: \end{table}

1380:

1381: \begin{table}

1382: 	\begin{center}

1383: 		\begin{tabular}

1384: 		    {c|c|c}

1385: 		     & Lambda & All Phage Genes \\

1386: 		     \hline

1387: 		     Number structural & 7 & 279 \\

1388: 		     Number non-structural & 18 & 1022 \\

1389: 		     \hline

1390: 		     \multicolumn{3}{c}{Aqua CAI Randomization Test} \\

1391: 		     \hline

1392: 		     median $p^{>}$ structural & $1.3\mathrm{x}10^{-4}$ & $8.0\mathrm{x}10^{-3}$ \\

1393: 		     median $p^{>}$ non-structural & 1.0 & 1.0 \\

1394: 		     ANOVA significance & $p=4.5\mathrm{x}10^{-5}$ & $p=4.7\mathrm{x}10^{-12}$ \\

1395: 		     \hline

1396: 		     \multicolumn{3}{c} {Orange BCAI Randomization Test} \\

1397: 		     \hline

1398: 		     median $p^{>}$ structural & $2.8\mathrm{x}10^{-2}$ & $2.0\mathrm{x}10^{-1}$ \\

1399: 		     median $p^{>}$ non-structural & 0.98 & 0.73 \\

1400: 		     ANOVA significance & $p=1.8\mathrm{x}10^{-4}$ & $p=1.6\mathrm{x}10^{-15}$ \\

1401: 		\end{tabular}

1402: 	\end{center}

1403: 	\caption{Structural annotation verses codon usage.  The table shows

1404: 	the median $p^>$ values amoung structural and non-structural genes,

1405: 	under the aqua and orange randomization tests. Small $p^>$ values indicate

1406: 	significantly elevated CAI, controlling for the amino acid sequence

1407: 	(aqua test) and the GC3 sequence (orange test). We also report the

1408: 	significance of non-parametic ANOVAs that compare median $p^>$-values between

1409: 	the structural and non-structural genes. Analyses are limited to

1410: 	those genes that pass the aqua test, as described in the main text;

1411: 	similar results are found without this restriction.

1412: }

1413: 	\label{tab:lambda_all_struct_non_aqua_orange}

1414: \end{table}

1415: \clearpage

1416:

1417: \begin{table}

1418: 	\begin{center}

1419: 		\begin{tabular}

1420: 		    {c|c}

1421: 		     & All Phage Genes \\

1422: 		     \hline

1423: 		     Number `Head'  & 145 \\

1424: 		     Number `Tail'  & 134 \\

1425: 		     Number non-structural (NS) & 1022 \\

1426: 		     \hline

1427: 		     \multicolumn{2}{c}{Aqua CAI Randomization Test} \\

1428: 		     \hline

1429: 		     median $p^{>}$ head &  $2.0\mathrm{x}10^{-3}$ \\

1430: 		     median $p^{>}$ tail &  $2.0\mathrm{x}10^{-2}$ \\

1431: 		     median $p^{>}$ NS &  1.0 \\

1432: 		     ANOVA Head vs NS & $p=6.4\mathrm{x}10^{-19}$ \\

1433: 		     ANOVA Tail vs NS & $p=1.8\mathrm{x}10^{-1}$ \\

1434: 		     ANOVA Head vs Tail & $p=2.1\mathrm{x}10^{-8}$ \\

1435: 		     \hline

1436: 		     \multicolumn{2}{c} {Orange BCAI Randomization Test} \\

1437: 		     \hline

1438: 		     median $p^{>}$ head &  $7.0\mathrm{x}10^{-2}$ \\

1439: 		     median $p^{>}$ tail &  $4.3\mathrm{x}10^{-1}$ \\

1440: 		     median $p^{>}$ NS &  0.73 \\

1441: 		     ANOVA Head vs NS  & $p=4.2\mathrm{x}10^{-21}$ \\

1442: 		     ANOVA Tail vs NS  & $p=1.7\mathrm{x}10^{-2}$ \\

1443: 		     ANOVA Head vs Tail  & $p=6.0\mathrm{x}10^{-8}$ \\

1444: 		\end{tabular}

1445: 	\end{center}

1446: 	\caption{Comparison between codon usage and refined structural

1447: 	annotations.

1448: 	As in Table \ref{tab:lambda_all_struct_non_aqua_orange},

1449: 	we compare the median aqua and orange $p^>$ values among head genes, tail

1450: 	genes, and non-structural genes. We report the significance of

1451: 	pairwise non-parametric ANOVAs comparing head to non-structural, tail

1452: 	to non-structural, and head to tail genes.

1453: 	These analyses are limited to genes that pass the aqua test;

1454: 	similar results are found without this

1455: 	restriction.

1456: }

1457: 	\label{tab:all_head_tail_aqua_orange}

1458: \end{table}

1459:

1460: \clearpage

1461: \begin{table}

1462: 	\begin{center}

1463: 		\begin{tabular}

1464: 		    {c|c}

1465: 		     \multicolumn{2}{c}{Median $p_{\text{combined}}^{\mathrm{orange}}$} \\

1466: 		     \hline

1467: 		     Temperate & $1.4\mathrm{x}10^{-2}$ \\

1468: 		     Non-temperate & $2.6\mathrm{x}10^{-5}$ \\

1469: 		     Un-identified & $4\mathrm{x}10^{-2}$ \\

1470: 		     ANOVA significance & $p = 0.1$ \\

1471: 		     \hline

1472: 		     \multicolumn{2}{c}{Median $p_{\text{combined}}^{\mathrm{green}}$} \\

1473: 		     \hline

1474: 		     Temperate & $5.1\mathrm{x}10^{-9}$ \\

1475: 		     Non-temperate & $7.0\mathrm{x}10^{-2}$ \\

1476: 		     Un-identified & $5\mathrm{x}10^{-2}$ \\

1477: 		     ANOVA significance & $p = 0.009$ \\

1478: 		\end{tabular}

1479: 	\end{center}

1480: \caption{{\bf Phage lifestyle versus codon usage}. The table shows the median

1481: $p_{\text{combined}}^{\mathrm{orange}}$ and

1482: $p_{\text{combined}}^{\mathrm{green}}$ values among phages classified as

1483: temperate, non-temperate, or un-identified for all phages included in Figure

1484: \ref{fig:green_orange_pass_genomes} and Table \ref{tab:phage_properties}. Small

1485: median $p_{\text{combined}}^{\mathrm{orange}}$ values indicate that these phages

1486: have significantly non-random (in either direction) BCAI, controlling for the

1487: amino acid sequence and the GC3 sequence, while small median

1488: $p_{\text{combined}}^{\mathrm{green}}$ values indicate that these phages have

1489: significantly non-random (in either direction) GC3, controlling for the amino

1490: acid sequence and the BCAI sequence. We also report the significance of

1491: non-parametic ANOVAs that compare these medians between these groups of phages.

1492: }

1493: 	\label{tab:temperate_non}

1494: \end{table}

1495: % section tables (end)

1496: \clearpage

1497: \bibliography{GLA,GLA_lux}

1498:

1499: \end{document}

1500:

1501: