1: \documentclass[amsmath,amssymb,aps]{revtex4}
2: \usepackage{graphicx}
3: \usepackage{bm}
4: \usepackage[usenames]{color}
5: \usepackage{multirow}
6: \usepackage{amsmath}
7:
8: \pdfoutput=1
9:
10: % \newcommand{\julius}[1]{\textbf{\textcolor{blue}{#1}}}
11: \newcommand{\julius}[1]{#1}
12: \newcommand{\josh}[1]{\textbf{\textcolor{red}{#1}}}
13: \newcommand{\ecoli}{\emph{E. coli}}
14: \newcommand{\paeru}{\emph{P. aeruginosa}}
15: \newcommand{\llact}{\emph{L. lactis}}
16:
17: \begin{document}
18:
19: \title{Genome landscapes and \\
20: bacteriophage codon usage}
21:
22: \author{Julius B. Lucks$^1$} \author{David R. Nelson$^{1,2}$}
23: \author{Grzegorz Kudla$^1$} \author{Joshua B. Plotkin$^{3,*}$}
24: \affiliation{ $^1$FAS Center for Systems Biology, Harvard University,
25: \\ $^2$ Lyman Laboratory of Physics, Harvard
26: University\\ $^3$ Department of Biology, University
27: of Pennsylvania\\ $^*$E-mail:
28: jplotkin@sas.upenn.edu }
29:
30: \date{\today}
31: \begin{abstract}
32:
33: Across all kingdoms of biological life, protein-coding genes exhibit
34: unequal usage of synonmous codons. Although alternative theories abound,
35: translational selection has been accepted as an important mechanism that
36: shapes the patterns of codon usage in prokaryotes and simple eukaryotes.
37: Here we analyze patterns of codon usage across 74 diverse bacteriophages
38: that infect \emph{E. coli}, \emph{P. aeruginosa} and \emph{L. lactis} as
39: their primary host. We introduce the concept of a `genome landscape,' which
40: helps reveal non-trivial, long-range patterns in codon usage across a
41: genome. We develop a series of randomization tests that allow us to
42: interrogate the significance of one aspect of codon usage, such a GC
43: content, while controlling for another aspect, such as adaptation to
44: host-preferred codons. We find that 33 phage genomes exhibit highly
45: non-random patterns in their GC3-content, use of host-preferred codons, or
46: both. We show that the head and tail proteins of these phages
47: exhibit significant bias towards host-preferred codons, relative
48: to the non-structural phage proteins. Our results support the hypothesis of
49: translational selection on viral genes for host-preferred codons, over a
50: broad range of bacteriophages.
51:
52:
53: \end{abstract}
54:
55:
56:
57: \maketitle
58:
59: \section{Introduction}\label{sec:introduction}
60:
61: The genomes of most organisms exhibit significant codon bias -- that is, the
62: unequal usage of synonymous codons. There are longstanding and contradictory
63: theories to account for such biases. Variation in codon usage between taxa,
64: particularly within mammals, is sometimes atrributed to neutral processes --
65: such as mutational biases during DNA replication, repair, and gene conversion
66: \cite{Bern95,Francino1999,Galtier2003,Eyre91}.
67:
68: There are also theories for codon bias driven by selection. Some researchers
69: have discussed codon bias as the result of selection for regulatory function
70: mediated by ribosome pausing \cite{LawrHart91}, or selection against
71: pre-termination codons \cite{Fitc80,ModiBatt81}. However, the dominant selective
72: theory of codon bias in organisms ranging from \textit{E. coli} to
73: \textit{Drosophila} posits that preferred codons correlate with the relative
74: abundances of isoaccepting tRNAs, thereby increasing translational efficiency
75: \cite{ZuckPaul65,Ikem81a,Ikem85,PoweMori97,DebrMarz94,SoreKurl89} and accuracy
76: \cite{Akas94}. This theory helps to explain why codon bias is often more extreme
77: in highly expressed genes \cite{Ikem81b}, or at highly conserved sites within a
78: gene \cite{Akas94}. Translational selection may also explain variation in codon
79: usage between genes selectively expressed in different tissues
80: \cite{Plotkin2004,Dittmar2006}. However, recent work suggests that synonymous
81: variation, particularly with respect to GC content, affects transcriptional
82: processes as well \cite{Kudla2006}.
83:
84: The codon usage of viruses has also received considerable attention
85: \cite{Jenkins2003,PlotDush03}, particularly in the case of bacteriophages
86: \cite{Sharp1984,Kunisawa1998,Sahu2004, Sahu2005,Sau2005,SauGosh2005}. Most work
87: along these lines has focused on individual phages, or on the patterns of
88: genomic codon usage across a handful of phages of the same host.
89:
90: Here, we provide a systematic analysis of intragenomic variation in
91: bacteriophage codon usage, using 74 fully sequenced viruses that infect a
92: diverse range of bacterial hosts. Motivated by energy landscapes associated with
93: DNA unzipping \cite{LubenskyNelson2002,Weeks2005}, we develop a novel
94: methodological tool, called a genome landscape, for studying the long-range
95: properties of codon usage across a phage genome. We introduce a series of
96: randomization tests that isolate different features of codon usage from each
97: other, and from the amino acid sequence of encoded proteins. More than twenty of
98: the phages in our analysis are shown to exhibit non-random variation in
99: synonymous GC content, as well as non-random variation in codons adapted for
100: host translation, or both. Additionally, we demonstrate that phage genes
101: encoding structural proteins are significantly more adapted to host-preferred
102: codons compared to non-structural genes. We discuss our results in the context
103: of translational selection and lateral gene transfer amongst phages.
104:
105:
106: % section introduction (end)
107: \section{Results}\label{sec:results}
108:
109: % (fold)
110: \subsection{Genome Landscapes}\label{sub:genome_landscapes}
111:
112: % (fold)
113: We start by introducing the concept of a genome landscape, which provides a
114: simple means for visualizing long-range correlations of sequence properties
115: across a genome. A genome landscape is simply a cumulative sum of a specified
116: quantitative property of codons. The calculation of the cumulative sum is
117: straightforward, and it consists of scanning over the genome sequence one codon
118: at a time, gathering the property of each codon, and summing it with the
119: properties of previous codons in the genome sequence.
120: Similar cumulative
121: sums are used in solid-state physics for, e.g., the
122: the calculation of energy levels
123: \cite{Ashcroft1976}.
124: In the case of the GC3 landscape, we have
125: \begin{equation}
126: \label{eq:FGC3}
127: F_{\mathrm{GC3}}(m) = \sum_{i=1}^m
128: (\eta_{\mathrm{GC3}}(m) - \overline{\eta_{\mathrm{GC3}}})
129: \end{equation}
130: where $\eta_{\mathrm{GC3}}(m)$ equals one or zero, depending upon whether the
131: the $m^{th}$ codon ends in a G/C or A/T, respectively. Note that we subtract the
132: genome-wide average GC3 content, $\overline{\eta_{\mathrm{GC3}}}$, so that
133: $F_{\mathrm{GC3}}(0) = F_{\mathrm{GC3}}(N) = 0$, where $N$ is the length of the
134: genome. In other words, we convert the genome codon sequence into a binary
135: string of 1's and 0's according to whether each codon is of type GC3 or AT3, and
136: we cumulatively sum this sequence to compute $F_{\mathrm{GC3}}(m)$.
137:
138: The interpretation of a GC3 landscape is straightforward. Regions of the genome
139: whose landscape exhibits an uphill slope contain higher than average GC3
140: content, whereas regions of downhill slope contain lower than average GC3
141: content. The genome landscape provides an efficient visualization of long-range
142: correlations in sequence properties across a genome, similar to the techniques
143: introduced by Karlin \cite{Karlin1993}.
144:
145: Traditional visualizations of GC3 content involve moving window averages of
146: \%GC3 over the genome \cite{Gregory2006}. In order to compare these techniques
147: with the landscape approach, we focus on the \emph{E. coli} phage lambda as an
148: illustrative example. Figure \ref{fig:land_hist} (a) shows the lambda phage GC3
149: landscape above its associated ``GC3 histogram". The histogram shows the GC3
150: content of each gene, and the width of each histogram bar reflects the length of
151: the corresponding gene. The figure reveals a striking pattern of lambda phage
152: codon usage: the genome is apparently divided into two halves that contain
153: significantly different GC3 contents \cite{Inman1966,Sanger1982}. The large
154: region of uphill slope on the left half of the GC3 landscape reflects the fact
155: that the majority of the genes in this region contain an excess of codons that
156: end in G or C. This trend is also reflected in the GC3 histogram bars, which are
157: higher than average in the left half of the genome (Figure \ref{fig:land_hist}).
158:
159: Genome landscapes also provide a natural means of evaluating whether or not
160: features of codon usage are due to random chance. Under a null model in
161: which
162: the $\eta(i)$'s above are chosen as independent random variables
163: with $\mathrm{var}(\eta(i)) = \langle \eta(i)^2 \rangle
164: - \langle \eta(i) \rangle^2 = \Delta$, one can show (see
165: Methods) that the standard
166: deviation of $F(\mathrm{GC3},m)$ is
167: \begin{equation}
168: \label{eq:sigma}
169: \sigma_{\mathrm{GC3}}(m) = \sqrt{\langle
170: F(\mathrm{GC3},m)^2 \rangle - \langle F(\mathrm{GC3},m) \rangle^2} = \sqrt{\frac{\Delta_{\mathrm{GC3}} m (N-m)}{N}}.
171: \end{equation}
172: This quantity is shown as a purple band in Figure
173: \ref{fig:land_hist}. For $\eta(i)$'s chosen to be 0 or 1 at random,
174: $\Delta_{\mathrm{GC3}} = 1/4$ and the maximum width $\sqrt{N}/4$ is
175: obtained at $m= N/2$. Since the scale of variation across the lambda phage GC3
176: landscape is much greater than its expectation under the null, we can
177: conclude that the distribution of G/C versus A/T ending codons is
178: highly non-random in the lambda phage genome.
179:
180: We can also gain intuition about the degree of non-randomness in the GC3
181: landscape by considering what would happen if the lambda phage genome were to
182: accumulate random synonymous mutations. Figure
183: \ref{fig:land_decay}(a) shows snapshots of the lambda GC3 landscape as
184: we simulate synonymous mutations to the genome. Between each snapshot,
185: $N$ synonymous mutations were introduced by
186: picking a codon at random along the genome, and then choosing a new
187: synonymous codon at random according to the global lambda phage codon
188: distribution. As more mutations are introduced, the GC3 landscape of the
189: synonymously mutated lambda genome approaches the purple band,
190: indicating that the GC3 pattern in the real lambda phage genome is
191: highly non-random.
192:
193: The procedure of producing a genome landscape can be applied to other
194: properties of codon usage. In addition to GC3, we will study patterns in
195: the Codon Adaptation Index (CAI). CAI measures the similarity of a
196: gene's codon usage to the `preferred' codons of an organism
197: \cite{Sharp1987} -- in this case, the host bacterium of the phage under
198: study. Every bacterium has a preferred set of codons defined as the
199: codons, one for each amino acid, that occur most frequently in genes
200: that are translated at high abundance. These genes are often taken to be
201: the ribosomal proteins and translational elongation factors
202: \cite{Sharp1987} (see Methods).
203:
204: In order to calculate CAI, the preferred codons are each assigned a weight $w =
205: 1$. The remaining codons are assigned weights according to their frequency in
206: the highly-translated genes, relative to the frequency of the $w=1$ codon. The
207: CAI of a gene is defined as the geometric mean of the $w$-values for its codons
208: \begin{equation}
209: \label{eq:CAI_def} \mathrm{CAI} = \left(\Pi_{i=1}^{M} w_i\right)^{1/M},
210: \end{equation}
211: where $w_i$ is the $w$-value of the $i^{th}$ codon, and
212: $M$ is the length of the gene. This quantity can be re-written as
213: \begin{equation} \mathrm{CAI} = \exp(\frac{1}{M} \sum_{i=1}^{M}
214: \ln(w_i)).
215: \end{equation}
216: The latter formulation is more useful for calculating genome landscapes,
217: because the argument of the exponential function is now a sum of the logs of the
218: $w$-values. Therefore, we define the CAI landscape as \begin{equation}
219: F_{\mathrm{CAI}}(m) = \sum_{i=1}^m (\eta_{\mathrm{CAI}}(m) -
220: \overline{\eta_{\mathrm{CAI}}}), \end{equation} where $\eta_{\mathrm{CAI}}(m) =
221: \ln(w_m)$.
222:
223: The CAI landscape for lambda phage is shown in Figure
224: \ref{fig:land_hist}(b), along with the CAI histogram of lambda phage.
225: For the CAI histograms, the height of each bar represents the CAI value
226: of that gene (Eq. \ref{eq:CAI_def}). As in the case with the GC3
227: landscape, we find that the lambda phage CAI landscape corresponds
228: closely to the CAI histogram, but it offers a more striking global view
229: of the long-range CAI structure in the lambda phage genome. One
230: contiguous half of the lambda phage genome exhibits elevated CAI,
231: whereas the other half exhibits depressed CAI. The observed CAI
232: landscape lies far outside the purple band in Figure
233: \ref{fig:land_hist}, calculated according to Eq. \ref{eq:sigma},
234: indicating that the pattern of CAI across the lambda phage genome is
235: non-random. However,
236: the purple band is wider for the CAI landscape than for the GC3
237: landscape, because the variance in the $\ln{(w_i)}$'s,
238: $\Delta_{\mathrm{CAI}}$, is greater than $\Delta_{\mathrm{GC3}}$.
239:
240: The GC3 and CAI landscapes for lambda phage are highly correlated with each
241: other (Figure \ref{fig:land_hist}). In particular they both have large uphill
242: regions on the left-hand side of the genome, indicating a region containing
243: codons with elevated GC3-content and CAI values, compared to the genome average.
244: It is possible that the observed correlation between the GC3 and CAI landscapes
245: could be caused by the conflation between high CAI and GC3 in the preferred
246: \emph{E. coli} codons, as we discuss below.
247:
248: We note that the genes in the region of elevated CAI primarily encode the highly
249: translated structural proteins that form the capsid and tail of the lambda phage
250: virions. This patterns suggests the hypothesis that, because of the need to
251: produce structural genes in high copy number during the viral life cycle, structural
252: genes preferentially use codons that match the host's preferred set of codons.
253: We will explore this translational-selection hypothesis in greater detail below.
254:
255: % subsection genome_landscapes (end)
256: \subsection{The Effect of Amino Acid Content on Genome
257: Landscapes}\label{sub:the_effect_of_amino_acid_content_on_genome_landscapes}
258:
259: % (fold)
260: The previous section illustrated that the codon usage across the lambda phage
261: genome is highly non-random with respect to both GC3 and CAI. In this section we
262: quantify this statement, and we focus on aspects of lambda's codon usage
263: patterns that are \emph{independent} of the amino acid sequences of the
264: encoded proteins.
265:
266: Since we are interested in studying the patterns of \emph{synonymous} codon
267: usage, it is important that we control for the amino acid sequence of encoded
268: proteins. Phages utilize a diverse spectrum of proteins, ranging from
269: those that form the protective capsid for nascent progeny, to those
270: encoding for the tail and tail fibers, to those that regulate the switch
271: between lytic or lysogenic infection pathways. As with other organisms, phage
272: proteins have been selected at the amino acid level for function and folding.
273: Some portion of a phage's codon usage is surely influenced by selection
274: for amino acid content.
275:
276: We can construct a simple randomization test to interrogate the potential
277: influence of the amino acid sequence on the GC3 and CAI landscapes of lambda
278: phage. In this test, we generate random genomes that have the exact same amino
279: acid sequence as lambda phage, but shuffled codons, such that the genome-wide,
280: or global, codon distribution is preserved in each random genome (see
281: Methods). As
282: summarized in Table \ref{tab:tests}, we refer to this test as the `aqua'
283: randomization test. For each of the randomized genomes, we calculate GC3 and CAI
284: landscape. Similar to a recent randomization method \cite{Zeldovich2007}, we then
285: compare the observed landscape of the actual genome to the
286: distribution of landscapes generated from the randomized genomes.
287:
288: Figure \ref{fig:aqua} shows the results of this comparison, with the observed
289: landscapes plotted as black lines, and the mean
290: $\pm$ one and two standard deviations of random trials shown in dark and light
291: aqua, respectively. As the figures show, the observed landscapes lie in the far
292: extremes of the randomized distributions -- indicating that the amino acid
293: sequence of the lambda phage genome does not determine the extraordinary
294: features of the observed landscapes.
295:
296: It is also instructive to query the influence of amino acid content on codon
297: usage in each gene individually. The histogram view of these randomization tests
298: allows us to ask this question precisely. Because the amino acid sequence is
299: preserved exactly across the genome, each histogram bar in Figure \ref{fig:aqua}
300: can be considered as its own randomization test, one for each gene. The position
301: of the horizontal black bar reflects the actual codon usage of
302: each gene, and it can be compared to the distribution of random trials in order
303: to compute a quantile for each gene:
304: \begin{equation} q^{>} = \frac{\mathrm{number\ of\
305: trials\ less\ than\ observed}}{\mathrm{number\ of\ trials}},\\
306: q^{<} = \frac{\mathrm{number\ of\ trials\ greater\ than\
307: observed}}{\mathrm{number\ of\ trials}}.
308: \end{equation}
309: Note that we have defined two quantiles, $q^{>}$ and $q^{<}$, that describe the
310: proportion of random trials strictly less or strictly greater than the observed data.
311: These two quantities sum to a values less than one (and equal to one if there
312: are no ties). A large value of $q^{>}$ signifies that the observed statistic
313: (e.g. GC3 or CAI) is \emph{greater} than most of the random trials.
314:
315: Associated with each of these quantiles is a p-value quantifying whether the
316: observed gene sequence has significantly different codon usage than the random
317: trials: $p^{<} = 1 - q^{<}$ and $p^{>} = 1 - q^{>}$. If either one of these
318: $p$-values is low, it signifies that the GC3 (or CAI) content of the gene is
319: significantly different than the genomic average, controlling for the amino acid
320: sequence of the gene. $p^<$ tests for significantly depressed GC3 (or CAI) in a
321: gene; and $p^>$ tests for significantly elevated GC3 (or CAI) in a gene. We will
322: use these $p$-values, which arise from the `aqua' randomization test, in two
323: ways.
324:
325: Since we are interested in studying the effects of synonymous codon usage alone,
326: we first wish to filter out any genes whose codon usage does not significantly deviate
327: from random, given the amino acid sequence. Therefore, in the subsequent
328: gene-by-gene analyses reported in this paper, we retain only those genes whose
329: quantiles fall in the extreme 5\% of random trials. That is, we only keep those
330: genes for which $p^{<}_{\mathrm{aqua}} < 0.025$ or $p^{>}_{\mathrm{aqua}} <
331: 0.025$. These genes are said to `pass' the aqua test, and they are
332: unshaded in Figure \ref{fig:aqua}.
333:
334: We also use the gene-by-gene $p$-values to quantify the degree to which
335: codon usage is independent of amino acid sequence across the genome as a
336: whole. To do so, we combine all the gene-by-gene $p$-values into an
337: aggregate $p$-value for the entire genome, $p_{\mathrm{aqua}}$, using
338: the method of Fisher \cite{Fisher1948}. We calculate the combined
339: $p$-value by summing the logs of twice the minimum of each gene-specific
340: p-value \begin{equation} f_{\mathrm{aqua}} = -2 \sum_{i=1}^{i=k} \ln{[2
341: \min(p^{<}_{\mathrm{aqua},i}, p^{>}_{\mathrm{aqua},i})]}, \end{equation}
342: where $p^{<}_{\mathrm{aqua},i}$ represents the aqua $p^<$-value for gene
343: $i$, and $k$ is the number of genes in the genome. It is well known that
344: $f_{\mathrm{aqua}}$ is chi-squared distributed with $2k$ degrees of
345: freedom \cite{Fisher1948}. Thus, the combined $p$-value for the
346: entire genome, $p_{\text{combined}}^{\mathrm{aqua}} = 1-
347: P_{\chi^2,2k}(f_{\mathrm{aqua}})$, where $P_{\chi^2,2k}(f)$ is the
348: cumulative chi-squared distribution with $2k$ degrees of freedom. In
349: the case of lambda phage, we find $p_{\text{combined}}^{\mathrm{aqua}} =
350: 7.42\mathrm{x}10^{-98}$ for GC3 and $p_{\text{combined}}^{\mathrm{aqua}}
351: = 1.50\mathrm{x}10^{-41}$ for CAI. Thus, we conclude that the neither
352: the GC3 nor the CAI patterns across the lambda phage genome are
353: determined by the genome's amino acid sequence.
354:
355: In the following sections we will use the aqua test (see Table
356: \ref{tab:tests}) and its associated gene-by-gene and combined p-values
357: as a control to verify that features of codon usage are not driven by
358: the amino acid sequence.
359:
360: % subsection the_effect_of_amino_acid_content_on_genome_landscapes (end)
361: \subsection{Disentangling CAI from GC3}\label{sub:disentangling_cai_from_gc3}
362:
363: % (fold)
364: Depending upon the preferred codons of the host species, the effect of
365: selection for high CAI in a viral gene is not necessarily independent
366: from the effect of selection for other features of viral codon usage,
367: such as high GC3.
368: For example, codons with high CAI values associated with a given host
369: may be biased towards high GC3 values as well (see Figure
370: \ref{fig:E_coli_master}, and Section
371: \ref{sub:disentangling_cai_from_gc3} below). It is important, therefore,
372: to disentangle the effects of selection for CAI versus selection for
373: GC3, in order to determine which one of these forces is responsible for
374: the non-random patterns of codon usage observed in the lambda genome.
375:
376: The weights used to compute CAI for \emph{E. coli} are shown in Figure
377: \ref{fig:E_coli_master}. The 61 codons are placed into one of four groups
378: according to whether they are GC3 or not (red or blue, respectively), and
379: whether they have high CAI or not (dark or light, respectively). High CAI is
380: determined by an arbitrary cutoff of $w \geq 0.9$. As this table demonstrates,
381: the set of preferred codons in E. coli is slightly biased towards GC-ending
382: codons (58\%).
383:
384: The GC bias of preferred codons, although slight, could conflate the
385: results of selection for CAI versus GC3 in phages that infect \emph{E.
386: coli}, such as lambda. We therefore introduce another randomization
387: test that allows us to disentangle patterns of CAI content from patterns
388: of GC3 content. Similar to the aqua randomization test described above,
389: we draw random phage genomes such that the amino acid sequence is
390: conserved, but we add the additional constraint of conserving the exact
391: GC3 sequence as well (see Methods). For example, at a site containing a
392: GC3 codon for leucine, in our random trials we only allow those leucine
393: codons terminating in G or C. By comparing the observed landscapes of
394: the genome with the distribution of randomly drawn landscapes, we can
395: isolate the features of codon usage driven by CAI, independent of GC3
396: and amino acid content. We refer to this randomization procedure at the
397: `orange' randomization test (Table \ref{tab:tests}).
398:
399: Conversely, we also wish to assess the strength of patterns in GC3 content,
400: independent of CAI and amino acid content. The appropriate randomization
401: procedure in this case requires that we constrain the amino acid sequence and
402: the sequence of codon CAI values while allowing GC3 to vary. However, because
403: CAI values are not binary, CAI cannot be constrained exactly while still
404: allowing for enough variability to produce a meaningful randomization test.
405: Thus, we introduce a binary version of the CAI measure, called BCAI, that is
406: qualitatively the same as and, for our purposes, interchangeable with CAI.
407:
408: The BCAI $w$-value for a codon is defined to be 0.7 if the codon is high CAI,
409: and 0.3 if the codon has low CAI. High CAI is defined by the threshold of $w
410: \geq 0.9$ (see Figure \ref{fig:E_coli_master}). The actual values assigned for
411: BCAI are arbitrary and have no effect on our results. In addition, the threshold
412: value $w \geq 0.9$ is also arbitrary, and our results are robust to changing
413: this threshold. BCAI provides a useful surrogate for CAI because its values are
414: binary, thereby allowing us to constrain a gene's amino acid sequence and BCAI
415: sequence \emph{exactly}, while varying GC3 content in random trials. The BCAI
416: landscapes and histograms are calculated in the same way as CAI landscapes and
417: histograms, except using BCAI $w$-values. As expected, the BCAI landscape of a
418: genome is qualitatively similar to its CAI landscape (compare Figures
419: \ref{fig:green_orange}b and \ref{fig:aqua}b), and the two landscapes are highly
420: correlated (e.g. $r = 0.72$ for lambda phage). Thus BCAI is interchangeable
421: with CAI for the purposes of our randomization tests.
422:
423: Figure \ref{fig:green_orange} shows the results of the two randomization tests
424: outlined above: the `green' test that compares the observed GC3 landscape to a
425: distribution of random trials constraining the amino acid sequence and the BCAI
426: sequence; and the `orange' test that compares the observed BCAI landscape to a
427: distribution of random trials constraining the amino acid sequence and the GC3
428: sequence. Our convention for naming these two tests is summarized in Table
429: \ref{tab:tests}.
430:
431: As seen in Figure \ref{fig:green_orange}a, the observed GC3 landscape lies
432: significantly outside of the random trials that preserve amino acid sequence and
433: BCAI sequence. Combining the gene-by-gene p-values for this test, we find
434: $p_{\text{combined}}^{\text{green}} = 5.1\mathrm{x}10^{-68}$ -- indicating that
435: the lambda phage genome as a whole has non-random GC3 variation independent of
436: amino acid and CAI (actually, BCAI) sequence. Conversely, Figure
437: \ref{fig:green_orange}b shows that the BCAI landscape contains non-random
438: features when controlling for both GC3 and amino acid sequence
439: ($p_{\text{combined}}^{\text{orange}} = 6.3\mathrm{x}10^{-9}$). In other words,
440: the lambda phage genome exhibits highly non-random patterns of both GC3 and CAI
441: codon variation, independent of one another and independent of the amino acid
442: sequence.
443:
444: % subsection disentangling_cai_from_gc3 (end)
445: \subsection{Non-random patterns of CAI and GC3 In
446: Bacteriophages}\label{sub:selection_for_cai_and_gc3_in_bacteriophages}
447:
448: % (fold)
449:
450: In the sections above we have demonstrated and quantified highly
451: non-random patterns of GC3 and CAI codon usage variation across the
452: lambda phage genome. We have also demonstrated that these trends are
453: independent of one another. In this section, we will extend our
454: analysis to a large range of diverse phages.
455:
456: In this section we consider all sequenced phages that infect \emph{E. coli},
457: \emph{Pseudomonas aeruginosa} or \emph{Lactococcus lactis} as their primary host.
458: The latter two hosts were chosen because of they contain unusually extreme GC3
459: content: 88 \%GC3 for \emph{P. aeurginosa} and 25 \%GC3 for \emph{L. lactis},
460: genome-wide. The extreme GC3 content of these hosts give rise to opposing
461: relationships between high CAI and GC3 -- as indicated schematically in
462: Figure \ref{fig:master_cartoons}. In particular, \emph{P. aeruginosa} strongly
463: favors GC3 in high-CAI codons (94\%), and \emph{L. lactis} strongly favors AT3 in
464: high-CAI codons (72\%). Thus, these three hosts span a large spectrum of
465: relationships between CAI and GC3. Since our randomization tests constrain amino
466: acid and BCAI exactly (the `green' test), and amino acids and GC3 exactly (the
467: `orange' test), we can control for any possible conflation between GC3 and CAI
468: trends. Thus, the randomization tests are equally applicable to all of the phage
469: genomes, regardless of their host.
470:
471: We performed the aqua, green, and orange randomization tests on the 45
472: phages of \emph{E. coli}, 12 phages of \emph{P. aeruginosa}, and 17
473: phages of \emph{L. lactis} whose genomes have been sequenced
474: (see Methods). In the first step of our
475: analysis, we removed any phages which failed either the aqua GC3 or aqua
476: CAI tests, because the codon usage of such genomes are influenced by
477: their amino acid sequence. A phage was said to pass these two control
478: tests if its Fisher combined p-values for both aqua GC3 and aqua CAI
479: were significant. The significance criterion for each test is
480: $p_{\text{combined}} < 5\%/74$, which incorporates a Bonferroni
481: correction for multiple tests. With this cutoff, 50 of the initial 74
482: phages passed the aqua control tests.
483:
484: Figure \ref{fig:green_orange_examples} shows results of these tests for
485: several example genomes. P2, a temperate phage, and T3, a non-temperate
486: phage both infect \emph{E. coli} and both pass the control tests and
487: exhibit significant `orange' and `green' results, as does D3112, a
488: temperate phage that infects \emph{P. aeruginosa}. However, not all
489: phages that pass the control test exhibit signifanct `orange' and
490: `green' results -- as evidenced by bIL286, a temperate phage infecting
491: \emph{L. lactis}.
492:
493: Figure \ref{fig:green_orange_pass_genomes} plots the distribution of
494: combined Fisher p-values of the orange and green tests, for the 50
495: phages that pass the control tests. The majority of these
496: p-values are highly significant. Using a Bonferoni-corrected theshold of
497: 5\%/50, a total of 22 genomes show significance in the orange test, 29
498: in the green text, and 17 in both orange and green. These results
499: indicate that non-random patterns in codon usage are not unique to
500: lambda phage. Indeed, over a range of bacterial hosts and a range of
501: phage viruses, there is apparent pressure for non-random patterns of
502: both GC3 content and CAI content, independent of one another and
503: independent of the amino acid sequence.
504:
505:
506: % subsection selection_for_cai_and_gc3_in_bacteriophages (end)
507: \subsection{Translational selection on phage structural
508: proteins}\label{sub:translational_selection_on_phage_structural_proteins}
509:
510: % (fold)
511: In this section, we investigate a natural hypothesis concerning the patterns of
512: non-random CAI usage we have observed in phage genomes -- namely, that these
513: patterns may be driven by selection for translational accuracy and efficiency,
514: which is stronger in more highly expressed proteins \cite{Ikem81a,Sharp1984}.
515:
516: Among all phage proteins, the structural proteins are the most highly expressed
517: \cite{Hendrix2004}. The structural proteins form the protective capsid that
518: encloses the viral genome, as well as the tail, which is often used for
519: transmission of the phage genome to the inside of the host \cite{Roessner1983}.
520: These proteins must be produced in high copy number -- many tens of copies of
521: each type of structural protein needed to form each of hundreds of viral progeny
522: \cite{Hendrix2004}. For each gene in a phage genome, we assigned a structural
523: annotation of 1 if the gene was known to encode a structural protein and 0
524: otherwise (see Methods).
525:
526: According to the standard hypothesis of translational selection, the
527: structural genes of phages should exhibit elevated CAI levels compared
528: to other phage genes, since they are translated (by the host) in high
529: copy numbers. To test this hypothesis, we performed regressions between
530: the structural annotation of phage genes and their aqua CAI and orange
531: BCAI p-values. In other words, we compared the structural properties of
532: genes against their CAI content, controlling for amino acid sequence,
533: and against their BCAI content, controlling for both amino acid sequence and
534: GC3 sequence.
535:
536: In the case of lambda phage, Figure \ref{fig:structural} shows the results of
537: the aqua CAI and orange BCAI randomization tests, with the structural genes
538: highlighted. The plot reveals a striking pattern: the vast majority of the
539: structural proteins lie on the left half of the genome, exactly in the region
540: where genes have elevated CAI values. In order to quantify this association we
541: performed ANOVAs. Before regressing structural
542: annotations against codon usage, we first removed the non-informative genes --
543: i.e. genes whose codon usage are influenced by their amino acid content, as
544: indicated by a failure to pass the aqua CAI test.
545:
546: Table \ref{tab:lambda_all_struct_non_aqua_orange} shows the results of the
547: regression between aqua CAI and orange BCAI $p^{>}$-values versus structural
548: annotations in lambda phage. The results are highly significant: structural
549: annotations explain half of the variation in CAI, even when controlling for
550: genes' amino acid sequences (aqua, $r^2$=56\%) as well as GC3 seqeuences (orange
551: test, $r^2$=46\%). The median $p^{>}$-value among structural genes is close to
552: zero, whereas the median $p^{>}$-value among non-structural genes is close to
553: one -- indicating that structural genes exhibit significantly \emph{elevated}
554: CAI values. These highly significant results are consistent with the hypothesis
555: of translational selection on structural proteins.
556:
557: In order to examine the relationship between structural annotation and CAI
558: across all 74 phages in our study, we performed the same ANOVA on the 1,309
559: informative genes (i.e. genes that pass the aqua CAI randomization test). Once
560: again, Table \ref{tab:lambda_all_struct_non_aqua_orange} shows a highly
561: significant relationship between structural annotation and CAI values,
562: controlling for amino acid content and GC3. Thus, the tendency toward elevated
563: CAI values in structural genes holds across all the phages in this study,
564: despite the fact that they infect a diverse range of hosts with a wide
565: variety of GC contents.
566:
567: % subsection translational_selection_on_phage_structural_proteins (end)
568: % section results (end)
569: \section{Discussion}\label{sec:discussion}
570:
571: In this paper, we have introduced genome landscapes as a tool for visualizing
572: and analyzing long-range patterns of codon usage across a genome. In combination
573: with a series of randomization tests, we have applied this tool to study
574: synonymous codon usage in 74 fully sequenced phages that infect a diverse range
575: of bacterial hosts. Genome landscapes provide a convenient means to identify
576: long-range trends that are not apparent through conventional, gene-by-gene or
577: moving-window analyses. Using a statistical test that compares codon usage to
578: random trials, controlling for the amino acid sequence, we found that
579: we found that many of the phages studied exhibit non-random variation
580: in codon usage. However, not all of the phages exhibit non-random variation as
581: exemplified by phage bIL286 (Figure \ref{fig:green_orange_examples}(d)).
582:
583: In light of long-standing \cite{Ikem81a} and recent \cite{Kudla2006}
584: literature from other organisms, we have focussed on two aspects of
585: phage codon usage: variation in third-position GC/AT content (GC3) and
586: variation in the degree of adaptation to the `preferred' codons of the
587: host (CAI). Almost three-quarters of the phages in our study exhibit
588: non-random intragenomic patterns of codon usage, even when controlling
589: for the amino acid sequence encoded by the genome. Almost half of such
590: genomes also show non-random patterns of CAI when additionally
591: controlling for the GC3 sequence. In other words, there is substantial
592: variation in CAI above and beyond what would be expected by random
593: chance, given the amino acid and GC3 sequences of these genomes.
594:
595: We have also compared the CAI values of phage genes to their annotations
596: as structural or non-structural proteins. We have conclusively
597: demonstrated that phage genes encoding structural proteins exhibit
598: significantly elevated CAI values compared to the non-structural proteins
599: from the same genome. These results hold even when controlling for the
600: the amino acid sequence and GC3 sequence of genes. Our
601: conclusions across a diverse range of phages are consistent with
602: early observations on lambda's codon usage \cite{Sanger1982},
603: early results for T7 \cite{Sharp1984}, and with the general hypothesis
604: of translational selection, which predicts elevated CAI in genes
605: expressed at high levels \cite{Ikem81a,Ikem81b,Sharp1987}. The pattern
606: of elevated CAI in structural proteins is particularly striking the case
607: of lambda phage. It is also worth noting that we find no
608: significant relationship between a phage's life-history (i.e. temperate
609: versus non-temperate) and the degree to which its structural proteins
610: exhibit elevated CAI (see Table \ref{tab:temperate_non}). This
611: observation likely reflects the fact that at some point every phage,
612: regardless of its life history, must generate certain structural proteins in
613: high abundance -- and so it is beneficial to encode such protein using
614: the host's translationally preferred codons.
615:
616: Our results on translational selection in phages shed light on the
617: nature of selection on viruses. The standard interpretation of elevated
618: CAI in highly expressed bacterial proteins assumes a fitness cost (per
619: molecule) associated with inefficient or inaccurate translation. We have
620: observed a similar relationship between expression level and CAI across
621: a diverse range of bacteriophages, which presumably do not incur a
622: direct energetic cost from inefficient translation by their hosts. Thus,
623: our results suggest that either there is an adaptive benefit (to the
624: virus) of elevated CAI in phage structural proteins, or that costs
625: incurred by the host bacterium also reduce the fitness of the virus.
626:
627: In addition to our results on CAI, we have also observed non-random patterns of
628: GC3 variation across the genomes of many phages. These patterns are highly
629: significant even after controlling for potential conflating factors, such as the
630: amino acid sequences and CAI sequences of genes. Unlike our results on CAI,
631: there is no clear mechanistic hypothesis underlying the non-random patterns of
632: GC3 in phages. It is possible that these patterns reflect selection for
633: efficient transcription \cite{Kudla2006} or for mRNA secondary structure. But in
634: the absence of independent information on such constraints, we cannot assess the
635: merits of these selective hypotheses, nor rule out the possibility of variation
636: in mutational biases across the phage genomes. It is interesting to note
637: that we find these significant non-random patterns of GC3 predominantly in
638: temperate phages (see Table \ref{tab:temperate_non}).
639:
640: Our study benefits from the number and breadth of phages we
641: have analyzed. Unlike previous studies, here we analyze phages whose
642: suspected hosts span a diverse range of bacteria, which themselves
643: differ in their genomic GC3 content and preferred codon choice. We have
644: calibrated CAI for each phage according to its primary host, and
645: nevertheless we find consistent relationships between CAI and viral
646: protein function. These results therefore conclusively extend the
647: classical theory of translational selection to the relationship between
648: viruses and their hosts.
649:
650: The present study also benefits from the development of randomization tests that
651: isolate the patterns of variation in CAI from variation in GC content. Due to
652: intrinsic biases in the GC content of the preferred codons of hosts, previously
653: studies on codon usage in phage have conflated these two types of synonymous
654: variation \cite{Sahu2004, Sahu2005,Sau2005,SauGosh2005}. The mechanisms
655: underlying GC3 variation and CAI variation likely differ, and so it is
656: critically important that we have analyzed each of these features controlling
657: for the other one.
658:
659: There is a large literature on the structure and evolution of phage genomes
660: which is pertinent to our analyses of phage codon usage. The genomes of phages
661: that infect \emph{E. coli}, \emph{L. lactis}, and \emph{Mycobacteria} are known
662: to be highly mosaic in structure
663: \cite{Juhala2000,Brussow2002,Hendrix2002,Lawrence2002,Pedulla2003,Hatfull2006}.
664: In other words, these genomes exhibit many similar local features that suggest
665: each genome was assembled from a common pool of bacteriophage genomic regions
666: \cite{Hendrix1999}. Recently, mosaicism was discussed in the lambdoid
667: phages focusing specifically on the \emph{E. coli} phages lambda, HK97 and N15
668: \cite{Hendrix2004}. We note that both HK97 and N15 have peaked landscape
669: structures like lambda, although not as pronounced, indicating that some degree
670: of mosaicism can be observed in genome landscapes among closely related phages.
671: The postulated mechanism for mosaicism is homologous and non-homologus
672: recombination between co-infecting phages or between a phage and a prophage
673: embedded in the host genome \cite{Hendrix1999,Brussow2002,Lawrence2001}. Some
674: have argued that the latter mechanism occurs more frequently, due to the large
675: number of lysogenized prophages in bacterial genomes \cite{Lawrence2001}.
676:
677: Lateral gene transfers could affect the codon usage patterns of phages,
678: especially if recombination occurs between phages whose preferred hosts
679: differ. In this case, the codon usage patterns of each phage may be expected to
680: reflect the preferred codons of their preferred hosts; a recent recombination
681: may result in regions of dramatically different codon usage from the average
682: phage codon usage. In particular, regions of unusual GC3 content in a phage
683: genome could reflect gene transfers between phages that typically infect hosts
684: of different GC3 content, in analogy with lateral gene transfer amongst
685: bacteria \cite{Ochman2000}. Morons are genes in phage genomes that are under
686: different transcriptional control than the rest of the phage genes, and are
687: often expressed when the phage is in the lysogenic state \cite{Hendrix2000}.
688: These morons have been observed to have very different nucleotide compositions
689: compared to the rest of the phage genome suggesting that they are the result of
690: such gene transfers \cite{Hendrix2000}. Thus one interpretation for our
691: observations of the 29 phages exhibiting non-random GC3 patterns is that these
692: genomes arose through recent recombination events, and have not subsequently
693: experienced enough time to equilibrate their GC3 content to that of their
694: current host. Given the lack of reliable estimates for time scales between
695: putative phage recombination events, or for codon usage equilibration, this
696: study neither supports nor refutes this interpretation. However, the
697: predominance of significant non-random patterns of GC3 in the genomes of
698: temperate phages (see Table \ref{tab:temperate_non}) may suggest that such
699: recombination occurs more frequently among temperate phage populations.
700:
701: We have demonstrated that phage genes encoding structural proteins exhibit
702: significantly elevated CAI values compared the non-structural phage genes. These
703: results support the classical translation selection hypothesis, now extended to
704: the relationship between viral and host codon usage. We do not find much
705: variation in codon usage among the structural genes themselves. This observation
706: has two plausible interpretations within the literature of lateral gene
707: transfers: either phages of different preferred hosts rarely co-infect, or there
708: is substantially less recombination among the structural proteins of phages. The
709: latter hypothesis has been independently suggested for the capsid proteins of
710: phages, based on the idea that capsid proteins form a complex with
711: multiple physical interactions whose function would be disrupted by individual
712: gene transfer events \cite{Hendrix2002}. Unlike capsid genes, phage tail genes
713: often exhibit mosaicism, and they they can include elements from diverse viruses
714: with variable host ranges \cite{Haggard-Ljungquist1992,Hendrix2002}. To
715: investigate this phenomenon in the context of codon usage, we refined the
716: structural annotation to separate head from tail genes (see Section
717: Methods). We performed three separate ANOVAs to compare
718: the CAI usage in these genes: comparing head versus non-structural, tail versus
719: non-structural, and head versus tail (Table
720: \ref{tab:all_head_tail_aqua_orange}). These regressions indicate that the head
721: genes are primarily responsible for that pattern of elevated CAI in structural
722: proteins. In addition, we detect a difference in codon usage between head and
723: tail genes. These results have at least two possible explanations: either the
724: head proteins are produced in higher copy number than the tail proteins, or
725: lateral gene transfers between diverse phages occur frequently enough in the
726: tail genes to impair their ability to optimize codon usage to their current
727: host. The first hypothesis is very plausible, in light of evidence on the copy
728: number of head and tail proteins \cite{Hendrix2004}; nevertheless, we cannot
729: rule out the second possibility.
730:
731: % section discussion (end)
732: \section{Materials and Methods}\label{sec:materials_and_methods}
733:
734: % (fold)
735: \subsection{Bacteriophage Genomes}\label{sub:bacteriophage_genomes}
736:
737: % (fold)
738: Bacteriophage genomes were downloaded from NCBI's GenBank
739: (\verb=http://www.ncbi.nlm.nih.gov/Genbank/index.html=) release 156 (October,
740: 2006) using Biopython's \cite{biopython} NCBI interface. We only used
741: reference sequence (refseq) phage genome records with accessions
742: of the form NC\_00dddd in order to have the most complete records
743: available. Of the 396 phage refseq's available, we focused on the 74 genomes of
744: phages whose primary host, as listed in the \verb=specific_host= tag in the
745: GenBank file, were \emph{E. coli}, \emph{P. aeruginosa} or \emph{L. lactis}. (A
746: complete list of the accession numbers used can be found in the supplementary
747: material.)
748:
749: All phage genomes were downloaded from GenBank. Before being used for the rest
750: of this study, every gene within a genome was scanned for overlaps within other
751: genes in the same genome, and all overlapping sequences were removed. A codon
752: was only retained if all three of its nucleotides occurred in a single open
753: reading frame. Thus the final genome sequence used was a concatenation
754: of all non-overlapping coding sequences, omitting any control elements and other
755: non-coding sequences.
756:
757: % subsection bacteriophage_genomes (end)
758: \subsection{Calculation of CAI Master
759: Tables}\label{sub:calculation_of_cai_master_tables}
760:
761: % (fold)
762: The definition of the Codon Adaptation Index requires the construction of a
763: `master' $w$-table for the host organism. Each of the 61 sense codons is
764: assigned a $w$-value based on the codon's frequency among the most highly
765: expressed genes in the host organism. In defining this set of genes, we follow
766: Sharp \cite{Sharp1987}, who specified highly expressed genes for \emph{E. coli}.
767:
768: In order to calculate the CAI master $w$-tables for P. aeruginosa and L. lactis,
769: we identified the homologs of the highly expressed \emph{E. coli} genes within
770: the other host genomes, using BLAST \cite{Altschul1990}. In particular, we used
771: qblast to find homologs to these \emph{E. coli} genes by inputting the gene
772: protein sequences, and blasting (blastp) against the nr database, restricting
773: the database to include proteins of the target organism. In all cases, we used
774: the most significant blast result as the ortholog, provided its e-value was less
775: than $1\mathrm{x}10^{-10}$.
776:
777: The particular proteins used for each of these three hosts are as
778: follows (NCBI genome accession numbers listed in parentheses beside the
779: host name, gI numbers listed in parentheses beside each protein). \emph{E.
780: coli} (NC\_000913): 30S ribosomal protein S10 (16131200), 30S ribosomal
781: protein S21 (16130961), 30S ribosomal protein S12 (16131221), 30S
782: ribosomal protein S20 (16128017), 30S ribosomal protein S1 (16128878),
783: 30S ribosomal protein S2 (16128162), 30S ribosomal protein S15
784: (16131057), 30S ribosomal protein S7 (16131220), 50S ribosomal protein
785: L28 (16131508), 50S ribosomal protein L33 (16131507), 50S ribosomal
786: protein L34 (16131571), 50S ribosomal protein L11 (16131813), 50S
787: ribosomal protein L10 (16131815), 50S ribosomal protein L1 (1790416 ),
788: 50S ribosomal protein L7/L12 (1790418 ), 50S ribosomal protein L17
789: (16131173), 50S ribosomal protein L3 (16131199), murein lipoprotein
790: (16129633), outer membrane protein A (3a;II*;G;d) (16128924), outer
791: membrane porin protein C (16130152), outer membrane porin 1a (Ia;b;F)
792: (16128896), protein chain elongation factor EF-Tu (duplicate of tufB)
793: (16131218), TufB (29140507), elongation factor Ts (16128163), elongation
794: factor EF-2 (16131219), recombinase A (16130606), molecular chaperone
795: DnaK (16128008); \emph{P. aeruginosa} (NC\_002516): elongation factor G
796: (15599462), 30S ribosomal protein S10 (15599460), 30S ribosomal protein
797: S21 (15595776), 30S ribosomal protein S12 (15599464), 30S ribosomal
798: protein S20 (15599759), 30S ribosomal protein S1 (15598358), 30S
799: ribosomal protein S2 (15598852), 30S ribosomal protein S15 (15599935),
800: 30S ribosomal protein S7 (15599463), 50S ribosomal protein L28
801: (15600509), 50S ribosomal protein L33 (15600508), 50S ribosomal protein
802: L34 (15600763), 50S ribosomal protein L11 (15599470), 50S ribosomal
803: protein L10 (15599468), 50S ribosomal protein L1 (15599469), 50S
804: ribosomal protein L7/L12 (15599467), 50S ribosomal protein L17
805: (15599433), 50S ribosomal protein L3 (15599459), probable outer membrane
806: protein precursor (15596238), elongation factor Tu (15599461),
807: elongation factor Ts (15598851), elongation factor G (15599462),
808: recombinase A (15598813), molecular chaperone DnaK (15599955); \emph{L. lactis}
809: (NC\_002662): 30S ribosomal protein S10 (15674082), 30S ribosomal
810: protein S21 (15672222), 30S ribosomal protein S12 (15674244), 30S
811: ribosomal protein S20 (15673721), 30S ribosomal protein S1 (15672820),
812: 30S ribosomal protein S2 (15674135), 30S ribosomal protein S15
813: (15673868), 30S ribosomal protein S7 (15674243), 50S ribosomal protein
814: L34 (15672113), 50S ribosomal protein L11 (15673983), 50S ribosomal
815: protein L10 (15673251), 50S ribosomal protein L1 (15673982), 50S
816: ribosomal protein L7/L12 (15673250), 50S ribosomal protein L17
817: (15674049), 50S ribosomal protein L3 (15674081), elongation factor Tu
818: (15673843), elongation factor Ts (15674134), elongation factor EF-2
819: (15674242), recombinase A (15672336), molecular chaperone DnaK
820: (15672936).
821:
822: Given the set of highly expressed genes, the CAI master $w$-table was
823: calculated as follows. For each host, the GenBank file (GenBank release
824: 156) was downloaded locally and transformed into a local data
825: structure using Biopython's \cite{biopython} GenBank parser. The
826: data structure was then scanned for each of the genes in the
827: highly translated gene set, and the collective CDS codon sequences of
828: these genes were concatenated together into one long sequence. Stop
829: codons and codons encoding for amino acids methionine (M), and
830: tryptophan (W) (each encoded by only one codon) were removed
831: from the concatened sequence. The frequencies of codons encoding all
832: other amino acids were then tabulated, and divided into groups according
833: to which amino acid they encode. The w-values are then calculated,
834: according to the procedure of Sharp \cite{Sharp1987}, as these
835: frequencies, normalized by the maximum frequency within each group. Thus
836: each amino acid has a codon with a $w$-value of 1, representing
837: the most commonly used codon for that amino acid. The $w$-values for
838: the stop codons and codons for methionine and tryptophan were set to the
839: average w-value of the remaining codons.
840:
841: % subsection calculation_of_cai_master_tables (end)
842: \subsection{Drawing Random Genomes According to
843: Constraints}\label{sub:drawing_random_genomes_according_to_constraints}
844:
845: % (fold)
846: Our randomization tests require drawing randomized phage genomes that are
847: constrained to have specific properties. In all of the randomization tests
848: discussed, the random sequences were drawn as a sequence of synonymous codons at
849: each position, thereby exactly preserving the amino acid sequences of proteins.
850:
851: The three randomization tests used in this work can all be considered variants
852: of a canonical randomization test that preserves both the amino acid sequence
853: and a bit mask sequence exactly, while drawing codons from the global,
854: genome-wide distribution. A bit mask sequence is string of zeros and ones
855: corresponding to all codons in the genome. For example, GC3 is 1 if the third
856: position of a codon is G or C, and 0 otherwise.
857:
858: Using the GC3 bit mask as an example, the randomization test procedure is
859: initialized by calculating the global codon frequencies that fit into categories
860: specified by the amino acid and the bit-mask value. Each amino acid has
861: associated with it two distributions: one for a bit-mask value of 1 and one for
862: a bit-mask value of 0. For example, alanine (A), is encoded by four codons, GCC
863: (1), GCG (1), GCT (0), GCA (0), where the GC3 bit-mask is shown in parenthesis.
864: Thus to calculate the codon distribution of alanine GC3 codons ($A_1$), we
865: compute the frequency of GCC and GCG codons across the whole phage genome.
866: Similarly, the distribution of $A_0$ codons is determined from the frequency of
867: GCT and GCA codons across the genome. In order to produce a random genome,
868: random codons are drawn at each position according to the distribution
869: associated with the position's amino acid and bit-mask value.
870:
871: Thus the three null tests can be specified by the definition of the bit mask
872: along the sequence, which determines the constraints on the
873: randomize trials. The aqua randomization test constrains the amino acid
874: sequence and nothing else, and so its bit mask consists of all 1's. The orange
875: randomization test preserves the amino acid and the GC3, and so its bit mask is
876: the GC3 sequence mentioned above. The green randomization test preserves the
877: amino acid and BCAI exactly, thus its bit mask is the thresholded BCAI (1 if
878: BCAI $\geq$ 0.7, 0 otherwise).
879:
880: % subsection drawing_random_genomes_according_to_constraints (end)
881: \subsection{Structural Annotation}\label{sub:structural_annotation}
882:
883: % (fold)
884: All phage genes were annotated as structural or non-structural by inspecting
885: the annotations of high-scoring BLAST hits among viral proteins. This procedure is
886: described in detail below.
887:
888: Each gene was considered separately within each genome object, although overlaps
889: were removed in the process of creating the genome objects (see section
890: \ref{sub:bacteriophage_genomes}). The amino acid sequence of each gene was
891: blasted against all known viral protein sequences using Biopython's interface
892: \cite{biopython} to the NCBI blast utility \cite{Altschul1990}. Specifically, we
893: used the blastp utility specifying the nr database, with entrez query `Viruses
894: [ORGN]'. We retained only those BLAST hits with e-values below the cutoff
895: $1\mathrm{x}10^{-4}$. All words in the title of these BLAST hits were collected,
896: using white space as a word-delimiter.
897:
898: The unique words from the blast hits were then compared against a set of
899: structural keywords: ``capsid", ``structural", ``head", ``tail", ``fiber",
900: ``scaffold", ``portal", ``coat", and ``tape". The words associated with the
901: BLAST hits were scanned for matches to the keywords, where each keyword was
902: treated as a regular expression. As a result, partial matching was counted as a
903: match. For example, a BLAST title containing the word `head-tail' would match
904: both keywords `head' and `tail'. If a gene had at least one structural keyword
905: match in its BLAST hit title, it was annotated as structural. Otherwise, it was
906: annotated as non-structural.
907:
908: We further subdivided the structural annotation into two classes: head and tail
909: genes. Tail genes were identified with the keywords ``tail", ``fiber", and
910: ``tape". These remaining structural genes that did not contain any of these
911: keywords were annotated as head genes. Two false positives for tail
912: identification in the lambda phage genome were manually corrected.
913:
914: \subsection{Null Model: Results for Random Walk
915: Landscapes}\label{sub:null_model_results_for_random_walk_landscapes}
916:
917: % (fold)
918:
919: In the sections above we have compared the genome landscapes calculated
920: from real genome sequences to a null model in which the sequences are
921: randomly drawn from a defined distribution. In this section, we compute
922: several properties of genome landscapes calculated from these random
923: genomes.
924:
925: We write the general genome landscape of length $N$ as
926: \begin{equation}
927: F(m) = \sum_{i=1}^m (\eta(i) - \overline{\eta}),
928: \end{equation}
929: where $\eta(i)$ are indepedant, and chosen from a random distribution with
930: $\mathrm{var}(\eta(i)) = \langle \eta(i)^2 \rangle - \langle \eta(i) \rangle^2 =
931: \Delta$, and
932: \begin{equation}
933: \overline{\eta} = \frac{1}{N}\sum_{i=1}^N \eta(i),
934: \end{equation}
935: which ensures $F(0) = F(N) = 0$.
936:
937: The purple regions in Figure \ref{fig:land_hist} represent the variance in the
938: genome landscapes of this null model at each $m$, $\sigma(m) = \sqrt{\langle F(m)^2
939: \rangle - \langle F(m) \rangle^2}$. Using the definitions above, we have
940: \begin{equation}
941: \begin{aligned}
942: F(m) &= \sum_{i=1}^m \eta(i)- \frac{m}{N}\sum_{i=1}^N \eta(i) \\
943: &= \left( \frac{m + (N-m)}{N} \right) \sum_{i=1}^m \eta(i)- \frac{m}{N}\sum_{i=1}^N \eta(i) \\
944: &= \frac{N-m}{N}\sum_{i=1}^m \eta(i) - \frac{m}{N}\sum_{i=m+1}^N \eta(i),
945: \end{aligned}
946: \end{equation}
947: and
948: \begin{equation}
949: \langle F(m) \rangle = \frac{m(N-m)\langle\eta\rangle}{N} - \frac{m(N-m)\langle\eta\rangle}{N} = 0.
950: \end{equation}
951: When we use $\langle \eta(i)\eta(j) \rangle = \langle \eta^2 \rangle \delta_{i,j}
952: + (1- \delta_{i,j}) \langle\eta\rangle^2$, with $\delta_{i,j} = 1$ if $i = j$ and 0 otherwise, we find
953: \begin{equation}
954: \begin{aligned}
955: \langle F(m)^2 \rangle &= \frac{m(N-m)}{N} (\langle\eta^2\rangle - \langle\eta\rangle^2) \\
956: &= \frac{\Delta m(N-m)}{N},
957: \end{aligned}
958: \end{equation}
959: leading to $\sigma(m) = \sqrt{\langle F(m)^2 \rangle - \langle F(m) \rangle^2} = \sqrt{\Delta m
960: (N-m)/N}$. In the case of GC3 landscapes, $\eta(i)$ is either 1 or 0 with equal
961: probability, giving $\Delta_{\mathrm{GC3}} = 1/4$.
962:
963: We can also calculate the full probability distribution,
964: $P(f;m,N,\Delta)$ that the genome landscape of length $N$ has an intermediate
965: value $F(m) = f$, at point $m$, by considering an $N$-step random walk that is
966: constrained to start and stop at $0$. This probability distribution can be
967: written as a product of two conditional probabilities for a walk that starts at
968: $0$ and ends at $f$ in $m$ steps, and a walk that starts at $f$ and ends at $0$
969: in $N-m$ steps
970: \begin{eqnarray}
971: \begin{aligned}
972: \label{eq:P_decomp}
973: P(f;m,N,\Delta) &= A G(0,f;m,\Delta) G(f,0;N-m,\Delta) \\
974: &= A G(0,f;m,\Delta) G(0,f;N-m,\Delta),
975: \end{aligned}
976: \end{eqnarray}
977: where $A$ is a normalization constant, and the last step used the inversion
978: symmetry of the random walks. Thus we seek the form of the conditional
979: probability $G(0,f;m,\Delta)$. In the same way as in Eq. (\ref{eq:P_decomp}), we
980: decompose this conditional probability into a multiplication of the conditional
981: probabilities for two walks, one that starts at $0$ and ends at $y$ in $x$
982: steps, and one that starts at $y$ and ends at $f$ in $m-x$ steps, and integrate
983: over all possible intermediate values $y$
984: \begin{equation}
985: G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}y G(0,y;x,\Delta) G(y,f;m-x,\Delta).
986: \end{equation}
987: We can continue this decomposition for each intermediate step to give
988: \begin{equation}
989: G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}y_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}y_{m-1} G(0,y_1;1,\Delta) G(y_1,y_2;1,\Delta) \ldots G(y_{m-1},f;1,\Delta).
990: \end{equation}
991: Keeping the order of integration the same, and noting that $G(y_1,y_2;1,\Delta)
992: = G(y_2 - y_1;1,\Delta)$ for these random walks, we can write $y_{i+1} - y_i =
993: s_{i+1}$ to give
994: \begin{equation}
995: G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}s_m G(s_1;1,\Delta) \ldots G(s_2;1,\Delta) G(s_m;1,\Delta) \delta\left( \sum_{i=1}^m s_m - f\right),
996: \end{equation}
997: where the delta function is added to force the constraint that the sum of all
998: the intermediate steps must be equal to $f$. All of the intermediate conditional
999: probabilities now represent one step walks, and so are equal to the underlying
1000: probability distribution of drawing a step size $s_m$, $p(s_m;\Delta)$
1001: \begin{equation}
1002: G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}s_m \delta\left( \sum_{i=1}^m s_m - f\right) \Pi_{i=1}^m p(s_i;\Delta).
1003: \end{equation}
1004: Making use of the integral representation of the delta function \cite{Grosberg1994}
1005: \begin{equation}
1006: \delta(x) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-ikx},
1007: \end{equation}
1008: we have
1009: \begin{equation}
1010: G(0,f;m,\Delta) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-ikf} \tilde{p}(k;\Delta)^m,
1011: \end{equation}
1012: where $\tilde{p}(k;\Delta)$ is the Fourier transform of $p(s;\Delta)$
1013: \begin{equation}
1014: \tilde{p}(k;\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s e^{-iks} p(s;\Delta) .
1015: \end{equation}
1016: For the purpose of this discussion, we assume $p(s;\Delta)$ has a Gaussian form
1017: $p(s) = \frac{1}{\sqrt{2\pi\Delta}}e^{-\frac{s^2}{2\Delta}}$, and note that the
1018: results are general. In this case, $\tilde{p}(k;\Delta) =
1019: e^{-\frac{k^2\Delta}{2}}$, and we have
1020: \begin{equation}
1021: G(0,f;m) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-m\Delta k^2/2}e^{-ikf} = \frac{1}{\sqrt{2\pi m\Delta}}e^{-f^2/2m\Delta}.
1022: \end{equation}
1023: To determine $A$, we enforce the normalization condition
1024: \begin{equation}
1025: \int_{-\infty}^{\infty} \mathrm{d}f P(f;m,N,\Delta) = 1,
1026: \end{equation}
1027: which gives
1028: \begin{eqnarray}
1029: \begin{aligned}
1030: P(f;m,N,\Delta) &= \frac{1}{\sigma\sqrt{2\pi}}e^{-f^2/2\sigma^2} \\
1031: \sigma(m) &= \sqrt{\Delta\frac{m(N-m)}{N}}.
1032: \end{aligned}
1033: \end{eqnarray}
1034: Note that from the full distribution, we can immediately identify $\sigma(m) =
1035: \sqrt{\langle F(m)^2 \rangle - \langle F(m) \rangle^2}$, confirming the explicit
1036: calculation above.
1037:
1038: \subsection{Acknowledgments}\label{sub:acknowledgments}
1039: % (fold)
1040:
1041: The authors would like to thank Herv\'{e} Isambert, Graham Hatfull, and Roger
1042: Hendrix for conversations and suggestions on this work. JBL and DRN would like
1043: to thank the Institute Curie, Paris, for hospitality during the initial phases
1044: of this work. Work by DRN was supported by the National Science Foundation
1045: through grants DMR-0231631 and DMR-0213805. JBL acknowledges the financial
1046: support of the Fannie and John Hertz Foundation. JBP acknowledges
1047: support from the Burroughs Wellcome Fund.
1048:
1049: % subsection acknowledgements (end)
1050:
1051: % subsection null_model_results_for_random_walk_landscapes (end)
1052:
1053:
1054: \clearpage
1055: % subsection structural_annotation (end)
1056: % section materials_and_methods (end)
1057: % ---- FIGURES ------- (fold)
1058: \begin{figure}
1059: [p]
1060: \begin{center}
1061: \begin{tabular}
1062: {cc}
1063: \includegraphics[scale=0.8]{Lambda_GC3.pdf} &
1064: \includegraphics[scale=0.8]{Lambda_CAI.pdf} \\
1065: \includegraphics[scale=0.8]{Lambda_GC3_histogram.pdf} &
1066: \includegraphics[scale=0.8]{Lambda_CAI_histogram.pdf} \\
1067: \end{tabular}
1068: \end{center}
1069: \caption{{\bf GC3 and CAI landscapes for lambda phage.} Landscapes of GC3
1070: (left) and CAI (right) measures of codon usage in Lambda phage. Only
1071: coding sequences are considered, which when concatenated together are 40,773 bp
1072: long (see Table \ref{tab:phage_properties}). The GC3 landscape is the
1073: mean-centered cumulative sum of the GC3 content (GC3=1, AT3=0) of codons. The
1074: CAI landscape is the mean-centered cumulative sum of the log $w$-value for each
1075: codon. For each landscape, a region
1076: exhibiting an uphill slope corresponds to higher than average GC3 or CAI. The
1077: horizontal purple band represents the expected
1078: amount of variation in a random walk of GC3 or AT3 choices,
1079: given by Equation \eqref{eq:sigma}. Both landscapes exhibit features far
1080: outside of the purple bands, indicating that the patterns of codon
1081: usage are highly non-random. Gene boundaries are represented by the bars in the
1082: histograms below each landcape. The height of the bars in the histogram indicate
1083: the GC3 and CAI values for each gene.} \label{fig:land_hist}
1084: \end{figure}
1085: \clearpage
1086: \begin{figure}
1087: [p]
1088: \begin{center}
1089: \begin{tabular}
1090: {cc}
1091: (a) \includegraphics[scale=0.8]{lambda_decay_GC3.pdf} &
1092: (b) \includegraphics[scale=0.8]{lambda_decay_CAI.pdf} \\
1093: \end{tabular}
1094: \end{center}
1095: \caption{ {\bf Snapshots of simulated
1096: synonymous mutation in the lambda phage genome.} Panel (a) shows GC3 and
1097: (b) shows CAI landscapes. In between successive snapshots (labeled
1098: by integers), $N$ synonymous mutations are introduced into the genome and the
1099: resulting landscape is shown, where $N$ is the number of codons in the lambda
1100: phage genome (see Section \ref{sub:genome_landscapes}). These snapshots show that
1101: the simulated genome landscapes approach the random null model, indicated by
1102: the purple band (see Figure \ref{fig:land_hist}). The final CAI landscape (3) lies
1103: almost completely within the purple band. Using the lambda phage mutation rate
1104: of $7.7\mathrm{x}10^{-8}$ mutations/bp/replication \cite{Drake1991}, we can
1105: estimate that approximately $10^7$ genome replications would be
1106: required
1107: to relax within the purple bars.} \label{fig:land_decay}
1108: \end{figure}
1109:
1110: \clearpage
1111: \begin{figure}
1112: [p]
1113: \begin{center}
1114: \begin{tabular}
1115: {cc}
1116: \includegraphics[scale=0.8]{Lambda_aqua_GC3.pdf} &
1117: \includegraphics[scale=0.8]{Lambda_aqua_CAI.pdf} \\
1118: \includegraphics[scale=0.8]{Lambda_aqua_GC3_histogram_filtered.pdf} &
1119: \includegraphics[scale=0.8]{Lambda_aqua_CAI_histogram_filtered.pdf} \\
1120: \end{tabular}
1121: \end{center}
1122: \caption{{\bf Observed and randomized landscapes for lambda phage. } The figure
1123: shows the observed GC3 (left) and CAI (right) landscapes, plotted in black,
1124: along with the mean $\pm 1$, and $\pm 2$ standard deviations of
1125: randomized trials, shown in aqua (bold line, dark and light regions,
1126: respectively). The `aqua' randomization test shown here draws random
1127: synonymous codons that preserve the exact amino acid
1128: sequence, according to probabilities that preserve the global codon
1129: usage
1130: distribution of the lambda genome. For the most part, the observed landscapes
1131: lie signficantly outside the distribution of randomized landscapes -- implying
1132: that the amino acid content of genes is not responsible for the observed pattern
1133: of the CAI landscape. In the lower panel, however, genes whose GC3 (left), or
1134: CAI (right) values fall between the 0.025 and 0.975 quantile of the random
1135: trials are shadowed in grey; the GC3/CAI values of such genes are not
1136: significantly different from random, given their amino acid sequence.}
1137: \label{fig:aqua}
1138: \end{figure}
1139: \clearpage
1140: \begin{figure}
1141: [p]
1142: \begin{center}
1143: \includegraphics[scale=0.6]{ecoli_master.pdf}
1144: \end{center}
1145: \caption{{\bf \emph{E. coli} codon usage master table.} The table of 61 codons
1146: along with their associated w-values is shown for \emph{E. coli}.
1147: The $w$-value of each codon reflects its frequency in
1148: highly transcribed \emph{E. coli} genes (see main text). The table
1149: is divided into four regions: codons with high CAI ($w \geq 0.9$)
1150: ending in G or C (dark red); codons with high CAI ending in A or
1151: T (dark blue); codons with low CAI ($w \leq 0.9$) ending in G or C
1152: (light red); codons with low CAI ending in A or T (light blue).
1153: As the table shows, there is
1154: a slight bias for GC3 in the high-CAI codons (58\%), and slight
1155: bias away from GC3 in the low-CAI codons (48\%).} \label{fig:E_coli_master}
1156: \end{figure}
1157: \clearpage
1158: \begin{figure}
1159: [p]
1160: \begin{center}
1161: \begin{tabular}
1162: {cc}
1163: \includegraphics[scale=0.8]{Lambda_blue_GC3.pdf} &
1164: \includegraphics[scale=0.8]{Lambda_orange_BCAI.pdf} \\
1165: \includegraphics[scale=0.8]{Lambda_blue_GC3_histogram.pdf} &
1166: \includegraphics[scale=0.8]{Lambda_orange_BCAI_histogram.pdf} \\
1167: \end{tabular}
1168: \end{center}
1169: \caption{{\bf Observed and randomized landscapes for lambda phage.}
1170: Observed landscapes are shown along with randomized landscapes
1171: associated with the `green' and `orange' tests.
1172: The green randomization procedure tests the
1173: significance of the GC3 landscape controlling for the observed
1174: CAI (actually, BCAI) variation across the genome. The orange
1175: randomization procedure tests the significance of the BCAI landscape,
1176: controlling for the observed GC3 variation across the genome.
1177: Both tests preserve the amino-acid sequence exactly.
1178: Both observed landscapes lie outside the distribution
1179: of random trials, indicating there is non-random GC3
1180: content controlling for CAI, and non-random CAI
1181: content controlling for GC3.} \label{fig:green_orange}
1182: \end{figure}
1183: \clearpage
1184: \begin{figure}
1185: [p]
1186: \begin{center}
1187: \begin{tabular}
1188: {ccc}
1189: \includegraphics[scale=1]{ecoli_CAI_master_cartoon} &
1190: \includegraphics[scale=1]{paeruginosa_CAI_master_cartoon} &
1191: \includegraphics[scale=1]{llactis_CAI_master_cartoon} \\
1192: \end{tabular}
1193: \end{center}
1194: \caption{{\bf Schematics of prefered codon usage tables for
1195: \emph{E. coli}, \emph{P. aeruginosa}, and \emph{L. lactis} following the
1196: conventions of Figure \ref{fig:E_coli_master}.}
1197: Unlike \emph{E. coli},
1198: \emph{P. aeruginosa} strongly favors GC3 in high-CAI codons
1199: (94\%), and \emph{L. lactis} strongly favors AT3 in high-CAI
1200: codons (72\%).} \label{fig:master_cartoons}
1201: \end{figure}
1202: \clearpage
1203: \begin{figure}
1204: [p]
1205: % P2 NC_001895
1206: % T3 NC_003298
1207: % D3112 NC_005178
1208: % bIL286 NC_002667
1209: \begin{center}
1210: \begin{tabular}
1211: {cc}
1212: (a) \includegraphics[scale=0.5]{P2_green_GC3.pdf} &
1213: \includegraphics[scale=0.5]{P2_orange_BCAI.pdf} \\
1214: (b) \includegraphics[scale=0.5]{T3_green_GC3.pdf} &
1215: \includegraphics[scale=0.5]{T3_orange_BCAI.pdf} \\
1216: (c) \includegraphics[scale=0.5]{D3112_green_GC3.pdf} &
1217: \includegraphics[scale=0.5]{D3112_orange_BCAI.pdf} \\
1218: (d) \includegraphics[scale=0.5]{bIL286_green_GC3.pdf} &
1219: \includegraphics[scale=0.5]{bIL286_orange_BCAI.pdf} \\
1220: \end{tabular}
1221: \end{center}
1222: \caption{{\bf `Green' (left) and `orange' (right) randomization tests
1223: for several phages.} Bacteriophages P2 (b) and T3 (b) both
1224: infect \emph{E. coli}. Phage D3112 (c) infects \emph{P. aeruginosa}.
1225: Phage bIL286 (d)
1226: infects \emph{L.
1227: lactis}. T3 is the only non-temperate phage of this group. See
1228: Table \ref{tab:phage_properties} for combined Fisher p-values for these tests.
1229: In the case of bIL286, note the lack of evidence for codon bias evident in
1230: the green and orange
1231: tests for bIL286, as confirmed by the insignificant $p$-values in
1232: Table \ref{tab:phage_properties}. In this case, we cannot rule out the
1233: possibility that the observed pattern in GC3 is determined
1234: completely by the amino acid and CAI sequence (green), or that the observed pattern in
1235: CAI is determined by the amino acid and GC3 sequence (orange).}
1236: \label{fig:green_orange_examples}
1237: \end{figure}
1238: \clearpage
1239: \begin{figure}
1240: [p]
1241: \begin{center}
1242: \includegraphics[scale=1]{gla_orange_blue_fisher_extreme_hist.pdf}
1243: \end{center}
1244: \caption{{\bf Combined Fisher p-values for the `green' and `orange'
1245: randomization tests across 50 phage genomes.} Phage names are
1246: listed on the x-axis, and are sorted by their `orange' p-value.
1247: A total of 29 genomes exhibit non-random
1248: GC3 content controlling for CAI (green test); and a total of
1249: 22 genome exhibit non-random
1250: CAI content controlling for GC3 (orange test). 17 genomes pass both of
1251: these tests. The dashed horizontal line indicates the
1252: threshold for significance after Bonfernni correction (i.e. 5\%/50).
1253: Upwards arrows indicate p-values that lie beyond the limits of the
1254: y-axis. See Table \ref{tab:phage_properties} for phage properties,
1255: including the
1256: p-values for these tests. Twenty four phage genomes
1257: that failed the aqua GC3 or CAI control tests
1258: are not included in this figure.} \label{fig:green_orange_pass_genomes}
1259: \end{figure}
1260: \clearpage
1261: \begin{figure}
1262: [p]
1263: \begin{center}
1264: \begin{tabular}
1265: {cc}
1266: \includegraphics[scale=0.8]{Lambda_aqua_CAI_histogram_structural.pdf} &
1267: \includegraphics[scale=0.8]{Lambda_orange_BCAI_histogram_structural.pdf} \\
1268: \end{tabular}
1269: \end{center}
1270: \caption{{\bf The relationship between codon usage and protein
1271: function in lambda phage.} The figure shows the aqua
1272: (CAI, as in Figure \ref{fig:aqua}) and orange (BCAI, as in
1273: Figure \ref{fig:green_orange}) randomization tests
1274: overlaid with information about protein function:
1275: genes classified as structural are shown with a white background
1276: and all other genes with a grey background. The histograms
1277: indicate a clear relationship between the structural
1278: classification of a gene and its significance under the aqua
1279: and orange tests: structural genes typically have elevated
1280: quantiles in the aqua test, whereas other genes typically have
1281: depressed quantiles. In other words, structural genes
1282: exhibit elevated CAI values when controlling for their
1283: amino acid sequence, compared to codon usage in the
1284: genome as a whole. Moreover, as the orange histograms
1285: indicate, this pattern is not caused by variation in GC3 content:
1286: the structural genes exhibit elevated BCAI values after
1287: controlling for both their amino acid sequence and their
1288: GC3 sequence.} \label{fig:structural}
1289: \end{figure}
1290: \clearpage
1291:
1292: % section figures (end)
1293: % ------- Tables ------- (fold)
1294:
1295: \begin{table}
1296: \begin{center}
1297: \begin{tabular}{c|c|c|c}
1298: Test Name & Genome Properties Constrained & Genome Properties Varied & Figure \\
1299: \hline
1300: Aqua & amino acid sequence, global codon distribution & synonymous codons & \ref{fig:aqua} \\
1301: Orange & amino acid and BCAI sequences & GC3 & \ref{fig:green_orange} \\
1302: Green & amino acid and GC3 sequences & BCAI & \ref{fig:green_orange} \\
1303: \end{tabular}
1304: \end{center}
1305: \caption{Randomization test descriptions.
1306: The three randomization tests used in the paper
1307: are color-coded according to what genome properties
1308: are constrained in the random trials.}
1309: \label{tab:tests}
1310: \end{table}
1311: \clearpage
1312:
1313: \begin{table}
1314: % \begin{center}
1315: {\tiny
1316: \begin{tabular}{c|c|c|c|c|c|c|c|c|c}
1317: Name & Host & Accession & Lifestyle & \# Genes &
1318: Length & Coding Length & \%GC3 & Orange p-value & Green p-value \\
1319: \hline
1320: T5 & \ecoli & NC\_005859 & NT & 161 & 121,750 & 96,051 & 31.6 & $1.38\mathrm{x}10^{-31}$ & $1.71\mathrm{x}10^{-19}$ \\
1321: RB69 & \ecoli & NC\_004928 & NT & 273 & 167,560 & 156,147 & 29.0 & $1.25\mathrm{x}10^{-21}$ & $5.21\mathrm{x}10^{-01}$ \\
1322: phiEL & \paeru & NC\_007623 & NT & 201 & 211,215 & 194,850 & 57.8 & $7.38\mathrm{x}10^{-20}$ & $2.17\mathrm{x}10^{-09}$ \\
1323: RB49 & \ecoli & NC\_005066 &NT & 273 & 164,018 & 152,592 & 36.9 & $2.01\mathrm{x}10^{-18}$ & $2.48\mathrm{x}10^{-01}$ \\
1324: F116 & \paeru & NC\_006552 & T & 70 & 65,195 & 60,240 & 76.3 & $1.31\mathrm{x}10^{-10}$ & $6.31\mathrm{x}10^{-16}$ \\
1325: CTX & \paeru & NC\_003278 &T & 47 & 35,580 & 31,971 & 81.2 & $1.44\mathrm{x}10^{-09}$ & $6.82\mathrm{x}10^{-32}$ \\
1326: phiKMV & \paeru & NC\_005045 & NT & 49 & 42,519 & 38,310 & 79.9 & $3.25\mathrm{x}10^{-09}$ & $9.54\mathrm{x}10^{-03}$ \\
1327: T4 & \ecoli & NC\_000866 & NT & 269 & 168,903 & 153,660 & 24.3 & $4.59\mathrm{x}10^{-09}$ & $8.62\mathrm{x}10^{-01}$ \\
1328: lambda & \ecoli & NC\_001416 & T & 69 & 48,502 & 40,773 & 53.5 & $6.25\mathrm{x}10^{-09}$ & $5.10\mathrm{x}10^{-68}$ \\
1329: D3 & \paeru & NC\_002484 & T & 94 & 56,425 & 49,095 & 68.3 & $1.57\mathrm{x}10^{-08}$ & $3.85\mathrm{x}10^{-07}$ \\
1330: P2 & \ecoli & NC\_001895 & T & 42 & 33,593 & 30,411 & 54.7 & $5.60\mathrm{x}10^{-08}$ & $2.54\mathrm{x}10^{-61}$ \\
1331: P1 & \ecoli & NC\_005856 & T & 108 & 94,800 & 80,103 & 48.2 & $9.37\mathrm{x}10^{-08}$ & $3.51\mathrm{x}10^{-11}$ \\
1332: D3112 & \paeru & NC\_005178 & T & 55 & 37,611 & 34,908 & 80.4 & $3.05\mathrm{x}10^{-07}$ & $4.35\mathrm{x}10^{-05}$ \\
1333: WPhi & \ecoli & NC\_005056 &T & 43 & 32,684 & 29,601 & 56.4 & $8.39\mathrm{x}10^{-07}$ & $7.80\mathrm{x}10^{-55}$ \\
1334: K1F & \ecoli & NC\_007456 & NT & 43 & 39,704 & 34,629 & 53.4 & $1.75\mathrm{x}10^{-05}$ & $8.03\mathrm{x}10^{-02}$ \\
1335: T3 & \ecoli & NC\_003298 & NT & 47 & 38,208 & 29,694 & 54.3 & $3.50\mathrm{x}10^{-05}$ & $3.07\mathrm{x}10^{-04}$ \\
1336: PaP3 & \paeru & NC\_004466 & T & 71 & 45,503 & 41,115 & 58.1 & $5.09\mathrm{x}10^{-05}$ & $1.64\mathrm{x}10^{-19}$ \\
1337: phiV10 & \ecoli & NC\_007804 & T & 55 & 39,104 & 36,111 & 48.8 & $1.25\mathrm{x}10^{-04}$ & $9.38\mathrm{x}10^{-11}$ \\
1338: P27 & \ecoli & NC\_003356 & T& 58 & 42,575 & 37,707 & 50.5 & $2.24\mathrm{x}10^{-04}$ & $2.23\mathrm{x}10^{-20}$ \\
1339: 933W & \ecoli & NC\_000924 & T & 78 & 61,670 & 52,956 & 50.0 & $4.29\mathrm{x}10^{-04}$ & $8.88\mathrm{x}10^{-09}$ \\
1340: B3 & \paeru & NC\_006548 & T & 56 & 38,439 & 36,138 & 77.3 & $4.40\mathrm{x}10^{-04}$ & $3.33\mathrm{x}10^{-05}$ \\
1341: HK97 & \ecoli & NC\_002167 & T & 59 & 39,732 & 34,191 & 52.1 & $7.61\mathrm{x}10^{-04}$ & $1.19\mathrm{x}10^{-20}$ \\
1342: VT2-Sa & \ecoli & NC\_000902 & T & 83 & 60,942 & 52,647 & 51.3 & $1.31\mathrm{x}10^{-03}$ & $7.40\mathrm{x}10^{-07}$ \\
1343: PRD1 & \ecoli & NC\_001421 & NT & 21 & 14,925 & 11,988 & 47.6 & $2.99\mathrm{x}10^{-03}$ & $5.97\mathrm{x}10^{-02}$ \\
1344: JK06 & \ecoli & NC\_007291 & U & 71 & 46,072 & 32,841 & 43.0 & $3.84\mathrm{x}10^{-03}$ & $1.63\mathrm{x}10^{-03}$ \\
1345: T1 & \ecoli & NC\_005833 & NT & 77 & 48,836 & 44,010 & 47.7 & $7.45\mathrm{x}10^{-03}$ & $3.64\mathrm{x}10^{-01}$ \\
1346: Pf1 & \paeru & NC\_001331 & U & 12 & 7,349 & 6,282 & 75.7 & $9.66\mathrm{x}10^{-03}$ & $6.67\mathrm{x}10^{-01}$ \\
1347: HK022 & \ecoli & NC\_002166 & T & 57 & 40,751 & 33,885 & 52.7 & $1.25\mathrm{x}10^{-02}$ & $4.36\mathrm{x}10^{-18}$ \\
1348: 4268 & \llact & NC\_004746 & NT & 49 & 36,596 & 33,759 & 24.7 & $1.59\mathrm{x}10^{-02}$ & $3.20\mathrm{x}10^{-01}$ \\
1349: BP-4795 & \ecoli & NC\_004813 & T & 48 & 57,930 & 22,356 & 48.1 & $1.66\mathrm{x}10^{-02}$ & $3.29\mathrm{x}10^{-10}$ \\
1350: 186 & \ecoli & NC\_001317 &T & 43 & 30,624 & 27,747 & 58.7 & $4.02\mathrm{x}10^{-02}$ & $1.79\mathrm{x}10^{-22}$ \\
1351: I2-2 & \ecoli & NC\_001332 & U & 8 & 6,744 & 5,166 & 35.0 & $6.91\mathrm{x}10^{-02}$ & $1.01\mathrm{x}10^{-01}$ \\
1352: phiKZ & \paeru & NC\_004629 & NT & 306 & 280,334 & 243,384 & 26.8 & $1.32\mathrm{x}10^{-01}$ & $1.79\mathrm{x}10^{-14}$ \\
1353: bIL312 & \llact & NC\_002671 & T & 27 & 15,179 & 11,292 & 28.1 & $1.49\mathrm{x}10^{-01}$ & $8.85\mathrm{x}10^{-04}$ \\
1354: HK620 & \ecoli & NC\_002730 & T & 58 & 38,297 & 33,717 & 45.9 & $1.61\mathrm{x}10^{-01}$ & $1.41\mathrm{x}10^{-05}$ \\
1355: Mu & \ecoli & NC\_000929 & T & 54 & 36,717 & 33,900 & 54.1 & $1.68\mathrm{x}10^{-01}$ & $4.49\mathrm{x}10^{-10}$ \\
1356: P4 & \ecoli & NC\_001609 & T & 14 & 11,624 & 9,765 & 52.4 & $1.71\mathrm{x}10^{-01}$ & $4.17\mathrm{x}10^{-18}$ \\
1357: N15 & \ecoli & NC\_001901 & T & 59 & 46,375 & 41,472 & 54.9 & $2.17\mathrm{x}10^{-01}$ & $1.38\mathrm{x}10^{-09}$ \\
1358: Stx2 I & \ecoli & NC\_003525 & T & 97 & 61,765 & 34,932 & 48.4 & $3.04\mathrm{x}10^{-01}$ & $4.23\mathrm{x}10^{-04}$ \\
1359: bIL286 & \llact & NC\_002667 & T & 61 & 41,834 & 38,694 & 24.8 & $3.68\mathrm{x}10^{-01}$ & $1.17\mathrm{x}10^{-01}$ \\
1360: Tuc2009 & \llact & NC\_002703 & T & 56 & 38,347 & 35,178 & 28.0 & $4.08\mathrm{x}10^{-01}$ & $1.81\mathrm{x}10^{-02}$ \\
1361: Stx2 II & \ecoli & NC\_004914 &T & 99 & 62,706 & 34,755 & 50.1 & $5.85\mathrm{x}10^{-01}$ & $9.94\mathrm{x}10^{-03}$ \\
1362: BK5-T & \llact & NC\_002796 & T & 52 & 40,003 & 33,267 & 24.0 & $5.91\mathrm{x}10^{-01}$ & $6.68\mathrm{x}10^{-01}$ \\
1363: Stx1 & \ecoli & NC\_004913 & T & 93 & 59,866 & 33,444 & 49.5 & $6.75\mathrm{x}10^{-01}$ & $2.97\mathrm{x}10^{-03}$ \\
1364: LC3 & \llact & NC\_005822 &T & 51 & 32,172 & 29,607 & 24.6 & $7.31\mathrm{x}10^{-01}$ & $4.90\mathrm{x}10^{-01}$ \\
1365: ul36 & \llact & NC\_004066 & NT & 58 & 36,798 & 32,400 & 27.7 & $8.64\mathrm{x}10^{-01}$ & $4.66\mathrm{x}10^{-02}$ \\
1366: Pf3 & \paeru & NC\_001418 &U & 9 & 5,833 & 5,487 & 35.9 & $8.70\mathrm{x}10^{-01}$ & $1.64\mathrm{x}10^{-06}$ \\
1367: bIL285 & \llact & NC\_002666 &T & 62 & 35,538 & 32,646 & 26.7 & $9.20\mathrm{x}10^{-01}$ & $9.93\mathrm{x}10^{-01}$ \\
1368: r1t & \llact & NC\_004302 &T & 50 & 33,350 & 30,315 & 25.4 & $9.53\mathrm{x}10^{-01}$ & $6.03\mathrm{x}10^{-01}$ \\
1369: bIL170 & \llact & NC\_001909 & T & 63 & 31,754 & 27,663 & 27.1 & $9.91\mathrm{x}10^{-01}$ & $8.71\mathrm{x}10^{-01}$ \\
1370: \end{tabular}
1371: }
1372: % \end{center}
1373: \caption{Phage properties. Properties are listed for all phages included in
1374: Figure \ref{fig:green_orange_pass_genomes}, in the same order based on the
1375: orange p-value. Lifestyle annotations are T (temperate), NT (non-temperate),
1376: U (unknown). The coding length refers to the length of all coding sequences
1377: concatenated together (see Methods.}
1378: \label{tab:phage_properties}
1379: \end{table}
1380:
1381: \begin{table}
1382: \begin{center}
1383: \begin{tabular}
1384: {c|c|c}
1385: & Lambda & All Phage Genes \\
1386: \hline
1387: Number structural & 7 & 279 \\
1388: Number non-structural & 18 & 1022 \\
1389: \hline
1390: \multicolumn{3}{c}{Aqua CAI Randomization Test} \\
1391: \hline
1392: median $p^{>}$ structural & $1.3\mathrm{x}10^{-4}$ & $8.0\mathrm{x}10^{-3}$ \\
1393: median $p^{>}$ non-structural & 1.0 & 1.0 \\
1394: ANOVA significance & $p=4.5\mathrm{x}10^{-5}$ & $p=4.7\mathrm{x}10^{-12}$ \\
1395: \hline
1396: \multicolumn{3}{c} {Orange BCAI Randomization Test} \\
1397: \hline
1398: median $p^{>}$ structural & $2.8\mathrm{x}10^{-2}$ & $2.0\mathrm{x}10^{-1}$ \\
1399: median $p^{>}$ non-structural & 0.98 & 0.73 \\
1400: ANOVA significance & $p=1.8\mathrm{x}10^{-4}$ & $p=1.6\mathrm{x}10^{-15}$ \\
1401: \end{tabular}
1402: \end{center}
1403: \caption{Structural annotation verses codon usage. The table shows
1404: the median $p^>$ values amoung structural and non-structural genes,
1405: under the aqua and orange randomization tests. Small $p^>$ values indicate
1406: significantly elevated CAI, controlling for the amino acid sequence
1407: (aqua test) and the GC3 sequence (orange test). We also report the
1408: significance of non-parametic ANOVAs that compare median $p^>$-values between
1409: the structural and non-structural genes. Analyses are limited to
1410: those genes that pass the aqua test, as described in the main text;
1411: similar results are found without this restriction.
1412: }
1413: \label{tab:lambda_all_struct_non_aqua_orange}
1414: \end{table}
1415: \clearpage
1416:
1417: \begin{table}
1418: \begin{center}
1419: \begin{tabular}
1420: {c|c}
1421: & All Phage Genes \\
1422: \hline
1423: Number `Head' & 145 \\
1424: Number `Tail' & 134 \\
1425: Number non-structural (NS) & 1022 \\
1426: \hline
1427: \multicolumn{2}{c}{Aqua CAI Randomization Test} \\
1428: \hline
1429: median $p^{>}$ head & $2.0\mathrm{x}10^{-3}$ \\
1430: median $p^{>}$ tail & $2.0\mathrm{x}10^{-2}$ \\
1431: median $p^{>}$ NS & 1.0 \\
1432: ANOVA Head vs NS & $p=6.4\mathrm{x}10^{-19}$ \\
1433: ANOVA Tail vs NS & $p=1.8\mathrm{x}10^{-1}$ \\
1434: ANOVA Head vs Tail & $p=2.1\mathrm{x}10^{-8}$ \\
1435: \hline
1436: \multicolumn{2}{c} {Orange BCAI Randomization Test} \\
1437: \hline
1438: median $p^{>}$ head & $7.0\mathrm{x}10^{-2}$ \\
1439: median $p^{>}$ tail & $4.3\mathrm{x}10^{-1}$ \\
1440: median $p^{>}$ NS & 0.73 \\
1441: ANOVA Head vs NS & $p=4.2\mathrm{x}10^{-21}$ \\
1442: ANOVA Tail vs NS & $p=1.7\mathrm{x}10^{-2}$ \\
1443: ANOVA Head vs Tail & $p=6.0\mathrm{x}10^{-8}$ \\
1444: \end{tabular}
1445: \end{center}
1446: \caption{Comparison between codon usage and refined structural
1447: annotations.
1448: As in Table \ref{tab:lambda_all_struct_non_aqua_orange},
1449: we compare the median aqua and orange $p^>$ values among head genes, tail
1450: genes, and non-structural genes. We report the significance of
1451: pairwise non-parametric ANOVAs comparing head to non-structural, tail
1452: to non-structural, and head to tail genes.
1453: These analyses are limited to genes that pass the aqua test;
1454: similar results are found without this
1455: restriction.
1456: }
1457: \label{tab:all_head_tail_aqua_orange}
1458: \end{table}
1459:
1460: \clearpage
1461: \begin{table}
1462: \begin{center}
1463: \begin{tabular}
1464: {c|c}
1465: \multicolumn{2}{c}{Median $p_{\text{combined}}^{\mathrm{orange}}$} \\
1466: \hline
1467: Temperate & $1.4\mathrm{x}10^{-2}$ \\
1468: Non-temperate & $2.6\mathrm{x}10^{-5}$ \\
1469: Un-identified & $4\mathrm{x}10^{-2}$ \\
1470: ANOVA significance & $p = 0.1$ \\
1471: \hline
1472: \multicolumn{2}{c}{Median $p_{\text{combined}}^{\mathrm{green}}$} \\
1473: \hline
1474: Temperate & $5.1\mathrm{x}10^{-9}$ \\
1475: Non-temperate & $7.0\mathrm{x}10^{-2}$ \\
1476: Un-identified & $5\mathrm{x}10^{-2}$ \\
1477: ANOVA significance & $p = 0.009$ \\
1478: \end{tabular}
1479: \end{center}
1480: \caption{{\bf Phage lifestyle versus codon usage}. The table shows the median
1481: $p_{\text{combined}}^{\mathrm{orange}}$ and
1482: $p_{\text{combined}}^{\mathrm{green}}$ values among phages classified as
1483: temperate, non-temperate, or un-identified for all phages included in Figure
1484: \ref{fig:green_orange_pass_genomes} and Table \ref{tab:phage_properties}. Small
1485: median $p_{\text{combined}}^{\mathrm{orange}}$ values indicate that these phages
1486: have significantly non-random (in either direction) BCAI, controlling for the
1487: amino acid sequence and the GC3 sequence, while small median
1488: $p_{\text{combined}}^{\mathrm{green}}$ values indicate that these phages have
1489: significantly non-random (in either direction) GC3, controlling for the amino
1490: acid sequence and the BCAI sequence. We also report the significance of
1491: non-parametic ANOVAs that compare these medians between these groups of phages.
1492: }
1493: \label{tab:temperate_non}
1494: \end{table}
1495: % section tables (end)
1496: \clearpage
1497: \bibliography{GLA,GLA_lux}
1498:
1499: \end{document}
1500:
1501: