0708.2038/GLA.tex
1: \documentclass[amsmath,amssymb,aps]{revtex4} 
2: \usepackage{graphicx} 
3: \usepackage{bm} 
4: \usepackage[usenames]{color} 
5: \usepackage{multirow}
6: \usepackage{amsmath}
7: 
8: \pdfoutput=1
9: 
10: % \newcommand{\julius}[1]{\textbf{\textcolor{blue}{#1}}} 
11: \newcommand{\julius}[1]{#1} 
12: \newcommand{\josh}[1]{\textbf{\textcolor{red}{#1}}}
13: \newcommand{\ecoli}{\emph{E. coli}}
14: \newcommand{\paeru}{\emph{P. aeruginosa}}
15: \newcommand{\llact}{\emph{L. lactis}}
16: 
17: \begin{document}
18: 
19: \title{Genome landscapes and \\
20: bacteriophage codon usage}
21: 
22: \author{Julius B. Lucks$^1$} \author{David R. Nelson$^{1,2}$}
23: \author{Grzegorz Kudla$^1$} \author{Joshua B. Plotkin$^{3,*}$}
24: \affiliation{ $^1$FAS Center for Systems Biology, Harvard University,
25: \\ $^2$ Lyman Laboratory of Physics, Harvard
26: University\\ $^3$ Department of Biology, University
27: of Pennsylvania\\ $^*$E-mail:
28: jplotkin@sas.upenn.edu }
29: 
30: \date{\today} 
31: \begin{abstract}
32:     
33:     Across all kingdoms of biological life, protein-coding genes exhibit
34:     unequal usage of synonmous codons. Although alternative theories abound,
35:     translational selection has been accepted as an important mechanism that
36:     shapes the patterns of codon usage in prokaryotes and simple eukaryotes.
37:     Here we analyze patterns of codon usage across 74 diverse bacteriophages
38:     that infect \emph{E. coli}, \emph{P. aeruginosa} and \emph{L. lactis} as
39:     their primary host. We introduce the concept of a `genome landscape,' which
40:     helps reveal non-trivial, long-range patterns in codon usage across a
41:     genome. We develop a series of randomization tests that allow us to
42:     interrogate the significance of one aspect of codon usage, such a GC
43:     content, while controlling for another aspect, such as adaptation to
44:     host-preferred codons. We find that 33 phage genomes exhibit highly
45:     non-random patterns in their GC3-content, use of host-preferred codons, or
46:     both. We show that the head and tail proteins of these phages 
47:     exhibit significant bias towards host-preferred codons, relative
48:     to the non-structural phage proteins. Our results support the hypothesis of
49:     translational selection on viral genes for host-preferred codons, over a
50:     broad range of bacteriophages.
51:     
52: 
53: \end{abstract}
54: 
55: 
56: 
57: \maketitle
58: 
59: \section{Introduction}\label{sec:introduction} 
60: 
61: The genomes of most organisms exhibit significant codon bias -- that is, the
62: unequal usage of synonymous codons. There are longstanding and contradictory
63: theories to account for such biases. Variation in codon usage between taxa,
64: particularly within mammals, is sometimes atrributed to neutral processes --
65: such as mutational biases during DNA replication, repair, and gene conversion
66: \cite{Bern95,Francino1999,Galtier2003,Eyre91}.
67: 
68: There are also theories for codon bias driven by selection. Some researchers
69: have discussed codon bias as the result of selection for regulatory function
70: mediated by ribosome pausing \cite{LawrHart91}, or selection against
71: pre-termination codons \cite{Fitc80,ModiBatt81}. However, the dominant selective
72: theory of codon bias in organisms ranging from \textit{E. coli} to
73: \textit{Drosophila} posits that preferred codons correlate with the relative
74: abundances of isoaccepting tRNAs, thereby increasing translational efficiency
75: \cite{ZuckPaul65,Ikem81a,Ikem85,PoweMori97,DebrMarz94,SoreKurl89} and accuracy
76: \cite{Akas94}. This theory helps to explain why codon bias is often more extreme
77: in highly expressed genes \cite{Ikem81b}, or at highly conserved sites within a
78: gene \cite{Akas94}. Translational selection may also explain variation in codon
79: usage between genes selectively expressed in different tissues
80: \cite{Plotkin2004,Dittmar2006}. However, recent work suggests that synonymous
81: variation, particularly with respect to GC content, affects transcriptional
82: processes as well \cite{Kudla2006}.
83: 
84: The codon usage of viruses has also received considerable attention
85: \cite{Jenkins2003,PlotDush03}, particularly in the case of bacteriophages
86: \cite{Sharp1984,Kunisawa1998,Sahu2004, Sahu2005,Sau2005,SauGosh2005}. Most work
87: along these lines has focused on individual phages, or on the patterns of
88: genomic codon usage across a handful of phages of the same host.
89: 
90: Here, we provide a systematic analysis of intragenomic variation in
91: bacteriophage codon usage, using 74 fully sequenced viruses that infect a
92: diverse range of bacterial hosts. Motivated by energy landscapes associated with
93: DNA unzipping \cite{LubenskyNelson2002,Weeks2005}, we develop a novel
94: methodological tool, called a genome landscape, for studying the long-range
95: properties of codon usage across a phage genome. We introduce a series of
96: randomization tests that isolate different features of codon usage from each
97: other, and from the amino acid sequence of encoded proteins. More than twenty of
98: the phages in our analysis are shown to exhibit non-random variation in
99: synonymous GC content, as well as non-random variation in codons adapted for
100: host translation, or both. Additionally, we demonstrate that phage genes
101: encoding structural proteins are significantly more adapted to host-preferred
102: codons compared to non-structural genes. We discuss our results in the context
103: of translational selection and lateral gene transfer amongst phages.
104: 
105: 
106: % section introduction (end)
107: \section{Results}\label{sec:results} 
108: 
109: % (fold)
110: \subsection{Genome Landscapes}\label{sub:genome_landscapes} 
111: 
112: % (fold)
113: We start by introducing the concept of a genome landscape, which provides a
114: simple means for visualizing long-range correlations of sequence properties
115: across a genome. A genome landscape is simply a cumulative sum of a specified
116: quantitative property of codons. The calculation of the cumulative sum is
117: straightforward, and it consists of scanning over the genome sequence one codon
118: at a time, gathering the property of each codon, and summing it with the
119: properties of previous codons in the genome sequence. 
120: Similar cumulative
121: sums are used in solid-state physics for, e.g., the
122: the calculation of energy levels 
123: \cite{Ashcroft1976}. 
124: In the case of the GC3 landscape, we have
125: \begin{equation} 
126:     \label{eq:FGC3}
127:     F_{\mathrm{GC3}}(m) = \sum_{i=1}^m
128: (\eta_{\mathrm{GC3}}(m) - \overline{\eta_{\mathrm{GC3}}}) 
129: \end{equation}
130: where $\eta_{\mathrm{GC3}}(m)$ equals one or zero, depending upon whether the
131: the $m^{th}$ codon ends in a G/C or A/T, respectively. Note that we subtract the
132: genome-wide average GC3 content, $\overline{\eta_{\mathrm{GC3}}}$, so that
133: $F_{\mathrm{GC3}}(0) = F_{\mathrm{GC3}}(N) = 0$, where $N$ is the length of the
134: genome. In other words, we convert the genome codon sequence into a binary
135: string of 1's and 0's according to whether each codon is of type GC3 or AT3, and
136: we cumulatively sum this sequence to compute $F_{\mathrm{GC3}}(m)$.
137: 
138: The interpretation of a GC3 landscape is straightforward. Regions of the genome
139: whose landscape exhibits an uphill slope contain higher than average GC3
140: content, whereas regions of downhill slope contain lower than average GC3
141: content. The genome landscape provides an efficient visualization of long-range
142: correlations in sequence properties across a genome, similar to the techniques
143: introduced by Karlin \cite{Karlin1993}.
144: 
145: Traditional visualizations of GC3 content involve moving window averages of
146: \%GC3 over the genome \cite{Gregory2006}. In order to compare these techniques
147: with the landscape approach, we focus on the \emph{E. coli} phage lambda as an
148: illustrative example. Figure \ref{fig:land_hist} (a) shows the lambda phage GC3
149: landscape above its associated ``GC3 histogram". The histogram shows the GC3
150: content of each gene, and the width of each histogram bar reflects the length of
151: the corresponding gene. The figure reveals a striking pattern of lambda phage
152: codon usage: the genome is apparently divided into two halves that contain
153: significantly different GC3 contents \cite{Inman1966,Sanger1982}. The large
154: region of uphill slope on the left half of the GC3 landscape reflects the fact
155: that the majority of the genes in this region contain an excess of codons that
156: end in G or C. This trend is also reflected in the GC3 histogram bars, which are
157: higher than average in the left half of the genome (Figure \ref{fig:land_hist}).
158: 
159: Genome landscapes also provide a natural means of evaluating whether or not
160: features of codon usage are due to random chance. Under a null model in
161: which
162: the $\eta(i)$'s above are chosen as independent random variables
163: with $\mathrm{var}(\eta(i)) = \langle \eta(i)^2 \rangle
164: - \langle \eta(i) \rangle^2 = \Delta$, one can show (see
165: Methods) that the standard
166: deviation of $F(\mathrm{GC3},m)$ is
167: \begin{equation}
168:     \label{eq:sigma}
169:     \sigma_{\mathrm{GC3}}(m) = \sqrt{\langle
170:     F(\mathrm{GC3},m)^2 \rangle - \langle F(\mathrm{GC3},m) \rangle^2} = \sqrt{\frac{\Delta_{\mathrm{GC3}} m (N-m)}{N}}.    
171: \end{equation}
172: This quantity is shown as a purple band in Figure
173: \ref{fig:land_hist}. For $\eta(i)$'s chosen to be 0 or 1 at random, 
174: $\Delta_{\mathrm{GC3}} = 1/4$ and the maximum width $\sqrt{N}/4$ is
175: obtained at $m= N/2$. Since the scale of variation across the lambda phage GC3
176: landscape is much greater than its expectation under the null, we can
177: conclude that the distribution of G/C versus A/T ending codons is
178: highly non-random in the lambda phage genome.
179: 
180: We can also gain intuition about the degree of non-randomness in the GC3
181: landscape by considering what would happen if the lambda phage genome were to
182: accumulate random synonymous mutations.  Figure
183: \ref{fig:land_decay}(a) shows snapshots of the lambda GC3 landscape as
184: we simulate synonymous mutations to the genome. Between each snapshot,
185: $N$ synonymous mutations were introduced by
186: picking a codon at random along the genome, and then choosing a new
187: synonymous codon at random according to the global lambda phage codon
188: distribution. As more mutations are introduced, the GC3 landscape of the
189: synonymously mutated lambda genome approaches the purple band,
190: indicating that the GC3 pattern in the real lambda phage genome is
191: highly non-random.
192: 
193: The procedure of producing a genome landscape can be applied to other
194: properties of codon usage. In addition to GC3, we will study patterns in
195: the Codon Adaptation Index (CAI).  CAI measures the similarity of a
196: gene's codon usage to the `preferred' codons of an organism
197: \cite{Sharp1987} -- in this case, the host bacterium of the phage under
198: study.  Every bacterium has a preferred set of codons defined as the
199: codons, one for each amino acid, that occur most frequently in genes
200: that are translated at high abundance. These genes are often taken to be
201: the ribosomal proteins and translational elongation factors
202: \cite{Sharp1987} (see Methods).
203: 
204: In order to calculate CAI, the preferred codons are each assigned a weight $w =
205: 1$. The remaining codons are assigned weights according to their frequency in
206: the highly-translated genes, relative to the frequency of the $w=1$ codon. The
207: CAI of a gene is defined as the geometric mean of the $w$-values for its codons
208: \begin{equation}
209: \label{eq:CAI_def} \mathrm{CAI} = \left(\Pi_{i=1}^{M} w_i\right)^{1/M},
210: \end{equation} 
211: where $w_i$ is the $w$-value of the $i^{th}$ codon, and
212: $M$ is the length of the gene. This quantity can be re-written as
213: \begin{equation} \mathrm{CAI} = \exp(\frac{1}{M} \sum_{i=1}^{M}
214: \ln(w_i)).  
215: \end{equation} 
216: The latter formulation is more useful for calculating genome landscapes,
217: because the argument of the exponential function is now a sum of the logs of the
218: $w$-values. Therefore, we define the CAI landscape as \begin{equation}
219: F_{\mathrm{CAI}}(m) = \sum_{i=1}^m (\eta_{\mathrm{CAI}}(m) -
220: \overline{\eta_{\mathrm{CAI}}}), \end{equation} where $\eta_{\mathrm{CAI}}(m) =
221: \ln(w_m)$.
222: 
223: The CAI landscape for lambda phage is shown in Figure
224: \ref{fig:land_hist}(b), along with the CAI histogram of lambda phage.
225: For the CAI histograms, the height of each bar represents the CAI value
226: of that gene (Eq. \ref{eq:CAI_def}).  As in the case with the GC3
227: landscape, we find that the lambda phage CAI landscape corresponds
228: closely to the CAI histogram, but it offers a more striking global view
229: of the long-range CAI structure in the lambda phage genome. One
230: contiguous half of the lambda phage genome exhibits elevated CAI,
231: whereas the other half exhibits depressed CAI.  The observed CAI
232: landscape lies far outside the purple band in Figure
233: \ref{fig:land_hist}, calculated according to Eq. \ref{eq:sigma},
234: indicating that the pattern of CAI across the lambda phage genome is
235: non-random. However,
236: the purple band is wider for the CAI landscape than for the GC3
237: landscape, because the variance in the $\ln{(w_i)}$'s,
238: $\Delta_{\mathrm{CAI}}$, is greater than $\Delta_{\mathrm{GC3}}$.
239: 
240: The GC3 and CAI landscapes for lambda phage are highly correlated with each
241: other (Figure \ref{fig:land_hist}). In particular they both have large uphill
242: regions on the left-hand side of the genome, indicating a region containing
243: codons with elevated GC3-content and CAI values, compared to the genome average.
244: It is possible that the observed correlation between the GC3 and CAI landscapes
245: could be caused by the conflation between high CAI and GC3 in the preferred
246: \emph{E. coli} codons, as we discuss below.
247: 
248: We note that the genes in the region of elevated CAI primarily encode the highly
249: translated structural proteins that form the capsid and tail of the lambda phage
250: virions. This patterns suggests the hypothesis that, because of the need to
251: produce structural genes in high copy number during the viral life cycle, structural
252: genes preferentially use codons that match the host's preferred set of codons.
253: We will explore this translational-selection hypothesis in greater detail below.
254: 
255: % subsection genome_landscapes (end)
256: \subsection{The Effect of Amino Acid Content on Genome
257: Landscapes}\label{sub:the_effect_of_amino_acid_content_on_genome_landscapes} 
258: 
259: % (fold)
260: The previous section illustrated that the codon usage across the lambda phage
261: genome is highly non-random with respect to both GC3 and CAI. In this section we
262: quantify this statement, and we focus on aspects of lambda's codon usage
263: patterns that are \emph{independent} of the amino acid sequences of the
264: encoded proteins.
265: 
266: Since we are interested in studying the patterns of \emph{synonymous} codon
267: usage, it is important that we control for the amino acid sequence of encoded
268: proteins. Phages utilize a diverse spectrum of proteins, ranging from
269: those that form the protective capsid for nascent progeny, to those
270: encoding for the tail and tail fibers, to those that regulate the switch
271: between lytic or lysogenic infection pathways. As with other organisms, phage
272: proteins have been selected at the amino acid level for function and folding.
273: Some portion of a phage's codon usage is surely influenced by selection
274: for amino acid content.
275: 
276: We can construct a simple randomization test to interrogate the potential
277: influence of the amino acid sequence on the GC3 and CAI landscapes of lambda
278: phage. In this test, we generate random genomes that have the exact same amino
279: acid sequence as lambda phage, but shuffled codons, such that the genome-wide,
280: or global, codon distribution is preserved in each random genome (see
281: Methods). As
282: summarized in Table \ref{tab:tests}, we refer to this test as the `aqua'
283: randomization test. For each of the randomized genomes, we calculate GC3 and CAI
284: landscape. Similar to a recent randomization method \cite{Zeldovich2007}, we then 
285: compare the observed landscape of the actual genome to the
286: distribution of landscapes generated from the randomized genomes.
287: 
288: Figure \ref{fig:aqua} shows the results of this comparison, with the observed
289: landscapes plotted as black lines, and the mean
290: $\pm$ one and two standard deviations of random trials shown in dark and light
291: aqua, respectively. As the figures show, the observed landscapes lie in the far
292: extremes of the randomized distributions -- indicating that the amino acid
293: sequence of the lambda phage genome does not determine the extraordinary
294: features of the observed landscapes.
295: 
296: It is also instructive to query the influence of amino acid content on codon
297: usage in each gene individually. The histogram view of these randomization tests
298: allows us to ask this question precisely. Because the amino acid sequence is
299: preserved exactly across the genome, each histogram bar in Figure \ref{fig:aqua}
300: can be considered as its own randomization test, one for each gene. The position
301: of the horizontal black bar reflects the actual codon usage of
302: each gene, and it can be compared to the distribution of random trials in order
303: to compute a quantile for each gene: 
304: \begin{equation} q^{>} = \frac{\mathrm{number\ of\
305: trials\ less\ than\ observed}}{\mathrm{number\ of\ trials}},\\
306: q^{<} = \frac{\mathrm{number\ of\ trials\ greater\ than\
307: observed}}{\mathrm{number\ of\ trials}}.  
308: \end{equation} 
309: Note that we have defined two quantiles, $q^{>}$ and $q^{<}$, that describe the
310: proportion of random trials strictly less or strictly greater than the observed data.
311: These two quantities sum to a values less than one (and equal to one if there
312: are no ties). A large value of $q^{>}$ signifies that the observed statistic
313: (e.g. GC3 or CAI) is \emph{greater} than most of the random trials.
314: 
315: Associated with each of these quantiles is a p-value quantifying whether the
316: observed gene sequence has significantly different codon usage than the random
317: trials: $p^{<} = 1 - q^{<}$ and $p^{>} = 1 - q^{>}$. If either one of these
318: $p$-values is low, it signifies that the GC3 (or CAI) content of the gene is
319: significantly different than the genomic average, controlling for the amino acid
320: sequence of the gene. $p^<$ tests for significantly depressed GC3 (or CAI) in a
321: gene; and $p^>$ tests for significantly elevated GC3 (or CAI) in a gene. We will
322: use these $p$-values, which arise from the `aqua' randomization test, in two
323: ways.
324: 
325: Since we are interested in studying the effects of synonymous codon usage alone,
326: we first wish to filter out any genes whose codon usage does not significantly deviate
327: from random, given the amino acid sequence. Therefore, in the subsequent
328: gene-by-gene analyses reported in this paper, we retain only those genes whose
329: quantiles fall in the extreme 5\% of random trials. That is, we only keep those
330: genes for which $p^{<}_{\mathrm{aqua}} < 0.025$ or $p^{>}_{\mathrm{aqua}} <
331: 0.025$. These genes are said to `pass' the aqua test, and they are
332: unshaded in Figure \ref{fig:aqua}.
333: 
334: We also use the gene-by-gene $p$-values to quantify the degree to which
335: codon usage is independent of amino acid sequence across the genome as a
336: whole. To do so, we combine all the gene-by-gene $p$-values into an
337: aggregate $p$-value for the entire genome, $p_{\mathrm{aqua}}$, using
338: the method of Fisher \cite{Fisher1948}. We calculate the combined
339: $p$-value by summing the logs of twice the minimum of each gene-specific
340: p-value \begin{equation} f_{\mathrm{aqua}} = -2 \sum_{i=1}^{i=k} \ln{[2
341: \min(p^{<}_{\mathrm{aqua},i}, p^{>}_{\mathrm{aqua},i})]}, \end{equation}
342: where $p^{<}_{\mathrm{aqua},i}$ represents the aqua $p^<$-value for gene
343: $i$, and $k$ is the number of genes in the genome. It is well known that
344: $f_{\mathrm{aqua}}$ is chi-squared distributed with $2k$ degrees of
345: freedom \cite{Fisher1948}.  Thus, the combined $p$-value for the
346: entire genome, $p_{\text{combined}}^{\mathrm{aqua}} = 1-
347: P_{\chi^2,2k}(f_{\mathrm{aqua}})$, where $P_{\chi^2,2k}(f)$ is the
348: cumulative chi-squared distribution with $2k$ degrees of freedom. In
349: the case of lambda phage, we find $p_{\text{combined}}^{\mathrm{aqua}} =
350: 7.42\mathrm{x}10^{-98}$ for GC3 and $p_{\text{combined}}^{\mathrm{aqua}}
351: = 1.50\mathrm{x}10^{-41}$ for CAI. Thus, we conclude that the neither
352: the GC3 nor the CAI patterns across the lambda phage genome are
353: determined by the genome's amino acid sequence.
354: 
355: In the following sections we will use the aqua test (see Table
356: \ref{tab:tests}) and its associated gene-by-gene and combined p-values
357: as a control to verify that features of codon usage are not driven by
358: the amino acid sequence.
359: 
360: % subsection the_effect_of_amino_acid_content_on_genome_landscapes (end)
361: \subsection{Disentangling CAI from GC3}\label{sub:disentangling_cai_from_gc3} 
362: 
363: % (fold)
364: Depending upon the preferred codons of the host species, the effect of
365: selection for high CAI in a viral gene is not necessarily independent
366: from the effect of selection for other features of viral codon usage,
367: such as high GC3.
368: For example, codons with high CAI values associated with a given host
369: may be biased towards high GC3 values as well (see Figure
370: \ref{fig:E_coli_master}, and Section
371: \ref{sub:disentangling_cai_from_gc3} below). It is important, therefore,
372: to disentangle the effects of selection for CAI versus selection for
373: GC3, in order to determine which one of these forces is responsible for
374: the non-random patterns of codon usage observed in the lambda genome.
375: 
376: The weights used to compute CAI for \emph{E. coli} are shown in Figure
377: \ref{fig:E_coli_master}. The 61 codons are placed into one of four groups
378: according to whether they are GC3 or not (red or blue, respectively), and
379: whether they have high CAI or not (dark or light, respectively). High CAI is
380: determined by an arbitrary cutoff of $w \geq 0.9$. As this table demonstrates,
381: the set of preferred codons in E. coli is slightly biased towards GC-ending
382: codons (58\%).
383: 
384: The GC bias of preferred codons, although slight, could conflate the
385: results of selection for CAI versus GC3 in phages that infect \emph{E.
386: coli}, such as lambda.  We therefore introduce another randomization
387: test that allows us to disentangle patterns of CAI content from patterns
388: of GC3 content. Similar to the aqua randomization test described above,
389: we draw random phage genomes such that the amino acid sequence is
390: conserved, but we add the additional constraint of conserving the exact
391: GC3 sequence as well (see Methods). For example, at a site containing a
392: GC3 codon for leucine, in our random trials we only allow those leucine
393: codons terminating in G or C. By comparing the observed landscapes of
394: the genome with the distribution of randomly drawn landscapes, we can
395: isolate the features of codon usage driven by CAI, independent of GC3
396: and amino acid content. We refer to this randomization procedure at the
397: `orange' randomization test (Table \ref{tab:tests}).
398: 
399: Conversely, we also wish to assess the strength of patterns in GC3 content,
400: independent of CAI and amino acid content. The appropriate randomization
401: procedure in this case requires that we constrain the amino acid sequence and
402: the sequence of codon CAI values while allowing GC3 to vary. However, because
403: CAI values are not binary, CAI cannot be constrained exactly while still
404: allowing for enough variability to produce a meaningful randomization test.
405: Thus, we introduce a binary version of the CAI measure, called BCAI, that is
406: qualitatively the same as and, for our purposes, interchangeable with CAI.
407: 
408: The BCAI $w$-value for a codon is defined to be 0.7 if the codon is high CAI,
409: and 0.3 if the codon has low CAI. High CAI is defined by the threshold of $w
410: \geq 0.9$ (see Figure \ref{fig:E_coli_master}). The actual values assigned for
411: BCAI are arbitrary and have no effect on our results. In addition, the threshold
412: value $w \geq 0.9$ is also arbitrary, and our results are robust to changing
413: this threshold. BCAI provides a useful surrogate for CAI because its values are
414: binary, thereby allowing us to constrain a gene's amino acid sequence and BCAI
415: sequence \emph{exactly}, while varying GC3 content in random trials. The BCAI
416: landscapes and histograms are calculated in the same way as CAI landscapes and
417: histograms, except using BCAI $w$-values. As expected, the BCAI landscape of a
418: genome is qualitatively similar to its CAI landscape (compare Figures
419: \ref{fig:green_orange}b and \ref{fig:aqua}b), and the two landscapes are highly
420: correlated (e.g. $r = 0.72$ for lambda phage). Thus BCAI is interchangeable
421: with CAI for the purposes of our randomization tests.
422: 
423: Figure \ref{fig:green_orange} shows the results of the two randomization tests
424: outlined above: the `green' test that compares the observed GC3 landscape to a
425: distribution of random trials constraining the amino acid sequence and the BCAI
426: sequence; and the `orange' test that compares the observed BCAI landscape to a
427: distribution of random trials constraining the amino acid sequence and the GC3
428: sequence. Our convention for naming these two tests is summarized in Table
429: \ref{tab:tests}.
430: 
431: As seen in Figure \ref{fig:green_orange}a, the observed GC3 landscape lies
432: significantly outside of the random trials that preserve amino acid sequence and
433: BCAI sequence. Combining the gene-by-gene p-values for this test, we find
434: $p_{\text{combined}}^{\text{green}} = 5.1\mathrm{x}10^{-68}$ -- indicating that
435: the lambda phage genome as a whole has non-random GC3 variation independent of
436: amino acid and CAI (actually, BCAI) sequence. Conversely, Figure
437: \ref{fig:green_orange}b shows that the BCAI landscape contains non-random
438: features when controlling for both GC3 and amino acid sequence
439: ($p_{\text{combined}}^{\text{orange}} = 6.3\mathrm{x}10^{-9}$). In other words,
440: the lambda phage genome exhibits highly non-random patterns of both GC3 and CAI
441: codon variation, independent of one another and independent of the amino acid
442: sequence.
443: 
444: % subsection disentangling_cai_from_gc3 (end)
445: \subsection{Non-random patterns of CAI and GC3 In
446: Bacteriophages}\label{sub:selection_for_cai_and_gc3_in_bacteriophages} 
447: 
448: % (fold) 
449: 
450: In the sections above we have demonstrated and quantified highly
451: non-random patterns of GC3 and CAI codon usage variation across the
452: lambda phage genome. We have also demonstrated that these trends are
453: independent of one another.  In this section, we will extend our
454: analysis to a large range of diverse phages.
455: 
456: In this section we consider all sequenced phages that infect \emph{E. coli},
457: \emph{Pseudomonas aeruginosa} or \emph{Lactococcus lactis} as their primary host.
458: The latter two hosts were chosen because of they contain unusually extreme GC3
459: content: 88 \%GC3 for \emph{P. aeurginosa} and 25 \%GC3 for \emph{L. lactis},
460: genome-wide. The extreme GC3 content of these hosts give rise to opposing
461: relationships between high CAI and GC3 -- as indicated schematically in
462: Figure \ref{fig:master_cartoons}. In particular, \emph{P. aeruginosa} strongly
463: favors GC3 in high-CAI codons (94\%), and \emph{L. lactis} strongly favors AT3 in
464: high-CAI codons (72\%). Thus, these three hosts span a large spectrum of
465: relationships between CAI and GC3. Since our randomization tests constrain amino
466: acid and BCAI exactly (the `green' test), and amino acids and GC3 exactly (the
467: `orange' test), we can control for any possible conflation between GC3 and CAI
468: trends. Thus, the randomization tests are equally applicable to all of the phage
469: genomes, regardless of their host.
470: 
471: We performed the aqua, green, and orange randomization tests on the 45
472: phages of \emph{E. coli}, 12 phages of \emph{P. aeruginosa}, and 17
473: phages of \emph{L. lactis} whose genomes have been sequenced
474: (see Methods). In the first step of our
475: analysis, we removed any phages which failed either the aqua GC3 or aqua
476: CAI tests, because the codon usage of such genomes are influenced by
477: their amino acid sequence. A phage was said to pass these two control
478: tests if its Fisher combined p-values for both aqua GC3 and aqua CAI
479: were significant. The significance criterion for each test is
480: $p_{\text{combined}} < 5\%/74$, which incorporates a Bonferroni
481: correction for multiple tests.  With this cutoff, 50 of the initial 74
482: phages passed the aqua control tests.
483: 
484: Figure \ref{fig:green_orange_examples} shows results of these tests for
485: several example genomes. P2, a temperate phage, and T3, a non-temperate
486: phage both infect \emph{E. coli} and both pass the control tests and
487: exhibit significant `orange' and `green' results, as does D3112, a
488: temperate phage that infects \emph{P.  aeruginosa}. However, not all
489: phages that pass the control test exhibit signifanct `orange' and
490: `green' results -- as evidenced by bIL286, a temperate phage infecting
491: \emph{L. lactis}.
492: 
493: Figure \ref{fig:green_orange_pass_genomes} plots the distribution of
494: combined Fisher p-values of the orange and green tests, for the 50
495: phages that pass the control tests. The majority of these
496: p-values are highly significant. Using a Bonferoni-corrected theshold of
497: 5\%/50, a total of 22 genomes show significance in the orange test, 29
498: in the green text, and 17 in both orange and green.  These results
499: indicate that non-random patterns in codon usage are not unique to
500: lambda phage.  Indeed, over a range of bacterial hosts and a range of
501: phage viruses, there is apparent pressure for non-random patterns of
502: both GC3 content and CAI content, independent of one another and
503: independent of the amino acid sequence.
504: 
505: 
506: % subsection selection_for_cai_and_gc3_in_bacteriophages (end)
507: \subsection{Translational selection on phage structural
508: proteins}\label{sub:translational_selection_on_phage_structural_proteins} 
509: 
510: % (fold)
511: In this section, we investigate a natural hypothesis concerning the patterns of
512: non-random CAI usage we have observed in phage genomes -- namely, that these
513: patterns may be driven by selection for translational accuracy and efficiency,
514: which is stronger in more highly expressed proteins \cite{Ikem81a,Sharp1984}. 
515: 
516: Among all phage proteins, the structural proteins are the most highly expressed
517: \cite{Hendrix2004}. The structural proteins form the protective capsid that
518: encloses the viral genome, as well as the tail, which is often used for
519: transmission of the phage genome to the inside of the host \cite{Roessner1983}.
520: These proteins must be produced in high copy number -- many tens of copies of
521: each type of structural protein needed to form each of hundreds of viral progeny
522: \cite{Hendrix2004}. For each gene in a phage genome, we assigned a structural
523: annotation of 1 if the gene was known to encode a structural protein and 0
524: otherwise (see Methods).
525: 
526: According to the standard hypothesis of translational selection, the
527: structural genes of phages should exhibit elevated CAI levels compared
528: to other phage genes, since they are translated (by the host) in high
529: copy numbers. To test this hypothesis, we performed regressions between
530: the structural annotation of phage genes and their aqua CAI and orange
531: BCAI p-values.  In other words, we compared the structural properties of
532: genes against their CAI content, controlling for amino acid sequence,
533: and against their BCAI content, controlling for both amino acid sequence and
534: GC3 sequence.
535: 
536: In the case of lambda phage, Figure \ref{fig:structural} shows the results of
537: the aqua CAI and orange BCAI randomization tests, with the structural genes
538: highlighted. The plot reveals a striking pattern: the vast majority of the
539: structural proteins lie on the left half of the genome, exactly in the region
540: where genes have elevated CAI values. In order to quantify this association we
541: performed ANOVAs. Before regressing structural
542: annotations against codon usage, we first removed the non-informative genes --
543: i.e. genes whose codon usage are influenced by their amino acid content, as
544: indicated by a failure to pass the aqua CAI test.
545: 
546: Table \ref{tab:lambda_all_struct_non_aqua_orange} shows the results of the
547: regression between aqua CAI and orange BCAI $p^{>}$-values versus structural
548: annotations in lambda phage. The results are highly significant: structural
549: annotations explain half of the variation in CAI, even when controlling for
550: genes' amino acid sequences (aqua, $r^2$=56\%) as well as GC3 seqeuences (orange
551: test, $r^2$=46\%). The median $p^{>}$-value among structural genes is close to
552: zero, whereas the median $p^{>}$-value among non-structural genes is close to
553: one -- indicating that structural genes exhibit significantly \emph{elevated}
554: CAI values. These highly significant results are consistent with the hypothesis
555: of translational selection on structural proteins.
556: 
557: In order to examine the relationship between structural annotation and CAI
558: across all 74 phages in our study, we performed the same ANOVA on the 1,309
559: informative genes (i.e. genes that pass the aqua CAI randomization test). Once
560: again, Table \ref{tab:lambda_all_struct_non_aqua_orange} shows a highly
561: significant relationship between structural annotation and CAI values,
562: controlling for amino acid content and GC3. Thus, the tendency toward elevated
563: CAI values in structural genes holds across all the phages in this study,
564: despite the fact that they infect a diverse range of hosts with a wide
565: variety of GC contents.
566: 
567: % subsection translational_selection_on_phage_structural_proteins (end)
568: % section results (end)
569: \section{Discussion}\label{sec:discussion} 
570: 
571: In this paper, we have introduced genome landscapes as a tool for visualizing
572: and analyzing long-range patterns of codon usage across a genome. In combination
573: with a series of randomization tests, we have applied this tool to study
574: synonymous codon usage in 74 fully sequenced phages that infect a diverse range
575: of bacterial hosts. Genome landscapes provide a convenient means to identify
576: long-range trends that are not apparent through conventional, gene-by-gene or
577: moving-window analyses. Using a statistical test that compares codon usage to
578: random trials, controlling for the amino acid sequence, we found that
579: we found that many of the phages studied exhibit non-random variation 
580: in codon usage.  However, not all of the phages exhibit non-random variation as
581: exemplified by phage bIL286 (Figure \ref{fig:green_orange_examples}(d)).
582: 
583: In light of long-standing \cite{Ikem81a} and recent \cite{Kudla2006}
584: literature from other organisms, we have focussed on two aspects of
585: phage codon usage: variation in third-position GC/AT content (GC3) and
586: variation in the degree of adaptation to the `preferred' codons of the
587: host (CAI). Almost three-quarters of the phages in our study exhibit
588: non-random intragenomic patterns of codon usage, even when controlling
589: for the amino acid sequence encoded by the genome. Almost half of such
590: genomes also show non-random patterns of CAI when additionally
591: controlling for the GC3 sequence. In other words, there is substantial
592: variation in CAI above and beyond what would be expected by random
593: chance, given the amino acid and GC3 sequences of these genomes.
594: 
595: We have also compared the CAI values of phage genes to their annotations
596: as structural or non-structural proteins. We have conclusively
597: demonstrated that phage genes encoding structural proteins exhibit
598: significantly elevated CAI values compared to the non-structural proteins
599: from the same genome. These results hold even when controlling for the
600: the amino acid sequence and GC3 sequence of genes. Our
601: conclusions across a diverse range of phages are consistent with
602: early observations on lambda's codon usage \cite{Sanger1982},
603: early results for T7 \cite{Sharp1984}, and with the general hypothesis
604: of translational selection, which predicts elevated CAI in genes
605: expressed at high levels \cite{Ikem81a,Ikem81b,Sharp1987}. The pattern
606: of elevated CAI in structural proteins is particularly striking the case
607: of lambda phage. It is also worth noting that we find no
608: significant relationship between a phage's life-history (i.e. temperate
609: versus non-temperate) and the degree to which its structural proteins
610: exhibit elevated CAI (see Table \ref{tab:temperate_non}). This
611: observation likely reflects the fact that at some point every phage,
612: regardless of its life history, must generate certain structural proteins in
613: high abundance -- and so it is beneficial to encode such protein using
614: the host's translationally preferred codons.
615: 
616: Our results on translational selection in phages shed light on the
617: nature of selection on viruses. The standard interpretation of elevated
618: CAI in highly expressed bacterial proteins assumes a fitness cost (per
619: molecule) associated with inefficient or inaccurate translation. We have
620: observed a similar relationship between expression level and CAI across
621: a diverse range of bacteriophages, which presumably do not incur a
622: direct energetic cost from inefficient translation by their hosts. Thus,
623: our results suggest that either there is an adaptive benefit (to the
624: virus) of elevated CAI in phage structural proteins, or that costs
625: incurred by the host bacterium also reduce the fitness of the virus.
626: 
627: In addition to our results on CAI, we have also observed non-random patterns of
628: GC3 variation across the genomes of many phages. These patterns are highly
629: significant even after controlling for potential conflating factors, such as the
630: amino acid sequences and CAI sequences of genes. Unlike our results on CAI,
631: there is no clear mechanistic hypothesis underlying the non-random patterns of
632: GC3 in phages. It is possible that these patterns reflect selection for
633: efficient transcription \cite{Kudla2006} or for mRNA secondary structure. But in
634: the absence of independent information on such constraints, we cannot assess the
635: merits of these selective hypotheses, nor rule out the possibility of variation
636: in mutational biases across the phage genomes. It is interesting to note
637: that we find these significant non-random patterns of GC3 predominantly in
638: temperate phages (see Table \ref{tab:temperate_non}).
639: 
640: Our study benefits from the number and breadth of phages we
641: have analyzed. Unlike previous studies, here we analyze phages whose
642: suspected hosts span a diverse range of bacteria, which themselves
643: differ in their genomic GC3 content and preferred codon choice. We have
644: calibrated CAI for each phage according to its primary host, and
645: nevertheless we find consistent relationships between CAI and viral
646: protein function. These results therefore conclusively extend the
647: classical theory of translational selection to the relationship between
648: viruses and their hosts.
649: 
650: The present study also benefits from the development of randomization tests that
651: isolate the patterns of variation in CAI from variation in GC content. Due to
652: intrinsic biases in the GC content of the preferred codons of hosts, previously
653: studies on codon usage in phage have conflated these two types of synonymous
654: variation \cite{Sahu2004, Sahu2005,Sau2005,SauGosh2005}. The mechanisms
655: underlying GC3 variation and CAI variation likely differ, and so it is
656: critically important that we have analyzed each of these features controlling
657: for the other one.
658: 
659: There is a large literature on the structure and evolution of phage genomes
660: which is pertinent to our analyses of phage codon usage. The genomes of phages
661: that infect \emph{E. coli}, \emph{L. lactis}, and \emph{Mycobacteria} are known
662: to be highly mosaic in structure
663: \cite{Juhala2000,Brussow2002,Hendrix2002,Lawrence2002,Pedulla2003,Hatfull2006}.
664: In other words, these genomes exhibit many similar local features that suggest
665: each genome was assembled from a common pool of bacteriophage genomic regions
666: \cite{Hendrix1999}. Recently, mosaicism was discussed in the lambdoid
667: phages focusing specifically on the \emph{E. coli} phages lambda, HK97 and N15
668: \cite{Hendrix2004}. We note that both HK97 and N15 have peaked landscape
669: structures like lambda, although not as pronounced, indicating that some degree
670: of mosaicism can be observed in genome landscapes among closely related phages.
671: The postulated mechanism for mosaicism is homologous and non-homologus
672: recombination between co-infecting phages or between a phage and a prophage
673: embedded in the host genome \cite{Hendrix1999,Brussow2002,Lawrence2001}. Some
674: have argued that the latter mechanism occurs more frequently, due to the large
675: number of lysogenized prophages in bacterial genomes \cite{Lawrence2001}.
676: 
677: Lateral gene transfers could affect the codon usage patterns of phages,
678: especially if recombination occurs between phages whose preferred hosts
679: differ. In this case, the codon usage patterns of each phage may be expected to
680: reflect the preferred codons of their preferred hosts; a recent recombination
681: may result in regions of dramatically different codon usage from the average
682: phage codon usage. In particular, regions of unusual GC3 content in a phage
683: genome could reflect gene transfers between phages that typically infect hosts
684: of different GC3 content, in analogy with lateral gene transfer amongst
685: bacteria \cite{Ochman2000}. Morons are genes in phage genomes that are under
686: different transcriptional control than the rest of the phage genes, and are
687: often expressed when the phage is in the lysogenic state \cite{Hendrix2000}.
688: These morons have been observed to have very different nucleotide compositions
689: compared to the rest of the phage genome suggesting that they are the result of
690: such gene transfers \cite{Hendrix2000}. Thus one interpretation for our
691: observations of the 29 phages exhibiting non-random GC3 patterns is that these
692: genomes arose through recent recombination events, and have not subsequently
693: experienced enough time to equilibrate their GC3 content to that of their
694: current host. Given the lack of reliable estimates for time scales between
695: putative phage recombination events, or for codon usage equilibration, this
696: study neither supports nor refutes this interpretation. However, the
697: predominance of significant non-random patterns of GC3 in the genomes of
698: temperate phages (see Table \ref{tab:temperate_non}) may suggest that such
699: recombination occurs more frequently among temperate phage populations.
700: 
701: We have demonstrated that phage genes encoding structural proteins exhibit
702: significantly elevated CAI values compared the non-structural phage genes. These
703: results support the classical translation selection hypothesis, now extended to
704: the relationship between viral and host codon usage. We do not find much
705: variation in codon usage among the structural genes themselves. This observation
706: has two plausible interpretations within the literature of lateral gene
707: transfers: either phages of different preferred hosts rarely co-infect, or there
708: is substantially less recombination among the structural proteins of phages. The
709: latter hypothesis has been independently suggested for the capsid proteins of
710: phages, based on the idea that capsid proteins form a complex with
711: multiple physical interactions whose function would be disrupted by individual
712: gene transfer events \cite{Hendrix2002}. Unlike capsid genes, phage tail genes
713: often exhibit mosaicism, and they they can include elements from diverse viruses
714: with variable host ranges \cite{Haggard-Ljungquist1992,Hendrix2002}. To
715: investigate this phenomenon in the context of codon usage, we refined the
716: structural annotation to separate head from tail genes (see Section
717: Methods). We performed three separate ANOVAs to compare
718: the CAI usage in these genes: comparing head versus non-structural, tail versus
719: non-structural, and head versus tail (Table
720: \ref{tab:all_head_tail_aqua_orange}). These regressions indicate that the head
721: genes are primarily responsible for that pattern of elevated CAI in structural
722: proteins. In addition, we detect a difference in codon usage between head and
723: tail genes. These results have at least two possible explanations: either the
724: head proteins are produced in higher copy number than the tail proteins, or
725: lateral gene transfers between diverse phages occur frequently enough in the
726: tail genes to impair their ability to optimize codon usage to their current
727: host. The first hypothesis is very plausible, in light of evidence on the copy
728: number of head and tail proteins \cite{Hendrix2004}; nevertheless, we cannot
729: rule out the second possibility.
730: 
731: % section discussion (end)
732: \section{Materials and Methods}\label{sec:materials_and_methods} 
733: 
734: % (fold)
735: \subsection{Bacteriophage Genomes}\label{sub:bacteriophage_genomes} 
736: 
737: % (fold)
738: Bacteriophage genomes were downloaded from NCBI's GenBank
739: (\verb=http://www.ncbi.nlm.nih.gov/Genbank/index.html=) release 156 (October,
740: 2006) using Biopython's \cite{biopython} NCBI interface. We only used
741: reference sequence (refseq) phage genome records with accessions
742: of the form NC\_00dddd in order to have the most complete records
743: available. Of the 396 phage refseq's available, we focused on the 74 genomes of
744: phages whose primary host, as listed in the \verb=specific_host= tag in the
745: GenBank file, were \emph{E. coli}, \emph{P. aeruginosa} or \emph{L. lactis}. (A
746: complete list of the accession numbers used can be found in the supplementary
747: material.)
748: 
749: All phage genomes were downloaded from GenBank. Before being used for the rest
750: of this study, every gene within a genome was scanned for overlaps within other
751: genes in the same genome, and all overlapping sequences were removed. A codon
752: was only retained if all three of its nucleotides occurred in a single open
753: reading frame. Thus the final genome sequence used was a concatenation
754: of all non-overlapping coding sequences, omitting any control elements and other
755: non-coding sequences.
756: 
757: % subsection bacteriophage_genomes (end)
758: \subsection{Calculation of CAI Master
759: Tables}\label{sub:calculation_of_cai_master_tables} 
760: 
761: % (fold)
762: The definition of the Codon Adaptation Index requires the construction of a
763: `master' $w$-table for the host organism. Each of the 61 sense codons is
764: assigned a $w$-value based on the codon's frequency among the most highly
765: expressed genes in the host organism. In defining this set of genes, we follow
766: Sharp \cite{Sharp1987}, who specified highly expressed genes for \emph{E. coli}.
767: 
768: In order to calculate the CAI master $w$-tables for P. aeruginosa and L. lactis,
769: we identified the homologs of the highly expressed \emph{E. coli} genes within
770: the other host genomes, using BLAST \cite{Altschul1990}. In particular, we used
771: qblast to find homologs to these \emph{E. coli} genes by inputting the gene
772: protein sequences, and blasting (blastp) against the nr database, restricting
773: the database to include proteins of the target organism. In all cases, we used
774: the most significant blast result as the ortholog, provided its e-value was less
775: than $1\mathrm{x}10^{-10}$.
776: 
777: The particular proteins used for each of these three hosts are as
778: follows (NCBI genome accession numbers listed in parentheses beside the
779: host name, gI numbers listed in parentheses beside each protein). \emph{E.
780: coli} (NC\_000913): 30S ribosomal protein S10 (16131200), 30S ribosomal
781: protein S21 (16130961), 30S ribosomal protein S12 (16131221), 30S
782: ribosomal protein S20 (16128017), 30S ribosomal protein S1 (16128878),
783: 30S ribosomal protein S2 (16128162), 30S ribosomal protein S15
784: (16131057), 30S ribosomal protein S7 (16131220), 50S ribosomal protein
785: L28 (16131508), 50S ribosomal protein L33 (16131507), 50S ribosomal
786: protein L34 (16131571), 50S ribosomal protein L11 (16131813), 50S
787: ribosomal protein L10 (16131815), 50S ribosomal protein L1 (1790416 ),
788: 50S ribosomal protein L7/L12 (1790418 ), 50S ribosomal protein L17
789: (16131173), 50S ribosomal protein L3 (16131199), murein lipoprotein
790: (16129633), outer membrane protein A (3a;II*;G;d) (16128924), outer
791: membrane porin protein C (16130152), outer membrane porin 1a (Ia;b;F)
792: (16128896), protein chain elongation factor EF-Tu (duplicate of tufB)
793: (16131218), TufB (29140507), elongation factor Ts (16128163), elongation
794: factor EF-2 (16131219), recombinase A (16130606), molecular chaperone
795: DnaK (16128008); \emph{P. aeruginosa} (NC\_002516): elongation factor G
796: (15599462), 30S ribosomal protein S10 (15599460), 30S ribosomal protein
797: S21 (15595776), 30S ribosomal protein S12 (15599464), 30S ribosomal
798: protein S20 (15599759), 30S ribosomal protein S1 (15598358), 30S
799: ribosomal protein S2 (15598852), 30S ribosomal protein S15 (15599935),
800: 30S ribosomal protein S7 (15599463), 50S ribosomal protein L28
801: (15600509), 50S ribosomal protein L33 (15600508), 50S ribosomal protein
802: L34 (15600763), 50S ribosomal protein L11 (15599470), 50S ribosomal
803: protein L10 (15599468), 50S ribosomal protein L1 (15599469), 50S
804: ribosomal protein L7/L12 (15599467), 50S ribosomal protein L17
805: (15599433), 50S ribosomal protein L3 (15599459), probable outer membrane
806: protein precursor (15596238), elongation factor Tu (15599461),
807: elongation factor Ts (15598851), elongation factor G (15599462),
808: recombinase A (15598813), molecular chaperone DnaK (15599955); \emph{L. lactis}
809: (NC\_002662): 30S ribosomal protein S10 (15674082), 30S ribosomal
810: protein S21 (15672222), 30S ribosomal protein S12 (15674244), 30S
811: ribosomal protein S20 (15673721), 30S ribosomal protein S1 (15672820),
812: 30S ribosomal protein S2 (15674135), 30S ribosomal protein S15
813: (15673868), 30S ribosomal protein S7 (15674243), 50S ribosomal protein
814: L34 (15672113), 50S ribosomal protein L11 (15673983), 50S ribosomal
815: protein L10 (15673251), 50S ribosomal protein L1 (15673982), 50S
816: ribosomal protein L7/L12 (15673250), 50S ribosomal protein L17
817: (15674049), 50S ribosomal protein L3 (15674081), elongation factor Tu
818: (15673843), elongation factor Ts (15674134), elongation factor EF-2
819: (15674242), recombinase A (15672336), molecular chaperone DnaK
820: (15672936).
821: 
822: Given the set of highly expressed genes, the CAI master $w$-table was
823: calculated as follows. For each host, the GenBank file (GenBank release
824: 156) was downloaded locally and transformed into a local data
825: structure using Biopython's \cite{biopython} GenBank parser. The
826: data structure was then scanned for each of the genes in the
827: highly translated gene set, and the collective CDS codon sequences of
828: these genes were concatenated together into one long sequence. Stop
829: codons and codons encoding for amino acids methionine (M), and
830: tryptophan (W) (each encoded by only one codon) were removed
831: from the concatened sequence. The frequencies of codons encoding all
832: other amino acids were then tabulated, and divided into groups according
833: to which amino acid they encode. The w-values are then calculated,
834: according to the procedure of Sharp \cite{Sharp1987}, as these
835: frequencies, normalized by the maximum frequency within each group. Thus
836: each amino acid has a codon with a $w$-value of 1, representing
837: the most commonly used codon for that amino acid. The $w$-values for
838: the stop codons and codons for methionine and tryptophan were set to the
839: average w-value of the remaining codons.
840: 
841: % subsection calculation_of_cai_master_tables (end)
842: \subsection{Drawing Random Genomes According to
843: Constraints}\label{sub:drawing_random_genomes_according_to_constraints} 
844: 
845: % (fold)
846: Our randomization tests require drawing randomized phage genomes that are
847: constrained to have specific properties. In all of the randomization tests
848: discussed, the random sequences were drawn as a sequence of synonymous codons at
849: each position, thereby exactly preserving the amino acid sequences of proteins.
850: 
851: The three randomization tests used in this work can all be considered variants
852: of a canonical randomization test that preserves both the amino acid sequence
853: and a bit mask sequence exactly, while drawing codons from the global,
854: genome-wide distribution. A bit mask sequence is string of zeros and ones
855: corresponding to all codons in the genome. For example, GC3 is 1 if the third
856: position of a codon is G or C, and 0 otherwise.
857: 
858: Using the GC3 bit mask as an example, the randomization test procedure is
859: initialized by calculating the global codon frequencies that fit into categories
860: specified by the amino acid and the bit-mask value. Each amino acid has
861: associated with it two distributions: one for a bit-mask value of 1 and one for
862: a bit-mask value of 0. For example, alanine (A), is encoded by four codons, GCC
863: (1), GCG (1), GCT (0), GCA (0), where the GC3 bit-mask is shown in parenthesis.
864: Thus to calculate the codon distribution of alanine GC3 codons ($A_1$), we
865: compute the frequency of GCC and GCG codons across the whole phage genome.
866: Similarly, the distribution of $A_0$ codons is determined from the frequency of
867: GCT and GCA codons across the genome. In order to produce a random genome,
868: random codons are drawn at each position according to the distribution
869: associated with the position's amino acid and bit-mask value.
870: 
871: Thus the three null tests can be specified by the definition of the bit mask
872: along the sequence, which determines the constraints on the
873: randomize trials. The aqua randomization test constrains the amino acid
874: sequence and nothing else, and so its bit mask consists of all 1's. The orange
875: randomization test preserves the amino acid and the GC3, and so its bit mask is
876: the GC3 sequence mentioned above. The green randomization test preserves the
877: amino acid and BCAI exactly, thus its bit mask is the thresholded BCAI (1 if
878: BCAI $\geq$ 0.7, 0 otherwise).
879: 
880: % subsection drawing_random_genomes_according_to_constraints (end)
881: \subsection{Structural Annotation}\label{sub:structural_annotation} 
882: 
883: % (fold)
884: All phage genes were annotated as structural or non-structural by inspecting
885: the annotations of high-scoring BLAST hits among viral proteins. This procedure is
886: described in detail below.
887: 
888: Each gene was considered separately within each genome object, although overlaps
889: were removed in the process of creating the genome objects (see section
890: \ref{sub:bacteriophage_genomes}). The amino acid sequence of each gene was
891: blasted against all known viral protein sequences using Biopython's interface
892: \cite{biopython} to the NCBI blast utility \cite{Altschul1990}. Specifically, we
893: used the blastp utility specifying the nr database, with entrez query `Viruses
894: [ORGN]'. We retained only those BLAST hits with e-values below the cutoff
895: $1\mathrm{x}10^{-4}$. All words in the title of these BLAST hits were collected,
896: using white space as a word-delimiter.
897: 
898: The unique words from the blast hits were then compared against a set of
899: structural keywords: ``capsid", ``structural", ``head", ``tail", ``fiber",
900: ``scaffold", ``portal", ``coat", and ``tape". The words associated with the
901: BLAST hits were scanned for matches to the keywords, where each keyword was
902: treated as a regular expression. As a result, partial matching was counted as a
903: match. For example, a BLAST title containing the word `head-tail' would match
904: both keywords `head' and `tail'. If a gene had at least one structural keyword
905: match in its BLAST hit title, it was annotated as structural. Otherwise, it was
906: annotated as non-structural.
907: 
908: We further subdivided the structural annotation into two classes: head and tail
909: genes. Tail genes were identified with the keywords ``tail", ``fiber", and
910: ``tape". These remaining structural genes that did not contain any of these
911: keywords were annotated as head genes. Two false positives for tail
912: identification in the lambda phage genome were manually corrected.
913: 
914: \subsection{Null Model: Results for Random Walk
915: Landscapes}\label{sub:null_model_results_for_random_walk_landscapes} 
916: 
917: % (fold)
918: 
919: In the sections above we have compared the genome landscapes calculated
920: from real genome sequences to a null model in which the sequences are
921: randomly drawn from a defined distribution. In this section, we compute
922: several properties of genome landscapes calculated from these random
923: genomes.
924: 
925: We write the general genome landscape of length $N$ as
926: \begin{equation}
927:     F(m) = \sum_{i=1}^m (\eta(i) - \overline{\eta}),
928: \end{equation}
929: where $\eta(i)$ are indepedant, and chosen from a random distribution with
930: $\mathrm{var}(\eta(i)) = \langle \eta(i)^2 \rangle - \langle \eta(i) \rangle^2 =
931: \Delta$, and
932: \begin{equation}
933:     \overline{\eta} = \frac{1}{N}\sum_{i=1}^N \eta(i),
934: \end{equation}
935: which ensures $F(0) = F(N) = 0$.
936: 
937: The purple regions in Figure \ref{fig:land_hist} represent the variance in the
938: genome landscapes of this null model at each $m$, $\sigma(m) = \sqrt{\langle F(m)^2
939: \rangle - \langle F(m) \rangle^2}$. Using the definitions above, we have
940: \begin{equation}
941:     \begin{aligned}
942:         F(m) &= \sum_{i=1}^m \eta(i)- \frac{m}{N}\sum_{i=1}^N \eta(i) \\
943:              &= \left( \frac{m + (N-m)}{N} \right) \sum_{i=1}^m \eta(i)- \frac{m}{N}\sum_{i=1}^N \eta(i) \\
944:              &= \frac{N-m}{N}\sum_{i=1}^m \eta(i) - \frac{m}{N}\sum_{i=m+1}^N \eta(i),
945:     \end{aligned}
946: \end{equation}
947: and
948: \begin{equation}
949:     \langle F(m) \rangle = \frac{m(N-m)\langle\eta\rangle}{N} - \frac{m(N-m)\langle\eta\rangle}{N} = 0.
950: \end{equation}
951: When we use $\langle \eta(i)\eta(j) \rangle = \langle \eta^2 \rangle \delta_{i,j}
952: + (1- \delta_{i,j}) \langle\eta\rangle^2$, with $\delta_{i,j} = 1$ if $i = j$ and 0 otherwise, we find
953: \begin{equation}
954:     \begin{aligned}
955:     \langle F(m)^2 \rangle &= \frac{m(N-m)}{N} (\langle\eta^2\rangle - \langle\eta\rangle^2) \\
956:     &= \frac{\Delta m(N-m)}{N},
957:     \end{aligned}
958: \end{equation}
959: leading to $\sigma(m) = \sqrt{\langle F(m)^2 \rangle - \langle F(m) \rangle^2} = \sqrt{\Delta m
960: (N-m)/N}$. In the case of GC3 landscapes, $\eta(i)$ is either 1 or 0 with equal
961: probability, giving $\Delta_{\mathrm{GC3}} = 1/4$.
962: 
963: We can also calculate the full probability distribution,
964: $P(f;m,N,\Delta)$ that the genome landscape of length $N$ has an intermediate
965: value $F(m) = f$, at point $m$, by considering an $N$-step random walk that is
966: constrained to start and stop at $0$. This probability distribution can be
967: written as a product of two conditional probabilities for a walk that starts at
968: $0$ and ends at $f$ in $m$ steps, and a walk that starts at $f$ and ends at $0$
969: in $N-m$ steps
970: \begin{eqnarray}
971:     \begin{aligned}
972:         \label{eq:P_decomp}
973:     P(f;m,N,\Delta) &= A G(0,f;m,\Delta) G(f,0;N-m,\Delta) \\
974:                     &= A G(0,f;m,\Delta) G(0,f;N-m,\Delta),
975:     \end{aligned}
976: \end{eqnarray}
977: where $A$ is a normalization constant, and the last step used the inversion
978: symmetry of the random walks. Thus we seek the form of the conditional
979: probability $G(0,f;m,\Delta)$. In the same way as in Eq. (\ref{eq:P_decomp}), we
980: decompose this conditional probability into a multiplication of the conditional
981: probabilities for two walks, one that starts at $0$ and ends at $y$ in $x$
982: steps, and one that starts at $y$ and ends at $f$ in $m-x$ steps, and integrate
983: over all possible intermediate values $y$
984: \begin{equation}
985:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}y G(0,y;x,\Delta) G(y,f;m-x,\Delta).
986: \end{equation}
987: We can continue this decomposition for each intermediate step to give
988: \begin{equation}
989:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}y_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}y_{m-1} G(0,y_1;1,\Delta) G(y_1,y_2;1,\Delta) \ldots G(y_{m-1},f;1,\Delta).
990: \end{equation}
991: Keeping the order of integration the same, and noting that $G(y_1,y_2;1,\Delta)
992: = G(y_2 - y_1;1,\Delta)$ for these random walks, we can write $y_{i+1} - y_i =
993: s_{i+1}$ to give
994: \begin{equation}
995:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}s_m G(s_1;1,\Delta) \ldots G(s_2;1,\Delta) G(s_m;1,\Delta) \delta\left( \sum_{i=1}^m s_m - f\right),
996: \end{equation}
997: where the delta function is added to force the constraint that the sum of all
998: the intermediate steps must be equal to $f$. All of the intermediate conditional
999: probabilities now represent one step walks, and so are equal to the underlying
1000: probability distribution of drawing a step size $s_m$, $p(s_m;\Delta)$
1001: \begin{equation}
1002:     G(0,f;m,\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s_1 \ldots \int_{-\infty}^{\infty} \mathrm{d}s_m \delta\left( \sum_{i=1}^m s_m - f\right) \Pi_{i=1}^m p(s_i;\Delta).
1003: \end{equation}
1004: Making use of the integral representation of the delta function \cite{Grosberg1994}
1005: \begin{equation}
1006:     \delta(x) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-ikx},
1007: \end{equation}
1008: we have
1009: \begin{equation}
1010:     G(0,f;m,\Delta) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-ikf} \tilde{p}(k;\Delta)^m,
1011: \end{equation}
1012: where $\tilde{p}(k;\Delta)$ is the Fourier transform of $p(s;\Delta)$
1013: \begin{equation}
1014:     \tilde{p}(k;\Delta) = \int_{-\infty}^{\infty} \mathrm{d}s e^{-iks} p(s;\Delta) .
1015: \end{equation}
1016: For the purpose of this discussion, we assume $p(s;\Delta)$ has a Gaussian form
1017: $p(s) = \frac{1}{\sqrt{2\pi\Delta}}e^{-\frac{s^2}{2\Delta}}$, and note that the
1018: results are general. In this case, $\tilde{p}(k;\Delta) =
1019: e^{-\frac{k^2\Delta}{2}}$, and we have
1020: \begin{equation}
1021:     G(0,f;m) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \mathrm{d}k e^{-m\Delta k^2/2}e^{-ikf} = \frac{1}{\sqrt{2\pi m\Delta}}e^{-f^2/2m\Delta}.
1022: \end{equation}
1023: To determine $A$, we enforce the normalization condition
1024: \begin{equation}
1025:     \int_{-\infty}^{\infty} \mathrm{d}f P(f;m,N,\Delta) = 1,
1026: \end{equation}
1027: which gives
1028: \begin{eqnarray}
1029:     \begin{aligned}
1030:     P(f;m,N,\Delta) &= \frac{1}{\sigma\sqrt{2\pi}}e^{-f^2/2\sigma^2} \\
1031:     \sigma(m) &= \sqrt{\Delta\frac{m(N-m)}{N}}.
1032:     \end{aligned}
1033: \end{eqnarray}
1034: Note that from the full distribution, we can immediately identify $\sigma(m) =
1035: \sqrt{\langle F(m)^2 \rangle - \langle F(m) \rangle^2}$, confirming the explicit
1036: calculation above.
1037: 
1038: \subsection{Acknowledgments}\label{sub:acknowledgments}
1039: % (fold)
1040: 
1041: The authors would like to thank Herv\'{e} Isambert, Graham Hatfull, and Roger
1042: Hendrix for conversations and suggestions on this work. JBL and DRN would like
1043: to thank the Institute Curie, Paris, for hospitality during the initial phases
1044: of this work. Work by DRN was supported by the National Science Foundation
1045: through grants DMR-0231631 and DMR-0213805. JBL acknowledges the financial
1046: support of the Fannie and John Hertz Foundation. JBP acknowledges
1047: support from the Burroughs Wellcome Fund.
1048: 
1049: % subsection acknowledgements (end)
1050: 
1051: % subsection null_model_results_for_random_walk_landscapes (end)
1052: 
1053: 
1054: \clearpage
1055: % subsection structural_annotation (end)
1056: % section materials_and_methods (end)
1057: % ---- FIGURES ------- (fold)
1058: \begin{figure}
1059: 	[p] 
1060: 	\begin{center}
1061: 		\begin{tabular}
1062: 			{cc} 
1063: 			\includegraphics[scale=0.8]{Lambda_GC3.pdf} & 
1064: 			\includegraphics[scale=0.8]{Lambda_CAI.pdf} \\
1065: 			\includegraphics[scale=0.8]{Lambda_GC3_histogram.pdf} & 
1066: 			\includegraphics[scale=0.8]{Lambda_CAI_histogram.pdf} \\
1067: 		\end{tabular}
1068: 	\end{center}
1069: 	\caption{{\bf GC3 and CAI landscapes for lambda phage.} Landscapes of GC3
1070: (left) and CAI (right) measures of codon usage in Lambda phage. Only
1071: coding sequences are considered, which when concatenated together are 40,773 bp
1072: long (see Table \ref{tab:phage_properties}). The GC3 landscape is the
1073: mean-centered cumulative sum of the GC3 content (GC3=1, AT3=0) of codons. The
1074: CAI landscape is the mean-centered cumulative sum of the log $w$-value for each
1075: codon. For each landscape, a region
1076: exhibiting an uphill slope corresponds to higher than average GC3 or CAI. The
1077: horizontal purple band represents the expected
1078: amount of variation in a random walk of GC3 or AT3 choices, 
1079: given by Equation \eqref{eq:sigma}. Both landscapes exhibit features far
1080: outside of the purple bands, indicating that the patterns of codon
1081: usage are highly non-random. Gene boundaries are represented by the bars in the
1082: histograms below each landcape. The height of the bars in the histogram indicate
1083: the GC3 and CAI values for each gene.} \label{fig:land_hist}
1084: \end{figure}
1085: \clearpage
1086: \begin{figure}
1087: 	[p] 
1088: 	\begin{center}
1089: 		\begin{tabular}
1090: 			{cc} 
1091: 			(a) \includegraphics[scale=0.8]{lambda_decay_GC3.pdf} & 
1092: 			(b) \includegraphics[scale=0.8]{lambda_decay_CAI.pdf} \\
1093: 		\end{tabular}
1094: 	\end{center}
1095: 	\caption{ {\bf Snapshots of simulated 
1096: synonymous mutation in the lambda phage genome.} Panel (a) shows GC3 and
1097: (b) shows CAI landscapes. In between successive snapshots (labeled 
1098: by integers), $N$ synonymous mutations are introduced into the genome and the
1099: resulting landscape is shown, where $N$ is the number of codons in the lambda
1100: phage genome (see Section \ref{sub:genome_landscapes}). These snapshots show that
1101: the simulated genome landscapes approach the random null model, indicated by
1102: the purple band (see Figure \ref{fig:land_hist}). The final CAI landscape (3) lies
1103: almost completely within the purple band. Using the lambda phage mutation rate
1104: of $7.7\mathrm{x}10^{-8}$ mutations/bp/replication \cite{Drake1991}, we can
1105: estimate that approximately $10^7$ genome replications would be
1106: required
1107: to relax within the purple bars.} \label{fig:land_decay}
1108: \end{figure}
1109: 
1110: \clearpage
1111: \begin{figure}
1112: 	[p] 
1113: 	\begin{center}
1114: 		\begin{tabular}
1115: 			{cc} 
1116: 			\includegraphics[scale=0.8]{Lambda_aqua_GC3.pdf} & 
1117: 			\includegraphics[scale=0.8]{Lambda_aqua_CAI.pdf} \\
1118: 			\includegraphics[scale=0.8]{Lambda_aqua_GC3_histogram_filtered.pdf} & 
1119: 			\includegraphics[scale=0.8]{Lambda_aqua_CAI_histogram_filtered.pdf} \\
1120: 		\end{tabular}
1121: 	\end{center}
1122: 	\caption{{\bf Observed and randomized landscapes for lambda phage. } The figure
1123: shows the observed GC3 (left) and CAI (right) landscapes, plotted in black,
1124: along with the mean $\pm 1$, and $\pm 2$ standard deviations of
1125: randomized trials, shown in aqua (bold line, dark and light regions,
1126: respectively). The `aqua' randomization test shown here draws random
1127: synonymous codons that preserve the exact amino acid
1128: sequence, according to probabilities that preserve the global codon
1129: usage
1130: distribution of the lambda genome. For the most part, the observed landscapes
1131: lie signficantly outside the distribution of randomized landscapes -- implying
1132: that the amino acid content of genes is not responsible for the observed pattern
1133: of the CAI landscape. In the lower panel, however, genes whose GC3 (left), or
1134: CAI (right) values fall between the 0.025 and 0.975 quantile of the random
1135: trials are shadowed in grey; the GC3/CAI values of such genes are not
1136: significantly different from random, given their amino acid sequence.}
1137: \label{fig:aqua}
1138: \end{figure}
1139: \clearpage
1140: \begin{figure}
1141: 	[p] 
1142: 	\begin{center}
1143: 		\includegraphics[scale=0.6]{ecoli_master.pdf} 
1144: 	\end{center}
1145: 	\caption{{\bf \emph{E. coli} codon usage master table.} The table of 61 codons 
1146: 	along with their associated w-values is shown for \emph{E. coli}. 
1147: 	The $w$-value of each codon reflects its frequency in 
1148: 	highly transcribed \emph{E. coli} genes (see main text). The table 
1149: 	is divided into four regions: codons with high CAI ($w \geq 0.9$) 
1150: 	ending in G or C (dark red); codons with high CAI ending in A or 
1151: 	T (dark blue); codons with low CAI ($w \leq 0.9$) ending in G or C 
1152: 	(light red); codons with low CAI ending in A or T (light blue). 
1153: 	As the table shows, there is 
1154: 	a slight bias for GC3 in the high-CAI codons (58\%), and slight 
1155: 	bias away from GC3 in the low-CAI codons (48\%).} \label{fig:E_coli_master} 
1156: \end{figure}
1157: \clearpage
1158: \begin{figure}
1159: 	[p] 
1160: 	\begin{center}
1161: 		\begin{tabular}
1162: 			{cc} 
1163: 			\includegraphics[scale=0.8]{Lambda_blue_GC3.pdf} & 
1164: 			\includegraphics[scale=0.8]{Lambda_orange_BCAI.pdf} \\
1165: 			\includegraphics[scale=0.8]{Lambda_blue_GC3_histogram.pdf} & 
1166: 			\includegraphics[scale=0.8]{Lambda_orange_BCAI_histogram.pdf} \\
1167: 		\end{tabular}
1168: 	\end{center}
1169: 	\caption{{\bf Observed and randomized landscapes for lambda phage.} 
1170: 	Observed landscapes are shown along with randomized landscapes 
1171: 	associated with the `green' and `orange' tests. 
1172: 	The green randomization procedure tests the 
1173: 	significance of the GC3 landscape controlling for the observed 
1174: 	CAI (actually, BCAI) variation across the genome. The orange 
1175: 	randomization procedure tests the significance of the BCAI landscape, 
1176: 	controlling for the observed GC3 variation across the genome. 
1177: 	Both tests preserve the amino-acid sequence exactly. 
1178: 	Both observed landscapes lie outside the distribution 
1179: 	of random trials, indicating there is non-random GC3 
1180: 	content controlling for CAI, and non-random CAI 
1181: 	content controlling for GC3.} \label{fig:green_orange} 
1182: \end{figure}
1183: \clearpage
1184: \begin{figure}
1185: 	[p] 
1186: 	\begin{center}
1187: 		\begin{tabular}
1188: 			{ccc} 
1189: 			\includegraphics[scale=1]{ecoli_CAI_master_cartoon} & 
1190: 			\includegraphics[scale=1]{paeruginosa_CAI_master_cartoon} & 
1191: 			\includegraphics[scale=1]{llactis_CAI_master_cartoon} \\
1192: 		\end{tabular}
1193: 	\end{center}
1194: 	\caption{{\bf Schematics of prefered codon usage tables for 
1195: 	\emph{E. coli}, \emph{P. aeruginosa}, and \emph{L. lactis} following the 
1196: 	conventions of Figure \ref{fig:E_coli_master}.} 
1197: 	Unlike \emph{E. coli},
1198: 	\emph{P. aeruginosa} strongly favors GC3 in high-CAI codons 
1199: 	(94\%), and \emph{L. lactis} strongly favors AT3 in high-CAI 
1200: 	codons (72\%).} \label{fig:master_cartoons} 
1201: \end{figure}
1202: \clearpage
1203: \begin{figure}
1204: 	[p] 
1205: 	% P2 NC_001895
1206: 	% T3 NC_003298
1207: 	% D3112 NC_005178
1208: 	% bIL286 NC_002667
1209: 	\begin{center}
1210: 		\begin{tabular}
1211: 			{cc} 
1212: 		(a) \includegraphics[scale=0.5]{P2_green_GC3.pdf} & 
1213: 			\includegraphics[scale=0.5]{P2_orange_BCAI.pdf} \\
1214: 		(b)	\includegraphics[scale=0.5]{T3_green_GC3.pdf} & 
1215: 			\includegraphics[scale=0.5]{T3_orange_BCAI.pdf} \\
1216: 		(c)	\includegraphics[scale=0.5]{D3112_green_GC3.pdf} & 
1217: 			\includegraphics[scale=0.5]{D3112_orange_BCAI.pdf} \\
1218: 		(d)	\includegraphics[scale=0.5]{bIL286_green_GC3.pdf} & 
1219: 			\includegraphics[scale=0.5]{bIL286_orange_BCAI.pdf} \\
1220: 		\end{tabular}
1221: 	\end{center}
1222: 	\caption{{\bf `Green' (left) and `orange' (right) randomization tests 
1223: for several phages.} Bacteriophages P2 (b) and T3 (b) both
1224: infect \emph{E. coli}. Phage D3112 (c) infects \emph{P. aeruginosa}.
1225: Phage bIL286 (d) 
1226: infects \emph{L.
1227: lactis}. T3 is the only non-temperate phage of this group. See
1228: Table \ref{tab:phage_properties} for combined Fisher p-values for these tests.
1229: In the case of bIL286, note the lack of evidence for codon bias evident in 
1230: the green and orange
1231: tests for bIL286, as confirmed by the insignificant $p$-values in 
1232: Table \ref{tab:phage_properties}. In this case, we cannot rule out the
1233: possibility that the observed pattern in GC3 is determined
1234: completely by the amino acid and CAI sequence (green), or that the observed pattern in
1235: CAI is determined by the amino acid and GC3 sequence (orange).}
1236: \label{fig:green_orange_examples}
1237: \end{figure}
1238: \clearpage
1239: \begin{figure}
1240: 	[p] 
1241: 	\begin{center}
1242: 		\includegraphics[scale=1]{gla_orange_blue_fisher_extreme_hist.pdf} 
1243: 	\end{center}
1244: 	\caption{{\bf Combined Fisher p-values for the `green' and `orange' 
1245: 	randomization tests across 50 phage genomes.} Phage names are 
1246: 	listed on the x-axis, and are sorted by their `orange' p-value. 
1247: 	A total of 29 genomes exhibit non-random 
1248: 	GC3 content controlling for CAI (green test); and a total of 
1249: 	22 genome exhibit non-random 
1250: 	CAI content controlling for GC3 (orange test). 17 genomes pass both of 
1251: 	these tests. The dashed horizontal line indicates the 
1252: 	threshold for significance after Bonfernni correction (i.e. 5\%/50). 
1253: 	Upwards arrows indicate p-values that lie beyond the limits of the
1254: 	y-axis. See Table \ref{tab:phage_properties} for phage properties, 
1255: 	including the
1256: 	p-values for these tests. Twenty four phage genomes 
1257: 	that failed the aqua GC3 or CAI control tests 
1258: 	are not included in this figure.} \label{fig:green_orange_pass_genomes} 
1259: \end{figure}
1260: \clearpage
1261: \begin{figure}
1262: 	[p] 
1263: 	\begin{center}
1264: 		\begin{tabular}
1265: 			{cc} 
1266: 			\includegraphics[scale=0.8]{Lambda_aqua_CAI_histogram_structural.pdf} & 
1267: 			\includegraphics[scale=0.8]{Lambda_orange_BCAI_histogram_structural.pdf} \\
1268: 		\end{tabular}
1269: 	\end{center}
1270: 	\caption{{\bf The relationship between codon usage and protein 
1271: 	function in lambda phage.} The figure shows the aqua 
1272: 	(CAI, as in Figure \ref{fig:aqua}) and orange (BCAI, as in 
1273: 	Figure \ref{fig:green_orange}) randomization tests 
1274: 	overlaid with information about protein function: 
1275: 	genes classified as structural are shown with a white background 
1276: 	and all other genes with a grey background. The histograms 
1277: 	indicate a clear relationship between the structural 
1278: 	classification of a gene and its significance under the aqua 
1279: 	and orange tests: structural genes typically have elevated 
1280: 	quantiles in the aqua test, whereas other genes typically have 
1281: 	depressed quantiles. In other words, structural genes 
1282: 	exhibit elevated CAI values when controlling for their 
1283: 	amino acid sequence, compared to codon usage in the 
1284: 	genome as a whole. Moreover, as the orange histograms 
1285: 	indicate, this pattern is not caused by variation in GC3 content: 
1286: 	the structural genes exhibit elevated BCAI values after 
1287: 	controlling for both their amino acid sequence and their 
1288: 	GC3 sequence.} \label{fig:structural} 
1289: \end{figure}
1290: \clearpage
1291: 
1292: % section figures (end)
1293: % ------- Tables ------- (fold)
1294: 
1295: \begin{table}
1296:     \begin{center}
1297:     \begin{tabular}{c|c|c|c}
1298:         Test Name & Genome Properties Constrained & Genome Properties Varied & Figure \\
1299:         \hline
1300:         Aqua & amino acid sequence, global codon distribution & synonymous codons & \ref{fig:aqua} \\
1301:         Orange & amino acid and BCAI sequences & GC3 & \ref{fig:green_orange} \\
1302:         Green & amino acid and GC3 sequences & BCAI & \ref{fig:green_orange} \\
1303:     \end{tabular}
1304:     \end{center}
1305:     \caption{Randomization test descriptions.  
1306: 	 The three randomization tests used in the paper 
1307: 	 are color-coded according to what genome properties 
1308: 	 are constrained in the random trials.}
1309:     \label{tab:tests}
1310: \end{table}
1311: \clearpage
1312: 
1313: \begin{table}
1314:     % \begin{center}
1315:     {\tiny
1316:     \begin{tabular}{c|c|c|c|c|c|c|c|c|c}
1317:         Name & Host & Accession & Lifestyle & \# Genes & 
1318:            Length & Coding Length & \%GC3 & Orange p-value & Green p-value \\
1319:            \hline
1320:            T5 & \ecoli & NC\_005859 & NT  & 161 & 121,750 & 96,051 & 31.6 & $1.38\mathrm{x}10^{-31}$ & $1.71\mathrm{x}10^{-19}$ \\
1321:            RB69 & \ecoli & NC\_004928 &  NT & 273 & 167,560 & 156,147 & 29.0 & $1.25\mathrm{x}10^{-21}$ & $5.21\mathrm{x}10^{-01}$ \\
1322:            phiEL & \paeru & NC\_007623 & NT  & 201 & 211,215 & 194,850 & 57.8 & $7.38\mathrm{x}10^{-20}$ & $2.17\mathrm{x}10^{-09}$ \\
1323:            RB49 & \ecoli & NC\_005066 &NT   & 273 & 164,018 & 152,592 & 36.9 & $2.01\mathrm{x}10^{-18}$ & $2.48\mathrm{x}10^{-01}$ \\
1324:            F116 & \paeru & NC\_006552 &  T & 70 & 65,195 & 60,240 & 76.3 & $1.31\mathrm{x}10^{-10}$ & $6.31\mathrm{x}10^{-16}$ \\
1325:            CTX & \paeru & NC\_003278 &T   & 47 & 35,580 & 31,971 & 81.2 & $1.44\mathrm{x}10^{-09}$ & $6.82\mathrm{x}10^{-32}$ \\
1326:            phiKMV & \paeru & NC\_005045 & NT  & 49 & 42,519 & 38,310 & 79.9 & $3.25\mathrm{x}10^{-09}$ & $9.54\mathrm{x}10^{-03}$ \\
1327:            T4 & \ecoli & NC\_000866 &  NT & 269 & 168,903 & 153,660 & 24.3 & $4.59\mathrm{x}10^{-09}$ & $8.62\mathrm{x}10^{-01}$ \\
1328:            lambda & \ecoli & NC\_001416 & T  & 69 & 48,502 & 40,773 & 53.5 & $6.25\mathrm{x}10^{-09}$ & $5.10\mathrm{x}10^{-68}$ \\
1329:            D3 & \paeru & NC\_002484 & T  & 94 & 56,425 & 49,095 & 68.3 & $1.57\mathrm{x}10^{-08}$ & $3.85\mathrm{x}10^{-07}$ \\
1330:            P2 & \ecoli & NC\_001895 & T  & 42 & 33,593 & 30,411 & 54.7 & $5.60\mathrm{x}10^{-08}$ & $2.54\mathrm{x}10^{-61}$ \\
1331:            P1 & \ecoli & NC\_005856 & T  & 108 & 94,800 & 80,103 & 48.2 & $9.37\mathrm{x}10^{-08}$ & $3.51\mathrm{x}10^{-11}$ \\
1332:            D3112 & \paeru & NC\_005178 & T  & 55 & 37,611 & 34,908 & 80.4 & $3.05\mathrm{x}10^{-07}$ & $4.35\mathrm{x}10^{-05}$ \\
1333:            WPhi & \ecoli & NC\_005056 &T   & 43 & 32,684 & 29,601 & 56.4 & $8.39\mathrm{x}10^{-07}$ & $7.80\mathrm{x}10^{-55}$ \\
1334:            K1F & \ecoli & NC\_007456 & NT  & 43 & 39,704 & 34,629 & 53.4 & $1.75\mathrm{x}10^{-05}$ & $8.03\mathrm{x}10^{-02}$ \\
1335:            T3 & \ecoli & NC\_003298 &  NT & 47 & 38,208 & 29,694 & 54.3 & $3.50\mathrm{x}10^{-05}$ & $3.07\mathrm{x}10^{-04}$ \\
1336:            PaP3 & \paeru & NC\_004466 &  T & 71 & 45,503 & 41,115 & 58.1 & $5.09\mathrm{x}10^{-05}$ & $1.64\mathrm{x}10^{-19}$ \\
1337:            phiV10 & \ecoli & NC\_007804 & T  & 55 & 39,104 & 36,111 & 48.8 & $1.25\mathrm{x}10^{-04}$ & $9.38\mathrm{x}10^{-11}$ \\
1338:            P27 & \ecoli & NC\_003356 &   T& 58 & 42,575 & 37,707 & 50.5 & $2.24\mathrm{x}10^{-04}$ & $2.23\mathrm{x}10^{-20}$ \\
1339:            933W & \ecoli & NC\_000924 &  T & 78 & 61,670 & 52,956 & 50.0 & $4.29\mathrm{x}10^{-04}$ & $8.88\mathrm{x}10^{-09}$ \\
1340:            B3 & \paeru & NC\_006548 &  T & 56 & 38,439 & 36,138 & 77.3 & $4.40\mathrm{x}10^{-04}$ & $3.33\mathrm{x}10^{-05}$ \\
1341:            HK97 & \ecoli & NC\_002167 & T  & 59 & 39,732 & 34,191 & 52.1 & $7.61\mathrm{x}10^{-04}$ & $1.19\mathrm{x}10^{-20}$ \\
1342:            VT2-Sa & \ecoli & NC\_000902 & T  & 83 & 60,942 & 52,647 & 51.3 & $1.31\mathrm{x}10^{-03}$ & $7.40\mathrm{x}10^{-07}$ \\
1343:            PRD1 & \ecoli & NC\_001421 &  NT & 21 & 14,925 & 11,988 & 47.6 & $2.99\mathrm{x}10^{-03}$ & $5.97\mathrm{x}10^{-02}$ \\
1344:            JK06 & \ecoli & NC\_007291 &  U & 71 & 46,072 & 32,841 & 43.0 & $3.84\mathrm{x}10^{-03}$ & $1.63\mathrm{x}10^{-03}$ \\
1345:            T1 & \ecoli & NC\_005833 & NT  & 77 & 48,836 & 44,010 & 47.7 & $7.45\mathrm{x}10^{-03}$ & $3.64\mathrm{x}10^{-01}$ \\
1346:            Pf1 & \paeru & NC\_001331 &  U & 12 & 7,349 & 6,282 & 75.7 & $9.66\mathrm{x}10^{-03}$ & $6.67\mathrm{x}10^{-01}$ \\
1347:            HK022 & \ecoli & NC\_002166 & T  & 57 & 40,751 & 33,885 & 52.7 & $1.25\mathrm{x}10^{-02}$ & $4.36\mathrm{x}10^{-18}$ \\
1348:            4268 & \llact & NC\_004746 &  NT & 49 & 36,596 & 33,759 & 24.7 & $1.59\mathrm{x}10^{-02}$ & $3.20\mathrm{x}10^{-01}$ \\
1349:            BP-4795 & \ecoli & NC\_004813 & T  & 48 & 57,930 & 22,356 & 48.1 & $1.66\mathrm{x}10^{-02}$ & $3.29\mathrm{x}10^{-10}$ \\
1350:            186 & \ecoli & NC\_001317 &T   & 43 & 30,624 & 27,747 & 58.7 & $4.02\mathrm{x}10^{-02}$ & $1.79\mathrm{x}10^{-22}$ \\
1351:            I2-2 & \ecoli & NC\_001332 &  U & 8 & 6,744 & 5,166 & 35.0 & $6.91\mathrm{x}10^{-02}$ & $1.01\mathrm{x}10^{-01}$ \\
1352:            phiKZ & \paeru & NC\_004629 & NT  & 306 & 280,334 & 243,384 & 26.8 & $1.32\mathrm{x}10^{-01}$ & $1.79\mathrm{x}10^{-14}$ \\
1353:            bIL312 & \llact & NC\_002671 &  T & 27 & 15,179 & 11,292 & 28.1 & $1.49\mathrm{x}10^{-01}$ & $8.85\mathrm{x}10^{-04}$ \\
1354:            HK620 & \ecoli & NC\_002730 &  T & 58 & 38,297 & 33,717 & 45.9 & $1.61\mathrm{x}10^{-01}$ & $1.41\mathrm{x}10^{-05}$ \\
1355:            Mu & \ecoli & NC\_000929 & T  & 54 & 36,717 & 33,900 & 54.1 & $1.68\mathrm{x}10^{-01}$ & $4.49\mathrm{x}10^{-10}$ \\
1356:            P4 & \ecoli & NC\_001609 &  T & 14 & 11,624 & 9,765 & 52.4 & $1.71\mathrm{x}10^{-01}$ & $4.17\mathrm{x}10^{-18}$ \\
1357:            N15 & \ecoli & NC\_001901 &  T & 59 & 46,375 & 41,472 & 54.9 & $2.17\mathrm{x}10^{-01}$ & $1.38\mathrm{x}10^{-09}$ \\
1358:            Stx2 I & \ecoli & NC\_003525 & T  & 97 & 61,765 & 34,932 & 48.4 & $3.04\mathrm{x}10^{-01}$ & $4.23\mathrm{x}10^{-04}$ \\
1359:            bIL286 & \llact & NC\_002667 &  T & 61 & 41,834 & 38,694 & 24.8 & $3.68\mathrm{x}10^{-01}$ & $1.17\mathrm{x}10^{-01}$ \\
1360:            Tuc2009 & \llact & NC\_002703 &  T & 56 & 38,347 & 35,178 & 28.0 & $4.08\mathrm{x}10^{-01}$ & $1.81\mathrm{x}10^{-02}$ \\
1361:            Stx2 II & \ecoli & NC\_004914 &T   & 99 & 62,706 & 34,755 & 50.1 & $5.85\mathrm{x}10^{-01}$ & $9.94\mathrm{x}10^{-03}$ \\
1362:            BK5-T & \llact & NC\_002796 &  T & 52 & 40,003 & 33,267 & 24.0 & $5.91\mathrm{x}10^{-01}$ & $6.68\mathrm{x}10^{-01}$ \\
1363:            Stx1 & \ecoli & NC\_004913 &  T & 93 & 59,866 & 33,444 & 49.5 & $6.75\mathrm{x}10^{-01}$ & $2.97\mathrm{x}10^{-03}$ \\
1364:            LC3 & \llact & NC\_005822 &T   & 51 & 32,172 & 29,607 & 24.6 & $7.31\mathrm{x}10^{-01}$ & $4.90\mathrm{x}10^{-01}$ \\
1365:            ul36 & \llact & NC\_004066 &  NT & 58 & 36,798 & 32,400 & 27.7 & $8.64\mathrm{x}10^{-01}$ & $4.66\mathrm{x}10^{-02}$ \\
1366:            Pf3 & \paeru & NC\_001418 &U   & 9 & 5,833 & 5,487 & 35.9 & $8.70\mathrm{x}10^{-01}$ & $1.64\mathrm{x}10^{-06}$ \\
1367:            bIL285 & \llact & NC\_002666 &T   & 62 & 35,538 & 32,646 & 26.7 & $9.20\mathrm{x}10^{-01}$ & $9.93\mathrm{x}10^{-01}$ \\
1368:            r1t & \llact & NC\_004302 &T   & 50 & 33,350 & 30,315 & 25.4 & $9.53\mathrm{x}10^{-01}$ & $6.03\mathrm{x}10^{-01}$ \\
1369:            bIL170 & \llact & NC\_001909 & T  & 63 & 31,754 & 27,663 & 27.1 & $9.91\mathrm{x}10^{-01}$ & $8.71\mathrm{x}10^{-01}$ \\
1370:     \end{tabular}
1371:     }
1372:     % \end{center}
1373:     \caption{Phage properties. Properties are listed for all phages included in
1374:     Figure \ref{fig:green_orange_pass_genomes}, in the same order based on the
1375:     orange p-value. Lifestyle annotations are T (temperate), NT (non-temperate),
1376:     U (unknown). The coding length refers to the length of all coding sequences
1377:     concatenated together (see Methods.}
1378:     \label{tab:phage_properties}
1379: \end{table}
1380: 
1381: \begin{table}
1382: 	\begin{center}
1383: 		\begin{tabular}
1384: 		    {c|c|c}
1385: 		     & Lambda & All Phage Genes \\
1386: 		     \hline
1387: 		     Number structural & 7 & 279 \\
1388: 		     Number non-structural & 18 & 1022 \\
1389: 		     \hline
1390: 		     \multicolumn{3}{c}{Aqua CAI Randomization Test} \\
1391: 		     \hline
1392: 		     median $p^{>}$ structural & $1.3\mathrm{x}10^{-4}$ & $8.0\mathrm{x}10^{-3}$ \\
1393: 		     median $p^{>}$ non-structural & 1.0 & 1.0 \\
1394: 		     ANOVA significance & $p=4.5\mathrm{x}10^{-5}$ & $p=4.7\mathrm{x}10^{-12}$ \\
1395: 		     \hline
1396: 		     \multicolumn{3}{c} {Orange BCAI Randomization Test} \\
1397: 		     \hline
1398: 		     median $p^{>}$ structural & $2.8\mathrm{x}10^{-2}$ & $2.0\mathrm{x}10^{-1}$ \\
1399: 		     median $p^{>}$ non-structural & 0.98 & 0.73 \\
1400: 		     ANOVA significance & $p=1.8\mathrm{x}10^{-4}$ & $p=1.6\mathrm{x}10^{-15}$ \\
1401: 		\end{tabular}
1402: 	\end{center}
1403: 	\caption{Structural annotation verses codon usage.  The table shows
1404: 	the median $p^>$ values amoung structural and non-structural genes,
1405: 	under the aqua and orange randomization tests. Small $p^>$ values indicate
1406: 	significantly elevated CAI, controlling for the amino acid sequence
1407: 	(aqua test) and the GC3 sequence (orange test). We also report the
1408: 	significance of non-parametic ANOVAs that compare median $p^>$-values between
1409: 	the structural and non-structural genes. Analyses are limited to 
1410: 	those genes that pass the aqua test, as described in the main text;
1411: 	similar results are found without this restriction.
1412: } 
1413: 	\label{tab:lambda_all_struct_non_aqua_orange} 
1414: \end{table}
1415: \clearpage
1416: 
1417: \begin{table}
1418: 	\begin{center}
1419: 		\begin{tabular}
1420: 		    {c|c}
1421: 		     & All Phage Genes \\
1422: 		     \hline
1423: 		     Number `Head'  & 145 \\
1424: 		     Number `Tail'  & 134 \\
1425: 		     Number non-structural (NS) & 1022 \\
1426: 		     \hline
1427: 		     \multicolumn{2}{c}{Aqua CAI Randomization Test} \\
1428: 		     \hline
1429: 		     median $p^{>}$ head &  $2.0\mathrm{x}10^{-3}$ \\
1430: 		     median $p^{>}$ tail &  $2.0\mathrm{x}10^{-2}$ \\
1431: 		     median $p^{>}$ NS &  1.0 \\
1432: 		     ANOVA Head vs NS & $p=6.4\mathrm{x}10^{-19}$ \\
1433: 		     ANOVA Tail vs NS & $p=1.8\mathrm{x}10^{-1}$ \\
1434: 		     ANOVA Head vs Tail & $p=2.1\mathrm{x}10^{-8}$ \\
1435: 		     \hline
1436: 		     \multicolumn{2}{c} {Orange BCAI Randomization Test} \\
1437: 		     \hline
1438: 		     median $p^{>}$ head &  $7.0\mathrm{x}10^{-2}$ \\
1439: 		     median $p^{>}$ tail &  $4.3\mathrm{x}10^{-1}$ \\
1440: 		     median $p^{>}$ NS &  0.73 \\
1441: 		     ANOVA Head vs NS  & $p=4.2\mathrm{x}10^{-21}$ \\
1442: 		     ANOVA Tail vs NS  & $p=1.7\mathrm{x}10^{-2}$ \\
1443: 		     ANOVA Head vs Tail  & $p=6.0\mathrm{x}10^{-8}$ \\
1444: 		\end{tabular}
1445: 	\end{center}
1446: 	\caption{Comparison between codon usage and refined structural
1447: 	annotations.
1448: 	As in Table \ref{tab:lambda_all_struct_non_aqua_orange}, 
1449: 	we compare the median aqua and orange $p^>$ values among head genes, tail
1450: 	genes, and non-structural genes. We report the significance of
1451: 	pairwise non-parametric ANOVAs comparing head to non-structural, tail
1452: 	to non-structural, and head to tail genes.
1453: 	These analyses are limited to genes that pass the aqua test; 
1454: 	similar results are found without this
1455: 	restriction.
1456: }
1457: 	\label{tab:all_head_tail_aqua_orange} 
1458: \end{table}
1459: 
1460: \clearpage
1461: \begin{table}
1462: 	\begin{center}
1463: 		\begin{tabular}
1464: 		    {c|c}
1465: 		     \multicolumn{2}{c}{Median $p_{\text{combined}}^{\mathrm{orange}}$} \\
1466: 		     \hline
1467: 		     Temperate & $1.4\mathrm{x}10^{-2}$ \\
1468: 		     Non-temperate & $2.6\mathrm{x}10^{-5}$ \\
1469: 		     Un-identified & $4\mathrm{x}10^{-2}$ \\
1470: 		     ANOVA significance & $p = 0.1$ \\
1471: 		     \hline
1472: 		     \multicolumn{2}{c}{Median $p_{\text{combined}}^{\mathrm{green}}$} \\
1473: 		     \hline
1474: 		     Temperate & $5.1\mathrm{x}10^{-9}$ \\
1475: 		     Non-temperate & $7.0\mathrm{x}10^{-2}$ \\
1476: 		     Un-identified & $5\mathrm{x}10^{-2}$ \\
1477: 		     ANOVA significance & $p = 0.009$ \\
1478: 		\end{tabular}
1479: 	\end{center}
1480: \caption{{\bf Phage lifestyle versus codon usage}. The table shows the median
1481: $p_{\text{combined}}^{\mathrm{orange}}$ and
1482: $p_{\text{combined}}^{\mathrm{green}}$ values among phages classified as
1483: temperate, non-temperate, or un-identified for all phages included in Figure
1484: \ref{fig:green_orange_pass_genomes} and Table \ref{tab:phage_properties}. Small
1485: median $p_{\text{combined}}^{\mathrm{orange}}$ values indicate that these phages
1486: have significantly non-random (in either direction) BCAI, controlling for the
1487: amino acid sequence and the GC3 sequence, while small median
1488: $p_{\text{combined}}^{\mathrm{green}}$ values indicate that these phages have
1489: significantly non-random (in either direction) GC3, controlling for the amino
1490: acid sequence and the BCAI sequence. We also report the significance of
1491: non-parametic ANOVAs that compare these medians between these groups of phages.
1492: }
1493: 	\label{tab:temperate_non} 
1494: \end{table}
1495: % section tables (end)
1496: \clearpage
1497: \bibliography{GLA,GLA_lux}
1498: 
1499: \end{document} 
1500: 
1501: