1: \documentclass{e1}
2: \usepackage{amssymb}
3: \usepackage{rotating}
4: \usepackage{longtable}
5: \usepackage{graphicx}
6: \usepackage{natbib}
7: %\usepackage{biograph}
8: \setlength{\textwidth}{16cm} \setlength{\textheight}{22.5cm} \setlength{\oddsidemargin}{10mm}
9: \setlength{\evensidemargin}{10mm}
10: \begin{document}
11: \begin{frontmatter}
12:
13: \title{CODON USAGE BIAS MEASURED THROUGH ENTROPY APPROACH}%\thanks{This paper is supported, in part, by Krasnoyarsk Science Foundation, grant 13F105.}
14: \author[ibf,defakto]{Michael G.Sadovsky\corauthref{cor1}\thanksref{label2}}
15: \thanks[label2]{To whom the correspondence should be addressed.}
16: \corauth[cor1]{660036 Russia, Krasnoyarsk, Akademgorodok; Institute of computational modelling of RAS; tel.
17: +7(3912)907469, fax: +7(3912)907454}
18: \address[ibf]{Institute of computational modelling of RAS}
19: \ead{msad@icm.krasn.ru}
20: \author[defakto]{Julia A.Putintzeva}
21: \address[defakto]{Siberian Federal university, Institute of natural sciences \& humanities}
22: \ead{kinomanka85@mail.ru}
23:
24: \begin{abstract}
25: Codon usage bias measure is defined through the mutual entropy calculation of real codon frequency
26: distribution against the quasi-equilibrium one. This latter is defined in three manners: (1) the frequency of
27: synonymous codons is supposed to be equal (i.e., the arithmetic mean of their frequencies); (2) it coincides
28: to the frequency distribution of triplets; and, finally, (3) the quasi-equilibrium frequency distribution is
29: defined as the expected frequency of codons derived from the dinucleotide frequency distribution. The measure
30: of bias in codon usage is calculated for $125$ bacterial genomes.
31: \end{abstract}
32:
33: \begin{keyword}
34: frequency \sep expected frequency \sep information value \sep entropy \sep correlation \sep classification
35: \end{keyword}
36:
37: \end{frontmatter}
38:
39: \newpage
40: \section{Introduction}\label{intro}
41:
42: It is a common fact, that the genetic code is degenerated. All amino acids (besides two ones) are encoded by
43: two or more codons; such codons are called synonymous and usually differ in a nucleotide occupying the third
44: position at codon. The synonymous codons occur with different frequencies, and this difference is observed
45: both between various genomes \citep{1,2,3,4}, and different genes of the same genome \citep{3,4,5,6}. A
46: synonymous codon usage bias could be explained in various ways, including mutational bias (shaping genomic
47: $\mathsf{G}$+$\mathsf{C}$ composition) and translational selection by tRNA abundance (acting mainly on highly
48: expressed genes). Still, the reported results are somewhat contradictory \citep{6}. A contradiction may
49: result from the differences in statistical methods used to estimate the codon usage bias. Here one should
50: clearly understand what factors affect the method and numerical result. Boltzmann entropy theory
51: \citep{bolz,e2} has been applied to estimate the degree of deviation from equal codon usage \citep{x,3}.
52:
53: The key point here is that the deviation measure of codon usage bias should be independent of biological
54: issue. It is highly desirable to avoid an implementation of any biological assumptions (such as mutational
55: bias or translational selection); it must be defined in purely mathematical way. The idea of entropy seems to
56: suit best of all here. The additional constraints on codon usage resulted from the amino acid frequency
57: distribution affects the entropy values, thus conspiring the effects directly linked to biases in synonymous
58: codon usage.
59:
60: Here we propose three new indices of codon usage bias, which take into account all of the three important
61: aspects of amino acid usage, i.e. (1) the number of distinct amino acids, (2) their relative frequencies, and
62: (3) their degree of codon degeneracy. All the indices are based on mutual entropy $\overline{S}$ calculation.
63: They differ in the codon frequency distribution supposed to be ``quasi-equilibrium". Indeed, the difference
64: between the indices consists in the difference of the definition of that latter.
65:
66: Consider a genetic entity, say, a genome, of the length $N$; that latter is the number of nucleotides
67: composing the entity. A word $\omega$ (of the length $q$) is a string of the length $q$, $1 \leq q \leq N$
68: observed within the entity. A set of all the words occurred within an entity makes the support $\mathsf{V}$
69: of the entity (or $q$--support, if indication of the length $q$ is necessary). Accompanying each element
70: $\omega$, $\omega \in \mathsf{V}$ with the number $n_{\omega}$ of its copies, one gets the (finite)
71: dictionary of the entity. Changing $n_{\omega}$ for the frequency \[f_{\omega}= \frac{n_{\omega}}{N}\,,\] one
72: gets the frequency dictionary $W_q$ of the entity (of the thickness $q$).
73:
74: Everywhere below, for the purposes of this paper, we shall distinguish codon frequency distribution from the
75: triplet frequency distribution. A triplet frequency distribution is the frequency dictionary $W_3$ of the
76: thickness $q=3$, where triplets are identified with neither respect to the specific position of a triplet
77: within the sequence. On the contrary, codon distribution is the frequency distribution of the triplets
78: occupying specific places within an entity: a codon is the triplet embedded into a sequence at the coding
79: position, only. Thus, the abundance of copes of the words of the length $q=3$ involved into the codon
80: distribution implementation is three times less, in comparison to the frequency dictionary $W_3$ of triplets.
81: Further, we shall denote the codon frequency dictionary as $\mathfrak{W}$; no lower index will be used, since
82: the thickness of the dictionary is fixed (and equal to $q=3$).
83:
84: \section{Materials and methods}\label{sec:1}
85: \subsection{Sequences and Codon Tabulations}\label{sec:2}
86: The tables of codon usage frequency were taken at Kazusa Institute site\footnote{www.kazusa.ac.jp/codons}.
87: The corresponding genome sequences have been retrieved from EMBL--bank\footnote{www.ebi.ac.uk/genomes}. The
88: codon usage tables containing not less that $10000$ codons have been used. Here we studied bacterial genomes
89: (see Table~\ref{T1}).
90:
91: \subsection{Codon bias usage indices}\label{sec:2-2}
92: Let $F$ denote the codon frequency distribution, $F = \{f_{\nu_1\nu_2\nu_3}\}$; here $f_{\nu_1\nu_2\nu_3}$ is
93: the frequency of a codon $\nu_1\nu_2\nu_3$. Further, let $\widetilde{F}$ denote a quasi-equilibrium frequency
94: distribution of codons. Hence, the measure $I$ of the codon usage bias is defined as the mutual entropy of
95: the real frequency distribution $F$ calculated against the quasi-equilibrium $\widetilde{F}$ one:
96: \begin{equation}\label{eq:1}
97: I = \sum_{\omega = 1}^{64} f_{\omega} \cdot \ln \left( \frac{f_{\omega}}{\tilde{f}_{\omega}} \right)\;.
98: \end{equation}
99: Here index $\omega$ enlists the codons, and $\tilde{f}_{\omega} \in \widetilde{F}$ is quasi-equilibrium
100: frequency. The measure (\ref{eq:1}) itself is rather simple and clear; a definition of quasi-equilibrium
101: distribution of codons is the matter of discussion here. We propose three ways to define the distribution
102: $\widetilde{F}$; they provide three different indices of codon usage bias. The relation between the values of
103: these indices observed for the same genome is the key issue, for our study.
104:
105: \subsubsection{Locally equilibrium codon distribution}
106: It is well known fact, that various amino acids manifest different occurrence frequency, within a genome, or
107: a gene. Synonymous codons, in turn, exhibit the different occurrence within the similar genetic entities.
108: Thus, an equality of frequencies of all the synonymous codons encoding the same amino acid
109: \begin{equation}\label{eq:2}
110: \tilde{f}_j = \frac{1}{L} \sum_{j \in J_i} f_j\,, \qquad \sum_{j \in J_i} f_j = \sum_{j \in J_i} \tilde{f}_j
111: = \varphi_i \;,
112: \end{equation}
113: is the first way to determine a quasi-equilibrium codon frequency distribution. Here the index $j$ enlists
114: the synonymous codons encoding the same amino acid, and $J_i$ is the set of such codons for $i{\textrm{-th}}$
115: amino acid, and $\varphi_i$ is the frequency of that latter. Surely, the list of amino acids must be extended
116: with {\sl stop} signal (encoded by three codons). Obviously, $\tilde{f}_j = \tilde{f}_k$ for any couple $j,k
117: \in J_i$.
118:
119: \subsubsection{Codon distribution vs. triplet distribution}
120: A triplet distribution gives the second way to define the quasi-equilibrium codon frequency distribution.
121: Since the codon frequency is determined with respect to the specific locations of the strings of the length
122: $q=3$, then two third of the abundance of copies of these strings fall beyond the calculation of the codon
123: frequency distribution. Thus, one can compare the codon frequency distribution with the similar distribution
124: implemented over the entire sequence, with no gaps in strings location. So, the frequency dictionary of the
125: thickness $q=3$
126: \begin{equation}\label{eq:3}
127: \tilde{f}_l = \hat{f}_l\,, \qquad 1 \leq l \leq 64
128: \end{equation}
129: is the quasi-equilibrium codon distribution here.
130:
131: \subsubsection{The most expected codon frequency distribution}
132: Finally, the third way to define the quasi-equilibrium codon frequency distribution is to derive it from the
133: frequency distribution of dinucleotides composing the codon. Having the codons frequency distribution $F$,
134: one always can derive the frequency composition $F_2$ of the dinucleotides composing the codons. To do that,
135: one must sum up the frequencies of the codons differing in the third (or the first one) nucleotide. Such
136: transformation is unambiguous\footnote{Here one must close up a sequence into a ring.}. The situation is
137: getting worse, as one tends to get a codon distribution due to the inverse transformation. An upward
138: transformation yields a family of dictionaries $\{F\}$, instead of the single one $F$. To eliminate the
139: ambiguity, one should implement some basic principle in order to avoid an implementation of extra, additional
140: information into the codon frequency distribution development. The principle of maximum of entropy of the
141: extended (i.e., codon) frequency distribution makes sense here \citep{n1,n2,n3,n4}. It means that a
142: researcher must figure out the extended (or reconstructed) codon distribution $\widetilde{F}$ with maximal
143: entropy, among the entities composing the family $\{F\}$. This approach allows to calculate the frequencies
144: of codons explicitly:
145: \begin{equation}\label{eq:4}
146: \widetilde{f}_{ijk} = \frac{f_{ij}\times f_{jk}}{f_{j}}\;,
147: \end{equation}
148: where $\widetilde{f}_{ijk}$ is the expected frequency of codon $ijk$, $f_{ij}$ is the frequency of a
149: dinucleotide $ij$, and $f_j$ is the frequency of nucleotide $j$; here $i,j,k \in \{\mathsf{A}, \mathsf{C},
150: \mathsf{G}, \mathsf{T}\}$.
151:
152: Thus, the calculation of the measure (\ref{eq:1}) maps each genome into tree-dimension space. Table~\ref{T1}
153: shows the data calculated for 115 bacterial genomes.
154:
155: \section{Results}\label{res}
156: We have examined 115 bacterial genomes. The calculations of three indices (\ref{eq:1}~-- \ref{eq:4}) and the
157: absolute entropy of codon distribution is shown in Table~\ref{T1}.
158: \begin{longtable}{|p{8.4cm}|c|c|c|c|c|}
159: \caption{\label{T1} Indices of codon usage bias; is the index calculated according to (\ref{eq:2}),
160: $S^{\ast}$ stands for the index defined due to (\ref{eq:3}),
161: and $T$ is the index defined due to (\ref{eq:4}). $S$ is the absolute entropy of codon distribution. $C$ is the class attribution (see Section~\ref{classif}).}\\
162: \hline \multicolumn{1}{|c|}{Genomes}& \multicolumn{1}{c|}{$I$} & $S^{\ast}$ & $T$ & $S$ & $C$\\
163: \hline
164: \endfirsthead
165: \multicolumn{6}{r}%
166: {{\tablename\ \thetable{} -- continued}} \\
167: \hline \multicolumn{1}{|c|}{Genomes} & \multicolumn{1}{c|}{$I$} & $S^{\ast}$ & $T$ & $S$ & $C$\\
168: \hline
169: \endhead
170: \hline \multicolumn{6}{|r|}{{continued on the next page}} \\ \hline
171: \endfoot
172: \endlastfoot
173: Acinetobacter sp.ADP1&0.1308&0.1526&0.1332&3.9111&1\\
174: Aeropyrum pernix K1&0.1381&0.1334&0.1611&3.9302&2\\
175: Agrobacterium tumefaciens str. C58&0.1995&0.1730&0.2681&3.8504&2\\
176: Aquifex aeolicus VF5&0.1144&0.1887&0.2273&3.8507&2\\
177: Archaeoglobus fulgidus DSM 4304&0.1051&0.2008&0.2264&3.9011&2\\
178: Bacillus anthracis str. Ames&0.1808&0.1880&0.1301&3.8232&1\\
179: Bacillus anthracis str. Sterne&0.1800&0.1873&0.1300&3.8236&1\\
180: Bacillus anthracis str.'Ames Ancestor'&0.1788&0.1850&0.1278&3.8246&1\\
181: Bacillus cereus ATCC 10987&0.1750&0.1791&0.1254&3.8291&1\\
182: Bacillus cereus ATCC 14579&0.1807&0.1853&0.1290&3.8220&1\\
183: Bacillus halodurans C-125&0.0538&0.1296&0.0967&3.9733&1\\
184: Bacillus subtilis subsp.subtilis str. 168&0.0581&0.1231&0.1117&3.9605&2\\
185: Bacteroides fragilis YCH46&0.0499&0.1201&0.1305&3.9824&2\\
186: Bacteroides thetaiotaomicron VPI-5482&0.0557&0.1258&0.1364&3.9713&2\\
187: Bartonella henselae str. Houston-1&0.1555&0.1650&0.1077&3.8913&1\\
188: Bartonella quintana str. Toulouse&0.1525&0.1616&0.1039&3.8954&1\\
189: Bdellovibrio bacteriovorus HD100&0.1197&0.1593&0.2404&3.9232&2\\
190: Bifidobacterium longum NCC2705&0.2459&0.2315&0.3666&3.8011&2\\
191: Bordetella bronchiseptica RB50&0.4884&0.3165&0.5598&3.5485&2\\
192: Borrelia burqdorferi B31&0.2330&0.1555&0.0988&3.6709&1\\
193: Borrelia garinii Pbi&0.2421&0.1616&0.1008&3.6630&1\\
194: Bradyrhizobium japonicum USDA 110&0.3163&0.2236&0.3789&3.7368&2\\
195: Campylobacter jejuni RM1221&0.2839&0.1994&0.1357&3.6617&1\\
196: Campylobacter jejuni subsp. Jejuni NCTC 11168&0.2846&0.2010&0.1379&3.6660&1\\
197: Caulobacter crescentus CB15&0.4250&0.2890&0.5045&3.6062&2\\
198: Chlamydophila caviae GPIC&0.1079&0.1199&0.0990&3.9445&1\\
199: Chlamydophila pneumoniae CWL029&0.0803&0.1054&0.0778&3.9748&1\\
200: Chlamydophila pneumoniae J138&0.0801&0.1050&0.0772&3.9755&1\\
201: Chlamydophila pneumoniae TW-183&0.0802&0.1037&0.0764&3.9760&1\\
202: Chlorobium tepidum TLS&0.1767&0.1809&0.2935&3.8777&2\\
203: Chromobacterium violaceum ATCC 12472&0.4245&0.3004&0.5354&3.6218&2\\
204: Clamydophyla pneumoniae AR39&0.0804&0.1055&0.0773&3.9748&2\\
205: Clostridium acetobutylicum ATCC 824&0.2431&0.1951&0.1305&3.7142&1\\
206: Clostridium perfringens str. 13&0.3602&0.2752&0.1943&3.5816&1\\
207: Clostridium tetani E88&0.3240&0.2381&0.1767&3.6088&1\\
208: Corynebacterium efficiens YS-314&0.2983&0.2379&0.3980&3.7494&2\\
209: Corynebacterium glutamicum ATCC 13032&0.0964&0.1510&0.1674&3.9498&2\\
210: Coxiella burnetii RSA 493&0.0843&0.1050&0.0892&3.9648&2\\
211: Desulfovibrio vulgaris subsp.vulgaris str. Hildenborough&0.2459&0.1980&0.3183&3.8090&2\\
212: Enterococcus faecalis V583&0.1592&0.1838&0.1295&3.8453&1\\
213: Escherichia coli CFT073&0.1052&0.1305&0.1734&3.9576&2\\
214: Escherichia coli K12 MG1655&0.1206&0.1463&0.1933&3.9372&2\\
215: Helicobacter hepaticus ATCC 51449&0.1760&0.1513&0.1065&3.8315&1\\
216: Helicobacter pylori 26695&0.1420&0.1646&0.1843&3.8454&2\\
217: Helicobacter pylori J99&0.1404&0.1660&0.1895&3.8479&2\\
218: Lactobacillus johnsonii NCC 533&0.2113&0.1937&0.1481&3.7856&1\\
219: Lactobacillus plantarum WCFS1&0.0813&0.1453&0.1544&3.9537&2\\
220: Lactococcus lactis subsp. Lactis Il1403&0.1923&0.1857&0.1173&3.8068&1\\
221: Legionella pneumophila subsp. Pneumophila str. Philadelphia 1&0.1018&0.1098&0.0880&3.9339&1\\
222: Leifsonia xyli subsp. Xyli str. CTCB07&0.3851&0.2411&0.4032&3.6490&2\\
223: Listeria monocytoqenes str. 4b F2365&0.1389&0.1766&0.1012&3.8600&1\\
224: Mannheimia succiniciproducens MBEL55E&0.1390&0.1624&0.1571&3.8943&1\\
225: Mesorhizobium loti MAFF303099&0.2734&0.2019&0.3402&3.7751&2\\
226: Methanocaldococcus jannaschii DSM 2661&0.2483&0.2108&0.1324&3.6751&2\\
227: Methanopyrus kandleri AV19&0.2483&0.2108&0.1324&3.6751&1\\
228: Methanosarcina acetivorans C2A&0.0530&0.1223&0.0876&3.9718&1\\
229: Methanosarcina mazei Go1&0.0739&0.1314&0.0889&3.9468&1\\
230: Methylococcus capsulatus str. Bath&0.2847&0.2096&0.3738&3.7709&2\\
231: Mycobacterium avium subsp. Paratuberculosis str. K10&0.4579&0.2779&0.4819&3.6038&2\\
232: Mycobacterium bovis AF2122/97&0.2449&0.1688&0.2862&3.7931&2\\
233: Mycobacterium leprae TN&0.1075&0.1216&0.1717&3.9513&2\\
234: Mycobacterium tuberculoisis CDC1551&0.2387&0.1618&0.2749&3.8029&2\\
235: Mycobacterium tuberculosis H37Rv&0.2457&0.1696&0.2878&3.7929&2\\
236: Mycoplasma mycoides subsp. mycoides SC&0.4748&0.2571&0.2247&3.4356&1\\
237: Mycoplasma penetrans HF-2&0.4010&0.2320&0.2047&3.5294&1\\
238: Neisseria gonorrhoeae FA 1090&0.1610&0.1740&0.2343&3.8852&2\\
239: Neisseria meningitidis MC58&0.1481&0.1708&0.2244&3.8969&2\\
240: Neisseria meninqitidis Z2491 serogroup A str. Z2491&0.1541&0.1786&0.2342&3.8898&2\\
241: Nitrosomonas europeae ATCC 19718&0.0824&0.1104&0.1587&3.9806&2\\
242: Nocardia farcinica IFM 10152&0.4842&0.2917&0.4968&3.5343&2\\
243: Nostoc sp.PCC7120&0.0877&0.1308&0.1124&3.9638&1\\
244: Parachlamydia sp. UWE25&0.1689&0.1397&0.1027&3.8561&1\\
245: Photorhabdus luminescens subsp. Laumondii TTO1&0.0704&0.1183&0.1068&3.9838&1\\
246: Porphyromonas gingivalis W83&0.0476&0.1167&0.1559&4.0034&2\\
247: Prochlorococcus marinus str. MIT 9313&0.0472&0.0956&0.0773&4.0203&1\\
248: Prochlorococcus marinus subsp. Marinus str. CCMP1375&0.1729&0.1423&0.1177&3.8697&1\\
249: Prochlorococcus marinus subsp. Pastoris str. CCMP1986&0.2556&0.1671&0.1412&3.7354&1\\
250: Propionibacterium acnes KPA171202&0.1277&0.1338&0.1700&3.9293&2\\
251: Pseudomonas aeruginosa PAO1&0.4648&0.3204&0.5733&3.5827&2\\
252: Pseudomonas putida KT2440&0.2847&0.2255&0.4061&3.7696&2\\
253: Pseudomonas syringae pv. Tomato str. DC3000&0.1960&0.1736&0.3013&3.8633&2\\
254: Pyrococcus abyssi GE5&0.0983&0.1962&0.1996&3.8887&2\\
255: Pyrococcus furiosus DSM 3638&0.1000&0.1641&0.1079&3.8847&1\\
256: Pyrococcus horikoshii OT3&0.0899&0.1508&0.1260&3.9105&1\\
257: Salmonella enterica subsp. Enterica serovar Typhi Ty2&0.1272&0.1465&0.2068&3.9327&2\\
258: Salmonella typhimurium LT2&0.1293&0.1490&0.2100&3.9300&2\\
259: Shewanella oneidensis MR-1&0.0700&0.1320&0.1329&3.9795&2\\
260: Shigella flexneri 2a str. 2457T&0.1196&0.1429&0.1913&3.9416&2\\
261: Shigella flexneri 2a str. 301&0.1097&0.1343&0.1791&3.9529&2\\
262: Sinorhizobium meliloti 1021&0.1960&0.2199&0.3013&3.8633&2\\
263: Staphylococcus aureus subsp. Aureus MRSA252&0.2338&0.2086&0.1531&3.7572&1\\
264: Staphylococcus aureus subsp. Aureus MSSA476&0.2356&0.2071&0.1554&3.7557&1\\
265: Staphylococcus aureus subsp. Aureus Mu50&0.2318&0.2056&0.1522&3.7591&1\\
266: Staphylococcus aureus subsp. Aureus MW2&0.2368&0.2106&0.1562&3.7535&1\\
267: Staphylococcus aureus subsp. Aureus N315&0.2348&0.2083&0.1543&3.7564&1\\
268: Staphylococcus epidermidis ATCC 12228&0.2277&0.2036&0.1399&3.7613&1\\
269: Staphylococcus haemolyticus JCSC1435&0.2304&0.2043&0.1526&3.7619&1\\
270: Streptococcus agalactiae 2603V/R&0.1690&0.1794&0.1200&3.8372&1\\
271: Streptococcus agalactiae NEM316&0.1679&0.1790&0.1209&3.8371&1\\
272: Streptococcus mutans UA159&0.1577&0.1783&0.1240&3.8468&1\\
273: Streptococcus pneumoniae R6&0.0952&0.1529&0.1210&3.9152&1\\
274: Streptococcus pneumoniae TIGR4&0.0957&0.1525&0.1209&3.9168&1\\
275: Streptococcus pyogenes M1 GAS&0.1227&0.1619&0.1137&3.8900&1\\
276: Streptococcus pyogenes MGAS10394&0.1167&0.1596&0.1101&3.8974&1\\
277: Streptococcus pyogenes MGAS315&0.1189&0.1636&0.1108&3.8929&1\\
278: Streptococcus pyogenes MGAS5005&0.1215&0.1612&0.1115&3.8929&1\\
279: Streptococcus pyogenes MGAS8232&0.1194&0.1608&0.1114&3.8932&1\\
280: Streptococcus pyogenes SSI-1&0.1189&0.1597&0.1111&3.8932&1\\
281: Streptococcus thermophilus CNRZ1066&0.1210&0.1710&0.1325&3.8908&1\\
282: Streptococcus thermophilus LMG 18311&0.1235&0.1737&0.1339&3.8881&1\\
283: Sulfolobus tokodaii str. 7&0.1932&0.1639&0.1253&3.7954&1\\
284: Thermoplasma acidophilum DSM 1728&0.0920&0.1668&0.2228&3.9315&2\\
285: Thermoplasma volcanium GSS1&0.0692&0.1345&0.1247&3.9379&2\\
286: Treponema polllidum str.Nichols&0.0548&0.0894&0.1095&4.0205&2\\
287: Ureaplasma parvun serovar 3 str. ATCC 700970&0.4111&0.2316&0.1950&3.5023&1\\
288: \hline
289: \end{longtable}
290: Thus, each genome is mapped into three-dimensional space determined by the indices (\ref{eq:1}~--
291: \ref{eq:4}). The Table provides also the fourth dimension, that is the absolute entropy of a codon
292: distribution. Further (see Section~\ref{classif}), we shall not take this dimension into consideration, since
293: it deteriorates the pattern observed in three-dimensional case.
294:
295: Meanwhile, the data on absolute entropy calculation of the codon distribution for various bacterial genomes
296: are rather interesting. Keeping in mind, that maximal value of the entropy is equal to $S_{\max} = \ln 64 =
297: 4.1589\ldots$, one sees that absolute entropy values observed over the set of genomes varies rather
298: significantly. {\sl Treponema polllidum str.Nichols} exhibits the maximal absolute entropy value equal to
299: $4.0205$, and {\sl Mycoplasma mycoides subsp. mycoides SC} has the minimal level of absolute entropy (equal
300: to $3.4356$).
301:
302: \subsection{Classification}\label{classif}
303: Consider a dispersion of the genomes at the space defined by the indices (\ref{eq:1}~-- \ref{eq:4}). The
304: scattering is shown in Figure~\ref{F1}. The dispersion pattern shown in this figure is two-horned; thus,
305: two-class pattern of the dispersion is hypothesized. Moreover, the genomes in the three-dimensional space
306: determined by the indices (\ref{eq:1}~-- \ref{eq:4}) occupy a nearly plane subspace. Obviously, the
307: dispersion of the genomes in the space is supposed to consists of two classes.
308:
309: Whether the proximity of genomes observed at the space defined by three indices (\ref{eq:1}~-- \ref{eq:4})
310: meets a proximity in other sense, is the key question of our investigation. Taxonomy is the most natural idea
311: of proximity, for genomes. Thus, the question arises, whether the genomes closely located at the space
312: indices (\ref{eq:1}~-- \ref{eq:4}), belong the same or closely related taxons? To answer this question, we
313: developed an unsupervised classification of the genomes, in three-dimensional space determined by the indices
314: (\ref{eq:1}~-- \ref{eq:4}).
315:
316: \begin{figure}
317: \includegraphics[width=16cm]{figGENE2.eps}
318: \caption{\label{F1} The distribution of genomes in the space determined by the indices (\ref{eq:1}~--
319: \ref{eq:4}). $\mathsf{S}_1$~is $I$~based index, $\mathsf{S}_2$~is $S^{\ast}$~based index, and
320: $\mathsf{S}_3$~is $T$~based index of codon usage bias.}
321: \end{figure}
322:
323: To develop such classification, one must split the genomes on $K$ classes, randomly. Then, for each class the
324: center is determined; that latter is the arithmetic mean of each coordinate corresponding to the specific
325: index. Then each genome (i.e., each point at the three-dimensional space) is checked for a proximity to each
326: $K$ classes. If a genome is closer to other class, than originally was attributed, then it must be
327: transferred to this class. As soon, as all the genomes are redistributed among the classes, the centers must
328: be recalculated, and all the genomes are checked again, for the proximity to their class; a redistribution
329: takes place, where necessary. This procedure runs till no one genome changes its class attribution. Then, the
330: discernibility of classes must be verified. There are various discernibility conditions (see, e.g.,
331: \citep{n5}).
332:
333: Here we executed a simplified version of the unsupervised classification. First, we did not checked the class
334: discernibility; next, a center of a class differs from a regular one. A straight line at the space determined
335: by the indices (\ref{eq:1}~-- \ref{eq:4}) is supposed to be a center of a class, rather than a point in it.
336: So, the classification was developed with respect to these two issues. The Table~\ref{T1} also shows the
337: class attribution, for each genome (see the last column indicated as $C$).
338:
339: \section{Discussion}\label{diskus}
340: Clear, concise and comprehensive investigation of the peculiarities of codon bias distribution may reveal
341: valuable and new knowledge towards the relation between the function (in general sense) and the structure of
342: nucleotide sequences. Indeed, here we studied the relation between the taxonomy of a genome bearer, and the
343: structure of that former. A structure may be defined in many ways, and here we explore the idea of ensemble
344: of (considerably short) fragments of a sequence. In particular, the structure here is understood in terms of
345: frequency dictionary (see Section~\ref{intro}; see also \citep{n1,n2,n3,n4} for details).
346:
347: Figure~\ref{F1} shows the dispersion of genomes in three-dimensional space determined by the indices
348: (\ref{eq:1}~-- \ref{eq:4}). The projection shown in this Figure yields the most suitable view of the pattern;
349: a comprehensive study of the distribution pattern seen in various projections shows that it is located in a
350: plane (or close to a plane). Thus, the three indices (\ref{eq:1}~-- \ref{eq:4}) are not independent.
351:
352: Next, the dispersion of the genomes in the indices (\ref{eq:1}~-- \ref{eq:4}) space is likely to hypothesize
353: the two-class distribution of the entities. Indeed, the unsupervised classification developed for the set of
354: genomes gets it. First of all, the genomes of the same genus belong the same class, as a rule. Some rare
355: exclusion of this rule result from a specific location of the entities within the ``bullet'' shown in
356: Figure~\ref{F1}.
357:
358: A measure of codon usage bias is matter of study of many researchers (see, e.g., \citep{e3,e4,e5,e6,e8}).
359: There have been explored numerous approaches for the bias index implementation. Basically, such indices are
360: based either on the statistical or probabilistic features of codon frequency distribution \citep{1,2,e3},
361: others are based on the entropy calculation of the distribution \citep{3,x} or similar indices based on the
362: issues of multidimensional data analysis and visualization techniques \citep{e5,e5-1}. An implementation of
363: an index (of a set of indices) affects strongly the sense and meaning of the observed data; here the question
364: arises towards the similarity of the observations obtained through various indices implementation, and the
365: discretion of the fine peculiarities standing behind those indices.
366:
367: Entropy seems to be the most universal and sustainable characteristics of a frequency distribution of any
368: nature \citep{bolz,obhod}. Thus, the entropy based approach to a study of codon usage bias seems to be the
369: most powerful. In particular, this approach was used by \cite{6}, where the entropy of the codon frequency
370: distribution has been calculated, for various genomes, and various fragments of genome. The data presented at
371: this paper manifest a significant correspondence to those shown above; here we take an advantage of the
372: general approach provided by \cite{6} through the calculation of more specific index, that is a mutual
373: entropy.
374:
375: An implementation of an index (or indices) of codon usage bias is of a merit not itself, but when it brings a
376: new comprehension of biological issues standing behind. Some biological mechanisms affecting the codon usage
377: bias are rather well known \citep{e8,e4,2,e9,4,5}. The rate of translation processes are the key issue here.
378: Quantitatively, the codon usage bias manifests a significant correlation to $\mathsf{C}+\mathsf{G}$ content
379: of a genetic entity. Obviously, the $\mathsf{C}+\mathsf{G}$ content seems to be an important factor (see,
380: e.\,g. \citep{e5,e5-1}); some intriguing observation towards the correspondence between
381: $\mathsf{C}+\mathsf{G}$ content and the taxonomy of bacteria is considered in \citep{mist}.
382:
383: Probably, the distribution of genomes as shown in Figure~\ref{F1} could result from $\mathsf{C}+\mathsf{G}$
384: content; yet, one may not exclude some other mechanisms and biological issues determining it. An exact and
385: reliable consideration of the relation between structure (that is the codon usage bias indices), and the
386: function encoded in a sequence is still obturated with the widest variety of the functions observed in
387: different sites of a sequence. Thus, a comprehensive study of such relation strongly require the
388: clarification and identification of the function to be considered as an entity. Moreover, one should provide
389: some additional efforts to prove an absence of interference between two (or more) functions encoded by the
390: sites.
391:
392: A relation between the structure (that is the codon usage bias) and taxonomy seems to be less deteriorated
393: with a variety of features to be considered. Previously, a significant dependence between the triplet
394: composition of 16S\,RNA of bacteria and their taxonomy has been reported \citep{g1,g2}. We have pursued
395: similar approach here. We studied the correlation between the class determined by the proximity at the space
396: defined by the codon usage bias indices (\ref{eq:1}~-- \ref{eq:4}), and the taxonomy of bacterial genomes.
397:
398: The data shown in Table~\ref{T1} reveal a significant correlation of class attribution to the taxonomy of
399: bacterial genomes. First of all, the correlation is the highest one for species and/or strain levels. Some
400: exclusion observed for {\sl Bacillus} genus may result from a modification of the unsupervised classification
401: implementation; on the other hand, the entities of that genus are spaced at the head of the bullet (see
402: Figure~\ref{F1}). A distribution of genomes over two classes looks rather complicated and quite irregular.
403: This fact may follow from a general situation with higher taxons disposition of bacteria.
404:
405: Nevertheless, the introduced indices of codon usage bias provide a researcher with new tool for knowledge
406: retrieval concerning the relation between structure and function, and structure and taxonomy of the bearers
407: of genetic entities.
408:
409: \section*{Acknowledgements} We are thankful to Professor Alexander Gorban from Liechester University for encouraging discussions of this work.
410:
411: \begin{thebibliography}{}
412: \bibitem[Bierne, Eyre-Walker(2006)]{e8}
413: Bierne, N., Eyre-Walker, A. Variation in synonymous codon use and DNA polymorphism within the Drosophila
414: genome. J.\,Evol.\,Biol. \textbf{19}(1), 1--11 (2006)
415:
416: \bibitem[Bugaeko et al.(1996)]{n1}
417: Bugaenko N.N., Gorban A.N., Sadovsky M.G. Towards the determination of information content of nucleotide
418: sequences. Russian J.of Mol.Biol. {\textbf{30}}, 529--541 (1996)
419:
420: \bibitem[Bugaeko et al.(1998)]{n2}
421: Bugaenko N.N., Gorban A.N., Sadovsky M.G. Maximum entropy method in analysis of genetic text and measurement
422: of its information content Open Sys.\& Information Dyn.. {\textbf{5}}, 265--278 (1998)
423:
424: \bibitem[Carbone et al.(2003)]{e5}
425: Carbone, A., Zinovyev, A., K\'{e}p\`{e}s, F. Codon adaptation index as a measure of dominating codon bias.
426: Bioinformatics. \textbf{19}(16), 2005--2015 (2003)
427:
428: \bibitem[Carbone et al.(2005)]{e5-1}
429: Carbone, A., K\'{e}p\`{e}s, F.á Zinovyev, A. Codon bias signatures, organization of microorganisms in codon
430: space, and lifestyle. Mol.\,Biol.\,Evol. \textbf{22}(3), 547--561 (2006)
431:
432: \bibitem[Frappat et al.(2003)]{x}
433: Frappat, L., Minichini, C., Sciarrino, A., Sorba, P. Universality and Shannon entropy of codon usage.
434: Phys.Review~\textbf{E}. {\textbf{68}}, 061910 (2003)
435:
436: \bibitem[Fuglsang(2006)]{e7}
437: Fuglsang, A. Estimating the ``Effective Number of Codons'': The Wright Way of Determining Codon Homozygosity
438: Leads to Superior Estimates. Genetics. \textbf{172}, 1301--1307 (2006)
439:
440: \bibitem[Galtier et al.(2006)]{e4}
441: Galtier, N., Bazin, E., Bierne, N. GC-biased segregation of non-coding polymorphisms in Drosophila. Genetics.
442: \textbf{172}, 221--228 (2006)
443:
444: \bibitem[Gibbs(1902)]{bolz}
445: Gibbs, J.W. Elementary Principles in Statistical Mechanics, Developed with Especial Reference to the Rational
446: Foundation of Thermodynamics. C.~Scribner's Sons, New Haven (1902)
447:
448: \bibitem[Gorban, Zinovyev(2007)]{mist}
449: Gorban, A.N., Zinovyev, A.Yu. The Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007
450: arXiv:q-bio/0412015
451:
452: \bibitem[Gorban, Karlin(2005)]{e2}
453: Gorban, A.N., Karlin, I.V. Invariant Manifolds for Physical and Chemical Kinetics, Lect. Notes Phys. 660,
454: Springer, Berlin, Heidelberg (2005).
455:
456: \bibitem[Gorban, Rossiev(2004)]{n5}
457: Gorban, A.N., Rossiev, D.A. Neurocomputers on PC. Nauka plc., Novosibirsk (2004).
458:
459: \bibitem[Gorban et al.(2001)]{g2}
460: Gorban, A.N., Popova, T.G., Sadovsky, M.G., Wunsch, D.C. Information content of the frequency dictionaries,
461: re-construction, transformation and classification of dictionaries and genetic texts // Intelligent
462: Engineering Systems through Artificial Neural Netwerks: \textbf{11}~-- {\sl Smart Engineering System Design},
463: N.-Y.: ASME Press 657--663 (2001)
464:
465: \bibitem[Gorban et al.(2000)]{g1}
466: Gorban, A.N., Popova, T.G., Sadovsky, M.G. Classification of symbol sequences over thier frequency
467: dictionaries: towards the connection between structure and natural taxonomy. Open Systems \& Information
468: Dynamics. \textbf{7}(1), 1--17 (2000)
469:
470: \bibitem[Gorban(1984)]{obhod}
471: Gorban, A.N. Equilibrium Encircling. Equations of Chemical Kinetics and their Thermodynamic Analysis.
472: Novosibirsk, Nauka Publ. (1984) 256 p.
473:
474: \bibitem[Jansen et al.(2003)]{2}
475: Jansen, R., Bussemaker, H.J. and Gerstein, M. Revisiting the codon adaptation index from a whole-genome
476: perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety
477: of models. NAR {\textbf{31}}, 2242--2251 (2003)
478:
479: \bibitem[Nakamura et al.(2000)]{e3}
480: Nakamura, Y., Gojobori, T., Ikemura, T. Codon usage tabulated from international DNA sequence databases:
481: status for the year 2000. Nucleic Acids Res. \textbf{28}, 292 (2000).
482:
483: \bibitem[Sadovsky(2003)]{n3}
484: Sadovsky, M.G. Comparison of real frequencies of strings vs. the expected ones reveals the information
485: capacity of macromoleculae. Journal of Biol.Phys. {\textbf{29}}, 23--38 (2003)
486:
487: \bibitem[Sadovsky(2006)]{n4}
488: Sadovsky, M.G. Information capacity of nucleotide sequences and its applications. Bulletin of Math.Biology.
489: {\textbf{68}}, 156--178 (2006)
490:
491: \bibitem[Sharp, Li(1987)]{1}
492: Sharp, P.M., Wen-Hsiung Li. The codon adaptation index --- a measure of directional synonymous codon usage
493: bias, and its potential applications. NAR {\textbf{15}}, 1281--1295 (1987)
494:
495: \bibitem[Sharp et al.(2005)]{e9}
496: Sharp, P.M., Bailes, E., Grocock, R.J., Peden, J.F., Sockett, R.E. Variation in the strength of selected
497: codon usage bias among bacteria. Nucleic Acids Research. \textbf{33}, 1141--1153 (2005)
498:
499: \bibitem[Sueoka, Kawanishi(2000)]{e6}
500: Sueoka, N., Kawanishi, Y. DNA G+C content of the third codon position and codon usage biases of human genes.
501: Gene. \textbf{261}(1), 53--62 (2000)
502:
503: \bibitem[Supek, Vlahovi\v{c}ek(2005)]{4}
504: Supek, F. and Vlahovi\v{c}ek, K. Comparison of codon usage measures and their applicability in prediction of
505: microbial gene expressivity. BMC Bioinformatics. {\textbf{6}}, 182--197 (2005)
506:
507: \bibitem[Suzuki et al.(2004)]{6}
508: Suzuki, H., Saito, R. and Tomita, M. The `weighted sum of relative entropy': a new index for synonymous codon
509: usage bias. Gene. {\textbf{335}}, 19--23 (2004)
510:
511: \bibitem[Xiu-Feng et al.(2004)]{5}
512: Xiu-Feng Wan, Dong Xu, Kleinhofs, A., Jizhong Zhou Quantitative relationship between synonymous codon usage
513: bias and GC composition across unicellular genomes. BMC Evolutionary Biology. {\textbf{4}}, 19--30 (2004)
514:
515: \bibitem[Zeeberg(2002)]{3}
516: Zeeberg, B. Shannon Information Theoretic Computation of Synonymous Codon Usage Biases in Coding Regions of
517: Human and Mouse Genomes. Genome Res. {\textbf{12}}, 944--955 (2002)
518:
519: \end{thebibliography}
520:
521: \end{document}
522: