0706:0706.2077/s1.tex

1: \documentclass{e1}

2: \usepackage{amssymb}

3: \usepackage{rotating}

4: \usepackage{longtable}

5: \usepackage{graphicx}

6: \usepackage{natbib}

7: %\usepackage{biograph}

8: \setlength{\textwidth}{16cm} \setlength{\textheight}{22.5cm} \setlength{\oddsidemargin}{10mm}

9: \setlength{\evensidemargin}{10mm}

10: \begin{document}

11: \begin{frontmatter}

12:

13: \title{CODON USAGE BIAS MEASURED THROUGH ENTROPY APPROACH}%\thanks{This paper is supported, in part, by Krasnoyarsk Science Foundation, grant 13F105.}

14: \author[ibf,defakto]{Michael G.Sadovsky\corauthref{cor1}\thanksref{label2}}

15: \thanks[label2]{To whom the correspondence should be addressed.}

16: \corauth[cor1]{660036 Russia, Krasnoyarsk, Akademgorodok; Institute of computational modelling of RAS; tel.

17: +7(3912)907469, fax: +7(3912)907454}

18: \address[ibf]{Institute of computational modelling of RAS}

19: \ead{msad@icm.krasn.ru}

20: \author[defakto]{Julia A.Putintzeva}

21: \address[defakto]{Siberian Federal university, Institute of natural sciences \& humanities}

22: \ead{kinomanka85@mail.ru}

23:

24: \begin{abstract}

25: Codon usage bias measure is defined through the mutual entropy calculation of real codon frequency

26: distribution against the quasi-equilibrium one. This latter is defined in three manners: (1) the frequency of

27: synonymous codons is supposed to be equal (i.e., the arithmetic mean of their frequencies); (2) it coincides

28: to the frequency distribution of triplets; and, finally, (3) the quasi-equilibrium frequency distribution is

29: defined as the expected frequency of codons derived from the dinucleotide frequency distribution. The measure

30: of bias in codon usage is calculated for $125$ bacterial genomes.

31: \end{abstract}

32:

33: \begin{keyword}

34: frequency \sep expected frequency \sep information value \sep entropy \sep correlation \sep classification

35: \end{keyword}

36:

37: \end{frontmatter}

38:

39: \newpage

40: \section{Introduction}\label{intro}

41:

42: It is a common fact, that the genetic code is degenerated. All amino acids (besides two ones) are encoded by

43: two or more codons; such codons are called synonymous and usually differ in a nucleotide occupying the third

44: position at codon. The synonymous codons occur with different frequencies, and this difference is observed

45: both between various genomes \citep{1,2,3,4}, and different genes of the same genome \citep{3,4,5,6}. A

46: synonymous codon usage bias could be explained in various ways, including mutational bias (shaping genomic

47: $\mathsf{G}$+$\mathsf{C}$ composition) and translational selection by tRNA abundance (acting mainly on highly

48: expressed genes). Still, the reported results are somewhat contradictory \citep{6}. A contradiction may

49: result from the differences in statistical methods used to estimate the codon usage bias. Here one should

50: clearly understand what factors affect the method and numerical result. Boltzmann entropy theory

51: \citep{bolz,e2} has been applied to estimate the degree of deviation from equal codon usage \citep{x,3}.

52:

53: The key point here is that the deviation measure of codon usage bias should be independent of biological

54: issue. It is highly desirable to avoid an implementation of any biological assumptions (such as mutational

55: bias or translational selection); it must be defined in purely mathematical way. The idea of entropy seems to

56: suit best of all here. The additional constraints on codon usage resulted from the amino acid frequency

57: distribution affects the entropy values, thus conspiring the effects directly linked to biases in synonymous

58: codon usage.

59:

60: Here we propose three new indices of codon usage bias, which take into account all of the three important

61: aspects of amino acid usage, i.e. (1) the number of distinct amino acids, (2) their relative frequencies, and

62: (3) their degree of codon degeneracy. All the indices are based on mutual entropy $\overline{S}$ calculation.

63: They differ in the codon frequency distribution supposed to be ``quasi-equilibrium". Indeed, the difference

64: between the indices consists in the difference of the definition of that latter.

65:

66: Consider a genetic entity, say, a genome, of the length $N$; that latter is the number of nucleotides

67: composing the entity. A word $\omega$ (of the length $q$) is a string of the length $q$, $1 \leq q \leq N$

68: observed within the entity. A set of all the words occurred within an entity makes the support $\mathsf{V}$

69: of the entity (or $q$--support, if indication of the length $q$ is necessary). Accompanying each element

70: $\omega$, $\omega \in \mathsf{V}$ with the number $n_{\omega}$ of its copies, one gets the (finite)

71: dictionary of the entity. Changing $n_{\omega}$ for the frequency \[f_{\omega}= \frac{n_{\omega}}{N}\,,\] one

72: gets the frequency dictionary $W_q$ of the entity (of the thickness $q$).

73:

74: Everywhere below, for the purposes of this paper, we shall distinguish codon frequency distribution from the

75: triplet frequency distribution. A triplet frequency distribution is the frequency dictionary $W_3$ of the

76: thickness $q=3$, where triplets are identified with neither respect to the specific position of a triplet

77: within the sequence. On the contrary, codon distribution is the frequency distribution of the triplets

78: occupying specific places within an entity: a codon is the triplet embedded into a sequence at the coding

79: position, only. Thus, the abundance of copes of the words of the length $q=3$ involved into the codon

80: distribution implementation is three times less, in comparison to the frequency dictionary $W_3$ of triplets.

81: Further, we shall denote the codon frequency dictionary as $\mathfrak{W}$; no lower index will be used, since

82: the thickness of the dictionary is fixed (and equal to $q=3$).

83:

84: \section{Materials and methods}\label{sec:1}

85: \subsection{Sequences and Codon Tabulations}\label{sec:2}

86: The tables of codon usage frequency were taken at Kazusa Institute site\footnote{www.kazusa.ac.jp/codons}.

87: The corresponding genome sequences have been retrieved from EMBL--bank\footnote{www.ebi.ac.uk/genomes}. The

88: codon usage tables containing not less that $10000$ codons have been used. Here we studied bacterial genomes

89: (see Table~\ref{T1}).

90:

91: \subsection{Codon bias usage indices}\label{sec:2-2}

92: Let $F$ denote the codon frequency distribution, $F = \{f_{\nu_1\nu_2\nu_3}\}$; here $f_{\nu_1\nu_2\nu_3}$ is

93: the frequency of a codon $\nu_1\nu_2\nu_3$. Further, let $\widetilde{F}$ denote a quasi-equilibrium frequency

94: distribution of codons. Hence, the measure $I$ of the codon usage bias is defined as the mutual entropy of

95: the real frequency distribution $F$ calculated against the quasi-equilibrium $\widetilde{F}$ one:

96: \begin{equation}\label{eq:1}

97: I = \sum_{\omega = 1}^{64} f_{\omega} \cdot \ln \left( \frac{f_{\omega}}{\tilde{f}_{\omega}} \right)\;.

98: \end{equation}

99: Here index $\omega$ enlists the codons, and $\tilde{f}_{\omega} \in \widetilde{F}$ is quasi-equilibrium

100: frequency. The measure (\ref{eq:1}) itself is rather simple and clear; a definition of quasi-equilibrium

101: distribution of codons is the matter of discussion here. We propose three ways to define the distribution

102: $\widetilde{F}$; they provide three different indices of codon usage bias. The relation between the values of

103: these indices observed for the same genome is the key issue, for our study.

104:

105: \subsubsection{Locally equilibrium codon distribution}

106: It is well known fact, that various amino acids manifest different occurrence frequency, within a genome, or

107: a gene. Synonymous codons, in turn, exhibit the different occurrence within the similar genetic entities.

108: Thus, an equality of frequencies of all the synonymous codons encoding the same amino acid

109: \begin{equation}\label{eq:2}

110: \tilde{f}_j = \frac{1}{L} \sum_{j \in J_i} f_j\,, \qquad \sum_{j \in J_i} f_j = \sum_{j \in J_i} \tilde{f}_j

111: = \varphi_i \;,

112: \end{equation}

113: is the first way to determine a quasi-equilibrium codon frequency distribution. Here the index $j$ enlists

114: the synonymous codons encoding the same amino acid, and $J_i$ is the set of such codons for $i{\textrm{-th}}$

115: amino acid, and $\varphi_i$ is the frequency of that latter. Surely, the list of amino acids must be extended

116: with {\sl stop} signal (encoded by three codons). Obviously, $\tilde{f}_j = \tilde{f}_k$ for any couple $j,k

117: \in J_i$.

118:

119: \subsubsection{Codon distribution vs. triplet distribution}

120: A triplet distribution gives the second way to define the quasi-equilibrium codon frequency distribution.

121: Since the codon frequency is determined with respect to the specific locations of the strings of the length

122: $q=3$, then two third of the abundance of copies of these strings fall beyond the calculation of the codon

123: frequency distribution. Thus, one can compare the codon frequency distribution with the similar distribution

124: implemented over the entire sequence, with no gaps in strings location. So, the frequency dictionary of the

125: thickness $q=3$

126: \begin{equation}\label{eq:3}

127: \tilde{f}_l = \hat{f}_l\,, \qquad 1 \leq l \leq 64

128: \end{equation}

129: is the quasi-equilibrium codon distribution here.

130:

131: \subsubsection{The most expected codon frequency distribution}

132: Finally, the third way to define the quasi-equilibrium codon frequency distribution is to derive it from the

133: frequency distribution of dinucleotides composing the codon. Having the codons frequency distribution $F$,

134: one always can derive the frequency composition $F_2$ of the dinucleotides composing the codons. To do that,

135: one must sum up the frequencies of the codons differing in the third (or the first one) nucleotide. Such

136: transformation is unambiguous\footnote{Here one must close up a sequence into a ring.}. The situation is

137: getting worse, as one tends to get a codon distribution due to the inverse transformation. An upward

138: transformation yields a family of dictionaries $\{F\}$, instead of the single one $F$. To eliminate the

139: ambiguity, one should implement some basic principle in order to avoid an implementation of extra, additional

140: information into the codon frequency distribution development. The principle of maximum of entropy of the

141: extended (i.e., codon) frequency distribution makes sense here \citep{n1,n2,n3,n4}. It means that a

142: researcher must figure out the extended (or reconstructed) codon distribution $\widetilde{F}$ with maximal

143: entropy, among the entities composing the family $\{F\}$. This approach allows to calculate the frequencies

144: of codons explicitly:

145: \begin{equation}\label{eq:4}

146: \widetilde{f}_{ijk} = \frac{f_{ij}\times f_{jk}}{f_{j}}\;,

147: \end{equation}

148: where $\widetilde{f}_{ijk}$ is the expected frequency of codon $ijk$, $f_{ij}$ is the frequency of a

149: dinucleotide $ij$, and $f_j$ is the frequency of nucleotide $j$; here $i,j,k \in \{\mathsf{A}, \mathsf{C},

150: \mathsf{G}, \mathsf{T}\}$.

151:

152: Thus, the calculation of the measure (\ref{eq:1}) maps each genome into tree-dimension space. Table~\ref{T1}

153: shows the data calculated for 115 bacterial genomes.

154:

155: \section{Results}\label{res}

156: We have examined 115 bacterial genomes. The calculations of three indices (\ref{eq:1}~-- \ref{eq:4}) and the

157: absolute entropy of codon distribution is shown in Table~\ref{T1}.

158: \begin{longtable}{|p{8.4cm}|c|c|c|c|c|}

159: \caption{\label{T1} Indices of codon usage bias; is the index calculated according to (\ref{eq:2}),

160: $S^{\ast}$ stands for the index defined due to (\ref{eq:3}),

161: and $T$ is the index defined due to (\ref{eq:4}). $S$ is the absolute entropy of codon distribution. $C$ is the class attribution (see Section~\ref{classif}).}\\

162: \hline \multicolumn{1}{|c|}{Genomes}& \multicolumn{1}{c|}{$I$} & $S^{\ast}$ & $T$ & $S$ & $C$\\

163: \hline

164: \endfirsthead

165: \multicolumn{6}{r}%

166: {{\tablename\ \thetable{} -- continued}} \\

167: \hline \multicolumn{1}{|c|}{Genomes} & \multicolumn{1}{c|}{$I$} & $S^{\ast}$ & $T$ & $S$ & $C$\\

168: \hline

169: \endhead

170: \hline \multicolumn{6}{|r|}{{continued on the next page}} \\ \hline

171: \endfoot

172: \endlastfoot

173: Acinetobacter sp.ADP1&0.1308&0.1526&0.1332&3.9111&1\\

174: Aeropyrum pernix K1&0.1381&0.1334&0.1611&3.9302&2\\

175: Agrobacterium tumefaciens str. C58&0.1995&0.1730&0.2681&3.8504&2\\

176: Aquifex aeolicus VF5&0.1144&0.1887&0.2273&3.8507&2\\

177: Archaeoglobus fulgidus DSM 4304&0.1051&0.2008&0.2264&3.9011&2\\

178: Bacillus anthracis str. Ames&0.1808&0.1880&0.1301&3.8232&1\\

179: Bacillus anthracis str. Sterne&0.1800&0.1873&0.1300&3.8236&1\\

180: Bacillus anthracis str.'Ames Ancestor'&0.1788&0.1850&0.1278&3.8246&1\\

181: Bacillus cereus ATCC 10987&0.1750&0.1791&0.1254&3.8291&1\\

182: Bacillus cereus ATCC 14579&0.1807&0.1853&0.1290&3.8220&1\\

183: Bacillus halodurans C-125&0.0538&0.1296&0.0967&3.9733&1\\

184: Bacillus subtilis subsp.subtilis str. 168&0.0581&0.1231&0.1117&3.9605&2\\

185: Bacteroides fragilis YCH46&0.0499&0.1201&0.1305&3.9824&2\\

186: Bacteroides thetaiotaomicron VPI-5482&0.0557&0.1258&0.1364&3.9713&2\\

187: Bartonella henselae str. Houston-1&0.1555&0.1650&0.1077&3.8913&1\\

188: Bartonella quintana str. Toulouse&0.1525&0.1616&0.1039&3.8954&1\\

189: Bdellovibrio bacteriovorus HD100&0.1197&0.1593&0.2404&3.9232&2\\

190: Bifidobacterium longum NCC2705&0.2459&0.2315&0.3666&3.8011&2\\

191: Bordetella bronchiseptica RB50&0.4884&0.3165&0.5598&3.5485&2\\

192: Borrelia burqdorferi B31&0.2330&0.1555&0.0988&3.6709&1\\

193: Borrelia garinii Pbi&0.2421&0.1616&0.1008&3.6630&1\\

194: Bradyrhizobium japonicum USDA 110&0.3163&0.2236&0.3789&3.7368&2\\

195: Campylobacter jejuni RM1221&0.2839&0.1994&0.1357&3.6617&1\\

196: Campylobacter jejuni subsp. Jejuni NCTC 11168&0.2846&0.2010&0.1379&3.6660&1\\

197: Caulobacter crescentus CB15&0.4250&0.2890&0.5045&3.6062&2\\

198: Chlamydophila caviae GPIC&0.1079&0.1199&0.0990&3.9445&1\\

199: Chlamydophila pneumoniae CWL029&0.0803&0.1054&0.0778&3.9748&1\\

200: Chlamydophila pneumoniae J138&0.0801&0.1050&0.0772&3.9755&1\\

201: Chlamydophila pneumoniae TW-183&0.0802&0.1037&0.0764&3.9760&1\\

202: Chlorobium tepidum TLS&0.1767&0.1809&0.2935&3.8777&2\\

203: Chromobacterium violaceum ATCC 12472&0.4245&0.3004&0.5354&3.6218&2\\

204: Clamydophyla pneumoniae AR39&0.0804&0.1055&0.0773&3.9748&2\\

205: Clostridium acetobutylicum ATCC 824&0.2431&0.1951&0.1305&3.7142&1\\

206: Clostridium perfringens str. 13&0.3602&0.2752&0.1943&3.5816&1\\

207: Clostridium tetani E88&0.3240&0.2381&0.1767&3.6088&1\\

208: Corynebacterium efficiens YS-314&0.2983&0.2379&0.3980&3.7494&2\\

209: Corynebacterium glutamicum ATCC 13032&0.0964&0.1510&0.1674&3.9498&2\\

210: Coxiella burnetii RSA 493&0.0843&0.1050&0.0892&3.9648&2\\

211: Desulfovibrio vulgaris subsp.vulgaris str. Hildenborough&0.2459&0.1980&0.3183&3.8090&2\\

212: Enterococcus faecalis V583&0.1592&0.1838&0.1295&3.8453&1\\

213: Escherichia coli CFT073&0.1052&0.1305&0.1734&3.9576&2\\

214: Escherichia coli K12 MG1655&0.1206&0.1463&0.1933&3.9372&2\\

215: Helicobacter hepaticus ATCC 51449&0.1760&0.1513&0.1065&3.8315&1\\

216: Helicobacter pylori 26695&0.1420&0.1646&0.1843&3.8454&2\\

217: Helicobacter pylori J99&0.1404&0.1660&0.1895&3.8479&2\\

218: Lactobacillus johnsonii NCC 533&0.2113&0.1937&0.1481&3.7856&1\\

219: Lactobacillus plantarum WCFS1&0.0813&0.1453&0.1544&3.9537&2\\

220: Lactococcus lactis subsp. Lactis Il1403&0.1923&0.1857&0.1173&3.8068&1\\

221: Legionella pneumophila subsp. Pneumophila str. Philadelphia 1&0.1018&0.1098&0.0880&3.9339&1\\

222: Leifsonia xyli subsp. Xyli str. CTCB07&0.3851&0.2411&0.4032&3.6490&2\\

223: Listeria monocytoqenes str. 4b F2365&0.1389&0.1766&0.1012&3.8600&1\\

224: Mannheimia succiniciproducens MBEL55E&0.1390&0.1624&0.1571&3.8943&1\\

225: Mesorhizobium loti MAFF303099&0.2734&0.2019&0.3402&3.7751&2\\

226: Methanocaldococcus jannaschii DSM 2661&0.2483&0.2108&0.1324&3.6751&2\\

227: Methanopyrus kandleri AV19&0.2483&0.2108&0.1324&3.6751&1\\

228: Methanosarcina acetivorans C2A&0.0530&0.1223&0.0876&3.9718&1\\

229: Methanosarcina mazei Go1&0.0739&0.1314&0.0889&3.9468&1\\

230: Methylococcus capsulatus str. Bath&0.2847&0.2096&0.3738&3.7709&2\\

231: Mycobacterium avium subsp. Paratuberculosis str. K10&0.4579&0.2779&0.4819&3.6038&2\\

232: Mycobacterium bovis AF2122/97&0.2449&0.1688&0.2862&3.7931&2\\

233: Mycobacterium leprae TN&0.1075&0.1216&0.1717&3.9513&2\\

234: Mycobacterium tuberculoisis CDC1551&0.2387&0.1618&0.2749&3.8029&2\\

235: Mycobacterium tuberculosis H37Rv&0.2457&0.1696&0.2878&3.7929&2\\

236: Mycoplasma mycoides subsp. mycoides SC&0.4748&0.2571&0.2247&3.4356&1\\

237: Mycoplasma penetrans HF-2&0.4010&0.2320&0.2047&3.5294&1\\

238: Neisseria gonorrhoeae FA 1090&0.1610&0.1740&0.2343&3.8852&2\\

239: Neisseria meningitidis MC58&0.1481&0.1708&0.2244&3.8969&2\\

240: Neisseria meninqitidis Z2491 serogroup A str. Z2491&0.1541&0.1786&0.2342&3.8898&2\\

241: Nitrosomonas europeae ATCC 19718&0.0824&0.1104&0.1587&3.9806&2\\

242: Nocardia farcinica IFM 10152&0.4842&0.2917&0.4968&3.5343&2\\

243: Nostoc sp.PCC7120&0.0877&0.1308&0.1124&3.9638&1\\

244: Parachlamydia sp. UWE25&0.1689&0.1397&0.1027&3.8561&1\\

245: Photorhabdus luminescens subsp. Laumondii TTO1&0.0704&0.1183&0.1068&3.9838&1\\

246: Porphyromonas gingivalis W83&0.0476&0.1167&0.1559&4.0034&2\\

247: Prochlorococcus marinus str. MIT 9313&0.0472&0.0956&0.0773&4.0203&1\\

248: Prochlorococcus marinus subsp. Marinus str. CCMP1375&0.1729&0.1423&0.1177&3.8697&1\\

249: Prochlorococcus marinus subsp. Pastoris str. CCMP1986&0.2556&0.1671&0.1412&3.7354&1\\

250: Propionibacterium acnes KPA171202&0.1277&0.1338&0.1700&3.9293&2\\

251: Pseudomonas aeruginosa PAO1&0.4648&0.3204&0.5733&3.5827&2\\

252: Pseudomonas putida KT2440&0.2847&0.2255&0.4061&3.7696&2\\

253: Pseudomonas syringae pv. Tomato str. DC3000&0.1960&0.1736&0.3013&3.8633&2\\

254: Pyrococcus abyssi GE5&0.0983&0.1962&0.1996&3.8887&2\\

255: Pyrococcus furiosus DSM 3638&0.1000&0.1641&0.1079&3.8847&1\\

256: Pyrococcus horikoshii OT3&0.0899&0.1508&0.1260&3.9105&1\\

257: Salmonella enterica subsp. Enterica serovar Typhi Ty2&0.1272&0.1465&0.2068&3.9327&2\\

258: Salmonella typhimurium LT2&0.1293&0.1490&0.2100&3.9300&2\\

259: Shewanella oneidensis MR-1&0.0700&0.1320&0.1329&3.9795&2\\

260: Shigella flexneri 2a str. 2457T&0.1196&0.1429&0.1913&3.9416&2\\

261: Shigella flexneri 2a str. 301&0.1097&0.1343&0.1791&3.9529&2\\

262: Sinorhizobium meliloti 1021&0.1960&0.2199&0.3013&3.8633&2\\

263: Staphylococcus aureus subsp. Aureus MRSA252&0.2338&0.2086&0.1531&3.7572&1\\

264: Staphylococcus aureus subsp. Aureus MSSA476&0.2356&0.2071&0.1554&3.7557&1\\

265: Staphylococcus aureus subsp. Aureus Mu50&0.2318&0.2056&0.1522&3.7591&1\\

266: Staphylococcus aureus subsp. Aureus MW2&0.2368&0.2106&0.1562&3.7535&1\\

267: Staphylococcus aureus subsp. Aureus N315&0.2348&0.2083&0.1543&3.7564&1\\

268: Staphylococcus epidermidis ATCC 12228&0.2277&0.2036&0.1399&3.7613&1\\

269: Staphylococcus haemolyticus JCSC1435&0.2304&0.2043&0.1526&3.7619&1\\

270: Streptococcus agalactiae 2603V/R&0.1690&0.1794&0.1200&3.8372&1\\

271: Streptococcus agalactiae NEM316&0.1679&0.1790&0.1209&3.8371&1\\

272: Streptococcus mutans UA159&0.1577&0.1783&0.1240&3.8468&1\\

273: Streptococcus pneumoniae R6&0.0952&0.1529&0.1210&3.9152&1\\

274: Streptococcus pneumoniae TIGR4&0.0957&0.1525&0.1209&3.9168&1\\

275: Streptococcus pyogenes M1 GAS&0.1227&0.1619&0.1137&3.8900&1\\

276: Streptococcus pyogenes MGAS10394&0.1167&0.1596&0.1101&3.8974&1\\

277: Streptococcus pyogenes MGAS315&0.1189&0.1636&0.1108&3.8929&1\\

278: Streptococcus pyogenes MGAS5005&0.1215&0.1612&0.1115&3.8929&1\\

279: Streptococcus pyogenes MGAS8232&0.1194&0.1608&0.1114&3.8932&1\\

280: Streptococcus pyogenes SSI-1&0.1189&0.1597&0.1111&3.8932&1\\

281: Streptococcus thermophilus CNRZ1066&0.1210&0.1710&0.1325&3.8908&1\\

282: Streptococcus thermophilus LMG 18311&0.1235&0.1737&0.1339&3.8881&1\\

283: Sulfolobus tokodaii str. 7&0.1932&0.1639&0.1253&3.7954&1\\

284: Thermoplasma acidophilum DSM 1728&0.0920&0.1668&0.2228&3.9315&2\\

285: Thermoplasma volcanium GSS1&0.0692&0.1345&0.1247&3.9379&2\\

286: Treponema polllidum str.Nichols&0.0548&0.0894&0.1095&4.0205&2\\

287: Ureaplasma parvun serovar 3 str. ATCC 700970&0.4111&0.2316&0.1950&3.5023&1\\

288: \hline

289: \end{longtable}

290: Thus, each genome is mapped into three-dimensional space determined by the indices (\ref{eq:1}~--

291: \ref{eq:4}). The Table provides also the fourth dimension, that is the absolute entropy of a codon

292: distribution. Further (see Section~\ref{classif}), we shall not take this dimension into consideration, since

293: it deteriorates the pattern observed in three-dimensional case.

294:

295: Meanwhile, the data on absolute entropy calculation of the codon distribution for various bacterial genomes

296: are rather interesting. Keeping in mind, that maximal value of the entropy is equal to $S_{\max} = \ln 64 =

297: 4.1589\ldots$, one sees that absolute entropy values observed over the set of genomes varies rather

298: significantly. {\sl Treponema polllidum str.Nichols} exhibits the maximal absolute entropy value equal to

299: $4.0205$, and {\sl Mycoplasma mycoides subsp. mycoides SC} has the minimal level of absolute entropy (equal

300: to $3.4356$).

301:

302: \subsection{Classification}\label{classif}

303: Consider a dispersion of the genomes at the space defined by the indices (\ref{eq:1}~-- \ref{eq:4}). The

304: scattering is shown in Figure~\ref{F1}. The dispersion pattern shown in this figure is two-horned; thus,

305: two-class pattern of the dispersion is hypothesized. Moreover, the genomes in the three-dimensional space

306: determined by the indices (\ref{eq:1}~-- \ref{eq:4}) occupy a nearly plane subspace. Obviously, the

307: dispersion of the genomes in the space is supposed to consists of two classes.

308:

309: Whether the proximity of genomes observed at the space defined by three indices (\ref{eq:1}~-- \ref{eq:4})

310: meets a proximity in other sense, is the key question of our investigation. Taxonomy is the most natural idea

311: of proximity, for genomes. Thus, the question arises, whether the genomes closely located at the space

312: indices (\ref{eq:1}~-- \ref{eq:4}), belong the same or closely related taxons? To answer this question, we

313: developed an unsupervised classification of the genomes, in three-dimensional space determined by the indices

314: (\ref{eq:1}~-- \ref{eq:4}).

315:

316: \begin{figure}

317: \includegraphics[width=16cm]{figGENE2.eps}

318: \caption{\label{F1} The distribution of genomes in the space determined by the indices (\ref{eq:1}~--

319: \ref{eq:4}). $\mathsf{S}_1$~is $I$~based index, $\mathsf{S}_2$~is $S^{\ast}$~based index, and

320: $\mathsf{S}_3$~is $T$~based index of codon usage bias.}

321: \end{figure}

322:

323: To develop such classification, one must split the genomes on $K$ classes, randomly. Then, for each class the

324: center is determined; that latter is the arithmetic mean of each coordinate corresponding to the specific

325: index. Then each genome (i.e., each point at the three-dimensional space) is checked for a proximity to each

326: $K$ classes. If a genome is closer to other class, than originally was attributed, then it must be

327: transferred to this class. As soon, as all the genomes are redistributed among the classes, the centers must

328: be recalculated, and all the genomes are checked again, for the proximity to their class; a redistribution

329: takes place, where necessary. This procedure runs till no one genome changes its class attribution. Then, the

330: discernibility of classes must be verified. There are various discernibility conditions (see, e.g.,

331: \citep{n5}).

332:

333: Here we executed a simplified version of the unsupervised classification. First, we did not checked the class

334: discernibility; next, a center of a class differs from a regular one. A straight line at the space determined

335: by the indices (\ref{eq:1}~-- \ref{eq:4}) is supposed to be a center of a class, rather than a point in it.

336: So, the classification was developed with respect to these two issues. The Table~\ref{T1} also shows the

337: class attribution, for each genome (see the last column indicated as $C$).

338:

339: \section{Discussion}\label{diskus}

340: Clear, concise and comprehensive investigation of the peculiarities of codon bias distribution may reveal

341: valuable and new knowledge towards the relation between the function (in general sense) and the structure of

342: nucleotide sequences. Indeed, here we studied the relation between the taxonomy of a genome bearer, and the

343: structure of that former. A structure may be defined in many ways, and here we explore the idea of ensemble

344: of (considerably short) fragments of a sequence. In particular, the structure here is understood in terms of

345: frequency dictionary (see Section~\ref{intro}; see also \citep{n1,n2,n3,n4} for details).

346:

347: Figure~\ref{F1} shows the dispersion of genomes in three-dimensional space determined by the indices

348: (\ref{eq:1}~-- \ref{eq:4}). The projection shown in this Figure yields the most suitable view of the pattern;

349: a comprehensive study of the distribution pattern seen in various projections shows that it is located in a

350: plane (or close to a plane). Thus, the three indices (\ref{eq:1}~-- \ref{eq:4}) are not independent.

351:

352: Next, the dispersion of the genomes in the indices (\ref{eq:1}~-- \ref{eq:4}) space is likely to hypothesize

353: the two-class distribution of the entities. Indeed, the unsupervised classification developed for the set of

354: genomes gets it. First of all, the genomes of the same genus belong the same class, as a rule. Some rare

355: exclusion of this rule result from a specific location of the entities within the ``bullet'' shown in

356: Figure~\ref{F1}.

357:

358: A measure of codon usage bias is matter of study of many researchers (see, e.g., \citep{e3,e4,e5,e6,e8}).

359: There have been explored numerous approaches for the bias index implementation. Basically, such indices are

360: based either on the statistical or probabilistic features of codon frequency distribution \citep{1,2,e3},

361: others are based on the entropy calculation of the distribution \citep{3,x} or similar indices based on the

362: issues of multidimensional data analysis and visualization techniques \citep{e5,e5-1}. An implementation of

363: an index (of a set of indices) affects strongly the sense and meaning of the observed data; here the question

364: arises towards the similarity of the observations obtained through various indices implementation, and the

365: discretion of the fine peculiarities standing behind those indices.

366:

367: Entropy seems to be the most universal and sustainable characteristics of a frequency distribution of any

368: nature \citep{bolz,obhod}. Thus, the entropy based approach to a study of codon usage bias seems to be the

369: most powerful. In particular, this approach was used by \cite{6}, where the entropy of the codon frequency

370: distribution has been calculated, for various genomes, and various fragments of genome. The data presented at

371: this paper manifest a significant correspondence to those shown above; here we take an advantage of the

372: general approach provided by \cite{6} through the calculation of more specific index, that is a mutual

373: entropy.

374:

375: An implementation of an index (or indices) of codon usage bias is of a merit not itself, but when it brings a

376: new comprehension of biological issues standing behind. Some biological mechanisms affecting the codon usage

377: bias are rather well known \citep{e8,e4,2,e9,4,5}. The rate of translation processes are the key issue here.

378: Quantitatively, the codon usage bias manifests a significant correlation to $\mathsf{C}+\mathsf{G}$ content

379: of a genetic entity. Obviously, the $\mathsf{C}+\mathsf{G}$ content seems to be an important factor (see,

380: e.\,g. \citep{e5,e5-1}); some intriguing observation towards the correspondence between

381: $\mathsf{C}+\mathsf{G}$ content and the taxonomy of bacteria is considered in \citep{mist}.

382:

383: Probably, the distribution of genomes as shown in Figure~\ref{F1} could result from $\mathsf{C}+\mathsf{G}$

384: content; yet, one may not exclude some other mechanisms and biological issues determining it. An exact and

385: reliable consideration of the relation between structure (that is the codon usage bias indices), and the

386: function encoded in a sequence is still obturated with the widest variety of the functions observed in

387: different sites of a sequence. Thus, a comprehensive study of such relation strongly require the

388: clarification and identification of the function to be considered as an entity. Moreover, one should provide

389: some additional efforts to prove an absence of interference between two (or more) functions encoded by the

390: sites.

391:

392: A relation between the structure (that is the codon usage bias) and taxonomy seems to be less deteriorated

393: with a variety of features to be considered. Previously, a significant dependence between the triplet

394: composition of 16S\,RNA of bacteria and their taxonomy has been reported \citep{g1,g2}. We have pursued

395: similar approach here. We studied the correlation between the class determined by the proximity at the space

396: defined by the codon usage bias indices (\ref{eq:1}~-- \ref{eq:4}), and the taxonomy of bacterial genomes.

397:

398: The data shown in Table~\ref{T1} reveal a significant correlation of class attribution to the taxonomy of

399: bacterial genomes. First of all, the correlation is the highest one for species and/or strain levels. Some

400: exclusion observed for {\sl Bacillus} genus may result from a modification of the unsupervised classification

401: implementation; on the other hand, the entities of that genus are spaced at the head of the bullet (see

402: Figure~\ref{F1}). A distribution of genomes over two classes looks rather complicated and quite irregular.

403: This fact may follow from a general situation with higher taxons disposition of bacteria.

404:

405: Nevertheless, the introduced indices of codon usage bias provide a researcher with new tool for knowledge

406: retrieval concerning the relation between structure and function, and structure and taxonomy of the bearers

407: of genetic entities.

408:

409: \section*{Acknowledgements} We are thankful to Professor Alexander Gorban from Liechester University for encouraging discussions of this work.

410:

411: \begin{thebibliography}{}

412: \bibitem[Bierne, Eyre-Walker(2006)]{e8}

413: Bierne, N., Eyre-Walker, A. Variation in synonymous codon use and DNA polymorphism within the Drosophila

414: genome. J.\,Evol.\,Biol. \textbf{19}(1), 1--11 (2006)

415:

416: \bibitem[Bugaeko et al.(1996)]{n1}

417: Bugaenko N.N., Gorban A.N., Sadovsky M.G. Towards the determination of information content of nucleotide

418: sequences. Russian J.of Mol.Biol. {\textbf{30}}, 529--541 (1996)

419:

420: \bibitem[Bugaeko et al.(1998)]{n2}

421: Bugaenko N.N., Gorban A.N., Sadovsky M.G. Maximum entropy method in analysis of genetic text and measurement

422: of its information content Open Sys.\& Information Dyn.. {\textbf{5}}, 265--278 (1998)

423:

424: \bibitem[Carbone et al.(2003)]{e5}

425: Carbone, A., Zinovyev, A., K\'{e}p\`{e}s, F. Codon adaptation index as a measure of dominating codon bias.

426: Bioinformatics. \textbf{19}(16), 2005--2015 (2003)

427:

428: \bibitem[Carbone et al.(2005)]{e5-1}

429: Carbone, A., K\'{e}p\`{e}s, F.� Zinovyev, A. Codon bias signatures, organization of microorganisms in codon

430: space, and lifestyle. Mol.\,Biol.\,Evol. \textbf{22}(3), 547--561 (2006)

431:

432: \bibitem[Frappat et al.(2003)]{x}

433: Frappat, L., Minichini, C., Sciarrino, A., Sorba, P. Universality and Shannon entropy of codon usage.

434: Phys.Review~\textbf{E}. {\textbf{68}}, 061910 (2003)

435:

436: \bibitem[Fuglsang(2006)]{e7}

437: Fuglsang, A. Estimating the ``Effective Number of Codons'': The Wright Way of Determining Codon Homozygosity

438: Leads to Superior Estimates. Genetics. \textbf{172}, 1301--1307 (2006)

439:

440: \bibitem[Galtier et al.(2006)]{e4}

441: Galtier, N., Bazin, E., Bierne, N. GC-biased segregation of non-coding polymorphisms in Drosophila. Genetics.

442: \textbf{172}, 221--228 (2006)

443:

444: \bibitem[Gibbs(1902)]{bolz}

445: Gibbs, J.W. Elementary Principles in Statistical Mechanics, Developed with Especial Reference to the Rational

446: Foundation of Thermodynamics. C.~Scribner's Sons, New Haven (1902)

447:

448: \bibitem[Gorban, Zinovyev(2007)]{mist}

449: Gorban, A.N., Zinovyev, A.Yu. The Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007

450: arXiv:q-bio/0412015

451:

452: \bibitem[Gorban, Karlin(2005)]{e2}

453: Gorban, A.N., Karlin, I.V. Invariant Manifolds for Physical and Chemical Kinetics, Lect. Notes Phys. 660,

454: Springer, Berlin, Heidelberg (2005).

455:

456: \bibitem[Gorban, Rossiev(2004)]{n5}

457: Gorban, A.N., Rossiev, D.A. Neurocomputers on PC. Nauka plc., Novosibirsk (2004).

458:

459: \bibitem[Gorban et al.(2001)]{g2}

460: Gorban, A.N., Popova, T.G., Sadovsky, M.G., Wunsch, D.C. Information content of the frequency dictionaries,

461: re-construction, transformation and classification of dictionaries and genetic texts // Intelligent

462: Engineering Systems through Artificial Neural Netwerks: \textbf{11}~-- {\sl Smart Engineering System Design},

463: N.-Y.: ASME Press 657--663 (2001)

464:

465: \bibitem[Gorban et al.(2000)]{g1}

466: Gorban, A.N., Popova, T.G., Sadovsky, M.G. Classification of symbol sequences over thier frequency

467: dictionaries: towards the connection between structure and natural taxonomy. Open Systems \& Information

468: Dynamics. \textbf{7}(1), 1--17 (2000)

469:

470: \bibitem[Gorban(1984)]{obhod}

471: Gorban, A.N. Equilibrium Encircling. Equations of Chemical Kinetics and their Thermodynamic Analysis.

472: Novosibirsk, Nauka Publ. (1984) 256 p.

473:

474: \bibitem[Jansen et al.(2003)]{2}

475: Jansen, R., Bussemaker, H.J. and Gerstein, M. Revisiting the codon adaptation index from a whole-genome

476: perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety

477: of models. NAR {\textbf{31}}, 2242--2251 (2003)

478:

479: \bibitem[Nakamura et al.(2000)]{e3}

480: Nakamura, Y., Gojobori, T., Ikemura, T. Codon usage tabulated from international DNA sequence databases:

481: status for the year 2000. Nucleic Acids Res. \textbf{28}, 292 (2000).

482:

483: \bibitem[Sadovsky(2003)]{n3}

484: Sadovsky, M.G. Comparison of real frequencies of strings vs. the expected ones reveals the information

485: capacity of macromoleculae. Journal of Biol.Phys. {\textbf{29}}, 23--38 (2003)

486:

487: \bibitem[Sadovsky(2006)]{n4}

488: Sadovsky, M.G. Information capacity of nucleotide sequences and its applications. Bulletin of Math.Biology.

489: {\textbf{68}}, 156--178 (2006)

490:

491: \bibitem[Sharp, Li(1987)]{1}

492: Sharp, P.M., Wen-Hsiung Li. The codon adaptation index --- a measure of directional synonymous codon usage

493: bias, and its potential applications. NAR {\textbf{15}}, 1281--1295 (1987)

494:

495: \bibitem[Sharp et al.(2005)]{e9}

496: Sharp, P.M., Bailes, E., Grocock, R.J., Peden, J.F., Sockett, R.E. Variation in the strength of selected

497: codon usage bias among bacteria. Nucleic Acids Research. \textbf{33}, 1141--1153 (2005)

498:

499: \bibitem[Sueoka, Kawanishi(2000)]{e6}

500: Sueoka, N., Kawanishi, Y. DNA G+C content of the third codon position and codon usage biases of human genes.

501: Gene. \textbf{261}(1), 53--62 (2000)

502:

503: \bibitem[Supek, Vlahovi\v{c}ek(2005)]{4}

504: Supek, F. and Vlahovi\v{c}ek, K. Comparison of codon usage measures and their applicability in prediction of

505: microbial gene expressivity. BMC Bioinformatics. {\textbf{6}}, 182--197 (2005)

506:

507: \bibitem[Suzuki et al.(2004)]{6}

508: Suzuki, H., Saito, R. and Tomita, M. The `weighted sum of relative entropy': a new index for synonymous codon

509: usage bias. Gene. {\textbf{335}}, 19--23 (2004)

510:

511: \bibitem[Xiu-Feng et al.(2004)]{5}

512: Xiu-Feng Wan, Dong Xu, Kleinhofs, A., Jizhong Zhou Quantitative relationship between synonymous codon usage

513: bias and GC composition across unicellular genomes. BMC Evolutionary Biology. {\textbf{4}}, 19--30 (2004)

514:

515: \bibitem[Zeeberg(2002)]{3}

516: Zeeberg, B. Shannon Information Theoretic Computation of Synonymous Codon Usage Biases in Coding Regions of

517: Human and Mouse Genomes. Genome Res. {\textbf{12}}, 944--955 (2002)

518:

519: \end{thebibliography}

520:

521: \end{document}

522: