0412:q-bio0412028/iden.tex

1: % Template article for preprint document class `elsart'

2: % SP 2001/01/05

3:

4:  %\documentclass{article}

5:  \documentclass[aps]{revtex4}

6:

7: % Use the option doublespacing or reviewcopy to obtain double line spacing

8: % \documentclass[reviewcopy]{elsart3}

9:

10: % if you use PostScript figures in your article

11: % use the graphics package for simple commands

12: % \usepackage{graphics}

13: % or use the graphicx package for more complicated commands

14:  \usepackage{graphicx}

15: % or use the epsfig package if you prefer to use the old commands

16: % \usepackage{epsfig}

17:

18: % The amssymb package provides various useful mathematical symbols

19: \usepackage{amssymb}

20:

21: \begin{document}

22:

23: \title{Forcing reversibility in the no strand-bias substitution model

24: allows for the theoretical and practical identifiability of its

25: 5 parameters from pairwise DNA sequence comparisons.}

26:

27: % use optional labels to link authors explicitly to addresses:

28: \author{Osvaldo Zagordi}

29: \email{zagordi@sissa.it}

30: \affiliation{International School of Advanced Studies SISSA-ISAS\\ via Beirut 2-4, 34013 Trieste, Italy}

31:

32: \author{Jean R. Lobry}

33:

34: \affiliation{Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - Lyon I\\

35: 43 Bd 11/11/1918, F-69622 Villeurbanne CEDEX, France}

36:

37: \begin{abstract}

38: Because of the base pairing rules in DNA, some mutations experienced by a portion of DNA during its

39: evolution result in the same substitution, as we can only observe differences in coupled nucleotides.

40: Then, in the absence of a bias between the two DNA strands, a model with at most

41: 6 different parameters instead of 12 is sufficient to study the evolutionary relationship between homologous sequences derived from a common

42: ancestor. On the other hand the same symmetry reduces the number of independent observations which can be made. Such a reduction

43: can in some cases invalidate the calculation of the parameters. A compromise between biologically acceptable hypotheses and tractability

44: is introduced and a five parameter \textit{reversible no-strand-bias condition} (\textbf{RNSB}) is presented.

45: The identifiability of the parameters under this model is shown by examples.

46: \end{abstract}

47:

48:

49: % keywords here, in the form: keyword \sep keyword

50: \keywords{Parity rules no-strand-bias}

51:

52: % PACS codes here, in the form: \PACS code \sep code

53: \pacs{02.50.Ey 02.50.Ga 87.14.Gg 87.23.Kg}

54:

55: \maketitle

56:

57: % main text

58:

59: \section{\label{intro}Introduction}

60: Darwinian Evolution is based upon the interplay of two driving forces: \textbf{mutation} of an organism features, and \textbf{natural

61: selection} acting on the living organisms. Nowadays the role of the DNA in the evolutive processes has been recognised, and the physical

62: basis of the mutation process has been identified. Mutation acts on the DNA and we call \textsl{mutation rate} the probability that a

63: descendant has a difference in the genome if this is compared to that of its parents.

64: The substitution rate is the probability of finding a difference when comparing the genomes of species to one of its ancestors.

65:

66: We see that while the mutation is closely related to the biophysical process of DNA damage, or replication error etc., the substitution

67: is the result of a mutation and of a population-dynamics process, which has spread the former to the whole population.

68: A fundamental observation by M. Kimura in 1968 \cite{ki68} argued that, in the case of neutral mutations (i.e. those mutations which have

69: no apparent effect on the adaptation of an organism to the environment), we can deduce the mutation rate from the substitutions, as

70: they are actually the same.

71:

72: Let's consider an ancestor $O$ at time $t=0$ which separates into two different evolutive lineages,

73: resulting in two different species, $A$ and $B$ at time $t$.

74: It would be useful to define a distance between $A$ and $B$ and to have a tool to calculate it by just comparing the

75: genomes of $A$ and $B$.

76:

77: In order to study evolutionary distance between homologous DNA sequences (descending from a common ancestor) and their

78: consequent relationship, a model for nucleotide substitution can be introduced.

79: Generally, the process is assumed to be a Markov chain, if some assumptions are made about the underlying process.

80: The general hypotheses are:

81: \begin{itemize}

82: \item substitution rates do not depend on the position along the DNA sequence;

83: \item they are constant during evolutionary time;

84: \item the two evolutionary lineages have the same rates;

85: \item DNA sequences are at the compositional equilibrium when they start to diverge (nucleotide frequencies are constant).

86: \end{itemize}

87: We will see that even with relaxing the last two hypotheses some calculations can be performed, but

88: it is worth noting that compositional equilibrium, if the last assumption is verified, is maintained during the course of evolution.

89:

90: Denoting with $f_{i}$ the compositional equilibrium frequency of the nucleotide $i$ with $i \in  \{ \sf{A, T, G, C} \}$

91: and with $r_{ij}=r_{i\leftarrow j}$ the substitution rate from nucleotide $j$ to $i$ in the unit time.

92: The distance between two sequences, can now be defined as

93:

94: \begin{equation}

95: d=2t\sum_{i}f_{i}\mu_{i}=2t\sum_{i}f_{i}\sum_{j(\neq i)}r_{ji} \quad .

96: \label{distance}

97: \end{equation}

98:

99: \begin{figure}[c]

100: \begin{center}

101: \includegraphics[width=7cm]{nsbc}

102: \caption{\footnotesize{Explication of the \textsl{no strand-bias condition}. If the rates for a certain substitution are the same on

103: both strands of DNA, one can deduce the equivalence of this rate to the one between the complementary bases.}}\label{nsbc_image}

104: \end{center}

105: \end{figure}

106:

107: Since 1969, when Jukes and Cantor proposed their first one-parameter model for nucleotide subsitution in DNA, many different models of

108: increasing complexity have been published. The general 4-state Markov model has 12 independent parameters, \textbf{G12} in

109: fig.\ref{schema} (for a review see Zharkikh \cite{zh94}). This number, and consequently the

110: model complexity, can be decreased by further conditions on the parameters, leading to a plethora of different models. A possible choice is

111: to take into account the property of \textsl{no strand-bias}, explained in fig.\ref{nsbc_image}.

112: It was introduced by Sueoka in 1995 \cite{su95} and we generally refer to it as

113: \textit{type 1 parity rule} or \textit{PR1}. This rule is easily understood thinking that, scoring the substitution on one strand,

114: the same substitution can be obtained in two ways: $\sf{A} \rightarrow \sf{C}$ is observed also if on the opposite strand

115: $\sf{T} \rightarrow \sf{G}$.

116:

117: \begin{figure}[c]

118: \begin{center}

119: \includegraphics[width=7cm]{schema}

120: \caption{\footnotesize{Hierarchy of DNA substitution models. Simplifications leading from a model to a simpler one are indicated by arrows.

121: Only those directly referring to our discussion are drawn. This figure has been adapted from Robert Schmidt's work.}}\label{schema}

122: \end{center}

123: \end{figure}

124:

125: This means that we cannot discriminate substitutions between two bases from those between their complementary bases. In symbols:

126: \begin{equation}

127: r_{ij}=r_{\bar{\imath}\bar{\jmath}},

128: \end{equation}

129: where the bar means complementary nucleotide: $\bar\mathsf{A}=\mathsf{T}$ and viceversa. And $\bar\mathsf{C}=\mathsf{G}$ similarly.

130:

131: The number of independent parameters is then halved, so that the following substitution rates can be introduced:

132:

133: \begin{eqnarray}\label{rates}

134: a&\equiv& r_{\sf{AT}}=r_{\sf{TA}}\nonumber\\

135: b&\equiv& r_{\sf{AG}}=r_{\sf{TC}}\nonumber\\

136: c&\equiv& r_{\sf{CT}}=r_{\sf{GA}}\nonumber\\

137: d&\equiv& r_{\sf{AC}}=r_{\sf{TG}}\nonumber\\

138: e&\equiv& r_{\sf{CA}}=r_{\sf{GT}}\nonumber\\

139: f&\equiv& r_{\sf{CG}}=r_{\sf{GC}}.\nonumber

140: \end{eqnarray}

141:

142: The notation introduced here is consistent with the one previously used by Sueoka \cite{su95} and Lobry \cite{lo95}

143:

144: Equilibrium frequencies for such a model are easily derived from the \textsl{master equations}:

145: $$

146: \dot{q_{i}}=\sum_{j}(r_{ij}q_{j}-r_{ji}q_{i}),

147: $$

148: where $q_{i}$ denotes in general the probability of state $i$.

149:

150: These frequencies are given by:

151: \begin{eqnarray}\label{equil}

152: f_{1} & \equiv & q^{\infty}_{\sf{A}}=q^{\infty}_{\sf{T}}=\frac{1}{2}\frac{b+d}{b+c+d+e}\nonumber\\

153: \mbox{}\\

154: f_{2} & \equiv & q^{\infty}_{\sf{G}}=q^{\infty}_{\sf{C}}=\frac{1}{2}\frac{c+e}{b+c+d+e}.\nonumber

155: \end{eqnarray}

156: The intrinsic symmetry of the model is evident. In this framework, in other words, there is only \textbf{one} independent frequency, the

157: other being deduced by the normalization condition $2f_{1}+2f_{2}=1$.

158: We now stress the fact that this is valid in a single strand (\textit{type 2 parity rule} or \textit{PR2}). If \textit{PR1} is

159: satisfied, then as a consequence the frequency of a nucleotide in a strand must be equal to that of its complement in the same strand.

160:

161: In the following we will resume some general results regarding \textit{PR1} algebra showing that, in many cases, it is not possible to

162: reconstruct the supposed underlying mutation pattern because the independent parameters outnumber the possible independent observations.

163:

164: \section{\label{pr1}Materials and Methods}

165: In this section we will give some results regarding the model introduced above, focusing on the number of actual independent

166: possible observations.

167:

168: \subsection{\label{general}General model}

169:

170: Given the substitution matrix $\mathsf{R}_{[4,4]}$, whose entries are the mutation rates per nucleotide per unit of time,

171: one can deduce the \textsl{evolutionary matrix} $\mathsf{P}_{[4,4]}(t)$,

172: whose entries $p_{ij}(t)$

173: represent the probability of finding at a certain site the base $i$ at time $t$, given the base $j$ at $t=0$. Yet the

174: \textsl{divergence matrix} $\mathsf{X}_{[4,4]}(t)$ can be deduced, whose entries $x_{ij}(t)$ are the mutual probability of

175: finding at time $t$ the base $j$ in a sequence, given the base $i$ at the same site of the other sequence.

176: Obviously, if the substitution pattern is the same for both sequences, it results in $x_{ij}(t)=x_{ji}(t)$.

177:

178: It is worth noting that the divergence matrix at initial time is nothing but the diagonal matrix with nucleotide

179: frequencies on the diagonal.

180:

181: The result of an evolutive process can be synthetically represented as an initial diagonal divergence matrix,

182: multiplied on the left and on the right by a certain number of substitution matrices (corresponding to the generation steps

183: in the two evolution lineages), producing a final matrix

184:

185: \begin{eqnarray}

186: \mathsf{X}(t) & = & \mathsf{R'_m}\cdots\mathsf{R'_2}\mathsf{R'_1}~\mathsf{X}(0)~\mathsf{R^{t}_1}\mathsf{R^{t}_2}\cdots\mathsf{R^{t}_n}\nonumber\\

187: \mathsf{X}(t) & = & \mathsf{P'}~\mathsf{X}(0)~\mathsf{P^{t}}\label{discretex}\\

188: x_{ij}(t) & = & \sum_{k=1}^{4}p'_{ik}(t)f_{k}p_{jk}(t)\nonumber

189: \end{eqnarray}

190: where the substitution matrices can, in principle, all be different.

191:

192: The entries of the divergence matrix are the experimentally observable quantities.

193:

194: In our case the substitution matrix is $\mathsf{R}_{[4,4]}$:

195:

196: \begin{displaymath}

197: \begin{array}{|c|c|c|c|c|}

198: \hline

199: \Rsh    & \sf{A}                    & \sf{T}                   & \sf{G}                   & \sf{C}\\

200: \hline

201: \sf{A}  & 1-a-c-e & a                   & c                   & e                   \\

202: \hline

203: \sf{T}  & a                   & 1-a-e-c & e                   & c                     \\

204: \hline

205: \sf{G}  & b                   & d                   & 1-b-d-f & f                       \\

206: \hline

207: \sf{C}  & d                   & b                   & f                   & 1- d - b - f  \\

208: \hline

209: \end{array}

210: \end{displaymath}

211:

212: obtained under the hypotheses of \textit{no-strand-bias}, I.E. \textit{PR1}.

213:

214: \subsection{Non identifiability of some models}

215: In the following we show that the mathematical properties of the \textit{PR1} algebra are such that,

216: dealing with the general model, the parameters to estimate outnumber the possible independent observations, so that the model

217: is untractable.

218: As seen in eq.(\ref{discretex})

219: $$

220: \mathsf{X}(t) = \mathsf{P'}~\mathsf{X}(0)~\mathsf{P^{t}}.

221: $$

222: Now, several cases are possible, depending on whether $\mathsf{P'}=\mathsf{P}$ or not.

223: In the following, we will assume that $\mathsf{X}(0)$ is already at compositional equilibrium, I.E.

224: \begin{eqnarray}

225: q^{0}_{\sf{A}} = & q^{0}_{\sf{T}} = f_1 = & x_{AA}(t=0) = x_{TT}(t=0) \nonumber\\

226: \mbox{}\\

227: q^{0}_{\sf{C}} = & q^{0}_{\sf{G}} = f_2 = & x_{CC}(t=0) = x_{GG}(t=0) \nonumber

228: \end{eqnarray}

229:

230: \subsubsection{$\mathsf{P'}=\mathsf{P}$} \label{counting}

231: As $\mathsf{P'}=\mathsf{P}$ it is clear that  $\mathsf{X}(t)$ is symmetric ($\mathsf{X}(t) = \mathsf{X^{t}}(t)$).

232: We have to estimate 6 parameters (6 mutation rates) and we have only 5 independent observations.

233: This happens because of the symmetry $x_{ij}=x_{ji}$,

234: the normalization conditions and because $x_{ij}=x_{\bar{\imath}\bar{\jmath}}$.

235: In more detail:

236: \begin{eqnarray}\label{4par1}

237: x_{AG} & = & x_{GA}=x_{TC}=x_{CT}\nonumber\\

238: x_{AC} & = & x_{CA}=x_{TG}=x_{GT}\nonumber\\

239: x_{AT} & = & x_{TA}\nonumber\\

240: x_{CG} & = & x_{GC}\nonumber\\

241: %\end{eqnarray}

242: %\begin{eqnarray}\label{S}

243: x_{AA} & = & x_{TT}\nonumber\\

244: x_{CC} & = & x_{GG}\nonumber

245: \end{eqnarray}

246: Where $x_{AA} = x_{TT}$ and $x_{CC} = x_{GG}$ can be deduced by the other four using the normalization ($\sum_{j}x_{ij}=f_{i}$)

247: and the equilibrium frequencies.

248: We find that $x_{AG}$, $x_{AC}$, $x_{AT}$, $x_{CG}$ and one equilibrium frequency are the only independent observable quantities.

249:

250: \subsubsection{$\mathsf{P'} \neq \mathsf{P}$}

251:

252: In this case mutation rates double becoming 12;

253: so we have 12 parameters to calculate. Independent observations, on the other hand, increase up to 7, because of the lack of the

254: symmetry $x_{ij}=x_{ji}$. Still the model is intractable.

255:

256: \subsection{\label{case}Reversible \textit{PR1} model}

257: In this section we will deal with one of the previous models, the simplest one where $\mathsf{P'}=\mathsf{P}$.

258: In this case simple calculations lead to an analytical expression for the divergence matrix, but the model

259: remains intractable. Yet we will see that by the imposition of a certain property the model becomes tractable, and a way to

260: estimate the parameters for a real data set will be proposed.

261:

262: In the following we will assume again that the initial divergence matrix is already at compositional equilibrium.

263: Further, we will treat the evolutionary process as a continuous time process, being the time since the divergence very long.

264: This allows us to write the following equations to solve the problem.

265: The expression for the evolutionary matrix is

266: \begin{equation}

267: \mathsf{P}(t)=\exp\{\mathsf{R}t\};

268: \label{pdt}

269: \end{equation}

270: as it is the solution of the differential equations (see Rodriguez et al. \cite{ro90})

271:

272: \begin{eqnarray}

273: \frac{d\mathsf{P}(t)}{dt} & = & \mathsf{P}(t)\mathsf{R}\\

274: \frac{dp_{ij}(t)}{dt} & = & \sum_{k=1}^{4}p_{ik}(t)r_{kj}.

275: \label{dpdt}

276: \end{eqnarray}

277:

278: While the divergence matrix is given by

279:

280: \begin{eqnarray}

281: \mathsf{X}(t) & = & \mathsf{P'}(t)\mathsf{X}(t=0)\mathsf{P}^{T}(t)\\

282: x_{ij}(t) & = & \sum_{k=1}^{4}p'_{ik}(t)f_{k}p_{jk}(t);

283: \label{xdt}

284: \end{eqnarray}

285:

286: It is easily verified that, if $\mathsf{P'}=\mathsf{P}$, then $x_{ij}(t)=x_{ji}(t)$.

287:

288: Now, the expressions for $x_{ij}(t)$ (the observables) can be inverted to obtain the rates and then the distance.

289:

290: The strategy could be:

291: \begin{itemize}

292: \item solve the model, that is find the $x_{ij}(t)$ as a function of rates;

293: \item invert the above equations to get an expression for the rates;

294: \item substitute the observed quantities $\bar{x}_{ij}$ in order to have a numerical estimation of the rates;

295: \item use these estimates to obtain the distance.

296: \end{itemize}

297:

298: The expressions for $x_{ij}$ can be deduced in a manner analogous to that proposed by Takahata \&  Kimura in 1981 \cite{tk81}

299: who deal with a slightly less general model than this (model \textbf{TK5} in fig.\ref{schema}).

300: In this way we get an expression for every entry of the divergence matrix, but with five

301: independent expressions, as stated above. We repeat here the reasons:

302: \begin{itemize}

303: \item the symmetry of the matrix $x_{ij}=x_{ji}$;

304: \item the intrinsic symmetry of the model $x_{ij}=x_{\bar{\imath}\bar{\jmath}}$;

305: \item the normalization conditions $\sum_{j}x_{ij}=f_{i}$.

306: \end{itemize}

307: Thus, we can write down the entire divergence matrix by means of the following quantities:

308:

309: \begin{eqnarray}\label{4par}

310: P~ & \equiv & x_{AG}=x_{GA}=x_{TC}=x_{CT}\nonumber\\

311: R~ & \equiv & x_{AC}=x_{CA}=x_{TG}=x_{GT}\nonumber\\

312: Q_{1} & \equiv & x_{AT}=x_{TA}\nonumber\\

313: Q_{2} & \equiv & x_{CG}=x_{GC},\nonumber\\

314: %\end{eqnarray}

315: %\begin{eqnarray}\label{S}

316: S_{1} & \equiv & x_{AA}=x_{TT}\nonumber\\

317: S_{2} & \equiv & x_{CC}=x_{GG}\nonumber

318: \end{eqnarray}

319: Where, as stated above, $S_{1}$ and $S_{2}$ can be deduced by the other four using the normalization and the equilibrium frequencies.

320: We find that $P, R, Q_{1},Q_{2}$ and one equilibrium frequency are the only independent observable quantities.

321:

322:

323: \subsection{Solution of the model}

324: Deriving an analytical expression for the divergence matrix is quite an easy task following \cite{tk81}.

325: Let's consider for example the element $x_{\mathsf{AC}}$; its derivative will be

326: \begin{equation}

327: \frac{dx_{\mathsf{AC}}}{dt}=\frac{d(q_{\mathsf{A}} q_{\mathsf{C}})}{dt}=q_{\mathsf{C}}\dot q_{\mathsf{A}} +q_{\mathsf{A}}\dot q_{\mathsf{C}}.\label{dt}

328: \end{equation}

329: It is worth giving a brief explication for this.

330: We said that we are considering the two lineages at compositional equilibrium at the initial time,

331: so one would naturally say that $\dot q_{i} = 0$, and so the above equation.

332: Stating that we are at compositional equilibrium means that \textbf{sampling the whole considered sequence}

333: nucleotide frequencies $f_i$ don't change (apart from finite-size fluctuations). It does not mean that

334: there is no mutation at all on each site; had this been the case, there would be no evolution to study.

335: The probability for each nucleotide to mutate into another is given by the master equation, and this is why we

336: can write $x_{ij}$ as $q_i$ times $q_j$, take the derivative, and reexpress in terms of other $q_i q_j$ products,

337: I.E. other $\mathsf{X}$ entries.

338:

339: An example of derivative would be, for example,

340: \begin{eqnarray}\label{dotadotc}

341: \dot q_{\mathsf{A}}=(dq_{\mathsf{C}}+bq_{\mathsf{G}}+aq_{\mathsf{T}})-(a+c+e)q_{\mathsf{A}}.

342: \end{eqnarray}

343: Substituting this and the analogue for $\dot q_{\mathsf{C}}$ in eq.(\ref{dt}) and doing the same for all $\mathsf{X}$ entries we obtain a

344: set of linear coupled first order differential equations which can be diagonalized and solved.

345:

346: More detail on the derivation is reported in the appendix \ref{app1}.

347:

348: \subsection{Reversibility}

349:

350: Until now we have stated that it is possible to write the divergence matrix for this model, but it would be of no use because we could

351: never invert five expressions and obtain six independent rates as functions of the matrix entries. What can be done is to reduce

352: the number of independent parameters by adding a relation between them. Many choices are possible. One could be, following \cite{tk81}, $a=f$.

353: Another possible choice is to make the model time reversible. We remember that time reversibility is satisfied when

354: \begin{equation}

355: p_{ij}f_{j}=p_{ji}f_{i} \qquad \forall i,j.

356: \end{equation}

357: where $p_{ij}$ are the entries of the evolutionary matrix and $f_{i}$ the equilibrium frequencies.

358: It is possible to demonstrate that this property is equivalent to the \textit{detailed balance} (see appendix \ref{app2}) which reads

359: \begin{equation}

360: r_{ij}f_{j}=r_{ji}f_{i} \qquad \forall i,j.

361: \end{equation}

362: In our model detailed balance holds if and only if

363: \begin{equation}

364: be=cd.

365: \end{equation}

366: This can be deduced by inspection of equilibrium frequencies expressions, or by a simpler rule \cite{luca},

367: reported here in appendix \ref{app3}.

368: A general version of reversible model has been studied by Yang \cite{ya94}, who pointed out its ability of fitting the data better than

369: other models. Gu and Li \cite{gu96} have shown its robustness against violation of time reversibility.

370:

371:

372:

373: \section{\label{res}Results and discussion}

374:

375: \subsection{Estimation of the substitution rates}

376: Due to the complexity of the expressions coming from this model, it is hard to think that one can

377: find an analytic way to invert them and express the rates as a function of the observables. Therefore we chose a statistic

378: way to perform this inversion, based on the $\chi^2$ test. We write the $\chi^2$ as

379: \begin{equation}

380: \chi^2 = \sum_{i,j} \frac{(\bar{x}_{i,j}-x_{i,j})^2}{\bar{x}_{i,j}} = \sum_{i,j} \frac{x_{i,j}^2}{\bar{x}_{i,j}} - 1.

381: \end{equation}

382:

383: It is easily seen that this quantity is always non-negative, being zero when $\bar{x}_{i,j}=x_{i,j}$, I.E. when the model perfectly

384: fits the observations. Clearly, by performing a minimization on it we look at the same time for the best parameters.

385: In this contest trying to minimize the $\chi^2$ as a function of six parameters would outcome in a complete failure, the algorithm would

386: wander among the infinite number of equivalent solutions. Enforcing the reversibility makes the estimation possible, as it will be shown

387: below.

388:

389:

390: \subsection{A realistic example}

391:

392: As an application, we started from the multiple alignment of rRNA sequences

393: used in \cite{Gouy89}. The observed divergence

394: matrix (unnormalized) between Xenopus and Homo is reported here below.

395: \mbox{}

396: \newline

397: % --- Table of observed nucleotide differences ---

398: \begin{center}

399: \begin{tabular}{c c c c c c}

400: \hline \hline

401: & & \multicolumn{4}{c}{ Xenopus }\\

402: \hline

403: & & A & T & G & C\\

404: & A & 647 & 1 & 17 & 2 \\

405: Homo & T & 3 & 523 & 11 & 18 \\

406: & G & 17 & 9 & 903 & 28 \\

407: & C & 8 & 21 & 25 & 691 \\

408: \hline \hline

409: \end{tabular}

410: \end{center}

411: \mbox{}

412: \newline

413:

414: By changing parameter values over 6 magnitude orders we found that the $\chi^2$ criterion was well shaped with only one global minimum

415: (fig. \ref{paramfig}).

416: A systematic exploration of all possible pairs of parameters showed that there were no strong structural correlations between parameters,

417: except between $b$ and $c$ (fig. \ref{pairsfig}). As a consequence, parameter values are easily estimated using standard non linear minimizing tools

418: (note that it is advisable to enforce parameter positivity during optimisation). This example showed that parameter can be estimated

419: in practice from a realistically sized dataset.

420:

421: \begin{figure}[c]

422: \begin{center}

423: \includegraphics[width=7cm]{figparatonce}

424: \caption{\footnotesize{$\chi^2$ shaped as a minimum over 6 orders of magnitude.}}\label{paramfig}

425: \end{center}

426: \end{figure}

427:

428:

429: \begin{figure}[c]

430: \begin{center}

431: \includegraphics[width=7cm]{pairsbc}

432: \caption{\footnotesize{Near the optimal values for the parameters, only $b$ and $c$ show a structural correlation.}}\label{pairsfig}

433: \end{center}

434: \end{figure}

435:

436:

437:

438: \subsection{Discussion}

439:

440: The most general model of evolution at the DNA level has 12 parameters and this is

441: too much for practical purposes. If we try to simplify it by enforcing some parameter

442: to be equal, then the number of possible sub-models rapidly increases because many ways of doing it are possible.

443: At the opposite side we find the only model which requires all the parameters to be equal (JC).

444:

445: It is clear that the number of published models in the literature doesn't cover all possible ones, and only those

446: coming from some biological or mathematical justifications have been explored.

447:

448: Under {\it PR1 hypothesis}, we are dealing with {\it no strand-bias} models

449: whose most general form has 6 parameters.

450: We do not claim that models of this class are the best in any way, but that they are an interesting starting point.

451: An important property of these models is their convergence

452: towards {\it PR2 state} even if substitution rates are modified during

453: the course of evolution \cite{lolo99}. {\it PR2 state} is a strong assumption and strand asymmetry has been observed in

454: many cases. But, as {\it PR2} is usually

455: observed at a genome scale level \cite{lo95}, the hope is that, {\it on average},

456: with local deviations from {\it PR1 hypothesis} canceling out, this

457: class of model is not too bad an approximation.

458: The {\bf biological} motivation leading to the {\it no strand-bias}

459: models has an important {\bf mathematical} consequence,

460: so, if it is biologically reasonable to study these models, one must be aware of

461: the fact that the symmetry involved inexorably reduces the number of independent observations,

462: making the model mathematically intractable.

463:

464: \subsection{Conclusion}

465: As we have shown in section \ref{counting} comparing the number of unknowns

466: to possible independent observations there is definitively no

467: hope to estimate the 6 parameters of the general form of the

468: {\it no strand-bias} model from pairwise DNA sequence comparisons.

469: There is no unique solution to a system of $M$ equations in $N > M$ unknowns,

470: in our case there is an infinite number of way to choose the six rates $a, b, c, d, e, f$ in order to satisfy the

471: five independent equations defining the matrix $\mathsf{X}$.

472: This result is extremely

473: unpleasant because it corresponds to the most common situation with

474: 	experimental

475: data from present day DNA: fossil DNA data are scarce and from a relatively

476: recent past. We clearly need further simplifications.

477:

478: We have exhibited here an example of a model, noted RNSB in figure 2, that

479: combines the properties of reversible models and {\it no strand-bias} models.

480: It is important to note that this model has still 5 parameters free because if the intersection

481: between the reversible model class and the {\it no strand-bias} class

482: were only --say--

483: 3 parameter free models, there would not have been much flexibility left for further

484: research. We do not claim that this new RNSB model is the best intersection between

485: the two classes. We just claim that the RNSB model proves that it's possible to do so

486: with 5 free parameters, so that there is no bottleneck here for further theoretical

487: work on the parametric forms for this class of DNA substitution models.

488:

489:

490: \subsection*{Acknowledgements}

491: This contribution partly comes from the thesis OZ presented at Naples University in October 2002.

492: The authors thank warmly prof. Luca Peliti for connecting them during the \textsl{strapp 04} meeting (Dresden, Germany, July 5-10 2004).

493: OZ also because he was introduced by him to the beauties of biological systems.

494: They thank Manolo Gouy for kindly providing the multiple alignment of rRNA sequences and for many constructive suggestions.

495: The manuscript was also improved thanks to the comments from three anonymous reviewers.

496:

497: % The Appendices part is started with the command \appendix;

498: % appendix sections are then done as normal sections

499: % \appendix

500:

501: % \section{}

502: % \label{}

503: \appendix

504: \section{Derivation of the divergence matrix}

505: \label{app1}

506:

507: In order to obtain the expressions for the divergence matrix we define (following the notation introduced above)

508: \begin{eqnarray}\label{xyz}

509: X_{\pm}& \equiv 2S_{1} \pm 2Q_{1}\nonumber\\

510: Y_{\pm}& \equiv 2S_{2} \pm 2Q_{2}\\

511: Z_{\pm}& \equiv 4P     \pm 4R.\nonumber

512: \end{eqnarray}

513:

514: These expressions reduce the problem to six first order ordinary coupled differential equations. This system is block-diagonal,

515: can easily be inverted and its solution is:

516: \begin{eqnarray}\label{xyz+}

517: X_{+}&=&\omega[\omega+(1-\omega)e^{\lambda_{0}t}] \nonumber\\

518: Y_{+}&=&(1-\omega)(1-\omega+\omega e^{\lambda_{0}t})\\

519: Z_{+}&=&2\omega(1-\omega)(1-e^{\lambda_{0}t})\nonumber

520: \end{eqnarray}

521:

522: and

523:

524: \begin{eqnarray}\label{xyz-}

525: X_{-}&=&\frac{1}{g^{2}}\{2\beta[\alpha\omega-\beta(1-\omega)]e^{\lambda_{1}t}+\nonumber\\

526:    {}&{}& +[\zeta\omega+\beta^{2}(1-\omega)]e^{\lambda_{2}t}+\nonumber\\

527:    {}&{}& +[\eta\omega+\beta^{2}(1-\omega)]e^{\lambda_{3}t}\}\nonumber\\

528: Y_{-}&=&\frac{1}{g^{2}}\{-2\alpha[\alpha\omega-\beta(1-\omega)]e^{\lambda_{1}t}+\nonumber\\

529:    {}&{}&+[\alpha^{2}\omega+\eta(1-\omega)]e^{\lambda_{2}t}+\nonumber\\

530:    {}&{}& +[\alpha^{2}\omega+\zeta(1-\omega)]e^{\lambda_{3}t}\}\\

531: Z_{-}&=&\frac{1}{g^{2}}\{-2(\delta-\gamma)[\alpha\omega-\beta(1-\omega)]e^{\lambda_{1}t}+\nonumber\\

532:    {}&{}& +[\alpha(\delta-\gamma+g)\omega-\beta(\delta-\gamma-g)(1-\omega)]e^{\lambda_{2}t}+\nonumber\\

533:    {}&{}& +[\alpha(\delta-\gamma-g)\omega-\beta(\delta-\gamma+g)(1-\omega)]e^{\lambda_{3}t}\}\nonumber

534: \end{eqnarray}

535:

536: where

537:

538: \begin{eqnarray}\label{ab-}

539:           \alpha & \equiv & c-e\nonumber\\

540:           \beta  & \equiv & b-d\nonumber\\

541:           \gamma & \equiv & 2a+c+e\nonumber\\

542:           \delta & \equiv & b+d+2f\nonumber\\

543:      \omega & \equiv & 2f_{1}=2f_{A}=2f_{T}\nonumber\\

544: \lambda_{0} & \equiv & -2(b+c+d+e)\nonumber\\

545: \lambda_{1} & \equiv & -(2a+b+c+d+e+2f)\nonumber\\

546: \lambda_{2} & \equiv & \lambda_{1}+g\nonumber\\

547: \lambda_{3} & \equiv & \lambda_{1}-g\nonumber\\

548:           g & \equiv & \sqrt{(\delta-\gamma)^{2}+4\alpha\beta}\nonumber\\

549:       \zeta & \equiv & \frac{1}{2}(\delta-\gamma)(\delta-\gamma+g)+\alpha\beta\nonumber\\

550:        \eta & \equiv & \frac{1}{2}(\delta-\gamma)(\delta-\gamma-g)+\alpha\beta\nonumber

551: \end{eqnarray}

552:

553: Combining all these, the entry for the divergence matrix are obtained.

554:

555:

556: \section{Reversibility and detailed balance}

557: \label{app2}

558: We will show here the equivalence between time reversibility and detailed balance.

559:

560: \subsection{DETAILED BALANCE $\Rightarrow $ TIME REVERSIBILITY}

561: Let's just remind that

562: $$

563: \mathsf{P}(t)=\exp\{\mathsf{R}t\},

564: $$

565: which can be developed as

566: \begin{equation}

567: \mathsf{P}(t)=\mathbb{I} +\mathsf{R}t + \frac{1}{2} \mathsf{R}^{2}t^{2} + \cdots,

568: \end{equation}

569: or

570: \begin{equation}

571: \label{ij}

572: p_{ij}=\delta_{ij} + r_{ij}t + \frac{1}{2} \sum_k r_{ik}r_{kj}t^{2} + \cdots

573: \end{equation}

574:

575: %\begin{equation}

576: %\label{ji}

577: %p_{ji}=\delta_{ji} + r_{ji}t + \frac{1}{2} r_{jk}r_{ki}t^{2} + \cdots.

578: %\end{equation}

579:

580: Equation (\ref{ij}) can be also written as:

581: \begin{eqnarray}

582: p_{ij}&=&\delta_{ij}+\nonumber\\

583: {}&{}&+ \sum_{n=1}^{\infty}\frac{s_{ij}^{(n)}}{n!}t^n,

584: \end{eqnarray}

585: where

586: \begin{eqnarray}

587: s_{ij}^{(n)}&= &\sum_{k_{1}k_{2}\cdots k_{n-1}}r_{i,k_{1}}r_{k_{1},k_{2}}\cdots r_{k_{n-2},k_{n-1}} r_{k_{n-1},j}\nonumber\\

588: {}&{}&\quad \textrm{for}~ n\geq 2\nonumber\\

589:         {}&{}&{}\\

590: s_{ij}^{(n)}&= &r_{ij}, \qquad \qquad \qquad \qquad  \textrm{for}~ n=1. \nonumber

591: \end{eqnarray}

592: Now we will show that, if detailed balance is satisfied, then

593: \begin{equation}\label{s=}

594: s_{ij}^{(n)}f_{j}=s_{ji}^{(n)}f_{i}, \qquad \forall i,j,n.

595: \end{equation}

596: In fact, exploiting detailed balance,

597: \begin{eqnarray}

598: s_{ij}^{(n)}f_{j}=\sum_{k_{1}\cdots k_{n-1}}r_{i,k_{1}}\cdots r_{k_{n-1},j}f_{j}

599: \end{eqnarray}

600: becomes

601: \begin{eqnarray}

602: {}&\sum_{k_{1}\cdots k_{n-1}}r_{i,k_{1}}\cdots r_{j,k_{n-1}}f_{k_{n-1}}=\nonumber\\

603: =&\sum_{k_{1}\cdots k_{n-1}}r_{i,k_{1}}\cdots r_{k_{n-1},k_{n-2}}r_{j,k_{n-1}}f_{k_{n-2}}=\cdots\nonumber

604: \end{eqnarray}

605: and finally

606: \begin{equation}

607: \cdots=\sum_{k_{1}\cdots k_{n-1}}r_{k_{1},i}r_{k_2,k_1}\cdots r_{j,k_{n-1}}f_{i}.

608: \end{equation}

609: Reordering all the factors

610: \begin{eqnarray}\label{last}

611: \sum_{k_{1}\cdots k_{n-1}}r_{k_{1},i}r_{k_2,k_1}\cdots r_{j,k_{n-1}}f_{i}=\nonumber\\

612: \sum_{k_{1}\cdots k_{n-1}}r_{j,k_{n-1}}r_{k_{n-1},k_{n-2}}r_{k_{n-2},k_{n-3}}\cdots r_{k_1,i}f_{i} .

613: \end{eqnarray}

614: As the sum is performed on indices $k_{1} \cdots k_{n-1}$

615: the expression in (\ref{last}) is equal to $s_{ji}^{(n)}f_{i}$ for all $n \geq 2$.

616: So we have (\ref{s=}) for $n > 1$, and it is evident for $n=1$. Further, as

617: $\delta_{ij}f_{j}=\delta_{ji}f_{i}$, we obtain $p_{ij}f_{j}=p_{ji}f{i}$ \textit{Q. E. D.}

618:

619:

620:

621: \subsection{DETAILED BALANCE $\Leftarrow $ TIME REVERSIBILITY}

622: Let's rewrite the formula

623: \begin{equation}\label{aga}

624: \frac{d\mathsf{P}(t)}{dt}=\mathsf{P}(t)\mathsf{R}; \quad \frac{dp_{ij}(t)}{dt}=\sum_{k}p_{ik}(t)r_{kj}.

625: \end{equation}

626: Let's compute the time derivative of $p_{ij}f_{j}$; if time reversibility holds it will be equal to the time derivative

627: of $p_{ji}f_{i}$.

628: From the formula (\ref{aga}), as equilibrium frequencies don't depend on time

629: \begin{equation}\label{dpdt2}

630: \frac{d}{dt}(p_{ij}(t)f_{j})=f_{j}\frac{dp_{ij}(t)}{dt}=\sum_{k}p_{ik}(t)r_{kj}f_{j}.

631: \end{equation}

632: But

633: $$

634: \frac{dp_{ij}(t)}{dt}=\sum_{k}r_{ik}p_{kj}(t),

635: $$

636: as $\mathsf{P}$ and $\mathsf{R}$ commute (evident from the solution).

637: The second expression in (\ref{dpdt2}) can be written as

638: \begin{equation}\label{dpdt3}

639: \sum_{k}p_{ik}(t)r_{kj}f_{j}=\sum_{k}r_{ik}p_{kj}(t)f_{j}.

640: \end{equation}

641: Because of the time reversibility the last expression in (\ref{dpdt3}) becomes

642: \begin{equation}\label{dpdt4}

643: \sum_{k}r_{ik}p_{kj}(t)f_{j}=\sum_{k}r_{ik}p_{jk}(t)f_{k}.

644: \end{equation}

645: Finally

646: \begin{equation}\label{dpdt5}

647: \frac{d}{dt}(p_{ji}(t)f_{i})=f_{i}\frac{dp_{ji}(t)}{dt}=\sum_{k}p_{jk}(t)r_{ki}f_{i}.

648: \end{equation}

649: Subtracting the (\ref{dpdt5}) from the (\ref{dpdt4}), which are equal, and keeping in evidence $p_{jk}(t)$ we finally obtain

650: \begin{equation}\label{dpdt6}

651: \sum_{k}p_{jk}(t)(r_{ik}f_{k}-r_{ki}f_{i})=0,

652: \end{equation}

653: and the detailed balance is satisfied \textit{Q. E. D.}

654:

655: \section{Detailed balance: simple check}

656: \label{app3}

657: A nice property of detailed balance is that there exists a very easy way to state if it holds,

658: even without calculating equilibrium frequencies.

659: Until now we have seen that the detailed balance is fulfilled when the equilibrium frequencies and the mutation rates (from which the

660: former depend) cancel every term in the master equations.

661:

662: Another way to check the detailed balance is to consider three states in the system and the rates connecting them. If the product of the

663: three rates which takes from a state to itself ``clockwise'' is equal to that calculated ``counter-clockwise'', then the detailed balance

664: holds. If we have three states $i, j, k$ then the above property will read

665: $$

666: r_{ik}r_{kj}r_{ji}=r_{ij}r_{jk}r_{ki}.

667: $$

668:

669:

670: \begin{thebibliography}{20}

671:

672: \bibitem{ki68} Kimura M., 1968. Evolutionary rate at the molecular level. \textit{Nature} \textbf{217}, 624-626.

673:

674: \bibitem{zh94} Zharkikh A., 1994. Estimation of evolutionary distances between nucleotide sequences.

675: \textit{J. of Mol. Evol.} \textbf{39}, 315-329.

676:

677: \bibitem{su95} Sueoka N., 1995. Intrastrand parity rules of DNA base composition and usage biases of synonymous codons.

678: \textit{J. of Mol. Evol.} \textbf{40}, 318-325.

679: \emph{Errata} \textbf{42}, 323

680:

681: \bibitem{lo95} Lobry J. R., 1995. Properties of a general model of DNA evolution under no-strand bias conditions.

682: \textit{J. of Mol. Evol.} \textbf{40}, 326-330.

683: \emph{Errata} \textbf{41}, 680.

684:

685: \bibitem{ro90} Rodriguez F., Oliver J. L., Mar\'\i n A., Medina J. R., 1990. The general stochastic model of nucleotide substitution.

686: \textit{J. of Theor. Biol.} \textbf{142}, 485-501.

687:

688: \bibitem{tk81} Takahata N., Kimura M., 1981. A model of evolutionary base substitution and its application with special reference to

689: rapid changes of pseudogenes. \textit{Genetics} \textbf{98}, 641-657.

690:

691: \bibitem{luca} Peliti L.: Appunti di meccanica statistica,

692: \textit{Bollati Boringhieri} (Torino, Italy, 2003)

693:

694: \bibitem{ya94} Yang Z., 1994. Estimating the pattern of nucleotide substitution.

695: \textit{J. of Mol. Evol.} \textbf{39}, 105-111.

696:

697: \bibitem{gu96} Gu X., Li W.-H., 1994. A general additive distance with time-reversibility and rate variation among nucleotide sites.

698: \textit{Proc. Natl. Acad. Sci. USA} \textbf{93}, 4671-4676.

699:

700: \bibitem{Gouy89} Gouy M., Li W.-H., 1989. Phylogenetic analysis based on rRNA sequences supports the archaebacterial tree rather than the eocyte tree.

701: \textit{Nature} \textbf{339}, 145-147.

702:

703: \bibitem{lolo99} Lobry J. R., Lobry C., 1999. Evolution of DNA base composition under no-strand-bias conditions when the substitution rates

704: are not constant.

705: \textit{Mol. Biol. Evol.} \textbf{16}, 719-723.

706:

707: % \bibitem{label}

708: % Text of bibliographic item

709:

710: % notes:

711: % \bibitem{label} \note

712:

713: % subbibitems:

714: % \begin{subbibitems}{label}

715: % \bibitem{label1}

716: % \bibitem{label2}

717: % If there is a note, it should come last:

718: % \bibitem{label3} \note

719: % \end{subbibitems}

720:

721: \end{thebibliography}

722:

723: \end{document}

724: