1: % Template article for preprint document class `elsart'
2: % SP 2001/01/05
3:
4: %\documentclass{article}
5: \documentclass[aps]{revtex4}
6:
7: % Use the option doublespacing or reviewcopy to obtain double line spacing
8: % \documentclass[reviewcopy]{elsart3}
9:
10: % if you use PostScript figures in your article
11: % use the graphics package for simple commands
12: % \usepackage{graphics}
13: % or use the graphicx package for more complicated commands
14: \usepackage{graphicx}
15: % or use the epsfig package if you prefer to use the old commands
16: % \usepackage{epsfig}
17:
18: % The amssymb package provides various useful mathematical symbols
19: \usepackage{amssymb}
20:
21: \begin{document}
22:
23: \title{Forcing reversibility in the no strand-bias substitution model
24: allows for the theoretical and practical identifiability of its
25: 5 parameters from pairwise DNA sequence comparisons.}
26:
27: % use optional labels to link authors explicitly to addresses:
28: \author{Osvaldo Zagordi}
29: \email{zagordi@sissa.it}
30: \affiliation{International School of Advanced Studies SISSA-ISAS\\ via Beirut 2-4, 34013 Trieste, Italy}
31:
32: \author{Jean R. Lobry}
33:
34: \affiliation{Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - Lyon I\\
35: 43 Bd 11/11/1918, F-69622 Villeurbanne CEDEX, France}
36:
37: \begin{abstract}
38: Because of the base pairing rules in DNA, some mutations experienced by a portion of DNA during its
39: evolution result in the same substitution, as we can only observe differences in coupled nucleotides.
40: Then, in the absence of a bias between the two DNA strands, a model with at most
41: 6 different parameters instead of 12 is sufficient to study the evolutionary relationship between homologous sequences derived from a common
42: ancestor. On the other hand the same symmetry reduces the number of independent observations which can be made. Such a reduction
43: can in some cases invalidate the calculation of the parameters. A compromise between biologically acceptable hypotheses and tractability
44: is introduced and a five parameter \textit{reversible no-strand-bias condition} (\textbf{RNSB}) is presented.
45: The identifiability of the parameters under this model is shown by examples.
46: \end{abstract}
47:
48:
49: % keywords here, in the form: keyword \sep keyword
50: \keywords{Parity rules no-strand-bias}
51:
52: % PACS codes here, in the form: \PACS code \sep code
53: \pacs{02.50.Ey 02.50.Ga 87.14.Gg 87.23.Kg}
54:
55: \maketitle
56:
57: % main text
58:
59: \section{\label{intro}Introduction}
60: Darwinian Evolution is based upon the interplay of two driving forces: \textbf{mutation} of an organism features, and \textbf{natural
61: selection} acting on the living organisms. Nowadays the role of the DNA in the evolutive processes has been recognised, and the physical
62: basis of the mutation process has been identified. Mutation acts on the DNA and we call \textsl{mutation rate} the probability that a
63: descendant has a difference in the genome if this is compared to that of its parents.
64: The substitution rate is the probability of finding a difference when comparing the genomes of species to one of its ancestors.
65:
66: We see that while the mutation is closely related to the biophysical process of DNA damage, or replication error etc., the substitution
67: is the result of a mutation and of a population-dynamics process, which has spread the former to the whole population.
68: A fundamental observation by M. Kimura in 1968 \cite{ki68} argued that, in the case of neutral mutations (i.e. those mutations which have
69: no apparent effect on the adaptation of an organism to the environment), we can deduce the mutation rate from the substitutions, as
70: they are actually the same.
71:
72: Let's consider an ancestor $O$ at time $t=0$ which separates into two different evolutive lineages,
73: resulting in two different species, $A$ and $B$ at time $t$.
74: It would be useful to define a distance between $A$ and $B$ and to have a tool to calculate it by just comparing the
75: genomes of $A$ and $B$.
76:
77: In order to study evolutionary distance between homologous DNA sequences (descending from a common ancestor) and their
78: consequent relationship, a model for nucleotide substitution can be introduced.
79: Generally, the process is assumed to be a Markov chain, if some assumptions are made about the underlying process.
80: The general hypotheses are:
81: \begin{itemize}
82: \item substitution rates do not depend on the position along the DNA sequence;
83: \item they are constant during evolutionary time;
84: \item the two evolutionary lineages have the same rates;
85: \item DNA sequences are at the compositional equilibrium when they start to diverge (nucleotide frequencies are constant).
86: \end{itemize}
87: We will see that even with relaxing the last two hypotheses some calculations can be performed, but
88: it is worth noting that compositional equilibrium, if the last assumption is verified, is maintained during the course of evolution.
89:
90: Denoting with $f_{i}$ the compositional equilibrium frequency of the nucleotide $i$ with $i \in \{ \sf{A, T, G, C} \}$
91: and with $r_{ij}=r_{i\leftarrow j}$ the substitution rate from nucleotide $j$ to $i$ in the unit time.
92: The distance between two sequences, can now be defined as
93:
94: \begin{equation}
95: d=2t\sum_{i}f_{i}\mu_{i}=2t\sum_{i}f_{i}\sum_{j(\neq i)}r_{ji} \quad .
96: \label{distance}
97: \end{equation}
98:
99: \begin{figure}[c]
100: \begin{center}
101: \includegraphics[width=7cm]{nsbc}
102: \caption{\footnotesize{Explication of the \textsl{no strand-bias condition}. If the rates for a certain substitution are the same on
103: both strands of DNA, one can deduce the equivalence of this rate to the one between the complementary bases.}}\label{nsbc_image}
104: \end{center}
105: \end{figure}
106:
107: Since 1969, when Jukes and Cantor proposed their first one-parameter model for nucleotide subsitution in DNA, many different models of
108: increasing complexity have been published. The general 4-state Markov model has 12 independent parameters, \textbf{G12} in
109: fig.\ref{schema} (for a review see Zharkikh \cite{zh94}). This number, and consequently the
110: model complexity, can be decreased by further conditions on the parameters, leading to a plethora of different models. A possible choice is
111: to take into account the property of \textsl{no strand-bias}, explained in fig.\ref{nsbc_image}.
112: It was introduced by Sueoka in 1995 \cite{su95} and we generally refer to it as
113: \textit{type 1 parity rule} or \textit{PR1}. This rule is easily understood thinking that, scoring the substitution on one strand,
114: the same substitution can be obtained in two ways: $\sf{A} \rightarrow \sf{C}$ is observed also if on the opposite strand
115: $\sf{T} \rightarrow \sf{G}$.
116:
117: \begin{figure}[c]
118: \begin{center}
119: \includegraphics[width=7cm]{schema}
120: \caption{\footnotesize{Hierarchy of DNA substitution models. Simplifications leading from a model to a simpler one are indicated by arrows.
121: Only those directly referring to our discussion are drawn. This figure has been adapted from Robert Schmidt's work.}}\label{schema}
122: \end{center}
123: \end{figure}
124:
125: This means that we cannot discriminate substitutions between two bases from those between their complementary bases. In symbols:
126: \begin{equation}
127: r_{ij}=r_{\bar{\imath}\bar{\jmath}},
128: \end{equation}
129: where the bar means complementary nucleotide: $\bar\mathsf{A}=\mathsf{T}$ and viceversa. And $\bar\mathsf{C}=\mathsf{G}$ similarly.
130:
131: The number of independent parameters is then halved, so that the following substitution rates can be introduced:
132:
133: \begin{eqnarray}\label{rates}
134: a&\equiv& r_{\sf{AT}}=r_{\sf{TA}}\nonumber\\
135: b&\equiv& r_{\sf{AG}}=r_{\sf{TC}}\nonumber\\
136: c&\equiv& r_{\sf{CT}}=r_{\sf{GA}}\nonumber\\
137: d&\equiv& r_{\sf{AC}}=r_{\sf{TG}}\nonumber\\
138: e&\equiv& r_{\sf{CA}}=r_{\sf{GT}}\nonumber\\
139: f&\equiv& r_{\sf{CG}}=r_{\sf{GC}}.\nonumber
140: \end{eqnarray}
141:
142: The notation introduced here is consistent with the one previously used by Sueoka \cite{su95} and Lobry \cite{lo95}
143:
144: Equilibrium frequencies for such a model are easily derived from the \textsl{master equations}:
145: $$
146: \dot{q_{i}}=\sum_{j}(r_{ij}q_{j}-r_{ji}q_{i}),
147: $$
148: where $q_{i}$ denotes in general the probability of state $i$.
149:
150: These frequencies are given by:
151: \begin{eqnarray}\label{equil}
152: f_{1} & \equiv & q^{\infty}_{\sf{A}}=q^{\infty}_{\sf{T}}=\frac{1}{2}\frac{b+d}{b+c+d+e}\nonumber\\
153: \mbox{}\\
154: f_{2} & \equiv & q^{\infty}_{\sf{G}}=q^{\infty}_{\sf{C}}=\frac{1}{2}\frac{c+e}{b+c+d+e}.\nonumber
155: \end{eqnarray}
156: The intrinsic symmetry of the model is evident. In this framework, in other words, there is only \textbf{one} independent frequency, the
157: other being deduced by the normalization condition $2f_{1}+2f_{2}=1$.
158: We now stress the fact that this is valid in a single strand (\textit{type 2 parity rule} or \textit{PR2}). If \textit{PR1} is
159: satisfied, then as a consequence the frequency of a nucleotide in a strand must be equal to that of its complement in the same strand.
160:
161: In the following we will resume some general results regarding \textit{PR1} algebra showing that, in many cases, it is not possible to
162: reconstruct the supposed underlying mutation pattern because the independent parameters outnumber the possible independent observations.
163:
164: \section{\label{pr1}Materials and Methods}
165: In this section we will give some results regarding the model introduced above, focusing on the number of actual independent
166: possible observations.
167:
168: \subsection{\label{general}General model}
169:
170: Given the substitution matrix $\mathsf{R}_{[4,4]}$, whose entries are the mutation rates per nucleotide per unit of time,
171: one can deduce the \textsl{evolutionary matrix} $\mathsf{P}_{[4,4]}(t)$,
172: whose entries $p_{ij}(t)$
173: represent the probability of finding at a certain site the base $i$ at time $t$, given the base $j$ at $t=0$. Yet the
174: \textsl{divergence matrix} $\mathsf{X}_{[4,4]}(t)$ can be deduced, whose entries $x_{ij}(t)$ are the mutual probability of
175: finding at time $t$ the base $j$ in a sequence, given the base $i$ at the same site of the other sequence.
176: Obviously, if the substitution pattern is the same for both sequences, it results in $x_{ij}(t)=x_{ji}(t)$.
177:
178: It is worth noting that the divergence matrix at initial time is nothing but the diagonal matrix with nucleotide
179: frequencies on the diagonal.
180:
181: The result of an evolutive process can be synthetically represented as an initial diagonal divergence matrix,
182: multiplied on the left and on the right by a certain number of substitution matrices (corresponding to the generation steps
183: in the two evolution lineages), producing a final matrix
184:
185: \begin{eqnarray}
186: \mathsf{X}(t) & = & \mathsf{R'_m}\cdots\mathsf{R'_2}\mathsf{R'_1}~\mathsf{X}(0)~\mathsf{R^{t}_1}\mathsf{R^{t}_2}\cdots\mathsf{R^{t}_n}\nonumber\\
187: \mathsf{X}(t) & = & \mathsf{P'}~\mathsf{X}(0)~\mathsf{P^{t}}\label{discretex}\\
188: x_{ij}(t) & = & \sum_{k=1}^{4}p'_{ik}(t)f_{k}p_{jk}(t)\nonumber
189: \end{eqnarray}
190: where the substitution matrices can, in principle, all be different.
191:
192: The entries of the divergence matrix are the experimentally observable quantities.
193:
194: In our case the substitution matrix is $\mathsf{R}_{[4,4]}$:
195:
196: \begin{displaymath}
197: \begin{array}{|c|c|c|c|c|}
198: \hline
199: \Rsh & \sf{A} & \sf{T} & \sf{G} & \sf{C}\\
200: \hline
201: \sf{A} & 1-a-c-e & a & c & e \\
202: \hline
203: \sf{T} & a & 1-a-e-c & e & c \\
204: \hline
205: \sf{G} & b & d & 1-b-d-f & f \\
206: \hline
207: \sf{C} & d & b & f & 1- d - b - f \\
208: \hline
209: \end{array}
210: \end{displaymath}
211:
212: obtained under the hypotheses of \textit{no-strand-bias}, I.E. \textit{PR1}.
213:
214: \subsection{Non identifiability of some models}
215: In the following we show that the mathematical properties of the \textit{PR1} algebra are such that,
216: dealing with the general model, the parameters to estimate outnumber the possible independent observations, so that the model
217: is untractable.
218: As seen in eq.(\ref{discretex})
219: $$
220: \mathsf{X}(t) = \mathsf{P'}~\mathsf{X}(0)~\mathsf{P^{t}}.
221: $$
222: Now, several cases are possible, depending on whether $\mathsf{P'}=\mathsf{P}$ or not.
223: In the following, we will assume that $\mathsf{X}(0)$ is already at compositional equilibrium, I.E.
224: \begin{eqnarray}
225: q^{0}_{\sf{A}} = & q^{0}_{\sf{T}} = f_1 = & x_{AA}(t=0) = x_{TT}(t=0) \nonumber\\
226: \mbox{}\\
227: q^{0}_{\sf{C}} = & q^{0}_{\sf{G}} = f_2 = & x_{CC}(t=0) = x_{GG}(t=0) \nonumber
228: \end{eqnarray}
229:
230: \subsubsection{$\mathsf{P'}=\mathsf{P}$} \label{counting}
231: As $\mathsf{P'}=\mathsf{P}$ it is clear that $\mathsf{X}(t)$ is symmetric ($\mathsf{X}(t) = \mathsf{X^{t}}(t)$).
232: We have to estimate 6 parameters (6 mutation rates) and we have only 5 independent observations.
233: This happens because of the symmetry $x_{ij}=x_{ji}$,
234: the normalization conditions and because $x_{ij}=x_{\bar{\imath}\bar{\jmath}}$.
235: In more detail:
236: \begin{eqnarray}\label{4par1}
237: x_{AG} & = & x_{GA}=x_{TC}=x_{CT}\nonumber\\
238: x_{AC} & = & x_{CA}=x_{TG}=x_{GT}\nonumber\\
239: x_{AT} & = & x_{TA}\nonumber\\
240: x_{CG} & = & x_{GC}\nonumber\\
241: %\end{eqnarray}
242: %\begin{eqnarray}\label{S}
243: x_{AA} & = & x_{TT}\nonumber\\
244: x_{CC} & = & x_{GG}\nonumber
245: \end{eqnarray}
246: Where $x_{AA} = x_{TT}$ and $x_{CC} = x_{GG}$ can be deduced by the other four using the normalization ($\sum_{j}x_{ij}=f_{i}$)
247: and the equilibrium frequencies.
248: We find that $x_{AG}$, $x_{AC}$, $x_{AT}$, $x_{CG}$ and one equilibrium frequency are the only independent observable quantities.
249:
250: \subsubsection{$\mathsf{P'} \neq \mathsf{P}$}
251:
252: In this case mutation rates double becoming 12;
253: so we have 12 parameters to calculate. Independent observations, on the other hand, increase up to 7, because of the lack of the
254: symmetry $x_{ij}=x_{ji}$. Still the model is intractable.
255:
256: \subsection{\label{case}Reversible \textit{PR1} model}
257: In this section we will deal with one of the previous models, the simplest one where $\mathsf{P'}=\mathsf{P}$.
258: In this case simple calculations lead to an analytical expression for the divergence matrix, but the model
259: remains intractable. Yet we will see that by the imposition of a certain property the model becomes tractable, and a way to
260: estimate the parameters for a real data set will be proposed.
261:
262: In the following we will assume again that the initial divergence matrix is already at compositional equilibrium.
263: Further, we will treat the evolutionary process as a continuous time process, being the time since the divergence very long.
264: This allows us to write the following equations to solve the problem.
265: The expression for the evolutionary matrix is
266: \begin{equation}
267: \mathsf{P}(t)=\exp\{\mathsf{R}t\};
268: \label{pdt}
269: \end{equation}
270: as it is the solution of the differential equations (see Rodriguez et al. \cite{ro90})
271:
272: \begin{eqnarray}
273: \frac{d\mathsf{P}(t)}{dt} & = & \mathsf{P}(t)\mathsf{R}\\
274: \frac{dp_{ij}(t)}{dt} & = & \sum_{k=1}^{4}p_{ik}(t)r_{kj}.
275: \label{dpdt}
276: \end{eqnarray}
277:
278: While the divergence matrix is given by
279:
280: \begin{eqnarray}
281: \mathsf{X}(t) & = & \mathsf{P'}(t)\mathsf{X}(t=0)\mathsf{P}^{T}(t)\\
282: x_{ij}(t) & = & \sum_{k=1}^{4}p'_{ik}(t)f_{k}p_{jk}(t);
283: \label{xdt}
284: \end{eqnarray}
285:
286: It is easily verified that, if $\mathsf{P'}=\mathsf{P}$, then $x_{ij}(t)=x_{ji}(t)$.
287:
288: Now, the expressions for $x_{ij}(t)$ (the observables) can be inverted to obtain the rates and then the distance.
289:
290: The strategy could be:
291: \begin{itemize}
292: \item solve the model, that is find the $x_{ij}(t)$ as a function of rates;
293: \item invert the above equations to get an expression for the rates;
294: \item substitute the observed quantities $\bar{x}_{ij}$ in order to have a numerical estimation of the rates;
295: \item use these estimates to obtain the distance.
296: \end{itemize}
297:
298: The expressions for $x_{ij}$ can be deduced in a manner analogous to that proposed by Takahata \& Kimura in 1981 \cite{tk81}
299: who deal with a slightly less general model than this (model \textbf{TK5} in fig.\ref{schema}).
300: In this way we get an expression for every entry of the divergence matrix, but with five
301: independent expressions, as stated above. We repeat here the reasons:
302: \begin{itemize}
303: \item the symmetry of the matrix $x_{ij}=x_{ji}$;
304: \item the intrinsic symmetry of the model $x_{ij}=x_{\bar{\imath}\bar{\jmath}}$;
305: \item the normalization conditions $\sum_{j}x_{ij}=f_{i}$.
306: \end{itemize}
307: Thus, we can write down the entire divergence matrix by means of the following quantities:
308:
309: \begin{eqnarray}\label{4par}
310: P~ & \equiv & x_{AG}=x_{GA}=x_{TC}=x_{CT}\nonumber\\
311: R~ & \equiv & x_{AC}=x_{CA}=x_{TG}=x_{GT}\nonumber\\
312: Q_{1} & \equiv & x_{AT}=x_{TA}\nonumber\\
313: Q_{2} & \equiv & x_{CG}=x_{GC},\nonumber\\
314: %\end{eqnarray}
315: %\begin{eqnarray}\label{S}
316: S_{1} & \equiv & x_{AA}=x_{TT}\nonumber\\
317: S_{2} & \equiv & x_{CC}=x_{GG}\nonumber
318: \end{eqnarray}
319: Where, as stated above, $S_{1}$ and $S_{2}$ can be deduced by the other four using the normalization and the equilibrium frequencies.
320: We find that $P, R, Q_{1},Q_{2}$ and one equilibrium frequency are the only independent observable quantities.
321:
322:
323: \subsection{Solution of the model}
324: Deriving an analytical expression for the divergence matrix is quite an easy task following \cite{tk81}.
325: Let's consider for example the element $x_{\mathsf{AC}}$; its derivative will be
326: \begin{equation}
327: \frac{dx_{\mathsf{AC}}}{dt}=\frac{d(q_{\mathsf{A}} q_{\mathsf{C}})}{dt}=q_{\mathsf{C}}\dot q_{\mathsf{A}} +q_{\mathsf{A}}\dot q_{\mathsf{C}}.\label{dt}
328: \end{equation}
329: It is worth giving a brief explication for this.
330: We said that we are considering the two lineages at compositional equilibrium at the initial time,
331: so one would naturally say that $\dot q_{i} = 0$, and so the above equation.
332: Stating that we are at compositional equilibrium means that \textbf{sampling the whole considered sequence}
333: nucleotide frequencies $f_i$ don't change (apart from finite-size fluctuations). It does not mean that
334: there is no mutation at all on each site; had this been the case, there would be no evolution to study.
335: The probability for each nucleotide to mutate into another is given by the master equation, and this is why we
336: can write $x_{ij}$ as $q_i$ times $q_j$, take the derivative, and reexpress in terms of other $q_i q_j$ products,
337: I.E. other $\mathsf{X}$ entries.
338:
339: An example of derivative would be, for example,
340: \begin{eqnarray}\label{dotadotc}
341: \dot q_{\mathsf{A}}=(dq_{\mathsf{C}}+bq_{\mathsf{G}}+aq_{\mathsf{T}})-(a+c+e)q_{\mathsf{A}}.
342: \end{eqnarray}
343: Substituting this and the analogue for $\dot q_{\mathsf{C}}$ in eq.(\ref{dt}) and doing the same for all $\mathsf{X}$ entries we obtain a
344: set of linear coupled first order differential equations which can be diagonalized and solved.
345:
346: More detail on the derivation is reported in the appendix \ref{app1}.
347:
348: \subsection{Reversibility}
349:
350: Until now we have stated that it is possible to write the divergence matrix for this model, but it would be of no use because we could
351: never invert five expressions and obtain six independent rates as functions of the matrix entries. What can be done is to reduce
352: the number of independent parameters by adding a relation between them. Many choices are possible. One could be, following \cite{tk81}, $a=f$.
353: Another possible choice is to make the model time reversible. We remember that time reversibility is satisfied when
354: \begin{equation}
355: p_{ij}f_{j}=p_{ji}f_{i} \qquad \forall i,j.
356: \end{equation}
357: where $p_{ij}$ are the entries of the evolutionary matrix and $f_{i}$ the equilibrium frequencies.
358: It is possible to demonstrate that this property is equivalent to the \textit{detailed balance} (see appendix \ref{app2}) which reads
359: \begin{equation}
360: r_{ij}f_{j}=r_{ji}f_{i} \qquad \forall i,j.
361: \end{equation}
362: In our model detailed balance holds if and only if
363: \begin{equation}
364: be=cd.
365: \end{equation}
366: This can be deduced by inspection of equilibrium frequencies expressions, or by a simpler rule \cite{luca},
367: reported here in appendix \ref{app3}.
368: A general version of reversible model has been studied by Yang \cite{ya94}, who pointed out its ability of fitting the data better than
369: other models. Gu and Li \cite{gu96} have shown its robustness against violation of time reversibility.
370:
371:
372:
373: \section{\label{res}Results and discussion}
374:
375: \subsection{Estimation of the substitution rates}
376: Due to the complexity of the expressions coming from this model, it is hard to think that one can
377: find an analytic way to invert them and express the rates as a function of the observables. Therefore we chose a statistic
378: way to perform this inversion, based on the $\chi^2$ test. We write the $\chi^2$ as
379: \begin{equation}
380: \chi^2 = \sum_{i,j} \frac{(\bar{x}_{i,j}-x_{i,j})^2}{\bar{x}_{i,j}} = \sum_{i,j} \frac{x_{i,j}^2}{\bar{x}_{i,j}} - 1.
381: \end{equation}
382:
383: It is easily seen that this quantity is always non-negative, being zero when $\bar{x}_{i,j}=x_{i,j}$, I.E. when the model perfectly
384: fits the observations. Clearly, by performing a minimization on it we look at the same time for the best parameters.
385: In this contest trying to minimize the $\chi^2$ as a function of six parameters would outcome in a complete failure, the algorithm would
386: wander among the infinite number of equivalent solutions. Enforcing the reversibility makes the estimation possible, as it will be shown
387: below.
388:
389:
390: \subsection{A realistic example}
391:
392: As an application, we started from the multiple alignment of rRNA sequences
393: used in \cite{Gouy89}. The observed divergence
394: matrix (unnormalized) between Xenopus and Homo is reported here below.
395: \mbox{}
396: \newline
397: % --- Table of observed nucleotide differences ---
398: \begin{center}
399: \begin{tabular}{c c c c c c}
400: \hline \hline
401: & & \multicolumn{4}{c}{ Xenopus }\\
402: \hline
403: & & A & T & G & C\\
404: & A & 647 & 1 & 17 & 2 \\
405: Homo & T & 3 & 523 & 11 & 18 \\
406: & G & 17 & 9 & 903 & 28 \\
407: & C & 8 & 21 & 25 & 691 \\
408: \hline \hline
409: \end{tabular}
410: \end{center}
411: \mbox{}
412: \newline
413:
414: By changing parameter values over 6 magnitude orders we found that the $\chi^2$ criterion was well shaped with only one global minimum
415: (fig. \ref{paramfig}).
416: A systematic exploration of all possible pairs of parameters showed that there were no strong structural correlations between parameters,
417: except between $b$ and $c$ (fig. \ref{pairsfig}). As a consequence, parameter values are easily estimated using standard non linear minimizing tools
418: (note that it is advisable to enforce parameter positivity during optimisation). This example showed that parameter can be estimated
419: in practice from a realistically sized dataset.
420:
421: \begin{figure}[c]
422: \begin{center}
423: \includegraphics[width=7cm]{figparatonce}
424: \caption{\footnotesize{$\chi^2$ shaped as a minimum over 6 orders of magnitude.}}\label{paramfig}
425: \end{center}
426: \end{figure}
427:
428:
429: \begin{figure}[c]
430: \begin{center}
431: \includegraphics[width=7cm]{pairsbc}
432: \caption{\footnotesize{Near the optimal values for the parameters, only $b$ and $c$ show a structural correlation.}}\label{pairsfig}
433: \end{center}
434: \end{figure}
435:
436:
437:
438: \subsection{Discussion}
439:
440: The most general model of evolution at the DNA level has 12 parameters and this is
441: too much for practical purposes. If we try to simplify it by enforcing some parameter
442: to be equal, then the number of possible sub-models rapidly increases because many ways of doing it are possible.
443: At the opposite side we find the only model which requires all the parameters to be equal (JC).
444:
445: It is clear that the number of published models in the literature doesn't cover all possible ones, and only those
446: coming from some biological or mathematical justifications have been explored.
447:
448: Under {\it PR1 hypothesis}, we are dealing with {\it no strand-bias} models
449: whose most general form has 6 parameters.
450: We do not claim that models of this class are the best in any way, but that they are an interesting starting point.
451: An important property of these models is their convergence
452: towards {\it PR2 state} even if substitution rates are modified during
453: the course of evolution \cite{lolo99}. {\it PR2 state} is a strong assumption and strand asymmetry has been observed in
454: many cases. But, as {\it PR2} is usually
455: observed at a genome scale level \cite{lo95}, the hope is that, {\it on average},
456: with local deviations from {\it PR1 hypothesis} canceling out, this
457: class of model is not too bad an approximation.
458: The {\bf biological} motivation leading to the {\it no strand-bias}
459: models has an important {\bf mathematical} consequence,
460: so, if it is biologically reasonable to study these models, one must be aware of
461: the fact that the symmetry involved inexorably reduces the number of independent observations,
462: making the model mathematically intractable.
463:
464: \subsection{Conclusion}
465: As we have shown in section \ref{counting} comparing the number of unknowns
466: to possible independent observations there is definitively no
467: hope to estimate the 6 parameters of the general form of the
468: {\it no strand-bias} model from pairwise DNA sequence comparisons.
469: There is no unique solution to a system of $M$ equations in $N > M$ unknowns,
470: in our case there is an infinite number of way to choose the six rates $a, b, c, d, e, f$ in order to satisfy the
471: five independent equations defining the matrix $\mathsf{X}$.
472: This result is extremely
473: unpleasant because it corresponds to the most common situation with
474: experimental
475: data from present day DNA: fossil DNA data are scarce and from a relatively
476: recent past. We clearly need further simplifications.
477:
478: We have exhibited here an example of a model, noted RNSB in figure 2, that
479: combines the properties of reversible models and {\it no strand-bias} models.
480: It is important to note that this model has still 5 parameters free because if the intersection
481: between the reversible model class and the {\it no strand-bias} class
482: were only --say--
483: 3 parameter free models, there would not have been much flexibility left for further
484: research. We do not claim that this new RNSB model is the best intersection between
485: the two classes. We just claim that the RNSB model proves that it's possible to do so
486: with 5 free parameters, so that there is no bottleneck here for further theoretical
487: work on the parametric forms for this class of DNA substitution models.
488:
489:
490: \subsection*{Acknowledgements}
491: This contribution partly comes from the thesis OZ presented at Naples University in October 2002.
492: The authors thank warmly prof. Luca Peliti for connecting them during the \textsl{strapp 04} meeting (Dresden, Germany, July 5-10 2004).
493: OZ also because he was introduced by him to the beauties of biological systems.
494: They thank Manolo Gouy for kindly providing the multiple alignment of rRNA sequences and for many constructive suggestions.
495: The manuscript was also improved thanks to the comments from three anonymous reviewers.
496:
497: % The Appendices part is started with the command \appendix;
498: % appendix sections are then done as normal sections
499: % \appendix
500:
501: % \section{}
502: % \label{}
503: \appendix
504: \section{Derivation of the divergence matrix}
505: \label{app1}
506:
507: In order to obtain the expressions for the divergence matrix we define (following the notation introduced above)
508: \begin{eqnarray}\label{xyz}
509: X_{\pm}& \equiv 2S_{1} \pm 2Q_{1}\nonumber\\
510: Y_{\pm}& \equiv 2S_{2} \pm 2Q_{2}\\
511: Z_{\pm}& \equiv 4P \pm 4R.\nonumber
512: \end{eqnarray}
513:
514: These expressions reduce the problem to six first order ordinary coupled differential equations. This system is block-diagonal,
515: can easily be inverted and its solution is:
516: \begin{eqnarray}\label{xyz+}
517: X_{+}&=&\omega[\omega+(1-\omega)e^{\lambda_{0}t}] \nonumber\\
518: Y_{+}&=&(1-\omega)(1-\omega+\omega e^{\lambda_{0}t})\\
519: Z_{+}&=&2\omega(1-\omega)(1-e^{\lambda_{0}t})\nonumber
520: \end{eqnarray}
521:
522: and
523:
524: \begin{eqnarray}\label{xyz-}
525: X_{-}&=&\frac{1}{g^{2}}\{2\beta[\alpha\omega-\beta(1-\omega)]e^{\lambda_{1}t}+\nonumber\\
526: {}&{}& +[\zeta\omega+\beta^{2}(1-\omega)]e^{\lambda_{2}t}+\nonumber\\
527: {}&{}& +[\eta\omega+\beta^{2}(1-\omega)]e^{\lambda_{3}t}\}\nonumber\\
528: Y_{-}&=&\frac{1}{g^{2}}\{-2\alpha[\alpha\omega-\beta(1-\omega)]e^{\lambda_{1}t}+\nonumber\\
529: {}&{}&+[\alpha^{2}\omega+\eta(1-\omega)]e^{\lambda_{2}t}+\nonumber\\
530: {}&{}& +[\alpha^{2}\omega+\zeta(1-\omega)]e^{\lambda_{3}t}\}\\
531: Z_{-}&=&\frac{1}{g^{2}}\{-2(\delta-\gamma)[\alpha\omega-\beta(1-\omega)]e^{\lambda_{1}t}+\nonumber\\
532: {}&{}& +[\alpha(\delta-\gamma+g)\omega-\beta(\delta-\gamma-g)(1-\omega)]e^{\lambda_{2}t}+\nonumber\\
533: {}&{}& +[\alpha(\delta-\gamma-g)\omega-\beta(\delta-\gamma+g)(1-\omega)]e^{\lambda_{3}t}\}\nonumber
534: \end{eqnarray}
535:
536: where
537:
538: \begin{eqnarray}\label{ab-}
539: \alpha & \equiv & c-e\nonumber\\
540: \beta & \equiv & b-d\nonumber\\
541: \gamma & \equiv & 2a+c+e\nonumber\\
542: \delta & \equiv & b+d+2f\nonumber\\
543: \omega & \equiv & 2f_{1}=2f_{A}=2f_{T}\nonumber\\
544: \lambda_{0} & \equiv & -2(b+c+d+e)\nonumber\\
545: \lambda_{1} & \equiv & -(2a+b+c+d+e+2f)\nonumber\\
546: \lambda_{2} & \equiv & \lambda_{1}+g\nonumber\\
547: \lambda_{3} & \equiv & \lambda_{1}-g\nonumber\\
548: g & \equiv & \sqrt{(\delta-\gamma)^{2}+4\alpha\beta}\nonumber\\
549: \zeta & \equiv & \frac{1}{2}(\delta-\gamma)(\delta-\gamma+g)+\alpha\beta\nonumber\\
550: \eta & \equiv & \frac{1}{2}(\delta-\gamma)(\delta-\gamma-g)+\alpha\beta\nonumber
551: \end{eqnarray}
552:
553: Combining all these, the entry for the divergence matrix are obtained.
554:
555:
556: \section{Reversibility and detailed balance}
557: \label{app2}
558: We will show here the equivalence between time reversibility and detailed balance.
559:
560: \subsection{DETAILED BALANCE $\Rightarrow $ TIME REVERSIBILITY}
561: Let's just remind that
562: $$
563: \mathsf{P}(t)=\exp\{\mathsf{R}t\},
564: $$
565: which can be developed as
566: \begin{equation}
567: \mathsf{P}(t)=\mathbb{I} +\mathsf{R}t + \frac{1}{2} \mathsf{R}^{2}t^{2} + \cdots,
568: \end{equation}
569: or
570: \begin{equation}
571: \label{ij}
572: p_{ij}=\delta_{ij} + r_{ij}t + \frac{1}{2} \sum_k r_{ik}r_{kj}t^{2} + \cdots
573: \end{equation}
574:
575: %\begin{equation}
576: %\label{ji}
577: %p_{ji}=\delta_{ji} + r_{ji}t + \frac{1}{2} r_{jk}r_{ki}t^{2} + \cdots.
578: %\end{equation}
579:
580: Equation (\ref{ij}) can be also written as:
581: \begin{eqnarray}
582: p_{ij}&=&\delta_{ij}+\nonumber\\
583: {}&{}&+ \sum_{n=1}^{\infty}\frac{s_{ij}^{(n)}}{n!}t^n,
584: \end{eqnarray}
585: where
586: \begin{eqnarray}
587: s_{ij}^{(n)}&= &\sum_{k_{1}k_{2}\cdots k_{n-1}}r_{i,k_{1}}r_{k_{1},k_{2}}\cdots r_{k_{n-2},k_{n-1}} r_{k_{n-1},j}\nonumber\\
588: {}&{}&\quad \textrm{for}~ n\geq 2\nonumber\\
589: {}&{}&{}\\
590: s_{ij}^{(n)}&= &r_{ij}, \qquad \qquad \qquad \qquad \textrm{for}~ n=1. \nonumber
591: \end{eqnarray}
592: Now we will show that, if detailed balance is satisfied, then
593: \begin{equation}\label{s=}
594: s_{ij}^{(n)}f_{j}=s_{ji}^{(n)}f_{i}, \qquad \forall i,j,n.
595: \end{equation}
596: In fact, exploiting detailed balance,
597: \begin{eqnarray}
598: s_{ij}^{(n)}f_{j}=\sum_{k_{1}\cdots k_{n-1}}r_{i,k_{1}}\cdots r_{k_{n-1},j}f_{j}
599: \end{eqnarray}
600: becomes
601: \begin{eqnarray}
602: {}&\sum_{k_{1}\cdots k_{n-1}}r_{i,k_{1}}\cdots r_{j,k_{n-1}}f_{k_{n-1}}=\nonumber\\
603: =&\sum_{k_{1}\cdots k_{n-1}}r_{i,k_{1}}\cdots r_{k_{n-1},k_{n-2}}r_{j,k_{n-1}}f_{k_{n-2}}=\cdots\nonumber
604: \end{eqnarray}
605: and finally
606: \begin{equation}
607: \cdots=\sum_{k_{1}\cdots k_{n-1}}r_{k_{1},i}r_{k_2,k_1}\cdots r_{j,k_{n-1}}f_{i}.
608: \end{equation}
609: Reordering all the factors
610: \begin{eqnarray}\label{last}
611: \sum_{k_{1}\cdots k_{n-1}}r_{k_{1},i}r_{k_2,k_1}\cdots r_{j,k_{n-1}}f_{i}=\nonumber\\
612: \sum_{k_{1}\cdots k_{n-1}}r_{j,k_{n-1}}r_{k_{n-1},k_{n-2}}r_{k_{n-2},k_{n-3}}\cdots r_{k_1,i}f_{i} .
613: \end{eqnarray}
614: As the sum is performed on indices $k_{1} \cdots k_{n-1}$
615: the expression in (\ref{last}) is equal to $s_{ji}^{(n)}f_{i}$ for all $n \geq 2$.
616: So we have (\ref{s=}) for $n > 1$, and it is evident for $n=1$. Further, as
617: $\delta_{ij}f_{j}=\delta_{ji}f_{i}$, we obtain $p_{ij}f_{j}=p_{ji}f{i}$ \textit{Q. E. D.}
618:
619:
620:
621: \subsection{DETAILED BALANCE $\Leftarrow $ TIME REVERSIBILITY}
622: Let's rewrite the formula
623: \begin{equation}\label{aga}
624: \frac{d\mathsf{P}(t)}{dt}=\mathsf{P}(t)\mathsf{R}; \quad \frac{dp_{ij}(t)}{dt}=\sum_{k}p_{ik}(t)r_{kj}.
625: \end{equation}
626: Let's compute the time derivative of $p_{ij}f_{j}$; if time reversibility holds it will be equal to the time derivative
627: of $p_{ji}f_{i}$.
628: From the formula (\ref{aga}), as equilibrium frequencies don't depend on time
629: \begin{equation}\label{dpdt2}
630: \frac{d}{dt}(p_{ij}(t)f_{j})=f_{j}\frac{dp_{ij}(t)}{dt}=\sum_{k}p_{ik}(t)r_{kj}f_{j}.
631: \end{equation}
632: But
633: $$
634: \frac{dp_{ij}(t)}{dt}=\sum_{k}r_{ik}p_{kj}(t),
635: $$
636: as $\mathsf{P}$ and $\mathsf{R}$ commute (evident from the solution).
637: The second expression in (\ref{dpdt2}) can be written as
638: \begin{equation}\label{dpdt3}
639: \sum_{k}p_{ik}(t)r_{kj}f_{j}=\sum_{k}r_{ik}p_{kj}(t)f_{j}.
640: \end{equation}
641: Because of the time reversibility the last expression in (\ref{dpdt3}) becomes
642: \begin{equation}\label{dpdt4}
643: \sum_{k}r_{ik}p_{kj}(t)f_{j}=\sum_{k}r_{ik}p_{jk}(t)f_{k}.
644: \end{equation}
645: Finally
646: \begin{equation}\label{dpdt5}
647: \frac{d}{dt}(p_{ji}(t)f_{i})=f_{i}\frac{dp_{ji}(t)}{dt}=\sum_{k}p_{jk}(t)r_{ki}f_{i}.
648: \end{equation}
649: Subtracting the (\ref{dpdt5}) from the (\ref{dpdt4}), which are equal, and keeping in evidence $p_{jk}(t)$ we finally obtain
650: \begin{equation}\label{dpdt6}
651: \sum_{k}p_{jk}(t)(r_{ik}f_{k}-r_{ki}f_{i})=0,
652: \end{equation}
653: and the detailed balance is satisfied \textit{Q. E. D.}
654:
655: \section{Detailed balance: simple check}
656: \label{app3}
657: A nice property of detailed balance is that there exists a very easy way to state if it holds,
658: even without calculating equilibrium frequencies.
659: Until now we have seen that the detailed balance is fulfilled when the equilibrium frequencies and the mutation rates (from which the
660: former depend) cancel every term in the master equations.
661:
662: Another way to check the detailed balance is to consider three states in the system and the rates connecting them. If the product of the
663: three rates which takes from a state to itself ``clockwise'' is equal to that calculated ``counter-clockwise'', then the detailed balance
664: holds. If we have three states $i, j, k$ then the above property will read
665: $$
666: r_{ik}r_{kj}r_{ji}=r_{ij}r_{jk}r_{ki}.
667: $$
668:
669:
670: \begin{thebibliography}{20}
671:
672: \bibitem{ki68} Kimura M., 1968. Evolutionary rate at the molecular level. \textit{Nature} \textbf{217}, 624-626.
673:
674: \bibitem{zh94} Zharkikh A., 1994. Estimation of evolutionary distances between nucleotide sequences.
675: \textit{J. of Mol. Evol.} \textbf{39}, 315-329.
676:
677: \bibitem{su95} Sueoka N., 1995. Intrastrand parity rules of DNA base composition and usage biases of synonymous codons.
678: \textit{J. of Mol. Evol.} \textbf{40}, 318-325.
679: \emph{Errata} \textbf{42}, 323
680:
681: \bibitem{lo95} Lobry J. R., 1995. Properties of a general model of DNA evolution under no-strand bias conditions.
682: \textit{J. of Mol. Evol.} \textbf{40}, 326-330.
683: \emph{Errata} \textbf{41}, 680.
684:
685: \bibitem{ro90} Rodriguez F., Oliver J. L., Mar\'\i n A., Medina J. R., 1990. The general stochastic model of nucleotide substitution.
686: \textit{J. of Theor. Biol.} \textbf{142}, 485-501.
687:
688: \bibitem{tk81} Takahata N., Kimura M., 1981. A model of evolutionary base substitution and its application with special reference to
689: rapid changes of pseudogenes. \textit{Genetics} \textbf{98}, 641-657.
690:
691: \bibitem{luca} Peliti L.: Appunti di meccanica statistica,
692: \textit{Bollati Boringhieri} (Torino, Italy, 2003)
693:
694: \bibitem{ya94} Yang Z., 1994. Estimating the pattern of nucleotide substitution.
695: \textit{J. of Mol. Evol.} \textbf{39}, 105-111.
696:
697: \bibitem{gu96} Gu X., Li W.-H., 1994. A general additive distance with time-reversibility and rate variation among nucleotide sites.
698: \textit{Proc. Natl. Acad. Sci. USA} \textbf{93}, 4671-4676.
699:
700: \bibitem{Gouy89} Gouy M., Li W.-H., 1989. Phylogenetic analysis based on rRNA sequences supports the archaebacterial tree rather than the eocyte tree.
701: \textit{Nature} \textbf{339}, 145-147.
702:
703: \bibitem{lolo99} Lobry J. R., Lobry C., 1999. Evolution of DNA base composition under no-strand-bias conditions when the substitution rates
704: are not constant.
705: \textit{Mol. Biol. Evol.} \textbf{16}, 719-723.
706:
707: % \bibitem{label}
708: % Text of bibliographic item
709:
710: % notes:
711: % \bibitem{label} \note
712:
713: % subbibitems:
714: % \begin{subbibitems}{label}
715: % \bibitem{label1}
716: % \bibitem{label2}
717: % If there is a note, it should come last:
718: % \bibitem{label3} \note
719: % \end{subbibitems}
720:
721: \end{thebibliography}
722:
723: \end{document}
724: