1:
2: \documentclass[letterpaper, 11pt]{article}
3: \usepackage{amsmath}
4: \usepackage{amsfonts}
5: \usepackage{amsthm}
6: \usepackage{amssymb}
7: \usepackage{mathrsfs}
8: \usepackage[hang]{subfigure}
9: \usepackage{graphicx,epsfig,fancyheadings,wasysym,psfrag}
10:
11: \usepackage{times}
12:
13: \pagestyle{fancy}
14:
15: \rhead[\thepage]{\thepage}
16: \cfoot{}
17: \usepackage{natbib}
18: \bibliographystyle{apalike}
19:
20: \DeclareMathOperator{\var}{var}
21: \DeclareMathOperator{\cov}{cov}
22: \DeclareMathOperator{\corr}{corr}
23:
24: \newcommand{\al}{\alpha}
25: \newcommand{\be}{\beta}
26: \newcommand{\de}{\delta}
27: \newcommand{\e}{\epsilon}
28: \newcommand{\g}{\gamma}
29: \newcommand{\ka}{\kappa}
30: \newcommand{\la}{\lambda}
31: \newcommand{\sig}{\sigma}
32:
33: \newcommand{\bR}{{\mathbb R}}
34: \newcommand{\bZ}{{\mathbb Z}}
35: \newcommand{\bQ}{{\mathbb Q}}
36: \newcommand{\bT}{{\mathbb T}}
37: \newcommand{\cC}{{\mathcal C}}
38: \newcommand{\cA}{{\mathcal A}}
39: \newcommand{\cF}{{\mathcal F}}
40: \newcommand{\cG}{{\mathcal G}}
41: \newcommand{\cH}{\mathcal H}
42: \newcommand{\cU}{\mathcal U}
43: \newcommand{\cY}{{\mathcal Y}}
44: \newcommand{\cZ}{{\mathcal Z}}
45: \newcommand{\cR}{{\mathcal R}}
46: \newcommand{\cL}{{\mathcal L}}
47: \newcommand{\cN}{{\mathcal N}}
48: \newcommand{\cV}{{\mathcal V}}
49: \newcommand{\cW}{{\mathcal W}}
50: \newcommand{\cM}{{\mathcal M}}
51: \newcommand{\cO}{{\mathcal O}}
52: \newcommand{\cP}{{\mathcal P}}
53: \newcommand{\cT}{{\mathcal T}}
54: \newcommand{\cB}{{\mathcal B}}
55: \newcommand{\cS}{{\mathcal S}}
56: \newcommand{\cE}{{\mathcal E}}
57:
58:
59: \newcommand{ \dist}{\mathrm{dist}}
60: \newcommand{ \co}{\mathrm{co}}
61: \newcommand{ \xor}{{\,\mathrm{xor}\,}}
62: \newcommand{\conn}{\leftrightsquigarrow}
63: \newcommand{\notconn}{{\,\,\leftrightsquigarrow\!\!\!\!\!\!\!\!/\;\,\,\,}}
64:
65: \newcommand{\frhalf}{{\textstyle \frac 12}}
66: \newcommand{\frquarter}{{\textstyle \frac 14}}
67:
68: \newcommand{\aas}{a.~a.~s.}
69:
70: \setlength{\textwidth}{16cm}
71: \setlength{\textheight}{21cm}
72: \setlength{\oddsidemargin}{0cm}
73: \setlength{\evensidemargin}{0cm}
74: \setlength{\topmargin}{0cm}
75: \setlength{\parskip}{1ex}
76:
77: \newtheorem{thm}{Theorem}
78: \newtheorem{lemma}{Lemma}[section]
79:
80:
81: \begin{document}
82:
83: \title{Percolation on fitness landscapes: effects of correlation, phenotype, and incompatibilities}
84: \author{Janko Gravner$^*$, Damien Pitman$^*$, and Sergey Gavrilets$^{\dag\ddag}$\\
85: $^*$Department of Mathematics, University of California, Davis, CA 95616,\\
86: $^{\dag}$Departments of Ecology and Evolutionary Biology
87: and Mathematics, \\
88: University of Tennessee, Knoxville, TN 37996, USA.\\
89: $^\ddag$corresponding author.
90: Phone: 865-974-8136,\
91: fax: 865-974-3067,\\
92: email: gavrila@tiem.utk.edu}
93:
94: \maketitle
95:
96:
97: \newpage
98:
99:
100: {\bf Abstract}\quad
101: We study how correlations in the random fitness assignment may affect the structure
102: of fitness landscapes. We consider three classes of fitness models. The
103: first is a continuous phenotype space in which individuals are characterized
104: by a large number of continuously varying traits such as size, weight, color, or
105: concentrations of gene products which directly affect fitness.
106: The second is a simple model that explicitly describes genotype-to-phenotype
107: and phenotype-to-fitness maps allowing for neutrality at both phenotype and fitness
108: levels and resulting in a fitness landscape with tunable correlation length.
109: The third is a class of models in which particular combinations of alleles or
110: values of phenotypic characters are ``incompatible'' in the sense that the
111: resulting genotypes or phenotypes have reduced (or zero) fitness.
112: This class of models
113: can be viewed as a generalization of the canonical Bateson-Dobzhansky-Muller
114: model of speciation.
115: We also demonstrate that the discrete $NK$ model shares some signature properties of models
116: with high correlations.
117: Throughout the paper, our focus is on the percolation threshold, on the number, size and
118: structure of connected clusters, and on the number of viable genotypes. \\
119:
120:
121: {\bf Key words}: fitness landscapes, percolation, nearly neutral networks, genetic incompatibilities
122:
123: \section{Introduction}
124:
125: The notion of fitness landscapes, introduced by a theoretical evolutionary biologist Sewall
126: Wright in \citeyear{wri32} (see also \citealt{kau93,gav04}), has proved extremely useful both in
127: biology and well outside of it. In the standard interpretation, a fitness landscape is a relationship
128: between a set of genes (or a set of quantitative characters) and a measure of fitness
129: (e.g. viability, fertility, or mating success). In Wright's original formulation the set of
130: genes (or quantitative characters) is the property of an individual. However, the notion of
131: fitness landscapes can be generalized to the level of a mating pair, or even a population of
132: individuals \citep{gav04}.
133:
134: To date, most empirical information on fitness landscapes in biological applications has come from studies
135: of RNA (e.g., \citealt{sch95,huy96b,fon98b}),
136: proteins (e.g., \citealt{lip91,mar96,ros97}),
137: viruses (e.g., \citealt{bur99,bur04}),
138: bacteria (e.g., \citealt{ele03,woo06}),
139: and artificial life (e.g., \citealt{len99,wil01c}).
140: The three paradigmatic landscapes --- rugged, single-peak,
141: and flat --- emphasizing particular
142: features of fitness landscapes have been the focus of most of the earlier theoretical work
143: (reviewed in \citealt{kau93,gav04}). These landscapes have found numerous applications with regards to the dynamics
144: of adaptation (e.g., \citealt{kau87,kau93,orr06a,orr06b})
145: and neutral molecular evolution (e.g., \citealt{der91}).
146:
147: More recently, it was realized that the dimensionality of most biologically interesting
148: fitness landscapes is enormous and that this huge dimensionality brings some new properties
149: which one does not observe in low-dimensional landscapes (e.g. in two- or three-dimensional
150: geographic landscapes). In particular, multidimensional landscapes are generically characterized
151: by the existence of neutral and nearly neutral networks (also referred to as holey fitness
152: landscapes) that extend throughout the landscapes
153: and that can dramatically affect the evolutionary dynamics of the populations
154: \citep{gav97,gav97b,rei97b,gav04,rei01a,rei01b,rei02}.
155:
156: An important property of fitness landscapes is their correlation pattern. A common measure
157: for the strength of dependence
158: is the {\it correlation function\/} $\rho$ measuring the correlation of
159: fitnesses of pairs of individual at a distance (e.g., Hamming) $d$ from each other in the
160: genotype (or phenotype) space:
161: \begin{equation} \label{rho}
162: \rho(d)=\frac{\cov[w(.),w(.)]_d}{\var(w)}
163: \end{equation}
164: \citep{eig89}. Here, the term in the numerator is the covariance of fitnesses
165: of two individuals conditioned on them being at distance $d$, and
166: $\var(w)$ is the variance in fitness over the whole fitness landscape.
167: For uncorrelated landscapes, $\rho(d)=0$ for $d > 0$. In contrast,
168: for highly correlated landscapes, $\rho(d)$ decreases with $d$ very slowly.
169:
170: The aim of this paper is to extend our previous work \citep{gav97b} in a number of directions
171: paying special attention to the question of how correlations in the
172: random fitness assignment may affect the structure of genotype and phenotype spaces.
173: For the resulting random fitness landscapes, we shed some
174: light on issues such as the number of viable genotypes,
175: number of connected clusters of viable genotypes and
176: their size distribution, existence thresholds, and
177: number of possible fitnesses.
178:
179: To this end, we introduce a variety of models,
180: which could be divided into two essentially different
181: classes: those with local correlations, and
182: those with global correlations. As we will see, techniques
183: used to analyze these models, and answers we obtain, differ
184: significantly. We use a mixture of analytical and computational techniques;
185: it is perhaps necessary to point out that these models
186: are very far from trivial, and one is quickly led to
187: outstanding open problems in probability theory and computer science.
188:
189: We start (in Section 2) by briefly reviewing some results from \cite{gav97b}.
190: In Section 3 we generalize these results for the case of a continuous
191: phenotype space when individuals are characterized by a large number
192: of continuously varying traits such as size, weight, color, or the
193: concentrations of some gene products. The latter interpretation
194: of the phenotype space may be particularly relevant given the rise of
195: proteomics and the growing interest in gene regulatory networks.
196:
197: The main idea behind our local correlations model studies in Section 4
198: is fitness assignment {\it conformity\/}. Namely, one randomly divides
199: the genotype space into components which are forced to have
200: the same phenotype; then, each different phenotype is independently assigned a random fitness.
201: This leads to a simple two-parameter
202: model, in which one parameter determines the density of viable genotypes,
203: and the other the correlations between them.
204: We argue that the probability of existence of a giant cluster (which swallows a positive
205: proportion of all viable genotypes) is a non-monotone function of the correlation
206: parameter and identify the critical surface at which this probability jumps
207: almost from 0 to 1. In Section 4 we also investigate the effects of
208: interaction between conformity structure and fitness assignment.
209:
210: Section 5 introduces our basic global correlation
211: model, one in which genotypes are eliminated due to random pairwise
212: {\it incompatibilities\/} between alleles. This is
213: equivalent to a random version of {\tt SAT} problem,
214: which is the canonical constraint satisfaction problem in computer
215: science. In general, a {\tt SAT} problem involves a set of Boolean variables
216: and their negations that are strung together with {\tt OR} symbols into
217: {\it clauses\/}. The {\it clauses\/} are joined by {\tt AND} symbols
218: into a {\it formula\/}. A {\tt SAT} problem asks one to decide, whether
219: the variables can be assigned values that will make the formula true.
220: An important special case, $K$-{\tt SAT}, has the length of each clause fixed at $K$.
221: Arguably, {\tt SAT} is the most important class of problems in complexity theory.
222: In fact, the general {\tt SAT} was the first known
223: NP-complete problem and was established as such by S. Cook in 1971 (\citealt{Coo}).
224: Even considerable simplifications, such as the {\tt $3$-SAT} (see Section 5.4), remain NP-complete,
225: although {\tt $2$-SAT} (see Section 5.1) can be solved efficiently by a simple algorithm.
226: See e.g. \cite{KV} for a comprehensive presentation of the theory. Difficulties
227: in analyzing random {\tt SAT} problems, in which formulas are chosen at random,
228: in many ways mirror their complexity classes, but even random {\tt $2$-SAT}
229: presents significant challenges \citep{dlV, BKL2}. In our present interpretation, the main reason
230: for these difficulties is that correlations are so high that the expected number
231: of viable genotypes
232: may be exponentially large, while at the same time the probability
233: that even one viable genotype exists is very low. In Section 5, we further
234: illuminate this issue by showing that connected viable clusters
235: must contain fairly large sub-cubes, and that the number of such clusters
236: is, in a proper interpretation, finite. The relevance to both types of
237: models for discrete and continuous
238: phenotype spaces is also discussed, with particular emphasis on the
239: existence of viable phenotypes in the presence of incompatibilities.
240: Section 5 also contains a brief review
241: of the existing theory on higher order incompatibilities.
242:
243:
244: In Section 6 we demonstrate how the discrete
245: $NK$ model shares some signature properties of models
246: with high correlations. In Section 7 we summarize our results
247: and discuss their biological relevance.
248: The proofs of our major results are relegated to Appendices A--E.
249:
250:
251: \section{The basic case: binary hypercube and independent binary fitness}
252:
253: We begin with a brief review of the basic setup, from \cite{gav97b}
254: and \cite{gav04}. The {\it binary hypercube\/}
255: consists of all $n$--long arrays of bits, or {\it alleles\/}, that is
256: $\cG=\{0, 1\}^n$. This is our {\it genotype space\/}.
257: Genotypes are linked by edges induced by bit-flips, i.e., {\it mutations\/} at a single locus,
258: for example, for $n=4$, a sequence of mutations might look like \[ 0000\leftrightarrow 1000\leftrightarrow 1001\leftrightarrow 1101\leftrightarrow 1100.
259: \]
260: The (Hamming) {\it distance\/} $d(x,y)$ between $x\in \cG$ and $y\in \cG$ is the
261: number of coordinates in which $x$ and $y$ differ or, equivalently,
262: the least number of mutations which connect $x$ and $y$.
263:
264: The {\it fitness\/} of each genotype $x$ is denoted by $w(x)$.
265: We will describe several ways to prescribe the fitness $w$ at random, according
266: to some probability measure $P$ on the $2^{2^n}$ possible assignments. Then we say that
267: an event $A_n$ happens {\it asymptotically almost surely\/} (\aas)
268: if $P(A_n)\to 1$ as $n\to\infty$. Typically, $A_n$ will capture
269: some important property of (random) clusters of genotypes.
270:
271: We commonly assume that $w(x)\in \{0,1\}$ so that $x$ is either viable
272: ($w(x)=1$) or inviable ($w(x)=0$).
273: As a natural starting point, \cite{gav97b} considered uncorrelated landscapes,
274: in which $w(x)$ is chosen to be 1 with probability $p_v$, for each $x$ independently of
275: others. We assume
276: this setup for the rest of this section and note that this
277: is a well-studied problem in mathematical literature,
278: although it presents considerable technical difficulties and
279: some issues are still not completely resolved.
280:
281: Given a particular fitness assignment, viable genotypes form
282: a subset of $\cG$, which is divided into
283: connected {\it components\/} or {\it clusters\/}.
284: For example, with $n=4$, if $0000$ is viable, but its 4 neighbors
285: $1000$, $0100$, $0010$, and $0001$ are not, then it is isolated in its own
286: cluster.
287:
288: Perhaps the most basic result determines the {\it connectivity
289: threshold\/} \citep{Tom}: when $p_v>1/2$, the set of all viable genotypes is connected a.~a.~s.
290: By contrast, when $p_v<1/2$, the set of viable genotypes is {\it not\/} connected
291: {\aas } This is easily understood, as the connectedness is closely linked to
292: isolated genotypes, whose expected number is $2^np_v(1-p_v)^n$. This expectation
293: makes a transition from exponentially large to exponentially small at $p_v=1/2$.
294: The events $\{x$ is isolated$\}$, $x\in \cG$, are only weakly
295: correlated, which implies that when $p_v<1/2$ there are exponentially
296: many isolated genotypes with high probability, while when $p_v>1/2$,
297: a separate argument shows that the event that the set of viable genotypes contains no isolated vertex
298: but is not connected becomes very unlikely for large $n$.
299: This is perhaps the clearest instance of the
300: {\it local method\/}: a local property (no isolated genotypes)
301: is \aas~equivalent to a global one (connectivity).
302:
303: Connectivity is clearly too much to ask for, as $p_v$ above $1/2$ is
304: not biologically realistic. Instead, one should look for a weaker
305: property which has a chance of occurring at small $p_v$. Such a
306: property is {\it percolation\/}, a.~k.~a.~existence of the {\it giant component\/}.
307: For this, we scale $p_v=\la_v/n$, for a constant $\la_v$.
308: When $\la_v>1$, the set of viable genotypes percolates, that is, it a.~a.~s.~contains a
309: component of at least $c\cdot n^{-1} 2^n$ genotypes, with all other
310: components of at most polynomial (in $n$) size.
311: When $\la_v<1$,
312: the largest component is a.~a.~s.~of size $Cn$. Here and below, $c$ and $C$ are
313: some constants. These are results from \cite{BKL2}.
314:
315: The local method that correctly identifies the percolation threshold
316: is a little
317: more sophisticated than the one for the connectivity threshold, and
318: uses branching processes with Poisson offspring distribution --- hence we introduce notation
319: Poisson($\la$) for a Poisson distribution with mean $\la$.
320: Viewed from, say, genotype $0\dots0$, the binary hypercube locally approximates a tree with
321: uniform degree $n$. Thus viable genotypes approximate
322: a branching process
323: in which every node has the number of successors distributed binomially
324: with parameters $n-1$ and $p$, hence this random number has mean about $\la_v$ and
325: is approximately Poisson($\la_v$).
326: When $\la_v>1$, such a branching process survives forever with probability
327: $1-\delta>0$, where $\delta=\delta(\la_v)$, and $\delta(\la)$ is given by the
328: implicit equation
329: \begin{equation}\label{delta}
330: \delta=e^{\la(\delta-1)}.
331: \end{equation}
332: (e.g., \citealt{AN}).
333: Large trees of viable genotypes created by the
334: branching processes which emanate from viable genotypes
335: merge into a very large (``giant'') connected set.
336: On the other hand, when $\la_v<1$ the branching process dies out with probability 1.
337:
338: The condition $\la_v>1$ for the existence of the giant component can be loosely
339: rewritten as
340: \begin{equation} \label{basic}
341: p_v > \frac{1}{n}.
342: \end{equation}
343: This shows that the larger the dimensionality $n$ of the genotype space, the smaller
344: values of the probability of being viable $p_v$ will result in the existence of
345: the giant component. See \cite{gav97b,gav97,gav04,ski04,pig06} for discussions of biological
346: significance and implications of this important result.
347:
348: \section{Percolation in a continuous phenotype space}
349:
350: In this section we will assume that individuals are characterized by $n$ continuous
351: traits (such as size, weight, color, or concentrations of particular gene products).
352: To be precise, we let $\cP =[0,1]^n$ be the {\em phenotype space}.
353:
354: We begin with the extension of the notion of independent viability.
355: The most straightforward analogue of the discrete genotype space considered in the
356: previous section involves Poisson point location
357: in $\cal{P}$, obtained by generating a Poisson($\lambda$) random variable $N$, and then
358: choosing points $x_1,\dots,x_N\in \cP$ uniformly at random.
359: These will be interpreted as {\it peaks\/}
360: of equal height in the fitness landscape.
361: Another parameter is a small $r>0$, which can be interpreted as measuring
362: how harsh the environment is: any phenotype within $r$
363: of one of the peaks is declared viable and any phenotype not within $r$ of one of the peaks
364: is declared inviable. For simplicity, we will assume ``within
365: $r$'' to mean that ``every coordinate differs by at most $r$,''
366: i.e., distance is measured in the ($n$-dimensional) $\ell^\infty$ norm $||\cdot||_\infty$.
367: Note that this makes the set of viable genotypes correlated, albeit
368: the range of correlations is limited to $2r$.
369:
370: Our most basic question is whether a positive proportion of
371: viable phenotypes is connected together into a giant cluster.
372: Note that the probability $p_v$ that a random point in $\cP$ is viable
373: is equal to the probability that there is a ``peak'' within $r$ from this
374: point. Therefore,
375: $$
376: p_v=1-\exp\left[-\lambda (2r)^n\right]\approx \lambda (2r)^n.
377: $$
378: This is also the expected combined volume of viable phenotypes.
379:
380: We will consider peaks
381: $x_i$ and $x_j$ to be {\it neighbors\/} if they share a viable phenotype,
382: that is, if their $r$-neighborhoods overlap, or
383: equivalently, if $||x_i-x_j||_\infty<2r$.
384: Two viable phenotypes $y_1$ and $y_2$ are {\it connected\/} if they are,
385: respectively, within $r$ of peaks $x_1$ and $x_2$, and $x_1$ and $x_2$ are
386: connected to each other via a chain of neighboring peaks.
387:
388: By the standard branching process comparison,
389: the necessary condition for the existence of a giant cluster is that a ``peak'' $x$ is connected
390: to more than one other ``peak'' on the average.
391: All peaks within $2r$ of the focal peak are connected to the latter.
392: Therefore, if $\mu$ is the expected number of peaks connected to $x$,
393: then
394: $$
395: \mu= \lambda \cdot (4r)^n,
396: $$
397: and $\mu>1$ is necessary for percolation.
398: As demonstrated by \cite{Pen} (for a different choice of
399: the norm, but the proof is the same),
400: this condition becomes sufficient when $n$ is large.
401: Note that the expected number $\lambda$ of peaks can be written as $\mu\cdot (4r)^{-n}$.
402:
403: If $\mu>1$ and fixed, then \aas~a positive proportion of
404: all peaks (that is, $cN$ peaks, where $c=c(\mu)>0$) are connected
405: in one ``giant'' component, while the remaining connected components are all of size $\cO(\log N)$.
406: On the other hand, if $\mu<1$, all components are \aas~of size $\cO(\log N)$.
407:
408: The condition $\mu>1$ for the existence of the giant component of viable phenotypes can be
409: loosely rewritten as
410: \begin{equation} \label{cont}
411: p_v > \frac{1}{2^n}.
412: \end{equation}
413: This shows that viable phenotypes are likely to form a large connected cluster even when
414: one is {\it very\/} unlikely to hit one of them at random, if
415: $n$ is even moderately large. The same conclusion and the same threshold are valid
416: if instead of $n$-cubes we use $n$-spheres of a constant radius.
417:
418: The percolation threshold in the continuous phenotype space given by inequality~(\ref{cont})
419: is much smaller than that in the discrete genotype space which is given by inequality~(\ref{basic}).
420: An intuitive reason for this is that continuous space offers a viable point a much greater opportunity
421: to be connected to a large cluster. Indeed, in the discrete genotype space there are $n$
422: neighbors per each genotype. In contrast, in the continuous phenotype space, the ratio
423: of the volume of the space where neigboring peaks can be located (which has radius $2r$)
424: to the volume of the focal $n$-cube (which has radius $r$) is $2^n$.
425:
426: \section{Percolation in a correlated landscape with phenotypic neutrality}
427:
428: The standard paradigm in biology is that the relationship between genotype and fitness
429: is mediated by phenotype (i.e., observable characteristics of individuals). Both the
430: genotype-to-phenotype and phenotype-to-fitness maps are typically not one-to-one.
431: Here, we formulate a simple model capturing these properties which also results in a
432: correlated fitness landscape.
433: Below we will call mutations that do not change phenotype {\em conformist}. These mutations
434: represent a subset of {\em neutral} mutations that do not change fitness.
435:
436: We propose the following two-step model. To begin the {\it first step\/},
437: we make each {\it pair\/} of genotypes $x$ and $y$ in a binary hypercube $\cG$ independently
438: {\it conformist\/} with probability $p_{d(x,y)}$ where $d(x,y)$
439: is the Hamming distance between $x$ and $y$. We then declare
440: $x$ and $y$ to belong to the same {\it conformist cluster\/} if they are linked
441: by a chain of conformist pairs. This version of long-range percolation model (cf., \citealt{Ber,Bis})
442: divides the set of genotypes $\cG$ into conformist clusters.
443: We postulate that all genotypes in the same conformist
444: cluster have the same phenotype. Therefore, genetic changes represented by
445: a change from one member of a conformist cluster to another (i.e., single or
446: multiple mutations) are phenotypically neutral.
447:
448: In the {\it second step\/}, we make each conformist cluster independently viable with
449: probability $p_v=\la_v/n$. This generates a random set of viable genotypes,
450: and we aim to investigate when this set has a large connected component.
451:
452: For example, the ``genotype'' can be a linear RNA sequence.
453: This sequence folds into a 2-dimensional molecule which has a particular structure
454: (or ``shape''), and corresponds to our ``phenotype.'' Finally, the molecule
455: itself has a particular function, e.g., to bind to a specific part of the cell or
456: to another molecule. A measure of how well this can be accomplished is represented by
457: our ``fitness.''
458:
459: The distribution of conformist clusters depends on the probabilities
460: $p_1, p_2, p_3, \dots $ which determine how the conformity probability
461: varies with distance.
462: Here we will study the case when $p_1=p_e>0,p_2=p_3=...=0$ \citep{Hag}.
463: It is then very convenient for the mathematical analysis that a pair $x$ and
464: $y$ can be conformist only when they are linked by an edge --- therefore
465: we can talk about {\it conformist edges\/} or equivalently {\it conformist mutations\/}.
466: (Note however that it is possible that nearest neighbors $x$ and $y$ are in the
467: same conformist cluster even if the edge between them is non-conformist.)
468:
469: Figure 1 illustrates our 2-step procedure on a four-dimensional example.
470:
471: We expect that a more general model with $p_i$ declining fast enough with $i$
472: is just a smeared version of this basic one, and its properties are not likely
473: to differ from those of the simpler model. We conjecture that for our purposes,
474: ``fast enough'' decrease should be exponential with a rate logarithmically
475: increasing in the dimension $n$, e.g. for large $k$,
476: \[
477: p_k \le \exp(-\alpha(\log n)k),
478: \]
479: for some $\alpha>1$. (This is expected to be so because in this case the expected number of
480: neighbors of the focal genotype is finite.)
481:
482: We observe that the first step of our procedure is an
483: edge version of the percolation model discussed in the second section, with a
484: similar giant component transition \citep{BKL1}.
485: Namely, let $p_1=p_e=\lambda_e/n$. Then, if $\la_e>1$, there
486: is a.~a.~s.~one giant conformist cluster of size $c\cdot 2^n$, with all others
487: of size at most $Cn$. In contrast, if $\la_e<1$ all conformist clusters
488: are of size at most $Cn$. Note that the number of conformist
489: clusters is always on the order $2^n$. In fact, even the number
490: of ``non-conformist'' (i.e., isolated) clusters is a.~a.~s.~asymptotic to
491: $e^{-\lambda_e} 2^n$, as $P(x\ \text{is isolated})=(1-\lambda_e/n)^n$.
492:
493: \begin{figure*}[t]
494: \begin{center}
495: {\includegraphics
496: [clip, viewport= 140 325 475 680, height=4cm]{4q.ps}
497: \hspace{1cm}
498: \includegraphics[clip, viewport= 140 325 475 680, height=4cm]{4edge-config.ps} \\
499: \vspace{.5cm}
500: \includegraphics[clip,viewport= 140 325 475 680, height=4cm]{4viability.ps}
501: \hspace{1cm}
502: \includegraphics[clip,viewport= 140 325 475 680, height=4cm]{4neut.ps}
503: }\end{center}
504: \caption{A four-dimensional example: start with the cube $\cG^4$ (top left), create conformist clusters by randomly eliminating each edge with probability
505: $1-p_e$ (top right), remove each conformist cluster with probability
506: $1-p_v$ (bottom left, removed vertices are black) and finally consider
507: connected components of the remaining vertices (bottom right,
508: there is just one component in this case).}
509: \end{figure*}
510:
511: Denote by $x\conn y$ (resp.~$x\notconn y$) the event that
512: $x$ and $y$ are (resp.~are not) in the same conformist cluster.
513: First, we note that the probability $P(x \conn y)$ that two genotypes belong to
514: the same conformist cluster depends on the Hamming distance $d(x,y)$ between them, and on
515: $p_e=\lambda_e/n$. In particular,
516: we show in Appendix A that, if $\la_e<1$ and $d(x,y)=k$ is fixed, then
517: \begin{equation} \label{Px-y}
518: k!p_e^k (1 - O(n^{-2})) \leq P(x \conn y) \leq k!p_e^k (1 + O(n^{-1} \log{n})).
519: \end{equation}
520: The dominant contribution $k!p_e^k$ is simply the expected number of conformist pathways between $x$ and $y$
521: that are of shortest possible length.
522:
523: It is also important to note that, for every $x\in \cG$,
524: the probability $P( x$ is viable$)=p_v$, therefore it does not depend on $p_e$.
525: Moreover, for $x,y\in \cG$,
526: $$
527: \begin{aligned}
528: &P(x\text{ and }y\text{ viable})-p_v^2\\
529: &=P(x\text{ and }y\text{ viable},x\conn y)+ P(x\text{ and }y\text{ viable},x\notconn y)-p_v^2\\
530: &=p_vP(x\conn y)+ p_v^2\cdot P(x\notconn y)-p_v^2\\
531: &=p_v(1-p_v)P(x\conn y)\ge 0.
532: \end{aligned}
533: $$
534: Therefore, the correlation function~(\ref{rho}) is
535: \begin{equation}
536: \rho(x,y)=P(x\conn y),
537: \end{equation}
538: which clearly increases with $p_e$ and, thus, with $\lambda_e$.
539: Therefore, this model
540: has tunable positive correlations controlled by the parameter $\la_e$, whose value does
541: not affect the expected number of viable genotypes.
542: The correlation function $\rho(x,y)$ decreases exponentially with distance
543: $d(x,y)$ when $\la_e<1$, and is bounded below when $\la_e>1$. Nevertheless,
544: as we will see below, we can effectively use local methods for all values of $\la_e$.
545:
546: \subsection{Threshold surface for percolation}
547:
548: Proceeding by the local branching process heuristics,
549: we reason that a surviving node on the branching tree can have
550: two types of descendants: those that are connected by conformist mutations
551: and those that are in different conformist clusters and thus
552: independently viable. Therefore the number
553: of descendants is approximately Poisson($\la_e+\la_v$).
554: This can only work when $\la_e<1$, as otherwise the correlations are global.
555:
556: If $\la_e>1$, we need to eliminate the
557: entire conformist giant component, which is \aas~inviable.
558: Locally, we condition on the
559: (supercritical) branching process of the supposed descendant to die out.
560: Such conditioned process is a subcritical branching process, with
561: Poisson $(\la_e\delta)$ distribution of successors \citep{AN}
562: where $\delta=\delta(\lambda_e)$ is given by the equation~(\ref{delta}).
563: This gives the
564: conformist contribution, to which we add the independent Poisson$(\la_v\delta)$ contribution.
565:
566: \begin{figure*}[t]
567: \begin{center}
568: \vspace{5pt}
569: \includegraphics[clip=true,height=10cm]{nt2.ps} \hspace{1.5cm}
570: \includegraphics[clip=true,height=10cm]{nt1.ps}
571: \end{center}
572:
573: \caption{Simulated $\la_v^m$ (long dashes) and $\la_v^{M}$ (short dashes), and $\zeta$
574: (solid) plotted against $\la_e$, for $n=10, \dots, 20$, and models from Section
575: 4.1 (left frame) and Section 4.2 (right frame). Lower bounds increase with $n$, and
576: upper bounds decrease, for this range of $n$. }
577: \end{figure*}
578:
579: To have a convenient summary of the conclusions above,
580: assume that $\la_e$ is fixed and let $\zeta(\la_e)$
581: be the smallest $\la_v$ which \aas~ensures the giant component, i.e.,
582: \[
583: \zeta(\la_e)=\inf\{\la_v: \text{a cluster of at least }cn^{-1} 2^n
584: \text{ viable genotypes exists \aas~for some } c>0\}.
585: \]
586: One would expect that for $\la_v<\zeta(\la_e)$ all components are \aas~of size at most $Cn$.
587: The asymptotic critical curve is given by
588: $\la_v=\zeta(\la_e)$, where
589: \begin{equation} \label{pheno}
590: \zeta(\la)=
591: \begin{cases}
592: 1-\la &\qquad\text{if } \la\in [0,1],\\
593: \frac 1{\delta}-\la&\qquad\text{if } \la\in [1,\infty).
594: \end{cases}
595: \end{equation}
596:
597: Having only a heuristic proof of this, we resort to computer
598: simulations for confirmation. For this, we
599: indicate
600: global connectivity with the event $A$ that a genotype
601: within distance 2 of $0\dots 0$ is connected
602: (through viable genotypes) to a genotype
603: within distance 2 of $1\dots 1$.
604: We make this choice because the
605: distance 2 is the smallest that works with asymptotic certainty.
606: Indeed, the genotypes $0\dots0$ and $1\dots1$ are likely to be inviable.
607: Even the number of viable genotypes within distance one of each of these is only of constant order,
608: so even in the percolation regime the probability of connectivity between
609: a viable genotype within distance one
610: of $0\dots0$ and a viable one within distance one of $1\dots1$ does not converge to 1 but is of
611: a nontrivial constant order. By contrast, there are about $n^2$ vertices
612: within distance 2 of $0\dots0$ among which of order $n$ are viable.
613:
614: When $\la_v>\zeta(\la_e)$ the probability of the event $A$
615: should therefore be (exponentially) close to 1. On the other hand, when $\la_v<\zeta(\la_e)$
616: the probability that a connected component within distance 2 of either
617: $0\dots0$ or $1\dots1$ extends for distance of the order $n$
618: is exponentially small. We further define the critical curves
619: $$
620: \begin{aligned}
621: &\text{$\la_v^{m}=\;$the smallest $\la_v$ for which
622: $P(A)>0.1$,}\\
623: &\text{$\la_v^{M}=\;$the largest $\la_v$ for which
624: $P(A)<0.9$.}
625: \end{aligned}
626: $$
627:
628: We approximated $\la_v^m$ and $\la_v^{M}$ for
629: $n=10, \dots, 20$ and $\la_e=0(0.1)2$, with 1000
630: independent realizations
631: of each choice of $n$, $\la_e$, and $\la_v$. We used the linear
632: cluster algorithm described in \cite{Sed}.
633: The results are depicted in Figure 1.
634: Unfortunately, simulations above $n\approx 20$
635: are not feasible.
636:
637: From Figure 2 we observe that:
638: \begin{itemize}
639:
640: \item Even for low $n$, both critical curves approximate well the
641: overall shape of the theoretical limit curve $\zeta$.
642: \item $\la_v^{m}$ and $\la_v^{M}$ get
643: closer faster than they converge to $\zeta$. Consequently,
644: one can expect that $P(A)$ makes a very sharp jump from near 0
645: to near 1 even for moderate $n$.
646: \item For $\la_e<1$, $\la_v^{m}$ tends to be above the limit curve. This is
647: not really surprising, as the local argument always gives an upper
648: bound on the probability $P(A)$ of event $A$. Further, the approximation of $\la_v^m$ deteriorates
649: near $\la_e=2$, which stems from the possibility of survival of the
650: giant component in this regime.
651: \end{itemize}
652:
653: What is clear from the heuristics and simulations is that
654: conformist mutations, and thus correlations, significantly affect
655: the probability of long range
656: connectivity in the genotype space. The effect is not monotone:
657: the most advantageous choice
658: is when the correlations are at the point of phase transition between between local and global.
659:
660:
661: To intuitively understand why percolation occurs the easiest with $\la_e \approx 1$, it helps
662: to think of the model as a branching process on clusters rather than on genotypes.
663: For a genotype on a viable cluster,
664: there is a number of neighboring clusters and each of these is viable with
665: probability $p_v$. If $\lambda_e < 1$, then the probability that any two of the neighboring
666: genotypes are in the same cluster is $o(1)$, so there are asymptotically exactly $n$ clusters
667: neighboring the present cluster. Consequently, the overall number of descendants will be greater
668: if the size of these clusters is greater on average; which is exactly what happens as $\lambda_e$
669: increases towards 1. If $\lambda_e > 1$, then there is a positive proportion of the neighboring
670: genotypes that are in the giant cluster. This giant cluster is likely to be inviable, so the parameter
671: $\lambda_v$ must be greater to compensate for its loss.
672:
673: \subsection{Correlations between conformity and viability}
674:
675: In the previous model, the viability probability $p_v$
676: was independent of the conformity structure. Mainly to
677: investigate the robustness of our conclusions,
678: we consider a simple generalization in which there
679: are either positive or negative correlations between conformity
680: and fitness. While more sophisticated models are possible,
681: the one below is chosen for its amenability to relatively simple analysis.
682:
683: Assume now that conformist clusters are formed as before (i.e.,
684: with edges being conformist with probability $p_e=\lambda_e/n$),
685: are still independently viable, but
686: now the probability of their viability depends on their
687: size. We will consider the simple case when an isolated genotype
688: (one might call it {\it non-conformist\/}) is viable with probability $p_0=\la_0/n$,
689: while a conformist cluster of size larger than 1 is viable with probability $p_1=\la_1/n$.
690:
691: In this case
692: $$
693: P(x\text{ is viable})=(1-p_e)^np_0+(1-(1-p_e)^n)p_1\sim \frac 1n\left(
694: e^{-\la_e}\la_0+(1-e^{-\la_e})\la_1\right).
695: $$
696: Moreover, by a similar calculation as before,
697: $$
698: \begin{aligned}
699: &P(x\text{ and }y\text{ viable})-P(x\text{ viable})^2\\
700: &=p_1(1-p_1)P(x\conn y)+P(x\text{ non-conformist})^2p_e(p_0-p_1)^2\cdot 1_{\{d(x,y)=1\}}.
701: \end{aligned}
702: $$
703: Here, the last factor is the indicator of the set $\{(x,y), d(x,y)=1\}$, which equals
704: $1$ if $d(x,y)=1$ and $0$ otherwise.
705: Therefore, for $d(x,y)\ge 2$, the correlation function (\ref{rho})
706: is
707: $$
708: \rho(x,y)\sim\frac {\la_1}{e^{-\la_e}\la_0+(1-e^{-\la_e})\la_1}P(x\conn y),
709: $$
710: which is smaller than before iff $\la_1<\la_0$. However, it has the same
711: asymptotic properties unless $\la_1=0$.
712:
713:
714: Assume first that $\la_e<1$.
715: The local analysis now leads to
716: a {\it multi-type\/} branching process \citep{AN} with three types: NC (non-conformist node),
717: CI (non-isolated node independently viable, so no conformist edge is
718: accounted for), and CC (non-isolated node viable by conformity, so
719: a conformist edge is accounted for).
720:
721: Note first that a genotype is
722: non-conformist with probability about $e^{-\la_e}$.
723: Hence a node of any of the three types creates a Poisson($e^{-\la_e}\la_1$) number
724: of type NC descendants, and a Poisson($(1-e^{-\la_e})\la_1$) number of type CI
725: descendants. In addition, the type CI creates a Poisson($\la_e$), conditioned
726: on being nonzero, number of descendants of type CC and type CC creates a
727: Poisson($\la_e$) number of descendants of type CC. Thus
728: the matrix of expectations, in which the $ij$th entry is the expectation of the number
729: of type $j$ descendants from type $i$, is
730: \[
731: M=
732: \begin{bmatrix}
733: e^{-\la_e}\la_0 & \left(1- e^{-\la_e}\right)\la_1 & 0\\
734: e^{-\la_e}\la_0 & \left(1- e^{-\la_e}\right)\la_1 & \la_e/(1-e^{-\la_e})\\
735: e^{-\la_e}\la_0 & \left(1- e^{-\la_e}\right)\la_1 & \la_e \end{bmatrix}\quad .
736: \]
737: When $\la_e>1$, $\la_e$ needs to be replaced by $\la_e\delta$, and
738: $\la_1$ by $\la_1\delta$, where $\delta=\delta(\la_e)$ is given by ~(\ref{delta}).
739:
740: It follows from the theory of multi-type branching processes \citep{AN} that
741: the critical surface for survival of a multi-type
742: branching process is given by $\det(M-1)=0$.
743:
744: The simplest case is when only non-conformist genotypes may be viable,
745: i.e., $\la_1=0$. In this case the critical surface is given by $\la_0 e^{-\la_e}=1$ (Pitman, unpub.).
746: Not surprisingly, the critical $\la_0$ to achieve global connectivity strictly
747: increases with $\la_e$, which is the result of negative correlations between
748: conformity and viability.
749:
750: The other extreme is when non-conformist genotypes are inviable,
751: i.e., $\la_0=0$. As an easy computation demonstrates,
752: the critical curve is now given by $\la_1=\zeta(\la_e)$, where
753: \begin{equation}\label{phenocorr}
754: \zeta(\la)=
755: \begin{cases}
756: \frac{1-\la}{\la e^{-\la}+1-e^{-\la}} &\qquad\text{if } \la\in \{0,1\},\\
757: \frac{\rho^{-1} -\la}{ \la e^{-\la}+1-e^{-\la\rho}}&\qquad\text{if } \la\in [1,\infty).
758: \end{cases}
759: \end{equation}
760: Note that $\zeta(\la)\to \infty$ as $\la\to 0$. We carried out exactly the
761: same simulations as before. These are also featured in Figure 2 (right frame), and again
762: confirm our local heuristics. We conclude that positive correlations
763: between viability and conformity tend to lead to a V-shaped critical
764: curve, whose sharpness at critical conformity $\la_e=1$ increases with
765: the size of correlations. In short, then, correlations help more
766: if viability probability increases with size of conformist clusters.
767:
768: \section{Percolation in incompatibility models}
769:
770: In the model considered in the previous section
771: correlations rapidly decreased with distance. This property
772: made local analysis possible. The models we introduce now
773: are fundamentally different in the sense that correlations are
774: so high that the local method gives a wrong answer.
775:
776: In the previous sections, in constructing fitness landscapes we were assigning fitness
777: to individual genotypes or phenotypes. Here, we make certain assumptions about ``fitness'' of
778: particular combinations of alleles or the values of phenotypic characters. Specifically,
779: we will assume that some of these combinations are ``incompatible'' in the sense that the
780: resulting genotypes or phenotypes have reduced (or zero) fitness \citep{orr95,orr96,gav04}.
781: The resulting models can be viewed as a generalization of the Bateson-Dobzhansky-Muller
782: model \citep{orr95,orr96,orr97,orr01,gav96b,gav97,gav97b,gav03d,gav04,coy04}
783: which represents a canonical model of speciation.
784:
785: \subsection{Diallelic loci}
786:
787: We begin by assuming that viability of a genotype is determined by
788: a set $F$ of pairwise incompatibilities. $F$ is thus
789: a subset of $4\cdot \binom{n}{2}$ pairs $(u_i, v_j)$,
790: where $1\le i<j\le n$ and $u,v\in\{0,1\}$. In this nonstandard notation, $(0_1,0_2)\in F$,
791: for example, means that allele $0$ at locus $1$ and allele $0$ at locus $2$
792: are incompatible. In general, if $(u_i, v_j)\in F$,
793: all genotypes with $u$ in position $i$ and $v$ in position $j$
794: are inviable.
795: A genotype $x$ is then inviable if and only if there exist $i$ and $j$, with $i<j$,
796: so that $u$ and $v$ are, respectively, the alleles of $x$ at loci $i$ and $j$,
797: and $(u_i, v_j)\in F$.
798: For example, if $F_1=\{(0_1, 0_2), (1_2, 0_3), (1_1, 1_2)\}$, viable genotypes may have
799: $011$, $100$, and $101$ as their first three alleles. For $F_2=F_1\cup \{(0_1, 1_3), (1_1, 0_2)\}$,
800: no viable genotype remains.
801:
802: Incompatibility $(0_1, 0_2)$ is equivalent to two implications: $0_1\implies 1_2$ and
803: $0_2\implies 1_1$ or to the single {\tt OR} statement $1_1$ {\tt OR} $1_2$. In this interpretation,
804: the problem of whether, for a given list of incompatibilities $F$, there is a viable genotype is
805: known as the {\tt $2$-SAT} problem \citep{KV}.
806: The associated {\it digraph\/} $D_F$ is a graph on $2n$ vertices $x_i$, $i=1,\dots n$, $x=0,1$,
807: with oriented edges determined by the implications. A well-known theorem \citep{KV} states
808: that a viable genotype exists iff $D_F$ contains no oriented cycle
809: from $0_i$ to $1_i$ and back to $0_i$ for any $i=1,\dots n$ in $D_F$.
810: For example, for the incompatibilities $F_2$ as above,
811: one such cycle is $0_1\to1_2\to 1_3\to 1_1\to 1_2\to 0_1$.
812:
813: Now assume that each possible incompatibility is adjoined to $F$ at random, independently
814: with probability
815: \[
816: p=\frac c{2n}.
817: \]
818: (We use the generic notation $p$ for a probability parameter
819: in all our models, even though the nature of probabilistic assignments differs from model to model.)
820:
821:
822: {\bf Existence of viable genotypes.}\quad
823: Let $N$ be the number of viable genotypes. Then
824: \begin{itemize}
825: \item if $c>1$, then a.~a.~s.~$N=0$.
826: \item if $c<1$, then a.~a.~s.~$N>0$.
827: \end{itemize}
828: This result first appeared in the computer science literature in the 90's
829: (see \citealt{dlV} for a review), and it is an
830: extension of the celebrated Erd\"os-R\'enyi random graph results
831: \citep{Bol,JLR} to the oriented case.
832:
833: Note that the expectation
834: $E(N)=2^n(1-p)^{\binom{ n}{2}}\approx 2^ne^{-cn/4}$,
835: which grows exponentially whenever $c<4\log 2\approx 2.77$. Neglecting
836: correlations would therefore suggest a wrong threshold for $N>0$. The local method
837: (e.g., used in \citealt[Chapter 6]{gav04}) is
838: even farther off, as it suggests an \aas~giant component when $p<(1-\e)\log n/n$
839: for any $\e>0$.
840:
841: {\bf The number of viable genotypes.}\quad
842: Assume that $c<1$. Sophisticated, but not mathematically rigorous
843: methods based on {\it replica symmetry\/} \citep{MZ,BMW} from statistical physics suggest that,
844: as $n\to\infty$,
845: $\lim n^{-1}\log N$ varies almost linearly between
846: $\log 2\approx 0.69$ (for small $c$, when, as we prove below, this limit is
847: $\log 2+\cO(c)$) and about $0.38$ (for $c$ close to $1$).
848: One can however prove that $n^{-1}\log N$ is for large $n$ sharply
849: concentrated around its mean \citep{dlV}.
850:
851: Upper and lower bounds on $N$ can also be obtained
852: rigorously. For example, if $X$ is a number of
853: incompatibilities which involve {\it disjoint\/} pairs of loci
854: (i.e., those for which every locus is represented at most once among the
855: incompatibilities),
856: then $N\le \exp(n\log 2+X\log(3/4))$, as each of the $X$ incompatibilities
857: reduces the number of viable genotypes by the factor $3/4$.
858: If we imagine
859: adding incompatibilities one by one at random until
860: there are about $cn$ of them, then after we have $k$
861: incompatibilities on disjoint pairs of loci the waiting time (measured by
862: the number of incompatibilities added)
863: for a new disjoint one is geometric with expectation $\binom{n} {2}/\binom{n-2k} {2}$.
864: Therefore,
865: $X$ is \aas~at least $Kn$, where
866: $K$ solves the approximate equation
867: $$
868: \binom{n} {2} \left(\sum_{k=0}^{Kn}\frac 1{ \binom{n-2k} {2}} \right)\sim cn,
869: $$
870: or
871: $$
872: \int_{0}^{Kn}\frac 1{(n-2k)^2}\, dk \sim \frac cn,
873: $$
874: which reduces to $K=c/(1+2c)$. This implies that the upper bound on $N$ can be
875: defined as
876: \begin{equation} \label{up_bound}
877: \limsup \frac 1n\log N\le \frac {1}{1+2c}\log 2+\frac {c}{1+2c}\log 3.
878: \end{equation}
879:
880: A lower bound is even easier to obtain. Namely,
881: the probability that a fixed location (i.e., locus) $i$ does not appear in $F$ is $(1-p)^{4(n-1)}
882: \to e^{-2c}$, and then it is easy to see that the number of loci represented in $F$
883: is asymptotically $(1-e^{-2c})n$. As the other loci are neutral (in the sense that changing
884: their alleles does not affect fitness),
885: $n^{-1}\log N$ is asymptotically at least $e^{-2c}\log 2$. Clearly, this gives
886: a lower bound on the exponential size of any cluster of viable genotypes.
887:
888: If this was an accurate bound, it would imply that the space of
889: genotypes is rather simple, in that almost all its entropy would come from neutral loci. The Appendix B presents two arguments which will
890: demonstrate that this is not the case. The derivations there are somewhat technical,
891: but do provide more insight into random pair incompatibilities.
892:
893: {\bf The structure of clusters.}\quad
894: The derivations in Appendix B show that every viable genotype is connected
895: through mutation to a fairly substantial
896: viable sub-cube. In this sub-cube, alleles on at most a proportion $r_u(c)<1$ of loci
897: are fixed (to 0 or 1) while the remaining proportion $1-r_u(c)$ could be
898: varied without effect on fitness. Note from Figure 4 in the
899: Appendix B that $1-r_u(c)\ge 0.3$ for
900: all $c$, and that such a phenomenon is
901: extremely unlikely on uncorrelated landscapes.
902: Note also that, for $c<1$, $N\ge 2^{(1-r_u(c))n}$ \aas~and so the lower
903: bound on $N$ can be written as
904: \begin{equation} \label{low_bound}
905: \liminf\frac 1n\log N\ge (1-r_u(c))\log 2.
906: \end{equation}
907:
908: {\bf The number of clusters.}\quad
909: The natural next question concerns the number of clusters
910: $R$ when $c<1$. This again has quite a surprising answer, unparalleled in
911: landscapes with rapidly decaying correlations. Namely,
912: $R$ is {\it stochastically bounded\/}, that
913: is, for every $\e>0$ there exists an $z=z(\e)$ such that $P(R\le z$ for all $n)>1-\e$.
914: As there is some confusion in the literature as to whether it is even possible
915: to get more than one cluster \citep{BMW}, Appendix C
916: presents a sketch of the results which will appear in Pitman (unpub.).
917: There we also show that the limiting probability of a unique cluster is
918: $\sqrt{(1-c)e^c}$.
919:
920: Asymptotically, a unique cluster has a better than even chance of
921: occurring for $c$ below about $0.9$, and is {\it very\/} likely to occur
922: for small $c$, though of course not
923: \aas~so. To confirm, we have done simulations for $n=20$ and $c=0.01 (0.01) 1$
924: (again 1000 trials in each case) and got distribution of clusters depicted
925: in Figure~3. The results suggest that the convergence to limiting distribution
926: is rather slow for $c$ close to 1, and that the likelihood of a unique
927: cluster increases for low $n$.
928:
929: \begin{figure*}[t]
930: \begin{center}
931: {\includegraphics[clip=true,height=5cm]{cls.ps}
932: }
933: \end{center}
934:
935: \caption{Simulated number of clusters, vs. $c$ for $n=20$. The
936: proportion (out of 1000) of trials with exactly one, exactly two, and at least three clusters
937: is plotted respectively with $+$'s, $\times$'s and $*$'s. The solid curve is
938: $\sqrt{(1-c)e^c}$.
939: }
940: \label{number_clusters}
941: \end{figure*}
942:
943:
944: To summarize, in the presence of random pairwise incompatibilities, the set
945: of viable genotypes is, when nonempty,
946: divided into a stochastically bounded number of connected clusters,
947: where a unique cluster is usually the most likely possibility.
948: These clusters are all of exponentially large size
949: (with bounds given by equations \ref{up_bound} and \ref{low_bound}), in fact they all contain
950: sub-cubes of dimension at least $(1-r_u(c))n$.
951: However, the proportion
952: of viable genotypes among all $2^n$ genotypes is exponentially small, by
953: equation (\ref{up_bound}).
954:
955: \subsection{Multiallelic loci}
956:
957: Here we assume that at each locus there can be $a\ (\ge 2)$
958: alleles (cf., \citealt{Rei}). In this case, the genotype space is
959: the generalized hypercube
960: $\cG_a=\{0,\dots, a-1\}^n$. For $a=3$
961: this could be interpreted as the genotype space of diploid
962: organisms without {\it cis-trans\/} effects \citep{gav97b},
963: $a=4$ corresponds to DNA sequences, and $a=20$ corresponds to proteins.
964: Much larger values of $a$ can correspond to a number of alleles at a protein
965: coding locus and we will see later that
966: there is not much difference between this model and a
967: natural continuous space model.
968:
969: We will assume that each pair of alleles, out of total
970: number of $a^2\binom{n}{2}$ is independently incompatible
971: with probability
972: $$p=\frac{c}{2n}.$$
973: The main question we are interested in
974: here is for which values of $c$ viable genotypes exist {\aas }
975:
976: Clearly, if $N$ is the number of viable phenotypes, then the expectation
977: $$
978: E(N)=a^n(1-p)^{\binom{n}{2}}\approx\exp(n \log a-{\textstyle\frac 14}cn),
979: $$
980: and so there are \aas~no viable phenotypes when $c>4\log a$. On the
981: other hand, clearly there are viable genotypes
982: (with all positions filled by 0's and 1's) when $c<1$. It turns out that the
983: first
984: of these trivial bounds is much closer to the critical value when $a$
985: is large. Before we proceed, however, we state a sharp
986: threshold result from \cite{Mol}: there exists a function $\gamma=\gamma(n,a)$
987: so that for every $\e>0$,
988: \begin{itemize}
989: \item if $c>\gamma+\e$, then a.~a.~s.~$N=0$.
990: \item if $c<\gamma-\e$, then a.~a.~s.~$N>0$.
991: \end{itemize}
992: In words, for a fixed $a$, the probability of the event that $N\ge 1$
993: transitions sharply from large to small
994: as $np$ varies. As it is not proved that
995: $\lim_{n\to\infty}\gamma(n,a)$ exists, it is in principle possible
996: that the place of this sharp transition fluctuates as $n$ increases
997: (although it must of course remain within $[1, 4\log a]$).
998:
999: Our main result in this section is
1000: \begin{equation}\label{gamma}
1001: \gamma=4\log a-o(1), \text{ as }a\to\infty.
1002: \end{equation}
1003: This somewhat surprising result in proven in Appendix D by the
1004: second moment method, as developed in \cite{AM} and \cite{AP}.
1005:
1006: \subsection{Continuous phenotype spaces}
1007:
1008: Here we extend the model of pair incompatibilities for the case of continuous
1009: phenotypic space $\cal{P}$. Again, we have a small $r>0$ as a parameter.
1010: For each of $(i,j)$, $i<j$, we consider independent Poisson point location $\Pi_{ij}$
1011: in the unit square $[0,1]\times[0,1]$, of rate $\la=c/(2n)$. (Equivalently, choose Poisson($\la$) number of
1012: points uniformly at random in $[0,1]\times[0,1]$.) Then we declare $a\in \cP$ inviable
1013: if there exist $i<j$ so that $(a_i,a_j)$ is within $r$ of $\Pi_{ij}$.
1014: Again, we use the two-dimensional $\ell^\infty$ norm for distance.
1015: Our procedure can be visualized as throwing a random number of
1016: $(n-2)$-dimensional square tubes of inviable phenotypes into the phenotype space.
1017:
1018: Our main result here is that the existence threshold is on the order $c\approx -\log r/r^{2}$.
1019: Namely, we prove in the Appendix E that there exists a constant $C>0$ so that for small enough $r$,
1020:
1021: \begin{itemize}
1022: \item if $c>4\frac{-\log r}{r^2}$, then a.~a.~s.~$N=0$.
1023: \item if $c<\frac{-\log r-C}{r^2}$, then a.~a.~s.~$N>0$.
1024: \end{itemize}
1025:
1026: \subsection{Complex incompatibilities}
1027:
1028: Here we assume that incompatibilities involve $K\ (\geq 2)$ diallelic loci \citep{orr96,gav04}.
1029: The question whether a viable combination of genes exist is then equivalent to
1030: the {\tt $K$-SAT} problem \citep{KV}. Even for $K=3$, this is an NP-complete problem
1031: \citep{KV}, so there is no known polynomial algorithm to answer this question.
1032: The random case, which we now describe, is also much harder to analyze
1033: than the {\tt $2$-SAT} one.
1034: Let $F$ be a random set
1035: to which any of the $2^K\binom n K$ incompatibilities belong independently with
1036: probability
1037: $$
1038: p=\frac {K!}{2^K}\cdot \frac c{n^{K-1}}.
1039: $$
1040: Here $c=c(K)$ is a constant, and the above form has been
1041: chosen to make the number of incompatibilities in $F$ asymptotically $cn$.
1042: (Note also the agreement with the definition of $p$ in Section 5.1
1043: when $K=2$.) For a fixed $K$,
1044: it has been proved \citep{Fri} that the probability that viable genotype exists
1045: jumps sharply from 0 to 1 as $c$ varies. However, the location of the
1046: jump has not been proved to converge as $n\to\infty$. Instead,
1047: a lot of effort has been
1048: invested in obtaining good bounds. For example \citep{AP}, for $K=3$, $c<3.42$ implies {\aas }
1049: existence
1050: of viable genotype, while $c>4.51$ implies \aas~nonexistence (while the sharp
1051: constant
1052: is estimated to be about $4.48$, see e.g. \citealt{BMW}).
1053: For $K=4$ the best current bounds are $7.91$ and $10.23$. For large
1054: $K$, the transition occurs at $c=2^K\log 2-\cO(K)$ \citep{AP}.
1055:
1056: Techniques from statistical physics \citep{BMW} strongly suggest
1057: that, for $K\ge 3$, there is another phase transition, which
1058: for $K=3$ occurs at about $c=3.96$. For smaller $c$, the
1059: viable genotypes are conjectured to
1060: be contained in a {\it single\/} cluster.
1061: For larger $c$, the space of viable genotypes
1062: (if nonempty) is divided into exponentially many connected clusters.
1063:
1064:
1065: Perhaps more relevant to genetic incompatibilities is the following
1066: {\it mixed\/} model (commonly known as {\tt $(2+p)$-SAT}), \citealt{MZ}). Assume that
1067: every 2-incompatibility is present with probability $c_2/(2n)$,
1068: while every 3-incompatibility is present with probability $3c_3/(4n^2)$.
1069: The normalizations are chosen so that the numbers of the two types of
1070: incompatibilities are asymptotically $c_2 n$ and $c_3 n$, respectively.
1071:
1072: If $c_2$ (resp. $c_3$) is very small, then the respective incompatibility
1073: set affects a very small proportion of loci, therefore
1074: $c_3$ (resp. $c_2$) determines whether a viable genotype is likely to exist.
1075: Intuitively, one also expects that 2-incompatibilities should be more
1076: important than 3-incompatibilities
1077: as one of the former type excludes more genotypes than one of the latter type. A careful
1078: analysis confirms this. First observe
1079: that $c_2>1$ implies \aas~non-existence of a viable genotype. The surprise
1080: \citep{MZ,AKKK} is that if $c_3$ is small enough, $c_2<1$
1081: implies \aas~existence of viable genotypes, so the 3-incompatibilities
1082: do not change the threshold. This is established in \cite{MZ} by a physics argument
1083: for $c_3<0.703$, while
1084: \cite{AKKK} gives a rigorous argument for $c_3<2/3$. Therefore, even if their numbers are
1085: on the same scale, if the more
1086: complex incompatibilities are rare enough compared to the pairwise
1087: ones, their contribution to the structure of the space of
1088: viable genotypes is not essential.
1089:
1090:
1091: \section{Notes on neutral clusters in the discrete {\it NK\/} model}
1092:
1093: The model considered here is a special case of the discretized NK model \citep{kau93},
1094: introduced in \cite{NE}.
1095: This model features $n$ diallelic loci each of which interacts with $K$ other loci.
1096: To have a concrete example, assume that the loci are arranged on a
1097: circle, so that $n+1\equiv 1$, $n+1\equiv 2$, etc., and let the
1098: interaction {\it neighborhood\/} of the $i$'th locus consist of itself
1099: and $K$ loci to its right $i+1, \dots, i+K$. For a given
1100: genotype $x\in\cG=\{0,1\}^n$,
1101: the neighborhood configuration of the
1102: $i$'th locus is then given by $\cN_i(x)= (x_i, x_{i+1}, \dots, x_{i+K})\in \{0,1\}^{K+1}$.
1103: To each locus and to each possible configuration
1104: in its neighborhood
1105: we independently assign a binary fitness contibution.
1106: To be more precise,
1107: we choose the $2^{K+1}n$ numbers $v_i(y)$, $i=1, \dots, n$ and $y\in \{0,1\}^{K+1}$,
1108: to be independently 0 or 1 with equal probability, and interpret $v_i(y)$
1109: as the fitness contribution of locus $i$ when its neighborhood configuration
1110: is $y$. The fitness
1111: of a genotype $x$ is then the sum of contributions from each locus:
1112: $$
1113: w(x)=\sum_{i=1}^n v_i(\cN_i(x)).
1114: $$
1115: In \cite{kau93}, the values $v_i$ were taken from a continuous distribution.
1116: In \cite{NE}, these values were integers in the range $[0,F-1]$ so that our model
1117: is a special case $F=2$.
1118: {\it Neutral clusters\/} are connected components of same
1119: fitness.
1120:
1121: The $K=0$ case is easy but nevertheless illustrative.
1122: Namely, a mutation at locus $i$ will not change fitness iff
1123: $v_i(0)=v_i(1)$; let $D$ be the number of such loci.
1124: Then $D\sim n/2$ \aas, the number of different fitnesses is $n-D$,
1125: each neutral cluster is a sub-cube
1126: of dimension $D$, and there are exactly $2^{n-D}$ neutral
1127: clusters.
1128:
1129: The next simplest situation is when $K=1$. Let
1130: $D_1$ be the number of loci $i$
1131: for which $v_i$ is constant. Then
1132: $D_1\sim n/8$ \aas, and each neutral cluster contains a
1133: sub-cube of dimension $D_1$. Moreover, let $D_2$ be
1134: the number of loci $i$ for which $v_i(00)=v_i(01)\ne v_i(10)=v_1(11)$.
1135: Note that any genotypes that differ at such locus $i$ must belong to
1136: a different neutral cluster, and so the number
1137: of different neutral clusters is at least $2^{D_2}$. Thus there
1138: are exponentially many of them, as
1139: again $D_2\sim n/8$ {\aas }
1140: This division of genotype space into exponentially many clusters
1141: of exponential size persists for every $K$, although
1142: the distribution of numbers and sizes of these clusters is not well understood (see
1143: \citealt{NE} for simulations for $n=20$).
1144:
1145: Finally, we mention that the question of whether a
1146: genotype with the maximal possible fitness $n$
1147: exists for a given $K$ is in many way related to issues in incompatibilities models
1148: \citep{CJK}.
1149:
1150: \section{Discussion}
1151:
1152: In this section we summarize our major findings and provide their biological interpretation.
1153:
1154: The previous work on neutral and nearly neutral networks in multidimensional fitness
1155: landscapes has concentrated exclusively on genotype spaces in which each individual
1156: (or a group of individuals) is characterized by a discrete set of genes. However
1157: many features of biological organisms that are actually observable and/or measurable are described by
1158: continuously varying variables such as size, weight, color, or concentration. A question
1159: of particular biological interest is whether (nearly) neutral networks are as prominent
1160: in a continuous phenotype space as they are in the discrete genotype space. Our results
1161: provide an affirmative answer to this question. Specifically, we have shown that in a simple
1162: model of random fitness assignment, viable phenotypes are likely to form a large connected
1163: cluster even if their overall frequency is very low provided the dimensionality of the phenotype
1164: space, $n$, is sufficiently large. In fact, the percolation threshold for the probability
1165: of being viable scales with $n$ as $1/2^n$ and, thus, decreases much faster than $1/n$ which is
1166: characteristic of the analogous discrete genotype space model.
1167:
1168: Earlier work on nearly neutral networks has been limited to consideration of the relationship
1169: between genotype and fitness. Any phenotypic properties that usually mediate this relationship
1170: in real biological organisms have been neglected. In Section 4, we proposed a novel model in which
1171: phenotype is introduced explicitly. In our model, the relationships both between genotype and
1172: phenotype and between phenotype and fitness are of many-to-one type, so that neutrality is present
1173: at both the phenotype and fitness levels. Moreover, this model results in a correlated fitness
1174: landscape in which the correlation function can be found explicitly. We studied the effects
1175: of phenotypic neutrality and correlation between fitnesses on the percolation threshold and
1176: showed that the most conducive
1177: conditions for the formation of the giant component is when the correlations are at the point
1178: of phase transition between local and global.
1179: To explore the robustness of our conclusions, we then look at a simplistic but
1180: mathematically illuminating model in which there is a correlation between conformity (i.e.,
1181: phenotypic neutrality) and fitness. The model has supported our conclusions.
1182:
1183: Section 5, we studied a number of models that have been recently proposed
1184: and explored within the context of studying speciation. In these models, fitness is assigned to
1185: particular gene/trait combinations and the fitness of the whole organisms depends on the presence
1186: or absence of incompatible combinations of genes or traits. In these models, the correlations
1187: of fitnesses are so high that local methods lead to wrong conclusions.
1188: First, we established the connection between these models and $K$-{\tt SAT} problems, prominent
1189: in computer science. Then we analyzed the conditions for the existence of viable genotypes,
1190: their number, as well as the structure and the number of clusters of viable genotypes.
1191: These questions have not been studied previously. Among other things we showed that the number
1192: of clusters is stochastically bounded and each cluster contains a very large sub-cube.
1193: The majority of our results are for the case of pairwise incompatibilities between diallelic
1194: loci, but we also looked at multiple alleles and complex incompatibilities. Moreover, we generalized
1195: some of our results to continuous phenotype spaces.
1196:
1197: At the end, we provided some additional results on the size, number and structure of
1198: neutral clusters in the discrete $NK$ model.
1199:
1200: Some more general lessons of our work are that
1201: \begin{itemize}
1202: \item Correlations may help or hinder connectivity in fitness landscapes. Even when
1203: correlations are positive and tunable by a single
1204: parameter, it may be advantageous
1205: (for higher connectivity) to increase
1206: them only to a limited extent.
1207: \item Averages (i.e., expected values) can easily lead to wrong conclusions,
1208: especially when correlations are strong. Nevertheless, they may still
1209: be useful with a crafty choice of relevant statistics.
1210: \item Very high correlations may fundamentally change the structure of connected
1211: clusters. For example, clusters may look locally more like cubes than trees and
1212: their number may be reduced dramatically.
1213: \item Necessary analytical techniques may be unexpected and quite sophisticated;
1214: for example, they may require
1215: detailed understanding of random graphs, spin-glass machinery, or decision algorithms.
1216: \end{itemize}
1217:
1218:
1219: {\small ACKNOWLEDGMENTS.
1220: This work was supported by the Defense Advanced Research Projects Agency (DARPA),
1221: by National Institutes of Health (grant GM56693),
1222: by the National Science Foundation (grants DMS-0204376 and DMS-0135345),
1223: and by Republic of Slovenia's Ministry of Science (program P1-285).}
1224:
1225: \begin{thebibliography}{}
1226:
1227: \bibitem[Achlioptas et~al., 2001]{AKKK}
1228: Achlioptas, D., Kirousis, L.~M., Kranakis, E., and Krizanc, D. (2001).
1229: \newblock Rigorous results for $(2+p)$-{SAT}.
1230: \newblock {\em Theoretical Computer Science}, 265:109--129.
1231:
1232: \bibitem[Achlioptas and Moore, 2004]{AM}
1233: Achlioptas, D. and Moore, C. (2004).
1234: \newblock Random k-{SAT}: two moments suffice to cross a sharp threshold.
1235: \newblock {\em SIAM Journal on Computing}, 17:947--973.
1236:
1237: \bibitem[Achlioptas and Peres, 2004]{AP}
1238: Achlioptas, D. and Peres, Y. (2004).
1239: \newblock The threshold for random $k$-{SAT} is $2\sp k\log 2-o(k)$.
1240: \newblock {\em Journal of the American Mathematical Society}, 17:947--973.
1241:
1242: \bibitem[Athreya and Ney, 1971]{AN}
1243: Athreya, K. and Ney, P. (1971).
1244: \newblock {\em Branching processes}.
1245: \newblock Springer-Verlag (reprinted by Dover 2004).
1246:
1247: \bibitem[Barbour et~al., 1992]{BHJ}
1248: Barbour, A.~D., Holst, L., and Janson, S. (1992).
1249: \newblock {\em Poisson Approximation}.
1250: \newblock Oxford University Press.
1251:
1252: \bibitem[Berger, 2004]{Ber}
1253: Berger, N. (2004).
1254: \newblock A lower bound for the chemical distance in sparse long-range
1255: percolation models.
1256: \newblock {\em http://arxiv.org/abs/math/0409021}.
1257:
1258: \bibitem[Biroli et~al., 2000]{BMW}
1259: Biroli, G., Monasson, R., and Weigt, M. (2000).
1260: \newblock A variational description of the ground state structure in random
1261: satisfiability problems.
1262: \newblock {\em European Physical Journal B-Condensed Matter}, 14:551--568.
1263:
1264: \bibitem[Biskup, 2004]{Bis}
1265: Biskup, M. (2004).
1266: \newblock On the scaling of the chemical distance in long-range percolation
1267: models.
1268: \newblock {\em Annals of Probability}, 32:2938--2977.
1269:
1270: \bibitem[Bollob\'as, 2001]{Bol}
1271: Bollob\'as, B. (2001).
1272: \newblock {\em Random Graphs}.
1273: \newblock Cambridge University Press.
1274:
1275: \bibitem[Bollob\'as et~al., 1992]{BKL1}
1276: Bollob\'as, B., Kohayakawa, Y., and \L{}uczak, T. (1992).
1277: \newblock The evolution of random subgraphs of the cube.
1278: \newblock {\em Random Structures and Algorithms}, 3:55--90.
1279:
1280: \bibitem[Bollob\'as et~al., 1994]{BKL2}
1281: Bollob\'as, B., Kohayakawa, Y., and \L{}uczak, T. (1994).
1282: \newblock On the evolution of random {Boolean} functions.
1283: \newblock In {\em Extremal problems for finite sets (Visegr\'ad, 1991)}, pages
1284: 137--156. Bolyai Society Mathematical Studies, 3, J\'anos Bolyai Mathematical
1285: Society, Budapest.
1286:
1287: \bibitem[Boufkhad and Dubois, 1999]{BD}
1288: Boufkhad, Y. and Dubois, O. (1999).
1289: \newblock Length of prime implicants and number of solutions of random {CNF}
1290: formulae.
1291: \newblock {\em Theoretical~Computer~Science}, 215:1--30.
1292:
1293: \bibitem[Burch and Chao, 1999]{bur99}
1294: Burch, C.~L. and Chao, L. (1999).
1295: \newblock Evolution by small steps and rugged landscapes in the {RNA} virus phi
1296: 6.
1297: \newblock {\em Genetics}, 151:921--927.
1298:
1299: \bibitem[Burch and Chao, 2004]{bur04}
1300: Burch, C.~L. and Chao, L. (2004).
1301: \newblock Epistasis and its relationship to canalization in the {RNA} virus phi
1302: 6.
1303: \newblock {\em Genetics}, 167:559--567.
1304:
1305: \bibitem[Choi et~al., 2005]{CJK}
1306: Choi, S.-S., Jung, K., and Kim, J.~H. (2005).
1307: \newblock Phase transition in a random {NK} landscape model.
1308: \newblock In {\em Proceedings of the 2005 Conference on Genetic and
1309: Evolutionary Computation, {Washington, DC}}, pages 1241--1248. ACM Press.
1310:
1311: \bibitem[Cook, 1971]{Coo}
1312: Cook, S.~A. (1971).
1313: \newblock The complexity of theorem proving procedures.
1314: \newblock In {\em Proceedings of the Third Annual ACM Symposium on the Theory
1315: of Computing}, pages 151--158. ACM.
1316:
1317: \bibitem[Coyne and Orr, 2004]{coy04}
1318: Coyne, J. and Orr, H.~A. (2004).
1319: \newblock {\em Speciation}.
1320: \newblock Sinauer Associates, Inc., Sunderland, Massachusetts.
1321:
1322: \bibitem[de~la Vega, 2001]{dlV}
1323: de~la Vega, W.~F. (2001).
1324: \newblock Random {2-SAT}: results and problems.
1325: \newblock {\em Theoretical Computer Science}, 265:131--146.
1326:
1327: \bibitem[Derrida and Peliti, 1991]{der91}
1328: Derrida, B. and Peliti, L. (1991).
1329: \newblock Evolution in flat landscapes.
1330: \newblock {\em Bulletin of Mathematical Biology}, 53:255--282.
1331:
1332: \bibitem[Eigen et~al., 1989]{eig89}
1333: Eigen, M., Mc{C}askill, J., and Schuster, P. (1989).
1334: \newblock The molecular quasispecies.
1335: \newblock {\em Advances in Chemical Physics}, 75:149--263.
1336:
1337: \bibitem[Elena and Lenski, 2003]{ele03}
1338: Elena, S.~F. and Lenski, R.~E. (2003).
1339: \newblock Evolution experiments with microorganisms: The dynamics and genetic
1340: bases of adaptation.
1341: \newblock {\em Nature Reviews Genetics}, 4:457--469.
1342:
1343: \bibitem[Fontana and Schuster, 1998]{fon98b}
1344: Fontana, W. and Schuster, P. (1998).
1345: \newblock Continuity in evolution: on the nature of transitions.
1346: \newblock {\em Science}, 280:1451--1455.
1347:
1348: \bibitem[Friedgut, 1999]{Fri}
1349: Friedgut, E. (1999).
1350: \newblock Necessary and sufficient conditions for sharp thersholds of graph
1351: properties, and the $k$-{SAT} problem.
1352: \newblock {\em Journal of the American Mathematical Society}, 12:1017--1054.
1353:
1354: \bibitem[Gavrilets, 1997]{gav97}
1355: Gavrilets, S. (1997).
1356: \newblock Evolution and speciation on holey adaptive landscapes.
1357: \newblock {\em Trends in Ecology and Evolution}, 12:307--312.
1358:
1359: \bibitem[Gavrilets, 2003]{gav03d}
1360: Gavrilets, S. (2003).
1361: \newblock Models of speciation: what have we learned in 40 years?
1362: \newblock {\em Evolution}, 57:2197--2215.
1363:
1364: \bibitem[Gavrilets, 2004]{gav04}
1365: Gavrilets, S. (2004).
1366: \newblock {\em Fitness landscapes and the origin of species}.
1367: \newblock Princeton University Press, Princeton, NJ.
1368:
1369: \bibitem[Gavrilets and Gravner, 1997]{gav97b}
1370: Gavrilets, S. and Gravner, J. (1997).
1371: \newblock Percolation on the fitness hypercube and the evolution of
1372: reproductive isolation.
1373: \newblock {\em Journal of Theoretical Biology}, 184:51--64.
1374:
1375: \bibitem[Gavrilets and Hastings, 1996]{gav96b}
1376: Gavrilets, S. and Hastings, A. (1996).
1377: \newblock Founder effect speciation: a theoretical reassessment.
1378: \newblock {\em American Naturalist}, 147:466--491.
1379:
1380: \bibitem[H\"aggstr\"om, 2001]{Hag}
1381: H\"aggstr\"om, O. (2001).
1382: \newblock Coloring percolation clusters at random.
1383: \newblock {\em Stochastic Processes and their Applications}, 96:213--242.
1384:
1385: \bibitem[Huynen et~al., 1996]{huy96b}
1386: Huynen, M.~A., Stadler, P.~F., and Fontana, W. (1996).
1387: \newblock Smoothness within ruggedness: the role of neutrality in adaptation.
1388: \newblock {\em Proceedings of the National Academy of Sciences USA},
1389: 93:397--401.
1390:
1391: \bibitem[Janson et~al., 2000]{JLR}
1392: Janson, S., \L{}uczak, T., and Rucinski, A. (2000).
1393: \newblock {\em Random Graphs}.
1394: \newblock Wiley.
1395:
1396: \bibitem[Kauffman, 1993]{kau93}
1397: Kauffman, S.~A. (1993).
1398: \newblock {\em The origins of order}.
1399: \newblock Oxford University Press, Oxford.
1400:
1401: \bibitem[Kauffman and Levin, 1987]{kau87}
1402: Kauffman, S.~A. and Levin, S. (1987).
1403: \newblock Towards a general theory of adaptive walks on rugged landscapes.
1404: \newblock {\em Journal of Theoretical Biology}, 128:11--45.
1405:
1406: \bibitem[Korte and Vygen, 2005]{KV}
1407: Korte, B. and Vygen, J. (2005).
1408: \newblock {\em Combinatorial Optimization, Theory and Algorithms}.
1409: \newblock Springer, 3rd edition.
1410:
1411: \bibitem[Lenski et~al., 1999]{len99}
1412: Lenski, R.~E., Ofria, C., Collier, T.~C., and Adami, C. (1999).
1413: \newblock Genome complexity, robustness and genetic interactions in digital
1414: organisms.
1415: \newblock {\em Nature}, 400:661--664.
1416:
1417: \bibitem[Lipman and Wilbur, 1991]{lip91}
1418: Lipman, D.~J. and Wilbur, W.~J. (1991).
1419: \newblock Modeling neutral and selective evolution of protein folding.
1420: \newblock {\em Proceedings of the Royal Society London B}, 245:7--11.
1421:
1422: \bibitem[Martinez et~al., 1996]{mar96}
1423: Martinez, M.~A., Pezo, V., Marli\`{e}re, P., and Wain-Hobson, S. (1996).
1424: \newblock Exploring the functional robustness of an enzyme by {\em in vitro}
1425: evolution.
1426: \newblock {\em EMBO Journal}, 15:1203--1210.
1427:
1428: \bibitem[Molloy, 2003]{Mol}
1429: Molloy, M. (2003).
1430: \newblock Models for random constraint satisfaction problems.
1431: \newblock {\em SIAM Journal on Computing}, 32:935--949.
1432:
1433: \bibitem[Monasson and Zecchina, 1997]{MZ}
1434: Monasson, R. and Zecchina, R. (1997).
1435: \newblock Statistical mechanics of the random {K}-satisfiability model.
1436: \newblock {\em Physical Review E}, 56:1357--1370.
1437:
1438: \bibitem[Newman and Engelhardt, 1998]{NE}
1439: Newman, M. E.~J. and Engelhardt, R. (1998).
1440: \newblock Effects of selective neutrality on the evolution of molecular
1441: species.
1442: \newblock {\em Proceedings of the Royal Society London B}, 265:1333--1338.
1443:
1444: \bibitem[Orr, 1995]{orr95}
1445: Orr, H.~A. (1995).
1446: \newblock The population genetics of speciation: the evolution of hybrid
1447: incompatibilities.
1448: \newblock {\em Genetics}, 139:1803--1813.
1449:
1450: \bibitem[Orr, 1997]{orr97}
1451: Orr, H.~A. (1997).
1452: \newblock Dobzhansky, {Bateson}, and the genetics of speciation.
1453: \newblock {\em Genetics}, 144:1331--1335.
1454:
1455: \bibitem[Orr, 2006a]{orr06b}
1456: Orr, H.~A. (2006a).
1457: \newblock The distribution of fitness effects among beneficial mutations in
1458: {Fisher}'s geometric model of adaptation.
1459: \newblock {\em Journal of Theoretical Biology}, 238:279--285.
1460:
1461: \bibitem[Orr, 2006b]{orr06a}
1462: Orr, H.~A. (2006b).
1463: \newblock The population genetics of adaptation on correlated fitness
1464: landscapes: The block model.
1465: \newblock {\em Evolution}, 60:1113--1124.
1466:
1467: \bibitem[Orr and Orr, 1996]{orr96}
1468: Orr, H.~A. and Orr, L.~H. (1996).
1469: \newblock Waiting for speciation: the effect of population subdivision on the
1470: waiting time to speciation.
1471: \newblock {\em Evolution}, 50:1742--1749.
1472:
1473: \bibitem[Orr and Turelli, 2001]{orr01}
1474: Orr, H.~A. and Turelli, M. (2001).
1475: \newblock The evolution of postzygotic isolation: accumulating
1476: {Dobzhansky}-{Muller} incompatibilities.
1477: \newblock {\em Evolution}, 55:1085--1094.
1478:
1479: \bibitem[Palasti, 1971]{Pal}
1480: Palasti, I. (1971).
1481: \newblock On the threshold distribution function of cycles in a directed random
1482: graph.
1483: \newblock {\em Studia Scientiarum Mathematicarum Hungarica}, 6:67--73.
1484:
1485: \bibitem[Penrose, 1996]{Pen}
1486: Penrose, M.~D. (1996).
1487: \newblock Continuum percolation and {Euclidean} minimal spanning trees in high
1488: dimensions.
1489: \newblock {\em The Annals of Applied Probability}, 6:528--544.
1490:
1491: \bibitem[Pigliucci, 2006]{pig06}
1492: Pigliucci, M. (2006).
1493: \newblock {\em Making Sense of Evolution: The Conceptual Foundations of
1494: Evolutionary Biology}.
1495: \newblock University of Chicago Press, Chicago.
1496:
1497: \bibitem[Reidys, 2006]{Rei}
1498: Reidys, C.~M. (2006).
1499: \newblock Combinatorics of genotype-phenotype maps: an {RNA} case study.
1500: \newblock In Percus, A., Istrate, G., and Moore, C., editors, {\em
1501: Computational Complexity and Statistical Physics}, pages 271--284. Oxford
1502: University Press.
1503:
1504: \bibitem[Reidys et~al., 2001]{rei01a}
1505: Reidys, C.~M., Forst, C.~V., and Schuster, P. (2001).
1506: \newblock Replication and mutation on neutral networks.
1507: \newblock {\em Bulletin of Mathematical Biology}, 63:57--94.
1508:
1509: \bibitem[Reidys and Stadler, 2001]{rei01b}
1510: Reidys, C.~M. and Stadler, P.~F. (2001).
1511: \newblock Neutrality in fitness landscapes.
1512: \newblock {\em Applied Mathematics and Computation}, 117:321--350.
1513:
1514: \bibitem[Reidys and Stadler, 2002]{rei02}
1515: Reidys, C.~M. and Stadler, P.~F. (2002).
1516: \newblock Combinatorial landscapes.
1517: \newblock {\em SIAM Review}, 44:3--54.
1518:
1519: \bibitem[Reidys et~al., 1997]{rei97b}
1520: Reidys, C.~M., Stadler, P.~F., and Schuster, P. (1997).
1521: \newblock Generic properties of combinatory maps: neutral networks of {RNA}
1522: secondary structures.
1523: \newblock {\em Bulletin of Mathematical Biology}, 59:339--397.
1524:
1525: \bibitem[Rost, 1997]{ros97}
1526: Rost, B. (1997).
1527: \newblock Protein structures sustain evolutionary drift.
1528: \newblock {\em Folding \& Design}, 2:S19--S24.
1529:
1530: \bibitem[Schuster, 1995]{sch95}
1531: Schuster, P. (1995).
1532: \newblock How to search for {RNA} structures. theoretical concepts in
1533: evolutionary biotechnology.
1534: \newblock {\em Journal of Biotechnology}, 41:239--257.
1535:
1536: \bibitem[Sedgewick, 1997]{Sed}
1537: Sedgewick, R. (1997).
1538: \newblock {\em Algorithms in {C, Parts 1-4}: Fundamentals, Data Structures,
1539: Sorting, Searching.}
1540: \newblock Addison-Wesley.
1541:
1542: \bibitem[Skipper, 2004]{ski04}
1543: Skipper, R.~A. (2004).
1544: \newblock The heuristic role of {Sewall Wright}'s 1932 adaptive landscape
1545: diagram.
1546: \newblock {\em Philosophy of Science}, 71:1176--1188.
1547:
1548: \bibitem[Toman, 1979]{Tom}
1549: Toman, E. (1979).
1550: \newblock The geometric structure of random boolean functions.
1551: \newblock {\em Problemy Kibernet. (in Russian)}, 35:111--132.
1552:
1553: \bibitem[Wilke et~al., 2001]{wil01c}
1554: Wilke, C.~O., Wang, J.~L., Ofria, C., Lenski, R.~E., and Adami, C. (2001).
1555: \newblock Evolution of digital organisms at high mutation rates leads to
1556: survival of the flattest.
1557: \newblock {\em Nature}, 412:331--333.
1558:
1559: \bibitem[Woods et~al., 2006]{woo06}
1560: Woods, R., Schneider, D., Winkworth, C.~L., Riley, M.~A., and Lenski, R.~E.
1561: (2006).
1562: \newblock Tests of parallel molecular evolution in a long-term experiment with
1563: {{\em Escherichia coli}}.
1564: \newblock {\em Proceedings of the National Academy of Sciences USA},
1565: 103:9107--9112.
1566:
1567: \bibitem[Wright, 1932]{wri32}
1568: Wright, S. (1932).
1569: \newblock The roles of mutation, inbreeding, crossbreeding and selection in
1570: evolution.
1571: \newblock In Jones, D.~F., editor, {\em Proceedings of the Sixth International
1572: Congress on Genetics}, volume~1, pages 356--366, Austin, Texas.
1573:
1574: \end{thebibliography}
1575:
1576:
1577:
1578:
1579: \newpage
1580:
1581: \section*{Appendix}
1582:
1583: \subsection*{Appendix A. Proof of equation~(\ref{Px-y}).}
1584:
1585: To prove equation (5), we assume that $\lambda_e<1$ and
1586: show that for a fixed $k$ (which does not grow with $n$), the
1587: event that $x$ and $y$ at distance $k$ are in the same conformist cluster is most likely to
1588: occur because $x$ and $y$ are connected via the shortest possible path. Indeed,
1589: the dominant term $k!p_e^k$ is the expected number of conformist pathways between $x$ and $y$
1590: that are of shortest possible length $k$. This easily follows from the observation that
1591: on a shortest path there
1592: is no opportunity to backtrack; each mutation must be toward the other genotype.
1593: We can assume that $x$
1594: is the all 0's genotype and $y$ is the genotype with 1's in the
1595: first $k$ positions and 0's elsewhere.
1596: There are $k!$ orders in which the 1's can be added.
1597:
1598: To obtain the lower bound we use inclusion-exclusion on the probability
1599: that $x \conn y$ through a shortest path. Let $\mathcal{I}_l=\mathcal{I}_l(x,y)$
1600: be the set of all paths of length $l$ between $x$ and $y$.
1601: Then
1602: $$P(x \conn y) \geq \sum_{\alpha \in \mathcal{I}_k} P(A_\alpha) -
1603: \sum_{\alpha \neq \beta \in \mathcal{I}_k} P(A_\alpha\cap A_\beta)$$
1604: where $A_\alpha$ is the event that a particular path $\alpha$ consists entirely
1605: of conformist edges.
1606: Notice that two distinct paths of the same length differ by at least two edges.
1607: Thus, we get the following upper bound
1608: $$\sum_{\alpha, \beta} P(A_\alpha\cap A_\beta) < (k!)2 p_e^{k+2},$$
1609: and the lower bound in (5) follows.
1610:
1611: The upper bound is a little more difficult to obtain (it is only here
1612: that we use $\lambda_e<1$) and we need some notation.
1613: Each genotype can be identified with the set of 1's that it contains,
1614: so for any two genotypes $u$ and $v$ we let $u \bigtriangleup v$ denote the set
1615: of loci on which they differ. Notice that if $u \bigtriangleup v$
1616: is even (resp. odd) then every path between $u$ and $v$ is of even (resp. odd)
1617: length because each mutation which alters the allele at a locus not in $u \bigtriangleup v$
1618: must later be compensated for.
1619:
1620: To estimate the expected number of conformist pathways,
1621: we will need to bound the number of paths of length $l$ between $x$ and $y$. This is given by
1622: $$ k!\binom{l}{m}m!n^{m}\quad \text{ where }\quad m=\frac{l-k}{2}.$$
1623: We show this via the methods of \cite{BKL1}.
1624: They obtain an estimate for the number of cycles of a given length through a fixed vertex of the cube.
1625:
1626: Given a path, say $x=v_0,v_1,\ldots,v_l=y$, between $x$ and $y$,
1627: let us associate the sequence
1628: $(\epsilon_1i_1,\ldots,\epsilon_l i_l)$
1629: where
1630: $$v_j \bigtriangleup v_{j-1}=\{i_j\}
1631: \quad\text{and}\quad
1632: \epsilon_j=
1633: \left\{
1634: \begin{array}{l}
1635: +1\qquad\text{ if } v_j=v_{j-1}\cup{i_j} \\
1636: -1\qquad\text{ if } v_j=v_{j-1}\setminus\{i_j\}
1637: \end{array}
1638: \right.$$
1639: $j=1,\ldots,l$. Since distinct paths will have distinct sequences we
1640: can bound the number of paths by finding an upper bound for the
1641: number of sequences.
1642:
1643: Note that there must be $m+k$ positive entries, which occur at
1644: $\binom{l}{m+k}=\binom{l}{m}$ possible locations. The absolute
1645: values of $m$ of these entries are chosen freely from $\{1,\dots, n\}$, while
1646: the remaining $k$ must be the integers $1,\ldots,k$. There are
1647: $n^mk!$ ways to do this. We are free to order the $m$ negative
1648: entries and the bound follows.
1649:
1650: We now assume that $d(x,y)$ is even and relabel $d(x,y)=2k$.
1651: We omit the similar calculation for odd distances. Define
1652: $b=-3k/(2\log\la_e)$ and $t=\lfloor b\log n\rfloor$. Then the
1653: expected number of conformist paths between $x$ and $y$ can be expressed as
1654: \begin{eqnarray*}\sum_{l\geq k+1} \sum_{\mathcal{I}_{2l}} p_e^{2l}&=&
1655: \sum_{k+1\leq l< t}
1656: \sum_{\mathcal{I}_{2l}}p_e^{2l}+\sum_{l\geq t}
1657: \sum_{\mathcal{I}_{2l}}p_e^{2l} \\
1658: &<&\sum_{k+1\leq l< t} \binom{2l}{l-k}n^{l-k}(l-k)!(2k)!p_e^{2l}
1659: +\sum_{l\geq t}n^{2l}p_e^{2l} \\
1660: &=&\sum_{k+1\leq l< t}
1661: (2l)^{l-k}n^{l-k}p_e^{2(l-k)}(2k)!p_e^{2k}
1662: +\sum_{l\geq t}\la_e^{2l} \\
1663: &<&(2k)!p_e^{2k}\sum_{l\geq k+1}(2b\la_e p_e\log n)^{l-k}+O(\la_e^{2b\log n})
1664: \\
1665: &=&k (2k)!p_e^{2k} O(p_e\log{n})+O(n^{2b\log \la_e}) \\
1666: &=&k (2k)!p_e^{2k} O\left( n^{-1} \log{n} \right) .
1667: \end{eqnarray*}
1668:
1669: \subsection*{Appendix B. Cluster structure under random pair incompatibilities.}
1670:
1671: Here we show that, under random pairwise incompatibilities model introduced in Section 5.1,
1672: connected clusters include large subcubes. The basic idea
1673: comes from \cite{BD}. A configuration $a\in \{0,1,*\}^n$
1674: is a way to specify a sub-cube of $\cG$, if $*$'s are thought of as places which could be filled
1675: by either a 0 or a 1. The number of non-$*$'s is the {\it length\/} of $a$. Call $a$
1676: an {\it implicant\/} if the entire sub-cube specified by $a$ is viable.
1677:
1678: We present two arguments, beginning with the one which
1679: works better for small $c$. Let the auxiliary random
1680: variable $X$ be the number of pairs of loci $(i,j)$, $i<j$, for which:
1681: \begin{itemize}
1682: \item[(E1)] There is exactly one incompatibility involving alleles on $i$ and $j$.
1683: \item[(E2)] There is no incompatibility involving an allele on either $i$ or $j$,
1684: and an allele on $k\notin\{i,j\}$.
1685: \end{itemize}
1686: Assume, without loss of generality, that the incompatibility
1687: which satisfies (E1) is $(1_i, 1_j)$. Then fitness of all
1688: genotypes which have any of the allele assignments $0_i0_j$, $0_i1_j$ and $1_i0_j$,
1689: and agree on other loci, is the same.
1690: Note also that all pairs of loci which satisfy (E1) and (E2) must be
1691: disjoint.
1692: Therefore, if $x$ is any viable genotype, its cluster contains
1693: an implicant with the number of $*$'s at least $X$ plus the number
1694: of free loci. To determine the size of $X$, note that the expectation
1695: $$
1696: E(X)={\binom{n}{2}}4p(1-p)^3(1-p)^{8(n-2)}\sim ce^{-4c}n
1697: $$
1698: and furthermore, by an equally easy computation,
1699: $$
1700: E(X^2)-E(X)^2=\cO(n),
1701: $$
1702: so that $X\sim ce^{-4c}n$ {\aas }
1703: It follows that every cluster
1704: contains \aas~at least $\exp((e^{-2c}+ce^{-4c})\log 2-\e)n)$,
1705: viable genotypes, for any $\e>0$.
1706:
1707:
1708: The second argument is a refinement of the one in \cite{BD}
1709: and only works better for larger $c$.
1710: Call an implicant $a$ a
1711: {\it prime implicant (PI)\/} if at any locus
1712: $i$, replacement of either $0_i$ or $1_i$
1713: by $*_i$ results in a non-implicant. Moreover, we call $a$ the {\it least prime
1714: implicant (LPI)\/} if it is a PI, and the following two conditions are
1715: satisfied. First, if all the $*$'s
1716: are changed to 0's, then no change from $1_i$ to $0_i$ results in a
1717: viable genotype.
1718: Second,
1719: no change $*_i1_j$ to $1_i*_j$, where $i<j$, results in an indicator.
1720:
1721: Now, every viable genotype must have an LPI in its cluster.
1722: To see this, assume we have a PI for which the first condition is not satisfied. Make the
1723: indicated change, then replace some 0's and 1's by $*$'s
1724: until you get a prime indicator. If the second
1725: condition is violated, make the resulting switch, then again
1726: make some replacement by $*$'s until you arrive at a PI.
1727: Either of these two operations moves within the same cluster, and
1728: keeps the number of 1's nonincreasing
1729: and their positions more to the left. Therefore, the procedure
1730: must at some point end, resulting in an LPI in the same cluster.
1731:
1732: For a sub-cube $a$ to be an LPI,
1733: the following conditions need to be satisfied:
1734: \begin{itemize}
1735: \item[(I1)] Every non-$*$ has to be compatible with every other non-$*$,
1736: and with both 0 and 1 on each of the $*$'s.
1737: \item[(I2)] Any of the four 0,1 combinations on any pair of $*$'s must be compatible.
1738: \item[(LPI1)] Pick an $i$ with allele 1, that is, a $1_i$.
1739: Then $0_i$ must be incompatible with at least
1740: one non-$*$, or at least one 0 on a $*$. Furthermore, if $0_i$ has an
1741: incompatibility
1742: with a 0 on a $*$ to its left, it has to have another incompatibility, either
1743: with a non-$*$, or with a 0 or a 1 on a $*$.
1744: \item[(LPI2)] Pick a $0_i$.
1745: Then $1_i$ must be incompatible with a non-$*$, or a 0 or a 1 on a $*$.
1746: \end{itemize}
1747: The first two conditions make $a$ an implicant, and the last two an LPI.
1748: Note also that these conditions are independent.
1749:
1750: Let now $X$ be the number of LPI of length $rn$. We will identify a
1751: function $L_4=L_4(r,c)$ such that
1752: $$
1753: \frac 1n\log E(X)\le L_4.
1754: $$
1755: Let
1756: $$
1757: L_1=L_1(\be,p,z)=z(\be\log p+(1-\be)\log(1-p)-\be\log\be-(1-\be)\log(1-\be)).
1758: $$
1759: This is the exponential rate for the probability that in $zn$
1760: Bernoulli trials with success probability $p$ there are exactly $\be n$
1761: successes, i.e., this probability is $\approx \exp(L_1n)$. Further,
1762: if $\kappa, \e,\de\in(0,1)$ are fixed, then among sub-cubes
1763: with $rn$ non-$*$'s and $\al n$ 1's ($\al\le r$), the proportion
1764: which have $\e n$ 1's in $[\kappa n, n]$ and $\de n$ $*$'s in
1765: $[1,\kappa n]$ has exponential rate
1766: $$
1767: \begin{aligned}
1768: L_2=&L_2(r,c,\kappa, \al, \e, \de)\\
1769: =&L_1((\al-\e)/\kappa, \al, \kappa)+L_1(\e/(1-\kappa), \al, 1-\kappa)\\
1770: &+L_1(\de/(\kappa-\al+\e), 1-r, \kappa-\al+\e)+ L_1((1-r-\de)/(1-\kappa-\e), 1-r, 1-\kappa-\e).
1771: \end{aligned}
1772: $$
1773: (Here all four first arguments in $L_1$ are in $[0, 1]$,
1774: or else the rate is $-\infty$.)
1775:
1776: The expected number of LPI, with $r,\kappa, \e,\de$ given as above, has exponential rate
1777: at most (and this is only an upper bound)
1778: $$
1779: \begin{aligned}
1780: L_3=&L_3(r,c,\kappa, \al, \e, \de)\\
1781: =&-(1-r)\log(1-r)-\al\log\al-(r-\al)\log(r-\al)\\
1782: &-c(1-r/2)^2\\
1783: &+(r-\al)\log(1-\exp(-c(1-r/2)))\\
1784: &+(\al-\e)\log(1-\exp(-c/2))+\e\log(1-\exp(-c/2)-{\textstyle\frac 12}\de c\exp(-c(1-r/2)))\\
1785: &+L_2(r,c,\kappa, \al, \e, \de).
1786: \end{aligned}
1787: $$
1788: The next to last line is obtained from (LPI1), as $\e n$ 1's must have
1789: $\de n$ $*$'s on their left.
1790:
1791: It follows that $L_4$ can be obtained by
1792: $$
1793: L_4(r,c)=\inf_\kappa\sup_{\al, \e,\de} L_3(r,c,\kappa, \al, \e, \de).
1794: $$
1795: If $L_4(r,c)<0$, all LPI (for this $c$) \aas~have length at most $r$. Numerical computations
1796: show that this gives a better
1797: bound than $1-e^{-2c}-ce^{-4c}$ for $c\ge 0.38$. Let us denote the
1798: best upper bound from the two estimates by $r_u(c)$. This function
1799: is computed numerically and plotted in Figure 3.
1800:
1801: \begin{figure*}[t]
1802: \begin{center}
1803: {\includegraphics[clip, height=5cm]{rm.ps}
1804: }
1805: \end{center}
1806:
1807: \caption{The upper bound $r_u(c)$ for the number of non-$*$'s
1808: in the implicant of smallest length included in every cluster
1809: of viable genotypes, plotted against $c$.
1810: }
1811: \label{fig_ap_a}
1812: \end{figure*}
1813:
1814: \subsection*{Appendix C. Number of clusters under random pair incompatibilities}
1815:
1816: In this section we briefly explain why the number of clusters
1817: under random pair incompatibilities is asymptotically
1818: a function of a Poisson random variable. There is a
1819: clear way to separate the genotype space into disconnected clusters.
1820: For example, if $F_1=\{(0_1,0_2), (1_2,0_3),(1_1,1_2)\}$, we
1821: see that every viable genotype has one of these two allele configurations
1822: on the first two loci: $C=0_11_2$ or $\overline{C}=1_10_2$.
1823: Since there are no genotypes with $0_10_1$ or $1_11_2$,
1824: there is no way to mutate from the viable genotypes with $0_11_2$
1825: to the viable genotypes with $1_10_2$ without passing through an inviable genotype.
1826: However, if we add one incompatibility to $F_1$ to make
1827: $F_2=F_1\cup\{(0_1,1_2)\}$,
1828: then there are no longer any genotypes with the alleles $0_11_2$
1829: and we return to a single cluster of viable genotypes.
1830:
1831: Notice that the digraph $D_{F_1}$ contains the directed
1832: cycle $1_1 \to 0_2 \to 1_1$ and equivalently the directed cycle
1833: $1_2 \to 0_1 \to 1_2$. $D_{F_3}$ also contains these
1834: cycles but there are paths between them as well: $0_2 \to 0_1$ and $1_1 \to 1_2$.
1835:
1836: Formally, a pair of complementary allele configurations
1837: $(C,\overline{C})$ on a set of $k \geq 2$ loci is defined to
1838: be a {\it splitting pair\/} if the digraph $D_F$ contains a directed cycle
1839: (in any order) on the alleles in $C$ (and equivalently on those in $\overline{C}$,
1840: which consist of reversed alleles in $C$)
1841: and does not contain a path between the alleles in $C$ and the alleles in $\overline{C}$.
1842: It should be clear from the example $F_1$ above that the existence
1843: of a splitting pair will create a barrier in the genotype
1844: space through which it is not possible to pass by mutations on viable genotypes.
1845: In fact, it is proved in Pitman (unpub.) that
1846: any two viable genotypes $u$ and $v$ will be disconnected
1847: in the fitness landscape if and only if the loci on which they
1848: differ contain a splitting pair.
1849:
1850:
1851: Thus, the existence of viable genotypes on either side of
1852: a splitting pair (with each configuration of complementary alleles)
1853: ensures disconnected clusters. If there are $k$ splitting pairs in the
1854: formula $F$ and there are viable genotypes with each of the allele
1855: configurations in each of the splitting pairs then there are $2^k$ clusters
1856: of viable genotypes.
1857: The restriction that there be viable genotypes on either side is asymptotically
1858: unlikely to make a difference as we can
1859: fix one of the $2^k$ configurations of alleles and \aas~find a
1860: viable genotype on the remaining loci. Therefore the number of
1861: clusters of viable genotypes is \aas~equal to $2^X$, where
1862: $X$ is the number of splitting pairs, provided that $X$ is
1863: stochastically bounded, but we will see shortly that the expectation
1864: $E(X)$ is bounded. In fact, the next paragraph suggests
1865: that $X$ converges to a Poisson limiting distribution.
1866: (A detailed discussion of this issue will appear in Pitman (unpub.).)
1867:
1868:
1869: It follows from \cite{Pal} or \cite{Bol}
1870: that the number of directed cycles of length $k$ in $D_F$ is
1871: Poisson$(\lambda_k)$ with $\lambda_k = (2k)^{-1}c^k$.
1872: In particular, the expected number of splitting pairs converges to
1873: is $\lambda=-\frac{1}{2} (\ln(1-c)+c)$.
1874: Moreover, the probability that there is no splitting pair
1875: converges to the product of the probabilities that the cycle of each length is absent
1876: \citep{Pal}, which is
1877: \begin{equation}
1878: \prod_{k=2}^\infty \exp{\left(-\frac{c^k}{2k}\right)} =
1879: \exp{\left(\frac{ \ln{(1-c)}+c}{2}\right)} = [(1-c)e^c]^{\frac{1}{2}}.
1880: \end{equation}
1881: In particular, this gives the limiting probability of a unique cluster.
1882:
1883: \subsection*{Appendix D. Proof of equation~(\ref{gamma}).}
1884:
1885: In this section we assume that genotypes have multiallelic loci, which are
1886: subject to random pair incompatibilities. The model introduced in Section 5.2
1887: is the most natural, but is not best suited for our second moment approach.
1888: Instead, we will work with the equivalent modified
1889: model with $m$ pair incompatibilities, each
1890: chosen independently at random, and the first and the second member of each pair
1891: chosen independently from the $an$ available alleles. We will assume
1892: that $m=\frac 14ca^2n$, label $c'=\frac 14c$, and denote, as usual, the resulting set
1893: of incompatibilities by $F$.
1894:
1895: To see that these two models are equivalent for our purposes,
1896: first note that the number of incompatibilities which are
1897: {\it not legitimate\/}, in the sense that the two alleles are chosen
1898: from the same locus, is stochastically bounded in $n$. (In fact, it
1899: converges in distribution to a Poisson($c'a^2$) random variable.)
1900: Moreover, by the Poisson approximation to the birthday problem
1901: \citep{BHJ}, the number of pairs of
1902: choices which result in the same incompatibility in this model is
1903: asymptotically Poisson($c'a^2/2$).
1904: In short, then, the procedure results in the number $m-\cO(1)$ of different legitimate
1905: incompatibilities. If $m$ in the modified model is increased to, say, $m'=m+n^{2/3}$, then the
1906: two models could be coupled so that
1907: the incompatibilities in the original model are included in those in the modified model. As
1908: the existence of a viable phenotype becomes less likely when $m$ is increased, this demonstrates
1909: that~(\ref{gamma}) will follow once we show
1910: the following for the modified model:
1911: for every $\e>0$ there exists a large enough $a$ so
1912: that $c'<\log a-\e$ implies that
1913: $N\ge 1$ \aas
1914:
1915: To show this, we introduce the auxiliary random variable
1916: $$
1917: X=\sum_{\sigma \in \cG_a}\prod_{I\in F}\left(w_01_{\{|I\cap\sigma|=0\}}+
1918: w_11_{\{|I\cap\sigma|=1\}}\right),
1919: $$
1920: where $1_A$ is the indicator of the set $A$.
1921: The size of the intersection $I\cap\sigma$ is computed by transforming
1922: both the incompatibility $I$ and
1923: the genotype $\sigma$ to
1924: sets of (indexed) alleles, and
1925: the weights $w_0$ and $w_1$ will be chosen later. To intuitively understand the
1926: statistic $X$, note that when $w_0=w_1=1$, the product is exactly the indicator of the
1927: event that $\sigma$ is viable and $X$ is then the number of viable genotypes $N$. In general,
1928: $X$ gives different scores to different viable genotypes --- however, the crucial fact to note
1929: is that that $X>0$ iff $N>0$. Therefore
1930: $$
1931: P(N>0)= P(X>0)\ge (E(X))^2/E(X^2),
1932: $$
1933: which is how the second moment method is used \citep{AM}.
1934:
1935: As
1936: $$
1937: \begin{aligned}
1938: &P(|\sigma\cap I|=0)=\left(\frac {a-1}a\right)^2, \\
1939: &P(|\sigma\cap I|=1)=\frac {2(a-1)}{a^2}, \\
1940: \end{aligned}
1941: $$
1942: we have
1943: $$
1944: E(X)=a^n\left(w_0\left(\frac {a-1}a\right)^2+w_1\frac {2(a-1)}{a^2}\right)^m.
1945: $$
1946: Moreover
1947: $$
1948: E(X^2)=\sum_{k=0}^n a^n \binom{n}{k}(a-1)^k(w_0^2 P(00)+2w_0w_1P(01)+w_1^2P(11)),
1949: $$
1950: where $P(01)$ is the probability that $I$ has
1951: intersection of size $0$ with $\sigma=0_1\dots0_k0_{k+1}\dots 0_n$ and of size $1$ with
1952: $\tau=1_1\dots1_k0_{k+1}\dots 0_n$, and $P(00)$ and $P(11)$ are defined analogously. Thus, if $k=\al n$,
1953: $$
1954: \begin{aligned}
1955: &P(00)=\left(1-\frac{1+\al}a\right)^2,\\
1956: &P(01)=\frac{2\al}a\left(1-\frac{1+\al}a\right),\\
1957: &P(11)=\frac{2(1-\al)}a\left(1-\frac{1+\al}a\right)+2\left(\frac\al a\right)^2.
1958: \end{aligned}
1959: $$
1960: Let $\Lambda=\Lambda_{a, w_0, w_1}(\al)$ be the $n$'th root of the
1961: $k=(\al n)$'th term in the sum for $E(X^2)$, divided by $E(X)^2$. Hence
1962: $$
1963: \begin{aligned}
1964: \Lambda=&\frac{(a-1)^\al}{a\cdot \al^\al(1-\al)^{1-\al}}\\
1965: &\times \frac
1966: {\left( w_0^2\left(1-\frac{1+\al}a\right)^2+4w_0w_1\frac{\al}a\left(1-\frac{1+\al}a\right)
1967: +2w_1^2\left(\frac{(1-\al)}a\left(1-\frac{1+\al}a\right)+\left(\frac\al a\right)^2\right)
1968: \right)^{c'a^2}}
1969: {\left(w_0\left(\frac {a-1}a\right)^2+w_1\frac {2(a-1)}{a^2}\right)^{2c'a^2}}.
1970: \end{aligned}
1971: $$
1972: Let $\al^*=(a-1)/a$. A short computation shows that $\Lambda=1$ when $\al=\al^*$.
1973:
1974: If $\Lambda>1$ for some $\al$, then $E(X^2)/(E(X))^2$ increases exponentially and
1975: the method fails (as we will see below,
1976: this always happens when $w_0=w_1=1$, i.e.,
1977: when $X=N$). On the other hand, if $\Lambda<1$ for $\al\ne\al^*$, and
1978: $\frac{d^2\Lambda}{d\al^2}(\al^*)<0$, then
1979: Lemma 3 from \cite{AM} implies that $E(X^2)/(E(X))^2\le C$ for some constant
1980: $C$, which in turn implies that $P(N>0)\ge 1/C$. The sharp threshold
1981: result then finishes off the proof of~(\ref{gamma}).
1982:
1983: Our aim then is to show that $w_0$ and $w_1$ can be chosen so that, for $c'=\log a-\e$,
1984: $\Lambda$ has the properties described in the above paragraph.
1985: We have thus reduced the proof of~(\ref{gamma}) to a calculus problem.
1986:
1987: Certainly the necessary condition is that $\frac{d\Lambda}{d\al}(\al^*)=0$, and
1988: $$
1989: \frac{d\Lambda}{d\al}(\al^*)=-\frac 2{a^3}(w_0(a-1)-w_1(a-2))^2,
1990: $$
1991: so we choose $w_0=a-2$ and $w_1=a-1$. (Only the quotient between $w_0$ and $w_1$
1992: matters, so a single equation is enough.) This simplifies $\Lambda$ to
1993: $$
1994: \Lambda=\Lambda_a(\al)=\frac{(a-1)^\al}{a\al^\al(1-\al)^{1-\al}}
1995: \cdot
1996: \frac
1997: {\left(\left(\al -\frac{a-1}a\right)^2-\frac{(a-1)^4}{a^2}\right)^{c'a^2}}
1998: {\left(\frac{(a-1)^2}a\right)^{2c'a^2}}.
1999: $$
2000: Let $\varphi=\log\Lambda$. We need to demonstrate that $\varphi<0$ for $\al\in[0,\al^*)\cup (\al^*, 1]$
2001: and that $\varphi''(\al^*)<0$. A further simplification can be obtained
2002: by using $x-Cx^2\le \log(1+x)\le x$ (valid for all nonnegative $x$),
2003: which enables us to transform $\varphi$ (without changing the
2004: notation) to
2005: $$
2006: \varphi(\al)=c'\frac{a^4}{(a-1)^4}\left(\al -\frac{a-1}a\right)^2
2007: -\al \log\al-(1-\al)\log(1-\al)+\al\log(a-1)-\log a.
2008: $$
2009: Now
2010: $$
2011: \varphi''(\al)=2c'\frac{a^4}{(a-1)^4}-\frac 1{\al(1-\al)}.
2012: $$
2013: So automatically, for $c'$ large but $c'=o(a)$, $\varphi''(\al^*)<0$ for large $a$. Moreover,
2014: $\varphi$ cannot have another local maximum when $\varphi''>0$. If
2015: $\varphi(\al)\ge 0$ for some $\al\ne\al^*$, then this must happen for an $\al$
2016: in one of the two intervals
2017: $[0, 1/(2c')+\cO((c')^{-2})]$ or $[1- 1/(2c')-\cO((c')^{-2}), 1]$.
2018: Now, $\varphi$ has a unique
2019: maximum at $\al^*$ in the second interval. In the first interval,
2020: a short computation shows that
2021: $$
2022: \varphi(\al)=-\e-\al \log a+\cO\left(\frac{\log\log a}{\log a}\right),
2023: $$
2024: which is negative for large $a$. This ends the proof.
2025:
2026: This method yields nontrivial lower bounds for $\gamma$ for all $a\ge 3$,
2027: cf.~Table 1.
2028:
2029: \begin{center}
2030: \renewcommand{\arraystretch}{.75}
2031: \begin{table}[t]
2032: \caption{The lower bounds on $\gamma$ obtained by the method described in
2033: text, compared to the easy upper bounds $4\log a$.
2034: }%
2035: \label{t2x1_prel}%
2036: {\normalsize \vspace{.2in} }
2037: \par
2038: \begin{center}
2039: {\normalsize
2040: \begin{tabular} {|r||r|r|}\hline
2041: $a$ & l.~b.~on $\gamma$ & $4\log a$\\ \hline\hline
2042: 3 & 1.679 & 4.395\\
2043: 4 & 2.841 & 5.546\\
2044: 5 & 3.848 & 6.438\\
2045: 6 & 4.714 & 7.168\\
2046: 7 & 5.467 & 7.784\\
2047: 8 & 6.128 & 8.318\\
2048: 9 & 6.715 & 8.789\\
2049: 10 & 7.242 & 9.211\\
2050: 20 & 10.672 & 11.983\\
2051: 30 & 12.608 & 13.605\\
2052: 40 & 13.944 & 14.756\\
2053: 50 & 14.960 & 15.649\\
2054: 100 & 18.017 & 18.421\\
2055: 200 & 20.982 & 21.194\\
2056: 300 & 22.663 & 22.816\\
2057: 400 & 23.846 & 23.966\\
2058: 500 & 24.759 & 24.859\\
2059: \hline
2060: \end{tabular}
2061: }
2062: \end{center}
2063: \end{table}
2064: \end{center}
2065:
2066: \subsection*{Appendix E. Existence of viable phenotypes.}
2067:
2068: In this section we describe a comparison between models from Sections 5.2 and 5.3
2069: that will yield the result in Section 5.3.
2070: We begin by assuming that $a=1/r$ is an integer, which we can do without loss of generality.
2071: Divide the $i$'th coordinate interval $[0,1]$ into $a$ disjoint intervals $I_{i0},\dots, I_{i,{a-1}}$
2072: of length $r$. For a phenotype $x\in \cP$ let $\Delta(x)\in \cG_a$ be determined
2073: so that $\Delta(x)_i=j$ iff $x_i\in I_{ij}$.
2074:
2075: Note that, as soon as $I_{i_1j_1}\times I_{i_2j_2}$ contains a point in
2076: $\cP_{i_1i_2}$, no $x$ with $\Delta(x)_{i_1}=j_1$ and $\Delta(x)_{i_2}=j_2$
2077: is viable. This happens independently for each such Cartesian product,
2078: with probability $1-\exp(-\la r^2)\ge cr^2/(2n)$.
2079: Therefore, using the result from Section 5.2, when $cr^2>4\log a=-4\log r$, there is
2080: \aas~no viable
2081: genotype.
2082:
2083: On the other hand, let $I^\e$ be the closed $\e$-neighborhood of the interval
2084: $I$ in $[0,1]$ (the set of points within $\e$ of $I$), and consider
2085: the events that $I_{i_1j_1}^{r/2}\times I_{i_2j_2}^{r/2}$ contains a point in
2086: $\Pi_{i_1i_2}$. These events are independent if we restrict $j_1,j_2$
2087: to even integers. Moreover, each has probability
2088: $1-\exp(-4\la r^2)\sim 4cr^2/(2n)$, for large $n$.
2089: It again follows from Section 6.2 that a viable genotype $x$
2090: with $\Delta(x)_i$ even for all $i$, \aas~exists as soon as
2091: $4cr^2<4(\log (a/2)-o(1))=(-4\log r-\log 2-o(1))$.
2092:
2093: \end{document}