0611:q-bio0611074/fl.tex

1:

2: \documentclass[letterpaper, 11pt]{article}

3: \usepackage{amsmath}

4: \usepackage{amsfonts}

5: \usepackage{amsthm}

6: \usepackage{amssymb}

7: \usepackage{mathrsfs}

8: \usepackage[hang]{subfigure}

9: \usepackage{graphicx,epsfig,fancyheadings,wasysym,psfrag}

10:

11: \usepackage{times}

12:

13: \pagestyle{fancy}

14:

15: \rhead[\thepage]{\thepage}

16: \cfoot{}

17: \usepackage{natbib}

18: \bibliographystyle{apalike}

19:

20: \DeclareMathOperator{\var}{var}

21: \DeclareMathOperator{\cov}{cov}

22: \DeclareMathOperator{\corr}{corr}

23:

24: \newcommand{\al}{\alpha}

25: \newcommand{\be}{\beta}

26: \newcommand{\de}{\delta}

27: \newcommand{\e}{\epsilon}

28: \newcommand{\g}{\gamma}

29: \newcommand{\ka}{\kappa}

30: \newcommand{\la}{\lambda}

31: \newcommand{\sig}{\sigma}

32:

33: \newcommand{\bR}{{\mathbb R}}

34: \newcommand{\bZ}{{\mathbb Z}}

35: \newcommand{\bQ}{{\mathbb Q}}

36: \newcommand{\bT}{{\mathbb T}}

37: \newcommand{\cC}{{\mathcal C}}

38: \newcommand{\cA}{{\mathcal A}}

39: \newcommand{\cF}{{\mathcal F}}

40: \newcommand{\cG}{{\mathcal G}}

41: \newcommand{\cH}{\mathcal H}

42: \newcommand{\cU}{\mathcal U}

43: \newcommand{\cY}{{\mathcal Y}}

44: \newcommand{\cZ}{{\mathcal Z}}

45: \newcommand{\cR}{{\mathcal R}}

46: \newcommand{\cL}{{\mathcal L}}

47: \newcommand{\cN}{{\mathcal N}}

48: \newcommand{\cV}{{\mathcal V}}

49: \newcommand{\cW}{{\mathcal W}}

50: \newcommand{\cM}{{\mathcal M}}

51: \newcommand{\cO}{{\mathcal O}}

52: \newcommand{\cP}{{\mathcal P}}

53: \newcommand{\cT}{{\mathcal T}}

54: \newcommand{\cB}{{\mathcal B}}

55: \newcommand{\cS}{{\mathcal S}}

56: \newcommand{\cE}{{\mathcal E}}

57:

58:

59: \newcommand{ \dist}{\mathrm{dist}}

60: \newcommand{ \co}{\mathrm{co}}

61: \newcommand{ \xor}{{\,\mathrm{xor}\,}}

62: \newcommand{\conn}{\leftrightsquigarrow}

63: \newcommand{\notconn}{{\,\,\leftrightsquigarrow\!\!\!\!\!\!\!\!/\;\,\,\,}}

64:

65: \newcommand{\frhalf}{{\textstyle \frac 12}}

66: \newcommand{\frquarter}{{\textstyle \frac 14}}

67:

68: \newcommand{\aas}{a.~a.~s.}

69:

70: \setlength{\textwidth}{16cm}

71: \setlength{\textheight}{21cm}

72: \setlength{\oddsidemargin}{0cm}

73: \setlength{\evensidemargin}{0cm}

74: \setlength{\topmargin}{0cm}

75: \setlength{\parskip}{1ex}

76:

77: \newtheorem{thm}{Theorem}

78: \newtheorem{lemma}{Lemma}[section]

79:

80:

81: \begin{document}

82:

83: \title{Percolation on fitness landscapes: effects of correlation, phenotype, and incompatibilities}

84: \author{Janko Gravner$^*$, Damien Pitman$^*$, and Sergey Gavrilets$^{\dag\ddag}$\\

85: $^*$Department of Mathematics, University of California, Davis, CA 95616,\\

86: $^{\dag}$Departments of Ecology and Evolutionary Biology

87: and Mathematics, \\

88: University of Tennessee, Knoxville, TN 37996, USA.\\

89: $^\ddag$corresponding author.

90: Phone: 865-974-8136,\

91: fax: 865-974-3067,\\

92: email: gavrila@tiem.utk.edu}

93:

94: \maketitle

95:

96:

97: \newpage

98:

99:

100: {\bf Abstract}\quad

101: We study how correlations in the random fitness assignment may affect the structure

102: of fitness landscapes. We consider three classes of fitness models. The

103: first is a continuous phenotype space in which individuals are characterized

104: by a large number of continuously varying traits such as size, weight, color, or

105: concentrations of gene products which directly affect fitness.

106: The second is a simple model that explicitly describes genotype-to-phenotype

107: and phenotype-to-fitness maps allowing for neutrality at both phenotype and fitness

108: levels and resulting in a fitness landscape with tunable correlation length.

109: The third is a class of models in which particular combinations of alleles or

110: values of phenotypic characters are ``incompatible'' in the sense that the

111: resulting genotypes or phenotypes have reduced (or zero) fitness.

112: This class of models

113: can be viewed as a generalization of the canonical Bateson-Dobzhansky-Muller

114: model of speciation.

115: We also demonstrate that the discrete $NK$ model shares some signature properties of models

116: with high correlations.

117: Throughout the paper, our focus is on the percolation threshold, on the number, size and

118: structure of connected clusters, and on the number of viable genotypes. \\

119:

120:

121: {\bf Key words}: fitness landscapes, percolation, nearly neutral networks, genetic incompatibilities

122:

123: \section{Introduction}

124:

125: The notion of fitness landscapes, introduced by a theoretical evolutionary biologist Sewall

126: Wright in \citeyear{wri32} (see also \citealt{kau93,gav04}), has proved extremely useful both in

127: biology and well outside of it. In the standard interpretation, a fitness landscape is a relationship

128: between a set of genes (or a set of quantitative characters) and a measure of fitness

129: (e.g. viability, fertility, or mating success). In Wright's original formulation the set of

130: genes (or quantitative characters) is the property of an individual. However, the notion of

131: fitness landscapes can be generalized to the level of a mating pair, or even a population of

132: individuals \citep{gav04}.

133:

134: To date, most empirical information on fitness landscapes in biological applications has come from studies

135: of RNA (e.g., \citealt{sch95,huy96b,fon98b}),

136: proteins (e.g., \citealt{lip91,mar96,ros97}),

137: viruses (e.g.,  \citealt{bur99,bur04}),

138: bacteria (e.g., \citealt{ele03,woo06}),

139: and artificial life (e.g.,  \citealt{len99,wil01c}).

140: The three paradigmatic landscapes --- rugged, single-peak,

141: and flat --- emphasizing particular

142: features of fitness landscapes have been the focus of most of the earlier theoretical work

143: (reviewed in \citealt{kau93,gav04}). These landscapes have found numerous applications with regards to the dynamics

144: of adaptation (e.g., \citealt{kau87,kau93,orr06a,orr06b})

145: and neutral molecular evolution (e.g., \citealt{der91}).

146:

147: More recently, it was realized that the dimensionality of most biologically interesting

148: fitness landscapes is enormous and that this huge dimensionality brings some new properties

149: which one does not observe in low-dimensional landscapes (e.g. in two- or three-dimensional

150: geographic landscapes). In particular, multidimensional landscapes are generically characterized

151: by the existence of neutral and nearly neutral networks (also referred to as holey fitness

152: landscapes) that extend throughout the landscapes

153: and that can dramatically affect the evolutionary dynamics of the populations

154: \citep{gav97,gav97b,rei97b,gav04,rei01a,rei01b,rei02}.

155:

156: An important property of fitness landscapes is their correlation pattern. A common measure

157: for the strength of dependence

158: is the {\it correlation function\/} $\rho$ measuring the correlation of

159: fitnesses  of pairs of individual at a distance (e.g., Hamming) $d$ from each other in the

160: genotype (or phenotype) space:

161: 	\begin{equation} \label{rho}

162: 		\rho(d)=\frac{\cov[w(.),w(.)]_d}{\var(w)}

163: 	\end{equation}

164: \citep{eig89}. Here, the term in the numerator is the covariance of fitnesses

165: of two individuals conditioned on them being at distance $d$, and

166: $\var(w)$ is the variance in fitness over the whole fitness landscape.

167: For uncorrelated landscapes, $\rho(d)=0$ for $d > 0$. In contrast,

168: for highly correlated landscapes, $\rho(d)$ decreases with $d$ very slowly.

169:

170: The aim of this paper is to extend our previous work \citep{gav97b} in a number of directions

171: paying special attention to the question of how correlations in the

172: random fitness assignment may affect the structure of genotype and phenotype spaces.

173: For the resulting random fitness landscapes, we shed some

174: light on issues such as the number of viable genotypes,

175: number of connected clusters of viable genotypes and

176: their size distribution, existence thresholds, and

177: number of possible fitnesses.

178:

179: To this end, we introduce a variety of models,

180: which could be divided into two essentially different

181: classes: those with local correlations, and

182: those with global correlations. As we will see, techniques

183: used to analyze these models, and answers we obtain, differ

184: significantly. We use a mixture of analytical and computational techniques;

185: it is perhaps necessary to point out that these models

186: are very far from trivial, and one is quickly led to

187: outstanding open problems in probability theory and computer science.

188:

189: We start (in Section 2) by briefly reviewing some results from \cite{gav97b}.

190: In Section 3 we generalize these results for the case of a continuous

191: phenotype space when individuals are characterized by a large number

192: of continuously varying traits such as size, weight, color, or the

193: concentrations of some gene products. The latter interpretation

194: of the phenotype space may be particularly relevant given the rise of

195: proteomics and the growing interest in gene regulatory networks.

196:

197: The main idea behind our local correlations model studies in Section 4

198: is fitness assignment {\it conformity\/}. Namely, one randomly divides

199: the genotype space into components which are forced to have

200: the same phenotype; then, each different phenotype is independently assigned a random fitness.

201: This leads to a simple two-parameter

202: model, in which one parameter determines the density of viable genotypes,

203: and the other the correlations between them.

204: We argue that the probability of existence of a giant cluster (which swallows a positive

205: proportion of all viable genotypes) is a non-monotone function of the correlation

206: parameter and identify the critical surface at which this probability jumps

207: almost from 0 to 1. In Section 4 we also investigate the effects of

208: interaction between conformity structure and fitness assignment.

209:

210: Section 5 introduces our basic global correlation

211: model, one in which genotypes are eliminated due to random pairwise

212: {\it incompatibilities\/} between alleles. This is

213: equivalent to a random version of {\tt SAT} problem,

214: which is the canonical constraint satisfaction problem in computer

215: science. In general, a {\tt SAT} problem involves a set of Boolean variables

216: and their negations that are strung together with {\tt OR} symbols into

217: {\it clauses\/}.  The {\it clauses\/} are joined by {\tt AND} symbols

218: into a {\it formula\/}. A {\tt SAT} problem asks one to decide, whether

219: the variables can be assigned values that will make the formula true.

220: An important special case, $K$-{\tt SAT}, has the length of each clause fixed at $K$.

221: Arguably, {\tt SAT} is the most important class of problems in complexity theory.

222: In fact, the general {\tt SAT} was the first known

223: NP-complete problem and was established as such by S. Cook in 1971 (\citealt{Coo}).

224: Even considerable simplifications, such as the {\tt $3$-SAT} (see Section 5.4), remain NP-complete,

225: although {\tt $2$-SAT} (see Section 5.1) can be solved efficiently by a simple algorithm.

226: See e.g. \cite{KV} for a comprehensive presentation of the theory. Difficulties

227: in analyzing random  {\tt SAT} problems, in which formulas are chosen at random,

228: in many ways mirror their complexity classes, but even random {\tt $2$-SAT}

229: presents significant challenges \citep{dlV, BKL2}. In our present interpretation, the main reason

230: for these difficulties is that correlations are so high that the expected number

231: of viable genotypes

232: may be exponentially large, while at the same time the probability

233: that even one viable genotype exists is very low. In Section 5, we further

234: illuminate this issue by showing that connected viable clusters

235: must contain fairly large sub-cubes, and that the number of such clusters

236: is, in a proper interpretation, finite. The relevance to both types of

237: models for discrete and continuous

238: phenotype spaces is also discussed, with particular emphasis on the

239: existence of viable phenotypes in the presence of incompatibilities.

240: Section 5 also contains a brief review

241: of the existing theory on higher order incompatibilities.

242:

243:

244: In Section 6 we demonstrate how the discrete

245: $NK$ model shares some signature properties of models

246: with high correlations. In Section 7 we summarize our results

247: and discuss their biological relevance.

248: The proofs of our major results are relegated to Appendices A--E.

249:

250:

251: \section{The basic case: binary hypercube and independent binary fitness}

252:

253: We begin with a brief review of the basic setup, from \cite{gav97b}

254: and \cite{gav04}. The {\it binary hypercube\/}

255: consists of all $n$--long arrays of bits, or {\it alleles\/}, that is

256: $\cG=\{0, 1\}^n$. This is our {\it genotype space\/}.

257: Genotypes are linked by edges induced by bit-flips, i.e., {\it mutations\/} at a single locus,

258: for example, for $n=4$, a sequence of mutations might look like \[ 0000\leftrightarrow 1000\leftrightarrow 1001\leftrightarrow 1101\leftrightarrow 1100.

259: \]

260: The (Hamming) {\it distance\/} $d(x,y)$ between $x\in \cG$ and $y\in \cG$ is the

261: number of coordinates in which $x$ and $y$ differ or, equivalently,

262: the least number of mutations which connect $x$ and $y$.

263:

264: The {\it fitness\/} of each genotype $x$ is denoted by $w(x)$.

265: We will describe several ways to prescribe the fitness $w$ at random, according

266: to some probability measure $P$ on the $2^{2^n}$ possible assignments. Then we say that

267: an event $A_n$ happens {\it asymptotically almost surely\/} (\aas)

268: if $P(A_n)\to 1$ as $n\to\infty$. Typically, $A_n$ will capture

269: some important property of (random) clusters of genotypes.

270:

271: We commonly assume that $w(x)\in \{0,1\}$ so that $x$ is either viable

272: ($w(x)=1$) or inviable ($w(x)=0$).

273: As a natural starting point, \cite{gav97b} considered  uncorrelated landscapes,

274: in which $w(x)$ is chosen to be 1  with probability $p_v$, for each $x$ independently of

275: others. We assume

276: this setup for the rest of this section and note that this

277: is a well-studied problem in mathematical literature,

278: although it presents considerable technical difficulties and

279: some issues are still not completely resolved.

280:

281: Given a particular fitness assignment, viable genotypes form

282: a subset of $\cG$, which is divided into

283: connected {\it components\/} or {\it clusters\/}.

284: For example, with $n=4$, if $0000$ is viable, but its 4 neighbors

285: $1000$, $0100$, $0010$, and $0001$ are not, then it is isolated in its own

286: cluster.

287:

288: Perhaps the most basic result determines the {\it connectivity

289: threshold\/} \citep{Tom}: when $p_v>1/2$, the set of all viable genotypes is connected a.~a.~s.

290: By contrast, when $p_v<1/2$,  the set of viable genotypes is {\it not\/} connected

291: {\aas } This is easily understood, as the connectedness is closely linked to

292: isolated genotypes, whose expected number is $2^np_v(1-p_v)^n$. This expectation

293: makes a transition from exponentially large to exponentially small at $p_v=1/2$.

294: The events $\{x$ is isolated$\}$, $x\in \cG$, are only weakly

295: correlated, which implies that when $p_v<1/2$ there are exponentially

296: many isolated genotypes with high probability, while when $p_v>1/2$,

297: a separate argument shows that the event that the set of viable genotypes contains no isolated vertex

298: but is not connected becomes very unlikely for large $n$.

299: This is perhaps the clearest instance of the

300: {\it local method\/}: a local property (no isolated genotypes)

301: is \aas~equivalent to a global one (connectivity).

302:

303: Connectivity is clearly too much to ask for, as $p_v$ above $1/2$ is

304: not biologically realistic. Instead, one should look for a weaker

305: property which has a chance of occurring at small $p_v$. Such a

306: property is {\it percolation\/}, a.~k.~a.~existence of the {\it giant component\/}.

307: For this, we scale $p_v=\la_v/n$, for a constant $\la_v$.

308: When $\la_v>1$, the set of viable genotypes percolates, that is, it a.~a.~s.~contains a

309: component of at least $c\cdot n^{-1} 2^n$ genotypes, with all other

310: components of at most polynomial (in $n$) size.

311: When $\la_v<1$,

312: the largest component is a.~a.~s.~of size $Cn$. Here and below, $c$ and $C$ are

313: some constants. These are results from \cite{BKL2}.

314:

315: The local method that correctly identifies the percolation threshold

316: is a little

317: more sophisticated than the one for the connectivity threshold, and

318: uses branching processes with Poisson offspring distribution --- hence we introduce notation

319: Poisson($\la$) for a Poisson distribution with mean $\la$.

320: Viewed from, say, genotype $0\dots0$, the binary hypercube locally approximates a tree with

321: uniform degree $n$. Thus viable genotypes approximate

322: a branching process

323: in which every node has the number of successors distributed binomially

324: with parameters $n-1$ and $p$, hence this random number has mean about $\la_v$ and

325: is approximately Poisson($\la_v$).

326: When $\la_v>1$, such a branching process survives forever with probability

327: $1-\delta>0$, where $\delta=\delta(\la_v)$, and $\delta(\la)$ is given by the

328: implicit equation

329: \begin{equation}\label{delta}

330: \delta=e^{\la(\delta-1)}.

331: \end{equation}

332: (e.g., \citealt{AN}).

333: Large trees of viable genotypes created by the

334: branching processes which emanate from viable genotypes

335: merge into a very large (``giant'') connected set.

336: On the other hand, when $\la_v<1$ the branching process dies out with probability 1.

337:

338: The condition $\la_v>1$ for the existence of the giant component can be loosely

339: rewritten as

340: 	\begin{equation} \label{basic}

341: 		p_v > \frac{1}{n}.

342: 	\end{equation}

343: This shows that the larger the dimensionality $n$ of the genotype space, the smaller

344: values of the probability of being viable $p_v$ will result in the existence of

345: the giant component. See \cite{gav97b,gav97,gav04,ski04,pig06}  for discussions of biological

346: significance and implications of this important result.

347:

348: \section{Percolation in a continuous phenotype space}

349:

350: In this section we will assume that individuals are characterized by $n$ continuous

351: traits (such as size, weight, color, or concentrations of particular gene products).

352: To be precise, we let $\cP =[0,1]^n$ be the {\em phenotype space}.

353:

354: We begin with the extension of the notion of independent viability.

355: The most straightforward analogue of the discrete genotype space considered in the

356: previous section involves Poisson point location

357: in $\cal{P}$, obtained by generating a Poisson($\lambda$) random variable $N$, and then

358: choosing points  $x_1,\dots,x_N\in \cP$ uniformly at random.

359: These will be interpreted as {\it peaks\/}

360: of equal height in the fitness landscape.

361: Another parameter is a small $r>0$, which can be interpreted as measuring

362: how harsh the environment is: any phenotype within $r$

363: of one of the peaks is declared viable and any phenotype not within $r$ of one of the peaks

364: is declared inviable. For simplicity, we will assume ``within

365: $r$'' to mean that ``every coordinate differs by at most $r$,''

366: i.e., distance is measured in the ($n$-dimensional) $\ell^\infty$ norm $||\cdot||_\infty$.

367: Note that this makes the set of viable genotypes correlated, albeit

368: the range of correlations is limited to $2r$.

369:

370: Our most basic question is whether a positive proportion of

371: viable phenotypes is connected together into a giant cluster.

372: Note that the probability $p_v$ that a random point in $\cP$ is viable

373: is equal to the probability that there is a ``peak'' within $r$ from this

374: point. Therefore,

375: $$

376: p_v=1-\exp\left[-\lambda (2r)^n\right]\approx \lambda (2r)^n.

377: $$

378: This is also the expected combined volume of viable phenotypes.

379:

380: We will consider peaks

381: $x_i$ and $x_j$ to be {\it neighbors\/} if they share a viable phenotype,

382: that is, if their $r$-neighborhoods overlap, or

383: equivalently, if $||x_i-x_j||_\infty<2r$.

384: Two viable phenotypes $y_1$ and $y_2$ are {\it connected\/} if they are,

385: respectively, within $r$ of peaks $x_1$ and $x_2$, and $x_1$ and $x_2$ are

386: connected to each other via a chain of neighboring peaks.

387:

388: By the standard branching process comparison,

389: the necessary condition for the existence of a giant cluster is that a ``peak'' $x$ is connected

390: to more than one other ``peak'' on the average.

391: All peaks within $2r$ of the focal peak are connected to the latter.

392: Therefore, if $\mu$ is the expected number of peaks connected to $x$,

393: then

394: $$

395: \mu= \lambda \cdot (4r)^n,

396: $$

397: and $\mu>1$ is necessary for percolation.

398: As demonstrated by \cite{Pen} (for a different choice of

399: the norm, but the proof is the same),

400: this condition becomes sufficient when $n$ is large.

401: Note that the expected number $\lambda$ of peaks can be written as $\mu\cdot (4r)^{-n}$.

402:

403: If $\mu>1$ and fixed, then \aas~a positive proportion of

404: all peaks (that is, $cN$ peaks, where $c=c(\mu)>0$) are connected

405: in one ``giant'' component, while the remaining connected components are all of size $\cO(\log N)$.

406: On the other hand, if $\mu<1$, all components are \aas~of size $\cO(\log N)$.

407:

408: The condition $\mu>1$ for the existence of the giant component of viable phenotypes can be

409: loosely rewritten as

410: 	\begin{equation} \label{cont}

411: 		p_v > \frac{1}{2^n}.

412: 	\end{equation}

413: This shows that viable phenotypes are likely to form a large connected cluster even when

414: one is {\it very\/} unlikely to hit one of them at random, if

415: $n$ is even moderately large. The same conclusion and the same threshold are valid

416: if instead of $n$-cubes we use $n$-spheres of a constant radius.

417:

418: The percolation threshold in the continuous phenotype space given by inequality~(\ref{cont})

419: is much smaller than that in the discrete genotype space which is given by inequality~(\ref{basic}).

420: An intuitive reason for this is that continuous space offers a viable point a much greater opportunity

421: to be connected to a large cluster. Indeed, in the discrete genotype space there are $n$

422: neighbors per each genotype. In contrast, in the continuous phenotype space, the ratio

423: of the volume of the space where neigboring peaks can be located (which has radius $2r$)

424: to the volume of the focal $n$-cube (which has radius $r$) is $2^n$.

425:

426: \section{Percolation in a correlated landscape with phenotypic neutrality}

427:

428: The standard paradigm in biology is that the relationship between genotype and fitness

429: is mediated by phenotype (i.e., observable characteristics of individuals). Both the

430: genotype-to-phenotype and phenotype-to-fitness maps are typically not one-to-one.

431: Here, we formulate a simple model capturing these properties which also results in a

432: correlated fitness landscape.

433: Below we will call mutations that do not change phenotype {\em conformist}. These mutations

434: represent a subset of {\em neutral} mutations that do not change fitness.

435:

436: We propose the following two-step model. To begin the {\it first step\/},

437: we make each  {\it pair\/} of genotypes $x$ and $y$ in a binary hypercube  $\cG$ independently

438: {\it conformist\/} with probability $p_{d(x,y)}$ where $d(x,y)$

439: is the Hamming distance between $x$ and $y$. We then declare

440: $x$ and $y$ to belong to the same {\it conformist cluster\/} if they are linked

441: by a chain of conformist pairs. This version of long-range percolation model (cf., \citealt{Ber,Bis})

442: divides the set of genotypes $\cG$ into conformist clusters.

443: We postulate that all genotypes in the same conformist

444: cluster have the same phenotype. Therefore, genetic changes represented by

445: a change from one member of a conformist cluster to another (i.e., single or

446: multiple mutations) are phenotypically  neutral.

447:

448: In the {\it second step\/}, we make each conformist cluster independently viable with

449: probability $p_v=\la_v/n$. This generates a random set of viable genotypes,

450: and we aim to investigate  when this set has a large connected component.

451:

452: For example, the ``genotype'' can be a linear RNA sequence.

453: This sequence folds into a 2-dimensional molecule which has a particular structure

454: (or ``shape''), and corresponds to our ``phenotype.'' Finally, the molecule

455: itself has a particular function, e.g., to bind to a specific part of the cell or

456: to another molecule. A measure of how well this can be accomplished is represented by

457: our ``fitness.''

458:

459: The distribution of conformist clusters depends on the probabilities

460: $p_1, p_2, p_3, \dots $ which determine how the conformity probability

461: varies with distance.

462: Here we will study the case when $p_1=p_e>0,p_2=p_3=...=0$ \citep{Hag}.

463: It is then very convenient for the mathematical analysis that a pair $x$ and

464: $y$ can be conformist only when they are linked by an edge --- therefore

465: we can talk about {\it conformist edges\/} or equivalently {\it conformist mutations\/}.

466: (Note however that it is possible that nearest neighbors $x$ and $y$ are in the

467: same conformist cluster even if the edge between them is non-conformist.)

468:

469: Figure 1 illustrates our 2-step procedure on a four-dimensional example.

470:

471: We expect that a more general model with $p_i$ declining fast enough with $i$

472: is just a smeared version of this basic one, and its properties are not likely

473: to differ from those of the simpler model. We conjecture that for our purposes,

474: ``fast enough'' decrease should be exponential with a rate logarithmically

475: increasing in the dimension $n$, e.g. for large $k$,

476: \[

477:       p_k \le \exp(-\alpha(\log n)k),

478: \]

479: for some $\alpha>1$. (This is expected to be so because in this case the expected number of

480: neighbors of the focal genotype is finite.)

481:

482: We observe that the first step of our procedure is an

483: edge version of the percolation model discussed in the second section, with a

484: similar giant component transition \citep{BKL1}.

485: Namely, let $p_1=p_e=\lambda_e/n$. Then, if $\la_e>1$, there

486: is a.~a.~s.~one giant conformist cluster of size $c\cdot 2^n$, with all others

487: of size at most $Cn$. In contrast, if $\la_e<1$ all conformist clusters

488: are of size at most $Cn$. Note that the number of conformist

489: clusters is always on the order $2^n$. In fact, even the number

490: of ``non-conformist'' (i.e., isolated) clusters is a.~a.~s.~asymptotic to

491: $e^{-\lambda_e} 2^n$, as $P(x\ \text{is isolated})=(1-\lambda_e/n)^n$.

492:

493: \begin{figure*}[t]

494:    \begin{center}

495:     {\includegraphics

496:     [clip, viewport= 140 325 475 680, height=4cm]{4q.ps}

497:     \hspace{1cm}

498:     \includegraphics[clip, viewport= 140 325 475 680, height=4cm]{4edge-config.ps} \\

499:     \vspace{.5cm}

500:     \includegraphics[clip,viewport= 140 325 475 680, height=4cm]{4viability.ps}

501:     \hspace{1cm}

502:     \includegraphics[clip,viewport= 140 325 475 680, height=4cm]{4neut.ps}

503:     }\end{center}

504: \caption{A four-dimensional example: start with the cube $\cG^4$ (top left),  create conformist clusters by randomly eliminating each edge with probability

505:  $1-p_e$ (top right), remove each conformist cluster with probability

506:  $1-p_v$ (bottom left, removed vertices are black) and finally consider

507:  connected components of the remaining vertices (bottom right,

508:  there is just one component in this case).}

509: \end{figure*}

510:

511: Denote by $x\conn y$ (resp.~$x\notconn y$) the event that

512: $x$ and $y$ are (resp.~are not) in the same conformist cluster.

513: First, we note that the probability $P(x \conn y)$ that two genotypes belong to

514: the same conformist cluster depends on the Hamming distance $d(x,y)$ between them, and on

515: $p_e=\lambda_e/n$. In particular,

516: we show in Appendix A that, if $\la_e<1$ and $d(x,y)=k$ is fixed, then

517: \begin{equation} \label{Px-y}

518: k!p_e^k (1 - O(n^{-2})) \leq P(x \conn y) \leq k!p_e^k (1 + O(n^{-1} \log{n})).

519: \end{equation}

520: The dominant contribution $k!p_e^k$ is simply the expected number of conformist pathways between $x$ and $y$

521: that are of shortest possible length.

522:

523: It is also important to note that, for every $x\in \cG$,

524: the probability $P( x$ is viable$)=p_v$, therefore it does not depend on $p_e$.

525: Moreover, for $x,y\in \cG$,

526: $$

527: \begin{aligned}

528: &P(x\text{ and }y\text{ viable})-p_v^2\\

529: &=P(x\text{ and }y\text{ viable},x\conn y)+ P(x\text{ and }y\text{ viable},x\notconn y)-p_v^2\\

530: &=p_vP(x\conn y)+ p_v^2\cdot P(x\notconn y)-p_v^2\\

531: &=p_v(1-p_v)P(x\conn y)\ge 0.

532: \end{aligned}

533: $$

534: Therefore, the correlation function~(\ref{rho}) is

535: \begin{equation}

536: \rho(x,y)=P(x\conn y),

537: \end{equation}

538: which clearly increases with $p_e$ and, thus, with $\lambda_e$.

539: Therefore, this model

540: has tunable positive correlations controlled by the parameter $\la_e$, whose value does

541: not affect the expected number of viable genotypes.

542: The correlation function $\rho(x,y)$ decreases exponentially with distance

543: $d(x,y)$ when $\la_e<1$, and is bounded below when $\la_e>1$.  Nevertheless,

544: as we will see below, we can effectively use local methods for all values of $\la_e$.

545:

546: \subsection{Threshold surface for percolation}

547:

548: Proceeding by the local branching process heuristics,

549: we reason that a surviving node on the branching tree can have

550: two types of descendants: those that are connected by conformist mutations

551: and those that are in different conformist clusters and thus

552: independently viable. Therefore the number

553: of descendants is approximately Poisson($\la_e+\la_v$).

554: This can only work when $\la_e<1$, as otherwise the correlations are global.

555:

556: If $\la_e>1$, we need to eliminate the

557: entire conformist giant component, which is \aas~inviable.

558: Locally, we condition on the

559: (supercritical) branching process of the supposed descendant to die out.

560: Such conditioned process is a subcritical branching process, with

561: Poisson $(\la_e\delta)$ distribution of successors \citep{AN}

562: where $\delta=\delta(\lambda_e)$ is given by the equation~(\ref{delta}).

563: This gives the

564: conformist contribution, to which we add the independent Poisson$(\la_v\delta)$ contribution.

565:

566:  \begin{figure*}[t]

567:   \begin{center}

568:   \vspace{5pt}

569:    \includegraphics[clip=true,height=10cm]{nt2.ps} \hspace{1.5cm}

570:    \includegraphics[clip=true,height=10cm]{nt1.ps}

571:   \end{center}

572:

573:   \caption{Simulated $\la_v^m$ (long dashes) and  $\la_v^{M}$ (short dashes), and $\zeta$

574:   (solid) plotted against $\la_e$, for $n=10, \dots, 20$, and models from Section

575:   4.1 (left frame) and Section 4.2 (right frame). Lower bounds increase with $n$, and

576:   upper bounds decrease, for this range of $n$. }

577: \end{figure*}

578:

579: To have a convenient summary of the conclusions above,

580: assume that $\la_e$ is fixed and let $\zeta(\la_e)$

581: be the smallest $\la_v$ which \aas~ensures the giant component, i.e.,

582: \[

583: \zeta(\la_e)=\inf\{\la_v: \text{a cluster of at least }cn^{-1} 2^n

584: \text{ viable genotypes exists \aas~for some } c>0\}.

585: \]

586: One would expect that for $\la_v<\zeta(\la_e)$ all components are \aas~of size at most $Cn$.

587: The asymptotic critical curve is given by

588: $\la_v=\zeta(\la_e)$, where

589: \begin{equation}  \label{pheno}

590: \zeta(\la)=

591: \begin{cases}

592: 1-\la &\qquad\text{if } \la\in [0,1],\\

593: \frac 1{\delta}-\la&\qquad\text{if } \la\in [1,\infty).

594: \end{cases}

595: \end{equation}

596:

597: Having only a heuristic proof of this, we resort to computer

598: simulations for confirmation. For this, we

599: indicate

600: global connectivity with the event $A$ that a genotype

601: within distance 2 of $0\dots 0$ is connected

602: (through viable genotypes) to a genotype

603: within distance 2 of $1\dots 1$.

604: We make this choice because the

605: distance 2 is the smallest that works with asymptotic certainty.

606: Indeed, the genotypes $0\dots0$ and $1\dots1$ are likely to be inviable.

607: Even the number of viable genotypes within distance one of each of these is only of constant order,

608: so even in the percolation regime the probability of connectivity between

609: a viable genotype within distance one

610: of $0\dots0$ and a viable one within distance one of $1\dots1$  does not converge to 1 but is of

611: a nontrivial constant order. By contrast, there are about $n^2$ vertices

612: within distance 2 of $0\dots0$ among which of order $n$ are viable.

613:

614: When $\la_v>\zeta(\la_e)$ the probability of the event $A$

615: should therefore be (exponentially) close to 1. On the other hand, when $\la_v<\zeta(\la_e)$

616: the probability that a connected component within distance 2 of either

617: $0\dots0$ or $1\dots1$ extends for distance of the order $n$

618: is exponentially small. We further define the critical curves

619: $$

620: \begin{aligned}

621: &\text{$\la_v^{m}=\;$the smallest $\la_v$ for which

622: $P(A)>0.1$,}\\

623: &\text{$\la_v^{M}=\;$the largest $\la_v$ for which

624: $P(A)<0.9$.}

625: \end{aligned}

626: $$

627:

628: We approximated $\la_v^m$ and $\la_v^{M}$ for

629: $n=10, \dots, 20$ and $\la_e=0(0.1)2$, with 1000

630: independent realizations

631: of each choice of $n$, $\la_e$, and $\la_v$. We used the linear

632: cluster algorithm described in \cite{Sed}.

633: The results are depicted in Figure 1.

634: Unfortunately, simulations above $n\approx 20$

635: are not feasible.

636:

637: From Figure 2 we observe that:

638: \begin{itemize}

639:

640: \item Even for low $n$, both critical curves approximate well the

641: overall shape of the theoretical limit curve $\zeta$.

642: \item $\la_v^{m}$ and $\la_v^{M}$ get

643: closer faster than they converge to $\zeta$. Consequently,

644: one can expect that $P(A)$ makes a very sharp jump from near 0

645: to near 1 even for moderate $n$.

646: \item For $\la_e<1$, $\la_v^{m}$ tends to be above the limit curve. This is

647: not really surprising, as the local argument always gives an upper

648: bound on the probability $P(A)$ of event $A$. Further, the approximation of $\la_v^m$ deteriorates

649: near $\la_e=2$, which stems from the possibility of survival of the

650: giant component in this regime.

651: \end{itemize}

652:

653: What is clear from the heuristics and simulations is that

654: conformist mutations, and thus correlations, significantly affect

655: the probability of long range

656: connectivity in the genotype space. The effect is not monotone:

657: the most advantageous choice

658: is when the correlations are at the point of phase transition between between local and global.

659:

660:

661: To intuitively understand why percolation occurs the easiest with $\la_e \approx 1$, it helps

662: to think of the model as a branching process on clusters rather than on genotypes.

663: For a genotype on a viable cluster,

664: there is a number of neighboring clusters and each of these is viable with

665: probability $p_v$. If $\lambda_e < 1$, then the probability that any two of the neighboring

666: genotypes are in the same cluster is $o(1)$, so there are asymptotically exactly $n$ clusters

667: neighboring the present cluster. Consequently, the overall number of descendants will be greater

668: if the size of these clusters is greater on average; which is exactly what happens as $\lambda_e$

669: increases towards 1. If $\lambda_e > 1$, then there is a positive proportion of the neighboring

670: genotypes that are in the giant cluster. This giant cluster is likely to be inviable, so the parameter

671: $\lambda_v$ must be greater to compensate for its loss.

672:

673: \subsection{Correlations between conformity and viability}

674:

675: In the previous model, the viability probability $p_v$

676: was independent of the conformity structure. Mainly to

677: investigate the robustness of our conclusions,

678: we consider a simple generalization in which there

679: are either positive or negative correlations between conformity

680: and fitness. While more sophisticated models are possible,

681: the one below is chosen for its amenability to relatively simple analysis.

682:

683: Assume now that conformist clusters are formed as before (i.e.,

684: with edges being conformist with probability $p_e=\lambda_e/n$),

685: are still independently viable, but

686: now the probability of their viability depends on their

687: size. We will consider the simple case when an isolated genotype

688: (one might call it {\it non-conformist\/}) is viable with probability $p_0=\la_0/n$,

689: while a conformist cluster of size larger than 1 is viable with probability $p_1=\la_1/n$.

690:

691: In this case

692: $$

693: P(x\text{ is viable})=(1-p_e)^np_0+(1-(1-p_e)^n)p_1\sim \frac 1n\left(

694: e^{-\la_e}\la_0+(1-e^{-\la_e})\la_1\right).

695: $$

696: Moreover, by a similar calculation as before,

697: $$

698: \begin{aligned}

699: &P(x\text{ and }y\text{ viable})-P(x\text{ viable})^2\\

700: &=p_1(1-p_1)P(x\conn y)+P(x\text{ non-conformist})^2p_e(p_0-p_1)^2\cdot 1_{\{d(x,y)=1\}}.

701: \end{aligned}

702: $$

703: Here, the last factor is  the indicator of the set $\{(x,y), d(x,y)=1\}$, which equals

704: $1$ if $d(x,y)=1$ and $0$ otherwise.

705: Therefore, for $d(x,y)\ge 2$, the correlation function (\ref{rho})

706: is

707: $$

708: \rho(x,y)\sim\frac {\la_1}{e^{-\la_e}\la_0+(1-e^{-\la_e})\la_1}P(x\conn y),

709: $$

710: which is smaller than before iff $\la_1<\la_0$. However, it has the same

711: asymptotic properties unless $\la_1=0$.

712:

713:

714: Assume first that $\la_e<1$.

715: The local analysis now leads to

716: a {\it multi-type\/} branching process \citep{AN} with three types: NC (non-conformist node),

717: CI (non-isolated node independently viable, so no conformist edge is

718: accounted for), and CC (non-isolated node viable by conformity, so

719: a conformist edge is accounted for).

720:

721: Note first that a genotype is

722: non-conformist with probability about $e^{-\la_e}$.

723: Hence a node of any of the three types creates a Poisson($e^{-\la_e}\la_1$) number

724: of type NC descendants, and a Poisson($(1-e^{-\la_e})\la_1$) number of type CI

725: descendants. In addition, the type CI creates a Poisson($\la_e$), conditioned

726: on being nonzero, number of descendants of type CC and type CC creates a

727: Poisson($\la_e$) number of descendants of type CC. Thus

728: the matrix of expectations, in which the $ij$th entry is the expectation of the number

729: of type $j$ descendants from type $i$, is

730: \[

731: M=

732: \begin{bmatrix}

733: e^{-\la_e}\la_0 & \left(1- e^{-\la_e}\right)\la_1 & 0\\

734: e^{-\la_e}\la_0 & \left(1- e^{-\la_e}\right)\la_1 & \la_e/(1-e^{-\la_e})\\

735: e^{-\la_e}\la_0 & \left(1- e^{-\la_e}\right)\la_1 & \la_e \end{bmatrix}\quad .

736: \]

737: When $\la_e>1$, $\la_e$ needs to be replaced by $\la_e\delta$, and

738: $\la_1$ by $\la_1\delta$, where $\delta=\delta(\la_e)$ is given by ~(\ref{delta}).

739:

740: It follows from the theory of multi-type branching processes \citep{AN} that

741: the critical surface for survival of a multi-type

742: branching process is given by $\det(M-1)=0$.

743:

744: The simplest case is when only non-conformist genotypes may be viable,

745: i.e., $\la_1=0$. In this case the critical surface is given by $\la_0 e^{-\la_e}=1$ (Pitman, unpub.).

746: Not surprisingly, the critical $\la_0$ to achieve global connectivity strictly

747: increases with $\la_e$, which is the result of negative correlations between

748: conformity and viability.

749:

750: The other extreme is when non-conformist genotypes are inviable,

751: i.e., $\la_0=0$. As an easy computation demonstrates,

752: the critical curve is now given by $\la_1=\zeta(\la_e)$, where

753: \begin{equation}\label{phenocorr}

754: \zeta(\la)=

755: \begin{cases}

756: \frac{1-\la}{\la e^{-\la}+1-e^{-\la}} &\qquad\text{if } \la\in \{0,1\},\\

757: \frac{\rho^{-1} -\la}{ \la e^{-\la}+1-e^{-\la\rho}}&\qquad\text{if } \la\in [1,\infty).

758: \end{cases}

759: \end{equation}

760: Note that $\zeta(\la)\to \infty$ as $\la\to 0$. We carried out exactly the

761: same simulations as before. These are also featured in Figure 2 (right frame), and again

762: confirm our local heuristics. We conclude that positive correlations

763: between viability and conformity tend to lead to a V-shaped critical

764: curve, whose sharpness at critical conformity $\la_e=1$ increases with

765: the size of correlations. In short, then, correlations help more

766: if viability probability increases with size of conformist clusters.

767:

768: \section{Percolation in incompatibility models}

769:

770: In the model considered in the previous section

771: correlations rapidly decreased with distance. This property

772: made local analysis possible. The models we introduce now

773: are fundamentally different in the sense that correlations are

774: so high that the local method gives a wrong answer.

775:

776: In the previous sections, in constructing fitness landscapes we were assigning fitness

777: to individual genotypes or phenotypes. Here, we make certain assumptions about ``fitness'' of

778: particular combinations of alleles or the values of phenotypic characters. Specifically,

779: we will assume that some of these combinations are ``incompatible'' in the sense that the

780: resulting genotypes or phenotypes have reduced (or zero) fitness \citep{orr95,orr96,gav04}.

781: The resulting models can be viewed as a generalization of the Bateson-Dobzhansky-Muller

782: model \citep{orr95,orr96,orr97,orr01,gav96b,gav97,gav97b,gav03d,gav04,coy04}

783: which represents a canonical model of speciation.

784:

785: \subsection{Diallelic loci}

786:

787: We begin by assuming that viability of a genotype is determined by

788: a set $F$ of pairwise incompatibilities. $F$ is thus

789: a subset of  $4\cdot \binom{n}{2}$ pairs $(u_i, v_j)$,

790: where $1\le i<j\le n$ and $u,v\in\{0,1\}$. In this nonstandard notation, $(0_1,0_2)\in F$,

791: for example, means that allele $0$ at locus $1$ and allele $0$ at locus $2$

792: are incompatible. In general, if $(u_i, v_j)\in F$,

793: all genotypes with $u$ in position $i$ and $v$ in position $j$

794: are inviable.

795: A genotype $x$ is then inviable if and only if there exist $i$ and $j$, with $i<j$,

796: so that $u$ and $v$ are, respectively, the alleles of $x$ at loci $i$ and $j$,

797: and $(u_i, v_j)\in F$.

798: For example, if $F_1=\{(0_1, 0_2), (1_2, 0_3), (1_1, 1_2)\}$, viable genotypes may have

799: $011$, $100$, and $101$ as their first three alleles. For $F_2=F_1\cup \{(0_1, 1_3), (1_1, 0_2)\}$,

800: no viable genotype remains.

801:

802: Incompatibility $(0_1, 0_2)$ is equivalent to two implications: $0_1\implies 1_2$ and

803: $0_2\implies 1_1$ or to the single {\tt OR} statement $1_1$ {\tt OR} $1_2$. In this interpretation,

804: the problem of whether, for a given list of incompatibilities $F$, there is a viable genotype is

805: known as the {\tt $2$-SAT} problem \citep{KV}.

806: The associated {\it digraph\/} $D_F$ is a graph on $2n$ vertices $x_i$, $i=1,\dots n$, $x=0,1$,

807: with oriented edges determined by the implications. A well-known theorem \citep{KV} states

808: that a viable genotype exists iff $D_F$ contains no oriented cycle

809: from $0_i$ to $1_i$ and back to  $0_i$  for any $i=1,\dots n$ in $D_F$.

810: For example, for the incompatibilities $F_2$ as above,

811: one such cycle is $0_1\to1_2\to 1_3\to 1_1\to 1_2\to 0_1$.

812:

813: Now assume that each possible incompatibility is adjoined to $F$ at random, independently

814: with probability

815: \[

816: p=\frac c{2n}.

817: \]

818: (We use the generic notation $p$ for a probability parameter

819: in all our models, even though the nature of probabilistic assignments differs from model to model.)

820:

821:

822: {\bf Existence of viable genotypes.}\quad

823: Let $N$ be the number of viable genotypes. Then

824: \begin{itemize}

825: \item if $c>1$, then a.~a.~s.~$N=0$.

826: \item if $c<1$, then a.~a.~s.~$N>0$.

827: \end{itemize}

828: This result first appeared in the computer science literature in the 90's

829: (see \citealt{dlV} for a review), and it is an

830: extension of the celebrated Erd\"os-R\'enyi random graph results

831: \citep{Bol,JLR} to the oriented case.

832:

833: Note that the expectation

834: $E(N)=2^n(1-p)^{\binom{ n}{2}}\approx 2^ne^{-cn/4}$,

835: which grows exponentially whenever $c<4\log 2\approx 2.77$. Neglecting

836: correlations would therefore suggest a wrong threshold for $N>0$. The local method

837: (e.g., used in \citealt[Chapter 6]{gav04}) is

838: even farther off, as it suggests an \aas~giant component when $p<(1-\e)\log n/n$

839: for any $\e>0$.

840:

841: {\bf The number of viable genotypes.}\quad

842: Assume that $c<1$. Sophisticated, but not mathematically rigorous

843: methods based on {\it replica symmetry\/} \citep{MZ,BMW}  from statistical physics suggest that,

844: as $n\to\infty$,

845: $\lim n^{-1}\log N$ varies almost linearly between

846: $\log 2\approx 0.69$ (for small $c$, when, as we prove below, this limit is

847: $\log 2+\cO(c)$) and about $0.38$ (for $c$ close to $1$).

848: One can however prove that $n^{-1}\log N$ is for large $n$ sharply

849: concentrated around its mean \citep{dlV}.

850:

851: Upper and lower bounds on $N$ can also be obtained

852: rigorously. For example, if $X$ is a number of

853: incompatibilities which involve {\it disjoint\/} pairs of loci

854: (i.e., those for which every locus is represented at most once among the

855: incompatibilities),

856: then $N\le \exp(n\log 2+X\log(3/4))$, as each of the $X$ incompatibilities

857: reduces the number of viable genotypes by the factor $3/4$.

858: If we imagine

859: adding incompatibilities one by one at random until

860: there are about $cn$ of them, then after we have $k$

861: incompatibilities on disjoint pairs of loci the waiting time (measured by

862: the number of incompatibilities added)

863: for a new disjoint one is geometric with expectation $\binom{n} {2}/\binom{n-2k} {2}$.

864: Therefore,

865: $X$ is \aas~at least $Kn$, where

866: $K$ solves the approximate equation

867: $$

868: \binom{n} {2} \left(\sum_{k=0}^{Kn}\frac 1{ \binom{n-2k} {2}} \right)\sim cn,

869: $$

870: or

871: $$

872: \int_{0}^{Kn}\frac 1{(n-2k)^2}\, dk \sim \frac cn,

873: $$

874: which reduces to $K=c/(1+2c)$. This implies that the upper bound on $N$ can be

875: defined as

876: \begin{equation}  \label{up_bound}

877: \limsup \frac 1n\log N\le \frac {1}{1+2c}\log 2+\frac {c}{1+2c}\log 3.

878: \end{equation}

879:

880: A lower bound is even easier to obtain. Namely,

881: the probability that a fixed location (i.e., locus) $i$ does not appear in $F$ is $(1-p)^{4(n-1)}

882: \to e^{-2c}$, and then it is easy to see that the number of loci represented in $F$

883: is asymptotically $(1-e^{-2c})n$. As the other loci are neutral (in the sense that changing

884: their alleles does not affect fitness),

885: $n^{-1}\log N$ is asymptotically at least $e^{-2c}\log 2$. Clearly, this gives

886: a lower bound on the exponential size of any cluster of viable genotypes.

887:

888: If this was an accurate bound, it would imply that the space of

889: genotypes is rather simple, in that almost all its entropy would come from neutral loci. The Appendix B presents two arguments which will

890: demonstrate that this is not the case. The derivations there are somewhat technical,

891: but do provide more insight into random pair incompatibilities.

892:

893: {\bf The structure of clusters.}\quad

894: The derivations in Appendix B show that every viable genotype is connected

895: through mutation to a fairly substantial

896: viable sub-cube. In this sub-cube, alleles on at most a proportion $r_u(c)<1$ of loci

897: are fixed (to 0 or 1) while the remaining proportion $1-r_u(c)$ could be

898: varied without effect on fitness. Note from Figure 4 in the

899: Appendix B that $1-r_u(c)\ge 0.3$ for

900: all $c$, and that such a phenomenon is

901: extremely unlikely on uncorrelated landscapes.

902: Note also that, for $c<1$, $N\ge 2^{(1-r_u(c))n}$ \aas~and so  the lower

903: bound on $N$ can be written as

904: \begin{equation} \label{low_bound}

905: \liminf\frac 1n\log N\ge (1-r_u(c))\log 2.

906: \end{equation}

907:

908: {\bf The number of clusters.}\quad

909: The natural next question concerns the number of clusters

910: $R$ when $c<1$. This again has quite a surprising answer, unparalleled in

911: landscapes with rapidly decaying correlations. Namely,

912: $R$ is {\it stochastically bounded\/}, that

913: is, for every $\e>0$ there exists an $z=z(\e)$ such that $P(R\le z$ for all $n)>1-\e$.

914: As there is some confusion in the literature as to whether it is even possible

915: to get more than one cluster \citep{BMW}, Appendix C

916: presents a sketch of the results which will appear in Pitman (unpub.).

917: There we also show that the limiting probability of a unique cluster is

918: $\sqrt{(1-c)e^c}$.

919:

920: Asymptotically, a unique cluster has a better than even chance of

921: occurring for $c$ below about $0.9$, and is {\it very\/} likely to occur

922: for small $c$, though of course not

923: \aas~so. To confirm, we have done simulations for $n=20$ and $c=0.01 (0.01) 1$

924: (again 1000 trials in each case) and got distribution of clusters depicted

925: in Figure~3. The results suggest that the convergence to limiting distribution

926: is rather slow for $c$ close to 1, and that the likelihood of a unique

927: cluster increases for low $n$.

928:

929: \begin{figure*}[t]

930:   \begin{center}

931:    {\includegraphics[clip=true,height=5cm]{cls.ps}

932:     }

933:   \end{center}

934:

935:   \caption{Simulated number of clusters, vs. $c$ for $n=20$. The

936:   proportion (out of 1000) of trials with exactly one, exactly two, and at least three clusters

937:   is plotted respectively with $+$'s, $\times$'s and $*$'s. The solid curve is

938:   $\sqrt{(1-c)e^c}$.

939:   }

940: \label{number_clusters}

941: \end{figure*}

942:

943:

944: To summarize, in the presence of random pairwise incompatibilities, the set

945: of viable genotypes is, when nonempty,

946: divided into a stochastically bounded number of connected clusters,

947: where a unique cluster is usually the most likely possibility.

948: These clusters are all of exponentially large size

949: (with bounds given by equations \ref{up_bound} and \ref{low_bound}), in fact they all contain

950: sub-cubes of dimension at least $(1-r_u(c))n$.

951: However, the proportion

952: of viable genotypes among all $2^n$ genotypes is exponentially small, by

953: equation (\ref{up_bound}).

954:

955: \subsection{Multiallelic loci}

956:

957: Here we assume that at each locus there can be $a\ (\ge 2)$

958: alleles (cf., \citealt{Rei}). In this case, the genotype space is

959:  the generalized hypercube

960: $\cG_a=\{0,\dots, a-1\}^n$. For $a=3$

961: this could be interpreted as the genotype space of diploid

962: organisms without {\it cis-trans\/} effects \citep{gav97b},

963: $a=4$ corresponds to DNA sequences, and $a=20$ corresponds to proteins.

964: Much larger values of $a$ can correspond to a number of alleles at a protein

965: coding locus and we will see later that

966: there is not much difference between this model and a

967: natural continuous space model.

968:

969: We will assume that each pair of alleles, out of total

970: number of $a^2\binom{n}{2}$ is independently incompatible

971: with probability

972: $$p=\frac{c}{2n}.$$

973: The main question we are interested in

974: here is for which values of $c$ viable genotypes exist {\aas }

975:

976: Clearly, if $N$ is the number of viable phenotypes, then the expectation

977: $$

978: E(N)=a^n(1-p)^{\binom{n}{2}}\approx\exp(n \log a-{\textstyle\frac 14}cn),

979: $$

980: and so there are \aas~no viable phenotypes when $c>4\log a$. On the

981: other hand, clearly there are viable genotypes

982: (with all positions filled by 0's and 1's) when $c<1$. It turns out that the

983: first

984: of these trivial bounds is much closer to the critical value when $a$

985: is large. Before we proceed, however, we state a sharp

986: threshold result from \cite{Mol}: there exists a function $\gamma=\gamma(n,a)$

987: so that for every $\e>0$,

988: \begin{itemize}

989: \item if $c>\gamma+\e$, then a.~a.~s.~$N=0$.

990: \item if $c<\gamma-\e$, then a.~a.~s.~$N>0$.

991: \end{itemize}

992: In words, for a fixed $a$, the probability of the event that $N\ge 1$

993: transitions sharply from large to small

994: as $np$ varies. As it is not proved that

995: $\lim_{n\to\infty}\gamma(n,a)$ exists, it is in principle possible

996: that the place of this sharp transition fluctuates as $n$ increases

997: (although it must of course remain within $[1, 4\log a]$).

998:

999: Our main result in this section is

1000: \begin{equation}\label{gamma}

1001: \gamma=4\log a-o(1), \text{ as }a\to\infty.

1002: \end{equation}

1003: This somewhat surprising result in proven in Appendix D by the

1004: second moment method, as developed in \cite{AM} and \cite{AP}.

1005:

1006: \subsection{Continuous phenotype spaces}

1007:

1008: Here we extend the model of pair incompatibilities for the case of continuous

1009: phenotypic space $\cal{P}$. Again, we have a small $r>0$ as a parameter.

1010: For each of $(i,j)$, $i<j$, we consider independent Poisson point location $\Pi_{ij}$

1011: in the unit square $[0,1]\times[0,1]$, of rate $\la=c/(2n)$. (Equivalently, choose Poisson($\la$) number of

1012: points uniformly at random in $[0,1]\times[0,1]$.) Then we declare $a\in \cP$ inviable

1013: if there exist $i<j$ so that $(a_i,a_j)$ is within $r$ of $\Pi_{ij}$.

1014: Again, we use the two-dimensional $\ell^\infty$ norm for distance.

1015: Our procedure can be visualized as throwing a random number of

1016: $(n-2)$-dimensional square tubes of inviable phenotypes into the phenotype space.

1017:

1018: Our main result here is that the existence threshold is on the order $c\approx -\log r/r^{2}$.

1019: Namely, we prove in the Appendix E that there exists a constant $C>0$ so that for small enough $r$,

1020:

1021:  \begin{itemize}

1022: \item if $c>4\frac{-\log r}{r^2}$, then a.~a.~s.~$N=0$.

1023: \item if $c<\frac{-\log r-C}{r^2}$, then a.~a.~s.~$N>0$.

1024: \end{itemize}

1025:

1026: \subsection{Complex incompatibilities}

1027:

1028: Here we assume that incompatibilities involve $K\ (\geq 2)$ diallelic loci \citep{orr96,gav04}.

1029: The question whether a viable combination of genes exist is then equivalent to

1030: the {\tt $K$-SAT} problem \citep{KV}. Even for $K=3$, this is an NP-complete problem

1031: \citep{KV}, so there is no known polynomial algorithm to answer this question.

1032: The random case, which we now describe, is also much harder to analyze

1033: than the {\tt $2$-SAT} one.

1034: Let $F$ be a random set

1035: to which any of the $2^K\binom n K$ incompatibilities belong independently with

1036: probability

1037: $$

1038: p=\frac {K!}{2^K}\cdot \frac c{n^{K-1}}.

1039: $$

1040: Here $c=c(K)$ is a constant, and the above form has been

1041: chosen to make the number of incompatibilities in $F$ asymptotically $cn$.

1042: (Note also the agreement with the definition of $p$ in Section 5.1

1043: when $K=2$.)  For a fixed $K$,

1044: it has been proved \citep{Fri} that the probability that viable genotype exists

1045: jumps sharply from 0 to 1 as $c$ varies. However, the location of the

1046: jump has not been proved to converge as $n\to\infty$. Instead,

1047: a lot of effort has been

1048: invested in obtaining good bounds. For example \citep{AP}, for $K=3$, $c<3.42$ implies {\aas }

1049: existence

1050: of viable genotype, while $c>4.51$ implies \aas~nonexistence (while the sharp

1051: constant

1052: is estimated to be about $4.48$, see e.g. \citealt{BMW}).

1053: For $K=4$ the best current bounds are $7.91$ and $10.23$. For large

1054: $K$, the transition occurs at $c=2^K\log 2-\cO(K)$ \citep{AP}.

1055:

1056: Techniques from statistical physics \citep{BMW} strongly suggest

1057: that, for $K\ge 3$, there is another phase transition, which

1058: for $K=3$ occurs at about $c=3.96$. For smaller $c$, the

1059: viable genotypes are conjectured to

1060: be contained in a {\it single\/} cluster.

1061: For larger $c$, the space of viable genotypes

1062: (if nonempty) is divided into exponentially many connected clusters.

1063:

1064:

1065: Perhaps more relevant to genetic incompatibilities is the following

1066: {\it mixed\/} model (commonly known as {\tt $(2+p)$-SAT}), \citealt{MZ}). Assume that

1067: every 2-incompatibility is present with probability $c_2/(2n)$,

1068: while every 3-incompatibility is present with probability $3c_3/(4n^2)$.

1069: The normalizations are chosen so that the numbers of the two types of

1070: incompatibilities are asymptotically $c_2 n$ and $c_3 n$, respectively.

1071:

1072: If $c_2$ (resp. $c_3$) is very small, then the respective incompatibility

1073: set affects a very small proportion of loci, therefore

1074: $c_3$ (resp. $c_2$) determines whether a viable genotype is likely to exist.

1075: Intuitively, one also expects that 2-incompatibilities should be more

1076: important than 3-incompatibilities

1077: as one of the former type excludes more genotypes than one of the latter type.  A careful

1078: analysis confirms this. First observe

1079: that $c_2>1$ implies \aas~non-existence of a viable genotype. The surprise

1080: \citep{MZ,AKKK} is that if $c_3$ is small enough, $c_2<1$

1081: implies \aas~existence of viable genotypes, so the 3-incompatibilities

1082: do not change the threshold. This is established in \cite{MZ} by a physics argument

1083: for $c_3<0.703$, while

1084: \cite{AKKK} gives a rigorous argument for $c_3<2/3$. Therefore, even if their numbers are

1085: on the same scale, if the more

1086: complex incompatibilities are rare enough compared to the pairwise

1087: ones, their contribution to the structure of the space of

1088: viable genotypes is not essential.

1089:

1090:

1091: \section{Notes on neutral clusters in the discrete {\it NK\/} model}

1092:

1093: The model considered here is a special case of the discretized NK model \citep{kau93},

1094: introduced in \cite{NE}.

1095: This model features $n$ diallelic loci each of which interacts with $K$ other loci.

1096: To have a concrete example, assume that the loci are arranged on a

1097: circle, so that $n+1\equiv 1$, $n+1\equiv 2$, etc., and let the

1098: interaction {\it neighborhood\/} of the $i$'th locus consist of itself

1099: and $K$ loci to its right $i+1, \dots, i+K$. For a given

1100: genotype $x\in\cG=\{0,1\}^n$,

1101: the neighborhood configuration of the

1102: $i$'th locus is then given by $\cN_i(x)= (x_i, x_{i+1}, \dots, x_{i+K})\in \{0,1\}^{K+1}$.

1103: To each locus and to each possible configuration

1104: in its neighborhood

1105: we independently assign a binary fitness contibution.

1106: To be more precise,

1107: we choose the $2^{K+1}n$ numbers $v_i(y)$, $i=1, \dots, n$ and $y\in \{0,1\}^{K+1}$,

1108: to be independently 0 or 1 with equal probability, and interpret $v_i(y)$

1109: as the fitness contribution of locus $i$ when its neighborhood configuration

1110: is $y$. The fitness

1111: of a genotype $x$ is then the sum of contributions from each locus:

1112: $$

1113: w(x)=\sum_{i=1}^n v_i(\cN_i(x)).

1114: $$

1115: In \cite{kau93}, the values $v_i$ were taken from a continuous distribution.

1116: In \cite{NE}, these values were integers in the range $[0,F-1]$ so that our model

1117: is a special case $F=2$.

1118: {\it Neutral clusters\/} are connected components of same

1119: fitness.

1120:

1121: The $K=0$ case is easy but nevertheless illustrative.

1122: Namely, a mutation at locus $i$ will not change fitness iff

1123: $v_i(0)=v_i(1)$; let $D$ be the number of such loci.

1124: Then $D\sim n/2$ \aas, the number of different fitnesses is $n-D$,

1125: each neutral cluster is a sub-cube

1126: of dimension $D$, and there are exactly $2^{n-D}$ neutral

1127: clusters.

1128:

1129: The next simplest situation is when $K=1$. Let

1130: $D_1$ be the number of loci $i$

1131: for which $v_i$ is constant. Then

1132: $D_1\sim n/8$ \aas, and each neutral cluster contains a

1133: sub-cube of dimension $D_1$. Moreover, let $D_2$ be

1134: the number of loci $i$ for which $v_i(00)=v_i(01)\ne v_i(10)=v_1(11)$.

1135: Note that any genotypes that differ at such locus $i$ must belong to

1136: a different neutral cluster, and so the number

1137: of different neutral clusters is at least $2^{D_2}$. Thus there

1138: are exponentially many of them, as

1139: again $D_2\sim n/8$ {\aas }

1140: This division of genotype space into exponentially many clusters

1141: of exponential size persists for every $K$, although

1142: the distribution of numbers and sizes of these clusters is not well understood (see

1143: \citealt{NE} for simulations for $n=20$).

1144:

1145: Finally, we mention that the question of whether a

1146: genotype with the maximal possible fitness $n$

1147: exists for a given $K$ is in many way related to issues in incompatibilities models

1148: \citep{CJK}.

1149:

1150: \section{Discussion}

1151:

1152: In this section we summarize our major findings and provide their biological interpretation.

1153:

1154: The previous work on neutral and nearly neutral networks in multidimensional fitness

1155: landscapes has concentrated exclusively on genotype spaces in which each individual

1156: (or a group of individuals) is characterized by a discrete set of genes. However

1157: many features of biological organisms that are actually observable and/or measurable are described by

1158: continuously varying variables such as size, weight, color, or concentration. A question

1159: of particular biological interest is whether (nearly) neutral networks are as prominent

1160: in a continuous phenotype space as they are in the discrete genotype space. Our results

1161: provide an affirmative answer to this question. Specifically, we have shown that in a simple

1162: model of random fitness assignment, viable phenotypes are likely to form a large connected

1163: cluster even if their overall frequency is very low provided the dimensionality of the phenotype

1164: space, $n$, is sufficiently large. In fact, the percolation threshold for the probability

1165: of being viable scales with $n$ as $1/2^n$ and, thus, decreases much faster than $1/n$ which is

1166: characteristic of the analogous discrete genotype space model.

1167:

1168: Earlier work on nearly neutral networks has been limited to consideration of the relationship

1169: between genotype and fitness. Any phenotypic properties that usually mediate this relationship

1170: in real biological organisms have been neglected. In Section 4, we proposed a novel model in which

1171: phenotype is introduced explicitly. In our model, the relationships both between genotype and

1172: phenotype and between phenotype and fitness are of many-to-one type, so that neutrality is present

1173: at both the phenotype and fitness levels. Moreover, this model results in a correlated fitness

1174: landscape in which the correlation function can be found explicitly. We studied the effects

1175: of phenotypic neutrality and correlation between fitnesses on the percolation threshold and

1176: showed that the most conducive

1177: conditions for the formation of the giant component is when the correlations are at the point

1178: of phase transition between local and global.

1179: To explore the robustness of our conclusions, we then look at a simplistic but

1180: mathematically illuminating model in which there is a correlation between conformity (i.e.,

1181: phenotypic neutrality) and fitness. The model has supported our conclusions.

1182:

1183: Section 5, we studied a number of models that have been recently proposed

1184: and explored within the context of studying speciation. In these models, fitness is assigned to

1185: particular gene/trait combinations and the fitness of the whole organisms depends on the presence

1186: or absence of incompatible combinations of genes or traits.  In these models, the correlations

1187: of fitnesses are so high that local methods lead to wrong conclusions.

1188: First, we established the connection between these models and $K$-{\tt SAT} problems, prominent

1189: in computer science. Then we analyzed the conditions for the existence of viable genotypes,

1190: their number, as well as the structure and the number of clusters of viable genotypes.

1191: These questions have not been studied previously. Among other things we showed that the number

1192: of clusters is stochastically bounded and each cluster contains a very large sub-cube.

1193: The majority of our results are for the case of pairwise incompatibilities between diallelic

1194: loci, but we also looked at multiple alleles and complex incompatibilities.  Moreover, we generalized

1195: some of our results to continuous phenotype spaces.

1196:

1197: At the end, we provided some additional results on the size, number and structure of

1198: neutral clusters in the discrete $NK$ model.

1199:

1200: Some more general lessons of our work are that

1201: \begin{itemize}

1202: \item Correlations may help or hinder connectivity in fitness landscapes. Even  when

1203: correlations are positive and tunable by a single

1204: parameter, it may be advantageous

1205: (for higher connectivity) to increase

1206: them only to a limited extent.

1207: \item Averages (i.e., expected values) can easily lead to wrong conclusions,

1208: especially when correlations are strong. Nevertheless, they may still

1209: be useful with a crafty choice of relevant statistics.

1210: \item Very high correlations may fundamentally change the structure of connected

1211: clusters. For example, clusters may look locally more like cubes than trees and

1212: their number may be reduced dramatically.

1213: \item Necessary analytical techniques may be unexpected and quite sophisticated;

1214: for example, they may require

1215: detailed understanding of random graphs, spin-glass machinery, or decision algorithms.

1216: \end{itemize}

1217:

1218:

1219: {\small ACKNOWLEDGMENTS.

1220: This work was supported by the Defense Advanced Research Projects Agency (DARPA),

1221: by National Institutes of Health (grant GM56693),

1222: by the National Science Foundation (grants DMS-0204376 and DMS-0135345),

1223: and by Republic of Slovenia's Ministry of Science (program P1-285).}

1224:

1225: \begin{thebibliography}{}

1226:

1227: \bibitem[Achlioptas et~al., 2001]{AKKK}

1228: Achlioptas, D., Kirousis, L.~M., Kranakis, E., and Krizanc, D. (2001).

1229: \newblock Rigorous results for $(2+p)$-{SAT}.

1230: \newblock {\em Theoretical Computer Science}, 265:109--129.

1231:

1232: \bibitem[Achlioptas and Moore, 2004]{AM}

1233: Achlioptas, D. and Moore, C. (2004).

1234: \newblock Random k-{SAT}: two moments suffice to cross a sharp threshold.

1235: \newblock {\em SIAM Journal on Computing}, 17:947--973.

1236:

1237: \bibitem[Achlioptas and Peres, 2004]{AP}

1238: Achlioptas, D. and Peres, Y. (2004).

1239: \newblock The threshold for random $k$-{SAT} is $2\sp k\log 2-o(k)$.

1240: \newblock {\em Journal of the American Mathematical Society}, 17:947--973.

1241:

1242: \bibitem[Athreya and Ney, 1971]{AN}

1243: Athreya, K. and Ney, P. (1971).

1244: \newblock {\em Branching processes}.

1245: \newblock Springer-Verlag (reprinted by Dover 2004).

1246:

1247: \bibitem[Barbour et~al., 1992]{BHJ}

1248: Barbour, A.~D., Holst, L., and Janson, S. (1992).

1249: \newblock {\em Poisson Approximation}.

1250: \newblock Oxford University Press.

1251:

1252: \bibitem[Berger, 2004]{Ber}

1253: Berger, N. (2004).

1254: \newblock A lower bound for the chemical distance in sparse long-range

1255:   percolation models.

1256: \newblock {\em http://arxiv.org/abs/math/0409021}.

1257:

1258: \bibitem[Biroli et~al., 2000]{BMW}

1259: Biroli, G., Monasson, R., and Weigt, M. (2000).

1260: \newblock A variational description of the ground state structure in random

1261:   satisfiability problems.

1262: \newblock {\em European Physical Journal B-Condensed Matter}, 14:551--568.

1263:

1264: \bibitem[Biskup, 2004]{Bis}

1265: Biskup, M. (2004).

1266: \newblock On the scaling of the chemical distance in long-range percolation

1267:   models.

1268: \newblock {\em Annals of Probability}, 32:2938--2977.

1269:

1270: \bibitem[Bollob\'as, 2001]{Bol}

1271: Bollob\'as, B. (2001).

1272: \newblock {\em Random Graphs}.

1273: \newblock Cambridge University Press.

1274:

1275: \bibitem[Bollob\'as et~al., 1992]{BKL1}

1276: Bollob\'as, B., Kohayakawa, Y., and \L{}uczak, T. (1992).

1277: \newblock The evolution of random subgraphs of the cube.

1278: \newblock {\em Random Structures and Algorithms}, 3:55--90.

1279:

1280: \bibitem[Bollob\'as et~al., 1994]{BKL2}

1281: Bollob\'as, B., Kohayakawa, Y., and \L{}uczak, T. (1994).

1282: \newblock On the evolution of random {Boolean} functions.

1283: \newblock In {\em Extremal problems for finite sets (Visegr\'ad, 1991)}, pages

1284:   137--156. Bolyai Society Mathematical Studies, 3, J\'anos Bolyai Mathematical

1285:   Society, Budapest.

1286:

1287: \bibitem[Boufkhad and Dubois, 1999]{BD}

1288: Boufkhad, Y. and Dubois, O. (1999).

1289: \newblock Length of prime implicants and number of solutions of random {CNF}

1290:   formulae.

1291: \newblock {\em Theoretical~Computer~Science}, 215:1--30.

1292:

1293: \bibitem[Burch and Chao, 1999]{bur99}

1294: Burch, C.~L. and Chao, L. (1999).

1295: \newblock Evolution by small steps and rugged landscapes in the {RNA} virus phi

1296:   6.

1297: \newblock {\em Genetics}, 151:921--927.

1298:

1299: \bibitem[Burch and Chao, 2004]{bur04}

1300: Burch, C.~L. and Chao, L. (2004).

1301: \newblock Epistasis and its relationship to canalization in the {RNA} virus phi

1302:   6.

1303: \newblock {\em Genetics}, 167:559--567.

1304:

1305: \bibitem[Choi et~al., 2005]{CJK}

1306: Choi, S.-S., Jung, K., and Kim, J.~H. (2005).

1307: \newblock Phase transition in a random {NK} landscape model.

1308: \newblock In {\em Proceedings of the 2005 Conference on Genetic and

1309:   Evolutionary Computation, {Washington, DC}}, pages 1241--1248. ACM Press.

1310:

1311: \bibitem[Cook, 1971]{Coo}

1312: Cook, S.~A. (1971).

1313: \newblock The complexity of theorem proving procedures.

1314: \newblock In {\em Proceedings of the Third Annual ACM Symposium on the Theory

1315:   of Computing}, pages 151--158. ACM.

1316:

1317: \bibitem[Coyne and Orr, 2004]{coy04}

1318: Coyne, J. and Orr, H.~A. (2004).

1319: \newblock {\em Speciation}.

1320: \newblock Sinauer Associates, Inc., Sunderland, Massachusetts.

1321:

1322: \bibitem[de~la Vega, 2001]{dlV}

1323: de~la Vega, W.~F. (2001).

1324: \newblock Random {2-SAT}: results and problems.

1325: \newblock {\em Theoretical Computer Science}, 265:131--146.

1326:

1327: \bibitem[Derrida and Peliti, 1991]{der91}

1328: Derrida, B. and Peliti, L. (1991).

1329: \newblock Evolution in flat landscapes.

1330: \newblock {\em Bulletin of Mathematical Biology}, 53:255--282.

1331:

1332: \bibitem[Eigen et~al., 1989]{eig89}

1333: Eigen, M., Mc{C}askill, J., and Schuster, P. (1989).

1334: \newblock The molecular quasispecies.

1335: \newblock {\em Advances in Chemical Physics}, 75:149--263.

1336:

1337: \bibitem[Elena and Lenski, 2003]{ele03}

1338: Elena, S.~F. and Lenski, R.~E. (2003).

1339: \newblock Evolution experiments with microorganisms: The dynamics and genetic

1340:   bases of adaptation.

1341: \newblock {\em Nature Reviews Genetics}, 4:457--469.

1342:

1343: \bibitem[Fontana and Schuster, 1998]{fon98b}

1344: Fontana, W. and Schuster, P. (1998).

1345: \newblock Continuity in evolution: on the nature of transitions.

1346: \newblock {\em Science}, 280:1451--1455.

1347:

1348: \bibitem[Friedgut, 1999]{Fri}

1349: Friedgut, E. (1999).

1350: \newblock Necessary and sufficient conditions for sharp thersholds of graph

1351:   properties, and the $k$-{SAT} problem.

1352: \newblock {\em Journal of the American Mathematical Society}, 12:1017--1054.

1353:

1354: \bibitem[Gavrilets, 1997]{gav97}

1355: Gavrilets, S. (1997).

1356: \newblock Evolution and speciation on holey adaptive landscapes.

1357: \newblock {\em Trends in Ecology and Evolution}, 12:307--312.

1358:

1359: \bibitem[Gavrilets, 2003]{gav03d}

1360: Gavrilets, S. (2003).

1361: \newblock Models of speciation: what have we learned in 40 years?

1362: \newblock {\em Evolution}, 57:2197--2215.

1363:

1364: \bibitem[Gavrilets, 2004]{gav04}

1365: Gavrilets, S. (2004).

1366: \newblock {\em Fitness landscapes and the origin of species}.

1367: \newblock Princeton University Press, Princeton, NJ.

1368:

1369: \bibitem[Gavrilets and Gravner, 1997]{gav97b}

1370: Gavrilets, S. and Gravner, J. (1997).

1371: \newblock Percolation on the fitness hypercube and the evolution of

1372:   reproductive isolation.

1373: \newblock {\em Journal of Theoretical Biology}, 184:51--64.

1374:

1375: \bibitem[Gavrilets and Hastings, 1996]{gav96b}

1376: Gavrilets, S. and Hastings, A. (1996).

1377: \newblock Founder effect speciation: a theoretical reassessment.

1378: \newblock {\em American Naturalist}, 147:466--491.

1379:

1380: \bibitem[H\"aggstr\"om, 2001]{Hag}

1381: H\"aggstr\"om, O. (2001).

1382: \newblock Coloring percolation clusters at random.

1383: \newblock {\em Stochastic Processes and their Applications}, 96:213--242.

1384:

1385: \bibitem[Huynen et~al., 1996]{huy96b}

1386: Huynen, M.~A., Stadler, P.~F., and Fontana, W. (1996).

1387: \newblock Smoothness within ruggedness: the role of neutrality in adaptation.

1388: \newblock {\em Proceedings of the National Academy of Sciences USA},

1389:   93:397--401.

1390:

1391: \bibitem[Janson et~al., 2000]{JLR}

1392: Janson, S., \L{}uczak, T., and Rucinski, A. (2000).

1393: \newblock {\em Random Graphs}.

1394: \newblock Wiley.

1395:

1396: \bibitem[Kauffman, 1993]{kau93}

1397: Kauffman, S.~A. (1993).

1398: \newblock {\em The origins of order}.

1399: \newblock Oxford University Press, Oxford.

1400:

1401: \bibitem[Kauffman and Levin, 1987]{kau87}

1402: Kauffman, S.~A. and Levin, S. (1987).

1403: \newblock Towards a general theory of adaptive walks on rugged landscapes.

1404: \newblock {\em Journal of Theoretical Biology}, 128:11--45.

1405:

1406: \bibitem[Korte and Vygen, 2005]{KV}

1407: Korte, B. and Vygen, J. (2005).

1408: \newblock {\em Combinatorial Optimization, Theory and Algorithms}.

1409: \newblock Springer, 3rd edition.

1410:

1411: \bibitem[Lenski et~al., 1999]{len99}

1412: Lenski, R.~E., Ofria, C., Collier, T.~C., and Adami, C. (1999).

1413: \newblock Genome complexity, robustness and genetic interactions in digital

1414:   organisms.

1415: \newblock {\em Nature}, 400:661--664.

1416:

1417: \bibitem[Lipman and Wilbur, 1991]{lip91}

1418: Lipman, D.~J. and Wilbur, W.~J. (1991).

1419: \newblock Modeling neutral and selective evolution of protein folding.

1420: \newblock {\em Proceedings of the Royal Society London B}, 245:7--11.

1421:

1422: \bibitem[Martinez et~al., 1996]{mar96}

1423: Martinez, M.~A., Pezo, V., Marli\`{e}re, P., and Wain-Hobson, S. (1996).

1424: \newblock Exploring the functional robustness of an enzyme by {\em in vitro}

1425:   evolution.

1426: \newblock {\em EMBO Journal}, 15:1203--1210.

1427:

1428: \bibitem[Molloy, 2003]{Mol}

1429: Molloy, M. (2003).

1430: \newblock Models for random constraint satisfaction problems.

1431: \newblock {\em SIAM Journal on Computing}, 32:935--949.

1432:

1433: \bibitem[Monasson and Zecchina, 1997]{MZ}

1434: Monasson, R. and Zecchina, R. (1997).

1435: \newblock Statistical mechanics of the random {K}-satisfiability model.

1436: \newblock {\em Physical Review E}, 56:1357--1370.

1437:

1438: \bibitem[Newman and Engelhardt, 1998]{NE}

1439: Newman, M. E.~J. and Engelhardt, R. (1998).

1440: \newblock Effects of selective neutrality on the evolution of molecular

1441:   species.

1442: \newblock {\em Proceedings of the Royal Society London B}, 265:1333--1338.

1443:

1444: \bibitem[Orr, 1995]{orr95}

1445: Orr, H.~A. (1995).

1446: \newblock The population genetics of speciation: the evolution of hybrid

1447:   incompatibilities.

1448: \newblock {\em Genetics}, 139:1803--1813.

1449:

1450: \bibitem[Orr, 1997]{orr97}

1451: Orr, H.~A. (1997).

1452: \newblock Dobzhansky, {Bateson}, and the genetics of speciation.

1453: \newblock {\em Genetics}, 144:1331--1335.

1454:

1455: \bibitem[Orr, 2006a]{orr06b}

1456: Orr, H.~A. (2006a).

1457: \newblock The distribution of fitness effects among beneficial mutations in

1458:   {Fisher}'s geometric model of adaptation.

1459: \newblock {\em Journal of Theoretical Biology}, 238:279--285.

1460:

1461: \bibitem[Orr, 2006b]{orr06a}

1462: Orr, H.~A. (2006b).

1463: \newblock The population genetics of adaptation on correlated fitness

1464:   landscapes: The block model.

1465: \newblock {\em Evolution}, 60:1113--1124.

1466:

1467: \bibitem[Orr and Orr, 1996]{orr96}

1468: Orr, H.~A. and Orr, L.~H. (1996).

1469: \newblock Waiting for speciation: the effect of population subdivision on the

1470:   waiting time to speciation.

1471: \newblock {\em Evolution}, 50:1742--1749.

1472:

1473: \bibitem[Orr and Turelli, 2001]{orr01}

1474: Orr, H.~A. and Turelli, M. (2001).

1475: \newblock The evolution of postzygotic isolation: accumulating

1476:   {Dobzhansky}-{Muller} incompatibilities.

1477: \newblock {\em Evolution}, 55:1085--1094.

1478:

1479: \bibitem[Palasti, 1971]{Pal}

1480: Palasti, I. (1971).

1481: \newblock On the threshold distribution function of cycles in a directed random

1482:   graph.

1483: \newblock {\em Studia Scientiarum Mathematicarum Hungarica}, 6:67--73.

1484:

1485: \bibitem[Penrose, 1996]{Pen}

1486: Penrose, M.~D. (1996).

1487: \newblock Continuum percolation and {Euclidean} minimal spanning trees in high

1488:   dimensions.

1489: \newblock {\em The Annals of Applied Probability}, 6:528--544.

1490:

1491: \bibitem[Pigliucci, 2006]{pig06}

1492: Pigliucci, M. (2006).

1493: \newblock {\em Making Sense of Evolution: The Conceptual Foundations of

1494:   Evolutionary Biology}.

1495: \newblock University of Chicago Press, Chicago.

1496:

1497: \bibitem[Reidys, 2006]{Rei}

1498: Reidys, C.~M. (2006).

1499: \newblock Combinatorics of genotype-phenotype maps: an {RNA} case study.

1500: \newblock In Percus, A., Istrate, G., and Moore, C., editors, {\em

1501:   Computational Complexity and Statistical Physics}, pages 271--284. Oxford

1502:   University Press.

1503:

1504: \bibitem[Reidys et~al., 2001]{rei01a}

1505: Reidys, C.~M., Forst, C.~V., and Schuster, P. (2001).

1506: \newblock Replication and mutation on neutral networks.

1507: \newblock {\em Bulletin of Mathematical Biology}, 63:57--94.

1508:

1509: \bibitem[Reidys and Stadler, 2001]{rei01b}

1510: Reidys, C.~M. and Stadler, P.~F. (2001).

1511: \newblock Neutrality in fitness landscapes.

1512: \newblock {\em Applied Mathematics and Computation}, 117:321--350.

1513:

1514: \bibitem[Reidys and Stadler, 2002]{rei02}

1515: Reidys, C.~M. and Stadler, P.~F. (2002).

1516: \newblock Combinatorial landscapes.

1517: \newblock {\em SIAM Review}, 44:3--54.

1518:

1519: \bibitem[Reidys et~al., 1997]{rei97b}

1520: Reidys, C.~M., Stadler, P.~F., and Schuster, P. (1997).

1521: \newblock Generic properties of combinatory maps: neutral networks of {RNA}

1522:   secondary structures.

1523: \newblock {\em Bulletin of Mathematical Biology}, 59:339--397.

1524:

1525: \bibitem[Rost, 1997]{ros97}

1526: Rost, B. (1997).

1527: \newblock Protein structures sustain evolutionary drift.

1528: \newblock {\em Folding \& Design}, 2:S19--S24.

1529:

1530: \bibitem[Schuster, 1995]{sch95}

1531: Schuster, P. (1995).

1532: \newblock How to search for {RNA} structures. theoretical concepts in

1533:   evolutionary biotechnology.

1534: \newblock {\em Journal of Biotechnology}, 41:239--257.

1535:

1536: \bibitem[Sedgewick, 1997]{Sed}

1537: Sedgewick, R. (1997).

1538: \newblock {\em Algorithms in {C, Parts 1-4}: Fundamentals, Data Structures,

1539:   Sorting, Searching.}

1540: \newblock Addison-Wesley.

1541:

1542: \bibitem[Skipper, 2004]{ski04}

1543: Skipper, R.~A. (2004).

1544: \newblock The heuristic role of {Sewall Wright}'s 1932 adaptive landscape

1545:   diagram.

1546: \newblock {\em Philosophy of Science}, 71:1176--1188.

1547:

1548: \bibitem[Toman, 1979]{Tom}

1549: Toman, E. (1979).

1550: \newblock The geometric structure of random boolean functions.

1551: \newblock {\em Problemy Kibernet. (in Russian)}, 35:111--132.

1552:

1553: \bibitem[Wilke et~al., 2001]{wil01c}

1554: Wilke, C.~O., Wang, J.~L., Ofria, C., Lenski, R.~E., and Adami, C. (2001).

1555: \newblock Evolution of digital organisms at high mutation rates leads to

1556:   survival of the flattest.

1557: \newblock {\em Nature}, 412:331--333.

1558:

1559: \bibitem[Woods et~al., 2006]{woo06}

1560: Woods, R., Schneider, D., Winkworth, C.~L., Riley, M.~A., and Lenski, R.~E.

1561:   (2006).

1562: \newblock Tests of parallel molecular evolution in a long-term experiment with

1563:   {{\em Escherichia coli}}.

1564: \newblock {\em Proceedings of the National Academy of Sciences USA},

1565:   103:9107--9112.

1566:

1567: \bibitem[Wright, 1932]{wri32}

1568: Wright, S. (1932).

1569: \newblock The roles of mutation, inbreeding, crossbreeding and selection in

1570:   evolution.

1571: \newblock In Jones, D.~F., editor, {\em Proceedings of the Sixth International

1572:   Congress on Genetics}, volume~1, pages 356--366, Austin, Texas.

1573:

1574: \end{thebibliography}

1575:

1576:

1577:

1578:

1579: \newpage

1580:

1581: \section*{Appendix}

1582:

1583: \subsection*{Appendix A. Proof of equation~(\ref{Px-y}).}

1584:

1585: To prove equation (5), we assume that $\lambda_e<1$ and

1586: show that for a fixed $k$ (which does not grow with $n$), the

1587: event that $x$ and $y$ at distance $k$ are in the same conformist cluster is most likely to

1588: occur because $x$ and $y$ are connected via the shortest possible path. Indeed,

1589: the dominant term $k!p_e^k$ is the expected number of conformist pathways between $x$ and $y$

1590: that are of shortest possible length $k$. This easily follows from the observation that

1591: on a shortest path there

1592: is no opportunity to backtrack; each mutation must be toward the other genotype.

1593: We can assume that $x$

1594: is the all 0's genotype and $y$ is the genotype with 1's in the

1595: first $k$ positions and 0's elsewhere.

1596: There are $k!$ orders in which the 1's can be added.

1597:

1598: To obtain the lower bound we use inclusion-exclusion on the probability

1599: that $x \conn y$ through a shortest path. Let $\mathcal{I}_l=\mathcal{I}_l(x,y)$

1600: be the set of all paths of length $l$ between $x$ and $y$.

1601: Then

1602: $$P(x \conn y) \geq \sum_{\alpha \in \mathcal{I}_k} P(A_\alpha) -

1603: \sum_{\alpha \neq \beta \in \mathcal{I}_k} P(A_\alpha\cap A_\beta)$$

1604: where $A_\alpha$ is the event that a particular path $\alpha$ consists entirely

1605: of conformist edges.

1606: Notice that two distinct paths of the same length differ by at least two edges.

1607: Thus, we get the following upper bound

1608: $$\sum_{\alpha, \beta} P(A_\alpha\cap A_\beta) < (k!)2 p_e^{k+2},$$

1609: and the lower bound in (5) follows.

1610:

1611: The upper bound is a little more difficult to obtain (it is only here

1612: that we use $\lambda_e<1$) and we need some notation.

1613: Each genotype can be identified with the set of 1's that it contains,

1614: so for any two genotypes $u$ and $v$ we let $u \bigtriangleup v$ denote the set

1615: of loci on which they differ. Notice that if $u \bigtriangleup v$

1616: is even (resp. odd) then every path between $u$ and $v$ is of even (resp. odd)

1617: length because each mutation which alters the allele at a locus not in $u \bigtriangleup v$

1618: must later be compensated for.

1619:

1620: To estimate the expected number of conformist pathways,

1621: we will need to bound the number of paths of length $l$ between $x$ and $y$. This is given by

1622: $$ k!\binom{l}{m}m!n^{m}\quad \text{ where }\quad m=\frac{l-k}{2}.$$

1623: We show this via the methods of \cite{BKL1}.

1624: They obtain an estimate for the number of cycles of a given length through a fixed vertex of the cube.

1625:

1626: Given a path, say $x=v_0,v_1,\ldots,v_l=y$, between $x$ and $y$,

1627: let us associate the sequence

1628: $(\epsilon_1i_1,\ldots,\epsilon_l i_l)$

1629: where

1630: $$v_j \bigtriangleup v_{j-1}=\{i_j\}

1631: \quad\text{and}\quad

1632: \epsilon_j=

1633: \left\{

1634: \begin{array}{l}

1635: +1\qquad\text{ if } v_j=v_{j-1}\cup{i_j} \\

1636: -1\qquad\text{ if } v_j=v_{j-1}\setminus\{i_j\}

1637: \end{array}

1638: \right.$$

1639: $j=1,\ldots,l$. Since distinct paths will have distinct sequences we

1640: can bound the number of paths by finding an upper bound for the

1641: number of sequences.

1642:

1643: Note that there must be $m+k$ positive entries, which occur at

1644: $\binom{l}{m+k}=\binom{l}{m}$ possible locations. The absolute

1645: values of $m$ of these entries are chosen freely from $\{1,\dots, n\}$, while

1646: the remaining $k$ must be the integers $1,\ldots,k$. There are

1647: $n^mk!$ ways to do this. We are free to order the $m$ negative

1648: entries and the bound follows.

1649:

1650: We now assume that $d(x,y)$ is even and relabel $d(x,y)=2k$.

1651: We omit the similar calculation for odd distances. Define

1652: $b=-3k/(2\log\la_e)$ and $t=\lfloor b\log  n\rfloor$. Then the

1653: expected number of conformist paths between $x$ and $y$ can be expressed as

1654: \begin{eqnarray*}\sum_{l\geq k+1} \sum_{\mathcal{I}_{2l}} p_e^{2l}&=&

1655: \sum_{k+1\leq l< t}

1656: \sum_{\mathcal{I}_{2l}}p_e^{2l}+\sum_{l\geq t}

1657: \sum_{\mathcal{I}_{2l}}p_e^{2l} \\

1658: &<&\sum_{k+1\leq l< t} \binom{2l}{l-k}n^{l-k}(l-k)!(2k)!p_e^{2l}

1659: +\sum_{l\geq t}n^{2l}p_e^{2l} \\

1660: &=&\sum_{k+1\leq l< t}

1661: (2l)^{l-k}n^{l-k}p_e^{2(l-k)}(2k)!p_e^{2k}

1662: +\sum_{l\geq t}\la_e^{2l} \\

1663: &<&(2k)!p_e^{2k}\sum_{l\geq k+1}(2b\la_e p_e\log n)^{l-k}+O(\la_e^{2b\log n})

1664: \\

1665: &=&k (2k)!p_e^{2k} O(p_e\log{n})+O(n^{2b\log \la_e}) \\

1666: &=&k (2k)!p_e^{2k} O\left( n^{-1} \log{n} \right)  .

1667: \end{eqnarray*}

1668:

1669: \subsection*{Appendix B. Cluster structure under random pair incompatibilities.}

1670:

1671: Here we show that, under random pairwise incompatibilities model introduced in Section 5.1,

1672: connected clusters include large subcubes. The basic idea

1673: comes from \cite{BD}. A configuration $a\in \{0,1,*\}^n$

1674: is a way to specify a sub-cube of $\cG$, if $*$'s are thought of as places which could be filled

1675: by either a 0 or a 1. The number of non-$*$'s is the {\it length\/} of $a$. Call $a$

1676: an {\it implicant\/} if the entire sub-cube specified by $a$ is viable.

1677:

1678: We present two arguments, beginning with the one which

1679: works better for small $c$. Let the auxiliary random

1680: variable $X$ be the number of pairs of loci $(i,j)$, $i<j$, for which:

1681: \begin{itemize}

1682: \item[(E1)] There is exactly one incompatibility involving alleles on $i$ and $j$.

1683: \item[(E2)] There is no incompatibility involving an allele on either $i$ or $j$,

1684: and  an allele on $k\notin\{i,j\}$.

1685: \end{itemize}

1686: Assume, without loss of generality, that the incompatibility

1687: which satisfies (E1) is $(1_i, 1_j)$. Then fitness of all

1688: genotypes which have any of the allele assignments $0_i0_j$, $0_i1_j$ and $1_i0_j$,

1689: and agree on other loci, is the same.

1690: Note also that all pairs of loci which satisfy (E1) and (E2) must be

1691: disjoint.

1692: Therefore, if $x$ is any viable genotype, its cluster contains

1693: an implicant  with the number of $*$'s at least $X$ plus the number

1694: of free loci. To determine the size of $X$, note that the expectation

1695: $$

1696: E(X)={\binom{n}{2}}4p(1-p)^3(1-p)^{8(n-2)}\sim ce^{-4c}n

1697: $$

1698: and furthermore, by an equally easy computation,

1699: $$

1700: E(X^2)-E(X)^2=\cO(n),

1701: $$

1702: so that $X\sim ce^{-4c}n$ {\aas }

1703: It follows that every cluster

1704: contains \aas~at least $\exp((e^{-2c}+ce^{-4c})\log 2-\e)n)$,

1705: viable genotypes, for any $\e>0$.

1706:

1707:

1708: The second argument is a refinement of the one in \cite{BD}

1709: and only works better for larger $c$.

1710: Call an implicant $a$ a

1711: {\it prime implicant (PI)\/} if at any locus

1712: $i$, replacement of either $0_i$ or $1_i$

1713: by $*_i$ results in a non-implicant. Moreover, we call $a$ the {\it least prime

1714: implicant (LPI)\/} if it is a PI, and the following two conditions are

1715: satisfied. First, if all the $*$'s

1716: are changed to 0's, then  no change from $1_i$ to $0_i$ results in a

1717: viable genotype.

1718: Second,

1719: no change $*_i1_j$ to $1_i*_j$, where $i<j$, results in an indicator.

1720:

1721: Now, every viable genotype must have an LPI in its cluster.

1722: To see this, assume we have a PI for which the first condition is not satisfied. Make the

1723: indicated change, then replace some 0's and 1's by $*$'s

1724: until you get a prime indicator. If the second

1725: condition is violated, make the resulting switch, then again

1726: make some replacement by $*$'s until you arrive at a PI.

1727: Either of these two operations moves within the same cluster, and

1728: keeps the number of 1's nonincreasing

1729: and their positions more to the left. Therefore, the procedure

1730: must at some point end, resulting in an LPI in the same cluster.

1731:

1732: For a sub-cube $a$ to be an LPI,

1733: the following conditions need to be satisfied:

1734: \begin{itemize}

1735: \item[(I1)] Every non-$*$ has to be compatible with every other non-$*$,

1736: and with both 0 and 1 on each of the $*$'s.

1737: \item[(I2)] Any of the four 0,1 combinations on any pair of $*$'s must be compatible.

1738: \item[(LPI1)] Pick an $i$ with allele 1, that is, a $1_i$.

1739: Then $0_i$ must be incompatible with at least

1740: one non-$*$, or at least one 0 on a $*$. Furthermore, if $0_i$ has an

1741: incompatibility

1742: with a 0 on a $*$ to its left, it has to have another incompatibility, either

1743: with a non-$*$, or with a 0 or a 1 on a $*$.

1744: \item[(LPI2)] Pick a $0_i$.

1745: Then $1_i$ must be incompatible with a non-$*$, or a 0 or a 1 on a $*$.

1746: \end{itemize}

1747: The first two conditions make $a$ an implicant, and the last two an LPI.

1748: Note also that these conditions are independent.

1749:

1750: Let now $X$ be the number of LPI of length $rn$. We will identify a

1751: function $L_4=L_4(r,c)$ such that

1752: $$

1753: \frac 1n\log E(X)\le L_4.

1754: $$

1755: Let

1756: $$

1757: L_1=L_1(\be,p,z)=z(\be\log p+(1-\be)\log(1-p)-\be\log\be-(1-\be)\log(1-\be)).

1758: $$

1759: This is the exponential rate for the probability that in $zn$

1760: Bernoulli trials with success probability $p$ there are exactly $\be n$

1761: successes, i.e., this probability is $\approx \exp(L_1n)$. Further,

1762: if $\kappa, \e,\de\in(0,1)$ are fixed, then among sub-cubes

1763: with $rn$ non-$*$'s and $\al n$ 1's ($\al\le r$), the proportion

1764: which have $\e n$ 1's in $[\kappa n, n]$ and $\de n$ $*$'s in

1765: $[1,\kappa n]$ has exponential rate

1766: $$

1767: \begin{aligned}

1768: L_2=&L_2(r,c,\kappa, \al, \e, \de)\\

1769: =&L_1((\al-\e)/\kappa, \al, \kappa)+L_1(\e/(1-\kappa), \al, 1-\kappa)\\

1770: &+L_1(\de/(\kappa-\al+\e), 1-r, \kappa-\al+\e)+ L_1((1-r-\de)/(1-\kappa-\e), 1-r, 1-\kappa-\e).

1771: \end{aligned}

1772: $$

1773: (Here all four first arguments in $L_1$ are in $[0, 1]$,

1774: or else the rate is $-\infty$.)

1775:

1776: The expected number of LPI, with $r,\kappa, \e,\de$  given as above, has exponential rate

1777: at most (and this is only an upper bound)

1778: $$

1779: \begin{aligned}

1780: L_3=&L_3(r,c,\kappa, \al, \e, \de)\\

1781: =&-(1-r)\log(1-r)-\al\log\al-(r-\al)\log(r-\al)\\

1782: &-c(1-r/2)^2\\

1783: &+(r-\al)\log(1-\exp(-c(1-r/2)))\\

1784: &+(\al-\e)\log(1-\exp(-c/2))+\e\log(1-\exp(-c/2)-{\textstyle\frac 12}\de c\exp(-c(1-r/2)))\\

1785: &+L_2(r,c,\kappa, \al, \e, \de).

1786: \end{aligned}

1787: $$

1788: The next to last line is obtained from (LPI1), as $\e n$ 1's must have

1789: $\de n$ $*$'s on their left.

1790:

1791: It follows that $L_4$ can be obtained by

1792: $$

1793: L_4(r,c)=\inf_\kappa\sup_{\al, \e,\de} L_3(r,c,\kappa, \al, \e, \de).

1794: $$

1795: If $L_4(r,c)<0$, all LPI (for this $c$) \aas~have length at most $r$. Numerical computations

1796: show that this gives a better

1797: bound than $1-e^{-2c}-ce^{-4c}$ for $c\ge 0.38$. Let us denote the

1798: best upper bound from the two estimates by $r_u(c)$. This function

1799: is computed numerically and plotted in Figure 3.

1800:

1801: \begin{figure*}[t]

1802:   \begin{center}

1803:    {\includegraphics[clip, height=5cm]{rm.ps}

1804:     }

1805:   \end{center}

1806:

1807:   \caption{The upper bound $r_u(c)$ for the number of non-$*$'s

1808:   in the implicant of smallest length included in every cluster

1809:   of viable genotypes, plotted against $c$.

1810:   }

1811: \label{fig_ap_a}

1812: \end{figure*}

1813:

1814: \subsection*{Appendix C. Number of clusters under random pair incompatibilities}

1815:

1816: In this section we briefly explain why the number of clusters

1817: under random pair incompatibilities is asymptotically

1818: a function of a Poisson random variable. There is a

1819: clear way to separate the genotype space into disconnected clusters.

1820: For example, if $F_1=\{(0_1,0_2), (1_2,0_3),(1_1,1_2)\}$, we

1821: see that every viable genotype has one of these two allele configurations

1822: on the first two loci: $C=0_11_2$ or $\overline{C}=1_10_2$.

1823: Since there are no genotypes with $0_10_1$ or $1_11_2$,

1824: there is no way to mutate from the viable genotypes with $0_11_2$

1825: to the viable genotypes with $1_10_2$ without passing through an inviable genotype.

1826: However, if we add one incompatibility to $F_1$ to make

1827: $F_2=F_1\cup\{(0_1,1_2)\}$,

1828: then there are no longer any genotypes with the alleles $0_11_2$

1829: and we return to a single cluster of viable genotypes.

1830:

1831: Notice that the digraph $D_{F_1}$ contains the directed

1832: cycle $1_1 \to 0_2 \to 1_1$ and equivalently the directed cycle

1833: $1_2 \to 0_1 \to 1_2$. $D_{F_3}$ also contains these

1834: cycles but there are paths between them as well: $0_2 \to 0_1$ and $1_1 \to 1_2$.

1835:

1836: Formally, a pair of complementary allele configurations

1837: $(C,\overline{C})$ on a set of  $k \geq 2$ loci is defined to

1838: be a {\it splitting pair\/} if the digraph $D_F$ contains a directed cycle

1839: (in any order) on the alleles in $C$ (and equivalently on those in $\overline{C}$,

1840: which consist of reversed alleles in $C$)

1841: and does not contain a path between the alleles in $C$ and the alleles in $\overline{C}$.

1842: It should be clear from the example $F_1$ above that the existence

1843: of a splitting pair will create a barrier in the genotype

1844: space through which it is not possible to pass by mutations on viable genotypes.

1845: In fact, it is proved in Pitman (unpub.) that

1846: any two viable genotypes $u$ and $v$ will be disconnected

1847: in the fitness landscape if and only if the loci on which they

1848: differ contain a splitting pair.

1849:

1850:

1851: Thus, the existence of viable genotypes on either side of

1852: a splitting pair (with each configuration of complementary alleles)

1853: ensures disconnected clusters. If there are $k$ splitting pairs in the

1854: formula $F$ and there are viable genotypes with each of the allele

1855: configurations in each of the splitting pairs then there are $2^k$ clusters

1856: of viable genotypes.

1857: The restriction that there be viable genotypes on either side is asymptotically

1858: unlikely to make a difference as we can

1859: fix one of the $2^k$ configurations of alleles and \aas~find a

1860: viable genotype on the remaining loci. Therefore the number of

1861: clusters of viable genotypes is \aas~equal to $2^X$, where

1862: $X$ is the number of splitting pairs, provided that $X$ is

1863: stochastically bounded, but we will see shortly that the expectation

1864: $E(X)$ is bounded. In fact, the next paragraph suggests

1865: that $X$ converges to a  Poisson limiting distribution.

1866: (A detailed discussion of this issue will appear in Pitman (unpub.).)

1867:

1868:

1869: It follows from \cite{Pal} or \cite{Bol}

1870: that the number of directed cycles of length $k$ in $D_F$ is

1871: Poisson$(\lambda_k)$ with $\lambda_k = (2k)^{-1}c^k$.

1872: In particular, the expected number of splitting pairs converges to

1873: is $\lambda=-\frac{1}{2} (\ln(1-c)+c)$.

1874: Moreover, the probability that there is no splitting pair

1875: converges to the product of the probabilities that the cycle of each length is absent

1876: \citep{Pal}, which is

1877: \begin{equation}

1878: \prod_{k=2}^\infty \exp{\left(-\frac{c^k}{2k}\right)} =

1879: \exp{\left(\frac{ \ln{(1-c)}+c}{2}\right)} = [(1-c)e^c]^{\frac{1}{2}}.

1880: \end{equation}

1881: In particular, this gives the limiting probability of a unique cluster.

1882:

1883: \subsection*{Appendix D. Proof of equation~(\ref{gamma}).}

1884:

1885: In this section we assume that genotypes have multiallelic loci, which are

1886: subject to random pair incompatibilities. The model introduced in Section 5.2

1887: is the most natural, but is not best suited for our second moment approach.

1888: Instead, we will work with the equivalent modified

1889: model with $m$ pair incompatibilities, each

1890: chosen independently at random, and the first and the second member of each pair

1891: chosen independently from the $an$ available alleles. We will assume

1892: that $m=\frac 14ca^2n$, label $c'=\frac 14c$, and denote, as usual, the resulting set

1893: of incompatibilities by $F$.

1894:

1895: To see that these two models are equivalent for our purposes,

1896: first note that the number of incompatibilities which are

1897: {\it not legitimate\/}, in the sense that the two alleles are chosen

1898: from the same locus, is  stochastically bounded in $n$. (In fact, it

1899: converges in distribution to a Poisson($c'a^2$) random variable.)

1900: Moreover, by the Poisson approximation to the birthday problem

1901: \citep{BHJ}, the number of pairs of

1902: choices which result in the same incompatibility in this model is

1903: asymptotically Poisson($c'a^2/2$).

1904: In short, then, the procedure results in the number $m-\cO(1)$ of different legitimate

1905: incompatibilities. If $m$ in the modified model is increased to, say, $m'=m+n^{2/3}$, then the

1906: two models could be coupled so that

1907: the incompatibilities in the original model are included in those in the modified model. As

1908: the existence of a viable phenotype becomes less likely when $m$ is increased, this demonstrates

1909: that~(\ref{gamma}) will follow once we show

1910: the following for the modified model:

1911: for every $\e>0$ there exists a large enough $a$ so

1912: that $c'<\log a-\e$ implies that

1913: $N\ge 1$ \aas

1914:

1915: To show this, we introduce the auxiliary random variable

1916: $$

1917: X=\sum_{\sigma \in \cG_a}\prod_{I\in F}\left(w_01_{\{|I\cap\sigma|=0\}}+

1918:                                         w_11_{\{|I\cap\sigma|=1\}}\right),

1919: $$

1920: where $1_A$ is the indicator of the set $A$.

1921: The size of the intersection $I\cap\sigma$ is computed by transforming

1922: both the incompatibility $I$ and

1923: the genotype $\sigma$ to

1924: sets of (indexed) alleles, and

1925: the weights $w_0$ and $w_1$ will be chosen later. To intuitively understand the

1926: statistic $X$, note that when $w_0=w_1=1$, the product is exactly the indicator of the

1927: event that $\sigma$ is viable and $X$ is then the number of viable genotypes $N$. In general,

1928: $X$ gives different scores to different viable genotypes --- however, the crucial fact to note

1929: is that that $X>0$ iff $N>0$. Therefore

1930: $$

1931: P(N>0)= P(X>0)\ge (E(X))^2/E(X^2),

1932: $$

1933: which is how the second moment method is used \citep{AM}.

1934:

1935: As

1936: $$

1937: \begin{aligned}

1938: &P(|\sigma\cap I|=0)=\left(\frac {a-1}a\right)^2, \\

1939: &P(|\sigma\cap I|=1)=\frac {2(a-1)}{a^2}, \\

1940: \end{aligned}

1941: $$

1942: we have

1943: $$

1944: E(X)=a^n\left(w_0\left(\frac {a-1}a\right)^2+w_1\frac {2(a-1)}{a^2}\right)^m.

1945: $$

1946: Moreover

1947: $$

1948: E(X^2)=\sum_{k=0}^n a^n \binom{n}{k}(a-1)^k(w_0^2 P(00)+2w_0w_1P(01)+w_1^2P(11)),

1949: $$

1950: where $P(01)$ is the probability that $I$ has

1951: intersection of size $0$ with $\sigma=0_1\dots0_k0_{k+1}\dots 0_n$ and of size $1$ with

1952: $\tau=1_1\dots1_k0_{k+1}\dots 0_n$, and $P(00)$ and $P(11)$ are defined analogously. Thus, if $k=\al n$,

1953: $$

1954: \begin{aligned}

1955: &P(00)=\left(1-\frac{1+\al}a\right)^2,\\

1956: &P(01)=\frac{2\al}a\left(1-\frac{1+\al}a\right),\\

1957: &P(11)=\frac{2(1-\al)}a\left(1-\frac{1+\al}a\right)+2\left(\frac\al a\right)^2.

1958: \end{aligned}

1959: $$

1960: Let $\Lambda=\Lambda_{a, w_0, w_1}(\al)$ be the $n$'th root of the

1961: $k=(\al n)$'th term in the sum for $E(X^2)$, divided by $E(X)^2$. Hence

1962: $$

1963: \begin{aligned}

1964: \Lambda=&\frac{(a-1)^\al}{a\cdot \al^\al(1-\al)^{1-\al}}\\

1965: &\times \frac

1966: {\left( w_0^2\left(1-\frac{1+\al}a\right)^2+4w_0w_1\frac{\al}a\left(1-\frac{1+\al}a\right)

1967:         +2w_1^2\left(\frac{(1-\al)}a\left(1-\frac{1+\al}a\right)+\left(\frac\al a\right)^2\right)

1968:  \right)^{c'a^2}}

1969: {\left(w_0\left(\frac {a-1}a\right)^2+w_1\frac {2(a-1)}{a^2}\right)^{2c'a^2}}.

1970: \end{aligned}

1971: $$

1972: Let $\al^*=(a-1)/a$. A short computation shows that $\Lambda=1$ when $\al=\al^*$.

1973:

1974: If $\Lambda>1$ for some $\al$, then $E(X^2)/(E(X))^2$ increases exponentially and

1975: the method fails (as we will see below,

1976: this always happens when $w_0=w_1=1$, i.e.,

1977: when $X=N$). On the other hand, if $\Lambda<1$ for $\al\ne\al^*$, and

1978: $\frac{d^2\Lambda}{d\al^2}(\al^*)<0$, then

1979: Lemma 3 from \cite{AM} implies that $E(X^2)/(E(X))^2\le C$ for some constant

1980: $C$, which in turn implies that $P(N>0)\ge 1/C$. The sharp threshold

1981: result then finishes off the proof of~(\ref{gamma}).

1982:

1983: Our aim then is to show that $w_0$ and $w_1$ can be chosen so that, for $c'=\log a-\e$,

1984: $\Lambda$ has the properties described in the above paragraph.

1985: We have thus reduced the proof of~(\ref{gamma}) to a calculus problem.

1986:

1987: Certainly the necessary condition is that $\frac{d\Lambda}{d\al}(\al^*)=0$, and

1988: $$

1989: \frac{d\Lambda}{d\al}(\al^*)=-\frac 2{a^3}(w_0(a-1)-w_1(a-2))^2,

1990: $$

1991: so we choose $w_0=a-2$ and $w_1=a-1$. (Only the quotient between $w_0$ and $w_1$

1992: matters, so a single equation is enough.) This simplifies $\Lambda$ to

1993: $$

1994: \Lambda=\Lambda_a(\al)=\frac{(a-1)^\al}{a\al^\al(1-\al)^{1-\al}}

1995: \cdot

1996: \frac

1997: {\left(\left(\al -\frac{a-1}a\right)^2-\frac{(a-1)^4}{a^2}\right)^{c'a^2}}

1998: {\left(\frac{(a-1)^2}a\right)^{2c'a^2}}.

1999: $$

2000: Let $\varphi=\log\Lambda$. We need to demonstrate that $\varphi<0$ for $\al\in[0,\al^*)\cup (\al^*, 1]$

2001: and that $\varphi''(\al^*)<0$. A further simplification can be obtained

2002: by using $x-Cx^2\le \log(1+x)\le x$ (valid for all nonnegative $x$),

2003: which enables us to transform $\varphi$ (without changing the

2004: notation) to

2005: $$

2006: \varphi(\al)=c'\frac{a^4}{(a-1)^4}\left(\al -\frac{a-1}a\right)^2

2007: -\al \log\al-(1-\al)\log(1-\al)+\al\log(a-1)-\log a.

2008: $$

2009: Now

2010: $$

2011: \varphi''(\al)=2c'\frac{a^4}{(a-1)^4}-\frac 1{\al(1-\al)}.

2012: $$

2013: So automatically, for $c'$ large but $c'=o(a)$, $\varphi''(\al^*)<0$ for large $a$. Moreover,

2014: $\varphi$ cannot have another local maximum when $\varphi''>0$. If

2015: $\varphi(\al)\ge 0$ for some $\al\ne\al^*$, then this must happen for an $\al$

2016: in one of the two intervals

2017: $[0, 1/(2c')+\cO((c')^{-2})]$ or $[1- 1/(2c')-\cO((c')^{-2}), 1]$.

2018: Now, $\varphi$ has a unique

2019: maximum at $\al^*$ in the second interval. In the first interval,

2020: a short computation shows that

2021: $$

2022: \varphi(\al)=-\e-\al \log a+\cO\left(\frac{\log\log a}{\log a}\right),

2023: $$

2024: which is negative for large $a$. This ends the proof.

2025:

2026: This method yields nontrivial lower bounds for $\gamma$ for all $a\ge 3$,

2027: cf.~Table 1.

2028:

2029: \begin{center}

2030: \renewcommand{\arraystretch}{.75}

2031: \begin{table}[t]

2032: \caption{The lower bounds on $\gamma$ obtained by the method described in

2033: text, compared to the easy upper bounds $4\log a$.

2034: }%

2035: \label{t2x1_prel}%

2036: {\normalsize \vspace{.2in} }

2037: \par

2038: \begin{center}

2039: {\normalsize

2040: \begin{tabular} {|r||r|r|}\hline

2041: $a$ & l.~b.~on $\gamma$ & $4\log a$\\ \hline\hline

2042: 3 & 1.679 & 4.395\\

2043: 4 & 2.841 & 5.546\\

2044: 5 & 3.848 & 6.438\\

2045: 6 & 4.714 & 7.168\\

2046: 7 & 5.467 & 7.784\\

2047: 8 & 6.128 & 8.318\\

2048: 9 & 6.715 & 8.789\\

2049: 10 & 7.242 & 9.211\\

2050: 20 & 10.672 & 11.983\\

2051: 30 & 12.608 & 13.605\\

2052: 40 & 13.944 & 14.756\\

2053: 50 & 14.960 & 15.649\\

2054: 100 & 18.017 & 18.421\\

2055: 200 & 20.982 & 21.194\\

2056: 300 & 22.663 & 22.816\\

2057: 400 & 23.846 & 23.966\\

2058: 500 & 24.759 & 24.859\\

2059: \hline

2060: \end{tabular}

2061: }

2062: \end{center}

2063: \end{table}

2064: \end{center}

2065:

2066: \subsection*{Appendix E. Existence of viable phenotypes.}

2067:

2068: In this section we describe a comparison between models from Sections 5.2 and 5.3

2069: that will yield the result in Section 5.3.

2070: We begin by assuming that $a=1/r$ is an integer, which we can do without loss of generality.

2071: Divide the $i$'th coordinate interval $[0,1]$ into $a$ disjoint intervals $I_{i0},\dots, I_{i,{a-1}}$

2072: of length $r$. For a phenotype $x\in \cP$ let $\Delta(x)\in \cG_a$ be determined

2073: so that $\Delta(x)_i=j$ iff $x_i\in I_{ij}$.

2074:

2075: Note that, as soon as $I_{i_1j_1}\times I_{i_2j_2}$ contains a point in

2076: $\cP_{i_1i_2}$, no $x$ with $\Delta(x)_{i_1}=j_1$ and $\Delta(x)_{i_2}=j_2$

2077: is viable. This happens independently for each such Cartesian product,

2078: with probability $1-\exp(-\la r^2)\ge cr^2/(2n)$.

2079: Therefore, using the result from Section 5.2, when $cr^2>4\log a=-4\log r$, there is

2080: \aas~no viable

2081: genotype.

2082:

2083: On the other hand, let $I^\e$ be the closed $\e$-neighborhood of the interval

2084: $I$ in $[0,1]$ (the set of points within $\e$ of $I$), and consider

2085: the events that $I_{i_1j_1}^{r/2}\times I_{i_2j_2}^{r/2}$ contains a point in

2086: $\Pi_{i_1i_2}$. These events are independent if we restrict $j_1,j_2$

2087: to even integers. Moreover, each has probability

2088: $1-\exp(-4\la r^2)\sim 4cr^2/(2n)$, for large $n$.

2089: It again follows from Section 6.2 that a viable genotype $x$

2090: with $\Delta(x)_i$ even for all $i$, \aas~exists as soon as

2091: $4cr^2<4(\log (a/2)-o(1))=(-4\log r-\log 2-o(1))$.

2092:

2093: \end{document}