0508:q-bio0508012/tcbb.tex

1: %

2: \documentclass[12pt,final]{IEEEtran}

3: %

4: \usepackage{makeidx}  % allows for indexgeneration

5: \usepackage{amsfonts}

6: \usepackage{epsfig}

7: \usepackage{amsmath}

8: \usepackage{subfigure}

9: %\usepackage{wrapfig}

10: %\usepackage{boxedminipage}

11: %\usepackage{harvard}

12: \usepackage[dutch,USenglish]{babel}

13: %

14: %

15: \numberwithin{equation}{section} \numberwithin{figure}{section}

16: %

17: \newtheorem{lemma}{Lemma}

18: \newtheorem{observation}{Observation}

19: \newtheorem{definition}{Definition}

20: %

21: \begin{document}

22: %

23: \title{On the Complexity of the Single Individual SNP Haplotyping Problem\thanks{Part of this research has been funded by the Dutch BSIK/BRICKS project.}}

24: %

25: \markboth{On the Complexity of the Single Individual SNP

26: Haplotyping Problem}{Cilibrasi \MakeLowercase{\textit{et al.}}}

27: %

28: \author{Rudi Cilibrasi\thanks{Rudi Cilibrasi is supported in part by NWO project 612.55.002, and by the IST Programme of the European

29: Community, under the PASCAL Network of Excellence,

30: IST-2002-506778. This publication only reflects the authors'

31: views.}, Leo van Iersel, Steven Kelk and John Tromp}

32: %

33: % \institute{Technische Universiteit Eindhoven (TU/e), Den Dolech 2, 5612 AX Eindhoven, Netherlands\\

34: % \email{l.j.j.v.iersel@tue.nl}\\

35: % \and

36: % Centrum voor Wiskunde en Informatica (CWI), Kruislaan 413, 1098 SJ Amsterdam, Netherlands \\

37: % \email{Rudi.Cilibrasi@cwi.nl, S.M.Kelk@cwi.nl, John.Tromp@cwi.nl}\\

38: % }

39: %

40: \maketitle              % typeset the title of the contribution

41: %

42: \begin{abstract}

43: We present several new results pertaining to haplotyping. These

44: results concern the combinatorial problem of reconstructing

45: haplotypes from incomplete and/or imperfectly sequenced haplotype

46: fragments. We consider the complexity of the problems

47: \emph{Minimum Error Correction} (MEC) and \emph{Longest Haplotype

48: Reconstruction} (LHR) for different restrictions on the input

49: data. Specifically, we look at the \emph{gapless} case, where

50: every row of the input corresponds to a gapless

51: haplotype-fragment, and the \emph{1-gap} case, where at most one

52: gap per fragment is allowed. We prove that MEC is APX-hard in the

53: 1-gap case and still NP-hard in the gapless case. In addition, we

54: question earlier claims that MEC is NP-hard even when the input

55: matrix is restricted to being completely binary. Concerning LHR,

56: we show that this problem is NP-hard and APX-hard in the 1-gap

57: case (and thus also in the general case), but is polynomial time

58: solvable in the gapless case.

59: \end{abstract}

60: %

61: %

62: %

63: \begin{keywords}

64: Combinatorial algorithms, Biology and genetics, Complexity

65: hierarchies

66: \end{keywords}

67: %

68: %

69: %

70: \section{Introduction}

71: %

72: If we abstractly consider the human genome as a string over the

73: nucleotide alphabet $\{ A, C, G, T \}$, it is widely known that

74: the genomes of any two humans have at more than 99\% of the sites

75: the same nucleotide. The sites at which variability is observed

76: across the human population are called \emph{Single Nucleotide

77: Polymorphisms} (SNPs), which are formally defined as the sites on

78: the human genome where, across the human population, two or more

79: nucleotides are observed and each such nucleotide occurs in at

80: least 5\% of the population. These sites, which occur (on average)

81: approximately once per thousand bases, capture the bulk of human

82: genetic variability; the string of nucleotides found at the SNP

83: sites of a human - the \emph{haplotype} of that individual - can

84: thus be thought of as a ``fingerprint'' for that individual.\\

85: \\

86: It has been observed that, for most SNP sites, only two

87: nucleotides are seen; sites where three or four nucleotides are

88: found are comparatively rare. Thus, from a combinatorial

89: perspective, a haplotype can be abstractly expressed as a string

90: over the alphabet $\{ 0,1 \}$. Indeed, the biologically-motivated

91: field of SNP and haplotype analysis has spawned a rich variety of

92: combinatorial problems, which are well described in surveys such

93: as \cite{bonizzoni} and \cite{halldorsson}.\\

94: \\

95: We focus on two such combinatorial problems, both variants of the

96: \emph{Single Individual Haplotyping Problem} (SIH), introduced in

97: \cite{lanciabafna}. SIH amounts to determining the haplotype of an

98: individual using (potentially) incomplete and/or imperfect

99: fragments of sequencing data. The situation is further complicated

100: by the fact that, being a \emph{diploid} organism, a human has two

101: versions of each chromosome; one each from the individual's mother

102: and father. Hence, for a given interval of the genome, a human has

103: two haplotypes. Thus, SIH can be more accurately described as

104: finding the two haplotypes of an individual given fragments of

105: sequencing data where the fragments potentially have read errors

106: and, crucially, where it is \emph{not} known which of the two

107: chromosomes each fragment was read from. We consider two

108: well-known variants of the problem: \emph{Minimum Error

109: Correction} (MEC), and \emph{Longest Haplotype Reconstruction}

110: (LHR).\\

111: \\

112: The input to these problems is a matrix $M$ of SNP fragments. Each

113: column of $M$ represents an SNP site and thus each entry of the

114: matrix denotes the (binary) choice of nucleotide seen at that SNP

115: location on that fragment. An entry of the matrix can thus either

116: be `0', `1' or a \emph{hole}, represented by `-', which denotes

117: lack of knowledge or uncertainty about the nucleotide at that

118: site. We use $M[i,j]$ to refer to the value found at row $i$,

119: column $j$ of $M$, and use $M[i]$ to refer to the $i$th row. Two

120: rows $r_1, r_2$ of the matrix \emph{conflict} if there exists a

121: column $j$ such that $M[r_1, j] \neq M[r_2, j]$ and $M[r_1,j],

122: M[r_2, j] \in \{0,1\}$.\\

123: \\

124: A matrix is \emph{feasible} iff the rows of the matrix can be

125: partitioned into two sets such that all rows

126: within each set are pairwise non-conflicting.\\

127: \\

128: The objective in MEC is to ``correct'' (or ``flip'') as few

129: entries of the input matrix as possible (i.e. convert 0 to 1 or

130: vice-versa) to arrive at a feasible matrix. The motivation behind

131: this is that all rows of the input matrix were sequenced from one

132: haplotype or the other, and that any deviation from

133: that haplotype occurred because of read-errors during sequencing.\\

134: \\

135: The problem LHR has the same input as MEC but a different

136: objective. Recall that the rows of a feasible matrix $M$ can be

137: partitioned into two sets such that all rows within each set are

138: pairwise non-conflicting. Having obtained such a partition, we can

139: reconstruct a haplotype from each set by merging all the rows in

140: that set together. (We define this formally later in Section

141: \ref{sec:lhr}.) With LHR the objective is to remove \emph{rows}

142: such that the resulting matrix is feasible and such that the sum

143: of the

144: lengths of the two resulting haplotypes is maximised.\\

145: \\

146: In the context of haplotyping, MEC and LHR have been discussed -

147: sometimes under different names - in papers such as

148: \cite{bonizzoni}, \cite{fasthare}, \cite{greenberg} and

149: (implicitly) \cite{lanciabafna}. One question arising from this

150: discussion is how the distribution of holes in the input data

151: affects computational complexity. To explain, let us first define

152: a \emph{gap} (in a string over the alphabet $\{0,1,-\}$) as a

153: maximal contiguous block of holes that is flanked on both sides by

154: non-hole values. For example, the string \texttt{---0010---} has

155: no gaps, \texttt{-0--10-111} has two gaps, and \texttt{-0-----1--}

156: has one gap. Two special cases of MEC and LHR that are considered

157: to be practically relevant are the ungapped case and the 1-gap

158: case. The ungapped variant is where every row of the input matrix

159: is ungapped, i.e. all holes appear at the start or end. In the

160: 1-gap case every row has at most one gap.\\

161: %

162: In Section \ref{subsec:umec} we offer what we believe is the first

163: proof that Ungapped-MEC (and hence 1-gap MEC and also the general

164: MEC) is NP-hard. We do so by reduction from MAX-CUT. (As far as we

165: are aware, other claims of this result are based explicitly or

166: implicitly on results found in \cite{kleinberg}; as we discuss in

167: Section \ref{subsec:bmec}, we conclude that the results in

168: \cite{kleinberg} cannot be used for this purpose.)\\

169: \\

170: The NP-hardness of 1-gap MEC (and general MEC) follows immediately

171: from the proof that Ungapped-MEC is NP-hard. However, our

172: NP-hardness proof for Ungapped-MEC is not

173: approximation-preserving, and consequently tells us little about

174: the (in)approximability of Ungapped-MEC, 1-gap MEC and general

175: MEC. In light of this we provide (in Section \ref{subsec:gmec}) a

176: proof that 1-gap MEC is APX-hard, thus excluding (unless P=NP) the

177: existence of a \emph{Polynomial Time

178: Approximation Scheme} (PTAS) for 1-gap MEC (and general MEC.)\\

179: \\

180: We define (in Section \ref{subsec:bmec}) the problem

181: \emph{Binary-MEC}, where the input matrix contains no holes; as

182: far as we know the complexity of this problem is still -

183: intriguingly - open. We also consider a parameterised version of

184: binary-MEC, where the number of haplotypes is not fixed as two,

185: but is part of the input. We prove that this problem is NP-hard in

186: Section \ref{subsec:pbmec}. (In the Appendix we also prove an

187: ``auxiliary'' lemma which, besides being interesting in its own

188: right, takes on a new significance in light of the open complexity

189: of

190: Binary-MEC.)\\

191: \\

192: In Section \ref{subsec:lhrpoly} we show that \emph{Ungapped-LHR}

193: is polynomial-time solvable and give a dynamic programming

194: algorithm for this which runs in time $O(n^{2}m+n^{3})$ for an $n

195: \times m$ input matrix. This improves upon the result of

196: \cite{lanciabafna} which also showed a polynomial-time algorithm

197: for Ungapped-LHR but

198: under the restricting assumption of non-nested input rows.\\

199: \\

200: We also prove, in Section \ref{subsec:lhrhard}, that LHR is

201: APX-hard (and thus also NP-hard) in the general case, by proving

202: the much stronger result that 1-gap LHR is APX-hard. This is the

203: first proof of hardness (for both 1-gap LHR and general LHR)

204: appearing in the literature. \footnote{In \cite{lanciabafna} there

205: is a claim, made very briefly, that LHR is NP-hard in general, but

206: it is not substantiated.}

207: %

208: %

209: %

210: \section{Minimum Error Correction (MEC)}

211: \label{sec:mec}

212: %

213: For a length-$m$ string $X \in \{0,1,-\}^m$, and a length-$m$

214: string $Y \in \{0,1\}^m$, we define $d(X,Y)$ as the number of

215: \emph{mismatches} between the strings i.e. positions where $X$ is

216: 0 and $Y$ is 1, or vice-versa; holes do not contribute to the

217: mismatch count. Recall the definition of \emph{feasible} from

218: earlier; an alternative, and equivalent, definition (which we use

219: in the following proofs) is as follows. An $n \times m$ SNP matrix

220: $M$ is \emph{feasible} iff there exist two strings (haplotypes)

221: $H_1, H_2 \in \{0,1\}^m$,

222: such that for all rows r of M, $d( r, H_1) = 0$ or $d( r, H_2 )=0$.\\

223: \\

224: Finally, a \emph{flip} is where a 0 entry is converted to a 1, or

225: vice-versa. Flipping to or from holes is not allowed and the

226: haplotypes $H_1$ and $H_2$ may not contain holes.

227: %

228: %

229: %

230: \subsection{Ungapped-MEC}

231: \label{subsec:umec}

232: \noindent\textbf{Problem:} \emph{Ungapped-MEC}\\

233: \textbf{Input:} An ungapped SNP matrix $M$\\

234: \textbf{Output:} Ungapped-MEC(M), which we define as the smallest

235: number of flips needed to make $M$ feasible.\footnote{In

236: subsequent problem definitions we regard it as implicit that P(I)

237: represents the optimal output of a problem $P$ on input $I$.}\\

238: %

239: %

240: %

241: \begin{lemma}

242: \label{lem:mechard} Ungapped-MEC is NP-hard.\\

243: \end{lemma}

244: \begin{proof}

245: We give a polynomial-time reduction from MAX-CUT, which is the

246: problem of computing the size of a maximum cardinality cut in a

247: graph.\footnote{The reduction given here can easily be converted

248: into a Karp reduction from the decision version of MAX-CUT to the

249: decision version of Ungapped-MEC.} Let $G=(V,E)$ be the input to

250: MAX-CUT, where $E$ is undirected. (We identify, wlog, $V$ with

251: $\{1, 2,...,|V|\}$.) We construct an input matrix $M$ for

252: Ungapped-MEC with $2k|V| + |E|$ rows and $2|V|$ columns where $k =

253: 2|E||V|$. We use $M_0$ to refer to the first $k|V|$ rows of $M$,

254: $M_1$ to refer to the second $k|V|$ rows of $M$, and $M_G$ to

255: refer to the remaining $|E|$ rows. $M_0$ consists of $|V|$

256: consecutive blocks of $k$ identical rows. Each row in the $i$-th

257: block (for $1 \leq i \leq |V|$) contains a $0$ at columns $2i-1$

258: and $2i$ and holes at all other columns. $M_1$ is defined similar

259: to $M_0$ with $1$-entries instead of $0$-entries. Each row of

260: $M_G$ encodes an edge from $E$: for edge $\{i,j\}$ (with $i<j$) we

261: specify that columns $2i-1$ and $2i$ contain 0s, columns $2j-1$

262: and $2j$ contain 1s, and for all $h \neq i, j$, column $2h-1$

263: contains 0 and column $2h$ contains 1. (See Figures

264: \ref{fig:mecgraph} and \ref{fig:mecmatrix} for an example of how

265: $M$ is constructed.)\\

266: %

267: %

268: %

269: \begin{figure}

270: \begin{centering}

271: \epsfig{file=./mec.eps} \caption{Example input to MAX-CUT (see

272: Lemma \ref{lem:mechard})} \label{fig:mecgraph}

273: \end{centering}

274: \end{figure}

275: \begin{figure}

276: \begin{centering}

277: \[

278: \begin{tabular}{rl}

279: $\left(

280: \begin{array}{cccccccc}

281: 0 & 0 & - & - & - & - & - & - \\

282: - & - & 0 & 0 & - & - & - & - \\

283: - & - & - & - & 0 & 0 & - & - \\

284: - & - & - & - & - & - & 0 & 0 \\

285: 1 & 1 & - & - & - & - & - & - \\

286: - & - & 1 & 1 & - & - & - & - \\

287: - & - & - & - & 1 & 1 & - & - \\

288: - & - & - & - & - & - & 1 & 1 \\

289: 0 & 0 & 1 & 1 & 0 & 1 & 0 & 1 \\

290: 0 & 0 & 0 & 1 & 1 & 1 & 0 & 1 \\

291: 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 \\

292: 0 & 1 & 0 & 1 & 0 & 0 & 1 & 1 %

293: \end{array}

294: \right) \hspace{-27pt}$ &

295: \begin{tabular}{l}

296: $\left.

297: \begin{array}{l}

298: \\

299: \\

300: \\

301: \\

302: \\

303: \\

304: \\

305: \\

306: \end{array}

307: \right\} 32$ copies \\

308: $\left.

309: \begin{array}{l}

310: \\

311: \\

312: \\

313: \\

314: \end{array}

315: \right\} M_G $\\

316: \end{tabular}

317: \end{tabular}

318: \]

319: \caption{Construction of matrix $M$ (from Lemma \ref{lem:mechard})

320: for graph in Figure \ref{fig:mecgraph}} \label{fig:mecmatrix}

321: \end{centering}

322: \vspace{-12pt}

323: \end{figure}

324: %

325: %

326: %

327: \\

328: Suppose $t$ is the largest cut possible in $G$ and $s$ is the

329: minimum number of flips needed to make $M$ feasible. We claim that

330: the following holds:

331: \begin{equation}

332: \label{maxcut} s=|E|(|V|-2)+2(|E|-t).

333: \end{equation}

334: From this $t$, the optimal solution of MAX-CUT, can easily be

335: computed. First, note that the solution to Ungapped-MEC(M) is

336: trivially upperbounded by $|V||E|$. This follows because we could

337: simply flip every 1 entry in $M_G$ to 0; the resulting overall

338: matrix would be feasible because we could just take $H_1$ as the

339: all-0 string and $H_2$ as the all-1 string. Now, we say a

340: haplotype $H$ has the \emph{double-entry} property if, for all

341: odd-indexed positions (i.e. columns) $j$ in $H$, the entry at

342: position $j$ of $H$ is the same as the entry at position $j+1$. We

343: argue that a minimal number of feasibility-inducing flips will

344: \emph{always} lead to two haplotypes $H_1, H_2$ such that both

345: haplotypes have the double-entry property and, further, $H_1$ is

346: the bitwise complement of $H_2$. (We describe such a pair of

347: haplotypes as \emph{partition-encoding}.) This is because, if

348: $H_1, H_2$ are not partition-encoding, then at least $k > |V||E|$

349: (in contrast with zero) entries in $M_0$ and/or $M_1$ will have to

350: be flipped, meaning this strategy is doomed to begin with.\\

351: \\

352: Now, for a given partition-encoding pair of haplotypes, it follows

353: that - for each row in $M_G$ - we will have to flip either $|V|-2$

354: or $|V|$ entries to reach its nearest haplotype. This is because,

355: irrespective of which haplotype we move a row to, the $|V|-2$

356: pairs of columns \emph{not} encoding end-points (for a given row)

357: will always cost 1 flip each to fix. Then either 2 or 0 of the 4

358: ``endpoint-encoding'' entries will also need to be flipped; 4

359: flips will never be necessary because then the row could move to

360: the other haplotype, requiring no extra flips. Ungapped-MEC thus

361: maximises the number of rows which require $|V|-2$ rather than

362: $|V|$ flips. If we think of $H_1$ and $H_2$ as encoding a

363: partition of the vertices of $V$ (i.e. a vertex $i$ is on one side

364: of the partition if $H_1$ has 1s in columns $2i-1$ and $2i$, and

365: on the other side if $H_2$ has 1s in those columns), it follows

366: that each row requiring $|V|-2$ flips corresponds to a cut-edge in

367: the vertex partition defined by $H_1$ and $H_2$. The expression

368: (\ref{maxcut}) follows.\\

369: \end{proof}

370: %

371: %

372: %

373: \subsection{1-gap MEC}

374: \label{subsec:gmec}

375: \noindent\textbf{Problem:} \emph{1-gap MEC}\\

376: \textbf{Input:} SNP matrix $M$ with at most 1 gap per row\\

377: \textbf{Output:} The smallest number of flips needed to make $M$ feasible.\\

378: \\

379: To prove that 1-gap MEC is APX-hard (and therefore also NP-hard)

380: we will give an L-reduction\footnote{An L-reduction is a specific

381: type of \emph{approximation-preserving} reduction, first

382: introduced in \cite{lreduc}. If there exists an L-reduction from a

383: problem X to a problem Y, then a PTAS for Y can be used to build a

384: PTAS for X. Conversely, if there exists an L-reduction from X to

385: Y, and X is APX-hard, so is Y. See (for example) \cite{sched} for

386: a succinct discussion of this.} from CUBIC-MIN-UNCUT, which is the

387: problem of finding the minimum number of edges that have to be

388: removed from a 3-regular graph in order to make it bipartite. Our

389: first goal is thus to prove the APX-hardness of CUBIC-MIN-UNCUT,

390: which itself will be proven using an L-reduction from the APX-hard problem CUBIC-MAX-CUT.\\

391: \\

392: To aid the reader, we reproduce here the definition of an

393: L-reduction.\\

394: \begin{definition}

395: (Papadimitriou and Yannakakis \cite{lreduc}) Let A and B be two

396: optimisation problems. An \emph{L-reduction} from A to B is a pair

397: of functions R and S, both computable in polynomial time, such

398: that for any instance I of A with optimum cost Opt(I), R(I) is an

399: instance of B with optimum cost Opt(R(I)) and for every

400: feasible\footnote{Note that \emph{feasible} in this context has a

401: different meaning to \emph{feasible} in the context of SNP

402: matrices.} solution s of R(I), S(s) is a feasible solution of I

403: such that:

404: \begin{equation}

405: \label{eq:L1}

406: Opt(R(I)) \leq \alpha Opt(I),

407: \end{equation}

408: for some positive constant $\alpha$ and:

409: \begin{equation}

410: \label{eq:L2}

411: |Opt(I) - c(S(s))| \leq \beta |Opt(R(I))-c(s)|,

412: \end{equation}

413: for some positive constant $\beta$, where c(S(s)) and c(s)

414: represent the costs of S(s) and s, respectively.\\

415: \end{definition}

416: \begin{observation}

417: \label{obs:mincuthard} CUBIC-MIN-UNCUT is APX-hard.\\

418: \end{observation}

419: \begin{proof}

420: We give an L-reduction from CUBIC-MAX-CUT, the problem of finding

421: the maximum cardinality of a cut in a 3-regular graph. (This

422: problem is shown to be APX-hard in \cite{alimontikann}; see also

423: \cite{bermankarpinski}.) Let $G=(V,E)$

424: be the input to CUBIC-MAX-CUT.\\

425: \\

426: Note that CUBIC-MIN-UNCUT is the ``complement'' of CUBIC-MAX-CUT,

427: as expressed by the following relationship:

428: %

429: \begin{equation}

430: \begin{array}{l}

431: \label{eq:duality} \text{\emph{CUBIC-MAX-CUT(G)}}\\

432: = |E| - \text{\emph{CUBIC-MIN-UNCUT(G)}}.

433: \end{array}

434: \end{equation}

435: %

436: To see why this holds, note that for every cut $C$, the removal of

437: the edges $E \setminus C$ will lead to a bipartite graph. On the

438: other hand, given a set of edges $E'$ whose removal makes $G$

439: bipartite, the complement is not necessarily a cut. However, given

440: a bipartition induced by the removal of $E'$, the edges from the

441: original graph that cross this bipartition form a cut $C'$, such

442: that $|C'| \geq |E \setminus E'|$. This proves (\ref{eq:duality}),

443: and the mapping (just described) from $E'$ to $C'$ is the mapping

444: we use in the L-reduction.\\

445: \\

446: Now, note that property (\ref{eq:L1}) of the L-reduction is easily

447: satisfied (taking $\alpha=1$) because the optimal value of

448: CUBIC-MIN-UNCUT is always less than or equal to the optimal value

449: of CUBIC-MAX-CUT. This follows from the combination of

450: (\ref{eq:duality}) with the fact that a maximum cut in a 3-regular

451: graph always contains at least $2/3$ of the edges: if a vertex has

452: less than two incident edges in the cut then we can get a larger

453: cut by moving this vertex to the other side of the partition.\\

454: \\

455: To see that property (\ref{eq:L2}) of the L-reduction is easily

456: satisfied (taking $\beta = 1$), let $E'$ be any set of edges whose

457: removal makes $G$ bipartite. Property (\ref{eq:L2}) is satisfied

458: because $E'$ gets mapped to a cut $C'$, as defined above, and

459: combined with (\ref{eq:duality}) this gives:

460: \begin{equation}

461: \begin{array}{l}

462: \text{\emph{CUBIC-MAX-CUT(G)}} - |C'|\\

463: \leq \text{\emph{CUBIC-MAX-CUT(G)}} - |E \setminus E'| \\

464: = |E'| - \text{\emph{CUBIC-MIN-UNCUT(G)}}.

465: \end{array}

466: \end{equation}

467: %

468: This completes the L-reduction from CUBIC-MAX-CUT to

469: CUBIC-MIN-UNCUT, proving the APX-hardness of CUBIC-MIN-UNCUT.\\

470: \end{proof}

471: %

472: We also need the following observation.\\

473: %

474: \begin{observation}

475: \label{obs:orient} Let $G = (V,E)$ be an undirected, 3-regular

476: graph. Then we can find, in polynomial time, an orientation of the

477: edges of $G$ so that each vertex has either in-degree 2 and

478: out-degree 1 (``in-in-out'') or

479: out-degree 2 and in-degree 1 (``out-out-in'').\\

480: \end{observation}

481: \begin{proof}

482: (We assume that $G$ is connected; if $G$ is not connected, we can

483: apply the following argument to each component of $G$ in turn, and

484: the overall result still holds.) Every cubic graph has an even

485: number of vertices, because every graph must have an even number

486: of odd-degree vertices. We add an arbitrary perfect matching to

487: the graph, which may create multiple edges. The graph is now

488: 4-regular and therefore has an Euler tour. We direct the edges

489: following the Euler-tour; every vertex is now in-in-out-out. If we

490: remove the perfect matching edges we added, we are left with an

491: oriented version of $G$ where every vertex is in-in-out or

492: out-out-in. This can all be done in polynomial time.\\

493: \end{proof}

494: %

495: \begin{lemma}

496: \label{apxhard} 1-gap MEC is APX-hard\\

497: \end{lemma}

498: \begin{proof}

499: We give a reduction from CUBIC-MIN-UNCUT. Consider an arbitrary

500: 3-regular graph $G = (V,E)$ and orient the edges as described in

501: Observation \ref{obs:orient} to obtain an oriented version of $G$,

502: $\overrightarrow{G} = (V, \overrightarrow{E})$, where each vertex

503: is either in-in-out or out-out-in. We construct an $|E| \times

504: |V|$ input matrix $M$ for 1-gap MEC as follows. The columns of $M$

505: correspond to the vertices of $\overrightarrow{G}$ and every row

506: of $M$ encodes an oriented edge of $\overrightarrow{G}$; it has a

507: $0$ in the column corresponding to the tail of the edge (i.e. the

508: vertex from which the edge leaves), a $1$ in the column

509: corresponding to the head of the edge,

510: and the rest holes.\\

511: \\

512: We prove the following:

513: \begin{equation}

514: \label{eq:uncutmec} \text{\emph{CUBIC-MIN-UNCUT(G)}} =

515: \text{\emph{1-gap MEC(M)}}.

516: \end{equation}

517: %

518: We first prove that:

519: \begin{equation}

520: \label{eq:uncutmec2} \text{\emph{1-gap MEC(M)}} \leq

521: \text{\emph{CUBIC-MIN-UNCUT(G)}}.

522: \end{equation}

523: To see this, let $E'$ be a minimal set of edges whose removal

524: makes $G$ bipartite, and let $|E'| = k$. Let $B = (L \cup R, E

525: \setminus E')$ be the bipartite graph (with bipartition $L \cup

526: R$) obtained from $G$ by removing the edges $E'$. Let $H_1$

527: (respectively, $H_2$) be the haplotype that has 1s in the columns

528: representing vertices of $L$ (respectively, $R$) and 0s elsewhere.

529: It is possible to make $M$ feasible with $k$ flips, by the

530: following process: for each edge in $E'$, flip the 0 bit in the

531: corresponding row of $M$ to 1. For each row r of M it is now

532: true that $d(r, H_1) = 0$ or $d(r,H_2) = 0$, proving the feasibility of $M$.\\

533: \\

534: The proof that,

535: \begin{equation}

536: \label{eq:uncutmec3} \text{\emph{CUBIC-MIN-UNCUT(G)}}\leq

537: \text{\emph{1-gap MEC(M)}},

538: \end{equation}

539: is more subtle. Suppose we can render $M$ feasible using $j$

540: flips, and let $H_1$ and $H_2$ be any two haplotypes such that,

541: after the $j$ flips, each row of $M$ is distance 0 from either

542: $H_1$ or $H_2$. If $H_1$ and $H_2$ are bitwise complementary then

543: we can make $G$ bipartite by removing an edge whenever we had to

544: flip a bit in the corresponding row. The idea is, namely, that the

545: 1s in $H_1$ (respectively, $H_2$) represent the vertices $L$

546: (respectively, $R$) in the resulting bipartition $L \cup R$.\\

547: \\

548: However, suppose the two haplotypes $H_1$ and $H_2$ are not

549: bitwise complementary. In this case it is sufficient to

550: demonstrate that there also exists bitwise complementary

551: haplotypes $H'_1$ and $H'_2$ such that, after $j$ (or fewer)

552: flips, every row of $M$ is distance 0 from either $H'_1$ or

553: $H'_2$. Consider thus a column of $H_1$ and $H_2$ where the two

554: haplotypes are not complementary. Crucially, the orientation of

555: $\overrightarrow{G}$ ensures that every column of $M$ contains

556: \emph{either} one $1$ and two $0$s \emph{or} two $1$s and one $0$

557: (and the rest holes). A simple case analysis shows that, because

558: of this, we can always change the value of one of the haplotypes

559: in that column, without increasing the number of flips. (The

560: number of flips might decrease.) Repeating this process for all

561: columns of $H_1$ and $H_2$ where the same value is observed thus

562: creates complementary haplotypes $H'_1$ and $H'_2$, and - as

563: described in the previous paragraph - these haplotypes then

564: determine which edges of $G$ should be removed to make $G$

565: bipartite. This completes the proof of (\ref{eq:uncutmec}).\\

566: \\

567: The above reduction can be computed in polynomial time and is an

568: L-reduction. From (\ref{eq:uncutmec}) it follows directly that

569: property (\ref{eq:L1}) of an L-reduction is satisfied with

570: $\alpha=1$. Property (\ref{eq:L2}), with $\beta=1$, follows from

571: the proof of (\ref{eq:uncutmec3}), combined with

572: (\ref{eq:uncutmec}). Namely, whenever we use (say) $t$ flips to

573: make $M$ feasible, we can find $s \leq t$ edges of $G$ that can be

574: removed to make $G$ bipartite. Combined with (\ref{eq:uncutmec})

575: this gives:

576: \begin{equation}

577: \begin{array}{l}

578: |\text{\emph{CUBIC-MIN-UNCUT(G)}} - s |\\

579: \leq | \text{\emph{1-gap MEC(M)}} - t |.

580: \end{array}

581: \end{equation}

582: \end{proof}

583: %

584: %

585: %

586: \subsection{Binary-MEC}

587: \label{subsec:bmec}

588: %

589: From a mathematical point of view it is interesting to determine

590: whether MEC stays NP-hard when the input matrix is

591: further restricted. Let us therefore define the following problem.\\

592: \\

593: \textbf{Problem:} \emph{Binary-MEC}\\

594: \textbf{Input:} An SNP matrix $M$ that does not contain any holes\\

595: \textbf{Output:} As for Ungapped-MEC\\

596: \\

597: Like all optimisation problems, the problem Binary-MEC has

598: different variants, depending on how the problem is defined. The

599: above definition is technically speaking the \emph{evaluation}

600: variant of the Binary-MEC problem\footnote{ See \cite{ausiello}

601: for a more detailed explanation of terminology in this area.}.

602: Consider the closely-related \emph{constructive} version:\\

603: \\

604: \textbf{Problem:} \emph{Binary-Constructive-MEC}\\

605: \textbf{Input:} An SNP matrix $M$ that does not contain any holes\\

606: \textbf{Output:} For an input matrix $M$ of size $n \times m$, two

607: haplotypes $H_1, H_2 \in \{0,1\}^m$ minimizing:

608: \begin{equation}

609: \label{eq:witsum} D_M(H_1, H_2) = \sum_{\text{rows r of M}} \min(

610: d(r,H_1), d(r, H_2) ).

611: \end{equation}

612: In the appendix, we prove that Binary-Constructive-MEC is

613: polynomial-time Turing interreducible with its evaluation

614: counterpart, Binary-MEC. This proves that Binary-Constructive-MEC

615: is solvable in polynomial-time iff Binary-MEC is solvable in

616: polynomial-time. We mention this correspondence because, when

617: expressed as a constructive problem, it can be seen that MEC is in

618: fact a specific type of \emph{clustering} problem, a topic of

619: intensive study in the literature. More specifically, we are

620: trying to find two representative ``median'' (or ``consensus'')

621: strings such that the sum, over all input strings, of the distance

622: between each input string and its nearest median, is minimised.

623: This interreducibility is potentially useful because we now argue,

624: in contrast to claims in the existing literature, that the

625: complexity

626: of Binary-MEC / Binary-Constructive-MEC is actually still open.\\

627: \\

628: To elaborate, it is claimed in several papers (e.g. \cite{alon})

629: that a problem equivalent to Binary-Constructive-MEC is NP-hard.

630: Such claims inevitably refer to the seminal paper

631: \emph{Segmentation Problems} by Kleinberg, Papadimitriou, and

632: Raghavan (KPR), which has appeared in multiple different forms

633: since 1998 (e.g. \cite{kleinberg}, \cite{kleinbergEco} and

634: \cite{kleinberg2004}.) However, the KPR papers actually discuss

635: two superficially similar, but essentially different, problems:

636: one problem is essentially equivalent to Binary-Constructive-MEC,

637: and the other is a more general (and thus, potentially, a more

638: difficult) problem.\footnote{In this more general problem, rows

639: and haplotypes are viewed as vectors and the distance between a

640: row and a haplotype is their dot product. Further, unlike

641: Binary-Constructive-MEC, this problem allows entries of the input

642: matrix to be drawn arbitrarily from $\mathbb{R}$. This extra

643: degree of freedom - particularly the ability to simultaneously use

644: positive, negative and zero values in the input matrix - is what

645: (when coupled with a dot product distance measure) provides the

646: ability to encode NP-hard problems.} Communication with the

647: authors \cite{christos} has confirmed that they have no proof of

648: hardness for the former problem i.e. the problem that is

649: essentially equivalent to Binary-Constructive-MEC.\\

650: \\

651: Thus we conclude that the complexity of Binary-Constructive-MEC /

652: Binary-MEC is still open. From an approximation viewpoint the

653: problem has been quite well-studied; the problem has a

654: \emph{Polynomial Time Approximation Scheme} (PTAS) because it is a

655: special form of the \emph{Hamming 2-Median Clustering Problem},

656: for which a PTAS is demonstrated in \cite{li}. Other approximation

657: results appear in \cite{kleinberg}, \cite{alon},

658: \cite{kleinberg2004}, \cite{geometric} and a heuristic for a

659: similar (but not identical) problem appears in \cite{fasthare}. We

660: also know that, if the number of haplotypes to be found is

661: specified as part of the input (and not fixed as 2), the problem

662: becomes NP-hard; we prove this in the following section. Finally,

663: it may also be relevant that the ``geometric'' version of the

664: problem (where rows of the input matrix are not drawn from

665: $\{0,1\}^m$ but from $\mathbb{R}^{m}$, and Euclidean distance is

666: used instead of Hamming distance) is also open from a complexity

667: viewpoint \cite{geometric}. (However, the version using

668: Euclidean-distance-squared \emph{is} known to be NP-hard

669: \cite{drineas}.)

670: %

671: %

672: %

673: \subsection{Parameterised Binary-MEC}

674: \label{subsec:pbmec}

675: %

676: Let us now consider a generalisation of the problem Binary-MEC,

677: where the number of haplotypes is not fixed as two, but part of

678: the input.\\

679: \\

680: \textbf{Problem:} \emph{Parameterised-Binary-MEC (PBMEC)}\\

681: \textbf{Input:} An SNP matrix $M$ that contains no holes, and $k \in \mathbb{N} \setminus \{0\}$\\

682: \textbf{Output:} The smallest number of flips needed to make $M$

683: feasible under $k$ haplotypes.\\

684: %

685: The notion of \emph{feasible} generalises easily to $k \geq 1$

686: haplotypes: an SNP matrix $M$ is \emph{feasible} under $k$

687: haplotypes if $M$ can be partitioned into $k$ segments such that

688: all the rows within each segment are pairwise non-conflicting. The

689: definition of $D_{M}$ also generalises easily to $k$ haplotypes;

690: we define $D_{M, k}(H_1, H_2, ..., H_k)$ as:

691: %

692: \begin{equation}

693: \label{eq:kmedsum} \sum_{\text{rows r of M}} \min(d(r,H_1), d(r,

694: H_2), ..., d(r,H_k) ).

695: \end{equation}

696: %

697: We define $OptTuples(M,k)$ as the set of unordered optimal

698: $k$-tuples of haplotypes for $M$ i.e. those $k$-tuples of

699: haplotypes which have a $D_{M,k}$ score equal to PBMEC$(M,k)$.\\

700: %

701: \begin{lemma}

702: \label{lem:pbmechard}

703: PBMEC is NP-hard\\

704: \end{lemma}

705: %

706: \begin{proof}

707: We reduce from the NP-hard problem MINIMUM-VERTEX-COVER. Let

708: $G=(V,E)$ be an undirected graph. A subset $V' \subseteq V$ is

709: said to \emph{cover} an edge $(u,v) \in E$ iff $u \in V'$ or $v

710: \in V'$. A \emph{vertex cover} of an undirected graph $G = (V,E)$

711: is a subset $U$ of the vertices such that every edge in $E$ is

712: covered by $U$. MINIMUM-VERTEX-COVER is the problem of, given a

713: graph $G$, computing the size of

714: a minimum cardinality vertex cover $U$ of $G$.\\

715: \\

716: Let $G = (V,E)$ be the input to MINIMUM-VERTEX-COVER. We construct

717: an SNP matrix $M$ as follows. $M$ has $|V|$ columns and

718: $3|E||V|+|E|$ rows. We name the first $3|E||V|$ rows $M_0$ and the

719: remaining $|E|$ rows $M_{G}$. $M_0$ is the matrix obtained by

720: taking the $|V| \times |V|$ identity matrix (i.e. 1s on the

721: diagonal, 0s everywhere else) and making $3|E|$ copies of each

722: row. Each row in $M_G$ encodes an edge of $G$: the row has

723: 1-entries at the endpoints of the edge, and the rest of the row is

724: 0. We argue shortly that, to compute the size of the smallest

725: vertex cover in $G$, we call PBMEC($M,k$) for increasing values of

726: $k$ (starting with $k=2$) until we first encounter a $k$ such

727: that:

728: \begin{equation}

729: \label{eq:kmed} PBMEC(M,k) = 3|E|(|V|-(k-1)) + |E|.

730: \end{equation}

731: Once the smallest such $k$ has been found, we can output that the

732: size of the smallest vertex cover in $G$ is $k-1$. (Actually, if

733: we haven't yet found a value $k < |V|-2$ satisfying the above

734: equation, we can check by brute force in polynomial-time whether

735: $G$ has a vertex cover of size $|V|-3$, $|V|-2$, $|V|-1$, or

736: $|V|$. The reason for wanting to ensure that PBMEC($M,k$) is not

737: called with $k \geq |V|-2$ is explained later in the

738: analysis.\footnote{Note that, should we wish to build a Karp

739: reduction from the decision version of MINIMUM-VERTEX-COVER to the

740: decision version of PBMEC, it is not a problem to make this brute

741: force checking fit into the framework of a Karp reduction. The

742: Karp reduction can do the brute force checking itself and use

743: trivial inputs to the decision

744: version of PBMEC to communicate its ``yes'' or ``no'' answer.})\\

745: \\

746: It remains only to prove that (for $k < |V|-2$)

747: (\ref{eq:kmed}) holds iff $G$ has a vertex cover of size $k-1$.\\

748: \\

749: To prove this we need to first analyze $OptTuples(M_{0},k)$.

750: Recall that $M_0$ was obtained by duplicating the rows of the $|V|

751: \times |V|$ identity matrix. Let $I_{|V|}$ be shorthand for the

752: $|V| \times |V|$ identity matrix. Given that $M_0$ is simply a

753: ``scaled up'' version of $I_{|V|}$, it follows that:

754: \begin{equation}

755: OptTuples(M_0,k) = OptTuples(I_{|V|},k).

756: \end{equation}

757: Now, we argue that all the $k$-tuples in $OptTuples(I_{|V|},k)$

758: (for $k < |V|-2$) have the following form: one haplotype from the

759: tuple contains only 0s, and the remaining $k-1$ haplotypes from

760: the tuple each have precisely one entry set to 1. Let us name such

761: a $k$-tuple

762: a \emph{candidate} tuple.\\

763: %

764: \begin{figure}

765: \begin{centering}

766: \epsfig{file=./mec.eps}

767: \caption{Example input graph to

768: MINIMUM-VERTEX-COVER (see Lemma \ref{lem:pbmechard})}

769: \label{fig:pbmecgraph}

770: \end{centering}

771: \end{figure}

772: %

773: \begin{figure}

774: \begin{centering}

775: \[

776: \begin{tabular}{ll}

777: $\left(

778: \begin{array}{cccc}

779: 1 & 0 & 0 & 0 \\

780: 0 & 1 & 0 & 0 \\

781: 0 & 0 & 1 & 0 \\

782: 0 & 0 & 0 & 1 \\

783: 1 & 1 & 0 & 0 \\

784: 1 & 0 & 1 & 0 \\

785: 1 & 0 & 0 & 1 \\

786: 0 & 0 & 1 & 1 \\

787: \end{array}

788: \right)$ \hspace{-28pt} &

789: \begin{tabular}{l}

790: $\left.

791: \begin{array}{c}

792: \\

793: \\

794: \\

795: \\

796: \end{array}

797: \right\} 12$ copies \\

798: $\left.

799: \begin{array}{c}

800: \\

801: \\

802: \\

803: \\

804: \end{array}

805: \right\} M_G $\\

806: \end{tabular}

807: \end{tabular}

808: \]

809: \end{centering}

810: \caption{Construction of matrix $M$ for graph from Figure

811: \ref{fig:pbmecgraph}}

812: \end{figure}

813: %

814: %

815: \\

816: First, note that $PBMEC(I_{|V|},k) \leq |V|-(k-1)$, because

817: $|V|-(k-1)$ is the value of the $D$ measure - defined in

818: (\ref{eq:kmedsum}) - under any candidate tuple. Secondly, under an

819: arbitrary $k$-tuple there can be at most $k$ rows of $I_{|V|}$

820: which contribute 0 to the $D$ measure. However, if precisely $k$

821: rows of $I_{|V|}$ contribute 0 to the $D$ measure (i.e. every

822: haplotype has precisely one entry set to 1, and the haplotypes are

823: all distinct) then there are $|V|-k$ rows which each contribute 2

824: to the $D$ measure; such a $k$-tuple cannot be optimal because it

825: has a $D$ measure of $2(|V|-k) > |V|-(k-1)$. So we reason that at

826: most $k-1$ rows contribute 0 to the $D$ measure. In fact,

827: \emph{precisely} $k-1$ rows must contribute 0 to the $D$ measure

828: because, otherwise, there would be at least $|V|-(k-2)$ rows

829: contributing at least 1, and this is not possible because

830: $PBMEC(I_{|V|},k) \leq |V|-(k-1)$. So $k-1$ of the haplotypes

831: correspond to rows of $I_{|V|}$, and the remaining $|V|-(k-1)$

832: rows of $I_{|V|}$ must each contribute 1 to the $D$ measure. But

833: the only way to do this (given that $|V|-(k-1) > 2$) is to make

834: the $k$th haplotype the haplotype where every entry is 0. Hence:

835: \begin{equation}

836: PBMEC(I_{|V|},k) = |V|-(k-1)

837: \end{equation}

838: and:

839: \begin{equation}

840: PBMEC(M_0,k) = 3|E|(|V|-(k-1)).

841: \end{equation}

842: $OptTuples(I_{|V|},k)$ ($= OptTuples(M_0,k)$) is, by extension,

843: precisely the set of candidate $k$-tuples.\\

844: \\

845: The next step is to observe that $OptTuples(M,k) \subseteq

846: OptTuples(M_0,k)$. To see this, suppose (by way of contradiction)

847: that it is not true, and there exists a $k$-tuple $H^{*} \in

848: OptTuples(M,k)$ that is not in $OptTuples(M_0,k)$. But then

849: replacing $H^{*}$ by any $k$-tuple out of $OptTuples(M_0,k)$ would

850: reduce the number of flips needed in $M_0$ by at least $3|E|$, in

851: contrast to an increase in the number of flips needed in $M_{G}$

852: of at most $2|E|$, thus leading to an overall reduction in the

853: number of flips; contradiction! (The $2|E|$ figure is the number

854: of flips

855: required to make all rows in $M_G$ equal to the all-0 haplotype.)\\

856: \\

857: Because $OptTuples(M,k) \subseteq OptTuples(M_0,k)$, we can

858: restrict our attention to the $k$-tuples in $OptTuples(M_0,k)$.

859: Observe that there is a natural 1-1 correspondence between the

860: elements of $OptTuples(M_0,k)$ and all size $k-1$ subsets of $V$:

861: a vertex $v \in V$ is in the subset corresponding to $H^{*} \in

862: OptTuples(M_0,k)$ iff one of the haplotypes in $H^{*}$ has a 1 in

863: the column corresponding to vertex $v$.\\

864: \\

865: Now, for a $k$-tuple $H^{*} \in OptTuples(M_0,k)$ we let $Cov( G,

866: H^{*} )$ be the set of edges in $G$ which are covered by the

867: subset of $V$ corresponding to $H^{*}$. (Thus, $|Cov(G,

868: H^{*})|=|E|$ iff $H^{*}$ represents a vertex cover of $G$.) It is

869: easy to check that, for $H^{*} \in OptTuples(M_0,k)$:

870: \begin{equation}

871: \begin{array}{lll}

872: D_{M,k}( H^{*} )&{}={}&3|E|(|V|-(k-1))\\

873: &&{+}\:|Cov(G, H^{*})|\\

874: &&{+}\:2( |E| - |Cov(G,H^{*}| ) \\

875: &{}={}&3|E|(|V|-(k-1))\\

876: &&{+}\:2|E| - |Cov(G, H^{*})|.\nonumber

877: \end{array}

878: \end{equation}

879: Hence, for $H^{*} \in OptTuples(M_0,k)$, $D_{M,k}(H^{*})$ equals

880: $3|E|(|V|-(k-1)) + |E|$ iff $H^{*}$ represents a size $k-1$ vertex

881: cover of $G$.\\

882: \end{proof}

883: %

884: %

885: %

886: \section{Longest Haplotype Reconstruction (LHR)}

887: \label{sec:lhr} \setcounter{equation}{0} Suppose an SNP matrix $M$

888: is feasible. Then we can partition the rows of $M$ into two sets,

889: $M_l$ and $M_r$, such that the rows within each set are pairwise

890: non-conflicting. (The partition might not be unique.) From $M_i$

891: ($i \in \{l,r\}$) we can then build a haplotype $H_i$ by combining

892: the rows of $M_i$ as follows: The $j$th column of $H_i$ is set to

893: 1 if at least one row from $M_i$ has a 1 in column $j$, is set to

894: 0 if at least one row from $M_i$ has a 0 in column $j$, and is set

895: to a hole if all rows in $M_i$ have a hole in column $j$. Note

896: that, in contrast to MEC, this leads to haplotypes that

897: potentially contain holes. For example, suppose one side of the

898: partition contains rows \texttt{10--, -0--} and \texttt{---1};

899: then the haplotype we get from this is \texttt{10-1}. We define

900: the \emph{length} of a haplotype $H$, denoted as $|H|$, as the

901: number of positions where it does not contain a hole; the

902: haplotype \texttt{10-1} thus has length 3, for example. Now, the

903: objective with LHR is to remove \emph{rows} from $M$ to make it

904: feasible but also such that the sum of the lengths of the two

905: resulting haplotypes is maximised. We define the function LHR(M)

906: (which gives a natural number as output) as the largest value this

907: sum-of-lengths value can take,

908: ranging over all feasibility-inducing row-removals and subsequent partitions.\\

909: \\

910: In Section \ref{subsec:lhrpoly} we provide a polynomial-time

911: dynamic programming algorithm for the ungapped variant of LHR,

912: Ungapped-LHR. In Section \ref{subsec:lhrhard} we show that LHR

913: becomes APX-hard and NP-hard when at most one gap per input row is

914: allowed, automatically also proving the hardness of LHR in the

915: general case.

916: %

917: \subsection{A polynomial-time algorithm for Ungapped-LHR}

918: \label{subsec:lhrpoly}

919: %

920: \noindent\textbf{Problem:} \emph{Ungapped-LHR}\\

921: \textbf{Input: } An ungapped SNP matrix $M$\\

922: \textbf{Output: } The value LHR(M), as defined above\\

923: \\

924: The LHR problem for ungapped matrices was proved to be

925: polynomial-time solvable by Lancia et. al in \cite{lanciabafna},

926: but only with the genuine restriction that no fragments are

927: included in other fragments. Our algorithm improves this in the

928: sense that it works for all ungapped input matrices; our algorithm

929: is similar in style to the algorithm that solves

930: MFR\footnote{Minimum Fragment Removal: in this problem the

931: objective is not to maximise the length of the haplotypes, but to

932: minimise the number of rows removed} in the ungapped case by Bafna

933: et. al. in \cite{bafna2005}. Note that our dynamic-programming

934: algorithm computes Ungapped-LHR(M) but it can easily be adapted to

935: generate the rows that must be removed (and subsequently, the

936: partition that must be made) to achieve this value.\\

937: %

938: \begin{lemma}

939: Ungapped-LHR can be solved in time $O(n^{2}m + n^{3})$\\

940: \end{lemma}

941: \begin{proof}

942: Let $M$ be the input to Ungapped-LHR, and assume the matrix has

943: size $n \times m$. For row $i$ define $l(i)$ as the leftmost

944: column that is not a hole and define $r(i)$ as the rightmost

945: column that is not a hole. The rows of $M$ are ordered such that

946: $l(i)\leq l(j)$ if $i<j$. Define the matrix $M_{i}$ as the matrix

947: consisting of the first $i$ rows of $M$ and two extra rows at the

948: top: row $0$ and row $-1$, both consisting of all holes. Define

949: $W(i)$ as the set of rows $j<i$ that are not in conflict with row

950: $i$.\\

951: \\

952: For $h,k\leq i$ and $h,k\geq -1$ and $r(h)\leq r(k)$ define

953: $D[h,k;i]$ as the maximum sum of lengths of two haplotypes such

954: that:

955: \begin{itemize}

956: \item each haplotype is built up as a combination of rows from

957: $M_i$ (in the sense explained above);

958: \item each row from $M_{i}$

959: can be used to build at most one haplotype (i.e. it cannot be used

960: for both haplotypes);

961: \item row $k$ is one of the rows used to build a haplotype and among such rows maximises $%

962: r(\cdot )$; \item row $h$ is one of the rows used to build the

963: haplotype for which $k$ is not used and among such rows maximises

964: $r(\cdot )$.\\

965: \end{itemize}

966: %

967: The optimal solution of the problem, $LHR(M)$, is given by:

968: %

969: \begin{equation}

970: \max_{h,k|r(h)\leq r(k)}D[h,k;n].

971: \end{equation}

972: %

973: This optimal solution can be calculated by starting with

974: $D[h,k,0]=0$ for $h,k\in {-1,0}$ and using the following recursive

975: formulas. We distinguish three different cases, the first is that

976: $h,k<i$. Under these circumstances:

977: \begin{equation}

978: \label{eq:lhr1} D[h,k;i]=D[h,k;i-1].

979: \end{equation}

980: %

981: This is because:

982: \begin{itemize}

983: \item if $r(i)>r(k)$: row $i$ cannot be used for the haplotype

984: that row $k$ is used for, because row $k$ has maximal $r(\cdot )$

985: among all rows that are used for a haplotype; \item if $r(i)\leq

986: r(k)$: row $i$ cannot increase the length of the haplotype that

987: row $k$ is used for (because also $l(i)\geq l(k)$);

988: \item the same arguments hold for $h$.\\

989: \end{itemize}

990: %

991: The second case is when $h=i$; $D[i,k;i]$ is equal to:

992: %

993: \begin{equation}

994: \label{eq:lhr}

995: \max_{\substack{ j\in W(i),\text{ \ ~}j\neq k \\ r(j)\leq r(i)}}%

996: D[j,k;i-1]+f(i,j).

997: \end{equation}

998: %

999: Where $f(i,j)=r(i)-\max \{r(j),l(i)-1\}$ is the increase of the

1000: haplotype's length. Equation (\ref{eq:lhr}) results from the following.

1001: The definition of $D[i,k;i]$ says that row $%

1002: i$ has to be used for the haplotype for which $k$ is not used and

1003: amongst such rows maximises $r(\cdot )$. Therefore, the optimal

1004: solution is achieved by adding row $i$ to some solution that has a

1005: row $j$ as the most-right-ending row, for some $j$ that agrees

1006: with $i$, is not equal to $k$ and ends before $i$. Adding row $i$

1007: to the haplotype leads to an increase of its length of

1008: $f(i,j)=r(i)-\max \{r(j),l(i)-1\}$. This term is fixed, for fixed

1009: $i$ and $j$ and therefore we only have to consider extensions of

1010: solutions that

1011: were already optimal. Note that this reasoning does not hold for more general, ``gapped'', data.\\

1012: \\

1013: The last case is when $k=i$; $D[h,i;i]$ is equal to:

1014: \begin{equation}

1015: \max_{\substack{ j\in W(i),\text{ \ ~}j\neq h  \\ r(j)\leq r(i)}}%

1016: \left\{

1017: \begin{array}{l}

1018: D[j,h;i-1]+f(i,j)\text{ if }r(h)\geq r(j),\\

1019: D[h,j;i-1]+f(i,j)\text{ if }r(h)<r(j).%

1020: \end{array}%

1021: \right. \nonumber

1022: \end{equation}

1023: %

1024: The above algorithm can be sped up by using the fact that, as a

1025: direct consequence of (\ref{eq:lhr1}), $D[h,k;i]=D[h,k;max(h,k)]$

1026: for all $h,k\leq i \leq n$. It is thus unnecessary to calculate

1027: the

1028: values $D[h,k;i]$ for $h,k<i$.\\

1029: \\

1030: The time for calculating all the $W(i)$ is $O(n^{2}m)$. When all

1031: the $W(i)$ are known, it takes $O(n^{3})$ time to calculate all

1032: the $D[h,k;max(h,k)]$. This is because we need to calculate

1033: $O(n^{2})$ values $D[i,k;i]$ and also $O(n^{2})$ values $D[h,i;i]$

1034: that take $O(n)$ time each. This leads to an overall time

1035: complexity of $O(n^{2}m+n^{3})$.\\

1036: \end{proof}

1037: %

1038: \vspace{-12pt}

1039: %

1040: \subsection{1-gap LHR is NP-hard and APX-hard}

1041: \label{subsec:lhrhard}

1042: %

1043: \noindent\textbf{Problem:} \emph{1-gap LHR}\\

1044: \textbf{Input: } SNP matrix $M$ with at most one gap per row\\

1045: \textbf{Output: } The value LHR(M), as defined earlier\\

1046: \\

1047: In this section we prove that 1-gap LHR is APX-hard (and thus also

1048: NP-hard.) We prove this by demonstrating (indirectly) an

1049: L-reduction from the problem CUBIC-MAX-INDEPENDENT-SET - the

1050: problem of computing the maximum cardinality of an independent set

1051: in a cubic graph - which is itself proven

1052: APX-hard in \cite{alimontikann}.\\

1053: \\

1054: We do this in several steps. We first show an L-reduction from

1055: \emph{Single Haplotype} LHR (SH-LHR), the version of LHR where

1056: only one haplotype is used\footnote{More formally:- rows of the

1057: input matrix $M$ must be removed until the remaining rows are

1058: mutually non-conflicting. The length of the resulting single

1059: haplotype, which we seek to maximise, is the number of columns

1060: (amongst the remaining rows) that have at least one non-hole

1061: entry.}, to LHR, such that the number of gaps per rows is

1062: unchanged. We then show an L-reduction from

1063: CUBIC-MAX-INDEPENDENT-SET to 2-gap SH-LHR. Then, using an

1064: observation pertaining to the structure of cubic graphs, we show

1065: how this reduction can be adapted to give an L-reduction from

1066: CUBIC-MAX-INDEPENDENT-SET to 1-gap SH-LHR. This proves

1067: the APX-hardness of 1-gap SH-LHR and thus (by transitivity of L-reductions) also 1-gap LHR.\\

1068: \begin{lemma}

1069: \label{lem:shequiv} SH-LHR is L-reducible to LHR, such that the

1070: number of gaps per row is unchanged.\\

1071: \end{lemma}

1072: \begin{proof}

1073: Let $M$ be the $n \times m$ input to SH-LHR. We may assume that

1074: $M$ contains no duplicate rows, because duplicate rows are

1075: entirely redundant when working with only one haplotype. We map

1076: the SH-LHR input, $M$, to the $2n \times m$ LHR input, $M'$, by

1077: taking each row of $M$ and making a copy of it. Informally, the

1078: idea is that the influence of the second haplotype can be neutralised

1079: by doubling the rows of the input matrix. Note that this construction

1080: clearly preserves the maximum number of gaps per row.\\

1081: \\

1082: Now, let $SOL(M')$ be the set that contains all pairs of

1083: haplotypes $(H_1,H_2)$ that can be induced by removing some rows

1084: of $M'$, partitioning the remaining rows of $M'$ into two mutually

1085: non-conflicting sets, and then reading off the two induced

1086: haplotypes. Similarly, let $SOL(M)$ be the set that contains all

1087: haplotypes $H$ that can be induced by removing some rows of $M$

1088: (such that the remaining rows are mutually non-conflicting) and

1089: then reading off the single, induced haplotype. Note the following

1090: pair of observations, which both follow directly from the

1091: construction of $M'$:

1092: \begin{equation}

1093: \label{eq:shl1} (H_1,H_2) \in SOL(M') \Rightarrow H_1, H_2 \in

1094: SOL(M),

1095: \end{equation}

1096: \begin{equation}

1097: \label{eq:shl2} H \in SOL(M) \Rightarrow (H,H) \in SOL(M').

1098: \end{equation}

1099: To satisfy the L-reduction we need to show how elements from

1100: $SOL(M')$ are mapped back to elements of $SOL(M)$ in polynomial

1101: time. So, let $(H_1, H_2)$ be any pair from $SOL(M')$. If $|H_1|

1102: \geq |H_2|$ map the pair $(H_1,H_2)$ to $H_1$, otherwise to $H_2$.

1103: This completes the L-reduction, and we now prove its correctness.

1104: Central to this is the proof of the following:

1105: \begin{equation}

1106: \label{eq:dubbel} \text{\emph{SH-LHR}}(M) = \frac{1}{2}

1107: \text{\emph{LHR}}(M').

1108: \end{equation}

1109: %

1110: The fact that SH-LHR(M) $\geq \frac{1}{2} \text{\emph{LHR}}(M')$

1111: follows immediately from (\ref{eq:shl1}) and the mapping described

1112: above. (This lets us fulfil condition \ref{eq:L1}) of the

1113: L-reduction definition, taking $\alpha=2$.) The fact that

1114: SH-LHR(M) $\leq \frac{1}{2} \text{\emph{LHR}}(M')$ follows

1115: because, by (\ref{eq:shl2}), every element in $SOL(M)$ is

1116: guaranteed to have a counterpart in $SOL(M')$ which has a total length twice as large.\\

1117: \\

1118: We can fulfil condition (\ref{eq:L2}) of the L-reduction by taking

1119: $\beta=\frac{1}{2}$. To see this, let $(H_1, H_2)$ be any pair

1120: from $SOL(M')$, and (wlog) assume that $|H_1| \geq |H_2|$. Let

1121: $r=\text{\emph{LHR}}(M')$, the distance of $(H_1,H_2)$ from

1122: optimal is then:

1123: \begin{equation}

1124: r - (|H_1|+|H_2|) \geq r - 2|H_1|.

1125: \end{equation}

1126: %

1127: Let $l=\text{\emph{SH-LHR}}(M)$, then:

1128: \begin{equation}

1129: \begin{array}{ll}

1130: l - |H_1|&{}={}\frac{r}{2} - |H_1|\\

1131: &{}={}\frac{1}{2} \bigg ( r-2|H_1| \bigg )\\

1132: &{}\leq{}\frac{1}{2} \bigg (r - (|H_1|+|H_2|) \bigg).

1133: \end{array}

1134: \end{equation}

1135: %

1136: Thus, taking $\beta = \frac{1}{2}$ satisfies condition

1137: (\ref{eq:L2}) of the L-reduction.\\

1138: \end{proof}

1139: %

1140: \begin{lemma}

1141: \label{lem:2gapAPX} 2-gap SH-LHR is APX-hard\\

1142: \end{lemma}

1143: \begin{proof}

1144: We reduce from CUBIC-MAX-INDEPENDENT-SET. Let $G = (V,E)$ be the

1145: undirected, cubic input to CUBIC-MAX-INDEPENDENT-SET. We direct

1146: the edges of $G$ in the manner described by Observation

1147: \ref{obs:orient}, to give $\overrightarrow{G} = (V,

1148: \overrightarrow{E})$. Thus, every vertex of $\overrightarrow{G}$

1149: is now out-out-in or in-in-out. A vertex $w$ is a \emph{child} of

1150: a vertex $v$ if there is an edge leaving $v$ in the direction of

1151: $w$ i.e. $(v,w) \in \overrightarrow{E}$, and in this case

1152: $v$ is said to be the \emph{parent} of $w$.\\

1153: \\

1154: Let $v_{in}$ be the number of vertices in $\overrightarrow{G}$

1155: that are in-in-out, and $v_{out}$ be the number of vertices that

1156: are out-out-in. We build a matrix $M$, to be used as input to

1157: 2-gap SH-LHR, which has $|V|$ rows and $2v_{in} + v_{out}$

1158: columns. The construction of $M$ is as follows. (Each row of $M$

1159: will represent a vertex from $V$, so we henceforth index the rows

1160: of $M$ using vertices of $V$.) Now, to each in-in-out vertex of

1161: $\overrightarrow{G}$, we allocate two \emph{adjacent} columns of

1162: $M$, and for each out-out-in vertex, we allocate one column of

1163: $M$. (A column may not be allocated to more than one

1164: vertex.)\footnote{Note that, for this lemma, it is not important

1165: how the columns are allocated; in the proof of Lemma

1166: \ref{lem:lhrhard}, the ordering is crucial.} For simplicity, we

1167: also impose an arbitrary total order

1168: $P$ on the vertices of $V$.\\

1169: \\

1170: Now, for each vertex $v \in V$, we build row $v$ as follows.

1171: Firstly, we put 1(s) in the column(s) representing $v$. Secondly,

1172: consider each child $w$ of $v$. If $w$ is an out-out-in vertex, we

1173: put a $0$ in the column representing $w$. Alternatively, $w$ is an

1174: in-in-out vertex, so $w$ is represented by two columns; in this

1175: case we put a 0 in the left such column (if $v$ comes before the

1176: other parent of $w$ in the total order $P$) or, alternatively, in

1177: the right column (if $v$ comes after the other parent of $w$ in the

1178: total order $P$). The rest of the row is holes.\\

1179: %

1180: \begin{figure}

1181: \begin{center}

1182: \epsfig{file=./lhrapx.eps}

1183: \end{center}

1184: \caption{Example input graph to CUBIC-MAX-INDEPENDENT-SET (see

1185: Lemmas \ref{lem:2gapAPX} and \ref{lem:shlhrhard}) after an

1186: appropriate edge orientation has been applied.}

1187: \label{fig:lhrgraph}

1188: \end{figure}

1189: %

1190: \newlength{\blb}

1191: \setlength{\blb}{-3.0pt}

1192: \newlength{\bl}

1193: \setlength{\bl}{-4pt}

1194: \newlength{\lb}

1195: \setlength{\lb}{-12.0pt}

1196: %

1197: \begin{figure}

1198: \begin{center}

1199: \begin{tabular}{l}

1200: $\left.

1201: \begin{array}{ccccccccccccc}

1202: \hspace{28pt} & v_3 \hspace{\bl} & v_1 \hspace{\bl} & v_2

1203: \hspace{\bl} & v_5 \hspace{\bl} & v_5 \hspace{\bl} & v_7

1204: \hspace{\bl} & v_8 \hspace{\bl} & v_8 \hspace{\bl} &

1205: v_4 \hspace{\bl} & v_4 \hspace{\bl} & v_6 \hspace{\bl} & v_6%

1206: \end{array}

1207: \right.$\\

1208: %

1209: \begin{tabular}{ll}

1210: %

1211: $\left.

1212: \begin{array}{c}

1213: v_1\\

1214: v_2\\

1215: v_3\\

1216: v_4\\

1217: v_5\\

1218: v_6\\

1219: v_7\\

1220: v_8%

1221: \end{array}

1222: \right. $

1223: %

1224: &

1225: %

1226: $\hspace{\lb}\left(

1227: \begin{array}{cccccccccccc}

1228: - \hspace{\blb} & 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \\

1229: - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \\

1230: 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \\

1231: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \hspace{\blb} & - \hspace{\blb} & - \\

1232: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \\

1233: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \\

1234: 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \\

1235: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 %

1236: \end{array}

1237: \right)$

1238: %

1239: \end{tabular}

1240: \end{tabular}

1241: \caption{Construction of matrix $M$ (from Lemma \ref{lem:2gapAPX}

1242: and \ref{lem:shlhrhard}) for graph in Figure \ref{fig:lhrgraph}}

1243: \label{fig:lhrmatrix}

1244: \end{center}

1245: \vspace{-12pt}

1246: \end{figure}

1247: %

1248: \\

1249: This completes the construction of $M$. Note that rows encoding

1250: in-in-out vertices contain two adjacent 1s and one 0, with at most

1251: one gap in the row, and rows encoding out-out-in vertices contain

1252: one 1 and two 0s, with at most two gaps in the row. In either case

1253: there are precisely 3 non-hole elements per row. It is also

1254: crucial to note that, reading

1255: down any one column of $M$, one sees exactly one 1 and exactly one 0.\\

1256: \\

1257: Let $K$ be any submatrix of $M$ obtained by removing rows from

1258: $M$, and let $V[K] \subseteq V$ be the set of vertices whose rows

1259: appear in $K$. If the rows of $K$ are mutually non-conflicting,

1260: then the haplotype induced by $K$ has length $3r$ where $r$ is the

1261: number of rows in $K$. This follows from the aforementioned facts

1262: that every column of $M$ contains exactly one 1 and

1263: one 0. and that every row has exactly 3 non-hole elements.\\

1264: \\

1265: We now prove that the rows of $K$ are in conflict iff $V[K]$ is

1266: not an independent set. First, suppose $V[K]$ is not an

1267: independent set. Then there exist $u, v \in V[K]$ such that $(u,v)

1268: \in \overrightarrow{E}$. In row $v$ of $K$ there are thus 1(s) in

1269: the column(s) representing vertex $v$. However, there is also (in

1270: row $u$) a 0 in the column (or one of the columns) representing

1271: vertex $v$, causing a conflict. Hence, if $V[K]$ is not an

1272: independent set, $K$ is in conflict. Now consider the other

1273: direction. Suppose $K$ is in conflict. Then in some column of $K$

1274: there is a 0 and a 1. Let $u$ be the row where the 0 is seen, and

1275: $v$ be the row where the 1 is seen. So both $u$ and $v$ are in

1276: $V[K]$. Further, we know that there is an out-edge $(u,v)$ in

1277: $\overrightarrow{E}$, and thus an edge between $u$ and $v$ in $E$,

1278: proving

1279: that $V[K]$ is not an independent set. This completes the proof of the iff relationship.\\

1280: \\

1281: It follows that:

1282: \begin{equation}

1283: \begin{array}{l}

1284: \text{\emph{CUBIC-MAX-INDEPENDENT-SET}}(G)\\

1285: = \frac{1}{3} \text{\emph{SH-LHR}}(M).

1286: \end{array}

1287: \end{equation}

1288: %

1289: The conditions of the L-reduction definition are now easily

1290: satisfied, because of the 1-1 correspondence between haplotypes

1291: induced (after row-removals) and independent sets in $G$, and the

1292: fact that a size-$r$ independent set of $G$ corresponds to a

1293: length-$3r$ haplotype (or, equivalently, to $r$ mutually

1294: non-conflicting rows of $M$.) The L-reduction is formally

1295: satisfied by taking $\alpha = 3$ and $\beta = \frac{1}{3}$. The

1296: two functions that comprise the L-reduction are both polynomial

1297: time computable.\\

1298: \end{proof}

1299: %

1300: \begin{lemma}

1301: \label{lem:shlhrhard} 1-gap SH-LHR is APX-hard.\\

1302: \end{lemma}

1303: \begin{proof}

1304: This proof is almost identical to the proof of Lemma

1305: \ref{lem:2gapAPX}; the difference is the manner in which columns

1306: of $M$ are assigned to vertices of $G$. The informal motivation is

1307: follows. In the previous allocation of columns to vertices, it was

1308: possible for a row corresponding to an out-out-in vertex to have 2

1309: gaps. Suppose, for each out-out-in vertex, we could ensure that

1310: one of the 0s in its row was adjacent to the 1 in the row, with no

1311: holes in between. Then every row of the matrix would have (at

1312: most) 1 gap, and we would be finished. We now show that, by

1313: exploiting a rather subtle property

1314: of cubic graphs, it is indeed possible to allocate columns to vertices such that this is possible.\\

1315: \\

1316: Assume, that we have ordered the edges of $G$ as before to obtain

1317: $\overrightarrow{G}$. Let $V_{out} \subseteq V$ be those vertices

1318: in $V$ that are out-out-in. Now, suppose we could compute (in

1319: polynomial time) an injective function $favourite: V_{out}

1320: \rightarrow V$ with the following properties:

1321: %

1322: \begin{itemize}

1323: \item for every $v \in V_{out}$, $(v,favourite(v))\in

1324: \overrightarrow{E}$;

1325: % \item for every $u,v \in V_{out}$, $favourite(u) = favourite(v)$

1326: % iff $u=v$; % follows from the fact that the function is injective

1327: \item the subgraph of $\overrightarrow{G}$ induced by edges of the

1328: form $(v,favourite(v))$, henceforth called the

1329: \emph{favourite-induced subgraph}, is acyclic.\\

1330: \end{itemize}

1331: %

1332: Given such a function it is easy to create a total enumeration of

1333: the vertices of $V$ such that every out-out-in vertex is

1334: immediately followed by its \emph{favourite} vertex. This

1335: enumeration can then be used to allocate the columns of $M$ to the

1336: vertices of $V$, such that every row of $M$ has at most one gap.

1337: To ensure this property, it is necessary to stipulate that, where

1338: $favourite(v)$ is an in-in-out vertex, the 0 encoding the edge

1339: $(v,favourite(v))$ is placed in the \emph{left} of the two columns

1340: encoding $favourite(v)$. This is not a problem because every

1341: vertex is the favourite of at most one other vertex.\\

1342: \\

1343: It remains to prove that the function \emph{favourite} exists and

1344: that it can be constructed in polynomial time. This is equivalent

1345: to finding vertex disjoint directed paths in $\overrightarrow{G}$

1346: such that every out-out-in vertex is on such a path and all paths

1347: end in an in-in-out vertex. Lemma \ref{lem:bert} tells us how to

1348: find such paths. We thank Bert Gerards for invaluable

1349: help with this.\\

1350: \\

1351: This completes the proof that 1-gap SH-LHR is APX-hard. (See

1352: Figures \ref{fig:lhrgraph} and \ref{fig:lhrmatrix} for

1353: an example of the whole reduction in action.)\\

1354: \end{proof}

1355: %

1356: \begin{lemma}

1357: \label{lem:bert} Let $\overrightarrow{G}$ be a directed, cubic

1358: graph with a partition $(V_{out},V_{in})$ of the vertices such

1359: that the vertices in $V_{out}$ are out-out-in and the vertices in

1360: $V_{in}$ are in-in-out. Then $V_{out}$ can be covered, in

1361: polynomial time, by vertex-disjoint directed paths ending in

1362: $V_{in}$.\\

1363: \end{lemma}

1364: \begin{proof}

1365: Observe that any two directed circuits contained entirely within

1366: $V_{out}$ are pairwise vertex disjoint. Let $V'_{out}$ be obtained

1367: from $V_{out}$ by shrinking each directed circuit in $V_{out}$ to

1368: a single vertex, and let $\overrightarrow{G'}$ be the resulting

1369: new graph. (Note that each vertex in $V'_{out}$ has outdegree at

1370: least 2 and indegree at most 1 and that the indegree of each node

1371: in $V_{in}$ is still 2, because we do not delete multiple edges)

1372: We now argue that it is possible to find a set of edges $F'$ in

1373: $\overrightarrow{G'}$, with $|F'| = |V'_{out}|$, such that - for

1374: each $v \in V'_{out}$ - precisely one edge from $F'$ begins at

1375: $v$, and such that no two edges in $F'$ have the same endpoint. We

1376: prove this by construction. For each vertex $u \in V'_{out}$ that

1377: has a child $v$ in $V'_{out}$, we can add the edge $(u,v)$ to

1378: $F'$, because $v$ has indegree 1 and therefore no other edges can

1379: end at $v$. (In case $u$ has two such children, we can choose one

1380: of the edges to add to $F'$). Thus we are left to deal with a

1381: subset of vertices $L \subseteq V'_{out}$ where every vertex in

1382: $L$ has all its children in $V_{in}$. Now consider the bipartite

1383: graph $B$ with bipartition $(L, V_{in})$ and an edge for every

1384: directed edge of $\overrightarrow{G'}$ going from $L$ to $V_{in}$.

1385: If we can find a matching in $B$ of size $|L|$, we can complete

1386: the construction of $F'$ by adding the edges from the perfect

1387: matching. Hall's Theorem states that a bipartite graph with

1388: bipartition $(X,Y)$ has a matching of size $|X|$ iff, for all $X'

1389: \subseteq X$, $|N(X')| \geq |X'|$, where $N(X')$ is the set of all

1390: neighbours of $X'$. Now, note that each vertex in $L$ sends at

1391: least two edges across the partition of $B$, and each vertex in

1392: $V_{in}$ can accept at most two such edges, so for each $L'

1393: \subseteq L$ it is clear that $|N(L')| \geq |L'|$. Hence, the

1394: graph $(L, V_{in})$ does indeed have a matching of size $|L|$ and

1395: the construction of $F'$ can be completed.\\

1396: \\

1397: Now, given that the graph induced by $V'_{out}$ is acyclic, so is

1398: $F'$. Let $F$ be the set of edges in $\overrightarrow{G}$

1399: corresponding to those in $F'$. $F$ is acyclic and each directed

1400: circuit $C$ in $V_{out}$ has exactly one vertex $v_{C}$ that is a

1401: tail of an edge of $F$ and no vertex that is a head of an edge in

1402: $F$. Let $P_C$ be the longest directed path in $C$ that ends in

1403: $v_C$. Then the union of $F$ and all $P_C$ over all directed

1404: circuits $C$ in $V_{out}$ is a collection of paths ending in

1405: $V_{in}$ and covering $V_{out}$.\\

1406: \\

1407: Finding cycles in a graph and finding a maximum matching in a

1408: bipartite graph are both polynomial-time computable, so the whole

1409: process described above is polynomial-time computable.\\

1410: \end{proof}

1411: %

1412: \begin{lemma}

1413: \label{lem:lhrhard}

1414: 1-gap LHR is APX-hard.\\

1415: \end{lemma}

1416: \begin{proof}

1417: Follows from Lemma \ref{lem:shlhrhard} and Lemma

1418: \ref{lem:shequiv}.\\

1419: \end{proof}

1420: %

1421: \section{Conclusion}

1422: %

1423: This paper involves the complexity (under various different input

1424: restrictions) of the haplotyping problems Minimum Error Correction

1425: (MEC) and Longest Haplotype Reconstruction (LHR). The state of

1426: knowledge about MEC and LHR after this paper is demonstrated in

1427: Table \ref{tab:after}. We also include Minimum Fragment Removal

1428: (MFR) and Minimum SNP Removal (MSR) in the table because they are

1429: two other well-known Single Individual Haplotyping problems. MSR

1430: (MFR) is the problem of removing the minimum number of columns

1431: (rows) from an SNP-matrix in order to make it feasible.\\

1432: %

1433: %

1434: \begin{table}[h]

1435: \begin{centering}

1436: \begin{tabular}{|c||c|c|}

1437: \hline

1438: & Binary (i.e. no holes) & ? (Section \ref{subsec:bmec})\\

1439: & & PTAS known \cite{li}\\

1440: \cline{2-3}

1441: MEC & Ungapped & NP-hard (Section \ref{subsec:umec})\\

1442: \cline{2-3}

1443:  & 1-Gap & NP-hard (Section \ref{subsec:gmec}),\\

1444:  & & APX-hard (Section \ref{subsec:gmec})\\

1445: % \cline{2-3}

1446: %  & General & NP-hard (implicit in \cite{kleinberg})\\

1447: %  & & APX-hard (Section \ref{subsec:gmec})\\

1448: \hline

1449: %

1450: % & Binary (i.e. no holes) & P (trivially)\\

1451: % \cline{2-3}

1452:  & Ungapped & P (Section \ref{subsec:lhrpoly})\\

1453: \cline{2-3}

1454: LHR & 1-Gap & NP-hard (Section \ref{subsec:lhrhard})\\

1455:  & & APX-hard (Section \ref{subsec:lhrhard})\\

1456: % \cline{2-3}

1457: %  & General & NP-hard (Section \ref{subsec:lhrhard})\\

1458: %  & & APX-hard (Section \ref{subsec:lhrhard})\\

1459: \hline

1460: %

1461: & Ungapped & P \cite{bafna2005}\\

1462: \cline{2-3}

1463: MFR & 1-Gap & NP-hard \cite{lanciabafna}\\

1464:  & & APX-hard \cite{bafna2005}\\

1465: % \cline{2-3}

1466: %  & General & NP-hard \cite{lanciabafna}\\

1467: %  & & APX-hard \cite{bafna2005}\\

1468: \hline

1469: %

1470: & Ungapped & P \cite{lanciabafna}\\

1471: \cline{2-3}

1472: MSR & 1-Gap & NP-hard \cite{bafna2005}\\

1473:  & & APX-hard \cite{bafna2005}\\

1474: % \cline{2-3}

1475: %  & General & NP-hard \cite{lanciabafna}\\

1476: %  & & APX-hard \cite{bafna2005}\\

1477: \hline

1478: %

1479: \end{tabular}

1480: \caption{The new state of knowledge following our work}

1481: \label{tab:after}

1482: \end{centering}

1483: \vspace{-12pt}

1484: \end{table}

1485: \\

1486: Indeed, from a complexity perspective, the most intriguing open

1487: problem is to ascertain the complexity of the ``re-opened''

1488: problem Binary-MEC. It would also be interesting to study the

1489: approximability of Ungapped-MEC.\\

1490: % ; we conjecture that (in an

1491: % approximation complexity sense) it is somewhat easier

1492: % than 1-gap MEC.\\

1493: \\

1494: From a more practical perspective, the next logical step is to

1495: study the complexity of these problems under more restricted

1496: classes of input, ideally under classes of input that have direct

1497: biological relevance. It would also be of interest to study some

1498: of these problems in a ``weighted'' context i.e. where the cost of

1499: the operation in question (row removal, column removal, error

1500: correction) is some function of (for example) an \emph{a priori}

1501: specified confidence in the correctness of the data being changed.

1502: %

1503: \section{Acknowledgements}

1504: %

1505: We thank Leen Stougie and Judith Keijsper for many useful

1506: conversations during the writing of this paper.

1507: %

1508: %

1509: %

1510: \begin{thebibliography}{77} % start the bibliography

1511: %

1512: \bibitem{alimontikann} Paola Alimonti, Vigo Kann, Hardness of approximating problems on cubic graphs, \emph{Proceedings of the Third Italian Conference on Algorithms and Complexity}, 288-298 (1997)

1513: %

1514: \bibitem{alon} Noga Alon, Benny Sudakov, On Two Segmentation Problems, \emph{Journal of Algorithms} 33, 173-184 (1999)

1515: %

1516: \bibitem{ausiello} G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, M. Protasi,

1517: Complexity and Approximation - Combinatorial optimization problems

1518: and their approximability properties, Springer Verlag (1999)

1519: %

1520: \bibitem{bafna2005} Vineet Bafna, Sorin Istrail, Giuseppe Lancia, Romeo Rizzi, Polynomial and APX-hard cases of the individual haplotyping problem,

1521: \emph{Theoretical Computer Science}, 335(1), 109-125 (2005)

1522: %

1523: \bibitem{bermankarpinski} Piotr Berman, Marek Karpinski, On Some Tighter Inapproximability Results (Extended Abstract), Proceedings of the 26th International Colloquium on Automata, Languages and Programming, 200-209 (1999)

1524: %

1525: \bibitem{bonizzoni} Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Jing Li, The Haplotyping Problem: An Overview of Computational Models and Solutions,

1526: \emph{Journal of Computer Science and Technology} 18(6), 675-688

1527: (November 2003)

1528: %

1529: \bibitem{drineas} P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering in large graphs via Singular Value Decomposition, \emph{Journal of Machine Learning} 56, 9-33 (2004)

1530: %

1531: %\bibitem{gavril} F. Gavril, Testing for equality between maximum matching and minimum node

1532: %covering, \emph{Information processing letters} 6, 199-202 (1977)

1533: %

1534: \bibitem{greenberg} Harvey J. Greenberg, William E. Hart, Giuseppe Lancia, Opportunities for Combinatorial Optimisation in Computational Biology, \emph{INFORMS Journal on Computing}, 16(3), 211-231 (2004)

1535: %

1536: \bibitem{halldorsson} Bjarni V. Halldorsson, Vineet Bafna, Nathan Edwards, Ross Lippert, Shibu Yooseph, and Sorin Istrail, A Survey of Computational Methods for Determining Haplotypes,

1537: \emph{Proceedings of the First RECOMB Satellite on Computational

1538: Methods for SNPs and Haplotype Inference}, Springer Lecture Notes

1539: in Bioinformatics, LNBI 2983, pp. 26-47  (2003)

1540: %

1541: \bibitem{sched} Hoogeveen, J.A., Schuurman, P., and Woeginger, G.J., Non-approximability results for scheduling problems with minsum criteria, \emph{INFORMS Journal on Computing}, 13(2), 157-168 (Spring 2001)

1542: %

1543: %\bibitem{hopcroftkarp} J.E. Hopcroft, R.M. Karp, An $n^{5/2}$ algorithm for maximum matching in bipartite graphs, \emph{SIAM Journal on Computing} 2, 225-231 (1973)

1544: %

1545: \bibitem{li} Yishan Jiao, Jingyi Xu, Ming Li, On the k-Closest Substring and k-Consensus Pattern Problems, \emph{Combinatorial Pattern Matching: 15th Annual Symposium} (CPM 2004) 130-144

1546: %

1547: \bibitem{kleinberg} Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, Segmentation Problems,

1548: \emph{Proceedings of STOC 1998}, 473-482 (1998)

1549: %

1550: \bibitem{kleinbergEco} Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, A Microeconomic View of Data Mining,

1551: \emph{Data Mining and Knowledge Discovery} 2, 311-324 (1998)

1552: %

1553: \bibitem{kleinberg2004} Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, Segmentation Problems,

1554: \emph{Journal of the ACM} 51(2), 263-280 (March 2004) Note: this

1555: paper is somewhat different to the 1998 version.

1556: %

1557: \bibitem{lanciabafna} Giuseppe Lancia, Vineet Bafna, Sorin Istrail, Ross Lippert, and Russel Schwartz, SNPs Problems, Complexity and Algorithms,

1558: \emph{Proceedings of the 9th Annual European Symposium on

1559: Algorithms}, 182-193 (2001)

1560: %

1561: \bibitem{pureparsimony} Giuseppe Lancia, Maria Christina Pinotti, Romeo Rizzi, Haplotyping Populations by Pure Parsimony: Complexity of Exact and Approximation Algorithms, \emph{INFORMS Journal on Computing}, Vol. 16, No.4, 348-359 (Fall 2004)

1562: %

1563: \bibitem{newlancia} Giuseppe Lancia, Romeo Rizzi, A polynomial

1564: solution to a special case of the parsimony haplotyping problem,

1565: to appear in \emph{Operations Research Letters}

1566: %

1567: \bibitem{geometric} Rafail Ostrovsky and Yuval Rabani, Polynomial-Time Approximation Schemes for Geometric Min-Sum Median Clustering,

1568: \emph{Journal of the ACM} 49(2), 139-156 (March 2002)

1569: %

1570: \bibitem{fasthare} Alessandro Panconesi and Mauro Sozio, Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction,

1571: \emph{Proceedings of 4th Workshop on Algorithms in Bioinformatics}

1572: (WABI 2004), LNCS Springer-Verlag, 266-277

1573: %

1574: \bibitem{christos} Personal communication with Christos H. Papadimitriou, June 2005

1575: %

1576: \bibitem{lreduc} C.H. Papadimitriou and M. Yannakakis, Optimization, approximation, and complexity classes, \emph{Journal of Computer and System Sciences} 43, 425-440 (1991)

1577: %

1578: \bibitem{fixedparam} Romeo Rizzi, Vineet Bafna, Sorin Istrail, Giuseppe Lancia: Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem,

1579: \emph{2nd Workshop on Algorithms in Bioinformatics} (WABI 2002)

1580: 29-43

1581: %

1582: \end{thebibliography}       % end the bibliography

1583: %

1584: %

1585: \vspace{30pt}

1586: %

1587: %

1588: \begin{biography}{Rudi Cilibrasi}

1589: received his bachelor's degree at Caltech in 1996. He spent

1590: several years in industry doing network programming, Linux kernel

1591: programming, and a variety of software development work until

1592: returning to academia with CWI in 2001. He is now nearing

1593: completion of his doctoral work that has been largely concerned

1594: with robust methods of approximating bioinformatics and related

1595: clustering problems. He currently maintains CompLearn (

1596: http://complearn.org/ ), an open-source data-mining package that

1597: can be used for phylogenetic tree construction.

1598:

1599: \end{biography}

1600: \begin{biography}{Leo van Iersel}

1601: received in 2004 his Master of Science degree in Applied

1602: Mathematics from the Universiteit Twente in the Netherlands. He is

1603: now working as a PhD student at the Technische Universiteit

1604: Eindhoven, also in the Netherlands. His research is mainly

1605: concerned with the search for combinatorial algorithms for

1606: biological problems.

1607: \end{biography}

1608: \begin{biography}{Steven Kelk}

1609: received his PhD in Computer Science in 2004 from the University

1610: of Warwick, in England. He is now working as a postdoc at the

1611: Centrum voor Wiskunde en Informatica (CWI) in Amsterdam, the

1612: Netherlands, where he is focussing on the combinatorial aspects of

1613: computational biology.

1614: \end{biography}

1615: \begin{biography}{John Tromp}

1616: received the bachelor's and PhD degrees in Computer Science from

1617: the University of Amsterdam in 1989 and 1993 respectively, where

1618: he studied with Paul Vit\'{a}nyi. He then spent two years as a

1619: postdoctoral fellow with Ming Li at the University of Waterloo in

1620: Canada. In 1996 he returned as a postdoc to the Centre for

1621: Mathematics and Computer Science (CWI) in Amsterdam. He spent 2001

1622: working as software developer at Bioinformatics Solutions Inc. in

1623: Waterloo, to return once more to CWI, where he currently holds a

1624: permanent position. He is the recipient of a Canada International

1625: Fellowship. See http://www.cwi.nl/~tromp/ for more information.

1626: \end{biography}

1627: %

1628: %

1629: %

1630: %

1631: \clearpage

1632: %

1633: \section*{Appendix: Interreducibility of MEC and Constructive-MEC}

1634: \label{app:inter}

1635: %

1636: \renewcommand{\theequation}{A\arabic{equation}}

1637: \setcounter{section}{0}

1638: %

1639: \renewcommand{\thesection}{A.\arabic{section}}

1640: %

1641: \numberwithin{equation}{section} \numberwithin{figure}{section}

1642: %

1643: % \section{Interreducibility of MEC and Constructive-MEC}

1644: %

1645: \begin{lemma}

1646: \label{lem:int} MEC and Constructive-MEC are polynomial-time

1647: Turing interreducible. (Also: Binary-MEC and

1648: Binary-Constructive-MEC are polynomial-time Turing

1649: interreducible.)\\

1650: \end{lemma}

1651: \begin{proof}

1652: We show interreducibility of MEC and Constructive-MEC in such a

1653: way that the interreducibility of Binary-MEC with

1654: Binary-Constructive-MEC also follows immediately from the

1655: reduction. This makes the reduction from Constructive-MEC to

1656: MEC quite complicated because we must thus avoid the use of holes.\\

1657: \\

1658: 1. Reducing MEC to Constructive-MEC is trivial because, given an

1659: optimal haplotype pair $(H_1, H_2)$, $D_M(H_1, H_2)$ can easily be

1660: computed in polynomial-time by summing $\min( d(H_1,r), d(H_2,r)

1661: )$ over all rows $r$ of the

1662: input matrix $M$.\\

1663: \\

1664: 2. Reducing Constructive-MEC to MEC is more involved. To prevent a

1665: particular special case which could complicate our reduction, we

1666: first check whether every row of $M$ (i.e. the input to

1667: Constructive-MEC) is identical. If this is so, we can complete the

1668: reduction by simply returning $(H_1, H_1)$ where $H_1$ is the

1669: first row of $M$. Hence,

1670: from this point onwards, we assume that $M$ has at least two distinct rows.\\

1671: \\

1672: Let $OptPairs(M)$ be the set of all unordered optimal haplotype

1673: pairs for $M$ i.e. the set of all $(H_1, H_2)$ such that $D_M(H_1,

1674: H_2) = MEC(M)$. Given that all rows in $M$ are not identical, we

1675: observe that there are no pairs of the form $(H_1, H_1)$ in

1676: $OptPairs(M)$.\footnote{This is because $D_{M}(H_1,H_1)$ is always

1677: larger than $D_{M}(H_1, r)$ for any row $r$ in $M$ that is not

1678: equal to $H_1$.} Let $OptPairs(M,H') \subseteq OptPairs(M)$ be

1679: those elements $(H_1,H_2) \in OptPairs(M)$ such that $H_1 = H'$ or

1680: $H_2 = H'$. Let $g(r, H_1, H_2)$

1681: be defined as $\min( d(r, H_1), d(r, H_2) )$.\\

1682: \\

1683: Consider the following two subroutines:\\

1684: \\

1685: \textbf{Subroutine: } \emph{DFN} (``Distance From Nearest Optimal Haplotype Pair'')\\

1686: \textbf{Input: } An $n \times m$ SNP matrix $M$ and a vector $r \in \{0,1\}^m$.\\

1687: \textbf{Output: } The value $d_{dfn}$ which we define as follows:

1688: \begin{equation}

1689: d_{dfn} = \min_{ (H_1, H_2) \in OptPairs(M) } g( r, H_1, H_2

1690: ).\nonumber

1691: \end{equation}

1692: \\

1693: \textbf{Subroutine: } \emph{ANCHORED-DFN} (``Anchored Distance From Nearest Optimal Haplotype Pair'')\\

1694: \textbf{Input: } An $n \times m$ SNP matrix $M$, a vector $r \in

1695: \{0,1\}^m$, and a haplotype $H'$ such that

1696: $(H', H_2) \in OptPairs(M)$ for some $H_2$.\\

1697: \textbf{Output: } The value $d_{adfn}$, defined as:

1698: \begin{equation}

1699: d_{adfn} = \min_{ (H_1, H_2) \in OptPairs(M, H') } g( r, H_1, H_2

1700: ).\nonumber

1701: \end{equation}

1702: \\

1703: We assume the existence of implementations of DFN and ANCHORED-DFN

1704: which run in polynomial-time whenever MEC runs in polynomial-time.

1705: We use these two subroutines to reduce Constructive-MEC to MEC and

1706: then, to complete the proof, demonstrate and prove correcteness of

1707: implementations for

1708: DFN and ANCHORED-DFN.\\

1709: \\

1710: The general idea of the reduction from Constructive-MEC to MEC is

1711: to find some pair $(H_1, H_2) \in OptPairs(M)$ by first finding

1712: $H_1$ (using repeated calls to DFN) and then finding $H_2$ (by

1713: using repeated calls to ANCHORED-DFN with $H_1$ specified as the

1714: ``anchoring'' haplotype.) Throughout the reduction, the following

1715: two observations are important. Both follow immediately from the

1716: definition of $D$ - i.e. (\ref{eq:witsum}).\\

1717: \begin{observation}

1718: \label{obs:expand} Let $M_1 \cup M_2$ be a partition of rows of

1719: the matrix $M$ into two sets. Then, for all $H_1$ and $H_2$,

1720: $D_{M}(H_1, H_2) = D_{M_1}(H_1, H_2) + D_{M_2}(H_1,H_2)$.\\

1721: \end{observation}

1722: \begin{observation}

1723: \label{obs:baseline} Suppose an SNP matrix $M_1$ can be obtained

1724: from an SNP matrix $M_2$ by removing 0 or more rows from $M_2$.

1725: Then $MEC(M_1) \leq MEC(M_2)$.\\

1726: \end{observation}

1727: To begin the reduction, note that, for an arbitrary

1728: haplotype $X$, DFN$(M,X)=0$ iff $(X, H_2) \in OptPairs(M)$ for

1729: some haplotype $H_2$. Our idea is thus that we initialise $X$ to

1730: be all-0 and flip one entry of $X$ at a time (i.e. change a 0 to a

1731: 1 or vice-versa) until DFN$(M,X)=0$; at that point $X = H_1$ (for

1732: some $(H_1, H_2) \in OptPairs(M)$.) More specifically, suppose

1733: DFN$(M,X) = d$ where $0 < d < m$. \footnote{It is not possible

1734: that DFN$(M,X)=m$, because all $(H_1, H_2) \in OptPairs(M)$ are of

1735: the form $H_1 \neq H_2$, and if $H_1 \neq H_2$ we know that

1736: $g(X,H_1,H_2) < m$.} If we define $flip(X,i)$ as the haplotype

1737: obtained by flipping the entry in the $i$th column of $X$, then we

1738: know that there exists $i$ ($1 \leq i \leq m$) such that DFN$(M,

1739: flip(X,i)) < d$. Such a position must exist because we can flip

1740: some entry in $X$ to bring it closer to the haplotype (which we

1741: know exists) that it was distance $d$ from. It is clear that we

1742: can find a position $i$ in polynomial-time by calling DFN$(M,

1743: flip(X,j))$ for $1 \leq j \leq m$ until it is found.

1744: Having found such an $i$, we set $X = flip(X,i)$.\\

1745: \\

1746: Clearly this process can be iterated, finding one entry to flip in

1747: every iteration, until DFN$(M,X)=0$ and at this point setting $H_1

1748: = X$ gives us the desired result. Given that DFN$(M,X)$ decreases

1749: by at least 1 every iteration, at most $m-1$ iterations

1750: are required.\\

1751: \\

1752: Thus, having found $H_1$, we need to find some $H_2$ such that

1753: $(H_1, H_2)$ is in $OptPairs(M)$.\\

1754: \\

1755: First, we initialise $X$ to be the complement of $H_1$ (i.e. the

1756: row obtained by flipping every entry of $H_1$). Now, observe that

1757: if $X \neq H_1$ and ANCHORED-DFN$(M, X, H_1) = 0$ then $(H_1, X)

1758: \in OptPairs(M)$ and we are finished. The tactic is thus to find,

1759: at each iteration, some position $i$ of $X$ such that

1760: ANCHORED-DFN$(M, flip(X,i), H_1)$ is less than ANCHORED-DFN$(M,

1761: X,H_1)$, and then setting $X$ to be $flip(X,i)$. As before we

1762: repeat this process until our call to ANCHORED-DFN returns zero.

1763: The ``trick'' in this case is to prevent $X$ converging on $H_1$,

1764: because (knowing that $M$ has at least two different types of row)

1765: $(H_1, H_1) \not \in OptPairs(M)$. The initialisation of $X$ to

1766: the complement of $H_1$ guarantees this. To see why this is,

1767: observe that, if $X$ is the complement of $H_1$, $d(X,H_1)=m$.

1768: Thus, we would need at least $m$ flips to transform $X$ into

1769: $H_1$. However, if $X$ is the complement of $H_1$, then - because

1770: we have guaranteed that $OptPairs(M)$ contains no pairs of the

1771: form $(H_1,H_1)$ - we know that ANCHORED-DFN$(M, X, H_1) < m$.

1772: Given that we can guarantee that ANCHORED-DFN$(M,X,H_1)$ can be

1773: reduced by at least 1 at every iteration, it is clear that we can

1774: find an $X$ such that ANCHORED-DFN$(M,X,H_1)=0$ after making no

1775: more than $m-1$ iterations, which ensures that $X$ cannot have

1776: been transformed into $H_1$. Once we have such an $X$ we can set

1777: $H_2 = X$ and

1778: return $(H_1, H_2)$.\\

1779: \\

1780: To complete the proof of Lemma \ref{lem:int} it remains only to

1781: demonstrate and prove the correctness of algorithms for DFN and

1782: ANCHORED-DFN, which we do below. Note that both DFN and

1783: ANCHORED-DFN run in polynomial-time if MEC runs in

1784: polynomial-time.\\

1785: \\

1786: \textbf{Subroutine: } \emph{DFN} (``Distance From Nearest Optimal Haplotype Pair'')\\

1787: \textbf{Input: } An $n \times m$ SNP matrix $M$ and a vector $r \in \{0,1\}^m$.\\

1788: \textbf{Output: } The value $d_{dfn}$ which we define as follows:

1789: \begin{equation}

1790: d_{dfn} = \min_{ (H_1, H_2) \in OptPairs(M) } g( r, H_1, H_2

1791: ).\nonumber

1792: \end{equation}

1793: The following is a three-step algorithm to compute DFN(M,r) which uses an oracle for MEC.\\

1794: \\

1795: 1. Compute $d = $MEC$(M)$.\\

1796: 2. Let $M'$ be the $n(m+1) \times m$ matrix obtained from $M$ by

1797: making $m+1$ copies of every row of $M$.\\

1798: 3. Return MEC$( M' \cup \{r\} ) - (m+1)d$ where $M' \cup \{r\}$ is

1799: the matrix obtained by adding the single row $r$ to the matrix

1800: $M'$.\\

1801: \\

1802: To prove the correctness of the above we first make a further

1803: observation, which (as with the two previous observations) follows

1804: directly from (\ref{eq:witsum}).\\

1805: \begin{observation}

1806: \label{obs:scale} Suppose an $kn \times m$ SNP matrix $M_1$ is

1807: obtained from an $n \times m$ SNP matrix $M_2$ by making $k \geq

1808: 1$ copies of every row of $M_2$. Then $MEC(M_1) = k.MEC(M_2)$, and

1809: $OptPairs(M_1) = OptPairs(M_2)$.\\

1810: \end{observation}

1811: By the above observation we know that MEC$(M') = (m+1)d$ and

1812: $OptPairs(M') = OptPairs(M)$. Now, we argue that $OptPairs(M' \cup

1813: \{r\}) \subseteq OptPairs(M)$. To see why this is, suppose there

1814: existed $(H_3, H_4)$ such that $(H_3, H_4) \in OptPairs(M' \cup

1815: \{r\})$ but $(H_3, H_4) \not \in OptPairs(M)$. This would mean

1816: $D_{M}(H_3, H_4) > d$ where $d = $MEC$(M)$. Now:

1817: \begin{align*}

1818: D_{M' \cup \{r\}}(H_3, H_4) & \geq D_{M'}(H_3, H_4)\\

1819: & = (m+1)D_{M}(H_3, H_4)\\

1820: & \geq (m+1)(d+1).

1821: \end{align*}

1822: However, if we take any $(H_1, H_2) \in OptPairs(M)$, we see that:

1823: \begin{align*}

1824: D_{M' \cup \{r\}}(H_1, H_2) & \leq (m+1)d + g(r,H_1, H_2)\\

1825: & \leq (m+1)d + m.

1826: \end{align*}

1827: Now, $(m+1)d + m < (m+1)(d+1)$ so $(H_3, H_4)$ could not possibly

1828: be in $OptPairs(M' \cup \{r\})$ - contradiction! The relationship

1829: $OptPairs(M' \cup \{r\}) \subseteq OptPairs(M)$ thus follows. It

1830: further follows, from Observation \ref{obs:expand}, that the

1831: members of $OptPairs(M' \cup \{r\})$ are precisely those pairs

1832: $(H_1, H_2) \in OptPairs(M)$ that minimise the expression

1833: $g(r,H_1,H_2)$. The minimal value of $g(r, H_1, H_2)$ has already

1834: been defined as $d_{dfn}$, so we have:

1835: \begin{equation}

1836: MEC(M' \cup \{r\}) = (m+1)d + d_{dfn}.\nonumber

1837: \end{equation}

1838: This proves the correctness of Step 3 of the subroutine.\\

1839: \\

1840: \textbf{Subroutine: } \emph{ANCHORED-DFN} (``Anchored Distance From Nearest Optimal Haplotype Pair'')\\

1841: \textbf{Input: } An $n \times m$ SNP matrix $M$, a vector $r \in

1842: \{0,1\}^m$, and a haplotype $H'$ such that

1843: $(H', H_2) \in OptPairs(M)$ for some $H_2$.\\

1844: \textbf{Output: } The value $d_{adfn}$, defined as:

1845: \begin{equation}

1846: d_{adfn} = \min_{ (H_1, H_2) \in OptPairs(M, H') } g( r, H_1, H_2

1847: ).\nonumber

1848: \end{equation}

1849: Given that $H'$ is one half of some optimal haplotype pair for

1850: $M$, it can be shown that ANCHORED-DFN$(M, r, H')$ =  DFN$( M \cup

1851: \{H'\}, r)$, thus demonstrating how ANCHORED-DFN can be easily

1852: reduced to DFN in polynomial-time. To prove the equation it is

1853: sufficient to demonstrate that $OptPairs( M \cup \{H'\}) =

1854: OptPairs(M, H')$, which we do now. Let $d=$MEC$(M)$. It follows

1855: that MEC$( M \cup \{ H' \} ) \geq d$. In fact, MEC$(M \cup \{H'\})

1856: = d$ because $D_{M \cup \{H'\}}( H', H_2 ) = d$ for all $(H', H_2)

1857: \in OptPairs(M,H')$. Hence $OptPairs(M,H') \subseteq OptPairs(M

1858: \cup \{ H' \})$. To prove the other direction, suppose there

1859: existed some pair $(H_1, H_2) \in OptPairs(M \cup \{ H' \})$ such

1860: that $H_1 \neq H'$ and $H_2 \neq H'$. But then, from Observation

1861: \ref{obs:expand}, we would have:

1862: \begin{align*}

1863: D_{M \cup \{H' \}} (H_1, H_2) &= D_{M}(H_1, H_2) + g(H', H_1, H_2) \\

1864: &\geq D_{M}(H_1, H_2) + 1\\

1865: &> d.

1866: \end{align*}

1867: Thus, $(H_1, H_2)$ could not have been in $OptPairs(M \cup

1868: \{H'\})$ in the first place, giving us a contradiction. Thus

1869: $OptPairs(M \cup \{ H' \}) \subseteq OptPairs(M, H')$ and hence

1870: $OptPairs(M \cup \{H' \}) = OptPairs(M, H')$, proving the

1871: correctness of subroutine ANCHORED-DFN.

1872: %

1873: \end{proof}

1874: \end{document}

1875: %

1876: %

1877: %

1878: %

1879: %

1880: