q-bio0508012/tcbb.tex
1: %
2: \documentclass[12pt,final]{IEEEtran}
3: %
4: \usepackage{makeidx}  % allows for indexgeneration
5: \usepackage{amsfonts}
6: \usepackage{epsfig}
7: \usepackage{amsmath}
8: \usepackage{subfigure}
9: %\usepackage{wrapfig}
10: %\usepackage{boxedminipage}
11: %\usepackage{harvard}
12: \usepackage[dutch,USenglish]{babel}
13: %
14: %
15: \numberwithin{equation}{section} \numberwithin{figure}{section}
16: %
17: \newtheorem{lemma}{Lemma}
18: \newtheorem{observation}{Observation}
19: \newtheorem{definition}{Definition}
20: %
21: \begin{document}
22: %
23: \title{On the Complexity of the Single Individual SNP Haplotyping Problem\thanks{Part of this research has been funded by the Dutch BSIK/BRICKS project.}}
24: %
25: \markboth{On the Complexity of the Single Individual SNP
26: Haplotyping Problem}{Cilibrasi \MakeLowercase{\textit{et al.}}}
27: %
28: \author{Rudi Cilibrasi\thanks{Rudi Cilibrasi is supported in part by NWO project 612.55.002, and by the IST Programme of the European
29: Community, under the PASCAL Network of Excellence,
30: IST-2002-506778. This publication only reflects the authors'
31: views.}, Leo van Iersel, Steven Kelk and John Tromp}
32: %
33: % \institute{Technische Universiteit Eindhoven (TU/e), Den Dolech 2, 5612 AX Eindhoven, Netherlands\\
34: % \email{l.j.j.v.iersel@tue.nl}\\
35: % \and
36: % Centrum voor Wiskunde en Informatica (CWI), Kruislaan 413, 1098 SJ Amsterdam, Netherlands \\
37: % \email{Rudi.Cilibrasi@cwi.nl, S.M.Kelk@cwi.nl, John.Tromp@cwi.nl}\\
38: % }
39: %
40: \maketitle              % typeset the title of the contribution
41: %
42: \begin{abstract}
43: We present several new results pertaining to haplotyping. These
44: results concern the combinatorial problem of reconstructing
45: haplotypes from incomplete and/or imperfectly sequenced haplotype
46: fragments. We consider the complexity of the problems
47: \emph{Minimum Error Correction} (MEC) and \emph{Longest Haplotype
48: Reconstruction} (LHR) for different restrictions on the input
49: data. Specifically, we look at the \emph{gapless} case, where
50: every row of the input corresponds to a gapless
51: haplotype-fragment, and the \emph{1-gap} case, where at most one
52: gap per fragment is allowed. We prove that MEC is APX-hard in the
53: 1-gap case and still NP-hard in the gapless case. In addition, we
54: question earlier claims that MEC is NP-hard even when the input
55: matrix is restricted to being completely binary. Concerning LHR,
56: we show that this problem is NP-hard and APX-hard in the 1-gap
57: case (and thus also in the general case), but is polynomial time
58: solvable in the gapless case.
59: \end{abstract}
60: %
61: %
62: %
63: \begin{keywords}
64: Combinatorial algorithms, Biology and genetics, Complexity
65: hierarchies
66: \end{keywords}
67: %
68: %
69: %
70: \section{Introduction}
71: %
72: If we abstractly consider the human genome as a string over the
73: nucleotide alphabet $\{ A, C, G, T \}$, it is widely known that
74: the genomes of any two humans have at more than 99\% of the sites
75: the same nucleotide. The sites at which variability is observed
76: across the human population are called \emph{Single Nucleotide
77: Polymorphisms} (SNPs), which are formally defined as the sites on
78: the human genome where, across the human population, two or more
79: nucleotides are observed and each such nucleotide occurs in at
80: least 5\% of the population. These sites, which occur (on average)
81: approximately once per thousand bases, capture the bulk of human
82: genetic variability; the string of nucleotides found at the SNP
83: sites of a human - the \emph{haplotype} of that individual - can
84: thus be thought of as a ``fingerprint'' for that individual.\\
85: \\
86: It has been observed that, for most SNP sites, only two
87: nucleotides are seen; sites where three or four nucleotides are
88: found are comparatively rare. Thus, from a combinatorial
89: perspective, a haplotype can be abstractly expressed as a string
90: over the alphabet $\{ 0,1 \}$. Indeed, the biologically-motivated
91: field of SNP and haplotype analysis has spawned a rich variety of
92: combinatorial problems, which are well described in surveys such
93: as \cite{bonizzoni} and \cite{halldorsson}.\\
94: \\
95: We focus on two such combinatorial problems, both variants of the
96: \emph{Single Individual Haplotyping Problem} (SIH), introduced in
97: \cite{lanciabafna}. SIH amounts to determining the haplotype of an
98: individual using (potentially) incomplete and/or imperfect
99: fragments of sequencing data. The situation is further complicated
100: by the fact that, being a \emph{diploid} organism, a human has two
101: versions of each chromosome; one each from the individual's mother
102: and father. Hence, for a given interval of the genome, a human has
103: two haplotypes. Thus, SIH can be more accurately described as
104: finding the two haplotypes of an individual given fragments of
105: sequencing data where the fragments potentially have read errors
106: and, crucially, where it is \emph{not} known which of the two
107: chromosomes each fragment was read from. We consider two
108: well-known variants of the problem: \emph{Minimum Error
109: Correction} (MEC), and \emph{Longest Haplotype Reconstruction}
110: (LHR).\\
111: \\
112: The input to these problems is a matrix $M$ of SNP fragments. Each
113: column of $M$ represents an SNP site and thus each entry of the
114: matrix denotes the (binary) choice of nucleotide seen at that SNP
115: location on that fragment. An entry of the matrix can thus either
116: be `0', `1' or a \emph{hole}, represented by `-', which denotes
117: lack of knowledge or uncertainty about the nucleotide at that
118: site. We use $M[i,j]$ to refer to the value found at row $i$,
119: column $j$ of $M$, and use $M[i]$ to refer to the $i$th row. Two
120: rows $r_1, r_2$ of the matrix \emph{conflict} if there exists a
121: column $j$ such that $M[r_1, j] \neq M[r_2, j]$ and $M[r_1,j],
122: M[r_2, j] \in \{0,1\}$.\\
123: \\
124: A matrix is \emph{feasible} iff the rows of the matrix can be
125: partitioned into two sets such that all rows
126: within each set are pairwise non-conflicting.\\
127: \\
128: The objective in MEC is to ``correct'' (or ``flip'') as few
129: entries of the input matrix as possible (i.e. convert 0 to 1 or
130: vice-versa) to arrive at a feasible matrix. The motivation behind
131: this is that all rows of the input matrix were sequenced from one
132: haplotype or the other, and that any deviation from
133: that haplotype occurred because of read-errors during sequencing.\\
134: \\
135: The problem LHR has the same input as MEC but a different
136: objective. Recall that the rows of a feasible matrix $M$ can be
137: partitioned into two sets such that all rows within each set are
138: pairwise non-conflicting. Having obtained such a partition, we can
139: reconstruct a haplotype from each set by merging all the rows in
140: that set together. (We define this formally later in Section
141: \ref{sec:lhr}.) With LHR the objective is to remove \emph{rows}
142: such that the resulting matrix is feasible and such that the sum
143: of the
144: lengths of the two resulting haplotypes is maximised.\\
145: \\
146: In the context of haplotyping, MEC and LHR have been discussed -
147: sometimes under different names - in papers such as
148: \cite{bonizzoni}, \cite{fasthare}, \cite{greenberg} and
149: (implicitly) \cite{lanciabafna}. One question arising from this
150: discussion is how the distribution of holes in the input data
151: affects computational complexity. To explain, let us first define
152: a \emph{gap} (in a string over the alphabet $\{0,1,-\}$) as a
153: maximal contiguous block of holes that is flanked on both sides by
154: non-hole values. For example, the string \texttt{---0010---} has
155: no gaps, \texttt{-0--10-111} has two gaps, and \texttt{-0-----1--}
156: has one gap. Two special cases of MEC and LHR that are considered
157: to be practically relevant are the ungapped case and the 1-gap
158: case. The ungapped variant is where every row of the input matrix
159: is ungapped, i.e. all holes appear at the start or end. In the
160: 1-gap case every row has at most one gap.\\
161: %
162: In Section \ref{subsec:umec} we offer what we believe is the first
163: proof that Ungapped-MEC (and hence 1-gap MEC and also the general
164: MEC) is NP-hard. We do so by reduction from MAX-CUT. (As far as we
165: are aware, other claims of this result are based explicitly or
166: implicitly on results found in \cite{kleinberg}; as we discuss in
167: Section \ref{subsec:bmec}, we conclude that the results in
168: \cite{kleinberg} cannot be used for this purpose.)\\
169: \\
170: The NP-hardness of 1-gap MEC (and general MEC) follows immediately
171: from the proof that Ungapped-MEC is NP-hard. However, our
172: NP-hardness proof for Ungapped-MEC is not
173: approximation-preserving, and consequently tells us little about
174: the (in)approximability of Ungapped-MEC, 1-gap MEC and general
175: MEC. In light of this we provide (in Section \ref{subsec:gmec}) a
176: proof that 1-gap MEC is APX-hard, thus excluding (unless P=NP) the
177: existence of a \emph{Polynomial Time
178: Approximation Scheme} (PTAS) for 1-gap MEC (and general MEC.)\\
179: \\
180: We define (in Section \ref{subsec:bmec}) the problem
181: \emph{Binary-MEC}, where the input matrix contains no holes; as
182: far as we know the complexity of this problem is still -
183: intriguingly - open. We also consider a parameterised version of
184: binary-MEC, where the number of haplotypes is not fixed as two,
185: but is part of the input. We prove that this problem is NP-hard in
186: Section \ref{subsec:pbmec}. (In the Appendix we also prove an
187: ``auxiliary'' lemma which, besides being interesting in its own
188: right, takes on a new significance in light of the open complexity
189: of
190: Binary-MEC.)\\
191: \\
192: In Section \ref{subsec:lhrpoly} we show that \emph{Ungapped-LHR}
193: is polynomial-time solvable and give a dynamic programming
194: algorithm for this which runs in time $O(n^{2}m+n^{3})$ for an $n
195: \times m$ input matrix. This improves upon the result of
196: \cite{lanciabafna} which also showed a polynomial-time algorithm
197: for Ungapped-LHR but
198: under the restricting assumption of non-nested input rows.\\
199: \\
200: We also prove, in Section \ref{subsec:lhrhard}, that LHR is
201: APX-hard (and thus also NP-hard) in the general case, by proving
202: the much stronger result that 1-gap LHR is APX-hard. This is the
203: first proof of hardness (for both 1-gap LHR and general LHR)
204: appearing in the literature. \footnote{In \cite{lanciabafna} there
205: is a claim, made very briefly, that LHR is NP-hard in general, but
206: it is not substantiated.}
207: %
208: %
209: %
210: \section{Minimum Error Correction (MEC)}
211: \label{sec:mec}
212: %
213: For a length-$m$ string $X \in \{0,1,-\}^m$, and a length-$m$
214: string $Y \in \{0,1\}^m$, we define $d(X,Y)$ as the number of
215: \emph{mismatches} between the strings i.e. positions where $X$ is
216: 0 and $Y$ is 1, or vice-versa; holes do not contribute to the
217: mismatch count. Recall the definition of \emph{feasible} from
218: earlier; an alternative, and equivalent, definition (which we use
219: in the following proofs) is as follows. An $n \times m$ SNP matrix
220: $M$ is \emph{feasible} iff there exist two strings (haplotypes)
221: $H_1, H_2 \in \{0,1\}^m$,
222: such that for all rows r of M, $d( r, H_1) = 0$ or $d( r, H_2 )=0$.\\
223: \\
224: Finally, a \emph{flip} is where a 0 entry is converted to a 1, or
225: vice-versa. Flipping to or from holes is not allowed and the
226: haplotypes $H_1$ and $H_2$ may not contain holes.
227: %
228: %
229: %
230: \subsection{Ungapped-MEC}
231: \label{subsec:umec}
232: \noindent\textbf{Problem:} \emph{Ungapped-MEC}\\
233: \textbf{Input:} An ungapped SNP matrix $M$\\
234: \textbf{Output:} Ungapped-MEC(M), which we define as the smallest
235: number of flips needed to make $M$ feasible.\footnote{In
236: subsequent problem definitions we regard it as implicit that P(I)
237: represents the optimal output of a problem $P$ on input $I$.}\\
238: %
239: %
240: %
241: \begin{lemma}
242: \label{lem:mechard} Ungapped-MEC is NP-hard.\\
243: \end{lemma}
244: \begin{proof}
245: We give a polynomial-time reduction from MAX-CUT, which is the
246: problem of computing the size of a maximum cardinality cut in a
247: graph.\footnote{The reduction given here can easily be converted
248: into a Karp reduction from the decision version of MAX-CUT to the
249: decision version of Ungapped-MEC.} Let $G=(V,E)$ be the input to
250: MAX-CUT, where $E$ is undirected. (We identify, wlog, $V$ with
251: $\{1, 2,...,|V|\}$.) We construct an input matrix $M$ for
252: Ungapped-MEC with $2k|V| + |E|$ rows and $2|V|$ columns where $k =
253: 2|E||V|$. We use $M_0$ to refer to the first $k|V|$ rows of $M$,
254: $M_1$ to refer to the second $k|V|$ rows of $M$, and $M_G$ to
255: refer to the remaining $|E|$ rows. $M_0$ consists of $|V|$
256: consecutive blocks of $k$ identical rows. Each row in the $i$-th
257: block (for $1 \leq i \leq |V|$) contains a $0$ at columns $2i-1$
258: and $2i$ and holes at all other columns. $M_1$ is defined similar
259: to $M_0$ with $1$-entries instead of $0$-entries. Each row of
260: $M_G$ encodes an edge from $E$: for edge $\{i,j\}$ (with $i<j$) we
261: specify that columns $2i-1$ and $2i$ contain 0s, columns $2j-1$
262: and $2j$ contain 1s, and for all $h \neq i, j$, column $2h-1$
263: contains 0 and column $2h$ contains 1. (See Figures
264: \ref{fig:mecgraph} and \ref{fig:mecmatrix} for an example of how
265: $M$ is constructed.)\\
266: %
267: %
268: %
269: \begin{figure}
270: \begin{centering}
271: \epsfig{file=./mec.eps} \caption{Example input to MAX-CUT (see
272: Lemma \ref{lem:mechard})} \label{fig:mecgraph}
273: \end{centering}
274: \end{figure}
275: \begin{figure}
276: \begin{centering}
277: \[
278: \begin{tabular}{rl}
279: $\left(
280: \begin{array}{cccccccc}
281: 0 & 0 & - & - & - & - & - & - \\
282: - & - & 0 & 0 & - & - & - & - \\
283: - & - & - & - & 0 & 0 & - & - \\
284: - & - & - & - & - & - & 0 & 0 \\
285: 1 & 1 & - & - & - & - & - & - \\
286: - & - & 1 & 1 & - & - & - & - \\
287: - & - & - & - & 1 & 1 & - & - \\
288: - & - & - & - & - & - & 1 & 1 \\
289: 0 & 0 & 1 & 1 & 0 & 1 & 0 & 1 \\
290: 0 & 0 & 0 & 1 & 1 & 1 & 0 & 1 \\
291: 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 \\
292: 0 & 1 & 0 & 1 & 0 & 0 & 1 & 1 %
293: \end{array}
294: \right) \hspace{-27pt}$ &
295: \begin{tabular}{l}
296: $\left.
297: \begin{array}{l}
298: \\
299: \\
300: \\
301: \\
302: \\
303: \\
304: \\
305: \\
306: \end{array}
307: \right\} 32$ copies \\
308: $\left.
309: \begin{array}{l}
310: \\
311: \\
312: \\
313: \\
314: \end{array}
315: \right\} M_G $\\
316: \end{tabular}
317: \end{tabular}
318: \]
319: \caption{Construction of matrix $M$ (from Lemma \ref{lem:mechard})
320: for graph in Figure \ref{fig:mecgraph}} \label{fig:mecmatrix}
321: \end{centering}
322: \vspace{-12pt}
323: \end{figure}
324: %
325: %
326: %
327: \\
328: Suppose $t$ is the largest cut possible in $G$ and $s$ is the
329: minimum number of flips needed to make $M$ feasible. We claim that
330: the following holds:
331: \begin{equation}
332: \label{maxcut} s=|E|(|V|-2)+2(|E|-t).
333: \end{equation}
334: From this $t$, the optimal solution of MAX-CUT, can easily be
335: computed. First, note that the solution to Ungapped-MEC(M) is
336: trivially upperbounded by $|V||E|$. This follows because we could
337: simply flip every 1 entry in $M_G$ to 0; the resulting overall
338: matrix would be feasible because we could just take $H_1$ as the
339: all-0 string and $H_2$ as the all-1 string. Now, we say a
340: haplotype $H$ has the \emph{double-entry} property if, for all
341: odd-indexed positions (i.e. columns) $j$ in $H$, the entry at
342: position $j$ of $H$ is the same as the entry at position $j+1$. We
343: argue that a minimal number of feasibility-inducing flips will
344: \emph{always} lead to two haplotypes $H_1, H_2$ such that both
345: haplotypes have the double-entry property and, further, $H_1$ is
346: the bitwise complement of $H_2$. (We describe such a pair of
347: haplotypes as \emph{partition-encoding}.) This is because, if
348: $H_1, H_2$ are not partition-encoding, then at least $k > |V||E|$
349: (in contrast with zero) entries in $M_0$ and/or $M_1$ will have to
350: be flipped, meaning this strategy is doomed to begin with.\\
351: \\
352: Now, for a given partition-encoding pair of haplotypes, it follows
353: that - for each row in $M_G$ - we will have to flip either $|V|-2$
354: or $|V|$ entries to reach its nearest haplotype. This is because,
355: irrespective of which haplotype we move a row to, the $|V|-2$
356: pairs of columns \emph{not} encoding end-points (for a given row)
357: will always cost 1 flip each to fix. Then either 2 or 0 of the 4
358: ``endpoint-encoding'' entries will also need to be flipped; 4
359: flips will never be necessary because then the row could move to
360: the other haplotype, requiring no extra flips. Ungapped-MEC thus
361: maximises the number of rows which require $|V|-2$ rather than
362: $|V|$ flips. If we think of $H_1$ and $H_2$ as encoding a
363: partition of the vertices of $V$ (i.e. a vertex $i$ is on one side
364: of the partition if $H_1$ has 1s in columns $2i-1$ and $2i$, and
365: on the other side if $H_2$ has 1s in those columns), it follows
366: that each row requiring $|V|-2$ flips corresponds to a cut-edge in
367: the vertex partition defined by $H_1$ and $H_2$. The expression
368: (\ref{maxcut}) follows.\\
369: \end{proof}
370: %
371: %
372: %
373: \subsection{1-gap MEC}
374: \label{subsec:gmec}
375: \noindent\textbf{Problem:} \emph{1-gap MEC}\\
376: \textbf{Input:} SNP matrix $M$ with at most 1 gap per row\\
377: \textbf{Output:} The smallest number of flips needed to make $M$ feasible.\\
378: \\
379: To prove that 1-gap MEC is APX-hard (and therefore also NP-hard)
380: we will give an L-reduction\footnote{An L-reduction is a specific
381: type of \emph{approximation-preserving} reduction, first
382: introduced in \cite{lreduc}. If there exists an L-reduction from a
383: problem X to a problem Y, then a PTAS for Y can be used to build a
384: PTAS for X. Conversely, if there exists an L-reduction from X to
385: Y, and X is APX-hard, so is Y. See (for example) \cite{sched} for
386: a succinct discussion of this.} from CUBIC-MIN-UNCUT, which is the
387: problem of finding the minimum number of edges that have to be
388: removed from a 3-regular graph in order to make it bipartite. Our
389: first goal is thus to prove the APX-hardness of CUBIC-MIN-UNCUT,
390: which itself will be proven using an L-reduction from the APX-hard problem CUBIC-MAX-CUT.\\
391: \\
392: To aid the reader, we reproduce here the definition of an
393: L-reduction.\\
394: \begin{definition}
395: (Papadimitriou and Yannakakis \cite{lreduc}) Let A and B be two
396: optimisation problems. An \emph{L-reduction} from A to B is a pair
397: of functions R and S, both computable in polynomial time, such
398: that for any instance I of A with optimum cost Opt(I), R(I) is an
399: instance of B with optimum cost Opt(R(I)) and for every
400: feasible\footnote{Note that \emph{feasible} in this context has a
401: different meaning to \emph{feasible} in the context of SNP
402: matrices.} solution s of R(I), S(s) is a feasible solution of I
403: such that:
404: \begin{equation}
405: \label{eq:L1}
406: Opt(R(I)) \leq \alpha Opt(I),
407: \end{equation}
408: for some positive constant $\alpha$ and:
409: \begin{equation}
410: \label{eq:L2}
411: |Opt(I) - c(S(s))| \leq \beta |Opt(R(I))-c(s)|,
412: \end{equation}
413: for some positive constant $\beta$, where c(S(s)) and c(s)
414: represent the costs of S(s) and s, respectively.\\
415: \end{definition}
416: \begin{observation}
417: \label{obs:mincuthard} CUBIC-MIN-UNCUT is APX-hard.\\
418: \end{observation}
419: \begin{proof}
420: We give an L-reduction from CUBIC-MAX-CUT, the problem of finding
421: the maximum cardinality of a cut in a 3-regular graph. (This
422: problem is shown to be APX-hard in \cite{alimontikann}; see also
423: \cite{bermankarpinski}.) Let $G=(V,E)$
424: be the input to CUBIC-MAX-CUT.\\
425: \\
426: Note that CUBIC-MIN-UNCUT is the ``complement'' of CUBIC-MAX-CUT,
427: as expressed by the following relationship:
428: %
429: \begin{equation}
430: \begin{array}{l}
431: \label{eq:duality} \text{\emph{CUBIC-MAX-CUT(G)}}\\
432: = |E| - \text{\emph{CUBIC-MIN-UNCUT(G)}}.
433: \end{array}
434: \end{equation}
435: %
436: To see why this holds, note that for every cut $C$, the removal of
437: the edges $E \setminus C$ will lead to a bipartite graph. On the
438: other hand, given a set of edges $E'$ whose removal makes $G$
439: bipartite, the complement is not necessarily a cut. However, given
440: a bipartition induced by the removal of $E'$, the edges from the
441: original graph that cross this bipartition form a cut $C'$, such
442: that $|C'| \geq |E \setminus E'|$. This proves (\ref{eq:duality}),
443: and the mapping (just described) from $E'$ to $C'$ is the mapping
444: we use in the L-reduction.\\
445: \\
446: Now, note that property (\ref{eq:L1}) of the L-reduction is easily
447: satisfied (taking $\alpha=1$) because the optimal value of
448: CUBIC-MIN-UNCUT is always less than or equal to the optimal value
449: of CUBIC-MAX-CUT. This follows from the combination of
450: (\ref{eq:duality}) with the fact that a maximum cut in a 3-regular
451: graph always contains at least $2/3$ of the edges: if a vertex has
452: less than two incident edges in the cut then we can get a larger
453: cut by moving this vertex to the other side of the partition.\\
454: \\
455: To see that property (\ref{eq:L2}) of the L-reduction is easily
456: satisfied (taking $\beta = 1$), let $E'$ be any set of edges whose
457: removal makes $G$ bipartite. Property (\ref{eq:L2}) is satisfied
458: because $E'$ gets mapped to a cut $C'$, as defined above, and
459: combined with (\ref{eq:duality}) this gives:
460: \begin{equation}
461: \begin{array}{l}
462: \text{\emph{CUBIC-MAX-CUT(G)}} - |C'|\\
463: \leq \text{\emph{CUBIC-MAX-CUT(G)}} - |E \setminus E'| \\
464: = |E'| - \text{\emph{CUBIC-MIN-UNCUT(G)}}.
465: \end{array}
466: \end{equation}
467: %
468: This completes the L-reduction from CUBIC-MAX-CUT to
469: CUBIC-MIN-UNCUT, proving the APX-hardness of CUBIC-MIN-UNCUT.\\
470: \end{proof}
471: %
472: We also need the following observation.\\
473: %
474: \begin{observation}
475: \label{obs:orient} Let $G = (V,E)$ be an undirected, 3-regular
476: graph. Then we can find, in polynomial time, an orientation of the
477: edges of $G$ so that each vertex has either in-degree 2 and
478: out-degree 1 (``in-in-out'') or
479: out-degree 2 and in-degree 1 (``out-out-in'').\\
480: \end{observation}
481: \begin{proof}
482: (We assume that $G$ is connected; if $G$ is not connected, we can
483: apply the following argument to each component of $G$ in turn, and
484: the overall result still holds.) Every cubic graph has an even
485: number of vertices, because every graph must have an even number
486: of odd-degree vertices. We add an arbitrary perfect matching to
487: the graph, which may create multiple edges. The graph is now
488: 4-regular and therefore has an Euler tour. We direct the edges
489: following the Euler-tour; every vertex is now in-in-out-out. If we
490: remove the perfect matching edges we added, we are left with an
491: oriented version of $G$ where every vertex is in-in-out or
492: out-out-in. This can all be done in polynomial time.\\
493: \end{proof}
494: %
495: \begin{lemma}
496: \label{apxhard} 1-gap MEC is APX-hard\\
497: \end{lemma}
498: \begin{proof}
499: We give a reduction from CUBIC-MIN-UNCUT. Consider an arbitrary
500: 3-regular graph $G = (V,E)$ and orient the edges as described in
501: Observation \ref{obs:orient} to obtain an oriented version of $G$,
502: $\overrightarrow{G} = (V, \overrightarrow{E})$, where each vertex
503: is either in-in-out or out-out-in. We construct an $|E| \times
504: |V|$ input matrix $M$ for 1-gap MEC as follows. The columns of $M$
505: correspond to the vertices of $\overrightarrow{G}$ and every row
506: of $M$ encodes an oriented edge of $\overrightarrow{G}$; it has a
507: $0$ in the column corresponding to the tail of the edge (i.e. the
508: vertex from which the edge leaves), a $1$ in the column
509: corresponding to the head of the edge,
510: and the rest holes.\\
511: \\
512: We prove the following:
513: \begin{equation}
514: \label{eq:uncutmec} \text{\emph{CUBIC-MIN-UNCUT(G)}} =
515: \text{\emph{1-gap MEC(M)}}.
516: \end{equation}
517: %
518: We first prove that:
519: \begin{equation}
520: \label{eq:uncutmec2} \text{\emph{1-gap MEC(M)}} \leq
521: \text{\emph{CUBIC-MIN-UNCUT(G)}}.
522: \end{equation}
523: To see this, let $E'$ be a minimal set of edges whose removal
524: makes $G$ bipartite, and let $|E'| = k$. Let $B = (L \cup R, E
525: \setminus E')$ be the bipartite graph (with bipartition $L \cup
526: R$) obtained from $G$ by removing the edges $E'$. Let $H_1$
527: (respectively, $H_2$) be the haplotype that has 1s in the columns
528: representing vertices of $L$ (respectively, $R$) and 0s elsewhere.
529: It is possible to make $M$ feasible with $k$ flips, by the
530: following process: for each edge in $E'$, flip the 0 bit in the
531: corresponding row of $M$ to 1. For each row r of M it is now
532: true that $d(r, H_1) = 0$ or $d(r,H_2) = 0$, proving the feasibility of $M$.\\
533: \\
534: The proof that,
535: \begin{equation}
536: \label{eq:uncutmec3} \text{\emph{CUBIC-MIN-UNCUT(G)}}\leq
537: \text{\emph{1-gap MEC(M)}},
538: \end{equation}
539: is more subtle. Suppose we can render $M$ feasible using $j$
540: flips, and let $H_1$ and $H_2$ be any two haplotypes such that,
541: after the $j$ flips, each row of $M$ is distance 0 from either
542: $H_1$ or $H_2$. If $H_1$ and $H_2$ are bitwise complementary then
543: we can make $G$ bipartite by removing an edge whenever we had to
544: flip a bit in the corresponding row. The idea is, namely, that the
545: 1s in $H_1$ (respectively, $H_2$) represent the vertices $L$
546: (respectively, $R$) in the resulting bipartition $L \cup R$.\\
547: \\
548: However, suppose the two haplotypes $H_1$ and $H_2$ are not
549: bitwise complementary. In this case it is sufficient to
550: demonstrate that there also exists bitwise complementary
551: haplotypes $H'_1$ and $H'_2$ such that, after $j$ (or fewer)
552: flips, every row of $M$ is distance 0 from either $H'_1$ or
553: $H'_2$. Consider thus a column of $H_1$ and $H_2$ where the two
554: haplotypes are not complementary. Crucially, the orientation of
555: $\overrightarrow{G}$ ensures that every column of $M$ contains
556: \emph{either} one $1$ and two $0$s \emph{or} two $1$s and one $0$
557: (and the rest holes). A simple case analysis shows that, because
558: of this, we can always change the value of one of the haplotypes
559: in that column, without increasing the number of flips. (The
560: number of flips might decrease.) Repeating this process for all
561: columns of $H_1$ and $H_2$ where the same value is observed thus
562: creates complementary haplotypes $H'_1$ and $H'_2$, and - as
563: described in the previous paragraph - these haplotypes then
564: determine which edges of $G$ should be removed to make $G$
565: bipartite. This completes the proof of (\ref{eq:uncutmec}).\\
566: \\
567: The above reduction can be computed in polynomial time and is an
568: L-reduction. From (\ref{eq:uncutmec}) it follows directly that
569: property (\ref{eq:L1}) of an L-reduction is satisfied with
570: $\alpha=1$. Property (\ref{eq:L2}), with $\beta=1$, follows from
571: the proof of (\ref{eq:uncutmec3}), combined with
572: (\ref{eq:uncutmec}). Namely, whenever we use (say) $t$ flips to
573: make $M$ feasible, we can find $s \leq t$ edges of $G$ that can be
574: removed to make $G$ bipartite. Combined with (\ref{eq:uncutmec})
575: this gives:
576: \begin{equation}
577: \begin{array}{l}
578: |\text{\emph{CUBIC-MIN-UNCUT(G)}} - s |\\
579: \leq | \text{\emph{1-gap MEC(M)}} - t |.
580: \end{array}
581: \end{equation}
582: \end{proof}
583: %
584: %
585: %
586: \subsection{Binary-MEC}
587: \label{subsec:bmec}
588: %
589: From a mathematical point of view it is interesting to determine
590: whether MEC stays NP-hard when the input matrix is
591: further restricted. Let us therefore define the following problem.\\
592: \\
593: \textbf{Problem:} \emph{Binary-MEC}\\
594: \textbf{Input:} An SNP matrix $M$ that does not contain any holes\\
595: \textbf{Output:} As for Ungapped-MEC\\
596: \\
597: Like all optimisation problems, the problem Binary-MEC has
598: different variants, depending on how the problem is defined. The
599: above definition is technically speaking the \emph{evaluation}
600: variant of the Binary-MEC problem\footnote{ See \cite{ausiello}
601: for a more detailed explanation of terminology in this area.}.
602: Consider the closely-related \emph{constructive} version:\\
603: \\
604: \textbf{Problem:} \emph{Binary-Constructive-MEC}\\
605: \textbf{Input:} An SNP matrix $M$ that does not contain any holes\\
606: \textbf{Output:} For an input matrix $M$ of size $n \times m$, two
607: haplotypes $H_1, H_2 \in \{0,1\}^m$ minimizing:
608: \begin{equation}
609: \label{eq:witsum} D_M(H_1, H_2) = \sum_{\text{rows r of M}} \min(
610: d(r,H_1), d(r, H_2) ).
611: \end{equation}
612: In the appendix, we prove that Binary-Constructive-MEC is
613: polynomial-time Turing interreducible with its evaluation
614: counterpart, Binary-MEC. This proves that Binary-Constructive-MEC
615: is solvable in polynomial-time iff Binary-MEC is solvable in
616: polynomial-time. We mention this correspondence because, when
617: expressed as a constructive problem, it can be seen that MEC is in
618: fact a specific type of \emph{clustering} problem, a topic of
619: intensive study in the literature. More specifically, we are
620: trying to find two representative ``median'' (or ``consensus'')
621: strings such that the sum, over all input strings, of the distance
622: between each input string and its nearest median, is minimised.
623: This interreducibility is potentially useful because we now argue,
624: in contrast to claims in the existing literature, that the
625: complexity
626: of Binary-MEC / Binary-Constructive-MEC is actually still open.\\
627: \\
628: To elaborate, it is claimed in several papers (e.g. \cite{alon})
629: that a problem equivalent to Binary-Constructive-MEC is NP-hard.
630: Such claims inevitably refer to the seminal paper
631: \emph{Segmentation Problems} by Kleinberg, Papadimitriou, and
632: Raghavan (KPR), which has appeared in multiple different forms
633: since 1998 (e.g. \cite{kleinberg}, \cite{kleinbergEco} and
634: \cite{kleinberg2004}.) However, the KPR papers actually discuss
635: two superficially similar, but essentially different, problems:
636: one problem is essentially equivalent to Binary-Constructive-MEC,
637: and the other is a more general (and thus, potentially, a more
638: difficult) problem.\footnote{In this more general problem, rows
639: and haplotypes are viewed as vectors and the distance between a
640: row and a haplotype is their dot product. Further, unlike
641: Binary-Constructive-MEC, this problem allows entries of the input
642: matrix to be drawn arbitrarily from $\mathbb{R}$. This extra
643: degree of freedom - particularly the ability to simultaneously use
644: positive, negative and zero values in the input matrix - is what
645: (when coupled with a dot product distance measure) provides the
646: ability to encode NP-hard problems.} Communication with the
647: authors \cite{christos} has confirmed that they have no proof of
648: hardness for the former problem i.e. the problem that is
649: essentially equivalent to Binary-Constructive-MEC.\\
650: \\
651: Thus we conclude that the complexity of Binary-Constructive-MEC /
652: Binary-MEC is still open. From an approximation viewpoint the
653: problem has been quite well-studied; the problem has a
654: \emph{Polynomial Time Approximation Scheme} (PTAS) because it is a
655: special form of the \emph{Hamming 2-Median Clustering Problem},
656: for which a PTAS is demonstrated in \cite{li}. Other approximation
657: results appear in \cite{kleinberg}, \cite{alon},
658: \cite{kleinberg2004}, \cite{geometric} and a heuristic for a
659: similar (but not identical) problem appears in \cite{fasthare}. We
660: also know that, if the number of haplotypes to be found is
661: specified as part of the input (and not fixed as 2), the problem
662: becomes NP-hard; we prove this in the following section. Finally,
663: it may also be relevant that the ``geometric'' version of the
664: problem (where rows of the input matrix are not drawn from
665: $\{0,1\}^m$ but from $\mathbb{R}^{m}$, and Euclidean distance is
666: used instead of Hamming distance) is also open from a complexity
667: viewpoint \cite{geometric}. (However, the version using
668: Euclidean-distance-squared \emph{is} known to be NP-hard
669: \cite{drineas}.)
670: %
671: %
672: %
673: \subsection{Parameterised Binary-MEC}
674: \label{subsec:pbmec}
675: %
676: Let us now consider a generalisation of the problem Binary-MEC,
677: where the number of haplotypes is not fixed as two, but part of
678: the input.\\
679: \\
680: \textbf{Problem:} \emph{Parameterised-Binary-MEC (PBMEC)}\\
681: \textbf{Input:} An SNP matrix $M$ that contains no holes, and $k \in \mathbb{N} \setminus \{0\}$\\
682: \textbf{Output:} The smallest number of flips needed to make $M$
683: feasible under $k$ haplotypes.\\
684: %
685: The notion of \emph{feasible} generalises easily to $k \geq 1$
686: haplotypes: an SNP matrix $M$ is \emph{feasible} under $k$
687: haplotypes if $M$ can be partitioned into $k$ segments such that
688: all the rows within each segment are pairwise non-conflicting. The
689: definition of $D_{M}$ also generalises easily to $k$ haplotypes;
690: we define $D_{M, k}(H_1, H_2, ..., H_k)$ as:
691: %
692: \begin{equation}
693: \label{eq:kmedsum} \sum_{\text{rows r of M}} \min(d(r,H_1), d(r,
694: H_2), ..., d(r,H_k) ).
695: \end{equation}
696: %
697: We define $OptTuples(M,k)$ as the set of unordered optimal
698: $k$-tuples of haplotypes for $M$ i.e. those $k$-tuples of
699: haplotypes which have a $D_{M,k}$ score equal to PBMEC$(M,k)$.\\
700: %
701: \begin{lemma}
702: \label{lem:pbmechard}
703: PBMEC is NP-hard\\
704: \end{lemma}
705: %
706: \begin{proof}
707: We reduce from the NP-hard problem MINIMUM-VERTEX-COVER. Let
708: $G=(V,E)$ be an undirected graph. A subset $V' \subseteq V$ is
709: said to \emph{cover} an edge $(u,v) \in E$ iff $u \in V'$ or $v
710: \in V'$. A \emph{vertex cover} of an undirected graph $G = (V,E)$
711: is a subset $U$ of the vertices such that every edge in $E$ is
712: covered by $U$. MINIMUM-VERTEX-COVER is the problem of, given a
713: graph $G$, computing the size of
714: a minimum cardinality vertex cover $U$ of $G$.\\
715: \\
716: Let $G = (V,E)$ be the input to MINIMUM-VERTEX-COVER. We construct
717: an SNP matrix $M$ as follows. $M$ has $|V|$ columns and
718: $3|E||V|+|E|$ rows. We name the first $3|E||V|$ rows $M_0$ and the
719: remaining $|E|$ rows $M_{G}$. $M_0$ is the matrix obtained by
720: taking the $|V| \times |V|$ identity matrix (i.e. 1s on the
721: diagonal, 0s everywhere else) and making $3|E|$ copies of each
722: row. Each row in $M_G$ encodes an edge of $G$: the row has
723: 1-entries at the endpoints of the edge, and the rest of the row is
724: 0. We argue shortly that, to compute the size of the smallest
725: vertex cover in $G$, we call PBMEC($M,k$) for increasing values of
726: $k$ (starting with $k=2$) until we first encounter a $k$ such
727: that:
728: \begin{equation}
729: \label{eq:kmed} PBMEC(M,k) = 3|E|(|V|-(k-1)) + |E|.
730: \end{equation}
731: Once the smallest such $k$ has been found, we can output that the
732: size of the smallest vertex cover in $G$ is $k-1$. (Actually, if
733: we haven't yet found a value $k < |V|-2$ satisfying the above
734: equation, we can check by brute force in polynomial-time whether
735: $G$ has a vertex cover of size $|V|-3$, $|V|-2$, $|V|-1$, or
736: $|V|$. The reason for wanting to ensure that PBMEC($M,k$) is not
737: called with $k \geq |V|-2$ is explained later in the
738: analysis.\footnote{Note that, should we wish to build a Karp
739: reduction from the decision version of MINIMUM-VERTEX-COVER to the
740: decision version of PBMEC, it is not a problem to make this brute
741: force checking fit into the framework of a Karp reduction. The
742: Karp reduction can do the brute force checking itself and use
743: trivial inputs to the decision
744: version of PBMEC to communicate its ``yes'' or ``no'' answer.})\\
745: \\
746: It remains only to prove that (for $k < |V|-2$)
747: (\ref{eq:kmed}) holds iff $G$ has a vertex cover of size $k-1$.\\
748: \\
749: To prove this we need to first analyze $OptTuples(M_{0},k)$.
750: Recall that $M_0$ was obtained by duplicating the rows of the $|V|
751: \times |V|$ identity matrix. Let $I_{|V|}$ be shorthand for the
752: $|V| \times |V|$ identity matrix. Given that $M_0$ is simply a
753: ``scaled up'' version of $I_{|V|}$, it follows that:
754: \begin{equation}
755: OptTuples(M_0,k) = OptTuples(I_{|V|},k).
756: \end{equation}
757: Now, we argue that all the $k$-tuples in $OptTuples(I_{|V|},k)$
758: (for $k < |V|-2$) have the following form: one haplotype from the
759: tuple contains only 0s, and the remaining $k-1$ haplotypes from
760: the tuple each have precisely one entry set to 1. Let us name such
761: a $k$-tuple
762: a \emph{candidate} tuple.\\
763: %
764: \begin{figure}
765: \begin{centering}
766: \epsfig{file=./mec.eps}
767: \caption{Example input graph to
768: MINIMUM-VERTEX-COVER (see Lemma \ref{lem:pbmechard})}
769: \label{fig:pbmecgraph}
770: \end{centering}
771: \end{figure}
772: %
773: \begin{figure}
774: \begin{centering}
775: \[
776: \begin{tabular}{ll}
777: $\left(
778: \begin{array}{cccc}
779: 1 & 0 & 0 & 0 \\
780: 0 & 1 & 0 & 0 \\
781: 0 & 0 & 1 & 0 \\
782: 0 & 0 & 0 & 1 \\
783: 1 & 1 & 0 & 0 \\
784: 1 & 0 & 1 & 0 \\
785: 1 & 0 & 0 & 1 \\
786: 0 & 0 & 1 & 1 \\
787: \end{array}
788: \right)$ \hspace{-28pt} &
789: \begin{tabular}{l}
790: $\left.
791: \begin{array}{c}
792: \\
793: \\
794: \\
795: \\
796: \end{array}
797: \right\} 12$ copies \\
798: $\left.
799: \begin{array}{c}
800: \\
801: \\
802: \\
803: \\
804: \end{array}
805: \right\} M_G $\\
806: \end{tabular}
807: \end{tabular}
808: \]
809: \end{centering}
810: \caption{Construction of matrix $M$ for graph from Figure
811: \ref{fig:pbmecgraph}}
812: \end{figure}
813: %
814: %
815: \\
816: First, note that $PBMEC(I_{|V|},k) \leq |V|-(k-1)$, because
817: $|V|-(k-1)$ is the value of the $D$ measure - defined in
818: (\ref{eq:kmedsum}) - under any candidate tuple. Secondly, under an
819: arbitrary $k$-tuple there can be at most $k$ rows of $I_{|V|}$
820: which contribute 0 to the $D$ measure. However, if precisely $k$
821: rows of $I_{|V|}$ contribute 0 to the $D$ measure (i.e. every
822: haplotype has precisely one entry set to 1, and the haplotypes are
823: all distinct) then there are $|V|-k$ rows which each contribute 2
824: to the $D$ measure; such a $k$-tuple cannot be optimal because it
825: has a $D$ measure of $2(|V|-k) > |V|-(k-1)$. So we reason that at
826: most $k-1$ rows contribute 0 to the $D$ measure. In fact,
827: \emph{precisely} $k-1$ rows must contribute 0 to the $D$ measure
828: because, otherwise, there would be at least $|V|-(k-2)$ rows
829: contributing at least 1, and this is not possible because
830: $PBMEC(I_{|V|},k) \leq |V|-(k-1)$. So $k-1$ of the haplotypes
831: correspond to rows of $I_{|V|}$, and the remaining $|V|-(k-1)$
832: rows of $I_{|V|}$ must each contribute 1 to the $D$ measure. But
833: the only way to do this (given that $|V|-(k-1) > 2$) is to make
834: the $k$th haplotype the haplotype where every entry is 0. Hence:
835: \begin{equation}
836: PBMEC(I_{|V|},k) = |V|-(k-1)
837: \end{equation}
838: and:
839: \begin{equation}
840: PBMEC(M_0,k) = 3|E|(|V|-(k-1)).
841: \end{equation}
842: $OptTuples(I_{|V|},k)$ ($= OptTuples(M_0,k)$) is, by extension,
843: precisely the set of candidate $k$-tuples.\\
844: \\
845: The next step is to observe that $OptTuples(M,k) \subseteq
846: OptTuples(M_0,k)$. To see this, suppose (by way of contradiction)
847: that it is not true, and there exists a $k$-tuple $H^{*} \in
848: OptTuples(M,k)$ that is not in $OptTuples(M_0,k)$. But then
849: replacing $H^{*}$ by any $k$-tuple out of $OptTuples(M_0,k)$ would
850: reduce the number of flips needed in $M_0$ by at least $3|E|$, in
851: contrast to an increase in the number of flips needed in $M_{G}$
852: of at most $2|E|$, thus leading to an overall reduction in the
853: number of flips; contradiction! (The $2|E|$ figure is the number
854: of flips
855: required to make all rows in $M_G$ equal to the all-0 haplotype.)\\
856: \\
857: Because $OptTuples(M,k) \subseteq OptTuples(M_0,k)$, we can
858: restrict our attention to the $k$-tuples in $OptTuples(M_0,k)$.
859: Observe that there is a natural 1-1 correspondence between the
860: elements of $OptTuples(M_0,k)$ and all size $k-1$ subsets of $V$:
861: a vertex $v \in V$ is in the subset corresponding to $H^{*} \in
862: OptTuples(M_0,k)$ iff one of the haplotypes in $H^{*}$ has a 1 in
863: the column corresponding to vertex $v$.\\
864: \\
865: Now, for a $k$-tuple $H^{*} \in OptTuples(M_0,k)$ we let $Cov( G,
866: H^{*} )$ be the set of edges in $G$ which are covered by the
867: subset of $V$ corresponding to $H^{*}$. (Thus, $|Cov(G,
868: H^{*})|=|E|$ iff $H^{*}$ represents a vertex cover of $G$.) It is
869: easy to check that, for $H^{*} \in OptTuples(M_0,k)$:
870: \begin{equation}
871: \begin{array}{lll}
872: D_{M,k}( H^{*} )&{}={}&3|E|(|V|-(k-1))\\
873: &&{+}\:|Cov(G, H^{*})|\\
874: &&{+}\:2( |E| - |Cov(G,H^{*}| ) \\
875: &{}={}&3|E|(|V|-(k-1))\\
876: &&{+}\:2|E| - |Cov(G, H^{*})|.\nonumber
877: \end{array}
878: \end{equation}
879: Hence, for $H^{*} \in OptTuples(M_0,k)$, $D_{M,k}(H^{*})$ equals
880: $3|E|(|V|-(k-1)) + |E|$ iff $H^{*}$ represents a size $k-1$ vertex
881: cover of $G$.\\
882: \end{proof}
883: %
884: %
885: %
886: \section{Longest Haplotype Reconstruction (LHR)}
887: \label{sec:lhr} \setcounter{equation}{0} Suppose an SNP matrix $M$
888: is feasible. Then we can partition the rows of $M$ into two sets,
889: $M_l$ and $M_r$, such that the rows within each set are pairwise
890: non-conflicting. (The partition might not be unique.) From $M_i$
891: ($i \in \{l,r\}$) we can then build a haplotype $H_i$ by combining
892: the rows of $M_i$ as follows: The $j$th column of $H_i$ is set to
893: 1 if at least one row from $M_i$ has a 1 in column $j$, is set to
894: 0 if at least one row from $M_i$ has a 0 in column $j$, and is set
895: to a hole if all rows in $M_i$ have a hole in column $j$. Note
896: that, in contrast to MEC, this leads to haplotypes that
897: potentially contain holes. For example, suppose one side of the
898: partition contains rows \texttt{10--, -0--} and \texttt{---1};
899: then the haplotype we get from this is \texttt{10-1}. We define
900: the \emph{length} of a haplotype $H$, denoted as $|H|$, as the
901: number of positions where it does not contain a hole; the
902: haplotype \texttt{10-1} thus has length 3, for example. Now, the
903: objective with LHR is to remove \emph{rows} from $M$ to make it
904: feasible but also such that the sum of the lengths of the two
905: resulting haplotypes is maximised. We define the function LHR(M)
906: (which gives a natural number as output) as the largest value this
907: sum-of-lengths value can take,
908: ranging over all feasibility-inducing row-removals and subsequent partitions.\\
909: \\
910: In Section \ref{subsec:lhrpoly} we provide a polynomial-time
911: dynamic programming algorithm for the ungapped variant of LHR,
912: Ungapped-LHR. In Section \ref{subsec:lhrhard} we show that LHR
913: becomes APX-hard and NP-hard when at most one gap per input row is
914: allowed, automatically also proving the hardness of LHR in the
915: general case.
916: %
917: \subsection{A polynomial-time algorithm for Ungapped-LHR}
918: \label{subsec:lhrpoly}
919: %
920: \noindent\textbf{Problem:} \emph{Ungapped-LHR}\\
921: \textbf{Input: } An ungapped SNP matrix $M$\\
922: \textbf{Output: } The value LHR(M), as defined above\\
923: \\
924: The LHR problem for ungapped matrices was proved to be
925: polynomial-time solvable by Lancia et. al in \cite{lanciabafna},
926: but only with the genuine restriction that no fragments are
927: included in other fragments. Our algorithm improves this in the
928: sense that it works for all ungapped input matrices; our algorithm
929: is similar in style to the algorithm that solves
930: MFR\footnote{Minimum Fragment Removal: in this problem the
931: objective is not to maximise the length of the haplotypes, but to
932: minimise the number of rows removed} in the ungapped case by Bafna
933: et. al. in \cite{bafna2005}. Note that our dynamic-programming
934: algorithm computes Ungapped-LHR(M) but it can easily be adapted to
935: generate the rows that must be removed (and subsequently, the
936: partition that must be made) to achieve this value.\\
937: %
938: \begin{lemma}
939: Ungapped-LHR can be solved in time $O(n^{2}m + n^{3})$\\
940: \end{lemma}
941: \begin{proof}
942: Let $M$ be the input to Ungapped-LHR, and assume the matrix has
943: size $n \times m$. For row $i$ define $l(i)$ as the leftmost
944: column that is not a hole and define $r(i)$ as the rightmost
945: column that is not a hole. The rows of $M$ are ordered such that
946: $l(i)\leq l(j)$ if $i<j$. Define the matrix $M_{i}$ as the matrix
947: consisting of the first $i$ rows of $M$ and two extra rows at the
948: top: row $0$ and row $-1$, both consisting of all holes. Define
949: $W(i)$ as the set of rows $j<i$ that are not in conflict with row
950: $i$.\\
951: \\
952: For $h,k\leq i$ and $h,k\geq -1$ and $r(h)\leq r(k)$ define
953: $D[h,k;i]$ as the maximum sum of lengths of two haplotypes such
954: that:
955: \begin{itemize}
956: \item each haplotype is built up as a combination of rows from
957: $M_i$ (in the sense explained above);
958: \item each row from $M_{i}$
959: can be used to build at most one haplotype (i.e. it cannot be used
960: for both haplotypes);
961: \item row $k$ is one of the rows used to build a haplotype and among such rows maximises $%
962: r(\cdot )$; \item row $h$ is one of the rows used to build the
963: haplotype for which $k$ is not used and among such rows maximises
964: $r(\cdot )$.\\
965: \end{itemize}
966: %
967: The optimal solution of the problem, $LHR(M)$, is given by:
968: %
969: \begin{equation}
970: \max_{h,k|r(h)\leq r(k)}D[h,k;n].
971: \end{equation}
972: %
973: This optimal solution can be calculated by starting with
974: $D[h,k,0]=0$ for $h,k\in {-1,0}$ and using the following recursive
975: formulas. We distinguish three different cases, the first is that
976: $h,k<i$. Under these circumstances:
977: \begin{equation}
978: \label{eq:lhr1} D[h,k;i]=D[h,k;i-1].
979: \end{equation}
980: %
981: This is because:
982: \begin{itemize}
983: \item if $r(i)>r(k)$: row $i$ cannot be used for the haplotype
984: that row $k$ is used for, because row $k$ has maximal $r(\cdot )$
985: among all rows that are used for a haplotype; \item if $r(i)\leq
986: r(k)$: row $i$ cannot increase the length of the haplotype that
987: row $k$ is used for (because also $l(i)\geq l(k)$);
988: \item the same arguments hold for $h$.\\
989: \end{itemize}
990: %
991: The second case is when $h=i$; $D[i,k;i]$ is equal to:
992: %
993: \begin{equation}
994: \label{eq:lhr}
995: \max_{\substack{ j\in W(i),\text{ \ ~}j\neq k \\ r(j)\leq r(i)}}%
996: D[j,k;i-1]+f(i,j).
997: \end{equation}
998: %
999: Where $f(i,j)=r(i)-\max \{r(j),l(i)-1\}$ is the increase of the
1000: haplotype's length. Equation (\ref{eq:lhr}) results from the following.
1001: The definition of $D[i,k;i]$ says that row $%
1002: i$ has to be used for the haplotype for which $k$ is not used and
1003: amongst such rows maximises $r(\cdot )$. Therefore, the optimal
1004: solution is achieved by adding row $i$ to some solution that has a
1005: row $j$ as the most-right-ending row, for some $j$ that agrees
1006: with $i$, is not equal to $k$ and ends before $i$. Adding row $i$
1007: to the haplotype leads to an increase of its length of
1008: $f(i,j)=r(i)-\max \{r(j),l(i)-1\}$. This term is fixed, for fixed
1009: $i$ and $j$ and therefore we only have to consider extensions of
1010: solutions that
1011: were already optimal. Note that this reasoning does not hold for more general, ``gapped'', data.\\
1012: \\
1013: The last case is when $k=i$; $D[h,i;i]$ is equal to:
1014: \begin{equation}
1015: \max_{\substack{ j\in W(i),\text{ \ ~}j\neq h  \\ r(j)\leq r(i)}}%
1016: \left\{
1017: \begin{array}{l}
1018: D[j,h;i-1]+f(i,j)\text{ if }r(h)\geq r(j),\\
1019: D[h,j;i-1]+f(i,j)\text{ if }r(h)<r(j).%
1020: \end{array}%
1021: \right. \nonumber
1022: \end{equation}
1023: %
1024: The above algorithm can be sped up by using the fact that, as a
1025: direct consequence of (\ref{eq:lhr1}), $D[h,k;i]=D[h,k;max(h,k)]$
1026: for all $h,k\leq i \leq n$. It is thus unnecessary to calculate
1027: the
1028: values $D[h,k;i]$ for $h,k<i$.\\
1029: \\
1030: The time for calculating all the $W(i)$ is $O(n^{2}m)$. When all
1031: the $W(i)$ are known, it takes $O(n^{3})$ time to calculate all
1032: the $D[h,k;max(h,k)]$. This is because we need to calculate
1033: $O(n^{2})$ values $D[i,k;i]$ and also $O(n^{2})$ values $D[h,i;i]$
1034: that take $O(n)$ time each. This leads to an overall time
1035: complexity of $O(n^{2}m+n^{3})$.\\
1036: \end{proof}
1037: %
1038: \vspace{-12pt}
1039: %
1040: \subsection{1-gap LHR is NP-hard and APX-hard}
1041: \label{subsec:lhrhard}
1042: %
1043: \noindent\textbf{Problem:} \emph{1-gap LHR}\\
1044: \textbf{Input: } SNP matrix $M$ with at most one gap per row\\
1045: \textbf{Output: } The value LHR(M), as defined earlier\\
1046: \\
1047: In this section we prove that 1-gap LHR is APX-hard (and thus also
1048: NP-hard.) We prove this by demonstrating (indirectly) an
1049: L-reduction from the problem CUBIC-MAX-INDEPENDENT-SET - the
1050: problem of computing the maximum cardinality of an independent set
1051: in a cubic graph - which is itself proven
1052: APX-hard in \cite{alimontikann}.\\
1053: \\
1054: We do this in several steps. We first show an L-reduction from
1055: \emph{Single Haplotype} LHR (SH-LHR), the version of LHR where
1056: only one haplotype is used\footnote{More formally:- rows of the
1057: input matrix $M$ must be removed until the remaining rows are
1058: mutually non-conflicting. The length of the resulting single
1059: haplotype, which we seek to maximise, is the number of columns
1060: (amongst the remaining rows) that have at least one non-hole
1061: entry.}, to LHR, such that the number of gaps per rows is
1062: unchanged. We then show an L-reduction from
1063: CUBIC-MAX-INDEPENDENT-SET to 2-gap SH-LHR. Then, using an
1064: observation pertaining to the structure of cubic graphs, we show
1065: how this reduction can be adapted to give an L-reduction from
1066: CUBIC-MAX-INDEPENDENT-SET to 1-gap SH-LHR. This proves
1067: the APX-hardness of 1-gap SH-LHR and thus (by transitivity of L-reductions) also 1-gap LHR.\\
1068: \begin{lemma}
1069: \label{lem:shequiv} SH-LHR is L-reducible to LHR, such that the
1070: number of gaps per row is unchanged.\\
1071: \end{lemma}
1072: \begin{proof}
1073: Let $M$ be the $n \times m$ input to SH-LHR. We may assume that
1074: $M$ contains no duplicate rows, because duplicate rows are
1075: entirely redundant when working with only one haplotype. We map
1076: the SH-LHR input, $M$, to the $2n \times m$ LHR input, $M'$, by
1077: taking each row of $M$ and making a copy of it. Informally, the
1078: idea is that the influence of the second haplotype can be neutralised
1079: by doubling the rows of the input matrix. Note that this construction
1080: clearly preserves the maximum number of gaps per row.\\
1081: \\
1082: Now, let $SOL(M')$ be the set that contains all pairs of
1083: haplotypes $(H_1,H_2)$ that can be induced by removing some rows
1084: of $M'$, partitioning the remaining rows of $M'$ into two mutually
1085: non-conflicting sets, and then reading off the two induced
1086: haplotypes. Similarly, let $SOL(M)$ be the set that contains all
1087: haplotypes $H$ that can be induced by removing some rows of $M$
1088: (such that the remaining rows are mutually non-conflicting) and
1089: then reading off the single, induced haplotype. Note the following
1090: pair of observations, which both follow directly from the
1091: construction of $M'$:
1092: \begin{equation}
1093: \label{eq:shl1} (H_1,H_2) \in SOL(M') \Rightarrow H_1, H_2 \in
1094: SOL(M),
1095: \end{equation}
1096: \begin{equation}
1097: \label{eq:shl2} H \in SOL(M) \Rightarrow (H,H) \in SOL(M').
1098: \end{equation}
1099: To satisfy the L-reduction we need to show how elements from
1100: $SOL(M')$ are mapped back to elements of $SOL(M)$ in polynomial
1101: time. So, let $(H_1, H_2)$ be any pair from $SOL(M')$. If $|H_1|
1102: \geq |H_2|$ map the pair $(H_1,H_2)$ to $H_1$, otherwise to $H_2$.
1103: This completes the L-reduction, and we now prove its correctness.
1104: Central to this is the proof of the following:
1105: \begin{equation}
1106: \label{eq:dubbel} \text{\emph{SH-LHR}}(M) = \frac{1}{2}
1107: \text{\emph{LHR}}(M').
1108: \end{equation}
1109: %
1110: The fact that SH-LHR(M) $\geq \frac{1}{2} \text{\emph{LHR}}(M')$
1111: follows immediately from (\ref{eq:shl1}) and the mapping described
1112: above. (This lets us fulfil condition \ref{eq:L1}) of the
1113: L-reduction definition, taking $\alpha=2$.) The fact that
1114: SH-LHR(M) $\leq \frac{1}{2} \text{\emph{LHR}}(M')$ follows
1115: because, by (\ref{eq:shl2}), every element in $SOL(M)$ is
1116: guaranteed to have a counterpart in $SOL(M')$ which has a total length twice as large.\\
1117: \\
1118: We can fulfil condition (\ref{eq:L2}) of the L-reduction by taking
1119: $\beta=\frac{1}{2}$. To see this, let $(H_1, H_2)$ be any pair
1120: from $SOL(M')$, and (wlog) assume that $|H_1| \geq |H_2|$. Let
1121: $r=\text{\emph{LHR}}(M')$, the distance of $(H_1,H_2)$ from
1122: optimal is then:
1123: \begin{equation}
1124: r - (|H_1|+|H_2|) \geq r - 2|H_1|.
1125: \end{equation}
1126: %
1127: Let $l=\text{\emph{SH-LHR}}(M)$, then:
1128: \begin{equation}
1129: \begin{array}{ll}
1130: l - |H_1|&{}={}\frac{r}{2} - |H_1|\\
1131: &{}={}\frac{1}{2} \bigg ( r-2|H_1| \bigg )\\
1132: &{}\leq{}\frac{1}{2} \bigg (r - (|H_1|+|H_2|) \bigg).
1133: \end{array}
1134: \end{equation}
1135: %
1136: Thus, taking $\beta = \frac{1}{2}$ satisfies condition
1137: (\ref{eq:L2}) of the L-reduction.\\
1138: \end{proof}
1139: %
1140: \begin{lemma}
1141: \label{lem:2gapAPX} 2-gap SH-LHR is APX-hard\\
1142: \end{lemma}
1143: \begin{proof}
1144: We reduce from CUBIC-MAX-INDEPENDENT-SET. Let $G = (V,E)$ be the
1145: undirected, cubic input to CUBIC-MAX-INDEPENDENT-SET. We direct
1146: the edges of $G$ in the manner described by Observation
1147: \ref{obs:orient}, to give $\overrightarrow{G} = (V,
1148: \overrightarrow{E})$. Thus, every vertex of $\overrightarrow{G}$
1149: is now out-out-in or in-in-out. A vertex $w$ is a \emph{child} of
1150: a vertex $v$ if there is an edge leaving $v$ in the direction of
1151: $w$ i.e. $(v,w) \in \overrightarrow{E}$, and in this case
1152: $v$ is said to be the \emph{parent} of $w$.\\
1153: \\
1154: Let $v_{in}$ be the number of vertices in $\overrightarrow{G}$
1155: that are in-in-out, and $v_{out}$ be the number of vertices that
1156: are out-out-in. We build a matrix $M$, to be used as input to
1157: 2-gap SH-LHR, which has $|V|$ rows and $2v_{in} + v_{out}$
1158: columns. The construction of $M$ is as follows. (Each row of $M$
1159: will represent a vertex from $V$, so we henceforth index the rows
1160: of $M$ using vertices of $V$.) Now, to each in-in-out vertex of
1161: $\overrightarrow{G}$, we allocate two \emph{adjacent} columns of
1162: $M$, and for each out-out-in vertex, we allocate one column of
1163: $M$. (A column may not be allocated to more than one
1164: vertex.)\footnote{Note that, for this lemma, it is not important
1165: how the columns are allocated; in the proof of Lemma
1166: \ref{lem:lhrhard}, the ordering is crucial.} For simplicity, we
1167: also impose an arbitrary total order
1168: $P$ on the vertices of $V$.\\
1169: \\
1170: Now, for each vertex $v \in V$, we build row $v$ as follows.
1171: Firstly, we put 1(s) in the column(s) representing $v$. Secondly,
1172: consider each child $w$ of $v$. If $w$ is an out-out-in vertex, we
1173: put a $0$ in the column representing $w$. Alternatively, $w$ is an
1174: in-in-out vertex, so $w$ is represented by two columns; in this
1175: case we put a 0 in the left such column (if $v$ comes before the
1176: other parent of $w$ in the total order $P$) or, alternatively, in
1177: the right column (if $v$ comes after the other parent of $w$ in the
1178: total order $P$). The rest of the row is holes.\\
1179: %
1180: \begin{figure}
1181: \begin{center}
1182: \epsfig{file=./lhrapx.eps}
1183: \end{center}
1184: \caption{Example input graph to CUBIC-MAX-INDEPENDENT-SET (see
1185: Lemmas \ref{lem:2gapAPX} and \ref{lem:shlhrhard}) after an
1186: appropriate edge orientation has been applied.}
1187: \label{fig:lhrgraph}
1188: \end{figure}
1189: %
1190: \newlength{\blb}
1191: \setlength{\blb}{-3.0pt}
1192: \newlength{\bl}
1193: \setlength{\bl}{-4pt}
1194: \newlength{\lb}
1195: \setlength{\lb}{-12.0pt}
1196: %
1197: \begin{figure}
1198: \begin{center}
1199: \begin{tabular}{l}
1200: $\left.
1201: \begin{array}{ccccccccccccc}
1202: \hspace{28pt} & v_3 \hspace{\bl} & v_1 \hspace{\bl} & v_2
1203: \hspace{\bl} & v_5 \hspace{\bl} & v_5 \hspace{\bl} & v_7
1204: \hspace{\bl} & v_8 \hspace{\bl} & v_8 \hspace{\bl} &
1205: v_4 \hspace{\bl} & v_4 \hspace{\bl} & v_6 \hspace{\bl} & v_6%
1206: \end{array}
1207: \right.$\\
1208: %
1209: \begin{tabular}{ll}
1210: %
1211: $\left.
1212: \begin{array}{c}
1213: v_1\\
1214: v_2\\
1215: v_3\\
1216: v_4\\
1217: v_5\\
1218: v_6\\
1219: v_7\\
1220: v_8%
1221: \end{array}
1222: \right. $
1223: %
1224: &
1225: %
1226: $\hspace{\lb}\left(
1227: \begin{array}{cccccccccccc}
1228: - \hspace{\blb} & 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \\
1229: - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \\
1230: 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \\
1231: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \hspace{\blb} & - \hspace{\blb} & - \\
1232: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \\
1233: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \\
1234: 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 0 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \\
1235: - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 1 \hspace{\blb} & 1 \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & - \hspace{\blb} & 0 %
1236: \end{array}
1237: \right)$
1238: %
1239: \end{tabular}
1240: \end{tabular}
1241: \caption{Construction of matrix $M$ (from Lemma \ref{lem:2gapAPX}
1242: and \ref{lem:shlhrhard}) for graph in Figure \ref{fig:lhrgraph}}
1243: \label{fig:lhrmatrix}
1244: \end{center}
1245: \vspace{-12pt}
1246: \end{figure}
1247: %
1248: \\
1249: This completes the construction of $M$. Note that rows encoding
1250: in-in-out vertices contain two adjacent 1s and one 0, with at most
1251: one gap in the row, and rows encoding out-out-in vertices contain
1252: one 1 and two 0s, with at most two gaps in the row. In either case
1253: there are precisely 3 non-hole elements per row. It is also
1254: crucial to note that, reading
1255: down any one column of $M$, one sees exactly one 1 and exactly one 0.\\
1256: \\
1257: Let $K$ be any submatrix of $M$ obtained by removing rows from
1258: $M$, and let $V[K] \subseteq V$ be the set of vertices whose rows
1259: appear in $K$. If the rows of $K$ are mutually non-conflicting,
1260: then the haplotype induced by $K$ has length $3r$ where $r$ is the
1261: number of rows in $K$. This follows from the aforementioned facts
1262: that every column of $M$ contains exactly one 1 and
1263: one 0. and that every row has exactly 3 non-hole elements.\\
1264: \\
1265: We now prove that the rows of $K$ are in conflict iff $V[K]$ is
1266: not an independent set. First, suppose $V[K]$ is not an
1267: independent set. Then there exist $u, v \in V[K]$ such that $(u,v)
1268: \in \overrightarrow{E}$. In row $v$ of $K$ there are thus 1(s) in
1269: the column(s) representing vertex $v$. However, there is also (in
1270: row $u$) a 0 in the column (or one of the columns) representing
1271: vertex $v$, causing a conflict. Hence, if $V[K]$ is not an
1272: independent set, $K$ is in conflict. Now consider the other
1273: direction. Suppose $K$ is in conflict. Then in some column of $K$
1274: there is a 0 and a 1. Let $u$ be the row where the 0 is seen, and
1275: $v$ be the row where the 1 is seen. So both $u$ and $v$ are in
1276: $V[K]$. Further, we know that there is an out-edge $(u,v)$ in
1277: $\overrightarrow{E}$, and thus an edge between $u$ and $v$ in $E$,
1278: proving
1279: that $V[K]$ is not an independent set. This completes the proof of the iff relationship.\\
1280: \\
1281: It follows that:
1282: \begin{equation}
1283: \begin{array}{l}
1284: \text{\emph{CUBIC-MAX-INDEPENDENT-SET}}(G)\\
1285: = \frac{1}{3} \text{\emph{SH-LHR}}(M).
1286: \end{array}
1287: \end{equation}
1288: %
1289: The conditions of the L-reduction definition are now easily
1290: satisfied, because of the 1-1 correspondence between haplotypes
1291: induced (after row-removals) and independent sets in $G$, and the
1292: fact that a size-$r$ independent set of $G$ corresponds to a
1293: length-$3r$ haplotype (or, equivalently, to $r$ mutually
1294: non-conflicting rows of $M$.) The L-reduction is formally
1295: satisfied by taking $\alpha = 3$ and $\beta = \frac{1}{3}$. The
1296: two functions that comprise the L-reduction are both polynomial
1297: time computable.\\
1298: \end{proof}
1299: %
1300: \begin{lemma}
1301: \label{lem:shlhrhard} 1-gap SH-LHR is APX-hard.\\
1302: \end{lemma}
1303: \begin{proof}
1304: This proof is almost identical to the proof of Lemma
1305: \ref{lem:2gapAPX}; the difference is the manner in which columns
1306: of $M$ are assigned to vertices of $G$. The informal motivation is
1307: follows. In the previous allocation of columns to vertices, it was
1308: possible for a row corresponding to an out-out-in vertex to have 2
1309: gaps. Suppose, for each out-out-in vertex, we could ensure that
1310: one of the 0s in its row was adjacent to the 1 in the row, with no
1311: holes in between. Then every row of the matrix would have (at
1312: most) 1 gap, and we would be finished. We now show that, by
1313: exploiting a rather subtle property
1314: of cubic graphs, it is indeed possible to allocate columns to vertices such that this is possible.\\
1315: \\
1316: Assume, that we have ordered the edges of $G$ as before to obtain
1317: $\overrightarrow{G}$. Let $V_{out} \subseteq V$ be those vertices
1318: in $V$ that are out-out-in. Now, suppose we could compute (in
1319: polynomial time) an injective function $favourite: V_{out}
1320: \rightarrow V$ with the following properties:
1321: %
1322: \begin{itemize}
1323: \item for every $v \in V_{out}$, $(v,favourite(v))\in
1324: \overrightarrow{E}$;
1325: % \item for every $u,v \in V_{out}$, $favourite(u) = favourite(v)$
1326: % iff $u=v$; % follows from the fact that the function is injective
1327: \item the subgraph of $\overrightarrow{G}$ induced by edges of the
1328: form $(v,favourite(v))$, henceforth called the
1329: \emph{favourite-induced subgraph}, is acyclic.\\
1330: \end{itemize}
1331: %
1332: Given such a function it is easy to create a total enumeration of
1333: the vertices of $V$ such that every out-out-in vertex is
1334: immediately followed by its \emph{favourite} vertex. This
1335: enumeration can then be used to allocate the columns of $M$ to the
1336: vertices of $V$, such that every row of $M$ has at most one gap.
1337: To ensure this property, it is necessary to stipulate that, where
1338: $favourite(v)$ is an in-in-out vertex, the 0 encoding the edge
1339: $(v,favourite(v))$ is placed in the \emph{left} of the two columns
1340: encoding $favourite(v)$. This is not a problem because every
1341: vertex is the favourite of at most one other vertex.\\
1342: \\
1343: It remains to prove that the function \emph{favourite} exists and
1344: that it can be constructed in polynomial time. This is equivalent
1345: to finding vertex disjoint directed paths in $\overrightarrow{G}$
1346: such that every out-out-in vertex is on such a path and all paths
1347: end in an in-in-out vertex. Lemma \ref{lem:bert} tells us how to
1348: find such paths. We thank Bert Gerards for invaluable
1349: help with this.\\
1350: \\
1351: This completes the proof that 1-gap SH-LHR is APX-hard. (See
1352: Figures \ref{fig:lhrgraph} and \ref{fig:lhrmatrix} for
1353: an example of the whole reduction in action.)\\
1354: \end{proof}
1355: %
1356: \begin{lemma}
1357: \label{lem:bert} Let $\overrightarrow{G}$ be a directed, cubic
1358: graph with a partition $(V_{out},V_{in})$ of the vertices such
1359: that the vertices in $V_{out}$ are out-out-in and the vertices in
1360: $V_{in}$ are in-in-out. Then $V_{out}$ can be covered, in
1361: polynomial time, by vertex-disjoint directed paths ending in
1362: $V_{in}$.\\
1363: \end{lemma}
1364: \begin{proof}
1365: Observe that any two directed circuits contained entirely within
1366: $V_{out}$ are pairwise vertex disjoint. Let $V'_{out}$ be obtained
1367: from $V_{out}$ by shrinking each directed circuit in $V_{out}$ to
1368: a single vertex, and let $\overrightarrow{G'}$ be the resulting
1369: new graph. (Note that each vertex in $V'_{out}$ has outdegree at
1370: least 2 and indegree at most 1 and that the indegree of each node
1371: in $V_{in}$ is still 2, because we do not delete multiple edges)
1372: We now argue that it is possible to find a set of edges $F'$ in
1373: $\overrightarrow{G'}$, with $|F'| = |V'_{out}|$, such that - for
1374: each $v \in V'_{out}$ - precisely one edge from $F'$ begins at
1375: $v$, and such that no two edges in $F'$ have the same endpoint. We
1376: prove this by construction. For each vertex $u \in V'_{out}$ that
1377: has a child $v$ in $V'_{out}$, we can add the edge $(u,v)$ to
1378: $F'$, because $v$ has indegree 1 and therefore no other edges can
1379: end at $v$. (In case $u$ has two such children, we can choose one
1380: of the edges to add to $F'$). Thus we are left to deal with a
1381: subset of vertices $L \subseteq V'_{out}$ where every vertex in
1382: $L$ has all its children in $V_{in}$. Now consider the bipartite
1383: graph $B$ with bipartition $(L, V_{in})$ and an edge for every
1384: directed edge of $\overrightarrow{G'}$ going from $L$ to $V_{in}$.
1385: If we can find a matching in $B$ of size $|L|$, we can complete
1386: the construction of $F'$ by adding the edges from the perfect
1387: matching. Hall's Theorem states that a bipartite graph with
1388: bipartition $(X,Y)$ has a matching of size $|X|$ iff, for all $X'
1389: \subseteq X$, $|N(X')| \geq |X'|$, where $N(X')$ is the set of all
1390: neighbours of $X'$. Now, note that each vertex in $L$ sends at
1391: least two edges across the partition of $B$, and each vertex in
1392: $V_{in}$ can accept at most two such edges, so for each $L'
1393: \subseteq L$ it is clear that $|N(L')| \geq |L'|$. Hence, the
1394: graph $(L, V_{in})$ does indeed have a matching of size $|L|$ and
1395: the construction of $F'$ can be completed.\\
1396: \\
1397: Now, given that the graph induced by $V'_{out}$ is acyclic, so is
1398: $F'$. Let $F$ be the set of edges in $\overrightarrow{G}$
1399: corresponding to those in $F'$. $F$ is acyclic and each directed
1400: circuit $C$ in $V_{out}$ has exactly one vertex $v_{C}$ that is a
1401: tail of an edge of $F$ and no vertex that is a head of an edge in
1402: $F$. Let $P_C$ be the longest directed path in $C$ that ends in
1403: $v_C$. Then the union of $F$ and all $P_C$ over all directed
1404: circuits $C$ in $V_{out}$ is a collection of paths ending in
1405: $V_{in}$ and covering $V_{out}$.\\
1406: \\
1407: Finding cycles in a graph and finding a maximum matching in a
1408: bipartite graph are both polynomial-time computable, so the whole
1409: process described above is polynomial-time computable.\\
1410: \end{proof}
1411: %
1412: \begin{lemma}
1413: \label{lem:lhrhard}
1414: 1-gap LHR is APX-hard.\\
1415: \end{lemma}
1416: \begin{proof}
1417: Follows from Lemma \ref{lem:shlhrhard} and Lemma
1418: \ref{lem:shequiv}.\\
1419: \end{proof}
1420: %
1421: \section{Conclusion}
1422: %
1423: This paper involves the complexity (under various different input
1424: restrictions) of the haplotyping problems Minimum Error Correction
1425: (MEC) and Longest Haplotype Reconstruction (LHR). The state of
1426: knowledge about MEC and LHR after this paper is demonstrated in
1427: Table \ref{tab:after}. We also include Minimum Fragment Removal
1428: (MFR) and Minimum SNP Removal (MSR) in the table because they are
1429: two other well-known Single Individual Haplotyping problems. MSR
1430: (MFR) is the problem of removing the minimum number of columns
1431: (rows) from an SNP-matrix in order to make it feasible.\\
1432: %
1433: %
1434: \begin{table}[h]
1435: \begin{centering}
1436: \begin{tabular}{|c||c|c|}
1437: \hline
1438: & Binary (i.e. no holes) & ? (Section \ref{subsec:bmec})\\
1439: & & PTAS known \cite{li}\\
1440: \cline{2-3}
1441: MEC & Ungapped & NP-hard (Section \ref{subsec:umec})\\
1442: \cline{2-3}
1443:  & 1-Gap & NP-hard (Section \ref{subsec:gmec}),\\
1444:  & & APX-hard (Section \ref{subsec:gmec})\\
1445: % \cline{2-3}
1446: %  & General & NP-hard (implicit in \cite{kleinberg})\\
1447: %  & & APX-hard (Section \ref{subsec:gmec})\\
1448: \hline
1449: %
1450: % & Binary (i.e. no holes) & P (trivially)\\
1451: % \cline{2-3}
1452:  & Ungapped & P (Section \ref{subsec:lhrpoly})\\
1453: \cline{2-3}
1454: LHR & 1-Gap & NP-hard (Section \ref{subsec:lhrhard})\\
1455:  & & APX-hard (Section \ref{subsec:lhrhard})\\
1456: % \cline{2-3}
1457: %  & General & NP-hard (Section \ref{subsec:lhrhard})\\
1458: %  & & APX-hard (Section \ref{subsec:lhrhard})\\
1459: \hline
1460: %
1461: & Ungapped & P \cite{bafna2005}\\
1462: \cline{2-3}
1463: MFR & 1-Gap & NP-hard \cite{lanciabafna}\\
1464:  & & APX-hard \cite{bafna2005}\\
1465: % \cline{2-3}
1466: %  & General & NP-hard \cite{lanciabafna}\\
1467: %  & & APX-hard \cite{bafna2005}\\
1468: \hline
1469: %
1470: & Ungapped & P \cite{lanciabafna}\\
1471: \cline{2-3}
1472: MSR & 1-Gap & NP-hard \cite{bafna2005}\\
1473:  & & APX-hard \cite{bafna2005}\\
1474: % \cline{2-3}
1475: %  & General & NP-hard \cite{lanciabafna}\\
1476: %  & & APX-hard \cite{bafna2005}\\
1477: \hline
1478: %
1479: \end{tabular}
1480: \caption{The new state of knowledge following our work}
1481: \label{tab:after}
1482: \end{centering}
1483: \vspace{-12pt}
1484: \end{table}
1485: \\
1486: Indeed, from a complexity perspective, the most intriguing open
1487: problem is to ascertain the complexity of the ``re-opened''
1488: problem Binary-MEC. It would also be interesting to study the
1489: approximability of Ungapped-MEC.\\
1490: % ; we conjecture that (in an
1491: % approximation complexity sense) it is somewhat easier
1492: % than 1-gap MEC.\\
1493: \\
1494: From a more practical perspective, the next logical step is to
1495: study the complexity of these problems under more restricted
1496: classes of input, ideally under classes of input that have direct
1497: biological relevance. It would also be of interest to study some
1498: of these problems in a ``weighted'' context i.e. where the cost of
1499: the operation in question (row removal, column removal, error
1500: correction) is some function of (for example) an \emph{a priori}
1501: specified confidence in the correctness of the data being changed.
1502: %
1503: \section{Acknowledgements}
1504: %
1505: We thank Leen Stougie and Judith Keijsper for many useful
1506: conversations during the writing of this paper.
1507: %
1508: %
1509: %
1510: \begin{thebibliography}{77} % start the bibliography
1511: %
1512: \bibitem{alimontikann} Paola Alimonti, Vigo Kann, Hardness of approximating problems on cubic graphs, \emph{Proceedings of the Third Italian Conference on Algorithms and Complexity}, 288-298 (1997)
1513: %
1514: \bibitem{alon} Noga Alon, Benny Sudakov, On Two Segmentation Problems, \emph{Journal of Algorithms} 33, 173-184 (1999)
1515: %
1516: \bibitem{ausiello} G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, M. Protasi,
1517: Complexity and Approximation - Combinatorial optimization problems
1518: and their approximability properties, Springer Verlag (1999)
1519: %
1520: \bibitem{bafna2005} Vineet Bafna, Sorin Istrail, Giuseppe Lancia, Romeo Rizzi, Polynomial and APX-hard cases of the individual haplotyping problem,
1521: \emph{Theoretical Computer Science}, 335(1), 109-125 (2005)
1522: %
1523: \bibitem{bermankarpinski} Piotr Berman, Marek Karpinski, On Some Tighter Inapproximability Results (Extended Abstract), Proceedings of the 26th International Colloquium on Automata, Languages and Programming, 200-209 (1999)
1524: %
1525: \bibitem{bonizzoni} Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Jing Li, The Haplotyping Problem: An Overview of Computational Models and Solutions,
1526: \emph{Journal of Computer Science and Technology} 18(6), 675-688
1527: (November 2003)
1528: %
1529: \bibitem{drineas} P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering in large graphs via Singular Value Decomposition, \emph{Journal of Machine Learning} 56, 9-33 (2004)
1530: %
1531: %\bibitem{gavril} F. Gavril, Testing for equality between maximum matching and minimum node
1532: %covering, \emph{Information processing letters} 6, 199-202 (1977)
1533: %
1534: \bibitem{greenberg} Harvey J. Greenberg, William E. Hart, Giuseppe Lancia, Opportunities for Combinatorial Optimisation in Computational Biology, \emph{INFORMS Journal on Computing}, 16(3), 211-231 (2004)
1535: %
1536: \bibitem{halldorsson} Bjarni V. Halldorsson, Vineet Bafna, Nathan Edwards, Ross Lippert, Shibu Yooseph, and Sorin Istrail, A Survey of Computational Methods for Determining Haplotypes,
1537: \emph{Proceedings of the First RECOMB Satellite on Computational
1538: Methods for SNPs and Haplotype Inference}, Springer Lecture Notes
1539: in Bioinformatics, LNBI 2983, pp. 26-47  (2003)
1540: %
1541: \bibitem{sched} Hoogeveen, J.A., Schuurman, P., and Woeginger, G.J., Non-approximability results for scheduling problems with minsum criteria, \emph{INFORMS Journal on Computing}, 13(2), 157-168 (Spring 2001)
1542: %
1543: %\bibitem{hopcroftkarp} J.E. Hopcroft, R.M. Karp, An $n^{5/2}$ algorithm for maximum matching in bipartite graphs, \emph{SIAM Journal on Computing} 2, 225-231 (1973)
1544: %
1545: \bibitem{li} Yishan Jiao, Jingyi Xu, Ming Li, On the k-Closest Substring and k-Consensus Pattern Problems, \emph{Combinatorial Pattern Matching: 15th Annual Symposium} (CPM 2004) 130-144
1546: %
1547: \bibitem{kleinberg} Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, Segmentation Problems,
1548: \emph{Proceedings of STOC 1998}, 473-482 (1998)
1549: %
1550: \bibitem{kleinbergEco} Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, A Microeconomic View of Data Mining,
1551: \emph{Data Mining and Knowledge Discovery} 2, 311-324 (1998)
1552: %
1553: \bibitem{kleinberg2004} Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, Segmentation Problems,
1554: \emph{Journal of the ACM} 51(2), 263-280 (March 2004) Note: this
1555: paper is somewhat different to the 1998 version.
1556: %
1557: \bibitem{lanciabafna} Giuseppe Lancia, Vineet Bafna, Sorin Istrail, Ross Lippert, and Russel Schwartz, SNPs Problems, Complexity and Algorithms,
1558: \emph{Proceedings of the 9th Annual European Symposium on
1559: Algorithms}, 182-193 (2001)
1560: %
1561: \bibitem{pureparsimony} Giuseppe Lancia, Maria Christina Pinotti, Romeo Rizzi, Haplotyping Populations by Pure Parsimony: Complexity of Exact and Approximation Algorithms, \emph{INFORMS Journal on Computing}, Vol. 16, No.4, 348-359 (Fall 2004)
1562: %
1563: \bibitem{newlancia} Giuseppe Lancia, Romeo Rizzi, A polynomial
1564: solution to a special case of the parsimony haplotyping problem,
1565: to appear in \emph{Operations Research Letters}
1566: %
1567: \bibitem{geometric} Rafail Ostrovsky and Yuval Rabani, Polynomial-Time Approximation Schemes for Geometric Min-Sum Median Clustering,
1568: \emph{Journal of the ACM} 49(2), 139-156 (March 2002)
1569: %
1570: \bibitem{fasthare} Alessandro Panconesi and Mauro Sozio, Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction,
1571: \emph{Proceedings of 4th Workshop on Algorithms in Bioinformatics}
1572: (WABI 2004), LNCS Springer-Verlag, 266-277
1573: %
1574: \bibitem{christos} Personal communication with Christos H. Papadimitriou, June 2005
1575: %
1576: \bibitem{lreduc} C.H. Papadimitriou and M. Yannakakis, Optimization, approximation, and complexity classes, \emph{Journal of Computer and System Sciences} 43, 425-440 (1991)
1577: %
1578: \bibitem{fixedparam} Romeo Rizzi, Vineet Bafna, Sorin Istrail, Giuseppe Lancia: Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem,
1579: \emph{2nd Workshop on Algorithms in Bioinformatics} (WABI 2002)
1580: 29-43
1581: %
1582: \end{thebibliography}       % end the bibliography
1583: %
1584: %
1585: \vspace{30pt}
1586: %
1587: %
1588: \begin{biography}{Rudi Cilibrasi}
1589: received his bachelor's degree at Caltech in 1996. He spent
1590: several years in industry doing network programming, Linux kernel
1591: programming, and a variety of software development work until
1592: returning to academia with CWI in 2001. He is now nearing
1593: completion of his doctoral work that has been largely concerned
1594: with robust methods of approximating bioinformatics and related
1595: clustering problems. He currently maintains CompLearn (
1596: http://complearn.org/ ), an open-source data-mining package that
1597: can be used for phylogenetic tree construction.
1598: 
1599: \end{biography}
1600: \begin{biography}{Leo van Iersel}
1601: received in 2004 his Master of Science degree in Applied
1602: Mathematics from the Universiteit Twente in the Netherlands. He is
1603: now working as a PhD student at the Technische Universiteit
1604: Eindhoven, also in the Netherlands. His research is mainly
1605: concerned with the search for combinatorial algorithms for
1606: biological problems.
1607: \end{biography}
1608: \begin{biography}{Steven Kelk}
1609: received his PhD in Computer Science in 2004 from the University
1610: of Warwick, in England. He is now working as a postdoc at the
1611: Centrum voor Wiskunde en Informatica (CWI) in Amsterdam, the
1612: Netherlands, where he is focussing on the combinatorial aspects of
1613: computational biology.
1614: \end{biography}
1615: \begin{biography}{John Tromp}
1616: received the bachelor's and PhD degrees in Computer Science from
1617: the University of Amsterdam in 1989 and 1993 respectively, where
1618: he studied with Paul Vit\'{a}nyi. He then spent two years as a
1619: postdoctoral fellow with Ming Li at the University of Waterloo in
1620: Canada. In 1996 he returned as a postdoc to the Centre for
1621: Mathematics and Computer Science (CWI) in Amsterdam. He spent 2001
1622: working as software developer at Bioinformatics Solutions Inc. in
1623: Waterloo, to return once more to CWI, where he currently holds a
1624: permanent position. He is the recipient of a Canada International
1625: Fellowship. See http://www.cwi.nl/~tromp/ for more information.
1626: \end{biography}
1627: %
1628: %
1629: %
1630: %
1631: \clearpage
1632: %
1633: \section*{Appendix: Interreducibility of MEC and Constructive-MEC}
1634: \label{app:inter}
1635: %
1636: \renewcommand{\theequation}{A\arabic{equation}}
1637: \setcounter{section}{0}
1638: %
1639: \renewcommand{\thesection}{A.\arabic{section}}
1640: %
1641: \numberwithin{equation}{section} \numberwithin{figure}{section}
1642: %
1643: % \section{Interreducibility of MEC and Constructive-MEC}
1644: %
1645: \begin{lemma}
1646: \label{lem:int} MEC and Constructive-MEC are polynomial-time
1647: Turing interreducible. (Also: Binary-MEC and
1648: Binary-Constructive-MEC are polynomial-time Turing
1649: interreducible.)\\
1650: \end{lemma}
1651: \begin{proof}
1652: We show interreducibility of MEC and Constructive-MEC in such a
1653: way that the interreducibility of Binary-MEC with
1654: Binary-Constructive-MEC also follows immediately from the
1655: reduction. This makes the reduction from Constructive-MEC to
1656: MEC quite complicated because we must thus avoid the use of holes.\\
1657: \\
1658: 1. Reducing MEC to Constructive-MEC is trivial because, given an
1659: optimal haplotype pair $(H_1, H_2)$, $D_M(H_1, H_2)$ can easily be
1660: computed in polynomial-time by summing $\min( d(H_1,r), d(H_2,r)
1661: )$ over all rows $r$ of the
1662: input matrix $M$.\\
1663: \\
1664: 2. Reducing Constructive-MEC to MEC is more involved. To prevent a
1665: particular special case which could complicate our reduction, we
1666: first check whether every row of $M$ (i.e. the input to
1667: Constructive-MEC) is identical. If this is so, we can complete the
1668: reduction by simply returning $(H_1, H_1)$ where $H_1$ is the
1669: first row of $M$. Hence,
1670: from this point onwards, we assume that $M$ has at least two distinct rows.\\
1671: \\
1672: Let $OptPairs(M)$ be the set of all unordered optimal haplotype
1673: pairs for $M$ i.e. the set of all $(H_1, H_2)$ such that $D_M(H_1,
1674: H_2) = MEC(M)$. Given that all rows in $M$ are not identical, we
1675: observe that there are no pairs of the form $(H_1, H_1)$ in
1676: $OptPairs(M)$.\footnote{This is because $D_{M}(H_1,H_1)$ is always
1677: larger than $D_{M}(H_1, r)$ for any row $r$ in $M$ that is not
1678: equal to $H_1$.} Let $OptPairs(M,H') \subseteq OptPairs(M)$ be
1679: those elements $(H_1,H_2) \in OptPairs(M)$ such that $H_1 = H'$ or
1680: $H_2 = H'$. Let $g(r, H_1, H_2)$
1681: be defined as $\min( d(r, H_1), d(r, H_2) )$.\\
1682: \\
1683: Consider the following two subroutines:\\
1684: \\
1685: \textbf{Subroutine: } \emph{DFN} (``Distance From Nearest Optimal Haplotype Pair'')\\
1686: \textbf{Input: } An $n \times m$ SNP matrix $M$ and a vector $r \in \{0,1\}^m$.\\
1687: \textbf{Output: } The value $d_{dfn}$ which we define as follows:
1688: \begin{equation}
1689: d_{dfn} = \min_{ (H_1, H_2) \in OptPairs(M) } g( r, H_1, H_2
1690: ).\nonumber
1691: \end{equation}
1692: \\
1693: \textbf{Subroutine: } \emph{ANCHORED-DFN} (``Anchored Distance From Nearest Optimal Haplotype Pair'')\\
1694: \textbf{Input: } An $n \times m$ SNP matrix $M$, a vector $r \in
1695: \{0,1\}^m$, and a haplotype $H'$ such that
1696: $(H', H_2) \in OptPairs(M)$ for some $H_2$.\\
1697: \textbf{Output: } The value $d_{adfn}$, defined as:
1698: \begin{equation}
1699: d_{adfn} = \min_{ (H_1, H_2) \in OptPairs(M, H') } g( r, H_1, H_2
1700: ).\nonumber
1701: \end{equation}
1702: \\
1703: We assume the existence of implementations of DFN and ANCHORED-DFN
1704: which run in polynomial-time whenever MEC runs in polynomial-time.
1705: We use these two subroutines to reduce Constructive-MEC to MEC and
1706: then, to complete the proof, demonstrate and prove correcteness of
1707: implementations for
1708: DFN and ANCHORED-DFN.\\
1709: \\
1710: The general idea of the reduction from Constructive-MEC to MEC is
1711: to find some pair $(H_1, H_2) \in OptPairs(M)$ by first finding
1712: $H_1$ (using repeated calls to DFN) and then finding $H_2$ (by
1713: using repeated calls to ANCHORED-DFN with $H_1$ specified as the
1714: ``anchoring'' haplotype.) Throughout the reduction, the following
1715: two observations are important. Both follow immediately from the
1716: definition of $D$ - i.e. (\ref{eq:witsum}).\\
1717: \begin{observation}
1718: \label{obs:expand} Let $M_1 \cup M_2$ be a partition of rows of
1719: the matrix $M$ into two sets. Then, for all $H_1$ and $H_2$,
1720: $D_{M}(H_1, H_2) = D_{M_1}(H_1, H_2) + D_{M_2}(H_1,H_2)$.\\
1721: \end{observation}
1722: \begin{observation}
1723: \label{obs:baseline} Suppose an SNP matrix $M_1$ can be obtained
1724: from an SNP matrix $M_2$ by removing 0 or more rows from $M_2$.
1725: Then $MEC(M_1) \leq MEC(M_2)$.\\
1726: \end{observation}
1727: To begin the reduction, note that, for an arbitrary
1728: haplotype $X$, DFN$(M,X)=0$ iff $(X, H_2) \in OptPairs(M)$ for
1729: some haplotype $H_2$. Our idea is thus that we initialise $X$ to
1730: be all-0 and flip one entry of $X$ at a time (i.e. change a 0 to a
1731: 1 or vice-versa) until DFN$(M,X)=0$; at that point $X = H_1$ (for
1732: some $(H_1, H_2) \in OptPairs(M)$.) More specifically, suppose
1733: DFN$(M,X) = d$ where $0 < d < m$. \footnote{It is not possible
1734: that DFN$(M,X)=m$, because all $(H_1, H_2) \in OptPairs(M)$ are of
1735: the form $H_1 \neq H_2$, and if $H_1 \neq H_2$ we know that
1736: $g(X,H_1,H_2) < m$.} If we define $flip(X,i)$ as the haplotype
1737: obtained by flipping the entry in the $i$th column of $X$, then we
1738: know that there exists $i$ ($1 \leq i \leq m$) such that DFN$(M,
1739: flip(X,i)) < d$. Such a position must exist because we can flip
1740: some entry in $X$ to bring it closer to the haplotype (which we
1741: know exists) that it was distance $d$ from. It is clear that we
1742: can find a position $i$ in polynomial-time by calling DFN$(M,
1743: flip(X,j))$ for $1 \leq j \leq m$ until it is found.
1744: Having found such an $i$, we set $X = flip(X,i)$.\\
1745: \\
1746: Clearly this process can be iterated, finding one entry to flip in
1747: every iteration, until DFN$(M,X)=0$ and at this point setting $H_1
1748: = X$ gives us the desired result. Given that DFN$(M,X)$ decreases
1749: by at least 1 every iteration, at most $m-1$ iterations
1750: are required.\\
1751: \\
1752: Thus, having found $H_1$, we need to find some $H_2$ such that
1753: $(H_1, H_2)$ is in $OptPairs(M)$.\\
1754: \\
1755: First, we initialise $X$ to be the complement of $H_1$ (i.e. the
1756: row obtained by flipping every entry of $H_1$). Now, observe that
1757: if $X \neq H_1$ and ANCHORED-DFN$(M, X, H_1) = 0$ then $(H_1, X)
1758: \in OptPairs(M)$ and we are finished. The tactic is thus to find,
1759: at each iteration, some position $i$ of $X$ such that
1760: ANCHORED-DFN$(M, flip(X,i), H_1)$ is less than ANCHORED-DFN$(M,
1761: X,H_1)$, and then setting $X$ to be $flip(X,i)$. As before we
1762: repeat this process until our call to ANCHORED-DFN returns zero.
1763: The ``trick'' in this case is to prevent $X$ converging on $H_1$,
1764: because (knowing that $M$ has at least two different types of row)
1765: $(H_1, H_1) \not \in OptPairs(M)$. The initialisation of $X$ to
1766: the complement of $H_1$ guarantees this. To see why this is,
1767: observe that, if $X$ is the complement of $H_1$, $d(X,H_1)=m$.
1768: Thus, we would need at least $m$ flips to transform $X$ into
1769: $H_1$. However, if $X$ is the complement of $H_1$, then - because
1770: we have guaranteed that $OptPairs(M)$ contains no pairs of the
1771: form $(H_1,H_1)$ - we know that ANCHORED-DFN$(M, X, H_1) < m$.
1772: Given that we can guarantee that ANCHORED-DFN$(M,X,H_1)$ can be
1773: reduced by at least 1 at every iteration, it is clear that we can
1774: find an $X$ such that ANCHORED-DFN$(M,X,H_1)=0$ after making no
1775: more than $m-1$ iterations, which ensures that $X$ cannot have
1776: been transformed into $H_1$. Once we have such an $X$ we can set
1777: $H_2 = X$ and
1778: return $(H_1, H_2)$.\\
1779: \\
1780: To complete the proof of Lemma \ref{lem:int} it remains only to
1781: demonstrate and prove the correctness of algorithms for DFN and
1782: ANCHORED-DFN, which we do below. Note that both DFN and
1783: ANCHORED-DFN run in polynomial-time if MEC runs in
1784: polynomial-time.\\
1785: \\
1786: \textbf{Subroutine: } \emph{DFN} (``Distance From Nearest Optimal Haplotype Pair'')\\
1787: \textbf{Input: } An $n \times m$ SNP matrix $M$ and a vector $r \in \{0,1\}^m$.\\
1788: \textbf{Output: } The value $d_{dfn}$ which we define as follows:
1789: \begin{equation}
1790: d_{dfn} = \min_{ (H_1, H_2) \in OptPairs(M) } g( r, H_1, H_2
1791: ).\nonumber
1792: \end{equation}
1793: The following is a three-step algorithm to compute DFN(M,r) which uses an oracle for MEC.\\
1794: \\
1795: 1. Compute $d = $MEC$(M)$.\\
1796: 2. Let $M'$ be the $n(m+1) \times m$ matrix obtained from $M$ by
1797: making $m+1$ copies of every row of $M$.\\
1798: 3. Return MEC$( M' \cup \{r\} ) - (m+1)d$ where $M' \cup \{r\}$ is
1799: the matrix obtained by adding the single row $r$ to the matrix
1800: $M'$.\\
1801: \\
1802: To prove the correctness of the above we first make a further
1803: observation, which (as with the two previous observations) follows
1804: directly from (\ref{eq:witsum}).\\
1805: \begin{observation}
1806: \label{obs:scale} Suppose an $kn \times m$ SNP matrix $M_1$ is
1807: obtained from an $n \times m$ SNP matrix $M_2$ by making $k \geq
1808: 1$ copies of every row of $M_2$. Then $MEC(M_1) = k.MEC(M_2)$, and
1809: $OptPairs(M_1) = OptPairs(M_2)$.\\
1810: \end{observation}
1811: By the above observation we know that MEC$(M') = (m+1)d$ and
1812: $OptPairs(M') = OptPairs(M)$. Now, we argue that $OptPairs(M' \cup
1813: \{r\}) \subseteq OptPairs(M)$. To see why this is, suppose there
1814: existed $(H_3, H_4)$ such that $(H_3, H_4) \in OptPairs(M' \cup
1815: \{r\})$ but $(H_3, H_4) \not \in OptPairs(M)$. This would mean
1816: $D_{M}(H_3, H_4) > d$ where $d = $MEC$(M)$. Now:
1817: \begin{align*}
1818: D_{M' \cup \{r\}}(H_3, H_4) & \geq D_{M'}(H_3, H_4)\\
1819: & = (m+1)D_{M}(H_3, H_4)\\
1820: & \geq (m+1)(d+1).
1821: \end{align*}
1822: However, if we take any $(H_1, H_2) \in OptPairs(M)$, we see that:
1823: \begin{align*}
1824: D_{M' \cup \{r\}}(H_1, H_2) & \leq (m+1)d + g(r,H_1, H_2)\\
1825: & \leq (m+1)d + m.
1826: \end{align*}
1827: Now, $(m+1)d + m < (m+1)(d+1)$ so $(H_3, H_4)$ could not possibly
1828: be in $OptPairs(M' \cup \{r\})$ - contradiction! The relationship
1829: $OptPairs(M' \cup \{r\}) \subseteq OptPairs(M)$ thus follows. It
1830: further follows, from Observation \ref{obs:expand}, that the
1831: members of $OptPairs(M' \cup \{r\})$ are precisely those pairs
1832: $(H_1, H_2) \in OptPairs(M)$ that minimise the expression
1833: $g(r,H_1,H_2)$. The minimal value of $g(r, H_1, H_2)$ has already
1834: been defined as $d_{dfn}$, so we have:
1835: \begin{equation}
1836: MEC(M' \cup \{r\}) = (m+1)d + d_{dfn}.\nonumber
1837: \end{equation}
1838: This proves the correctness of Step 3 of the subroutine.\\
1839: \\
1840: \textbf{Subroutine: } \emph{ANCHORED-DFN} (``Anchored Distance From Nearest Optimal Haplotype Pair'')\\
1841: \textbf{Input: } An $n \times m$ SNP matrix $M$, a vector $r \in
1842: \{0,1\}^m$, and a haplotype $H'$ such that
1843: $(H', H_2) \in OptPairs(M)$ for some $H_2$.\\
1844: \textbf{Output: } The value $d_{adfn}$, defined as:
1845: \begin{equation}
1846: d_{adfn} = \min_{ (H_1, H_2) \in OptPairs(M, H') } g( r, H_1, H_2
1847: ).\nonumber
1848: \end{equation}
1849: Given that $H'$ is one half of some optimal haplotype pair for
1850: $M$, it can be shown that ANCHORED-DFN$(M, r, H')$ =  DFN$( M \cup
1851: \{H'\}, r)$, thus demonstrating how ANCHORED-DFN can be easily
1852: reduced to DFN in polynomial-time. To prove the equation it is
1853: sufficient to demonstrate that $OptPairs( M \cup \{H'\}) =
1854: OptPairs(M, H')$, which we do now. Let $d=$MEC$(M)$. It follows
1855: that MEC$( M \cup \{ H' \} ) \geq d$. In fact, MEC$(M \cup \{H'\})
1856: = d$ because $D_{M \cup \{H'\}}( H', H_2 ) = d$ for all $(H', H_2)
1857: \in OptPairs(M,H')$. Hence $OptPairs(M,H') \subseteq OptPairs(M
1858: \cup \{ H' \})$. To prove the other direction, suppose there
1859: existed some pair $(H_1, H_2) \in OptPairs(M \cup \{ H' \})$ such
1860: that $H_1 \neq H'$ and $H_2 \neq H'$. But then, from Observation
1861: \ref{obs:expand}, we would have:
1862: \begin{align*}
1863: D_{M \cup \{H' \}} (H_1, H_2) &= D_{M}(H_1, H_2) + g(H', H_1, H_2) \\
1864: &\geq D_{M}(H_1, H_2) + 1\\
1865: &> d.
1866: \end{align*}
1867: Thus, $(H_1, H_2)$ could not have been in $OptPairs(M \cup
1868: \{H'\})$ in the first place, giving us a contradiction. Thus
1869: $OptPairs(M \cup \{ H' \}) \subseteq OptPairs(M, H')$ and hence
1870: $OptPairs(M \cup \{H' \}) = OptPairs(M, H')$, proving the
1871: correctness of subroutine ANCHORED-DFN.
1872: %
1873: \end{proof}
1874: \end{document}
1875: %
1876: %
1877: %
1878: %
1879: %
1880: