q-bio0605024/final.tex
1: \documentclass[12pt,draftcls,onecolumn]{IEEEtran}
2: %\documentclass[onecolumn]{IEEEtran}
3: 
4: %\numberwithin{equation}{section} \numberwithin{figure}{section}
5: 
6: \newtheorem{lemma}{Lemma}
7: \newtheorem{observation}{Observation}
8: \newtheorem{definition}{Definition}
9: \newtheorem{theorem}{Theorem}
10: \newtheorem{corollary}{Corollary}
11: 
12: \usepackage[USenglish]{babel}
13: 
14: \usepackage{makeidx}
15: \usepackage{amsfonts}
16: \usepackage{epsfig}
17: \usepackage{amsmath}
18: \usepackage{amssymb}
19: \pagestyle{plain}
20: \pagenumbering{arabic}
21: 
22: 
23: %%%%%%%% BEGIN COMMENT %%%%%%%%%%
24: 
25: \newif\ifcomment\commentfalse
26: \def\commentON{\commenttrue}
27: \def\commentOFF{\commentfalse}
28: 
29: \long\outer\def\bc#1\ec{{\ifcomment \sloppy  $[${\bf suggest}]
30: {{#1}} \textbf{[end]} \fi }}
31: 
32: \long\outer\def\br#1\er{{\ifcomment \sloppy  $[${\bf suggest remove}]
33: {{#1}} \textbf{[end]} \fi }}
34: 
35: \long\outer\def\bo#1\eo{{\ifcomment \sloppy  $[${\bf instead of}]
36: {\textit{#1}} \textbf{[end]}  \fi }}
37: 
38: \long\outer\def\BC#1\EC{{\ifcomment \sloppy \par \#  \dotfill
39: {\textsc{#1}} \dotfill \# \par \fi }}
40: 
41: \long\outer\def\ph#1{$PH(*,{#1})$} \long\outer\def\phmin#1{$PH^{nt}(*,{#1})$}
42: 
43: \long\outer\def\mpph#1{$MPPH(*,{#1})$} \long\outer\def\mpphmin#1{$MPPH^{nt}(*,{#1})$}
44: 
45: \long\outer\def\phminZ{$PH^{nt}$} \long\outer\def\mpphminZ{$MPPH^{nt}$}
46: 
47: \long\outer\def\lbmidmin#1#2{\ensuremath{LB^{nt}_{mid}({#1},{#2})}}
48: \long\outer\def\lbwith#1#2{\ensuremath{LB_{mid}({#1},{#2})}}
49: 
50: \commentOFF
51: %\commentON
52: 
53: %%%%%%%%% END COMMENT %%%%%%%%%%%
54: 
55: \ifcomment
56: \pagestyle{plain}
57: \pagenumbering{arabic}
58: \fi
59: 
60: \begin{document}
61: 
62: %\markboth{Shorelines of islands of tractability...}{Van Iersel\MakeLowercase{\textit{et al.}}}
63: 
64: \title{Shorelines of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems\thanks{Supported by the Dutch
65: BSIK/BRICKS
66: project.}}
67: 
68: 
69: %\titlerunning{Shorelines of islands of tractability}
70: 
71: \author{Leo van Iersel, Judith Keijsper, Steven Kelk and Leen Stougie}
72: 
73: %\authorrunning{Leo van Iersel, Judith Keijsper, Steven Kelk \and Leen Stougie}
74: %\institute{Technische Universiteit Eindhoven (TU/e), Den Dolech 2, 5612 AX Eindhoven, Netherlands,\\
75: %\email{l.j.j.v.iersel@tue.nl, j.c.m.keijsper@tue.nl},\\
76: %\texttt{http://www.tue.nl} \and
77: %Centrum voor Wiskunde en Informatica (CWI), Kruislaan 413, 1098 SJ Amsterdam, Netherlands, \\
78: %\email{steven.kelk@cwi.nl, leen.stougie@cwi.nl}, \\
79: %\texttt{http://www.cwi.nl} }
80: 
81: \maketitle
82: 
83: \begin{abstract}
84: \noindent The problem \emph{Parsimony Haplotyping} ($PH$) asks for the smallest set of haplotypes which can explain a
85: given set of genotypes, and the problem \emph{Minimum Perfect Phylogeny Haplotyping} ($MPPH$) asks for the smallest
86: such set which also allows the haplotypes to be embedded in a \emph{perfect phylogeny}, an evolutionary tree with
87: biologically-motivated restrictions. For $PH$, we extend recent work by further mapping the interface between ``easy''
88: and ``hard'' instances, within the framework of $(k,\ell)$-\emph{bounded instances} where the number of 2's per column
89: and row of the input matrix is restricted. By exploring, in the same way, the tractability frontier of $MPPH$ we
90: provide the first concrete, positive results for this problem, and the algorithms underpinning these results offer new
91: insights about how $MPPH$ might be further tackled in the future. In addition, we construct for both $PH$ and $MPPH$
92: polynomial time approximation algorithms, based on properties of the columns of the input matrix. We conclude with an
93: overview of intriguing open problems in $PH$ and $MPPH$.
94: \end{abstract}
95: 
96: \begin{keywords}
97: Combinatorial algorithms, Biology and genetics, Complexity hierarchies
98: \end{keywords}
99: 
100: 
101: \section{Introduction}
102: \noindent The computational problem of inferring biologically-meaningful haplotype data from the genotype data of a population
103: continues to generate considerable interest at the interface of biology and computer science/mathematics. A popular
104: underlying abstraction for this model (in the context of diploid organisms) represents a genotype as a string
105: over a $\{0,1,2\}$ alphabet, and a haplotype as a string over $\{0,1\}$. The exact goal depends on the
106: biological model being applied but a common, minimal algorithmic requirement is that, given a set of genotypes, a
107: set of haplotypes must be produced which resolves the genotypes.
108: \medskip
109: 
110: To be precise, we are given a \emph{genotype matrix} $G$ with elements in $\{0,1,2\}$, the rows of which
111: correspond to genotypes, while its columns correspond to sites on the genome, called SNP's. A \emph{haplotype matrix}
112: has elements from $\{0,1\}$, and rows corresponding to haplotypes. Haplotype matrix $H$ \emph{resolves} genotype
113: matrix $G$ if for each row $g_i$ of $G$, containing at least one $2$, there are two rows $h_{i_1}$ and $h_{i_2}$ of
114: $H$, such that $g_i(j) = h_{i_1}(j)$ for all $j$ with $h_{i_1}(j)= h_{i_2}(j)$ and $g_i(j) = 2$ otherwise, in which
115: case we say that $h_{i_1}$ and $h_{i_2}$ resolve $g_i$, we write $g_i=h_{i_1}+h_{i_2}$, and we call $h_{i_1}$ the {\em
116: complement} of $h_{i_2}$ with respect to $g_i$, and vice versa. A row $g_i$ without 2's is itself a haplotype and is
117: uniquely resolved by this haplotype, which thus has to be contained in $H$.
118: 
119: We define the first of the two problems that we study in this paper.
120: 
121: \medskip
122: 
123: \noindent \textbf{Problem:} Parsimony Haplotyping ($PH$)\\
124: \textbf{Input:} A genotype matrix $G$.\\
125: \textbf{Output:} A haplotype matrix $H$ with a minimum number of rows that resolves $G$.
126: 
127: \medskip
128: 
129: \noindent \looseness=-2 There is a rich literature in this area, of which recent papers such as \cite{brown} give a
130: good overview. The problem is APX-hard \cite{lanciaApx}\cite{islands} and, in terms of approximation algorithms with
131: performance \emph{guarantees}, existing methods remain rather unsatisfactory, as will be shortly explained. This has
132: led many authors to consider methods based on Integer Linear Programming (ILP)
133: \cite{brown}\cite{gusfieldparsimony}\cite{halldorson}\cite{lanciaApx}. A different response to the hardness is to
134: search for ``islands of tractability'' amongst special, restricted cases of the problem, exploring the frontier
135: between hardness and polynomial-time solvability. In the literature available in this direction
136: \cite{wabi}\cite{lanciaApx}\cite{lancia}\cite{islands}, this investigation has specified classes of
137: $(k,\ell)$-\emph{bounded instances}: in a $(k,\ell)$-\emph{bounded instance} the input genotype matrix $G$ has at most
138: $k$ $2$'s per row and at most $\ell$ $2$'s per column (cf. \cite{islands}). If $k$ or $\ell$ is a ``$*$'' we mean
139: instances that are bounded only by the number of $2$'s per column or per row, respectively. In this paper we
140: supplement this ``tractability'' literature with mainly positive results, and in doing so almost complete the bounded
141: instance complexity landscape.
142: 
143: Next to the $PH$ problem we study the \emph{Minimum Perfect Phylogeny Haplotyping} ($MPPH$)
144: model \cite{nphardnote}. Again a minimum-size set of resolving haplotypes is required but this time under the
145: additional, biologically-motivated restriction that the produced haplotypes permit a \emph{perfect phylogeny}, i.e.,
146: they can be placed at the leaves of an evolutionary tree within which each site mutates at most once. Haplotype
147: matrices admitting a perfect phylogeny are completely characterised \cite{gusfieldbook}\cite{gusfieldnetwork} by the
148: absence of the forbidden submatrix\\
149: \[F = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}.\] \\
150: \noindent
151: \textbf{Problem:} Minimum Perfect Phylogeny Haplotyping ($MPPH$)\\
152: \textbf{Input:} A genotype matrix $G$.\\
153: \textbf{Output:} A haplotype matrix $H$ with a minimum number of rows that resolves $G$ and admits a perfect
154: phylogeny.
155: 
156: \medskip
157: 
158: \noindent The feasibility question ($PPH$) - given a genotype matrix $G$, find any haplotype matrix $H$ that resolves
159: $G$ and admits a perfect phylogeny, or state that no such $H$ exists - is solvable in linear-time
160: \cite{gusfieldlinear}\cite{anOptimal}. Researchers in this area are now moving on to explore the $PPH$ question on
161: phylogenetic \emph{networks} \cite{gusnetwork}.
162: 
163: The $MPPH$ problem, however, has so far hardly been studied beyond an NP-hardness result \cite{nphardnote}
164: and occasional comments within $PH$ and $PPH$ literature \cite{mpphref2}\cite{anOptimal}\cite{mpphref}. In this paper we
165: thus provide what is one of the first attempts to analyse the parsimony optimisation criteria within a well-defined
166: and widely applicable biological framework. We seek namely to map the $MPPH$ complexity landscape in the same way as
167: the $PH$ complexity landscape: using the concept of $(k,\ell)$-boundedness. We write $PH(k,\ell)$ and $MPPH(k,\ell)$ for these
168: problems restricted to $(k,\ell)$-bounded instances.\\
169: 
170: 
171: \noindent
172: \textbf{Previous work and our contribution}
173: 
174: \medskip
175: 
176: \noindent In \cite{lanciaApx} it was shown that $PH(3,*)$ is APX-hard. In \cite{wabi}\cite{lancia} it was shown that
177: $PH(2,*)$ is polynomial-time solvable. Recently, in \cite{islands}, it was shown (amongst other results) that
178: $PH(4,3)$ is APX-hard. In \cite{islands} it was also proven that the restricted subcase of $PH(*,2)$ is polynomial-time solvable
179: where the \emph{compatibility graph} of the input genotype matrix is a clique. (Informally, the compatibility graph shows
180: for every pair of genotypes whether those two genotypes can use common haplotypes in their resolution.)
181: 
182: In this paper, we bring the boundaries between hard and easy classes closer by showing that $PH(3,3)$ is APX-hard and
183: that $PH(*,1)$ is polynomial-time solvable.
184: 
185: As far as $MPPH$ is concerned there have been, prior to this paper, no concrete results beyond the above mentioned
186: NP-hardness result. We show that $MPPH(3,3)$ is APX-hard and that, like their $PH$ counterparts, $MPPH(2,*)$ and
187: $MPPH(*,1)$ are polynomial-time solvable (in both cases using a reduction to the $PH$ counterpart). We also show that
188: the clique result from \cite{islands} holds in the case of $MPPH(*,2)$ as well. As with its $PH$ counterpart the
189: complexity of $MPPH(*,2)$ remains open.
190: 
191: \medskip
192: \noindent The fact that both $PH$ and $MPPH$ already become $APX$-hard for $(3,3)$-bounded instances means that,
193: in terms of deterministic approximation algorithms, the best that we can in general hope for is constant
194: approximation ratios. Lancia et al \cite{lanciaApx}\cite{lancia} have given two separate approximation algorithms with approximation
195: ratios of $\sqrt{n}$ and $2^{k-1}$ respectively, where $n$ is the number of genotypes in the input, and $k$ is the maximum
196: number of 2's appearing in a row of the genotype matrix\footnote{It would
197: be overly restrictive to write $PH(k,*)$ here
198: because their algorithm runs in polynomial time even if $k$ is not a constant.}. An $O(\log n)$
199: approximation algorithm has been given in \cite{log} but this only runs in polynomial time if the set of all possible haplotypes that
200: can participate in feasible solutions, can be enumerated in polynomial time. The obvious problem with the $2^{k-1}$ and the
201: $O(\log n)$ approximation algorithms is thus that either the accuracy decays exponentially (as in the former case) or the running
202: time increases exponentially (as in the latter case) with an increasing number of 2's per row. Here we offer a
203: simple, alternative approach which achieves (in polynomial time) approximation ratios linear in $\ell$ for
204: $PH(*,\ell)$ and
205: $MPPH(*,\ell)$ instances, and
206: actually also achieves these ratios in polynomial time when $\ell$ is not constant. These ratios are
207: shown in the Table \ref{tab:ratios}; note how improved
208: ratios can be obtained if every genotype is guaranteed to have at least one 2.
209: \begin{table}
210: \centering
211: \caption{Approximation ratios achieved in this paper}
212: \label{tab:ratios}
213: \begin{tabular}{|c|c|}
214: \hline
215: Problem $(\ell \geq 2)$ & Approximation ratio\\
216: \hline
217: \hline
218: $PH(*,\ell)$ & $\frac{3}{2}\ell + \frac{1}{2}$\\
219: \hline
220: $PH(*,\ell)$ where every genotype has at least one 2 & $\frac{3}{4}\ell + \frac{7}{4} - \frac{3}{2}\frac{1}{\ell +1}$\\
221: \hline
222: $MPPH(*,\ell)$ & $2 \ell$\\
223: \hline
224: $MPPH(*,\ell)$ where every genotype has at least one 2 & $\ell + 2 - \frac{2}{\ell+1}$\\
225: \hline
226: \end{tabular}
227: \end{table}
228: 
229: We have thus decoupled the approximation ratio from the maximum number of 2's per row, and instead made the ratio
230: conditional on the maximum number of 2's per column. Our approximation scheme is hence an improvement to the
231: $2^{k-1}$-approximation algorithm except in cases where the maximum number of 2's per row is exponentially small
232: compared to the maximum number of 2's per column. Our approximation scheme yields also the first approximation results
233: for $MPPH$.
234: 
235: \medskip
236: 
237: \noindent As explained by Sharan et al. in their ``islands of tractability'' paper \cite{islands}, identifying
238: tractable special classes can be practically useful for constructing high-speed subroutines within ILP solvers, but
239: perhaps the most significant aspect of this paper is the analysis underpinning the results, which - by deepening our
240: understanding of how this problem behaves - assists the search for better, faster approximation algorithms and for
241: determining the exact shorelines of the islands of tractability.
242: 
243: Furthermore, the fact that - prior to this paper - concrete and positive results for $MPPH$ had not been
244: obtained (except for rather pessimistic modifications to ILP models \cite{brown}), means that the algorithms given
245: here for the $MPPH$ cases, and the
246:  data structures used in their analysis (e.g. the \emph{restricted compatibility graph} in
247: Section~\ref{sec:posres}), assume particular importance.
248: 
249: Finally, this paper yields some interesting open problems, of which the outstanding $(*,2)$ case (for both
250: $PH$ and $MPPH$) is only one; prominent amongst these questions (which are discussed at the end of the paper) is the
251: question of whether $MPPH$ and $PH$ instances are inter-reducible, at least within the bounded-instance framework.
252: 
253: \medskip
254: 
255: \noindent The paper is organised as follows. In Section~\ref{sec:negres} we give the hardness results, in
256: Section~\ref{sec:posres} we present the polynomial-time solvable cases, in Section~\ref{sec:approx} we give
257: approximation algorithms and we finish in Section~\ref{sec:concl} with conclusions and open problems.
258: %
259: \section{Hard problems}
260: \label{sec:negres}
261: \begin{theorem}
262: \label{lem:33phyloAPX} $MPPH(3,3)$ is APX-hard.
263: \end{theorem}
264: \begin{proof}
265: The proof in \cite{nphardnote} that $MPPH$ is NP-hard uses a reduction from {\sc Vertex Cover}, which can be modified
266: to yield NP-hardness and APX-hardness for (3,3)-bounded instances. Given a graph $T=(V,E)$ the reduction in
267: \cite{nphardnote} constructs a genotype matrix $G(T)$ of $MPPH$ with $|V|+|E|$ rows and $2|V|+|E|$ columns. For every
268: vertex $v_i \in V$ there is a genotype (row) $g_i$ in $G(T)$ with $g_i(i)=1$, $g_i(i+|V|)=1$ and $g_i(j)=0$ for every
269: other position $j$. In addition, for every edge $e_k=\{v_{h},v_{l}\}$ there is a genotype $g_k$ with $g_k(h)=2$,
270: $g_k(l)=2$, $g_k(2|V|+k)=2$ and $g_k(j)=0$ for every other position $j$. Bafna et al. \cite{nphardnote} prove that an
271: optimal solution for $MPPH$ with input $G(T)$ contains $|V| + |E| + VC(T)$ haplotypes, where $VC(T)$ is the size of
272: the smallest vertex cover in $T$.
273: 
274: {\sc 3-Vertex Cover} is the vertex cover problem when every vertex in the input graph has at most degree 3. It is
275: known to be APX-hard \cite{deg3}\cite{cubic}. Let $T$ be an instance of {\sc 3-Vertex Cover}. We assume that $T$ is
276: connected. Observe that for such a $T$ the reduction described above yields a $MPPH$ instance $G(T)$ that is
277: $(3,3)$-bounded. We show that existence of a polynomial-time $(1+\epsilon)$ approximation algorithm $A(\epsilon)$ for
278: $MPPH$ would imply a polynomial-time $(1+\epsilon')$ approximation algorithm for {\sc 3-Vertex Cover} with
279: $\epsilon'=8\epsilon$.\footnote[1]{Strictly speaking this is insufficient to prove APX-hardness but it is not
280: difficult to show that the described reduction is actually an L-reduction \cite{deg3}, from which APX-hardness
281: follows.}
282: 
283: Let $t$ be the solution value for $MPPH(G(T))$ returned by $A(\epsilon)$, and $t^*$ the optimal value for
284: $MPPH(G(T))$. By the argument mentioned above from \cite{nphardnote} we obtain a solution with value $d = t - |V| -
285: |E|$ as an approximation of $VC(T)$. Since $t \leq (1+\epsilon)t^*$, we have $d \leq VC(T) + \epsilon VC(T) + \epsilon
286: |V| + \epsilon |E|$. Connectedness of $T$ implies that $|V|-1 \leq |E|$. In {\sc 3-Vertex Cover}, a single vertex can
287: cover at most 3 edges in $T$, implying that $VC(T) \geq |E|/3 \geq (|V|-1)/3$. Hence, $|V| \leq 4 VC(T)$ (for $|V|\geq
288: 2$) and we have (if $|V| \geq 2$):
289: \begin{align*}
290: d & \leq VC(T) + \epsilon VC(T) + 4\epsilon VC(T) + 3\epsilon VC(T)\\
291: & \leq VC(T) + 8 \epsilon VC(T)\\
292: & \leq (1 + 8\epsilon) VC(T).
293: \end{align*}
294: \end{proof}
295: \begin{theorem}
296: \label{lem:33apx} $PH(3,3)$ is APX-hard.
297: \end{theorem}
298: \begin{proof}
299: \looseness=-1
300: The proof by Sharan et al. \cite{islands} that $PH(4,3)$ is APX-hard can be modified slightly to obtain
301: APX-hardness of $PH(3,3)$. The reduction is from {\sc 3-Dimensional Matching} with each element occurring in at most
302: three triples (3DM3): given disjoint sets $X$, $Y$ and $Z$ containing $\nu$ elements each and a set
303: $C=\{c_0,\ldots,c_{\mu-1}\}$ of $\mu$ triples in $X\times Y\times Z$ such that each element occurs in at most three
304: triples in $C$, find a maximum cardinality set $C' \subseteq C$ of disjoint triples.
305: 
306: From an instance of 3DM3 we build a genotype matrix $G$ with $3 \nu
307: + 3\mu$ rows and $6\nu+4\mu$ columns. The first $3\nu$ rows are
308: called \emph{element-genotypes} and the last $3\mu$ rows are called
309: \emph{matching-genotypes}. We specify non-zero entries of the
310: genotypes only.\footnote[2]{Only in this proof we index haplotypes,
311: genotypes and matrices starting with 0, which makes notation
312: consistent with \cite{islands}.} For every element $x_i \in X$
313: define element-genotype $g^x_i$ with $g^x_i(3\nu+i)=1$;
314: $g^x_i(6\nu+4k)=2$ for all $k$ with $x_i \in c_k$. If $x_i$ occurs
315: in at most two triples we set $g^x_i(i)=2$. For every element $y_i
316: \in Y$ there is an element-genotype $g^y_i$ with $g^y_i(4\nu+i)=1$;
317: $g^y_i(6\nu+4k)=2$ for all $k$ with $y_i \in c_k$ and if $y_i$
318: occurs in at most two triples then we set $g^y_i(\nu + i)=2$. For
319: every element $z_i \in Z$ there is an element-genotype $g^z_i$ with
320: $g^z_i(5\nu+i)=1$; $g^z_i(6\nu+4k)=2$ for all $k$ with $z_i \in c_k$
321: and if $z_i$ occurs in at most two triples then we set
322: $g^z_i(2\nu+i)=2$. For each triple $c_k=\{ x_{i_1},
323: y_{i_2},z_{i_3}\} \in C$ there are three matching-genotypes $c_k^x$,
324: $c_k^y$ and $c_k^z$: $c_k^x$ has $c_k^x(3\nu+i_1)=2$,
325: $c_k^x(6\nu+4k)=1$ and $c_k^x(6\nu+4k+1)=2$; $c_k^y$ has
326: $c_k^y(4\nu+i_2)=2$, $c_k^y(6\nu+4k)=1$ and $c_k^y(6\nu+4k+2)=2$;
327: $c_k^z$ has $c_k^z(5\nu+i_3)=2$, $c_k^z(6\nu+4k)=1$ and
328: $c_k^z(6\nu+4k+3)=2$.
329: 
330: Notice that the element-genotypes only have a 2 in the first $3\nu$ columns if the element occurs in at most two
331: triples. This is the only difference with the reduction from \cite{islands}, where every element-genotype has a 2 in
332: the first $3\nu$ columns: i.e., for elements $x_i\in X$, $y_i\in Y$ or $z_i\in Z$ a 2 in column $i$, $\nu+i$ or
333: $2\nu+i$, respectively. As a direct consequence our genotype matrix has only three 2's per row in contrast to the four
334: 2's per row in the original reduction.
335: 
336: We claim that for this (3,3)-bounded instance exactly the same arguments can be used as for the (4,3)-bounded
337: instance. In the original reduction the left-most 2's ensured that, for each element-genotype, at most one of the two
338: haplotypes used to resolve it was used in the resolution of other genotypes. Clearly this remains true in our modified
339: reduction for elements appearing in two or fewer triples, because the corresponding left-most 2's have been retained.
340: So consider an element $x_i$ appearing in three triples and suppose, by way of contradiction, that \emph{both}
341: haplotypes used to resolve $g^x_i$ are used in the resolution of other genotypes. Now, the 1 in position $3\nu+i$
342: prevents this element-genotype from sharing haplotypes with other element-genotypes, so genotype $g^x_i$ must share
343: both its haplotypes with matching-genotypes. Note that, because $g^x_i(3\nu+i)=1$, the genotype $g^x_i$ can only
344: possibly share haplotypes with matching-genotypes corresponding to triples that contain $x_i$. Indeed, if $x_i$ is in
345: triples $c_{k_1}$, $c_{k_2}$ and $c_{k_3}$ then the only genotypes with which $g^x_i$ can potentially share haplotypes
346: are $c^x_{k_1}$, $c^x_{k_2}$ and $c^x_{k_3}$. Genotype $g^x_i$ cannot share both its haplotypes with the same
347: matching-genotype (e.g. $c^{x}_{k_1}$) because both haplotypes of $g^x_i$ will have a 1 in column $3\nu +i$ whilst
348: only one of the two haplotypes for $c^{x}_{k_1}$ will have a 1 in that column. So, without loss of generality, $g^x_i$
349: is resolved by a haplotype that $c^x_{k_1}$ uses and a haplotype that $c^x_{k_2}$ uses. However, this is not possible,
350: because $g^x_i$ has a 2 in the column corresponding to $c_{k_3}$, whilst both $c^{x}_{k_1}$ and $c^{x}_{k_2}$ have a 0
351: in that column, yielding a contradiction.
352: 
353: Note that, in the original reduction, it was not only true that each element-genotype shared at most one of its
354: haplotypes, but - more strongly - it was also true that such a shared haplotype was used by exactly one other genotype
355: (i.e. the genotype corresponding to the triple the element gets assigned to). To see that this property is also
356: retained in the modified reduction observe that if (say) $g^x_i$ shares one haplotype with two genotypes $c^{x}_{k_1}$
357: and $c^{x}_{k_2}$ then $x_i$ must be in both triples $c_{k_1}$ and $c_{k_2}$, but this is not possible because, in the
358: two columns corresponding to triples $c_{k_1}$ and $c_{k_2}$, $c^{x}_{k_1}$ has 1 and 0 whilst $c^{x}_{k_2}$ has 0 and
359: 1.\\
360: \end{proof}
361: 
362: \section{Polynomial-time solvability}
363: \label{sec:posres}
364: \subsection{Parsimony haplotyping}
365: 
366: \noindent We will prove polynomial-time solvability of $PH$ on (*,1)-bounded instances.
367: 
368: We say that two genotypes $g_1$ and $g_2$ are \emph{compatible}, denoted as $g_1 \sim g_2$, if $g_1(j) =
369: g_2(j)$ or $g_1(j) = 2$ or $g_2(j) = 2$ for all $j$. A genotype $g$ and a haplotype $h$ are \emph{consistent} if $h$
370: can be used to resolve $g$, ie. if $g(j)=h(j)$ or $g(j)=2$ for all $j$. The \emph{compatibility graph} is the graph
371: with vertices for the genotypes and an edge between two genotypes if they are compatible.
372: 
373: \medskip
374: 
375: \begin{lemma} \label{lem:labelling} If $g_1$ and $g_2$ are compatible rows of a genotype matrix with at most one $2$ per column
376: then there exists exactly one haplotype that is consistent with both $g_1$ and
377: $g_2$.\end{lemma}
378: \begin{proof}
379: The only haplotype that is consistent with both $g_1$ and $g_2$ is $h$ with $h(j) = g_1(j)$ for all $j$ with $g_1(j)
380: \neq 2$ and $h(j) = g_2(j)$ for all $j$ with $g_2(j) \neq 2$. There are no columns where $g_1$ and $g_2$ are both
381: equal to $2$ because there is at most one $2$ per column. In columns where $g_1$ and $g_2$ are both not equal to $2$
382: they are equal because $g_1$ and $g_2$ are compatible.\\
383: \end{proof}
384: \medskip
385: We use the notation $g_1 \sim_h g_2$ if $g_1$ and $g_2$ are compatible and $h$ is consistent with both. We prove that
386: the compatibility graph has a specific structure. A \emph{1-sum} of two graphs is the result of identifying a vertex
387: of one graph with a vertex of the other graph. A 1-sum of $n+1$ graphs is the result of identifying a vertex of a
388: graph with a vertex of a 1-sum of $n$ graphs. See Figure~\ref{fig:compgraph} for an example of a 1-sum of three
389: cliques ($K_3$, $K_4$ and $K_2$).
390: 
391: \medskip
392: 
393: \begin{lemma} \label{lem:1sum} If $G$ is a genotype matrix with at most one $2$ per column then every connected component
394: of the compatibility graph of $G$ is a 1-sum of cliques, where edges in the same clique are labelled with the same
395: haplotype.
396: \end{lemma}
397: \begin{proof}
398: Let $C$ be the compatibility graph of $G$ and let $g_1,g_2,\ldots,g_k$ be a cycle in $C$. It suffices to show that
399: there exists a haplotype $h_c$ such that $g_{i} \sim_{h_c} g_{i'}$ for all $i,i'\in\{1,...,k\}$. Consider an arbitrary
400: column $j$. If there is no genotype with a $2$ in this column then $g_1 \sim g_2 \sim \ldots \sim g_k$ implies that
401: $g_1(j)=g_2(j)=\ldots = g_k(j)$. Otherwise, let $g_{i_j}$ be the unique genotype with a $2$ in column $j$. Then $g_1
402: \sim g_2 \sim \ldots \sim g_{i_j-1}$ together with $g_1 \sim g_k \sim g_{k-1}\sim \ldots \sim g_{i_j+1}$ implies that
403: $g_{i}(j)=g_{i'}(j)$ for all $i,i' \in \{1,...,k\} \setminus \{i_j\}$. Set $h_c(j)=g_i(j)$, $i \neq i_j$. Repeating
404: this for each column $j$ produces a haplotype $h_c$ such that indeed $g_{i} \sim_{h_c} g_{i'}$ for all
405: $i,i'\in\{1,...,k\}$.\\
406: \end{proof}
407: 
408: \begin{figure}
409: \vspace{-24pt}
410: \begin{minipage}{.45\textwidth}
411: \begin{center}
412: \begin{tabular}{ll}
413: $\begin{array}{c}
414: g_1\\
415: g_2\\
416: g_3\\
417: g_4\\
418: g_5\\
419: g_6\\
420: g_7\\
421: \end{array}$
422: & $\begin{bmatrix}
423: 0 & 0 & 1 & 0 & 2 & 0 & 1\\
424: 2 & 0 & 2 & 0 & 0 & 0 & 1\\
425: 0 & 0 & 1 & 2 & 0 & 0 & 1\\
426: 0 & 0 & 1 & 0 & 0 & 0 & 2\\
427: 0 & 0 & 1 & 1 & 0 & 2 & 1\\
428: 1 & 2 & 0 & 0 & 0 & 0 & 1\\
429: 0 & 0 & 1 & 1 & 0 & 0 & 1\\
430: \end{bmatrix}$
431: \end{tabular}
432: \end{center}
433: \end{minipage}
434: \begin{minipage}{.45\textwidth}
435: \begin{center}
436: \epsfig{file=./compgraph2.eps} \end{center}
437: \end{minipage}
438: \caption{Example of a genotype matrix and the corresponding compatibility graph, with $h_1=(0,0,1,1,0,0,1)$,
439: $h_2=(0,0,1,0,0,0,1)$ and $h_3=(1,0,0,0,0,0,1)$.} \label{fig:compgraph} \vspace{-12pt}
440: \end{figure}
441: \medskip
442: From this lemma, it follows directly that in $PH(*,1)$ the compatibility graph is {\em chordal}, meaning
443: that all its induced cycles are triangles. Every chordal graph has a \emph{simplicial} vertex, a vertex whose (closed)
444: neighbourhood is a clique. Deleting a vertex in a chordal graph gives again a chordal graph (see for example
445: \cite{blair} for an introduction to chordal graphs). The following lemma leads almost immediately to polynomial
446: solvability of $PH(*,1)$. We use set-operations for the rows of matrices: thus, e.g., $h\in H$ says $h$ is a row of
447: matrix $H$, $H\cup h$ says $h$ is added to $H$ as a row, and $H'\subset H$ says $H'$ is a submatrix consisting of rows
448: of $H$.
449: 
450: \medskip
451: 
452: \begin{lemma} \label{lem:starone} Given haplotype matrix $H'$ and genotype
453: matrix $G$ with at most one 2 per column it is possible to find, in polynomial time, a haplotype matrix $H$ that
454: resolves $G$, has $H'$ as a submatrix and has a minimum number of rows.
455: \end{lemma}
456: \begin{proof}
457: \looseness=-1 The proof is constructive. Let problem $(G,H')$ denote the above problem on input matrices $G$ and $H'$.
458: Let $C$ be the compatibility graph of $G$, which implied by Lemma~\ref{lem:1sum} is chordal. Suppose $g$ corresponds
459: to a simplicial vertex of $C$. Let $h_c$ be the unique haplotype consistent with any genotype in the closed
460: neighbourhood clique of $g$. We extend matrix $H'$ to $H''$ and update graph $C$ as follows.
461: \begin{enumerate}
462: \item If $g$ has no $2$'s it can be resolved with only one haplotype $h=g$. We set $H''=H'\cup h$ and remove $g$ from $C$.
463: \item Else, if there exist rows $h_1\in H'$ and $h_2\in H'$ that resolve $g$ we set $H''=H'$ and remove $g$ from $C$.
464: \item Else, if there exists $h_1\in H'$ such that $g=h_1+h_c$ we set $H''=H'\cup h_c$ and remove $g$ from $C$.
465: \item Else, if there exists $h_1\in H'$ and $h_2\notin H'$ such that $g=h_1+h_2$ we set $H''=H'\cup h_2$ and remove $g$ from $C$.
466: \item Else, if $g$ is not an isolated vertex in $C$ then there exists a haplotype $h_1$ such that $g=h_1+h_c$ and we set
467: $H''=H'\cup \{h_1, h_c\}$ and remove $g$ from $C$.
468: \item Otherwise, $g$ is an isolated vertex in $C$ and we set $H''=H'\cup \{h_1, h_2\}$ for any $h_1$ and $h_2$ such that
469: $g=h_1+h_2$ and remove $g$ from $C$.
470: \end{enumerate}
471: The resulting graph is again chordal and we repeat the above procedure for $H'=H''$ until all vertices are removed from $C$.
472: Let $H$ be the final haplotype matrix $H''$. It is clear from the construction that $H$ resolves $G$.
473: 
474: We prove that $H$ has a minimum number of rows by induction on the number of genotypes. Clearly, if $G$ has only one
475: genotype the algorithm constructs the only, and hence optimal, solution. The induction hypothesis is that the
476: algorithm finds an optimal solution to the problem $(G,H')$ for any haplotype matrix $H'$ if $G$ has at most $n-1$
477: rows. Now consider haplotype matrix $H'$ and genotype matrix $G$ with $n$ rows. The first step of the algorithm
478: selects a simplicial vertex $g$ and proceeds with one of the cases 1 to 6. The algorithm then finds (by the induction
479: hypothesis) an optimal solution $H$ to problem $(G\setminus\{g\},H'')$. It remains to prove that $H$ is also an
480: optimal solution to problem $(G,H')$. We do this by showing that an optimal solution $H^*$ to problem $(G,H')$ can be
481: modified  to include $H''$. We prove this for every case of the algorithm separately.
482: 
483: \begin{enumerate}
484: \item In this case $h\in H^*$, since $g$ can only be resolved by $h$.\smallskip \item In this case $H''=H'$ and hence
485: $H''\subseteq H^*$.\smallskip \item Suppose that $h_c \notin H^*$. Because we are not in case $2$ we know that there
486: are two rows in $H^*$ that resolve $g$ and at least one of the two, say $h^*$, is not a row of $H'$. Since $h_c$ is
487: the unique haplotype consistent with (the simplicial) $g$ and any compatible genotype, $h^*$ can not be consistent
488: with any other genotype than $g$. Thus, replacing $h^*$ by $h_c$ gives a solution with the same number of rows but
489: containing $h_c$. \smallskip \item Suppose that $h_2\notin H^*$. Because we are not in case $2$ or $3$ we know that
490: there is a haplotype $h^*\in H^*$ consistent with $g$, $h^*\notin H'$ and $h^*\neq h_c$. Hence it is not consistent
491: with any other genotypes than $g$ and we can replace $h^*$ by $h_2$. \smallskip \item Suppose that $h_1\notin H^*$ or
492: $h_c\notin H^*$. Because we are not in case $2$, $3$ or $4$, there are haplotypes $h^*\in H\backslash H'$ and
493: $h^{**}\in H\backslash H'$ that resolve $g$. If $h^*$ and $h^{**}$ are both not equal to $h_c$ then they are not
494: consistent with any other genotype than $g$. Replacing $h^*$ and $h^{**}$ by $h_1$ and $h_c$ leads to another optimal
495: solution. If one of $h^*$ and $h^{**}$ is equal to $h_c$ then we can replace the other one by $h_1$. \smallskip \item
496: \looseness=-1 Suppose that $h_1\notin H^*$ or $h_2\notin H^*$. There are haplotypes $h^*,h^{**}\in H^*\backslash H'$
497: that resolve $g$ and just $g$ since $g$ is an isolated vertex. Replacing $h^*$ and $h^{**}$ by $h_1$ and $h_2$ gives
498: an optimal solution containing $h_1$ and $h_2$.
499: \end{enumerate}
500: \end{proof}
501: \begin{theorem} \label{prop:starone}
502: The problem $PH(*,1)$ can be solved in polynomial time. \end{theorem}
503: \begin{proof}
504: The proof follows from Lemma~\ref{lem:starone}. Construction of the compatibility graph takes $O(n^2m)$ time, for an
505: $n$ times $m$ input matrix. Finding an ordering in which to delete the simplicial vertices can be done in time
506: $O(n^2)$ \cite{rose} and resolving each vertex takes $O(n^2m)$ time. The overall running time of the algorithm is
507: therefore $O(n^3m)$.\\
508: \end{proof}
509: 
510: \subsection{Minimum perfect phylogeny haplotyping}
511: 
512: \noindent
513: \looseness=-1 Polynomial-time solvability of $PH$ on $(2,*)$-bounded instances has been shown in \cite{wabi} and
514: \cite{lancia}. We prove it for $MPPH(2,*)$. We start with a definition.
515: 
516: \medskip
517: 
518: \begin{definition} \label{def:redres} For two columns of a genotype matrix we say that a \emph{reduced resolution} of these columns
519: is the result of applying the following rules as often as possible to the submatrix induced by these columns: deleting
520: one of two identical rows and the replacement rules\\ $\begin{bmatrix} 2 & a \end{bmatrix} \rightarrow
521: \begin{bmatrix} 1 & a \\ 0 & a \end{bmatrix}$, $\begin{bmatrix} a & 2 \end{bmatrix} \rightarrow \begin{bmatrix} a & 1
522: \\ a & 0 \end{bmatrix}$, $\begin{bmatrix} 2 & 2 \end{bmatrix} \rightarrow \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix}$ and
523: $\begin{bmatrix} 2 & 2 \end{bmatrix} \rightarrow \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$, for $a \in \{0,1\}$.\\
524: \end{definition}
525: %
526: Note that two columns can have more than one reduced resolution if there is a genotype with a 2 in both these columns.
527: The reduced resolutions of a column pair of a genotype matrix $G$ are submatrices of (or equal to) $F$ and represent
528: all possibilities for the submatrix induced by the corresponding two columns of a minimal haplotype matrix $H$
529: resolving $G$, after collapsing identical rows.
530: 
531: \medskip
532: 
533: \begin{theorem} \label{prop:startwophylo}
534: The problem $MPPH(2,*)$ can be solved in polynomial time.
535: \end{theorem}
536: \begin{proof}
537: We reduce $MPPH(2,*)$ to $PH$(2,*), which can be solved in polynomial time (see above). Let $G$ be an instance of
538: $MPPH(2,*)$. We may assume that any two rows are different.
539: 
540: Take the submatrix of any two columns of $G$. If it does not contain a [2~2] row, then in terms of
541: Definition~\ref{def:redres} there is only one reduced resolution. If $G$ contains two or more [2~2] rows then,
542: since by assumption all genotypes are different, $G$ must have $\begin{bmatrix} 2 & 2 & 0 \\
543: 2 & 2 & 1 \end{bmatrix}$ and therefore $\begin{bmatrix} 2 & 0 \\
544: 2 & 1 \end{bmatrix}$ as a submatrix, which can only be resolved by a haplotype matrix containing the forbidden
545: submatrix $F$. It follows that in this case the instance is infeasible. If it contains exactly one [2~2] row, then
546: there are clearly two reduced resolutions. Thus we may assume that for each column pair there are at most two reduced
547: solutions.
548: 
549: Observe that if for some column pair all reduced resolutions are equal to $F$ the instance is again infeasible. On the
550: other hand, if for all column pairs none of the reduced resolutions is equal to $F$ then $MPPH(2,*)$ is equivalent to
551: $PH(2,*)$ because any minimal haplotype matrix $H$ that resolves $G$ admits a perfect phylogeny. Finally, consider a
552: column pair with two reduced resolutions, one of them containing $F$. Because there are two reduced resolutions there
553: is a genotype $g$ with a 2 in both columns. Let $h_1$ and $h_2$ be the haplotypes that correspond to the resolution of
554: $g$ that does not lead to $F$. Then we replace $g$ in $G$ by $h_1$ and $h_2$, ensuring that a minimal haplotype matrix
555: $H$ resolving $G$ can not have $F$ as a submatrix in these two columns.
556: 
557: Repeating this procedure for every column pair either tells us that the matrix $G$ was an infeasible instance or
558: creates a genotype matrix $G'$ such that any minimal haplotype matrix $H$ resolves $G'$ if and only if $H$ resolves
559: $G$, and $H$ admits a perfect phylogeny.\\
560: \end{proof}
561: 
562: \medskip
563: 
564: \begin{theorem} \label{prop:staronephylo} The problem $MPPH(*,1)$ can be solved in polynomial time.
565: \end{theorem}
566: \begin{proof}
567: Similar to the proof of Theorem~\ref{prop:startwophylo} we reduce $MPPH(*,1)$ to $PH(*,1)$. As there, consider for any
568: pair of columns of the input genotype matrix $G$ its reduced resolutions, according to Definition~\ref{def:redres}. Since
569: $G$ has at most one $2$ per column there is at most one genotype with 2's in both columns. Hence there are at most two
570: reduced resolutions. If all reduced resolutions are equal to the forbidden submatrix $F$ the instance is infeasible.
571: If on the other hand for all column pairs no reduced resolution is equal to $F$ then in fact $MPPH(*,1)$ is equivalent
572: to $PH(*,1)$, because any minimal haplotype matrix resolving $G$ admits a perfect phylogeny.
573: 
574: As in the proof of Theorem~\ref{prop:startwophylo} we are left with considering column pairs for which one of the two
575: reduced resolutions is equal to $F$. For such a column pair there must be a genotype $g$ that has 2's in both these
576: columns. The other genotypes have only 0's and 1's in them. Suppose we get a forbidden submatrix $F$ in these columns
577: of the solution if $g$ is resolved by haplotypes $h_1$ and $h_2$, where $h_1$ has $a$ and $b$ and therefore $h_2$ has
578: $1-a$ and $1-b$ in these columns, $a,b\in \{0,1\}$. We will change the input matrix $G$ such that if $g$ gets resolved
579: by such a \emph{forbidden resolution} these haplotypes are not consistent with any other genotypes. We do this by
580: adding an extra column to $G$ as follows. The genotype $g$ gets a $1$ in this new column. Every genotype with $a$ and
581: $b$ or with $1-a$ and $1-b$ in the considered columns gets a $0$ in the new column. Every other genotype gets a $1$ in
582: the new column. For example, the matrix
583: \[
584: \begin{bmatrix} 2 & 2 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix}
585: {\rm \ gets\ one\ extra\ column\ and\ becomes}
586: \begin{bmatrix} 2 & 2 & 1 \\ 0 & 1 & 1\\ 1 & 0 & 1\\ 1 & 1 & 0\end{bmatrix}.
587: \]
588: \noindent Denote by  $G_{mod}$  the result of modifying $G$ by adding such a column for every pair of columns with
589: exactly one `bad' and one `good' reduced resolution. It is not hard to see that any optimal solution to $PH(*,1)$ on
590: $G_{mod}$ can be transformed into a solution to $MPPH(*,1)$ on $G$ of the same cardinality (indeed, any two haplotypes
591: used in a forbidden resolution of a genotype $g$ in $G_{mod}$ are not consistent with any other genotype of $G_{mod}$,
592: and hence may be replaced by two other haplotypes resolving $g$ in a non-forbidden way). Now, let $H$ be an optimal
593: solution to $MPPH(*,1)$ on $G$. We can modify $H$ to obtain a solution to $PH(*,1)$ on $G_{mod}$ of the same
594: cardinality as follows. We modify every haplotype in $H$ in the same way as the genotypes it resolves. From the
595: construction of $G_{mod}$ it follows that two compatible genotypes are only modified differently if the haplotype they
596: are both consistent with is in a forbidden resolution. However, in $H$ no genotypes are resolved with a forbidden
597: resolution since $H$ is a solution to $MPPH(*,1)$. We conclude that optimal solutions to $PH(*,1)$ on $G_{mod}$
598: correspond to optimal solutions to $MPPH(*,1)$ on $G$ and hence the latter problem can be solved in polynomial time,
599: by Theorem \ref{prop:starone}.
600: 
601: If we use the algorithm from the proof of Lemma~\ref{lem:starone} as a subroutine we get an overall running time of
602: $O(n^3m^2)$, for an $n \times m$ input matrix.\\
603: \end{proof}
604: \medskip
605: \medskip
606: The borderline open complexity problems are now $PH(*,2)$ and $MPPH(*,2)$. Unfortunately, we have not found the answer
607: to these complexity questions. However, the borders have been pushed slightly further. In \cite{islands} $PH(*,2)$ is
608: shown to be polynomially solvable if the input genotypes have the complete graph as compatibility graph, we call this
609: problem $PH(*,2)$-$C1$. We will give the counterpart result for $MPPH(*,2)$-$C1$.
610: 
611: Let $G$ be an $n \times m$ $MPPH(*,2)$-$C1$ input matrix. Since the compatibility graph is a clique, every column of
612: $G$ contains only one symbol besides possible 2's. If we replace in every 1-column of $G$ (a column containing only
613: 1's and 2's) the 1's by 0's and mark the SNP corresponding to this column `flipped', then
614:  we obtain an equivalent problem
615: on a $\{0,2\}$-matrix $G'$.
616: To see that this problem is indeed equivalent, suppose $H'$ is a haplotype matrix
617: resolving this modified genotype
618: matrix $G'$ and suppose $H'$ does not contain the forbidden submatrix $F$.
619:  Then by interchanging 0's and 1's in every column of $H'$
620: corresponding to a flipped SNP, one obtains a haplotype matrix $H$ without the forbidden submatrix
621: which resolves the original input matrix $G$. And vice versa.
622: Hence, from now on we will assume, without loss of generality, that the input matrix $G$ is a $\{0,2\}$-matrix.
623: 
624: If we assume moreover that $n\geq 3$, which we do from here on, the \emph{trivial haplotype} $h_t$ defined as the
625: all-0 haplotype of length $m$ is the only haplotype consistent with all genotypes in $G$.
626: 
627: We define the \emph{restricted} compatibility graph $C_{R}(G)$ of
628: $G$ as follows. As in the normal compatibility graph, the vertices of
629: $C_{R}(G)$ are the genotypes of $G$. However, there is an edge
630: $\{g,g'\}$ in $C_{R}$(G) only if $g \sim_{h} g'$ for some $h \neq
631: h_t$, or, equivalently, if there is a column where both $g$ and $g'$
632: have a 2.
633: 
634: \medskip
635: 
636: \begin{lemma}
637: \label{lem:deg2} If $G$ is a feasible instance of $MPPH(*,2)$-$C1$
638: then every vertex in $C_R(G)$ has degree at most 2.
639: \end{lemma}
640: \begin{proof}
641: Any vertex of degree higher than 2 in $C_R(G)$ implies the existence
642: in $G$ of submatrix:
643: 
644: \medskip
645: \[
646: B= \begin{bmatrix} 2 & 2 & 2 \\ 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \end{bmatrix}
647: \]
648: \medskip
649: 
650: \noindent \looseness=+1 It is easy to verify that no resolution of this submatrix permits a perfect phylogeny.\\
651: \end{proof}
652: 
653: \medskip
654: 
655: Suppose that $G$ has two identical columns. There are either 0, 1 or 2 rows with 2's in both these columns.
656: In each case it is easy to see that any haplotype matrix $H$ resolving $G$ can be modified, without introducing
657:  a forbidden submatrix,  to make
658: the corresponding columns in $H$ equal as well (simply delete one column and duplicate another). This leads to the
659: first step of the algorithm {\bf A} that we propose for solving $MPPH(*,2)$-$C1$:
660: 
661: \medskip
662: 
663: \noindent {\bf Step 1 of A}: Collapse all identical columns in $G$.
664: 
665: \medskip
666: 
667: \noindent From now on,
668:  we assume that there are no identical columns. Let us partition the genotypes in $G_0$, $G_1$
669: and $G_2$, denoting the set of genotypes in $G$ with, respectively, degree 0,1, and 2 in $C_R(G)$. For any genotype
670: $g$ of degree 1 in $C_R(G)$ there is exactly
671:  one genotype with a 2 in the same column as $g$. Because there are no
672: identical columns,
673:  it follows that any genotype $g$ of degree 1 in $C_R(G)$ can have at most two 2's. Similarly any
674: genotype of degree 2 in $C_R(G)$ has at most three 2's. Accordingly we define $G_1^1$ and $G_1^2$ as the genotypes in
675: $G_1$ that have one 2 and two 2's, respectively, and similarly $G_2^2$ and $G_2^3$ as the genotypes in $G_2$ with two
676: and three 2's, respectively.
677: 
678: The following lemma states how genotypes in these sets  must  be resolved if no submatrix $F$ is allowed in
679: the solution. If genotype $g$ has $k$ 2's we denote by  $g[a_1,a_2,\ldots,a_k]$ the haplotype
680:  with entry $a_i$ in the position  where $g$ has its $i$-th 2  and 0 everywhere else.
681: 
682: \medskip
683: 
684: \begin{lemma}
685: \label{lem:usegeno} A haplotype matrix is a feasible solution to the problem $MPPH(*,2)$-$C1$ if and only if all genotypes are resolved in one of the following ways:
686: 
687: \noindent {\em (i)} A genotype $g\in G_1^1$ is resolved by $g[1]$ and $g[0]=h_t$. \\
688: {\em (ii)} A genotype $g\in G_2^2$ is resolved by $g[0,1]$ and $g[1,0]$. \\
689: {\em (iii)} A genotype $g\in G_1^2$ is either resolved by $g[0,0]=h_t$
690: and $g[1,1]$ or by $g[0,1]$ and $g[1,0]$. \\
691: {\em (iv)} A genotype $g\in G_2^3$ is either resolved by $g[1,0,0]$
692: and $g[0,1,1]$ or by $g[0,1,0]$ and $g[1,0,1]$ (assuming that
693:  the two neighbours of $g$ have a 2 in the first two positions where $g$ has a 2).
694: \end{lemma}
695: \begin{proof}
696: A genotype $g\in G_2^2$ has degree 2 in $C_R(G)$, which implies the existence in $G$ of a submatrix:
697: \medskip
698: \begin{center}
699: $D =$
700: \begin{tabular}{ll}
701: $\begin{array}{l}
702: g\\
703: g'\\
704: g''\\
705: \end{array}
706: \begin{bmatrix} 2 & 2 \\ 2 & 0 \\ 0 & 2 \end{bmatrix}$
707: \end{tabular}.
708: \end{center}
709: \medskip
710: \noindent Resolving $g$ with $g[0,0]$ and $g[1,1]$ clearly leads to the forbidden submatrix $F$. Similarly, resolving
711: a genotype $g\in G_2^3$ with $g[0,0,1]$ and $g[1,1,0]$ or with $g[0,0,0]$ and $g[1,1,1]$ leads to a forbidden
712: submatrix in the first two columns where $g$ has a 2. It follows that
713:  resolving the genotypes in a way other than
714:  described in the lemma
715: yields a haplotype matrix which does not admit a perfect phylogeny.
716: 
717: Now suppose that all genotypes are resolved as described in the lemma and assume that there is a forbidden submatrix
718: $F$ in the solution. Without loss of generality,  we assume $F$ can be found in the first two columns of the solution
719: matrix. We may also assume that no haplotype can be deleted from the solution. Then, since $F$ contains [1 1], there
720: is a genotype $g$ starting with [2~2]. Since there are no identical columns there are only two possibilities. The
721: first possibility is that there is exactly one other genotype $g'$ with a 2 in exactly one of the first two columns.
722: Since all genotypes different from $g$ and $g'$ start with [0 0], none of the resolutions of $g$ can have created the
723: complete submatrix $F$. Contradiction. The other possibility is that there is exactly one genotype with a 2 in the
724: first column and exactly one genotype with a 2 in the second column, but these are different genotypes, i.e. we have
725: the submatrix $D$. Then $g\in G_2^3$ or $g\in G_2^2$ and it can again be checked that none of the resolutions in (ii)
726: and (iv) leads to the forbidden submatrix.\\
727: \end{proof}
728: 
729: \medskip
730: 
731: \begin{lemma} Let $G$ be an instance of $MPPH(*,2)$ and $G_1^2$, $G_2^3$ as defined above.
732: \label{lem:private} \\ {\em (i)} Any nontrivial haplotype is consistent
733: with at most two genotypes in $G$.\\
734: {\em (ii)} A genotype $g\in G_1^2\cup G_2^3$ must be resolved using at least one haplotype that is not consistent with
735: any other genotype.
736: \end{lemma}
737: \begin{proof} {\em (i)} Let $h$ be a nontrivial haplotype.
738: There is a column where $h$ has a 1 and there are at most
739: two genotypes with a 2 in that column. \\
740: {\em (ii)} A genotype $g\in G_1^2\cup G_2^3$ has a 2 in a column that has no other 2's. Hence there is a haplotype
741: with a 1 in this column and this haplotype is not consistent with any other genotypes.\\
742: \end{proof}
743: 
744: \medskip
745: 
746:  A haplotype that is only consistent with $g$ is called a \emph{private haplotype} of $g$. Based on (i) and
747: (ii) of Lemma~\ref{lem:usegeno} we propose the next step of {\bf A}:
748: 
749: \medskip
750: 
751: \noindent {\bf Step 2 of A}: \looseness=-1 Resolve all $g\in G_1^1 \cup G_2^2$ by the unique haplotypes allowed to
752: resolve them according to Lemma~\ref{lem:usegeno}. Also resolve each $g\in G_0$ with $h_t$ and the complement of $h_t$
753: with respect to $g$. This leads to a partial haplotype matrix $H_2^p$.
754: 
755: \medskip
756: 
757: \noindent The next step of {\bf A} is based on Lemma~\ref{lem:private} (ii).
758: 
759: \medskip
760: \noindent {\bf Step 3 of A}: \looseness=-1 For each $g\in G_1^2 \cup G_2^3$ with $g\sim_{h'}g'$ for some $h'\in H_2^p$
761: that is allowed to resolve $g$ according to Lemma~\ref{lem:usegeno}, resolve $g$ by adding the complement $h''$ of
762: $h'$ w.r.t. $g$ to the set of haplotypes, i.e. set $H_2^p := H_2^p \cup \{h''\}$, and repeat this step as long as new
763: haplotypes get added. This leads to partial haplotype matrix $H_3^p$.
764: 
765: \medskip
766: 
767: \noindent Notice that $H_3^p$ does not contain any haplotype that is
768: allowed to resolve any of the genotypes that have not been resolved
769: in Steps 2 and 3. Let us denote this set of leftover, unresolved
770: haplotypes by $GL$, the degree 1 vertices among those by  $GL_1\subseteq G_1^2$, and the
771: degree 2 vertices among those  by $GL_2\subseteq G_2^3$. The restricted
772: compatibility graph induced by $GL$, which we denote by $C_R(GL)$
773: consists of paths and circuits. We first give the final steps of
774: algorithm A and argue optimality afterwards.
775: 
776: \medskip
777: 
778: \noindent {\bf Step 4 of A}: Resolve each cycle in $C_R(GL)$, necessarily consisting of $GL_2$-vertices, by starting
779: with an arbitrary vertex and, following the cycle, resolving each next pair $g,g'$ of vertices by haplotype $h \neq
780: h_t$ such that $g\sim_h g'$ and the two complements of $h$ w.r.t. $g$ and $g'$ respectively. In case of an odd cycle
781: the last vertex is resolved by any pair of haplotypes that is allowed to resolve it. Note that $h$ has a 1 in the
782: column where both $g$ and $g'$ have a 2 and otherwise 0. It follows easily that $g$ and $g'$ are both allowed to use
783: $h$ (and its complement) according to (iv) of Lemma~\ref{lem:usegeno}.
784: 
785: \medskip
786: 
787: \noindent {\bf Step 5 of A}: Resolve each path in $C_R(GL)$ with both endpoints in $GL_1$ by first resolving the
788: $GL_1$ endpoints by the trivial haplotype $h_t$ and the complements of $h_t$ w.r.t. the two endpoint genotypes,
789: respectively. The remaining path contains only $GL_2$-vertices and is resolved according to Step 6.
790: 
791: \medskip
792: 
793: \noindent {\bf Step 6 of A}: Resolve each remaining path by starting in (one of) its $GL_2$-endpoint(s), and following
794: the path, resolving each next pair of vertices as in Step 4. In case of a path with an odd number of vertices, resolve
795: the last vertex by any pair of haplotypes that is allowed to resolve it in case it is a $GL_2$-vertex, and resolve it
796: by the trivial haplotype and its complement w.r.t. the vertex in case it is a $GL_1$ vertex.
797: 
798: \medskip
799: 
800: By construction the haplotype matrix $H$ resulting from {\bf A} resolves $G$. In addition, from
801: Lemma~\ref{lem:usegeno} follows that $H$ admits a perfect phylogeny.
802: 
803: To argue minimality of the solution, first observe that the haplotypes added in Step 2 and Step 3 are
804: unavoidable by Lemma~\ref{lem:usegeno} (i) and (ii) and Lemma~\ref{lem:private} (ii). Lemma~\ref{lem:private} tells us
805: moreover that the resolution of a cycle of $k$ genotypes in $GL_2$ requires at least $k+\lceil\frac{k}{2}\rceil$
806: haplotypes that can not be used to resolve any other genotypes in $GL$. This proves optimality of Step 4. To prove
807: optimality of the last two steps we need to take into account that genotypes in $GL_1$ can potentially share the
808: trivial haplotype. Observe that to resolve a path with $k$ vertices one needs at least $k+\lceil\frac{k}{2}\rceil$
809: haplotypes. Indeed {\bf A} does not use more than that in Steps 5 and 6. Moreover, since these paths are disjoint,
810: they cannot share haplotypes for resolving their genotypes except for the endpoints if they are in $GL_1$, which can
811: share the trivial haplotype. Indeed, {\bf A} exploits the possibility of sharing the trivial haplotype in a maximal
812: way, except on a path with an even number of vertices and one endpoint in $GL_1$. Such a path, with $k$ (even)
813: vertices, is resolved in {\bf A} by $3\frac{k}{2}$ haplotypes that can not be used to resolve any other genotypes. The
814: degree 1 endpoint might alternatively be resolved by the trivial haplotype and its complement w.r.t. the corresponding
815: genotype, adding the latter private haplotype, but then for resolving the remaining path with $k-1$ (odd) vertices
816: only from $GL_2$ we still need $k-1+\lceil\frac{k-1}{2}\rceil$, which together with the private haplotype of the
817: degree 1 vertex gives $3\frac{k}{2}$ haplotypes also (not even counting $h_t$).
818: 
819: As a result we have polynomial-time solvability of $MPPH(*,2)$-$C1$.
820: 
821: \medskip
822: 
823: \begin{theorem}
824: $MPPH(*,2)$ is solvable in polynomial time if the compatibility graph is a clique.
825: \flushright
826: \QEDclosed
827: \end{theorem}
828: 
829: \section{Approximation algorithms}
830: \label{sec:approx}
831: In this section we construct polynomial time approximation algorithms for $PH$ and $MPPH$, where the accuracy depends
832: on the number of 2's per column of the input matrix. We describe genotypes without 2's as \emph{trivial} genotypes,
833: since they have to be resolved in a trivial way by one haplotype. Genotypes with at least one 2 will be described as
834: \emph{nontrivial} genotypes. We write \phminZ{} and \mpphminZ{} to denote the restricted versions of the problems
835: where each genotype is nontrivial. We make this distinction between the problems because we have better lower bounds
836: (and thus approximation ratios) for the restricted variants.
837: 
838: \subsection{$PH$ and $MPPH$ where all input genotypes are nontrivial}
839: To prove approximation guarantees we need good lower bounds on the number of haplotypes in the solution. We start with
840: two bounds from \cite{islands}, whose proof we give because the first one is short but based on a crucial observation, and the second one was incomplete in \cite{islands}. We use these bounds to obtain a different lower
841: bound that we need for our approximation algorithms. \medskip
842: \begin{lemma}
843: \label{lem:minBound} \cite{islands} Let $G$ be an $n \times m$ instance of \phminZ{} (or \mpphminZ). Then at
844: least
845: \begin{eqnarray*}
846: LB_{sqrt}(n) = \bigg \lceil \frac{ 1 + \sqrt{1+8n} }{2} \bigg \rceil
847: \end{eqnarray*}
848: haplotypes are required to resolve $G$.
849: \end{lemma}
850: \begin{proof}
851: The proof follows directly from the observation that $q$ haplotypes can resolve at most $\binom{q}{2} = q(q-1)/2$
852: nontrivial genotypes.\\
853: \end{proof}
854: \medskip
855: \begin{lemma}
856: \label{lem:theirbound} \cite{islands} Let $G$ be an $n \times m$ instance of \phmin{\ell}, for some $\ell \geq 1$,
857: such that the compatibility graph of $G$ is a clique. Then at least
858: \begin{eqnarray*}
859: LB_{sha}(n,\ell) = \bigg \lceil \frac{2n}{\ell+1} + 1 \bigg \rceil
860: \end{eqnarray*}
861: haplotypes are required to resolve $G$.
862: \end{lemma}
863: \begin{proof}
864: Recall that, after relabeling if necessary, the trivial haplotype $h_t$ is the all-0 haplotype and is consistent with all genotypes. Suppose a solution of $G$ has $q$
865: non-trivial haplotypes. Observe that $h_t$ can be used in the resolution of at most $q$ genotypes. Also observe (by
866: Lemma 5 in \cite{islands}) that each non-trivial haplotype can be used in the resolution of at most $\ell$ genotypes.
867: Now distinguish two cases. First consider the case where $h_t$ is in the solution. Then from the two observations
868: above it follows that $n \leq (q+\ell q)/2$ and hence the solution consists of at least $q+1 \geq 2n/(\ell +1)+1$
869: haplotypes. Now consider the second case i.e. where $h_t$ is not in the solution. Then we have that $n \leq \ell q/2$
870: and hence that the solution consists of at least $2n/\ell$ haplotypes. If $n \geq \ell (\ell+1)/2$ we have that
871: $2n/\ell \geq 2n/(\ell+1)+1$, and the claim follows. If $n < \ell (\ell+1)/2$ then this implies that $\ell
872: >\frac{\sqrt{1+8n}-1}{2}$. Combining this with that by Lemma~\ref{lem:minBound} $q\geq \frac{\sqrt{1+8n}+1}{2}$ gives
873: that $(\ell+1)(q-1) > \frac{1}{4}(\sqrt{1+8n} + 1)(\sqrt{1+8n} - 1)$, which is equal to $2n$. It follows that $q >
874: 2n/(\ell +1)+1$.\\
875: \end{proof}
876: \medskip
877: The $LB_{sha}$ bound has been proven only for \phminZ{} (and \mpphminZ) instances where the compatibility graph is a
878: clique. We now prove a different bound which, in terms of cliques, is slightly weaker (for large $n$) than $LB_{sha}$,
879: but which allows us to generalise the bound to more general inputs. (Indeed it remains an open question whether
880: $LB_{sha}$ applies as a lower bound not just for cliques but also for general instances.)
881: \medskip
882: \begin{lemma}
883: \label{lem:boundmin} Let $G$ be an $n \times m$ instance of \phmin{\ell}, for some $\ell \geq 1$. Then at least
884: \begin{equation}
885: \lbmidmin{n}{\ell} = \bigg \lceil \frac{2(n+\ell)(\ell+1)}{\ell(\ell+3)}
886: \bigg \rceil
887: \end{equation}
888: haplotypes are required to resolve $G$.
889: \end{lemma}
890: \begin{proof}
891: Let $C(G)$ be the compatibility graph of $G$. We may assume without loss of generality that $C(G)$ is connected. First
892: consider the case where $C(G)$ is a clique. If $n \geq \ell ( \ell +1)/2$, it suffices to notice that
893: $\lbmidmin{n}{\ell}\leq LB_{sha}(n,\ell)$ for each value of $\ell \geq 1$, since the function
894: \begin{equation}
895: f(n) = \frac{2n}{\ell +1}+1 - \frac{2(n+\ell)(\ell+1)}{\ell(\ell+3)}
896: \end{equation}
897: is equal to $0$ if $n= \ell ( \ell +1)/2$ and has nonnegative derivative
898: $f'(n)=\frac{2}{\ell+1}-2\frac{\ell+1}{\ell(\ell+3)}\geq 0$.\\
899: Secondly, if $1 \leq n \leq \ell (\ell +1)/2$, straightforward but tedious calculations show that for all $\ell \geq
900: 1$ the function
901: \begin{equation}
902:  F(n)= \frac{ 1 + \sqrt{1+8n}}{2} - \frac{2(n+\ell)(\ell+1)}{\ell(\ell+3)}
903: \end{equation}
904: has value $0$ for $n= \ell ( \ell +1)/2$ and for some $n$ in the interval $[0,1]$, whereas in between these values it
905: has positive value. Hence, $\lbmidmin{n}{\ell}\leq LB_{sqrt}(n)$ for $1 \leq n \leq \ell (\ell +1)/2$.
906: 
907: To prove that the bound also holds if $C(G)$ is not a clique we use induction on $n$. Suppose that
908: for each $n'< n$ the lemma
909: holds for all $n' \times m$ instances $G'$ of \phmin{\ell '} for every $m$ and $\ell '$.
910: Since $C(G)$ is not a clique there exist two genotypes $g_1$ and $g_2$ in $G$ and a
911: column $j$ such that $g_1(j)=0$ and $g_2(j)=1$. Given that $G$ is a \phmin{\ell} instance
912: $t \leq \ell$ genotypes have a 2 in column $j$.
913: Deleting these $t$ genotypes yields an instance $G^d$ with disconnected compatibility graph $C(G^d)$, since the absence of a $2$ in column $j$ prevents the existence of any path from $g_1$ to $g_2$. Let $C(G^d)$ have $p \geq 2$
914: components $C(G_1), ..., C(G_p)$, and let $n_i \geq 1$ denote the number of genotypes in $G_i$. Thus, $n = n_1 +
915: ... + n_p + t$. We use the induction hypothesis on $G_1,\ldots,G_p$ to conclude that the number of haplotypes required to resolve $G$ is at least
916: \begin{eqnarray*}
917: \sum_{i=1}^p \bigg \lceil \frac{2(n_i + \ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil
918:               & \geq & \bigg \lceil \frac{2(\sum_{i=1}^p n_i + p\ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil
919:              \geq \bigg \lceil \frac{2(\sum_{i=1}^p n_i + 2\ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil \\
920:              & \geq & \bigg \lceil \frac{2(\sum_{i=1}^p n_i + t+ \ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil
921:              = \bigg \lceil \frac{2(n + \ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil
922: \end{eqnarray*}
923: \end{proof}
924: \medskip
925: \begin{corollary}
926: \label{cor:easyapproxMin} Let $G$ be an $n \times m$ instance of \phmin{\ell} or \mpphmin{\ell}, for some $\ell \geq
927: 1$. Any feasible solution for $G$ is within a ratio $\ell + 2 - \frac{2}{\ell+1}$ from optimal.
928: \end{corollary}
929: \begin{proof}
930: Immediate from the fact that any solution for $G$ has at most $2n$ haplotypes. In the case of $MPPH$ we can check
931: whether feasible solutions exist, and if so obtain such a solution, by using the algorithm in for example
932: \cite{gusfieldlinear}.\\
933: \end{proof}
934: \medskip
935: Not surprisingly, better approximation ratios can be achieved. The following simple algorithm computes
936: approximations of \phmin{\ell}. (The algorithm does not work for $MPPH$, however.)
937: 
938: \medskip
939: 
940: \noindent
941: \textbf{Algorithm:} $PH^{nt}M$ \\
942: \textbf{Step 1:} construct the compatibility graph $C(G)$.\\
943: \textbf{Step 2:} find a maximal matching $M$ in $C(G)$.\\
944: \textbf{Step 3:} for every edge $\{g_1,g_2\}\in M$, resolve $g_1$ and $g_2$ by in total 3 haplotypes: any haplotype
945: consistent with both $g_1$ and $g_2$, and its complements with respect to $g_1$ and $g_2$.\\
946: \textbf{Step 4:} resolve each remaining genotype by two haplotypes.
947: \medskip
948: \begin{theorem}
949: $PH^{nt}M$ computes a solution to \phmin{\ell} in polynomial time within an approximation
950: ratio of $c(\ell)=\frac{3}{4}\ell +\frac{7}{4}-\frac{3}{2}\frac{1}{\ell +1}$, for every $\ell \geq 1$.
951: \end{theorem}
952: \begin{proof}
953: Since constructing $C(G)$ given $G$ takes $O(n^2m)$ time and finding a maximal matching in any graph takes linear
954: time, $O(n^2m)$ running time follows directly.
955: 
956: Let $q$ be the size of the maximal matching.
957: Then $PH^{nt}M$ gives a solution with
958: $3q+2(n-2q)$ = $2n-q$ haplotypes. Since the complement of the
959: maximal matching is an independent set of size $n-2q$, any solution must contain at least $2(n-2q)$
960: haplotypes to resolve the genotypes in this independent set.
961: The theorem thus holds if $\frac{2n-q}{2n-4q} \leq c(\ell)$. If
962: $\frac{2n-q}{2n-4q}
963: > c(\ell)$, implying that $q > \frac{2-2c(\ell)}{1-4c(\ell)}n$, we use the lower bound of Lemma
964: \ref{lem:boundmin} to obtain
965: \[
966: % NOTE THAT I COULD NOT USE THE LBMIDMIN MACRO HIER
967: \frac{2n-q}{ LB^{nt}_{mid}(n,\ell) } < \frac{2n-\frac{2-2c(\ell)}{1-4c(\ell)}n}{LB^{nt}_{mid}(n,\ell)} <
968: \frac{(2n-\frac{2-2c(\ell)}{1-4c(\ell)}n)\ell(\ell+3)}{2n(\ell +1)}= \frac{3\ell
969: c(\ell)}{4c(\ell)-1}\frac{\ell+3}{\ell+1}= c(\ell).
970: \]
971: The last equality follows directly since $(4c(\ell)-1)(\ell+1)=3\ell(\ell+3)$.\\
972: \end{proof}
973: 
974: \subsection{$PH$ and $MPPH$ where not all input genotypes are nontrivial}
975: 
976: Given an instance $G$ of $PH$ or $MPPH$ containing $n$ genotypes, $n_{nt}$ denotes the number of nontrivial
977: genotypes in $G$ and $n_t$ the
978: number of trivial genotypes; clearly $n = n_{nt} + n_t.$
979: \medskip
980: \begin{lemma}
981: \label{lem:cliquewith}
982: Let $G$ be an $n \times m$ instance of $PH(*,\ell)$, for some $\ell \geq 2$, where the compatibility
983: graph of the nontrivial genotypes in $G$ is a clique, $G$ is not equal to a single trivial genotype,
984: and no nontrivial genotype in $G$ is the sum of two trivial genotypes in $G$. Then at least
985: \[
986: \lbwith{n}{\ell} = \bigg \lceil \frac{n}{\ell} + 1 \bigg \rceil
987: \]
988: haplotypes are needed to resolve $G$.
989: \end{lemma}
990: \begin{proof}
991: Note that the lemma holds if $n_t \geq n/\ell + 1$. So we assume from now on that $n_t < n/\ell + 1$.
992: 
993: We first prove that the bound holds for $n_{nt} \leq \ell$. Combining this with $n_t < n/2 + 1$ gives that $n < 2\ell
994: + 2$. Thus $n/\ell + 1 < 4$. Hence if $n_t \geq 4$ then we are done. Thus we only have to consider cases where both
995: $n_t \in \{0,1,2,3\}$ and $\ell \geq \max \{2,n_{nt}\}$. We verify these cases in Table \ref{tab:case}; note the
996: importance of the fact that no nontrivial genotype is the sum of two trivial haplotypes in verifying that these are
997: correct lower bounds. (Also, there is no $n_t = 1, n_{nt} = 0$ case because of the lemma's precondition.)
998: %
999: \begin{table}
1000: \centering \caption{Case $n_t < 4$, $n_{nt}\leq \ell$ in proof of Lemma \ref{lem:cliquewith}} \label{tab:case}
1001: \begin{tabular}{|c|c|c|}
1002: \hline
1003: $n_t$&$n_{nt}$&$\lceil n/\ell +1 \rceil$\\
1004: \hline
1005: 0 & 1 & 2 \\
1006: 0 & $z \geq 2$ & $\leq \lceil z/z + 1 \rceil = 2$\\
1007: 1 & 1 & 2\\
1008: 1 & $z \geq 2$ & $\leq \lceil (z+1)/z + 1 \rceil = 3$\\
1009: 2 & 0 & 2\\
1010: 2 & 1 & $\leq 3$\\
1011: 2 & $z\geq 2$ & $\leq \lceil (z+2)/z + 1 \rceil = 3$\\
1012: 3 & 0 & $\leq 3$\\
1013: 3 & 1 & $\leq 3$\\
1014: 3 & 2 & $\leq 4$\\
1015: 3 & $z \geq 3$ & $\leq \lceil (z+3)/z + 1 \rceil = 3$\\
1016: \hline
1017: \end{tabular}
1018: \end{table}
1019: 
1020: We now prove the lemma for $n_{nt} > \ell$. Note that in this case there exists a unique trivial haplotype $h_t$
1021: consistent with all nontrivial genotypes. Suppose, by way of contradiction, that $N = N_t + N_{nt}$ is the size of the
1022: smallest instance $G'$ for which the bound does not hold. Let $H$ be an optimal solution for $G'$ and let $h = |H|$.
1023: 
1024: Observe firstly that $N = 1$ (mod $\ell)$, because if this is not true we have that $\lbwith{N-1}{\ell} =
1025: \lbwith{N}{\ell}$ and we can find a smaller instance for which the bound does not hold, simply by removing an
1026: arbitrary genotype from $G'$, contradicting the minimal choice of $N$.
1027: 
1028: Similarly we argue that $h = \lbwith{N}{\ell}-1$, since if $h \leq \lbwith{N}{\ell}-2$ we could remove an arbitrary
1029: genotype to yield a size $N-1$ instance and still have that $h < \lbwith{N-1}{\ell}$.
1030: 
1031: We choose a specific resolution of $G'$ using $H$ and represent it as a \emph{haplotype graph}. The vertices of this
1032: graph are the haplotypes in $H$. For each nontrivial genotype $g \in G'$ there is an edge between the two haplotypes
1033: that resolve it. For each trivial genotype $g \in G'$ there is a loop on the corresponding haplotype. There are no
1034: edges between looped haplotypes because of the precondition that no nontrivial genotype is the sum of two trivial
1035: genotypes.
1036: 
1037: From Lemma 5 of \cite{islands} it follows that, with the exception of the possibly present trivial haplotype and
1038: disregarding loops, each haplotype in the graph has degree at most $\ell$. In addition, if an unlooped haplotype has
1039: degree less than or equal to $\ell$, or a looped haplotype has degree (excluding its loop) strictly smaller than
1040: $\ell$, then deleting this haplotype and all its at most $\ell$ incident genotypes creates an instance $G''$
1041: containing at least $N-\ell$ genotypes that can be resolved using $h-1$ haplotypes, yielding a contradiction to the
1042: minimality of $N$. (Note that, because $N_{nt}>\ell$, it is not possible that the instance $G''$ is empty or equal to
1043: a single trivial genotype.)
1044: 
1045: The only case that remains is when, apart from the possibly present trivial haplotype, every haplotype in the
1046: haplotype graph is looped and has degree $\ell$ (excluding its loop). However, there are no edges between looped
1047: vertices and they can therefore only be adjacent to the trivial haplotype, yielding a contradiction.\\
1048: \end{proof}
1049: \medskip
1050: \begin{lemma}
1051: \label{lem:withgeneral}
1052: Let $G$ be an $n \times m$ instance of $PH(*,\ell)$, for some $\ell \geq 2$, where $G$ is not equal to a
1053: single trivial genotype, and no nontrivial genotype in $G$ is the sum of two trivial genotypes in $G$. Then
1054: at least $\lbwith{n}{\ell}$ haplotypes are needed to resolve $G$.
1055: \end{lemma}
1056: \begin{proof}
1057: Essentially the same inductive argument as used in Lemma \ref{lem:boundmin} works: it is always possible to disconnect
1058: the compatibility graph of $G$ into at least two components by removing at most $\ell$ nontrivial genotypes, and using
1059: cliques as the base of the induction. The presence of trivial genotypes in the input (which we can actually simply
1060: exclude from the compatibility graph) does not alter the analysis. The fact that (in the inductive step) at least two
1061: components are created, each of which contains at least one nontrivial genotype, ensures that the inductive argument
1062: is not harmed by the presence of single trivial genotypes (for which the bound does not hold).\\
1063: \end{proof}
1064: \medskip
1065: \begin{corollary}
1066: \label{cor:withfirstbound} Let $G$ be an $n \times m$ instance of $PH(*,\ell)$ or $MPPH(*,\ell)$, for some $\ell
1067: \geq 2$. Any feasible solution for $G$ is within a ratio of $2\ell$ from optimal.
1068: \end{corollary}
1069: \begin{proof}
1070: Immediate because $2n/(n/\ell+1) < 2\ell$. (As before the algorithm from e.g. \cite{gusfieldlinear} can be used to
1071: generate feasible solutions for $MPPH$, or to determine that they do not exist.)\\
1072: \end{proof}
1073: The algorithm $PH^{nt}M$ can easily be adapted to solve $PH(*,\ell)$ approximately.
1074: 
1075: \medskip
1076: 
1077: \noindent
1078: \textbf{Algorithm:} $PHM$\\
1079: \textbf{Step 1:} remove from $G$ all genotypes that are the sum of two trivial genotypes \\
1080: \textbf{Step 2:} construct the compatibility graph $C(G')$ of the leftover instance $G'$.\\
1081: \textbf{Step 3:} find a maximal matching $M$ in $C(G')$.\\
1082: \textbf{Step 4:} for every edge $\{g_1,g_2\}\in M$, resolve $g_1$ and $g_2$ by
1083: three haplotypes if $g_1$ and $g_2$ are both nontrivial and by two haplotypes if one of them is trivial.\\
1084: \textbf{Step 5:} resolve each remaining nontrivial genotype by two haplotypes and each remaining trivial genotype by its corresponding haplotype.
1085: \medskip
1086: \begin{theorem}
1087: $PHM$ computes a solution to $PH(*,\ell)$ in polynomial time within an approximation
1088: ratio of $d(\ell)=\frac{3}{2}\ell +\frac{1}{2}$, for every $\ell \geq 2$.
1089: \end{theorem}
1090: \begin{proof}
1091: Since constructing $C(G)$ given $G$ takes $O(n^2m)$ time and finding a maximal matching in any graph takes linear
1092: time, $O(n^2m)$ running time follows directly.
1093: 
1094: Let $q$ be the size of the maximal matching, $n$ the number of genotypes
1095: after Step 1 and $n_t$ the number of trivial genotypes in $G'$.
1096: Then $PHM$
1097: gives a solution with $2n-q-n_t$ haplotypes.
1098: Since the complement of the
1099: maximal matching is an independent set of size $n-2q$ in $C(G')$, any solution must contain at least $2(n-2q)$
1100: haplotypes to resolve the genotypes in this independent set.
1101: The theorem thus holds if $\frac{2n-q-n_t}{n-2q} \leq d(\ell)$.
1102: If $\frac{2n-q-n_t}{n-2q}
1103: > d(\ell)$, implying that $q > \frac{(d(\ell)-2)n+n_t}{2d(\ell)-1}$, we use the lower bound of Lemma
1104: \ref{lem:withgeneral} and obtain
1105: \[
1106: \frac{2n-q-n_t}{LB_{mid}(n,\ell)} < \frac{2n-\frac{(d(\ell)-2)n+n_t}{2d(\ell)-1}}{\lceil \frac{n}{\ell} + 1 \rceil} <
1107: \frac{2n-\frac{(d(\ell)-2)n}{2d(\ell)-1}}{\frac{n}{\ell}}= \frac{3d(\ell)\ell}{2d(\ell)-1}= d(\ell).
1108: \]
1109: %\end{array} \end{equation}
1110: The last equality follows directly since $2d(\ell)-1 = 3\ell$.\\
1111: \end{proof}
1112: 
1113: \section{Postlude}
1114: \label{sec:concl}
1115: There remain a number of open problems to be solved. The complexity of $PH(*,2)$ and $MPPH(*,2)$ is still unknown. An
1116: approach that might raise the necessary insight is to study the $PH(*,2)\text{-}Cq$ and $MPPH(*,2)\text{-}Cq$ variants
1117: of these problems (i.e. where the compatibility graph is the sum of $q$ cliques) for small $q$. If a complexity result
1118: nevertheless continues to be elusive then it would be interesting to try and improve approximation ratios for
1119: $PH(*,2)$ and $MPPH(*,2)$; might it even be possible to find a PTAS (\emph{Polynomial-time Approximation Scheme}) for
1120: each of these problems? Note also that the complexity of $PH(k,2)$ and $MPPH(k,2)$ remains open for constant $k \geq
1121: 3$.
1122: 
1123: Another intriguing open question concerns the relative complexity of $PH$ and $MPPH$ instances. Has $PH(k,\ell)$ always
1124: the same complexity as $MPPH(k,\ell)$, in terms of well-known complexity measurements (polynomial-time solvability,
1125: NP-hardness, APX-hardness)? For hard instances, do approximability ratios differ? A related question is whether it is possible to directly
1126: encode $PH$ instances as $MPPH$ instances, and/or vice-versa, and if so whether/how this affects the bounds on the number
1127: of 2's in columns and rows.
1128: 
1129: For hard $PH(k,\ell)$ instances it would also be interesting to see if those approximation algorithms that yield
1130: approximation ratios as functions of $k$, can be intelligently combined with the approximation algorithms in this
1131: paper (having approximation ratios determined by $\ell$), perhaps with superior approximation ratios as a consequence.
1132: In terms of approximation algorithms for $MPPH$ there is a lot of work to be done because the
1133: approximation algorithms presented in this paper actually do little more than return an arbitrary feasible solution.
1134: It is also not clear if the $2^{k-1}$-approximation algorithms for $PH(k,*)$ can be attained (or improved) for $MPPH$.
1135: More generally, it seems likely that big improvements in approximation ratios (for both $PH$ and $MPPH$) will require
1136: more sophisticated, input-sensitive lower bounds and algorithms. What are the limits of approximability for these
1137: problems, and how far will algorithms with formal performance-guarantees (such as in this paper) have to improve to
1138: make them competitive with dominant ILP-based methods?
1139: 
1140: Finally, with respect to $MPPH$, it could be good to
1141: explore how parsimonious the solutions are that are produced by the
1142: various $PPH$ feasibility algorithms, and whether searching through
1143: the entire space of $PPH$ solutions (as proposed in \cite{anOptimal})
1144: yields practical algorithms for solving $MPPH$.
1145: 
1146: \section*{Acknowledgements}
1147: 
1148: All authors contributed equally to this paper and were supported by the Dutch BSIK/BRICKS project. A preliminary
1149: version of this paper appeared in \emph{Proceedings of the 6th International Workshop on Algorithms in Bioinformatics} (WABI 2006)
1150: \cite{wabibeaches}.
1151: 
1152: 
1153: \begin{thebibliography}{50}
1154: \small
1155: 
1156: \bibitem{cubic} Alimonti, P., Kann, V., Hardness of approximating problems
1157: on cubic graphs, Proceedings of the Third Italian Conference on Algorithms and Complexity, 288-298 (1997)
1158: 
1159: \bibitem{nphardnote} Bafna, V., Gusfield, D., Hannenhalli, S., Yooseph, S.,
1160: A Note on Efficient Computation of Haplotypes via Perfect Phylogeny, \emph{Journal of Computational Biology}, 11(5),
1161: pp. 858-866 (2004)
1162: 
1163: \bibitem{blair} Blair, J.R.S., Peyton, B., An introduction to chordal graphs and clique trees, in \emph{Graph theory and sparse matrix computation}, pp. 1-29, Springer (1993)
1164: 
1165: \bibitem{mpphref2} Bonizzoni, P., Vedova, G.D., Dondi, R., Li, J., The haplotyping problem: an overview of computational models and solutions,
1166: \emph{Journal of Computer Science and Technology} 18(6), pp. 675-688 (2003)
1167: 
1168: \bibitem{brown} Brown, D., Harrower, I., Integer programming approaches to haplotype inference by pure
1169: parsimony, \emph{IEEE/ACM Transactions on Computational Biology and Informatics} 3(2) (2006)
1170: 
1171: \bibitem{wabi} Cilibrasi, R., Iersel, L.J.J. van, Kelk, S.M., Tromp, J., On the Complexity of Several Haplotyping Problems, Proceedings
1172: of the 5th International Workshop on Algorithms in Bioinformatics (WABI 2005), LNBI 3692, Springer Verlag, Berlin, pp.
1173: 128-139 (2005)
1174: 
1175: \bibitem{gusfieldlinear} Ding, Z., Filkov, V., Gusfield, D., A linear-time algorithm for the perfect phylogeny
1176: haplotyping (PPH) problem, \emph{Journal of Computational Biology}, 13(2) pp. 522-533 (2006)
1177: 
1178: \bibitem{gusfieldbook} Gusfield, D., \emph{Algorithms on Strings, Trees, and Sequences: Computer Science and Computational
1179: Biology}, Cambridge University Press (1997)
1180: 
1181: \bibitem{gusfieldnetwork} Gusfield, D., Efficient algorithms for inferring evolutionary history, \emph{Networks} 21,
1182: pp. 19-28 (1991)
1183: 
1184: \bibitem{gusfieldparsimony} Gusfield, D., Haplotype inference by pure parsimony, Proc. 14th
1185: Ann. Symp. Combinatorial Pattern Matching, pp. 144-155 (2003)
1186: 
1187: \bibitem{halldorson} Halld\'orsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S.,
1188: Istrail, S., A survey of computational methods for determining haplotypes, Proc. DIMACS/RECOMB Satellite Workshop:
1189: Computational Methods for SNPs and Haplotype Inference, pp. 26-47 (2004)
1190: 
1191: \bibitem{wabibeaches} Iersel, L.J.J. van, Keijsper, J., Kelk, S.M., Stougie, L., Beaches of Islands of Tractability: Algorithms for Parsimony
1192: and Minimum Perfect Phylogeny Haplotyping Problems, Proceedings of the 6th International Workshop on Algorithms in Bioinformatics (WABI 2006),
1193: LNCS 4175, Springer, pp. 80-91 (2006)
1194: 
1195: \bibitem{lanciaApx} Lancia, G., Pinotti, M., Rizzi, R., Haplotyping populations by pure
1196: parsimony: complexity of exact and approximation algorithms, \emph{INFORMS Journal on Computing} 16(4) pp. 348-359
1197: (2004)
1198: 
1199: \bibitem{lancia} Lancia, G., Rizzi, R.,
1200: A polynomial case of the parsimony haplotyping problem, \emph{Operations Research Letters} 34(3) pp. 289-295 (2006)
1201: 
1202: \bibitem{deg3} Papadimitriou, C.H., Yannakakis, M.,
1203: Optimization, approximation, and complexity classes, \emph{J. Comput. System Sci.} 43, pp. 425-440 (1991)
1204: 
1205: \bibitem{rose} Rose, D.J., Tarjan, R.E., Lueker, G.S., Algorithmic aspects of vertex elimination on graphs, \emph{SIAM
1206: J. Comput.}, 5, pp. 266-283 (1976)
1207: 
1208: \bibitem{islands} Sharan, R., Halld\'orsson, B.V., Istrail, S., Islands of tractability for parsimony haplotyping,
1209: \emph{IEEE/ACM Transactions on Computational Biology and Bioinformatics} 3(3), pp. 303-311 (2006)
1210: 
1211: \bibitem{gusnetwork} Song, Y.S., Wu, Y., Gusfield, D., Algorithms for imperfect phylogeny haplotyping
1212: (IPPH) with single haploplasy or recombination event, Proceedings of the 5th International Workshop on Algorithms in
1213: Bioinformatics (WABI 2005), LNBI 3692, Springer Verlag, Berlin, pp. 152-164 (2005)
1214: 
1215: \bibitem{anOptimal} VijayaSatya, R., Mukherjee, A., An optimal algorithm for perfect phylogeny haplotyping,
1216: \emph{Journal of Computational Biology} 13(4), pp. 897-928 (2006)
1217: 
1218: \bibitem{mpphref} Xian-Sun Zhang, Rui-Sheng Wang, Ling-Yun Wu, Luonan Chen, Models and Algorithms for Haplotyping
1219: Problem, \emph{Current Bioinformatics} 1, pp. 105-114 (2006)
1220: 
1221: \bibitem{log} Yao-Ting Huang, Kun-Mao Chao, Ting Chen, An approximation algorithm for haplotype inference by
1222: maximum parsimony, \emph{Journal of Computational Biology} 12(10) pp. 1261-74 (2005)
1223: 
1224: \end{thebibliography}
1225: 
1226: \clearpage
1227: 
1228: \begin{biography}{Leo van Iersel}
1229: received in 2004 his Master of Science degree in Applied Mathematics from the Universiteit Twente in The Netherlands.
1230: He is now working as a PhD student at the Technische Universiteit Eindhoven, also in the Netherlands. His research is
1231: mainly concerned with the search for combinatorial algorithms for biological problems.
1232: \end{biography}
1233: 
1234: \begin{biography}{Judith Keijsper}
1235: received her master's and PhD degrees in 1994 and 1998 respectively from the Universiteit van Amsterdam in The
1236: Netherlands, where she worked with Lex Schrijver on combinatorial algorithms for graph problems. After working as a
1237: postdoc at Leibniz-IMAG in Grenoble, France, and as an assistant professor at the Universiteit Twente in the
1238: Netherlands for short periods of time, she moved to the Technische Universiteit Eindhoven in the Netherlands in the
1239: year 2000. She is an assistant professor there, and her current research focus is combinatorial algorithms for
1240: problems from computational biology.
1241: \end{biography}
1242: 
1243: \begin{biography}{Steven Kelk}
1244: received his PhD in Computer Science in 2004 from the University
1245: of Warwick, in England. He is now working as a postdoc at the
1246: Centrum voor Wiskunde en Informatica (CWI) in Amsterdam, the
1247: Netherlands, where he is focussing on the combinatorial aspects of
1248: computational biology.
1249: \end{biography}
1250: 
1251: \begin{biography}{Leen Stougie}
1252: received his PhD in 1985 from the Erasmus Universiteit of Rotterdam, The Netherlands. He is currently working at the
1253: Centrum voor Wiskunde en Informatica (CWI) in Amsterdam and at the Technische Universiteit Eindhoven as an associate
1254: professor.
1255: \end{biography}
1256: 
1257: \end{document}
1258: