0605:q-bio0605024/final.tex

1: \documentclass[12pt,draftcls,onecolumn]{IEEEtran}

2: %\documentclass[onecolumn]{IEEEtran}

3:

4: %\numberwithin{equation}{section} \numberwithin{figure}{section}

5:

6: \newtheorem{lemma}{Lemma}

7: \newtheorem{observation}{Observation}

8: \newtheorem{definition}{Definition}

9: \newtheorem{theorem}{Theorem}

10: \newtheorem{corollary}{Corollary}

11:

12: \usepackage[USenglish]{babel}

13:

14: \usepackage{makeidx}

15: \usepackage{amsfonts}

16: \usepackage{epsfig}

17: \usepackage{amsmath}

18: \usepackage{amssymb}

19: \pagestyle{plain}

20: \pagenumbering{arabic}

21:

22:

23: %%%%%%%% BEGIN COMMENT %%%%%%%%%%

24:

25: \newif\ifcomment\commentfalse

26: \def\commentON{\commenttrue}

27: \def\commentOFF{\commentfalse}

28:

29: \long\outer\def\bc#1\ec{{\ifcomment \sloppy  $[${\bf suggest}]

30: {{#1}} \textbf{[end]} \fi }}

31:

32: \long\outer\def\br#1\er{{\ifcomment \sloppy  $[${\bf suggest remove}]

33: {{#1}} \textbf{[end]} \fi }}

34:

35: \long\outer\def\bo#1\eo{{\ifcomment \sloppy  $[${\bf instead of}]

36: {\textit{#1}} \textbf{[end]}  \fi }}

37:

38: \long\outer\def\BC#1\EC{{\ifcomment \sloppy \par \#  \dotfill

39: {\textsc{#1}} \dotfill \# \par \fi }}

40:

41: \long\outer\def\ph#1{$PH(*,{#1})$} \long\outer\def\phmin#1{$PH^{nt}(*,{#1})$}

42:

43: \long\outer\def\mpph#1{$MPPH(*,{#1})$} \long\outer\def\mpphmin#1{$MPPH^{nt}(*,{#1})$}

44:

45: \long\outer\def\phminZ{$PH^{nt}$} \long\outer\def\mpphminZ{$MPPH^{nt}$}

46:

47: \long\outer\def\lbmidmin#1#2{\ensuremath{LB^{nt}_{mid}({#1},{#2})}}

48: \long\outer\def\lbwith#1#2{\ensuremath{LB_{mid}({#1},{#2})}}

49:

50: \commentOFF

51: %\commentON

52:

53: %%%%%%%%% END COMMENT %%%%%%%%%%%

54:

55: \ifcomment

56: \pagestyle{plain}

57: \pagenumbering{arabic}

58: \fi

59:

60: \begin{document}

61:

62: %\markboth{Shorelines of islands of tractability...}{Van Iersel\MakeLowercase{\textit{et al.}}}

63:

64: \title{Shorelines of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems\thanks{Supported by the Dutch

65: BSIK/BRICKS

66: project.}}

67:

68:

69: %\titlerunning{Shorelines of islands of tractability}

70:

71: \author{Leo van Iersel, Judith Keijsper, Steven Kelk and Leen Stougie}

72:

73: %\authorrunning{Leo van Iersel, Judith Keijsper, Steven Kelk \and Leen Stougie}

74: %\institute{Technische Universiteit Eindhoven (TU/e), Den Dolech 2, 5612 AX Eindhoven, Netherlands,\\

75: %\email{l.j.j.v.iersel@tue.nl, j.c.m.keijsper@tue.nl},\\

76: %\texttt{http://www.tue.nl} \and

77: %Centrum voor Wiskunde en Informatica (CWI), Kruislaan 413, 1098 SJ Amsterdam, Netherlands, \\

78: %\email{steven.kelk@cwi.nl, leen.stougie@cwi.nl}, \\

79: %\texttt{http://www.cwi.nl} }

80:

81: \maketitle

82:

83: \begin{abstract}

84: \noindent The problem \emph{Parsimony Haplotyping} ($PH$) asks for the smallest set of haplotypes which can explain a

85: given set of genotypes, and the problem \emph{Minimum Perfect Phylogeny Haplotyping} ($MPPH$) asks for the smallest

86: such set which also allows the haplotypes to be embedded in a \emph{perfect phylogeny}, an evolutionary tree with

87: biologically-motivated restrictions. For $PH$, we extend recent work by further mapping the interface between ``easy''

88: and ``hard'' instances, within the framework of $(k,\ell)$-\emph{bounded instances} where the number of 2's per column

89: and row of the input matrix is restricted. By exploring, in the same way, the tractability frontier of $MPPH$ we

90: provide the first concrete, positive results for this problem, and the algorithms underpinning these results offer new

91: insights about how $MPPH$ might be further tackled in the future. In addition, we construct for both $PH$ and $MPPH$

92: polynomial time approximation algorithms, based on properties of the columns of the input matrix. We conclude with an

93: overview of intriguing open problems in $PH$ and $MPPH$.

94: \end{abstract}

95:

96: \begin{keywords}

97: Combinatorial algorithms, Biology and genetics, Complexity hierarchies

98: \end{keywords}

99:

100:

101: \section{Introduction}

102: \noindent The computational problem of inferring biologically-meaningful haplotype data from the genotype data of a population

103: continues to generate considerable interest at the interface of biology and computer science/mathematics. A popular

104: underlying abstraction for this model (in the context of diploid organisms) represents a genotype as a string

105: over a $\{0,1,2\}$ alphabet, and a haplotype as a string over $\{0,1\}$. The exact goal depends on the

106: biological model being applied but a common, minimal algorithmic requirement is that, given a set of genotypes, a

107: set of haplotypes must be produced which resolves the genotypes.

108: \medskip

109:

110: To be precise, we are given a \emph{genotype matrix} $G$ with elements in $\{0,1,2\}$, the rows of which

111: correspond to genotypes, while its columns correspond to sites on the genome, called SNP's. A \emph{haplotype matrix}

112: has elements from $\{0,1\}$, and rows corresponding to haplotypes. Haplotype matrix $H$ \emph{resolves} genotype

113: matrix $G$ if for each row $g_i$ of $G$, containing at least one $2$, there are two rows $h_{i_1}$ and $h_{i_2}$ of

114: $H$, such that $g_i(j) = h_{i_1}(j)$ for all $j$ with $h_{i_1}(j)= h_{i_2}(j)$ and $g_i(j) = 2$ otherwise, in which

115: case we say that $h_{i_1}$ and $h_{i_2}$ resolve $g_i$, we write $g_i=h_{i_1}+h_{i_2}$, and we call $h_{i_1}$ the {\em

116: complement} of $h_{i_2}$ with respect to $g_i$, and vice versa. A row $g_i$ without 2's is itself a haplotype and is

117: uniquely resolved by this haplotype, which thus has to be contained in $H$.

118:

119: We define the first of the two problems that we study in this paper.

120:

121: \medskip

122:

123: \noindent \textbf{Problem:} Parsimony Haplotyping ($PH$)\\

124: \textbf{Input:} A genotype matrix $G$.\\

125: \textbf{Output:} A haplotype matrix $H$ with a minimum number of rows that resolves $G$.

126:

127: \medskip

128:

129: \noindent \looseness=-2 There is a rich literature in this area, of which recent papers such as \cite{brown} give a

130: good overview. The problem is APX-hard \cite{lanciaApx}\cite{islands} and, in terms of approximation algorithms with

131: performance \emph{guarantees}, existing methods remain rather unsatisfactory, as will be shortly explained. This has

132: led many authors to consider methods based on Integer Linear Programming (ILP)

133: \cite{brown}\cite{gusfieldparsimony}\cite{halldorson}\cite{lanciaApx}. A different response to the hardness is to

134: search for ``islands of tractability'' amongst special, restricted cases of the problem, exploring the frontier

135: between hardness and polynomial-time solvability. In the literature available in this direction

136: \cite{wabi}\cite{lanciaApx}\cite{lancia}\cite{islands}, this investigation has specified classes of

137: $(k,\ell)$-\emph{bounded instances}: in a $(k,\ell)$-\emph{bounded instance} the input genotype matrix $G$ has at most

138: $k$ $2$'s per row and at most $\ell$ $2$'s per column (cf. \cite{islands}). If $k$ or $\ell$ is a ``$*$'' we mean

139: instances that are bounded only by the number of $2$'s per column or per row, respectively. In this paper we

140: supplement this ``tractability'' literature with mainly positive results, and in doing so almost complete the bounded

141: instance complexity landscape.

142:

143: Next to the $PH$ problem we study the \emph{Minimum Perfect Phylogeny Haplotyping} ($MPPH$)

144: model \cite{nphardnote}. Again a minimum-size set of resolving haplotypes is required but this time under the

145: additional, biologically-motivated restriction that the produced haplotypes permit a \emph{perfect phylogeny}, i.e.,

146: they can be placed at the leaves of an evolutionary tree within which each site mutates at most once. Haplotype

147: matrices admitting a perfect phylogeny are completely characterised \cite{gusfieldbook}\cite{gusfieldnetwork} by the

148: absence of the forbidden submatrix\\

149: \[F = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}.\] \\

150: \noindent

151: \textbf{Problem:} Minimum Perfect Phylogeny Haplotyping ($MPPH$)\\

152: \textbf{Input:} A genotype matrix $G$.\\

153: \textbf{Output:} A haplotype matrix $H$ with a minimum number of rows that resolves $G$ and admits a perfect

154: phylogeny.

155:

156: \medskip

157:

158: \noindent The feasibility question ($PPH$) - given a genotype matrix $G$, find any haplotype matrix $H$ that resolves

159: $G$ and admits a perfect phylogeny, or state that no such $H$ exists - is solvable in linear-time

160: \cite{gusfieldlinear}\cite{anOptimal}. Researchers in this area are now moving on to explore the $PPH$ question on

161: phylogenetic \emph{networks} \cite{gusnetwork}.

162:

163: The $MPPH$ problem, however, has so far hardly been studied beyond an NP-hardness result \cite{nphardnote}

164: and occasional comments within $PH$ and $PPH$ literature \cite{mpphref2}\cite{anOptimal}\cite{mpphref}. In this paper we

165: thus provide what is one of the first attempts to analyse the parsimony optimisation criteria within a well-defined

166: and widely applicable biological framework. We seek namely to map the $MPPH$ complexity landscape in the same way as

167: the $PH$ complexity landscape: using the concept of $(k,\ell)$-boundedness. We write $PH(k,\ell)$ and $MPPH(k,\ell)$ for these

168: problems restricted to $(k,\ell)$-bounded instances.\\

169:

170:

171: \noindent

172: \textbf{Previous work and our contribution}

173:

174: \medskip

175:

176: \noindent In \cite{lanciaApx} it was shown that $PH(3,*)$ is APX-hard. In \cite{wabi}\cite{lancia} it was shown that

177: $PH(2,*)$ is polynomial-time solvable. Recently, in \cite{islands}, it was shown (amongst other results) that

178: $PH(4,3)$ is APX-hard. In \cite{islands} it was also proven that the restricted subcase of $PH(*,2)$ is polynomial-time solvable

179: where the \emph{compatibility graph} of the input genotype matrix is a clique. (Informally, the compatibility graph shows

180: for every pair of genotypes whether those two genotypes can use common haplotypes in their resolution.)

181:

182: In this paper, we bring the boundaries between hard and easy classes closer by showing that $PH(3,3)$ is APX-hard and

183: that $PH(*,1)$ is polynomial-time solvable.

184:

185: As far as $MPPH$ is concerned there have been, prior to this paper, no concrete results beyond the above mentioned

186: NP-hardness result. We show that $MPPH(3,3)$ is APX-hard and that, like their $PH$ counterparts, $MPPH(2,*)$ and

187: $MPPH(*,1)$ are polynomial-time solvable (in both cases using a reduction to the $PH$ counterpart). We also show that

188: the clique result from \cite{islands} holds in the case of $MPPH(*,2)$ as well. As with its $PH$ counterpart the

189: complexity of $MPPH(*,2)$ remains open.

190:

191: \medskip

192: \noindent The fact that both $PH$ and $MPPH$ already become $APX$-hard for $(3,3)$-bounded instances means that,

193: in terms of deterministic approximation algorithms, the best that we can in general hope for is constant

194: approximation ratios. Lancia et al \cite{lanciaApx}\cite{lancia} have given two separate approximation algorithms with approximation

195: ratios of $\sqrt{n}$ and $2^{k-1}$ respectively, where $n$ is the number of genotypes in the input, and $k$ is the maximum

196: number of 2's appearing in a row of the genotype matrix\footnote{It would

197: be overly restrictive to write $PH(k,*)$ here

198: because their algorithm runs in polynomial time even if $k$ is not a constant.}. An $O(\log n)$

199: approximation algorithm has been given in \cite{log} but this only runs in polynomial time if the set of all possible haplotypes that

200: can participate in feasible solutions, can be enumerated in polynomial time. The obvious problem with the $2^{k-1}$ and the

201: $O(\log n)$ approximation algorithms is thus that either the accuracy decays exponentially (as in the former case) or the running

202: time increases exponentially (as in the latter case) with an increasing number of 2's per row. Here we offer a

203: simple, alternative approach which achieves (in polynomial time) approximation ratios linear in $\ell$ for

204: $PH(*,\ell)$ and

205: $MPPH(*,\ell)$ instances, and

206: actually also achieves these ratios in polynomial time when $\ell$ is not constant. These ratios are

207: shown in the Table \ref{tab:ratios}; note how improved

208: ratios can be obtained if every genotype is guaranteed to have at least one 2.

209: \begin{table}

210: \centering

211: \caption{Approximation ratios achieved in this paper}

212: \label{tab:ratios}

213: \begin{tabular}{|c|c|}

214: \hline

215: Problem $(\ell \geq 2)$ & Approximation ratio\\

216: \hline

217: \hline

218: $PH(*,\ell)$ & $\frac{3}{2}\ell + \frac{1}{2}$\\

219: \hline

220: $PH(*,\ell)$ where every genotype has at least one 2 & $\frac{3}{4}\ell + \frac{7}{4} - \frac{3}{2}\frac{1}{\ell +1}$\\

221: \hline

222: $MPPH(*,\ell)$ & $2 \ell$\\

223: \hline

224: $MPPH(*,\ell)$ where every genotype has at least one 2 & $\ell + 2 - \frac{2}{\ell+1}$\\

225: \hline

226: \end{tabular}

227: \end{table}

228:

229: We have thus decoupled the approximation ratio from the maximum number of 2's per row, and instead made the ratio

230: conditional on the maximum number of 2's per column. Our approximation scheme is hence an improvement to the

231: $2^{k-1}$-approximation algorithm except in cases where the maximum number of 2's per row is exponentially small

232: compared to the maximum number of 2's per column. Our approximation scheme yields also the first approximation results

233: for $MPPH$.

234:

235: \medskip

236:

237: \noindent As explained by Sharan et al. in their ``islands of tractability'' paper \cite{islands}, identifying

238: tractable special classes can be practically useful for constructing high-speed subroutines within ILP solvers, but

239: perhaps the most significant aspect of this paper is the analysis underpinning the results, which - by deepening our

240: understanding of how this problem behaves - assists the search for better, faster approximation algorithms and for

241: determining the exact shorelines of the islands of tractability.

242:

243: Furthermore, the fact that - prior to this paper - concrete and positive results for $MPPH$ had not been

244: obtained (except for rather pessimistic modifications to ILP models \cite{brown}), means that the algorithms given

245: here for the $MPPH$ cases, and the

246:  data structures used in their analysis (e.g. the \emph{restricted compatibility graph} in

247: Section~\ref{sec:posres}), assume particular importance.

248:

249: Finally, this paper yields some interesting open problems, of which the outstanding $(*,2)$ case (for both

250: $PH$ and $MPPH$) is only one; prominent amongst these questions (which are discussed at the end of the paper) is the

251: question of whether $MPPH$ and $PH$ instances are inter-reducible, at least within the bounded-instance framework.

252:

253: \medskip

254:

255: \noindent The paper is organised as follows. In Section~\ref{sec:negres} we give the hardness results, in

256: Section~\ref{sec:posres} we present the polynomial-time solvable cases, in Section~\ref{sec:approx} we give

257: approximation algorithms and we finish in Section~\ref{sec:concl} with conclusions and open problems.

258: %

259: \section{Hard problems}

260: \label{sec:negres}

261: \begin{theorem}

262: \label{lem:33phyloAPX} $MPPH(3,3)$ is APX-hard.

263: \end{theorem}

264: \begin{proof}

265: The proof in \cite{nphardnote} that $MPPH$ is NP-hard uses a reduction from {\sc Vertex Cover}, which can be modified

266: to yield NP-hardness and APX-hardness for (3,3)-bounded instances. Given a graph $T=(V,E)$ the reduction in

267: \cite{nphardnote} constructs a genotype matrix $G(T)$ of $MPPH$ with $|V|+|E|$ rows and $2|V|+|E|$ columns. For every

268: vertex $v_i \in V$ there is a genotype (row) $g_i$ in $G(T)$ with $g_i(i)=1$, $g_i(i+|V|)=1$ and $g_i(j)=0$ for every

269: other position $j$. In addition, for every edge $e_k=\{v_{h},v_{l}\}$ there is a genotype $g_k$ with $g_k(h)=2$,

270: $g_k(l)=2$, $g_k(2|V|+k)=2$ and $g_k(j)=0$ for every other position $j$. Bafna et al. \cite{nphardnote} prove that an

271: optimal solution for $MPPH$ with input $G(T)$ contains $|V| + |E| + VC(T)$ haplotypes, where $VC(T)$ is the size of

272: the smallest vertex cover in $T$.

273:

274: {\sc 3-Vertex Cover} is the vertex cover problem when every vertex in the input graph has at most degree 3. It is

275: known to be APX-hard \cite{deg3}\cite{cubic}. Let $T$ be an instance of {\sc 3-Vertex Cover}. We assume that $T$ is

276: connected. Observe that for such a $T$ the reduction described above yields a $MPPH$ instance $G(T)$ that is

277: $(3,3)$-bounded. We show that existence of a polynomial-time $(1+\epsilon)$ approximation algorithm $A(\epsilon)$ for

278: $MPPH$ would imply a polynomial-time $(1+\epsilon')$ approximation algorithm for {\sc 3-Vertex Cover} with

279: $\epsilon'=8\epsilon$.\footnote[1]{Strictly speaking this is insufficient to prove APX-hardness but it is not

280: difficult to show that the described reduction is actually an L-reduction \cite{deg3}, from which APX-hardness

281: follows.}

282:

283: Let $t$ be the solution value for $MPPH(G(T))$ returned by $A(\epsilon)$, and $t^*$ the optimal value for

284: $MPPH(G(T))$. By the argument mentioned above from \cite{nphardnote} we obtain a solution with value $d = t - |V| -

285: |E|$ as an approximation of $VC(T)$. Since $t \leq (1+\epsilon)t^*$, we have $d \leq VC(T) + \epsilon VC(T) + \epsilon

286: |V| + \epsilon |E|$. Connectedness of $T$ implies that $|V|-1 \leq |E|$. In {\sc 3-Vertex Cover}, a single vertex can

287: cover at most 3 edges in $T$, implying that $VC(T) \geq |E|/3 \geq (|V|-1)/3$. Hence, $|V| \leq 4 VC(T)$ (for $|V|\geq

288: 2$) and we have (if $|V| \geq 2$):

289: \begin{align*}

290: d & \leq VC(T) + \epsilon VC(T) + 4\epsilon VC(T) + 3\epsilon VC(T)\\

291: & \leq VC(T) + 8 \epsilon VC(T)\\

292: & \leq (1 + 8\epsilon) VC(T).

293: \end{align*}

294: \end{proof}

295: \begin{theorem}

296: \label{lem:33apx} $PH(3,3)$ is APX-hard.

297: \end{theorem}

298: \begin{proof}

299: \looseness=-1

300: The proof by Sharan et al. \cite{islands} that $PH(4,3)$ is APX-hard can be modified slightly to obtain

301: APX-hardness of $PH(3,3)$. The reduction is from {\sc 3-Dimensional Matching} with each element occurring in at most

302: three triples (3DM3): given disjoint sets $X$, $Y$ and $Z$ containing $\nu$ elements each and a set

303: $C=\{c_0,\ldots,c_{\mu-1}\}$ of $\mu$ triples in $X\times Y\times Z$ such that each element occurs in at most three

304: triples in $C$, find a maximum cardinality set $C' \subseteq C$ of disjoint triples.

305:

306: From an instance of 3DM3 we build a genotype matrix $G$ with $3 \nu

307: + 3\mu$ rows and $6\nu+4\mu$ columns. The first $3\nu$ rows are

308: called \emph{element-genotypes} and the last $3\mu$ rows are called

309: \emph{matching-genotypes}. We specify non-zero entries of the

310: genotypes only.\footnote[2]{Only in this proof we index haplotypes,

311: genotypes and matrices starting with 0, which makes notation

312: consistent with \cite{islands}.} For every element $x_i \in X$

313: define element-genotype $g^x_i$ with $g^x_i(3\nu+i)=1$;

314: $g^x_i(6\nu+4k)=2$ for all $k$ with $x_i \in c_k$. If $x_i$ occurs

315: in at most two triples we set $g^x_i(i)=2$. For every element $y_i

316: \in Y$ there is an element-genotype $g^y_i$ with $g^y_i(4\nu+i)=1$;

317: $g^y_i(6\nu+4k)=2$ for all $k$ with $y_i \in c_k$ and if $y_i$

318: occurs in at most two triples then we set $g^y_i(\nu + i)=2$. For

319: every element $z_i \in Z$ there is an element-genotype $g^z_i$ with

320: $g^z_i(5\nu+i)=1$; $g^z_i(6\nu+4k)=2$ for all $k$ with $z_i \in c_k$

321: and if $z_i$ occurs in at most two triples then we set

322: $g^z_i(2\nu+i)=2$. For each triple $c_k=\{ x_{i_1},

323: y_{i_2},z_{i_3}\} \in C$ there are three matching-genotypes $c_k^x$,

324: $c_k^y$ and $c_k^z$: $c_k^x$ has $c_k^x(3\nu+i_1)=2$,

325: $c_k^x(6\nu+4k)=1$ and $c_k^x(6\nu+4k+1)=2$; $c_k^y$ has

326: $c_k^y(4\nu+i_2)=2$, $c_k^y(6\nu+4k)=1$ and $c_k^y(6\nu+4k+2)=2$;

327: $c_k^z$ has $c_k^z(5\nu+i_3)=2$, $c_k^z(6\nu+4k)=1$ and

328: $c_k^z(6\nu+4k+3)=2$.

329:

330: Notice that the element-genotypes only have a 2 in the first $3\nu$ columns if the element occurs in at most two

331: triples. This is the only difference with the reduction from \cite{islands}, where every element-genotype has a 2 in

332: the first $3\nu$ columns: i.e., for elements $x_i\in X$, $y_i\in Y$ or $z_i\in Z$ a 2 in column $i$, $\nu+i$ or

333: $2\nu+i$, respectively. As a direct consequence our genotype matrix has only three 2's per row in contrast to the four

334: 2's per row in the original reduction.

335:

336: We claim that for this (3,3)-bounded instance exactly the same arguments can be used as for the (4,3)-bounded

337: instance. In the original reduction the left-most 2's ensured that, for each element-genotype, at most one of the two

338: haplotypes used to resolve it was used in the resolution of other genotypes. Clearly this remains true in our modified

339: reduction for elements appearing in two or fewer triples, because the corresponding left-most 2's have been retained.

340: So consider an element $x_i$ appearing in three triples and suppose, by way of contradiction, that \emph{both}

341: haplotypes used to resolve $g^x_i$ are used in the resolution of other genotypes. Now, the 1 in position $3\nu+i$

342: prevents this element-genotype from sharing haplotypes with other element-genotypes, so genotype $g^x_i$ must share

343: both its haplotypes with matching-genotypes. Note that, because $g^x_i(3\nu+i)=1$, the genotype $g^x_i$ can only

344: possibly share haplotypes with matching-genotypes corresponding to triples that contain $x_i$. Indeed, if $x_i$ is in

345: triples $c_{k_1}$, $c_{k_2}$ and $c_{k_3}$ then the only genotypes with which $g^x_i$ can potentially share haplotypes

346: are $c^x_{k_1}$, $c^x_{k_2}$ and $c^x_{k_3}$. Genotype $g^x_i$ cannot share both its haplotypes with the same

347: matching-genotype (e.g. $c^{x}_{k_1}$) because both haplotypes of $g^x_i$ will have a 1 in column $3\nu +i$ whilst

348: only one of the two haplotypes for $c^{x}_{k_1}$ will have a 1 in that column. So, without loss of generality, $g^x_i$

349: is resolved by a haplotype that $c^x_{k_1}$ uses and a haplotype that $c^x_{k_2}$ uses. However, this is not possible,

350: because $g^x_i$ has a 2 in the column corresponding to $c_{k_3}$, whilst both $c^{x}_{k_1}$ and $c^{x}_{k_2}$ have a 0

351: in that column, yielding a contradiction.

352:

353: Note that, in the original reduction, it was not only true that each element-genotype shared at most one of its

354: haplotypes, but - more strongly - it was also true that such a shared haplotype was used by exactly one other genotype

355: (i.e. the genotype corresponding to the triple the element gets assigned to). To see that this property is also

356: retained in the modified reduction observe that if (say) $g^x_i$ shares one haplotype with two genotypes $c^{x}_{k_1}$

357: and $c^{x}_{k_2}$ then $x_i$ must be in both triples $c_{k_1}$ and $c_{k_2}$, but this is not possible because, in the

358: two columns corresponding to triples $c_{k_1}$ and $c_{k_2}$, $c^{x}_{k_1}$ has 1 and 0 whilst $c^{x}_{k_2}$ has 0 and

359: 1.\\

360: \end{proof}

361:

362: \section{Polynomial-time solvability}

363: \label{sec:posres}

364: \subsection{Parsimony haplotyping}

365:

366: \noindent We will prove polynomial-time solvability of $PH$ on (*,1)-bounded instances.

367:

368: We say that two genotypes $g_1$ and $g_2$ are \emph{compatible}, denoted as $g_1 \sim g_2$, if $g_1(j) =

369: g_2(j)$ or $g_1(j) = 2$ or $g_2(j) = 2$ for all $j$. A genotype $g$ and a haplotype $h$ are \emph{consistent} if $h$

370: can be used to resolve $g$, ie. if $g(j)=h(j)$ or $g(j)=2$ for all $j$. The \emph{compatibility graph} is the graph

371: with vertices for the genotypes and an edge between two genotypes if they are compatible.

372:

373: \medskip

374:

375: \begin{lemma} \label{lem:labelling} If $g_1$ and $g_2$ are compatible rows of a genotype matrix with at most one $2$ per column

376: then there exists exactly one haplotype that is consistent with both $g_1$ and

377: $g_2$.\end{lemma}

378: \begin{proof}

379: The only haplotype that is consistent with both $g_1$ and $g_2$ is $h$ with $h(j) = g_1(j)$ for all $j$ with $g_1(j)

380: \neq 2$ and $h(j) = g_2(j)$ for all $j$ with $g_2(j) \neq 2$. There are no columns where $g_1$ and $g_2$ are both

381: equal to $2$ because there is at most one $2$ per column. In columns where $g_1$ and $g_2$ are both not equal to $2$

382: they are equal because $g_1$ and $g_2$ are compatible.\\

383: \end{proof}

384: \medskip

385: We use the notation $g_1 \sim_h g_2$ if $g_1$ and $g_2$ are compatible and $h$ is consistent with both. We prove that

386: the compatibility graph has a specific structure. A \emph{1-sum} of two graphs is the result of identifying a vertex

387: of one graph with a vertex of the other graph. A 1-sum of $n+1$ graphs is the result of identifying a vertex of a

388: graph with a vertex of a 1-sum of $n$ graphs. See Figure~\ref{fig:compgraph} for an example of a 1-sum of three

389: cliques ($K_3$, $K_4$ and $K_2$).

390:

391: \medskip

392:

393: \begin{lemma} \label{lem:1sum} If $G$ is a genotype matrix with at most one $2$ per column then every connected component

394: of the compatibility graph of $G$ is a 1-sum of cliques, where edges in the same clique are labelled with the same

395: haplotype.

396: \end{lemma}

397: \begin{proof}

398: Let $C$ be the compatibility graph of $G$ and let $g_1,g_2,\ldots,g_k$ be a cycle in $C$. It suffices to show that

399: there exists a haplotype $h_c$ such that $g_{i} \sim_{h_c} g_{i'}$ for all $i,i'\in\{1,...,k\}$. Consider an arbitrary

400: column $j$. If there is no genotype with a $2$ in this column then $g_1 \sim g_2 \sim \ldots \sim g_k$ implies that

401: $g_1(j)=g_2(j)=\ldots = g_k(j)$. Otherwise, let $g_{i_j}$ be the unique genotype with a $2$ in column $j$. Then $g_1

402: \sim g_2 \sim \ldots \sim g_{i_j-1}$ together with $g_1 \sim g_k \sim g_{k-1}\sim \ldots \sim g_{i_j+1}$ implies that

403: $g_{i}(j)=g_{i'}(j)$ for all $i,i' \in \{1,...,k\} \setminus \{i_j\}$. Set $h_c(j)=g_i(j)$, $i \neq i_j$. Repeating

404: this for each column $j$ produces a haplotype $h_c$ such that indeed $g_{i} \sim_{h_c} g_{i'}$ for all

405: $i,i'\in\{1,...,k\}$.\\

406: \end{proof}

407:

408: \begin{figure}

409: \vspace{-24pt}

410: \begin{minipage}{.45\textwidth}

411: \begin{center}

412: \begin{tabular}{ll}

413: $\begin{array}{c}

414: g_1\\

415: g_2\\

416: g_3\\

417: g_4\\

418: g_5\\

419: g_6\\

420: g_7\\

421: \end{array}$

422: & $\begin{bmatrix}

423: 0 & 0 & 1 & 0 & 2 & 0 & 1\\

424: 2 & 0 & 2 & 0 & 0 & 0 & 1\\

425: 0 & 0 & 1 & 2 & 0 & 0 & 1\\

426: 0 & 0 & 1 & 0 & 0 & 0 & 2\\

427: 0 & 0 & 1 & 1 & 0 & 2 & 1\\

428: 1 & 2 & 0 & 0 & 0 & 0 & 1\\

429: 0 & 0 & 1 & 1 & 0 & 0 & 1\\

430: \end{bmatrix}$

431: \end{tabular}

432: \end{center}

433: \end{minipage}

434: \begin{minipage}{.45\textwidth}

435: \begin{center}

436: \epsfig{file=./compgraph2.eps} \end{center}

437: \end{minipage}

438: \caption{Example of a genotype matrix and the corresponding compatibility graph, with $h_1=(0,0,1,1,0,0,1)$,

439: $h_2=(0,0,1,0,0,0,1)$ and $h_3=(1,0,0,0,0,0,1)$.} \label{fig:compgraph} \vspace{-12pt}

440: \end{figure}

441: \medskip

442: From this lemma, it follows directly that in $PH(*,1)$ the compatibility graph is {\em chordal}, meaning

443: that all its induced cycles are triangles. Every chordal graph has a \emph{simplicial} vertex, a vertex whose (closed)

444: neighbourhood is a clique. Deleting a vertex in a chordal graph gives again a chordal graph (see for example

445: \cite{blair} for an introduction to chordal graphs). The following lemma leads almost immediately to polynomial

446: solvability of $PH(*,1)$. We use set-operations for the rows of matrices: thus, e.g., $h\in H$ says $h$ is a row of

447: matrix $H$, $H\cup h$ says $h$ is added to $H$ as a row, and $H'\subset H$ says $H'$ is a submatrix consisting of rows

448: of $H$.

449:

450: \medskip

451:

452: \begin{lemma} \label{lem:starone} Given haplotype matrix $H'$ and genotype

453: matrix $G$ with at most one 2 per column it is possible to find, in polynomial time, a haplotype matrix $H$ that

454: resolves $G$, has $H'$ as a submatrix and has a minimum number of rows.

455: \end{lemma}

456: \begin{proof}

457: \looseness=-1 The proof is constructive. Let problem $(G,H')$ denote the above problem on input matrices $G$ and $H'$.

458: Let $C$ be the compatibility graph of $G$, which implied by Lemma~\ref{lem:1sum} is chordal. Suppose $g$ corresponds

459: to a simplicial vertex of $C$. Let $h_c$ be the unique haplotype consistent with any genotype in the closed

460: neighbourhood clique of $g$. We extend matrix $H'$ to $H''$ and update graph $C$ as follows.

461: \begin{enumerate}

462: \item If $g$ has no $2$'s it can be resolved with only one haplotype $h=g$. We set $H''=H'\cup h$ and remove $g$ from $C$.

463: \item Else, if there exist rows $h_1\in H'$ and $h_2\in H'$ that resolve $g$ we set $H''=H'$ and remove $g$ from $C$.

464: \item Else, if there exists $h_1\in H'$ such that $g=h_1+h_c$ we set $H''=H'\cup h_c$ and remove $g$ from $C$.

465: \item Else, if there exists $h_1\in H'$ and $h_2\notin H'$ such that $g=h_1+h_2$ we set $H''=H'\cup h_2$ and remove $g$ from $C$.

466: \item Else, if $g$ is not an isolated vertex in $C$ then there exists a haplotype $h_1$ such that $g=h_1+h_c$ and we set

467: $H''=H'\cup \{h_1, h_c\}$ and remove $g$ from $C$.

468: \item Otherwise, $g$ is an isolated vertex in $C$ and we set $H''=H'\cup \{h_1, h_2\}$ for any $h_1$ and $h_2$ such that

469: $g=h_1+h_2$ and remove $g$ from $C$.

470: \end{enumerate}

471: The resulting graph is again chordal and we repeat the above procedure for $H'=H''$ until all vertices are removed from $C$.

472: Let $H$ be the final haplotype matrix $H''$. It is clear from the construction that $H$ resolves $G$.

473:

474: We prove that $H$ has a minimum number of rows by induction on the number of genotypes. Clearly, if $G$ has only one

475: genotype the algorithm constructs the only, and hence optimal, solution. The induction hypothesis is that the

476: algorithm finds an optimal solution to the problem $(G,H')$ for any haplotype matrix $H'$ if $G$ has at most $n-1$

477: rows. Now consider haplotype matrix $H'$ and genotype matrix $G$ with $n$ rows. The first step of the algorithm

478: selects a simplicial vertex $g$ and proceeds with one of the cases 1 to 6. The algorithm then finds (by the induction

479: hypothesis) an optimal solution $H$ to problem $(G\setminus\{g\},H'')$. It remains to prove that $H$ is also an

480: optimal solution to problem $(G,H')$. We do this by showing that an optimal solution $H^*$ to problem $(G,H')$ can be

481: modified  to include $H''$. We prove this for every case of the algorithm separately.

482:

483: \begin{enumerate}

484: \item In this case $h\in H^*$, since $g$ can only be resolved by $h$.\smallskip \item In this case $H''=H'$ and hence

485: $H''\subseteq H^*$.\smallskip \item Suppose that $h_c \notin H^*$. Because we are not in case $2$ we know that there

486: are two rows in $H^*$ that resolve $g$ and at least one of the two, say $h^*$, is not a row of $H'$. Since $h_c$ is

487: the unique haplotype consistent with (the simplicial) $g$ and any compatible genotype, $h^*$ can not be consistent

488: with any other genotype than $g$. Thus, replacing $h^*$ by $h_c$ gives a solution with the same number of rows but

489: containing $h_c$. \smallskip \item Suppose that $h_2\notin H^*$. Because we are not in case $2$ or $3$ we know that

490: there is a haplotype $h^*\in H^*$ consistent with $g$, $h^*\notin H'$ and $h^*\neq h_c$. Hence it is not consistent

491: with any other genotypes than $g$ and we can replace $h^*$ by $h_2$. \smallskip \item Suppose that $h_1\notin H^*$ or

492: $h_c\notin H^*$. Because we are not in case $2$, $3$ or $4$, there are haplotypes $h^*\in H\backslash H'$ and

493: $h^{**}\in H\backslash H'$ that resolve $g$. If $h^*$ and $h^{**}$ are both not equal to $h_c$ then they are not

494: consistent with any other genotype than $g$. Replacing $h^*$ and $h^{**}$ by $h_1$ and $h_c$ leads to another optimal

495: solution. If one of $h^*$ and $h^{**}$ is equal to $h_c$ then we can replace the other one by $h_1$. \smallskip \item

496: \looseness=-1 Suppose that $h_1\notin H^*$ or $h_2\notin H^*$. There are haplotypes $h^*,h^{**}\in H^*\backslash H'$

497: that resolve $g$ and just $g$ since $g$ is an isolated vertex. Replacing $h^*$ and $h^{**}$ by $h_1$ and $h_2$ gives

498: an optimal solution containing $h_1$ and $h_2$.

499: \end{enumerate}

500: \end{proof}

501: \begin{theorem} \label{prop:starone}

502: The problem $PH(*,1)$ can be solved in polynomial time. \end{theorem}

503: \begin{proof}

504: The proof follows from Lemma~\ref{lem:starone}. Construction of the compatibility graph takes $O(n^2m)$ time, for an

505: $n$ times $m$ input matrix. Finding an ordering in which to delete the simplicial vertices can be done in time

506: $O(n^2)$ \cite{rose} and resolving each vertex takes $O(n^2m)$ time. The overall running time of the algorithm is

507: therefore $O(n^3m)$.\\

508: \end{proof}

509:

510: \subsection{Minimum perfect phylogeny haplotyping}

511:

512: \noindent

513: \looseness=-1 Polynomial-time solvability of $PH$ on $(2,*)$-bounded instances has been shown in \cite{wabi} and

514: \cite{lancia}. We prove it for $MPPH(2,*)$. We start with a definition.

515:

516: \medskip

517:

518: \begin{definition} \label{def:redres} For two columns of a genotype matrix we say that a \emph{reduced resolution} of these columns

519: is the result of applying the following rules as often as possible to the submatrix induced by these columns: deleting

520: one of two identical rows and the replacement rules\\ $\begin{bmatrix} 2 & a \end{bmatrix} \rightarrow

521: \begin{bmatrix} 1 & a \\ 0 & a \end{bmatrix}$, $\begin{bmatrix} a & 2 \end{bmatrix} \rightarrow \begin{bmatrix} a & 1

522: \\ a & 0 \end{bmatrix}$, $\begin{bmatrix} 2 & 2 \end{bmatrix} \rightarrow \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix}$ and

523: $\begin{bmatrix} 2 & 2 \end{bmatrix} \rightarrow \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$, for $a \in \{0,1\}$.\\

524: \end{definition}

525: %

526: Note that two columns can have more than one reduced resolution if there is a genotype with a 2 in both these columns.

527: The reduced resolutions of a column pair of a genotype matrix $G$ are submatrices of (or equal to) $F$ and represent

528: all possibilities for the submatrix induced by the corresponding two columns of a minimal haplotype matrix $H$

529: resolving $G$, after collapsing identical rows.

530:

531: \medskip

532:

533: \begin{theorem} \label{prop:startwophylo}

534: The problem $MPPH(2,*)$ can be solved in polynomial time.

535: \end{theorem}

536: \begin{proof}

537: We reduce $MPPH(2,*)$ to $PH$(2,*), which can be solved in polynomial time (see above). Let $G$ be an instance of

538: $MPPH(2,*)$. We may assume that any two rows are different.

539:

540: Take the submatrix of any two columns of $G$. If it does not contain a [2~2] row, then in terms of

541: Definition~\ref{def:redres} there is only one reduced resolution. If $G$ contains two or more [2~2] rows then,

542: since by assumption all genotypes are different, $G$ must have $\begin{bmatrix} 2 & 2 & 0 \\

543: 2 & 2 & 1 \end{bmatrix}$ and therefore $\begin{bmatrix} 2 & 0 \\

544: 2 & 1 \end{bmatrix}$ as a submatrix, which can only be resolved by a haplotype matrix containing the forbidden

545: submatrix $F$. It follows that in this case the instance is infeasible. If it contains exactly one [2~2] row, then

546: there are clearly two reduced resolutions. Thus we may assume that for each column pair there are at most two reduced

547: solutions.

548:

549: Observe that if for some column pair all reduced resolutions are equal to $F$ the instance is again infeasible. On the

550: other hand, if for all column pairs none of the reduced resolutions is equal to $F$ then $MPPH(2,*)$ is equivalent to

551: $PH(2,*)$ because any minimal haplotype matrix $H$ that resolves $G$ admits a perfect phylogeny. Finally, consider a

552: column pair with two reduced resolutions, one of them containing $F$. Because there are two reduced resolutions there

553: is a genotype $g$ with a 2 in both columns. Let $h_1$ and $h_2$ be the haplotypes that correspond to the resolution of

554: $g$ that does not lead to $F$. Then we replace $g$ in $G$ by $h_1$ and $h_2$, ensuring that a minimal haplotype matrix

555: $H$ resolving $G$ can not have $F$ as a submatrix in these two columns.

556:

557: Repeating this procedure for every column pair either tells us that the matrix $G$ was an infeasible instance or

558: creates a genotype matrix $G'$ such that any minimal haplotype matrix $H$ resolves $G'$ if and only if $H$ resolves

559: $G$, and $H$ admits a perfect phylogeny.\\

560: \end{proof}

561:

562: \medskip

563:

564: \begin{theorem} \label{prop:staronephylo} The problem $MPPH(*,1)$ can be solved in polynomial time.

565: \end{theorem}

566: \begin{proof}

567: Similar to the proof of Theorem~\ref{prop:startwophylo} we reduce $MPPH(*,1)$ to $PH(*,1)$. As there, consider for any

568: pair of columns of the input genotype matrix $G$ its reduced resolutions, according to Definition~\ref{def:redres}. Since

569: $G$ has at most one $2$ per column there is at most one genotype with 2's in both columns. Hence there are at most two

570: reduced resolutions. If all reduced resolutions are equal to the forbidden submatrix $F$ the instance is infeasible.

571: If on the other hand for all column pairs no reduced resolution is equal to $F$ then in fact $MPPH(*,1)$ is equivalent

572: to $PH(*,1)$, because any minimal haplotype matrix resolving $G$ admits a perfect phylogeny.

573:

574: As in the proof of Theorem~\ref{prop:startwophylo} we are left with considering column pairs for which one of the two

575: reduced resolutions is equal to $F$. For such a column pair there must be a genotype $g$ that has 2's in both these

576: columns. The other genotypes have only 0's and 1's in them. Suppose we get a forbidden submatrix $F$ in these columns

577: of the solution if $g$ is resolved by haplotypes $h_1$ and $h_2$, where $h_1$ has $a$ and $b$ and therefore $h_2$ has

578: $1-a$ and $1-b$ in these columns, $a,b\in \{0,1\}$. We will change the input matrix $G$ such that if $g$ gets resolved

579: by such a \emph{forbidden resolution} these haplotypes are not consistent with any other genotypes. We do this by

580: adding an extra column to $G$ as follows. The genotype $g$ gets a $1$ in this new column. Every genotype with $a$ and

581: $b$ or with $1-a$ and $1-b$ in the considered columns gets a $0$ in the new column. Every other genotype gets a $1$ in

582: the new column. For example, the matrix

583: \[

584: \begin{bmatrix} 2 & 2 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix}

585: {\rm \ gets\ one\ extra\ column\ and\ becomes}

586: \begin{bmatrix} 2 & 2 & 1 \\ 0 & 1 & 1\\ 1 & 0 & 1\\ 1 & 1 & 0\end{bmatrix}.

587: \]

588: \noindent Denote by  $G_{mod}$  the result of modifying $G$ by adding such a column for every pair of columns with

589: exactly one `bad' and one `good' reduced resolution. It is not hard to see that any optimal solution to $PH(*,1)$ on

590: $G_{mod}$ can be transformed into a solution to $MPPH(*,1)$ on $G$ of the same cardinality (indeed, any two haplotypes

591: used in a forbidden resolution of a genotype $g$ in $G_{mod}$ are not consistent with any other genotype of $G_{mod}$,

592: and hence may be replaced by two other haplotypes resolving $g$ in a non-forbidden way). Now, let $H$ be an optimal

593: solution to $MPPH(*,1)$ on $G$. We can modify $H$ to obtain a solution to $PH(*,1)$ on $G_{mod}$ of the same

594: cardinality as follows. We modify every haplotype in $H$ in the same way as the genotypes it resolves. From the

595: construction of $G_{mod}$ it follows that two compatible genotypes are only modified differently if the haplotype they

596: are both consistent with is in a forbidden resolution. However, in $H$ no genotypes are resolved with a forbidden

597: resolution since $H$ is a solution to $MPPH(*,1)$. We conclude that optimal solutions to $PH(*,1)$ on $G_{mod}$

598: correspond to optimal solutions to $MPPH(*,1)$ on $G$ and hence the latter problem can be solved in polynomial time,

599: by Theorem \ref{prop:starone}.

600:

601: If we use the algorithm from the proof of Lemma~\ref{lem:starone} as a subroutine we get an overall running time of

602: $O(n^3m^2)$, for an $n \times m$ input matrix.\\

603: \end{proof}

604: \medskip

605: \medskip

606: The borderline open complexity problems are now $PH(*,2)$ and $MPPH(*,2)$. Unfortunately, we have not found the answer

607: to these complexity questions. However, the borders have been pushed slightly further. In \cite{islands} $PH(*,2)$ is

608: shown to be polynomially solvable if the input genotypes have the complete graph as compatibility graph, we call this

609: problem $PH(*,2)$-$C1$. We will give the counterpart result for $MPPH(*,2)$-$C1$.

610:

611: Let $G$ be an $n \times m$ $MPPH(*,2)$-$C1$ input matrix. Since the compatibility graph is a clique, every column of

612: $G$ contains only one symbol besides possible 2's. If we replace in every 1-column of $G$ (a column containing only

613: 1's and 2's) the 1's by 0's and mark the SNP corresponding to this column `flipped', then

614:  we obtain an equivalent problem

615: on a $\{0,2\}$-matrix $G'$.

616: To see that this problem is indeed equivalent, suppose $H'$ is a haplotype matrix

617: resolving this modified genotype

618: matrix $G'$ and suppose $H'$ does not contain the forbidden submatrix $F$.

619:  Then by interchanging 0's and 1's in every column of $H'$

620: corresponding to a flipped SNP, one obtains a haplotype matrix $H$ without the forbidden submatrix

621: which resolves the original input matrix $G$. And vice versa.

622: Hence, from now on we will assume, without loss of generality, that the input matrix $G$ is a $\{0,2\}$-matrix.

623:

624: If we assume moreover that $n\geq 3$, which we do from here on, the \emph{trivial haplotype} $h_t$ defined as the

625: all-0 haplotype of length $m$ is the only haplotype consistent with all genotypes in $G$.

626:

627: We define the \emph{restricted} compatibility graph $C_{R}(G)$ of

628: $G$ as follows. As in the normal compatibility graph, the vertices of

629: $C_{R}(G)$ are the genotypes of $G$. However, there is an edge

630: $\{g,g'\}$ in $C_{R}$(G) only if $g \sim_{h} g'$ for some $h \neq

631: h_t$, or, equivalently, if there is a column where both $g$ and $g'$

632: have a 2.

633:

634: \medskip

635:

636: \begin{lemma}

637: \label{lem:deg2} If $G$ is a feasible instance of $MPPH(*,2)$-$C1$

638: then every vertex in $C_R(G)$ has degree at most 2.

639: \end{lemma}

640: \begin{proof}

641: Any vertex of degree higher than 2 in $C_R(G)$ implies the existence

642: in $G$ of submatrix:

643:

644: \medskip

645: \[

646: B= \begin{bmatrix} 2 & 2 & 2 \\ 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \end{bmatrix}

647: \]

648: \medskip

649:

650: \noindent \looseness=+1 It is easy to verify that no resolution of this submatrix permits a perfect phylogeny.\\

651: \end{proof}

652:

653: \medskip

654:

655: Suppose that $G$ has two identical columns. There are either 0, 1 or 2 rows with 2's in both these columns.

656: In each case it is easy to see that any haplotype matrix $H$ resolving $G$ can be modified, without introducing

657:  a forbidden submatrix,  to make

658: the corresponding columns in $H$ equal as well (simply delete one column and duplicate another). This leads to the

659: first step of the algorithm {\bf A} that we propose for solving $MPPH(*,2)$-$C1$:

660:

661: \medskip

662:

663: \noindent {\bf Step 1 of A}: Collapse all identical columns in $G$.

664:

665: \medskip

666:

667: \noindent From now on,

668:  we assume that there are no identical columns. Let us partition the genotypes in $G_0$, $G_1$

669: and $G_2$, denoting the set of genotypes in $G$ with, respectively, degree 0,1, and 2 in $C_R(G)$. For any genotype

670: $g$ of degree 1 in $C_R(G)$ there is exactly

671:  one genotype with a 2 in the same column as $g$. Because there are no

672: identical columns,

673:  it follows that any genotype $g$ of degree 1 in $C_R(G)$ can have at most two 2's. Similarly any

674: genotype of degree 2 in $C_R(G)$ has at most three 2's. Accordingly we define $G_1^1$ and $G_1^2$ as the genotypes in

675: $G_1$ that have one 2 and two 2's, respectively, and similarly $G_2^2$ and $G_2^3$ as the genotypes in $G_2$ with two

676: and three 2's, respectively.

677:

678: The following lemma states how genotypes in these sets  must  be resolved if no submatrix $F$ is allowed in

679: the solution. If genotype $g$ has $k$ 2's we denote by  $g[a_1,a_2,\ldots,a_k]$ the haplotype

680:  with entry $a_i$ in the position  where $g$ has its $i$-th 2  and 0 everywhere else.

681:

682: \medskip

683:

684: \begin{lemma}

685: \label{lem:usegeno} A haplotype matrix is a feasible solution to the problem $MPPH(*,2)$-$C1$ if and only if all genotypes are resolved in one of the following ways:

686:

687: \noindent {\em (i)} A genotype $g\in G_1^1$ is resolved by $g[1]$ and $g[0]=h_t$. \\

688: {\em (ii)} A genotype $g\in G_2^2$ is resolved by $g[0,1]$ and $g[1,0]$. \\

689: {\em (iii)} A genotype $g\in G_1^2$ is either resolved by $g[0,0]=h_t$

690: and $g[1,1]$ or by $g[0,1]$ and $g[1,0]$. \\

691: {\em (iv)} A genotype $g\in G_2^3$ is either resolved by $g[1,0,0]$

692: and $g[0,1,1]$ or by $g[0,1,0]$ and $g[1,0,1]$ (assuming that

693:  the two neighbours of $g$ have a 2 in the first two positions where $g$ has a 2).

694: \end{lemma}

695: \begin{proof}

696: A genotype $g\in G_2^2$ has degree 2 in $C_R(G)$, which implies the existence in $G$ of a submatrix:

697: \medskip

698: \begin{center}

699: $D =$

700: \begin{tabular}{ll}

701: $\begin{array}{l}

702: g\\

703: g'\\

704: g''\\

705: \end{array}

706: \begin{bmatrix} 2 & 2 \\ 2 & 0 \\ 0 & 2 \end{bmatrix}$

707: \end{tabular}.

708: \end{center}

709: \medskip

710: \noindent Resolving $g$ with $g[0,0]$ and $g[1,1]$ clearly leads to the forbidden submatrix $F$. Similarly, resolving

711: a genotype $g\in G_2^3$ with $g[0,0,1]$ and $g[1,1,0]$ or with $g[0,0,0]$ and $g[1,1,1]$ leads to a forbidden

712: submatrix in the first two columns where $g$ has a 2. It follows that

713:  resolving the genotypes in a way other than

714:  described in the lemma

715: yields a haplotype matrix which does not admit a perfect phylogeny.

716:

717: Now suppose that all genotypes are resolved as described in the lemma and assume that there is a forbidden submatrix

718: $F$ in the solution. Without loss of generality,  we assume $F$ can be found in the first two columns of the solution

719: matrix. We may also assume that no haplotype can be deleted from the solution. Then, since $F$ contains [1 1], there

720: is a genotype $g$ starting with [2~2]. Since there are no identical columns there are only two possibilities. The

721: first possibility is that there is exactly one other genotype $g'$ with a 2 in exactly one of the first two columns.

722: Since all genotypes different from $g$ and $g'$ start with [0 0], none of the resolutions of $g$ can have created the

723: complete submatrix $F$. Contradiction. The other possibility is that there is exactly one genotype with a 2 in the

724: first column and exactly one genotype with a 2 in the second column, but these are different genotypes, i.e. we have

725: the submatrix $D$. Then $g\in G_2^3$ or $g\in G_2^2$ and it can again be checked that none of the resolutions in (ii)

726: and (iv) leads to the forbidden submatrix.\\

727: \end{proof}

728:

729: \medskip

730:

731: \begin{lemma} Let $G$ be an instance of $MPPH(*,2)$ and $G_1^2$, $G_2^3$ as defined above.

732: \label{lem:private} \\ {\em (i)} Any nontrivial haplotype is consistent

733: with at most two genotypes in $G$.\\

734: {\em (ii)} A genotype $g\in G_1^2\cup G_2^3$ must be resolved using at least one haplotype that is not consistent with

735: any other genotype.

736: \end{lemma}

737: \begin{proof} {\em (i)} Let $h$ be a nontrivial haplotype.

738: There is a column where $h$ has a 1 and there are at most

739: two genotypes with a 2 in that column. \\

740: {\em (ii)} A genotype $g\in G_1^2\cup G_2^3$ has a 2 in a column that has no other 2's. Hence there is a haplotype

741: with a 1 in this column and this haplotype is not consistent with any other genotypes.\\

742: \end{proof}

743:

744: \medskip

745:

746:  A haplotype that is only consistent with $g$ is called a \emph{private haplotype} of $g$. Based on (i) and

747: (ii) of Lemma~\ref{lem:usegeno} we propose the next step of {\bf A}:

748:

749: \medskip

750:

751: \noindent {\bf Step 2 of A}: \looseness=-1 Resolve all $g\in G_1^1 \cup G_2^2$ by the unique haplotypes allowed to

752: resolve them according to Lemma~\ref{lem:usegeno}. Also resolve each $g\in G_0$ with $h_t$ and the complement of $h_t$

753: with respect to $g$. This leads to a partial haplotype matrix $H_2^p$.

754:

755: \medskip

756:

757: \noindent The next step of {\bf A} is based on Lemma~\ref{lem:private} (ii).

758:

759: \medskip

760: \noindent {\bf Step 3 of A}: \looseness=-1 For each $g\in G_1^2 \cup G_2^3$ with $g\sim_{h'}g'$ for some $h'\in H_2^p$

761: that is allowed to resolve $g$ according to Lemma~\ref{lem:usegeno}, resolve $g$ by adding the complement $h''$ of

762: $h'$ w.r.t. $g$ to the set of haplotypes, i.e. set $H_2^p := H_2^p \cup \{h''\}$, and repeat this step as long as new

763: haplotypes get added. This leads to partial haplotype matrix $H_3^p$.

764:

765: \medskip

766:

767: \noindent Notice that $H_3^p$ does not contain any haplotype that is

768: allowed to resolve any of the genotypes that have not been resolved

769: in Steps 2 and 3. Let us denote this set of leftover, unresolved

770: haplotypes by $GL$, the degree 1 vertices among those by  $GL_1\subseteq G_1^2$, and the

771: degree 2 vertices among those  by $GL_2\subseteq G_2^3$. The restricted

772: compatibility graph induced by $GL$, which we denote by $C_R(GL)$

773: consists of paths and circuits. We first give the final steps of

774: algorithm A and argue optimality afterwards.

775:

776: \medskip

777:

778: \noindent {\bf Step 4 of A}: Resolve each cycle in $C_R(GL)$, necessarily consisting of $GL_2$-vertices, by starting

779: with an arbitrary vertex and, following the cycle, resolving each next pair $g,g'$ of vertices by haplotype $h \neq

780: h_t$ such that $g\sim_h g'$ and the two complements of $h$ w.r.t. $g$ and $g'$ respectively. In case of an odd cycle

781: the last vertex is resolved by any pair of haplotypes that is allowed to resolve it. Note that $h$ has a 1 in the

782: column where both $g$ and $g'$ have a 2 and otherwise 0. It follows easily that $g$ and $g'$ are both allowed to use

783: $h$ (and its complement) according to (iv) of Lemma~\ref{lem:usegeno}.

784:

785: \medskip

786:

787: \noindent {\bf Step 5 of A}: Resolve each path in $C_R(GL)$ with both endpoints in $GL_1$ by first resolving the

788: $GL_1$ endpoints by the trivial haplotype $h_t$ and the complements of $h_t$ w.r.t. the two endpoint genotypes,

789: respectively. The remaining path contains only $GL_2$-vertices and is resolved according to Step 6.

790:

791: \medskip

792:

793: \noindent {\bf Step 6 of A}: Resolve each remaining path by starting in (one of) its $GL_2$-endpoint(s), and following

794: the path, resolving each next pair of vertices as in Step 4. In case of a path with an odd number of vertices, resolve

795: the last vertex by any pair of haplotypes that is allowed to resolve it in case it is a $GL_2$-vertex, and resolve it

796: by the trivial haplotype and its complement w.r.t. the vertex in case it is a $GL_1$ vertex.

797:

798: \medskip

799:

800: By construction the haplotype matrix $H$ resulting from {\bf A} resolves $G$. In addition, from

801: Lemma~\ref{lem:usegeno} follows that $H$ admits a perfect phylogeny.

802:

803: To argue minimality of the solution, first observe that the haplotypes added in Step 2 and Step 3 are

804: unavoidable by Lemma~\ref{lem:usegeno} (i) and (ii) and Lemma~\ref{lem:private} (ii). Lemma~\ref{lem:private} tells us

805: moreover that the resolution of a cycle of $k$ genotypes in $GL_2$ requires at least $k+\lceil\frac{k}{2}\rceil$

806: haplotypes that can not be used to resolve any other genotypes in $GL$. This proves optimality of Step 4. To prove

807: optimality of the last two steps we need to take into account that genotypes in $GL_1$ can potentially share the

808: trivial haplotype. Observe that to resolve a path with $k$ vertices one needs at least $k+\lceil\frac{k}{2}\rceil$

809: haplotypes. Indeed {\bf A} does not use more than that in Steps 5 and 6. Moreover, since these paths are disjoint,

810: they cannot share haplotypes for resolving their genotypes except for the endpoints if they are in $GL_1$, which can

811: share the trivial haplotype. Indeed, {\bf A} exploits the possibility of sharing the trivial haplotype in a maximal

812: way, except on a path with an even number of vertices and one endpoint in $GL_1$. Such a path, with $k$ (even)

813: vertices, is resolved in {\bf A} by $3\frac{k}{2}$ haplotypes that can not be used to resolve any other genotypes. The

814: degree 1 endpoint might alternatively be resolved by the trivial haplotype and its complement w.r.t. the corresponding

815: genotype, adding the latter private haplotype, but then for resolving the remaining path with $k-1$ (odd) vertices

816: only from $GL_2$ we still need $k-1+\lceil\frac{k-1}{2}\rceil$, which together with the private haplotype of the

817: degree 1 vertex gives $3\frac{k}{2}$ haplotypes also (not even counting $h_t$).

818:

819: As a result we have polynomial-time solvability of $MPPH(*,2)$-$C1$.

820:

821: \medskip

822:

823: \begin{theorem}

824: $MPPH(*,2)$ is solvable in polynomial time if the compatibility graph is a clique.

825: \flushright

826: \QEDclosed

827: \end{theorem}

828:

829: \section{Approximation algorithms}

830: \label{sec:approx}

831: In this section we construct polynomial time approximation algorithms for $PH$ and $MPPH$, where the accuracy depends

832: on the number of 2's per column of the input matrix. We describe genotypes without 2's as \emph{trivial} genotypes,

833: since they have to be resolved in a trivial way by one haplotype. Genotypes with at least one 2 will be described as

834: \emph{nontrivial} genotypes. We write \phminZ{} and \mpphminZ{} to denote the restricted versions of the problems

835: where each genotype is nontrivial. We make this distinction between the problems because we have better lower bounds

836: (and thus approximation ratios) for the restricted variants.

837:

838: \subsection{$PH$ and $MPPH$ where all input genotypes are nontrivial}

839: To prove approximation guarantees we need good lower bounds on the number of haplotypes in the solution. We start with

840: two bounds from \cite{islands}, whose proof we give because the first one is short but based on a crucial observation, and the second one was incomplete in \cite{islands}. We use these bounds to obtain a different lower

841: bound that we need for our approximation algorithms. \medskip

842: \begin{lemma}

843: \label{lem:minBound} \cite{islands} Let $G$ be an $n \times m$ instance of \phminZ{} (or \mpphminZ). Then at

844: least

845: \begin{eqnarray*}

846: LB_{sqrt}(n) = \bigg \lceil \frac{ 1 + \sqrt{1+8n} }{2} \bigg \rceil

847: \end{eqnarray*}

848: haplotypes are required to resolve $G$.

849: \end{lemma}

850: \begin{proof}

851: The proof follows directly from the observation that $q$ haplotypes can resolve at most $\binom{q}{2} = q(q-1)/2$

852: nontrivial genotypes.\\

853: \end{proof}

854: \medskip

855: \begin{lemma}

856: \label{lem:theirbound} \cite{islands} Let $G$ be an $n \times m$ instance of \phmin{\ell}, for some $\ell \geq 1$,

857: such that the compatibility graph of $G$ is a clique. Then at least

858: \begin{eqnarray*}

859: LB_{sha}(n,\ell) = \bigg \lceil \frac{2n}{\ell+1} + 1 \bigg \rceil

860: \end{eqnarray*}

861: haplotypes are required to resolve $G$.

862: \end{lemma}

863: \begin{proof}

864: Recall that, after relabeling if necessary, the trivial haplotype $h_t$ is the all-0 haplotype and is consistent with all genotypes. Suppose a solution of $G$ has $q$

865: non-trivial haplotypes. Observe that $h_t$ can be used in the resolution of at most $q$ genotypes. Also observe (by

866: Lemma 5 in \cite{islands}) that each non-trivial haplotype can be used in the resolution of at most $\ell$ genotypes.

867: Now distinguish two cases. First consider the case where $h_t$ is in the solution. Then from the two observations

868: above it follows that $n \leq (q+\ell q)/2$ and hence the solution consists of at least $q+1 \geq 2n/(\ell +1)+1$

869: haplotypes. Now consider the second case i.e. where $h_t$ is not in the solution. Then we have that $n \leq \ell q/2$

870: and hence that the solution consists of at least $2n/\ell$ haplotypes. If $n \geq \ell (\ell+1)/2$ we have that

871: $2n/\ell \geq 2n/(\ell+1)+1$, and the claim follows. If $n < \ell (\ell+1)/2$ then this implies that $\ell

872: >\frac{\sqrt{1+8n}-1}{2}$. Combining this with that by Lemma~\ref{lem:minBound} $q\geq \frac{\sqrt{1+8n}+1}{2}$ gives

873: that $(\ell+1)(q-1) > \frac{1}{4}(\sqrt{1+8n} + 1)(\sqrt{1+8n} - 1)$, which is equal to $2n$. It follows that $q >

874: 2n/(\ell +1)+1$.\\

875: \end{proof}

876: \medskip

877: The $LB_{sha}$ bound has been proven only for \phminZ{} (and \mpphminZ) instances where the compatibility graph is a

878: clique. We now prove a different bound which, in terms of cliques, is slightly weaker (for large $n$) than $LB_{sha}$,

879: but which allows us to generalise the bound to more general inputs. (Indeed it remains an open question whether

880: $LB_{sha}$ applies as a lower bound not just for cliques but also for general instances.)

881: \medskip

882: \begin{lemma}

883: \label{lem:boundmin} Let $G$ be an $n \times m$ instance of \phmin{\ell}, for some $\ell \geq 1$. Then at least

884: \begin{equation}

885: \lbmidmin{n}{\ell} = \bigg \lceil \frac{2(n+\ell)(\ell+1)}{\ell(\ell+3)}

886: \bigg \rceil

887: \end{equation}

888: haplotypes are required to resolve $G$.

889: \end{lemma}

890: \begin{proof}

891: Let $C(G)$ be the compatibility graph of $G$. We may assume without loss of generality that $C(G)$ is connected. First

892: consider the case where $C(G)$ is a clique. If $n \geq \ell ( \ell +1)/2$, it suffices to notice that

893: $\lbmidmin{n}{\ell}\leq LB_{sha}(n,\ell)$ for each value of $\ell \geq 1$, since the function

894: \begin{equation}

895: f(n) = \frac{2n}{\ell +1}+1 - \frac{2(n+\ell)(\ell+1)}{\ell(\ell+3)}

896: \end{equation}

897: is equal to $0$ if $n= \ell ( \ell +1)/2$ and has nonnegative derivative

898: $f'(n)=\frac{2}{\ell+1}-2\frac{\ell+1}{\ell(\ell+3)}\geq 0$.\\

899: Secondly, if $1 \leq n \leq \ell (\ell +1)/2$, straightforward but tedious calculations show that for all $\ell \geq

900: 1$ the function

901: \begin{equation}

902:  F(n)= \frac{ 1 + \sqrt{1+8n}}{2} - \frac{2(n+\ell)(\ell+1)}{\ell(\ell+3)}

903: \end{equation}

904: has value $0$ for $n= \ell ( \ell +1)/2$ and for some $n$ in the interval $[0,1]$, whereas in between these values it

905: has positive value. Hence, $\lbmidmin{n}{\ell}\leq LB_{sqrt}(n)$ for $1 \leq n \leq \ell (\ell +1)/2$.

906:

907: To prove that the bound also holds if $C(G)$ is not a clique we use induction on $n$. Suppose that

908: for each $n'< n$ the lemma

909: holds for all $n' \times m$ instances $G'$ of \phmin{\ell '} for every $m$ and $\ell '$.

910: Since $C(G)$ is not a clique there exist two genotypes $g_1$ and $g_2$ in $G$ and a

911: column $j$ such that $g_1(j)=0$ and $g_2(j)=1$. Given that $G$ is a \phmin{\ell} instance

912: $t \leq \ell$ genotypes have a 2 in column $j$.

913: Deleting these $t$ genotypes yields an instance $G^d$ with disconnected compatibility graph $C(G^d)$, since the absence of a $2$ in column $j$ prevents the existence of any path from $g_1$ to $g_2$. Let $C(G^d)$ have $p \geq 2$

914: components $C(G_1), ..., C(G_p)$, and let $n_i \geq 1$ denote the number of genotypes in $G_i$. Thus, $n = n_1 +

915: ... + n_p + t$. We use the induction hypothesis on $G_1,\ldots,G_p$ to conclude that the number of haplotypes required to resolve $G$ is at least

916: \begin{eqnarray*}

917: \sum_{i=1}^p \bigg \lceil \frac{2(n_i + \ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil

918:               & \geq & \bigg \lceil \frac{2(\sum_{i=1}^p n_i + p\ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil

919:              \geq \bigg \lceil \frac{2(\sum_{i=1}^p n_i + 2\ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil \\

920:              & \geq & \bigg \lceil \frac{2(\sum_{i=1}^p n_i + t+ \ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil

921:              = \bigg \lceil \frac{2(n + \ell)(\ell+1)}{\ell(\ell+3)} \bigg \rceil

922: \end{eqnarray*}

923: \end{proof}

924: \medskip

925: \begin{corollary}

926: \label{cor:easyapproxMin} Let $G$ be an $n \times m$ instance of \phmin{\ell} or \mpphmin{\ell}, for some $\ell \geq

927: 1$. Any feasible solution for $G$ is within a ratio $\ell + 2 - \frac{2}{\ell+1}$ from optimal.

928: \end{corollary}

929: \begin{proof}

930: Immediate from the fact that any solution for $G$ has at most $2n$ haplotypes. In the case of $MPPH$ we can check

931: whether feasible solutions exist, and if so obtain such a solution, by using the algorithm in for example

932: \cite{gusfieldlinear}.\\

933: \end{proof}

934: \medskip

935: Not surprisingly, better approximation ratios can be achieved. The following simple algorithm computes

936: approximations of \phmin{\ell}. (The algorithm does not work for $MPPH$, however.)

937:

938: \medskip

939:

940: \noindent

941: \textbf{Algorithm:} $PH^{nt}M$ \\

942: \textbf{Step 1:} construct the compatibility graph $C(G)$.\\

943: \textbf{Step 2:} find a maximal matching $M$ in $C(G)$.\\

944: \textbf{Step 3:} for every edge $\{g_1,g_2\}\in M$, resolve $g_1$ and $g_2$ by in total 3 haplotypes: any haplotype

945: consistent with both $g_1$ and $g_2$, and its complements with respect to $g_1$ and $g_2$.\\

946: \textbf{Step 4:} resolve each remaining genotype by two haplotypes.

947: \medskip

948: \begin{theorem}

949: $PH^{nt}M$ computes a solution to \phmin{\ell} in polynomial time within an approximation

950: ratio of $c(\ell)=\frac{3}{4}\ell +\frac{7}{4}-\frac{3}{2}\frac{1}{\ell +1}$, for every $\ell \geq 1$.

951: \end{theorem}

952: \begin{proof}

953: Since constructing $C(G)$ given $G$ takes $O(n^2m)$ time and finding a maximal matching in any graph takes linear

954: time, $O(n^2m)$ running time follows directly.

955:

956: Let $q$ be the size of the maximal matching.

957: Then $PH^{nt}M$ gives a solution with

958: $3q+2(n-2q)$ = $2n-q$ haplotypes. Since the complement of the

959: maximal matching is an independent set of size $n-2q$, any solution must contain at least $2(n-2q)$

960: haplotypes to resolve the genotypes in this independent set.

961: The theorem thus holds if $\frac{2n-q}{2n-4q} \leq c(\ell)$. If

962: $\frac{2n-q}{2n-4q}

963: > c(\ell)$, implying that $q > \frac{2-2c(\ell)}{1-4c(\ell)}n$, we use the lower bound of Lemma

964: \ref{lem:boundmin} to obtain

965: \[

966: % NOTE THAT I COULD NOT USE THE LBMIDMIN MACRO HIER

967: \frac{2n-q}{ LB^{nt}_{mid}(n,\ell) } < \frac{2n-\frac{2-2c(\ell)}{1-4c(\ell)}n}{LB^{nt}_{mid}(n,\ell)} <

968: \frac{(2n-\frac{2-2c(\ell)}{1-4c(\ell)}n)\ell(\ell+3)}{2n(\ell +1)}= \frac{3\ell

969: c(\ell)}{4c(\ell)-1}\frac{\ell+3}{\ell+1}= c(\ell).

970: \]

971: The last equality follows directly since $(4c(\ell)-1)(\ell+1)=3\ell(\ell+3)$.\\

972: \end{proof}

973:

974: \subsection{$PH$ and $MPPH$ where not all input genotypes are nontrivial}

975:

976: Given an instance $G$ of $PH$ or $MPPH$ containing $n$ genotypes, $n_{nt}$ denotes the number of nontrivial

977: genotypes in $G$ and $n_t$ the

978: number of trivial genotypes; clearly $n = n_{nt} + n_t.$

979: \medskip

980: \begin{lemma}

981: \label{lem:cliquewith}

982: Let $G$ be an $n \times m$ instance of $PH(*,\ell)$, for some $\ell \geq 2$, where the compatibility

983: graph of the nontrivial genotypes in $G$ is a clique, $G$ is not equal to a single trivial genotype,

984: and no nontrivial genotype in $G$ is the sum of two trivial genotypes in $G$. Then at least

985: \[

986: \lbwith{n}{\ell} = \bigg \lceil \frac{n}{\ell} + 1 \bigg \rceil

987: \]

988: haplotypes are needed to resolve $G$.

989: \end{lemma}

990: \begin{proof}

991: Note that the lemma holds if $n_t \geq n/\ell + 1$. So we assume from now on that $n_t < n/\ell + 1$.

992:

993: We first prove that the bound holds for $n_{nt} \leq \ell$. Combining this with $n_t < n/2 + 1$ gives that $n < 2\ell

994: + 2$. Thus $n/\ell + 1 < 4$. Hence if $n_t \geq 4$ then we are done. Thus we only have to consider cases where both

995: $n_t \in \{0,1,2,3\}$ and $\ell \geq \max \{2,n_{nt}\}$. We verify these cases in Table \ref{tab:case}; note the

996: importance of the fact that no nontrivial genotype is the sum of two trivial haplotypes in verifying that these are

997: correct lower bounds. (Also, there is no $n_t = 1, n_{nt} = 0$ case because of the lemma's precondition.)

998: %

999: \begin{table}

1000: \centering \caption{Case $n_t < 4$, $n_{nt}\leq \ell$ in proof of Lemma \ref{lem:cliquewith}} \label{tab:case}

1001: \begin{tabular}{|c|c|c|}

1002: \hline

1003: $n_t$&$n_{nt}$&$\lceil n/\ell +1 \rceil$\\

1004: \hline

1005: 0 & 1 & 2 \\

1006: 0 & $z \geq 2$ & $\leq \lceil z/z + 1 \rceil = 2$\\

1007: 1 & 1 & 2\\

1008: 1 & $z \geq 2$ & $\leq \lceil (z+1)/z + 1 \rceil = 3$\\

1009: 2 & 0 & 2\\

1010: 2 & 1 & $\leq 3$\\

1011: 2 & $z\geq 2$ & $\leq \lceil (z+2)/z + 1 \rceil = 3$\\

1012: 3 & 0 & $\leq 3$\\

1013: 3 & 1 & $\leq 3$\\

1014: 3 & 2 & $\leq 4$\\

1015: 3 & $z \geq 3$ & $\leq \lceil (z+3)/z + 1 \rceil = 3$\\

1016: \hline

1017: \end{tabular}

1018: \end{table}

1019:

1020: We now prove the lemma for $n_{nt} > \ell$. Note that in this case there exists a unique trivial haplotype $h_t$

1021: consistent with all nontrivial genotypes. Suppose, by way of contradiction, that $N = N_t + N_{nt}$ is the size of the

1022: smallest instance $G'$ for which the bound does not hold. Let $H$ be an optimal solution for $G'$ and let $h = |H|$.

1023:

1024: Observe firstly that $N = 1$ (mod $\ell)$, because if this is not true we have that $\lbwith{N-1}{\ell} =

1025: \lbwith{N}{\ell}$ and we can find a smaller instance for which the bound does not hold, simply by removing an

1026: arbitrary genotype from $G'$, contradicting the minimal choice of $N$.

1027:

1028: Similarly we argue that $h = \lbwith{N}{\ell}-1$, since if $h \leq \lbwith{N}{\ell}-2$ we could remove an arbitrary

1029: genotype to yield a size $N-1$ instance and still have that $h < \lbwith{N-1}{\ell}$.

1030:

1031: We choose a specific resolution of $G'$ using $H$ and represent it as a \emph{haplotype graph}. The vertices of this

1032: graph are the haplotypes in $H$. For each nontrivial genotype $g \in G'$ there is an edge between the two haplotypes

1033: that resolve it. For each trivial genotype $g \in G'$ there is a loop on the corresponding haplotype. There are no

1034: edges between looped haplotypes because of the precondition that no nontrivial genotype is the sum of two trivial

1035: genotypes.

1036:

1037: From Lemma 5 of \cite{islands} it follows that, with the exception of the possibly present trivial haplotype and

1038: disregarding loops, each haplotype in the graph has degree at most $\ell$. In addition, if an unlooped haplotype has

1039: degree less than or equal to $\ell$, or a looped haplotype has degree (excluding its loop) strictly smaller than

1040: $\ell$, then deleting this haplotype and all its at most $\ell$ incident genotypes creates an instance $G''$

1041: containing at least $N-\ell$ genotypes that can be resolved using $h-1$ haplotypes, yielding a contradiction to the

1042: minimality of $N$. (Note that, because $N_{nt}>\ell$, it is not possible that the instance $G''$ is empty or equal to

1043: a single trivial genotype.)

1044:

1045: The only case that remains is when, apart from the possibly present trivial haplotype, every haplotype in the

1046: haplotype graph is looped and has degree $\ell$ (excluding its loop). However, there are no edges between looped

1047: vertices and they can therefore only be adjacent to the trivial haplotype, yielding a contradiction.\\

1048: \end{proof}

1049: \medskip

1050: \begin{lemma}

1051: \label{lem:withgeneral}

1052: Let $G$ be an $n \times m$ instance of $PH(*,\ell)$, for some $\ell \geq 2$, where $G$ is not equal to a

1053: single trivial genotype, and no nontrivial genotype in $G$ is the sum of two trivial genotypes in $G$. Then

1054: at least $\lbwith{n}{\ell}$ haplotypes are needed to resolve $G$.

1055: \end{lemma}

1056: \begin{proof}

1057: Essentially the same inductive argument as used in Lemma \ref{lem:boundmin} works: it is always possible to disconnect

1058: the compatibility graph of $G$ into at least two components by removing at most $\ell$ nontrivial genotypes, and using

1059: cliques as the base of the induction. The presence of trivial genotypes in the input (which we can actually simply

1060: exclude from the compatibility graph) does not alter the analysis. The fact that (in the inductive step) at least two

1061: components are created, each of which contains at least one nontrivial genotype, ensures that the inductive argument

1062: is not harmed by the presence of single trivial genotypes (for which the bound does not hold).\\

1063: \end{proof}

1064: \medskip

1065: \begin{corollary}

1066: \label{cor:withfirstbound} Let $G$ be an $n \times m$ instance of $PH(*,\ell)$ or $MPPH(*,\ell)$, for some $\ell

1067: \geq 2$. Any feasible solution for $G$ is within a ratio of $2\ell$ from optimal.

1068: \end{corollary}

1069: \begin{proof}

1070: Immediate because $2n/(n/\ell+1) < 2\ell$. (As before the algorithm from e.g. \cite{gusfieldlinear} can be used to

1071: generate feasible solutions for $MPPH$, or to determine that they do not exist.)\\

1072: \end{proof}

1073: The algorithm $PH^{nt}M$ can easily be adapted to solve $PH(*,\ell)$ approximately.

1074:

1075: \medskip

1076:

1077: \noindent

1078: \textbf{Algorithm:} $PHM$\\

1079: \textbf{Step 1:} remove from $G$ all genotypes that are the sum of two trivial genotypes \\

1080: \textbf{Step 2:} construct the compatibility graph $C(G')$ of the leftover instance $G'$.\\

1081: \textbf{Step 3:} find a maximal matching $M$ in $C(G')$.\\

1082: \textbf{Step 4:} for every edge $\{g_1,g_2\}\in M$, resolve $g_1$ and $g_2$ by

1083: three haplotypes if $g_1$ and $g_2$ are both nontrivial and by two haplotypes if one of them is trivial.\\

1084: \textbf{Step 5:} resolve each remaining nontrivial genotype by two haplotypes and each remaining trivial genotype by its corresponding haplotype.

1085: \medskip

1086: \begin{theorem}

1087: $PHM$ computes a solution to $PH(*,\ell)$ in polynomial time within an approximation

1088: ratio of $d(\ell)=\frac{3}{2}\ell +\frac{1}{2}$, for every $\ell \geq 2$.

1089: \end{theorem}

1090: \begin{proof}

1091: Since constructing $C(G)$ given $G$ takes $O(n^2m)$ time and finding a maximal matching in any graph takes linear

1092: time, $O(n^2m)$ running time follows directly.

1093:

1094: Let $q$ be the size of the maximal matching, $n$ the number of genotypes

1095: after Step 1 and $n_t$ the number of trivial genotypes in $G'$.

1096: Then $PHM$

1097: gives a solution with $2n-q-n_t$ haplotypes.

1098: Since the complement of the

1099: maximal matching is an independent set of size $n-2q$ in $C(G')$, any solution must contain at least $2(n-2q)$

1100: haplotypes to resolve the genotypes in this independent set.

1101: The theorem thus holds if $\frac{2n-q-n_t}{n-2q} \leq d(\ell)$.

1102: If $\frac{2n-q-n_t}{n-2q}

1103: > d(\ell)$, implying that $q > \frac{(d(\ell)-2)n+n_t}{2d(\ell)-1}$, we use the lower bound of Lemma

1104: \ref{lem:withgeneral} and obtain

1105: \[

1106: \frac{2n-q-n_t}{LB_{mid}(n,\ell)} < \frac{2n-\frac{(d(\ell)-2)n+n_t}{2d(\ell)-1}}{\lceil \frac{n}{\ell} + 1 \rceil} <

1107: \frac{2n-\frac{(d(\ell)-2)n}{2d(\ell)-1}}{\frac{n}{\ell}}= \frac{3d(\ell)\ell}{2d(\ell)-1}= d(\ell).

1108: \]

1109: %\end{array} \end{equation}

1110: The last equality follows directly since $2d(\ell)-1 = 3\ell$.\\

1111: \end{proof}

1112:

1113: \section{Postlude}

1114: \label{sec:concl}

1115: There remain a number of open problems to be solved. The complexity of $PH(*,2)$ and $MPPH(*,2)$ is still unknown. An

1116: approach that might raise the necessary insight is to study the $PH(*,2)\text{-}Cq$ and $MPPH(*,2)\text{-}Cq$ variants

1117: of these problems (i.e. where the compatibility graph is the sum of $q$ cliques) for small $q$. If a complexity result

1118: nevertheless continues to be elusive then it would be interesting to try and improve approximation ratios for

1119: $PH(*,2)$ and $MPPH(*,2)$; might it even be possible to find a PTAS (\emph{Polynomial-time Approximation Scheme}) for

1120: each of these problems? Note also that the complexity of $PH(k,2)$ and $MPPH(k,2)$ remains open for constant $k \geq

1121: 3$.

1122:

1123: Another intriguing open question concerns the relative complexity of $PH$ and $MPPH$ instances. Has $PH(k,\ell)$ always

1124: the same complexity as $MPPH(k,\ell)$, in terms of well-known complexity measurements (polynomial-time solvability,

1125: NP-hardness, APX-hardness)? For hard instances, do approximability ratios differ? A related question is whether it is possible to directly

1126: encode $PH$ instances as $MPPH$ instances, and/or vice-versa, and if so whether/how this affects the bounds on the number

1127: of 2's in columns and rows.

1128:

1129: For hard $PH(k,\ell)$ instances it would also be interesting to see if those approximation algorithms that yield

1130: approximation ratios as functions of $k$, can be intelligently combined with the approximation algorithms in this

1131: paper (having approximation ratios determined by $\ell$), perhaps with superior approximation ratios as a consequence.

1132: In terms of approximation algorithms for $MPPH$ there is a lot of work to be done because the

1133: approximation algorithms presented in this paper actually do little more than return an arbitrary feasible solution.

1134: It is also not clear if the $2^{k-1}$-approximation algorithms for $PH(k,*)$ can be attained (or improved) for $MPPH$.

1135: More generally, it seems likely that big improvements in approximation ratios (for both $PH$ and $MPPH$) will require

1136: more sophisticated, input-sensitive lower bounds and algorithms. What are the limits of approximability for these

1137: problems, and how far will algorithms with formal performance-guarantees (such as in this paper) have to improve to

1138: make them competitive with dominant ILP-based methods?

1139:

1140: Finally, with respect to $MPPH$, it could be good to

1141: explore how parsimonious the solutions are that are produced by the

1142: various $PPH$ feasibility algorithms, and whether searching through

1143: the entire space of $PPH$ solutions (as proposed in \cite{anOptimal})

1144: yields practical algorithms for solving $MPPH$.

1145:

1146: \section*{Acknowledgements}

1147:

1148: All authors contributed equally to this paper and were supported by the Dutch BSIK/BRICKS project. A preliminary

1149: version of this paper appeared in \emph{Proceedings of the 6th International Workshop on Algorithms in Bioinformatics} (WABI 2006)

1150: \cite{wabibeaches}.

1151:

1152:

1153: \begin{thebibliography}{50}

1154: \small

1155:

1156: \bibitem{cubic} Alimonti, P., Kann, V., Hardness of approximating problems

1157: on cubic graphs, Proceedings of the Third Italian Conference on Algorithms and Complexity, 288-298 (1997)

1158:

1159: \bibitem{nphardnote} Bafna, V., Gusfield, D., Hannenhalli, S., Yooseph, S.,

1160: A Note on Efficient Computation of Haplotypes via Perfect Phylogeny, \emph{Journal of Computational Biology}, 11(5),

1161: pp. 858-866 (2004)

1162:

1163: \bibitem{blair} Blair, J.R.S., Peyton, B., An introduction to chordal graphs and clique trees, in \emph{Graph theory and sparse matrix computation}, pp. 1-29, Springer (1993)

1164:

1165: \bibitem{mpphref2} Bonizzoni, P., Vedova, G.D., Dondi, R., Li, J., The haplotyping problem: an overview of computational models and solutions,

1166: \emph{Journal of Computer Science and Technology} 18(6), pp. 675-688 (2003)

1167:

1168: \bibitem{brown} Brown, D., Harrower, I., Integer programming approaches to haplotype inference by pure

1169: parsimony, \emph{IEEE/ACM Transactions on Computational Biology and Informatics} 3(2) (2006)

1170:

1171: \bibitem{wabi} Cilibrasi, R., Iersel, L.J.J. van, Kelk, S.M., Tromp, J., On the Complexity of Several Haplotyping Problems, Proceedings

1172: of the 5th International Workshop on Algorithms in Bioinformatics (WABI 2005), LNBI 3692, Springer Verlag, Berlin, pp.

1173: 128-139 (2005)

1174:

1175: \bibitem{gusfieldlinear} Ding, Z., Filkov, V., Gusfield, D., A linear-time algorithm for the perfect phylogeny

1176: haplotyping (PPH) problem, \emph{Journal of Computational Biology}, 13(2) pp. 522-533 (2006)

1177:

1178: \bibitem{gusfieldbook} Gusfield, D., \emph{Algorithms on Strings, Trees, and Sequences: Computer Science and Computational

1179: Biology}, Cambridge University Press (1997)

1180:

1181: \bibitem{gusfieldnetwork} Gusfield, D., Efficient algorithms for inferring evolutionary history, \emph{Networks} 21,

1182: pp. 19-28 (1991)

1183:

1184: \bibitem{gusfieldparsimony} Gusfield, D., Haplotype inference by pure parsimony, Proc. 14th

1185: Ann. Symp. Combinatorial Pattern Matching, pp. 144-155 (2003)

1186:

1187: \bibitem{halldorson} Halld\'orsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S.,

1188: Istrail, S., A survey of computational methods for determining haplotypes, Proc. DIMACS/RECOMB Satellite Workshop:

1189: Computational Methods for SNPs and Haplotype Inference, pp. 26-47 (2004)

1190:

1191: \bibitem{wabibeaches} Iersel, L.J.J. van, Keijsper, J., Kelk, S.M., Stougie, L., Beaches of Islands of Tractability: Algorithms for Parsimony

1192: and Minimum Perfect Phylogeny Haplotyping Problems, Proceedings of the 6th International Workshop on Algorithms in Bioinformatics (WABI 2006),

1193: LNCS 4175, Springer, pp. 80-91 (2006)

1194:

1195: \bibitem{lanciaApx} Lancia, G., Pinotti, M., Rizzi, R., Haplotyping populations by pure

1196: parsimony: complexity of exact and approximation algorithms, \emph{INFORMS Journal on Computing} 16(4) pp. 348-359

1197: (2004)

1198:

1199: \bibitem{lancia} Lancia, G., Rizzi, R.,

1200: A polynomial case of the parsimony haplotyping problem, \emph{Operations Research Letters} 34(3) pp. 289-295 (2006)

1201:

1202: \bibitem{deg3} Papadimitriou, C.H., Yannakakis, M.,

1203: Optimization, approximation, and complexity classes, \emph{J. Comput. System Sci.} 43, pp. 425-440 (1991)

1204:

1205: \bibitem{rose} Rose, D.J., Tarjan, R.E., Lueker, G.S., Algorithmic aspects of vertex elimination on graphs, \emph{SIAM

1206: J. Comput.}, 5, pp. 266-283 (1976)

1207:

1208: \bibitem{islands} Sharan, R., Halld\'orsson, B.V., Istrail, S., Islands of tractability for parsimony haplotyping,

1209: \emph{IEEE/ACM Transactions on Computational Biology and Bioinformatics} 3(3), pp. 303-311 (2006)

1210:

1211: \bibitem{gusnetwork} Song, Y.S., Wu, Y., Gusfield, D., Algorithms for imperfect phylogeny haplotyping

1212: (IPPH) with single haploplasy or recombination event, Proceedings of the 5th International Workshop on Algorithms in

1213: Bioinformatics (WABI 2005), LNBI 3692, Springer Verlag, Berlin, pp. 152-164 (2005)

1214:

1215: \bibitem{anOptimal} VijayaSatya, R., Mukherjee, A., An optimal algorithm for perfect phylogeny haplotyping,

1216: \emph{Journal of Computational Biology} 13(4), pp. 897-928 (2006)

1217:

1218: \bibitem{mpphref} Xian-Sun Zhang, Rui-Sheng Wang, Ling-Yun Wu, Luonan Chen, Models and Algorithms for Haplotyping

1219: Problem, \emph{Current Bioinformatics} 1, pp. 105-114 (2006)

1220:

1221: \bibitem{log} Yao-Ting Huang, Kun-Mao Chao, Ting Chen, An approximation algorithm for haplotype inference by

1222: maximum parsimony, \emph{Journal of Computational Biology} 12(10) pp. 1261-74 (2005)

1223:

1224: \end{thebibliography}

1225:

1226: \clearpage

1227:

1228: \begin{biography}{Leo van Iersel}

1229: received in 2004 his Master of Science degree in Applied Mathematics from the Universiteit Twente in The Netherlands.

1230: He is now working as a PhD student at the Technische Universiteit Eindhoven, also in the Netherlands. His research is

1231: mainly concerned with the search for combinatorial algorithms for biological problems.

1232: \end{biography}

1233:

1234: \begin{biography}{Judith Keijsper}

1235: received her master's and PhD degrees in 1994 and 1998 respectively from the Universiteit van Amsterdam in The

1236: Netherlands, where she worked with Lex Schrijver on combinatorial algorithms for graph problems. After working as a

1237: postdoc at Leibniz-IMAG in Grenoble, France, and as an assistant professor at the Universiteit Twente in the

1238: Netherlands for short periods of time, she moved to the Technische Universiteit Eindhoven in the Netherlands in the

1239: year 2000. She is an assistant professor there, and her current research focus is combinatorial algorithms for

1240: problems from computational biology.

1241: \end{biography}

1242:

1243: \begin{biography}{Steven Kelk}

1244: received his PhD in Computer Science in 2004 from the University

1245: of Warwick, in England. He is now working as a postdoc at the

1246: Centrum voor Wiskunde en Informatica (CWI) in Amsterdam, the

1247: Netherlands, where he is focussing on the combinatorial aspects of

1248: computational biology.

1249: \end{biography}

1250:

1251: \begin{biography}{Leen Stougie}

1252: received his PhD in 1985 from the Erasmus Universiteit of Rotterdam, The Netherlands. He is currently working at the

1253: Centrum voor Wiskunde en Informatica (CWI) in Amsterdam and at the Technische Universiteit Eindhoven as an associate

1254: professor.

1255: \end{biography}

1256:

1257: \end{document}

1258: