0111:cs0111051/arxiv.tex

1: \documentclass[11pt]{article}

2:

3: \usepackage{amstext}

4: \usepackage{enumerate}

5: \usepackage{citesort}

6: \usepackage[mathscr]{eucal}

7: \usepackage{epsfig}

8: \usepackage{amsmath}

9: \usepackage{theorem}

10:

11: \renewcommand{\baselinestretch}{1.1}

12:

13: \addtolength{\textheight}{1.4in}

14: \addtolength{\textwidth}{1in}

15: \addtolength{\topmargin}{-0.7in}

16: %\addtolength{\evensidemargin}{-0.7in}

17: \addtolength{\oddsidemargin}{-0.3in}

18:

19: \def\myendproof{{\ \vbox{\hrule\hbox{%

20:    \vrule height1.3ex\hskip0.8ex\vrule}\hrule }}\par}

21:  \newtheorem{theorem}{Theorem}[section]

22: \newtheorem{lemma}[theorem]{Lemma}

23: \newtheorem{corollary}[theorem]{Corollary}

24: \newtheorem{fact}[theorem]{Fact}

25: \newtheorem{definition}{Definition}

26: \newenvironment{proof}{{\it Proof. }}{\myendproof}

27:

28:

29: % shorthands

30: \newcommand{\bigO}{\mathscr{O}}

31:

32: \newcommand{\mnote}[1]{\marginpar{\scriptsize\it #1}}

33:

34: \newcommand{\comment}[1]{}

35:

36:

37: \title{Predicting RNA Secondary Structures with Arbitrary Pseudoknots

38: by Maximizing the Number of Stacking Pairs}

39:

40: \author{\hspace*{.5in}

41: Samuel Ieong\thanks{Department of Computer Science,

42: Yale University, New Haven, CT 06520.}

43: \and

44: Ming-Yang Kao\thanks{Department of Computer Science,

45: Northwestern University, Evanston, IL 60201 (kao@cs.northwestern.edu).

46: This research was supported in part by NSF Grant EIA-0112934.}

47: \and

48: Tak-Wah Lam\thanks{Department of Computer Science,

49: The University of Hong Kong, Hong Kong

50: (\{twlam, smyiu\}@cs.hku.hk).

51: This research was supported in part by Hong Kong RGC grant HKU-7027/98E.}

52: \hspace*{.5in}

53: \and

54: Wing-Kin Sung\thanks{Department of Computer Science,

55: National University of Singapore, 3 Science Drive 2,

56: Singapore 117543 (ksung@comp.nus.edu.sg).}

57: \and

58: Siu-Ming Yiu\footnotemark[3]

59: }

60:

61: \begin{document}

62: \date{}

63: \maketitle

64:

65: \begin{abstract}

66:

67: The paper investigates the computational problem of predicting

68: RNA secondary structures.

69: The general belief is that allowing pseudoknots makes the problem hard.

70: Existing polynomial-time algorithms are heuristic algorithms

71: with no performance guarantee and can only handle limited

72: types of pseudoknots.

73: In this paper we initiate the study of

74: predicting RNA secondary structures with

75: a maximum number of stacking pairs while allowing arbitrary pseudoknots.

76: We obtain two approximation algorithms with worst-case approximation ratios

77: of $1/2$ and $1/3$ for planar and general secondary structures,

78: respectively. For an RNA sequence of $n$ bases,

79: the approximation algorithm for planar secondary structures

80: runs in

81: $O(n^3)$ time while that for the general case runs in linear time.

82: Furthermore, we prove that allowing

83: pseudoknots makes it NP-hard to maximize

84: the number of stacking pairs in a planar secondary structure.

85: This result is in contrast with the recent

86: NP-hard results on psuedoknots

87: which are based on optimizing some general and complicated

88: energy functions.

89:

90:

91: \end{abstract}

92:

93: \section{Introduction}

94: Ribonucleic acids (RNAs) are molecules that are responsible for regulating many genetic

95: and metabolic activities in cells.

96: An RNA is single-stranded and

97: can be considered as a sequence of nucleotides (also known as bases).  There are four

98: basic nucleotides, namely, Adenine (A), Cytosine (C), Guanine (G), and Uracil (U).  An

99: RNA folds into a 3-dimensional structure by forming pairs of bases.  Paired bases tend to

100: stabilize the RNA (i.e., have negative free energy).  Yet base pairing does not occur

101: arbitrarily.  In particular, A-U and C-G form stable pairs and

102: are known as the {\em Watson-Crick}

103: base pairs. Other base pairings are less stable and often ignored.  An example of a

104: folded RNA is shown in Figure~\ref{fig:RNA-structure}.

105: Note that this figure is just schematic;

106: in practice, RNAs are 3-dimensional molecules.

107:

108: \begin{figure}[t]

109: \begin{centering}

110: \epsfig{file=secondary_structure.eps, height=2.3in}

111: \caption{Example of a folded RNA}

112: \label{fig:RNA-structure}

113: \end{centering}

114: \end{figure}

115:

116: The 3-dimensional structure is related to the function of the RNA.

117: Yet existing experimental techniques for determining the

118: 3-dimensional structures of RNAs are

119: often very costly and time consuming (see, e.g., \cite{Meidanis:1997:ICM}).

120: The secondary structure of an RNA is the set of base pairings formed

121: in its 3-dimensional structure.

122: To determine the 3-dimensional structure of a given RNA sequence,

123: it is useful to determine the corresponding secondary structure.

124: As a result, it is important to design efficient algorithms to

125: predict the secondary structure with computers.

126:

127: {From} a computational viewpoint, the challenge of the RNA secondary

128: structure prediction

129: problem arises from some special structures called pseudoknots,

130: which are defined as follows.

131: Let $S$ be an RNA sequence $s_1, s_2, \cdots, s_n$.   A

132: {\it pseudoknot} is composed of two interleaving base

133: pairs, i.e., $(s_i, s_j)$ and $(s_k, s_\ell)$

134: such that $i < k < j < \ell$.  See Figure~\ref{fig:simple-pk} for examples.

135:

136:

137: If we assume that the secondary structure of an RNA contains no pseudoknots,

138: the secondary structure can be decomposed into a few types of loops: stacking pairs,

139: hairpins, bulges, internal loops, and multiple loops

140: (see, e.g., Tompa's lecture notes \cite{Tompa:2000:LNB} or

141: Waterman's book \cite{Waterman:1995:ICB}).   A

142: {\it stacking pair} is a loop formed by two pairs of consecutive

143: bases $(s_i, s_j)$ and $(s_{i+1}, s_{j-1})$ with $i+4 \leq j$.

144: See Figure~\ref{fig:RNA-structure} for an example.

145: By definition, a stacking pair contains no unpaired bases and any other kinds

146: of loops contain one or more unpaired bases.  Since

147: unpaired bases are destabilizing and have positive free energy,

148: stacking pairs are the only type of

149: loops that have negative free energy and stabilize the secondary structure.

150: It is also natural to

151: assume that the free energies of loops are independent.

152: Then an optimal pseudoknot-free secondary structure can be

153: computed using dynamic programming in $O(n^3)$ time

154: \cite{Lyngso:1999:FEI,Lyngso:1999:ILR,Zuker:1984:RSS,Zuker:1989:UDP}.

155:

156: \begin{figure}[t]

157: \begin{centering}

158: \epsfig{file=pseudoknots.eps, height=1.4in}

159: \caption{Examples of pseudoknots}

160: \label{fig:simple-pk}

161: \end{centering}

162: \end{figure}

163:

164:

165: However, pseudoknots are known to exist in some RNAs.

166: For predicting secondary structures with pseudoknots,

167: Nussinov et al.~\cite{Nussinov:1978:ALM} have studied the case where

168: the energy function is minimized when the number of base pairs is

169: maximized and have obtained an $O(n^3)$-time algorithm

170: for predicting secondary structures.

171: Based on some special energy functions, Lyngso and Pedersen

172: \cite{Lyngso:2000:RPP} have proven that determining the optimal secondary structure

173: possibly with

174: pseudoknots is NP-hard. Akutsu \cite{Akutsu:2000:DPA}

175: has shown that it is NP-hard to determine an optimal

176: planar secondary structure, where a secondary structure is {\it planar}

177: if the

178: graph formed by the base pairings and the backbone connections of adjacent bases is

179: planar (see Section 2 for a more detailed definition).

180: Rivas and Eddy \cite{Rivas:1999:DPA}, Uemura et al. \cite{Uemura:1999:TAG},

181: and

182: Akutsu \cite{Akutsu:2000:DPA} have also proposed polynomial-time algorithms that can

183: handle limited types of pseudoknots; note that the exact types of

184: such pseudoknots are implicit in these algorithms and difficult to

185: determine.

186:

187: Although it might be desirable to have a better classification of pseudoknots and

188: better algorithms that

189: can handle a wider class of pseudoknots,

190: this paper approaches the problem in a different general direction.

191: We initiate the study of predicting RNA secondary

192: structures that allow arbitrary pseudoknots while maximizing

193: the number of stacking pairs.

194: Such a simple energy function is meaningful as

195: stacking pairs are the only loops that stabilize

196: secondary structures.  We obtain two approximation algorithms with worst-case ratios of

197: 1/2 and 1/3 for planar and general secondary structures, respectively.

198: The planar

199: approximation algorithm makes use of a geometric observation

200: that allows us to

201: visualize the planarity of stacking pairs on a rectangular grid;

202: interestingly, such an

203: observation does not hold if our aim is to maximize the number of base pairs.

204: This algorithm runs in $O(n^3)$ time.

205: The second approximation algorithm is more complicated and

206: is based on a combination of multiple  ``greedy'' strategies.

207: A straightforward analysis cannot lead to the approximation ratio of $1/3$.

208: We make use of amortization over different steps to obtain the desired

209: ratio. This algorithm runs in $O(n)$ time.

210:

211: To complement these two algorithms, we also prove

212: that allowing pseudoknots makes it NP-hard to

213: find the planar secondary structure with the

214: largest number of stacking pairs.

215: The proof makes use of a reduction from a

216: well-known NP-complete problem called Tripartite Matching

217: \cite{Garey:1979:CIG}.

218: This result indicates that the hardness of the RNA secondary

219: structure prediction problem may be inherent in the pseudoknot structures

220: and may not be necessarily due to the complication of the energy functions.

221: This is in contrast to the other NP-hardness results discussed

222: earlier.

223:

224: The rest of this paper is organized into four sections.  Section 2

225: discusses some basic properties.

226: Sections 3 and 4 present the approximation algorithms for

227: planar and general secondary structures, respectively.

228: Section 5 details the NP-hardness result.

229: Section 6 concludes the paper with open problems.

230:

231: \section{Preliminaries}

232:

233: Let $S=s_1 s_2 \cdots s_n$ be an RNA sequence of $n$ bases.

234: A {\it secondary structure} ${\cal P}$ of $S$ is a set of

235: Watson-Crick pairs $(s_{i_1}, s_{j_1}),\ldots, (s_{i_p},

236: s_{j_p})$, where $s_{i_r}+2 \leq s_{j_r}$ for all $r=1, \ldots, p$

237: and no two pairs share a base.

238: We denote $q$ ($q \geq 1$)

239: consecutive stacking pairs ($s_i, s_j$), ($s_{i+1}, s_{j-1}$);

240: ($s_{i+1}, s_{j-1}$), ($s_{i+2}, s_{j-2}$)

241: $\ldots$ ($s_{i+q-1}, s_{j-q+1}$), ($s_{i+q}, s_{j-q}$) of

242: ${\cal P}$ by ($s_i,s_{i+1}, \ldots, s_{i+q};$

243: \linebreak[4] $s_{j-q}, \ldots, s_{j-1}, s_j$).

244:

245: \begin{definition}

246: Given a secondary structure ${\cal P}$,

247: we define an undirected

248: graph $G({\cal P})$ such that the bases

249: of $S$ are the nodes of $G({\cal P})$ and $(s_i, s_j)$ is

250: an edge of $G({\cal P})$ if $j = i+1$ or $(s_i, s_j)$ is

251: a base pair in ${\cal P}$.

252: \end{definition}

253:

254: \begin{definition}

255: A secondary structure ${\cal P}$ is planar if $G({\cal P})$ is

256: a planar graph.

257: \end{definition}

258:

259: \begin{definition}

260: A secondary structure ${\cal P}$ is said to contain an

261: {\it interleaving block} if ${\cal P}$ contains three

262: stacking pairs

263: $(s_i, s_{i+1}; s_{j-1}, s_j)$, $(s_{i'}, s_{i'+1}; s_{j'-1}, s_{j'})$,

264: $(s_{i''}, s_{i''+1}; s_{j''-1},s_{j''})$ where $i < i' < i'' < j < j' < j''$.

265: \end{definition}

266:

267: \begin{lemma}

268: \label{interleavingblock}

269: If a secondary structure ${\cal P}$ contains

270: an interleaving block, ${\cal P}$ is non-planar.

271: \end{lemma}

272:

273: \begin{proof}

274: Suppose ${\cal P}$ contains an interleaving block.  Without

275: loss of generality, we assume that ${\cal P}$ contains

276: the stacking pairs ($s_1, s_2; s_7, s_8$),

277: ($s_3, s_4; s_9, s_{10}$), and ($s_5, s_6; s_{11}, s_{12}$).

278: Figure \ref{interblock}(a) shows the

279: subgraph of $G({\cal P})$ corresponding to these

280: stacking pairs. Since this subgraph contains a homeomorphic copy of

281: $K_{3,3}$ (see Figure \ref{interblock}(b)),

282: $G({\cal P})$ and ${\cal P}$ are non-planar.

283: \end{proof}

284:

285: \begin{figure*}[hbtp]

286: \begin{center}

287: \scalebox{0.5}[0.5]{\includegraphics{interblock2.eps}}

288: \caption{Interleaving block}

289: \label{interblock}

290: \end{center}

291: \end{figure*}

292:

293: \section{An Approximation Algorithm for Planar Secondary Structures}

294: We present an algorithm which,

295: given an RNA sequence $S = s_1 s_2 \ldots s_n$,

296: constructs a {\it planar} secondary structure of $S$

297: to approximate one with the maximum number of stacking pairs

298: with a ratio of at least $1/2$. This

299: approximation algorithm is based on the subtle

300: observation in Lemma \ref{planarembedding}

301: that if a secondary structure ${\cal P}$ is planar,

302: the subgraph of $G({\cal P})$ which contains {\it only} the stacking pairs

303: of ${\cal P}$ can be embedded in a grid with a useful property.

304: This property enables us to consider only the secondary structure of

305: $S$ {\it without pseudoknots} in order to achieve 1/2 approximation

306: ratio.

307:

308: \begin{definition}

309: Given a secondary structure ${\cal P}$, we define a

310: {\it stacking pair embedding} of

311: ${\cal P}$ on a grid as follows.

312: Represent the bases of $S$ as $n$ consecutive grid points on the

313: same horizontal grid line $L$ such that $s_i$ and $s_{i+1}$

314: $(1 \leq i < n)$ are connected directly by a horizontal grid edge.

315: If $(s_i, s_{i+1}; s_{j-1}, s_j)$ is a stacking pair of ${\cal P}$,

316: $s_i$ and $s_{i+1}$ are connected to $s_j$ and $s_{j-1}$ respectively

317: by a sequence of grid edges such that the two sequences must

318: be either both above or both below $L$.

319: \end{definition}

320:

321: Figure \ref{embedding-eg} shows a stacking pair embedding

322: (Figure \ref{embedding-eg}(b))

323: of a given secondary structure (Figure \ref{embedding-eg}(a)).

324: Note that ($s_3,s_9$)

325: do not form a stacking pair with other base pair, so $s_3$

326: is not connected to $s_9$ in the stacking pair embedding.

327: Similarly, $s_4$ is not connected to $s_{10}$ in the

328: embedding.

329:

330: \begin{figure*}[hbtp]

331: \begin{center}

332: \scalebox{0.5}[0.5]{\includegraphics{embedding.eps}}

333: \caption{An example of a stacking pair embedding}

334: \label{embedding-eg}

335: \end{center}

336: \end{figure*}

337:

338: \begin{definition}

339: A stacking pair embedding is said to be {\it planar} if

340: it can be drawn in such a way that

341: no lines cross or overlap with each other in the grid.

342: \end{definition}

343:

344: The embedding shown in Figure \ref{embedding-eg}(b) is planar.

345:

346: \begin{lemma}

347: \label{planarembedding}

348: Let ${\cal P}$ be a secondary structure of an RNA sequence $S$.

349: Let $E$ be a stacking pair embedding of ${\cal P}$.

350: If ${\cal P}$ is planar, then $E$ must be planar.

351: \end{lemma}

352:

353: \begin{proof}

354: If ${\cal P}$ does not have a planar stacking

355: pair embedding, we claim that ${\cal P}$ contains an

356: interleaving block. Let $L$ be the horizontal grid line

357: that contains the bases of $S$ in $E$.

358: Since ${\cal P}$ does not have a planar

359: stacking pair embedding, we can assume that $E$ has

360: two stacking pairs intersect

361: above $L$ (see Figure \ref{non-planar-sec-struct}(a)).

362:

363: \begin{figure*}[hbtp]

364: \begin{center}

365: \scalebox{0.5}[0.5]{\includegraphics{nonplanar2.eps}}

366: \caption{Non-planar stacking pair embedding}

367: \label{non-planar-sec-struct}

368: \end{center}

369: \end{figure*}

370:

371:

372: If there is no other stacking pair underneath these two

373: pairs, we can flip one of the pairs below $L$ as shown

374: in Figure \ref{non-planar-sec-struct}(b). So, there must be

375: at least one stacking pair underneath these two

376: pairs. By checking all

377: possible cases (all non-symmetric cases are shown in

378: Figures \ref{non-planar-sec-struct}(c) to (i)), it can be

379: shown that $E$ cannot be redrawn without crossing or overlapping

380: lines only if it contains an interleaving block

381: (Figures \ref{non-planar-sec-struct}(h) and (i)). So, by

382: Lemma \ref{interleavingblock}, ${\cal P}$ is non-planar.

383: \end{proof}

384:

385: By Lemma \ref{planarembedding},

386: we can relate two secondary structures having the maximum

387: number of stacking pairs with and without pseudoknots

388: in the following lemma.

389:

390: \begin{lemma}

391: \label{1/2-ratio}

392: Given an RNA sequence $S$, let $N^*$ be the maximum number of

393: stacking pairs that can be formed by a planar secondary

394: structure of $S$ and let $W$ be the maximum

395: number of stacking pairs that can be formed by $S$ without

396: pseudoknots. Then, $W \geq \frac{N^*}{2}$.

397: \end{lemma}

398:

399: \begin{proof}

400: Let ${\cal P}^*$ be a planar secondary structure of $S$ with $N^*$

401: stacking pairs. Since ${\cal P}^*$ is planar, by Lemma

402: \ref{planarembedding}, any stacking pair embedding of ${\cal P}^*$

403: is planar.

404:

405: Let $E$ be a stacking pair embedding of ${\cal P}^*$

406: such that no lines cross each other in the grid.

407: Let $L$ be the horizontal grid line of $E$ which

408: contains all bases of $S$.

409: Let $n_1$ and $n_2$ be the number of stacking pairs which

410: are drawn above and below $L$, respectively.

411: Without loss of generality,

412: assume that $n_1 \geq n_2$. Now, we construct another planar

413: secondary structure ${\cal P}$ from $E$ by deleting all stacking

414: pairs which are drawn below $L$.

415: Obviously, ${\cal P}$ is a planar secondary structure of $S$ without

416: pseudoknots. Since $n_1 \geq n_2$, $n_1 \geq \frac{N^*}{2}$.

417: As $W \geq n_1$, $W \geq \frac{N^*}{2}$.

418: \end{proof}

419:

420: Based on Lemma \ref{1/2-ratio}, we now present the dynamic programming

421: algorithm $MaxSP$ which computes the maximium number of

422: stacking pairs that can be formed by an RNA

423: sequence $S=s_1 s_2 \ldots s_n$ without pseudoknots.

424:

425: \vspace{5pt}

426: \noindent

427: {\bf Algorithm $MaxSP$}

428:

429: Define $V(i,j)$ (for $j \geq i$) as the maximum number of stacking

430: pairs without pseudoknots that can be formed by $s_i \ldots s_j$

431: {\it if $s_i$ and $s_j$ form a Watson-Crick pair}.

432: Let $W(i,j)$ ($j \geq i$) be the maximum number

433: of stacking pairs without pseudoknots that can be formed by

434: $s_i \ldots s_j$. Obviously, $W(1,n)$ gives the maximum

435: number of stacking pairs that can be formed by $S$ without

436: pseudoknots.

437:

438: \noindent \fbox{Basis:}

439:

440: For $j = i, i+1, i+2 \mbox{~or~} i+3$ ($j \leq n$),

441: \[

442: \begin{array}{lll}

443: V(i,j) & = 0  & \mbox{ if $s_i, s_{j}$ form a Watson-Crick pair;} \\

444: W(i,j) & = 0.  &

445: \end{array}

446: \]

447:

448: \noindent \fbox{Recurrence:}

449:

450: For $j > i+3$,

451: %\[W(i,j) = \max \left\{

452: \[

453: \begin{array}{llll}

454: W(i,j) & = & \max \left\{

455: \begin{array}{ll}

456: V(i,j) & \mbox{ if $s_i$, $s_j$ form a Watson-Crick pair} \\

457: W(i+1, j) & \\

458: W(i, j-1) &

459: \end{array}

460: \right\}; \\

461: &&\\

462: V(i,j) & = & \max \left\{

463: \begin{array}{l}

464: V(i+1, j-1) + 1 \mbox{~~~~if $s_{i+1}$, $s_{j-1}$ form a Watson-Crick pair} \\

465: \max_{i+1 \leq k \leq j-2}{\{W(i+1,k)+W(k+1,j-1)\}}

466: \end{array}

467: \right\}.

468: \end{array}

469: \]

470:

471: \begin{lemma}

472: Given an RNA sequence $S$ of length $n$, Algorithm $MaxSP$

473: computes the maximum number of stacking pairs that can be

474: formed by $S$ without pseudoknots in $O(n^3)$ time and

475: $O(n^2)$ space.

476: \end{lemma}

477:

478: \begin{proof}

479: There are $O(n^2)$ entries $V(i,j)$ and $W(i,j)$ to be

480: filled. To fill an entry of $V(i,j)$, we check

481: at most $O(n)$ values. To fill an entry of $W(i,j)$, $O(1)$ time

482: suffices. The total time complexity for filling all entries

483: is $O(n^3)$. Storing all entries requires $O(n^2)$ space.

484: \end{proof}

485:

486: Although Algorithm $MaxSP$ presented in the above only

487: computes the number of stacking pairs, it can be easily modified

488: to compute the secondary structure.

489: Thus we have the following theorem.

490:

491: \begin{theorem}

492: The Algorithm $MaxSP$ is an $(1/2)$-approximation algorithm

493: for the problem of constructing a secondary structure which

494: maximizes the number of stacking pairs for an RNA sequence $S$.

495: \end{theorem}

496:

497: \section{An Approximation Algorithm for General Secondary Structures}

498: We present Algorithm $GreedySP()$ which,

499: given an RNA sequence $S = s_1 s_2 \ldots s_n$,

500: constructs a secondary structure of $S$ (not necessarily planar)

501: with at least $1/3$ of the maximum possible number of

502: stacking pairs.

503: The approximation algorithm uses a greedy approach.

504: Figure \ref{1/3-approx-alg} shows

505: the algorithm $GreedySP()$.

506:

507: \begin{figure}[htbp]

508: \fbox{

509: \begin{minipage}{.95\textwidth}

510: \noindent // Let $S=s_1 s_2 \ldots s_n$ be the input RNA sequence.

511: Initially, all $s_j$ are unmarked.

512:

513: \noindent // Let $E$ be the set of base pairs output by the algorithm.

514: Initially, $E = \emptyset$.

515:

516: \vspace{5pt}

517: \noindent $GreedySP(S, i)$

518: \hspace{10pt}

519: // $i \geq 3$

520:

521: \begin{enumerate}

522: \item Repeatedly find the {\it leftmost} $i$ consecutive stacking pairs

523:       $SP$ (i.e., find $(s_p,\ldots,s_{p+i};s_{q-i},\ldots,s_q)$ such that

524:       $p$ is as small as possible) formed by unmarked bases.

525:       Add $SP$ to $E$ and mark all these bases.

526: \item For $k = i-1$ downto $2$, \\

527:       Repeatedly find {\it any} $k$ consecutive stacking pairs $SP$

528:       formed by unmarked bases.

529:       Add $SP$ to $E$ and mark all these bases.

530: \item Repeatedly find the {\it leftmost} stacking pair $SP$ formed

531:       by unmarked bases.

532:       Add $SP$ to $E$ and mark all these bases.

533: \end{enumerate}

534: \end{minipage}

535: } %% end of fbox

536: \caption{A 1/3-Approximation Algorithm}

537: \label{1/3-approx-alg}

538: \end{figure}

539:

540: In the following, we analyze the approximation ratio of

541: this algorithm.

542: The algorithm $GreedySP(S, i)$ will generate a sequence of $SP$'s

543: denoted by $SP_1, SP_2, \ldots, SP_h$.

544:

545: \begin{fact}

546: \label{spdisjoint}

547: For any $SP_j$ and $SP_k$ $(j \neq k)$, the

548: stacking pairs in $SP_j$ do not share any base with those

549: in $SP_k$.

550: \end{fact}

551:

552: For each $SP_j = (s_p, \ldots, s_{p+t}; s_{q-t}, \ldots, s_q)$,

553: we define two intervals of indexes, ${\cal I}_j$ and

554: ${\cal J}_j$, as $[p .. p+t]$

555: and $[q-t .. q]$, respectively.

556: In order to compare

557: the number of stacking pairs formed with that in the optimal

558: case, we have the following definition.

559:

560: %\begin{fact}

561: %We have the following facts:

562: %\begin{itemize}

563: %\item All ${\cal I}_j$ and ${\cal J}_j$ (for all $j$) intervals are disjoint.

564: %\item $|SP_j| = |{\cal I}_j|-1$ where $|{\cal I}_j|$ denotes the

565: %      number of bases in the interval.

566: %\end{itemize}

567: %\end{fact}

568:

569: \begin{definition}

570: \label{xpi}

571: Let ${\cal P}$ be an optimal secondary structure of $S$ with

572: the maximum number of stacking pairs. Let

573: ${\cal F}$ be the set of all stacking pairs of ${\cal P}$.

574: For each $SP_j$ computed by

575: $GreedySP(S,i)$ and $\beta = {\cal I}_j$ or ${\cal J}_j$,

576: \[\mbox{let~}{\cal X}_{\beta} = \{ (s_k, s_{k+1}; s_{w-1}, s_w) \in

577: {\cal F} | \mbox{~at least one of indexes~} k, k+1, w-1, w

578: \mbox{~is in~} \beta\}. \]

579: \end{definition}

580:

581: Note that ${\cal X}_\beta$'s may not be disjoint.

582:

583: \begin{lemma}

584: \label{complete}

585: $\bigcup_{1 \leq j \leq h} \{{\cal X}_{{\cal I}_j} \cup

586: {\cal X}_{{\cal J}_j}\} = {\cal F}$.

587: \end{lemma}

588:

589: \begin{proof}

590: We prove this lemma by contradiction. Suppose that there exists a

591: stacking pair ($s_k,s_{k+1};s_{w-1},s_w$) in ${\cal F}$ but not in

592: any of ${\cal X}_{{\cal I}_j}$ and ${\cal X}_{{\cal J}_j}$.

593: By Definition \ref{xpi}, none of the indexes, $k,k+1,w-1,w$

594: is in any of ${\cal I}_j$ and ${\cal J}_j$. This contradicts

595: with Step 3 of Algorithm $GreedySP(S,i)$.

596: \end{proof}

597:

598: \begin{definition}

599: \label{x'pi}

600: For each ${\cal X}_{{\cal I}_j}$,

601: \[

602: \mbox{let~} {\cal X}'_{{\cal I}_j} =

603: {\cal X}_{{\cal I}_j} -

604: \bigcup_{k<j} \{ {\cal X}_{{\cal I}_k} \cup

605:                  {\cal X}_{{\cal J}_k} \},

606: \mbox{~and let~} {\cal X}'_{{\cal J}_j} =

607: {\cal X}_{{\cal J}_j} -

608: \bigcup_{k<j} \{ {\cal X}_{{\cal I}_k} \cup

609:                  {\cal X}_{{\cal J}_k} \} - {\cal X}_{{\cal I}_j} \]

610: \end{definition}

611:

612: Let $|SP_j|$ be the number of stacking pairs represented by

613: $SP_j$. Let $|{\cal I}_j|$ and $|{\cal J}_j|$ be the numbers

614: of indexes in the intervals ${\cal I}_j$ and ${\cal J}_j$,

615: respectively.

616:

617: \begin{lemma}

618: \label{sumfraction}

619: Let $N$ be the number of stacking pairs computed by

620: Algorithm $GreedySP(S,i)$ and $N^*$ be the maximum number of

621: stacking pairs that can be formed by $S$.

622: If for all $j$, we have

623: $|SP_j| \geq \frac{1}{r} \times

624: |({\cal X}'_{{\cal I}_j} \cup {\cal X}'_{{\cal J}_j})|$, then

625: $N \geq \frac{1}{r} \times N^*$.

626: \end{lemma}

627:

628: \begin{proof}

629: By Definition \ref{x'pi},

630: $\bigcup_k \{{\cal X}_{{\cal I}_k} \cup {\cal X}_{{\cal J}_k}\} =

631:  \bigcup_k \{{\cal X}'_{{\cal I}_k} \cup {\cal X}'_{{\cal J}_k}\}$.

632: Then by Fact \ref{spdisjoint}, $N = \sum_j |SP_j|$. Thus,

633: $N \geq \frac{1}{r} \times

634: |\bigcup_k \{{\cal X}_{{\cal I}_k} \cup {\cal X}_{{\cal J}_k}\}|$.

635: By Lemma \ref{complete}, $N \geq \frac{1}{r} \times N^*$.

636: \end{proof}

637:

638: \begin{lemma}

639: \label{boundforspi}

640: For each $SP_j$ computed by $GreedySP(S,i)$, we have

641: $|SP_j| \geq \frac{1}{3}

642: \times

643: |({\cal X}'_{{\cal I}_j} \cup {\cal X}'_{{\cal J}_j})|$.

644: \end{lemma}

645:

646: \begin{proof}

647: There are three cases as follows.

648:

649: \vspace{5pt}

650: \noindent

651: {\it Case 1:} $SP_j$ is computed by $GreedySP(S, i)$ in Step 1.

652: Note that $SP_j = (s_p, \ldots, s_{p+i};$ $s_{q-i}, \ldots, s_q)$ is

653: the leftmost $i$ consecutive stacking pairs, i.e., $p$ is the

654: smallest possible.

655: By definition, $|{\cal X}'_{{\cal I}_j}|, |{\cal X}'_{{\cal J}_j}| \leq i+2$.

656: We further claim that $|{\cal X}'_{{\cal I}_j}| \leq i+1$.

657: Then $|SP_j| / | {\cal X}'_{{\cal I}_j} \cup {\cal X}'_{{\cal J}_j}|

658: \geq i/((i+1)+(i+2)) \geq 1/3$ (as $i \geq 3$).

659:

660: We prove the claim by contradiction. Assume that

661: $|{\cal X}'_{{\cal I}_j}| = i+2$. That is,

662: for some integer $t$, ${\cal F}$ has $i+2$ consecutive stacking pairs

663: $(s_{p-1}, \ldots, s_{p+i+1}; s_{t-i-1}, \ldots, s_{t+1})$.

664: Furthermore, none of the bases $s_{p-1}, \ldots, s_{p+i+1}, s_{t-i-1}, \ldots, s_{t+1}$

665: are marked before $SP_j$ is chosen; otherwise,

666: suppose one such base, says $s_a$, is marked

667: when the algorithm chooses $SP_\ell$ for $\ell < j$,

668: then an stacking pair adjacent to $s_a$ does not belong to

669: ${\cal X}'_{{\cal I}_j}$ and they belong to ${\cal X}'_{{\cal I}_\ell}$

670: or ${\cal X}'_{{\cal J}_\ell}$ instead.

671: Therefore, $(s_{p-1}, \ldots, s_{p+i-1}; s_{t-i+1}, \ldots, s_{t+1})$

672: is the leftmost $i$ consecutive stacking pairs formed by unmarked bases

673: before $SP_j$ is chosen.

674: As $SP_j$ is not the leftmost $i$ consecutive stacking pairs,

675: this contradicts the selection criteria of $SP_j$.

676: The claim follows.

677:

678: \vspace{5pt}

679: \noindent

680: {\it Case 2:} $SP_j$ is computed by $GreedySP(S, i)$ in Step 2.

681: Let $|SP_j| = k \geq 2$. Let

682: $SP_j = (s_p, \ldots, s_{p+k}; s_{q-k}, \ldots, s_q)$.

683: By definition, $|{\cal X}'_{{\cal I}_j}|, |{\cal X}'_{{\cal J}_j}| \leq k+2$.

684: We claim that $|{\cal X}'_{{\cal I}_j}|, |{\cal X}'_{{\cal J}_j}| \leq k+1$.

685: Then $|SP_j| / | {\cal X}'_{{\cal I}_j} \cup {\cal X}'_{{\cal J}_j}|

686: \geq k/((k+1)+(k+1))$,

687: which is at least $1/3$ as $k \geq 2$.

688:

689: To show that $|{\cal X}'_{{\cal I}_j}| \leq k+1$ by contradiction,

690: assume $|{\cal X}'_{{\cal I}_j}| = k+2$. Thus, for some integer $t$,

691: there exist $k+2$ consecutive stacking pairs

692: $(s_{p-1}, \ldots, s_{p+k+1}; s_{t-k-1}, \ldots, s_{t+1})$.

693: Similarly to case 1, we can show that

694: none of the bases $s_{p-1}, \ldots, s_{p+k+1}, s_{t-k-1}, \ldots, s_{t+1}$

695: are marked before $SP_j$ is chosen.

696: Thus, $GreedySP(S, i)$ should select some $k+1$ or $k+2$ consecutive

697: stacking pairs

698: instead of the chosen $k$ consecutive stacking pairs,

699: reaching a contradiction.

700: Similarly, we can show $|{\cal X}'_{{\cal J}_j}| \leq k+1$.

701:

702: \vspace{5pt}

703: \noindent

704: {\it Case 3:} $SP_j$ is computed by $GreedySP(S, i)$ in Step 3.

705: $SP_j$ is the leftmost stacking pair when it is chosen.

706: Let $SP_j = (s_p, s_{p+1}; s_{q-1}, s_q)$.

707: By the same approach as in Case 2,

708: we can show $|{\cal X}'_{{\cal I}_j}|, |{\cal X}'_{{\cal J}_j}| \leq 2$.

709: We further claim $|{\cal X}'_{{\cal I}_j}| \leq 1$.

710: Then $|SP_j| / | {\cal X}'_{{\cal I}_j} \cup {\cal X}'_{{\cal J}_j}| \geq 1/(1+2) = 1/3$.

711:

712: To verify $|{\cal X}'_{{\cal I}_j}| \leq 1$,

713: we consider all possible cases with $|{\cal X}'_{{\cal I}_j}| = 2$

714: while there are no two consecutive stacking pairs.

715: The only possible case is that for some integers $r, t$,

716: both $(s_{p-1}, s_p; s_{r-1}, s_r)$

717: and $(s_p, s_{p+1}; s_{t-1}, s_t)$ belong to ${\cal X}'_{{\cal I}_j}$.

718: Then, $SP_j$ cannot be the leftmost stacking pair formed by unmarked bases,

719: contradicting the selection criteria of $SP_j$.

720: \end{proof}

721:

722: \begin{theorem}

723: Let $S$ be an RNA sequence. Let $N^*$ be the maximum number of stacking

724: pairs that can be formed by any secondary structure of $S$. Let

725: $N$ be the number of stacking pairs output by $GreedySP(S,i)$. Then,

726: $N \geq \frac{N^*}{3}$.

727: \end{theorem}

728:

729: \begin{proof}

730: By Lemmas \ref{sumfraction} and \ref{boundforspi}, the result follows.

731: \end{proof}

732:

733: We remark that by setting $i=3$ in $GreedySP(S,i)$, we can already

734: achieve the approximation ratio of 1/3. The following theorem gives

735: the time and space complexity of the algorithm.

736:

737: \begin{theorem}

738: Given an RNA sequence $S$ of length $n$ and a constant $k$,

739: Algorithm \linebreak[4] $GreedySP(S,k)$

740: can be implemented in $O(n)$ time and $O(n)$ space.

741: \end{theorem}

742:

743: \begin{proof}

744: Recall that the bases of an RNA sequence are chosen from the

745: alphabet $\{A,U,G,C\}$. If $k$ is a constant, there

746: are only constant number of different patterns of consecutive

747: stacking pairs that we must consider. For any $1 \leq j \leq k$,

748: there are only $4^j$ different strings that can be formed by

749: the four characters $\{A,U,G,C\}$. So, the locations of the

750: occurrences of these possible strings in the

751: RNA sequence can be recorded in an array of linked lists

752: indexed by the pattern of the string using $O(n)$ time preprocessing.

753: There are at most $4^j$ linked lists for any fixed $j$ and

754: there are at most $n$ entries in these linked lists. In total,

755: there are at most $kn$ entries in all linked lists for all

756: possible values of $j$.

757:

758: Now, we fix a constant $j$.

759: To locate all $j$ consecutive

760: stacking pairs,

761: we scan the RNA sequence from left to right. For each substring of

762: $j$ consecutive characters, we look up the array to see whether

763: we can form $j$ consecutive stacking pairs. By simple

764: bookkeeping, we can keep track which bases have been used

765: already. Each entry in the linked lists will only be

766: scanned at most once, so

767: the whole procedure takes only $O(n)$ time. Since $k$ is a constant,

768: we can repeat the whole procedure for $k$ different values of $j$, and the

769: total time complexity is still $O(n)$ time.

770: \end{proof}

771:

772: \newcommand{\encode}[1]{\langle #1 \rangle}

773:

774: \section{NP-completeness}

775:

776: In this section, we show that it is NP-hard to find a planar

777: secondary structure with the largest number of stacking pairs.

778: We consider the following decision problem.

779: Given an RNA sequence $S$ and an integer $h$, we wish to determine

780: whether the largest possible number of stacking pairs in a planar

781: secondary structure of $S$, denoted sp($S$), is at least $h$.  Below we show

782: that this decision problem is NP-complete by reducing the tripartite

783: matching problem \cite{Garey:1979:CIG} to it, which is defined as follows.

784:

785: Given three node sets $X$, $Y$, and $Z$ with the same cardinality

786: $n$ and

787: an edge set $E \subseteq X \times Y \times Z$ of size $m$,

788: the {\it tripartite matching problem} is to

789: determine whether $E$ contains a perfect matching, i.e.,

790: a set of $n$ edges which touches every node of $X$, $Y$, and $Z$

791: exactly once.

792:

793: The remainder of this section is organized as follows.

794: Section~\ref{sec-construction} shows how we construct in polynomial

795: time an RNA sequence $S_E$ and an integer $h$ from a given instance

796: $(X,Y,Z, E)$ of the tripartite matching problem, where $h$

797: depends on $n$ and $m$.  Section~\ref{sec-if} shows that if $E$

798: contains a perfect matching, then sp($S_E$) $\ge h$.

799: Section~\ref{sec-only-if} is the non-trivial part, showing that if $E$

800: does not contain a perfect matching, then sp($S_E$) $< h$.  Combining

801: these three sections, we can conclude that it is NP-hard to

802: maximize the

803: number of stacking pairs for planar RNA secondary structures.

804:

805: \subsection{Construction of the RNA sequence $S_E$} \label{sec-construction}

806:

807: Consider any instance $(X,Y,Z, E)$ of the tripartite matching problem.

808: We construct an RNA sequence $S_E$ and an integer $h$ as follows.

809: Let $X = \{x_1, \cdots, x_n\}$, $Y = \{y_1, \cdots, y_n\}$, and $Z =

810: \{z_1, \cdots, z_n\}$.  Furthermore, let $E = \{ e_1, e_2, \cdots, e_m

811: \}$, where each edge $e_j = (x_{p_j}, y_{q_j}, z_{r_j})$.  Recall that

812: an RNA sequence contains characters chosen from the alphabet $\{A, U,

813: G, C\}$.  Below we denote $A^i$, where $i$ is any positive integer, as

814: the sequence of $i$ $A$'s. Furthermore, $A^+$ means a sequence of one

815: or more $A$'s.

816:

817: \newcommand{\od}[1]{\overline{\delta(#1)}}

818: \newcommand{\op}[1]{\overline{\pi(#1)}}

819:

820: Let $d = \max\{ 6n, 4(m+1) \} + 1$.  Define the following four RNA

821: sequences for every positive integer $k < d$.

822: \begin{itemize}

823: \item $\delta(k)$ is the sequence $U^dA^kGU^dA^{d-k}$, and

824: $\overline{\delta(k)}$ is the sequence $U^{d-k}A^dGU^kA^d$.

825: \item $\pi(k)$ is the sequence $C^{2d+2k} AG C^{4d-2k}$, and

826: $\overline{\pi(k)}$ is the sequence $G^{4d-2k}A G^{2d+2k}$.

827: \end{itemize}

828:

829: {\small\bf Fragments:} Note that the sequences

830: $\delta(k)$ and $\od{k}$ are each composed of

831: two substrings in the form of $U^+ A^+$, separated by a character $G$.

832: Each of these two substrings is called a {\it fragment}. Similarly,

833: the two substrings of the form $C^+$ separated by $AG$ in $\pi(k)$

834: and the two substrings of the form $G^+$ separated by the character

835: $A$ in $\overline{\pi(k)}$ are also called fragments.

836:

837: {\small\bf Node Encoding:} Each node in the three node sets $X$, $Y$,

838: and $Z$ is associated with a unique sequence.  For $1 \le i \le n$,

839: let $\encode{x_i}$, $\encode{y_i}$, $\encode{z_i}$ denote the

840: sequences $\delta(i)$, $\delta(n+i)$, $\delta(2n+i)$, respectively.

841: Intuitively, $\encode{x_i}$ is the encoding of the node $x_i$, and

842: similarly $\encode{y_i}$ and $\encode{z_i}$ are for the nodes $y_i$ and

843: $z_i$, respectively.  Furthermore, define $\encode{\overline{x_i}} =

844: \od{i}$, $\encode{\overline{y_i}} = \od{n+i}$, and

845: $\encode{\overline{z_i}} = \od{2n+i}$.

846:

847: The node set $X$ is associated with two sequences $\cal X$ =

848: $\encode{x_1} G \encode{x_2} G \cdots G \encode{x_n}$ and

849: $\overline{\cal X}$ = $\encode{\overline{x_n}} G

850: \encode{\overline{x_{n-1}}} G \cdots G \encode{\overline{x_1}}$.

851: Let ${\cal X} - x_i$

852: = $\encode{x_1} G \cdots G \encode{x_{i-1}} G

853: \encode{x_{i+1}} G \cdots \encode{x_n}$ and $\overline{{\cal X} -

854: x_i}$ = $\encode{\overline{x_n}} G \cdots G

855: \encode{\overline{x_{i+1}}} G \encode{\overline{x_{i-1}}} G \cdots G

856: \encode{\overline{x_1}}$, where $x_i$ is any node in $X$.

857: Similarly, the node sets $Y$ and $Z$ are

858: associated with sequences ${\cal Y}$, $\overline{\cal Y}$, and

859: $\cal Z$, $\overline{\cal Z}$, respectively.

860:

861: {\small\bf Edge Encoding:} For each edge $e_j$ (where $1 \le j \le

862: m$), we define four delimiter sequences, namely,

863: $V_j = \pi(j)$, $W_j = \pi(m+1+j)$, $\overline{V_j} = \overline{\pi(j)}$,

864: and $\overline{W_j} = \overline{\pi(m+1+j)}$.

865: Assume that $e_j = (x_{p_j}, y_{q_j},

866: z_{r_j})$. Then $e_j$ is encoded by the sequence $S_j$ defined as

867: \[

868:  AG~V_j~AG~W_j~AG~{\cal X}~G~{\cal Y}~G~{\cal Z}~G~

869:  \overline{({\cal Z} - z_{r_j})}~G~\overline{({\cal Y} - y_{q_j})}~G~

870:  \overline{({\cal X} - x_{p_j})}~\overline{V_j}~A~\overline{W_j}.

871: \]

872: Let $S_{m+1}$ be a special sequence defined as $AG~V_{m+1}~AG~W_{m+1}~AG~

873: \overline{\cal Z}~G~\overline{\cal Y}~G~\overline{\cal X}~\;

874: \overline{V_{m+1}}~A~\overline{W_{m+1}}$.  In the following

875: discussion, each $S_j$ is referred to as a {\em region}.

876:

877: Finally, we define $S_E$ to be the sequence $S_{m+1} S_m \cdots

878: S_1$.

879: Let $\sigma = 3n(3d-2) + 6d - 1$ and

880: let $h = m \sigma + n (6d - 4) + 12 d - 5$.  Note that $S_E$ has $O((n+m)^3)$

881: characters and can be constructed in $O(|S_E|)$ time.

882: In Sections \ref{sec-if} and \ref{sec-only-if}, we show that

883: sp($S_E$) $\ge h$ if and only if $E$ contains a perfect matching.

884:

885: \subsection{Correctness of the if-part} \label{sec-if}

886: This section shows that if $E$ has a perfect matching,

887: we can construct a planar secondary structure for $S_E$

888: containing at least $h$ stacking pairs.  Therefore,

889: sp($S_E$) $\geq h$.

890:

891:

892: First of all, we establish several basic steps for constructing

893: stacking pairs on $S_E$.

894: \begin{itemize}

895: \setlength{\itemsep}{-1pt}

896: \item $\delta(i)$ or $\overline{\delta(i)}$ itself can form

897:       $d-1$ stacking pairs, while

898:       $\delta(i)$ and $\overline{\delta(i)}$ together can form

899:       $3d - 2$ stacking pairs.

900: \item

901:         $\pi(i)$ and $\overline{\pi(i)}$ together can form

902:         $6d - 2$ stacking pairs.

903: \item

904: For any $i \neq j$,

905:         $\pi(i)$ and $\overline{\pi(j)}$ together can form

906:         $6d - 3$ stacking pairs.

907: \end{itemize}

908:

909:

910: \begin{lemma}

911: If $E$ has a perfect matching, then sp($S_E$) $\geq h$.

912: \end{lemma}

913: \begin{proof}

914: Let  $M = \{ e_{j_1}, e_{j_2}, \ldots, e_{j_n} \}$ be a perfect matching.

915: Without loss of generality, we assume that $1 \le j_1 < j_2 < \ldots < j_n

916: \le m$.  Define $j_{n+1} = m+1$.

917: To obtain a planar secondary structure

918: for $S_E$ with at least $h$ stacking pairs,

919: we consider the regions one by one. There are three cases.

920:

921: \noindent

922: {\it Case 1:} We consider any region $S_j$ such that $e_j \not\in M$.

923: Our goal is to show that $\sigma = 3n(3d-2) +6d -1$

924: stacking pairs can be formed within $S_j$.  Note that

925: there are $(m-n)$ edges not in $M$.  Thus, we can obtain a total

926: of  $(m-n)\sigma$ stacking pairs in this case.  Details are as follows.

927: Assume that $e_j = (x_{p_j}, y_{q_j}, z_{r_j})$.

928: \begin{itemize}

929: \item $6d-2$ stacking pairs can be formed between $V_j$ and $\overline{V_j}$,

930:       and between $W_j$ and $\overline{W_j}$.

931: \item $3d-2$ stacking pairs can be formed

932:         between $\encode{x_i}$ and $\encode{\overline{x_i}}$

933:         for all $i \neq p_j$,

934:         and between $\encode{y_i}$ and $\encode{\overline{y_i}}$

935:         for all $i \neq q_j$,

936:         and between $\encode{z_i}$ and $\encode{\overline{z_i}}$

937:         for all $i \neq r_j$.

938: \item   $\encode{\overline{x_{p_j}}}$, $\encode{\overline{y_{q_j}}}$, and

939:         $\encode{\overline{z_{r_j}}}$ can each

940:         form $d-1$ stacking pairs.

941: \end{itemize}

942: The total  number of stacking pairs that can be formed within $S_j$

943: is $2(6d-2) + 3(n-1)(3d-2) + 3(d-1)$

944: = $3n(3d - 2) + 6d - 1$ = $\sigma$.

945:

946: \noindent

947: {\it Case 2:} We consider the edges $e_{j_1}, e_{j_2}, \ldots, e_{j_n}$

948: in $M$. Our goal is to

949: show that each corresponding region  accounts for $\sigma + 6d -4$

950: stacking pairs. Thus, we obtain a total of $n\sigma + n(6d -4)$ stacking

951: pairs in this case.  Details are as follows.

952: Unlike Case 1, each region $S_{j_k}$, where $1 \le k \le n$,

953: may have some of its bases paired with that of $S_{j_{k+1}}$.

954: \begin{itemize}

955: \item $6d-3$ stacking pairs can be formed between $W_{j_k}$ in $S_{j_k}$

956:       and $\overline{W_{j_{k+1}}}$ in $S_{j_{k+1}}$.

957: \item $6d-2$ stacking pairs can be formed between $V_{j_k}$ in $S_{j_k}$

958:       and $\overline{V_{j_k}}$ in $S_{j_k}$.

959:

960:

961: \item $3d-2$ stacking pairs can be paired between $\encode{x_i}$ in

962:       $S_{j_k}$

963:       and $\encode{\overline{x_i}}$ in $S_{j_k}$ for any

964:       $i \neq p_{j_1}, \ldots, p_{j_k}$,

965:       and between $\encode{y_i}$ in $S_{j_k}$

966:    and $\encode{\overline{y_i}}$ in $S_{j_k}$ for any

967:    $i \neq q_{j_1}, \ldots, q_{j_k}$, and

968:    between $\encode{z_i}$ in $S_{j_k}$

969:    and $\encode{\overline{z_i}}$ in $S_{j_k}$ for any $i \neq r_{j_1},

970:    \ldots, r_{j_k}$.

971:

972: \item

973:       $3d-2$ stacking pairs can be paired between $\encode{x_i}$ in

974:       $S_{j_k}$  and $\encode{\overline{x_i}}$ in $S_{j_{k+1}}$ for any

975:       $i = p_{j_1}, \ldots, p_{j_k}$,

976:       and between $\encode{y_i}$ in $S_{j_k}$

977:       and $\encode{\overline{y_i}}$ in $S_{j_{k+1}}$ for any

978:       $i = q_{j_1}, \ldots, q_{j_k}$, and

979:       between $\encode{z_i}$ in $S_{j_{k+1}}$

980:       and $\encode{\overline{z_i}}$ in $S_{j_{k+1}}$ for any $i = r_{j_1},

981:      \ldots, r_{j_k}$.

982:

983: \end{itemize}

984: The total number of stacking pairs charged to $S_{j_k}$ is

985: $6d-3 + 6d -2 + 3n (3d -2)$ = $\sigma + 6d - 4$.

986:

987: \noindent

988: {\it Case 3:} We consider $S_{m+1}$.

989: We can form $6d-2$ stacking pairs between $V_{m+1}$ and

990: $\overline{V_{m+1}}$, and

991: $6d-3$ stacking pairs between $W_{m+1}$ and $\overline{W_{j_1}}$.

992: The number of such stacking pairs is $12d - 5$.

993:

994: Combining the three cases, the number of stacking pairs that

995: can be formed on $S_E$ is $(m-n)\sigma + n(\sigma + 6d - 4) + 12d - 5$,

996: which is exactly $h$.  Notice that no two stacking pairs formed

997: cross each other.  Thus, sp($S_E$) $\ge h$.

998: \end{proof}

999:

1000: \subsection{Correctness of the only-if part} \label{sec-only-if}

1001:

1002: This section shows that if $E$ has no perfect matching, then

1003: sp($S_E$)$<h$. We first give the framework of the proof in

1004: Section~\ref{sec-only-if-framework}.

1005: Then, some basic definitions and concepts are

1006: presented in Section~\ref{sec-only-if-definition}.

1007: The proof of the only-if part

1008: is given in Section~\ref{sec-only-if-proof}.

1009:

1010: \newcommand{\opt}{\mbox{\rm OPT}}

1011:

1012: \subsubsection{Framework of the proof} \label{sec-only-if-framework}

1013: Let $\opt$ be a secondary structure of $S_E$ with the maximum

1014: number of stacking pairs. Let $\#\opt$ be the number of stacking pairs

1015: in $\opt$. That is, $\#\opt =$ sp($S_E$). In this section,

1016: we will establish

1017: an upper bound for $\#\opt$. Recall that we only consider

1018: Watson-Crick base pairs, i.e., $A-U$ and $C-G$ pairs.

1019: We define a conjugate of a

1020: substring in $S_E$ as follows.

1021:

1022: \vspace{5pt}

1023: \noindent

1024: {\bf Conjugates:}

1025: For every substring $R = s_1 s_2 \ldots s_k$ of $S_E$,

1026: the {\it conjugate} of $R$ is

1027: $\hat{R} = \hat{s_k} \ldots \hat{s_1}$,

1028: where $\hat{A} = U$, $\hat{U} = A$, $\hat{C} = G$, and $\hat{G} = C$.

1029:

1030: \vspace{5pt}

1031: For example, $AA$'s conjugate is $UU$ and $UA$'s conjugate is $UA$.

1032: To form a stacking pair, two adjacent bases must be paired

1033: with another two adjacent bases. So, we concentrate on the possible

1034: patterns of adjacent bases in $S_E$.

1035:

1036: \vspace{5pt}

1037: \noindent

1038: {\bf 2-substrings:}

1039: In $S_E$, any two adjacent characters are referred to as a 2-substring.

1040: By construction, $S_E$ has only ten different types of 2-substrings:

1041: $UU$, $AA$, $UA$, $GG$, $CC$, $GC$, $AG$, $GA$, $GU$, and $CA$-substrings.

1042: A 2-substring can only form a stacking pair with its conjugate.

1043: If they actually form a stacking pair in $OPT$, they are said to

1044: be {\it paired}.

1045:

1046: \vspace{5pt}

1047: Since the conjugates of $AG$, $GA$, $GU$, and $CA$-substrings do not

1048: exist in $S_E$,

1049: there is no stacking pair in $S_E$ which involves these 2-substrings.

1050: We only need to consider $AA$, $UU$, $UA$, $GG$, $CC$, $GC$-substrings.

1051: Table \ref{occ-2substrings} shows the numbers of

1052: occurrences of these 2-substrings

1053: in $S_j$ ($1 \leq j \le m+1$) and the total occurrences of these

1054: substrings in $S_E$.

1055:

1056: {\begin{table*}

1057: \footnotesize

1058: \begin{center}

1059: \begin{tabular}{|l|l|l||l|}

1060: \hline

1061: Substring & \multicolumn{3}{c|}{Total number of occurrences of $t$ in} \\ \cline{2-4}

1062: ($t$) & $S_j$ ($j=1,2, \ldots, m$) & $S_{m+1}$ & $S_E$ \\ \hline

1063: AA & $3n(d-2)+(3n-3)(2d-2)$ & $3n(2d-2)$ & $m(3n(d-2) + (3n-3)(2d-2)) + 3n(2d-2)$\\

1064: UU & $3n(2d-2)+(3n-3)(d-2)$ & $3n(d-2)$ & $m(3n(2d-2) + (3n-3)(d-2)) + 3n(d-2)$\\

1065: UA & $2(6n-3)$ & $6n$ & $2m(6n-3) + 6n$\\

1066: GG & $2(6d-2)$ & $2(6d-2)$ & $2(m+1)(6d-2)$\\

1067: CC & $2(6d-2)$ & $2(6d-2)$ & $2(m+1)(6d-2)$\\

1068: GC & $4$ & $4$ & $4m+4$\\ \hline

1069: \end{tabular}

1070: \caption{Number of occurrences of different 2-substrings}

1071: \label{occ-2substrings}

1072: \end{center}

1073: \end{table*}

1074: }

1075:

1076:

1077: Let $\#AA$ denote the number of occurrences of $AA$-substrings in $S_E$.

1078: We use the $\#$ notation for other types of 2-subtrings in $S_E$ similarly.

1079: The following fact gives a straightforward upper bound for $\#\opt$.

1080:

1081:

1082: \begin{fact}  \label{lem-interval-very-basic}

1083: \begin{tabbing}

1084: ABCDEF \= $\#\opt$ \= $\le$ \= \kill

1085: \> $\#\opt$ \> $\le$ \> $\min\{\#AA, \#UU\} + \min\{\#GG, \#CC\} + \#UA / 2 + \#GC / 2$ \\

1086: \> \> $=$ \> $h + n + 1 + (2m+2)$.

1087: \end{tabbing}

1088: \end{fact}

1089:

1090:

1091: Note that $\opt$ may not pair all $AA$-subtrings with $UU$-substrings.

1092: Let $\diamondsuit AA$ be the number of $AA$-substrings that

1093: are not paired in $\opt$.  Again, we use the $\diamondsuit$ notaion

1094: for other types of 2-substrings.

1095: Fact~\ref{lem-interval-very-basic} can be strengthened as follows.

1096:

1097: \begin{fact} \label{lem-interval-basic}

1098: $\#\opt \le \min\{\#AA-\diamondsuit AA, \#UU-\diamondsuit UU\} +

1099: \min\{\#GG-\diamondsuit GG, \#CC-\diamondsuit CC\} +

1100: (\#UA-\diamondsuit UA)/2 + (\#GC-\diamondsuit GC) / 2$.

1101: \end{fact}

1102:

1103: The upper bound given in Fact \ref{lem-interval-basic} forms

1104: the basis of our proof for showing that $\#\opt < h$.

1105: In the following sections, we consider the possible structure of

1106: $\opt$. For each possible case, we show that the lower

1107: bounds for some $\diamondsuit$ values, such as

1108: $\diamondsuit AA$ and $\diamondsuit CC$, are sufficiently

1109: large so that $\opt$ can be shown to be less than $h$.

1110: In particular, in one of the cases, we must make use of the fact

1111: that $E$ does not have a perfect matching in order to prove the

1112: lower bound for $\diamondsuit AA$, $\diamondsuit UA$, and $\diamondsuit

1113: UU$. We give some basic definitions and concepts in Section

1114: \ref{sec-only-if-definition}. The lower bounds and the

1115: proof are given in Section \ref{sec-only-if-proof}.

1116:

1117: \subsubsection{Definitions and concepts} \label{sec-only-if-definition}

1118: In this section, we give some definitions and concepts which are

1119: useful in deriving lower bounds for $\diamondsuit$ values.

1120: We first classify each region $S_j$ in $S_E$

1121: as either {\it open} or {\it closed} with

1122: respect to $\opt$. Then, extending the definitions of fragments and

1123: conjugates, we introduce {\it conjugate fragments} and

1124: {\it delimiter fragments}. Finally, we present a property

1125: of delimiter fragments in open regions.

1126:

1127: \paragraph{Open and closed regions:}

1128: With respect to $\opt$, a region

1129: $S_j$ in $S_E$ is said to be an {\it open region}

1130: if some $UU$, $AA$, or $UA$-substrings in $S_j$ are paired

1131: with some 2-substrings outside $S_j$;

1132: otherwise, it is a {\it closed region}.

1133:

1134: \begin{lemma} \label{lem-s_m+1}

1135: If $S_{m+1}$ is a closed region, then $\#\opt < h$.

1136: \end{lemma}

1137: \begin{proof}

1138: $S_{m+1}$ has $3nd$ more $AA$-substrings than $UU$-substrings.

1139: If $S_{m+1}$ is a closed region, these $3nd$ $AA$-substrings

1140: are not paired by $\opt$.

1141: Thus, $\diamondsuit AA \geq 3nd$.

1142: By Fact~\ref{lem-interval-basic}, $\#\opt < h+(n+1) + (2m+2) - 3nd < h$.

1143: \end{proof}

1144:

1145: %By Lemma \ref{lem-s_m+1}, it suffices to assume that

1146: %$S_{m+1}$ is an open region.

1147: Recall that $S_E$ is a sequence

1148: composed of $\delta$'s, $\overline{\delta}$'s,

1149: $\pi$'s, and $\overline{\pi}$'s.

1150: Each $\delta(k)$ (respectively $\overline{\delta(k)}$) consists of

1151: two substrings of the form $U^+ A^+$, each of these substrings

1152: is called a {\em fragment}.  Furthermore,

1153: each $\pi(k)$ (resp.\ $\overline{\pi(k)}$) consists of

1154: two substrings of the form $C^+$ (respectively $G^+$), each of these

1155: subtrings is also called a fragment.

1156:

1157: \paragraph{Conjugate fragments and delimiter fragments:}

1158: Consider any fragment $F$ in $S_E$.

1159: Another fragment $F'$ in $S_E$ is called a {\em conjugate fragment}\/

1160: of $F$ if $F'$ is the conjugate of $F$.

1161: Note that if $F$ is a fragment of a certian $\delta(k)$ (resp. $\pi(k)$), then

1162: $F'$ appears only in some $\overline{\delta(k)}$ (respectively

1163: $\overline{\pi(k)}$),

1164: and vice versa.

1165: By construction, if $F$ is a fragment of some delimiter sequence

1166: $V_j$ or $W_j$, then

1167: $F$ has a unique conjugate fragment in $S_E$, which

1168: is located in $\overline{V_j}$ or $\overline{W_j}$, respectively.

1169: However, if $F$ is a fragment of some non-delimiter sequence,

1170: says, $\encode{x_i}$, then for every instance of $\encode{\overline{x_i}}$ in $S_E$,

1171: $F$ contains one conjugate fragment in $\encode{\overline{x_i}}$.

1172:

1173: A fragment $F$ is said to be {\em paired}\/ with

1174: its conjugate fragment $F'$ by $\opt$ if $\opt$ includes

1175: all the pairs of bases between $F$ and $F'$.

1176:

1177: For $1 \leq j \leq m+1$,

1178: the fragment $F$ in $V_j$ or $W_j$

1179: is called a {\it delimiter fragment}.

1180: Note that the delimiter fragment $F$ should be of

1181: the form $C^{2d+k}$ for $2d > k > 0$.

1182:

1183: The following lemma shows a property of delimiter fragments

1184: in open regions.

1185:

1186: \begin{lemma} \label{lem-delimiter-fragment}

1187: If $S_j$ is an open region, then both delimiter

1188: fragments of either $V_j$ or $W_j$

1189: must not pair with their conjugate fragments in $\opt$.

1190: \end{lemma}

1191: \begin{proof}

1192: We prove the statement by contradiction.

1193: Suppose one fragment of $V_j$ and one fragment of $W_j$

1194: are paired with their conjugate fragments.

1195: Let $(s_x, s_{x+1}; s_{y-1}, s_y)$ and $(s_{x'}, s_{x'+1}; s_{y'-1}, s_{y'})$

1196: be some particular stacking pairs in $V_j$ and $W_j$, respectively.

1197: Since $S_j$ is an open region,

1198: we can identify a stacking pair $(s_{x''}, s_{x''+1}; s_{y''-1}, s_{y''})$

1199: where $s_{x''} s_{x''+1}$ and $s_{y''-1} s_{y''}$

1200: are 2-substrings within and outside $S_j$, respectively.

1201: Note that these three stacking pairs form an interleaving block.

1202: By Lemma~\ref{interleavingblock}, ${\opt}$ is not planar,

1203: reaching a contradiction.

1204: \end{proof}

1205:

1206:

1207: \subsubsection{Proof of the only-if part} \label{sec-only-if-proof}

1208: By Lemma~\ref{lem-s_m+1}, it suffices to assume that

1209: $S_{m+1}$ is an open region.

1210: Before we give the proof of the only-if part, let us consider the

1211: following lemma.

1212:

1213: \begin{lemma} \label{lem-open-delimiter}

1214: Let $\alpha$ be the number of delimiter fragments that

1215: are not paired with their conjugate fragments.

1216: Then,

1217: $\diamondsuit CC + \diamondsuit GG \geq \alpha + (\#GC - \diamondsuit GC)$.

1218: \end{lemma}

1219: \begin{proof}

1220: By construction, a $GC$-substring

1221: must be next to the left end of a delimiter fragment $F$, which is

1222: of the form $C^+$.

1223: No other $GC$-substrings can exist. If this $GC$-substring is

1224: paired, the leftmost $CC$-substring of $F$

1225: must not be paired as there is no $GGC$ pattern in $S_E$.

1226: Thus, $F$ must be one of the $\alpha$ delimiter fragments

1227: that are not paired with their conjugate fragments.

1228: Based on this observation, we classify

1229: the $\alpha$ delimiter fragments into two groups:

1230: (1) $(\#GC - \diamondsuit GC)$'s delimiter fragments whose

1231: $GC$-substrings at the left end are paired; and

1232: (2) $\alpha - (\#GC - \diamondsuit GC)$'s delimiter fragments whose

1233: $GC$-substrings at the left end are not paired.

1234:

1235: For each delimiter fragment $F = C^{2d+k}$ in group (1),

1236: since the $GC$-substring on the left of $F$ is paired,

1237: the leftmost $CC$-substring of $F$ must not be paired by $\opt$.

1238: For the remaining $2d+k-2$ $CC$-substrings,

1239: we either find a $CC$-substring which is not paired by $\opt$;

1240: or these $2d+k-2$ $CC$-substrings are paired to

1241: $GG$-substrings in some fragment $F' = G^{2d+k'}$ with $2d > k' > k$,

1242: and thus, some $GG$-substring of $F'$ is not paired.

1243: Therefore, each delimiter fragment in group (1) introduces

1244: either (i) two unpaired $CC$-substrings or

1245: (ii) one unpaired $CC$-substring and one unpaired $GG$-substring.

1246: Hence, the total number of unpaired $CC$ and $GG$-substrings due to

1247: delimiter fragments in group (1) $\geq 2 (\#GC - \diamondsuit GC)$.

1248:

1249: For each delimiter fragment $F = C^{2d+k}$ in group (2), consider

1250: the $CC$-substrings in $F$. With a similar argument, we can show

1251: that

1252: each delimiter fragment in group (2) introduces

1253: either (i) one unpaired $CC$-substring

1254: or (ii) one unpaired $GG$-substring.

1255: Hence, the total number of unpaired $CC$ and $GG$-substrings due to

1256: delimiter fragments in group (2) $\geq \alpha - (\#GC - \diamondsuit GC)$.

1257:

1258: In total, we have

1259: $\diamondsuit CC + \diamondsuit GG

1260: \geq \alpha + (\# GC - \diamondsuit GC)$.

1261: \end{proof}

1262:

1263: Now, we state a lemma which shows the

1264: lower bounds for some $\diamondsuit$ values in terms of

1265: the number of open regions in $\opt$.

1266:

1267: \begin{lemma} \label{diamondlowerbounds}

1268: Let $\ell \ge 1$ be the number of open regions in $\opt$.

1269:

1270: \vspace{3pt}

1271: \noindent

1272: (1) If $S_{m+1}$ is an open region, then $\diamondsuit UU \geq 3(m+1-\ell) d$.

1273:

1274: \vspace{3pt}

1275: \noindent

1276: (2) $\max \{ \diamondsuit CC, \diamondsuit GG \} \geq

1277:     \ell + (\# GC - \diamondsuit GC) / 2$.

1278:

1279: \vspace{3pt}

1280: \noindent

1281: (3) If $\ell = n+1$, $S_{m+1}$ is an open region,

1282:     and $E$ does not have a perfect matching,

1283:     then either (a) $\diamondsuit UU \geq 3(m-n)d + 1$,

1284:     (b) $\diamondsuit AA \geq 1$, or (c) $\diamondsuit UA \geq 2$.

1285: \end{lemma}

1286:

1287: \begin{proof}

1288:

1289: \noindent

1290: {\small \bf Statement 1.}

1291: Within each closed region $S_j$ where $j \neq m+1$,

1292: $3d$'s $UU$-substrings cannot paired in $\opt$.

1293: As there are $m+1-\ell$ such closed regions, $3(m+1-\ell)d$

1294: $UU$-substrings are not

1295: paired in $\opt$. Thus, $\diamondsuit UU \geq 3(m+1-\ell)d$.

1296:

1297: \vspace{5pt}

1298: \noindent

1299: {\small \bf Statement 2.}

1300: By Lemma~\ref{lem-delimiter-fragment}, we can identify $2 \ell$ fragments

1301: in $V_j$ and $W_j$ of all open regions

1302: which are not paired with their conjugate fragments.

1303: Then, by Lemma \ref{lem-open-delimiter}, we have

1304: $\diamondsuit CC + \diamondsuit GG \geq 2\ell + (\# GC - \diamondsuit GC)$.

1305: Thus, $\max\{ \diamondsuit CC, \diamondsuit GG \} \geq

1306: \ell + (\# GC - \diamondsuit GC) / 2$.

1307:

1308: \vspace{5pt}

1309: \noindent

1310: {\small \bf Statement 3.}

1311: By a similar argument to the proof for Statement 1,

1312: within the $m+1-\ell = m-n$ closed regions,

1313: $3(m-n)d$ $UU$-substrings are not paired in $\opt$.

1314:

1315: For the $\ell = n+1$ open regions,

1316: one of them must be $S_{m+1}$.

1317: Let

1318: $S_{j_1}, \ldots, S_{j_n}$ be the remaining $n$ open regions.

1319: Recall that $e_{j_1}, \ldots, e_{j_n}$

1320: are the corresponding edges of these $n$ open regions.

1321: Since these $n$ edges cannot form a perfect matching,

1322: some node, says $x_k$, is adjacent to these $n$

1323: edges more than once.

1324: Thus, within $S_{j_1}, \ldots, S_{j_n}, S_{m+1}$,

1325: we have more $\encode{x_k}$ than

1326: $\encode{\overline{x_k}}$.

1327: Therefore, at least two of the fragments in all $\encode{x_k}$

1328: are not paired

1329: with their conjugate fragments.

1330:

1331: Let $F$ be one of such fragments.

1332: Note that $F$ is of the form $U^d A^k$.

1333: Since $F$ is not paired with its conjugate fragment,

1334: one of the following three cases occurs in $\opt$:

1335:

1336: \vspace{3pt}

1337: \noindent

1338: Case 1: An $UU$-substring of $F$ is not paired.

1339:

1340: \vspace{3pt}

1341: \noindent

1342: Case 2: An $AA$-substring of $F$ is not paired.

1343:

1344: \vspace{3pt}

1345: \noindent

1346: Case 3: All $UU$-substrings and $AA$-substrings $F$ are paired.

1347: In this case, $U^d$ of $F$ is paired with $A^d$ of a fragment

1348: $F' = U^{k'}A^d$;

1349: and $A^k$ of $F$ is paired with some substring $U^k$ of some fragment $F''$.

1350: As $F'$ and $F''$ are not the same fragment, the $UA$-substrings of both $F$

1351: and $F'$ are not paired.

1352:

1353: \vspace{3pt}

1354: In summary, we have either

1355: (1) $\diamondsuit UU \geq 3(m-n)d + 1$, or

1356: (2) $\diamondsuit AA \geq 1$, or

1357: (3) $\diamondsuit UA \geq 2$.

1358: \end{proof}

1359:

1360: Based on Lemma \ref{diamondlowerbounds}, we prove the only-if part

1361: by a case analysis in the following lemma.

1362:

1363: \begin{lemma}

1364: If $E$ does not have a prefect matching,

1365: then $\# \opt < h$.

1366: \end{lemma}

1367: \begin{proof}

1368: Recall that if $S_{m+1}$ is a closed region, then

1369: $\#\opt < h$. Now, suppose that $S_{m+1}$ is an

1370: open region. We show

1371: $\# \opt < h$ in three cases $\ell < n+1$, $\ell > n+1$ and $\ell = n+1$.

1372:

1373: \vspace{5pt}

1374: \noindent

1375: {\it Case 1:} $\ell < n+1$. By Lemma ~\ref{diamondlowerbounds} (1),

1376: $\diamondsuit UU \geq 3(m+1-\ell)d$.

1377: By Fact~\ref{lem-interval-basic},

1378: we can conclude that $\#\opt = h + n+1 + (2m+2) - 3(n+1-\ell)d

1379: \leq h + n+1 + (2m+2) - 3d < h$.

1380:

1381: \vspace{5pt}

1382: \noindent

1383: {\it Case 2:} $\ell > n+1$. By Lemma~\ref{diamondlowerbounds} (2),

1384: $\max \{ \diamondsuit CC, \diamondsuit GG \} \geq \ell + (\# GC - \diamondsuit GC)/2$.

1385: By Fact~\ref{lem-interval-basic},

1386: $\#\opt \leq h + n + 1 - \ell$, which is smaller than $h$

1387: because $\ell > n+1$.

1388:

1389: \vspace{5pt}

1390: \noindent

1391: {\it Case 3:} $\ell = n+1$. By Lemma~\ref{diamondlowerbounds} (3),

1392: either

1393: (a) $\diamondsuit UU \geq 3(m-n)d + 1$, or

1394: (b) $\diamondsuit AA \geq 1$, or

1395: (c) $\diamondsuit UA \geq 2$.

1396: By Fact~\ref{lem-interval-basic},

1397: $\#\opt \leq h + n - \max \{ \diamondsuit CC, \diamondsuit GG \}

1398: + (\#GC - \diamondsuit GC) / 2$.

1399: By Lemma~\ref{diamondlowerbounds} (2),

1400: we have $\#\opt < h$.

1401: \end{proof}

1402:

1403: We conclude that if $E$ does not have a prefect matching,

1404: then $\#\opt < h$. Equivalently,

1405: if $\#\opt \geq h$, then

1406: $E$ has a prefect matching.

1407:

1408: \section{Conclusions}

1409: In this paper, we have studied the problem of predicting RNA secondary

1410: structures that allow arbitrary pseudoknots with a simple free

1411: energy function that is minimized when the number of stacking

1412: pairs is maximized. We have proved that this problem is NP-hard if the

1413: secondary structure is required to be planar. We conjecture that

1414: the problem is also NP-hard for the general case.

1415: We have also given two approximation algorithms for this problem with

1416: worst-case approximation ratios of 1/2 and 1/3 for planar and general

1417: secondary structures, respectively. It would be of interest to

1418: improve these approximation ratios.

1419:

1420: Another direction is to study the problem using

1421: energy function that is minimized when the number of base pairs is

1422: maximized. It is known that this problem can be solved in cubic time

1423: if the secondary structure can be non-planar \cite{Nussinov:1978:ALM}.

1424: However, the computational complexity of the problem is still open if the

1425: secondary structure is required to be planar. We conjecture that

1426: the problem becomes NP-hard under this additional condition.

1427: We would like to point out that the observation that have

1428: enabled us to visualize the planarity of stacking pairs on a rectangular

1429: grid does not hold in case of maximizing base pairs.

1430:

1431: \bibliographystyle{plain}

1432: \bibliography{rnastruct}

1433:

1434: \end{document}

1435: