0312:q-bio0312007/lc.tex

1: \documentclass[a4paper, 11pt]{article}

2: \usepackage{epsf}

3: \usepackage{epsfig}

4: \usepackage{fancyhdr}

5: \usepackage{amsmath}

6: \usepackage{amssymb}

7: \usepackage{theorem}

8: \topmargin=15pt

9: \oddsidemargin=20pt

10: \headsep=20pt

11: \footskip=30pt

12: \textheight=600pt

13: \textwidth=400pt

14: \renewcommand{\theequation}{\arabic{equation}}

15: \numberwithin{equation}{section}

16:

17: \begin{document}

18: \title{What can we learn from noncoding regions of similarity between genomes?}

19: \author{Thomas A. Down and Tim J. P. Hubbard \\

20: {\it \{td2,th\}@sanger.ac.uk}}

21: \date{\today}

22: \maketitle

23:

24: \begin{abstract}

25: {\bf Background:} In addition to known protein-coding genes, large amount of apparently non-coding

26: sequence are conserved between the human and mouse genomes.  It seems reasonable

27: to assume that these conserved regions are more likely to contain functional

28: elements than less-conserved portions of the genome.  Here we used a motif-oriented

29: machine learning method to extract the strongest signal from a set of non-coding

30: conserved sequences.

31:

32: {\bf Results:}We successfully fitted models to reflect the non-coding sequences,

33: and showed that the results were quite consistent for repeated training runs.

34: Using the learned model to scan genomic sequence, we found that it often

35: made predictions close to the start of annotated genes.  We compared this method

36: with other published promoter-prediction systems, and show that the set of promoters

37: which are detected by this method seems to be substantially similar to that

38: detected by existing methods.

39:

40: {\bf Conclusions:} The results presented here indicate that the promoter signal

41: is the strongest single motif-based signal in the non-coding functional fraction

42: of the genome.  They also lend support to the belief that there exists a substantial

43: subset of promoter regions which share common features and are detectable by

44: a variety of computational methods.

45: \end{abstract}

46:

47: \section{Background}

48:

49: Since the publication of draft sequences for the human \cite{human.genome} and

50: mouse \cite{mouse.genome} genomes, several groups have run large-scale comparisons

51: of the sequences to detect regions of conserved sequence.  An initial

52: survey of these was published along with the draft mouse genome \cite{mouse.genome}.

53: Briefly, protein coding genes are -- as we might expect -- among the most strongly conserved

54: regions, but homologous sequences can be found throughout the genome. In total, it is

55: possible to align up to 40\% of the mouse genome to human sequence \cite{schwartz.blastz}, but it seems

56: likely that at least some of this is just random ``comparative noise'' -- regions of

57: sequence which serve no particular purpose but which, purely by chance, have not

58: yet accumulated enough mutations to make their evolutionary relationship unrecognizable.  However, it

59: is widely accepted that

60: some of the noncoding-but-similar regions, especially those with the highest

61: levels of sequence identity between the two species, are preferentially conserved

62: because they perform some important function.  It has been estimated that around

63: 5\% of the genome is under purifying selection \cite{mouse.genome}, indicating that

64: mutations in these regions have deleterious effects: a strong suggestion of some

65: important function..

66:

67: Here, we apply the Eponine Windowed Sequence (EWS) sequence analysis method \cite{down.rvmseq}

68: method which uses a Relevance Vector Machine \cite{tipping.rvm} to extract a

69: minimal set of short motifs which are able to discriminate

70: between two sets of sequences: in this case, a positive set of conserved non-coding

71: sequences and a negative set of randomly picked sequences.  The EWS model is

72: an adaption of the Eponine Anchored Sequence model first described in \cite{down1}

73: and subsequently used to predict a range of additional biological features including

74: translation start sites and transcription termination sites [A. Ramadass, unpublished]

75: While EAS is designed to classify individual points in a sequence -- a feature

76: which allows the EponineTSS model to predict precise locations for transcription

77: start sites -- EWS classifies

78: complete blocks (windows) of sequence.  The design and implementation of the EWS

79: model is described in detail in \cite{down.rvmseq}.

80:

81: \section{Results}

82:

83: We considered a set of `tight' alignments made by the blastz program \cite{schwartz.blastz} between

84: release NCBI33 of the human genome and release NCBIM30 of the mouse genome.  In total,

85: this method reported 787173 blocks alignable between the two genomes.  We considered

86: only those blocks assigned to human chromosome 6, a 170Mb chromosome which

87: has recently undergone manual annotation of gene structures and other features

88: \cite{mungall.chr6}.  This chromosome included 44105

89: (5.6\%) of the total alignments.  These varied in length from 34 to 9382 bases,

90: with a length distribution skewed towards relatively short alignments, as shown in figure

91: \ref{length.histogram}.

92:

93: \begin{figure}[!bth]

94: \begin{center}

95: \includegraphics[scale=0.66]{length-histogram.eps}

96: \caption{Number of blastz alignments of specific lengths to human chromosome 6}

97: \label{length.histogram}

98: \end{center}

99: \end{figure}

100:

101: Since we were interested in non-coding features of the genome, we ignored all

102: regions where an alignment overlaps an annotated gene structure.  This

103: removed 20.8\% of aligned bases.  It is

104: possible that some genes, and especially psuedogenes, have been missed by the

105: annotation process, so we also removed portions covered by {\it ab initio} gene

106: predictions from the Genscan program \cite{burge.genscan}.  This eliminated

107: an additional 4.3\% of aligned bases.

108: Finally, repetitive sequence elements annotated by the programs RepeatMaster \cite{smit.repeatmasker} and

109: trf \cite{benson.trf} (5.9\%) were removed from the working set.  The remainder of the aligned

110: regions were split into non-overlapping 200 base windows, ignoring any portions

111: less that 200 bases.  This gave a set of 13925 sequences which are well-conserved

112: between human and mouse -- and therefore likely to be functional --

113: but which are very unlikely to be part of the protein-coding repertoire.  These

114: formed the positive training set for our machine learning strategy.

115:

116: A negative training set of equal size was prepared by picking 200-base windows

117: at random from the non-coding, non-repetitive portions of chromosome 6, using

118: the same criteria to define repeats and coding sequence.  While

119: it is probable that this set also included some functional sequences, we would

120: expect them to be represented at a substantially lower level than in the

121: conserved set.

122:

123: These two sets of sequence were presented to the Eponine Windowed Sequence

124: machine learning system.  Randomly chosen 5-base words were used as seed motifs,

125: and three models were trained, each for 2000 cycles.  The set of motifs used

126: in model 1 is shown in table \ref{model.motifs}

127:

128: \begin{table}[!bth]

129: \begin{center}

130: \include{model1-motifs}

131: \caption{Motifs used in EWS homology model 1.  The entries in this table show

132:   consensus sequences of the weight matrices used in the model (note that it

133:   is possible for two distinct weight matrices to have the same consensus

134:   sequence).  Motifs are listed in both forwards and reverse-complement orientation,

135:   and the two sections of the table indicate whether that motif is given a

136:   positive or negative weight in the learned linear model.}

137: \label{model.motifs}

138: \end{center}

139: \end{table}

140:

141: While the exact set of motifs used in the model varied

142: somewhat from run to run, testing pairs of models on non-overlapping windows

143: from a 1Mb region of human chromosome 22 and plotting the scores showed that

144: the model outputs were highly correlated ({\it e.g.} figure \ref{scan.scatter}).

145: We calculated the Pearson correlation coefficient for all pairs, and in all cases

146: this was greater than $0.96$.  From this strong correlation, we concluded that

147: any variations in the model were simply the result of the trainer picking one

148: representative from a group of motifs which provide similar information.

149:

150: \begin{figure}[!bth]

151: \begin{center}

152: \includegraphics[scale=1.0]{scan-scatter.eps}

153: \caption{Scatter plot showing the scores of models 1 and 2 on a set of sequences}

154: \label{scan.scatter}

155: \end{center}

156: \end{figure}

157:

158: We scanned genomic sequences using these models at a range of thresholds,

159: and examined the results using the Ensembl genome browser \cite{hubbard.ensembl}.

160: Visual inspection showed that

161: many of the highest-scoring regions were  localized near the start of genes.

162: This prompted us to look at the distribution of high-scoring with respect to

163: the starts of a set of well-annotated genes.  We considered the GD\_mRNA genes

164: from version 2.3 of the human chromosome 22 annotation.  These are confidently

165: annotated genes with experimental evidence as described in \cite{collins.c22},

166: which confirms at least the approximate location of the ends of the transcripts, and

167: are entirely independent from the chromosome 6 training data.

168: Figure \ref{c22.density} shows the density of predictions with GLM scores $\geq0.90$ relative to the

169: annotated 5' ends of these genes.  This shows a strong peak of predictions close

170: to the annotated starts, demonstrating that the model is predicting some sequences

171: commonly located around the transcription start site of genes.  Combining this

172: observation with the fact that the model was trained from conserved (and therefore

173: presumed functional) sequences, we believe that it is detecting signals found

174: in the promoter regions of genes.

175:

176: \begin{figure}[!bth]

177: \begin{center}

178: \includegraphics[scale=1.0]{ehnew-density.ps}

179: \caption{Density of predictions from one of the homology models around known gene

180:   starts on human chromosome 22}

181: \label{c22.density}

182: \end{center}

183: \end{figure}

184:

185: Evaluation of promoter-prediction methods on a large scale is a difficult exercise,

186: since there are no large pieces of genomic sequence for which we can be certain

187: we know the complete set of transcribed regions, and even in the case on well-known

188: genes we often do not know the precise location at which transcription begins.

189: In \cite{down1}, we developed a pseudochromosome, derived from release 2.3 of

190: the chromosome 22 annotation.  As described above, this includes a subset of

191: 284 experimentally verified gene structures.  The pseudochromosome was constructed

192: to include these genes while omitting all other annotated genes (which could

193: be substantially truncated).  We considered predictions (groups of one or more

194: overlapping windows which all have scores greater than some chosen threshold)

195: to be correct if they lie withing 2kb of an annotated gene start, and false otherwise.

196: Plotting accuracy (proportions of predictions which are correct) against

197: coverage (proportion of transcript starts which are detected by one of the

198: correct predictions) gives an ROC curve.  This is plotted for three different

199: models in figure \ref{ehnew-roc}.  Firstly, this shows that predictive

200: performance for all three models is rather similar similar.  It also shows

201: that they can function as accurate promoter predictors, with accuracy rising

202: to a plateau of around $0.7$.

203:

204: \begin{figure}[!bth]

205: \begin{center}

206: \includegraphics[scale=1.0]{ehnew-roc.ps}

207: \caption{Accuracy vs. coverage at a range of score thresholds for three

208:   homology models}

209: \label{ehnew-roc}

210: \end{center}

211: \end{figure}

212:

213: We picked model 1 for further study.  Using a score threshold of $0.91$, this gives

214: an accuracy of $0.68$ and a coverage of $0.31$.  We compared the set of genes

215: correctly detected by this model to two other methods: firstly, the EponineTSS

216: predictor described in \cite{down1}, and secondly, the published results from

217: the PromoterInspector program \cite{scherf.c22}.  PromoterInspector results were

218: mapped to pseudochromosome coordinates using the procedure described in

219: \cite{down1}.  Figure \ref{intersection} shows how the set of promoters detected

220: by these three distinct methods overlaps.  There are clearly strong correlations

221: between all three methods.  In particular, at this threshold the EpoHomol model

222: detects 98 promoters which were found by at least one of the other methods, but

223: only 4 novel promoters.

224:

225: \begin{figure}[!bth]

226: \begin{center}

227: \includegraphics[scale=0.66]{intersection-homol.ps}

228: \caption{Sets of pseudochromosome promoters correctly predicted by three

229: different prediction methods: EponineTSS \cite{down1} with a score threshold

230: of 0.999, PromoterInspector (labeled ``Pro'spector''), and the homology-EWS

231: model 1 with a score threshold of 0.91 (``Homol\_1'').}

232: \label{intersection}

233: \end{center}

234: \end{figure}

235:

236: \section{Conclusions}

237:

238: We have shown here that, when presented with a set of non-coding sequences which are

239: strongly conserved between human and mouse, a simple motif-oriented machine

240: learning system consistently builds models which are able to detect a

241: substantial fraction of human promoter regions with good accuracy.  This

242: strongly suggests that this promoter signal represents the most widely used

243: motif-based signal in functional non-coding sequence.  While the model learned

244: here can clearly be applied for the purpose of genome-wide promoter annotation, in practice existing

245: methods offer better coverage and (in the case of the EponineTSS predictor)

246: predictions for the precise location of the transcription start site.

247:

248: It is interesting that the promoter model learned by this technique detected

249: substantially the same set of promoters as found by the EponineTSS and PromoterInspector

250: methods.  It has previously been remarked that these two methods detect similar

251: sets \cite{down1}, but this could perhaps be explained by the fact that both

252: methods were initially derived from similar sets of known promoter sequences (in both

253: cases, training data was extracted from the EPD database \cite{bucher.epd}.  In

254: the case of the homology models described here, there is no connection with EPD,

255: or any similar set of known promoters: the training data was picked purely

256: on the basis of its high similarity to corresponding portions of the mouse genome.

257: These result therefore support the alternate view that there is a particular

258: `easily detected' subclass of promoter sequences.

259:

260: One distinct group of promoters, which previous results show may correspond to

261: this easily detected family, is those promoters associated with CpG islands [ref].

262: However, while a number of the motifs listed in table \ref{model.motifs} are G/C

263: rich and/or contain the CpG dinucleotide, by no means all of the motifs match

264: this description, and indeed one motif containing CpG has a negative weight in the

265: linear model -- their presence reduces the model output score -- while some A/T

266: rich motifs have positive weights.  We therefore believe that the signals detected

267: here are significantly more complex than a simple overrepresentation of CpG

268: dinucleotides.

269:

270: \section{Materials and Methods}

271:

272: \subsection{Genomic sequence and annotation}

273:

274: Human genome sequence release NCBI33 and mouse genome release NCBIM30 were extracted

275: from Ensembl databases \cite{hubbard.ensembl}, which also contained gene predictions

276: from Genscan \cite{burge.genscan} and repeat data from RepeatMasker \cite{smit.repeatmasker}

277: and trf \cite{benson.trf}.  Curated annotation of gene

278: structures on human chromosome 6 was obtained from the Vega database \cite{vega}.

279: Vega and Ensembl data was extracted directly from the SQL databases using

280: the BioJava toolkit with biojava-ensembl extensions \cite{biojava}.

281:

282: \subsection{Genome alignments}

283:

284: Human-mouse genome alignments were generate by the blastz alignment.  These were

285: subsequently re-scored and filtered to give a `tight' set of high-confidence

286: alignments, as described in \cite{schwartz.blastz}.  We downloaded the tight

287: alignment set from the UCSC genome website \cite{ucsc}.

288:

289: \subsection{Pseudochromosome for testing promoter-finding methods}

290:

291: A 16.3Mb pseudochromosome sequence was produced based on version 2.3 of the

292: curated annotation for human chromosome 22.  This includes all the experimentally-validated

293: gene structures and their upstream regions, while omitting regions containing

294: genes that are predicted but not fully verified.  In the case of a pair of

295: divergent genes where one has been verified and the second has not, their shared

296: upstream region was cut at the midpoint.  More information about

297: pseudochromosome construction is given in \cite{down1}.

298:

299: \subsection{Eponine Windowed Sequence learning}

300:

301: The Eponine Windowed Sequence (EWS) model is a machine learning system for identifying

302: a small set of motifs which can be effectively used to classify some set of training

303: sequences \cite{down.rvmseq}.  In this case, we applied a slightly restricted

304: version of the EWS trainer which omitted the ``Append Column'' sampling

305: rule, restricting the model to learning motifs with length less than or equal

306: to the length of the seed motifs.

307:

308: \section{Acknowledgments}

309:

310: Chromosome 22 annotation data version 2.3 were produced by the Chromosome 22 Annotation

311: Group at the Sanger Institute and were obtained from the World Wide Web at

312: http://www.sanger.ac.uk/HGP/Chr22 (Dunham {\it et al.} unpublished data).  TD

313: would like to thank the Wellcome Trust for funding.

314:

315:

316: \begin{thebibliography}{1}

317:

318: \bibitem{human.genome} The Genome International Sequencing Consortium: {\bf Initial

319:   sequencing and analysis of the human genome.} {\it Nature} 2001 {\bf 409:}860-912

320: \bibitem{mouse.genome} The Mouse Genome Sequencing Consortium: {\bf Initial

321:   sequencing and comparative analysis of the mouse genome.} {\it Nature} 2003 {\bf 420:}520-562

322: \bibitem{schwartz.blastz} Schwartz S, Kent WJ, Smit A, Zhang Z,

323:   Baertsch R, Hardison RC, Haussler D, Miller W: {\bf Human-Mouse

324:   Alignments with BLASTZ.} {\it Genome Res.} 2003 {\bf 13:}103-107.

325: \bibitem{down.rvmseq} Down TA, Hubbard TJP: {\bf Relevance Vector Machines for

326:   classifying points and regions of biological sequences.} submitted.

327: \bibitem{tipping.rvm} Tipping ME: {\bf Sparse Bayesian learning and

328:   the relevance vector machine.} {\it Journal of Machine Learning Research} 2000

329:   {\bf 1:}211--244.

330: \bibitem{down1} Down TA, Hubbard TJP: {\bf Computational Detection

331:   and Location of Transcription Start Sites in Mammalian Genomic DNA.}

332:   {\it Genome Res.} 2002 {\bf 12:}652-658.

333: \bibitem{mungall.chr6} A. J. Mungall {\it et al.} {\bf The DNA sequence and analysis

334:   of human chromosome 6} {\it Nature} 2003 {\bf 425:}805-811.

335: \bibitem{burge.genscan} Burge C, Karlin S: {\bf Prediction of complete gene

336:   structures in human genomic DNA.} {\it Journal of Molecular Biology} 1997 {\bf 268}78-94

337: \bibitem{smit.repeatmasker} Smit AFA, Green P: {\bf RepeatMasker}\newline

338:   [http://ftp.genome.washington.edu/RM/RepeatMasker.html]

339: \bibitem{benson.trf} Benson G: {\bf Tandem repeats finder: a program to analyze

340:   DNA sequences} {\it Nucleic Acids Res.} 1999 {\bf 27:}573-580

341: \bibitem{hubbard.ensembl} T. J. P. Hubbard {\it et al.}: {\bf The Ensembl genome

342:   database project} {\it Nucleic Acids Res.} 2002 {\bf 30:}38--31.

343: \bibitem{collins.c22} Collins JE, Goward ME, Cole CG, Smink LJ, Huckle EJ, Knowles

344:   S, Bye JM, Beare DM, Dunham I: {\bf Reevaluating Human Gene Annotation: A Second-

345:   Generation Analysis of Chromosome 22}. {\it Genome Res} 2003, {\bf 13:}27-36

346: \bibitem{scherf.c22} Scherf M, Klingenhoff A, Frech K, Quandt K, Schneider R,

347:   Grote K, Frisch M, Gailus-Durner V, Seidel A, Brack-Werner R, Werner T:

348:   {\bf First pass annotation of promoters on human chromosome 22} {\it Genome Res.} 2001

349:   {\bf 11:}333-340

350: \bibitem{bucher.epd} Perier RC, Praz V, Junier T, Bonnard C, Bucher P:

351:   {\bf The Eukaryotic Promoter Database (EPD)} {\it Nucleic Acids Res.} 2000

352:   {\bf 28:}307-309

353: \bibitem{vega} {\bf Vega Genome Browser} [http://vega.sanger.ac.uk/].

354: \bibitem{biojava} {\bf BioJava} [http://www.biojava.org/].

355: \bibitem{ucsc} {\bf UCSC Genome Bioinformatics} [http://genome.cse.ucsc.edu/].

356:

357: \end{thebibliography}

358:

359: \end{document}

360: