1: \documentclass[a4paper, 11pt]{article}
2: \usepackage{epsf}
3: \usepackage{epsfig}
4: \usepackage{fancyhdr}
5: \usepackage{amsmath}
6: \usepackage{amssymb}
7: \usepackage{theorem}
8: \topmargin=15pt
9: \oddsidemargin=20pt
10: \headsep=20pt
11: \footskip=30pt
12: \textheight=600pt
13: \textwidth=400pt
14: \renewcommand{\theequation}{\arabic{equation}}
15: \numberwithin{equation}{section}
16:
17: \begin{document}
18: \title{What can we learn from noncoding regions of similarity between genomes?}
19: \author{Thomas A. Down and Tim J. P. Hubbard \\
20: {\it \{td2,th\}@sanger.ac.uk}}
21: \date{\today}
22: \maketitle
23:
24: \begin{abstract}
25: {\bf Background:} In addition to known protein-coding genes, large amount of apparently non-coding
26: sequence are conserved between the human and mouse genomes. It seems reasonable
27: to assume that these conserved regions are more likely to contain functional
28: elements than less-conserved portions of the genome. Here we used a motif-oriented
29: machine learning method to extract the strongest signal from a set of non-coding
30: conserved sequences.
31:
32: {\bf Results:}We successfully fitted models to reflect the non-coding sequences,
33: and showed that the results were quite consistent for repeated training runs.
34: Using the learned model to scan genomic sequence, we found that it often
35: made predictions close to the start of annotated genes. We compared this method
36: with other published promoter-prediction systems, and show that the set of promoters
37: which are detected by this method seems to be substantially similar to that
38: detected by existing methods.
39:
40: {\bf Conclusions:} The results presented here indicate that the promoter signal
41: is the strongest single motif-based signal in the non-coding functional fraction
42: of the genome. They also lend support to the belief that there exists a substantial
43: subset of promoter regions which share common features and are detectable by
44: a variety of computational methods.
45: \end{abstract}
46:
47: \section{Background}
48:
49: Since the publication of draft sequences for the human \cite{human.genome} and
50: mouse \cite{mouse.genome} genomes, several groups have run large-scale comparisons
51: of the sequences to detect regions of conserved sequence. An initial
52: survey of these was published along with the draft mouse genome \cite{mouse.genome}.
53: Briefly, protein coding genes are -- as we might expect -- among the most strongly conserved
54: regions, but homologous sequences can be found throughout the genome. In total, it is
55: possible to align up to 40\% of the mouse genome to human sequence \cite{schwartz.blastz}, but it seems
56: likely that at least some of this is just random ``comparative noise'' -- regions of
57: sequence which serve no particular purpose but which, purely by chance, have not
58: yet accumulated enough mutations to make their evolutionary relationship unrecognizable. However, it
59: is widely accepted that
60: some of the noncoding-but-similar regions, especially those with the highest
61: levels of sequence identity between the two species, are preferentially conserved
62: because they perform some important function. It has been estimated that around
63: 5\% of the genome is under purifying selection \cite{mouse.genome}, indicating that
64: mutations in these regions have deleterious effects: a strong suggestion of some
65: important function..
66:
67: Here, we apply the Eponine Windowed Sequence (EWS) sequence analysis method \cite{down.rvmseq}
68: method which uses a Relevance Vector Machine \cite{tipping.rvm} to extract a
69: minimal set of short motifs which are able to discriminate
70: between two sets of sequences: in this case, a positive set of conserved non-coding
71: sequences and a negative set of randomly picked sequences. The EWS model is
72: an adaption of the Eponine Anchored Sequence model first described in \cite{down1}
73: and subsequently used to predict a range of additional biological features including
74: translation start sites and transcription termination sites [A. Ramadass, unpublished]
75: While EAS is designed to classify individual points in a sequence -- a feature
76: which allows the EponineTSS model to predict precise locations for transcription
77: start sites -- EWS classifies
78: complete blocks (windows) of sequence. The design and implementation of the EWS
79: model is described in detail in \cite{down.rvmseq}.
80:
81: \section{Results}
82:
83: We considered a set of `tight' alignments made by the blastz program \cite{schwartz.blastz} between
84: release NCBI33 of the human genome and release NCBIM30 of the mouse genome. In total,
85: this method reported 787173 blocks alignable between the two genomes. We considered
86: only those blocks assigned to human chromosome 6, a 170Mb chromosome which
87: has recently undergone manual annotation of gene structures and other features
88: \cite{mungall.chr6}. This chromosome included 44105
89: (5.6\%) of the total alignments. These varied in length from 34 to 9382 bases,
90: with a length distribution skewed towards relatively short alignments, as shown in figure
91: \ref{length.histogram}.
92:
93: \begin{figure}[!bth]
94: \begin{center}
95: \includegraphics[scale=0.66]{length-histogram.eps}
96: \caption{Number of blastz alignments of specific lengths to human chromosome 6}
97: \label{length.histogram}
98: \end{center}
99: \end{figure}
100:
101: Since we were interested in non-coding features of the genome, we ignored all
102: regions where an alignment overlaps an annotated gene structure. This
103: removed 20.8\% of aligned bases. It is
104: possible that some genes, and especially psuedogenes, have been missed by the
105: annotation process, so we also removed portions covered by {\it ab initio} gene
106: predictions from the Genscan program \cite{burge.genscan}. This eliminated
107: an additional 4.3\% of aligned bases.
108: Finally, repetitive sequence elements annotated by the programs RepeatMaster \cite{smit.repeatmasker} and
109: trf \cite{benson.trf} (5.9\%) were removed from the working set. The remainder of the aligned
110: regions were split into non-overlapping 200 base windows, ignoring any portions
111: less that 200 bases. This gave a set of 13925 sequences which are well-conserved
112: between human and mouse -- and therefore likely to be functional --
113: but which are very unlikely to be part of the protein-coding repertoire. These
114: formed the positive training set for our machine learning strategy.
115:
116: A negative training set of equal size was prepared by picking 200-base windows
117: at random from the non-coding, non-repetitive portions of chromosome 6, using
118: the same criteria to define repeats and coding sequence. While
119: it is probable that this set also included some functional sequences, we would
120: expect them to be represented at a substantially lower level than in the
121: conserved set.
122:
123: These two sets of sequence were presented to the Eponine Windowed Sequence
124: machine learning system. Randomly chosen 5-base words were used as seed motifs,
125: and three models were trained, each for 2000 cycles. The set of motifs used
126: in model 1 is shown in table \ref{model.motifs}
127:
128: \begin{table}[!bth]
129: \begin{center}
130: \include{model1-motifs}
131: \caption{Motifs used in EWS homology model 1. The entries in this table show
132: consensus sequences of the weight matrices used in the model (note that it
133: is possible for two distinct weight matrices to have the same consensus
134: sequence). Motifs are listed in both forwards and reverse-complement orientation,
135: and the two sections of the table indicate whether that motif is given a
136: positive or negative weight in the learned linear model.}
137: \label{model.motifs}
138: \end{center}
139: \end{table}
140:
141: While the exact set of motifs used in the model varied
142: somewhat from run to run, testing pairs of models on non-overlapping windows
143: from a 1Mb region of human chromosome 22 and plotting the scores showed that
144: the model outputs were highly correlated ({\it e.g.} figure \ref{scan.scatter}).
145: We calculated the Pearson correlation coefficient for all pairs, and in all cases
146: this was greater than $0.96$. From this strong correlation, we concluded that
147: any variations in the model were simply the result of the trainer picking one
148: representative from a group of motifs which provide similar information.
149:
150: \begin{figure}[!bth]
151: \begin{center}
152: \includegraphics[scale=1.0]{scan-scatter.eps}
153: \caption{Scatter plot showing the scores of models 1 and 2 on a set of sequences}
154: \label{scan.scatter}
155: \end{center}
156: \end{figure}
157:
158: We scanned genomic sequences using these models at a range of thresholds,
159: and examined the results using the Ensembl genome browser \cite{hubbard.ensembl}.
160: Visual inspection showed that
161: many of the highest-scoring regions were localized near the start of genes.
162: This prompted us to look at the distribution of high-scoring with respect to
163: the starts of a set of well-annotated genes. We considered the GD\_mRNA genes
164: from version 2.3 of the human chromosome 22 annotation. These are confidently
165: annotated genes with experimental evidence as described in \cite{collins.c22},
166: which confirms at least the approximate location of the ends of the transcripts, and
167: are entirely independent from the chromosome 6 training data.
168: Figure \ref{c22.density} shows the density of predictions with GLM scores $\geq0.90$ relative to the
169: annotated 5' ends of these genes. This shows a strong peak of predictions close
170: to the annotated starts, demonstrating that the model is predicting some sequences
171: commonly located around the transcription start site of genes. Combining this
172: observation with the fact that the model was trained from conserved (and therefore
173: presumed functional) sequences, we believe that it is detecting signals found
174: in the promoter regions of genes.
175:
176: \begin{figure}[!bth]
177: \begin{center}
178: \includegraphics[scale=1.0]{ehnew-density.ps}
179: \caption{Density of predictions from one of the homology models around known gene
180: starts on human chromosome 22}
181: \label{c22.density}
182: \end{center}
183: \end{figure}
184:
185: Evaluation of promoter-prediction methods on a large scale is a difficult exercise,
186: since there are no large pieces of genomic sequence for which we can be certain
187: we know the complete set of transcribed regions, and even in the case on well-known
188: genes we often do not know the precise location at which transcription begins.
189: In \cite{down1}, we developed a pseudochromosome, derived from release 2.3 of
190: the chromosome 22 annotation. As described above, this includes a subset of
191: 284 experimentally verified gene structures. The pseudochromosome was constructed
192: to include these genes while omitting all other annotated genes (which could
193: be substantially truncated). We considered predictions (groups of one or more
194: overlapping windows which all have scores greater than some chosen threshold)
195: to be correct if they lie withing 2kb of an annotated gene start, and false otherwise.
196: Plotting accuracy (proportions of predictions which are correct) against
197: coverage (proportion of transcript starts which are detected by one of the
198: correct predictions) gives an ROC curve. This is plotted for three different
199: models in figure \ref{ehnew-roc}. Firstly, this shows that predictive
200: performance for all three models is rather similar similar. It also shows
201: that they can function as accurate promoter predictors, with accuracy rising
202: to a plateau of around $0.7$.
203:
204: \begin{figure}[!bth]
205: \begin{center}
206: \includegraphics[scale=1.0]{ehnew-roc.ps}
207: \caption{Accuracy vs. coverage at a range of score thresholds for three
208: homology models}
209: \label{ehnew-roc}
210: \end{center}
211: \end{figure}
212:
213: We picked model 1 for further study. Using a score threshold of $0.91$, this gives
214: an accuracy of $0.68$ and a coverage of $0.31$. We compared the set of genes
215: correctly detected by this model to two other methods: firstly, the EponineTSS
216: predictor described in \cite{down1}, and secondly, the published results from
217: the PromoterInspector program \cite{scherf.c22}. PromoterInspector results were
218: mapped to pseudochromosome coordinates using the procedure described in
219: \cite{down1}. Figure \ref{intersection} shows how the set of promoters detected
220: by these three distinct methods overlaps. There are clearly strong correlations
221: between all three methods. In particular, at this threshold the EpoHomol model
222: detects 98 promoters which were found by at least one of the other methods, but
223: only 4 novel promoters.
224:
225: \begin{figure}[!bth]
226: \begin{center}
227: \includegraphics[scale=0.66]{intersection-homol.ps}
228: \caption{Sets of pseudochromosome promoters correctly predicted by three
229: different prediction methods: EponineTSS \cite{down1} with a score threshold
230: of 0.999, PromoterInspector (labeled ``Pro'spector''), and the homology-EWS
231: model 1 with a score threshold of 0.91 (``Homol\_1'').}
232: \label{intersection}
233: \end{center}
234: \end{figure}
235:
236: \section{Conclusions}
237:
238: We have shown here that, when presented with a set of non-coding sequences which are
239: strongly conserved between human and mouse, a simple motif-oriented machine
240: learning system consistently builds models which are able to detect a
241: substantial fraction of human promoter regions with good accuracy. This
242: strongly suggests that this promoter signal represents the most widely used
243: motif-based signal in functional non-coding sequence. While the model learned
244: here can clearly be applied for the purpose of genome-wide promoter annotation, in practice existing
245: methods offer better coverage and (in the case of the EponineTSS predictor)
246: predictions for the precise location of the transcription start site.
247:
248: It is interesting that the promoter model learned by this technique detected
249: substantially the same set of promoters as found by the EponineTSS and PromoterInspector
250: methods. It has previously been remarked that these two methods detect similar
251: sets \cite{down1}, but this could perhaps be explained by the fact that both
252: methods were initially derived from similar sets of known promoter sequences (in both
253: cases, training data was extracted from the EPD database \cite{bucher.epd}. In
254: the case of the homology models described here, there is no connection with EPD,
255: or any similar set of known promoters: the training data was picked purely
256: on the basis of its high similarity to corresponding portions of the mouse genome.
257: These result therefore support the alternate view that there is a particular
258: `easily detected' subclass of promoter sequences.
259:
260: One distinct group of promoters, which previous results show may correspond to
261: this easily detected family, is those promoters associated with CpG islands [ref].
262: However, while a number of the motifs listed in table \ref{model.motifs} are G/C
263: rich and/or contain the CpG dinucleotide, by no means all of the motifs match
264: this description, and indeed one motif containing CpG has a negative weight in the
265: linear model -- their presence reduces the model output score -- while some A/T
266: rich motifs have positive weights. We therefore believe that the signals detected
267: here are significantly more complex than a simple overrepresentation of CpG
268: dinucleotides.
269:
270: \section{Materials and Methods}
271:
272: \subsection{Genomic sequence and annotation}
273:
274: Human genome sequence release NCBI33 and mouse genome release NCBIM30 were extracted
275: from Ensembl databases \cite{hubbard.ensembl}, which also contained gene predictions
276: from Genscan \cite{burge.genscan} and repeat data from RepeatMasker \cite{smit.repeatmasker}
277: and trf \cite{benson.trf}. Curated annotation of gene
278: structures on human chromosome 6 was obtained from the Vega database \cite{vega}.
279: Vega and Ensembl data was extracted directly from the SQL databases using
280: the BioJava toolkit with biojava-ensembl extensions \cite{biojava}.
281:
282: \subsection{Genome alignments}
283:
284: Human-mouse genome alignments were generate by the blastz alignment. These were
285: subsequently re-scored and filtered to give a `tight' set of high-confidence
286: alignments, as described in \cite{schwartz.blastz}. We downloaded the tight
287: alignment set from the UCSC genome website \cite{ucsc}.
288:
289: \subsection{Pseudochromosome for testing promoter-finding methods}
290:
291: A 16.3Mb pseudochromosome sequence was produced based on version 2.3 of the
292: curated annotation for human chromosome 22. This includes all the experimentally-validated
293: gene structures and their upstream regions, while omitting regions containing
294: genes that are predicted but not fully verified. In the case of a pair of
295: divergent genes where one has been verified and the second has not, their shared
296: upstream region was cut at the midpoint. More information about
297: pseudochromosome construction is given in \cite{down1}.
298:
299: \subsection{Eponine Windowed Sequence learning}
300:
301: The Eponine Windowed Sequence (EWS) model is a machine learning system for identifying
302: a small set of motifs which can be effectively used to classify some set of training
303: sequences \cite{down.rvmseq}. In this case, we applied a slightly restricted
304: version of the EWS trainer which omitted the ``Append Column'' sampling
305: rule, restricting the model to learning motifs with length less than or equal
306: to the length of the seed motifs.
307:
308: \section{Acknowledgments}
309:
310: Chromosome 22 annotation data version 2.3 were produced by the Chromosome 22 Annotation
311: Group at the Sanger Institute and were obtained from the World Wide Web at
312: http://www.sanger.ac.uk/HGP/Chr22 (Dunham {\it et al.} unpublished data). TD
313: would like to thank the Wellcome Trust for funding.
314:
315:
316: \begin{thebibliography}{1}
317:
318: \bibitem{human.genome} The Genome International Sequencing Consortium: {\bf Initial
319: sequencing and analysis of the human genome.} {\it Nature} 2001 {\bf 409:}860-912
320: \bibitem{mouse.genome} The Mouse Genome Sequencing Consortium: {\bf Initial
321: sequencing and comparative analysis of the mouse genome.} {\it Nature} 2003 {\bf 420:}520-562
322: \bibitem{schwartz.blastz} Schwartz S, Kent WJ, Smit A, Zhang Z,
323: Baertsch R, Hardison RC, Haussler D, Miller W: {\bf Human-Mouse
324: Alignments with BLASTZ.} {\it Genome Res.} 2003 {\bf 13:}103-107.
325: \bibitem{down.rvmseq} Down TA, Hubbard TJP: {\bf Relevance Vector Machines for
326: classifying points and regions of biological sequences.} submitted.
327: \bibitem{tipping.rvm} Tipping ME: {\bf Sparse Bayesian learning and
328: the relevance vector machine.} {\it Journal of Machine Learning Research} 2000
329: {\bf 1:}211--244.
330: \bibitem{down1} Down TA, Hubbard TJP: {\bf Computational Detection
331: and Location of Transcription Start Sites in Mammalian Genomic DNA.}
332: {\it Genome Res.} 2002 {\bf 12:}652-658.
333: \bibitem{mungall.chr6} A. J. Mungall {\it et al.} {\bf The DNA sequence and analysis
334: of human chromosome 6} {\it Nature} 2003 {\bf 425:}805-811.
335: \bibitem{burge.genscan} Burge C, Karlin S: {\bf Prediction of complete gene
336: structures in human genomic DNA.} {\it Journal of Molecular Biology} 1997 {\bf 268}78-94
337: \bibitem{smit.repeatmasker} Smit AFA, Green P: {\bf RepeatMasker}\newline
338: [http://ftp.genome.washington.edu/RM/RepeatMasker.html]
339: \bibitem{benson.trf} Benson G: {\bf Tandem repeats finder: a program to analyze
340: DNA sequences} {\it Nucleic Acids Res.} 1999 {\bf 27:}573-580
341: \bibitem{hubbard.ensembl} T. J. P. Hubbard {\it et al.}: {\bf The Ensembl genome
342: database project} {\it Nucleic Acids Res.} 2002 {\bf 30:}38--31.
343: \bibitem{collins.c22} Collins JE, Goward ME, Cole CG, Smink LJ, Huckle EJ, Knowles
344: S, Bye JM, Beare DM, Dunham I: {\bf Reevaluating Human Gene Annotation: A Second-
345: Generation Analysis of Chromosome 22}. {\it Genome Res} 2003, {\bf 13:}27-36
346: \bibitem{scherf.c22} Scherf M, Klingenhoff A, Frech K, Quandt K, Schneider R,
347: Grote K, Frisch M, Gailus-Durner V, Seidel A, Brack-Werner R, Werner T:
348: {\bf First pass annotation of promoters on human chromosome 22} {\it Genome Res.} 2001
349: {\bf 11:}333-340
350: \bibitem{bucher.epd} Perier RC, Praz V, Junier T, Bonnard C, Bucher P:
351: {\bf The Eukaryotic Promoter Database (EPD)} {\it Nucleic Acids Res.} 2000
352: {\bf 28:}307-309
353: \bibitem{vega} {\bf Vega Genome Browser} [http://vega.sanger.ac.uk/].
354: \bibitem{biojava} {\bf BioJava} [http://www.biojava.org/].
355: \bibitem{ucsc} {\bf UCSC Genome Bioinformatics} [http://genome.cse.ucsc.edu/].
356:
357: \end{thebibliography}
358:
359: \end{document}
360: