1: \documentclass[11pt,twocolumn]{article}
2:
3: \usepackage{fullpage}
4: \usepackage{times}
5: \usepackage{algorithm}
6: \usepackage{algorithmic}
7: \usepackage{amsthm}
8: \usepackage{graphicx}
9:
10: \newcommand{\keyword}[1]{\texttt{#1}}
11: \newcommand{\latcom}[1]{\texttt{$\backslash$#1}}
12:
13: \newtheorem{thm}{Theorem}[section]
14: \newtheorem{lem}{Lemma}[section]
15: \newtheorem{cor}{Corollary}[section]
16: %
17: \renewcommand{\baselinestretch}{.95}
18: \normalsize
19:
20: \newcommand{\RelaxFloats}{
21: \renewcommand{\topfraction}{0.9}
22: \renewcommand{\floatpagefraction}{0.9}
23: \renewcommand{\textfraction}{0.1}
24: }
25:
26:
27: \begin{document}
28:
29: \RelaxFloats
30:
31: \begin{titlepage}
32:
33: \flushleft
34:
35:
36: \vspace{0.00in}
37: \parbox{6.5in}{\large \noindent
38: Draft prepared for \textbf{arXiv}.
39: }
40:
41:
42: \vspace{0.10in}
43: \parbox{6.5in}{\large \noindent
44: Manuscript information:
45: {10} text pages,
46: {2} figures,
47: {3} tables.
48: }
49:
50:
51:
52: \vspace{2.00in}
53: \parbox{6.5in}{\LARGE \centering
54: Mining Mass Spectra: Metric Embeddings and \\
55: Fast Near Neighbor Search
56: }
57:
58:
59: \vspace{0.5in}
60: \parbox{6.5in}{\large \centering
61: Debojyoti Dutta,
62: Ting Chen\footnotemark[1]
63: }
64:
65:
66: \vspace{0.5in}
67: \parbox{6.5in}{\large \centering
68: Molecular and Computational Biology Program \\
69: University of Southern California \\
70: Los Angeles, CA 90089-2910
71: }
72:
73: \vspace{0.5in}
74: \parbox{6.5in}{\large \centering
75: \today
76: }
77:
78:
79: \footnotetext[1]{
80: To whom correspondence should be addressed.
81: Molecular and Computational Biology Program,
82: University of Southern California.
83: MCB 201, 1050 Childs Way,
84: Los Angeles, CA 90089-2910.
85: E-mail: ddutta@usc.edu.,
86: tingchen@usc.edu
87: Tel: (213)740-2416,
88: (213)740-2415.
89: Fax: (213)740-8631.
90: }
91: \end{titlepage}
92:
93:
94:
95: \iffalse
96:
97: \title
98: \author{Debojyoti Dutta\footnote{The authors are with the Department
99: of Computational Biology, University of Southern California, Los Angeles
100: 90089. They can be contacted at ddutta@usc.edu, tingchen@usc.edu
101: respectively }\,
102: Ting Chen\addtocounter{footnote}{-1}\footnotemark\
103: }
104:
105: \maketitle
106:
107: \fi
108:
109: \normalsize
110:
111: \begin{small}
112: \begin{abstract}
113:
114: Mining large-scale high-throughput tandem mass spectrometry data sets
115: is a very important problem in mass spectrometry based protein
116: identification.
117: %
118: %
119: %
120: One of the fundamental problems in large scale mining of spectra is to
121: design appropriate metrics and algorithms to avoid all-pair-wise
122: comparisons of spectra. In this paper, we present a general framework
123: based on vector spaces to avoid pair-wise comparisons.
124: %
125: %
126: %
127: %
128: %
129: We first robustly embed spectra in a high dimensional space in a novel
130: fashion and then apply fast approximate near neighbor algorithms for
131: tasks such as constructing filters for database search, indexing and
132: similarity searching. We formally prove that our embedding has low
133: distortion compared to the cosine similarity, and, along with locality
134: sensitive hashing (LSH), we design filters for database search that
135: can filter out more than 989\% of peptides (118 times less)
136: while missing at most 0.29\%
137: of the correct sequences. We then show how our framework can be used
138: in similarity searching, which can then be used to detect tight
139: clusters or replicates. On an average, for a cluster size of 16
140: spectra, LSH only misses 1 spectrum and admits only 1 false spectrum.
141: In addition, our framework in conjunction with dimension reduction
142: techniques allow us to visualize large datasets in 2D space. Our
143: framework also has the potential to embed and compare datasets with
144: post translation modifications (PTM).
145:
146:
147: \end{abstract}
148: \end{small}
149:
150: \section{Introduction}
151:
152:
153: %
154:
155: Proteomics aims to analyze proteins and peptides expressed by the
156: dynamic biological processes within
157: cells~\cite{Pandey00,Aebersold00}. Proteins are responsible for many
158: inter and intra-cellular activities such as metabolism and cell
159: signaling where proteins are often modified after
160: translation within cells~\cite{Mann03NatBiot,Yates95}.
161: %
162: %
163: %
164: %
165: In the post-genomic era, one
166: of the most important problems is to characterize the {\em proteome},
167: i.e. the set of proteins within an organism.
168:
169:
170: %
171:
172: Tandem mass spectrometry is one of the most promising and widely used
173: high throughput techniques to analyze proteins and peptides
174: \cite{Pandey00,Aebersold00}. It comprises of two stages. A protein mixture
175: is enzymatically digested and separated by HPLC (High Performance
176: Liquid Chromatography) before inserting into a mass spectrometer through
177: a capillary. Then the peptides gets ionized
178: and their precursor ion masses, or
179: mass/charge ratios, are measured. This is the MS1
180: stage. The peaks (or ionized peptides) from the MS1 stage are
181: selected and further fragmented in a second stage using techniques
182: such as Collision Induced Dissociation (CID) to yield the MS2 fragment
183: ions. Ideally, each peptide gets cleaved into two parts. The N-terminal
184: ion (b-ion) represents the prefix while the C-terminal ion (y-ion) is the suffix.
185: %
186: %
187: This stage is also known as the tandem MS or the MSMS
188: stage. For more details beyond this oversimplified description, the
189: reader is directed to the wonderful survey~\cite{Aebersold00}.
190:
191:
192: %
193:
194: There are two main approaches to analyzing tandem mass spectra
195: data. First, and the most widely used, is the
196: database search method~\cite{Keller02AC,Nsvski03,Zhang,Bafna01}.
197: Here, peptides from a sequence database are digested in-silico and the
198: resultant virtual spectra are matched (or scored) with the real
199: spectra. High scored peptides are typically chosen as the peptide
200: candidates. This method leads to a combinatorial explosion
201: when used to search for Post Translational Modifications
202: (PTMs)~\cite{Yates95}. Second, the de-novo
203: method \cite{Dancik99,Chen01,Ma} reconstructs the sequence without the
204: help of a database.
205: %
206: %
207: %
208: %
209: %
210: %
211: Other approaches combine denovo sequencing and database search by
212: first generating sequence tags, or subsequences, and then using these
213: tags~\cite{pepnovo} as filters for database search with and
214: without PTMs~\cite{inspect}.
215:
216:
217: %
218:
219: The promise of tandem mass spectrometry has led research groups to
220: routinely use this method to probe the proteomes.
221: %
222: %
223: %
224: %
225: %
226: A single run of a mass spectrometer can generate several thousands of
227: spectra, and the sheer size as well as the number of real life mass
228: spectra datasets is predicted to grow at an unprecedented rate with
229: laboratories operating several spectrometers in parallel, round the
230: clock. Thus, efficient mining of these large-scale mass spectra data
231: to obtain useful clues for biological discovery is a very important
232: problem.
233:
234:
235: %
236:
237: Mining large spectra has several challenges, some of which
238: are presented below.
239: %
240: 1) Indexing huge databases of mass spectra is not
241: standardized. Commonly used methods use precursion ion mass but this
242: method has two main problems: i) there can be errors in precursor ion
243: masses. ii) there may be many spectra (several thousands of them) that
244: have masses close to each other.
245: %
246: 2) It is difficult to search for similar spectra on a large scale
247: quickly, or in sublinear time. This is a core function used by several
248: data mining applications.
249: %
250: %
251: %
252: %
253: %
254: 3) Clustering large databases of spectra is a daunting task. Most
255: similarity measures proposed in tandem mass spectrometry use pair wise
256: metrics for similarity. Such pair wise methods lead to an explosion of
257: similarity calculations, i.e. $O(n^2)$ for a set of $n$ spectra. Thus,
258: a key open problem is to use methods that avoid the pair-wise
259: similarity calculations. If objects can be transformed into metric
260: spaces, problems such as similarity searching and clustering becomes
261: easier. Thus we need to find methods to robustly embed spectra in
262: metric spaces.
263: %
264: 4) Visualization of large groups of mass spectra is an important
265: problem which can also be used to qualitatively identify outliers in
266: the huge number of spectra produced.
267:
268:
269: %
270:
271:
272: In this paper, we present a general framework for large scale mining of
273: tandem mass spectra. Our main contributions are the following:
274: %
275: 1) We robustly embed spectra into a metric space,
276: %
277: 2) We show, both formally and empirically that distances using our
278: embedding areas good as those that use the well known cosine method.
279: %
280: 3) Then we use apply a geometric fast near neighbor search technique,
281: Locality Sensitive Hashing (LSH)~\cite{datar04}, to solve several
282: problems such as fast filters for database search, similarity
283: searching of mass spectra, and visualization of large spectral
284: database.
285: %
286: 4) Our embedding in conjunction with PCA and manifold learning can be
287: used to visualize large groups of spectra.
288: %
289: 5) Our embedding holds promise for comparing spectra with Post
290: Translational Modifications (PTM).
291:
292:
293: Our idea of robust embedding of vector spaces to mine mass spectra is
294: novel. Previous work to embed spectra into vector spaces using vectors
295: of amino acid counts to database search~\cite{Halligan04,Halligan05}.
296: They focussed on clustering sequence databases based on this amino
297: acid counts to search for mass spectra, given amino acid counts or
298: sequence tags. However getting an accurate estimate of amino acid
299: composition is itself a hard problem, especially when the quality of
300: spectra is not high. However, our method embeds ion fragments of
301: spectra directly into a vector space and avoids estimating higher
302: level features such as amino acid composition. Also our scheme is more
303: general: using a single embedding, we can either compare spectra with
304: each other or compare spectra with peptide sequences by generating
305: their virtual, or in-silico digested, spectra.
306: In addition, we demonstrate that our framework can be used in concrete
307: mining applications. We first use our embedding along with Locality
308: Sensitive Hashing to speed-up database search. We demonstrate that we
309: can filter out more than 99.152\% spectra with a false negative rate
310: of 0.29\%. The average query time for a spectra is 0.21s.
311: Then, we answer similarity queries and find replicates or tight
312: clusters. LSH misses an average of 1 spectrum per cluster, that have
313: an average cluster size of 16 spectra, while admitting only 1 false
314: spectrum.
315: %
316: %
317: %
318: %
319: %
320: %
321: %
322: %
323:
324:
325: To the best of our knowledge, we are not aware of any other work that robustly
326: embeds spectra in metric spaces with provable guarantees and then uses fast
327: approximate near neighbor techniques to solve mass spectrometry data mining
328: problems.
329:
330:
331: \section{Methods}
332:
333:
334: Our approach is to use vector spaces which have been successful in
335: numerous data mining applications including web searching{\bf cite web
336: mining}. Several fast mining algorithms become simpler to design in
337: these spaces, compared to designing them in non metric spaces
338: e.g. spaces where the only available measure is a pairwise similarity
339: measure. Thus, the key problem in this approach is to robustly embed
340: spectra into a high dimensional metric space and define appropriate
341: distances. Also, these distances must be correlated with the well
342: known cosine similarities. In other words, we desire an embedding with
343: bounded distortion with respect to the cosine similarity.
344:
345:
346: \subsection{Embedding Spectra}
347:
348:
349: \subsection*{Noise Removal}
350:
351:
352: \begin{figure}\label{fig:SNRdist}
353: \includegraphics[width=\linewidth,height=3in]{figs/snr.ps}
354: \caption{Signal and Noise distributions of peak intensities in different
355: regions of spectra (from the training set).}
356: \end{figure}
357:
358: The achiles heel of tandem mass spectra analysis is the amount of
359: noise in the mass spectra. In fact, most peaks (around 80\%) cannot be
360: explained and are called {\em 'noise'} peaks. {\em 'Signal'} peaks
361: (such as $b, y$ ions) are useful for interpretation. As a first step,
362: we remove noise peaks enriching the signal to noise ratio (SNR).
363:
364:
365: We use a statistical method to increase SNR. We first find the
366: intensity distributions of signal and noise peaks in a set of
367: annotated spectra. For this, we consider a set of good quality
368: annotated spectra as described in Section~\ref{sec:results} and
369: generate the virtual spectrum $v_p$ for each of the real spectra $r_p$
370: for a peptide $p$. For the virtual spectrum generation we consider
371: the following ions: $b$, $b-H_2O$, $b-NH_3$, $y$, $y-H_20$, $y-NH_3$.
372: Then we divide the mass range of $r_p$ into $k=10$ sections. For each
373: section, and for each real peak, we consider its intensity rank
374: i.e. the most intense peak has rank 0 and so on. We divide the peaks
375: of $r_p$ into two sets $S_p$ and $N_p$. $S_p$ contains all those
376: peaks, and their intensity ranks, which have a match in the virtual
377: spectrum $v_p$. Thus, for each region, we can get a distribution of
378: signal and noise intensity ranks for each region as shown in
379: Figure~\ref{fig:SNRdist}.
380:
381:
382: We define a metric SNR of a peak $(mz_j,I_j)$ as follows
383: $$
384: SNR(j)\ =\ \frac{P[\mbox{rank}(j)|(mz_j,I_j)\in S_p]}{P[\mbox{rank}(j)|(mz_j,I_j)\in N_p]}
385: $$
386: If larger SNR, the peak is likely to be a useful peak, else its a noise peak.
387: From Figure~\ref{fig:SNRdist} we can conclude
388: that the noise is very poor at the ends of the spectra, i.e. at low mass
389: regions and high mass regions. This statistical observation reinforces
390: the mass spectrometry folklore that the {\em middle region} is the most
391: suitable for finding signal peaks.
392:
393:
394: \subsection*{Features and Distances}
395:
396: There are several possible ways to embed tandem mass spectra into a
397: vector space that support the most common operation of comparing two
398: spectra and find similarities. For example, the cosine similarity
399: metric~\cite{Keller02AC} and their different variants have been very
400: popular in the recent papers. Unfortunately the cosine metric does
401: not yield a metric embedding because the triangle inequality is
402: violated. Also the cosine similarity metric implies algorithms that
403: consider pairs of spectra. Clearly such algorithms are difficult to
404: scale due to the $O(n^2)$ number of similarity calculations.
405:
406:
407: For metric embeddings, the design space is quite large.
408: %
409: %
410: %
411: A simple idea is to directly bin the peaks and use the intensities to
412: form a vector space. However spectra from different datasets have
413: different intensities and we would like to have a single embedding
414: that could potentially integrate multiple spectral databases.
415: %
416: %
417: %
418:
419:
420:
421: \begin{figure}\label{fig:cube}
422: \begin{tabular}{c c}
423: \includegraphics[width=1.5in,height=1.5in]{figs/msmine-cube.ps} &
424: \includegraphics[width=1.5in,height=1.5in]{figs/msmine-circle.ps}\\
425: (i) & (ii)\\
426: \end{tabular}
427: \caption{ (i) Embedding spectra in a $n$-dimensional cube, (ii)
428: Using a 2-dimensional example to illustrate the correlation between the
429: Euclidean distance and the well known cosine similarity }
430: \end{figure}
431:
432:
433:
434: We first {\em clean} spectra as mentioned in the previous subsection.
435: Then we divide the entire mass range (from 0 to some maximum range)
436: into discrete intervals of 2da. For each interval of 2da,
437: a bit is set to 1 if the cleaned spectrum
438: contains a peak in that interval, else it is 0. This embeds
439: each spectra into the vertices of a n-dimensional hypercube. A 3D
440: version is shown in Figure~\ref{fig:cube}. Our feature vectors
441: are defined to be the the {\em unit} vectors in the direction of the
442: corresponding vertices of the n-dimensional hypercube. Thus the space
443: of our embedding is a n-dimensional unit hyper-sphere.
444:
445:
446:
447: We define the spectral similarity or distance between spectra
448: $x$, $y$, as $||x-y||$. If the angle between two similar spectra $x$, $y$ is
449: $\theta$, $\cos \theta$ will be close to 1, or $1-\cos \theta$ will be
450: very small. Since $x$ $y$ are unit vectors, their
451: Euclidean distance will also be small. Thus, for small angles, $1-\cos
452: \theta \approx D(x,y) $, where $D$ is the Euclidean distance. It is easy
453: to show that as $n$ or the number of dimensions increases, the minimum
454: angle for pairs of very similar spectra $x$, $y$ becomes
455: smaller. Thus, instead of calculating the $1- \cos \theta$, we
456: calculate $D(x,y)$. The natural question that arises is the
457: {\em distortion} of our embedding. We will now show that it is has
458: bounded accuracy in theory, and we will later show that the accuracy
459: is empirically quite high in comparision with the cosine similarity.
460:
461:
462:
463: We prove some properties of the embeddings. It is easy to show the following theorem:
464: \begin{thm}
465: The embedding discussed above defines a metric space.
466: \end{thm}
467: \begin{proof}
468: The proof is very simple
469: To show that our embedding defines a metric space, we need to prove three things:
470: 1) $||x-y||=0$ iff $x=y$, 2) $||x-y||=||y-x||$ and 3) the distance measure
471: obeys the triangle inequality. These properties are trivial to prove in our case
472: as our embedding uses Eculidean distances.
473: \end{proof}
474:
475:
476: We then show that the maximum euclidean distance is bounded by $\sqrt{2}$.
477: \begin{lem}
478: The distance between the feature vectors of any two mass spectra is
479: bounded above by $\sqrt{2}$.
480: \end{lem}
481: \begin{proof}
482: Suppose there are two spectra $x$, $y$ respectively. We shall uses the
483: names of the spectra and their feature vectors
484: interchangeably. According to our scheme we first filter the noisy
485: peaks and generate the binary vector after binning. Now assume $x$ has
486: $k$ bits set to a and $y$ has $k'$ bits set to 1. Also assume that $c$
487: of the common bits are 1. Then $||x||=\frac{1}{\sqrt{k}}$ and
488: $||y||=\frac{1}{\sqrt{k'}}$. Since $c$ bits are common, the number of
489: dissimilar bits between $x$ and $y$ are $(k-c)+(k'-c)$. We have
490: \begin{small}
491: \begin{eqnarray}
492: ||x-y|| & = & \sqrt{(k-c).\left( \frac{1}{\sqrt{k}}\right)^2\ +
493: \ (k'-c).\left( \frac{1}{\sqrt{k'}}\right)^2 }\\
494: & = & \sqrt{ 2 - c.\left( \frac{1}{k} + \frac{1}{k'} \right) }
495: \end{eqnarray}
496: \end{small}
497: \end{proof}
498:
499:
500: Next we show that our embedding has bounded distortion when we compare
501: with the well known cosine similarity. We have the following theorem:
502: \begin{thm}
503: If $\theta$ is the angle made by the feature vectors of spectra $x$,
504: $y$, and the number of ones in each of the vectors after binning is
505: the same we must have
506: $0<\frac{1-\cos{\theta}}{||x-y||}<\frac{1}{\sqrt{2}}$. Or in other
507: words, the distortion between our Euclidean embedding and the cosine
508: similarity is bounded.
509: \end{thm}
510: \begin{proof}
511: As in the previous lemma,
512: $
513: ||x-y|| = \sqrt{ 2 - c.\left( \frac{1}{k} + \frac{1}{k'} \right) }.
514: $
515: Now the cosine of the angle $\theta$ between $x$, $y$ can be written as
516: $\cos\theta = \frac{c}{\sqrt{kk'}}$.
517: Assume $k=k'$ and note that $0\leq \frac{c}{k}\leq 1$. Thus, we must have
518: \begin{eqnarray}
519: \frac { 1 - \cos\theta } {||x-y||} & = &
520: \frac { 1\ -\ \frac{c}{\sqrt{kk'}} }
521: { \sqrt{ 2 - c.\left( \frac{1}{k} + \frac{1}{k'} \right) } } \\
522: & = & \frac{1-\frac{c}{k}}{\sqrt{2-\frac{2c}{k}}}\\
523: & = & \frac{1}{\sqrt{2}} \sqrt{1-\frac{c}{k}}
524: \end{eqnarray}
525: We note that since, $0\leq \frac{c}{k}\leq 1$, we must also have
526: $0\leq 1- \frac{c}{k}\leq 1$ and the theorem follows.
527: \end{proof}
528:
529:
530: Thus our embedding will perform almost as good as the standard cosine
531: metric. We show in the next section that this is indeed the case,
532: empirically. Also, since the points are in a Euclidean space, we can
533: elegant geometric techniques that yield fast approximate algorithms
534: for mining the data.
535:
536: \subsection{Similarity Searching}
537:
538:
539: The ability to calculate distances as opposed to cosines is an
540: important feature of our framework. Now, we apply elegant
541: near neighbor algorithms to answer queries quickly but
542: approximately, as we show in the paper. The basic query
543: primitive we use is the following:
544:
545:
546: \noindent{Primitive 1}: Given a spectrum $x$ and a set of spectra $S$,
547: we want to find all the spectra $S_r$ that are similar to $x$,
548: i.e. spectrum $y\in S_r$, iff $D(x,y)<r_q$, where $D$ is the Euclidean
549: distance and the $r_q$ is a query radius.
550:
551:
552:
553: A very simple approach would be to do a linear scan on the database
554: and output every spectrum $y$ such that $D(x,y)<r_q$. This takes
555: $O(n)$ time. However, if $S$ becomes very large and so do the number
556: of queries say $O(n)$, then we have a $O(n^2)$ algorithm. This is
557: clearly unacceptable for our problem. Thus, we desire methods that
558: will yield near neighbor queries in {\em sub-linear} time. For this we
559: are willing to tradeoff some accuracy for speedup. Several sub-linear
560: near neighbor methods exist but we leverage Locality Sensitive
561: Hashing~\cite{datar04} since, unlike others, it promises bounded
562: guarantees and is also easy to implement. We briefly present the idea
563: below.
564:
565:
566:
567: \subsection*{Locality Sensitive Hashing}
568:
569: The basic idea behind random projections is a class of hash functions
570: that are locality sensitive i.e. if two points $(p, q)$ are close they
571: will have small $|p-q|$ and they will hash to the same value with high
572: probability. If they are far they should collide with small
573: probability.
574:
575: \noindent{Definition 1}: A family $\{ H = f: S \rightarrow U \}$ is
576: called locality-sensitive, if for any point $q$, the function $$p(t) =
577: Pr_H[h(q) = h(v) : |q-v| = t]$$ is strictly decreasing in $t$. That
578: is, the probability of collision of points $q$ and $v$ is decreasing
579: with the distance between them.
580:
581: \noindent{Definition 2}: A family $H=\{h:S\rightarrow U\}$ is called
582: $(r_1,r_2,p_1,p_2)$ sensitive for distribution $D$ if for any $v,q \in S$,
583: we have
584: \begin{itemize}
585: \item if $v\in B(q,r_1)$ then $\mbox{Pr}[h(q)=h(v)]\geq p_1$
586: \item if $v\notin B(q,r_2)$ then $\mbox{Pr}[h(q)=h(v)]\leq p_2$
587: \end{itemize}
588: Here $B(q,r)$ represents a ball around point $q$ with a radius $r$.
589: Thus a good family of hash functions will try to {\em amplify}
590: the gap between $p_1$ and $p_2$.
591:
592:
593: Indyk et.~al.~\cite{datar04} showed that s-stable distributions can be
594: used to construct such families of locality sensitive hash
595: functions. An s-stable distribution is defined as follows.
596:
597: \noindent{Definition 3}: A distribution $D$ over $R$ is called {\em
598: s-stable}, if there exists $s$ such that for any $n$ real numbers $v_1
599: ... v_n$ and i.i.d. variables $X_1 ... X_n$ with distribution $D$, the
600: random variable $\sum_i{v_i X_i}$ has the same distribution as the
601: variable $(\sum_i{v_i^p})^{\frac{1}{s}}X$, where $X$ is a random
602: variable with distribution $D$.
603:
604:
605: Consider a random vector $a$ of $n$ dimensions. For any two
606: n-dimensional vectors $(p, q)$ the distance between their projections
607: $(a.p - a.q)$ is distributed as $|p-q|_s X$ where $X$ is a s-stable
608: distribution. We {\em chop} the real line into equal width segments of
609: appropriate size and assign hash values to vectors based on which
610: segment they project onto. The above can be shown to be locality
611: preserving.
612:
613:
614: There are two parameters to tune LSH. Given a family $H$ of hash
615: functions as defined above, the LSH algorithm chooses $k$ of them and
616: concatenates them to amplify the gap between $p_1$ and $p_2$. Thus,
617: for a point $v$, $g(v)=(h_1(v)...h_k(v))$. Also, $L$ such groups of
618: hash functions are chosen, independently and uniformly at random,
619: (i.e. $g_1...g_L$) to reduce the error. During pre-processing, each
620: point $v$ is hashed by the $L$ functions buckets and stored in the
621: bucket given by each of $g_i(v)$. For any query point $q$, all the
622: buckets $g_1(q)...g_L(q)$ are searched. For each point $x$ in the
623: buckets, if the distance between $q$ and $x$ is within the query
624: distace, we output this as the nearest neighbor. Thus, the parameters
625: $k$ and $L$ are crucial. It has been shown~\cite{indyk99,datar04}
626: that $k=\log_{1/p_2}{n}$ and $L=n^\rho$, where
627: $\rho=\frac{\log{1/p_1}}{\log{1/p_2}}$, ensures locality sensitive
628: properties. In Ref.~\cite{datar04}, the authors consider $L2$ spaces
629: and bound $\rho$ above empirically by $\frac{1}{c}$, $c$ being the
630: approximation guarantee, i.e. for a given radius $R$, the algorithm
631: returns points whose distance is within $c\times R$. The time
632: complexity of LSH has been shown to be $O(dn^\rho \log{n})$, where $d$
633: is the number of dimensions and $\rho$ is as defined above. Thus, if
634: we desire a coarse level of approximation, LSH can guarantee
635: sub-linear run times for geometric queries.
636:
637:
638:
639:
640: \subsection{Similarity Searching}
641:
642:
643: Using our embedding and a fast near neighbor algorithm, we can
644: find spectra similar to a given query spectrum. The
645: key is to use the correct query radius $r$. We
646: show in the next section how this can be chosen.
647: If we give too high a radius, it might yield a
648: large dataset and if the radius is too low, it might not yield any
649: neighbor.
650:
651:
652: If an appropriate query radius is chosen,
653: it is easy to find tight clusters using the following heuristic:
654: \noindent{ANN-cluster}: 1) Embed spectra into a Euclidean space and
655: form the set $S$. 2) Hash the feature vectors, $S$, using LSH. 3)
656: Choose some $k$ random spectra, find their near neighbors (tight
657: clusters). For each random spectra add their neighbors to set $S$. 4)
658: $S=S-C$. 5) Go to step 3 till $S$ is empty.
659:
660:
661: Another immediate consequence of our framework is to find outliers. To
662: check for outlier, we need to determine whether a spectrum has at most
663: 1 or 2 neighbors. If the neighbors remain unchanged even on increasing
664: the query radius by $\delta$, a spectrum is indeed an outlier. Since
665: near neighbors take sub-linear time with LSH, outliers can be detected
666: in sub-quadratic time.
667:
668:
669:
670:
671: \subsection{Speedup Database search}
672:
673:
674:
675: In this section, we discuss a sample application using our mining
676: framework. Database search is the primary tandem mass spectrometry
677: data mining applications. Given a query spectrum $x$, and a mass
678: spectra database $MSDB$ (described in Section~\ref{sec:results}, the
679: problem is to find out which peptide $p\in MSDB$ corresponds to $x$.
680:
681:
682: Database search is a well explored topic, see ~\cite{Wan05} for
683: example. Most tools index the the MSDB by the peptide mass. Then for a
684: spectrum $x$, the precursor mass $m_x$ is found. Then all the spectra
685: $S_p={y|y\in MSDB}$ are compared with $x$ such that $|m_y-m_x|<\delta$,
686: where $\delta$ is some pre-defined mass tolerance. Each comparison
687: operation between the query spectrum and the candidate spectrum takes
688: a while depending on the scoring function used. We reduce the size of
689: $S_p$ by filtering the unrelated spectra, speeding up the search. We
690: ensure that we do not filter out the true peptide for a
691: given spectrum while we discard most of the unrelated peptide.
692:
693:
694: We generate the virtual spectra from each peptide
695: sequence in the database, and then embed those virtual spectra in the
696: Euclidean space, as mentioned. Then for filtering, we choose an appropriate
697: threshold radius $r$ and query the LSH algorithm to yield all the
698: candidates within a ball of radius $r$. The ratio of the total number
699: of peptides within a mass tolerance divided by the number of
700: candidates returned is our speedup.
701:
702:
703:
704:
705: \subsection{Visualization and Dimension Reduction}
706:
707: As mentioned earlier, vizualizing thousands of spectra is a very hard
708: problem. We are not aware of any previous work that allows us to
709: visualize large mass spectrometry data sets. Our embedding followed
710: by dimension reduction allows to view spectra on a two or three
711: dimensional space. As a bonus, it qualitatively allows us to identify
712: outliers in the data set.
713:
714:
715: Once we have embedded the spectra in a Euclidean space, we can use
716: some of the common techniques to visualize high dimensional data by
717: dimensionality reduction. The most common linear method is to use
718: PCA~\cite{strang}. Recently, several non-linear methods for
719: dimensionality reduction have been discovered, the majority of them
720: exploiting the low dimensional manifold structure of the dataset. In
721: this paper, we leverage one of these techniques, the isomap method, to
722: project the high dimensional data on a 2D plane. Due to lack of space
723: we do not provide a description of the method.
724:
725:
726:
727: \section{Experimental Results}
728: \label{sec:results}
729:
730: In this section, we describe the empirical evaluation of our embedding
731: followed by some representative data mining tasks. Unless otherwise
732: stated we use the following dataset from Keller
733: et. al.~\cite{Keller02Data}. For calculating statistics, we used 80\%
734: of the 1618 spectra from this annotation at random. The statistics
735: were independent of the exact choices of the spectra. Note that our
736: techniques are unsupervised except for the selection of query radii.
737: Out of this, 1014 spectra were digested with trypsin and were used for
738: database search filter.
739:
740:
741: For database search filters, a non-redundant protein sequence database
742: called MSDB, which is maintained by the Imperial College, London. The
743: release (20042301) has 1,454,651 protein sequences (around 550M amino
744: acids) from multiple organisms. Peptide sequences were generated by
745: in-silico digestion and the list of peptides were grouped into
746: different files by their precursor ion mass, a different file for
747: 10da.
748:
749:
750: \subsection{Empirical evaluation of the embedding}
751:
752: In this section, we critically analyze our embedding and different
753: distance metric. For these analyzes, we chose a set of 1014 curated
754: spectra of proteins digested with trypsin and reported by Keller
755: et. al. We then cleaned the spectra picked the most likely to be the
756: signal peaks. Then we constructed the binary bit vector as discussed
757: earlier. For the set of spectra, we knew that there were 100 odd
758: clusters with 15 spectra per cluster on an average. We calculate the
759: pairwise distances between spectra within the same cluster and we term
760: this the similar set $SS$. We then choose a representative from each
761: cluster at random and calculate the distances and we call this set the
762: dissimilar set $DS$. Then we plot the frequency distribution of DS
763: and SS as they both have similar number of pairwise distances in
764: Figure~\ref{fig:inter} for three metrics: hamming, 1-cosine and
765: euclidean. Its very clear that hamming is unsuitable as a metric as it
766: has low discriminability. As expected, 1-cosine and euclidean looks
767: almost similar with low overlaps between the sets DS and SS. Also note
768: that the cosine metric used here is not exactly the same used by
769: others. We do not take the intensities into consideration after we
770: have selected the peaks.
771:
772:
773:
774: \begin{figure}\label{fig:inter}
775: \includegraphics[width=\linewidth,height=4in]{figs/interVSintraCluster.ps}
776: \caption{Distribution of scores with real spectra using different
777: metrics (hamming, 1-cosine, euclidean). The dotted curve plots the
778: inter-cluster distances while the solid line represents the
779: intra-cluster distribution.}
780: \end{figure}
781:
782:
783: \iffalse
784:
785: In the previous case, we plotted distances between real spectra. Now
786: we plot distances between spectra and their corresponding virtual
787: spectra and compare with distances between spectra and virtual spectra
788: generated from totally dissimilar peptides in
789: Figure~\ref{fig:trueVSfalse}. Note that even in this case, there is
790: very good separation (with < 5\% overlap).
791:
792:
793:
794:
795: \begin{figure}\label{fig:trueVSfalse}
796: \includegraphics[width=\linewidth,height=4in]{figs/compareRealVirtualDistance.ps}
797: \caption{Distribution of distance between real and virtual spectra using
798: different metric. The dotten curve represents the distance between
799: real spectra and distances to virtual spectra from different peptides.
800: The other curve shows the distribution of distances between spectra
801: and the virtual spectra from the true peptides.}
802: \end{figure}
803: \fi
804:
805: Now, we consider the database of tryptic peptides, $MSDB$. For each
806: peptide, we generate its virtual spectrum and then construct the
807: feature vector as above. For each real spectrum, we calculate the
808: distance with the correct virtual spectra and we call this set of
809: scores to be $SS$. Then we choose, from the database, 100 random
810: peptides having almost the same mass as the precursion ion mass of the
811: given spectrum. We then add the set of scores to the dissimilar set
812: $DS$. We then plot the probability distribution of SS and DS in
813: Figure~\ref{fig:trueVSfalseDB}. Again we can see the clear sepatation
814: between the two sets of distances (with $<1\%$ overlap). This
815: indicates that the efficacy of euclidean distance in our embedded
816: space is a good metric to design filters for database search, Note the
817: sharp impulse at 1.414 corresponding to distances between real spectra
818: and completely dissimilar peptides within a mass tolerance of 2da,
819: providing empirical evidence for Lemma 2.2.
820:
821:
822:
823: \begin{figure}\label{fig:trueVSfalseDB}
824: \includegraphics[width=\linewidth,height=4in]{figs/realVStrueANDfalseVirtual.ps}
825: \caption{Distribution of distance between real and virtual spectra using
826: different metric. The dotten curve represents the distance between
827: real spectra and distances to virtual spectra from 100 different peptides
828: of similar precursor masses. The sequences are from MSDB.
829: The other curve shows the distribution of distances between spectra
830: and the virtual spectra from the true peptides.}
831: \end{figure}
832:
833:
834: \subsection{Post Translational Modifications}
835:
836: Now we present some very preliminary results on a set of spectra from
837: the PFTau protein. We picked 8 good quality spectra with known
838: Phosphorylations. We wanted to study whether our metric can help
839: design filters that might work for PTM studies. From the Figure~\ref{fig:ptm},
840: we note that distances between spectra and their PTM variants
841: have a higher likelihood of being classified as similar than dissimilar.
842: This is evident from Figure~\ref{fig:inter}.
843:
844: \begin{figure}\label{fig:ptm}
845: \begin{small}
846: \begin{tabular}{ |c | c |}
847: \hline
848: R.LTQAPVPMPDLKNVK.S & 1.23\\
849: R.LTQAPVPMPDLK\# NVK.S & \\
850: R.HLSNVSSTGSIDMVDSPQLATLADEV & 1.27\\
851: R.HLSNVSST\^GS\^IDMVDS\^PQLATLADEV & \\
852: R.TPSLPTPPTR.E & 0.98\\
853: R.TPSLPT\*PPTR.E & \\
854: R.QEFEVMVMEDHAGTYGLGLGDR.K & 1.19\\
855: R.QEFEVMVMEDHAGT\^YGLGLGDR.K & \\
856: \hline
857: \label{tab:ptm}
858: \end{tabular}
859: \end{small}
860: \caption{Some sample distances between spectra and their PTM variants.
861: Note the low scores between the pairs. Distances between spectra of
862: different peptides had a mean $\mu=1.388$ and $\sigma=0.017$.}
863: \end{figure}
864:
865:
866: \subsection{Query processing using LSH}
867:
868: In this section, we quantify the accuracy of our framework for
869: similarity searching and clustering. As mentioned earlier, we use LSH
870: to answer queries with bounded errors in expected sub-linear time.
871:
872:
873: We first indexed the 1014 spectra using our embedding followed by LSH.
874: For each of the 1014 spectra, we queried LSH with a radius $r$.
875: We varied $r$. We plot the
876: number of missed spectra that were actually present in the cluster of
877: the query spectrum in Figure~\ref{fig:LSH-misses} and the number of
878: false positives in Figure~\ref{fig:LSH-fpos}. As we increased the
879: radius, we the number of misses decreased. This is expected as the
880: radius of the {\em query ball} increases the number of possible data
881: points that can be considered. As expected, the number of false
882: positives also increased as $r$ increased. This indirectly demonstrates
883: the accuracy of any clustering algorithm based on LSH. We
884: miss an average of 1 spectrum within each cluster
885: while admitting only 1 false spectrum.
886:
887:
888:
889: At $r=1.0-1.1$ the false positives are not very high. This might be
890: important when we want to query for similar spectra in order to
891: generate the consensus spectra. In such situations, it might be fine
892: to miss out some bad quality spectra (distances to bad quality spectra
893: are usually higher). Also, consider situations where we would like to
894: coarsely partition the data set (e.g. for clustering). Then,
895: we can afford to have a few false positives but we cannot
896: miss any true positives. In such cases we increase the radius to at most
897: 1.25 as the likelihood of a intra-cluster distance being greater than
898: 1.25 is low, from Figure~\ref{fig:inter}.
899:
900:
901:
902: \begin{figure}\label{fig:LSH-misses}
903: \includegraphics[width=\linewidth,height=1.75in]{figs/LSH-misses.ps}
904: \caption{The average number of spectra that are present in the cluster
905: containing the query spectrum but are missed by LSH }
906: \end{figure}
907:
908:
909:
910: \begin{figure}\label{fig:LSH-fpos}
911: \includegraphics[width=\linewidth,height=1.75in]{figs/LSH-fpos.ps}
912: \caption{The average number of spectra that are not present in the cluster
913: containing the query spectrum but are reported by LSH}
914: \end{figure}
915:
916:
917:
918:
919:
920: \subsection{Speeding up Database Search}
921:
922:
923: To test the efficacy of our framework on speeding up database search,
924: we first use our metric to filter out candidate spectra. Since our
925: distance calculation is much faster than the detailed scoring of two
926: spectra, we define speedup by the ratio of total number of candidate
927: peptides with a mass tolerance of 2 daltons and the total number of
928: peptides that have a distance of $\Delta$ with the query spectrum and
929: have the same mass tolerance. Then we increase $\Delta$ and calculate
930: the number of true peptides missed in this filtering process. In
931: Figure~\ref{fig:speedup} we plot the speedup on a logarithmic scale
932: against the miss percentage. This gives us the speedup (or quality of
933: filtering) versus accuracy tradeoff of using our framework. For a 2
934: dalton range the number of peptides are around 100-200K. For around a
935: a 100K peptide set, LSH takes 0.21s on an average to answer queries.
936: As we see from Figure~\ref{fig:speedup}, we can get an
937: average speedup of 118 if we allow 0.19\% misses.
938: This may be reasonable for
939: many applications. In fact, we found that our errors were due to low
940: quality spectra in our test dataset.
941:
942:
943: \begin{figure}\label{fig:speedup}
944: \includegraphics[width=\linewidth,height=2in]{figs/speedup-cos.ps}
945: \caption{Filtering of spectra for DBASE search}
946: \end{figure}
947:
948:
949:
950: \subsection{Visualization and Dimension Reduction}
951:
952:
953: Consider the training dataset of mass spectra. We first generate
954: Euclidean feature vectors for each spectra. Then we used PCA and
955: plotted the first two components on the x-axis and the y-axis as shown
956: in Figure~\ref{fig:pca}(i). The clusters are visible and so are the
957: outliers. But the visualization is coarse grained.
958:
959:
960: Then we use Isomaps on the same dataset. Recall that in Isomaps, one
961: first needs to calculate the near neighbors. Thus in our plot, we also
962: show the near neighbor graph along with the projected points as shown
963: in Figure~\ref{fig:pca}(ii). The cluster structure seem to be
964: qualitatively clearer than with PCA.
965:
966:
967: \begin{figure}\label{fig:pca}
968: \begin{tabular}{c c}
969: \includegraphics[width=1.5in,height=1.75in]{figs/spectra-pca.ps} &
970: \includegraphics[width=1.5in,height=1.75in]{figs/spectra-isomap.ps}\\
971: (i) PCA & (ii) Isomap\\
972: \end{tabular}
973: \caption{Dimension Reduction with Isomap}
974: \end{figure}
975:
976:
977: \section{Discussion}
978:
979:
980: The results in the previous section look promising. The clear
981: separation between the DS and SS set during the metrics comparision
982: was a surprise to us, initially. One of the reasons for the good
983: result is the quality of the dataset. We first wanted to validate our
984: simple assumptions and claims on a dataset which had reliable
985: interpretations. Since we first transform the spectra into binary bit
986: strings we avoided the huge variations of density in spectra. The
987: signal to noise ratio pilot study also underscored the fact that we
988: need to study spectra by segmenting them. Note that one reason why we
989: obtained clear separations between the DS and SS in all cases with our
990: embedding is that we avoided using precursor ion mass as a feature.
991: Even though its fine to use the precursor mass as a coarser grain
992: filter, it will lead to less robust embeddings as such masses are
993: prone to errors due to isotope effects. Also our theoretical results
994: will not hold.
995:
996:
997:
998: For LSH, the speed and the accuracy is quite satisfying. However,
999: there are two implementation issues. Our current indexing is memory
1000: bound. This means we need lots of memory to index millions of mass
1001: spectra. Even though this is possible with the current 64 bit
1002: machines, we need to design disk based LSH schemes. We are working on
1003: a large scale implementation of our framework based on such
1004: techniques. Another issue is the choice of the number of bins and the
1005: mass coverage. Increasing the number of bins leads us to the curse of
1006: dimensionality which would slow down LSH and reduce the filtering
1007: speedup. If we choose fine grained bins with a lower maximum mass, our
1008: embedding will result in a pseudo-metric space as several different
1009: spectra will now satisfy assumption one in Theorem 2.1.
1010:
1011:
1012:
1013:
1014: \section{Conclusions and Future Work}
1015:
1016: In this paper, we showed that our embedding with geometric algorithms
1017: provides a good framework for mining mass spectra. In particular, we
1018: have demonstrated both theoretically as well as empirically, that our
1019: embedding coupled with Euclidean distance performs as well as the well
1020: known cosine similarity while providing us with the benefits of a
1021: metric space and enabling us to use approximate sub-linear time near
1022: neighbor techniques for data mining. Using this framework, we showed
1023: how we can do similarity searches and find tight clusters. Also, we
1024: demonstrated that we can get 2 order of magnitude filtering for
1025: database search. As an aside, we are also able to visualize large
1026: datasets in two dimensions qualitatively identifying the outliers.
1027:
1028:
1029: This work is the first step in the direction of an integrated
1030: framework for large scale mining of tandem mass spectra using simple
1031: techniques from embeddings, vector spaces and computational
1032: geometry. Several directions are being investigated at this point. The
1033: main areas of investigation are 1) Better embeddings that offer better
1034: resolution for PTM spectra 2) Faster external database searching
1035: algorithms that use embedding 3) More effective blind PTM searching
1036: using embeddings 4) Large scale clustering and visualization of mass
1037: spectrometry data and 5) Integrating data from different sources using
1038: our embeddings.
1039:
1040:
1041: We should note that several sections in the paper could be of
1042: independent interest. For example, we need to explore the
1043: probabilistic cleaning of mass spectra in more details. Our embedding
1044: promises to work across datasets and this general method can be used
1045: to do integrated study of other biological datasets eg. microarray
1046: data sets.
1047:
1048:
1049: \iffalse
1050: \section{Acknowledgments}
1051:
1052: Debojyoti would like to thank Vidhya Navalpakkam for her invaluable
1053: help. The authors would like to thank
1054: Prof. Piotr Indyk who provided insight into the LSH algorithm
1055: and also provided the initial LSH code.
1056: Debojyoti would also like to thank Yunhu Wan and Lijuan Mo
1057: for extremely helpful discussions.
1058: \fi
1059:
1060: \bibliographystyle{plain}
1061: \bibliography{msms,lsh}
1062:
1063:
1064:
1065:
1066:
1067: \end{document}
1068:
1069: