q-bio0603002/arxiv.tex
1: \documentclass[11pt,twocolumn]{article}
2: 
3: \usepackage{fullpage}
4: \usepackage{times}
5: \usepackage{algorithm}
6: \usepackage{algorithmic}
7: \usepackage{amsthm}
8: \usepackage{graphicx}
9: 
10: \newcommand{\keyword}[1]{\texttt{#1}}
11: \newcommand{\latcom}[1]{\texttt{$\backslash$#1}}
12: 
13: \newtheorem{thm}{Theorem}[section]
14: \newtheorem{lem}{Lemma}[section]
15: \newtheorem{cor}{Corollary}[section]
16: %
17: \renewcommand{\baselinestretch}{.95}
18: \normalsize
19: 
20: \newcommand{\RelaxFloats}{
21:         \renewcommand{\topfraction}{0.9}
22:         \renewcommand{\floatpagefraction}{0.9}
23:         \renewcommand{\textfraction}{0.1}
24: }
25: 
26: 
27: \begin{document}
28: 
29: \RelaxFloats
30: 
31: \begin{titlepage}
32: 
33: \flushleft
34: 
35: 
36: \vspace{0.00in}
37: \parbox{6.5in}{\large \noindent
38:    Draft prepared for \textbf{arXiv}.
39: }
40: 
41: 
42: \vspace{0.10in}
43: \parbox{6.5in}{\large \noindent
44:    Manuscript information:
45:    {10} text pages,
46:    {2} figures,
47:    {3} tables.
48: }
49: 
50: 
51: 
52: \vspace{2.00in}
53: \parbox{6.5in}{\LARGE \centering
54:    Mining Mass Spectra: Metric Embeddings and \\
55:    Fast Near Neighbor Search
56: }
57: 
58: 
59: \vspace{0.5in}
60: \parbox{6.5in}{\large \centering
61:    Debojyoti Dutta,
62:    Ting Chen\footnotemark[1]
63: }
64: 
65: 
66: \vspace{0.5in}
67: \parbox{6.5in}{\large \centering
68:    Molecular and Computational Biology Program \\
69:    University of Southern California \\
70:    Los Angeles, CA 90089-2910
71: }
72: 
73: \vspace{0.5in}
74: \parbox{6.5in}{\large \centering
75:    \today
76: }
77: 
78: 
79: \footnotetext[1]{
80: To whom correspondence should be addressed.
81: Molecular and Computational Biology Program,
82: University of Southern California.
83: MCB 201, 1050 Childs Way,
84: Los Angeles, CA 90089-2910.
85: E-mail: ddutta@usc.edu.,
86:        tingchen@usc.edu
87: Tel:   (213)740-2416,
88:       (213)740-2415.
89: Fax:   (213)740-8631.
90: }
91: \end{titlepage}
92: 
93: 
94: 
95: \iffalse
96: 
97: \title
98: \author{Debojyoti Dutta\footnote{The authors are with the Department
99: of Computational Biology, University of Southern California, Los Angeles 
100: 90089. They can be contacted at ddutta@usc.edu, tingchen@usc.edu 
101: respectively }\,
102: Ting Chen\addtocounter{footnote}{-1}\footnotemark\
103: }
104: 
105: \maketitle
106: 
107: \fi
108: 
109: \normalsize
110: 
111: \begin{small}
112: \begin{abstract}
113: 
114: Mining large-scale high-throughput tandem mass spectrometry data sets
115: is a very important problem in mass spectrometry based protein
116: identification.
117: %
118: %
119: %
120: One of the fundamental problems in large scale mining of spectra is to
121: design appropriate metrics and algorithms to avoid all-pair-wise
122: comparisons of spectra.  In this paper, we present a general framework
123: based on vector spaces to avoid pair-wise comparisons.
124: %
125: %
126: %
127: %
128: %
129: We first robustly embed spectra in a high dimensional space in a novel
130: fashion and then apply fast approximate near neighbor algorithms for
131: tasks such as constructing filters for database search, indexing and
132: similarity searching.  We formally prove that our embedding has low
133: distortion compared to the cosine similarity, and, along with locality
134: sensitive hashing (LSH), we design filters for database search that
135: can filter out more than 989\% of peptides (118 times less)
136: while missing at most 0.29\%
137: of the correct sequences. We then show how our framework can be used
138: in similarity searching, which can then be used to detect tight
139: clusters or replicates. On an average, for a cluster size of 16
140: spectra, LSH only misses 1 spectrum and admits only 1 false spectrum.
141: In addition, our framework in conjunction with dimension reduction
142: techniques allow us to visualize large datasets in 2D space. Our
143: framework also has the potential to embed and compare datasets with
144: post translation modifications (PTM).
145: 
146: 
147: \end{abstract}
148: \end{small}
149: 
150: \section{Introduction}
151: 
152: 
153: %
154: 
155: Proteomics aims to analyze proteins and peptides expressed by the
156: dynamic biological processes within
157: cells~\cite{Pandey00,Aebersold00}. Proteins are responsible for many
158: inter and intra-cellular activities such as metabolism and cell
159: signaling where proteins are often modified after
160: translation within cells~\cite{Mann03NatBiot,Yates95}. 
161: %
162: %
163: %
164: %
165: In the post-genomic era, one
166: of the most important problems is to characterize the {\em proteome},
167: i.e. the set of proteins within an organism.
168: 
169: 
170: %
171: 
172: Tandem mass spectrometry is one of the most promising and widely used
173: high throughput techniques to analyze proteins and peptides
174: \cite{Pandey00,Aebersold00}. It comprises of two stages. A protein mixture 
175: is enzymatically digested and separated by HPLC (High Performance 
176: Liquid Chromatography) before inserting into a mass spectrometer through 
177: a capillary. Then the peptides gets ionized 
178: and their precursor ion masses, or 
179: mass/charge ratios,  are measured. This is the MS1
180: stage. The peaks (or ionized peptides) from the MS1 stage are
181: selected and further fragmented in a second stage using techniques
182: such as Collision Induced Dissociation (CID) to yield the MS2 fragment
183: ions. Ideally, each peptide gets cleaved into two parts. The N-terminal
184: ion (b-ion) represents the prefix while the C-terminal ion (y-ion) is the suffix.  
185: %
186: %
187: This stage is also known as the tandem MS or the MSMS
188: stage. For more details beyond this oversimplified description, the
189: reader is directed to the wonderful survey~\cite{Aebersold00}.
190: 
191: 
192: %
193: 
194: There are two main approaches to analyzing tandem mass spectra
195: data. First, and the most widely used, is the 
196: database search method~\cite{Keller02AC,Nsvski03,Zhang,Bafna01}.  
197: Here, peptides from a sequence database are digested in-silico and the
198: resultant virtual spectra are matched (or scored) with the real
199: spectra. High scored peptides are typically chosen as the peptide
200: candidates.  This method leads to a combinatorial explosion
201: when used to search for Post Translational Modifications
202: (PTMs)~\cite{Yates95}.  Second, the de-novo
203: method \cite{Dancik99,Chen01,Ma} reconstructs the sequence without the
204: help of a database.  
205: %
206: %
207: %
208: %
209: %
210: %
211: Other approaches combine denovo sequencing and database search by 
212: first generating sequence tags, or subsequences, and then using these
213: tags~\cite{pepnovo} as filters for database search with and 
214: without PTMs~\cite{inspect}. 
215: 
216: 
217: %
218: 
219: The promise of tandem mass spectrometry has led research groups to
220: routinely use this method to probe the proteomes. 
221: %
222: %
223: %
224: %
225: %
226: A single run of a mass spectrometer can generate several thousands of
227: spectra, and the sheer size as well as the number of real life mass
228: spectra datasets is predicted to grow at an unprecedented rate with
229: laboratories operating several spectrometers in parallel, round the
230: clock. Thus, efficient mining of these large-scale mass spectra data
231: to obtain useful clues for biological discovery is a very important
232: problem.
233: 
234: 
235: %
236: 
237: Mining large spectra has several challenges, some of which 
238: are presented below. 
239: %
240: 1) Indexing huge databases of mass spectra is not
241: standardized. Commonly used methods use precursion ion mass but this
242: method has two main problems: i) there can be errors in precursor ion
243: masses. ii) there may be many spectra (several thousands of them) that
244: have masses close to each other.
245: %
246: 2) It is difficult to search for similar spectra on a large scale
247: quickly, or in sublinear time. This is a core function used by several
248: data mining applications.
249: %
250: %
251: %
252: %
253: %
254: 3) Clustering large databases of spectra is a daunting task. Most
255: similarity measures proposed in tandem mass spectrometry use pair wise
256: metrics for similarity. Such pair wise methods lead to an explosion of
257: similarity calculations, i.e. $O(n^2)$ for a set of $n$ spectra. Thus,
258: a key open problem is to use methods that avoid the pair-wise
259: similarity calculations.  If objects can be transformed into metric
260: spaces, problems such as similarity searching and clustering becomes
261: easier.  Thus we need to find methods to robustly embed spectra in
262: metric spaces.
263: %
264: 4) Visualization of large groups of mass spectra is an important
265: problem which can also be used to qualitatively identify outliers in
266: the huge number of spectra produced.
267: 
268: 
269: %
270: 
271: 
272: In this paper, we present a general framework for large scale mining of 
273: tandem mass spectra. Our main contributions are the following: 
274: %
275: 1) We robustly embed spectra into a metric space,
276: %
277: 2) We show, both formally and empirically that distances using our
278: embedding areas good as those that use the well known cosine method.
279: %
280: 3) Then we use apply a geometric fast near neighbor search technique,
281: Locality Sensitive Hashing (LSH)~\cite{datar04}, to solve several
282: problems such as fast filters for database search, similarity
283: searching of mass spectra, and visualization of large spectral
284: database.
285: %
286: 4) Our embedding in conjunction with PCA and manifold learning can be
287: used to visualize large groups of spectra.
288: %
289: 5) Our embedding holds promise for comparing spectra with Post
290: Translational Modifications (PTM).
291: 
292: 
293: Our idea of robust embedding of vector spaces to mine mass spectra is
294: novel. Previous work to embed spectra into vector spaces using vectors
295: of amino acid counts to database search~\cite{Halligan04,Halligan05}.
296: They focussed on clustering sequence databases based on this amino
297: acid counts to search for mass spectra, given amino acid counts or
298: sequence tags. However getting an accurate estimate of amino acid
299: composition is itself a hard problem, especially when the quality of
300: spectra is not high. However, our method embeds ion fragments of
301: spectra directly into a vector space and avoids estimating higher
302: level features such as amino acid composition. Also our scheme is more
303: general: using a single embedding, we can either compare spectra with
304: each other or compare spectra with peptide sequences by generating
305: their virtual, or in-silico digested, spectra.
306: In addition, we demonstrate that our framework can be used in concrete
307: mining applications.  We first use our embedding along with Locality
308: Sensitive Hashing to speed-up database search.  We demonstrate that we
309: can filter out more than 99.152\% spectra with a false negative rate
310: of 0.29\%. The average query time for a spectra is 0.21s.
311: Then, we answer similarity queries and find replicates or tight
312: clusters.  LSH misses an average of 1 spectrum per cluster, that have
313: an average cluster size of 16 spectra, while admitting only 1 false
314: spectrum.  
315: %
316: %
317: %
318: %
319: %
320: %
321: %
322: %
323: 
324: 
325: To the best of our knowledge, we are not aware of any other work that robustly 
326: embeds spectra in metric spaces with provable guarantees and then uses fast 
327: approximate near neighbor techniques to solve mass spectrometry data mining 
328: problems. 
329: 
330: 
331: \section{Methods}
332: 
333: 
334: Our approach is to use vector spaces which have been successful in
335: numerous data mining applications including web searching{\bf cite web
336: mining}.  Several fast mining algorithms become simpler to design in
337: these spaces, compared to designing them in non metric spaces
338: e.g. spaces where the only available measure is a pairwise similarity
339: measure.  Thus, the key problem in this approach is to robustly embed
340: spectra into a high dimensional metric space and define appropriate
341: distances. Also, these distances must be correlated with the well
342: known cosine similarities. In other words, we desire an embedding with
343: bounded distortion with respect to the cosine similarity.
344: 
345: 
346: \subsection{Embedding Spectra}
347: 
348: 
349: \subsection*{Noise Removal}
350: 
351: 
352: \begin{figure}\label{fig:SNRdist}
353:     \includegraphics[width=\linewidth,height=3in]{figs/snr.ps}
354:     \caption{Signal and Noise distributions of peak intensities in different 
355: regions of spectra (from the training set).}
356: \end{figure}
357: 
358: The achiles heel of tandem mass spectra analysis is the amount of
359: noise in the mass spectra. In fact, most peaks (around 80\%) cannot be
360: explained and are called {\em 'noise'} peaks. {\em 'Signal'} peaks
361: (such as $b, y$ ions) are useful for interpretation.  As a first step,
362: we remove noise peaks enriching the signal to noise ratio (SNR).
363: 
364: 
365: We use a statistical method to increase SNR.  We first find the
366: intensity distributions of signal and noise peaks in a set of
367: annotated spectra.  For this, we consider a set of good quality
368: annotated spectra as described in Section~\ref{sec:results} and
369: generate the virtual spectrum $v_p$ for each of the real spectra $r_p$
370: for a peptide $p$.  For the virtual spectrum generation we consider
371: the following ions: $b$, $b-H_2O$, $b-NH_3$, $y$, $y-H_20$, $y-NH_3$.
372: Then we divide the mass range of $r_p$ into $k=10$ sections. For each
373: section, and for each real peak, we consider its intensity rank
374: i.e. the most intense peak has rank 0 and so on.  We divide the peaks
375: of $r_p$ into two sets $S_p$ and $N_p$. $S_p$ contains all those
376: peaks, and their intensity ranks, which have a match in the virtual
377: spectrum $v_p$. Thus, for each region, we can get a distribution of
378: signal and noise intensity ranks for each region as shown in
379: Figure~\ref{fig:SNRdist}.
380: 
381: 
382: We define a  metric SNR of a peak $(mz_j,I_j)$ as  follows 
383: $$ 
384: SNR(j)\ =\ \frac{P[\mbox{rank}(j)|(mz_j,I_j)\in S_p]}{P[\mbox{rank}(j)|(mz_j,I_j)\in N_p]}
385: $$
386: If larger SNR, the peak is likely to be a useful peak, else its a noise peak.  
387: From Figure~\ref{fig:SNRdist} we can conclude 
388: that the noise is very poor at the ends of the spectra, i.e. at low mass
389: regions and high mass regions. This statistical observation reinforces
390: the mass spectrometry folklore that the {\em middle region} is the most 
391: suitable for finding signal peaks.
392: 
393: 
394: \subsection*{Features and Distances}
395: 
396: There are several possible ways to embed tandem mass spectra into a
397: vector space that support the most common operation of comparing two
398: spectra and find similarities. For example, the cosine similarity
399: metric~\cite{Keller02AC} and their different variants have been very
400: popular in the recent papers.  Unfortunately the cosine metric does
401: not yield a metric embedding because the triangle inequality is
402: violated.  Also the cosine similarity metric implies algorithms that
403: consider pairs of spectra. Clearly such algorithms are difficult to
404: scale due to the $O(n^2)$ number of similarity calculations.
405: 
406: 
407: For metric embeddings, the design space is quite large. 
408: %
409: %
410: %
411: A simple idea is to directly bin the peaks and use the intensities to
412: form a vector space. However spectra from different datasets have
413: different intensities and we would like to have a single embedding
414: that could potentially integrate multiple spectral databases.
415: %
416: %
417: %
418: 
419: 
420:  
421: \begin{figure}\label{fig:cube}
422: \begin{tabular}{c c}
423: \includegraphics[width=1.5in,height=1.5in]{figs/msmine-cube.ps} &
424: \includegraphics[width=1.5in,height=1.5in]{figs/msmine-circle.ps}\\
425: (i) & (ii)\\
426: \end{tabular}
427: \caption{ (i) Embedding spectra in a $n$-dimensional cube, (ii) 
428: Using a 2-dimensional example to illustrate the correlation between the 
429: Euclidean distance and the well known cosine similarity }
430: \end{figure}
431: 
432: 
433: 
434: We first {\em clean} spectra as mentioned in the previous subsection.
435: Then we divide the entire mass range (from 0 to some maximum range)
436: into discrete intervals of 2da.  For each interval of 2da,
437: a bit is set to 1 if the cleaned spectrum
438: contains a peak in that interval, else it is 0. This embeds
439: each spectra into the vertices of a n-dimensional hypercube. A 3D
440: version is shown in Figure~\ref{fig:cube}. Our feature vectors
441: are defined to be the the {\em unit} vectors in the direction of the
442: corresponding vertices of the n-dimensional hypercube. Thus the space
443: of our embedding is a n-dimensional unit hyper-sphere.
444: 
445: 
446: 
447: We define the spectral similarity or distance between spectra 
448: $x$, $y$, as $||x-y||$. If the angle between two similar spectra $x$, $y$ is
449: $\theta$, $\cos \theta$ will be close to 1, or $1-\cos \theta$ will be
450: very small. Since $x$ $y$ are unit vectors, their
451: Euclidean distance will also be small.  Thus, for small angles, $1-\cos
452: \theta \approx D(x,y) $, where $D$ is the Euclidean distance. It is easy
453: to show that as $n$ or the number of dimensions increases, the minimum
454: angle for pairs of very similar spectra $x$, $y$ becomes
455: smaller. Thus, instead of calculating the $1- \cos \theta$, we 
456: calculate $D(x,y)$.  The natural question that arises is the
457: {\em distortion} of our embedding.  We will now show that it is has
458: bounded accuracy in theory, and we will later show that the accuracy
459: is empirically quite high in comparision with the cosine similarity.
460: 
461: 
462: 
463: We prove some properties of the embeddings. It is easy to show the following theorem:
464: \begin{thm}
465: The embedding discussed above defines a metric space. 
466: \end{thm}
467: \begin{proof}
468: The proof is very simple 
469: To show that our embedding defines a metric space, we need to prove three things:
470: 1) $||x-y||=0$ iff $x=y$, 2) $||x-y||=||y-x||$ and 3) the distance measure 
471: obeys the triangle inequality. These properties are trivial to prove in our case 
472: as our embedding uses Eculidean distances. 
473: \end{proof}
474: 
475: 
476: We then  show that the maximum euclidean distance is bounded by $\sqrt{2}$. 
477: \begin{lem}
478: The distance between the feature vectors of any two mass spectra is
479: bounded above by $\sqrt{2}$.
480: \end{lem}
481: \begin{proof}
482: Suppose there are two spectra $x$, $y$ respectively. We shall uses the
483: names of the spectra and their feature vectors
484: interchangeably. According to our scheme we first filter the noisy
485: peaks and generate the binary vector after binning. Now assume $x$ has
486: $k$ bits set to a and $y$ has $k'$ bits set to 1. Also assume that $c$
487: of the common bits are 1.  Then $||x||=\frac{1}{\sqrt{k}}$ and
488: $||y||=\frac{1}{\sqrt{k'}}$. Since $c$ bits are common, the number of
489: dissimilar bits between $x$ and $y$ are $(k-c)+(k'-c)$. We have
490: \begin{small}
491: \begin{eqnarray}
492: ||x-y|| &  = &  \sqrt{(k-c).\left( \frac{1}{\sqrt{k}}\right)^2\ +
493:   \ (k'-c).\left( \frac{1}{\sqrt{k'}}\right)^2 }\\
494:  &  = & \sqrt{ 2 -  c.\left( \frac{1}{k} + \frac{1}{k'} \right) }
495: \end{eqnarray}
496: \end{small}
497: \end{proof}
498: 
499: 
500: Next we show that our embedding has bounded distortion when we compare
501: with the well known cosine similarity.  We have the following theorem:
502: \begin{thm}
503: If $\theta$ is the angle made by the feature vectors of spectra $x$,
504: $y$, and the number of ones in each of the vectors after binning is
505: the same we must have
506: $0<\frac{1-\cos{\theta}}{||x-y||}<\frac{1}{\sqrt{2}}$. Or in other
507: words, the distortion between our Euclidean embedding and the cosine
508: similarity is bounded.
509: \end{thm}
510: \begin{proof}
511: As in the previous lemma, 
512: $
513: ||x-y|| = \sqrt{ 2 -  c.\left( \frac{1}{k} + \frac{1}{k'} \right) }.
514: $
515: Now the cosine of the angle $\theta$ between $x$, $y$ can be written as 
516: $\cos\theta = \frac{c}{\sqrt{kk'}}$.
517: Assume $k=k'$ and note that $0\leq \frac{c}{k}\leq 1$. Thus, we must have 
518: \begin{eqnarray}
519: \frac { 1 - \cos\theta } {||x-y||} &  =  & 
520: \frac { 1\ -\ \frac{c}{\sqrt{kk'}} }
521: { \sqrt{ 2 -  c.\left( \frac{1}{k} + \frac{1}{k'} \right) } } \\
522:  & = & \frac{1-\frac{c}{k}}{\sqrt{2-\frac{2c}{k}}}\\
523:  & = & \frac{1}{\sqrt{2}} \sqrt{1-\frac{c}{k}}
524: \end{eqnarray}
525: We note that since,  $0\leq \frac{c}{k}\leq 1$, we must also have 
526:  $0\leq 1- \frac{c}{k}\leq 1$ and the theorem follows. 
527: \end{proof}
528: 
529: 
530: Thus our embedding will perform almost as good as the standard cosine
531: metric. We show in the next section that this is indeed the case,
532: empirically.  Also, since the points are in a Euclidean space, we can
533: elegant geometric techniques that yield fast approximate algorithms
534: for mining the data.
535: 
536: \subsection{Similarity Searching} 
537: 
538: 
539: The ability to calculate distances as opposed to cosines is an
540: important feature of our framework. Now, we apply elegant 
541: near neighbor algorithms to answer queries quickly but
542: approximately, as we show in the paper.  The basic query
543: primitive we use is the following:
544: 
545: 
546: \noindent{Primitive 1}: Given a spectrum $x$ and a set of spectra $S$,
547: we want to find all the spectra $S_r$ that are similar to $x$,
548: i.e. spectrum $y\in S_r$, iff $D(x,y)<r_q$, where $D$ is the Euclidean
549: distance and the $r_q$ is a query radius.
550: 
551:  
552: 
553: A very simple approach would be to do a linear scan on the database
554: and output every spectrum $y$ such that $D(x,y)<r_q$. This takes
555: $O(n)$ time. However, if $S$ becomes very large and so do the number
556: of queries say $O(n)$, then we have a $O(n^2)$ algorithm. This is
557: clearly unacceptable for our problem. Thus, we desire methods that
558: will yield near neighbor queries in {\em sub-linear} time. For this we
559: are willing to tradeoff some accuracy for speedup. Several sub-linear
560: near neighbor methods exist but we leverage Locality Sensitive
561: Hashing~\cite{datar04} since, unlike others, it promises bounded
562: guarantees and is also easy to implement.  We briefly present the idea
563: below.
564: 
565: 
566: 
567: \subsection*{Locality Sensitive Hashing}
568: 
569: The basic idea behind random projections is a class of hash functions
570: that are locality sensitive i.e. if two points $(p, q)$ are close they
571: will have small $|p-q|$ and they will hash to the same value with high
572: probability. If they are far they should collide with small
573: probability.
574: 
575: \noindent{Definition 1}: A family $\{ H = f: S \rightarrow U \}$ is
576: called locality-sensitive, if for any point $q$, the function $$p(t) =
577: Pr_H[h(q) = h(v) : |q-v| = t]$$ is strictly decreasing in $t$. That
578: is, the probability of collision of points $q$ and $v$ is decreasing
579: with the distance between them.
580: 
581: \noindent{Definition 2}: A family $H=\{h:S\rightarrow U\}$ is called
582: $(r_1,r_2,p_1,p_2)$ sensitive for distribution $D$ if for any $v,q \in S$,
583: we have
584: \begin{itemize}
585: \item if $v\in B(q,r_1)$ then $\mbox{Pr}[h(q)=h(v)]\geq p_1$
586: \item if $v\notin B(q,r_2)$ then $\mbox{Pr}[h(q)=h(v)]\leq p_2$
587: \end{itemize}
588: Here $B(q,r)$ represents a ball around point $q$ with a radius $r$.
589: Thus a good family of hash functions will try to {\em amplify}
590: the gap between $p_1$ and $p_2$.
591: 
592: 
593: Indyk et.~al.~\cite{datar04} showed that s-stable distributions can be
594: used to construct such families of locality sensitive hash
595: functions. An s-stable distribution is defined as follows.
596: 
597: \noindent{Definition 3}: A distribution $D$ over $R$ is called {\em
598: s-stable}, if there exists $s$ such that for any $n$ real numbers $v_1
599: ... v_n$ and i.i.d. variables $X_1 ... X_n$ with distribution $D$, the
600: random variable $\sum_i{v_i X_i}$ has the same distribution as the
601: variable $(\sum_i{v_i^p})^{\frac{1}{s}}X$, where $X$ is a random
602: variable with distribution $D$.
603: 
604: 
605: Consider a random vector $a$ of $n$ dimensions. For any two
606: n-dimensional vectors $(p, q)$ the distance between their projections
607: $(a.p - a.q)$ is distributed as $|p-q|_s X$ where $X$ is a s-stable
608: distribution. We {\em chop} the real line into equal width segments of
609: appropriate size and assign hash values to vectors based on which
610: segment they project onto. The above can be shown to be locality
611: preserving.
612: 
613: 
614: There are two parameters to tune LSH. Given a family $H$ of hash
615: functions as defined above, the LSH algorithm chooses $k$ of them and
616: concatenates them to amplify the gap between $p_1$ and $p_2$. Thus,
617: for a point $v$, $g(v)=(h_1(v)...h_k(v))$. Also, $L$ such groups of
618: hash functions are chosen, independently and uniformly at random,
619: (i.e. $g_1...g_L$) to reduce the error.  During pre-processing, each
620: point $v$ is hashed by the $L$ functions buckets and stored in the
621: bucket given by each of $g_i(v)$. For any query point $q$, all the
622: buckets $g_1(q)...g_L(q)$ are searched. For each point $x$ in the
623: buckets, if the distance between $q$ and $x$ is within the query
624: distace, we output this as the nearest neighbor. Thus, the parameters
625: $k$ and $L$ are crucial.  It has been shown~\cite{indyk99,datar04}
626: that $k=\log_{1/p_2}{n}$ and $L=n^\rho$, where
627: $\rho=\frac{\log{1/p_1}}{\log{1/p_2}}$, ensures locality sensitive
628: properties. In Ref.~\cite{datar04}, the authors consider $L2$ spaces
629: and bound $\rho$ above empirically by $\frac{1}{c}$, $c$ being the
630: approximation guarantee, i.e. for a given radius $R$, the algorithm
631: returns points whose distance is within $c\times R$.  The time
632: complexity of LSH has been shown to be $O(dn^\rho \log{n})$, where $d$
633: is the number of dimensions and $\rho$ is as defined above.  Thus, if
634: we desire a coarse level of approximation, LSH can guarantee
635: sub-linear run times for geometric queries.
636: 
637: 
638: 
639: 
640: \subsection{Similarity Searching}
641: 
642: 
643: Using our embedding and a fast near neighbor algorithm, we can 
644: find spectra similar to a given query spectrum. The
645: key is to use the correct query radius $r$. We
646: show in the next section how this can be chosen. 
647: If we give too high a radius, it might yield a
648: large dataset and if the radius is too low, it might not yield any
649: neighbor.
650: 
651: 
652: If an appropriate query radius is chosen, 
653: it is easy to find tight clusters using the following heuristic:
654: \noindent{ANN-cluster}: 1) Embed spectra into a Euclidean space and
655: form the set $S$.  2) Hash the feature vectors, $S$, using LSH. 3)
656: Choose some $k$ random spectra, find their near neighbors (tight
657: clusters). For each random spectra add their neighbors to set $S$. 4)
658: $S=S-C$. 5) Go to step 3 till $S$ is empty.
659: 
660: 
661: Another immediate consequence of our framework is to find outliers. To
662: check for outlier, we need to determine whether a spectrum has at most
663: 1 or 2 neighbors. If the neighbors remain unchanged even on increasing
664: the query radius by $\delta$, a spectrum is indeed an outlier.  Since
665: near neighbors take sub-linear time with LSH, outliers can be detected
666: in sub-quadratic time.
667: 
668: 
669: 
670: 
671: \subsection{Speedup Database search}
672: 
673: 
674: 
675: In this section, we discuss a sample application using our mining
676: framework.  Database search is the primary tandem mass spectrometry
677: data mining applications. Given a query spectrum $x$, and a mass
678: spectra database $MSDB$ (described in Section~\ref{sec:results}, the
679: problem is to find out which peptide $p\in MSDB$ corresponds to $x$.
680: 
681: 
682: Database search is a well explored topic, see ~\cite{Wan05} for
683: example. Most tools index the the MSDB by the peptide mass. Then for a
684: spectrum $x$, the precursor mass $m_x$ is found. Then all the spectra
685: $S_p={y|y\in MSDB}$ are compared with $x$ such that $|m_y-m_x|<\delta$,
686: where $\delta$ is some pre-defined mass tolerance.  Each comparison
687: operation between the query spectrum and the candidate spectrum takes
688: a while depending on the scoring function used.  We reduce the size of
689: $S_p$ by filtering the unrelated spectra, speeding up the search. We
690: ensure that we do not filter out the true peptide for a
691: given spectrum while we discard most of the unrelated peptide.
692: 
693: 
694: We generate the virtual spectra from each peptide
695: sequence in the database, and then embed those virtual spectra in the
696: Euclidean space, as mentioned. Then for filtering, we choose an appropriate
697: threshold radius $r$ and query the LSH algorithm to yield all the
698: candidates within a ball of radius $r$. The ratio of the total number
699: of peptides within a mass tolerance divided by the number of
700: candidates returned is our speedup. 
701: 
702: 
703: 
704: 
705: \subsection{Visualization and Dimension Reduction}
706: 
707: As mentioned earlier, vizualizing thousands of spectra is a very hard
708: problem.  We are not aware of any previous work that allows us to
709: visualize large mass spectrometry data sets.  Our embedding followed
710: by dimension reduction allows to view spectra on a two or three
711: dimensional space. As a bonus, it qualitatively allows us to identify
712: outliers in the data set.
713: 
714: 
715: Once we have embedded the spectra in a Euclidean space, we can use
716: some of the common techniques to visualize high dimensional data by
717: dimensionality reduction.  The most common linear method is to use
718: PCA~\cite{strang}. Recently, several non-linear methods for
719: dimensionality reduction have been discovered, the majority of them
720: exploiting the low dimensional manifold structure of the dataset.  In
721: this paper, we leverage one of these techniques, the isomap method, to
722: project the high dimensional data on a 2D plane. Due to lack of space
723: we do not provide a description of the method.
724: 
725: 
726: 
727: \section{Experimental Results}
728: \label{sec:results}
729: 
730: In this section, we describe the empirical evaluation of our embedding
731: followed by some representative data mining tasks. Unless otherwise
732: stated we use the following dataset from Keller
733: et. al.~\cite{Keller02Data}. For calculating statistics, we used 80\%
734: of the 1618 spectra from this annotation at random.  The statistics
735: were independent of the exact choices of the spectra.  Note that our
736: techniques are unsupervised except for the selection of query radii.
737: Out of this, 1014 spectra were digested with trypsin and were used for
738: database search filter.
739: 
740: 
741: For database search filters, a non-redundant protein sequence database
742: called MSDB, which is maintained by the Imperial College, London.  The
743: release (20042301) has 1,454,651 protein sequences (around 550M amino
744: acids) from multiple organisms.  Peptide sequences were generated by
745: in-silico digestion and the list of peptides were grouped into
746: different files by their precursor ion mass, a different file for
747: 10da.
748: 
749: 
750: \subsection{Empirical evaluation of the embedding}
751: 
752: In this section, we critically analyze our embedding and different
753: distance metric. For these analyzes, we chose a set of 1014 curated
754: spectra of proteins digested with trypsin and reported by Keller
755: et. al.  We then cleaned the spectra picked the most likely to be the
756: signal peaks. Then we constructed the binary bit vector as discussed
757: earlier. For the set of spectra, we knew that there were 100 odd
758: clusters with 15 spectra per cluster on an average.  We calculate the
759: pairwise distances between spectra within the same cluster and we term
760: this the similar set $SS$.  We then choose a representative from each
761: cluster at random and calculate the distances and we call this set the
762: dissimilar set $DS$.  Then we plot the frequency distribution of DS
763: and SS as they both have similar number of pairwise distances in
764: Figure~\ref{fig:inter} for three metrics: hamming, 1-cosine and
765: euclidean. Its very clear that hamming is unsuitable as a metric as it
766: has low discriminability. As expected, 1-cosine and euclidean looks
767: almost similar with low overlaps between the sets DS and SS. Also note
768: that the cosine metric used here is not exactly the same used by
769: others.  We do not take the intensities into consideration after we
770: have selected the peaks.
771: 
772: 
773: 
774: \begin{figure}\label{fig:inter}
775:     \includegraphics[width=\linewidth,height=4in]{figs/interVSintraCluster.ps}
776:     \caption{Distribution of scores with real spectra using different
777: metrics (hamming, 1-cosine, euclidean). The dotted curve plots the
778: inter-cluster distances while the solid line represents the
779: intra-cluster distribution.}
780: \end{figure}
781: 
782: 
783: \iffalse
784: 
785: In the previous case, we plotted distances between real spectra. Now
786: we plot distances between spectra and their corresponding virtual
787: spectra and compare with distances between spectra and virtual spectra
788: generated from totally dissimilar peptides in
789: Figure~\ref{fig:trueVSfalse}.  Note that even in this case, there is
790: very good separation (with < 5\% overlap).
791: 
792: 
793: 
794: 
795: \begin{figure}\label{fig:trueVSfalse}
796:     \includegraphics[width=\linewidth,height=4in]{figs/compareRealVirtualDistance.ps}
797:     \caption{Distribution of distance between real and virtual spectra using 
798: different metric. The dotten curve represents the distance between 
799: real spectra and distances to virtual spectra from different peptides. 
800: The other curve shows the distribution of distances between spectra 
801: and the virtual spectra from the true peptides.}
802: \end{figure}
803: \fi
804: 
805: Now, we consider the database of tryptic peptides, $MSDB$. For each
806: peptide, we generate its virtual spectrum and then construct the
807: feature vector as above.  For each real spectrum, we calculate the
808: distance with the correct virtual spectra and we call this set of
809: scores to be $SS$. Then we choose, from the database, 100 random
810: peptides having almost the same mass as the precursion ion mass of the
811: given spectrum. We then add the set of scores to the dissimilar set
812: $DS$. We then plot the probability distribution of SS and DS in
813: Figure~\ref{fig:trueVSfalseDB}. Again we can see the clear sepatation
814: between the two sets of distances (with $<1\%$ overlap).  This
815: indicates that the efficacy of euclidean distance in our embedded
816: space is a good metric to design filters for database search, Note the
817: sharp impulse at 1.414 corresponding to distances between real spectra
818: and completely dissimilar peptides within a mass tolerance of 2da,
819: providing empirical evidence for Lemma 2.2.
820: 
821:       
822: 
823: \begin{figure}\label{fig:trueVSfalseDB}
824:     \includegraphics[width=\linewidth,height=4in]{figs/realVStrueANDfalseVirtual.ps}
825:     \caption{Distribution of distance between real and virtual spectra using 
826: different metric. The dotten curve represents the distance between 
827: real spectra and distances to virtual spectra from 100 different peptides
828: of similar precursor masses. The sequences are from MSDB.  
829: The other curve shows the distribution of distances between spectra 
830: and the virtual spectra from the true peptides.}
831: \end{figure}
832: 
833: 
834: \subsection{Post Translational Modifications}
835: 
836: Now we present some very preliminary results on a set of spectra from
837: the PFTau protein. We picked 8 good quality spectra with known
838: Phosphorylations. We wanted to study whether our metric can help
839: design filters that might work for PTM studies. From the Figure~\ref{fig:ptm},
840: we note that distances between spectra and their PTM variants 
841: have a higher likelihood of being classified as similar than dissimilar. 
842: This is evident from Figure~\ref{fig:inter}.
843: 
844: \begin{figure}\label{fig:ptm}
845: \begin{small}
846: \begin{tabular}{ |c | c |} 
847: \hline
848: R.LTQAPVPMPDLKNVK.S & 1.23\\ 
849: R.LTQAPVPMPDLK\# NVK.S & \\
850: R.HLSNVSSTGSIDMVDSPQLATLADEV & 1.27\\
851: R.HLSNVSST\^GS\^IDMVDS\^PQLATLADEV & \\
852: R.TPSLPTPPTR.E & 0.98\\
853: R.TPSLPT\*PPTR.E &  \\
854: R.QEFEVMVMEDHAGTYGLGLGDR.K & 1.19\\
855: R.QEFEVMVMEDHAGT\^YGLGLGDR.K & \\
856: \hline
857: \label{tab:ptm}
858: \end{tabular}
859: \end{small}
860: \caption{Some sample distances between spectra and their PTM variants.
861: Note the low scores between the pairs. Distances between spectra of 
862: different peptides had a mean $\mu=1.388$ and $\sigma=0.017$.}
863: \end{figure}
864: 
865: 
866: \subsection{Query processing using LSH}
867: 
868: In this section, we quantify the accuracy of our framework for
869: similarity searching and clustering.  As mentioned earlier, we use LSH
870: to answer queries with bounded errors in expected sub-linear time.
871: 
872: 
873: We first indexed the 1014 spectra using our embedding followed by LSH.  
874: For each of the 1014 spectra, we queried LSH with a radius $r$. 
875: We varied $r$.  We plot the
876: number of missed spectra that were actually present in the cluster of
877: the query spectrum in Figure~\ref{fig:LSH-misses} and the number of
878: false positives in Figure~\ref{fig:LSH-fpos}.  As we increased the
879: radius, we the number of misses decreased. This is expected as the
880: radius of the {\em query ball} increases the number of possible data
881: points that can be considered. As expected, the number of false
882: positives also increased as $r$ increased. This indirectly demonstrates
883: the accuracy of any clustering algorithm based on LSH. We 
884: miss an average of 1 spectrum within each cluster
885: while admitting only 1 false spectrum.  
886: 
887: 
888: 
889: At $r=1.0-1.1$ the false positives are not very high.  This might be
890: important when we want to query for similar spectra in order to
891: generate the consensus spectra. In such situations, it might be fine
892: to miss out some bad quality spectra (distances to bad quality spectra
893: are usually higher). Also, consider situations where we would like to
894: coarsely partition the data set (e.g. for clustering).  Then, 
895: we can afford to have a few false positives but we cannot
896: miss any true positives. In such cases we increase the radius to at most 
897: 1.25 as the likelihood of a intra-cluster distance being greater than 
898: 1.25 is low, from Figure~\ref{fig:inter}.
899: 
900: 
901: 
902: \begin{figure}\label{fig:LSH-misses}
903:     \includegraphics[width=\linewidth,height=1.75in]{figs/LSH-misses.ps}
904:     \caption{The average number of spectra that are present in the cluster 
905: containing the query spectrum but are missed by LSH }
906: \end{figure}
907: 
908: 
909: 
910: \begin{figure}\label{fig:LSH-fpos}
911:     \includegraphics[width=\linewidth,height=1.75in]{figs/LSH-fpos.ps}
912:     \caption{The average number of spectra that are not present in the cluster
913: containing the query spectrum but are reported by LSH}
914: \end{figure}
915: 
916: 
917: 
918: 
919: 
920: \subsection{Speeding up Database Search}
921: 
922:  
923: To test the efficacy of our framework on speeding up database search,
924: we first use our metric to filter out candidate spectra. Since our
925: distance calculation is much faster than the detailed scoring of two
926: spectra, we define speedup by the ratio of total number of candidate
927: peptides with a mass tolerance of 2 daltons and the total number of
928: peptides that have a distance of $\Delta$ with the query spectrum and
929: have the same mass tolerance. Then we increase $\Delta$ and calculate
930: the number of true peptides missed in this filtering process.  In
931: Figure~\ref{fig:speedup} we plot the speedup on a logarithmic scale
932: against the miss percentage. This gives us the speedup (or quality of
933: filtering) versus accuracy tradeoff of using our framework.  For a 2
934: dalton range the number of peptides are around 100-200K. For around a
935: a 100K peptide set, LSH takes 0.21s on an average to answer queries. 
936: As we see from Figure~\ref{fig:speedup}, we can get an
937: average speedup of 118 if we allow 0.19\% misses. 
938: This may be reasonable for
939: many applications. In fact, we found that our errors were due to low
940: quality spectra in our test dataset.
941: 
942: 
943: \begin{figure}\label{fig:speedup}
944:     \includegraphics[width=\linewidth,height=2in]{figs/speedup-cos.ps}
945:     \caption{Filtering of spectra for DBASE search}
946: \end{figure}
947: 
948: 
949: 
950: \subsection{Visualization and Dimension Reduction}
951: 
952: 
953: Consider the training dataset of mass spectra.  We first generate
954: Euclidean feature vectors for each spectra.  Then we used PCA and
955: plotted the first two components on the x-axis and the y-axis as shown
956: in Figure~\ref{fig:pca}(i). The clusters are visible and so are the
957: outliers. But the visualization is coarse grained.
958: 
959: 
960: Then we use Isomaps on the same dataset.  Recall that in Isomaps, one
961: first needs to calculate the near neighbors. Thus in our plot, we also
962: show the near neighbor graph along with the projected points as shown
963: in Figure~\ref{fig:pca}(ii).  The cluster structure seem to be
964: qualitatively clearer than with PCA.
965: 
966: 
967: \begin{figure}\label{fig:pca}
968: \begin{tabular}{c c}
969: \includegraphics[width=1.5in,height=1.75in]{figs/spectra-pca.ps} &
970: \includegraphics[width=1.5in,height=1.75in]{figs/spectra-isomap.ps}\\
971: (i) PCA & (ii) Isomap\\
972: \end{tabular}
973:     \caption{Dimension Reduction with Isomap}
974: \end{figure}
975: 
976: 
977: \section{Discussion}
978: 
979: 
980: The results in the previous section look promising. The clear
981: separation between the DS and SS set during the metrics comparision
982: was a surprise to us, initially. One of the reasons for the good
983: result is the quality of the dataset. We first wanted to validate our
984: simple assumptions and claims on a dataset which had reliable
985: interpretations. Since we first transform the spectra into binary bit
986: strings we avoided the huge variations of density in spectra.  The
987: signal to noise ratio pilot study also underscored the fact that we
988: need to study spectra by segmenting them. Note that one reason why we
989: obtained clear separations between the DS and SS in all cases with our
990: embedding is that we avoided using precursor ion mass as a feature.
991: Even though its fine to use the precursor mass as a coarser grain
992: filter, it will lead to less robust embeddings as such masses are
993: prone to errors due to isotope effects. Also our theoretical results
994: will not hold.
995: 
996: 
997: 
998: For LSH, the speed and the accuracy is quite satisfying.  However,
999: there are two implementation issues.  Our current indexing is memory
1000: bound. This means we need lots of memory to index millions of mass
1001: spectra. Even though this is possible with the current 64 bit
1002: machines, we need to design disk based LSH schemes.  We are working on
1003: a large scale implementation of our framework based on such
1004: techniques. Another issue is the choice of the number of bins and the
1005: mass coverage. Increasing the number of bins leads us to the curse of
1006: dimensionality which would slow down LSH and reduce the filtering
1007: speedup. If we choose fine grained bins with a lower maximum mass, our
1008: embedding will result in a pseudo-metric space as several different
1009: spectra will now satisfy assumption one in Theorem 2.1.
1010: 
1011:  
1012: 
1013: 
1014: \section{Conclusions and Future Work}
1015: 
1016: In this paper, we showed that our embedding with geometric algorithms
1017: provides a good framework for mining mass spectra. In particular, we
1018: have demonstrated both theoretically as well as empirically, that our
1019: embedding coupled with Euclidean distance performs as well as the well
1020: known cosine similarity while providing us with the benefits of a
1021: metric space and enabling us to use approximate sub-linear time near
1022: neighbor techniques for data mining. Using this framework, we showed
1023: how we can do similarity searches and find tight clusters. Also, we
1024: demonstrated that we can get 2 order of magnitude filtering for
1025: database search. As an aside, we are also able to visualize large
1026: datasets in two dimensions qualitatively identifying the outliers.
1027: 
1028: 
1029: This work is the first step in the direction of an integrated
1030: framework for large scale mining of tandem mass spectra using simple
1031: techniques from embeddings, vector spaces and computational
1032: geometry. Several directions are being investigated at this point. The
1033: main areas of investigation are 1) Better embeddings that offer better
1034: resolution for PTM spectra 2) Faster external database searching
1035: algorithms that use embedding 3) More effective blind PTM searching
1036: using embeddings 4) Large scale clustering and visualization of mass
1037: spectrometry data and 5) Integrating data from different sources using
1038: our embeddings.
1039: 
1040: 
1041: We should note that several sections in the paper could be of
1042: independent interest. For example, we need to explore the
1043: probabilistic cleaning of mass spectra in more details.  Our embedding
1044: promises to work across datasets and this general method can be used
1045: to do integrated study of other biological datasets eg. microarray
1046: data sets.
1047:  
1048: 
1049: \iffalse
1050: \section{Acknowledgments}
1051: 
1052: Debojyoti would like to thank Vidhya Navalpakkam for her invaluable 
1053: help. The authors would like to thank 
1054: Prof. Piotr Indyk who provided insight into the LSH algorithm 
1055: and also provided the initial LSH code. 
1056: Debojyoti would also like to thank Yunhu Wan and Lijuan Mo 
1057: for extremely helpful discussions. 
1058: \fi
1059: 
1060: \bibliographystyle{plain}
1061: \bibliography{msms,lsh}
1062: 
1063: 
1064:  
1065: 
1066: 
1067: \end{document}
1068: 
1069: