0603:q-bio0603002/arxiv.tex

1: \documentclass[11pt,twocolumn]{article}

2:

3: \usepackage{fullpage}

4: \usepackage{times}

5: \usepackage{algorithm}

6: \usepackage{algorithmic}

7: \usepackage{amsthm}

8: \usepackage{graphicx}

9:

10: \newcommand{\keyword}[1]{\texttt{#1}}

11: \newcommand{\latcom}[1]{\texttt{$\backslash$#1}}

12:

13: \newtheorem{thm}{Theorem}[section]

14: \newtheorem{lem}{Lemma}[section]

15: \newtheorem{cor}{Corollary}[section]

16: %

17: \renewcommand{\baselinestretch}{.95}

18: \normalsize

19:

20: \newcommand{\RelaxFloats}{

21:         \renewcommand{\topfraction}{0.9}

22:         \renewcommand{\floatpagefraction}{0.9}

23:         \renewcommand{\textfraction}{0.1}

24: }

25:

26:

27: \begin{document}

28:

29: \RelaxFloats

30:

31: \begin{titlepage}

32:

33: \flushleft

34:

35:

36: \vspace{0.00in}

37: \parbox{6.5in}{\large \noindent

38:    Draft prepared for \textbf{arXiv}.

39: }

40:

41:

42: \vspace{0.10in}

43: \parbox{6.5in}{\large \noindent

44:    Manuscript information:

45:    {10} text pages,

46:    {2} figures,

47:    {3} tables.

48: }

49:

50:

51:

52: \vspace{2.00in}

53: \parbox{6.5in}{\LARGE \centering

54:    Mining Mass Spectra: Metric Embeddings and \\

55:    Fast Near Neighbor Search

56: }

57:

58:

59: \vspace{0.5in}

60: \parbox{6.5in}{\large \centering

61:    Debojyoti Dutta,

62:    Ting Chen\footnotemark[1]

63: }

64:

65:

66: \vspace{0.5in}

67: \parbox{6.5in}{\large \centering

68:    Molecular and Computational Biology Program \\

69:    University of Southern California \\

70:    Los Angeles, CA 90089-2910

71: }

72:

73: \vspace{0.5in}

74: \parbox{6.5in}{\large \centering

75:    \today

76: }

77:

78:

79: \footnotetext[1]{

80: To whom correspondence should be addressed.

81: Molecular and Computational Biology Program,

82: University of Southern California.

83: MCB 201, 1050 Childs Way,

84: Los Angeles, CA 90089-2910.

85: E-mail: ddutta@usc.edu.,

86:        tingchen@usc.edu

87: Tel:   (213)740-2416,

88:       (213)740-2415.

89: Fax:   (213)740-8631.

90: }

91: \end{titlepage}

92:

93:

94:

95: \iffalse

96:

97: \title

98: \author{Debojyoti Dutta\footnote{The authors are with the Department

99: of Computational Biology, University of Southern California, Los Angeles

100: 90089. They can be contacted at ddutta@usc.edu, tingchen@usc.edu

101: respectively }\,

102: Ting Chen\addtocounter{footnote}{-1}\footnotemark\

103: }

104:

105: \maketitle

106:

107: \fi

108:

109: \normalsize

110:

111: \begin{small}

112: \begin{abstract}

113:

114: Mining large-scale high-throughput tandem mass spectrometry data sets

115: is a very important problem in mass spectrometry based protein

116: identification.

117: %

118: %

119: %

120: One of the fundamental problems in large scale mining of spectra is to

121: design appropriate metrics and algorithms to avoid all-pair-wise

122: comparisons of spectra.  In this paper, we present a general framework

123: based on vector spaces to avoid pair-wise comparisons.

124: %

125: %

126: %

127: %

128: %

129: We first robustly embed spectra in a high dimensional space in a novel

130: fashion and then apply fast approximate near neighbor algorithms for

131: tasks such as constructing filters for database search, indexing and

132: similarity searching.  We formally prove that our embedding has low

133: distortion compared to the cosine similarity, and, along with locality

134: sensitive hashing (LSH), we design filters for database search that

135: can filter out more than 989\% of peptides (118 times less)

136: while missing at most 0.29\%

137: of the correct sequences. We then show how our framework can be used

138: in similarity searching, which can then be used to detect tight

139: clusters or replicates. On an average, for a cluster size of 16

140: spectra, LSH only misses 1 spectrum and admits only 1 false spectrum.

141: In addition, our framework in conjunction with dimension reduction

142: techniques allow us to visualize large datasets in 2D space. Our

143: framework also has the potential to embed and compare datasets with

144: post translation modifications (PTM).

145:

146:

147: \end{abstract}

148: \end{small}

149:

150: \section{Introduction}

151:

152:

153: %

154:

155: Proteomics aims to analyze proteins and peptides expressed by the

156: dynamic biological processes within

157: cells~\cite{Pandey00,Aebersold00}. Proteins are responsible for many

158: inter and intra-cellular activities such as metabolism and cell

159: signaling where proteins are often modified after

160: translation within cells~\cite{Mann03NatBiot,Yates95}.

161: %

162: %

163: %

164: %

165: In the post-genomic era, one

166: of the most important problems is to characterize the {\em proteome},

167: i.e. the set of proteins within an organism.

168:

169:

170: %

171:

172: Tandem mass spectrometry is one of the most promising and widely used

173: high throughput techniques to analyze proteins and peptides

174: \cite{Pandey00,Aebersold00}. It comprises of two stages. A protein mixture

175: is enzymatically digested and separated by HPLC (High Performance

176: Liquid Chromatography) before inserting into a mass spectrometer through

177: a capillary. Then the peptides gets ionized

178: and their precursor ion masses, or

179: mass/charge ratios,  are measured. This is the MS1

180: stage. The peaks (or ionized peptides) from the MS1 stage are

181: selected and further fragmented in a second stage using techniques

182: such as Collision Induced Dissociation (CID) to yield the MS2 fragment

183: ions. Ideally, each peptide gets cleaved into two parts. The N-terminal

184: ion (b-ion) represents the prefix while the C-terminal ion (y-ion) is the suffix.

185: %

186: %

187: This stage is also known as the tandem MS or the MSMS

188: stage. For more details beyond this oversimplified description, the

189: reader is directed to the wonderful survey~\cite{Aebersold00}.

190:

191:

192: %

193:

194: There are two main approaches to analyzing tandem mass spectra

195: data. First, and the most widely used, is the

196: database search method~\cite{Keller02AC,Nsvski03,Zhang,Bafna01}.

197: Here, peptides from a sequence database are digested in-silico and the

198: resultant virtual spectra are matched (or scored) with the real

199: spectra. High scored peptides are typically chosen as the peptide

200: candidates.  This method leads to a combinatorial explosion

201: when used to search for Post Translational Modifications

202: (PTMs)~\cite{Yates95}.  Second, the de-novo

203: method \cite{Dancik99,Chen01,Ma} reconstructs the sequence without the

204: help of a database.

205: %

206: %

207: %

208: %

209: %

210: %

211: Other approaches combine denovo sequencing and database search by

212: first generating sequence tags, or subsequences, and then using these

213: tags~\cite{pepnovo} as filters for database search with and

214: without PTMs~\cite{inspect}.

215:

216:

217: %

218:

219: The promise of tandem mass spectrometry has led research groups to

220: routinely use this method to probe the proteomes.

221: %

222: %

223: %

224: %

225: %

226: A single run of a mass spectrometer can generate several thousands of

227: spectra, and the sheer size as well as the number of real life mass

228: spectra datasets is predicted to grow at an unprecedented rate with

229: laboratories operating several spectrometers in parallel, round the

230: clock. Thus, efficient mining of these large-scale mass spectra data

231: to obtain useful clues for biological discovery is a very important

232: problem.

233:

234:

235: %

236:

237: Mining large spectra has several challenges, some of which

238: are presented below.

239: %

240: 1) Indexing huge databases of mass spectra is not

241: standardized. Commonly used methods use precursion ion mass but this

242: method has two main problems: i) there can be errors in precursor ion

243: masses. ii) there may be many spectra (several thousands of them) that

244: have masses close to each other.

245: %

246: 2) It is difficult to search for similar spectra on a large scale

247: quickly, or in sublinear time. This is a core function used by several

248: data mining applications.

249: %

250: %

251: %

252: %

253: %

254: 3) Clustering large databases of spectra is a daunting task. Most

255: similarity measures proposed in tandem mass spectrometry use pair wise

256: metrics for similarity. Such pair wise methods lead to an explosion of

257: similarity calculations, i.e. $O(n^2)$ for a set of $n$ spectra. Thus,

258: a key open problem is to use methods that avoid the pair-wise

259: similarity calculations.  If objects can be transformed into metric

260: spaces, problems such as similarity searching and clustering becomes

261: easier.  Thus we need to find methods to robustly embed spectra in

262: metric spaces.

263: %

264: 4) Visualization of large groups of mass spectra is an important

265: problem which can also be used to qualitatively identify outliers in

266: the huge number of spectra produced.

267:

268:

269: %

270:

271:

272: In this paper, we present a general framework for large scale mining of

273: tandem mass spectra. Our main contributions are the following:

274: %

275: 1) We robustly embed spectra into a metric space,

276: %

277: 2) We show, both formally and empirically that distances using our

278: embedding areas good as those that use the well known cosine method.

279: %

280: 3) Then we use apply a geometric fast near neighbor search technique,

281: Locality Sensitive Hashing (LSH)~\cite{datar04}, to solve several

282: problems such as fast filters for database search, similarity

283: searching of mass spectra, and visualization of large spectral

284: database.

285: %

286: 4) Our embedding in conjunction with PCA and manifold learning can be

287: used to visualize large groups of spectra.

288: %

289: 5) Our embedding holds promise for comparing spectra with Post

290: Translational Modifications (PTM).

291:

292:

293: Our idea of robust embedding of vector spaces to mine mass spectra is

294: novel. Previous work to embed spectra into vector spaces using vectors

295: of amino acid counts to database search~\cite{Halligan04,Halligan05}.

296: They focussed on clustering sequence databases based on this amino

297: acid counts to search for mass spectra, given amino acid counts or

298: sequence tags. However getting an accurate estimate of amino acid

299: composition is itself a hard problem, especially when the quality of

300: spectra is not high. However, our method embeds ion fragments of

301: spectra directly into a vector space and avoids estimating higher

302: level features such as amino acid composition. Also our scheme is more

303: general: using a single embedding, we can either compare spectra with

304: each other or compare spectra with peptide sequences by generating

305: their virtual, or in-silico digested, spectra.

306: In addition, we demonstrate that our framework can be used in concrete

307: mining applications.  We first use our embedding along with Locality

308: Sensitive Hashing to speed-up database search.  We demonstrate that we

309: can filter out more than 99.152\% spectra with a false negative rate

310: of 0.29\%. The average query time for a spectra is 0.21s.

311: Then, we answer similarity queries and find replicates or tight

312: clusters.  LSH misses an average of 1 spectrum per cluster, that have

313: an average cluster size of 16 spectra, while admitting only 1 false

314: spectrum.

315: %

316: %

317: %

318: %

319: %

320: %

321: %

322: %

323:

324:

325: To the best of our knowledge, we are not aware of any other work that robustly

326: embeds spectra in metric spaces with provable guarantees and then uses fast

327: approximate near neighbor techniques to solve mass spectrometry data mining

328: problems.

329:

330:

331: \section{Methods}

332:

333:

334: Our approach is to use vector spaces which have been successful in

335: numerous data mining applications including web searching{\bf cite web

336: mining}.  Several fast mining algorithms become simpler to design in

337: these spaces, compared to designing them in non metric spaces

338: e.g. spaces where the only available measure is a pairwise similarity

339: measure.  Thus, the key problem in this approach is to robustly embed

340: spectra into a high dimensional metric space and define appropriate

341: distances. Also, these distances must be correlated with the well

342: known cosine similarities. In other words, we desire an embedding with

343: bounded distortion with respect to the cosine similarity.

344:

345:

346: \subsection{Embedding Spectra}

347:

348:

349: \subsection*{Noise Removal}

350:

351:

352: \begin{figure}\label{fig:SNRdist}

353:     \includegraphics[width=\linewidth,height=3in]{figs/snr.ps}

354:     \caption{Signal and Noise distributions of peak intensities in different

355: regions of spectra (from the training set).}

356: \end{figure}

357:

358: The achiles heel of tandem mass spectra analysis is the amount of

359: noise in the mass spectra. In fact, most peaks (around 80\%) cannot be

360: explained and are called {\em 'noise'} peaks. {\em 'Signal'} peaks

361: (such as $b, y$ ions) are useful for interpretation.  As a first step,

362: we remove noise peaks enriching the signal to noise ratio (SNR).

363:

364:

365: We use a statistical method to increase SNR.  We first find the

366: intensity distributions of signal and noise peaks in a set of

367: annotated spectra.  For this, we consider a set of good quality

368: annotated spectra as described in Section~\ref{sec:results} and

369: generate the virtual spectrum $v_p$ for each of the real spectra $r_p$

370: for a peptide $p$.  For the virtual spectrum generation we consider

371: the following ions: $b$, $b-H_2O$, $b-NH_3$, $y$, $y-H_20$, $y-NH_3$.

372: Then we divide the mass range of $r_p$ into $k=10$ sections. For each

373: section, and for each real peak, we consider its intensity rank

374: i.e. the most intense peak has rank 0 and so on.  We divide the peaks

375: of $r_p$ into two sets $S_p$ and $N_p$. $S_p$ contains all those

376: peaks, and their intensity ranks, which have a match in the virtual

377: spectrum $v_p$. Thus, for each region, we can get a distribution of

378: signal and noise intensity ranks for each region as shown in

379: Figure~\ref{fig:SNRdist}.

380:

381:

382: We define a  metric SNR of a peak $(mz_j,I_j)$ as  follows

383: $$

384: SNR(j)\ =\ \frac{P[\mbox{rank}(j)|(mz_j,I_j)\in S_p]}{P[\mbox{rank}(j)|(mz_j,I_j)\in N_p]}

385: $$

386: If larger SNR, the peak is likely to be a useful peak, else its a noise peak.

387: From Figure~\ref{fig:SNRdist} we can conclude

388: that the noise is very poor at the ends of the spectra, i.e. at low mass

389: regions and high mass regions. This statistical observation reinforces

390: the mass spectrometry folklore that the {\em middle region} is the most

391: suitable for finding signal peaks.

392:

393:

394: \subsection*{Features and Distances}

395:

396: There are several possible ways to embed tandem mass spectra into a

397: vector space that support the most common operation of comparing two

398: spectra and find similarities. For example, the cosine similarity

399: metric~\cite{Keller02AC} and their different variants have been very

400: popular in the recent papers.  Unfortunately the cosine metric does

401: not yield a metric embedding because the triangle inequality is

402: violated.  Also the cosine similarity metric implies algorithms that

403: consider pairs of spectra. Clearly such algorithms are difficult to

404: scale due to the $O(n^2)$ number of similarity calculations.

405:

406:

407: For metric embeddings, the design space is quite large.

408: %

409: %

410: %

411: A simple idea is to directly bin the peaks and use the intensities to

412: form a vector space. However spectra from different datasets have

413: different intensities and we would like to have a single embedding

414: that could potentially integrate multiple spectral databases.

415: %

416: %

417: %

418:

419:

420:

421: \begin{figure}\label{fig:cube}

422: \begin{tabular}{c c}

423: \includegraphics[width=1.5in,height=1.5in]{figs/msmine-cube.ps} &

424: \includegraphics[width=1.5in,height=1.5in]{figs/msmine-circle.ps}\\

425: (i) & (ii)\\

426: \end{tabular}

427: \caption{ (i) Embedding spectra in a $n$-dimensional cube, (ii)

428: Using a 2-dimensional example to illustrate the correlation between the

429: Euclidean distance and the well known cosine similarity }

430: \end{figure}

431:

432:

433:

434: We first {\em clean} spectra as mentioned in the previous subsection.

435: Then we divide the entire mass range (from 0 to some maximum range)

436: into discrete intervals of 2da.  For each interval of 2da,

437: a bit is set to 1 if the cleaned spectrum

438: contains a peak in that interval, else it is 0. This embeds

439: each spectra into the vertices of a n-dimensional hypercube. A 3D

440: version is shown in Figure~\ref{fig:cube}. Our feature vectors

441: are defined to be the the {\em unit} vectors in the direction of the

442: corresponding vertices of the n-dimensional hypercube. Thus the space

443: of our embedding is a n-dimensional unit hyper-sphere.

444:

445:

446:

447: We define the spectral similarity or distance between spectra

448: $x$, $y$, as $||x-y||$. If the angle between two similar spectra $x$, $y$ is

449: $\theta$, $\cos \theta$ will be close to 1, or $1-\cos \theta$ will be

450: very small. Since $x$ $y$ are unit vectors, their

451: Euclidean distance will also be small.  Thus, for small angles, $1-\cos

452: \theta \approx D(x,y) $, where $D$ is the Euclidean distance. It is easy

453: to show that as $n$ or the number of dimensions increases, the minimum

454: angle for pairs of very similar spectra $x$, $y$ becomes

455: smaller. Thus, instead of calculating the $1- \cos \theta$, we

456: calculate $D(x,y)$.  The natural question that arises is the

457: {\em distortion} of our embedding.  We will now show that it is has

458: bounded accuracy in theory, and we will later show that the accuracy

459: is empirically quite high in comparision with the cosine similarity.

460:

461:

462:

463: We prove some properties of the embeddings. It is easy to show the following theorem:

464: \begin{thm}

465: The embedding discussed above defines a metric space.

466: \end{thm}

467: \begin{proof}

468: The proof is very simple

469: To show that our embedding defines a metric space, we need to prove three things:

470: 1) $||x-y||=0$ iff $x=y$, 2) $||x-y||=||y-x||$ and 3) the distance measure

471: obeys the triangle inequality. These properties are trivial to prove in our case

472: as our embedding uses Eculidean distances.

473: \end{proof}

474:

475:

476: We then  show that the maximum euclidean distance is bounded by $\sqrt{2}$.

477: \begin{lem}

478: The distance between the feature vectors of any two mass spectra is

479: bounded above by $\sqrt{2}$.

480: \end{lem}

481: \begin{proof}

482: Suppose there are two spectra $x$, $y$ respectively. We shall uses the

483: names of the spectra and their feature vectors

484: interchangeably. According to our scheme we first filter the noisy

485: peaks and generate the binary vector after binning. Now assume $x$ has

486: $k$ bits set to a and $y$ has $k'$ bits set to 1. Also assume that $c$

487: of the common bits are 1.  Then $||x||=\frac{1}{\sqrt{k}}$ and

488: $||y||=\frac{1}{\sqrt{k'}}$. Since $c$ bits are common, the number of

489: dissimilar bits between $x$ and $y$ are $(k-c)+(k'-c)$. We have

490: \begin{small}

491: \begin{eqnarray}

492: ||x-y|| &  = &  \sqrt{(k-c).\left( \frac{1}{\sqrt{k}}\right)^2\ +

493:   \ (k'-c).\left( \frac{1}{\sqrt{k'}}\right)^2 }\\

494:  &  = & \sqrt{ 2 -  c.\left( \frac{1}{k} + \frac{1}{k'} \right) }

495: \end{eqnarray}

496: \end{small}

497: \end{proof}

498:

499:

500: Next we show that our embedding has bounded distortion when we compare

501: with the well known cosine similarity.  We have the following theorem:

502: \begin{thm}

503: If $\theta$ is the angle made by the feature vectors of spectra $x$,

504: $y$, and the number of ones in each of the vectors after binning is

505: the same we must have

506: $0<\frac{1-\cos{\theta}}{||x-y||}<\frac{1}{\sqrt{2}}$. Or in other

507: words, the distortion between our Euclidean embedding and the cosine

508: similarity is bounded.

509: \end{thm}

510: \begin{proof}

511: As in the previous lemma,

512: $

513: ||x-y|| = \sqrt{ 2 -  c.\left( \frac{1}{k} + \frac{1}{k'} \right) }.

514: $

515: Now the cosine of the angle $\theta$ between $x$, $y$ can be written as

516: $\cos\theta = \frac{c}{\sqrt{kk'}}$.

517: Assume $k=k'$ and note that $0\leq \frac{c}{k}\leq 1$. Thus, we must have

518: \begin{eqnarray}

519: \frac { 1 - \cos\theta } {||x-y||} &  =  &

520: \frac { 1\ -\ \frac{c}{\sqrt{kk'}} }

521: { \sqrt{ 2 -  c.\left( \frac{1}{k} + \frac{1}{k'} \right) } } \\

522:  & = & \frac{1-\frac{c}{k}}{\sqrt{2-\frac{2c}{k}}}\\

523:  & = & \frac{1}{\sqrt{2}} \sqrt{1-\frac{c}{k}}

524: \end{eqnarray}

525: We note that since,  $0\leq \frac{c}{k}\leq 1$, we must also have

526:  $0\leq 1- \frac{c}{k}\leq 1$ and the theorem follows.

527: \end{proof}

528:

529:

530: Thus our embedding will perform almost as good as the standard cosine

531: metric. We show in the next section that this is indeed the case,

532: empirically.  Also, since the points are in a Euclidean space, we can

533: elegant geometric techniques that yield fast approximate algorithms

534: for mining the data.

535:

536: \subsection{Similarity Searching}

537:

538:

539: The ability to calculate distances as opposed to cosines is an

540: important feature of our framework. Now, we apply elegant

541: near neighbor algorithms to answer queries quickly but

542: approximately, as we show in the paper.  The basic query

543: primitive we use is the following:

544:

545:

546: \noindent{Primitive 1}: Given a spectrum $x$ and a set of spectra $S$,

547: we want to find all the spectra $S_r$ that are similar to $x$,

548: i.e. spectrum $y\in S_r$, iff $D(x,y)<r_q$, where $D$ is the Euclidean

549: distance and the $r_q$ is a query radius.

550:

551:

552:

553: A very simple approach would be to do a linear scan on the database

554: and output every spectrum $y$ such that $D(x,y)<r_q$. This takes

555: $O(n)$ time. However, if $S$ becomes very large and so do the number

556: of queries say $O(n)$, then we have a $O(n^2)$ algorithm. This is

557: clearly unacceptable for our problem. Thus, we desire methods that

558: will yield near neighbor queries in {\em sub-linear} time. For this we

559: are willing to tradeoff some accuracy for speedup. Several sub-linear

560: near neighbor methods exist but we leverage Locality Sensitive

561: Hashing~\cite{datar04} since, unlike others, it promises bounded

562: guarantees and is also easy to implement.  We briefly present the idea

563: below.

564:

565:

566:

567: \subsection*{Locality Sensitive Hashing}

568:

569: The basic idea behind random projections is a class of hash functions

570: that are locality sensitive i.e. if two points $(p, q)$ are close they

571: will have small $|p-q|$ and they will hash to the same value with high

572: probability. If they are far they should collide with small

573: probability.

574:

575: \noindent{Definition 1}: A family $\{ H = f: S \rightarrow U \}$ is

576: called locality-sensitive, if for any point $q$, the function $$p(t) =

577: Pr_H[h(q) = h(v) : |q-v| = t]$$ is strictly decreasing in $t$. That

578: is, the probability of collision of points $q$ and $v$ is decreasing

579: with the distance between them.

580:

581: \noindent{Definition 2}: A family $H=\{h:S\rightarrow U\}$ is called

582: $(r_1,r_2,p_1,p_2)$ sensitive for distribution $D$ if for any $v,q \in S$,

583: we have

584: \begin{itemize}

585: \item if $v\in B(q,r_1)$ then $\mbox{Pr}[h(q)=h(v)]\geq p_1$

586: \item if $v\notin B(q,r_2)$ then $\mbox{Pr}[h(q)=h(v)]\leq p_2$

587: \end{itemize}

588: Here $B(q,r)$ represents a ball around point $q$ with a radius $r$.

589: Thus a good family of hash functions will try to {\em amplify}

590: the gap between $p_1$ and $p_2$.

591:

592:

593: Indyk et.~al.~\cite{datar04} showed that s-stable distributions can be

594: used to construct such families of locality sensitive hash

595: functions. An s-stable distribution is defined as follows.

596:

597: \noindent{Definition 3}: A distribution $D$ over $R$ is called {\em

598: s-stable}, if there exists $s$ such that for any $n$ real numbers $v_1

599: ... v_n$ and i.i.d. variables $X_1 ... X_n$ with distribution $D$, the

600: random variable $\sum_i{v_i X_i}$ has the same distribution as the

601: variable $(\sum_i{v_i^p})^{\frac{1}{s}}X$, where $X$ is a random

602: variable with distribution $D$.

603:

604:

605: Consider a random vector $a$ of $n$ dimensions. For any two

606: n-dimensional vectors $(p, q)$ the distance between their projections

607: $(a.p - a.q)$ is distributed as $|p-q|_s X$ where $X$ is a s-stable

608: distribution. We {\em chop} the real line into equal width segments of

609: appropriate size and assign hash values to vectors based on which

610: segment they project onto. The above can be shown to be locality

611: preserving.

612:

613:

614: There are two parameters to tune LSH. Given a family $H$ of hash

615: functions as defined above, the LSH algorithm chooses $k$ of them and

616: concatenates them to amplify the gap between $p_1$ and $p_2$. Thus,

617: for a point $v$, $g(v)=(h_1(v)...h_k(v))$. Also, $L$ such groups of

618: hash functions are chosen, independently and uniformly at random,

619: (i.e. $g_1...g_L$) to reduce the error.  During pre-processing, each

620: point $v$ is hashed by the $L$ functions buckets and stored in the

621: bucket given by each of $g_i(v)$. For any query point $q$, all the

622: buckets $g_1(q)...g_L(q)$ are searched. For each point $x$ in the

623: buckets, if the distance between $q$ and $x$ is within the query

624: distace, we output this as the nearest neighbor. Thus, the parameters

625: $k$ and $L$ are crucial.  It has been shown~\cite{indyk99,datar04}

626: that $k=\log_{1/p_2}{n}$ and $L=n^\rho$, where

627: $\rho=\frac{\log{1/p_1}}{\log{1/p_2}}$, ensures locality sensitive

628: properties. In Ref.~\cite{datar04}, the authors consider $L2$ spaces

629: and bound $\rho$ above empirically by $\frac{1}{c}$, $c$ being the

630: approximation guarantee, i.e. for a given radius $R$, the algorithm

631: returns points whose distance is within $c\times R$.  The time

632: complexity of LSH has been shown to be $O(dn^\rho \log{n})$, where $d$

633: is the number of dimensions and $\rho$ is as defined above.  Thus, if

634: we desire a coarse level of approximation, LSH can guarantee

635: sub-linear run times for geometric queries.

636:

637:

638:

639:

640: \subsection{Similarity Searching}

641:

642:

643: Using our embedding and a fast near neighbor algorithm, we can

644: find spectra similar to a given query spectrum. The

645: key is to use the correct query radius $r$. We

646: show in the next section how this can be chosen.

647: If we give too high a radius, it might yield a

648: large dataset and if the radius is too low, it might not yield any

649: neighbor.

650:

651:

652: If an appropriate query radius is chosen,

653: it is easy to find tight clusters using the following heuristic:

654: \noindent{ANN-cluster}: 1) Embed spectra into a Euclidean space and

655: form the set $S$.  2) Hash the feature vectors, $S$, using LSH. 3)

656: Choose some $k$ random spectra, find their near neighbors (tight

657: clusters). For each random spectra add their neighbors to set $S$. 4)

658: $S=S-C$. 5) Go to step 3 till $S$ is empty.

659:

660:

661: Another immediate consequence of our framework is to find outliers. To

662: check for outlier, we need to determine whether a spectrum has at most

663: 1 or 2 neighbors. If the neighbors remain unchanged even on increasing

664: the query radius by $\delta$, a spectrum is indeed an outlier.  Since

665: near neighbors take sub-linear time with LSH, outliers can be detected

666: in sub-quadratic time.

667:

668:

669:

670:

671: \subsection{Speedup Database search}

672:

673:

674:

675: In this section, we discuss a sample application using our mining

676: framework.  Database search is the primary tandem mass spectrometry

677: data mining applications. Given a query spectrum $x$, and a mass

678: spectra database $MSDB$ (described in Section~\ref{sec:results}, the

679: problem is to find out which peptide $p\in MSDB$ corresponds to $x$.

680:

681:

682: Database search is a well explored topic, see ~\cite{Wan05} for

683: example. Most tools index the the MSDB by the peptide mass. Then for a

684: spectrum $x$, the precursor mass $m_x$ is found. Then all the spectra

685: $S_p={y|y\in MSDB}$ are compared with $x$ such that $|m_y-m_x|<\delta$,

686: where $\delta$ is some pre-defined mass tolerance.  Each comparison

687: operation between the query spectrum and the candidate spectrum takes

688: a while depending on the scoring function used.  We reduce the size of

689: $S_p$ by filtering the unrelated spectra, speeding up the search. We

690: ensure that we do not filter out the true peptide for a

691: given spectrum while we discard most of the unrelated peptide.

692:

693:

694: We generate the virtual spectra from each peptide

695: sequence in the database, and then embed those virtual spectra in the

696: Euclidean space, as mentioned. Then for filtering, we choose an appropriate

697: threshold radius $r$ and query the LSH algorithm to yield all the

698: candidates within a ball of radius $r$. The ratio of the total number

699: of peptides within a mass tolerance divided by the number of

700: candidates returned is our speedup.

701:

702:

703:

704:

705: \subsection{Visualization and Dimension Reduction}

706:

707: As mentioned earlier, vizualizing thousands of spectra is a very hard

708: problem.  We are not aware of any previous work that allows us to

709: visualize large mass spectrometry data sets.  Our embedding followed

710: by dimension reduction allows to view spectra on a two or three

711: dimensional space. As a bonus, it qualitatively allows us to identify

712: outliers in the data set.

713:

714:

715: Once we have embedded the spectra in a Euclidean space, we can use

716: some of the common techniques to visualize high dimensional data by

717: dimensionality reduction.  The most common linear method is to use

718: PCA~\cite{strang}. Recently, several non-linear methods for

719: dimensionality reduction have been discovered, the majority of them

720: exploiting the low dimensional manifold structure of the dataset.  In

721: this paper, we leverage one of these techniques, the isomap method, to

722: project the high dimensional data on a 2D plane. Due to lack of space

723: we do not provide a description of the method.

724:

725:

726:

727: \section{Experimental Results}

728: \label{sec:results}

729:

730: In this section, we describe the empirical evaluation of our embedding

731: followed by some representative data mining tasks. Unless otherwise

732: stated we use the following dataset from Keller

733: et. al.~\cite{Keller02Data}. For calculating statistics, we used 80\%

734: of the 1618 spectra from this annotation at random.  The statistics

735: were independent of the exact choices of the spectra.  Note that our

736: techniques are unsupervised except for the selection of query radii.

737: Out of this, 1014 spectra were digested with trypsin and were used for

738: database search filter.

739:

740:

741: For database search filters, a non-redundant protein sequence database

742: called MSDB, which is maintained by the Imperial College, London.  The

743: release (20042301) has 1,454,651 protein sequences (around 550M amino

744: acids) from multiple organisms.  Peptide sequences were generated by

745: in-silico digestion and the list of peptides were grouped into

746: different files by their precursor ion mass, a different file for

747: 10da.

748:

749:

750: \subsection{Empirical evaluation of the embedding}

751:

752: In this section, we critically analyze our embedding and different

753: distance metric. For these analyzes, we chose a set of 1014 curated

754: spectra of proteins digested with trypsin and reported by Keller

755: et. al.  We then cleaned the spectra picked the most likely to be the

756: signal peaks. Then we constructed the binary bit vector as discussed

757: earlier. For the set of spectra, we knew that there were 100 odd

758: clusters with 15 spectra per cluster on an average.  We calculate the

759: pairwise distances between spectra within the same cluster and we term

760: this the similar set $SS$.  We then choose a representative from each

761: cluster at random and calculate the distances and we call this set the

762: dissimilar set $DS$.  Then we plot the frequency distribution of DS

763: and SS as they both have similar number of pairwise distances in

764: Figure~\ref{fig:inter} for three metrics: hamming, 1-cosine and

765: euclidean. Its very clear that hamming is unsuitable as a metric as it

766: has low discriminability. As expected, 1-cosine and euclidean looks

767: almost similar with low overlaps between the sets DS and SS. Also note

768: that the cosine metric used here is not exactly the same used by

769: others.  We do not take the intensities into consideration after we

770: have selected the peaks.

771:

772:

773:

774: \begin{figure}\label{fig:inter}

775:     \includegraphics[width=\linewidth,height=4in]{figs/interVSintraCluster.ps}

776:     \caption{Distribution of scores with real spectra using different

777: metrics (hamming, 1-cosine, euclidean). The dotted curve plots the

778: inter-cluster distances while the solid line represents the

779: intra-cluster distribution.}

780: \end{figure}

781:

782:

783: \iffalse

784:

785: In the previous case, we plotted distances between real spectra. Now

786: we plot distances between spectra and their corresponding virtual

787: spectra and compare with distances between spectra and virtual spectra

788: generated from totally dissimilar peptides in

789: Figure~\ref{fig:trueVSfalse}.  Note that even in this case, there is

790: very good separation (with < 5\% overlap).

791:

792:

793:

794:

795: \begin{figure}\label{fig:trueVSfalse}

796:     \includegraphics[width=\linewidth,height=4in]{figs/compareRealVirtualDistance.ps}

797:     \caption{Distribution of distance between real and virtual spectra using

798: different metric. The dotten curve represents the distance between

799: real spectra and distances to virtual spectra from different peptides.

800: The other curve shows the distribution of distances between spectra

801: and the virtual spectra from the true peptides.}

802: \end{figure}

803: \fi

804:

805: Now, we consider the database of tryptic peptides, $MSDB$. For each

806: peptide, we generate its virtual spectrum and then construct the

807: feature vector as above.  For each real spectrum, we calculate the

808: distance with the correct virtual spectra and we call this set of

809: scores to be $SS$. Then we choose, from the database, 100 random

810: peptides having almost the same mass as the precursion ion mass of the

811: given spectrum. We then add the set of scores to the dissimilar set

812: $DS$. We then plot the probability distribution of SS and DS in

813: Figure~\ref{fig:trueVSfalseDB}. Again we can see the clear sepatation

814: between the two sets of distances (with $<1\%$ overlap).  This

815: indicates that the efficacy of euclidean distance in our embedded

816: space is a good metric to design filters for database search, Note the

817: sharp impulse at 1.414 corresponding to distances between real spectra

818: and completely dissimilar peptides within a mass tolerance of 2da,

819: providing empirical evidence for Lemma 2.2.

820:

821:

822:

823: \begin{figure}\label{fig:trueVSfalseDB}

824:     \includegraphics[width=\linewidth,height=4in]{figs/realVStrueANDfalseVirtual.ps}

825:     \caption{Distribution of distance between real and virtual spectra using

826: different metric. The dotten curve represents the distance between

827: real spectra and distances to virtual spectra from 100 different peptides

828: of similar precursor masses. The sequences are from MSDB.

829: The other curve shows the distribution of distances between spectra

830: and the virtual spectra from the true peptides.}

831: \end{figure}

832:

833:

834: \subsection{Post Translational Modifications}

835:

836: Now we present some very preliminary results on a set of spectra from

837: the PFTau protein. We picked 8 good quality spectra with known

838: Phosphorylations. We wanted to study whether our metric can help

839: design filters that might work for PTM studies. From the Figure~\ref{fig:ptm},

840: we note that distances between spectra and their PTM variants

841: have a higher likelihood of being classified as similar than dissimilar.

842: This is evident from Figure~\ref{fig:inter}.

843:

844: \begin{figure}\label{fig:ptm}

845: \begin{small}

846: \begin{tabular}{ |c | c |}

847: \hline

848: R.LTQAPVPMPDLKNVK.S & 1.23\\

849: R.LTQAPVPMPDLK\# NVK.S & \\

850: R.HLSNVSSTGSIDMVDSPQLATLADEV & 1.27\\

851: R.HLSNVSST\^GS\^IDMVDS\^PQLATLADEV & \\

852: R.TPSLPTPPTR.E & 0.98\\

853: R.TPSLPT\*PPTR.E &  \\

854: R.QEFEVMVMEDHAGTYGLGLGDR.K & 1.19\\

855: R.QEFEVMVMEDHAGT\^YGLGLGDR.K & \\

856: \hline

857: \label{tab:ptm}

858: \end{tabular}

859: \end{small}

860: \caption{Some sample distances between spectra and their PTM variants.

861: Note the low scores between the pairs. Distances between spectra of

862: different peptides had a mean $\mu=1.388$ and $\sigma=0.017$.}

863: \end{figure}

864:

865:

866: \subsection{Query processing using LSH}

867:

868: In this section, we quantify the accuracy of our framework for

869: similarity searching and clustering.  As mentioned earlier, we use LSH

870: to answer queries with bounded errors in expected sub-linear time.

871:

872:

873: We first indexed the 1014 spectra using our embedding followed by LSH.

874: For each of the 1014 spectra, we queried LSH with a radius $r$.

875: We varied $r$.  We plot the

876: number of missed spectra that were actually present in the cluster of

877: the query spectrum in Figure~\ref{fig:LSH-misses} and the number of

878: false positives in Figure~\ref{fig:LSH-fpos}.  As we increased the

879: radius, we the number of misses decreased. This is expected as the

880: radius of the {\em query ball} increases the number of possible data

881: points that can be considered. As expected, the number of false

882: positives also increased as $r$ increased. This indirectly demonstrates

883: the accuracy of any clustering algorithm based on LSH. We

884: miss an average of 1 spectrum within each cluster

885: while admitting only 1 false spectrum.

886:

887:

888:

889: At $r=1.0-1.1$ the false positives are not very high.  This might be

890: important when we want to query for similar spectra in order to

891: generate the consensus spectra. In such situations, it might be fine

892: to miss out some bad quality spectra (distances to bad quality spectra

893: are usually higher). Also, consider situations where we would like to

894: coarsely partition the data set (e.g. for clustering).  Then,

895: we can afford to have a few false positives but we cannot

896: miss any true positives. In such cases we increase the radius to at most

897: 1.25 as the likelihood of a intra-cluster distance being greater than

898: 1.25 is low, from Figure~\ref{fig:inter}.

899:

900:

901:

902: \begin{figure}\label{fig:LSH-misses}

903:     \includegraphics[width=\linewidth,height=1.75in]{figs/LSH-misses.ps}

904:     \caption{The average number of spectra that are present in the cluster

905: containing the query spectrum but are missed by LSH }

906: \end{figure}

907:

908:

909:

910: \begin{figure}\label{fig:LSH-fpos}

911:     \includegraphics[width=\linewidth,height=1.75in]{figs/LSH-fpos.ps}

912:     \caption{The average number of spectra that are not present in the cluster

913: containing the query spectrum but are reported by LSH}

914: \end{figure}

915:

916:

917:

918:

919:

920: \subsection{Speeding up Database Search}

921:

922:

923: To test the efficacy of our framework on speeding up database search,

924: we first use our metric to filter out candidate spectra. Since our

925: distance calculation is much faster than the detailed scoring of two

926: spectra, we define speedup by the ratio of total number of candidate

927: peptides with a mass tolerance of 2 daltons and the total number of

928: peptides that have a distance of $\Delta$ with the query spectrum and

929: have the same mass tolerance. Then we increase $\Delta$ and calculate

930: the number of true peptides missed in this filtering process.  In

931: Figure~\ref{fig:speedup} we plot the speedup on a logarithmic scale

932: against the miss percentage. This gives us the speedup (or quality of

933: filtering) versus accuracy tradeoff of using our framework.  For a 2

934: dalton range the number of peptides are around 100-200K. For around a

935: a 100K peptide set, LSH takes 0.21s on an average to answer queries.

936: As we see from Figure~\ref{fig:speedup}, we can get an

937: average speedup of 118 if we allow 0.19\% misses.

938: This may be reasonable for

939: many applications. In fact, we found that our errors were due to low

940: quality spectra in our test dataset.

941:

942:

943: \begin{figure}\label{fig:speedup}

944:     \includegraphics[width=\linewidth,height=2in]{figs/speedup-cos.ps}

945:     \caption{Filtering of spectra for DBASE search}

946: \end{figure}

947:

948:

949:

950: \subsection{Visualization and Dimension Reduction}

951:

952:

953: Consider the training dataset of mass spectra.  We first generate

954: Euclidean feature vectors for each spectra.  Then we used PCA and

955: plotted the first two components on the x-axis and the y-axis as shown

956: in Figure~\ref{fig:pca}(i). The clusters are visible and so are the

957: outliers. But the visualization is coarse grained.

958:

959:

960: Then we use Isomaps on the same dataset.  Recall that in Isomaps, one

961: first needs to calculate the near neighbors. Thus in our plot, we also

962: show the near neighbor graph along with the projected points as shown

963: in Figure~\ref{fig:pca}(ii).  The cluster structure seem to be

964: qualitatively clearer than with PCA.

965:

966:

967: \begin{figure}\label{fig:pca}

968: \begin{tabular}{c c}

969: \includegraphics[width=1.5in,height=1.75in]{figs/spectra-pca.ps} &

970: \includegraphics[width=1.5in,height=1.75in]{figs/spectra-isomap.ps}\\

971: (i) PCA & (ii) Isomap\\

972: \end{tabular}

973:     \caption{Dimension Reduction with Isomap}

974: \end{figure}

975:

976:

977: \section{Discussion}

978:

979:

980: The results in the previous section look promising. The clear

981: separation between the DS and SS set during the metrics comparision

982: was a surprise to us, initially. One of the reasons for the good

983: result is the quality of the dataset. We first wanted to validate our

984: simple assumptions and claims on a dataset which had reliable

985: interpretations. Since we first transform the spectra into binary bit

986: strings we avoided the huge variations of density in spectra.  The

987: signal to noise ratio pilot study also underscored the fact that we

988: need to study spectra by segmenting them. Note that one reason why we

989: obtained clear separations between the DS and SS in all cases with our

990: embedding is that we avoided using precursor ion mass as a feature.

991: Even though its fine to use the precursor mass as a coarser grain

992: filter, it will lead to less robust embeddings as such masses are

993: prone to errors due to isotope effects. Also our theoretical results

994: will not hold.

995:

996:

997:

998: For LSH, the speed and the accuracy is quite satisfying.  However,

999: there are two implementation issues.  Our current indexing is memory

1000: bound. This means we need lots of memory to index millions of mass

1001: spectra. Even though this is possible with the current 64 bit

1002: machines, we need to design disk based LSH schemes.  We are working on

1003: a large scale implementation of our framework based on such

1004: techniques. Another issue is the choice of the number of bins and the

1005: mass coverage. Increasing the number of bins leads us to the curse of

1006: dimensionality which would slow down LSH and reduce the filtering

1007: speedup. If we choose fine grained bins with a lower maximum mass, our

1008: embedding will result in a pseudo-metric space as several different

1009: spectra will now satisfy assumption one in Theorem 2.1.

1010:

1011:

1012:

1013:

1014: \section{Conclusions and Future Work}

1015:

1016: In this paper, we showed that our embedding with geometric algorithms

1017: provides a good framework for mining mass spectra. In particular, we

1018: have demonstrated both theoretically as well as empirically, that our

1019: embedding coupled with Euclidean distance performs as well as the well

1020: known cosine similarity while providing us with the benefits of a

1021: metric space and enabling us to use approximate sub-linear time near

1022: neighbor techniques for data mining. Using this framework, we showed

1023: how we can do similarity searches and find tight clusters. Also, we

1024: demonstrated that we can get 2 order of magnitude filtering for

1025: database search. As an aside, we are also able to visualize large

1026: datasets in two dimensions qualitatively identifying the outliers.

1027:

1028:

1029: This work is the first step in the direction of an integrated

1030: framework for large scale mining of tandem mass spectra using simple

1031: techniques from embeddings, vector spaces and computational

1032: geometry. Several directions are being investigated at this point. The

1033: main areas of investigation are 1) Better embeddings that offer better

1034: resolution for PTM spectra 2) Faster external database searching

1035: algorithms that use embedding 3) More effective blind PTM searching

1036: using embeddings 4) Large scale clustering and visualization of mass

1037: spectrometry data and 5) Integrating data from different sources using

1038: our embeddings.

1039:

1040:

1041: We should note that several sections in the paper could be of

1042: independent interest. For example, we need to explore the

1043: probabilistic cleaning of mass spectra in more details.  Our embedding

1044: promises to work across datasets and this general method can be used

1045: to do integrated study of other biological datasets eg. microarray

1046: data sets.

1047:

1048:

1049: \iffalse

1050: \section{Acknowledgments}

1051:

1052: Debojyoti would like to thank Vidhya Navalpakkam for her invaluable

1053: help. The authors would like to thank

1054: Prof. Piotr Indyk who provided insight into the LSH algorithm

1055: and also provided the initial LSH code.

1056: Debojyoti would also like to thank Yunhu Wan and Lijuan Mo

1057: for extremely helpful discussions.

1058: \fi

1059:

1060: \bibliographystyle{plain}

1061: \bibliography{msms,lsh}

1062:

1063:

1064:

1065:

1066:

1067: \end{document}

1068:

1069: