0303:cond-mat0303473/PN.tex

1: \documentclass[pre, preprint,floatfix]{revtex4}

2: \usepackage{graphicx}

3: \bibliographystyle{apsrev}

4: \renewcommand{\r}{\right}

5: \renewcommand{\l}{\left}

6:

7:

8: \begin{document}

9:

10: \title{Global Snapshot of Protein Interaction Network -- A Percolation Based Approach}

11:

12: \author{Chen-Shan Chin}

13:

14: \affiliation{Department of Biochemistry and Biophysics, University of

15:   California, San Francisco, 94143, CA, USA}

16:

17: \email{cschin@genome.ucsf.edu}

18:

19: \author{Manoj Pratim Samanta}

20:

21: \affiliation{NASA Advanced Supercomputing Division, NASA Ames Research Center,

22:   Moffet Field, 94035, CA, USA}

23:

24: \email{msamanta@nas.nasa.gov}

25:

26:

27: \date{\today}

28:

29: \begin{abstract}

30:   In this paper, we study the large-scale protein interaction network

31:   of yeast utilizing a stochastic method based upon percolation of

32:   random graphs.  In order to find the global features of

33:   connectivities in the network, we introduce numerical measures that

34:   quantify (1) how strongly a protein ties with the other parts of the

35:   network and (2) how significantly an interaction contributes to the

36:   integrity of the network.  Our study shows that the distribution of

37:   essential proteins is distinct from the background in terms of

38:   global connectivities.  This observation highlights a fundamental

39:   difference between the essential and the non-essential proteins in

40:   the network.  Furthermore, we find that the interaction data

41:   obtained from different experimental methods such as

42:   immunoprecipitation and two-hybrid techniques possess different

43:   characteristics.  We discuss the biological implications of these

44:   observations.

45: \end{abstract}

46:

47: \maketitle

48:

49:

50: \section{Introduction}

51:

52: Recent availability of a large amount of data from high-throughput

53: experiments~\cite{Zhu2,Uetz,Ito,Gavin,Ho} has brought about a

54: fundamental change in the way we study biological systems. Unlike the

55: traditional methods which relied on probing a single or a few proteins

56: to identify important pathways, it is now becoming possible to

57: describe larger functional `modules'~\cite{Hartwell} and even the

58: global properties of the entire

59: proteome~\cite{Jeong,Maslov,Mering,Bader}.  Researchers are attempting

60: to connect large-scale protein interaction data with information from

61: phenotype studies~\cite{Jeong,Maslov}.  In one such analysis of data

62: from yeast, Jeong {\it et al.}  observed the connectivities of

63: individual proteins in the network to closely follow a power-law

64: distribution.  Similar to other power-law networks, positive

65: correlation existed between a protein's inviability and its

66: connectivity~\cite{Jeong}.  In another study, Maslov {\it et al.}

67: observed interesting patterns in the distribution of the links between

68: the nearest neighbors in the network and postulated that such patterns

69: give rise to the specificity and the robustness of the

70: network~\cite{Maslov}.

71:

72: One of the shortcomings of the previous approaches is that they drew

73: conclusions about the global nature of the network from its local

74: connectivity properties. It is unclear whether such local studies

75: based on individual nodes or nearest neighbors fully capture the

76: global picture of the network. For example, some essential proteins,

77: namely, those for which null mutants produce inviable

78: strains~\cite{YeastDel}, may have few numbers of direct links but

79: still take important roles in the network through the proteins to

80: which they are connected.  Such proteins would not be correctly

81: identified by just counting the number of links as in

82: Ref.~\cite{Jeong}.  To properly recognize such cases, it is necessary

83: to go beyond the nearest neighbor links.  However, it is not clear

84: that the techniques mentioned above can easily be extended to answer

85: such questions.

86:

87: In this paper, we introduce a stochastic method inspired by the

88: percolation model in statistical mechanics\cite{percolation} that

89: overcomes the shortcomings of the previous approaches.  This method

90: allows us to define a quantity that measures the correlation between

91: any two nodes in the network, taking the topology of the entire

92: network into account.  Biologically, such correlations describe the

93: direct and indirect influences of one protein on another through the

94: protein interaction network.  If such correlations indeed carry

95: biological significance, we expect the essential proteins to be highly

96: correlated, in general, with the rest of the network.  One of our main

97: results is that most essential proteins do possess higher correlations

98: between themselves and the rest of the network.  This is consistent

99: with previous results~\cite{Jeong}, because in the first order, the

100: correlations computed by us are proportional to the connectivities of

101: the proteins. However, we show that it is important to go beyond the

102: first order. Identifying essential proteins by our method performs

103: consistently better than just counting links.  Additionally, we

104: observe that the essential proteins interact more tightly with the

105: other essential proteins, thus forming a `network core'.  This

106: directly agrees with large-scale experiments probing protein

107: networks~\cite{Gavin}.

108:

109: Based on our method, we can also quantify the relative significance of an

110: interaction to the integrity of the network. We observe that the

111: interaction data from different measurement techniques, such as

112: immunoprecipitation(IP) and the two-hybrid test, give distinct

113: distributions.  This suggests that various experimental

114: techniques for probing the protein interaction might explore

115: different regions of the network.

116:

117:

118:

119: \section{Method and Materials}

120: \label{sec:method}

121:

122: \subsection{Bond-percolation on Graph}

123: Given any two nodes in a network, the strength of their connectivity

124: can be estimated in different ways. Some of these measures are local.

125: For example, we can ask whether any two nodes are directed linked, how

126: many common neighbors they share~\cite{Samanta}, {\it etc}.  We can

127: also ask how local properties of a node, such as the degree of links,

128: associate with its function and its importance in the

129: network\cite{Jeong}.  Furthermore, information about the correlations

130: between nodes involving nonlocal properties, such as the length of the

131: shortest path and clustering structures, will enable us to uncover

132: hidden features buried within the massive data. Here, we present a

133: generic approach that extracts useful information about a node beyond

134: its local connections.

135:

136: Correlations between two nodes may come from other numerous short

137: paths rather than just the shortest path.  A reasonable estimate of

138: correlation should take into account the number and lengths of

139: different paths between two nodes.  One possible way to estimate such

140: correlation between two nodes is to repeatedly remove some fraction

141: $q$ of the links in the network chosen randomly and check whether they

142: still remain connected.  Their probability remaining

143: connected is proportional to the number of short paths between them

144: and inversely proportional the length of those paths.  This

145: probability provides a good measurement of the correlation between two

146: nodes that includes the information regarding the non-local topology

147: of the network.  The described process of finding the correlation

148: between two nodes in a network is equivalent to the bond-percolation

149: model in statistical mechanics\cite{percolation}.

150:

151: Mathematically, a network is treated in the language of graph theory,

152: where a node is denoted as a vertex and a link as an edge.  Given a

153: graph $G$ with vertices $V$ and edges $E$, a percolation configuration

154: is realized as follows.  Each edge $e_{ij}$ linking vertices $i$ and

155: $j$ is assigned a random number $p_{ij}$ distributed uniformly from 0

156: to 1.  If this random number is greater than $p = 1 - q$, a given

157: percolation probability, then the edge is eliminated from the original

158: graph.  The final graph $G^\prime$ consists of the edge set $E^\prime

159: = E - \bar{E}$, where $\bar{E}$ is the set of edges that $p_{ij} > p$

160: and $E^\prime$ consists those edges with $p_{ij} < p$.  Assuming that

161: $G$ is connected, the reduced graph $G^\prime$ may or may not remain a

162: single connected component depending on $p$.

163:

164:

165: \subsection{Susceptibility}

166: The first step in applying the algorithm is to determine the appropriate

167: value of the probability $p$. If $p$ is near one, then we only produce

168: totally connected graphs.  If $p$ is too close to zero, then the network

169: is split into individual vertices and small clusters. An intermediate value of

170: $p$ provides information about the non-local properties of the network.

171:

172: The degree of fragmentation in the graph $G^\prime$ can be quantified

173: by the order parameter $m(p)$, the ratio of the largest connected

174: component to the total graph size.  It is defined as $m(p) = N_{\rm

175:   max}/|V|$, where $N_{\rm max}$ is the number of vertices of the

176: largest connected component and $|V|$ is the total number of vertices.

177: For a connected graph $G$, $m(p)$ varies from $1/|V|$ to 1 as $p$

178: changes from 0 to 1.  Here, $m$ is a stochastic variable, whose

179: fluctuation is defined by

180: \begin{equation}

181:   \chi(p) = \langle (m - \langle m \rangle)^2 \rangle^{\frac{1}{2}}

182: \end{equation}

183: The brackets denote the ensemble average, which is the average over

184: many different realizations of $G^\prime$.  The curve of $\chi(p)$

185: reveals certain aspects of the graph topology.  For example, if $G$ is

186: a regular two dimensional square lattice, then $\chi$ diverges with a

187: power law behavior as a function of $p-p_{\rm c}$, for $p_{\rm

188:   c}=1/2$.  For other types of regular lattices, like triangular

189: lattices or higher dimensional lattices, $p_{\rm c}$ and/or the power

190: law exponent also change.  A maximum in $\chi(p)$ occurs at the

191: transition point $p_{\rm c}$, indicating a phase transition and

192: critical behavior\cite{percolation}.  At this critical point, the

193: distribution of the sizes of the connected clusters decay as a power

194: law. Chosing a value of $p$ near this critical value, we get the most

195: non-local information regarding the network.

196:

197: \subsection{Correlations and the definition of $v_i$}

198: Whether two arbitrary vertices $i$ and $j$ remain connected in

199: $G^\prime$ can provide more detailed information about $G$.  If two

200: vertices retain their connection, it means that there exist paths in

201: $E^\prime$ from vertex $i$ to vertex $j$.  Define $\delta_{ij}$ as

202: function of a pair of vertices $i$ and $j$ such that $\delta_{ij} = 1$

203: if vertices $i$ and $j$ are connected, and $\delta_{ij} = 0$

204: otherwise.  The percolation correlation $c_{ij}$ is then defined as the

205: ensemble average of $\delta_{ij}$,

206: \begin{equation}

207:   c_{ij} = \langle \delta_{ij} \rangle.

208: \end{equation}

209:

210: With knowledge of the $c_{ij}$, we are equipped to

211: measure how strongly a vertex $i$ links to the rest of the network

212: counting both direct and indirect connections to vertex $i$.

213: We define the quantity $v_i$ for vertex $i$,

214: \begin{equation}

215:   v_i = \frac{1}{|V|} \sum_{j \in V} c_{ij}

216: \end{equation}

217: This value is sensitive not only to the linking degree at each vertex

218: but also to higher order connections between a vertex and the rest of

219: the random graph.  Thus, $v_i$ effectively ranks the importance of a

220: vertex in the graph. Intuitively, $v_i$ may be interpreted as the

221: fraction of other vertices to which vertex $i$ remains linked, if each

222: edge is broken with probability $q = 1 - p$ in the graph $G$.  In

223: Fig.~\ref{fig:smallnet}, we show the descending ranking order of the

224: $v_i$'s for a small graph.

225:

226:

227: \subsection{The definition of $\beta_{ij}$}

228: Using a similar idea, we can define a quantity that allows us to check

229: the influence of an edge on the graph integrity.  The elimination of

230: some edges may fundamentally change the connectivity properties

231: whereas the graph topology may be relatively unchanged against the deletion

232: of others.  For example, for a small fully connected subgraph, termed

233: a clique, removal of a certain number of edges between the vertices of

234: the subgraph tends not to separate the graph into disconnected pieces.

235: Individual links in the subgraph do not play crucial roles in

236: supporting the integrity of the subgraph and the whole graph.  We

237: define the quantity $\beta_{ij}$ to monitor the importance of edge

238: $e_{ij}$ to the integrity of the graph,

239: \begin{widetext}

240: \begin{equation}

241:   \beta_{ij} = \frac{1}{|V|^2} \sum_{l,m\in V}

242:   \l(c_{lm}\l(G^\prime \cup \{e_{ij}\}\r) - c_{lm}\l(G^\prime \setminus \{e_{ij}\}\r)\r).

243: \end{equation}

244: \end{widetext}

245: The first term in the summation is correlation $c_{lm}$ measured by adding

246: $e_{ij}$ in $G^\prime$ independent of $p_{ij}$ and $p$.  The second

247: term in $c_{lm}$ measured by removing $e_{ij}$ in $G^\prime$.  The

248: difference in measurement of $c_{lm}$ under the presence or absence of

249: edge $e_{ij}$ allows us to distinguish edges.  For example, if

250: $e_{ij}$ bridges two clusters, then $\beta_{ij}$ will be elevated

251: (note the edges 1, 2 and 3 in Fig.~\ref{fig:smallnet}).  Suppose edge

252: $e_{ij}$ connects two disjoint connected components $A$ and $B$ with

253: sizes $n_{\rm A}$ and $n_{\rm B}$.  Then, in a realization of

254: $G^\prime$, the contribution to $\beta_{ij}$ is the difference between

255: $\sum_{l,m\in A\cup B} \delta_{lm} = |n_A+n_B|^2$ and $\sum_{l,m\in

256:   A} \delta_{lm} + \sum_{l,m\in B} \delta_{lm} = |n_A|^2+|n_B|^2$.

257: Namely, the contribution to $\beta_{ij}$ is proportional to $n_{\rm

258:   A}n_{\rm B}$.  However, if $e_{ij}$ is embedded within a connected

259: component such that adding or removing $e_{ij}$ does not perturb the

260: component's connectivity, then $e_{ij}$ is redundant and does not

261: contribute to $\beta_{ij}$.  With this interpretation, $\beta_{ij}$

262: measures how well $e_{ij}$ succeeds in connecting differing big

263: components or modules.

264:

265: \begin{figure*}[htbp]

266:   \includegraphics[width=6in]{smallnet.eps}

267:   \caption{We applied our algorithm with $p=0.43$ on a small graph.

268:     The vertices are indexed in the descending order of $v$ and the

269:     parenthesized numbers indicate the degree of connection.  Some

270:     vertices, like vertex 3, have few neighbors but are out-ranked in

271:     terms of $v_i$ to other vertices with more neighbors.  Vertices

272:     with equivalent degree of connectivity might be ranked very

273:     differently because they have differing number of next nearest

274:     neighbors.  The edges having largest eighteen $\beta_{ij}$ shown

275:     in gray and are ranked.  If we remove these edges, the graph is

276:     severed into several compact subgraphs.  The edges carrying

277:     largest $\beta_{ij}$ tend to link different large components.  The

278:     edges within a clique, like vertices 5,4,9,13, and 14, have the

279:     smallest $\beta_{ij}$.}

280:   \label{fig:smallnet}

281: \end{figure*}

282:

283: \subsection{Protein interaction data}

284: Here, we apply the described method on the yeast protein interaction

285: data taken from the Database of Interacting

286: Proteins(DIP)~\cite{Deane}.  The dataset contains 14871 interactions

287: between 4692 proteins\footnote{We used the files ``yeast20020901.lst''

288:   and ``dip20020616.xin'', downloaded from DIP database

289:   ({\tt http://dip.doe-mbi.ucla.edu/}).} and includes interactions measured

290: by different experimental methods.  We treat the interaction network

291: as an undirected graph, with the proteins as vertices. If two proteins

292: are interaction partners in the dataset, the corresponding vertices

293: are joined by an edge.

294:

295:

296: \section{Results and Discussions}

297: \label{sec:DIP}

298:

299: \subsection{Determination of $p$}

300: As a first step in applying this stochastic method on the protein

301: interaction network, we need to determine the appropriate value of $p$. If

302: $p$ is near one, then we will only produce totally connected graphs.

303: If $p$ is too close to zero, then we will only obtain information

304: about small clusters. Some intermediate value of $p$ will give us

305: global properties of the network.

306:

307: In order to determine the proper value of $p$, we need to compute the

308: curve $\chi(p)$.  Such a curve for the DIP data is shown in

309: Fig.~\ref{fig:sus}.  The curve peaks at about $p=0.07$, where the size

310: fluctuations of the largest cluster are maximal.  Most realizations of

311: the percolation graph $G^\prime$ in the neighborhood of this peak

312: yield sparse but still predominantly connected graphs.  Accordingly,

313: computing $v_i$ and $\beta_{ij}$ around this peak in $\chi(p)$ avoids

314: the finite size effect at smaller $p$ and loss of resolutions at

315: larger $p$.

316:

317: \begin{figure}[htbp]

318:   \includegraphics[width=3in]{sus.eps}

319:   \caption{Susceptibility curve of the parameter $m$.  The curve

320:     peaks at $p=0.07$, where the fluctuations of $m$ are greatest.}

321:   \label{fig:sus}

322: \end{figure}

323:

324:

325: \subsection{Distribution of $v_i$}

326: We gathered our data from $10^5$ realizations of the graph at $p =

327: 0.07$.  The distribution of $\log(v_i)$ for the protein interaction

328: network is shown in Fig.~\ref{fig:hist_vi}. We also report the

329: distributions of a subset composing only the essential

330: proteins\footnote{We got the list of essential proteins from the

331:   Saccharomyces Genome Deletion Project~\cite{YeastDel}

332:   ({\tt http://yeastdeletion.stanford.edu/}).}.

333: The distribution of $v_i$ for essential proteins significantly differs

334: from the background distribution and is biased toward greater $v_i$.

335: A protein with a greater $v_i$ ties to the network more strongly than

336: a protein possessing a smaller $v_i$.  Therefore, we would predict

337: that removing a protein from yeast with a greater $v_i$ harms more

338: biologically important pathways and would thereby be more likely to

339: destroy viability.  The percentage of proteins having a given $v_i$

340: which are essential ( (number of essential proteins of a given

341: $v_i$)/(number of proteins of the given $v_i$) ) is shown in

342: Fig.~\ref{fig:corr-ess-v}.  This percentage has strong positive

343: correlation with $v_i$, in agreement with the prediction.

344:

345:

346: \begin{figure}[htbp]

347:   \includegraphics[width=3in]{his_log_vi.eps}

348:   \caption{Histogram of $\log(v_i)$.  The distribution of $v_i$ for

349:     essential proteins is skewed toward larger $v$.}

350:   \label{fig:hist_vi}

351: \end{figure}

352:

353:

354: \begin{figure}[htbp]

355:   \includegraphics[width=3in]{corr-ess-v.eps}

356:   \caption{The percentage of proteins which are essential as a

357:     function of $v_i$. }

358:   \label{fig:corr-ess-v}

359: \end{figure}

360:

361:

362: What are the specific connectivity properties that produce a large

363: $v_i$ for a specific protein?  To a first order approximation, $v_i$

364: is proportional to the degree of connectivity of the $i^{\rm th}$

365: protein.  Since a protein with $k$ interactions is usually connected

366: to at least $p\cdot k$ proteins, in the first order $v_i$ is

367: proportional to $k_i$.  However, the protein interaction network

368: displays small world properties\footnote{The graph diameter (the

369:   maximum amongst all the shortest paths between all pairs of

370:   vertices) of the protein interaction network is 12. The average path

371:   length of the path between any two proteins is 4.23.}, Therefore,

372: the correction to $v_i$ from higher order connections should be

373: included.  For example, if the number of next-nearest neighbors of a

374: protein is much greater than the number of nearest neighbors, then the

375: contribution from the next-nearest neighbors is comparable to that of

376: the nearest neighbors.  In such a case, the proteins with the same

377: $k_i$ have a broad distribution of $v_i$ as in our results.  The value

378: of $v_i$ gives more extensive information about the protein's

379: connectivity in the network beyond that of its nearest neighbors.

380:

381: Our method is advantageous because we can identify important proteins

382: that might otherwise not be considered significant because they have

383: lower first-order interaction degree.  Such proteins probably control

384: other essential proteins through a few critical interactions.  To

385: illustrate the power of this approach compared to merely counting the

386: nearest neighbor degree of interactions, we rank the proteins by $v_i$

387: and compare the result to the ranking by $k_i$ (see

388: Table~\ref{tab:compare}).  For example, 61\% of the proteins in the

389: top 2\% of $v_i$ are essential, whereas only 52\% of the proteins in

390: the top 2\% of $k_i$ are required for viability.  Such a result

391: suggests the essential proteins with higher $v_i$ not only have more

392: interactions but are also more likely to interact more frequently with

393: other proteins, which also tend to be essential.  A similar

394: observation has been reported by Gavin, {\it et al.}~\cite{Gavin}, and

395: our independent evidence supports their hypothesis.

396:

397: \begin{table}[htbp]

398:   \begin{tabular}{|c||c|c|c|}

399:     \hline

400:     All Proteins & \multicolumn{3}{l}{Essential Proteins}\vline\\

401:     \hline

402:     \hline

403:     Percentile & by $v_i$ & by $k_i$ & by $v_i$ (randomize)\\

404:     \hline

405:     2\%(94) & 61\% & 52\% & 53\% \\

406:     5\%(234) & 53\% & 47\% &  50\% \\

407:     10\%(469) & 48\% & 46\% & 48\% \\

408:     25\%(1173) &39\% & 38\% & 38\% \\

409:     \hline

410:   \end{tabular}

411:

412:   \caption{The percentage of essential proteins in

413:     selected percentiles ranked by $v_i$ and the degree of connection

414:     $k_i$.  In the top 92 proteins ranked by $v_i$, 61\% of them

415:     are essential while only 52\% of essential proteins are captured when

416:     ranked by $k_i$.  The third column is a control in which the $v_i$ are

417:     recalculated for a (quasi-)randomized graph in which edges have

418:     been swapped while retaining the degrees of connection of all vertices in

419:     the original graph. Identifying essential proteins by calculating

420:     $v_i$ performs consistently better than only computing $k_i$,

421:     demonstrating the significance of nonlocal structure beyond

422:     that of nearest neighbor relations.  If we randomly perturb the

423:     global graph structure, the ability to identify essential proteins

424:     drops, even though the degree of connection at each vertex is unchanged.}

425:   \label{tab:compare}

426: \end{table}

427:

428: The proteins with 10 highest $v_i$ are listed in

429: Table~\ref{tab:pList1}.  The full list of proteins with their $v_i$

430: can be found in the supplemental web site\footnote{\tt

431: http://www.nas.nasa.gov/Groups/SciTech/nano/msamanta/projects/percolation/index.php}.

432: A selection of a few essential proteins with high $v_i$ but low $k_i$ is

433: also shown in Table~\ref{tab:pList2}.

434:

435: \begin{table}[htbp]

436:   \begin{tabular}{|c|c|c|c|}

437:     \hline

438:     protein & $v_i$ & $k_i$ & viability \\

439:     \hline

440:     \hline

441:       SRP1 & 0.0623 & 196 & inviable \\

442:       TEM1 & 0.0531 & 115 & inviable \\

443:       JSN1 & 0.0524 & 282 & viable \\

444:       YDL213C & 0.0516 & 58 &  viable\\

445:       CKA1 & 0.0513 & 65 & viable \\

446:       NUP116 & 0.0505 & 146 & inviable \\

447:       ERB1 & 0.0494 & 55 & inviable \\

448:       HHF1 & 0.0486 & 74 & viable \\

449:       NOP2 & 0.0479 & 48 & inviable \\

450:       CDC95 & 0.0475 & 48 & viable\\

451:     \hline

452:   \end{tabular}

453:   \caption{List of the proteins with 10 highest $v_i$. }

454:   \label{tab:pList1}

455: \end{table}

456:

457: \begin{table}[htbp]

458:   \begin{tabular}{|c|c|c||c|c|c|}

459:     \hline

460:     $k_i$ & protein & $v_i$ & $k_i$ & protein & $v_i$ \\

461:     \hline

462:     \hline

463:       & UTP8 & 0.0084      &    & MAK11 & 0.0127 \\

464:       & YKL088W & 0.0081   &    & BMS1 & 0.0124  \\

465:     3 & DYS1 & 0.0075      &  5 & YPR144C & 0.0117 \\

466:       & TRL1 & 0.0070      &    & ACS2 & 0.0113   \\

467:       & GRS1 & 0.0068      &    & DIP2 & 0.0112   \\

468:     \hline

469:       & RLP24 & 0.0115     &    & NOP14 & 0.0133  \\

470:       & ROK1 & 0.0106      &    & NOC3  & 0.0131   \\

471:     4 & SPB4 & 0.0101      &  6 & SEN1 & 0.0124   \\

472:       & MES1 & 0.0094      &    & YLL034C &0.0123  \\

473:       & SEC18 & 0.00868    &    & DIB1 & 0.0110   \\

474:     \hline

475:   \end{tabular}

476:   \caption{A selection of a few essential proteins with

477: high $v_i$ but low $k_i$.}

478:   \label{tab:pList2}

479: \end{table}

480:

481:

482: \subsection{Distribution of $\beta_{ij}$}

483: The interactions in the network can be grouped by the experimental

484: methods used to detect them.  We score each interaction within the

485: network by $\beta_{ij}$.  The distribution of

486: $\log(\beta_{ij})$(Fig.~\ref{fig:h_beta}) provides a mechanism to

487: detect differences amongst different subsets of interactions obtained

488: by varied experimental methods.  In Fig.~\ref{fig:h_beta}, we compare

489: the distribution of $\log(\beta_{ij})$ from the whole network to

490: distribution derived from several subsets of the network.  First, we

491: use the subset, as the core set, of the interactions that was derived

492: by Deane {\it et al.}~\cite{Deane}.  Interactions in the core set are

493: statistically verified to reduce the false positive rate, yielding

494: 1925 interactions (excluding self-interacting pairs).  The

495: distribution of $\log(\beta_{ij})$ for the core set is similar to that

496: obtained for the entire network.  However, upon comparing the

497: distribution of $\log(\beta_{ij})$ for subsets of those interactions

498: obtained from different experimental procedures, differences emerge.

499: For example, interactions measured by immunoprecipitation tends to

500: have a larger $\beta_{ij}$, so that the distribution of

501: $\log(\beta_{ij})$ of this subset shifts to the right.  In contrast,

502: the distribution for the subset of interactions measured with

503: high-throughput two-hybrid tests display the opposite trend.

504:

505: \begin{figure}[htbp]

506:   \includegraphics[width=3in]{h_beta.eps}

507:   \caption{Normalized distributions of $\log(\beta_{ij})$ for

508:     different subsets of interactions. The solid line represents the

509:     distribution for all interactions in the data.  The dotted line

510:     corresponds to the core set extracted by Deane, {\it et

511:       al}\cite{Deane}.  The short dashed line refers to interactions

512:     obtained by immunoprecipitation, and the long dashed line

513:     represents the subset of interactions derived from high-throughput

514:     two-hybrid tests.}

515:   \label{fig:h_beta}

516: \end{figure}

517:

518: If $e_{ij}$ is the only edge linking two clusters, the contribution of

519: a particular realization of the percolation procedure to $\beta_{ij}$

520: is proportional to the product of the sizes of the two clusters.  Hence,

521: an edge with a greater $\beta_{ij}$ has a greater tendency to link two

522: large modules or clusters in the network.  With this notion in mind,

523: an examination of Fig.~\ref{fig:h_beta} suggests that the IP method is

524: possibly more sensitive to interactions between proteins in different

525: large modules while the two-hybrid tests are better suited to

526: detecting interactions which tend not to link larger modules.

527:

528: The discrepancy between the IP method and the two-hybrid tests might

529: reflect the underlying biochemical differences between the two

530: methods.  Unlike IP, the two-hybrid test is an {\it in vivo}

531: technique, and thus it can detect transient and unstable

532: interactions\cite{Mering}.  Our analysis of the distribution of

533: $\log(\beta_{ij})$ for the two-hybrid data is a quantitative

534: demonstration that these transient and unstable interactions

535: contribute less to the integrity of the interaction network.

536:

537:

538: \section{Conclusion}

539: \label{sec:con}

540:

541: We presented a stochastic algorithm that explored the global

542: connectivity properties of a protein interaction network.  This

543: percolation-based algorithm allowed us to assign weights to vertices and

544: edges according to non-local topological properties.  We applied the

545: algorithm to the protein interaction network for yeast and found that

546: the percentage of essential proteins correlated strongly with $v_i$.

547: Importantly, the values of $v_i$, which incorporated the knowledge of

548: connections beyond the nearest neighbors, could more successfully

549: discriminate essential proteins than a method based solely on local

550: connections.  In addition, the essential proteins with greater $v_i$

551: not only possessed more interactions with any other proteins but also

552: displayed more interactions with other {\em essential} proteins.  This

553: result suggested that essential proteins along with other proteins

554: having greater $v_i$ might form a ``core network'' with a higher

555: density of interactions within the ``core network'' than the

556: background network.  If this unverified hypothesis is confirmed, then

557: we would gain significant insight into the evolution of a protein

558: interaction network.  Are the proteins in this ``core network'' in

559: general more evolutionarily conserved than others?  Hunter {\it et al.}

560: claimed that there is significant negative correlation between each

561: protein's degree of connectivity and protein evolutionary rate, and

562: that evolutionary change may occur largely by coevolution~\cite{Fraser}.

563: If this is indeed so, we expect a stronger correlation between $v_i$ and

564: protein evolutionary rate, since $v_i$ provides a better resolution

565: than the degree of connectivity for proteins' positions in their

566: interaction network.

567:

568: The $\beta_{ij}$ scores for interaction could distinguish the differences

569: between different experimental methods for measuring protein

570: interactions.  Such a quantitative measure of the distinction amongst

571: the experimental approaches will aid the interpretation of the

572: proteomic data.

573:

574: In principle, $c_{ij}$ can be calculated exactly given a percolation

575: probability $p$.  However, this would require recursive iterations

576: over all possible sub-graphs.  Our stochastic approach efficiently

577: obtains the approximations to the exact value of $c_{ij}$, $v_i$ and

578: $\beta_{ij}$.  In this work, we model the interaction network as a

579: static graph with uniform weight on each edge.  For a biological

580: system, dynamical aspects need to be incorporated.  Various

581: experimental methods for probing the physical interactions between

582: proteins respond differently to the dynamics of biological systems.

583: The two-hybrid test is more sensitive to transient interactions while

584: the IP method is more sensitive to large and stable protein complexes.

585: The differences might be addressed from different dynamics aspects in

586: the interaction network.

587:

588: With regard to future pursuits, we note that it is also possible to

589: use $\beta_{ij}$ to cluster vertices within a random graph.  The

590: $\beta_{ij}$ score for a random graph is similar to the edge

591: ``betweenness'', defined as the number of shortest paths between all

592: pairs of vertices passing through a given edge.  An edge with a

593: greater $\beta_{ij}$ is likely also an edge with a greater edge

594: ``betweenness'', because such an edge has great tendency to bridge two

595: different clusters or modules.  Clustering utilizing edge

596: ``betweenness'' have been successfully applied to certain types of

597: random networks\cite{Newman}.  We expect that results similar to those

598: shown in Fig.~\ref{fig:smallnet} could be achieved with $\beta_{ij}$

599: not only for this small test graph but more significantly for larger

600: graphs in which the computational cost of calculating edge

601: ``betweenness'' is prohibitive.  For the present, however, the idea of

602: percolation on random networks provides a natural mechanism for

603: revealing dominant cluster structure within a graph.  We hope such

604: natural cluster structure will provide further details about the

605: protein interaction network.

606:

607:

608: \acknowledgements{ We thank Hao Li and Shoudan Liang for fruitful

609:   discussion.  C.~S.~Chin also likes to thank Yigal Nochomovitz for

610:   critical reading of the manuscript.  C.~S.~Chin is supported by Sandler

611:   Opportunity Grant.  M.~P.~Samanta is supported by NASA contract

612:   DTTS59-99-D-00437/A61812D to CSC.}

613:

614:

615: \begin{thebibliography}{10}

616:

617: \bibitem{Zhu2}

618: Zhu, H et~al.

619: \newblock (2000) {\em Nature Genet.} {\bf 26}, 283--289.

620:

621: \bibitem{Uetz}

622: Uetz, P et~al.

623: \newblock (2000) {\em Nature} {\bf 403}, 623--627.

624:

625: \bibitem{Ito}

626: Ito, T, Chiba, T, Ozawa, R, Yoshida, M, Hattori, M,  \& Sakaki, Y.

627: \newblock (2001) {\em Proc. Natl. Acad. Sci.} {\bf 98}, 4569--4574.

628:

629: \bibitem{Gavin}

630: Gavin, A.~C et~al.

631: \newblock (2002) {\em Nature} {\bf 415}, 141--147.

632:

633: \bibitem{Ho}

634: Ho, Y et~al.

635: \newblock (2002) {\em Nature} {\bf 415}, 180--183.

636:

637: \bibitem{Hartwell}

638: Hartwell, L.~H, Hopfield, J.~J, Liebler, S,  \& Murray, A.~W.

639: \newblock (1999) {\em Nature} {\bf 402}, C47--C52.

640:

641: \bibitem{Jeong}

642: Jeong, H, Mason, S.~P, Barabasi, A.-L,  \& Oltvai, Z.~N.

643: \newblock (2001) {\em Nature} {\bf 411}, 41--42.

644:

645: \bibitem{Maslov}

646: Maslov, S \& Sneppen, K.

647: \newblock (2002) {\em Science} {\bf 296}, 910.

648:

649: \bibitem{Mering}

650: Mering, C.~V, Krause, R, Snel, B, Cornell, M, Oliver, S.~G, Fields, S,  \&

651:   Bork, P.

652: \newblock (2002) {\em Nature} {\bf 417}, 399--403.

653:

654: \bibitem{Bader}

655: Bader, G.~D \& Hogue, C.~W.~V.

656: \newblock (2002) {\em Nature biotech.} {\bf 20}, 991--997.

657:

658: \bibitem{YeastDel}

659: Winzeler, E.~A et~al.

660: \newblock (1999) {\em Science} {\bf 285}, 901--906.

661:

662: \bibitem{percolation}

663: Stauffer, D \& Aharony, A.

664: \newblock (1994) {\em Introduction to Percolation Theory}.

665: \newblock (Taylor \& Francis).

666:

667: \bibitem{Samanta}

668: Samanta, M.~P \& Liang, S.

669: \newblock (2003) Redundancy in large-scale protein interaction networks.

670: \newblock in preparation.

671:

672: \bibitem{Deane}

673: Deane, C.~M, Salwinski, L, Xenarios, I,  \& Eisenberg, D.

674: \newblock (2002) {\em Mol. Cell Proteomics} {\bf 1}, 349--356.

675:

676: \bibitem{Fraser}

677: Fraser, H.~B, Hirsh, A.~E, Steinmetz, L.~M, Scharfe, C,  \& Feldman, M.~W.

678: \newblock (2002) {\em Science} {\bf 296}, 750--752.

679:

680: \bibitem{Newman}

681: Girvan, M \& Newman, M.~E.~J.

682: \newblock (2001) {\em Proc. Natl. Acad. Sci. USA} {\bf 99}, 7821--7826.

683:

684: \end{thebibliography}

685:

686:

687: \end{document}

688:

689:

690: