1: \documentclass[pre, preprint,floatfix]{revtex4}
2: \usepackage{graphicx}
3: \bibliographystyle{apsrev}
4: \renewcommand{\r}{\right}
5: \renewcommand{\l}{\left}
6:
7:
8: \begin{document}
9:
10: \title{Global Snapshot of Protein Interaction Network -- A Percolation Based Approach}
11:
12: \author{Chen-Shan Chin}
13:
14: \affiliation{Department of Biochemistry and Biophysics, University of
15: California, San Francisco, 94143, CA, USA}
16:
17: \email{cschin@genome.ucsf.edu}
18:
19: \author{Manoj Pratim Samanta}
20:
21: \affiliation{NASA Advanced Supercomputing Division, NASA Ames Research Center,
22: Moffet Field, 94035, CA, USA}
23:
24: \email{msamanta@nas.nasa.gov}
25:
26:
27: \date{\today}
28:
29: \begin{abstract}
30: In this paper, we study the large-scale protein interaction network
31: of yeast utilizing a stochastic method based upon percolation of
32: random graphs. In order to find the global features of
33: connectivities in the network, we introduce numerical measures that
34: quantify (1) how strongly a protein ties with the other parts of the
35: network and (2) how significantly an interaction contributes to the
36: integrity of the network. Our study shows that the distribution of
37: essential proteins is distinct from the background in terms of
38: global connectivities. This observation highlights a fundamental
39: difference between the essential and the non-essential proteins in
40: the network. Furthermore, we find that the interaction data
41: obtained from different experimental methods such as
42: immunoprecipitation and two-hybrid techniques possess different
43: characteristics. We discuss the biological implications of these
44: observations.
45: \end{abstract}
46:
47: \maketitle
48:
49:
50: \section{Introduction}
51:
52: Recent availability of a large amount of data from high-throughput
53: experiments~\cite{Zhu2,Uetz,Ito,Gavin,Ho} has brought about a
54: fundamental change in the way we study biological systems. Unlike the
55: traditional methods which relied on probing a single or a few proteins
56: to identify important pathways, it is now becoming possible to
57: describe larger functional `modules'~\cite{Hartwell} and even the
58: global properties of the entire
59: proteome~\cite{Jeong,Maslov,Mering,Bader}. Researchers are attempting
60: to connect large-scale protein interaction data with information from
61: phenotype studies~\cite{Jeong,Maslov}. In one such analysis of data
62: from yeast, Jeong {\it et al.} observed the connectivities of
63: individual proteins in the network to closely follow a power-law
64: distribution. Similar to other power-law networks, positive
65: correlation existed between a protein's inviability and its
66: connectivity~\cite{Jeong}. In another study, Maslov {\it et al.}
67: observed interesting patterns in the distribution of the links between
68: the nearest neighbors in the network and postulated that such patterns
69: give rise to the specificity and the robustness of the
70: network~\cite{Maslov}.
71:
72: One of the shortcomings of the previous approaches is that they drew
73: conclusions about the global nature of the network from its local
74: connectivity properties. It is unclear whether such local studies
75: based on individual nodes or nearest neighbors fully capture the
76: global picture of the network. For example, some essential proteins,
77: namely, those for which null mutants produce inviable
78: strains~\cite{YeastDel}, may have few numbers of direct links but
79: still take important roles in the network through the proteins to
80: which they are connected. Such proteins would not be correctly
81: identified by just counting the number of links as in
82: Ref.~\cite{Jeong}. To properly recognize such cases, it is necessary
83: to go beyond the nearest neighbor links. However, it is not clear
84: that the techniques mentioned above can easily be extended to answer
85: such questions.
86:
87: In this paper, we introduce a stochastic method inspired by the
88: percolation model in statistical mechanics\cite{percolation} that
89: overcomes the shortcomings of the previous approaches. This method
90: allows us to define a quantity that measures the correlation between
91: any two nodes in the network, taking the topology of the entire
92: network into account. Biologically, such correlations describe the
93: direct and indirect influences of one protein on another through the
94: protein interaction network. If such correlations indeed carry
95: biological significance, we expect the essential proteins to be highly
96: correlated, in general, with the rest of the network. One of our main
97: results is that most essential proteins do possess higher correlations
98: between themselves and the rest of the network. This is consistent
99: with previous results~\cite{Jeong}, because in the first order, the
100: correlations computed by us are proportional to the connectivities of
101: the proteins. However, we show that it is important to go beyond the
102: first order. Identifying essential proteins by our method performs
103: consistently better than just counting links. Additionally, we
104: observe that the essential proteins interact more tightly with the
105: other essential proteins, thus forming a `network core'. This
106: directly agrees with large-scale experiments probing protein
107: networks~\cite{Gavin}.
108:
109: Based on our method, we can also quantify the relative significance of an
110: interaction to the integrity of the network. We observe that the
111: interaction data from different measurement techniques, such as
112: immunoprecipitation(IP) and the two-hybrid test, give distinct
113: distributions. This suggests that various experimental
114: techniques for probing the protein interaction might explore
115: different regions of the network.
116:
117:
118:
119: \section{Method and Materials}
120: \label{sec:method}
121:
122: \subsection{Bond-percolation on Graph}
123: Given any two nodes in a network, the strength of their connectivity
124: can be estimated in different ways. Some of these measures are local.
125: For example, we can ask whether any two nodes are directed linked, how
126: many common neighbors they share~\cite{Samanta}, {\it etc}. We can
127: also ask how local properties of a node, such as the degree of links,
128: associate with its function and its importance in the
129: network\cite{Jeong}. Furthermore, information about the correlations
130: between nodes involving nonlocal properties, such as the length of the
131: shortest path and clustering structures, will enable us to uncover
132: hidden features buried within the massive data. Here, we present a
133: generic approach that extracts useful information about a node beyond
134: its local connections.
135:
136: Correlations between two nodes may come from other numerous short
137: paths rather than just the shortest path. A reasonable estimate of
138: correlation should take into account the number and lengths of
139: different paths between two nodes. One possible way to estimate such
140: correlation between two nodes is to repeatedly remove some fraction
141: $q$ of the links in the network chosen randomly and check whether they
142: still remain connected. Their probability remaining
143: connected is proportional to the number of short paths between them
144: and inversely proportional the length of those paths. This
145: probability provides a good measurement of the correlation between two
146: nodes that includes the information regarding the non-local topology
147: of the network. The described process of finding the correlation
148: between two nodes in a network is equivalent to the bond-percolation
149: model in statistical mechanics\cite{percolation}.
150:
151: Mathematically, a network is treated in the language of graph theory,
152: where a node is denoted as a vertex and a link as an edge. Given a
153: graph $G$ with vertices $V$ and edges $E$, a percolation configuration
154: is realized as follows. Each edge $e_{ij}$ linking vertices $i$ and
155: $j$ is assigned a random number $p_{ij}$ distributed uniformly from 0
156: to 1. If this random number is greater than $p = 1 - q$, a given
157: percolation probability, then the edge is eliminated from the original
158: graph. The final graph $G^\prime$ consists of the edge set $E^\prime
159: = E - \bar{E}$, where $\bar{E}$ is the set of edges that $p_{ij} > p$
160: and $E^\prime$ consists those edges with $p_{ij} < p$. Assuming that
161: $G$ is connected, the reduced graph $G^\prime$ may or may not remain a
162: single connected component depending on $p$.
163:
164:
165: \subsection{Susceptibility}
166: The first step in applying the algorithm is to determine the appropriate
167: value of the probability $p$. If $p$ is near one, then we only produce
168: totally connected graphs. If $p$ is too close to zero, then the network
169: is split into individual vertices and small clusters. An intermediate value of
170: $p$ provides information about the non-local properties of the network.
171:
172: The degree of fragmentation in the graph $G^\prime$ can be quantified
173: by the order parameter $m(p)$, the ratio of the largest connected
174: component to the total graph size. It is defined as $m(p) = N_{\rm
175: max}/|V|$, where $N_{\rm max}$ is the number of vertices of the
176: largest connected component and $|V|$ is the total number of vertices.
177: For a connected graph $G$, $m(p)$ varies from $1/|V|$ to 1 as $p$
178: changes from 0 to 1. Here, $m$ is a stochastic variable, whose
179: fluctuation is defined by
180: \begin{equation}
181: \chi(p) = \langle (m - \langle m \rangle)^2 \rangle^{\frac{1}{2}}
182: \end{equation}
183: The brackets denote the ensemble average, which is the average over
184: many different realizations of $G^\prime$. The curve of $\chi(p)$
185: reveals certain aspects of the graph topology. For example, if $G$ is
186: a regular two dimensional square lattice, then $\chi$ diverges with a
187: power law behavior as a function of $p-p_{\rm c}$, for $p_{\rm
188: c}=1/2$. For other types of regular lattices, like triangular
189: lattices or higher dimensional lattices, $p_{\rm c}$ and/or the power
190: law exponent also change. A maximum in $\chi(p)$ occurs at the
191: transition point $p_{\rm c}$, indicating a phase transition and
192: critical behavior\cite{percolation}. At this critical point, the
193: distribution of the sizes of the connected clusters decay as a power
194: law. Chosing a value of $p$ near this critical value, we get the most
195: non-local information regarding the network.
196:
197: \subsection{Correlations and the definition of $v_i$}
198: Whether two arbitrary vertices $i$ and $j$ remain connected in
199: $G^\prime$ can provide more detailed information about $G$. If two
200: vertices retain their connection, it means that there exist paths in
201: $E^\prime$ from vertex $i$ to vertex $j$. Define $\delta_{ij}$ as
202: function of a pair of vertices $i$ and $j$ such that $\delta_{ij} = 1$
203: if vertices $i$ and $j$ are connected, and $\delta_{ij} = 0$
204: otherwise. The percolation correlation $c_{ij}$ is then defined as the
205: ensemble average of $\delta_{ij}$,
206: \begin{equation}
207: c_{ij} = \langle \delta_{ij} \rangle.
208: \end{equation}
209:
210: With knowledge of the $c_{ij}$, we are equipped to
211: measure how strongly a vertex $i$ links to the rest of the network
212: counting both direct and indirect connections to vertex $i$.
213: We define the quantity $v_i$ for vertex $i$,
214: \begin{equation}
215: v_i = \frac{1}{|V|} \sum_{j \in V} c_{ij}
216: \end{equation}
217: This value is sensitive not only to the linking degree at each vertex
218: but also to higher order connections between a vertex and the rest of
219: the random graph. Thus, $v_i$ effectively ranks the importance of a
220: vertex in the graph. Intuitively, $v_i$ may be interpreted as the
221: fraction of other vertices to which vertex $i$ remains linked, if each
222: edge is broken with probability $q = 1 - p$ in the graph $G$. In
223: Fig.~\ref{fig:smallnet}, we show the descending ranking order of the
224: $v_i$'s for a small graph.
225:
226:
227: \subsection{The definition of $\beta_{ij}$}
228: Using a similar idea, we can define a quantity that allows us to check
229: the influence of an edge on the graph integrity. The elimination of
230: some edges may fundamentally change the connectivity properties
231: whereas the graph topology may be relatively unchanged against the deletion
232: of others. For example, for a small fully connected subgraph, termed
233: a clique, removal of a certain number of edges between the vertices of
234: the subgraph tends not to separate the graph into disconnected pieces.
235: Individual links in the subgraph do not play crucial roles in
236: supporting the integrity of the subgraph and the whole graph. We
237: define the quantity $\beta_{ij}$ to monitor the importance of edge
238: $e_{ij}$ to the integrity of the graph,
239: \begin{widetext}
240: \begin{equation}
241: \beta_{ij} = \frac{1}{|V|^2} \sum_{l,m\in V}
242: \l(c_{lm}\l(G^\prime \cup \{e_{ij}\}\r) - c_{lm}\l(G^\prime \setminus \{e_{ij}\}\r)\r).
243: \end{equation}
244: \end{widetext}
245: The first term in the summation is correlation $c_{lm}$ measured by adding
246: $e_{ij}$ in $G^\prime$ independent of $p_{ij}$ and $p$. The second
247: term in $c_{lm}$ measured by removing $e_{ij}$ in $G^\prime$. The
248: difference in measurement of $c_{lm}$ under the presence or absence of
249: edge $e_{ij}$ allows us to distinguish edges. For example, if
250: $e_{ij}$ bridges two clusters, then $\beta_{ij}$ will be elevated
251: (note the edges 1, 2 and 3 in Fig.~\ref{fig:smallnet}). Suppose edge
252: $e_{ij}$ connects two disjoint connected components $A$ and $B$ with
253: sizes $n_{\rm A}$ and $n_{\rm B}$. Then, in a realization of
254: $G^\prime$, the contribution to $\beta_{ij}$ is the difference between
255: $\sum_{l,m\in A\cup B} \delta_{lm} = |n_A+n_B|^2$ and $\sum_{l,m\in
256: A} \delta_{lm} + \sum_{l,m\in B} \delta_{lm} = |n_A|^2+|n_B|^2$.
257: Namely, the contribution to $\beta_{ij}$ is proportional to $n_{\rm
258: A}n_{\rm B}$. However, if $e_{ij}$ is embedded within a connected
259: component such that adding or removing $e_{ij}$ does not perturb the
260: component's connectivity, then $e_{ij}$ is redundant and does not
261: contribute to $\beta_{ij}$. With this interpretation, $\beta_{ij}$
262: measures how well $e_{ij}$ succeeds in connecting differing big
263: components or modules.
264:
265: \begin{figure*}[htbp]
266: \includegraphics[width=6in]{smallnet.eps}
267: \caption{We applied our algorithm with $p=0.43$ on a small graph.
268: The vertices are indexed in the descending order of $v$ and the
269: parenthesized numbers indicate the degree of connection. Some
270: vertices, like vertex 3, have few neighbors but are out-ranked in
271: terms of $v_i$ to other vertices with more neighbors. Vertices
272: with equivalent degree of connectivity might be ranked very
273: differently because they have differing number of next nearest
274: neighbors. The edges having largest eighteen $\beta_{ij}$ shown
275: in gray and are ranked. If we remove these edges, the graph is
276: severed into several compact subgraphs. The edges carrying
277: largest $\beta_{ij}$ tend to link different large components. The
278: edges within a clique, like vertices 5,4,9,13, and 14, have the
279: smallest $\beta_{ij}$.}
280: \label{fig:smallnet}
281: \end{figure*}
282:
283: \subsection{Protein interaction data}
284: Here, we apply the described method on the yeast protein interaction
285: data taken from the Database of Interacting
286: Proteins(DIP)~\cite{Deane}. The dataset contains 14871 interactions
287: between 4692 proteins\footnote{We used the files ``yeast20020901.lst''
288: and ``dip20020616.xin'', downloaded from DIP database
289: ({\tt http://dip.doe-mbi.ucla.edu/}).} and includes interactions measured
290: by different experimental methods. We treat the interaction network
291: as an undirected graph, with the proteins as vertices. If two proteins
292: are interaction partners in the dataset, the corresponding vertices
293: are joined by an edge.
294:
295:
296: \section{Results and Discussions}
297: \label{sec:DIP}
298:
299: \subsection{Determination of $p$}
300: As a first step in applying this stochastic method on the protein
301: interaction network, we need to determine the appropriate value of $p$. If
302: $p$ is near one, then we will only produce totally connected graphs.
303: If $p$ is too close to zero, then we will only obtain information
304: about small clusters. Some intermediate value of $p$ will give us
305: global properties of the network.
306:
307: In order to determine the proper value of $p$, we need to compute the
308: curve $\chi(p)$. Such a curve for the DIP data is shown in
309: Fig.~\ref{fig:sus}. The curve peaks at about $p=0.07$, where the size
310: fluctuations of the largest cluster are maximal. Most realizations of
311: the percolation graph $G^\prime$ in the neighborhood of this peak
312: yield sparse but still predominantly connected graphs. Accordingly,
313: computing $v_i$ and $\beta_{ij}$ around this peak in $\chi(p)$ avoids
314: the finite size effect at smaller $p$ and loss of resolutions at
315: larger $p$.
316:
317: \begin{figure}[htbp]
318: \includegraphics[width=3in]{sus.eps}
319: \caption{Susceptibility curve of the parameter $m$. The curve
320: peaks at $p=0.07$, where the fluctuations of $m$ are greatest.}
321: \label{fig:sus}
322: \end{figure}
323:
324:
325: \subsection{Distribution of $v_i$}
326: We gathered our data from $10^5$ realizations of the graph at $p =
327: 0.07$. The distribution of $\log(v_i)$ for the protein interaction
328: network is shown in Fig.~\ref{fig:hist_vi}. We also report the
329: distributions of a subset composing only the essential
330: proteins\footnote{We got the list of essential proteins from the
331: Saccharomyces Genome Deletion Project~\cite{YeastDel}
332: ({\tt http://yeastdeletion.stanford.edu/}).}.
333: The distribution of $v_i$ for essential proteins significantly differs
334: from the background distribution and is biased toward greater $v_i$.
335: A protein with a greater $v_i$ ties to the network more strongly than
336: a protein possessing a smaller $v_i$. Therefore, we would predict
337: that removing a protein from yeast with a greater $v_i$ harms more
338: biologically important pathways and would thereby be more likely to
339: destroy viability. The percentage of proteins having a given $v_i$
340: which are essential ( (number of essential proteins of a given
341: $v_i$)/(number of proteins of the given $v_i$) ) is shown in
342: Fig.~\ref{fig:corr-ess-v}. This percentage has strong positive
343: correlation with $v_i$, in agreement with the prediction.
344:
345:
346: \begin{figure}[htbp]
347: \includegraphics[width=3in]{his_log_vi.eps}
348: \caption{Histogram of $\log(v_i)$. The distribution of $v_i$ for
349: essential proteins is skewed toward larger $v$.}
350: \label{fig:hist_vi}
351: \end{figure}
352:
353:
354: \begin{figure}[htbp]
355: \includegraphics[width=3in]{corr-ess-v.eps}
356: \caption{The percentage of proteins which are essential as a
357: function of $v_i$. }
358: \label{fig:corr-ess-v}
359: \end{figure}
360:
361:
362: What are the specific connectivity properties that produce a large
363: $v_i$ for a specific protein? To a first order approximation, $v_i$
364: is proportional to the degree of connectivity of the $i^{\rm th}$
365: protein. Since a protein with $k$ interactions is usually connected
366: to at least $p\cdot k$ proteins, in the first order $v_i$ is
367: proportional to $k_i$. However, the protein interaction network
368: displays small world properties\footnote{The graph diameter (the
369: maximum amongst all the shortest paths between all pairs of
370: vertices) of the protein interaction network is 12. The average path
371: length of the path between any two proteins is 4.23.}, Therefore,
372: the correction to $v_i$ from higher order connections should be
373: included. For example, if the number of next-nearest neighbors of a
374: protein is much greater than the number of nearest neighbors, then the
375: contribution from the next-nearest neighbors is comparable to that of
376: the nearest neighbors. In such a case, the proteins with the same
377: $k_i$ have a broad distribution of $v_i$ as in our results. The value
378: of $v_i$ gives more extensive information about the protein's
379: connectivity in the network beyond that of its nearest neighbors.
380:
381: Our method is advantageous because we can identify important proteins
382: that might otherwise not be considered significant because they have
383: lower first-order interaction degree. Such proteins probably control
384: other essential proteins through a few critical interactions. To
385: illustrate the power of this approach compared to merely counting the
386: nearest neighbor degree of interactions, we rank the proteins by $v_i$
387: and compare the result to the ranking by $k_i$ (see
388: Table~\ref{tab:compare}). For example, 61\% of the proteins in the
389: top 2\% of $v_i$ are essential, whereas only 52\% of the proteins in
390: the top 2\% of $k_i$ are required for viability. Such a result
391: suggests the essential proteins with higher $v_i$ not only have more
392: interactions but are also more likely to interact more frequently with
393: other proteins, which also tend to be essential. A similar
394: observation has been reported by Gavin, {\it et al.}~\cite{Gavin}, and
395: our independent evidence supports their hypothesis.
396:
397: \begin{table}[htbp]
398: \begin{tabular}{|c||c|c|c|}
399: \hline
400: All Proteins & \multicolumn{3}{l}{Essential Proteins}\vline\\
401: \hline
402: \hline
403: Percentile & by $v_i$ & by $k_i$ & by $v_i$ (randomize)\\
404: \hline
405: 2\%(94) & 61\% & 52\% & 53\% \\
406: 5\%(234) & 53\% & 47\% & 50\% \\
407: 10\%(469) & 48\% & 46\% & 48\% \\
408: 25\%(1173) &39\% & 38\% & 38\% \\
409: \hline
410: \end{tabular}
411:
412: \caption{The percentage of essential proteins in
413: selected percentiles ranked by $v_i$ and the degree of connection
414: $k_i$. In the top 92 proteins ranked by $v_i$, 61\% of them
415: are essential while only 52\% of essential proteins are captured when
416: ranked by $k_i$. The third column is a control in which the $v_i$ are
417: recalculated for a (quasi-)randomized graph in which edges have
418: been swapped while retaining the degrees of connection of all vertices in
419: the original graph. Identifying essential proteins by calculating
420: $v_i$ performs consistently better than only computing $k_i$,
421: demonstrating the significance of nonlocal structure beyond
422: that of nearest neighbor relations. If we randomly perturb the
423: global graph structure, the ability to identify essential proteins
424: drops, even though the degree of connection at each vertex is unchanged.}
425: \label{tab:compare}
426: \end{table}
427:
428: The proteins with 10 highest $v_i$ are listed in
429: Table~\ref{tab:pList1}. The full list of proteins with their $v_i$
430: can be found in the supplemental web site\footnote{\tt
431: http://www.nas.nasa.gov/Groups/SciTech/nano/msamanta/projects/percolation/index.php}.
432: A selection of a few essential proteins with high $v_i$ but low $k_i$ is
433: also shown in Table~\ref{tab:pList2}.
434:
435: \begin{table}[htbp]
436: \begin{tabular}{|c|c|c|c|}
437: \hline
438: protein & $v_i$ & $k_i$ & viability \\
439: \hline
440: \hline
441: SRP1 & 0.0623 & 196 & inviable \\
442: TEM1 & 0.0531 & 115 & inviable \\
443: JSN1 & 0.0524 & 282 & viable \\
444: YDL213C & 0.0516 & 58 & viable\\
445: CKA1 & 0.0513 & 65 & viable \\
446: NUP116 & 0.0505 & 146 & inviable \\
447: ERB1 & 0.0494 & 55 & inviable \\
448: HHF1 & 0.0486 & 74 & viable \\
449: NOP2 & 0.0479 & 48 & inviable \\
450: CDC95 & 0.0475 & 48 & viable\\
451: \hline
452: \end{tabular}
453: \caption{List of the proteins with 10 highest $v_i$. }
454: \label{tab:pList1}
455: \end{table}
456:
457: \begin{table}[htbp]
458: \begin{tabular}{|c|c|c||c|c|c|}
459: \hline
460: $k_i$ & protein & $v_i$ & $k_i$ & protein & $v_i$ \\
461: \hline
462: \hline
463: & UTP8 & 0.0084 & & MAK11 & 0.0127 \\
464: & YKL088W & 0.0081 & & BMS1 & 0.0124 \\
465: 3 & DYS1 & 0.0075 & 5 & YPR144C & 0.0117 \\
466: & TRL1 & 0.0070 & & ACS2 & 0.0113 \\
467: & GRS1 & 0.0068 & & DIP2 & 0.0112 \\
468: \hline
469: & RLP24 & 0.0115 & & NOP14 & 0.0133 \\
470: & ROK1 & 0.0106 & & NOC3 & 0.0131 \\
471: 4 & SPB4 & 0.0101 & 6 & SEN1 & 0.0124 \\
472: & MES1 & 0.0094 & & YLL034C &0.0123 \\
473: & SEC18 & 0.00868 & & DIB1 & 0.0110 \\
474: \hline
475: \end{tabular}
476: \caption{A selection of a few essential proteins with
477: high $v_i$ but low $k_i$.}
478: \label{tab:pList2}
479: \end{table}
480:
481:
482: \subsection{Distribution of $\beta_{ij}$}
483: The interactions in the network can be grouped by the experimental
484: methods used to detect them. We score each interaction within the
485: network by $\beta_{ij}$. The distribution of
486: $\log(\beta_{ij})$(Fig.~\ref{fig:h_beta}) provides a mechanism to
487: detect differences amongst different subsets of interactions obtained
488: by varied experimental methods. In Fig.~\ref{fig:h_beta}, we compare
489: the distribution of $\log(\beta_{ij})$ from the whole network to
490: distribution derived from several subsets of the network. First, we
491: use the subset, as the core set, of the interactions that was derived
492: by Deane {\it et al.}~\cite{Deane}. Interactions in the core set are
493: statistically verified to reduce the false positive rate, yielding
494: 1925 interactions (excluding self-interacting pairs). The
495: distribution of $\log(\beta_{ij})$ for the core set is similar to that
496: obtained for the entire network. However, upon comparing the
497: distribution of $\log(\beta_{ij})$ for subsets of those interactions
498: obtained from different experimental procedures, differences emerge.
499: For example, interactions measured by immunoprecipitation tends to
500: have a larger $\beta_{ij}$, so that the distribution of
501: $\log(\beta_{ij})$ of this subset shifts to the right. In contrast,
502: the distribution for the subset of interactions measured with
503: high-throughput two-hybrid tests display the opposite trend.
504:
505: \begin{figure}[htbp]
506: \includegraphics[width=3in]{h_beta.eps}
507: \caption{Normalized distributions of $\log(\beta_{ij})$ for
508: different subsets of interactions. The solid line represents the
509: distribution for all interactions in the data. The dotted line
510: corresponds to the core set extracted by Deane, {\it et
511: al}\cite{Deane}. The short dashed line refers to interactions
512: obtained by immunoprecipitation, and the long dashed line
513: represents the subset of interactions derived from high-throughput
514: two-hybrid tests.}
515: \label{fig:h_beta}
516: \end{figure}
517:
518: If $e_{ij}$ is the only edge linking two clusters, the contribution of
519: a particular realization of the percolation procedure to $\beta_{ij}$
520: is proportional to the product of the sizes of the two clusters. Hence,
521: an edge with a greater $\beta_{ij}$ has a greater tendency to link two
522: large modules or clusters in the network. With this notion in mind,
523: an examination of Fig.~\ref{fig:h_beta} suggests that the IP method is
524: possibly more sensitive to interactions between proteins in different
525: large modules while the two-hybrid tests are better suited to
526: detecting interactions which tend not to link larger modules.
527:
528: The discrepancy between the IP method and the two-hybrid tests might
529: reflect the underlying biochemical differences between the two
530: methods. Unlike IP, the two-hybrid test is an {\it in vivo}
531: technique, and thus it can detect transient and unstable
532: interactions\cite{Mering}. Our analysis of the distribution of
533: $\log(\beta_{ij})$ for the two-hybrid data is a quantitative
534: demonstration that these transient and unstable interactions
535: contribute less to the integrity of the interaction network.
536:
537:
538: \section{Conclusion}
539: \label{sec:con}
540:
541: We presented a stochastic algorithm that explored the global
542: connectivity properties of a protein interaction network. This
543: percolation-based algorithm allowed us to assign weights to vertices and
544: edges according to non-local topological properties. We applied the
545: algorithm to the protein interaction network for yeast and found that
546: the percentage of essential proteins correlated strongly with $v_i$.
547: Importantly, the values of $v_i$, which incorporated the knowledge of
548: connections beyond the nearest neighbors, could more successfully
549: discriminate essential proteins than a method based solely on local
550: connections. In addition, the essential proteins with greater $v_i$
551: not only possessed more interactions with any other proteins but also
552: displayed more interactions with other {\em essential} proteins. This
553: result suggested that essential proteins along with other proteins
554: having greater $v_i$ might form a ``core network'' with a higher
555: density of interactions within the ``core network'' than the
556: background network. If this unverified hypothesis is confirmed, then
557: we would gain significant insight into the evolution of a protein
558: interaction network. Are the proteins in this ``core network'' in
559: general more evolutionarily conserved than others? Hunter {\it et al.}
560: claimed that there is significant negative correlation between each
561: protein's degree of connectivity and protein evolutionary rate, and
562: that evolutionary change may occur largely by coevolution~\cite{Fraser}.
563: If this is indeed so, we expect a stronger correlation between $v_i$ and
564: protein evolutionary rate, since $v_i$ provides a better resolution
565: than the degree of connectivity for proteins' positions in their
566: interaction network.
567:
568: The $\beta_{ij}$ scores for interaction could distinguish the differences
569: between different experimental methods for measuring protein
570: interactions. Such a quantitative measure of the distinction amongst
571: the experimental approaches will aid the interpretation of the
572: proteomic data.
573:
574: In principle, $c_{ij}$ can be calculated exactly given a percolation
575: probability $p$. However, this would require recursive iterations
576: over all possible sub-graphs. Our stochastic approach efficiently
577: obtains the approximations to the exact value of $c_{ij}$, $v_i$ and
578: $\beta_{ij}$. In this work, we model the interaction network as a
579: static graph with uniform weight on each edge. For a biological
580: system, dynamical aspects need to be incorporated. Various
581: experimental methods for probing the physical interactions between
582: proteins respond differently to the dynamics of biological systems.
583: The two-hybrid test is more sensitive to transient interactions while
584: the IP method is more sensitive to large and stable protein complexes.
585: The differences might be addressed from different dynamics aspects in
586: the interaction network.
587:
588: With regard to future pursuits, we note that it is also possible to
589: use $\beta_{ij}$ to cluster vertices within a random graph. The
590: $\beta_{ij}$ score for a random graph is similar to the edge
591: ``betweenness'', defined as the number of shortest paths between all
592: pairs of vertices passing through a given edge. An edge with a
593: greater $\beta_{ij}$ is likely also an edge with a greater edge
594: ``betweenness'', because such an edge has great tendency to bridge two
595: different clusters or modules. Clustering utilizing edge
596: ``betweenness'' have been successfully applied to certain types of
597: random networks\cite{Newman}. We expect that results similar to those
598: shown in Fig.~\ref{fig:smallnet} could be achieved with $\beta_{ij}$
599: not only for this small test graph but more significantly for larger
600: graphs in which the computational cost of calculating edge
601: ``betweenness'' is prohibitive. For the present, however, the idea of
602: percolation on random networks provides a natural mechanism for
603: revealing dominant cluster structure within a graph. We hope such
604: natural cluster structure will provide further details about the
605: protein interaction network.
606:
607:
608: \acknowledgements{ We thank Hao Li and Shoudan Liang for fruitful
609: discussion. C.~S.~Chin also likes to thank Yigal Nochomovitz for
610: critical reading of the manuscript. C.~S.~Chin is supported by Sandler
611: Opportunity Grant. M.~P.~Samanta is supported by NASA contract
612: DTTS59-99-D-00437/A61812D to CSC.}
613:
614:
615: \begin{thebibliography}{10}
616:
617: \bibitem{Zhu2}
618: Zhu, H et~al.
619: \newblock (2000) {\em Nature Genet.} {\bf 26}, 283--289.
620:
621: \bibitem{Uetz}
622: Uetz, P et~al.
623: \newblock (2000) {\em Nature} {\bf 403}, 623--627.
624:
625: \bibitem{Ito}
626: Ito, T, Chiba, T, Ozawa, R, Yoshida, M, Hattori, M, \& Sakaki, Y.
627: \newblock (2001) {\em Proc. Natl. Acad. Sci.} {\bf 98}, 4569--4574.
628:
629: \bibitem{Gavin}
630: Gavin, A.~C et~al.
631: \newblock (2002) {\em Nature} {\bf 415}, 141--147.
632:
633: \bibitem{Ho}
634: Ho, Y et~al.
635: \newblock (2002) {\em Nature} {\bf 415}, 180--183.
636:
637: \bibitem{Hartwell}
638: Hartwell, L.~H, Hopfield, J.~J, Liebler, S, \& Murray, A.~W.
639: \newblock (1999) {\em Nature} {\bf 402}, C47--C52.
640:
641: \bibitem{Jeong}
642: Jeong, H, Mason, S.~P, Barabasi, A.-L, \& Oltvai, Z.~N.
643: \newblock (2001) {\em Nature} {\bf 411}, 41--42.
644:
645: \bibitem{Maslov}
646: Maslov, S \& Sneppen, K.
647: \newblock (2002) {\em Science} {\bf 296}, 910.
648:
649: \bibitem{Mering}
650: Mering, C.~V, Krause, R, Snel, B, Cornell, M, Oliver, S.~G, Fields, S, \&
651: Bork, P.
652: \newblock (2002) {\em Nature} {\bf 417}, 399--403.
653:
654: \bibitem{Bader}
655: Bader, G.~D \& Hogue, C.~W.~V.
656: \newblock (2002) {\em Nature biotech.} {\bf 20}, 991--997.
657:
658: \bibitem{YeastDel}
659: Winzeler, E.~A et~al.
660: \newblock (1999) {\em Science} {\bf 285}, 901--906.
661:
662: \bibitem{percolation}
663: Stauffer, D \& Aharony, A.
664: \newblock (1994) {\em Introduction to Percolation Theory}.
665: \newblock (Taylor \& Francis).
666:
667: \bibitem{Samanta}
668: Samanta, M.~P \& Liang, S.
669: \newblock (2003) Redundancy in large-scale protein interaction networks.
670: \newblock in preparation.
671:
672: \bibitem{Deane}
673: Deane, C.~M, Salwinski, L, Xenarios, I, \& Eisenberg, D.
674: \newblock (2002) {\em Mol. Cell Proteomics} {\bf 1}, 349--356.
675:
676: \bibitem{Fraser}
677: Fraser, H.~B, Hirsh, A.~E, Steinmetz, L.~M, Scharfe, C, \& Feldman, M.~W.
678: \newblock (2002) {\em Science} {\bf 296}, 750--752.
679:
680: \bibitem{Newman}
681: Girvan, M \& Newman, M.~E.~J.
682: \newblock (2001) {\em Proc. Natl. Acad. Sci. USA} {\bf 99}, 7821--7826.
683:
684: \end{thebibliography}
685:
686:
687: \end{document}
688:
689:
690: