1: \documentclass[pre,aps,twocolumn,nofootinbib,floatfix]{revtex4}
2: \usepackage{epsf,amssymb,amsmath,multirow}
3: \begin{document}
4: \newcommand{\knn}{$\langle k_{\rm nn}\rangle (k)$}
5: \thispagestyle{empty}
6: \title{Graph theoretic analysis of protein interaction networks of eukaryotes}
7: \author{K.-I. Goh$^*$, B. Kahng$^{*,\dagger}$ and D. Kim$^*$}
8: \affiliation{$^*$School of Physics and $^{\dagger}$Program in Bioinformatics, Seoul National University, Seoul 151-747, Korea}
9: \date{June 15, 2004}
10: \begin{abstract}
11: Thanks to recent progress in high-throughput experimental techniques, the datasets of large-scale protein interactions of prototypical multicellular species, the nematode worm {\em Caenorhabditis elegans} and the fruit fly {\em Drosophila melanogaster}, have been assayed. The datasets are obtained mainly by using the yeast hybrid method, which contains false-positive and false-negative simultaneously. Accordingly, while it is desirable to test such datasets through further wet experiments, here we invoke recent developed network theory to test such high throughput datasets in a simple way.
12: Based on the fact that the key biological processes indispensable to maintaining life are universal across eukaryotic species, and the comparison of structural properties of the protein interaction networks (PINs) of the two species with those of the yeast PIN, we find that while the worm and the yeast PIN datasets exhibit similar structural properties, the current fly dataset, though most comprehensively screened ever, does not reflect generic structural properties correctly as it is. The modularity is suppressed and the connectivity correlation is lacking. Addition of interlogs to the current fly dataset increases the modularity and enhances the occurrence of triangular motifs as well. The connectivity correlation function of the fly, however, remains distinct under such interlogs addition, for which we present a possible scenario through an {\em in silico} modeling.
13: \end{abstract}
14: \maketitle
15: \renewcommand{\thetable}{{\bf \arabic{table}}}
16: \renewcommand{\thefigure}{{\bf \arabic{figure}}}
17: \renewcommand{\figurename}{{\bf Fig.}}
18: \renewcommand{\tablename}{{\bf Table}}
19:
20: \noindent
21: {\bf Introduction}\\
22: In the last few years graph theoretic methods to understand complex
23: biomolecular systems have been developed very rapidly \cite{nrg}.
24: Such a development has made advances toward uncovering the organizing
25: principles of cellular networks in post-genomic biology.
26: The cellular components such as genes, proteins, and other biological
27: molecules, connected by all physiologically relevant interactions,
28: form a full weblike molecular architecture in a cell. In such an
29: architecture, genes play a central role, which are expressed through
30: proteins. Proteins rarely act alone, rather they cooperate with others
31: to act physiologically. Thus protein interactions play pivotal roles
32: in various aspects of the structural and functional organizations
33: and their complete description would be the first step toward a thorough
34: understanding of the web of life. Proteins are viewed as
35: nodes of a complex protein interaction network (PIN) in which
36: two proteins are linked if they physically contact with each other.
37: The graph theoretic approach has been useful to understand intricate
38: interwoven structures of the PIN \cite{jeong,wagner,maslov}.
39: The key biological processes indispensable to maintaining life
40: are universal across eukaryotic species since many involved genes
41: are evolutionarily conserved \cite{alberts}.
42: Using this property, one can test a newly discovered dataset
43: if it really contains more or less complete information of protein
44: interactions. Moreover, this {\it in silico} approach offers
45: one the candidates of protein interaction pairs, of which the number
46: is considerably reduced compared with the total combinatorial
47: pairs. Thus, the graphic theoretic analysis would provide a useful
48: guide for further wet studies of protein interactions.
49:
50: Species with sequenced genome such as the yeast {\em Saccharomyces
51: cerevisiae} provide important test beds for the study of the PIN.
52: Thanks to recent progress in the high-throughput experimental
53: techniques such as the yeast two-hybrid assay \cite{uetz,ito} and
54: the mass spectroscopy \cite{gavin,ho},
55: the dataset of the yeast PIN has been firmly established~\cite{mips,dip}.
56: Very recently, large-scale protein interactions of multicellular species,
57: the nematode worm {\em Caenorhabditis elegans} \cite{vidal}
58: and the fruit fly {\em Drosophila melanogaster} \cite{giot},
59: have been assayed. While those datasets, mainly based on the yeast
60: two-hybrid assay, need physiological proof, they contain large-scale
61: proteins and protein interactions, making graph theoretic study possible.
62: In this paper, we analyze those datasets and compare them with the
63: more-established set of interactions in the budding yeast~\cite{dip}.
64: Our graph theoretic analysis suggests that the present interaction dataset
65: of the fruit fly, based on the yeast two-hybrid (Y2H) assay, may have left out
66: a significant part of protein interactions, though most comprehensively
67: screened ever. Such conclusion has been reached by the comparison of
68: the generic features of the PIN, the modularity and the
69: connectivity correlations, across the three species. For the fly, those
70: quantities behave distinctively: The modularity is suppressed
71: and the connectivity correlation is lacking. Such distinct behavior
72: can be overcome partially by the addition of yeast interlogs into the
73: fly dataset.\\
74:
75: \begin{figure*}[t]
76: \centerline{\epsfxsize=15cm \epsfbox{fig1-pk.eps}}
77: \caption{The degree distributions $p_d(k)$ for
78: {\bf (a)} the yeast,
79: {\bf (b)} the prokaryotes {\em Helicobacter pylori} ($\circ$)
80: and {\em Escherichia coli} ($\Box$),
81: {\bf (c)} the worm (Worm-All),
82: {\bf (d)} the Y2H subset of the worm dataset (Worm-Y2H),
83: {\bf (e)} the fly,
84: and {\bf (f)} the Fly+Interlog dataset.
85: }
86: \label{fig:pk}
87: \end{figure*}
88:
89: \noindent{\bf Materials and Methods}\\
90: {\bf Graph theory terminology.}
91: (i) Network is composed of vertices and edges. In the
92: protein interaction network, vertices represent proteins and
93: edges protein interactions.
94: (ii) Degree is the number of edges connected to a given
95: vertex. The degree distribution $p_d(k)$ is the fraction of vertices
96: having $k$ degrees.
97: (iii) Clustering coefficient of a node is
98: defined as $C_i=2e_i/k_i(k_i-1)$, where $e_i$ is the number
99: of connections among the $k_i$ neighbors of a vertex $i$.
100: Clustering function $C(k)$ is the mean value of $C_i$ over
101: the vertices with degree $k$, while the clustering coefficient
102: $C$ is the mean of $C_i$ over all vertices.
103: When the network contains
104: hierarchical and modular structures within it, it is known
105: that the clustering function $C(k)$ behaves as
106: $C(k)\sim k^{-\beta}$ for large $k$~\cite{ravasz}.
107: (iv) $\langle k_{\rm nn} \rangle(k)$ is the mean degree
108: of the neighbors of a vertex with degree $k$.
109: It is known that $\langle k_{\rm nn}
110: \rangle (k)\sim k^{-\nu}$ with $\nu > 0$ for the Internet and
111: the protein interaction network~\cite{knn,maslov}, implying that
112: vertices with large degree tend to connect to the ones with small
113: degree. Such a network is called dissortative network.
114: Besides this quantity, the ep x $r$ has been
115: introduced \cite{assort} to characterize the degree-degree
116: correlation between the two vertices located at the ends of
117: an edge, which is defined as
118: \begin{equation*}
119: r = \frac{\langle k_1k_2\rangle - \langle (k_1+k_2)/2\rangle^2}{
120: \langle (k_1^2+k_2^2)/2\rangle - \langle (k_1+k_2)/2\rangle^2} \quad,
121: \end{equation*}
122: where $k_1$ and $k_2$ are the degrees of two vertices at the ends of
123: an edge, and $\langle \cdots\rangle$ denotes the average over all
124: edges.\\
125:
126: \noindent
127: {\bf The protein interaction network datasets.}
128: We used the yeast subset of the interaction data compiled in the
129: Database of Interacting Proteins (DIP) as of January 2004
130: (http://dip.doe-mbi.ucla.edu) \cite{dip}. The datasets for the
131: worm and the fly are obtained from the works of Li {\em et al.}~\cite{vidal}
132: and Giot {\em et al.}~\cite{giot},
133: respectively.
134: For the worm, we consider two different versions, the one consisting
135: of only the interactions from the Y2H screens (referred to as Worm-Y2H
136: network in this paper)
137: and the other the full network supplied by Li {\em et al.}~\cite{vidal}
138: (referred to as Worm-All network).
139: The characteristics of each dataset and the values
140: of the graphic theoretic quantities are tabulated in Table~\ref{table1}.\\
141:
142: \begin{table}[b]
143: \caption{
144: {\bf Protein interaction network datasets.}
145: Tabulated are for each dataset the size of proteome $N_{\rm proteome}$,
146: the number of proteins $N$ and the number of protein--protein
147: interactions $L$ in the dataset, the mean degree $\langle k\rangle$,
148: the clustering coefficient $C$, the assortativity $r$,
149: and the number of proteins forming the largest cluster $N_1$.
150: The self-interactions are excluded throughout.
151: }
152: \begin{ruledtabular}
153: \begin{tabular}{cccccc}
154: & Yeast & Worm-Y2H & Worm-All & Fly \\
155: \hline
156: $N_{\rm proteome}$ & 6195 & 22246 & 22246 & 16206\\
157: $N$ & 4714 & 2835 & 3216 & 7055 \\
158: $L$ & 14857 & 4438 & 50444 & 20947\\
159: $\langle k\rangle$ & 6.3 & 3.1 & 3.4 & 5.9\\
160: $C$ & 0.12 & 0.047 & 0.15 & 0.014\\
161: $r$ & -0.14 & -0.16 & -0.13 & -0.036\\
162: $N_1$ & 4627 & 2601 & 2898 & 6929
163: \label{table1}
164: \end{tabular}
165: \end{ruledtabular}
166: \end{table}
167:
168: \noindent
169: {\bf Orthologous gene assignment.}
170: For cross-species ortholog information, we used the information from the
171: KOG database \cite{kog}, a eukaryotic extension of the Clusters of
172: Orthologous Genes (COG) database (http://www.ncbi.nlm.nih.gov/COG/new/).
173:
174: \begin{figure*}[t]
175: \centerline{\epsfxsize=15cm \epsfbox{fig2-ck.eps}}
176: \caption{The local clustering function $C(k)$ for
177: {\bf (a)} the yeast,
178: {\bf (b)} the bacteria {\it H. pylori} ($\circ$) and {\em E. coli} ($\Box$),
179: {\bf (c)} the worm (Worm-All),
180: {\bf (d)} the Worm-Y2H dataset,
181: {\bf (e)} the fly, and
182: {\bf (f)} the Fly+Interlog dataset.
183: The abscissae and ordinates are fixed for clear comparison.
184: }
185: \label{fig2-ck}
186: \end{figure*}
187:
188: \noindent
189: {\bf Yeast interlogs in fly.}
190: Having identified the yeast-fly orthologs, we look for the
191: interactions in the yeast network between
192: those yeast proteins both having orthologs in the fly network.
193: Such orthologous interactions are called the interlogs.
194: If the corresponding fly interaction is present, we call
195: it an {\em overlap interlog}. If not, we call it a {\em potential
196: interlog}. Note that the ortholog relationship is not always one-to-one,
197: resulting in multiple interlogs for a given yeast interaction.
198: For {\em in silico} analysis on the effect of the addition of
199: potential interlogs in the fly network, we include on average one potential
200: interlog per yeast interaction. Specifically, for each
201: yeast interaction A-B having no overlap interlog, each potential interlog
202: is added in the fly network with probability $1/(o_Ao_B)$,
203: where $o_X$ is the number of fly ortholog(s) of the yeast gene X.
204: The network obtained in this way
205: is referred to as Fly+Interlog network hereafter.
206: The full list of the 408 overlap and the 55176 potential interlogs
207: are available on the web
208: (http://komplex0.snu.ac.kr/pin/yeast-fly-interlog.xls).
209: \\
210:
211: \noindent{\bf Results}\\
212: \noindent{\bf Degree distributions.}
213: In Fig.~\ref{fig:pk}, we plot the degree distributions of diverse
214: protein interaction networks, all of which display the scale-free
215: behavior, fitting well to the generalized Pareto formula,
216: $p_d(k)\sim (k+k_0)^{-\gamma}$,
217: almost indistinguishable with each other.
218: While the degree distribution is a fundamental quantity in graph theory,
219: it deals with global network structure, so it does not give
220: detailed information on structural property.
221: \\
222:
223: \noindent{\bf Modularity}\\
224: A cellular function is achieved by a set of related
225: proteins, usually forming a pathway or a complex.
226: Such functional module manifests itself as a localized dense
227: subgraph within the whole cellular network.
228: The presence of modules and their hierarchical organization
229: can be visualized by the local clustering function $C(k)$ \cite{ravasz}.
230: For the yeast PIN, $C(k)$ exhibits a plateau
231: for small $k$ and falls off rapidly for large $k$,
232: reflecting the modular structure bridged by the hubs (Fig.~2{\bf a}).
233: The similar pattern is observed in the worm (Fig.~2{\bf c})
234: and the two prokaryotic species, {\em H. pylori}
235: and {\em E. coli} (Fig.~2{\bf b}).
236: Note that the worm dataset contains the yeast interlogs.
237: For the fly Y2H data, however, $C(k)$ behaves distinctively,
238: almost constant for all $k$ (Fig.~2{\bf e}).
239: To understand this discrepancy, we add
240: the potential yeast interlogs into the current fly Y2H dataset.
241: Then $C(k)$ behaves in a similar fashion to other dataset,
242: showing a moderate plateau for small $k$ and rapid decrease for
243: large $k$, albeit the altitude of the plateau, which is roughly
244: the clustering coefficient $C$, is not as high as in the yeast and
245: the worm (Fig.~2{\bf f}).
246: To find the role of the interlogs in the worm, we consider
247: the Worm-Y2H dataset, and plot its $C(k)$ in Fig.~2{\bf d}.
248: Indeed, the signature character of $C(k)$
249: is lost, in particular, the plateau for small $k$ almost disappears,
250: implying the yeast-interlogs play a role of forming modules, where
251: proteins are closely linked each other.\\
252:
253: \noindent{\bf Conservation rate of interactions.}
254: We count how many yeast interactions are actually conserved
255: in orthologous form in both the worm and the fly.
256: The conservation rate found in this way for the Y2H screen dataset
257: is surprisingly low; 2.7\% for the worm (Worm-Y2H)
258: and 3.8\% for the fly.
259: For the worm, we note that such low coverage is in part
260: due to the insufficient number of baits used in the experiment
261: (3,024 baits, 833 out of which are present in the network).
262: When we consider the conservation of triangular interaction patterns,
263: a basic unit of cooperative functional module \cite{milo},
264: only 3 out of 1731 are conserved in the worm, while none in the fly
265: (Fig.~3).
266: The lack of conserved interaction motifs in the fly data
267: suggests that the current fly network misses some of important
268: cooperative aspects of the cellular network in the fly.
269: The effort to fill this gap is timely.\\
270:
271: \begin{figure}
272: \centerline{\epsfxsize=8.5cm \epsfbox{fig3-triangle.eps}}
273: \caption{
274: Conservation of interaction motif.
275: Shown in the middle is a triangular interaction subgraph within the yeast involving in
276: ubiquitin-dependent protein catabolism. Corresponding
277: orthologous counterpart in the worm and the fly are also shown.
278: This motif is conserved in the worm Y2H data, while only a single interaction
279: is detected in the fly data.
280: }
281: \end{figure}
282:
283: \begin{table*}
284: \caption{Network motif structure of the three species.
285: Tabulated is the number of each subgraph present in the network.
286: According to its $Z$- and $E$-score, the significant motifs (M) and
287: anti-motifs (AM) are indicated.}
288: \label{motif-table}
289: \begin{tabular}{cccccc}
290: & Yeast & Worm-Y2H & Worm-All & Fly & Fly+Interlog \\
291: \hline
292: \multirow{2}{5mm}{
293: \begin{picture}(0,0)(0,0)
294: \qbezier(0,0)(4,6.8)(4,6.8)
295: \qbezier(4,6.8)(4,6.8)(8,0)
296: \put(0.2,0){\circle*{3}}
297: \put(8.2,0){\circle*{3}}
298: \put(4.2,6.8){\circle*{3}}
299: \end{picture}
300: }
301: & 329961 & 81205 & 87294 & 413926 & 520704$\pm$1358 \\
302: & & & & & \\
303: \hline
304: \multirow{2}{5mm}{
305: \begin{picture}(0,0)(0,0)
306: \qbezier(0,0)(4,6.8)(4,6.8)
307: \qbezier(4,6.8)(4,6.8)(8,0)
308: \qbezier(0,0)(0,0)(8,0)
309: \put(0.2,0){\circle*{3}}
310: \put(8.2,0){\circle*{3}}
311: \put(4.2,6.8){\circle*{3}}
312: \end{picture}
313: }
314: & 7136 & 366 & 1512 & 1549 & 3504$\pm$40 \\
315: & M (Z=80, E=3.3) & & M (Z=29, E=2.5) & & M (Z=45, E=1.4)\\
316: \hline
317: \multirow{2}{5mm}{
318: \begin{picture}(0,0)(0,0)
319: \qbezier(0,0)(0,8)(0,8)
320: \qbezier(0,8)(8,8)(8,8)
321: \qbezier(8,8)(8,0)(8,0)
322: \put(0.2,0){\circle*{3}}
323: \put(8.2,0){\circle*{3}}
324: \put(0.2,8){\circle*{3}}
325: \put(8.2,8){\circle*{3}}
326: \end{picture}
327: }
328: & 4081023 & 604723 & 680485 & 7378808 & 971960$\pm$37157 \\
329: & & & & & \\
330: \hline
331: \multirow{2}{5mm}{
332: \begin{picture}(0,0)(0,0)
333: \qbezier(0,0)(0,8)(0,8)
334: \qbezier(0,8)(8,8)(8,8)
335: \qbezier(0,8)(8,0)(8,0)
336: \put(0.2,0){\circle*{3}}
337: \put(8.2,0){\circle*{3}}
338: \put(0.2,8){\circle*{3}}
339: \put(8.2,8){\circle*{3}}
340: \end{picture}
341: }
342: & 9024723 & 2129609 & 2157048 & 6315922 & 7409320$\pm$24476 \\
343: & & & & & \\
344: \hline
345: \multirow{2}{5mm}{
346: \begin{picture}(0,0)(0,0)
347: \qbezier(0,0)(0,8)(0,8)
348: \qbezier(0,8)(8,8)(8,8)
349: \qbezier(0,0)(0,0)(8,0)
350: \qbezier(0,8)(8,0)(8,0)
351: \put(0.2,0){\circle*{3}}
352: \put(8.2,0){\circle*{3}}
353: \put(0.2,8){\circle*{3}}
354: \put(8.2,8){\circle*{3}}
355: \end{picture}
356: }
357: & 368730 & 46050 & 58520 & 160846 & 263324$\pm$2617 \\
358: & AM (Z=-122, E=-0.7) & & AM (Z=-59, E=-0.7) & & \\
359: \hline
360: \multirow{2}{5mm}{
361: \begin{picture}(0,0)(0,0)
362: \qbezier(0,0)(0,8)(0,8)
363: \qbezier(0,8)(8,8)(8,8)
364: \qbezier(8,8)(8,0)(8,0)
365: \qbezier(8,0)(8,0)(0,0)
366: \put(0.2,0){\circle*{3}}
367: \put(8.2,0){\circle*{3}}
368: \put(0.2,8){\circle*{3}}
369: \put(8.2,8){\circle*{3}}
370: \end{picture}
371: }
372: & 21806 & 4350 & 4686 & 54100 & 60648$\pm$206 \\
373: & & M (Z=9.5, E=0.6) & & M (Z=81, E=2.1) & M (Z=60, E=1.3) \\
374: \hline
375: \multirow{2}{5mm}{
376: \begin{picture}(0,0)(0,0)
377: \qbezier(0,0)(0,8)(0,8)
378: \qbezier(0,8)(8,8)(8,8)
379: \qbezier(8,8)(8,0)(8,0)
380: \qbezier(8,0)(8,0)(0,0)
381: \qbezier(0,8)(8,0)(8,0)
382: \put(0.2,0){\circle*{3}}
383: \put(8.2,0){\circle*{3}}
384: \put(0.2,8){\circle*{3}}
385: \put(8.2,8){\circle*{3}}
386: \end{picture}
387: }
388: & 27455 & 1505 & 4120 & 4029 & 9313$\pm$228 \\
389: & AM (Z=-49, E=-0.7) & & AM (Z=-25, E=-0.6) & M (Z=12, N=0.8) & \\
390: \hline
391: \multirow{2}{5mm}{
392: \begin{picture}(0,0)(0,0)
393: \qbezier(0,0)(0,8)(0,8)
394: \qbezier(0,8)(8,8)(8,8)
395: \qbezier(8,8)(8,0)(8,0)
396: \qbezier(8,0)(8,0)(0,0)
397: \qbezier(0,8)(8,0)(8,0)
398: \qbezier(8,8)(0,0)(0,0)
399: \put(0.2,0){\circle*{3}}
400: \put(8.2,0){\circle*{3}}
401: \put(0.2,8){\circle*{3}}
402: \put(8.2,8){\circle*{3}}
403: \end{picture}
404: }
405: & 5259 & 30 & 1563 & 82 & 914$\pm$35 \\
406: & & & M (Z=10, E=0.8) & M (Z=11, E=3.5) & M (Z=40, E=6.0)\\
407: \hline
408: \end{tabular}
409: \end{table*}
410:
411: \noindent{\bf Motif structure.}
412: Since the modularity manifested by $C(k)$
413: is closely related to the formation of triangles in the network,
414: here we further perform network motif analysis
415: for the three species datasets.
416: The network motifs are small recurring subgraphs which are overrepresented
417: in a given network and are believed to provide the basic evolutionary
418: and functional signatures of the network \cite{milo}.
419: Since it was recently discovered that the motif constituents
420: are more conserved during evolution than the rest \cite{wuchty},
421: one would expect the density of each motif to be
422: close to each other across the three species.
423: From the comparison of the columns for Yeast, Worm-All, and Fly
424: in Table \ref{motif-table}, we can see that
425: the triangle motif is relatively not abundant in Fly,
426: while the square motif is.
427: Thus, the absolute magnitude of the clustering function
428: is smaller for the fly than for the yeast or the worm.
429: The density of the triangle motif is higher in the Fly+Interlog dataset,
430: indicating that the clustering coefficient is enhanced overall by the
431: addition of the interlogs of the fly.
432:
433: In Table \ref{motif-table} we have summarized the motif structure
434: for each network.
435: We follow Milo {\em et al.}~\cite{milo} to calculate the two scores,
436: $Z$- and $E$-score, defined as
437: $Z=(N-N_{random})/\sigma_{random}$ and $E=(N-N_{random})/N_{random}$,
438: respectively, and use the following two criteria to specify whether a
439: subgraph is a motif or an anti-motif (an anti-motif is a subgraph
440: significantly underrepresented in the network):
441: \begin{itemize}
442: \item[(i)] The probability that $N$ is observed in randomized network
443: is smaller than 0.01.
444: \item[(ii)] $|E|>E_0$, where we set the threshold $E_0=0.5$,
445: rather than $E_0=0.1$ in Milo {\em et al.}~\cite{milo}.
446: \end{itemize}
447: Here, $N_{random}$ and $\sigma_{random}$ are the expected number of occurrence
448: in the randomized version of the network and their standard deviation
449: obtained from 1000 samples respectively,
450: where the randomization is performed by the
451: switching method \cite{milo}. In calculating them for
452: the 4-node subgraphs, the numbers of 3-node subgraphs are fixed to be
453: those of the original networks. For the Fly+Interlog network,
454: 10 realizations of interlog addition (see Method) are averaged.
455: \\
456:
457: \begin{figure*}
458: \centerline{\epsfxsize=15cm \epsfbox{fig4-knn.eps}}
459: \caption{
460: The average neighbor degree function $\langle k_{\rm nn}\rangle(k)$ for
461: {\bf (a)} the yeast,
462: {\bf (b)} the prokaryotes {\em H. pylori} ($\circ$) and {\em E. coli} ($\Box$),
463: {\bf (c)} the worm (Worm-All),
464: {\bf (d)} the Y2H subset of the worm (Worm-Y2H),
465: {\bf (e)} the fly,
466: and {\bf (f)} the Fly+Interlog dataset.
467: The abscissae and ordinates are fixed for clear comparison.
468: }
469: \label{fig-knn}
470: \end{figure*}
471:
472: \noindent{\bf Degree-degree correlation}\\
473: The mean neighbor degree function \knn~is useful
474: in understanding the degree-degree correlation in a network.
475: In Fig.~\ref{fig-knn}, we plot \knn~for each dataset.
476: For the yeast, it is known that \knn~decreases
477: with increasing $k$ \cite{maslov}, which turns out to be also true
478: for some prokaryotic species, too (Figs.~4{\bf a}-{\bf b}).
479: Such a behavior in \knn~is also observed for the worm
480: (Figs.~4{\bf c}-{\bf d}), however,
481: it is flat for the fly, implying lack of correlation (Fig.~4{\bf e}).
482: Such distinct behavior for the fly is robust under the addition of
483: the interlogs (Fig.~4{\bf f}), which suggests the lack of
484: correlation in the fly network could be intrinsic,
485: even though we cannot exclude the possibility
486: that it is again the artifact of the incompleteness of the data.
487: The hypothesis that the lack of correlation could be intrinsic
488: may be supported by the following observations.\\
489:
490: \noindent
491: {\bf Effect of diversification of gene function on $\langle k_{\rm nn}\rangle(k)$.}
492: While the pattern of $C(k)$ of the fly becomes similar to those
493: of the yeast and the worm by the addition of the interlogs,
494: that of $\langle k_{\rm nn} \rangle(k)$ remains distinct.
495: Thus here we investigate if such a flat behavior is intrinsic
496: through an {\it in silico} model, finding that indeed,
497: the decreasing behavior of $\langle k_{\rm nn}\rangle(k)$
498: becomes moderated through the network evolution with the duplication
499: and divergence processes. Homologs in a genome
500: are thought to result from the gene duplication event, which is
501: usually followed by the diversification to lower the redundancy.
502: Some computer models aiming to mimic these processes
503: in proteome evolution exist in the literature \cite{sole,vespig}.
504: We investigate how the diversification process
505: affects the topological property of the proteome network,
506: in particular, the degree-degree correlation in terms of
507: $\langle k_{\rm nn} \rangle(k)$.
508: To this end, we perform following procedures motivated by
509: V\'azquez {\em et al.}~\cite{vespig}:
510: \begin{enumerate}
511: \item Starting with the yeast protein network, at each step,
512: a protein A is chosen randomly and is duplicated as A$'$.
513: Then the protein A and A$'$ share common neighbors.
514: \item For each neighboring protein of A and A$'$, one of edges connected
515: to either A or A$'$ is removed with equal probability.
516: \item Repeat 1--2 until the number of proteins reaches $\sim$20,000,
517: the approximate sizes of the worm and the fly proteome.
518: \end{enumerate}
519: Note that in this procedure, the number of proteins increases while
520: the number of interactions stays still. Thus the average degree
521: decreases as the size of proteome increases. Such decrease
522: will be compensated by, e.g., the acquisition of new interactions
523: between existing proteins via mutation. However, we do not take
524: such a process into account, to single out the effect of the
525: diversification only.
526:
527: The result of simulation is shown in Fig.~\ref{model}.
528: The local clustering function $C(k)$ is simply shifted downward, due to
529: the overall decrease of the edge density.
530: On the other hand, the average neighbor degree
531: \knn decreases as $k$ but with a smaller rate, indicating that the
532: diversification process can, although not perfectly, neutralize
533: the connectivity correlation.
534: Furthermore, if we {\em assume} that the establishment of new interactions
535: follows the preferential attachment \cite{ba} or random attachment,
536: the overall correlation would diminish eventually.\\
537:
538: \begin{figure}[!t]
539: \centerline{\epsfxsize=9cm \epsfbox{fig5-model.eps}}
540: \caption{Effect of gene function diversification in (a) $C(k)$ and (b)
541: \knn. Red circles are the data
542: of the original yeast network and the blue squares those after
543: running the diversification procedures {\it in silico}.
544: The slope of the straight line
545: (the rate of decrease) in (b) is -0.3 (top, green)
546: and -0.15 (bottom, magenta), respectively.}
547: \label{model}
548: \end{figure}
549: \begin{figure*}
550: \centerline{\epsfxsize=15cm \epsfbox{fig6-sampled.eps}}
551: \caption{Effect of bait selection. Red circle is for the full data,
552: green diamond the randomly sampled one, blue square the biased sampled one.}
553: \label{bait}
554: \end{figure*}
555:
556: \noindent
557: {\bf Effect of bait selection on $\langle k_{\rm nn}\rangle(k)$.}
558: There has been an argument that the apparent decreasing trend
559: in $\langle k_{\rm nn}\rangle(k)$ is an artifact from the limited
560: selection of baits in the two-hybrid experiment \cite{aloy}.
561: Indeed, Li et al.~\cite{vidal} had selected the baits with their
562: own criteria, mainly based on the biological indispensability
563: and the potential applicability to the human therapeutics.
564: To check this hypothesis {\it in silico}, we sampled the 30\% subset
565: of 4950 baits identified in Giot et al.'s fly network \cite{giot} and
566: reconstructed the network only with the interactions associated with
567: the sampled baits.
568: We sampled in two different ways; the random sampling
569: and the biased sampling toward the highly connected
570: baits (the sampling probability is proportional to
571: the number of bait-interactions).
572: Both data sets generate the decreasing trend
573: in $\langle k_{\rm nn}\rangle(k)$ (Fig.~\ref{bait}).
574: One can see that even though the original network has the null slope in
575: $\langle k_{\rm nn}\rangle (k)$,
576: the negative slope develops in the sampled ones, demonstrating
577: that the insufficient use of the bait {\em can} produce
578: artifactual correlation
579: in the connectivity. If this scenario holds, one conjecture that
580: $\langle k_{\rm nn} \rangle(k)$ curve will become flatter
581: as the interaction data accumulates and becomes more complete.
582: \\
583:
584: \noindent{\bf Summary and discussion}\\
585: We have investigated in detail the structural properties
586: of the protein interaction networks of three eukaryotic species,
587: the budding yeast, the nematode worm, and the fruit fly.
588: In particular, we have focused on the comparative assessment of
589: the modularity and the degree-degree correlation for those networks.
590: We found that while the worm dataset behaves similarly to the
591: yeast for the two graph theoretic quantities, the fly does not.
592: The difference might be attributed to the presence (absence) of
593: the yeast-interlogs in the current worm (fly) dataset.
594: For the fly dataset, the modularity is suppressed and
595: the connectivity correlation is lacking. We found that
596: the clustering function can be restored to those of the yeast
597: dataset by the addition of interlogs selected randomly
598: among the candidates to the current dataset.
599: We also performed motif analysis for the three species, finding
600: that the density of the triangle motif is increased by the
601: addition of the interlogs to the current fly dataset.
602: Finally, the candidates of the protein interactions of the fly
603: are provided in the supplementary materials,
604: which could be useful in finding protein interactions missed in
605: the current fly dataset.
606: \\
607:
608: This work is supported by the KOSEF grant No. R14-2002-059-01000-0
609: in the ABRL program and the MOST grant No. M1 03B500000110.
610:
611:
612: \begin{thebibliography}{99}
613: %\setlength{\itemindent}{-1.22cm}
614: \bibitem{alberts} Alberts, B., Bray, D., Johnson, A., Lewis, J., Raff, M., Roberts, K. and Walter, P. (1998) {\it Essential Cell Biology} Garland, New York.
615: \bibitem{aloy} Aloy, P. and Russell, P. B. (2002) Potential artefacts in protein-interaction networks. {\it FEBS Lett.} {\bf 530}, 253--254.
616: \bibitem{ba} Barab\'asi, A.-L., and Albert, R. (1999) Emergence of scaling in random networks. {\it Science} {\bf 286}, 509--512.
617: \bibitem{nrg} Barab\'asi, A.-L. and Oltvai, Z. N. (2004) Network biology: understanding the cell's functional organization. {\it Nat. Rev. Genet.} {\bf 5}, 101--114.
618: \bibitem{gavin} Gavin, A.-C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J. M., Michon, A. M., Cruciat, C. M., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. {\it Nature} {\bf 415}, 141--147.
619: \bibitem{giot} Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y. L., Ooi, C. E., Godwin, B., Vitols, E., et al. (2003) A protein interaction map of {\it Drosophila melanogaster}. {\it Science} {\bf 302} 1727--1736.
620: \bibitem{ho} Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. {\it Nature} {\bf 415}, 180--183.
621: \bibitem{ito} Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. {\it Proc. Natl. Acad. Sci. U.S.A.} {\bf 98}, 4569--4574.
622: \bibitem{jeong} Jeong, H., Mason, S. P., Barab\'asi, A.-L. and Oltvai, Z. N. (2001) Lethality and centrality in protein networks. {\it Nature} {\bf 411}, 41--42.
623: \bibitem{vidal} Li, S., Armstrong, C. M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P. O., Han, J. D., Chesneau, A., Hao, T., et al. (2004) A map of the interactome network of the metazoan {\it C. elegans}. {\it Science} {\bf 303}, 540--543.
624: \bibitem{maslov} Maslov, S. and Sneppen, K. (2002) Specificity and stability in topology of protein networks. {\it Science} {\bf 296}, 910--913.
625: \bibitem{milo} Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002) Network motifs: simple building blocks of complex networks. {\it Science} {\bf 298}, 824--827.
626: \bibitem{mips} Mewes, H. W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N, Stumpflen, V., et al., (2004) MIPS: analysis and annotation of proteins from whole genomes. {\it Nucl. Acids Res.} {\bf 32}, D41--D44.
627: \bibitem{assort} Newman, M. E. J. (2002) Assoratative mixing in networks. {\it Phys. Rev. Lett.} {\bf 89}, 208701.
628: \bibitem{knn} Pastor-Satorras, R., V\'azquez, A. and Vespignani, A. (2001) Dynamical and correlation properities of the Internet, {\it Phys. Rev. Lett.} {\bf 87}, 258701.
629: \bibitem{ravasz} Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. and Barab\'asi, A.-L. (2002) Hierarchical organization of modularity in metabolic networks. {\it Science} {\bf 297}, 1551--1555.
630: \bibitem{dip} Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U. and Eisenberg, D. (2004) The Database of Interacting Proteins: 2004 update. {\it Nucl. Acids Res.} {\bf 32}, D449--D451.
631: \bibitem{sole} Sol\'e, R. V., Pastor-Satorras, R., Smith, E. and Kepler, T. (2002) A model of large-scale proteome evolution. {\it Adv. Complex Syst.} {\bf 5}, 43--54.
632: \bibitem{kog} Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Mikolskaya, A. N., et al. (2003) The COG database: an updated version includes eukaryotes. {\it BMC Bioinformatics} {\bf 4}, 41.
633: \bibitem{tong} Tong, A. H. Y., Drees, B., Nardelli, G., Bader, G. D., Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, S., Nelson, B., Paoluzi, S., et al. (2001) A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. {\it Science} {\bf 295}, 321--324.
634: \bibitem{uetz} Uetz, P., Giot, L., Cagney, G., Masfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M, Pochart, P., et al. (2000) A comprehensive analysis of protein-protein interactions in {\it Saccharomyces cerevisiae}. {\it Nature} {\bf 403}, 623--627.
635: \bibitem{vespig} V\'azquez, A., Flammini, A., Maritan, A. and Vespignani, A. (2003) Modelling of protein interaction networks. {\it Complexus} {\bf 1}, 38--42.
636: \bibitem{wagner} Wagner, A. (2001) The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. {\it Mol. Biol. Evol.} {\bf 18}, 1283--1292.
637: \bibitem{wuchty} Wuchty, S., Oltvai, Z. N. and Barab\'asi, A.-L. (2003) Evolutionary conservation of motif constituents in the yeast protein interaction network. {\it Nat. Genet.} {\bf 35}, 176--179.
638: \end{thebibliography}
639:
640: \end{document}
641:
642: