0502:q-bio0502018/fly.tex

1: \documentclass[pre,aps,twocolumn,nofootinbib,floatfix]{revtex4}

2: \usepackage{epsf,amssymb,amsmath,multirow}

3: \begin{document}

4: \newcommand{\knn}{$\langle k_{\rm nn}\rangle (k)$}

5: \thispagestyle{empty}

6: \title{Graph theoretic analysis of protein interaction networks of eukaryotes}

7: \author{K.-I. Goh$^*$, B. Kahng$^{*,\dagger}$ and D. Kim$^*$}

8: \affiliation{$^*$School of Physics and $^{\dagger}$Program in Bioinformatics, Seoul National University, Seoul 151-747, Korea}

9: \date{June 15, 2004}

10: \begin{abstract}

11: Thanks to recent progress in high-throughput experimental techniques, the datasets of large-scale protein interactions of prototypical multicellular species, the nematode worm {\em Caenorhabditis elegans} and the fruit fly {\em Drosophila melanogaster}, have been assayed. The datasets are obtained mainly by using the yeast hybrid method, which contains false-positive and false-negative simultaneously. Accordingly, while it is desirable to test such datasets through further wet experiments, here we invoke recent developed network theory to test such high throughput datasets in a simple way.

12: Based on the fact that the key biological processes indispensable to maintaining life are universal across eukaryotic species, and the comparison of structural properties of the protein interaction networks (PINs) of the two species with those of the yeast PIN, we find that while the worm and the yeast PIN datasets exhibit similar structural properties, the current fly dataset, though most comprehensively screened ever, does not reflect generic structural properties correctly as it is. The modularity is suppressed and the connectivity correlation is lacking. Addition of interlogs to the current fly dataset increases the modularity and enhances the occurrence of triangular motifs as well. The connectivity correlation function of the fly, however, remains distinct under such interlogs addition, for which we present a possible scenario through an {\em in silico} modeling.

13: \end{abstract}

14: \maketitle

15: \renewcommand{\thetable}{{\bf \arabic{table}}}

16: \renewcommand{\thefigure}{{\bf \arabic{figure}}}

17: \renewcommand{\figurename}{{\bf Fig.}}

18: \renewcommand{\tablename}{{\bf Table}}

19:

20: \noindent

21: {\bf Introduction}\\

22: In the last few years graph theoretic methods to understand complex

23: biomolecular systems have been developed very rapidly \cite{nrg}.

24: Such a development has made advances toward uncovering the organizing

25: principles of cellular networks in post-genomic biology.

26: The cellular components such as genes, proteins, and other biological

27: molecules, connected by all physiologically relevant interactions,

28: form a full weblike molecular architecture in a cell. In such an

29: architecture, genes play a central role, which are expressed through

30: proteins. Proteins rarely act alone, rather they cooperate with others

31: to act physiologically. Thus protein interactions play pivotal roles

32: in various aspects of the structural and functional organizations

33: and their complete description would be the first step toward a thorough

34: understanding of the web of life. Proteins are viewed as

35: nodes of a complex protein interaction network (PIN) in which

36: two proteins are linked if they physically contact with each other.

37: The graph theoretic approach has been useful to understand intricate

38: interwoven structures of the PIN \cite{jeong,wagner,maslov}.

39: The key biological processes indispensable to maintaining life

40: are universal across eukaryotic species since many involved genes

41: are evolutionarily conserved \cite{alberts}.

42: Using this property, one can test a newly discovered dataset

43: if it really contains more or less complete information of protein

44: interactions. Moreover, this {\it in silico} approach offers

45: one the candidates of protein interaction pairs, of which the number

46: is considerably reduced compared with the total combinatorial

47: pairs. Thus, the graphic theoretic analysis would provide a useful

48: guide for further wet studies of protein interactions.

49:

50: Species with sequenced genome such as the yeast {\em Saccharomyces

51: cerevisiae} provide important test beds for the study of the PIN.

52: Thanks to recent progress in the high-throughput experimental

53: techniques such as the yeast two-hybrid assay \cite{uetz,ito} and

54: the mass spectroscopy \cite{gavin,ho},

55: the dataset of the yeast PIN has been firmly established~\cite{mips,dip}.

56: Very recently, large-scale protein interactions of multicellular species,

57: the nematode worm {\em Caenorhabditis elegans} \cite{vidal}

58: and the fruit fly {\em Drosophila melanogaster} \cite{giot},

59: have been assayed. While those datasets, mainly based on the yeast

60: two-hybrid assay, need physiological proof, they contain large-scale

61: proteins and protein interactions, making graph theoretic study possible.

62: In this paper, we analyze those datasets and compare them with the

63: more-established set of interactions in the budding yeast~\cite{dip}.

64: Our graph theoretic analysis suggests that the present interaction dataset

65: of the fruit fly, based on the yeast two-hybrid (Y2H) assay, may have left out

66: a significant part of protein interactions, though most comprehensively

67: screened ever. Such conclusion has been reached by the comparison of

68: the generic features of the PIN, the modularity and the

69: connectivity correlations, across the three species. For the fly, those

70: quantities behave distinctively: The modularity is suppressed

71: and the connectivity correlation is lacking. Such distinct behavior

72: can be overcome partially by the addition of yeast interlogs into the

73: fly dataset.\\

74:

75: \begin{figure*}[t]

76: \centerline{\epsfxsize=15cm \epsfbox{fig1-pk.eps}}

77: \caption{The degree distributions $p_d(k)$ for

78: {\bf (a)} the yeast,

79: {\bf (b)} the prokaryotes {\em Helicobacter pylori} ($\circ$)

80: and {\em Escherichia coli} ($\Box$),

81: {\bf (c)} the worm (Worm-All),

82: {\bf (d)} the Y2H subset of the worm dataset (Worm-Y2H),

83: {\bf (e)} the fly,

84: and {\bf (f)} the Fly+Interlog dataset.

85: }

86: \label{fig:pk}

87: \end{figure*}

88:

89: \noindent{\bf Materials and Methods}\\

90: {\bf Graph theory terminology.}

91: (i) Network is composed of vertices and edges. In the

92: protein interaction network, vertices represent proteins and

93: edges protein interactions.

94: (ii) Degree is the number of edges connected to a given

95: vertex. The degree distribution $p_d(k)$ is the fraction of vertices

96: having $k$ degrees.

97: (iii) Clustering coefficient of a node is

98: defined as $C_i=2e_i/k_i(k_i-1)$, where $e_i$ is the number

99: of connections among the $k_i$ neighbors of a vertex $i$.

100: Clustering function $C(k)$ is the mean value of $C_i$ over

101: the vertices with degree $k$, while the clustering coefficient

102: $C$ is the mean of $C_i$ over all vertices.

103: When the network contains

104: hierarchical and modular structures within it, it is known

105: that the clustering function $C(k)$ behaves as

106: $C(k)\sim k^{-\beta}$ for large $k$~\cite{ravasz}.

107: (iv) $\langle k_{\rm nn} \rangle(k)$ is the mean degree

108: of the neighbors of a vertex with degree $k$.

109: It is known that $\langle k_{\rm nn}

110: \rangle (k)\sim k^{-\nu}$ with $\nu > 0$ for the Internet and

111: the protein interaction network~\cite{knn,maslov}, implying that

112: vertices with large degree tend to connect to the ones with small

113: degree. Such a network is called dissortative network.

114: Besides this quantity, the ep x $r$ has been

115: introduced \cite{assort} to characterize the degree-degree

116: correlation between the two vertices located at the ends of

117: an edge, which is defined as

118: \begin{equation*}

119: r = \frac{\langle k_1k_2\rangle - \langle (k_1+k_2)/2\rangle^2}{

120: \langle (k_1^2+k_2^2)/2\rangle - \langle (k_1+k_2)/2\rangle^2} \quad,

121: \end{equation*}

122: where $k_1$ and $k_2$ are the degrees of two vertices at the ends of

123: an edge, and $\langle \cdots\rangle$ denotes the average over all

124: edges.\\

125:

126: \noindent

127: {\bf The protein interaction network datasets.}

128: We used the yeast subset of the interaction data compiled in the

129: Database of Interacting Proteins (DIP) as of January 2004

130: (http://dip.doe-mbi.ucla.edu) \cite{dip}. The datasets for the

131: worm and the fly are obtained from the works of Li {\em et al.}~\cite{vidal}

132: and Giot {\em et al.}~\cite{giot},

133: respectively.

134: For the worm, we consider two different versions, the one consisting

135: of only the interactions from the Y2H screens (referred to as Worm-Y2H

136: network in this paper)

137: and the other the full network supplied by Li {\em et al.}~\cite{vidal}

138: (referred to as Worm-All network).

139: The characteristics of each dataset and the values

140: of the graphic theoretic quantities are tabulated in Table~\ref{table1}.\\

141:

142: \begin{table}[b]

143: \caption{

144: {\bf Protein interaction network datasets.}

145: Tabulated are for each dataset the size of proteome $N_{\rm proteome}$,

146: the number of proteins $N$ and the number of protein--protein

147: interactions $L$ in the dataset, the mean degree $\langle k\rangle$,

148: the clustering coefficient $C$, the assortativity $r$,

149: and the number of proteins forming the largest cluster $N_1$.

150: The self-interactions are excluded throughout.

151: }

152: \begin{ruledtabular}

153: \begin{tabular}{cccccc}

154:  & Yeast & Worm-Y2H & Worm-All & Fly \\

155: \hline

156: $N_{\rm proteome}$ & 6195 & 22246 & 22246 & 16206\\

157: $N$ & 4714 & 2835 & 3216 & 7055 \\

158: $L$ & 14857 & 4438 & 50444 & 20947\\

159: $\langle k\rangle$ & 6.3 & 3.1 & 3.4 & 5.9\\

160: $C$ & 0.12 & 0.047 & 0.15 & 0.014\\

161: $r$ & -0.14 & -0.16 & -0.13 & -0.036\\

162: $N_1$ & 4627 & 2601 & 2898 & 6929

163: \label{table1}

164: \end{tabular}

165: \end{ruledtabular}

166: \end{table}

167:

168: \noindent

169: {\bf Orthologous gene assignment.}

170: For cross-species ortholog information, we used the information from the

171: KOG database \cite{kog}, a eukaryotic extension of the Clusters of

172: Orthologous Genes (COG) database (http://www.ncbi.nlm.nih.gov/COG/new/).

173:

174: \begin{figure*}[t]

175: \centerline{\epsfxsize=15cm \epsfbox{fig2-ck.eps}}

176: \caption{The local clustering function $C(k)$ for

177: {\bf (a)} the yeast,

178: {\bf (b)} the bacteria {\it H. pylori} ($\circ$) and {\em E. coli} ($\Box$),

179: {\bf (c)} the worm (Worm-All),

180: {\bf (d)} the Worm-Y2H dataset,

181: {\bf (e)} the fly, and

182: {\bf (f)} the Fly+Interlog dataset.

183: The abscissae and ordinates are fixed for clear comparison.

184: }

185: \label{fig2-ck}

186: \end{figure*}

187:

188: \noindent

189: {\bf Yeast interlogs in fly.}

190: Having identified the yeast-fly orthologs, we look for the

191: interactions in the yeast network between

192: those yeast proteins both having orthologs in the fly network.

193: Such orthologous interactions are called the interlogs.

194: If the corresponding fly interaction is present, we call

195: it an {\em overlap interlog}. If not, we call it a {\em potential

196: interlog}. Note that the ortholog relationship is not always one-to-one,

197: resulting in multiple interlogs for a given yeast interaction.

198: For {\em in silico} analysis on the effect of the addition of

199: potential interlogs in the fly network, we include on average one potential

200: interlog per yeast interaction. Specifically, for each

201: yeast interaction A-B having no overlap interlog, each potential interlog

202: is added in the fly network with probability $1/(o_Ao_B)$,

203: where $o_X$ is the number of fly ortholog(s) of the yeast gene X.

204: The network obtained in this way

205: is referred to as Fly+Interlog network hereafter.

206: The full list of the 408 overlap and the 55176 potential interlogs

207: are available on the web

208: (http://komplex0.snu.ac.kr/pin/yeast-fly-interlog.xls).

209: \\

210:

211: \noindent{\bf Results}\\

212: \noindent{\bf Degree distributions.}

213: In Fig.~\ref{fig:pk}, we plot the degree distributions of diverse

214: protein interaction networks, all of which display the scale-free

215: behavior, fitting well to the generalized Pareto formula,

216: $p_d(k)\sim (k+k_0)^{-\gamma}$,

217: almost indistinguishable with each other.

218: While the degree distribution is a fundamental quantity in graph theory,

219: it deals with global network structure, so it does not give

220: detailed information on structural property.

221: \\

222:

223: \noindent{\bf Modularity}\\

224: A cellular function is achieved by a set of related

225: proteins, usually forming a pathway or a complex.

226: Such functional module manifests itself as a localized dense

227: subgraph within the whole cellular network.

228: The presence of modules and their hierarchical organization

229: can be visualized by the local clustering function $C(k)$ \cite{ravasz}.

230: For the yeast PIN, $C(k)$ exhibits a plateau

231: for small $k$ and falls off rapidly for large $k$,

232: reflecting the modular structure bridged by the hubs (Fig.~2{\bf a}).

233: The similar pattern is observed in the worm (Fig.~2{\bf c})

234: and the two prokaryotic species, {\em H. pylori}

235: and {\em E. coli} (Fig.~2{\bf b}).

236: Note that the worm dataset contains the yeast interlogs.

237: For the fly Y2H data, however, $C(k)$ behaves distinctively,

238: almost constant for all $k$ (Fig.~2{\bf e}).

239: To understand this discrepancy, we add

240: the potential yeast interlogs into the current fly Y2H dataset.

241: Then $C(k)$ behaves in a similar fashion to other dataset,

242: showing a moderate plateau for small $k$ and rapid decrease for

243: large $k$, albeit the altitude of the plateau, which is roughly

244: the clustering coefficient $C$, is not as high as in the yeast and

245: the worm (Fig.~2{\bf f}).

246: To find the role of the interlogs in the worm, we consider

247: the Worm-Y2H dataset, and plot its $C(k)$ in Fig.~2{\bf d}.

248: Indeed, the signature character of $C(k)$

249: is lost, in particular, the plateau for small $k$ almost disappears,

250: implying  the yeast-interlogs play a role of forming modules, where

251: proteins are closely linked each other.\\

252:

253: \noindent{\bf Conservation rate of interactions.}

254: We count how many yeast interactions are actually conserved

255: in orthologous form in both the worm and the fly.

256: The conservation rate found in this way for the Y2H screen dataset

257: is surprisingly low; 2.7\% for the worm (Worm-Y2H)

258: and 3.8\% for the fly.

259: For the worm, we note that such low coverage is in part

260: due to the insufficient number of baits used in the experiment

261: (3,024 baits, 833 out of which are present in the network).

262: When we consider the conservation of triangular interaction patterns,

263: a basic unit of cooperative functional module \cite{milo},

264: only 3 out of 1731 are conserved in the worm, while none in the fly

265: (Fig.~3).

266: The lack of conserved interaction motifs in the fly data

267: suggests that the current fly network misses some of important

268: cooperative aspects of the cellular network in the fly.

269: The effort to fill this gap is timely.\\

270:

271: \begin{figure}

272: \centerline{\epsfxsize=8.5cm \epsfbox{fig3-triangle.eps}}

273: \caption{

274: Conservation of interaction motif.

275: Shown in the middle is a triangular interaction subgraph within the yeast involving in

276: ubiquitin-dependent protein catabolism. Corresponding

277: orthologous counterpart in the worm and the fly are also shown.

278: This motif is conserved in the worm Y2H data, while only a single interaction

279: is detected in the fly data.

280: }

281: \end{figure}

282:

283: \begin{table*}

284: \caption{Network motif structure of the three species.

285: Tabulated is the number of each subgraph present in the network.

286: According to its $Z$- and $E$-score, the significant motifs (M) and

287: anti-motifs (AM) are indicated.}

288: \label{motif-table}

289: \begin{tabular}{cccccc}

290:  & Yeast & Worm-Y2H & Worm-All & Fly & Fly+Interlog \\

291: \hline

292: \multirow{2}{5mm}{

293: \begin{picture}(0,0)(0,0)

294:  \qbezier(0,0)(4,6.8)(4,6.8)

295:  \qbezier(4,6.8)(4,6.8)(8,0)

296:  \put(0.2,0){\circle*{3}}

297:  \put(8.2,0){\circle*{3}}

298:  \put(4.2,6.8){\circle*{3}}

299: \end{picture}

300: }

301: & 329961 & 81205 & 87294 & 413926 & 520704$\pm$1358 \\

302: &  &  &  &  & \\

303: \hline

304: \multirow{2}{5mm}{

305: \begin{picture}(0,0)(0,0)

306:  \qbezier(0,0)(4,6.8)(4,6.8)

307:  \qbezier(4,6.8)(4,6.8)(8,0)

308:  \qbezier(0,0)(0,0)(8,0)

309:  \put(0.2,0){\circle*{3}}

310:  \put(8.2,0){\circle*{3}}

311:  \put(4.2,6.8){\circle*{3}}

312: \end{picture}

313: }

314: & 7136 & 366 & 1512 & 1549 & 3504$\pm$40 \\

315: & M (Z=80, E=3.3) & & M (Z=29, E=2.5) & & M (Z=45, E=1.4)\\

316: \hline

317: \multirow{2}{5mm}{

318: \begin{picture}(0,0)(0,0)

319:  \qbezier(0,0)(0,8)(0,8)

320:  \qbezier(0,8)(8,8)(8,8)

321:  \qbezier(8,8)(8,0)(8,0)

322:  \put(0.2,0){\circle*{3}}

323:  \put(8.2,0){\circle*{3}}

324:  \put(0.2,8){\circle*{3}}

325:  \put(8.2,8){\circle*{3}}

326: \end{picture}

327: }

328: & 4081023 & 604723 & 680485 & 7378808 & 971960$\pm$37157 \\

329: &  &  &  &  &  \\

330: \hline

331: \multirow{2}{5mm}{

332: \begin{picture}(0,0)(0,0)

333:  \qbezier(0,0)(0,8)(0,8)

334:  \qbezier(0,8)(8,8)(8,8)

335:  \qbezier(0,8)(8,0)(8,0)

336:  \put(0.2,0){\circle*{3}}

337:  \put(8.2,0){\circle*{3}}

338:  \put(0.2,8){\circle*{3}}

339:  \put(8.2,8){\circle*{3}}

340: \end{picture}

341: }

342: & 9024723 & 2129609 & 2157048 & 6315922 & 7409320$\pm$24476 \\

343: &  &  &  &  &  \\

344: \hline

345: \multirow{2}{5mm}{

346: \begin{picture}(0,0)(0,0)

347:  \qbezier(0,0)(0,8)(0,8)

348:  \qbezier(0,8)(8,8)(8,8)

349:  \qbezier(0,0)(0,0)(8,0)

350:  \qbezier(0,8)(8,0)(8,0)

351:  \put(0.2,0){\circle*{3}}

352:  \put(8.2,0){\circle*{3}}

353:  \put(0.2,8){\circle*{3}}

354:  \put(8.2,8){\circle*{3}}

355: \end{picture}

356: }

357: & 368730 & 46050 & 58520 & 160846 & 263324$\pm$2617 \\

358: & AM (Z=-122, E=-0.7) &  & AM (Z=-59, E=-0.7) &  &  \\

359: \hline

360: \multirow{2}{5mm}{

361: \begin{picture}(0,0)(0,0)

362:  \qbezier(0,0)(0,8)(0,8)

363:  \qbezier(0,8)(8,8)(8,8)

364:  \qbezier(8,8)(8,0)(8,0)

365:  \qbezier(8,0)(8,0)(0,0)

366:  \put(0.2,0){\circle*{3}}

367:  \put(8.2,0){\circle*{3}}

368:  \put(0.2,8){\circle*{3}}

369:  \put(8.2,8){\circle*{3}}

370: \end{picture}

371: }

372: & 21806 & 4350 & 4686 & 54100 & 60648$\pm$206 \\

373: &  & M (Z=9.5, E=0.6) &  & M (Z=81, E=2.1)  & M (Z=60, E=1.3) \\

374: \hline

375: \multirow{2}{5mm}{

376: \begin{picture}(0,0)(0,0)

377:  \qbezier(0,0)(0,8)(0,8)

378:  \qbezier(0,8)(8,8)(8,8)

379:  \qbezier(8,8)(8,0)(8,0)

380:  \qbezier(8,0)(8,0)(0,0)

381:  \qbezier(0,8)(8,0)(8,0)

382:  \put(0.2,0){\circle*{3}}

383:  \put(8.2,0){\circle*{3}}

384:  \put(0.2,8){\circle*{3}}

385:  \put(8.2,8){\circle*{3}}

386: \end{picture}

387: }

388: & 27455 & 1505 & 4120 & 4029 & 9313$\pm$228 \\

389: & AM (Z=-49, E=-0.7) &  & AM (Z=-25, E=-0.6) & M (Z=12, N=0.8) &  \\

390: \hline

391: \multirow{2}{5mm}{

392: \begin{picture}(0,0)(0,0)

393:  \qbezier(0,0)(0,8)(0,8)

394:  \qbezier(0,8)(8,8)(8,8)

395:  \qbezier(8,8)(8,0)(8,0)

396:  \qbezier(8,0)(8,0)(0,0)

397:  \qbezier(0,8)(8,0)(8,0)

398:  \qbezier(8,8)(0,0)(0,0)

399:  \put(0.2,0){\circle*{3}}

400:  \put(8.2,0){\circle*{3}}

401:  \put(0.2,8){\circle*{3}}

402:  \put(8.2,8){\circle*{3}}

403: \end{picture}

404: }

405: & 5259 & 30 & 1563 & 82 & 914$\pm$35 \\

406: &  &  & M (Z=10, E=0.8) & M (Z=11, E=3.5) &  M (Z=40, E=6.0)\\

407: \hline

408: \end{tabular}

409: \end{table*}

410:

411: \noindent{\bf Motif structure.}

412: Since the modularity manifested by $C(k)$

413: is closely related to the formation of triangles in the network,

414: here we further perform network motif analysis

415: for the three species datasets.

416: The network motifs are small recurring subgraphs which are overrepresented

417: in a given network and are believed to provide the basic evolutionary

418: and functional signatures of the network \cite{milo}.

419: Since it was recently discovered that the motif constituents

420: are more conserved during evolution than the rest \cite{wuchty},

421: one would expect the density of each motif to be

422: close to each other across the three species.

423: From the comparison of the columns for Yeast, Worm-All, and Fly

424: in Table \ref{motif-table}, we can see that

425: the triangle motif is relatively not abundant in Fly,

426: while the square motif is.

427: Thus, the absolute magnitude of the clustering function

428: is smaller for the fly than for the yeast or the worm.

429: The density of the triangle motif is higher in the Fly+Interlog dataset,

430: indicating that the clustering coefficient is enhanced overall by the

431: addition of the interlogs of the fly.

432:

433: In Table \ref{motif-table} we have summarized the motif structure

434: for each network.

435: We follow Milo {\em et al.}~\cite{milo} to calculate the two scores,

436: $Z$- and $E$-score, defined as

437: $Z=(N-N_{random})/\sigma_{random}$ and $E=(N-N_{random})/N_{random}$,

438: respectively, and use the following two criteria to specify whether a

439: subgraph is a motif or an anti-motif (an anti-motif is a subgraph

440: significantly underrepresented in the network):

441: \begin{itemize}

442: \item[(i)] The probability that $N$ is observed in randomized network

443: is smaller than 0.01.

444: \item[(ii)] $|E|>E_0$, where we set the threshold $E_0=0.5$,

445: rather than $E_0=0.1$ in Milo {\em et al.}~\cite{milo}.

446: \end{itemize}

447: Here, $N_{random}$ and $\sigma_{random}$ are the expected number of occurrence

448: in the randomized version of the network and their standard deviation

449: obtained from 1000 samples respectively,

450: where the randomization is performed by the

451: switching method \cite{milo}. In calculating them for

452: the 4-node subgraphs, the numbers of 3-node subgraphs are fixed to be

453: those of the original networks. For the Fly+Interlog network,

454: 10 realizations of interlog addition (see Method) are averaged.

455: \\

456:

457: \begin{figure*}

458: \centerline{\epsfxsize=15cm \epsfbox{fig4-knn.eps}}

459: \caption{

460: The average neighbor degree function $\langle k_{\rm nn}\rangle(k)$ for

461: {\bf (a)} the yeast,

462: {\bf (b)} the prokaryotes {\em H. pylori} ($\circ$) and {\em E. coli} ($\Box$),

463: {\bf (c)} the worm (Worm-All),

464: {\bf (d)} the Y2H subset of the worm (Worm-Y2H),

465: {\bf (e)} the fly,

466: and {\bf (f)} the Fly+Interlog dataset.

467: The abscissae and ordinates are fixed for clear comparison.

468: }

469: \label{fig-knn}

470: \end{figure*}

471:

472: \noindent{\bf Degree-degree correlation}\\

473: The mean neighbor degree function \knn~is useful

474: in understanding the degree-degree correlation in a network.

475: In Fig.~\ref{fig-knn}, we plot \knn~for each dataset.

476: For the yeast, it is known that \knn~decreases

477: with increasing $k$ \cite{maslov}, which turns out to be also true

478: for some prokaryotic species, too (Figs.~4{\bf a}-{\bf b}).

479: Such a behavior in \knn~is also observed for the worm

480: (Figs.~4{\bf c}-{\bf d}), however,

481: it is flat for the fly, implying lack of correlation (Fig.~4{\bf e}).

482: Such distinct behavior for the fly is robust under the addition of

483: the interlogs (Fig.~4{\bf f}), which suggests the lack of

484: correlation in the fly network could be intrinsic,

485: even though we cannot exclude the possibility

486: that it is again the artifact of the incompleteness of the data.

487: The hypothesis that the lack of correlation could be intrinsic

488: may be supported by the following observations.\\

489:

490: \noindent

491: {\bf Effect of diversification of gene function on $\langle k_{\rm nn}\rangle(k)$.}

492: While the pattern of $C(k)$ of the fly becomes similar to those

493: of the yeast and the worm by the addition of the interlogs,

494: that of $\langle k_{\rm nn} \rangle(k)$ remains distinct.

495: Thus here we investigate if such a flat behavior is intrinsic

496: through an {\it in silico} model, finding that indeed,

497: the decreasing behavior of $\langle k_{\rm nn}\rangle(k)$

498: becomes moderated through the network evolution with the duplication

499: and divergence processes. Homologs in a genome

500: are thought to result from the gene duplication event, which is

501: usually followed by the diversification to lower the redundancy.

502: Some computer models aiming to mimic these processes

503: in proteome evolution exist in the literature \cite{sole,vespig}.

504: We investigate how the diversification process

505: affects the topological property of the proteome network,

506: in particular, the degree-degree correlation in terms of

507: $\langle k_{\rm nn} \rangle(k)$.

508: To this end, we perform following procedures motivated by

509: V\'azquez {\em et al.}~\cite{vespig}:

510: \begin{enumerate}

511: \item Starting with the yeast protein network, at each step,

512: a protein A is chosen randomly and is duplicated as A$'$.

513: Then the protein A and A$'$ share common neighbors.

514: \item For each neighboring protein of A and A$'$, one of edges connected

515: to either A or A$'$ is removed with equal probability.

516: \item Repeat 1--2 until the number of proteins reaches $\sim$20,000,

517: the approximate sizes of the worm and the fly proteome.

518: \end{enumerate}

519: Note that in this procedure, the number of proteins increases while

520: the number of interactions stays still. Thus the average degree

521: decreases as the size of proteome increases. Such decrease

522: will be compensated by, e.g., the acquisition of new interactions

523: between existing proteins via mutation. However, we do not take

524: such a process into account, to single out the effect of the

525: diversification only.

526:

527: The result of simulation is shown in Fig.~\ref{model}.

528: The local clustering function $C(k)$ is simply shifted downward, due to

529: the overall decrease of the edge density.

530: On the other hand, the average neighbor degree

531: \knn decreases as $k$ but with a smaller rate, indicating that the

532: diversification process can, although not perfectly, neutralize

533: the connectivity correlation.

534: Furthermore, if we {\em assume} that the establishment of new interactions

535: follows the preferential attachment \cite{ba} or random attachment,

536: the overall correlation would diminish eventually.\\

537:

538: \begin{figure}[!t]

539: \centerline{\epsfxsize=9cm \epsfbox{fig5-model.eps}}

540: \caption{Effect of gene function diversification in (a) $C(k)$ and (b)

541: \knn. Red circles are the data

542: of the original yeast network and the blue squares those after

543: running the diversification procedures {\it in silico}.

544: The slope of the straight line

545: (the rate of decrease) in (b) is -0.3 (top, green)

546: and -0.15 (bottom, magenta), respectively.}

547: \label{model}

548: \end{figure}

549: \begin{figure*}

550: \centerline{\epsfxsize=15cm \epsfbox{fig6-sampled.eps}}

551: \caption{Effect of bait selection. Red circle is for the full data,

552: green diamond the randomly sampled one, blue square the biased sampled one.}

553: \label{bait}

554: \end{figure*}

555:

556: \noindent

557: {\bf Effect of bait selection on $\langle k_{\rm nn}\rangle(k)$.}

558: There has been an argument that the apparent decreasing trend

559: in $\langle k_{\rm nn}\rangle(k)$ is an artifact from the limited

560: selection of baits in the two-hybrid experiment \cite{aloy}.

561: Indeed, Li et al.~\cite{vidal} had selected the baits with their

562: own criteria, mainly based on the biological indispensability

563: and the potential applicability to the human therapeutics.

564: To check this hypothesis {\it in silico}, we sampled the 30\% subset

565: of 4950 baits identified in Giot et al.'s fly network \cite{giot} and

566: reconstructed the network only with the interactions associated with

567: the sampled baits.

568: We sampled in two different ways; the random sampling

569: and the biased sampling toward the highly connected

570: baits (the sampling probability is proportional to

571: the number of bait-interactions).

572: Both data sets generate the decreasing trend

573: in $\langle k_{\rm nn}\rangle(k)$ (Fig.~\ref{bait}).

574: One can see that even though the original network has the null slope in

575: $\langle k_{\rm nn}\rangle (k)$,

576: the negative slope develops in the sampled ones, demonstrating

577: that the insufficient use of the bait {\em can} produce

578: artifactual correlation

579: in the connectivity. If this scenario holds, one conjecture that

580: $\langle k_{\rm nn} \rangle(k)$ curve will become flatter

581: as the interaction data accumulates and becomes more complete.

582: \\

583:

584: \noindent{\bf Summary and discussion}\\

585: We have investigated in detail the structural properties

586: of the protein interaction networks of three eukaryotic species,

587: the budding yeast, the nematode worm, and the fruit fly.

588: In particular, we have focused on the comparative assessment of

589: the modularity and the degree-degree correlation for those networks.

590: We found that while the worm dataset behaves similarly to the

591: yeast for the two graph theoretic quantities, the fly does not.

592: The difference might be attributed to the presence (absence) of

593: the yeast-interlogs in the current worm (fly) dataset.

594: For the fly dataset, the modularity is suppressed and

595: the connectivity correlation is lacking. We found that

596: the clustering function can be restored to those of the yeast

597: dataset by the addition of interlogs selected randomly

598: among the candidates to the current dataset.

599: We also performed motif analysis for the three species, finding

600: that the density of the triangle motif is increased by the

601: addition of the interlogs to the current fly dataset.

602: Finally, the candidates of the protein interactions of the fly

603: are provided in the supplementary materials,

604: which could be useful in finding protein interactions missed in

605: the current fly dataset.

606: \\

607:

608: This work is supported by the KOSEF grant No. R14-2002-059-01000-0

609: in the ABRL program and the MOST grant No. M1 03B500000110.

610:

611:

612: \begin{thebibliography}{99}

613: %\setlength{\itemindent}{-1.22cm}

614: \bibitem{alberts} Alberts, B., Bray, D., Johnson, A., Lewis, J., Raff, M., Roberts, K. and Walter, P. (1998) {\it Essential Cell Biology} Garland, New York.

615: \bibitem{aloy} Aloy, P. and Russell, P. B. (2002) Potential artefacts in protein-interaction networks. {\it FEBS Lett.} {\bf 530}, 253--254.

616: \bibitem{ba} Barab\'asi, A.-L., and Albert, R. (1999) Emergence of scaling in random networks. {\it Science} {\bf 286}, 509--512.

617: \bibitem{nrg} Barab\'asi, A.-L. and Oltvai, Z. N. (2004) Network biology: understanding the cell's functional organization. {\it Nat. Rev. Genet.} {\bf 5}, 101--114.

618: \bibitem{gavin} Gavin, A.-C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J. M., Michon, A. M., Cruciat, C. M., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. {\it Nature} {\bf 415}, 141--147.

619: \bibitem{giot} Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y. L., Ooi, C. E., Godwin, B., Vitols, E., et al. (2003) A protein interaction map of {\it Drosophila melanogaster}. {\it Science} {\bf 302} 1727--1736.

620: \bibitem{ho} Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. {\it Nature} {\bf 415}, 180--183.

621: \bibitem{ito} Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. {\it Proc. Natl. Acad. Sci. U.S.A.} {\bf 98}, 4569--4574.

622: \bibitem{jeong} Jeong, H., Mason, S. P., Barab\'asi, A.-L. and Oltvai, Z. N. (2001) Lethality and centrality in protein networks. {\it Nature} {\bf 411}, 41--42.

623: \bibitem{vidal} Li, S., Armstrong, C. M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P. O., Han, J. D., Chesneau, A., Hao, T., et al. (2004) A map of the interactome network of the metazoan {\it C. elegans}. {\it Science} {\bf 303}, 540--543.

624: \bibitem{maslov} Maslov, S. and Sneppen, K. (2002) Specificity and stability in topology of protein networks. {\it Science} {\bf 296}, 910--913.

625: \bibitem{milo} Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002) Network motifs: simple building blocks of complex networks. {\it Science} {\bf 298}, 824--827.

626: \bibitem{mips} Mewes, H. W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N, Stumpflen, V., et al., (2004) MIPS: analysis and annotation of proteins from whole genomes. {\it Nucl. Acids Res.} {\bf 32}, D41--D44.

627: \bibitem{assort} Newman, M. E. J. (2002) Assoratative mixing in networks. {\it Phys. Rev. Lett.} {\bf 89}, 208701.

628: \bibitem{knn} Pastor-Satorras, R., V\'azquez, A. and Vespignani, A. (2001) Dynamical and correlation properities of the Internet, {\it Phys. Rev. Lett.} {\bf 87}, 258701.

629: \bibitem{ravasz} Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. and Barab\'asi, A.-L. (2002) Hierarchical organization of modularity in metabolic networks. {\it Science} {\bf 297}, 1551--1555.

630: \bibitem{dip} Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U. and Eisenberg, D. (2004) The Database of Interacting Proteins: 2004 update. {\it Nucl. Acids Res.} {\bf 32}, D449--D451.

631: \bibitem{sole} Sol\'e, R. V., Pastor-Satorras, R., Smith, E. and Kepler, T. (2002) A model of large-scale proteome evolution. {\it Adv. Complex Syst.} {\bf 5}, 43--54.

632: \bibitem{kog} Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Mikolskaya, A. N., et al. (2003) The COG database: an updated version includes eukaryotes. {\it BMC Bioinformatics} {\bf 4}, 41.

633: \bibitem{tong} Tong, A. H. Y., Drees, B., Nardelli, G., Bader, G. D., Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, S., Nelson, B., Paoluzi, S., et al. (2001) A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. {\it Science} {\bf 295}, 321--324.

634: \bibitem{uetz} Uetz, P., Giot, L., Cagney, G., Masfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M, Pochart, P., et al. (2000) A comprehensive analysis of protein-protein interactions in {\it Saccharomyces cerevisiae}. {\it Nature} {\bf 403}, 623--627.

635: \bibitem{vespig} V\'azquez, A., Flammini, A., Maritan, A. and Vespignani, A. (2003) Modelling of protein interaction networks. {\it Complexus} {\bf 1}, 38--42.

636: \bibitem{wagner} Wagner, A. (2001) The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. {\it Mol. Biol. Evol.} {\bf 18}, 1283--1292.

637: \bibitem{wuchty} Wuchty, S., Oltvai, Z. N. and Barab\'asi, A.-L. (2003) Evolutionary conservation of motif constituents in the yeast protein interaction network. {\it Nat. Genet.} {\bf 35}, 176--179.

638: \end{thebibliography}

639:

640: \end{document}

641:

642: