0503:q-bio0503010/pfp.tex

1: \documentclass[rmp,twocolumn]{revtex4}

2:

3: \usepackage{graphicx,amsmath,amssymb,txfonts}

4:

5: \begin{document}

6:

7: \title{Role-similarity based functional prediction in networked

8:   systems:\\ Application to the yeast proteome}

9:

10: \author{Petter Holme}

11: \affiliation{Department of Physics, University of Michigan, Ann Arbor,

12:   MI 48109}

13: \author{Mikael Huss}

14: \affiliation{Department of Numerical Analysis and Computer Science,

15:   Royal Institute of Technology, 100 44 Stockholm, Sweden}

16:

17: \begin{abstract}

18:   We propose a general method to predict functions of vertices where:

19:   1. The wiring of the network is somehow related to the vertex

20:   functionality. 2. A fraction of the vertices are functionally

21:   classified. The method is influenced by role-similarity measures of

22:   social network analysis. The two versions of our prediction scheme

23:   is tested on model networks were the functions of the vertices are

24:   designed to match their network surroundings. We also apply these

25:   methods to the proteome of the yeast \textit{Saccharomyces

26:     cerevisiae} and find the results compatible with more specialized

27:   methods.

28: \end{abstract}

29:

30: \maketitle

31:

32: \section{Introduction}

33:

34: Systems made up of entities that interact pairwise can be modeled as

35: networks. To comprehend the emergent properties of such systems---the

36: objective of the study of complex systems and systems biology---one

37: approach is to investigate the global properties of the corresponding

38: networks \cite{mejn:rev,ba:rev,harary,wf}. In many cases the

39: individual entities (or vertices) have distinct functions in the

40: system. In such cases, provided the wiring of the edges relates to the

41: function of vertices, one can predict these functions from the

42: vertices' position in the network. For example, a corporate hierarchy

43: may be topped by a CEO, followed by a CFO and COO, so a chart of

44: who reports to whom is enough to identify these positions. Another

45: problem in this category of much recent interest is to predict protein

46: functions \cite{hodg:pfp} from the networks of protein interactions

47: \cite{yook:protein,deng:pfp,hish:pfp,leto:pfp,sama:pfp,vaz:pfp}.

48: These methods, like other methods based on e.g. protein sequences,

49: are important because to confirm a protein function one needs

50: function-specific and  possibly hard-to-design \textit{in vivo},

51: genetic or biochemical tests, while interaction and sequence data can

52: be obtained fairly easily.

53:

54: In this paper we propose a general method of predicting the functions

55: of vertices in networked systems where the functions are partly mapped

56: out. The rationale of our algorithm is to match unknown vertices with

57: the most similar (judging from the network structure) categorized

58: vertex and take the functions of the latter vertex as our

59: forecast. The network similarity concept we ground our method on is

60: related to the notion of regular equivalence \cite{eve:sim,wf} or role

61: similarity \cite{regeeco1} of social network theory. Roughly speaking,

62: two vertices are similar, in this sense, if the network looks alike from

63: their respective perspectives. We evaluate our method on model

64: networks where the categories of vertices reflect their placement in

65: the network. We also apply the method to \textit{S.\ cerevisiae}

66: protein data obtained from the MIPS data base \cite{pagel:mips} (data

67: extracted January 23, 2005).

68:

69: \section{Role similarity and definition of the prediction scheme}

70:

71: \begin{figure}

72:   \resizebox*{0.95\linewidth}{!}{\includegraphics{equ.eps}}

73:   \caption{

74:     Illustration of structural and regular equivalence. $i$ and $j$

75:     are structurally equivalent in (a) since they have the same

76:     neighborhoods, and regularly equivalent in (b) since there is a

77:     matching of regularly equivalent vertices between the

78:     neighborhoods. In (b) vertices of the same color are regularly

79:     equivalent.

80:   }

81:   \label{fig:equ}

82: \end{figure}

83:

84: Role similarity refers to rather broad set of concepts and related

85: measures. Basically, the \textit{role} of a vertex is determined by

86: the characteristics of the vertices it is connected to

87: \cite{wf}.\footnote{Note that the nomenclature is somewhat ambiguous. Another

88:   use of ``role'' is to say that vertices with the similar values of

89:   vertex-specific structural measures have the same role

90:   \cite{gui:meta,luss:dolphin}.} Consider

91: two vertices $i$ and $j$. If their neighborhoods

92: are similar, we say $i$ and $j$ have high role similarity. The

93: question how to define the similarity of the neighborhoods $\Gamma_i$

94: and $\Gamma_j$ leads to two different concepts. One choice matches the

95: identity of vertices in the neighborhood. This leads to the

96: \textit{structural equivalence} relation which is true if

97: $\Gamma_i=\Gamma_j$. Another way to compare neighborhoods is to match

98: the similarity of vertices in the neighborhood which gives the concept

99: of \textit{regular equivalence}---if one can pair the vertices of

100: $\Gamma_i$ with vertices in $\Gamma_j$ such that each pair is

101: regularly equivalent, then $i$ and $j$ are also regularly

102: equivalent. Since vertices with the same functions need not, in

103: general, be close, we will need a similarity score measuring how close

104: to regular equivalence two vertices are. Following

105: Refs.\ \cite{simrank,blondel:sim} we define a similarity score based on

106: iterating the regular equivalence principle ``two vertices are similar

107: if they are pointed to, or point to, vertices that themselves

108: similar.'' In the general case of a directed network with $R$

109: different types of edges, one implementation of this argument is just

110: to sum the similarities between vertices of the neighborhoods:

111: \begin{equation}\label{eq:simdef_i}

112:   \sigma^\mathrm{I}_{n+1}(i,j) = \sum_{r=1}^R\left[

113:     \sum_{i'\in\Gamma_{i,r}^{\mathrm{in}}}

114:     \sum_{j'\in\Gamma_{j,r}^{\mathrm{in}}} \sigma^\mathrm{I}_n (i',j') +

115:     \sum_{i'\in\Gamma_{i,r}^{\mathrm{out}}}

116:     \sum_{j'\in\Gamma_{j,r}^{\mathrm{out}}} \sigma^\mathrm{I}_n

117:     (i',j')\right],

118: \end{equation}

119: where $\sigma^\mathrm{I}_n(i,j)$ is the similarity between $i$ and $j$

120: after the $n$'th iteration and $\Gamma_{i,r}^{\mathrm{in}}$ is the

121: in-neighborhood of $i$ with respect to $r$-edges. To avoid

122: overflow problems we rescale all similarities so that

123: $\max_{ij}|\sigma^\mathrm{I}_n(i,j)|=S$ after each iteration. We

124: break the iteration when the sum, before the normalization, has not

125: changed by more than a $10^{-8}$th of its previous value.

126:

127: By the Eq.~\ref{eq:simdef_i} definition, high degree vertices will

128: appear more similar to the average other vertex than low-degree

129: vertices. To compensate for this effect one may divide by the

130: appropriate degrees (numbers of neighbors) to obtain:

131: \begin{widetext}

132: \begin{equation}\label{eq:simdef_ii}

133:   \sigma^\mathrm{II}_{n+1}(i,j) = \sum_{r=1}^R\left[

134:     \frac{1}{k_{i,r}^{\mathrm{in}}\:k_{j,r}^{\mathrm{in}}}

135:     \sum_{i'\in\Gamma_{i,r}^{\mathrm{in}}}

136:     \sum_{j'\in\Gamma_{j,r}^{\mathrm{in}}} \sigma^\mathrm{II}_n (i',j') +

137:     \frac{1}{k_{i,r}^{\mathrm{out}}\:k_{j,r}^{\mathrm{out}}}

138:     \sum_{i'\in\Gamma_{i,r}^{\mathrm{out}}}

139:     \sum_{j'\in\Gamma_{j,r}^{\mathrm{out}}} \sigma^\mathrm{II}_n

140:     (i',j')\right],

141: \end{equation}

142: \end{widetext}

143: where $k_{i,r}^{\mathrm{in}}$ is the in-degree of $i$ with respect to

144: $r$-edges. From now on we call $\sigma^\mathrm{I}(i,j)=

145: \sigma^\mathrm{I}_\infty(i,j)$ of Eq.~\ref{eq:simdef_i} and

146: $\sigma^\mathrm{II}(i,j)$ of Eq.~\ref{eq:simdef_ii} the I- and

147: II-similarity between $i$ and $j$ respectively.

148:

149: As mentioned, we suppose some of the vertices are functionally

150: categorized. In general we assume one vertex can have many

151: functions. For pairs of such functionally determined vertices the

152: above similarities will add no information. Instead we define

153: a functional similarity

154: \begin{equation}\label{eq:simdef_f}

155:   \sigma_f(i,j) = J(F_i,F_j) - \langle J \rangle ,

156: \end{equation}

157: for such pairs, where $F_i$ is $i$'s function set (we assume a finite

158: number of functions) and $J(\:\cdot\:)$ denotes the Jackard index

159: $J(A,B) = |A\cap B|\:/\:|A\cup B|$ and the average is over all pairs of

160: categorized vertices. We will later need $\sigma(i,j)=0$ to represent

161: neutrality which is why we subtract the mean. Whenever a pair of

162: classified vertices $(i,j)$ appears in the sums of

163: Eqs.~\ref{eq:simdef_i} or \ref{eq:simdef_ii} we use the

164: $\sigma_f(i,j)$ value of Eq.~\ref{eq:simdef_f} instead of

165: $\sigma^\mathrm{I}(i,j)$ or $\sigma^\mathrm{II}(i,j)$. I.e., we assume

166: the functional classification is more accurate than the

167: role-similarities and hence do not update the former.

168:

169: In general we can now define our prediction scheme as follows:

170: \begin{enumerate}

171: \item \label{enu:init} For vertex pairs with at least one unclassified

172:   vertex initialize $\sigma_0(i,j)$ to $0$ if $i\neq j$ and

173:   to $1 - \langle J \rangle$ otherwise.

174: \item \label{enu:sim} Calculate the similarity scores for all pairs of

175:   unique vertices such that at least one is unclassified.

176: \item \label{enu:choose} For an unclassified vertex $i$, predict the

177:   function set $F_{\hat{i}}$, where $\hat{i}$ is the classified

178:   vertex with highest similarity to $i$. If $\hat{i}$ is not unique,

179:   but a set $\hat{I} = \{\hat{i}_1,\cdots,\hat{i}_m\}$ has the highest

180:   similarity to $i$, then let the set $G$ of functions present in more

181:   than half of the set of $j$'s be your guess. If $G$ is empty, let

182:   $F_j$ for a random $j\in\hat{I}$ be the guess.

183: \end{enumerate}

184: The diagonal elements will have maximal functional similarity (which

185: is why we set them to $1-\langle J \rangle$ in step~\ref{enu:init}),

186: otherwise we assume neutrality. The backup selection rules in

187: step~\ref{enu:choose} will typically be needed when unclassified

188: vertices are structurally equivalent to classified vertices, the use

189: of the majority rule instead of only a random guess will compensate

190: for occasional errors in the assignment of functions to classified

191: proteins. Our parameter $S$ sets the relative importance of the

192: functional similarities to the subsequent assessments of

193: $\sigma$. As mentioned above, the functional classification is assumed

194: to be more accurate than the role-similarities, and it is thus sensible to

195: choose a $\sigma\in [0,1-\langle J\rangle]$. The appropriate $S$ value

196: is problem dependent. We will use $S=0{.}8$ which is in this interval

197: for both our two test cases. To summarize, we have proposed two

198: versions of our prediction scheme, scheme I and II, corresponding to

199: I- and II-similarity.

200:

201: \section{Application to model networks}

202:

203: To test our prediction algorithm we construct model networks where the

204: assigned functions of the vertices correspond to their position in the

205: network. We test the algorithm's size scaling and performance in

206: sub-ideal conditions by randomly perturbing the network.

207:

208: \subsection{Definition of the model networks}

209:

210: \begin{figure*}

211:   \includegraphics{ill.eps}

212:   \caption{

213:     Model networks where vertex function and position are related. (a)

214:     shows the initial network. (b) shows a realization with 30

215:     vertices and rewiring probability $r=0{.}1$. ``\textbf{*}''

216:     indicates a rewired edge.

217:   }

218:   \label{fig:ill}

219: \end{figure*}

220:

221: In defining our model, we will metaphorically use the flow of raw

222: material, products and information in a manufacturing system. For our

223: purpose we only need networks where the functions of vertices correspond to

224: their position in their network surroundings---we will not further

225: motivate its relevance as a model for manufacturing networks. We

226: assign five distinct functional classes of the vertices: The

227: \textit{supply} vertices are the source of the raw material which

228: flows along \textit{A-edges} to \textit{assembler} vertices. The

229: assembled products are transported via \textit{B-edges} to

230: \textit{delivery} vertices that dispatch the products. From the

231: delivery vertices informational feedback is sent to the supply

232: vertices  through \textit{C-edges}. Furthermore, the A and B-edges can

233: fork at \textit{A-} and \textit{B-distributor} vertices.

234:

235: The precise definition of the model is as follows: Start with the

236: kernel shown in Fig.~\ref{fig:ill}(a), then grow the network vertex by

237: vertex. At each iteration, assign, with equal probability, one of the

238: above functions to the new vertex. Then, depending on the assigned

239: function, form edges including the new vertex as follows.

240: \begin{description}

241: \item[Supply.] Add an A-edge to an assembler or A-distributor, and a

242:   C-edge from a delivery vertex.

243: \item[Assembly.] Add an A-edge from an assembler or A-distributor

244:   vertex, and a B-edge to an assembler or A-distributor.

245: \item[Delivery.] Add a B-edge from an assembler or B-distributor, and

246:   a C-edge to a supplier.

247: \item[A(B)-distribution.] Add an A(B)-edge from an assembler or

248:   A(B)-distributor vertex, and an A(B)-edge to an assembler or

249:   A(B)-distributor.

250: \end{description}

251: The choice of vertex to attach the new vertex to, given its functional

252: category, is done with uniform randomness. Note that the number of

253: edges will on average be twice the number of vertices (two edges are

254: added per vertex).

255:

256: From the definition so far, any vertex is identifiable from its

257: neighborhood---a vertex with incoming C-edges and out-going A-edges is

258: a supplier, and so on. Real data-sets are seldom perfect---neither in

259: the wiring of the edges, nor in the functional classification. To test

260: the prediction scheme under more realistic circumstances we randomize

261: the network as follows: After generating a network according to the

262: above scheme, we go through all edges sequentially. With a probability

263: $r$ detach the from-side of an edge and re-attach it to a randomly

264: chosen vertex such that no self-edge or multiple edge (of the same

265: type---A, B or C) is formed. Rewire the to-side likewise with the same

266: probability. A realization of the algorithm is displayed in

267: Fig.~\ref{fig:ill}(b). After the rewiring there is not necessarily

268: enough information to classify a vertex---$i$ in Fig.~\ref{fig:ill}(b)

269: is an assembler but could just as well have been a B-distributor.

270:

271: \subsection{Prediction performance}

272:

273: \begin{figure}

274:   \resizebox*{\linewidth}{!}{\includegraphics{mod.eps}}

275:   \caption{

276:     The fraction of correctly predicted functions $s$ for our model

277:     networks as a function of the rewiring probability $r$.  (a) show

278:     the results based on I-similarities, (b) is the corresponding plot

279:     for II-similarities. The points are averaged over $\sim 1000$ runs

280:     of the network construction and prediction scheme with

281:     $a=1/50$. Errorbars are smaller than the symbol size. The

282:     horizontal line marks the limit of random guessing $0{.}2$.

283:   }

284:   \label{fig:mod}

285: \end{figure}

286:

287: To test the our prediction scheme we mark a random set of $aN$,

288: $a\in(0,N)$, vertices unclassified. Then we predict the function of these

289: vertices and let the average fraction of correctly predicted vertices

290: $s$ be our performance measure. Fig.~\ref{fig:mod} shows $s$ for

291: $a=1/50$ and different network sizes, as a function of the the

292: rewiring probability $r$. In the small-$r$ limit the I-similarity

293: prediction scheme makes an almost flawless job with $s>99{.}9\%$ for

294: $N\geqslant 500$. Note, since we have five distinct functions, random

295: guessing could not do better than $s=1/5$. This value, $s=1/5$, is by

296: necessity attained in the random limit $r=1$. For small $r$-values the

297: scheme II performs best, but if $r\lesssim 0{.}2$ scheme I performs

298: slightly better. The size convergence for scheme I is faster, so in

299: the large network limit II may outperform I. To understand the

300: performance of the different schemes we note that scheme I has a

301: tendency to match an unknown vertex to a known vertex of high

302: degree. When $r=0$ this effect leads to some mispredictions for scheme

303: I. But the redundant information about high degree vertices makes the

304: more robust to minor perturbations, thus the slower decay of the

305: $s(r)$-curves compared with scheme II.

306:

307: We observe that the performance increases with the systems size for

308: both schemes. This is important effect since databases in general grow

309: in size--our prediction scheme will thus be more accurate with time.

310: We surmise the explanation lies in, roughly speaking, that the bigger

311: the network gets, the more likely it is that there is a very good

312: matching. This is an effect local methods (taking only the surrounding

313: of a vertex into account) could not utilize. A full explanation of

314: this effect lies beyond the scope of this paper.

315:

316: \section{Predicting protein function in yeast}

317:

318: \begin{figure}

319:   \resizebox*{0.85\linewidth}{!}{\includegraphics{pex.eps}}

320:   \caption{

321:     Example from the yeast protein prediction by scheme II on the

322:     first level functional data. When YJL191w is marked

323:     unknown it gets matched with YOR133w because their surroundings

324:     looks similar. The arrowed lines mark genetic regulation edges,

325:     other lines represent physical interaction.

326:   }

327:   \label{fig:pex}

328: \end{figure}

329:

330: \subsection{Functional prediction of proteins}

331:

332: Specifying protein functions experimentally requires demanding and

333: potentially expensive tests. If one can obtain good guesses of the

334: functions of an unknown protein, much is gained. During last decade,

335: there has been a great number of methods suggested for protein

336: functional prediction, including methods based on  based on sequence

337: or structure alignments  \cite{paw:seq,irving:struct}, attributes

338: derived from collections of sequences or

339: structures \cite{jensen:seq,dobson:struct}, phylogenetic profiles

340:  \cite{pelle:pfp}, or analysis of protein complexes

341: \cite{gavin:complexes}.  Much of recent work has concentrated on

342: functional prediction based on protein-protein interaction data. Many

343: of these are specialized methods that exploit specific features of

344: protein-protein interaction data \cite{vaz:pfp,schw:pfp,marc:pfp1,%

345: marc:pfp2,hodg:pfp,leto:pfp,sama:pfp} (such as that vertices that

346: interact physically are likely to share some functionality). The more

347: general approaches \cite{deng:pfp,hish:pfp} are local in the sense

348: that they are only based on pairwise statistics. For this reason they

349: may not share the advantageous size scaling properties of our method.

350:

351: \subsection{Applying the method to protein data}

352:

353: There are two types of large scale network data available for

354: \textit{S.\ cerevisiae}: ``physical'' and ``genetic'' protein-protein

355: interactions. The terms ``physical'' and ``genetic'' refer to the type of

356: experiment used to deduce the interaction. The genetic experiments

357: are based on mutation studies, and the evidence from them is of

358: a more indirect nature. We therefore distinguish

359: between physical and genetic edges. All edges are undirected. Our data

360: set, derived from the MIPS data base, has $N=4580$ linked together by

361: $5129$ genetic regulation edges and $7434$ physical interaction

362: edges. We removed duplicates, self-edges and interactions where one or

363: both of the interacting substances were not proteins. The assigned

364: functions are arranged in a hierarchical fashion, according to the

365: FunCat categorization scheme \cite{ruepp:funcat} used by the MIPS

366: database. The first level contains the coarsest description of a

367: protein's function, such as ``metabolism,'' the second level is more

368: specified e.g.\ ``amino acid metabolism,'' and so on. We will test our

369: algorithm of the first and second level of this hierarchy and thus

370: treat functions that differ in a finer classification as equal. There

371: are three categories with no substantial functional

372: information---``ubiquitous expression,'' ``classification not yet

373: clear-cut'' and ``unclassified proteins.'' We considered vertices with

374: no other assigned categories than these three uncategorized.

375:

376: In Fig.~\ref{fig:pex} we show a small example of scheme II in action

377: on the yeast data. Suppose YJL191w is to be classified (we know it has

378: the level-1 functions ``protein with binding function \ldots'' and

379: ``protein synthesis''). The classified protein with highest similarity

380: is YOR133w. This is because YNL041c, which interacts physically with

381: YJL191w, is functionally identical (at level one of the hierarchy) to

382: YBR068c that is physically linked to YOR133w. Similarly, YJL191w is

383: genetically linked with YCR031c, which shares one functional category

384: with YDR385w, which is genetically linked with YOR133w. These two

385: features give a high similarity score to the pair YJL191w and YOR133w,

386: so scheme II guesses that YJL191w has the functional category

387: ``protein synthesis'' but misses the ``protein with binding function

388: \ldots'' category.

389:

390: \subsection{Performance of the scheme}

391:

392: \begin{table}

393:   \caption{\label{tab:perf} The performance of our methods compared to

394:     the neighborhood counting method of Ref.\ \cite{schw:pfp}. $s_+$ is

395:     the average fraction of correct predictions among the predicted

396:     functions averaged over all the classified proteins. $s_-$ is the

397:     average fraction of correct predictions among the actual

398:     functions.}

399:   \begin{ruledtabular}

400:     \begin{tabular}{r|cccccc}

401:       & \multicolumn{3}{c}{level 1} &  \multicolumn{3}{c}{level 2}\\

402:       & NCM & Scheme I & Scheme II & NCM & Scheme I & Scheme II\\\hline

403:       $s_+$ & 0{.}269(6) & 0{.}392(6) & 0{.}337(6) &

404:       0{.}199(5) & 0{.}238(6) & 0{.}220(6) \\

405:       $s_-$ & 0{.}354(6) & 0{.}291(5) & 0{.}346(7) &

406:       0{.}252(6) & 0{.}199(5) & 0{.}231(6) \\

407:     \end{tabular}

408:   \end{ruledtabular}

409: \end{table}

410:

411: For the previously described test networks we know \textit{a priori}

412: that the number of functions to be predicted is one. The same may be

413: true for a variety of systems, but not for proteins. With the number

414: of functions as one variable in the prediction problem we proceed to

415: replace the success rate $s$ by the two measures \textit{precision}

416: $s_+$ and \textit{recall} $s_-$ (the names borrowed from corresponding

417: quantities in the text-mining literature, see e.g.\ Ref.~\cite{rag:tm}

418: and references therein):

419: \begin{equation}\label{eq:spm}

420:   s_+ = \left\langle\frac{n_c}{f_*}\right\rangle \mbox{~and~}

421:   s_- = \left\langle\frac{n_c}{f}\right\rangle ,

422: \end{equation}

423: where $n_c$ is the number of correctly predicted functions, $f$ is the

424: real number of functions and $f_*$ is the number of predicted

425: functions. $1-s_+$ is thus the expected fraction of false positive

426: predictions (and similarly for $s_-$). Both these measures take values

427: in the interval $[0,1]$ with $0$ meaning that no function is predicted

428: correctly and $1$ represents perfect prediction. The averages are over

429: the set of predicted functions in the same kind of leave-one-out

430: estimates as performed for the test networks.

431:

432: We follow Refs.\ \cite{vaz:pfp,deng:pfp} and use the neighborhood

433: counting method (NCM) of Ref.\ \cite{schw:pfp} for reference

434: values. This method assigns the $f_*$ most frequent functions among

435: the neighbors of the physical interaction network to the unknown

436: protein. Considering its simplicity, compared with the more elaborate

437: procedures listed above, this is a remarkably efficient method. (I.e.,

438: $f_*$ is a parameter of this model.) In our implementation, if the

439: $f_*$'th function is not unique we select that randomly. Thus proteins

440: with no neighbors are assigned $f_*$ functions randomly. Precision and

441: recall values are displayed in Tab.~\ref{tab:perf}. We use $f_*=2$ for

442: the NCM which is the closest value to the average number of functions

443: per protein for both levels one and two in our data set. The values

444: may look low compared to similar tables in other papers on protein

445: prediction, but these often do not include low-degree vertices, or use

446: other performance measures (such as counting the fraction of proteins

447: with at least one correctly predicted function, and so on). We note

448: that, like the more disordered test networks, scheme II gives better

449: performance in general (typically having better recall- but slightly

450: worse precision-values).

451:

452: \section{Summary and discussion}

453:

454: We have proposed methods for predicting the function of vertices in

455: networked systems where the function of a vertex relates to its

456: position. The principle behind our scheme is role equivalence as

457: related to the regular equivalence concept of social network

458: analysis. I.e., vertices are similar if the network, as seen from the

459: respective vertices, look similar. We make two extensions to the method

460: proposed in Refs.\ \cite{simrank,blondel:sim} to networks where some of

461: the vertices are functionally categorized. The prediction of an

462: uncategorized protein is then done by copying the functions of the

463: other vertex with highest role similarity. Our schemes, corresponding

464: to our two role similarities, are tested on model networks. These are

465: designed to have a correspondence between the function of the vertex

466: and their network surrounding. This correspondence can be tuned by a

467: randomization parameter. We find that the performance of both schemes

468: increases with the system size (the fraction of unknown vertices and

469: rewired edges is fixed), which makes the applicability of our methods

470: increasing with time (as data bases, in general, tend to grow). The

471: differences between scheme I and II can be described by the fact that,

472: scheme I gives (compared with scheme II) a higher similarity to

473: vertex-pairs containing a high-degree vertex. Furthermore, we apply

474: our method to the \textit{S.\ cerevisiae} proteome. We use the

475: networks of protein-protein interactions  and obtain results that

476: compare well with standard methods designed solely with protein

477: functional prediction in mind. We do not claim that our method

478: outperform the best specialized protein prediction methods---our aim

479: is to construct a global method for general functional prediction, and

480: most protein functional prediction schemes would perform poorly on our

481: test networks. The ideas of this paper might however contribute to

482: future, more elaborate, methods for prediction of protein functions.

483:

484: The basic advantage of our method, as we see it, is that is a very

485: general method that should apply to functional prediction in many

486: systems. Moreover, it makes use of global network information,

487: giving performance that does not decrease as the systems gets

488: larger. The fact that it is a truly global algorithm---the prediction

489: of every vertex' functions takes wiring of the whole network into

490: account---makes it rather slow (compared to e.g.\ specialized protein

491: functional prediction methods, such as the one proposed in

492: Ref.\ \cite{schw:pfp}). The execution time scales as $O(M^2)$ (where

493: $M$ is the total number of edges). But data sets of $10^4$-$10^5$,

494: which cover e.g.\ the size of proteomes of known organisms, should be

495: manageable to present day computers. We believe the problem of

496: functional prediction in different types of networked systems is far

497: from concluded---both in its full generality and the question how to

498: utilize the characteristics of more specific systems.

499:

500: \subsection*{Acknowledgments}

501:

502: The authors thank Micha Enevoldsen, Elizabeth Leicht and Mark Newman

503: for comments.

504:

505: \begin{thebibliography}{10}

506:

507: \bibitem{ba:rev}

508: Albert, R.\ \& Barab\'{a}si, A.-L.

509:  (2002) {\em Rev.\ Mod.\ Phys.}\ {\bf 74}, 47--98.

510:

511: \bibitem{harary}

512: Buckley, F.\ \& Harary, F.

513:  (1989) {\em Distance in graphs}.

514:  (Addison-Wesley, Redwood City).

515:

516: \bibitem{mejn:rev}

517: Newman, M. E.~J.

518:  (2003) {\em SIAM Rev.}\ {\bf 45}, 167--256.

519:

520: \bibitem{wf}

521: Wasserman, S.\ \& Faust, K.

522:  (1994) {\em Social network analysis: Methods and applications}.

523:  (Cambridge University Press, Cambridge).

524:

525: \bibitem{hodg:pfp}

526: Hodgman, T.

527:  (2000) {\em Bioinformatics} {\bf 16}, 10--15.

528:

529: \bibitem{yook:protein}

530: Yook, S., Oltvai, Z.\  \& Barab\'{a}si, A.-L.

531:  (2004) {\em Proteomics} {\bf 4}, 928--942.

532:

533: \bibitem{sama:pfp}

534: Samanta, M.~P.\ \& Liang, S.

535:  (2003) {\em Proc.\ Natl.\ Acad.\ Sci.\ USA} {\bf 100}, 12579--12583.

536:

537: \bibitem{vaz:pfp}

538: Vazquez, A., Flammini, A., Martian, A., \& Vespignani, A.

539:  (2003) {\em Nature Biotech.}\ {\bf 21}, 697--700.

540:

541: \bibitem{deng:pfp}

542: Deng, M., Zhang, K., Mehta, S., Chen, T.\  \& Sun, F.

543:  (2002) in {\em Proceedings of the IEEE Computer Society

544:   Bioinformatics Conference (CSB 02)}.

545:  (Stanford CA), pp. 197--207.

546:

547: \bibitem{hish:pfp}

548: Hishigaki, H., Nakai, K., Ono, T., Tanigami, A.\  \& Tagaki, T.

549:  (2001) {\em Yeast} {\bf 18}, 523--531.

550:

551: \bibitem{leto:pfp}

552: Letovsky, S.\ \& Kasif, S.

553:  (2003) {\em Bioinformatics} {\bf 19}, 197--204.

554:

555: \bibitem{eve:sim}

556: Everett, M.~G.

557:  (1985) {\em Soc.\ Netw.}\ {\bf 7}, 353--359.

558:

559: \bibitem{regeeco1}

560: Luczkovich, J.~J., Borgatti, S.~P., Johnson, J.~C.,  \& Everett, M.~G.

561:  (2003) {\em J.\ Theor.\ Biol.}\ {\bf 220}, 303--321.

562:

563: \bibitem{pagel:mips}

564: Pagel, P., Kovac, S., Oesterheld, M., Brauner, B., Dunger-Kaltenbach,

565: I., Frishman, G., Montrone, C., Mark, P., St\"{u}mpflen, V., Mewes,

566: H.~W.\ {\em et al.}  (2004) {\em Bioinformatics}, [Epub ahead of

567: print] \url{doi:10.1093/bioinformatics/bti115}.

568:

569: \bibitem{gui:meta}

570: Guimer\`{a}, R.\ \& {Nunes Amaral}, L.~A.

571:  (2005) {\em Nature} {\bf 433}, 895--900.

572:

573: \bibitem{luss:dolphin}

574: Lusseau, D.\ \& Newman, M. E.~J.

575:  (2004) {\em Proc.\ R.\ Soc.\ London B} {\bf 271}, 477--481.

576:

577: \bibitem{blondel:sim}

578: Blondel, V.~D., Gajardo, A., Heymans, M., Senellart, P.,  \& {van Dooren}, P.

579:  (2004) {\em SIAM Rev.}\ {\bf 46}, 647--666.

580:

581: \bibitem{simrank}

582: Jeh, G.\ \& Widom, J. (2002) {Proceedings of the eighth ACM SIGKDD

583:   international conference on knowledge discovery and data

584:   mining}. (Edmonton), pp. 538--543.

585:

586: \bibitem{paw:seq}

587: Pawlowski, K., Jaroszewski, L., Rychlewski, L.\ \& Godzik, A. (2000)

588: {\em Pac.\ Symp.\ Biocomput.}, 42--53.

589:

590: \bibitem{irving:struct}

591: Irving, J.~A., Whisstock, J.~C.\ \& Lesk, A.~M. (2001) {\em Proteins}

592: {\bf 42}, 378--382.

593:

594: \bibitem{jensen:seq}

595: Jensen, L.~J., Staerfeldt, H.\ \& Brunak, S. (2003) {\em

596:   Bioinformatics} {\bf 19}, 635--642.

597:

598: \bibitem{dobson:struct}

599: Dobson, P.~D.\ \& Doig, A.~S. (2003) {\em J.\ Mol.\ Biol.}\ {\bf 330},

600: 771--783.

601:

602: \bibitem{pelle:pfp}

603: Pellegrini, M., Marcotte, E., Thompson, M.~J., Eisenberg, D.\ \&

604: Yeates, T.~O. (1999) {\em Proc.\ Natl.\ Acad.\ Sci.\ USA} {\bf 96},

605: 4285--4288.

606:

607: \bibitem{gavin:complexes}

608: Gavin, A.~C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer,

609: A., Schultz, J., Rick, J.~M., Michon, A.~M., Cruciat, C.~M., Remor,

610: M.\ {\em et al.} (2004) {\em Nucleic Acids Res.}\ {\bf 32},

611: 5539--5545.

612:

613: \bibitem{marc:pfp2}

614: Marcotte, E.~M., Pellegrini, M., Ng, H.~L., Rice, D.~W., Yeates, T.~O.\ \&

615: Eisenberg, D.  (1999) {\em Science} {\bf 285}, 751--753.

616:

617: \bibitem{marc:pfp1}

618: Marcotte, E.~M., Pellegrini, M., Thompson, M.~J., Yeates, T.~O.\  \&

619: Eisenberg, D. (1999) {\em Nature} {\bf 402}, 83--86.

620:

621: \bibitem{schw:pfp}

622: Schwikowski, B., Uetz, P.\ \& Fields, S. (2000) {\em Nature Biotech.}\

623: {\bf 18}, 1257--1261.

624:

625: \bibitem{ruepp:funcat}

626: Ruepp, A., Zollner, A., Albermann, K., Hani, J., Mokrejs, M., Tetko,

627: I., Guldener, U., Mannhaupt, G., Munsterkotter, M.\ \& Mewes,

628: H.~W. (2004) {\em Nucleic Acids Res.}\ {\bf 32}, 5539--5545.

629:

630: \bibitem{rag:tm}

631:  Raghavan, V.~V., Jung G.~S.\ \& Bollmann, P. (1989)  {\em ACM Trans.\

632:    Inf.\ Syst.}\ {\bf 7}, 205--229.

633:

634: \end{thebibliography}

635:

636:

637: \end{document}

638: