1: \documentclass[rmp,twocolumn]{revtex4}
2:
3: \usepackage{graphicx,amsmath,amssymb,txfonts}
4:
5: \begin{document}
6:
7: \title{Role-similarity based functional prediction in networked
8: systems:\\ Application to the yeast proteome}
9:
10: \author{Petter Holme}
11: \affiliation{Department of Physics, University of Michigan, Ann Arbor,
12: MI 48109}
13: \author{Mikael Huss}
14: \affiliation{Department of Numerical Analysis and Computer Science,
15: Royal Institute of Technology, 100 44 Stockholm, Sweden}
16:
17: \begin{abstract}
18: We propose a general method to predict functions of vertices where:
19: 1. The wiring of the network is somehow related to the vertex
20: functionality. 2. A fraction of the vertices are functionally
21: classified. The method is influenced by role-similarity measures of
22: social network analysis. The two versions of our prediction scheme
23: is tested on model networks were the functions of the vertices are
24: designed to match their network surroundings. We also apply these
25: methods to the proteome of the yeast \textit{Saccharomyces
26: cerevisiae} and find the results compatible with more specialized
27: methods.
28: \end{abstract}
29:
30: \maketitle
31:
32: \section{Introduction}
33:
34: Systems made up of entities that interact pairwise can be modeled as
35: networks. To comprehend the emergent properties of such systems---the
36: objective of the study of complex systems and systems biology---one
37: approach is to investigate the global properties of the corresponding
38: networks \cite{mejn:rev,ba:rev,harary,wf}. In many cases the
39: individual entities (or vertices) have distinct functions in the
40: system. In such cases, provided the wiring of the edges relates to the
41: function of vertices, one can predict these functions from the
42: vertices' position in the network. For example, a corporate hierarchy
43: may be topped by a CEO, followed by a CFO and COO, so a chart of
44: who reports to whom is enough to identify these positions. Another
45: problem in this category of much recent interest is to predict protein
46: functions \cite{hodg:pfp} from the networks of protein interactions
47: \cite{yook:protein,deng:pfp,hish:pfp,leto:pfp,sama:pfp,vaz:pfp}.
48: These methods, like other methods based on e.g. protein sequences,
49: are important because to confirm a protein function one needs
50: function-specific and possibly hard-to-design \textit{in vivo},
51: genetic or biochemical tests, while interaction and sequence data can
52: be obtained fairly easily.
53:
54: In this paper we propose a general method of predicting the functions
55: of vertices in networked systems where the functions are partly mapped
56: out. The rationale of our algorithm is to match unknown vertices with
57: the most similar (judging from the network structure) categorized
58: vertex and take the functions of the latter vertex as our
59: forecast. The network similarity concept we ground our method on is
60: related to the notion of regular equivalence \cite{eve:sim,wf} or role
61: similarity \cite{regeeco1} of social network theory. Roughly speaking,
62: two vertices are similar, in this sense, if the network looks alike from
63: their respective perspectives. We evaluate our method on model
64: networks where the categories of vertices reflect their placement in
65: the network. We also apply the method to \textit{S.\ cerevisiae}
66: protein data obtained from the MIPS data base \cite{pagel:mips} (data
67: extracted January 23, 2005).
68:
69: \section{Role similarity and definition of the prediction scheme}
70:
71: \begin{figure}
72: \resizebox*{0.95\linewidth}{!}{\includegraphics{equ.eps}}
73: \caption{
74: Illustration of structural and regular equivalence. $i$ and $j$
75: are structurally equivalent in (a) since they have the same
76: neighborhoods, and regularly equivalent in (b) since there is a
77: matching of regularly equivalent vertices between the
78: neighborhoods. In (b) vertices of the same color are regularly
79: equivalent.
80: }
81: \label{fig:equ}
82: \end{figure}
83:
84: Role similarity refers to rather broad set of concepts and related
85: measures. Basically, the \textit{role} of a vertex is determined by
86: the characteristics of the vertices it is connected to
87: \cite{wf}.\footnote{Note that the nomenclature is somewhat ambiguous. Another
88: use of ``role'' is to say that vertices with the similar values of
89: vertex-specific structural measures have the same role
90: \cite{gui:meta,luss:dolphin}.} Consider
91: two vertices $i$ and $j$. If their neighborhoods
92: are similar, we say $i$ and $j$ have high role similarity. The
93: question how to define the similarity of the neighborhoods $\Gamma_i$
94: and $\Gamma_j$ leads to two different concepts. One choice matches the
95: identity of vertices in the neighborhood. This leads to the
96: \textit{structural equivalence} relation which is true if
97: $\Gamma_i=\Gamma_j$. Another way to compare neighborhoods is to match
98: the similarity of vertices in the neighborhood which gives the concept
99: of \textit{regular equivalence}---if one can pair the vertices of
100: $\Gamma_i$ with vertices in $\Gamma_j$ such that each pair is
101: regularly equivalent, then $i$ and $j$ are also regularly
102: equivalent. Since vertices with the same functions need not, in
103: general, be close, we will need a similarity score measuring how close
104: to regular equivalence two vertices are. Following
105: Refs.\ \cite{simrank,blondel:sim} we define a similarity score based on
106: iterating the regular equivalence principle ``two vertices are similar
107: if they are pointed to, or point to, vertices that themselves
108: similar.'' In the general case of a directed network with $R$
109: different types of edges, one implementation of this argument is just
110: to sum the similarities between vertices of the neighborhoods:
111: \begin{equation}\label{eq:simdef_i}
112: \sigma^\mathrm{I}_{n+1}(i,j) = \sum_{r=1}^R\left[
113: \sum_{i'\in\Gamma_{i,r}^{\mathrm{in}}}
114: \sum_{j'\in\Gamma_{j,r}^{\mathrm{in}}} \sigma^\mathrm{I}_n (i',j') +
115: \sum_{i'\in\Gamma_{i,r}^{\mathrm{out}}}
116: \sum_{j'\in\Gamma_{j,r}^{\mathrm{out}}} \sigma^\mathrm{I}_n
117: (i',j')\right],
118: \end{equation}
119: where $\sigma^\mathrm{I}_n(i,j)$ is the similarity between $i$ and $j$
120: after the $n$'th iteration and $\Gamma_{i,r}^{\mathrm{in}}$ is the
121: in-neighborhood of $i$ with respect to $r$-edges. To avoid
122: overflow problems we rescale all similarities so that
123: $\max_{ij}|\sigma^\mathrm{I}_n(i,j)|=S$ after each iteration. We
124: break the iteration when the sum, before the normalization, has not
125: changed by more than a $10^{-8}$th of its previous value.
126:
127: By the Eq.~\ref{eq:simdef_i} definition, high degree vertices will
128: appear more similar to the average other vertex than low-degree
129: vertices. To compensate for this effect one may divide by the
130: appropriate degrees (numbers of neighbors) to obtain:
131: \begin{widetext}
132: \begin{equation}\label{eq:simdef_ii}
133: \sigma^\mathrm{II}_{n+1}(i,j) = \sum_{r=1}^R\left[
134: \frac{1}{k_{i,r}^{\mathrm{in}}\:k_{j,r}^{\mathrm{in}}}
135: \sum_{i'\in\Gamma_{i,r}^{\mathrm{in}}}
136: \sum_{j'\in\Gamma_{j,r}^{\mathrm{in}}} \sigma^\mathrm{II}_n (i',j') +
137: \frac{1}{k_{i,r}^{\mathrm{out}}\:k_{j,r}^{\mathrm{out}}}
138: \sum_{i'\in\Gamma_{i,r}^{\mathrm{out}}}
139: \sum_{j'\in\Gamma_{j,r}^{\mathrm{out}}} \sigma^\mathrm{II}_n
140: (i',j')\right],
141: \end{equation}
142: \end{widetext}
143: where $k_{i,r}^{\mathrm{in}}$ is the in-degree of $i$ with respect to
144: $r$-edges. From now on we call $\sigma^\mathrm{I}(i,j)=
145: \sigma^\mathrm{I}_\infty(i,j)$ of Eq.~\ref{eq:simdef_i} and
146: $\sigma^\mathrm{II}(i,j)$ of Eq.~\ref{eq:simdef_ii} the I- and
147: II-similarity between $i$ and $j$ respectively.
148:
149: As mentioned, we suppose some of the vertices are functionally
150: categorized. In general we assume one vertex can have many
151: functions. For pairs of such functionally determined vertices the
152: above similarities will add no information. Instead we define
153: a functional similarity
154: \begin{equation}\label{eq:simdef_f}
155: \sigma_f(i,j) = J(F_i,F_j) - \langle J \rangle ,
156: \end{equation}
157: for such pairs, where $F_i$ is $i$'s function set (we assume a finite
158: number of functions) and $J(\:\cdot\:)$ denotes the Jackard index
159: $J(A,B) = |A\cap B|\:/\:|A\cup B|$ and the average is over all pairs of
160: categorized vertices. We will later need $\sigma(i,j)=0$ to represent
161: neutrality which is why we subtract the mean. Whenever a pair of
162: classified vertices $(i,j)$ appears in the sums of
163: Eqs.~\ref{eq:simdef_i} or \ref{eq:simdef_ii} we use the
164: $\sigma_f(i,j)$ value of Eq.~\ref{eq:simdef_f} instead of
165: $\sigma^\mathrm{I}(i,j)$ or $\sigma^\mathrm{II}(i,j)$. I.e., we assume
166: the functional classification is more accurate than the
167: role-similarities and hence do not update the former.
168:
169: In general we can now define our prediction scheme as follows:
170: \begin{enumerate}
171: \item \label{enu:init} For vertex pairs with at least one unclassified
172: vertex initialize $\sigma_0(i,j)$ to $0$ if $i\neq j$ and
173: to $1 - \langle J \rangle$ otherwise.
174: \item \label{enu:sim} Calculate the similarity scores for all pairs of
175: unique vertices such that at least one is unclassified.
176: \item \label{enu:choose} For an unclassified vertex $i$, predict the
177: function set $F_{\hat{i}}$, where $\hat{i}$ is the classified
178: vertex with highest similarity to $i$. If $\hat{i}$ is not unique,
179: but a set $\hat{I} = \{\hat{i}_1,\cdots,\hat{i}_m\}$ has the highest
180: similarity to $i$, then let the set $G$ of functions present in more
181: than half of the set of $j$'s be your guess. If $G$ is empty, let
182: $F_j$ for a random $j\in\hat{I}$ be the guess.
183: \end{enumerate}
184: The diagonal elements will have maximal functional similarity (which
185: is why we set them to $1-\langle J \rangle$ in step~\ref{enu:init}),
186: otherwise we assume neutrality. The backup selection rules in
187: step~\ref{enu:choose} will typically be needed when unclassified
188: vertices are structurally equivalent to classified vertices, the use
189: of the majority rule instead of only a random guess will compensate
190: for occasional errors in the assignment of functions to classified
191: proteins. Our parameter $S$ sets the relative importance of the
192: functional similarities to the subsequent assessments of
193: $\sigma$. As mentioned above, the functional classification is assumed
194: to be more accurate than the role-similarities, and it is thus sensible to
195: choose a $\sigma\in [0,1-\langle J\rangle]$. The appropriate $S$ value
196: is problem dependent. We will use $S=0{.}8$ which is in this interval
197: for both our two test cases. To summarize, we have proposed two
198: versions of our prediction scheme, scheme I and II, corresponding to
199: I- and II-similarity.
200:
201: \section{Application to model networks}
202:
203: To test our prediction algorithm we construct model networks where the
204: assigned functions of the vertices correspond to their position in the
205: network. We test the algorithm's size scaling and performance in
206: sub-ideal conditions by randomly perturbing the network.
207:
208: \subsection{Definition of the model networks}
209:
210: \begin{figure*}
211: \includegraphics{ill.eps}
212: \caption{
213: Model networks where vertex function and position are related. (a)
214: shows the initial network. (b) shows a realization with 30
215: vertices and rewiring probability $r=0{.}1$. ``\textbf{*}''
216: indicates a rewired edge.
217: }
218: \label{fig:ill}
219: \end{figure*}
220:
221: In defining our model, we will metaphorically use the flow of raw
222: material, products and information in a manufacturing system. For our
223: purpose we only need networks where the functions of vertices correspond to
224: their position in their network surroundings---we will not further
225: motivate its relevance as a model for manufacturing networks. We
226: assign five distinct functional classes of the vertices: The
227: \textit{supply} vertices are the source of the raw material which
228: flows along \textit{A-edges} to \textit{assembler} vertices. The
229: assembled products are transported via \textit{B-edges} to
230: \textit{delivery} vertices that dispatch the products. From the
231: delivery vertices informational feedback is sent to the supply
232: vertices through \textit{C-edges}. Furthermore, the A and B-edges can
233: fork at \textit{A-} and \textit{B-distributor} vertices.
234:
235: The precise definition of the model is as follows: Start with the
236: kernel shown in Fig.~\ref{fig:ill}(a), then grow the network vertex by
237: vertex. At each iteration, assign, with equal probability, one of the
238: above functions to the new vertex. Then, depending on the assigned
239: function, form edges including the new vertex as follows.
240: \begin{description}
241: \item[Supply.] Add an A-edge to an assembler or A-distributor, and a
242: C-edge from a delivery vertex.
243: \item[Assembly.] Add an A-edge from an assembler or A-distributor
244: vertex, and a B-edge to an assembler or A-distributor.
245: \item[Delivery.] Add a B-edge from an assembler or B-distributor, and
246: a C-edge to a supplier.
247: \item[A(B)-distribution.] Add an A(B)-edge from an assembler or
248: A(B)-distributor vertex, and an A(B)-edge to an assembler or
249: A(B)-distributor.
250: \end{description}
251: The choice of vertex to attach the new vertex to, given its functional
252: category, is done with uniform randomness. Note that the number of
253: edges will on average be twice the number of vertices (two edges are
254: added per vertex).
255:
256: From the definition so far, any vertex is identifiable from its
257: neighborhood---a vertex with incoming C-edges and out-going A-edges is
258: a supplier, and so on. Real data-sets are seldom perfect---neither in
259: the wiring of the edges, nor in the functional classification. To test
260: the prediction scheme under more realistic circumstances we randomize
261: the network as follows: After generating a network according to the
262: above scheme, we go through all edges sequentially. With a probability
263: $r$ detach the from-side of an edge and re-attach it to a randomly
264: chosen vertex such that no self-edge or multiple edge (of the same
265: type---A, B or C) is formed. Rewire the to-side likewise with the same
266: probability. A realization of the algorithm is displayed in
267: Fig.~\ref{fig:ill}(b). After the rewiring there is not necessarily
268: enough information to classify a vertex---$i$ in Fig.~\ref{fig:ill}(b)
269: is an assembler but could just as well have been a B-distributor.
270:
271: \subsection{Prediction performance}
272:
273: \begin{figure}
274: \resizebox*{\linewidth}{!}{\includegraphics{mod.eps}}
275: \caption{
276: The fraction of correctly predicted functions $s$ for our model
277: networks as a function of the rewiring probability $r$. (a) show
278: the results based on I-similarities, (b) is the corresponding plot
279: for II-similarities. The points are averaged over $\sim 1000$ runs
280: of the network construction and prediction scheme with
281: $a=1/50$. Errorbars are smaller than the symbol size. The
282: horizontal line marks the limit of random guessing $0{.}2$.
283: }
284: \label{fig:mod}
285: \end{figure}
286:
287: To test the our prediction scheme we mark a random set of $aN$,
288: $a\in(0,N)$, vertices unclassified. Then we predict the function of these
289: vertices and let the average fraction of correctly predicted vertices
290: $s$ be our performance measure. Fig.~\ref{fig:mod} shows $s$ for
291: $a=1/50$ and different network sizes, as a function of the the
292: rewiring probability $r$. In the small-$r$ limit the I-similarity
293: prediction scheme makes an almost flawless job with $s>99{.}9\%$ for
294: $N\geqslant 500$. Note, since we have five distinct functions, random
295: guessing could not do better than $s=1/5$. This value, $s=1/5$, is by
296: necessity attained in the random limit $r=1$. For small $r$-values the
297: scheme II performs best, but if $r\lesssim 0{.}2$ scheme I performs
298: slightly better. The size convergence for scheme I is faster, so in
299: the large network limit II may outperform I. To understand the
300: performance of the different schemes we note that scheme I has a
301: tendency to match an unknown vertex to a known vertex of high
302: degree. When $r=0$ this effect leads to some mispredictions for scheme
303: I. But the redundant information about high degree vertices makes the
304: more robust to minor perturbations, thus the slower decay of the
305: $s(r)$-curves compared with scheme II.
306:
307: We observe that the performance increases with the systems size for
308: both schemes. This is important effect since databases in general grow
309: in size--our prediction scheme will thus be more accurate with time.
310: We surmise the explanation lies in, roughly speaking, that the bigger
311: the network gets, the more likely it is that there is a very good
312: matching. This is an effect local methods (taking only the surrounding
313: of a vertex into account) could not utilize. A full explanation of
314: this effect lies beyond the scope of this paper.
315:
316: \section{Predicting protein function in yeast}
317:
318: \begin{figure}
319: \resizebox*{0.85\linewidth}{!}{\includegraphics{pex.eps}}
320: \caption{
321: Example from the yeast protein prediction by scheme II on the
322: first level functional data. When YJL191w is marked
323: unknown it gets matched with YOR133w because their surroundings
324: looks similar. The arrowed lines mark genetic regulation edges,
325: other lines represent physical interaction.
326: }
327: \label{fig:pex}
328: \end{figure}
329:
330: \subsection{Functional prediction of proteins}
331:
332: Specifying protein functions experimentally requires demanding and
333: potentially expensive tests. If one can obtain good guesses of the
334: functions of an unknown protein, much is gained. During last decade,
335: there has been a great number of methods suggested for protein
336: functional prediction, including methods based on based on sequence
337: or structure alignments \cite{paw:seq,irving:struct}, attributes
338: derived from collections of sequences or
339: structures \cite{jensen:seq,dobson:struct}, phylogenetic profiles
340: \cite{pelle:pfp}, or analysis of protein complexes
341: \cite{gavin:complexes}. Much of recent work has concentrated on
342: functional prediction based on protein-protein interaction data. Many
343: of these are specialized methods that exploit specific features of
344: protein-protein interaction data \cite{vaz:pfp,schw:pfp,marc:pfp1,%
345: marc:pfp2,hodg:pfp,leto:pfp,sama:pfp} (such as that vertices that
346: interact physically are likely to share some functionality). The more
347: general approaches \cite{deng:pfp,hish:pfp} are local in the sense
348: that they are only based on pairwise statistics. For this reason they
349: may not share the advantageous size scaling properties of our method.
350:
351: \subsection{Applying the method to protein data}
352:
353: There are two types of large scale network data available for
354: \textit{S.\ cerevisiae}: ``physical'' and ``genetic'' protein-protein
355: interactions. The terms ``physical'' and ``genetic'' refer to the type of
356: experiment used to deduce the interaction. The genetic experiments
357: are based on mutation studies, and the evidence from them is of
358: a more indirect nature. We therefore distinguish
359: between physical and genetic edges. All edges are undirected. Our data
360: set, derived from the MIPS data base, has $N=4580$ linked together by
361: $5129$ genetic regulation edges and $7434$ physical interaction
362: edges. We removed duplicates, self-edges and interactions where one or
363: both of the interacting substances were not proteins. The assigned
364: functions are arranged in a hierarchical fashion, according to the
365: FunCat categorization scheme \cite{ruepp:funcat} used by the MIPS
366: database. The first level contains the coarsest description of a
367: protein's function, such as ``metabolism,'' the second level is more
368: specified e.g.\ ``amino acid metabolism,'' and so on. We will test our
369: algorithm of the first and second level of this hierarchy and thus
370: treat functions that differ in a finer classification as equal. There
371: are three categories with no substantial functional
372: information---``ubiquitous expression,'' ``classification not yet
373: clear-cut'' and ``unclassified proteins.'' We considered vertices with
374: no other assigned categories than these three uncategorized.
375:
376: In Fig.~\ref{fig:pex} we show a small example of scheme II in action
377: on the yeast data. Suppose YJL191w is to be classified (we know it has
378: the level-1 functions ``protein with binding function \ldots'' and
379: ``protein synthesis''). The classified protein with highest similarity
380: is YOR133w. This is because YNL041c, which interacts physically with
381: YJL191w, is functionally identical (at level one of the hierarchy) to
382: YBR068c that is physically linked to YOR133w. Similarly, YJL191w is
383: genetically linked with YCR031c, which shares one functional category
384: with YDR385w, which is genetically linked with YOR133w. These two
385: features give a high similarity score to the pair YJL191w and YOR133w,
386: so scheme II guesses that YJL191w has the functional category
387: ``protein synthesis'' but misses the ``protein with binding function
388: \ldots'' category.
389:
390: \subsection{Performance of the scheme}
391:
392: \begin{table}
393: \caption{\label{tab:perf} The performance of our methods compared to
394: the neighborhood counting method of Ref.\ \cite{schw:pfp}. $s_+$ is
395: the average fraction of correct predictions among the predicted
396: functions averaged over all the classified proteins. $s_-$ is the
397: average fraction of correct predictions among the actual
398: functions.}
399: \begin{ruledtabular}
400: \begin{tabular}{r|cccccc}
401: & \multicolumn{3}{c}{level 1} & \multicolumn{3}{c}{level 2}\\
402: & NCM & Scheme I & Scheme II & NCM & Scheme I & Scheme II\\\hline
403: $s_+$ & 0{.}269(6) & 0{.}392(6) & 0{.}337(6) &
404: 0{.}199(5) & 0{.}238(6) & 0{.}220(6) \\
405: $s_-$ & 0{.}354(6) & 0{.}291(5) & 0{.}346(7) &
406: 0{.}252(6) & 0{.}199(5) & 0{.}231(6) \\
407: \end{tabular}
408: \end{ruledtabular}
409: \end{table}
410:
411: For the previously described test networks we know \textit{a priori}
412: that the number of functions to be predicted is one. The same may be
413: true for a variety of systems, but not for proteins. With the number
414: of functions as one variable in the prediction problem we proceed to
415: replace the success rate $s$ by the two measures \textit{precision}
416: $s_+$ and \textit{recall} $s_-$ (the names borrowed from corresponding
417: quantities in the text-mining literature, see e.g.\ Ref.~\cite{rag:tm}
418: and references therein):
419: \begin{equation}\label{eq:spm}
420: s_+ = \left\langle\frac{n_c}{f_*}\right\rangle \mbox{~and~}
421: s_- = \left\langle\frac{n_c}{f}\right\rangle ,
422: \end{equation}
423: where $n_c$ is the number of correctly predicted functions, $f$ is the
424: real number of functions and $f_*$ is the number of predicted
425: functions. $1-s_+$ is thus the expected fraction of false positive
426: predictions (and similarly for $s_-$). Both these measures take values
427: in the interval $[0,1]$ with $0$ meaning that no function is predicted
428: correctly and $1$ represents perfect prediction. The averages are over
429: the set of predicted functions in the same kind of leave-one-out
430: estimates as performed for the test networks.
431:
432: We follow Refs.\ \cite{vaz:pfp,deng:pfp} and use the neighborhood
433: counting method (NCM) of Ref.\ \cite{schw:pfp} for reference
434: values. This method assigns the $f_*$ most frequent functions among
435: the neighbors of the physical interaction network to the unknown
436: protein. Considering its simplicity, compared with the more elaborate
437: procedures listed above, this is a remarkably efficient method. (I.e.,
438: $f_*$ is a parameter of this model.) In our implementation, if the
439: $f_*$'th function is not unique we select that randomly. Thus proteins
440: with no neighbors are assigned $f_*$ functions randomly. Precision and
441: recall values are displayed in Tab.~\ref{tab:perf}. We use $f_*=2$ for
442: the NCM which is the closest value to the average number of functions
443: per protein for both levels one and two in our data set. The values
444: may look low compared to similar tables in other papers on protein
445: prediction, but these often do not include low-degree vertices, or use
446: other performance measures (such as counting the fraction of proteins
447: with at least one correctly predicted function, and so on). We note
448: that, like the more disordered test networks, scheme II gives better
449: performance in general (typically having better recall- but slightly
450: worse precision-values).
451:
452: \section{Summary and discussion}
453:
454: We have proposed methods for predicting the function of vertices in
455: networked systems where the function of a vertex relates to its
456: position. The principle behind our scheme is role equivalence as
457: related to the regular equivalence concept of social network
458: analysis. I.e., vertices are similar if the network, as seen from the
459: respective vertices, look similar. We make two extensions to the method
460: proposed in Refs.\ \cite{simrank,blondel:sim} to networks where some of
461: the vertices are functionally categorized. The prediction of an
462: uncategorized protein is then done by copying the functions of the
463: other vertex with highest role similarity. Our schemes, corresponding
464: to our two role similarities, are tested on model networks. These are
465: designed to have a correspondence between the function of the vertex
466: and their network surrounding. This correspondence can be tuned by a
467: randomization parameter. We find that the performance of both schemes
468: increases with the system size (the fraction of unknown vertices and
469: rewired edges is fixed), which makes the applicability of our methods
470: increasing with time (as data bases, in general, tend to grow). The
471: differences between scheme I and II can be described by the fact that,
472: scheme I gives (compared with scheme II) a higher similarity to
473: vertex-pairs containing a high-degree vertex. Furthermore, we apply
474: our method to the \textit{S.\ cerevisiae} proteome. We use the
475: networks of protein-protein interactions and obtain results that
476: compare well with standard methods designed solely with protein
477: functional prediction in mind. We do not claim that our method
478: outperform the best specialized protein prediction methods---our aim
479: is to construct a global method for general functional prediction, and
480: most protein functional prediction schemes would perform poorly on our
481: test networks. The ideas of this paper might however contribute to
482: future, more elaborate, methods for prediction of protein functions.
483:
484: The basic advantage of our method, as we see it, is that is a very
485: general method that should apply to functional prediction in many
486: systems. Moreover, it makes use of global network information,
487: giving performance that does not decrease as the systems gets
488: larger. The fact that it is a truly global algorithm---the prediction
489: of every vertex' functions takes wiring of the whole network into
490: account---makes it rather slow (compared to e.g.\ specialized protein
491: functional prediction methods, such as the one proposed in
492: Ref.\ \cite{schw:pfp}). The execution time scales as $O(M^2)$ (where
493: $M$ is the total number of edges). But data sets of $10^4$-$10^5$,
494: which cover e.g.\ the size of proteomes of known organisms, should be
495: manageable to present day computers. We believe the problem of
496: functional prediction in different types of networked systems is far
497: from concluded---both in its full generality and the question how to
498: utilize the characteristics of more specific systems.
499:
500: \subsection*{Acknowledgments}
501:
502: The authors thank Micha Enevoldsen, Elizabeth Leicht and Mark Newman
503: for comments.
504:
505: \begin{thebibliography}{10}
506:
507: \bibitem{ba:rev}
508: Albert, R.\ \& Barab\'{a}si, A.-L.
509: (2002) {\em Rev.\ Mod.\ Phys.}\ {\bf 74}, 47--98.
510:
511: \bibitem{harary}
512: Buckley, F.\ \& Harary, F.
513: (1989) {\em Distance in graphs}.
514: (Addison-Wesley, Redwood City).
515:
516: \bibitem{mejn:rev}
517: Newman, M. E.~J.
518: (2003) {\em SIAM Rev.}\ {\bf 45}, 167--256.
519:
520: \bibitem{wf}
521: Wasserman, S.\ \& Faust, K.
522: (1994) {\em Social network analysis: Methods and applications}.
523: (Cambridge University Press, Cambridge).
524:
525: \bibitem{hodg:pfp}
526: Hodgman, T.
527: (2000) {\em Bioinformatics} {\bf 16}, 10--15.
528:
529: \bibitem{yook:protein}
530: Yook, S., Oltvai, Z.\ \& Barab\'{a}si, A.-L.
531: (2004) {\em Proteomics} {\bf 4}, 928--942.
532:
533: \bibitem{sama:pfp}
534: Samanta, M.~P.\ \& Liang, S.
535: (2003) {\em Proc.\ Natl.\ Acad.\ Sci.\ USA} {\bf 100}, 12579--12583.
536:
537: \bibitem{vaz:pfp}
538: Vazquez, A., Flammini, A., Martian, A., \& Vespignani, A.
539: (2003) {\em Nature Biotech.}\ {\bf 21}, 697--700.
540:
541: \bibitem{deng:pfp}
542: Deng, M., Zhang, K., Mehta, S., Chen, T.\ \& Sun, F.
543: (2002) in {\em Proceedings of the IEEE Computer Society
544: Bioinformatics Conference (CSB 02)}.
545: (Stanford CA), pp. 197--207.
546:
547: \bibitem{hish:pfp}
548: Hishigaki, H., Nakai, K., Ono, T., Tanigami, A.\ \& Tagaki, T.
549: (2001) {\em Yeast} {\bf 18}, 523--531.
550:
551: \bibitem{leto:pfp}
552: Letovsky, S.\ \& Kasif, S.
553: (2003) {\em Bioinformatics} {\bf 19}, 197--204.
554:
555: \bibitem{eve:sim}
556: Everett, M.~G.
557: (1985) {\em Soc.\ Netw.}\ {\bf 7}, 353--359.
558:
559: \bibitem{regeeco1}
560: Luczkovich, J.~J., Borgatti, S.~P., Johnson, J.~C., \& Everett, M.~G.
561: (2003) {\em J.\ Theor.\ Biol.}\ {\bf 220}, 303--321.
562:
563: \bibitem{pagel:mips}
564: Pagel, P., Kovac, S., Oesterheld, M., Brauner, B., Dunger-Kaltenbach,
565: I., Frishman, G., Montrone, C., Mark, P., St\"{u}mpflen, V., Mewes,
566: H.~W.\ {\em et al.} (2004) {\em Bioinformatics}, [Epub ahead of
567: print] \url{doi:10.1093/bioinformatics/bti115}.
568:
569: \bibitem{gui:meta}
570: Guimer\`{a}, R.\ \& {Nunes Amaral}, L.~A.
571: (2005) {\em Nature} {\bf 433}, 895--900.
572:
573: \bibitem{luss:dolphin}
574: Lusseau, D.\ \& Newman, M. E.~J.
575: (2004) {\em Proc.\ R.\ Soc.\ London B} {\bf 271}, 477--481.
576:
577: \bibitem{blondel:sim}
578: Blondel, V.~D., Gajardo, A., Heymans, M., Senellart, P., \& {van Dooren}, P.
579: (2004) {\em SIAM Rev.}\ {\bf 46}, 647--666.
580:
581: \bibitem{simrank}
582: Jeh, G.\ \& Widom, J. (2002) {Proceedings of the eighth ACM SIGKDD
583: international conference on knowledge discovery and data
584: mining}. (Edmonton), pp. 538--543.
585:
586: \bibitem{paw:seq}
587: Pawlowski, K., Jaroszewski, L., Rychlewski, L.\ \& Godzik, A. (2000)
588: {\em Pac.\ Symp.\ Biocomput.}, 42--53.
589:
590: \bibitem{irving:struct}
591: Irving, J.~A., Whisstock, J.~C.\ \& Lesk, A.~M. (2001) {\em Proteins}
592: {\bf 42}, 378--382.
593:
594: \bibitem{jensen:seq}
595: Jensen, L.~J., Staerfeldt, H.\ \& Brunak, S. (2003) {\em
596: Bioinformatics} {\bf 19}, 635--642.
597:
598: \bibitem{dobson:struct}
599: Dobson, P.~D.\ \& Doig, A.~S. (2003) {\em J.\ Mol.\ Biol.}\ {\bf 330},
600: 771--783.
601:
602: \bibitem{pelle:pfp}
603: Pellegrini, M., Marcotte, E., Thompson, M.~J., Eisenberg, D.\ \&
604: Yeates, T.~O. (1999) {\em Proc.\ Natl.\ Acad.\ Sci.\ USA} {\bf 96},
605: 4285--4288.
606:
607: \bibitem{gavin:complexes}
608: Gavin, A.~C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer,
609: A., Schultz, J., Rick, J.~M., Michon, A.~M., Cruciat, C.~M., Remor,
610: M.\ {\em et al.} (2004) {\em Nucleic Acids Res.}\ {\bf 32},
611: 5539--5545.
612:
613: \bibitem{marc:pfp2}
614: Marcotte, E.~M., Pellegrini, M., Ng, H.~L., Rice, D.~W., Yeates, T.~O.\ \&
615: Eisenberg, D. (1999) {\em Science} {\bf 285}, 751--753.
616:
617: \bibitem{marc:pfp1}
618: Marcotte, E.~M., Pellegrini, M., Thompson, M.~J., Yeates, T.~O.\ \&
619: Eisenberg, D. (1999) {\em Nature} {\bf 402}, 83--86.
620:
621: \bibitem{schw:pfp}
622: Schwikowski, B., Uetz, P.\ \& Fields, S. (2000) {\em Nature Biotech.}\
623: {\bf 18}, 1257--1261.
624:
625: \bibitem{ruepp:funcat}
626: Ruepp, A., Zollner, A., Albermann, K., Hani, J., Mokrejs, M., Tetko,
627: I., Guldener, U., Mannhaupt, G., Munsterkotter, M.\ \& Mewes,
628: H.~W. (2004) {\em Nucleic Acids Res.}\ {\bf 32}, 5539--5545.
629:
630: \bibitem{rag:tm}
631: Raghavan, V.~V., Jung G.~S.\ \& Bollmann, P. (1989) {\em ACM Trans.\
632: Inf.\ Syst.}\ {\bf 7}, 205--229.
633:
634: \end{thebibliography}
635:
636:
637: \end{document}
638: