1: \documentclass[floatfix,twocolumn,showpacs,preprintnumbers,amsmath,amssymb]{revtex4}
2:
3: \usepackage{graphicx,epsfig,amsfonts}% Include figure files
4: \usepackage{dcolumn}% Align table columns on decimal point
5: \usepackage{psfrag}
6: \usepackage{concmath,charter}
7: \usepackage{subfigure}
8: \begin{document}
9:
10: \newcommand{\e}[1]{\emph{#1}}
11: \newcommand{\avg}[1]{\langle #1 \rangle}
12: \newcommand{\va}[0]{{\mathbf a}}
13: \newcommand{\vb}[0]{{\mathbf b}}
14: \newcommand{\vc}[0]{{\mathbf c}}
15:
16:
17: \title{Global statistical analysis of the protein homology network}
18:
19: \author{C.~Miccio}
20: \email{miccio@mib.infn.it}
21: \affiliation{
22: Dipartimento di Fisica G.Occhialini, Universit\`a di
23: Milano--Bicocca and INFN, Sezione di Milano, Piazza della Scienza 3
24: - I-20126 Milano, Italy}
25: \author{T.~Rattei}
26: \email{t.rattei@wzw.tum.de}
27: \affiliation{
28: Department of Genome Oriented Bioinformatics, Technical University
29: of Munich,
30: Wissenschaftszentrum 5 Weihenstephan, 85350 Freising,
31: Germany }
32:
33: \date{\today}
34:
35: \begin{abstract}
36: The similarity between protein sequences is a directly and easly
37: computed quantity from which to deduce information about their
38: evolutionary distance and to detect homologous proteins. The {\emph
39: SIMAP} database -- {\emph Similarity Matrix of Proteins} --
40: provides a pre-computed similarity matrix covering the similarity
41: space formed by about all publicly available amino acid sequences
42: from public databases and completely sequenced genomes. From SIMAP
43: we construct the protein homology network, where the proteins are
44: the nodes and the links represent homology relationships. With more
45: than $5$ million nodes and about $70 \times 10^9$ edges it is the
46: greatest protein homology network ever been builded. We
47: describe the basic features and we perform a global statistical
48: analysis of the network. Starting from the Smith-Waterman similarity
49: score, we define for each edge a weight $w$ to measure the
50: similarity distance between two nodes. Keeping only edges with a
51: weigth greater than a minimal $\bar w$, and by varying $\bar w$ we
52: build a family of networks with different degree of similarity. We
53: investigate the distribution of connected components (clusters) of
54: the networks at different $\bar w$ and in particular we find a
55: behaviour similar to a phase transition guided by the formation of a
56: giant component. Moreover we study selected sequence features and
57: protein domains of protein pairs that connect different clusters in
58: the networks at different level of similarity. We observed
59: specific, non-random distributions of the protein features and
60: domains for proteins connecting clusters at certain weight
61: intervals.
62: \end{abstract}
63: %\pacs{87.10.+e, 05.10.Ln}
64:
65: \maketitle
66:
67: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
68: \section{Background}
69:
70: The number of known proteins is rapidly growing and the sequence of
71: amino acids is, at the moment, the main source of information for many
72: new proteins which still have unidentified functions. Protein sequence
73: analysis, and more specifically, the analysis of similarities among
74: protein sequences, is therefore the basis of studies trying to
75: understand protein evolutionary processes or to detect unknown
76: biological functions of new proteins. Proteins with similar sequences
77: can be found in different organisms and in a single organism
78: \footnote{Due to duplication and shuffling of coding segments in the
79: akno DNA during the evolution.}, \cite{revEvol}. By means of the
80: degree of similarity obtained by a pairwise sequence comparison it is
81: possible to deduce information about their evolutionary distance.
82: Specifically, two proteins are homologous if they evolved from a
83: common ancestral protein sequence and, in most cases, they have also
84: the same, or very similar, biological function. Homology can be
85: deduced from statistically significant sequence similarities. However,
86: new sequences often have only weak similarities to known proteins, and
87: single similarities search are insufficient to assign validated
88: properties of characterized proteins to new sequences. Instead a graph
89: formed by all-against-all comparisons of a large amount of
90: protein-data could become useful. This is the case of {\bf SIMAP} --
91: \e{Similarity Matrix of Proteins} -- a database containing the
92: similarity space formed by almost all amino acid sequences, with
93: nearly 5.5 million non-redundant protein sequences drawn from
94: completely sequenced genomes and public database. Moreover,
95: pre-calculated similarity space allows very rapid access to
96: significant hits of interest and prevents time-consuming
97: re-computation. The algorithm that precomputes the sequences
98: similarities is based on the FASTA heuristic. First it compares
99: low-complexity masked proteins using FASTA and then it recalculates
100: the hits found using non-masked sequences and the Smith-Waterman
101: algorithm. In both phases of the alignment process the BLOSUM50 amino
102: acids substitution matrix is used. For each hit the Smith-Waterman
103: score, the identity, the gapped identity, the overlap and the start
104: and the stop coordinates of the alignment in
105: both proteins are stored. For more details see \cite{simap}.\\
106: Graphs formed by all-against-all sequence comparisons can be used to
107: derive inheritance patterns of proteins, to reconstruct the
108: evolutionary relationships between proteins and to classify them into
109: protein families by looking for dense clusters disconnected from the
110: rest of the network. To date, this approach has been carefully
111: evaluated by case studies targeted at selected protein families
112: \cite{phn}, but a global analysis of the complete homology network
113: formed by all publicly available proteins has not been published. The
114: aim of this work is to analyze global and local properties of the
115: graph forming the homology network.
116:
117:
118: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
119: \section{SIMAP graph representation}
120:
121: The information contained in the Simap database can be reorganized by
122: means of a weighted graph representation, $G(V, E, w)$, where $V$ is
123: the set of nodes, $E$ the set of edges, and $w$ a weight function on
124: the edges: $w : E \to [0,1]$. Each node, $\va \in V$, represents a
125: protein sequence and each edge, $e = \{ \va,\vb \} \in E$ between two
126: nodes $\va$, $\vb$ represents the stored alignment between the
127: respective protein sequences\footnote{For simplicity we will use the
128: same notation to point graphs's nodes and database's proteins.}. In
129: this way an undirected weighted graph can be obtained, since the
130: symmetry of the alignment procedure leads to undirected edges and the
131: score of the alignment allows the assignment of a suitable weight to
132: every edge. (Despite the possibility of making an alignment between a
133: protein sequence and itself, self-edges are not considered). More
134: specifically if $s(\va,\vb)$ is the Smith-Waterman (SW) optimal score
135: obtained with the FASTA algorithm between sequence $\va$ and $\vb$, a
136: suitable weight $w(\va,\vb) \in [0,1]$ for the edge $e = \{ \va,\vb
137: \}$ can be defined as follow:
138: \begin{equation}
139: \label{eq:weight}
140: w(\va,\vb) = \frac{s(\va,\vb)}{ \sqrt{ \; s(\va,\va) \;
141: s(\vb,\vb)}},
142: \end{equation}
143: From $w(\va,\vb)$ one could define a distance function as $d(\va,\vb)
144: = 1 - w(\va,\vb)\;$, whose values are in $[0, 1]$ as distance function
145: usually defined on linear spaces. $d$ should satisfy positivity, null
146: and simmetry properties for all pairs of sequence proteins and also
147: the triangular inequality which is fully satisfied for the BLOSUM50
148: matrix.
149:
150: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
151: \section{Polishing procedure}
152: Strictly speaking, the set of all protein sequences of the Simap
153: database is not a good space over which to define the distance measure
154: $d$. There are, in fact, $1538$ pairs of sequences that have distance
155: equal to zero, although they are classified with a different sequence
156: id. However, they differ only in the presence of one or two $'$X$'$ in
157: their amino acid sequence annotation, where $'$X$'$ is the standard
158: symbol for an unknown amino acid residue in a protein sequence. It is
159: therefore natural to decide to knock out, for each of these pairs of
160: sequences, the one that has the $'$X$'$ in the sequence; this
161: procedure entails the removal, in the graph representation, of all
162: edges connected to the removed nodes. Another improvment for database
163: consistency is the checking of symmetry of all edges: every time, a
164: direct edge is found, the inverse relation, if absent, is added.
165:
166: As a final result of these manipulations, a graph with $V = 5,489,907$
167: nodes and $E = 69,500,722,050$ edges can be constructed.
168:
169: Over the polished Simap protein sequences space the distance $d = 1 -
170: w(\va,\vb)\;$ fails the triangular inequality over few cases (around
171: $\approx 0.2 \%$ of triangles). However redefining, for istance,
172:
173: \begin{equation}
174: \label{eq:distance}
175: d(\va,\vb) =\sqrt{1 - w(\va,\vb)},
176: \end{equation}
177: we have that the triangle inequality is satisfied for all triples of
178: linked proteins and (\ref{eq:distance}) has all properties required
179: for a \e{distance measure}.
180:
181: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
182: \section{Characterization of Simap protein space}
183:
184: In the Simap database, protein sequences come from $104,560$ different
185: species. There are, in particular, $3$ species (\e{Homo sapiens},
186: \e{Arabidopsis thaliana}, \e{Rice plants}) with more than $100,000$
187: protein sequences and $72$ with more than $10,000$.
188:
189: \begin{table}[!htb]
190: \begin{center}
191: \begin{tabular}{|c|c|c|}
192: \vspace{-10pt} & & \\ \hline \it{kingdoms} & & \it{number of species} \\
193: \vspace{-10pt} & & \\ \hline
194: bacteria & & $11,130$ \\ \hline
195: viruses & viruses & $13,708$ \\
196: & phages & $923$ \\ \hline
197: plants & & $31,232$ \\ \hline
198: animalia & invertebrates & $25,951$ \\ %\cline{2-4}
199: & vertebrates & $19,341$ \\
200: & (rodents) & $(1,474)$ \\
201: & (mammals) & $(1,854)$ \\
202: & (primates) & $(393)$ \\ \hline
203: environmental samples & & $1,453$ \\ \hline
204: synthetic & & $822$ \\ \hline
205: \end{tabular}
206: \caption{\label{tab1} \small Number of species for each
207: kingdom.}
208: \end{center}
209: \end{table}
210:
211: A coarse subdivision of all species is shown in
212: Table~\ref{tab1}; it separates species in five (non-standard)
213: main kingdoms: bacteria, viruses, plants, invertebrates (animalia) and
214: vertebrates (animalia). The classification reveals the presence of
215: very many different animalia species, but only eight of these species
216: are present with their complete genome (the other animalia proteins
217: were imported from multiple species databases).
218: Figure~\ref{fig1} shows the protein distribution for
219: each kingdom. There is also a high number ($546,439$) of unassigned
220: protein sequences.\footnote{These sequences come from databases:
221: \e{PDB proteins}, \e{mips non-redundant protein database},
222: \e{UNIPROT SWISSPROT}, \e{UNIPROT-TrEMBL}, \e{PFAM sequences},
223: \e{Eukaryotic signature proteins.}}.
224:
225: \begin{figure}[!htb]
226: \begin{center}
227: \includegraphics[height=0.36\textwidth,angle=270]{f1.eps}
228: \caption{\label{fig1} {\small Distribution of
229: proteins for each kingdom. The little graph shows the
230: distribution within vertebrates.}}
231: \end{center}
232: \end{figure}
233:
234: \subsection{Length and self-similarity distribution}
235:
236: \begin{figure}[!htb]
237: \includegraphics[height=0.70\textwidth]{f2.eps}
238: \caption{\label{fig2}{\small (a) Distribution of protein sequences'
239: lengths. In the inner boxe an enlargement of the distribution is
240: shown. (b) Length distributions of protein sequences which
241: belong to \e{bacteria} ($\avg{l} = 316.9$, $l_{max} = 36805$),
242: \e{viruses} ($\avg{l} = 273.9$,$l_{max} = 7312$ ), \e{plants}
243: ($\avg{l} = 314.5$, $l_{max} = 20925$), \e{invertebrated}
244: ($\avg{l} = 416.1$, $l_{max} = 23015$), \e{vertebrated}
245: ($\avg{l} = 397.1$, $l_{max} = 38031$).}}
246: \end{figure}
247:
248: The protein sequences space is characterized by the length
249: distribution shown in Figure~\ref{fig2}a and in Figure~\ref{fig2}b we
250: give the length distributions for sequences belonging to bacteria,
251: viruses, plants, vertebrates and invertebrates.
252:
253: \begin{figure}[!htb]
254: \vspace{0.2cm}
255: \includegraphics[height=0.36\textwidth,angle=270]{f3.eps}
256: \caption{\label{fig3}{\small Distribution of protein sequences'
257: self-scores. In the inner boxe an enlargement of the
258: distribution is shown. }}
259: \end{figure}
260:
261: The \e{self-similarity} \e{score} 's distribution of protein sequence
262: appears in Figure~\ref{fig3}. The self-similarity scores distribution
263: is well reproduced by a mixture of normal distributions, one for each
264: length entry. The self-similarity score $s(\va,\va)$ of a protein
265: sequence of length $l$, can be thougth as a sum of $l$ i.i.d. random
266: variables, i.e. a sum of the self-similarities scores of random amino
267: acids. Knowing the amino acids background probabilities\footnote{ The
268: values for background distribution of amino acids come from data
269: used for the PAM matrix: $\;p_A=0.096;\; p_R=0.034;\; p_N=0.042;\;
270: p_D=0.053;\; p_C=0.025;\; p_Q=0.032;\; p_E=0.053;\; p_G=0.090;\;
271: p_H=0.034;\; p_I=0.035;\; p_L= 0.084;\; p_K=0.085;\; p_M=0.012;\;
272: p_F=0.045;\; p_P=0.041;\; p_S=0.057;\; p_T=0.062;\; p_W=0.012;\;
273: p_Y=0.030;\; p_V=0.078$.\\ They can be obtained from \e{{\small
274: http://apps.bioneq.qc.ca/twiki/pub/Knowledgebase/PAM/}}
275: \e{{\small PAM2.JPG}}} $p_{a}$ and the diagonal values of the
276: BLOSUM50 score matrix, $B_{aa}$, the self-similarity score of a random
277: amino acid will follow a normal distribution with mean $\avg{s} =
278: \sum_a p_a \, B_{aa} \;\; ( \approx 6.727)$ and variance $ \sigma =
279: \sqrt{\sum_a p_a B_{aa}^2 - \avg{s} ^2} \;\;(\approx 2.067)$.
280: Self-similarity scores of random amino acid sequences of length $l$
281: will have a normal distribution $g(l,s)$ with mean $l\,\avg{s}$ and
282: variance $\sqrt{l\,\sigma^2}$. Finally, the self-similarity scores
283: distribution is well approximated by the sum $\sum_{l} g(l,s) f(l)$,
284: where $f(l)$ is the observed length distribution, Figure~\ref{fig4}.
285:
286: \begin{figure}[!htb]
287: \vspace{0.2cm}
288: \includegraphics[height=0.36\textwidth,angle=270]{f4.eps}
289: \caption{\label{fig4}{\small Distribution of protein sequences'
290: self-scores and the curve obtained by an overlap of normal
291: distributions opportunely wighted by the protein sequences's
292: length distribution are compared.}}
293: \end{figure}
294:
295: \subsection{Pairwise similarity distribution}
296:
297: The SW optimum similarity scores distribution obtained from all FASTA
298: sequence alignments present a homogeneous cutoff equal to $80$, used
299: for storing hits in Simap database. It was chosen independently of the
300: query and database length, but as an optimal compromise between
301: sensitivity and possibility to store an accessible number of hits,
302: because of the high number of protein sequences.
303:
304: \begin{figure}[!htb]
305: \includegraphics[height=0.70\textwidth]{f5.eps}
306: \caption{\label{fig5}{\small (a) Distribution of edges' weights $w$.
307: In the inner box is shown an enlargement of the distribution
308: tail. (b) Repartition function edges' weights distribution.}}
309: \end{figure}
310:
311: In Figure~\ref{fig5}a the distribution of weights $w$ is shown, and in
312: Figure~\ref{fig5}b the corresponding repartition distribution $\rho(w)$. The
313: values of $\rho(w) \in [0,1]$ represent the fractions of edges which
314: have weight greater or equal to $w$. From them we see that the major
315: part of the edges (about $80\%$ of the total number of edges) has a
316: very low value of $w$ ($\leq 0.2$).
317:
318: \subsection{Coordination and cluster distribution}
319:
320: Weights $w$ can be used as a parameter to define a collection of
321: graphs. For a fixed value of $w = \bar{w}$ (or a value of $d = \bar{d}
322: = \sqrt{1 -\bar{w} }$ ), a graph is built keeping only edges with $w >
323: \bar{w}$ ($d \le \bar{d}$). For high values of $\bar{w}$, i.e. at
324: small distances, nodes are linked if, and only if, the corresponding
325: protein sequences have a high degree of similarity; then it is
326: reasonable to expect graphs with many small connected components. By
327: decreasing $\bar{w}`$ values, in other words by also linking proteins
328: having a lower degree of similarity, graphs with larger connected
329: components are expected. The graph obtained by considering all
330: possible edges (by fixing $\bar{w} = 0$) is not the complete graph,
331: due to the cutoff on the score alignment (there are about $0.1 \%$ of
332: edges of the corresponding complete graph).
333:
334: We have built graphs for values of $w$ equal to $0.975$, $0.95$,
335: $0.925$, $ 0.9$, $0.875$, $0.85$, $0.825$, $0.8$, $0.775$, $0.75$,
336: $0.725$, $0.7$, $0.675$, $0.65$, $0.625$, $0.6$, $0.575$, $0.55$,
337: $0.525$, $0.5$, $0.475$, $0.45$, $0.425$, $0.4$ $0.375$, $0.35$,
338: $0.325$, $0.3$, $0.275$, $0.25$, $0.225$, $0.2$, $0.175$, $0.15$,
339: $0.125$; $\;$ for each of these values the set of the protein
340: sequences splits into clusters, i.e. isolated connected components.
341: Linking proteins that have a greater and greater distance from each
342: other (decresing $\bar{w}$), clusters merge to form larger clusters,
343: the number of isolated proteins and the number of components with a
344: very small size decreases, while the number of clusters of medium and
345: large size increases.
346:
347: \begin{figure}[!htb]
348:
349: \subfigure[]{\label{fig6a}
350: \includegraphics[width=0.34\textwidth, angle=270]{f6a.ps}
351: } \vspace{-0.4cm}
352: \subfigure[]{\label{fig6b}
353: \includegraphics[width=0.40\textwidth, height=0.42\textwidth, angle=270]{f6b.ps}
354: }
355:
356: \caption{{\small (a) Distribution of size of connected
357: components of the protein sequences graph built at $\bar{w} =
358: 0.975$ (red curve), $\bar{w} = 0.75$ (pink curve) and $\bar{w} =
359: 0.4$ (blue curve). It is evident that as the $\bar{w}$ value
360: decrease the number of connected components with small size
361: decreases and the starting region of the power law behaviour
362: shifts to higher values of size. (b) Distribution of
363: coordination degree of the protein sequences graph built at
364: $\bar{w} = 0.975$ (red curve), $\bar{w} = 0.75$ (pink curve) and
365: $\bar{w} = 0.4$ (blue curve). As the $\bar{w}$ value decrease
366: the number of nodes with coordination degree decreases and the
367: starting region of the power law behaviour shifts to higher
368: values of coordination degree.}}
369: \end{figure}
370:
371: Measuring the (not normalized) cluster distribution, we find that, for
372: each fixed values of $\bar{w}$, the number of clusters
373: $n_{\bar{w}}(s)$ of size $s$ follows, in a specific size range, a
374: power law behaviour, $n_{\bar{w}}(s) \sim s^{-\sigma(\bar{w})}$.
375: Fitted values of $\sigma(\bar{w})$ and fitting size ranges are
376: reported in Table~\ref{tab2} and a log-log plot of size
377: distribution $n_{\bar{w}}(s)$, for three different values of $\bar w$
378: is shown in Figure~\ref{fig6a}. Also the (not normalized) coordination
379: degree distribution $f_{\bar w}(z)$ follows a power law distribution,
380: $f_{\bar w}(z) \sim z^{-\alpha(\bar w)}$, for each values of
381: $\bar{w}$. A log-log plot of coordination degree distribution
382: $f_{\bar{w}}(z)$, for three different values of $\bar w$ is shown in
383: Figure~\ref{fig6b}. Fitted values of $\alpha(\bar{w})$ and fitting
384: coordination degree's ranges are reported in Table~\ref{tab3}.
385:
386: \begin{table}[!htb]
387: \begin{center}
388: \begin{tabular}{|c|ccc|} \hline
389: \vspace{-10pt} & & & \\ $\bar{w}$ & $\sigma$ & \quad component
390: & \quad correlation \\ & & \quad size range & \quad coefficient \\
391: \vspace{-10pt} & & & \\ \hline
392: \vspace{-10pt} & & & \\
393: $\;$ $0.95$ $\;$ & $\;$ $2.70$ & $10 - 60$ & $-0.995$ \\
394: $\;$ $0.90$ $\;$ & $\;$ $2.70$ & $10 - 60$ & $-0.996$ \\
395: $\;$ $0.85$ $\;$ & $\;$ $2.69$ & $10 - 60$ & $-0.994$ \\
396: $\;$ $0.80$ $\;$ & $\;$ $2.62$ & $10 - 80$ & $-0.996$ \\
397: $\;$ $0.75$ $\;$ & $\;$ $2.52$ & $10 - 80$ & $-0.996$ \\
398: $\;$ $0.70$ $\;$ & $\;$ $2.40$ & $10 - 80$ & $-0.996$ \\
399: $\;$ $0.65$ $\;$ & $\;$ $2.32$ & $10 - 100$ & $-0.997$ \\
400: $\;$ $0.60$ $\;$ & $\;$ $2.21$ & $10 - 100$ & $-0.996$ \\
401: $\;$ $0.55$ $\;$ & $\;$ $2.17$ & $10 - 100$ & $-0.996$ \\
402: $\;$ $0.50$ $\;$ & $\;$ $2.07$ & $10 - 100$ & $-0.997$ \\
403: $\;$ $0.45$ $\;$ & $\;$ $2.01$ & $10 - 100$ & $-0.997$ \\
404: $\;$ $0.40$ $\;$ & $\;$ $2.00$ & $10 - 100$ & $-0.996$ \\
405: $\;$ $0.35$ $\;$ & $\;$ $1.98$ & $10 - 100$ & $-0.997$ \\
406: $\;$ $0.30$ $\;$ & $\;$ $1.98$ & $10 - 100$ & $-0.997$ \\
407: $\;$ $0.25$ $\;$ & $\;$ $2.01$ & $10 - 100$ & $-0.996$ \\ \hline
408: \end{tabular}
409: \caption{\label{tab2} \small Fitting values of exponent
410: $\sigma$ of the power law distribution of connected components
411: for selected values of $\bar{w}$. For each fitting the size
412: range and its correlation coefficient are reported.}
413: \end{center}
414: \end{table}
415:
416:
417: \begin{table}[!htb]
418: \begin{center}
419: \begin{tabular}{|c|c|c|ccc|} \hline
420: \vspace{-10pt} & & & &\\
421:
422: $\bar{w}$ & $\avg{z}$ & max $z$ & $\alpha$ & \quad coordination & \quad
423: correlation \\
424:
425: & & & & \quad degree range & \quad coefficient \\
426: \vspace{-10pt} & & & & &\\ \hline
427: \vspace{-10pt} & & & & &\\
428:
429: $0.95$ & $14.4$ & $5735$ & $1.59$ & $25 - 100$ & $-0.990$ \\
430: & & & $1.46$ & $100 - 500$ & $-0.953$ \\ \hline
431:
432: $0.90$ & $73.1$ & $10794$ & $1.58$ & $25 - 100$ & $-0.988$ \\
433: & & & $1.51$ & $100 - 500$ & $-0.939$ \\ \hline
434:
435: $0.85$ & $138.3$ & $16500$ & $1.68$ & $25 - 100$ & $-0.993$ \\
436: & & & $1.42$ & $100 - 800$ & $-0.964$ \\ \hline
437:
438: $0.80$ & $207.2$ & $ 23726$ & $1.73$ & $25 - 100$ & $-0.994$ \\
439: & & & $1.29$ & $100 - 800$ & $-0.941$ \\ \hline
440:
441: $0.75$ & $294.0$ & $33265$ & $1.79$ & $25 - 100$ & $-0.997$ \\
442: & & & $1.22$ & $100 - 1000$ & $-0.956$ \\ \hline
443:
444: $0.70$ & $395.3$ & $35202$ & $1.74$ & $25 - 100$ & $-0.996$ \\
445: & & & $1.28$ & $100 - 1000$ & $-0.946$ \\ \hline
446:
447: $0.65$ & $507.8$ & $36333$ & $1.71$ & $25 - 100$ & $-0.998$ \\
448: & & & $1.39$ & $100 - 1000$ & $-0.950$ \\ \hline
449:
450: $0.60$ & $622.3$ & $37729$ & $1.63$ & $25 - 100$ & $-0.999$ \\
451: & & & $1.32$ & $100 - 1500$ & $-0.930$ \\ \hline
452:
453: $0.55$ & $745.3$ & $41871$ & $1.54$ & $25 - 100$ & $-0.998$ \\
454: & & & $1.44$ & $100 - 1500$ & $-0.927$ \\ \hline
455:
456: $0.50$ & $911.7$ & $49895$ & $1.44$ & $25 - 100$ & $-0.998$ \\
457: & & & $1.56$ & $100 - 2000$ & $-0.944$ \\ \hline
458:
459: $0.45$ & $1108.1$ & $51309$ & $1.38$ & $25 - 100$ & $-0.998$ \\
460: & & & $1.62$ & $100 - 2000$& $-0.951$ \\ \hline
461:
462: $0.40$ & $1314.2$ & $51956$ & $1.28$ & $25 - 100$ & $-0.998$ \\
463: & & & $1.67$ & $100 - 2500$ & $-0.946$ \\ \hline
464:
465: $0.35$ & $1501.9$ & $52513$ & $1.19$ & $25 - 100$ & $-0.998$ \\
466: & & & $1.72$ & $100 - 2500$ & $-0.961$ \\ \hline
467:
468: $0.30$ & $1668.9$ & $60722$ & $1.08$ & $25 - 100$ & $-0.997$ \\
469: & & & $1.74$ & $100 - 3000$ & $-0.969$ \\ \hline
470:
471: $0.25$ & $1826.2$ & $64781$ & $0.97$ & $25 - 100$ & $-0.997$ \\
472: & & & $1.78$ & $100 - 3000$ & $-0.969$ \\ \hline
473: \end{tabular}
474: \caption{\label{tab3} \small Fitting values of exponent $\alpha$ of
475: the power law distribution of coordination degree for selected
476: values of $\bar{w}$. We compute two linear fittings different in the
477: choice of fitting range of coordination degree. For each fitting the
478: range of coordination degree and its correlation coefficient are
479: reported. In the second column the average degree is shown; the
480: third column gives the maximum value of the coordination degree. }
481: \end{center}
482: \end{table}
483:
484: \section{Comparison with generalized random graphs}
485:
486: It would be interesting to compare these behaviours with that of a
487: model of random graphs. It is well known that, in the classical model,
488: random graphs (where every pair of nodes is chosen to be an edge with
489: probability $p$, as introducede by Erd\"os-R\'enyi
490: \cite{erdos_renyi}), have the same expected coordination degree at
491: every node, so they are characterized by a poissonian coordination
492: degree distribution with mean value $\avg{z} \sim p V$. Futhermore, as
493: soon as $\avg{z}$ assume a value greater than $1$, a giant connected
494: component appears, that is a component whose size is much greater than
495: the size of all other components, and that represents an important
496: fraction of all graph's nodes.
497:
498: A better theorical comparison model could be represented by
499: generalized random graphs endowed with a specific degree-distribution.
500: These can be generated via the Monte-Carlo algorithm (following the
501: work in \cite{burda} of Burda et al.). In particular, starting from a
502: random graph of $V$ nodes and $E$ edges, making local graph
503: transformations which leave the number of nodes and the number of
504: edges constant and accepting them with a probability which depends on
505: the desired equilibrium degree distribution (Metropolis algorithm), we
506: have generated a collection of random graphs with the same
507: coordination degree distribution and the same average degree as some
508: of our protein sequences graphs.
509:
510: For each of them we observe a fundamentally different
511: distribution of connected components in the protein sequences graphs
512: and in the random graphs. In the latter model the power law behaviour
513: is absent, while there is a always a dominant giant connected
514: component, much larger than the many other small components, whose
515: size distribution decreases exponentially (See Figure~\ref{fig7}).
516:
517: \begin{figure}[!htb]
518: \includegraphics[height=0.36\textwidth,angle=270]{f7.eps}
519: \caption{\label{fig7}{\small Top: coordination degree distribution
520: of the collection of random graphs generated via Monte-Carlo
521: algorithm fixing the equilibrium degree distribution equal to
522: that one observed in the protein sequences graph at $\bar{w} =
523: 0.99$ and fixing the average degree equal to $\avg{z} = 0.57$.
524: Bottom: size distribution of connected components of the random
525: graphs.}}
526: \end{figure}
527:
528: By comparison, in the Simap protein sequences space the coordination
529: degree distribution $f_{\bar w}(z)$ and the connected component
530: distribution $n_{\bar w}(s)$ are strongly correlated. The former, for
531: example, can be reproduced quite well by means of $n_{\bar w}(s)$. Let
532: the index $i$ label all connected components and let us consider all
533: possible edges between nodes belonging to a connected components of
534: size $s_i$; then the cluster would be a complete subgraph and all its
535: $s_i$ nodes would have coordination degree equal to $z_i = s_i-1$. If
536: this were true for all connected components then all clusters would be
537: complete subgraphs and we would expect a coordination degree
538: distribution equal to $f_{\bar w}(z) \sim ( s \; n_{\bar w}(s) ) |_{s
539: = z + 1} $. In our graphs, although complete connected components
540: are present, the majority of clusters have only a high average degree
541: distribution, not equal to its size minus one, as in complete graphs.
542: However let's consider a component with size $s_i$ and a number of
543: edges equal to $m_i$; the quantity $\Delta_i = \frac{2 m_i}{s_i
544: (s_i-1)}$ represents the fraction of edges that are present in the
545: $i$-th component respect to the number of edges that would be present
546: if the component were a complete subgraph (i.e. $s_i (s_i-1)/2$).
547: Introducing $\Delta_i$ as a measure of edges' density for each
548: component we can approximate the coordination degree distribution
549: $f_{\bar w}(z)$ by means of the size connected component distribution
550: $n_{\bar w}(s)$ too. Specifically we find that the coordination degree
551: distribution behaves like $f_{\bar w}(z) \sim \bar{\Delta}(z+1) \;
552: (z+1) \; n_{\bar w}(z+1) $, where $\bar{\Delta}(s)$ is the edges'
553: density averaged over all components of size $s$: $\bar{\Delta}(s) =
554: \frac{\sum_{i} \delta_{s_i, s} \Delta_i}{\sum_{i} \delta_{s_i, s}}$.
555: Figure~\ref{fig8} shows both the observed degree distribtution and the
556: approximated degree distribution obtained by means of $n(s)$ of the
557: graph at $\bar{w} = 0.95$.
558:
559: \begin{figure}[!htb]
560: \includegraphics[height=0.36\textwidth,angle=270]{f8.eps}
561: \caption{\label{fig8}{\small Observed degree distribution (black
562: curve) and the approximated degree distribution (red curve)
563: obtained by means of $n(s)$ of the graph at $\bar{w} = 0.95$.}}
564: \end{figure}
565:
566: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
567: \section{Giant component}
568:
569: \begin{figure}[!htb]
570: \includegraphics[height=0.70\textwidth]{f9.eps}
571: \caption{\label{fig9}{\small (a) Fraction of nodes belonging to the
572: largest cluster for each value of $\bar{w}$. (b) Fraction of
573: species present in the largest cluster for each value of
574: $\bar{w}$.}}
575: \end{figure}
576:
577: An interesting phenomenon occurs when $\bar{w}$ value decrease; we see
578: the formation of the giant component. In Figure~\ref{fig9}a the
579: behaviour of the fraction of nodes belonging to the largest component
580: is shown.
581:
582: \begin{table*}[!htb]
583: \begin{center}
584: \begin{tabular}{|c|c|c|c|c|c|c|c|c|}\hline
585: \vspace{-7pt} & & & & & & & & \\
586:
587: $\bar{w}$ & $\bar{d}$ & size & bacteria & viruses & plants &
588: invertebrates & vertebrates & number of different species \\
589:
590: \vspace{-7pt} & & & & & & & & \\ \hline
591:
592: $0.975$ & $0.1581$ & $8322$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $4$ \\
593: $0.950$ & $0.2236$ & $15955$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $4$ \\
594: $0.925$ & $0.2739$ & $47687$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $10$ \\
595: $0.900$ & $0.3162$ & $50729$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\
596: $0.875$ & $0.3536$ & $51028$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\
597: $0.850$ & $0.3873$ & $51405$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\
598: $0.825$ & $0.4183$ & $51969$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\
599: $0.800$ & $0.4472$ & $52097$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\
600: $0.775$ & $0.4743$ & $52881$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\
601: $0.750$ & $0.5000$ & $63003$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $60$ \\
602: $0.725$ & $0.5244$ & $118777$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $67$ \\
603: $0.700$ & $0.5477$ & $120974$ & $0.000$ & $0.999$ & $0.000$ & $0.000$ & $0.000$ & $106$ \\
604: $0.675$ & $0.5701$ & $145278$ & $0.002$ & $0.997$ & $0.000$ & $0.000$ & $0.000$ & $302$ \\
605: $0.650$ & $0.5916$ & $224310$ & $0.002$ & $0.749$ & $0.001$ & $0.000$ & $0.248$ & $988$ \\
606: $0.625$ & $0.6124$ & $272426$ & $0.014$ & $0.662$ & $0.010$ & $0.007$ & $0.306$ & $4384$ \\
607: $0.600$ & $0.6325$ & $297280$ & $0.028$ & $0.643$ & $0.015$ & $0.011$ & $0.303$ & $7854$ \\
608: $0.575$ & $0.6519$ & $318472$ & $0.032$ & $0.613$ & $0.027$ & $0.015$ & $0.313$ & $9668$ \\
609: $0.550$ & $0.6708$ & $362379$ & $0.047$ & $0.554$ & $0.035$ & $0.024$ & $0.341$ & $11437$ \\
610: $0.525$ & $0.6892$ & $404788$ & $0.049$ & $0.526$ & $0.047$ & $0.029$ & $0.349$ & $15593$ \\
611: $0.500$ & $0.7071$ & $450072$ & $0.065$ & $0.482$ & $0.055$ & $0.033$ & $0.365$ & $16272$ \\
612: $0.475$ & $0.7246$ & $584371$ & $0.084$ & $0.379$ & $0.151$ & $0.037$ & $0.349$ & $20957$ \\
613: $0.450$ & $0.7416$ & $718286$ & $0.114$ & $0.312$ & $0.194$ & $0.041$ & $0.340$ & $35346$ \\
614: $0.425$ & $0.7583$ & $975629$ & $0.151$ & $0.229$ & $0.184$ & $0.095$ & $0.341$ & $68338$ \\
615: $0.400$ & $0.7746$ & $1202753$ & $0.181$ & $0.188$ & $0.209$ & $0.096$ & $0.326$ & $76230$ \\
616: $0.375$ & $0.7906$ & $1435734$ & $0.210$ & $0.160$ & $0.224$ & $0.093$ & $0.312$ & $77970$ \\
617: $0.350$ & $0.8062$ & $1739772$ & $0.254$ & $0.133$ & $0.236$ & $0.087$ & $0.291$ & $80100$ \\
618: $0.325$ & $0.8216$ & $2059217$ & $0.288$ & $0.117$ & $0.239$ & $0.083$ & $0.273$ & $82714$ \\
619: $0.300$ & $0.8367$ & $2383804$ & $0.316$ & $0.102$ & $0.244$ & $0.080$ & $0.258$ & $84953$ \\
620: $0.275$ & $0.8515$ & $2728214$ & $0.350$ & $0.090$ & $0.243$ & $0.078$ & $0.239$ & $86151$ \\
621: $0.250$ & $0.8660$ & $3071192$ & $0.374$ & $0.083$ & $0.240$ & $0.076$ & $0.226$ & $90357$ \\
622: $0.225$ & $0.8803$ & $3420697$ & $0.396$ & $0.078$ & $0.239$ & $0.074$ & $0.213$ & $94210$ \\
623: $0.200$ & $0.8944$ & $3807556$ & $0.416$ & $0.076$ & $0.237$ & $0.073$ & $0.199$ & $101358$ \\
624: $0.175$ & $0.9083$ & $4210208$ & $0.432$ & $0.074$ & $0.234$ & $0.072$ & $0.188$ & $102774$ \\
625: $0.150$ & $0.9220$ & $4651704$ & $0.446$ & $0.072$ & $0.233$ & $0.073$ & $0.177$ & $103831$ \\
626: $0.125$ & $0.9354$ & $5049016$ & $0.455$ & $0.069$ & $0.235$ & $0.073$ & $0.167$ & $104227$ \\ \hline
627:
628: \end{tabular}
629: \caption{\label{tab4} \small For each fixed values of $bar w$, we
630: computed the percentage of proteins, among those belonging to the
631: largest component, that come from the five kingdoms.}
632: \end{center}
633: \end{table*}
634:
635: Starting from approximately $\bar{w}\sim 0.65$ the largest component
636: begins to expand its size capturing a lot of smaller components.
637: Furthermore the components which are disconnetted at $\bar{w} \sim
638: 0.675$ and which go to form the giant component at $\bar{w}\sim 0.65$
639: are samples of many different sizes, from small components to very big
640: components. This phenomenon becomes more and more evident for lower
641: values of $\bar{w}$, when the coordination degree distribution of the
642: giant component follows a power law scaling. This is evident also
643: from Figure~\ref{fig6b}, where we plot the distribution of the
644: coordination degree for the whole set of proteins. The exponent
645: $\alpha(\bar{w})$ of the power law behavior $f_{\bar w}(z) \sim
646: z^{-\alpha(\bar w)}$ varies slightly between the regions corresponding
647: to small values of the coordination degree $z$ and to large values of
648: $z$. Clearly when a giant component exists, the region with large $z$
649: is largely determined by the giant component itself. In
650: Table~\ref{tab3} we report the fitting values of the exponent
651: $\alpha(\bar{w})$ computed in two regions with small and large values
652: of $z$. As we decrease the value of $\bar w$, the two fitting values
653: of $\alpha(\bar w)$ become more and more divergent. In fact, since the
654: largest component is growing, the tail of the distribution $f_{\bar
655: w}(z)$ becomes more and more important and assumes a power law
656: behavior characterized by a different exponent.
657:
658: A significant fact goes with the rapid size increase of the largest
659: component. In Table~\ref{tab4} we show, for each $\bar{w}$, the fraction of
660: different kingdoms and the number of different species which appear in
661: the largest connected component. Down to around $\bar{w} = 0.675 $
662: only proteins coming from viruses belong to the largest component and,
663: moreover this largest cluster has not yet become giant with respect to
664: smaller clusters. For $\bar{w} \lesssim 0.675$ the formation of a
665: giant component begins, and simultaneously all kinds of kingdoms enter
666: in the species composition of the giant cluster. This is also evident
667: from Figure~\ref{fig9}b, where we plot the fraction of the number of species
668: belonging to the largest component. This ratio increases rapidly
669: around the same value of $\bar w$. These processes continue for lower
670: values of $\bar{w}$, with the giant component including more and more
671: proteins belonging to many different species, and the ratio for each
672: kingdom tends to become the same as that of the whole database.
673: Furthermore around $\bar w \simeq 0.475$ there is a very sharp
674: increase both in the dimension of the giant component and especially
675: in the number of species present in it, as it is evident from Figures
676: ~\ref{fig9}a and ~\ref{fig9}b.
677:
678: The processes just described may indicate the presence of a phase
679: transition: we have two different phases, one for large values of
680: $\bar w$, characterized by the presence of clusters with similar
681: dimensions and with the largest one composed especially of viruses,
682: and the second phase characterized by the presence of a giant
683: component composed of different species alongside other small little
684: clusters. We note however that the phase transition is not sharp, but
685: the changes in the dimension and composition of the largest component
686: are spread in a range $ 0.475 < \bar w <0.675$. We also note that the
687: plot in Figure~\ref{fig9}b has a very rapid increase for $w \sim 0.475$.
688:
689: \begin{table*}[!htb]
690: \begin{center}
691: \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline
692:
693: $\bar{w}$ & $0.95$ & $0.90$ & $0.85$ & $0.80$ &
694: $0.75$ & $0.70$ & $0.65$ & $0.60$ & $0.55$ &
695: $0.50$ & $0.45$ & $0.35$ & $0.25$ & $0.15$ \\ \hline
696:
697: bacteria & $9.6$ & $12.2$ & $14.2$ & $17.2$ & $21.9$ & $22.6$
698: & $23.6$ & $23.8$ & $23.9$ & $25.1$ & $25.8$ & $29.0$ & $35.6$
699: & $57.0$ \\
700:
701: viruses & $32.7$ & $31.4$ & $24.3$ & $17.6$ & $11.4$ & $7.4$ &
702: $5.2$ & $3.8$ & $2.9$ & $2.7$ & $2.4$ & $2.7$ & $4.2$ & $7.5$
703: \\
704:
705: plants & $9.3$ & $10.8$ & $11.4$ & $9.4$ & $8.3$ & $7.3$ &
706: $7.6$ & $7.8$ & $7.7$ & $7.5$ & $7.5$ & $6.2$ & $4.0$ & $0.0$
707: \\
708:
709: invertebrates & $11.6$ & $8.9$ & $7.4$ & $5.8$ & $3.6$ & $3.2$
710: & $2.5$ & $2.0$ & $1.6$ & $1.5$ & $1.2$ & $1.4$ & $1.3$ &
711: $1.1$ \\
712:
713: vertebrates & $22.9$ & $23.0$ & $25.4$ & $25.7$ & $25.6$ &
714: $25.9$ & $23.6$ & $20.0$ & $17.1$ & $13.0$ & $10.2$ & $5.2$ &
715: $2.8$ & $1.1$ \\
716:
717: bac-vir & $2.7$ & $2.2$ & $2.1$ & $2.1$ & $1.6$ & $1.6$ &
718: $1.4$ & $1.0$ & $1.0$ & $1.1$ & $1.0$ & $1.7$ & $2.4$ & $3.2$
719: \\
720:
721: bac-pla & $1.6$ & $1.8$ & $2.8$ & $2.9$ & $3.5$ & $4.5$ &
722: $5.9$ & $7.0$ & $8.5$ & $8.9$ & $9.1$ & $10.8$ & $11.3$ &
723: $18.3$ \\
724:
725: bac-inv & $0.5$ & $0.4$ & $0.7$ & $0.7$ & $0.8$ & $0.9$ &
726: $1.3$ & $1.7$ & $2.1$ & $2.1$ & $2.0$ & $2.6$ & $3.0$ & $1.1$
727: \\
728:
729: bac-ver & $1.8$ & $2.0$ & $2.4$ & $2.3$ & $1.9$ & $1.9$ &
730: $1.8$ & $1.6$ & $1.5$ & $1.5$ & $1.3$ & $1.1$ & $1.1$ & $1.1$
731: \\
732:
733: vir-pla & $0.2$ & $0.1$ & $0.2$ & $0.4$ & $0.3$ & $0.4$ &
734: $0.3$ & $0.3$ & $0.2$ & $0.2$ & $0.2$ & $0.2$ & $0.5$ & $0.0$
735: \\
736:
737: vir-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
738: $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$
739: \\
740:
741: vir-ver & $0.2$ & $0.5$ & $0.7$ & $0.8$ & $0.9$ & $0.7$ &
742: $0.6$ & $0.4$ & $0.3$ & $0.2$ & $0.1$ & $0.2$ & $0.1$ & $0.0$
743: \\
744:
745: pla-inv & $0.9$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.2$ &
746: $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.3$ & $0.2$ & $0.5$ & $0.0$
747: \\
748:
749: pla-ver & $0.5$ & $0.9$ & $0.8$ & $1.1$ & $1.3$ & $1.0$ &
750: $1.1$ & $1.2$ & $1.2$ & $1.0$ & $0.9$ & $1.3$ & $1.7$ & $1.1$
751: \\
752:
753: inv-ver & $0.5$ & $1.1$ & $2.6$ & $4.5$ & $7.0$ & $8.4$ &
754: $9.2$ & $10.3$ & $10.9$ & $11.2$ & $11.0$ & $9.0$ & $5.5$ &
755: $0.0$ \\
756:
757: bac-vir-pla & $0.0$ & $0.4$ & $0.3$ & $0.5$ & $0.3$ & $0.3$ &
758: $0.4$ & $0.2$ & $0.2$ & $0.2$ & $0.4$ & $0.4$ & $0.7$ & $1.1$
759: \\
760:
761: bac-vir-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
762: $0.0$ & $0.0$ & $0.1$ & $0.0$ & $0.1$ & $0.1$ & $0.3$ & $1.1$
763: \\
764:
765: bac-vir-ver & $0.2$ & $0.1$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ &
766: $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.2$ & $0.2$ & $0.0$
767: \\
768:
769: bac-pla-inv & $0.0$ & $0.0$ & $0.1$ & $0.2$ & $0.5$ & $0.6$ &
770: $0.8$ & $0.9$ & $1.3$ & $2.0$ & $2.3$ & $2.4$ & $3.1$ & $1.1$
771: \\
772:
773: bac-pla-ver & $0.0$ & $0.1$ & $0.0$ & $0.0$ & $0.1$ & $0.3$ &
774: $0.6$ & $0.6$ & $0.9$ & $1.0$ & $1.3$ & $1.7$ & $1.4$ & $0.0$
775: \\
776:
777: bac-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.3$ & $0.4$ & $0.4$ &
778: $0.4$ & $0.9$ & $0.8$ & $0.9$ & $0.9$ & $1.0$ & $0.8$ & $1.1$
779: \\
780:
781: vir-pla-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
782: $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$
783: \\
784:
785: vir-pla-ver & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ & $0.1$ &
786: $0.1$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$
787: \\
788:
789: vir-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.2$ & $0.1$ & $0.2$ &
790: $0.3$ & $0.2$ & $0.2$ & $0.2$ & $0.2$ & $0.1$ & $0.1$ & $0.0$
791: \\
792:
793: pla-inv-ver & $0.9$ & $1.4$ & $1.8$ & $5.5$ & $7.3$ & $8.4$ &
794: $9.4$ & $11.0$ & $11.3$ & $12.0$ & $12.4$ & $13.4$ & $11.7$ &
795: $0.0$ \\
796:
797: bac-vir-pla-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
798: $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.0$
799: & $0.0$ \\
800:
801: bac-vir-pla-ver & $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ &
802: $0.1$ & $0.2$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.1$
803: & $0.0$ \\
804:
805: bac-vir-inv-ver & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
806: $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ & $0.1$
807: & $0.0$ \\
808:
809: bac-pla-inv-ver & $0.2$ & $0.1$ & $0.4$ & $0.7$ & $1.0$ &
810: $2.1$ & $2.5$ & $3.8$ & $5.1$ & $6.4$ & $8.0$ & $7.6$ & $6.7$
811: & $0.0$ \\
812:
813: vir-pla-inv-ver & $0.0$ & $0.1$ & $0.0$ & $0.0$ & $0.1$ &
814: $0.1$ & $0.2$ & $0.3$ & $0.3$ & $0.3$ & $0.2$ & $0.2$ & $0.1$
815: & $1.1$ \\
816:
817: bac-vir-pla-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ &
818: $0.2$ & $0.2$ & $0.1$ & $0.2$ & $0.3$ & $0.5$ & $0.7$ & $0.4$
819: & $1.1$ \\ \hline
820:
821: \end{tabular}
822: \caption{\label{tab5} \small Spread of species in connected
823: components. Each value indicates the percentage of clusters,
824: calculated on clusters having size greater than $90$, composed by
825: proteins coming from only one kingdom, only from a pair of kingdoms,
826: etc., up to the percentage of clusters composed by proteins of all
827: kingdoms.}
828: \end{center}
829: \end{table*}
830:
831: In Table~\ref{tab5}, for each $\bar{w}$, it can be seen how different
832: kingdoms are distributed in connected components. In particular we
833: count the number of components, whose size is greater than $90$ and
834: record the percentage of clusters whose proteins come from species of
835: only one kingdom, only from a pair of kingdoms, etc., up to the
836: percentage of connected components which contain proteins of all
837: kingdoms. For high values of $\bar{w}$ the majority of clusters are
838: made up of proteins belonging to only one kingdom, in particular the
839: kingdom of viruses; clusters with proteins of different kingdoms are
840: very scarce. As expected, as $\bar{w}$ decreases, the percentage of
841: clusters belonging to only one kingdom decreases in favor of clusters
842: of mixed kingdom composition.
843:
844: It is interesting to note that the virus kingdom has a very low
845: tendency to cluster with the other kingdoms, in particular with plants
846: and animalia. Furthermore, for no values of $\bar{w}$ do we see the
847: formation of components (of size greater than $90$) with proteins
848: coming from viruses and invertebrates, and from viruses, plants and
849: invertebrates. Virus proteins cluster mainly with bacterial proteins.
850: In addition we observe that bacterial proteins cluster mainly with
851: plant proteins and vice versa. Moreover, although plant proteins
852: cluster infrequently with invertebrates and with vertebrates, there
853: are many more clusters consisting simultaneously of plant,
854: invertebrate and vertebrate proteins. Finally we note that at the
855: lowest value of $\bar{w}$, the majority of components which are not
856: included in the giant component are clusters consisting of bacterial
857: proteins, of bacterial and plant proteins and of virus proteins.
858:
859: \section{Analysis of the proteins that connect clusters}
860:
861: \begin{figure}[!htb]
862: \begin{center}
863: \vspace{-0.4cm}
864: \subfigure[]{\label{fig10a}
865: \includegraphics[width=0.48\textwidth]{f10a.eps}
866: } \vspace{-0.4cm}
867: \subfigure[]{\label{fig10b}
868: \includegraphics[width=0.48\textwidth]{f10b.eps}
869: }
870: \caption{{\small Length representation of (a) proteins joining
871: generic clusters and of (b) proteins joining the largest
872: cluster. The red color encodes overrepresented lengths; the
873: blue color indicates underrepresented lengths.}}
874: \end{center}
875: \end{figure}
876:
877:
878: \begin{figure}[!htb]
879: \begin{center}
880: \vspace{-0.4cm}
881: \subfigure[\label{fig11a}]{
882: \includegraphics[width=0.48\textwidth]{f11a.eps}
883: }\vspace{-0.4cm}
884: \subfigure[\label{fig11b}]{
885: \includegraphics[width=0.48\textwidth]{f11b.eps}
886: }
887: \caption{{\small Representation of the low complexity
888: content of (a) proteins joining generic clusters and of (b)
889: proteins joining the largest cluster. The red color encodes
890: overrepresented values; the blue color indicates
891: underrepresented values. }}
892: \end{center}
893: \end{figure}
894:
895: \begin{figure}[!htb]
896: \begin{center}
897: \vspace{-0.4cm}
898: \subfigure[\label{fig12a}]{
899: \includegraphics[width=0.48\textwidth]{f12a.eps}
900: }\vspace{-0.4cm}
901: \subfigure[\label{fig12b}]{
902: \includegraphics[width=0.48\textwidth]{f12b.eps}
903: }
904: \caption{{\small Representation of the isoelectric
905: points of (a) proteins joining generic clusters and of (b)
906: proteins joining the largest cluster. The red color encodes
907: overrepresented values; the blue color indicates
908: underrepresented values.}}
909: \end{center}
910: \end{figure}
911:
912: \begin{figure}[!htb]
913: \begin{center}
914: \vspace{-0.4cm}
915: \subfigure[\label{fig13a}]{
916: \includegraphics[width=0.48\textwidth]{f13a.eps}
917: }\vspace{-0.4cm}
918: \subfigure[\label{fig13b}]{
919: \includegraphics[width=0.48\textwidth]{f13b.eps}
920: }
921: \caption{{\small Representation of the predicted number
922: of transmembrane helices of (a) proteins joining generic
923: clusters and of (b) proteins joining the largest cluster. The
924: red color encodes overrepresented values; the blue color
925: indicates underrepresented values.}}
926: \end{center}
927: \end{figure}
928:
929:
930: \begin{figure}[!htb]
931: \begin{center}
932: \vspace{-0.4cm}
933: \subfigure[\label{fig14a}]{
934: \includegraphics[width=0.48\textwidth]{f14a.eps}
935: }\vspace{-0.4cm}
936: \subfigure[\label{fig14b}]{
937: \includegraphics[width=0.48\textwidth]{f14b.eps}
938: }
939: \caption{{\small Representation of the predicted signal peptides and
940: protein localization signals of (a) proteins joining generic
941: clusters and of (b) proteins joining the largest cluster. The
942: red color encodes overrepresented values; the blue color
943: indicates underrepresented values. }}
944: \end{center}
945: \end{figure}
946:
947: \begin{figure}[!htb]
948: \begin{center}
949: \vspace{-0.4cm}
950: \subfigure[\label{fig15a}]{
951: \includegraphics[width=0.48\textwidth]{f15a.eps}
952: }\vspace{-0.4cm}
953: \subfigure[\label{fig15b}]{
954: \includegraphics[width=0.48\textwidth]{f15b.eps}
955: }
956: \caption{{\small Representation of the predicted protein domains of
957: (a) proteins joining generic clusters and of (b) joining the
958: largest cluster. Each line in the graph denotes a certain
959: domain. The red color encodes overrepresented values; the blue
960: color indicates underrepresented values.}}
961: \end{center}
962: \end{figure}
963:
964:
965: Protein pairs that connect clusters in the different weight intervals
966: are of special interest as they harbor the most conserved sequence
967: regions that are shared by the interconnected clusters. We want to
968: know if certain sequence features and protein domains are enriched in
969: these proteins compared to the complete proteome. Therefore we have
970: calculated for all protein contained in SIMAP some sequence features:
971: \e{length}, \e{isoelectric point} (using the EMBOSS sequence analysis
972: package \cite{emboss}), \e{low complexity content} (using the program
973: seg \cite{segprog}) and the number of \e{predicted transmembrane
974: segments} (using the program TMHMM \cite{tmhmmprog}). Additionally,
975: in order to derive functional information for all proteins, we have
976: predicted \e{signal peptides} (using SignalP 3.0 \cite{signalPprog}),
977: \e{localization signals} (using TargetP 1.1\cite{targetPprog}) and
978: \e{protein domains} (using the databases PFAM, TIGRFAM, PANTHER,
979: SUPERFAMILY, SMART and PIRSF from InterPro 12.1 \cite{pddb}) for all
980: SIMAP proteins.
981:
982: For all weight intervals we have counted the feature occurrence in the
983: proteins that connect clusters; these proteins are all pairs of
984: sequences which belong to different clusters in the graph built at
985: $\bar{w}_1$ and belonging to the same cluster in the graph built at
986: $\bar{w}_2$, where $\bar{w}_2<\bar{w}_1$ are two consecutive values of
987: the weight $\bar{w}$. We have also distinguished between two disjoint
988: sets of these proteins: proteins linking the clusters that will form
989: the largest cluster in the graph built at $\bar{w}_2$ and proteins
990: linking the other generic clusters.
991:
992: The enrichment ($e$) of features was calculated as ratio of the number
993: of features found ($k$) and the number of features expected ($k_E$):
994: $e = k/k_E$. The number of features expected was calculated by: $k_E =
995: K\,n/V$, where $n$ is the number of proteins of interest (e.g.
996: connecting clusters in a given weight interval), $K$ denotes the
997: number of proteins used for clustering having the given feature and
998: $V$ corresponds to the number of proteins used for clustering.
999:
1000: \subsection{Results}
1001:
1002: Proteins joining clusters outside the largest cluster show an
1003: over-representation of lengths around 400aa (Figure~\ref{fig10a}),
1004: contain overrepresented proteins of small low complexity content
1005: (Figure~\ref{fig11a}), are often neutral or weakly acidic
1006: (Figure~\ref{fig12a}) and contain more transmembrane proteins than
1007: expected (Figure~\ref{fig13a}). Proteins joining clusters in the
1008: giant component are characterized by short and very long lengths
1009: (Figure~\ref{fig10b}), reduced low complexity content
1010: (Figure~\ref{fig11b}), acidic and alkaline proteins, dependent on the
1011: weight interval (Figure~\ref{fig12b}) and a high number of
1012: transmembrane domains in the lower weight intervals
1013: (Figure~\ref{fig13b}). Signal peptides were found overrepresented in
1014: proteins joining clusters outside the largest component at the lower
1015: weight intervals; at higher weight intervals and in proteins joining
1016: clusters in the largest component they were found underrepresented, as
1017: were localization signals in all proteins joining clusters
1018: (Figure~\ref{fig14a} and Figure~\ref{fig14b}). For all considered
1019: weight intervals we could find interval-specific overrepresented and
1020: underrepresented protein domains (Figure~\ref{fig15a} and
1021: \ref{fig15b}). Remarkably these domains are not only specific for a
1022: certain weight interval, but also different for proteins joining
1023: clusters outside the largest component and proteins joining clusters
1024: in the largest component (See Table~\ref{tab6}).
1025:
1026: \subsection{Discussion}
1027:
1028: All of the analyzed sequence features indicate that proteins that join
1029: clusters at a certain weight interval are not distributed equally over
1030: the complete protein space. For all of the features we could find
1031: specific under- and over-representation. Proteins joining clusters
1032: outside the largest component and proteins joining clusters in the
1033: largest component are different with respect to almost all considered
1034: features, which indicates that the largest component contains proteins
1035: that are different from those contained in other large clusters. These
1036: findings are complemented by the observation of specific over- and
1037: underrepresented functional domains in the proteins connecting
1038: clusters at certain weight intervals. Thus we conclude that for each
1039: weight interval a small number of protein families is responsible for
1040: cluster interconnections.
1041:
1042: %%%%%%%%%%%%%%%%%%%%%%
1043: \section{Conclusions}
1044:
1045: We investigated the local e global properties of the sequence
1046: similarity space formed by all proteins in the SIMAP database, which
1047: contains more than $5.5$ millon amino acid sequences. We represented
1048: this space as a graph whose vertices are proteins and the edges are
1049: weighted to reflect the similarity between the corresponding pairs of
1050: sequences (high weight, high similarity). The choice of this weight
1051: formula (\ref{eq:weight}) came from the necessity to compare the
1052: similarity score between pairs of sequences that could have different
1053: lengths. The SW score was therefore modified by means of the
1054: self-score geometric mean which contains the length information of the
1055: two aligned sequences.
1056:
1057: Then, keeping only edges with $w \geq \bar w$, we built a collection
1058: of graphs by varing $\bar w$. From the analysis of the connected
1059: components we found that these graphs do not belong to the class of
1060: random graphs, whereas they are characterized by a power law behaviour
1061: both in the size cluster distribution and in the coordination degree
1062: distribution and for each fixed $\bar w$ these two distributions are
1063: strongly related to each other.
1064:
1065: With the variation of $\bar w$, we found interesting changes in the
1066: global organization of the protein homology networks: we observed two
1067: different phases, one for large values of $\bar w $, characterized by
1068: the presence of clusters with similar dimensions, each composed
1069: essentially by proteins belonging to only one kingdom and with the
1070: largest one composed especially by viruses, and the second phase, for
1071: lower values of $\bar w$, characterized by the presence of a giant
1072: component composed by different species and other very little
1073: clusters.
1074:
1075: In the end we investigated sequence features and functional
1076: informations of protein pairs that are responsible of the connection
1077: of clusters in the different intervals of $\bar w$, since they harbor
1078: the most conserved sequence regions that are shared by the
1079: interconnected clusters. We found that proteins joining clusters
1080: outside the largest component and proteins joining clusters in the
1081: largest component are different with respect to almost all considered
1082: features, which indicates that the largest component contains proteins
1083: that are different from those contained in other large
1084: clusters. Indeed we found an overrepresentation of a small set of
1085: domains which shows that a small number of protein families is
1086: responsible for cluster interconnections.
1087:
1088: The analysis we performed gives a first view of the global
1089: organization of the greatest protein homology network ever been built
1090: before. It is the first step and the starting point to answer to other
1091: global or local interesting questions which could confirm that the
1092: protein homology network is structured with respect to functional and
1093: evolutionary properties.
1094:
1095:
1096: %%%%%%%%%%%%%%%%%%%%%%%%%%%
1097: \section{Acknowledgements}
1098: The authors thanks Claudio Destri, Roland Arnold and Mattia Pelizzola
1099: for useful discussions, Michele Caselle for encouraging our
1100: collaboration and Patrick Tischler, Jan Krumsiek and Benedikt
1101: Wachinger for providing the software for protein feature calculation.
1102:
1103: \newpage
1104: \begin{table*}[!htb]
1105: \begin{center}
1106: \begin{tabular}{|c|cc|cc|} \hline
1107:
1108: $\bar{w_1} \to \bar{w_2}$ & $e$ & \hspace{-0pt} Proteins joining
1109: generic clusters & $e$ & \hspace{-0pt} Proteins joining the largest cluster \\
1110: \hline
1111:
1112: & $0.02$ & PF00598 Flu\_M1 & $0.93$ & PF00078 RVT\_1 \\
1113: & $0.03$ & PF00522 VPR & $1.08$ & PF00075 RnaseH \\
1114: & $0.03$ & PF00540 Gag\_p17 & $1.44$ & PF06815 RVT\_connect \\
1115: & $0.03$ & PF00951 Arteri\_Gl & $1.46$ & PF07075 DUF1343 \\
1116: & $0.03$ & PF00971 EIAV\_GP90 & $2.19$ & PF00665 rve \\
1117: $0.750$ $\to$ $0.725$ & & & & \\
1118: & $9.40$ & PF02916 DNA\_PPF & $15.41$ & PF00607 Gag\_p24 \\
1119: & $11.09$ & PF07095 IgaA & $18.79$ & PF00517 GP41 \\
1120: & $11.25$ & PF08272 Topo\_Zn\_Ribbon & $18.91$ & PF02022 Integrase\_Zn \\
1121: & $11.83$ & PF06899 WzyE & $27.07$ & PF00540 Gag\_p17 \\
1122: & $12.46$ & PF06788 UPF0257 & $137.49$ & PF00516 GP120 \\ \hline
1123:
1124: & & & $0.88$ & PF00078 RVT\_1 \\
1125: & & & $1.16$ & PF00077 RVP \\
1126: & & & $1.91$ & PF06817 RVT\_thumb \\
1127: & & & $3.68$ & PF00075 RnaseH \\
1128: & & & $3.77$ & PF00665 rve \\
1129: $0.725$ $\to$ $0.700$ & & & & \\
1130: & & & $37.19$ & PF00186 DHFR\_1 \\
1131: & & & $80.26$ & PF00098 zf-CCHC \\
1132: & & & $129.77$ & PF00516 GP120 \\
1133: & & & $139.92$ & PF00607 Gag\_p24 \\
1134: & & & $145.50$ & PF00540 Gag\_p17 \\ \hline
1135:
1136:
1137: & $0.01$ & PF00516 GP120 & $0.12$ & PF00098 zf-CCHC \\
1138: & $0.01$ & PF00522 VPR & $0.15$ & PF00271 Helicase\_C \\
1139: & $0.01$ & PF00602 Flu\_PB1 & $0.22$ & PF00078 RVT\_1 \\
1140: & $0.01$ & PF00603 Flu\_PA & $1.02$ & PF01560 HCV\_NS1 \\
1141: & $0.01$ & PF01539 HCV\_env & $1.16$ & PF06817 RVT\_thumb \\
1142: $0.700$ $\to$ $0.675$ & & & & \\
1143: & $10.14$ & PF08435 Calici\_coat\_C & $15.62$ & PF02907 Peptidase\_S29 \\
1144: & $10.22$ & PF03296 Pox\_polyA\_pol & $19.47$ & PF00517 GP41 \\
1145: & $12.94$ & PF05733 Tenui\_N & $57.66$ & PF00516 GP120 \\
1146: & $12.98$ & PF03805 CLAG & $74.03$ & PF00077 RVP \\
1147: & $13.68$ & PF00897 Orbi\_VP7 & $98.38$ & PF02348 CTP\_transf\_3 \\ \hline
1148:
1149: & $0.01$ & PF00064 Neur & $0.10$ & PF00078 RVT\_1 \\
1150: & $0.01$ & PF00469 F-protein & $0.13$ & PF00077 RVP \\
1151: & $0.01$ & PF00506 Flu\_NP & $0.18$ & PF00560 LRR\_1 \\
1152: & $0.01$ & PF00516 GP120 & $0.18$ & PF00607 Gag\_p24 \\
1153: & $0.01$ & PF00540 Gag\_p17 & $0.30$ & PF00665 rve \\
1154: $0.675$ $\to$ $0.650$ & & & & \\
1155: & $11.63$ & PF04310 MukB & $151.92$ & PF02959 Tax \\
1156: & $12.71$ & PF07108 PipA & $168.64$ & PF00758 EPO\_TPO \\
1157: & $13.48$ & PF07429 Fuc4NAc\_transf & $431.37$ & PF08300 HCV\_NS5a\_1 \\
1158: & $15.20$ & PF03506 Flu\_C\_NS1 & $441.03$ & PF08301 HCV\_NS5a\_1b \\
1159: & $15.26$ & PF06593 RBDV\_coat & $483.96$ & PF01506 HCV\_NS5a \\ \hline
1160:
1161: & $0.01$ & PF00506 Flu\_NP & $0.03$ & PF00096 zf-C2H2 \\
1162: & $0.01$ & PF00516 GP120 & $0.04$ & PF00078 RVT\_1 \\
1163: & $0.01$ & PF00540 Gag\_p17 & $0.17$ & PF00023 Ank \\
1164: & $0.01$ & PF00603 Flu\_PA & $0.17$ & PF00589 Phage\_integrase \\
1165: & $0.01$ & PF00695 vMSA & $0.19$ & PF00903 Glyoxalase \\
1166: $0.650 $ $\to$ $ 0.625$ & & & & \\
1167: & $12.57$ & PF06952 PsiA & $202.08$ & PF01002 Flavi\_NS2B \\
1168: & $13.73$ & PF06788 UPF0257 & $221.93$ & PF01349 Flavi\_NS4B \\
1169: & $14.79$ & PF05788 Orbi\_VP1 & $222.59$ & PF01353 GFP \\
1170: & $15.42$ & PF00901 Orbi\_VP5 & $229.23$ & PF01350 Flavi\_NS4A \\
1171: & $16.02$ & PF03753 HHV6-IE & $243.38$ & PF00948 Flavi\_NS1 \\ \hline
1172:
1173: \end{tabular}
1174: \end{center}
1175: \end{table*}
1176:
1177: \begin{table*}[!htb]
1178: \begin{center}
1179: \begin{tabular}{|c|cc|cc|} \hline
1180:
1181: & $0.01$ & PF00124 Photo\_RC & $0.09$ & PF00009 GTP\_EFTU \\
1182: & $0.01$ & PF00603 Flu\_PA & $0.13$ & PF07974 EGF\_2 \\
1183: & $0.01$ & PF00695 vMSA & $0.2$ & PF00096 zf-C2H2 \\
1184: & $0.01$ & PF01560 HCV\_NS1 & $0.22$ & PF00560 LRR\_1 \\
1185: & $0.02$ & PF00223 PsaA\_PsaB & $0.23$ & PF01546 Peptidase\_M20 \\
1186: $0.625$ $\to$ $0.600$ & & & & \\
1187: & $11.95$ & PF06517 Orthopox\_A43R & $376.41$ & PF01002 Flavi\_NS2B \\
1188: & $12.09$ & PF00843 Arena\_nucleocap & $403.70$ & PF00948 Flavi\_NS1 \\
1189: & $13.08$ & PF06802 DUF1231 & $411.72$ & PF01349 Flavi\_NS4B \\
1190: & $14.72$ & PF05273 Pox\_RNA\_Pol\_22 & $425.27$ & PF01350 Flavi\_NS4A \\
1191: & $16.90$ & PF03021 CM2 & $538.21$ & PF05408 Peptidase\_C28 \\ \hline
1192:
1193: & $0.01$ & PF00517 GP41 & $0.06$ & PF00096 zf-C2H2 \\
1194: & $0.01$ & PF00559 Vif & $0.06$ & PF00097 zf-C3HC4 \\
1195: & $0.01$ & PF00600 Flu\_NS1 & $0.09$ & PF00009 GTP\_EFTU \\
1196: & $0.01$ & PF00969 MHC\_II\_beta & $0.09$ & PF01266 DAO \\
1197: & $0.01$ & PF06815 RVT\_connect & $0.11$ & PF01926 MMR\_HSR1 \\
1198: $0.600$ $\to$ $0.575$ & & & & \\
1199: & $10.54$ & PF02477 Nairo\_nucleo & $133.87$ & PF05790 C2-set \\
1200: & $11.95$ & PF07982 Herpes\_UL74 & $139.12$ & PF01353 GFP \\
1201: & $12.30$ & PF06871 TraH\_2 & $150.11$ & PF00518 E6 \\
1202: & $14.14$ & PF02509 Rota\_NS35 & $195.29$ & PF02929 Bgal\_small\_N \\
1203: & $16.04$ & PF06929 Rotavirus\_VP3 & $231.71$ & PF01382 Avidin \\ \hline
1204:
1205: & $0.01$ & PF00016 RuBisCO\_large & $0.02$ & PF00115 COX1 \\
1206: & $0.01$ & PF00113 Enolase\_C & $0.07$ & PF07690 MFS\_1 \\
1207: & $0.01$ & PF00123 Hormone\_2 & $0.08$ & PF07993 NAD\_binding\_4 \\
1208: & $0.01$ & PF00506 Flu\_NP & $0.09$ & PF00517 GP41 \\
1209: & $0.01$ & PF01010 Oxidored\_q1\_C & $0.10$ & PF00583 Acetyltransf\_1 \\
1210: $0.575 $ $\to$ $ 0.550$ & & & & \\
1211: & $10.60$ & PF06134 RhaA & $161.43$ & PF01140 Gag\_MA \\
1212: & $10.95$ & PF07095 IgaA & $168.19$ & PF04528 Adeno\_E4\_34 \\
1213: & $11.75$ & PF00897 Orbi\_VP7 & $173.44$ & PF08377 MAP2\_projctn \\
1214: & $12.13$ & PF03294 Pox\_Rap94 & $184.23$ & PF02093 Gag\_p30 \\
1215: & $13.75$ & PF01295 Adenylate\_cycl & $311.32$ & PF01141 Gag\_p12 \\ \hline
1216:
1217: & $0.01$ & PF00016 RuBisCO\_large & $0.06$ & PF00067 p450 \\
1218: & $0.01$ & PF00516 GP120 & $0.07$ & PF00023 Ank \\
1219: & $0.01$ & PF00522 VPR & $0.08$ & PF00097 zf-C3HC4 \\
1220: & $0.01$ & PF00540 Gag\_p17 & $0.11$ & PF01381 HTH\_3 \\
1221: & $0.01$ & PF01539 HCV\_env & $0.11$ & PF04851 ResIII \\
1222: $0.550$ $\to$ $0.525$ & & & & \\
1223: & $11.29$ & PF05928 Zea\_mays\_MuDR & $101.41$ & PF01537 Herpes\_glycop\_D \\
1224: & $11.62$ & PF06829 DUF1238 & $121.18$ & PF02929 Bgal\_small\_N \\
1225: & $11.63$ & PF03277 Herpes\_UL4 & $123.25$ & PF01376 Enterotoxin\_b \\
1226: & $11.64$ & PF03395 Pox\_P4A & $128.24$ & PF06466 PCAF\_N \\
1227: & $12.73$ & PF08405 Calici\_PP\_N & $147.36$ & PF05806 Noggin \\ \hline
1228:
1229:
1230: & $0.01$ & PF00600 Flu\_NS1 & $0.02$ & PF00106 adh\_short \\
1231: & $0.01$ & PF00869 Flavi\_glycoprot & $0.04$ & PF00270 DEAD \\
1232: & $0.01$ & PF01539 HCV\_env & $0.05$ & PF00037 Fer4 \\
1233: & $0.01$ & PF02461 AMO & $0.06$ & PF02518 HATPase\_c \\
1234: & $0.01$ & PF02788 RuBisCO\_large\_N & $0.08$ & PF00249 Myb\_DNA-binding \\
1235: $0.525$ $\to$ $0.500$ & & & & \\
1236: & $11.36$ & PF07434 CblD & $68.92$ & PF03939 Ribosomal\_L23eN \\
1237: & $11.80$ & PF04913 Baculo\_Y142 & $72.11$ & PF06267 DUF1028 \\
1238: & $11.98$ & PF05880 Fiji\_64\_capsid & $96.66$ & PF02022 Integrase\_Zn \\
1239: & $13.48$ & PF06306 CgtA & $120.34$ & PF00552 Integrase \\
1240: & $13.98$ & PF03317 ELF & $129.98$ & PF02929 Bgal\_small\_N \\ \hline
1241: \end{tabular}
1242: \caption{\label{tab6} \small For proteins joining clusters outside the
1243: largest component or joining the giant component the five mostly
1244: underrepresented and five mostly overrepresented PFAM domains are
1245: giver per interval of weight w.}
1246: \end{center}
1247: \end{table*}
1248:
1249:
1250: \begin{thebibliography}{99}
1251:
1252: \bibitem{revEvol} E.V.~Koonin, {\it Orthologs, Paralogs, and
1253: Evolutionary Genomics.}, {\tt Annu. Rev. Genet. 2005 39:309-38}
1254:
1255: \bibitem{simap} R.~Arnold, T.~Rattei, P.~Tischler, M.~Truong,
1256: V. St\"{u}mpflen, W.~Mewes, {\it SIMAP - The similarity matrix of
1257: proteins}, {\tt Bioinformatics {\bf 21}, ii42-ii46 (2005)}
1258:
1259: \bibitem{phn} D.~Medini, A.~Covacci, C.~Donati, {\it Protein homoloy
1260: network families reveal step-wise diversification of type III and
1261: type IV secretion systems.}, {\tt PLOS Computational Biology 2
1262: 1543-1551 (2006)}
1263:
1264: \bibitem{erdos_renyi} P.~Erd\"os, A.~R\'enyi, {\it On random graphs},
1265: {\tt I, Publ. Math. Debrecen {\bf 6}, 290-291 (1959)}
1266:
1267: \bibitem{burda} L.~Bogacz, Z.~Burda, W.~Janke, B.~Waclaw, {\it A
1268: program generanting homogeneous random graph with given weights},
1269: [{\tt cond-mat/0506330}].
1270:
1271: \bibitem{emboss} P.~Rice, I.~Longden, et al., {\it EMBOSS: the
1272: European Molecular Biology Open Software Suite}, {\tt Trends Genet
1273: 16(6): 276-7 (2000)}
1274:
1275: \bibitem{segprog} J.C.~Wootton, {\it Sequences with `unusual' amino
1276: acid compositions.}, {\tt Curr. Opin. Struct. Biol 4: 413-421
1277: (1994)}
1278:
1279: \bibitem{tmhmmprog} A.~Krogh, B.~Larsson, et al., {\it Predicting
1280: transmembrane protein topology with a hidden Markov model:
1281: application to complete genomes.} , {\tt J. Mol. Biol 305(3):
1282: 567-580 (2001)}
1283:
1284: \bibitem{signalPprog} J.D.~Bendtsen, H.~Nielsen, et al., {\it Improved
1285: prediction of signal peptides: SignalP 3.0.} , {\tt Journal of
1286: Molecular Biology 340(4): 783-795 (2004)}
1287:
1288: \bibitem{targetPprog} O.~Emanuelsson, H.~Nielsen, et al., {\it
1289: Predicting subcellular localization of proteins based on their
1290: N-terminal amino acid sequence.}, {\tt Journal of Molecular
1291: Biology 300(4): 1005-1016 (2000)}
1292:
1293: \bibitem{pddb} N.J.~Mulder, R.~Apweiler, et al., {\it InterPro,
1294: progress and status in 2005.}, {\tt Nucleic Acids Research 33
1295: (Database issue): D201-5 (2005)}
1296:
1297: \end{thebibliography}
1298:
1299: \end{document}
1300:
1301:
1302:
1303:
1304:
1305:
1306:
1307:
1308: