q-bio0703053/simap.tex
1: \documentclass[floatfix,twocolumn,showpacs,preprintnumbers,amsmath,amssymb]{revtex4}
2: 
3: \usepackage{graphicx,epsfig,amsfonts}% Include figure files
4: \usepackage{dcolumn}% Align table columns on decimal point
5: \usepackage{psfrag}
6: \usepackage{concmath,charter}
7: \usepackage{subfigure}
8: \begin{document}
9: 
10: \newcommand{\e}[1]{\emph{#1}} 
11: \newcommand{\avg}[1]{\langle #1 \rangle}
12: \newcommand{\va}[0]{{\mathbf a}}
13: \newcommand{\vb}[0]{{\mathbf b}}
14: \newcommand{\vc}[0]{{\mathbf c}}
15: 
16: 
17: \title{Global statistical analysis of the protein homology network}
18: 
19: \author{C.~Miccio}
20: \email{miccio@mib.infn.it}
21: \affiliation{
22:   Dipartimento di Fisica G.Occhialini, Universit\`a di
23:   Milano--Bicocca and INFN, Sezione di Milano, Piazza della Scienza 3
24:   - I-20126 Milano, Italy}
25: \author{T.~Rattei} 
26: \email{t.rattei@wzw.tum.de} 
27: \affiliation{
28:   Department of Genome Oriented Bioinformatics, Technical University
29:   of Munich, 
30:   Wissenschaftszentrum 5 Weihenstephan, 85350 Freising,
31:   Germany }
32: 
33: \date{\today}
34: 
35: \begin{abstract}
36:   The similarity between protein sequences is a directly and easly
37:   computed quantity from which to deduce information about their
38:   evolutionary distance and to detect homologous proteins. The {\emph
39:     SIMAP} database -- {\emph Similarity Matrix of Proteins} --
40:   provides a pre-computed similarity matrix covering the similarity
41:   space formed by about all publicly available amino acid sequences
42:   from public databases and completely sequenced genomes.  From SIMAP
43:   we construct the protein homology network, where the proteins are
44:   the nodes and the links represent homology relationships.  With more
45:   than $5$ million nodes and about $70 \times 10^9$ edges it is the
46:   greatest protein homology network ever been builded.  We
47:   describe the basic features and we perform a global statistical
48:   analysis of the network. Starting from the Smith-Waterman similarity
49:   score, we define for each edge a weight $w$ to measure the
50:   similarity distance between two nodes. Keeping only edges with a
51:   weigth greater than a minimal $\bar w$, and by varying $\bar w$ we
52:   build a family of networks with different degree of similarity. We
53:   investigate the distribution of connected components (clusters) of
54:   the networks at different $\bar w$ and in particular we find a
55:   behaviour similar to a phase transition guided by the formation of a
56:   giant component. Moreover we study selected sequence features and
57:   protein domains of protein pairs that connect different clusters in
58:   the networks at different level of similarity.  We observed
59:   specific, non-random distributions of the protein features and
60:   domains for proteins connecting clusters at certain weight
61:   intervals.
62: \end{abstract}
63: %\pacs{87.10.+e, 05.10.Ln}
64: 
65: \maketitle
66: 
67: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
68: \section{Background}
69: 
70: The number of known proteins is rapidly growing and the sequence of
71: amino acids is, at the moment, the main source of information for many
72: new proteins which still have unidentified functions. Protein sequence
73: analysis, and more specifically, the analysis of similarities among
74: protein sequences, is therefore the basis of studies trying to
75: understand protein evolutionary processes or to detect unknown
76: biological functions of new proteins. Proteins with similar sequences
77: can be found in different organisms and in a single organism
78: \footnote{Due to duplication and shuffling of coding segments in the
79:   akno DNA during the evolution.}, \cite{revEvol}. By means of the
80: degree of similarity obtained by a pairwise sequence comparison it is
81: possible to deduce information about their evolutionary distance.
82: Specifically, two proteins are homologous if they evolved from a
83: common ancestral protein sequence and, in most cases, they have also
84: the same, or very similar, biological function. Homology can be
85: deduced from statistically significant sequence similarities. However,
86: new sequences often have only weak similarities to known proteins, and
87: single similarities search are insufficient to assign validated
88: properties of characterized proteins to new sequences. Instead a graph
89: formed by all-against-all comparisons of a large amount of
90: protein-data could become useful. This is the case of {\bf SIMAP} --
91: \e{Similarity Matrix of Proteins} -- a database containing the
92: similarity space formed by almost all amino acid sequences, with
93: nearly 5.5 million non-redundant protein sequences drawn from
94: completely sequenced genomes and public database. Moreover,
95: pre-calculated similarity space allows very rapid access to
96: significant hits of interest and prevents time-consuming
97: re-computation. The algorithm that precomputes the sequences
98: similarities is based on the FASTA heuristic. First it compares
99: low-complexity masked proteins using FASTA and then it recalculates
100: the hits found using non-masked sequences and the Smith-Waterman
101: algorithm. In both phases of the alignment process the BLOSUM50 amino
102: acids substitution matrix is used. For each hit the Smith-Waterman
103: score, the identity, the gapped identity, the overlap and the start
104: and the stop coordinates of the alignment in
105: both proteins are stored. For more details see \cite{simap}.\\
106: Graphs formed by all-against-all sequence comparisons can be used to
107: derive inheritance patterns of proteins, to reconstruct the
108: evolutionary relationships between proteins and to classify them into
109: protein families by looking for dense clusters disconnected from the
110: rest of the network. To date, this approach has been carefully
111: evaluated by case studies targeted at selected protein families
112: \cite{phn}, but a global analysis of the complete homology network
113: formed by all publicly available proteins has not been published. The
114: aim of this work is to analyze global and local properties of the
115: graph forming the homology network.
116: 
117: 
118: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
119: \section{SIMAP graph representation}
120: 
121: The information contained in the Simap database can be reorganized by
122: means of a weighted graph representation, $G(V, E, w)$, where $V$ is
123: the set of nodes, $E$ the set of edges, and $w$ a weight function on
124: the edges: $w : E \to [0,1]$. Each node, $\va \in V$, represents a
125: protein sequence and each edge, $e = \{ \va,\vb \} \in E$ between two
126: nodes $\va$, $\vb$ represents the stored alignment between the
127: respective protein sequences\footnote{For simplicity we will use the
128:   same notation to point graphs's nodes and database's proteins.}.  In
129: this way an undirected weighted graph can be obtained, since the
130: symmetry of the alignment procedure leads to undirected edges and the
131: score of the alignment allows the assignment of a suitable weight to
132: every edge. (Despite the possibility of making an alignment between a
133: protein sequence and itself, self-edges are not considered). More
134: specifically if $s(\va,\vb)$ is the Smith-Waterman (SW) optimal score
135: obtained with the FASTA algorithm between sequence $\va$ and $\vb$, a
136: suitable weight $w(\va,\vb) \in [0,1]$ for the edge $e = \{ \va,\vb
137: \}$ can be defined as follow:
138: \begin{equation}
139: \label{eq:weight} 
140:   w(\va,\vb) = \frac{s(\va,\vb)}{ \sqrt{ \; s(\va,\va) \;
141:       s(\vb,\vb)}}, 
142: \end{equation}
143: From $w(\va,\vb)$ one could define a distance function as $d(\va,\vb)
144: = 1 - w(\va,\vb)\;$, whose values are in $[0, 1]$ as distance function
145: usually defined on linear spaces. $d$ should satisfy positivity, null
146: and simmetry properties for all pairs of sequence proteins and also
147: the triangular inequality which is fully satisfied for the BLOSUM50
148: matrix.
149: 
150: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
151: \section{Polishing procedure}
152: Strictly speaking, the set of all protein sequences of the Simap
153: database is not a good space over which to define the distance measure
154: $d$. There are, in fact, $1538$ pairs of sequences that have distance
155: equal to zero, although they are classified with a different sequence
156: id. However, they differ only in the presence of one or two $'$X$'$ in
157: their amino acid sequence annotation, where $'$X$'$ is the standard
158: symbol for an unknown amino acid residue in a protein sequence.  It is
159: therefore natural to decide to knock out, for each of these pairs of
160: sequences, the one that has the $'$X$'$ in the sequence; this
161: procedure entails the removal, in the graph representation, of all
162: edges connected to the removed nodes. Another improvment for database
163: consistency is the checking of symmetry of all edges: every time, a
164: direct edge is found, the inverse relation, if absent, is added.
165: 
166: As a final result of these manipulations, a graph with $V = 5,489,907$
167: nodes and $E = 69,500,722,050$ edges can be constructed.
168: 
169: Over the polished Simap protein sequences space the distance $d = 1 -
170: w(\va,\vb)\;$ fails the triangular inequality over few cases (around
171: $\approx 0.2 \%$ of triangles). However redefining, for istance,
172: 
173: \begin{equation}
174: \label{eq:distance} 
175: d(\va,\vb) =\sqrt{1 - w(\va,\vb)}, 
176: \end{equation}
177: we have that the triangle inequality is satisfied for all triples of
178: linked proteins and (\ref{eq:distance}) has all properties required
179: for a \e{distance measure}.
180: 
181: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
182: \section{Characterization of Simap protein space}
183: 
184: In the Simap database, protein sequences come from $104,560$ different
185: species.  There are, in particular, $3$ species (\e{Homo sapiens},
186: \e{Arabidopsis thaliana}, \e{Rice plants}) with more than $100,000$
187: protein sequences and $72$ with more than $10,000$.
188: 
189: \begin{table}[!htb]
190:   \begin{center}
191:       \begin{tabular}{|c|c|c|}
192:         \vspace{-10pt} & & \\ \hline \it{kingdoms} & & \it{number of species} \\
193:         \vspace{-10pt} & & \\ \hline
194:         bacteria &                &  $11,130$    \\ \hline
195:         viruses  &  viruses       &  $13,708$    \\
196:         &  phages        &  $923$     \\ \hline
197:         plants   &                &  $31,232$   \\ \hline
198:         animalia & invertebrates  &  $25,951$   \\ %\cline{2-4}
199:         & vertebrates    &  $19,341$   \\
200:         & (rodents)      &  $(1,474)$  \\
201:         & (mammals)      &  $(1,854)$  \\
202:         & (primates)     &  $(393)$   \\ \hline	         
203:         environmental samples  &   &  $1,453$    \\ \hline
204:         synthetic              &   &  $822$      \\ \hline
205:       \end{tabular} 
206:       \caption{\label{tab1} \small Number of species for each
207:         kingdom.}
208:   \end{center}
209: \end{table}
210: 
211: A coarse subdivision of all species is shown in
212: Table~\ref{tab1}; it separates species in five (non-standard)
213: main kingdoms: bacteria, viruses, plants, invertebrates (animalia) and
214: vertebrates (animalia). The classification reveals the presence of
215: very many different animalia species, but only eight of these species
216: are present with their complete genome (the other animalia proteins
217: were imported from multiple species databases).
218: Figure~\ref{fig1} shows the protein distribution for
219: each kingdom. There is also a high number ($546,439$) of unassigned
220: protein sequences.\footnote{These sequences come from databases:
221:   \e{PDB proteins}, \e{mips non-redundant protein database},
222:   \e{UNIPROT SWISSPROT}, \e{UNIPROT-TrEMBL}, \e{PFAM sequences},
223:   \e{Eukaryotic signature proteins.}}.
224: 
225: \begin{figure}[!htb]
226:   \begin{center}
227:     \includegraphics[height=0.36\textwidth,angle=270]{f1.eps}
228:     \caption{\label{fig1} {\small Distribution of
229:         proteins for each kingdom. The little graph shows the
230:         distribution within vertebrates.}}
231:   \end{center}
232: \end{figure}
233: 
234: \subsection{Length and self-similarity distribution}
235: 
236: \begin{figure}[!htb]
237:   \includegraphics[height=0.70\textwidth]{f2.eps}
238:   \caption{\label{fig2}{\small (a) Distribution of protein sequences'
239:       lengths. In the inner boxe an enlargement of the distribution is
240:       shown. (b) Length distributions of protein sequences which
241:       belong to \e{bacteria} ($\avg{l} = 316.9$, $l_{max} = 36805$),
242:       \e{viruses} ($\avg{l} = 273.9$,$l_{max} = 7312$ ), \e{plants}
243:       ($\avg{l} = 314.5$, $l_{max} = 20925$), \e{invertebrated}
244:       ($\avg{l} = 416.1$, $l_{max} = 23015$), \e{vertebrated}
245:       ($\avg{l} = 397.1$, $l_{max} = 38031$).}}
246: \end{figure}
247: 
248: The protein sequences space is characterized by the length
249: distribution shown in Figure~\ref{fig2}a and in Figure~\ref{fig2}b we
250: give the length distributions for sequences belonging to bacteria,
251: viruses, plants, vertebrates and invertebrates.
252: 
253: \begin{figure}[!htb]
254:   \vspace{0.2cm}
255:   \includegraphics[height=0.36\textwidth,angle=270]{f3.eps}
256:   \caption{\label{fig3}{\small Distribution of protein sequences'
257:       self-scores. In the inner boxe an enlargement of the
258:       distribution is shown. }}
259: \end{figure}
260: 
261: The \e{self-similarity} \e{score} 's distribution of protein sequence
262: appears in Figure~\ref{fig3}.  The self-similarity scores distribution
263: is well reproduced by a mixture of normal distributions, one for each
264: length entry. The self-similarity score $s(\va,\va)$ of a protein
265: sequence of length $l$, can be thougth as a sum of $l$ i.i.d.  random
266: variables, i.e. a sum of the self-similarities scores of random amino
267: acids. Knowing the amino acids background probabilities\footnote{ The
268:   values for background distribution of amino acids come from data
269:   used for the PAM matrix: $\;p_A=0.096;\; p_R=0.034;\; p_N=0.042;\;
270:   p_D=0.053;\; p_C=0.025;\; p_Q=0.032;\; p_E=0.053;\; p_G=0.090;\;
271:   p_H=0.034;\; p_I=0.035;\; p_L= 0.084;\; p_K=0.085;\; p_M=0.012;\;
272:   p_F=0.045;\; p_P=0.041;\; p_S=0.057;\; p_T=0.062;\; p_W=0.012;\;
273:   p_Y=0.030;\; p_V=0.078$.\\ They can be obtained from \e{{\small
274:       http://apps.bioneq.qc.ca/twiki/pub/Knowledgebase/PAM/}}
275:   \e{{\small PAM2.JPG}}} $p_{a}$ and the diagonal values of the
276: BLOSUM50 score matrix, $B_{aa}$, the self-similarity score of a random
277: amino acid will follow a normal distribution with mean $\avg{s} =
278: \sum_a p_a \, B_{aa} \;\; ( \approx 6.727)$ and variance $ \sigma =
279: \sqrt{\sum_a p_a B_{aa}^2 - \avg{s} ^2} \;\;(\approx 2.067)$.
280: Self-similarity scores of random amino acid sequences of length $l$
281: will have a normal distribution $g(l,s)$ with mean $l\,\avg{s}$ and
282: variance $\sqrt{l\,\sigma^2}$.  Finally, the self-similarity scores
283: distribution is well approximated by the sum $\sum_{l} g(l,s) f(l)$,
284: where $f(l)$ is the observed length distribution, Figure~\ref{fig4}.
285: 
286: \begin{figure}[!htb]
287:   \vspace{0.2cm}
288:   \includegraphics[height=0.36\textwidth,angle=270]{f4.eps}
289:   \caption{\label{fig4}{\small Distribution of protein sequences'
290:       self-scores and the curve obtained by an overlap of normal
291:       distributions opportunely wighted by the protein sequences's
292:       length distribution are compared.}}
293: \end{figure}
294: 
295: \subsection{Pairwise similarity distribution}
296: 
297: The SW optimum similarity scores distribution obtained from all FASTA
298: sequence alignments present a homogeneous cutoff equal to $80$, used
299: for storing hits in Simap database. It was chosen independently of the
300: query and database length, but as an optimal compromise between
301: sensitivity and possibility to store an accessible number of hits,
302: because of the high number of protein sequences.
303: 
304: \begin{figure}[!htb]
305:   \includegraphics[height=0.70\textwidth]{f5.eps}
306:   \caption{\label{fig5}{\small (a) Distribution of edges' weights $w$.
307:       In the inner box is shown an enlargement of the distribution
308:       tail. (b) Repartition function edges' weights distribution.}}
309: \end{figure}
310: 
311: In Figure~\ref{fig5}a the distribution of weights $w$ is shown, and in
312: Figure~\ref{fig5}b the corresponding repartition distribution $\rho(w)$. The
313: values of $\rho(w) \in [0,1]$ represent the fractions of edges which
314: have weight greater or equal to $w$. From them we see that the major
315: part of the edges (about $80\%$ of the total number of edges) has a
316: very low value of $w$ ($\leq 0.2$).
317:  
318: \subsection{Coordination and cluster distribution}
319: 
320: Weights $w$ can be used as a parameter to define a collection of
321: graphs. For a fixed value of $w = \bar{w}$ (or a value of $d = \bar{d}
322: = \sqrt{1 -\bar{w} }$ ), a graph is built keeping only edges with $w >
323: \bar{w}$ ($d \le \bar{d}$). For high values of $\bar{w}$, i.e. at
324: small distances, nodes are linked if, and only if, the corresponding
325: protein sequences have a high degree of similarity; then it is
326: reasonable to expect graphs with many small connected components. By
327: decreasing $\bar{w}`$ values, in other words by also linking proteins
328: having a lower degree of similarity, graphs with larger connected
329: components are expected. The graph obtained by considering all
330: possible edges (by fixing $\bar{w} = 0$) is not the complete graph,
331: due to the cutoff on the score alignment (there are about $0.1 \%$ of
332: edges of the corresponding complete graph).
333: 
334: We have built graphs for values of $w$ equal to $0.975$, $0.95$,
335: $0.925$, $ 0.9$, $0.875$, $0.85$, $0.825$, $0.8$, $0.775$, $0.75$,
336: $0.725$, $0.7$, $0.675$, $0.65$, $0.625$, $0.6$, $0.575$, $0.55$,
337: $0.525$, $0.5$, $0.475$, $0.45$, $0.425$, $0.4$ $0.375$, $0.35$,
338: $0.325$, $0.3$, $0.275$, $0.25$, $0.225$, $0.2$, $0.175$, $0.15$,
339: $0.125$; $\;$ for each of these values the set of the protein
340: sequences splits into clusters, i.e. isolated connected components.
341: Linking proteins that have a greater and greater distance from each
342: other (decresing $\bar{w}$), clusters merge to form larger clusters,
343: the number of isolated proteins and the number of components with a
344: very small size decreases, while the number of clusters of medium and
345: large size increases.
346: 
347: \begin{figure}[!htb]
348: 
349:     \subfigure[]{\label{fig6a} 
350:       \includegraphics[width=0.34\textwidth, angle=270]{f6a.ps}
351:     } \vspace{-0.4cm}
352:     \subfigure[]{\label{fig6b}
353:       \includegraphics[width=0.40\textwidth, height=0.42\textwidth, angle=270]{f6b.ps}
354:     }
355: 
356:   \caption{{\small (a) Distribution of size of connected
357:       components of the protein sequences graph built at $\bar{w} =
358:       0.975$ (red curve), $\bar{w} = 0.75$ (pink curve) and $\bar{w} =
359:       0.4$ (blue curve). It is evident that as the $\bar{w}$ value
360:       decrease the number of connected components with small size
361:       decreases and the starting region of the power law behaviour
362:       shifts to higher values of size. (b) Distribution of
363:       coordination degree of the protein sequences graph built at
364:       $\bar{w} = 0.975$ (red curve), $\bar{w} = 0.75$ (pink curve) and
365:       $\bar{w} = 0.4$ (blue curve). As the $\bar{w}$ value decrease
366:       the number of nodes with coordination degree decreases and the
367:       starting region of the power law behaviour shifts to higher
368:       values of coordination degree.}}
369: \end{figure}
370: 
371: Measuring the (not normalized) cluster distribution, we find that, for
372: each fixed values of $\bar{w}$, the number of clusters
373: $n_{\bar{w}}(s)$ of size $s$ follows, in a specific size range, a
374: power law behaviour, $n_{\bar{w}}(s) \sim s^{-\sigma(\bar{w})}$.
375: Fitted values of $\sigma(\bar{w})$ and fitting size ranges are
376: reported in Table~\ref{tab2} and a log-log plot of size
377: distribution $n_{\bar{w}}(s)$, for three different values of $\bar w$
378: is shown in Figure~\ref{fig6a}.  Also the (not normalized) coordination
379: degree distribution $f_{\bar w}(z)$ follows a power law distribution,
380: $f_{\bar w}(z) \sim z^{-\alpha(\bar w)}$, for each values of
381: $\bar{w}$. A log-log plot of coordination degree distribution
382: $f_{\bar{w}}(z)$, for three different values of $\bar w$ is shown in
383: Figure~\ref{fig6b}.  Fitted values of $\alpha(\bar{w})$ and fitting
384: coordination degree's ranges are reported in Table~\ref{tab3}.
385: 
386: \begin{table}[!htb]
387:   \begin{center}
388:       \begin{tabular}{|c|ccc|} \hline
389:         \vspace{-10pt} & & & \\ $\bar{w}$ & $\sigma$ & \quad component
390:         & \quad correlation \\ & & \quad size range & \quad coefficient \\
391:         \vspace{-10pt} & & & \\ \hline
392:         \vspace{-10pt} & & & \\
393:         $\;$ $0.95$ $\;$   &   $\;$ $2.70$  &  $10 - 60$  &  $-0.995$ \\
394:         $\;$ $0.90$ $\;$    &   $\;$ $2.70$  &  $10 - 60$  &  $-0.996$ \\
395:         $\;$ $0.85$ $\;$    &   $\;$ $2.69$  &  $10 - 60$  &  $-0.994$ \\
396:         $\;$ $0.80$ $\;$    &   $\;$ $2.62$  &  $10 - 80$  &  $-0.996$ \\
397:         $\;$ $0.75$ $\;$    &   $\;$ $2.52$  &  $10 - 80$  &  $-0.996$ \\
398:         $\;$ $0.70$ $\;$    &   $\;$ $2.40$  &  $10 - 80$  &  $-0.996$ \\
399:         $\;$ $0.65$ $\;$    &   $\;$ $2.32$  &  $10 - 100$  &  $-0.997$ \\
400:         $\;$ $0.60$ $\;$    &   $\;$ $2.21$  &  $10 - 100$  &  $-0.996$ \\
401:         $\;$ $0.55$ $\;$    &   $\;$ $2.17$  &  $10 - 100$  &  $-0.996$ \\
402:         $\;$ $0.50$ $\;$    &   $\;$ $2.07$  &  $10 - 100$  &  $-0.997$ \\
403:         $\;$ $0.45$ $\;$    &   $\;$ $2.01$  &  $10 - 100$  &  $-0.997$ \\
404:         $\;$ $0.40$ $\;$    &   $\;$ $2.00$  &  $10 - 100$  &  $-0.996$ \\
405:         $\;$ $0.35$ $\;$    &   $\;$ $1.98$  &  $10 - 100$  &  $-0.997$ \\
406:         $\;$ $0.30$ $\;$    &   $\;$ $1.98$  &  $10 - 100$  &  $-0.997$ \\
407:         $\;$ $0.25$ $\;$    &   $\;$ $2.01$  &  $10 - 100$  &  $-0.996$ \\ \hline
408:       \end{tabular}
409:     \caption{\label{tab2} \small Fitting values of exponent
410:       $\sigma$ of the power law distribution of connected components
411:       for selected values of $\bar{w}$. For each fitting the size
412:       range and its correlation coefficient are reported.}
413: \end{center}
414: \end{table}
415: 
416: 
417: \begin{table}[!htb]
418:   \begin{center}
419:     \begin{tabular}{|c|c|c|ccc|} \hline
420:       \vspace{-10pt} & & & &\\ 
421:       
422:       $\bar{w}$ &  $\avg{z}$ & max $z$ & $\alpha$ & \quad coordination & \quad
423:       correlation \\
424:       
425:       & & & & \quad degree range & \quad coefficient \\
426:       \vspace{-10pt} & & & & &\\ \hline
427:       \vspace{-10pt} & & & & &\\
428:       
429:       $0.95$   & $14.4$    & $5735$  & $1.59$  & $25 - 100$    &  $-0.990$ \\
430:       &           &         & $1.46$  &  $100 - 500$  &  $-0.953$ \\ \hline
431:       
432:       $0.90$   & $73.1$    & $10794$  & $1.58$  & $25 - 100$   &  $-0.988$ \\
433:       &           &          & $1.51$  & $100 - 500$  &  $-0.939$ \\ \hline
434:         
435:         $0.85$   & $138.3$   & $16500$  & $1.68$  & $25 - 100$   &  $-0.993$ \\
436:         &           &          & $1.42$  & $100 - 800$  &  $-0.964$ \\ \hline
437:         
438:         $0.80$   & $207.2$   & $ 23726$ & $1.73$  & $25 - 100$   &  $-0.994$ \\
439:         &           &          & $1.29$  & $100 - 800$  &  $-0.941$ \\ \hline
440:         
441:         $0.75$   & $294.0$   & $33265$  & $1.79$  & $25 - 100$   &  $-0.997$ \\
442:         &           &          & $1.22$  & $100 - 1000$ &  $-0.956$ \\ \hline
443:         
444:         $0.70$   & $395.3$   & $35202$  & $1.74$  & $25 - 100$   &  $-0.996$ \\
445:         &           &          & $1.28$  & $100 - 1000$ &  $-0.946$ \\ \hline	
446:         
447:         $0.65$   & $507.8$   & $36333$  & $1.71$  & $25 - 100$   &  $-0.998$ \\
448:         &           &          & $1.39$  & $100 - 1000$ &  $-0.950$ \\ \hline
449:         
450:         $0.60$   & $622.3$   & $37729$  & $1.63$  & $25 - 100$   &  $-0.999$ \\
451:         &           &          & $1.32$  & $100 - 1500$ &  $-0.930$ \\ \hline
452:         
453:         $0.55$   & $745.3$   & $41871$  & $1.54$  & $25 - 100$   &  $-0.998$ \\
454:         &           &          & $1.44$  & $100 - 1500$ &  $-0.927$ \\ \hline
455:         
456:         $0.50$   & $911.7$   & $49895$  & $1.44$  & $25 - 100$   &  $-0.998$ \\
457:         &           &          & $1.56$  & $100 - 2000$ &  $-0.944$ \\ \hline
458:         
459:         $0.45$   & $1108.1$  & $51309$  &  $1.38$  & $25 - 100$  &  $-0.998$ \\
460:         &           &          &  $1.62$  & $100 - 2000$&  $-0.951$ \\ \hline
461:         
462:         $0.40$   & $1314.2$  & $51956$  & $1.28$  & $25 - 100$   &  $-0.998$ \\
463:         &           &          & $1.67$  & $100 - 2500$ &  $-0.946$ \\ \hline
464:         
465:         $0.35$   & $1501.9$  & $52513$  & $1.19$  & $25 - 100$   &  $-0.998$ \\
466:         &           &          & $1.72$  & $100 - 2500$ &  $-0.961$ \\ \hline
467:         
468:         $0.30$   & $1668.9$  & $60722$  & $1.08$  & $25 - 100$   &  $-0.997$ \\
469:         &           &          & $1.74$  & $100 - 3000$ &  $-0.969$ \\ \hline	
470:         
471:         $0.25$   & $1826.2$  & $64781$  & $0.97$  & $25 - 100$   &  $-0.997$ \\
472:         &           &          & $1.78$  & $100 - 3000$ &  $-0.969$ \\ \hline	
473:       \end{tabular} 
474:     \caption{\label{tab3} \small Fitting values of exponent $\alpha$ of
475:       the power law distribution of coordination degree for selected
476:       values of $\bar{w}$. We compute two linear fittings different in the
477:       choice of fitting range of coordination degree. For each fitting the
478:       range of coordination degree and its correlation coefficient are
479:       reported. In the second column the average degree is shown; the
480:       third column gives the maximum value of the coordination degree. }
481:   \end{center}
482: \end{table}
483: 
484: \section{Comparison with generalized random graphs}
485: 
486: It would be interesting to compare these behaviours with that of a
487: model of random graphs. It is well known that, in the classical model,
488: random graphs (where every pair of nodes is chosen to be an edge with
489: probability $p$, as introducede by Erd\"os-R\'enyi
490: \cite{erdos_renyi}), have the same expected coordination degree at
491: every node, so they are characterized by a poissonian coordination
492: degree distribution with mean value $\avg{z} \sim p V$. Futhermore, as
493: soon as $\avg{z}$ assume a value greater than $1$, a giant connected
494: component appears, that is a component whose size is much greater than
495: the size of all other components, and that represents an important
496: fraction of all graph's nodes.
497: 
498: A better theorical comparison model could be represented by
499: generalized random graphs endowed with a specific degree-distribution.
500: These can be generated via the Monte-Carlo algorithm (following the
501: work in \cite{burda} of Burda et al.). In particular, starting from a
502: random graph of $V$ nodes and $E$ edges, making local graph
503: transformations which leave the number of nodes and the number of
504: edges constant and accepting them with a probability which depends on
505: the desired equilibrium degree distribution (Metropolis algorithm), we
506: have generated a collection of random graphs with the same
507: coordination degree distribution and the same average degree as some
508: of our protein sequences graphs.
509: 
510: For each of them we observe a fundamentally different
511: distribution of connected components in the protein sequences graphs
512: and in the random graphs. In the latter model the power law behaviour
513: is absent, while there is a always a dominant giant connected
514: component, much larger than the many other small components, whose
515: size distribution decreases exponentially (See Figure~\ref{fig7}).
516: 
517: \begin{figure}[!htb]
518:   \includegraphics[height=0.36\textwidth,angle=270]{f7.eps}
519:   \caption{\label{fig7}{\small Top: coordination degree distribution
520:       of the collection of random graphs generated via Monte-Carlo
521:       algorithm fixing the equilibrium degree distribution equal to
522:       that one observed in the protein sequences graph at $\bar{w} =
523:       0.99$ and fixing the average degree equal to $\avg{z} = 0.57$.
524:       Bottom: size distribution of connected components of the random
525:       graphs.}}
526: \end{figure}
527: 
528: By comparison, in the Simap protein sequences space the coordination
529: degree distribution $f_{\bar w}(z)$ and the connected component
530: distribution $n_{\bar w}(s)$ are strongly correlated. The former, for
531: example, can be reproduced quite well by means of $n_{\bar w}(s)$. Let
532: the index $i$ label all connected components and let us consider all
533: possible edges between nodes belonging to a connected components of
534: size $s_i$; then the cluster would be a complete subgraph and all its
535: $s_i$ nodes would have coordination degree equal to $z_i = s_i-1$. If
536: this were true for all connected components then all clusters would be
537: complete subgraphs and we would expect a coordination degree
538: distribution equal to $f_{\bar w}(z) \sim ( s \; n_{\bar w}(s) ) |_{s
539:   = z + 1} $. In our graphs, although complete connected components
540: are present, the majority of clusters have only a high average degree
541: distribution, not equal to its size minus one, as in complete graphs.
542: However let's consider a component with size $s_i$ and a number of
543: edges equal to $m_i$; the quantity $\Delta_i = \frac{2 m_i}{s_i
544:   (s_i-1)}$ represents the fraction of edges that are present in the
545: $i$-th component respect to the number of edges that would be present
546: if the component were a complete subgraph (i.e. $s_i (s_i-1)/2$).
547: Introducing $\Delta_i$ as a measure of edges' density for each
548: component we can approximate the coordination degree distribution
549: $f_{\bar w}(z)$ by means of the size connected component distribution
550: $n_{\bar w}(s)$ too. Specifically we find that the coordination degree
551: distribution behaves like $f_{\bar w}(z) \sim \bar{\Delta}(z+1) \;
552: (z+1) \; n_{\bar w}(z+1) $, where $\bar{\Delta}(s)$ is the edges'
553: density averaged over all components of size $s$: $\bar{\Delta}(s) =
554: \frac{\sum_{i} \delta_{s_i, s} \Delta_i}{\sum_{i} \delta_{s_i, s}}$.
555: Figure~\ref{fig8} shows both the observed degree distribtution and the
556: approximated degree distribution obtained by means of $n(s)$ of the
557: graph at $\bar{w} = 0.95$.
558: 
559: \begin{figure}[!htb]
560:   \includegraphics[height=0.36\textwidth,angle=270]{f8.eps}
561:   \caption{\label{fig8}{\small Observed degree distribution (black
562:       curve) and the approximated degree distribution (red curve)
563:       obtained by means of $n(s)$ of the graph at $\bar{w} = 0.95$.}}
564: \end{figure}
565: 
566: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
567: \section{Giant component}
568: 
569: \begin{figure}[!htb]
570:   \includegraphics[height=0.70\textwidth]{f9.eps}
571:   \caption{\label{fig9}{\small (a) Fraction of nodes belonging to the
572:       largest cluster for each value of $\bar{w}$. (b) Fraction of
573:       species present in the largest cluster for each value of
574:       $\bar{w}$.}}
575: \end{figure}
576: 
577: An interesting phenomenon occurs when $\bar{w}$ value decrease; we see
578: the formation of the giant component.  In Figure~\ref{fig9}a the
579: behaviour of the fraction of nodes belonging to the largest component
580: is shown.
581: 
582: \begin{table*}[!htb]
583:   \begin{center}
584:   \begin{tabular}{|c|c|c|c|c|c|c|c|c|}\hline 
585:     \vspace{-7pt} & & & & & & & & \\
586:     
587:     $\bar{w}$ & $\bar{d}$ & size & bacteria & viruses & plants &
588:     invertebrates & vertebrates & number of different species \\
589:     
590:     \vspace{-7pt} & & & & & & & & \\ \hline
591:     
592:     $0.975$ & $0.1581$ & $8322$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $4$ \\ 
593:     $0.950$ & $0.2236$ & $15955$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $4$ \\ 
594:     $0.925$ & $0.2739$ & $47687$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $10$ \\ 
595:     $0.900$ & $0.3162$ & $50729$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\ 
596:     $0.875$ & $0.3536$ & $51028$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\ 
597:     $0.850$ & $0.3873$ & $51405$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\ 
598:     $0.825$ & $0.4183$ & $51969$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\ 
599:     $0.800$ & $0.4472$ & $52097$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\ 
600:     $0.775$ & $0.4743$ & $52881$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\ 
601:     $0.750$ & $0.5000$ & $63003$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $60$ \\ 
602:     $0.725$ & $0.5244$ & $118777$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $67$ \\ 
603:     $0.700$ & $0.5477$ & $120974$ & $0.000$ & $0.999$ & $0.000$ & $0.000$ & $0.000$ & $106$ \\ 
604:     $0.675$ & $0.5701$ & $145278$ & $0.002$ & $0.997$ & $0.000$ & $0.000$ & $0.000$ & $302$ \\ 
605:     $0.650$ & $0.5916$ & $224310$ & $0.002$ & $0.749$ & $0.001$ & $0.000$ & $0.248$ & $988$ \\ 
606:     $0.625$ & $0.6124$ & $272426$ & $0.014$ & $0.662$ & $0.010$ & $0.007$ & $0.306$ & $4384$ \\ 
607:     $0.600$ & $0.6325$ & $297280$ & $0.028$ & $0.643$ & $0.015$ & $0.011$ & $0.303$ & $7854$ \\ 
608:     $0.575$ & $0.6519$ & $318472$ & $0.032$ & $0.613$ & $0.027$ & $0.015$ & $0.313$ & $9668$ \\ 
609:     $0.550$ & $0.6708$ & $362379$ & $0.047$ & $0.554$ & $0.035$ & $0.024$ & $0.341$ & $11437$ \\ 
610:     $0.525$ & $0.6892$ & $404788$ & $0.049$ & $0.526$ & $0.047$ & $0.029$ & $0.349$ & $15593$ \\ 
611:     $0.500$ & $0.7071$ & $450072$ & $0.065$ & $0.482$ & $0.055$ & $0.033$ & $0.365$ & $16272$ \\ 
612:     $0.475$ & $0.7246$ & $584371$ & $0.084$ & $0.379$ & $0.151$ & $0.037$ & $0.349$ & $20957$ \\ 
613:     $0.450$ & $0.7416$ & $718286$ & $0.114$ & $0.312$ & $0.194$ & $0.041$ & $0.340$ & $35346$ \\ 
614:     $0.425$ & $0.7583$ & $975629$ & $0.151$ & $0.229$ & $0.184$ & $0.095$ & $0.341$ & $68338$ \\ 
615:     $0.400$ & $0.7746$ & $1202753$ & $0.181$ & $0.188$ & $0.209$ & $0.096$ & $0.326$ & $76230$ \\ 
616:     $0.375$ & $0.7906$ & $1435734$ & $0.210$ & $0.160$ & $0.224$ & $0.093$ & $0.312$ & $77970$ \\ 
617:     $0.350$ & $0.8062$ & $1739772$ & $0.254$ & $0.133$ & $0.236$ & $0.087$ & $0.291$ & $80100$ \\ 
618:     $0.325$ & $0.8216$ & $2059217$ & $0.288$ & $0.117$ & $0.239$ & $0.083$ & $0.273$ & $82714$ \\ 
619:     $0.300$ & $0.8367$ & $2383804$ & $0.316$ & $0.102$ & $0.244$ & $0.080$ & $0.258$ & $84953$ \\ 
620:     $0.275$ & $0.8515$ & $2728214$ & $0.350$ & $0.090$ & $0.243$ & $0.078$ & $0.239$ & $86151$ \\ 
621:     $0.250$ & $0.8660$ & $3071192$ & $0.374$ & $0.083$ & $0.240$ & $0.076$ & $0.226$ & $90357$ \\ 
622:     $0.225$ & $0.8803$ & $3420697$ & $0.396$ & $0.078$ & $0.239$ & $0.074$ & $0.213$ & $94210$ \\ 
623:     $0.200$ & $0.8944$ & $3807556$ & $0.416$ & $0.076$ & $0.237$ & $0.073$ & $0.199$ & $101358$ \\ 
624:     $0.175$ & $0.9083$ & $4210208$ & $0.432$ & $0.074$ & $0.234$ & $0.072$ & $0.188$ & $102774$ \\ 
625:     $0.150$ & $0.9220$ & $4651704$ & $0.446$ & $0.072$ & $0.233$ & $0.073$ & $0.177$ & $103831$ \\ 
626:     $0.125$ & $0.9354$ & $5049016$ & $0.455$ & $0.069$ & $0.235$ & $0.073$ & $0.167$ & $104227$ \\ \hline 
627:     
628: \end{tabular}
629: \caption{\label{tab4} \small For each fixed values of $bar w$, we
630:   computed the percentage of proteins, among those belonging to the
631:   largest component, that come from the five kingdoms.}
632:  \end{center}
633: \end{table*}
634: 
635: Starting from approximately $\bar{w}\sim 0.65$ the largest component
636: begins to expand its size capturing a lot of smaller components.
637: Furthermore the components which are disconnetted at $\bar{w} \sim
638: 0.675$ and which go to form the giant component at $\bar{w}\sim 0.65$
639: are samples of many different sizes, from small components to very big
640: components. This phenomenon becomes more and more evident for lower
641: values of $\bar{w}$, when the coordination degree distribution of the
642: giant component follows a power law scaling.  This is evident also
643: from Figure~\ref{fig6b}, where we plot the distribution of the
644: coordination degree for the whole set of proteins. The exponent
645: $\alpha(\bar{w})$ of the power law behavior $f_{\bar w}(z) \sim
646: z^{-\alpha(\bar w)}$ varies slightly between the regions corresponding
647: to small values of the coordination degree $z$ and to large values of
648: $z$. Clearly when a giant component exists, the region with large $z$
649: is largely determined by the giant component itself. In
650: Table~\ref{tab3} we report the fitting values of the exponent
651: $\alpha(\bar{w})$ computed in two regions with small and large values
652: of $z$. As we decrease the value of $\bar w$, the two fitting values
653: of $\alpha(\bar w)$ become more and more divergent. In fact, since the
654: largest component is growing, the tail of the distribution $f_{\bar
655: w}(z)$ becomes more and more important and assumes a power law
656: behavior characterized by a different exponent.
657: 
658: A significant fact goes with the rapid size increase of the largest
659: component. In Table~\ref{tab4} we show, for each $\bar{w}$, the fraction of
660: different kingdoms and the number of different species which appear in
661: the largest connected component.  Down to around $\bar{w} = 0.675 $
662: only proteins coming from viruses belong to the largest component and,
663: moreover this largest cluster has not yet become giant with respect to
664: smaller clusters. For $\bar{w} \lesssim 0.675$ the formation of a
665: giant component begins, and simultaneously all kinds of kingdoms enter
666: in the species composition of the giant cluster. This is also evident
667: from Figure~\ref{fig9}b, where we plot the fraction of the number of species
668: belonging to the largest component.  This ratio increases rapidly
669: around the same value of $\bar w$. These processes continue for lower
670: values of $\bar{w}$, with the giant component including more and more
671: proteins belonging to many different species, and the ratio for each
672: kingdom tends to become the same as that of the whole database.
673: Furthermore around $\bar w \simeq 0.475$ there is a very sharp
674: increase both in the dimension of the giant component and especially
675: in the number of species present in it, as it is evident from Figures
676: ~\ref{fig9}a and ~\ref{fig9}b.
677: 
678: The processes just described may indicate the presence of a phase
679: transition: we have two different phases, one for large values of
680: $\bar w$, characterized by the presence of clusters with similar
681: dimensions and with the largest one composed especially of viruses,
682: and the second phase characterized by the presence of a giant
683: component composed of different species alongside other small little
684: clusters. We note however that the phase transition is not sharp, but
685: the changes in the dimension and composition of the largest component
686: are spread in a range $ 0.475 < \bar w <0.675$. We also note that the
687: plot in Figure~\ref{fig9}b has a very rapid increase for $w \sim 0.475$.
688: 
689: \begin{table*}[!htb]
690:   \begin{center} 
691:   \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline 
692: 	
693:     $\bar{w}$ & $0.95$ & $0.90$ & $0.85$ & $0.80$ &
694:     $0.75$ & $0.70$ & $0.65$ & $0.60$ & $0.55$ &
695:     $0.50$ & $0.45$ & $0.35$ & $0.25$ & $0.15$ \\ \hline 
696: 
697:     bacteria & $9.6$ & $12.2$ & $14.2$ & $17.2$ & $21.9$ & $22.6$
698:     & $23.6$ & $23.8$ & $23.9$ & $25.1$ & $25.8$ & $29.0$ & $35.6$
699:     & $57.0$ \\
700:     
701:     viruses & $32.7$ & $31.4$ & $24.3$ & $17.6$ & $11.4$ & $7.4$ &
702:     $5.2$ & $3.8$ & $2.9$ & $2.7$ & $2.4$ & $2.7$ & $4.2$ & $7.5$
703:     \\
704:     
705:     plants & $9.3$ & $10.8$ & $11.4$ & $9.4$ & $8.3$ & $7.3$ &
706:     $7.6$ & $7.8$ & $7.7$ & $7.5$ & $7.5$ & $6.2$ & $4.0$ & $0.0$
707:     \\
708:     
709:     invertebrates & $11.6$ & $8.9$ & $7.4$ & $5.8$ & $3.6$ & $3.2$
710:     & $2.5$ & $2.0$ & $1.6$ & $1.5$ & $1.2$ & $1.4$ & $1.3$ &
711:     $1.1$ \\
712:     
713:     vertebrates & $22.9$ & $23.0$ & $25.4$ & $25.7$ & $25.6$ &
714:     $25.9$ & $23.6$ & $20.0$ & $17.1$ & $13.0$ & $10.2$ & $5.2$ &
715:     $2.8$ & $1.1$ \\
716:     
717:     bac-vir & $2.7$ & $2.2$ & $2.1$ & $2.1$ & $1.6$ & $1.6$ &
718:     $1.4$ & $1.0$ & $1.0$ & $1.1$ & $1.0$ & $1.7$ & $2.4$ & $3.2$
719:     \\
720:     
721:     bac-pla & $1.6$ & $1.8$ & $2.8$ & $2.9$ & $3.5$ & $4.5$ &
722:     $5.9$ & $7.0$ & $8.5$ & $8.9$ & $9.1$ & $10.8$ & $11.3$ &
723:     $18.3$ \\
724:     
725:     bac-inv & $0.5$ & $0.4$ & $0.7$ & $0.7$ & $0.8$ & $0.9$ &
726:     $1.3$ & $1.7$ & $2.1$ & $2.1$ & $2.0$ & $2.6$ & $3.0$ & $1.1$
727:     \\
728:     
729:     bac-ver & $1.8$ & $2.0$ & $2.4$ & $2.3$ & $1.9$ & $1.9$ &
730:     $1.8$ & $1.6$ & $1.5$ & $1.5$ & $1.3$ & $1.1$ & $1.1$ & $1.1$
731:     \\
732:     
733:     vir-pla & $0.2$ & $0.1$ & $0.2$ & $0.4$ & $0.3$ & $0.4$ &
734:     $0.3$ & $0.3$ & $0.2$ & $0.2$ & $0.2$ & $0.2$ & $0.5$ & $0.0$
735:     \\
736:     
737:     vir-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
738:     $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$
739:     \\
740:     
741:     vir-ver & $0.2$ & $0.5$ & $0.7$ & $0.8$ & $0.9$ & $0.7$ &
742:     $0.6$ & $0.4$ & $0.3$ & $0.2$ & $0.1$ & $0.2$ & $0.1$ & $0.0$
743:     \\
744:     
745:     pla-inv & $0.9$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.2$ &
746:     $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.3$ & $0.2$ & $0.5$ & $0.0$
747:     \\
748:     
749:     pla-ver & $0.5$ & $0.9$ & $0.8$ & $1.1$ & $1.3$ & $1.0$ &
750:     $1.1$ & $1.2$ & $1.2$ & $1.0$ & $0.9$ & $1.3$ & $1.7$ & $1.1$
751:     \\
752:     
753:     inv-ver & $0.5$ & $1.1$ & $2.6$ & $4.5$ & $7.0$ & $8.4$ &
754:     $9.2$ & $10.3$ & $10.9$ & $11.2$ & $11.0$ & $9.0$ & $5.5$ &
755:     $0.0$ \\
756:     
757:     bac-vir-pla & $0.0$ & $0.4$ & $0.3$ & $0.5$ & $0.3$ & $0.3$ &
758:     $0.4$ & $0.2$ & $0.2$ & $0.2$ & $0.4$ & $0.4$ & $0.7$ & $1.1$
759:     \\
760:     
761:     bac-vir-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
762:     $0.0$ & $0.0$ & $0.1$ & $0.0$ & $0.1$ & $0.1$ & $0.3$ & $1.1$
763:     \\
764:     
765:     bac-vir-ver & $0.2$ & $0.1$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ &
766:     $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.2$ & $0.2$ & $0.0$
767:     \\
768:     
769:     bac-pla-inv & $0.0$ & $0.0$ & $0.1$ & $0.2$ & $0.5$ & $0.6$ &
770:     $0.8$ & $0.9$ & $1.3$ & $2.0$ & $2.3$ & $2.4$ & $3.1$ & $1.1$
771:     \\
772:     
773:     bac-pla-ver & $0.0$ & $0.1$ & $0.0$ & $0.0$ & $0.1$ & $0.3$ &
774:     $0.6$ & $0.6$ & $0.9$ & $1.0$ & $1.3$ & $1.7$ & $1.4$ & $0.0$
775:     \\
776:     
777:     bac-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.3$ & $0.4$ & $0.4$ &
778:     $0.4$ & $0.9$ & $0.8$ & $0.9$ & $0.9$ & $1.0$ & $0.8$ & $1.1$
779:     \\
780:     
781:     vir-pla-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
782:     $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$
783:     \\
784:     
785:     vir-pla-ver & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ & $0.1$ &
786:     $0.1$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$
787:     \\
788:     
789:     vir-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.2$ & $0.1$ & $0.2$ &
790:     $0.3$ & $0.2$ & $0.2$ & $0.2$ & $0.2$ & $0.1$ & $0.1$ & $0.0$
791:     \\
792:     
793:     pla-inv-ver & $0.9$ & $1.4$ & $1.8$ & $5.5$ & $7.3$ & $8.4$ &
794:     $9.4$ & $11.0$ & $11.3$ & $12.0$ & $12.4$ & $13.4$ & $11.7$ &
795:     $0.0$ \\
796:     
797:     bac-vir-pla-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
798:     $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.0$
799:     & $0.0$ \\
800:     
801:     bac-vir-pla-ver & $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ &
802:     $0.1$ & $0.2$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.1$
803:     & $0.0$ \\
804:     
805:     bac-vir-inv-ver & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &
806:     $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ & $0.1$
807:     & $0.0$ \\
808:     
809:     bac-pla-inv-ver & $0.2$ & $0.1$ & $0.4$ & $0.7$ & $1.0$ &
810:     $2.1$ & $2.5$ & $3.8$ & $5.1$ & $6.4$ & $8.0$ & $7.6$ & $6.7$
811:     & $0.0$ \\
812:     
813:     vir-pla-inv-ver & $0.0$ & $0.1$ & $0.0$ & $0.0$ & $0.1$ &
814:     $0.1$ & $0.2$ & $0.3$ & $0.3$ & $0.3$ & $0.2$ & $0.2$ & $0.1$
815:     & $1.1$ \\
816:     
817:     bac-vir-pla-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ &
818:     $0.2$ & $0.2$ & $0.1$ & $0.2$ & $0.3$ & $0.5$ & $0.7$ & $0.4$
819:     & $1.1$ \\ \hline
820:     
821: \end{tabular}
822: \caption{\label{tab5} \small Spread of species in connected
823:   components. Each value indicates the percentage of clusters,
824:   calculated on clusters having size greater than $90$, composed by
825:   proteins coming from only one kingdom, only from a pair of kingdoms,
826:   etc., up to the percentage of clusters composed by proteins of all
827:   kingdoms.}
828:  \end{center}
829: \end{table*}
830: 
831: In Table~\ref{tab5}, for each $\bar{w}$, it can be seen how different
832: kingdoms are distributed in connected components. In particular we
833: count the number of components, whose size is greater than $90$ and
834: record the percentage of clusters whose proteins come from species of
835: only one kingdom, only from a pair of kingdoms, etc., up to the
836: percentage of connected components which contain proteins of all
837: kingdoms. For high values of $\bar{w}$ the majority of clusters are
838: made up of proteins belonging to only one kingdom, in particular the
839: kingdom of viruses; clusters with proteins of different kingdoms are
840: very scarce. As expected, as $\bar{w}$ decreases, the percentage of
841: clusters belonging to only one kingdom decreases in favor of clusters
842: of mixed kingdom composition.
843: 
844: It is interesting to note that the virus kingdom has a very low
845: tendency to cluster with the other kingdoms, in particular with plants
846: and animalia. Furthermore, for no values of $\bar{w}$ do we see the
847: formation of components (of size greater than $90$) with proteins
848: coming from viruses and invertebrates, and from viruses, plants and
849: invertebrates. Virus proteins cluster mainly with bacterial proteins.
850: In addition we observe that bacterial proteins cluster mainly with
851: plant proteins and vice versa. Moreover, although plant proteins
852: cluster infrequently with invertebrates and with vertebrates, there
853: are many more clusters consisting simultaneously of plant,
854: invertebrate and vertebrate proteins. Finally we note that at the
855: lowest value of $\bar{w}$, the majority of components which are not
856: included in the giant component are clusters consisting of bacterial
857: proteins, of bacterial and plant proteins and of virus proteins.
858: 
859: \section{Analysis of the proteins that connect clusters}
860: 
861: \begin{figure}[!htb]
862:   \begin{center}
863:     \vspace{-0.4cm}
864:     \subfigure[]{\label{fig10a} 
865:       \includegraphics[width=0.48\textwidth]{f10a.eps}
866:     } \vspace{-0.4cm}
867:     \subfigure[]{\label{fig10b}
868:       \includegraphics[width=0.48\textwidth]{f10b.eps}
869:     }
870:     \caption{{\small Length representation of (a) proteins joining
871:         generic clusters and of (b) proteins joining the largest
872:         cluster. The red color encodes overrepresented lengths; the
873:         blue color indicates underrepresented lengths.}}
874:   \end{center}
875: \end{figure}
876: 
877: 
878: \begin{figure}[!htb]
879:   \begin{center}
880:     \vspace{-0.4cm}
881:     \subfigure[\label{fig11a}]{
882:      \includegraphics[width=0.48\textwidth]{f11a.eps}
883:     }\vspace{-0.4cm}
884:     \subfigure[\label{fig11b}]{
885:       \includegraphics[width=0.48\textwidth]{f11b.eps}
886:     }
887:   \caption{{\small Representation of the low complexity
888:       content of (a) proteins joining generic clusters and of (b)
889:       proteins joining the largest cluster. The red color encodes
890:       overrepresented values; the blue color indicates
891:       underrepresented values.  }}
892:   \end{center}
893: \end{figure}
894: 
895: \begin{figure}[!htb]
896:   \begin{center}
897:     \vspace{-0.4cm}
898:     \subfigure[\label{fig12a}]{
899:      \includegraphics[width=0.48\textwidth]{f12a.eps}
900:     }\vspace{-0.4cm}
901:     \subfigure[\label{fig12b}]{
902:       \includegraphics[width=0.48\textwidth]{f12b.eps}
903:     }
904: \caption{{\small Representation of the isoelectric
905:       points of (a) proteins joining generic clusters and of (b)
906:       proteins joining the largest cluster.  The red color encodes
907:       overrepresented values; the blue color indicates
908:       underrepresented values.}}
909:   \end{center}
910: \end{figure}
911: 
912: \begin{figure}[!htb]
913:   \begin{center}
914:     \vspace{-0.4cm}
915:     \subfigure[\label{fig13a}]{
916:       \includegraphics[width=0.48\textwidth]{f13a.eps}
917:     }\vspace{-0.4cm}
918:     \subfigure[\label{fig13b}]{
919:       \includegraphics[width=0.48\textwidth]{f13b.eps}
920:     }
921:     \caption{{\small Representation of the predicted number
922: 	of transmembrane helices of (a) proteins joining generic
923: 	clusters and of (b) proteins joining the largest cluster. The
924: 	red color encodes overrepresented values; the blue color
925: 	indicates underrepresented values.}}
926:   \end{center}
927: \end{figure}
928: 
929: 
930: \begin{figure}[!htb]
931:   \begin{center}
932:     \vspace{-0.4cm}
933:     \subfigure[\label{fig14a}]{
934:      \includegraphics[width=0.48\textwidth]{f14a.eps}
935:     }\vspace{-0.4cm}
936:     \subfigure[\label{fig14b}]{
937:       \includegraphics[width=0.48\textwidth]{f14b.eps}
938:     }
939: \caption{{\small Representation of the predicted signal peptides and
940:       protein localization signals of (a) proteins joining generic
941:       clusters and of (b) proteins joining the largest cluster. The
942:       red color encodes overrepresented values; the blue color
943:       indicates underrepresented values. }}
944:   \end{center}
945: \end{figure}
946: 
947: \begin{figure}[!htb]
948:   \begin{center}
949:     \vspace{-0.4cm}
950:     \subfigure[\label{fig15a}]{
951:      \includegraphics[width=0.48\textwidth]{f15a.eps}
952:     }\vspace{-0.4cm}
953:     \subfigure[\label{fig15b}]{
954:       \includegraphics[width=0.48\textwidth]{f15b.eps}
955:     }
956: \caption{{\small Representation of the predicted protein domains of
957:       (a) proteins joining generic clusters and of (b) joining the
958:       largest cluster.  Each line in the graph denotes a certain
959:       domain. The red color encodes overrepresented values; the blue
960:       color indicates underrepresented values.}}
961:   \end{center}
962: \end{figure}
963: 
964: 
965: Protein pairs that connect clusters in the different weight intervals
966: are of special interest as they harbor the most conserved sequence
967: regions that are shared by the interconnected clusters. We want to
968: know if certain sequence features and protein domains are enriched in
969: these proteins compared to the complete proteome.  Therefore we have
970: calculated for all protein contained in SIMAP some sequence features:
971: \e{length}, \e{isoelectric point} (using the EMBOSS sequence analysis
972: package \cite{emboss}), \e{low complexity content} (using the program
973: seg \cite{segprog}) and the number of \e{predicted transmembrane
974: segments} (using the program TMHMM \cite{tmhmmprog}).  Additionally,
975: in order to derive functional information for all proteins, we have
976: predicted \e{signal peptides} (using SignalP 3.0 \cite{signalPprog}),
977: \e{localization signals} (using TargetP 1.1\cite{targetPprog}) and
978: \e{protein domains} (using the databases PFAM, TIGRFAM, PANTHER,
979: SUPERFAMILY, SMART and PIRSF from InterPro 12.1 \cite{pddb}) for all
980: SIMAP proteins.
981: 
982: For all weight intervals we have counted the feature occurrence in the
983: proteins that connect clusters; these proteins are all pairs of
984: sequences which belong to different clusters in the graph built at
985: $\bar{w}_1$ and belonging to the same cluster in the graph built at
986: $\bar{w}_2$, where $\bar{w}_2<\bar{w}_1$ are two consecutive values of
987: the weight $\bar{w}$. We have also distinguished between two disjoint
988: sets of these proteins: proteins linking the clusters that will form
989: the largest cluster in the graph built at $\bar{w}_2$ and proteins
990: linking the other generic clusters.
991: 
992: The enrichment ($e$) of features was calculated as ratio of the number
993: of features found ($k$) and the number of features expected ($k_E$):
994: $e = k/k_E$. The number of features expected was calculated by: $k_E =
995: K\,n/V$, where $n$ is the number of proteins of interest (e.g.
996: connecting clusters in a given weight interval), $K$ denotes the
997: number of proteins used for clustering having the given feature and
998: $V$ corresponds to the number of proteins used for clustering.
999: 
1000: \subsection{Results}
1001: 
1002: Proteins joining clusters outside the largest cluster show an
1003: over-representation of lengths around 400aa (Figure~\ref{fig10a}),
1004: contain overrepresented proteins of small low complexity content
1005: (Figure~\ref{fig11a}), are often neutral or weakly acidic
1006: (Figure~\ref{fig12a}) and contain more transmembrane proteins than
1007: expected (Figure~\ref{fig13a}).  Proteins joining clusters in the
1008: giant component are characterized by short and very long lengths
1009: (Figure~\ref{fig10b}), reduced low complexity content
1010: (Figure~\ref{fig11b}), acidic and alkaline proteins, dependent on the
1011: weight interval (Figure~\ref{fig12b}) and a high number of
1012: transmembrane domains in the lower weight intervals
1013: (Figure~\ref{fig13b}).  Signal peptides were found overrepresented in
1014: proteins joining clusters outside the largest component at the lower
1015: weight intervals; at higher weight intervals and in proteins joining
1016: clusters in the largest component they were found underrepresented, as
1017: were localization signals in all proteins joining clusters
1018: (Figure~\ref{fig14a} and Figure~\ref{fig14b}).  For all considered
1019: weight intervals we could find interval-specific overrepresented and
1020: underrepresented protein domains (Figure~\ref{fig15a} and
1021: \ref{fig15b}). Remarkably these domains are not only specific for a
1022: certain weight interval, but also different for proteins joining
1023: clusters outside the largest component and proteins joining clusters
1024: in the largest component (See Table~\ref{tab6}).
1025: 
1026: \subsection{Discussion}
1027: 
1028: All of the analyzed sequence features indicate that proteins that join
1029: clusters at a certain weight interval are not distributed equally over
1030: the complete protein space. For all of the features we could find
1031: specific under- and over-representation. Proteins joining clusters
1032: outside the largest component and proteins joining clusters in the
1033: largest component are different with respect to almost all considered
1034: features, which indicates that the largest component contains proteins
1035: that are different from those contained in other large clusters. These
1036: findings are complemented by the observation of specific over- and
1037: underrepresented functional domains in the proteins connecting
1038: clusters at certain weight intervals. Thus we conclude that for each
1039: weight interval a small number of protein families is responsible for
1040: cluster interconnections.
1041: 
1042: %%%%%%%%%%%%%%%%%%%%%%
1043: \section{Conclusions}
1044: 
1045: We investigated the local e global properties of the sequence
1046: similarity space formed by all proteins in the SIMAP database, which
1047: contains more than $5.5$ millon amino acid sequences. We represented
1048: this space as a graph whose vertices are proteins and the edges are
1049: weighted to reflect the similarity between the corresponding pairs of
1050: sequences (high weight, high similarity). The choice of this weight
1051: formula (\ref{eq:weight}) came from the necessity to compare the
1052: similarity score between pairs of sequences that could have different
1053: lengths. The SW score was therefore modified by means of the
1054: self-score geometric mean which contains the length information of the
1055: two aligned sequences.
1056: 
1057: Then, keeping only edges with $w \geq \bar w$, we built a collection
1058: of graphs by varing $\bar w$. From the analysis of the connected
1059: components we found that these graphs do not belong to the class of
1060: random graphs, whereas they are characterized by a power law behaviour
1061: both in the size cluster distribution and in the coordination degree
1062: distribution and for each fixed $\bar w$ these two distributions are
1063: strongly related to each other.
1064: 
1065: With the variation of $\bar w$, we found interesting changes in the
1066: global organization of the protein homology networks: we observed two
1067: different phases, one for large values of $\bar w $, characterized by
1068: the presence of clusters with similar dimensions, each composed
1069: essentially by proteins belonging to only one kingdom and with the
1070: largest one composed especially by viruses, and the second phase, for
1071: lower values of $\bar w$, characterized by the presence of a giant
1072: component composed by different species and other very little
1073: clusters.
1074: 
1075: In the end we investigated sequence features and functional
1076: informations of protein pairs that are responsible of the connection
1077: of clusters in the different intervals of $\bar w$, since they harbor
1078: the most conserved sequence regions that are shared by the
1079: interconnected clusters. We found that proteins joining clusters
1080: outside the largest component and proteins joining clusters in the
1081: largest component are different with respect to almost all considered
1082: features, which indicates that the largest component contains proteins
1083: that are different from those contained in other large
1084: clusters. Indeed we found an overrepresentation of a small set of
1085: domains which shows that a small number of protein families is
1086: responsible for cluster interconnections.
1087: 
1088: The analysis we performed gives a first view of the global
1089: organization of the greatest protein homology network ever been built
1090: before. It is the first step and the starting point to answer to other
1091: global or local interesting questions which could confirm that the
1092: protein homology network is structured with respect to functional and
1093: evolutionary properties.
1094:     
1095: 
1096: %%%%%%%%%%%%%%%%%%%%%%%%%%%
1097: \section{Acknowledgements}
1098: The authors thanks Claudio Destri, Roland Arnold and Mattia Pelizzola
1099: for useful discussions, Michele Caselle for encouraging our
1100: collaboration and Patrick Tischler, Jan Krumsiek and Benedikt
1101: Wachinger for providing the software for protein feature calculation.
1102: 
1103: \newpage 
1104: \begin{table*}[!htb] 
1105:   \begin{center}
1106:       \begin{tabular}{|c|cc|cc|}  \hline 
1107:     
1108:     $\bar{w_1} \to \bar{w_2}$ & $e$ & \hspace{-0pt} Proteins joining
1109:     generic clusters & $e$ & \hspace{-0pt} Proteins joining the largest cluster \\
1110:     \hline
1111:     
1112:     & $0.02$	& PF00598 Flu\_M1          & $0.93$	& PF00078 RVT\_1 \\
1113:     & $0.03$	& PF00522 VPR              & $1.08$	& PF00075 RnaseH \\
1114:     & $0.03$	& PF00540 Gag\_p17         & $1.44$	& PF06815 RVT\_connect \\
1115:     & $0.03$	& PF00951 Arteri\_Gl       & $1.46$	& PF07075 DUF1343 \\
1116:     & $0.03$	& PF00971 EIAV\_GP90       & $2.19$	& PF00665 rve \\
1117:     $0.750$ $\to$ $0.725$ & & & & \\
1118:     & $9.40$	& PF02916 DNA\_PPF         & $15.41$	& PF00607 Gag\_p24 \\
1119:     & $11.09$	& PF07095 IgaA             & $18.79$	   & PF00517 GP41 \\ 
1120:     & $11.25$	& PF08272 Topo\_Zn\_Ribbon & $18.91$	& PF02022 Integrase\_Zn \\
1121:     & $11.83$	& PF06899 WzyE             & $27.07$	& PF00540 Gag\_p17 \\
1122:     & $12.46$	& PF06788 UPF0257          & $137.49$	& PF00516 GP120 \\ \hline
1123:     
1124:     & &                                & $0.88$	& PF00078 RVT\_1 \\
1125:     & &                                & $1.16$	& PF00077 RVP \\
1126:     & &                                & $1.91$	& PF06817 RVT\_thumb \\
1127:     & &                                & $3.68$	& PF00075 RnaseH \\
1128:     & &                                & $3.77$	& PF00665 rve \\
1129:     $0.725$ $\to$ $0.700$ & & & & \\ 
1130:     & &                                & $37.19$	& PF00186 DHFR\_1 \\
1131:     & &                                & $80.26$	& PF00098 zf-CCHC \\
1132:     & &                                & $129.77$	& PF00516 GP120 \\
1133:     & &                                & $139.92$	& PF00607 Gag\_p24 \\
1134:     & &                                & $145.50$	& PF00540 Gag\_p17 \\ \hline
1135:     
1136:     
1137:     & $0.01$	& PF00516 GP120            & $0.12$	& PF00098 zf-CCHC \\
1138:     & $0.01$	& PF00522 VPR              & $0.15$	& PF00271 Helicase\_C \\
1139:     & $0.01$	& PF00602 Flu\_PB1         & $0.22$	& PF00078 RVT\_1 \\
1140:     & $0.01$	& PF00603 Flu\_PA          & $1.02$	& PF01560 HCV\_NS1 \\
1141:     & $0.01$	& PF01539 HCV\_env         & $1.16$	& PF06817 RVT\_thumb \\
1142:     $0.700$ $\to$ $0.675$ & & & & \\
1143:     & $10.14$	& PF08435 Calici\_coat\_C  & $15.62$	& PF02907 Peptidase\_S29 \\
1144:     & $10.22$	& PF03296 Pox\_polyA\_pol  & $19.47$	& PF00517 GP41 \\
1145:     & $12.94$	& PF05733 Tenui\_N         & $57.66$	& PF00516 GP120 \\
1146:     & $12.98$	& PF03805 CLAG             & $74.03$	& PF00077 RVP \\
1147:     & $13.68$	& PF00897 Orbi\_VP7        & $98.38$	& PF02348 CTP\_transf\_3 \\ \hline
1148:     
1149:     & $0.01$	& PF00064 Neur             & $0.10$	   & PF00078 RVT\_1 \\
1150:     & $0.01$	& PF00469 F-protein        & $0.13$	& PF00077 RVP \\
1151:     & $0.01$	& PF00506 Flu\_NP          & $0.18$	& PF00560 LRR\_1 \\
1152:     & $0.01$	& PF00516 GP120            & $0.18$	& PF00607 Gag\_p24 \\
1153:     & $0.01$	& PF00540 Gag\_p17         & $0.30$	& PF00665 rve \\
1154:     $0.675$ $\to$ $0.650$ & & & & \\
1155:     & $11.63$	& PF04310 MukB             & $151.92$	& PF02959 Tax \\
1156:     & $12.71$	& PF07108 PipA             & $168.64$	& PF00758 EPO\_TPO \\
1157:     & $13.48$	& PF07429 Fuc4NAc\_transf  & $431.37$	& PF08300 HCV\_NS5a\_1 \\
1158:     & $15.20$	& PF03506 Flu\_C\_NS1      & $441.03$	& PF08301 HCV\_NS5a\_1b \\
1159:     & $15.26$	& PF06593 RBDV\_coat       & $483.96$	& PF01506 HCV\_NS5a \\ \hline 
1160:     
1161:     & $0.01$	& PF00506 Flu\_NP    & $0.03$	& PF00096 zf-C2H2 \\
1162:     & $0.01$	& PF00516 GP120      & $0.04$	& PF00078 RVT\_1 \\
1163:     & $0.01$	& PF00540 Gag\_p17   & $0.17$	& PF00023 Ank \\
1164:     & $0.01$	& PF00603 Flu\_PA    & $0.17$	& PF00589 Phage\_integrase \\
1165:     & $0.01$	& PF00695 vMSA       & $0.19$	& PF00903 Glyoxalase \\
1166:     $0.650 $ $\to$ $ 0.625$ & & & & \\
1167:     & $12.57$	& PF06952 PsiA       & $202.08$	& PF01002 Flavi\_NS2B \\
1168:     & $13.73$	& PF06788 UPF0257    & $221.93$	& PF01349 Flavi\_NS4B \\
1169:     & $14.79$	& PF05788 Orbi\_VP1  & $222.59$	& PF01353 GFP \\
1170:     & $15.42$	& PF00901 Orbi\_VP5  & $229.23$	& PF01350 Flavi\_NS4A \\
1171:     & $16.02$	& PF03753 HHV6-IE    & $243.38$	& PF00948 Flavi\_NS1 \\ \hline
1172:     
1173:  \end{tabular}  
1174:   \end{center} 
1175: \end{table*}
1176: 
1177: \begin{table*}[!htb] 
1178:   \begin{center}
1179:     \begin{tabular}{|c|cc|cc|}  \hline 
1180:    
1181:     & $0.01$	& PF00124 Photo\_RC         & $0.09$  & PF00009 GTP\_EFTU \\
1182:     & $0.01$	& PF00603 Flu\_PA           & $0.13$  & PF07974 EGF\_2 \\
1183:     & $0.01$	& PF00695 vMSA              & $0.2$   & PF00096 zf-C2H2 \\
1184:     & $0.01$	& PF01560 HCV\_NS1          & $0.22$  & PF00560 LRR\_1 \\
1185:     & $0.02$	& PF00223 PsaA\_PsaB        & $0.23$  & PF01546 Peptidase\_M20 \\
1186:     $0.625$ $\to$ $0.600$ & & & & \\
1187:     & $11.95$	& PF06517 Orthopox\_A43R    & $376.41$ & PF01002 Flavi\_NS2B \\
1188:     & $12.09$	& PF00843 Arena\_nucleocap  & $403.70$ & PF00948 Flavi\_NS1 \\
1189:     & $13.08$	& PF06802 DUF1231           & $411.72$ & PF01349 Flavi\_NS4B \\
1190:     & $14.72$	& PF05273 Pox\_RNA\_Pol\_22 & $425.27$ & PF01350 Flavi\_NS4A \\
1191:     & $16.90$	& PF03021 CM2               & $538.21$ & PF05408 Peptidase\_C28 \\ \hline
1192:     
1193:     & $0.01$	& PF00517 GP41             & $0.06$	& PF00096 zf-C2H2 \\
1194:     & $0.01$	& PF00559 Vif              & $0.06$	& PF00097 zf-C3HC4 \\ 
1195:     & $0.01$	& PF00600 Flu\_NS1         & $0.09$	& PF00009 GTP\_EFTU \\
1196:     & $0.01$	& PF00969 MHC\_II\_beta    & $0.09$	& PF01266 DAO \\
1197:     & $0.01$	& PF06815 RVT\_connect     & $0.11$	& PF01926 MMR\_HSR1 \\
1198:     $0.600$ $\to$ $0.575$ & & & & \\
1199:     & $10.54$	& PF02477 Nairo\_nucleo    & $133.87$ & PF05790 C2-set \\
1200:     & $11.95$	& PF07982 Herpes\_UL74     & $139.12$ & PF01353 GFP \\
1201:     & $12.30$	& PF06871 TraH\_2          & $150.11$ & PF00518 E6 \\
1202:     & $14.14$	& PF02509 Rota\_NS35       & $195.29$ & PF02929 Bgal\_small\_N \\
1203:     & $16.04$	& PF06929 Rotavirus\_VP3   & $231.71$ & PF01382 Avidin \\ \hline
1204:     
1205:     & $0.01$	& PF00016 RuBisCO\_large   & $0.02$	& PF00115 COX1 \\
1206:     & $0.01$	& PF00113 Enolase\_C       & $0.07$	& PF07690 MFS\_1 \\
1207:     & $0.01$	& PF00123 Hormone\_2       & $0.08$	& PF07993 NAD\_binding\_4 \\
1208:     & $0.01$	& PF00506 Flu\_NP          & $0.09$	& PF00517 GP41 \\
1209:     & $0.01$	& PF01010 Oxidored\_q1\_C  & $0.10$	& PF00583 Acetyltransf\_1 \\
1210:     $0.575 $ $\to$ $ 0.550$ & & & & \\ 
1211:     & $10.60$	& PF06134 RhaA             & $161.43$ & PF01140 Gag\_MA \\
1212:     & $10.95$	& PF07095 IgaA             & $168.19$ & PF04528 Adeno\_E4\_34 \\
1213:     & $11.75$	& PF00897 Orbi\_VP7        & $173.44$ & PF08377 MAP2\_projctn \\
1214:     & $12.13$	& PF03294 Pox\_Rap94       & $184.23$ & PF02093 Gag\_p30 \\
1215:     & $13.75$	& PF01295 Adenylate\_cycl  & $311.32$ & PF01141 Gag\_p12 \\ \hline
1216:     
1217:     & $0.01$	& PF00016 RuBisCO\_large  & $0.06$	& PF00067 p450 \\
1218:     & $0.01$	& PF00516 GP120           & $0.07$	& PF00023 Ank \\
1219:     & $0.01$	& PF00522 VPR             & $0.08$	& PF00097 zf-C3HC4 \\
1220:     & $0.01$	& PF00540 Gag\_p17        & $0.11$	& PF01381 HTH\_3 \\
1221:     & $0.01$	& PF01539 HCV\_env        & $0.11$	& PF04851 ResIII \\
1222:     $0.550$ $\to$ $0.525$             &       &                         &             &  \\
1223:     & $11.29$	& PF05928 Zea\_mays\_MuDR & $101.41$	& PF01537 Herpes\_glycop\_D \\
1224:     & $11.62$	& PF06829 DUF1238         & $121.18$	& PF02929 Bgal\_small\_N \\
1225:     & $11.63$	& PF03277 Herpes\_UL4     & $123.25$	& PF01376 Enterotoxin\_b \\
1226:     & $11.64$	& PF03395 Pox\_P4A        & $128.24$	& PF06466 PCAF\_N \\
1227:     & $12.73$	& PF08405 Calici\_PP\_N   & $147.36$	& PF05806 Noggin \\ \hline
1228:     
1229:     
1230:     & $0.01$	& PF00600 Flu\_NS1           & $0.02$	& PF00106 adh\_short \\
1231:     & $0.01$	& PF00869 Flavi\_glycoprot   & $0.04$	& PF00270 DEAD \\
1232:     & $0.01$	& PF01539 HCV\_env           & $0.05$	& PF00037 Fer4 \\
1233:     & $0.01$	& PF02461 AMO                & $0.06$	& PF02518 HATPase\_c \\
1234:     & $0.01$	& PF02788 RuBisCO\_large\_N  & $0.08$	& PF00249 Myb\_DNA-binding \\
1235:     $0.525$ $\to$ $0.500$	&     &                           &        & \\
1236:     & $11.36$	& PF07434 CblD              & $68.92$	& PF03939 Ribosomal\_L23eN \\
1237:     & $11.80$	& PF04913 Baculo\_Y142      & $72.11$	& PF06267 DUF1028 \\
1238:     & $11.98$	& PF05880 Fiji\_64\_capsid  & $96.66$	& PF02022 Integrase\_Zn \\
1239:     & $13.48$	& PF06306 CgtA              & $120.34$	& PF00552 Integrase \\
1240:     & $13.98$	& PF03317 ELF               & $129.98$	& PF02929 Bgal\_small\_N \\ \hline
1241:   \end{tabular}   
1242: \caption{\label{tab6} \small For proteins joining clusters outside the
1243:   largest component or joining the giant component the five mostly
1244:   underrepresented and five mostly overrepresented PFAM domains are
1245:   giver per interval of weight w.}
1246:  \end{center}
1247: \end{table*}
1248: 
1249: 
1250: \begin{thebibliography}{99}
1251: 
1252: \bibitem{revEvol} E.V.~Koonin, {\it Orthologs, Paralogs, and
1253:     Evolutionary Genomics.}, {\tt Annu. Rev. Genet. 2005 39:309-38}
1254:   
1255: \bibitem{simap} R.~Arnold, T.~Rattei, P.~Tischler, M.~Truong,
1256: V. St\"{u}mpflen, W.~Mewes, {\it SIMAP - The similarity matrix of
1257: proteins}, {\tt Bioinformatics {\bf 21}, ii42-ii46 (2005)}
1258: 
1259: \bibitem{phn} D.~Medini, A.~Covacci, C.~Donati, {\it Protein homoloy
1260:     network families reveal step-wise diversification of type III and
1261:     type IV secretion systems.}, {\tt PLOS Computational Biology 2
1262:     1543-1551 (2006)}
1263: 
1264: \bibitem{erdos_renyi} P.~Erd\"os, A.~R\'enyi, {\it On random graphs},
1265: {\tt I, Publ. Math. Debrecen {\bf 6}, 290-291 (1959)}
1266: 
1267: \bibitem{burda} L.~Bogacz, Z.~Burda, W.~Janke, B.~Waclaw, {\it A
1268: program generanting homogeneous random graph with given weights},
1269: [{\tt cond-mat/0506330}].
1270: 
1271: \bibitem{emboss} P.~Rice, I.~Longden, et al., {\it EMBOSS: the
1272:     European Molecular Biology Open Software Suite}, {\tt Trends Genet
1273:     16(6): 276-7 (2000)}
1274: 
1275: \bibitem{segprog} J.C.~Wootton, {\it Sequences with `unusual' amino
1276:     acid compositions.}, {\tt Curr. Opin. Struct. Biol 4: 413-421
1277:     (1994)}
1278: 
1279: \bibitem{tmhmmprog} A.~Krogh, B.~Larsson, et al., {\it Predicting
1280:     transmembrane protein topology with a hidden Markov model:
1281:     application to complete genomes.} , {\tt J. Mol. Biol 305(3):
1282:     567-580 (2001)}
1283: 
1284: \bibitem{signalPprog} J.D.~Bendtsen, H.~Nielsen, et al., {\it Improved
1285:     prediction of signal peptides: SignalP 3.0.} , {\tt Journal of
1286:     Molecular Biology 340(4): 783-795 (2004)}
1287: 
1288: \bibitem{targetPprog} O.~Emanuelsson, H.~Nielsen, et al., {\it
1289:     Predicting subcellular localization of proteins based on their
1290:     N-terminal amino acid sequence.}, {\tt Journal of Molecular
1291:     Biology 300(4): 1005-1016 (2000)}
1292: 
1293: \bibitem{pddb} N.J.~Mulder, R.~Apweiler, et al., {\it InterPro,
1294:     progress and status in 2005.}, {\tt Nucleic Acids Research 33
1295:     (Database issue): D201-5 (2005)}
1296: 
1297: \end{thebibliography}
1298: 
1299: \end{document}
1300: 
1301: 
1302: 
1303: 
1304: 
1305: 
1306: 
1307: 
1308: